* TWO mm'szoRs 0F LEVE’RE‘S TE‘sFrf OF HGMOGENEITY 0F VARLARces A MONTE CARLO mvesneAnon 0F , - ACTUAL NOMINAL ERROR RATE AAA 7 :53 RELATIVE POWER Disserthioni‘for the Degree MFR. a... i * ’ 7 MICHIGAN STATE UNIVERSITY JAMES DOUGLAS MARS 19.75 AAAAAAAAAAA JI/ 3 1293 01066 0854 This is to certify that the I thesis entitled TWO ESTENSIONS OF LEVENE'S TEST OF HOMOGENEITY OF VARIANCES: A MONTE CARLO INVESTIGATION OF ACTUAL-NOMINAL ERROR RATE AND RELATIVE POWER presented by James Douglas Maas has been accepted towards fulfillment of the requirements for PH.D. degreein SECONDARY ED. 5 CURRICULUM . a CR , \Z‘Qcowgi‘ LA.‘ “um 'NLWMMS ‘ ajor professor Date 7 February 1975 0-7639 ‘ mom ‘ HBAG & SUNS' . ‘ 800K BINDERY INC. " LIBRARY BINDERS ‘ m: 2.113;: ABSTRACT TWO EXTENSIONS OF LEVENE'S TEST OF HOMOGENEITY OF VARIANCES: A MONTE CARLO INVESTIGATION OF ACTUAL-NOMINAL ERROR RATE AND RELATIVE POWER BY James Douglas Maas This study was motivated by the need for a good K-sample test for equal variability. Well over a dozen tests have been suggested in the literature but all suffer from one or more of the following ailments: (1) Poor correspondence between nominal type I error rate and empirical type I error rate if the nor- mality assumption is violated; (2) Low power when this correspondence is good; (3) Lack of post hoc procedures associated with the method. Of all the tests which have been considered in the recent literature, only two have shown an acceptable correspondence between the nominal and empirical type I error rates, no matter what distribution is sampled from. These are Box's test and Moses' test. Box's test was James Douglas Maas found to be more powerful than Moses' test for all cases considered. Unfortunately, neither of these procedures is desirable for sample sizes much less than fifteen. Of the remaining tests, one suggested by Levcne showed the most promise. Levene's test consists of an analysis of variance on the absolute deviations of the observations from their sample means. The average amount by which the empirical and nominal type I error rates differed was relatively low for this test; however, Levene's test consistently gave somewhat liberal estimates of the nominal alpha. Both the Kruskal-Wallis test and the Normal Scores test of location are known to be some- what more conservative than is ANOVA for most distri- butions sampled. Usually little power is sacrificed when performing the Kruskal-Wallis test or the Normal Scores test. It was thought that the conservative nature of these tests might counter the liberal nature of Levene's test to produce a test statistic which was acceptable in its empirical type I error rate. Both the Kruskal-Wallis extension and the Normal Scores extension of Levene's test would hopefully have power comparable to that of Levene's test. Thus, the thrust of this study was a Monte Carlo investigation of the properties of both the Kruskal-Wallis extension and the Normal Scores extension of Levene's test for equal variability. James Douglas Maas Two, three and four levels of the independent variable were considered for various small sample sizes. The properties of the test statistics were observed when sampling from normal, uniform and exponential distri- butions. For each of the cases a simulated analysis was repeated one thousand times and the number of rejections of the null hypothesis of equal variability was counted using a series of commonly employed nominal alpha levels. Both nominal-empirical alpha level fit and power were of concern in this study. When samples were taken from either uniform or normal distributions, the results were in the direction predicted. Both extensions of Levene's test produced better estimates of type I error rate than did Levene's test. The Normal Scores extension proved to give gen- erally better estimates of type I error rate and power than did the Kruskal-Wallis extension. For a = .10 and a = .05, both extensions were still somewhat liberal and for a = .01 they were slightly conservative. In these same situations, Levene's test had slightly more power; however, some of the advantage with respect to power may be attributable to the slightly greater liberality of Levene's test. When the exponential distribution was sampled from, the empirical estimates of a were all very liberal. With this poor quality fit, the power comparisons were James Douglas Maas meaningless. With the failure of these techniques for the exponential distribution, the good K-sample test for equal variability which is somewhat distribution free had not been found. This failure indirectly suggested the use of a similar but slightly different technique. The properties of a modified form of Levene's test and of the two extensions were observed in a mini- study. The test statistics were computed as before, with the exception of deviating observations from sample medians rather than sample means. The median deviation technique was employed for the three sample design with six observations per sample when sampling from normal distributions and when sampling from exponential distri- butions. The results of this mini-study were quite encouraging. For a = .10 and a = .05, the empirical type I error rates of the Levene type test were quite close to the nominal level when sampling from either distribution. However, for a = .01 the empirical type I error rates were too liberal. With the exception of Box's test and Moses' test, this is the only time that the type I error rate was well behaved when sampling from either a distri- bution with high kurtosis or with heavy skew. Although the Levene type test using the median was not without power, it appears to be considerably less powerful than the appropriate conventional tests when sampling from James Douglas Maas normal distributions. The Kruskal—Wallis median extension and the Normal Scores median extension of the Levene type test did not fare nearly as well in either their empiri- cal estimates of type I error rate or power. TWO EXTENSIONS OF LEVENE'S TEST OF HOMOGENEITY OF VARIANCES: A MONTE CARLO INVESTIGATION OF ACTUAL-NOMINAL ERROR RATE AND RELATIVE POWER BY James Douglas Maas A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Secondary Education and Curriculum 1975 ACKNOWLEDGMENTS One seldom gets an opportunity to publicly say thanks to the people who have profoundly and positively affected his life. I would like to take this space to do just that. To Dr. Maryellen McSweeney, who directed this dissertation effort, thank you for the faith that you showed in me. It has been a pleasure to work closely with a tireless, brilliant and dedicated teacher. To Dr. Ted Ward, who chaired my doctoral com- mittee, my thanks for being so understanding. My co- teaching experience with you will remain among my most pleasant memories. To Dr. Howard Teitelbaum, Dr. Marvin Grandstaff and Dr. Andrew Porter, my thanks for your constant guidance and encouragement as committee members. A doctoral program would be very difficult with- out good teachers and good friends. All of these people I hold in the highest regard not only as intellectual leaders but also as friends. ii Finally, to my wife Jean, a super thanks to a girl who truly earned her Ph.T. Degree. Thanks for keeping me on the right track to the very end. Hope- fully, the future will be bright for you because of all the hardships you endured the past four years. iii Chapter I. II. III. IV. V. TABLE OF CONTENTS RATIONALE . . . . . . . . . . . An Observation. . . . . . . . . An Example . . . . . . . . . . Analysis Techniques Available. . . . Two Proposed Tests . . . . . . . The Research Questions . . . . . . An Overview. . . . . . . . . . REVIEW OF THE LITERATURE . . . . . . K Sample Tests of Scale (K>2). . . . Levene's Test of Scale . . . K Sample Tests of Location (K>2). . . Summary . . . . . . . . . . . THE DESIGN. 0 O C O O O C O O C THE GENERATORS USED. . . . . . . . Generation of Pseudorandom Numbers . . Generation of Uniform Random Variates . Generation of Exponential Random Variates . . . . . . . . . . Generation of Normal Random Variates . Generation of Inverse Normal Scores. . Generation of F Values . . . . . . THE RESULTS 0 C O C C O O O O 0 Sampling from Normal Distributions . . Sampling from Uniform Distributions. . Sampling from Exponential Distributions Agreement Rates of Kruskal-Wallis and Normal Scores Extensions. . . . . A Modification. . . . . . . . . Summary of Results . . . . . . . iv Page WNCDOBNH F‘H 14 24 3O 36 38 45 45 48 49 49 51 53 55 56 63 71 71 74 81 Chapter VI. SUMMARY, CONCLUSIONS Summary . Findings . Discussion. SELECTED BIBLIOGRAPHY Page AND DISCUSSION . . . . 83 O O O O O O O O O 83 O O O O O O 0 O O 87 0 O O O O O O O O 89 O O 0 O O O O O O 95 LIST OF TABLES Monte Carlo studies of tests of scale in recent literature . . . . . . . . Average empirical estimates of type I error rates when a = .05 for twelve K-sample tests of scale based on Monte Carlo studies found in the literature . . . Observed number of type I errors in ten thousand tests, given a nominal level Of .05 O O C O O O O O O O 0 Empirical estimates of type I error rates using the F test, Levene's test, the Kruskal-Wallis extension and the Normal Scores extension when sampling from normal distributions with o = o = _ l 2 o o .OK—l. o o o o o o o o 0 Empirical estimates of power using the F test, Levene's test, the Kruskal-Wallis extension and the Normal Scores extension when sampling from normal distributions With 01 = 02 = . . . OK_1 = 1, OK = 2 . Average empirical estimates of type I error rates for the four tests considered when sampling from normal distributions . . Average empirical estimates of power for the four tests considered when sampling from normal distributions . . . . . . . Average distance of the empirical estimate of a from the expected value of a for all designs considered when sampling from normal distributions . . . . . . . vi Page 16 25 28 57 58 60 61 62 Ta 5-1 5-1 5-1: 5-9 0 Empirical estimates of type I error rates using the F test, Levene's test, the Kruskal-Wallis extension and the Normal Scores extension when sampling from uni- form distributions with 01 = 02 = . . . OK = 1 I I O O O O O O O O O 0 Empirical estimates of power using the F test, Levene's test, the Kruskal-Wallis extension and the Normal Scores extension when sampling from uniform distributions Wlth 01 = 02 = . . . OK-l = 1, OK = 2. . Average empirical estimates of type I error rates for the four tests considered when sampling from uniform distributions . . Average empirical estimates of power for the four tests considered when sampling from uniform distributions . . . . . . . Average distance of the empirical estimates of a from the expected value of a for all designs considered when sampling from uniform distributions . . . . . . . Empirical estimates of type I error rates obtained when sampling from uniform distri- butions for the Kruskal-Wallis extension and the Normal Scores extension. . . . Empirical estimates of type I error rates using the F test, Levene's test, the Kruskal-Wallis extension and the Normal Scores extension when sampling from exponential distributions with o = o = _ 1 2 oooOK—l o o o o o o o 0 Empirical estimates of power using the F test, Levene's test, the Kruskal-Wallis extension and the Normal Scores extension when sampling from exponential distri- butions Wlth 01 - 02 - . . . OK_1 — l, o — 2 . . . . . . . . . . . . K Average distance of the empirical estimates of a from the expected value of a for all designs considered when sampling from exponential distributions. . . . . . vii Page 64 65 66 68 69 7O 72 73 74 Page Kruskal—Wallis extension and Normal Scores extension agreement rates for one thousand trials . . . . . . . . . . 75 Kruskal-Wallis extension and Normal Scores extension agreement rates for one thousand trials 0 I O O O O O O O O O O O 76 Empirical type I error rate and power obtained by three variations of Levene's test and the Kruskal-Wallis and Normal Scores extensions where a three-cell design (6, 6, 6) was used when sampling from exponential distributions with 01' 02 and 03 as tabled . . . . . . . . . . . 78 Empirical type I error rate and power obtained by two variations of Levene's test and the Kruskal-Wallis and Normal Scores extensions where a three-cell design (6, 6, 6) was used when sampling from normal distributions with 01, o and o as tabled . . . . . . g . .3 . . 80 viii me CO er CHAPTER I RATIONALE An Observation In educational research, there are often analysis problems which are a function of the fact that the real world is not neatly structured to meet the assumptions required in the models that researchers have available to them. When faced with a situation which knowingly does not conform to the models available, one can react in different ways. The most common approach seems to be to use the model knowing the assumptions are probably being violated. If the test is robust with respect to violation of these assumptions, then this approach is quite sound. In many of the common situations robustness has been investigated and this information is available to the researcher. Should the test be nonrobust there are two alter- natives one has if he wishes to make inferential state- ments. The model and its associated test of significance could be used without knowledge of how close the actual error rates are to the nominal error rates when the 0f SiI rec particular violation is made. For many researchers this may be the only route available. If the violation of assumptions seems critical, another option consists of the development of a new model which does not demand these assumptions. This may be as "simple“ as a data transformation or as complex as the development of a new test statistic and its distri- butional properties. Often this is a difficult task with many dead ends. Occasionally the model is effective and all ends on a happy note. An Example Recently, an analysis situation was encountered which did not conform to a known model (26). A researcher, teaching prospective counselors to use behavioral objec- tives effectively in the counseling situation, wished to compare the effectiveness of three techniques. Thirty— six subjects were randomly assigned to one of three con- ditions. Group A received (1) an article in which the importance of behavioral objectives was emphasized, (2) a manual which explained conditions, criterion and terminal behavior statements and which gave illustrations of each and (3) a rating form to be used in comparing situations as to their use of these objectives. Group B received the article and manual but not the rating form. Group C received only the article. After getting acquainted with the above materials, each subject was placed in a situation in which he was asked to compare each of three pairs of audiotapes on their use of behavioral objectives. For each pair he was to pick the better of the two. Group A was able to use the rating form to assist them whereas Groups B and C were asked to make their choice without the use of this rating form. In fact, at no point in the experiment did Groups B and C have access to the rating form. In each group half of the subjects received feedback on their pair selections and half did not. Again, those to get feedback were selected at random. After this experience each subject was placed in a simulated counseling situation. The dependent variable of interest was the number of uses of behavioral objec- tive statements per unit time that was observed during the simulated situation. The design was: Group A Group B Group C T1 T2 T3 Feedback XXX XXX XXX F1 XXX XXX XXX No Feedback XXX XXX XXX F2 XXX XXX XXX Thirty-six subjects were randomly assigned to the six cells so that there were six observations per cell. Two of the research questions of interest were: 1. Does Treatment A produce a more homogeneous group than Treatments B and/or C? 2. Are those groups receiving feedback more homo- geneous than those groups not receiving it? Analysis Techniques Available A search was made for a good test which was created specifically to answer questions about homogeneity of variances for designs with two or more design variables. The following three criteria were used to judge the quality of prospective multi—factor tests of equal variability. First, there must be good agreement between the nominal a and actual a. This means that the proba- bility of rejecting the null hypothesis is actually a when the test is performed at the nominal a level. Second, the test must possess good power. This means that there is a high probability of rejecting the null hypothesis when it is not true. Third, the test should have post h0c procedures associated with it. This means that if the null hypothesis is rejected, techniques are available to locate the sources of rejection and to estimate their magnitude. The search for a multifactor test of variance homogeneity meeting these criteria was fruitless. Those multifactor tests which were identified suffered from serious limitations. The tests developed by Russell and Bradley (53), Han (23) and Shukla (53) are restricted to multifactor designs with no interaction effects and with a single source of variance heterogeneity. The test proposed by Overall and Woodward (49) is less restrictive in its design specifications, but is believed to be sensitive to population nonnormality. Simultaneous post hoc confidence interval procedures are not available for any of these tests except that of Overall and Wood- ward, and the small sample properties of the tests have not been investigated for nonnormal data. Although the literature on multifactor tests for variance homogeneity was limited and the tests were disappointing, the literature on methods of comparing variances for single factor designs was somewhat more extensive. The above design could be considered as a single factor design if the six cells were treated as six levels of one design variable as pictured below: T1 F1 T1 F2 T2 1 2 2 3 1 3 2 XXX XXX XXX XXX XXX XXX XXX XXX XXX XXX XXX XXX In m bL hi Bartlett, Cochran, Hartley, Cadwell, Kendall, Scheffé, Bargmann, Box, Levene and others have addressed themselves to the develOpment of analysis techniques for testing several populations for equality of variance. However, all of these methods suffer from one or more of the following ailments: (1) Poor correspondence between nominal type I error rate and actual type I error rate if the nor- mality assumption is violated; (2) Low power when this correspondence is good; (3) Lack of post hoc procedures associated with the method. In a 1972 article, Gartside (15) compared six different methods of testing for homogeneity of variance. These six methods are (l) Bartlett's statistic, (2) a modification of Bartlett's statistic, (3) Cochran's statistic, (4) Cadwell's statistic, (5) Box's statistic and (6) Bargmann's modification of Box's statistic. Under the normality and equal size sampling cases he found that all six methods very favorably matched nominal type I error rate and actual type I error rate. However, under the condition of nonnormality but equal sample size all but Box's test were markedly high in their actual type I error rates. Unfortunately Box's test proved to have less power under these same . . . 2 2 conditions. For one alternative (01 = C° = l, 02 = C1 = C, . . . a: = CK-l) Bartlett's statistic was more power- 2 _ 2 _ 2 _ 2 = ful and for another (01 — o2 - . . . OK-l - 1, OK C) Cochran's was more powerful. As a result of this article one could conclude that the traditional tests left much to be desired. In 1960 Levene devised a test based on ANOVA of absolute deviations from the mean (36). Using Monte Carlo techniques, Levene showed, for equal sized samples of ten or twenty per cell with two and three cell designs, that the empirical type I error rate and nominal type I error rate practically coincided. This was true for normal, uniform and double exponential distributions. At the nominal .05 level, the largest empirical proba- bility obtained was .063. Levene also investigated the power of his test for rejecting Ho and found it to be quite satisfactory. Assuming normal distributions, the approximate efficiency of Levene's test with respect to Bartlett's test ranged from .83 to .90 depending on the value of alpha and the alternative hypothesis considered. In addition to these properties, Levene's test, employing a fixed effects ANOVA model, has techniques available to locate specific differences in variability should Ho be rejected. This test went relatively unnoticed by applied statisticians until a 1966 American Educational Research Journal (17) article by Glass brought it to the fore- front. iFrom 1972 to 1974, three articles were written indicating that Levene's test has inflated type I empirical probabilities for cases in which the cell sizes are small or the number of cells is large. This indicates a need for more exploration into possible K sample tests which have (a) correspondence of nominal and empirical type I error rate better than that of Levene's test, (b) available post hoc procedures, (c) more power than Levene's test and (d) efficiency close to Bartlett's test in the situation for which Bartlett's test was designed. Two Proposed Tests In multicell tests of location, ANOVA is tra- ditionally the test used. It has been found to be robust with respect to violation of the normality assumption and, if cell sizes are equal, the homoscedasticity assumption (18). However, ANOVA has some genuine com- petitors. The Kruskal-Wallis (33) test is one which is rather insensitive to differences in the skewness, kur- tosis or scale of the K population distributions. This test is insensitive to nonlocation alternatives and has a high asymptotic efficiency relative to the ANOVA F test as_a test for location alternatives (.955 when the assumptions of normality and common variance are satis- fied). If the underlying distributions are nonnormal the efficiency relative to the ANOVA F test often equals or exceeds one. Two such nonnormal distributions for which the Kruskal-Wallis test is at least as efficient as the ANOVA F test are the uniform and the exponential distribution. The normal scores test is yet another good com- petitor with ANOVA (39). When the assumptions of nor- mality and common variance are met, the asymptotic efficiency of the normal scores test relative to the ANOVA F test is one. It is more efficient than the Kruskal-Wallis test when the samples have been drawn from the uniform or exponential distributions (27). Both the Kruskal-Wallis test and the normal scores test have been shown to be somewhat conservative in the empirical estimates of type I error rate. It has been stated that (l) Levene's test appears to be an efficient test relative to Bartlett's test, (2) Levene's test is somewhat liberal in its empirical estimates of actual type I error rate, but (3) the Kruskal-Wallis test and normal scores test are very good competitors to ANOVA in tests of location and are even more efficient than ANOVA if the normality and homoscedasticity assumptions are not met, and (4) the KruSkal-Wallis test and the normal scores test are some- what conservative in their empirical estimates of actual type I error rate. This would suggest two other possible 10 methods of testing K populations for differences in variance. The first is a Kruskal-Wallis extension of Levene's test and the second is a normal scores extension of Levene's test. First, consider Levene's test. What Levene sug- gests is to calculate zij' the absolute value of the difference of each observation in a given cell from the cell mean. For observation i in cell j, Zij = Xij - X j . Then perform a one-way ANOVA on the zij scores. The test statistic is the ratio of Msbetween to MSwithin of the transformed Zij scores. In a J cell design with a grand total of N observations, if this F-ratio exceeds the (l-a) percentile point of the central F distribution which has J-l and N-J degrees of freedom, one concludes, at the a significance level, that the populations are not identical in variability. The first extension suggested is based on the Kruskal-Wallis statistic. This extension involves the following steps: . C 1c 1 t z.. = X.. - X'.. for each individual. 1 a u a e 1] I 1] 3I 2. Rank order these Zij scores from 1 to N and with its corresponding rank r... replace each Zi' 13 J 3. Obtain the mean rank Fj for each cell. ll Compute the Kruskal-Wallis statistic. H= 12 I? nE2-3(N+1) N (N+1) j=1 j j If the H statistic exceeds the (l-a) percentile point of the central x2 distribution with K—l degrees of freedom, conclude, at the a signifi- cance level, that the populations are not identical in variability. The second extension that is suggested is a K-sample normal scores extension of Levene's test. The following steps are used in this extension: 1. 2. Calculate zij scores for each individual. Rank order the Zij scores from 1 to N and replace each Zij with its corresponding rank rij' Replace each rij with its inverse normal score ¢Ij = ¢-l(rij/(N+l) ) where ¢-1(p) denotes the point on the standard normal distribution which cuts off the cumulative probability p. Calculate the test statistic W where 2: ¢ 2 (N-l) Y: ((i ij) / n.) W = J J 22 -1 2 ij ij 12 5. If W exceeds the (l-a) percentile point of the central x2 distribution with K-l degrees of freedom, conclude, at the a significance level, that the populations are not identical in variability. The Research Questions The exact questions to be answered by this study are: For the Kruskal-Wallis extension of Levene's test and the Normal Scores extension of Levene's test, (1) How do the nominal type I error rate and the actual type I error rate correspond? (2) How powerful are these tests with respect to Levene's test? and (3) For the two- sample case how powerful are these tests with respect to the traditional parametric test F = si/sg? Two methods, the analytic and the empirical, can be used to obtain the sampling distribution of test sta- tistics such as F used in Levene's case or H or W in the two extensions suggested. When possible the analytic method is preferred because the distributions involved are exact and free from sampling error. (Levene was forced to use empirical Monte Carlo techniques "because of the well known difficulty of obtaining the distribution of F from a non-normal population analytically." The difficulty involved absolute value integration and further complications which he mentions (36, p. 280). 13 For the same reasons, the questions of this study must be answered by an empirical sampling (Monte Carlo) method rather than analytically. An Overview The current tests of scale, including Levene's test are reviewed in Chapter II. In addition, the properties of ANOVA, the Kruskal-Wallis test and the Normal Scores test are compared. The primary questions of the study and the techniques used to answer them are found in Chapter III. These techniques involve the generation of uniform, normal and exponential random variates, inverse normal scores and some F values. The generators used are discussed in Chapter IV. The results of.the study are found in Chapter V. These results coupled with a finding of Miller (42) suggested that a modification of Levene's technique might prove pro- mising. A mini-study exploring this modification and the results of the mini-study are also found in Chapter V. The study is summarized and the primary conclusions are listed in Chapter VI. A discussion follows in which other possible techniques for testing equal variability are explored. CHAPTER II REVIEW OF THE LITERATURE K Sample Tests of Scale (K>2) Since 1931, statisticians have been searching for a good test for homogeneity of variability for K populations. One important criterion for a good test is that it be relatively distribution free. This means that no matter what distributions are sampled, the pro- portion of rejections, when the null hypothesis of equal variance is true, is close to the nominal level of sig- nificance of the test. If more than one test should have this property, then the one with the greatest power is typically preferred. In this section, tests proposed by Bartlett (1), Hartley (24), Cochran (31), Cadwell (7), Box (3), Box and Andersen (4), Moses (44), Burr (12), Miller (42) and Levene (36) are reviewed. The small sample properties of these tests have been examined by means of Monte Carlo techniques by Games et a1. (14), Gartside (15), Hall (21), Layard (34), Levene (36), Miller (42) and Winslow and Arnold (63). A summary of the tests each 14 15 studied the distributions sampled, the designs used, and the power alternatives used is displayed in Table 2-1. Gartside (15) claimed that the first approach to the K Sample test of equal variability was made in 1931 by J. Neyman and E. Pearson using the likelihood ratio statistic approximately distributed as x§_1. Bartlett (1) modified this statistic to improve the approximation to the chi-square distribution. His test statistic is: (N-K) ln (Zvisi/ (N-K)) - Xvi-1n Si B=1+§TTY1 (El-2L) where: N = total sample size K = number of samples 8: = variance of sample 1 and vi = degrees of freedom upon which 5: is based. This statistic is referred to the chi-square distribution with K-l degrees of freedom. Needless to say, the com- putations involved with this test are quite tedious. Box (3) first noticed that if nonnormal populations are sampled when using Bartlett's test, the probability of a type I error may be much larger or smaller than the nominal a. The direction of the departure from the nomi- nal significance level depends on the kurtosis of the pepulations sampled. zscAaLcImm bcwocu CA mason «0 women mo mmflpzum OHHUO moCOEII.~IN MQQQR l6 meson suxflm Nu wcm>mq mm m comxm and gm means emacxxomn elfluxc+o u m coaxm and ammo: mead oa m enemas: ccmumocclxom damn Hmeuoz xom upmauumm Hmsom nuxwm max .Hux mm N :ooxm and xm Omacxxomn mama o + H6 U m comxm Hno memo: HUHHHS oa m Euowwco comumocmq coma m anvo . . . Ho w "No on v EHOMHGD wcm>mq an o . . . o m "Ho oa m HmEuoz N ccm>mq E u o 3 ON N swam mwamfimm * com: 0H 8mm Eoum omameom omucoamcoo muonpzd mc>flumcumuad umzom mcowucowuumao mumme . ocmo mcmwmmo mucumumuwa ucmocu ca mason mo mummy mo mmwosum oaumu.mucozll.alm mqmda 17 eN N H N ma N mea U + c U 6 NH N HmEuoz memos paocud N + soncws oauusxoummq Emm Nu mcm>mq 3mxm mamuuxm N mcm>mq M ma m zmxw cuoucooz mmaunmm clause + o u o smxm usmflam cmusooo mead m Euowwca unsm upcnonm HmEuoz cmmumocalxom hoaxcwz xom mmsmw upmauumm a ma ca ouNo ma m aux N H ma e mmauumm an N6 . . . U N6 U N6 ANV ma m Hacoflmz cmunooo tha M ed HmEuoz aam3omo moamuumw Hume R No lav m xom . v upmauumm m UMMWMm mmamamm * coma Eoum odeEmm omumoflmcoo mc>fiuccumuad HUSOm .1 mcowuonfiuumflo .mumma mucousc new: mcmwmco omscfluc00I1.HlN Manse 18 This was further confirmed by Layard (34), Hall (22), Gartside (15) and Games et a1. (14). Layard looked at four sample designs with equal cell sizes of ten and twenty-five. He sampled from normal, uniform and double exponential distributions. Hall sampled from these same three distributions and looked at five sample designs with equal cell sizes of ten and twenty-five. In addition he sampled from a skew double expo- nential distribution and a sixth power distribution. Gartside sampled from the normal and the Weibull distri- butions for a variety of designs with equal cell sizes of four and sixteen. Games at al. looked at three sample designs with equal cell sizes of six and eighteen and samples from normal, uniform, slightly skewed, moderately skewed, extremely skewed, and symmetric leptokurtic distributions. Hartley (24) developed the statistic Fmax = (largest of K variances) / (smallest of K variances). The Fmax distribution is tabled in Kirk's text (31). Box (3) showed that as with Bartlett's test, this test will be liberal when sampling from leptokurtic distri- butions and conservative when sampling from platykurtic distributions. The Monte Carlo studies of Games et a1. (14) and Gartside (15) support this contention. Cochran (31) devised a relatively simple test statistic, the ratio of the largest of the K cell 19 variances to the sum of all the cell variances. Critical values of Cochran's statistic are tabled in Kirk (31). Unfortunately, Box (3) showed once again that with non- normal distributions, this test has the same problems as Bartlett's test and Hartley's test. The Monte Carlo studies of Gartside (15) and Games et a1. (14) support the claim of Box. Cadwell (7) devised a statistic which is the ratio of the largest of the sample ranges to the smallest of the sample ranges. It is referred to the x2 distri- bution. Gartside (15) found this test to be too liberal when sampling from the Weibull distribution. This was the only distribution he considered other than the normal distribution. Box and Andersen (4) used permutation theory to modify Bartlett's test. They define the statistic: M1 = B (1 + G/2) where: B = Bartlett's statistic and G = a function of the sample estimate of the kurtosis. They investigated the behavior of their test and found that for normal, uniform and double exponential distri- butions it gave empirical probabilities closer to the 20 nominal a than Bartlett's test when the equal variance hypothesis was true. This study was based upon only two hundred samples. Using a three-cell design with six observations per cell, Games et a1. (14) found the Box- Andersen test was too liberal for four out of the six distributions considered in their study. Surprisingly, they found the test to be very conservative when sampling from the uniform distribution whereas Box and Andersen had found their test to be somewhat liberal for this case. Hall (22) also found the Box-Andersen statistic very liberal for all five distributions he used when sampling ten observations per sample from a five-sample design. The test was found to behave better with twenty- five observations per sample; however, the empirical estimates of type I error rates ranged from .025 to .087 depending upon the distribution sampled. Miller's (42) results were similar to those of Hall. The only dif- ference between the studies of Miller and Hall was the number of samples in the design; Miller chose two samples, and Hall chose five samples. Bartlett and Kendall (2) suggested a test using ln $2, and Scheffé (51) described this test in his classic text. Box (3) further explored the nature of this test, which has been cited in the recent literature as the, Box test. The test consists of breaking each of the K samples into subsamples, computing 1n 32 for each 21 subsample, and doing an ANOVA on the variable ln 32 to test equality of the pOpulation variances. Bartlett and Kendall warned that the quality of the test may be poor for small cell sizes. Miller (42) looked at the quality of this test for two sample designs with twenty-five observations per sample. Each sample was subdivided into five subsamples each of size five. For the dis- tributions Miller sampled, the test appeared to be well behaved at a = .05 and d = .01. The largest empirical discrepancy from the nominal a was .009 for the skew double exponential distribution at nominal a = .05. . Games et a1. (14) looked at the quality of this test for three sample designs with eighteen observations per sample and two different subsampling patterns. The first sampling pattern consisted of subdividing the eighteen observations within a sample into nine sub- samples of two each; the second used six subsamples of three each. They considered six different distributions with varying degrees of skew and kurtosis. Both sub- sampling patterns yielded consistently conservative estimated probabilities of type I errors. The power estimates were considerably higher for the second vari- ation than for the first. Games et al. concluded that the second variation had reasonable power for all pOpu- lations sampled from at a = .05. 22 Moses (44) proposed a nonparametric modification of the Box test. It is simply the Kruskal-Wallis test applied to the value of 1n 52 obtained from subsamples as in the Box test. For the two-sample situation with twenty-five observations per cell, Miller (42) found that with sub- samples of size five in each sample, the Moses test was somewhat conservative in its estimates of type I error rates and was consistently less powerful than Box's test. For the five—sample design with twenty-five observations per sample, Hall (21) obtained results similar to those of Miller. Winslow and Arnold (63) used two-sample designs with equal sample sizes of 6, 12, 18 and 24 with different subset sizes. They sampled from the normal distribution and obtained somewhat less conservative results. Miller (42) proposed a jackknife technique to test variance equality in the two-sample design, and Layard (34) extended this test to the K sample design. This test involves computing a one-way ANOVA on the Ui' 3 where: — - - . _ n. ln s. (n. 1) 1n Si (3) where: ni = the number of observations in cell i and 23 SE = the sample variance in cell i and z — s . = K#j (xiK - xiii))2 i(3) n.-2 1 where: 2 SE. . = m xiK 1(3) ni-I Miller found that for two sample designs with twenty-five observations per sample, the empirical estimates of a = .05 ranged from .029 to .082 depending on the distribution sampled; with ten observations per cell the estimates ranged from .020 to .126. The power was consistently higher than that obtained using Box's test. In a nearly identical study using five samples, Hall obtained almost identical results. Layard's (34) results were similar when four samples were observed. Burr and Foster (12) introduced the statistic: _ 4 2 2 Q — ZsK/(XSK) where: SK = the standard deviation in sample K. Games et a1. (14) found it to be a very unstable esti- mator of type I error rates. For a = .05, the empirical estimates ranged from .001 to .371, depending on the distribution sampled. 24 The average empirical estimates of type I error rates for twelve different K sample tests based on the Monte Carlo studies cited in Table 2-1 are displayed in Table 2-2 for a series of distributions with differing skewness and kurtosis. It can be observed that the only statistics which have reasonable estimates of type I error rates for all the distributions considered are Box's test and Moses' test. In the studies of Hall (22) and Miller (42), Box's test was always a better estimator of type I error rate and was always more powerful than was Moses' test. In addition to the previously described tests is Levene's test (36) which is a focal point of this study. Levene's Test of Scale In 1960, Levene (36) suggested a new technique for testing for equal variability in K populations. His test consists of an ANOVA on the absolute deviations of the observations from their sample means. One aspect of his study was a Monte Carlo investigation of the empirical probability of obtaining a significant F-ratio under the null hypothesis: cl = 02 = . . . = UK. He restricted his study to two-, four- and ten-sample designs with equal sample size of ten or twenty. The distributions sampled from were the normal, the uniform, and the double exponential. 25 .mHnmu on» No HHmo Hen» How commuc>o mcumeumm Hoowuwmso mo Henson on» ma Hones: Hosea i nu un nu nu nu m o nn nn m m oN.onHN.N o nn un un nu un omH. NmH. ooH. nu un Nmo. NoN. HHanoz H H nn nn nn H H nn H H N N vm.o so.N NNo. moH. un uu nu oNN. mmN. nn HNm. oNo. oeo. mNm. smxm Osmuuxm H H nn nn nu H H nu H. H N N mm. so. moo. Nmo. nu nu nn emo. moo. nu moo. Nmo. Heo. boo. seem ouuumooz H H nn nn nn H H uu H H N N «H. mo. mmo. moo. nu nu uu woo. «mo. nn Noo. moo. Neo. moo. sexm uanHm N nu N e N nu nn nn nn o N e mm o omo. nn moH. ooH. mNo. nn nn nu nu Hmo. moo. oNo. Meson nume H H nu nu nn H H nn H H N N oH.o o moo. oNo. nn nu nu NAH. NmH. nu «mm. neo. ooo. NNN. oHuusHoumoH sum N nu N o N nu nn nn nn o N v o N Nmo. un omo. mNo. Hmo. un nu un un Hoo. meo. mmN. .qomxm Hon xm m m N e N nu un un nn e e o m o moo. mNo. moo. oNo. omo. nu nn nu nn moo. mmo. ooN. .codxm «Hanan m o N 4 NH m m H m NH NH o o Hmo. Nmo. moo. meo. meo. poo. omo. moo mmo. mso. moo. Nmo. Hmsuoz o o N o N H H nn H m o o N.Hn o «mo. moo. oeo mNo. mmo. «Ho. moo. nu Hoo. «mo. svo. ooo. snouHco w... P. D W H 3 3 H V 8 8 mwmouufifi 30u—w . m .. A N. H m n m N H mm m p. 5:33.... Z a a X X. X x. a .4 u. M J a H... u u = x. __ yu s I 1 a J + T. . a s w. .n m. A w n N . o H... u N as mummy mHmEomnx m>Hmsu now mo. u HHIN canoe mono ousucumuwa on» cw undo“ memosum OHHMU muse: no woman mason mo _ a conzgmmumu uouuc H mmhu mo mcumfiwumm Hmowumem mmcum>2) When Levene's statistic is calculated, an ANOVA is performed on the absolute deviations (zij) from the sample means. 2.. = IX.. - £..| 1] 13 13 The assumptions necessary to use the F distribution with the ANOVA model are: l. The observations are drawn from normally dis- tributed populations. 2. The observations represent random samples from 'the populations.' 3. The variances of the populations are equal. If all samples are drawn randomly from identical popu- lations, then assumptions (2) and (3) are met for zij scores; however, the first assumption is not likely to 31 be met. The zij scores are unlikely to be normally dis- tributed. Their distribution will be a function of the distribution of the xij scores. What then will be the effect of the violation of the first assumption? Glass et a1. (18) thoroughly review the consequences of the failure to meet this assumption. In summary, they state that skewed popu- lations have very little effect on either the level of significance or the power of the fixed-effects ANOVA F-test. With respect to kurtosis, they indicate that when the populations sampled from are leptokurtic, the actual a is less than the nominal a. The opposite is true if platykurtic distributions are sampled from. However, they indicate that these effects are slight. The actual power is less than the nominal power when the populations sampled from are platykurtic, but is greater when the populations sampled from are leptokurtic. These effects can be substantial for small n's. With these generally fine properties of ANOVA, why then should the Kruskal-Wallis extension and the Normal Scores extension of Levene's test be considered? There are three reasons why: 1. They make much less restrictive assumptions with respect to the type of distribution sampled from. 2. They have been shown to be competitive with the F test with respect to power. 32 3. They are generally somewhat conservative tests (recall the liberal nature of Levene's test). These claims will now be documented. The Kruskal-Wallis technique tests the null hypothesis that the K samples come from identical popu- lations with reSpect to averages. Valid use of the technique requires that the variable of interest have an underlying continuous distribution, and at least be measured to an ordinal level (8). The Normal Scores tests consider the same hypothesis and make the same assumptions. It should be noted that there are three forms of the Normal Scores test. They are the Bell-Doksum test, the Terry-Hoeffding test and the Van der Waerden test (8). The Bell-Doksum test has the disadvantage that the test statistic depends upon the particular, random normal deviates selected. Two people may both use this test to analyze the same set of data and get different results. The Terry-Hoeffding test requires the use of special tables of expected normal order statistics which are conveniently tabled only for N.150. The Van der Waerden test uses the r/(n+l) quantile of a standard normal random variable as the replacement for the score with rank r. The quantiles are widely tabled and can be approximated by 33 interpolation in standard normal tables if desired. This is the type of test which is used in this study. The relative power of ANOVA, the Kruskal-Wallis test and the normal scores test has been investigated asymptotically and for small samples. Hodges and Lehmann (27) show that the Pitman asymptotic efficiency of the Kruskal-Wallis test with respect to the ANOVA F test is greater than or equal to .864 irrespective of the dis- tribution sampled from. Likewise, the Pitman efficiency of the Normal Scores test with respect to the ANOVA F test is always greater than or equal to 1. They also show that the efficiency of the Kruskal-Wallis test with respect to the normal scores test is somewhere between 0 and 6/n = 1.91, depending on the distribution sampled from. Their proofs are for the two sample case. While it is reassuring that the Kruskal-Wallis test and the Normal Scores test are asymptotically com- petitive with the ANOVA F test, the focus of this study is on situations involving small sample sizes. _ Klotz (32, p. 631) studied the small sample power and efficiency of the one sample Wilcoxon and Normal Scores test for normal shift alternatives when N 5 10. He concluded: "Because of the extremely high efficiency of the nonparametric tests relative to the t in the region of interest, it is the author's Opinion 34 that the nonparametric tests would be preferred to the t in almost all practical situations.” Vander Laan and Oosterhoff (59) compared the Wilcoxon and the normal scores test for the two sample case of equal samples of size six. They sampled from normal distributions and found for various normal-shift alternatives at a = .05, that the normal scores test was almost always slightly more powerful than the Wilcoxon test and was always slightly less powerful than the t test. Vander Laan (58) also compared the exact powers using two sample designs with samples of size six when sampling from exponential distributions and also from uniform distributions. In both cases he got results consistent with those obtained when sampling from normal distributions. Neave and Granger (46, p. 509) when sampling twenty or forty observations from normal or from bimodal asymmetric distributions, concluded: Over the range of situations investigated, the normal scores test gave the most satisfactory results, followed closely by the Wilcoxon rank-sum test. Even when the populations were normally ‘distributed, these tests were only slightly inferior to the t test and naturally were much superior in the cases of non-normal populations. Leaverton and Busch (35) sampling from normal distributions using equal sample sizes of 4, 7, 9, ll, 13 and 25, found the Wilcoxon test compares favorably to the t even for small samples. Using their power 35 curves as reference, they question the widespread use of the ordinary two-sample t test. McSweeney and Penfield (39) provided tables comparing the power of the Kruskal-Wallis test and the normal scores test for three sample designs with equal sample sizes of S, 6, 8, 10 and 12. When sampling from uniform distributions, the normal scores test dis— played somewhat greater empirical power than the Kruskal- Wallis test. However, when samples were taken from normal distributions, the normal scores test was perhaps slightly less powerful than the Kruskal-Wallis test. The fact that the Kruskal-Wallis test and the normal scores test are somewhat conservative for small sample situations has been shown in a series of Monte Carlo studies. Kruskal and Wallis (33) first found their sta- tistic to be slightly conservative for small n. Gabriel and Lachenbruch (13) confirmed this when sampling from three sample designs using various small equal sample sizes. They found the test to be somewhat conservative for almost all of the cases they considered for a = .10, .05 and .01. McSweeney (38) sampled from three sample designs with equal sample sizes of 5, 6, 8, 10 and 12. She indicated that the chi-square approximation was good although slightly conservative for both the 36 Kruskal-Wallis test and the normal scores test. These results (39, p. 187) seemed to hold when sampling from either normal or uniform distributions. The normal scores test usually was more conservative than was the Kruskal-Wallis test. Neave and Granger (46, p. 513) observed the same type of results when using normal or bimodal distri- butions. They found in addition that the normal scores test using inverse scores rounded to one decimal place of accuracy gave results as good as those having four places of accuracy. Summary Twelve K-sample tests of scale which have been suggested since 1937 have been examined. These tests have been compared, a few at a time, in recent Monte Carlo studies. With the exception of Box's test and Moses' test, all have been shown to be markedly liberal when sampling from certain types of distributions. For all cases considered Box's test has been shown to be relatively weak when compared to many of the other tests in situations for which the other tests were designed. Levene's 2 test was found to be slightly liberal for most distributions considered but was extremely liberal when sampling from a leptokurtic distribution 37 with extreme skew. The test appeared to have satis— factory power with empirical efficiencies relative to Bartlett's test ranging from .83 to .90. Levene's test involves an ANOVA on the absolute deviations from the sample means. Two other nonpara- metric tests which are competitors to ANOVA were examined. Both the Kruskal-Wallis test and the Normal Scores test were shown to be competitive with the ANOVA F test with respect to power against shift alternatives while being somewhat more conservative than ANOVA. When the Kruskal—Wallis test or the Normal Scores test is used in place of Levene's ANOVA technique, per- haps the conservative nature of these nonparametric techniques might counter the liberal nature of Levene's ANOVA to produce a test statistic which has reasonable type I error rates and relatively good power. CHAPTER III THE DESIGN The questions answered by this study were: For the Kruskal-Wallis extension of Levene's test and the Normal Scores extension of Levene's test, 1. How do the nominal type I error rate and the empirical type I error rate correspond? How powerful are these tests with respect to Levene's test? For the two-sample case, how powerful are these tests with respect to the traditional para- metric test F = si/sg? There are a series of concerns which must be addressed when speaking to these questions. Among them are: 1. How many samples should be chosen for each simulation? Which alpha levels will be considered? From which distributions should the Xij come? 38 39 4. How many levels of the independent variable will be considered? 5. What should the cell sizes be? 6. Which alternative hypotheses should be selected when looking at relative power? For each of the cases to be mentioned a simu- lated analysis was repeated one thousand times and the number of rejections was counted using a series of commonly employed alpha-levels. The alpha levels con- sidered were .10, .05, and .01 since these are the most common levels selected by researchers. The use of one thousand repetitions somewhat compensates for the disturbing effects of random sampling. The standard errors associated with the empirical esti- mates of type I error rates are approximately .00949 for a = .10, .00689 for a = .05 and .00315 for a = .01. The standard error associated with a given power estimate is always less than .0158. The approximate value of the error rate estimates and power estimates obtained are therefore reasonably close to the true values. The distributions considered to test error rates and power were the normal, the uniform and the exponential. It is known that tests for variance are sensitive to nonzero kurtosis. The normal distribution, with zero kurtosis, was selected in order that direct 40 comparisons might be made with established parametric tests. The uniform distribution was selected to repre- sent extreme flatness (platykurtosis) and the exponential distribution was selected to represent extreme peakedness (leptokurtosis) and extreme skew. By the selection of distributions from both ends and the middle of the kur- tosis spectrum, the results should apply to most distri- butions of practical utility in research. Following are diagrams of the distributions of X and of IX - u| for the above selections. Distribution of Distribution of x |x - nu| I Normal /:’\_._ 4\ r—-A I Uniform [y L l g; . d . n I l Exponential 1:, [J\\_ Reasonably good fit of nominal type I error rates and empirical type I error rates would be expected if indeed the distribution sampled from was the distri- bution of absolute deviations from the population mean. In fact the uniform situation has been considered in a somewhat different application in the McSweeney-Penfield (39) study. The empirical estimates were found to be slightly conservative for both the Kruskal-Wallis and 41 the Normal Scores test. Unfortunately, their findings are not applicable in this study since it is concerned with absolute deviations from the sample means, not the population means. The diagrams associated with these |X - XI distributions are unknown. Deviating from the sample means rather than the population means, especially when sampling from distributions with a heavy skew such as the exponential, might result in increased variability in the sampling distribution of the statistic. This seems especially likely in situations involving small cell sizes. Because of the large amount of time used by com- puters in ranking procedures, the scope of the questions asked was limited to situations where the total N was relatively small. The following situations were con- sidered. Case N0. 1 2 3 4 5 6 7 8 9 No. of Levels of the Independent 2 2 2 3 3 3 4 4 4 Variable Cell size 9 12 18 6 8 12 4 6 9 Total N 18 24 36 18 24 36 16 24 36 42 The nine cases were selected for the following reasons: 1. Since for each set of levels of the independent variable, the cell sizes increase, the relation- ships between cell size and power could be examined. 2. Since in cases 1 and 9, cases 2 and 6, and cases 4 and 8, the cell size remains constant as the number of cells increases, some evidence was available to indicate the relationship between the number of cells and the power for a given cell size. For all nine cases, the normal, uniform and exponential distributions were observed when comparing Levene's test, the Kruskal-Wallis extension of Levene's test and the Normal Scores extension of Levene's test for nominal-empirical error rate agreement and for relative power. In addition, for cases 1, 2 and 3, these three tests were compared with the standard para- metric F test where F = si/sg. Two situations were considered: (a) 01 = 02 = . . . OK = l and (b) 01 = 02 = . . . oK_1 = 1, OK = 2. The first situation permitted study of the nominal- empirical type I error rate question. The second situation gave some evidence as to the relative power of the different tests when the slippage alternative is 43 used. This is an alternative frequently used in Monte Carlo studies and was chosen for convenience. A total of 9 x 3 x 2 = 54 computer simulations of one thousand samples each were made. (In order to use computer time efficiently, both situations were observed using the same set of one thousand observations. The results of the cases involving three cells, in which the uniform distribution was considered, allowed a comparison with the McSweeney-Penfield study. They considered the relative quality of the Kruskal- Wallis test and the Normal Scores test when sampling from normal and uniform distributions. When computing Levene's statistic and the sta- tistics associated with the two extensions, absolute deviations of observations from their cell means are calculated. If, when sampling from the uniform dis- tribution these deviations were taken from population means, the distribution of deviations would also be uniform. The study of the two extensions would be identical with that of McSweeney and Penfield. How- ever in this study, deviations were taken from sample means and these deviations are not uniformly distributed. By comparison of the two studies the effect of deviating from sample means rather than population means could be observed. 44 An additional (case 4) exponential situation was run in which the absolute deviations were made from the population means to explore further the effect of high skew and kurtosis on absolute deviations from sample means. One other (case 4) exponential situation was run in which the absolute deviations were made from the sample medians. It was felt that the median would be affected less than the mean by an extreme observation, and thus have less effect on the resulting ranks. The empirical alpha level might coincide more closely with the nominal alpha level. This last experiment was repeated using the normal distribution. Miller's (42) finding that a median deviated Levene—type statistic was asymptotically distribution free provided added incentive to examine the small sample properties of this statistic. CHAPTER IV THE GENERATORS USED In this chapter the different generating pro- cedures used in the study and the tests to which they were subjected are described. The following needed to be generated: (1) pseudorandom numbers between zero and one, (2) uniform random variates, (3) exponential random variates, (4) normal random variates, (5) inverse normal scores and (6) some F values. Generation of Pseudorandom Numbers In this study many thousand uniform, exponential and normal random variates needed to be generated. The quality of the procedures in all three of these situations depended upon a program which generated highly dependable pseudorandom numbers. It has been stated that: . . . an acceptable method for generating random numbers must yield sequences of numbers which are (l) uniformly distributed, (2) statistically inde- pendent, (3) reproducible and (4) nonrepeating for any desired length. Furthermore, such a method must also be capable of (5) generating random numbers at high rates of speed, yet (6) requiring a minimum amount of computer memory capacity. (45, p. 46) 45 46 There were three possible routes that could have been taken to generate pseudorandom numbers: (1) a manual method, (2) the use of a library table or (3) a computer method. Of these three alternatives the only one which was feasible for a study of this magnitude was the use of a computer. The computer used was the IBM 360-30 machine at Ferris State College. The most common methods used to generate random numbers on a digital computer are called congruential methods, of which three types could have been used: (1) the multiplicative method, (2) the additive method or (3) the mixed method. Since the multiplicative con- gruential method has been found to behave well statis- tically (29, p. 240), it was selected for use in this study. The procedure used in generating random numbers using the multiplicative congruential method involves five steps (45, p. 52). 1. Choose any odd number as a starting value no. 2. Choose a value for a. This value should be close to (2b/2 i 3) where b is the number of bits in the largest possible integer using FORTRAN. With the compiler used, b = 32 so that (216 + 3) was the choice made for a. This choice minimizes the first-order serial cor- relation between the pseudorandom numbers. 47 3. Compute (a - ni) using fixed point integer arith- metic. For the first number generated i = 0. The product will consist of 2b = 64 bits. The 32 low-order bits represent ni+1 multiplication instruction in FORTRAN automati- as the integer cally discards the high order 32 bits. _ 32 . ._ 4. Calculate ri+l — (ni+l)/(2 ) to obtain a uni formly distributed variate on the unit interval. 5. Increment i by l and repeat (3) and (4). This cycle is continued until i reaches N, the number of random numbers desired. Some preliminary testing of the multiplicative congruential generator was performed. To test for uni- formity ten thousand numbers were generated by this technique. The numbers generated were sorted into twenty categories of size .05 each. A chi-square test of goodness of fit to the uniform distribution yielded a x2 = 18.184 which, when referred to the chi-square distribution with 19 degrees of freedom, indicated no reason to disbelieve that uniformly distributed numbers were being generated (.50 .5, let 2 = /'-2 ln (1 - u). Use 2 a0 + alz + azz 2 l + blz + bzz + b32 3 with the constants aj (j = O, 1, 2) and bj (j = l, 2, 3) defined as before to obtain the desired approximation. Generation of F Values When the 2-sample variance ratio test or Levene's Test is used, the sample test statistic is compared with a value of the F distribution with m and n degrees of freedom. For many values of m and n, tabled values of the F distribution are readily available. In this study the available values were used whenever possible; how- ever, many of the m and n pairings necessary were not tabled for the a level desired. A computer program to obtain either F values or their associated probabilities was written in FORTRAN by Clark Holloway and W. B. Capp in 1959 and revised by R. J. McKelvey in 1961 (28). Its use was suggested by Linda Glendening (20). This program works in either of two directions: if m, n and F are entered, the 54 corresponding probability value p is calculated; if m, n and p are entered, the corresponding F value is cal- culated. The second option was desirable in this study, however, the time taken to calculate a given F when running in background on the IBM 360-30 computer was close to 30 minutes in the three runs attempted. The compute time to run in the first direction was found to be many times faster than that of the second. Consequently, a main program was written in which values of m, n and an F which was known to be somewhat smaller than the desired F value were entered. The Holloway-Capp-McKelvey program was used as a sub- routine. The entered F value was repeatedly incremented in steps of .10 until a p higher than desired was Obtained. At this point, the last low estimate was incremented in steps of .01. The procedure was repeated with the final increment of .001. This gave F value estimates to three decimal places. To test the quality of this procedure, four known tabled values of F were subjected to it. In all cases the calculated F value was identical to the value found in the tables. CHAPTER V THE RESULTS The basic questions of this study are concerned with the relative quality of four different tests of homogeneity of variance: (1) the standard F ratio, (2) Levene's test, (3) a Kruskal-Wallis extension of Levene's test and (4) the Normal Scores extension of Levene's test. Which test has the best correspondence between nominal type I error rate and empirical type I error rate? Which test has the best ability to correctly reject the null hypothesis when it is appropriate to do so? Three different distributions were considered and for each distribution comparisons were made between nominal and empirical type I error rates. The nominal error rates used were a = .10, a = .05 and d = .01. The distributions considered were the normal, the uniform and the exponential distribution. In this chapter, the results of the study are reported first for the normal distribution, then for the uniform distribution and finally for the exponential 55 56 distribution. These results suggested a modification using absolute deviations from the median rather than the mean. Two simulations were performed using this modification; the results of these simulations are reported. The chapter ends with a short summary of the results. Sampling from Normal Distributions The results of the simulation using random samples from the normal distribution are presented in two tables. Displayed in Table 5-1 is the empirical type I error rate for the four tests using one thousand random samples from normal distributions with 01 = 02 = . . . OK = 1. The standard errors associated with the estimates for the null case are approximately .00949 for a = .10, .00689 for a = .05 and .00315 for a = .01. The power of the four tests using random samples from normal dis- tributions with o = l, o = 2, is l = 02 ' ‘ ' OK-i K displayed in Table 5-2. The size of the standard error associated with a given power estimate is always less than .0158. It can be observed (Table 5-2) that, for a fixed number of cells, as the cell size increased the power significantly increased. For a fixed cell size as the number of cells increased, Levene's test always became significantly more powerful. This relationship did not seem to hold 57 moo. moo. «Ho. u «mo. mmo. mmo. n omo. Nmo. eoH. m .m .m .m «oo. moo. oNo. n omo. «mo. omo. u mmH. mmH. m«H. m .m .m .o moo. moo. H«o. n «no. Nmo. HoH. n o«H. HmH. ooH. .« .« .« .« moo. Noo. moo. n m«o. mmo. omo. n «mo. omo. mmo. NH .NH .NH HHo. mHo. mHo. n Hoo. mmo. moo. n mHH. mHH. mmH. m .m .m moo. boo. eHo. n «mo. mmo. omo. n an. m«H. «NH. m .m .m NHo. mHo. oHo. NHo. mmo. Nmo. moo. Nmo. mHH. mHH. mHH. mH .mH moo. moo. mHo. oHo. mmo. omo. mmo. m«o. «oH. NHH. oHH. NH .NH oHo. HHo. MHo. «Ho. m«o. m«o. omo. omo. mmo. Hmo. ooH. m .m z x H m z s H m z s H mNHm HHmo oHo. u o omo. u o ooH. u o .H U VHo . . . U No U Ho nuwz mcoausowuumwo HmEuoc Scum mcwadEmm GOSB coflmcmuxc mcuoom HmEHoz map ocm commcmuxm mHHHMZImemcHM ecu .ummu m.mcc>mq .ummu m was mcwms munch Hound H camp mo mmumswumm HBUHHHQEMII.Hum flames 58 mmH. ANA. Nmm. I mmm. mmm. mam. I mmo. mmv. mmo. I m .m .m .m mmo. «mo. mmH. u moN. mmH. HNm. u Hmm. HHm. em«. n m .m .m .m moo. moo. Hmo. n mmH. mmH. mmN. n meN. mNN. mmm. n .« .« .« .« HNN. «oN. Hmm. n mo«. mN«. Hoo. n mmm. Hmm. mHN. n NH .NH .NH omo. mmo. ooH. u msN. mNN. mom. n mo«. mmm. MNm. n m .m .m moo. oNo. mHH. n «NN. mHN. mmN. n mmm. mmm. om«. n o .o .m N«m. Hmm. NN«. mmo. HHo. mom. mmN. m«m. mme. «mm. mmm. nHm. mH .mH moH. mmH. HoN. HH«. mH«. mm«. «m«. omo. mum. Hmm. «no. HmN. NH .NH NoH. moH. mHH. «mN. mom. ««m. «mm. mHm. mm«. mN«. Hmm. omo. m .m z s H m z s H m z s H m ONHm HHoo Hoo. u o omo. u s ooH. u o N U VHo .H U HIMo . . . U No U as nuw3 mCOHusnfluumwo HoEHoc Eoum mcwamsom c033 cowmcmuxo mmuoom HMEHoz ecu ocm concmuxm mwaamz Iamxmcux may .ummu m.mcm>mq .ummu m can mcwms Hm3om mo mmuweflumm HMONHHQEMII.NIm mqmda 59 for either extension of Levene's test as sometimes the power increased and sometimes it decreased with an increasing number of cells. When considering the extensions, 83 percent of these increases or decreases were within the .95 confidence interval of ordinary sampling variability. The average empirical estimates of a are pre- sented in Table 5-3. Averages over the various cell sizes were used to condense the larger tables to a man- ageable size. By averaging, the extremes become some- what obscured. First consider the two cell designs. The three alternatives to the standard F test appear to be somewhat liberal in their estimates of nominal d. For all three nominal d-levels considered, Levene's test was the most liberal with the Normal Scores extension the least liberal. The standard F test not only gives the best empirical estimate of type I error rates, it also has the highest power (Table 5-4). Levene's test is more powerful than either of the extensions, which are approximately equally powerful. To compare Levene's test and its two extensions further, consider the average empirical estimates of type I error rates for the three- and four-cell designs (Table 5-3). As in the two cell designs, Levene's test -is the most liberal and the Normal Scores extension the 60 woo. woo. mmo. I oao. mmo. mmo. who. I omo. mHHcU v NNH. mNH. mma. I ooa. woo. oao. mHo. I oao. mmo. omo. mmo. I omo. maamo m mHH. mHH. ONH. I ooa. 0H0. HHo. mHo. NHo. oao. Nmo. mmo. mmo. omo. omo. mHHmU N mom. boa. moa. Noa. ooa. concmuxm coamcmuxm monoum HmEHoz mHHHozlamxmsux umma m.mcm>mq umme h o HmcHEoz mcowucoflnumwo HmEHoc Eoum mcHHmEom cm£3 omumowmcoo mummu Mac“ on» now moves Hound H down no mmumEHumw HmOHHHmEm cmcuc>¢II.mIm mumma 61 mmo. Hmo. 05H. I oao. ovm. 0mm. hmm. I omo. mHHwU v mom. hvm. hmv. I OOH. mmH. mmH. mom. I 0H0. mmm. mom. HNv. I omo. mHHmU m Hmo. va. vmm. I ooa. «om. vam. m«m. va. 0H0. vmv. mow. mmm. mum. omo. mHHOU N mom. mmm. mom. mun. 00H. scamcmuxm coflmccuxm mmuoom HmEHoz maaamzlamxmsux umma m.mcm>mq puma m a HMCHEOZ mcoausnmuumao Hmsuoc Eoum mcHHmEMm cons commowmcoo mummy snow wnu How szod mo mmpoEHumm HMUHHHmEm monum>mH .ummu m can momma mcumu House H mmmu mo mmquHumm HMOHHHQEMII.oIm mqmde 65 HHN. HmH. mmm. I mmv. OHV. HNm. I mmm. mom. HNh. I m .m .m .m mmo. Nmo. HmN. I HmN. omN. va. I oNo. moo. Hmm. I o .m .o .o HHo. mHo. va. I oom. HHN. com. I NNm. mmm. mom. I v .v .v .v OHM. oom. mHm. I mom. ovm. mmo. I mmm. wmm. me. I NH .NH .NH mmH. NoH. mom. I omm. mom. mmm. I omm. mHm. mmo. I m .m .m «no. mmo. mmH. I HmN. NmN. mom. I oNo. NNv. mHm. I o .o .o mmo. Nom. omo. mmm. mmo. mmo. mom. «mm. th. mmm. mmm. m«m. mH .mH HmN. va. mom. 0mm. 0mm. mom. How. VHm. omo. mom. mom. omo. NH .NH ooH. mmH. mmm. mHH. hmm. mwm. mbv. mmo. hHm. mmo. com. mNm. m .m Z M H .m Z x .H W Z M H .m mNHm HHmU oHo. U a omo. U o ooH. U o N U Mo .H U HIMo . . . U No U Ho nuHs mGOHusQHHumHU Show IHcs Eoum mcHHdEmm c023 GOHmcouxm mmuoom HmEuoz may one concmuxw mHHHo3 IHmMmsuM ecu .ummu m.mcm>wH .ummu M can mchs umsoo mo mwuoEHumm HMUHHHQEMII.mIm MHmma 66 woo. woo. mmo. I 0H0. moo. omo. Hmo. I omo. mHHmU « mmH. mmH. mmH. I oOH. woo. woo. VHO. I 0H0. mmo. omo. moo. I omo. mHHmU m NNH. mNH. omH. I ooH. moo. NHo. mHo. Hoo. oHo. ooo. omo. omo. moo. omo. mHHmU N oNH. mNH. mNH. oNo. ooH. COHmcmuxm GOHmcmuMm mmHoom HmEHoz mHHHMSImemDHM puma m.mcm>mH puma M o HmcHEoz mCOHuanHumHo EHOHHcs EOHH ocHHmEom cons omHmoncoo mummu snow was How mmumu Hound H womb mo mwuosHumm HMUHHHQEO mmoum>m m o mcHEo mmHoom HmEHoz mHHHszmemcuM pm 9 m. H pm 8 M H . z mCOHuanHumHo EHOHHCS EOHM mcHHMEMm cmnz commoncoo mummy HSOM map How Hm3om mo mmumEHumm HMUHHHMEO mmmum>¢II.mIm MHmda 69 the Normal Scores extension the least. Nineteen times out of twenty-seven the Normal Scores extension estimate was closer to nominal on than the Kruskal-Wallis extension estimate. Twenty-two times it was closer than the Levene estimate. It is clear that for the situations involving sampling from the uniform distribution, the Normal Scores extension provides better estimates of nominal OI than does either Levene's test or the Kruskal-Wallis extension. This is also borne out when considering the average dis- tance that the empirical estimates are from the nominal OI 'PAUBLE 5-10.--Average distance of the empirical estimates of a from the expected value of a for all designs considered when sampling from uni- form distributions —__I Kruskal-Wallis Normal Scores ‘ I Nominal a Levene 3 Test Extension Extension . 100 .036 .036 .028 . 050 .027 .020 .014 - 010 .009 .003 .003 As happened when samples were chosen from normal distributions using three- and four-cell designs, Levene's test was the most powerful of the three tests considered (Table 5-9) , and the Normal Scores extension was slightly more powerful than the Kruskal-Wallis extension . 70 The empirical type I error rates obtained when sampling from the uniform distribution for the Kruskal- Wallis extension and the Normal Scores extension can be observed in Table 5-11. Both statistics were calculated on the basis of observations deviated from their respec- tive populations means and from sample means. When deviations were taken from the sample means, the esti- mates of type I error rates for a = .10 and a = .05 were quite high. This indicated that the sample mean is not a robust estimator of the population mean, espe- cially for the smaller sample sizes. TABLE 5-1l.--Empirical estimates of type I error rates obtained when sampling from uniform distri- butions for the Kruskal-Wallis extension and the Normal Scores extension. Deviations are taken from (A) population means* and (B) sample means a = .10 a = .05 a = .01 Cell Size K N K N K N 6, 6, 6 A .084 .084 .037 .036 .004 .004 B .133 .131 .064 .059 .009 .010 8, 8, 8 A .095 .094 .045 .041 .006 .005 B .137 .128 .069 .067 .006 .005 12, 12, 12 A .107 .102 .047 .045 .009 .008 B .114 .106 .058 .050 .005 .004 * Obtained from McSweeney-Penfield (38, p. 187) study. 71 Sampling from Exponential Distributions The results of the simulation using random samples from the exponential distribution are presented in two tables. Displayed in Table 5—12 is the empirical type I error rate for the four tests using one thousand random samples from exponential distributions with 01 = 02 = . . . = OK = l. The power of the four tests using random samples from exponential distributions with 01 = 02 = . . . = o = 1, OK = 2, is displayed in Table 5-13. For all four tests used, the empirical estimates of a are very liberal. The poor nature of the fit is summarized in Table 5-14. When 100 rejections are expected, the closest any test comes is 243 rejections. With 50 expected, the closest is 151 and when 10 rejec- tions are expected, the closest is 51. With a fit this poor, it makes little sense to consider the question of relative power. Agreement Rates of Kruskal-Wallis and NormaI Scores Extensions So far gross comparisons of the empirical type I error rates have been made across tests and for each of three distributions. The availability of one extensive set of tests results permitted consideration of the issue how the Kruskal-Wallis and Normal Scores extensions dealt with the same data set. 72 moH. mNH. omo. I mmN. mHm. mvN. I mN«. moo. Nvm. , I m .m .m .m omo. NoH. moH. I omN. mHm. mmN. I HHv. oNo. mmm. I o .o .o .o oHo. mmo. mmo. I HmN. mmN. omN. I va. mew. mew. I v .v .v .v mmo. oNH. omo. I mNN. mmN. oHN. I mmm. omm. mom. I NH .NH .NH omo. Hmo. ooo. I mNN. va. ooH. I on. va. mmN. I m .o .o Hmo. mmo. mmo. I hNN. mmN. oHN. I mmm. mmm. mmm. I o .o .m omo. omo. mmo. moH. oHN. mmN. moH. moN. mmN. mHm. omN. mmm. mH .mH woo. woo. mmo. oNH. mmH. mHN. NoH. va. NmN. HHm. NoN. mom. NH .NH Hmo. moo. mmo. mmo. mmH. mmH. HmH. HHN. mmN. omN. moN. mom. m .m z M H M z M H M z M H M muHm HHmU oHo. U a omo. U o ooH. U o H U Mo . . . U No U Ho nuH3 mGOHuanHumHo HoHucmcomxm Eoum mcHHMEmm cons conccuxm mmuoom HmEHoz on» one COHmcmuxm mHHHMSIHMMmsuM can .ummu m.mcm>mH .ummu M can ochs mouou Hound H mom» mo mmuoEHumm HmoHHHMEMII.NHIm MHmda 73 mom. mmN. HNN. I on. mmm. one. I oNo. Hmo. Now. I m .m .m .m mmH. «ON. HHN. I NHo. va. mHVA I mom. mmm. mom. I o .o .m .m mmo. omo. o«H. I va. mmm. mmm. I owe. Hom. mom. I v .o .v .v NNm. Ovm. OOm. I th. omm. mmo. I Now. mmm. MHO. I NH .NH .NH HHN. MNN. VON. I MHv. mmo. mov. I va. mmm. mNm. I m .m .m va. moH. HmH. I mmm. mmm. Nvm. I omo. mHm. mmo. I o .o .o mmm. oNo. Nmm. HHm. mmm. oHo. mmm. Hmo. mmo. oHN. mmo. moo. NH .mH th. mON. OHN. woo. woo. mmo. mmv. mmm. Ohm. Omm. vwm. omm. NH .NH mmH. NoN. mmH. Nom. Hom. Nmm. mmm. mow. who. mmo. vow. mmm. m .m z M H M z M H m Z M H M ONHm HHOU OHo. U o omo. U o OOH. U o N U Mo .H U HIMo . . . U No U Ho nuH3 mcoHuanHumHo HoHucccomxm EOHH mcHHMEmm cc£3 GOHmcmuxm monoum HmEHoz map one :oncmuxm mHHHmz ImemsHM on» .umwu m.mcc>mH .ummp M can mchs Hm3om mo mmumEHumm HMUHHHMEMII.mHIm MHHHH 74 TABLE 5-14.--Average distance of the empirical estimates of a from the expected value of a for all designs considered when sampling from exponential distributions . Levene's Kruskal-Wallis Normal Scores * Nominal a Test Extension Extension F Test .100 .315 .369 .354 .336 .050 .203 .259 .239 .243 .010 .074 .088 .074 .171 * Only two-cell designs considered. The number of agreements of the Kruskal-Wallis and Normal Scores extensions can be observed in Table 5-15 for the situation in which the null hypothesis is true and a = .10. The rate of agreement lies between 95.4 percent and 98.9 percent. The same type of infor- mation is displayed in Table 5-16 for the situation in which the alternative hypothesis is true. In this case, the rate of agreement lies between 94.3 percent and 97.8 percent. A Modification The extremely liberal situation which existed when sampling from the exponential distribution strongly suggests that deviating from the sample means does not produce the most desirable results. The effect a single large value can have on the sample mean produces an instability which strongly affects the deviations. 0: “OmflmH .N m «0mm #QmUUM N .Q t. 75 Hm OH mHv va HN m OHH mmm m 5 mm HOm m .m .m .m «N m vow mom mH m hNH Hmm .OH 5H mHH mom O .O .O .O om h mHv mom mH N mmH ONm OH m hMH Ovm v .v .v .v ON OH Ohm vmm OH OH mm Ohm HH m mm mmm NH .NH .NH mN m OHm Hmm OH H MNH Nmm HH m hOH mom m .m .m mN HN Oh wow MH HH ONH mmm OH O mNH mom O .O .O ONHm HHOU Hm.mo Hm.mo Hm.mo H<.Ho H<.mo Hm.mo Hm.mo H¢.«o H<.mo Hm.mo Hm.mo H¢.mo omHnmu no no can No .Ho nuH3 mGOHuanuumHo HoHucmsomxo ECHO mcHHmEmm cmnz cow: was HO .O .OO cmHmmO HHmOImmunu m muons mconcmuxm mmuoom HmEnoz 0cm mHHHszmemSHM on» cam ummu m.mcm>mq mo mGOHuMHHm> mounp an omchuno um3om cam mama uouum H moan HMUHHHmEmll.mHIm mqmda 79 For this reason and the fact an outlier has much less effect on the median than on the mean, another run was made to see if absolute deviations from the median might prove promising. The exponential distribution was used since this is the situation in which all previous attempts had failed. The results of this run are also shown in Table 5-17. From this one run, it appears that deviating from the sample medians rather than the sample means produces much better fit of empirical a to the nominal a. This is especially true for Levene's test at the nominal a levels .10 and .05. However, the fit at a = .01 remains poor even though considerably improved from the former case. The power is substantially reduced, even relative to the more conservative population mean deviated scores, when median deviation is used. Hence, the loss of power is not totally a consequence of obtaining a better actual alpha level. To see if this technique would perform well when sampling from other distributions, a final run was made in which samples were taken from normal distributions. The results are reported in Table 5-18. Once again, the Levene type test performed well. For a = .05 and <1 = .10, the empirical estimates of type I error rates Eire no longer liberal. However, as in the case of 80 woo. mmo. OOO. OOH. th. HmH. HHm. OmN. OmN. N H H mcwHUmS OOO. OOO. VHO. who. OOO. mvo. OOH. NNH. mmo. H H H mHmEmm OOO. OOO. mHH. vNN. mHN. OmN. mmm. mmm. omv. N H H meow: OOO. hOO. mHo. vOO. OOO. OOO. bmH. mvH. va. H H H mHmEmm z M A z m A Z x A no No Ho "EOH.m OHO. U o omo. U a OOH. U o mcoHuMH>mo omHnmu mm mo cam No .Ho nuHB mcoHuanHuch HmEuoc Eonm OGHHQEMm 3053 cows mmz HO .O .OO cmHmoc HHmoImmunu m muons mGOHmcmuxm mmuoom HmEuoz can mHHHmlemxmouM on» cam umwu m.mcm>mq mo mGOHHMHum> 03» xn omCHmuno umzom cam mumu Houum H max» HMUHHHmEmII.OHIm mqmde 81 sampling from exponential distributions, the estimates are not as close if a = .01. Summary of Results Three questions were of concern in this study. The first question was: For the Kruskal-Wallis extension of Levene's test and the Normal Scores extension of Levene's test, how do the nominal type I error rate and the empirical type I error rate correspond? When sampling from normal or uniform distributions with nominal a's of .10 and .05, both extensions gave slightly liberal estimates of the nominal a. The Normal Scores extension consistently gave the better estimate. When the nominal a was .01, the extensions gave, on the average, slightly conservative estimates. At this a level, they seem to have nearly the same quality. The strongest criticism of Levene's test has been its liberal nature. Both extensions are less liberal in their observed type I error rates than Levene's test. When the exponential distribution was considered, all observed type I error rates were extremely high, no matter which of the four tests was used. This showed the undesirable effect an outlier can have on a sample mean. Since outliers have little effect on a median, two runs were made in which deviations were made from sample medians rather than sample means. When 82 exponential distributions or normal distributions were used with this technique, a good fit was found for the Levene type test when using the nominal a's of .10 and .05. However, the technique was also too liberal for a = .01. The second question to be addressed in this study was: how powerful are the Kruskal-Wallis extension and the Normal Scores extension with respect to Levene's test? In view of the poor observed type I error rates for the exponential distributions, the power comparison was made only when sampling from normal or uniform distributions. When deviating from sample means, the Normal Scores extension was found in most cases to be slightly more powerful than the Kruskal-Wallis extension. Both extensions were less powerful than Levene's test. The third question of concern was: for the two sample case, how powerful are these extensions with respect to the traditional F test? Although both extensions were more stable across distributions in the observed type I error rate than the traditional variance ratio F test, they have somewhat less power, at least when sampling from normal or uniform distributions. CHAPTER VI SUMMARY, CONCLUSIONS AND DISCUSSION Summary This study was motivated by the need for a good K-sample test for equal variability. Well over a dozen tests have been suggested in the literature but all suffer from one or more of the following ailments: (1) Poor correspondence between nominal type I error rate and empirical type I error rate if the normality assumption is violated; (2) Low power when this correspondence is good; (3) Lack of post hoc procedures associated with the method. Of all the tests which have been considered in the recent literature, only two have shown an acceptable correspondence between the nominal and empirical type I error rates, no matter what distribution is sampled from. These are Box's test and Moses' test. Box's test was found to be more powerful than Moses' test for all cases considered. Unfortunately, neither of these procedures is desirable for sample sizes much less than fifteen. 83 84 This is because both tests involve breaking each of the K samples into subsamples and computing 1n 52 for each subsample. Of the remaining tests, one suggested by Levene showed the most promise. Levene's test consists of an analysis of variance on the absolute deviations of the observations from their sample means. The average amount by which the empirical and nominal type I error rates differed was relatively low for this test; how- ever, Levene's test consistently gave somewhat liberal estimates of the nominal alpha. Both the Kruskal-Wallis test and the Normal Scores test of location are known to be somewhat more conservative than is ANOVA for most distributions sampled. Usually little power is sacri- ficed when performing the Kruskal-Wallis test or the Normal Scores test. It was thought that the conserva- tive nature of these tests might counter the liberal nature of Levene's test to produce a test statistic which was acceptable in its empirical type I error rate. Both the Kruskal-Wallis extension and the Normal Scores extension of Levene's test would hopefully have power comparable to that of Levene's test. Thus, the thrust of this study was a Monte Carlo investigation of the properties of both the Kruskal-Wallis extension and the Normal Scores extension of Levene's test for equal variability. 85 Two, three and four levels of the independent variable were considered for various small sample sizes. The properties of the test statistics were observed when sampling from normal, uniform and exponential distri- butions. For each of the cases a simulated analysis was repeated one thousand times and the number of rejections of the null hypothesis of equal variability was counted using a series of commonly employed nominal alpha levels. Both nominal-empirical alpha level fit and power were of concern in this study. When samples were taken from either uniform or normal distributions, the results were in the direction predicted. Both extensions of Levene's test produced better estimates of type I error rate than did Levene's test. The Normal Scores extension proved to give gen- erally better estimates of type I error rate and power than did the Kruskal-Wallis extension. For a = .10 and a = .05, both extensions were still somewhat liberal and for a = .01 they were slightly conservative. In these same situations, Levene's test had slightly more power; however, some of the advantage with respect to power may be attributable to the slightly greater liberality of Levene's test. When the exponential distribution was sampled from, the empirical estimates of a were all very liberal. With this poor quality fit, the power comparisons were 86 meaningless. With the failure of these techniques for the exponential distribution, the good K-sample test for equal variability which is somewhat distribution free had not been found. This failure indirectly suggested the use of a similar but slightly different technique. The properties of a modified form of Levene's test and of the two extensions were observed in a mini- study. The test statistics were computed as before, with the exception of deviating observations from sample medians rather than sample means. The median deviation technique was employed for the three-sample design with six observations per sample when sampling from normal distributions and when sampling from exponential dis- tributions. The results of this mini-study were quite encouraging. For a = .10 and a = .05, the empirical type I error rates of the Levene type test were quite close to the nominal level when sampling from either distri- bution. However, for a = .01 the empirical type I error rates were too liberal. With the exception of Box's test and Moses' test, this is the only time that the type I error rate was well behaved when sampling from either a distribution with high kurtosis or with heavy skew.. Although the Levene type test using the median was not without power, it appears to be less powerful than the appropriate conventional tests when sampling 87 from normal distributions. The Kruskal-Wallis median extension and the Normal Scores median extension of the Levene type test did not fare nearly as well in either their empirical estimates of type I error rate or power. Findings The first seven findings result from the main study; the last two findings are from a mini-study per- formed as a follow-up to the main study. The latter must be considered tentative in light of the limited nature of the mini-study. When sampling from normal distributions or uniform distributions: 1. Both the Kruskal-Wallis extension and the Normal Scores extension of Levene's test gave better estimates of type I error rates than did Levene's test. 2. In general the Normal Scores extension gave slightly better estimates of type I error rates than did the Kruskal-Wallis extension. 3. Both extensions were generally somewhat liberal in their estimates of type I error rates at a = .10 and a = .05 and somewhat conservative for a = .01. 88 4. Although Levene's test was more powerful than either extension, a part of this extra power came from the liberal nature of the test. 5. The Normal Scores extension was generally slightly more powerful than was the Kruskal- Wallis extension. 6. The two sample F ratio was more powerful than Levene's test and its extensions when sampling from normal distributions. However, when sampling from the uniform distributions, Levene's test was more powerful than the F ratio. When sampling from exponential distributions: 7. Levene's test and the two extensions all gave considerably liberal estimates of type I error rates. When deviating from the sample medians rather than the sample means, and when sampling from normal or exponential distributions: 8. The Levene type test gave good estimates of type I error rates for a = .10 and a = .05 but was somewhat liberal for a = .01. 9. The two extensions gave poorer estimates of type I error rates than did the Levene test and they were also less powerful. 89 Discussion This study was originally prompted by the prac- tical need to test a hypothesis of equal variability in a design which had several independent variables. No general multi-factor tests that were insensitive to non- normality could be identified, and those single factor tests which were identified proved to be defective in one or more ways. Nevertheless, some guidance can be given to the researcher who needs a pragmatic answer to the question of how to proceed with his data analysis. If the sample sizes are reasonably large and of equal size, the researcher can treat the study design as if it consisted of a single factor and can use Box's test. It has been shown to give reasonably good esti- mates of type I error rate for a variety of distributions when the sample sizes are at least fifteen. This test has post hoc procedures available since it is an ANOVA of ln 82. Its primary drawback appears to be the sacri- fice in power necessitated by the grouping of observations to form the ln 32 which are the units of analysis. If the researcher should have good evidence that the distributions sampled are normal, then he could use Bartlett's test or Hartley's test in the case of a single factor study, or perhaps, Overall and Woodward's test for a multifactor study. The first two tests are con- siderably more powerful than Box's test. The researcher must realize that these tests are extremely sensitive to 9O violation of the normality assumption and should not be used unless the case for normality is strong. Other tests could be recommended for specific distributions, but the researcher seldom, if ever, has knowledge of the type of distributions his samples have been drawn from. He could plot his data or calculate the first four moments to get an idea of the type of distribution he might be working with, but this will only give him a vague idea. Therefore, in most situ- ations he should use a test which is relatively dis- tribution free. Recent literature suggests the use of Levene's test but in this study and others Levene's test has been shown to be too liberal for small sample sizes irrespec- tive of the distribution sampled. The results of this study suggest that the Normal Scores extension of Levene's test would be a more reasonable test to con- sider. However, even this test is very poor should the populations sampled be exponential in nature. In the mini-study used as a follow-up to the main study, the median deviated Levene type test showed considerable promise as being a valid contender. It is the only small sample test which appears to be well behaved at a = .10 and a = .05 for heavily skewed and leptokurtic distributions. The only distributions considered in this study were the exponential and the 91 normal distributions in the case of the three-sample design with six observations per sample. Shortly after the completion of this study, an article by Brown and Forsythe (6) provided more evidence to support the case of a median-deviated Levene type statistic. They considered balanced and unbalanced two-sample designs with sample sizes of ten, twenty or forty. They sampled from normal, Student's t on 4 degrees of freedom and chi-square on 4 degrees of freedom dis- tributions. For all situations considered, the empirical estimates of a were very good and usually slightly con- servative. Although the power of this test was somewhat lower than that of the variance ratio F test, it was competitive with it. The use of median-deviated scores is supportable not only on the basis of these empirical studies but also on the basis of Miller's (42) analytic work. Miller has proven that Levene's test is asymptotically distri- bution free if the deviations are taken from the sample medians. This indicates that as n + w, the null dis- tribution of the test statistic will be invariant no matter what the distributions sampled look like. Whether this property holds for small sample sizes is a question for further empirical investigation. The median-deviated Levene type test appears to be more promising than Box's test. Brown and Forsythe 92 have supplied some evidence that the test is well behaved for unbalanced designs. In Gartside's (15) study, Box's test was shown to be considerably liberal if the design used was not a balanced one. In light of the encouraging findings in the mini-study, the study of Brown and Forsythe and the analytic work of Miller, it is critical that the pro- perties of this test be further explored. The test should be examined with a series of designs for a wide range of distributions with differing values of skewness and kurtosis. Although the median-deviated statistic appears promising, other tests of better quality may emerge. Perhaps the use of a 25 percent trimmed mean would pro- vide better estimates of type I error rate or better power than the use of the median. The 25 percent trimmed mean is the mean of the observations remaining after deleting the 25 percent largest and the 25 percent smallest values in that sample. The median could be considered a 50 percent trimmed mean. What trimmed mean would be the best is speCulative. The effect extreme values have on the sample mean seems to be the reason Levene's test is too liberal. Perhaps some transformation of the absolute deviations might reduce the effect of these extreme scores. Levene attempted log 2 and / 2 but found both to give even more 93 liberal estimates of type I error rate than the original z's. A transformation which might work wonders with one distribution could well be counter productive with another. It seems unlikely that there is a transfor- mation which would work well for a wide variety of dis- tributions. The Jackknife technique proposed by Miller (42) seems to give somewhat liberal estimates for most dis- tributions. This again could be due to the nonrobust quality of the arithmetic mean as an estimator of central location. Perhaps the Jackknife principle could be extended with a different estimator of central location. If the median were used, this would involve an ANOVA on v.. scores where for observation j in sample i: 13 _ 2 _ _ 2 (l) vij — ni log ti (ni 1) log ti(j) 1 2 ni-l Zj (Xij mi) (2) where ti = (3) with mi the median for all ni observations 2 _ l 2 _ 2 ‘4’ and ti ‘ ni-z K¢j (Xij mi’ (5) with mi(j) the median for the (ni - l) obser— vations remaining when the Kth observation is removed Without the use of a computer, this technique would be unbearably tedious to consider. Even with a computer, the large amount of ranking involved would make this a relatively expensive statistic to compute. 94 The other hope for a good test of variance homogeneity might be an extension of one of the two sample nonparametric tests of scale. Of those available, Mood's test has better asymptotic relative efficiency than either the Siegel-Tukey test or the Freund-Ansari test (5). However these three tests have disadvantages in the two-sample case. No exact tables have been pre- pared for Mood's test so it is referred to the normal distribution, thus the test is probably not too accurate if n is small. But the test fails on more serious grounds than this. For example, if the two-sample sizes are equal and the two populations do not overlap, the test is certain to accept the hypothesis of equal variance whether or not it is really true. It has been suggested that this might be corrected by aligning the sample medians, but Bradley (S, p. 120) points out that even "if the influence of unequal locations can be com- pletely and satisfactorily eliminated, it is easy to invent populations having identical medians and identical dispersion indices of a given type but having shapes whose differences would cause the test to reject." Unfortunately this is also true with the other two non- parametric tests mentioned. SELECTED BIBLIOGRAPHY SELECTED BIBLIOGRAPHY Bartlett, M. S. "Properties of Sufficiency and Statistical Tests." Proceedings of the Royal Society of London, Series A,II60471937): 268-82. , and Kendall, D. G. "The Statistical Analysis of Variance-Heterogeneity and the Logarithmic Transformation." Journal of the Royal Statistical Society, Series 8,48 (1946): 128-38. Box, G. E. P. "Non-normality and Tests on Variances." Biometrika, 40 (1953): 318-35. , and Andersen, S. L. "Permutation Theory in the Derivation of Robust Criteria and the Study of Departures from Assumption." Journal of the Royal Statistical Society, Series B, 17 (1955): 1-26. Bradley, J. V. Distribution-Free Statistical Tests. Englewood Cliffs, N.J.: Prentice Hall, Inc., 1968. Brown, M. B., and Forsythe, A. B. "Robust Tests for the Equality of Variances." Journal of the American Statistical Association, 69 (1974Y: 364L67. Cadwell, J. H. "Approximating to the Distributions of Measures of Dispersion by a Power of x2." Biometrika, 40 (1953): 336-46. Conover, W. J. Practical Nonparametric Statistics. New York: John Wiley and Sons, Inc.,—1971. Coveyou, R. R. "Serial Correlation in the Generation of Pseudo-Random Numbers." Journal for the Association of Computing Machifiery, 77(1960): 72-74. 95 10. ll. 12. l3. 14. 15. 16. l7. l8. 19. 20. 96 Feder, P. I. "On the Nonrobustness of the Jackknife and Box-Andersen Procedures for Estimating Variances in Small Samples." Paper presented at the Annual Meeting of the American Statistical Association, New York, December, 1973. Fisher, R. A., and Yates, F. Statistical Tables for Biological, Agricultural and Medical Research. Neinork: Hafher Pfiblishing Company, Inc., 1948. Foster, L. A. "Testing for Equality of Variances." Dissertation Abstracts, 26 (1965): 1060B. Gabriel, K. R., and Lachenbruch, P. A. "Non-Para- metric ANOVA in Small Samples: A Monte Carlo Study of the Adequacy of the Asymptotic Approxi- mation." Biometrics, 25 (1969): 593-96. Games, P. A., Winkler, H. B., and Probert, D. A. "Robust Tests for Homogeneity of Variance." Educational and Psychological Measurement, 32 (1972): 887-910. Gartside, P. S. "A Study of Methods for Comparing Several Variances." Journal of the American Statistical Association, 67'(l972):*342-46. Gehan, E. A., and Thomas, D. C. "The Performance of Some Two Sample Tests in Small Samples With and Without Censoring.” Biometrika, 56 (1969): 127-32. Glass, G. V. "Testing Homogeneity of Variances." American Educational Research Journal, 3 (1966): 187-90. , Peckham, P. D., and Sanders, J. R. "Con- sequences of Failure to Meet Assumptions Under- lying the Fixed Effects Analysis of Variance and Covariance." Review of Educational Research, 42 (1972): 237-88. , and Stanley, J. C. Statistical Methods in Education and Psychology. EngIewood Cliffs, N.J.: Prentice-HalI} 1970. Glendening, L. Personal Communication. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 97 Greenberger, M. "An a Priori Determination of Serial Correlation in Computer Generated Random Numbers." Mathematics of Computations, 15 (1961): 383-89. Hall, I. J. "Some Comparisons of Tests for Equality of Variances." Journal of Statistical Compu- tation and Simulation, I (1972): 183-94. Han, C. "Testing the Homogeneity of Variances in a Two-Way Classification." Biometrics, 25 (1969): 153-58. Hartley, H. O. "The Maximum F-Ratio as a Short Cut Test for Heterogeneity of Variance." Biometrika, 37 (1950): 308-12. Hastings, C., Jr. Approximations for Digital Com- uters. Princeton, N.J.: Princeton University Press, 1955. Hector, M. A. "Evaluation of an Instructional Model for Teaching Counselor Trainees How to Establish Behavioral Objectives in Counseling." Ph.D. dissertation, Michigan State University, 1973. Hodges, J. L., Jr., and Lehmann, E. L. "Comparison of the Normal Scores and Wilcoxon Test.” Pro- ceedings of the Fourth Berkeley Symposium 3?— MathematIcal Statistics and Probability, Berkeley and Los’Angeles: UniverSIty ofDCaIIfornia Press, 1961. Holloway, C., and Capp, W. B. F-Distribution Generator: A Fortran IV Program. 1959. Revised by McKerey, R. J., 196I. Hull, T. B., and Dobell, A. R. "Random Number Generators." SIAM Review, 4 (1962): 230-54. , and . "Mixed Congruential Random NEEBer Generators for Binary Machines." Journal for the Association of Computing Machinery, 11 (1964):’31-40. Kirk, R. Experimental Design: Procedures for the Behavioral Sciences. Belmont, Ca.: WadSworth, 1968. ' Klotz, J. ”Small Sample Power and Efficiency for the One Sample Wilcoxon and Normal Scores Tests." Annals of Mathematical Statistics, 34 (1963): -624-32. 33. 34. 35. 36. 37. 38. 39. 40. 41. 42. 43. 44. 98 Kruskal, W. H., and Wallis, W. A. "Use of Ranks in One-Criterion Variance Analysis." Journal of the American Statistical Association, 47 (I952): 583-621. Layard, M. W. J. "Robust Large Sample Tests for Homogeneity of Variances." Journal of the American Statistical Association, 68 (1973): 195-98. Leaverton, P., and Busch, J. J. ”Small Sample Power Curves for the Two Sample Location Problem." Technometrics, 11 (1969): 229-307. Levene, H. "Robust Tests for Equality of Variances." Contributions to Probability and Statistics. Editedfiby I. Olkin et aI. StanfBrd, Ca.: Stanford University Press, 1960. MacLaren, M. D., and Marsaglia, G. "Uniform Random Number Generators." Journal for the Association of Computing Machinery, 12 (1965): 83-89. McSweeney, M. T. "An Empirical Study of Two Pro- posed Nonparametric Tests for Main Effects and Interaction." Dissertation Abstracts, 28 (1968): 4005A-4006A. , and Penfield, D. "The Normal Scores Test for the c-Sample Problem." The British Journal of Mathematical and Statistical Psychology, 22 (I969): 177-92. Merrington, M., and Thompson, C. M. "Tables of Percentage Points of the Inverted Beta (F) Distribution." Biometrika, 33 (1943): 73-88. Miller, A. J. Letter to the Editor. Technometrics, 14 (1972): 507. Miller, R. 6., Jr. "Jackknifing Variances." Annals of Mathematical Statistics, 39 (1968): Mood, A. M. "On the Asymptotic Efficiency of Certain Nonparametric Two-Sample Tests." Annals of Mathematical Statistics, 25 (1954): 514-22. Moses, L. E. "Rank Tests of Dispersion." Annals of Mathematical Statistics, 34 (1963): 973-83. 45. 46. 47. 48. 49. 50. 51. 52. 53. 54. 55. 56. 99 Naylor, T. H., Balintfy, J. L., Burdick, D. S., and Chu, K. Computer Simulation Techniques. New York: John WiIey and’Sons, Inc., 1966. Neave, H. R., and Granger, C. W. J. "A Monte Carlo Study Comparing Various Two Sample Tests for Dif- ferences in Mean." Technometrics, 10 (1968): 509-22. Neel, J. H., and Stallings, W. M. "A Monte Carlo Study of Levene's Test of Homogeneity of Variance: Empirical Frequencies of Type I Error in Normal Distributions." Paper presented at the American Educational Research Association Con- vention, Chicago, April, 1974. Neyman, J., and Pearson, E. 8. "On the Problem of K Samples." Bulletin de 1'Academie Polonaise des Sciences et des Lettres, June 1931, pp. 460- 81. Overall, J. E., and Woodward, J. A. "A Simple Test for Heterogeneity of Variance in Complex Factorial Designs." Psychometrika, 39 (1974): 311-18. Pratt, J. W. "Robustness of Some Procedures for the Two-Sample Location Problem." Journal of the American Statistical Association, 59 (1964): 665-80. Scheffé, H. The Analysis of Variance.‘ New York: John Wiley and Sons, Inc., 1959? Shorack, G. R. "Nonparametric Tests and Estimation of Scale in the 2-Sample Problem." Dissertation Abstracts, 26 (1966): 6751B. Shukla, G. K. "An Invariant Test for the Homo- geneity of Variances in a Two-Way Classification." Biometrics, 28 (1972): 1063-72. Siegel, S. Nonparametric Statistics for the Behavioral Sciences. New Yofk: McGraw—Hill Book Company, Inc., 1956. , and Tukey, J. W. "A Nonparametric Sum of the Ranks Procedure for Relative Spread in Unpaired Samples." Journal of the American Statistical Association, 55 (1957):4429-45. Sytsma, S. Personal Communication. 57. 58. 59. 60. 61. 62. 63. 100 Terry, M. E. "Some Rank Order Tests Which Are Most Powerful Against Specific Parametric Alternatives." Annals of Mathematical Statistics, 23 (1952): Van Der Laan, P. "Exact Power of Some Rank Tests." Publication de l'Institut de Statistique de 1TUniversitéIde Paris, 13 (1964): 211-34. , and Oosterhoff, J. "Monte Carlo Estimation of the Powers of the Distribution-Free Two Sample Tests of Wilcoxon, Vander Waerden and Terry and Comparison of These Powers." Statistica Neerlan- dica, 19 (1965): 265-75. , and . "Experimental Determination of the Power Functions of the Two Sample Rank Tests of Wilcoxon, Vander Waerden and Terry by Monte Carlo Techniques-I." Normal Parent Dis- tributions. Statistica Neerlandica, 21 (1967): 55-68. Weber, J. M. “The Heuristic Explication of a Large- Sample Normal Scores Test for Interaction." British Journal of Mathematical and Statistical Psychology, 25 (1972): 246-56. Wheeler, D. J. ”An Alternative to an F Test on Variances." Dissertation Abstracts, 31 (1971): 6334B. Winslow, S. S., and Arnold, J. C. "A Stochastic Simulation Study of a Rank-Like Test for Dis- persion Which Is Distribution-Free Under Location Differences." Journal of Statistical Computation and Simulation, I—(l972): 315-29. ICHIGnN STRTE UNIV. LIBRnRIEs IIIUMIIHII‘HUNWUIIUHIUIHIUUHI 31293010560854