WIHUN"IIHIIHIHIIHIIUHWHJHINIIUWIIIIHHI I LII mu; Ilzlllijlljwl In (III III II I m I ll 14 This is to certify that the thesis entitled A MONTE CARLO EVALUATION OF RIDGE REGRESSION AS AN ALTERNATIVE TO ORDINARY LEAST SQUARES presented by BRY AN WALTER COYLE has been accepted towards fulfillment of the requirements for M.A. degree in PSYCHOLOGY I N ' ‘ Major professor 11-]- Date 79 0-7639 v OVERDUE PINES ARE 25¢ PER DAY PER ITEM Return to book drop to remove this checkout from your record. A MONTE CARLO EVALUATION OF RIDGE REGRESSION AS AN ALTERNATIVE TO ORDINARY LEAST SQUARES BY Bryan Walter Coyle A THESIS Submitted to Michigan State University in partial fulfillment of the requirements for the degree of MASTER OF ARTS Department of Psychology 1979 ABSTRACT A MONTE CARLO EVALUATION OF RIDGE REGRESSION AS AN ALTERNATIVE TO ORDINARY LEAST SQUARES By Bryan Walter Coyle This study investigated a proposed modification of ordinary least squares (OLS) multiple regression. Conventional OLS is generally used to combine the information present among a set of variables so as to optimize the prediction of a criterion variable in the original sample and to provide an equation for use in subsequent samples without the necessity of re-estimation. In addition, the predictor weights estimated are frequently used to infer the functional characteristics of the system which produced the data. Hoerl and Kennard (1970a) have suggested deliberately introducing a statistical bias into the OLS estimation procedure in an attempt to increase the predictive robustness and structural accuracy of ordinary least squares in collinear data sets. Their method, termed ridge regression, was compared with unit weighting (Schmidt, 1971) and with OLS in a Monte Carlo experiment based on three data matrices drawn from the literature. It was concluded that the ridge technique can outperform OLS in situations where the collinearity is high and consistent across all Bryan Walter Coyle predictors. When the collinearity is concentrated in subsets of the predictor matrix ridge regression is dominated by OLS. Consistent with Schmidt (1971), when sample sizes are small relative to the number of predictors, no suppressors are present and only prediction as opposed to structural interpretation is relevant, unit weighting is to be preferred. Approved: Date: Thesis Committee: Neal Schmitt, Chairperson Raymond Frankmann Ralph Levine ACKNOWLEDGMENTS I wish to thank Dr. Neal Schmitt, my chairperson, for both his assistance on this and many other projects and for his sage counsel and friendship over several years. Thanks are also due to Dr. Raymond Frankmann and Dr. Ralph Levine for their advice and guidance as committee members and teachers. Finally, I thank my wife, Phyllis, without whose love and support this and many other endeavors would have little meaning. ii TABLE OF CONTENTS Page L IST 0F TMLES O O O O O O O O O O O O O O O O 0 iv Chapter I O INTRODUCT ION O O O O O O O O O O O O O O O 1 II. MULTIPLE LINEAR REGRESSION. . . . . . . . . . . 3 The Mad e 1 O O I O O O O O I O O O O O O O 3 Assumptions of the Regression Model . . . . . . . 8 The Correlation Model. . . . . . . . . . . . 10 Multicollinearity . . . . . . . . . . . . . l3 Alternatives to Ordinary Least Squares . . . . . . 29 Ridge Regression . . . . . . . . . . . . . 37 I I I 0 “THOD O O O O O O O O O O O O I O O O O 4 3 The Papulation Matrices . . . . . . . . . . . 43 samp leg 0 I O O O O O O O O O O O O O O 4 9 Equation Estimation . . . . . . . . . . . . 49 Data Analysis . . . . . . . . . . . . . . 50 IV. RESULTS AND DISCUSSION . . . . . . . . . . . . 52 V O SUMY 0 I O O O O O O O O O O O O O O O 7 2 BIBLIOGRAPHY. . . . . . . . . . . . . . . . . . 75 iii 10. ll. 12. l3. 14. 15. 16. LIST OF TABLES Two Predictor Regression Results with Varying Inter- correlation and Validity . . . . . . . . . Illustrative Multicollinearity and Eigenanalysis Values . P0pulation Matrix Based on 6000 Cases-—HOPOP . . . Population Matrix Based on 6000 Cases--HIPOP . . . Population Matrix Based on 6000 Cases--LOPOP . . . Eigenvectors of the Population Matrices. . . . Initial R2 - LS, RIDGE, BU, VU, PU . . . . . . Mean Initial R2 Superiority of Least Squares over RIDGE, BU, VU, PU. o o o o o o o o o o o o o Cross-Validated R2 - LS, RIDGE, BU, VU . . . . . Mean Cross-Validated R2 Superiority of Least Squares RIDGE, BU, W. o o o o o o o o o o o 0 Equation Mean Square Errors. . . . . . . Ratio of Average LScv to Average RIDGECV . . . . HIPOP Precision Statistics . . . . . . . . . HOPOP Precision Statistics . . . . . . . . . LOPOP Precision Statistics . . . . . . . . . over Mean Differences of Precision Statistics Between Least Squares and RIDGE . . . . . . . . . . . iv Page 19 3O 44 46 47 48 53 54 56 57 61 66 67 68 69 7O CHAPTER I INTRODUCTION Linear composites are commonly used in psychology, education and in the social sciences generally for the purpose of combining the information present among a group of variables into a single variable. While many methods of forming a composite have been proposed and utilized (Blum & Naylor, 1968; Burket, 1964; Claudy, 1972; Lawshe & Shucker, 1959) multiple linear regression is by far the most commonly used combinatorial method. This is especially true since the advent of digital computers and widely available "canned" programs which save the researcher the tedium of hand calculation and often the concomitant necessity of considering the applicability of this method to the research problem at hand. Linear composites, whether formed by multiple regression or other techniques to be discussed subsequently, are generally used for either predictive or descriptive purposes. In the former case one is interested in creating the composite so as to maximize its correlation with some external variable, usually designated the criterion. Examples of this usage would be predicting a person's future academic standing from past records or estimating the probability of job success on the basis of a composite formed by qualification tests, interview data and previous employment history. Descriptive uses, also termed structural interpretation, of linear composites involve assessing the degree of change produced in the criterion variable by a unit change in one or more of those variables which form the composite. As multiple linear regression is the most frequently used com- binatorial scheme, at least when the number of available cases is large relative to the number of indicator variables intended to form the composite, its assumptions and use will be discussed first. Limitations inherent in this model are presented as well as the major alternatives to it. Various criteria that have been pr0posed to evaluate the opti- mality of combination rules are then contrasted with a modified regression approach, termed ridge regression (Hoerl & Kennard, 1970a; 1970b). The empirical performance of this method was assessed in a Monte Carlo design employing three data sets with different degrees of intervariable relationships. From each of these populations 25 samples at each of five sample sizes were randomly drawn. Ridge regression and ordinary least squares (multiple regression) are then employed on each of these 375 samples as are three different methods of simple unit weighting (Schmidt, 1971). For each method the predictive efficiency in the initial sample as well as the long-term efficiency in the popula- tion are evaluated. In addition, the structural accuracy with respect to the precision of parameter estimation of ridge regression and ordinary least squares (OLS) are compared. CHAPTER II MULTIPLE LINEAR REGRESSION The Model An optimal method for obtaining estimates of criterion values as a function of predictor score levels would be the following (Burket, 1964): Select all conceptually relevant variables not statistically independent of the criterion. Measurements on these predictors and the criterion should be obtained on a sufficiently large number of cases (termed the validation sample) such that all possible combinations of score levels are represented. The criterion prediction for a parti- cular case would be the criterion mean of all cases in the validation sample having the same predictor profile. In practice this idealized system is not generally workable because of the large sample size required to insure stable parameter estimates for every possible predictor profile. What is necessary then is to make simplifying assumptions and ad0pt a system which will provide fairly accurate predictions of criterion performance over a wide range of possible predictor profiles despite the unavailability of some of them. The assumption most frequently employed in the behavioral sciences is that there exists an approximate functional relationship (most often presumed to be linear although this is not necessary) between the predictors and criterion. The function form relating these is estimated in the sample at hand by the method of multiple linear regression or ordinary least squares which assures two important properties: (1) the sum of squared residuals between the actual criterion values and those predicted from the weighted profile comr ponents will be minimized for the validation sample; and (2) the corre- lation between these two score sets will be the maximum obtainable for this sample (Draper & Smith, 1966; Li, 1974). The method of least squares and its properties may be summarized with the following notation. The linear regression function relating the dependent variable (Y) to one or more independent predictor variables (X), is for the ith case, (1) yi a + le11 + 82x12 + . . . + B x + e pip i where: y. = criterion score for the ith subject in the sample, a = a scaling constant used to adjust for differences in origin between the y and x variables; also termed the intercept, Bj = a partial regression coefficient used to weight the jth predictor variable, x = the 1th individual's observed score on the jth predictor, £1 a error in prediction for the ith subject, y1 = predicted criterion score for the 1th subject. Thus, (2) (" p“ > e = y - a + 2 B x i i j=1 j ij gyi-yi Thus the properties of OLS noted above are then (3) Z ( - A )2 = minimum 1 3’1 3'1 (4) r . = maximum, YY with these properties holding for the N cases in the sample on which the weights were estimated. The correlation of equation (4) is referred to as the multiple correlation or if squared, the coefficient of deter- mination of the weighted predictor composite with the criterion. The model for estimating the weights is more easily presented in matrix terms and that notation will be established here. Proofs of the derivations are available in Draper and Smith (1966), Finn (1974) and Scheffe (1959). Without loss of generality the observations on all variables are assumed to be standardized so that the constant term (a) in the general model is identically zero. Let y be a column vector of N criterion observations, X be a N x p matrix with rank p less than N, each row representing one cases' observations on the p predictor variables, a be a column vector of N uncorrelated errors with mean zero and variance 02, B be a column vector of p population regression coefficients. The general linear model presented in (1) becomes (5) y = XB +.€. Because of the assumptions concerning errors, E(e) a 0; E(se’) = ozI, the criterion vector y has the expectation (6) E(y) = X8 and the covariance matrix , 2 (7) E[(y - X8) (y - X8) ] = 0 I. If the B are the sample estimates of the population regression coeffi- cients, B, and § are the predicted criterion scores based on these same sample estimates then (8) E -- (x'x)'1X'§ and, (9) >3 = XE. Because the variables have been standardized (X'X) is in the form of a zero—order correlation matrix among the predictors and X'y is the vector of predictor-criterion correlations or validities. The estimates of the population regression coefficients have the expectation (10) E(§) = B and the covariance matrix on EHé-m(é-mu=o%wmd. B is the "best" estimate of the population vector 8 in that the sum of squared errors in prediction is minimized in the sample. This can be demonstrated (Finn, 1974, p. 96) by considering any other estimate 8* where 8* a B + d and d is the vector of discrepancies between B and the alternative estimate. The sum of squared errors with 8* replacing B is (6’6)* = (y - XB*)'(y - XB*) =Hy-m)-mrHy-m>-m1 = (y - xé)'(y - X6) — 2d'X'(y - xé) + d‘X'Xd. The first term is 8'5; the second term is zero as from equation (8) d'X'(y - xé) = d'x'y - d'X'X(X'X)-1X'y = o. The third term is positive as it represents the sum of the squared elements of Xd. Thus the residuals (e'e) are inflated anytime one departs from B as the estimate of population weights derived from the sample. The variance of these minimized residuals in the standardized case is one minus the coefficient of multiple determination (R2), that value which expresses the squared correlation between the optimally weighted predictor combination and the criterion. Where X'X is in the form of a correlation matrix equivalent formulae for R2 are (Burket, 1964; Overall & Klett, 1972), (12) R y'X(x'X)‘1x'y (13) R2 - é'x'y. R2, R or (1 - R2) are commonly presented as indices of the predictive efficiency of the multiple regression model in the estimation sample. Although the multiple linear regression model presented above has been used extensively throughout the sciences for the purposes of prediction and structural interpretation, its assumptions are often poorly understood. Cureton (1950) considered that, "It is doubtful that any other statistical techniques have been so generally and widely misused and misinterpreted as have those of multiple correlation" (p. 690). The situation is perhaps worse today with the wider avail- ability of canned computer programs for regression. Assumptions of the Regression Model The simplest set of crucial assumptions (Johnston, 1972, p. 122) necessary to estimate the 8 vector in the model y = X3 + s are three in number: (14) E(ee') = 021, (15) X is a set of fixed values, (16) X has rank p less than N. The requirement of (14) is that the error or disturbance values have constant variance--a prOperty referred to as homoscedasticity. The diagonal nature of the symmetric matrix E(es’) implies that the covari- ance between any pair of error terms be zero. Fulfillment of this assumption can often be evaluated by visual examination of residual value plots (Draper & Smith, 1966, ch. 3). Failure to meet this assumption most often occurs in time series analysis or when the linear model fitted is inappropriate for the set of observations at hand (i.e., there exists a nonlinear relation between the predictors and the criterion). As the assumption can generally be adequately met by the inclusion of appropriate quadratic terms or by the inclusion of linear or higher order terms in time or by suitable data transformations such as the arcsin, square root or log transforms before analysis (Tukey, 1949; Winer, 1971) the consequences of failure to meet this assumption will not be considered further here. The second essential assumption (15) is more germane to the purpose of the present paper. Regression theory requires that the X matrix be a set of values fixed by the experimenter exactly as are the levels of independent variables at which observations on y, the criter- ion, are taken in fixed effects analysis of variance designs (Binder, 1959). Implicit in this assumption is the requirement that the X values be free of measurement error. This means that in repeated sampling of criterion values the only source of variation is attributable to the vector of disturbances, a. If this assumption is met, B is an unbiased linear estimator (Johnston, 1972, pp. 18-23). Effects of violations of this assumption are discussed below under the correlation model. The third assumption (16) states that X must be of full rank equal to p, the number of predictors. If the rank of X is less than the number of predictors the 6 vector is indeterminate and no unique solution to the normal equations exists. As will be discussed under the heading of "multicollinearity," problems can also arise when this assumption is only approximately met. 10 The Correlation Model While data transformations or deletion of some predictors have been found in many cases to adequately compensate for violations of assumptions (14) and (16), failure to obtain fixed predictor values requires an alternative model. Traditionally, multiple regression techniques have been applied in precisely those situations where the control required to obtain fixed-X values cannot be insured (Cohen, 1968). Data sets analyzed by means of OLS are typified by subjects' test scores, historical records and in general by data that is not collected according to a design for the systematic evaluation of criterion scores obtained at preselected levels of the independent vari- ables. In this type of situation the correlational model for the predictors is more appropriate than is the regression model. The latter is based on the assumption that only the disturbance vector 5 is subject to sampling error--an assumption that is rarely met in applied multiple regression situations. The correlation or random-X model assumes that the predictors and the criterion are random variables sampled from a joint multivariate normal distribution. Regardless of the distributional form of the disturbances (and hence of the y values) the OLS method provides "best"--i.e., minimum variance, unbiased estimators of the pOpulation 8 values. While the fixed-X or regression model makes no assumptions about the distribution of the predictor variables it does require a normal error assumption to permit inferential tests. This assumption is based on empirical evaluations of the robustness of the £_and E statistics against moder- ate departures from normality (Neter & Wasserman, 1974). When this assumption is met the B estimates are maximum likelihood estimates of 11 the true, population weights with the same best linear unbiased pro— perties as the least squares values (Herzberg, 1969; Neter & Wasserman, 1974). While both models would appear to provide the necessary data for inferential uses of multiple regression results it is clear that the correlational model is almost always more appropriate. Under the null hypothesis of zero multiple correlation the distributional theory is identical for the two models. However in applications, especially those for predictive purposes such as in personnel selection, the null hypothesis is rarely true (Burket, 1964). The extreme complexity of the correlational model in cases where the null hypothesis does not obtain has led most investigators to use the fixed-X model in the hope that there will be little practical difference in the results derived (Burket, 1964; Claudy, 1972; Cohen & Cohen, 1975; Neter & Wasserman, 1974). While this subject has not received a great deal of attention in the literature, it would appear that application of fixed-X procedures to randomrX data affects the suitability of the weight estimates thus derived. It has been demonstrated by Berkson (1950), Geary (1953) and Rao and Miller (1971) that if the predictor variables are not held at preselected values and are subject to errors of measurement the beta weights will in fact be biased estimates of the population values. It is thus not necessarily the case that beta weights derived on a sample of finite size such that they maximize the multiple correlation in the sample are.the "best" estimates of the parameter vector 8 (Claudy, 1972). "Best" here refers not to the minimum variance properties of least squares estimators but rather to the minimization of the difference 12 between the true population multiple correlation (p) and the estimate of it (R) obtained by application of the sample weights to the popula- tion. Application of the fixed-X regression model to data collected under the assumptions of the randomrX model results in an over-fitting of the regression surface to the sample data. In practice this means that the beta weights are optimized on the idiosyncracies caused by sampling and measurement error in the estimation sample. Accordingly, when these weights are applied in a new sample or in the population the resulting multiple correlation will be lower than the initial estimate. This general problem has been termed "shrinkage" and has been considered by numerous authors. For instance, formulae have been evaluated to estimate this shrinkage (see Schmitt, Coyle, & Rauschenberger, 1977 for a comparison of the major formulae) and alternatives to OLS have been proposed (Claudy, 1972; Cureton, 1962; Herzberg, 1969; Lawshe & Schucker, 1959; Schmidt, 1971). While accurate estimation of the multiple correlation is of primary interest for predictive uses of multiple regression techniques, it does not touch upon the second purpose of multiple regression: structural interpretation. The final consideration relative to assumption (15) is germane to this purpose. Application of fixed-X OLS to random—X data subject to sampling and measurement error inflates the variability among the optimizing weights without regard to the true variance of the parameter values. It is not the standard error or variance of any single weight estimate which is referred to here but rather the dispersion of the p weights calculated on the p predictors in a sample. This effect is l3 attributable to the sample values being subject to not only the vari- ance of the pOpulation weights but also to the error variance generated by the less than perfectly reliable measurement of predictor scores. Awareness of this artifact has led to such prOposals as averaging the beta weights obtained in a random split of the sample (Claudy, 1972) and using as a 3 estimate the least deviant (from zero) of the two weights obtained in a fifty-fifty Split (Cureton, 1962). The relative merits of several such alternatives to OLS are discussed subsequent to the further examination of the implications of assumption (16) in the following section. Multicollinearity The assumption of equation (16) that X, the predictor matrix, has rank p < N actually has implicit two requirements. The first con- cerns the ratio of the sample size available to the number of predictors for which weights must be estimated, and the second involves the number of linear dependencies among the predictor set. The question of sample size is common to any attempt to establish a statistical estimate of a parameter--the greater the number of cases upon which the estimate is derived, the more stable it will be. In the case of OLS using standardized data the requirement is that p §_N or else the model is considered to be overdefined and a unique solution for B is not possible. In point of fact one generally desires that N be much greater than the number of predictors for as was demonstrated by Wishart (1931), 2 14 where 9 represents the population multiple correlation. In the case where the null hypothesis of no predictor-criterion correlation holds in the population, equation (17) shows that the sample value will be inflated. Setting p2 to zero yields 2 -.B (18) E(R ) - N' It is from this equation that the various shrinkage estimators have been derived (Darlington, 1968; Lord, 1950; Nicholson, 1960; Wherry, 1931). From the above formula it is obvious that the extent to which R2 over-estimates p2 varies directly with the number of predictors and inversely with the sample size. These characteristics are important when one compares the efficiency of OLS and alternative estimators (Schmidt, 1971) in a variety of practical situations. Throughout the long history of multiple regression usage in the sciences, practitioners have come to appreciate its robustness in the face of violations of some underlying assumptions and have, in many cases, developed remedial procedures to correct unsuitable data before analysis. Examples of this would be the Durbin-Watson (1950) test statistic for autocorrelated error terms with the attendant sug- gestions of Cochrane and Orcutt (1949) as to their correction. Simi- larly, Bartlett's variance homogeneity test has given rise to a number of data transformations suitable to different types of heteroscedasti- city (Winer, 1971). The development of detection and correction methods for problems of multicollinearity in regression models has not yet reached the level of rote application of Specified test statistics which in turn could provide evidence as to appropriate alterations to be made (Farrar & Glauber, 1967). In fact, while economists have 15 apparently been aware of the difficulties inherent in highly correlated predictor sets for some time, it seems that others in the social and behavioral sciences have frequently labeled such a concern as being of "theoretical interest only" and thereby dismissed it from consideration in their applied work. As shall be demonstrated multicollinearity can cause some very practical problems to arise (Darlington, 1968). While a variety of definitions of multicollinearity exist in the literature, many of them are more symptomatic than definitive. The definition used here is attributable to Johnston (1972) and Silvey (1969). If one considers the predictor matrix X of dimensions (N x p) a linear dependence is said to exist between the column vectors x1, x2,..., xp if there exist constants a1, a2, ..., ap, not all zero, such that P (19) Z a x = 0 . When (19) holds for some subset of the column vectors of X (and thus for the matrix as a whole), multicollinearity is said to exist. In this case beta estimates cannot be obtained as the predictor matrix is singular and thus its inverse does not exist (equation 8). However, even when (19) does not obtain exactly but rather is only approximately true, multicollinearity is still a relevant problem for the data analyst. Thus the question of collinearity is one of severity or the degree of departure from orthogonal variates (Kmenta, 1971; Mason, Gunst & Webster, 1975). There are three primary sources of highly collinear data sets (Mason, Gunst, & Webster, 1975). The first involves an overdefined 16 model--one where there exist more predictors than observations. The difficulties caused by cases in which this is approximately true were discussed above. When faced with such a situation the analyst must, (a) eliminate some predictors; (b) use grouped subsets of predictors; or (c) utilize some form of principal components regression. There are deficiencies inherent in each of these solutions and they will be dis- cussed in the following section. The latter two sources of collinearity, sampling techniques and physical constraints on the model, are quite similar and can be presented together. These situations arise when the data have been sampled from only a subspace of the predictor variable domain or when some predictors' values are restricted to a near exact relationship with other variables in the X matrix. In the former case data observations can be added from the undersampled area of the domain, if indeed the investigator is aware of the problem, which can usually be identified through eigenvector analysis (Silvey, 1969). When practical constraints eliminate this alternative or when the problem cannot be identified as attributable to undersampling, few remedial measures are available. The effects of approximate multicollinearity have been presented by Johnston (1972) as follows: 1. The precision of estimation falls so that it becomes very difficult, if not impossible, to disentangle the relative influences of the various x variables. This loss of precisicnl has three aspects: specific estimates may have very large errors; these errors may be highly correlated, one with another; and the sampling variances of the coefficients will be very large. 2. Investigators are sometimes led to drop variables from an analysis because their coefficients are not significantly different from zero, but the true situation may be not that a variable has no effect but simply that the set of sample data has not enabled us to pick it up. 17 3. Estimates of coefficients become very sensitive to particular sets of sample data, and the addition of a few more observa‘ tions can sometimes produce dramatic shifts in some of the coefficients (p. 160). The first difficulty has been well documented and illustrated by Darlington (1968) while the latter two consequences are familiar to psychologists under the general rubric of "bouncing betas." The detection and analysis of these effects will be considered for the two predictor case although all results are applicable to the case of any number of predictors as long as equation (19) is not exactly satisfied. The effects of collinearity on estimates can best be seen by considering the inverse of the correlation matrix. Equation (8) can be written for the standardized model with C = (X'}()-1 as 8 c c r (20) .YI.2 ll 1 yl By2.1 c21 c22 ry2 where ryj represents the validity coefficient for the jth predictor. From (20) it is evident that in the case of uncorrelated predictors (c12 - c21 = O), the validity coefficient is the beta for any one predictor as the predictor matrix is then an identity matrix. A 1.0 0.0 r (21) éyl.2 : yl By2.1 0.0 1.0 ry2 When the predictor intercorrelation is not equal to zero the inverse matrix is of the form (22) c -- (x'X)‘1 = (——1———> 12 18 This is the matrix formulation of the familiar computational solution for beta weights with two predictors. . ryl - ry2 r12 8 = 1 1 _ r2 12 (23) . ryZ " 13,1 1[12 828 2 l - r 12 Equation (23) is a well-known result and illustrates that as (19) becomes exact (i.e., r12 or riz + 1.0) the diagonal elements of the inverse matrix approach infinity (l.0_: C11-+ 0°). Several consequences follow from this. The limiting case of intercorrelation is derived by assuming ryl - ry2 which is justified since as r12-+ 1.0, each pre- dictor's validity must become equal (Klein & Nakamure, 1962; Sastry, 1970). This further implies that as r + 1.0 the slightest discrepancy 12 in the magnitude of the validity coefficients will result in the beta weights being approximately equal but Opposite in sign. Obviously, the slight discrepancies causing this are sample specific and due to sampling error and lack of perfect reliability in measurement. This is the "bouncing beta" problem (McDonald & Schwing, 1972; Swindel, 1974; Wampler, 1970) demonstrated by sign reversals and in the case of multi- ple predictors by dramatic shifts in the magnitude of weights in differ- ent samples from the same population (Johnston, 1972; Wherry, 1975). Table 1 provides sample calculations demonstrating the effects of varying r12 and discrepant versus equal validities. While it is possible for a beta weight to be underestimated, the gross inflation of the diagonal elements in the inverse matrix due to 19 Table 1 Two Predictor Regression Results with Varying Intercorrelation and Validity Validities Predictor Beta Weights Variance Covariance Multiple R ryl ry2 r13 By1.2 By2.1 81 B12 .50 .50 .00 .50 .50 0.50 0.0 .710 .50 .50 .30 .38 .38 0.68 -0.2 .620 .50 .50 .60 .31 .31 1.08 -0.65 .560 .50 .50 .90 .26 .26 3.89 -3.51 .510 .50 .50 .95 .256 .256 7.63 -7.25 .506 .50 .50 .99 .25 .25 37.64 -37.26 .501 .51 .50 .00 .51 .50 .49 0.0 .714 .51 .50 .30 .396 .381 .688 -0.2 .626 .51 .50 .60 .328 .303 1.064 -O.639 .565 .51 .50 .90 .316 .216 3.845 -3.461 .519 .51 .50 .95 .358 .159 7.563 -7.185 .512 .51 .50 .99 .75 —.246 36.11 —36.74 .511 20 high collinearity generally results in betas with large absolute values without regard to the true pOpulation value. While these are "correct" values their counterintuitive signs and magnitudes again have prompted the discussion of alternative estimators to OLS (Churchill, 1971; Hoerl & Kennard, 1970a; Klein & Nakamura, 1962). The notion that these inflated betas are potentially poor estimates is attributable to the effects of collinearity on the variance-covariance of the beta weights. Equation (11) can be expressed (with r = a) as (omitting a 12 function of sample size) . o: 1.0 -¢ (24) VarB = ( 2 l-d ) -¢ 1 so that 2 A A as (25) VarB1 = VarB = 2 l—m and A A —C O (26) CovBIB2 = 1.«2 - As multicollinearity, here expressed as a, increases it is evident that the sampling variances of the estimated coefficients increase. For example, as m increases from .5 to .9 the sampling vari- ance increases by over 300 percent while a a .95 gives an increment of 750 percent (Johnson, 1972). It should be noted though that poor precision in the estimation of individual coefficients does not imply that the linear combination of predictors is correspondingly poor. 21 This apparent anomaly is evidenced in the two predictor case with positive a by the negative covariance of the estimates. This means that if one beta weight is overestimated, another in the same sample with which it is positively correlated with also be overestimated in absolute value but will have the Opposite sign. The higher the corre- lation between the two variables, the more pronounced will be this tendency to compensate for errors in estimation. In the extreme case of r12 = 1.0 any pair of weights with the same sum will be exactly equivalent--for instance, weights of -4.0 and 6.0 or 3.0 and -l.0.1 This effect is exemplified by Darlington (1968) and proven by Mason, Gunst, and Webster (1975) for the greater than two predictors case. The above discussion has illustrated the nature of the problems cited by Johnston (1972) as being attributable to the effects of multi— collinearity in the predictor set. Yet large predictor sets with, as a rule, validities above .25 or .30 are the norm rather than the exception in most MR applications. With higher validities and p greater than three or four it becomes inevitable that the deleterious effects of multicollinearity will be felt. While this problem is of minimal interest for purely predictive MR uses, it should be carefully con- sidered when structural interpretation is the goal. In this setting the magnitudes and sampling variances of weight estimates can lead to erroneous conclusions as to the importance or predictive utility of individual variables. Thus it is important to consider ways of 1This is true despite the fact that perfect collinearity (r 2 = 1.0) makes inversion of the predictor matrix impossible (Darlington, 1968). 22 detecting multicollinearity, assessing its impact, and hopefully dis- covering solutions to the problems it poses. Numerous techniques have been proposed for detecting multi- collinearity and the more important will be discussed. The simplest available Operational definition of unacceptable collinearity is the arbitrary establishment of a maximum permissible value for predictor intercorrelation. Aside from the arbitrariness inherent in this approach it shares the faults of the next pr0posal to be presented. Klein (1962) suggests that ". . . intercorrelation or multi- collinearity is not a problem unless it is high relative to the over- all degree of multiple correlation . . ." (p. 101). Despite its intuitive appeal this rule of thumb is not valid. Farrar and Glauber (1967), while providing a geometric rationale for the rule, point out that perfect collinearity or the case of a completely singular predictor matrix is perfectly compatible with low pairwise correlations. A set of dummy coded contrast vectors such as commonly used for the analysis of variance whose non-zero elements exhaust the sample space would fulfill these requirements (Cohen, 1967; Cohen & Cohen, 1975). For the same reasons measures based on average intercorrelations (Cureton, 1971; Kaiser, 1968; Meyer, 1975) are inadequate warnings of severe multi- collinearity. A measure presented by Kmenta (1971) is the coefficient of deter- mination R2(j) which is obtained by regressing the criterion variable on all predictors excluding x If a high degree of collinearity is 3. present in the data the discrepancy between REJ) and the coefficient of determination for the full predictor set will be quite small. However, 23 a small difference may simply be reflective of the worthlessness of X1 as a predictor variable. This is illustrated by Darlington's (1968) suggestion of this exact comparison for estimating the importance of individual predictors. Furthermore, this measure does not depict the nature of the collinearity, i.e., which variables are involved in the relationships. Another measure with the same limitations as Rfij) is based on the §_statistic obtained from fitting the full model and the E statis- tics obtained by deleting one variable at a time from the equation. If the overall §_is significant and the individual 5 tests are not, multicollinearity is indicated. However, this occurrence is unusual even with high collinearity (Mason, Gunst, & Webster, 1975), and, like the previous measure, the nature of the collinearity is not specifiable. A single measure which summarizes the collinearity present in the entire predictor matrix, again without providing information as to its nature, is provided by the determinant, symbolized IX'XI. As IX'XI is in standardized form, 0 §_|X'XI f_1.0 while if a linear dependence satisfying equation (19) exists the determinant is equal to zero. This measure provides at least an ordinal indicant of the presence of multicollinearity although the collinearity could be attributable to one or several very small latent roots. Under the assumption of multi- variate normality (not generally tenable in the assumed fixed-X case, as discussed earlier) work by Wilks (1932) and Bartlett (1950) indicates that a chi-square test of the departure of the determinant from zero is possible. Further, the determinant obtained by deleting one variable or set of variables from the matrix forms an §_ratio with the deter- minant of the full p-order matrix. These tests are however very 24 sensitive to departures from normality and are of sufficient complexity to discourage their frequent use (Farrar & Glauber, 1967). Much the same information can be obtained more readily by the methods to be dis- cussed next. Johnston's (1972, pp. 162-163) conclusion that the standard error of beta weight estimates should give adequate warning of the presence of multicollinearity can be extended to provide more exact information. The standard error of a single beta weight, 31’ is defined to be 2 . C (1 - R ) (27) SOEOBi a/ii yOlgz’ooo’P N - p where Cii is the diagonal element of (X'X)-l correSponding to the ith predictor. This measure provides an intra-matrix indication of col- linearity but still does not facilitate inter-matrix comparisons. A more useful measure is the C11 component of the standard error formula which indicates collinearity without reference to the coefficient of determination for the full equation. This diagonal element of the inverse intercorrelation matrix of predictors (actually a transformation of it) has been termed the variance inflation factor by Marquardt (1970) and has been employed as an indicant of multicollinearity by Marquardt and Snee (1975) and Snee (1973). This element, C11, and the off-diagonal values of the matrix can be expressed in terms of more familiar quantities to demonstrate their utility. If the symbol si.j,...p is used to represent the square- root of the residual variance obtained when any one predictor is regressed on the remaining p-l predictors (i.e., 81 j p = 25 //(l - R: j p) , then Ci1 is the reciprocal of this variance-- _ 1 ii (Siojgooop multicollinearity (Farrar & Glauber, 1967) as the squared multiple C )2' This provides a convenient means of assessing correlation of each predictor regressed on those remaining is implicit in the inverse matrix of (X'X). The relationship is 2 _ 1 (28) R - l - i.j,...p Cii Thus if perfect collinearity exists (19), the C element will be ii infinitely large and the matrix is seen as singular. For matrices which do not exactly satisfy (19), the natural range of C is simply greater ii than or equal to one with equality obtaining in the orthogonal variate case. If a single high-collinearity exists (high pairwise correlations) the large Cii and ij will indicate which variables are involved. How— ever, if multiple caused collinearities are present one must look to the off-diagonal elements of (X'X)-1 for more information. An off-diagonal element 0 is defined as 13 _ ij.k,...p (29) c - . ij (Si.j,...p)(sj.i,k,...p) The numerator is a partial correlation of an order two less than the rank of the full matrix. It may be noted that inversion of the correla- tion matrix provides a means of quickly obtaining all of the highest order partial correlations by the formula (30) r .—.. __il 0 26 Especially in cases of non-overlapping groups of multicollineari- ties consideration of the diagonal and off-diagonal elements of the inverse allow one to locate the variables contributing to the problem. Marquardt (1970), Mason, Gunst, and Webster (1975), and Farrar and Glauber (1967) consider this indication of collinearity to be the best available. Gordon (1967) illustrates the effects on these values produced by varying intercorrelation and subset size. There is yet one improvement which can be suggested to further facilitate the interpretation of multicollinear matrices. Once the presence of high collinearity has been established by means of one or several high C values or, in a summary manner, by the existence of a ii near-zero determinant, one is still interested in accurately pinpointing the contributions to the problem--essentially in Specifying the coeffi- cients of equation (19). A procedure which enables this is basically the stepping stone for rank-reduction alternatives to OLS or, alterna— tively, as an exploratory statistical method in its own right. Eigenan- alysis is basic to all expositions of principal components or factor analysis, but its utility has not been widely appreciated by users of multiple regression. Eigenanalysis is essentially a procedure for extracting from a matrix the successive vectors of weights which when applied to the original variables will produce linear combinations of maximum variance. These eigen or characteristic vectors as they are also called are subject to two conditions. First, they are restricted to unit length, i.e., v1 vJ - 1.0. Secondly, each vector must maximize the residual variance extracted from the matrix subject to the condition that it is orthogonal to all other vectors. For the case of an inter- correlation matrix (X'X), the matrix equation to be solved is 27 (31) ((X'X) - ADVi = 0 where 11 represents the characteristic root or variance of its associated vector of coefficients, vi and is obtained from (32) V'(X'X)V = 1 diagonal. Solution of this equation (Cooley & Lohnes, 1971; Finn, 1974; Tatsuoka, 1971) yields a set of p roots and a matrix of dimension p x p containing the coefficient vectors. Some attributes of these values can be of use in assessing the effects of multicollinearity. For a matrix of ortho- gonal standardized variates each characteristic root is equal to exactly one. Therefore, if r = 0 for all i # j 11 then, 11 = 1.0 for all i, and P (33) Z 11 = p, the matrix rank. i=1 If the vectors are not orthogonal the sum of the roots must still equal the variance of the full matrix--thus (33) is true in all cases. How- ever, with correlated variates the first one or several eigenvectors extracted will exhaust much of the variance and the later eigenvalues will approach zero. In the case of perfect collinearity (19) one or more of the roots will in fact be equal to or less than zero. Thus each eigenvalue is an indicant of the degree of collinearity present in the 28 matrix and the inflation of the sum of the reciprocals of the roots away from p provides another matrix-wide summary of the severity. 0f more interest than merely another summary measure are the elements of the vectors associated with small eigenvalues. These coefficients, just as in factor analytic interpretations, Show which variables are the major contributors to the definition of the vector. Thus large positive or negative coefficients in a vector with a small eigenvalue indicate which variables are contributing the most to the lack of orthogonality (Marquardt & Snee, 1975; Mason, Gunst, & Webster, 1975; Snee, 1973; Webster, Gunst, & Mason, 1971). A relationship of interest (Snee, 1973) involves an alternative method of computing the diagonal elements of the correlation matrix inverse. Because the eigenvector matrix is columnwise orthogonal and of unit length (V'V - VV' a I) equation (32) can be rearranged to give (34) (X’X) = VADV’. Using a matrix theorem for inverses ((ABC)-1 = C-lB-lA-l, Dorf, 1969) it is evident that (35) 8’1 = V’ABIV, and (36) R;i = 011 = vilxl’l + vi212‘1+...+vip1;1. From this equation the significance of characteristic roots less than 1.0 is immediately obvious. The basis of the standard error for any one beta weight (28) is a direct function of the spectrum of eigenvalues 29 for the matrix upon which it is computed. An eigenanalysis and other multicollinearity statistics discussed above are presented in Table 2 based on a numerical example taken from Cooley and Lohnes (1971). With only three variables and a rather simple pattern of interdependence the source of the collinearity is readily apparent. In more complex analyses however the information provided by large loadings (-.66 and .72 in v3) and associated small roots (.304) can be valuable. Because of the problems with established methods of assessing multicollinearity outlined above, it is suggested that eigenanalysis be performed on any data set in which high collinearity is suspected. InSpection of the eigenvector values should allow a researcher to pin- point likely problem variables or sets of variables. Once the severity of the multicollinearity present in a matrix has been assessed, ways should be considered for handling the deleterious effects it can have on weight estimates. Numerous methods have been presented in the sta- tistical, sociological, and econometrics literature and several will be discussed here. Alternatives to Ordinary Least Squares The two basic uses to which multiple regression estimation of weighting coefficients are applied are again relevant here. The majority of alternatives to OLS (including derivations based on the OLS pro- cedure) are directed at maximizing the sample equation's multiple R or else its expected value on cross-validation, subject to such constraints as computational ease or the availability of adequately large data samples. Thus, many alternatives are explicitly concerned only with prediction, and several in fact make structural interpretation 30 Table 2 Illustrative Multicollinearity and Eigenanalysis Values (X'X) - 1.00 .67 -.10r- 067 1.00 -029 J—.010 -029 IOOOJ Determinant (X'X) - .4987 -1 q .— (X'X) - C - 1.84 -l.28 -.18 -l.28 1.98 .44 -.18 .44 I.. 1 11.3 F - Eigenvectors - V - .64 .38 -.66 .69 .10 .72 -.34 .91 .20 h ‘ Z of Trace I- _' Eigenvalues - 11 - 1.768 59.0 P _1 2 A = 4.924 0.927 30.9 181 i J 0.304J 10.1 31 impossible either by eliminating some variables on the basis of speci- fied criteria or by utilizing arbitrary weights. Alternatives to ordinary least squares can be briefly summarized as falling into one of the following classes: (a) some form of data augmentation; (b) rank reduction procedures; (c) utilization of weights independent of inter- dependence relationships present in the initial sample. The need for data augmentation is most explicit when the avail- able sample size is less than the number of predictor variables for which weights are to be estimated. When this situation arises, as it often does for example in medical studies involving multiple observa- tions on a limited number of patients, there are few alternatives to simply obtaining more subjects or eliminating some variables. The latter course of action precludes structural interpretation of a full set of weights although predictive utility may not be significantly hampered. Evaluating more subjects is often prohibitively expensive if not simply impossible. If excessive collinearity attributable to an undersampling of the regions of the data domain is evidenced by either evaluation of the eigenvectors or joint variable distributions (Webster, Gunst, & Mason, 1974), little choice remains other than to acquire observations on a larger N. Rank reduction procedures have occasionally been employed in the last mentioned case, although interpretation may be vastly complicated. These procedures may be classified basically as either based on a posteriori orthogonalization of the data vectors or on evaluation of successive partial validities as variables are included or deleted from the predictor set. Virtually all orthogonalization models attempt to eliminate specific variance (in the factor analytic sense) from the 32 predictor intercorrelation matrix. Thus, one approach in this area is the application of a principal components analysis of the predictors (normalized eigenvector values). Based on a scree test (Cattell, 1966) or the meaningfulness of the resulting components, an arbitrary number (Jeffers, 1966; Jolife, 1972, 1973) are retained and matrix transformations are used to re-estimate variable scores for individual subjects. These scores are then used in the usual regression computa- tions. Examples of this method have become fairly common since the advent of readily available computers to carry out the tedious matrix manipulations (Gunst, Mason, & Webster, 1975; Jeffers, 1966; Massey, 1974, Schmitt & Coyle, 1976). Variations on this approach have utilized the characteristic vectors as predictors (Gunst, Mason, & Webster, 1971), inserted communality estimates in the R matrix (Horst, 1941), and attempted to estimate the R.1 matrix (Guttman, 1958) rather than R itself. One method (Burket, 1964) augments the predictor matrix with the vector of criterion validities before principal axes orthogonaliza- tion. Finally, all analyses based on components or axes may also be subjected to rotation (for example, varimax or quartimax) before being used to re-estimate subjects' scores. If a principal components analysis is employed and the number of retained components is the same as the number of original variables it can be shown that the multiple regression equation derived will be identical to that obtainable from the raw variables (Darlington, 1968; Herzberg, 1968). Therefore, only cases in which fewer factors are extracted than the number of original variables can potentially be of interest. The argument in favor of such rank reduction is usually based on the well known fact that if the variables being factored 33 contain substantial error variance, it will tend to be concentrated in the vectors associated with small roots. Potentially serious problems exist in applications of any of these methods. The distributional theory is exceedingly complex for those analyses employing communality estimates and this leaves significance testing of derived weights a virtually intractable problem (Burket, 1964). Further it is possible that the factor accounting for the least variance in the predictor set, and therefore the prime candidate for omission, in fact correlates perfectly with the criterion (Darlington, 1968). Description of the variables re-estimated from factor matrices is also generally compli- cated by "intermediate" loadings, and the indeterminancy of factor scores up to a linear transformation and this in turn obfuscates efforts to interpret the subsequent regression equation. Variable deletion based on various criteria has been prOposed when degrees of freedom are limited or when collinearity is a problem (Draper & Smith, 1966). In all cases deletion procedures attempt to maximize the validity of the initial equation subject to specified constraints, and are thus not amenable to instances in which structural interpretations of a full rank predictor matrix is of interest. As Darlington (1968) notes, removing the variable with the smallest beta weight is not guaranteed to produce the equation with the highest population validity for that rank model. Accretion methods of variable selection begin with the variable having the highest zero-order validity and then in successive steps add those variables which will give the greatest increase in the multiple R for the equation (Draper & Smith, 1966). Horst and MacEwan (1960) suggest the reverse of this procedure and note that the two methods--forward selection and backward 34 elimination—-will not in general yield the same equations. Both pro- cedures are terminated on the basis of arbitrary criteria such as validity increment or number of variables included. Stepwise regression is essentially identical to forward selection but additionally it tests at each step all variables already in the equation. If, because of the inclusion of subsequent predictors, a variable's partial correlation has fallen below a Specified value it is then eliminated and the pro- cedure continues to evaluate the remaining candidates for the equation (Nie, Hull, Jenkins, Steinbrenner, & Bent, 1975). Numerous variations on these three basic approaches have been suggested (Anderson & Fruchter, 1960; Burket, 1964; Furnival & Wilson, 1974; Hocking & Leslie, 1967; LaMotte & Hocking, 1970; Rock, Linn, Evans, & Patrick, 1970) but in general these three have been preferred. The case of interest in the present paper is that in which one wishes, for either predictive or interpretative purposes, to obtain an equation based on all p variables upon which data were collected. Accordingly, the relative merits and problems of reduced rank procedures will not be discussed further. The selection and application of non-least squares estimated weights has received a great deal of attention especially in the psy- chological decision-making literature (Einhorn & Hogarth, 1975). In general the motivation for development of alternative weighting strate- gies has had three facets: (a) especially when computations must be done by hand, the complexity of the work necessary to calculate OLS weights is prohibitive; (b) high collinearity situations combined with less than perfect reliability of measurement make it quite likely that arbitrary weights will outperform unstable beta estimates in subsequent 35 usage; (c) situations in which it is desired to correlate a linear function of the predictors with a criterion but a sufficiently large sample on which to estimate beta weights is not available. A variety of combinatorial schema have been evaluated in the literature (Claudy, 1972; Lawshe & Schucker, 1959; Trattner, 1963; Wesman & Bennett, 1959) such as raw score addition so that variables are weighted by their standard deviations, addition of standardized scores, weighting by the reciprocal of the standard deviation, and weighting by the validity coefficient. The concensus has developed that equal weights for situations in which N is less than approximately 50 are superior or only slightly inferior to OLS weights regardless of the number of predictors. The comparison made is generally between the multiple R obtained from application of the original beta weights in a cross-validational sample (a set of cases not included in the original estimation of the weights) and the multiple R produced by unit weights. A comprehensive Monte Carlo study of the empirical performance of unit weights versus sample beta weights when they are validated in the popu- lation was done by Schmidt (1971). For 40 combinations of N and p com- parisons on a variety of correlation matrices sampled from the litera- ture, he demonstrated that the maximal superiority (in terms of obtained R2 values) of beta weights averaged over 100 samples was only .083. When suppressor variables were removed this maximum dropped to .039. Both maximum beta versus unit weight discrepancies were in fact obtained in the papulations themselves where the beta weights were error free, i.e., parameter values rather than sample estimates. No other weighting scheme has been shown to be so consistently comparable to the perfor- mance of beta weights. 36 The results provided by Schmidt's analysis are in accord with the suggestions of Einhorn and Hogarth (1975) and Dawes and Corrigan (1974) who derived their conclusions from comparative studies of the human decision making process. Dawes and Corrigan summarize their results by Stating that to obtain stable prediction equations in situ- ations where all variables are subject to error it is necessary simply to select relevant predictors, determine their sign, make all predictors comparable, and then add. While this solution appears at first to be a panacea for the numerous problems encountered in MR. Roose and Doherty (1976) noted several difficulties in attempting to apply these sug- gestions. They found that selecting the variables without the use of some sort of stepwise procedure an arduous task. Nor had they any manner of a priori determining the predictor signs. Those they used were based on the validity coefficients for the selected variables. In their words (Roose & Doherty, 1976), ". . . the success of unit weighting as demonstrated in the present study rested upon crutches fashioned from the very MR procedure bested by unit weighting" (p. 245). Wainer (1975, 1976) has formulated the expected loss attributable to the use of unit rather than OLS weights and noted that for practical purposes the loss is so small that the OLS procedure is not justifiable. Again though, his derivations assume some sort of selection and sign assignment a priori, no suppressors, as well as a maximal spread in the beta weights of only .5, conditions which it is frequently impossible to meet. If one is interested in full rank multiple prediction it would appear that unit weighting is the viable alternative to MR. While structural interpretation is not possible, except on the gross level of 37 zero weights implying no utility and weights of one or minus one indi- cating acceptable utility, the robustness of unit weights makes them worthy of further consideration especially in cases of extreme collin— earity where large beta weight sampling variances are common. Three methods of assessing the signs of unit weights (discussed below) are compared to OLS in this work. The problems with standard multiple regression which prompted researchers to consider unit weighting and various orthogonalization schema have recently given rise to a modified OLS methodology. The details of this method, termed ridge regression, are discussed next. Ridge Regression Hoerl (1962) originally prOposed this modified regression method Specifically to deal with the problems of severe multicollinearity dis- cussed above. In exemplifying the method of ridge analysis (Hoerl & Kennard, 1970a; 1970b) the errors associated with non-experimentally collected data are noted in that (X'X) is not nearly a unit or identity matrix. Weighting coefficients derived from such a matrix are often of incorrect Sign and have inflated values, as was noted before. The undesirable nature of such weights are expressed by Hoerl and Kennard (1970a); . . . the least squares estimates [which] often do not make sense when put into the context of the physics, chemistry, and engineering of the process which is generating the data. In such cases, one is forced to treat the estimated predicting function as a black box or to drop factors to destroy the correlation bonds among the x1 used to form X'X. Both these alternatives are unsatisfactory if the original intent was to use the estimated predictor for con- trol and optimization (p. 55). The suggestion offered in such cases is the use of 38 (37) 8 a mu: + uprlx'y, k 3 0 for estimating beta weights rather than equation (8). This procedure it is claimed modifies the weight values such that they are less extreme in absolute value and thus necessarily have reduced variance. The technique can also be used to generate a trace of the effects of in- creasing k values on the coefficients which portrays the differential effects on each. Hoerl and Kennard (1970a) contend that by reducing the variability of the coefficients a more accurate estimate of the parameter values can be obtained although the resultant estimates are biased. The derivation of this approach considers the variance of the coefficient vector 8 (equation 11) and notes that the expected value of the squared distance (L2) from 8 to B is 2 (38) E(LZ) = 0 Trace (X'X)"1 or equivalently, (39) E(B'B) = B'B + ozTrace(X'X)-1. Hoerl and Kennard also demonstrate that (38) is equivalent to P _ (40) E(LZ) = 022 11 1. i=1 The lower bound for the average squared distance between the sample coefficients and the parameters is given by 02/1m This corresponds in' to the previous discussion of multicollinearity wherein it was noted that small eigenvalues are one of the best indicants of unstable weights. The authors suggestion of augmenting the diagonal of (X'X) with small 39 positive quantities (0 §_k_:_1) has the effect of decreasing the diagonal elements of the predictor inverse matrix. This in turn deflates the absolute values of the beta weights and reduces their collective vari- ance. Hoerl and Kennard (1970a, p. 60) demonstrate that the expected squared distance from 8 to B is composed of two elements-the total variance of the parameter estimates and the square of the bias intro- duced by the non-least squares computation (equation 37). Thus when k = 0 and OLS are calculated the bias is zero. The authors Show with an existence theorem that it is possible to select k greater than 0, take a little bias, and without greatly inflating the residual error variance for the equation, obtain 8 estimates with substantially lower mean square errors (L2). The problem is then one of selecting an optimal k value for use in any one matrix. The suggestion of Hoerl and Kennard (1970b) is to use a graphic display of the effects of increasing k and note the point at which four conditions are met: (a) the characteristics of the graph will be those of an orthogonal system, (b) coefficients will have reasonable absolute values, (c) coefficients with apparently incorrect signs at k a 0 will have changed to correct signs, and (d) the residual sums of square will not have inflated to an unreasonable value. The plot (Hoerl & Kennard, 1970b, Figure 2, p. 72) shows the values of each of 10 coefficients plotted against the value of k in equation (37) which produced them. They would advocate selecting the beta weights produced by the equation with a k of approximately .25--that is, after the point of maximum decline in absolute value is passed and the 40 coefficients are seen to be visually stable. Assessments which Hoerl and Kennard (1970b) make on the basis of this graph exemplify the utility of the procedure: (1) The coefficients from the ordinary least squares are un- doubtedly overestimated. At least, they are collectively not stable. It is unlikely that another set of y's would give Bi like these. Moving a short distance from the least squares point k = 0 shows a rapid decrease in absolute value of at least two of them, namely, those for factors 5 and 6. Figure 2 shows the decrease in the Squared length of the coefficient vector with k. When k 8 .1, it is 43.3% of its original value; for an orthogonal system it would be 83%. (ii) Factor 5 has the negative coefficient with the largest value. But the addition of k > 0 quickly drives it toward zero and it then becomes positive. Such action should not be surprising, eSpecially when it is compared with the action of factor 6. Factor 6 also decreases rapidly but stabilizes and does not go down to zero. Factors 5 and 6 have a simple correlation coeffi- cient of 0.84 which says that to a first approximation, they are the same factor but with different names. It would be surprising if their true effects were opposite in Sign. (Without a knowledge of the underlying technology, no definitive statement can be made.) The covariance of -4.33 is driving them apart so that they are Opposite in Sign. The phenomenon observed here is not atypical. Positive coefficients for highly correlated factors can be stable as a sum, especially when they are correlated to various degrees with other factors. (iii) The correlations with other factors causes factor 1 to be underestimated. At k = 0 factor 1 is the second least important negative factor. But with the addition of k > 0 it increases in absolute value. The other negative factors are slightly over— estimated and when sufficient k > 0 has been added to stabilize the system, factor 1 becomes the most important negative factor. (iv) Factor 7 is overestimated and is driven toward zero. (v) At a value of k in the interval (0.2,.3) the system has stabilized and coefficients chosen from a k in this range will undoubtedly be closer to 8 and more stable for prediction than the least squares coefficients or some subset of them (pp. 71-72). Several authors have utilized the ridge regression (RR) techni- que. Churchill (1975) used 3001 cases selected in samples of size 50 and calculated ridge coefficients (8) for 13 predictors. His results demonstrated a departure from the parameter values 1.7 times higher 41 for OLS as Opposed to RR. Vinod (1976) who used a modified RR method which selected arbitrary k values based on rank reduction analyses, Marquardt and Snee (1975), McDonald and Schwing (1973), and Snee (1973) all reported superiority of RR over OLS. Several researchers have attempted to develop point estimates of k (Baldwin, 1975; Hoerl & Kennard, 1976; Lawless & Wang, 1976; McDonald & Galarneau, 1975; Newhouse & Oman, 1971) but these attempts uniformly assume that k is non-stochastic (Coniffe & Stone, 1973; Smith, 1976)--an unadmissable assumption. Further, these more exhaustive studies do not invariably demonstrate RR as superior to OLS. Thus while virtually all investigators consider RR to be an instructive mode of analysis most contend that it is preferable to OLS in all nonorthogonal situations, it Still remains as much an art (selecting k) as a science. The most practical suggestion is probably that of Marquardt's (1970) variance inflation factor (VIF) which was mentioned earlier. This value for the 1th predictor is the ith diagonal element of the matrix [(X'X);1(X'X)(X'X);1]. Just as the diagonal elements of (X'X)-1 in standard form range from one to infinity as collinearity increases, so do these VIF values. Marquardt's suggestion is that k be selected at the point where these values are ". . . reasonable, certainly less than 10 . . ." (p. 609). Evaluation of the VIP along with the eigen- vector weights associated with small eigenvalues (Snee, 1973; Webster, Gunst, & Mason, 1974) would appear to be the most reasonable way Of ascertaining which VIF's should be deflated the most and therefore which k value should be selected. The research reported here proposed to evaluate the relative efficiency of ridge regression as compared with ordinary least squares. 42 In addition, unit weighting was contrasted with these methods both because of its demonstrated utility and because it should in fact be most efficient in exactly the high collinearity situations for which RR was prOposed (Wainer, 1976; Wainer & Thissen, 1976). CHAPTER III METHOD Consideration Of the possible approaches to these comparisons favored a Monte Carlo study in which the sample size and collinearity could be controlled. Accordingly three matrices were selected from the literature. The factor structures for each of these matrices were input to the Ohio State Correlated Score Generation Program (Wherry, Naylor, Wherry, & Fallis, 1965) which produces multiple random samples corre- sponding to the structure. Generated samples from each Of the three matrices were pooled to form three pOpulations of 6000 cases each. Each matrix selected had 10 predictors so that a total of 165 coeffi- cients were estimated. The maximum Obtained discrepancy between a tar- get and an estimated population correlation was .031. The Population Matrices The first matrix selected was used as an example by Hoerl and Kennard (1970a) and was taken from Gorman and Toman (1966). This matrix (hereafter referenced as HOPOP, Table 3) was selected both because of its previous use as an RR supportive example and because Of its broad range Of predictor intercorrelations. Two other matrices were selected so as to broaden the scOpe of the comparisons. These matrices were considered more typical of those generally encountered in psychological 43 44 .AAONmHv Unmanmx was Humom aoum cmxmfim ooo. Nod. mom. son. Boo. Nmo. H4o.4 mom.a 4mm.H ooo.m omoao>omw4m 4moo. u nomaoaumomo Boo. a No e4o. a e 8Ho484sz moo. ooo. 4mm. ooN. Ham. om4.u ooa.u Hem.- oNN.u mea.u muomfios moon :OHumaaeom om4. omH. omo. o4o. 4am. men. NoH.u o4o.n 4oH.u oao.- coaoonauo Hmo. ~o4.- ~44. ~44. ~4m. moo.u 444.- oom.u oom.u oH NAH.- who. omo. MNH.- moo. 4Ao.n MHH. HHH.- o ~on.n oo4.u o~m.u oeo. 5mm. ooo. moo.n o moo. Hum. ooo. Hmo.u N4N. moo.- A o4o. moo.u omo.u ooo. oho.- o moo.- oHo.u «mo. man.- m oao.n ooH.u oHH. 4 moo. 44m. m 4mo.u N o4 o o e o m 4 m N H oaomaum> ooomomuuoommo oooo co ommmm xHuomz oofiooaoooo n «Home 45 and sociological applications. Table 4 illustrates the high average intercorrelation matrix reproduced from the factor structure Of a matrix employed by Rock, Linn, Evans, and Patrick (1970) and originally taken from Klein and Evans (1969). Table 5 is the low average intercorrela- tion matrix used by Rock et a1. (1970) and taken from Klein and Evans (1968). These two matrices (HIPOP and LOPOP respectively), incorpora- ting two other predictors which were deleted for the present research, were selected by Rock et a1. (1970) to evaluate four methods of predictor selection because of their representativeness. It was felt that these three data sets constituted a reasonable sample from the domain of possible matrices of interest to researchers in the social sciences. The HOPOP matrix with its negative intercorrelations and validities is atypical of most psychological data but does characterize occurrences in the economics and management literature. Additionally its use by Hoerl and Kennard (1970a) as an RR example without benefit of comparison with other techniques warrants its inclusion. Eigenanalyses of the HIPOP and LOPOP matrices (Table 6) illustrate their salient features. Both data sets differ from HOPOP in that their ranges of intercorrelation are more restricted typifying the data encountered in psychological and measurement studies. The first eigenvalue Of the HIPOP matrix accounts for 67 percent of the total variance while the first four roots of the LOPOP data set account for only 63 percent of its variance. Thus, by any accepted definition, the HIPOP matrix would be considered highly multicollinear while the LOPOP matrix is less severely afflicted. The fact that six of its roots combined account for less than 37 percent of the possible variance however indicates that weight estimation is likely to be adversely affected. 46 .AONaHv xoauoom use .mam>m .OOHA .xoom scum soxosm 4AA. HoH. NNN. o4N. MNN. Non. oH4. Non. 44N. N4N.o ooooo>ouoam Nooo. u ooooaaumumo Non. a ooN. n e oHoNoNoz 44o.- o4.ou o4N. nNo. ooo. NNo. m4o. ooN. m4N. ooo. muomams moon coaumaamom NN4. No4. HNo. HNo. oNo. ooo. ooo. NNo. NNm. o4m. ooNumSNuo Nmm. om4. oNo. Non. Non. ooo. nNo. ooo. om4. oN 4m4. 4No. Hon. oNn. N4o. oN.m Ho4. mo4. o Nmo. moo. oNo. Hmo. Nmo. 4Nm. NNN. o mNN. ooN. 4oN. ooo. 4oo. ooo. N ooN. ooN. NNN. 4mo. o4o. o o4N. 4oN. m4o. mNo. m ooN. ooo. ooo. 4 NoN. 4oo. N Non. N o4 o o N o m 4 m N H «HomNuoo mmomHmIImommo oooo so momma Nahum: sowumaaeom q OHQOH 47 .AONmHv xowuumm vow .mam>m .acHA .xoom aouw coxwem Non. mm4. mom. ooo. NoN. ooo. oNo. N4o.H oNN.H Hoo.m maoHo>ooNHm mNoH. a ooooHauoomo oNH. a No NN4. u e mHoHoHoz ooH. oNo. NNo. moo. ooo. NoH. NNH.- ooN. ooo.- ooo. mooNHoz sumo sowumaaaom ooo. NHH. ooo. moH. oHN. NNN. HNH. Non. Noo.- NNH. ooHuooHuo HNo. ooN. ooN. NoN. HoN. om4. Non. ooo.- ooH. oH moo. ooo. oNo. H4H. 4HH. ooo. NHo. ooo. o ooo. oNH. mNo.- oNH. HHH. ooo.- o4o. o ooH. ooH. NNN. ooo. oNo.- oHH. N Non. oH4. ooo. oHH. mNH. o no4. mo4. moo. ooH. m mom. ooo. mmN. 4 o4o.- 4om. m ooN. N oH o N N o m 4 N N H «HomHuoo mooooHuumommo oooo no comma xwuumz :OHumaooom n «Home 48 Table 6 Eigenvectors of the Population Matrices Variable HIPOP l .31 .38 -.38 -.23 .09 .34 .57 -.29 .05 .17 2 .31 .05 .37 -.16 -.84 .07 .12 .04 .07 .09 3 .34 .03 .23 .05 .09 .12 .39 —.36 .71 .09 4 .34 .06 .16 .34 .19 -.18 .06 -.46 .37 .56 5 .33 .09 .17 .24 .25 .54 .01 .63 .02 .19 6 .33 .09 .10 .30 .07 -.65 .37 .30 .21 .30 7 .35 -.13 .08 .03 .08 .09 .32 -.13 .53 .66 8 .30 .42 -.40 -.37 -.03 -.31 .48 .23 .08 .21 9 .26 -.55 -.64 .34 -.28 .06 .04 .02 .11 .09 10 .28 -.58 .16 -.63 .29 -.ll .18 .07 .02 .16 HOPOP 1 -.41 .36 -.14 .07 .02 -.36 .15 .03 .60 .40 2 .01 .10 .77 .09 .33 -.22 .29 .37 .04 .ll 3 -.39 .14 .07 .20 .02 .61 .42 .48 .08 .02 4 -.O6 -.02 -.36 -.50 .76 .02 .01 .19 .05 .04 5 .47 -.06 .00 .16 .18 .03 .26 -.03 .69 .41 6 .46 -.28 .11 -.O4 .05 .05 .19 .20 .05 .78 7 .21 .59 .12 .14 .30 .40 .09 -.52 .09 .19 8 -.25 -.57 .09 .07 .12 .43 .51 -.24 .28 .07 9 .04 .17 .30 -.80 -.36 .22 .02 -.03 .25 .04 10 .36 .24 -.36 .09 -.21 .22 .59 .48 .01 .06 LOPOP l .24 .46 .35 .Ol -.36 .43 .46 .15 .07 .23 2 .07 .73 .22 -.14 .01 -.34 .46 -.ll .01 .23 3 .46 -.13 -.04 .14 -.08 .24 .04 -.03 .02 .82 4 .44 .03 -.ll .03 .22 .10 .15 .12 .80 .23 5 .38 .12 -.40 .11 .20 -.06 .16 .57 .49 .18 6 .36 .09 -.04 -.04 .43 -.39 .55 -.45 .13 .07 7 .27 -.25 .14 .19 -.61 -.64 .04 .14 .08 .09 8 .13 -.32 .61 -.54 .29 -.07 .02 .35 .07 .06 9 .ll .02 -.48 —.79 -.35 .02 .05 -.10 .02 .01 10 .39 -.23 .17 .02 -.10 .26 .47 -.52 .29 .33 aEach population contained 6000 cases. 49 Samples Twenty-five samples of sizes 30, 60, 90, 120, and 200 were drawn from each population using a random sampling procedure available in the SPSS package (Nie et al., 1975). These sample sizes were selected to span the range of values for which unit weights have been demonstrated to be superior to OLS (Schmidt, 1971). Each sample (375 in all) was standardized and input to a program written by the author for the necessary least squares, unit weights and ridge regression computations. Equation Estimation Five equations were estimated in each of the samples. OLS weights and the multiple R they produced were calculated according to equation (8). Three unit weight equations were also estimated in each sample. The first equation was produced by assigning the Sign of the fallible sample beta weights (BU) to the unit weighting coefficients. The Sign of each validity coefficient (VU) in each sample was also used to determine the unit weight signs. Third, the infallible population beta weight signs (PU) were employed. This third method implies that the investigator has prior information as to the correct signs, pre- sumably on the basis of previous experience with the variable. These three methods were selected because, in an applied situation, they correSpond to the manner in which one would generally determine the unit signs. For the ridge regression equation in each sample the value of k in equation (37) was determined by the following rule: select the largest k possible (stepsizes of .01) with the restriction that no diagonal element of [X'X);1(X'X)(X'X);1] is less than 1.0. Attendant 50 to the earlier discussion of VIF's (Marquardt, 1970; Snee, 1973) pilot work was done experimenting with a variety of selection rules based on these values. It was found that the above rule always selected reason- able k values at a point just slightly lower than a visual trace examination would suggest. This criterion is also in keeping with more recent analytical attempts to define k which have generally found that less bias (i.e., small k values) can adequately handle the problems of multicollinearity (Guilkey & Murphy, 1975; McDonald & Galarneau, 1975). Data Analysis Virtually all studies evaluating ridge regression to date have, at least implicitly, been concerned only with structural interpreta- tion. Previous Monte Carlo studies (Hoerl, Kennard, & Baldwin, 1976; McDonald & Golarneau, 1975) which had available the true parameter values of 8 based their evaluations on the mean square error (MSE) criterion, i.e., §1(81 - Bi)2 with the 81 vector being produced by either OLS or RR.i While this comparison statistic accurately reflects the average precision of the 81 point estimates, it does not provide for assessment of the predictive utility of the overall linear combina- tion. It is possible that while one method of estimating Bi will have a lower MSE than another, the predictive utility of the latter will be superior. Because of this consideration the predictive ability of all five equations was evaluated as well as the MSE. As the RR procedure necessarily decrements the coefficient of determination as compared to that Of OLS in the estimation sample, these initial R and R2 values were evaluated. Due to the overfitting of the regression surface in 51 the estimation sample discussed earlier, a practical measure of an equation's utility is its performance in a cross-validation sample. However, as Schmidt (1971) has noted, a researcher is not interested in how a set of weights do in a single random replication sample but rather in how they perform in the long run, i.e., how they compare with the predictive utility of the infallible pOpulation weights. Accordingly, the equations estimated for each sample were cross-validated in the populations from which they were drawn. The formula for the cross- validated multiple R is (Nunnally, 1967) w'X'y (41) Rw = ‘20P /(w (X X)p°pw) where (w) is the appropriate vector of unit, RR, or OLS weights. The final comparison statistic, like MSE, is applicable only to the RR and OLS results. The coefficient of variation proposed by Churchill (1975) is calculated by dividing the square root of the MSE for a single coefficient by the true parameter value. These coeffi- cients of variation (CV) can then be averaged over predictors, sample sizes, and/or populations for summary purposes. CHAPTER IV RESULTS AND DISCUSSION Estimation sample results for the five equations discussed above are presented in terms of the Obtained mean coefficients of determina- tion in Table 7. As noted above, the bias factor due to the k value in equation (37) results in a lower R2 for ridge regression than ordinary least squares in all cases. The average values of these differences for 25 samples are presented for all sample sizes and for the three populations in Table 8. The magnitude of the positive values in Table 8 reflect the higher average obtained R2 values for OLS over 25 samples for each of the four alternative weighting methods. It should be noted that the lower obtained values for RR equations (approximately .038) will in general make them better estimates of the pOpulation cross- validity due to the overfitting of sampling error present in the estima- tion sample. Possible exceptions to this conclusion are cases in which either an OLS or RR initial equation cross-validates in a single instance upward in terms of the R2. This in fact occurred, on the average, for samples of sizes 90, 120, and 200 drawn from the HOPOP matrix and estimated by RR. The special nature of this population will be dis- cussed below. A final difference tO be noted between OLS and RR in Table 7 is the reduced range of R2 estimates provided by RR over 52 53 Table 7 1611:1211 R2 - LS, RIDGE, BU, VU, Pu£1 Sample Size Population Equation 30 6O 90 120 200 LS .663 .615 .578 .544 .527 RIDGE .587 .562 .540 .513 .502 HIPOP BU .415 .478 .468 .463 .463 VU .466 .485 .486 .471 .470 PU .483 .483 .493 .477 .476 LS .933 .918 .911 .907 .902 RIDGE .869 .864 .857 .858 .856 HOPOP BU .641 .672 .683 .692 .696 V0 .688 .642 .636 .651 .674 PU .698 .700 .700 .697 .696 LS .434 .307 .265 .228 .208 RIDGE .395 .291 .257 .222 .204 LOPOP BU .241 .198 .191 .161 .158 VU .238 .184 .169 .154 .146 PU .125 .149 .150 .150 .155 Note. All entries are mean values based on 25 samples. aLS - ordinary least squares; RIDGE 8 ridge regression; BU - unit weights with Signs determined by sample beta weights; VU 8 unit weights with signs determined by sample validity coefficients; PU - unit weights with Signs determined by infallible pOpulation beta weights. Mean Initial R2 Superiority of Least Squares 54 Table 8 Over RIDGE, BU, VU, PUa Sample Size Population Equation 30 60 90 120 200 Average RIDGE .076 .053 .038 .031 .025 .045 BU .248 .137 .100 .081 .063 .128 HIPOP V0 .197 .130 .092 .071 .057 .109 PU .180 .132 .085 .067 .051 .103 RIDGE .064 .054 .054 .049 .046 .053 BU .292 .246 .228 .215 .206 .237 HOPOP VU .245 .276 .275 .256 .228 .256 PU .235 .218 .211 .210 .206 .216 RIDGE .039 .016 .008 .006 .004 .015 BU .193 .109 .074 .067 .050 .099 LOPOP VU .196 .123 .096 .074 .062 .110 PU .305 .158 .115 .078 .053 .142 RIDGE .060 .041 .033 .029 .025 .038 BU .244 .164 .137 .121 .106 .155 AVERAGE VU .213 .176 .154 .134 .116 .159 PU .240 .169 .137 .118 .103 .154 Note. All entries are mean values based on 25 samples. aRIDGE = ridge regression; BU = unit weights with signs deter- mined by sample beta weights; VU 8 unit weights with signs determined by sample validity coefficients; PU = unit weights with signs determined by infallible population beta weights. 55 different sample sizes. It appears then that RR is somewhat less sensi- tive to the size of the sample in which weights are estimated than is OLS. Over the three populations and five sample sizes, the range Of ridge estimated coefficients of determination are approximately 37 per- cent less than those Of OLS estimates. All three unit weight equations demonstrate the same relative indifference to sample size and in some cases to be discussed below, provide better estimates of actual utility than either OLS or RR. For predictive purposes the R2 Obtained in the initial sample is typically not of interest beyond indicating whether the linear com- bination of predictor variables has any utility at all. Cross-validated (typically in only a single holdout sample) or formula estimated coeffi- cients of determination are the usual criteria for utility decisions. In general, the latter approach has been Shown to be preferable (Schmitt, Coyle, & Rauschenberger, 1977); however in a Monte Carlo study such as considered here, one has available the actual pOpulation matrix which obviates the need for estimates of long-term cross-validated efficiency. Table 9 presents the results, for the four relevant equation types, Of applying the sample estimates to the population from which the data were drawn. AS this step concerns validation of sample dependent values the unit weight equations signed by the infallible population beta weights (PU in Table 7 and 8) are not evaluated. Table 10 contains the average differences between OLS and the RR equations, the unit weights signed by the sample validity coefficients (VU) and the weights signed by the sample beta weights (BU). Negative entries in this Table (10) indicate that the equation in question obtained a higher average cross- validated R2 than did OLS for the same population and sample size. 56 Table 9 Cross-Validated 112 - LS, RIDGE, BU, vua Sample Size Population Equation 30 60 90 120 200 LS .345 .405 .447 .466 .481 RIDGE .428 .456 .474 .482 .489 HIPOP BU .231 .326 .358 .390 .412 VU .465 .465 .465 .465 .465 LS .845 .872 .879 .886 .890 RIDGE .838 .858 .867 .875 .880 HOPOP BU .574 .640 .673 .684 .697 VU .601 .566 .588 .612 .659 LS .050 .085 .111 .130 .147 RIDGE .056 .089 .113 .132 ..148 LOPOP BU .040 .056 .075 .092 .113 VU .093 .116 .124 .128 .135 Note: All entries are mean values based on 25 samples. a LS - ordinary least squares; RIDGE a ridge regression; BU - unit weights with signs determined by sample beta weights; VU 8 unit weights with Signs determined by sample validity coefficients. Over RIDGE, BU, VU‘al 57 Table 10 Mean Cross-Validated R2 Superiority of Least Squares Sample Size Population Equation 30 60 90 120 200 Average RIDGE -.083 -.051 -.027 -.016 -.008 -.037 HIPOP BU .144 .079 .089 .076 .069 .085 VU -.120 -.060 -.018 .001 .016 -.036 RIDGE .007 .014 .012 .011 .010 .011 HOPOP BU .271 .232 .206 .202 .193 .221 VU .244 .306 .291 .274 .231 .269 RIDGE -.006 -.004 -.002 -.002 -.001 -.003 LOPOP BU .010 .029 .036 .038 .034 .029 VU -.043 -.031 -.013 .002 .012 -.015 RIDGE -.027 -.014 -.006 -.002 .000 -.010 Average BU .132 .113 .110 .105 .099 .112 VU .027 .072 .087 .092 .086 .073 Note. All entries are mean values based on 25 samples. aRIDGE = ridge regression; BU = unit weights with Sign deter- mined by sample beta weights; VU = unit weights with signs determined by sample validity coefficients. 58 Inspection of Tables 9 and 10 shows that BU equations are generally the poorest while RR provides the best average results. These results are more evident if one ignores the obtained values for the Hoerl and Kennard (1970b) population. In the high and low intercorrelation populations, ridge regression outperforms OLS by a small margin (.02) for all sample sizes. In the HOPOP matrix the situation is reversed with OLS demon- strating a Slight superiority (.01) over RR. As Tables 7 through 10 concern predictive utility rather than structural interpretation, it is at this point that the efficiency of the various unit weighting schemes must be considered. Schmidt (1971) noted that with simulated data such as presented here, violations of the assumptions of multiple regression (linearity, homogeneity, and normality of conditional variances) cannot occur. Such violations apparently occur in approximately 20 percent Of actual empirical data sets (Sevier, 1957; Schmidt, 1971; Tupes, 1964) and their effect is to attenuate the predictive utility of OLS. Therefore, in this simulation differences between OLS Obtained R2 values and those of unit weights should be taken as maximal estimates. In practice, OLS will be somewhat less efficient than is indicated here. In the HIPOP and LOPOP matrices (Table 9 and 10) the results for unit weighting methods are similar to those reported by Schmidt (1971). As concluded in that study, when no suppressor effects are present, a sample size of approximately 180 is necessary before OLS will demonstrate a distinct superiority over unit weights. In Table 9 the high and low intercorrelation populations Show OLS to be useful upon cross-validation in the range between 120 and 200 cases. It is also concluded from these tables (9 and 10) that signing unit weights with 59 the sign of the sample beta weight estimate is not generally advantage- ous. This is congruent with Hoerl and Kennard's (1970a, 1970b) rationale for RR; that is, that when collinearity is high, the sample beta weights will frequently exhibit incorrect signs and indicate excessive suppressor effects. Thus, when previous experience with the variables permits one to decide the sign of the unit weight for each predictor, these signs should be employed. This method is conceptually at least, preferable to both the BU and VU Sign assignment as it is independent Of sampling fluctuations. In practice many uses of MR involve variables (as predictors or criteria) for which one could not confidently decide on their sign before analysis (Roose & Doherty, 1976). The conclusion to be drawn from this study is that the next best alternative is to use the Sign of each predictor's zero-order validity coefficient. The Hoerl and Kennard (1970b) pOpulation matrix (HOPOP in Tables 7 through 10) presents several contradictions to the above mentioned conclusions. This pOpulation is not typical of those en- countered in social science data; generally, its coefficient of deter- mination is higher than the norm, four of the validities are negative, and five variables in the population are identified as suppressors (see Table 3). It is ironic that this matrix was chosen by Hoerl and Kennard (1970b) as an example of the advantages Of RR over OLS. Across all sample sizes investigated in this study RR equations based on random samples from the HOPOP matrix are dominated by OLS. Ordinary least squares also demonstrates higher cross-validated coefficients of deter- mination than do either beta weight or validity signed unit weights. With regard to the RR results it must be concluded that the biasing 60 factor of equation (37) "overcorrected" the weights of some predictors in this population and thus reduced the cross-validated R2. This occurrence emphasizes the need for an analytical determination of an optimal biasing parameter (k) which ideally could adopt different values for different predictors. This would seem to be indicated as advantage- ous for sample data from a matrix such as HOPOP where the collinearity is not uniform across the predictors. A mixture of high and low pair- wise intercorrelations (Table 3) presumably requires a variable bias factor. This conclusion is supported by the results for the HIPOP and LOPOP matrices both of which demonstrated RR as superior to OLS upon cross-validation of the sample equations in the population. These discrepancies further demonstrate that the determinant is an insuffi- cient indicant of the degree of collinearity insofar as its value might be used to determine whether OLS or RR should be applied. The LOPOP population actually has a determinant 48 times larger (indicating less severe multicollinearity) than the HOPOP matrix, yet RR was superior on the LOPOP samples and not on the HOPOP samples. Conclusions as to the predictive utility of these various com- binatorial schema would seem to be as follows: 1. In agreement with Schmidt (1971), unit weights should be employed in samples of under approximately 200 cases. 2. In the absence of prior knowledge, validity coefficients should be used to determine the sign Of each predictor's unit weight. 3. The predictive utility of ridge regression, while superior to OLS and unit weights in some instances, would not seem to be great enough to warrant its use. If an analytic determination of the bias parameter k is develOped, ridge regression would 61 seem to be practical for analyses in which the intercorrelation is both "high" and consistent throughout the matrix and sample size is not very large. While this study is not conclusive, it appears that RR is most useful for predictive purposes in the same sample size range as are unit weights. The focus of the discussion now turns to consideration Of the accuracy Of weight estimation by the OLS and RR methods as Opposed to the predictive utility of their respective linear combinations. Table 11 presents the average mean square error (MSE) of estimation values, the average bias factor (k), and the calculated average sum of reciprocals of the eigenvalues for each population and sample size. It should be recalled that the selection of the value for k determines, along with sample specific collinearity, the value which will result for MSE. Thus, as long as an analytic solution for the bias factor is not avail- able, individuals may rightly argue for the appropriateness of values other than those employed here. It is considered, however, that the method of determining k employed in this study yields reasonable results which are consistent with published uses of RR. Further, as noted by Churchill (1975), a Monte Carlo study is potentially susceptible to- the criticism of Optimizing the selection of the bias factor so as to conform to the population specifications. Thus, it is argued that the arbitrariness of a "reasonable" selection rule such as used herein will permit greater generalizability of results as we await a solution to the problem of analytically optimizing k on the basis of sample infor- mation only. The mean square error values in Table 11 can be interpreted as a summary measure of the accuracy with which weights were estimated. 62 Table 11 Equation Mean Square Errors Population Sample HIPOP HOPOP LOPOP a Equation _1 _1 _1 Size MSE 21 k MSE 21 k MSE 21 k LS .085 51.15 .020 54.46 .064 20.47 30 RIDGE .023 23.67 .15 .041 27.04 .08 .043 16.56. .07 L8 .489 125.34 .009 43.78 .027 16.45 60 RIDGE .033 23.01 .15 .029 26.85 .06 .022 14.75 .05 LS .023 36.15 .006 39.48 .016 15.54 90 RIDGE .009 20.87 .14 .025 25.64 .06 .014 14.52 .03 LS .013 34.75 .004 38.89 .010 15.19 120 RIDGE .006 20.43 .14 .021 26.42 .05 .009 14.27 .03 L8 .008 32.80 .002 37.05 .006 14.43 200 RIDGE .004 19.87 .15 .018 26.23 .05 .005 13.84 .02 Note. All entries are mean values based on 25 samples. aLS = ordinary least squares; RIDGE a ridge regression; 21-1 = sum of the reciprocals of the sam 1e eigenvalues; k = average value of k in the expression ((X'X) + kI)‘ X'y. 63 MSE is equal to the sum of the deviations squared about the parameter beta weight plus the squared bias. Thus, the smaller the MSE value is for a particular cell, the better the average beta estimate was when one averages errors over the 10 coefficients in each sample and the 25 samples per cell. It is seen in Table 11 that RR had a smaller MSE than OLS in all HIPOP and LOPOP samples. This improvement in accuracy seems to be inversely related to the sample size available for estimation, similar to the results for predictive utility presented above. While it was concluded earlier that unit weights were preferable to RR esti- mates for predictive use when sample size is less than approximately 180, the same is not true here. Structural interpretation of regression estimates makes explicit the intent to characterize a system or process as a function of the magnitude of weight estimates. The substitution of arbitrary weights (i.e., unit weights) may not deter predictive use of the system's indicators but it necessarily eliminates the possibility Of assessing their individual utilities. The improvement in MSE attributable to RR is substantial in most cases. The outlying value of .489 for the least squares mean value of MSE at a sample size of 60 is attributable to one random sample's extreme beta estimates. Omitting this one sample and calculating the -l i and k for RR remains unchanged at .15. It is interesting to note that same statistics for OLS on 24 samples yields MSE = .033; Z 1 = 39.36 this one sample's extreme beta estimates were adequately handled by the RR technique using the decision rule for k selection discussed above. Over all sample sizes in the high intercorrelation maxtrix, the MSE due to use of RR weight estimates is approximately 91 percent less than that generated by OLS (the value is 65 percent if the one aberrant 64 sample from the sample size 60 cell is removed). In the LOPOP matrices RR is 24 percent more accurate overall. In the HOPOP matrices, as was the case with predictive utility, RR is dominated at all sample sizes by OLS. For these samples OLS is 69 percent more efficient than RR. The conclusion is therefore similar to that for predictive considerations: for the appropriate matrices (high collinearity which is consistent across the matrix) ridge regression can provide improved weight estima- tion, especially for small sample sizes. Even in subjectively low inter- correlation cases (LOPOP) RR will not be worse than OLS although the extra computational labor may not be worth the slight gain in estimation precision. Table 11 also lists the values computed for the sum of the reciprocals of the eigenvalues with and without the biasing factor (OLS and RR solutions respectively). This value was assessed by Hoerl and Kennard (1970b) as an indication of the degree to which orthogonalization had been achieved by the RR technique. If the predictors utilized had in fact been uncorrelated, this value, as demonstrated earlier, would equal 10.0 or equivalently, p, the number of predictors. RR over all HIPOP matrices demonstrated a 61 percent reduction (44 percent with the above noted sample omitted from the sample size 60 cell) in the sum of eigenvalue reciprocals as compared to OLS. In the LOPOP matrices the figure was 10 percent while in the HOPOP samples the reciprocals were 38 percent smaller. These results demonstrate the inappropriateness of this value (the sum of reciprocals of eigenvalues) as an indicant of the utility of the RR technique. AS noted for the HOPOP matrices reduction in the size of this value is an artifact of the application 65 of equation (37) and does not necessarily imply either enhanced precision of estimation or superior predictive ability. Churchill's (1975) modified coefficient of variation can also be used to assess the advantages of one technique relative to another. This value (CV) is calculated by dividing the square root of an esti- mate's MSE by the absolute value of the parameter it is intended to estimate. These values can then be averaged for summary purposes. Table 12 presents the ratio of the average CV produced by 0L3 to that of RR. Again, deleting the single outlying sample from the HIPOP 60 cell reduces the value reported in Table 12 to approximately 1.71. These results are consistent with the conclusions drawn on the basis of MSE comparisons; RR is substantially more accurate at small sample sizes with high, consistent collinearity than is the OLS technique. This dominance diminishes as sample size increases and it is further decre- mented for lower collinearity samples such as those represented by LOPOP. The Hoerl and Kennard (1970b) population again demonstrates the superiority of OLS at all sample sizes investigated. Tables 13 through 15 present the relevant precision statistics for each coefficient for the HIPOP, HOPOP, and LOPOP matrices, reSpec- tively. Table 16 presents the differences between OLS and RR precision statistics pooled over sample sizes. It is interesting to note in these tables that RR produces a smaller bias in estimation for virtually all coefficients at all sample sizes for the HIPOP and LOPOP sets than does OLS despite the inclusion of k in equation (37) as a deliberate biasing factor. The exceptions among the 100 bias estimates are seven values found among the LOPOP matrices for samples of sizes 120 and 200. In these cases OLS and RR produce identical (to three places of accuracy) 66 Table 12 Ratio of Average LS to Average RIDGE a CV CV Population Sample Size HIPOP HOPOP LOPOP 30 1.820 .919 1.182 60 3.194 .789 1.088 90 1.530 .706 1.053 120 1.460 .621 1.049 200 1.238 .532 1.032 Note. All entries are mean values for 25 samples averaged over 10 beta weights per sample. aLS = ordinary least squares, RIDGE = ridge regression; cv = coefficient Of variation. 67 .aoauaHu-> no usaquHuHUOu I >0 A .coHaaueuou auvqe - a unausavu unwed ahacqvuo a man ...Hnldu n~ no value ao:—u> anal a»: noHuHeu HH< "9.32 n~.«n poo. moo. oao. an.an moo. moo. «no. nn.nv «no. ooo. nno. «n.5n nuo. one. one. H~.us Ono. «no. sac. - u o~.on «no. ooo. ado. an... «no. nae. nae. «c.«u one. n~c. one. o~.non ans. owe. can.H «n.nc~ «an. «no. man. an .- cuo>< co." «co. «co. 000. no.“ «no. 400. "no. na.« ooo. moo. n~o. oo.~ ”no. see. duo. c~.< «ac. one. nae. - ~o.~ ooo. moo. ooo. nn.n nno. «00. «no. oo.n one. Ono. cue. 4~.o¢ coo.— oo~.~ on~.n H~.~ ooo. «no. «an. a4 0— ou.n moo. moo. moo. nc.~« «ca. .00. «No. no.nn «no. ooo. cue. on.o~ nae. «no. one. oo.a~ «so. nae. ano. - ou.o «co. «00. ~90. an.n~ uuo. ooo. sgo. oc.- cue. one. one. nn.¢~n «on. can. can." oo.o~ one. «no. one. a; o no. «no. moo. «do. no. sac. ooo. nuo. ms. .06. ooo. «no. no. 940. one. oao. oo.~ «no. one. aka. a no. 090. ooo. n—o. on. «no. vac. nae. no. mac. -o. sac. na.u con. and. sam. oc.~ moo. 090. and. mg a 84 48. n8. .8. :4 3o. 48. HS. 2N 20. H8. .3. HNN 2o. N8. .3. 2.." o3. 2o. ooo. - nH.~ one. cue. cue. a..u one. «no. .40. nn.¢ moo. nee. non. No.c« c~o.~ oo~.~ can.n “a.“ on". and. cau. a; n co. moo. moo. Ono. on. «no. 000. Qua. am. «no. ooo. «ac. o~.~ one. one. moo. Ne." nae. «no. «no. a «9. «Ho. ooo. nun. ¢~.~ duo. one. 0.6. nn.~ «no. cue. ooo. on.n can. nan. sen. nu.“ no”. «00. com. a; o 00.5 oao. moo. ooo. o~.n ooo. coo. Ono. nn.~u cHo. ooo. «me. n~.c~ one. «no. ouo. oo.«« 0.0. one. one. u en.uu «no. ooo. cuo. no.c~ «no. duo. ouo. oc.°~ Ono. qua. ooo. oo.~n ooo. one. can. ha... «ha. «an. den. «4 n SN 30. «8. 48. oN.n «8. So. HS. $6 2... 8o. 43. 8.4 NS. So. So. 8... H3. :3. .3. u a~.v «no. «No. one. -.n «no. one. two. no.n «co. cue. ooo. n~.¢~ «cu. «on. mac. on.nn an". «en. sun. mg e co. ace. moo. «no. 00. «co. moo. mac. «n. nae. ooo. ado. ac.” moo. one. moo. «o.~ Ono. Ono. one. a oo. 48. So. So. no; So. n8. :3. oN.H .3. So. 3o. 3; oNo.H Re. Re; SN 3H. .8. SN. 3 n we. ooo. «ca. 006. no. ooo. ace. nHo. nu.“ «no. «no. one. on.« mac. oHc. one. o~.~ ace. nuo. ave. a no. «no. soc. oHo. o~.u «me. «no. «no. as.“ «ea. nno. ooo. no.0 nee. con. nuc.« no.“ nun. $00. and. «A u 3.23 So. So. o8. 2.2N So. So. use. 3.84 So. 3o. :3. 3.43. «8. NS. NS. 842. NS. «3. So. u 2.3.” So. ooo. “8. 3.3.” So. So. o8. 8.3.. one. 8o. 03. No.23 SH. 2:. RN. 3.32 o2. 2o. 2:. 3 N 8 Nu:- No no: 8 No.3- No no: 6 N9:- No no: 6 N9:- No no: 46 N9:- No on: c.3333. 28433.8 bow. oNH Iolo 8 Pa :8 :88 'H' uTI II .-“zIITC H ¢OHuaquuw ecu-«uoum mom": ma ado-h éi4lillliz. . u 1.1. a... ..liI l... 1'.‘ I‘ltn 4,1 II .41‘251'101:1 '6 (Ill 68 III‘ I ill-III: '4 .couuaqua> OO scuauHuesau I >0 3 .couoauuauu auuH. I a maauanva .nea~ xuacdaus I m; - .aoualaa MN c: suns: nus—e) cs9- uuc auguu:9 __< ”mmwm 5M. NCO. NOO. QQO. no. NMO. NOO. «MO. en. OOO. MOO. MOO. NO. GOO. MOO. cue. O0. OOO. GOO. OOO. I nououo>< On. MOO. NOO. MOO. Ne. ooo. MOO. aOO. NM. Ono. ooo. M~O. Mo. n—O. 000. "NO. «a. one. Ono. 010. mg NM. NOO. 80. M8. :. MOO. NOO. MOO. NM. «8. N00. 08. 3. Moo. 25. M8. MN." 08. Ms. Mg. m O0. «OO. nOO. NOO. ON. MOO. NOO. OOO. DO. «OO. MOO. NOO. NN.~ OOO. QOO. MuO. ~O.N ONO. Nno. OMO. m4 On NO. NOO. «OO. MOO. QO. cOO. NOO. OOO. MO.u MOO. NOO. 500. MN.~ OOO. NOO. Ono. OO.~ MOO. NOO. «OO. 8 Oc. NOO. «OO. NOO. ON. MOO. NOO. cOO. O0. vOO. MOO. NoO. MN.~ OOO. «OO. Ono. ac." MOO. MOO. OOO. m4 0 NN. O—O MOO. nuO an. o—O. NOO. o—O. «c. NNO. «OO. MNO. Me. ONO. MOO. ONO. «M. OMO. OOO. QQO. a Na. MOO. NOO. MOO. ON. GOO. OOO. Ono. «M. OHO. MOO. MnO. NM. Ono. OOO. MAO. OM. Mao. OOO. MNO. ma O No. OMO. NOO. OMO «N. McO. NOO. NOO. OO. NMO. cOO. MMO. OO. «00. MOO. OOO. O~.~ Man. MOO. Odd. I NN. MOO. NOO. «OO. ~M. MOO. Mco. OOO. no. OOO. 000. Mac. on. nOO. MOO. duo. no. ONO. NuO. Nno. ma 5 MM. so". moo. HON. OM. MnN. MOO. OMN. co. NON. Moo. OON. no. MON. MOO. «OM. us. ~OM. 00¢. ROM. a «a. OOO. QOO. Nao. an. MHO. OOO. MNO. CN. QNO. Q~O. OMO. ON. NMO. uNO. MMO. Me. OOO. MMO. NMn. m4 0 OO. «Md. QOO. OMM. 00. OM—. MOO. OMA. '0. can. MOO. OON. MO.~ NNN. GOO. OMN. N~.~ MON. ONO. NON. I NN. moo. «OO. ——O. «N. OOO. moo. Ono. «M. Mao. ooo. «NO. «9. ONO. Mac. ace. as. use. NMo. oNn. ma M «0. «OO. ~OO. Mac. 00. «OO. «00. MOO. On. MOO. ~OO. ooo. No. GOO. NOO. Odo. NN.~ «HO. MOO. One. a Oc. NOO. «OO. MOO. OQ. NOO. ~OO. MOO. Me. NOO. «OO. MOO. MN. cOO. MOO. sOO. MM.~ MAO. OOO. NNO. ma 4 ON. NOO. NOO. OOO. ON. Ono. MOO. N~O. NM. Ono. cOO. ONO. MM. MNO. QOO. nuo. OM. MMO. Ono. MQO. a nu. NOO. NOO. «OO. MN. «OO. MOO. sOo. MM. Ono. NOO. «no. 0N. NOO. McO. Nno. 0M. ONO. Ono. MvO. ma M ea. OuO. NOO. A—O. «M. MnO. nOO. cue. no. sac. NOO. Ono. Os. nNO. Moo. ONO. on. ONO. coo. «MO. I 0N. NOO. NOO. MOO. ON. NOO. «OO. QOO. Ne. MOO. «OO. OOO. MM. GOO. OOO. ONO. Mo. NMO. OOO. NNO. OJ N no. MNO. NOO. MNO. OO.A OMO. NOO. ~MO. OO.~ NMO. MOO. MMO. 10.n ONO. MOO. «MO. OM.~ QMO. GOO. OOO. I co. «co. NOO. ooo. no. MOO. MOO. mac. «a. Mac. ooo. ONO. 10." NNO. vac. «Mo. a~.~ ONO. sac. Mao. mg n 5 N94: No on: 8 N3: N6 Na. ,5 No.4: on an: .5 No:- No not B No:- No mm: «.5233 3.404236 OON ONH O0 OO OM 491.6. .471.I.'IIV.'-I‘1I‘II1III .1 I11- ‘nflk‘4H‘Iin. 1i“ . III—1| 1.: ouum anal-a ..LlQl- ouHuIHucum nod-«uoum mono: @— Ounah I a.~dI...IIIIuI.I 69 .co«u1«un> no ucaNONOOUOO I >0 A .OONu-OONSN auvHu I a “sensor: union Nua=4aua I was .OONOIOO MN co Osman ao:~a> soul can caucus» ~H< "9.02 OM.N OOO. nOO. nuO. NO.N MNO. OOO. NNO. ON.N NNO. nnO. mnO. NN.M MnO. NNO. OMO. N0.0 NOO. OOO. NON. - N no." OOO. OOO. Ono. NN.N MOO. OOO. ONO. OO.N ONO. MNO. Ono. On.n NOO. ONO. OOO. ON.M OOO. OOO. OMN. «A no Ouoe< On. OOO. OOO. OOO. "O." NOO. NOO. ONO. NO.n ONO. NnO. NnO. OM.~ ONO. N—O. ”no. OO.N NOO. NNO. OOO. : «O. OOO. OOO. Ono. OO.~ ONO. NNO. NnO. NN.~ NNO. M—O. uno. OM.N NOO. NNO. NOO. NN.N MOO. OOO. NON. m4 ON NN.N OOO. nOO. NOO. NN.N ONO. MOO. MNO. Mn.N ONO. OHO. ONO. MO.N ONO. MNO. OOO. MN.N NNO. MOO. ONO. a NN.N MOO. MOO. NOO. ON.~ “NO. MOO. ONO. OO.N ONO. NOO. NnO. O~.n anO. NOO. OMO. OO.» OnO. ONO. OOO. m4 O O~.O OOO. nOO. OOO. OO.M N—O. OOO. OuO. N~.O NNO. ONO. ONO. NN.O NNO. Mao. Nno. MN.ON OnO. ONO. nOO. u ON.O OOO. OOO. OnO. ON.O «NO. OOO. «NO. Nn.o ONO. MNO. Ono. N0.0 MNO. OHO. NOO. Mo.~H NOO. Ono. ONO. O; O MO.~ NOO. MOO. NOO. OO.N MOO. OOO. NNO. OM.M MNO. MOO. Ono. NM.n NNO. MNO. NnO. OO.M OMO. Ono. OOO. u «O.N NOO. Moo. NNO. ON.N ONO. OOO. ONO. NN.M MNO. NNO. NOO. OO.M NNO. NHO. MOO. O0.0 ONO. OOO. NNN. «A N an.n OOO. OOO. MHO. MO.N MNO. OOO. ONO. OO.N MNO. ONO. NMO. OM.N nnO. HNO. OMO. MN.n ONO. OOO. NNN. u Mn.n ONO. OOO. ONO. NN.~ OuO. ONO. ONO. NN.N MNO. Ono. NOO. ON.N OOO. NNO. NOO. N0.0 Mon. OOO. NNN. a; O OO.N OOO. MOO. ONO. NN.N NnO. NOO. OnO. no.~ ONO. OuO. OOO. ON.N ONO. ONO. OOO. ON.M OON. OOO. NNN. - NN.N OOO. OOO. Mao. ON.~ NOO. OOO. ONO. O0." ONO. ONO. OOO. NM.N MOO. Ono. ONO. NM.O nan. NOO. NON. «A M ON.N NnO. Ono. NNO. Nn.~ NNO. Nno. nnO. «O.N ONO. Mac. NnO. ON.N OOO. Nnc. OOO. OO.n OON. «0O. MNN. u nN.n NNO. «nO. ONO. Mn.~ «NO. MNo. ONO. MO.N MNO. MOO. Ono. MO.N NNO. OOO. NNN. NO.M «Nu. "ON. NNN. ma O on. OOO. OOO. MOO. NN. MOO. Ono. MNO. OO. ONO. NNO. OnO. NN.N "OO. ONO. NOO. OO.~ NOO. MnO. NOO. u no. OnO. NOO. NOO. MO. Ono. NNO. "NO. NO. NNO. Ono. Ono. OO.~ OMO. ONO. OOO. NO.N OON. «NO. ONN. ma n MO.N OOO. OOO. MOO. OO.N «HO. NOO. OHO. «O.n ONO. Mao. Ono. OO.n Ono. «NO. NMO. OM.O MOO. Ono. «OO. u NO.~ OHO. OOO. Ono. N~.N N—O. OOO. ONO. Ou.n NNO. NOO. OOO. no.n Ono. MNO. OOO. NO.M NNO. OMO. NNN. ”A N NO.N OOO. OOO. Ono. Nn.N OOO. MOO. ONc. Mo.n ONO. MNO. MnO. O0.0 NnO. ONO. NOO. On.N NOO. NMO. nnn. - Mo.N OOO. OOO. "no. NO.N OHO. OOO. Mao. OO.n NNO. MNO. ONO. OO.M OOO. ONO. MNO. N0.0 ONN. ONO. OO«. «4 N >0 Nm<~O NO NO! >0 N00 Nm<~0 NO NO! >0 Nm<~n NO um: >0 Nm mo OGOHUHMMOOU n >0m .mONHm OHOamm uo>o OOHOOO OOHOBOO MN mo muom comauon OOOOHOOOHO 030 so comma moaHm> some mum OOHuuaO HH< "Ouoz MH. MOO. OOO. NHO. HN. OOO. NOO. OOO. HM.O OOO. MON. MOO. OH MH. MOO. HOO. OOO. MO.I HOO.I HOO. OOO. O0.0N OOH. OHH. NHM. O OO. MOO. HOO. MOO. HH.I MHO.I HOO. NHO.I MO. MOO. MMO. ONO. w MM. OOO. OOO. HHO. NO.I MMO.I NOO. HMO.I O0.0 MOO. OON. OMN. N ON. MOO. MOO. OHO. OM.I NON.I MHO. NNN.I OO.H ONO. OOO. OHH. O MN. NHO. MOO. ONO. NM.I ONH.I OHO. OOH.I OM.HH MOO. HMO. ONO. M ON. MHO. OHO. MNO. MH.I NOO.I HOO. NOO.: OM.O HMO. OOO. HMH. O ON. OHO. OHO. ONO. OO.I MOO.I MOO. NOO.I MM.H ONN. MOH. MNM. M HM. NOO. OOO. HHO. ON.I NHO.I HOO. OHO.: OO.H OOH. MOO. OMN. N NO. OHO. MOO. OHO. MN.| OHO.I OOO. MHO.I OM.MHO HMO. OMO. MOO. H >0 Nm0 Nm0 Nm