WIHUN"IIHIIHIHIIHIIUHWHJHINIIUWIIIIHHI

  

I LII mu; Ilzlllijlljwl In (III III II I m I ll

14

This is to certify that the

thesis entitled

A MONTE CARLO EVALUATION OF RIDGE REGRESSION
AS AN ALTERNATIVE TO ORDINARY LEAST SQUARES

presented by

BRY AN WALTER COYLE

has been accepted towards fulfillment
of the requirements for

M.A. degree in PSYCHOLOGY

I N ' ‘

Major professor

11-]-
Date 79

 

0-7639

 

 

v

 

OVERDUE PINES ARE 25¢ PER DAY
PER ITEM

Return to book drop to remove
this checkout from your record.

 

 

 

 

A MONTE CARLO EVALUATION OF RIDGE REGRESSION

AS AN ALTERNATIVE TO ORDINARY LEAST SQUARES

BY

Bryan Walter Coyle

A THESIS

Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of

MASTER OF ARTS

Department of Psychology

1979

ABSTRACT

A MONTE CARLO EVALUATION OF RIDGE REGRESSION
AS AN ALTERNATIVE TO ORDINARY LEAST SQUARES

By

Bryan Walter Coyle

This study investigated a proposed modification of ordinary
least squares (OLS) multiple regression. Conventional OLS is generally
used to combine the information present among a set of variables so as
to optimize the prediction of a criterion variable in the original
sample and to provide an equation for use in subsequent samples without
the necessity of re-estimation. In addition, the predictor weights
estimated are frequently used to infer the functional characteristics
of the system which produced the data.

Hoerl and Kennard (1970a) have suggested deliberately introducing
a statistical bias into the OLS estimation procedure in an attempt to
increase the predictive robustness and structural accuracy of ordinary
least squares in collinear data sets. Their method, termed ridge
regression, was compared with unit weighting (Schmidt, 1971) and with
OLS in a Monte Carlo experiment based on three data matrices drawn from
the literature.

It was concluded that the ridge technique can outperform OLS

in situations where the collinearity is high and consistent across all

Bryan Walter Coyle

predictors. When the collinearity is concentrated in subsets of the
predictor matrix ridge regression is dominated by OLS. Consistent with
Schmidt (1971), when sample sizes are small relative to the number of
predictors, no suppressors are present and only prediction as opposed
to structural interpretation is relevant, unit weighting is to be

preferred.

Approved:

 

Date:

 

Thesis Committee:
Neal Schmitt, Chairperson
Raymond Frankmann

Ralph Levine

ACKNOWLEDGMENTS

I wish to thank Dr. Neal Schmitt, my chairperson, for both his
assistance on this and many other projects and for his sage counsel and
friendship over several years.

Thanks are also due to Dr. Raymond Frankmann and Dr. Ralph
Levine for their advice and guidance as committee members and teachers.

Finally, I thank my wife, Phyllis, without whose love and

support this and many other endeavors would have little meaning.

ii

TABLE OF CONTENTS

Page

L IST 0F TMLES O O O O O O O O O O O O O O O O 0 iv
Chapter

I O INTRODUCT ION O O O O O O O O O O O O O O O 1

II. MULTIPLE LINEAR REGRESSION. . . . . . . . . . . 3

The Mad e 1 O O I O O O O O I O O O O O O O 3

Assumptions of the Regression Model . . . . . . . 8

The Correlation Model. . . . . . . . . . . . 10

Multicollinearity . . . . . . . . . . . . . l3

Alternatives to Ordinary Least Squares . . . . . . 29

Ridge Regression . . . . . . . . . . . . . 37

I I I 0 “THOD O O O O O O O O O O O O I O O O O 4 3

The Papulation Matrices . . . . . . . . . . . 43

samp leg 0 I O O O O O O O O O O O O O O 4 9

Equation Estimation . . . . . . . . . . . . 49

Data Analysis . . . . . . . . . . . . . . 50

IV. RESULTS AND DISCUSSION . . . . . . . . . . . . 52

V O SUMY 0 I O O O O O O O O O O O O O O O 7 2

BIBLIOGRAPHY. . . . . . . . . . . . . . . . . . 75

iii

10.

ll.

12.

l3.

14.

15.

16.

LIST OF TABLES

Two Predictor Regression Results with Varying Inter-
correlation and Validity . . . . . . . . .

Illustrative Multicollinearity and Eigenanalysis Values .

P0pulation Matrix Based on 6000 Cases-—HOPOP . . .
Population Matrix Based on 6000 Cases--HIPOP . . .
Population Matrix Based on 6000 Cases--LOPOP . . .
Eigenvectors of the Population Matrices. . . .

Initial R2 - LS, RIDGE, BU, VU, PU . . . . . .

Mean Initial R2 Superiority of Least Squares over RIDGE,

BU, VU, PU. o o o o o o o o o o o o o
Cross-Validated R2 - LS, RIDGE, BU, VU . . . . .

Mean Cross-Validated R2 Superiority of Least Squares
RIDGE, BU, W. o o o o o o o o o o o 0

Equation Mean Square Errors. . . . . . .

Ratio of Average LScv to Average RIDGECV . . . .
HIPOP Precision Statistics . . . . . . . . .
HOPOP Precision Statistics . . . . . . . . .

LOPOP Precision Statistics . . . . . . . . .

over

Mean Differences of Precision Statistics Between Least

Squares and RIDGE . . . . . . . . . . .

iv

Page

19
3O
44
46
47
48

53

54

56

57
61
66
67
68

69

7O

CHAPTER I

INTRODUCTION

Linear composites are commonly used in psychology, education
and in the social sciences generally for the purpose of combining the
information present among a group of variables into a single variable.
While many methods of forming a composite have been proposed and
utilized (Blum & Naylor, 1968; Burket, 1964; Claudy, 1972; Lawshe &
Shucker, 1959) multiple linear regression is by far the most commonly
used combinatorial method. This is especially true since the advent
of digital computers and widely available "canned" programs which save
the researcher the tedium of hand calculation and often the concomitant
necessity of considering the applicability of this method to the
research problem at hand.

Linear composites, whether formed by multiple regression or
other techniques to be discussed subsequently, are generally used for
either predictive or descriptive purposes. In the former case one is
interested in creating the composite so as to maximize its correlation
with some external variable, usually designated the criterion. Examples
of this usage would be predicting a person's future academic standing
from past records or estimating the probability of job success on the

basis of a composite formed by qualification tests, interview data

and previous employment history. Descriptive uses, also termed
structural interpretation, of linear composites involve assessing the
degree of change produced in the criterion variable by a unit change in
one or more of those variables which form the composite.

As multiple linear regression is the most frequently used com-
binatorial scheme, at least when the number of available cases is large
relative to the number of indicator variables intended to form the
composite, its assumptions and use will be discussed first. Limitations
inherent in this model are presented as well as the major alternatives
to it. Various criteria that have been pr0posed to evaluate the opti-
mality of combination rules are then contrasted with a modified
regression approach, termed ridge regression (Hoerl & Kennard, 1970a;
1970b). The empirical performance of this method was assessed in a
Monte Carlo design employing three data sets with different degrees of
intervariable relationships. From each of these populations 25 samples
at each of five sample sizes were randomly drawn. Ridge regression and
ordinary least squares (multiple regression) are then employed on each
of these 375 samples as are three different methods of simple unit
weighting (Schmidt, 1971). For each method the predictive efficiency
in the initial sample as well as the long-term efficiency in the popula-
tion are evaluated. In addition, the structural accuracy with respect
to the precision of parameter estimation of ridge regression and

ordinary least squares (OLS) are compared.

CHAPTER II
MULTIPLE LINEAR REGRESSION

The Model

An optimal method for obtaining estimates of criterion values
as a function of predictor score levels would be the following (Burket,
1964): Select all conceptually relevant variables not statistically
independent of the criterion. Measurements on these predictors and
the criterion should be obtained on a sufficiently large number of
cases (termed the validation sample) such that all possible combinations
of score levels are represented. The criterion prediction for a parti-
cular case would be the criterion mean of all cases in the validation
sample having the same predictor profile.

In practice this idealized system is not generally workable
because of the large sample size required to insure stable parameter
estimates for every possible predictor profile. What is necessary then
is to make simplifying assumptions and ad0pt a system which will provide
fairly accurate predictions of criterion performance over a wide range
of possible predictor profiles despite the unavailability of some of
them. The assumption most frequently employed in the behavioral
sciences is that there exists an approximate functional relationship

(most often presumed to be linear although this is not necessary)

between the predictors and criterion. The function form relating
these is estimated in the sample at hand by the method of multiple
linear regression or ordinary least squares which assures two important
properties: (1) the sum of squared residuals between the actual
criterion values and those predicted from the weighted profile comr
ponents will be minimized for the validation sample; and (2) the corre-
lation between these two score sets will be the maximum obtainable for
this sample (Draper & Smith, 1966; Li, 1974).

The method of least squares and its properties may be summarized
with the following notation. The linear regression function relating
the dependent variable (Y) to one or more independent predictor variables

(X), is for the ith case,

(1) yi a + le11 + 82x12 + . . . + B x + e

pip i

where: y. = criterion score for the ith subject in the sample,
a = a scaling constant used to adjust for differences
in origin between the y and x variables; also termed
the intercept,
Bj = a partial regression coefficient used to weight the

jth predictor variable,

x = the 1th

individual's observed score on the jth
predictor,
£1 a error in prediction for the ith subject,

y1 = predicted criterion score for the 1th subject.

Thus,

(2) (" p“ >
e = y - a + 2 B x
i i j=1 j ij

gyi-yi
Thus the properties of OLS noted above are then
(3) Z ( - A )2 = minimum
1 3’1 3'1
(4) r . = maximum,
YY

with these properties holding for the N cases in the sample on which
the weights were estimated. The correlation of equation (4) is referred
to as the multiple correlation or if squared, the coefficient of deter-
mination of the weighted predictor composite with the criterion.

The model for estimating the weights is more easily presented
in matrix terms and that notation will be established here. Proofs of
the derivations are available in Draper and Smith (1966), Finn (1974)
and Scheffe (1959). Without loss of generality the observations on all
variables are assumed to be standardized so that the constant term (a)

in the general model is identically zero.

Let y be a column vector of N criterion observations,
X be a N x p matrix with rank p less than N,
each row representing one cases' observations on
the p predictor variables,
a be a column vector of N uncorrelated errors with
mean zero and variance 02,

B be a column vector of p population regression coefficients.

The general linear model presented in (1) becomes
(5) y = XB +.€.

Because of the assumptions concerning errors, E(e) a 0; E(se’) = ozI,

the criterion vector y has the expectation
(6) E(y) = X8
and the covariance matrix
, 2
(7) E[(y - X8) (y - X8) ] = 0 I.

If the B are the sample estimates of the population regression coeffi-
cients, B, and § are the predicted criterion scores based on these same

sample estimates then

(8) E -- (x'x)'1X'§
and,
(9) >3 = XE.

Because the variables have been standardized (X'X) is in the form of a
zero—order correlation matrix among the predictors and X'y is the vector
of predictor-criterion correlations or validities. The estimates of

the population regression coefficients have the expectation
(10) E(§) = B
and the covariance matrix

on EHé-m(é-mu=o%wmd.

B is the "best" estimate of the population vector 8 in that the
sum of squared errors in prediction is minimized in the sample. This
can be demonstrated (Finn, 1974, p. 96) by considering any other estimate
8* where 8* a B + d and d is the vector of discrepancies between B and
the alternative estimate. The sum of squared errors with 8* replacing

B is

(6’6)* = (y - XB*)'(y - XB*)
=Hy-m)-mrHy-m>-m1

= (y - xé)'(y - X6) — 2d'X'(y - xé) + d‘X'Xd.
The first term is 8'5; the second term is zero as from equation (8)
d'X'(y - xé) = d'x'y - d'X'X(X'X)-1X'y = o.

The third term is positive as it represents the sum of the squared
elements of Xd. Thus the residuals (e'e) are inflated anytime one
departs from B as the estimate of population weights derived from the
sample.

The variance of these minimized residuals in the standardized
case is one minus the coefficient of multiple determination (R2), that
value which expresses the squared correlation between the optimally
weighted predictor combination and the criterion. Where X'X is in the
form of a correlation matrix equivalent formulae for R2 are (Burket,

1964; Overall & Klett, 1972),

(12) R y'X(x'X)‘1x'y

(13) R2 - é'x'y.

R2, R or (1 - R2) are commonly presented as indices of the predictive
efficiency of the multiple regression model in the estimation sample.
Although the multiple linear regression model presented above
has been used extensively throughout the sciences for the purposes of
prediction and structural interpretation, its assumptions are often
poorly understood. Cureton (1950) considered that, "It is doubtful that
any other statistical techniques have been so generally and widely
misused and misinterpreted as have those of multiple correlation"
(p. 690). The situation is perhaps worse today with the wider avail-

ability of canned computer programs for regression.

Assumptions of the Regression Model

 

The simplest set of crucial assumptions (Johnston, 1972, p. 122)

necessary to estimate the 8 vector in the model y = X3 + s are three in

number:

(14) E(ee') = 021,

(15) X is a set of fixed values,
(16) X has rank p less than N.

The requirement of (14) is that the error or disturbance values
have constant variance--a prOperty referred to as homoscedasticity. The
diagonal nature of the symmetric matrix E(es’) implies that the covari-
ance between any pair of error terms be zero. Fulfillment of this
assumption can often be evaluated by visual examination of residual
value plots (Draper & Smith, 1966, ch. 3). Failure to meet this

assumption most often occurs in time series analysis or when the linear

model fitted is inappropriate for the set of observations at hand

(i.e., there exists a nonlinear relation between the predictors and

the criterion). As the assumption can generally be adequately met by
the inclusion of appropriate quadratic terms or by the inclusion of
linear or higher order terms in time or by suitable data transformations
such as the arcsin, square root or log transforms before analysis
(Tukey, 1949; Winer, 1971) the consequences of failure to meet this
assumption will not be considered further here.

The second essential assumption (15) is more germane to the
purpose of the present paper. Regression theory requires that the X
matrix be a set of values fixed by the experimenter exactly as are the
levels of independent variables at which observations on y, the criter-
ion, are taken in fixed effects analysis of variance designs (Binder,
1959). Implicit in this assumption is the requirement that the X values
be free of measurement error. This means that in repeated sampling of
criterion values the only source of variation is attributable to the
vector of disturbances, a. If this assumption is met, B is an unbiased
linear estimator (Johnston, 1972, pp. 18-23). Effects of violations of
this assumption are discussed below under the correlation model.

The third assumption (16) states that X must be of full rank
equal to p, the number of predictors. If the rank of X is less than
the number of predictors the 6 vector is indeterminate and no unique
solution to the normal equations exists. As will be discussed under
the heading of "multicollinearity," problems can also arise when this

assumption is only approximately met.

10

The Correlation Model

While data transformations or deletion of some predictors have
been found in many cases to adequately compensate for violations of
assumptions (14) and (16), failure to obtain fixed predictor values
requires an alternative model. Traditionally, multiple regression
techniques have been applied in precisely those situations where the
control required to obtain fixed-X values cannot be insured (Cohen,
1968). Data sets analyzed by means of OLS are typified by subjects'
test scores, historical records and in general by data that is not
collected according to a design for the systematic evaluation of
criterion scores obtained at preselected levels of the independent vari-
ables. In this type of situation the correlational model for the
predictors is more appropriate than is the regression model. The latter
is based on the assumption that only the disturbance vector 5 is subject
to sampling error--an assumption that is rarely met in applied multiple
regression situations. The correlation or random-X model assumes that
the predictors and the criterion are random variables sampled from a
joint multivariate normal distribution.

Regardless of the distributional form of the disturbances (and
hence of the y values) the OLS method provides "best"--i.e., minimum
variance, unbiased estimators of the pOpulation 8 values. While the
fixed-X or regression model makes no assumptions about the distribution
of the predictor variables it does require a normal error assumption
to permit inferential tests. This assumption is based on empirical
evaluations of the robustness of the £_and E statistics against moder-
ate departures from normality (Neter & Wasserman, 1974). When this

assumption is met the B estimates are maximum likelihood estimates of

11

the true, population weights with the same best linear unbiased pro—
perties as the least squares values (Herzberg, 1969; Neter & Wasserman,
1974).

While both models would appear to provide the necessary data
for inferential uses of multiple regression results it is clear that
the correlational model is almost always more appropriate. Under the
null hypothesis of zero multiple correlation the distributional theory
is identical for the two models. However in applications, especially
those for predictive purposes such as in personnel selection, the null
hypothesis is rarely true (Burket, 1964). The extreme complexity of
the correlational model in cases where the null hypothesis does not
obtain has led most investigators to use the fixed-X model in the hope
that there will be little practical difference in the results derived
(Burket, 1964; Claudy, 1972; Cohen & Cohen, 1975; Neter & Wasserman,
1974).

While this subject has not received a great deal of attention
in the literature, it would appear that application of fixed-X procedures
to randomrX data affects the suitability of the weight estimates thus
derived. It has been demonstrated by Berkson (1950), Geary (1953)
and Rao and Miller (1971) that if the predictor variables are not held
at preselected values and are subject to errors of measurement the beta
weights will in fact be biased estimates of the population values. It
is thus not necessarily the case that beta weights derived on a sample
of finite size such that they maximize the multiple correlation in the
sample are.the "best" estimates of the parameter vector 8 (Claudy, 1972).
"Best" here refers not to the minimum variance properties of least

squares estimators but rather to the minimization of the difference

12

between the true population multiple correlation (p) and the estimate
of it (R) obtained by application of the sample weights to the popula-
tion.

Application of the fixed-X regression model to data collected
under the assumptions of the randomrX model results in an over-fitting
of the regression surface to the sample data. In practice this means
that the beta weights are optimized on the idiosyncracies caused by
sampling and measurement error in the estimation sample. Accordingly,
when these weights are applied in a new sample or in the population the
resulting multiple correlation will be lower than the initial estimate.
This general problem has been termed "shrinkage" and has been considered
by numerous authors. For instance, formulae have been evaluated to
estimate this shrinkage (see Schmitt, Coyle, & Rauschenberger, 1977 for
a comparison of the major formulae) and alternatives to OLS have been
proposed (Claudy, 1972; Cureton, 1962; Herzberg, 1969; Lawshe & Schucker,
1959; Schmidt, 1971).

While accurate estimation of the multiple correlation is of
primary interest for predictive uses of multiple regression techniques,
it does not touch upon the second purpose of multiple regression:
structural interpretation.

The final consideration relative to assumption (15) is germane
to this purpose. Application of fixed-X OLS to random—X data subject
to sampling and measurement error inflates the variability among the
optimizing weights without regard to the true variance of the parameter
values. It is not the standard error or variance of any single weight
estimate which is referred to here but rather the dispersion of the p

weights calculated on the p predictors in a sample. This effect is

l3

attributable to the sample values being subject to not only the vari-
ance of the pOpulation weights but also to the error variance generated
by the less than perfectly reliable measurement of predictor scores.
Awareness of this artifact has led to such prOposals as averaging the
beta weights obtained in a random split of the sample (Claudy, 1972)
and using as a 3 estimate the least deviant (from zero) of the two
weights obtained in a fifty-fifty Split (Cureton, 1962). The relative
merits of several such alternatives to OLS are discussed subsequent to
the further examination of the implications of assumption (16) in the

following section.

Multicollinearity

 

The assumption of equation (16) that X, the predictor matrix,
has rank p < N actually has implicit two requirements. The first con-
cerns the ratio of the sample size available to the number of predictors
for which weights must be estimated, and the second involves the number
of linear dependencies among the predictor set.

The question of sample size is common to any attempt to establish
a statistical estimate of a parameter--the greater the number of cases
upon which the estimate is derived, the more stable it will be. In the
case of OLS using standardized data the requirement is that p §_N or
else the model is considered to be overdefined and a unique solution for
B is not possible. In point of fact one generally desires that N be much

greater than the number of predictors for as was demonstrated by Wishart

(1931),

2

14

where 9 represents the population multiple correlation. In the case
where the null hypothesis of no predictor-criterion correlation holds
in the population, equation (17) shows that the sample value will be

inflated. Setting p2 to zero yields

2 -.B
(18) E(R ) - N'

It is from this equation that the various shrinkage estimators have
been derived (Darlington, 1968; Lord, 1950; Nicholson, 1960; Wherry,
1931). From the above formula it is obvious that the extent to which
R2 over-estimates p2 varies directly with the number of predictors and
inversely with the sample size. These characteristics are important
when one compares the efficiency of OLS and alternative estimators
(Schmidt, 1971) in a variety of practical situations.

Throughout the long history of multiple regression usage in the
sciences, practitioners have come to appreciate its robustness in the
face of violations of some underlying assumptions and have, in many
cases, developed remedial procedures to correct unsuitable data
before analysis. Examples of this would be the Durbin-Watson (1950)
test statistic for autocorrelated error terms with the attendant sug-
gestions of Cochrane and Orcutt (1949) as to their correction. Simi-
larly, Bartlett's variance homogeneity test has given rise to a number
of data transformations suitable to different types of heteroscedasti-
city (Winer, 1971). The development of detection and correction methods
for problems of multicollinearity in regression models has not yet
reached the level of rote application of Specified test statistics
which in turn could provide evidence as to appropriate alterations to

be made (Farrar & Glauber, 1967). In fact, while economists have

15

apparently been aware of the difficulties inherent in highly correlated
predictor sets for some time, it seems that others in the social and
behavioral sciences have frequently labeled such a concern as being of
"theoretical interest only" and thereby dismissed it from consideration
in their applied work. As shall be demonstrated multicollinearity can
cause some very practical problems to arise (Darlington, 1968).

While a variety of definitions of multicollinearity exist in the
literature, many of them are more symptomatic than definitive. The
definition used here is attributable to Johnston (1972) and Silvey
(1969). If one considers the predictor matrix X of dimensions (N x p)
a linear dependence is said to exist between the column vectors x1,
x2,..., xp if there exist constants a1, a2, ..., ap, not all zero, such

that

P
(19) Z a x = 0 .

When (19) holds for some subset of the column vectors of X (and thus
for the matrix as a whole), multicollinearity is said to exist. In
this case beta estimates cannot be obtained as the predictor matrix is
singular and thus its inverse does not exist (equation 8). However,
even when (19) does not obtain exactly but rather is only approximately
true, multicollinearity is still a relevant problem for the data analyst.
Thus the question of collinearity is one of severity or the degree of
departure from orthogonal variates (Kmenta, 1971; Mason, Gunst & Webster,
1975).

There are three primary sources of highly collinear data sets

(Mason, Gunst, & Webster, 1975). The first involves an overdefined

16

model--one where there exist more predictors than observations. The
difficulties caused by cases in which this is approximately true were
discussed above. When faced with such a situation the analyst must,
(a) eliminate some predictors; (b) use grouped subsets of predictors;
or (c) utilize some form of principal components regression. There are
deficiencies inherent in each of these solutions and they will be dis-
cussed in the following section.

The latter two sources of collinearity, sampling techniques and
physical constraints on the model, are quite similar and can be presented
together. These situations arise when the data have been sampled from
only a subspace of the predictor variable domain or when some predictors'
values are restricted to a near exact relationship with other variables
in the X matrix. In the former case data observations can be added from
the undersampled area of the domain, if indeed the investigator is aware
of the problem, which can usually be identified through eigenvector
analysis (Silvey, 1969). When practical constraints eliminate this
alternative or when the problem cannot be identified as attributable
to undersampling, few remedial measures are available.

The effects of approximate multicollinearity have been presented
by Johnston (1972) as follows:

1. The precision of estimation falls so that it becomes very
difficult, if not impossible, to disentangle the relative
influences of the various x variables. This loss of precisicnl
has three aspects: specific estimates may have very large
errors; these errors may be highly correlated, one with
another; and the sampling variances of the coefficients will
be very large.

2. Investigators are sometimes led to drop variables from an
analysis because their coefficients are not significantly
different from zero, but the true situation may be not that

a variable has no effect but simply that the set of sample
data has not enabled us to pick it up.

17

3. Estimates of coefficients become very sensitive to particular
sets of sample data, and the addition of a few more observa‘
tions can sometimes produce dramatic shifts in some of the
coefficients (p. 160).

The first difficulty has been well documented and illustrated
by Darlington (1968) while the latter two consequences are familiar to
psychologists under the general rubric of "bouncing betas." The detection
and analysis of these effects will be considered for the two predictor
case although all results are applicable to the case of any number of
predictors as long as equation (19) is not exactly satisfied.

The effects of collinearity on estimates can best be seen by

considering the inverse of the correlation matrix. Equation (8) can be

written for the standardized model with C = (X'}()-1 as

8 c c r
(20) .YI.2 ll 1 yl

By2.1 c21 c22 ry2

where ryj represents the validity coefficient for the jth predictor.
From (20) it is evident that in the case of uncorrelated predictors
(c12 - c21 = O), the validity coefficient is the beta for any one

predictor as the predictor matrix is then an identity matrix.

A

1.0 0.0 r
(21) éyl.2 : yl
By2.1 0.0 1.0 ry2

When the predictor intercorrelation is not equal to zero the inverse

matrix is of the form

(22) c -- (x'X)‘1 = (——1———> 12

 

18

This is the matrix formulation of the familiar computational solution

for beta weights with two predictors.

 

 

. ryl - ry2 r12
8 =
1 1 _ r2
12
(23)
. ryZ " 13,1 1[12
828 2
l - r
12

Equation (23) is a well-known result and illustrates that as (19)
becomes exact (i.e., r12 or riz + 1.0) the diagonal elements of the
inverse matrix approach infinity (l.0_: C11-+ 0°). Several consequences
follow from this. The limiting case of intercorrelation is derived by
assuming ryl - ry2 which is justified since as r12-+ 1.0, each pre-
dictor's validity must become equal (Klein & Nakamure, 1962; Sastry,

1970). This further implies that as r + 1.0 the slightest discrepancy

12
in the magnitude of the validity coefficients will result in the beta
weights being approximately equal but Opposite in sign. Obviously,

the slight discrepancies causing this are sample specific and due to
sampling error and lack of perfect reliability in measurement. This is
the "bouncing beta" problem (McDonald & Schwing, 1972; Swindel, 1974;
Wampler, 1970) demonstrated by sign reversals and in the case of multi-
ple predictors by dramatic shifts in the magnitude of weights in differ-
ent samples from the same population (Johnston, 1972; Wherry, 1975).
Table 1 provides sample calculations demonstrating the effects of
varying r12 and discrepant versus equal validities.

While it is possible for a beta weight to be underestimated, the

gross inflation of the diagonal elements in the inverse matrix due to

 

19

Table 1

Two Predictor Regression Results with Varying

Intercorrelation and Validity

 

Validities Predictor Beta Weights Variance Covariance Multiple R

 

 

 

 

ryl ry2 r13 By1.2 By2.1 81 B12

.50 .50 .00 .50 .50 0.50 0.0 .710
.50 .50 .30 .38 .38 0.68 -0.2 .620
.50 .50 .60 .31 .31 1.08 -0.65 .560
.50 .50 .90 .26 .26 3.89 -3.51 .510
.50 .50 .95 .256 .256 7.63 -7.25 .506
.50 .50 .99 .25 .25 37.64 -37.26 .501
.51 .50 .00 .51 .50 .49 0.0 .714
.51 .50 .30 .396 .381 .688 -0.2 .626
.51 .50 .60 .328 .303 1.064 -O.639 .565
.51 .50 .90 .316 .216 3.845 -3.461 .519
.51 .50 .95 .358 .159 7.563 -7.185 .512

.51 .50 .99 .75 —.246 36.11 —36.74 .511

 

20

high collinearity generally results in betas with large absolute values
without regard to the true pOpulation value. While these are "correct"
values their counterintuitive signs and magnitudes again have prompted
the discussion of alternative estimators to OLS (Churchill, 1971; Hoerl
& Kennard, 1970a; Klein & Nakamura, 1962). The notion that these
inflated betas are potentially poor estimates is attributable to the
effects of collinearity on the variance-covariance of the beta weights.

Equation (11) can be expressed (with r = a) as (omitting a

12

function of sample size)

 

 

. o: 1.0 -¢
(24) VarB = ( 2
l-d ) -¢ 1
so that
2
A A as
(25) VarB1 = VarB = 2
l—m
and
A A —C O
(26) CovBIB2 = 1.«2 -

As multicollinearity, here expressed as a, increases it is
evident that the sampling variances of the estimated coefficients
increase. For example, as m increases from .5 to .9 the sampling vari-
ance increases by over 300 percent while a a .95 gives an increment of
750 percent (Johnson, 1972). It should be noted though that poor
precision in the estimation of individual coefficients does not imply

that the linear combination of predictors is correspondingly poor.

21

This apparent anomaly is evidenced in the two predictor case with
positive a by the negative covariance of the estimates. This means
that if one beta weight is overestimated, another in the same sample
with which it is positively correlated with also be overestimated in
absolute value but will have the Opposite sign. The higher the corre-
lation between the two variables, the more pronounced will be this
tendency to compensate for errors in estimation. In the extreme case
of r12 = 1.0 any pair of weights with the same sum will be exactly
equivalent--for instance, weights of -4.0 and 6.0 or 3.0 and -l.0.1
This effect is exemplified by Darlington (1968) and proven by Mason,
Gunst, and Webster (1975) for the greater than two predictors case.

The above discussion has illustrated the nature of the problems
cited by Johnston (1972) as being attributable to the effects of multi—
collinearity in the predictor set. Yet large predictor sets with, as a
rule, validities above .25 or .30 are the norm rather than the exception
in most MR applications. With higher validities and p greater than
three or four it becomes inevitable that the deleterious effects of
multicollinearity will be felt. While this problem is of minimal
interest for purely predictive MR uses, it should be carefully con-
sidered when structural interpretation is the goal. In this setting
the magnitudes and sampling variances of weight estimates can lead to
erroneous conclusions as to the importance or predictive utility of

individual variables. Thus it is important to consider ways of

 

1This is true despite the fact that perfect collinearity (r 2 =

1.0) makes inversion of the predictor matrix impossible (Darlington,
1968).

22

detecting multicollinearity, assessing its impact, and hopefully dis-
covering solutions to the problems it poses.

Numerous techniques have been proposed for detecting multi-
collinearity and the more important will be discussed.

The simplest available Operational definition of unacceptable
collinearity is the arbitrary establishment of a maximum permissible
value for predictor intercorrelation. Aside from the arbitrariness
inherent in this approach it shares the faults of the next pr0posal to
be presented.

Klein (1962) suggests that ". . . intercorrelation or multi-
collinearity is not a problem unless it is high relative to the over-
all degree of multiple correlation . . ." (p. 101). Despite its
intuitive appeal this rule of thumb is not valid. Farrar and Glauber
(1967), while providing a geometric rationale for the rule, point out
that perfect collinearity or the case of a completely singular predictor
matrix is perfectly compatible with low pairwise correlations. A set
of dummy coded contrast vectors such as commonly used for the analysis
of variance whose non-zero elements exhaust the sample space would
fulfill these requirements (Cohen, 1967; Cohen & Cohen, 1975). For the
same reasons measures based on average intercorrelations (Cureton, 1971;
Kaiser, 1968; Meyer, 1975) are inadequate warnings of severe multi-
collinearity.

A measure presented by Kmenta (1971) is the coefficient of deter-
mination R2(j) which is obtained by regressing the criterion variable

on all predictors excluding x If a high degree of collinearity is

3.
present in the data the discrepancy between REJ) and the coefficient of

determination for the full predictor set will be quite small. However,

23

a small difference may simply be reflective of the worthlessness of X1
as a predictor variable. This is illustrated by Darlington's (1968)
suggestion of this exact comparison for estimating the importance of
individual predictors. Furthermore, this measure does not depict the
nature of the collinearity, i.e., which variables are involved in the
relationships.

Another measure with the same limitations as Rﬁj) is based on
the §_statistic obtained from fitting the full model and the E statis-
tics obtained by deleting one variable at a time from the equation.

If the overall §_is significant and the individual 5 tests are not,
multicollinearity is indicated. However, this occurrence is unusual
even with high collinearity (Mason, Gunst, & Webster, 1975), and, like
the previous measure, the nature of the collinearity is not specifiable.

A single measure which summarizes the collinearity present in
the entire predictor matrix, again without providing information as to
its nature, is provided by the determinant, symbolized IX'XI. As
IX'XI is in standardized form, 0 §_|X'XI f_1.0 while if a linear
dependence satisfying equation (19) exists the determinant is equal to
zero. This measure provides at least an ordinal indicant of the presence
of multicollinearity although the collinearity could be attributable to
one or several very small latent roots. Under the assumption of multi-
variate normality (not generally tenable in the assumed fixed-X case,
as discussed earlier) work by Wilks (1932) and Bartlett (1950) indicates
that a chi-square test of the departure of the determinant from zero is
possible. Further, the determinant obtained by deleting one variable
or set of variables from the matrix forms an §_ratio with the deter-

minant of the full p-order matrix. These tests are however very

24

sensitive to departures from normality and are of sufficient complexity
to discourage their frequent use (Farrar & Glauber, 1967). Much the
same information can be obtained more readily by the methods to be dis-
cussed next.

Johnston's (1972, pp. 162-163) conclusion that the standard
error of beta weight estimates should give adequate warning of the
presence of multicollinearity can be extended to provide more exact
information. The standard error of a single beta weight, 31’ is defined

to be

2
. C (1 - R )
(27) SOEOBi a/ii yOlgz’ooo’P
N - p

 

 

where Cii is the diagonal element of (X'X)-l correSponding to the ith
predictor. This measure provides an intra-matrix indication of col-
linearity but still does not facilitate inter-matrix comparisons. A
more useful measure is the C11 component of the standard error formula
which indicates collinearity without reference to the coefficient of
determination for the full equation.

This diagonal element of the inverse intercorrelation matrix
of predictors (actually a transformation of it) has been termed the
variance inflation factor by Marquardt (1970) and has been employed as
an indicant of multicollinearity by Marquardt and Snee (1975) and Snee
(1973). This element, C11, and the off-diagonal values of the matrix
can be expressed in terms of more familiar quantities to demonstrate
their utility. If the symbol si.j,...p is used to represent the square-

root of the residual variance obtained when any one predictor is

regressed on the remaining p-l predictors (i.e., 81 j p =

25

 

//(l - R: j p) , then Ci1 is the reciprocal of this variance--

_ 1
ii
(Siojgooop

multicollinearity (Farrar & Glauber, 1967) as the squared multiple

C

 

)2' This provides a convenient means of assessing

correlation of each predictor regressed on those remaining is implicit

in the inverse matrix of (X'X). The relationship is

2 _ 1

(28) R - l -
i.j,...p Cii

Thus if perfect collinearity exists (19), the C element will be

ii
infinitely large and the matrix is seen as singular. For matrices which

do not exactly satisfy (19), the natural range of C is simply greater

ii
than or equal to one with equality obtaining in the orthogonal variate
case. If a single high-collinearity exists (high pairwise correlations)
the large Cii and ij will indicate which variables are involved. How—
ever, if multiple caused collinearities are present one must look to the
off-diagonal elements of (X'X)-1 for more information. An off-diagonal
element 0 is defined as

13

_ ij.k,...p
(29) c - .
ij (Si.j,...p)(sj.i,k,...p)

The numerator is a partial correlation of an order two less than the
rank of the full matrix. It may be noted that inversion of the correla-
tion matrix provides a means of quickly obtaining all of the highest

order partial correlations by the formula

(30) r .—.. __il 0

26

Especially in cases of non-overlapping groups of multicollineari-
ties consideration of the diagonal and off-diagonal elements of the
inverse allow one to locate the variables contributing to the problem.
Marquardt (1970), Mason, Gunst, and Webster (1975), and Farrar and
Glauber (1967) consider this indication of collinearity to be the best
available. Gordon (1967) illustrates the effects on these values
produced by varying intercorrelation and subset size.

There is yet one improvement which can be suggested to further
facilitate the interpretation of multicollinear matrices. Once the
presence of high collinearity has been established by means of one or

several high C values or, in a summary manner, by the existence of a

ii
near-zero determinant, one is still interested in accurately pinpointing
the contributions to the problem--essentially in Specifying the coeffi-
cients of equation (19). A procedure which enables this is basically
the stepping stone for rank-reduction alternatives to OLS or, alterna—
tively, as an exploratory statistical method in its own right. Eigenan-
alysis is basic to all expositions of principal components or factor
analysis, but its utility has not been widely appreciated by users of
multiple regression. Eigenanalysis is essentially a procedure for
extracting from a matrix the successive vectors of weights which when
applied to the original variables will produce linear combinations of
maximum variance. These eigen or characteristic vectors as they are
also called are subject to two conditions. First, they are restricted
to unit length, i.e., v1 vJ - 1.0. Secondly, each vector must maximize
the residual variance extracted from the matrix subject to the condition

that it is orthogonal to all other vectors. For the case of an inter-

correlation matrix (X'X), the matrix equation to be solved is

27
(31) ((X'X) - ADVi = 0

where 11 represents the characteristic root or variance of its associated

vector of coefficients, vi and is obtained from

(32) V'(X'X)V = 1 diagonal.

Solution of this equation (Cooley & Lohnes, 1971; Finn, 1974; Tatsuoka,
1971) yields a set of p roots and a matrix of dimension p x p containing
the coefficient vectors. Some attributes of these values can be of use
in assessing the effects of multicollinearity. For a matrix of ortho-
gonal standardized variates each characteristic root is equal to exactly
one.

Therefore, if

r = 0 for all i # j

11
then, 11 = 1.0 for all i,
and
P
(33) Z 11 = p, the matrix rank.
i=1

If the vectors are not orthogonal the sum of the roots must still equal
the variance of the full matrix--thus (33) is true in all cases. How-
ever, with correlated variates the first one or several eigenvectors
extracted will exhaust much of the variance and the later eigenvalues
will approach zero. In the case of perfect collinearity (19) one or
more of the roots will in fact be equal to or less than zero. Thus each

eigenvalue is an indicant of the degree of collinearity present in the

28

matrix and the inflation of the sum of the reciprocals of the roots
away from p provides another matrix-wide summary of the severity. 0f
more interest than merely another summary measure are the elements of
the vectors associated with small eigenvalues. These coefficients,
just as in factor analytic interpretations, Show which variables are
the major contributors to the definition of the vector. Thus large
positive or negative coefficients in a vector with a small eigenvalue
indicate which variables are contributing the most to the lack of
orthogonality (Marquardt & Snee, 1975; Mason, Gunst, & Webster, 1975;
Snee, 1973; Webster, Gunst, & Mason, 1971). A relationship of interest
(Snee, 1973) involves an alternative method of computing the diagonal
elements of the correlation matrix inverse. Because the eigenvector
matrix is columnwise orthogonal and of unit length (V'V - VV' a I)

equation (32) can be rearranged to give

(34) (X’X) = VADV’.

Using a matrix theorem for inverses ((ABC)-1 = C-lB-lA-l, Dorf, 1969)
it is evident that

(35) 8’1 = V’ABIV,

and

(36) R;i = 011 = vilxl’l + vi212‘1+...+vip1;1.

From this equation the significance of characteristic roots less than
1.0 is immediately obvious. The basis of the standard error for any

one beta weight (28) is a direct function of the spectrum of eigenvalues

29

for the matrix upon which it is computed. An eigenanalysis and other
multicollinearity statistics discussed above are presented in Table 2
based on a numerical example taken from Cooley and Lohnes (1971). With
only three variables and a rather simple pattern of interdependence the
source of the collinearity is readily apparent. In more complex analyses
however the information provided by large loadings (-.66 and .72 in v3)
and associated small roots (.304) can be valuable.

Because of the problems with established methods of assessing
multicollinearity outlined above, it is suggested that eigenanalysis
be performed on any data set in which high collinearity is suspected.
InSpection of the eigenvector values should allow a researcher to pin-
point likely problem variables or sets of variables. Once the severity
of the multicollinearity present in a matrix has been assessed, ways
should be considered for handling the deleterious effects it can have
on weight estimates. Numerous methods have been presented in the sta-
tistical, sociological, and econometrics literature and several will be

discussed here.

Alternatives to Ordinary Least Squares

The two basic uses to which multiple regression estimation of
weighting coefficients are applied are again relevant here. The majority
of alternatives to OLS (including derivations based on the OLS pro-
cedure) are directed at maximizing the sample equation's multiple R or
else its expected value on cross-validation, subject to such constraints
as computational ease or the availability of adequately large data
samples. Thus, many alternatives are explicitly concerned only with

prediction, and several in fact make structural interpretation

30

Table 2

Illustrative Multicollinearity and Eigenanalysis Values

 

 

 

(X'X) - 1.00 .67 -.10r-
067 1.00 -029
J—.010 -029 IOOOJ

 

 

Determinant (X'X) - .4987
-1 q .—
(X'X) - C - 1.84 -l.28 -.18
-l.28 1.98 .44
-.18 .44
I.. 1 11.3
F -
Eigenvectors - V - .64 .38 -.66
.69 .10 .72
-.34 .91 .20
h ‘

 

 

Z of Trace

I- _'
Eigenvalues - 11 - 1.768 59.0 P _1
2 A = 4.924
0.927 30.9 181 i

 

 

J 0.304J 10.1

 

31

impossible either by eliminating some variables on the basis of speci-
fied criteria or by utilizing arbitrary weights. Alternatives to
ordinary least squares can be briefly summarized as falling into one of
the following classes: (a) some form of data augmentation; (b) rank
reduction procedures; (c) utilization of weights independent of inter-
dependence relationships present in the initial sample.

The need for data augmentation is most explicit when the avail-
able sample size is less than the number of predictor variables for
which weights are to be estimated. When this situation arises, as it
often does for example in medical studies involving multiple observa-
tions on a limited number of patients, there are few alternatives to
simply obtaining more subjects or eliminating some variables. The
latter course of action precludes structural interpretation of a full
set of weights although predictive utility may not be significantly
hampered. Evaluating more subjects is often prohibitively expensive
if not simply impossible. If excessive collinearity attributable to an
undersampling of the regions of the data domain is evidenced by either
evaluation of the eigenvectors or joint variable distributions (Webster,
Gunst, & Mason, 1974), little choice remains other than to acquire
observations on a larger N.

Rank reduction procedures have occasionally been employed in the
last mentioned case, although interpretation may be vastly complicated.
These procedures may be classified basically as either based on a
posteriori orthogonalization of the data vectors or on evaluation of
successive partial validities as variables are included or deleted from
the predictor set. Virtually all orthogonalization models attempt to

eliminate specific variance (in the factor analytic sense) from the

32

predictor intercorrelation matrix. Thus, one approach in this area is
the application of a principal components analysis of the predictors
(normalized eigenvector values). Based on a scree test (Cattell,

1966) or the meaningfulness of the resulting components, an arbitrary
number (Jeffers, 1966; Jolife, 1972, 1973) are retained and matrix
transformations are used to re-estimate variable scores for individual
subjects. These scores are then used in the usual regression computa-
tions. Examples of this method have become fairly common since the
advent of readily available computers to carry out the tedious matrix
manipulations (Gunst, Mason, & Webster, 1975; Jeffers, 1966; Massey,
1974, Schmitt & Coyle, 1976). Variations on this approach have utilized
the characteristic vectors as predictors (Gunst, Mason, & Webster, 1971),
inserted communality estimates in the R matrix (Horst, 1941), and
attempted to estimate the R.1 matrix (Guttman, 1958) rather than R
itself. One method (Burket, 1964) augments the predictor matrix with
the vector of criterion validities before principal axes orthogonaliza-
tion. Finally, all analyses based on components or axes may also be
subjected to rotation (for example, varimax or quartimax) before being
used to re-estimate subjects' scores.

If a principal components analysis is employed and the number
of retained components is the same as the number of original variables
it can be shown that the multiple regression equation derived will be
identical to that obtainable from the raw variables (Darlington, 1968;
Herzberg, 1968). Therefore, only cases in which fewer factors are
extracted than the number of original variables can potentially be of
interest. The argument in favor of such rank reduction is usually

based on the well known fact that if the variables being factored

33

contain substantial error variance, it will tend to be concentrated in
the vectors associated with small roots. Potentially serious problems
exist in applications of any of these methods. The distributional
theory is exceedingly complex for those analyses employing communality
estimates and this leaves significance testing of derived weights a
virtually intractable problem (Burket, 1964). Further it is possible
that the factor accounting for the least variance in the predictor set,
and therefore the prime candidate for omission, in fact correlates
perfectly with the criterion (Darlington, 1968). Description of the
variables re-estimated from factor matrices is also generally compli-
cated by "intermediate" loadings, and the indeterminancy of factor
scores up to a linear transformation and this in turn obfuscates efforts
to interpret the subsequent regression equation.

Variable deletion based on various criteria has been prOposed
when degrees of freedom are limited or when collinearity is a problem
(Draper & Smith, 1966). In all cases deletion procedures attempt to
maximize the validity of the initial equation subject to specified
constraints, and are thus not amenable to instances in which structural
interpretations of a full rank predictor matrix is of interest. As
Darlington (1968) notes, removing the variable with the smallest beta
weight is not guaranteed to produce the equation with the highest
population validity for that rank model. Accretion methods of variable
selection begin with the variable having the highest zero-order validity
and then in successive steps add those variables which will give the
greatest increase in the multiple R for the equation (Draper & Smith,
1966). Horst and MacEwan (1960) suggest the reverse of this procedure

and note that the two methods--forward selection and backward

34

elimination—-will not in general yield the same equations. Both pro-
cedures are terminated on the basis of arbitrary criteria such as
validity increment or number of variables included. Stepwise regression
is essentially identical to forward selection but additionally it tests
at each step all variables already in the equation. If, because of the
inclusion of subsequent predictors, a variable's partial correlation

has fallen below a Specified value it is then eliminated and the pro-
cedure continues to evaluate the remaining candidates for the equation
(Nie, Hull, Jenkins, Steinbrenner, & Bent, 1975). Numerous variations
on these three basic approaches have been suggested (Anderson & Fruchter,
1960; Burket, 1964; Furnival & Wilson, 1974; Hocking & Leslie, 1967;
LaMotte & Hocking, 1970; Rock, Linn, Evans, & Patrick, 1970) but in
general these three have been preferred.

The case of interest in the present paper is that in which one
wishes, for either predictive or interpretative purposes, to obtain an
equation based on all p variables upon which data were collected.
Accordingly, the relative merits and problems of reduced rank procedures
will not be discussed further.

The selection and application of non-least squares estimated
weights has received a great deal of attention especially in the psy-
chological decision-making literature (Einhorn & Hogarth, 1975). In
general the motivation for development of alternative weighting strate-
gies has had three facets: (a) especially when computations must be
done by hand, the complexity of the work necessary to calculate OLS
weights is prohibitive; (b) high collinearity situations combined with
less than perfect reliability of measurement make it quite likely that

arbitrary weights will outperform unstable beta estimates in subsequent

35

usage; (c) situations in which it is desired to correlate a linear
function of the predictors with a criterion but a sufficiently large
sample on which to estimate beta weights is not available.

A variety of combinatorial schema have been evaluated in the
literature (Claudy, 1972; Lawshe & Schucker, 1959; Trattner, 1963;
Wesman & Bennett, 1959) such as raw score addition so that variables
are weighted by their standard deviations, addition of standardized
scores, weighting by the reciprocal of the standard deviation, and
weighting by the validity coefficient. The concensus has developed
that equal weights for situations in which N is less than approximately
50 are superior or only slightly inferior to OLS weights regardless of
the number of predictors. The comparison made is generally between the
multiple R obtained from application of the original beta weights in a
cross-validational sample (a set of cases not included in the original
estimation of the weights) and the multiple R produced by unit weights.
A comprehensive Monte Carlo study of the empirical performance of unit
weights versus sample beta weights when they are validated in the popu-
lation was done by Schmidt (1971). For 40 combinations of N and p com-
parisons on a variety of correlation matrices sampled from the litera-
ture, he demonstrated that the maximal superiority (in terms of obtained
R2 values) of beta weights averaged over 100 samples was only .083.

When suppressor variables were removed this maximum dropped to .039.
Both maximum beta versus unit weight discrepancies were in fact obtained
in the papulations themselves where the beta weights were error free,
i.e., parameter values rather than sample estimates. No other weighting

scheme has been shown to be so consistently comparable to the perfor-

mance of beta weights.

36

The results provided by Schmidt's analysis are in accord with
the suggestions of Einhorn and Hogarth (1975) and Dawes and Corrigan
(1974) who derived their conclusions from comparative studies of the
human decision making process. Dawes and Corrigan summarize their
results by Stating that to obtain stable prediction equations in situ-
ations where all variables are subject to error it is necessary simply
to select relevant predictors, determine their sign, make all predictors
comparable, and then add. While this solution appears at first to be a
panacea for the numerous problems encountered in MR. Roose and Doherty
(1976) noted several difficulties in attempting to apply these sug-
gestions. They found that selecting the variables without the use of
some sort of stepwise procedure an arduous task. Nor had they any
manner of a priori determining the predictor signs. Those they used
were based on the validity coefficients for the selected variables.

In their words (Roose & Doherty, 1976), ". . . the success of unit
weighting as demonstrated in the present study rested upon crutches
fashioned from the very MR procedure bested by unit weighting" (p. 245).
Wainer (1975, 1976) has formulated the expected loss attributable to
the use of unit rather than OLS weights and noted that for practical
purposes the loss is so small that the OLS procedure is not justifiable.
Again though, his derivations assume some sort of selection and sign
assignment a priori, no suppressors, as well as a maximal spread in the
beta weights of only .5, conditions which it is frequently impossible

to meet.

If one is interested in full rank multiple prediction it would
appear that unit weighting is the viable alternative to MR. While

structural interpretation is not possible, except on the gross level of

37

zero weights implying no utility and weights of one or minus one indi-
cating acceptable utility, the robustness of unit weights makes them
worthy of further consideration especially in cases of extreme collin—
earity where large beta weight sampling variances are common.

Three methods of assessing the signs of unit weights (discussed
below) are compared to OLS in this work. The problems with standard
multiple regression which prompted researchers to consider unit weighting
and various orthogonalization schema have recently given rise to a
modified OLS methodology. The details of this method, termed ridge

regression, are discussed next.

Ridge Regression

 

Hoerl (1962) originally prOposed this modified regression method
Specifically to deal with the problems of severe multicollinearity dis-
cussed above. In exemplifying the method of ridge analysis (Hoerl &
Kennard, 1970a; 1970b) the errors associated with non-experimentally
collected data are noted in that (X'X) is not nearly a unit or identity
matrix. Weighting coefficients derived from such a matrix are often of
incorrect Sign and have inflated values, as was noted before. The
undesirable nature of such weights are expressed by Hoerl and Kennard
(1970a);

. . . the least squares estimates [which] often do not make sense
when put into the context of the physics, chemistry, and engineering
of the process which is generating the data. In such cases, one

is forced to treat the estimated predicting function as a black

box or to drop factors to destroy the correlation bonds among the

x1 used to form X'X. Both these alternatives are unsatisfactory

if the original intent was to use the estimated predictor for con-
trol and optimization (p. 55).

The suggestion offered in such cases is the use of

38
(37) 8 a mu: + uprlx'y, k 3 0

for estimating beta weights rather than equation (8). This procedure
it is claimed modifies the weight values such that they are less extreme
in absolute value and thus necessarily have reduced variance. The
technique can also be used to generate a trace of the effects of in-
creasing k values on the coefficients which portrays the differential
effects on each. Hoerl and Kennard (1970a) contend that by reducing
the variability of the coefficients a more accurate estimate of the
parameter values can be obtained although the resultant estimates are
biased.

The derivation of this approach considers the variance of the
coefficient vector 8 (equation 11) and notes that the expected value

of the squared distance (L2) from 8 to B is

2

(38) E(LZ) = 0 Trace (X'X)"1

or equivalently,
(39) E(B'B) = B'B + ozTrace(X'X)-1.

Hoerl and Kennard also demonstrate that (38) is equivalent to

P _
(40) E(LZ) = 022 11 1.
i=1

The lower bound for the average squared distance between the sample

coefficients and the parameters is given by 02/1m This corresponds

in'
to the previous discussion of multicollinearity wherein it was noted
that small eigenvalues are one of the best indicants of unstable weights.

The authors suggestion of augmenting the diagonal of (X'X) with small

39

positive quantities (0 §_k_:_1) has the effect of decreasing the diagonal
elements of the predictor inverse matrix. This in turn deflates the
absolute values of the beta weights and reduces their collective vari-
ance.

Hoerl and Kennard (1970a, p. 60) demonstrate that the expected
squared distance from 8 to B is composed of two elements-the total
variance of the parameter estimates and the square of the bias intro-
duced by the non-least squares computation (equation 37). Thus when
k = 0 and OLS are calculated the bias is zero. The authors Show with
an existence theorem that it is possible to select k greater than 0,
take a little bias, and without greatly inflating the residual error
variance for the equation, obtain 8 estimates with substantially lower
mean square errors (L2).

The problem is then one of selecting an optimal k value for use
in any one matrix. The suggestion of Hoerl and Kennard (1970b) is to
use a graphic display of the effects of increasing k and note the point
at which four conditions are met: (a) the characteristics of the graph
will be those of an orthogonal system, (b) coefficients will have
reasonable absolute values, (c) coefficients with apparently incorrect
signs at k a 0 will have changed to correct signs, and (d) the residual
sums of square will not have inflated to an unreasonable value. The
plot (Hoerl & Kennard, 1970b, Figure 2, p. 72) shows the values of
each of 10 coefficients plotted against the value of k in equation (37)
which produced them. They would advocate selecting the beta weights
produced by the equation with a k of approximately .25--that is, after

the point of maximum decline in absolute value is passed and the

40

coefficients are seen to be visually stable. Assessments which Hoerl
and Kennard (1970b) make on the basis of this graph exemplify the
utility of the procedure:

(1) The coefficients from the ordinary least squares are un-
doubtedly overestimated. At least, they are collectively not
stable. It is unlikely that another set of y's would give Bi
like these. Moving a short distance from the least squares point
k = 0 shows a rapid decrease in absolute value of at least two of
them, namely, those for factors 5 and 6. Figure 2 shows the
decrease in the Squared length of the coefficient vector with k.
When k 8 .1, it is 43.3% of its original value; for an orthogonal
system it would be 83%.

(ii) Factor 5 has the negative coefficient with the largest value.
But the addition of k > 0 quickly drives it toward zero and it
then becomes positive. Such action should not be surprising,
eSpecially when it is compared with the action of factor 6.

Factor 6 also decreases rapidly but stabilizes and does not go
down to zero. Factors 5 and 6 have a simple correlation coeffi-
cient of 0.84 which says that to a first approximation, they are
the same factor but with different names. It would be surprising
if their true effects were opposite in Sign. (Without a knowledge
of the underlying technology, no definitive statement can be made.)
The covariance of -4.33 is driving them apart so that they are
Opposite in Sign. The phenomenon observed here is not atypical.
Positive coefficients for highly correlated factors can be stable
as a sum, especially when they are correlated to various degrees
with other factors.

(iii) The correlations with other factors causes factor 1 to be
underestimated. At k = 0 factor 1 is the second least important
negative factor. But with the addition of k > 0 it increases in
absolute value. The other negative factors are slightly over—
estimated and when sufficient k > 0 has been added to stabilize
the system, factor 1 becomes the most important negative factor.

(iv) Factor 7 is overestimated and is driven toward zero.
(v) At a value of k in the interval (0.2,.3) the system has
stabilized and coefficients chosen from a k in this range will
undoubtedly be closer to 8 and more stable for prediction than
the least squares coefficients or some subset of them (pp. 71-72).
Several authors have utilized the ridge regression (RR) techni-
que. Churchill (1975) used 3001 cases selected in samples of size 50

and calculated ridge coefficients (8) for 13 predictors. His results

demonstrated a departure from the parameter values 1.7 times higher

41

for OLS as Opposed to RR. Vinod (1976) who used a modified RR method
which selected arbitrary k values based on rank reduction analyses,
Marquardt and Snee (1975), McDonald and Schwing (1973), and Snee (1973)
all reported superiority of RR over OLS.

Several researchers have attempted to develop point estimates
of k (Baldwin, 1975; Hoerl & Kennard, 1976; Lawless & Wang, 1976;
McDonald & Galarneau, 1975; Newhouse & Oman, 1971) but these attempts
uniformly assume that k is non-stochastic (Coniffe & Stone, 1973; Smith,
1976)--an unadmissable assumption. Further, these more exhaustive
studies do not invariably demonstrate RR as superior to OLS. Thus while
virtually all investigators consider RR to be an instructive mode of
analysis most contend that it is preferable to OLS in all nonorthogonal
situations, it Still remains as much an art (selecting k) as a science.
The most practical suggestion is probably that of Marquardt's (1970)
variance inflation factor (VIF) which was mentioned earlier. This
value for the 1th predictor is the ith diagonal element of the matrix
[(X'X);1(X'X)(X'X);1]. Just as the diagonal elements of (X'X)-1 in
standard form range from one to infinity as collinearity increases, so
do these VIF values. Marquardt's suggestion is that k be selected at
the point where these values are ". . . reasonable, certainly less
than 10 . . ." (p. 609). Evaluation of the VIP along with the eigen-
vector weights associated with small eigenvalues (Snee, 1973; Webster,
Gunst, & Mason, 1974) would appear to be the most reasonable way Of
ascertaining which VIF's should be deflated the most and therefore
which k value should be selected.

The research reported here proposed to evaluate the relative

efficiency of ridge regression as compared with ordinary least squares.

42

In addition, unit weighting was contrasted with these methods both
because of its demonstrated utility and because it should in fact be
most efficient in exactly the high collinearity situations for which RR

was prOposed (Wainer, 1976; Wainer & Thissen, 1976).

CHAPTER III

METHOD

Consideration Of the possible approaches to these comparisons
favored a Monte Carlo study in which the sample size and collinearity
could be controlled. Accordingly three matrices were selected from the
literature. The factor structures for each of these matrices were input
to the Ohio State Correlated Score Generation Program (Wherry, Naylor,
Wherry, & Fallis, 1965) which produces multiple random samples corre-
sponding to the structure. Generated samples from each Of the three
matrices were pooled to form three pOpulations of 6000 cases each.

Each matrix selected had 10 predictors so that a total of 165 coeffi-
cients were estimated. The maximum Obtained discrepancy between a tar-

get and an estimated population correlation was .031.

The Population Matrices

 

The first matrix selected was used as an example by Hoerl and
Kennard (1970a) and was taken from Gorman and Toman (1966). This matrix
(hereafter referenced as HOPOP, Table 3) was selected both because of
its previous use as an RR supportive example and because Of its broad
range Of predictor intercorrelations. Two other matrices were selected
so as to broaden the scOpe of the comparisons. These matrices were

considered more typical of those generally encountered in psychological

43

44

.AAONmHv Unmanmx was Humom aoum cmxmﬁm

 

 

 

 

 

ooo. Nod. mom. son. Boo. Nmo. H4o.4 mom.a 4mm.H ooo.m omoao>omw4m
4moo. u nomaoaumomo Boo. a No e4o. a e 8Ho484sz
moo. ooo. 4mm. ooN. Ham. om4.u ooa.u Hem.- oNN.u mea.u muomﬁos moon
:OHumaaeom
om4. omH. omo. o4o. 4am. men. NoH.u o4o.n 4oH.u oao.- coaoonauo
Hmo. ~o4.- ~44. ~44. ~4m. moo.u 444.- oom.u oom.u oH
NAH.- who. omo. MNH.- moo. 4Ao.n MHH. HHH.- o
~on.n oo4.u o~m.u oeo. 5mm. ooo. moo.n o
moo. Hum. ooo. Hmo.u N4N. moo.- A
o4o. moo.u omo.u ooo. oho.- o
moo.- oHo.u «mo. man.- m
oao.n ooH.u oHH. 4
moo. 44m. m
4mo.u N
o4 o o e o m 4 m N H oaomaum>

 

ooomomuuoommo oooo co ommmm xHuomz ooﬁooaoooo

n «Home

45

and sociological applications. Table 4 illustrates the high average
intercorrelation matrix reproduced from the factor structure Of a matrix
employed by Rock, Linn, Evans, and Patrick (1970) and originally taken
from Klein and Evans (1969). Table 5 is the low average intercorrela-
tion matrix used by Rock et a1. (1970) and taken from Klein and Evans
(1968). These two matrices (HIPOP and LOPOP respectively), incorpora-
ting two other predictors which were deleted for the present research,
were selected by Rock et a1. (1970) to evaluate four methods of predictor
selection because of their representativeness. It was felt that these
three data sets constituted a reasonable sample from the domain of
possible matrices of interest to researchers in the social sciences.

The HOPOP matrix with its negative intercorrelations and validities is
atypical of most psychological data but does characterize occurrences in
the economics and management literature. Additionally its use by Hoerl
and Kennard (1970a) as an RR example without benefit of comparison with
other techniques warrants its inclusion. Eigenanalyses of the HIPOP

and LOPOP matrices (Table 6) illustrate their salient features. Both
data sets differ from HOPOP in that their ranges of intercorrelation

are more restricted typifying the data encountered in psychological and
measurement studies. The first eigenvalue Of the HIPOP matrix accounts
for 67 percent of the total variance while the first four roots of the
LOPOP data set account for only 63 percent of its variance. Thus, by
any accepted definition, the HIPOP matrix would be considered highly
multicollinear while the LOPOP matrix is less severely afflicted. The
fact that six of its roots combined account for less than 37 percent

of the possible variance however indicates that weight estimation is

likely to be adversely affected.

46

.AONaHv xoauoom use .mam>m .OOHA .xoom scum soxosm

 

 

 

 

 

4AA. HoH. NNN. o4N. MNN. Non. oH4. Non. 44N. N4N.o ooooo>ouoam
Nooo. u ooooaaumumo Non. a ooN. n e oHoNoNoz
44o.- o4.ou o4N. nNo. ooo. NNo. m4o. ooN. m4N. ooo. muomams moon
coaumaamom
NN4. No4. HNo. HNo. oNo. ooo. ooo. NNo. NNm. o4m. ooNumSNuo
Nmm. om4. oNo. Non. Non. ooo. nNo. ooo. om4. oN
4m4. 4No. Hon. oNn. N4o. oN.m Ho4. mo4. o
Nmo. moo. oNo. Hmo. Nmo. 4Nm. NNN. o
mNN. ooN. 4oN. ooo. 4oo. ooo. N
ooN. ooN. NNN. 4mo. o4o. o
o4N. 4oN. m4o. mNo. m
ooN. ooo. ooo. 4
NoN. 4oo. N
Non. N
o4 o o N o m 4 m N H «HomNuoo

 

mmomHmIImommo oooo so momma Nahum: sowumaaeom

q OHQOH

47

.AONmHv xowuumm vow .mam>m .acHA .xoom aouw coxwem

 

 

 

 

 

Non. mm4. mom. ooo. NoN. ooo. oNo. N4o.H oNN.H Hoo.m maoHo>ooNHm
mNoH. a ooooHauoomo oNH. a No NN4. u e mHoHoHoz
ooH. oNo. NNo. moo. ooo. NoH. NNH.- ooN. ooo.- ooo. mooNHoz sumo
sowumaaaom
ooo. NHH. ooo. moH. oHN. NNN. HNH. Non. Noo.- NNH. ooHuooHuo
HNo. ooN. ooN. NoN. HoN. om4. Non. ooo.- ooH. oH
moo. ooo. oNo. H4H. 4HH. ooo. NHo. ooo. o
ooo. oNH. mNo.- oNH. HHH. ooo.- o4o. o
ooH. ooH. NNN. ooo. oNo.- oHH. N
Non. oH4. ooo. oHH. mNH. o
no4. mo4. moo. ooH. m
mom. ooo. mmN. 4
o4o.- 4om. m
ooN. N
oH o N N o m 4 N N H «HomHuoo

 

mooooHuumommo oooo no

comma xwuumz :OHumaooom

n «Home

48

 

 

 

 

 

 

Table 6
Eigenvectors of the Population Matrices

Variable HIPOP
l .31 .38 -.38 -.23 .09 .34 .57 -.29 .05 .17
2 .31 .05 .37 -.16 -.84 .07 .12 .04 .07 .09
3 .34 .03 .23 .05 .09 .12 .39 —.36 .71 .09
4 .34 .06 .16 .34 .19 -.18 .06 -.46 .37 .56
5 .33 .09 .17 .24 .25 .54 .01 .63 .02 .19
6 .33 .09 .10 .30 .07 -.65 .37 .30 .21 .30
7 .35 -.13 .08 .03 .08 .09 .32 -.13 .53 .66
8 .30 .42 -.40 -.37 -.03 -.31 .48 .23 .08 .21
9 .26 -.55 -.64 .34 -.28 .06 .04 .02 .11 .09
10 .28 -.58 .16 -.63 .29 -.ll .18 .07 .02 .16

HOPOP
1 -.41 .36 -.14 .07 .02 -.36 .15 .03 .60 .40
2 .01 .10 .77 .09 .33 -.22 .29 .37 .04 .ll
3 -.39 .14 .07 .20 .02 .61 .42 .48 .08 .02
4 -.O6 -.02 -.36 -.50 .76 .02 .01 .19 .05 .04
5 .47 -.06 .00 .16 .18 .03 .26 -.03 .69 .41
6 .46 -.28 .11 -.O4 .05 .05 .19 .20 .05 .78
7 .21 .59 .12 .14 .30 .40 .09 -.52 .09 .19
8 -.25 -.57 .09 .07 .12 .43 .51 -.24 .28 .07
9 .04 .17 .30 -.80 -.36 .22 .02 -.03 .25 .04
10 .36 .24 -.36 .09 -.21 .22 .59 .48 .01 .06

LOPOP
l .24 .46 .35 .Ol -.36 .43 .46 .15 .07 .23
2 .07 .73 .22 -.14 .01 -.34 .46 -.ll .01 .23
3 .46 -.13 -.04 .14 -.08 .24 .04 -.03 .02 .82
4 .44 .03 -.ll .03 .22 .10 .15 .12 .80 .23
5 .38 .12 -.40 .11 .20 -.06 .16 .57 .49 .18
6 .36 .09 -.04 -.04 .43 -.39 .55 -.45 .13 .07
7 .27 -.25 .14 .19 -.61 -.64 .04 .14 .08 .09
8 .13 -.32 .61 -.54 .29 -.07 .02 .35 .07 .06
9 .ll .02 -.48 —.79 -.35 .02 .05 -.10 .02 .01
10 .39 -.23 .17 .02 -.10 .26 .47 -.52 .29 .33

 

aEach population contained 6000 cases.

49

Samples
Twenty-five samples of sizes 30, 60, 90, 120, and 200 were drawn

from each population using a random sampling procedure available in the
SPSS package (Nie et al., 1975). These sample sizes were selected to
span the range of values for which unit weights have been demonstrated
to be superior to OLS (Schmidt, 1971). Each sample (375 in all) was
standardized and input to a program written by the author for the

necessary least squares, unit weights and ridge regression computations.

Equation Estimation

Five equations were estimated in each of the samples. OLS
weights and the multiple R they produced were calculated according to
equation (8). Three unit weight equations were also estimated in each
sample. The first equation was produced by assigning the Sign of the
fallible sample beta weights (BU) to the unit weighting coefficients.
The Sign of each validity coefficient (VU) in each sample was also used
to determine the unit weight signs. Third, the infallible population
beta weight signs (PU) were employed. This third method implies that
the investigator has prior information as to the correct signs, pre-
sumably on the basis of previous experience with the variable. These
three methods were selected because, in an applied situation, they
correSpond to the manner in which one would generally determine the
unit signs.

For the ridge regression equation in each sample the value of
k in equation (37) was determined by the following rule: select the
largest k possible (stepsizes of .01) with the restriction that no

diagonal element of [X'X);1(X'X)(X'X);1] is less than 1.0. Attendant

50

to the earlier discussion of VIF's (Marquardt, 1970; Snee, 1973) pilot
work was done experimenting with a variety of selection rules based on
these values. It was found that the above rule always selected reason-
able k values at a point just slightly lower than a visual trace
examination would suggest. This criterion is also in keeping with more
recent analytical attempts to define k which have generally found that
less bias (i.e., small k values) can adequately handle the problems of

multicollinearity (Guilkey & Murphy, 1975; McDonald & Galarneau, 1975).

Data Analysis

 

Virtually all studies evaluating ridge regression to date have,
at least implicitly, been concerned only with structural interpreta-
tion. Previous Monte Carlo studies (Hoerl, Kennard, & Baldwin, 1976;
McDonald & Golarneau, 1975) which had available the true parameter
values of 8 based their evaluations on the mean square error (MSE)
criterion, i.e., §1(81 - Bi)2 with the 81 vector being produced by
either OLS or RR.i While this comparison statistic accurately reflects
the average precision of the 81 point estimates, it does not provide
for assessment of the predictive utility of the overall linear combina-
tion. It is possible that while one method of estimating Bi will have
a lower MSE than another, the predictive utility of the latter will be
superior.

Because of this consideration the predictive ability of all
five equations was evaluated as well as the MSE. As the RR procedure
necessarily decrements the coefficient of determination as compared

to that Of OLS in the estimation sample, these initial R and R2 values

were evaluated. Due to the overfitting of the regression surface in

51

the estimation sample discussed earlier, a practical measure of an
equation's utility is its performance in a cross-validation sample.
However, as Schmidt (1971) has noted, a researcher is not interested in
how a set of weights do in a single random replication sample but rather
in how they perform in the long run, i.e., how they compare with the
predictive utility of the infallible pOpulation weights. Accordingly,
the equations estimated for each sample were cross-validated in the
populations from which they were drawn. The formula for the cross-
validated multiple R is (Nunnally, 1967)

w'X'y

(41) Rw = ‘20P
/(w (X X)p°pw)

 

 

where (w) is the appropriate vector of unit, RR, or OLS weights.

The final comparison statistic, like MSE, is applicable only
to the RR and OLS results. The coefficient of variation proposed by
Churchill (1975) is calculated by dividing the square root of the MSE
for a single coefficient by the true parameter value. These coeffi-
cients of variation (CV) can then be averaged over predictors, sample

sizes, and/or populations for summary purposes.

CHAPTER IV

RESULTS AND DISCUSSION

Estimation sample results for the five equations discussed above
are presented in terms of the Obtained mean coefficients of determina-
tion in Table 7. As noted above, the bias factor due to the k value in
equation (37) results in a lower R2 for ridge regression than ordinary
least squares in all cases. The average values of these differences
for 25 samples are presented for all sample sizes and for the three
populations in Table 8. The magnitude of the positive values in Table 8
reflect the higher average obtained R2 values for OLS over 25 samples
for each of the four alternative weighting methods. It should be noted
that the lower obtained values for RR equations (approximately .038)
will in general make them better estimates of the pOpulation cross-
validity due to the overfitting of sampling error present in the estima-
tion sample. Possible exceptions to this conclusion are cases in which
either an OLS or RR initial equation cross-validates in a single instance
upward in terms of the R2. This in fact occurred, on the average, for
samples of sizes 90, 120, and 200 drawn from the HOPOP matrix and
estimated by RR. The special nature of this population will be dis-
cussed below. A final difference tO be noted between OLS and RR in

Table 7 is the reduced range of R2 estimates provided by RR over

52

53

 

 

 

 

 

 

 

Table 7
1611:1211 R2 - LS, RIDGE, BU, VU, Pu£1
Sample Size

Population Equation 30 6O 90 120 200
LS .663 .615 .578 .544 .527

RIDGE .587 .562 .540 .513 .502
HIPOP BU .415 .478 .468 .463 .463
VU .466 .485 .486 .471 .470
PU .483 .483 .493 .477 .476
LS .933 .918 .911 .907 .902
RIDGE .869 .864 .857 .858 .856
HOPOP BU .641 .672 .683 .692 .696
V0 .688 .642 .636 .651 .674
PU .698 .700 .700 .697 .696
LS .434 .307 .265 .228 .208

RIDGE .395 .291 .257 .222 .204
LOPOP BU .241 .198 .191 .161 .158
VU .238 .184 .169 .154 .146
PU .125 .149 .150 .150 .155

Note. All entries are mean values based on 25 samples.

aLS - ordinary least squares; RIDGE 8 ridge regression; BU -
unit weights with Signs determined by sample beta weights; VU 8 unit
weights with signs determined by sample validity coefficients; PU -

unit weights with Signs determined by infallible pOpulation beta

weights.

Mean Initial R2 Superiority of Least Squares

54

Table 8

Over RIDGE, BU, VU, PUa

 

 

 

 

 

 

 

 

Sample Size
Population Equation 30 60 90 120 200 Average
RIDGE .076 .053 .038 .031 .025 .045
BU .248 .137 .100 .081 .063 .128
HIPOP
V0 .197 .130 .092 .071 .057 .109
PU .180 .132 .085 .067 .051 .103
RIDGE .064 .054 .054 .049 .046 .053
BU .292 .246 .228 .215 .206 .237
HOPOP
VU .245 .276 .275 .256 .228 .256
PU .235 .218 .211 .210 .206 .216
RIDGE .039 .016 .008 .006 .004 .015
BU .193 .109 .074 .067 .050 .099
LOPOP
VU .196 .123 .096 .074 .062 .110
PU .305 .158 .115 .078 .053 .142
RIDGE .060 .041 .033 .029 .025 .038
BU .244 .164 .137 .121 .106 .155
AVERAGE
VU .213 .176 .154 .134 .116 .159
PU .240 .169 .137 .118 .103 .154
Note. All entries are mean values based on 25 samples.

aRIDGE = ridge regression; BU = unit weights with signs deter-
mined by sample beta weights; VU 8 unit weights with signs determined by
sample validity coefficients; PU = unit weights with signs determined by
infallible population beta weights.

55

different sample sizes. It appears then that RR is somewhat less sensi-
tive to the size of the sample in which weights are estimated than is
OLS. Over the three populations and five sample sizes, the range Of
ridge estimated coefficients of determination are approximately 37 per-
cent less than those Of OLS estimates. All three unit weight equations
demonstrate the same relative indifference to sample size and in some
cases to be discussed below, provide better estimates of actual utility
than either OLS or RR.

For predictive purposes the R2 Obtained in the initial sample
is typically not of interest beyond indicating whether the linear com-
bination of predictor variables has any utility at all. Cross-validated
(typically in only a single holdout sample) or formula estimated coeffi-
cients of determination are the usual criteria for utility decisions.
In general, the latter approach has been Shown to be preferable (Schmitt,
Coyle, & Rauschenberger, 1977); however in a Monte Carlo study such as
considered here, one has available the actual pOpulation matrix which
obviates the need for estimates of long-term cross-validated efficiency.
Table 9 presents the results, for the four relevant equation types, Of
applying the sample estimates to the population from which the data
were drawn. AS this step concerns validation of sample dependent values
the unit weight equations signed by the infallible population beta
weights (PU in Table 7 and 8) are not evaluated. Table 10 contains the
average differences between OLS and the RR equations, the unit weights
signed by the sample validity coefficients (VU) and the weights signed
by the sample beta weights (BU). Negative entries in this Table (10)
indicate that the equation in question obtained a higher average cross-

validated R2 than did OLS for the same population and sample size.

56

Table 9

Cross-Validated 112 - LS, RIDGE, BU, vua

 

 

 

 

 

Sample Size
Population Equation 30 60 90 120 200
LS .345 .405 .447 .466 .481
RIDGE .428 .456 .474 .482 .489
HIPOP
BU .231 .326 .358 .390 .412
VU .465 .465 .465 .465 .465
LS .845 .872 .879 .886 .890
RIDGE .838 .858 .867 .875 .880
HOPOP
BU .574 .640 .673 .684 .697
VU .601 .566 .588 .612 .659
LS .050 .085 .111 .130 .147
RIDGE .056 .089 .113 .132 ..148
LOPOP
BU .040 .056 .075 .092 .113
VU .093 .116 .124 .128 .135

 

 

Note: All entries are mean values based on 25 samples.

a

LS - ordinary least squares; RIDGE a ridge regression; BU -
unit weights with signs determined by sample beta weights; VU 8 unit
weights with Signs determined by sample validity coefficients.

Over RIDGE, BU, VU‘al

57

Table 10

Mean Cross-Validated R2 Superiority of Least Squares

 

 

 

 

 

 

 

 

Sample Size
Population Equation 30 60 90 120 200 Average
RIDGE -.083 -.051 -.027 -.016 -.008 -.037
HIPOP BU .144 .079 .089 .076 .069 .085
VU -.120 -.060 -.018 .001 .016 -.036
RIDGE .007 .014 .012 .011 .010 .011
HOPOP BU .271 .232 .206 .202 .193 .221
VU .244 .306 .291 .274 .231 .269
RIDGE -.006 -.004 -.002 -.002 -.001 -.003
LOPOP BU .010 .029 .036 .038 .034 .029
VU -.043 -.031 -.013 .002 .012 -.015
RIDGE -.027 -.014 -.006 -.002 .000 -.010
Average BU .132 .113 .110 .105 .099 .112
VU .027 .072 .087 .092 .086 .073
Note. All entries are mean values based on 25 samples.

aRIDGE = ridge regression; BU = unit weights with Sign deter-
mined by sample beta weights; VU = unit weights with signs determined
by sample validity coefficients.

58

Inspection of Tables 9 and 10 shows that BU equations are generally the
poorest while RR provides the best average results. These results are
more evident if one ignores the obtained values for the Hoerl and Kennard
(1970b) population. In the high and low intercorrelation populations,
ridge regression outperforms OLS by a small margin (.02) for all sample
sizes. In the HOPOP matrix the situation is reversed with OLS demon-
strating a Slight superiority (.01) over RR.

As Tables 7 through 10 concern predictive utility rather than
structural interpretation, it is at this point that the efficiency of
the various unit weighting schemes must be considered. Schmidt (1971)
noted that with simulated data such as presented here, violations of the
assumptions of multiple regression (linearity, homogeneity, and normality
of conditional variances) cannot occur. Such violations apparently
occur in approximately 20 percent Of actual empirical data sets (Sevier,
1957; Schmidt, 1971; Tupes, 1964) and their effect is to attenuate the
predictive utility of OLS. Therefore, in this simulation differences
between OLS Obtained R2 values and those of unit weights should be
taken as maximal estimates. In practice, OLS will be somewhat less
efficient than is indicated here.

In the HIPOP and LOPOP matrices (Table 9 and 10) the results
for unit weighting methods are similar to those reported by Schmidt
(1971). As concluded in that study, when no suppressor effects are
present, a sample size of approximately 180 is necessary before OLS will
demonstrate a distinct superiority over unit weights. In Table 9 the
high and low intercorrelation populations Show OLS to be useful upon
cross-validation in the range between 120 and 200 cases. It is also

concluded from these tables (9 and 10) that signing unit weights with

59

the sign of the sample beta weight estimate is not generally advantage-
ous. This is congruent with Hoerl and Kennard's (1970a, 1970b) rationale
for RR; that is, that when collinearity is high, the sample beta weights
will frequently exhibit incorrect signs and indicate excessive suppressor
effects. Thus, when previous experience with the variables permits one
to decide the sign of the unit weight for each predictor, these signs
should be employed. This method is conceptually at least, preferable to
both the BU and VU Sign assignment as it is independent Of sampling
fluctuations.

In practice many uses of MR involve variables (as predictors or
criteria) for which one could not confidently decide on their sign
before analysis (Roose & Doherty, 1976). The conclusion to be drawn
from this study is that the next best alternative is to use the Sign of
each predictor's zero-order validity coefficient.

The Hoerl and Kennard (1970b) pOpulation matrix (HOPOP in
Tables 7 through 10) presents several contradictions to the above
mentioned conclusions. This pOpulation is not typical of those en-
countered in social science data; generally, its coefficient of deter-
mination is higher than the norm, four of the validities are negative,
and five variables in the population are identified as suppressors (see
Table 3). It is ironic that this matrix was chosen by Hoerl and Kennard
(1970b) as an example of the advantages Of RR over OLS. Across all
sample sizes investigated in this study RR equations based on random
samples from the HOPOP matrix are dominated by OLS. Ordinary least
squares also demonstrates higher cross-validated coefficients of deter-
mination than do either beta weight or validity signed unit weights.

With regard to the RR results it must be concluded that the biasing

60

factor of equation (37) "overcorrected" the weights of some predictors
in this population and thus reduced the cross-validated R2. This
occurrence emphasizes the need for an analytical determination of an
optimal biasing parameter (k) which ideally could adopt different values
for different predictors. This would seem to be indicated as advantage-
ous for sample data from a matrix such as HOPOP where the collinearity
is not uniform across the predictors. A mixture of high and low pair-
wise intercorrelations (Table 3) presumably requires a variable bias
factor. This conclusion is supported by the results for the HIPOP and
LOPOP matrices both of which demonstrated RR as superior to OLS upon
cross-validation of the sample equations in the population. These
discrepancies further demonstrate that the determinant is an insuffi-
cient indicant of the degree of collinearity insofar as its value might
be used to determine whether OLS or RR should be applied. The LOPOP
population actually has a determinant 48 times larger (indicating less
severe multicollinearity) than the HOPOP matrix, yet RR was superior on
the LOPOP samples and not on the HOPOP samples.
Conclusions as to the predictive utility of these various com-
binatorial schema would seem to be as follows:
1. In agreement with Schmidt (1971), unit weights should be
employed in samples of under approximately 200 cases.
2. In the absence of prior knowledge, validity coefficients should
be used to determine the sign Of each predictor's unit weight.
3. The predictive utility of ridge regression, while superior to
OLS and unit weights in some instances, would not seem to be
great enough to warrant its use. If an analytic determination

of the bias parameter k is develOped, ridge regression would

61

seem to be practical for analyses in which the intercorrelation

is both "high" and consistent throughout the matrix and sample

size is not very large. While this study is not conclusive, it
appears that RR is most useful for predictive purposes in the
same sample size range as are unit weights.

The focus of the discussion now turns to consideration Of the
accuracy Of weight estimation by the OLS and RR methods as Opposed to
the predictive utility of their respective linear combinations. Table 11
presents the average mean square error (MSE) of estimation values, the
average bias factor (k), and the calculated average sum of reciprocals
of the eigenvalues for each population and sample size. It should be
recalled that the selection of the value for k determines, along with
sample specific collinearity, the value which will result for MSE.
Thus, as long as an analytic solution for the bias factor is not avail-
able, individuals may rightly argue for the appropriateness of values
other than those employed here. It is considered, however, that the
method of determining k employed in this study yields reasonable results
which are consistent with published uses of RR. Further, as noted by
Churchill (1975), a Monte Carlo study is potentially susceptible to-
the criticism of Optimizing the selection of the bias factor so as to
conform to the population specifications. Thus, it is argued that the
arbitrariness of a "reasonable" selection rule such as used herein will
permit greater generalizability of results as we await a solution to
the problem of analytically optimizing k on the basis of sample infor-
mation only.

The mean square error values in Table 11 can be interpreted as

a summary measure of the accuracy with which weights were estimated.

62

Table 11

Equation Mean Square Errors

 

 

 

 

 

 

 

 

 

 

Population
Sample HIPOP HOPOP LOPOP
a
Equation _1 _1 _1
Size MSE 21 k MSE 21 k MSE 21 k
LS .085 51.15 .020 54.46 .064 20.47
30
RIDGE .023 23.67 .15 .041 27.04 .08 .043 16.56. .07
L8 .489 125.34 .009 43.78 .027 16.45
60
RIDGE .033 23.01 .15 .029 26.85 .06 .022 14.75 .05
LS .023 36.15 .006 39.48 .016 15.54
90
RIDGE .009 20.87 .14 .025 25.64 .06 .014 14.52 .03
LS .013 34.75 .004 38.89 .010 15.19
120
RIDGE .006 20.43 .14 .021 26.42 .05 .009 14.27 .03
L8 .008 32.80 .002 37.05 .006 14.43
200
RIDGE .004 19.87 .15 .018 26.23 .05 .005 13.84 .02

 

 

 

 

 

Note. All entries are mean values based on 25 samples.

aLS = ordinary least squares; RIDGE a ridge regression; 21-1 =
sum of the reciprocals of the sam 1e eigenvalues; k = average value of
k in the expression ((X'X) + kI)‘ X'y.

63

MSE is equal to the sum of the deviations squared about the parameter
beta weight plus the squared bias. Thus, the smaller the MSE value is
for a particular cell, the better the average beta estimate was when one
averages errors over the 10 coefficients in each sample and the 25
samples per cell. It is seen in Table 11 that RR had a smaller MSE
than OLS in all HIPOP and LOPOP samples. This improvement in accuracy
seems to be inversely related to the sample size available for estimation,
similar to the results for predictive utility presented above. While
it was concluded earlier that unit weights were preferable to RR esti-
mates for predictive use when sample size is less than approximately
180, the same is not true here. Structural interpretation of regression
estimates makes explicit the intent to characterize a system or process
as a function of the magnitude of weight estimates. The substitution
of arbitrary weights (i.e., unit weights) may not deter predictive use
of the system's indicators but it necessarily eliminates the possibility
Of assessing their individual utilities.

The improvement in MSE attributable to RR is substantial in most
cases. The outlying value of .489 for the least squares mean value of
MSE at a sample size of 60 is attributable to one random sample's

extreme beta estimates. Omitting this one sample and calculating the

-l
i

and k for RR remains unchanged at .15. It is interesting to note that

same statistics for OLS on 24 samples yields MSE = .033; Z 1 = 39.36
this one sample's extreme beta estimates were adequately handled by the
RR technique using the decision rule for k selection discussed above.
Over all sample sizes in the high intercorrelation maxtrix, the
MSE due to use of RR weight estimates is approximately 91 percent less

than that generated by OLS (the value is 65 percent if the one aberrant

64

sample from the sample size 60 cell is removed). In the LOPOP matrices
RR is 24 percent more accurate overall. In the HOPOP matrices, as was
the case with predictive utility, RR is dominated at all sample sizes

by OLS. For these samples OLS is 69 percent more efficient than RR.

The conclusion is therefore similar to that for predictive considerations:
for the appropriate matrices (high collinearity which is consistent
across the matrix) ridge regression can provide improved weight estima-
tion, especially for small sample sizes. Even in subjectively low inter-
correlation cases (LOPOP) RR will not be worse than OLS although the
extra computational labor may not be worth the slight gain in estimation
precision.

Table 11 also lists the values computed for the sum of the
reciprocals of the eigenvalues with and without the biasing factor (OLS
and RR solutions respectively). This value was assessed by Hoerl and
Kennard (1970b) as an indication of the degree to which orthogonalization
had been achieved by the RR technique. If the predictors utilized had
in fact been uncorrelated, this value, as demonstrated earlier, would
equal 10.0 or equivalently, p, the number of predictors. RR over all
HIPOP matrices demonstrated a 61 percent reduction (44 percent with the
above noted sample omitted from the sample size 60 cell) in the sum of
eigenvalue reciprocals as compared to OLS. In the LOPOP matrices the
figure was 10 percent while in the HOPOP samples the reciprocals were
38 percent smaller. These results demonstrate the inappropriateness of
this value (the sum of reciprocals of eigenvalues) as an indicant of
the utility of the RR technique. AS noted for the HOPOP matrices

reduction in the size of this value is an artifact of the application

65

of equation (37) and does not necessarily imply either enhanced precision
of estimation or superior predictive ability.

Churchill's (1975) modified coefficient of variation can also
be used to assess the advantages of one technique relative to another.
This value (CV) is calculated by dividing the square root of an esti-
mate's MSE by the absolute value of the parameter it is intended to
estimate. These values can then be averaged for summary purposes.

Table 12 presents the ratio of the average CV produced by 0L3 to that

of RR. Again, deleting the single outlying sample from the HIPOP 60
cell reduces the value reported in Table 12 to approximately 1.71.

These results are consistent with the conclusions drawn on the basis of
MSE comparisons; RR is substantially more accurate at small sample sizes
with high, consistent collinearity than is the OLS technique. This
dominance diminishes as sample size increases and it is further decre-
mented for lower collinearity samples such as those represented by
LOPOP. The Hoerl and Kennard (1970b) population again demonstrates the
superiority of OLS at all sample sizes investigated.

Tables 13 through 15 present the relevant precision statistics
for each coefficient for the HIPOP, HOPOP, and LOPOP matrices, reSpec-
tively. Table 16 presents the differences between OLS and RR precision
statistics pooled over sample sizes. It is interesting to note in these
tables that RR produces a smaller bias in estimation for virtually all
coefficients at all sample sizes for the HIPOP and LOPOP sets than does
OLS despite the inclusion of k in equation (37) as a deliberate biasing
factor. The exceptions among the 100 bias estimates are seven values
found among the LOPOP matrices for samples of sizes 120 and 200. In

these cases OLS and RR produce identical (to three places of accuracy)

66

Table 12

Ratio of Average LS to Average RIDGE a
CV CV

 

 

 

Population
Sample Size HIPOP HOPOP LOPOP
30 1.820 .919 1.182
60 3.194 .789 1.088
90 1.530 .706 1.053
120 1.460 .621 1.049
200 1.238 .532 1.032

 

Note. All entries are mean values for 25 samples averaged
over 10 beta weights per sample.

aLS = ordinary least squares, RIDGE = ridge regression; cv =
coefficient Of variation.

67

.aoauaHu-> no usaquHuHUOu I >0

A

.coHaaueuou auvqe - a unausavu unwed ahacqvuo a man

...Hnldu n~ no value ao:—u> anal a»: noHuHeu HH< "9.32

 

 

 

 

 

 

 

 

 

 

 

 

n~.«n poo. moo. oao. an.an moo. moo. «no. nn.nv «no. ooo. nno. «n.5n nuo. one. one. H~.us Ono. «no. sac. - u
o~.on «no. ooo. ado. an... «no. nae. nae. «c.«u one. n~c. one. o~.non ans. owe. can.H «n.nc~ «an. «no. man. an .- cuo><
co." «co. «co. 000. no.“ «no. 400. "no. na.« ooo. moo. n~o. oo.~ ”no. see. duo. c~.< «ac. one. nae. -
~o.~ ooo. moo. ooo. nn.n nno. «00. «no. oo.n one. Ono. cue. 4~.o¢ coo.— oo~.~ on~.n H~.~ ooo. «no. «an. a4 0—
ou.n moo. moo. moo. nc.~« «ca. .00. «No. no.nn «no. ooo. cue. on.o~ nae. «no. one. oo.a~ «so. nae. ano. -
ou.o «co. «00. ~90. an.n~ uuo. ooo. sgo. oc.- cue. one. one. nn.¢~n «on. can. can." oo.o~ one. «no. one. a; o
no. «no. moo. «do. no. sac. ooo. nuo. ms. .06. ooo. «no. no. 940. one. oao. oo.~ «no. one. aka. a
no. 090. ooo. n—o. on. «no. vac. nae. no. mac. -o. sac. na.u con. and. sam. oc.~ moo. 090. and. mg a
84 48. n8. .8. :4 3o. 48. HS. 2N 20. H8. .3. HNN 2o. N8. .3. 2.." o3. 2o. ooo. -
nH.~ one. cue. cue. a..u one. «no. .40. nn.¢ moo. nee. non. No.c« c~o.~ oo~.~ can.n “a.“ on". and. cau. a; n
co. moo. moo. Ono. on. «no. 000. Qua. am. «no. ooo. «ac. o~.~ one. one. moo. Ne." nae. «no. «no. a
«9. «Ho. ooo. nun. ¢~.~ duo. one. 0.6. nn.~ «no. cue. ooo. on.n can. nan. sen. nu.“ no”. «00. com. a; o
00.5 oao. moo. ooo. o~.n ooo. coo. Ono. nn.~u cHo. ooo. «me. n~.c~ one. «no. ouo. oo.«« 0.0. one. one. u
en.uu «no. ooo. cuo. no.c~ «no. duo. ouo. oc.°~ Ono. qua. ooo. oo.~n ooo. one. can. ha... «ha. «an. den. «4 n
SN 30. «8. 48. oN.n «8. So. HS. $6 2... 8o. 43. 8.4 NS. So. So. 8... H3. :3. .3. u
a~.v «no. «No. one. -.n «no. one. two. no.n «co. cue. ooo. n~.¢~ «cu. «on. mac. on.nn an". «en. sun. mg e
co. ace. moo. «no. 00. «co. moo. mac. «n. nae. ooo. ado. ac.” moo. one. moo. «o.~ Ono. Ono. one. a
oo. 48. So. So. no; So. n8. :3. oN.H .3. So. 3o. 3; oNo.H Re. Re; SN 3H. .8. SN. 3 n
we. ooo. «ca. 006. no. ooo. ace. nHo. nu.“ «no. «no. one. on.« mac. oHc. one. o~.~ ace. nuo. ave. a
no. «no. soc. oHo. o~.u «me. «no. «no. as.“ «ea. nno. ooo. no.0 nee. con. nuc.« no.“ nun. $00. and. «A u
3.23 So. So. o8. 2.2N So. So. use. 3.84 So. 3o. :3. 3.43. «8. NS. NS. 842. NS. «3. So. u
2.3.” So. ooo. “8. 3.3.” So. So. o8. 8.3.. one. 8o. 03. No.23 SH. 2:. RN. 3.32 o2. 2o. 2:. 3 N
8 Nu:- No no: 8 No.3- No no: 6 N9:- No no: 6 N9:- No no: 46 N9:- No on: c.3333. 28433.8
bow. oNH Iolo 8 Pa
:8 :88

 

'H' uTI II .-“zIITC

 

H

¢OHuaquuw ecu-«uoum mom":

ma ado-h

éi4lillliz. .

u 1.1. a... ..liI l... 1'.‘ I‘ltn

4,1 II .41‘251'101:1 '6 (Ill

68

III‘ I ill-III:

'4

.couuaqua> OO scuauHuesau I >0

3

.couoauuauu auuH. I a maauanva .nea~ xuacdaus I m;
-

 

 

 

 

 

.aoualaa MN c: suns: nus—e) cs9- uuc auguu:9 __< ”mmwm
5M. NCO. NOO. QQO. no. NMO. NOO. «MO. en. OOO. MOO. MOO. NO. GOO. MOO. cue. O0. OOO. GOO. OOO. I
nououo><
On. MOO. NOO. MOO. Ne. ooo. MOO. aOO. NM. Ono. ooo. M~O. Mo. n—O. 000. "NO. «a. one. Ono. 010. mg
NM. NOO. 80. M8. :. MOO. NOO. MOO. NM. «8. N00. 08. 3. Moo. 25. M8. MN." 08. Ms. Mg. m
O0. «OO. nOO. NOO. ON. MOO. NOO. OOO. DO. «OO. MOO. NOO. NN.~ OOO. QOO. MuO. ~O.N ONO. Nno. OMO. m4 On
NO. NOO. «OO. MOO. QO. cOO. NOO. OOO. MO.u MOO. NOO. 500. MN.~ OOO. NOO. Ono. OO.~ MOO. NOO. «OO. 8
Oc. NOO. «OO. NOO. ON. MOO. NOO. cOO. O0. vOO. MOO. NoO. MN.~ OOO. «OO. Ono. ac." MOO. MOO. OOO. m4 0
NN. O—O MOO. nuO an. o—O. NOO. o—O. «c. NNO. «OO. MNO. Me. ONO. MOO. ONO. «M. OMO. OOO. QQO. a
Na. MOO. NOO. MOO. ON. GOO. OOO. Ono. «M. OHO. MOO. MnO. NM. Ono. OOO. MAO. OM. Mao. OOO. MNO. ma O
No. OMO. NOO. OMO «N. McO. NOO. NOO. OO. NMO. cOO. MMO. OO. «00. MOO. OOO. O~.~ Man. MOO. Odd. I
NN. MOO. NOO. «OO. ~M. MOO. Mco. OOO. no. OOO. 000. Mac. on. nOO. MOO. duo. no. ONO. NuO. Nno. ma 5
MM. so". moo. HON. OM. MnN. MOO. OMN. co. NON. Moo. OON. no. MON. MOO. «OM. us. ~OM. 00¢. ROM. a
«a. OOO. QOO. Nao. an. MHO. OOO. MNO. CN. QNO. Q~O. OMO. ON. NMO. uNO. MMO. Me. OOO. MMO. NMn. m4 0
OO. «Md. QOO. OMM. 00. OM—. MOO. OMA. '0. can. MOO. OON. MO.~ NNN. GOO. OMN. N~.~ MON. ONO. NON. I
NN. moo. «OO. ——O. «N. OOO. moo. Ono. «M. Mao. ooo. «NO. «9. ONO. Mac. ace. as. use. NMo. oNn. ma M
«0. «OO. ~OO. Mac. 00. «OO. «00. MOO. On. MOO. ~OO. ooo. No. GOO. NOO. Odo. NN.~ «HO. MOO. One. a
Oc. NOO. «OO. MOO. OQ. NOO. ~OO. MOO. Me. NOO. «OO. MOO. MN. cOO. MOO. sOO. MM.~ MAO. OOO. NNO. ma 4
ON. NOO. NOO. OOO. ON. Ono. MOO. N~O. NM. Ono. cOO. ONO. MM. MNO. QOO. nuo. OM. MMO. Ono. MQO. a
nu. NOO. NOO. «OO. MN. «OO. MOO. sOo. MM. Ono. NOO. «no. 0N. NOO. McO. Nno. 0M. ONO. Ono. MvO. ma M
ea. OuO. NOO. A—O. «M. MnO. nOO. cue. no. sac. NOO. Ono. Os. nNO. Moo. ONO. on. ONO. coo. «MO. I
0N. NOO. NOO. MOO. ON. NOO. «OO. QOO. Ne. MOO. «OO. OOO. MM. GOO. OOO. ONO. Mo. NMO. OOO. NNO. OJ N
no. MNO. NOO. MNO. OO.A OMO. NOO. ~MO. OO.~ NMO. MOO. MMO. 10.n ONO. MOO. «MO. OM.~ QMO. GOO. OOO. I
co. «co. NOO. ooo. no. MOO. MOO. mac. «a. Mac. ooo. ONO. 10." NNO. vac. «Mo. a~.~ ONO. sac. Mao. mg n
5 N94: No on: 8 N3: N6 Na. ,5 No.4: on an: .5 No:- No not B No:- No mm: «.5233 3.404236
OON ONH O0 OO OM

 

 

 

 

 

 

491.6. .471.I.'IIV.'-I‘1I‘II1III

.1 I11- ‘nﬂk‘4H‘Iin.

1i“ . III—1| 1.:

ouum anal-a

..LlQl-

ouHuIHucum nod-«uoum mono:

@— Ounah

I a.~dI...IIIIuI.I

69

.co«u1«un> no ucaNONOOUOO I >0

A

.OONu-OONSN auvHu I a “sensor: union Nua=4aua I was

 

 

 

 

 

 

 

 

 

 

 

 

.OONOIOO MN co Osman ao:~a> soul can caucus» ~H< "9.02
OM.N OOO. nOO. nuO. NO.N MNO. OOO. NNO. ON.N NNO. nnO. mnO. NN.M MnO. NNO. OMO. N0.0 NOO. OOO. NON. - N
no." OOO. OOO. Ono. NN.N MOO. OOO. ONO. OO.N ONO. MNO. Ono. On.n NOO. ONO. OOO. ON.M OOO. OOO. OMN. «A no Ouoe<
On. OOO. OOO. OOO. "O." NOO. NOO. ONO. NO.n ONO. NnO. NnO. OM.~ ONO. N—O. ”no. OO.N NOO. NNO. OOO. :
«O. OOO. OOO. Ono. OO.~ ONO. NNO. NnO. NN.~ NNO. M—O. uno. OM.N NOO. NNO. NOO. NN.N MOO. OOO. NON. m4 ON
NN.N OOO. nOO. NOO. NN.N ONO. MOO. MNO. Mn.N ONO. OHO. ONO. MO.N ONO. MNO. OOO. MN.N NNO. MOO. ONO. a
NN.N MOO. MOO. NOO. ON.~ “NO. MOO. ONO. OO.N ONO. NOO. NnO. O~.n anO. NOO. OMO. OO.» OnO. ONO. OOO. m4 O
O~.O OOO. nOO. OOO. OO.M N—O. OOO. OuO. N~.O NNO. ONO. ONO. NN.O NNO. Mao. Nno. MN.ON OnO. ONO. nOO. u
ON.O OOO. OOO. OnO. ON.O «NO. OOO. «NO. Nn.o ONO. MNO. Ono. N0.0 MNO. OHO. NOO. Mo.~H NOO. Ono. ONO. O; O
MO.~ NOO. MOO. NOO. OO.N MOO. OOO. NNO. OM.M MNO. MOO. Ono. NM.n NNO. MNO. NnO. OO.M OMO. Ono. OOO. u
«O.N NOO. Moo. NNO. ON.N ONO. OOO. ONO. NN.M MNO. NNO. NOO. OO.M NNO. NHO. MOO. O0.0 ONO. OOO. NNN. «A N
an.n OOO. OOO. MHO. MO.N MNO. OOO. ONO. OO.N MNO. ONO. NMO. OM.N nnO. HNO. OMO. MN.n ONO. OOO. NNN. u
Mn.n ONO. OOO. ONO. NN.~ OuO. ONO. ONO. NN.N MNO. Ono. NOO. ON.N OOO. NNO. NOO. N0.0 Mon. OOO. NNN. a; O
OO.N OOO. MOO. ONO. NN.N NnO. NOO. OnO. no.~ ONO. OuO. OOO. ON.N ONO. ONO. OOO. ON.M OON. OOO. NNN. -
NN.N OOO. OOO. Mao. ON.~ NOO. OOO. ONO. O0." ONO. ONO. OOO. NM.N MOO. Ono. ONO. NM.O nan. NOO. NON. «A M
ON.N NnO. Ono. NNO. Nn.~ NNO. Nno. nnO. «O.N ONO. Mac. NnO. ON.N OOO. Nnc. OOO. OO.n OON. «0O. MNN. u
nN.n NNO. «nO. ONO. Mn.~ «NO. MNo. ONO. MO.N MNO. MOO. Ono. MO.N NNO. OOO. NNN. NO.M «Nu. "ON. NNN. ma O
on. OOO. OOO. MOO. NN. MOO. Ono. MNO. OO. ONO. NNO. OnO. NN.N "OO. ONO. NOO. OO.~ NOO. MnO. NOO. u
no. OnO. NOO. NOO. MO. Ono. NNO. "NO. NO. NNO. Ono. Ono. OO.~ OMO. ONO. OOO. NO.N OON. «NO. ONN. ma n
MO.N OOO. OOO. MOO. OO.N «HO. NOO. OHO. «O.n ONO. Mao. Ono. OO.n Ono. «NO. NMO. OM.O MOO. Ono. «OO. u
NO.~ OHO. OOO. Ono. N~.N N—O. OOO. ONO. Ou.n NNO. NOO. OOO. no.n Ono. MNO. OOO. NO.M NNO. OMO. NNN. ”A N
NO.N OOO. OOO. Ono. Nn.N OOO. MOO. ONc. Mo.n ONO. MNO. MnO. O0.0 NnO. ONO. NOO. On.N NOO. NMO. nnn. -
Mo.N OOO. OOO. "no. NO.N OHO. OOO. Mao. OO.n NNO. MNO. ONO. OO.M OOO. ONO. MNO. N0.0 ONN. ONO. OO«. «4 N
>0 Nm<~O NO NO! >0 N0<Hn No um: >0 Nm<~0 NO NO! >0 Nm<~n NO um: >0 Nm<un No an: useuunavu uﬂONuwuuoou
OON ONN OO OO On
02.... 388

Ali. 1.! I

. II’.NJIII.. ‘11 II III 0.. ,I

I 441,]IWI II! .Ili .

HIIIIINIIII I.

.‘III.

II II": I I

an; :3 use.» .513 usua— has

MN awash

.u.II9.I4.

II....II. OIAUiAo

I.I.toI..

7O

.aoaumHum> mo OGOHUHMMOOU n >0m

.mONHm OHOamm uo>o

 

 

 

 

 

OOHOOO OOHOBOO MN mo muom comauon OOOOHOOOHO 030 so comma moaHm> some mum OOHuuaO HH< "Ouoz
MH. MOO. OOO. NHO. HN. OOO. NOO. OOO. HM.O OOO. MON. MOO. OH
MH. MOO. HOO. OOO. MO.I HOO.I HOO. OOO. O0.0N OOH. OHH. NHM. O
OO. MOO. HOO. MOO. HH.I MHO.I HOO. NHO.I MO. MOO. MMO. ONO. w
MM. OOO. OOO. HHO. NO.I MMO.I NOO. HMO.I O0.0 MOO. OON. OMN. N
ON. MOO. MOO. OHO. OM.I NON.I MHO. NNN.I OO.H ONO. OOO. OHH. O
MN. NHO. MOO. ONO. NM.I ONH.I OHO. OOH.I OM.HH MOO. HMO. ONO. M
ON. MHO. OHO. MNO. MH.I NOO.I HOO. NOO.: OM.O HMO. OOO. HMH. O
ON. OHO. OHO. ONO. OO.I MOO.I MOO. NOO.I MM.H ONN. MOH. MNM. M
HM. NOO. OOO. HHO. ON.I NHO.I HOO. OHO.: OO.H OOH. MOO. OMN. N
NO. OHO. MOO. OHO. MN.| OHO.I OOO. MHO.I OM.MHO HMO. OMO. MOO. H
>0 Nm<Hm No max >0 Nm<Hm NO mm: m>0 Nm<Hm No 0M: ucOHOHmmooo
OOOOH momom OOOHO
OOHOOHSOOO

 

MOOHM can moumsuw ummOH ommsumm

OOHOOHOOOM OOHOHOOHO mo mOOOOHOOMHO one:

oH mHeee

71

bias values. In Table 14 the inapprOpriateness of RR for the HOPOP
based samples is again evident. For most coefficients the variance
of the estimate has been reduced as expected but the bias has grown
large enough to offset this improvement and thus the MSE is worse, in
general, for RR than for OLS. This information is summarized in
Table 16 where negative values indicate the superior accuracy (i.e.,

smaller variance, less bias, smaller MSE, etc.) of OLS.

CHAPTER V

SUMMARY

The ridge regression technique advocated by Hoerl and Kennard
(1970a, 1970b) for use in regression analyses when high collinearity
is present among the predictors was evaluated by comparing it with the
ordinary least squares and unit weighting methods of combining pre-
dictors so as to form a linear composite. It is concluded that the RR
technique is preferable to OLS in certain restricted situations.
Specifically, if one's intent is to optimize the prediction of a
criterion by the composite, unit weighting with signs for the predictor
variables determined by their zero-order validity coefficients is
probably preferable to the calculation of a ridge regression. When
sample size is large enough (for instance, greater than 200, Schmidt,
1971) so that OLS weights are expected to yield better prediction in
subsequent samples than will unit weights, the indication from the
present study is that RR will be very little, if at all, better. Thus,
for large samples OLS seems to provide weights which are not signifi-
cantly inferior to RR weights for prediction.

If the purpose of the regression analysis is structural inter-
pretation however, RR, based on the empirical populations evaluated

above, yields substantially more accurate weight estimates than does

72

73

OLS for certain types of matrices. Ridge regression, on the average,
exhibited less bias, smaller variance, and a smaller overall MSE than
OLS for all coefficients estimated in samples drawn from two empirical
populations: HIPOP, in which the pairwise intercorrelation was high
and consistent throughout the matrix, and LOPOP, in which pairwise
values were low but also consistent across 10 predictors.

The ridge technique was dominated by OLS from both a predictive
and structural interpretation perspective when evaluation was based on
samples drawn from the HOPOP matrix. This matrix exhibited a number of
characteristics which are uncommon in social science data sets. The
wide dispersion of pairwise intercorrelations and the presence of both
high positive and high negative validities contributed to the formation
of a population beta vector with large absolute values. This would
appear to be the prime reason for ridge regression's failure to yield
better precision statistics than OLS in samples from the HOPOP matrix.
RR functions in general to reduce the absolute value of the beta esti-
mates in the sample which are considered inflated due to sample speci-
fic error and high multicollinearity. However, in the Hoerl and Kennard
(1970b) population the collinearity is not consistent throughout the
matrix and the validities for some relatively independent predictors
are quite high (see Table 3). Applying RR to samples from a pOpulation
with these characteristics will, it appears, result in overcorrection
for the properly large weights and undercorrection for the others.

It is concluded therefore that RR can be used to more precisely
estimate beta weights even when an analytic determination of the bias
parameter is not available. The prime consideration in the decision

to use RR over OLS is the type of matrix one has to analyze. It is

74

apparent that the determinant is insufficient as an indicator of the
degree of collinearity as it is not sensitive to the distribution of
pairwise correlations in the matrix. Two suggestions may be made on
the basis of the present study. First, RR is a preferable method of
analysis to the extent that the correlation matrix is unidimensional.
This can be evaluated by considering the eigenvalues and eigenvectors
for the matrix. Secondly, the diagonal values of the correlation
matrix's inverse should not be widely diSpersed. In effect, this means
that any p-l combination of the predictors should predict the pth
predictor equally well. This rule is an indication of the consistency
of the intercorrelation level in the matrix.

The ridge regression technique has been criticized justifiably
as being an arbitrary method when applied to any one sample matrix
which requires excessive subjective interpretation and final identifi-
cation of an adequate result (Conniffe & Stone, 1973; Smith, 1976).
The element of subjectivity, whether in defining an arbitrary bias (k)
selection rule or in visually examining the ridge trace for a stable
solution, cannot be denied at the moment. However, it is concluded on
the basis of this study that reasonable decision rules can be established
which, for appropriate data sets, will function better than ordinary
least squares. Thus, the author is in agreement with Churchill (1975),
Hoerl and Kennard (1970b), and Smith and Goldstein (1975) in concluding
that RR does offer potential improvements over OLS. It is hOped that
an analytic derivation of an Optimal bias parameter will be develOped
as well as objective methods of determining the apprOpriateness of any

particular data set for the ridge technique.

BIBLIOGRAPHY

BIBLIOGRAPHY

Anderson, H. E., Jr., & Fruchter, B. Some multiple correlation and
predictor selection methods. Psychometrika, 1960, 25, 59-76.

 

Bartlett, M. S. Tests of significance in factor analysis. British
Journal of Psychology, Statistical Section, 1950, g, 77-85.

 

 

Berkson, J. Are there two regressions? Journal of the American Sta-
tistical Association, 1950, 45, 164-180.

 

 

Binder, A. Considerations of the place of assumptions in correlational
analysis. American Psychologist, 1959,.14, 504-510.

 

Blum, M., & Naylor, J. C. Industrial psychology; Its theoretical and
social foundations. Third edition. New York: Harper and
Row, 1968.

 

 

Burket, G. A. A study of reduced rank models for multiple prediction.
Psychometric Monographs, 1964, No. 12.

 

Cattell, R. B. The scree test for the number of factors. Multivariate
Behavioral Research, 1966, 1, 245-251.

 

 

Churchill, G. A., Jr. A regression estimation method for collinear
predictors. Decision Sciences, 1975, 6, 670-687.

 

Claudy, J. G. A comparison of five variable weighting procedures.
Educational and Psychological Measurement, 1972, 1;, 311-322.

 

Cochrane, C., & Orcutt, G. H. Application of least squares regression
to relationships containing autocorrelated error terms.
Journal of the American Statistical Association, 1949, 44, 32-61.

 

Cohen, J. Multiple regression as a general data analytic system.
Psychological Bulletin, 1968, 19, 426-443.

 

Cohen, J., & Cohen, P. Applied multiple regression/correlation analysis
for the behavioral sciences. Hillsdale, N.J.: Erlbaum
Associates, 1975.

 

Coniffe, D., & Stone, J. A critical view of ridge regression. The
Statistican, 1973, 33, 181-187.

75

76

Cooley, W. W., & Lohnes, P. R. Multivariate Data Analysis. New York:
Wiley, 1971.

 

Cureton, E. E. Validity. In E. F. Lindquist (Ed.), Educational
Measurement. Washington: American Council on Education, 1950.

 

Cureton, E. E. Multivariate Psychological Statistics. Unpublished
manuscript, The University of Tennessee, Knoxville, 1962.

Cureton, E. E. A measure of the average intercorrelation. Educational
and Psychological Measurement, 1971,‘§l, 627-628.

 

Darlington, R. B. Multiple regression in psychological research and
practice. Psychological Bulletin, 1968, 62, 161-182.

Dawes, R. M., & Corrigan, B. Linear models in decisin making. Psy-
chological Bulletin, 1974, 81, 95-106.

Draper N., & Smith, B. Applied reggession analysis. New York:
Wiley, 1966.

Durbin, J., & Watson, G. S. Testing for serial correlation in least
squares regression. Biometrika, 1950, 21, 409-428.

 

Einhorn, H. J., & Hogarth, R. M. Unit weighting schemes for decision
making. Organizational Behavior and Human Performance, 1975,
2.2, 171-192 0

Farrar, D. E., & Glauber, R. R. Multicollinearity in regression
analysis: The problem revisited. Review of Economics and

 

Finn, J. D. A general model for multivariate analysis. New York:
Holt, Rinehart, and Winston, Inc., 1974.

Furnival, G. M., & Wilson, R. W., Jr. Regressions by leaps and bounds.
Technometrics, 1974, 16, 499-511.

Geary, R. C. Non-linear functional relationship between two variables
when one variable is controlled. Journal of the American
Statistical Association, 1953, 48, 94-103.

 

Gordon, R. A. Issues in multiple regression. American Journal of
Sociology, 1967, 2;, 591-616.

 

Gorman, J. W., & Toman, R. J. Selection of variables for fitting
equations to data. Technometrics, 1966, 8, 27-51.

 

Guilkey, D. K., & Murphy, J. L. Directed ridge regression techniques
in cases of multicollinearity. Journal of the American Sta-
tistical Association, 1975, 19, 769-775.

77

Guttman, L. To what extent can communalities reduce rank? Psychometrika,
1958, 222 297-309.

 

Herzberg, P. A. The parameters of cross-validation. Psychometric
Monographs, 1969, No. 14.

 

 

Hocking, R. R., & Leslie, R. N. Selection of the best subset in
regression analysis. Technometrics, 1967, 3, 531-540.

 

Hoerl, A. E. Application of ridge analysis to regression problems.
Chemical Engineering Progress, 1962, 88, 54-59.

 

Hoerl, A. E., & Kennard, R. W. Ridge regression: Biased estimation
for non-orthogonal problems. Technometrics, 1970a, 13, 55-67.

 

Hoerl, A. E., & Kennard, R. W. Ridge regression: Applications to
nonorthogonal problems. Technometrics, 1970b, 18, 69-82.

 

Hoerl, A. E., Kennard, R. W., & Baldwin, K. F. Ridge regression: Some
simulations. Communications in Statistics, 1975, 8, 105-123.

Horst, P. The prediction of personal adjustment. New York: Social
Science Research Council, 1941.

Horst, P., & MacEwan, C. Predictor-elimination techniques for deter-
mining multiple prediction batteries. Psychological Reports,
l960, 1, 19-50.

Jeffers, J. N. R. Two case studies in the application of principal
component analysis. Applied Statistics, 1967, 18, 225-236.

Johnston, J. Econometric methods. Second edition. New York:
McGraw-Hill, Inc., 1972.

Joliffe, I. T. Discarding variables in a principal component analysis.
1. Artifactual data. Applied Statistics, 1973,‘88, 21-31.

Kaiser, H. F. A measure of the average intercorrelation. Educational
and Psychological Measurement, 1968,‘88, 245-247.

 

Klein, L. R. An introduction to econometrics. New Jersey: Prentice-
Hall, 1962.

Klein, L. R., & Nakamura, M. Singularity in the equation systems of
econometrics: Some aSpects of the problem of multicollinearity.
International Economic Review, 1962, 8, 274-299.

Klein, S. P., & Evans, F. R. An examination of the validity of nine
experimental tests for predicting success in law school.
Educational and Psychological Measurement, 1968, 88, 909-913.

Klein, S. P., 8 Evans, F. R. Early prediction of independent accomplish-
ment. Unpublished manuscript, 1969.

78

Kmenta, J. Elements of econometrics. New York: The MacMillan Co.,
1971.

LaMotte, L. R., & Hocking, R. R. Computational efficiency in the
selection of regression variables. Technometrics, 1970, 12,
83-93 0

Lawless, J. F., & Wang, P. A simulation study of ridge and other
regression estimates. Communications in Statistics, 1976,42,

Lawshe, C. H., & Shucker, R. E. The relative efficiency of four test
weighting methods in multiple prediction. Educational and
Psychological Measurement, 1959, 12, 103-114.

Li, J. C. R. Statistical inference II. Ann Arbor, MI: Edwards
Brothers, Inc., 1974.

Lord, F. M. Efficiency of prediction when a regression equation from
one sample is used in a new sample. Research Bulletin 50-40.
Princeton, N.J.: Educational Testing Service, 1950.

Marquardt, D. W. Generalized inverses, ridge regression, biased linear
estimation and non-linear estimation. Technometrics, 1970, 88,

Marquardt, D. W., & Snee, R. D. Ridge regression in practice. The
American Statistician, 1975, 82, 3-20.

 

Mason, R. L., Gunst, R. F., & Webster, J. T. Regression analysis and
problems of multicollinearity. Communications in Statistics,
1975, 8, 277-292.

Massy, W. F. Principal components regression in exploratory statistical
research. Journal of the American Statistical Association,
1965, 88, 234-256.

 

McDonald, G. C., & Galarneau, D. I. Monte-Carlo evaluation of some
ridge-type estimators. Journal of the American Statistical
Association, 1975, 18, 407-416.

 

 

McDonald, G. C., & Schwing, R. C. Instabilities of regression esti-
mates relating air pollution to mortality. Technometrics,

 

Meyer, E. P. A measure of the average intercorrelation. Educational
and Psychological Measurement, 1975, 88, 67-72.

 

 

Neter, J., & Wasserman, W. Applied linear statistical models. Home-
wood, 111.: Irwin, 1974.

Newhouse, J. P., & Oman, S. D. An evaluation of ridge estimators.
U.S. Air Force Project Rand R-7161PR, 1971.

 

79

Nicholson, G. E. Prediction in future samples. In I. Olkin, et al.
(Bds.), Contributions topprobability and statistics. Palo Alto,
CA: Stanford, 1960, pp. 322-330.

Nie, N. H., Hull, C. H., Jenkins, J. G., Steinbrenner, K., & Bent, D. H.
Statisticalppackagg for the social sciences. New York:
McGraw-Hill, 1975.

Nunnally, J. C. Psychometric theory. New York: McGraw-Hill, 1967.

 

Overall, J. E., & Klett, C. J. Applied multivariate analysis. New
York: McGraw-Hill, 1972.

 

Rao, P., & Miller, R. L. Applied econometrics. Belmont, CA: wadsworth
Publishing Co., Inc., 1971.

Rock, D. A., Linn, R. L., Evans, F. R., & Patrick, C. A comparison of
predictor selection techniques using Monte Carlo methods.
Educational and Psychological Measurement, 1970, 88, 873-884.

Roose, J. E., & Doherty, M. E. Judgment theory applied to the selection
of life insurance salesmen. Organizational Behavior and Human
Performance, 1976, 18, 231-249.

 

 

Sastry, M. V. R. Some limits in the theory of multicollinearity. The
American Statistician, 1970, 88, 39-40.

 

Scheffe, H. The analysis of variance. New York: Wiley, 1959.

 

Schmidt, F. L. The relative efficiency of regression and simple unit
predictor weights in applied differential psychology. Edu-
cational and Psychological Measurement, 1971,.21’ 699-714.

 

Schmitt, N., & Coyle, B. W. Applicant decisions in the employment inter-
view. Journal of Applied Psychology, 1976, 81, 184-192.

 

Schmitt, N., Coyle, B. W., & Rauschenberger, J. A Monte Carlo evaluation
of three formula estimates of cross-validated multiple corre-
lation. Psycholog1cal Bulletin, 1977, in press.

 

Sevier, F. A. C. Testing the assumptions underlying multiple regression.
Journal of Experimental Education, 1957, 88, 323-330.

 

Silvey, S. D. Multicollinearity and imprecise estimation. Journal of
the Royal Statistical Sociepy, Series B, 1969, 81, 539-552.

 

 

Smith, A. F. M., & Goldstein, M. Ridge regression: Some comments on
a paper of Connife & Stone. The Statistician, 1975, 88, 61-66.

 

Smith, V. K. A note on ridge regression. Decision Sciences, 1976, 1,
562-566.

 

80

Snee, R. D. Some aspects of non-orthogonal data analysis, part I:
Developing prediction equations. Journal of Quality Technology,
1973, 5, 67-79.

 

Swindel, B. F. Instabilities of regression coefficients illustrated.
The American Statistician, 1974, 88, 63-65.

 

Tatsuoka, M. M. Multivariate analysis: Techniques for educational and
psycho1pgical research. New York: Wiley, 1971.

 

 

Trattner, M. H. Comparison of three methods for assembling aptitude
test batteries. Personnel Psychology, 1963, 18, 221-232.

 

Tukey, J. W. One degree of freedom for nonadditivity. Biometrics,

 

Tupes, E. C. A note on "validity and nonlinear heteroscedastic models."
Personnel Psychology, 1964, 11, 59-61.

 

Vinod, H. D. Application of new ridge regression methods to a study of
Bell System scale economics. Journal of the American Statisti-
cal Association, 1976,‘81, 835-841.

 

 

Wainer, H. Estimating coefficients in linear models: It don't make no
nevermind. Psycholpgical Bulletin, 1976, 88, 213-217.

 

Wainer, H., & Thissen, D. Three steps towards robust regression.
Psychometrika, 1976, 81, 9-34.

Wampler, R. H. A report on the accuracy of some widely used least
squares computer programs. Journal of the American Statistical
Association, 1970, 88, 549-565.

 

 

Webster, J. T., Gunst, R. F., & Mason, R. L. Latent root regression
analysis. Technometrics, 1974, 18, 513-522.

 

Wesman, A. G., & Bennett, G. K. Multiple regression vs. simple addition
of scores in prediction of college grades. Educational and
Psychological Measurement, 1959, 18, 243-246.

 

Wherry, R. J., Sr. A new formula for predicting the shrinkage of the
coefficient of multiple correlation. The Annals of Mathematical
Statistics, 1931, 8, 440-457.

 

 

Wherry, R. J., Sr. Underprediction from overfitting: 45 years of
shrinkage. Personnel Psychology, 1975, 88, l-18.

Wherry, R. J., Sr., Naylor, J. C., Wherry, R. J., Jr., & Fallis, R. F.
Generating multiple samples of multivariate data with arbitrary
population parameters. Psychometrika, 1965, 88, 303-313.

 

 

81
Wilks, S. S. Certain generalizations in analysis of variance. Bio-

Winer, B. J. Statistical principles in experimental design. Second
edition. New York: McGraw-Hill, 1971.

 

Wishart, J. The mean and second moment of the multiple correlation
coefficient in samples from a normal population. Biometrika,
1931, 88, 353-361.

 

   

"‘mmmf