.‘I" 'vrv —....

-‘o‘oﬁ.l .7 'Q(.".— - '_'

y'.-

‘9': 0-..u., ‘

v-

......

O"'I

"""-IUD~'0~¢"

~“‘.'I'-'

'9

...O‘

 

 

 

_ . . . mFu .A N; ..
_ . N._.A ..R... PVQM. _
. Hg... P ”M 8..
. mm. m H ..h... SM
N .F. G w ..Plu. ..rrn
m 0 h M:
m... M _ .
M B .

 
 

  

_ ~ . o . N.
_ . . .. . .
. I
.. .
_ . . .
. .
. . . .
_
. . . . _
_ . _ ..
- . _ _ _ . .
. . .. a _
. . N n O
. . co.
. . . .
- ~ _ #0
. ...
I _ u '.
._ A . . . n
. u. _ .
3 . .
.
. ,
. .
. s.
. ..
. o . . .
. . . . ¢
. . . . .. . . .. . 3. ..
. . . ... . ...
. . .. n. ... c. . .. _ ..4
. .. . . . . . ._ . ...c....0.
..
u u
v
u : ~
..
. . . . ... . .
0 . I .
. .
... .. . . L .
. .
..
. . ... . ~
_ I . .
a v
c u a a .
. . . .
. .
. .
a . 0 . . . .
. .
. _ o
I .
o .
9
. .
. ..
. .
.
. ...
.. .
I .
- _ ’ n .
O .
c - Q .
I v ‘
. . . . _
.
. . .
. o
. 4
.. .
- . u
. d
.
I I ...
c . .
.o . .
.
. o A
.
. J
1
n .
. . .
.
. . _
.
.
y u
. ..
. . . .
.
.
~ .
.
I n. .
.
.
_ .
.
.
.
I . .
a
.
. a
.
. .. . . . _ _ . . .. ....

 

               

 

on... .
A C n ‘ .
... .... .33.... ...a .m...

 

  

                          

         

 

...4“.....r.4

. .... .. ve .. ...“:
.. on. a...-
JOHﬁ.‘ I

... ....
......... ...
..o.

....

-

. ..U..
pp. . .’.o:ﬂ:

...;-

    

no.1?!

.....D
....mcrfw. 5

 

. ... “W......Ul

...
r. N.
.... :....

 

 

 

 

 

 

.9. .200.
in...

.. ..8 ..l...:. .

  

    

um'. .. .0.

 

 

 

r _a... 4.. ...a . .
v . . ... .. o
n. - v I. ‘ .O I
o . . 0 .
. .v . .. .
. . . . . . r .. ._ .
. . ... . . . . . .
.. . ... .
.. ....I . ..
¢ .4 A c . p
. .. . . . . ~. .. .
.I n — a pd. .
. o. _ . ...
o . ..
.. ... an
. --. o . r; c
. . .. o . . 4.
.. .. . o . . .
,. o
_ . .. o. , u .
. . . ..
. . t o . _ . .
. _ .
g . ... ..
‘0 I ... I
I ‘
. . . .
. . -
I n
.o _ .vo
. a ..
.. .
.
a . u
. .
. .
_ .. .
Q I
a
C o
. .
.
.-
.. ...-.... . ..
.... .2....p..?... ...... .
A N. 2
. ..

 

 

 

THESlS

 

ABSTRACT
IMPLICATIONS OF THREE DEFINITIONS OF TEST BIAS IN

THE VALIDATION OF AN APPRENTICE PROGRAM
SELECTION PROCEDURE

BY

Felicia Williams Seaton

The EEOC guidelines on employee selection procedures
(1970) require the examination of currently used selection
tests for evidence of test validity and possible bias
against minority groups. No consensus in the scientific
literature has been reached on the most appropriate defini-
tion of test bias. The present research.examined l3
predictors used to select applicants for an apprentice
training program.for evidence of validity and fairness
using the Cleary (1968), Thorndike (1971), and Darlington
(1971) definitions of fairness to minority applicants.

Sixty-three third- and fourth-year apprentice trainees
volunteered to participate in the present study. Validity
was measured by the relationship between selection predic-
tors and a job sample performance test, a paper and pencil
achievement test, and a composite score derived from per-
formance and achievement scores. Correlations between
predictors and criterion measures were computed separately

for majority and minority apprentices, and the significance

Felicia Williams Seaton

of the difference between validity coefficients for majority
and minority apprentices was tested, to determine if differ—
ential validity existed.' Analyses of test bias were per-
formed using the Cleary, Thorndike and Darlington defini-
tions.

Results showed that for the total sample of appren-
tices, five of the thirteen predictors were adequately
valid measures of the criteria. Differential validity
was not found to occur more often than could be expected
by chance alone. Differences in mean test and criterion
scores for majority and minority apprentices were signifi-
cant at the .05 level.

The Cleary definition specifies that a fair test can
neither over- or underpredict performance for members of
any subgroup. Applying that definition to the present data,
all tests were found to be biased in favor of the minority
group (i.e., minority criterion scores were overpredicted
by test scores). The Thorndike definition states that a
test is fair if the proportion of the minority group
selected using the test is the same as the proportion who
could be successful on the job. Using this definition,
eight tests were biased in favor of the minority group,
while the other five were fair tests. The Darlington defini-
tion, requires that test score for minority group not be
different for any group of applicants with the same criterion

score. By this definition, three of the tests are biased

Felicia Williams Seaton

against the minority group, and two tests are biased
in favor of the minority group.

Legal implications of test validity and test bias
were discussed and suggestions were given for further
research on the issue of the differential effects of

selection on population subgroups.

IMPLICATIONS OF THREE DEFINITIONS OF TEST BIAS IN
THE VALIDATION OF AN APPRENTICE PROGRAM

SELECTION PROCEDURE

By‘

Felicia Williams Seaton

A THESIS

Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of

MASTER OF ARTS

Department of Psychology

1975

ACKNOWLEDGEMENTS

This study was made possible through a grant funded
by the 0.5. Department of Labor. Special thanks go to Mr.
William.Main and Mr. Milt Murto of Chrysler Personnel
Division for their valuable assistance and cooperation
in the gathering of data for the present study. I am also
indebted to Alan Greenthal and other graduate students who
worked with me on this project.

My deepest appreciation goes to my chairman and
advisor, Dr. Frank L. Schmidt, who has been my greatest
source of guidance and encouragement throughout my graduate
program. Without the expert assistance of Dr. John E.
Hunter and his wife, Ronda, the data analysis and statis-
tical portions of this research would have never been
completed. I would also like to thank my other committee
member, Dr. Neal Schmitt, whose many suggestions have been
extremely helpful.

Finally, a special debt of gratitude is due to my
husband, Carlos, who suffered through many difficult hours
with me, and to my parents whose confidence in my success
never waivered, and whose prayers and inspiration were
always there when times were hardest. It is to my loving

parents that I dedicate this thesis.

ii

TABLE OF CONTENTS

Page
‘ACKNOWLEDGEMENTS . . . . \. . . ..
LIST OF TABLES O O O O O O O O 0

LIST OF FIGURES . . . . . . . . . . . . . . Vi

INTRODUCTION . . . ._ . . . . . . . .
REVIEW or THE LITERATURE . . . . . . .' . . . . 5
MTHODS O O O O O O O O O O C O O O O O 3 1

Data Source for Criteria and Predictors . . . . 31
Data Analysis . . . . . . . . . . . . . 42

“SULTS O O O O O O O O O O O 0

Differential Validity . . . . . . . . . . 58
Test Bias . . . . . . . . . . . . . . 65

DISCUSSION 0 O O O O O O O O O O O 0 O I 89

Validity of Selection Tests . . . . . . . . 89
Differential Validity . . . . . . . . . . 93
Test Bias . . . . . . . . . . . . . . 95
Conclusions . . . . . . . . . . . . .' 99

LIST OF “FEENCES O O O I O O O O C O O O O 10 5

MPENDIX O O O O O O O O O O O O O O O O 108

iii

Table

l.

10.

11.
12.
13.

LIST OF TABLES

Correlations Between Predictors and
Performance Criterion Subscores . . . .

Correlations Between Predictors and
Performance, Achievement, and the
Composite Criterion . . . . . . . .

Predictor Validities Corrected for
Attenuation in the Criterion . . . . .

Predictor Validity Corrected for
Restriction in Range . . . . . . . .

Predictor Validities Corrected for
Attenuation in the Criterion
and Restriction in Range . . . . . .

Evidence of Differential Validity for
Majority and Minority Apprentices . . .

Predictor Intercorrelations Before and
After Selection . . . . . . . . .

Correlation Between Criterion Measures
and Race . . . . . . . . . . . .

Majority-Minority Mean Differences in
Predictor Score . . . . . . . . .

Correlations between Predictors and Race
for Restricted and Unrestricted Samples .

Test Bias Statistics--Performance Criterion .
Test Bias Statistics--Achievement Criterion .

Test Bias Statistics--The Composite Criterion

iv

Page

48

50

52

54

S7

59

64

66

68

70
71
73

75

Table

14.

15.

’ A10

A2.
A3.

Page
Majority-Minority Mean Test Scores for
Restricted and Unrestricted Samples . . . . 85
Changes in Standard Deviations from Restricted
to Unrestricted Samples of Majority and
'Minority Apprentices . . . . . . . . . 86
Report of Test Results Apprentice Candidates . 108

Apprenticeship Evaluation Standards . . . . 109

Report of Selection Results for Apprentice
Candidates . . . . . . . . . . . . lll

LIST OF FIGURES

Figure Page

1. Illustration of Thorndike's argument against
the traditional definition of test
fairness . . . . . . ~. . . . . . . 13

2. Prediction of black and white criterion
scores from the white regression
equation (Schmidt and Hunter, 1974) . . . . 16

3. Prediction of black and white criterion

.scores from separate regression equations

(Schmidt and Hunter, 1974) . . . . . . . 18
4. Case B. Regression lines are identical for

minority but minority mean is lower
(C019, 1972) o o o o o o o o o o o 27

5. Effects of selection on a continuous dis-
tribution of test scores . . . . . . . 81

6. Difference in the restricted test distri~

butions of two groups as a function of
different selection ratios . . . . . . . 84

vi

INTRODUCTION

Discrimination in the selection and employment of the
disadvantaged has been a topic of widespread investigation
in the last decade. The development of the Equal Opportunity
Commission (EEOC) in 1964, and the publication of its guide-
lines for Employee Selection Procedures in 1970 have
focused the attention of employers, personnel psychologists,
and federal court judges on the use of tests and other sel-
ection procedures which may have an adverse impact on the
selection of minorities and other groups protected by the
Civil Rights Act of 1964.

A major issue to which a substantial portion of the
EEOC guidelines (1970 and 1974) is devoted is the need for
the validation of currently used selection procedures. While
many professionals:h1both psychology and business recognized
a need for validation, until the early 1970's few seemed
to be doing much in the direction of completing test valida-
tion studies (Enneis,l97l; Wallace §t_§l,, 1970). Maintenance
of test validation information became a requirement of the
1970 EEOC guidelines. Failure of an employer to successfully
validate a selection procedure can result in the loss of a
suit in which a minority employee files discrimination

charges (as in the well publicized Griggs vsgpuke Power

 

case of 1971), or in severe penalties from the EEOC. As a
result, employers (especially of large companies) across
the nation have devoted a great deal more attention to the
validation of their selection procedures.

While the guidelines make it clear that employers must
maintain evidence of test validity in the face of a charge
of adverse impact, there exists some confusion on the issue of
what constitutes "adverse impact". Currently, two major
phenomenon have been considered: differential validity and
test bias.

Differential validity occurs when the difference
between the validity coefficient for one group and that
for another is significant. This issue has received a dis-
proportionate amount of attention in past research. Humphreys
(1973) pointed out that reports of between-group validity
differences in the literature have often used a test for the
significance of test validity for each group separately
(i.e., a test for single-group validity) rather than testing
for the significance of the difference between the validities
for the two groups. Humphreys concluded that this confusion
of the two types of validity differences has contributed
greatly to the inflation of estimates of the actual number
of cases of differential validity occuring in employee selec-
tion. . Schmidt, Berner, and Hunter (1973) showed that when
differences between majority and minority sample sizes are

taken into account, the frequency of single-group validity

reported in the literature is not greater than would be
expected by chance alone. These research findings and
others (Boehm, 1972; Bray and Moses, 1972; Ruch, 1972),
suggest that failure to find significant differences in
(test validity for majority and minority groups does not
preclude the possibility that a test could still discrimi-
nate unfairly against the minority group.

Another issue relevant to adverse impact is the ques—
tion of whether or not a test is unfair or biased against
minorities. To the present time, there has been no general
consensus on a definition for the phenomenon of test bias,
although several models have been proposed. Cleary (1968)
proposed a definition specifying that a fair test can
neither over- or under-predict performance for members of
any subgroup. Thorndike (1971) stated that a test is fair
only if the proportion of a population subgroup selected
using the test is the same as the proportion who could be
successful on the job. A third definition proposed by
Darlington (1971) maintains that a test is fair only when,
within any given group of applicants with the same criterion
score, test scores are not different for cultural subgroups.

The present research examines the three major defini-
tions of test bias proposed by Cleary, Thorndike and

Darlington, as they apply to the selection procedure

currently used by Chrysler Automotive Corporation to
select applicants for the Apprentice Training Program.
Validity of each of the individual selection tests, as
measured by their relationships to job performance and
achievement criteria, is assessed in accordance with EEOC
guideline regulations. Each test is examined for evidence
of bias under each of three definitions of test bias, and
the implications of the three models are compared. In
addition, the possible effects of prior selection on the

data are explored.

REVIEW OF THE LITERATURE

The question of fairness in employee selection pro-
'cedures is one which has centered around definitions of the
commonly used terms: “discrimination" and "adverse
impact.” The major source for the legal clarification of
these concepts has come from the EEOC Guidelines for
Employee Selection Procedures, in which discrimination is
defined as "the use of any test which adversely affects
the hiring, promotion, or transfer of classes protected
under Title VII unless (a) the test has been validated and
a high degree of utility shown, and (b) it can be demon-
strated that suitable alternative procedures are unavail-
able for use” (EEOC, 1970).

Evidence of discrimination includes instances of
higher rejection rates for minority candidates than for
non-minority candidates, a situation implying adverse
impact for the minority group. Specifically, "adverse
impact“ is defined as the use of a selection procedure
which results in the selection of members of any racial,
ethnic, or sex group at a lower rate than members of other
groups. The newest draft of the Uniform Guidelines on
Employee Selection Procedures (EEOCC, 1974) goes on to

state a specific rate for any group selected (4/5 of the

majority group) below which the selection procedure will
be considered to have an adverse impact.

A basic problem in the EEOC clarification of these major
concepts is the failure to distinguish between differntial
validity and test bias as determinants of discrimination and
adverse impact. One source of confusion is the use of terms
differential prediction and differential validity inter-
changeably. In the case in which differential prediction re-
fers to significantly different regression lines for differ-
ent subgroups, the term more accurately represents test bias,
not differential validity.

Although differential validity and test bias may at
first appear very similar (a source of confusion to the EEOC),
they are in fact very different in their frequency of occur-
ence and consequences for both employer and employee in the
selection situation. The issue of differential validity has
been widely researched in recent years, and attempts have
been made to reach a consensus on the terminology and im-
plications of the concept. There remains little consensus,
however, on the terms and definition which should be used
to express the important "fairness in testing" phenomenon.

The terms "test bias" and "test fairness" are most
commonly used in current literature, and are often used
interchangeably. (thus a test is equally either fair or
unbiased). The terms "culture fair" and "culture free" have
also been seen in earlier literature, but are often equated

with the simplest definition of fairness in testing.

A notable exception is the use of the term "culture fair-
ness" by Thorndike (1971) and Darlington (1971) in which

the concepts are representative of the more sophisticated
definition of the phenomenon of fairness, referred to in the

present research as test bias or test fairness.l
The simplest definition of test bias is linked with the

original use of the terms culture fair and culture-free.

As the terms were first used, a culture fair test was one

in which the mean test score was the same for all subgroups.
This definition made the a priori assumption that all

groups were the same on the variable being measured, and
that any differences which occured were due to measurement
error. This eliminates the possibility of any real between-
group differences on psychological traits, and this is in-
consistent with empirical evidence revealed in years of
psychological testing. The culture-free/culture fair defi-
nition represented a very early attempt to establish evi—
dence of test bias against minority subgroups, and advocated
the use of nonverbal tests which were supposedly (inherenthw
more fair to minorities. Research has shown, however, that
non-verbal tests do not always create additional fairness

for the disadvantaged, and have the potential of enhancing

 

1In the context of the present research, the terms

test bias and test fairness refer to the way in which a
test is used to make selection decisions, not to the
inherent characteristics of the test itself.

between-group differences (Arvez, 1972; Bray and Moses,
1972). In light of the results rejecting its major prem,
ises,the culture-free concept can be seen as an unsatis-
factory definition of test fairness.

The next definition of test bias to emerge was propo-
sed by Anne Cleary in 1968. This definition, the first to
be discussed in detail, has come to be accepted as the
traditional definition of test bias, and has been widely
endorsed by educational researchers, industrial psycholo-
gists concerned with selection, text book authors and gov-
ernmental agencies. The first court judgement directly
concerned with the issue of test bias was decided in March,

1975. One outcome of this case (Cortez vs. Rosen) was the

 

decision that only the Cleary definition of test fairness
meets the EEOC requirement for fairness.

Cleary (1968) conducted a study of bias in the Scholas-
tic Aptitude Test (SAT) against blacks, using her own defi-
nition of test bias. According to Cleary:

A test is biased for member of a subgroup

of the population if, in the prediction of a

criterion for which the test was designed, con-

sistent non-zero errors of prediction are made

for members of the subgroup. In other words,

the test is biased if the criterion score pre-

dicted from the common regression line is consis-

tently too high or too low for members of the
subgroup (p. 115).

Using this definition, Cleary looked for test bias by
studying predictive validity in terms of the regression of

the criteria on the test. Data was gathered from three

colleges, using GPA as the criterion. The two hypotheses
tested were: (1) slopes will be equal for blacks and whites
and (2) there will be equal intercepts for blacks and whites.
Assuming equal standard deviations for both races, the first
.hypothesis meant equal validity for both groups. The second
hypothesis was more fundamental to Cleary's definition of
test bias, for unequal intercepts would mean that consistent
non-zero errors of prediction were being made when one
regression line was used to predict for both groups.

A basic concern at the time of the Cleary study was
that norms on the SAT (and other educational measures)
developed primarily on white students might lead to an
under-prediction of educational success for black students.
This would clearly be an unfair situation for blacks, and
the solution Cleary proposed to alleviate the consistent
errors being made was the use of separate regression equa-
tions for blacks and whites. The use of separate equations
for each subgroup would then provide the most accurate
estimate possible for each group.

The first hypothesis of equal slopes was not rejected
in any of the three colleges. The second hypothesis of
equal intercepts was rejected in only one of the three col-
leges, but the direction of bias was unexpected: in that
college, the SAT tended to overpredict, rather than under-
predict, black GPA. While overprediction is still a form

of bias in the strict sense of the Cleary definition, its

10

implications are not as severe in considerations of discrim-
ination against blacks. Cleary's conclusion was that little
evidence exists of test bias against blacks in the SAT.

The Cleary definition (also called the regression
.model) has been the most widely accepted definition of test
bias within a predictive context. It has been the theoret-
ical basis for a substantial number of educational studies
of bias (e.g., Cleary, 1968; Davis and Temp, 1971; Linn and
werts, 1971), and employment studies (e.g., Bach, gt_gl,,
1971; Gael and Grant, 1972).

The model of bias presented by Cleary has a statisti-
cal as well as an intuitive appeal. In a sense, the use
of a Cleary-defined fair test is “fair" to the institution
because it maximizes prediction accuracy, therefore select-
ing applicants with the highest average criterion scores.
With this definition, the highest quality work force (i.e.,
those most likely to actually succeed) will be selected.
This is a major plus for this definition from an institu-
tional standpoint.

A general (but never formally stated) argument against
the Cleary definition claimed there is a problem of defin-
ing fairness by the Cleary model with the use of a less than
perfectly reliable test. The concern is that the effect of
reliability in a test would be to produce artifactual dif-
ferences between subgroup regression lines, leading to the
labeling of a test as biased simply because of its unrelia-

bility. In this case the bias would favor the minority

ti

to

11

group. Hunter and Schmidt (1975) recognized this as a

false issue, however. They demonstrated that an unreliable
test would be in fact biased against better qualified appli-
cants, but this bias is not racial in nature. It is, rather,
‘a bias against the more qualified applicant, regardless of
race, because it would be those applicants just above the
test cut-off who would be rejected because of test unrelia-
bility. They also point out that the argument that tests
are biased against blacks because they are unreliable is
false, because in the case of .00 reliability, a test
becomes a random selection device which would select blacks
in proportion to the number of black applicants, and hence
might well select blacks in pr0portion to p0pulation quotas.
Thus, this common argument against the Cleary model is not
only false, it is exactly Opposite the truth.

The Cleary definition adequately fulfills the require-
ment of the ethical position of selecting "the best man for
the job,“ and is fair to the individual because a fair test
will never over- or underpredict that individual's perfor?
mance. From a very different ethical standpoint, however,
the Cleary definition is not satisfactory. Thorndike's
objection to the Cleary definition represents such an ethical
standpoint.

Thorndike (1971) argued that a test fair by the tradi-
tional definition of equal regression equations is unfair

to the lower scoring minority as a whole because the

12

preportion qualifying on the test will be smaller, relative
to the higher scoring group, than the proportion that will
reach any specified level of criterion in performance.
The problem he identified is illustrated in Figure l.

The situation in Figure 1 depicts a test fair by the
Cleary definition: the regression of the criterion on the
test is identical for the two groups‘(shown by a solid line
with a slope - .25). Thorndike noted, however, that the
difference in the means on the test is substantially larger
than the difference in the means on the criterion variable.
This difference has a notable effect on selection decisions.
Suppose, for example, the mean of the majority group on the
criterion was used as a success-failure cut-off. As can be
seen in Figure 1, 50% of the majority group in this example
would be successful, and about 30% of the minority group
would succeed. Based on prediction of the criterion from
knowledge of test scores alone, however, 50% of the major-
ity but less than 5% of the minority would be selected with
a "fair test"! A

In light of this situation, Thorndike proposed a second
definition of test bias, in which he states:

An alternate definition would specify that

the qualifying scores on a test should be set at

levels that will qualify applicants in the two

groups in prOportion to the fraction of the two

groups reaching a specified level of criterion

performance (p. 63).

By adopting this definition, Thorndike suggests that a fair

test will provide each group the same Opportunity for

13

.nnmshwmm and“ mo sowuwswmmv
HmsOHuwomuu may umcwmmm unmesmum mwxwosnona mo sowumuumsHHH_

bmmoo<allvhousm¢

   
     

>kEOa<2Iv

 

GALIUONIW

    

mmuooam

 

AIIHOPVW—a

ZOENFEO

.H musmwm

l4

admission to training or to a job as would be presented by
the proportion of the group falling above a specified
criterion score. Stated differently, the same proportion
will be selected on a fair test as would be selected if the
(criterion itself were used to determine selection, or as if
the test had perfect validity.~

The “group fairness" definition prOposed by Thorndike
has an appeal very different from Cleary's individualist
approach. In one sense, it does seem only fair to select
from a group in proportion to the number of individuals in
that group who could in fact perform a job successfully.
In the example above (Figure l), a Thorndike-fair test
would, by selecting on criterion success cut-offs, select
a markedly greater proportion of minorities than would the
Cleary-fair test.

One of the major criticisms of the Thorndike definition
is that it is not statistically optimal. It represents a
probability matching approach where the specific a priori
probability of being selected is the random probability that
anyone from the group will be successful. If every "minority"
group (for example, the low scoring majority) claimed that
they should be selected;hiproportion to the number of
individuals in that group who could be successful on the
criterion, differential selection and placement would become
a virtual impossibility, and many of the advantages of the

use of valid selection procedures would be lost.

15

Another disadvantage of Thorndike's definition is that
reverse discrimination under the Cleary definition (actual
success on the job) can be filed against employers giving
preference to minority members over majority members who
have a greater probability of success.

The conflict between the Cleary and Thorndike defini-
tions, from a statistical as well as ethical point of view,
is inevitable. Under the Cleary definition it is unethical
and unfair to select potentially unsuccessful minority ap-
plicants over predicted successful majority applicants, and
this is the result of using a Thorndike fair test. From
the Cleary position, such a test is both statistically
inaccurate and ethically biased against the majority.

A very practical criticism of the Thorndike definition
follows from the above discussion. With this definition
there could be a greater incidence of placing individuals
in roles for which they are not suited (Hunter and Schmidt,
1975). This could be a different type of unfairness to
minorities who if hired would be unsuccessful, especially
if failure has some critical effect for employee, employer,
or both.

Schmidt and Hunter (1974) examined the different
practical implications of the two definitions of test bias.
In the example they presented (Figure 2), they made the
assumption that the difference (in standard deviation units)

in the black and white means on a given test is equal to

16

-(>

 

 

 

 

 

 

Figure 2. Prediction of black and white criterion scores
from the white regression equation (Schmidt
and Hunter, 1974).

17

the racial difference on the criterion. That is, E(XW -

E(XB) - E(YW) - E(YB) = 1 SDK = lsnY

mean, subscripts W, B = white, black, respectively, X =

10. (Where E = the

test, Y = criterion).
The criterion variable Y is not shown, however. Instead
Y, the values of Y predicted from the regression equation

f. If RXY is .50, the

white regression equation is YW = .SOX + 25. Under these

circumstances, the test will overpredict for blacks (pre-

are given, and E(YW) - E(YB) = 1 SD

dicting a mean score of Y = 45 instead of Y = 40). This
makes the test biased by the Cleary definition. But because
the mean differences between groups are equal on test and
criterion, it is fair by the Thorndike procedure. With the
use of the Thorndike "fair" test. if the selection cut-Off
were placed at the majority mean, 50% of the majority and
16% of the minority would be selected. If, however, separate
equations were used to fit the Cleary definition of a fair
test, the result would be as shown in Figure 3. Now the test
neither over- nor underpredicts for either group, but this
"fair" test selects only 2.3% of the minority group. More
generally, Schmidt and Hunter conclude: "For all selection
ratios, selection on the basis of a test meeting Cleary's
definition results in the acceptance of a markedly smaller
percentage of minority applicants than selection on a test

meeting Thorndike's definition" (Schmidt and Hunter,l974).

. 1 | |

18

-<>

 

 

 

 

 

Figure 3. Prediction of black and white criterion scores
from separate regression equations (Schmidt
and Hunter, 1974).

19

In the analysis of bias studies in educational testing,

Schmidt and Hunter (1974) found that many of the studies

which had found the test used to be biased to favor minor-
ities (in the direction of overprediction for blacks) by the
Cleary definition, were nonetheless biased against blacks
by the Throndike definition of equal mean differences on
test and criterion. Similiar results were also found in
employment studies reviewed by Schmidt and Hunter (1974).
Other evidence of overprediction for blacks and Thorndike
bias in the same data has been reported by Linn (1973).
Schmidt and Hunter (1974) propose that, holding dif-
ferences between groupsconstant, the larger the validity,
the smaller the overprediction by the Cleary definition
need be for a test to be fair by the Thorndike definition.
Alternately, holding validity constant, the smaller the
difference in subgroup criterion score, the smaller the
magnitude of overprediction required. These properties
hold when validity is less than 1.0 and between-group

differences exist on the criterion, the situation most

commonly encountered in current employee selection situations.
Noting the conflict of existing definition of test
fairness, Darlington (1971) approached the issue (which he
and Thorndike both called "culture fairness") by first trans—
lating the positions represented by Cleary and Thorndike to

a common denominator and, within the same language, proposing

20

three other definitions. This was accomplished by stating

.all definitions in terms of correlational analysis.

Darlington started with the criterion variable Y, a
_predictor X, and defined a third variable C, which denotes
group membership (e.gu majority — minority). With these
terms defined, Darlington then gave four definitions of test
fairness in terms of the correlations among these three vari—
ables. In each case, an equation was defined by the degree
to which test X discriminated among cultural groups. For
the purpose of his analysis, Darlington assumed that both
groups have equal standard deviations on test and criterion
and equal validity (i.e., parallel regression lines).

The first definition was a restatement of the tradition-
al (Cleary) definition, where

1) ch.= RCY/RXY°

Stated in other terms, there can be no differences between
races beyond that produced by differences on X, so that the
partial correlation of group membership with the criterion

with the test partialed out ( R ) should be zero. In

CY.X
other words, a test is fair by Cleary if knowledge of group
membership cannot be used to increase prediciton accuracy.
Fairness will be maximized by selecting people with the

highest criterion scores by this definition, and the most

valid test will (implicity) be the most fair.

21

Darlington's second definition approximated Thorndike's
position. In terms of three variables,

2) Rex = Rcy'

For a test to be fair by this definition, any racial differ-
ences which exist on the predictor will be equal (in SD
units) to the racial differences which exist on the cri-
terion. Hunter and Schmidt (1975) note that this is a
correct statement of Thorndike's position if the common
regression equation is used to select all applicants. If
separate regression lines are used for different subgroups,
an alternate definition must be given to correctly repre-
sent Thorndike's position.

Darlington then developed a third definition of
test bias which requires,

3’ Rex = RCY ' ny°

The argument supporting definition three first assumes that
successful performance on the criteriOn is related to a com-
posite of many abilities, as is the ability to do well on a
test. If the partialed correlation between test and race
with the criterion partialed out (R

CX.Y
the test must be tapping abilities which show large racial

) is not 0, then

differences which are not relevant to the criterion. Such
a test would be biased. This definition reverses the roles
of X and Y as independent and dependent variables, respec-
tively, and suggests the analysis of the regression of test
on criterion rather than vice versa, as proposed in defini-

tions 1 and 2.

22

Darlington's third definition received little attention
until a novel argument in its favor was presented by Cole
(1972). In the model prOposed, Cole introduced a "fairness
regardless of group" definition of test fairness which
.assures that as a group, individuals who can achieve a
satisfactory criterion score have the same probability of
being selected regardless of group membership. Her "equal
opportunity model" (EOM) is thus a case of the requirement
of definition three that RCX.Y = 0.

Darlington and Cole's definition of test fairness,
which requires equal regression of test on criterion for all
subgroups, is the same as the Cleary definition with the
roles of test and criterion reversed. The two definitions
are in no way equivalent, however, because the regression of
X on Y is not the same as the regression of Y on X unless
either (1) no racial differences exist on either test or
criterion or (2) validity is perfect.

The fourth definition is simply stated:

This is the simplest definition of culture fairness. Stated
in correlation terms, the test is not allowed to corre-
late at all with culture. This definition has already been
discussed as oversimplistic and unrealistic in light of real
world selection situations.

Not completely satisfied with any of the first four

definitions of test bias, Darlington prOposed yet a fifth

23

definition. In developing his fifth definition of test
fairness, Darlington decided that it seemed unfair to use
the white regression lines to predict scores for blacks
when consistent errors of predictions would be made, but
on the other hand, he disagreed with using race as a
predictor, which he said is essentially what is done when
separate regression equations are used. -Dar1ington dis-
cussed the conflicting definitions of fairness in terms Of
"a conflict between the two goals of low cultural discrimi-
nation and high validity" (Darlington, 1971). He decided
that the balance between the two goals required a value
judgement concerning the relative importance of each.
Darlington proposed a procedure for operationalizing
this required subjective judgement. He proposed that the
administrator be asked to specify a number which represents
the difference between criterion scores required to equalize
the worth of a minority and majority applicant. That is, he
recommended the use of a number k such that a majority
applicant would be worth exactly the same as a minority
applicant whose criterion score is k units lower. Instead of
trying to predict the criterion score Y, he suggested using
the value Y - kC as the criterion in order to give a special
value or advantage to hiring minorities (where C is scored
0 - l for minority - majority, respectively). Darlington
then proposed that Y - kC, rather than Y, be the variable to

be maximized in the selection process. If the regression of

‘

24

Y - kC on X is then fair by the Cleary definition, the test
is "culturally optimal". If k = 0, Darlington's and Cleary's
definitions are equivalent.

Darlington's definition five differs from Thorndike's
first in format: where Thorndike's definition would set
different cutting scores for different cultural groups,
Darlington's system adds points to minority scores and uses
the same cutting score. Both definitions are forms of quotas.
The second, more fundamental difference is stated by
Darlington:

Thorndike recommends at least implicitly, a
mechanical determination of the ratios of subjects
selected from different cultural groups, while we
recommend that these ratios be determined (indirectly)
by a subjective, policy-level decision concerning the

relative importance of validity maximization and cul-
tural fairness (1971, p. 71).

Evaluations of Darlington's definition of "cultural
optimality" have not tended to be highly favorable, due for
the most part to his insistence on the subjective deter-
mination of the value of k. Hunter and Schmidt (1975) point
out that, from a mathematical point of view,subtracting a
constant from white criterion scores is equal to the re-
sults of adding an equivalent constant to black scores with-
out changing the prediction equation. Thus Darlington's
procedure simply amounts to adding a constant to black scores
so more will be admitted. They conclude that this is just
”an esoteric and uncontrolled method of setting quotas"

(Hunter and Schmidt, 1975).

25

In Linn's (1973) opinion, although Darlington's pro-
posal is of theoretical interest, in practice it seems un-
likely that institutions will ever formalize the procedure
by actually picking an explicit value of k and then try-
ing to maximize the value Y - kC. Hunter and Schmidt, (1975)
suggest that when the Cleary definition is employed without
the use of separate regression equations, then the employer
has implicity picked a value of k equal to the distance
between majority and minority regression lines.

The Darlington definitions three and five, since they
are not equivalent to either Cleary or Thorndike's defin-
itions, present still another perspective of the potential
implications of test bias in employee selection. Cole (1972)
examined the implications of these four definitions of bias,
plus two others2 in an attempt to identify the values and
beliefs about fairness and the actual procedures to be fol-
lowedto alleviate bias according to each model.

In the five selection situations examined by Cole, the
following assumptions were made: 1) only two subgroups (i.e.,
majority and minority) existed, 2) only one predictor was
used (although she states that the results are identical
for multiple prediction), and 3) a selection ration of .20

applied.

 

2The two other models were the quota model and the

employer's model proposed by Einhorn and Bass (1971).

26

In case A, the selection situation existed where
regression lines are parallel, with the minority line above
the majority, and with the minority intercept larger than
that of the majority. This situation rarely if ever occurs
.in practice, and need not be discussed further here.

In case B (Figure 4), one~regression line serves for
both subgroups, with the minority means on both the pre—

dictor and criterion being smaller than the majority means.

This situation is very common in the literature on selec-
tion, and is the most relevant example which Cole gives.
Under these circumstances, the Darlington model (5) of
fairness (with k set at .5) will select 20% minority;
Thorndike 13.3%, Cole (or Darlington 3) 16.4% and Cleary
4.5% of the minority group. It should be noted that for this
situation, Cole's selection of the value for k was com-
pletely arbitrary. Had k a 0 been chosen as the culturally
optimal constant, the test would have been fair by Darling-
ton's definition, (since Darlington's definition agrees
with Cleary‘s for k = 0). The effect of the larger R value
chosen by Cole makes Darlington's model select a markedly
larger percentage (20%) of minority than would the use of
the value of k a 0, which would select (4.5%).

The last three cases presented by Cole all assume dif-
ferential validity (i.e., different regression slopes and
intercepts in the presence of equal standard deviations).

However, since differential validity has not typically

27

.xmpaﬂ .maooc “mace
me some muﬂuoswa pun muﬂuocws How HOOHDGOCH mum mocwa soamwmummm .m Ommo .v musmwm

.ozx ezx
. Oo o...
x . w

 

l

 

0.0... 52*
ABO.—02F

 

28

been found in employee selection procedures, these cases
will not be discuSsed here.

Each of the four models of test bias reviewed has its
own strengths, weaknesses, and implications for employee
.selection, but none provides a completely satisfactory
definition of test bias. The Cleary definition is statis-
tically Optimal and will select the highest quality work
force, but selects the lowest proportion of the minority
group, and is especially disadvantageous to minorities
when validity is low. The Thorndike definition will select
a higher percentage of the minority group, but is not
statistically Optimal and will not select the most quali-
fied work force. Use of Darlington's definition three will
select the highest proportion of minority groups, but the
basic assumptions of casuality behind the definition have
not been proven, and the prediction of test score form
knowledge of ability (i.e., criterion score) is not the
normal practice in employment testing. Darlington's goal
of achieving "cultural optimality" by using Y - kC in the
regression equation makes the proportion of the minority
group selected using his definition depend directly on the
value of k which is used. His proposal for the subjective
estimation of the value k, however, makes his fifth defini-
tion another method of setting quotas which is not statis-

tically optimal. For these reasons, Darlington's definition

29

5 is not a useful model for the evaluationof test bias
in the present research.

The conclusion to be drawn from a review of the
literature on test bias is at once obvious and illusive.

.On the one hand, it is apparent that, with the infrequent
occurrence of differential validity in the testing
situation, test bias is a current and very real issue
related to the problems of discrimination and adverse
impact. At the same time, however, there is little general
consensus as to what constitutes test bias. Three major
definitions of test bias (Cleary, Thorndike, and Darlington
3) have been examined here. Each will conflict with the
other under normal selection conditions. Each has different
implications for both employer and employee.

One deficiency found in the literature reviewed was
the absence of a sound comparison of the three major defini-
tions using actual data from an empirical study. In an
attempt to remedy this deficiency, the present research
aimed to clarify the specific implications of each of the
three models of test bias when each was applied to a
selection procedure used to select employees for an
apprentice training program. Each of the predictors in the
selection battery was examined first for validity, and then
for evidence of bias under each definition of test bias.

Because data for the present study were collected on a

30

sample which had been selected on the basis of scores on
the selection predictors, the effects of prior selection

on test validity and test bias were also considered.

METHODS

Data Source for Criteria and Predictors

I The data source for the present study was a recent
study (Schmidt, Greenthal,Berner, Hunter, and Williams,
1974) which constructed a job sample performance test for
the skilled trades.3 Based on the knowledge that blacks
and other disadvantaged groups score significantly below
national norms on most paper and pencil tests, it was
suggested that those paper and pencil tests could be tap-
ping the determinants of job proficiency on which racial
differences are largest or, conversely, that they may fail
to tap determinants of job success on which differences
are smaller or perhaps non-existent.

If this were true, then Schmidt gt_gl, reasoned that a
well-constructed job sample performance test would show
smaller racial differences than the traditional paper and
pencil tests. Hunter gt_al, (1975) reported that in general
performance tests do show smaller racial differences than

'paper and pencil tests. Hence, if the content of a job

 

3This study was reported for the Department of Labor,
in an attempt to develop a content valid and reliable
measurement of performance, and to evaluate the practical
and economic feasibility of the use of that performance
measure.

31

32

sample performance test is essentially identical to job
content, it should show racial differences smaller than those
of paper and pencil tests, and in fact no different from
actual job performance differences. Because Of the nature

of its content and the methods used to determine content,

a well-constructed job sample test is, ipso facto, content

 

valid (meeting EEOC regulations). The performance measure
which they developed provided an adequate criterion against
which to measure validity and test bias in the present
research.

A thorough job analysis was the logical first step in
the development of the performance test. This had already
been done by a previous researcher (Oriel, 1974). Super-
visors and journeymen were asked to indicate which tasks
apprentices were expected to master by the end of their
first year. These tasks and other job analysis information
were used to delineate 31 "activity modules"--independent
performances, each of which incorporated a number of the
tasks identified by supervisors andgburneymen. Twenty-three
sh0p modules were selected from 31 activity modules to be
included in the study by Schmidt, gt_al, A slightly modified
list of these 23 modules was then presented to 21 local
experienced machinistqburneymen for rankings for each machine
on difficulty, frequency of occurrence and importance for

both apprentice and journeymen positions.

33

The next step in the construction of the performance
test was the determination of the number of independent
evaluation scores which could be derived from each model.'
'Several distinct advantages were seen to the use of the
method of end product rather than specific behavior or
process evaluations. These advantages included greater
interjudge reliability, greater practical feasibility, and
higher validity for the former method. Tolerance and finish
were the two end-product evaluation used to measure per-'
formance in the study by Schmidt gt. El:

Each task was next executed by a machinist consultant
to the project, and the information obtained was used to
construct two tentative forms of the test. These forms were
combined to yield a tentative final form, which was then
pilot tested in a local machine shop, and the final form
for the performance test was drawn up.

A major goal of the SChmidt gt. 31. performance meas-
urement procedure was the evaluation of the ability to per-
form specific tasks which are central to the machines trades
occupation. It was not the intent of that study to construct
a global measure of job performance such as the overall value
of the individual to the organization. For this reason, only
a limited number of specific skills and abilities were meas-
ured by the job sample performance test,while more general
behaviors, such as absence rate, frequency of large errors,

task sequences, etc. were not used to evaluate performance.

34

_ Four paper-and-pencil aptitude tests and one paper-and-
pencil achievement test were also administered to apprentices
to test the hypothesis of smaller between group mean differ-
ences for the performance test. Selected were the Wonderlic
'Personnel Test (a test of general ability), the Minnesota
Paper Form Board (a test of spatial aptitude), the Purdue
Industrial Training Classification Test (an arithmetic
reasoning test) and the Bennett Mechanical Aptitude test.

The achievement measure used was the Machines Trade
Achievement Test of the Ohio Trade and Industrial Education
Achievement Tests (1960 series). The test was developed by
Ohio State University for the Ohio State Department of
Education for use in connection with the state's vocational
and technical education programs. It was carefully con-
structed using content validation procedures, and was judged
content-valid by the project's expert machinist consultant.
Internal consistency reliability measured by KR 20 = .84
(Schmidt gt_al, 1974). To provide a contrast to performance
measures, and to add an indication of measured achievement,
the Ohio Achievement Test was used as a criterion against
which to judge test validity and test bias in the present
research.

Between November, 1973 and June 1974, the Performance
Measure Procedure (PMP) was administered to 87 apprentices
(including 68 from Chrysler) with the equivalent of one

year's machinist training. Apprentices were tested in

35

groups of five. A short presentation was given to each
group explaining the nature of the project. Following the
presentation, each apprentice was given a piece of stock,
a task drawing, and identifying labels. All task instruc-
tions were recorded on cassette tapes, and tape players
with earphones were placed at each task station. Average
time for the performance test was just over four hours.
When time permitted, the five paper and pencil tests were,
given when the PMP was completed, but when the eight-hour
allotment was insufficient the paper-and-pencil tests were
administered to small groups Of employees at a later date.
Evaluation was made on the basis of the end product
tolerance and finish measurements and task time required.4
Each piece was measured for tolerance by judges (non-
professional in the skilled trades) using highly precise
dial-read micrometers and calipers. The system developed
to analyze the results of the absolute measures recorded
by each judge was a three point system, in which each
piece was judged within tolerance, outside the first but
within the second less stringent tolerance, or outside both.
Finish was determined by comparing each end product with

"bench-marks” (workpieces chosen in advance to specify

 

4No time limits were imposed, but since the instruc-
tions emphasized speed as well as quality, work time was
recorded as a third score.

36

quality levels of finish), so that relative rather than
absolute judgements were required.

The results of the study by Schmidt et_al. showed high
inter-rater and internal consistency reliability for the
Performance Measurement Procedure. In the case of all but
one written test, there were large and significant dif-
ferences between the majority and minority groups. Except
for tolerance scores on the lathe, no significant differ-
ences were found on any dimension on tolerance or finish
scores of minority and.majority apprentices. However,
minority apprentices did take longer on the average to
complete the perfOrmance.

The fact that there were differences in time but not
in quality of performance is consistent with the assumption
that the apprentice used .an internal rule of "always take
time for quality." Had the men been forced to produce at
a given pace, then the time difference would have been
translated into quality differences.

This created a problem in the use of the performance
test as a criterion: how shall time and quality be weighted?
This in turn depends on the actual job for which the test
is to be used. In a job where the criterion is "Produce
the best possible piece regardless of time," the prOper
weight for time would be 0. But in a task where wide
discrepancies in quality are still "acceptable," production
is proportional to speed and the two quality scores should

receive 0 weight.

37

Schmidt et_§l. arbitrarily defined a "total performance
score" by adding the standard score for tolerance to the
standard score for finish and subtracting the standard
score for time. That is, they arbitrarily weighted quality
twice as heavily as time. This study will consider each
score separately (as well as Schmidt et_§l. total score)
in order to permit inferences to either case.

Five criterion measures used for the present study are:
three performance test subscores (total tolerance score,
total finish score, and total time score), total performance
score, and total score on the Ohio Machines Trades Achieve-
ment Test. In addition, a composite criterion score was com-
puted by combining performance and achievement scores, such
that the total of the three performance measures was weighted
equally with the total score on the Ohio test, and the two
totals combined to equal one composite score.

The predictors in the present research are the selection
variables used to determine entrance into the Chrysler
Apprentice Training Program. This selection procedure con-
sists of a screening test, a battery of four paper and pencil
tests, and eight background variables which attempt to assess
relevant school and work experience. Samples of forms to
report all data maintained on an individual applicant are

presented in Tables Al- A3 of the appendix.

38

The first test encountered by the applicant is the
screening test, called the Chrysler Personnel Test. This
is a five-minute multiple-choice test composed of items on
arithmetic and general reasoning ability. A minimum score
.of 15 on this screening test is required for an applicant
to continue in the selection process.

Upon successful completion of the screening test,
applicants are rescheduled to take the Apprentice Test
Battery, which consists of four tests: The Tool Knowledge
and ShOp Arithmetic tests from SRA Mechanical Aptitude Test
Battery, the Arithmetic Test Form B developed by Chrysler,
and the Survey of.0bject Visualization Test. All four tests
are administered to applicant groups at one sitting. Total
testing time is approximately 2 1/2 hours.

The SRA Tool Knowledge test is a multiple choice test
of 45 questions, each with five response choices. It is
designed to measure knowledge of the different tools,
machines, and other devices used most often by mechanical
tradesmen. Each question shows a picture of the device,
followed by the question "This is a _____” or "This is
used to . . ." The time for the test is ten minutes, and
the score is the number correct.

The SRA Shop Arithmetic Test is a multiple-choice test
designed to measure skill and aptitude in work with numbers.
This fifteen minute test has 24 questions, each with four

responses. The questions require skill in addition,

39

subtraction, multiplication and division, as well as the
reading of diagram charts. Total score equals the number
right. The lower limit for reliability (by the Kuder-
Richardson 21 formula) for the total score of the SRA
Mechanical Aptitude Battery is .83 (the test manual does
not report reliabilities for separate tests).

The Arithmetic Test Form B was developed by the
Chrysler Personnel Division in 1961. It is a ten-minute, 45
item mutiple-choice test designed to measure skill in basic
arithmetic. Each question has four possible solutions.
Score equals the total number right. NO reliability data
was available on this test.

The fourth test, the Survey of Object Visualization,
is composed of 44 items. In each item, there is a drawing
of a flat pattern followed by four objects, one of which
correctly represents what the object would look like when
folded up. This test is designed to measure ability to
predict how an object will look when its shape and position
are changed. Total score equals the number correct minus
one third the number incorrect. Split-halves reliability
based on 266 cases(corrected by the Spearman-Brown formula)
is reported in the test manual to be .91.

Raw scores from each of the four tests are converted to

percentiles, and the percentiles are then averaged to

40

give a composite battery score for each applicant (see
Table A1 Appendix).S This composite score (averaged
percentile) is then multiplied by two to give the total
score on the Apprentice Test Battery. The maximum number
.of points is thus 200.

Applicants also receive points on eight other dimen-
sions. Until September 1, 1973, a high school diploma or
GED equivalent was required before application could be
made for the Apprentice Training Program.6 Points are
awarded on a five-point scale for high school grades in
mathematics, science, and other related courses, according
to the grade received. Similarly, points are awarded for
scores on the five tests of the GED according to score. The
maximum number of points awarded for GED and high school
grades is 100.

Points are also awarded to applicants for general
course work applicable to apprenticeship which was taken
after high school. Points are awarded on a-five-point
system according to grade, to a maximum of 10 courses and
50 points.

A maximum of 25 points can be accredited to an appli-

cant for Armed Forces work in a.re1ated field. Points are.

 

5It should be noted that this average of percentile
scores is not itself a percentile score.

6This requirement has only been provisionally dropped

for a one year period. It does not affect the current
report.

41

given for related work experience in agreement with the

number of on-the-job training hours, to a maximum of 50

points. Corporate Service Points are given to Chrysler

employees for each year of service up to five years, and
one point for each five years or fraction following the

first five years.

Total points possible on the entire apprentice selec-
tion procedure (including the test battery and other
evaluation standards) equals 445. The total points
required to qualify for a position on the priority list

for Apprentice Training is 115.7

Subjects

Sixty-eight apprentices from the Detroit Chrysler
plants with the requisite amount of machine training volun-
teered to participate in this study. Most subjects were
third and fourth year tool-and-die apprentices. Twenty
of the 68 subjects were minority (29.4%); one female took
part in the study, making it impractical to try to break
down results on the basis of sex. Five of the 68 subjects
(3 minority and 2 majority) failed to complete all parts of
the performance test. Their scores were not included in

the statistical analysis portions of the present research.

 

7UP until February 1, 1971 an interview was required
for each applicant. Maximum points for the interview was
fifty, and the total qualifying score was 150. When the
interview was discovered to be unreliable and unnecessary,
it was dr0pped, and the average interview score (35) was
subtracted from the total qualifying score.

42

Data Analysis

One of the major statistical analysis in this research
concerns the verification of the validity of each of the
predictors (i.e., tests) used to select applicants for the
'apprentice program. This analysis involved the examination of
the correlation coefficicent which describes the relationship
between each of the 14 selection tests and each of six criter-
ion measures. Correlations were computed separately for minor-
ity and majority apprentices, as well as for the entire sample
of apprentices.

The EEOC guidelines state tnytvalidity is demonstrated
by a correlation between test and criteria which is signifi-
cant at the .05 level. The present research does not limit
the term "valid test" to the statistical demonstration of
validity proposed by the EEOC, although significant tests were

performed on predictor validities and the results were noted.9

Analysis of test bias were performed on all test regardless
of the statistical significance of their relationship to

the criteria.

 

8In analysis using a very small sample size, adherence
to an alpha level of .05 minimizes the probability that a
test which is really not statistically valid will be found
Valid. At the same time, however, the probability of finding
a test invalid (by a statistical test) when it in fact is
not may be quite high. For example, if the true validity
of a test is .25 and a sample of 60 is drawn, the probability
that the sample validity will not be significant at the .05
level is .50. That is, half the sample coefficients would
be incorrectly interpreted as showing the test to be invalid.
In a sample Of 17, the probability of incorrectly assuming
that the test is invalid would be .80. If a test is

43

The validity of a test is defined as the correlation
between the test and the criterion in the applicant popula-
tion (i.e., before selection takes place). However, in
the present study, data on the criteria could be Obtained
.only for the restricted population of persons who were
selected using the predictor tests. Restriction in range
due to prior selection tends to produce underestimates of
validity in the applicant pool. To obtain an estimate of
unrestricted validity, a correction for restriction in
range was performed. As will be shown below, there is no
formula in the literature which is exactly appropriate for
correcting for restriction in range in the present data.
However, Thorndike's (1971) case II formula was used to

provide an indication of the size of the restriction.9

 

incorrectly assumed to be valid and is used for selection,
then the prior random selection procedure is being replaced
by a new random selection procedure, and hence has neither
a good nor a bad effect. But if a valid test is incor- .
rectly labeled "invalid" and hence rejected, then a poten-
tially useful selection procedure is being replaced by a
random selection procedure at great cost. Thus, only the
second kind of error is really relevant, and preoccupation
with maintaining a small alpha level may be highly counter-
productive.

9This formula is given by:

SD unrestricted
ny SD restricted

. . 2
1 _ nyZ + ny2 SD unrestricted

ny =

 

SD restricted

44

This correction was accomplished by utilizing the standard
deviations for each test computed on an unrestricted sam-
ple (i.e., a sample in which both passing and failing
scores were included in computed standard deviations) of
[scores of 200 applicants drawn at random from the appren—
tice program applicant record files.

A final analysis in the validation portion of the
present research was a test for differential validity.
Although the EEOC guidelines requires the use of an .05
level of significance for evidence of differential validity,
the body Of evidence against the probability of finding
differential validity more frequently than could occur by
chance alone suggests the use of a higher level. Therefore,
an .01 level was used.

A major goal of this research was the determination
of the fairness or bias of each predictor. A strategy for
determining test bias was proposed by Potthoff (1966).

For the analysis of predictor bias by the three definitions,
the following procedure was devised. All tests of statis-
tical significance used a level of .05 and were performed
on the uncorrected data. The first two tests were of the
significance Of the correlations between criterion measures
and race, and between predictor score and race. The

results of these tests provide an indication of the relative
influence that racial group membership has on criterion and

predictor scores.

45

For both the Cleary and Darlington definition, deter—

mination of test bias was derived from the examination

of partial correlations. For the Cleary model, the null
hypothesis for test fairness is ryc.x=o' This formula as-
sumes that x and y are measured without error and that there
is no restriction in range. While formulas exist for cor-
rection for attenuation hix and y, no apprOpriate formulas
exist for correction for all cases of restriction in range
(as will be shown below). Hence, the statistical test was
simply run on the uncorrected partial correlation.

The Thorndike model of test bias requires that the
difference in standard score units between the difference
in majority and minority mean scores on the tests and major-
ity-minority mean scores on the criterion equals 0.That is,
this definition of fairness states:(Xw- XB) -(YW- YB) - 0,
(where 2W and §W = the means for the majority on test score

and criterion score, respectively, and XB and YB equal the

mean test score and criterion score, respectively, for the
minority group). In order for the test of this hypothesis
to be complete accurate, X and Y must be expressed in
standard score form. This is done implicitly when the test
used to specify fairness is r - r = 0. Where this

cy cx
difference was positive and significant, the Thorndike

46

definition labeled the test unfair, or biased against the
minority group.10
The null hypothesis for test fairness using_Darling-.
[ton's definition (3) requires rcx.y = 0. For the analysis
of test fairness by both the Darlington and Thorndike
definitions, it was necessary to make a correction for
unreliability in the criterion measures. This procedure,
called correction for attenuation, provides the best
estimate of the magnitude of true differences (i.e., the
difference after controlling for the effects of random
error) in minority and majority criterion scores. This
correction was necessary because random error or unrelia—
bility in a criterion measure acts to obscure or reduce
true group differences. This correction proved to be
more complex than expected, however, and was found to be

infeasible in the present research, as will be discussed

below.

 

10This assumes that the majority group is scored high
and the minority group low on C. If majority is scored
high (e.g., 1) and minority low (e.g., 0), then a negative

difference in rCy - rcx would specify bias against the

minority group.

RESULTS

The product moment coefficients of correlation
between predictor scores and scores on each of the three
performance subtests are presented in Table 1. In concur-
rence with EEOC requirements, validity is reported separately
by racial group. Because of the large difference in the
sample sizes of majority and minority apprentices (47 and
16, respectively) the test of the significance of validity
coefficients for each group separately is not especially
meaningful in the present research, and will not be dis-
cussed. A more meaningful analysis of between group
differences in validity will be presented later. For the
interested reader, the test of significance (at the .05
level) of the validity reported for the majority group
requires a correlation greater than or equal to .285. For
the minority group validity must be greater than or equal
'to .50 for statistical significance.

Correlation coefficients computed on the total sample
which reached the .05 significance level required by the
EEOC for demonstration of validity are shown in Table l
by an asterisk (*). No scores were reported for any of the
subjects in the total sample for Armed Forces Work, so

that predictor was dropped from all further analyses.

47

HHDNGH OHQMHO>MM HMOE 03¢

moamwa muoom 30H m Dona 0m omuoom ommum>mn mw maﬁa O

maomwum> menu so wouoom on: OOHQSMm mxoman o:

 

48

 

 

 

 

 

 

 

&
mo. v.& «
«em.- mo.- ~m.- me. as. “H. Ha.) Ho. mo.u .m.m.o
me. n no. 1m~.u u em.) eH.u u ma.u mesmeuooxm.xuos
mH.- ma.) mo.u mo.u oa.u ¢H.- mo.s mo. mo.u manages» .m.m.m
mH.- mo.- mm.-. NH. mm. mo. .mu. um. um. mumeuo .m.m
as. so. we. ~H.- ma. ma.n on. ow. mo. mocmeom .m.m
~o.- so. so.- ma. am. so. NN. «a. mN. sums .m.m
mo. mm. so. no. H~.- mo. «m~.- -.u a~.u one
mo. so.) mo.- Ho.u HH.- ca. «ma. o~.- mm. sumuumn
mo.- mo.- mH.u no. mo. so. o~. AH. mu. .mn> nomeno
mo.- ~m.- no. mo. an.. «H. as. «4.- «m. muemﬂzocx Hoe»
NH. so.) no. me.) on.. co. «H. «m. as. onumanunum
«m. on. on. so.) -.u no. mo.- mm.u oo. sums moan
ma. ao.u ma. ma.- as.) «a. mo.u we.u no.) mnwcmmuom
noun on SR uouu an 3H uouu on 3H uouowomum
MOSHE nmﬂﬂwh OOGMHOHOH

 

 

 

 

mmuoonnsm eowumuwuo

QUGMEHOHHOQ UCM WHOUUﬂUmHm ﬂmm3umm MGOﬂHMHOHHOU H UHQMB

49

Significant validity coefficients were obtained for
three predictors of tolerance score. The test battery, GED
and high school others correlated .25, - .25, and .28, re-
spectively with tolerance scores. Only work experience cor-
. related significantly with finish scores ( r = -.28 ), and
corporate service points (CSP) correlated significantly
with time scores ( r - -.34 ). None of the other 31 cor-
relation coefficients were significant at the .05 level.

Correlations between the predictors and performance,
achievement, and the composite criterion computed for the
total sample are shown in Table 2. The demonstrated re-
lationship between the predictor scores and performance
criterion scores was weak. None of the correlations reached
statistical significance at the .05 level. The test battery,
post high school training, work experience, and corporate
service points showed the highest correlations with per-
formance scores, showing r's of .20, -.19, -.19, and -.19
respectively.

Selection predictors were more highly correlated with
achievement criterion score. A correlation of .34 was seen
between the test battery and achievement, while high school
others correlated .31 and .27, respectively, with achieve-
ment score. All three correlations were significant at the
.05 level. Significant validity coefficients were also ob-
tained for the correlations between the test battery and
composite score ( r 8 .33) and high school math and the

composite score ( r = .30 ).

Table 2 Correlations Between Predictors and Performance,

Achievement, and the Composite Criterion

N = 63

 

 

 

 

Total Ohio
Predictor Performance Achievement
' screening -.01 .14
shop math .08 .13
arithmetic .12 .20
tool knowledge .10 .16
object vis. .10 -.05
battery .20 .34*
G.E.D. -.08 -.07
H.S. math .19 .31*
H.S. science .06 .18
8.5. others .10 .27*
P.H.S. training -.19 -.10
Work experience «.19 -.09
C.S.P. -.19 -.ll

.08
.13
.20
.16
.03
.33*
-.O9
.30*
.15
.23
-.18-
-.17
-.19

 

 

* p < .05

Composite)

51

One finding which was somewhat contrary to expectation
was the low validity demonstrated by the screening test and
the four tests which combined to make up the test battery.
None of these five tests correlated significantly with any
' of the six criterion measures. Slightly less than significant
validity was obtained from the correlation of the screening
test with finish time subscores (-.18 and .18, respectively),
shop math with time score (r=.22), chrysler arithmetic with
achievement score (r-.20), tool knowledge with tolerance
(r - .19) and object visualization with tolerance (r a .20).
A correlation greater than or equal to .25 was required for
statistical significance.

The sample validity coefficients were corrected for
attenuation due to unreliability in the criterion and
results are presented in Table 3. Estimates of reliability
using the coefficients alpha formula for the three criterion
measures were .84 for the Ohio Achievement Test, .80 for
the composite score, and .64 for total performance score.
Estimates of corrected correlations between the test battery
and performance, achievement, and the composite criterion
were .25, .38, and .37 respectively. Correlations between
high school math and the three criterion measures were
next highest, showing r - .24 for performance, r = .34 for
achievement, and r - .34 for the composite score. High school
others correlated .29 with achievement and .26 with the
composite score. All other predictor-criterion correlations

remained below .25.

Table 3 Predictor Validities Corrected for

Attenuation in the Criterion

 

 

N = 63
Total Ohio

Predictor Performance Achievement Composite
acreening" -.01 .15 .09
shop math .10 .14 .15
arithmetic .15 . .22 .22
tool knowledge .13 .17 .18
object vis. .13 -.05' .03
battery .25 .38 .37
G.E.D. -.10 -.08 -.10
8.8. math .24 .34 .34
H.S. science .08 .20 .17
H.S. others .13 .29 .26
P.H.S. training -.24 -.11 -.20
WOrk experience -.24 -.10 -.19
C.S.P. -.24 —.12 -.21

 

 

 

 

53

To investigate the possibility that the low predictor
validities found in the first analysis could be due to the
restriction in range due to prior selection, the proper
correction was made. Several problems were encountered in
‘carrying out this correction, however. In the sample data
drawn from Chrysler's record files, high school math,
science, and other scores were all included under one score
labeled “high school transcript,” making the unrestricted
standard deviations for the three separate scores unavaile
able. In addition, no scores were reported for related
work experience in the random sample of unrestricted
scores, so that no correction could be made for this
variable.

Correction for restriction in range were performed
statistically on remaining predictor-criterion correlations,
and the resulting estimates of predictor validity are pre-
sented in Table 4. The battery showed the most notable
change from restricted to unrestricted correlations.
Corrected for restriction, correlations between the battery
and performance, achievement, and composite criterion were
.34, .54, and .53, respectively. The effect of the correc-
tion for restriction in range was much less marked for
correlations involving the five other paper and pencil
tests. For the correlation between the screening test and
achievement score, the estimate of validity was raised

from .14 to .19 after this correction.

54

Table 4 Predictor Validity Corrected for
Restriction in Range

 

 

N = 63
Total Ohio
Predictor Performance Achievement Composite
screening -.01 .19 .ll
shOpimath .ll .17 .17
arithmetic .14 ‘ .24 .24
tool knowledge .14 .22 . .22
object vis. .13 -.07 .04
battery .34 .54 .53
G.E.D. -.07 -.06 -.08
P.H.S. training -.26 -.14 -.25
C.S.P. -.17 -.10 . -.17

 

 

 

 

55

Similarly, correlations between tool knowledge and

achievement and composite criteria were increased from

.16 to .22 with this correction. validities for shop math,
arithmetic and object visualization changed little, and
Iremained low. Estimates of validity decreased for corre-
lations between GED and C.S.P.\(corporate service points)
and the three measures because the estimated unrestricted
standard deviations were smaller than restricted standard
deviations. This caused the estimated unrestricted
validity coefficient to be smaller than the restricted
coefficient.11 A logical explanation for this result
assumes that the large standard deviations obtained for the
present sample must have been due to sampling error.

Correlations between GED, post high school training

and corporate service points (CSP) and the three criterion
measures were unpredictedly negative, and there is no
immediately Obvious explanation for these results. The
correlations between GED and performance, achievement, and
composite criteria could conceivably be a function of sam-
pling error; that is, supposing the true correlations between
GED and the three criteria were all 0, sampling error could
produce low negative correlations Of -.07, -.06, and n.08

for performance, achievement, and composite criteria,

 

l
The formula for restriction is given in footnote 9,
page 43.

56

respectively. For the other two variables, however,
there appear to be other factor(s) acting in addition to
sampling error to produce the negative validities. The
results of efforts to discover these possible factors
(will be presented in subsequent analyses.

Predictor validities corrected first for criterion
attenuation and additionally for restriction in range are
given in Table 5. Estimates of validity for the test
battery after both correction for restriction in range
and attenuation on the criterion were .42, .59, and .58
for prediction of performance, achievement, and composite
criteria, respectively. Correlations of high school math
(corrected only for attenuation) and all three criterion
measures show the next highest validity, equal to .24, .34,
and .34 for performance, achievement, and composite criteria,
respectively. The test battery and high school math were
the two best single predictors of performance, achievement,
and the composite criteria. While no other predictors
showed a notable relationship to performance criteria after
both corrections were made (ignoring for the moment the
negative validities of the last three predictorS), estimates
of validity for the correlations of arithmetic and tool
knowledge with achievement and the composite criterion were

both greater than .22.

57

 

 

Table 5 Predictor Validities Corrected for Attenuation
in the Criterion and Restriction in Range
N = 63
Total Ohio

Predictor Performance AchieVement Composite
screening -.01 .21 .13
shop math .13 “.19 .20
arithmetic .18 .26 .26
tool knowledge .18 .23 .24
object vis. .17 —.07 .04
battery r .42 .59 .58
G.E.D. -.09 —.07 -.09
H.S. math .24* .34* .34*
H.S. science .08* .20* .17*
H.S. others .13* .29* .26*
P.H.S. Training -.33 —.15 —.28
Work experience -.24* -.10* -.l9*
C.S.P. -.22 -.19

 

-.11

 

 

 

*Corrected for criterion attenuation only.

58

Differential Validity

The data presented in Table 6 speak to the hypothesis
that differential validity in selection data does not
occur more Often than could be expected by chance. Out Of
.72 comparisons, six cases of differential validity
(involving screening, tool knowledge, and the test battery)
for majority and minority apprentices were found. The
probability of obtaining six significant correlations out
of 72 possible correlations is less than .10 (Brozek and
Tiede, 1952). Having defined chance at the .05 level,
these results are consistent with previous research findings.

A closer look at the separate validity coefficients
obtained for majority and minority apprentices reveals
some serious departures from results reported in the
literature on test validation and differential validity.
Although statistically the number of cases of significant
differential validity found was not greater than could
have occurred by chance alone, the difference in magnitude
and sign (positive-negative) of validities for majority and
minority apprentices for some predictors suggests the
operation of some phenomenon in the present data which
has a differential effect on majority and minority appren-
tices.

Indicative of the hypothesized differential influence
of some factor in the present data is the difference in the

correlations for minority between shoP math and arithmetic

59

ooo. v o..

 

 

 

 

 

Ho. v at
.m.z ma. oo.u .m.z om. mo. .m.z mo. oH.s .o.m.o
.m.z oo. om.u .m.z oH. on. .m.z NH.I oH.u oneness» .m.m.o
.m.z so. oH. .m.z no. mo. .m.z mo. Ho. mumauo .m.m
.m.z om. oo. .m.z mm. mo. .m.z om. mo.u mucmaom .m.m
.m.z No. on. .m.z on. ma. .m.z on. ma. sums .m.m
.m.z oo. oH.u .m.z Ha. mo. .m.z oo.u no.1 one
.. mo.u mo. .. oo.: oo. .m.z om.u om. sumunmn
.m.z o~.u oo. .m.z ~m.- mo. .m.z oo. oo. mw>ouomono
R on.) om. .m.z NH.) Hm. . mm.u om. moomazocx Hoop
.m.z HH.- so. .m.z m~.u mo. .m.z no. mo. oanmsnnoum
.m.z om.u no. .m.z oo.n Ga. .m.z H~.u oo. sums ooam
. om.s on. .m.z om.u oo. 4 oo.: oo. ocaammuom
-m s m s m 3
a H H a H u a H H Houowooum
ouwmomeoo uco8o>owno< owso oucMEHomuom Hooch

 

moooucoumm< anemones one MDHMOnmz How muwowam> Howucmuommﬂa mo mocoowbm

w manna

60

tests and the three criteria. Theoretically, since all three
predictors are measures of mathematical ability, it would

be expected that all would correlate similarly with criter-
ion measures. As Table 6 illustrates, however, shop math

and arithmetic correlate negatively with criterion scores
..for minority apprentices, while minority high school math

scores correlate highly in a positive direction with criter-
ion scores.

Using the achievement criterion as an example, correlation
with high school math is .71, while correlations between
achievement and shop math and arithmetic tests are -.44

and -.25, respectively. This effect is seen only for min-
ority apprentices, while validities for majority apprentices
for the same three tests behave as predicted. In general,
the first six paper and pencil tests correlate negatively
with criterion measures for minority apprentices only, while
minority high school transcript scores correlate highly in

a positive direction with minority criterion scores.

It is given that, with the small sample sizes for
minority and majority (16 and 47, respectively) a great deal
of error will be reflected in sample estimates of validity.
It would not appear, however, that sampling variability
alone could account for the markedly different validities
observed. Restriction in range due to prior selection is

another factor known to be operating in the present data,

61

(Table 4), but its differential effects for majority and
minority apprentices are less well known.

The usual effect of restriction is range due to prior
selectiOn in a given sample is the reduction of obtained
'post-selection validities to near 0. Formulas have been
created to correct for this type of restriction (see page
41), and are applicable to the case~where selection is done

on a single variable.

In the case where selection is done on more than one
variable and where ratios differ for subgroups, it is con-
ceivable that the effect of selection on predictor validity
will be different for each group. The question which re-

. mains to be answered is: what types of validity would be
expected to result if selection restriction did in fact

have a differential effect for minority and majority groups?
If it could be shown that the resulting validities for a
very highly selected group (i.e., a group with a very low
selection ratio) could be spuriously negative, one possible
explanation for the extreme validities found for minority
apprentices will have been found. The following paragraph
shows an example of how this could occur.

First the following assumptions are made: (1) the
selection ratio for the given group -.50 (for computational
convenience), (2) the correlation in the applicant pool for

a predictor X and criterion 2 (i.e., Ox ) = .50, (3) Ox =

2 Y

0, (4) Oyz = 0, and (5) selection is based on the combined

62

scores of predictor X and another predictor Y. The task
is then to calculate the post-selection correlations r

and ryz' Results using the above assumptions showed

obtained post-selection correlations of r = —.47, rxz =

.43, and ryz = -.20. When Oxz is raised to .7, the obtained

post-selection correlations are r = -.47, rxz = .63, and

rxz = -.29. Thus the effect of restrictiOn in range was

not to decrease ryz toward 0, but to change it from O = 0
to ryz = -.29. 4
In a third example, the assumption that Oxy = o is
changed to Oxy = .30. Given Oyz = .5 and Oyz = .15, rxy =

rxz = -.20, rxz = .48.

Following from the preceeding example it was hypo-
thesized that the negative correlations Observed for minor-
ityapprentices between many of the predictors and criterion
measures could be an artifact of prior selection. Since
there was no way to directly compare rxz's.ﬁm:restricted
and unrestricted samples (because no criterion data were
collected for unrestricted samples), another way to sub-
stantiate the hypothesis is to compare the correlations
between predictors (i.e., rxy' in the notation of the
above example) in the present study with predictor inter-
correlations obtained from an unselected sample. If there
is a major change is sign and magnitude of correlations

from selected to unselected samples, support will be given

to the conclusion that selection operates in a Similiar

63

manner in the data of the present study and the hypothetical
examples. Table 7 presents such a comparison of the correla-
tions between high school transcript scores and six other
selection predictors for selected and unselected samples.

Results of the comparisons in Table 7 are consistent
with the hypothesis. Unrestricted predictor intercorrela-
tions for minority were positive, as in the third example
above, yet correlations between predictors after selection
were consistently negative. Following the line of reasoning
of the hypothesized model, if selectiOn affects the predic-
tor-criterion correlations as predicted, spurious negative
validities would be observed in the present data for minor-
ity apprentices. That is in fact exactly what occured.

Since selectiOn was based on the composite test score,
both majority and minority groups are affected by restrict-
ion in range to prior selection. The effects of selection
were not as severe for the majority group, however, because‘
their selection ratio (SR = 30) was higher. As can be seen
in Tables 1 and 6, the effect for the majority group was not
great enough to produce post—selection correlation coef-
ficients that were negative. The results of the analysis
showing the differential effects of selection for majority
and minority groups cast suspicion on the applicability of
the corrections for restrictiOn in range for the total group

shown in Tables 4 and S.

64

.mouoow muoauo tom .oocooom .nuoe oouowuummucs no Eon on» maoooo ouﬂmomEOOR

 

 

oo.- oo.- oo.: u- oo.: oo.. oo.- nu omo
oo.- oo.- oo.- . oo. oo.- oo.- oo.- oo. mumnumm
mo.- oo.- oo. oo. oo.- oo.- oo.- oo. .mo> pomnno
oo.- oo. no.8 oo. oo.: oo.- oo.- Hm. sums mono
oo. oo.- oo.: oo.- oo.- oo.- oo.- mm. moomasocs Hooe
oo. oo. om. oo.- oo.- oo.: oo.- «a. ozoammuom
muonuo mocowom out: «ouwmomsoo muosuo oocowom sums emuwmomaoo

 

 

HOHM4 OHONOQ HOHMAN OHONOM HODOHUOHQ

anemone: anemone:

 

COAuomHOm Houm¢ one enamom wcowumamuuoououcH Houoﬁowum

h mqmda

--_—. .1.

65

Test Bias

Significant majority-minority criterion differences.
Due to the fact that the extent of the differential effect of
prior selection on criterion mean scores for majority and
minority apprentices is unknown, an analysis of mean criterion
score differences based on post—selection data would be
tenuous. A more meaningful and stable comparison of racial
differences in criterion score may be Obtained from the
examination of the point biserial correlation between cri-
terion and race. Results of the comparison of all six
criteria on levels of correlation with race (rcy) are

presented in Table 8.

-¢— ‘.

Finish subscores of the performance test show the least
correlation with race. Neither the correlation between.
tolerance subscores and race or between finish score and
race are significant. For total time score (reflected),12
the correlation with race of -.36 was significant at the .01
level. Since total performance was the composite of toler-

ance, finish, and time scores, it also showed a small but

 

12FOr all correlations involving time scores, time was
reflected so that the most desirable (i.e., positive) out-
come was a low time score rather than a high one. This
explains the negative correlation obtained for time score
and race (where majority were scored 1, minority 2.)

66

TABLE 8

Correlation Between Criterion Measures and Race

 

 

 

 

Criterion . ‘ r
CY

Tolerance -.18

Finish .12
Time - -.36**
Total Performance -.28*
Ohio Achievement -.42**
Composite -.43**

*p<.05

67

significant correlation with race. This correlation (-.28)
was smaller than both the correlations between achievement
and race (-.42) and the composite criterion and race (-.43),

both of which were significant at the .01 level.

Significant majority-minority differences in predictor
£3232. Racial differences in mean predictOr score are
shown in Table 8. As was noted for criterion mean differ-
ences, the predictor mean differences presented in Table 9
are markedly affected by prior selection. A better estimate
of the influence of race on predictor scores is achieved
through the analysis of the last column of this table, the
correlations between predictor and race (rcx).

Examination of predictor-race correlations showed six
of the thirteen predictors to be significantly correlated
with race. The screening test, shOp math, arithmetic, and
test battery showed significant negative relationships to
race, indicating that selected majority apprentices scored
higher than minority on these predictors. Post high school
training and corporate service points were positively
correlated with race, indicating that selected minority
apprentices on the average scored higher than did the
majority on these variables.

An estimate of majority-minority predictor score
differences in the applicant pool (i.e., before selection)

was calculated on an unrestricted random sample Of 198 test

68

TABLE 9

Majority-Minority Mean Differences in Predictor Score

 

 

Predictor Xw KB D DSD r
Screening 19.489 17.250 2.24 .66 -.26*-
Shop Math 12.702 11.062 1.64 .67 -.25*
Arithmetic 23.787 19.312 4.48 1.12 -.35**
Tool Knowledge 31.681 32.875 -l.l94 .25 .09
Object Vis. 28.894 27.375 1.519 .18 -.08
Battery ' 118.553 98.812 19.741 .81 —.33**
GED 2.617 1.625 .992 .50 -.11
H.S. Math 9.830 9.500 .33 .05 -.02
H.S. Science 2.979 2.000 .979 .41 -.12
H.S. Others 11.128 9.688 1.44 .17 -.06
P.H.S. Training 1.447 6.813 -5.37 .04 .31*
C.S.P. 3.383 4.938 -1.56 .79 .33**

*p<.05

**p<.01
NOte: and X are means for majority and minority groups.
respectigely. D and D are the group differences in

raw and minority groupSBtandard deviations, respec-
tively. r is the point-biserial correlation between
group membership and score.

69

scores, and results presented in Table 10. To assure com-
parability between the restricted and unrestricted rcx's, a
correction was made which took into account the different
ratios of minority to majority apprentices in each sample
(16/47 versus 98/99 for restricted and unrestricted samples,
respectively). While only six predictors showed significant
correlations with race in the restricted sample, all but one
predictor showed a large correlation with race in the
equated unrestricted sample. The effect of restriction on
the correlation between test and race was largest for the

test battery and the screening test.

Three definitions of test bias. The analysis of test
bias using the Cleary, Thorndike, and Darlington (3) defini-
tions are presented in Tables 11-13. Results were examined
first for the three most valid predictors--the test battery,
high school math, and high school others.

According to the Cleary definition, these three tests
are all biased in favor of minority in the prediction of
achievement and composite criterion scores. That is, in all
cases the majority regression line would overpredict
minority criterion scores. As a predictor of performance,
the test battery is a fair test for both majority and
minority apprentices. High school math and others, however,
are biased in favor of the minority in the prediction of

performance scores.

70

TABLE 10

Correlations between Predictors and Race for

Restricted and Unrestricted Samples

 

 

Restricted Equated Unrestricted

Predictor rcx rcx
Screen -.26* -.13
Tool Knowledge .09 -.38
Shop Math -.25* -.35
Arithmetic -.35* -.34
Object Vis. -.08 -.31
Battery -.33* -.46
H.S. Transcript -.07 -.23
P.H.S. Training .31* .31
GED -.ll -.20
CSP .33* .28

 

*p<.05

71

 

 

 

 

 

some oo.- oo.- some oo.- oo.- common+ .o~.~u oo.- mocmoom .m.m
some mm. mo. some mo.au oo.- someon+ .m~.~u oo.: some .m.m
some oo.ou oo.- some oo.- oo.- ommman+ .mm.~- oo.- one
common- «Ho.mu oo.: Home on. mo. Home om.H- oo.- someone
some mo.u oo.- Home o~.Hu oo.- oomman+ «m~.~- oo.: .mo> nomﬂno
Home mo. NH. comman+ .o~.~- on.) summon+ .mo.~s oo.- magmasocs Hoes
ommmonu ..oo.~- oo.- some oo. oo. common+ «oo.~u oo.- oaumsauaua
some oo.~u oo.- some oo.- oo.- common+ .ma.~- oo.- sum: moem
common- .m~.~u oo.: some oo.- oo.. ommman+ .ov.~n oo.- ouncoouom
coﬁmwuoc u >.oxu codewooo u xoulmuu cowmwooo u Novas uouowooum
couocwaumo oxoocuona humoao
mouz

EOHHUUﬂHU UOCEOMHOQIIOUMUMHumUW mﬂﬂm ”MOB

H4 mqmda

72

moose auwuocwa may no Ho>ou so common ma Down a oommwn+

macho auwuocﬁa on» uncwmoo oomowo no one» a common: "muoz

 

Ho.vm.«

mo.vmr
ommoan+ .mm.~ on. comman+ ..oo.mu oo.: some oo.au oo.: .m.m.o
comman+ .m~.~ ow. common+ ..m~.ou oo.- some mo.au oo.- masseuse .m.m.m
some oo.- oo.- some .Ho.au oo.- someon+ .o~.~u oo.- mumeuo .m.m

 

.UCOUIIHH mqmda

 

 

 

 

 

cookouwuu ucoao>owno¢llmowumwumum moan umoa.

NH Manda

“one _Ho.- oo.- comman+ .oo.~- oo.- oomman+ oo.ma oo.- mucooum .m.:
some oo.a no. summan+ ..mo.~u oo.- common+ oo.ou oo.: anus .m.m
Home oH.H- oo.- some oo.au oo.- commaa+ Ho.m- oo.- new
Home oo.a- oo.- some oo.- oo.- common+ oo.~- oo.- somehow
some oo.- oo.- ommmon+ .oo.~u oo.- comman+ oo.o- oo.- .mo> nomnno
some oo.H on. owmman+ ..Ho.m- oo.- comman+ mo.ou oo.- moooazoqx Hoes
ommman- .oo.~- om”- some oo.- oo.- summan+ H~.mu oo.- nonmagnaua
3
.1 some oo.a- oo.- some oo.au oo.- comman+ oo.mu oo.- cum: scam
Home Hm.a- oo.- some mo.o n oo.- comman+ ov.mu oo.- ocoammuom
CmeHooo u >.oxu scamwomo u xouumuu coﬁnwuoo +u n.0hu Houuwooum
coumcﬂaumo oxwocuona anomau
monz

74

macho Soﬁuocws one no Ho>mm ca common me one» u comman+

msoum auwuocoa one uncommo

common ma umou a oommwnl “ouoz

 

Ho.vm«.

mo.vma

Ho. um unmowuwcoam m.» HH¢+
newswo+ «mm.~ Hm. oomown+ acom.vh mh.| oommHn+ ov.MI av.| .m.m.u
oommHn+ amv.~ om. cemmwo+ ..mv.v: nh.| oomoan+ mv.mn He.) mcwcwmua .m.m.m
name no. oo. tomoﬂn+ amm.~: on.) oommwn+ mm.n: Nv.u muonuo .m.:

 

.DGOUIINH mdmdﬁ

75'

 

 

 

 

 

some oo.- oo.- comman+ «No.~- oo.- emanaa+ om.ou oo.- monoaom .m.m
some oo. oo. comman+ ..oo.~n oo.- ommman+ mo.ou oo.- sums .m.m
some om.ou oo.- some oo.an oo.- common+ oo.m- oo.- one
some mo.au oo.- some oo.- oo.- comman+ om.~u oo.- sumuumm
some oo.- oo.- common+ .n~.~u oo.- comman+ so.mu oo.- .ma> nomnno
some om.a on. comman+ «.oo.ou oo.- ommmHn+ Ho.mn oo.- «onwasocu Hoes
ommmanu eso.~- oo.- some om.- oo.- comman+ Ho.ou oo.- oanmseuaum
some oo.~- oo.- some -.H- oo.- comman+ mm.mu oo.- sum: monm
commons .mo.~- oo.- some so.au oo.- comman+ mo.m- oo.- manommuom
coﬂmwooo u >.oxu cowmwooo u xoulmou cowmwooo +0 x.omu uouowooum
couocwaumo oxwocuona \ >MOOHU
mouz

cookouwuo mnemooeoo unanimowumwumum moan umoa

MH wand?

76

 

moonm muﬁuocoa on» no uo>mm cw oomomn mo umou u oommwo+
msoum MDHMOCAE ecu umcﬂmmm common no one» u comment nouoz
Ho.vm««

mo.vms

Ho. um ucmoowwcmﬁm m.u Haom
commeo+ Rmm.m mm. tomcao+ «amm.vi oo.: cowoﬂn+ m.ma oo.: .m.m.o
ommmon+ .mo.~ om. oomman+ ..Hm.ou oo.: comman+ mm.mu oo.- oaocamus .m.m.m
noon mm. oo. comoao+ yam.ml mm.n oomoan+ mm.m1 mv.l muonuo .m.m

 

.DCOUIIMH mam¢8

77

According to Thorndike, "fair" means that the differ-
ences in SD units between majority and minority mean scores
on the test is not significantly different from the differ-
ence in majority and minority mean scores on the criteria.
Applying the Thorndike definition, the test battery is also
considered a fair measure of performance for minority
apprentices. In addition, this model finds the test battery
to be a fair predictor of achievement and composite
criteria. High school math and science are also fair pre-
dictors of performance criterion scores, but are both biased
in favor of the minority when used to predict achievement
and the composite criterion. This direction of bias, in the
Thorndike model, means that the differences between majority
and minority on the test are actually smaller than the
differences between majority and minority on the criterion.
Since correlations are not corrected for restriction in
range, the smaller racial difference in mean score on the
test is most probably an artifact of selection.

Darlington's definition three found the test battery,
as a predictor of performance criteria, to be biased in
favor of minority apprentices. That is, given equal cri-
terion scores for majority and minority apprentices the mean
test score will be higher for the minority. High school
math and others are fair predictors of performance, achieve~
ment, and composite criteria, and the test battery is, by

the Darlington definition, a fair predictor of achievement

78

and composite criterion scores. Darlington's "fair" means
that in a subset of apprentices with the same criterion
score, the mean test scores for the majority and minority
groups are the same.

- More generally, the analyses presented in Tables 11-13
demonstrate the hypothesized lack of agreement between the
Cleary, Thorndike, and Darlington definitions of test bias.
In the total of 39 cases presented, agreement between the
three definitions occurred only 4 times; the predictors post
high school training and corporate service points were found,
by all three definitions, to be biased measures of achieve-
ment and composite criterion scores. No consensus was

reached in any case, however, on the magnitude of test bias.

A Cleary-defined fair test. The strict interpretation
of the Cleary definition of test bias requires the elimina-
tion of consistent non-zero errors of prediction (i.e., bias)
by the use of separate regression equations for majority and
minority apprentices. The determination of the separate
regression equations for majority and minority apprentices
in the present data, however, was not feasible due to l) the
small sample size of minority apprentices, and the great
amount of random error which would consequentially influence
the prediction equations for that group, 2) the differential
effect of selection on minority and majority apprentices, as

was seen in the differences in subgroup validity obtained

79

for the present selected sample of apprentices, and 3) the
spurious negative correlations between predictors and
criteria for minority which would affect minority regression
line slopes. For the above reasons, the separate regression
equations for majority and minority apprentices were not

presented.

Statistical corrections for the Thorndike and Darlington
models. What would the test bias decisions have been had the
data been unrestricted in range due to prior selection and
had perfectly reliable criterion measures been used? It was
stated earlier that corrections for restriction in range and
criterion attenuation would be statistically performed on
validity coefficients, and the Thorndike and Darlington
models of test bias reapplied to answer this question.
Several unforeseen theoretical issues arose, however, in the
application of these corrections.

In the Thorndike model, the correction for restriction
in range involves the correction of thecorrelations rcy’
rcx' and rxy13 (where c=race, x=predictor, and y=criteria).
It has already been noted that no method existsby which to

correct rCy for restriction in range in the present data.

Prior to beginning this investigation there was no known

 

13In addition to r and r r must be corrected for
restriction in range begXuse it C§ppe§¥s in the denominator
of Darlington' 5 formula for test bias.

80

evidence in the literature that indicated that rCX and r
could not be corrected for restriction in range.

The usual correction of the correlation rx for restric—
tion in range assumes that (1) selection is on the variable
X, (2) X is homoscedastic (i.e., the variance of Y about the
regression of Y on X is the same for all values of X). This
is illustrated in Figure 5. The key to the formula is the
fact that Y is still linear on X after selection and Y is

homoscedastic on X after selection. The statistic O and

le‘
the slope of the regression line are exactly the same in both
the restricted and unrestricted populations (Figure 5).
Furthermore, all of the above conditions are true regardless
of the selection ratio. In this case, the unrestricted cor-

relation is given by:

_ r
2* 2 2
J0 rxy + (1 — rxy )

 

where v = unrestricted Ox
restricted ox

 

The formula above will fail if (1) Y is not linear or
not homoscedastic on X or (2) if X is not the variable for
the selection process. For example, suppose there are two
predictors X and Y and the criterion is Z. If people are
selected on X, then the unrestricted validity pXy is _
related to the restricted validity rxz by the formula above,

i.e.,

81

CRITERION .

LEOTBD

 

 

 

m

E(YIX)+ ”'le

E(YIX)

 

l
E(YIX)-"Y|x

TEST

 

SELECTION CUT-OFF

Figure 5. Effects of selection on a continuous distribution

of test scores.

D _
X2 ‘0 rmz

.‘

Jvz r + (1 - r 2)
xz xz

where v = unrestricted 0x3
restricted OXZ
But the other unrestricted validity coefficient pyz is given

by a completely different formula

0 _ -
XZ ryz u rxy rxz

 

 

 

2 2

k1 + u rXy J1 + u r

XZ

where u = vZ-l = unrestricted 0x3- 1.
restricted ox‘

 

Thus there is no one formula that is suitable in all contexts
for correction for restriction in range.

Actually, neither of the formulas above is suitable for
the present study. In the present study, the selection was
not done on any one of the predictors, but was done on a
composite of all of them. Development of formulas for this
case is in progress (Hunter, Schmidt, and Seaton, in prepa-
ration), although they would be of little use in the present
study in any case: the formulas rely on the multiple
regression of the criterion onto all of the predictors. In
the present case this would require the accurate estimation

of a 9 variable multiple regression equation for each

83

criterion, given only 63 observations. This is an extremely
unsatisfactory ratio of observations to variables.

In the case of checking for test bias there is a fur-
ther complication: not all the variables are continuous.
In the case where one of the measured variables is dichoto-
mous, as in the point biserial correlation between race and
test score, the effect of selection upon test score distri-
butions for different subgroups will vary with the selection
ratio used for each subgroup. Figure 6 presents the
situation occurring in the present research, where selection
ratio for the majority = .30 and the selection ratio for
minority = .08. Using a common cut-off point, it may be seen
that test scores of selected minority applicants will tend
to be concentrated much more closely to the cut-off point
than will-scores for the selected majority. This in turn
should affect the differences in means and standard deviations
between restricted and unrestricted samples differentially
for minority and majority, and will thus affect the point
biserialcorrelations rxc and ryc’ The formulas for such
changes in point biserial correlations are grossly different
from those above for continuous variables and yield very
different results. In the example chosen to resemble the
present data, the point biserial correlations changed much
less than the continuous formulas would predict. Tables 14

and 15 present the test score mean and standard deviation

84

       
 

MINORITY
R MAJORITY

um =.Oa. '

 

\

I '- - ‘ ‘ TESTSCORE
SELECTION CUT-OFF

Figure 6. Difference in the restricted test distributions
of two groups as a function of different selec-
tion ratios.

85'

TABLE 1 4

Majority—Minority Mean Test Scores for
Restricted and Unrestricted Samples

 

 

 

 

Black White
Predictor ERES EUR D ERES YUN D
Screening 17.25 12.18 5.07 19.49 13.78 5.71
T001 Knowledge 32.87 26.06 6.81 31.68 32.54 -.86
Shop Math 11.06 9.50 1.56 12.70 12.37 .33
Arithmetic' 19.31 13.45 5.86 23.79 18.69 5.10
Object Vis. 27.37 17.39 9.98 28.89 24.11 4.78
Battery 98.81 55.37 43.44 118.55 99.43 19.12
H.S. Transcript 21.19 17.54 3.65 23.94 22.68 1.26
P.H.S. Training 6.81 17.00 '10.19 1.45 9.33 *7.89
GED 1.62 5.71 -4.09 2.62 7.11 -3.10
C.S.P. 4.94 4.18 .75 3.38 3.12 .25

 

Changes in Standard Deviations from Restricted

86

TABLE 1 5

to Unrestricted Samples of Majority
and Minority Apprentices

 

 

 

 

Minority Majority
Predictor SDUN SDRES ‘ASDB ; SDUN SDRES ASDW
Screening 5.31 3.38 1.93 5.12 3.60 1.52
Shop Math 3.61 2.46 1.15 3.59 2.94 .65
Arithmetic. 6.13 4.00 2.13 5.11 5.64 - .47
Tool Knowledge 8.46 4.78 3.68 6.49 6.29 .20
Object Vis. 10.17 8.47 1.70 9.71 7.41 1.97
Battery 38.82 24.35 14.47 42.09 24.12 ,17.97
GED 3.15 1.96 1.19 3.35 4.37 -l.02
H.S. Transcript 9.91 15.83 ~5.92 11.94 15.49 -3.55
H.S. Training 14.61 13.71 .90 7.66 1.71 5.95
C.S.P. 1.64 1.93 -.34 1.95 1.94. .01

 

87

differences, respectively, between restricted and unrestric-
ted samples for majority and minority apprentices. These
results indicate that while restriction in range due to

prior selection affects both majority and minority appren-
tices, the consequences of selection are more severe for
minority apprentices. Because the two subgroups are

affected differently by selection, the traditional correction
for the correlation between predictor and race for restric-
tion in range is not appropriate.

Because selection has been demonstrated to have such a
differential effect on majority and minority apprentice
scores, suspicion is cast also on the applicability of the
correction of the validity coefficient for the total sample

(

rxytot) for restriction in range. The interaction of
restriction in range due to prior selection and the dif-
ferential effects of selection ratios employed must have an
influence on corrected validities for the total group. The
exact extent of this influence in unknown at present.

In summary, the question of what the test bias
decisions using all three bias models would have been had the
data not been restricted due to prior selection cannot be
answered because of the inapplicability of the traditional

correction for restriction in range model to the correction

of the correlations between predictor_and race and between

predictor and criterion. Because unreliability in the

88

criterion measure interacts with and is effected by
restriction in range, the question of what the test bias
decisions would have been had the criterion been free of

unreliability cannot be answered at this time.14

 

14Many of the analysis planned in the introductory
sessions of the present research could not be performed
due to problems which developed due to restriction in
range. After the present study had been finalized, however,
it was noted that these analyses could be performed when the
composite test score was used to predict the criteria.
Results of these analyses are available in a separate docu-
ment from the author of the present research, or from
Dr. John E. Hunter, at Michigan State University.

DISCUSSION

Validity of Selection Tests

Test validity was assessed in accordance with EEOC
guideline regulations. The results of the validity
analysis using a statistically significant correlation
between predictor and criteria as evidence of “demon-
strated validity" gave a very misleading impression of
the extent to which selection predictors measure perfor-
mance and achievement criterion. Statistical signifi-
cance varies as a function of sample size, the chosen alpha
level, and in the present study, restriction in range.
Results of the validation of the test battery serve to
illustrate the inapplicability of statistical tests under
these circumstances.

The estimate of the correlation in the present sample
between the test battery and the performance test was .20.
According to the test at the .05 level, this validity is
not significant. Corrected for restriction in range and
criterion attenuation (thus approaching a more accurate
estimate of the population correlation), however, this

correlation was .42. This validity is certainly

89

90

"significant" by any practical definition of significance,
but according to the first statistical test,15 the test
battery is not a valid measure of performance.

AlthOugh the error of failure to reject an invalid
'test is minimized by adherence to an a level of .05, the
probability of finding a test to be "invalid" in a sample
of 63 when it is significantly valid in the population
is very high. In the selection situation it is the latter
(type II) error which holds the most serious consequences
for the employer. Statistical significance testing at a
given a level is thus not the appropriate measure of
test validity, especially with small samples.

The interest of the present research.was to estimate
the validity of selection predictors in the population,
rather than specifically limiting validity to the present
sample. While statistical tests of the significance of
validity were performed to satisfy legal guidelines, a
validity coefficient which was not "significant" was not
necessarily "not valid."

The test battery was one predictor which showed a
high degree of relationship to criterion measures. Corre-
lations obtained in the present research between the test

battery and performance, achievement, and the composite

criterion were .20, .34, and .33, respectively.

 

15Tests of statistical significance should not be
applied to corrected data.

91

Restriction in range had a marked effect upon
validities obtained for the test battery. It will be
recalled that the score on this predictor is the composite
of scores of four other tests. This composite accounts
for 200 out 445 possible points, and the other 245 points
are spread over eight other variables. Thus the largest
single variable in the Chrysler selection process was the
test battery, and hence, it showed far more restriction
in range than did other predictors.

Corrected first for attenuation due to criterion
unreliability and then for restriction in range due to
prior selecton, correlations between the test battery
and performance, achievement, and the composite criterion
were .42, .59, and .58, respectively. With these correc-
tions, the battery was the single predictor which demon-
strated the highest degree of relationship to the three
total score criterion measures.

High school math and high school others also correlated
well with total score criterion. When corrected for
attenuation, correlations between high school math and
performance, achievement and composite scores were .24,
.34, and .34, respectively. High school others correlated
.13, .29, and .26 respectively, with performance, achieve-
ment, and the composite criterion. Evidence suggests that

these correlations would probably not have been much

92

higher if corrected for restriction in range. The
restricted variance of the high school composite differed
little from the unrestricted variance, and this relation-
ship is probably true of the individual scores as well.

' Arithmetic and tool knowledge tests appeared to be
adequate predictors of achievement and composite criteria
when corrections were made for attenuation and restriction
in range.

In general, the procedures Chrysler used in employee
selection appear to tap better the abilities measured by~
the Ohio achievement test than those measured by the
performance measurement procedure. Perhaps this is because
the Ohio test is largely a test of cognitive abilities,
while the performance test has a sizable psychomotor compo-
nent.

It should be noted that neither the Ohio test nor
the performance measurement procedure is a complete
measure of total production on the job. Rather, they are
measures of specific aptitudes and skills necessary for
successful production on the job. A more global measure
of production would need to include measures of absenteeism,
attitude toward the job, and other motivational variables
not tapped by these tests. It was not the purpose of the
present study to try to choose a criterion which was a
complete measure total success on the job. As measure

of specific aspects of job success, the criterion measures

93

used in the present study were probably much better than
the typically used supervisor rating criterion, which are
so heavily biased by the worker's personality.

Results of the validation of the selection procedures
currently used to select applicants for the apprentice
program must be carefully interpreted within the context
of the present study. Absolute magnitude of validity was
expected to be low because of the effects of restriction
in range due to prior selection on predictor validity.
While corrections for this effect were made on the corre-
lations between predictors and criteria, these corrections
provide only estimates of the validities which would have
been obtained had the sample used not been selected on the
basis of test scores. Keeping in mind the general limita-
tions of the present study, it may be generally concluded
that for the total sample of majority and minority appren-
tices, the test battery, high school math, high school
others, arithmetic, and tool knowledge tests were the most

valid predictors of total group criterion scores.

Differential Validity

 

Results of the present study were consistent with the
body of evidence in the literature suggesting that signi-
ficant subgroup differences in validity coefficients do
not exist for majority and minority subgroups more often

than could be expected by chance. Support for this

94

conclusion was weak, however, due to the existence of

many cases in the present study of large but non-significant
differences in the magnitude and sign (positive-negative).
between predictor validities for majority and minority
(apprentices. The high negative correlations for minority
apprentices between several predictors and the criterion
measures warranted on analysis of subgroup differences in
validity beyond the usual statistical search for differen-
tial validity.

An example was given in which it was shown that
selection made on the basis of the composite score of two
predictors could lead to a post-selection correlation
between one predictor and the criterion which was spuriously
negative. The hypothesized model was, of necessity, over-
simplified. The selection ratio assumed (SR = .5) was not
comparable to either the selection ration for majority
(SR = .3) or minority (SR = .08) apprentices in the present
data, but was chosen for computational convenience. To the
author's knowledge no formula exists by which to incorporate
the different selection ratios for minority and majority
apprentices into the hypothesized model. In addition, the .
model assumes selection on two predictors while selection
in the present study was based on a composite of thirteen
items. Furthermore, the effect of selection on predictor

validity in the situation where both predictors are

95

correlated with the criterion in the unselected sample was
not considered.

The amount of work which would be required to expand
the simplistic model to include more complex variables
(and situations is much beyond the sc0pe of the present
research. Therefore, it is only suggested that the
hypothesized model could account for the negative correla-
tions between predictors and criteria for minority appren-

tices observed in the present data.

Test BiaS‘

Significant differences between majority and minority
mean criterion scores were found for four of six criterion
measures. Total performance, achievement, and the compo-
site scores all showed a significant relationship to race.
The direction of correlations indicated that majority
apprentices scored higher, on the average than did minority
apprentices. These differences in mean criterion scores
are consistent with past research findings although the
differences found in the present research were somewhat
larger than average. Schmidt §£_§l, (1975) found an average
difference of .5 standard deviations to occur in most
employment research. In the present study the effect of
selection on criterion scores is unmeasured, but almost
certainly effects the results and increases the degree of

caution with which they should be interpreted.

96

Subgroup differences in mean predictor score were
found, as predicted. On the present selected sample,
correlations between predictor and race were significant
for six of the thirteen predictors. It was known a priOri
that restriction in range due to prior selection would
effect mean differences in predictor score for majority
and minority apprentices. Correlations between race and
predictor were therefore computed on data from a sample
of unselected apprentice applicants and the results com-
pared. More consistent with the results of previous
studies, these estimates of the relationship between pre-
dictor and race showed that for all of the thirteen predic-
tors, differences in majority and minority test scores were
greater than 0.

All predictor-race correlations were negative (indi-
cating that majority scored higher on the average than
minority apprentices) except the correlations between post
high school training and race, and corporate service
points and race. The lowest correlation was -.13 between
the screening test and race, while the highest correlation,
-.35 between arithmetic and race.

Results of the application of the Cleary, Thorndike
and Darlington definitions of test bias to the apprentice
program selection procedures were much in line with results

of previous studies and theoretical prediction. The Cleary

97

definition labeled all predictors of achievement and com—
posite criterion biased in favor of minority apprentices.
The same type of bias in favor of the minority, or over-
prediction, was found by Cleary (1968), Davis and Temp
((1971), and Temp (1971) in educational studies and

by Bray and Moses (1972) and Guinn, Tupes, and Alley
(1970) in employment testing studies. When performance
was taken as the criterion, only three out of thirteen
predictors were fair by the Cleary definition.

The test bias analysis using the Thorndike definition
was, as predicted, very different from the analysis which
used the Cleary definition. Only three tests were biased
predictors of performance by the Thorndike definition of
teSt bias, while for the Cleary model all but three pre-
dictors of performance were biased. (The three predictors
were not the same ones in both cases.) Similar lack of
consensus between the two models was found for predictors
of achievement and composite criteria.

The direction of bias by the Thorndike definition was
also in favor of the minority group. The Thorndike model
concerns itself most with situations where differences in
subgroup scores are larger on the test than on the criteria,
the case found most often in literature on employment and
educational testing. A possible explanation for the

findings in the present study of bias in favor of the

98

minority is that the effect of prior selection on subgroup
differences on the test was greater than the effects on
subgroup differences on the criterion.

Suspicion was cast on the applicability of the correc-
tion of predictor validities for the total sample when
selection effects were shown to be different for majority
and minority apprentices. For this reason, the Thorndike
and Darlington analyses which called for the use of
unrestricted estimates of validity were deemed inapprOpriate.

Schmidt e£_al, (1975) predicted that the analysis of
test bias using the Darlington definition would find pre-
dictors to be almost invariably biased against minorities.
This was not true in the present study. Instead, tests
were most often found to be fair by the Darlington defini-
tion. When test bias was found to occur by this definition,
it was most often bias against the minority group. Post
high school training and corporate service points, however,
were biased by the Darlington definition in favor of the
minority group.

The lack of consensus between the three definitions
of test bias indicated by theoretical comparisons of the
models was also found to be true in comparisons of the
three definitions based on actual data. In 35 out of 39
cases, a test which was biased as a predictor of a given

criterion by one definition was not defined as biased by

99

both of the other two definitions. The four cases of
agreement involved the post high school training and

corporate service points. All definitions agreed that
-these predictors were biased in favor of the minority

group when used to predict achievement and the composite

criterion.

Conclusions

 

Two questions which the present research sought to
answer were: (1) are the selection procedures currently
used to select applicants for the apprentice training
program at Chrysler Automotive Corporation valid predictors
of performance and achievement criteria? and (2) are these
predictors fair to both majority and minority apprentices?
The answer to both questions cannot be directly generalized
from the context of the present study because of the small
sample size upon which the results are based, and the
differential effect of selection on majority and minority
apprentices within that sample. The general findings of
the present study were, however, consistent with prior
research findings.

The test battery, the major selection tool in the
Chrysler procedure, proved to be a very valid measure of
performance and achievement criterion used in the present
research. High school math and high school others were

two other predictors which, along with the test battery were

100

significantly correlated with criterion measures. Arith-
metic and tool knowledge tests also showed adequate corre-
lations with achievement and composite criterion score.
The other eight predictors did not show a high degree of
(relationship to criterion measures.

The most important implications of the results of
this validation study concerns the continued use of these
selection tests. Only three predictors in the selection
procedure satisfy legal (i.e., EEOC) requiremenhsfor the
demonstration of test validity, but a total of five of the
thirteen predictors showed correlations (corrected for
attenuation and restriction in range) equal to or above
correlations usually reported in employment testing
literature. These five predictors account for a possible
300 out of 445 total points on the selection procedure.

From a legal point of view, it is not the validity
of the individual tests which is of most concern, but the
validity of the total selection procedure. In the present
study it was not technically feasible to investigate the
validity of total score on the Chrysler selection procedure,
but the test battery composite score gives a good indication
of what that validity might be.

The battery was a statistically and practically valid
measure of performance and achievement criteria. Corrected

for attenuation and restriction in range, correlations

101

between the test battery and performance, achievement and
the composite criterion were .42, .59 and .58, respectively.
For decision making purposes, it is the characteristics

of the test battery, rather than the individual tests,
'which are of greatest importance for both practical and
legal purposes. 9

Results supporting the continued use of the five most
valid predictors used in the Chrysler selection battery
may only be considered suggestive. Evidence of potential
adverse impact of the selection battery for minority
apprentices was seen to exist in:E§e markedly different
proportions of majority and minority applicants selected
with this procedure. The results of the present study would
not satisfy legal requirements for the test validity of ten
of the thirteen selection predictors, in the face of charges
of adverse impact. This is not to say, however, that the
relationships between selection predictors and performance
and achievement criteria have no practical meaning.

Several alternatives are available to the employer in
this situation. Another validation study could be pro-
posed, using a larger sample size, to substantiate the
validity of the selection procedure. Total score on the
selection procedure should be included as a predictor,
since selection is based on the composite of predictor

scores rather than on separate cut—off scores on each

102

predictive Alpredictor validity design would be especially
useful in avoiding the problems of prior selection encoun-
tered in the present research.

As a second alternative, the present.strategy of
selection on the basis of a composite score of thirteen
predictors could be abandoned,\and attempts made to find
a more valid strategy. Specifically, the present selection
battery could be modified by using only selection predictors
which are valid measures of criterion performances, and
eliminating other unnecessary and invalid predictors.

The present study would suggest that the test battery and
high school transcript scores be included in the new selec-
tion strategy. Careful attention should be given to the
development of a more valid screening measure.

The question of test fairness also has an implication
for the continued use of selection tests. Judgements of
fairness will depend on the definition used. The Cleary
definition found most selection tests to be biased in
favor of the minority group. If the interest of the
employer is to assure that tests do not discriminate
unfairly between races, and not necessarily to assure the
most accurate prediction possible for individuals of both
groups, then the continued use of selection tests will
not adversely affect the hiring opportunities of minorities.

If the main concern is the maximization of prediction

103

accuracy through the selection of individuals who will
have the highest probability of success, however, the use
of the Cleary model requires the use of separate predic-
tion equations where differential validity was found for
(minority and majority applicants or a multiple regression
equation otherwise. With the use of the later strategy,
much fewer than 8 percent minority would be selected with
the use of a "fair" test.

Use of the Thorndike definition to set predictor
score cutfoffs for majority and minority would result in
a higher selection ratio for minority that the use of the
Cleary model. The use of the Darlington definition will
result in the selection of a still higher proportion of
minority applicants. As the proportion of minorities
selected increases, however, prediction accuracy will be
affected inversely. The final decision of which model of
test bias is "best" rests on the relative value given to
the cost of reduced validity in comparison to the under
sirability of a very low rate of minority selection.

The present study sought to take into account the
effects of restriction in range in the data due to prior
selection, but did not anticipate the differential effects
of selection on the data for majority and minority appren-
tices. A major implication for further research suggests

that future studies of concurrent validity can no longer

104

assume that selection effects will be the same for

majority and minority apprentices, especially in the common
situation where the prOportions of applicants selected from
each subgroup are different. (This is always the case when
one cut-off is used and differences in mean test scores
exist.)

Investigations of concurrent validity should also
include a careful consideration of the possibility that
different subgroup selection ratios will have.a differential
effect on the validities and mean predictive and criterion
scores for each subgroup. Use of a predictor rather than
concurrent validity strategy will lessen this problem, and
provide a better foundation for test bias analyses.
Although restriction in range can also occur in predictive
validation, the degree of restriction will often be less

in the predictive strategy.

LIST OF REFERENCES

L18 '1' OF REFERENCES

Arvez, R. D. Some comments on culture fair tests.
Personnel Psyphology, 1972, gé, 433—448.

Baehr, M.; Sanders, D.; Froemel, E.; and Furcon, J. The
prediction of performance for black and for white
police patrolmen. Professional Psychology, 1971,
2, 46-57. *

Boehm, B. R. Negro-white differences in validity of
employment and training selection procedures:
Summary of research evidence. Journal of Applied
Psychology, 1972, ég, 33*39.

 

Bray, D. W. and Moses, J. L. Personnel selection. Annual
Review of Psychology, 1972, 23, 545-576.

Brozek, J. and Tiede, K. Reliable and Questionable Signi-
ficance in a Series of statistical tests. Psycholo-
gical Bulletin, 1952, 42, 339-341.

 

Cleary, A. T. Test bias: Prediction of grades of negro
and white students in integrated colleges. Journal
of Educational Measurement, 1968, g, 115-124.

Cole, N. S. Bias in selection. ACT Research Report No.
51. Iowa: American College Testing Program, 1972.

Darlington, R. B. Another look at "culture fairness."
Journal of Educational Measurement, 1971, 8, 71-82.

Davis, J. A. and Temp, G. Is the SAT biased against black
students? College Board Review, Fall 1971, No. 81,
4-9.

Einhorn, H. J. and Bass, A. R. Methodological considera-
tions relevant to discrimination in employment test-
ing. Psychological Bulletin, 1971, Z§J 261-269.

Enneis, W. H. The EEOC guidelines on testing. The Law

and Personnel Testing. Pittsburg: The University
of Pittsburg Press, 1971, 9-14.

105

106

Equal Employment Opportunity Commission. Guidelines on
employee selection procedures. Federal Register,
August, 1970, 35, 12333-12336.

 

Equal Employment Opportunity Coordinating Council. Uniform
gpidelines on employee selection. Staff Committee
Draft (Mimeograph), June 24, 1974.

Gael, S. and Grant, D. Employment test validation for
minority and non-minority telephone company service
representatives. Journal of Applied Psychology,
1972, gg, 135-139. \

Guinn, N.; Tupes, E. C.; and Alley, W. E. Cultural sub-
group differences in the relationships between Air
Force aptitude composites and training criteria.
Technical Report 70-35, Lackland Air Force Base,
Texas: Human Resources Research Center, 1970.

Hunter, J. E.; Schmidt, F. L.; and Seaton, F. W.
Problems in doing validity studies on selected
populations (in preparation).

Hunter, J. E. and Schmidt, F. L. A critical analysis of
the statistical and ethical implications of various
definitions of test bias. Paper presented at the
Midwest Society o‘fMulti-variate Experimental Psy-
chologY, Chicago, Illinois, May 2, 1975.

Hunter, J. E.; Schmidt, F. L.; Rauschenberger, J. M.
Fairness of psychological tests: Implications of
three definitions of selection utility and minority

hiring. Michigan State University, Mimeograph copy,
1975.

Linn, R. L. Fair test use in selection. Review of Educa-
tional Research, 1973, 43, 2, 139-161.

 

Linn, R. L. and Werts, C. E. Considerations for studies of
test bias. Journal of Educational Measurement, 1971,
g, 1.4.

Oriel, A. E. A performance based individualized training
system for technical and apprentice training: A pilot
study. Final report to Manpower Administration, U.S.
Department of Labor, Contract no. 82-17-71-48, 1974.

Potthoff, R. F. Statistical aspects of the problem of
biases in psychological tests. Institute on Statis-
tics, Mimeograph service number 479. Chapel Hill,
North Carolina University series, 1966.

107

Ruch, W. w. Statistical, legal and moral problems in
following the EEOC guidelines. Presented at the
Symposium on Differential Validity and EEOC Test-
ing Guidelines, WPA, Oregon, April 28, 1972.

Schmidt, F. L. Comments on EEOC June 1974 draft uniform
guidelines.on employee selection procedures.
Mimeograph copy, August, 1974.

Schmidt, F. L.: Berner, J.; and Hunter, J. E. Racial
differences in validity of employment tests: Reality

or illusion? Journal of Applied Psychology, 1973,
58’ 5-9.

Schmidt, F.; Greenthal, A.; Berner J.; Hunter, J; and
Williams, F. A performance measurement feasibility
study: Implications for manpower policy. Final
report to Manpower Administration, U.S. Department
of Labor. Contract no. 82-17-71-48, 1974.

Schmidt, F. and Hunter, J. Racial and ethnic bias in
psychological tests: Divergent implications of
two definitions of test bias. 'American Psychologist,
1974’ 2.2., 1-80

 

Temp, G. Test bias: Validity of the SAT for blacks and
whites in thirteen integrated institutions. ‘Journal
of Educational Measurement, 1971, g, 245-251.

Thorndike, R. L. Concepts of culture-fairness. Journal
of Educational Measurement, 1971, 8, 63-70.

Wallace, P.: Kissinger, B.: and Reynolds, B. Testing of
minority group applicants for employment. 'Personnel

Testing anngqual Employment Oppgrtunity. Wishington,
D. ., U.S. Government Printing 0 ca, 970, 1-11.

 

 

APPENDI X

108

 

 

 

 

 

 

 

 

 

 

 

 

3.92:5... m w n w a a a m m n
. . ... : :Z—_.E—_..._::_ E:_::_::__E Cb. _ e _ _ L
mm
an
Oh I.
are
mmn. « u an a a...» ”magma uuuuuu+uuu uuuuunuoeaouc:
my. a a a u a u a u ... u ... a u u a .... + n u .... u ﬂ n .... u... ..
mm a u u n u u u u n u u . u v u :
mm a a u a u n u u a a" u a a u_ u :‘ u ..n uuuu u
, _ a
a a _ _ __.:_::_::_::_::_ ::_::_::—:: 1.. ._ _ _ _ _
n w n w a da 3 w I
gun!
T'IIIIIIOWIIIJ— AIOVI AVIIAOI TIAVIIAOIIL IILOV AVIIAOI TII'IIIJLOUIIIIIII-

CPT
U

"nonua see u .>ee Houuea

umHSmmUHgmmt 2t ENOU‘

u BOHC €360;
«mou¢ uwouuua

 

.n

 

DO? GUM—.5 ZH mtg‘ do at g

 

..

In |||Il.

.N
.— “aux—mug ua<¢h

 

up<a uu_>¢um ub<¢oa¢ou

 

 

«meta: cx_hmuh

pﬁbﬂrcn

nueuexaz >e_¢=uum Seneca

pecan mmam>¢=u hammmxm

 

wh<a

unisex emmpl
use:

CHRYSLER CORPORATION, MARCH 1968

 

moumowocmo wowusmnmmd muasmmm name no “Momma

H< manna.

109

Table A2
Apprenticeship Evaluation Standards

 

HIGH SCHOOL GRADES

 

MATHEMATICS A g _c_ _D_ g
(2) Algebra 5 ﬂ 3 2 l
(1) Geometry E I 3 2 l
(1) Trigonometry 5 4 3 2 l
(2)' Advanced Algebra 5 4 3 2 1
(5) Others 5 4 3 2 l
(4) General Math 3 2 l 0 0

MAXIMUM POINTS 50

SCIENCE g; _I_3_ g Q g
(2) Chemistry 5 4 3 2 l
(2) Physics 5 4 3 2 1
OTHERS
(4) Drafting 5 4 3 2 1
(4) Related Shop 5 4 3 2 l
(2 Industrial Art 5 4 3 2 1

mmmms- so
'Llemmr HIGH scnoon SCHOLASTICV 100

(High School Level)

GENERAL EDUCATIONAL DEVELOPMENT RESULTS — (Without comple-
tion of High School)

 

Standard Score 45-49 53-54 SF—Sg 30-54 CF-Uo
Test 1 1 2 3 C 5
Test 2 l 2 3 4 5
Test 3 l 2 3 4 5
Test 4 l 2 3 4 5
Test 5 1 2 3 4 S

MAXIMUM POINTS 25
Partial High School Transcript in accordance with above

MAXII/IUM POINTS 75

IRIVTLIITYP! f‘ 17‘ h 37.?“ LT‘f/‘I'Y GHTYAAV “1‘"va 9AA-

110

Table A2 Continued.

 

POS T HI GH S CHOOL TRAI NI NG

I»
In:
In
lo
In

General Courses Applicable to
Apprenticeship (Up to 10 Courses)

. 1
x
w
I.)
...-l

MAXIMUM POI N TS 5 C

ARMED FORCES WORK in Related Field
‘ MAXIMUM POINTS 25

RELATED II’O '{ =X LRIENCE

O.J.T. Hours 0-49 . 500-999 .1000-1499 1503-1999' ZOOO-Up.

Upgrader in
Trade or Appren—
tice in Trade 10

k.)

30 40

b)
'1
(J

RELATED‘VORK AREAS SJ
MAXIMUM POINTS 5?
APPRENTICE TEST BATTERY PERCENTAGE X 2
MAXIMUM POINTS: 200
CORPORATE SERVICE POINTS
1 Point for each year up to 5 years

1 Point for each 5 years or major fraction thereof

TOTAL POINTS RECLUIRED TO QUALIFY 15"-

 

Chrysler Corppration
Technical ECucation Department
Chrysler Institute

3-37-72

111

Table A3
Report of Selection Results for Apprentice Candidates

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Plant Dept. 'SocZ-‘Se'c'fl'ﬂb‘TT-Cfa'an NoIL‘l'rade - let "1‘63. - 2nd Tirade -Trd Sex 1 Origin €076
WWW-Kai's“ Baree'JUa't‘e'TC'Er WTWTW’SW 1cm run any
‘7 Laat 7112.: Initial

Application Conpleted HIGH SCHOOL
Screening Tear Hathenatica
Interviewed Science
Tranacript Requeated Othera

Tranacript Received

Honorable)
) Hilitary Diacharge
Other )

Tent Battery Scheduled
Date

Teat Battery leenlte

RELATED TRAINING GRADES:

G.E.D. (High School Level)

gg§r urea scaoor

Poet High School Training

Arnad Porcea work
Related ﬂhrh Experience

Interviewer'a Report

Apprentice Tent Battery

 

Corporate Service Pointa

TOTAL EVALUATION POINTS

LEI-TLC] LG LG s.[:] mung

 

 

 

 

 

 

 

 

1_oz_oa_m_os_oo_o7_oe_09__1o__n_12_13_14_ls_
16VU_m_JL_m_n_JL_n_u_JL_u_N_JL_”_m__
31:32;_33_34_3s_3o_37_3a_39_¢o_a1_‘2_43_u__:s_
45_n__I.a__:9_so_sl_sz_53_s4_ss_so_s7_sa_s9_oo__
61__62__63__6‘_65_66_67_68_69_70__7l__72_73_74__75_
75_77_7a_79__ao_al__az_a3_u_os_oo__s7_oa_a9__9o__
91__92_-_93__9¢__95_9o_97_9a_99_

nacmunxmrm ACTION

Accept ~Accepted

Reject Iejected

  
    

Mich om Snare ' ‘
Unwuxy

LIBRARY 5 .7}

 

HICHIGRN STATE UNIV. LIBRQRIES

1115"

|| lllllll

31293101

4287