.‘I" 'vrv —.... -‘o‘ofi.l .7 'Q(.".— - '_' y'.- ‘9': 0-..u., ‘ v- ...... O"'I """-IUD~'0~¢" ~“‘.'I'-' '9 ...O‘ _ . . . mFu .A N; .. _ . N._.A ..R... PVQM. _ . Hg... P ”M 8.. . mm. m H ..h... SM N .F. G w ..Plu. ..rrn m 0 h M: m... M _ . M B . _ ~ . o . N. _ . . .. . . . I .. . _ . . . . . . . . . _ . . . . _ _ . _ .. - . _ _ _ . . . . .. a _ . . N n O . . co. . . . . - ~ _ #0 . ... I _ u '. ._ A . . . n . u. _ . 3 . . . . , . . . s. . .. . o . . . . . . . ¢ . . . . .. . . .. . 3. .. . . . ... . ... . . .. n. ... c. . .. _ ..4 . .. . . . . . ._ . ...c....0. .. u u v u : ~ .. . . . . ... . . 0 . I . . . ... .. . . L . . . .. . . ... . ~ _ I . . a v c u a a . . . . . . . . . a . 0 . . . . . . . _ o I . o . 9 . . . .. . . . . ... .. . I . - _ ’ n . O . c - Q . I v ‘ . . . . _ . . . . . o . 4 .. . - . u . d . I I ... c . . .o . . . . o A . . J 1 n . . . . . . . _ . . y u . .. . . . . . . ~ . . I n. . . . _ . . . . I . . a . . a . . .. . . . _ _ . . .. .... on... . A C n ‘ . ... .... .33.... ...a .m... ...4“.....r.4 . .... .. ve .. ...“: .. on. a...- JOHfi.‘ I ... .... ......... ... ..o. .... - . ..U.. pp. . .’.o:fl: ...;- no.1?! .....D ....mcrfw. 5 . ... “W......Ul ... r. N. .... :.... .9. .200. in... .. ..8 ..l...:. . um'. .. .0. r _a... 4.. ...a . . v . . ... .. o n. - v I. ‘ .O I o . . 0 . . .v . .. . . . . . . . r .. ._ . . . ... . . . . . . .. . ... . .. ....I . .. ¢ .4 A c . p . .. . . . . ~. .. . .I n — a pd. . . o. _ . ... o . .. .. ... an . --. o . r; c . . .. o . . 4. .. .. . o . . . ,. o _ . .. o. , u . . . . .. . . t o . _ . . . _ . g . ... .. ‘0 I ... I I ‘ . . . . . . - I n .o _ .vo . a .. .. . . a . u . . . . _ .. . Q I a C o . . . .- .. ...-.... . .. .... .2....p..?... ...... . A N. 2 . .. THESlS ABSTRACT IMPLICATIONS OF THREE DEFINITIONS OF TEST BIAS IN THE VALIDATION OF AN APPRENTICE PROGRAM SELECTION PROCEDURE BY Felicia Williams Seaton The EEOC guidelines on employee selection procedures (1970) require the examination of currently used selection tests for evidence of test validity and possible bias against minority groups. No consensus in the scientific literature has been reached on the most appropriate defini- tion of test bias. The present research.examined l3 predictors used to select applicants for an apprentice training program.for evidence of validity and fairness using the Cleary (1968), Thorndike (1971), and Darlington (1971) definitions of fairness to minority applicants. Sixty-three third- and fourth-year apprentice trainees volunteered to participate in the present study. Validity was measured by the relationship between selection predic- tors and a job sample performance test, a paper and pencil achievement test, and a composite score derived from per- formance and achievement scores. Correlations between predictors and criterion measures were computed separately for majority and minority apprentices, and the significance Felicia Williams Seaton of the difference between validity coefficients for majority and minority apprentices was tested, to determine if differ— ential validity existed.' Analyses of test bias were per- formed using the Cleary, Thorndike and Darlington defini- tions. Results showed that for the total sample of appren- tices, five of the thirteen predictors were adequately valid measures of the criteria. Differential validity was not found to occur more often than could be expected by chance alone. Differences in mean test and criterion scores for majority and minority apprentices were signifi- cant at the .05 level. The Cleary definition specifies that a fair test can neither over- or underpredict performance for members of any subgroup. Applying that definition to the present data, all tests were found to be biased in favor of the minority group (i.e., minority criterion scores were overpredicted by test scores). The Thorndike definition states that a test is fair if the proportion of the minority group selected using the test is the same as the proportion who could be successful on the job. Using this definition, eight tests were biased in favor of the minority group, while the other five were fair tests. The Darlington defini- tion, requires that test score for minority group not be different for any group of applicants with the same criterion score. By this definition, three of the tests are biased Felicia Williams Seaton against the minority group, and two tests are biased in favor of the minority group. Legal implications of test validity and test bias were discussed and suggestions were given for further research on the issue of the differential effects of selection on population subgroups. IMPLICATIONS OF THREE DEFINITIONS OF TEST BIAS IN THE VALIDATION OF AN APPRENTICE PROGRAM SELECTION PROCEDURE By‘ Felicia Williams Seaton A THESIS Submitted to Michigan State University in partial fulfillment of the requirements for the degree of MASTER OF ARTS Department of Psychology 1975 ACKNOWLEDGEMENTS This study was made possible through a grant funded by the 0.5. Department of Labor. Special thanks go to Mr. William.Main and Mr. Milt Murto of Chrysler Personnel Division for their valuable assistance and cooperation in the gathering of data for the present study. I am also indebted to Alan Greenthal and other graduate students who worked with me on this project. My deepest appreciation goes to my chairman and advisor, Dr. Frank L. Schmidt, who has been my greatest source of guidance and encouragement throughout my graduate program. Without the expert assistance of Dr. John E. Hunter and his wife, Ronda, the data analysis and statis- tical portions of this research would have never been completed. I would also like to thank my other committee member, Dr. Neal Schmitt, whose many suggestions have been extremely helpful. Finally, a special debt of gratitude is due to my husband, Carlos, who suffered through many difficult hours with me, and to my parents whose confidence in my success never waivered, and whose prayers and inspiration were always there when times were hardest. It is to my loving parents that I dedicate this thesis. ii TABLE OF CONTENTS Page ‘ACKNOWLEDGEMENTS . . . . \. . . .. LIST OF TABLES O O O O O O O O 0 LIST OF FIGURES . . . . . . . . . . . . . . Vi INTRODUCTION . . . ._ . . . . . . . . REVIEW or THE LITERATURE . . . . . . .' . . . . 5 MTHODS O O O O O O O O O O C O O O O O 3 1 Data Source for Criteria and Predictors . . . . 31 Data Analysis . . . . . . . . . . . . . 42 “SULTS O O O O O O O O O O O 0 Differential Validity . . . . . . . . . . 58 Test Bias . . . . . . . . . . . . . . 65 DISCUSSION 0 O O O O O O O O O O O 0 O I 89 Validity of Selection Tests . . . . . . . . 89 Differential Validity . . . . . . . . . . 93 Test Bias . . . . . . . . . . . . . . 95 Conclusions . . . . . . . . . . . . .' 99 LIST OF “FEENCES O O O I O O O O C O O O O 10 5 MPENDIX O O O O O O O O O O O O O O O O 108 iii Table l. 10. 11. 12. 13. LIST OF TABLES Correlations Between Predictors and Performance Criterion Subscores . . . . Correlations Between Predictors and Performance, Achievement, and the Composite Criterion . . . . . . . . Predictor Validities Corrected for Attenuation in the Criterion . . . . . Predictor Validity Corrected for Restriction in Range . . . . . . . . Predictor Validities Corrected for Attenuation in the Criterion and Restriction in Range . . . . . . Evidence of Differential Validity for Majority and Minority Apprentices . . . Predictor Intercorrelations Before and After Selection . . . . . . . . . Correlation Between Criterion Measures and Race . . . . . . . . . . . . Majority-Minority Mean Differences in Predictor Score . . . . . . . . . Correlations between Predictors and Race for Restricted and Unrestricted Samples . Test Bias Statistics--Performance Criterion . Test Bias Statistics--Achievement Criterion . Test Bias Statistics--The Composite Criterion iv Page 48 50 52 54 S7 59 64 66 68 70 71 73 75 Table 14. 15. ’ A10 A2. A3. Page Majority-Minority Mean Test Scores for Restricted and Unrestricted Samples . . . . 85 Changes in Standard Deviations from Restricted to Unrestricted Samples of Majority and 'Minority Apprentices . . . . . . . . . 86 Report of Test Results Apprentice Candidates . 108 Apprenticeship Evaluation Standards . . . . 109 Report of Selection Results for Apprentice Candidates . . . . . . . . . . . . lll LIST OF FIGURES Figure Page 1. Illustration of Thorndike's argument against the traditional definition of test fairness . . . . . . ~. . . . . . . 13 2. Prediction of black and white criterion scores from the white regression equation (Schmidt and Hunter, 1974) . . . . 16 3. Prediction of black and white criterion .scores from separate regression equations (Schmidt and Hunter, 1974) . . . . . . . 18 4. Case B. Regression lines are identical for minority but minority mean is lower (C019, 1972) o o o o o o o o o o o 27 5. Effects of selection on a continuous dis- tribution of test scores . . . . . . . 81 6. Difference in the restricted test distri~ butions of two groups as a function of different selection ratios . . . . . . . 84 vi INTRODUCTION Discrimination in the selection and employment of the disadvantaged has been a topic of widespread investigation in the last decade. The development of the Equal Opportunity Commission (EEOC) in 1964, and the publication of its guide- lines for Employee Selection Procedures in 1970 have focused the attention of employers, personnel psychologists, and federal court judges on the use of tests and other sel- ection procedures which may have an adverse impact on the selection of minorities and other groups protected by the Civil Rights Act of 1964. A major issue to which a substantial portion of the EEOC guidelines (1970 and 1974) is devoted is the need for the validation of currently used selection procedures. While many professionals:h1both psychology and business recognized a need for validation, until the early 1970's few seemed to be doing much in the direction of completing test valida- tion studies (Enneis,l97l; Wallace §t_§l,, 1970). Maintenance of test validation information became a requirement of the 1970 EEOC guidelines. Failure of an employer to successfully validate a selection procedure can result in the loss of a suit in which a minority employee files discrimination charges (as in the well publicized Griggs vsgpuke Power case of 1971), or in severe penalties from the EEOC. As a result, employers (especially of large companies) across the nation have devoted a great deal more attention to the validation of their selection procedures. While the guidelines make it clear that employers must maintain evidence of test validity in the face of a charge of adverse impact, there exists some confusion on the issue of what constitutes "adverse impact". Currently, two major phenomenon have been considered: differential validity and test bias. Differential validity occurs when the difference between the validity coefficient for one group and that for another is significant. This issue has received a dis- proportionate amount of attention in past research. Humphreys (1973) pointed out that reports of between-group validity differences in the literature have often used a test for the significance of test validity for each group separately (i.e., a test for single-group validity) rather than testing for the significance of the difference between the validities for the two groups. Humphreys concluded that this confusion of the two types of validity differences has contributed greatly to the inflation of estimates of the actual number of cases of differential validity occuring in employee selec- tion. . Schmidt, Berner, and Hunter (1973) showed that when differences between majority and minority sample sizes are taken into account, the frequency of single-group validity reported in the literature is not greater than would be expected by chance alone. These research findings and others (Boehm, 1972; Bray and Moses, 1972; Ruch, 1972), suggest that failure to find significant differences in (test validity for majority and minority groups does not preclude the possibility that a test could still discrimi- nate unfairly against the minority group. Another issue relevant to adverse impact is the ques— tion of whether or not a test is unfair or biased against minorities. To the present time, there has been no general consensus on a definition for the phenomenon of test bias, although several models have been proposed. Cleary (1968) proposed a definition specifying that a fair test can neither over- or under-predict performance for members of any subgroup. Thorndike (1971) stated that a test is fair only if the proportion of a population subgroup selected using the test is the same as the proportion who could be successful on the job. A third definition proposed by Darlington (1971) maintains that a test is fair only when, within any given group of applicants with the same criterion score, test scores are not different for cultural subgroups. The present research examines the three major defini- tions of test bias proposed by Cleary, Thorndike and Darlington, as they apply to the selection procedure currently used by Chrysler Automotive Corporation to select applicants for the Apprentice Training Program. Validity of each of the individual selection tests, as measured by their relationships to job performance and achievement criteria, is assessed in accordance with EEOC guideline regulations. Each test is examined for evidence of bias under each of three definitions of test bias, and the implications of the three models are compared. In addition, the possible effects of prior selection on the data are explored. REVIEW OF THE LITERATURE The question of fairness in employee selection pro- 'cedures is one which has centered around definitions of the commonly used terms: “discrimination" and "adverse impact.” The major source for the legal clarification of these concepts has come from the EEOC Guidelines for Employee Selection Procedures, in which discrimination is defined as "the use of any test which adversely affects the hiring, promotion, or transfer of classes protected under Title VII unless (a) the test has been validated and a high degree of utility shown, and (b) it can be demon- strated that suitable alternative procedures are unavail- able for use” (EEOC, 1970). Evidence of discrimination includes instances of higher rejection rates for minority candidates than for non-minority candidates, a situation implying adverse impact for the minority group. Specifically, "adverse impact“ is defined as the use of a selection procedure which results in the selection of members of any racial, ethnic, or sex group at a lower rate than members of other groups. The newest draft of the Uniform Guidelines on Employee Selection Procedures (EEOCC, 1974) goes on to state a specific rate for any group selected (4/5 of the majority group) below which the selection procedure will be considered to have an adverse impact. A basic problem in the EEOC clarification of these major concepts is the failure to distinguish between differntial validity and test bias as determinants of discrimination and adverse impact. One source of confusion is the use of terms differential prediction and differential validity inter- changeably. In the case in which differential prediction re- fers to significantly different regression lines for differ- ent subgroups, the term more accurately represents test bias, not differential validity. Although differential validity and test bias may at first appear very similar (a source of confusion to the EEOC), they are in fact very different in their frequency of occur- ence and consequences for both employer and employee in the selection situation. The issue of differential validity has been widely researched in recent years, and attempts have been made to reach a consensus on the terminology and im- plications of the concept. There remains little consensus, however, on the terms and definition which should be used to express the important "fairness in testing" phenomenon. The terms "test bias" and "test fairness" are most commonly used in current literature, and are often used interchangeably. (thus a test is equally either fair or unbiased). The terms "culture fair" and "culture free" have also been seen in earlier literature, but are often equated with the simplest definition of fairness in testing. A notable exception is the use of the term "culture fair- ness" by Thorndike (1971) and Darlington (1971) in which the concepts are representative of the more sophisticated definition of the phenomenon of fairness, referred to in the present research as test bias or test fairness.l The simplest definition of test bias is linked with the original use of the terms culture fair and culture-free. As the terms were first used, a culture fair test was one in which the mean test score was the same for all subgroups. This definition made the a priori assumption that all groups were the same on the variable being measured, and that any differences which occured were due to measurement error. This eliminates the possibility of any real between- group differences on psychological traits, and this is in- consistent with empirical evidence revealed in years of psychological testing. The culture-free/culture fair defi- nition represented a very early attempt to establish evi— dence of test bias against minority subgroups, and advocated the use of nonverbal tests which were supposedly (inherenthw more fair to minorities. Research has shown, however, that non-verbal tests do not always create additional fairness for the disadvantaged, and have the potential of enhancing 1In the context of the present research, the terms test bias and test fairness refer to the way in which a test is used to make selection decisions, not to the inherent characteristics of the test itself. between-group differences (Arvez, 1972; Bray and Moses, 1972). In light of the results rejecting its major prem, ises,the culture-free concept can be seen as an unsatis- factory definition of test fairness. The next definition of test bias to emerge was propo- sed by Anne Cleary in 1968. This definition, the first to be discussed in detail, has come to be accepted as the traditional definition of test bias, and has been widely endorsed by educational researchers, industrial psycholo- gists concerned with selection, text book authors and gov- ernmental agencies. The first court judgement directly concerned with the issue of test bias was decided in March, 1975. One outcome of this case (Cortez vs. Rosen) was the decision that only the Cleary definition of test fairness meets the EEOC requirement for fairness. Cleary (1968) conducted a study of bias in the Scholas- tic Aptitude Test (SAT) against blacks, using her own defi- nition of test bias. According to Cleary: A test is biased for member of a subgroup of the population if, in the prediction of a criterion for which the test was designed, con- sistent non-zero errors of prediction are made for members of the subgroup. In other words, the test is biased if the criterion score pre- dicted from the common regression line is consis- tently too high or too low for members of the subgroup (p. 115). Using this definition, Cleary looked for test bias by studying predictive validity in terms of the regression of the criteria on the test. Data was gathered from three colleges, using GPA as the criterion. The two hypotheses tested were: (1) slopes will be equal for blacks and whites and (2) there will be equal intercepts for blacks and whites. Assuming equal standard deviations for both races, the first .hypothesis meant equal validity for both groups. The second hypothesis was more fundamental to Cleary's definition of test bias, for unequal intercepts would mean that consistent non-zero errors of prediction were being made when one regression line was used to predict for both groups. A basic concern at the time of the Cleary study was that norms on the SAT (and other educational measures) developed primarily on white students might lead to an under-prediction of educational success for black students. This would clearly be an unfair situation for blacks, and the solution Cleary proposed to alleviate the consistent errors being made was the use of separate regression equa- tions for blacks and whites. The use of separate equations for each subgroup would then provide the most accurate estimate possible for each group. The first hypothesis of equal slopes was not rejected in any of the three colleges. The second hypothesis of equal intercepts was rejected in only one of the three col- leges, but the direction of bias was unexpected: in that college, the SAT tended to overpredict, rather than under- predict, black GPA. While overprediction is still a form of bias in the strict sense of the Cleary definition, its 10 implications are not as severe in considerations of discrim- ination against blacks. Cleary's conclusion was that little evidence exists of test bias against blacks in the SAT. The Cleary definition (also called the regression .model) has been the most widely accepted definition of test bias within a predictive context. It has been the theoret- ical basis for a substantial number of educational studies of bias (e.g., Cleary, 1968; Davis and Temp, 1971; Linn and werts, 1971), and employment studies (e.g., Bach, gt_gl,, 1971; Gael and Grant, 1972). The model of bias presented by Cleary has a statisti- cal as well as an intuitive appeal. In a sense, the use of a Cleary-defined fair test is “fair" to the institution because it maximizes prediction accuracy, therefore select- ing applicants with the highest average criterion scores. With this definition, the highest quality work force (i.e., those most likely to actually succeed) will be selected. This is a major plus for this definition from an institu- tional standpoint. A general (but never formally stated) argument against the Cleary definition claimed there is a problem of defin- ing fairness by the Cleary model with the use of a less than perfectly reliable test. The concern is that the effect of reliability in a test would be to produce artifactual dif- ferences between subgroup regression lines, leading to the labeling of a test as biased simply because of its unrelia- bility. In this case the bias would favor the minority ti to 11 group. Hunter and Schmidt (1975) recognized this as a false issue, however. They demonstrated that an unreliable test would be in fact biased against better qualified appli- cants, but this bias is not racial in nature. It is, rather, ‘a bias against the more qualified applicant, regardless of race, because it would be those applicants just above the test cut-off who would be rejected because of test unrelia- bility. They also point out that the argument that tests are biased against blacks because they are unreliable is false, because in the case of .00 reliability, a test becomes a random selection device which would select blacks in proportion to the number of black applicants, and hence might well select blacks in pr0portion to p0pulation quotas. Thus, this common argument against the Cleary model is not only false, it is exactly Opposite the truth. The Cleary definition adequately fulfills the require- ment of the ethical position of selecting "the best man for the job,“ and is fair to the individual because a fair test will never over- or underpredict that individual's perfor? mance. From a very different ethical standpoint, however, the Cleary definition is not satisfactory. Thorndike's objection to the Cleary definition represents such an ethical standpoint. Thorndike (1971) argued that a test fair by the tradi- tional definition of equal regression equations is unfair to the lower scoring minority as a whole because the 12 preportion qualifying on the test will be smaller, relative to the higher scoring group, than the proportion that will reach any specified level of criterion in performance. The problem he identified is illustrated in Figure l. The situation in Figure 1 depicts a test fair by the Cleary definition: the regression of the criterion on the test is identical for the two groups‘(shown by a solid line with a slope - .25). Thorndike noted, however, that the difference in the means on the test is substantially larger than the difference in the means on the criterion variable. This difference has a notable effect on selection decisions. Suppose, for example, the mean of the majority group on the criterion was used as a success-failure cut-off. As can be seen in Figure 1, 50% of the majority group in this example would be successful, and about 30% of the minority group would succeed. Based on prediction of the criterion from knowledge of test scores alone, however, 50% of the major- ity but less than 5% of the minority would be selected with a "fair test"! A In light of this situation, Thorndike proposed a second definition of test bias, in which he states: An alternate definition would specify that the qualifying scores on a test should be set at levels that will qualify applicants in the two groups in prOportion to the fraction of the two groups reaching a specified level of criterion performance (p. 63). By adopting this definition, Thorndike suggests that a fair test will provide each group the same Opportunity for 13 .nnmshwmm and“ mo sowuwswmmv HmsOHuwomuu may umcwmmm unmesmum mwxwosnona mo sowumuumsHHH_ bmmookEOa<2Iv GALIUONIW mmuooam AIIHOPVW—a ZOENFEO .H musmwm l4 admission to training or to a job as would be presented by the proportion of the group falling above a specified criterion score. Stated differently, the same proportion will be selected on a fair test as would be selected if the (criterion itself were used to determine selection, or as if the test had perfect validity.~ The “group fairness" definition prOposed by Thorndike has an appeal very different from Cleary's individualist approach. In one sense, it does seem only fair to select from a group in proportion to the number of individuals in that group who could in fact perform a job successfully. In the example above (Figure l), a Thorndike-fair test would, by selecting on criterion success cut-offs, select a markedly greater proportion of minorities than would the Cleary-fair test. One of the major criticisms of the Thorndike definition is that it is not statistically optimal. It represents a probability matching approach where the specific a priori probability of being selected is the random probability that anyone from the group will be successful. If every "minority" group (for example, the low scoring majority) claimed that they should be selected;hiproportion to the number of individuals in that group who could be successful on the criterion, differential selection and placement would become a virtual impossibility, and many of the advantages of the use of valid selection procedures would be lost. 15 Another disadvantage of Thorndike's definition is that reverse discrimination under the Cleary definition (actual success on the job) can be filed against employers giving preference to minority members over majority members who have a greater probability of success. The conflict between the Cleary and Thorndike defini- tions, from a statistical as well as ethical point of view, is inevitable. Under the Cleary definition it is unethical and unfair to select potentially unsuccessful minority ap- plicants over predicted successful majority applicants, and this is the result of using a Thorndike fair test. From the Cleary position, such a test is both statistically inaccurate and ethically biased against the majority. A very practical criticism of the Thorndike definition follows from the above discussion. With this definition there could be a greater incidence of placing individuals in roles for which they are not suited (Hunter and Schmidt, 1975). This could be a different type of unfairness to minorities who if hired would be unsuccessful, especially if failure has some critical effect for employee, employer, or both. Schmidt and Hunter (1974) examined the different practical implications of the two definitions of test bias. In the example they presented (Figure 2), they made the assumption that the difference (in standard deviation units) in the black and white means on a given test is equal to 16 -(> Figure 2. Prediction of black and white criterion scores from the white regression equation (Schmidt and Hunter, 1974). 17 the racial difference on the criterion. That is, E(XW - E(XB) - E(YW) - E(YB) = 1 SDK = lsnY mean, subscripts W, B = white, black, respectively, X = 10. (Where E = the test, Y = criterion). The criterion variable Y is not shown, however. Instead Y, the values of Y predicted from the regression equation f. If RXY is .50, the white regression equation is YW = .SOX + 25. Under these circumstances, the test will overpredict for blacks (pre- are given, and E(YW) - E(YB) = 1 SD dicting a mean score of Y = 45 instead of Y = 40). This makes the test biased by the Cleary definition. But because the mean differences between groups are equal on test and criterion, it is fair by the Thorndike procedure. With the use of the Thorndike "fair" test. if the selection cut-Off were placed at the majority mean, 50% of the majority and 16% of the minority would be selected. If, however, separate equations were used to fit the Cleary definition of a fair test, the result would be as shown in Figure 3. Now the test neither over- nor underpredicts for either group, but this "fair" test selects only 2.3% of the minority group. More generally, Schmidt and Hunter conclude: "For all selection ratios, selection on the basis of a test meeting Cleary's definition results in the acceptance of a markedly smaller percentage of minority applicants than selection on a test meeting Thorndike's definition" (Schmidt and Hunter,l974). . 1 | | 18 -<> Figure 3. Prediction of black and white criterion scores from separate regression equations (Schmidt and Hunter, 1974). 19 In the analysis of bias studies in educational testing, Schmidt and Hunter (1974) found that many of the studies which had found the test used to be biased to favor minor- ities (in the direction of overprediction for blacks) by the Cleary definition, were nonetheless biased against blacks by the Throndike definition of equal mean differences on test and criterion. Similiar results were also found in employment studies reviewed by Schmidt and Hunter (1974). Other evidence of overprediction for blacks and Thorndike bias in the same data has been reported by Linn (1973). Schmidt and Hunter (1974) propose that, holding dif- ferences between groupsconstant, the larger the validity, the smaller the overprediction by the Cleary definition need be for a test to be fair by the Thorndike definition. Alternately, holding validity constant, the smaller the difference in subgroup criterion score, the smaller the magnitude of overprediction required. These properties hold when validity is less than 1.0 and between-group differences exist on the criterion, the situation most commonly encountered in current employee selection situations. Noting the conflict of existing definition of test fairness, Darlington (1971) approached the issue (which he and Thorndike both called "culture fairness") by first trans— lating the positions represented by Cleary and Thorndike to a common denominator and, within the same language, proposing 20 three other definitions. This was accomplished by stating .all definitions in terms of correlational analysis. Darlington started with the criterion variable Y, a _predictor X, and defined a third variable C, which denotes group membership (e.gu majority — minority). With these terms defined, Darlington then gave four definitions of test fairness in terms of the correlations among these three vari— ables. In each case, an equation was defined by the degree to which test X discriminated among cultural groups. For the purpose of his analysis, Darlington assumed that both groups have equal standard deviations on test and criterion and equal validity (i.e., parallel regression lines). The first definition was a restatement of the tradition- al (Cleary) definition, where 1) ch.= RCY/RXY° Stated in other terms, there can be no differences between races beyond that produced by differences on X, so that the partial correlation of group membership with the criterion with the test partialed out ( R ) should be zero. In CY.X other words, a test is fair by Cleary if knowledge of group membership cannot be used to increase prediciton accuracy. Fairness will be maximized by selecting people with the highest criterion scores by this definition, and the most valid test will (implicity) be the most fair. 21 Darlington's second definition approximated Thorndike's position. In terms of three variables, 2) Rex = Rcy' For a test to be fair by this definition, any racial differ- ences which exist on the predictor will be equal (in SD units) to the racial differences which exist on the cri- terion. Hunter and Schmidt (1975) note that this is a correct statement of Thorndike's position if the common regression equation is used to select all applicants. If separate regression lines are used for different subgroups, an alternate definition must be given to correctly repre- sent Thorndike's position. Darlington then developed a third definition of test bias which requires, 3’ Rex = RCY ' ny° The argument supporting definition three first assumes that successful performance on the criteriOn is related to a com- posite of many abilities, as is the ability to do well on a test. If the partialed correlation between test and race with the criterion partialed out (R CX.Y the test must be tapping abilities which show large racial ) is not 0, then differences which are not relevant to the criterion. Such a test would be biased. This definition reverses the roles of X and Y as independent and dependent variables, respec- tively, and suggests the analysis of the regression of test on criterion rather than vice versa, as proposed in defini- tions 1 and 2. 22 Darlington's third definition received little attention until a novel argument in its favor was presented by Cole (1972). In the model prOposed, Cole introduced a "fairness regardless of group" definition of test fairness which .assures that as a group, individuals who can achieve a satisfactory criterion score have the same probability of being selected regardless of group membership. Her "equal opportunity model" (EOM) is thus a case of the requirement of definition three that RCX.Y = 0. Darlington and Cole's definition of test fairness, which requires equal regression of test on criterion for all subgroups, is the same as the Cleary definition with the roles of test and criterion reversed. The two definitions are in no way equivalent, however, because the regression of X on Y is not the same as the regression of Y on X unless either (1) no racial differences exist on either test or criterion or (2) validity is perfect. The fourth definition is simply stated: This is the simplest definition of culture fairness. Stated in correlation terms, the test is not allowed to corre- late at all with culture. This definition has already been discussed as oversimplistic and unrealistic in light of real world selection situations. Not completely satisfied with any of the first four definitions of test bias, Darlington prOposed yet a fifth 23 definition. In developing his fifth definition of test fairness, Darlington decided that it seemed unfair to use the white regression lines to predict scores for blacks when consistent errors of predictions would be made, but on the other hand, he disagreed with using race as a predictor, which he said is essentially what is done when separate regression equations are used. -Dar1ington dis- cussed the conflicting definitions of fairness in terms Of "a conflict between the two goals of low cultural discrimi- nation and high validity" (Darlington, 1971). He decided that the balance between the two goals required a value judgement concerning the relative importance of each. Darlington proposed a procedure for operationalizing this required subjective judgement. He proposed that the administrator be asked to specify a number which represents the difference between criterion scores required to equalize the worth of a minority and majority applicant. That is, he recommended the use of a number k such that a majority applicant would be worth exactly the same as a minority applicant whose criterion score is k units lower. Instead of trying to predict the criterion score Y, he suggested using the value Y - kC as the criterion in order to give a special value or advantage to hiring minorities (where C is scored 0 - l for minority - majority, respectively). Darlington then proposed that Y - kC, rather than Y, be the variable to be maximized in the selection process. If the regression of ‘ 24 Y - kC on X is then fair by the Cleary definition, the test is "culturally optimal". If k = 0, Darlington's and Cleary's definitions are equivalent. Darlington's definition five differs from Thorndike's first in format: where Thorndike's definition would set different cutting scores for different cultural groups, Darlington's system adds points to minority scores and uses the same cutting score. Both definitions are forms of quotas. The second, more fundamental difference is stated by Darlington: Thorndike recommends at least implicitly, a mechanical determination of the ratios of subjects selected from different cultural groups, while we recommend that these ratios be determined (indirectly) by a subjective, policy-level decision concerning the relative importance of validity maximization and cul- tural fairness (1971, p. 71). Evaluations of Darlington's definition of "cultural optimality" have not tended to be highly favorable, due for the most part to his insistence on the subjective deter- mination of the value of k. Hunter and Schmidt (1975) point out that, from a mathematical point of view,subtracting a constant from white criterion scores is equal to the re- sults of adding an equivalent constant to black scores with- out changing the prediction equation. Thus Darlington's procedure simply amounts to adding a constant to black scores so more will be admitted. They conclude that this is just ”an esoteric and uncontrolled method of setting quotas" (Hunter and Schmidt, 1975). 25 In Linn's (1973) opinion, although Darlington's pro- posal is of theoretical interest, in practice it seems un- likely that institutions will ever formalize the procedure by actually picking an explicit value of k and then try- ing to maximize the value Y - kC. Hunter and Schmidt, (1975) suggest that when the Cleary definition is employed without the use of separate regression equations, then the employer has implicity picked a value of k equal to the distance between majority and minority regression lines. The Darlington definitions three and five, since they are not equivalent to either Cleary or Thorndike's defin- itions, present still another perspective of the potential implications of test bias in employee selection. Cole (1972) examined the implications of these four definitions of bias, plus two others2 in an attempt to identify the values and beliefs about fairness and the actual procedures to be fol- lowedto alleviate bias according to each model. In the five selection situations examined by Cole, the following assumptions were made: 1) only two subgroups (i.e., majority and minority) existed, 2) only one predictor was used (although she states that the results are identical for multiple prediction), and 3) a selection ration of .20 applied. 2The two other models were the quota model and the employer's model proposed by Einhorn and Bass (1971). 26 In case A, the selection situation existed where regression lines are parallel, with the minority line above the majority, and with the minority intercept larger than that of the majority. This situation rarely if ever occurs .in practice, and need not be discussed further here. In case B (Figure 4), one~regression line serves for both subgroups, with the minority means on both the pre— dictor and criterion being smaller than the majority means. This situation is very common in the literature on selec- tion, and is the most relevant example which Cole gives. Under these circumstances, the Darlington model (5) of fairness (with k set at .5) will select 20% minority; Thorndike 13.3%, Cole (or Darlington 3) 16.4% and Cleary 4.5% of the minority group. It should be noted that for this situation, Cole's selection of the value for k was com- pletely arbitrary. Had k a 0 been chosen as the culturally optimal constant, the test would have been fair by Darling- ton's definition, (since Darlington's definition agrees with Cleary‘s for k = 0). The effect of the larger R value chosen by Cole makes Darlington's model select a markedly larger percentage (20%) of minority than would the use of the value of k a 0, which would select (4.5%). The last three cases presented by Cole all assume dif- ferential validity (i.e., different regression slopes and intercepts in the presence of equal standard deviations). However, since differential validity has not typically 27 .xmpafl .maooc “mace me some mufluoswa pun mufluocws How HOOHDGOCH mum mocwa soamwmummm .m Ommo .v musmwm .ozx ezx . Oo o... x . w l 0.0... 52* ABO.—02F 28 been found in employee selection procedures, these cases will not be discuSsed here. Each of the four models of test bias reviewed has its own strengths, weaknesses, and implications for employee .selection, but none provides a completely satisfactory definition of test bias. The Cleary definition is statis- tically Optimal and will select the highest quality work force, but selects the lowest proportion of the minority group, and is especially disadvantageous to minorities when validity is low. The Thorndike definition will select a higher percentage of the minority group, but is not statistically Optimal and will not select the most quali- fied work force. Use of Darlington's definition three will select the highest proportion of minority groups, but the basic assumptions of casuality behind the definition have not been proven, and the prediction of test score form knowledge of ability (i.e., criterion score) is not the normal practice in employment testing. Darlington's goal of achieving "cultural optimality" by using Y - kC in the regression equation makes the proportion of the minority group selected using his definition depend directly on the value of k which is used. His proposal for the subjective estimation of the value k, however, makes his fifth defini- tion another method of setting quotas which is not statis- tically optimal. For these reasons, Darlington's definition 29 5 is not a useful model for the evaluationof test bias in the present research. The conclusion to be drawn from a review of the literature on test bias is at once obvious and illusive. .On the one hand, it is apparent that, with the infrequent occurrence of differential validity in the testing situation, test bias is a current and very real issue related to the problems of discrimination and adverse impact. At the same time, however, there is little general consensus as to what constitutes test bias. Three major definitions of test bias (Cleary, Thorndike, and Darlington 3) have been examined here. Each will conflict with the other under normal selection conditions. Each has different implications for both employer and employee. One deficiency found in the literature reviewed was the absence of a sound comparison of the three major defini- tions using actual data from an empirical study. In an attempt to remedy this deficiency, the present research aimed to clarify the specific implications of each of the three models of test bias when each was applied to a selection procedure used to select employees for an apprentice training program. Each of the predictors in the selection battery was examined first for validity, and then for evidence of bias under each definition of test bias. Because data for the present study were collected on a 30 sample which had been selected on the basis of scores on the selection predictors, the effects of prior selection on test validity and test bias were also considered. METHODS Data Source for Criteria and Predictors I The data source for the present study was a recent study (Schmidt, Greenthal,Berner, Hunter, and Williams, 1974) which constructed a job sample performance test for the skilled trades.3 Based on the knowledge that blacks and other disadvantaged groups score significantly below national norms on most paper and pencil tests, it was suggested that those paper and pencil tests could be tap- ping the determinants of job proficiency on which racial differences are largest or, conversely, that they may fail to tap determinants of job success on which differences are smaller or perhaps non-existent. If this were true, then Schmidt gt_gl, reasoned that a well-constructed job sample performance test would show smaller racial differences than the traditional paper and pencil tests. Hunter gt_al, (1975) reported that in general performance tests do show smaller racial differences than 'paper and pencil tests. Hence, if the content of a job 3This study was reported for the Department of Labor, in an attempt to develop a content valid and reliable measurement of performance, and to evaluate the practical and economic feasibility of the use of that performance measure. 31 32 sample performance test is essentially identical to job content, it should show racial differences smaller than those of paper and pencil tests, and in fact no different from actual job performance differences. Because Of the nature of its content and the methods used to determine content, a well-constructed job sample test is, ipso facto, content valid (meeting EEOC regulations). The performance measure which they developed provided an adequate criterion against which to measure validity and test bias in the present research. A thorough job analysis was the logical first step in the development of the performance test. This had already been done by a previous researcher (Oriel, 1974). Super- visors and journeymen were asked to indicate which tasks apprentices were expected to master by the end of their first year. These tasks and other job analysis information were used to delineate 31 "activity modules"--independent performances, each of which incorporated a number of the tasks identified by supervisors andgburneymen. Twenty-three sh0p modules were selected from 31 activity modules to be included in the study by Schmidt, gt_al, A slightly modified list of these 23 modules was then presented to 21 local experienced machinistqburneymen for rankings for each machine on difficulty, frequency of occurrence and importance for both apprentice and journeymen positions. 33 The next step in the construction of the performance test was the determination of the number of independent evaluation scores which could be derived from each model.' 'Several distinct advantages were seen to the use of the method of end product rather than specific behavior or process evaluations. These advantages included greater interjudge reliability, greater practical feasibility, and higher validity for the former method. Tolerance and finish were the two end-product evaluation used to measure per-' formance in the study by Schmidt gt. El: Each task was next executed by a machinist consultant to the project, and the information obtained was used to construct two tentative forms of the test. These forms were combined to yield a tentative final form, which was then pilot tested in a local machine shop, and the final form for the performance test was drawn up. A major goal of the SChmidt gt. 31. performance meas- urement procedure was the evaluation of the ability to per- form specific tasks which are central to the machines trades occupation. It was not the intent of that study to construct a global measure of job performance such as the overall value of the individual to the organization. For this reason, only a limited number of specific skills and abilities were meas- ured by the job sample performance test,while more general behaviors, such as absence rate, frequency of large errors, task sequences, etc. were not used to evaluate performance. 34 _ Four paper-and-pencil aptitude tests and one paper-and- pencil achievement test were also administered to apprentices to test the hypothesis of smaller between group mean differ- ences for the performance test. Selected were the Wonderlic 'Personnel Test (a test of general ability), the Minnesota Paper Form Board (a test of spatial aptitude), the Purdue Industrial Training Classification Test (an arithmetic reasoning test) and the Bennett Mechanical Aptitude test. The achievement measure used was the Machines Trade Achievement Test of the Ohio Trade and Industrial Education Achievement Tests (1960 series). The test was developed by Ohio State University for the Ohio State Department of Education for use in connection with the state's vocational and technical education programs. It was carefully con- structed using content validation procedures, and was judged content-valid by the project's expert machinist consultant. Internal consistency reliability measured by KR 20 = .84 (Schmidt gt_al, 1974). To provide a contrast to performance measures, and to add an indication of measured achievement, the Ohio Achievement Test was used as a criterion against which to judge test validity and test bias in the present research. Between November, 1973 and June 1974, the Performance Measure Procedure (PMP) was administered to 87 apprentices (including 68 from Chrysler) with the equivalent of one year's machinist training. Apprentices were tested in 35 groups of five. A short presentation was given to each group explaining the nature of the project. Following the presentation, each apprentice was given a piece of stock, a task drawing, and identifying labels. All task instruc- tions were recorded on cassette tapes, and tape players with earphones were placed at each task station. Average time for the performance test was just over four hours. When time permitted, the five paper and pencil tests were, given when the PMP was completed, but when the eight-hour allotment was insufficient the paper-and-pencil tests were administered to small groups Of employees at a later date. Evaluation was made on the basis of the end product tolerance and finish measurements and task time required.4 Each piece was measured for tolerance by judges (non- professional in the skilled trades) using highly precise dial-read micrometers and calipers. The system developed to analyze the results of the absolute measures recorded by each judge was a three point system, in which each piece was judged within tolerance, outside the first but within the second less stringent tolerance, or outside both. Finish was determined by comparing each end product with "bench-marks” (workpieces chosen in advance to specify 4No time limits were imposed, but since the instruc- tions emphasized speed as well as quality, work time was recorded as a third score. 36 quality levels of finish), so that relative rather than absolute judgements were required. The results of the study by Schmidt et_al. showed high inter-rater and internal consistency reliability for the Performance Measurement Procedure. In the case of all but one written test, there were large and significant dif- ferences between the majority and minority groups. Except for tolerance scores on the lathe, no significant differ- ences were found on any dimension on tolerance or finish scores of minority and.majority apprentices. However, minority apprentices did take longer on the average to complete the perfOrmance. The fact that there were differences in time but not in quality of performance is consistent with the assumption that the apprentice used .an internal rule of "always take time for quality." Had the men been forced to produce at a given pace, then the time difference would have been translated into quality differences. This created a problem in the use of the performance test as a criterion: how shall time and quality be weighted? This in turn depends on the actual job for which the test is to be used. In a job where the criterion is "Produce the best possible piece regardless of time," the prOper weight for time would be 0. But in a task where wide discrepancies in quality are still "acceptable," production is proportional to speed and the two quality scores should receive 0 weight. 37 Schmidt et_§l. arbitrarily defined a "total performance score" by adding the standard score for tolerance to the standard score for finish and subtracting the standard score for time. That is, they arbitrarily weighted quality twice as heavily as time. This study will consider each score separately (as well as Schmidt et_§l. total score) in order to permit inferences to either case. Five criterion measures used for the present study are: three performance test subscores (total tolerance score, total finish score, and total time score), total performance score, and total score on the Ohio Machines Trades Achieve- ment Test. In addition, a composite criterion score was com- puted by combining performance and achievement scores, such that the total of the three performance measures was weighted equally with the total score on the Ohio test, and the two totals combined to equal one composite score. The predictors in the present research are the selection variables used to determine entrance into the Chrysler Apprentice Training Program. This selection procedure con- sists of a screening test, a battery of four paper and pencil tests, and eight background variables which attempt to assess relevant school and work experience. Samples of forms to report all data maintained on an individual applicant are presented in Tables Al- A3 of the appendix. 38 The first test encountered by the applicant is the screening test, called the Chrysler Personnel Test. This is a five-minute multiple-choice test composed of items on arithmetic and general reasoning ability. A minimum score .of 15 on this screening test is required for an applicant to continue in the selection process. Upon successful completion of the screening test, applicants are rescheduled to take the Apprentice Test Battery, which consists of four tests: The Tool Knowledge and ShOp Arithmetic tests from SRA Mechanical Aptitude Test Battery, the Arithmetic Test Form B developed by Chrysler, and the Survey of.0bject Visualization Test. All four tests are administered to applicant groups at one sitting. Total testing time is approximately 2 1/2 hours. The SRA Tool Knowledge test is a multiple choice test of 45 questions, each with five response choices. It is designed to measure knowledge of the different tools, machines, and other devices used most often by mechanical tradesmen. Each question shows a picture of the device, followed by the question "This is a _____” or "This is used to . . ." The time for the test is ten minutes, and the score is the number correct. The SRA Shop Arithmetic Test is a multiple-choice test designed to measure skill and aptitude in work with numbers. This fifteen minute test has 24 questions, each with four responses. The questions require skill in addition, 39 subtraction, multiplication and division, as well as the reading of diagram charts. Total score equals the number right. The lower limit for reliability (by the Kuder- Richardson 21 formula) for the total score of the SRA Mechanical Aptitude Battery is .83 (the test manual does not report reliabilities for separate tests). The Arithmetic Test Form B was developed by the Chrysler Personnel Division in 1961. It is a ten-minute, 45 item mutiple-choice test designed to measure skill in basic arithmetic. Each question has four possible solutions. Score equals the total number right. NO reliability data was available on this test. The fourth test, the Survey of Object Visualization, is composed of 44 items. In each item, there is a drawing of a flat pattern followed by four objects, one of which correctly represents what the object would look like when folded up. This test is designed to measure ability to predict how an object will look when its shape and position are changed. Total score equals the number correct minus one third the number incorrect. Split-halves reliability based on 266 cases(corrected by the Spearman-Brown formula) is reported in the test manual to be .91. Raw scores from each of the four tests are converted to percentiles, and the percentiles are then averaged to 40 give a composite battery score for each applicant (see Table A1 Appendix).S This composite score (averaged percentile) is then multiplied by two to give the total score on the Apprentice Test Battery. The maximum number .of points is thus 200. Applicants also receive points on eight other dimen- sions. Until September 1, 1973, a high school diploma or GED equivalent was required before application could be made for the Apprentice Training Program.6 Points are awarded on a five-point scale for high school grades in mathematics, science, and other related courses, according to the grade received. Similarly, points are awarded for scores on the five tests of the GED according to score. The maximum number of points awarded for GED and high school grades is 100. Points are also awarded to applicants for general course work applicable to apprenticeship which was taken after high school. Points are awarded on a-five-point system according to grade, to a maximum of 10 courses and 50 points. A maximum of 25 points can be accredited to an appli- cant for Armed Forces work in a.re1ated field. Points are. 5It should be noted that this average of percentile scores is not itself a percentile score. 6This requirement has only been provisionally dropped for a one year period. It does not affect the current report. 41 given for related work experience in agreement with the number of on-the-job training hours, to a maximum of 50 points. Corporate Service Points are given to Chrysler employees for each year of service up to five years, and one point for each five years or fraction following the first five years. Total points possible on the entire apprentice selec- tion procedure (including the test battery and other evaluation standards) equals 445. The total points required to qualify for a position on the priority list for Apprentice Training is 115.7 Subjects Sixty-eight apprentices from the Detroit Chrysler plants with the requisite amount of machine training volun- teered to participate in this study. Most subjects were third and fourth year tool-and-die apprentices. Twenty of the 68 subjects were minority (29.4%); one female took part in the study, making it impractical to try to break down results on the basis of sex. Five of the 68 subjects (3 minority and 2 majority) failed to complete all parts of the performance test. Their scores were not included in the statistical analysis portions of the present research. 7UP until February 1, 1971 an interview was required for each applicant. Maximum points for the interview was fifty, and the total qualifying score was 150. When the interview was discovered to be unreliable and unnecessary, it was dr0pped, and the average interview score (35) was subtracted from the total qualifying score. 42 Data Analysis One of the major statistical analysis in this research concerns the verification of the validity of each of the predictors (i.e., tests) used to select applicants for the 'apprentice program. This analysis involved the examination of the correlation coefficicent which describes the relationship between each of the 14 selection tests and each of six criter- ion measures. Correlations were computed separately for minor- ity and majority apprentices, as well as for the entire sample of apprentices. The EEOC guidelines state tnytvalidity is demonstrated by a correlation between test and criteria which is signifi- cant at the .05 level. The present research does not limit the term "valid test" to the statistical demonstration of validity proposed by the EEOC, although significant tests were performed on predictor validities and the results were noted.9 Analysis of test bias were performed on all test regardless of the statistical significance of their relationship to the criteria. 8In analysis using a very small sample size, adherence to an alpha level of .05 minimizes the probability that a test which is really not statistically valid will be found Valid. At the same time, however, the probability of finding a test invalid (by a statistical test) when it in fact is not may be quite high. For example, if the true validity of a test is .25 and a sample of 60 is drawn, the probability that the sample validity will not be significant at the .05 level is .50. That is, half the sample coefficients would be incorrectly interpreted as showing the test to be invalid. In a sample Of 17, the probability of incorrectly assuming that the test is invalid would be .80. If a test is 43 The validity of a test is defined as the correlation between the test and the criterion in the applicant popula- tion (i.e., before selection takes place). However, in the present study, data on the criteria could be Obtained .only for the restricted population of persons who were selected using the predictor tests. Restriction in range due to prior selection tends to produce underestimates of validity in the applicant pool. To obtain an estimate of unrestricted validity, a correction for restriction in range was performed. As will be shown below, there is no formula in the literature which is exactly appropriate for correcting for restriction in range in the present data. However, Thorndike's (1971) case II formula was used to provide an indication of the size of the restriction.9 incorrectly assumed to be valid and is used for selection, then the prior random selection procedure is being replaced by a new random selection procedure, and hence has neither a good nor a bad effect. But if a valid test is incor- . rectly labeled "invalid" and hence rejected, then a poten- tially useful selection procedure is being replaced by a random selection procedure at great cost. Thus, only the second kind of error is really relevant, and preoccupation with maintaining a small alpha level may be highly counter- productive. 9This formula is given by: SD unrestricted ny SD restricted . . 2 1 _ nyZ + ny2 SD unrestricted ny = SD restricted 44 This correction was accomplished by utilizing the standard deviations for each test computed on an unrestricted sam- ple (i.e., a sample in which both passing and failing scores were included in computed standard deviations) of [scores of 200 applicants drawn at random from the appren— tice program applicant record files. A final analysis in the validation portion of the present research was a test for differential validity. Although the EEOC guidelines requires the use of an .05 level of significance for evidence of differential validity, the body Of evidence against the probability of finding differential validity more frequently than could occur by chance alone suggests the use of a higher level. Therefore, an .01 level was used. A major goal of this research was the determination of the fairness or bias of each predictor. A strategy for determining test bias was proposed by Potthoff (1966). For the analysis of predictor bias by the three definitions, the following procedure was devised. All tests of statis- tical significance used a level of .05 and were performed on the uncorrected data. The first two tests were of the significance Of the correlations between criterion measures and race, and between predictor score and race. The results of these tests provide an indication of the relative influence that racial group membership has on criterion and predictor scores. 45 For both the Cleary and Darlington definition, deter— mination of test bias was derived from the examination of partial correlations. For the Cleary model, the null hypothesis for test fairness is ryc.x=o' This formula as- sumes that x and y are measured without error and that there is no restriction in range. While formulas exist for cor- rection for attenuation hix and y, no apprOpriate formulas exist for correction for all cases of restriction in range (as will be shown below). Hence, the statistical test was simply run on the uncorrected partial correlation. The Thorndike model of test bias requires that the difference in standard score units between the difference in majority and minority mean scores on the tests and major- ity-minority mean scores on the criterion equals 0.That is, this definition of fairness states:(Xw- XB) -(YW- YB) - 0, (where 2W and §W = the means for the majority on test score and criterion score, respectively, and XB and YB equal the mean test score and criterion score, respectively, for the minority group). In order for the test of this hypothesis to be complete accurate, X and Y must be expressed in standard score form. This is done implicitly when the test used to specify fairness is r - r = 0. Where this cy cx difference was positive and significant, the Thorndike 46 definition labeled the test unfair, or biased against the minority group.10 The null hypothesis for test fairness using_Darling-. [ton's definition (3) requires rcx.y = 0. For the analysis of test fairness by both the Darlington and Thorndike definitions, it was necessary to make a correction for unreliability in the criterion measures. This procedure, called correction for attenuation, provides the best estimate of the magnitude of true differences (i.e., the difference after controlling for the effects of random error) in minority and majority criterion scores. This correction was necessary because random error or unrelia— bility in a criterion measure acts to obscure or reduce true group differences. This correction proved to be more complex than expected, however, and was found to be infeasible in the present research, as will be discussed below. 10This assumes that the majority group is scored high and the minority group low on C. If majority is scored high (e.g., 1) and minority low (e.g., 0), then a negative difference in rCy - rcx would specify bias against the minority group. RESULTS The product moment coefficients of correlation between predictor scores and scores on each of the three performance subtests are presented in Table 1. In concur- rence with EEOC requirements, validity is reported separately by racial group. Because of the large difference in the sample sizes of majority and minority apprentices (47 and 16, respectively) the test of the significance of validity coefficients for each group separately is not especially meaningful in the present research, and will not be dis- cussed. A more meaningful analysis of between group differences in validity will be presented later. For the interested reader, the test of significance (at the .05 level) of the validity reported for the majority group requires a correlation greater than or equal to .285. For the minority group validity must be greater than or equal 'to .50 for statistical significance. Correlation coefficients computed on the total sample which reached the .05 significance level required by the EEOC for demonstration of validity are shown in Table l by an asterisk (*). No scores were reported for any of the subjects in the total sample for Armed Forces Work, so that predictor was dropped from all further analyses. 47 HHDNGH OHQMHO>MM HMOE 03¢ moamwa muoom 30H m Dona 0m omuoom ommum>mn mw mafia O maomwum> menu so wouoom on: OOHQSMm mxoman o: 48 & mo. v.& « «em.- mo.- ~m.- me. as. “H. Ha.) Ho. mo.u .m.m.o me. n no. 1m~.u u em.) eH.u u ma.u mesmeuooxm.xuos mH.- ma.) mo.u mo.u oa.u ¢H.- mo.s mo. mo.u manages» .m.m.m mH.- mo.- mm.-. NH. mm. mo. .mu. um. um. mumeuo .m.m as. so. we. ~H.- ma. ma.n on. ow. mo. mocmeom .m.m ~o.- so. so.- ma. am. so. NN. «a. mN. sums .m.m mo. mm. so. no. H~.- mo. «m~.- -.u a~.u one mo. so.) mo.- Ho.u HH.- ca. «ma. o~.- mm. sumuumn mo.- mo.- mH.u no. mo. so. o~. AH. mu. .mn> nomeno mo.- ~m.- no. mo. an.. «H. as. «4.- «m. muemflzocx Hoe» NH. so.) no. me.) on.. co. «H. «m. as. onumanunum «m. on. on. so.) -.u no. mo.- mm.u oo. sums moan ma. ao.u ma. ma.- as.) «a. mo.u we.u no.) mnwcmmuom noun on SR uouu an 3H uouu on 3H uouowomum MOSHE nmflflwh OOGMHOHOH mmuoonnsm eowumuwuo QUGMEHOHHOQ UCM WHOUUflUmHm flmm3umm MGOflHMHOHHOU H UHQMB 49 Significant validity coefficients were obtained for three predictors of tolerance score. The test battery, GED and high school others correlated .25, - .25, and .28, re- spectively with tolerance scores. Only work experience cor- . related significantly with finish scores ( r = -.28 ), and corporate service points (CSP) correlated significantly with time scores ( r - -.34 ). None of the other 31 cor- relation coefficients were significant at the .05 level. Correlations between the predictors and performance, achievement, and the composite criterion computed for the total sample are shown in Table 2. The demonstrated re- lationship between the predictor scores and performance criterion scores was weak. None of the correlations reached statistical significance at the .05 level. The test battery, post high school training, work experience, and corporate service points showed the highest correlations with per- formance scores, showing r's of .20, -.19, -.19, and -.19 respectively. Selection predictors were more highly correlated with achievement criterion score. A correlation of .34 was seen between the test battery and achievement, while high school others correlated .31 and .27, respectively, with achieve- ment score. All three correlations were significant at the .05 level. Significant validity coefficients were also ob- tained for the correlations between the test battery and composite score ( r 8 .33) and high school math and the composite score ( r = .30 ). Table 2 Correlations Between Predictors and Performance, Achievement, and the Composite Criterion N = 63 Total Ohio Predictor Performance Achievement ' screening -.01 .14 shop math .08 .13 arithmetic .12 .20 tool knowledge .10 .16 object vis. .10 -.05 battery .20 .34* G.E.D. -.08 -.07 H.S. math .19 .31* H.S. science .06 .18 8.5. others .10 .27* P.H.S. training -.19 -.10 Work experience «.19 -.09 C.S.P. -.19 -.ll .08 .13 .20 .16 .03 .33* -.O9 .30* .15 .23 -.18- -.17 -.19 * p < .05 Composite) 51 One finding which was somewhat contrary to expectation was the low validity demonstrated by the screening test and the four tests which combined to make up the test battery. None of these five tests correlated significantly with any ' of the six criterion measures. Slightly less than significant validity was obtained from the correlation of the screening test with finish time subscores (-.18 and .18, respectively), shop math with time score (r=.22), chrysler arithmetic with achievement score (r-.20), tool knowledge with tolerance (r - .19) and object visualization with tolerance (r a .20). A correlation greater than or equal to .25 was required for statistical significance. The sample validity coefficients were corrected for attenuation due to unreliability in the criterion and results are presented in Table 3. Estimates of reliability using the coefficients alpha formula for the three criterion measures were .84 for the Ohio Achievement Test, .80 for the composite score, and .64 for total performance score. Estimates of corrected correlations between the test battery and performance, achievement, and the composite criterion were .25, .38, and .37 respectively. Correlations between high school math and the three criterion measures were next highest, showing r - .24 for performance, r = .34 for achievement, and r - .34 for the composite score. High school others correlated .29 with achievement and .26 with the composite score. All other predictor-criterion correlations remained below .25. Table 3 Predictor Validities Corrected for Attenuation in the Criterion N = 63 Total Ohio Predictor Performance Achievement Composite acreening" -.01 .15 .09 shop math .10 .14 .15 arithmetic .15 . .22 .22 tool knowledge .13 .17 .18 object vis. .13 -.05' .03 battery .25 .38 .37 G.E.D. -.10 -.08 -.10 8.8. math .24 .34 .34 H.S. science .08 .20 .17 H.S. others .13 .29 .26 P.H.S. training -.24 -.11 -.20 WOrk experience -.24 -.10 -.19 C.S.P. -.24 —.12 -.21 53 To investigate the possibility that the low predictor validities found in the first analysis could be due to the restriction in range due to prior selection, the proper correction was made. Several problems were encountered in ‘carrying out this correction, however. In the sample data drawn from Chrysler's record files, high school math, science, and other scores were all included under one score labeled “high school transcript,” making the unrestricted standard deviations for the three separate scores unavaile able. In addition, no scores were reported for related work experience in the random sample of unrestricted scores, so that no correction could be made for this variable. Correction for restriction in range were performed statistically on remaining predictor-criterion correlations, and the resulting estimates of predictor validity are pre- sented in Table 4. The battery showed the most notable change from restricted to unrestricted correlations. Corrected for restriction, correlations between the battery and performance, achievement, and composite criterion were .34, .54, and .53, respectively. The effect of the correc- tion for restriction in range was much less marked for correlations involving the five other paper and pencil tests. For the correlation between the screening test and achievement score, the estimate of validity was raised from .14 to .19 after this correction. 54 Table 4 Predictor Validity Corrected for Restriction in Range N = 63 Total Ohio Predictor Performance Achievement Composite screening -.01 .19 .ll shOpimath .ll .17 .17 arithmetic .14 ‘ .24 .24 tool knowledge .14 .22 . .22 object vis. .13 -.07 .04 battery .34 .54 .53 G.E.D. -.07 -.06 -.08 P.H.S. training -.26 -.14 -.25 C.S.P. -.17 -.10 . -.17 55 Similarly, correlations between tool knowledge and achievement and composite criteria were increased from .16 to .22 with this correction. validities for shop math, arithmetic and object visualization changed little, and Iremained low. Estimates of validity decreased for corre- lations between GED and C.S.P.\(corporate service points) and the three measures because the estimated unrestricted standard deviations were smaller than restricted standard deviations. This caused the estimated unrestricted validity coefficient to be smaller than the restricted coefficient.11 A logical explanation for this result assumes that the large standard deviations obtained for the present sample must have been due to sampling error. Correlations between GED, post high school training and corporate service points (CSP) and the three criterion measures were unpredictedly negative, and there is no immediately Obvious explanation for these results. The correlations between GED and performance, achievement, and composite criteria could conceivably be a function of sam- pling error; that is, supposing the true correlations between GED and the three criteria were all 0, sampling error could produce low negative correlations Of -.07, -.06, and n.08 for performance, achievement, and composite criteria, l The formula for restriction is given in footnote 9, page 43. 56 respectively. For the other two variables, however, there appear to be other factor(s) acting in addition to sampling error to produce the negative validities. The results of efforts to discover these possible factors (will be presented in subsequent analyses. Predictor validities corrected first for criterion attenuation and additionally for restriction in range are given in Table 5. Estimates of validity for the test battery after both correction for restriction in range and attenuation on the criterion were .42, .59, and .58 for prediction of performance, achievement, and composite criteria, respectively. Correlations of high school math (corrected only for attenuation) and all three criterion measures show the next highest validity, equal to .24, .34, and .34 for performance, achievement, and composite criteria, respectively. The test battery and high school math were the two best single predictors of performance, achievement, and the composite criteria. While no other predictors showed a notable relationship to performance criteria after both corrections were made (ignoring for the moment the negative validities of the last three predictorS), estimates of validity for the correlations of arithmetic and tool knowledge with achievement and the composite criterion were both greater than .22. 57 Table 5 Predictor Validities Corrected for Attenuation in the Criterion and Restriction in Range N = 63 Total Ohio Predictor Performance AchieVement Composite screening -.01 .21 .13 shop math .13 “.19 .20 arithmetic .18 .26 .26 tool knowledge .18 .23 .24 object vis. .17 —.07 .04 battery r .42 .59 .58 G.E.D. -.09 —.07 -.09 H.S. math .24* .34* .34* H.S. science .08* .20* .17* H.S. others .13* .29* .26* P.H.S. Training -.33 —.15 —.28 Work experience -.24* -.10* -.l9* C.S.P. -.22 -.19 -.11 *Corrected for criterion attenuation only. 58 Differential Validity The data presented in Table 6 speak to the hypothesis that differential validity in selection data does not occur more Often than could be expected by chance. Out Of .72 comparisons, six cases of differential validity (involving screening, tool knowledge, and the test battery) for majority and minority apprentices were found. The probability of obtaining six significant correlations out of 72 possible correlations is less than .10 (Brozek and Tiede, 1952). Having defined chance at the .05 level, these results are consistent with previous research findings. A closer look at the separate validity coefficients obtained for majority and minority apprentices reveals some serious departures from results reported in the literature on test validation and differential validity. Although statistically the number of cases of significant differential validity found was not greater than could have occurred by chance alone, the difference in magnitude and sign (positive-negative) of validities for majority and minority apprentices for some predictors suggests the operation of some phenomenon in the present data which has a differential effect on majority and minority appren- tices. Indicative of the hypothesized differential influence of some factor in the present data is the difference in the correlations for minority between shoP math and arithmetic 59 ooo. v o.. Ho. v at .m.z ma. oo.u .m.z om. mo. .m.z mo. oH.s .o.m.o .m.z oo. om.u .m.z oH. on. .m.z NH.I oH.u oneness» .m.m.o .m.z so. oH. .m.z no. mo. .m.z mo. Ho. mumauo .m.m .m.z om. oo. .m.z mm. mo. .m.z om. mo.u mucmaom .m.m .m.z No. on. .m.z on. ma. .m.z on. ma. sums .m.m .m.z oo. oH.u .m.z Ha. mo. .m.z oo.u no.1 one .. mo.u mo. .. oo.: oo. .m.z om.u om. sumunmn .m.z o~.u oo. .m.z ~m.- mo. .m.z oo. oo. mw>ouomono R on.) om. .m.z NH.) Hm. . mm.u om. moomazocx Hoop .m.z HH.- so. .m.z m~.u mo. .m.z no. mo. oanmsnnoum .m.z om.u no. .m.z oo.n Ga. .m.z H~.u oo. sums ooam . om.s on. .m.z om.u oo. 4 oo.: oo. ocaammuom -m s m s m 3 a H H a H u a H H Houowooum ouwmomeoo uco8o>owno< owso oucMEHomuom Hooch moooucoumm< anemones one MDHMOnmz How muwowam> Howucmuommfla mo mocoowbm w manna 60 tests and the three criteria. Theoretically, since all three predictors are measures of mathematical ability, it would be expected that all would correlate similarly with criter- ion measures. As Table 6 illustrates, however, shop math and arithmetic correlate negatively with criterion scores ..for minority apprentices, while minority high school math scores correlate highly in a positive direction with criter- ion scores. Using the achievement criterion as an example, correlation with high school math is .71, while correlations between achievement and shop math and arithmetic tests are -.44 and -.25, respectively. This effect is seen only for min- ority apprentices, while validities for majority apprentices for the same three tests behave as predicted. In general, the first six paper and pencil tests correlate negatively with criterion measures for minority apprentices only, while minority high school transcript scores correlate highly in a positive direction with minority criterion scores. It is given that, with the small sample sizes for minority and majority (16 and 47, respectively) a great deal of error will be reflected in sample estimates of validity. It would not appear, however, that sampling variability alone could account for the markedly different validities observed. Restriction in range due to prior selection is another factor known to be operating in the present data, 61 (Table 4), but its differential effects for majority and minority apprentices are less well known. The usual effect of restriction is range due to prior selectiOn in a given sample is the reduction of obtained 'post-selection validities to near 0. Formulas have been created to correct for this type of restriction (see page 41), and are applicable to the case~where selection is done on a single variable. In the case where selection is done on more than one variable and where ratios differ for subgroups, it is con- ceivable that the effect of selection on predictor validity will be different for each group. The question which re- . mains to be answered is: what types of validity would be expected to result if selection restriction did in fact have a differential effect for minority and majority groups? If it could be shown that the resulting validities for a very highly selected group (i.e., a group with a very low selection ratio) could be spuriously negative, one possible explanation for the extreme validities found for minority apprentices will have been found. The following paragraph shows an example of how this could occur. First the following assumptions are made: (1) the selection ratio for the given group -.50 (for computational convenience), (2) the correlation in the applicant pool for a predictor X and criterion 2 (i.e., Ox ) = .50, (3) Ox = 2 Y 0, (4) Oyz = 0, and (5) selection is based on the combined 62 scores of predictor X and another predictor Y. The task is then to calculate the post-selection correlations r and ryz' Results using the above assumptions showed obtained post-selection correlations of r = —.47, rxz = .43, and ryz = -.20. When Oxz is raised to .7, the obtained post-selection correlations are r = -.47, rxz = .63, and rxz = -.29. Thus the effect of restrictiOn in range was not to decrease ryz toward 0, but to change it from O = 0 to ryz = -.29. 4 In a third example, the assumption that Oxy = o is changed to Oxy = .30. Given Oyz = .5 and Oyz = .15, rxy = rxz = -.20, rxz = .48. Following from the preceeding example it was hypo- thesized that the negative correlations Observed for minor- ityapprentices between many of the predictors and criterion measures could be an artifact of prior selection. Since there was no way to directly compare rxz's.fim:restricted and unrestricted samples (because no criterion data were collected for unrestricted samples), another way to sub- stantiate the hypothesis is to compare the correlations between predictors (i.e., rxy' in the notation of the above example) in the present study with predictor inter- correlations obtained from an unselected sample. If there is a major change is sign and magnitude of correlations from selected to unselected samples, support will be given to the conclusion that selection operates in a Similiar 63 manner in the data of the present study and the hypothetical examples. Table 7 presents such a comparison of the correla- tions between high school transcript scores and six other selection predictors for selected and unselected samples. Results of the comparisons in Table 7 are consistent with the hypothesis. Unrestricted predictor intercorrela- tions for minority were positive, as in the third example above, yet correlations between predictors after selection were consistently negative. Following the line of reasoning of the hypothesized model, if selectiOn affects the predic- tor-criterion correlations as predicted, spurious negative validities would be observed in the present data for minor- ity apprentices. That is in fact exactly what occured. Since selectiOn was based on the composite test score, both majority and minority groups are affected by restrict- ion in range to prior selection. The effects of selection were not as severe for the majority group, however, because‘ their selection ratio (SR = 30) was higher. As can be seen in Tables 1 and 6, the effect for the majority group was not great enough to produce post—selection correlation coef- ficients that were negative. The results of the analysis showing the differential effects of selection for majority and minority groups cast suspicion on the applicability of the corrections for restrictiOn in range for the total group shown in Tables 4 and S. 64 .mouoow muoauo tom .oocooom .nuoe oouowuummucs no Eon on» maoooo ouflmomEOOR oo.- oo.- oo.: u- oo.: oo.. oo.- nu omo oo.- oo.- oo.- . oo. oo.- oo.- oo.- oo. mumnumm mo.- oo.- oo. oo. oo.- oo.- oo.- oo. .mo> pomnno oo.- oo. no.8 oo. oo.: oo.- oo.- Hm. sums mono oo. oo.- oo.: oo.- oo.- oo.- oo.- mm. moomasocs Hooe oo. oo. om. oo.- oo.- oo.: oo.- «a. ozoammuom muonuo mocowom out: «ouwmomsoo muosuo oocowom sums emuwmomaoo HOHM4 OHONOQ HOHMAN OHONOM HODOHUOHQ anemone: anemone: COAuomHOm Houm¢ one enamom wcowumamuuoououcH Houofiowum h mqmda --_—. .1. 65 Test Bias Significant majority-minority criterion differences. Due to the fact that the extent of the differential effect of prior selection on criterion mean scores for majority and minority apprentices is unknown, an analysis of mean criterion score differences based on post—selection data would be tenuous. A more meaningful and stable comparison of racial differences in criterion score may be Obtained from the examination of the point biserial correlation between cri- terion and race. Results of the comparison of all six criteria on levels of correlation with race (rcy) are presented in Table 8. -¢— ‘. Finish subscores of the performance test show the least correlation with race. Neither the correlation between. tolerance subscores and race or between finish score and race are significant. For total time score (reflected),12 the correlation with race of -.36 was significant at the .01 level. Since total performance was the composite of toler- ance, finish, and time scores, it also showed a small but 12FOr all correlations involving time scores, time was reflected so that the most desirable (i.e., positive) out- come was a low time score rather than a high one. This explains the negative correlation obtained for time score and race (where majority were scored 1, minority 2.) 66 TABLE 8 Correlation Between Criterion Measures and Race Criterion . ‘ r CY Tolerance -.18 Finish .12 Time - -.36** Total Performance -.28* Ohio Achievement -.42** Composite -.43** *p<.05 67 significant correlation with race. This correlation (-.28) was smaller than both the correlations between achievement and race (-.42) and the composite criterion and race (-.43), both of which were significant at the .01 level. Significant majority-minority differences in predictor £3232. Racial differences in mean predictOr score are shown in Table 8. As was noted for criterion mean differ- ences, the predictor mean differences presented in Table 9 are markedly affected by prior selection. A better estimate of the influence of race on predictor scores is achieved through the analysis of the last column of this table, the correlations between predictor and race (rcx). Examination of predictor-race correlations showed six of the thirteen predictors to be significantly correlated with race. The screening test, shOp math, arithmetic, and test battery showed significant negative relationships to race, indicating that selected majority apprentices scored higher than minority on these predictors. Post high school training and corporate service points were positively correlated with race, indicating that selected minority apprentices on the average scored higher than did the majority on these variables. An estimate of majority-minority predictor score differences in the applicant pool (i.e., before selection) was calculated on an unrestricted random sample Of 198 test 68 TABLE 9 Majority-Minority Mean Differences in Predictor Score Predictor Xw KB D DSD r Screening 19.489 17.250 2.24 .66 -.26*- Shop Math 12.702 11.062 1.64 .67 -.25* Arithmetic 23.787 19.312 4.48 1.12 -.35** Tool Knowledge 31.681 32.875 -l.l94 .25 .09 Object Vis. 28.894 27.375 1.519 .18 -.08 Battery ' 118.553 98.812 19.741 .81 —.33** GED 2.617 1.625 .992 .50 -.11 H.S. Math 9.830 9.500 .33 .05 -.02 H.S. Science 2.979 2.000 .979 .41 -.12 H.S. Others 11.128 9.688 1.44 .17 -.06 P.H.S. Training 1.447 6.813 -5.37 .04 .31* C.S.P. 3.383 4.938 -1.56 .79 .33** *p<.05 **p<.01 NOte: and X are means for majority and minority groups. respectigely. D and D are the group differences in raw and minority groupSBtandard deviations, respec- tively. r is the point-biserial correlation between group membership and score. 69 scores, and results presented in Table 10. To assure com- parability between the restricted and unrestricted rcx's, a correction was made which took into account the different ratios of minority to majority apprentices in each sample (16/47 versus 98/99 for restricted and unrestricted samples, respectively). While only six predictors showed significant correlations with race in the restricted sample, all but one predictor showed a large correlation with race in the equated unrestricted sample. The effect of restriction on the correlation between test and race was largest for the test battery and the screening test. Three definitions of test bias. The analysis of test bias using the Cleary, Thorndike, and Darlington (3) defini- tions are presented in Tables 11-13. Results were examined first for the three most valid predictors--the test battery, high school math, and high school others. According to the Cleary definition, these three tests are all biased in favor of minority in the prediction of achievement and composite criterion scores. That is, in all cases the majority regression line would overpredict minority criterion scores. As a predictor of performance, the test battery is a fair test for both majority and minority apprentices. High school math and others, however, are biased in favor of the minority in the prediction of performance scores. 70 TABLE 10 Correlations between Predictors and Race for Restricted and Unrestricted Samples Restricted Equated Unrestricted Predictor rcx rcx Screen -.26* -.13 Tool Knowledge .09 -.38 Shop Math -.25* -.35 Arithmetic -.35* -.34 Object Vis. -.08 -.31 Battery -.33* -.46 H.S. Transcript -.07 -.23 P.H.S. Training .31* .31 GED -.ll -.20 CSP .33* .28 *p<.05 71 some oo.- oo.- some oo.- oo.- common+ .o~.~u oo.- mocmoom .m.m some mm. mo. some mo.au oo.- someon+ .m~.~u oo.: some .m.m some oo.ou oo.- some oo.- oo.- ommman+ .mm.~- oo.- one common- «Ho.mu oo.: Home on. mo. Home om.H- oo.- someone some mo.u oo.- Home o~.Hu oo.- oomman+ «m~.~- oo.: .mo> nomflno Home mo. NH. comman+ .o~.~- on.) summon+ .mo.~s oo.- magmasocs Hoes ommmonu ..oo.~- oo.- some oo. oo. common+ «oo.~u oo.- oaumsauaua some oo.~u oo.- some oo.- oo.- common+ .ma.~- oo.- sum: moem common- .m~.~u oo.: some oo.- oo.. ommman+ .ov.~n oo.- ouncoouom cofimwuoc u >.oxu codewooo u xoulmuu cowmwooo u Novas uouowooum couocwaumo oxoocuona humoao mouz EOHHUUflHU UOCEOMHOQIIOUMUMHumUW mflflm ”MOB H4 mqmda 72 moose auwuocwa may no Ho>ou so common ma Down a oommwn+ macho auwuocfia on» uncwmoo oomowo no one» a common: "muoz Ho.vm.« mo.vmr ommoan+ .mm.~ on. comman+ ..oo.mu oo.: some oo.au oo.: .m.m.o comman+ .m~.~ ow. common+ ..m~.ou oo.- some mo.au oo.- masseuse .m.m.m some oo.- oo.- some .Ho.au oo.- someon+ .o~.~u oo.- mumeuo .m.m .UCOUIIHH mqmda cookouwuu ucoao>owno¢llmowumwumum moan umoa. NH Manda “one _Ho.- oo.- comman+ .oo.~- oo.- oomman+ oo.ma oo.- mucooum .m.: some oo.a no. summan+ ..mo.~u oo.- common+ oo.ou oo.: anus .m.m Home oH.H- oo.- some oo.au oo.- commaa+ Ho.m- oo.- new Home oo.a- oo.- some oo.- oo.- common+ oo.~- oo.- somehow some oo.- oo.- ommmon+ .oo.~u oo.- comman+ oo.o- oo.- .mo> nomnno some oo.H on. owmman+ ..Ho.m- oo.- comman+ mo.ou oo.- moooazoqx Hoes ommman- .oo.~- om”- some oo.- oo.- summan+ H~.mu oo.- nonmagnaua 3 .1 some oo.a- oo.- some oo.au oo.- comman+ oo.mu oo.- cum: scam Home Hm.a- oo.- some mo.o n oo.- comman+ ov.mu oo.- ocoammuom CmeHooo u >.oxu scamwomo u xouumuu cofinwuoo +u n.0hu Houuwooum coumcflaumo oxwocuona anomau monz 74 macho Sofiuocws one no Ho>mm ca common me one» u comman+ msoum auwuocoa one uncommo common ma umou a oommwnl “ouoz Ho.vm«. mo.vma Ho. um unmowuwcoam m.» HH¢+ newswo+ «mm.~ Hm. oomown+ acom.vh mh.| oommHn+ ov.MI av.| .m.m.u oommHn+ amv.~ om. cemmwo+ ..mv.v: nh.| oomoan+ mv.mn He.) mcwcwmua .m.m.m name no. oo. tomofln+ amm.~: on.) oommwn+ mm.n: Nv.u muonuo .m.: .DGOUIINH mdmdfi 75' some oo.- oo.- comman+ «No.~- oo.- emanaa+ om.ou oo.- monoaom .m.m some oo. oo. comman+ ..oo.~n oo.- ommman+ mo.ou oo.- sums .m.m some om.ou oo.- some oo.an oo.- common+ oo.m- oo.- one some mo.au oo.- some oo.- oo.- comman+ om.~u oo.- sumuumm some oo.- oo.- common+ .n~.~u oo.- comman+ so.mu oo.- .ma> nomnno some om.a on. comman+ «.oo.ou oo.- ommmHn+ Ho.mn oo.- «onwasocu Hoes ommmanu eso.~- oo.- some om.- oo.- comman+ Ho.ou oo.- oanmseuaum some oo.~- oo.- some -.H- oo.- comman+ mm.mu oo.- sum: monm commons .mo.~- oo.- some so.au oo.- comman+ mo.m- oo.- manommuom coflmwooo u >.oxu cowmwooo u xoulmou cowmwooo +0 x.omu uouowooum couocwaumo oxwocuona \ >MOOHU mouz cookouwuo mnemooeoo unanimowumwumum moan umoa MH wand? 76 moonm mufiuocoa on» no uo>mm cw oomomn mo umou u oommwo+ msoum MDHMOCAE ecu umcflmmm common no one» u comment nouoz Ho.vm«« mo.vms Ho. um ucmoowwcmfim m.u Haom commeo+ Rmm.m mm. tomcao+ «amm.vi oo.: cowofln+ m.ma oo.: .m.m.o ommmon+ .mo.~ om. oomman+ ..Hm.ou oo.: comman+ mm.mu oo.- oaocamus .m.m.m noon mm. oo. comoao+ yam.ml mm.n oomoan+ mm.m1 mv.l muonuo .m.m .DCOUIIMH mam¢8 77 According to Thorndike, "fair" means that the differ- ences in SD units between majority and minority mean scores on the test is not significantly different from the differ- ence in majority and minority mean scores on the criteria. Applying the Thorndike definition, the test battery is also considered a fair measure of performance for minority apprentices. In addition, this model finds the test battery to be a fair predictor of achievement and composite criteria. High school math and science are also fair pre- dictors of performance criterion scores, but are both biased in favor of the minority when used to predict achievement and the composite criterion. This direction of bias, in the Thorndike model, means that the differences between majority and minority on the test are actually smaller than the differences between majority and minority on the criterion. Since correlations are not corrected for restriction in range, the smaller racial difference in mean score on the test is most probably an artifact of selection. Darlington's definition three found the test battery, as a predictor of performance criteria, to be biased in favor of minority apprentices. That is, given equal cri- terion scores for majority and minority apprentices the mean test score will be higher for the minority. High school math and others are fair predictors of performance, achieve~ ment, and composite criteria, and the test battery is, by the Darlington definition, a fair predictor of achievement 78 and composite criterion scores. Darlington's "fair" means that in a subset of apprentices with the same criterion score, the mean test scores for the majority and minority groups are the same. - More generally, the analyses presented in Tables 11-13 demonstrate the hypothesized lack of agreement between the Cleary, Thorndike, and Darlington definitions of test bias. In the total of 39 cases presented, agreement between the three definitions occurred only 4 times; the predictors post high school training and corporate service points were found, by all three definitions, to be biased measures of achieve- ment and composite criterion scores. No consensus was reached in any case, however, on the magnitude of test bias. A Cleary-defined fair test. The strict interpretation of the Cleary definition of test bias requires the elimina- tion of consistent non-zero errors of prediction (i.e., bias) by the use of separate regression equations for majority and minority apprentices. The determination of the separate regression equations for majority and minority apprentices in the present data, however, was not feasible due to l) the small sample size of minority apprentices, and the great amount of random error which would consequentially influence the prediction equations for that group, 2) the differential effect of selection on minority and majority apprentices, as was seen in the differences in subgroup validity obtained 79 for the present selected sample of apprentices, and 3) the spurious negative correlations between predictors and criteria for minority which would affect minority regression line slopes. For the above reasons, the separate regression equations for majority and minority apprentices were not presented. Statistical corrections for the Thorndike and Darlington models. What would the test bias decisions have been had the data been unrestricted in range due to prior selection and had perfectly reliable criterion measures been used? It was stated earlier that corrections for restriction in range and criterion attenuation would be statistically performed on validity coefficients, and the Thorndike and Darlington models of test bias reapplied to answer this question. Several unforeseen theoretical issues arose, however, in the application of these corrections. In the Thorndike model, the correction for restriction in range involves the correction of thecorrelations rcy’ rcx' and rxy13 (where c=race, x=predictor, and y=criteria). It has already been noted that no method existsby which to correct rCy for restriction in range in the present data. Prior to beginning this investigation there was no known 13In addition to r and r r must be corrected for restriction in range begXuse it C§ppe§¥s in the denominator of Darlington' 5 formula for test bias. 80 evidence in the literature that indicated that rCX and r could not be corrected for restriction in range. The usual correction of the correlation rx for restric— tion in range assumes that (1) selection is on the variable X, (2) X is homoscedastic (i.e., the variance of Y about the regression of Y on X is the same for all values of X). This is illustrated in Figure 5. The key to the formula is the fact that Y is still linear on X after selection and Y is homoscedastic on X after selection. The statistic O and le‘ the slope of the regression line are exactly the same in both the restricted and unrestricted populations (Figure 5). Furthermore, all of the above conditions are true regardless of the selection ratio. In this case, the unrestricted cor- relation is given by: _ r 2* 2 2 J0 rxy + (1 — rxy ) where v = unrestricted Ox restricted ox The formula above will fail if (1) Y is not linear or not homoscedastic on X or (2) if X is not the variable for the selection process. For example, suppose there are two predictors X and Y and the criterion is Z. If people are selected on X, then the unrestricted validity pXy is _ related to the restricted validity rxz by the formula above, i.e., 81 CRITERION . LEOTBD m E(YIX)+ ”'le E(YIX) l E(YIX)-"Y|x TEST SELECTION CUT-OFF Figure 5. Effects of selection on a continuous distribution of test scores. D _ X2 ‘0 rmz .‘ Jvz r + (1 - r 2) xz xz where v = unrestricted 0x3 restricted OXZ But the other unrestricted validity coefficient pyz is given by a completely different formula 0 _ - XZ ryz u rxy rxz 2 2 k1 + u rXy J1 + u r XZ where u = vZ-l = unrestricted 0x3- 1. restricted ox‘ Thus there is no one formula that is suitable in all contexts for correction for restriction in range. Actually, neither of the formulas above is suitable for the present study. In the present study, the selection was not done on any one of the predictors, but was done on a composite of all of them. Development of formulas for this case is in progress (Hunter, Schmidt, and Seaton, in prepa- ration), although they would be of little use in the present study in any case: the formulas rely on the multiple regression of the criterion onto all of the predictors. In the present case this would require the accurate estimation of a 9 variable multiple regression equation for each 83 criterion, given only 63 observations. This is an extremely unsatisfactory ratio of observations to variables. In the case of checking for test bias there is a fur- ther complication: not all the variables are continuous. In the case where one of the measured variables is dichoto- mous, as in the point biserial correlation between race and test score, the effect of selection upon test score distri- butions for different subgroups will vary with the selection ratio used for each subgroup. Figure 6 presents the situation occurring in the present research, where selection ratio for the majority = .30 and the selection ratio for minority = .08. Using a common cut-off point, it may be seen that test scores of selected minority applicants will tend to be concentrated much more closely to the cut-off point than will-scores for the selected majority. This in turn should affect the differences in means and standard deviations between restricted and unrestricted samples differentially for minority and majority, and will thus affect the point biserialcorrelations rxc and ryc’ The formulas for such changes in point biserial correlations are grossly different from those above for continuous variables and yield very different results. In the example chosen to resemble the present data, the point biserial correlations changed much less than the continuous formulas would predict. Tables 14 and 15 present the test score mean and standard deviation 84 MINORITY R MAJORITY um =.Oa. ' \ I '- - ‘ ‘ TESTSCORE SELECTION CUT-OFF Figure 6. Difference in the restricted test distributions of two groups as a function of different selec- tion ratios. 85' TABLE 1 4 Majority—Minority Mean Test Scores for Restricted and Unrestricted Samples Black White Predictor ERES EUR D ERES YUN D Screening 17.25 12.18 5.07 19.49 13.78 5.71 T001 Knowledge 32.87 26.06 6.81 31.68 32.54 -.86 Shop Math 11.06 9.50 1.56 12.70 12.37 .33 Arithmetic' 19.31 13.45 5.86 23.79 18.69 5.10 Object Vis. 27.37 17.39 9.98 28.89 24.11 4.78 Battery 98.81 55.37 43.44 118.55 99.43 19.12 H.S. Transcript 21.19 17.54 3.65 23.94 22.68 1.26 P.H.S. Training 6.81 17.00 '10.19 1.45 9.33 *7.89 GED 1.62 5.71 -4.09 2.62 7.11 -3.10 C.S.P. 4.94 4.18 .75 3.38 3.12 .25 Changes in Standard Deviations from Restricted 86 TABLE 1 5 to Unrestricted Samples of Majority and Minority Apprentices Minority Majority Predictor SDUN SDRES ‘ASDB ; SDUN SDRES ASDW Screening 5.31 3.38 1.93 5.12 3.60 1.52 Shop Math 3.61 2.46 1.15 3.59 2.94 .65 Arithmetic. 6.13 4.00 2.13 5.11 5.64 - .47 Tool Knowledge 8.46 4.78 3.68 6.49 6.29 .20 Object Vis. 10.17 8.47 1.70 9.71 7.41 1.97 Battery 38.82 24.35 14.47 42.09 24.12 ,17.97 GED 3.15 1.96 1.19 3.35 4.37 -l.02 H.S. Transcript 9.91 15.83 ~5.92 11.94 15.49 -3.55 H.S. Training 14.61 13.71 .90 7.66 1.71 5.95 C.S.P. 1.64 1.93 -.34 1.95 1.94. .01 87 differences, respectively, between restricted and unrestric- ted samples for majority and minority apprentices. These results indicate that while restriction in range due to prior selection affects both majority and minority appren- tices, the consequences of selection are more severe for minority apprentices. Because the two subgroups are affected differently by selection, the traditional correction for the correlation between predictor and race for restric- tion in range is not appropriate. Because selection has been demonstrated to have such a differential effect on majority and minority apprentice scores, suspicion is cast also on the applicability of the correction of the validity coefficient for the total sample ( rxytot) for restriction in range. The interaction of restriction in range due to prior selection and the dif- ferential effects of selection ratios employed must have an influence on corrected validities for the total group. The exact extent of this influence in unknown at present. In summary, the question of what the test bias decisions using all three bias models would have been had the data not been restricted due to prior selection cannot be answered because of the inapplicability of the traditional correction for restriction in range model to the correction of the correlations between predictor_and race and between predictor and criterion. Because unreliability in the 88 criterion measure interacts with and is effected by restriction in range, the question of what the test bias decisions would have been had the criterion been free of unreliability cannot be answered at this time.14 14Many of the analysis planned in the introductory sessions of the present research could not be performed due to problems which developed due to restriction in range. After the present study had been finalized, however, it was noted that these analyses could be performed when the composite test score was used to predict the criteria. Results of these analyses are available in a separate docu- ment from the author of the present research, or from Dr. John E. Hunter, at Michigan State University. DISCUSSION Validity of Selection Tests Test validity was assessed in accordance with EEOC guideline regulations. The results of the validity analysis using a statistically significant correlation between predictor and criteria as evidence of “demon- strated validity" gave a very misleading impression of the extent to which selection predictors measure perfor- mance and achievement criterion. Statistical signifi- cance varies as a function of sample size, the chosen alpha level, and in the present study, restriction in range. Results of the validation of the test battery serve to illustrate the inapplicability of statistical tests under these circumstances. The estimate of the correlation in the present sample between the test battery and the performance test was .20. According to the test at the .05 level, this validity is not significant. Corrected for restriction in range and criterion attenuation (thus approaching a more accurate estimate of the population correlation), however, this correlation was .42. This validity is certainly 89 90 "significant" by any practical definition of significance, but according to the first statistical test,15 the test battery is not a valid measure of performance. AlthOugh the error of failure to reject an invalid 'test is minimized by adherence to an a level of .05, the probability of finding a test to be "invalid" in a sample of 63 when it is significantly valid in the population is very high. In the selection situation it is the latter (type II) error which holds the most serious consequences for the employer. Statistical significance testing at a given a level is thus not the appropriate measure of test validity, especially with small samples. The interest of the present research.was to estimate the validity of selection predictors in the population, rather than specifically limiting validity to the present sample. While statistical tests of the significance of validity were performed to satisfy legal guidelines, a validity coefficient which was not "significant" was not necessarily "not valid." The test battery was one predictor which showed a high degree of relationship to criterion measures. Corre- lations obtained in the present research between the test battery and performance, achievement, and the composite criterion were .20, .34, and .33, respectively. 15Tests of statistical significance should not be applied to corrected data. 91 Restriction in range had a marked effect upon validities obtained for the test battery. It will be recalled that the score on this predictor is the composite of scores of four other tests. This composite accounts for 200 out 445 possible points, and the other 245 points are spread over eight other variables. Thus the largest single variable in the Chrysler selection process was the test battery, and hence, it showed far more restriction in range than did other predictors. Corrected first for attenuation due to criterion unreliability and then for restriction in range due to prior selecton, correlations between the test battery and performance, achievement, and the composite criterion were .42, .59, and .58, respectively. With these correc- tions, the battery was the single predictor which demon- strated the highest degree of relationship to the three total score criterion measures. High school math and high school others also correlated well with total score criterion. When corrected for attenuation, correlations between high school math and performance, achievement and composite scores were .24, .34, and .34, respectively. High school others correlated .13, .29, and .26 respectively, with performance, achieve- ment, and the composite criterion. Evidence suggests that these correlations would probably not have been much 92 higher if corrected for restriction in range. The restricted variance of the high school composite differed little from the unrestricted variance, and this relation- ship is probably true of the individual scores as well. ' Arithmetic and tool knowledge tests appeared to be adequate predictors of achievement and composite criteria when corrections were made for attenuation and restriction in range. In general, the procedures Chrysler used in employee selection appear to tap better the abilities measured by~ the Ohio achievement test than those measured by the performance measurement procedure. Perhaps this is because the Ohio test is largely a test of cognitive abilities, while the performance test has a sizable psychomotor compo- nent. It should be noted that neither the Ohio test nor the performance measurement procedure is a complete measure of total production on the job. Rather, they are measures of specific aptitudes and skills necessary for successful production on the job. A more global measure of production would need to include measures of absenteeism, attitude toward the job, and other motivational variables not tapped by these tests. It was not the purpose of the present study to try to choose a criterion which was a complete measure total success on the job. As measure of specific aspects of job success, the criterion measures 93 used in the present study were probably much better than the typically used supervisor rating criterion, which are so heavily biased by the worker's personality. Results of the validation of the selection procedures currently used to select applicants for the apprentice program must be carefully interpreted within the context of the present study. Absolute magnitude of validity was expected to be low because of the effects of restriction in range due to prior selection on predictor validity. While corrections for this effect were made on the corre- lations between predictors and criteria, these corrections provide only estimates of the validities which would have been obtained had the sample used not been selected on the basis of test scores. Keeping in mind the general limita- tions of the present study, it may be generally concluded that for the total sample of majority and minority appren- tices, the test battery, high school math, high school others, arithmetic, and tool knowledge tests were the most valid predictors of total group criterion scores. Differential Validity Results of the present study were consistent with the body of evidence in the literature suggesting that signi- ficant subgroup differences in validity coefficients do not exist for majority and minority subgroups more often than could be expected by chance. Support for this 94 conclusion was weak, however, due to the existence of many cases in the present study of large but non-significant differences in the magnitude and sign (positive-negative). between predictor validities for majority and minority (apprentices. The high negative correlations for minority apprentices between several predictors and the criterion measures warranted on analysis of subgroup differences in validity beyond the usual statistical search for differen- tial validity. An example was given in which it was shown that selection made on the basis of the composite score of two predictors could lead to a post-selection correlation between one predictor and the criterion which was spuriously negative. The hypothesized model was, of necessity, over- simplified. The selection ratio assumed (SR = .5) was not comparable to either the selection ration for majority (SR = .3) or minority (SR = .08) apprentices in the present data, but was chosen for computational convenience. To the author's knowledge no formula exists by which to incorporate the different selection ratios for minority and majority apprentices into the hypothesized model. In addition, the . model assumes selection on two predictors while selection in the present study was based on a composite of thirteen items. Furthermore, the effect of selection on predictor validity in the situation where both predictors are 95 correlated with the criterion in the unselected sample was not considered. The amount of work which would be required to expand the simplistic model to include more complex variables (and situations is much beyond the sc0pe of the present research. Therefore, it is only suggested that the hypothesized model could account for the negative correla- tions between predictors and criteria for minority appren- tices observed in the present data. Test BiaS‘ Significant differences between majority and minority mean criterion scores were found for four of six criterion measures. Total performance, achievement, and the compo- site scores all showed a significant relationship to race. The direction of correlations indicated that majority apprentices scored higher, on the average than did minority apprentices. These differences in mean criterion scores are consistent with past research findings although the differences found in the present research were somewhat larger than average. Schmidt §£_§l, (1975) found an average difference of .5 standard deviations to occur in most employment research. In the present study the effect of selection on criterion scores is unmeasured, but almost certainly effects the results and increases the degree of caution with which they should be interpreted. 96 Subgroup differences in mean predictor score were found, as predicted. On the present selected sample, correlations between predictor and race were significant for six of the thirteen predictors. It was known a priOri that restriction in range due to prior selection would effect mean differences in predictor score for majority and minority apprentices. Correlations between race and predictor were therefore computed on data from a sample of unselected apprentice applicants and the results com- pared. More consistent with the results of previous studies, these estimates of the relationship between pre- dictor and race showed that for all of the thirteen predic- tors, differences in majority and minority test scores were greater than 0. All predictor-race correlations were negative (indi- cating that majority scored higher on the average than minority apprentices) except the correlations between post high school training and race, and corporate service points and race. The lowest correlation was -.13 between the screening test and race, while the highest correlation, -.35 between arithmetic and race. Results of the application of the Cleary, Thorndike and Darlington definitions of test bias to the apprentice program selection procedures were much in line with results of previous studies and theoretical prediction. The Cleary 97 definition labeled all predictors of achievement and com— posite criterion biased in favor of minority apprentices. The same type of bias in favor of the minority, or over- prediction, was found by Cleary (1968), Davis and Temp ((1971), and Temp (1971) in educational studies and by Bray and Moses (1972) and Guinn, Tupes, and Alley (1970) in employment testing studies. When performance was taken as the criterion, only three out of thirteen predictors were fair by the Cleary definition. The test bias analysis using the Thorndike definition was, as predicted, very different from the analysis which used the Cleary definition. Only three tests were biased predictors of performance by the Thorndike definition of teSt bias, while for the Cleary model all but three pre- dictors of performance were biased. (The three predictors were not the same ones in both cases.) Similar lack of consensus between the two models was found for predictors of achievement and composite criteria. The direction of bias by the Thorndike definition was also in favor of the minority group. The Thorndike model concerns itself most with situations where differences in subgroup scores are larger on the test than on the criteria, the case found most often in literature on employment and educational testing. A possible explanation for the findings in the present study of bias in favor of the 98 minority is that the effect of prior selection on subgroup differences on the test was greater than the effects on subgroup differences on the criterion. Suspicion was cast on the applicability of the correc- tion of predictor validities for the total sample when selection effects were shown to be different for majority and minority apprentices. For this reason, the Thorndike and Darlington analyses which called for the use of unrestricted estimates of validity were deemed inapprOpriate. Schmidt e£_al, (1975) predicted that the analysis of test bias using the Darlington definition would find pre- dictors to be almost invariably biased against minorities. This was not true in the present study. Instead, tests were most often found to be fair by the Darlington defini- tion. When test bias was found to occur by this definition, it was most often bias against the minority group. Post high school training and corporate service points, however, were biased by the Darlington definition in favor of the minority group. The lack of consensus between the three definitions of test bias indicated by theoretical comparisons of the models was also found to be true in comparisons of the three definitions based on actual data. In 35 out of 39 cases, a test which was biased as a predictor of a given criterion by one definition was not defined as biased by 99 both of the other two definitions. The four cases of agreement involved the post high school training and corporate service points. All definitions agreed that -these predictors were biased in favor of the minority group when used to predict achievement and the composite criterion. Conclusions Two questions which the present research sought to answer were: (1) are the selection procedures currently used to select applicants for the apprentice training program at Chrysler Automotive Corporation valid predictors of performance and achievement criteria? and (2) are these predictors fair to both majority and minority apprentices? The answer to both questions cannot be directly generalized from the context of the present study because of the small sample size upon which the results are based, and the differential effect of selection on majority and minority apprentices within that sample. The general findings of the present study were, however, consistent with prior research findings. The test battery, the major selection tool in the Chrysler procedure, proved to be a very valid measure of performance and achievement criterion used in the present research. High school math and high school others were two other predictors which, along with the test battery were 100 significantly correlated with criterion measures. Arith- metic and tool knowledge tests also showed adequate corre- lations with achievement and composite criterion score. The other eight predictors did not show a high degree of (relationship to criterion measures. The most important implications of the results of this validation study concerns the continued use of these selection tests. Only three predictors in the selection procedure satisfy legal (i.e., EEOC) requiremenhsfor the demonstration of test validity, but a total of five of the thirteen predictors showed correlations (corrected for attenuation and restriction in range) equal to or above correlations usually reported in employment testing literature. These five predictors account for a possible 300 out of 445 total points on the selection procedure. From a legal point of view, it is not the validity of the individual tests which is of most concern, but the validity of the total selection procedure. In the present study it was not technically feasible to investigate the validity of total score on the Chrysler selection procedure, but the test battery composite score gives a good indication of what that validity might be. The battery was a statistically and practically valid measure of performance and achievement criteria. Corrected for attenuation and restriction in range, correlations 101 between the test battery and performance, achievement and the composite criterion were .42, .59 and .58, respectively. For decision making purposes, it is the characteristics of the test battery, rather than the individual tests, 'which are of greatest importance for both practical and legal purposes. 9 Results supporting the continued use of the five most valid predictors used in the Chrysler selection battery may only be considered suggestive. Evidence of potential adverse impact of the selection battery for minority apprentices was seen to exist in:E§e markedly different proportions of majority and minority applicants selected with this procedure. The results of the present study would not satisfy legal requirements for the test validity of ten of the thirteen selection predictors, in the face of charges of adverse impact. This is not to say, however, that the relationships between selection predictors and performance and achievement criteria have no practical meaning. Several alternatives are available to the employer in this situation. Another validation study could be pro- posed, using a larger sample size, to substantiate the validity of the selection procedure. Total score on the selection procedure should be included as a predictor, since selection is based on the composite of predictor scores rather than on separate cut—off scores on each 102 predictive Alpredictor validity design would be especially useful in avoiding the problems of prior selection encoun- tered in the present research. As a second alternative, the present.strategy of selection on the basis of a composite score of thirteen predictors could be abandoned,\and attempts made to find a more valid strategy. Specifically, the present selection battery could be modified by using only selection predictors which are valid measures of criterion performances, and eliminating other unnecessary and invalid predictors. The present study would suggest that the test battery and high school transcript scores be included in the new selec- tion strategy. Careful attention should be given to the development of a more valid screening measure. The question of test fairness also has an implication for the continued use of selection tests. Judgements of fairness will depend on the definition used. The Cleary definition found most selection tests to be biased in favor of the minority group. If the interest of the employer is to assure that tests do not discriminate unfairly between races, and not necessarily to assure the most accurate prediction possible for individuals of both groups, then the continued use of selection tests will not adversely affect the hiring opportunities of minorities. If the main concern is the maximization of prediction 103 accuracy through the selection of individuals who will have the highest probability of success, however, the use of the Cleary model requires the use of separate predic- tion equations where differential validity was found for (minority and majority applicants or a multiple regression equation otherwise. With the use of the later strategy, much fewer than 8 percent minority would be selected with the use of a "fair" test. Use of the Thorndike definition to set predictor score cutfoffs for majority and minority would result in a higher selection ratio for minority that the use of the Cleary model. The use of the Darlington definition will result in the selection of a still higher proportion of minority applicants. As the proportion of minorities selected increases, however, prediction accuracy will be affected inversely. The final decision of which model of test bias is "best" rests on the relative value given to the cost of reduced validity in comparison to the under sirability of a very low rate of minority selection. The present study sought to take into account the effects of restriction in range in the data due to prior selection, but did not anticipate the differential effects of selection on the data for majority and minority appren- tices. A major implication for further research suggests that future studies of concurrent validity can no longer 104 assume that selection effects will be the same for majority and minority apprentices, especially in the common situation where the prOportions of applicants selected from each subgroup are different. (This is always the case when one cut-off is used and differences in mean test scores exist.) Investigations of concurrent validity should also include a careful consideration of the possibility that different subgroup selection ratios will have.a differential effect on the validities and mean predictive and criterion scores for each subgroup. Use of a predictor rather than concurrent validity strategy will lessen this problem, and provide a better foundation for test bias analyses. Although restriction in range can also occur in predictive validation, the degree of restriction will often be less in the predictive strategy. LIST OF REFERENCES L18 '1' OF REFERENCES Arvez, R. D. Some comments on culture fair tests. Personnel Psyphology, 1972, gé, 433—448. Baehr, M.; Sanders, D.; Froemel, E.; and Furcon, J. The prediction of performance for black and for white police patrolmen. Professional Psychology, 1971, 2, 46-57. * Boehm, B. R. Negro-white differences in validity of employment and training selection procedures: Summary of research evidence. Journal of Applied Psychology, 1972, ég, 33*39. Bray, D. W. and Moses, J. L. Personnel selection. Annual Review of Psychology, 1972, 23, 545-576. Brozek, J. and Tiede, K. Reliable and Questionable Signi- ficance in a Series of statistical tests. Psycholo- gical Bulletin, 1952, 42, 339-341. Cleary, A. T. Test bias: Prediction of grades of negro and white students in integrated colleges. Journal of Educational Measurement, 1968, g, 115-124. Cole, N. S. Bias in selection. ACT Research Report No. 51. Iowa: American College Testing Program, 1972. Darlington, R. B. Another look at "culture fairness." Journal of Educational Measurement, 1971, 8, 71-82. Davis, J. A. and Temp, G. Is the SAT biased against black students? College Board Review, Fall 1971, No. 81, 4-9. Einhorn, H. J. and Bass, A. R. Methodological considera- tions relevant to discrimination in employment test- ing. Psychological Bulletin, 1971, Z§J 261-269. Enneis, W. H. The EEOC guidelines on testing. The Law and Personnel Testing. Pittsburg: The University of Pittsburg Press, 1971, 9-14. 105 106 Equal Employment Opportunity Commission. Guidelines on employee selection procedures. Federal Register, August, 1970, 35, 12333-12336. Equal Employment Opportunity Coordinating Council. Uniform gpidelines on employee selection. Staff Committee Draft (Mimeograph), June 24, 1974. Gael, S. and Grant, D. Employment test validation for minority and non-minority telephone company service representatives. Journal of Applied Psychology, 1972, gg, 135-139. \ Guinn, N.; Tupes, E. C.; and Alley, W. E. Cultural sub- group differences in the relationships between Air Force aptitude composites and training criteria. Technical Report 70-35, Lackland Air Force Base, Texas: Human Resources Research Center, 1970. Hunter, J. E.; Schmidt, F. L.; and Seaton, F. W. Problems in doing validity studies on selected populations (in preparation). Hunter, J. E. and Schmidt, F. L. A critical analysis of the statistical and ethical implications of various definitions of test bias. Paper presented at the Midwest Society o‘fMulti-variate Experimental Psy- chologY, Chicago, Illinois, May 2, 1975. Hunter, J. E.; Schmidt, F. L.; Rauschenberger, J. M. Fairness of psychological tests: Implications of three definitions of selection utility and minority hiring. Michigan State University, Mimeograph copy, 1975. Linn, R. L. Fair test use in selection. Review of Educa- tional Research, 1973, 43, 2, 139-161. Linn, R. L. and Werts, C. E. Considerations for studies of test bias. Journal of Educational Measurement, 1971, g, 1.4. Oriel, A. E. A performance based individualized training system for technical and apprentice training: A pilot study. Final report to Manpower Administration, U.S. Department of Labor, Contract no. 82-17-71-48, 1974. Potthoff, R. F. Statistical aspects of the problem of biases in psychological tests. Institute on Statis- tics, Mimeograph service number 479. Chapel Hill, North Carolina University series, 1966. 107 Ruch, W. w. Statistical, legal and moral problems in following the EEOC guidelines. Presented at the Symposium on Differential Validity and EEOC Test- ing Guidelines, WPA, Oregon, April 28, 1972. Schmidt, F. L. Comments on EEOC June 1974 draft uniform guidelines.on employee selection procedures. Mimeograph copy, August, 1974. Schmidt, F. L.: Berner, J.; and Hunter, J. E. Racial differences in validity of employment tests: Reality or illusion? Journal of Applied Psychology, 1973, 58’ 5-9. Schmidt, F.; Greenthal, A.; Berner J.; Hunter, J; and Williams, F. A performance measurement feasibility study: Implications for manpower policy. Final report to Manpower Administration, U.S. Department of Labor. Contract no. 82-17-71-48, 1974. Schmidt, F. and Hunter, J. Racial and ethnic bias in psychological tests: Divergent implications of two definitions of test bias. 'American Psychologist, 1974’ 2.2., 1-80 Temp, G. Test bias: Validity of the SAT for blacks and whites in thirteen integrated institutions. ‘Journal of Educational Measurement, 1971, g, 245-251. Thorndike, R. L. Concepts of culture-fairness. Journal of Educational Measurement, 1971, 8, 63-70. Wallace, P.: Kissinger, B.: and Reynolds, B. Testing of minority group applicants for employment. 'Personnel Testing anngqual Employment Oppgrtunity. Wishington, D. ., U.S. Government Printing 0 ca, 970, 1-11. APPENDI X 108 3.92:5... m w n w a a a m m n . . ... : :Z—_.E—_..._::_ E:_::_::__E Cb. _ e _ _ L mm an Oh I. are mmn. « u an a a...» ”magma uuuuuu+uuu uuuuunuoeaouc: my. a a a u a u a u ... u ... a u u a .... + n u .... u fl n .... u... .. mm a u u n u u u u n u u . u v u : mm a a u a u n u u a a" u a a u_ u :‘ u ..n uuuu u , _ a a a _ _ __.:_::_::_::_::_ ::_::_::—:: 1.. ._ _ _ _ _ n w n w a da 3 w I gun! T'IIIIIIOWIIIJ— AIOVI AVIIAOI TIAVIIAOIIL IILOV AVIIAOI TII'IIIJLOUIIIIIII- CPT U "nonua see u .>ee Houuea umHSmmUHgmmt 2t ENOU‘ u BOHC €360; «mou¢ uwouuua .n DO? GUM—.5 ZH mtg‘ do at g .. In |||Il. .N .— “aux—mug ua<¢h up¢um ub<¢oa¢ou «meta: cx_hmuh pfibflrcn nueuexaz >e_¢=uum Seneca pecan mmam>¢=u hammmxm wh