“\WR'ELE-‘ziflgifié U. n I € W 4‘ _ ' .( QM?" " 'a'v .-. '4 .~ A» ‘A i753; _ ' 1W}!!! _- (finw . .. 3 M13 in , . fem; :r- :4» ' 19 H ‘ 2‘ . sflk“ Vz': $9 .. ‘- ~ ~ ~ A 5. ‘mnm " ‘ Haida-0.! §: ‘ Man £3???“ ‘ KW u r . 5‘ FfiRéESS 4!; £53? a. mm? 3.1%! igflg . as; gr‘ ¥ 3333' O m '59:?“ ELLE!“ - n % §1§VK§§3§3 ’ “5%???“ .;,~ _ :15!“ t' 3 1 310mm: i9;§§'z;ffi‘: : . : a . ‘3 a: Mr? rp; , : 223;}st 31' ” £351; ' :5. 135‘?!" w 5!: 2 3§13}:a§:3 a :33 p1,; ~ g“)! x? r‘ ‘ “$333? ,’ g. .83 ; Juiéqgé‘ . .%_flww ;:5 ‘ . "a: c 5%??? 3%.. if {it ‘ £59315; ‘ ‘ I :2;g ¥ 2. .7“ . ‘ 3h 5’ ‘fiafigfifiru ”a m: . ‘ tax: .535? VHF-183$ ANSTATEU H lillllllllIHIJUIIIHIIWIHIUW 301389 3718 INILWIHHIZII This is to' certify that_the thesis entitled VIDEO-BASED VERSUS PAPER-AND-PENCIL METHOD OF ASSESSMENT IN SITUATIONAL JUDGEMENT TESTS : SUBGROUP DIFFERENCES IN PERFORMANCE AND EXAMINEE REACTIONS presented by DAVID CHAN has been accepted towards fulfillment of the requirements for M . A. degree in PSYCHOLOGY 2222 Major professor NEAL SCHMITT Date MARCH 20, 1996 0-7639 MS U is an Affirmative Action/Equal Opportunity Institution LIBRARY Michigan State University PLACE N RETURN BOX to romovo this chookoul from your rooord. TO AVOID FINES rotum on or before data duo. DATE DUE DATE DUE DATE DUE usu I. An Afflrmotivo ActhEquol Oppomnay mutation M1 VIDEO-BASED VERSUS PAPER-AND-PENCIL METHOD OF ASSESSMENT IN SITUATIONAL JUDGEMENT TESTS : SUBGROUP DIFFERENCES IN PERFORMANCE AND EXAMINEE REACTIONS By David Chan A THESIS Submitted to Michigan State University in partial fulfillment of the requirements for the degree of MASTER OF ARTS Department of Psychology 1996 ABSTRACT VIDEO-BASED VERSUS PAPER-AND-PENCIL METHOD OF ASSESSMENT IN SITUATIONAL JUDGEMENT TESTS : SUBGROUP DIFFERENCES IN PERFORMANCE AND EXAMINEE REACTIONS By David Chan Based on a conceptual distinction between test content and method of testing, the present study examined several theoretically and practically important effects relating race, reading comprehension, method of assessment, face validity perceptions, and performance on a situational judgement test using a sample of 241 psychology undergraduates (113 Blacks; 128 Whites). Results showed that the Black-White differences in situational judgement test performance and face validity reactions to the test were substantially smaller in the video-based method of testing than in the paper- and-pencil method The Race X Method interaction effect on test performance was attributable to differences in reading comprehension and face validity reactions associated with race and method of testing. Implications of the findings were discussed in the context of research on adverse impact and examinee test reactions. Dedicded to Scpum Bin Kasmm' md Kong Kin Seng iii ACTG‘IOWlmGEMENTS This thesis could not have been successfully completed without the encouragement and support from a number of people. The most important individual is Neal Schmitt, the chair of my thesis committee. Neal has given me valuable guidance and support throughout the research process. He impresses me not just because of his professional expertise, but also because of his enormous efforts and great patience in the development of graduate students. The experience of working withNeal onthethesis andotherresearchprojectshasmadeagreat impacton my professional life. Neal has personified my ideal professor. Dan 1] gen and Rick DeShon, the other two members on my thesis committee, have made very significant contributions to my graduate training. My first encormter with Dan was actually "on paper". He authored the I/O psychology textbook I used in my undergraduate days way back in Singapore. In fact, Neal and Dan were the primary reasons for my choice of graduate school. In retrospect, travelling thousands of miles across the ocean to Michigan State was well worth it. Dan's "motivational" seminar has resulted in lasting motivational effects on me. Rick was the first professor I worked with at Michigan State. His zeal for his work has always impressed me. From Neal and Rick, I have appreciated the meaning of "research interests". Two other individuals, Kevin Ford and Steve Kozlowski, have indirectly contributed to the pment piece of work Both were instrumental in developing my fimdamental expertise in 1/0 psychology without which I could not have completed the thesis so smoothly. Finally, I must thank two groups of people who have provided me valuable social support. To fellow I/O graduate students at Michigan State, I am grateful for making me comfortable in a foreign land. To my dear fi'iends in Singapore, I am gratefirl for the concern and constant update of things back home. David Chan iv TABIEOFCIX‘JTENTS LIST OF TABLES ........................................................................................................ viii LIST OF FIGURES ......................................................................................................... ix INTRODUCTION ............................................................................................................. 1 Overview ................................................................................................................ 1 Two Conflicting Goals in Personnel Selection .................................................... 2 Subgroup Differences on Selection Tests ............................................................ 3 Attempts at Reducing Adverse Impact ................................................................ 7 Work Sample Tests: Validity and Adverse Impact ............................................. 8 The Logic and Problems in Development and Use of Work Samples ............... 9 Assessment Centers ............................................................................................. 13 The Goldstein et al. (1993) Study ...................................................................... 16 The Present Study ............................................................................................... 20 The logic of Situational Judgement Tests ........................................................ 21 Situational Judgement Tests: Simulation Fidelity and Predictive Validity ...... 22 Situational Judgement Tests and Adverse Impact ............................................. 25 Video-Based Situational Judgement Tests ......................................................... 26 Hypothesis 1 ............................................................................................ 29 Hypothesis 2 ............................................................................................ 31 Hypothesis 3 ............................................................................................ 33 Examinee Test Reactions .................................................................................... 35 V Face Validity and Predictive Validity Perceptions ............................................ 36 Hypotheses 4 ........................................................................................... 37 Face Validity and Method of Testing ................................................................. 38 Hypothesis 5 ............................................................................................ 39 Subgroup Membership and Face Validity .......................................................... 40 Hypothesis 6 ............................................................................................ 41 Factor Invariance across Methods of Testing .................................................... 43 METHOD ........................................................................................................................ 45 Examinees ............................................................................................................ 45 Development of Situational Judgement Test ...................................................... 45 Measures of Examinee Test Reactions ............................................................... 47 Reading Comprehension, Cognitive Ability, Personality Tests ........................ 48 Design .................................................................................................................. 49 Procedure ............................................................................................................. 50 Analyses .............................................................................................................. 51 RESULTS ........................................................................................................................ 55 Relationships between Race, Reading Comprehension, Method of Assessment, and Performance on Situational Judgement Test ....... 59 Factorial Invariance across Method Groups ....................................................... 68 Effects of Method of Assessment on Differential Subgroup Performance on Individual Situational Judgement Constructs ................................................ 82 Face Validity and Predictive Validity Perceptions ............................................ 85 vi Face Validity and Method of Testing ................................................................ 86 Subgroup Membership and Face Validity .......................................................... 86 Relationships between Race, Reading Comprehension, Method of Assessment, Face Validity Perceptions, and Situational Judgement Test Performance ........................................................... 90 DISCUSSION .................................................................................................................. 94 Method-Content Distinction ................................................................................ 95 Test Reactions ..................................................................................................... 96 Factorial Invariance of Test Responses across Assessment Methods ............. 100 Measurement Errors and Effect Size Estimates ............................................... 102 Limitations and Future Research ...................................................................... 103 Conclusion ......................................................................................................... 105 REFERENCES ............................................................................................................... 108 Appendix A A Priori Power Analyses ...................................................................... 118 Appendix B. Example of a Paper-and-Pencil Vignette ............................................... 120 Appendix C. Test Reactions Questionnaire ............................................................... 122 Appendix D. Means, Standard Deviations, Reliabilities, and Intercorrelations of Study Variables broken down by Race .................................................................. 125 Appendix E. Covariance Matrices of Indicators for Situational Judgement Factors ....................................................................................... 130 Appendix F. Covariance Matrices of Indicators for Situational Judgement Factors and Personality Factors ............................................... 131 vii Table 1 - Emk2- nmm3- nmm4- news- Table 6 - Table 7 - Table 8 - LIST OF TABIES Means, Standard Deviations, Reliabilities, and Intercorrelations of Study Variables .................................................................................................. 56 Summary of Hierarchical Regressions of Situational Judgement Test Performance on Race, Reading Comprehension, and Method of Assessment (N = 241) .............................................................................. 60 Means, Standard Deviations, Reliabilities, and Intercorrelations (observed and corrected) of Situational Judgement Scales broken down by Method Groups ................................................................. 6 9 Fit Indices Associated with Multiple-Group Confinnatory Factor Analytic Models Tested in Assessmmt of Measurement Invariance of Situational Judgement Scores across Paper-and-Pencil Method Group (N = 121) and Video-Based Method Group (N = 120) .......................... 7 1 Means, Standard Deviations, and Intercorrelations between Situational Judgement Indicators and Personality Indicators broken down by Method Groups ......................................................................................... 77 Situational Judgement Factors : Subgroup Means, Standard Deviations, and Associated Effect Sizes for Paper-and-Pencil Method and Video- Based Method of Assessment .................................................................. 83 Hierarchical Regressions for Face Validity Perceptions and Situational Judgement Test Performance (N = 241) ................................................. 87 Means, Standard Deviations, Reliabilities, and Intercorrelations of Study Variables broken down by Race ............................................................ 125 viii Figurel - Figure3 - Figure4 - FigureS - Figure6 - Figure7 - Figure8 - Figure9 - Figure 10 - Figurel] - HSTOFHGURFB Hypothesis 1 : Predicted Race X Method Interaction on Situational Judgement Test Performance ................................................................... 30 Hypothesis 2 : Predicted Method X Reading Comprehension Interaction on Situational Judgement Test Performance ........................................... 32 Hypothesis 3 : Method X Reading Comprehension Interaction as an Explanation for Race X Method Interaction on Situational Judgement Test Performance ...................................................................................... 34 Hypothesis 6 : Predicted Race X Method Interaction on Face Validity Perceptions of Situational Judgement Test .............................................. 4 2 Race X Method Interaction on Situational Judgement Test Performance ...................................................................................... 62 Method X Reading Comprehension Interaction on Situational Judgement Test Performance ...................................................................................... 64 Race X Method Interaction on Situational Judgement Test Performance after Controlling for Effect of Method X Reading Comprehension Interaction ................................................................................................. 66 Confirmatory Factor Analytic Model with Associated Common Metric Standardized Factor Loadings and Factor Correlations for Both Method Groups (* p < .05) .................................................................................... 7 5 Race X Method Interaction on Face Validity Perceptions of Situational Judgement Test .......................... 89 Race X Method Interaction on Situational Judgement Test Performance after Controlling for Effects of Method X Reading Comprehension Interaction and Face Validity Perceptions ............................................... 91 Relationships between Race, Reading Comprehension, Method of Assessment, Face Validity Perceptions, and Situational Judgement Test Performance .............................................................................................. 93 INTRODUCTION Mm The present study examines the effects of a video-based versus a paper-and- pencil method of assessment on adverse impact and examinee reactions in a situational judgement test. The dependent variables of interest are test performance and examinee test reactions. Making the important distinction between test method and test content (Hunter & Hunter, 1984), test content is held constant across two different methods of testing so as to isolate subgroup differences on the dependent variables due solely to test methods. The research problem leading to the present study will first be identified The theoretical issues and practical concerns in personnel selection constituting the research problem will be explicated Two conflicting goals in personnel selection will first be noted Theissueofadverseimpactisflrendiscussedmrdanemptswmduceadverse impact in selection is reviewed firm the research on work samples and assessment centers. This will lead to the focal selection procedure in the present study namely, the situational judgement test which is becoming increasingly popular in the research and practice of personnel selection The relationship between the logic of the test and its associated levels of adverse impact is discussed The recent research on examinee test reactions will then be introduced. The frequently neglected but important issue of difi‘erential subgroup attitudes is examined and related to the important distinction between test content and the method of testing. Based on the literature review and 2 conceptual analysis in the Intrrrdudion. hypotheses for the present study are presented. The hypotheses were tested in a sample of 241 undergraduates (113 Blacks, 128 Whites). Results supported the hypotheses. Limitations, contributions, and implications of the present study were discussed ICfl°°El°E lSl° A crucial element in the achievement of organizational goals is the selection of individuals with high ability to perform their jobs. Hence, the primary focus of personnel selection research and personnel selection procedures has always been the maximization of predictive efficiency by identifying and selecting individuals with the highest job-relevant ability. There has been a vast amount of empirical research on the validity and utility of selection procedures. Meta-analyses of these primary findings indieate that for a wide variety of jobs, valid measures of job-relevant ability dimensions can be developed and used to select high potential individuals. For example, paper-and-pencil measures of cognitive ability are valid predictors of most jobs in the US economy (Hunter & Hunter, 1984; Schmidt & Hunter, 1981). Assessment centers have consistently demonstrated validities for jobs involving managerial skills (Gaugler, Rosenthal, Thornton, & Bentson, 1987). Work samples (Schmitt, Gooding, Noe, & Kirsch, 1984) and biographical information (Reilly & Chao, 1982) are valid predictors of important job outcomes, and even interviews, when rigorously structured and administered, appear to be valid measures of job-relevant dimensions (McDaniel, Whetzel, Schmidt, & Maurer, 1994). Utility studies have also 3 shown that valid selection procedures can make substantial economic contributions to organimtional productivity (e.g., Boudreau, 1983). However, organizational productivity is not the only goal to be considered by the employer when selecting individuals. Schmitt and Noe (1986) noted that at least since the passage of the Civil Rights Act of 1964, political and legal demands have forced employers to consider a second and fiequently conflicting goal namely, equal errrployment opportunities for various subgroups (minorities and women) in the American society. In 1965, President Johnson issued Executive Order 11246 which required all Federal contractors and subcontractors take afl'mnatiyeaction to ensure that employees are treated without regard to race, color, sex, religion, and national origin This order, the passage of the Civil Rights Act of 1964, and subsequent court cases concerning charges of discriminatory use of tests constituted the zeitgeist for personnel researchers examining differences in validity of selection procedures across subgroups. Schmitt and Noe (1986) provided a summary of the research and issues on subgroup differences in test performance and differences in validity of tests across subgroups including both differential validity and difl°erential prediction. Whereas there is not much data regarding subgroup differences in validities of other predictors, the findings on subgroup differences (in particular, Black-White differences) in performance on paper-and-pencil measures of cognitive ability are well established. 4 There is little evidence of a Black-White difference in validity coefficients for paper- and-pencil measures of cognitive ability (i.e., little evidence of differential validity). Differential validity occurs when there is a significant difference between observed validities for two subgroups. Reviews of research have shown that differential validity is generally absent (Jensen, 1980; Linn, 1978) and when it is observed, the validity differences between Blacks and Whites are small and trivial (Cascio, 1982; Hrmter, Schmidt, & Hunter, 1979). Moreover, Bobko and Bartlett (1978) have successfirlly argued that differential validity pens; would not be a sufficient indicator of test bias. For example, different subgroup validity coefficients may result when two groups differ in variability even when their prediction systems are identical. Although there is little evidence of differential validity, there is an extensive research demonstrating a simble difference in test means of Black and White subgroups with Blacks on the average scoring about one standard deviation below Whites (e. g., Hrmter & Hunter, 1984; Loehlin, Lindzey, & Spuhler, 1975; Schmidt, Greenthal, Hunter, Berner, & Seaton, 1977). Despite the absence of differences in subgroup validity coefficients, the use of paper-and-pencil measm'es of cognitive ability to select in a manner that optimizes predicted performance will still result in the hiring of a small number of Blacks relative to Whites because the Black subgroup mean score on the test is substantially lower than the White subgroup mean score. Hence, there is a conflict between the optimization of predicted performance (i.e., the goal of organimtional productivity) and the goal of equal subgroup representation in selection. 5 A similar conflict is reached with respect to the assessment of differential prediction which is more directly related to issues of test bias than is differential validity. Differential prediction has now become the accepted way of evaluating test bias by most psychometricians. Evaluation of differential prediction involves the consideration of validity coefficients and standard errors of estimates and the regression line describing the predictor-criterion relationship. Predictions of performance are made using regression equations. According to the Cleary (1968) model oftest bias, which is endorsed by both the WW Scimitar (1978) and the :m ' when (SIOP, 1987), a test is biased when a common regression equation results in either over- or under-prediction of subgroup performance, that is, a test is biased when there is differential prediction Over-prediction for a protected minority group resulting fiorn the use of a common regression line indieates test bias in the psychometric sense but is generally not considered a problem of fairness (SIOP, 1987). Hence, whereas test bias is a technical, psychometric issue, fairness is a social notion involving consideration of valued outcomes (SIOP, 197 8). The Cleary (1968) approach requires the use of separate subgroup regression equations when the equations are significantly different. The use of separate equations to provide a single rank order of applicants using predicted scores will result in hiring the best qualified individuals hence optimizing predicted perfonnance but it will result in the selection of unequal proportions of members of various subgroups when subgroup mean performance differs. Schmitt and Noe (1986) noted that most research 6 evidence indicates that the use of a single common equation results in slight over- prediction of minority group performance whereas the use of separate equations results in average predicted performance for subgroups which is identieal to the actual performance difference hence satisfying Cleary's (1968) criterion Schmitt and Noe (1986) have also shown that the use of separate regression equations as prescribed by Cleary (1968) will result in selecting relatively few members of the lower scoring group (which is fiequently the minority group) at all levels of selection ratios. In short, paper-and-pencil mm of cognitive ability are valid predictors of job performance and generally unbiased toward minority subgroup members in the sense that their predicted performance matches their actual performance. However, sizable subgroup differences on test performance exist (with Blacks scoring on the average one standard deviation below Whites). Top-down selection on the basis of test scores results in the hiring of relatively small proportions of minority subgroup members. In most cases, the use of paper-and-pencil measures of cognitive ability in selection produces a high level of "adverse impact" on minority hiring rates (e.g., Hunter & Hunter, 1984; Schmidt et al. 1977) defined by the W W (1978) as the failure to meet the 4/5 rule, that is, the ratio of the proportion of minority applicants hired to majority applicants hired should not be lower than 4/5. The conflict between the goal of organizational productivity and the goal of equal subgroup representation prompted personnel researchers to try to develop valid predictors of performance that have levels of adverse impact lower than that associated with traditional paper-and-pencil measures of cognitive ability. A promising approach is the search for alternative predictor comm. In this approach, researchers attempt to go beyond the construct of cognitive ability as assessed by traditional paper-and- pencil measures to measure other job-relevant abilities and attributes. The logic for the construct-oriented approach to reducing adverse impact in selection is that paper-and-pencil measures of cognitive ability, while valid, may be measrn'ing those determinants of job success on which subgroup differences are largest and conversely, they may fail to measure important determinants ofjob success on which such differences are smaller or nonexistent (Schmidt et al. 1977). However, the majority of the studies involving a search for altemative predictors have not adopted a construct-oriented approach Instead, efforts have been directed at the development of alternative selection methods such as work samples, assessment centers, and biodata, and the efforts were often atheoretical. As argued later, this neglect of constructs resulted in a serious conformd between method of testing and test content in assessment which has largely hindered our tmderstanding of the nature of subgroup differences in performance on selection instruments. Many of these issues are best illustrated with the development and use of work sample tests as alternative predictors (to paper-and-pencil cognitive ability tests) of job performance. 8 The next section will summarize the research on the validity and adverse impact of work sample tests. The logic of work sample tests and the problems associated with their development and use will then be explicated Assessment centers, an alternative predictor closely related to work samples will also be discussed to illustrate several issues and problems concerning the reduction of adverse impact. The discussion will lead to the consideration of situational judgement tests and their relationships to adverse impact and examinee test reactions which is the subject of the present study. Inworksarnpletests, exarnineesarerequiredtoperfonnthesamebehaviors that they would be required to perform on the job. Several reviews have demonstrated that work sample tests can be at least as predictive of job performance as paper-and- pencil cognitive ability tests. Hunter and Hunter (1984) found that paper-and-pencil cognitive ability tests were about equally as valid as work sample tests. Schmitt et al.'s (1984) meta-analysis found that the validity of work samples were superior to those of biodata and cognitive ability tests. With respect to adverse impact, work samples appear to be advantageous compared to cognitive ability tests in that the mean difference between the scores of majority and minority subgroup members is typically less for work samples (Brugnoli, Campion, & Basen, 1979; Cascio & Phillips, 1979; Schmidt et al. 1977; Schmitt, Clause, & Pulakos, 1996; Vlfrgdor & Green, 1991). For example, Schmidt et al. (1977) compared the adverse impact of a content-valid work 9 sarrrple test of metal trade skills to that of a well-constructed content-valid paper—and- pencil achievement test for the same technical area. They found the typieal one standard deviation Black-White subgroup difference for the paper-and-pencil test but found no signifieant difference between Blacks and Whites for the work sample test Bernardin's (1984) meta-analytic review of Black-White difl‘erences on work sample tests found an average difierence of .54 standard deviation units favoring Whites. In order to explain the positive results of the work sample test regarding its validity and small adverse impact, one needs to examine the logic of the development anduseofworksarnples. The interestinworksarnples couldinpartbetracedto Wernimont and Campbell (1968) who contended that samples of the kinds of behaviors actually required to be performed on the job would predict firture job performance better than scores on typical cognitive ability tests. The authors argued that scores on ability tests are merely "signs" which are less similar to and hence less related to actual job performance compared to "samples" of the work on the job. The irrrplicit assumption is that the more similar a test is to the actual job, the higher the validity of the test In accounting for the predictive success of work sample tests, Asher and Sciarrio (1974) stated that a strong relationship between the content of the job and the 00th of the selection method must exist for high predictive validity to occur. Smith and George (1992) argued that Asher and Sciarrino's (1974) "point to point " validation theory on work samples can be used as an explanation for the 10 success and failure of most selection methods. However, the notion of "similarity" between test and actual job has never been sufliciently explicated in most studies which examined tests pmportedly similar to actual job content. This is certainly true in the case of work sample tests. Given that performance on most jobs is multidimensional, a work sample successfirlly replicating a portion of the job is almost always multidimensional. However, little if any work hasbeendirectedmmderstmrdingthenatmeofflreconsuuctsmeasmedinwork sample tests. Schmitt et al. (1996) reviewed studies on subgroup differences published from 1964 to 1994 in three majorjournals concerned with personnel selection (W Rersonnelfielection)mdanemptedmascetaindrenanneofflreconsnucts measured andmethodsusedinthosesmdies.“f1thregardtoworksamples, theauthorsformd that it was almost never clear what construct(s) were measured. In the same review, the authors also noted that the data available regarding the lower adverse impact associated with work samples relative to paper-and-pencil cognitive ability tests are not very usefirl in providing us an miderstanding of the reasons for the reduction in adverseimpact. Thisisduetoaninherentconformdbetweenmethodandtestcontent in almost all studies comparing subgroup differences on the two types on tests. In these studies, work samples and cognitive ability tests differed in the method of testing (e. g., paper-and-pencil versus actual task perfonnance) and presumably, the nature of the constructs measured (e.g., general cognitive ability versus interpersonal-oriented 11 dimensions) due to different item content between the two tests. The distinction between method and content (Hunter & Hmrter, 1984) is crucial tothestudyofreductioninadverse impact. Ifmethodandtestconterrtis disconfounded in a study, then subgroup differences due to method and subgroup differences due to test content can be isolated In principle, we can then reduce or even eliminate adverse impact by changing method of testing or test content depending on the job-relevance of the given constructs. For example, two different methods of testing may have the same test content measrning the same job-relevant construct but one method produces less adverse impact than the other. Adverse impact due solely to method of testing can then be eliminated by using the method with lower adverse impact assuming that method is job-inelevant. On the other hand, by controlling method, we may be able to ascertain different test contents that differ in the size of subgroup differences they produce. For example, subgroup differences may be smaller for test content tapping interpersonal skills than test content tapping cognitive constructs (Hough, 1994; Hough, Eaton, Dunnette, Kamp, & McCloy, 1990). Assuming both types of constructs are job- relevant, adverse impact can be reduced and validity can be increased by expanding the predictor space beyond the measurement of cognitive constructs to include the measurement of interpersonal skills constructs. The present study differentiates method fiom content by comparing two different testing methods (paper-and—pencil versus video-based assessment) with the same set of test items. The importance of the method-content distinction in the 12 present study will be elaborated later. In short, in terms of our understanding of the smaller adverse impact associated with work samples relative to cognitive ability tests, moreworkiscertainlyneededonthenatmeoftheconstructsmeasmedinwork samples, their representativeness of the job, and issues relating to the physical fidelity and psychological fidelity of the simulation (McHenry & Schmitt, 1994). There are several practieal problems that have limited the use of work samples. Despite its validity and low adverse impact, many organimtions have not incorporated work samples into their selection procedures due to the high cost of testing. Work samples are often expensive to develop and administer, especially when raters are required Manyworksarrrpletestsareadministeredoneononebyatest administrator who often has to score the results by hand (McHenry & Schmitt, 1994). To ensure reliability, more raters are required which increases the cost of testing. Costs are further increased when complex administration and scoring procedures demand rigorous assessor training (Wrgdor & Green, 1991). In certain cases, work sample tests may not be practical due to the potential danger to the applicant inherent in the tasks. Jobs involving high physieal demands nary be least practical for the development of work sample tests and yet these may be the jobs where work samples are most predictive. Finally, some jobs may be sufficiently technical and involve a substantial amount of job-specific knowledge such that it would not be possible to develop a work sample that is representative of a significant portion of the job and at the same time applicable to applicants (who do not have the knowledge and skills of the experienced incumbents). 13 AW Several issues and problems concerning the reduction of adverse impact using alternative predictors can be illustrated with the development and use of a selection instrument closely related to work samples namely, the assessment center. Although primary research and reviews in personnel selection have almost always treated assessment centers as a type of predictor distinct from work samples, the two predictors have much in common Both assessment centers and work sarrrples are based on a behavioral sampling assumption and they share the basic tenet of the behavioral consistency approach that the best predictor offuture performance is present or past performance or behavior of the sarrre type. Both are simulations in the sense that the task stimuli are constructed such that they mimic actual job situations and elicit responses which are purported indicators of how assessees would lundle the task situations if they were actually occurring on the job. Assessment centers are more like "samples" than like "signs" in the sense distinguished by Wernimont and Campbell ( 1968). Both work samples and assessment centers are almost always multidimensional reflecting the multidimensionality of the target job they mimic. Both also tend to have high face validity. Both often require trained raters who are also subject matter experts on the target job and both are expensive to develop and administer. The distinguishing feature of the assessment center is its multiexercise- multirater methodology. Also, although in principle the multiexercisemultirater methodology ean be applied to almost any job, assessment centers have been historically restricted to the assessment of general managerial dimensions. Because it 14 typically assesses general managerial dimensions as opposed to some job-specific technical knowledge and skills, the assessment center is less likely to have the problem of inapplicability to inexperienced applicants faced by rrmny work samples alluded to earlier. With respect to validity and adverse impact, research on assessment centers has demonstrated a pattern of findings similar to that of work sarrrples. Like work samples, validities obtained for assessment centers are at least comparable to those observed for cognitive ability tests. At least two meta-analyses have found substantial validities for assessment centers. Schrrritt et al. (1984) found an average validity of .41 across 21 studies. Gaugler et al. (1987) obtained an average validity of .34 based on 107 validity coefficients for various performance criteria from 50 studies. Like work samples, typical Black-White subgroup differences in assessment center performance are also substantially smaller than the one standard deviation difference observed for cognitive ability tests (e. g., Huck & Bray, 1976). Based on a sample of 2,910 candidates who were assessed for school administrator positions in 25 different assessment centers using the same set of exercises and dimensions, Schmitt (1993) found significant mean differences between Black and White subgroup members for 10 of 13 dimensions ranging between two-thirds to three-fourths of a standard deviation in favor of Whites. In short, assessment centers by no means eliminate adverse impact but they tend to have Black-White differences substantially smaller than those for cognitive ability tests. Whereas studies of work samples have tended to neglect the issue of 15 constructs, a substantial amount of research has been devoted to the study of construct validity of the dimensional ratings in assessment centers. However, our understanding ofthenatmeoftheconstructstappedinassessmentcenters isnobetterthanthecase in work samples. Mrltitrait—multimethod studies have consistently reported low construct validity of dimemional ratings and factor analyses of these ratings produced "exercise factors" rather than dimensional factors (e.g., Chan, in press; Sackett & Dreher, 1982; Sackett & Harris, 1983; Schneider & Schmitt, 1992; Turnage & Muchinsky, 1982). In describing the lack of construct validity in assessment center research, Klimoski & Brickner (1987) noted that we know assessment centers work in the sense that they have predictive validity but we do not know why insofar as we have little rmderstanding of the nature of the constructs tapped by assessor ratings. Just as in the case of work samples, it is terrrpting to attribute the smaller subgroup difference observed in assessment center performance (relative to paper-and- pencil cognitive ability tests) to the nature of the constructs tapped by the test content. Like work sarrrples, one may hypothesize that the multidimensionality of assessment centers included both cognitive and non-cognitive constructs (e. g., interpersonal dimensions) and that subgroup differences on non-cognitive constructs may be smaller or even non-existent compared to cognitive constructs such that the overall ratings in assessment centers exhibit lower adverse impact relative to paper-and-pencil measures of cognitive constructs. However, as mentioned earlier, the test of such a hypothesis would require a design eliminating the method—content confound in the comparison between assessment center perforrmnce and performance on cognitive ability tests. 16 Unfortunately, a fully-crossed content by method factorial design is often not feasible. For example, in a paper-and-pencil methodology, it is difficult to develop test content tapping many of the usual assessment center dimensions (e. g., leadership, decisiveness) and sometimes impossible to do so (e. g., oral communication). Schmitt et al. (1996) reviewed studies on subgroup differences and found no study which employed the method by content design. However, they did find one unpublished study (Goldstein, Braverrnan, & Chung, 1993) reporting subgroup differences measured using different methods. The Goldstein et al. (1993) study will nowbedescribedinordertodiscussthecore issues associatedwiththe methodby content design approach to examining subgroup difference. Some of the problems with the design used in Goldstein et al. (1993) will be addressed in the present study. W The purpose of Goldstein et al. (1993) was to examine the effects of different testing methods on subgroup differences. The authors attempted to address the "method versus content" issue by developing four tests that purportedly assess the same six abilities. The sample consisted of 29 Whites and 13 Blacks who were being assessed for promotion in a police organization. The four tests used, which were construed as work samples by the authors, were similar to the typical exercises in an assessment center. They were a mittenjnzbaskmt, a mleplamxercjse in which the examinee conducts a performance appraisal counseling session, a simulation planningexercise requiring the examinee to develop contingency plans to a 17 hypothetical event, and a srmulationgrercise in which the examinee supervises activities associated with the event that he or she had prepared in the simulation planning exercise. The six abilities assessed across all four tests were the ability to pay attention to details, to adjust communication to level of understanding of other person, to cormnunicate using proper grammar and wording, to put materials in a logical sequence, to adjust action or decision in light of new information, and to maintain composure in stressful situations. Citing Helms (1992), Goldstein et al. argued that the Afiican-centered values and beliefs of Blacks emphasize communalism, movement, and orality which would in tmn influence their test-taking performance. Accordingly, Blacks have a disadvantage on paper-and-pencil tests compared to Whites due to the strong written component requirement for successful performance on such measures. The written component is construed as a requirement ofthe test method and is not part ofthe construct intended to be assessed by the test content. The authors hypothesized that a testing method requiring a written response mode favors Whites over Blacks whereas tests that were more interactive, behaviorally-oriented, and aurally-/orally—oriented would exhibit less adverse impact. Hence, it was predicted that the written in-basket test would have a higher level of adverse impact relative to the other three tests which were more interactive, behavioral, and aural/oral in nature. The results were consistent with the hypothesis. The written in-basket test had a substantially higher level of adverse impact (.47 to .87, average = .65) when compared to the simulation planning exercise (.41 to .64, 18 average = .48) and the simulation exercise (.22 to .36, average = .30). For the role play which is presumably the most interactive-oriented exercise, Blacks performed better than Whites (.38 to .64, average = .58). Schmitt et al. (1996) noted several limitations with Goldstein et al.'s (1993) study. The sarrrple sizes were small with only 13 Blacks and 29 Whites. No reliability estimates were reported for the various measures. With low reliabilities, true subgroup difimces will not be detected It is possible that some of the more interactive measmes (e. g., simulation exercise) are substantially less reliable than paper-and-pencil measures such tint true subgroup difierences were not detected on the former. That is, it was not clear if Goldstein et al.'s (1993) findings were due to true subgroup differences or simply an artifact of differential reliability in measurement. In the present study which compared two methods of assessment in a situational judgement test, the reliabilities of each measurement method were estimated so that effect sizes could be corrected for rmreliability in measurement. Adequate sample sizes were also employed to ensure sufficient power. Schmitt et al. (1996) also noted that there was no evidence establishing the equivalence of constructs across methods in Goldstein et al's. (1993) study. This is an inrportant concern because, as argued earlier, the adequacy of a method by content design for the isolation of method sources and content sources of subgroup differences presupposes an equivalence of constructs across methods when test content is held constant across methods. In Goldstein et al. (1993), the content of the task stimuli (i.e., test content) appeared to be quite different across test methods. For example, it 19 was not clear if the "ability to maintain composure under stressful situations" elicited by the preparation of memos in the in-basket test (and rated by assessors) was in fact the same construct as the purportedly same dimension elicited (and rated) by the interactions in the counseling situation of the role play exercise. In the present study on situational judgement, the issue of construct equivalence was addressed by administering the same test items using two different methods of stimulus presentation and empirically testing factorial invariance of test responses across the two methods. Another limitation of Goldstein et al. (1993) was that the ability dimensions described were relatively specific and their results may not be generalizable to the broader psychological constructs of interest typically assessed by the common predictor instruments in personnel selection such as cognitive ability tests, personality measures, and work samples. The present study on situational judgement employed more global constructs such as interpersonal skill dimensions of conflict resolution and errrpathy. In order to provide a more rigorous test of the hypothesis that a signifieant amount of the Black-White difierence in perfonnance on paper-and-pencil tests is due solely to the reading/written requirements inherent in the method of testing and mdepmdenofflrewnsuuctmeasmedflrepresmtstudyalsoadrmmsteredareadmg comprehension test to both Blacks and Whites. The hypothesis would predict lower reading comprehension scores for Blacks and that the Black-White subgroup difference in performance on the paper-and-pencil method of testing will be reduced when reading comprehension is controlled. 20 IhePresenLStrrdy The present study examined the effects of a video-based versus a paper-and- pencil method of assessment on adverse impact and examinee test reactions in a situational judgement test. With respect to adverse impact, test content (and presumably, the constructs measured) was held constant across two different methods of testing so as to isolate subgroup differences due solely to test methods. As mentioned earlier, construct equivalence across methods was empirically tested. Reliabilities of measurement were estimated to obtain corrected effect size estimates. Areadingmmprehmsimtestwasadnnmsteredmprovideanaddifionalteflofflre hypothesis that a significant amount of the Black-White difference in performance on paper-and-pencil tests is due solely to the reading/written requirements inherent in the method of testing independent of the test content. As discussed thus far, the study addressed the issues and problems associated with evaluating the effect of test method and test content on the size of subgroup difi‘erences in test performance. The use of the present situational judgement test circumvented many of the conceptual and practical problems associated with typical work samples and assessment centers explicated earlier. The logic and research on situational judgement tests and their relationship to adverse impact will be discussed next. Examinee test reactions, the second dependent variable in the present study, will then be introduced The study of test reactions has become increasingly important in recent personnel selection research and the links between test reactions and the method—content distinction will be explicated. 21 III'ES"lIl I In a typical situational judgement test, examinees are presented with a hypothetical scenario describing a work situation in which a problem has arisen The work situation may be a possible actual situation on the target job or a situation constructed such that it is psychologically isomorphic to an actual situation. The latter would address the problem faced by typical work samples concerning inapplicability of test items to inexperienced applicants due to the requirement of job-specific knowledge and experience on some jobs. Either way, the work situations on the test are developed on the basis of job analysis data often including a critical-incident analysis involving subject nratter experts. The individual situational judgement problem is almost always multidimensional in nature in the sense that an adequate solution or handling of the problem would involve several ability and skill dimensions. Alternative responses are presented to the exarrrinee following the description of the situation. Examinees' scores on the test are conrputed based on their endorsement of the responses. In tests employing a forced-choice format, examinees are typically asked to choose the most effective response, or to choose the most effective response and the least effective response. In another format (the format used inthepresentstudy), examineesareaskedtorateeachresponseintennsofits effectiveness usually using some form of a Likert-type scale. The scoring key is developed from prior effectiveness ratings of response alternatives obtained from subject matter experts. The decision rules for identifying the most or least effective response or arriving at the score for each effectiveness rating given by examinees vary 22 from test to test. Regardless of the precise rules used, statistical analyses and sometimes content analyses are performed on the subject matter expert ratings to msure reliability and agreement in the ratings used for the development of the scoring key. Often, the objective of developing a situational test is to sample behaviors firm the domain of job performance rather than measuring any particular construct or predispositional sign Hence, like work samples, situational judgement tests are more like "samples" than like "signs". However, Motowidlo, Drmnette, & Carter (1990) noted that it would be interesting to discover what constructs are measured by the test. The importance of construct-orientation and the distinction between method and content for the examination of adverse impact has been discussed earlier. Identifying the nature of the constructs measured in situational judgement tests will provide a better understanding of the causes of adverse impact and help in the development of ways of reducing the level of adverse impact associated with some given selection instrument. .1....'... A. w. a“ $1A_...__q!!..1...‘_. ., M an...” _ .1. .6; Work samples, assessment centers, and situational judgement tests may all be construed as forms of simulations. In these simulations, task stimuli are constructed such that they mimic actual job situations and elicit responses which are purported indicators of how assessees would handle the task situations if they were actually occurring on the job (Motowidlo et al. 1990). Work samples are on the high end of 23 the continuum of simulation fidelity because they use very realistic materials to represent the task situation and examinees may respond in a manner almost identical to the way they would if they were actually on the job. As tests move toward the low end of the fidelity continuum, stimuli and responses are less faithfirl approximations of actual job stimuli and responses. The situational interview (Latham & Saari, 1984; Latham, Saari, Pursell & Campion, 1980; Weekley & Gier, 1987) is a well-known example of a simulation on the lower end of the fidelity continuum. Latham et al. (1980) reported a situational interview with a validity of .46 and Latham & Saari (1984) reported a validity of .14. Motowidlo et al. (1990) developed a paper-and-pencil type of situational judgement test which they termed a "low-fidelity" simulation. In this test, the task stimulus (i.e., the work situation) is presented in a written form and examinees are required to endorse alternative respomes described also in written form. The test resembles similar situational inventories developed in early research such as the Supervisory Practices Test (Bruce & Learner, 1958), the "How supervise?" (File & Remmer, 1971), and the leadership Evaluation and Development Scale (Tenopyr, 1969). The paper-and—pencil method of administering the situational judgement test in the present study is a type of "low-fidelity " simulation with a format similar to the test developed by Motowidlo et al. (1990) except that instead of a forced-choice response format, the present test requires examinees to give effectiveness ratings for each of the alternative responses. Motowidlo et al. (1990) noted that although simulations with higher fidelity 24 should be better predictors of actual job performance than those with lower fidelity according to the basic tenet of behavioral consistency, there have been no systematic studies of the relationship between differences in fidelity and incremental predictive value. Such high fidelity simulations as work samples and assessment centers are expensive to develop and adrrrinister and the cost of developing such simulations may not ofi‘set the gain in predictive value over lower fidelity simulations (Motowidlo et al., 1990). Whereas it is expensive and often not feasible to administer work samples or assessment centers to a large group of examinees in one testing session, situational judgement tests can be administered to relatively large numbers of examinees in one session In the ease of a paper-and-pencil format of the test, the scale of testing effort and expense is identical to traditional paper-and-pencil measures of cognitive ability tests or personality tests. Moreover, work samples and assessment centers almost always require substantial involvement of subject matter experts for rating or scoring of individual examinee performance at the time of testing and ongoing assessor training costs can be high On the other hand, the primary involvement of subject matter experts in the situational judgement test is in the development of the test stimulus (work situations) and scoring key. Hence, flour a practical viewpoint, it is worthwhile to explore the predictive validity of low fidelity simulations such as the situational judgement test. Using a sample of approximately 120 management incumbents, Motowidlo et al. (1990) found positive validities for their low fidelity situational judgemart test in 25 predicting supervisory ratings of performance (.28 to .37, p < .01). Fmther evidence of validity for the test were provided in Motowidlo & Tippins (1993) in which two studies were reported Study 1 employed a predictive validation design and found an average validity of .25 predicting supervisory performance ratings in a sample of 36 management applicants. Study 2 employed a concurrent validation design and formd an average validity of .20 predicting supervisory perfonrrance ratings in a sample of 109 to 128 marketing incumbents. Pulakos, Schmitt, & Keenan (1994) developed a situational judgement test similar in format to Motowidlo et al.’s (1990) low-fidelity simulation test. Using a sample of incumbents firm a large federal investigative agency, they found significant validities for the test in predicting two performance criteria namely, minvestigmmeproficiency (.20) and eEQnandprofessionalism (.13). Motowidlo et al. (1990) found a Black-White difference of .21 standard deviation favoring Whites in their sample of incumbents and a difference of .38 standard deviation favoring Whites in their sample of applicants. Although these differences were nonsigrrificant, there is a caution against concluding that situational judgement tests successfully eliminated adverse impact. The number of Blacks in Motowidlo et al.'s (1990) samples were small (ranging from 21 to 31) and the power to detect a difference of .5 standard deviation was only between 47% and 68% (Cohen, 1977). Of the two studies reported in Motowidlo & Tippins (1993), one 26 provided no information on Black-White differences as the sample of Blacks was too srrrall for subgroup analysis (N = 16). The other study reported that Blacks scored lower than Whites by .38 standard deviation (44 Blacks vs 178 Whites). Weighting the Black-White differences reported in Motowidlo et al. (1990) and Motowidlo & Tippins (1993) by their sample sizes yielded an average adverse impact of .32 standard deviation (total of 97 Blacks vs 378 Whites). In the situational judgement test developed by Pulakos, Schmiitt, & Keenan (1994), Blacks scored lower than Whites by .41 standard deviation (100 Blacks vs 259 Whites). The above review showed that adverse impact levels of the paper-and-pencil type of situational judgement test appear to be substantially lower than the typical one standard deviation for cognitive ability tests but the size of the Black-White difl°erence is still considered at least moderate and is practically significant. A primary purpose of this study was to examine the possibility of reducing the Black-White difference on the situatioml judgement test by simply changing the method of stimulus presentation from the paper-and—pencil delivery to a video-based delivery while keeping test content constant. The theoretical rationale for this hypothesis has been explicated earlier in the discussion of the Goldstein et al. (1993) study. By replacing the paper- and-pencil method which requires reading comprehension with the more interactive, behavioral, and orally-/amally-oriented video-based method, the Black-White difference in test performance should be reduced. 27 Although the advantages in use of video-based testing in personnel selection have been alluded to as early as in Thomdike (1949), its actual use is relatively new and there is an insufficient research base evaluating the psychometric properties and adverse impact of the assessment method However, the few studies conducted did report some encouraging results for a video-based method of presenting the situational judgement test. Based on a KSAO analysis of 50 customer service jobs, \Vrlson Leaming (1990) developed a video-based situational judgement test for the assessment of custonrer service skills. Using performance ratings as the criterion, the test was found to have a validity of .40 for a sample of 126 Canadian employees and .34 for a sample of 60 American employees. In another video-based test developed for transit operator selection, Snriderle, Perry, & Cronshaw (1994) reported a significant negative validity using number of complaints as the criterion but no significant correlations were found betwem test scores and two other criteria namely, commendations and a performance composite. Dalessio ( 1994) also found a significant average validity of .17 for a video-based test predicting turnover a year later using several samples of insurance agents (total N = 677). The present author located only one published study reporting the adverse impact level of the video-based situational judgement test. Snriderle et al. (1994) found no signifieant Black-White difference in test performance (46 Blacks vs. 267 Whites). However, the result was not corrected for unreliability of measurement. The low reliability of the test (alpha = .47) certainly attenuated the true Black-White difference. Moreover, the present author perfonned a power analysis (Cohen, 1988) 28 on the data and formd that the study had only a power of approximately 59% to detect amoderateefiectsize(d=.5)at0t=.05. Hence, moreresearchisneededtoascertain the adverse impact level of video-based situational judgement tests. The present study examined Black-White differences in performance on a video-based assessment and compared it with the difference on a paper-and-pencil format of the same situational judgement test. A priori power analyses were conducted to ensure adequate sample sizes and reliabilities of the two measmements were estimated to correct for attenuation due to unreliability. The present study developed two formats of a single situational judgement test, differing in the method of testing (video—based versus paper-and-pencil presentation of the work situations) with test content held constant. As discussed earlier, Helms (1992) theorized that Afiican-centered values and beliefs of Blacks emphasize commurralism, movement, and orality at the expense of reading comprehension. The lack of emphasis on reading conrpreherrsion in turn influences their test-taking performance resulting in Blacks having a disadvantage on paper-and-pencil tests compared to Whites due to the strong written component requirement for successful performance on such measures. Reviews have curnulated an extensive research evidence showing a significant and substantial Black-White difference on paper-and- pencil measures of cognitive—oriented constructs in favor of Whites, that is, a high level of adverse impact exists. Results fi'om Motowidlo et al. (1990), Motowidlo & Tippins (1993), and Pulakos et a1. (1994) indicated that Blacks also score lower than Whites on a paper-and-pencil type of situational judgement test. Prior to testing the 29 primary hypotheses concerning effects of test method on adverse impact, it was necessary in the present study to first replicate the previous findings that Blacks perform significantly poorer than Whites on a situational judgement test presented in a paper-and-pencil format. Goldstein et al. (1993) and Schmitt et al. (1996) have argued that a testing method loaded with a strong reading/written component would tend to favor Whites over Blacks whereas tests that were more interactive, behaviorally-oriented, and amally-/orally-oriented would exhibit less adverse impact Based on this argument and Helm's (1992) theory, it was predicted that for performance on the situational judgement test, . C . . .. ... - .. - - - C ‘ O . _-.v. l.-._ .0 “mm tl-iur. r us mum”. ..1{J ”v.0 The nature of the expected interaction is depicted in Figure 1. 3O .mocmEBton. “mod. EmEomeaw _mco=m3_m co c2899.... e052). x comm 386de ”P m_wo£oa>I .F meant EmEmmmmm< .5 UOEwS. eowmmooeS __o:on_-ecmaoamn_ o..E>>_H_ xoflm I eoueuuoped reel ucew perorperd 31 It was argued earlier in the paper that a significant amount of the Black-White difference in performance on paper-and-pencil tests could be due solely to the reading comprehension inherent in the method of testing independent of the test content. Two hypotheses were derived fi'om this argument. One hypothesis related perfonnance on the test to the method of testing and individuals' reading comprehension ability whereas the other hypothesis related test performance to method of testing, reading comprehension, and racial subgroup membership. With respect to method of testing and reading comprehension ability, it was expected that an individual's perfonnance on the situational judgement test would be affected by his or her reading comprehension ability when the test was administered using the paper-and-pencil method but no such effect would exist when the test was administered using the video-based method. Hence, it was predicted that for performance on the situational judgement test, Al .AIQJ _ k". '95.. 331 02"}. .0"... I Al '2’ 0JA ‘ 1H. '_Jl I .11 The nature of the expected interaction is depicted in Figure 2. 32 .mocmEEton. ewe... EoEomezw 35:35 :0 cozomaoE. co_mcocoEEoo mcfimom x porch—2 eoEeoan. ”N £35.09»: .N 939m :o_mcocanoo mcfimom I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I l I I I I I I I I I I I I I I poem—2 emmmmomeSrt eo£w_>_ __ocon_-ecm.._oamd .I. IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII aoueuuoped reel ueew perogperd 33 Previous research has shown that a Black-White subgroup difference exists on reading comprehension tests favoring Whites (e.g., Matthews, 1991; Scott, 1987). A Black-White difference in reading comprehension scores favoring Whites is expected to be replicated in the present study. Thus, a significant anrourrt of the Black-White difference in performance on paper-and-pencil tests could be due solely to the reading comprehension requirements inherent in the method of testing independent of the test content. That is, a substantial amount of the Race X Method interaction efl°ect on test performance hypothesized in H1 could be due solely to the Method X Reading Comprehension interaction effect on test performance hypothesized in H2. Hence, it was predicted that: H3: It‘ In: . tram. term or Wit: .0... 9 ur ILI‘J .nm 1: o r The predicted reationship between Race X Method interaction and Method X Reading Comprehension interaction on test performance is depicted in Figure 3. 34 .mocmEBtmn. Ewe... _m:o=m3_w co co_..om._oE_ eofios. x oomm 5.. cozmcmaxm cm on c2692... concocoaano mcfimom x e052). ”m 285.0%: .m «taut mocmanton. EoEomezw .mcozmztw co_mcmcanoo A » mcfimom AI EoEmmomm< .5 DOLE—2 comm 35 ExamirmlestReactions Research on test validity and adverse impact has tended to examine predictor adequacy and fairness from the organizational and psychometric perspectives. Recent research in personnel selection has begun to focus more attention on applieant reactions or examinee attitudes to selection procedures (e.g., Arvey, Strickland, Drauden, & Martin, 1990; Gilliland, 1993; Gilliland, 1994; Macan, Avedon, Paese, & Smith, 1994; Schnritt & Gilliland, 1992; Schnritt, Gilliland, Larrdis, & Devine, 1993). This notion of perceived test adequacy and fairness has been termed "social validity" (Schuler, 1993), "impact validity" (Iles & Robertson, 1989) and the "social side of selection" (I-Ierriot, 1989). Examinee reactions to selection procedures could be organizationally relevant in that it could affect applicant and employee behaviors (Arvey & Sackett, 1993; Gilliland, 1994). Prernack and Wanous (1985) and Robertson and Smith (1989) argued that assessment situations serve as a preview of the organization and Schuler and Fnrhner (1993) noted that selection instruments can be used as instruments for personnel marketing. Snrither, Reilly, Millsap, Pearlrnarr, and Stoffey (1993) outlined three possible practical effects of applicant reactions. First, reactions can indirectly influence applieant pursuit or acceptance of job ofiers through organizational attractiveness. Second, reactions may relate to the likelihood of litigation and the success of the defense of the selection procedure. Third, reactions may indirectly afiect both validity and utility through motivation in test performance and loss of qualified applicants respectively. In short, exanrinee test reactions are of interest 36 because they constitute a critical component of the recruitment-selection process. E Ml'l' IE 1.. 1511' E . Whereas there has been extensive research on the validity of work sarrrples and assessment centers, there are few studies which systematically investigate exanrinee reactions to these predictors such as face validity and predictive validity perceptions. Schnritt and Gilliland (1992), Gilliland (1993), and Gilliland (1994) have attempted to relate organimtional theories of distributive and procedural justice to exanrinee test reactions and Schuler (1993) has proposed a model of social validity. However, in generalflreresemchonexamineereacfimrstoselectionprocedmeshasbeen fiagrnerrted and atheoretical. Most research has been focused on the description of exanrinee attitudes or reactions to different selection tests and compared reactions across tests. There is a need to integrate studies ofexanrinee attitudes into the broader selection fiarnework. In the present study, the investigation of examinee test reactions is integrated into the selection fiarnework by analyzing attitudes by race and examining relationships between attitudes and adverse impact and method of testing. Research on examinee reactions has focused on face validity and little effort has been directed to investigating perceived predictive validity. Face validity (also [crown as perceived content validity) refes to the extent to which examinees perceived the content of the selection procedure to be related to the content of the job. Perceived predictive validity refers to the extent to which examinees perceived the procedure predicts future performance regardless of face validity (Snrither et al., 1993). 37 Whereas both face validity and perceived predictive validity are conceptually distinct, their empirical relationship is less clear. Although it is intuitively plausible to expect face validity to be highly correlated with perceived predictive validity, there has been little evidence demonstrating the correlation The author located only one study which directly examined their empirical relationship. Snrither et al. (1993) formd a significant correlation of .36 (p < .01) between face validity and perceived predictive validity for a civil service examination However, interpretations were problematic because the examination consisted of a variety of selection procedures (mainly paper-and-pencil measrn'es of job knowledge and cognitive ability) and the sample of applicants were assessed for a variety of jobs ranging from entry-level to professional positions. In the present study, the relationship between face validity and perceived predictive validity was examined separately for four different tests. Using the job of a production worker as the frame of reference, it was predicted, within each of four different tests (a situational judgement test administered either in paper-and-pencil format or in video-based format, a reading comprehension test, a personality test, and a cognitive ability test,), that: H4: '21: 6.: ureter ‘ -.__c_r o r‘ .651 .r u; run: eI‘ 0.0 l I‘lfi 11' El . 38 E Ifil'l' III] 1 EI’ Both work sarnpletestsandassessment centers appeartohave high face validity. Research has shown that selection procedures involving simulations elicit more favorable examinee reactions than those using paper-and-pencil measures (Dodd, 1977; Macan et al., 1944; Schmidt et al., 1977; Snrither et al., 1993). Schmidt et al. (1977) reported that perceptions of work sample tests were more favorable than those of paper-and-pencil measures of cognitive ability. Dodd (1977) found that assessees have positive reactions to the face validity aspects of an assessment center. Macan et al. (1994) found that examinees perceived the assessment center as more face valid than cognitive ability tests. The high face validity and favorable exanrinee attitudes for work samples and assessment centers are often attributed to their realistic test situation and similarity to the target job, that is, their high simulation fidelity. Smither et al. (1993) found that procedures involving simulations were generally perceived as more favorable than paper-and-pencil measrnes. However, it is not clear which aspects of these tests are responsible for the positive reactions. Previous studies comparing examinee reactions across tests (e. g., across assessment centers and cognitive ability tests) have been limited in increasing our understanding of examinee reactions due to the method-content confound across tests. By comparing two different means of measurement with test content held constant, the present study was able to examine any possible differences in reactions attributable solely to test method The assumption that simulation fidelity or concreterress of test stimulus is positively related to examinee reactions would suggest 39 that in the present study, the video-based method of administering the situation judgement test would be perceived more favorably than the paper-and-pencil method even when test content remained the same because the video-based method was a more concrete representation with higher simulation fidelity than the paper-and-pencil method There is some evidence of positive examinee reactions to a video-based method of testing (e. g., Dyer, Desmarais, Midkifi; 1993). Hence, it was predicted that Hi: .__ tr 0 1‘ Le. 0.1...-raeum s-sr .r u.‘ an I.-.l .3413 mm Within the same test, it is possible for examinees to have, simultaneously, low predictive validity perceptions and high face validity perceptions. Doing well on a test which has content related to the job tasks (i.e., a high face valid test) does not always guarantee successful job performance because successfirl performance has multiple determinants, many of which have little, if any, to do with test performance. Unlike face validity, perceived predictive validity is less dependent on the test content or other test characteristics. There was no clear theoretical rationale for relating differences in test methods to differences in perceived determinants of successful job performance. Hence, no formal hypothesis was formulated for any effect of method of testing on perceived predictive validity. 40 S! H] l' 115 3511 The relationship between racial subgroup membership and reactions to selection tests is clearly an important practical issue. If examinee reactions afl'ect subsequent examinee behaviors which are organizationally relevant, then any differential subgroup test reactions may explain some of the job performance and behavior variance or test performance associated with race. Almost all systematic differences in behaviors across racial groups have important economic and socio-political implications for the organization Few studies have analyzed examinee reactions by racial subgroup membership. Schmidt et al. (1977) found no Black-White differences in reactions to work sample tests and cognitive ability tests. The lack of a significant Black-White difference in attitudes toward the cognitive ability test is somewhat puzzling. Given that Blacks perform poorer than Whites on cognitive ability tests and assuming that the means of measurement (i.e., paper-and-pencil) tends to be consistent with the cultural values, beliefs, and experiences of Whites but inconsistent with those of Blacks (Helms, 1992; Goldstein, et al., 1993; Schmitt et al., 1996), one would predict that Blacks would have attitudes less favorable than Whites regarding paper-and-pencil cognitive ability tests. On the other hand, there is no reason to expect a Black-White difference in attitudes on selection procedures involving more realistic materials or concrete representations. The confound between test method and test content in Schmidt et al.'s (1977) comparison between work samples and cognitive ability test could not provide a rigorous test of the hypothesis of Black-White difference relating to method of 41 measurement. This hypothesis could be more directly tested in the present study because content was held constant across two different means of measurement It was predicted that: 0.. tam .‘ r a...‘ ._, 1.5 («an or terrors-.0. o 3.... m m .6.» u 0; “3.4.3 Ia! I' I- I‘dl‘dl I 0231‘; 32"” g 2. I-Al. ‘ OEIISIIIII I“. I. I' 1") The nature of the expected interaction is depicted in Figme 4. 42 4.on EmEomezw _mco=m3_m .6 mcozaooamn Ee=m> mom”. :0 corona: 85.22 x 8mm eoEeoan ”m £3569»: .v 929m EoEmmomm< no eoEoE eowmmooeS __ocon-_ucm.._onmn 9E>>_H_ xom_m I suorrdaored ArgpueA erred ueew perorperd 43 Whereas racial subgroup membership was expected to interact with method of testing to affect face validity perceptions due to differential subgroup experiences with test characteristics, it was less clear if these subgroup experiences were relevant to predictive validity perceptions. There was no clear theoretical rationale for relating either these differences in subgroup experiences or differences in method of testing to differmces in perceived determinants of successfirl job performance. Hence, no formal hypothesis was fonnulated for any Race X Method interaction effect on predictive validity perceptions. Evidence of factorial invariance of responses to the situational judgement test across the two method groups would indicate that the same constructs were indeed being measured when test content was held constant across the two different methods of asessmerrt. In addition, establishing factorial invariance across the two method groups would allow meaningful comparisons to be nude between the paper-and-pencil method group and the video-based group of examinees with regard to their situational judgement scores. Factorial invariance was construed and assessed both internally (i.e., within the test) and externally (i.e., relationships with variables external to the test). Internally, factorial invariance was construed in terms of measrnement invariance. Externally, factorial invariance was construed in terms of nomological invariance (or external parallelism). Measurement invariance exists when the numerical values across the two 44 groups are on the same measurement scale (Drasgow, 1984, 1987). In the absence of measurement invariance (i.e., when numerical values across groups are not on the samemeasurementscale), group difl‘erencesinmeantestscores orinpattemsof correlations of the test with external variables are substantively misleading. Nomological invariance or external parallelism across groups exists when the groups exhibit similar patterns of correlations between the test (or factors measured by the test) and external variables. To establish nomological invariance, independent established measures of personality constructs were administered in the present study for the purpose of relating them to scores on the two versions of the situational judgement test. It was anticipated that both versions would have similar patterns of conelations with the personality constructs. Examinm Examinees were introductory psychology undergraduates who participated in the study for extra course credits. A series of power analyses (Cohen, 1988) was performed for each hypothesis to detennine the required sample size (see Appendix A for series of power analyses). For each analysis, the power desired was .80 assuming a small effect size (see Cohen, 1988) at or = .05. The power analyses revealed that 240 subjects were required A total of 244 undergraduates participated in the study and 241 provided usable data (113 Blacks, 128 Whites; 63.9% females). The incomplete and rmusable responses from 3 examinees were excluded from all analyses performed The video-based version of the situational judgement test used in the present study was a pilot version of a video-based situations assessment test developed by a large US-based human resources consultancy firm The test was developed by the firm as part of a comprehensive test battery for a consortium. The simulation focused on two broad functional areas namely, work habits and interpersonal skills. Each area was defined in terms of two performance factors. Work habits was defined in terms of work comnritrnent and work quality. Interpersonal skills was defined in terms of conflict management and empathy. The videotape included one practice video vignette 45 46 and 12 actual video vignettes spanning a range of common situations likely to be encountered in today's semiskilled and skilled blue collar work place. Each vignette depicted employees interacting on the job and described an interpersonal or work- related problem for one of the employees. At the end of each vignette, examinees were asked what action the employee should take to resolve the problem. A series of possible responses (ranging 9 to 14 responses per vignette) was presented in written form on the answer booklet. For each possible response, examinees were asked to rate its appropriateness on a 6-point rating scale fiom magnum to W. The pilot version of the test had a total of 126 items. On the basis of an item content analyis, the human resornce experts at the consultancy firm edited the test and produced a final version with a total of 63 items measuring the four aprimi factors, namely, work commitment (11 items), work quality (19 items), conflict management (17 items), and empathy (16 items). The pilot version of the test was administered in the present study because the final edited version was not available at the time of study. However, only the 63 items identified in the final version of the test were used in the computation of the total situational judgement score and the analyses involving the four amen factors. The consultancy firm developed a rational scoring key fiom the ratings of 25 job content experts. Each point on the rating scale was assigned a score of 0, 1, or 2 according to the percentage oferrdorsernerrt by the experts. A score of2 was assigned when endorsement was 50% or greater, 1 when endorsement was 25% to 49.99%, and 0 when endorsement was less than 25% 47 Based on the written script of the videotape which described the essential visual elements of the vignette, and the mrrator's speech and dialogue between the characters both in verbatim, the present author developed a paper-and-pencil format of the test. In this paper-and-pencil measure, each of the vignettes (1 practice and 12 actual vignettes) was presented in written form. The written vignette was described in the third-person perspective (as opposed to a dialogue) similar in form to the typical paper-and-pencil type of situational judgemmt test used in previous research (e. g., Motowidlo et al., 1990; Pulakos et al., 1994). The substantive content of each written vignette was identical to the corresponding video vignette. After reading each vignette, examinees gave their ratings on an answer booklet similar to the one used in the video-based method containing the same response items. The scoring key for the paper-and-pencil version of the test was identical to the one used in the video-based version The video-bmed administration and the paper-and-pencil administration each hadatotaltestingtime lasting45 minutes. Appendiprresents anexample ofthe vignettes and possible responses used in the paper-and-pencil method I l [E . I B . Face validity and predictive validity perceptions were each assessed by a 5- itern measure adapted fiom part ofa questionnaire used in Snrither et al. (1993). To provide a fiarne of reference, exarrrinees were asked to give ratings on the items concerning relationships between the test and the job of a production worker working in a team-based situation. It was further stated that to do the job well, the worker had 48 to be both technically competent and able to relate to others effectively. Ratings were anchored on a 6-point Likert-type scale from W to stronglxagrm The questionnaire is shown in Appendix C. Three widely used paper-and-pencil measures of established psychological constructs were administered to all examinees. Reading comprehension was assessed using the Comprehension subtest of the W (Form G, Brown, Bennet, & Hanna, 1993). The test was developed for use with high school and college students and it has been widely used in psychology and education for the assessment of reading comprehension. Form G (published in 1993) is one of the two parallel forms in the fifth edition of the test that was published originally in 1929. The comprehension subtest is a multiple-choice forrrmt test in which examinees read 8 passages and respond to a total of 36 five-answer multiple choice questions. Administration time is 20 minutes. The test-retest reliabilities of the comprehension subtestreportedinthetestmanualrangedfiom.75to.82. . Cognitive ability was assessed using the W (Wonderlic & Assoc, 1984). The Wonderlic test is a general cognitive test for industrial use (for reviews, see Schmidt, 1985; Schoenfeldt, 1985). It is a 12-minute test consisting of 50 items with a variety of verbal, numerical, and some spatial content, and it yields a single total score. Test-retest reliabilities ranged fiom .703 to .903. Personality constructs assessed were the "Big-five" dimensions measured using 49 the NEO-FFI (Costa & McCrae, 1992), a short version (i.e., 60 items) of the NEQBI (Costa & McRae, 1985). The dimensions assessed by the test are non-clinical constructs and include conscientiousness, agreeableness, neuroticism, openness to experience, and extraversion. The test contains a total of60 items each scored on a 5- point Likert-type scale ranging from W to W. Each of the 5 dimensions is measured by 12 items. The time for completion of the test is 30 minutes. Evidence of criterion-related validity and construct validity for the NEQBI havebeendocumentedinareviewbyDigman(1990)mrdreponedinCostaand Mche (1992). Design The study employed primarily a 2 X 2 between-subjects factorial design with performance on the situational judgement test and examinee test reactions (face validity and perceived predictive validity) as the dependent variables. The two independent variables were Race (Blacks vs. Whites) and Method (video-based vs. paper-and-pencil). Assignment of examinees to the Method condition was random with the restriction tlmt examinees in the same testing session were administered the same method. The number of examinees per condition was approximately equal (Black-Video = 51, Black-Paper = 62, White-Video = 69, White-Paper = 59). The paper-and-pencil measures of reading comprehension, cognitive ability, and the Bi g- Five personality constructs were administered to all examinees. 50 liocedme ExamineesweretestedhraclassmomsettingingroupsmngingbetweenSand 19 individuals. In the video-based method condition, the video vignettes were presented on a 25" television positioned in a manner such that all examinees could watch and listen to the videotape clearly. Instructions for the test were given on the videotape by a narrator. The instructions began with an example vignette as a practice item. The narrator first described the setting of the work situation and introduced the characters. Thevignwewasthenpresented. Attheendofthevignette, thevideo fiamefi'ozeandexanfineeswereaskedtoopenflreanswerbooklettoflreexample situation section and indicate the efl‘ectiveness of each possible response described in written form using the 6-point rating scale. Examinees had 2 minutes to complete ratings for all the responses pertaining to the vignette. After clarifying any questions regarding the manner ofcompleting the test, the actual test began There were a total of12videovignettesontheactualtestandeachvignettewasprecededbyanarrator introduction. Examinees had 2 minutes to complete ratings for the associated responses. After the video-based test which lasted approximately 45 minutes, examinees were asked to complete a questionnaire regarding their perceptions of the test. ThequesfionnairewasflreexmnmeeattimdesmeasmewhichconsistedofflreS items assessing face validity and the 5 items assessing perceived predictive validity. Examinees then completed a series of three paper-and-pencil measures including the wonderlieflersomellest, the NEW and the NEQEEI administered in counterbalanced order across test sessions. The same examinee 51 attitude questionnaire was administered following completion of each of the three paper-and-pencil measures. In the paper-and-pencil method condition, examinees were presented with the paper-and-pencil version of the situational judgemert test. The instructions for the test were written on the first page of the test booklet. The same example vignette preceded the 12 actual test vignettes. Examinees were given 45 minutes to complete the test. After the test, the rest of the session was identical to the video-based method session. Examinees completed the examinee attitudes questionnaire for the situational judgement test, and then followed by the three paper-and—pencil measures administered in cormterbalanced order across test sessions with an examinee attitude questionnaire following completion of each measure. In both conditions, subjects were thoroughly debriefed and thanked for their participation. The total testing time per session for each condition was approximately 2 hours. Analyses Effect size estimates ((1 statistic) for subgroup differences in performance on the situational judgment test were conrputed by subtracting the majority test mean from the minority test mean and dividing the difference by the pooled standard deviation. Hence, negative effect sizes indicated that Blacks scored lower than Whites whereas positive effect sizes indicated the reverse. Sex, Race, andMethodweredummycodedGemales=0, Males: 1; Whites: O, Blacks = 1; paper-and—pencil = O, video-based = 1) and the other study variables 52 were treated as continuous variables. Hierarchical regression analyses were used to test the interaction effects hypothesized in H1, H2, H3, and H6. Conelational analyses were used to test H4, and an independent-samples t—test was used to test H5. Multiple-groups covariance structure modeling using LISREL 8 (Joreskog & Sorbom, 1993) was used to assess measurement invariance and nomological invariance of the situational judgement test across the two method groups. Measurement invariance was tested by simultaneously comparing confirmatory factor analytic models across groups. It is widely accepted that measurement invariance is established when the factor loading matrix is invariant across groups (Alwin & Jackson, 1981; Sorbom, 1974). A more stringent criterion for measurement invariance is when both factor loadings and error variances of measures are invariant across groups. Nomological invariance was tested by comparing, across groups, the structural relationships between each situational judgement factor to the set of Bi g-Five personality factors. Nomological invariance is established when structrnal relationships are invariant across groups. The fit of a model was assessed using the )6 statistic and a variety of fit indices. The )6 statistic is the most widely used measure of model fit in organizational research (James & James, 1989; Kelloway, 1996). The main disadvantage of the x2 is its high sensitivity to sample size such that with large sample sizes, most models will produce statistically significant )8 values resulting in rejection of these models even if they are theoretically reasonable. Hence, most researchers 53 also rely on a variety ofaltemate fit indices to reduce the dependence on sample size when assessing model fit. Because the various indices differ on their specific assumptions, the use of multiple indices when evaluating a model can provide convergent evidence in the assessment of model fit In the present study, the indices used included J oreskog and Sorbom's (1989) goodness-of-fit index (GFI) and adjusted goodness-of-fit index (AGFI), Bentler‘s (1990) comparative fit index (CFI), Bentler and Bonett’s (1980) non-normed fit index (NNFI), Joreskog and Sorbom's (1986) standardized root mean square residual (standardized RMSR), and Steiger's (1990) root mean square error of approximation (RMSEA). Both GFI and AGFI are widely used indices of fit based on the comparison of observed and estimated covariances (see Kelloway, 1996). The AGFI is a parsirnonous fit which adjusts the GFI for the degrees of freedom in the model, that is, it takes into consideration the fact that a model always increases in fit as the number of fiee parameters to be estimated approaches the number of independent pieces of information available for estimation. The CPI and NNFI measure how well the model fits relative to a baseline model, usually the independence (i.e., null) model. The values of GFI, AGFI, CFI, and NNFI range fiom 0 to 1.0 with values approaching 1.0 indicating a good fit to the data. The present study used the convention of larger than .90 as an indication of good fit. The standardized RMSR is a measure of the average standardized residuals of the predicted covariance matrix fiom the observed covariance matrix. Values approaching 0 indicate a good fit to the data. The conventional value of less than .10 54 wasusedasanindicationofgoodfitinthepresentstudy. TheRMSEAisameasure of the average size of the fitted residuals per degree of freedom. Following Browne and Cudeck (1993), the present study considered a value of .05 or less as indicating a close fit; between .05 and .10 as a moderate fit; and more than .10 as a poor fit. The )6 difference test (me), obtained by calculating the difierence in the models' respective x2 with degrees of freedom equal to the difference in the models' respective degrees of fieedom, was used to compare the statistical significance of difference in fit between nested models. Table 1 presents the means, standard deviations, reliability estimates, and interconelations of all the study variables. The same statistics broken down by racial subgroups are reported in Appendix D. As shown in Table 1, the internal consistency reliability estimates (Cronbach's or) for the measures used in the present study were in acceptable ranges. The reliability estimates reported for the two versions of the situational judgement test are underestimates because of the multidimensional nature of the test. An inspection of Table 1 showed bivariate support for the major hypotheses. Race was more highly correlated with situational test performance when the test was administered in the paper-and—pencil method than when administered in the video- based method. Consistent with previous research, race was correlated with reading comprehension. With regard to test reactions, face validity perceptions and predictive validity perceptions were positively correlated for each of the four different tests (i.e., situational judgemmt, reading comprehension, cognitive ability, personality) used in the study. For the situational judgement test, face validity perceptions were correlated with the method of assessment. Also, race was more highly correlated with face validity perceptions when the situational judgement test was administered in the paper- and-pencil method than when administered in the video-based method. Each hypothesis will be addressed directly and in a multivariate sense in the following sections. 55 56 82580 ~ 03$. QC em oo 9 en 8- Na- 3 as 88 mum? .w 83 a 8 8 z- 8. 8 Se 39. 828 .5 Ge 8 9. mm- 3 2- m3 2% magma 6 Ge :- 8- «E 3.9. a> .m Ge :1 S- one can an .4 8 8 on S. 8,2 .m 8 Ms. em. 57. N on on 8:52 ._ .39» Nasseateeze2:2awsenemas amass: .... .. ........... . z $1.5... . 2 a 03¢ 57 326.80 3 035,—. 68 we we 2 mm co me 3 8 we 5. ea. 2 mo 8 «a 3. no No- 8 Ned mad“ OmZA—mmn— .om GD cm on S S n3 8. no. 2 mot m3- 8 N3 8 co. no co 5. no. omd $.0— OmZ-m—U-m0§m .: QC 3 8 3c. 3 3 en 3m. .5. mo. mwfi wodm meO .2 9‘8 ca- an. cm. 3 wT so om- 3 36 3.NN Ogm—Z .a an 3 ON 3 M: S 3 m— 3 2 N3 Z S a m h o n v m N _ Gm mane—z 58 .82 u 26 822922 e8 .82 u a B>-mo .22 u .6 82-295 .22 u a 8.2.82.2 .22 u 8 am é neon... :2 u z 8 Ban on menses :5 é 8o§p=8§a e8 gouge names... ago: see as E can anomaoaeuaa as do 8385 assessees as B> 9a an. é a ensfieo .2388 0:383 see 5:532 $32.32 a. 2 some? 2000 Em E85 8 9:82:80 03 83828 32538 5.. 8858.82 E 0.3 82:23:33 .3238 2a 3252mm»: new macaw—oboe as as season 2.1.28.2 .3253 2832 583882 eumoaomeaéem 58033er 881883 as 852 e8 .xmm .2025: .95 5223 2,382 8:83.293 562$ Ban—"8% Homeo>§xmu>§m ”meaguzfio usee.85...Zuompmz neofififiwfimmmg ”msgouaomaaoonomzoo ”ensues; Ecmz nemz 5288388 ”2.2232 successeezémm smegma; 38832 BeaéoeSua> engages... 38355 :23 esteem"?— eosaao a 882.2052 ”838% .8 smuxmm use .8832 383% a assess co Begun—05m: ”Bataan“ 8§E> . 68 on em 2 mm em ow 02 :V S- 3 S- n2 5. 02. no mo 3 8 mg- we no- 9am on: ZOOU.Qm~—m.- 38 m— 8 me he 3 co 5 no no 02. 8 8. ms. 3 8 8 8 mm- 8 no $6 3.9 ZOOU.m~U>_U xoflm I om aouewroyed rsel ueew perogperd 63 H2 predicted that a Method X Reading Comprehension interaction effect on situational judgement test performance such that performance will be positively and significantly correlated with reading comprehension ability in the paper-and—pencil method of testing whereas no significant correlation between test performance and reading comprehension ability will 000m in the video-based method As shown in Table 2, entering Method and Reading Comprehension as a single block in step 1 of the regression of test performance on these factors accounted for 12% of the variance, p < .05. The Method X Reading Comprehension interaction term was entered in step 2 which resulted in a significant increase in variance accounted for, AR2 = .03, Adf = 1, p < .05. A plot of the interaction (Cohen & Cohen 1988) as depicted in Figure 6 showed that test perforrmnce and reading comprehension were positively correlated in the paper-and-pencil method of testing but they were nearly uncon'elated in the video- based method Hence, H2 was supported. .mocmctorwd ..mmh EmEmmczw 6:25:25 :0 52852:. 53:93an0 mcfimmm x 35.22 .m Saul szc: Um cc co_mcmcmano mcfimmm U052). nommmomuSt. poems. __ocmn_-ucm..m%n_.a. o.m+ (srrun ps u!) eoueuuoped reel ueew perogperd 65 H3 predicted that the Race X Method interaction effect on situational judgement test performance would diminish after controlling for the effect of the Method X Reading Comprehension interaction. As shown in Table 2, Race, Reading Comprehension, Method, and Method X Reading Comprehension interaction were entered as a single block in step 1 of the regression of test performance on race, reading comprehension, and method of assessment. The block accounted for 19% of the variance, p < .05. Entering the Race X Method interaction term in step 2 provided a signifieant but small increase in variance accounted for, AR2 = .01, Adf= 1, p < .05. The proportion of variance in test performance accounted for by the Race X Method interaction obtained in H1 diminished substantially fiom 4% to a small (though still statistically significant, p < .05) 1% once the effect of Method X Reading Comprehension on test performance was controlled. Hence, H3 was supported Figure 7 depicts the nature of the Race X Method interaction mmmlling for the effect of Method X Reading Comprehension interaction on test performance. Compared to Figure 5, Figure 7 shows that the Race X Method interaction effect was dampened to some extent after controlling for the effect of Method X Reading Comprehension interaction. In summary, the regression analyses provided support for the first three hypotheses. There was a Race X Method interaction effect on situational judgement test perforrmnce such that the Black-White performance difference (favoring Whites) was substantially smaller in the video-based method of testing than in the paper-and- perrcil method A Method X Reading Comprehension interaction also existed such 00:00.25 chcocmfiEoo @5003”. x 00:55. .5 Beam .2 05:02:00 5:0 mocmEBtmd 50h #:0600030 0002035 :0 cozomaoflc. 005.22 x comm N 059m Ememmmm< ..0 00505. 033-0005 __ocmn_-0cm..mn_0n_ 02E>>_H_ x005 I eoueuuoped rsel ueew petogperd 67 that test performance was positively correlated with reading comprehension ability in the paper-and-pencil method but that they were nearly uncorrelated in the video-based method As shown in the regression results for H3, this Method X Reading Comprehension interaction accormted for a substantial portion of the Race X Method interaction effect on test performance. 68 EactmiallmtarimacmsiMethodflroups Table 3 presents the means, standard deviations, and both observed and corrected (for scale tmreliability) interconelations of the form a priori scales on the situational judgement test, broken down by method groups. Not surprisingly, internal consistency estimates of reliabilities (Cronbach's or) were low due to the relatively small number of items on each scale and the dichotomous (with a few trichotomous) scoring of the items. However, scale reliabilities were substantially higher than inter- scale correlations which provided some preliminary evidence for discriminant validity. Inter-scale conelations remained low, relative to scale reliabilities, even after correcting for unreliability in each scale. Multiple-group covariance structure analysis was used to provide a more rigorous test for the discriminant validity of the four a priori scales and to assess factorial invariance across method groups. Table 3 69 Means SD 1 2 3 4 Situational Judgement Scales 1. Conflict 11.84 3.81 (.46) .49 .23 .36 2. Empathy 9.47 3.65 .24 (.53) 70 28 3. Quality 10.09 2.50 .08 .26 (.26) .27 4. Commitment 7.98 3.54 .18 .15 .10 (.53) 1. Conflict 12.29 3.64 (.40) .48 .49 .30 2. Empathy 10.44 3.03 .15 (.24) .85 .49 3. Quality 10.74 2.78 .13 .27 (.42) .22 4. Commitment 9.38 3.33 .12 .15 .09 (.39) Note. Cronbach’s a reliabilities are in parentheses. Observed correlations are below diagonals and corrected (for unreliability) correlations are above diagonals. 70 Measmmmlnvariance. As described earlier, factorial invariance referred to both measurement invariance and nomological invariance. To formulate measmement models for the test of measurement invariance, items within each of the four scales were first randomly sorted into three sets comprised of approximately equal numbers of items. Item scores were writ-weighted and summed within each set to create three trait indicators (also lmown as observed indicators) for each latent trait variable purportedly measured by each scale (i.e., each situational judgement factor), giving a total of 12 trait indicators. A factor loading was arbitrarily set to 1.0 for one of the three indicators for each latent trait variable in order to scale that latent trait variable (Bollen, 1989). Appendix E presents the 12 X 12 observed covariance matrix among trait indicators for each of the two method groups. Table 4 presents the fit indices associated with the series of nested confirmatory factor analytic models fit to the two observed covariance matrices. Also presented in this table are chi-square difference tests associated with relevant model comparisons. 71 won—£88 v 038. 883? 8.8 Ba m8§ta>oo 88mm 95 www.588— 88£ 89cm mo. mo. mm. mm. me. co. m o2: m2 m> m2 mod worm? 888$ cfiSoboU 80m .92 .32—anal» 8.8 new .mwfiuaB 88am £35880 883 can mo. mo. ow. mm. mm. a. ArmmSm NE m> 2.4 on: wammfi @883 caflobouueom .NE decadent» 8.8 28 $532 88am new no. mo. on. 8. 3. mm. as ...ovd: 88$ 88.80 omwfim 42 <85 ease. as Ezz Eu Ec< E0 :8 N3 acmfiaaou 382 .8 "x 8:20 383. 288868.,“ 382 . "1. . .1 .4: .) ....,.ur.... ..n "1.. S .... .2 . .-.-..“r ...“ ...... .... ....... “....“ r. o .4..-.. ....-. ..‘4 o .... ...4. .... . wot.) .fl ..4 o . do n... .o n o: ... ‘4 . ‘. ....340 a .< o. v 033. 72 .39? mo. mo. mm. am. we. om. mo. mo. ow. ca. Va. cm. on vmdfi m2 m> 92 0 23 m2 m> 3.4 Na mos v2 m> m2 88.3888 88% use .8288? 8.8 .meBB 883 88m 03 $83 888$ eBay—80 80m .22 .885580 88am 08m $853? 8.8 28 $598— 88& Beam OS :63 £8me 682280 Son 62 3 9m 522 EU EO< ED :8 N3 88888 $82 .8 "x 385 m88< 883%on 882 73 A single general factor model in which factor loadings and error variances were fieely estimated across method groups (Model M1) was first fit to the covariance matrices as the baseline measurement model. The single general factor model provided a marginal fit to the data, )8 = 181.40, df=109, p < .05, GFI = .88, AGFI = .91, CFI = .66, NNFI = .59, standardized RMSR = .09, RMSFA = .05. Hence, there was no strong evidence of unidimensionalty in the situational judgement test. A four factor model in which factor covariance, factor loadings, and error variances were freely estimated across method groups (Model M2) was next fit to the data The model provided a signifieant increase in fit over the single factor model, sz =57.92, Adf=9,p<.05, andarwsonable fittothedataas indicatedby the fit indices. To test for measurement invariance across method groups, Model MZ was compared to a more parsimonous model (i.e., with higher degrees of fieedom) in which factor loadings were constrained to be equal across groups. The more parsimonous model (i.e., Model M3) continued to provide a reasonable fit to the data and the decrease in model fit from Model N12 to Model MB was nonsignificant, A76 = 10.20, Adf = 8, ns. Hence, equality of factor loadings across method groups was established Model MB was compared to Model M4, a yet more parsimonous model in whichbothfactorloadingsanderrorvarianceswere constrainedtobeequal across groups. Model M4 provided a good fit as indicated by the fit indices and the decrease inmodel fit, asmeasmedbythexzdifi‘erencetest, fromModelM3toModelM4was nonsignificant, sz = 7.03, Adf = 12, us. That is, using the stringent criterion of both 74 equal factor loadings and equal error variances, measurement invariance across the two method groups was established To examine the structrn'al aspects of the confirmatory factor analytic models (i.e., the relationships among latent trait variables), Model M4 was compared to Model M5 in which between-group equality constraints were imposed on factor covariances. In Model M5, the six factor covariances were constrained to be equal across the two method groups. That is, Model M4 and Model M5 difi‘ered only with respect to structural relations among the latent trait variables; for each model, the factor loadings and error variances were constrained to be equal across the two method groups. Model M5 provided a good fit to the data, )8 = 142.72, df = 126, n.s., GFI = .90, AGFI = .94, CFI = .92, NNFI = .92, standardized RMSR = .09, RMSEA = .02. The decrease in model fit from Model M4 to Model M5 was nonsignificant, Ax2 = 2.01, Adf = 6, ns. Comparison between Model M5 and Model M2 also revealed that as a whole, none of the equality constraints on factor covariances, factor loadings, and error variances significantly decreased model fit. Hence, Model M5 was selected as the most adequate measurement model. Figure 8 depicts Model M5 with its associated common metric factor loadings and factor conelations. All factor loadings were statistically signifieant, p < .05. Of the six factor correlations, five were statistically significant, p < .05. Full measurement invariance across method groups (i.e., full internal factorial invariance) was established in terms of error variances, factor loadings and factor covariances. 75 .Go. v m *v 8580 8:82 :8m 8m meouflobov 88mm 98 $5884 885m eofieemeefim mEoo NEOO FEOO *mm. EmEzEEoo *mm. *9». menu szc «g. _.N. maEm NaEc EEQ *ON. 11V. 2.82 88:80 685803».fl HEB E82 88225. 8823 boemahweou .me 2wa mcoo NCOO Fcoo *vm. *vc. *3. 76 Nomologicallnyariance. Table 5 presents, for each method group, the means, standard deviations, and interconelations between the 12 situational judgement indicators and the 5 personality indicators. 77 8:588 n 833. end mmd bod mot S.- we. Nor mo. 2 . mo.- S.- 2 .- 3. co. Hm. 3 . no. No. Rum no. mo. 2.- 3. mod mo. 8.- S.- no. 34» mo. 2. 2.- no. RN 5.- 2. no.- mo. ohm 5. mo. mo.- 3. 0H.- ové No. oo. mo.- om. cfim mwd a. 3. mm. 2. no.- No.- om. om. 3. 3.- mwd mu. 3. 5.- cm. wfi. wofi wad mfiw one omd Neda ovdn 3.4m hwdm hmdm menu—z >§m meO OMDNZ mmMO< UmZOU :Sna 8883.88 3% mace «88 EB 30 ~20 :30 Ram 88m Ram «.80 ~80 :80 mm was: . u‘..oo .-.. 1.0.. c. d ..o‘. ....“.¢ ‘ . . .4 .4. Hi. a . Q .1. 3 n 22g 78 dengfixmuarém amusemenzfio ”8885028ng ”amoaozfipefimmmca. ”mmuegoueomoéoUuUmZOU USEEEEoUnEoU 3330 ”mfiamEm—uaam noaeooueoo 6838.58 mo~n§> .334 a: 3: E :4 2: «3 m3 9; SA a: Se 28 8 3m N3 mam m3 men m3. e3 2a 2:. now o; 3.4 ”so: 8.- 2. S. 2.- 2.- 8.- 8.- 8.- 8.- 2.- S.- 8.- 98 mean >588 5.- 8.- a. 8.- M:- 8- 2.. 8. 8. mm. :. s. new 3.8 E8 2. so. No. 8.- fl. so. 2. 8.- 8.- S. 3. 8.- a: ”in 9:82 8. no. 2. 8. 8. mo. 8.- 3.- S. 2. 8.- 8. one 3.8 meme... 3.- 2. 8. S. 8. 8.- 2.- 8.- 8. 8.- 8.- 8. $8 3.8 828 sfiuzv Sam-82> Re of v: as m: 3: Re 42 $4 an «2 92 am meow «88 :80 38 3.0 :5 28m «mam Ram 28 28 :50 am ”so: 79 The present author had planned to use a multiple-group covariance structure analysis approach to testing equality of structural relationships between latent variables across groups (J oreskog & Sorbom, 1993) to assess nomological invariance (external parallelism) of the four situational judgement factors (in reference to the Big-Five personality factors) across method groups. Full nomological invariance is achieved whmboflrequalityofpmameteresfimatesandequafityoferrorsmeachofthe structural equations relating the respective situational judgement factor to the set of personality factors are established across method groups. However, as shown in Table 5, the observed correlations between the situational judgement indicators and the persomlity indicators were trivial, fluctuating around 0. Contrary to the author’s expectation, it appeared that the personality factors measured in the present study were not related to the situational judgement factors. Therefore, it was not meaningful to test for external parallelism using the five personality factors as external reference variables. A multiple-group covariance structure analysis was attempted, but failed to reject a model specifying between-group equality in structural parameters. The failure to reject was due to the lack of correlation between situational judgement factors and the external reference variables used for both method groups, and not beeause of a between-group similarity in the patterns of external correlation (i.e., external parallelism). Because of the low conelations between situational judgement scales and personality measures, it was formd that structural parameter estimates in both method groups were trivial. Three nested models were fit to the 17 X 17 observed covariance matrix 80 relating the 12 indicators for the situational judgement factors and 5 indicators for personality factors for each method group (covariance matrices are reported in Appendix F). For all three models, measurement aspects were held constant so that effects of structmal and measurement differences were not conformded in model comparisons. Constraining measurement aspects also resulted in the comparison of more parsimonous models by reducing the number of parameters to be estimated simultaneously. For measurement aspects of the four situational judgement factors, factor covariances, factor loadings and error variances of observed indicators were constrained to be equal across method groups (i.e., the measurement model specified by Model MS). For measurement aspects of the five personality factors, the error variances of observed indicators are not identified parameters and they cannot be estimated because there was only one observed indicator (i.e., Big-Five sub-scale) per factor. Rather than assuming that the indicators were infallible measures by fixing error variances to zero, the identification problem was solved by fixing the error variance of each indicator to a value derived fiom its internal consistency estimate of reliability r,“ (Cronbach's or), using the formula 0-2 =(1-rx--)o;-2 (1) where of = error variance of indicator and 6,2 = variance of indicator (J oreskog and Sorbom, 1993). Model N1 freely estimated across method groups both the structural parameters and error terms in each of the four structural equations relating the respective 81 situational judgement factor to the set of five personality factors. The model provided a good fit to the data, )6 = 164.84, df= 217, n.s., GFI = .92, AGFI = .94. The model was compared to the more parsimonous Model N2 which similarly allowed error terms tofieelyvmybmspecifiedsmrctmalpmameteresfimatestobeinvariantacross method groups. A )6 difference test showed that decrease in model fit fi'om Model N1 to Model N2 was nonsigrrificant, A}? = .72, Adf = 20, ns. Hence, structural parameter estimates did not differ significantly across method groups. Model N2 was compared to Model N3 which specified both structural parameter estimates and error terms structmal equations to be invariant across method groups. The decrease in model fit fi'orn Model N2 to Model N3 was nonsignificant, Ax2 = 2.08, Adf = 4, us. That is, error terms in structural equations did not differ significantly across groups. However, for all three models, an inspection of the common metric standardized regressions of each situational judgement factor on the five personality factors revealed trivial paramaer estimates fluctuating arormd 0, within the range between -.07 and +.04. Therefore, whereas the nested model comparisons indicated equality of structural equations relating the respective situational judgement factor to the set of personality factors, the equality should not be construed as evidence for nomological invariance across method groups (i.e., external factorial invariance/parallelism). Instead, the equality of structural regressions was a result of near-zero correlations between situational judgement factors and the external reference variables (i.e., the Bi g-F ive personality measm'es) selected for the assessment of external parallelism. 82 i631. o u a r m. o ; .r-s. nm or .J ism ... . .0 {Jo Jo ' an 0 mm ; u u, “in..- Mirna-[WW The establishment of factorial invariance of responses to the situational judgement test in terms of full measurement invariance across method groups supported the meaningfulness of between-method comparisons of subgroup performance at the level of individual constructs measured by the test. Factor scores for each of the four situational judgement factors were computed for all examinees based on the factor loadings in the measurement model (Model M5). Because the factors are latent variables fiee of measurement errors in the observed indicators, comparisons of Black-White differences in factor scores provide more accurate estimates (i.e., disattenuated for unreliability in measures) of the effect of method of assessment on adverse impact in the situational judgement test. Table 6 presents, for each of the four situational judgement factors, the subgroup factor means, standard deviations, and associated d statistics for each of the two methods of assessment. As shown in the table, the paper-and-pencil method produced substantial Black-White differences in performance favoring Whites on each of the form constructs as indicated by the d statistic (Conflict = -.70; Empathy = -.43; Quality = -.35; Commitment = -.63). These Black-White differences were substantially reduced in the video-based method (Conflict = .02; Empathy = -.18; Quality = .06; Commitment = -.36), with d differences across methods ranging fiom .27 to .72. In fact, in the video-based method, Black-White differences were not statistically significant for any of the four factors. 83 32:88 0 033. El 36 an mm; 9% no Mon—m as me. mo. me; mad o2 *2..- on; 36 #2 38. mo; Rh me me; New on SE? em; omd R 8-H med no #85 dag-U 888m 8080383 Enougmm u 5 8:80me 3388 u mm 882 Z oemufim u Om 8802 Z 338% gag-gag 388%; o I}. .4 ..tr. 0.. .....‘-. ...-... ...). :3. .8. . .-..-.- ......w. .31..-) £71...- .... o 0355 .mO.Va* 5N. cmr an; mwé cm“ *8.- VwA m 3» #2 gon- g; afim av 84 Saw on SE? mm; one 3 D: wmd No Mom—m gag 2». co. mm; wwé 02 mm: M34 mod HQ 30H mm; ewe oo 2 A mum-v on 33>? mm; Noe ~m 2 A Ned No #85 8:35 mm. fl .- mm; om.m 03 “Rev mm; 56 ~N~ 38H mm; 9mm 00 on; 36 on 93>? u 5 85.8% cram—Sm u Om 8802 Z 033.3 u Om 832 Z 333% Hafiz-dung 85 To summarize, nomological invariance of responses to the situational judgement test across the two method groups could not be tested because of near-zero correlations between situational judgement factors and the external reference personality variables. However, factorial invariance in terms of fill measurement invariance across methods was established Measurement invariance supported the meaningfirlness of betweer-method comparisons of racial subgroup performance at the level of individual constructs disattenuated for measurement errors. For each construct, there was a large Black-White performance difference favoring Whites in the paper-and-pencil mdhod These performance differences were substantially reduced in the video-based method. H4predictedthatforeachofthefom'difi‘erenttestsadministeredinthepresent study predictive validity perceptions will be strongly and positively correlated with face validity perceptions. Results showed that correlations between the two perceptions were significant (p < .05), positive, and substantial for all forn' tests (paper-and-pencil situational judgement, r = .28, N = 121; video-based situational judgement, r = .24, N = 120; reading comprehension, r = .60, N = 241, personality, I = .48, N = 241; cognitive ability, I = .70, N = 241). For each test, the correlation between the two types of perceptions was substantially lower than the reliability estimates (Cronbach's or) of the respective perception measures (paper-and-pencil situational judgement, Face r,“ = .90, Predictive r,x =.75; video-based situational 86 judgement, Face r,‘x = .78, Predictive rxx = .81; reading comprehension, Face r,‘x = .88, Predictive r,‘x = .90; personality, Face r,x = .76, Predictive r,“ = .86; cognitive ability, Face r,‘x = .81, Predictive rxx = .86). This provided evidence of discriminant validity forthetwotypes ofperceptions. H4wassupported. E 1511 ill] l [11' H5 predicted that face validity perceptions of the situational judgement test will be signifieantly higher when administered in the video-based method than when it is administered in the paper-and-pencil method. Results of an independent sample t—test supported the hypothesis; the video-based method received significantly higher mean face validity ratings (M = 19.69, SD = 2.96, N = 120) than the paper-and-pencil method (M = 17.84, SD = 4.36, N = 121), t (237) = 3.87, p < .05. S] M] 1' 1E 15” H6 predicted that the difference in face validity perceptions on the situational judgement test reported by Blacks and Whites will be greater in the paper-and-pencil method than in the video-based method To test this Race X Method interaction, a hierarchical regression of face validity perceptions was performed. As shown in Table 7, Race and Method were entered as a single block in step 1 of the regression and accounted for 12% of the variance in perceptions, p < .05. Entering the Race X Method interaction term in step 2 of the regression resulted in a significant increase in variance accounted for, AR2 = .04, Adf = 1, p < .05. 87 Table 7 u an a r or 0 ca ,0! H r In 0 ur W241.) Criteria and Predictors R2 df AR2 Adf AF E IH'l' [H l .5: Step 1. Race .120 2 1652* Method Step 2. Race X Method .160 3 .040 1 12.13* IestEertonnance Step 1. Race .219 5 1320* Reading Method Reading X Method Face Validity Step 2. Race X Method .227 6 .008 1 2.50 *p<.05. 88 Figure 9 depicts the nature ofthe interaction in terms of differences in subgroup mean perceptions. As shown in the figure, the Black-White difference in face validity perceptions on the situational judgement test was greater in the paper- and-pencil method than in the video-based method To assess the practical significance of the statistically significant Race X Method interaction, effect sizes for subgroup differences were computed using the 5! statistic. A substantial Black-White difference in perceptions of form-fifths of a standard deviation with Whites reporting higher face validity was formd on the pmer—and-pencil version of the situational judgement test, d = -.80. The Black-White difference in perceptions was reduced substantially to a practically trivial one-nineth of a standard deviation in the video- based version ofthe test, d = -.11. Hence, H6 was supported. 89 ewe... EmEmmozw _mco=m3_m do mcozamouwd >.._o__m> mom“. :0 concede: uofims. x momm .m 93mm EmEmwwmm< .6 uofims. ommmmémeS __ocmn_-ocm.._m%d w=c>>_H_ xoflm I or NV 3 or m.- cm suondeored ArgpneA 602:] ueew perogperd 90 - o o - -- . . .- a . -‘ a - . - . k. a!” U L's lx'llr 1'}. k...“ \.'.'5.-'..lf JUMP”?! .U U‘llk'. 0 $ .19-"1|. 2.x Face validity perceptions and performance on the situational judgement test were significantly correlated, r = .33, p < .05. Because there was a Race X Method interaction effect on face validity perceptions, it appeared likely that face validity perceptions could explain the remaining portion of the Race X Method interaction effect on situational judgement test performance not attributable to the Method X Reading Comprehension interaction on test performance (see results for H3). It should be noted that this result was not hypothesized A hierarchical regression was performed to examine if face validity perceptions could account for the remaining unaccounted portion of the Race X Method interaction on test performance. As shown in Table 7, the variables Race, Reading Comprehension, Method of Assessment, Method X Reading Comprehension interaction, and Face Validity Perceptions were entered as a single block in step 1 of the regression of test perfonnance and accormted for 22% of the variance, p < .05. The Race XMethod interactionterrnwasthen entered in step 2 ofthe regression, which did not account for unique variance, AR2 = .008, ns. Figure 10 depicts the plot of the Race X Method interaction on test performance Wing for the effects of both Method X Reading Comprehension interaction and Face Validity Perceptions. Compared to Figures 5 and 7, Figure 10 shows that the Race X Method interaction disappeared after controlling for the effects of Method X Reading Comprehension and Face Validity Perceptions. 91 .chEQOd b_o__m> mom“. ocm concede: co_mcmcmEEoo mcfimmm x DOLE—2 do 9.0th .0”— 9:63:00 .mtm mocmEEtmn. ewe-r EmEmmozw 35:95 :0 5388:. .8522 x comm .2 meant Ememmwm< u_O U052). tween-omuS __ocmd-_ucm..oamm 933D xom_m I eoueuuoped rsei ueew perogperd 92 One implication of these results is that the use of a video-based method of item presentation might have had a "motivatio " effect on Black examinees that afiected their performance on the test. This idea will be discussed further below. Figure 11 summarizes the relationships between Race, Reading Comprehension, Method of Assessment, Face Validity Perceptions, and Situational Test Performance. 93 .mocmEBtmd 8mm... 3:28.25 Ucm .chEmoEd b_e__m> mom”. .EmEmwmmmxx .6 e052). .co_mcmcm._aE00 mcfimmm 68m 5353 ma_cmco=m_mm .h F meat mocmEEtmd EmEmmvsw ficozmsgm A ) mcozamoumd b_e__m> mom". 8855558 mcfimmm momm EmEmmmmm< .6 8:65. DISCUSSION The present study has established several theoretically and practically important effects relating race, reading comprehension, method of assessment, face validity perceptions, and performance on a situational judgement test. As predicted by H1, race and the method of assessment interact to affect situational judgement test performance such that the Black-White performance difference (favoring Whites) is substantially smaller in the video-based method of testing than in the paper-and-pencil method As predicted by H2, the method of assessment also interacts with examinees' reading comprehension ability such that test performance positively correlates with reading comprehension ability in the paper-and-pencil method but performance and reading comprehension are nearly tmcorrelated in the video-based method The results for H3 supported the argument that this Method X Reading Comprehension interaction accounts for a substantial portion of the Race X Method interaction effect on test performance. Another set of important results involved examinee reactions to the situational judgement test. As predicted by Hi, face validity perceptions of the test are significantly higher when administered in the video-based method than when administered in the paper-and-pencil method In addition, race and the method of assessment interact to affect face validity perceptions in a manner as predicted by H6. The difference in face validity perceptions reported by Blacks and Whites (with Whites giving higher face validity ratings) is greater in the paper-and-pencil method than in the video-based method Finally, the results also suggest that face validity 94 95 perceptions may explain the remaining portion of the Race X Method interaction effect on situational judgement test performance not attributable to the Method X Reading Comprehension interaction on test performance. The implieations and contributions of the present study to the researeh on subgroup differences in test pefonnance and test reactions extend beyond the study of situational judgement tests. The issues revolve around the relationships between the method-content distinction and subgroup differences in test performance and test reactions. These issues will be discussed next in terms of conceptual, methodological, and practical implieations. l l l l 2 II' . . A fundamental contribution of the present study is the emphasis on the distinction between test method and test content. By disconfounding method and content in the present study, subgroup differences due to method and subgroup differences due to content can be isolated By holding test content constant, the Race X Method interaction efi'ect on test performance obtained in the present study shows that two different methods of testing measuring the same job-relevant content may have differential adverse impact. In principle, adverse impact due solely to method of testing can be eliminated by using the method with lower adverse impact assuming that method is job-irrelevant. Schmitt et al. (1996) argued that a significant amormt of the Black-White difference in performance on paper-and-pencil tests rrright be due solely to the 96 reading/written requirements inherent in the method of testing and independent of the test content. As discussed earlier, Goldstein et al.'s (1993) attempt to show that the mahod of testing can afiect differences in subgroup test perfomrance has several problems. The method versus content distinction nrade in the present study enables an empirical test of Schmitt et al.'s (1996) argument. In addition, the inclusion of a standard reading comprehension test in the study allows a direct test of the notion of a Method X Reading Comprehension interaction effect on test performance. The present findingsregardingHLHZ, deialsosrrpportflreargmnerrtflratracedifierences in testscmmmaybepmtlyflreresultofdiffemcesmflrereadingmquiremmts associated with the method of testing. W The present study contributes to the rerent research on examinee reactions toward selection procedures. The only study which attempted to examine the relationship between face validity and predictive validity perceptions is Smither et al. (1993). As discussed earlier in the paper, interpretations of the study's findings are problematic because perceptions measured were based on an examination consisting of a variety of selection procedures and the examinees used were applicants assessed for a variety of jobs ranging from entry-level to professional positions. The present study avoided these problems by using the job of a production worker as the flame of reference and examining the relationship between face validity and predictive validity perceptions separately for far different tests. As predicted by H4, face validity 97 perceptions and predictive validity perceptions are positively and strongly conelated. In addition, for each test, the correlation between the two types of perceptions was substantially lower than the internal consistency reliability estimates of the respective perception measures therefore providing evidence of discriminant validity for the two types of perceptions. In the present study, the investigation of examinee test reactions is integrated into the broader selection fiamework by analyzing subgroup differences in test reactions and examining its relationship to adverse impact and method of testing. Previous studies which simply compared and described mean differences in attitudes or reactions across tests have been limited in increasing our understanding of test reactions due to the method-content confound across tests. The method-content distinction helps clarify the aspects of tests responsible for examinee reactions. The results of the present study show that without varying test item content, the method of testing p15: can affect face validity perceptions, including subgroup difl'erences in these perceptions. A serendipitous finding (insofar as the results were not hypothesized) in the present study relates to the role of face validity perceptions in explaining the remaining portion of the Race X Method interaction effect on situational judgement test performance not attributable to the Method X Reading Comprehension interaction on test performance. Race and method of assessment interact to affect face validity perceptions which in turn affect test perfonnance. In other words, subgroup differences in reading comprehension my account for a substantial portion of the 98 Black-White difference in test performance in a paper-and-pencil method of assessment. In addition, a nontrivial part of the adverse impact could be due to the fact that the paper-and-pencil method of assessment elicits lower face validity perceptions from Black examinees relative to White examinees. This lowered face validity may have a negative motivational and performance effect on Black examinees. The present results regarding the relationships between race, method of assessment, face validity, and test performance contribute to the recent research on test reactions. Face validity perceptions constitute an important dimension of test reactions. Some researchers have argued that low face validity could result in biased or inaccurate test scores and reduce the operational validity of a selection procedure (e.g., Cascio, 1987; Robertson & Kandola, 1982; Smither, et al., 1993). Chan, Schmitt, DeShon, Clause, & Delbridge (under review) provided evidence that face validity perceptions affect test-taking motivation which in turn affects cognitive test performance. Chan et al. also found that the typical Black-White difference in test performance was partially mediated by differences in face validity perceptions and test-taking motivation. Arvey et al. (1990) argued that the traditional model of cognitive test performance as simply a fimction of ability plus error is probably incorrect and that researches have tended to focus exclusively on the ability dimension and have ignored the effort dimension or motivational aspects of test perfonnance. A similar argument may apply to performance on situational judgement tests. The present results suggest that the Black-White difiermce in performance on a paper-and-pencil situational judgement test could be decomposed into an ability 99 component (i.e., reading comprehension dfl'erences) and a motivational component (i.e., face validity perception differences). However, a difference between the situational judgement test and the traditional cognitive ability test is that in the former, the ability (i.e., reading corrrprehension) dimension is often not part of the construct space intended to be measured by the test and is therefore job-irrelevant Chan et al. argued that an important practieal implication of their findings was that face validity of a test represents a practical means of reducing adverse irrrpact of many traditional paper-and-pencil measures because it is possible to write test items that reflect a credible face valid relationship to the performance of jobs for which examinees are being assessed The present study found that the manipulation of the method of test item presentation resulted in changes in face validity perceptions including changes in the size of the Black-White difference in these perceptions. It is plausible that these changes in perceptions in turn affected the Black-White differerce in test performance. Whereas it is possible to affect face validity perceptions by writing credible items, the present findings suggest that simply changing the method of item presentation without changing item content may have substantial effects on subgroup differences in face validity perceptions and test performance. Although the present results fiom the regression analyses are consistent with the idea that face validity perceptions affect test performance, it is also possible that test performance affects face validity perceptions. Chan et al. suggested that examinees' performance on a cognitive ability test may influence subsequent responses to face validity items. A self-serving mechanism may operate for reported face 100 validity such that there exists a tendency for examinees to attribute poor test performance to low face validity of the test Poor performance on a test in which its content is perceived as unrelated to the content of the job is more self-serving than when test content is perceived as related to the content of the job. However, a self- serving bias explanation is a weaker argument in the case of performance on situational judgement tests than in the case of performance on traditional cognitive ability tests. This is because it is more difficult for an examinee to have knowledge or an estimate of his or her performance level on a situational judgement test compared to a cognitive ability test It is not the purpose of the present study to address the causal relationships between face validity perceptions and test performance. The present data relating face validity perceptions and test performance are correlational in nature and causal inferences are not possible. Future research should consider experimental designs for manipulating test reactions and examining if Black-White differences in test performancecanbereducedbycharrgesintestreactions. Although method and content are conceptually distinct, it is often difficult to separate the two empirically. The Goldstein et al. (1993) study discussed earlier in this paper illustrates the methodological difficulty in isolating the effects of method fiom the effects of test content and vice versa. The present study suggests that one way to tease out the two different effects is to examine a common set of test items 101 across difi"erent methods of testing. By holding test item content constant, the same intended constructs are presumably held constant across methods. However, holding item content constant does not guarantee that the same constructs are measured across method groups. Measurement invariance of responses to the test items is critical and needs to be established In the absence ofestablished measurement invariance, there is no support for meaningful between-method comparisons oftest scores. Asdemonstratedinthepresent study, measurement invariance can be tested using the multiple-group approach to confirmatory factor analysis. Ideally, the researcher should have apricri scales for the constructs of interestsothatheorshecanproceedtotestforequalityofrelevantparameter estimates (e. g., factor loadings, factor covariances, error variances) across method groups in a theory-driven mamrer. Anodrerwaytotestifthesarneconshuctsaremeasmedacrossdifierent method groups by holding test items constant is through the assessment of nomological invariance. The idea is similar to the assessment of external parallelism in the classical psychometric development of test items. Given a set of external reference variables, some of which are expected (by some conceptual reasons) to be empirically related to the constructs measured on the test whereas others are not, we have evidence of factorial invariance of responses to the test across method groups if both groups exhibit the same patterns of correlations between test constructs and external variables. In the present study, nomological invariance of test responses across the two method groups could not be tested because of near-zero correlations 102 between situational judgement factors and the external reference personality variables. Therefore, the researcher should base the search and choice of external variables on solid theoretical grounds and relevant previous empirical literature. Of course, this is ofiennoteasybecauseitpresupposesthattheresearcherhaslittle difficulty in explieating the nature of the constructs of interest on the test examined which may not always be the case. The mean differences obtained in the present study between racial subgroups and between methods indicate the presence of reading comprehension and some motivational difference associated with race and method It should be noted that these mean difl'erences reflect level differences on the situational judgement factors due to the effects of reading comprehension and motivational differences. Mm diflemnces are consistent with factorial invariance oftest responses across method groups (in terms of both measurement invariance and nomological invariance). The same constructcanbemeasuredintwo groupsthoughthe groups may difi‘erwithrespectto the level on the construct. Measurement invariance can coexist with mean differences because differences in factor means across method groups are independent of the equality of item-factor loadings, error variances, and factor covariances across method groups. Nomological invariance can coexist with mean differences because differences in factor means across method groups are independent of the equality of conelations between factors and external reference variables. 103 Another methodologieal issue concerns the need to correct effect size estimates (for subgroup differences) for attenuation due to unreliability. The majority of previous studies comparing adverse impact across selection procedures failed to report reliability estimates for the various measures or failed to correct effect size estimates for attenuation due to unreliability of measurement. VVrth low reliabilities, true subgroup differences will not be detected For studies reporting differential adverse irrrpactacmssmeasmesbasedonrmconectedeflectsizeestirmtes, itisnotclearifthe results are due to true subgroup differences or sirrrply an artifact of differential reliability in measurement. In the case of situational judgement tests, the difficulty is compounded because Cronbach's or, the most readily available reliability estimate, may not be an appropriate reliability index due to the multidimensional nature of these tests. Test-retest reliability is hard to obtain because it requires at least two separate administrations of the same test to the same examinees. Parallel form reliability is often not feasible because it requires the use of different item content which raises the issue of construct equivalence and complicates the interpretation of corrected estimates. The present study suggests that one way to examine corrected efl°ect size estimates for the multidimensional situational judgement test is to compute, for all examinees, factor scores for each situational judgement factor based on the factor loadings in the conceptually derived and empirically validated measurement model. Because the factors are latent variables fiee of measurement errors in the observed indicators, corrrparisons of method group and racial subgroup differences in factor 104 scores provide more accurate estimates (i.e., disattenuated for unreliability in measures) of the effect of mahod of assessment on test performance and adverse impact. At least three limitations of the present findings should be noted The first limitation concerns the generalizability of the findings relating to face validity perceptions. There are settings in which all examinees are likely to report that all tests are highly face valid Examples of these settings include testing situations of actual job applicants or incumbents in which the stakes for successful test performance are high (e. g., assessment for hiring or promotion). It is very unlikely that an applicant taking a selection test for ajob to which he or she desires to be hired will report low face validity on the test In these high stake situations, self-presentation concerns may restrict reported face validity to high ratings when examinees perceive that test reactions may be used as inputs to individual situations. This is most likely to happen when examinees do not have confidence that face validity responses are anonymous. In such settings, restriction of range limits the effect size estimates associated with face validity perceptions. However, it should be noted that in many of these settings, the assessment of face validity is likely to have low construct validity. Future research on the face validity of different testing methods should be sensitive to the nature of the samples used and the setting of the test assessment situation. Theories and measures of social desirability and self-presentation concerns my be relevant in 105 certain high stake settings. A second limitation concerns the nature of the constructs measured in the situational judgement test. Although the study addressed limitations in previous research by focusing on aprimj situational judgement factors, conecting for measurement errors, and establishing factorial invariance of test responses across methods, more work needs to be done on construct validation. At this point, it is premature to use scores on individual situational judgement factors (at least those measured in this study) for any individual diagnostic or decision purpose. Future research should be explicit in the preoperational constitutive definitions of the relevant constructs in order to guide the development of appropriate measm'es (i.e., writing valid items). Finally, nomological invariance was not tested in the present study due to the inappropriate choice of extemal reference variables. In future research, factorial mvanmceoftestresponsesacrossmethodgroupsmtarmofboflrmeasmemmt invariance and nomoloigical invariance should be empirically established and not merely assumed Thefocus ofthepresentstudywasnotontestbias as definedby the Cleary model (Cleary, 1968). No criterion performance data were collected to examine differential prediction across racial subgroups and method groups. From a practical perspective, fixture research should examine potential relationships between differential prediction and method effects on subgroup differences in test performance and face validity perceptions (or other motivational variables). For example, consider the use of test scores on the paper-and-pmcil version of the situational judgement test in the 106 present study as a predictor of job perfonnance. If reading comprehension is job- irrelevant and uncorrelated with actual job perfonnance, then using a cormnon regression line based on the regression of job performance on situational judgement test scores would likely result in an over-prediction for White examinees and under- prediction of Black examinees. That is, test bias in the Cleary sense would occur. chlusion The present study contributes to the sparse research on video testing in personnel selection and the research on situational judgement testing in particular. As mentioned early in the paper, the only published study reporting the adverse impact level of the video-based situational judgement test (Smiderle et al., 1994) did not correct for unreliability of measmement. The present study reports corrected estimates and isolates the method and content sources of subgroup differences in video testing. With the increasing popularity of video testing, clarifying the nature of its associated adverse impact levels becomes important fiom a legal and socio-political perspective. VVrth the exception of relatively higher costs in test development due to video production, the video-based method shares the same practical benefits with the paper- and-pencil fonrmt of the situational judgement test including the scale of testing which allows a large number of examinees in one session Moreover, the video-based method is more realistic and concrete than the paper-and-pencil method The method also elicits less adverse impact, more favorable face validity reactions in general and less subgroup differences in these reactions in particular. In addition, the video-based 107 method is generally less expensive than such high fidelity simulations as work samples and assessment centers. Hence, fi'om a practical perspective, it is worthwhile to inth more research efforts in video-based testing and conrpare the method with the traditional paper-and-pencil method of assessment for the same test. The method- content distinction made in the present study provides a conceptual and methodological basis for formulating fixture study designs. REFERENCES Alwin, D.F., & Jackson, DJ. (1981). Application of simultaneous factor analysis to issues of factor invariance. In D]. Jackson & E.F.Borgatta (Eds), Eager ' .... mam u (pp.249-279). Beverly Hills, CA: Arvey, R.D., & Sackett, RR. (1993). Faimess in selection: Current developments and perspectives. In N.Schmitt & W.C.Borman (Eds), Personnel Wells (pp-171-202). San Francisco: Jossey-Bass Publishers. Arvey, R.D., Strickland, W., Drauden, G., & Martin, C. (1990). A Motivational components of test taking. WA}, 695-716. Asher, J.J., &, Sciarrino, JA (1974). Realistic work sanrple tests: A review. Whey-.21, 519-533- Bentler, PM (1990). Emmet-11mm, 238-246. Bentler, P.M & Bonett, D.G. (1980). Significance tests and goodness of fit in the analysis of covariance structures. WM, 588-606. Bemardin, HJ- (1984) mm W. Paper presented at the 44th Annual Meeting of the Academy of Management, Boston. Bobko, P., & Bartlett, C]. (1978). Subgroup validities: Differential definitions and differential prediction JmnnalfifiApplieiBsxcthflw 12-14. 108 Bollen, KA. (1989). . . .- VVrley. Boudreau, J. (1983). Economic consideratiom in estimating the utility of human resource productivity improvement programs. W16, 551- 576. _ Brown,J., Bennet, J-, & Hanna, G. (1993). W Lombard, IL: Riverside. Bruce, MM, & Learner, DB. (1958). A supervisory practices test. Emanuel W 207-216. Brugnoli, G.A., Canrpion, J.E., & Basen, IA. (1979). Racial bias in the use of work samples for personnel selection. W, 119-123. Cascio, W.F. (1982). Wong. Boston, MAKent. Cascio, W.F., & Phillips, N. (1979). Performance testing: A rose among thorns? What-.32. 751-766. Casio, W.F. (1987). . Englewood Cliffs, NJ: Prentice-Hall, Inc. Chen, D. (in press). Criterion and construct validation of an assessment center. Chan, D., Schmitt, N., DeShon, R.P., Clause, S.C., Delbridge, K Reactionsto 10%“ . ‘ «.-. K .3). ' '0‘ C a. 0| J 0» hilt 33' I’LL ['31 03...".1] I 1.... e. .. k and test-takm' g motixation. Manuscript under review. Cleary, TA. (1968). Test bias: prediction of grades of negro and white students in integrated colleges. lurrrnalefEduc-ationaLMer-tsrrrementj, 115-124. Cohen, J. (1977). _ ' Hillsdale, NJ: Erlbaurn. Cohen, J. (1988). ' edition. Hillsdale, NJ: Erlbaurn. Cohen, J. & Cohen- R (1983). Ammmaemmtmmmm science (2nd ed). Hillsdale, NJ: Erlbaum Costa, P.T., Jr., & McRae, RR. (1985). WWW manual. Odessa, Florida: Psychological Assessment Resources, Inc. Costa, P.T., Jr., & McRae, RR (1992). WWW {arm \I 0 "1510.1. ° I an .m kl O ' ”k -. r c kl 10 ° ‘- ...-1.0 I art .0). (NEQEEI). Odessa, Florida: Psychological Assessment Resources, Inc. Dalessio, AT. (1994). Predicting insurance agent turnover using a video-based situational judgement test. Will, 23-32. Digrnan, J.M (1990). Personality structure: Emergence of the five factor model. WW, (Vol.41, Pp.417-460). Palo Alto, CA' Annual Reviews. Drasgow, F. (1984). Scrutirrizing psychological tests: Measurement equivalence and equivalent relations with external variables are central issues. W, 134-135. Drasgow, F. (1987). Study of the measmement bias of two standardized psychological tests. W, 19-29. Dyer, P.J., Desmarais, L.B.,1vfidkiff, KR. (1993). Multimeliaemplmanen . Paper presented at the Eighth Annual Conference of the Society for Industrial and Organizational Psychology, San 111 Francisco. File, Q.W., & Remmer, RH. (1971). WWW). NY: Psychological Corporation. Gaugler, B.B., Rosenthal, D.B., Thomton, G.C., & Bentson, C. (1987). Meta- analysis of assessment center validity. W13), 493- 511. Gilliland, SW. (1993). The perceived faimess of selection systems: An organizational justice perspective. AcademxfliManagemenLRCJdflJfi, 694-734. Gilliland, SW. (1994). Effects of procedural and distributive justice on reactions to a selection system. WW 691-701. Goldstein, HW., Braverman, E.P., & Chung, B. (1993). Methodm .n .m' 't - r i ear. 0 t. i arc-n exits "3110110 0 34‘» n ‘1.) 3m .1: t. i em .8. Paper presented at the Eighth Annual Conference of the Society for Industrial and Organizational Psychology, Montreal, Canada Helms, JR. (1992). Why is there no study of cultural equivalence in standardized cognitive ability testing? WW 1083-1101. Herriot, P. (1989). W. In N.Smith & I.Robertson (Eds), Advances in selection and assessment. NY: \Vrley. Huck, J .R., & Bray, D.W. (1976). Management assessment center evaluations and subsequent job performance of black and White females. PersonneLPsycthng, 22, 13-30. Hunter. J.E., & Hunter, RF. (1984). Validity and utility of altemative predictors of job perforrmnce. WM, 72-98. Hunter, J.E., Schmidt, F .L., & Hunter.RF. (1979). Differential validity of 112 employment tests by race: a comprehensive review and analysis. Psychological Bulletin._8_6, 721-735. Iles, PA & Robertson, LT. (1989). The impact of personnel selection procedures on candidates. In P.Herriot (Ed), WWW Qrganizatiuns (pp.257—271). Chichester, England: \Vrley. James, L.R. & James, LA. (1989). Causal modeling in organizational research. In C.L., Copper and I.Robertson (eds), WWWQI WEE-hung” (pp-371404). Chichester, UK John Wiley. Jensen, AR. (1980). W NY: Free Press. Joreskog, K & Sorbom, D. (1986). WWW .. . Mooresville, IN: Scientific Software, Inc. Joreskog, K & Sorbom, D. (1989). LISRELJLAgr-idetcthepmgramand l' . :2 l 1: :1. $1355. Joreskog, K & Sorbom, D. (1993). theSMllSrarmmandlanguage. Chicago: Scientific Software. Kelloway, E.K (1996). Common practices in structural equation modeling. In C.L., Copper and I.Robertson (eds), ,l - WW (pp.141-180). Chichester, UK John Wiley. Klirnoski, R.J., & Brickner, M (1987). Why do assessment centers work? The puzzle of assessment center validity. WW 243-260. Latham, G.P., & Saari, L.M (1984). Do people do what they say? Further studies on the situational interview. WWW-.62, 569-573. Iatham, G.P., Saari, L.M, Pursell, E.D., & Carrrpion, MA (1980). The 113 situational interview. WWWJE 422-427. Linn, RL. (197 8). Single group validity, differential validity, and differential prediction What-.63, 507-514. Loehlin, J.C., Lindzey, G., & Spuhler, J.M (1975). Win mtelligenw. San Francisco: Freeman. Macan, T.H., Avedon, MJ., Paese, M, & Smith, DE. (1994). The effects of applicants' reactions to cognitive ability tests and an assessment center. Personnel Psychology,_41, 715-738. Matthews, DB. (1991). learning styles research: Implications for increasing students in teacher education programs. MW 228- 236. McDaniel, MA, Whetzel, D.L., Schmidt, F.L., & Maurer, SD. (1994). The validity of employment interviews: A comprehensive review and meta-analysis. lormaleprpliflBsyeholua-JQ, 599-616. McHenry, J .J ., & Schmitt, N. (1994). Multimedia testing. In MG.Rurnsey, C.B.Walker, & J.HHanis (Eds), WWW NJ: Hillsdale. Motowidlo, S.J., Durmette, MD, & Carter, G.W. (1990). An alternative selection procedure: The low-fidelity simulation .LtmmalngppliflPs-ychologyfl, 640-647. Motowidlo, S.J., & Tippins, N. (1993). Further studies of the low-fidelity simulation in the form of a situational inventory. W W, 337-344. Premack, S.L., & Wanous, JP. (1985). A meta-analysis of realistic job preview experiments. JmmmprpfiflEfiLchQM 706-719. 114 Pulakos, E.D., Schmitt,N., & Keenan, PA (1994). Malidationand Report FR-PRD-94-20). Alexandria, VA: Human Resources Research Organization. Reily, RR, & Chao, GT. (1982). Validity and fairness of some altemative employee selection procedures. W35.- 1-62. Robertson, I.T., & Smith, M (1989). Personnel selection methods. In MSmith &1.T-R0bert80n (EdS-)- WWW (pp-89-112). NY:Wiley. Sackett, PR, & Dreher, GR (1982). Constructs and assessment center dimensions: Some troubling empirical findings. WWW-.51, 401-410. Sackett, PR, & Harris, MM (1983). WWW WWW. Paper presented at the American Psychological Association Convention, Amheim, CA Schmidt, F.L., & Hunter, J .E. (1981). Employment testing: Old theories and new research findings. WW 1128-1137. Schmidt, F.L., Greenthal, AL., Hunter, J.E., Berner, J.G., & Seaton, F.W. (1977). Job samples vs. paper-and-pencil trade and technical tests: Adverse impact and exarrrinee attitudes. W, 187-197. Schmitt, N. (1993). Group composition, gender, and race effects on assessment center ratings. In HSchuler, J.L. Farr, & MSmith (Eds), PersonneLselectionand ' . NJ: Hillsdale. Schmitt, N., Clause, C.S., & Pulakos, ED. (1996). Subgroup differences associated with different measures of some common job relevant constructs. In C.L- Cooper & I.T. Robertson (Eds), I :-.u ' Wm. NY: Vlfrley. Schmitt, N., & Gilliland, SW. (1992). Beyond differential prediction. Faimess in selection. In D. Saunders (Ed), a ..rut 1er . -_ . ‘- remixes (Vol.1, pp.21-46). Greenwich, CT: JAI Press. Schmitt, N., Gilliland, s.w., Landis, Rs, & Devine, D. (1993). Corrrputer— based testing applied to selection of secretarial applicants. W, 149-165. Schmitt, N., Gooding, RZ., Noe, RA, & Kirsch, MP. (1984). Meta-analyses of validity studies published between 1964 and 1982 and the investigation of study characteristics. BersenneLBsychclc-gnjz, 407422. Schmitt, N., & Noe, RA (1986). Personnel selection and equal employment opportunity. In C.L. Cooper & I.T. Robertson (Eds), Mona-1mm Schneider, J., & Schmitt, N. (1992). An exercise design approach to understanding assessment center dimension and exercise constructs. .[Qtrrnalgf Applmmhologull, 32-41. Schuler, H. (1993). Social validity of selection situations: A concept and some empirical results. In HSchuler, J .L. Farr, & MSmith (Eds), PersonneLselectionand Schuler, I-I, & Fruhner, R. (1993). Effects of assessment center participation on self esteem and on evaluation of the selection situation In H.Schuler, J.L. Farr, & 1 16 Scott, R. (1987). Gender and race achievement profiles of Black and White third-grade students. ImmalefPsxcholchJZl, 629-634. Smith, M & George, D. (1992). Selection methods. In C.L. Cooper & I.T. Robertson (Eds), I an... NY: VVrley. Smiderle, D., Perry, B.A, & Cronshaw, SF. (1994). Evaluation of video-based assessment in transit operator selection WWW-9(1), 3- 22. Srnither, J .W., Reilly, R.R., Millsap, RE, Pearlman, K, & Stofl‘ey, RW- (1993). Applicant reactions to selection procedmes mm, 49-76. Society for Industrial and Organizational Psychology. (1987). W . (Third Edition). College Park, MD: Author.?? Sorbom, ,D. (1974). A general method for studying difl‘erences in factor meansandfactorstructuresbetweengroups. BritishlcumaLcfiMat-hematmlmd WWI, 229-239. Steiger, J H (1990). Structtnal model evaluation and modification: An interval estimation approach- W25, 173-180- Tenopyr, ML. (1969). The comparative validity of selected leadership skills relative to success in production management. PersonneLPsychologLZZ, 77-85. Turnage, J.J., & Muchinsky, PM (1984). A comparison of the predictive validity of assessment center evaluations versus traditional measures in forecasting supervisory job performance: Interpretive implications of criterion distortion for the assessment center. WWW, 595-602. 117 Uniform Guidelines on Employee Selection (1978). W, 38290-38315. Weekley, J .A, & Gier, J .A (1987). Reliability and validity of the situational interview for a sales position WWW]; 484-487. Wernimont, P.R, & Campbell, JP. (1968). Signs, samples, and criteria. Whey-.52- 372-376. Wigdor, AK, & Green, B.F., Jr. (1991). Edema-55mm workplace, Washington, DC: National Academy Press. Mlson Leaming (1990). (TAB). Longwood, FL: Wilson learning. Wonderlic, ER, and Assoc. (1984). Wigwam-est manual. Northfield, IL: E.F. Wonderlic & Associates, Inc. APPENDICES APPENDIX A APPENDIX A S' E!E"E ; I For each of the following power analyses, the desired power was fixed at .80 and or was fixed at .05. Expected effect sizes were construed as "small" effect sizes (Cohen, 1988). H1: H1 tests the unique variance accounted for by the Race X Method term over and above the set of control variables consisting of Race and Method (Set A). A small AR2 of .03 was arbitrarily expected The expected R2 for the entire set ofpredictors (SetA+RaceXMethodterm)wasarbitrary fixedata conservative value of .10. Using Cohen and Cohen's (1983) fonnula for effect size f 2, we have P = ARZ/(l - R2) = .03/(1 - .10) = .033 According to Cohen (1988), a f 2 value of .033 is construed as a small effect srze. Cohen & Cohen 's (1983) formula for required sample size n* is as follows: n*=aH5+k+1 k refers to (if for unique source of variance. We have k = 1. From the table of L values in Cohen & Cohen (1983), we have L = 7.85. Therefore, we have n* = (785/033) + 1 + 1 = 239.8 H2 tests the unique variance accounted for by the Method X Reading Comprehension term over and above the set of control variables consisting of Method and Reading Comprehension (Set A). The same assumptions as H1 were made which resulted in the same sample size requirement (n* = 239.8). 118 H4: H6: APPENDIX A HB tests the unique variance accounted for by the Race X Method term over and above the set of control variables consisting of Race, Reading Comprehension, Method, and Reading Comprehension X Method (Set A). The sameassumptionsweremadeasHl exceptthatszasfixedatahigheerut nevertheless conservative) value of .15 because of the larger number of variables in Set A Using the same formulae in H1 resulted in a required sample size of 226.3. H4 tests the significance of the Pearson correlation coefficient between Face Validity Perceptions and Predictive Validity Perceptions. A small effect size of r = .20 was arbitrarily expected Based on Cohen's (1988) tables for sample size requirements for conelation coefficients, a desired power of .80 at or = .05 indicated that a sarrrple size of 194 was required H5 tests the difference in mean face validity perceptions between the paper- and-pencil method and the video-based method A conservative 51 value of .30 was arbitrarily expected Based on Cohen's (1988) tables for sample size requirements forttests between means, adesiredpower of.80 at 0t= .05 indicated that a sample size of 175 was required H6 tests the unique variance accounted for by the Race X Method term over and above the set of control variables consisting of Race and Method (Set A). The same assumptions as H1 were made for AR2 and R2 for Set A Using the same formulae in H1 resulted in a required sample size of 239.8. 119 APPENDIX B APPENDIX B EXAMPLE OF A PAPER-AND-PENCIL VIGNETTE The following is an example of a written vignette on the test booklet and some possible responses on the answer booklet in the paper-and—pencil version of the situational judgement test. Example of a written vignette on the test booklet. SITUATION 1 Jerry and Dennis are discussing how they should go about checking the machinery in the building. Jerry told Dennis that they should start at the West end of the building and work their way East so that the more important machinery will be taken care of first. Dennis disagreed as he thinks that since the East end is on break right now, it would be much faster to start East and work their way West. Jerry said that he has never seen anyone doing it that way and besides, the machinery at the West end is more critical. Dennis continues to disagree and thinks that they should start at the East end. Jerry can respond in a number of ways. For each possible response described in the answer booklet, indicate its effectiveness on the rating scale provided. 120 APPENDIX B Examples of possible responses on the answer booklet. After you have read SITUATION 1, rate the effectiveness of responses below from Jerry's perspective. 1. Ask your supervisor to decide which method is better. 2. Convince Dennis that your method is best. 3. Agree to use Dennis' method. 4. Split the work in half. Each of you use your own method. 5. Tell Dennis that you will use his method for a while, but you will switch if it looks like your method is best. 6. Tell Dennis that he needs to listen carefully to your ideas. 7. Compromise. Use Dennis' method today and your method next time. 8. Demand that Dennis use your method. For each possible response, the following rating scale is provided. VERY SOMEWHAT SOMEWHAT VERY INEFFECTIVE INEFFECTIVE INEFFECTIVE EFFECTIVE EFFECTIVE EFFECTIVE 121 APPENDIX C APPENDIX C TEST REACTIONS QUESTIONNAIRE QUESTIONNAIRE ON THE TEST THAT YOU HAVE JUST COMPLETED Consider the job of a production worker which requires working in team-based situations. To do the job well, the worker has to be technically competent and also be able to relate to other persons effectively. For such a job, indicate how much you agree or disagree with the following statements about the test that you have just completed by circling the appropriate number on the rating scale provided. 1. I did not understand what the test had to do with the job. 1 2 3 4 5 STRONGLY NEITHER AGREE STRONGLY DISAGREE DISAGREE NOR DISAGREE AGREE AGREE 2. I could not see any relationship between the test and what I think is required by the job tasks. 1 2 3 4 5 STRONGLY NEITHER AGREE STRONGLY DISAGREE DISAGREE NOR DISAGREE AGREE AGREE 3. It would be obvious to anyone that the test is related to the job tasks. 1 2 3 4 5 STRONGLY NEITHER AGREE STRONGLY DISAGREE DISAGREE NOR DISAGREE AGREE AGREE 122 APPENDIX C 4. The actual content of the test was clearly similar to the job tasks. 1 2 3 4 5 STRONGLY NEITHER AGREE STRONGLY DISAGREE DISAGREE NOR DISAGREE AGREE AGREE 5. There was no real connection between the test and the job tasks. 1 2 3 4 5 STRONGLY NEITHER AGREE STRONGLY DISAGREE DISAGREE NOR DISAGREE AGREE AGREE 6. Failing to pass the test clearly indicates that you can't do the job. 1 2 3 4 5 STRONGLY NEITHER AGREE STRONGLY DISAGREE DISAGREE NOR DISAGREE AGREE AGREE 7. I am confident that the test can predict how well an applicant will perform on the job. 1 2 3 4 5 STRONGLY NEITHER AGREE STRONGLY DISAGREE DISAGREE NOR DISAGREE AGREE AGREE 123 APPENDIX C 8. My performance on the test was a good indicator of my ability to do the job. 1 2 3 4 5 STRONGLY NEITHER AGREE STRONGLY DISAGREE DISAGREE NOR DISAGREE AGREE AGREE 9. Applicants who perform well on the test are more likely to perform well on the job than applicants who perform poorly. l 2 3 4 5 STRONGLY NEITHER AGREE STRONGLY DISAGREE DISAGREE NOR DISAGREE AGREE AGREE 10. The employer can tell a lot about the applicant's ability to do the job from the results of the test. 1 2 3 4 5 STRONGLY NEITHER AGREE STRONGLY DISAGREE DISAGREE NOR DISAGREE AGREE AGREE 124 APPENDIXD 125 825:8 w 033. mm MA- 3 AN f. 8 mA 3- 3- co- 3 8 8 a 8- --.. 8 : -..- S AN ON 3 M: 5 0A AAoAaw . 2... .-.... -.... n— Egan—es ... . .....:..-.. ... 8.8 8.8 mum? s 88 8.8 828 e 5. 8.: 0782mm .m 28 8.8 a> e was 8.8 as .m 2.. 8. x8 .N on on. GOA-APE A edge-"ES mafia» am 88: ...... -1. vii-... ..-‘4 was? 126 won—€80 w cam-A. on av wA 3 co mA- on 3 mA we. 8 we. no 8. oo 5A 3. 8 8 3A.. Aye-MA OmZ-Qmmm .3 Ah Om NA 5A cc AA ac. No. no. oo- AA- 8. no no. no. NA- Ao- co mN-m No.2 DEA-modem .wA om mm 2 8 NM Ab. Am «A. No NA- 8. co 0A oo- 9 NA 3. $5 ANNA QAAmA~AnA .mA ---- ---- so we. we. vo- 2 mA 2 cm ---- oA ---- med mmdA nm>-mAU<...-A .3 mm no 9.. co NA- No AtA 0A .1. No. no ---- AA.m 8.9 AMA-9am .mA 3 8 AA aA No NA no. 1.. S cm- ---- Rd no.3 BAA -mAU .v mm- ---- ens mmfim AmnA .m Ac. av. Av. Xmm .N on. mm. 902% .A g Ah mm mm hm mm co 8 3. 3.. aA mA- AA 3. NA- 8 AA no no- mo 8. cud mA.NA ZGOUIAAmAMAmAN 9N 3» Av hm cc nA- om co. 3 Ac] mA NA- 3. AA co Ao. NA- 3 no 9mm QumA ZOOQmAU’qudN ANONkoAhAeAmAvAmANAAAvowswmVMNA Oman: 128 8:338 w 2an on mm 3 Am on 3” as 3 mm co AA- mA NN no mm mo 2. mo Aed 312 OmAZAAmAmAnA .aA AA mA mA cm wA 8. NA- mm co mA- AN 5A 3 00 ac «A- 0A.. 26 anéA OmAZ-m0AAmA~AnA .nA ---- ---- VA Ac. NA- om mu 8 we Ab -..- NN- ---- cod AndA nw>-m0<~A.A.NmA .9 no 8 8. S as mm «3. me. Now 3.3 meO d nA- Am- AN- no. aA- mm- 0A nA.w wmfia OMDmZ .w AN ON «A wA 5A 0A 2 VA mA NA AA 2 a w A. c n v m N A Gm @802 129 duo—£2 duo—«Sum MA"AAugm-Ea-Eag 6383-0023 cocoa->883 8a Xmm Ea OCEAN: 9% 563$ 30605 B>m8bmunmmm 5:55, 8amumoA‘JA.A-VnmA ”mBSHRAOHZmEO ”8358323ng ”ago—fig mafia—035.. 3.83% navigifiuam ”83820 .8 xumuxmm 32 380%.... 383mm 5 308333 We Beoznooamz ”83382“ away; . we am sc- Am A-m mm 0A cv 8- ac- mA- NA- mc Ac- Ac oc- cm 3 no AA NA.m awcA ZOOU-nAmAMAnA .AN cc- cm- 3.. on mc. cc mA wc. AA- AN- 5- cc cA- mc Nc- VA 3- 8 ac mc.m 3.NA ZOOU-mAU