“\WR'ELE-‘ziﬂgiﬁé

U.

n I

€

W 4‘ _ ' .(

QM?" " 'a'v

.-. '4
.~
A»

‘A

i753; _ '

1W}!!!

_- (ﬁnw . ..
3 M13

in
,

. fem;

:r- :4»
' 19 H ‘ 2‘ .
sﬂk“ Vz': $9

.. ‘- ~ ~ ~ A 5.
‘mnm " ‘ Haida-0.!

§: ‘
Man
£3???“ ‘
KW

u

r . 5‘
FﬁRéESS
4!; £53?
a.

mm? 3.1%!

igﬂg
. as;

gr‘ ¥

3333' O

m

'59:?“

ELLE!“ - n

% §1§VK§§3§3
’ “5%???“

.;,~ _
:15!“ t'

3 1
310mm:

i9;§§'z;fﬁ‘: : . : a . ‘3 a:
Mr? rp; , : 223;}st
31' ” £351; ' :5. 135‘?!"

w

5!: 2
3§13}:a§:3
a

:33 p1,;

~ g“)! x? r‘ ‘
“$333? ,’
g. .83

; Juiéqgé‘
. .%_ﬂww

;:5 ‘
.
"a:

c

5%??? 3%..
if {it ‘ £59315; ‘

‘ I
:2;g
¥

2. .7“ . ‘
3h 5’ ‘ﬁaﬁgﬁﬁru

”a m:
. ‘ tax: .535?

 

VHF-183$

ANSTATEU

H lillllllllIHIJUIIIHIIWIHIUW

301389 3718

INILWIHHIZII

 

 

 

 

 

 

 

This is to' certify that_the
thesis entitled

VIDEO-BASED VERSUS PAPER-AND-PENCIL METHOD OF ASSESSMENT
IN SITUATIONAL JUDGEMENT TESTS :
SUBGROUP DIFFERENCES IN PERFORMANCE AND EXAMINEE REACTIONS

presented by

DAVID CHAN

has been accepted towards fulﬁllment
of the requirements for

M . A. degree in PSYCHOLOGY

2222

Major professor

NEAL SCHMITT

 

 

Date MARCH 20, 1996

0-7639 MS U is an Affirmative Action/Equal Opportunity Institution

 

 

LIBRARY
Michigan State
University

 

 

 

 

PLACE N RETURN BOX to romovo this chookoul from your rooord.
TO AVOID FINES rotum on or before data duo.

DATE DUE DATE DUE DATE DUE

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

usu I. An Afﬂrmotivo ActhEquol Oppomnay mutation
M1

VIDEO-BASED VERSUS PAPER-AND-PENCIL METHOD OF ASSESSMENT
IN SITUATIONAL JUDGEMENT TESTS :
SUBGROUP DIFFERENCES IN PERFORMANCE AND EXAMINEE REACTIONS

By

David Chan

A THESIS

Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of

MASTER OF ARTS

Department of Psychology
1996

ABSTRACT

VIDEO-BASED VERSUS PAPER-AND-PENCIL METHOD OF ASSESSMENT
IN SITUATIONAL JUDGEMENT TESTS :
SUBGROUP DIFFERENCES IN PERFORMANCE AND EXAMINEE REACTIONS

By
David Chan

Based on a conceptual distinction between test content and method of testing, the
present study examined several theoretically and practically important effects relating
race, reading comprehension, method of assessment, face validity perceptions, and
performance on a situational judgement test using a sample of 241 psychology
undergraduates (113 Blacks; 128 Whites). Results showed that the Black-White
differences in situational judgement test performance and face validity reactions to the
test were substantially smaller in the video-based method of testing than in the paper-
and-pencil method The Race X Method interaction effect on test performance was
attributable to differences in reading comprehension and face validity reactions
associated with race and method of testing. Implications of the ﬁndings were

discussed in the context of research on adverse impact and examinee test reactions.

Dedicded to

Scpum Bin Kasmm' md Kong Kin Seng

iii

ACTG‘IOWlmGEMENTS

This thesis could not have been successfully completed without the
encouragement and support from a number of people. The most important individual
is Neal Schmitt, the chair of my thesis committee. Neal has given me valuable
guidance and support throughout the research process. He impresses me not just
because of his professional expertise, but also because of his enormous efforts and
great patience in the development of graduate students. The experience of working
withNeal onthethesis andotherresearchprojectshasmadeagreat impacton my
professional life. Neal has personiﬁed my ideal professor.

Dan 1] gen and Rick DeShon, the other two members on my thesis committee,
have made very signiﬁcant contributions to my graduate training. My ﬁrst encormter
with Dan was actually "on paper". He authored the I/O psychology textbook I used in
my undergraduate days way back in Singapore. In fact, Neal and Dan were the
primary reasons for my choice of graduate school. In retrospect, travelling thousands
of miles across the ocean to Michigan State was well worth it. Dan's "motivational"
seminar has resulted in lasting motivational effects on me. Rick was the ﬁrst
professor I worked with at Michigan State. His zeal for his work has always
impressed me. From Neal and Rick, I have appreciated the meaning of "research
interests".

Two other individuals, Kevin Ford and Steve Kozlowski, have indirectly
contributed to the pment piece of work Both were instrumental in developing my
ﬁmdamental expertise in 1/0 psychology without which I could not have completed
the thesis so smoothly.

Finally, I must thank two groups of people who have provided me valuable
social support. To fellow I/O graduate students at Michigan State, I am grateful for
making me comfortable in a foreign land. To my dear ﬁ'iends in Singapore, I am
grateﬁrl for the concern and constant update of things back home.

David Chan

iv

TABIEOFCIX‘JTENTS

LIST OF TABLES ........................................................................................................ viii
LIST OF FIGURES ......................................................................................................... ix
INTRODUCTION ............................................................................................................. 1
Overview ................................................................................................................ 1

Two Conﬂicting Goals in Personnel Selection .................................................... 2
Subgroup Differences on Selection Tests ............................................................ 3
Attempts at Reducing Adverse Impact ................................................................ 7

Work Sample Tests: Validity and Adverse Impact ............................................. 8

The Logic and Problems in Development and Use of Work Samples ............... 9
Assessment Centers ............................................................................................. 13

The Goldstein et al. (1993) Study ...................................................................... 16

The Present Study ............................................................................................... 20

The logic of Situational Judgement Tests ........................................................ 21
Situational Judgement Tests: Simulation Fidelity and Predictive Validity ...... 22
Situational Judgement Tests and Adverse Impact ............................................. 25
Video-Based Situational Judgement Tests ......................................................... 26
Hypothesis 1 ............................................................................................ 29

Hypothesis 2 ............................................................................................ 31

Hypothesis 3 ............................................................................................ 33

Examinee Test Reactions .................................................................................... 35

V

Face Validity and Predictive Validity Perceptions ............................................ 36

Hypotheses 4 ........................................................................................... 37

Face Validity and Method of Testing ................................................................. 38
Hypothesis 5 ............................................................................................ 39

Subgroup Membership and Face Validity .......................................................... 40
Hypothesis 6 ............................................................................................ 41

Factor Invariance across Methods of Testing .................................................... 43
METHOD ........................................................................................................................ 45
Examinees ............................................................................................................ 45
Development of Situational Judgement Test ...................................................... 45
Measures of Examinee Test Reactions ............................................................... 47
Reading Comprehension, Cognitive Ability, Personality Tests ........................ 48
Design .................................................................................................................. 49
Procedure ............................................................................................................. 50
Analyses .............................................................................................................. 51
RESULTS ........................................................................................................................ 55

Relationships between Race, Reading Comprehension,

Method of Assessment, and Performance on Situational Judgement Test ....... 59
Factorial Invariance across Method Groups ....................................................... 68
Effects of Method of Assessment on Differential Subgroup Performance

on Individual Situational Judgement Constructs ................................................ 82
Face Validity and Predictive Validity Perceptions ............................................ 85

vi

Face Validity and Method of Testing ................................................................ 86
Subgroup Membership and Face Validity .......................................................... 86
Relationships between Race, Reading Comprehension,
Method of Assessment, Face Validity Perceptions, and

Situational Judgement Test Performance ........................................................... 90
DISCUSSION .................................................................................................................. 94
Method-Content Distinction ................................................................................ 95
Test Reactions ..................................................................................................... 96
Factorial Invariance of Test Responses across Assessment Methods ............. 100
Measurement Errors and Effect Size Estimates ............................................... 102
Limitations and Future Research ...................................................................... 103
Conclusion ......................................................................................................... 105
REFERENCES ............................................................................................................... 108
Appendix A A Priori Power Analyses ...................................................................... 118
Appendix B. Example of a Paper-and-Pencil Vignette ............................................... 120
Appendix C. Test Reactions Questionnaire ............................................................... 122

Appendix D. Means, Standard Deviations, Reliabilities, and Intercorrelations

of Study Variables broken down by Race .................................................................. 125
Appendix E. Covariance Matrices of Indicators for

Situational Judgement Factors ....................................................................................... 130
Appendix F. Covariance Matrices of Indicators for

Situational Judgement Factors and Personality Factors ............................................... 131

vii

Table 1 -

Emk2-

nmm3-

nmm4-

news-

Table 6 -

Table 7 -

Table 8 -

LIST OF TABIES

Means, Standard Deviations, Reliabilities, and Intercorrelations of Study
Variables .................................................................................................. 56
Summary of Hierarchical Regressions of Situational Judgement Test
Performance on Race, Reading Comprehension, and Method of
Assessment (N = 241) .............................................................................. 60
Means, Standard Deviations, Reliabilities, and Intercorrelations
(observed and corrected) of Situational Judgement Scales broken down
by Method Groups ................................................................. 6 9
Fit Indices Associated with Multiple-Group Conﬁnnatory Factor
Analytic Models Tested in Assessmmt of Measurement Invariance of
Situational Judgement Scores across Paper-and-Pencil Method Group

(N = 121) and Video-Based Method Group (N = 120) .......................... 7 1
Means, Standard Deviations, and Intercorrelations between Situational
Judgement Indicators and Personality Indicators broken down by
Method Groups ......................................................................................... 77
Situational Judgement Factors : Subgroup Means, Standard Deviations,
and Associated Effect Sizes for Paper-and-Pencil Method and Video-

Based Method of Assessment .................................................................. 83
Hierarchical Regressions for Face Validity Perceptions and Situational
Judgement Test Performance (N = 241) ................................................. 87

Means, Standard Deviations, Reliabilities, and Intercorrelations of Study
Variables broken down by Race ............................................................ 125

viii

Figurel -

Figure3 -

Figure4 -

FigureS -

Figure6 -

Figure7 -

Figure8 -

Figure9 -

Figure 10 -

Figurel] -

HSTOFHGURFB

Hypothesis 1 : Predicted Race X Method Interaction on Situational

Judgement Test Performance ................................................................... 30
Hypothesis 2 : Predicted Method X Reading Comprehension Interaction
on Situational Judgement Test Performance ........................................... 32

Hypothesis 3 : Method X Reading Comprehension Interaction as an
Explanation for Race X Method Interaction on Situational Judgement

Test Performance ...................................................................................... 34
Hypothesis 6 : Predicted Race X Method Interaction on Face Validity
Perceptions of Situational Judgement Test .............................................. 4 2
Race X Method Interaction on Situational Judgement

Test Performance ...................................................................................... 62
Method X Reading Comprehension Interaction on Situational Judgement
Test Performance ...................................................................................... 64

Race X Method Interaction on Situational Judgement Test Performance
after Controlling for Effect of Method X Reading Comprehension
Interaction ................................................................................................. 66
Conﬁrmatory Factor Analytic Model with Associated Common Metric
Standardized Factor Loadings and Factor Correlations for Both Method

Groups (* p < .05) .................................................................................... 7 5
Race X Method Interaction on Face Validity Perceptions of Situational
Judgement Test .......................... 89

Race X Method Interaction on Situational Judgement Test Performance
after Controlling for Effects of Method X Reading Comprehension
Interaction and Face Validity Perceptions ............................................... 91
Relationships between Race, Reading Comprehension, Method of
Assessment, Face Validity Perceptions, and Situational Judgement Test
Performance .............................................................................................. 93

INTRODUCTION
Mm

The present study examines the effects of a video-based versus a paper-and-
pencil method of assessment on adverse impact and examinee reactions in a situational
judgement test. The dependent variables of interest are test performance and examinee
test reactions. Making the important distinction between test method and test content
(Hunter & Hunter, 1984), test content is held constant across two different methods of
testing so as to isolate subgroup differences on the dependent variables due solely to
test methods.

The research problem leading to the present study will ﬁrst be identiﬁed The
theoretical issues and practical concerns in personnel selection constituting the research
problem will be explicated Two conﬂicting goals in personnel selection will ﬁrst be
noted Theissueofadverseimpactisﬂrendiscussedmrdanemptswmduceadverse
impact in selection is reviewed ﬁrm the research on work samples and assessment
centers. This will lead to the focal selection procedure in the present study namely,
the situational judgement test which is becoming increasingly popular in the research
and practice of personnel selection The relationship between the logic of the test and
its associated levels of adverse impact is discussed The recent research on examinee
test reactions will then be introduced. The frequently neglected but important issue of
diﬁ‘erential subgroup attitudes is examined and related to the important distinction

between test content and the method of testing. Based on the literature review and

2
conceptual analysis in the Intrrrdudion. hypotheses for the present study are presented.
The hypotheses were tested in a sample of 241 undergraduates (113 Blacks,
128 Whites). Results supported the hypotheses. Limitations, contributions, and

implications of the present study were discussed

ICﬂ°°El°E lSl°
A crucial element in the achievement of organizational goals is the selection of

individuals with high ability to perform their jobs. Hence, the primary focus of
personnel selection research and personnel selection procedures has always been the
maximization of predictive efﬁciency by identifying and selecting individuals with the
highest job-relevant ability. There has been a vast amount of empirical research on
the validity and utility of selection procedures. Meta-analyses of these primary
ﬁndings indieate that for a wide variety of jobs, valid measures of job-relevant ability
dimensions can be developed and used to select high potential individuals. For
example, paper-and-pencil measures of cognitive ability are valid predictors of most
jobs in the US economy (Hunter & Hunter, 1984; Schmidt & Hunter, 1981).
Assessment centers have consistently demonstrated validities for jobs involving
managerial skills (Gaugler, Rosenthal, Thornton, & Bentson, 1987). Work samples
(Schmitt, Gooding, Noe, & Kirsch, 1984) and biographical information (Reilly &
Chao, 1982) are valid predictors of important job outcomes, and even interviews, when
rigorously structured and administered, appear to be valid measures of job-relevant
dimensions (McDaniel, Whetzel, Schmidt, & Maurer, 1994). Utility studies have also

3
shown that valid selection procedures can make substantial economic contributions to

organimtional productivity (e.g., Boudreau, 1983).

However, organizational productivity is not the only goal to be considered by
the employer when selecting individuals. Schmitt and Noe (1986) noted that at least
since the passage of the Civil Rights Act of 1964, political and legal demands have
forced employers to consider a second and ﬁequently conﬂicting goal namely, equal
errrployment opportunities for various subgroups (minorities and women) in the
American society. In 1965, President Johnson issued Executive Order 11246 which
required all Federal contractors and subcontractors take aﬂ'mnatiyeaction to ensure
that employees are treated without regard to race, color, sex, religion, and national
origin This order, the passage of the Civil Rights Act of 1964, and subsequent court
cases concerning charges of discriminatory use of tests constituted the zeitgeist for

personnel researchers examining differences in validity of selection procedures across

subgroups.

 

Schmitt and Noe (1986) provided a summary of the research and issues on
subgroup differences in test performance and differences in validity of tests across
subgroups including both differential validity and diﬂ°erential prediction. Whereas
there is not much data regarding subgroup differences in validities of other predictors,
the ﬁndings on subgroup differences (in particular, Black-White differences) in

performance on paper-and-pencil measures of cognitive ability are well established.

4
There is little evidence of a Black-White difference in validity coefﬁcients for paper-

and-pencil measures of cognitive ability (i.e., little evidence of differential validity).
Differential validity occurs when there is a signiﬁcant difference between observed
validities for two subgroups. Reviews of research have shown that differential validity
is generally absent (Jensen, 1980; Linn, 1978) and when it is observed, the validity
differences between Blacks and Whites are small and trivial (Cascio, 1982; Hrmter,
Schmidt, & Hunter, 1979). Moreover, Bobko and Bartlett (1978) have successﬁrlly
argued that differential validity pens; would not be a sufﬁcient indicator of test bias.
For example, different subgroup validity coefﬁcients may result when two groups
differ in variability even when their prediction systems are identical.

Although there is little evidence of differential validity, there is an extensive
research demonstrating a simble difference in test means of Black and White
subgroups with Blacks on the average scoring about one standard deviation below
Whites (e. g., Hrmter & Hunter, 1984; Loehlin, Lindzey, & Spuhler, 1975; Schmidt,
Greenthal, Hunter, Berner, & Seaton, 1977). Despite the absence of differences in
subgroup validity coefﬁcients, the use of paper-and-pencil measm'es of cognitive
ability to select in a manner that optimizes predicted performance will still result in
the hiring of a small number of Blacks relative to Whites because the Black subgroup
mean score on the test is substantially lower than the White subgroup mean score.
Hence, there is a conﬂict between the optimization of predicted performance (i.e., the
goal of organimtional productivity) and the goal of equal subgroup representation in

selection.

5
A similar conﬂict is reached with respect to the assessment of differential

prediction which is more directly related to issues of test bias than is differential
validity. Differential prediction has now become the accepted way of evaluating test
bias by most psychometricians. Evaluation of differential prediction involves the
consideration of validity coefﬁcients and standard errors of estimates and the
regression line describing the predictor-criterion relationship. Predictions of
performance are made using regression equations. According to the Cleary (1968)
model oftest bias, which is endorsed by both the WW
Scimitar (1978) and the :m '
when (SIOP, 1987), a test is biased when a common regression equation results

 

in either over- or under-prediction of subgroup performance, that is, a test is biased
when there is differential prediction Over-prediction for a protected minority group
resulting ﬁorn the use of a common regression line indieates test bias in the
psychometric sense but is generally not considered a problem of fairness (SIOP, 1987).
Hence, whereas test bias is a technical, psychometric issue, fairness is a social notion
involving consideration of valued outcomes (SIOP, 197 8).

The Cleary (1968) approach requires the use of separate subgroup regression
equations when the equations are signiﬁcantly different. The use of separate equations
to provide a single rank order of applicants using predicted scores will result in hiring
the best qualiﬁed individuals hence optimizing predicted perfonnance but it will result
in the selection of unequal proportions of members of various subgroups when
subgroup mean performance differs. Schmitt and Noe (1986) noted that most research

6
evidence indicates that the use of a single common equation results in slight over-

prediction of minority group performance whereas the use of separate equations results
in average predicted performance for subgroups which is identieal to the actual
performance difference hence satisfying Cleary's (1968) criterion Schmitt and Noe
(1986) have also shown that the use of separate regression equations as prescribed by
Cleary (1968) will result in selecting relatively few members of the lower scoring
group (which is ﬁequently the minority group) at all levels of selection ratios.

In short, paper-and-pencil mm of cognitive ability are valid predictors of
job performance and generally unbiased toward minority subgroup members in the
sense that their predicted performance matches their actual performance. However,
sizable subgroup differences on test performance exist (with Blacks scoring on the
average one standard deviation below Whites). Top-down selection on the basis of
test scores results in the hiring of relatively small proportions of minority subgroup
members. In most cases, the use of paper-and-pencil measures of cognitive ability in
selection produces a high level of "adverse impact" on minority hiring rates (e.g.,
Hunter & Hunter, 1984; Schmidt et al. 1977) deﬁned by the W
W (1978) as the failure to meet the 4/5 rule, that is, the ratio of the
proportion of minority applicants hired to majority applicants hired should not be
lower than 4/5.

 

The conﬂict between the goal of organizational productivity and the goal of
equal subgroup representation prompted personnel researchers to try to develop valid
predictors of performance that have levels of adverse impact lower than that associated
with traditional paper-and-pencil measures of cognitive ability. A promising approach
is the search for alternative predictor comm. In this approach, researchers attempt
to go beyond the construct of cognitive ability as assessed by traditional paper-and-
pencil measures to measure other job-relevant abilities and attributes.

The logic for the construct-oriented approach to reducing adverse impact in
selection is that paper-and-pencil measures of cognitive ability, while valid, may be
measrn'ing those determinants of job success on which subgroup differences are largest
and conversely, they may fail to measure important determinants ofjob success on
which such differences are smaller or nonexistent (Schmidt et al. 1977). However, the
majority of the studies involving a search for altemative predictors have not adopted a
construct-oriented approach Instead, efforts have been directed at the development of
alternative selection methods such as work samples, assessment centers, and biodata,
and the efforts were often atheoretical. As argued later, this neglect of constructs
resulted in a serious conformd between method of testing and test content in
assessment which has largely hindered our tmderstanding of the nature of subgroup
differences in performance on selection instruments. Many of these issues are best
illustrated with the development and use of work sample tests as alternative predictors

(to paper-and-pencil cognitive ability tests) of job performance.

8
The next section will summarize the research on the validity and adverse

impact of work sample tests. The logic of work sample tests and the problems
associated with their development and use will then be explicated Assessment
centers, an alternative predictor closely related to work samples will also be discussed
to illustrate several issues and problems concerning the reduction of adverse impact.

The discussion will lead to the consideration of situational judgement tests and their

relationships to adverse impact and examinee test reactions which is the subject of the
present study.

 

Inworksarnpletests, exarnineesarerequiredtoperfonnthesamebehaviors
that they would be required to perform on the job. Several reviews have demonstrated
that work sample tests can be at least as predictive of job performance as paper-and-
pencil cognitive ability tests. Hunter and Hunter (1984) found that paper-and-pencil
cognitive ability tests were about equally as valid as work sample tests. Schmitt et
al.'s (1984) meta-analysis found that the validity of work samples were superior to
those of biodata and cognitive ability tests. With respect to adverse impact, work
samples appear to be advantageous compared to cognitive ability tests in that the mean
difference between the scores of majority and minority subgroup members is typically
less for work samples (Brugnoli, Campion, & Basen, 1979; Cascio & Phillips, 1979;
Schmidt et al. 1977; Schmitt, Clause, & Pulakos, 1996; Vlfrgdor & Green, 1991). For
example, Schmidt et al. (1977) compared the adverse impact of a content-valid work

9
sarrrple test of metal trade skills to that of a well-constructed content-valid paper—and-

pencil achievement test for the same technical area. They found the typieal one

standard deviation Black-White subgroup difference for the paper-and-pencil test but
found no signiﬁeant difference between Blacks and Whites for the work sample test
Bernardin's (1984) meta-analytic review of Black-White diﬂ‘erences on work sample

tests found an average diﬁerence of .54 standard deviation units favoring Whites.

 

In order to explain the positive results of the work sample test regarding its
validity and small adverse impact, one needs to examine the logic of the development
anduseofworksarnples. The interestinworksarnples couldinpartbetracedto
Wernimont and Campbell (1968) who contended that samples of the kinds of
behaviors actually required to be performed on the job would predict ﬁrture job
performance better than scores on typical cognitive ability tests. The authors argued
that scores on ability tests are merely "signs" which are less similar to and hence less
related to actual job performance compared to "samples" of the work on the job. The
irrrplicit assumption is that the more similar a test is to the actual job, the higher the
validity of the test In accounting for the predictive success of work sample tests,
Asher and Sciarrio (1974) stated that a strong relationship between the content of the
job and the 00th of the selection method must exist for high predictive validity to
occur. Smith and George (1992) argued that Asher and Sciarrino's (1974) "point to

point " validation theory on work samples can be used as an explanation for the

10
success and failure of most selection methods.

However, the notion of "similarity" between test and actual job has never been
suﬂiciently explicated in most studies which examined tests pmportedly similar to
actual job content. This is certainly true in the case of work sample tests. Given that
performance on most jobs is multidimensional, a work sample successﬁrlly replicating
a portion of the job is almost always multidimensional. However, little if any work
hasbeendirectedmmderstmrdingthenatmeofﬂreconsuuctsmeasmedinwork
sample tests.

Schmitt et al. (1996) reviewed studies on subgroup differences published from
1964 to 1994 in three majorjournals concerned with personnel selection (W

 

Rersonnelﬁelection)mdanemptedmascetaindrenanneofﬂreconsnucts measured
andmethodsusedinthosesmdies.“f1thregardtoworksamples, theauthorsformd
that it was almost never clear what construct(s) were measured. In the same review,
the authors also noted that the data available regarding the lower adverse impact
associated with work samples relative to paper-and-pencil cognitive ability tests are
not very useﬁrl in providing us an miderstanding of the reasons for the reduction in
adverseimpact. Thisisduetoaninherentconformdbetweenmethodandtestcontent
in almost all studies comparing subgroup differences on the two types on tests. In
these studies, work samples and cognitive ability tests differed in the method of testing
(e. g., paper-and-pencil versus actual task perfonnance) and presumably, the nature of

the constructs measured (e.g., general cognitive ability versus interpersonal-oriented

11
dimensions) due to different item content between the two tests.

The distinction between method and content (Hunter & Hmrter, 1984) is crucial
tothestudyofreductioninadverse impact. Ifmethodandtestconterrtis
disconfounded in a study, then subgroup differences due to method and subgroup
differences due to test content can be isolated In principle, we can then reduce or
even eliminate adverse impact by changing method of testing or test content depending
on the job-relevance of the given constructs. For example, two different methods of
testing may have the same test content measrning the same job-relevant construct but
one method produces less adverse impact than the other. Adverse impact due solely to
method of testing can then be eliminated by using the method with lower adverse
impact assuming that method is job-inelevant.

On the other hand, by controlling method, we may be able to ascertain
different test contents that differ in the size of subgroup differences they produce. For
example, subgroup differences may be smaller for test content tapping interpersonal
skills than test content tapping cognitive constructs (Hough, 1994; Hough, Eaton,
Dunnette, Kamp, & McCloy, 1990). Assuming both types of constructs are job-
relevant, adverse impact can be reduced and validity can be increased by expanding
the predictor space beyond the measurement of cognitive constructs to include the
measurement of interpersonal skills constructs.

The present study differentiates method ﬁom content by comparing two
different testing methods (paper-and—pencil versus video-based assessment) with the

same set of test items. The importance of the method-content distinction in the

12
present study will be elaborated later. In short, in terms of our understanding of the

smaller adverse impact associated with work samples relative to cognitive ability tests,
moreworkiscertainlyneededonthenatmeoftheconstructsmeasmedinwork
samples, their representativeness of the job, and issues relating to the physical ﬁdelity
and psychological ﬁdelity of the simulation (McHenry & Schmitt, 1994).

There are several practieal problems that have limited the use of work samples.
Despite its validity and low adverse impact, many organimtions have not incorporated
work samples into their selection procedures due to the high cost of testing. Work
samples are often expensive to develop and administer, especially when raters are
required Manyworksarrrpletestsareadministeredoneononebyatest administrator
who often has to score the results by hand (McHenry & Schmitt, 1994). To ensure
reliability, more raters are required which increases the cost of testing. Costs are
further increased when complex administration and scoring procedures demand
rigorous assessor training (Wrgdor & Green, 1991). In certain cases, work sample
tests may not be practical due to the potential danger to the applicant inherent in the
tasks. Jobs involving high physieal demands nary be least practical for the
development of work sample tests and yet these may be the jobs where work samples
are most predictive. Finally, some jobs may be sufﬁciently technical and involve a
substantial amount of job-speciﬁc knowledge such that it would not be possible to
develop a work sample that is representative of a signiﬁcant portion of the job and at
the same time applicable to applicants (who do not have the knowledge and skills of
the experienced incumbents).

13
AW

Several issues and problems concerning the reduction of adverse impact using
alternative predictors can be illustrated with the development and use of a selection
instrument closely related to work samples namely, the assessment center. Although
primary research and reviews in personnel selection have almost always treated
assessment centers as a type of predictor distinct from work samples, the two
predictors have much in common Both assessment centers and work sarrrples are
based on a behavioral sampling assumption and they share the basic tenet of the
behavioral consistency approach that the best predictor offuture performance is
present or past performance or behavior of the sarrre type. Both are simulations in the
sense that the task stimuli are constructed such that they mimic actual job situations
and elicit responses which are purported indicators of how assessees would lundle the
task situations if they were actually occurring on the job. Assessment centers are
more like "samples" than like "signs" in the sense distinguished by Wernimont and
Campbell ( 1968). Both work samples and assessment centers are almost always
multidimensional reﬂecting the multidimensionality of the target job they mimic. Both
also tend to have high face validity. Both often require trained raters who are also
subject matter experts on the target job and both are expensive to develop and
administer. The distinguishing feature of the assessment center is its multiexercise-
multirater methodology. Also, although in principle the multiexercisemultirater
methodology ean be applied to almost any job, assessment centers have been
historically restricted to the assessment of general managerial dimensions. Because it

14
typically assesses general managerial dimensions as opposed to some job-speciﬁc
technical knowledge and skills, the assessment center is less likely to have the
problem of inapplicability to inexperienced applicants faced by rrmny work samples
alluded to earlier.

With respect to validity and adverse impact, research on assessment centers has
demonstrated a pattern of ﬁndings similar to that of work sarrrples. Like work
samples, validities obtained for assessment centers are at least comparable to those
observed for cognitive ability tests. At least two meta-analyses have found substantial
validities for assessment centers. Schrrritt et al. (1984) found an average validity of
.41 across 21 studies. Gaugler et al. (1987) obtained an average validity of .34 based
on 107 validity coefﬁcients for various performance criteria from 50 studies.

Like work samples, typical Black-White subgroup differences in assessment
center performance are also substantially smaller than the one standard deviation
difference observed for cognitive ability tests (e. g., Huck & Bray, 1976). Based on a
sample of 2,910 candidates who were assessed for school administrator positions in 25
different assessment centers using the same set of exercises and dimensions, Schmitt
(1993) found signiﬁcant mean differences between Black and White subgroup
members for 10 of 13 dimensions ranging between two-thirds to three-fourths of a
standard deviation in favor of Whites. In short, assessment centers by no means
eliminate adverse impact but they tend to have Black-White differences substantially
smaller than those for cognitive ability tests.

Whereas studies of work samples have tended to neglect the issue of

15
constructs, a substantial amount of research has been devoted to the study of construct

validity of the dimensional ratings in assessment centers. However, our understanding
ofthenatmeoftheconstructstappedinassessmentcenters isnobetterthanthecase
in work samples. Mrltitrait—multimethod studies have consistently reported low
construct validity of dimemional ratings and factor analyses of these ratings produced
"exercise factors" rather than dimensional factors (e.g., Chan, in press; Sackett &
Dreher, 1982; Sackett & Harris, 1983; Schneider & Schmitt, 1992; Turnage &
Muchinsky, 1982). In describing the lack of construct validity in assessment center
research, Klimoski & Brickner (1987) noted that we know assessment centers work in
the sense that they have predictive validity but we do not know why insofar as we
have little rmderstanding of the nature of the constructs tapped by assessor ratings.
Just as in the case of work samples, it is terrrpting to attribute the smaller
subgroup difference observed in assessment center performance (relative to paper-and-
pencil cognitive ability tests) to the nature of the constructs tapped by the test content.
Like work sarrrples, one may hypothesize that the multidimensionality of assessment
centers included both cognitive and non-cognitive constructs (e. g., interpersonal
dimensions) and that subgroup differences on non-cognitive constructs may be smaller
or even non-existent compared to cognitive constructs such that the overall ratings in
assessment centers exhibit lower adverse impact relative to paper-and-pencil measures
of cognitive constructs. However, as mentioned earlier, the test of such a hypothesis
would require a design eliminating the method—content confound in the comparison

between assessment center perforrmnce and performance on cognitive ability tests.

16
Unfortunately, a fully-crossed content by method factorial design is often not feasible.

For example, in a paper-and-pencil methodology, it is difﬁcult to develop test content
tapping many of the usual assessment center dimensions (e. g., leadership, decisiveness)
and sometimes impossible to do so (e. g., oral communication).

Schmitt et al. (1996) reviewed studies on subgroup differences and found no
study which employed the method by content design. However, they did ﬁnd one
unpublished study (Goldstein, Braverrnan, & Chung, 1993) reporting subgroup
differences measured using different methods. The Goldstein et al. (1993) study will
nowbedescribedinordertodiscussthecore issues associatedwiththe methodby
content design approach to examining subgroup difference. Some of the problems
with the design used in Goldstein et al. (1993) will be addressed in the present study.

W

The purpose of Goldstein et al. (1993) was to examine the effects of different
testing methods on subgroup differences. The authors attempted to address the
"method versus content" issue by developing four tests that purportedly assess the
same six abilities. The sample consisted of 29 Whites and 13 Blacks who were being
assessed for promotion in a police organization. The four tests used, which were
construed as work samples by the authors, were similar to the typical exercises in an
assessment center. They were a mittenjnzbaskmt, a mleplamxercjse in which
the examinee conducts a performance appraisal counseling session, a simulation

planningexercise requiring the examinee to develop contingency plans to a

17
hypothetical event, and a srmulationgrercise in which the examinee supervises
activities associated with the event that he or she had prepared in the simulation
planning exercise. The six abilities assessed across all four tests were the ability to
pay attention to details, to adjust communication to level of understanding of other
person, to cormnunicate using proper grammar and wording, to put materials in a
logical sequence, to adjust action or decision in light of new information, and to
maintain composure in stressful situations.

Citing Helms (1992), Goldstein et al. argued that the Aﬁican-centered values
and beliefs of Blacks emphasize communalism, movement, and orality which would in
tmn inﬂuence their test-taking performance. Accordingly, Blacks have a disadvantage
on paper-and-pencil tests compared to Whites due to the strong written component
requirement for successful performance on such measures. The written component is
construed as a requirement ofthe test method and is not part ofthe construct intended
to be assessed by the test content.

The authors hypothesized that a testing method requiring a written response
mode favors Whites over Blacks whereas tests that were more interactive,
behaviorally-oriented, and aurally-/orally—oriented would exhibit less adverse impact.
Hence, it was predicted that the written in-basket test would have a higher level of
adverse impact relative to the other three tests which were more interactive,
behavioral, and aural/oral in nature. The results were consistent with the hypothesis.
The written in-basket test had a substantially higher level of adverse impact (.47 to

.87, average = .65) when compared to the simulation planning exercise (.41 to .64,

18
average = .48) and the simulation exercise (.22 to .36, average = .30). For the role

play which is presumably the most interactive-oriented exercise, Blacks performed
better than Whites (.38 to .64, average = .58).

Schmitt et al. (1996) noted several limitations with Goldstein et al.'s (1993)
study. The sarrrple sizes were small with only 13 Blacks and 29 Whites. No
reliability estimates were reported for the various measures. With low reliabilities,
true subgroup diﬁmces will not be detected It is possible that some of the more
interactive measmes (e. g., simulation exercise) are substantially less reliable than
paper-and-pencil measures such tint true subgroup diﬁerences were not detected on
the former. That is, it was not clear if Goldstein et al.'s (1993) ﬁndings were due to
true subgroup differences or simply an artifact of differential reliability in
measurement. In the present study which compared two methods of assessment in a
situational judgement test, the reliabilities of each measurement method were estimated
so that effect sizes could be corrected for rmreliability in measurement. Adequate
sample sizes were also employed to ensure sufﬁcient power.

Schmitt et al. (1996) also noted that there was no evidence establishing the
equivalence of constructs across methods in Goldstein et al's. (1993) study. This is an
inrportant concern because, as argued earlier, the adequacy of a method by content
design for the isolation of method sources and content sources of subgroup differences
presupposes an equivalence of constructs across methods when test content is held
constant across methods. In Goldstein et al. (1993), the content of the task stimuli

(i.e., test content) appeared to be quite different across test methods. For example, it

19
was not clear if the "ability to maintain composure under stressful situations" elicited

by the preparation of memos in the in-basket test (and rated by assessors) was in fact
the same construct as the purportedly same dimension elicited (and rated) by the
interactions in the counseling situation of the role play exercise. In the present study
on situational judgement, the issue of construct equivalence was addressed by
administering the same test items using two different methods of stimulus presentation
and empirically testing factorial invariance of test responses across the two methods.

Another limitation of Goldstein et al. (1993) was that the ability dimensions
described were relatively speciﬁc and their results may not be generalizable to the
broader psychological constructs of interest typically assessed by the common
predictor instruments in personnel selection such as cognitive ability tests, personality
measures, and work samples. The present study on situational judgement employed
more global constructs such as interpersonal skill dimensions of conﬂict resolution and
errrpathy.

In order to provide a more rigorous test of the hypothesis that a signiﬁeant
amount of the Black-White diﬁerence in perfonnance on paper-and-pencil tests is due
solely to the reading/written requirements inherent in the method of testing and
mdepmdenofﬂrewnsuuctmeasmedﬂrepresmtstudyalsoadrmmsteredareadmg
comprehension test to both Blacks and Whites. The hypothesis would predict lower
reading comprehension scores for Blacks and that the Black-White subgroup difference
in performance on the paper-and-pencil method of testing will be reduced when
reading comprehension is controlled.

20
IhePresenLStrrdy
The present study examined the effects of a video-based versus a paper-and-

pencil method of assessment on adverse impact and examinee test reactions in a
situational judgement test. With respect to adverse impact, test content (and
presumably, the constructs measured) was held constant across two different methods
of testing so as to isolate subgroup differences due solely to test methods. As
mentioned earlier, construct equivalence across methods was empirically tested.
Reliabilities of measurement were estimated to obtain corrected effect size estimates.
Areadingmmprehmsimtestwasadnnmsteredmprovideanaddiﬁonalteﬂofﬂre
hypothesis that a signiﬁcant amount of the Black-White difference in performance on
paper-and-pencil tests is due solely to the reading/written requirements inherent in the
method of testing independent of the test content. As discussed thus far, the study
addressed the issues and problems associated with evaluating the effect of test method
and test content on the size of subgroup diﬁ‘erences in test performance. The use of
the present situational judgement test circumvented many of the conceptual and
practical problems associated with typical work samples and assessment centers
explicated earlier. The logic and research on situational judgement tests and their
relationship to adverse impact will be discussed next. Examinee test reactions, the
second dependent variable in the present study, will then be introduced The study of
test reactions has become increasingly important in recent personnel selection research
and the links between test reactions and the method—content distinction will be
explicated.

21
III'ES"lIl I

In a typical situational judgement test, examinees are presented with a
hypothetical scenario describing a work situation in which a problem has arisen The
work situation may be a possible actual situation on the target job or a situation
constructed such that it is psychologically isomorphic to an actual situation. The latter
would address the problem faced by typical work samples concerning inapplicability
of test items to inexperienced applicants due to the requirement of job-speciﬁc
knowledge and experience on some jobs. Either way, the work situations on the test
are developed on the basis of job analysis data often including a critical-incident
analysis involving subject nratter experts. The individual situational judgement
problem is almost always multidimensional in nature in the sense that an adequate
solution or handling of the problem would involve several ability and skill dimensions.

Alternative responses are presented to the exarrrinee following the description
of the situation. Examinees' scores on the test are conrputed based on their
endorsement of the responses. In tests employing a forced-choice format, examinees
are typically asked to choose the most effective response, or to choose the most
effective response and the least effective response. In another format (the format used
inthepresentstudy), examineesareaskedtorateeachresponseintennsofits
effectiveness usually using some form of a Likert-type scale. The scoring key is
developed from prior effectiveness ratings of response alternatives obtained from
subject matter experts. The decision rules for identifying the most or least effective

response or arriving at the score for each effectiveness rating given by examinees vary

22
from test to test. Regardless of the precise rules used, statistical analyses and

sometimes content analyses are performed on the subject matter expert ratings to
msure reliability and agreement in the ratings used for the development of the scoring
key.

Often, the objective of developing a situational test is to sample behaviors ﬁrm
the domain of job performance rather than measuring any particular construct or
predispositional sign Hence, like work samples, situational judgement tests are more
like "samples" than like "signs". However, Motowidlo, Drmnette, & Carter (1990)
noted that it would be interesting to discover what constructs are measured by the test.
The importance of construct-orientation and the distinction between method and
content for the examination of adverse impact has been discussed earlier. Identifying
the nature of the constructs measured in situational judgement tests will provide a
better understanding of the causes of adverse impact and help in the development of
ways of reducing the level of adverse impact associated with some given selection

instrument.

.1....'... A. w. a“ $1A_...__q!!..1...‘_. ., M an...” _ .1. .6;
Work samples, assessment centers, and situational judgement tests may all be
construed as forms of simulations. In these simulations, task stimuli are constructed
such that they mimic actual job situations and elicit responses which are purported
indicators of how assessees would handle the task situations if they were actually

occurring on the job (Motowidlo et al. 1990). Work samples are on the high end of

23
the continuum of simulation ﬁdelity because they use very realistic materials to
represent the task situation and examinees may respond in a manner almost identical
to the way they would if they were actually on the job.

As tests move toward the low end of the ﬁdelity continuum, stimuli and
responses are less faithﬁrl approximations of actual job stimuli and responses. The
situational interview (Latham & Saari, 1984; Latham, Saari, Pursell & Campion, 1980;
Weekley & Gier, 1987) is a well-known example of a simulation on the lower end of
the ﬁdelity continuum. Latham et al. (1980) reported a situational interview with a
validity of .46 and Latham & Saari (1984) reported a validity of .14.

Motowidlo et al. (1990) developed a paper-and-pencil type of situational
judgement test which they termed a "low-ﬁdelity" simulation. In this test, the task
stimulus (i.e., the work situation) is presented in a written form and examinees are
required to endorse alternative respomes described also in written form. The test
resembles similar situational inventories developed in early research such as the
Supervisory Practices Test (Bruce & Learner, 1958), the "How supervise?" (File &
Remmer, 1971), and the leadership Evaluation and Development Scale (Tenopyr,
1969). The paper-and—pencil method of administering the situational judgement test in
the present study is a type of "low-ﬁdelity " simulation with a format similar to the
test developed by Motowidlo et al. (1990) except that instead of a forced-choice
response format, the present test requires examinees to give effectiveness ratings for
each of the alternative responses.

Motowidlo et al. (1990) noted that although simulations with higher ﬁdelity

24
should be better predictors of actual job performance than those with lower ﬁdelity

according to the basic tenet of behavioral consistency, there have been no systematic
studies of the relationship between differences in ﬁdelity and incremental predictive
value. Such high ﬁdelity simulations as work samples and assessment centers are
expensive to develop and adrrrinister and the cost of developing such simulations may
not oﬁ‘set the gain in predictive value over lower ﬁdelity simulations (Motowidlo et
al., 1990).

Whereas it is expensive and often not feasible to administer work samples or
assessment centers to a large group of examinees in one testing session, situational
judgement tests can be administered to relatively large numbers of examinees in one
session In the ease of a paper-and-pencil format of the test, the scale of testing effort
and expense is identical to traditional paper-and-pencil measures of cognitive ability
tests or personality tests. Moreover, work samples and assessment centers almost
always require substantial involvement of subject matter experts for rating or scoring
of individual examinee performance at the time of testing and ongoing assessor
training costs can be high On the other hand, the primary involvement of subject
matter experts in the situational judgement test is in the development of the test
stimulus (work situations) and scoring key. Hence, ﬂour a practical viewpoint, it is
worthwhile to explore the predictive validity of low ﬁdelity simulations such as the
situational judgement test.

Using a sample of approximately 120 management incumbents, Motowidlo et
al. (1990) found positive validities for their low ﬁdelity situational judgemart test in

25
predicting supervisory ratings of performance (.28 to .37, p < .01). Fmther evidence

of validity for the test were provided in Motowidlo & Tippins (1993) in which two
studies were reported Study 1 employed a predictive validation design and found an
average validity of .25 predicting supervisory performance ratings in a sample of 36
management applicants. Study 2 employed a concurrent validation design and formd
an average validity of .20 predicting supervisory perfonrrance ratings in a sample of
109 to 128 marketing incumbents. Pulakos, Schmitt, & Keenan (1994) developed a
situational judgement test similar in format to Motowidlo et al.’s (1990) low-ﬁdelity
simulation test. Using a sample of incumbents ﬁrm a large federal investigative

agency, they found signiﬁcant validities for the test in predicting two performance
criteria namely, minvestigmmeproﬁciency (.20) and eEQnandprofessionalism
(.13).

 

Motowidlo et al. (1990) found a Black-White difference of .21 standard
deviation favoring Whites in their sample of incumbents and a difference of .38
standard deviation favoring Whites in their sample of applicants. Although these
differences were nonsigrriﬁcant, there is a caution against concluding that situational
judgement tests successfully eliminated adverse impact. The number of Blacks in
Motowidlo et al.'s (1990) samples were small (ranging from 21 to 31) and the power
to detect a difference of .5 standard deviation was only between 47% and 68%
(Cohen, 1977). Of the two studies reported in Motowidlo & Tippins (1993), one

26
provided no information on Black-White differences as the sample of Blacks was too

srrrall for subgroup analysis (N = 16). The other study reported that Blacks scored
lower than Whites by .38 standard deviation (44 Blacks vs 178 Whites). Weighting
the Black-White differences reported in Motowidlo et al. (1990) and Motowidlo &
Tippins (1993) by their sample sizes yielded an average adverse impact of .32 standard
deviation (total of 97 Blacks vs 378 Whites). In the situational judgement test
developed by Pulakos, Schmiitt, & Keenan (1994), Blacks scored lower than Whites
by .41 standard deviation (100 Blacks vs 259 Whites).

 

The above review showed that adverse impact levels of the paper-and-pencil
type of situational judgement test appear to be substantially lower than the typical one
standard deviation for cognitive ability tests but the size of the Black-White diﬂ°erence
is still considered at least moderate and is practically signiﬁcant. A primary purpose
of this study was to examine the possibility of reducing the Black-White difference on
the situatioml judgement test by simply changing the method of stimulus presentation
from the paper-and—pencil delivery to a video-based delivery while keeping test
content constant. The theoretical rationale for this hypothesis has been explicated
earlier in the discussion of the Goldstein et al. (1993) study. By replacing the paper-
and-pencil method which requires reading comprehension with the more interactive,
behavioral, and orally-/amally-oriented video-based method, the Black-White
difference in test performance should be reduced.

27
Although the advantages in use of video-based testing in personnel selection

have been alluded to as early as in Thomdike (1949), its actual use is relatively new
and there is an insufﬁcient research base evaluating the psychometric properties and
adverse impact of the assessment method However, the few studies conducted did
report some encouraging results for a video-based method of presenting the situational
judgement test. Based on a KSAO analysis of 50 customer service jobs, \Vrlson
Leaming (1990) developed a video-based situational judgement test for the assessment
of custonrer service skills. Using performance ratings as the criterion, the test was
found to have a validity of .40 for a sample of 126 Canadian employees and .34 for a
sample of 60 American employees. In another video-based test developed for transit
operator selection, Snriderle, Perry, & Cronshaw (1994) reported a signiﬁcant negative
validity using number of complaints as the criterion but no signiﬁcant correlations
were found betwem test scores and two other criteria namely, commendations and a
performance composite. Dalessio ( 1994) also found a signiﬁcant average validity of
.17 for a video-based test predicting turnover a year later using several samples of
insurance agents (total N = 677).

The present author located only one published study reporting the adverse
impact level of the video-based situational judgement test. Snriderle et al. (1994)
found no signiﬁeant Black-White difference in test performance (46 Blacks vs. 267
Whites). However, the result was not corrected for unreliability of measurement. The
low reliability of the test (alpha = .47) certainly attenuated the true Black-White
difference. Moreover, the present author perfonned a power analysis (Cohen, 1988)

28
on the data and formd that the study had only a power of approximately 59% to detect

amoderateeﬁectsize(d=.5)at0t=.05. Hence, moreresearchisneededtoascertain
the adverse impact level of video-based situational judgement tests. The present study
examined Black-White differences in performance on a video-based assessment and
compared it with the difference on a paper-and-pencil format of the same situational
judgement test. A priori power analyses were conducted to ensure adequate sample
sizes and reliabilities of the two measmements were estimated to correct for
attenuation due to unreliability.

The present study developed two formats of a single situational judgement test,
differing in the method of testing (video—based versus paper-and-pencil presentation of
the work situations) with test content held constant. As discussed earlier, Helms
(1992) theorized that Aﬁican-centered values and beliefs of Blacks emphasize
commurralism, movement, and orality at the expense of reading comprehension. The
lack of emphasis on reading conrpreherrsion in turn inﬂuences their test-taking
performance resulting in Blacks having a disadvantage on paper-and-pencil tests
compared to Whites due to the strong written component requirement for successful
performance on such measures. Reviews have curnulated an extensive research
evidence showing a signiﬁcant and substantial Black-White difference on paper-and-
pencil measures of cognitive—oriented constructs in favor of Whites, that is, a high
level of adverse impact exists. Results ﬁ'om Motowidlo et al. (1990), Motowidlo &
Tippins (1993), and Pulakos et a1. (1994) indicated that Blacks also score lower than

Whites on a paper-and-pencil type of situational judgement test. Prior to testing the

29
primary hypotheses concerning effects of test method on adverse impact, it was

necessary in the present study to ﬁrst replicate the previous ﬁndings that Blacks
perform signiﬁcantly poorer than Whites on a situational judgement test presented in a
paper-and-pencil format.

Goldstein et al. (1993) and Schmitt et al. (1996) have argued that a testing
method loaded with a strong reading/written component would tend to favor Whites
over Blacks whereas tests that were more interactive, behaviorally-oriented, and
amally-/orally-oriented would exhibit less adverse impact Based on this argument and
Helm's (1992) theory, it was predicted that for performance on the situational

judgement test,

 

. C . . .. ... - .. - - - C ‘ O .
_-.v. l.-._ .0 “mm tl-iur. r us mum”. ..1{J ”v.0

 

The nature of the expected interaction is depicted in Figure 1.

3O

.mocmEBton. “mod. EmEomeaw _mco=m3_m co
c2899.... e052). x comm 386de ”P m_wo£oa>I .F meant

EmEmmmmm< .5 UOEwS.
eowmmooeS __o:on_-ecmaoamn_

 

 

 

o..E>>_H_
xoﬂm I

 

 

 

 

 

 

 

 

 

 

 

 

eoueuuoped reel ucew perorperd

31
It was argued earlier in the paper that a signiﬁcant amount of the Black-White

difference in performance on paper-and-pencil tests could be due solely to the reading
comprehension inherent in the method of testing independent of the test content. Two
hypotheses were derived ﬁ'om this argument. One hypothesis related perfonnance on
the test to the method of testing and individuals' reading comprehension ability
whereas the other hypothesis related test performance to method of testing, reading
comprehension, and racial subgroup membership. With respect to method of testing
and reading comprehension ability, it was expected that an individual's perfonnance on
the situational judgement test would be affected by his or her reading comprehension
ability when the test was administered using the paper-and-pencil method but no such
effect would exist when the test was administered using the video-based method.
Hence, it was predicted that for performance on the situational judgement test,

 

Al .AIQJ _ k". '95.. 331 02"}. .0"... I Al '2’ 0JA ‘ 1H. '_Jl I .11

 

The nature of the expected interaction is depicted in Figure 2.

32

.mocmEEton. ewe... EoEomezw 35:35 :0 cozomaoE.
co_mcocoEEoo mcﬁmom x porch—2 eoEeoan. ”N £35.09»: .N 939m

:o_mcocanoo mcﬁmom

 

I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I l I I I I I I I I I I I I I I

 

 

poem—2 emmmmomeSrt
eo£w_>_ __ocon_-ecm.._oamd .I.

 

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

 

 

 

 

 

 

 

aoueuuoped reel ueew perogperd

33
Previous research has shown that a Black-White subgroup difference exists on

reading comprehension tests favoring Whites (e.g., Matthews, 1991; Scott, 1987). A
Black-White difference in reading comprehension scores favoring Whites is expected
to be replicated in the present study. Thus, a signiﬁcant anrourrt of the Black-White
difference in performance on paper-and-pencil tests could be due solely to the reading
comprehension requirements inherent in the method of testing independent of the test
content. That is, a substantial amount of the Race X Method interaction eﬂ°ect on test
performance hypothesized in H1 could be due solely to the Method X Reading
Comprehension interaction effect on test performance hypothesized in H2. Hence, it
was predicted that:

H3: It‘ In: . tram. term or Wit: .0... 9 ur ILI‘J .nm 1: o r

 

The predicted reationship between Race X Method interaction and Method X Reading
Comprehension interaction on test performance is depicted in Figure 3.

34

.mocmEBtmn. Ewe... _m:o=m3_w co co_..om._oE_ eoﬁos. x oomm 5.. cozmcmaxm cm
on c2692... concocoaano mcﬁmom x e052). ”m 285.0%: .m «taut

 

mocmanton.
EoEomezw
.mcozmztw

 

 

co_mcmcanoo
A » mcﬁmom

 

 

 

 

 

 

AI

 

 

 

 

 

EoEmmomm<
.5 DOLE—2

 

 

 

comm

 

 

35

ExamirmlestReactions

Research on test validity and adverse impact has tended to examine predictor
adequacy and fairness from the organizational and psychometric perspectives. Recent
research in personnel selection has begun to focus more attention on applieant
reactions or examinee attitudes to selection procedures (e.g., Arvey, Strickland,
Drauden, & Martin, 1990; Gilliland, 1993; Gilliland, 1994; Macan, Avedon, Paese, &
Smith, 1994; Schnritt & Gilliland, 1992; Schnritt, Gilliland, Larrdis, & Devine, 1993).
This notion of perceived test adequacy and fairness has been termed "social validity"
(Schuler, 1993), "impact validity" (Iles & Robertson, 1989) and the "social side of
selection" (I-Ierriot, 1989).

Examinee reactions to selection procedures could be organizationally relevant
in that it could affect applicant and employee behaviors (Arvey & Sackett, 1993;
Gilliland, 1994). Prernack and Wanous (1985) and Robertson and Smith (1989)
argued that assessment situations serve as a preview of the organization and Schuler
and Fnrhner (1993) noted that selection instruments can be used as instruments for
personnel marketing. Snrither, Reilly, Millsap, Pearlrnarr, and Stoffey (1993) outlined
three possible practical effects of applicant reactions. First, reactions can indirectly
inﬂuence applieant pursuit or acceptance of job oﬁers through organizational
attractiveness. Second, reactions may relate to the likelihood of litigation and the
success of the defense of the selection procedure. Third, reactions may indirectly
aﬁect both validity and utility through motivation in test performance and loss of

qualiﬁed applicants respectively. In short, exanrinee test reactions are of interest

36
because they constitute a critical component of the recruitment-selection process.

E Ml'l' IE 1.. 1511' E .
Whereas there has been extensive research on the validity of work sarrrples and

assessment centers, there are few studies which systematically investigate exanrinee
reactions to these predictors such as face validity and predictive validity perceptions.
Schnritt and Gilliland (1992), Gilliland (1993), and Gilliland (1994) have attempted to
relate organimtional theories of distributive and procedural justice to exanrinee test
reactions and Schuler (1993) has proposed a model of social validity. However, in
generalﬂreresemchonexamineereacﬁmrstoselectionprocedmeshasbeen
ﬁagrnerrted and atheoretical. Most research has been focused on the description of
exanrinee attitudes or reactions to different selection tests and compared reactions
across tests. There is a need to integrate studies ofexanrinee attitudes into the broader
selection ﬁarnework. In the present study, the investigation of examinee test reactions
is integrated into the selection ﬁarnework by analyzing attitudes by race and
examining relationships between attitudes and adverse impact and method of testing.
Research on examinee reactions has focused on face validity and little effort
has been directed to investigating perceived predictive validity. Face validity (also
[crown as perceived content validity) refes to the extent to which examinees perceived
the content of the selection procedure to be related to the content of the job.
Perceived predictive validity refers to the extent to which examinees perceived the
procedure predicts future performance regardless of face validity (Snrither et al., 1993).

37
Whereas both face validity and perceived predictive validity are conceptually

distinct, their empirical relationship is less clear. Although it is intuitively plausible to
expect face validity to be highly correlated with perceived predictive validity, there
has been little evidence demonstrating the correlation The author located only one
study which directly examined their empirical relationship. Snrither et al. (1993)
formd a signiﬁcant correlation of .36 (p < .01) between face validity and perceived
predictive validity for a civil service examination However, interpretations were
problematic because the examination consisted of a variety of selection procedures
(mainly paper-and-pencil measrn'es of job knowledge and cognitive ability) and the
sample of applicants were assessed for a variety of jobs ranging from entry-level to
professional positions.

In the present study, the relationship between face validity and perceived
predictive validity was examined separately for four different tests. Using the job of a
production worker as the frame of reference, it was predicted, within each of four
different tests (a situational judgement test administered either in paper-and-pencil
format or in video-based format, a reading comprehension test, a personality test, and
a cognitive ability test,), that:

H4: '21: 6.: ureter ‘ -.__c_r o r‘ .651 .r u; run: eI‘ 0.0

l I‘lﬁ 11' El .

38
E Iﬁl'l' III] 1 EI’
Both work sarnpletestsandassessment centers appeartohave high face

validity. Research has shown that selection procedures involving simulations elicit
more favorable examinee reactions than those using paper-and-pencil measures (Dodd,
1977; Macan et al., 1944; Schmidt et al., 1977; Snrither et al., 1993). Schmidt et al.
(1977) reported that perceptions of work sample tests were more favorable than those
of paper-and-pencil measures of cognitive ability. Dodd (1977) found that assessees
have positive reactions to the face validity aspects of an assessment center. Macan et
al. (1994) found that examinees perceived the assessment center as more face valid
than cognitive ability tests. The high face validity and favorable exanrinee attitudes
for work samples and assessment centers are often attributed to their realistic test
situation and similarity to the target job, that is, their high simulation ﬁdelity.
Smither et al. (1993) found that procedures involving simulations were generally
perceived as more favorable than paper-and-pencil measrnes.

However, it is not clear which aspects of these tests are responsible for the
positive reactions. Previous studies comparing examinee reactions across tests (e. g.,
across assessment centers and cognitive ability tests) have been limited in increasing
our understanding of examinee reactions due to the method-content confound across
tests. By comparing two different means of measurement with test content held
constant, the present study was able to examine any possible differences in reactions
attributable solely to test method The assumption that simulation ﬁdelity or

concreterress of test stimulus is positively related to examinee reactions would suggest

39
that in the present study, the video-based method of administering the situation

judgement test would be perceived more favorably than the paper-and-pencil method
even when test content remained the same because the video-based method was a
more concrete representation with higher simulation ﬁdelity than the paper-and-pencil
method There is some evidence of positive examinee reactions to a video-based

method of testing (e. g., Dyer, Desmarais, Midkiﬁ; 1993). Hence, it was predicted that

Hi: .__ tr 0 1‘ Le. 0.1...-raeum s-sr .r u.‘ an I.-.l .3413 mm

 

Within the same test, it is possible for examinees to have, simultaneously, low
predictive validity perceptions and high face validity perceptions. Doing well on a test
which has content related to the job tasks (i.e., a high face valid test) does not always
guarantee successful job performance because successﬁrl performance has multiple
determinants, many of which have little, if any, to do with test performance. Unlike
face validity, perceived predictive validity is less dependent on the test content or
other test characteristics. There was no clear theoretical rationale for relating
differences in test methods to differences in perceived determinants of successful job
performance. Hence, no formal hypothesis was formulated for any effect of method of

testing on perceived predictive validity.

40
S! H] l' 115 3511
The relationship between racial subgroup membership and reactions to selection

tests is clearly an important practical issue. If examinee reactions aﬂ'ect subsequent
examinee behaviors which are organizationally relevant, then any differential subgroup
test reactions may explain some of the job performance and behavior variance or test
performance associated with race. Almost all systematic differences in behaviors
across racial groups have important economic and socio-political implications for the
organization

Few studies have analyzed examinee reactions by racial subgroup membership.
Schmidt et al. (1977) found no Black-White differences in reactions to work sample
tests and cognitive ability tests. The lack of a signiﬁcant Black-White difference in
attitudes toward the cognitive ability test is somewhat puzzling. Given that Blacks
perform poorer than Whites on cognitive ability tests and assuming that the means of
measurement (i.e., paper-and-pencil) tends to be consistent with the cultural values,
beliefs, and experiences of Whites but inconsistent with those of Blacks (Helms, 1992;
Goldstein, et al., 1993; Schmitt et al., 1996), one would predict that Blacks would
have attitudes less favorable than Whites regarding paper-and-pencil cognitive ability
tests. On the other hand, there is no reason to expect a Black-White difference in
attitudes on selection procedures involving more realistic materials or concrete
representations. The confound between test method and test content in Schmidt et al.'s
(1977) comparison between work samples and cognitive ability test could not provide
a rigorous test of the hypothesis of Black-White difference relating to method of

41
measurement. This hypothesis could be more directly tested in the present study
because content was held constant across two different means of measurement It was

predicted that:

 

0.. tam .‘ r a...‘ ._, 1.5 («an or terrors-.0. o 3.... m m .6.» u 0;

“3.4.3 Ia! I' I- I‘dl‘dl I 0231‘; 32"” g 2. I-Al. ‘ OEIISIIIII I“. I. I' 1")

The nature of the expected interaction is depicted in Figme 4.

42

4.on EmEomezw _mco=m3_m .6 mcozaooamn Ee=m> mom”. :0
corona: 85.22 x 8mm eoEeoan ”m £3569»: .v 929m

EoEmmomm< no eoEoE
eowmmooeS __ocon-_ucm.._onmn

 

 

 

 

9E>>_H_
xom_m I

 

 

 

 

 

 

 

 

 

suorrdaored ArgpueA erred ueew perorperd

43
Whereas racial subgroup membership was expected to interact with method of

testing to affect face validity perceptions due to differential subgroup experiences with
test characteristics, it was less clear if these subgroup experiences were relevant to
predictive validity perceptions. There was no clear theoretical rationale for relating
either these differences in subgroup experiences or differences in method of testing to
differmces in perceived determinants of successﬁrl job performance. Hence, no

formal hypothesis was fonnulated for any Race X Method interaction effect on

predictive validity perceptions.

 

Evidence of factorial invariance of responses to the situational judgement test
across the two method groups would indicate that the same constructs were indeed
being measured when test content was held constant across the two different methods
of asessmerrt. In addition, establishing factorial invariance across the two method
groups would allow meaningful comparisons to be nude between the paper-and-pencil
method group and the video-based group of examinees with regard to their situational
judgement scores. Factorial invariance was construed and assessed both internally
(i.e., within the test) and externally (i.e., relationships with variables external to the
test). Internally, factorial invariance was construed in terms of measrnement
invariance. Externally, factorial invariance was construed in terms of nomological
invariance (or external parallelism).

Measurement invariance exists when the numerical values across the two

44
groups are on the same measurement scale (Drasgow, 1984, 1987). In the absence of

measurement invariance (i.e., when numerical values across groups are not on the
samemeasurementscale), group diﬂ‘erencesinmeantestscores orinpattemsof
correlations of the test with external variables are substantively misleading.
Nomological invariance or external parallelism across groups exists when the
groups exhibit similar patterns of correlations between the test (or factors measured by
the test) and external variables. To establish nomological invariance, independent
established measures of personality constructs were administered in the present study
for the purpose of relating them to scores on the two versions of the situational
judgement test. It was anticipated that both versions would have similar patterns of

conelations with the personality constructs.

Examinm

Examinees were introductory psychology undergraduates who participated in
the study for extra course credits. A series of power analyses (Cohen, 1988) was
performed for each hypothesis to detennine the required sample size (see Appendix A
for series of power analyses). For each analysis, the power desired was .80 assuming
a small effect size (see Cohen, 1988) at or = .05. The power analyses revealed that
240 subjects were required A total of 244 undergraduates participated in the study
and 241 provided usable data (113 Blacks, 128 Whites; 63.9% females). The

incomplete and rmusable responses from 3 examinees were excluded from all analyses

performed

 

The video-based version of the situational judgement test used in the present
study was a pilot version of a video-based situations assessment test developed by a
large US-based human resources consultancy ﬁrm The test was developed by the
ﬁrm as part of a comprehensive test battery for a consortium. The simulation focused
on two broad functional areas namely, work habits and interpersonal skills. Each area
was deﬁned in terms of two performance factors. Work habits was deﬁned in terms
of work comnritrnent and work quality. Interpersonal skills was deﬁned in terms of

conﬂict management and empathy. The videotape included one practice video vignette

45

46
and 12 actual video vignettes spanning a range of common situations likely to be

encountered in today's semiskilled and skilled blue collar work place. Each vignette
depicted employees interacting on the job and described an interpersonal or work-
related problem for one of the employees. At the end of each vignette, examinees
were asked what action the employee should take to resolve the problem. A series of
possible responses (ranging 9 to 14 responses per vignette) was presented in written
form on the answer booklet. For each possible response, examinees were asked to rate
its appropriateness on a 6-point rating scale ﬁom magnum to W.
The pilot version of the test had a total of 126 items. On the basis of an item content
analyis, the human resornce experts at the consultancy ﬁrm edited the test and
produced a ﬁnal version with a total of 63 items measuring the four aprimi factors,
namely, work commitment (11 items), work quality (19 items), conﬂict management
(17 items), and empathy (16 items). The pilot version of the test was administered in
the present study because the ﬁnal edited version was not available at the time of
study. However, only the 63 items identiﬁed in the ﬁnal version of the test were used
in the computation of the total situational judgement score and the analyses involving
the four amen factors.

The consultancy ﬁrm developed a rational scoring key ﬁom the ratings of 25
job content experts. Each point on the rating scale was assigned a score of 0, 1, or 2
according to the percentage oferrdorsernerrt by the experts. A score of2 was assigned
when endorsement was 50% or greater, 1 when endorsement was 25% to 49.99%, and
0 when endorsement was less than 25%

47
Based on the written script of the videotape which described the essential

visual elements of the vignette, and the mrrator's speech and dialogue between the
characters both in verbatim, the present author developed a paper-and-pencil format of
the test. In this paper-and-pencil measure, each of the vignettes (1 practice and 12
actual vignettes) was presented in written form. The written vignette was described in
the third-person perspective (as opposed to a dialogue) similar in form to the typical
paper-and-pencil type of situational judgemmt test used in previous research (e. g.,
Motowidlo et al., 1990; Pulakos et al., 1994). The substantive content of each written
vignette was identical to the corresponding video vignette. After reading each
vignette, examinees gave their ratings on an answer booklet similar to the one used in
the video-based method containing the same response items. The scoring key for the
paper-and-pencil version of the test was identical to the one used in the video-based
version The video-bmed administration and the paper-and-pencil administration each
hadatotaltestingtime lasting45 minutes. Appendiprresents anexample ofthe

vignettes and possible responses used in the paper-and-pencil method

I l [E . I B .
Face validity and predictive validity perceptions were each assessed by a 5-
itern measure adapted ﬁom part ofa questionnaire used in Snrither et al. (1993). To
provide a ﬁarne of reference, exarrrinees were asked to give ratings on the items
concerning relationships between the test and the job of a production worker working
in a team-based situation. It was further stated that to do the job well, the worker had

48
to be both technically competent and able to relate to others effectively. Ratings were

anchored on a 6-point Likert-type scale from W to stronglxagrm The
questionnaire is shown in Appendix C.

 

Three widely used paper-and-pencil measures of established psychological
constructs were administered to all examinees. Reading comprehension was assessed
using the Comprehension subtest of the W (Form G, Brown,
Bennet, & Hanna, 1993). The test was developed for use with high school and college
students and it has been widely used in psychology and education for the assessment
of reading comprehension. Form G (published in 1993) is one of the two parallel
forms in the ﬁfth edition of the test that was published originally in 1929. The
comprehension subtest is a multiple-choice forrrmt test in which examinees read 8
passages and respond to a total of 36 ﬁve-answer multiple choice questions.
Administration time is 20 minutes. The test-retest reliabilities of the comprehension
subtestreportedinthetestmanualrangedﬁom.75to.82. .

Cognitive ability was assessed using the W (Wonderlic
& Assoc, 1984). The Wonderlic test is a general cognitive test for industrial use (for
reviews, see Schmidt, 1985; Schoenfeldt, 1985). It is a 12-minute test consisting of 50
items with a variety of verbal, numerical, and some spatial content, and it yields a
single total score. Test-retest reliabilities ranged ﬁom .703 to .903.

Personality constructs assessed were the "Big-ﬁve" dimensions measured using

49
the NEO-FFI (Costa & McCrae, 1992), a short version (i.e., 60 items) of the NEQBI

(Costa & McRae, 1985). The dimensions assessed by the test are non-clinical
constructs and include conscientiousness, agreeableness, neuroticism, openness to
experience, and extraversion. The test contains a total of60 items each scored on a 5-
point Likert-type scale ranging from W to W. Each of the 5
dimensions is measured by 12 items. The time for completion of the test is 30
minutes. Evidence of criterion-related validity and construct validity for the NEQBI
havebeendocumentedinareviewbyDigman(1990)mrdreponedinCostaand
Mche (1992).

Design

The study employed primarily a 2 X 2 between-subjects factorial design with
performance on the situational judgement test and examinee test reactions (face
validity and perceived predictive validity) as the dependent variables. The two
independent variables were Race (Blacks vs. Whites) and Method (video-based vs.
paper-and-pencil). Assignment of examinees to the Method condition was random
with the restriction tlmt examinees in the same testing session were administered the
same method. The number of examinees per condition was approximately equal
(Black-Video = 51, Black-Paper = 62, White-Video = 69, White-Paper = 59). The
paper-and-pencil measures of reading comprehension, cognitive ability, and the Bi g-
Five personality constructs were administered to all examinees.

50
liocedme

ExamineesweretestedhraclassmomsettingingroupsmngingbetweenSand
19 individuals. In the video-based method condition, the video vignettes were
presented on a 25" television positioned in a manner such that all examinees could
watch and listen to the videotape clearly. Instructions for the test were given on the
videotape by a narrator. The instructions began with an example vignette as a practice
item. The narrator ﬁrst described the setting of the work situation and introduced the
characters. Thevignwewasthenpresented. Attheendofthevignette, thevideo
ﬁameﬁ'ozeandexanﬁneeswereaskedtoopenﬂreanswerbooklettoﬂreexample
situation section and indicate the eﬂ‘ectiveness of each possible response described in
written form using the 6-point rating scale. Examinees had 2 minutes to complete
ratings for all the responses pertaining to the vignette. After clarifying any questions
regarding the manner ofcompleting the test, the actual test began There were a total
of12videovignettesontheactualtestandeachvignettewasprecededbyanarrator
introduction. Examinees had 2 minutes to complete ratings for the associated
responses. After the video-based test which lasted approximately 45 minutes,
examinees were asked to complete a questionnaire regarding their perceptions of the
test. ThequesﬁonnairewasﬂreexmnmeeattimdesmeasmewhichconsistedofﬂreS
items assessing face validity and the 5 items assessing perceived predictive validity.
Examinees then completed a series of three paper-and-pencil measures including the
wonderlieﬂersomellest, the NEW and the NEQEEI
administered in counterbalanced order across test sessions. The same examinee

51
attitude questionnaire was administered following completion of each of the three

paper-and-pencil measures.

In the paper-and-pencil method condition, examinees were presented with the
paper-and-pencil version of the situational judgemert test. The instructions for the test
were written on the ﬁrst page of the test booklet. The same example vignette
preceded the 12 actual test vignettes. Examinees were given 45 minutes to complete
the test. After the test, the rest of the session was identical to the video-based method
session. Examinees completed the examinee attitudes questionnaire for the situational
judgement test, and then followed by the three paper-and—pencil measures administered
in cormterbalanced order across test sessions with an examinee attitude questionnaire
following completion of each measure. In both conditions, subjects were thoroughly
debriefed and thanked for their participation. The total testing time per session for
each condition was approximately 2 hours.

Analyses

Effect size estimates ((1 statistic) for subgroup differences in performance on
the situational judgment test were conrputed by subtracting the majority test mean
from the minority test mean and dividing the difference by the pooled standard
deviation. Hence, negative effect sizes indicated that Blacks scored lower than
Whites whereas positive effect sizes indicated the reverse.

Sex, Race, andMethodweredummycodedGemales=0, Males: 1; Whites:
O, Blacks = 1; paper-and—pencil = O, video-based = 1) and the other study variables

52
were treated as continuous variables. Hierarchical regression analyses were used to

test the interaction effects hypothesized in H1, H2, H3, and H6. Conelational
analyses were used to test H4, and an independent-samples t—test was used to test H5.

Multiple-groups covariance structure modeling using LISREL 8 (Joreskog &
Sorbom, 1993) was used to assess measurement invariance and nomological invariance
of the situational judgement test across the two method groups. Measurement
invariance was tested by simultaneously comparing conﬁrmatory factor analytic
models across groups. It is widely accepted that measurement invariance is
established when the factor loading matrix is invariant across groups (Alwin &
Jackson, 1981; Sorbom, 1974). A more stringent criterion for measurement invariance
is when both factor loadings and error variances of measures are invariant across
groups.

Nomological invariance was tested by comparing, across groups, the structural
relationships between each situational judgement factor to the set of Bi g-Five
personality factors. Nomological invariance is established when structrnal
relationships are invariant across groups.

The ﬁt of a model was assessed using the )6 statistic and a variety of ﬁt
indices. The )6 statistic is the most widely used measure of model ﬁt in
organizational research (James & James, 1989; Kelloway, 1996). The main
disadvantage of the x2 is its high sensitivity to sample size such that with large sample
sizes, most models will produce statistically signiﬁcant )8 values resulting in rejection
of these models even if they are theoretically reasonable. Hence, most researchers

53
also rely on a variety ofaltemate ﬁt indices to reduce the dependence on sample size

when assessing model ﬁt. Because the various indices differ on their speciﬁc
assumptions, the use of multiple indices when evaluating a model can provide
convergent evidence in the assessment of model ﬁt In the present study, the indices
used included J oreskog and Sorbom's (1989) goodness-of-ﬁt index (GFI) and adjusted
goodness-of-ﬁt index (AGFI), Bentler‘s (1990) comparative ﬁt index (CFI), Bentler
and Bonett’s (1980) non-normed ﬁt index (NNFI), Joreskog and Sorbom's (1986)
standardized root mean square residual (standardized RMSR), and Steiger's (1990) root
mean square error of approximation (RMSEA).

Both GFI and AGFI are widely used indices of ﬁt based on the comparison of
observed and estimated covariances (see Kelloway, 1996). The AGFI is a
parsirnonous ﬁt which adjusts the GFI for the degrees of freedom in the model, that is,
it takes into consideration the fact that a model always increases in ﬁt as the number
of ﬁee parameters to be estimated approaches the number of independent pieces of
information available for estimation. The CPI and NNFI measure how well the model
ﬁts relative to a baseline model, usually the independence (i.e., null) model. The
values of GFI, AGFI, CFI, and NNFI range ﬁom 0 to 1.0 with values approaching 1.0
indicating a good ﬁt to the data. The present study used the convention of larger than
.90 as an indication of good ﬁt.

The standardized RMSR is a measure of the average standardized residuals of
the predicted covariance matrix ﬁom the observed covariance matrix. Values

approaching 0 indicate a good ﬁt to the data. The conventional value of less than .10

54
wasusedasanindicationofgoodﬁtinthepresentstudy. TheRMSEAisameasure

of the average size of the ﬁtted residuals per degree of freedom. Following Browne
and Cudeck (1993), the present study considered a value of .05 or less as indicating a
close ﬁt; between .05 and .10 as a moderate ﬁt; and more than .10 as a poor ﬁt.

The )6 difference test (me), obtained by calculating the diﬁerence in the
models' respective x2 with degrees of freedom equal to the difference in the models'
respective degrees of ﬁeedom, was used to compare the statistical signiﬁcance of
difference in ﬁt between nested models.

Table 1 presents the means, standard deviations, reliability estimates, and
interconelations of all the study variables. The same statistics broken down by racial
subgroups are reported in Appendix D. As shown in Table 1, the internal consistency
reliability estimates (Cronbach's or) for the measures used in the present study were in
acceptable ranges. The reliability estimates reported for the two versions of the
situational judgement test are underestimates because of the multidimensional nature of
the test.

An inspection of Table 1 showed bivariate support for the major hypotheses.
Race was more highly correlated with situational test performance when the test was
administered in the paper-and—pencil method than when administered in the video-
based method. Consistent with previous research, race was correlated with reading
comprehension. With regard to test reactions, face validity perceptions and predictive
validity perceptions were positively correlated for each of the four different tests (i.e.,
situational judgemmt, reading comprehension, cognitive ability, personality) used in
the study. For the situational judgement test, face validity perceptions were correlated
with the method of assessment. Also, race was more highly correlated with face
validity perceptions when the situational judgement test was administered in the paper-
and-pencil method than when administered in the video-based method. Each
hypothesis will be addressed directly and in a multivariate sense in the following

sections.

55

56

82580 ~ 03$.

QC em oo 9 en

 

 

8- Na- 3 as 88 mum? .w
83 a 8 8 z- 8. 8 Se 39. 828 .5
Ge 8 9. mm- 3 2- m3 2% magma 6
Ge :- 8- «E 3.9. a> .m
Ge :1 S- one can an .4
8 8 on S. 8,2 .m
8 Ms. em. 57. N
on on 8:52 ._
.39»
Nasseateeze2:2awsenemas amass:
.... .. ........... . z $1.5... . 2

a 03¢

57

326.80 3 035,—.

 

68 we we 2 mm co me 3 8 we 5. ea. 2 mo 8 «a 3. no No- 8 Ned mad“ OmZA—mmn— .om
GD cm on S S n3 8. no. 2 mot m3- 8 N3 8 co. no co 5. no. omd $.0— OmZ-m—U<m .3
8% co cm 3 mm 3 3 co. 5 S. 8. ac no 2 no N3- co 8 Ed 8.: Q<mm-Qm~E .w3
Amwv mo 2 E 3 mo 3 no No. 3. 2 no 8 mo 3. No we wmd 3.3 Enqmuﬁm .3
38 em ---- -..- 3 no- 3. co 3 we ac 3 ---- co 3. -..- mnd wo.3 max/AME.“ .3
Qt ---- ---- mo 3. we- S 3 N3 9 m3 ---- mc- mo. ---- cad mod— nm>-m0<n— .2
GD 3 no no. we no- mo. 3 N3 ---- no No No. ---- cNd 3.9 ~mm-Qm~E .3
8% em 2 3N 8 cc 2 mm ---- mm c? on. ---- end 3:.— BA -mUsm .9
$8 8. 2 we. no- no. me am an ea- 5 co. 9mm aﬁmm 2000 .2
EC 3 2- Om mm No 2.. Z 3- mm- no- 36 e—Nm >§m .:
QC 3 8 3c. 3 3 en 3m. .5. mo. mwﬁ wodm meO .2
9‘8 ca- an. cm. 3 wT so om- 3 36 3.NN Ogm—Z .a
an 3 ON 3 M: S 3 m— 3 2 N3 Z S a m h o n v m N _ Gm mane—z

 

58

.82 u 26 822922 e8 .82 u a B>-mo<m .82

u 26 a> .22 u .6 82-295 .22 u a 8.2.82.2 .22 u 8 am é neon... :2 u z 8 Ban on menses :5 é 8o§p=8§a e8 gouge
names... ago: see as E can anomaoaeuaa as do 8385 assessees as B> 9a an. é a ensﬁeo .2388 0:383 see 5:532
$32.32 a. 2 some? 2000 Em E85 8 9:82:80 03 83828 32538 5.. 8858.82 E 0.3 82:23:33 .3238 2a 3252mm»: new macaw—oboe
as as season 2.1.28.2 .3253 2832 583882 eumoaomeaéem 58033er 881883 as 852 e8 .xmm .2025: .95
5223 2,382 8:83.293

562$ Ban—"8% Homeo>§xmu>§m ”meaguzﬁo usee.85...Zuompmz neoﬁﬁﬁwﬁmmmg ”msgouaomaaoonomzoo
”ensues; Ecmz nemz 5288388 ”2.2232 successeezémm smegma; 38832 BeaéoeSua> engages... 38355 :23

esteem"?— eosaao a 882.2052 ”838% .8 smuxmm use .8832 383% a assess co Begun—05m: ”Bataan“ 8§E> .

 

68 on em 2 mm em ow 02 :V S- 3 S- n2 5. 02. no mo 3 8 mg- we no- 9am on: ZOOU.Qm~—m.-

38 m— 8 me he 3 co 5 no no 02. 8 8. ms. 3 8 8 8 mm- 8 no $6 3.9 ZOOU.m~U<m.~N

 

$382—$232323:2awnewemm— ﬂaw—802

 

59

- o I - ..‘ . - .- o - . o . - .
\. «.90! O 0» Ordhxvl \2.. L211.“ JIIIUlI'A. H U10”. 0 é .3.» "1|. 2|!

 

Table 2 presents the hierarchical regression analyses performed to test HL H2,
and H3. H1 predicted a Race X Method interaction on performance on the situational
judgement test such that the Black—White diﬁ‘erence using the video-based method of
testing will be smaller than that using the paper-and-pencil method As shown in
Table 2, Race and Method were entered as a single block in step 1 of the regression of
test performance on race and method of assessment. Time eﬂ‘ects accomted for 12%
ofthetestvariance, p< .05. The Race XMethodproductterm, which representedthe
Race X Method interaction, was entered in step 2 of the regression. Entering the Race
X Method interaction term resulted in a signiﬁcant increase in variance accounted for,

AR2=.04,Adf=1,p<.05.

60

Table 2

um; 0 I'-(U - ("1' n 0 [0. Ole Cf‘u'r ‘ "Inna 'n

 

 

 

Hypothesis and Predictors R2 (if AR2 Adf AF
Hmmhesisl
Step 1. Race .12 2 17.40*
Method
Step 2. Race X Method .16 3 .04 1 9.78*
Hmthesisl
Step 1. Reading .12 2 16.50*
Method
Step 2. Reading X Method .15 3 .03 1 7.48*
methesis}
Step 1. Race .19 4 13.85*
Reading
Method
Reading X Method
Step 2. Race X Method .20 5 .01 1 4.21*

 

*p<.05.

61
Figure 5 depicts the nature of the interaction in terms of differences in

subgroup mean performance. As shown in the ﬁgme, the Black-White difference in
test performance was greater in the paper-and-pencil method than in the video-based
method To assess the practical signiﬁcance of the statistically signiﬁcant Race X
Method interaction, effect sizes for subgroup observed mean differences were
computed using the d statistic. A substantial Black-White diﬂerence in performance
of almost one standard deviation favoring Whites was found on the paper-and-pencil
version of the situational judgement test, d = -.95. The Black-White difference was
reduced substantially to about one-ﬁfth of a standard deviation in the video-based

version of the test, d = -.21. Hence, H1 was supported

62

.mocmctormn. 2mm... EmEmmnzw 3:23:25

:0 c2599.... U052). x comm .m 659.;

EwEmmmmm< .6 noﬁms.
nonemomnS __ocmm-ncm¢mamn_

 

 

 

m.._c>>_U
xoﬂm I

 

 

 

 

 

 

 

 

 

 

 

 

om

aouewroyed rsel ueew perogperd

63
H2 predicted that a Method X Reading Comprehension interaction effect on

situational judgement test performance such that performance will be positively and
signiﬁcantly correlated with reading comprehension ability in the paper-and—pencil
method of testing whereas no signiﬁcant correlation between test performance and
reading comprehension ability will 000m in the video-based method As shown in
Table 2, entering Method and Reading Comprehension as a single block in step 1 of
the regression of test performance on these factors accounted for 12% of the variance,
p < .05. The Method X Reading Comprehension interaction term was entered in step
2 which resulted in a signiﬁcant increase in variance accounted for, AR2 = .03, Adf =
1, p < .05. A plot of the interaction (Cohen & Cohen 1988) as depicted in Figure 6
showed that test perforrmnce and reading comprehension were positively correlated in
the paper-and-pencil method of testing but they were nearly uncon'elated in the video-
based method Hence, H2 was supported.

.mocmctorwd ..mmh EmEmmczw 6:25:25 :0
52852:. 53:93an0 mcﬁmmm x 35.22 .m Saul

szc: Um cc co_mcmcmano mcﬁmmm

 

 

 

 

 

U052). nommmomuSt.
poems. __ocmn_-ucm..m%n_.a. o.m+

 

 

 

 

 

 

(srrun ps u!) eoueuuoped reel ueew perogperd

65
H3 predicted that the Race X Method interaction effect on situational

judgement test performance would diminish after controlling for the effect of the
Method X Reading Comprehension interaction. As shown in Table 2, Race, Reading
Comprehension, Method, and Method X Reading Comprehension interaction were
entered as a single block in step 1 of the regression of test performance on race,
reading comprehension, and method of assessment. The block accounted for 19% of
the variance, p < .05. Entering the Race X Method interaction term in step 2 provided
a signiﬁeant but small increase in variance accounted for, AR2 = .01, Adf= 1, p < .05.
The proportion of variance in test performance accounted for by the Race X Method
interaction obtained in H1 diminished substantially ﬁom 4% to a small (though still
statistically signiﬁcant, p < .05) 1% once the effect of Method X Reading
Comprehension on test performance was controlled. Hence, H3 was supported

Figure 7 depicts the nature of the Race X Method interaction mmmlling
for the effect of Method X Reading Comprehension interaction on test performance.
Compared to Figure 5, Figure 7 shows that the Race X Method interaction effect was
dampened to some extent after controlling for the effect of Method X Reading
Comprehension interaction.

In summary, the regression analyses provided support for the ﬁrst three
hypotheses. There was a Race X Method interaction effect on situational judgement
test perforrmnce such that the Black-White performance difference (favoring Whites)
was substantially smaller in the video-based method of testing than in the paper-and-
perrcil method A Method X Reading Comprehension interaction also existed such

00:00.25 chcocmﬁEoo @5003”. x 00:55. .5 Beam .2 05:02:00 5:0
mocmEBtmd 50h #:0600030 0002035 :0 cozomaoﬂc. 005.22 x comm N 059m

Ememmmm< ..0 00505.
033-0005 __ocmn_-0cm..mn_0n_

 

 

 

02E>>_H_
x005 I

 

 

 

 

 

 

eoueuuoped rsel ueew petogperd

 

 

 

 

 

67
that test performance was positively correlated with reading comprehension ability in
the paper-and-pencil method but that they were nearly uncorrelated in the video-based
method As shown in the regression results for H3, this Method X Reading
Comprehension interaction accormted for a substantial portion of the Race X Method

interaction effect on test performance.

68

EactmiallmtarimacmsiMethodﬂroups

Table 3 presents the means, standard deviations, and both observed and
corrected (for scale tmreliability) interconelations of the form a priori scales on the
situational judgement test, broken down by method groups. Not surprisingly, internal
consistency estimates of reliabilities (Cronbach's or) were low due to the relatively
small number of items on each scale and the dichotomous (with a few trichotomous)
scoring of the items. However, scale reliabilities were substantially higher than inter-
scale correlations which provided some preliminary evidence for discriminant validity.
Inter-scale conelations remained low, relative to scale reliabilities, even after
correcting for unreliability in each scale. Multiple-group covariance structure analysis
was used to provide a more rigorous test for the discriminant validity of the four a

priori scales and to assess factorial invariance across method groups.

Table 3

69

 

 

Means SD 1 2 3 4
Situational Judgement Scales

1. Conﬂict 11.84 3.81 (.46) .49 .23 .36
2. Empathy 9.47 3.65 .24 (.53) 70 28
3. Quality 10.09 2.50 .08 .26 (.26) .27
4. Commitment 7.98 3.54 .18 .15 .10 (.53)
1. Conﬂict 12.29 3.64 (.40) .48 .49 .30
2. Empathy 10.44 3.03 .15 (.24) .85 .49
3. Quality 10.74 2.78 .13 .27 (.42) .22
4. Commitment 9.38 3.33 .12 .15 .09 (.39)

 

Note. Cronbach’s a reliabilities are in parentheses. Observed correlations are below

diagonals and corrected (for unreliability) correlations are above diagonals.

70
Measmmmlnvariance. As described earlier, factorial invariance referred to

both measurement invariance and nomological invariance. To formulate measmement
models for the test of measurement invariance, items within each of the four scales
were ﬁrst randomly sorted into three sets comprised of approximately equal numbers
of items. Item scores were writ-weighted and summed within each set to create three
trait indicators (also lmown as observed indicators) for each latent trait variable
purportedly measured by each scale (i.e., each situational judgement factor), giving a
total of 12 trait indicators. A factor loading was arbitrarily set to 1.0 for one of the
three indicators for each latent trait variable in order to scale that latent trait variable
(Bollen, 1989). Appendix E presents the 12 X 12 observed covariance matrix among
trait indicators for each of the two method groups.

Table 4 presents the ﬁt indices associated with the series of nested
conﬁrmatory factor analytic models ﬁt to the two observed covariance matrices. Also
presented in this table are chi-square difference tests associated with relevant model

comparisons.

71

won—£88 v 038.
883? 8.8 Ba m8§ta>oo

88mm 95 www.588— 88£ 89cm

mo. mo. mm. mm. me. co. m o2: m2 m> m2 mod worm? 888$ cﬁSoboU 80m .92
.32—anal» 8.8 new .mwﬁuaB

88am £35880 883 can

mo. mo. ow. mm. mm. a. ArmmSm NE m> 2.4 on: wammﬁ @883 caﬂobouueom .NE
decadent» 8.8

28 $532 88am new

no. mo. on. 8. 3. mm. as ...ovd: 88$ 88.80 omwﬁm 42

 

<85 ease. as Ezz Eu Ec< E0 :8 N3 acmﬁaaou 382 .8 "x 8:20 383. 288868.,“ 382

 

. "1. . .1 .4: .) ....,.ur.... ..n "1.. S .... .2 . .-.-..“r ...“ ...... .... ....... “....“ r.

o .4..-.. ....-. ..‘4 o .... ...4. .... . wot.) .ﬂ ..4 o . do n... .o n o: ... ‘4 . ‘. ....340 a .< o.

v 033.

72

.39?

 

mo. mo. mm. am. we. om.

mo. mo. ow. ca. Va. cm.

on vmdﬁ m2 m> 92

0 23 m2 m> 3.4

Na mos v2 m> m2

88.3888 88% use

.8288? 8.8 .meBB 883 88m

03 $83 888$ eBay—80 80m .22
.885580 88am 08m $853?

8.8 28 $598— 88& Beam

OS :63 £8me 682280 Son 62

 

<mm§ yum—>3 9m 522 EU EO< ED

:8 N3 88888 $82

.8 "x 385 m88< 883%on 882

 

73
A single general factor model in which factor loadings and error variances were ﬁeely

estimated across method groups (Model M1) was ﬁrst ﬁt to the covariance matrices as
the baseline measurement model. The single general factor model provided a marginal
ﬁt to the data, )8 = 181.40, df=109, p < .05, GFI = .88, AGFI = .91, CFI = .66,
NNFI = .59, standardized RMSR = .09, RMSFA = .05. Hence, there was no strong
evidence of unidimensionalty in the situational judgement test.

A four factor model in which factor covariance, factor loadings, and error
variances were freely estimated across method groups (Model M2) was next ﬁt to the
data The model provided a signiﬁeant increase in ﬁt over the single factor model, sz
=57.92, Adf=9,p<.05, andarwsonable ﬁttothedataas indicatedby the ﬁt
indices. To test for measurement invariance across method groups, Model MZ was
compared to a more parsimonous model (i.e., with higher degrees of ﬁeedom) in
which factor loadings were constrained to be equal across groups. The more
parsimonous model (i.e., Model M3) continued to provide a reasonable ﬁt to the data
and the decrease in model ﬁt from Model N12 to Model MB was nonsigniﬁcant, A76 =
10.20, Adf = 8, ns. Hence, equality of factor loadings across method groups was
established

Model MB was compared to Model M4, a yet more parsimonous model in
whichbothfactorloadingsanderrorvarianceswere constrainedtobeequal across
groups. Model M4 provided a good ﬁt as indicated by the ﬁt indices and the decrease
inmodel ﬁt, asmeasmedbythexzdiﬁ‘erencetest, fromModelM3toModelM4was
nonsigniﬁcant, sz = 7.03, Adf = 12, us. That is, using the stringent criterion of both

74

equal factor loadings and equal error variances, measurement invariance across the two
method groups was established

To examine the structrn'al aspects of the conﬁrmatory factor analytic models
(i.e., the relationships among latent trait variables), Model M4 was compared to Model
M5 in which between-group equality constraints were imposed on factor covariances.
In Model M5, the six factor covariances were constrained to be equal across the two
method groups. That is, Model M4 and Model M5 diﬁ‘ered only with respect to
structural relations among the latent trait variables; for each model, the factor loadings
and error variances were constrained to be equal across the two method groups.

Model M5 provided a good ﬁt to the data, )8 = 142.72, df = 126, n.s., GFI =
.90, AGFI = .94, CFI = .92, NNFI = .92, standardized RMSR = .09, RMSEA = .02.
The decrease in model ﬁt from Model M4 to Model M5 was nonsigniﬁcant, Ax2 =
2.01, Adf = 6, ns. Comparison between Model M5 and Model M2 also revealed that
as a whole, none of the equality constraints on factor covariances, factor loadings, and
error variances signiﬁcantly decreased model ﬁt. Hence, Model M5 was selected as
the most adequate measurement model. Figure 8 depicts Model M5 with its associated
common metric factor loadings and factor conelations. All factor loadings were
statistically signiﬁeant, p < .05. Of the six factor correlations, ﬁve were statistically
signiﬁcant, p < .05. Full measurement invariance across method groups (i.e., full
internal factorial invariance) was established in terms of error variances, factor

loadings and factor covariances.

75

.Go. v m *v 8580 8:82 :8m 8m meouﬂobov 88mm 98 $5884 885m eoﬁeemeeﬁm

 

 

mEoo

 

 

NEOO

 

 

 

 

FEOO

 

 

 

 

*mm.

EmEzEEoo

*mm.

 

*9».

menu

 

 

 

 

szc

 

 

 

«g.

_.N.

 

 

maEm

 

 

NaEc

 

 

 

 

EEQ

 

 

*ON.

 

11V.

    

2.82 88:80 685803».ﬂ HEB E82 88225. 8823 boemahweou .me 2wa

 

mcoo

 

NCOO

 

 

Fcoo

 

 

 

*vm.

*vc.

 

 

*3.

 

76

Nomologicallnyariance. Table 5 presents, for each method group, the means,
standard deviations, and interconelations between the 12 situational judgement
indicators and the 5 personality indicators.

77

8:588 n 833.

end mmd bod
mot S.- we.
Nor mo. 2 .
mo.- S.- 2 .-
3. co. Hm.
3 . no. No.

Rum

no.

mo.

2.-

3.

mod

mo.

8.-

S.-

no.

34»

mo.

2.

2.-

no.

RN

5.-

2.

no.-

mo.

ohm

5.

mo.

mo.-

3.

0H.-

ové

No.

oo.

mo.-

om.

cﬁm mwd
a. 3.
mm. 2.
no.- No.-
om. om.
3. 3.-

mwd

mu.

3.

5.-

cm.

wﬁ.

woﬁ

wad

mﬁw

one

omd

Neda

ovdn

3.4m

hwdm

hmdm

menu—z

>§m

meO

OMDNZ

mmMO<

UmZOU

:Sna 8883.88
3%

 

mace «88 EB 30 ~20 :30 Ram 88m Ram «.80 ~80 :80 mm was:

 

. u‘..oo .-..

1.0..

c. d

..o‘.

....“.¢

‘

. . .4 .4.

Hi. a

. Q

.1.

3

n 22g

78

dengﬁxmuarém amusemenzﬁo ”8885028ng ”amoaozﬁpeﬁmmmca.

”mmuegoueomoéoUuUmZOU USEEEEoUnEoU 3330 ”mﬁamEm—uaam noaeooueoo 6838.58 mo~n§> .334

 

 

a: 3: E :4 2: «3 m3 9; SA a: Se 28 8
3m N3 mam m3 men m3. e3 2a 2:. now o; 3.4 ”so:
8.- 2. S. 2.- 2.- 8.- 8.- 8.- 8.- 2.- S.- 8.- 98 mean >588
5.- 8.- a. 8.- M:- 8- 2.. 8. 8. mm. :. s. new 3.8 E8
2. so. No. 8.- ﬂ. so. 2. 8.- 8.- S. 3. 8.- a: ”in 9:82
8. no. 2. 8. 8. mo. 8.- 3.- S. 2. 8.- 8. one 3.8 meme...
3.- 2. 8. S. 8. 8.- 2.- 8.- 8. 8.- 8.- 8. $8 3.8 828

sﬁuzv Sam-82>
Re of v: as m: 3: Re 42 $4 an «2 92 am
meow «88 :80 38 3.0 :5 28m «mam Ram 28 28 :50 am ”so:

 

79
The present author had planned to use a multiple-group covariance structure

analysis approach to testing equality of structural relationships between latent variables
across groups (J oreskog & Sorbom, 1993) to assess nomological invariance (external
parallelism) of the four situational judgement factors (in reference to the Big-Five
personality factors) across method groups. Full nomological invariance is achieved
whmboﬂrequalityofpmameteresﬁmatesandequaﬁtyoferrorsmeachofthe
structural equations relating the respective situational judgement factor to the set of
personality factors are established across method groups.

However, as shown in Table 5, the observed correlations between the
situational judgement indicators and the persomlity indicators were trivial, ﬂuctuating
around 0. Contrary to the author’s expectation, it appeared that the personality factors
measured in the present study were not related to the situational judgement factors.
Therefore, it was not meaningful to test for external parallelism using the ﬁve
personality factors as external reference variables. A multiple-group covariance
structure analysis was attempted, but failed to reject a model specifying between-group
equality in structural parameters. The failure to reject was due to the lack of
correlation between situational judgement factors and the external reference variables
used for both method groups, and not beeause of a between-group similarity in the
patterns of external correlation (i.e., external parallelism). Because of the low
conelations between situational judgement scales and personality measures, it was
formd that structural parameter estimates in both method groups were trivial.

Three nested models were ﬁt to the 17 X 17 observed covariance matrix

80
relating the 12 indicators for the situational judgement factors and 5 indicators for

personality factors for each method group (covariance matrices are reported in
Appendix F). For all three models, measurement aspects were held constant so that
effects of structmal and measurement differences were not conformded in model
comparisons. Constraining measurement aspects also resulted in the comparison of
more parsimonous models by reducing the number of parameters to be estimated
simultaneously.

For measurement aspects of the four situational judgement factors, factor
covariances, factor loadings and error variances of observed indicators were
constrained to be equal across method groups (i.e., the measurement model speciﬁed
by Model MS). For measurement aspects of the ﬁve personality factors, the error
variances of observed indicators are not identiﬁed parameters and they cannot be
estimated because there was only one observed indicator (i.e., Big-Five sub-scale) per
factor. Rather than assuming that the indicators were infallible measures by ﬁxing
error variances to zero, the identiﬁcation problem was solved by ﬁxing the error
variance of each indicator to a value derived ﬁom its internal consistency estimate of
reliability r,“ (Cronbach's or), using the formula

0-2 =(1-rx--)o;-2 (1)
where of = error variance of indicator and 6,2 = variance of indicator (J oreskog and
Sorbom, 1993).

Model N1 freely estimated across method groups both the structural parameters

and error terms in each of the four structural equations relating the respective

81
situational judgement factor to the set of ﬁve personality factors. The model provided

a good ﬁt to the data, )6 = 164.84, df= 217, n.s., GFI = .92, AGFI = .94. The model
was compared to the more parsimonous Model N2 which similarly allowed error terms
toﬁeelyvmybmspeciﬁedsmrctmalpmameteresﬁmatestobeinvariantacross
method groups. A )6 difference test showed that decrease in model ﬁt ﬁ'om Model
N1 to Model N2 was nonsigrriﬁcant, A}? = .72, Adf = 20, ns. Hence, structural
parameter estimates did not differ signiﬁcantly across method groups. Model N2 was
compared to Model N3 which speciﬁed both structural parameter estimates and error
terms structmal equations to be invariant across method groups. The decrease in
model ﬁt ﬁ'orn Model N2 to Model N3 was nonsigniﬁcant, Ax2 = 2.08, Adf = 4, us.
That is, error terms in structural equations did not differ signiﬁcantly across groups.

However, for all three models, an inspection of the common metric
standardized regressions of each situational judgement factor on the ﬁve personality
factors revealed trivial paramaer estimates ﬂuctuating arormd 0, within the range
between -.07 and +.04. Therefore, whereas the nested model comparisons indicated
equality of structural equations relating the respective situational judgement factor to
the set of personality factors, the equality should not be construed as evidence for
nomological invariance across method groups (i.e., external factorial
invariance/parallelism). Instead, the equality of structural regressions was a result of
near-zero correlations between situational judgement factors and the external reference
variables (i.e., the Bi g-F ive personality measm'es) selected for the assessment of
external parallelism.

82

i631. o u a r m. o ; .r-s. nm or .J ism ... . .0 {Jo Jo ' an 0 mm ; u u, “in..-
Mirna-[WW

The establishment of factorial invariance of responses to the situational
judgement test in terms of full measurement invariance across method groups
supported the meaningfulness of between-method comparisons of subgroup
performance at the level of individual constructs measured by the test. Factor scores
for each of the four situational judgement factors were computed for all examinees
based on the factor loadings in the measurement model (Model M5). Because the
factors are latent variables ﬁee of measurement errors in the observed indicators,
comparisons of Black-White differences in factor scores provide more accurate
estimates (i.e., disattenuated for unreliability in measures) of the effect of method of
assessment on adverse impact in the situational judgement test.

Table 6 presents, for each of the four situational judgement factors, the
subgroup factor means, standard deviations, and associated d statistics for each of the
two methods of assessment. As shown in the table, the paper-and-pencil method
produced substantial Black-White differences in performance favoring Whites on each
of the form constructs as indicated by the d statistic (Conﬂict = -.70; Empathy = -.43;
Quality = -.35; Commitment = -.63). These Black-White differences were
substantially reduced in the video-based method (Conﬂict = .02; Empathy = -.18;
Quality = .06; Commitment = -.36), with d differences across methods ranging ﬁom
.27 to .72. In fact, in the video-based method, Black-White differences were not

statistically signiﬁcant for any of the four factors.

83

32:88 0 033.

 

 

El 36 an mm; 9% no Mon—m
as
me. mo. me; mad o2 *2..- on; 36 #2 38.
mo; Rh me me; New on SE?
em; omd R 8-H med no #85
dag-U
888m 8080383 Enougmm
u 5 8:80me 3388 u mm 882 Z oemuﬁm u Om 8802 Z
338% gag-gag
388%;
o I}. .4 ..tr. 0.. .....‘-. ...-... ...). :3. .8. . .-..-.- ......w.

.31..-) £71...- ....

o 0355

.mO.Va*

 

 

5N. cmr an; mwé cm“ *8.- VwA m 3» #2 gon-
g; aﬁm av 84 Saw on SE?
mm; one 3 D: wmd No Mom—m
gag
2». co. mm; wwé 02 mm: M34 mod HQ 30H
mm; ewe oo 2 A mum-v on 33>?
mm; Noe ~m 2 A Ned No #85
8:35
mm. ﬂ .- mm; om.m 03 “Rev mm; 56 ~N~ 38H
mm; 9mm 00 on; 36 on 93>?
u 5 85.8% cram—Sm u Om 8802 Z 033.3 u Om 832 Z

333%

Haﬁz-dung

 

85

To summarize, nomological invariance of responses to the situational
judgement test across the two method groups could not be tested because of near-zero
correlations between situational judgement factors and the external reference
personality variables. However, factorial invariance in terms of ﬁll measurement
invariance across methods was established Measurement invariance supported the
meaningﬁrlness of betweer-method comparisons of racial subgroup performance at the
level of individual constructs disattenuated for measurement errors. For each
construct, there was a large Black-White performance difference favoring Whites in
the paper-and-pencil mdhod These performance differences were substantially

reduced in the video-based method.

 

H4predictedthatforeachofthefom'diﬁ‘erenttestsadministeredinthepresent
study predictive validity perceptions will be strongly and positively correlated with
face validity perceptions. Results showed that correlations between the two
perceptions were signiﬁcant (p < .05), positive, and substantial for all forn' tests
(paper-and-pencil situational judgement, r = .28, N = 121; video-based situational
judgement, r = .24, N = 120; reading comprehension, r = .60, N = 241, personality, I
= .48, N = 241; cognitive ability, I = .70, N = 241). For each test, the correlation
between the two types of perceptions was substantially lower than the reliability
estimates (Cronbach's or) of the respective perception measures (paper-and-pencil
situational judgement, Face r,“ = .90, Predictive r,x =.75; video-based situational

86
judgement, Face r,‘x = .78, Predictive rxx = .81; reading comprehension, Face r,‘x = .88,

Predictive r,‘x = .90; personality, Face r,x = .76, Predictive r,“ = .86; cognitive ability,
Face r,‘x = .81, Predictive rxx = .86). This provided evidence of discriminant validity
forthetwotypes ofperceptions. H4wassupported.

E 1511 ill] l [11'

H5 predicted that face validity perceptions of the situational judgement test will
be signiﬁeantly higher when administered in the video-based method than when it is
administered in the paper-and-pencil method. Results of an independent sample t—test
supported the hypothesis; the video-based method received signiﬁcantly higher mean
face validity ratings (M = 19.69, SD = 2.96, N = 120) than the paper-and-pencil
method (M = 17.84, SD = 4.36, N = 121), t (237) = 3.87, p < .05.

S] M] 1' 1E 15”
H6 predicted that the difference in face validity perceptions on the situational

judgement test reported by Blacks and Whites will be greater in the paper-and-pencil
method than in the video-based method To test this Race X Method interaction, a
hierarchical regression of face validity perceptions was performed. As shown in Table
7, Race and Method were entered as a single block in step 1 of the regression and
accounted for 12% of the variance in perceptions, p < .05. Entering the Race X
Method interaction term in step 2 of the regression resulted in a signiﬁcant increase in
variance accounted for, AR2 = .04, Adf = 1, p < .05.

87

 

 

Table 7
u an a r or 0 ca ,0! H r In 0 ur
W241.)
Criteria and Predictors R2 df AR2 Adf AF
E IH'l' [H l .5:
Step 1. Race .120 2 1652*
Method
Step 2. Race X Method .160 3 .040 1 12.13*
IestEertonnance
Step 1. Race .219 5 1320*
Reading
Method
Reading X Method
Face Validity
Step 2. Race X Method .227 6 .008 1 2.50

 

*p<.05.

88
Figure 9 depicts the nature ofthe interaction in terms of differences in

subgroup mean perceptions. As shown in the ﬁgure, the Black-White difference in
face validity perceptions on the situational judgement test was greater in the paper-
and-pencil method than in the video-based method To assess the practical
signiﬁcance of the statistically signiﬁcant Race X Method interaction, effect sizes for
subgroup differences were computed using the 5! statistic. A substantial Black-White
difference in perceptions of form-ﬁfths of a standard deviation with Whites reporting
higher face validity was formd on the pmer—and-pencil version of the situational
judgement test, d = -.80. The Black-White difference in perceptions was reduced
substantially to a practically trivial one-nineth of a standard deviation in the video-
based version ofthe test, d = -.11. Hence, H6 was supported.

89

ewe... EmEmmozw _mco=m3_m do mcozamouwd >.._o__m> mom“.
:0 concede: uoﬁms. x momm .m 93mm

EmEmwwmm< .6 uoﬁms.
ommmmémeS __ocmn_-ocm.._m%d

 

 

 

w=c>>_H_
xoﬂm I

 

 

 

 

 

 

 

 

 

 

or

NV

3

or

m.-

cm

suondeored ArgpneA 602:] ueew perogperd

90

- o o - -- . . .- a . -‘ a - . - .
k. a!” U L's lx'llr 1'}. k...“ \.'.'5.-'..lf JUMP”?! .U U‘llk'. 0 $ .19-"1|. 2.x

 

Face validity perceptions and performance on the situational judgement test
were signiﬁcantly correlated, r = .33, p < .05. Because there was a Race X Method
interaction effect on face validity perceptions, it appeared likely that face validity
perceptions could explain the remaining portion of the Race X Method interaction
effect on situational judgement test performance not attributable to the Method X
Reading Comprehension interaction on test performance (see results for H3). It should
be noted that this result was not hypothesized

A hierarchical regression was performed to examine if face validity perceptions
could account for the remaining unaccounted portion of the Race X Method interaction
on test performance. As shown in Table 7, the variables Race, Reading
Comprehension, Method of Assessment, Method X Reading Comprehension
interaction, and Face Validity Perceptions were entered as a single block in step 1 of
the regression of test perfonnance and accormted for 22% of the variance, p < .05.

The Race XMethod interactionterrnwasthen entered in step 2 ofthe regression,
which did not account for unique variance, AR2 = .008, ns.

Figure 10 depicts the plot of the Race X Method interaction on test
performance Wing for the effects of both Method X Reading
Comprehension interaction and Face Validity Perceptions. Compared to Figures 5 and
7, Figure 10 shows that the Race X Method interaction disappeared after controlling
for the effects of Method X Reading Comprehension and Face Validity Perceptions.

91

.chEQOd b_o__m> mom“. ocm
concede: co_mcmcmEEoo mcﬁmmm x DOLE—2 do 9.0th .0”— 9:63:00
.mtm mocmEEtmn. ewe-r EmEmmozw 35:95 :0 5388:. .8522 x comm .2 meant

Ememmwm< u_O U052).
tween-omuS __ocmd-_ucm..oamm

 

 

 

933D
xom_m I

 

 

 

 

 

 

 

 

eoueuuoped rsei ueew perogperd

 

 

 

92
One implication of these results is that the use of a video-based method of item

presentation might have had a "motivatio " effect on Black examinees that aﬁected
their performance on the test. This idea will be discussed further below.

Figure 11 summarizes the relationships between Race, Reading Comprehension,
Method of Assessment, Face Validity Perceptions, and Situational Test Performance.

93

.mocmEBtmd 8mm... 3:28.25 Ucm .chEmoEd b_e__m> mom”. .EmEmwmmmxx

.6 e052). .co_mcmcm._aE00 mcﬁmmm 68m 5353 ma_cmco=m_mm .h F meat

 

mocmEEtmd
EmEmmvsw
ﬁcozmsgm

 

A

 

 

 

 

 

)

 

 

mcozamoumd
b_e__m> mom".

 

 

 

 

 

8855558
mcﬁmmm

 

 

 

momm

 

 

 

 

EmEmmmmm<
.6 8:65.

 

 

 

 

 

 

DISCUSSION

The present study has established several theoretically and practically important
effects relating race, reading comprehension, method of assessment, face validity
perceptions, and performance on a situational judgement test. As predicted by H1,
race and the method of assessment interact to affect situational judgement test
performance such that the Black-White performance difference (favoring Whites) is
substantially smaller in the video-based method of testing than in the paper-and-pencil
method As predicted by H2, the method of assessment also interacts with examinees'
reading comprehension ability such that test performance positively correlates with
reading comprehension ability in the paper-and-pencil method but performance and
reading comprehension are nearly tmcorrelated in the video-based method The results
for H3 supported the argument that this Method X Reading Comprehension interaction
accounts for a substantial portion of the Race X Method interaction effect on test
performance.

Another set of important results involved examinee reactions to the situational
judgement test. As predicted by Hi, face validity perceptions of the test are
signiﬁcantly higher when administered in the video-based method than when
administered in the paper-and-pencil method In addition, race and the method of
assessment interact to affect face validity perceptions in a manner as predicted by H6.
The difference in face validity perceptions reported by Blacks and Whites (with
Whites giving higher face validity ratings) is greater in the paper-and-pencil method
than in the video-based method Finally, the results also suggest that face validity

94

95
perceptions may explain the remaining portion of the Race X Method interaction effect

on situational judgement test performance not attributable to the Method X Reading
Comprehension interaction on test performance.

The implieations and contributions of the present study to the researeh on
subgroup differences in test pefonnance and test reactions extend beyond the study of
situational judgement tests. The issues revolve around the relationships between the
method-content distinction and subgroup differences in test performance and test
reactions. These issues will be discussed next in terms of conceptual, methodological,
and practical implieations.

l l l l 2 II' . .

A fundamental contribution of the present study is the emphasis on the
distinction between test method and test content. By disconfounding method and
content in the present study, subgroup differences due to method and subgroup
differences due to content can be isolated By holding test content constant, the Race
X Method interaction eﬁ'ect on test performance obtained in the present study shows
that two different methods of testing measuring the same job-relevant content may
have differential adverse impact. In principle, adverse impact due solely to method of
testing can be eliminated by using the method with lower adverse impact assuming
that method is job-irrelevant.

Schmitt et al. (1996) argued that a signiﬁcant amormt of the Black-White

difference in performance on paper-and-pencil tests rrright be due solely to the

96
reading/written requirements inherent in the method of testing and independent of the
test content. As discussed earlier, Goldstein et al.'s (1993) attempt to show that the
mahod of testing can aﬁect differences in subgroup test perfomrance has several
problems. The method versus content distinction nrade in the present study enables an
empirical test of Schmitt et al.'s (1996) argument. In addition, the inclusion of a
standard reading comprehension test in the study allows a direct test of the notion of a
Method X Reading Comprehension interaction effect on test performance. The present
ﬁndingsregardingHLHZ, deialsosrrpportﬂreargmnerrtﬂratracediﬁerences in
testscmmmaybepmtlyﬂreresultofdiffemcesmﬂrereadingmquiremmts
associated with the method of testing.

W

The present study contributes to the rerent research on examinee reactions
toward selection procedures. The only study which attempted to examine the
relationship between face validity and predictive validity perceptions is Smither et al.
(1993). As discussed earlier in the paper, interpretations of the study's ﬁndings are
problematic because perceptions measured were based on an examination consisting of
a variety of selection procedures and the examinees used were applicants assessed for
a variety of jobs ranging from entry-level to professional positions. The present study
avoided these problems by using the job of a production worker as the ﬂame of
reference and examining the relationship between face validity and predictive validity
perceptions separately for far different tests. As predicted by H4, face validity

97
perceptions and predictive validity perceptions are positively and strongly conelated.
In addition, for each test, the correlation between the two types of perceptions was
substantially lower than the internal consistency reliability estimates of the respective
perception measures therefore providing evidence of discriminant validity for the two
types of perceptions.

In the present study, the investigation of examinee test reactions is integrated
into the broader selection ﬁamework by analyzing subgroup differences in test
reactions and examining its relationship to adverse impact and method of testing.
Previous studies which simply compared and described mean differences in attitudes
or reactions across tests have been limited in increasing our understanding of test
reactions due to the method-content confound across tests. The method-content
distinction helps clarify the aspects of tests responsible for examinee reactions. The
results of the present study show that without varying test item content, the method of
testing p15: can affect face validity perceptions, including subgroup diﬂ'erences in
these perceptions.

A serendipitous ﬁnding (insofar as the results were not hypothesized) in the
present study relates to the role of face validity perceptions in explaining the
remaining portion of the Race X Method interaction effect on situational judgement
test performance not attributable to the Method X Reading Comprehension interaction
on test performance. Race and method of assessment interact to affect face validity
perceptions which in turn affect test perfonnance. In other words, subgroup
differences in reading comprehension my account for a substantial portion of the

98
Black-White difference in test performance in a paper-and-pencil method of

assessment. In addition, a nontrivial part of the adverse impact could be due to the
fact that the paper-and-pencil method of assessment elicits lower face validity
perceptions from Black examinees relative to White examinees. This lowered face
validity may have a negative motivational and performance effect on Black examinees.
The present results regarding the relationships between race, method of
assessment, face validity, and test performance contribute to the recent research on test
reactions. Face validity perceptions constitute an important dimension of test
reactions. Some researchers have argued that low face validity could result in biased
or inaccurate test scores and reduce the operational validity of a selection procedure
(e.g., Cascio, 1987; Robertson & Kandola, 1982; Smither, et al., 1993). Chan,
Schmitt, DeShon, Clause, & Delbridge (under review) provided evidence that face
validity perceptions affect test-taking motivation which in turn affects cognitive test
performance. Chan et al. also found that the typical Black-White difference in test
performance was partially mediated by differences in face validity perceptions and
test-taking motivation. Arvey et al. (1990) argued that the traditional model of
cognitive test performance as simply a ﬁmction of ability plus error is probably
incorrect and that researches have tended to focus exclusively on the ability
dimension and have ignored the effort dimension or motivational aspects of test
perfonnance. A similar argument may apply to performance on situational judgement
tests. The present results suggest that the Black-White diﬁermce in performance on a
paper-and-pencil situational judgement test could be decomposed into an ability

99
component (i.e., reading comprehension dﬂ'erences) and a motivational component

(i.e., face validity perception differences). However, a difference between the
situational judgement test and the traditional cognitive ability test is that in the former,
the ability (i.e., reading corrrprehension) dimension is often not part of the construct
space intended to be measured by the test and is therefore job-irrelevant

Chan et al. argued that an important practieal implication of their ﬁndings was
that face validity of a test represents a practical means of reducing adverse irrrpact of
many traditional paper-and-pencil measures because it is possible to write test items
that reﬂect a credible face valid relationship to the performance of jobs for which
examinees are being assessed The present study found that the manipulation of the
method of test item presentation resulted in changes in face validity perceptions
including changes in the size of the Black-White difference in these perceptions. It is
plausible that these changes in perceptions in turn affected the Black-White differerce
in test performance. Whereas it is possible to affect face validity perceptions by
writing credible items, the present ﬁndings suggest that simply changing the method of
item presentation without changing item content may have substantial effects on
subgroup differences in face validity perceptions and test performance.

Although the present results ﬁom the regression analyses are consistent with
the idea that face validity perceptions affect test performance, it is also possible that
test performance affects face validity perceptions. Chan et al. suggested that
examinees' performance on a cognitive ability test may inﬂuence subsequent responses
to face validity items. A self-serving mechanism may operate for reported face

100
validity such that there exists a tendency for examinees to attribute poor test

performance to low face validity of the test Poor performance on a test in which its
content is perceived as unrelated to the content of the job is more self-serving than
when test content is perceived as related to the content of the job. However, a self-
serving bias explanation is a weaker argument in the case of performance on
situational judgement tests than in the case of performance on traditional cognitive
ability tests. This is because it is more difﬁcult for an examinee to have knowledge or
an estimate of his or her performance level on a situational judgement test compared
to a cognitive ability test

It is not the purpose of the present study to address the causal relationships
between face validity perceptions and test performance. The present data relating face
validity perceptions and test performance are correlational in nature and causal
inferences are not possible. Future research should consider experimental designs for

manipulating test reactions and examining if Black-White differences in test

performancecanbereducedbycharrgesintestreactions.

 

Although method and content are conceptually distinct, it is often difﬁcult to
separate the two empirically. The Goldstein et al. (1993) study discussed earlier in
this paper illustrates the methodological difﬁculty in isolating the effects of method
ﬁom the effects of test content and vice versa. The present study suggests that one

way to tease out the two different effects is to examine a common set of test items

101
across diﬁ"erent methods of testing. By holding test item content constant, the same

intended constructs are presumably held constant across methods.

However, holding item content constant does not guarantee that the same
constructs are measured across method groups. Measurement invariance of responses
to the test items is critical and needs to be established In the absence ofestablished
measurement invariance, there is no support for meaningful between-method
comparisons oftest scores. Asdemonstratedinthepresent study, measurement
invariance can be tested using the multiple-group approach to conﬁrmatory factor
analysis. Ideally, the researcher should have apricri scales for the constructs of
interestsothatheorshecanproceedtotestforequalityofrelevantparameter
estimates (e. g., factor loadings, factor covariances, error variances) across method
groups in a theory-driven mamrer.

Anodrerwaytotestifthesarneconshuctsaremeasmedacrossdiﬁerent
method groups by holding test items constant is through the assessment of
nomological invariance. The idea is similar to the assessment of external parallelism
in the classical psychometric development of test items. Given a set of external
reference variables, some of which are expected (by some conceptual reasons) to be
empirically related to the constructs measured on the test whereas others are not, we
have evidence of factorial invariance of responses to the test across method groups if
both groups exhibit the same patterns of correlations between test constructs and
external variables. In the present study, nomological invariance of test responses
across the two method groups could not be tested because of near-zero correlations

102
between situational judgement factors and the external reference personality variables.

Therefore, the researcher should base the search and choice of external variables on
solid theoretical grounds and relevant previous empirical literature. Of course, this is
oﬁennoteasybecauseitpresupposesthattheresearcherhaslittle difﬁculty in
explieating the nature of the constructs of interest on the test examined which may not
always be the case.

The mean differences obtained in the present study between racial subgroups
and between methods indicate the presence of reading comprehension and some
motivational difference associated with race and method It should be noted that these
mean diﬂ'erences reﬂect level differences on the situational judgement factors due to
the effects of reading comprehension and motivational differences. Mm diﬂemnces
are consistent with factorial invariance oftest responses across method groups (in
terms of both measurement invariance and nomological invariance). The same
constructcanbemeasuredintwo groupsthoughthe groups may diﬁ‘erwithrespectto
the level on the construct. Measurement invariance can coexist with mean differences
because differences in factor means across method groups are independent of the
equality of item-factor loadings, error variances, and factor covariances across method
groups. Nomological invariance can coexist with mean differences because differences
in factor means across method groups are independent of the equality of conelations
between factors and external reference variables.

 

103
Another methodologieal issue concerns the need to correct effect size estimates

(for subgroup differences) for attenuation due to unreliability. The majority of
previous studies comparing adverse impact across selection procedures failed to report
reliability estimates for the various measures or failed to correct effect size estimates
for attenuation due to unreliability of measurement. VVrth low reliabilities, true
subgroup differences will not be detected For studies reporting differential adverse
irrrpactacmssmeasmesbasedonrmconectedeﬂectsizeestirmtes, itisnotclearifthe
results are due to true subgroup differences or sirrrply an artifact of differential
reliability in measurement. In the case of situational judgement tests, the difﬁculty is
compounded because Cronbach's or, the most readily available reliability estimate, may
not be an appropriate reliability index due to the multidimensional nature of these
tests. Test-retest reliability is hard to obtain because it requires at least two separate
administrations of the same test to the same examinees. Parallel form reliability is
often not feasible because it requires the use of different item content which raises the
issue of construct equivalence and complicates the interpretation of corrected
estimates.

The present study suggests that one way to examine corrected eﬂ°ect size
estimates for the multidimensional situational judgement test is to compute, for all
examinees, factor scores for each situational judgement factor based on the factor
loadings in the conceptually derived and empirically validated measurement model.
Because the factors are latent variables ﬁee of measurement errors in the observed

indicators, corrrparisons of method group and racial subgroup differences in factor

104
scores provide more accurate estimates (i.e., disattenuated for unreliability in

measures) of the effect of mahod of assessment on test performance and adverse

impact.

At least three limitations of the present ﬁndings should be noted The ﬁrst
limitation concerns the generalizability of the ﬁndings relating to face validity
perceptions. There are settings in which all examinees are likely to report that all tests
are highly face valid Examples of these settings include testing situations of actual
job applicants or incumbents in which the stakes for successful test performance are
high (e. g., assessment for hiring or promotion). It is very unlikely that an applicant
taking a selection test for ajob to which he or she desires to be hired will report low
face validity on the test In these high stake situations, self-presentation concerns may
restrict reported face validity to high ratings when examinees perceive that test
reactions may be used as inputs to individual situations. This is most likely to happen
when examinees do not have conﬁdence that face validity responses are anonymous.
In such settings, restriction of range limits the effect size estimates associated with
face validity perceptions. However, it should be noted that in many of these settings,
the assessment of face validity is likely to have low construct validity. Future
research on the face validity of different testing methods should be sensitive to the
nature of the samples used and the setting of the test assessment situation. Theories

and measures of social desirability and self-presentation concerns my be relevant in

105
certain high stake settings.

A second limitation concerns the nature of the constructs measured in the
situational judgement test. Although the study addressed limitations in previous
research by focusing on aprimj situational judgement factors, conecting for
measurement errors, and establishing factorial invariance of test responses across
methods, more work needs to be done on construct validation. At this point, it is
premature to use scores on individual situational judgement factors (at least those
measured in this study) for any individual diagnostic or decision purpose. Future
research should be explicit in the preoperational constitutive deﬁnitions of the relevant
constructs in order to guide the development of appropriate measm'es (i.e., writing
valid items). Finally, nomological invariance was not tested in the present study due
to the inappropriate choice of extemal reference variables. In future research, factorial
mvanmceoftestresponsesacrossmethodgroupsmtarmofboﬂrmeasmemmt
invariance and nomoloigical invariance should be empirically established and not
merely assumed

Thefocus ofthepresentstudywasnotontestbias as deﬁnedby the Cleary
model (Cleary, 1968). No criterion performance data were collected to examine
differential prediction across racial subgroups and method groups. From a practical
perspective, ﬁxture research should examine potential relationships between differential
prediction and method effects on subgroup differences in test performance and face
validity perceptions (or other motivational variables). For example, consider the use

of test scores on the paper-and-pmcil version of the situational judgement test in the

106
present study as a predictor of job perfonnance. If reading comprehension is job-
irrelevant and uncorrelated with actual job perfonnance, then using a cormnon
regression line based on the regression of job performance on situational judgement
test scores would likely result in an over-prediction for White examinees and under-

prediction of Black examinees. That is, test bias in the Cleary sense would occur.

chlusion

The present study contributes to the sparse research on video testing in
personnel selection and the research on situational judgement testing in particular. As
mentioned early in the paper, the only published study reporting the adverse impact
level of the video-based situational judgement test (Smiderle et al., 1994) did not
correct for unreliability of measmement. The present study reports corrected estimates
and isolates the method and content sources of subgroup differences in video testing.
With the increasing popularity of video testing, clarifying the nature of its associated
adverse impact levels becomes important ﬁom a legal and socio-political perspective.

VVrth the exception of relatively higher costs in test development due to video
production, the video-based method shares the same practical beneﬁts with the paper-
and-pencil fonrmt of the situational judgement test including the scale of testing which
allows a large number of examinees in one session Moreover, the video-based
method is more realistic and concrete than the paper-and-pencil method The method
also elicits less adverse impact, more favorable face validity reactions in general and
less subgroup differences in these reactions in particular. In addition, the video-based

107
method is generally less expensive than such high ﬁdelity simulations as work samples

and assessment centers. Hence, ﬁ'om a practical perspective, it is worthwhile to inth
more research efforts in video-based testing and conrpare the method with the
traditional paper-and-pencil method of assessment for the same test. The method-
content distinction made in the present study provides a conceptual and
methodological basis for formulating ﬁxture study designs.

REFERENCES

Alwin, D.F., & Jackson, DJ. (1981). Application of simultaneous factor
analysis to issues of factor invariance. In D]. Jackson & E.F.Borgatta (Eds), Eager
' .... mam u (pp.249-279). Beverly Hills, CA:

 
 

Arvey, R.D., & Sackett, RR. (1993). Faimess in selection: Current
developments and perspectives. In N.Schmitt & W.C.Borman (Eds), Personnel
Wells (pp-171-202). San Francisco: Jossey-Bass Publishers.

Arvey, R.D., Strickland, W., Drauden, G., & Martin, C. (1990). A
Motivational components of test taking. WA}, 695-716.

Asher, J.J., &, Sciarrino, JA (1974). Realistic work sanrple tests: A review.
Whey-.21, 519-533-

Bentler, PM (1990).

Emmet-11mm, 238-246.

Bentler, P.M & Bonett, D.G. (1980). Signiﬁcance tests and goodness of ﬁt in
the analysis of covariance structures. WM, 588-606.

Bemardin, HJ- (1984) mm
W. Paper presented at the 44th Annual Meeting of the Academy of
Management, Boston.

Bobko, P., & Bartlett, C]. (1978). Subgroup validities: Differential deﬁnitions
and differential prediction JmnnalﬁﬁApplieiBsxcthﬂw 12-14.

108

 

Bollen, KA. (1989). . . .-

 

VVrley.

Boudreau, J. (1983). Economic consideratiom in estimating the utility of
human resource productivity improvement programs. W16, 551-
576. _

Brown,J., Bennet, J-, & Hanna, G. (1993). W
Lombard, IL: Riverside.

Bruce, MM, & Learner, DB. (1958). A supervisory practices test. Emanuel
W 207-216.

Brugnoli, G.A., Canrpion, J.E., & Basen, IA. (1979). Racial bias in the use of
work samples for personnel selection. W, 119-123.

Cascio, W.F. (1982).
Wong. Boston, MAKent.

Cascio, W.F., & Phillips, N. (1979). Performance testing: A rose among
thorns? What-.32. 751-766.

Casio, W.F. (1987). .

 

 

Englewood Cliffs, NJ: Prentice-Hall, Inc.

Chen, D. (in press). Criterion and construct validation of an assessment center.

 

Chan, D., Schmitt, N., DeShon, R.P., Clause, S.C., Delbridge, K Reactionsto

10%“ . ‘ «.-. K .3). ' '0‘ C a. 0| J 0» hilt 33' I’LL ['31 03...".1] I 1.... e. .. k

and test-takm' g motixation. Manuscript under review.
Cleary, TA. (1968). Test bias: prediction of grades of negro and white

students in integrated colleges. lurrrnalefEduc-ationaLMer-tsrrrementj, 115-124.

Cohen, J. (1977). _ '
Hillsdale, NJ: Erlbaurn.

 

Cohen, J. (1988). '

 

edition. Hillsdale, NJ: Erlbaurn.

Cohen, J. & Cohen- R (1983). Ammmaemmtmmmm
science (2nd ed). Hillsdale, NJ: Erlbaum

Costa, P.T., Jr., & McRae, RR. (1985). WWW
manual. Odessa, Florida: Psychological Assessment Resources, Inc.

Costa, P.T., Jr., & McRae, RR (1992). WWW

{arm \I 0 "1510.1. ° I an .m kl O ' ”k -. r c kl 10 ° ‘- ...-1.0 I art .0).
(NEQEEI). Odessa, Florida: Psychological Assessment Resources, Inc.

Dalessio, AT. (1994). Predicting insurance agent turnover using a video-based
situational judgement test. Will, 23-32.

Digrnan, J.M (1990). Personality structure: Emergence of the ﬁve factor
model. WW, (Vol.41, Pp.417-460). Palo Alto, CA' Annual
Reviews.

Drasgow, F. (1984). Scrutirrizing psychological tests: Measurement
equivalence and equivalent relations with external variables are central issues.
W, 134-135.

Drasgow, F. (1987). Study of the measmement bias of two standardized
psychological tests. W, 19-29.

Dyer, P.J., Desmarais, L.B.,1vﬁdkiff, KR. (1993). Multimeliaemplmanen

. Paper presented at the Eighth
Annual Conference of the Society for Industrial and Organizational Psychology, San

 

111

Francisco.

File, Q.W., & Remmer, RH. (1971). WWW). NY:
Psychological Corporation.

Gaugler, B.B., Rosenthal, D.B., Thomton, G.C., & Bentson, C. (1987). Meta-
analysis of assessment center validity. W13), 493-
511.

Gilliland, SW. (1993). The perceived faimess of selection systems: An
organizational justice perspective. AcademxﬂiManagemenLRCJdﬂJﬁ, 694-734.

Gilliland, SW. (1994). Effects of procedural and distributive justice on
reactions to a selection system. WW 691-701.

Goldstein, HW., Braverman, E.P., & Chung, B. (1993). Methodm

.n .m' 't - r i ear. 0 t. i arc-n exits "3110110 0 34‘» n ‘1.) 3m .1: t. i em .8.
Paper presented at the Eighth Annual Conference of the Society for Industrial and
Organizational Psychology, Montreal, Canada

Helms, JR. (1992). Why is there no study of cultural equivalence in
standardized cognitive ability testing? WW 1083-1101.

Herriot, P. (1989). W. In N.Smith & I.Robertson
(Eds), Advances in selection and assessment. NY: \Vrley.

Huck, J .R., & Bray, D.W. (1976). Management assessment center evaluations
and subsequent job performance of black and White females. PersonneLPsycthng,
22, 13-30.

Hunter. J.E., & Hunter, RF. (1984). Validity and utility of altemative
predictors of job perforrmnce. WM, 72-98.

Hunter, J.E., Schmidt, F .L., & Hunter.RF. (1979). Differential validity of

112
employment tests by race: a comprehensive review and analysis. Psychological
Bulletin._8_6, 721-735.

Iles, PA & Robertson, LT. (1989). The impact of personnel selection
procedures on candidates. In P.Herriot (Ed), WWW
Qrganizatiuns (pp.257—271). Chichester, England: \Vrley.

James, L.R. & James, LA. (1989). Causal modeling in organizational
research. In C.L., Copper and I.Robertson (eds), WWWQI
WEE-hung” (pp-371404). Chichester, UK John Wiley.

Jensen, AR. (1980). W NY: Free Press.

Joreskog, K & Sorbom, D. (1986). WWW

.. . Mooresville, IN:

 

Scientiﬁc Software, Inc.
Joreskog, K & Sorbom, D. (1989). LISRELJLAgr-idetcthepmgramand

l' . :2 l 1: :1. $1355.
Joreskog, K & Sorbom, D. (1993).
theSMllSrarmmandlanguage. Chicago: Scientiﬁc Software.
Kelloway, E.K (1996). Common practices in structural equation modeling. In

 

C.L., Copper and I.Robertson (eds), ,l -
WW (pp.141-180). Chichester, UK John Wiley.
Klirnoski, R.J., & Brickner, M (1987). Why do assessment centers work?

The puzzle of assessment center validity. WW 243-260.
Latham, G.P., & Saari, L.M (1984). Do people do what they say? Further

studies on the situational interview. WWW-.62, 569-573.
Iatham, G.P., Saari, L.M, Pursell, E.D., & Carrrpion, MA (1980). The

 

113
situational interview. WWWJE 422-427.

Linn, RL. (197 8). Single group validity, differential validity, and differential
prediction What-.63, 507-514.

Loehlin, J.C., Lindzey, G., & Spuhler, J.M (1975). Win
mtelligenw. San Francisco: Freeman.

Macan, T.H., Avedon, MJ., Paese, M, & Smith, DE. (1994). The effects of
applicants' reactions to cognitive ability tests and an assessment center. Personnel
Psychology,_41, 715-738.

Matthews, DB. (1991). learning styles research: Implications for increasing
students in teacher education programs. MW 228-
236.

McDaniel, MA, Whetzel, D.L., Schmidt, F.L., & Maurer, SD. (1994). The
validity of employment interviews: A comprehensive review and meta-analysis.
lormaleprpliﬂBsyeholua-JQ, 599-616.

McHenry, J .J ., & Schmitt, N. (1994). Multimedia testing. In MG.Rurnsey,
C.B.Walker, & J.HHanis (Eds), WWW NJ: Hillsdale.

Motowidlo, S.J., Durmette, MD, & Carter, G.W. (1990). An alternative
selection procedure: The low-ﬁdelity simulation .LtmmalngppliﬂPs-ychologyﬂ,
640-647.

Motowidlo, S.J., & Tippins, N. (1993). Further studies of the low-ﬁdelity
simulation in the form of a situational inventory. W
W, 337-344.

Premack, S.L., & Wanous, JP. (1985). A meta-analysis of realistic job

preview experiments. JmmmprpﬁﬂEﬁLchQM 706-719.

114
Pulakos, E.D., Schmitt,N., & Keenan, PA (1994). Malidationand

 

Report FR-PRD-94-20). Alexandria, VA: Human Resources Research Organization.

Reily, RR, & Chao, GT. (1982). Validity and fairness of some altemative
employee selection procedures. W35.- 1-62.

Robertson, I.T., & Smith, M (1989). Personnel selection methods. In
MSmith &1.T-R0bert80n (EdS-)- WWW (pp-89-112).
NY:Wiley.

Sackett, PR, & Dreher, GR (1982). Constructs and assessment center
dimensions: Some troubling empirical ﬁndings. WWW-.51,
401-410.

Sackett, PR, & Harris, MM (1983). WWW
WWW. Paper presented at the American Psychological
Association Convention, Amheim, CA

Schmidt, F.L., & Hunter, J .E. (1981). Employment testing: Old theories and
new research ﬁndings. WW 1128-1137.

Schmidt, F.L., Greenthal, AL., Hunter, J.E., Berner, J.G., & Seaton, F.W.
(1977). Job samples vs. paper-and-pencil trade and technical tests: Adverse impact
and exarrrinee attitudes. W, 187-197.

Schmitt, N. (1993). Group composition, gender, and race effects on assessment
center ratings. In HSchuler, J.L. Farr, & MSmith (Eds), PersonneLselectionand

' . NJ: Hillsdale.

 

Schmitt, N., Clause, C.S., & Pulakos, ED. (1996). Subgroup differences

associated with different measures of some common job relevant constructs. In C.L-

Cooper & I.T. Robertson (Eds), I :-.u '

Wm. NY: Vlfrley.
Schmitt, N., & Gilliland, SW. (1992). Beyond differential prediction. Faimess

 

in selection. In D. Saunders (Ed), a ..rut 1er . -_ . ‘-

 

remixes (Vol.1, pp.21-46). Greenwich, CT: JAI Press.

Schmitt, N., Gilliland, s.w., Landis, Rs, & Devine, D. (1993). Corrrputer—
based testing applied to selection of secretarial applicants. W,
149-165.

Schmitt, N., Gooding, RZ., Noe, RA, & Kirsch, MP. (1984). Meta-analyses
of validity studies published between 1964 and 1982 and the investigation of study
characteristics. BersenneLBsychclc-gnjz, 407422.

Schmitt, N., & Noe, RA (1986). Personnel selection and equal employment
opportunity. In C.L. Cooper & I.T. Robertson (Eds), Mona-1mm

 

Schneider, J., & Schmitt, N. (1992). An exercise design approach to

understanding assessment center dimension and exercise constructs. .[Qtrrnalgf

Applmmhologull, 32-41.
Schuler, H. (1993). Social validity of selection situations: A concept and some

empirical results. In HSchuler, J .L. Farr, & MSmith (Eds), PersonneLselectionand

 

Schuler, I-I, & Fruhner, R. (1993). Effects of assessment center participation

on self esteem and on evaluation of the selection situation In H.Schuler, J.L. Farr, &

 

1 16
Scott, R. (1987). Gender and race achievement proﬁles of Black and White

third-grade students. ImmalefPsxcholchJZl, 629-634.
Smith, M & George, D. (1992). Selection methods. In C.L. Cooper & I.T.
Robertson (Eds), I an...

 

NY: VVrley.

Smiderle, D., Perry, B.A, & Cronshaw, SF. (1994). Evaluation of video-based
assessment in transit operator selection WWW-9(1), 3-
22.

Srnither, J .W., Reilly, R.R., Millsap, RE, Pearlman, K, & Stoﬂ‘ey, RW-
(1993). Applicant reactions to selection procedmes mm, 49-76.

Society for Industrial and Organizational Psychology. (1987). W
. (Third Edition). College

 

Park, MD: Author.??

Sorbom, ,D. (1974). A general method for studying diﬂ‘erences in factor
meansandfactorstructuresbetweengroups. BritishlcumaLcﬁMat-hematmlmd
WWI, 229-239.

Steiger, J H (1990). Structtnal model evaluation and modiﬁcation: An interval
estimation approach- W25, 173-180-

Tenopyr, ML. (1969). The comparative validity of selected leadership skills
relative to success in production management. PersonneLPsychologLZZ, 77-85.

Turnage, J.J., & Muchinsky, PM (1984). A comparison of the predictive
validity of assessment center evaluations versus traditional measures in forecasting
supervisory job performance: Interpretive implications of criterion distortion for the
assessment center. WWW, 595-602.

117

Uniform Guidelines on Employee Selection (1978). W,
38290-38315.

Weekley, J .A, & Gier, J .A (1987). Reliability and validity of the situational
interview for a sales position WWW]; 484-487.

Wernimont, P.R, & Campbell, JP. (1968). Signs, samples, and criteria.
Whey-.52- 372-376.

Wigdor, AK, & Green, B.F., Jr. (1991). Edema-55mm
workplace, Washington, DC: National Academy Press.

Mlson Leaming (1990).

 

(TAB). Longwood, FL: Wilson learning.

Wonderlic, ER, and Assoc. (1984). Wigwam-est manual.
Northﬁeld, IL: E.F. Wonderlic & Associates, Inc.

APPENDICES

APPENDIX A

APPENDIX A

S' E!E"E ; I

For each of the following power analyses, the desired power was ﬁxed at .80 and or
was ﬁxed at .05. Expected effect sizes were construed as "small" effect sizes (Cohen,
1988).

H1:
H1 tests the unique variance accounted for by the Race X Method term over
and above the set of control variables consisting of Race and Method (Set A).
A small AR2 of .03 was arbitrarily expected The expected R2 for the entire set
ofpredictors (SetA+RaceXMethodterm)wasarbitrary ﬁxedata
conservative value of .10. Using Cohen and Cohen's (1983) fonnula for effect
size f 2, we have

P = ARZ/(l - R2)
= .03/(1 - .10)
= .033
According to Cohen (1988), a f 2 value of .033 is construed as a small effect
srze.

Cohen & Cohen 's (1983) formula for required sample size n* is as follows:
n*=aH5+k+1

k refers to (if for unique source of variance. We have k = 1. From the table of
L values in Cohen & Cohen (1983), we have L = 7.85. Therefore, we have

n* = (785/033) + 1 + 1
= 239.8

H2 tests the unique variance accounted for by the Method X Reading
Comprehension term over and above the set of control variables consisting of

Method and Reading Comprehension (Set A). The same assumptions as H1
were made which resulted in the same sample size requirement (n* = 239.8).

118

H4:

H6:

APPENDIX A

HB tests the unique variance accounted for by the Race X Method term over
and above the set of control variables consisting of Race, Reading
Comprehension, Method, and Reading Comprehension X Method (Set A). The
sameassumptionsweremadeasHl exceptthatszasﬁxedatahigheerut
nevertheless conservative) value of .15 because of the larger number of
variables in Set A Using the same formulae in H1 resulted in a required
sample size of 226.3.

H4 tests the signiﬁcance of the Pearson correlation coefﬁcient between Face
Validity Perceptions and Predictive Validity Perceptions. A small effect size of
r = .20 was arbitrarily expected Based on Cohen's (1988) tables for sample
size requirements for conelation coefﬁcients, a desired power of .80 at or = .05
indicated that a sarrrple size of 194 was required

H5 tests the difference in mean face validity perceptions between the paper-
and-pencil method and the video-based method A conservative 51 value of .30
was arbitrarily expected Based on Cohen's (1988) tables for sample size
requirements forttests between means, adesiredpower of.80 at 0t= .05
indicated that a sample size of 175 was required

H6 tests the unique variance accounted for by the Race X Method term over
and above the set of control variables consisting of Race and Method (Set A).
The same assumptions as H1 were made for AR2 and R2 for Set A Using the
same formulae in H1 resulted in a required sample size of 239.8.

119

APPENDIX B

APPENDIX B

EXAMPLE OF A PAPER-AND-PENCIL VIGNETTE

The following is an example of a written vignette on the test booklet and some
possible responses on the answer booklet in the paper-and—pencil version of the
situational judgement test.

Example of a written vignette on the test booklet.

SITUATION 1

Jerry and Dennis are discussing how they should go about checking the
machinery in the building. Jerry told Dennis that they should start at the West
end of the building and work their way East so that the more important
machinery will be taken care of ﬁrst. Dennis disagreed as he thinks that since
the East end is on break right now, it would be much faster to start East and
work their way West. Jerry said that he has never seen anyone doing it that
way and besides, the machinery at the West end is more critical. Dennis
continues to disagree and thinks that they should start at the East end.

Jerry can respond in a number of ways. For each possible response described in the
answer booklet, indicate its effectiveness on the rating scale provided.

120

APPENDIX B

Examples of possible responses on the answer booklet.

After you have read SITUATION 1, rate the effectiveness of responses below from
Jerry's perspective.

1. Ask your supervisor to decide which method is better.

2. Convince Dennis that your method is best.

3. Agree to use Dennis' method.

4. Split the work in half. Each of you use your own method.

5. Tell Dennis that you will use his method for a while, but you will switch if it

looks like your method is best.

6. Tell Dennis that he needs to listen carefully to your ideas.
7. Compromise. Use Dennis' method today and your method next time.
8. Demand that Dennis use your method.

For each possible response, the following rating scale is provided.

VERY SOMEWHAT SOMEWHAT VERY
INEFFECTIVE INEFFECTIVE INEFFECTIVE EFFECTIVE EFFECTIVE EFFECTIVE

121

APPENDIX C

APPENDIX C

TEST REACTIONS QUESTIONNAIRE

QUESTIONNAIRE ON THE TEST THAT YOU HAVE JUST COMPLETED

Consider the job of a production worker which requires working in team-based
situations. To do the job well, the worker has to be technically competent and also be
able to relate to other persons effectively. For such a job, indicate how much you
agree or disagree with the following statements about the test that you have just
completed by circling the appropriate number on the rating scale provided.

 

1. I did not understand what the test had to do with the job.
1 2 3 4 5
STRONGLY NEITHER AGREE STRONGLY

DISAGREE DISAGREE NOR DISAGREE AGREE AGREE

 

2. I could not see any relationship between the test and what I think is required
by the job tasks.
1 2 3 4 5
STRONGLY NEITHER AGREE STRONGLY

DISAGREE DISAGREE NOR DISAGREE AGREE AGREE

 

3. It would be obvious to anyone that the test is related to the job tasks.
1 2 3 4 5
STRONGLY NEITHER AGREE STRONGLY

DISAGREE DISAGREE NOR DISAGREE AGREE AGREE

122

APPENDIX C

 

4. The actual content of the test was clearly similar to the job tasks.
1 2 3 4 5
STRONGLY NEITHER AGREE STRONGLY

DISAGREE DISAGREE NOR DISAGREE AGREE AGREE

 

5. There was no real connection between the test and the job tasks.
1 2 3 4 5
STRONGLY NEITHER AGREE STRONGLY

DISAGREE DISAGREE NOR DISAGREE AGREE AGREE

 

6. Failing to pass the test clearly indicates that you can't do the job.
1 2 3 4 5
STRONGLY NEITHER AGREE STRONGLY

DISAGREE DISAGREE NOR DISAGREE AGREE AGREE

 

7. I am conﬁdent that the test can predict how well an applicant will perform on
the job.
1 2 3 4 5
STRONGLY NEITHER AGREE STRONGLY

DISAGREE DISAGREE NOR DISAGREE AGREE AGREE

123

APPENDIX C

 

8. My performance on the test was a good indicator of my ability to do the job.
1 2 3 4 5
STRONGLY NEITHER AGREE STRONGLY

DISAGREE DISAGREE NOR DISAGREE AGREE AGREE

9. Applicants who perform well on the test are more likely to perform well on the
job than applicants who perform poorly.

l 2 3 4 5

STRONGLY NEITHER AGREE STRONGLY
DISAGREE DISAGREE NOR DISAGREE AGREE AGREE

10. The employer can tell a lot about the applicant's ability to do the job from the
results of the test.

1 2 3 4 5

STRONGLY NEITHER AGREE STRONGLY
DISAGREE DISAGREE NOR DISAGREE AGREE AGREE

124

APPENDIXD

125

825:8 w 033.

mm MA- 3 AN f. 8

mA 3- 3- co- 3

8 8 a 8-
--.. 8
: -..-

S

 

AN ON 3 M: 5 0A

AAoAaw

 

. 2... .-.... -....

n— Egan—es

... . .....:..-.. ...

8.8 8.8 mum? s
88 8.8 828 e
5. 8.: 0782mm .m
28 8.8 a> e
was 8.8 as .m
2.. 8. x8 .N
on on. GOA-APE A
edge-"ES
maﬁa»
am 88:
...... -1. vii-... ..-‘4
was?

126

won—€80 w cam-A.

 

on av wA 3 co mA- on 3 mA we. 8 we. no 8. oo 5A 3. 8 8 3A.. Aye-MA OmZ-Qmmm .3
Ah Om NA 5A cc AA ac. No. no. oo- AA- 8. no no. no. NA- Ao- co mN-m No.2 DEA-modem .wA
om mm 2 8 NM Ab. Am «A. No NA- 8. co 0A oo- 9 NA 3. $5 ANNA Q<mM.Qm~AnA .A.A
mm AA ac MA- VA.- 3 cc vo- vc. 2. 0A 8 VA- wA- AA mo omen. 8.3 Q<mM-mu<nA .2
AN 1.-- ---- co 8. 5A- we we AA 0A no ---- mot ---- mad 8.: Am>AAmA~AnA .mA
---- ---- so we. we. vo- 2 mA 2 cm ---- oA ---- med mmdA nm>-mAU<...-A .3
mm no 9.. co NA- No AtA 0A .1. No. no ---- AA.m 8.9 AMA-9am .mA
3 8 AA aA No NA no. 1.. S cm- ---- Rd no.3 BAA -mAU<nA .NA
3. NA am. CT so me AV VA 3 ho- voﬁ woém 2000 AA
AA 0A- em 5N mA- mm- Arm.- ca. 5 Adm-c oadm ><Em .2
co co 8 we. 8 8 so no nnﬁ VNAm meO .a
mm- mm. NN- bA mo. 2. oo- wwA. SNN OuADmAZ .w
AN om GA wA 5A 0A mA VA MA NA AA A: a w A. c n v m N A Om memo—2

 

127

826:8 w asap.

NN AN on on ma. 0A.. 3.0 awdm mmm0< .A.
no Va 5 Ac- mo. and Am.Am UmZOU .0
we. 3” AA- aA- mmd maﬁa OZAQ<mA~A .m

---- mA- ---- wwd Nady ~m> .v

mm- ---- ens mmﬁm AmnA .m

Ac. av. Av. Xmm .N

on. mm. 902% .A

g

Ah mm mm hm mm co 8 3. 3.. aA mA- AA 3. NA- 8 AA no no- mo 8. cud mA.NA ZGOUIAAmAMAmAN

9N 3» Av hm cc nA- om co. 3 Ac] mA NA- 3. AA co Ao. NA- 3 no 9mm QumA ZOOQmAU’qudN

 

ANONkoAhAeAmAvAmANAAAvowswmVMNA Oman:

 

128

8:338 w 2an

 

on mm 3 Am on 3” as 3 mm co AA- mA NN no mm mo 2. mo Aed 312 OmAZAAmAmAnA .aA
AA mA mA cm wA 8. NA- mm co mA- AN 5A 3 00 ac «A- 0A.. 26 anéA OmAZ-m0<nA .wA
3 mA wA mm NA 3 8 no. co 3. ac NA- cw Ac. 3 wA omﬁ mm.AA auqamam SA
0A- AA cm 8. Ac- ac Ac 8 3. we ac. vN co 3- we 36 RNA Q<mA~A-mAU<nA .oA
Om ---- ---- Ac 8. 3. 8 mm 3 8 Am -..- AN- ..-- 2d 3.3 Am>AAmA~AnA .nA
---- ---- VA Ac. NA- om mu 8 we Ab -..- NN- ---- cod AndA nw>-m0<m .3
am 3 3 3 3. 3. 2 mA ---- 9 co. ---- Am.m AN.mA ammdmAyAAA .mA
am 8 AA co Ac. 3 mm ---- Ann R. ---- and mA6A Adm -mAU<nA .NA
8 no VA 8 MA- 5m 8 3 mo. Ac- cnﬁ onAN 2000 .AA
AA AA. cm nA we no. on ma. mA- AA..m onAm ><~A.A.NmA .9
no 8 8. S as mm «3. me. Now 3.3 meO d
nA- Am- AN- no. aA- mm- 0A nA.w wmﬁa OMDmZ .w
AN ON «A wA 5A 0A 2 VA mA NA AA 2 a w A. c n v m N A Gm @802

 

129

duo—£2 duo—«Sum MA"AAugm-Ea-Eag 6383-0023 cocoa->883 8a Xmm Ea OCEAN: 9%

563$ 30605 B>m8bmunmmm

5:55, 8amumo<nA ucommag-HE.n.A.I->A‘JA.A-VnmA ”mBSHRAOHZmEO ”8358323ng ”ago—ﬁg<ummm0< amoqmsoucomomaounomzov
nbAﬁqoaonA Edmz nOmZ 623389800 368m mean-QSAoZémM 358....ng Andaman—“Am womanéuugunm> maﬁa—035..

3.83% navigiﬁuam ”83820 .8 xumuxmm 32 380%.... 383mm 5 308333 We Beoznooamz ”83382“ away; .

 

we am sc- Am A-m mm 0A cv 8- ac- mA- NA- mc Ac- Ac oc- cm 3 no AA NA.m awcA ZOOU-nAmAMAnA .AN

cc- cm- 3.. on mc. cc mA wc. AA- AN- 5- cc cA- mc Nc- VA 3- 8 ac mc.m 3.NA ZOOU-mAU<nA .cN

 

AmcmaAwAbAcAmAvAmANAAAcAawhonVmNA Swag:

 

APPENDIXE

O . '
MOM-m 6). .0 1" «.I 1 U 2-.. .‘-.~ 0 H. .2... ts
. . .

   

All ‘1. “.-.. = .14.

conl con2 con3
oonl 3.96
con2 0.57 1.74
con3 1.17 0.67 4.00
qual 0.12 0.20 0.17
qua2 0.01 0.24 -0.14
qua3 -0.06 0.44 -0.23
coml 0.61 0.31 0.52
com2 0.13 -0.10 0.47
com3 0.24 0.28 -0.04
empl 0.33 0.71 0.74
emp2 0.14 0.76 -0.18
emp3 - - 0.47 0.44
coml com2 com3
coml 3.02
com2 0.76 2.77
com3 0.94 0.54 2.28
empl 0.69 -0.22 0.09
empz 0.56 -0.20 0.50
emp3 0.43 0.10 0.05
Group 2: Video-Based Method
conl con2 can3
conl 4 26
con2 0 41 1.74
con3 0 73 0.77 3.43
qual -0 15 0.62 0.22
qua2 0 21 0.34 0.14
qua3 -0 07 0.20 -0.15
coml 0 57 0.25 0.22
com2 0 14 0.16 -0.12
com3 -0 02 0.23 0.07
empl 0 12 0.59 0.15
empz 0 02 0.08 -0.17
emp3 0 54 0.28 0.03
coml com2 com3
coml 2.16
com2 0.25 2.68
com3 0.66 0.90 2.60
empl 0.47 0.07 0.20
empz 0.21 — - 0.38
emp3 0.25 —0.22 0.21

130

.. g‘ﬂk‘ﬁl

2.47

2.73

APPENDIX F

APPENDIX F

 

131

conl con2 con3 qual qua2 qua3
conl 1.00
con2 0.25 1.00
con3 0.32 0.26 1.00
qual 0.04 0.12 0.06 1.00
qua2 0.00 0.17 -0.07 0.10 1.00
qua3 -0.04 0.29 -0 09 0.19 0.18 1.00
coml 0.20 0.15 0.14 0.04 -0.01 -0.02
com2 0.04 —0.02 0.12 0.00 0.10 0.02
com3 0.07 0.14 —0.02 0.20 0.10 0.10
empl 0.10 0.35 0.22 0.12 0.11 0.25
emp2 0.06 0.40 -0.05 0.10 0.03 0.21
emp3 0.00 0.24 0.16 0.28 0.04 0.19
neuro -0.01 -0.03 —0.10 -0.17 -0.12 -0.15
extra 0.15 0.13 0.19 0.05 0.03 0.04
open 0.17 0.16 0.19 0.13 -0.08 0.08
agree 0.28 0.22 0.20 0.07 0.06 0.09
consc 0.18 -0.16 0.10 0.09 0.06 0.06
coml com2 com3 empl emp2 emp3
coml 1.00
com2 0 29 1.00
' com3 0.37 0.26 1.00
empl 0.25 -0.09 0.04 1.00
emp2 0.23 -0.08 0.25 0.34 1.00
emp3 0.17 0.03 0.02 0.38 0.47 1.00
neuro -0.11 -0.05 -0.10 -0.10 -0.10 -0.07
extra 0.02 -0.08 -0.04 0.03 0.03 0.06
open 0.21 0.08 -0.03 0.01 0.09 0.16
agree 0.21 0.07 0.12 0.21 0.14 0.03
consc 0 03 0 05 0 14 -0.09 -0.16 -0.11
neuro extra open agree consc
neuro 66.97
extra -6.27 35.73
open 2.85 7.89 35.24
agree -10.62 13.54 9.67 44.87
consc -14.88 10.10 -0.02 8.05 40.42

APPE‘IDIXF

Appendlx F contmued
Group 2: Video-Based Method
conl con2 con3 qual qua2 qua3
conl 1.00
con2 0.16 1.00
con3 0.18 0.32 1.00
qual -0.04 0.31 0.09 1.00
qua2 0.11 0.24 0.08 0.21 1.00
qua3 -0.03 0.13 -0.07 0.37 0.12 1.00
coml 0.21 0.13 0.08 -0.09 0.06 -0.01
com2 0.05 0.10 -0.05 0.08 0.02 0.03
com3 0.02 0.12 0.03 0.08 0.07 0.12
empl 0.03 0.28 0.06 0.25 0.05 0.10
emp2 —0.01 0.08 -0.05 0.07 -0.01 0.07
emp3 0.16 0.13 0.01 0.21 0.16 0.19
neuro 0.05 0.14 0.03 0.10 0.13 -0.11
extra -0.04 -0.26 —0.16 -0.22 -0.10 -0.15
open -0.06 0.11 0.23 -0 04 -0.15 -0.05
agree 0.04 -0.03 0.11 0.07 0.14 0.08
consc -0.02 -0.01 -0.08 -0.05 0.09 0.11
coml com2 com3 empl empz emp3
coml 1.00
com2 0.10 1.00
com3 0.31 0.35 1.00
empl 0.21 0.03 0.08 1.00
emp2 0.10 0.02 0.18 0.21 1.00
emp3 0.11 -0.07 0.08 0.10 0.12 1.00
neuro 0.03 0.09 0.17 -0.01 -0.06 0.11
extra 0.11 0.13 0.01 -0.08 -0.06 -0.04
open 0.22 -0.06 0.03 0.09 -0.01 -0.11
agree 0.16 0.06 0.01 0.10 -0.04 —0.03
consc 0.06 0.10 -0.04 0.05 -0.08 -0.09
neuro extra open agree consc
neuro 61.11
extra -9.62 41.44
open 1.98 1.81 33.36
agree -9.70 10.44 —3.07 39.73
consc -14.24 6.98 —0.79 10.71 32.11

132

"lilﬁﬂﬂﬁﬂﬂjﬂﬁﬂiﬂiﬂiﬂiﬂ'wﬁﬂﬂVs