~-—...-.-~.
.x'

."- ...~ 3:»;
T . A'—
‘V .J'-‘
‘r v.
‘.. H
J

.r..‘ av

.c .
1.; a.“ V A
.u x. r.

c .. ‘
.. .

7‘!

; ..
I‘J-ru ....
1:42-93»; v- _

‘
,. .506
r.
sz->

“v

A

3.5
‘1‘ .~ﬂ5, ‘

.;..f; -a v' - .
“ ‘ < "uui'r-NE.
" 'v'd' 3.7-

' ' "l ’7'!

, . ...

. .‘s‘
‘ v
3811";

l...
”a”:

“2:,“

,

- ...,

”Emmy.“ 1
.

J
up. ,y
li~h~ - -. p,
' - ., .p t.
u v

J .

.
1-?

.~ 9
q“ . 5....”

v; "a: .1
n. A ‘

'5".

t

1"»? ,
‘33:
a” .‘T
:'

l“; '

. ¢
1’; :4” V
"puﬁ’K-‘nuw Jam
. m . - - . .
I ' \‘

n,

in

_‘- .-“1.:

!.T. ~ “9’1

u' r y
.

;

’5')“

o.— ”'
r i ' I ’ V ‘ ’ (772*; ~ ,"
- ’ '- ’ , ~..,. ...
"ll" .1" . w .. ' > u" ' _ ‘n
, r1,“ 3., ~ .. ... . ~1-"
.- .. ,.

- n}... .-
r-v'--v ,.
v 'Iv‘vp-ueh
A V"'vu'hﬂ ..
1‘r‘vn4Q-[chq
y‘b“

: ..
v!- .1 ru
‘ ‘rh “gr

~xn;h" A,

.. ,..
M...
‘P u .

.-,.».> _
o ,-....‘.|,r, ‘
lrlbtleII; . .

v 4~u:--..‘..',.,-,,
-*v- ,ru......,~, .
r- ,.,,,. .,.,

.

 

l!“ 811?

llllllllllllllllllllllllllllllllllllllllllillllllllll (/

31293 00896 3120

This is to certify that the

dissertation entitled
Influences of information acquisition and method of
rating, and in-role versus extra-role behaviors on

rater accuracy, halo, type and amount of search

presented by

Jon Michael Werner

has been accepted towards fulﬁllment
of the requirements for

 

 

Ph.D. degree in organizational Behavior

//a/Z7W%—~

Major protessor

 

Date (/36/92.

MSU is an Afﬁrmative Action ’Equal Opportunity Institution 0 12771

 

g" LIBRARY
Mlchlgen State
Unlverslty

 

 

PLACE IN RETURN BOX to rem ove this checkout from your record.
TO AVOID FINES return on or before date due.

Fe
DATE DUE DATE DUE DATE DUE ’

JUN 2 2 1m"

 

 

 

 

h‘

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

l
“J g
WWW

MSU Is An Affirmative Action/Equal Opportunity Institution
cMuna-pd

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

‘—

 

INFLUENCES OF INFORMATION ACQUISITION AND METHOD OF RATING,
AND IN-ROLE VERSUS EXTRA-ROLE BEHAVIORS ON RATER ACCURACY, HALO,

TYPE AND AMOUNT OF SEARCH

By

Jon Michael Werner

A DISSERTATION

Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of

DOCTOR OF PHILOSOPHY

Department of Management

1992

J0“/’

ABSTRACT
INFLUENCES OF INFORMATION ACQUISITION AND METHOD OF RATING,

AND IN-ROLE VERSUS EXTRA-ROLE BEHAVIORS ON RATER ACCURACY,
HALO, TYPE AND AMOUNT OF SEARCH

BY

Jon Michael Werner

This study extends cognitively-oriented performance appraisal
research by addressing two primary questions: 1) does the manner in
which rating scales are organized affect the type of ratings made? and 2)
what behaviors do raters consider relevant when rating job performance?
To study the first question, a 2 by 2 factorial design was used, where
method of rating (by person, versus by dimension) and prior knowledge of
format were manipulated as between-subjects variables. The second
question was addressed by creating performance dimensions which captured
either in-role or extra-role (i.e. , organizational citizenship) behaviors.
The performance levels of six ratees were experimentally manipulated as
within-subj ects variables, using three levels of in-role and two levels of
extra-role performance. Subjects were 116 supervisors from a large
university, randomly assigned to between—subjects condition. A computer
simulation was devised, which asked raters to search for performance
information, and then make ratings for each ratee. Hypotheses were tested
using overall performance ratings, halo, accuracy, type of search, and
amount of search as dependent measures.

Method of rating had little impact on overall ratings but had a
large impact on two measures of halo. Halo was lowest when rating by
dimension, and when subjects had no prior knowledge of format type.

Results were mixed for the effects of these variables on rater accuracy.

In general, accuracy was better when.ratingﬂby dimension, but effect sizes
were small and inconsistent. No effects were found for these
manipulations on type or amount of search.

Level of in—role performance, level of extra-role performance, and
their interaction eadh explained statistically significant amounts of
rating variance. Halo was increased for ratees exhibiting high levels of
extra-role performance, but contradictory results were obtained for two
accuracy measures (stereotype and differential accuracy). Finally, level
of inerole performance had some impact on amount of search by ratee.

Overall, results from this study provided partial support for the
first research question, and strong support for the letter. A proposed
interaction between method of rating and level of extra—role behavior was

not supported. Implications are discussed.

TO DAD

You were gone too soon,

there was so much more that I longed for...

Yet,

your accomplishments have been my unspoken model

and inspiration.

Though I'll not hear your “well done!" on this earth,
I know this moment pleases you;

to you I dedicate this labor of mine.

ii

ACKNOWLEDGEMENTS

Faculty, family, friends — so many have guided and walked with me
to this point. Words of gratitude are definitely in order. To my
committee: John Hollenbeck, Dan Ilgen, and Ken Wexley. You're helpful
input and direction were invaluable. Ken, you've taught me so much.
Without your early urging and encouragement, I wouldn't have returned to
the doctoral program. Thanks for helping me discover a life I really do
love! Dan, your ability to balance career accomplishments with faith and
family spell ”Career Success/Personal Success”. You've shown me that
Korman's warning can be heeded. Though at times I would have enjoyed less
“critique” in your feedback, I see now that your input led me to produce
a better product. Thanks for helping me scale back and focus on a more
manageable project! John, your ability to impact our field so quickly
never ceases to amaze me. I've gained so much from being around you the
past five years. Your onegoing feedback and guidance have helped me get
through this process. Thanks!

To my family: Mom, you've sacrificed so much for me, and for our
life together. Words cannot express my gratitude for all you've done.
Thanks, too, for instilling in me a love of learning, and a love of the
written word. Barbara, in you I truly found the best mate (for me). Your
constant support and willingness to forego for the sake of my education
made this long process so much more enjoyable. I couldn't have done this

without you” Hans and Noelle, the hugs and kisses and love and laughter

iii
more than make up for disrupted nights, crying, and occasional bouts of
the grumpies. Thanks for showing me that there is life beyond my word
processor.

To my friends: Anne, Blair, Dass, Ellen, Kathy, Pat, and Peggy -
I wish I had more fully appreciated.what a wonderful cohort you were! To
Tim and Cheryl, Larry and Karen, Lowell and Arva —— you showed me true
fellowship, and that it was worth the risk to take an ”inside look".

And to Audrey, Bill, and all my new colleagues at USC: you.weren't
supposed to make it onto the acknowledgements page of my dissertation, but
that's life, and I'm sure glad to be a part of such a great group of
people.

Finally, special thanks are in order to Stephen Gilliland, who
adapted the computer program to run my simulation, and who made it easy
for me to transfer the data I needed into SPSS-X files. Your computer
expertise saved me hours and hours of grunt work. Also, Kay Butcher and
Mike Rice provided invaluable assistance to me in lining up the
supervisors for both phases of my project.

THANKS TO YOU ALL!!

iv

TABLE OF CONTENTS

LIST OF TABLES ..................................................... viii
LIST OF FIGURES ...................................................... x
CHAPTER 1: INTRODUCTION ............................................. 1
Overview ........................................................... 1
Search and Rating Scale Format ..................................... 5
In—Role versus Extra-Role Behaviors ................................ 8
Contributions of this Research ..................................... 12
Search and Method of Rating ...................................... 12
In-Role vs. Extra-Role Behaviors ................................. 13
CHAPTER 2: LITERATURE REVIEW AND HYPOTHESES ......................... 17
Past Research on Error and Accuracy ................................ 17
Error ............................................................ 18
Accuracy ......................................................... l9
Cognitively—Based Performance Appraisal Research ................... 21
The Barman/Murphy Stream ......................................... 22
University of South Carolina Stream .............................. 26
Persona vs. TaskeBlocking in Information Acquisition ........... 27
Rating Format vs. Advanced Knowledge of that Format ............ 30
Process Tracing Research ......................................... 36

Hypotheses Concerning Information Acquisition/Method of Rating...38
Overall Ratings ................................................ 39

Error .......................................................... 39

V

Accuracy ....................................................... 41
Mediating (Process) Variables .................................. 43
InrRole Versus Extra-Role Performance .............................. 46
Measuring Performance ............................................ 46
Organizational Citizenship Behavior ............................ 48
OCB and Performance Appraisal .................................. 53
Hypotheses Concerning In—Role and Extra-Role Behaviors ........... 60
Level of InrRole Behaviors and Accuracy ........................ 60
Level of In—Role Behaviors and Amount of Search ................ 61
Extra-Role Behaviors, Error, and Accuracy ...................... 61

Interactions Between the InrRole and Extra-Role Manipulations..64

Hypotheses Concerning Method of Rating, DOB, and Accuracy ........ 65
Summary ............................................................ 66
CHAPTER 3: METHOD ................................................... 69
Overview of Methodology ............................................ 69
Participants ....................................................... 69
Power Analysis ..................................................... 69
Procedure .......................................................... 70
Deriving the Content of the Study ................................ 70
The Primary Study ................................................ 74
Constraints on Information Search .............................. 76
Variables .......................................................... 77
Overall Performance Ratings ...................................... 77
Error ............................................................ 77

Accuracy ......................................................... 78

vi

Type of Search ................................................... 79
Amount of Search ................................................. 79
Data Analysis ...................................................... 80
CHAPTER 4: RESULTS .................................................. 87
Content Derivation for the Study ................................... 87
Results from the Primary Study ..................................... 94
Overall Ratings .................................................. 94
Halo Effect ...................................................... 97
Accuracy ........................................................ 100
Correlational Accuracy ........................................ 102
Distance Accuracy ............................................. 103
Dickinson's MANOVA Approach ................................... 109
Type of Search .................................................. 113
Amount of Search ................................................ 115
Dickinson's Approach to the Withianubject Manipulations ........ 116
IneRole Performance ............................................. 121
Extra-Role Performance .......................................... 122
Interaction of In—Role and Extra-Role Performance ............... 124
Method of Rating, OCBI, and Accuracy ............................ 125
CHAPTER 5: DISCUSSION .............................................. 129
Hypotheses Concerning Type of Format .............................. 130
Overall Ratings ................................................. 130
Halo ............................................................ 131
Accuracy ........................................................ 133
Correlational Accuracy ........................................ 133

Distance Accuracy ............................................. 134

vii

Process Variables ............................................... 139
Type of Search ................................................ 139
Amount of Search .............................................. 140

Hypotheses Concerning Type of Performance ......................... 141

In-Role Performance ............................................. 150
Differential Accuracy ......................................... 150
Amount of Search .............................................. 151

Extra-Role Performance .......................................... 152
Halo .......................................................... 152
Accuracy ...................................................... 153

Interaction of In—Role and Extra-Role Performance ............... 153

Hypotheses Concerning Method of Rating, OCBI, and Accuracy ........ 155
Summary and Directions for Future Research ........................ 155

Strengths and Limitations of the Current Study .................. 156
Strengths ..................................................... 156
Limitations ................................................... 157

General Conclusions from this Study ............................. 161
Type of Format ................................................ 162
Type of Performance Dimension ................................. 164

Directions for Future Research .................................. 166
Type of Format ................................................ 166
Type of Performance Dimension ................................. 166
Process Issues ................................................ 167
Setting/Contextual Issues ..................................... 167

APPENDIX A .......................................................... 169

LIST OF REFERENCES .................................................. 172

Table

Table

Table

Table

Table

Table

Table

Table

Table

Table

Table

Table

viii

LIST OF TABLES

1: Mean Importance Ratings Given to the Six Performance
Dimensions ..................................................... 88
2: Background Information for Subject Matter Experts and
Primary Sample ................................................. 90
3: True Scores from the 15 Subject Matter Experts ............. 93
4: Overall Ratings by Ratee: True Scores, Whole Sample, and

by Condition ................................................... 95
5: Median and Mean Intercorrelations Among Dimensions,

Across Ratees .................................................. 98
6: Mean Ratings by Ratee and Dimension: Whole Sample, and

by Condition .................................................. 101
7: Analysis of Variance for Prior Knowledge and Format Type

on Accuracy.................. ................................. 111
8: Means, Standard Deviations, and Breakdowns Concerning

Type of Search ................................................ 114
9: Analysis of Variance for In-Role and Extra—Role

Performance ................................................... 118
10: Means and Standard Deviations for Orthonormal Contrasts

by Ratee ...................................................... 121
11: Means, Rank Order, and Standard Deviations for Amount of
Search by Ratee ............................................... 122

12: Median and Mean Intercorrelations by OCBI ................ 123

Table

Table

Table

Table

Table

ix

13: Means and Standard Deviations for SA and DA by Level of

OCBI .......................................................... 123
14: Means and S.D.'s for SA and DA by Method of Rating and
Level of OCBI ................................................. 126
15: MANOVA for Method of Rating, OCBI, and Stereotype

Accuracy ...................................................... 127
16: MANOVA for Method of Rating, OCBI, and Differential
Accuracy ...................................................... 127
17: Amount and Order of Search, by Ratee and by Dimension....l43

LIST OF FIGURES

FIGURE 1: Overview of the Research Variables .......................... 16
FIGURE 2: A Model of the Influences of Person- versus Dimension-

Blocking on Performance Appraisal Ratings ........................ 31
FIGURE 3: The Between-Subject Manipulations of Format Type and

Prior Knowledge of Format ........................................ 34
FIGURE 4: Predictions Concerning Type of Format, Advanced Knowledge

of Format, Halo, and Accuracy .................................... 40
FIGURE 5: Predictions Concerning Type of Search and Amount of Search. .44
FIGURE 6: Sample Performance Dimensions to be Rated ................... 57
FIGURE 7: Target Profiles for Hypothetical Ratees ..................... 63
FIGURE 8: Predicted Relationships between Method of Rating,

Favorability of OCB information, and Accuracy .................... 67
FIGURE 9: Contrast-Coded Variables for this Research .................. 81
FIGURE 10: Summary of the Variables from Hypotheses 1-8 Predicted

to be Statistically Significant .................................. 86
FIGURE 11: Median Intercorrelations Among Dimensions (Halo) ........... 99

FIGURE 12: Results for Stereotype Accuracy (as measured by SACORR) . . .104

FIGURE 13: Results for Distance Score Accuracy Components ............ 106
FIGURE 14: Results for Overall Cronbach Accuracy ..................... 108
FIGURE 15: Interaction of In-Role and Extra-Role Performance ......... 119

FIGURE 16: Interaction of In-Role and Extra-Role Performance, by

Type of Format .................................................. 120

CHAPTER 1: INTRODUCTION

9mm

Few areas in personnel psychology and human resource management have
been as heavily researched as performance appraisal (Nathan 6: Tippins,
1990) . A good portion of this research has focused on halo effects (e.g. ,
Cooper, 1981; Jacobs 6: Kozlowski, 1985; Murphy & Reynolds, 1988), rating
accuracy (e.g. , Lord, 1985; Murphy 6: Balzer, 1986; Padgett 6: Ilgen, 1989),
or the relationship between halo and accuracy (e.g. , Becker 6: Cardy, 1986;
Smither 6: Reilly, 1987; Fisicaro, 1988; Murphy 6: Balzer, 1989). While a
considerable amount has been learned from such research, a criticism of
much psychometric research is that not enough has been learned concerning
the m by which raters form their appraisal judgments (Landy & Farr,
1980). Since 1980, extensive effort has been expended studying cognitive
influences on performance appraisal (Ilgen 6: Feldman, 1983; DeNisi,
Cafferty, 6: Meglino, 1984; Ilgen, Barnes-Farrell, 6: McKellen, in press).

For example, Ilgen et al. (in press) summarized much of the
cognitively—oriented performance appraisal research conducted in the
19808. They discussed research under three broad phases of information
processing, i.e. , a) search or acquisition of information, b)
categorization, organization, and storage of information in memory, and c)
retrieval and integration of information, followed by judgments formed on

the basis of this information. Within each phase, Ilgen et al. (in press)

2
summarized the effects found from four sources: raters, ratees, rating
scales, and the setting or context in which appraisal took place.

The current research follows considerable previous research (cf.,
Murphy & Balzer, 1986; Dickinson, 1987) on the 55223391 of performance
appraisal ratings, since accuracy is a primary goal of any appraisal
system (Feldman, 1986). However, this study goes beyond past research,
and is expected to shed increased light on the pgggegg by which raters
form their appraisal judgments.

An important issue addressed by this research is the domain of
individual ratee ‘behaviors (and. other‘ characteristics) which. raters
consider relevant when rating job performance. Most current approaches to
performance appraisal recommend that raters (and rating scales) focus on
measurable job behaviors and/or tangible results (Latham & Wexley, 1981;
Odiorne, 1965). Organ (1977), on the other hand, argued that practicing
managers view "performance” as more than simply fulfilling one's job
duties. The latter has been labelled.1n:;gle_behggig;§ (Williams, 1988).
Such inrrole behaviors are obviously important to managerial ratings of
performance. However, according to Organ, acts that are spontaneous, and
generally beyond one's regular role requirements, are also important to
managers when they evaluate ratee performance. Such extra-role behaviors
have been labelled “organizational citizenship behaviors" , or OCB (Bateman
& Organ, 1983). While much of the subsequent research on such extra-role
or citizenship behaviors has net focused on performance appraisal per se,
recent research by Orr, Sackett, and Mercer (1989), and MacKenzie,

Podsakoff, and Fetter (1991) strongly suggests that managers can and do

3
evaluate both in-role and extra-role (OCB) behaviors when making appraisal
judgments.

Because of these findings, both in-role and extra-role performance
are of interest in the present study. This chapter begins by describing
research propositions relating to search and rating scale format. Such
research focuses more on aspects of in—role performance. It is based on
the premise that the manner in which rating scales are organized affects
the type of ratings made. Since Symonds (1925), various researchers have
suggested that ratings made one dimension at a time would exhibit less
halo effect than those made in the traditional, one-person-at-a-time
manner. While research on this notion has been disappointing in its
demonstrated effect on halo (Cooper, 1981), recent cognitively-oriented
research suggests that scale format organizes or "frames" the rating
problem, and that such framing effects are likely to have a greater impact
on the manner in which raters 19219.11 for performance-relevant information
(DeNisi 6: Williams, 1988). That is to say, such framing is expected to
affect the rating process, such that raters using a person-blocked scale
are expected to search for information in a person-blocked manner, and
raters using a dimension-blocked scale will search for information in a
dimension-blocked fashion. These different search strategies are then
expected to influence measures of rater accuracy.

Such questions are important to address, but they do not get at the
issue just mentioned concerning whether recent performance appraisal
efforts have downplayed or ignored performance dimensions which managers
consider important when making their appraisal ratings. For example,

Williams and Hummert (1990) studied the constructs used by supervisors and

4

clerical employees to define productive and unproductive work behavior.
Both groups described behaviors corresponding to 9 of the 12 dimensions
used.on that organization's rating scale. However, a number of dimensions
and behaviors were mentioned which were get a part of the rating scale,
e.g., helping others when one's own work is complete, going the "extra
mile” to complete a task, and speaking well of the organization when off
the job. As will be demonstrated below, such behaviors are very similar
to the extra-role behaviors discussed by Organ (1977; 1988b), Orr et a1.
(1989), and MacKenzie et a1. (1991). This supports the argument that it
is important to study the effects of 'both inrrole and extra-role
performance when researching the performance appraisal process. In the
present context, this means studying the effects of inrrole and extra-role
performance on measures of rater search and accuracy.

In effect, this aspect of the research asks whether mgr; dimengigng
(in-role and extra-role) should be measured in.performance appraisal than
is presently recommended (e.g., Cascio, 1989). It is expected that search
and accuracy measures will be influenced by the experimental manipulations
of ‘both in—role and extra-role performance. These issues will be
addressed in this chapter following the discussion of rater search and
rating scale format manipulations.

As indicated, this research goes beyond measuring a static dependent
variable such as rater accuracy, and will also study the cognitive
processes used ‘by raters as they search for information and. make
subsequent appraisal ratings (Landy & Farr, 1980). A relatively recent
methodology known as prgge§§,;1§gigg (Ford et a1., 1989) will be used to

examine the processes used by raters when they search for performance

5

information under different framing conditions (i.e., person— vs.
dimension—oriented scale formats). This methodology, as well as the
computer simulation which will be used, are discussed in greater detail
below. Briefly, variables such as the amount and type of search are
expected to be influenced by the manipulations of person— versus
dimension-oriented rating scale formats. Further, it is expected that
search will be influenced by manipulations of in-role and extra-role
performance, and that interactions will occur between the manipulations of
search and extra-role performance.

In the remainder of this chapter, the above ideas will be sketched
out in greater detail. Following this, the contributions which this
research can make to the performance appraisal literature are discussed.

Search and Rating Scale Format

The first major issue in this research is the most psychometric in

nature, having to do with whether the WW influences
rater accuracy and the amount of halo observed. In 1925, Symonds

suggested that halo errors would decrease if raters made ratings of all
individuals on one trait or dimension before moving on to the next
dimension. Some support for this was found by Stevens and Wonderlic
(1934).

Cooper (1981), in his influential work on halo, cited four studies
which tested for differences between raters who rated ”by person" (i.e. ,
in the typical manner) versus "by category” (Taylor & Hastman, 1956;
Johnson, 1963; Blumberg, DeSoto, & Kuethe, 1966; Brown, 1968). None of
these studies found significant differences between the methods in the

amount of halo observed. However, recent cognitively-oriented research

6
(discussed more fully in the next chapter) indicates that method of
information acquisition does influence various measures of rating
accuracy. In this case, whether information is gathered "by person" or
"by dimension" can be viewed as an acquisition strategy. However, even
though Cooper (1981) falls within the cognitive perspective, in none of
the studies cited by him did researchers measure accuracy.

Current research makes frequent use of expert ”true scores" and the
four distance score components of rater accuracy discussed by Cronbach
(1955; e.g., Dickinson, 1987; Smither & Reilly, 1987); i.e., elevation -
the mean rating given by a rater, across all ratees and dimensions;
differential elevation — the mean rating given to each ratee, across
dimensions; stereotype accuracy -— the mean rating given to each
dimension, across ratees; and differential accuracy —— the ratings of each
ratee's performance on each dimension (see Chapter 2). The present
research will also measure halo, so as to compare results with previous
research. Of greater concern, however, is the impact of manipulating
rating scale format and prior knowledge of that format on measures of
rating accuracy. Cronbach's (1955) measures of accuracy will be used as
primary dependent variables.

Feldman (1981) predicted that the structure of performance appraisal
forms would interact with the manner in which raters categorize, store,
recall, and integrate performance information. Further, DeNisi et a1.
(1984) and DeNisi and Summers (1986) proposed that introducing rating
scales W helps raters use the scale format as a cue (or
frame) for organizing information in memory. This prediction will be

tested by the present research, i.e., is it differences in rating scale

7

format per se which influences rater accuracy and halo, or is it rater
knowledge of that format prior to information acquisition which influences
rater accuracy and error? The rationale for this manipulation is
expressed in DeNisi and K. Williams (1988) . Manipulating the structure of
the rating scale format (person vs. dimension) is expected to influence
halo and accuracy primarily via its influence on the mm], of
information from memory, whereas providing knowledge of the scale format
prior to search is expected to affect rater m strategies and
subsequent information storage in memory. This experimental manipulation
should provide the "more appropriate test” of the effects of rating by
person versus dimension called for by DeNisi and Williams (1988, p. 119).

A number of recent studies have utilized computer-controlled
information display boards to examine differences in search/acquisition
strategies (Williams et a1. , 1985; Cafferty et a1., 1986). This approach
comes out of research on decision making, and is often labeled ”process
tracing" research (Ford, Schmitt, Schechtman, Hults, 6: Doherty, 1989).
Such an approach allows the researcher to examine the pattern and depth of
each rater's information search. For example, subjects who know in
advance that they will rate by person (or dimension) are expected to
search for information in a manner consistent with that format. Also, a
person-blocked format is expected to increase the salience of global
impressions, which is then expected to decrease the overall amount of
search undertaken (Hastie & Park, 1986; Srull & Wyer, 1989). A process
tracing approach is ideally suited to studying the effects of the
experimental manipulations on each rater's search strategies. Unlike much

previous performance appraisal research, a process tracing methodology is

8
designed to allow the researcher to map the manner in which raters
actively seek and process information (Ilgen 6: Feldman, 1983).

e v - o v

The second broad issue in this research concerns the extent to which
raters are influenced by both the in—role and extra-role (OCB) behaviors
exhibited by ratees. If managers consider extra-role behaviors when
making performance ratings, yet these behaviors don't fit within typical
performance appraisal dimensions, then these behaviors are having an
undetermined effect on rater search activities and accuracy. In the
present study, ratee levels of in-role and extra-role performance will be
experimentally manipulated. For reasons discussed below, the presence of
such extra-role, or citizenship information is expected to influence rater
search strategies, as well as indices of rater accuracy.

Most behavioral approaches to performance appraisal currently in use
emphasize tying the appraisal instrument as closely possible to the job
description for the job in question (e.g., Latham & Wexley, 1981).
Recently, however, concern has been expressed that even well—done job
analyses may miss many behaviors which managers consider important in
evaluating employee performance, behaviors that are not technically part
of the job requirements (e.g. , cooperation, contribution to morale). Orr
et a1. (1989) found that the majority of the managers in their policy-
capturing study used information on both prescribed (in-role) and
citizenship behaviors when estimating the dollar value of performance
(8D,). Their intent was not to show the relative importance of each type

of behavior for estimates of the dollar value of performance, but rather

9
to show that citizenship behaviors played a role when managers made such
estimations.

MacKenzie, Podsakoff, and Fetter (1991) obtained measures of
objective sales performance for insurance agents, then collected managers'
evaluations of their subordinates' citizenship behaviors, as well as
subjective performance evaluations. They used the objective indices as
indicators of ”in-role" performance and, unsurprisingly, found a sizable
(r- .48) correlation between objective performance and subjective rating.
What was noteworthy about this study was that a LISREL analysis including
both objective and OCB measures of performance accounted for 44% of the
variance in subjective ratings. This was much higher than the 7% mean R2
reported by Heneman (1986) in his meta-analysis of the objective—
subjective performance link, and would seem to indicate that OCB added
considerably to the explained variance in supervisory ratings. Further,
two aspects of citizenship (factors labelled "Altruism" and "Civic
Virtue") explained as much variance in subjective ratings as did the
objective performance measures. If these findings are robust across other
samples, it bolsters Organ's (1988a) argument that managers view
performance as more than simply ”in-role" behavior.

While MacKenzie et a1. (1991) studied OCB within a performance
appraisal context, their operationalization of "in-role performance” was
in fact a "results” measure of performance, i.e. , they used three measures
of tangible organizational outcomes. This is much different from the
behavioral approach taken by most citizenship research (Smith, Organ, 6:
Near, 1983; Bateman 6: Organ, 1983; Williams, 1988). Orr et a1. (1989) had

managers rate 13 traits and behaviors, yet their research was not strictly

10
a performance appraisal study. Thus, it would be valuable to replicate
the findings of MacKenzie et a1. and Orr et al. in a performance appraisal
setting where in—role and extra-role W are systematically varied
(thus keeping the measurements of in-role and extra-role performance as
similar as possible).

Further, there is value in embedding a study of in—role versus
extra-role behaviors within the present research framework, with its
emphasis on search, accuracy, and halo. First, Organ (1990a) proposed
that OCB would explain some portion of the halo effect observed in
subjective ratings of performance. For example, if two secretaries are
equivalent in terms of in-role performance, but one is known to go the
"extra mile” to complete a task, while the other declares that ”If I have
to work an extra ten minutes, I better get paid for it!" (Williams 6:
Hummert, 1990), the prediction is that such extra—role behaviors will
influence managers' general ratings of performance (positively and
negatively, respectively). No research was located which measured the
effects of citizenship behaviors on measures of halo or rating accuracy.

Second, at least two past studies of OCB could have profited by an
explicit measurement of rater accuracy. For example, in their seminal
work on OCB, Smith et a1. (1983) found an unexpected effect for
organizational department on each of their OCB factors. In their
discussion, they attributed this to different response sets (i.e. , amounts
of leniency) among raters. Similarly, Orr et a1. noted that their study
picked up ”real individual differences in perceptions of variability of
performance" (1989, p. 39). Such effects can be successfully captured

using Cronbach's (1955) measure of elevation. The value of using

ll

Cronbach's four distance score measures of accuracy is that the effects of
response styles (e.g., leniency and strictness) can be separated from
other aspects of rating accuracy (Funder, 1987; Sulsky & Balzer, 1988).
In the current research context, it.may be that some raters are inherently
strict or lenient, regardless of“how'rating format, in-role, or extra-role
performance are manipulated. In this case, elevation is viewed as a
nuisance variable, which can be controlled or partialled out, so that the
more interesting effects of the manipulations on the other aspects of
accuracy can be measured (see Chapter 2).

Third, and most importantly for the current research, interactions
can be hypothesized between search/method of rating, the presence or
absence of OCB information, and Cronbach's (1955) measures of accuracy.
Concerning search and method of rating, the research summarized by DeNisi
and Williams (1988) would predict that raters rating by person would be
M accurate on a measure of differential elevation (rank ordering
ratees), but less accurate on measures of stereotype accuracy (ranking
dimensions) and differential accuracy (the dimension x ratee interaction)
than raters rating by dimension. Concerning extra-role behaviors,
MacKenzie et a1. (1991) proposed that OCB information may trigger the
increased use of categories and schema by raters (cf., Fiske, 1981;
Feldman, 1981) , which would be expected to decrease accuracy, at least for
measures of stereotype and differential accuracy. Thus, at least two of
the differences in accuracy predicted by DeNisi and Williams (1988) are
expected to be even stronger when OCB information is present versus when

it is not.

12

The expectation of an interaction between the manipulations of
method of rating and extra—role performance is the primary justification
for simultaneously studying the effects of both on measures of search,
accuracy, and halo. The experiment to be described below integrates and
extends several distinct lines of prior research. The contributions of
the present research to the performance appraisal literature are discussed
next.

0 r s f the esen Res are

e od R t n . The present study makes several
practical and theoretical contributions to our understanding of the
performance appraisal process, as well as the role of extra-role behaviors
within that process. First, research has been quite limited concerning
‘whether'method.of rating (by person versus by dimension) influences rating
accuracy or halo. This is the case despite a long history of interest in
the topic in personnel psychology (Symonds, 1925; Stevens & Wonderlic,
1934), as well as in educational measurement.

The process of making performance ratings can be compared to the
educational practice of grading essay examinations one exam at a time
versus one question at a time. Most educational measurement specialists
advocate the latter approach, as illustrated by Mehrens & Lehmann (1973),
who wrote that "To reduce the halo effect..., we strongly recommend that
teachers grade one question at a time rather than one paper (containing
several responses) at one time" (p. 234). Yet, in spite of the widespread
acceptance of the superiority of grading ”by question" (e.g., Coffman,
1971; Coker, Kolstad, & Sosa, 1988), no research studies were located

which tested the effects of one method against the other.

13

In.personnel psychology research, the research stream summarized by
DeNisi and Williams (1988) suggests that different methods of rating
should influence different measures of accuracy. DeNisi and Williams
(1988) proposed that structuring information by person increases a rater's
ability to measure the overall proficiency of each worker (differential
elevation), but decreases the ability to accurately rate within-ratee
differences on various performance dimensions (differential accuracy).

If such effects are found in the present study, this would lead to
important practical recommendations concerning method of rating and the
purpose of appraisal (DeNisi, Cafferty, & Meglino, 1984). If the primary
outcome desired from appraisal is an accurate rank ordering of candidates
for purposes of promotion or merit increase, then the traditional method
of rating by person would be recommended (to increase differential
elevation; Murphy, Garcia, Kerkar, Martin, & Balzer, 1982). If, however,
the primary purpose of appraisal is for developmental purposes, i.e.,
providing feedback to employees about their performance on specific job
dimensions, then rating by dimension would be recommended (to increase
stereotype and differential accuracy; cf, Dickinson, 1987). This raises
the interesting prospect that a single method of rating may be inadequate
for multiple appraisal purposes (DeNisi 6: Williams, 1988, p. 139).
Regardless of how the results come out, more research is needed on the
effects of rating method on rater search, accuracy, and halo.

In—Rgle vs, Extra-Role Behavigrs. The second major contribution of
this research lies in its ability to measure the effects of in—role versus
extra-role behaviors on performance ratings in a controlled laboratory

environment. This is expected to strengthen the findings of Orr et a1.

14
(1989) and MacKenzie et a1. (1991) that citizenship behaviors explain
important variance in ratings beyond that explained by in—role
performance. The implication of this is that most approaches to job
analysis and performance appraisal are incomplete, because minimal to no
attention has been paid to non-prescribed behaviors such as citizenship
behaviors (MacKenzie et a1., 1991). If results come out as 'predicted,
this strengthens Organ's (1977, 1988a) argument that we need to change the
way we view and define "performance".

Returning to the framework laid out by Ilgen et al. (in press) and
discussed above, it can be seen that the current research focuses most
heavily on the acquisition or search phase of information processing,
although issues of the organization. of information. in. memory, the
retrieval of information from memory, and subsequent performance judgments
are also addressed. Additionally, three of the four sources of variance
cited by these authors are studied: a) ratings scales (person vs.
dimension formats), b) raters (i.e., their search strategies), and ratee
effects (manipulations of ratee levels of inerole and extra-role
performance). The importance of the final source of variation, setting,or
context, has been described or documented by numerous recent researchers
(Mohrman & Lawler, 1983; Longenecker, Sims, & Gioia, 1987; Padgett, 1988;
Longenecker, 1989). However, it goes beyond the scope of the current
research to study the effects of context on rater search and.accuracy, and
thus, as important as they are, these variables will not be explicitly
measured or manipulated in the present research.

The preceding pages have summarized briefly the need for research on

the effects of search and method of rating, as well as in—role and extra-

15
role behavior on rater search strategies and accuraqy. The following
figure shows this research in. broad. outline form, and. the primary
independent and dependent variables which will be manipulated or measured

(see Figure 1). This will be explicated more clearly in Chapter 2.

l6

 

 

moczmm
mocmczoton.

=Eo>O Am
8th 2m... AN

>om5oo<
3.55:5 6
59:09..
25885 8
co=m>2m
.mzcemao ._ 3
co=m>o_m \ Am

69:03. 2

339d»
anaconda

cotmmmom £5 E

 

moist; 9: .0 32230 ; 2:9“.

 

3.8.35
-552:
moo .0 .93 3

@833
-555:
85anth
some.

3 _e>o._ Am

gamma

 

 

sotmmm Lo
2525. 3

/ 586m
/ / Lo 2:... Am

 

 

mama“
gm

 

 

 

 

 

seesaw

comzammv

mg... HES“.

Lo omcmiocx
corn. 3

@835
59559
ago“.
86585
.> :8th Am

3

 

 

 

CHAPTER 2: LITERATURE REVIEW AND HYPOTHESES

This chapter is organized as follows. First, research on rater
effects (errors) and accuracy is briefly reviewed, including a
presentation of Cronbach's accuracy measures. Second, recent performance
appraisal research from a cognitive perspective (e.g., DeNisi et a1.,
1984) is discussed in some detail. The hypotheses concerning the accuracy
of ratings made by person versus by dimension are then drawn from this
research. Third, issues related to the proper measurement of performance
are presented, followed by hypotheses on the effects of inrrole versus
extra-role behaviors on accuracy, error, and search measures. Following
this, hypotheses are presented outlining the proposed interactions between
method of rating, extra-role performance, and accuracy. Finally, a

summary of the present research is provided.

esea c o u a

Appraising employee performance continues to be of widespread
interest to both researchers and practitioners in the field of
personnel/human resource management (Wexley & Klimoski, 1984; Locher &
Teel, 1988). As stated 'previously, early research on. performance
appraisal tended to emphasize either psychometric issues, i.e., how to
make the rating instrument less prone to bias (cf., Landy & Farr, 1980),
or training raters to avoid bias (e.g., Latham, Wexley, & Pursell, 1975;

Bernardin 6: Pence, 1980). The most common dependent variables in such

17

18
research were various measures of halo, leniency, or range restriction.
Such measures were thought to provide indirect measures of rater accuracy.
Recently, however, this view has been increasingly challenged (Cooper,
1981; Becker & Cardy, 1986; Fisicaro, 1988; Sulsky 6: Balzer, 1988; Murphy
& Balzer, 1989).
£11.29]:-

As one example of this changed thinking concerning error and
accuracy, much research has followed Thorndike's early view that halo
error is a strong rater tendency to ”think of the person in general as
rather good or rather inferior and to color the judgment of the separate
qualities by this general feeling" (Thorndike, 1920, p. 25; cited in
Jacobs & Kozlowski, 1985). Such a global effect was thought to lead
raters to rate separate dimensions more uniformly than warranted by a
ratee's actual performance. Following this logic, Cooper (1981) defined
"illusory halo" as a condition where observed halo or intercorrelation
among dimensions exceeded the true or actual intercorrelation among
dimensions.

An increasing body of literature, however, suggests that observed
intercorrelations among dimensions are sometimes higher and sometimes
lower than the true intercorrelations (Fisicaro, 1988; Murphy 6: Reynolds,
1988; Murphy 6: Jake, 1989). As noted by Murphy and Jake (1989), it is
hard to see how a reliance on overall impressions could lead raters to
W the true relations among dimensions.

An important point related to the above discussion is that most halo
measures do not take into consideration the true intercorrelations among

dimensions (Murphy & Balzer, 1989). When true intercorrelations are

19
unknown, it is impossible to state unequivocally whether a high observed
intercorrelation is halo egrg; or not. This is true for measures of
leniency and range restriction as well.

Murphy and Balzer (1989) conducted a meta-analysis using the raw
data from nine performance appraisal studies where Cronbach's (1955)
measures of accuracy were available. The mean correlation between six
rater error measures and the Cronbach accuracy measures was r - .05 (r -
.06, when corrected- for attenuation and sampling error). Murphy and
Balzer argued that the most likely explanation for this weak link between
error and accuracy measures was the inability of typical error measures to
take true intercorrelations into account. They recommended that
researchers discontinue use of error measures as indirect indicators of
rating accuracy.

AQQBEEQX-

Beginning with Borman (1977; 1979) and Murphy, Garcia, Kerkar,
Martin, & Balzer (1982), direct measures of rating accuracy have
increasingly been used in performance appraisal research. For example,
Murphy et a1. (1982) developed videotaped performance vignettes, then
obtained "true scores” from expert raters who had enhanced opportunities
to view ratee performance. The accuracy of subjects' ratings were then
measured in comparison to these true scores using formulas developed by
Cronbach (1955).

Cronbach (1955) criticized earlier research on person perception
because it relied on a global accuracy measure often referred to as D2
(Sulsky & Balzer, 1988). Such an index measures the squared difference

between subject ratings and true scores averaged across all ratees and

20

dimensions (see Appendix A for all formulas). Cronbach argued that this
index combines different aspects of rating accuracy, each of which can be
useful in and of itself. Cronbach proposed that the overall distance
between rater ratings and true scores be decomposed into four independent
accuracy scores. From Cronbach and others (Murphy et a1, 1982; Dickinson,
1987; Kenny & Albright, 1987; Sulsky 6: Balzer, 1988), these distance score
components can be defined as follows:

Elavatiog: the component of accuracy due to the average or

mean rating given by a rater, across all ratees and

dimensions. This can be viewed as a general response set, in

that it affects the way a rater rates every target on every

trait. In ANOVA terminology, this is synonymous with the

differential grand mean.

Differential Elevation: the component of accuracy associated

with the average rating given to each ratee, across all

performance dimensions. This reflects a rater's ability to

order ratees in comparison to their overall differences as

specified by the means of their target ratings. In ANOVA

terms, this is the differential main effect for ratees.

Stareotype Accuraty: the component of accuracy associated

with the average rating given to each performance dimension,

across ratees. This can be viewed as the degree of match

between the rater's response set and the true scores (Kenny G:

Albright, 1987). Stated differently, it reflects the extent

to which the rater correctly assesses the relative strengths

and weaknesses of the ratees as a whole on the given

21

performance dimensions. In ANOVA terms, this is the
differential main effect for dimensions.
W: the component of accuracy associated
with rating each ratee's performance on each dimension. This
reflects the rater's sensitivity to ratee differences in
patterns of performance, and corresponds most closely to the
lay notion of accuracy (Murphy et al. , 1982) . In ANOVA terms,
this is the differential ratee x dimension interaction.

W: the sum of the above components (see
Cronbach, 1955; and Appendix A).

Go tve-aedeoace asa eea

Much of the performance appraisal research in the 19803 was more
explicitly cognitive in orientation than earlier research, primarily
because earlier research, while not without some measure of success, was
still disappointing in all that it left unanswered (Landy, Zedeck, 6:
Cleveland, 1983). Too often, raters of performance were viewed as "black
boxes", where stimuli were presented, and responses recorded, yet little
was learned about hag raters formed their performance judgments (Wexley 6:
Klimoski, 1984).

Landy and Farr (1980) reviewed past research on performance rating,
and concluded that researchers could more fruitfully focus on the
cognitive processes of raters, in order to generate "more substantive
propositions concerning where rater biases come from" (p. 96). Their
paper seems to have generated considerable research and theorizing (cf.,

Feldman, 1981, 1986; Ilgen 6: Feldman, 1983; Wexley & Klimoski, 1984;

22

DeNisi, Cafferty, &.Meglino, 1984; DeNisi &'Williams, 1988; Ilgen et al.,
in press). For example, DeNisi and Williams (1988) reviewed numerous
cognitive models of the appraisal process, including those of Wherry
(1952; Wherry'&.Bartlett, 1982), Feldman (1981), Ilgen.and.Fe1dman (1983),
and DeNisi, Cafferty, and Meglino (1984). While these models differ in
what they emphasize and the particular components included in each.mode1,
they are similar in proposing a multi-step process, where raters acquire,
store, recall, and then combine performance information to make judgments.

In this section of the chapter, three related research streams will
be highlighted. The first stream follows from Barman and.Murphy, and has
emphasized the Cronbach measures of accuracy as major dependent variables.
The second stream comes out of the University of South Carolina (e.g.,
DeNisi et a1., 1984), and has emphasized issues related to information
acquisition, storage, and retrieval. Finally, research utilizing a
methodology known as process tracing will be discussed. With this design,
raters actively search for information, and the researcher can map the
amount and type of search undertaken by each rater. This methodology will
be used to study the effects of the experimental manipulations on rater
search strategies (see Figure 1). Use of this methodology allows the
current research to go beyond static measures of accuracy or halo, to tap
aspects of the cognitive processes used by raters when making their
ratings (Landy & Farr, 1980).
MW

Despite a long tradition of study, research on the accuracy of
person perception "came to an abrupt and nearly complete halt after the

publication of Cronbach's methodological critique” (Funder, 1987, p. 77).

23
Funder (1987) noted, however, that a limited amount of accuracy research
continued after Cronbach ( 1955) , most notably by industrial psychologists.
Much of the credit for this can be attributed to Walter Barman.

As noted above, Borman (1977) developed videotapes and expert true
scores for use in a laboratory rating task. Citing Cronbach (1955),
Barman (1977) argued that differential accuracy was the most appropriate
measure for evaluating performance judgments. He defined differential
accuracy (DA) as the ability to correctly rank order a target person's
standing on a given trait (dimension), and measured this with a measure
which correlated ratings with true scores for each dimension. However,
Borman's DA.measure is not equivalent to Cronbach's notion.of differential
accuracy, either in Cronbach's distance or correlational formula (see
Appendix A). In fact, Borman's DA.measure is insensitive to the distances
between ratings and true scores, whereas Cronbach's distance measure is
sensitive to this (Sulsky & Balzer, 1988). Further, it is important to
view accuracy as a multidimensional phenomenon (Becker & Cardy, 1986).
This was not done by Barman (1977; 1979), but has been done in research
following Murphy et a1. (1982).

Murphy et a1. (1982) also developed a set of videotapes, which
portrayed lecture vignettes by various graduate teaching assistants.
Similar to Borman (1977), true scores were obtained by expert raters who
had.multiple opportunities to view the tapes. ‘Unlike Barman, however, all
four Cronbach measures of accuracy were calculated. Further, student
raters were asked to make two types of ratings: a behavioral frequency
rating, and a global evaluation. The frequency rating was similar to a

behavioral observation scale (BOS; Latham & Wexley, 1981), where raters

24
reported the frequency of observing 12 key behaviors. The global
performance evaluation utilized Likert scales to evaluate eight trait-like
dimensions of performance, and resembled a typical graphic rating scale.

The major purpose of Murphy et a1. (1982) was to demonstrate a
significant relationship between the accuracy of observing behavior and
the accuracy of rating overall performance. Support for this was found in
that, for all of the Cronbach measures except stereotype accuracy, there
were significant correlations between the accuracy of the frequency rating
and the corresponding accuracy measure for the global evaluation. For the
purposes of the present study, the major contributions of Murphy et a1.
(1982) are three-fold: a) establishing the first link between
information acquisition activities and rating accuracy (DeNisi 6: Williams,
1988); b) successfully reintroducing all of Cronbach's accuracy measures
into the personnel psychology literature; and c) developing videotapes
which facilitated considerable further research on performance appraisal
accuracy (see Murphy 6: Balzer, 1989).

Two studies which have used the Murphy tapes will be briefly
discussed (Murphy & Balzer, 1986; Murphy, Philbin, & Adams, 1989). First,
Murphy 6: Balzer (1986) predicted that a one-day delay between observation
and rating would lead to increased halo (due to increased reliance on
general impressions), and that this would lead to decreased accuracy.
Elevation measures were not included in this study, since there was little
evidence that memory—based ratings made under delayed conditions would be
more or less lenient than those obtained immediately after observing
performance. Results from this study indicated that the delayed ratings

were more highly intercorrelated than those made immediately after

25
observation; they were also significantly higher than the average ”true"
intercorrelations [e.g., true r— .44; observed r(delay condition)- .64].
However, for measures of stereotype and differential accuracy, the delayed
ratings were ma accurate than the immediate ratings. This was presented
as further evidence for the weakness of error measures (e.g., halo) as
primary indicators of rating accuracy.

The final study to be discussed by Murphy and his associates is
Murphy, Philbin, and Adams (1989), which built on the two studies just
mentioned. Murphy et a1. (1989) compared the accuracy of immediate
ratings with those made one, three, and seven days after observation of
the tapes. They also varied the purpose of observation, i.e., half the
subjects were told that performance evaluation was their sole task, while
the remaining subjects were told that their primary task was to learn the
content of the lectures, with performance evaluation as a secondary task.
It was predicted that those for whom performance evaluation was a primary
task would be more accurate than those focusing more on the lecture
content. Somewhat disappointingly, this result was found only for the
stereotype accuracy measure of the frequency ratings; none of the global
performance appraisal measures showed significant differences. There was
also an interaction between purpose of observation and delay, such that
the greater stereotype accuracy of the ”performance evaluation primary"
subjects washed out by the seventh day. Murphy et a1. (1989) interpreted
this by utilizing the distinction between on-line and memory-based
judgments presented by Hastie and Park (1986).

Overall, Murphy and his associates have emphasized memory issues in

information processing. The present research will focus more particularly

26
on information acquisition strategies, drawing on the research presented
below from the University of South Carolina. ‘What is instrumental in the
work of Murphy and his associates is their refinement and use of the
Cronbach (1955) accuracy measures as major dependent variables. Each
Cronbach measure provides unique and valuable information concerning rater
accuracy. Accordingly, this research will emphasize these measures as
primary dependent variables. Other accuracy measures have been used in
performance appraisal research (e.g., Lord, 1985; Padgett & Ilgen, 1989);
however, the Cronbach measures are the most widely used (Sulsky & Balzer,
1988; Murphy & Balzer, 1989). Also of importance is the fact that gang of
the South Carolina research has utilized Cronbach's measures (cf., DeNisi
& Williams, 1988). It is argued below that the Cronbach measures better
capture the variables of interest in the South Carolina research, while
providing additional information as well. Use of these measures is most
conducive to tackling the "pervasive accuracy problem in performance
appraisal research” (DeNisi & Williams, 1988, p. 109).
v 1 ea

One of the larger research streams dealing‘with.cognitive influences
on performance appraisal comes out of the University of South Carolina
(DeNisi et a1., 1983; K. Williams et a1., 1985; Cafferty et a1., 1986;
DeNisi et a1., 1989; K. Williams et a1., 1990). Much of this research has
focused. on information. acquisition. strategies, although storage and
retrieval issues have been addressed as well. Research on information
acquisition will first be reviewed, followed by a discussion of the

effects of method of rating (Symonds, 1925).

27

WW. DeNisi.
Cafferty, Williams, Blencoe, and Meglino (1983) used a computer
information board to determine what type of information raters would seek
out in evaluating the performance of four hypothetical ratees. The
highest percentage of raters (44%) organized their search by ratee, i.e. ,
they sought information about how one worker performed across different
tasks before seeking information on another worker's performance. Thirty
percent of the raters organized their search by task, i.e., comparing
different workers on one task before moving to another task. Eighteen
percent of the raters tended to seek information. on. the repeated
performance of one ratee on the same task; whereas 8% displayed no
discernable pattern. From this and related research (Williams, DeNisi,
Blencoe, & Cafferty, 1985), three information acquisition strategies have
been identified: personeblocked, taskéblocked, and nonblocked.

Cafferty, DeNisi, and Williams (1986, Study 1) found a similar rater
preference for information about one person's performance across tasks
before moving to another target. In Study 2, all performance incidents
were presented to raters (rather than allowing for differential search).
One-third of the subjects viewed the incidents arranged by ratee (persona
blocked); one-third of the subjects viewed the incidents arranged.by task
(taskrblocked); and one-third of the subjects viewed the incidents in a
mixed or nonblocked manner.

Cafferty et al. found that subjects in the personr and taskrblocked
conditions exhibited significantly more clustering of behaviors in recall
according to their respective pattern of presentation. Subjects in the

personéblocked condition recalled significantly more items than subjects

28

in the other two conditions; however, they also recalled more incorrect
items than those in the task-blocked condition (what Pitre 6: Sims, 1987,
referred to as "gap filling"). Contrary to predictions, there was at
difference by presentation pattern in the overall performance ratings
given. However, the task-blocked condition led to significantly greater
intra—ratee discriminability, i.e., actual performance differences on
different tasks by each ratee were noted more accurately in the task-
blocked condition than in the other two conditions (this corresponds to
Cronbach's differential accuracy measure). It has been speculated that
person-blocking leads to greater reliance on global impressions, which in
turn leads to greater halo bias (DeNisi 6: Williams, 1988). Blocking by
task (or dimension) is expected to reduce this effect. Other research,
however, makes this less than clear-cut.

Specifically, Williams, DeNisi, Meglino, and Cafferty (1986) looked
at initial appraisal purpose and subsequent performance ratings. Subjects
viewed one of two videotapes, where the performance of carpentry tasks was
blocked either by person or by task. In a memory—based rating made two
days later, subjects who had previously made "deservedness" ratings of
each ratee for extra work were only able to differentiate among ratee
proficiency levels when they had acquired information in a person-blocked
manner. Thus, while Cafferty et a1. (1986) leads to the conclusion that
task or dimension blocking leads to greater rating accuracy, Williams et
a1. (1986) appeared to contradict this. DeNisi and Williams (1988)
suggested that the distinction may be one of intra— versus inter-ratee
discriminability, i.e., "While the use of person categories may increase

impression formation tendencies and reduce intra—ratee variability, it may

29
increase the rater's ability to assess the overall proficiency of each
worker” (p. 139). In Cronbach's terminology, the prediction is that
person blocking leads to greater accuracy than task blocking in terms of
differential elevation (correctly rank ordering ratees) , but less accuracy
in terms of differential accuracy (correctly noting ratee patterns of
performance).

A final study relevant to the discussion of person— versus task-
blocking is DeNisi, Robbins, and Cafferty (1989). These authors measured
the effects of diary keeping on subsequent recall and rating accuracy.
Videotapes were assembled of individuals performing carpentry tasks, such
that no two consecutive segments portrayed the same carpenter, the same
task, or the same performance level, i.e., a nonblocked presentation
pattern. Subjects were instructed to keep diaries either by person, by
task, or "free" (as they wished); a fourth no-diary condition served as a
control.

DeNisi et a1. (1989) found significant clustering in recall: more
person clustering with person diaries, more task clustering with task
diaries, and the lowest clustering in the no-diary condition. No
differences in the accuracy of overall ratings were found between the
person— and task—diary conditions. Also, contrary to the prediction of
DeNisi and Williams (1988) above, intra—ratee discriminability
(differential accuracy) was 1331;; in the person-diary condition. One
explanation for this is that DeNisi et a1. (1989) made only small changes
in ratee performance over a small number of tasks. When larger memory
demands are placed on raters (as will be done in the present study),

results are expected to follow Cafferty et a1. (1986), i.e., greater

30
differential accuracy for task/dimension blocking. At any rate, as
DeNisi and Williams concluded (p. 146), the way in which raters organize
information in.memory does affect rating accuracy; however, more research
is needed to determine what, if any, is the best type of organization for
different appraisal purposes (i.e., by task, by person, or another schema
not yet tested).

Rating Eotaat versus Advanced Kagwlegga 9f that Eotaat. The above
research demonstrates that the manner in which raters acquire performance
information influences various measures of rating accuracy. This is
related to the question of whether method of rating influences accuracy;
however, the two processes are conceptually distinct. Drawing on.a model
presented by DeNisi and Williams (1988, p. 137), it is proposed that an
intervention prior to information search will affect acquisition
strategies and subsequent information storage in memory, whereas an
intervention focusing,on the method of making ratings (by person versus by
dimension) will primarily influence the retrieval of information already
stored in memory (see Figure 2).

The processes portrayed in Figure 2 can be described as follows.
The manner in which individuals search for information influences the
manner in which that information is subjectively organized in memory;
person and task blocking are thought to be two types of schema used to
organize information. Next, all raters (regardless of how they
subjectively organize information) are thought to store information in
memory using hath impressionistic and behavioral codes (Hastie & Park,
1986; Srull & Wyer, 1989; Williams et a1., 1990). It is predicted that

blocking patterns influence the salience of impressionistic versus

31

as? sags) a 328 see usages
mmczmm .mm_maa< 8368th :o oEonméoﬁcmEE

m:m.m>-comtmn_ E 853:... on. B Enos. < ”N 659“.

 

 

 

 

 

 

scam .
=8m¢ .

¥m<._.
mime/.002.

 

 

 

 

 

 

_ mmooo .mtoszmm .
mmooo ozﬂcoﬁmmaE. .

ll>|. Till.
_ mozSszummm

”.0 ZOF<NEOGwH<O

 

 

 

 

._.<2m0u_ 2052925 .>
20mmwn_ ".0 ZO_._.<N_.=._.D

 

 

 

 

:10sz z. _
20.2.2512. do

ZO_._.<N_Z<Om—O _

m>Fow 33m _

 

 

<

 

 

 

 

mzmwtkn.
Iom<wm
ZO_._.<2mOu_Z_

 

 

 

 

ﬁes”. 9sz
.0 $8335. torn. .
30ng Emma? .

mm>_._.0wsm0 Gz_mmwOOmn_

 

 

,

 

 

 

 

 

 

 

 

32

behavioral data, i.e., impressionistic codes will be more salient under
person-blocking, with behavioral codes more salient under task/dimension-
blocking (DeNisi 5: Williams, 1988). These effects are then thought to
influence judgment tasks, such as those involved in rating performance.
Finally, once ratings have been made, these ratings are thought to
influence the subsequent categorization of performance, as well as future
information search patterns (cf ., Murphy, Balzer, Lockhart, 6: Eisenmann,
1985).

Figure 2 can be used as a framework for the research already
presented. First, arguments for the superiority of rating by dimension
versus by person (Symonds, 1925; Stevens 6: Wonderlic, 1934; Cooper, 1981)
focus on the end of the process depicted in Figure 2. Even if implicitly,
it is assumed that forcing a dimension structure on ratings will reduce
halo and increase accuracy. By default, this approach assumes that method
of rating will influence accuracy via its influence on retrieval
processes.

On the other hand, the research on information acquisition focuses
on the beginning of the process shown in Figure 2, arguing that accuracy
will be most influenced by issues related to information acquisition and
storage. While many additional questions could be addressed building on
the South Carolina research, this study will focus on just one of the
issues they raised, i.e. , the extent to which raters with prior knowledge
of the rating format to be used when making ratings are more accurate than
those who learn of the format just prior to the rating task. The
assumption is that those who know the rating format prior to information

acquisition will use that format as a cue or frame when organizing

33
information in memory (Williams et a1. , 1990). In terms of Figure 2, this
prior knowledge has been placed in the ”Processing Objectives” box, under
the assumption that this knowledge will directly influence information
search patterns and the subjective organization of information in memory.

In this research, issues related to method of rating and information
acquisition will be addressed in a 2 x 2 between subjects design. Half of
the subjects will make their ratings ”by person", half will rate "by
dimension” (from here on, the term ”dimension-blocked” will be used
instead of "task-blocked", to denote an organizing pattern or format
centered around appraisal dimensions, rather than the traditional person-
centered approach; Cooper, 1981). Further, half of the subjects in each
group will be shown the format to be used prior to the opportunity to
search for information (see Figure 3). This design allows for tests of
the main effects for rating format and prior knowledge of that format, as
well as any interactions, for each Cronbach accuracy measure.

As noted earlier, Cooper (1981) cited four studies where the
expected halo differences when rating by person versus by dimension were
not found. What is interesting, however, is that in each of these
studies, the rating format was presented only at the time when ratings
were required. Thus, the effects of prior knowledge of the rating format
were not tested (DeNisi 6: Williams, 1988).

There is mixed evidence concerning whether familiarity with the
rating scale per se leads to differences in halo or accuracy. Positive
results were obtained by Bernardin 6: Walter (1977) , who found that raters
who received behaviorally-anchored rating scales (BARS) before observing

behavior demonstrated less halo and greater inter-rater reliability than

34

 

 

manual
By Person By Dimension
Condition 1 Condition 2
Yes
Astana
Knowledge
91
Emmet
No Condition 3 Condition 4

 

 

 

 

”* Note: The within-subjects factors are manipulated within each of the
four between-subjects conditions

Figure 3: The Between-Subjects Manipulations
of Format Type and Prior Knowledge of Format

35

raters who received the scales after observation. On the other hand,
Cardy et al. (1987) found no effects for familiarizing raters with the
BARS scales used in their study. However, these authors defined accuracy
utilizing Borman's differential accuracy measure, rather than the more
complete Cronbach measures. Also, the results of both Bernardin and
Walter (1977) and Cardy et a1. (1987) relate onLy to ratings made by
Bataan; their research was not designed to tap person versus dimension
differences.

DeNisi and Summers (1986) provided an initial answer to this
question. Using the same videotaped carpentry tasks as Williams et a1.
(1986) and DeNisi et a1. (1989) above, they constructed two different
rating scales, one focusing on the tasks or behaviors performed, and the
other on general trait dimensions such as motivation, neatness, and
attention to detail. Additionally, subjects were shown the rating scale
they would.use either prior to or after observing the videotapes. Results
were that subjects with advanced knowledge of the rating format
demonstrated an increased level of organization in memory, as well as
increased accuracy of recall. However, a major weakness of this study*was
that overall rating accuracy could only be calculated for subjects in.the
"task scale" condition, since no "true scores" (in the Cronbach sense)
were available for the trait ratings. DeNisi and Summers (1986) found
that, within the task scale condition, rating accuracy was greatest for
those with advanced knowledge of the scale. What is needed, however, and
what the current study does, is to compare the effects of both rating

format and prior knowledge of that format on rating accuracy.

36

Before discussing research propositions concerning information
acquisition and method of rating, some previous process tracing research
will be briefly discussed. As noted above, this approach allows the
researcher to measure the amount and type of search engaged in by each
rater. This should greatly aid in understanding the effects of the
experimental manipulations on rater search and accuracy. In the current
research, raters will have access to a person by dimension matrix for a
number of ratees. It is helpful to understand this methodology before
discussing the specific predictions made in this study.

e e c

Several of the studies mentioned above used computer-controlled
information boards to examine differences in search/acquisition strategies
(DeNisi et a1., 1983; Williams et a1., 1985, Study 2; Cafferty et a1,
1986, Study 1). This methodology comes out of research on decision
making, and is generally referred to as "process tracing" research (see
Ford et a1. , 1989, for a recent comprehensive review). A major advantage
of an information board methodology is that it allows for the examination
of the decision maker's depth and pattern of information search.

Even though cognitive models of performance appraisal view raters as
active seekers and processors of ratee performance information (e.g.,
Ilgen 6: Feldman, 1983) , the implicit assumption in much previous research
(e.g., Zedeck 6: Kafry, 1977) is that raters are passive receptors of
information, not playing an active role by expending effort to obtain
information about each ratee. Kozlowski and Ford (1991) argued that this

provides limited fidelity to the way raters actually receive information,

37
and that process tracing is needed in order to investigate the manner in
which raters attend to and acquire performance-relevant information.

Kozlowski and Ford (1991) recently completed two studies of rater
acquisition strategies using a computer-controlled information board.
Subjects rated the performance of 12 police officers, six of whom were
consistently good performers, and six of whom where consistently poor
performers. Prior to the information search task at the computer,
subjects received "personnel files” for each ratee, where the amount of
information available on each ratee was systematically varied, i.e., the
number of "critical incidents” listed for each ratee ranged from zero to
18 items (ratee performance level and prior ratee information were within-
subjects factors). Subjects were then instructed to search the computer
for as little or as much additional information as necessary to make
accurate ratings; information was available in a ratee by dimension
matrix.

In Study 1, subjects were placed under four levels of search
constraint, i.e., they were told they could access up to 25%, 50%, 75%, or
100% of the available items. Kozlowski and Ford reported three major
findings. First, the amount of prior ratee information influenced.search,
such that the greater the prior information available, the less the
subsequent search. Second, as expected, search constraint influenced
acquisition, with those under lower constraints seeking significantly more
items. Third, a performance level x prior ratee information interaction
indicated a tendency for raters to seek more information for poor versus

good performers, when the level of prior information was high.

38

In Study 2, Kozlowski and Ford used only the 25% and 100% search
constraints, but added four levels of time delay between viewing the
personnel files and the opportunity to search and then make appraisal
ratings (0, 1, 4, and 7 days). The results largely replicated those in
Study 1, as well as showing a tendency for raters in the no-delay
condition to seek more information when prior information was low than
subjects in the delayed conditions. This last finding is consistent with
the notion that memory-based information is more influenced by general
evaluations or impressions (Hastie & Park, 1986).

A limitation of their study noted by Kozlowski and Ford is that the
complexity of their design prevented them from varying ratee performance
levels. Because ratee performance was either consistently good or poor,
there was no point in linking patterns of information acquisition to
particular rating outcomes. Differences in within-ratee performance are
needed in order to address issues of rating accuracy, at least for the
Cronbach (1955) accuracy measures (cf., Padgett 6: Ilgen, 1989).

Similar to Kozlowski and Ford (1991), a computer—controlled
information board will be used in the present research. Ratee performance
patterns will be constructed to portray high, average, and low levels of
overall (in-role) performance. In addition, within-ratee performance
differences on various dimensions will be built in, in order to
meaningfully test for rater differences in stereotype and differential
accuracy.

Hypotheaea Cancerning Infotaatian atgutsitioaznethgd of Ratiag
In this section, the hypotheses relating to information acquisition

and method of rating will be presented, broken down by dependent variable.

39
Hypotheses concerning the two process variables will be discussed after
this.

antal1__ﬁat1nga. The South Carolina research has produced
conflicting results concerning the effects of acquiring information by
person versus dimension on the accuracy of overall (or summary)
performance ratings given to each ratee. For example, Williams et a1.
(1986) found some increase in overall rating accuracy for subjects in
their personrblocked condition. DeNisi et a1. (1989) found that keeping
a diary of any sort (i.e., personeblocked, task-blocked, or free) resulted
in greater accuracy than the no-diary condition; however, no differences
were observed between the person— and task-blocked conditions. Finally,
Cafferty et a1. (1986) found no differences in overall ratings between
their person-blocked, task-blocked, and mixed conditions.

Intuitively, one might expect the accuracy of overall ratings to be
better for those subjects who rate by person (i.e., a pattern similar to
the one presented below for differential elevation). However, the results
from the studies just cited are sufficiently discouraging so as to
necessitate making no prediction regarding overall ratings.

Ettat. Measures of halo will be computed so that the results of
this study can be compared with previous studies (Bernardin & Walter,
1977; Cooper, 1981, Murphy & Balzer, 1989; Murphy &.Jako, 1989). Only one
prediction will be made in the current study (see Figure 4), i.e.,

H1: Halo effect —— Subjects rating by person will exhibit more

halo than those rating by dimension. Given the results of

previous studies described by Cooper (1981), this effect is

not expected when subjects are unaware of the rating format to

be used; however, when the format is known, greater halo is

expected when rating by person, with lower halo when rating by
dimension.

40

2m... 95 $8502 .588”. .o $8.265. noocm>n<
.EEBH. .0 on: 9.58.80 9.0.8.85 ”v $50....

 

 

 

 

 

 

 

56585 59$ 56585 caused
_ . . .
_ . . .
mm> mm>
02 .6982 .6932
3.29.9.5 02 8.5886
3.89:5 nosed 8555.0 coeon.
_ _ _ _
_ _ _ .
mm>
CO_~N>0_M OZ O_MI
m. :26 _
oz _ .. FD
mm>

 

 

41

Attataty. Of major concern in this research is the effects of the
between-subj ects manipulations on three of Cronbach's distance score
accuracy measures. The fourth aspect of accuracy, elevation, will be
measured, but similar to prior research (Murphy & Balzer, 1986; Murphy et
a1., 1989), no differences between groups are expected based on format or
prior knowledge of format. For the remaining measures, the general
prediction is that ratings made by person will be more accurate in terms
of differential elevation, but less accurate in terms of stereotype and
differential accuracy than ratings made by dimension (DeNisi & Williams,
1988).

What this statement fails to make clear, however, is the extent to
which these accuracy effects are brought about by the rating format per se
'versus prior knowledge of that format. Summarizing the research described
above, the research proposition guiding this study is that there will be
an effect for rating format, but that this will operate primarily in
conjunction with prior knowledge of that format. These predicted effects
and interactions are shown graphically in Figure 4 (Note that the
Cronbach distance score measures depict the distance away from true score
estimates; therefore, the lQEQI. the values, the gtaatat the rater
accuracy). Specifically, the predictions are:

H2a: Differential Elevation -— Subjects rating by person will be

more accurate in correctly rank ordering ratees than subjects

rating by dimension. Further, there will be a disordinal

interaction (Keppel, 1982), such that accuracy is best for
subjects with advanced knowledge that they will be rating by

person, and worst for subjects with advanced knowledge that
they will be rating by dimension.

42

For subjects knowing they“will rate by dimension, the expectation is
that this knowledge will induce raters to spend so much time attending to
ratee performance on specific dimensions that they will ”lose sight" of
the overall rank ordering of ratees.

The predictions for stereotype accuracy and differential accuracy
are essentially the reverse of that for differential elevation, i.e.,
H2b: Stereotype Accuracy - Subjects rating by dimension will be

more accurate in correctly ranking the dimensions than

subjects rating by person. There will be a disordinal

interaction, such that accuracy is best when subjects know in
advance that they will rate by dimension, and worst when
subjects know in advance that they will rate by person.

Advanced.know1edge of a personrblocked rating format is expected to
increase raters' reliance on overall impressions, thus decreasing the
attention given to ratee performance on specific dimensions. Similarly,
H2c: Differential Accuracy —- subjects rating by dimensionnwill be

more accurate in correctly noting individual ratee patterns of

performance than subjects rating by person. There will be a

disordinal interaction, such that accuracy is best when

subjects know in advance that they will rate by dimension, and
worst when subjects know in advance that they will rate by
person.

Cronbach (1955) also proposed correlational measures of rating
accuracy. For completeness, these measures will be calculated, and the
above three hypotheses (H2a - H2c) will be tested on with these measures
as well. However, the primary focus will be on the distance score
measures, as these are the most widely used (Murphy & Balzer, 1989), and

they correspond most closely to the generally accepted notion of rating

accuracy (in contrast to rating validity, Sulsky & Balzer, 1988) .

43
oc s V a e . A distinct advantage of process
tracing research is that it allows the researcher to examine the pattern
and depth of each rater's information search (Ford et a1. , 1989). In the
present context, it can be tested whether prior knowledge of the rating
format affects the pattern or type of search, as well as the depth or
amount of search exhibited by each rater.

Concerning type of search, prior research demonstrates that, for a
majority of raters, their "natural tendencies" are to search for
information by person (DeNisi et a1. , 1983; Cafferty et a1. , 1986) . Thus,
subjects without prior knowledge of the format to be used are expected to
display a general tendency to search by person. Subjects who receive
advanced knowledge concerning the format to be used are expected to
display stronger tendencies to search for information in a manner
consistent with that format. Figure 5 displays the type of interaction
that is predicted.

H3: Type of Search - Prior knowledge of format type will
influence subsequent search activities, such that those
knowing they will rate by person will search more by person,
and those knowing they will rate by dimension will search more
by dimension. Raters without such prior knowledge will be
more likely to search for information by person.

Concerning amount of search, previous research has found a general
tendency on the part of most raters to make appraisal decisions based upon
a limited amount of information. For example, in DeNisi et a1. (1983),
subjects who were allowed unlimited search of performance information

requested fewer than half the available items before making their

appraisal decisions. Kozlowski and Ford (1991) found similar results.

44

5.5m .o .595. c5
555m .0 9...... 9.5850 95.8.5... ”m $50.“.

 

 

 

 

55585 5m5n. 55585 525..
. .
_ _ 32 ._ .
oz \ 5.5.5
5> -55566
coﬁmw 55:.
.o couwmm
.caoE< o
5> z / 5.5.5 05>...
-533
so... .58

 

 

45

Subjects in their unlimited search condition searched for an average of

62% and 43% of available items (in Studies 1 and 2, respectively).

This same search tendency is expected in the present study as well.
In particular, this limited search is expected for those without prior
knowledge of the rating format. However, providing prior knowledge of
either a person or a dimension format is expected to differentially affect
amount of search (Figure 5). Building on.research in.cognitive psychology
(Hastie & Park, 1986; Srull & Wyer, 1989; DeNisi & Williams, 1988), it is
expected that subjects who know they will rate by person will be more
likely to store information in memory I'by person”. This is expected to
increase the salience of global impressions (i.e., the impressionistic
codes presented in Figure 2), which is then expected to decrease the
overall amount of information search. On the other hand, those who know
in advance that they will be rating by dimension are expected to display
an above average amount of search.

H4: Amount of Search - All subjects will display a tendency to
voluntarily end their information search before they have
accessed all available information. This tendency will be
more pronounced for subjects who know they will rate by
person, and less pronounced for those who know they will rate
by dimension.

Once again, a strength of the process tracing,methodology is that it
allows the researcher to measure depth and pattern of search in some
detail. It is hoped that the results for these two "process" variables
will be useful in explaining the accuracy results that are obtained, as
well as testing the underlying assumptions presented in Figure 2 (i.e.,

the dotted boxes for subjective organization in memory and categorization

of performance).

46
e v s t — c

The preceding pages have summarized the research and hypotheses
concerning the effects of information acquisition and method of rating on
rater accuracy and error measures, as well as on amount and type of
search. This section of the chapter focuses on the influences of in—role
and extra-role performance on these same variables. While the preceding
pages have proposed effects for person— versus dimension-oriented formats
on search and accuracy measures, the following pages address the issue of
whether the right dimensions of performance are being measured in most
current approaches to performance appraisal. Following Organ (1988b), it
is expected that dimensions capturing both in-role and extra-role
performance will be used by managers when they make their ratings.
Further, both search and accuracy measures are expected to be influenced
by these experimental manipulations. As stated in Chapter 1, the
rationale for studying search/method of rating and in—role/extra-role
performance in the same study is the proposed interaction between method
of rating, extra-role performance, and accuracy. This interaction is
discussed below, after the literature and hypotheses concerning in—role
and extra-role performance have been presented.

As depicted in Figure 1, the level of in—role and extra-role
performance will be manipulated as within-subj ect factors. Issues related
to measuring performance and organizational citizenship will be discussed
next, followed by specific hypotheses relating to these variables.

P rf an
It is axiomatic that there ought to be a strong link between the

content of performance appraisal instruments and the content of the job(s)

47
being appraised (Wexley' & 'Yukl, 1984). For example, the Unifgtm
Qaidelinas on Eaplgyee SeLettioa gracedataa (1978) state:

”There shall be a job analysis which includes an analysis of

the important work behaviors required for successful

performance... "(Sec. 14.C.2).

Many efforts have been made to link performance appraisal
instruments directly to written job descriptions (e.g., Buford,
Burkhalter, & Jacobs, 1988). Indeed, researchers over the past 30 years
have spent considerable effort trying to improve behavior— and results-
oriented measures of performance (Landy & Farr, 1980; Latham & Wexley,
1981). Among other things, this research has sought to increase rating
accuracy, improve feedback to employees, and ensure that employers comply
with legal requirements. Concerning this latter point, policy-capturing
research has demonstrated that organizations are more likely to win
employment discrimination lawsuits in court if their appraisal system is
based on measures of behaviors or results, rather than on measures of
broad traits (Feild & Holley, 1982; Werner, 1992).

Given that behavior and results measures of performance have been
advocated since the 19603 (e.g., Smith & Kendall, 1963; Odiorne, 1965),
and that current legal guidelines and court decisions emphasize the
importance of organizations using such measures, the use of graphic- or
trait-oriented rating scales would seem hard to justify. Yet, trait-
oriented scales continue to be widely used. When Werner (1992) combined
his data with that from Feild and Holley (1982), over 60% of the codable
cases (66 of 107) involved ttait-Qriented appraisal formats. Further,

Locher and Teel (1988) surveyed over 300 organizations, and found that

48
almost 70% of them used either graphic ratings scales or essay evaluation
of employees.

Why do graphic or trait-oriented rating scales continue to be so
widely used? Certainly, much of the explanation for this is that such
scales are inexpensive and easy to administer. Additionally, some
managers and organizations may not know about the legal ramifications of
performance appraisal, or may feel that the likelihood of getting taken to
court for a poorly designed appraisal system is relatively small (Werner,
1992).

However, it is questionable whether such explanations fully capture
the discrepancy between what has been recommended in this area and what is
actually practiced. Two additional explanations are proposed: 1) trait
ratings are often preferred.by raters because they parallel the manner in
which individuals form impressions and retain information in memory
(Cantor & Mischel, 1977; Srull & Wyer, 1989), and 2) managers (and other
raters of performance) often have definitions of ”performance” that go
beyond an employee's performance of his or her stated job duties (Organ,
1977), i.e., what is here labelled ”inerole performance”. As developed
below, trait-oriented scales are thought to capture elements of employee
”citizenship behaviors" (Bateman & Organ, 1983) which managers consider
important for overall effectiveness, yet which may not fit precisely
within the role-prescribed behaviors of a written job description.

thanizational Citiaeaahia Bahavtot. Organ (1977) listed a number
of things organizations are likely to value beyond some minimally-
acceptable level of productivity, including employee predictability,

cooperation, and general tendencies toward compliance. Such behaviors

49
were described as "the glue which holds collective endeavors together” (p.
50). Later, such behaviors were labelled organizational citizenship
behaviors, or OCB, and this was thought to include such actions as
cooperating with coworkers, working to improve the organization, and
accepting special orders without complaint (Smith, Organ, 6: Near, 1983;
Bateman 6: Organ, 1983; Organ 6: Konovsky, 1989) .

Katz (1964; Katz & Kahn, 1966) described three types of behaviors
which were considered essential for a functioning organization: a) people
must be induced to enter and remain with the organization; b) they must
reliably carry out specific role or job requirements; and c) there also
needs to be innovative and spontaneous activity that goes beyond role
prescriptions. Citizenship behaviors are at times part of an employee's
role or job, e.g. , courtesy toward customers may be prescribed behavior
for sales personnel (Brief 5: Motowidlo, 1986). However, most citizenship
behavior is viewed as "supra-role”, i.e., behavior which cannot be
prescribed or required in advance for a given job (Bateman 6: Organ, 1983) .

Recently, Organ defined OCB as “individual behavior that is
discretionary, not directly or explicitly recognized by the formal reward
system, and in the aggregate promotes the efficient and effective
functioning of the organization. . . . (T)he behavior is not an enforceable
requirement of the role or job description, ... it is rather a matter of
personal choice, such that its omission is not generally understood as
punishable. . . (and) returns (to the individual should) not be
contractually guaranteed by any specific policies and procedures" (Organ,

1988b, pp. 4—5).

50

Organizational citizenship behavior can be distinguished from the
broader concept of prosocial organizational behavior. Brief and Motowidlo
(1986) reviewed prosocial organizational behavior, and argued that such
behaviors can be: a) either functional or dysfunctional for the
organization; b) either role—prescribed or extra-role; and c) directed at
various targets (coworkers, customers, the organization as a whole).
Given Organ's definition above, OCB can be viewed as a subset of prosocial
organizational behavior which is functional for the organization and
generally extra-role in nature.

There is still an on—going debate as to exactly what constitutes
OCB. Early research by Bateman and Organ (1983) viewed it as a
unidimensional construct. However, a factor analysis of a different OCB
measure by Smith et al. (1983) identified two factors: Altruism and
Generalized Compliance. Altruism included behaviors "directly and
intentionally aimed at helping a specific person in face-to-face
situations" (p. 657), whereas Generalized Compliance ”pertains to a more
impersonal form of conscientiousness that does not provide immediate aid
to any one specific person, but rather is indirectly helpful to others
involved in the system" (p. 657). Organ (1988b) later stated that this
factor is more aptly labelled ”Conscientiousness", to put a greater
emphasis on its inner-directedness.

Recently, Organ (1988b) proposed five categories of OCB: Altruism,
Conscientiousness , Sportsmanship , Courtesy , and Civic Virtue .
Sportsmanship was identified by Williams, Podsakoff, and Huber (1986) in
a reanalysis of the Bateman and Organ (1983) scale, and is characterized

primarily by actions which people refrain from doing (e.g., avoiding

51

complaining and petty grievances). Courtesy is conceived of as ”touching
base” with others who will be affected by a decision, passing along
information, briefing, reminders, etc. Finally, Civic Virtue is defined
as active participation in the political life of the organization (Graham,
19 86), and includes involving oneself in understanding and discussing
current organizational issues and concerns. Support for this OCB
categorization scheme was found by MacKenzie et al. (1991).

In a different vein, Graham (1986; 1989) drew on writings in
classical political philosophy to propose four major factors of OCB. The
first two factors, Interpersonal Helping and Personal Industry, are
similar to the Altruism and Conscientiousness factors described above.
Graham's third factor, Individual Initiative, is an operationalization of
organizational participation, and parallels what Organ (1988b) labelled
Civic Virtue. The fourth factor is called Loyalty, and involves defending

the interests and reputation of the organization to outsiders. Support
for this four-factor structure was found in separate factor analyses by
Graham (1989) and Karambayya (1990).

In an attempt to make sense of these different categorization
schemes for OCB, the current research will use the categories proposed by
L. Williams (1988). Williams (1988) collapsed the above categories from
Organ (1988b) and Graham (1986) into two broader categories which
represented: l) OCBs which benefit the general maniaatian (e.g.,
carrying out role requirements well beyond the norm or minimum required
levels), and 2) OCBs which immediately benefit specific indivtduala
(e.g- , helping another person with an organizationally—relevant problem),

whiCh ultimately contributes to the organization. These two categories

52

were labelled OCBO and OCBI, respectively. It should be noted that these
categories parallel Bateman and Organ's (1983) factors of
Conscientiousness and Altruism. Their primary advantage, however, is
parsimony, in that they are broad enough to encompass the other OCB
factors which have been proposed. For example, Organ's Sportsmanship and
Civic Virtue factors, and Graham's Individual Initiative and Loyalty
factors can all be incorporated into the OCBO factor, whereas Organ's
Courtesy factor fits into the OCBI factor.

Williams (1988) developed questionnaires designed to tap in-role
behaviors (IRB) , organizational citizenship behaviors directed toward the
organization as a whole (OCBO), and organizational citizenship behaviors
directed toward specific individuals (OCBI). Factor analyses were
conducted on samples of self-ratings, peer ratings, and supervisor ratings
(the self-ratings were from employed MBA students). For each sample, a
clean three-factor solution emerged which strongly supported the a priori
categories. For the present research, the significance of Williams (1988)
is two-fold: a) two broad categories of organizational citizenship
behavior were identified, and b) Williams (1988) was the first to
demonstrate empirically that survey measures of OCB were capturing
something distinct from traditional in-role performance. Past research
that had attempted to do this produced equivocal results (O'Reilly 6:
Chatman, 1986; Puffer, 1987). The importance of this is that Williams'
(1988) results strengthen the argument that OCB measures are doing more
than simply capturing "old wine in new wineskins", i.e. , OCB measures are

not merely replicating supervisors' ratings of in-role performance. This

53
is of considerable importance for the present study, where OCB is
incorporated within a performance appraisal context.

QQB aad Eatfgtaange Apatataal. Much of the research and writing to
date on organizational citizenship behavior has emphasized definitional
issues (Smith et a1., 1983; Graham, 1986, 1989; Organ, 1988b; VanDyne 6:
Cummings, 1990), or the relationship of OCB to such perceived antecedents
as satisfaction and organizational commitment (Bateman 6: Organ, 1983;
O'Reilly 6: Chatman, 1986; Organ, 1988a; L. Williams, 1988). Very little
of this work has pertained directly to performance appraisal. The
following paragraphs describe the research and theorizing on OCB which
most directly relates to the current study.

Puffer (1987) measured the relationships between prosocial
behaviors, noncompliant behaviors, and sales commission for a sample of
retail sales personnel whose pay was based entirely on commission. In
essence, prosocial behaviors were positive citizenship behaviors (e.g.,
handling a postsale problem for another salesperson), and noncompliant
behaviors were negative or anti-citizenship behaviors (e.g., not doing
one's "fair share" of customer phone calling; cf. , Fisher 6: Locke, 1990).
Puffer found that sales performance (adjusted for hours worked and
standardized by store) correlated .16 with prosocial behavior, and —.23
with noncompliant behavior (both p < .05). The former correlation is
identical to what was obtained by MacKenzie et a1. (1991) when they
correlated their Altruism factor with their measure of objective
performance. These results suggest that there 13 a relationship between
citizenship behaviors and objective measures of performance; the strength

of this relationship, however, appears to be fairly modest. Organ (1988b,

54
p. 41) interpreted this as evidence for the distinctness of OCB, i.e.,
that it is related to, but not the same thing as objective, in-role
performance.

If raters can reliably distinguish between in-role and extra-role
performance (L. Williams, 1988), and these two types of behaviors are at
most only moderately intercorrelated (Puffer, 1987; MacKenzie et a1. , in
press), then the next thing that would be useful to establish is that
raters do in fact utilize both types of information when making
performance ratings. As indicated in Chapter 1, this issue has been
partially addressed by Orr et a1. (1989) and MacKenzie et a1. (1991).

For example, Orr et a1. (1989) conducted a policy-capturing study,
where 17 managers assigned a dollar value of performance to 50
hypothetical ratees. Ratee profiles varied on 13 dimensions of work
behavior, 10 of which were prescribed or in-role behaviors, and three of
which were considered to be citizenship behaviors (team cooperation,
contribution to morale, and company orientation). Ten raters had
significant regression weights on the citizenship behaviors, indicating
that they used the citizenship behaviors (in addition to the in—role
behaviors) to estimate the dollar value of performance. These raters also
demonstrated a higher R2 (.81 vs. .68, p < .05) than the remaining raters,
indicating that using both types of cues facilitated them in accounting
for more of the variance in dollar value of performance.

Orr et a1. (1989) demonstrated in the context of utility analysis
than many managers consider citizenship behaviors when evaluating the
value of employee performance. MacKenzie et al. (1991) reported a similar

finding in a performance appraisal context. MacKenzie et al. used a large

55

sample of insurance agents from an organization where three objective
measures of performance were already in use. Managers of these agents
were first asked to subjectively rate each agent's performance. Next,
they rated the organizational citizenship of each agent. The OCB
questionnaire included four factors: Altruism, Civic Virtue, Courtesy,
and Sportsmanship. A LISREL analysis revealed that Objective Performance,
Altruism, and Civic Virtue had significant relationships with the
subjective ratings (p < .01), with an overall R2 of .45. The two OCB
factors explained as much variance in the ratings as did the objective
performance ratings. Further, followbup analyses revealed that this was
not due to same—source (common method) bias between the ratings of
performance and OCB.

A weakness of MacKenzie et a1. (1991) is that objective performance
measures are necessarily deficient (Wexley &'Yukl, 1984); much that would
be considered "in-role performance” will not be picked up by such
measures. MacKenzie et a1. urged that a replication of their study be
conducted with other samples. The current study will do some of this,
utilizing a different test of the same notion. In line with past research
on OCB (Smith et a1., 1983; Bateman & Organ, 1983; L. Williams, 1988),
this study will focus on inrrole versus extra-role tahaviots.
Specifically, these behaviors will.be available to raters in.the dimension
by ratee matrix described above. Keeping the behavioral focus consistent
for both inrrole and extra-role performance is more in line with Organ's
(1988b) conception of these variables. It is hoped that, by providing a
different test of the conceptualization underlying MacKenzie et a1.

(1991), the results of this study will supplement their results.

56

In his book on organizational citizenship, Organ (1988b) discussed
how OCB might influence the performance appraisal process. Organ argued
for relatively simple performance appraisal forms, with six or fewer
dimensions to be rated. According to Organ, the first dimension or two
should focus on quantitative inrrole productivity and technical
excellence. The next dimensions "might capture facets of contribution
that straddle the 'boundary ‘between in—role performance and. certain
categories of OCB; attendance, punctuality, and rule-compliance come to
mind" (p. 92). The remaining dimensions would be broader, focusing on
such things as cooperation, collegiality, or voluntary contribution of
time and effort. Organ stated that these dimensions ”however
characterized, would enable the rater to give fair credit in a general,
global sense for many other forms of OCB" (p. 92). Finally, Organ
proposed that an overall rating be provided to assess each employee's
"total contribution".

An example of the dimensions which might be included on such a form
is presented in Figure 6. It is interesting to note the direct parallel
between L. Williams (1988) three categories of behavior (IRB, OCBO, and
OCBI) and the dimensions suggested by Organ (1988b). Inrrole behaviors
(IRB) capture the dimensions of quantitative productivity and technical
excellence. OCBO would include factors such as punctuality and rule-
compliance (in fact, these are two of the seven items which Williams used
to define his OCBO factor). OCBI, on the other hand, is the broadest
factor, and would capture issues similar to Smith et al.'s (1983) Altruism

factor.

57

InrRole Productivity

Technical Excellence

Attendance

Rule-Compliance

Cooperation

Voluntary Contribution of Time and Effort

Williams'
(1988)
Behavioral

Qategatiaa

IRB

IRB

OCBO

OCBO

OCBI

OCBI

Figure 6: Sample Performance Appraisal Dimensions to be Rated

(from Organ, 1988b)

58

In this study, the recommendations and research of Williams (1988)
and Organ (1988b) will be followed when deriving the dimensions to be
rated, as well as the critical incidents included in the dimension x ratee
matrix available to each rater (see Chapter 3 for details). Briefly,
subject matter experts from the target organization 'will identify
organizationally-relevant dimensions for a particular job, similar to
those listed in Figure 6. Two dimensions will be included for each of
William's (1988) three categories of behavior (IRB, OCBO, and OCBI).
Critical behavioral incidents will be generated for the job in question,
which will then be collected into various configurations to represent
different levels of'in—role and.extra-role performance by the hypothetical
ratees. It should be noted that the kind of dimensions which Organ
(1988b) proposed combine elements of both trait- and behaviorally-oriented
appraisal formats. In spite of the long-standing view that it is better
to measure behaviors and results than broad traits, such "combination
approaches" have recently appeared in the human resource management
literature (e.g., Levy, 1989). In defense of this, Kavanagh (1971) argued
that trait ratings should not be discarded if they help account for total
variance in appraisal ratings. More recently, Rice (1985) noted a
movement toward greater use of trait ratings because, in spite of their
subjectivity, many managers consider traits to be vital to their overall
ability to rate employee performance. This trend is evident in.assessment
center research, where such broad dimensions as "energy", "initiative”,
and "motivation" are often used (Thornton & Byam, 1982). Interestingly,
however, the developers of assessment center procedures generally seek to

define these dimensions as behaviorally as possible.

59

Previous research on the impact of in-role versus citizenship
behaviors on utility estimates (Orr et a1. , 1989) has also utilized both
trait and behavioral dimensions. A problem with Orr et a1. (1989),
however, is that in—role performance was defined by specific work
behaviors, while OCB was defined by more trait-like terms such as ”team
cooperation” and ”contribution to morale". This potential confound would
seem to be inherent in this type of research, given that Organ (1988b)
intended for raters to consider "general tendencies" when rating employee
citizenship. The current research will seek to minimize this confound as
much as possible by defining both in—role and extra-role dimensions in
behavioral terms (similar to assessment center procedures). It is thought
that this will reduce the subjectivity of trait ratings (Wexley 6s Yukl,
1984), and reduce the likelihood that any differences observed between
ratings of inrrole and extra—role performance are attributable primarily
to differences between rating behaviors versus rating traits.

A major advantage of using dimensions which capture elements of both
ratee traits and behaviors is that this overcomes the difficulties
observed by DeNisi and Summers (1986) . As noted above, DeNisi and Summers
(1986) devised both behavioral and trait-oriented rating scales. The
problem, however, was that they had no expert true scores for their trait
ratings, and therefore were unable to compute accuracy scores for this
condition, or compare results across conditions. In the present study,
all raters will rate the same dimensions; thus, a comparison of accuracy
results by condition will be possible (see Figure 3).

In light of the preceding pages, hypotheses are now presented

concerning the predicted effects of the performance manipulations on

60
accuracy, error, and.the process variables. First, main.effects for these
within-subjects factors are discussed, i.e., level of inrrole and extra-
role performance. This is followed by hypotheses concerning expected
interactions between the performance manipulations, as well as between the
performance manipulations and search/method of rating.
0 e e In- ole a d tr -Ro e e viors
The following predictions concern tataa effects, without regard to
the between-subj ects conditions in which a rater participated. Hypotheses
are presented for both inrrole and extra-role performance. Predicted
interactions are discussed after this.

v - e ehavio a c ac . Level of in—role
performance has been manipulated in a number of previous performance
appraisal studies (DeNisi & Stevens, 1981; Karl & Wexley, 1989; Padgett &
Ilgen, 1989; Kozlowski 6: Ford, 1991). Padgett and Ilgen (1989) and
Kozlowski and Ford (1991) utilized two levels of performance (high/low),
whereas DeNisi and Stevens (1981) and Karl and Wexley (1989) used three
levels (high, average, and low). The current research will follow the
latter studies in using three levels of inrrole performance. Ayerage
ratee performance is thought to represent a more ambiguous stimulus than
either high or low levels of performance (Wexley, Yukl, Kovacs, & Sanders,
1972). This is relevant to the present study, in that DeNisi et a1.
(1984) predicted that accuracy would be greatest for high and low
performance, and worst when rating average performance. The research to
date, however, has demonstrated only that high performance is rated more
accurately than either average or low performance (DeNisi 6: Stevens, 1981;

Karl 6: Wexley, 1989). This would lead to the following prediction:

61
H5a: Differential accuracy will be greater (i.e., closer to the
expert true scores) for ratees exhibiting high levels of in—

role behaviors (IRB) than for ratees exhibiting average or low

levels of IRB.

It is thought that this aspect of Cronbach's (1955) accuracy best
captures the manner in which accuracy has been characterized in previous
research. The other Cronbach measures are not expected to be
differentially influenced by level of IRB.

Laval 9_f la-Role Behavi‘ ora and Mount ot Saarah. Kozlowski and Ford
(1991) and Padgett and Ilgen (1989) both found that raters sought more
information on poor performers than high performers. DeNisi and Stevens
(1981) interpreted their accuracy results (just cited) by claiming that
average and low performing ratees require more attention on the part of
raters. DeNisi and Stevens (1981) did not measure amount of search in
their study, but the implication from their research would be that:

HSb: The amount of search will be greatest for ratees exhibiting
average or low levels of IRB, and lowest for ratees exhibiting

high levels of IRB.

At first glance, this hypothesis appears contradictory to Hypothesis
5a, i.e. , that raters will search more, but be mag accurate in rating
average and low performers. However, this apparent paradox will be
retained, because it is felt that rating average and low performers is a
more difficult task and that, in spite of increased search, accuracy will
be lower for these ratees in comparison to ratees exhibiting high in—role
performance (DeNisi 6: Stevens, 1981).

Extra-Role Behaviors, Ettor, and Accutacy. In this study, OCB will

be manipulated by having one ratee within each IRB performance level

62
demonstrate either highly favorable or neutral levels of OCBI (see Figure
7). From Organ (1990a), it is thought that the observed intercorrelation
among dimensions (halo) will be higher for ratees where favorable OCB
information is present than for those where OCB information is neutral,
i.e.,
H6a: More halo will be observed for ratees where highly favorable

OCBI information is ‘present than for ratees where OCBI

information is neutral.

If this effect carried over to accuracy measures (which is not clear
given the results of Murphy and Balzer, 1989), then similar predictions
can be made for stereotype and differential accuracy.

H6b: Stereotype accurady will be worse for ratees where highly

favorable OCBI information is present than for ratees where
OCBI information is neutral.

H6c: Differential accuracy will be worse for ratees where highly
favorable OCBI information is present than for ratees where

OCBI information is neutral.

These are the only two Cronbach accuracy' measures for ‘which
predictions are made, since it is not clear that OCB information would
have any effect on the rank ordering of ratees (i.e., differential
elevation). These predictions are made despite the cautions of Murphy and
Balzer (1989) because this is a controlled laboratory study'where the true
intercorrelations among dimensions will be known (Fisicaro, 1988). The

weak link between error and accuracy noted.by Murphy and Balzer (1989) is

not expected to hold in this particular instance.

Ratee

Figure 7:

63

high inrrole performance; high OCBI

high inerole performance; neutral OCBI

average in—role performance; high OCBI

average in-role performance; neutral OCBI

low in—role performance; high OCBI

low in-role performance; neutral OCBI

Target Profiles for Hypothetical Ratees

02xsisﬂmu9

(tonsnnznt)

0x:mds&am)

64

W. The
combinations which can be formed by the within—subjects factors are shown
in Figure 7. Hypothetical ratee profiles will be constructed to match
these six combinations. It can. be seen that the consistency of
information is highest for Ratees l, 4, and 6, and lowest for Ratees 2, 3,
and 5. Padgett and Ilgen (1989) predicted that when ratee performance
information is inconsistent, raters will seek more information, yet be
less accurate on a measure of differential elevation. They obtained their
predicted results for accuracy, but not for amount of search. As both
hypotheses are plausible, both will be tested again in the current
research.

For amount of search, results in the present study should differ
from Padgett and Ilgen (1989) for several reasons. First, before search,
Padgett and Ilgen (1989) required subjects to view five videotaped
vignettes for each of their four ratees. This requirement may have had
the unintended consequence of limiting further search (mean number of
vignettes searched.was 8.76, out of 17 available for each ratee). In the
current study, raters will have little prior information concerning
ratees, and thus will need to access the computer to obtain the
information they need.to make their ratings. Second, in Padgett and Ilgen
(1989) , search was more labor-intensive and time-consuming. In the
present study, information will be accessed quickly on the computer, and
this increased search (beyond that observed by Padgett 6 Ilgen, 1989) is
expected to aid raters in better noting inconsistent performance.
Finally, to increase the realism of the task, Padgett and Ilgen (1989) had

subjects simultaneously complete an in-basket exercise. One drawback of

65
this manipulation is that it may have also served generally to limit
sub j ect-initiated search .
H7a: The amount of search will be greater for ratees where

performance information is inconsistent concerning IRB and
OCBI than for ratees where such information is consistent.

H7b: Differential elevation will be worse (less accurate) for
ratees where performance information is inconsistent
concerning IRB and OCBI than for ratees where IRB and OCBI
information is consistent (Padgett 6 Ilgen, 1989).

e c bed a

MacKenzie et a1. (1991) speculated that the presence of OCB
information would trigger the increased use of categories and schema by
raters. This greater reliance on general impressions (or impressionistic
codes, Hastie 6 Park, 1986) is expected to decrease rater accuracy in
comparison to the accuracy of ratings made when such OCB information is
absent (or neutral, as in the present study).

Rather than proposing messy and improbable four- or five-way
interactions utilizing accuracy and all independent variables, this study
will focus on two three—way interactions between method of rating, OCB,
and accuracy. Since the accuracy predictions concern the presence of OCB
information (e.g., Organ, 1990a), level of in—role performance will be
ignored in these predictions. Also, predictions for method of rating will
be made by collapsing across prior knowledge conditions. Thus, the
predictions concern the main effects for type of format, the favorability

of OCBI information, and accuracy. The influence of OCB information is

expected to primarily influence Cronbach's measures of stereotype and

66
differential accuracy. These effects are presented graphically in Figure

8, and verbally below.

H8a: Stereotype accuracy — Two main effects and at interaction
will be observed between method of rating, type of OCBI
information available, and stereotype accuracy. Subjects

rating by dimension will be more accurate than subjects rating
by person. In addition, accuracy will be greater for ratees
with neutral OCBI information than for ratees with favorable
(i.e., more salient) OCBI information.

H8b: Differential accuracy — No main effects and n_o interaction
will be observed between method of rating, type of OCBI
information, and differential accuracy. Subjects rating by
dimension will be more accurate than subjects rating by
person. In addition, accuracy will be greater for ratees with
neutral OCBI information than for ratees with favorable OCBI
information.

marl

Accuracy has been of fundamental concern to psychologists and other
researchers of organizational behavior for decades (Funder, 1987). This
emphasis on accuracy is no less strong in the area of performance
appraisal. This study will take widely-used accuracy measures, and then
utilize a more recent methodological approach known as process tracing
(Ford et a1., 1989) to address a number of questions concerning method of
rating and accuracy. Hypotheses concerning information acquisition and
method of rating have been drawn from recent advances and applications of
cognitive psychology (cf., DeNisi 6 Williams, 1988). A second set of
questions concern the manner in which raters utilize both inrrole and
extra-role behaviors when evaluating ratee performance. These hypotheses
deal 'with areas relatively untouched 'by' most previous research on

organizational citizenship behaviors (Organ, 1988b; 1990a). Additionally,

67

55500.... 05 £25.52... moo .0 3:55.05”. dazmm .0
5:52 55.5.5 025555.“. 5.0.55 5 manor.

 

 

5.0585 5m5n.
_ .
. .
5.88.0.5 500 .952
5.52.5.5 500 5.555”.
59:00.6.
5.5.2.5
5.0585 5.05.”.

8.62.25 500 6:82

_ .
5.58.0.5 500 0.555“.
59:00....
55555

 

68
interactions between method of rating, extra-role performance, and
accuracy are proposed.

Previous research on information acquisition has noted that
experienced raters may use different cognitive processes than naive
(student) raters (Williams et a1. , 1986; DeNisi et a1. , 1989). This study
will explore some of these issues, and should provide answers that are of
more than academic interest to "real world” raters of performance. The
next chapter will lay out the specifics of the methodology used in this

study .

CHAPTER 3: METHOD
W
This chapter lays out the sample, procedures, variables, and data
analysis used in this study; The manner in which.the content of the study
was derived is first presented, followed by a discussion of the procedures
used in the primary study. The primary study entailed allowing subjects
to search a computer information board, and then asking them to provide
ratings for six hypothetical ratees. The appropriate data analytic
methods are presented for testing the various hypotheses presented in
Chapter 2.
Pa i a
Subjects in this study were supervisors at Michigan State
University. The study was conducted with the consent and cooperation of
the university's Director of Personnel Administration, as well as the
leadership of the Michigan State University Administrative Professional
Supervisors Association (APSA). The APSA represents over 850 supervisors
employed by the university, and covers virtually all employees on campus
with supervisory responsibilities. The subject matter experts were also
drawn from this same pool of APSA supervisors.
Power Analysis
An a priori power analysis was conducted to determine the sample
size needed in order to have adequate power to detect significant effects.
Crucial variables for such calculations are the expected effect sizes for
the experimental manipulations. Previous research using Cronbach's (1955)
accuracy measures have generated medium (etaz - .06) to large (etaz - .20)

effects using other manipulations (cf., Murphy et a1., 1989; Padgett &

69

70

Ilgen, 1989). The South Carolina researchers, while not using Cronbach's
measures, have typically generated "medium" effects for their accuracy
measures (as specified by Cohen, 1988). DeNisi and Stevens' (1988)
manipulation of prior knowledge of format produced a sizable effect on
accuracy (d > .50). Finally, the manipulations and measures of in—role
and extra-role performance in MacKenzie et a1. (1991) and Orr et a1.
(1989) produced very large effects; i.e., in MacKenzie et a1. (1991),
overall R? was .45; in Orr et a1. (1989), R6- .74.

Taken together, the above findings suggested that an overall R? of
.60 was reasonable for the four independent variables and three
interactions in this study. What is also crucial for a power analysis,
however, is the estimation of the unique variance (srz) which each
individual variable or interaction is expected to explain (Cohen and
Cohen, 1983, p. 118). Using an overall R%- .60 with seven variables (4
independent variables, and.three interactions), an alpha of .05, and power
of .80, various calculations can be made. If'srz- .03, then 113 subjects
will be required; if sra- .04, then 87 subjects will be required; and if
srz- .05, 71 subjects will be required. Given the likelihood that at least
some of the variables or interactions would produce small effects, it was
decided to include 116 subjects in this research (i.e., 29 subjects per
between subjects condition; cf., Figure 3). This provided adequate power
to detect significant effects if they existed.

may;

De vi the o ent o the t

Procedures outlined by latham and wexley (1981) and Padgett and

Ilgen (1989) were adapted for use in the present study. Following Padgett

71
and Ilgen (1989), a general secretarial position was chosen as the "focal
job” (all ratees were assumed to be holding a similar position in the
organization). ,Almost all subjects in the study'were currently evaluating
the performance of one or more secretaries in their work unit; thus, this
was a position which was highly familiar to study participants.

Questionnaires were first given to a pre—sample of university
supervisors, asking them to rate the importance of various performance
dimensions for a secretarial position. The dimensions were selected based
on the discussions of Organ (1988b) and Williams (1988). Two criteria
were used to select the dimensions: a) widespread agreement by the
supervisors as to the relevance of the dimension to the organization and
job in question, and b) each of Williams’ (1988) behavioral categories
(IRB, OCBO, and OCBI) were represented.by two dimensions on the final list
of dimensions selected (cf., Figure 6).

A second group of APSA supervisors was then recruited to serve as
"subject matter experts", i.e. , a group who generated and evaluated
critical behavioral incidents, assigned each incident to its appropriate
dimension, and then assigned final ratings to each incident and
hypothetical ratee (producing the "true scores" for the Cronbach accuracy
measures). Previous performance appraisal research using expert raters
has varied in the number of raters used to generate such true scores, from
a low of five raters in Cardy et al. (1987), to a high of 25 raters in
Karl and.Wexley (1989). Five studies used the videotapes and true scores
from.Murphy et al. (1982), where true scores were derived from 13 raters.
The mean number of raters used in the 13 studies located was 12.5. From

this, it was decided to collect ratings from 15 raters to generate the

72
true scores for this study. Subject matter experts (SMEs) were
volunteers, as well as supervisors who were recommended to the author by
his contacts on the APSA board.

The design of the primary study requires that there be one critical
incident for each ratee on each performance dimension; thus, with six
ratees and.six dimensions, a total of 36 critical incidents were required.
These incidents were generated as follows. First, incidents were drawn
from the research of Padgett and Ilgen (1989). Padgett and Ilgen (1989)
interviewed secretaries at the same university, and generated numerous
critical incidents. Many of these incidents were appropriate for the
present research as well.

Second, incidents were generated by the researcher, drawing on
materials from Organ (1988b), L. Williams (1988), S. Williams and.Hummert
(1990), as well as on.discussions with office personnel at the university.
From these sources, a questionnaire was assembled containing a large
number of critical incidents. The subject matter experts were asked to
evaluate each incident on a seven-point Likert scale, and then go back
through the incidents to assign each incident to its appropriate dimension
of job performance.

Padgett and Ilgen (1989) used two criteria to evaluate agreement
among their 10 expert raters: a) 70% agreement among the raters
concerning which dimension was represented by a particular incident, and
b) a standard deviation of less than 1.25 across ratings of the level of
performance portrayed by each incident. These evaluation criteria were

used in the present study as well.

73

Once ambiguous items were removed, a list of the remaining incidents
were modified as necessary to ensure that the target levels of in-role and
extra-role performance were adequately portrayed. These final edited
incidents were sent back to the SMEs for them to again rate the level of
performance represented by each item. These items were assembled so as to
portray six hypothetical ratees (see Figure 7). The subject matter
experts were asked to rate each ”ratee” on each dimension, as well as
provide an overall performance rating for each ratee. The means of these
SME ratings were then used as the "true scores" in Cronbach's (1955)
accuracy measures.

Operationally, only two of the three categories of behavior
discussed in Chapter 2 were manipulated. First, in—role behavior (IRB)
was manipulated to portray high, average, and low levels of performance.
Second, OCBI was manipulated to portray favorable versus neutral (i.e. ,
average) manifestations of citizenship directed toward specific
individuals (Williams, 1988; Puffer, 1987). OCBO was tied to the level of
in-role performance, and thus not independently manipulated, for two
reasons: 1) this simplified the research design and data analysis, and
2) as Organ (1988b) noted, these type of dimensions "straddle the
boundary” between in-role performance and OCB. VanDyne and Cummings
(1990) noted that the difference between in-role and extra-role behavior
is often extremely subtle; this seemed particularly true for OCBO-type
dimensions. Thus, to maximize the distinction between in-role performance

and OCB, only the OCBI-form of citizenship were independently manipulated.

74

W

Computer software developed by the Michigan State University
Psychology Department was modified for use in the current study. This
software was used by Kozlowski and Ford (1991), and allows the rater to
search a computer information board for information on up to 12 ratees on
12 dimensions. In the current study, a 6 x 6 ratee by dimension matrix
was developed, using the dimensions and hypothetical ratees just
described. This software is designed so that a different ordering of
ratees and dimensions is presented to each subject, thus reducing the
likelihood of order effects confounding the results of the study.

Subjects were contacted individually by telephone to solicit their
involvement in the study. When a meeting time was agreed upon, subjects
were run individually in their own offices. The process tracing software
is sufficiently flexible so that it can be used on almost all IBM-
compatible computers; thus, supervisors could choose to complete the
project on their own computer, or on a laptop computer provided by the
researcher. Supervisors were randomly assigned to one of the four
between-subject conditions (in Figure 3) by following a set order of
proceeding through the between—subj ect conditions, i.e. , whatever
condition was "up next" when a particular supervisor had agreed to
participate was the condition that supervisor received. Data collection
was stopped once data was collected from 116 supervisors.

In all conditions, supervisors were told that this research
concerned ways to improve the accuracy of performance appraisal ratings,
as well as determining the proper content for performance appraisal. The

use of rater diaries as a method to increase rater accuracy was discussed,

75

and subjects were told to assume that a supervisor whom they knew and
respected had faithfully recorded behavioral incidents for each of her six
subordinates. They were to assume that this other supervisor no longer
works for the university, and that they were now to complete the
appraisals for these six "employees". They were to assume that the
available items accurately portrayed behaviors observed by this
supervisor, but that it was up to them (the subjects) to determine the
ratings for each employee on each dimension, as well as to assign an
overall rating to each employee. Their agreement to participate in this
study was solicited at this point.

At this point, the six performance dimensions, as well as the seven-
point Likert rating scale to be used in this study were presented to all
subjects. Those subjects with 119 advanced knowledge of the rating format,
i.e. , whether it would be person— or dimension-blocked (Conditions 3 and
4; Figure 3) were told only that they would make ratings for these ratees
once they had completed their computer search. No mention of person-
versus dimension-blocking was made until after the computer search had
been completed. At that point, they made their ratings using the
appropriate format for their condition.

Subjects in Conditions 1 and 2 (advanced knowledge of a person— or
dimension-blocked form, respectively) were shown the appropriate format
prior to their computer search. Operationally, this was built into the
computer programs for these conditions at the appropriate place, so as to
maximize similarity across "runs" of the manipulation. Subject discussion
of the rating format was discouraged at this point. Primarily, it was

stated that a major issue in this research concerned the accuracy of

76
ratings made with this particular rating format. Once subjects in
Conditions 1 and 2 had completed their computer search, they made their
ratings in the same manner (i.e., using the same formats) as subjects in
Conditions 3 and 4, respectively.

In all conditions, subjects conducted their information search at
their own pace, and note-taking was permitted. At any point, subjects
could terminate their search by responding to the appropriate prompt on
the computer screen. If they chose to search for 28 items (the maximum
available), then the computer program automatically moved them into the
rating phase of the study. All ratings were collected on the computer,
using the format appropriate for each condition. Once subjects completed
the rating task, they were asked to complete a brief ”Final
Questionnaire", which solicited several pieces of background information.
This information (e.g., years of supervisory experience) was compared with
the responses of the subject matter expert group, as well as amongst the
four betweenrsubjects conditions (as a check on the random assignment of
subjects to conditions).

Constraints on Information §earch. A.fina1 design issue to address
in this research study is whether any constraints should be placed on
raters in terms of the amount of search they can undertake. DeNisi et a1.
(1984) discussed time pressures as an important variable influencing
search strategies. DeNisi et al. (1983) found that subjects allowed to
select only 9 of 25 available items (36%) produced.a.much different search
pattern than those for whom search was unlimited. In their unlimited
search condition, subjects preferred seeking information by person,

followed by a preference for searching by task. In their constrained

77

condition, over half the subjects displayed a mixed search strategy
(neither'persone or taskéblocked). Kozlowski and Ford (in.press, Study 1)
provided a stronger test of the effects of constraints on information
acquisition. They allowed subjects to access 25%, 50%, 75%, or 100% of
the available items, and found a large effect for search constraint.

In this study, a constraint intended to be moderate was used, i.e.,
in all conditions, subjects were allowed to access 28 of the available 36
critical incidents. This meant that each subject was able to access up
to 78% of the information available on all the ratees. A constraint is
useful, in that it more closely resembles a ”real" appraisal setting
(where search is not unlimited). However, a constraint that was too
severe was expected to limit the amount of rater variability in search
patterns which was observed. A modest constraint addresses the concerns
of DeNisi et a1. (1984), without restricting variance on the amount of
rater search too severely. Further, such a constraint was expected to
”force" raters to choose information from those dimensions which.they felt
were most important to them before making their ratings, which.was useful
in looking at the "inrrole" versus ”extra-role" issues raised in this
research.

Variables

ngrall Performance gatings. 'These are the overall ratings provided

by each subject for each hypothetical ratee. Means were calculated for
each ratee separately, for the four between—subjects conditions, and for
all ratees combined.

Errgr. Murphy and Balzer (1989) discussed two measures of halo,

i.e. , the median correlation between performance dimensions , averaged over

78
ratees (MEDCORR) , and the variance of the ratings assigned to each ratee,
averaged across ratees (VARRAT). Further, Murphy and Jako (1989) argue
that when the number of observations is limited (in this case, the number
of ratees), then a more stable estimate of halo is the mean correlation
between performance dimensions, over ratees (MNCORR) . This value was also
computed for each of the between-subject conditions.

Accuracy. The primary dependent variables in this research were
Cronbach's (1955) measures of accuracy. These were defined verbally above
(pp. 18—19), and the formulas are provided in Appendix A. Most previous
research has utilized the deviation score formulas (cf. , Murphy & Balzer,
1989). With this approach, each accuracy score is a squared deviation
score measuring some aspect of the differences between subject ratings and
expert true scores for the hypothetical ratees. Computer programs to
calculate these scores will be adapted from Balzer (Note 1). Dickinson,
Hedge, Johnson, & Silverhart (1990) have presented an alternative means of
analyzing accuracy scores (see Data Analysis section below), but this
method also relies on the notion of deviation scores (Dickinson, Note 2) .

Cronbach (1955) also presented formulas for differential elevation,
stereotype accuracy, and differential accuracy which used variances and
correlations. Becker and Cardy (1986) argued that these accuracy measures
also carry individually meaningful information about rater accuracy. As
Sulsky and Balzer (1988) point out, these latter measures are not
sensitive to the distances between subject and true score ratings, and
thus are not strictly measures of rating accuracy. They do, however,

provide evidence of rating validity (Sulsky & Balzer, 1988), and thus,

79
these three correlational forms of rating accuracy will also be
calculated.

Type of Search. Cafferty et a1. (1986) measured type of search by
studying the types of transitions subjects made when they chose which
performance incidents to view. A personrblocked transition is one where
subjects request another incident for the same ratee. .A.dimension—blocked
transition is one where subjects ask for an incident for a different ratee
on the same performance dimension. A mixed or nonblocked transition is
one where subjects change both ratee and dimension when making their next
choice (such ”shifts" will be ignored in the data analysis). The number
of each of these transitions will be calculated for each subject.

Payne (1976) used the following formula to determine whether each
subject's search pattern was person— or dimensionrblocked:

# eo—oced ranston -#0£Dimen§ien:13_1_92kedlranai£i_qns

# of PersoneBlocked Transitions + # of Dimension-Blocked Transitions

This formula ignores the mixed or nonblocked transitions, but does
usefully collapse information on type of search into one value which
resembles a correlation coefficient. If a subject made only person-
blocked transitions, their score on this measure would be +1.00; if a
subject made only dimensionrblocked transitions, their score would be
—1.00; and if a subject made an equal number of person— and dimension—
blocked transitions, their score would be 0.00. Payne's (1976) measure
‘was used for all tests concerning type of search in this research.

Amgug§_gﬁ_§g§£gh, Amount of search was calculated as the number of

items subjects request before indicating that they were willing to make

80
performance ratings. Theoretically, this value could range from zero to
28. These amount of search values were used to test Hypothesis 4b.
mm;

In Chapter Two, seven dependent variables were proposed for this
research, i.e., differential elevation (DEL), stereotype accuracy (SA),
differential accuracy (DA), halo, overall performance ratings, amount of
search, and type of search.

The appropriate design for this research has been labeled ”subjects
within groups by conditions" by Cohen and Cohen (1983). This design is
required because both between— and within-subj ect factors are manipulated.
Other labels for such a design include "split plot” and "subjects nested
within groups repeatedly measured on a factor”. In this research, the
"groups" are the four betweenesubjects conditions presented in Figure 3
(format type, and prior knowledge of format); different subjects are
nested within each of these groups. The repeated measures are the'withine
subjects manipulations of in-role and extra-role performance; all
subjects, regardless of betweenesubjects manipulation, will be presented
with the same matrix of hypothetical ratees. In effect, each ”ratee"
should be viewed as a "condition" in Cohen and Cohen's (1983) terms, where
each ratee represents a different pairing of the levels of the within—
subjects manipulations. Thus, this research has subjects nested within
four groups, all viewing six within-subjects "conditions" (i.e., ratees).
Figure 9 (Section B) depicts this graphically.

Seven variables were created in order to test the various main
effects and interactions posited by Hypotheses l - 8 (see Figure 9).

Variables X1 and X¢ 'portray information concerning prior knowledge of

81

FIGURE 9: Contrast—Coded Variables for this Research

82

Between-Subjects Manipulations (Hypotheses 1—4):
Xl- prior knowledge of format (yes- 15; no- 45)
X2- type of format (by person— *1; by dimension— 4:)

X3- XIXZ interaction

x1 x2 x3
Condition 1 15 is 1‘
Condition 2 11 41 4:
Condition 3 4: 1‘: 4‘
Condition 4 J: 41 1‘

Within—Subjects Manipulations (Hypotheses 5-7):
X,- level of in-role performance (high- 1; average or low- 41)
X5- level of extra-role performance (positive- 1:; neutral- 42)

X5- X4X5 interaction

83
Ram
LMMLmJ-m.

Xe szsxs X‘sza Xaxsxe x«stcs X«X5Xe qus

 

Condi-

tion 1 l 8 H 1 -8 —h —8 k -k -k —k k -H 8 —k —H -H k
(Subjects

1'k1)

Condi-
tion 2
(Subjects

l-kz )

Condi-
tion 3
(Subjects

l‘ka)

Condi-
tion 4

(Subjects
1'19. )

C. Relationship Between Type of Format and Level of Extra-Role
Performance (Hypothesis 8):

X7- X2X5 interaction

X2 X5 X7

Condition 1 8 8 k
8 -k -k

Condition 2 -8 8 —k
—8 —8 k

Condition 3 8 k k
k -k -k

Condition 4 -8 8 -R

84
rating format and type of format, respectively. Variable X3 carries their
interaction (i.e. , xlxz). This was the interaction of interest in
Hypotheses 1-4.

Variables X. and X5 represent the manipulations of in—role and extra-
role performance, respectively. Variable X5 represents the interaction
between these two variables (X5- X.X5). This is the interaction of
interest for Hypotheses 7a and 7b. A final variable, X7, portrays the
interaction between X; (type of format) and X5 (level of OCBI information
present), i.e. , X7- szs. This is the interaction of interest in
Hypotheses 8a and 8b.

As shown in Figure 9, contrast coding was used to represent the
nominal values portrayed by these seven variables. This type of coding is
most appropriate for factorial designs where interactions are of
particular interest, since meaningful regression coefficients, as well as
partial and semi-partial correlations can be obtained for both the
proposed main effects and their interactions (Cohen & Cohen, 1983). For
most variables, this coding is straightforward. For X“ level of in-role
performance, values have been assigned in accordance with past research
results (see Chapter 2).

Various hierarchical regression analyses were planned to test the
predictions of Hypotheses l—8. Cohen and Cohen (1983) noted that
hierarchical regression is the counterpart to the analysis of covariance.
Both procedures allow the researcher to statistically control sources of
variation which are irrelevant to the research questions of interest. In
this study, the general procedure was to enter the "extraneous" variables

into the regression equation in Step One of the process, with the

85

variables of interest for a particular hypothesis entered in Step Two.
Significant regression coefficients were anticipated for the variables in
Step Two, again dependent upon the specific hypotheses being tested.

Figure 10 summarizes the predictors for this research project. For
Hypotheses 1-4, variables X, through K, were entered first, followed by
‘variables X1 through X3. Following the Hypotheses in Chapter 2, when the
dependent variable was differential elevation, stereotype accuracy, or
differential accuracy, each of the variables (X1, X2, and X3) were expected
to have significant regression coefficients. For halo, amount of search,
and type of search, only X1 and X3 were expected to have significant
regression coefficients.. For overall performance ratings, none of the
regression coefficients were expected to be statistically significant.

For Hypotheses 5—7, variables X1, X2, X3, and X7 were entered into
Step One, followed by the variables of interest for each dependent
'variable (see Figure 10). Finally, for'Hypotheses 8a and.8b, variables X1,
X3, X‘, and X6 were entered first, followed by variables X2, X5, and X7.
.As mentioned previously, a significant regression coefficient was not

expected for X; for any of the dependent variables to be tested.

86

 

Dependent
Variable
Overall
Performance X1 X2 X3
Rating
H810 3* X1 XZ ﬁ* X3 5* X5

(H1) (H1) (H1) (H6a)
Differential ﬂ* XI p* X; p* X3 3* X5
Elevation (H2a) (H2a) (H2a) (H7b)
Stereotype 6* X1 6* X2 6* X3 6* X5 X7
Accuracy (H2b) (H2b) (H2b) (H6b) (H8a)
Differential 6* X1 6* X2 6* X3 6* X4 6* X5 X7
Accuracy (H2c) (H2c) (H2c) (HSa) (H6c) (H8b)
Type of 13* X1 X2 19* X3
Search (H3) (H3) (H3)
Amount Of ﬁ'k X1 X2 5* X3 p* X4 p* X5
Search (H4) (H4) (H4) (HSb) (H7a)
Note: 6* - this coefficient predicted to be statistically significant

FIGURE 10:

Predicted to be Statistically Significant

Summary of the Variables from Hypotheses 1-8

CHAPTER 4: RESULTS

This chapter first documents how the content and true scores were
derived for this study, and then presents the results of the primary
analyses. Results are presented in turn for each of the hypotheses
discussed above in Chapter 2.

Content e iva ion t t d

Questionnaires were given to 10 university supervisors, asking them
to rate the importance of six performance dimensions for a secretarial
position. As can be seen in Table 1, all six a priori dimensions were
rated as highly important by these supervisors (nine completed
questionnaires were returned). Job Knowledge and Accuracy of Work was
rated as most important (mean- 6.67), with the lowest importance ratings
given to Extra Effort/Initiative (mean— 5.56).

Supervisors were also asked to record any other dimensions which
they felt should be measured, but were not being captured by these six
dimensions. Four supervisors responded to this open-ended question.
Three wrote that there should be some measure of ”attitude” or helpfulness
included; one wrote that "interpersonal skills" were necessary; and one
listed a need for independent judgment or decision making. Given the high
means and low standard deviations in Table 1, as well as the low level of
response to the open-ended question, it was decided that these dimensions
met the criteria listed in Chapter 3, and could thus be used when deriving
the content for the primary study.

A second group of APSA supervisors was recruited to serve as
"subject matter experts“, i.e. , the group who would generate and evaluate

critical behavioral incidents, assign each incident to its appropriate

87

88

TABLE 1

Mean Importance Ratings Given to the Six Performance Dimensions

(1 - not at all important; 7 - very important)

A. Job Knowledge and Accuracy of Work
(possessing the necessary knowledge and skills to
perform job; accuracy and thoroughness of work)
B. Productivity

(amount of work completed; ability to efficiently
organize work)

C. Dependability/Attendance

(infrequent tardiness, unscheduled absences, etc.)

D. Following Policies and Regulations
(following all necessary rules, regulations,
policies, and procedures)

E. Cooperation and Teamwork

(providing assistance and support to others;
coordinating work with others)

F. Extra Effort/Initiative

(takes on extra tasks when needed, goes the
"extra mile")

 

6.11

6.33

6.56

6.22

5.56

.50

.93

.71

.73

1.20

1.13

89
dimension, and then make final ratings for each incident and hypothetical
ratee (generating "true scores“ measures).

Supervisors were recruited from several sources to be subject matter
experts (SMEs) . Five supervisors volunteered at an APSA monthly meeting.
The remainder were recommended to the researcher by one of these five
individuals (two of whom were APSA board members). The primary criterion
for inclusion in the SME group was a high degree of willingness to assist
in such a project. Similar to previous research using subject matter
experts, supervisors were deemed "expert" largely because they were
provided an extended opportunity to review the relevant critical incidents
(cf., Borman, 1977; Sulsky & Balzer, 1988).

It turned out that the SME group had worked for the university an
average of 17 years, with an average of 12.5 years of total supervisory
experience; both figures were higher than those reported by supervisors in
the primary sample. Means and standard deviations for these background
information items are given in Table 2. T—tests were conducted comparing
the means for the subject matter experts with those from the primary
sample of 116 supervisors. None of these comparisons were statistically
significant, although the t—test for years of supervisory experience
approached statistical significance (t - 1.76, p < .10). Those in the SME
group had an average of three years more total supervisory experience than
those in the primary sample.

One of the concerns that Sulsky and Balzer (1988) expressed about
performance appraisal accuracy research is the level of ”expertise"
actually possessed by subject matter experts. In the current research, it

is not claimed that the raters in the subject matter expert group were

90

Table 2

Background Information for Subject Matter Experts and Primary Sample‘

Item

Years
worked

for
university

Years

of

super-
visory
experience

Years

in
present
position

Raters'
favorable-
ness to
rating
process

Perceived
favorable-
ness of
raters'
actual
employees
to rating
process

Subject
Matter
Experts

mam.

17.13

(7.11)

12.53

(5.91)

(4.30)

(1.53)

(1.06)

Primary
Sample
(n-116)

15.72

(6.97)

(6.85)

(5.94)

(1.25)

(1.32)

Condi-
tion 1

15.69

(7.45)

10.24

(6.79)

(5.77)

(1.02)

(1.31)

Condi-
tion 2

16.17

(6.85)

(5.42)

(5.69)

(1.49)

(1.39)

Condi-
tion 3

15.10

(6.91)

(6.57)

(6.11)

(1.48)

(1.48)

Condi-
tion 4

15.90

(7.00)

(8.50)

(6.43)

(.99)

(1.07)

 

‘ Mean values; standard deviations in parentheses.

- p < .10

91
better supervisors or better raters than the remaining supervisors, only
that their increased involvement with the study material allowed them a
better opportunity to make informed ratings.

The 36 critical incidents were generated as follows. First, as
hoped, the majority of incidents were culled from the research of Padgett
and Ilgen (1989). Padgett and Ilgen (1989) interviewed five secretaries
who worked at the same university, and generated 101 critical incidents.
The means and standard deviations for these incidents were provided to the
researcher by Padgett (Note 3) . Over 40 incidents could be adapted to fit
the dimensions and levels of intended performance in the present study.

Second, approximately 20 incidents were generated by the author,
drawing on materials from Padgett and Ilgen (1989), Organ (1988b), L.
‘Williams (1988), S. Williams and.Hummert (1990), as‘well as on.discussions
with the office supervisor and secretaries in his own department. From
these sources, a questionnaire was assembled containing 56 critical
incidents. The 15 SMEs were asked to evaluate each incident on a seven—
point Likert scale, and then go back through the incidents to assign each
incident to its appropriate dimension of job performance (see Table l).

The two criteria from Padgett and Ilgen (1989) were used to evaluate
agreement among SME raters. Based on this, 14 of the 56 incidents had to
be discarded because of low agreement among the SMEs concerning the
dimension portrayed by that incident. The second criteria (low standard
deviations) was not a major issue in the current study, as only eight of
the 56 incidents had standard deviations for the SME ratings exceeding

1.00, and only one exceeded 1.25.

92

A more difficult issue for the current study was finding appropriate
incidents for each dimension, where a suitable number of incidents
portrayed the desired or target level of performance (e.g. , high, average,
or low; see Figure 7). When the initial profiles of hypothetical ratees
were compiled, it was discovered that there were no suitable incidents
remaining which portrayed high levels of "Following Rules and Regulations”
or average levels of "Extra Effort and Initiative". Thus, additional
incidents were generated, and these were then rated by a smaller group of
eight supervisors. This procedure generated the four incidents needed to
complete the profiles.

Once the ambiguous, redundant, or otherwise non-optimal items had
been removed, the remaining 36 incidents were used as the content of the
study. The final wording of some of the items was modified, in an effort
to better capture targeted levels of performance. As stated previously,
these items were assembled so as to portray six hypothetical ratees.
These composites were presented to the SMEs, who were asked to rate each
"ratee” on each dimension, as well as provide an overall performance
rating for each ratee. Since an important question in this research was
whether ratings made "by person" were different from ratings made "by
dimension", the SMEs were also shown the same incidents arranged by
dimension prior to making their ratings of each incident and ratee. The
means of these SME ratings were then used as the ”true scores" in
Cronbach's (1955) accuracy measures.

As can be seen in Table 3, there was a good match between the
targeted levels of performance and the mean ratings provided by the

subject matter experts. Ratings for Pat (high in—role, high extra-role) ,

Pat

Chris

Terry

Kim

Jody

Lynn

93

Table 3

True Scores from the 15 Subject Matter Experts

211mm We]. Wins
Job Knowledge & Accuracy high 5.13
Productivity high 4.87
Dependability/Attendance high 5.27
Following Policies & Procedures high 5.20
Cooperation & Teamwork high 5.40
Extra Effort/Initiative high 5‘89
OVERALL 5.47
Job Knowledge & Accuracy high 4.53
Productivity high 4.87
Dependability/Attendance high 5.20
Following Policies & Procedures high 5.93
Cooperation & Teamwork average 4.33
Extra Effort/Initiative average 1‘69
OVERALL 4.87
Job Knowledge & Accuracy average 4.53
Productivity average 4.13
Dependability/Attendance average 4.07
Following Policies & Procedures average 4.40
Cooperation & Teamwork high 4.93
Extra Effort/Initiative high 5,22
OVERALL 4.60
Job Knowledge & Accuracy average 4.00
Productivity average 4.13
Dependability/Attendance average 4.00
Following Policies & Procedures average 4.07
Cooperation & Teamwork average 4.73
Extra Effort/Initiative average 4,89
OVERALL 4.20
Job Knowledge & Accuracy low 2.60
Productivity low 2.60
Dependability/Attendance low 2.40
Following Policies & Procedures low 2.93
Cooperation & Teamwork high 4.40
Extra Effort/Initiative high 5,09
OVERALL 3.07
Job Knowledge & Accuracy low 2.53
Productivity low 1.93
Dependability/Attendance low 1.86
Following Policies & Procedures low 3.60
Cooperation & Teamwork average 4.00
Extra Effort/Initiative average 4,43

OVERALL

3.00

94

Chris (high in-role, average extra-role), and Lynn (low in—role, average
extra-role) were particularly close to target levels. For the remaining
ratees, the match was very good for 14 of the 18 available incidents. The
four exceptions were: a) Terry (average in-role, high extra-role) was
rated higher by the SMEs than desired on Job Knowledge & Accuracy of Work;
b) Kim (average in—role, average extra-role) was rated higher than
desired on both extra-role dimensions; and c) Jody (low in—role, high
extra-role) was rated lower than desired for Cooperation and Teamwork. It
is not thought that the results to be presented below were strongly
effected by these deviations from the intended target levels of
performance (e.g. , see Table 10 below).

Next, the results from the primary study are presented, first for
the between-subject manipulations (Hypotheses 1—4) , then for the within-
subject manipulations (Hypotheses 5-7), and finally for the proposed
interaction between method of rating, level of OCBI, and accuracy
(Hypotheses 8a and 8b).

Results from the Primag Study
Ove a in

Both subject matter experts and supervisors in the primary study
made ratings of overall performance for each ratee. Means for these
ratings are provided in Table 4. Several points can be made concerning
these overall ratings. First, the mean ratings given to each ratee by the
primary sample (n - 116) are extremely close to those given by the subject
matter experts. On average, study participants gave slightly higher
ratings than those given by the SMEs (mean discrepancy - +.035 across all

six ratees) , but none of the differences between the SME ratings and the

Overall Ratings by Ratee:

Subject
Matter
Experts
Bates .(nzliL
Pat 5.47
(high IRB,
high OCBI)
Chris 4.87
(high IRB,

average OCBI)

Terry 4.60
(average IRB,

high OCBI)

Kim 4.20

(average IRB,
average OCBI)

Jody 3.07
(low IRB,

high OCBI)

Lynn 3.00
(low IRB,

low OCBI)

Mean 4.20
(all
six

ratees)

Primary
Sample
(n—llﬁ)

5.54

(+.O7)

(-.02)

4.53

(-.07)

4.21

(+.01)

3.16

(+.O9)

3.12

(+.12)

4.235

(+.O35)

95

Table 4

Condi-
tion 1

(+.

(+.

(+.

(+.

(+.

.69

.22)

.86

.01)

.72

12)

.31

11)

.21

14)

.10

10)

.316

116)

True Scores, Whole Sample,

Condi-
tion 2

5.72

(+.25)

(-.01)

4.41

(-.l9)

4.07

(-.l3)

3.03

(-.04)

3.07

(+.O7)

4.195

(-.005)

and by Condition?

Condi-
tion 3

5.28

(-.19)

5.00

(+.13)

4.55

(-.05)

4.21

(+.01)

3.31

(+.24)

3.17

(+.17)

4.253

(+.053)

Condi-
tion 4

(n-29)

(+.01)

(-.18)

4.41

(-.19)

4.24

(+.O4)

3.10

(+.03)

3.14

(+.14)

4.178

(—.022)

 

‘ Mean ratings; deviations from

SME scores in parentheses

96
ratings from the whole sample were statistically significant. From this,
it can'be inferred that intended performance levels were well-captured by
the critical incidents used in this study.

Second, there were some differences between the conditions in the
overall ratings given to each ratee, but again, these were quite small in
magnitude. Subjects with prior knowledge of format gave slightly higher
ratings than the true scores (mean discrepancies from the true scores were
+.056, versus +.012 when no prior knowledge of format was provided), and
subjects rating by person rated higher than those rating by dimension
(mean discrepancies of +.085 versus —.013), but neither of these
differences reached statistical significance. Analyses of variance for
each of the six ratees by prior knowledge (X1) and format type (X2)
produced only one statistically significant difference: subjects with
prior knowledge of format (Conditions 1 and 2) rated Pat higher than those
‘without prior knowledge of format (p < .05, eta? - .04).

Third, a regression analysis was performed on these overall ratings,
using the seven contrast-coded variables discussed in Chapter 3. This
does not directly address any of the hypotheses in the study, but does
serve as a rough manipulation check for both the between— and within-
subject manipulations. These seven variables accounted for 41% of the
variance in overall ratings. As would be expected from the values
presented in Table 4, none of the beta weights for the between-subject
manipulations were statistically significant (X1, X2, or their
interaction). However, each of the within-subjects manipulations was
significant beyond the p < .001 level. Level of in-role performance (X4)

was the dominant variable, with a squared semi-partial correlation (srz)

97
of .371. Level of extra-role performance (X5) accounted for an.additiona1
.025 of overall variance, while the X4*X5 interaction explained an
additional .012 in rating variance. The final variable, X7, had.no effect
on overall ratings. This pattern of relationships is indicative of the
results presented below relevant to the specific research hypotheses.

As expected, then, the betweenrsubject manipulations of format type
and ‘prior knowledge of format had only' minimal impact on, overall
performance ratings. Viewed in the most positive light, this could be
taken as evidence for the high level of skill possessed by all the raters
in this study at making overall ratings of performance. Given the high
experience levels presented in Table 2, this is a plausible explanation.
Despite this, however, questions remain to be addressed concerning the
halo and accuracy observed in these ratings.

H ec

Hypothesis 1 predicted that there would be more halo exhibited.when
ratings were made by person, but that this effect would only occur when
subjects had prior knowledge of rating format. Results concerning this
hypothesis are shown in Table 5.

As predicted, there was a sizable difference in the median
intercorrelations depending on format type, with higher intercorrelations
(indicating greater halo) in the personablocked conditions (MEDCORRF .42
vs. .32, t- 5.07, p < .001). There was also an unexpected effect for
prior knowledge, with higher intercorrelations in the conditions where
prior knowledge of format type was provided (MEDCORR- .40 vs. .34, t -
3.24, p < .01). These effects are shown graphically in Figure 11. As can

be seen, there are two main effects, and no interaction.

98
Table 5

Median and Mean Intercorrelations Among Dimensions, Across Ratees

 

WM. MALCQEB 5.2.0...
Condition 1 .44922 .09925 .44726 .09438
(Prior Knowledge of
Persoanlocked Format)
Condition 2 .35173 .21621 .37743 .18484
(Prior Knowledge of
Dimensioanlocked Format)
Condition 3 .38382 .15827 .38995 .16765
(No Prior Knowledge of
Person—Blocked Format)
Condition 4 .29560 .15737 .29748 .13507
(No Prior Knowledge of
Dimension—Blocked Format)
Subject Matter Experts .34275 .32994 .28978 .20267

(True Scores)

Murphy and Jako (1989) argued that with a small number of ratees,
the average or mean intercorrelations provide a more stable estimate of
halo. Repeating these calculations using the mean intercorrelations
produced similar, though slightly weaker results. Intercorrelations were
higher with the person-blocked format (MNCORR.- .41 vs. .34, t - 2.98,
p < .01), and with prior knowledge of format type (MNCORR - .40 vs. .34,
t- 2.44, p < .05).

Murphy and Balzer (1989) also presented a halo measure that used the
variance of the ratings assigned to ratees, averaged across ratees. None
of the analyses using this measure of halo (VARRAT) produced differences

which were statistically significant. laxall cases, standard deviations

99

683

m:o_m:mE_o acoE< 228.9522... $622 ”I mEDOE

 

coacoeﬁ cowamn.
_ n .

oz 8. men.

mm; mm. -mm.
mm.

.8.
ms.

.3.

-8.

 

o_mI

100
were very large, and the instability spoken of by Murphy and Jako (1989)
seemed to be a major influence on this.

Overall, then, there is partial support for Hypothesis 1, in that
halo was higher in the personrblocked conditions. The next dependent
variables to address are those relating more directly to accuracy.
AQEEIQQX

The mean ratings for each ratee on each dimension are presented in
Table 6. Comparing the true score means to the means for the whole sample
reveals far greater differences between the rating sources than did the
overall ratings presented in Table 4. Only 16 of the 36 ratings were
reasonably close to the true score ratings. Twelve ratings were markedly
higher than the true scores, and 8 ratings were considerably lower than
the true scores (a difference of approximately +/- .30 is significant at
p < .05; if a more stringent alpha, such as p < .01, is used so as to
reduce the likelihood of Type I error per comparison, 10 ratings still
differ significantly from. the true scores). In all cases, these
differences were in the direction of the mean rating given to that ratee,
which corresponds to the halo evidence presented above.

A regression analysis similar to that done on the overall ratings
was conducted on these ratings, with similar results. The seven contrast-
coded variables from this study accounted for 28% of the variance in
ratings given. The only significant beta weights were those for inrrole
performance (srz - .253), extra-role performance (srz - .021), and their
interaction (sr'z - .005).

Hypotheses 2a - 2c can be tested using either the correlational

accuracy or the distance score formulations from Cronbach (1955). Results

101
Table 6

Mean Ratings by Ratee and Dimension: Whole Sample, and by Condition

Primary Condi- Condi- Condi— Condi-
Ratee/ SMEs Sample tion 1 tion 2 tion 3 tion 4
wmmﬂlﬂ MMMM
Pat
Job Knowl. 5.13 5.42 5.24 5.72 5.24 5.48
Productiv. 4.87 5.35 5.17 5.66 5.31 5.28
Dependab. 5.27 5.46 5.59 5.55 5.38 5.31
Policies 5.20 5.20 4.97 5.34 5.45 5.03
Teamwork 5.40 5.39 5.49 5.62 5.14 5.34
Ex.Effort 5.80 5.46 5.38 5.62 5.34 5.48
Chris
Job Knowl. 4.53 4.95 4.97 5.10 4.97 4.76
Productiv. 4.87 4.90 4.79 5.10 5.07 4.62
Dependab. 5.20 5.16 5.24 5.17 5.21 5.03
Policies 5.93 5.02 4.93 5.03 5.14 4.97
Teamwork 4.33 4.43 4.45 4.41 4.41 4.45
Ex.Effort 3.60 4.47 4.52 4.34 4.62 4.38
Terry
Job Knowl. 4.53 4.59 4.52 4.66 4.69 4.48
Productiv. 4.13 4.28 4.34 4.14 4.38 4.24
Dependab. 4.07 4.58 4.76 4.48 4.66 4.41
Policies 4.40 4.50 4.59 4.55 4.55 4.31
Teamwork 4.93 4.82 4.93 4.62 4.79 4.93
Ex.Effort 5.27 4.84 4.90 4.72 5.10 4.66
Kim
Job Knowl. 4.00 4.03 4.00 4.07 4.14 3.90
Productiv. 4.13 4.21 4.34 4.24 4.10 4.14
Dependab. 4.00 4.02 3.97 4.03 4.21 3.86
Policies 4.07 4.16 4.31 4.03 4.24 4.07
Teamwork 4.73 4.49 4.52 4.45 4.52 4.48
Ex.Effort 4.80 4.53 4.62 4.48 4.66 4.38
Jody
Job Knowl. 2.60 2.99 3.28 2.97 3.00 2.72
Productiv. 2.60 2.91 2.86 2.97 2.90 2.90
Dependab. 2.40 2.68 2.59 2.76 2.66 2.72
Policies 2.93 3.41 3.41 3.41 3.45 3.38
Teamwork 4.40 4.17 4.17 4.10 4.17 4.24
Ex.Effort 5.00 4.06 4.24 3.72 4.14 4.14
Lynn
Job Knowl. 2.53 2.89 2.97 2.72 2.93 2.93
Productiv. 1.93 2.79 2.79 2.79 2.69 2.90
Dependab. 1.86 2.69 2.55 2.76 2.72 2.72
Policies 3.60 3.65 3.72 3.62 3.59 3.66
Teamwork 4.00 3.88 3.69 4.03 3.76 4.03
Ex.Effort 4.43 3.85 3.79 3.66 4.07 3.90

102
will be reported first using the correlational accuracy formulations, then
using the distance score formulations. .As stated above, the correlational
approach is more aptly viewed as a measure of rating validity (Sulsky &
Balzer, 1988).

Correlational Accuracy, Hypothesis 2a concerned differential
elevation, proposing that accuracy would be higher for those rating by
person than for those rating by dimension, and that accuracy would be
highest for those knowing in advance that they were going to rate by
person, and lowest for those knowing in advance that they were going to
rate by dimension. This hypothesis was not supported. Differential
elevation correlation (DECORR) was very high in all conditions, ranging
from r - .885 in Condition 1 to r - .922 in Condition 4 (Mean DECORR
across conditions - .903, s.d. - .086). Analyses of variance using DECORR
as the dependent variable, and X1, X2, and their interaction as the
independent variables revealed only a marginally significant effect for
type of format, where DECORR was higher for those who rated by dimension
(r - .888 by person, versus r- .917 by dimension, p < .10). This is
Opposite what was predicted for differential elevation, where it was
expected that those rating by person would be better able to correctly
rank order ratees.

Staying with correlational accuracy, but moving to Hypotheses 2b and
2c, the opposite predictions were made for stereotype accuracy (SACORR)
and differential accuracy (DACORR). It was expected that accuracy would
be greater when rating by dimension, and that it would be best with prior
knowledge of a dimension—blocked format, and worst with prior knowledge of

a person-blocked format. These hypotheses were also not confirmed. For

103

stereotype accuracy (H2b), correlations were lower and much more varied
than for DECORR (mean SACORR across conditions - .594, s.d. - .312).
Analysis of' variance revealed. a significant ‘main. effect for prior
knowledge of'format,'with greater correlational accuracy for those‘without
prior knowledge of format type (r - .659, versus r - .528 for those with
prior knowledge, p < .05, eta2 - .044). Also, the X1*X2 interaction
approached statistical significance (p - .068, eta2 - .028). This
interaction can be seen in Figure 12. As Figure 12 indicates, accuracy
was best for those in Condition 4, and worst for those in Condition 2.

Hypothesis 2c concerned differential accuracy. The mean DACORR
across the four conditions was r - .444, with a standard deviation of
.212. There was almost no variance across conditions, from a low of r -
.435 in Condition 1 to a high of r - .453 in Condition 3. Analysis of
variance revealed no significant effects for the between—subject
manipulations or their interaction.

Overall, then, the correlational accuracy measures provided partial
support for the efficacy of rating by dimension. This support was weak,
however, and not consistent with the hypotheses put forth in the study.

Distance Accuracy. Hypotheses 2a — 2c can also be analyzed using
the distance score formulations presented by Cronbach (1955). As Murphy
and Balzer (1989) demonstrated, this has been the most common
operationalization of rater accuracy in the performance appraisal
literature. Regression analyses were conducted using the seven contrast-
coded variables discussed in Chapter 3. Five separate analyses were
conducted, using elevation, differential elevation, stereotype accuracy,

differential accuracy, and overall accuracy as the dependent variables,

104

Ammoo<m 5 8.388 mg
69:84. 3209me to. 2.32“. ”NP mmDGE

 

86565 :8th
_ _
_ _ \\
llos.
mo>
mv.
llom.
mmOo<m
cm.
Luce.
mm.
o ilon.
z 8.

 

105
respectively. The means by condition for the first four distance score
formulations are presented in Figure 13 (Hypotheses 2a - 2c concern the
differential elevation, stereotype accuracy, and differential accuracy).

Prior knowledge of format (X1), format type (X2), and their
interaction (X3) produced a number of significant beta weights in these
regression equations, i.e., 11 of the 15 t—tests were statistically
significant at p < .05. In these analyses, there were 36 cases per
subject (one for each rating made), for a total N of 4284. This made it
easy to obtain statistical significance. However, as shown below, the
amount of variance accounted for by these between-subject manipulations
was very small, and thus the general findings from these analyses were
quite disappointing.

With distance scores, lower scores indicate greater accuracy.
Concerning elevation, there was a main effect for format type, such that
subjects were closer to the true score estimates in the dimension-blocked
conditions (p < .001, sr2 - .003). Multiple R2 for this equation was .004.
Concerning differential elevation, both main effects and their interaction
were significant, but not in the direction predicted by Hypothesis 2a.
Accuracy was worst for those with prior knowledge (p < .001, sr2 - .007),
and better for those rating by dimension (p < .001, sr2 - .004). The best
differential elevation was obtained by those in Condition 4 (no prior
knowledge of a dimension-blocked format). Multiple R2 for this equation
was .02.

Although still small, the strongest effects observed using this
design and the difference score formulations were for stereotype accuracy

(Hypothesis 2b). Accuracy was worse with prior knowledge (p < .001,

106

mucmCOQEOO >omhnoo< 9.00m mocmuﬂﬂ ..2 mzzmom “mw MED—GE

 

 

:o_mcoE_o comtom
_ _
_ _
mm.
. em.
02 mwmm mam. mm.
l/
mo> [/1 mm.
mmmm mmm.
5.
mm.
2.22.950 condom
_ _
_ _
02 mm.
mam. mm.
ov.
owe. 3..

mm> . me.

one. we

5938.4
00. .mzcewED

co=m>2m
RESEE

 

 

 

comcoea comtom
_ _
_ _
o .
.2 v_m. mam
mm.
mm> mam.
coacmea comamm
_ _
_ _
«mm.
o
mew.
wo>
mmm.

>omSoo<
8onme

co=m>2m

107
sr2 - .031) and when rating by dimension (p < .001, sr2 - .02). The
X1*X2interaction was also significant (p < .001, sr2 - .019). As Figure
13 indicates, subjects in Condition 2 were least accurate in terms of
stereotype accuracy. This is opposite from what Hypothesis 2b predicted.
Multiple R2 for this equation was .07.

Concerning differential accuracy and Hypothesis 2c, there was a
statistically significant effect for format type which was in line with
the hypothesis, i.e. , accuracy was better when rating by dimension
(p < .05, srz - .001). Again, however, the amount of explained variance
was extremely small (R2 - .002).

The Cronbach (1955) accuracy measures produce a squared value for
each component, i.e. , ELz, DEz, 8A2, and DAz. Following Murphy et a1.
(1982), it has become customary to use and report the square roots of
these values as the accuracy results for each component. These are the
values depicted in Figure 13, and used in the above analyses. It is also
possible to sum these squared values, and then take the square root of
that as a measure of overall accuracy. This value is presented in Figure
14. As can be seen, the pattern is very similar to that found for
differential elevation: accuracy is greater without prior knowledge of
format (p < .001, sr2 - .015), and when rating by dimension (p < .001,
sr2 - .003). The interaction is also statistically significant (p < .05,
srz - .001) , such that accuracy is best for those in Condition 4. Multiple
R2 for this equation was .02.

As discussed in Chapter 3, the reason for utilizing the above
regression analyses was to simultaneously test for the effects of both the

between—subject and within-subject manipulations. Unfortunately, when

108

.8982 53:90 __m_o>O .2 £33m H3 mmDOE

:o_mcoE_o comtom

_ _

_ _ “V
02 1.. mm.
Nwm.

 

o m. .... mm. 5.982
__m._m>O

mm> 11 mm.
a]

mam. .T. om.

 

109

these regression analyses were conducted, the beta weights for the within-
subject variables were all zero. In retrospect, it can be seen that this
is a mathematical necessity, given the way these dependent and independent
variables were constructed. The dependent variables come from Cronbach
(1955), and the formulations used in this study were adapted from Balzer
(Note 1). These formulations produce one value per subject for each of
the accuracy components, i.e. , one ELz score, one DEz score, one SAz score,
and one DAz score. The contrast-coded variables (X4, X5, X6) were designed
to capture in—role performance, extra—role performance, and their
interaction, and are by nature orthogonal. Such orthogonal variables
cannot explain any variance in a single (point) value. Thus, the
regression analyses presented above were problematic, in that they did not
accomplish the purpose set out for them in Chapter 3. Fortunately,
another way of measuring accuracy was located, which is capable of
measuring much of what was intended by the above regression analyses.

This approach comes from Dickinson (1987; Dickinson et a1. , 1990).
Dickipsop's MANOVA Approach. Dickinson (1987) laid out an analysis
of variance design to test the accuracy of performance ratings. The
differences between performance ratings from a given sample and true score
ratings are built into the MANOVA calculations using orthonormal
contrasts. The overall differences between the true scores and the sample
is picked up as a "rating sources" source of variation. This corresponds
to Cronbach's (1955) concept of elevation. The other primary sources of
Variation are for "ratee", "dimension”, and the "ratee x dimension"
1rlteraction. These correspond to Cronbach's (1955) conceptions for

d1- fferential elevation, stereotype accuracy, and differential accuracy,

110
respectively. Thus, all of Cronbach's (1955) distance score accuracy
components are included in a.mu1tiple analysis of variance. Such a design
allows the researcher to simultaneously test between? and withinrsubject
manipulations.

Dickinson et a1. (1990) utilized this framework to test the effects
of cognitive modeling and feedback in rater accuracy training. In
Experiment 2, they used eight between-subject conditions, and tested the
effects of these manipulations on ratings made for seven ratees across
three performance dimensions. Substantively, Dickinson et al. (1990)
found that almost all of the variation explained in their ratings was due
to differences between the samples in the levels of the ratings they'made;
almost none of this could be explained by the between-subject
manipulations or their interactions. After discussions with Dickinson
(Note 2), it was determined that this design framework was also applicable
to the present study, and the results of this analysis are presented next.

Table 7 displays the results of this analysis. Since the ratees and
dimensions were selected by the researcher, it assumes a fixed effects
model (Kirk, 1982). The analysis was done using;MANOVA on SPSS-X. SPSS—X
provides values for partial eta-squared as an estimate of effect size. ,As
noted by Cohen (1988), partial eta—squared overestimates actual effect
size. It is, however, a consistent measure, and does provide a reasonable
estimate of the relative effects of each source of variation.

As can'be seen, most of the variation in these orthonormal contrasts
was explained by differences between the samples in terms of a) their
overall rating levels (i.e., ratings sources, or elevation), b) their

ratings for the ratees (differential elevation), c) their ratings for the

111

Table 7

Analysis of Variance for Prior Knowledge and Format Type on Accuracy

 

Partial
Spppppppi df MS F—ratio §p§£_
Rating Sources 1 11.92 6.28* .056
Prior Knowledge (X1) 1 .26 .14 .001
Format Type (X2) 1 .91 .48 .004
X1 x X2 1 1.08 .57 .005
Raters/Condition (R/C) 112 1.90
Ratees (E) 5 2.93 3.91** .034
E x X1 5 .34 .46 .004
E x X2 5 1.17 1.56 .014
E x X1 x X2 5 .46 .62 .005
E x R/C 560 .75
Dimensions (D) 5 21.53 87.14*** .438
D x X1 5 .25 1.01 .009
D x X2 5 .50 2.04 .018
D x X1 x X2 5 .28 1.15 .010
D x R/C 560 .25
Ratees x Dimensions 25 9.15 51.11*** .313
E x D x X1 25 .16 .88 .008
E x D x X2 25 .20 1.14 .010
E x D x X1 x X2 25 .24 1.37 .012
E x D x R/C 2800 .18

 

.10
.05
.01

*** .001

X-
'U'U'O'd
AAAA

112

dimensions (stereotype accuracy), and d) by the ratee x dimension
interaction (differential accuracy). Consistent with previous analyses,
the smallest difference between the ratings of the subject matter experts
and the whole sample was in their ratings of the six ratees; while this
difference was significant (p < .01), partial eta2 was .03. Similarly,
differences in overall ratings levels or elevation also had a small, but
statistically significant influence on accuracy (p < .05, partial eta? -
.06). The greatest amount of variation was explained by differences
between the sources in their ratings of the six dimensions used in the
study (p‘< .001). Partial eta? for dimensions (i.e., stereotype accuracy)
was a sizable .44. Differences between the rating sources in their ratee
x dimension interactions also accounted for a large amount of variation
(p < .001, partial eta? - .31).

As seen by both the F—ratios and the partial eta squares, the
between-subject manipulations had only a negligible effect on these
sources of variation. The tests of Hypothesis 2a are the
"ratee"interactions for ratee by format type (E x X2) and ratee by prior
knowledge by format type (E x.Xl x X2). As seen in Table 7, both of these
interactions had small effects which were not statistically significant.

Hypothesis 2b is tested by two of the dimension interactions (i.e.,
D x X2, and D x X1 x X2). The interaction between dimensions and format
type approached statistical significance (p < .10, partial eta2 - .02), but
was not as predicted by Hypothesis 2b. Accuracy was worse when rating by
dimension; again, this was due to the poor showing in this regard by

subjects in Condition 2.

113

The ratee by dimension by format type interaction (D x E x X2) and
the ratee by dimension by prior knowledge by format type interaction (D x
E x X1 x X2) test Hypothesis 2c. Both of these interactions had small,
non-significant results.

This analysis of variance procedure (Dickinson, 1987) can also be
used to test the hypotheses relevant to the within-subject manipulations
(Hypotheses 5 - 7). Results from these analyses will be presented below.
Two primary findings should be drawn from the above discussion: a) this
MANOVA.procedure corroborates the generally weak findings for'the between?
subjects manipulations, and b) most of the variance was explained by
rating level differences between the subject matter experts and.the sample
which were pp; strongly related to the study's hypotheses. This was
particularly true in regards to how each group viewed the dimensions and
the ratee by dimension interaction present in this study.

T e o e r

Hypothesis 3 predicted. that subjects ‘would. display' a general
preference for information search by person, and that this would be
stronger for subjects in Condition 1, and less pronounced for subjects in
Condition 2. Means, standard deviations, and breakdowns by condition are
presented in Table 8. As can be seen, subjects exhibited a strong
preference for a person—oriented search pattern. Eighty-two percent of
all study participants searched for information in a person-blocked
fashion; it is not evident, however, that the prior knowledge manipulation
had any impact on subsequent type of search. The two groups with prior
knowledge of rating format (Conditions 1 and 2) had near identical values

on Payne's (1976) type of search measure, indicating a strong preference

114
Table 8

Means, Standard Deviations, and Breakdowns Concerning Type of Search

Number of people Payne's type of
searching by: search measure
Condi—
pipn_ ers Dimension Mean ﬁygy
l 24 5 .538 .710
2 22 7 .537 .695
3 23 6 .446 .699
4 26 3 .680 .523
OVERALL E E .551 .658

for searching for information by person. Subjects in the no prior
knowledge conditions were slightly more varied in their type of search,
but still strongly person-oriented in their search patterns. Analysis of
variance using Payne's measure as the dependent variable, and.Xl, X2, and
their interaction as the independent variables produced no statistically
significant differences between the conditions. Thus, the key aspect of
Hypothesis 3 (that prior knowledge of format type would influence
subsequent search) was not supported.

The above results led to the question of whether type of search is
more of an individual difference variable, which is less subject to
manipulation than was thought prior to data collection. This post hoc
explanation is strengthened by the large number of extreme values found
for the Payne measure, i.e., in general, people either strongly preferred

seeking information 'by person or ‘by dimension, regardless of' what

115
condition they were in. In Chapter 1, Figure 1 posited a potential direct
link between type of search and accuracy. No specific hypotheses were
formulated regarding this linkage, but for completeness, such analyses
were also conducted.

First, type of search was dichotomized as either person— or
dimension-oriented. This new variable was then entered into analysis of
variance equations patterned after Dickinson et a1. (1990) and Table 7
(instead of X1 and X2) . There was no evidence from this analysis that
person- versus dimension-blocked searching influenced any of the four
accuracy measures, i.e., it cannot be said based on this data that one
type of search pattern led to greater accuracy than the other.

A second alternative explanation is that what is most important for
rating accuracy is that there be congruence between the manner in which a
subject searches for information and the type of format which they then
use to make their ratings. Cafferty (Note 4) has preliminary evidence of
such a link in a similar process tracing research project. This was
tested in the current study by dividing subjects into ”congruent" or
”incongruent" search/format groups (for congruents, their search pattern
and format type matched; for incongruents, they did not). Again, there
were no statistically significant differences explained by this breakdown.
Thus, a link between type of search and accuracy cannot be supported based
on the results from the current study.

un earc

Hypothesis 4 predicted that subjects would voluntarily end their

information search before they had accessed all available items.

Practically, it was expected that subjects would choose to end their

116
search before they had reached the maximum allowable of 28 items. This
hypothesis received no support whatsoever from the current research, in
that 104 out of 116 subjects chose to access the maximum of 28 possible
items. The lowest amount of search by any study participant was 22 items,
and the means in the four conditions were all between 27.65 and 27.83.
Obviously, the analysis of variance on amount of search by X1, X2, and
their interaction was not statistically significant. In retrospect, it
seems clear that the design and information constraints employed in this
study kept Hypothesis 4 from receiving a fair test in this instance.
This concludes the results for the between-subject manipulations.
The next issue to address are the effects of the within-subject
manipulations of in—role and extra-role performance.
0 ac to th W - ub an 1 tio
With some modification, the analysis of variance procedure described
above from Dickinson (e.g., Table 7) can also be used to measure the
effects of the in-role and extra—role manipulations. From Kirk (1982),
the appropriate experimental design for this study is the SPF-pr*qt, where
there are two between-subject blocks and two within-subj ect blocks. What
is necessary for this design, however, is that there be one data point for
each ”ratee". This requires collapsing across the dimensions in this
s tudy to derive one mean rating for each ratee, rather than an individual
rating for each of the six dimensions. Once this is done, orthonormal
contrasts can be computed and used in a multiple analysis of variance
8 imilar to that described above. It should be noted that these results do
at correspond directly to the Cronbach (1955) measures of accuracy. They

do , however, provide a sense of the overall impact of the within-subject

117
manipulations, and will be followed up with more specific tests of the
hypotheses in this study.

The results of this analysis are presented in Table 9. As can be
seen, in-role performance had a small, but statistically significant
independent effect on these pooled accuracy ratings (partial etaz - .034) .
By itself, extra-role performance did not have a statistically significant
effect on the accuracy of these ratings, but the interaction of inerole
and extra-role performance was significant (p < .01, partial etafu- .056).
Further, the three-way interaction of in—role performance, extra-role
'performance, and format type approached significance (p < .10, partial
eta2 - .022). These interactions are depicted in Figures 15 and 16. The
*values depicted are orthonormal contrasts, where the differences between
.subject ratings and true score ratings have been divided by the square
root of 2 (Dickinson, Note 2; Kirk, 1982). The means and standard
(deviations by ratee are given in Table 10. Tukey's Honest Significant
Difference (HSD) procedure revealed that the only difference between mean
Iratings which was statistically significant was that between the ratings

of Kim and Lynn (Glass & Hopkins, 1984; Dickinson et a1. , 1990).

118

Table 9

Analysis of Variance for IneRole and Extra-Role Performance

 

Pandal
ﬁgures. df MS F-ratio 8551.
Rating Sources 1 1.99 6.28* .056
Prior Knowledge (X1) 1 .04 .14 .001
Format Type (X2) 1 .15 .48 .004
X1 x X2 1 .18 .57 .005
Raters/Condition (R/C) 112 .32
In—Role Performance (In) 2 .60 3.98* .034
In x X1 2 .09 .61 .005
In x X2 2 .25 1.68 .015
In x X1 x X2 2 .19 1.26 .011
In x R/C 224 .15
Extra-Role Performance (Ex) 1 .04 .28 .002
Ex x X1 1 .04 .29 .003
Ex x X2 1 .01 .10 .001
Ex x X1 x X2 1 .00 .01 .000
Ex x R/C 112 .14
InrRole x Extra—Role 2 .60 6.61** .056
In x Ex x X1 2 .03 .34 .003
In x Ex x X2 2 .23 2.49 .022
In x Ex x X1 x X2 2 .00 .03 .000
In x Ex x R/C 224 .09

 

p < .10

* p < .05
** p < .01
*** p < .001

119

mocmE.otmn_ ®_Omum.zxw Ucm EON—A: b0 :O_~om..m~:_ ”mp memu—n.

 

 

     

moo mm:
39o
saw—.... >< :9... @383... 26..
. _ v n . .
\
III VO.I
II vor
II No:
II «or
I. I I I IIIIIII .i 8. I I I I
mg m>< IIIIIIII .. 8.
.... mo. .
so. 8053.. - mo
mm_>>o.. II we.
1.. mo. .
1.. mo .6883.
II mo. mo.
$.29: 1.. 2. moons:
1.. op.
II N—.
II NP.
.i E.
II 3.
II or.
II 9.

 

 

120

BEE“. .0 e9... .5
608.908.”. m.om.-9.xw 8m 0.2.... .o 8.899... no. #507.

Ed 950 >8... 8.x .62. :9...

_ _ _ _ _ _
A“

I: 8.-

 

899. .5 II No.

8982

8.9555 ..m

 

121

Table 10

Means and Standard Deviations for Orthonormal Contrasts by Ratee

M 119.21.! 5.2..
Pat (high IRB, high OCBI) .074 .472
Chris (high IRB, ave. OCBI) .054 .432
Terry (ave. IRB, high OCBI) .032 .449
Kim (ave. IRB, ave. OCBI) —.035 .273
Jody (low IRB, high OCBI) .034 .343
Lynn (low IRB, ave. OCBI) .164 .361

The specific hypotheses regarding in—role and extra-role performance

are addressed next.

-R forma c
Hypothesis 5a predicted that differential accuracy would be greater
for ratees exhibiting high levels of inrrole behaviors (IRB) than for
ratees exhibiting average or low levels of IRB. As Figure 15
demonstrates, accuracy was worst for Lynn (low IRB, low OCBI), but was
next worst for the two ratees high in IRB (Pat and Chris). Accuracy was
best for the two ratees with average levels of IRB. When the Cronbach
(1955) differential accuracy' measure was recalculated. to test this
hypothesis, there was a statistically significant difference between the
means, but in the opposite direction as proposed by Hypothesis 5a. The
mean DA value for the high IRB ratees was .66, as compared to .54 for the
average and low IRB ratees (t - -8.21, p < .001). Thus, Hypothesis 5a

was not supported.

 

122
Hypothesis 5b predicted that there would be more search for ratees
with average or low levels of inrrole behavior than for those with high
IRB. The mean amount of search.by ratee is shown in Table 11. Search was
higher for the lower "performing" ratees. Comparing the mean for the high
IRB ratees (4.48) to that for the others (4.69) showed a small difference
in the direction of the hypothesis. A repeated measures multiple

analysis of variance using level of in—role performance as the repeated

Table 11

Means, Rank Order, and Standard Deviations for Amount of Search by Ratee

aee MsaniEank). 5.2..
Pat (high IRB, high OCBI) 4.34 (5) 1.04
Chris (high IRB, ave. OCBI) 4.63 (4) 1.47
Terry (ave. IRB, high OCBI) 4.32 (6) 1.23
Kim (ave. IRB, ave. OCBI) 4.70 (3) 1.55
Jody (low IRB, high OCBI) 4.78 (2) 1.40
Lynn (low IRB, ave. OCBI) 4.97 (l) 1.11

measure demonstrated that this difference between the means approached
statistical significance (F - 3.04, p < .10, partial eta? - .026). Thus,

Hypothesis 5b received modest support from these findings.

Exgrp-gple Performance

Hypothesis 6a predicted that halo would be higher for ratees who
exhibited.high levels of extra-role behaviors (OCBI). Table 12 gives the

‘means and standard deviations relative to this hypothesis.

123
T-tests comparing the median intercorrelations revealed strong
differences between ratees who were high versus average on OCBI (t- 7.88,
p < .001). This finding held up with the mean intercorrelations as well
(t- 5.37, p < .001). Thus, there was strong support for the hypothesis
that there would be more halo with the presence of highly favorable extra-

role behaviors.

Table 12

Median and Mean Intercorrelations by OCBI

MEDCORR s. D. m _$.._1L.

High OCBI .43883 .17011 .44073 .22101
(Pat, Terry, Jody)

Average OCBI .30136 .15977 .31533 .22076
(Chris, Kim, Lynn)

Hypotheses 6b and 6c predicted that this halo effect would carry
over into differences between the ratee groups on stereotype accuracy and
differential accuracyu .Means and standard deviations for these values can

be seen in Table 13.

Table 13

Means and Standard Deviations for SA and DA by Level of OCBI

s; Acel SID, Diff. Acc, §.D.

High OCBI .4822 .162 .5317 .146
(Pat, Terry, Jody)

Average OCBI .4291 .144 .7436 .129
(Chris, Kim, Lynn)

124

Hypothesis 6b was supported, in that stereotype accuracy was worse
for ratees high on OCBI. A repeated measures MANOVA using level of OCBI
as the repeated measure was statistically significant (F - 10.49,
p < .01, partial etaz - .084). However, Hypothesis 6c was not supported,
in that differential accuracy was substantially worse for ratees who were
average (or neutral) on OCBI (F - 217.39, p < .001, partial eta2 - .654).
There is, then, only mixed support for the notion that halo effects would
carry over to accuracy effects (at least, in the manner hypothesized in

this study).

0 - 1 a d -

Tables 9 and 10 above, as well as Figure 14, showed that at a
general level, the accuracy of ratings was effected by the interaction of
in-role and extra-role performance. In this study, subjects were least
accurate in rating Lynn (low IRB, average OCBI). In general, their
ratings of all the ratees were higher than the ratings given by the
subject matter experts, with the exception of Kim, where the mean rating
from the sample was lower than the true scores.

Hypotheses 7a and 7b concerned two specific interactions that were
predicted based on past performance appraisal research. Hypothesis 7a
predicted that subjects would search for more information on ratees who
were inconsistent in the levels of in-role and extra—role behaviors
exhibited than for ratees where performance information was consistent
concerning in—role and extra-role performance. Referring back to Table 11
above, ratees Pat, Kim, and Lynn were considered ”consistent", and Chris,

Terry, and Jody as "inconsistent". The mean for "consistent" ratees was

125
4.67; for "inconsistent", it was 4.58. This difference was in the
opposite direction from what was expected; however, a repeated measures
MANOVA revealed that this difference was not statistically significant.
Overall, then, Hypothesis 7a was not supported.

Hypothesis 7b predicted that differential elevation would be worse
for inconsistent ratees than for ratees who were consistent in their
performance levels. This was tested by calculating separate differential
elevation values for each group of ratees, and then analyzing these in a
repeated.measures MANOVA. The mean DE for the consistent ratees was .394;
for the inconsistent ratees, it was .406. This difference was in the
direction hypothesized (lower scores indicating greater accuracy), but was
not significant (p - .64), and only accounted for a very small proportion
of rating variance (partial etaz - .002) . Thus, Hypothesis 7b was also not
supported. There is no evidence in this study that the consistency of
ratee performance information impacted rater search or accuracy.

Method of Rating, OCBI, and Accuracy

The final pair of hypotheses concerned the relationship between
person— versus dimension—blocked rating, level of extra—role behavior, and
accuracy. It was expected that the greater reliance on general
impressions when rating by person would be enhanced when subjects rated
ratees displaying high levels of OCBI. As depicted in Figure 8, two main
effects and no interaction were predicted, both for stereotype accuracy
and for differential accuracy.

Means and standard deviations relevant to Hypotheses 8a and 8b are
given in Table 14. The results of the repeated measures MANOVA are

presented in Table 15.

126
Table 14

Means and S. D.'s for SA and DA by Method of Rating and Level of OCBI

W
83.29.15.211 W
Hess 5.9... new 9.1..
High OCBI .4642 .151 .5003 .172
(Pat, Terry, Jody)
Average OCBI .4355 .135 .4227 .153
(Chris, Kim, Lynn)
Differential Accuracy
W Measles
than 5.1). Mean 5.12.
High OCBI .5387 .153 .5247 .139
(Pat, Terry, Jody)
Average OCBI .7376 .143 .7497 .116

(Chris, Kim, Lynn)

Consistent with prior analyses, there was no main effect for method
of rating. There was a main effect for level of extra-role behavior,
which accounted for 8.5% of the variance. There was some indication of an
interaction between method of rating and level of OCBI (partial eta2 -
.019), but this did not attain statistical significance (p - .14).

Overall, then, there was little support for Hypothesis 8a.

127
Table 15

MANOVA for Method of Rating, OCBI, and Stereotype Accuracy

 

lknxial
M df MS Fﬂ £1: 10 Etai.
Method of Rating (X2) 1 .01 .25 .002
Raters/Condition (R/C) 114 .03
Level of OCBI (OCBI) l .16 lO.60** .085
X2 x OCBI l .03 2.25 .019
OCBI x R/C 114 .02

Table 16

MANOVA for Method of Rating, OCBI, and Differential Accuracy

 

Partial
Source df MS F-rgtio M
Method of Rating (X2) 1 .00 .00 000
Raters/Condition (R/C) 114 .03
Level of OCBI (OCBI) 1 2.61 217.06*** .656
X2 x OCBI 1 .01 .82 .007
OCBI x R/C 114 .01
* p < .05
** p < .01

*** p < .001

128

Hypothesis 8b also predicted two main effects and no interaction
between method of rating and level of OCBI, using differential accuracy as
the dependent variable. The means were presented above in Table 14, and
the results of a repeated measures MANOVA are given in Table 16. Once
again, there was no main effect for method of rating. Level of OCBI had
a sizable main effect, with a partial etaz of .656. As noted in Hypothesis
6c above, however, this was in the opposite direction as predicted.
Finally, the interaction between method of rating and OCBI was not

significant. Thus, there was no support for Hypothesis 8b.

CHAPTER 5: DISCUSSION

Three broad questions guided this research: a) does the method of
making ratings (by person versus by dimension) influence various process
and.outcome variables?, b) do experienced raters make use of information
concerning both in-role and extra—role behavior when making performance
appraisal ratings?, and c) is there an interaction between rating format
and the level of extra-role performance demonstrated by ratees?

Hypotheses 1-4 tested issues relevant to the first question.
Results were strongest for the influences of type of format and prior
knowledge of that format on measures of halo. Results were mixed
concerning the effects of these variables on various conceptualizations of
rater accuracy. Results were weakest concerning any effects of these
between-subject manipulations on the two process variables (type and
amount of search).

Hypotheses 5-7 tested.the effects of in—role and.extra-role behavior
on both process and outcome variables. As with Hypotheses 1-4, results
were mixed concerning the specific hypotheses in this study. Results
clearly supported the proposed link between level of extra-role behavior
and halo. Level of extra-role behavior was also significantly related to
both stereotype and differential accuracy, although the latter was in the
opposite direction as was hypothesized. Finally, there was some evidence
that level of inrrole behavior influenced the amount of search made for
particular ratees. When results from various aspects of this study are
combined, there is strong support for the notion that these raters
utilized both in—role and extra-role information when making their

ratings.

129

130

Finally, Hypothesis 8 tested for the interaction between method of
rating, level of extra-role behavior directed toward other individuals,
and two accuracy measures. For both stereotype and differential accuracy,
there was a significant main effect for level of extra-role behavior, no
main effect for method of rating, and no interaction.

In this chapter, Hypotheses 1—7 are discussed separately, under the
broad headings of "Type of Format" and “Type of Performance Dimensions”.
Following this, the proposed interaction between method of rating and
extra—role behavior is discussed. The chapter concludes with a summary

and discussion of directions for future research.

H otheses Concerni e o F a

Was

Because of the conflicting findings from previous research (Williams
et a1., 1986; Cafferty et a1., 1986; DeNisi et a1, 1989), no formal
hypotheses were advanced in this study concerning the effects of type of
format or prior knowledge of format on the overall ratings given to each
ratee. In this study, those subjects rating by person were less accurate
than those rating by dimension. Averaging across ratees, the person-
blocked conditions rated +.085 higher than the true scores, while the
dimension-blocked conditions rated -.013 lower than the true scores.
While the difference between these two values was not statistically
significant, at the least these findings do not support the argument from
Williams et a1. (1986) that overall ratings should be more accurate when

ratings are made by person.

131

Table 4 did not report the standard deviations by ratee and
condition, but these ranged from .37 to .88, with an average standard
deviation across ratees and conditions of .69. Thus, there was a fair
amount of variance in these ratings. What is remarkable, however, is how
closely the mean ratings from the primary sample matched the true scores
assigned to each ratees. Given the high levels of work experience
reported in Table 2, it is plausible to argue that rating overall
performance levels is something at which this sample is quite proficient.
Type of format seemed to make little difference in the ability of these
raters to measure overall performance levels.
3412

One of the strongest effects found in this study was that for type
of format on both the median and mean intercorrelation among ratings of
dimensions, across ratees. This was somewhat surprising, in part.because
the four previous studies which made such a format distinction and used
halo as a dependent variable found no significant differences between
ratings made by person versus by category or dimension (cf., Cooper,
1981). Thus, this effect was only expected when raters were told in
advance which format they would be using to make their ratings. Instead,
as Table 5 and Figure 11 made clear, there was also a significant main
effect for prior knowledge of format. Halo was highest in Condition 1
(prior knowledge of’ a personrblocked format), and. next highest in
Condition 3 (also personrblocked format, but with no prior knowledge).
Halo in Condition 2 (prior knowledge, dimension format) was slightly lower
than Condition 3. Halo was lowest of all in Condition 4 (no prior

knowledge, dimension format).

132

Table 5 also reports the median and mean intercorrelation among
dimensions for the subject matter experts. This corresponds to Cooper's
(1981) notion of "true halo”, i.e., how much intercorrelation among these
dimensions there “should be". As can be seen, the mean intercorrelation
for the subject matter experts was .2898. Condition 4 had a mean
intercorrelation which was quite close to this (.2975), but all the other
conditions had MNCORR values which were considerably higher, indicating
that the experimental manipulations had increased the level of halo beyond
the true intercorrelations.

The median intercorrelation for the true scores was .3428, which was
higher than the mean intercorrelation. Conditions 1, 2 and 3 had MEDCORR
‘values above this, while the MEDCORR.value for Condition 4 was below this
(.2956). It would seem that unexpectedly having to make performance
ratings by dimension caused raters in Condition 4 to see less
intercorrelation among dimensions than was "actually there" (at least,
according to the subject matter experts). While speculative, it is
possible that such a scenario led raters to underestimate the true
relations among dimensions (cf., Murphy 6: Jako, 1989). In any case,
comparing across the two halo measures, raters in Condition 4 had halo
values which were closest to the subject matter experts.

It is not entirely clear why the results for the halo effect were so
much stronger in this study than has been found in previous research. One
possibility has to do with the use of a search constraint in the present
research. As discussed above in Chapter 3, subjects were allowed to
access up to 28 of the 36 available critical incidents. This constraint

was selected to emulate a real appraisal setting, where search is not

133

unlimited. Also, it was expected that the constraint would force raters
to think about which information.they'most wanted.to have before they'made
their ratings. Observation of the subjects while they underwent the
search task, as well as evidence to be presented below under ”Type of
Performance Dimensions" would indicate that the constraint achieved this
objective for the vast majority of subjects. Unfortunately, such forced
selectivity came at the cost of "forced halo”, i.e., without any
information on ratee performance on certain dimensions, subjects were
forced to rely on the information they had gained concerning ratee
performance on other dimensions. In this case, the best response raters
could make (in terms of optimally responding to the situation presented to
them) was to make at least eight "haloed" ratings. Viewed in retrospect,
it seems surprising that the halo measures for the primary sample aren't
more divergent from the true scores than they are. In any case, it is
possible that this "forced halo" may have served to accentuate the
differences between ratings made by person versus by dimension.
Accuracy

Correlational Accuracy. Some of the most puzzling findings in this
study concern the impact of type of format and prior knowledge of format
on measures of rater accuracy. First, using the correlational forms of
accuracy, the magnitude of these correlations was quite impressive, i.e.,
across the four conditions, r - .90 for differential elevation
correlation, r - .59 for stereotype accuracy correlation, and r - .44 for
differential accuracy correlation (all p < .001). This indicates that
this sample of supervisors were a) extremely good at capturing the

proper rank ordering of ratees, b) quite good at capturing the levels of

134
performance demonstrated on the various dimensions, and c) relatively
good at capturing the "true” performance of the various ratees on the
various dimensions (according to the true scores).

Type of format and prior knowledge of format had only a modest
impact on these correlational measures. Contrary to expectations,
subjects rating by dimension were slightly more accurate in rank ordering
ratees (DECORR) than those rating by person (p < .10). Also, subjects
with prior knowledge of the dimension format were worst at rank ordering
performance on the different dimensions (SACORR), while subjects without
prior knowledge of a dimension—blocked format were best (p < .10) . These
findings are not strong enough or clear enough to argue that rating
111151153; was substantially effected by method of rating (Sulsky & Balzer,
1988).

Distance Accuracy. Moving to the more commonly used distance score
formulations, accuracy was greater when rating by dimension for three of
the four accuracy components, i.e. , for elevation, differential elevation,
and differential accuracy (Figure 13). This was also true for overall
accuracy (Figure 14). However, the magnitude of these effects was small
(sr2 never exceeded .02) , and only the effect for differential accuracy was
in line with the predictions made in this study.

Even in retrospect, the author finds it hard to explain why the
results for differential elevation and stereotype accuracy came out
opposite from what was predicted. Concerning rank ordering the ratees, it
would seem that a 6 x 6 matrix of ratees and dimensions must not have been
large enough for raters to lose sight of overall ratee performance levels

(see page 35 above). But why subjects in Condition 4 would be turn out to

135
be best at rank ordering ratees, or why subjects in Condition 2 would be
worst at ranking performance on the dimensions is far from clear.

The results concerning prior knowledge of format were also not as
anticipated. It was expected that prior knowledge would impact rater
accuracy primarily in conjunction with type of format, i.e. , through the
various interactions presented in Figure 4. Instead, prior knowledge had
a stronger independent effect than was hypothesized. Moreover, regardless
of condition, subjects with prior knowledge of format were generally less,
accurate than subjects without prior knowledge (see Figure 13). Using
distance scores, the main effect for prior knowledge was statistically
significant for differential elevation, stereotype accuracy, and overall
accuracy.

As was shown in Chapter 4 and will be discussed below, the prior
knowledge manipulation did not have its intended impact on the two process
variables (type and amount of search). Thus, the projected link between
prior knowledge and search could not be established. Yet, in some manner,
prior knowledge was effecting the accuracy of ratings given. It would
seem that, in this study at least, viewing the appraisal format prior to
search proved to be a disruptive influence in the rating process. One
possible explanation for this is that the manipulation led subjects in
Conditions 1 and 2 to expend more energy focusing on the particular format
which was presented to them, rather than on viewing and rating the
performance incidents. In a “real” appraisal setting, of course, it
should be clear well in advance of the rating task what type of format
raters will use to make their ratings. Thus, the impact of prior

knowledge on accuracy should be less salient in an applied setting.

136
Nonetheless, it is not particularly encouraging that, in general, subjects
in this study who were "surprised” by the appraisal format they received
were mm accurate than those who knew their respective format in advance.

Returning to the information processing literature, an explanation
can be found for the general superiority of subjects in Condition 4 (no
prior knowledge of format, and then a dimension-oriented format).
Subjects in this condition exhibited less halo than subjects in the other
conditions, and were also generally more accurate (cf. , Figures 13 and
14). Ilgen and Feldman (1983), Lord (1985) and others have written
extensively about the differences between the "automatic" and "controlled”
processing of information. Automatic processing is the dominant mode of
information processing, whereby individuals engage in little conscious
monitoring or processing of information (e. g. , this is typical when
driving a car; Ilgen 6: Feldman, 1983) . Controlled processing takes place
when information processing is specifically under the conscious control or
mediation of the individual. Such processing is generally brought forth
to deal with some “effortful or problematic" situation (Ilgen & Feldman,
1983, p. 156).

The above description fits the current study quite well. Lord
(1985) argued that most attention, storage, and retrieval of information
from memory is governed by automatic processing. This and prior research
have shown raters' strong tendencies to search for information and make
ratings "by person". This is most likely the "automatic" response to such
an information processing task. In contrast, subjects in Condition 4 were
" j olted" by an unexpected rating format. It is possible that unexpectedly

having to rate by dimension caused subjects in this condition to engage in

137
far more controlled processing (before making their ratings) than did
subjects in the other conditions. Such an explanation is post hoc and
speculative, but is worthy of further research.

To summarize the above findings, then, does type of format influence
rater accuracy? The best answer from the distance and correlational
accuracy results would seem to be that there is some influence, in that
ratings made by dimension are slightly more accurate. But the results
were not strong, and.were not consistent with prior research or theory in
this area. Several points can be made concerning the generally weak
findings concerning the between-subject manipulations and rater accuracy.

The first two points draw on the analyses patterned after Dickinson
(1987). As Table 7 demonstrated, most of the variance in these ratings
was not explained by type of format or prior knowledge of format. There
were sizable differences in how the primary sample and the subject matter
experts viewed the performance levels demonstrated by these ratees on
these dimensions, but these differences were only modestly related to the
betweenrsubject manipulations. It is possible that the search constraint
used in this study not only served to increase halo (as mentioned above),
but also added significant “noise” (or guessing) which weakened overall
accuracy results, for these and earlier accuracy analyses.

Second, the observed power values for Table 7 were much lower than
expected. The average observed power value across the different sources
of variance was .58, well below the .80 specified in the a priori power
analysis. Thus, this experiment was designed and carried out with enough
statistical power to be able to detect "small" effect sizes (as specified

by Cohen, 1988). Unfortunately, however, actual effect sizes came out

138
even.smaller than what was specified in the a priori power analysis. This
raises the question of whether the format manipulations used in this study
have any meaningful effect on measures of accuracy.

Third, an explanation for these generally weak effects which has not
yet been mentioned is that, similar to prior research (Padgett & Ilgen,
1989; DeNisi, Robbins, & Cafferty, 1989), raters were allowed to take
notes as they went through the simulation. The idea behind this was to
make the experience less taxing on subjects' memories, as several raters
in the pilot sample made it clear that they did not want to be in a
"testing" environmentg However, it is possible that such.note taking also
served to level or minimize differences between the conditions. Based on
the author's observation of the manner in which raters completed the
project, it would seem that, for many raters, filling in the ratings at
the end of the simulation was largely a matter of being able to read one's
notes "down" or "across", i.e., by dimension or by person. This may have
weakened the differences that might otherwise have been observed between
conditions.

In the South Carolina research (summarized by DeNisi & Williams,
1988), "medium" effect sizes (Cohen, 1988) have typically been.observed.as
a result of their person- versus task-blocking manipulations. In that
research stream, however, much more attention has been given to recall
issues than was done in the present research. Note taking has obviously
been excluded from such studies (widh the exception of DeNisi et a1.,
1989). It is worth further investigation whether a concern for both
recall and rating in the same study produces stronger effects for such

manipulations than was found in the present study (cf. , Figure 2).

139

In conclusion, then, there was some general advantage to rating by
dimension in terms of accuracy. However, results were not particularly
supportive of the hypotheses for the specific Cronbach accuracy measures.
Thus, this study does little to answer the question of whether these
different formats aid or hinder raters in making intra— versus inter—ratee
discriminations (cf., p. 23 above).

e V ab e

Me of Search. It was also expected that the between—subject
manipulations would influence the type and amount of search undertaken by
subjects, at least for those with advance knowledge of the format they
would be using. Such was not the case. Across the four conditions, the
mean number of person-blocked transitions was 15.2; the mean number of
dimension—blocked transitions was 4.6; and the mean number of nonblocked
transitions was 6.6. This clearly supports earlier findings that raters
have natural tendencies to search for information by person (DeNisi et
a1. , 1983; Cafferty et a1. , 1986). However, there was no indication that
those who knew they would rate by dimension changed their search pattern
to fit a dimension—blocked format, or that those knowing they would rate
by person searched more strongly by person. In fact, there seemed to be
a small minority (18%) fairly evenly spread across the conditions who
consciously chose to search for information by dimension. This group was
neither more nor less accurate in their ratings than the majority who
searched for information by person. Further, the congruence of search
style and format type was also not predictive of rater accuracy.

Overall, then, the manipulations of type of format and prior

knowledge of format had no impact on the type of search raters engaged in.

140

It is possible that the prior knowledge manipulation was too subtle to
dislodge raters from their natural search tendencies. However, the
effects of the prior knowledge manipulation on measures of halo and
accuracy described above would indicate that this manipulation did have
some effect on raters. What it did not do, however, was cause them to
change their search strategies. Follow-up interviews with selected
subjects from the primary study will be conducted at a later date to seek
to explain why this occurred.

Amount of Search. 'Two significant flaws in the design.of this study
were that a) there were not enough performance incidents, and b) not
enough.search was allowed to adequately test the hypothesized.relationship
between type of format, prior knowledge, and amount of search. For
example, as a target stimulus, DeNisi et a1. (1983) used performance
incidents for four workers on four tasks, with four incidents available
for each worker on each task (64 total incidents). Thus, they'had a depth
of search.per task aspect to their design whichuwas lacking in.the present
study. It would have been practically difficult to come up with more
performance incidents which matched the performance dimensions and
performance levels required in this study, but at.a minimum, it seems that
there should have been two performance incidents available for each ratee
on each dimension (for 72 total incidents). With. more incidents
available, and a different search constraint, differences in amount of
search would have been more likely to surface.

Another difference between this study and previous research is that,
in previous work (DeNisi et a1., 1983; Kozlowski & Ford, 1991), raters

were only asked to make ratings of overall performance, whereas in this

141

study, raters were asked to evaluate ratees on each dimension, as well as
provide an overall rating. With no prior knowledge of ratee performance
(as was available in Kozlowski & Ford, 1991), subjects needed all the
information they could get in order to make informed ratings on each
dimension. Also, as has been discussed above, the imposed search
constraint served as an additional limit on the variability of search.
Although the search constraint was intended to be "moderate", when
combined with the factors just described, it seemed to constrain rater
variability of search far more than was intended. Thus, amount of search
did not get a fair test in the current research. Future research will
have to address whether knowledge of format influences the amount of
search undertaken.

In summary, the manipulations of type of format and.prior knowledge
of format had.a strong impact on measures of halo, some impact on measures
of accuracy, and no discernable impact on the process variables. Next,

the hypotheses concerning the withinrsubject manipulations are discussed.

Hypotheses Ccncerning Type of Performance

The second broad question guiding this research was whether
experienced raters make use of information concerning both in-role and
extra-role performance when making appraisal ratings. Support for this
notion can be found in several findings from this study. First, as one
might expect, the regression analyses using the contrast-coded variables
on the overall ratings (Table 4), as well as on the 36 ratings of ratee
performance on particular dimensions (Table 6) revealed that the dominant

influence on both types of ratings was in—role performance. It is both

142
logically and legally defensible that such behaviors are a primary
determinant of performance ratings (Werner, 1992). However, extra-role
performance and the interaction of inrrole and extra-role performance had
significant beta weights in both regression equations. These were, in
fact, much smaller than the beta weights for in—role performance. Also,
the impact of extra-role performance was much smaller in this study than
was found by MacKenzie et al. (1991). Nonetheless, the extra-role
manipulations explained statistically significant amounts of rating
variance beyond that explained by the inerole performance manipulations.

Further support for the idea that experienced raters are concerned
with information about both in-role and extra-role performance comes from
additional data which was collected for each ratee in the study. The
computer program used in this study recorded four items of information
which are relevant here: a) the amount of search engaged in for each
ratee; b) the amount of search engaged in for each dimension; c) the
order in which subjects sought information concerning the six ratees; and
d) the order in which subjects sought information concerning the six
performance dimensions.

Both amount cf search by ratee and by dimension and ogdc; cf sccgch
by ratee and by dimension give some sense of the relative importance of
these various targets to this sample of supervisors. As mentioned above,
every time the computer simulation was run, a different order appeared for
both ratees and.dimensions, so any differences in search should be because
of intentional choices by the subjects, rather than because of any order
effects inherent in the presentation of information. Results for these

variables are found in Table 17.

143

Table 17

Amount and Order of Search, by Ratee and by Dimension

Ratee

latest

Pat (high, high)
Chris (high, ave.)
Terry (ave., high)
Kim (ave., ave.)
Jody (low, high)

Lynn (low, ave.)

Dimension

Ia; get;

1) Job Knowledge
& Accuracy

2) Productivity

3) Dependability/
Attendance

4) Following Policies
& Procedures

5) Cooperation &
Teamwork

6) Extra Effort/
Initiative

Amount of Search

Mea

4.

4.

34

63

.32

.70

.78

.97

.35

.21

.02

.71

.53

.91

Ra

(5) l
(4) 1
(6) 1.
(3) l.
(2) l
(l) l.
(l) l.
(2) 1.
(3) l.
(6) l.
(4) l.
(5) l.

8,0,

.04

.47

23

55

.40

11

45

34

53

95

65

84

Order of Search‘

Mean

3.62

3.58

3.50

3.24

3.57

3.62

n

(5.5)
(4)
(2)
(1)
(3)

(5.5)

(1)

(3)

(2)

(5)

(4)

(6)

§,D,

1.71
1.68
1.81
1.61
1.87

1.79

 

‘ Lower values for order of search indicate earlier search.

144

The values for amount of search by ratee are the same as those
presented in Table 11. A repeated measures MANOVA using amount of search
as the repeated measure was statistically significant (F - 3.69, p < .01,
partial eta2 - .03) . Post hoc comparisons using Tukey's Honest Significant
Difference (HSD) test revealed that the amount of search for Pat and Terry
was significantly lower than that for Lynn (Glass & Hopkins, 1984).

In the right columns of Table 17 are the values for order of search
by ratee. A repeated measures MANOVA using order of search 3 as the
repeated measure revealed no statistically significant differences. This
is desirable, because it indicates that there were no order effects for
ratees. A totally random order of viewing ratees would have given all
ratees values of 3.5. As can be seen, the obtained order of search values
were all quite close to this.

The bottom half of Table 17 contains the values most relevant to the
current discussion. There was considerable variation in the amount of
search by dimension. A repeated measures MANOVA using amount of search as
the repeated measure was highly significant (F - 17.24, p < .001, partial
etaz - .13). Subjects were most interested in information on the two in—
role dimensions, and least interested in information concerning "Following
Policies and Procedures” . Tukey's HSD test revealed that dimensions 1 and
2 differed significantly from dimensions 4, 5, and 6; dimension 3 differed
significantly from dimensions 4 and 6; and dimension 4 differed
significantly from dimension 5 (see Table 17).

Finally, the lower right columns display the order of search by
dimension. On average, these supervisors looked first at information

concerning job knowledge and accuracy, attendance, and productivity.

145

.After this, their order of search tended to be: cooperation and teamwork,
following policies and procedures, and then extra effort and initiative.
A repeated measures MANOVA using order of search as the repeated measure
was statistically significant (F - 11.72, p < .001, partial etazl- .09).
Tukey's HSD test revealed that dimensions 1, 2, and 3 differed
significantly from dimensions 4 and 6, and dimensions 1 and 3 differed
significantly from dimension 5.

Looking at the raw values and rankings for the dimensions, the
combined results from the analyses for amount and order of search would
indicate that, for this job, subjects focused most on the top three
dimensions (job knowledge, productivity, and attendance). There was a
fairly clear distinction between the values for these three dimensions and
the last three dimensions. Once again, this is not an order effect, as
different subjects saw the dimensions presented in different orders.

This finding is interesting, in that attendance and following
policies and.procedures were both intended to capture L. Williams' (1988)
construct of "organizational citizenship behavior directed toward the
organization" (OCBO). Similarly, these two dimensions correspond to
Organ's (1988b) Conscientiousness construct. Yet, as Organ (1988b) also
noted, such dimensions "straddle the boundary" between in—role and extra-
role performance. For the position of secretary at this university,
attendance appeared very much.to be viewed.as an in-role behavior, whereas
following policies and procedures (or "rule compliance", in Organ's terms)
was treated.more like the other extra-role dimensions. Thus, despite the
fact that performance levels for inrrole behavior and OCBO were yoked

together, and that performance levels for the OCBI (or Altruism)

146
dimensions of teamwork and extra effort were manipulated independently of
the other dimensions, subjects in this study treated rule compliance
differently than the first three dimensions, and.more like the latter two
(OCBI) dimensions. This provides evidence of the “boundary straddling"
nature of these OCBO behaviors (Organ, 1988b).

The above analyses can also be viewed as evidence of the problems
inherent in Organ's (1988b; 1990a) use of the terms "in-role" and "extra-
role" behavior. As Organ would admit (1988b), these terms are imprecise
and hard to pin down. Further research is needed to help clarify these
terms. Two recent developments in the organizational behavior literature
are useful in this regard.

First, in the last few years, there has been a resurgence in
interest in the use of personality traits in personnel selection (cf.,
Hollenbeck, Brief, Whitener, & Pauli, 1988; Day & Silverman, 1989).
Recently, three empirical articles have appeared in the literature which
compared 'various personality' dimensions 'with. a ‘number of' criterion
measures (Hough, Eaton, Dunnette, Kamp, & McCloy, 1990; Tett, Jackson, &
Rothstein, 1991; Barrick & Mount, 1991). Hough et a1. (1990) used six
personality dimensions (from Hogan, 1982) with a large military sample,
and found that the Conscientiousness subscale of their Dependability
dimension was unrelated to a measure of technical proficiency (r - .02),
but significantly related to a measure of personal discipline (r - .23,
uncorrected for unreliability, p < .01). Tett et a1. (1991) conducted a
meta—analysis using quite restrictive criteria for inclusion in their
study. For the Conscientiousness dimension, they found a mean correlation

with criterion measures of .12 (r - .18, corrected for unreliability).

147

Using a much larger sample of studies and subjects, Barrick and
Mount (1991) found slightly stronger results for the Conscientiousness
dimension (r - .13; when corrected for unreliability, r - .22). Of
interest in Barrick and Mount (1991), however, is that validity
coefficients for Conscientiousness were stable across five occupational
categories and three types of criterion measures. Further, the validity
coefficients for Conscientiousness were consistently the highest of'any'of
the personality dimensions they measured (including Extraversion,
Emotional Stability, Agreeableness, and Openness to Experience). Thus,
from all three of the above studies, it would seem that Conscientiousness
is important to the accomplishment of important work tasks in all jobs
(Barrick & Mount, 1991). FUture research on organizational citizenship
behavior must take note of the research just presented. It is likely that
Conscientiousness will demonstrate a moderate, but statistically
significant relationship with "inerole" behavior. Such an
interrelationship must be considered as Organ's theoretical work is
refined and expanded in the future.

A second recent development in the organizational behavior
literature was the publication of'a chapter on job design and.roles in.the
latest edition of the andboo of dust a1 0 ani atio a
(Ilgen & Hollenbeck, 1991). Ilgen and Hollenbeck (1991) documented how
two relatively non-overlapping literatures have developed in the past, one
focusing on "jobs" (i.e., job analysis and design), and the other on
"roles" (e.g., role conflict and ambiguity). These authors sought to
integrate these two related literatures, and presented a theory of job-

role differentiation.

148

Ilgen and Hollenbeck (1991) defined a job as a set of task elements
grouped together under one job title and designed to be performed by one
individual. They referred to the "designed” or "official" tasks of the
job as established task elements. In addition to these established task
elements, there are also emergent task elements. These emergent task
elements develop as the individual seeks to carry out his or her job in a
particular work setting. These emergent task elements are more
subjective, personal, and dynamic than the established task elements, and
are also specified by more social sources than are the established task
elements (I1gen.& Hollenbeck, 1991). Roles, then, are larger sets of task
elements, which contain both established .5115! emergent task elements.
Thus, two individuals working on the same job may have very different
roles, because of differences in skills, personality, tenure, etc.

It can be seen that what Ilgen and Hollenbeck (1991) referred to as
established task elements is very similar to Organ's (1988b) concept of
in—role behavior, whereas the emergent task elements bear some similarity
to extra-role or citizenship behaviors. The value of Ilgen and
Hollenbeck's approach, however, is that they discussed how a task element
which 'begins as a part of a larger "role set" can. over time ‘be
incorporated into the formal job itself, e.g., as job analysts seek to
codify (or institutionalize) what is done on a given job.

Thus, instead of speaking of 13:;clc behaviors (as Organ and this
research have done), it is more precise to speak of ”in-job behaviors",
i.e., those task. elements which. are clearly spelled. out in. a ,job
specification. What this research has called §3£I§21212 behaviors would

then be more properly labelled ”extra-job behaviors”. Such extra-job

149

behaviors may or may not be expected for a given role, but again, as such
role expectations become institutionalized through the job analysis
procedure, these behaviors would cease to be emergent and become
established task elements of a given job. Future research in this area
must be more precise in defining the terminology used to describe
different behaviors, and would profit from incorporating the job-role
distinction presented by Ilgen and Hollenbeck (1991).

Overall, then, these supplementary analyses demonstrated that
subjects in this study were most concerned with performance on the in-role
(or in-j ob) performance dimensions. The extra-role (OCBI) dimensions were
searched less and later than the other dimensions. Nonetheless, it is
important to note that this information w_a_§ sought and used by these
supervisors. Anecdotally, no supervisor in this study objected to the use
of any of the performance dimensions included in this study. More
importantly, the means for teamwork and extra effort were a long way from
zero. Given the search constraint imposed in this study, supervisors
could have chosen to ignore one or both of the OCBI dimensions. In fact,
only four percent of supervisors sought no information on either dimension
5 or 6 (the mean percentage of supervisors who skipped a dimension was
three percent across all six dimensions). Instead, most supervisors
sought out information on these dimensions and, as the regression analyses
discussed above illustrated, extra-role performance explained
statistically significant variance beyond that explained by in—role
performance. So, all in all, these analyses support the notion that

experienced raters use both types of information when making performance

150
ratings. Next, the discussion will turn to the specific hypotheses
advanced in this study relevant to type of performance dimension.
W

Differential Accuracy. It was expected that subjects would be most
accurate in rating the performance of ratees exhibiting high levels of in-
role behaviors or IRB (DeNisi & Stevens, 1981; Karl 6: Wexley, 1989).
Instead, when Cronbach differential accuracy measures were calculated
separately for ratees exhibiting high versus average and low levels of in-
role behavior, it was found that accuracy was worse for the high IRB
ratees (p < .001) . The analysis of variance procedure presented in Tables
9 and 10 above (Dickinson, 1987; Dickinson et a1., 1990) shed light on
this, in that subjects in this study were closest to the expert true
scores for ratees who were average in their levels of IRB.

This is not consistent with Wexley et al. (1972), who found that
average performance seemed to present raters with a more ambiguous
stimulus than either high or low levels of performance. One explanation
for this difference is that Wexley et a1. (1972) conducted their study in
the context of selection, whereas in the present study, ratings were made
in a performance appraisal setting. Hiring decisions are straightforward
for applicants who are clearly high or low in their exhibited (or
expected) performance levels. Ambiguity is highest for applicants in the
middle range of performance. In contrast, in a performance appraisal
rating task, it is easy and common to rate the majority of ratees as
average. This holds true even if rating inflation has caused the typical
mean rating in an organization to rise above the midpoint of the

organization's rating scale. It is harder and less common to make ratings

151
that are either unusually high or low. In fact, at the time this research
was conducted, it was standard practice in this organization to require
supervisors to put into writing reasons why they”were assigning ratings at
the extremes of the organization's rating scale“ Such. documentationnwas
not required for midrlevel ratings, leading to rather pronounced central
tendencies in actual ratings assigned.

For the ratings in this study, Table 10 provided the standard
deviations by ratee for the orthonormal contrasts used in Table 9. This
showed that the standard deviation was lowest, and thus rater agreement
was highest for Kim, the ratee who was average on both IRB and OCBI.
Standard deviations were somewhat higher for the two low IRB ratees, and
highest for the three "above average" ratees. It is possible that
differences of opinion. among the raters concerning 'what rating is
appropriate for very high (and very low) performance contributed to the
differential accuracy findings reported above. .At the least, the findings
from.this study cast doubt on the idea that high performing ratees will be
rated more accurately in a performance appraisal setting. Further
research on this issue is needed.

Amcunc cf Search. As predicted, raters searched more for ratees who
were average or low on IRB (p < .10, partial eta? - .026). Observing the
differences in Table 11, however, one can see that the distinction was
primarily between amount of search for low IRB performers (4.88) versus
the amount of search for average and high IRB performers (4.51 and 4.49,
respectively). Also, comparing these results with Table 10, one can see
that, contrary to predictions, ratees searched more and were 1393 accurate

for the average and low IRB ratees combined. However, this is somewhat

152
deceptive, in that these subjects searched for the largest amount of
information on Lynn (low IRB, average OCBI), but were also l££§£ accurate
in rating Lynn's performance. Thus, the relationship here is not
straightforward, and needs to be studied further to be better understood.
- Pe n

ﬂclc. From Organ (1990a), it was hypothesized that observed
intercorrelations would be higher for ratees who demonstrated higher
levels of extra-role'behaviors (i.e., OCBI). Results in Table 12 strongly
supported this. Using both the mean and median intercorrelations, halo
was over 40% higher for ratees with high OCBI than for ratees with average
OCBI levels (p < .001 for both measures).

This raises an interesting issue. The descriptive question is
whether experienced raters use both inrrole and extra-role information
when making performance ratings. The results from this study, as well as
previous research (MacKenzie et a1., 1991; Orr et a1., 1989) strongly
support the idea that raters use both types of information“ The normative
question is whether they "ought to" be using such information, i.e., is
extra-role information contributing to halo cgicg, and thus something to
be minimized or controlled in the rating process, or is it a valuable but
neglected source of information, which should be included in the rating
process to explain more rating variance? In this study, halo was
considerably higher for the high OCBI ratees. It was hoped that the issue
of whether this was error or not would be answered by the results for the
two accuracy measures presented below. Unfortunately, a clear answer does

not emerge from these results.

153

Accuracy. Table 13 revealed that, like the effects for halo,
stereotype accuracy was worse for ratees high in OCBI. This was a rather
large effect (p < .001, partial etaz - .084), and would seem to indicate
that the presence of clearly favorable extra-role information led raters
to be less accurate in rating the performance levels intended for these
dimensions. However, the results for differential accuracy were just the
opposite, and even stronger in magnitude. The ability of raters to match
the true score ratings for each ratee on each dimension was much worse for
the average OCBI ratees (.53 for high OCBI, .74 for average OCBI).
Partial eta2 for this difference was .65. It is not clear why this effect
was so strong in favor of the high OCBI ratees. It does, however, leave
open the question of whether the halo effect reported above is necessarily
a ”bad" thing. Future research needs to address the normative issue of
whether extra-role information should be included in the performance
appraisal rating process.

te a t n In—Role and Extra- ole Perfo c

It was expected that when the level of performance demonstrated by
ratees was inconsistent between the in—role and extra-role dimensions,
raters would search more for inconsistent ratees, but be less accurate on
a measure of differential elevation (Padgett 6: Ilgen, 1989). Neither of
these hypotheses were supported in the present study. Concerning amount
of search, subjects actually searched for somewhat more information on the
consistent ratees, although this difference was small and not
statistically significant. Based on comments made to the researcher by
numerous subjects after they completed the project, there is anecdotal

evidence that many subjects were very aware of the inconsistencies

154
included in the ratee composites (particularly for Jody, the low IRB, high
OCBI ratee). Yet, there is no indication that such awareness effected any
of the measures collected in this study. It would seem that amount of
search was more effected by the level of in-role performance (or overall
performance) than it was by the consistency of performance information
available.

Unlike Padgett and Ilgen (1989) , there was also no effect for
consistency of information on differential elevation. In their study,
Padgett and Ilgen (1989) were not concerned with issues of in-role versus
extra-role performance dimensions. Also, Padgett and Ilgen provided
raters with between eight and twelve performance incidents for each ratee
on each performance dimension. Thus, inconsistency of performance may
have been more salient in their study than in the present one. In any
case, in the present study, the ability to rank order ratee performance
was not effected by the consistency of information.

It should be noted that Table 9 did indicate a moderate inrrole by
extra-role interaction using orthonormal contrasts between the true scores
and the subjects' ratings (partial eta2 - .056) . As Table 6 indicated (and
Figure 15 demonstrated), this was largely because subjects rated Lynn much
higher than the true scores on the in—role dimensions, and Kim lower than
the true scores on the extra—role dimensions. Thus, despite the lack of
findings for Hypotheses 7a and 7b, it is expected that future research
will detect meaningful inrrole by extra-role interactions, and that the
accuracy of ratings will be effected by the consistency of performance

information .

155
hes C cernin t od C nd c ac

Two hypotheses were put forward in this study concerning type of
format, level of OCBI, and accuracy. Specifically, for stereotype and
differential accuracy, it was expected that there would be two main
effects and no interaction for each dependent variable. Subjects were
expected to be more accurate rating by dimension, and when rating the
ratees with neutral levels of OCBI. Results did not support either
hypothesis. For stereotype accuracy, there was a main effect as predicted
for level of OCBI, but no main effect for type of format. Also, the
interaction between method of rating and level of OCBI was not
significant.

For differential accuracy, there was no main effect for type of
format, a large main effect for level of OCBI (in favor of the high OCBI
ratees), and no interaction. Obviously, for both dependent variables, the
lack of significant effects for the betweenrsubjects manipulations
(Hypotheses 2b and 2c) removed any likelihood of finding significant
effects consistent with Hypotheses 8a and 8b. Before further research is
conducted where both type of format and type of performance dimension are
manipulated simultaneously, it is imperative that it first be established
that type of format has a.meaningful impact on such dependent variables as
were used in this study. Until then, tests such as those just described
are premature.

a a d Directions 0 e R ea c

This chapter will conclude with a discussion of strengths and

weaknesses of the current study, general conclusions that can be drawn

from this study, and future research needs in this area.

156
h n ta 0 s u d

W. A number of strengths can be highlighted in the way this
research was designed and carried out. First, while most earlier research
has studied 9.1.9.112]; halo (e.g., Jacobs 6: Kozlowski, 1985) c]; accuracy
(e.g. , Murphy 5: Balzer, 1986) , this study measured both halo and accuracy
in attempting to answer substantive questions concerning type of format
and type of performance dimension. It is true that the findings for halo
versus accuracy were often discrepant, and that this added considerable
ambiguity to the interpretation of the study's results. However, if only
one measure had been used as a primary dependent variable (i.e. , either
halo or accuracy), this would have led to faulty conclusions concerning
the magnitude and direction of the study's findings. So, despite the
increased ambiguity brought on by using both variables, this is preferable
to relying on one versus the other. Indirectly, this study lends support
to Murphy and Balzer (1989) , who argued that halo and accuracy are only
weakly related.

A second strength of this study was the linking together of several
distinct research streams to measure rater search processes and accuracy
simultaneously. The conceptual background for this study was drawn
largely from the University of South Carolina stream (DeNisi 6: Williams,
1988); the accuracy dependent variables were drawn from Cronbach (1955)
and Murphy et a1. (1982); finally, the process variables were drawn from
Ford and his colleagues (Ford et a1., 1989; Kozlowski 5: Ford, 1991).
Although the results in this study did not come out as intended for most
of the hypotheses concerning type and amount of search, it is nonetheless

desirable to measure such process variables. Future research should

157
continue to measure such variables, in hopes of better understanding the
(cognitive) reasons why raters make the ratings they do (Landy & Farr,
1980; Ilgen et a1., in press).

Third, this study was able to get beyond the “trait versus behavior“
dilemma by measuring both in—role and extra-role performance in terms of
behavioral critical incidents. This solved the measurement problems
experienced by DeNisi and Summers (1986), and also the apparent confound
in Orr et al. (1989) that in—role performance was measured in terms of
behaviors, while citizenship or extra-role performance was measured in
terms of traits. Describing all performance dimensions in behavioral
terms is a practical advance that should be utilized in future research.

A fourth strength of this study was the use of a computerized
information board to easily and efficiently collect data from experienced
raters in a large organization. It took most subjects less than an hour
to complete all aspects of this project. Use of computerized simulations
should definitely continue in the future. Also, the fact that such an
experiment could be carried out with raters possessing almost ten years of
supervisory experience lends strong practical support to the finding that
experienced raters used both in—role and extra—role performance
information when making their ratings.

Limitacions. This study was also not without its limitations.
Several of these have been discussed above. In this section, six
limitations or problems will be discussed, in terms of the way this
research project was designed or carried out.

An obvious weakness in this study concerned the search constraint.

Subjects were asked to make 42 ratings, i.e. , rating all six ratees on all

158

six dimensions, plus rating the overall performance of each ratee.
Subjects were told that there was only one item of information available
for each ratee on each dimension (36 total), and that they would be able
to access 28 of those items. This scenario clearly contributed to the
lack of variance concerning amount of search, where the mean amount of
search was 27.7 items, with a distribution which was strongly negatively
skewed. It is likely that this constraint contributed as well to the halo
and accuracy results described.above. Since ratings of each ratee on.each
dimension are needed for the Cronbach accuracy measures, future research
should have more items available for each ratee on each dimension (e.g. ,
DeNisi et a1., 1983; Padgett & Ilgen, 1989). Also, any search constraint
which is used in the future should be pilot tested to ensure that it does
not serve as an artificial ceiling limiting variance in the amount of
search undertaken by subjects.

A second critical flaw in this study was the failure to note the
inherent conflict between the proposed design and method of data analysis
and the structure of the primary (Cronbach) dependent variables. A
"subjects within groups by conditions" design (Cohen & Cohen, 1983) was
proposed, to be tested using hierarchical regression. Unfortunately, the
orthogonal, contrast-coded'variables for the withinrsubjects factors were
not capable of explaining any variance in the Cronbach measures, since
these measures produce singular accuracy values for each subject, i.e.,
one elevation value per subject, one differential elevation value per
subject, etc.. As presented in Chapter 4, other analyses were conducted

which better matched the hypotheses of this study. However, these other

159
data analytic approaches carried with them their own advantages and
disadvantages.

In particular, the Dickinson. MANOVA approach described above
(Dickinson, 1987; Dickinson.et a1., 1990) required.the use of a.split-plot
factorial design (Kirk, 1982). This design is particularly powerful in
detecting effects for withinrsubjects manipulations, but is less powerful
in detecting effects for between-subjects manipulations (Kirk, 1982).
This summary statement from Kirk (1982) corresponds directly to the power
levels observed for the variables presented in Tables 7 and 9, i.e.,
observed power was extremely low for the betweenesubjects manipulations,
and considerably higher for the within-subjects manipulations. Fbture
research will need to take this into consideration, specifically in
determining the proper number of subjects per condition to detect effects
for type of format, if such effects exist (again, this experiment did not
establish that there is anything more than a very small effect for person
versus dimensionrblocked formats).

Although not as severe, a third limitation of the present study was
that it assumed. a fixed-effects model, i.e., conclusions about the
manipulations apply only to the treatment levels used in this experiment
(Kirk, 1982). This is not so important for type of format, prior
knowledge of format, or for level of ratee performance. It is more of an
issue, however, for the dimensicng chosen.in.this study; These dimensions
were deemed meaningful to this organization and sample (e.g., Table l),
and adequately captured the constructs of in—role and extra-role
performance drawn from previous research (Organ, 1988b; Williams, 1988).

However, it is possible that results would vary with the use of other

160
dimensions, or other conceptualizations of in-role and extra-role
performance. As recommended by Ilgen et al. (in press), future process or
cognitively-oriented performance appraisal research must also deal with
cm issues. The current study takes steps in this direction, but
needs to be combined with and followed by considerably more related
research (cf., McDonald, in press).

A fourth possible limitation of the current research concerns the
quality of the true scores generated by the subject matter experts. It is
possible that some of the weakness of results for the accuracy measures
was due to deficiencies in the true scores. For example, the subject
matter experts made almost no use of the extreme values on the rating
scale (1 and 7). They were, however, stricter than the primary sample in
rating the in—role performance of Jody and Lynn (the low IRB performers).
It is possible that, despite extended opportunity to evaluate the critical
incidents, the 15 subject matter experts were not as "expert" as desired
in making such ratings. Unlike some previous research (Karl & Wexley,
1989; Padgett 6: Ilgen, 1989), expert raters in this study were 1193,
provided rater training prior to the rating task. Such training is
recommended in future research.

A fifth point is related to the above, and concerns whether the
deviation of the true scores from the intended target levels of
performance may have weakened overall study results. As Table 3
indicated, subject matter ratings deviated markedly from desired levels
for four of the 36 ratings. However, two points can be made which would
indicate that these deviations did not substantially effect the results of

this study: a) such deviations should have had their strongest effect on

161
the withinrsubject manipulations of inrrole and extra-role performance,
yet results in this study were strongest for precisely these variables;
b) the subject matter experts best captured the intended levels of
performance for ratees Pat, Chris, and Lynn. Turning to the results in
Table 10, the accuracy of these orthonormal contrasts was gcgcc for these
same three ratees. It does not seem, then, that accuracy (or the lack
thereof) can be explained by the deviations of the true scores from
intended target levels of performance.

A final limitation of this study is self—evident by this point in
the manuscript: the design of this study was too ”busy”, with many
complex and interrelated hypotheses. Interpretation of the results was
often difficult because of all the different things going on in this
study. As noted above in the discussion of Hypotheses 8a and 8b, it is
expected that future research will more fruitfully proceed when type of
format and type of performance dimension are first studied separately.
Also, unless a better way cange found to concurrently measure the effects
of level of perfbrmance (Hypothesis 5) and consistency of performance
information (Hypothesis 7), these variables should not be tested in the
same study.

G 0 us m i d

Having laid out both the strengths and weaknesses of this research
project, discussion.will now turn to what can.be learned from the results
of this study. The discussion in this section will focus on general
conclusions related.to the broad.questions presented at the outset of this

chapter.

162

Iypc_c§_jprmg§. In terms of rater accuracy, there was a slight
advantage to making ratings by dimension. Of the three correlational
measures of accuracy, the only significant main effect for type of format
‘was for differential elevation.correlation (DECORR), in favor of ratingﬂby
dimension. Three of the four distance score accuracy formulations had
significant main effects in favor of dimensions; only stereotype accuracy
had a significant main effect in favor of rating by person. The overall
distance score accuracy (Cronbach, 1955) also demonstrated.a statistically
significant main effect for rating by dimension.

Unfortunately, these effects were very small in magnitude. None of
these effects uniquely explained more than 2% of the rating variance.
Such small effect sizes can be viewed in.two different ways. First, since
a review of the relevant literature indicated that prior research had
demonstrated small to medium effect sizes (Cohen, 1988) for similar
manipulations, one might argue that the design limitations noted above
served to constrain the size of the effects observed in this study. This
may have in fact occurred, which leaves open the possibility that future
research *which is more specifically focused on format issues will
demonstrate effects of greater practical significance. Also, as noted
above, it is likely that studying the effects of recall and rating
together will increase the effect sizes observed (DeNisi 6: Williams,
1988).

The second way to view such small effect sizes is to conclude that
this is all that such manipulations are capable of producing. Ilgen et
al. (in press) reviewed over 50 research articles under the heading

"Performance appraisal process research in the 19803", and concluded that

163
cognitive processes have accounted for only a limited amount of variance
in appraisal ratings. It may be that the results of the current research
provided (unwelcome) support for the somber conclusions of Ilgen et al.
(in press), and that the enthusiasm of the past decade for studying
cognitive processes in performance appraisal has been stronger than the
results to date would indicate is warranted.

In the author's opinion, it is too soon to curtail such process
research in the area of performance appraisal. It is hoped that, as both
content and process issues are studied in future research, more
substantive results will be forthcoming as well. Still, the cautionary
note from Ilgen and his colleagues should be heeded. Time will tell if
this line of research has "reached a point of diminishing returns" (Ilgen
et al., in press).

A final issue to mention in this section concerns the m of the
performance appraisal rating. DeNisi et a1. (1984) and others have
discussed this as an important variable influencing appraisal ratings. As
noted in Chapter 1, it was hoped that the pattern of results in this study
would have practical implications for different appraisal purposes, i.e. ,
if a correct ranking of employees was desired for a promotion decision,
then rating by person could be recommended, but if developmental feedback
for employees was needed, then rating by dimension would be viewed as
superior. Clearly, the direction of results for the hypotheses related to
type of format were sufficiently discouraging so that no implications can
be drawn concerning format and the purpose of appraisal. Purpose of
appraisal is an important content issue that must be considered in

performance appraisal research (Ilgen et al., in press). Sadly, the

164
current research did not make the contribution expected concerning the
relative effectiveness of different appraisal formats for making within-
versus across-ratee discriminations (DeNisi & Williams, 1988).

e o a . In. contrast to the findings
concerning type of format, the results for type of performance dimension
were considerably stronger and more robust. This sample of experienced
raters focused most on information concerning ratee behaviors which can be
classified as in—role (or inrjob; Ilgen & Hollenbeck, 1991), i.e., those
behaviors most closely associated with the narrow job duties found in most
job descriptions. These behaviors were also the dominant influence on the
appraisal ratings made by these raters. However, the two citizenship
(OCBI) dimensions of cooperation and extra effort were also used.by these
raters. Such extra—role (extra-job) behaviors explained small, but
statistically significant amounts of rating variance, as did the
interaction of inrrole and extra-role performance. This clearly supports
Organ's (1988a) contention that practicing managers view performance as
more than simply inrrole behaviors.

Even within the confines of this relatively controlled laboratory
study, there was evidence that raters were interested in extra-role
dimensions which "give fair credit in a general, global sense for many
other forms of OCB" (Organ, 1988b). In this context, it would seem that
there is merit in an appraisal system which focuses most on job-relevant
behaviors, but also gives the rater an opportunity to evaluate broader
trait— or citizenship-oriented dimensions as well. Interestingly, this is
precisely the direction in which this university is'heading with its newly

implemented appraisal system for staff employees.

165

Overall, then, this study supports the findings from Orr et a1.
(1989) and MacKenzie et a1. (1991) that managers use both in-role and
extra-role information when making appraisal ratings. It is argued that
the current study provides the best direct test of this hypothesis to
date, since Orr et a1. (1989) asked raters to make utility estimates, and
MacKenzie et a1. (1991) used a narrow, sales commission-oriented measure
to tap "objective” performance. In this study, the effects for extra-role
performance were not as strong as those observed by MacKenzie et a1.
(1991) . They are, however, more likely to parallel what would be found in
an actual appraisal setting, where raters must somehow simultaneously
account for both aspects of performance when making their ratings.

The question still remains as to whether the halo effect observed
for ratees with high levels of OCBI is m or not, and thus whether
bringing extra-role information more explicitly into the appraisal process
is desirable or not. Organ (1988b) would argue for such a broadening of
the appraisal domain; other researchers would clearly disagree. It is the
author's opinion that such extra-role information should be included in
the appraisal process, since such information explains relevant rating
variance, and is desirable for the effective functioning of the
organization as a whole. Of course, there will always be a tension here,
since once something is rated, it may no longer be "extra-role", i.e. , it
may become an expected job/role requirement (e.g. , obligatory attendance
at social functions). However, if the extra-role dimensions remain at the
level of generality recommended by Organ (1988b), such a migration of
specific behaviors from one category to the other should be less likely.

Hopefully, such questions will be better addressed by future research

166

which, as mentioned.above, also incorporates the theoretical work of Ilgen
and Hollenbeck (1991).
Qircccicns for Epture Research

Since future research ideas have already been raised throughout this
chapter, this section will only highlight Mpg; raised by this
research, which will hopefully be answered in the future. Questions will
be grouped under the headings of type of format, type of performance

dimension, process issues, and setting/contextual issues.

Iype cf Eormac
** What impact does type of fermat have on rater accuracy -

zero, small, or larger than those observed in this study?

** What would happen if a simplified version of this study were
run, where raters were not allowed to take notes, but were
forced to rely more on their memories (i.e., an emphasis on

both recall and rating)?

e o e fo a ce 1 e si

** Do these findings for inerole versus extra-role performance
generalize beyond. this setting, these ratees, and. these
dimensions?

** Should extra-role behaviors be included in the formal
performance appraisal process, i.e., is this adding error, or
explaining relevant variance in the process?

** Does level of performance impact the accuracy of ratings in.an

appraisal (in contrast to a selection) setting?

167
W
** Why didn't the raters given prior knowledge of the format they
would be using adapt their search pattern to fit that format?
** What results would be obtained in a replication where there

was more (and less constrained) search allowed?

Secting Z Qoptextual Iccpcg. Ilgen et al. (in press) discussed the

literature in this area under three broad headings: a) acquisition of
information, b) organization and storage of that information, and c)
retrieval, integration, and evaluation of that information. Additionally,
under each heading, they discussed the relevant literature as this related
to four sources of variation, i.e., variance due to ratees, raters, rating
scales, and the setting or context in which the appraisal took place.

The current study has emphasized issues of information acquisition,
as this related to subsequent ratings given. Manipulations were made of
ratee levels of performance and rating scale format, and rater search
patterns were measured as well. The one source of variation.not addressed
at all in this research was the setting in which appraisal takes place.
Padgett (1988) and others (Longenecker et a1., 1987) have documented the
importance of these variables as well. For example, Padgett (1988) found
that many of the raters in her study inflated the actual ratings given to
their subordinates beyond. their ”real” ratings for these employees
(collected at a later time, in confidence, by the researcher). Further,
this rating inflation could be predicted by raters' beliefs about their
ability to be open and honest when rating their subordinates. Padgett

(1988) referred to this as rater motivation to rate accurately.

168
Longenecker et a1. (1987) discussed similar processes in terms of the
"political" aspects of performance rating.

Whatever the label, such processes are not issues of cognitive
processing (Ilgen et a1., in press), yet they are extremely salient in
most applied appraisal settings. Future research must deal with these
political or motivational issues as well. It may be, as Longenecker et
a1. (1987) argued, that all the emphasis in our field on rater accuracy
has been misplaced, i.e., what if raters are able, but not motivated to
rate accurately, due to organizational or other contextual factors? As
Ilgen et al. (in press) suggested, performance appraisal research should
advance more rapidly if we expand the types of variables included in our
models of the appraisal process. We still do not know the extent to which
accuracy is jointly influenced by both the rater's ability and motivation

to rate accurately. Future research needs to address such issues.

APPENDIX

169

AEZEEDIX_A

Formulas Used to Calculate Accuracy and Error Scores

cc a

ggcpbcch, For mathematical reasons, Cronbach (1955) utilized the cgpcycg
differences between subject ratings and true score ratings. 'The overall measure
of rater accuracy, Dz, represents the squared difference between.subject ratings

(x) and true scores (1) averaged across n ratees and k dimensions:

2
(x - t )

D2 - 2
k nk nk

1
-—-' 2
kn n
This overall measure can be broken down into four components: elevation
(E), differential elevation (DE), stereotype accuracy (SA), and differential
accuracy (DA). The last three components can be expressed in terms of squared
differences, as well as in terms of variances and correlations. It is thought
that the two forms of measuring DE, SA, and DA carry unique information about

each of these types of accuracy (Becker & Cardy, 1986; Sulsky & Balzer, 1988).

'These formulas are:
- - 2
52- (x ,—t,,)

and

l -
2. _ - - - - -
DE ,. g [(2.. x..) (1.. t..)]2,

or equivalently,

DE2 3 0 2-+02 2 ; 20- o- n— -
I O I X' '
Er 4° ¥L° 31° .L-31°

and

3A2=12[(£.-£1-(Z.-£)12
j ,1 e 00

or equivalently,

SA2 - o- ? + o ? — 20- ,0 ,A_ _ ,
’41 2.1 L1 2.1 L111

170

and

DAZ- __ (x. ._ i. - 2. .+ go.) .. (z. «.2..- 2. .+2..)]2
i 81 4.1 4.. 1 4.1 4. 1
or equivalently,

2 2 2
DA - 0a +ob - Zoaob/Lab

where a- x14- ’2: 4.14-Z. , andb- {if "a: -xtj+2, , 19.j- and/g1!- - rating and
true score for ratee 4'. on dimension j ; 24'.- and 2}, - mean rating and mean true
score for ratee 1.; {,1- and 2'1“. mean rating and mean true score for dimension
j ; and 12,, and 2,, - mean rating and mean true. score, over all ratees and
dimensions. Overall rating accuracy equals the sum of the above four difference
scores, i.e.,

ACCZ - £2 + on? + 5A2 + DAZ (see Cronbach, 1955).

MW. Barman sought to measure a rater's ability

to distinguish among ratees on a number of performance dimensions. His formula
is:

Borman's DA - -—1— 2 (Th)

d 1.1

where d refers to the number of dimensions and Th refers to the correlation
between ratings and true scores for a particular dimension, transformed to a 2
score. This formula yields'a DA score for each dimension. An overall DA score
is than computed by averaging the correlations across dimensions using Fisher's
r to z transformation. Borman's DA measure is ac; equivalent to Cronbach's DA
measure, either in Cronbach's distance score, or his variance/correlational
formulation (Becker & Cardy, 1986; Sulsky 6: Balzer, 1988).
Em:

Drawing from Seal, Downey, and Lahey (1980), Murphy and Balzer (1989)

described six error measures that have been used in past research:

171

MEDCORR: the median correlation between performance dimensions,

over ratees (halo);

VARRAT: the variance of the ratings assigned to each ratee,

averaged across ratees (halo);

MEAN: the absolute value of the difference between.the mean.rating,

over ratees and dimensions, and the scale midpoint (leniency);

SKEW: the skew of the distribution of ratings over ratees and

dimensions (leniency);

SD: the standard deviation of the rating distribution, over ratees

and dimensions (range restriction); and

KURT: the kurtosis of the rating distribution over ratees and

dimensions (range restriction).

Fisicaro (1988) recommended. that the VARRAT measure for halo just
described be amended to take into consideration the true halo or
intercorrelation among dimensions. Using standard deviations instead of
variances, Fisicaro (1988) first presented a halo measure focusing only on

observed halo as follows:

1 n
H0 - .___ 2 SD .
3d n k'l XL
where the standard deviation of ratings is computed across dimensions for each

ratee, and then averaged across ratees. His "improved" halo measures are as

follows:

HE _1 IX1 (SD 51) ' )
8d n k‘l ti X;
This formula takes into account the true intercorrelation among dimensions.

Finally, Fisicaro (1988) recommended the use of an absolute measure of halo

error, which would reflect an overall tendency to make an error, i.e.,

AHE .. _1_.§ (SDt.-SDX.)|
39 n k-l L L

172

LIST OF REFERENCES

Note 1. Balzer, W.K. Personal communication. November, 1990.
Note 2. Dickinson, T.L. Personal communication. October, 1991.
Note 3. Padgett, M.Y. Personal communication. March, 1991.

Note 4. Cafferty, T.P. Personal communication. May, 1991.

 

Barrick, M.R. , 6: Mount, M.K. (1991) . The big five personality dimensions

and job performance: A meta-analysis. W, 45, 1-
26.

Bateman, T.S., 6: Organ, D.W. (1983). Job satisfaction and the good
soldier: The relationship between affect and employee

”citizenship”- A2a92mx_af_uanasement_isarnel. 26. 587-595.

Becker, B.E. , 6: Cardy, R.L. (1986). Influence of halo error on appraisal

// effectiveness: A conceptual and empirical reconsideration. .Lcnmgl

cﬁ Applied Psychoiogy, 11, 662-671.

Bernardin, H-J-o 6: Pence, E.C. (1980). Effects of rater training:

V Creating new response sets and decreasing accuracy. W
Aeelieg_£sxsh21221. 65. 60-66-

Bernardin, H.J., 6: Walter, C.S. (1977). The effects of rater training
and diary-keeping on psychometric error in ratings. MM

Applied Psychciogy, g2, 63-69.

Blumberg, H.H., DeSoto, C.B., 6: Kuethe, J.L. (1966). Evaluation of

rating scale formats. Personnel Ecycnclogy, 12, 243-259.
Borman, W.C. (1977) . Consistency of rating accuracy and rating errors in
the judgement of human performance. 2 e av a

Human Perfonnancc, 2_0_, 238-252.

Borman, W.C. (1979). Format and training effects on rating accuracy and

rater errors. i2aras1.2flAnnlied_£§12hslszx. 59. 410-421.

Brief, A. , 6: Motowidlo, S.J . (1986). Prosocial organizational behaviors.
Acngiemy of Management ﬁevicw, 10, 710-725.

173

Brown, E.M. (1968). Influence of training, method, and relationship on

the halo effect. Journal cf Applicc Psychology, 22, 195-199.
Buford, J.A., Jr., Burkhalter, B.B., & Jacobs, G.T. (1988). Link job
descriptions to performance appraisals. Personnel qurnnl, June,
132-140.
[Cafferty, T.P., DeNisi, A.S., 6: Williams, K.J. (1986). Search and
retrieval patterns for performance information: Effects on
evaluations of multiple targets. e a t

Max. 5.0. 676-683.

Cantor, N., & Mischel, W. (1977). Traits as prototypes: Effects on

recognition memory. u a1 0 e a t nd ocia s 10 ,
35, 38-48.

Cardy, R.L., Bernardin, H.J., Abbott, J.G., Senderak, M.P., & Taylor, K.
(1987). The effects of individual performance schemata and
dimension familiarization on rating accuracy. Journal pi

Occupational Psychology, cg, 197-205.

Cascio, W.F. (1989). Managingﬂhuman.resources: Productivity, quality of
work life, profits. New York: McGraw-Hill.

Coffman, W.E. (1971). On the reliability of ratings of essay
examinations in English. Research in rhe Iccching ci English, i,
24-37.

Cohen.J. (1988). tatistica 0 er a a o e v 5,

Second Edition. Hillsdale, N.J.: Erlbaum.

Cohen, J., & Cohen, P. (1983). e t re 0 re a

cnalysis for the behnvioral sciences, Second Edition. Hillsdale,
N.J.: Erlbaum.

Coker, D.R., Kolstad, R.K., & Sosa, A.H. (1988). Improving essay tests:
Structuring the items and scoring responses. Clearing honsc, £1.

253-255.

Cooper, W.H. (1981). Ubiquitous halo. Psychological Bulletin, 29, 218-
244.

Cronbach, L.J. (1955). Processes affecting scores on "understanding of
others" and "assumed similarity”. Psychclogical Pullcrin,,§2, 177-
193.

Davis, M.S. (1971). That's interesting! Towards a phenomenology of

sociology and a sociology of phenomenology. Philosophy cﬁ Social
Sciencc, 1, 309-344.

Day, D.V., & Silverman, S.B. (1989). Personality and job performance:
Evidence of incremental validity. Pcrscnnel Psychclogy, 42, 25-36.

174

DeNisi, A.S., Cafferty, T.P., 6: Meglino, B.M. (1984). A cognitive view

of the performance appraisal process: A model and research
propositions. Qrganizarionnl Pehavicr Ann human Perfornnancc, 33
360-396.

DeNisi, A.S., Cafferty, T.P., Williams, K.J., Blencoe, A.G., 6: Meglino,
B.M. (1983). Rater information acquisition strategies: Two

preliminary experiments. WW.
169-172 .

DeNisi, A.S., Robbins, T., 6: Cafferty, T.P. (1989). Organization of
information used for performance appraisals: Role of diary-keeping.

Journal of Applied Psychclcgy, 14, 124-129.

DeNisi, A.S., 6: Stevens, G.E. (1981). Profiles of performance,
performance evaluations, and personnel decisions. Acgdcny—gf

Mnncgenenr Journal, PA, 592-602.

DeNisi, A.S. , 6: Summers, T.P. (1986). Rating forms and the organization
of information: A cognitive role for appraisal instruments. Paper
presented at the National Academy of Management Meetings, Chicago.

DeNisi, A. S. 6: Williams, K. J. (1988). Cognitive approaches to
" performance appraisal. In K. M. Rowland 6: G. R. Ferris (Eds. ),
. - - :e-u-at (Vol. 6, pp.

 

109-155). I Greenwich, .CT: JAI Press.

Dickinson, T.L. (1987) . Designs for evaluating the validity and accuracy

of performance ratings. Organizntional Behavior ang Hnnan Decisicn
Prcccsces, 42, 1-21.

Dickinson, T.L., Hedge, J.W., Johnson, R.L., 6: Silverhart, T.A. (1990).
Work performance ratings: Cognitive modeling and feedback
principles in rater accuracy training. Technical Report, AFHRL-TP-
89-61, Air Force Human Resources Laboratory.

Feild, H.S., 6: Holley, W.H. (1982). The relationship of performance
appraisal system characteristics to verdicts in selected employment

discrimination cases. Acncemy cf Management; Jcnmnl, 22, 392-406.
Feldman, J.M. (1981). Beyond attribution theory: Cognitive processes in
performance appraisal. Journal of Applied Pcychclogy, 66, 127-148.
Feldman, J .M. (1986). Instrumentation and training for performance
appraisal: A perceptual-cognitive viewpoint. In K.M. Rowland 6:
G.R. Ferris (Eds.), esea c n ersonnel and an rc

Management (Vol. 4, pp. 45-99). Greenwich, CT: JAI Press.

Fisher, C.D., 6: Locke, E.A. (1990). Bad citizenship behaviors: Giving
what you get. Paper presented at the National Academy of Management
Meetings, San Francisco.

175

Fisicaro, S.A. (1988). A reexamination of the relation between halo

error and accuracy. Jcnrnal ci Applied Psycholcgy, 11, 239-244.
Fiske, S.T. (1981). Social cognition and affect. In J. Harvey (Ed.),
Co nit 0 oc a1 behav r e viro e t. Reading, MA:

Addison-Wesley.

Ford, J.K., Schmitt, N., Schechtman, S.L., Hults, B.M., & Doherty, M.L.
(1989) . Process tracing methods : Contributions , problems and
neglected research questions . a at n vi an

Decisicn Processcs, Al, 75-117.

Funder, D.C. (1987). Errors and mistakes: Evaluating the accuracy of

social Judgment. W. 191.. 75-90.

Glass, G.V., & Hopkins, K.D. (1984). Stati tica m t ds uc i
and psychology, Second edition. Englewood Cliffs, N.J.: Prentice-
Hall.

Graham, J.W. (1986). Organizational citizenShip behavior informed by
political theory. Paper presented at the National Academy of
Management Meeting, Chicago, August, 1986.

Graham, J.W. (1989). Organizational citizenship behavior: Construct
redefinition, operationalization, and 'validation. ‘Unpublished
manuscript, Department of Management, Loyola University of Chicago.

Hastie, R., & Park, B. (1986). The relationShip between memory and
judgment depends on whether the judgment task is memory-based or on-

line. Psychological Rcview, 21, 258-268.

Heneman, R.L. (1986). The relationship between supervisory ratings and
results-oriented measures of performance: A meta-analysis.

Personnel Psychology, 32, 811-826.

Hogan, R. (1982). A socioanalytic theory of personality. In M.M. Page
(Ed.), 1982 Nebraska Synposium of'Mcciysricn (pp. 55-89). Lincoln:

University of Nebraska Press.

Hollenbeck, J.R., Brief, A.P., Whitener, E.M., & Pauli, K.E. (1988). An
empirical note on the interaction of personality and aptitude in

personnel selection. Journal oi Managemenc, l3, 441-451.

Hough, L.M., Eaton, N.K., Dunnette, M.D., Kamp, J.D., 6: McCloy, R.A.
(1990). Criterion-related validities of personality constructs and
the effect of response distortion on those validities. Jcnrnsl_cf

Applied Psychology (Monograph), 2;, 581-595.

Ilgen, D.R., Barnes-Farrell, J.L., & McKellen, D.B. (in press).
Performance appraisal process research in the 19803: What has it
contributed to appraisals in use? r a a ion Behavio

c ro

176
Ilgen, D.R., & Feldman, J.M. (1983). Performance appraisal: A process
focus. In B. Staw & L. Cummings (Eds.), Ecscsrch_in_Qrgnnizsricnsl
ﬁehavior (Vol. 5, pp. 141-197). Greenwich, CT: JAI Press.

Ilgen, D.R., & Hollenbeck, J.R. (1991). The structure of work: Job

design and roles. In M.D. Dunnette (Ed.), Handbook ci lndustrisl
d 0 an iona PS 0 , Second Edition. Chicago: Rand
McNally.

Jacobs, R., & Kozlowski, S.W.J. (1985). A closer look at halo error in
performance ratings. Acsdemy cf ﬂansgenenr Journal, 22, 210-212.

Johnson, D.M. (1963). Reanalysis of experimental halo effects. Jcnrnsl
cf Applied Psychology, 51, 46-47.

Karambayya, R. (1990). Contextual predictors of organizational
citizenship behavior. oceed f he adem ,
221-225.

Karl, K.A., & Wexley, K.N. (1989). Patterns of performance and rating
frequency: Influences on the assessment of performance. Journal cf

Managemenr, l2, 5-20.

Katz, D. (1964). The motivational basis of organizational behavior.

Behaviorsl Science, 2, 131-133.

Katz, D., 6 Kahn, R.L. (1966). The 0 a s ch 0 an s.
New York: Wiley.

Kavanagh, M.J. (1971). The content issue in performance appraisal: A

review. Personnel Psychology, 23, 653-668.

Kenny, D.A., & .Albright, L. (1987). Accuracy in interpersonal
perception: A social relations analysis. Psychclcgicnl Pnllcrin,
icg, 390-402.

Keppel, G. (1982). Des a i ' searc ' nd

Englewood Cliffs, N.J.: Prentice-Hall.

Kirk, R.E. (1982). x e e a es ° edu es or e behav a1
sciences, Second edition. Belmont, California: Brooks/Cole.

Kozlowski, S.W.J., & Ford, J.K. (1991). Rater information acquisition
processes: Tracing the effects of prior knowledge, performance
level, search constraint, and. memory demand. Qrgsnicsricnsl
Behavior and Human Decision Processes, 52, 282-301.

Landy, F.J., & Farr, J.L. (1980). Performance rating. Psychclcgicnl
Pulletin, 21, 72-107.

Landy, F.J., Zedeck, S., & Cleveland, J. (Eds.) (1983). Pcrfcrmsncc
ncasurement anc theory. Hillsdale, NJ: Erlbaum Associates.

177

Latham, G.P., 6 Wexley, K.N. (1981). a n oduct v u
pcrformance appraissl. Reading, MA: Addison-Wesley.

Latham, G.P., Wexley, K.N., 6 Pursell, E.D. (1975). Training,managers to
minimize rating errors in the observation of behavior. Jcnrnsl_c£

ApplieLBsxebalsgx. .632. 550-555.

Levy, M. (1989). Almost perfect performance appraisals. Pcrscnncl
Jcnrnnl, April, 76-83.

Locher, A.H., 6‘Tee1, K.S. (1988). Appraisal trends. e s u ,
£1: 9, 139-145.

Longenecker, C.O. (1989). Truth or consequences: Politics and
performance appraisals. anincss_ﬂcriccns, November-December, 76-
82.

Longenecker, C.O., Sims, H.P., Jr., 6 Gioia, D.A. (1987). Behind the

mask: The politics of employee appraisal. A2§Q£EI.Q£.M§D§8§E§D£
Execucive, 1, 183-193.

Lord, R.G. (1985). Accuracy in behavioral measurement: An alternative
definition based on raters' cognitive schema and signal detection

theory. Journal cf Applied Psycholcgy, 12, 66-71.

MacKenzie, S.B., Podsakoff, P.M., 6 Fetter, R. (1991). Organizational
citizenship behavior and objective productivity as determinants of

managerial evaluations of salespersons' performance. an a
WWW. 1Q. 123-150.

McDonald, T. (in press). The effect of dimension content on observation
and ratings of job performance. a a e v

Decision Processes.

MehrenS. W.A.. & Lehmann. LJ. (1973). WW3
cducation and psychology. New York: Holt, Rinehart, 6 Winston.

Mohrman, A.M., 6 Lawler, E.E. (1983). Motivation and performance

appraisal behavior. In F. Landy, S. Zedeck, 6.J. Cleveland (Eds.),

er orman measu eme t h . Hillsdale, NJ: Erlbaum
Associates.

Murphy, K.R., 6 Balzer, W.K. (1986). Systematic distortions in memory-
based behavior ratings and performance evaluations: Consequences

for rating accuracy. Jcnrnal ci Applies Psycholcgy, ll, 39-44.

Murphy, K.R., 6 Balzer, W.K. (1989). Rater errors and rating accuracy.
Jcnrnal of Applied Psycholcgy, 13, 619-624.

Murphy, K.R., Balzer, W.K., Lockhart, M.C., 6 Eisenmann, E.J. (1985).
Effects of previous performance on evaluations of present

performance. mm. 2.0. 72-84.

178

Murphy, K.R., Garcia, M., Kerkar, S., Martin, C., 6 Balzer, W.K. (1982).
Relationship between observational accuracy and accuracy in

evaluating performance. Joumsl cf Appliec Psychology, 22, 320-325.

Murphy, K.R., 6 Jako, R. (1989). Under what conditions are observed
intercorrelations greater or smaller than true intercorrelations?

Walled—25191121283. 1‘1. 827-830.

Murphy, K.R., Philbin, T.A., 6 Adams, S.R. (1989). Effect of purpose of
observation on accuracy of immediate and delayed performance
ratings. r anizati na ehav a ec io ces e , 32

336-354.

Murphy, K.R., 6 Reynolds, D.H. (1988). Does true halo affect observed
halo? Journal of Applied Psychology, 12, 235-238.

Nathan, B.R., 6 Tippins, N. (1990). The consequences of halo "error” in
performance ratings: .A field study of the moderating effect of halo

on test validation results. WW, 1:, 290-

296.
Odiorne, G. (1965). Msnagenent hy chjccriycs. New York: Pitman
Publishing.

O'Reilly, C., III, 6 Chatman, J. (1986). Organizational commitment and
psychological attachment: The effects of compliance,
identification, and internalization.on prosocial behaviorz Jcnrncl

cf Applied Psychology, 11, 492-499.

Organ, D.W. (1977). A reappraisal and reinterpretation of the
satisfaction-causes-performance hypothesis. Acsdemy of Manngemenc
Review, 2, 46-53.

Organ, D.W. (1988a). A restatement of the satisfaction-performance
hypothesis. Journal of Msnagemenr, lg, 547-557.

Organ, D.W. (1988b). Or a izational ti e s i behav ° ood
soldier syndrome. Lexington, MA: Lexington Books.

Organ, D.W. (1990a). The motivational basis of organizational
citizenship behavior. In B. Stew 6 L. Cummings (Eds.), Rescsrch in

Qrgsnirarional Dchavicr (Vol. 12, pp. 43-72). Greenwich, CT: JAI

Press.
Organ, D.W. (1990b). Fairness, productivity, and organizational
citizenship ‘behaviort Trade-offs in student and. manager pay

decisions. Paper presented at the National Academy of Management
Meeting, San Francisco, August, 1990.

Organ, D.W}, 6 Konovsky, M. (1989). Cognitive 'versus affective
determinants of organizational citizenship behavior. Jcnrnnl_cf

Applied Psychology, 1A, 157-164.

179

Orr, J.M., Sackett, P.R., 6 Mercer, M. (1989). The role of prescribed
and nonprescribed behaviors in estimating the dollar value of

performance. MAW. 2.4. 34-40.

Padgett, M.Y. (1988). Performance appraisal in context: Motivational
influences on performance ratings. Unpublished Ph.D. dissertation,
Department of Management, Michigan State University.

Padgett, M.Y., 6 Ilgen, D.R. (1989). The impact of ratee performance
characteristics on rater cognitive processes and alternative

measures of rater accuracy. W
W. as. 232-260.

Payne, J. W. (1976). Task complexity and contingent processing in
decision making: An information search and protocol analysis.

Drganizational Behsvior and Hnnnn Periornance, lg, 366-387.

Pitre, E., 6 Sims, H.P., Jr. (1987). The thinking organization: How
patterns of thought determine organizational culture. Nacionnl

Prcduccivity Review, Autumn, 340-347.

Puffer, S.M. (1987). Prosocial behavior, noncompliant behavior, and work

performance among commission salespeople. W
Psychology, 12, 615-621.

Rice, B. (1985). Performance review: The job nobody likes. Psychclcgy
Tcday. September, 30-36.

Saal, F.E., Downey, R.G., 6 Lahey, M.A. (1980). Rating the ratings:
Assessing the psychometric quality of rating data. hychclcgicnl

Dnlletin, ﬂ, 413-428 .

Smith, C.A., Organ, D.W., 6 Near, J.P. (1983). Organizational
citizenship behavior: Its nature and antecedents. W

Applied Psychology, 62, 653-663.

Smith, P., 6 Kendall, L.M. (1963). Retranslation of expectations: An
approach to the construction of unambiguous anchors for rating

scales. Journal of Applied Psychology, ﬂ, 149-155.

Smither, J.W., 6 Reilly, R.R. (1987). True intercorrelation among job
components, time delay in rating, and rater intelligence as
determinants of accuracy in performance ratings. W31

Pehavior and Human Decision Processes, 32, 369-391.

Srull, T.K., 6 Wyer, R.S., Jr. (1989). Person memory and judgment.
Psychological Review, _9_6, 58-83.

Stevens, S.N., 6 Wonderlic, E.F. (1934). An effective revision of the
rating technique. Personnel qurnsl, l2, 125-134.

180

Sulsky, L.M., 6 Balzer, W.K. (1988). Meaning and measurement of
performance rating accuracy: Some methodological and theoretical

concerns. Journal of Applied Psychology, 12, 497-506.

Symonds, P.M. (1925). Notes on rating. Jcnrnnl cf Applied Psychology,
2, 188-195.

Taylor, E.K., 6 Hastman, R. (1956). Relation of format and
administration to the characteristics of graphic rating scales.

Pcrsonnel Psychclogy, 2, 181-206.

Tett, R.E., Jackson, D.N., 6 Rothstein, M. (1991). Personality measures
as predictors of job performance: A meta-analytic review.

Pcrsonnel Psychology, 33, 703-742.

Thorndike, E.L. (1920). A constant error in psychological ratings.
Journal of Applied Psychology, A, 25-29.

Thornton. 6.0.. III.. & Byam. W.C. (1982). W
managerial performance. New Yerk: Academic Press.

VanDyne, L., 6 Cummings, L.L. (1990). Extra-role behaviors: The need
for construct and definitional clarity. Paper presented at the
National Academy of Management Meeting, San Francisco, August, 1990.

Werner, J.M. (1992). Predicting U.S. Courts of Appeals decisions
involving performance appraisal: Updating Feild 6 Holley for the
19803. Manuscript under revision.

Wexley, K.N., 6 Klimoski, R. (1984). Performance appraisal: An update.
In KeM. Rowland 6 G.R. Ferris (Eds.), a h e n

human resource managemenc (Vol. 2, pp. 35-79). Greenwich, CT: JAI
Press, Inc.

Wexley, K.N., 6 Yukl, G.A. (1984). a zat n behavi a d so
psychology. Homewood, 111.: Irwin.

Wexley, K.N., Yukl, C.A., Kovacs, 8.2., 6 Sanders, R.E. (1972).
Importance of contrast effects in employment interviews. MM

Applied Psychology, 22, 45-48.

Wherry. RJ. (1952). WWW—mm
rating. Columbus: The Ohio State Research Foundation.

Wherry, R.J., 6 Bartlett, C.J. (1982). The control of bias in ratings:
A theory of rating. Personnel Psychology, 22, 521-551.

Williams, K.J., Cafferty, T.P., 6 DeNisi, A.S. (1990). The effect of
performance appraisal salience on recall and ratings.

Drganizational Behavicr ang human Decision Prcccsscs, A6, 217-239.

181

Williams, K.J., DeNisi, A.S., Blencoe, A.G., 6 Cafferty, T.P. (1985).

V The role of appraisal purpose: Effects of purpose on
information acquisition and utilization. Qrgnniznricnnl

Dehavior anc Human Decision Processes, 36, 314-339.

Williams, K.J., DeNisi, A.S., Meglino, B.M., 6 Cafferty, T.P. (1986).
, Initial decisions and subsequent performance ratings. Journal of

Wiser. 11. 189-195.

Williams, L.J. (1988). Affective and nonaffective components of job
satisfaction and organizational commitment as determinants of
organizational citizenship behaviors . Unpublished Ph . D
dissertation, Department of Management, Indiana University.

Williams, L.J., Podsakoff, P.M., 6 Huber, V. (1986). Determinants of
organizational citizenship behaviors: A structural equation
analysis with cross-validation. Paper presented at the National
Academy of Management Meetings, Chicago.

Williams, S.L. , 6 Hummert, M.L. (1990). Evaluating performance appraisal
/ instrument dimensions using construct analysis. 0 a usines

Qommunicntion, 21, 117- 135 .

Zedeck, S. , 6 Kafry, D. (1977). Capturing rater policies for processing

/‘ evaluation data. Or anizatio al Be vio n Human Perfo a ce, 12,
269-294.

nICHIan smTE UNIV. LI
mllWllWWUNIWI‘IWWW
312930089631