~-—...-.-~. .x' ."- ...~ 3:»; T . A'— ‘V .J'-‘ ‘r v. ‘.. H J .r..‘ av .c . 1.; a.“ V A .u x. r. c .. ‘ .. . 7‘! ; .. I‘J-ru .... 1:42-93»; v- _ ‘ ,. .506 r. sz-> “v A 3.5 ‘1‘ .~fl5, ‘ .;..f; -a v' - . “ ‘ < "uui'r-NE. " 'v'd' 3.7- ' ' "l ’7'! , . ... . .‘s‘ ‘ v 3811"; l... ”a”: “2:,“ , - ..., ”Emmy.“ 1 . J up. ,y li~h~ - -. p, ' - ., .p t. u v J . . 1-? .~ 9 q“ . 5....” v; "a: .1 n. A ‘ '5". t 1"»? , ‘33: a” .‘T :' l“; ' . ¢ 1’; :4” V "pufi’K-‘nuw Jam . m . - - . . I ' \‘ n, in _‘- .-“1.: !.T. ~ “9’1 u' r y . ; ’5')“ o.— ”' r i ' I ’ V ‘ ’ (772*; ~ ," - ’ '- ’ , ~..,. ... "ll" .1" . w .. ' > u" ' _ ‘n , r1,“ 3., ~ .. ... . ~1-" .- .. ,. - n}... .- r-v'--v ,. v 'Iv‘vp-ueh A V"'vu'hfl .. 1‘r‘vn4Q-[chq y‘b“ : .. v!- .1 ru ‘ ‘rh “gr ~xn;h" A, .. ,.. M... ‘P u . .-,.».> _ o ,-....‘.|,r, ‘ lrlbtleII; . . v 4~u:--..‘..',.,-,, -*v- ,ru......,~, . r- ,.,,,. .,., . l!“ 811? llllllllllllllllllllllllllllllllllllllllllillllllllll (/ 31293 00896 3120 This is to certify that the dissertation entitled Influences of information acquisition and method of rating, and in-role versus extra-role behaviors on rater accuracy, halo, type and amount of search presented by Jon Michael Werner has been accepted towards fulfillment of the requirements for Ph.D. degree in organizational Behavior //a/Z7W%—~ Major protessor Date (/36/92. MSU is an Affirmative Action ’Equal Opportunity Institution 0 12771 g" LIBRARY Mlchlgen State Unlverslty PLACE IN RETURN BOX to rem ove this checkout from your record. TO AVOID FINES return on or before date due. Fe DATE DUE DATE DUE DATE DUE ’ JUN 2 2 1m" h‘ l “J g WWW MSU Is An Affirmative Action/Equal Opportunity Institution cMuna-pd ‘— INFLUENCES OF INFORMATION ACQUISITION AND METHOD OF RATING, AND IN-ROLE VERSUS EXTRA-ROLE BEHAVIORS ON RATER ACCURACY, HALO, TYPE AND AMOUNT OF SEARCH By Jon Michael Werner A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Management 1992 J0“/’ ABSTRACT INFLUENCES OF INFORMATION ACQUISITION AND METHOD OF RATING, AND IN-ROLE VERSUS EXTRA-ROLE BEHAVIORS ON RATER ACCURACY, HALO, TYPE AND AMOUNT OF SEARCH BY Jon Michael Werner This study extends cognitively-oriented performance appraisal research by addressing two primary questions: 1) does the manner in which rating scales are organized affect the type of ratings made? and 2) what behaviors do raters consider relevant when rating job performance? To study the first question, a 2 by 2 factorial design was used, where method of rating (by person, versus by dimension) and prior knowledge of format were manipulated as between-subjects variables. The second question was addressed by creating performance dimensions which captured either in-role or extra-role (i.e. , organizational citizenship) behaviors. The performance levels of six ratees were experimentally manipulated as within-subj ects variables, using three levels of in-role and two levels of extra-role performance. Subjects were 116 supervisors from a large university, randomly assigned to between—subjects condition. A computer simulation was devised, which asked raters to search for performance information, and then make ratings for each ratee. Hypotheses were tested using overall performance ratings, halo, accuracy, type of search, and amount of search as dependent measures. Method of rating had little impact on overall ratings but had a large impact on two measures of halo. Halo was lowest when rating by dimension, and when subjects had no prior knowledge of format type. Results were mixed for the effects of these variables on rater accuracy. In general, accuracy was better when.ratingflby dimension, but effect sizes were small and inconsistent. No effects were found for these manipulations on type or amount of search. Level of in—role performance, level of extra-role performance, and their interaction eadh explained statistically significant amounts of rating variance. Halo was increased for ratees exhibiting high levels of extra-role performance, but contradictory results were obtained for two accuracy measures (stereotype and differential accuracy). Finally, level of inerole performance had some impact on amount of search by ratee. Overall, results from this study provided partial support for the first research question, and strong support for the letter. A proposed interaction between method of rating and level of extra—role behavior was not supported. Implications are discussed. TO DAD You were gone too soon, there was so much more that I longed for... Yet, your accomplishments have been my unspoken model and inspiration. Though I'll not hear your “well done!" on this earth, I know this moment pleases you; to you I dedicate this labor of mine. ii ACKNOWLEDGEMENTS Faculty, family, friends — so many have guided and walked with me to this point. Words of gratitude are definitely in order. To my committee: John Hollenbeck, Dan Ilgen, and Ken Wexley. You're helpful input and direction were invaluable. Ken, you've taught me so much. Without your early urging and encouragement, I wouldn't have returned to the doctoral program. Thanks for helping me discover a life I really do love! Dan, your ability to balance career accomplishments with faith and family spell ”Career Success/Personal Success”. You've shown me that Korman's warning can be heeded. Though at times I would have enjoyed less “critique” in your feedback, I see now that your input led me to produce a better product. Thanks for helping me scale back and focus on a more manageable project! John, your ability to impact our field so quickly never ceases to amaze me. I've gained so much from being around you the past five years. Your onegoing feedback and guidance have helped me get through this process. Thanks! To my family: Mom, you've sacrificed so much for me, and for our life together. Words cannot express my gratitude for all you've done. Thanks, too, for instilling in me a love of learning, and a love of the written word. Barbara, in you I truly found the best mate (for me). Your constant support and willingness to forego for the sake of my education made this long process so much more enjoyable. I couldn't have done this without you” Hans and Noelle, the hugs and kisses and love and laughter iii more than make up for disrupted nights, crying, and occasional bouts of the grumpies. Thanks for showing me that there is life beyond my word processor. To my friends: Anne, Blair, Dass, Ellen, Kathy, Pat, and Peggy - I wish I had more fully appreciated.what a wonderful cohort you were! To Tim and Cheryl, Larry and Karen, Lowell and Arva —— you showed me true fellowship, and that it was worth the risk to take an ”inside look". And to Audrey, Bill, and all my new colleagues at USC: you.weren't supposed to make it onto the acknowledgements page of my dissertation, but that's life, and I'm sure glad to be a part of such a great group of people. Finally, special thanks are in order to Stephen Gilliland, who adapted the computer program to run my simulation, and who made it easy for me to transfer the data I needed into SPSS-X files. Your computer expertise saved me hours and hours of grunt work. Also, Kay Butcher and Mike Rice provided invaluable assistance to me in lining up the supervisors for both phases of my project. THANKS TO YOU ALL!! iv TABLE OF CONTENTS LIST OF TABLES ..................................................... viii LIST OF FIGURES ...................................................... x CHAPTER 1: INTRODUCTION ............................................. 1 Overview ........................................................... 1 Search and Rating Scale Format ..................................... 5 In—Role versus Extra-Role Behaviors ................................ 8 Contributions of this Research ..................................... 12 Search and Method of Rating ...................................... 12 In-Role vs. Extra-Role Behaviors ................................. 13 CHAPTER 2: LITERATURE REVIEW AND HYPOTHESES ......................... 17 Past Research on Error and Accuracy ................................ 17 Error ............................................................ 18 Accuracy ......................................................... l9 Cognitively—Based Performance Appraisal Research ................... 21 The Barman/Murphy Stream ......................................... 22 University of South Carolina Stream .............................. 26 Persona vs. TaskeBlocking in Information Acquisition ........... 27 Rating Format vs. Advanced Knowledge of that Format ............ 30 Process Tracing Research ......................................... 36 Hypotheses Concerning Information Acquisition/Method of Rating...38 Overall Ratings ................................................ 39 Error .......................................................... 39 V Accuracy ....................................................... 41 Mediating (Process) Variables .................................. 43 InrRole Versus Extra-Role Performance .............................. 46 Measuring Performance ............................................ 46 Organizational Citizenship Behavior ............................ 48 OCB and Performance Appraisal .................................. 53 Hypotheses Concerning In—Role and Extra-Role Behaviors ........... 60 Level of InrRole Behaviors and Accuracy ........................ 60 Level of In—Role Behaviors and Amount of Search ................ 61 Extra-Role Behaviors, Error, and Accuracy ...................... 61 Interactions Between the InrRole and Extra-Role Manipulations..64 Hypotheses Concerning Method of Rating, DOB, and Accuracy ........ 65 Summary ............................................................ 66 CHAPTER 3: METHOD ................................................... 69 Overview of Methodology ............................................ 69 Participants ....................................................... 69 Power Analysis ..................................................... 69 Procedure .......................................................... 70 Deriving the Content of the Study ................................ 70 The Primary Study ................................................ 74 Constraints on Information Search .............................. 76 Variables .......................................................... 77 Overall Performance Ratings ...................................... 77 Error ............................................................ 77 Accuracy ......................................................... 78 vi Type of Search ................................................... 79 Amount of Search ................................................. 79 Data Analysis ...................................................... 80 CHAPTER 4: RESULTS .................................................. 87 Content Derivation for the Study ................................... 87 Results from the Primary Study ..................................... 94 Overall Ratings .................................................. 94 Halo Effect ...................................................... 97 Accuracy ........................................................ 100 Correlational Accuracy ........................................ 102 Distance Accuracy ............................................. 103 Dickinson's MANOVA Approach ................................... 109 Type of Search .................................................. 113 Amount of Search ................................................ 115 Dickinson's Approach to the Withianubject Manipulations ........ 116 IneRole Performance ............................................. 121 Extra-Role Performance .......................................... 122 Interaction of In—Role and Extra-Role Performance ............... 124 Method of Rating, OCBI, and Accuracy ............................ 125 CHAPTER 5: DISCUSSION .............................................. 129 Hypotheses Concerning Type of Format .............................. 130 Overall Ratings ................................................. 130 Halo ............................................................ 131 Accuracy ........................................................ 133 Correlational Accuracy ........................................ 133 Distance Accuracy ............................................. 134 vii Process Variables ............................................... 139 Type of Search ................................................ 139 Amount of Search .............................................. 140 Hypotheses Concerning Type of Performance ......................... 141 In-Role Performance ............................................. 150 Differential Accuracy ......................................... 150 Amount of Search .............................................. 151 Extra-Role Performance .......................................... 152 Halo .......................................................... 152 Accuracy ...................................................... 153 Interaction of In—Role and Extra-Role Performance ............... 153 Hypotheses Concerning Method of Rating, OCBI, and Accuracy ........ 155 Summary and Directions for Future Research ........................ 155 Strengths and Limitations of the Current Study .................. 156 Strengths ..................................................... 156 Limitations ................................................... 157 General Conclusions from this Study ............................. 161 Type of Format ................................................ 162 Type of Performance Dimension ................................. 164 Directions for Future Research .................................. 166 Type of Format ................................................ 166 Type of Performance Dimension ................................. 166 Process Issues ................................................ 167 Setting/Contextual Issues ..................................... 167 APPENDIX A .......................................................... 169 LIST OF REFERENCES .................................................. 172 Table Table Table Table Table Table Table Table Table Table Table Table viii LIST OF TABLES 1: Mean Importance Ratings Given to the Six Performance Dimensions ..................................................... 88 2: Background Information for Subject Matter Experts and Primary Sample ................................................. 90 3: True Scores from the 15 Subject Matter Experts ............. 93 4: Overall Ratings by Ratee: True Scores, Whole Sample, and by Condition ................................................... 95 5: Median and Mean Intercorrelations Among Dimensions, Across Ratees .................................................. 98 6: Mean Ratings by Ratee and Dimension: Whole Sample, and by Condition .................................................. 101 7: Analysis of Variance for Prior Knowledge and Format Type on Accuracy.................. ................................. 111 8: Means, Standard Deviations, and Breakdowns Concerning Type of Search ................................................ 114 9: Analysis of Variance for In-Role and Extra—Role Performance ................................................... 118 10: Means and Standard Deviations for Orthonormal Contrasts by Ratee ...................................................... 121 11: Means, Rank Order, and Standard Deviations for Amount of Search by Ratee ............................................... 122 12: Median and Mean Intercorrelations by OCBI ................ 123 Table Table Table Table Table ix 13: Means and Standard Deviations for SA and DA by Level of OCBI .......................................................... 123 14: Means and S.D.'s for SA and DA by Method of Rating and Level of OCBI ................................................. 126 15: MANOVA for Method of Rating, OCBI, and Stereotype Accuracy ...................................................... 127 16: MANOVA for Method of Rating, OCBI, and Differential Accuracy ...................................................... 127 17: Amount and Order of Search, by Ratee and by Dimension....l43 LIST OF FIGURES FIGURE 1: Overview of the Research Variables .......................... 16 FIGURE 2: A Model of the Influences of Person- versus Dimension- Blocking on Performance Appraisal Ratings ........................ 31 FIGURE 3: The Between-Subject Manipulations of Format Type and Prior Knowledge of Format ........................................ 34 FIGURE 4: Predictions Concerning Type of Format, Advanced Knowledge of Format, Halo, and Accuracy .................................... 40 FIGURE 5: Predictions Concerning Type of Search and Amount of Search. .44 FIGURE 6: Sample Performance Dimensions to be Rated ................... 57 FIGURE 7: Target Profiles for Hypothetical Ratees ..................... 63 FIGURE 8: Predicted Relationships between Method of Rating, Favorability of OCB information, and Accuracy .................... 67 FIGURE 9: Contrast-Coded Variables for this Research .................. 81 FIGURE 10: Summary of the Variables from Hypotheses 1-8 Predicted to be Statistically Significant .................................. 86 FIGURE 11: Median Intercorrelations Among Dimensions (Halo) ........... 99 FIGURE 12: Results for Stereotype Accuracy (as measured by SACORR) . . .104 FIGURE 13: Results for Distance Score Accuracy Components ............ 106 FIGURE 14: Results for Overall Cronbach Accuracy ..................... 108 FIGURE 15: Interaction of In-Role and Extra-Role Performance ......... 119 FIGURE 16: Interaction of In-Role and Extra-Role Performance, by Type of Format .................................................. 120 CHAPTER 1: INTRODUCTION 9mm Few areas in personnel psychology and human resource management have been as heavily researched as performance appraisal (Nathan 6: Tippins, 1990) . A good portion of this research has focused on halo effects (e.g. , Cooper, 1981; Jacobs 6: Kozlowski, 1985; Murphy & Reynolds, 1988), rating accuracy (e.g. , Lord, 1985; Murphy 6: Balzer, 1986; Padgett 6: Ilgen, 1989), or the relationship between halo and accuracy (e.g. , Becker 6: Cardy, 1986; Smither 6: Reilly, 1987; Fisicaro, 1988; Murphy 6: Balzer, 1989). While a considerable amount has been learned from such research, a criticism of much psychometric research is that not enough has been learned concerning the m by which raters form their appraisal judgments (Landy & Farr, 1980). Since 1980, extensive effort has been expended studying cognitive influences on performance appraisal (Ilgen 6: Feldman, 1983; DeNisi, Cafferty, 6: Meglino, 1984; Ilgen, Barnes-Farrell, 6: McKellen, in press). For example, Ilgen et al. (in press) summarized much of the cognitively—oriented performance appraisal research conducted in the 19808. They discussed research under three broad phases of information processing, i.e. , a) search or acquisition of information, b) categorization, organization, and storage of information in memory, and c) retrieval and integration of information, followed by judgments formed on the basis of this information. Within each phase, Ilgen et al. (in press) 2 summarized the effects found from four sources: raters, ratees, rating scales, and the setting or context in which appraisal took place. The current research follows considerable previous research (cf., Murphy & Balzer, 1986; Dickinson, 1987) on the 55223391 of performance appraisal ratings, since accuracy is a primary goal of any appraisal system (Feldman, 1986). However, this study goes beyond past research, and is expected to shed increased light on the pgggegg by which raters form their appraisal judgments. An important issue addressed by this research is the domain of individual ratee ‘behaviors (and. other‘ characteristics) which. raters consider relevant when rating job performance. Most current approaches to performance appraisal recommend that raters (and rating scales) focus on measurable job behaviors and/or tangible results (Latham & Wexley, 1981; Odiorne, 1965). Organ (1977), on the other hand, argued that practicing managers view "performance” as more than simply fulfilling one's job duties. The latter has been labelled.1n:;gle_behggig;§ (Williams, 1988). Such inrrole behaviors are obviously important to managerial ratings of performance. However, according to Organ, acts that are spontaneous, and generally beyond one's regular role requirements, are also important to managers when they evaluate ratee performance. Such extra-role behaviors have been labelled “organizational citizenship behaviors" , or OCB (Bateman & Organ, 1983). While much of the subsequent research on such extra-role or citizenship behaviors has net focused on performance appraisal per se, recent research by Orr, Sackett, and Mercer (1989), and MacKenzie, Podsakoff, and Fetter (1991) strongly suggests that managers can and do 3 evaluate both in-role and extra-role (OCB) behaviors when making appraisal judgments. Because of these findings, both in-role and extra-role performance are of interest in the present study. This chapter begins by describing research propositions relating to search and rating scale format. Such research focuses more on aspects of in—role performance. It is based on the premise that the manner in which rating scales are organized affects the type of ratings made. Since Symonds (1925), various researchers have suggested that ratings made one dimension at a time would exhibit less halo effect than those made in the traditional, one-person-at-a-time manner. While research on this notion has been disappointing in its demonstrated effect on halo (Cooper, 1981), recent cognitively-oriented research suggests that scale format organizes or "frames" the rating problem, and that such framing effects are likely to have a greater impact on the manner in which raters 19219.11 for performance-relevant information (DeNisi 6: Williams, 1988). That is to say, such framing is expected to affect the rating process, such that raters using a person-blocked scale are expected to search for information in a person-blocked manner, and raters using a dimension-blocked scale will search for information in a dimension-blocked fashion. These different search strategies are then expected to influence measures of rater accuracy. Such questions are important to address, but they do not get at the issue just mentioned concerning whether recent performance appraisal efforts have downplayed or ignored performance dimensions which managers consider important when making their appraisal ratings. For example, Williams and Hummert (1990) studied the constructs used by supervisors and 4 clerical employees to define productive and unproductive work behavior. Both groups described behaviors corresponding to 9 of the 12 dimensions used.on that organization's rating scale. However, a number of dimensions and behaviors were mentioned which were get a part of the rating scale, e.g., helping others when one's own work is complete, going the "extra mile” to complete a task, and speaking well of the organization when off the job. As will be demonstrated below, such behaviors are very similar to the extra-role behaviors discussed by Organ (1977; 1988b), Orr et a1. (1989), and MacKenzie et a1. (1991). This supports the argument that it is important to study the effects of 'both inrrole and extra-role performance when researching the performance appraisal process. In the present context, this means studying the effects of inrrole and extra-role performance on measures of rater search and accuracy. In effect, this aspect of the research asks whether mgr; dimengigng (in-role and extra-role) should be measured in.performance appraisal than is presently recommended (e.g., Cascio, 1989). It is expected that search and accuracy measures will be influenced by the experimental manipulations of ‘both in—role and extra-role performance. These issues will be addressed in this chapter following the discussion of rater search and rating scale format manipulations. As indicated, this research goes beyond measuring a static dependent variable such as rater accuracy, and will also study the cognitive processes used ‘by raters as they search for information and. make subsequent appraisal ratings (Landy & Farr, 1980). A relatively recent methodology known as prgge§§,;1§gigg (Ford et a1., 1989) will be used to examine the processes used by raters when they search for performance 5 information under different framing conditions (i.e., person— vs. dimension—oriented scale formats). This methodology, as well as the computer simulation which will be used, are discussed in greater detail below. Briefly, variables such as the amount and type of search are expected to be influenced by the manipulations of person— versus dimension-oriented rating scale formats. Further, it is expected that search will be influenced by manipulations of in-role and extra-role performance, and that interactions will occur between the manipulations of search and extra-role performance. In the remainder of this chapter, the above ideas will be sketched out in greater detail. Following this, the contributions which this research can make to the performance appraisal literature are discussed. Search and Rating Scale Format The first major issue in this research is the most psychometric in nature, having to do with whether the WW influences rater accuracy and the amount of halo observed. In 1925, Symonds suggested that halo errors would decrease if raters made ratings of all individuals on one trait or dimension before moving on to the next dimension. Some support for this was found by Stevens and Wonderlic (1934). Cooper (1981), in his influential work on halo, cited four studies which tested for differences between raters who rated ”by person" (i.e. , in the typical manner) versus "by category” (Taylor & Hastman, 1956; Johnson, 1963; Blumberg, DeSoto, & Kuethe, 1966; Brown, 1968). None of these studies found significant differences between the methods in the amount of halo observed. However, recent cognitively-oriented research 6 (discussed more fully in the next chapter) indicates that method of information acquisition does influence various measures of rating accuracy. In this case, whether information is gathered "by person" or "by dimension" can be viewed as an acquisition strategy. However, even though Cooper (1981) falls within the cognitive perspective, in none of the studies cited by him did researchers measure accuracy. Current research makes frequent use of expert ”true scores" and the four distance score components of rater accuracy discussed by Cronbach (1955; e.g., Dickinson, 1987; Smither & Reilly, 1987); i.e., elevation - the mean rating given by a rater, across all ratees and dimensions; differential elevation — the mean rating given to each ratee, across dimensions; stereotype accuracy -— the mean rating given to each dimension, across ratees; and differential accuracy —— the ratings of each ratee's performance on each dimension (see Chapter 2). The present research will also measure halo, so as to compare results with previous research. Of greater concern, however, is the impact of manipulating rating scale format and prior knowledge of that format on measures of rating accuracy. Cronbach's (1955) measures of accuracy will be used as primary dependent variables. Feldman (1981) predicted that the structure of performance appraisal forms would interact with the manner in which raters categorize, store, recall, and integrate performance information. Further, DeNisi et a1. (1984) and DeNisi and Summers (1986) proposed that introducing rating scales W helps raters use the scale format as a cue (or frame) for organizing information in memory. This prediction will be tested by the present research, i.e., is it differences in rating scale 7 format per se which influences rater accuracy and halo, or is it rater knowledge of that format prior to information acquisition which influences rater accuracy and error? The rationale for this manipulation is expressed in DeNisi and K. Williams (1988) . Manipulating the structure of the rating scale format (person vs. dimension) is expected to influence halo and accuracy primarily via its influence on the mm], of information from memory, whereas providing knowledge of the scale format prior to search is expected to affect rater m strategies and subsequent information storage in memory. This experimental manipulation should provide the "more appropriate test” of the effects of rating by person versus dimension called for by DeNisi and Williams (1988, p. 119). A number of recent studies have utilized computer-controlled information display boards to examine differences in search/acquisition strategies (Williams et a1. , 1985; Cafferty et a1., 1986). This approach comes out of research on decision making, and is often labeled ”process tracing" research (Ford, Schmitt, Schechtman, Hults, 6: Doherty, 1989). Such an approach allows the researcher to examine the pattern and depth of each rater's information search. For example, subjects who know in advance that they will rate by person (or dimension) are expected to search for information in a manner consistent with that format. Also, a person-blocked format is expected to increase the salience of global impressions, which is then expected to decrease the overall amount of search undertaken (Hastie & Park, 1986; Srull & Wyer, 1989). A process tracing approach is ideally suited to studying the effects of the experimental manipulations on each rater's search strategies. Unlike much previous performance appraisal research, a process tracing methodology is 8 designed to allow the researcher to map the manner in which raters actively seek and process information (Ilgen 6: Feldman, 1983). e v - o v The second broad issue in this research concerns the extent to which raters are influenced by both the in—role and extra-role (OCB) behaviors exhibited by ratees. If managers consider extra-role behaviors when making performance ratings, yet these behaviors don't fit within typical performance appraisal dimensions, then these behaviors are having an undetermined effect on rater search activities and accuracy. In the present study, ratee levels of in-role and extra-role performance will be experimentally manipulated. For reasons discussed below, the presence of such extra-role, or citizenship information is expected to influence rater search strategies, as well as indices of rater accuracy. Most behavioral approaches to performance appraisal currently in use emphasize tying the appraisal instrument as closely possible to the job description for the job in question (e.g., Latham & Wexley, 1981). Recently, however, concern has been expressed that even well—done job analyses may miss many behaviors which managers consider important in evaluating employee performance, behaviors that are not technically part of the job requirements (e.g. , cooperation, contribution to morale). Orr et a1. (1989) found that the majority of the managers in their policy- capturing study used information on both prescribed (in-role) and citizenship behaviors when estimating the dollar value of performance (8D,). Their intent was not to show the relative importance of each type of behavior for estimates of the dollar value of performance, but rather 9 to show that citizenship behaviors played a role when managers made such estimations. MacKenzie, Podsakoff, and Fetter (1991) obtained measures of objective sales performance for insurance agents, then collected managers' evaluations of their subordinates' citizenship behaviors, as well as subjective performance evaluations. They used the objective indices as indicators of ”in-role" performance and, unsurprisingly, found a sizable (r- .48) correlation between objective performance and subjective rating. What was noteworthy about this study was that a LISREL analysis including both objective and OCB measures of performance accounted for 44% of the variance in subjective ratings. This was much higher than the 7% mean R2 reported by Heneman (1986) in his meta-analysis of the objective— subjective performance link, and would seem to indicate that OCB added considerably to the explained variance in supervisory ratings. Further, two aspects of citizenship (factors labelled "Altruism" and "Civic Virtue") explained as much variance in subjective ratings as did the objective performance measures. If these findings are robust across other samples, it bolsters Organ's (1988a) argument that managers view performance as more than simply ”in-role" behavior. While MacKenzie et a1. (1991) studied OCB within a performance appraisal context, their operationalization of "in-role performance” was in fact a "results” measure of performance, i.e. , they used three measures of tangible organizational outcomes. This is much different from the behavioral approach taken by most citizenship research (Smith, Organ, 6: Near, 1983; Bateman 6: Organ, 1983; Williams, 1988). Orr et a1. (1989) had managers rate 13 traits and behaviors, yet their research was not strictly 10 a performance appraisal study. Thus, it would be valuable to replicate the findings of MacKenzie et a1. and Orr et al. in a performance appraisal setting where in—role and extra-role W are systematically varied (thus keeping the measurements of in-role and extra-role performance as similar as possible). Further, there is value in embedding a study of in—role versus extra-role behaviors within the present research framework, with its emphasis on search, accuracy, and halo. First, Organ (1990a) proposed that OCB would explain some portion of the halo effect observed in subjective ratings of performance. For example, if two secretaries are equivalent in terms of in-role performance, but one is known to go the "extra mile” to complete a task, while the other declares that ”If I have to work an extra ten minutes, I better get paid for it!" (Williams 6: Hummert, 1990), the prediction is that such extra—role behaviors will influence managers' general ratings of performance (positively and negatively, respectively). No research was located which measured the effects of citizenship behaviors on measures of halo or rating accuracy. Second, at least two past studies of OCB could have profited by an explicit measurement of rater accuracy. For example, in their seminal work on OCB, Smith et a1. (1983) found an unexpected effect for organizational department on each of their OCB factors. In their discussion, they attributed this to different response sets (i.e. , amounts of leniency) among raters. Similarly, Orr et a1. noted that their study picked up ”real individual differences in perceptions of variability of performance" (1989, p. 39). Such effects can be successfully captured using Cronbach's (1955) measure of elevation. The value of using ll Cronbach's four distance score measures of accuracy is that the effects of response styles (e.g., leniency and strictness) can be separated from other aspects of rating accuracy (Funder, 1987; Sulsky & Balzer, 1988). In the current research context, it.may be that some raters are inherently strict or lenient, regardless of“how'rating format, in-role, or extra-role performance are manipulated. In this case, elevation is viewed as a nuisance variable, which can be controlled or partialled out, so that the more interesting effects of the manipulations on the other aspects of accuracy can be measured (see Chapter 2). Third, and most importantly for the current research, interactions can be hypothesized between search/method of rating, the presence or absence of OCB information, and Cronbach's (1955) measures of accuracy. Concerning search and method of rating, the research summarized by DeNisi and Williams (1988) would predict that raters rating by person would be M accurate on a measure of differential elevation (rank ordering ratees), but less accurate on measures of stereotype accuracy (ranking dimensions) and differential accuracy (the dimension x ratee interaction) than raters rating by dimension. Concerning extra-role behaviors, MacKenzie et a1. (1991) proposed that OCB information may trigger the increased use of categories and schema by raters (cf., Fiske, 1981; Feldman, 1981) , which would be expected to decrease accuracy, at least for measures of stereotype and differential accuracy. Thus, at least two of the differences in accuracy predicted by DeNisi and Williams (1988) are expected to be even stronger when OCB information is present versus when it is not. 12 The expectation of an interaction between the manipulations of method of rating and extra—role performance is the primary justification for simultaneously studying the effects of both on measures of search, accuracy, and halo. The experiment to be described below integrates and extends several distinct lines of prior research. The contributions of the present research to the performance appraisal literature are discussed next. 0 r s f the esen Res are e od R t n . The present study makes several practical and theoretical contributions to our understanding of the performance appraisal process, as well as the role of extra-role behaviors within that process. First, research has been quite limited concerning ‘whether'method.of rating (by person versus by dimension) influences rating accuracy or halo. This is the case despite a long history of interest in the topic in personnel psychology (Symonds, 1925; Stevens & Wonderlic, 1934), as well as in educational measurement. The process of making performance ratings can be compared to the educational practice of grading essay examinations one exam at a time versus one question at a time. Most educational measurement specialists advocate the latter approach, as illustrated by Mehrens & Lehmann (1973), who wrote that "To reduce the halo effect..., we strongly recommend that teachers grade one question at a time rather than one paper (containing several responses) at one time" (p. 234). Yet, in spite of the widespread acceptance of the superiority of grading ”by question" (e.g., Coffman, 1971; Coker, Kolstad, & Sosa, 1988), no research studies were located which tested the effects of one method against the other. 13 In.personnel psychology research, the research stream summarized by DeNisi and Williams (1988) suggests that different methods of rating should influence different measures of accuracy. DeNisi and Williams (1988) proposed that structuring information by person increases a rater's ability to measure the overall proficiency of each worker (differential elevation), but decreases the ability to accurately rate within-ratee differences on various performance dimensions (differential accuracy). If such effects are found in the present study, this would lead to important practical recommendations concerning method of rating and the purpose of appraisal (DeNisi, Cafferty, & Meglino, 1984). If the primary outcome desired from appraisal is an accurate rank ordering of candidates for purposes of promotion or merit increase, then the traditional method of rating by person would be recommended (to increase differential elevation; Murphy, Garcia, Kerkar, Martin, & Balzer, 1982). If, however, the primary purpose of appraisal is for developmental purposes, i.e., providing feedback to employees about their performance on specific job dimensions, then rating by dimension would be recommended (to increase stereotype and differential accuracy; cf, Dickinson, 1987). This raises the interesting prospect that a single method of rating may be inadequate for multiple appraisal purposes (DeNisi 6: Williams, 1988, p. 139). Regardless of how the results come out, more research is needed on the effects of rating method on rater search, accuracy, and halo. In—Rgle vs, Extra-Role Behavigrs. The second major contribution of this research lies in its ability to measure the effects of in—role versus extra-role behaviors on performance ratings in a controlled laboratory environment. This is expected to strengthen the findings of Orr et a1. 14 (1989) and MacKenzie et a1. (1991) that citizenship behaviors explain important variance in ratings beyond that explained by in—role performance. The implication of this is that most approaches to job analysis and performance appraisal are incomplete, because minimal to no attention has been paid to non-prescribed behaviors such as citizenship behaviors (MacKenzie et a1., 1991). If results come out as 'predicted, this strengthens Organ's (1977, 1988a) argument that we need to change the way we view and define "performance". Returning to the framework laid out by Ilgen et al. (in press) and discussed above, it can be seen that the current research focuses most heavily on the acquisition or search phase of information processing, although issues of the organization. of information. in. memory, the retrieval of information from memory, and subsequent performance judgments are also addressed. Additionally, three of the four sources of variance cited by these authors are studied: a) ratings scales (person vs. dimension formats), b) raters (i.e., their search strategies), and ratee effects (manipulations of ratee levels of inerole and extra-role performance). The importance of the final source of variation, setting,or context, has been described or documented by numerous recent researchers (Mohrman & Lawler, 1983; Longenecker, Sims, & Gioia, 1987; Padgett, 1988; Longenecker, 1989). However, it goes beyond the scope of the current research to study the effects of context on rater search and.accuracy, and thus, as important as they are, these variables will not be explicitly measured or manipulated in the present research. The preceding pages have summarized briefly the need for research on the effects of search and method of rating, as well as in—role and extra- 15 role behavior on rater search strategies and accuraqy. The following figure shows this research in. broad. outline form, and. the primary independent and dependent variables which will be manipulated or measured (see Figure 1). This will be explicated more clearly in Chapter 2. l6 moczmm mocmczoton. =Eo>O Am 8th 2m... AN >om5oo< 3.55:5 6 59:09.. 25885 8 co=m>2m .mzcemao ._ 3 co=m>o_m \ Am 69:03. 2 339d» anaconda cotmmmom £5 E moist; 9: .0 32230 ; 2:9“. 3.8.35 -552: moo .0 .93 3 @833 -555: 85anth some. 3 _e>o._ Am gamma sotmmm Lo 2525. 3 / 586m / / Lo 2:... Am mama“ gm seesaw comzammv mg... HES“. Lo omcmiocx corn. 3 @835 59559 ago“. 86585 .> :8th Am 3 CHAPTER 2: LITERATURE REVIEW AND HYPOTHESES This chapter is organized as follows. First, research on rater effects (errors) and accuracy is briefly reviewed, including a presentation of Cronbach's accuracy measures. Second, recent performance appraisal research from a cognitive perspective (e.g., DeNisi et a1., 1984) is discussed in some detail. The hypotheses concerning the accuracy of ratings made by person versus by dimension are then drawn from this research. Third, issues related to the proper measurement of performance are presented, followed by hypotheses on the effects of inrrole versus extra-role behaviors on accuracy, error, and search measures. Following this, hypotheses are presented outlining the proposed interactions between method of rating, extra-role performance, and accuracy. Finally, a summary of the present research is provided. esea c o u a Appraising employee performance continues to be of widespread interest to both researchers and practitioners in the field of personnel/human resource management (Wexley & Klimoski, 1984; Locher & Teel, 1988). As stated 'previously, early research on. performance appraisal tended to emphasize either psychometric issues, i.e., how to make the rating instrument less prone to bias (cf., Landy & Farr, 1980), or training raters to avoid bias (e.g., Latham, Wexley, & Pursell, 1975; Bernardin 6: Pence, 1980). The most common dependent variables in such 17 18 research were various measures of halo, leniency, or range restriction. Such measures were thought to provide indirect measures of rater accuracy. Recently, however, this view has been increasingly challenged (Cooper, 1981; Becker & Cardy, 1986; Fisicaro, 1988; Sulsky 6: Balzer, 1988; Murphy & Balzer, 1989). £11.29]:- As one example of this changed thinking concerning error and accuracy, much research has followed Thorndike's early view that halo error is a strong rater tendency to ”think of the person in general as rather good or rather inferior and to color the judgment of the separate qualities by this general feeling" (Thorndike, 1920, p. 25; cited in Jacobs & Kozlowski, 1985). Such a global effect was thought to lead raters to rate separate dimensions more uniformly than warranted by a ratee's actual performance. Following this logic, Cooper (1981) defined "illusory halo" as a condition where observed halo or intercorrelation among dimensions exceeded the true or actual intercorrelation among dimensions. An increasing body of literature, however, suggests that observed intercorrelations among dimensions are sometimes higher and sometimes lower than the true intercorrelations (Fisicaro, 1988; Murphy 6: Reynolds, 1988; Murphy 6: Jake, 1989). As noted by Murphy and Jake (1989), it is hard to see how a reliance on overall impressions could lead raters to W the true relations among dimensions. An important point related to the above discussion is that most halo measures do not take into consideration the true intercorrelations among dimensions (Murphy & Balzer, 1989). When true intercorrelations are 19 unknown, it is impossible to state unequivocally whether a high observed intercorrelation is halo egrg; or not. This is true for measures of leniency and range restriction as well. Murphy and Balzer (1989) conducted a meta-analysis using the raw data from nine performance appraisal studies where Cronbach's (1955) measures of accuracy were available. The mean correlation between six rater error measures and the Cronbach accuracy measures was r - .05 (r - .06, when corrected- for attenuation and sampling error). Murphy and Balzer argued that the most likely explanation for this weak link between error and accuracy measures was the inability of typical error measures to take true intercorrelations into account. They recommended that researchers discontinue use of error measures as indirect indicators of rating accuracy. AQQBEEQX- Beginning with Borman (1977; 1979) and Murphy, Garcia, Kerkar, Martin, & Balzer (1982), direct measures of rating accuracy have increasingly been used in performance appraisal research. For example, Murphy et a1. (1982) developed videotaped performance vignettes, then obtained "true scores” from expert raters who had enhanced opportunities to view ratee performance. The accuracy of subjects' ratings were then measured in comparison to these true scores using formulas developed by Cronbach (1955). Cronbach (1955) criticized earlier research on person perception because it relied on a global accuracy measure often referred to as D2 (Sulsky & Balzer, 1988). Such an index measures the squared difference between subject ratings and true scores averaged across all ratees and 20 dimensions (see Appendix A for all formulas). Cronbach argued that this index combines different aspects of rating accuracy, each of which can be useful in and of itself. Cronbach proposed that the overall distance between rater ratings and true scores be decomposed into four independent accuracy scores. From Cronbach and others (Murphy et a1, 1982; Dickinson, 1987; Kenny & Albright, 1987; Sulsky 6: Balzer, 1988), these distance score components can be defined as follows: Elavatiog: the component of accuracy due to the average or mean rating given by a rater, across all ratees and dimensions. This can be viewed as a general response set, in that it affects the way a rater rates every target on every trait. In ANOVA terminology, this is synonymous with the differential grand mean. Differential Elevation: the component of accuracy associated with the average rating given to each ratee, across all performance dimensions. This reflects a rater's ability to order ratees in comparison to their overall differences as specified by the means of their target ratings. In ANOVA terms, this is the differential main effect for ratees. Stareotype Accuraty: the component of accuracy associated with the average rating given to each performance dimension, across ratees. This can be viewed as the degree of match between the rater's response set and the true scores (Kenny G: Albright, 1987). Stated differently, it reflects the extent to which the rater correctly assesses the relative strengths and weaknesses of the ratees as a whole on the given 21 performance dimensions. In ANOVA terms, this is the differential main effect for dimensions. W: the component of accuracy associated with rating each ratee's performance on each dimension. This reflects the rater's sensitivity to ratee differences in patterns of performance, and corresponds most closely to the lay notion of accuracy (Murphy et al. , 1982) . In ANOVA terms, this is the differential ratee x dimension interaction. W: the sum of the above components (see Cronbach, 1955; and Appendix A). Go tve-aedeoace asa eea Much of the performance appraisal research in the 19803 was more explicitly cognitive in orientation than earlier research, primarily because earlier research, while not without some measure of success, was still disappointing in all that it left unanswered (Landy, Zedeck, 6: Cleveland, 1983). Too often, raters of performance were viewed as "black boxes", where stimuli were presented, and responses recorded, yet little was learned about hag raters formed their performance judgments (Wexley 6: Klimoski, 1984). Landy and Farr (1980) reviewed past research on performance rating, and concluded that researchers could more fruitfully focus on the cognitive processes of raters, in order to generate "more substantive propositions concerning where rater biases come from" (p. 96). Their paper seems to have generated considerable research and theorizing (cf., Feldman, 1981, 1986; Ilgen 6: Feldman, 1983; Wexley & Klimoski, 1984; 22 DeNisi, Cafferty, &.Meglino, 1984; DeNisi &'Williams, 1988; Ilgen et al., in press). For example, DeNisi and Williams (1988) reviewed numerous cognitive models of the appraisal process, including those of Wherry (1952; Wherry'&.Bartlett, 1982), Feldman (1981), Ilgen.and.Fe1dman (1983), and DeNisi, Cafferty, and Meglino (1984). While these models differ in what they emphasize and the particular components included in each.mode1, they are similar in proposing a multi-step process, where raters acquire, store, recall, and then combine performance information to make judgments. In this section of the chapter, three related research streams will be highlighted. The first stream follows from Barman and.Murphy, and has emphasized the Cronbach measures of accuracy as major dependent variables. The second stream comes out of the University of South Carolina (e.g., DeNisi et a1., 1984), and has emphasized issues related to information acquisition, storage, and retrieval. Finally, research utilizing a methodology known as process tracing will be discussed. With this design, raters actively search for information, and the researcher can map the amount and type of search undertaken by each rater. This methodology will be used to study the effects of the experimental manipulations on rater search strategies (see Figure 1). Use of this methodology allows the current research to go beyond static measures of accuracy or halo, to tap aspects of the cognitive processes used by raters when making their ratings (Landy & Farr, 1980). MW Despite a long tradition of study, research on the accuracy of person perception "came to an abrupt and nearly complete halt after the publication of Cronbach's methodological critique” (Funder, 1987, p. 77). 23 Funder (1987) noted, however, that a limited amount of accuracy research continued after Cronbach ( 1955) , most notably by industrial psychologists. Much of the credit for this can be attributed to Walter Barman. As noted above, Borman (1977) developed videotapes and expert true scores for use in a laboratory rating task. Citing Cronbach (1955), Barman (1977) argued that differential accuracy was the most appropriate measure for evaluating performance judgments. He defined differential accuracy (DA) as the ability to correctly rank order a target person's standing on a given trait (dimension), and measured this with a measure which correlated ratings with true scores for each dimension. However, Borman's DA.measure is not equivalent to Cronbach's notion.of differential accuracy, either in Cronbach's distance or correlational formula (see Appendix A). In fact, Borman's DA.measure is insensitive to the distances between ratings and true scores, whereas Cronbach's distance measure is sensitive to this (Sulsky & Balzer, 1988). Further, it is important to view accuracy as a multidimensional phenomenon (Becker & Cardy, 1986). This was not done by Barman (1977; 1979), but has been done in research following Murphy et a1. (1982). Murphy et a1. (1982) also developed a set of videotapes, which portrayed lecture vignettes by various graduate teaching assistants. Similar to Borman (1977), true scores were obtained by expert raters who had.multiple opportunities to view the tapes. ‘Unlike Barman, however, all four Cronbach measures of accuracy were calculated. Further, student raters were asked to make two types of ratings: a behavioral frequency rating, and a global evaluation. The frequency rating was similar to a behavioral observation scale (BOS; Latham & Wexley, 1981), where raters 24 reported the frequency of observing 12 key behaviors. The global performance evaluation utilized Likert scales to evaluate eight trait-like dimensions of performance, and resembled a typical graphic rating scale. The major purpose of Murphy et a1. (1982) was to demonstrate a significant relationship between the accuracy of observing behavior and the accuracy of rating overall performance. Support for this was found in that, for all of the Cronbach measures except stereotype accuracy, there were significant correlations between the accuracy of the frequency rating and the corresponding accuracy measure for the global evaluation. For the purposes of the present study, the major contributions of Murphy et a1. (1982) are three-fold: a) establishing the first link between information acquisition activities and rating accuracy (DeNisi 6: Williams, 1988); b) successfully reintroducing all of Cronbach's accuracy measures into the personnel psychology literature; and c) developing videotapes which facilitated considerable further research on performance appraisal accuracy (see Murphy 6: Balzer, 1989). Two studies which have used the Murphy tapes will be briefly discussed (Murphy & Balzer, 1986; Murphy, Philbin, & Adams, 1989). First, Murphy 6: Balzer (1986) predicted that a one-day delay between observation and rating would lead to increased halo (due to increased reliance on general impressions), and that this would lead to decreased accuracy. Elevation measures were not included in this study, since there was little evidence that memory—based ratings made under delayed conditions would be more or less lenient than those obtained immediately after observing performance. Results from this study indicated that the delayed ratings were more highly intercorrelated than those made immediately after 25 observation; they were also significantly higher than the average ”true" intercorrelations [e.g., true r— .44; observed r(delay condition)- .64]. However, for measures of stereotype and differential accuracy, the delayed ratings were ma accurate than the immediate ratings. This was presented as further evidence for the weakness of error measures (e.g., halo) as primary indicators of rating accuracy. The final study to be discussed by Murphy and his associates is Murphy, Philbin, and Adams (1989), which built on the two studies just mentioned. Murphy et a1. (1989) compared the accuracy of immediate ratings with those made one, three, and seven days after observation of the tapes. They also varied the purpose of observation, i.e., half the subjects were told that performance evaluation was their sole task, while the remaining subjects were told that their primary task was to learn the content of the lectures, with performance evaluation as a secondary task. It was predicted that those for whom performance evaluation was a primary task would be more accurate than those focusing more on the lecture content. Somewhat disappointingly, this result was found only for the stereotype accuracy measure of the frequency ratings; none of the global performance appraisal measures showed significant differences. There was also an interaction between purpose of observation and delay, such that the greater stereotype accuracy of the ”performance evaluation primary" subjects washed out by the seventh day. Murphy et a1. (1989) interpreted this by utilizing the distinction between on-line and memory-based judgments presented by Hastie and Park (1986). Overall, Murphy and his associates have emphasized memory issues in information processing. The present research will focus more particularly 26 on information acquisition strategies, drawing on the research presented below from the University of South Carolina. ‘What is instrumental in the work of Murphy and his associates is their refinement and use of the Cronbach (1955) accuracy measures as major dependent variables. Each Cronbach measure provides unique and valuable information concerning rater accuracy. Accordingly, this research will emphasize these measures as primary dependent variables. Other accuracy measures have been used in performance appraisal research (e.g., Lord, 1985; Padgett & Ilgen, 1989); however, the Cronbach measures are the most widely used (Sulsky & Balzer, 1988; Murphy & Balzer, 1989). Also of importance is the fact that gang of the South Carolina research has utilized Cronbach's measures (cf., DeNisi & Williams, 1988). It is argued below that the Cronbach measures better capture the variables of interest in the South Carolina research, while providing additional information as well. Use of these measures is most conducive to tackling the "pervasive accuracy problem in performance appraisal research” (DeNisi & Williams, 1988, p. 109). v 1 ea One of the larger research streams dealing‘with.cognitive influences on performance appraisal comes out of the University of South Carolina (DeNisi et a1., 1983; K. Williams et a1., 1985; Cafferty et a1., 1986; DeNisi et a1., 1989; K. Williams et a1., 1990). Much of this research has focused. on information. acquisition. strategies, although storage and retrieval issues have been addressed as well. Research on information acquisition will first be reviewed, followed by a discussion of the effects of method of rating (Symonds, 1925). 27 WW. DeNisi. Cafferty, Williams, Blencoe, and Meglino (1983) used a computer information board to determine what type of information raters would seek out in evaluating the performance of four hypothetical ratees. The highest percentage of raters (44%) organized their search by ratee, i.e. , they sought information about how one worker performed across different tasks before seeking information on another worker's performance. Thirty percent of the raters organized their search by task, i.e., comparing different workers on one task before moving to another task. Eighteen percent of the raters tended to seek information. on. the repeated performance of one ratee on the same task; whereas 8% displayed no discernable pattern. From this and related research (Williams, DeNisi, Blencoe, & Cafferty, 1985), three information acquisition strategies have been identified: personeblocked, taskéblocked, and nonblocked. Cafferty, DeNisi, and Williams (1986, Study 1) found a similar rater preference for information about one person's performance across tasks before moving to another target. In Study 2, all performance incidents were presented to raters (rather than allowing for differential search). One-third of the subjects viewed the incidents arranged by ratee (persona blocked); one-third of the subjects viewed the incidents arranged.by task (taskrblocked); and one-third of the subjects viewed the incidents in a mixed or nonblocked manner. Cafferty et al. found that subjects in the personr and taskrblocked conditions exhibited significantly more clustering of behaviors in recall according to their respective pattern of presentation. Subjects in the personéblocked condition recalled significantly more items than subjects 28 in the other two conditions; however, they also recalled more incorrect items than those in the task-blocked condition (what Pitre 6: Sims, 1987, referred to as "gap filling"). Contrary to predictions, there was at difference by presentation pattern in the overall performance ratings given. However, the task-blocked condition led to significantly greater intra—ratee discriminability, i.e., actual performance differences on different tasks by each ratee were noted more accurately in the task- blocked condition than in the other two conditions (this corresponds to Cronbach's differential accuracy measure). It has been speculated that person-blocking leads to greater reliance on global impressions, which in turn leads to greater halo bias (DeNisi 6: Williams, 1988). Blocking by task (or dimension) is expected to reduce this effect. Other research, however, makes this less than clear-cut. Specifically, Williams, DeNisi, Meglino, and Cafferty (1986) looked at initial appraisal purpose and subsequent performance ratings. Subjects viewed one of two videotapes, where the performance of carpentry tasks was blocked either by person or by task. In a memory—based rating made two days later, subjects who had previously made "deservedness" ratings of each ratee for extra work were only able to differentiate among ratee proficiency levels when they had acquired information in a person-blocked manner. Thus, while Cafferty et a1. (1986) leads to the conclusion that task or dimension blocking leads to greater rating accuracy, Williams et a1. (1986) appeared to contradict this. DeNisi and Williams (1988) suggested that the distinction may be one of intra— versus inter-ratee discriminability, i.e., "While the use of person categories may increase impression formation tendencies and reduce intra—ratee variability, it may 29 increase the rater's ability to assess the overall proficiency of each worker” (p. 139). In Cronbach's terminology, the prediction is that person blocking leads to greater accuracy than task blocking in terms of differential elevation (correctly rank ordering ratees) , but less accuracy in terms of differential accuracy (correctly noting ratee patterns of performance). A final study relevant to the discussion of person— versus task- blocking is DeNisi, Robbins, and Cafferty (1989). These authors measured the effects of diary keeping on subsequent recall and rating accuracy. Videotapes were assembled of individuals performing carpentry tasks, such that no two consecutive segments portrayed the same carpenter, the same task, or the same performance level, i.e., a nonblocked presentation pattern. Subjects were instructed to keep diaries either by person, by task, or "free" (as they wished); a fourth no-diary condition served as a control. DeNisi et a1. (1989) found significant clustering in recall: more person clustering with person diaries, more task clustering with task diaries, and the lowest clustering in the no-diary condition. No differences in the accuracy of overall ratings were found between the person— and task—diary conditions. Also, contrary to the prediction of DeNisi and Williams (1988) above, intra—ratee discriminability (differential accuracy) was 1331;; in the person-diary condition. One explanation for this is that DeNisi et a1. (1989) made only small changes in ratee performance over a small number of tasks. When larger memory demands are placed on raters (as will be done in the present study), results are expected to follow Cafferty et a1. (1986), i.e., greater 30 differential accuracy for task/dimension blocking. At any rate, as DeNisi and Williams concluded (p. 146), the way in which raters organize information in.memory does affect rating accuracy; however, more research is needed to determine what, if any, is the best type of organization for different appraisal purposes (i.e., by task, by person, or another schema not yet tested). Rating Eotaat versus Advanced Kagwlegga 9f that Eotaat. The above research demonstrates that the manner in which raters acquire performance information influences various measures of rating accuracy. This is related to the question of whether method of rating influences accuracy; however, the two processes are conceptually distinct. Drawing on.a model presented by DeNisi and Williams (1988, p. 137), it is proposed that an intervention prior to information search will affect acquisition strategies and subsequent information storage in memory, whereas an intervention focusing,on the method of making ratings (by person versus by dimension) will primarily influence the retrieval of information already stored in memory (see Figure 2). The processes portrayed in Figure 2 can be described as follows. The manner in which individuals search for information influences the manner in which that information is subjectively organized in memory; person and task blocking are thought to be two types of schema used to organize information. Next, all raters (regardless of how they subjectively organize information) are thought to store information in memory using hath impressionistic and behavioral codes (Hastie & Park, 1986; Srull & Wyer, 1989; Williams et a1., 1990). It is predicted that blocking patterns influence the salience of impressionistic versus 31 as? sags) a 328 see usages mmczmm .mm_maa< 8368th :o oEonméoficmEE m:m.m>-comtmn_ E 853:... on. B Enos. < ”N 659“. scam . =8m¢ . ¥m<._. mime/.002. _ mmooo .mtoszmm . mmooo ozflcofimmaE. . ll>|. Till. _ mozSszummm ”.0 ZOF 20mmwn_ ".0 ZO_._.Fow 33m _ < mzmwtkn. Iom_._.0wsm0 Gz_mmwOOmn_ , 32 behavioral data, i.e., impressionistic codes will be more salient under person-blocking, with behavioral codes more salient under task/dimension- blocking (DeNisi 5: Williams, 1988). These effects are then thought to influence judgment tasks, such as those involved in rating performance. Finally, once ratings have been made, these ratings are thought to influence the subsequent categorization of performance, as well as future information search patterns (cf ., Murphy, Balzer, Lockhart, 6: Eisenmann, 1985). Figure 2 can be used as a framework for the research already presented. First, arguments for the superiority of rating by dimension versus by person (Symonds, 1925; Stevens 6: Wonderlic, 1934; Cooper, 1981) focus on the end of the process depicted in Figure 2. Even if implicitly, it is assumed that forcing a dimension structure on ratings will reduce halo and increase accuracy. By default, this approach assumes that method of rating will influence accuracy via its influence on retrieval processes. On the other hand, the research on information acquisition focuses on the beginning of the process shown in Figure 2, arguing that accuracy will be most influenced by issues related to information acquisition and storage. While many additional questions could be addressed building on the South Carolina research, this study will focus on just one of the issues they raised, i.e. , the extent to which raters with prior knowledge of the rating format to be used when making ratings are more accurate than those who learn of the format just prior to the rating task. The assumption is that those who know the rating format prior to information acquisition will use that format as a cue or frame when organizing 33 information in memory (Williams et a1. , 1990). In terms of Figure 2, this prior knowledge has been placed in the ”Processing Objectives” box, under the assumption that this knowledge will directly influence information search patterns and the subjective organization of information in memory. In this research, issues related to method of rating and information acquisition will be addressed in a 2 x 2 between subjects design. Half of the subjects will make their ratings ”by person", half will rate "by dimension” (from here on, the term ”dimension-blocked” will be used instead of "task-blocked", to denote an organizing pattern or format centered around appraisal dimensions, rather than the traditional person- centered approach; Cooper, 1981). Further, half of the subjects in each group will be shown the format to be used prior to the opportunity to search for information (see Figure 3). This design allows for tests of the main effects for rating format and prior knowledge of that format, as well as any interactions, for each Cronbach accuracy measure. As noted earlier, Cooper (1981) cited four studies where the expected halo differences when rating by person versus by dimension were not found. What is interesting, however, is that in each of these studies, the rating format was presented only at the time when ratings were required. Thus, the effects of prior knowledge of the rating format were not tested (DeNisi 6: Williams, 1988). There is mixed evidence concerning whether familiarity with the rating scale per se leads to differences in halo or accuracy. Positive results were obtained by Bernardin 6: Walter (1977) , who found that raters who received behaviorally-anchored rating scales (BARS) before observing behavior demonstrated less halo and greater inter-rater reliability than 34 manual By Person By Dimension Condition 1 Condition 2 Yes Astana Knowledge 91 Emmet No Condition 3 Condition 4 ”* Note: The within-subjects factors are manipulated within each of the four between-subjects conditions Figure 3: The Between-Subjects Manipulations of Format Type and Prior Knowledge of Format 35 raters who received the scales after observation. On the other hand, Cardy et al. (1987) found no effects for familiarizing raters with the BARS scales used in their study. However, these authors defined accuracy utilizing Borman's differential accuracy measure, rather than the more complete Cronbach measures. Also, the results of both Bernardin and Walter (1977) and Cardy et a1. (1987) relate onLy to ratings made by Bataan; their research was not designed to tap person versus dimension differences. DeNisi and Summers (1986) provided an initial answer to this question. Using the same videotaped carpentry tasks as Williams et a1. (1986) and DeNisi et a1. (1989) above, they constructed two different rating scales, one focusing on the tasks or behaviors performed, and the other on general trait dimensions such as motivation, neatness, and attention to detail. Additionally, subjects were shown the rating scale they would.use either prior to or after observing the videotapes. Results were that subjects with advanced knowledge of the rating format demonstrated an increased level of organization in memory, as well as increased accuracy of recall. However, a major weakness of this study*was that overall rating accuracy could only be calculated for subjects in.the "task scale" condition, since no "true scores" (in the Cronbach sense) were available for the trait ratings. DeNisi and Summers (1986) found that, within the task scale condition, rating accuracy was greatest for those with advanced knowledge of the scale. What is needed, however, and what the current study does, is to compare the effects of both rating format and prior knowledge of that format on rating accuracy. 36 Before discussing research propositions concerning information acquisition and method of rating, some previous process tracing research will be briefly discussed. As noted above, this approach allows the researcher to measure the amount and type of search engaged in by each rater. This should greatly aid in understanding the effects of the experimental manipulations on rater search and accuracy. In the current research, raters will have access to a person by dimension matrix for a number of ratees. It is helpful to understand this methodology before discussing the specific predictions made in this study. e e c Several of the studies mentioned above used computer-controlled information boards to examine differences in search/acquisition strategies (DeNisi et a1., 1983; Williams et a1., 1985, Study 2; Cafferty et a1, 1986, Study 1). This methodology comes out of research on decision making, and is generally referred to as "process tracing" research (see Ford et a1. , 1989, for a recent comprehensive review). A major advantage of an information board methodology is that it allows for the examination of the decision maker's depth and pattern of information search. Even though cognitive models of performance appraisal view raters as active seekers and processors of ratee performance information (e.g., Ilgen 6: Feldman, 1983) , the implicit assumption in much previous research (e.g., Zedeck 6: Kafry, 1977) is that raters are passive receptors of information, not playing an active role by expending effort to obtain information about each ratee. Kozlowski and Ford (1991) argued that this provides limited fidelity to the way raters actually receive information, 37 and that process tracing is needed in order to investigate the manner in which raters attend to and acquire performance-relevant information. Kozlowski and Ford (1991) recently completed two studies of rater acquisition strategies using a computer-controlled information board. Subjects rated the performance of 12 police officers, six of whom were consistently good performers, and six of whom where consistently poor performers. Prior to the information search task at the computer, subjects received "personnel files” for each ratee, where the amount of information available on each ratee was systematically varied, i.e., the number of "critical incidents” listed for each ratee ranged from zero to 18 items (ratee performance level and prior ratee information were within- subjects factors). Subjects were then instructed to search the computer for as little or as much additional information as necessary to make accurate ratings; information was available in a ratee by dimension matrix. In Study 1, subjects were placed under four levels of search constraint, i.e., they were told they could access up to 25%, 50%, 75%, or 100% of the available items. Kozlowski and Ford reported three major findings. First, the amount of prior ratee information influenced.search, such that the greater the prior information available, the less the subsequent search. Second, as expected, search constraint influenced acquisition, with those under lower constraints seeking significantly more items. Third, a performance level x prior ratee information interaction indicated a tendency for raters to seek more information for poor versus good performers, when the level of prior information was high. 38 In Study 2, Kozlowski and Ford used only the 25% and 100% search constraints, but added four levels of time delay between viewing the personnel files and the opportunity to search and then make appraisal ratings (0, 1, 4, and 7 days). The results largely replicated those in Study 1, as well as showing a tendency for raters in the no-delay condition to seek more information when prior information was low than subjects in the delayed conditions. This last finding is consistent with the notion that memory-based information is more influenced by general evaluations or impressions (Hastie & Park, 1986). A limitation of their study noted by Kozlowski and Ford is that the complexity of their design prevented them from varying ratee performance levels. Because ratee performance was either consistently good or poor, there was no point in linking patterns of information acquisition to particular rating outcomes. Differences in within-ratee performance are needed in order to address issues of rating accuracy, at least for the Cronbach (1955) accuracy measures (cf., Padgett 6: Ilgen, 1989). Similar to Kozlowski and Ford (1991), a computer—controlled information board will be used in the present research. Ratee performance patterns will be constructed to portray high, average, and low levels of overall (in-role) performance. In addition, within-ratee performance differences on various dimensions will be built in, in order to meaningfully test for rater differences in stereotype and differential accuracy. Hypotheaea Cancerning Infotaatian atgutsitioaznethgd of Ratiag In this section, the hypotheses relating to information acquisition and method of rating will be presented, broken down by dependent variable. 39 Hypotheses concerning the two process variables will be discussed after this. antal1__fiat1nga. The South Carolina research has produced conflicting results concerning the effects of acquiring information by person versus dimension on the accuracy of overall (or summary) performance ratings given to each ratee. For example, Williams et a1. (1986) found some increase in overall rating accuracy for subjects in their personrblocked condition. DeNisi et a1. (1989) found that keeping a diary of any sort (i.e., personeblocked, task-blocked, or free) resulted in greater accuracy than the no-diary condition; however, no differences were observed between the person— and task-blocked conditions. Finally, Cafferty et a1. (1986) found no differences in overall ratings between their person-blocked, task-blocked, and mixed conditions. Intuitively, one might expect the accuracy of overall ratings to be better for those subjects who rate by person (i.e., a pattern similar to the one presented below for differential elevation). However, the results from the studies just cited are sufficiently discouraging so as to necessitate making no prediction regarding overall ratings. Ettat. Measures of halo will be computed so that the results of this study can be compared with previous studies (Bernardin & Walter, 1977; Cooper, 1981, Murphy & Balzer, 1989; Murphy &.Jako, 1989). Only one prediction will be made in the current study (see Figure 4), i.e., H1: Halo effect —— Subjects rating by person will exhibit more halo than those rating by dimension. Given the results of previous studies described by Cooper (1981), this effect is not expected when subjects are unaware of the rating format to be used; however, when the format is known, greater halo is expected when rating by person, with lower halo when rating by dimension. 40 2m... 95 $8502 .588”. .o $8.265. noocm>n< .EEBH. .0 on: 9.58.80 9.0.8.85 ”v $50.... 56585 59$ 56585 caused _ . . . _ . . . mm> mm> 02 .6982 .6932 3.29.9.5 02 8.5886 3.89:5 nosed 8555.0 coeon. _ _ _ _ _ _ _ . mm> CO_~N>0_M OZ O_MI m. :26 _ oz _ .. FD mm> 41 Attataty. Of major concern in this research is the effects of the between-subj ects manipulations on three of Cronbach's distance score accuracy measures. The fourth aspect of accuracy, elevation, will be measured, but similar to prior research (Murphy & Balzer, 1986; Murphy et a1., 1989), no differences between groups are expected based on format or prior knowledge of format. For the remaining measures, the general prediction is that ratings made by person will be more accurate in terms of differential elevation, but less accurate in terms of stereotype and differential accuracy than ratings made by dimension (DeNisi & Williams, 1988). What this statement fails to make clear, however, is the extent to which these accuracy effects are brought about by the rating format per se 'versus prior knowledge of that format. Summarizing the research described above, the research proposition guiding this study is that there will be an effect for rating format, but that this will operate primarily in conjunction with prior knowledge of that format. These predicted effects and interactions are shown graphically in Figure 4 (Note that the Cronbach distance score measures depict the distance away from true score estimates; therefore, the lQEQI. the values, the gtaatat the rater accuracy). Specifically, the predictions are: H2a: Differential Elevation -— Subjects rating by person will be more accurate in correctly rank ordering ratees than subjects rating by dimension. Further, there will be a disordinal interaction (Keppel, 1982), such that accuracy is best for subjects with advanced knowledge that they will be rating by person, and worst for subjects with advanced knowledge that they will be rating by dimension. 42 For subjects knowing they“will rate by dimension, the expectation is that this knowledge will induce raters to spend so much time attending to ratee performance on specific dimensions that they will ”lose sight" of the overall rank ordering of ratees. The predictions for stereotype accuracy and differential accuracy are essentially the reverse of that for differential elevation, i.e., H2b: Stereotype Accuracy - Subjects rating by dimension will be more accurate in correctly ranking the dimensions than subjects rating by person. There will be a disordinal interaction, such that accuracy is best when subjects know in advance that they will rate by dimension, and worst when subjects know in advance that they will rate by person. Advanced.know1edge of a personrblocked rating format is expected to increase raters' reliance on overall impressions, thus decreasing the attention given to ratee performance on specific dimensions. Similarly, H2c: Differential Accuracy —- subjects rating by dimensionnwill be more accurate in correctly noting individual ratee patterns of performance than subjects rating by person. There will be a disordinal interaction, such that accuracy is best when subjects know in advance that they will rate by dimension, and worst when subjects know in advance that they will rate by person. Cronbach (1955) also proposed correlational measures of rating accuracy. For completeness, these measures will be calculated, and the above three hypotheses (H2a - H2c) will be tested on with these measures as well. However, the primary focus will be on the distance score measures, as these are the most widely used (Murphy & Balzer, 1989), and they correspond most closely to the generally accepted notion of rating accuracy (in contrast to rating validity, Sulsky & Balzer, 1988) . 43 oc s V a e . A distinct advantage of process tracing research is that it allows the researcher to examine the pattern and depth of each rater's information search (Ford et a1. , 1989). In the present context, it can be tested whether prior knowledge of the rating format affects the pattern or type of search, as well as the depth or amount of search exhibited by each rater. Concerning type of search, prior research demonstrates that, for a majority of raters, their "natural tendencies" are to search for information by person (DeNisi et a1. , 1983; Cafferty et a1. , 1986) . Thus, subjects without prior knowledge of the format to be used are expected to display a general tendency to search by person. Subjects who receive advanced knowledge concerning the format to be used are expected to display stronger tendencies to search for information in a manner consistent with that format. Figure 5 displays the type of interaction that is predicted. H3: Type of Search - Prior knowledge of format type will influence subsequent search activities, such that those knowing they will rate by person will search more by person, and those knowing they will rate by dimension will search more by dimension. Raters without such prior knowledge will be more likely to search for information by person. Concerning amount of search, previous research has found a general tendency on the part of most raters to make appraisal decisions based upon a limited amount of information. For example, in DeNisi et a1. (1983), subjects who were allowed unlimited search of performance information requested fewer than half the available items before making their appraisal decisions. Kozlowski and Ford (1991) found similar results. 44 5.5m .o .595. c5 555m .0 9...... 9.5850 95.8.5... ”m $50.“. 55585 5m5n. 55585 525.. . . _ _ 32 ._ . oz \ 5.5.5 5> -55566 cofimw 55:. .o couwmm .caoE< o 5> z / 5.5.5 05>... -533 so... .58 45 Subjects in their unlimited search condition searched for an average of 62% and 43% of available items (in Studies 1 and 2, respectively). This same search tendency is expected in the present study as well. In particular, this limited search is expected for those without prior knowledge of the rating format. However, providing prior knowledge of either a person or a dimension format is expected to differentially affect amount of search (Figure 5). Building on.research in.cognitive psychology (Hastie & Park, 1986; Srull & Wyer, 1989; DeNisi & Williams, 1988), it is expected that subjects who know they will rate by person will be more likely to store information in memory I'by person”. This is expected to increase the salience of global impressions (i.e., the impressionistic codes presented in Figure 2), which is then expected to decrease the overall amount of information search. On the other hand, those who know in advance that they will be rating by dimension are expected to display an above average amount of search. H4: Amount of Search - All subjects will display a tendency to voluntarily end their information search before they have accessed all available information. This tendency will be more pronounced for subjects who know they will rate by person, and less pronounced for those who know they will rate by dimension. Once again, a strength of the process tracing,methodology is that it allows the researcher to measure depth and pattern of search in some detail. It is hoped that the results for these two "process" variables will be useful in explaining the accuracy results that are obtained, as well as testing the underlying assumptions presented in Figure 2 (i.e., the dotted boxes for subjective organization in memory and categorization of performance). 46 e v s t — c The preceding pages have summarized the research and hypotheses concerning the effects of information acquisition and method of rating on rater accuracy and error measures, as well as on amount and type of search. This section of the chapter focuses on the influences of in—role and extra-role performance on these same variables. While the preceding pages have proposed effects for person— versus dimension-oriented formats on search and accuracy measures, the following pages address the issue of whether the right dimensions of performance are being measured in most current approaches to performance appraisal. Following Organ (1988b), it is expected that dimensions capturing both in-role and extra-role performance will be used by managers when they make their ratings. Further, both search and accuracy measures are expected to be influenced by these experimental manipulations. As stated in Chapter 1, the rationale for studying search/method of rating and in—role/extra-role performance in the same study is the proposed interaction between method of rating, extra-role performance, and accuracy. This interaction is discussed below, after the literature and hypotheses concerning in—role and extra-role performance have been presented. As depicted in Figure 1, the level of in—role and extra-role performance will be manipulated as within-subj ect factors. Issues related to measuring performance and organizational citizenship will be discussed next, followed by specific hypotheses relating to these variables. P rf an It is axiomatic that there ought to be a strong link between the content of performance appraisal instruments and the content of the job(s) 47 being appraised (Wexley' & 'Yukl, 1984). For example, the Unifgtm Qaidelinas on Eaplgyee SeLettioa gracedataa (1978) state: ”There shall be a job analysis which includes an analysis of the important work behaviors required for successful performance... "(Sec. 14.C.2). Many efforts have been made to link performance appraisal instruments directly to written job descriptions (e.g., Buford, Burkhalter, & Jacobs, 1988). Indeed, researchers over the past 30 years have spent considerable effort trying to improve behavior— and results- oriented measures of performance (Landy & Farr, 1980; Latham & Wexley, 1981). Among other things, this research has sought to increase rating accuracy, improve feedback to employees, and ensure that employers comply with legal requirements. Concerning this latter point, policy-capturing research has demonstrated that organizations are more likely to win employment discrimination lawsuits in court if their appraisal system is based on measures of behaviors or results, rather than on measures of broad traits (Feild & Holley, 1982; Werner, 1992). Given that behavior and results measures of performance have been advocated since the 19603 (e.g., Smith & Kendall, 1963; Odiorne, 1965), and that current legal guidelines and court decisions emphasize the importance of organizations using such measures, the use of graphic- or trait-oriented rating scales would seem hard to justify. Yet, trait- oriented scales continue to be widely used. When Werner (1992) combined his data with that from Feild and Holley (1982), over 60% of the codable cases (66 of 107) involved ttait-Qriented appraisal formats. Further, Locher and Teel (1988) surveyed over 300 organizations, and found that 48 almost 70% of them used either graphic ratings scales or essay evaluation of employees. Why do graphic or trait-oriented rating scales continue to be so widely used? Certainly, much of the explanation for this is that such scales are inexpensive and easy to administer. Additionally, some managers and organizations may not know about the legal ramifications of performance appraisal, or may feel that the likelihood of getting taken to court for a poorly designed appraisal system is relatively small (Werner, 1992). However, it is questionable whether such explanations fully capture the discrepancy between what has been recommended in this area and what is actually practiced. Two additional explanations are proposed: 1) trait ratings are often preferred.by raters because they parallel the manner in which individuals form impressions and retain information in memory (Cantor & Mischel, 1977; Srull & Wyer, 1989), and 2) managers (and other raters of performance) often have definitions of ”performance” that go beyond an employee's performance of his or her stated job duties (Organ, 1977), i.e., what is here labelled ”inerole performance”. As developed below, trait-oriented scales are thought to capture elements of employee ”citizenship behaviors" (Bateman & Organ, 1983) which managers consider important for overall effectiveness, yet which may not fit precisely within the role-prescribed behaviors of a written job description. thanizational Citiaeaahia Bahavtot. Organ (1977) listed a number of things organizations are likely to value beyond some minimally- acceptable level of productivity, including employee predictability, cooperation, and general tendencies toward compliance. Such behaviors 49 were described as "the glue which holds collective endeavors together” (p. 50). Later, such behaviors were labelled organizational citizenship behaviors, or OCB, and this was thought to include such actions as cooperating with coworkers, working to improve the organization, and accepting special orders without complaint (Smith, Organ, 6: Near, 1983; Bateman 6: Organ, 1983; Organ 6: Konovsky, 1989) . Katz (1964; Katz & Kahn, 1966) described three types of behaviors which were considered essential for a functioning organization: a) people must be induced to enter and remain with the organization; b) they must reliably carry out specific role or job requirements; and c) there also needs to be innovative and spontaneous activity that goes beyond role prescriptions. Citizenship behaviors are at times part of an employee's role or job, e.g. , courtesy toward customers may be prescribed behavior for sales personnel (Brief 5: Motowidlo, 1986). However, most citizenship behavior is viewed as "supra-role”, i.e., behavior which cannot be prescribed or required in advance for a given job (Bateman 6: Organ, 1983) . Recently, Organ defined OCB as “individual behavior that is discretionary, not directly or explicitly recognized by the formal reward system, and in the aggregate promotes the efficient and effective functioning of the organization. . . . (T)he behavior is not an enforceable requirement of the role or job description, ... it is rather a matter of personal choice, such that its omission is not generally understood as punishable. . . (and) returns (to the individual should) not be contractually guaranteed by any specific policies and procedures" (Organ, 1988b, pp. 4—5). 50 Organizational citizenship behavior can be distinguished from the broader concept of prosocial organizational behavior. Brief and Motowidlo (1986) reviewed prosocial organizational behavior, and argued that such behaviors can be: a) either functional or dysfunctional for the organization; b) either role—prescribed or extra-role; and c) directed at various targets (coworkers, customers, the organization as a whole). Given Organ's definition above, OCB can be viewed as a subset of prosocial organizational behavior which is functional for the organization and generally extra-role in nature. There is still an on—going debate as to exactly what constitutes OCB. Early research by Bateman and Organ (1983) viewed it as a unidimensional construct. However, a factor analysis of a different OCB measure by Smith et al. (1983) identified two factors: Altruism and Generalized Compliance. Altruism included behaviors "directly and intentionally aimed at helping a specific person in face-to-face situations" (p. 657), whereas Generalized Compliance ”pertains to a more impersonal form of conscientiousness that does not provide immediate aid to any one specific person, but rather is indirectly helpful to others involved in the system" (p. 657). Organ (1988b) later stated that this factor is more aptly labelled ”Conscientiousness", to put a greater emphasis on its inner-directedness. Recently, Organ (1988b) proposed five categories of OCB: Altruism, Conscientiousness , Sportsmanship , Courtesy , and Civic Virtue . Sportsmanship was identified by Williams, Podsakoff, and Huber (1986) in a reanalysis of the Bateman and Organ (1983) scale, and is characterized primarily by actions which people refrain from doing (e.g., avoiding 51 complaining and petty grievances). Courtesy is conceived of as ”touching base” with others who will be affected by a decision, passing along information, briefing, reminders, etc. Finally, Civic Virtue is defined as active participation in the political life of the organization (Graham, 19 86), and includes involving oneself in understanding and discussing current organizational issues and concerns. Support for this OCB categorization scheme was found by MacKenzie et al. (1991). In a different vein, Graham (1986; 1989) drew on writings in classical political philosophy to propose four major factors of OCB. The first two factors, Interpersonal Helping and Personal Industry, are similar to the Altruism and Conscientiousness factors described above. Graham's third factor, Individual Initiative, is an operationalization of organizational participation, and parallels what Organ (1988b) labelled Civic Virtue. The fourth factor is called Loyalty, and involves defending the interests and reputation of the organization to outsiders. Support for this four-factor structure was found in separate factor analyses by Graham (1989) and Karambayya (1990). In an attempt to make sense of these different categorization schemes for OCB, the current research will use the categories proposed by L. Williams (1988). Williams (1988) collapsed the above categories from Organ (1988b) and Graham (1986) into two broader categories which represented: l) OCBs which benefit the general maniaatian (e.g., carrying out role requirements well beyond the norm or minimum required levels), and 2) OCBs which immediately benefit specific indivtduala (e.g- , helping another person with an organizationally—relevant problem), whiCh ultimately contributes to the organization. These two categories 52 were labelled OCBO and OCBI, respectively. It should be noted that these categories parallel Bateman and Organ's (1983) factors of Conscientiousness and Altruism. Their primary advantage, however, is parsimony, in that they are broad enough to encompass the other OCB factors which have been proposed. For example, Organ's Sportsmanship and Civic Virtue factors, and Graham's Individual Initiative and Loyalty factors can all be incorporated into the OCBO factor, whereas Organ's Courtesy factor fits into the OCBI factor. Williams (1988) developed questionnaires designed to tap in-role behaviors (IRB) , organizational citizenship behaviors directed toward the organization as a whole (OCBO), and organizational citizenship behaviors directed toward specific individuals (OCBI). Factor analyses were conducted on samples of self-ratings, peer ratings, and supervisor ratings (the self-ratings were from employed MBA students). For each sample, a clean three-factor solution emerged which strongly supported the a priori categories. For the present research, the significance of Williams (1988) is two-fold: a) two broad categories of organizational citizenship behavior were identified, and b) Williams (1988) was the first to demonstrate empirically that survey measures of OCB were capturing something distinct from traditional in-role performance. Past research that had attempted to do this produced equivocal results (O'Reilly 6: Chatman, 1986; Puffer, 1987). The importance of this is that Williams' (1988) results strengthen the argument that OCB measures are doing more than simply capturing "old wine in new wineskins", i.e. , OCB measures are not merely replicating supervisors' ratings of in-role performance. This 53 is of considerable importance for the present study, where OCB is incorporated within a performance appraisal context. QQB aad Eatfgtaange Apatataal. Much of the research and writing to date on organizational citizenship behavior has emphasized definitional issues (Smith et a1., 1983; Graham, 1986, 1989; Organ, 1988b; VanDyne 6: Cummings, 1990), or the relationship of OCB to such perceived antecedents as satisfaction and organizational commitment (Bateman 6: Organ, 1983; O'Reilly 6: Chatman, 1986; Organ, 1988a; L. Williams, 1988). Very little of this work has pertained directly to performance appraisal. The following paragraphs describe the research and theorizing on OCB which most directly relates to the current study. Puffer (1987) measured the relationships between prosocial behaviors, noncompliant behaviors, and sales commission for a sample of retail sales personnel whose pay was based entirely on commission. In essence, prosocial behaviors were positive citizenship behaviors (e.g., handling a postsale problem for another salesperson), and noncompliant behaviors were negative or anti-citizenship behaviors (e.g., not doing one's "fair share" of customer phone calling; cf. , Fisher 6: Locke, 1990). Puffer found that sales performance (adjusted for hours worked and standardized by store) correlated .16 with prosocial behavior, and —.23 with noncompliant behavior (both p < .05). The former correlation is identical to what was obtained by MacKenzie et a1. (1991) when they correlated their Altruism factor with their measure of objective performance. These results suggest that there 13 a relationship between citizenship behaviors and objective measures of performance; the strength of this relationship, however, appears to be fairly modest. Organ (1988b, 54 p. 41) interpreted this as evidence for the distinctness of OCB, i.e., that it is related to, but not the same thing as objective, in-role performance. If raters can reliably distinguish between in-role and extra-role performance (L. Williams, 1988), and these two types of behaviors are at most only moderately intercorrelated (Puffer, 1987; MacKenzie et a1. , in press), then the next thing that would be useful to establish is that raters do in fact utilize both types of information when making performance ratings. As indicated in Chapter 1, this issue has been partially addressed by Orr et a1. (1989) and MacKenzie et a1. (1991). For example, Orr et a1. (1989) conducted a policy-capturing study, where 17 managers assigned a dollar value of performance to 50 hypothetical ratees. Ratee profiles varied on 13 dimensions of work behavior, 10 of which were prescribed or in-role behaviors, and three of which were considered to be citizenship behaviors (team cooperation, contribution to morale, and company orientation). Ten raters had significant regression weights on the citizenship behaviors, indicating that they used the citizenship behaviors (in addition to the in—role behaviors) to estimate the dollar value of performance. These raters also demonstrated a higher R2 (.81 vs. .68, p < .05) than the remaining raters, indicating that using both types of cues facilitated them in accounting for more of the variance in dollar value of performance. Orr et a1. (1989) demonstrated in the context of utility analysis than many managers consider citizenship behaviors when evaluating the value of employee performance. MacKenzie et al. (1991) reported a similar finding in a performance appraisal context. MacKenzie et al. used a large 55 sample of insurance agents from an organization where three objective measures of performance were already in use. Managers of these agents were first asked to subjectively rate each agent's performance. Next, they rated the organizational citizenship of each agent. The OCB questionnaire included four factors: Altruism, Civic Virtue, Courtesy, and Sportsmanship. A LISREL analysis revealed that Objective Performance, Altruism, and Civic Virtue had significant relationships with the subjective ratings (p < .01), with an overall R2 of .45. The two OCB factors explained as much variance in the ratings as did the objective performance ratings. Further, followbup analyses revealed that this was not due to same—source (common method) bias between the ratings of performance and OCB. A weakness of MacKenzie et a1. (1991) is that objective performance measures are necessarily deficient (Wexley &'Yukl, 1984); much that would be considered "in-role performance” will not be picked up by such measures. MacKenzie et a1. urged that a replication of their study be conducted with other samples. The current study will do some of this, utilizing a different test of the same notion. In line with past research on OCB (Smith et a1., 1983; Bateman & Organ, 1983; L. Williams, 1988), this study will focus on inrrole versus extra-role tahaviots. Specifically, these behaviors will.be available to raters in.the dimension by ratee matrix described above. Keeping the behavioral focus consistent for both inrrole and extra-role performance is more in line with Organ's (1988b) conception of these variables. It is hoped that, by providing a different test of the conceptualization underlying MacKenzie et a1. (1991), the results of this study will supplement their results. 56 In his book on organizational citizenship, Organ (1988b) discussed how OCB might influence the performance appraisal process. Organ argued for relatively simple performance appraisal forms, with six or fewer dimensions to be rated. According to Organ, the first dimension or two should focus on quantitative inrrole productivity and technical excellence. The next dimensions "might capture facets of contribution that straddle the 'boundary ‘between in—role performance and. certain categories of OCB; attendance, punctuality, and rule-compliance come to mind" (p. 92). The remaining dimensions would be broader, focusing on such things as cooperation, collegiality, or voluntary contribution of time and effort. Organ stated that these dimensions ”however characterized, would enable the rater to give fair credit in a general, global sense for many other forms of OCB" (p. 92). Finally, Organ proposed that an overall rating be provided to assess each employee's "total contribution". An example of the dimensions which might be included on such a form is presented in Figure 6. It is interesting to note the direct parallel between L. Williams (1988) three categories of behavior (IRB, OCBO, and OCBI) and the dimensions suggested by Organ (1988b). Inrrole behaviors (IRB) capture the dimensions of quantitative productivity and technical excellence. OCBO would include factors such as punctuality and rule- compliance (in fact, these are two of the seven items which Williams used to define his OCBO factor). OCBI, on the other hand, is the broadest factor, and would capture issues similar to Smith et al.'s (1983) Altruism factor. 57 InrRole Productivity Technical Excellence Attendance Rule-Compliance Cooperation Voluntary Contribution of Time and Effort Williams' (1988) Behavioral Qategatiaa IRB IRB OCBO OCBO OCBI OCBI Figure 6: Sample Performance Appraisal Dimensions to be Rated (from Organ, 1988b) 58 In this study, the recommendations and research of Williams (1988) and Organ (1988b) will be followed when deriving the dimensions to be rated, as well as the critical incidents included in the dimension x ratee matrix available to each rater (see Chapter 3 for details). Briefly, subject matter experts from the target organization 'will identify organizationally-relevant dimensions for a particular job, similar to those listed in Figure 6. Two dimensions will be included for each of William's (1988) three categories of behavior (IRB, OCBO, and OCBI). Critical behavioral incidents will be generated for the job in question, which will then be collected into various configurations to represent different levels of'in—role and.extra-role performance by the hypothetical ratees. It should be noted that the kind of dimensions which Organ (1988b) proposed combine elements of both trait- and behaviorally-oriented appraisal formats. In spite of the long-standing view that it is better to measure behaviors and results than broad traits, such "combination approaches" have recently appeared in the human resource management literature (e.g., Levy, 1989). In defense of this, Kavanagh (1971) argued that trait ratings should not be discarded if they help account for total variance in appraisal ratings. More recently, Rice (1985) noted a movement toward greater use of trait ratings because, in spite of their subjectivity, many managers consider traits to be vital to their overall ability to rate employee performance. This trend is evident in.assessment center research, where such broad dimensions as "energy", "initiative”, and "motivation" are often used (Thornton & Byam, 1982). Interestingly, however, the developers of assessment center procedures generally seek to define these dimensions as behaviorally as possible. 59 Previous research on the impact of in-role versus citizenship behaviors on utility estimates (Orr et a1. , 1989) has also utilized both trait and behavioral dimensions. A problem with Orr et a1. (1989), however, is that in—role performance was defined by specific work behaviors, while OCB was defined by more trait-like terms such as ”team cooperation” and ”contribution to morale". This potential confound would seem to be inherent in this type of research, given that Organ (1988b) intended for raters to consider "general tendencies" when rating employee citizenship. The current research will seek to minimize this confound as much as possible by defining both in—role and extra-role dimensions in behavioral terms (similar to assessment center procedures). It is thought that this will reduce the subjectivity of trait ratings (Wexley 6s Yukl, 1984), and reduce the likelihood that any differences observed between ratings of inrrole and extra—role performance are attributable primarily to differences between rating behaviors versus rating traits. A major advantage of using dimensions which capture elements of both ratee traits and behaviors is that this overcomes the difficulties observed by DeNisi and Summers (1986) . As noted above, DeNisi and Summers (1986) devised both behavioral and trait-oriented rating scales. The problem, however, was that they had no expert true scores for their trait ratings, and therefore were unable to compute accuracy scores for this condition, or compare results across conditions. In the present study, all raters will rate the same dimensions; thus, a comparison of accuracy results by condition will be possible (see Figure 3). In light of the preceding pages, hypotheses are now presented concerning the predicted effects of the performance manipulations on 60 accuracy, error, and.the process variables. First, main.effects for these within-subjects factors are discussed, i.e., level of inrrole and extra- role performance. This is followed by hypotheses concerning expected interactions between the performance manipulations, as well as between the performance manipulations and search/method of rating. 0 e e In- ole a d tr -Ro e e viors The following predictions concern tataa effects, without regard to the between-subj ects conditions in which a rater participated. Hypotheses are presented for both inrrole and extra-role performance. Predicted interactions are discussed after this. v - e ehavio a c ac . Level of in—role performance has been manipulated in a number of previous performance appraisal studies (DeNisi & Stevens, 1981; Karl & Wexley, 1989; Padgett & Ilgen, 1989; Kozlowski 6: Ford, 1991). Padgett and Ilgen (1989) and Kozlowski and Ford (1991) utilized two levels of performance (high/low), whereas DeNisi and Stevens (1981) and Karl and Wexley (1989) used three levels (high, average, and low). The current research will follow the latter studies in using three levels of inrrole performance. Ayerage ratee performance is thought to represent a more ambiguous stimulus than either high or low levels of performance (Wexley, Yukl, Kovacs, & Sanders, 1972). This is relevant to the present study, in that DeNisi et a1. (1984) predicted that accuracy would be greatest for high and low performance, and worst when rating average performance. The research to date, however, has demonstrated only that high performance is rated more accurately than either average or low performance (DeNisi 6: Stevens, 1981; Karl 6: Wexley, 1989). This would lead to the following prediction: 61 H5a: Differential accuracy will be greater (i.e., closer to the expert true scores) for ratees exhibiting high levels of in— role behaviors (IRB) than for ratees exhibiting average or low levels of IRB. It is thought that this aspect of Cronbach's (1955) accuracy best captures the manner in which accuracy has been characterized in previous research. The other Cronbach measures are not expected to be differentially influenced by level of IRB. Laval 9_f la-Role Behavi‘ ora and Mount ot Saarah. Kozlowski and Ford (1991) and Padgett and Ilgen (1989) both found that raters sought more information on poor performers than high performers. DeNisi and Stevens (1981) interpreted their accuracy results (just cited) by claiming that average and low performing ratees require more attention on the part of raters. DeNisi and Stevens (1981) did not measure amount of search in their study, but the implication from their research would be that: HSb: The amount of search will be greatest for ratees exhibiting average or low levels of IRB, and lowest for ratees exhibiting high levels of IRB. At first glance, this hypothesis appears contradictory to Hypothesis 5a, i.e. , that raters will search more, but be mag accurate in rating average and low performers. However, this apparent paradox will be retained, because it is felt that rating average and low performers is a more difficult task and that, in spite of increased search, accuracy will be lower for these ratees in comparison to ratees exhibiting high in—role performance (DeNisi 6: Stevens, 1981). Extra-Role Behaviors, Ettor, and Accutacy. In this study, OCB will be manipulated by having one ratee within each IRB performance level 62 demonstrate either highly favorable or neutral levels of OCBI (see Figure 7). From Organ (1990a), it is thought that the observed intercorrelation among dimensions (halo) will be higher for ratees where favorable OCB information is present than for those where OCB information is neutral, i.e., H6a: More halo will be observed for ratees where highly favorable OCBI information is ‘present than for ratees where OCBI information is neutral. If this effect carried over to accuracy measures (which is not clear given the results of Murphy and Balzer, 1989), then similar predictions can be made for stereotype and differential accuracy. H6b: Stereotype accurady will be worse for ratees where highly favorable OCBI information is present than for ratees where OCBI information is neutral. H6c: Differential accuracy will be worse for ratees where highly favorable OCBI information is present than for ratees where OCBI information is neutral. These are the only two Cronbach accuracy' measures for ‘which predictions are made, since it is not clear that OCB information would have any effect on the rank ordering of ratees (i.e., differential elevation). These predictions are made despite the cautions of Murphy and Balzer (1989) because this is a controlled laboratory study'where the true intercorrelations among dimensions will be known (Fisicaro, 1988). The weak link between error and accuracy noted.by Murphy and Balzer (1989) is not expected to hold in this particular instance. Ratee Figure 7: 63 high inrrole performance; high OCBI high inerole performance; neutral OCBI average in—role performance; high OCBI average in-role performance; neutral OCBI low in—role performance; high OCBI low in-role performance; neutral OCBI Target Profiles for Hypothetical Ratees 02xsisflmu9 (tonsnnznt) 0x:mds&am) 64 W. The combinations which can be formed by the within—subjects factors are shown in Figure 7. Hypothetical ratee profiles will be constructed to match these six combinations. It can. be seen that the consistency of information is highest for Ratees l, 4, and 6, and lowest for Ratees 2, 3, and 5. Padgett and Ilgen (1989) predicted that when ratee performance information is inconsistent, raters will seek more information, yet be less accurate on a measure of differential elevation. They obtained their predicted results for accuracy, but not for amount of search. As both hypotheses are plausible, both will be tested again in the current research. For amount of search, results in the present study should differ from Padgett and Ilgen (1989) for several reasons. First, before search, Padgett and Ilgen (1989) required subjects to view five videotaped vignettes for each of their four ratees. This requirement may have had the unintended consequence of limiting further search (mean number of vignettes searched.was 8.76, out of 17 available for each ratee). In the current study, raters will have little prior information concerning ratees, and thus will need to access the computer to obtain the information they need.to make their ratings. Second, in Padgett and Ilgen (1989) , search was more labor-intensive and time-consuming. In the present study, information will be accessed quickly on the computer, and this increased search (beyond that observed by Padgett 6 Ilgen, 1989) is expected to aid raters in better noting inconsistent performance. Finally, to increase the realism of the task, Padgett and Ilgen (1989) had subjects simultaneously complete an in-basket exercise. One drawback of 65 this manipulation is that it may have also served generally to limit sub j ect-initiated search . H7a: The amount of search will be greater for ratees where performance information is inconsistent concerning IRB and OCBI than for ratees where such information is consistent. H7b: Differential elevation will be worse (less accurate) for ratees where performance information is inconsistent concerning IRB and OCBI than for ratees where IRB and OCBI information is consistent (Padgett 6 Ilgen, 1989). e c bed a MacKenzie et a1. (1991) speculated that the presence of OCB information would trigger the increased use of categories and schema by raters. This greater reliance on general impressions (or impressionistic codes, Hastie 6 Park, 1986) is expected to decrease rater accuracy in comparison to the accuracy of ratings made when such OCB information is absent (or neutral, as in the present study). Rather than proposing messy and improbable four- or five-way interactions utilizing accuracy and all independent variables, this study will focus on two three—way interactions between method of rating, OCB, and accuracy. Since the accuracy predictions concern the presence of OCB information (e.g., Organ, 1990a), level of in—role performance will be ignored in these predictions. Also, predictions for method of rating will be made by collapsing across prior knowledge conditions. Thus, the predictions concern the main effects for type of format, the favorability of OCBI information, and accuracy. The influence of OCB information is expected to primarily influence Cronbach's measures of stereotype and 66 differential accuracy. These effects are presented graphically in Figure 8, and verbally below. H8a: Stereotype accuracy — Two main effects and at interaction will be observed between method of rating, type of OCBI information available, and stereotype accuracy. Subjects rating by dimension will be more accurate than subjects rating by person. In addition, accuracy will be greater for ratees with neutral OCBI information than for ratees with favorable (i.e., more salient) OCBI information. H8b: Differential accuracy — No main effects and n_o interaction will be observed between method of rating, type of OCBI information, and differential accuracy. Subjects rating by dimension will be more accurate than subjects rating by person. In addition, accuracy will be greater for ratees with neutral OCBI information than for ratees with favorable OCBI information. marl Accuracy has been of fundamental concern to psychologists and other researchers of organizational behavior for decades (Funder, 1987). This emphasis on accuracy is no less strong in the area of performance appraisal. This study will take widely-used accuracy measures, and then utilize a more recent methodological approach known as process tracing (Ford et a1., 1989) to address a number of questions concerning method of rating and accuracy. Hypotheses concerning information acquisition and method of rating have been drawn from recent advances and applications of cognitive psychology (cf., DeNisi 6 Williams, 1988). A second set of questions concern the manner in which raters utilize both inrrole and extra-role behaviors when evaluating ratee performance. These hypotheses deal 'with areas relatively untouched 'by' most previous research on organizational citizenship behaviors (Organ, 1988b; 1990a). Additionally, 67 55500.... 05 £25.52... moo .0 3:55.05”. dazmm .0 5:52 55.5.5 025555.“. 5.0.55 5 manor. 5.0585 5m5n. _ . . . 5.88.0.5 500 .952 5.52.5.5 500 5.555”. 59:00.6. 5.5.2.5 5.0585 5.05.”. 8.62.25 500 6:82 _ . 5.58.0.5 500 0.555“. 59:00.... 55555 68 interactions between method of rating, extra-role performance, and accuracy are proposed. Previous research on information acquisition has noted that experienced raters may use different cognitive processes than naive (student) raters (Williams et a1. , 1986; DeNisi et a1. , 1989). This study will explore some of these issues, and should provide answers that are of more than academic interest to "real world” raters of performance. The next chapter will lay out the specifics of the methodology used in this study . CHAPTER 3: METHOD W This chapter lays out the sample, procedures, variables, and data analysis used in this study; The manner in which.the content of the study was derived is first presented, followed by a discussion of the procedures used in the primary study. The primary study entailed allowing subjects to search a computer information board, and then asking them to provide ratings for six hypothetical ratees. The appropriate data analytic methods are presented for testing the various hypotheses presented in Chapter 2. Pa i a Subjects in this study were supervisors at Michigan State University. The study was conducted with the consent and cooperation of the university's Director of Personnel Administration, as well as the leadership of the Michigan State University Administrative Professional Supervisors Association (APSA). The APSA represents over 850 supervisors employed by the university, and covers virtually all employees on campus with supervisory responsibilities. The subject matter experts were also drawn from this same pool of APSA supervisors. Power Analysis An a priori power analysis was conducted to determine the sample size needed in order to have adequate power to detect significant effects. Crucial variables for such calculations are the expected effect sizes for the experimental manipulations. Previous research using Cronbach's (1955) accuracy measures have generated medium (etaz - .06) to large (etaz - .20) effects using other manipulations (cf., Murphy et a1., 1989; Padgett & 69 70 Ilgen, 1989). The South Carolina researchers, while not using Cronbach's measures, have typically generated "medium" effects for their accuracy measures (as specified by Cohen, 1988). DeNisi and Stevens' (1988) manipulation of prior knowledge of format produced a sizable effect on accuracy (d > .50). Finally, the manipulations and measures of in—role and extra-role performance in MacKenzie et a1. (1991) and Orr et a1. (1989) produced very large effects; i.e., in MacKenzie et a1. (1991), overall R? was .45; in Orr et a1. (1989), R6- .74. Taken together, the above findings suggested that an overall R? of .60 was reasonable for the four independent variables and three interactions in this study. What is also crucial for a power analysis, however, is the estimation of the unique variance (srz) which each individual variable or interaction is expected to explain (Cohen and Cohen, 1983, p. 118). Using an overall R%- .60 with seven variables (4 independent variables, and.three interactions), an alpha of .05, and power of .80, various calculations can be made. If'srz- .03, then 113 subjects will be required; if sra- .04, then 87 subjects will be required; and if srz- .05, 71 subjects will be required. Given the likelihood that at least some of the variables or interactions would produce small effects, it was decided to include 116 subjects in this research (i.e., 29 subjects per between subjects condition; cf., Figure 3). This provided adequate power to detect significant effects if they existed. may; De vi the o ent o the t Procedures outlined by latham and wexley (1981) and Padgett and Ilgen (1989) were adapted for use in the present study. Following Padgett 71 and Ilgen (1989), a general secretarial position was chosen as the "focal job” (all ratees were assumed to be holding a similar position in the organization). ,Almost all subjects in the study'were currently evaluating the performance of one or more secretaries in their work unit; thus, this was a position which was highly familiar to study participants. Questionnaires were first given to a pre—sample of university supervisors, asking them to rate the importance of various performance dimensions for a secretarial position. The dimensions were selected based on the discussions of Organ (1988b) and Williams (1988). Two criteria were used to select the dimensions: a) widespread agreement by the supervisors as to the relevance of the dimension to the organization and job in question, and b) each of Williams’ (1988) behavioral categories (IRB, OCBO, and OCBI) were represented.by two dimensions on the final list of dimensions selected (cf., Figure 6). A second group of APSA supervisors was then recruited to serve as "subject matter experts", i.e. , a group who generated and evaluated critical behavioral incidents, assigned each incident to its appropriate dimension, and then assigned final ratings to each incident and hypothetical ratee (producing the "true scores" for the Cronbach accuracy measures). Previous performance appraisal research using expert raters has varied in the number of raters used to generate such true scores, from a low of five raters in Cardy et al. (1987), to a high of 25 raters in Karl and.Wexley (1989). Five studies used the videotapes and true scores from.Murphy et al. (1982), where true scores were derived from 13 raters. The mean number of raters used in the 13 studies located was 12.5. From this, it was decided to collect ratings from 15 raters to generate the 72 true scores for this study. Subject matter experts (SMEs) were volunteers, as well as supervisors who were recommended to the author by his contacts on the APSA board. The design of the primary study requires that there be one critical incident for each ratee on each performance dimension; thus, with six ratees and.six dimensions, a total of 36 critical incidents were required. These incidents were generated as follows. First, incidents were drawn from the research of Padgett and Ilgen (1989). Padgett and Ilgen (1989) interviewed secretaries at the same university, and generated numerous critical incidents. Many of these incidents were appropriate for the present research as well. Second, incidents were generated by the researcher, drawing on materials from Organ (1988b), L. Williams (1988), S. Williams and.Hummert (1990), as well as on.discussions with office personnel at the university. From these sources, a questionnaire was assembled containing a large number of critical incidents. The subject matter experts were asked to evaluate each incident on a seven-point Likert scale, and then go back through the incidents to assign each incident to its appropriate dimension of job performance. Padgett and Ilgen (1989) used two criteria to evaluate agreement among their 10 expert raters: a) 70% agreement among the raters concerning which dimension was represented by a particular incident, and b) a standard deviation of less than 1.25 across ratings of the level of performance portrayed by each incident. These evaluation criteria were used in the present study as well. 73 Once ambiguous items were removed, a list of the remaining incidents were modified as necessary to ensure that the target levels of in-role and extra-role performance were adequately portrayed. These final edited incidents were sent back to the SMEs for them to again rate the level of performance represented by each item. These items were assembled so as to portray six hypothetical ratees (see Figure 7). The subject matter experts were asked to rate each ”ratee” on each dimension, as well as provide an overall performance rating for each ratee. The means of these SME ratings were then used as the "true scores" in Cronbach's (1955) accuracy measures. Operationally, only two of the three categories of behavior discussed in Chapter 2 were manipulated. First, in—role behavior (IRB) was manipulated to portray high, average, and low levels of performance. Second, OCBI was manipulated to portray favorable versus neutral (i.e. , average) manifestations of citizenship directed toward specific individuals (Williams, 1988; Puffer, 1987). OCBO was tied to the level of in-role performance, and thus not independently manipulated, for two reasons: 1) this simplified the research design and data analysis, and 2) as Organ (1988b) noted, these type of dimensions "straddle the boundary” between in-role performance and OCB. VanDyne and Cummings (1990) noted that the difference between in-role and extra-role behavior is often extremely subtle; this seemed particularly true for OCBO-type dimensions. Thus, to maximize the distinction between in-role performance and OCB, only the OCBI-form of citizenship were independently manipulated. 74 W Computer software developed by the Michigan State University Psychology Department was modified for use in the current study. This software was used by Kozlowski and Ford (1991), and allows the rater to search a computer information board for information on up to 12 ratees on 12 dimensions. In the current study, a 6 x 6 ratee by dimension matrix was developed, using the dimensions and hypothetical ratees just described. This software is designed so that a different ordering of ratees and dimensions is presented to each subject, thus reducing the likelihood of order effects confounding the results of the study. Subjects were contacted individually by telephone to solicit their involvement in the study. When a meeting time was agreed upon, subjects were run individually in their own offices. The process tracing software is sufficiently flexible so that it can be used on almost all IBM- compatible computers; thus, supervisors could choose to complete the project on their own computer, or on a laptop computer provided by the researcher. Supervisors were randomly assigned to one of the four between-subject conditions (in Figure 3) by following a set order of proceeding through the between—subj ect conditions, i.e. , whatever condition was "up next" when a particular supervisor had agreed to participate was the condition that supervisor received. Data collection was stopped once data was collected from 116 supervisors. In all conditions, supervisors were told that this research concerned ways to improve the accuracy of performance appraisal ratings, as well as determining the proper content for performance appraisal. The use of rater diaries as a method to increase rater accuracy was discussed, 75 and subjects were told to assume that a supervisor whom they knew and respected had faithfully recorded behavioral incidents for each of her six subordinates. They were to assume that this other supervisor no longer works for the university, and that they were now to complete the appraisals for these six "employees". They were to assume that the available items accurately portrayed behaviors observed by this supervisor, but that it was up to them (the subjects) to determine the ratings for each employee on each dimension, as well as to assign an overall rating to each employee. Their agreement to participate in this study was solicited at this point. At this point, the six performance dimensions, as well as the seven- point Likert rating scale to be used in this study were presented to all subjects. Those subjects with 119 advanced knowledge of the rating format, i.e. , whether it would be person— or dimension-blocked (Conditions 3 and 4; Figure 3) were told only that they would make ratings for these ratees once they had completed their computer search. No mention of person- versus dimension-blocking was made until after the computer search had been completed. At that point, they made their ratings using the appropriate format for their condition. Subjects in Conditions 1 and 2 (advanced knowledge of a person— or dimension-blocked form, respectively) were shown the appropriate format prior to their computer search. Operationally, this was built into the computer programs for these conditions at the appropriate place, so as to maximize similarity across "runs" of the manipulation. Subject discussion of the rating format was discouraged at this point. Primarily, it was stated that a major issue in this research concerned the accuracy of 76 ratings made with this particular rating format. Once subjects in Conditions 1 and 2 had completed their computer search, they made their ratings in the same manner (i.e., using the same formats) as subjects in Conditions 3 and 4, respectively. In all conditions, subjects conducted their information search at their own pace, and note-taking was permitted. At any point, subjects could terminate their search by responding to the appropriate prompt on the computer screen. If they chose to search for 28 items (the maximum available), then the computer program automatically moved them into the rating phase of the study. All ratings were collected on the computer, using the format appropriate for each condition. Once subjects completed the rating task, they were asked to complete a brief ”Final Questionnaire", which solicited several pieces of background information. This information (e.g., years of supervisory experience) was compared with the responses of the subject matter expert group, as well as amongst the four betweenrsubjects conditions (as a check on the random assignment of subjects to conditions). Constraints on Information §earch. A.fina1 design issue to address in this research study is whether any constraints should be placed on raters in terms of the amount of search they can undertake. DeNisi et a1. (1984) discussed time pressures as an important variable influencing search strategies. DeNisi et al. (1983) found that subjects allowed to select only 9 of 25 available items (36%) produced.a.much different search pattern than those for whom search was unlimited. In their unlimited search condition, subjects preferred seeking information by person, followed by a preference for searching by task. In their constrained 77 condition, over half the subjects displayed a mixed search strategy (neither'persone or taskéblocked). Kozlowski and Ford (in.press, Study 1) provided a stronger test of the effects of constraints on information acquisition. They allowed subjects to access 25%, 50%, 75%, or 100% of the available items, and found a large effect for search constraint. In this study, a constraint intended to be moderate was used, i.e., in all conditions, subjects were allowed to access 28 of the available 36 critical incidents. This meant that each subject was able to access up to 78% of the information available on all the ratees. A constraint is useful, in that it more closely resembles a ”real" appraisal setting (where search is not unlimited). However, a constraint that was too severe was expected to limit the amount of rater variability in search patterns which was observed. A modest constraint addresses the concerns of DeNisi et a1. (1984), without restricting variance on the amount of rater search too severely. Further, such a constraint was expected to ”force" raters to choose information from those dimensions which.they felt were most important to them before making their ratings, which.was useful in looking at the "inrrole" versus ”extra-role" issues raised in this research. Variables ngrall Performance gatings. 'These are the overall ratings provided by each subject for each hypothetical ratee. Means were calculated for each ratee separately, for the four between—subjects conditions, and for all ratees combined. Errgr. Murphy and Balzer (1989) discussed two measures of halo, i.e. , the median correlation between performance dimensions , averaged over 78 ratees (MEDCORR) , and the variance of the ratings assigned to each ratee, averaged across ratees (VARRAT). Further, Murphy and Jako (1989) argue that when the number of observations is limited (in this case, the number of ratees), then a more stable estimate of halo is the mean correlation between performance dimensions, over ratees (MNCORR) . This value was also computed for each of the between-subject conditions. Accuracy. The primary dependent variables in this research were Cronbach's (1955) measures of accuracy. These were defined verbally above (pp. 18—19), and the formulas are provided in Appendix A. Most previous research has utilized the deviation score formulas (cf. , Murphy & Balzer, 1989). With this approach, each accuracy score is a squared deviation score measuring some aspect of the differences between subject ratings and expert true scores for the hypothetical ratees. Computer programs to calculate these scores will be adapted from Balzer (Note 1). Dickinson, Hedge, Johnson, & Silverhart (1990) have presented an alternative means of analyzing accuracy scores (see Data Analysis section below), but this method also relies on the notion of deviation scores (Dickinson, Note 2) . Cronbach (1955) also presented formulas for differential elevation, stereotype accuracy, and differential accuracy which used variances and correlations. Becker and Cardy (1986) argued that these accuracy measures also carry individually meaningful information about rater accuracy. As Sulsky and Balzer (1988) point out, these latter measures are not sensitive to the distances between subject and true score ratings, and thus are not strictly measures of rating accuracy. They do, however, provide evidence of rating validity (Sulsky & Balzer, 1988), and thus, 79 these three correlational forms of rating accuracy will also be calculated. Type of Search. Cafferty et a1. (1986) measured type of search by studying the types of transitions subjects made when they chose which performance incidents to view. A personrblocked transition is one where subjects request another incident for the same ratee. .A.dimension—blocked transition is one where subjects ask for an incident for a different ratee on the same performance dimension. A mixed or nonblocked transition is one where subjects change both ratee and dimension when making their next choice (such ”shifts" will be ignored in the data analysis). The number of each of these transitions will be calculated for each subject. Payne (1976) used the following formula to determine whether each subject's search pattern was person— or dimensionrblocked: # eo—oced ranston -#0£Dimen§ien:13_1_92kedlranai£i_qns # of PersoneBlocked Transitions + # of Dimension-Blocked Transitions This formula ignores the mixed or nonblocked transitions, but does usefully collapse information on type of search into one value which resembles a correlation coefficient. If a subject made only person- blocked transitions, their score on this measure would be +1.00; if a subject made only dimensionrblocked transitions, their score would be —1.00; and if a subject made an equal number of person— and dimension— blocked transitions, their score would be 0.00. Payne's (1976) measure ‘was used for all tests concerning type of search in this research. Amgug§_gfi_§g§£gh, Amount of search was calculated as the number of items subjects request before indicating that they were willing to make 80 performance ratings. Theoretically, this value could range from zero to 28. These amount of search values were used to test Hypothesis 4b. mm; In Chapter Two, seven dependent variables were proposed for this research, i.e., differential elevation (DEL), stereotype accuracy (SA), differential accuracy (DA), halo, overall performance ratings, amount of search, and type of search. The appropriate design for this research has been labeled ”subjects within groups by conditions" by Cohen and Cohen (1983). This design is required because both between— and within-subj ect factors are manipulated. Other labels for such a design include "split plot” and "subjects nested within groups repeatedly measured on a factor”. In this research, the "groups" are the four betweenesubjects conditions presented in Figure 3 (format type, and prior knowledge of format); different subjects are nested within each of these groups. The repeated measures are the'withine subjects manipulations of in-role and extra-role performance; all subjects, regardless of betweenesubjects manipulation, will be presented with the same matrix of hypothetical ratees. In effect, each ”ratee" should be viewed as a "condition" in Cohen and Cohen's (1983) terms, where each ratee represents a different pairing of the levels of the within— subjects manipulations. Thus, this research has subjects nested within four groups, all viewing six within-subjects "conditions" (i.e., ratees). Figure 9 (Section B) depicts this graphically. Seven variables were created in order to test the various main effects and interactions posited by Hypotheses l - 8 (see Figure 9). Variables X1 and X¢ 'portray information concerning prior knowledge of 81 FIGURE 9: Contrast—Coded Variables for this Research 82 Between-Subjects Manipulations (Hypotheses 1—4): Xl- prior knowledge of format (yes- 15; no- 45) X2- type of format (by person— *1; by dimension— 4:) X3- XIXZ interaction x1 x2 x3 Condition 1 15 is 1‘ Condition 2 11 41 4: Condition 3 4: 1‘: 4‘ Condition 4 J: 41 1‘ Within—Subjects Manipulations (Hypotheses 5-7): X,- level of in-role performance (high- 1; average or low- 41) X5- level of extra-role performance (positive- 1:; neutral- 42) X5- X4X5 interaction 83 Ram LMMLmJ-m. Xe szsxs X‘sza Xaxsxe x«stcs X«X5Xe qus Condi- tion 1 l 8 H 1 -8 —h —8 k -k -k —k k -H 8 —k —H -H k (Subjects 1'k1) Condi- tion 2 (Subjects l-kz ) Condi- tion 3 (Subjects l‘ka) Condi- tion 4 (Subjects 1'19. ) C. Relationship Between Type of Format and Level of Extra-Role Performance (Hypothesis 8): X7- X2X5 interaction X2 X5 X7 Condition 1 8 8 k 8 -k -k Condition 2 -8 8 —k —8 —8 k Condition 3 8 k k k -k -k Condition 4 -8 8 -R 84 rating format and type of format, respectively. Variable X3 carries their interaction (i.e. , xlxz). This was the interaction of interest in Hypotheses 1-4. Variables X. and X5 represent the manipulations of in—role and extra- role performance, respectively. Variable X5 represents the interaction between these two variables (X5- X.X5). This is the interaction of interest for Hypotheses 7a and 7b. A final variable, X7, portrays the interaction between X; (type of format) and X5 (level of OCBI information present), i.e. , X7- szs. This is the interaction of interest in Hypotheses 8a and 8b. As shown in Figure 9, contrast coding was used to represent the nominal values portrayed by these seven variables. This type of coding is most appropriate for factorial designs where interactions are of particular interest, since meaningful regression coefficients, as well as partial and semi-partial correlations can be obtained for both the proposed main effects and their interactions (Cohen & Cohen, 1983). For most variables, this coding is straightforward. For X“ level of in-role performance, values have been assigned in accordance with past research results (see Chapter 2). Various hierarchical regression analyses were planned to test the predictions of Hypotheses l—8. Cohen and Cohen (1983) noted that hierarchical regression is the counterpart to the analysis of covariance. Both procedures allow the researcher to statistically control sources of variation which are irrelevant to the research questions of interest. In this study, the general procedure was to enter the "extraneous" variables into the regression equation in Step One of the process, with the 85 variables of interest for a particular hypothesis entered in Step Two. Significant regression coefficients were anticipated for the variables in Step Two, again dependent upon the specific hypotheses being tested. Figure 10 summarizes the predictors for this research project. For Hypotheses 1-4, variables X, through K, were entered first, followed by ‘variables X1 through X3. Following the Hypotheses in Chapter 2, when the dependent variable was differential elevation, stereotype accuracy, or differential accuracy, each of the variables (X1, X2, and X3) were expected to have significant regression coefficients. For halo, amount of search, and type of search, only X1 and X3 were expected to have significant regression coefficients.. For overall performance ratings, none of the regression coefficients were expected to be statistically significant. For Hypotheses 5—7, variables X1, X2, X3, and X7 were entered into Step One, followed by the variables of interest for each dependent 'variable (see Figure 10). Finally, for'Hypotheses 8a and.8b, variables X1, X3, X‘, and X6 were entered first, followed by variables X2, X5, and X7. .As mentioned previously, a significant regression coefficient was not expected for X; for any of the dependent variables to be tested. 86 Dependent Variable Overall Performance X1 X2 X3 Rating H810 3* X1 XZ fi* X3 5* X5 (H1) (H1) (H1) (H6a) Differential fl* XI p* X; p* X3 3* X5 Elevation (H2a) (H2a) (H2a) (H7b) Stereotype 6* X1 6* X2 6* X3 6* X5 X7 Accuracy (H2b) (H2b) (H2b) (H6b) (H8a) Differential 6* X1 6* X2 6* X3 6* X4 6* X5 X7 Accuracy (H2c) (H2c) (H2c) (HSa) (H6c) (H8b) Type of 13* X1 X2 19* X3 Search (H3) (H3) (H3) Amount Of fi'k X1 X2 5* X3 p* X4 p* X5 Search (H4) (H4) (H4) (HSb) (H7a) Note: 6* - this coefficient predicted to be statistically significant FIGURE 10: Predicted to be Statistically Significant Summary of the Variables from Hypotheses 1-8 CHAPTER 4: RESULTS This chapter first documents how the content and true scores were derived for this study, and then presents the results of the primary analyses. Results are presented in turn for each of the hypotheses discussed above in Chapter 2. Content e iva ion t t d Questionnaires were given to 10 university supervisors, asking them to rate the importance of six performance dimensions for a secretarial position. As can be seen in Table 1, all six a priori dimensions were rated as highly important by these supervisors (nine completed questionnaires were returned). Job Knowledge and Accuracy of Work was rated as most important (mean- 6.67), with the lowest importance ratings given to Extra Effort/Initiative (mean— 5.56). Supervisors were also asked to record any other dimensions which they felt should be measured, but were not being captured by these six dimensions. Four supervisors responded to this open-ended question. Three wrote that there should be some measure of ”attitude” or helpfulness included; one wrote that "interpersonal skills" were necessary; and one listed a need for independent judgment or decision making. Given the high means and low standard deviations in Table 1, as well as the low level of response to the open-ended question, it was decided that these dimensions met the criteria listed in Chapter 3, and could thus be used when deriving the content for the primary study. A second group of APSA supervisors was recruited to serve as "subject matter experts“, i.e. , the group who would generate and evaluate critical behavioral incidents, assign each incident to its appropriate 87 88 TABLE 1 Mean Importance Ratings Given to the Six Performance Dimensions (1 - not at all important; 7 - very important) A. Job Knowledge and Accuracy of Work (possessing the necessary knowledge and skills to perform job; accuracy and thoroughness of work) B. Productivity (amount of work completed; ability to efficiently organize work) C. Dependability/Attendance (infrequent tardiness, unscheduled absences, etc.) D. Following Policies and Regulations (following all necessary rules, regulations, policies, and procedures) E. Cooperation and Teamwork (providing assistance and support to others; coordinating work with others) F. Extra Effort/Initiative (takes on extra tasks when needed, goes the "extra mile") 6.11 6.33 6.56 6.22 5.56 .50 .93 .71 .73 1.20 1.13 89 dimension, and then make final ratings for each incident and hypothetical ratee (generating "true scores“ measures). Supervisors were recruited from several sources to be subject matter experts (SMEs) . Five supervisors volunteered at an APSA monthly meeting. The remainder were recommended to the researcher by one of these five individuals (two of whom were APSA board members). The primary criterion for inclusion in the SME group was a high degree of willingness to assist in such a project. Similar to previous research using subject matter experts, supervisors were deemed "expert" largely because they were provided an extended opportunity to review the relevant critical incidents (cf., Borman, 1977; Sulsky & Balzer, 1988). It turned out that the SME group had worked for the university an average of 17 years, with an average of 12.5 years of total supervisory experience; both figures were higher than those reported by supervisors in the primary sample. Means and standard deviations for these background information items are given in Table 2. T—tests were conducted comparing the means for the subject matter experts with those from the primary sample of 116 supervisors. None of these comparisons were statistically significant, although the t—test for years of supervisory experience approached statistical significance (t - 1.76, p < .10). Those in the SME group had an average of three years more total supervisory experience than those in the primary sample. One of the concerns that Sulsky and Balzer (1988) expressed about performance appraisal accuracy research is the level of ”expertise" actually possessed by subject matter experts. In the current research, it is not claimed that the raters in the subject matter expert group were 90 Table 2 Background Information for Subject Matter Experts and Primary Sample‘ Item Years worked for university Years of super- visory experience Years in present position Raters' favorable- ness to rating process Perceived favorable- ness of raters' actual employees to rating process Subject Matter Experts mam. 17.13 (7.11) 12.53 (5.91) (4.30) (1.53) (1.06) Primary Sample (n-116) 15.72 (6.97) (6.85) (5.94) (1.25) (1.32) Condi- tion 1 15.69 (7.45) 10.24 (6.79) (5.77) (1.02) (1.31) Condi- tion 2 16.17 (6.85) (5.42) (5.69) (1.49) (1.39) Condi- tion 3 15.10 (6.91) (6.57) (6.11) (1.48) (1.48) Condi- tion 4 15.90 (7.00) (8.50) (6.43) (.99) (1.07) ‘ Mean values; standard deviations in parentheses. - p < .10 91 better supervisors or better raters than the remaining supervisors, only that their increased involvement with the study material allowed them a better opportunity to make informed ratings. The 36 critical incidents were generated as follows. First, as hoped, the majority of incidents were culled from the research of Padgett and Ilgen (1989). Padgett and Ilgen (1989) interviewed five secretaries who worked at the same university, and generated 101 critical incidents. The means and standard deviations for these incidents were provided to the researcher by Padgett (Note 3) . Over 40 incidents could be adapted to fit the dimensions and levels of intended performance in the present study. Second, approximately 20 incidents were generated by the author, drawing on materials from Padgett and Ilgen (1989), Organ (1988b), L. ‘Williams (1988), S. Williams and.Hummert (1990), as‘well as on.discussions with the office supervisor and secretaries in his own department. From these sources, a questionnaire was assembled containing 56 critical incidents. The 15 SMEs were asked to evaluate each incident on a seven— point Likert scale, and then go back through the incidents to assign each incident to its appropriate dimension of job performance (see Table l). The two criteria from Padgett and Ilgen (1989) were used to evaluate agreement among SME raters. Based on this, 14 of the 56 incidents had to be discarded because of low agreement among the SMEs concerning the dimension portrayed by that incident. The second criteria (low standard deviations) was not a major issue in the current study, as only eight of the 56 incidents had standard deviations for the SME ratings exceeding 1.00, and only one exceeded 1.25. 92 A more difficult issue for the current study was finding appropriate incidents for each dimension, where a suitable number of incidents portrayed the desired or target level of performance (e.g. , high, average, or low; see Figure 7). When the initial profiles of hypothetical ratees were compiled, it was discovered that there were no suitable incidents remaining which portrayed high levels of "Following Rules and Regulations” or average levels of "Extra Effort and Initiative". Thus, additional incidents were generated, and these were then rated by a smaller group of eight supervisors. This procedure generated the four incidents needed to complete the profiles. Once the ambiguous, redundant, or otherwise non-optimal items had been removed, the remaining 36 incidents were used as the content of the study. The final wording of some of the items was modified, in an effort to better capture targeted levels of performance. As stated previously, these items were assembled so as to portray six hypothetical ratees. These composites were presented to the SMEs, who were asked to rate each "ratee” on each dimension, as well as provide an overall performance rating for each ratee. Since an important question in this research was whether ratings made "by person" were different from ratings made "by dimension", the SMEs were also shown the same incidents arranged by dimension prior to making their ratings of each incident and ratee. The means of these SME ratings were then used as the ”true scores" in Cronbach's (1955) accuracy measures. As can be seen in Table 3, there was a good match between the targeted levels of performance and the mean ratings provided by the subject matter experts. Ratings for Pat (high in—role, high extra-role) , Pat Chris Terry Kim Jody Lynn 93 Table 3 True Scores from the 15 Subject Matter Experts 211mm We]. Wins Job Knowledge & Accuracy high 5.13 Productivity high 4.87 Dependability/Attendance high 5.27 Following Policies & Procedures high 5.20 Cooperation & Teamwork high 5.40 Extra Effort/Initiative high 5‘89 OVERALL 5.47 Job Knowledge & Accuracy high 4.53 Productivity high 4.87 Dependability/Attendance high 5.20 Following Policies & Procedures high 5.93 Cooperation & Teamwork average 4.33 Extra Effort/Initiative average 1‘69 OVERALL 4.87 Job Knowledge & Accuracy average 4.53 Productivity average 4.13 Dependability/Attendance average 4.07 Following Policies & Procedures average 4.40 Cooperation & Teamwork high 4.93 Extra Effort/Initiative high 5,22 OVERALL 4.60 Job Knowledge & Accuracy average 4.00 Productivity average 4.13 Dependability/Attendance average 4.00 Following Policies & Procedures average 4.07 Cooperation & Teamwork average 4.73 Extra Effort/Initiative average 4,89 OVERALL 4.20 Job Knowledge & Accuracy low 2.60 Productivity low 2.60 Dependability/Attendance low 2.40 Following Policies & Procedures low 2.93 Cooperation & Teamwork high 4.40 Extra Effort/Initiative high 5,09 OVERALL 3.07 Job Knowledge & Accuracy low 2.53 Productivity low 1.93 Dependability/Attendance low 1.86 Following Policies & Procedures low 3.60 Cooperation & Teamwork average 4.00 Extra Effort/Initiative average 4,43 OVERALL 3.00 94 Chris (high in-role, average extra-role), and Lynn (low in—role, average extra-role) were particularly close to target levels. For the remaining ratees, the match was very good for 14 of the 18 available incidents. The four exceptions were: a) Terry (average in-role, high extra-role) was rated higher by the SMEs than desired on Job Knowledge & Accuracy of Work; b) Kim (average in—role, average extra-role) was rated higher than desired on both extra-role dimensions; and c) Jody (low in—role, high extra-role) was rated lower than desired for Cooperation and Teamwork. It is not thought that the results to be presented below were strongly effected by these deviations from the intended target levels of performance (e.g. , see Table 10 below). Next, the results from the primary study are presented, first for the between-subject manipulations (Hypotheses 1—4) , then for the within- subject manipulations (Hypotheses 5-7), and finally for the proposed interaction between method of rating, level of OCBI, and accuracy (Hypotheses 8a and 8b). Results from the Primag Study Ove a in Both subject matter experts and supervisors in the primary study made ratings of overall performance for each ratee. Means for these ratings are provided in Table 4. Several points can be made concerning these overall ratings. First, the mean ratings given to each ratee by the primary sample (n - 116) are extremely close to those given by the subject matter experts. On average, study participants gave slightly higher ratings than those given by the SMEs (mean discrepancy - +.035 across all six ratees) , but none of the differences between the SME ratings and the Overall Ratings by Ratee: Subject Matter Experts Bates .(nzliL Pat 5.47 (high IRB, high OCBI) Chris 4.87 (high IRB, average OCBI) Terry 4.60 (average IRB, high OCBI) Kim 4.20 (average IRB, average OCBI) Jody 3.07 (low IRB, high OCBI) Lynn 3.00 (low IRB, low OCBI) Mean 4.20 (all six ratees) Primary Sample (n—llfi) 5.54 (+.O7) (-.02) 4.53 (-.07) 4.21 (+.01) 3.16 (+.O9) 3.12 (+.12) 4.235 (+.O35) 95 Table 4 Condi- tion 1 (+. (+. (+. (+. (+. .69 .22) .86 .01) .72 12) .31 11) .21 14) .10 10) .316 116) True Scores, Whole Sample, Condi- tion 2 5.72 (+.25) (-.01) 4.41 (-.l9) 4.07 (-.l3) 3.03 (-.04) 3.07 (+.O7) 4.195 (-.005) and by Condition? Condi- tion 3 5.28 (-.19) 5.00 (+.13) 4.55 (-.05) 4.21 (+.01) 3.31 (+.24) 3.17 (+.17) 4.253 (+.053) Condi- tion 4 (n-29) (+.01) (-.18) 4.41 (-.19) 4.24 (+.O4) 3.10 (+.03) 3.14 (+.14) 4.178 (—.022) ‘ Mean ratings; deviations from SME scores in parentheses 96 ratings from the whole sample were statistically significant. From this, it can'be inferred that intended performance levels were well-captured by the critical incidents used in this study. Second, there were some differences between the conditions in the overall ratings given to each ratee, but again, these were quite small in magnitude. Subjects with prior knowledge of format gave slightly higher ratings than the true scores (mean discrepancies from the true scores were +.056, versus +.012 when no prior knowledge of format was provided), and subjects rating by person rated higher than those rating by dimension (mean discrepancies of +.085 versus —.013), but neither of these differences reached statistical significance. Analyses of variance for each of the six ratees by prior knowledge (X1) and format type (X2) produced only one statistically significant difference: subjects with prior knowledge of format (Conditions 1 and 2) rated Pat higher than those ‘without prior knowledge of format (p < .05, eta? - .04). Third, a regression analysis was performed on these overall ratings, using the seven contrast-coded variables discussed in Chapter 3. This does not directly address any of the hypotheses in the study, but does serve as a rough manipulation check for both the between— and within- subject manipulations. These seven variables accounted for 41% of the variance in overall ratings. As would be expected from the values presented in Table 4, none of the beta weights for the between-subject manipulations were statistically significant (X1, X2, or their interaction). However, each of the within-subjects manipulations was significant beyond the p < .001 level. Level of in-role performance (X4) was the dominant variable, with a squared semi-partial correlation (srz) 97 of .371. Level of extra-role performance (X5) accounted for an.additiona1 .025 of overall variance, while the X4*X5 interaction explained an additional .012 in rating variance. The final variable, X7, had.no effect on overall ratings. This pattern of relationships is indicative of the results presented below relevant to the specific research hypotheses. As expected, then, the betweenrsubject manipulations of format type and ‘prior knowledge of format had only' minimal impact on, overall performance ratings. Viewed in the most positive light, this could be taken as evidence for the high level of skill possessed by all the raters in this study at making overall ratings of performance. Given the high experience levels presented in Table 2, this is a plausible explanation. Despite this, however, questions remain to be addressed concerning the halo and accuracy observed in these ratings. H ec Hypothesis 1 predicted that there would be more halo exhibited.when ratings were made by person, but that this effect would only occur when subjects had prior knowledge of rating format. Results concerning this hypothesis are shown in Table 5. As predicted, there was a sizable difference in the median intercorrelations depending on format type, with higher intercorrelations (indicating greater halo) in the personablocked conditions (MEDCORRF .42 vs. .32, t- 5.07, p < .001). There was also an unexpected effect for prior knowledge, with higher intercorrelations in the conditions where prior knowledge of format type was provided (MEDCORR- .40 vs. .34, t - 3.24, p < .01). These effects are shown graphically in Figure 11. As can be seen, there are two main effects, and no interaction. 98 Table 5 Median and Mean Intercorrelations Among Dimensions, Across Ratees WM. MALCQEB 5.2.0... Condition 1 .44922 .09925 .44726 .09438 (Prior Knowledge of Persoanlocked Format) Condition 2 .35173 .21621 .37743 .18484 (Prior Knowledge of Dimensioanlocked Format) Condition 3 .38382 .15827 .38995 .16765 (No Prior Knowledge of Person—Blocked Format) Condition 4 .29560 .15737 .29748 .13507 (No Prior Knowledge of Dimension—Blocked Format) Subject Matter Experts .34275 .32994 .28978 .20267 (True Scores) Murphy and Jako (1989) argued that with a small number of ratees, the average or mean intercorrelations provide a more stable estimate of halo. Repeating these calculations using the mean intercorrelations produced similar, though slightly weaker results. Intercorrelations were higher with the person-blocked format (MNCORR.- .41 vs. .34, t - 2.98, p < .01), and with prior knowledge of format type (MNCORR - .40 vs. .34, t- 2.44, p < .05). Murphy and Balzer (1989) also presented a halo measure that used the variance of the ratings assigned to ratees, averaged across ratees. None of the analyses using this measure of halo (VARRAT) produced differences which were statistically significant. laxall cases, standard deviations 99 683 m:o_m:mE_o acoE< 228.9522... $622 ”I mEDOE coacoefi cowamn. _ n . oz 8. men. mm; mm. -mm. mm. .8. ms. .3. -8. o_mI 100 were very large, and the instability spoken of by Murphy and Jako (1989) seemed to be a major influence on this. Overall, then, there is partial support for Hypothesis 1, in that halo was higher in the personrblocked conditions. The next dependent variables to address are those relating more directly to accuracy. AQEEIQQX The mean ratings for each ratee on each dimension are presented in Table 6. Comparing the true score means to the means for the whole sample reveals far greater differences between the rating sources than did the overall ratings presented in Table 4. Only 16 of the 36 ratings were reasonably close to the true score ratings. Twelve ratings were markedly higher than the true scores, and 8 ratings were considerably lower than the true scores (a difference of approximately +/- .30 is significant at p < .05; if a more stringent alpha, such as p < .01, is used so as to reduce the likelihood of Type I error per comparison, 10 ratings still differ significantly from. the true scores). In all cases, these differences were in the direction of the mean rating given to that ratee, which corresponds to the halo evidence presented above. A regression analysis similar to that done on the overall ratings was conducted on these ratings, with similar results. The seven contrast- coded variables from this study accounted for 28% of the variance in ratings given. The only significant beta weights were those for inrrole performance (srz - .253), extra-role performance (srz - .021), and their interaction (sr'z - .005). Hypotheses 2a - 2c can be tested using either the correlational accuracy or the distance score formulations from Cronbach (1955). Results 101 Table 6 Mean Ratings by Ratee and Dimension: Whole Sample, and by Condition Primary Condi- Condi- Condi— Condi- Ratee/ SMEs Sample tion 1 tion 2 tion 3 tion 4 wmmfllfl MMMM Pat Job Knowl. 5.13 5.42 5.24 5.72 5.24 5.48 Productiv. 4.87 5.35 5.17 5.66 5.31 5.28 Dependab. 5.27 5.46 5.59 5.55 5.38 5.31 Policies 5.20 5.20 4.97 5.34 5.45 5.03 Teamwork 5.40 5.39 5.49 5.62 5.14 5.34 Ex.Effort 5.80 5.46 5.38 5.62 5.34 5.48 Chris Job Knowl. 4.53 4.95 4.97 5.10 4.97 4.76 Productiv. 4.87 4.90 4.79 5.10 5.07 4.62 Dependab. 5.20 5.16 5.24 5.17 5.21 5.03 Policies 5.93 5.02 4.93 5.03 5.14 4.97 Teamwork 4.33 4.43 4.45 4.41 4.41 4.45 Ex.Effort 3.60 4.47 4.52 4.34 4.62 4.38 Terry Job Knowl. 4.53 4.59 4.52 4.66 4.69 4.48 Productiv. 4.13 4.28 4.34 4.14 4.38 4.24 Dependab. 4.07 4.58 4.76 4.48 4.66 4.41 Policies 4.40 4.50 4.59 4.55 4.55 4.31 Teamwork 4.93 4.82 4.93 4.62 4.79 4.93 Ex.Effort 5.27 4.84 4.90 4.72 5.10 4.66 Kim Job Knowl. 4.00 4.03 4.00 4.07 4.14 3.90 Productiv. 4.13 4.21 4.34 4.24 4.10 4.14 Dependab. 4.00 4.02 3.97 4.03 4.21 3.86 Policies 4.07 4.16 4.31 4.03 4.24 4.07 Teamwork 4.73 4.49 4.52 4.45 4.52 4.48 Ex.Effort 4.80 4.53 4.62 4.48 4.66 4.38 Jody Job Knowl. 2.60 2.99 3.28 2.97 3.00 2.72 Productiv. 2.60 2.91 2.86 2.97 2.90 2.90 Dependab. 2.40 2.68 2.59 2.76 2.66 2.72 Policies 2.93 3.41 3.41 3.41 3.45 3.38 Teamwork 4.40 4.17 4.17 4.10 4.17 4.24 Ex.Effort 5.00 4.06 4.24 3.72 4.14 4.14 Lynn Job Knowl. 2.53 2.89 2.97 2.72 2.93 2.93 Productiv. 1.93 2.79 2.79 2.79 2.69 2.90 Dependab. 1.86 2.69 2.55 2.76 2.72 2.72 Policies 3.60 3.65 3.72 3.62 3.59 3.66 Teamwork 4.00 3.88 3.69 4.03 3.76 4.03 Ex.Effort 4.43 3.85 3.79 3.66 4.07 3.90 102 will be reported first using the correlational accuracy formulations, then using the distance score formulations. .As stated above, the correlational approach is more aptly viewed as a measure of rating validity (Sulsky & Balzer, 1988). Correlational Accuracy, Hypothesis 2a concerned differential elevation, proposing that accuracy would be higher for those rating by person than for those rating by dimension, and that accuracy would be highest for those knowing in advance that they were going to rate by person, and lowest for those knowing in advance that they were going to rate by dimension. This hypothesis was not supported. Differential elevation correlation (DECORR) was very high in all conditions, ranging from r - .885 in Condition 1 to r - .922 in Condition 4 (Mean DECORR across conditions - .903, s.d. - .086). Analyses of variance using DECORR as the dependent variable, and X1, X2, and their interaction as the independent variables revealed only a marginally significant effect for type of format, where DECORR was higher for those who rated by dimension (r - .888 by person, versus r- .917 by dimension, p < .10). This is Opposite what was predicted for differential elevation, where it was expected that those rating by person would be better able to correctly rank order ratees. Staying with correlational accuracy, but moving to Hypotheses 2b and 2c, the opposite predictions were made for stereotype accuracy (SACORR) and differential accuracy (DACORR). It was expected that accuracy would be greater when rating by dimension, and that it would be best with prior knowledge of a dimension—blocked format, and worst with prior knowledge of a person-blocked format. These hypotheses were also not confirmed. For 103 stereotype accuracy (H2b), correlations were lower and much more varied than for DECORR (mean SACORR across conditions - .594, s.d. - .312). Analysis of' variance revealed. a significant ‘main. effect for prior knowledge of'format,'with greater correlational accuracy for those‘without prior knowledge of format type (r - .659, versus r - .528 for those with prior knowledge, p < .05, eta2 - .044). Also, the X1*X2 interaction approached statistical significance (p - .068, eta2 - .028). This interaction can be seen in Figure 12. As Figure 12 indicates, accuracy was best for those in Condition 4, and worst for those in Condition 2. Hypothesis 2c concerned differential accuracy. The mean DACORR across the four conditions was r - .444, with a standard deviation of .212. There was almost no variance across conditions, from a low of r - .435 in Condition 1 to a high of r - .453 in Condition 3. Analysis of variance revealed no significant effects for the between—subject manipulations or their interaction. Overall, then, the correlational accuracy measures provided partial support for the efficacy of rating by dimension. This support was weak, however, and not consistent with the hypotheses put forth in the study. Distance Accuracy. Hypotheses 2a — 2c can also be analyzed using the distance score formulations presented by Cronbach (1955). As Murphy and Balzer (1989) demonstrated, this has been the most common operationalization of rater accuracy in the performance appraisal literature. Regression analyses were conducted using the seven contrast- coded variables discussed in Chapter 3. Five separate analyses were conducted, using elevation, differential elevation, stereotype accuracy, differential accuracy, and overall accuracy as the dependent variables, 104 Ammoo mv. llom. mmOoomhnoo< 9.00m mocmuflfl ..2 mzzmom “mw MED—GE :o_mcoE_o comtom _ _ _ _ mm. . em. 02 mwmm mam. mm. l/ mo> [/1 mm. mmmm mmm. 5. mm. 2.22.950 condom _ _ _ _ 02 mm. mam. mm. ov. owe. 3.. mm> . me. one. we 5938.4 00. .mzcewED co=m>2m RESEE comcoea comtom _ _ _ _ o . .2 v_m. mam mm. mm> mam. coacmea comamm _ _ _ _ «mm. o mew. wo> mmm. >omSoo< 8onme co=m>2m 107 sr2 - .031) and when rating by dimension (p < .001, sr2 - .02). The X1*X2interaction was also significant (p < .001, sr2 - .019). As Figure 13 indicates, subjects in Condition 2 were least accurate in terms of stereotype accuracy. This is opposite from what Hypothesis 2b predicted. Multiple R2 for this equation was .07. Concerning differential accuracy and Hypothesis 2c, there was a statistically significant effect for format type which was in line with the hypothesis, i.e. , accuracy was better when rating by dimension (p < .05, srz - .001). Again, however, the amount of explained variance was extremely small (R2 - .002). The Cronbach (1955) accuracy measures produce a squared value for each component, i.e. , ELz, DEz, 8A2, and DAz. Following Murphy et a1. (1982), it has become customary to use and report the square roots of these values as the accuracy results for each component. These are the values depicted in Figure 13, and used in the above analyses. It is also possible to sum these squared values, and then take the square root of that as a measure of overall accuracy. This value is presented in Figure 14. As can be seen, the pattern is very similar to that found for differential elevation: accuracy is greater without prior knowledge of format (p < .001, sr2 - .015), and when rating by dimension (p < .001, sr2 - .003). The interaction is also statistically significant (p < .05, srz - .001) , such that accuracy is best for those in Condition 4. Multiple R2 for this equation was .02. As discussed in Chapter 3, the reason for utilizing the above regression analyses was to simultaneously test for the effects of both the between—subject and within-subject manipulations. Unfortunately, when 108 .8982 53:90 __m_o>O .2 £33m H3 mmDOE :o_mcoE_o comtom _ _ _ _ “V 02 1.. mm. Nwm. o m. .... mm. 5.982 __m._m>O mm> 11 mm. a] mam. .T. om. 109 these regression analyses were conducted, the beta weights for the within- subject variables were all zero. In retrospect, it can be seen that this is a mathematical necessity, given the way these dependent and independent variables were constructed. The dependent variables come from Cronbach (1955), and the formulations used in this study were adapted from Balzer (Note 1). These formulations produce one value per subject for each of the accuracy components, i.e. , one ELz score, one DEz score, one SAz score, and one DAz score. The contrast-coded variables (X4, X5, X6) were designed to capture in—role performance, extra—role performance, and their interaction, and are by nature orthogonal. Such orthogonal variables cannot explain any variance in a single (point) value. Thus, the regression analyses presented above were problematic, in that they did not accomplish the purpose set out for them in Chapter 3. Fortunately, another way of measuring accuracy was located, which is capable of measuring much of what was intended by the above regression analyses. This approach comes from Dickinson (1987; Dickinson et a1. , 1990). Dickipsop's MANOVA Approach. Dickinson (1987) laid out an analysis of variance design to test the accuracy of performance ratings. The differences between performance ratings from a given sample and true score ratings are built into the MANOVA calculations using orthonormal contrasts. The overall differences between the true scores and the sample is picked up as a "rating sources" source of variation. This corresponds to Cronbach's (1955) concept of elevation. The other primary sources of Variation are for "ratee", "dimension”, and the "ratee x dimension" 1rlteraction. These correspond to Cronbach's (1955) conceptions for d1- fferential elevation, stereotype accuracy, and differential accuracy, 110 respectively. Thus, all of Cronbach's (1955) distance score accuracy components are included in a.mu1tiple analysis of variance. Such a design allows the researcher to simultaneously test between? and withinrsubject manipulations. Dickinson et a1. (1990) utilized this framework to test the effects of cognitive modeling and feedback in rater accuracy training. In Experiment 2, they used eight between-subject conditions, and tested the effects of these manipulations on ratings made for seven ratees across three performance dimensions. Substantively, Dickinson et al. (1990) found that almost all of the variation explained in their ratings was due to differences between the samples in the levels of the ratings they'made; almost none of this could be explained by the between-subject manipulations or their interactions. After discussions with Dickinson (Note 2), it was determined that this design framework was also applicable to the present study, and the results of this analysis are presented next. Table 7 displays the results of this analysis. Since the ratees and dimensions were selected by the researcher, it assumes a fixed effects model (Kirk, 1982). The analysis was done using;MANOVA on SPSS-X. SPSS—X provides values for partial eta-squared as an estimate of effect size. ,As noted by Cohen (1988), partial eta—squared overestimates actual effect size. It is, however, a consistent measure, and does provide a reasonable estimate of the relative effects of each source of variation. As can'be seen, most of the variation in these orthonormal contrasts was explained by differences between the samples in terms of a) their overall rating levels (i.e., ratings sources, or elevation), b) their ratings for the ratees (differential elevation), c) their ratings for the 111 Table 7 Analysis of Variance for Prior Knowledge and Format Type on Accuracy Partial Spppppppi df MS F—ratio §p§£_ Rating Sources 1 11.92 6.28* .056 Prior Knowledge (X1) 1 .26 .14 .001 Format Type (X2) 1 .91 .48 .004 X1 x X2 1 1.08 .57 .005 Raters/Condition (R/C) 112 1.90 Ratees (E) 5 2.93 3.91** .034 E x X1 5 .34 .46 .004 E x X2 5 1.17 1.56 .014 E x X1 x X2 5 .46 .62 .005 E x R/C 560 .75 Dimensions (D) 5 21.53 87.14*** .438 D x X1 5 .25 1.01 .009 D x X2 5 .50 2.04 .018 D x X1 x X2 5 .28 1.15 .010 D x R/C 560 .25 Ratees x Dimensions 25 9.15 51.11*** .313 E x D x X1 25 .16 .88 .008 E x D x X2 25 .20 1.14 .010 E x D x X1 x X2 25 .24 1.37 .012 E x D x R/C 2800 .18 .10 .05 .01 *** .001 X- 'U'U'O'd AAAA 112 dimensions (stereotype accuracy), and d) by the ratee x dimension interaction (differential accuracy). Consistent with previous analyses, the smallest difference between the ratings of the subject matter experts and the whole sample was in their ratings of the six ratees; while this difference was significant (p < .01), partial eta2 was .03. Similarly, differences in overall ratings levels or elevation also had a small, but statistically significant influence on accuracy (p < .05, partial eta? - .06). The greatest amount of variation was explained by differences between the sources in their ratings of the six dimensions used in the study (p‘< .001). Partial eta? for dimensions (i.e., stereotype accuracy) was a sizable .44. Differences between the rating sources in their ratee x dimension interactions also accounted for a large amount of variation (p < .001, partial eta? - .31). As seen by both the F—ratios and the partial eta squares, the between-subject manipulations had only a negligible effect on these sources of variation. The tests of Hypothesis 2a are the "ratee"interactions for ratee by format type (E x X2) and ratee by prior knowledge by format type (E x.Xl x X2). As seen in Table 7, both of these interactions had small effects which were not statistically significant. Hypothesis 2b is tested by two of the dimension interactions (i.e., D x X2, and D x X1 x X2). The interaction between dimensions and format type approached statistical significance (p < .10, partial eta2 - .02), but was not as predicted by Hypothesis 2b. Accuracy was worse when rating by dimension; again, this was due to the poor showing in this regard by subjects in Condition 2. 113 The ratee by dimension by format type interaction (D x E x X2) and the ratee by dimension by prior knowledge by format type interaction (D x E x X1 x X2) test Hypothesis 2c. Both of these interactions had small, non-significant results. This analysis of variance procedure (Dickinson, 1987) can also be used to test the hypotheses relevant to the within-subject manipulations (Hypotheses 5 - 7). Results from these analyses will be presented below. Two primary findings should be drawn from the above discussion: a) this MANOVA.procedure corroborates the generally weak findings for'the between? subjects manipulations, and b) most of the variance was explained by rating level differences between the subject matter experts and.the sample which were pp; strongly related to the study's hypotheses. This was particularly true in regards to how each group viewed the dimensions and the ratee by dimension interaction present in this study. T e o e r Hypothesis 3 predicted. that subjects ‘would. display' a general preference for information search by person, and that this would be stronger for subjects in Condition 1, and less pronounced for subjects in Condition 2. Means, standard deviations, and breakdowns by condition are presented in Table 8. As can be seen, subjects exhibited a strong preference for a person—oriented search pattern. Eighty-two percent of all study participants searched for information in a person-blocked fashion; it is not evident, however, that the prior knowledge manipulation had any impact on subsequent type of search. The two groups with prior knowledge of rating format (Conditions 1 and 2) had near identical values on Payne's (1976) type of search measure, indicating a strong preference 114 Table 8 Means, Standard Deviations, and Breakdowns Concerning Type of Search Number of people Payne's type of searching by: search measure Condi— pipn_ ers Dimension Mean fiygy l 24 5 .538 .710 2 22 7 .537 .695 3 23 6 .446 .699 4 26 3 .680 .523 OVERALL E E .551 .658 for searching for information by person. Subjects in the no prior knowledge conditions were slightly more varied in their type of search, but still strongly person-oriented in their search patterns. Analysis of variance using Payne's measure as the dependent variable, and.Xl, X2, and their interaction as the independent variables produced no statistically significant differences between the conditions. Thus, the key aspect of Hypothesis 3 (that prior knowledge of format type would influence subsequent search) was not supported. The above results led to the question of whether type of search is more of an individual difference variable, which is less subject to manipulation than was thought prior to data collection. This post hoc explanation is strengthened by the large number of extreme values found for the Payne measure, i.e., in general, people either strongly preferred seeking information 'by person or ‘by dimension, regardless of' what 115 condition they were in. In Chapter 1, Figure 1 posited a potential direct link between type of search and accuracy. No specific hypotheses were formulated regarding this linkage, but for completeness, such analyses were also conducted. First, type of search was dichotomized as either person— or dimension-oriented. This new variable was then entered into analysis of variance equations patterned after Dickinson et a1. (1990) and Table 7 (instead of X1 and X2) . There was no evidence from this analysis that person- versus dimension-blocked searching influenced any of the four accuracy measures, i.e., it cannot be said based on this data that one type of search pattern led to greater accuracy than the other. A second alternative explanation is that what is most important for rating accuracy is that there be congruence between the manner in which a subject searches for information and the type of format which they then use to make their ratings. Cafferty (Note 4) has preliminary evidence of such a link in a similar process tracing research project. This was tested in the current study by dividing subjects into ”congruent" or ”incongruent" search/format groups (for congruents, their search pattern and format type matched; for incongruents, they did not). Again, there were no statistically significant differences explained by this breakdown. Thus, a link between type of search and accuracy cannot be supported based on the results from the current study. un earc Hypothesis 4 predicted that subjects would voluntarily end their information search before they had accessed all available items. Practically, it was expected that subjects would choose to end their 116 search before they had reached the maximum allowable of 28 items. This hypothesis received no support whatsoever from the current research, in that 104 out of 116 subjects chose to access the maximum of 28 possible items. The lowest amount of search by any study participant was 22 items, and the means in the four conditions were all between 27.65 and 27.83. Obviously, the analysis of variance on amount of search by X1, X2, and their interaction was not statistically significant. In retrospect, it seems clear that the design and information constraints employed in this study kept Hypothesis 4 from receiving a fair test in this instance. This concludes the results for the between-subject manipulations. The next issue to address are the effects of the within-subject manipulations of in—role and extra-role performance. 0 ac to th W - ub an 1 tio With some modification, the analysis of variance procedure described above from Dickinson (e.g., Table 7) can also be used to measure the effects of the in-role and extra—role manipulations. From Kirk (1982), the appropriate experimental design for this study is the SPF-pr*qt, where there are two between-subject blocks and two within-subj ect blocks. What is necessary for this design, however, is that there be one data point for each ”ratee". This requires collapsing across the dimensions in this s tudy to derive one mean rating for each ratee, rather than an individual rating for each of the six dimensions. Once this is done, orthonormal contrasts can be computed and used in a multiple analysis of variance 8 imilar to that described above. It should be noted that these results do at correspond directly to the Cronbach (1955) measures of accuracy. They do , however, provide a sense of the overall impact of the within-subject 117 manipulations, and will be followed up with more specific tests of the hypotheses in this study. The results of this analysis are presented in Table 9. As can be seen, in-role performance had a small, but statistically significant independent effect on these pooled accuracy ratings (partial etaz - .034) . By itself, extra-role performance did not have a statistically significant effect on the accuracy of these ratings, but the interaction of inerole and extra-role performance was significant (p < .01, partial etafu- .056). Further, the three-way interaction of in—role performance, extra-role 'performance, and format type approached significance (p < .10, partial eta2 - .022). These interactions are depicted in Figures 15 and 16. The *values depicted are orthonormal contrasts, where the differences between .subject ratings and true score ratings have been divided by the square root of 2 (Dickinson, Note 2; Kirk, 1982). The means and standard (deviations by ratee are given in Table 10. Tukey's Honest Significant Difference (HSD) procedure revealed that the only difference between mean Iratings which was statistically significant was that between the ratings of Kim and Lynn (Glass & Hopkins, 1984; Dickinson et a1. , 1990). 118 Table 9 Analysis of Variance for IneRole and Extra-Role Performance Pandal figures. df MS F-ratio 8551. Rating Sources 1 1.99 6.28* .056 Prior Knowledge (X1) 1 .04 .14 .001 Format Type (X2) 1 .15 .48 .004 X1 x X2 1 .18 .57 .005 Raters/Condition (R/C) 112 .32 In—Role Performance (In) 2 .60 3.98* .034 In x X1 2 .09 .61 .005 In x X2 2 .25 1.68 .015 In x X1 x X2 2 .19 1.26 .011 In x R/C 224 .15 Extra-Role Performance (Ex) 1 .04 .28 .002 Ex x X1 1 .04 .29 .003 Ex x X2 1 .01 .10 .001 Ex x X1 x X2 1 .00 .01 .000 Ex x R/C 112 .14 InrRole x Extra—Role 2 .60 6.61** .056 In x Ex x X1 2 .03 .34 .003 In x Ex x X2 2 .23 2.49 .022 In x Ex x X1 x X2 2 .00 .03 .000 In x Ex x R/C 224 .09 p < .10 * p < .05 ** p < .01 *** p < .001 119 mocmE.otmn_ ®_Omum.zxw Ucm EON—A: b0 :O_~om..m~:_ ”mp memu—n. moo mm: 39o saw—.... >< :9... @383... 26.. . _ v n . . \ III VO.I II vor II No: II «or I. I I I IIIIIII .i 8. I I I I mg m>< IIIIIIII .. 8. .... mo. . so. 8053.. - mo mm_>>o.. II we. 1.. mo. . 1.. mo .6883. II mo. mo. $.29: 1.. 2. moons: 1.. op. II N—. II NP. .i E. II 3. II or. II 9. 120 BEE“. .0 e9... .5 608.908.”. m.om.-9.xw 8m 0.2.... .o 8.899... no. #507. Ed 950 >8... 8.x .62. :9... _ _ _ _ _ _ A“ I: 8.- 899. .5 II No. 8982 8.9555 ..m 121 Table 10 Means and Standard Deviations for Orthonormal Contrasts by Ratee M 119.21.! 5.2.. Pat (high IRB, high OCBI) .074 .472 Chris (high IRB, ave. OCBI) .054 .432 Terry (ave. IRB, high OCBI) .032 .449 Kim (ave. IRB, ave. OCBI) —.035 .273 Jody (low IRB, high OCBI) .034 .343 Lynn (low IRB, ave. OCBI) .164 .361 The specific hypotheses regarding in—role and extra-role performance are addressed next. -R forma c Hypothesis 5a predicted that differential accuracy would be greater for ratees exhibiting high levels of inrrole behaviors (IRB) than for ratees exhibiting average or low levels of IRB. As Figure 15 demonstrates, accuracy was worst for Lynn (low IRB, low OCBI), but was next worst for the two ratees high in IRB (Pat and Chris). Accuracy was best for the two ratees with average levels of IRB. When the Cronbach (1955) differential accuracy' measure was recalculated. to test this hypothesis, there was a statistically significant difference between the means, but in the opposite direction as proposed by Hypothesis 5a. The mean DA value for the high IRB ratees was .66, as compared to .54 for the average and low IRB ratees (t - -8.21, p < .001). Thus, Hypothesis 5a was not supported. 122 Hypothesis 5b predicted that there would be more search for ratees with average or low levels of inrrole behavior than for those with high IRB. The mean amount of search.by ratee is shown in Table 11. Search was higher for the lower "performing" ratees. Comparing the mean for the high IRB ratees (4.48) to that for the others (4.69) showed a small difference in the direction of the hypothesis. A repeated measures multiple analysis of variance using level of in—role performance as the repeated Table 11 Means, Rank Order, and Standard Deviations for Amount of Search by Ratee aee MsaniEank). 5.2.. Pat (high IRB, high OCBI) 4.34 (5) 1.04 Chris (high IRB, ave. OCBI) 4.63 (4) 1.47 Terry (ave. IRB, high OCBI) 4.32 (6) 1.23 Kim (ave. IRB, ave. OCBI) 4.70 (3) 1.55 Jody (low IRB, high OCBI) 4.78 (2) 1.40 Lynn (low IRB, ave. OCBI) 4.97 (l) 1.11 measure demonstrated that this difference between the means approached statistical significance (F - 3.04, p < .10, partial eta? - .026). Thus, Hypothesis 5b received modest support from these findings. Exgrp-gple Performance Hypothesis 6a predicted that halo would be higher for ratees who exhibited.high levels of extra-role behaviors (OCBI). Table 12 gives the ‘means and standard deviations relative to this hypothesis. 123 T-tests comparing the median intercorrelations revealed strong differences between ratees who were high versus average on OCBI (t- 7.88, p < .001). This finding held up with the mean intercorrelations as well (t- 5.37, p < .001). Thus, there was strong support for the hypothesis that there would be more halo with the presence of highly favorable extra- role behaviors. Table 12 Median and Mean Intercorrelations by OCBI MEDCORR s. D. m _$.._1L. High OCBI .43883 .17011 .44073 .22101 (Pat, Terry, Jody) Average OCBI .30136 .15977 .31533 .22076 (Chris, Kim, Lynn) Hypotheses 6b and 6c predicted that this halo effect would carry over into differences between the ratee groups on stereotype accuracy and differential accuracyu .Means and standard deviations for these values can be seen in Table 13. Table 13 Means and Standard Deviations for SA and DA by Level of OCBI s; Acel SID, Diff. Acc, §.D. High OCBI .4822 .162 .5317 .146 (Pat, Terry, Jody) Average OCBI .4291 .144 .7436 .129 (Chris, Kim, Lynn) 124 Hypothesis 6b was supported, in that stereotype accuracy was worse for ratees high on OCBI. A repeated measures MANOVA using level of OCBI as the repeated measure was statistically significant (F - 10.49, p < .01, partial etaz - .084). However, Hypothesis 6c was not supported, in that differential accuracy was substantially worse for ratees who were average (or neutral) on OCBI (F - 217.39, p < .001, partial eta2 - .654). There is, then, only mixed support for the notion that halo effects would carry over to accuracy effects (at least, in the manner hypothesized in this study). 0 - 1 a d - Tables 9 and 10 above, as well as Figure 14, showed that at a general level, the accuracy of ratings was effected by the interaction of in-role and extra-role performance. In this study, subjects were least accurate in rating Lynn (low IRB, average OCBI). In general, their ratings of all the ratees were higher than the ratings given by the subject matter experts, with the exception of Kim, where the mean rating from the sample was lower than the true scores. Hypotheses 7a and 7b concerned two specific interactions that were predicted based on past performance appraisal research. Hypothesis 7a predicted that subjects would search for more information on ratees who were inconsistent in the levels of in-role and extra—role behaviors exhibited than for ratees where performance information was consistent concerning in—role and extra-role performance. Referring back to Table 11 above, ratees Pat, Kim, and Lynn were considered ”consistent", and Chris, Terry, and Jody as "inconsistent". The mean for "consistent" ratees was 125 4.67; for "inconsistent", it was 4.58. This difference was in the opposite direction from what was expected; however, a repeated measures MANOVA revealed that this difference was not statistically significant. Overall, then, Hypothesis 7a was not supported. Hypothesis 7b predicted that differential elevation would be worse for inconsistent ratees than for ratees who were consistent in their performance levels. This was tested by calculating separate differential elevation values for each group of ratees, and then analyzing these in a repeated.measures MANOVA. The mean DE for the consistent ratees was .394; for the inconsistent ratees, it was .406. This difference was in the direction hypothesized (lower scores indicating greater accuracy), but was not significant (p - .64), and only accounted for a very small proportion of rating variance (partial etaz - .002) . Thus, Hypothesis 7b was also not supported. There is no evidence in this study that the consistency of ratee performance information impacted rater search or accuracy. Method of Rating, OCBI, and Accuracy The final pair of hypotheses concerned the relationship between person— versus dimension—blocked rating, level of extra—role behavior, and accuracy. It was expected that the greater reliance on general impressions when rating by person would be enhanced when subjects rated ratees displaying high levels of OCBI. As depicted in Figure 8, two main effects and no interaction were predicted, both for stereotype accuracy and for differential accuracy. Means and standard deviations relevant to Hypotheses 8a and 8b are given in Table 14. The results of the repeated measures MANOVA are presented in Table 15. 126 Table 14 Means and S. D.'s for SA and DA by Method of Rating and Level of OCBI W 83.29.15.211 W Hess 5.9... new 9.1.. High OCBI .4642 .151 .5003 .172 (Pat, Terry, Jody) Average OCBI .4355 .135 .4227 .153 (Chris, Kim, Lynn) Differential Accuracy W Measles than 5.1). Mean 5.12. High OCBI .5387 .153 .5247 .139 (Pat, Terry, Jody) Average OCBI .7376 .143 .7497 .116 (Chris, Kim, Lynn) Consistent with prior analyses, there was no main effect for method of rating. There was a main effect for level of extra-role behavior, which accounted for 8.5% of the variance. There was some indication of an interaction between method of rating and level of OCBI (partial eta2 - .019), but this did not attain statistical significance (p - .14). Overall, then, there was little support for Hypothesis 8a. 127 Table 15 MANOVA for Method of Rating, OCBI, and Stereotype Accuracy lknxial M df MS Ffl £1: 10 Etai. Method of Rating (X2) 1 .01 .25 .002 Raters/Condition (R/C) 114 .03 Level of OCBI (OCBI) l .16 lO.60** .085 X2 x OCBI l .03 2.25 .019 OCBI x R/C 114 .02 Table 16 MANOVA for Method of Rating, OCBI, and Differential Accuracy Partial Source df MS F-rgtio M Method of Rating (X2) 1 .00 .00 000 Raters/Condition (R/C) 114 .03 Level of OCBI (OCBI) 1 2.61 217.06*** .656 X2 x OCBI 1 .01 .82 .007 OCBI x R/C 114 .01 * p < .05 ** p < .01 *** p < .001 128 Hypothesis 8b also predicted two main effects and no interaction between method of rating and level of OCBI, using differential accuracy as the dependent variable. The means were presented above in Table 14, and the results of a repeated measures MANOVA are given in Table 16. Once again, there was no main effect for method of rating. Level of OCBI had a sizable main effect, with a partial etaz of .656. As noted in Hypothesis 6c above, however, this was in the opposite direction as predicted. Finally, the interaction between method of rating and OCBI was not significant. Thus, there was no support for Hypothesis 8b. CHAPTER 5: DISCUSSION Three broad questions guided this research: a) does the method of making ratings (by person versus by dimension) influence various process and.outcome variables?, b) do experienced raters make use of information concerning both in-role and extra—role behavior when making performance appraisal ratings?, and c) is there an interaction between rating format and the level of extra-role performance demonstrated by ratees? Hypotheses 1-4 tested issues relevant to the first question. Results were strongest for the influences of type of format and prior knowledge of that format on measures of halo. Results were mixed concerning the effects of these variables on various conceptualizations of rater accuracy. Results were weakest concerning any effects of these between-subject manipulations on the two process variables (type and amount of search). Hypotheses 5-7 tested.the effects of in—role and.extra-role behavior on both process and outcome variables. As with Hypotheses 1-4, results were mixed concerning the specific hypotheses in this study. Results clearly supported the proposed link between level of extra-role behavior and halo. Level of extra-role behavior was also significantly related to both stereotype and differential accuracy, although the latter was in the opposite direction as was hypothesized. Finally, there was some evidence that level of inrrole behavior influenced the amount of search made for particular ratees. When results from various aspects of this study are combined, there is strong support for the notion that these raters utilized both in—role and extra-role information when making their ratings. 129 130 Finally, Hypothesis 8 tested for the interaction between method of rating, level of extra-role behavior directed toward other individuals, and two accuracy measures. For both stereotype and differential accuracy, there was a significant main effect for level of extra-role behavior, no main effect for method of rating, and no interaction. In this chapter, Hypotheses 1—7 are discussed separately, under the broad headings of "Type of Format" and “Type of Performance Dimensions”. Following this, the proposed interaction between method of rating and extra—role behavior is discussed. The chapter concludes with a summary and discussion of directions for future research. H otheses Concerni e o F a Was Because of the conflicting findings from previous research (Williams et a1., 1986; Cafferty et a1., 1986; DeNisi et a1, 1989), no formal hypotheses were advanced in this study concerning the effects of type of format or prior knowledge of format on the overall ratings given to each ratee. In this study, those subjects rating by person were less accurate than those rating by dimension. Averaging across ratees, the person- blocked conditions rated +.085 higher than the true scores, while the dimension-blocked conditions rated -.013 lower than the true scores. While the difference between these two values was not statistically significant, at the least these findings do not support the argument from Williams et a1. (1986) that overall ratings should be more accurate when ratings are made by person. 131 Table 4 did not report the standard deviations by ratee and condition, but these ranged from .37 to .88, with an average standard deviation across ratees and conditions of .69. Thus, there was a fair amount of variance in these ratings. What is remarkable, however, is how closely the mean ratings from the primary sample matched the true scores assigned to each ratees. Given the high levels of work experience reported in Table 2, it is plausible to argue that rating overall performance levels is something at which this sample is quite proficient. Type of format seemed to make little difference in the ability of these raters to measure overall performance levels. 3412 One of the strongest effects found in this study was that for type of format on both the median and mean intercorrelation among ratings of dimensions, across ratees. This was somewhat surprising, in part.because the four previous studies which made such a format distinction and used halo as a dependent variable found no significant differences between ratings made by person versus by category or dimension (cf., Cooper, 1981). Thus, this effect was only expected when raters were told in advance which format they would be using to make their ratings. Instead, as Table 5 and Figure 11 made clear, there was also a significant main effect for prior knowledge of format. Halo was highest in Condition 1 (prior knowledge of’ a personrblocked format), and. next highest in Condition 3 (also personrblocked format, but with no prior knowledge). Halo in Condition 2 (prior knowledge, dimension format) was slightly lower than Condition 3. Halo was lowest of all in Condition 4 (no prior knowledge, dimension format). 132 Table 5 also reports the median and mean intercorrelation among dimensions for the subject matter experts. This corresponds to Cooper's (1981) notion of "true halo”, i.e., how much intercorrelation among these dimensions there “should be". As can be seen, the mean intercorrelation for the subject matter experts was .2898. Condition 4 had a mean intercorrelation which was quite close to this (.2975), but all the other conditions had MNCORR values which were considerably higher, indicating that the experimental manipulations had increased the level of halo beyond the true intercorrelations. The median intercorrelation for the true scores was .3428, which was higher than the mean intercorrelation. Conditions 1, 2 and 3 had MEDCORR ‘values above this, while the MEDCORR.value for Condition 4 was below this (.2956). It would seem that unexpectedly having to make performance ratings by dimension caused raters in Condition 4 to see less intercorrelation among dimensions than was "actually there" (at least, according to the subject matter experts). While speculative, it is possible that such a scenario led raters to underestimate the true relations among dimensions (cf., Murphy 6: Jako, 1989). In any case, comparing across the two halo measures, raters in Condition 4 had halo values which were closest to the subject matter experts. It is not entirely clear why the results for the halo effect were so much stronger in this study than has been found in previous research. One possibility has to do with the use of a search constraint in the present research. As discussed above in Chapter 3, subjects were allowed to access up to 28 of the 36 available critical incidents. This constraint was selected to emulate a real appraisal setting, where search is not 133 unlimited. Also, it was expected that the constraint would force raters to think about which information.they'most wanted.to have before they'made their ratings. Observation of the subjects while they underwent the search task, as well as evidence to be presented below under ”Type of Performance Dimensions" would indicate that the constraint achieved this objective for the vast majority of subjects. Unfortunately, such forced selectivity came at the cost of "forced halo”, i.e., without any information on ratee performance on certain dimensions, subjects were forced to rely on the information they had gained concerning ratee performance on other dimensions. In this case, the best response raters could make (in terms of optimally responding to the situation presented to them) was to make at least eight "haloed" ratings. Viewed in retrospect, it seems surprising that the halo measures for the primary sample aren't more divergent from the true scores than they are. In any case, it is possible that this "forced halo" may have served to accentuate the differences between ratings made by person versus by dimension. Accuracy Correlational Accuracy. Some of the most puzzling findings in this study concern the impact of type of format and prior knowledge of format on measures of rater accuracy. First, using the correlational forms of accuracy, the magnitude of these correlations was quite impressive, i.e., across the four conditions, r - .90 for differential elevation correlation, r - .59 for stereotype accuracy correlation, and r - .44 for differential accuracy correlation (all p < .001). This indicates that this sample of supervisors were a) extremely good at capturing the proper rank ordering of ratees, b) quite good at capturing the levels of 134 performance demonstrated on the various dimensions, and c) relatively good at capturing the "true” performance of the various ratees on the various dimensions (according to the true scores). Type of format and prior knowledge of format had only a modest impact on these correlational measures. Contrary to expectations, subjects rating by dimension were slightly more accurate in rank ordering ratees (DECORR) than those rating by person (p < .10). Also, subjects with prior knowledge of the dimension format were worst at rank ordering performance on the different dimensions (SACORR), while subjects without prior knowledge of a dimension—blocked format were best (p < .10) . These findings are not strong enough or clear enough to argue that rating 111151153; was substantially effected by method of rating (Sulsky & Balzer, 1988). Distance Accuracy. Moving to the more commonly used distance score formulations, accuracy was greater when rating by dimension for three of the four accuracy components, i.e. , for elevation, differential elevation, and differential accuracy (Figure 13). This was also true for overall accuracy (Figure 14). However, the magnitude of these effects was small (sr2 never exceeded .02) , and only the effect for differential accuracy was in line with the predictions made in this study. Even in retrospect, the author finds it hard to explain why the results for differential elevation and stereotype accuracy came out opposite from what was predicted. Concerning rank ordering the ratees, it would seem that a 6 x 6 matrix of ratees and dimensions must not have been large enough for raters to lose sight of overall ratee performance levels (see page 35 above). But why subjects in Condition 4 would be turn out to 135 be best at rank ordering ratees, or why subjects in Condition 2 would be worst at ranking performance on the dimensions is far from clear. The results concerning prior knowledge of format were also not as anticipated. It was expected that prior knowledge would impact rater accuracy primarily in conjunction with type of format, i.e. , through the various interactions presented in Figure 4. Instead, prior knowledge had a stronger independent effect than was hypothesized. Moreover, regardless of condition, subjects with prior knowledge of format were generally less, accurate than subjects without prior knowledge (see Figure 13). Using distance scores, the main effect for prior knowledge was statistically significant for differential elevation, stereotype accuracy, and overall accuracy. As was shown in Chapter 4 and will be discussed below, the prior knowledge manipulation did not have its intended impact on the two process variables (type and amount of search). Thus, the projected link between prior knowledge and search could not be established. Yet, in some manner, prior knowledge was effecting the accuracy of ratings given. It would seem that, in this study at least, viewing the appraisal format prior to search proved to be a disruptive influence in the rating process. One possible explanation for this is that the manipulation led subjects in Conditions 1 and 2 to expend more energy focusing on the particular format which was presented to them, rather than on viewing and rating the performance incidents. In a “real” appraisal setting, of course, it should be clear well in advance of the rating task what type of format raters will use to make their ratings. Thus, the impact of prior knowledge on accuracy should be less salient in an applied setting. 136 Nonetheless, it is not particularly encouraging that, in general, subjects in this study who were "surprised” by the appraisal format they received were mm accurate than those who knew their respective format in advance. Returning to the information processing literature, an explanation can be found for the general superiority of subjects in Condition 4 (no prior knowledge of format, and then a dimension-oriented format). Subjects in this condition exhibited less halo than subjects in the other conditions, and were also generally more accurate (cf. , Figures 13 and 14). Ilgen and Feldman (1983), Lord (1985) and others have written extensively about the differences between the "automatic" and "controlled” processing of information. Automatic processing is the dominant mode of information processing, whereby individuals engage in little conscious monitoring or processing of information (e. g. , this is typical when driving a car; Ilgen 6: Feldman, 1983) . Controlled processing takes place when information processing is specifically under the conscious control or mediation of the individual. Such processing is generally brought forth to deal with some “effortful or problematic" situation (Ilgen & Feldman, 1983, p. 156). The above description fits the current study quite well. Lord (1985) argued that most attention, storage, and retrieval of information from memory is governed by automatic processing. This and prior research have shown raters' strong tendencies to search for information and make ratings "by person". This is most likely the "automatic" response to such an information processing task. In contrast, subjects in Condition 4 were " j olted" by an unexpected rating format. It is possible that unexpectedly having to rate by dimension caused subjects in this condition to engage in 137 far more controlled processing (before making their ratings) than did subjects in the other conditions. Such an explanation is post hoc and speculative, but is worthy of further research. To summarize the above findings, then, does type of format influence rater accuracy? The best answer from the distance and correlational accuracy results would seem to be that there is some influence, in that ratings made by dimension are slightly more accurate. But the results were not strong, and.were not consistent with prior research or theory in this area. Several points can be made concerning the generally weak findings concerning the between-subject manipulations and rater accuracy. The first two points draw on the analyses patterned after Dickinson (1987). As Table 7 demonstrated, most of the variance in these ratings was not explained by type of format or prior knowledge of format. There were sizable differences in how the primary sample and the subject matter experts viewed the performance levels demonstrated by these ratees on these dimensions, but these differences were only modestly related to the betweenrsubject manipulations. It is possible that the search constraint used in this study not only served to increase halo (as mentioned above), but also added significant “noise” (or guessing) which weakened overall accuracy results, for these and earlier accuracy analyses. Second, the observed power values for Table 7 were much lower than expected. The average observed power value across the different sources of variance was .58, well below the .80 specified in the a priori power analysis. Thus, this experiment was designed and carried out with enough statistical power to be able to detect "small" effect sizes (as specified by Cohen, 1988). Unfortunately, however, actual effect sizes came out 138 even.smaller than what was specified in the a priori power analysis. This raises the question of whether the format manipulations used in this study have any meaningful effect on measures of accuracy. Third, an explanation for these generally weak effects which has not yet been mentioned is that, similar to prior research (Padgett & Ilgen, 1989; DeNisi, Robbins, & Cafferty, 1989), raters were allowed to take notes as they went through the simulation. The idea behind this was to make the experience less taxing on subjects' memories, as several raters in the pilot sample made it clear that they did not want to be in a "testing" environmentg However, it is possible that such.note taking also served to level or minimize differences between the conditions. Based on the author's observation of the manner in which raters completed the project, it would seem that, for many raters, filling in the ratings at the end of the simulation was largely a matter of being able to read one's notes "down" or "across", i.e., by dimension or by person. This may have weakened the differences that might otherwise have been observed between conditions. In the South Carolina research (summarized by DeNisi & Williams, 1988), "medium" effect sizes (Cohen, 1988) have typically been.observed.as a result of their person- versus task-blocking manipulations. In that research stream, however, much more attention has been given to recall issues than was done in the present research. Note taking has obviously been excluded from such studies (widh the exception of DeNisi et a1., 1989). It is worth further investigation whether a concern for both recall and rating in the same study produces stronger effects for such manipulations than was found in the present study (cf. , Figure 2). 139 In conclusion, then, there was some general advantage to rating by dimension in terms of accuracy. However, results were not particularly supportive of the hypotheses for the specific Cronbach accuracy measures. Thus, this study does little to answer the question of whether these different formats aid or hinder raters in making intra— versus inter—ratee discriminations (cf., p. 23 above). e V ab e Me of Search. It was also expected that the between—subject manipulations would influence the type and amount of search undertaken by subjects, at least for those with advance knowledge of the format they would be using. Such was not the case. Across the four conditions, the mean number of person-blocked transitions was 15.2; the mean number of dimension—blocked transitions was 4.6; and the mean number of nonblocked transitions was 6.6. This clearly supports earlier findings that raters have natural tendencies to search for information by person (DeNisi et a1. , 1983; Cafferty et a1. , 1986). However, there was no indication that those who knew they would rate by dimension changed their search pattern to fit a dimension—blocked format, or that those knowing they would rate by person searched more strongly by person. In fact, there seemed to be a small minority (18%) fairly evenly spread across the conditions who consciously chose to search for information by dimension. This group was neither more nor less accurate in their ratings than the majority who searched for information by person. Further, the congruence of search style and format type was also not predictive of rater accuracy. Overall, then, the manipulations of type of format and prior knowledge of format had no impact on the type of search raters engaged in. 140 It is possible that the prior knowledge manipulation was too subtle to dislodge raters from their natural search tendencies. However, the effects of the prior knowledge manipulation on measures of halo and accuracy described above would indicate that this manipulation did have some effect on raters. What it did not do, however, was cause them to change their search strategies. Follow-up interviews with selected subjects from the primary study will be conducted at a later date to seek to explain why this occurred. Amount of Search. 'Two significant flaws in the design.of this study were that a) there were not enough performance incidents, and b) not enough.search was allowed to adequately test the hypothesized.relationship between type of format, prior knowledge, and amount of search. For example, as a target stimulus, DeNisi et a1. (1983) used performance incidents for four workers on four tasks, with four incidents available for each worker on each task (64 total incidents). Thus, they'had a depth of search.per task aspect to their design whichuwas lacking in.the present study. It would have been practically difficult to come up with more performance incidents which matched the performance dimensions and performance levels required in this study, but at.a minimum, it seems that there should have been two performance incidents available for each ratee on each dimension (for 72 total incidents). With. more incidents available, and a different search constraint, differences in amount of search would have been more likely to surface. Another difference between this study and previous research is that, in previous work (DeNisi et a1., 1983; Kozlowski & Ford, 1991), raters were only asked to make ratings of overall performance, whereas in this 141 study, raters were asked to evaluate ratees on each dimension, as well as provide an overall rating. With no prior knowledge of ratee performance (as was available in Kozlowski & Ford, 1991), subjects needed all the information they could get in order to make informed ratings on each dimension. Also, as has been discussed above, the imposed search constraint served as an additional limit on the variability of search. Although the search constraint was intended to be "moderate", when combined with the factors just described, it seemed to constrain rater variability of search far more than was intended. Thus, amount of search did not get a fair test in the current research. Future research will have to address whether knowledge of format influences the amount of search undertaken. In summary, the manipulations of type of format and.prior knowledge of format had.a strong impact on measures of halo, some impact on measures of accuracy, and no discernable impact on the process variables. Next, the hypotheses concerning the withinrsubject manipulations are discussed. Hypotheses Ccncerning Type of Performance The second broad question guiding this research was whether experienced raters make use of information concerning both in-role and extra-role performance when making appraisal ratings. Support for this notion can be found in several findings from this study. First, as one might expect, the regression analyses using the contrast-coded variables on the overall ratings (Table 4), as well as on the 36 ratings of ratee performance on particular dimensions (Table 6) revealed that the dominant influence on both types of ratings was in—role performance. It is both 142 logically and legally defensible that such behaviors are a primary determinant of performance ratings (Werner, 1992). However, extra-role performance and the interaction of inrrole and extra-role performance had significant beta weights in both regression equations. These were, in fact, much smaller than the beta weights for in—role performance. Also, the impact of extra-role performance was much smaller in this study than was found by MacKenzie et al. (1991). Nonetheless, the extra-role manipulations explained statistically significant amounts of rating variance beyond that explained by the inerole performance manipulations. Further support for the idea that experienced raters are concerned with information about both in-role and extra-role performance comes from additional data which was collected for each ratee in the study. The computer program used in this study recorded four items of information which are relevant here: a) the amount of search engaged in for each ratee; b) the amount of search engaged in for each dimension; c) the order in which subjects sought information concerning the six ratees; and d) the order in which subjects sought information concerning the six performance dimensions. Both amount cf search by ratee and by dimension and ogdc; cf sccgch by ratee and by dimension give some sense of the relative importance of these various targets to this sample of supervisors. As mentioned above, every time the computer simulation was run, a different order appeared for both ratees and.dimensions, so any differences in search should be because of intentional choices by the subjects, rather than because of any order effects inherent in the presentation of information. Results for these variables are found in Table 17. 143 Table 17 Amount and Order of Search, by Ratee and by Dimension Ratee latest Pat (high, high) Chris (high, ave.) Terry (ave., high) Kim (ave., ave.) Jody (low, high) Lynn (low, ave.) Dimension Ia; get; 1) Job Knowledge & Accuracy 2) Productivity 3) Dependability/ Attendance 4) Following Policies & Procedures 5) Cooperation & Teamwork 6) Extra Effort/ Initiative Amount of Search Mea 4. 4. 34 63 .32 .70 .78 .97 .35 .21 .02 .71 .53 .91 Ra (5) l (4) 1 (6) 1. (3) l. (2) l (l) l. (l) l. (2) 1. (3) l. (6) l. (4) l. (5) l. 8,0, .04 .47 23 55 .40 11 45 34 53 95 65 84 Order of Search‘ Mean 3.62 3.58 3.50 3.24 3.57 3.62 n (5.5) (4) (2) (1) (3) (5.5) (1) (3) (2) (5) (4) (6) §,D, 1.71 1.68 1.81 1.61 1.87 1.79 ‘ Lower values for order of search indicate earlier search. 144 The values for amount of search by ratee are the same as those presented in Table 11. A repeated measures MANOVA using amount of search as the repeated measure was statistically significant (F - 3.69, p < .01, partial eta2 - .03) . Post hoc comparisons using Tukey's Honest Significant Difference (HSD) test revealed that the amount of search for Pat and Terry was significantly lower than that for Lynn (Glass & Hopkins, 1984). In the right columns of Table 17 are the values for order of search by ratee. A repeated measures MANOVA using order of search 3 as the repeated measure revealed no statistically significant differences. This is desirable, because it indicates that there were no order effects for ratees. A totally random order of viewing ratees would have given all ratees values of 3.5. As can be seen, the obtained order of search values were all quite close to this. The bottom half of Table 17 contains the values most relevant to the current discussion. There was considerable variation in the amount of search by dimension. A repeated measures MANOVA using amount of search as the repeated measure was highly significant (F - 17.24, p < .001, partial etaz - .13). Subjects were most interested in information on the two in— role dimensions, and least interested in information concerning "Following Policies and Procedures” . Tukey's HSD test revealed that dimensions 1 and 2 differed significantly from dimensions 4, 5, and 6; dimension 3 differed significantly from dimensions 4 and 6; and dimension 4 differed significantly from dimension 5 (see Table 17). Finally, the lower right columns display the order of search by dimension. On average, these supervisors looked first at information concerning job knowledge and accuracy, attendance, and productivity. 145 .After this, their order of search tended to be: cooperation and teamwork, following policies and procedures, and then extra effort and initiative. A repeated measures MANOVA using order of search as the repeated measure was statistically significant (F - 11.72, p < .001, partial etazl- .09). Tukey's HSD test revealed that dimensions 1, 2, and 3 differed significantly from dimensions 4 and 6, and dimensions 1 and 3 differed significantly from dimension 5. Looking at the raw values and rankings for the dimensions, the combined results from the analyses for amount and order of search would indicate that, for this job, subjects focused most on the top three dimensions (job knowledge, productivity, and attendance). There was a fairly clear distinction between the values for these three dimensions and the last three dimensions. Once again, this is not an order effect, as different subjects saw the dimensions presented in different orders. This finding is interesting, in that attendance and following policies and.procedures were both intended to capture L. Williams' (1988) construct of "organizational citizenship behavior directed toward the organization" (OCBO). Similarly, these two dimensions correspond to Organ's (1988b) Conscientiousness construct. Yet, as Organ (1988b) also noted, such dimensions "straddle the boundary" between in—role and extra- role performance. For the position of secretary at this university, attendance appeared very much.to be viewed.as an in-role behavior, whereas following policies and procedures (or "rule compliance", in Organ's terms) was treated.more like the other extra-role dimensions. Thus, despite the fact that performance levels for inrrole behavior and OCBO were yoked together, and that performance levels for the OCBI (or Altruism) 146 dimensions of teamwork and extra effort were manipulated independently of the other dimensions, subjects in this study treated rule compliance differently than the first three dimensions, and.more like the latter two (OCBI) dimensions. This provides evidence of the “boundary straddling" nature of these OCBO behaviors (Organ, 1988b). The above analyses can also be viewed as evidence of the problems inherent in Organ's (1988b; 1990a) use of the terms "in-role" and "extra- role" behavior. As Organ would admit (1988b), these terms are imprecise and hard to pin down. Further research is needed to help clarify these terms. Two recent developments in the organizational behavior literature are useful in this regard. First, in the last few years, there has been a resurgence in interest in the use of personality traits in personnel selection (cf., Hollenbeck, Brief, Whitener, & Pauli, 1988; Day & Silverman, 1989). Recently, three empirical articles have appeared in the literature which compared 'various personality' dimensions 'with. a ‘number of' criterion measures (Hough, Eaton, Dunnette, Kamp, & McCloy, 1990; Tett, Jackson, & Rothstein, 1991; Barrick & Mount, 1991). Hough et a1. (1990) used six personality dimensions (from Hogan, 1982) with a large military sample, and found that the Conscientiousness subscale of their Dependability dimension was unrelated to a measure of technical proficiency (r - .02), but significantly related to a measure of personal discipline (r - .23, uncorrected for unreliability, p < .01). Tett et a1. (1991) conducted a meta—analysis using quite restrictive criteria for inclusion in their study. For the Conscientiousness dimension, they found a mean correlation with criterion measures of .12 (r - .18, corrected for unreliability). 147 Using a much larger sample of studies and subjects, Barrick and Mount (1991) found slightly stronger results for the Conscientiousness dimension (r - .13; when corrected for unreliability, r - .22). Of interest in Barrick and Mount (1991), however, is that validity coefficients for Conscientiousness were stable across five occupational categories and three types of criterion measures. Further, the validity coefficients for Conscientiousness were consistently the highest of'any'of the personality dimensions they measured (including Extraversion, Emotional Stability, Agreeableness, and Openness to Experience). Thus, from all three of the above studies, it would seem that Conscientiousness is important to the accomplishment of important work tasks in all jobs (Barrick & Mount, 1991). FUture research on organizational citizenship behavior must take note of the research just presented. It is likely that Conscientiousness will demonstrate a moderate, but statistically significant relationship with "inerole" behavior. Such an interrelationship must be considered as Organ's theoretical work is refined and expanded in the future. A second recent development in the organizational behavior literature was the publication of'a chapter on job design and.roles in.the latest edition of the andboo of dust a1 0 ani atio a (Ilgen & Hollenbeck, 1991). Ilgen and Hollenbeck (1991) documented how two relatively non-overlapping literatures have developed in the past, one focusing on "jobs" (i.e., job analysis and design), and the other on "roles" (e.g., role conflict and ambiguity). These authors sought to integrate these two related literatures, and presented a theory of job- role differentiation. 148 Ilgen and Hollenbeck (1991) defined a job as a set of task elements grouped together under one job title and designed to be performed by one individual. They referred to the "designed” or "official" tasks of the job as established task elements. In addition to these established task elements, there are also emergent task elements. These emergent task elements develop as the individual seeks to carry out his or her job in a particular work setting. These emergent task elements are more subjective, personal, and dynamic than the established task elements, and are also specified by more social sources than are the established task elements (I1gen.& Hollenbeck, 1991). Roles, then, are larger sets of task elements, which contain both established .5115! emergent task elements. Thus, two individuals working on the same job may have very different roles, because of differences in skills, personality, tenure, etc. It can be seen that what Ilgen and Hollenbeck (1991) referred to as established task elements is very similar to Organ's (1988b) concept of in—role behavior, whereas the emergent task elements bear some similarity to extra-role or citizenship behaviors. The value of Ilgen and Hollenbeck's approach, however, is that they discussed how a task element which 'begins as a part of a larger "role set" can. over time ‘be incorporated into the formal job itself, e.g., as job analysts seek to codify (or institutionalize) what is done on a given job. Thus, instead of speaking of 13:;clc behaviors (as Organ and this research have done), it is more precise to speak of ”in-job behaviors", i.e., those task. elements which. are clearly spelled. out in. a ,job specification. What this research has called §3£I§21212 behaviors would then be more properly labelled ”extra-job behaviors”. Such extra-job 149 behaviors may or may not be expected for a given role, but again, as such role expectations become institutionalized through the job analysis procedure, these behaviors would cease to be emergent and become established task elements of a given job. Future research in this area must be more precise in defining the terminology used to describe different behaviors, and would profit from incorporating the job-role distinction presented by Ilgen and Hollenbeck (1991). Overall, then, these supplementary analyses demonstrated that subjects in this study were most concerned with performance on the in-role (or in-j ob) performance dimensions. The extra-role (OCBI) dimensions were searched less and later than the other dimensions. Nonetheless, it is important to note that this information w_a_§ sought and used by these supervisors. Anecdotally, no supervisor in this study objected to the use of any of the performance dimensions included in this study. More importantly, the means for teamwork and extra effort were a long way from zero. Given the search constraint imposed in this study, supervisors could have chosen to ignore one or both of the OCBI dimensions. In fact, only four percent of supervisors sought no information on either dimension 5 or 6 (the mean percentage of supervisors who skipped a dimension was three percent across all six dimensions). Instead, most supervisors sought out information on these dimensions and, as the regression analyses discussed above illustrated, extra-role performance explained statistically significant variance beyond that explained by in—role performance. So, all in all, these analyses support the notion that experienced raters use both types of information when making performance 150 ratings. Next, the discussion will turn to the specific hypotheses advanced in this study relevant to type of performance dimension. W Differential Accuracy. It was expected that subjects would be most accurate in rating the performance of ratees exhibiting high levels of in- role behaviors or IRB (DeNisi & Stevens, 1981; Karl 6: Wexley, 1989). Instead, when Cronbach differential accuracy measures were calculated separately for ratees exhibiting high versus average and low levels of in- role behavior, it was found that accuracy was worse for the high IRB ratees (p < .001) . The analysis of variance procedure presented in Tables 9 and 10 above (Dickinson, 1987; Dickinson et a1., 1990) shed light on this, in that subjects in this study were closest to the expert true scores for ratees who were average in their levels of IRB. This is not consistent with Wexley et al. (1972), who found that average performance seemed to present raters with a more ambiguous stimulus than either high or low levels of performance. One explanation for this difference is that Wexley et a1. (1972) conducted their study in the context of selection, whereas in the present study, ratings were made in a performance appraisal setting. Hiring decisions are straightforward for applicants who are clearly high or low in their exhibited (or expected) performance levels. Ambiguity is highest for applicants in the middle range of performance. In contrast, in a performance appraisal rating task, it is easy and common to rate the majority of ratees as average. This holds true even if rating inflation has caused the typical mean rating in an organization to rise above the midpoint of the organization's rating scale. It is harder and less common to make ratings 151 that are either unusually high or low. In fact, at the time this research was conducted, it was standard practice in this organization to require supervisors to put into writing reasons why they”were assigning ratings at the extremes of the organization's rating scale“ Such. documentationnwas not required for midrlevel ratings, leading to rather pronounced central tendencies in actual ratings assigned. For the ratings in this study, Table 10 provided the standard deviations by ratee for the orthonormal contrasts used in Table 9. This showed that the standard deviation was lowest, and thus rater agreement was highest for Kim, the ratee who was average on both IRB and OCBI. Standard deviations were somewhat higher for the two low IRB ratees, and highest for the three "above average" ratees. It is possible that differences of opinion. among the raters concerning 'what rating is appropriate for very high (and very low) performance contributed to the differential accuracy findings reported above. .At the least, the findings from.this study cast doubt on the idea that high performing ratees will be rated more accurately in a performance appraisal setting. Further research on this issue is needed. Amcunc cf Search. As predicted, raters searched more for ratees who were average or low on IRB (p < .10, partial eta? - .026). Observing the differences in Table 11, however, one can see that the distinction was primarily between amount of search for low IRB performers (4.88) versus the amount of search for average and high IRB performers (4.51 and 4.49, respectively). Also, comparing these results with Table 10, one can see that, contrary to predictions, ratees searched more and were 1393 accurate for the average and low IRB ratees combined. However, this is somewhat 152 deceptive, in that these subjects searched for the largest amount of information on Lynn (low IRB, average OCBI), but were also l££§£ accurate in rating Lynn's performance. Thus, the relationship here is not straightforward, and needs to be studied further to be better understood. - Pe n flclc. From Organ (1990a), it was hypothesized that observed intercorrelations would be higher for ratees who demonstrated higher levels of extra-role'behaviors (i.e., OCBI). Results in Table 12 strongly supported this. Using both the mean and median intercorrelations, halo was over 40% higher for ratees with high OCBI than for ratees with average OCBI levels (p < .001 for both measures). This raises an interesting issue. The descriptive question is whether experienced raters use both inrrole and extra-role information when making performance ratings. The results from this study, as well as previous research (MacKenzie et a1., 1991; Orr et a1., 1989) strongly support the idea that raters use both types of information“ The normative question is whether they "ought to" be using such information, i.e., is extra-role information contributing to halo cgicg, and thus something to be minimized or controlled in the rating process, or is it a valuable but neglected source of information, which should be included in the rating process to explain more rating variance? In this study, halo was considerably higher for the high OCBI ratees. It was hoped that the issue of whether this was error or not would be answered by the results for the two accuracy measures presented below. Unfortunately, a clear answer does not emerge from these results. 153 Accuracy. Table 13 revealed that, like the effects for halo, stereotype accuracy was worse for ratees high in OCBI. This was a rather large effect (p < .001, partial etaz - .084), and would seem to indicate that the presence of clearly favorable extra-role information led raters to be less accurate in rating the performance levels intended for these dimensions. However, the results for differential accuracy were just the opposite, and even stronger in magnitude. The ability of raters to match the true score ratings for each ratee on each dimension was much worse for the average OCBI ratees (.53 for high OCBI, .74 for average OCBI). Partial eta2 for this difference was .65. It is not clear why this effect was so strong in favor of the high OCBI ratees. It does, however, leave open the question of whether the halo effect reported above is necessarily a ”bad" thing. Future research needs to address the normative issue of whether extra-role information should be included in the performance appraisal rating process. te a t n In—Role and Extra- ole Perfo c It was expected that when the level of performance demonstrated by ratees was inconsistent between the in—role and extra-role dimensions, raters would search more for inconsistent ratees, but be less accurate on a measure of differential elevation (Padgett 6: Ilgen, 1989). Neither of these hypotheses were supported in the present study. Concerning amount of search, subjects actually searched for somewhat more information on the consistent ratees, although this difference was small and not statistically significant. Based on comments made to the researcher by numerous subjects after they completed the project, there is anecdotal evidence that many subjects were very aware of the inconsistencies 154 included in the ratee composites (particularly for Jody, the low IRB, high OCBI ratee). Yet, there is no indication that such awareness effected any of the measures collected in this study. It would seem that amount of search was more effected by the level of in-role performance (or overall performance) than it was by the consistency of performance information available. Unlike Padgett and Ilgen (1989) , there was also no effect for consistency of information on differential elevation. In their study, Padgett and Ilgen (1989) were not concerned with issues of in-role versus extra-role performance dimensions. Also, Padgett and Ilgen provided raters with between eight and twelve performance incidents for each ratee on each performance dimension. Thus, inconsistency of performance may have been more salient in their study than in the present one. In any case, in the present study, the ability to rank order ratee performance was not effected by the consistency of information. It should be noted that Table 9 did indicate a moderate inrrole by extra-role interaction using orthonormal contrasts between the true scores and the subjects' ratings (partial eta2 - .056) . As Table 6 indicated (and Figure 15 demonstrated), this was largely because subjects rated Lynn much higher than the true scores on the in—role dimensions, and Kim lower than the true scores on the extra—role dimensions. Thus, despite the lack of findings for Hypotheses 7a and 7b, it is expected that future research will detect meaningful inrrole by extra-role interactions, and that the accuracy of ratings will be effected by the consistency of performance information . 155 hes C cernin t od C nd c ac Two hypotheses were put forward in this study concerning type of format, level of OCBI, and accuracy. Specifically, for stereotype and differential accuracy, it was expected that there would be two main effects and no interaction for each dependent variable. Subjects were expected to be more accurate rating by dimension, and when rating the ratees with neutral levels of OCBI. Results did not support either hypothesis. For stereotype accuracy, there was a main effect as predicted for level of OCBI, but no main effect for type of format. Also, the interaction between method of rating and level of OCBI was not significant. For differential accuracy, there was no main effect for type of format, a large main effect for level of OCBI (in favor of the high OCBI ratees), and no interaction. Obviously, for both dependent variables, the lack of significant effects for the betweenrsubjects manipulations (Hypotheses 2b and 2c) removed any likelihood of finding significant effects consistent with Hypotheses 8a and 8b. Before further research is conducted where both type of format and type of performance dimension are manipulated simultaneously, it is imperative that it first be established that type of format has a.meaningful impact on such dependent variables as were used in this study. Until then, tests such as those just described are premature. a a d Directions 0 e R ea c This chapter will conclude with a discussion of strengths and weaknesses of the current study, general conclusions that can be drawn from this study, and future research needs in this area. 156 h n ta 0 s u d W. A number of strengths can be highlighted in the way this research was designed and carried out. First, while most earlier research has studied 9.1.9.112]; halo (e.g., Jacobs 6: Kozlowski, 1985) c]; accuracy (e.g. , Murphy 5: Balzer, 1986) , this study measured both halo and accuracy in attempting to answer substantive questions concerning type of format and type of performance dimension. It is true that the findings for halo versus accuracy were often discrepant, and that this added considerable ambiguity to the interpretation of the study's results. However, if only one measure had been used as a primary dependent variable (i.e. , either halo or accuracy), this would have led to faulty conclusions concerning the magnitude and direction of the study's findings. So, despite the increased ambiguity brought on by using both variables, this is preferable to relying on one versus the other. Indirectly, this study lends support to Murphy and Balzer (1989) , who argued that halo and accuracy are only weakly related. A second strength of this study was the linking together of several distinct research streams to measure rater search processes and accuracy simultaneously. The conceptual background for this study was drawn largely from the University of South Carolina stream (DeNisi 6: Williams, 1988); the accuracy dependent variables were drawn from Cronbach (1955) and Murphy et a1. (1982); finally, the process variables were drawn from Ford and his colleagues (Ford et a1., 1989; Kozlowski 5: Ford, 1991). Although the results in this study did not come out as intended for most of the hypotheses concerning type and amount of search, it is nonetheless desirable to measure such process variables. Future research should 157 continue to measure such variables, in hopes of better understanding the (cognitive) reasons why raters make the ratings they do (Landy & Farr, 1980; Ilgen et a1., in press). Third, this study was able to get beyond the “trait versus behavior“ dilemma by measuring both in—role and extra-role performance in terms of behavioral critical incidents. This solved the measurement problems experienced by DeNisi and Summers (1986), and also the apparent confound in Orr et al. (1989) that in—role performance was measured in terms of behaviors, while citizenship or extra-role performance was measured in terms of traits. Describing all performance dimensions in behavioral terms is a practical advance that should be utilized in future research. A fourth strength of this study was the use of a computerized information board to easily and efficiently collect data from experienced raters in a large organization. It took most subjects less than an hour to complete all aspects of this project. Use of computerized simulations should definitely continue in the future. Also, the fact that such an experiment could be carried out with raters possessing almost ten years of supervisory experience lends strong practical support to the finding that experienced raters used both in—role and extra—role performance information when making their ratings. Limitacions. This study was also not without its limitations. Several of these have been discussed above. In this section, six limitations or problems will be discussed, in terms of the way this research project was designed or carried out. An obvious weakness in this study concerned the search constraint. Subjects were asked to make 42 ratings, i.e. , rating all six ratees on all 158 six dimensions, plus rating the overall performance of each ratee. Subjects were told that there was only one item of information available for each ratee on each dimension (36 total), and that they would be able to access 28 of those items. This scenario clearly contributed to the lack of variance concerning amount of search, where the mean amount of search was 27.7 items, with a distribution which was strongly negatively skewed. It is likely that this constraint contributed as well to the halo and accuracy results described.above. Since ratings of each ratee on.each dimension are needed for the Cronbach accuracy measures, future research should have more items available for each ratee on each dimension (e.g. , DeNisi et a1., 1983; Padgett & Ilgen, 1989). Also, any search constraint which is used in the future should be pilot tested to ensure that it does not serve as an artificial ceiling limiting variance in the amount of search undertaken by subjects. A second critical flaw in this study was the failure to note the inherent conflict between the proposed design and method of data analysis and the structure of the primary (Cronbach) dependent variables. A "subjects within groups by conditions" design (Cohen & Cohen, 1983) was proposed, to be tested using hierarchical regression. Unfortunately, the orthogonal, contrast-coded'variables for the withinrsubjects factors were not capable of explaining any variance in the Cronbach measures, since these measures produce singular accuracy values for each subject, i.e., one elevation value per subject, one differential elevation value per subject, etc.. As presented in Chapter 4, other analyses were conducted which better matched the hypotheses of this study. However, these other 159 data analytic approaches carried with them their own advantages and disadvantages. In particular, the Dickinson. MANOVA approach described above (Dickinson, 1987; Dickinson.et a1., 1990) required.the use of a.split-plot factorial design (Kirk, 1982). This design is particularly powerful in detecting effects for withinrsubjects manipulations, but is less powerful in detecting effects for between-subjects manipulations (Kirk, 1982). This summary statement from Kirk (1982) corresponds directly to the power levels observed for the variables presented in Tables 7 and 9, i.e., observed power was extremely low for the betweenesubjects manipulations, and considerably higher for the within-subjects manipulations. Fbture research will need to take this into consideration, specifically in determining the proper number of subjects per condition to detect effects for type of format, if such effects exist (again, this experiment did not establish that there is anything more than a very small effect for person versus dimensionrblocked formats). Although not as severe, a third limitation of the present study was that it assumed. a fixed-effects model, i.e., conclusions about the manipulations apply only to the treatment levels used in this experiment (Kirk, 1982). This is not so important for type of format, prior knowledge of format, or for level of ratee performance. It is more of an issue, however, for the dimensicng chosen.in.this study; These dimensions were deemed meaningful to this organization and sample (e.g., Table l), and adequately captured the constructs of in—role and extra-role performance drawn from previous research (Organ, 1988b; Williams, 1988). However, it is possible that results would vary with the use of other 160 dimensions, or other conceptualizations of in-role and extra-role performance. As recommended by Ilgen et al. (in press), future process or cognitively-oriented performance appraisal research must also deal with cm issues. The current study takes steps in this direction, but needs to be combined with and followed by considerably more related research (cf., McDonald, in press). A fourth possible limitation of the current research concerns the quality of the true scores generated by the subject matter experts. It is possible that some of the weakness of results for the accuracy measures was due to deficiencies in the true scores. For example, the subject matter experts made almost no use of the extreme values on the rating scale (1 and 7). They were, however, stricter than the primary sample in rating the in—role performance of Jody and Lynn (the low IRB performers). It is possible that, despite extended opportunity to evaluate the critical incidents, the 15 subject matter experts were not as "expert" as desired in making such ratings. Unlike some previous research (Karl & Wexley, 1989; Padgett 6: Ilgen, 1989), expert raters in this study were 1193, provided rater training prior to the rating task. Such training is recommended in future research. A fifth point is related to the above, and concerns whether the deviation of the true scores from the intended target levels of performance may have weakened overall study results. As Table 3 indicated, subject matter ratings deviated markedly from desired levels for four of the 36 ratings. However, two points can be made which would indicate that these deviations did not substantially effect the results of this study: a) such deviations should have had their strongest effect on 161 the withinrsubject manipulations of inrrole and extra-role performance, yet results in this study were strongest for precisely these variables; b) the subject matter experts best captured the intended levels of performance for ratees Pat, Chris, and Lynn. Turning to the results in Table 10, the accuracy of these orthonormal contrasts was gcgcc for these same three ratees. It does not seem, then, that accuracy (or the lack thereof) can be explained by the deviations of the true scores from intended target levels of performance. A final limitation of this study is self—evident by this point in the manuscript: the design of this study was too ”busy”, with many complex and interrelated hypotheses. Interpretation of the results was often difficult because of all the different things going on in this study. As noted above in the discussion of Hypotheses 8a and 8b, it is expected that future research will more fruitfully proceed when type of format and type of performance dimension are first studied separately. Also, unless a better way cange found to concurrently measure the effects of level of perfbrmance (Hypothesis 5) and consistency of performance information (Hypothesis 7), these variables should not be tested in the same study. G 0 us m i d Having laid out both the strengths and weaknesses of this research project, discussion.will now turn to what can.be learned from the results of this study. The discussion in this section will focus on general conclusions related.to the broad.questions presented at the outset of this chapter. 162 Iypc_c§_jprmg§. In terms of rater accuracy, there was a slight advantage to making ratings by dimension. Of the three correlational measures of accuracy, the only significant main effect for type of format ‘was for differential elevation.correlation (DECORR), in favor of ratingflby dimension. Three of the four distance score accuracy formulations had significant main effects in favor of dimensions; only stereotype accuracy had a significant main effect in favor of rating by person. The overall distance score accuracy (Cronbach, 1955) also demonstrated.a statistically significant main effect for rating by dimension. Unfortunately, these effects were very small in magnitude. None of these effects uniquely explained more than 2% of the rating variance. Such small effect sizes can be viewed in.two different ways. First, since a review of the relevant literature indicated that prior research had demonstrated small to medium effect sizes (Cohen, 1988) for similar manipulations, one might argue that the design limitations noted above served to constrain the size of the effects observed in this study. This may have in fact occurred, which leaves open the possibility that future research *which is more specifically focused on format issues will demonstrate effects of greater practical significance. Also, as noted above, it is likely that studying the effects of recall and rating together will increase the effect sizes observed (DeNisi 6: Williams, 1988). The second way to view such small effect sizes is to conclude that this is all that such manipulations are capable of producing. Ilgen et al. (in press) reviewed over 50 research articles under the heading "Performance appraisal process research in the 19803", and concluded that 163 cognitive processes have accounted for only a limited amount of variance in appraisal ratings. It may be that the results of the current research provided (unwelcome) support for the somber conclusions of Ilgen et al. (in press), and that the enthusiasm of the past decade for studying cognitive processes in performance appraisal has been stronger than the results to date would indicate is warranted. In the author's opinion, it is too soon to curtail such process research in the area of performance appraisal. It is hoped that, as both content and process issues are studied in future research, more substantive results will be forthcoming as well. Still, the cautionary note from Ilgen and his colleagues should be heeded. Time will tell if this line of research has "reached a point of diminishing returns" (Ilgen et al., in press). A final issue to mention in this section concerns the m of the performance appraisal rating. DeNisi et a1. (1984) and others have discussed this as an important variable influencing appraisal ratings. As noted in Chapter 1, it was hoped that the pattern of results in this study would have practical implications for different appraisal purposes, i.e. , if a correct ranking of employees was desired for a promotion decision, then rating by person could be recommended, but if developmental feedback for employees was needed, then rating by dimension would be viewed as superior. Clearly, the direction of results for the hypotheses related to type of format were sufficiently discouraging so that no implications can be drawn concerning format and the purpose of appraisal. Purpose of appraisal is an important content issue that must be considered in performance appraisal research (Ilgen et al., in press). Sadly, the 164 current research did not make the contribution expected concerning the relative effectiveness of different appraisal formats for making within- versus across-ratee discriminations (DeNisi & Williams, 1988). e o a . In. contrast to the findings concerning type of format, the results for type of performance dimension were considerably stronger and more robust. This sample of experienced raters focused most on information concerning ratee behaviors which can be classified as in—role (or inrjob; Ilgen & Hollenbeck, 1991), i.e., those behaviors most closely associated with the narrow job duties found in most job descriptions. These behaviors were also the dominant influence on the appraisal ratings made by these raters. However, the two citizenship (OCBI) dimensions of cooperation and extra effort were also used.by these raters. Such extra—role (extra-job) behaviors explained small, but statistically significant amounts of rating variance, as did the interaction of inrrole and extra-role performance. This clearly supports Organ's (1988a) contention that practicing managers view performance as more than simply inrrole behaviors. Even within the confines of this relatively controlled laboratory study, there was evidence that raters were interested in extra-role dimensions which "give fair credit in a general, global sense for many other forms of OCB" (Organ, 1988b). In this context, it would seem that there is merit in an appraisal system which focuses most on job-relevant behaviors, but also gives the rater an opportunity to evaluate broader trait— or citizenship-oriented dimensions as well. Interestingly, this is precisely the direction in which this university is'heading with its newly implemented appraisal system for staff employees. 165 Overall, then, this study supports the findings from Orr et a1. (1989) and MacKenzie et a1. (1991) that managers use both in-role and extra-role information when making appraisal ratings. It is argued that the current study provides the best direct test of this hypothesis to date, since Orr et a1. (1989) asked raters to make utility estimates, and MacKenzie et a1. (1991) used a narrow, sales commission-oriented measure to tap "objective” performance. In this study, the effects for extra-role performance were not as strong as those observed by MacKenzie et a1. (1991) . They are, however, more likely to parallel what would be found in an actual appraisal setting, where raters must somehow simultaneously account for both aspects of performance when making their ratings. The question still remains as to whether the halo effect observed for ratees with high levels of OCBI is m or not, and thus whether bringing extra-role information more explicitly into the appraisal process is desirable or not. Organ (1988b) would argue for such a broadening of the appraisal domain; other researchers would clearly disagree. It is the author's opinion that such extra-role information should be included in the appraisal process, since such information explains relevant rating variance, and is desirable for the effective functioning of the organization as a whole. Of course, there will always be a tension here, since once something is rated, it may no longer be "extra-role", i.e. , it may become an expected job/role requirement (e.g. , obligatory attendance at social functions). However, if the extra-role dimensions remain at the level of generality recommended by Organ (1988b), such a migration of specific behaviors from one category to the other should be less likely. Hopefully, such questions will be better addressed by future research 166 which, as mentioned.above, also incorporates the theoretical work of Ilgen and Hollenbeck (1991). Qircccicns for Epture Research Since future research ideas have already been raised throughout this chapter, this section will only highlight Mpg; raised by this research, which will hopefully be answered in the future. Questions will be grouped under the headings of type of format, type of performance dimension, process issues, and setting/contextual issues. Iype cf Eormac ** What impact does type of fermat have on rater accuracy - zero, small, or larger than those observed in this study? ** What would happen if a simplified version of this study were run, where raters were not allowed to take notes, but were forced to rely more on their memories (i.e., an emphasis on both recall and rating)? e o e fo a ce 1 e si ** Do these findings for inerole versus extra-role performance generalize beyond. this setting, these ratees, and. these dimensions? ** Should extra-role behaviors be included in the formal performance appraisal process, i.e., is this adding error, or explaining relevant variance in the process? ** Does level of performance impact the accuracy of ratings in.an appraisal (in contrast to a selection) setting? 167 W ** Why didn't the raters given prior knowledge of the format they would be using adapt their search pattern to fit that format? ** What results would be obtained in a replication where there was more (and less constrained) search allowed? Secting Z Qoptextual Iccpcg. Ilgen et al. (in press) discussed the literature in this area under three broad headings: a) acquisition of information, b) organization and storage of that information, and c) retrieval, integration, and evaluation of that information. Additionally, under each heading, they discussed the relevant literature as this related to four sources of variation, i.e., variance due to ratees, raters, rating scales, and the setting or context in which the appraisal took place. The current study has emphasized issues of information acquisition, as this related to subsequent ratings given. Manipulations were made of ratee levels of performance and rating scale format, and rater search patterns were measured as well. The one source of variation.not addressed at all in this research was the setting in which appraisal takes place. Padgett (1988) and others (Longenecker et a1., 1987) have documented the importance of these variables as well. For example, Padgett (1988) found that many of the raters in her study inflated the actual ratings given to their subordinates beyond. their ”real” ratings for these employees (collected at a later time, in confidence, by the researcher). Further, this rating inflation could be predicted by raters' beliefs about their ability to be open and honest when rating their subordinates. Padgett (1988) referred to this as rater motivation to rate accurately. 168 Longenecker et a1. (1987) discussed similar processes in terms of the "political" aspects of performance rating. Whatever the label, such processes are not issues of cognitive processing (Ilgen et a1., in press), yet they are extremely salient in most applied appraisal settings. Future research must deal with these political or motivational issues as well. It may be, as Longenecker et a1. (1987) argued, that all the emphasis in our field on rater accuracy has been misplaced, i.e., what if raters are able, but not motivated to rate accurately, due to organizational or other contextual factors? As Ilgen et al. (in press) suggested, performance appraisal research should advance more rapidly if we expand the types of variables included in our models of the appraisal process. We still do not know the extent to which accuracy is jointly influenced by both the rater's ability and motivation to rate accurately. Future research needs to address such issues. APPENDIX 169 AEZEEDIX_A Formulas Used to Calculate Accuracy and Error Scores cc a ggcpbcch, For mathematical reasons, Cronbach (1955) utilized the cgpcycg differences between subject ratings and true score ratings. 'The overall measure of rater accuracy, Dz, represents the squared difference between.subject ratings (x) and true scores (1) averaged across n ratees and k dimensions: 2 (x - t ) D2 - 2 k nk nk 1 -—-' 2 kn n This overall measure can be broken down into four components: elevation (E), differential elevation (DE), stereotype accuracy (SA), and differential accuracy (DA). The last three components can be expressed in terms of squared differences, as well as in terms of variances and correlations. It is thought that the two forms of measuring DE, SA, and DA carry unique information about each of these types of accuracy (Becker & Cardy, 1986; Sulsky & Balzer, 1988). 'These formulas are: - - 2 52- (x ,—t,,) and l - 2. _ - - - - - DE ,. g [(2.. x..) (1.. t..)]2, or equivalently, DE2 3 0 2-+02 2 ; 20- o- n— - I O I X' ' Er 4° ¥L° 31° .L-31° and 3A2=12[(£.-£1-(Z.-£)12 j ,1 e 00 or equivalently, SA2 - o- ? + o ? — 20- ,0 ,A_ _ , ’41 2.1 L1 2.1 L111 170 and DAZ- __ (x. ._ i. - 2. .+ go.) .. (z. «.2..- 2. .+2..)]2 i 81 4.1 4.. 1 4.1 4. 1 or equivalently, 2 2 2 DA - 0a +ob - Zoaob/Lab where a- x14- ’2: 4.14-Z. , andb- {if "a: -xtj+2, , 19.j- and/g1!- - rating and true score for ratee 4'. on dimension j ; 24'.- and 2}, - mean rating and mean true score for ratee 1.; {,1- and 2'1“. mean rating and mean true score for dimension j ; and 12,, and 2,, - mean rating and mean true. score, over all ratees and dimensions. Overall rating accuracy equals the sum of the above four difference scores, i.e., ACCZ - £2 + on? + 5A2 + DAZ (see Cronbach, 1955). MW. Barman sought to measure a rater's ability to distinguish among ratees on a number of performance dimensions. His formula is: Borman's DA - -—1— 2 (Th) d 1.1 where d refers to the number of dimensions and Th refers to the correlation between ratings and true scores for a particular dimension, transformed to a 2 score. This formula yields'a DA score for each dimension. An overall DA score is than computed by averaging the correlations across dimensions using Fisher's r to z transformation. Borman's DA measure is ac; equivalent to Cronbach's DA measure, either in Cronbach's distance score, or his variance/correlational formulation (Becker & Cardy, 1986; Sulsky 6: Balzer, 1988). Em: Drawing from Seal, Downey, and Lahey (1980), Murphy and Balzer (1989) described six error measures that have been used in past research: 171 MEDCORR: the median correlation between performance dimensions, over ratees (halo); VARRAT: the variance of the ratings assigned to each ratee, averaged across ratees (halo); MEAN: the absolute value of the difference between.the mean.rating, over ratees and dimensions, and the scale midpoint (leniency); SKEW: the skew of the distribution of ratings over ratees and dimensions (leniency); SD: the standard deviation of the rating distribution, over ratees and dimensions (range restriction); and KURT: the kurtosis of the rating distribution over ratees and dimensions (range restriction). Fisicaro (1988) recommended. that the VARRAT measure for halo just described be amended to take into consideration the true halo or intercorrelation among dimensions. Using standard deviations instead of variances, Fisicaro (1988) first presented a halo measure focusing only on observed halo as follows: 1 n H0 - .___ 2 SD . 3d n k'l XL where the standard deviation of ratings is computed across dimensions for each ratee, and then averaged across ratees. His "improved" halo measures are as follows: HE _1 IX1 (SD 51) ' ) 8d n k‘l ti X; This formula takes into account the true intercorrelation among dimensions. Finally, Fisicaro (1988) recommended the use of an absolute measure of halo error, which would reflect an overall tendency to make an error, i.e., AHE .. _1_.§ (SDt.-SDX.)| 39 n k-l L L 172 LIST OF REFERENCES Note 1. Balzer, W.K. Personal communication. November, 1990. Note 2. Dickinson, T.L. Personal communication. October, 1991. Note 3. Padgett, M.Y. Personal communication. March, 1991. Note 4. Cafferty, T.P. Personal communication. May, 1991. Barrick, M.R. , 6: Mount, M.K. (1991) . The big five personality dimensions and job performance: A meta-analysis. W, 45, 1- 26. Bateman, T.S., 6: Organ, D.W. (1983). Job satisfaction and the good soldier: The relationship between affect and employee ”citizenship”- A2a92mx_af_uanasement_isarnel. 26. 587-595. Becker, B.E. , 6: Cardy, R.L. (1986). Influence of halo error on appraisal // effectiveness: A conceptual and empirical reconsideration. .Lcnmgl cfi Applied Psychoiogy, 11, 662-671. Bernardin, H-J-o 6: Pence, E.C. (1980). Effects of rater training: V Creating new response sets and decreasing accuracy. W Aeelieg_£sxsh21221. 65. 60-66- Bernardin, H.J., 6: Walter, C.S. (1977). The effects of rater training and diary-keeping on psychometric error in ratings. MM Applied Psychciogy, g2, 63-69. Blumberg, H.H., DeSoto, C.B., 6: Kuethe, J.L. (1966). Evaluation of rating scale formats. Personnel Ecycnclogy, 12, 243-259. Borman, W.C. (1977) . Consistency of rating accuracy and rating errors in the judgement of human performance. 2 e av a Human Perfonnancc, 2_0_, 238-252. Borman, W.C. (1979). Format and training effects on rating accuracy and rater errors. i2aras1.2flAnnlied_£§12hslszx. 59. 410-421. Brief, A. , 6: Motowidlo, S.J . (1986). Prosocial organizational behaviors. Acngiemy of Management fievicw, 10, 710-725. 173 Brown, E.M. (1968). Influence of training, method, and relationship on the halo effect. Journal cf Applicc Psychology, 22, 195-199. Buford, J.A., Jr., Burkhalter, B.B., & Jacobs, G.T. (1988). Link job descriptions to performance appraisals. Personnel qurnnl, June, 132-140. [Cafferty, T.P., DeNisi, A.S., 6: Williams, K.J. (1986). Search and retrieval patterns for performance information: Effects on evaluations of multiple targets. e a t Max. 5.0. 676-683. Cantor, N., & Mischel, W. (1977). Traits as prototypes: Effects on recognition memory. u a1 0 e a t nd ocia s 10 , 35, 38-48. Cardy, R.L., Bernardin, H.J., Abbott, J.G., Senderak, M.P., & Taylor, K. (1987). The effects of individual performance schemata and dimension familiarization on rating accuracy. Journal pi Occupational Psychology, cg, 197-205. Cascio, W.F. (1989). Managingflhuman.resources: Productivity, quality of work life, profits. New York: McGraw-Hill. Coffman, W.E. (1971). On the reliability of ratings of essay examinations in English. Research in rhe Iccching ci English, i, 24-37. Cohen.J. (1988). tatistica 0 er a a o e v 5, Second Edition. Hillsdale, N.J.: Erlbaum. Cohen, J., & Cohen, P. (1983). e t re 0 re a cnalysis for the behnvioral sciences, Second Edition. Hillsdale, N.J.: Erlbaum. Coker, D.R., Kolstad, R.K., & Sosa, A.H. (1988). Improving essay tests: Structuring the items and scoring responses. Clearing honsc, £1. 253-255. Cooper, W.H. (1981). Ubiquitous halo. Psychological Bulletin, 29, 218- 244. Cronbach, L.J. (1955). Processes affecting scores on "understanding of others" and "assumed similarity”. Psychclogical Pullcrin,,§2, 177- 193. Davis, M.S. (1971). That's interesting! Towards a phenomenology of sociology and a sociology of phenomenology. Philosophy cfi Social Sciencc, 1, 309-344. Day, D.V., & Silverman, S.B. (1989). Personality and job performance: Evidence of incremental validity. Pcrscnnel Psychclogy, 42, 25-36. 174 DeNisi, A.S., Cafferty, T.P., 6: Meglino, B.M. (1984). A cognitive view of the performance appraisal process: A model and research propositions. Qrganizarionnl Pehavicr Ann human Perfornnancc, 33 360-396. DeNisi, A.S., Cafferty, T.P., Williams, K.J., Blencoe, A.G., 6: Meglino, B.M. (1983). Rater information acquisition strategies: Two preliminary experiments. WW. 169-172 . DeNisi, A.S., Robbins, T., 6: Cafferty, T.P. (1989). Organization of information used for performance appraisals: Role of diary-keeping. Journal of Applied Psychclcgy, 14, 124-129. DeNisi, A.S., 6: Stevens, G.E. (1981). Profiles of performance, performance evaluations, and personnel decisions. Acgdcny—gf Mnncgenenr Journal, PA, 592-602. DeNisi, A.S. , 6: Summers, T.P. (1986). Rating forms and the organization of information: A cognitive role for appraisal instruments. Paper presented at the National Academy of Management Meetings, Chicago. DeNisi, A. S. 6: Williams, K. J. (1988). Cognitive approaches to " performance appraisal. In K. M. Rowland 6: G. R. Ferris (Eds. ), . - - :e-u-at (Vol. 6, pp. 109-155). I Greenwich, .CT: JAI Press. Dickinson, T.L. (1987) . Designs for evaluating the validity and accuracy of performance ratings. Organizntional Behavior ang Hnnan Decisicn Prcccsces, 42, 1-21. Dickinson, T.L., Hedge, J.W., Johnson, R.L., 6: Silverhart, T.A. (1990). Work performance ratings: Cognitive modeling and feedback principles in rater accuracy training. Technical Report, AFHRL-TP- 89-61, Air Force Human Resources Laboratory. Feild, H.S., 6: Holley, W.H. (1982). The relationship of performance appraisal system characteristics to verdicts in selected employment discrimination cases. Acncemy cf Management; Jcnmnl, 22, 392-406. Feldman, J.M. (1981). Beyond attribution theory: Cognitive processes in performance appraisal. Journal of Applied Pcychclogy, 66, 127-148. Feldman, J .M. (1986). Instrumentation and training for performance appraisal: A perceptual-cognitive viewpoint. In K.M. Rowland 6: G.R. Ferris (Eds.), esea c n ersonnel and an rc Management (Vol. 4, pp. 45-99). Greenwich, CT: JAI Press. Fisher, C.D., 6: Locke, E.A. (1990). Bad citizenship behaviors: Giving what you get. Paper presented at the National Academy of Management Meetings, San Francisco. 175 Fisicaro, S.A. (1988). A reexamination of the relation between halo error and accuracy. Jcnrnal ci Applied Psycholcgy, 11, 239-244. Fiske, S.T. (1981). Social cognition and affect. In J. Harvey (Ed.), Co nit 0 oc a1 behav r e viro e t. Reading, MA: Addison-Wesley. Ford, J.K., Schmitt, N., Schechtman, S.L., Hults, B.M., & Doherty, M.L. (1989) . Process tracing methods : Contributions , problems and neglected research questions . a at n vi an Decisicn Processcs, Al, 75-117. Funder, D.C. (1987). Errors and mistakes: Evaluating the accuracy of social Judgment. W. 191.. 75-90. Glass, G.V., & Hopkins, K.D. (1984). Stati tica m t ds uc i and psychology, Second edition. Englewood Cliffs, N.J.: Prentice- Hall. Graham, J.W. (1986). Organizational citizenShip behavior informed by political theory. Paper presented at the National Academy of Management Meeting, Chicago, August, 1986. Graham, J.W. (1989). Organizational citizenship behavior: Construct redefinition, operationalization, and 'validation. ‘Unpublished manuscript, Department of Management, Loyola University of Chicago. Hastie, R., & Park, B. (1986). The relationShip between memory and judgment depends on whether the judgment task is memory-based or on- line. Psychological Rcview, 21, 258-268. Heneman, R.L. (1986). The relationship between supervisory ratings and results-oriented measures of performance: A meta-analysis. Personnel Psychology, 32, 811-826. Hogan, R. (1982). A socioanalytic theory of personality. In M.M. Page (Ed.), 1982 Nebraska Synposium of'Mcciysricn (pp. 55-89). Lincoln: University of Nebraska Press. Hollenbeck, J.R., Brief, A.P., Whitener, E.M., & Pauli, K.E. (1988). An empirical note on the interaction of personality and aptitude in personnel selection. Journal oi Managemenc, l3, 441-451. Hough, L.M., Eaton, N.K., Dunnette, M.D., Kamp, J.D., 6: McCloy, R.A. (1990). Criterion-related validities of personality constructs and the effect of response distortion on those validities. Jcnrnsl_cf Applied Psychology (Monograph), 2;, 581-595. Ilgen, D.R., Barnes-Farrell, J.L., & McKellen, D.B. (in press). Performance appraisal process research in the 19803: What has it contributed to appraisals in use? r a a ion Behavio c ro 176 Ilgen, D.R., & Feldman, J.M. (1983). Performance appraisal: A process focus. In B. Staw & L. Cummings (Eds.), Ecscsrch_in_Qrgnnizsricnsl fiehavior (Vol. 5, pp. 141-197). Greenwich, CT: JAI Press. Ilgen, D.R., & Hollenbeck, J.R. (1991). The structure of work: Job design and roles. In M.D. Dunnette (Ed.), Handbook ci lndustrisl d 0 an iona PS 0 , Second Edition. Chicago: Rand McNally. Jacobs, R., & Kozlowski, S.W.J. (1985). A closer look at halo error in performance ratings. Acsdemy cf flansgenenr Journal, 22, 210-212. Johnson, D.M. (1963). Reanalysis of experimental halo effects. Jcnrnsl cf Applied Psychology, 51, 46-47. Karambayya, R. (1990). Contextual predictors of organizational citizenship behavior. oceed f he adem , 221-225. Karl, K.A., & Wexley, K.N. (1989). Patterns of performance and rating frequency: Influences on the assessment of performance. Journal cf Managemenr, l2, 5-20. Katz, D. (1964). The motivational basis of organizational behavior. Behaviorsl Science, 2, 131-133. Katz, D., 6 Kahn, R.L. (1966). The 0 a s ch 0 an s. New York: Wiley. Kavanagh, M.J. (1971). The content issue in performance appraisal: A review. Personnel Psychology, 23, 653-668. Kenny, D.A., & .Albright, L. (1987). Accuracy in interpersonal perception: A social relations analysis. Psychclcgicnl Pnllcrin, icg, 390-402. Keppel, G. (1982). Des a i ' searc ' nd Englewood Cliffs, N.J.: Prentice-Hall. Kirk, R.E. (1982). x e e a es ° edu es or e behav a1 sciences, Second edition. Belmont, California: Brooks/Cole. Kozlowski, S.W.J., & Ford, J.K. (1991). Rater information acquisition processes: Tracing the effects of prior knowledge, performance level, search constraint, and. memory demand. Qrgsnicsricnsl Behavior and Human Decision Processes, 52, 282-301. Landy, F.J., & Farr, J.L. (1980). Performance rating. Psychclcgicnl Pulletin, 21, 72-107. Landy, F.J., Zedeck, S., & Cleveland, J. (Eds.) (1983). Pcrfcrmsncc ncasurement anc theory. Hillsdale, NJ: Erlbaum Associates. 177 Latham, G.P., 6 Wexley, K.N. (1981). a n oduct v u pcrformance appraissl. Reading, MA: Addison-Wesley. Latham, G.P., Wexley, K.N., 6 Pursell, E.D. (1975). Training,managers to minimize rating errors in the observation of behavior. Jcnrnsl_c£ ApplieLBsxebalsgx. .632. 550-555. Levy, M. (1989). Almost perfect performance appraisals. Pcrscnncl Jcnrnnl, April, 76-83. Locher, A.H., 6‘Tee1, K.S. (1988). Appraisal trends. e s u , £1: 9, 139-145. Longenecker, C.O. (1989). Truth or consequences: Politics and performance appraisals. anincss_flcriccns, November-December, 76- 82. Longenecker, C.O., Sims, H.P., Jr., 6 Gioia, D.A. (1987). Behind the mask: The politics of employee appraisal. A2§Q£EI.Q£.M§D§8§E§D£ Execucive, 1, 183-193. Lord, R.G. (1985). Accuracy in behavioral measurement: An alternative definition based on raters' cognitive schema and signal detection theory. Journal cf Applied Psycholcgy, 12, 66-71. MacKenzie, S.B., Podsakoff, P.M., 6 Fetter, R. (1991). Organizational citizenship behavior and objective productivity as determinants of managerial evaluations of salespersons' performance. an a WWW. 1Q. 123-150. McDonald, T. (in press). The effect of dimension content on observation and ratings of job performance. a a e v Decision Processes. MehrenS. W.A.. & Lehmann. LJ. (1973). WW3 cducation and psychology. New York: Holt, Rinehart, 6 Winston. Mohrman, A.M., 6 Lawler, E.E. (1983). Motivation and performance appraisal behavior. In F. Landy, S. Zedeck, 6.J. Cleveland (Eds.), er orman measu eme t h . Hillsdale, NJ: Erlbaum Associates. Murphy, K.R., 6 Balzer, W.K. (1986). Systematic distortions in memory- based behavior ratings and performance evaluations: Consequences for rating accuracy. Jcnrnal ci Applies Psycholcgy, ll, 39-44. Murphy, K.R., 6 Balzer, W.K. (1989). Rater errors and rating accuracy. Jcnrnal of Applied Psycholcgy, 13, 619-624. Murphy, K.R., Balzer, W.K., Lockhart, M.C., 6 Eisenmann, E.J. (1985). Effects of previous performance on evaluations of present performance. mm. 2.0. 72-84. 178 Murphy, K.R., Garcia, M., Kerkar, S., Martin, C., 6 Balzer, W.K. (1982). Relationship between observational accuracy and accuracy in evaluating performance. Joumsl cf Appliec Psychology, 22, 320-325. Murphy, K.R., 6 Jako, R. (1989). Under what conditions are observed intercorrelations greater or smaller than true intercorrelations? Walled—25191121283. 1‘1. 827-830. Murphy, K.R., Philbin, T.A., 6 Adams, S.R. (1989). Effect of purpose of observation on accuracy of immediate and delayed performance ratings. r anizati na ehav a ec io ces e , 32 336-354. Murphy, K.R., 6 Reynolds, D.H. (1988). Does true halo affect observed halo? Journal of Applied Psychology, 12, 235-238. Nathan, B.R., 6 Tippins, N. (1990). The consequences of halo "error” in performance ratings: .A field study of the moderating effect of halo on test validation results. WW, 1:, 290- 296. Odiorne, G. (1965). Msnagenent hy chjccriycs. New York: Pitman Publishing. O'Reilly, C., III, 6 Chatman, J. (1986). Organizational commitment and psychological attachment: The effects of compliance, identification, and internalization.on prosocial behaviorz Jcnrncl cf Applied Psychology, 11, 492-499. Organ, D.W. (1977). A reappraisal and reinterpretation of the satisfaction-causes-performance hypothesis. Acsdemy of Manngemenc Review, 2, 46-53. Organ, D.W. (1988a). A restatement of the satisfaction-performance hypothesis. Journal of Msnagemenr, lg, 547-557. Organ, D.W. (1988b). Or a izational ti e s i behav ° ood soldier syndrome. Lexington, MA: Lexington Books. Organ, D.W. (1990a). The motivational basis of organizational citizenship behavior. In B. Stew 6 L. Cummings (Eds.), Rescsrch in Qrgsnirarional Dchavicr (Vol. 12, pp. 43-72). Greenwich, CT: JAI Press. Organ, D.W. (1990b). Fairness, productivity, and organizational citizenship ‘behaviort Trade-offs in student and. manager pay decisions. Paper presented at the National Academy of Management Meeting, San Francisco, August, 1990. Organ, D.W}, 6 Konovsky, M. (1989). Cognitive 'versus affective determinants of organizational citizenship behavior. Jcnrnnl_cf Applied Psychology, 1A, 157-164. 179 Orr, J.M., Sackett, P.R., 6 Mercer, M. (1989). The role of prescribed and nonprescribed behaviors in estimating the dollar value of performance. MAW. 2.4. 34-40. Padgett, M.Y. (1988). Performance appraisal in context: Motivational influences on performance ratings. Unpublished Ph.D. dissertation, Department of Management, Michigan State University. Padgett, M.Y., 6 Ilgen, D.R. (1989). The impact of ratee performance characteristics on rater cognitive processes and alternative measures of rater accuracy. W W. as. 232-260. Payne, J. W. (1976). Task complexity and contingent processing in decision making: An information search and protocol analysis. Drganizational Behsvior and Hnnnn Periornance, lg, 366-387. Pitre, E., 6 Sims, H.P., Jr. (1987). The thinking organization: How patterns of thought determine organizational culture. Nacionnl Prcduccivity Review, Autumn, 340-347. Puffer, S.M. (1987). Prosocial behavior, noncompliant behavior, and work performance among commission salespeople. W Psychology, 12, 615-621. Rice, B. (1985). Performance review: The job nobody likes. Psychclcgy Tcday. September, 30-36. Saal, F.E., Downey, R.G., 6 Lahey, M.A. (1980). Rating the ratings: Assessing the psychometric quality of rating data. hychclcgicnl Dnlletin, fl, 413-428 . Smith, C.A., Organ, D.W., 6 Near, J.P. (1983). Organizational citizenship behavior: Its nature and antecedents. W Applied Psychology, 62, 653-663. Smith, P., 6 Kendall, L.M. (1963). Retranslation of expectations: An approach to the construction of unambiguous anchors for rating scales. Journal of Applied Psychology, fl, 149-155. Smither, J.W., 6 Reilly, R.R. (1987). True intercorrelation among job components, time delay in rating, and rater intelligence as determinants of accuracy in performance ratings. W31 Pehavior and Human Decision Processes, 32, 369-391. Srull, T.K., 6 Wyer, R.S., Jr. (1989). Person memory and judgment. Psychological Review, _9_6, 58-83. Stevens, S.N., 6 Wonderlic, E.F. (1934). An effective revision of the rating technique. Personnel qurnsl, l2, 125-134. 180 Sulsky, L.M., 6 Balzer, W.K. (1988). Meaning and measurement of performance rating accuracy: Some methodological and theoretical concerns. Journal of Applied Psychology, 12, 497-506. Symonds, P.M. (1925). Notes on rating. Jcnrnnl cf Applied Psychology, 2, 188-195. Taylor, E.K., 6 Hastman, R. (1956). Relation of format and administration to the characteristics of graphic rating scales. Pcrsonnel Psychclogy, 2, 181-206. Tett, R.E., Jackson, D.N., 6 Rothstein, M. (1991). Personality measures as predictors of job performance: A meta-analytic review. Pcrsonnel Psychology, 33, 703-742. Thorndike, E.L. (1920). A constant error in psychological ratings. Journal of Applied Psychology, A, 25-29. Thornton. 6.0.. III.. & Byam. W.C. (1982). W managerial performance. New Yerk: Academic Press. VanDyne, L., 6 Cummings, L.L. (1990). Extra-role behaviors: The need for construct and definitional clarity. Paper presented at the National Academy of Management Meeting, San Francisco, August, 1990. Werner, J.M. (1992). Predicting U.S. Courts of Appeals decisions involving performance appraisal: Updating Feild 6 Holley for the 19803. Manuscript under revision. Wexley, K.N., 6 Klimoski, R. (1984). Performance appraisal: An update. In KeM. Rowland 6 G.R. Ferris (Eds.), a h e n human resource managemenc (Vol. 2, pp. 35-79). Greenwich, CT: JAI Press, Inc. Wexley, K.N., 6 Yukl, G.A. (1984). a zat n behavi a d so psychology. Homewood, 111.: Irwin. Wexley, K.N., Yukl, C.A., Kovacs, 8.2., 6 Sanders, R.E. (1972). Importance of contrast effects in employment interviews. MM Applied Psychology, 22, 45-48. Wherry. RJ. (1952). WWW—mm rating. Columbus: The Ohio State Research Foundation. Wherry, R.J., 6 Bartlett, C.J. (1982). The control of bias in ratings: A theory of rating. Personnel Psychology, 22, 521-551. Williams, K.J., Cafferty, T.P., 6 DeNisi, A.S. (1990). The effect of performance appraisal salience on recall and ratings. Drganizational Behavicr ang human Decision Prcccsscs, A6, 217-239. 181 Williams, K.J., DeNisi, A.S., Blencoe, A.G., 6 Cafferty, T.P. (1985). V The role of appraisal purpose: Effects of purpose on information acquisition and utilization. Qrgnniznricnnl Dehavior anc Human Decision Processes, 36, 314-339. Williams, K.J., DeNisi, A.S., Meglino, B.M., 6 Cafferty, T.P. (1986). , Initial decisions and subsequent performance ratings. Journal of Wiser. 11. 189-195. Williams, L.J. (1988). Affective and nonaffective components of job satisfaction and organizational commitment as determinants of organizational citizenship behaviors . Unpublished Ph . D dissertation, Department of Management, Indiana University. Williams, L.J., Podsakoff, P.M., 6 Huber, V. (1986). Determinants of organizational citizenship behaviors: A structural equation analysis with cross-validation. Paper presented at the National Academy of Management Meetings, Chicago. Williams, S.L. , 6 Hummert, M.L. (1990). Evaluating performance appraisal / instrument dimensions using construct analysis. 0 a usines Qommunicntion, 21, 117- 135 . Zedeck, S. , 6 Kafry, D. (1977). Capturing rater policies for processing /‘ evaluation data. Or anizatio al Be vio n Human Perfo a ce, 12, 269-294. nICHIan smTE UNIV. LI mllWllWWUNIWI‘IWWW 312930089631