T r1154; ‘ A" Date MS U is an Aflirmatiw Action/Equal Opportunity Institution III'IIII IIIIIIIIII IIIIIIIIII IIIIImI mIIy 31293 01068mii LI-e‘a A} n i inn-451L512“? 6513.6 i Mfle‘firfi 12.32. #13 state 1‘ gin-’3 exuféy i §2V 3’“. in. E (A?! w _. .— This is to certify that the dissertation entitled The Effects of Rating Format and Rater Training on Performance Rating Accuracy and the Motivation to Rate Accurately. presented by Robert Lloyd Heneman has been accepted towards fulfillment of the requirements for Ph.D. . Labor and Industrial degree m Relations 7776M SC. 777% Major professor Michael L. Moore February 15, 1984 012771 MSU LIBRARIES RETURNING MATERIALS: Place in book drop to remove this checkout from your record. FINES will be charged if book is returned after the date stamped below. a .1 .‘\ AUG 2 23mm THE EFFECTS OF RATING FORMAT AND RATER TRAINING ON PERFORMANCE RATING ACCURACY AND THE MOTIVATION TO RATE ACCURATELY By Robert Lloyd Heneman A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY School of Labor and Industrial Relations 1984 ABSTRACT THE EFFECTS OF RATING FORMAT AND RATER TRAINING ON PERFORMANCE RATING ACCURACY AND THE MOTIVATION To RATE ACCURATELY By Robert Lloyd Heneman An important criterion in the evaluation of a perform- ance appraisal system is the accuracy Of performance ratings. Two methods of increasing rate accuracy, rating format and rater training, and their impact on the motiva- tion to rate accurately, were considered in this disserta- tion. It was hypothesized, based upon cognitive processing theory and expectancy theory, that performance rating accu- racy and the motivation to rate accurately would be greater when: (1) behaviors rather than traits were rated; (2) Observational rather than rater error training was provided; and (3) the rating format and rater training were consistent with another. Finally, it was expected that there would be a positive correlation between performance rating accuracy and the motivation to rate accurately. These hypotheses were tested in a laboratory experiment with 87 supervisors from a western utility company. A 2 x 3 factorial design was used. The first factor, rating format, consisted of two levels: behavior scale and trait scale. The second factor, rater training, was defined by three levels: rater error training, Observational training, and Robert Lloyd Heneman control group training. The results of this experiment provided no support for the hypotheses. Instead, it was found that traits were rated more accurately than behaviors. Two conclusions were made on the basis of these data. First, it appears that raters cognitively process perform- ance information using trait oriented schema. Consequent- ly, their ratings are more likely to be accurate when traits rather than behaviors are rated. Second, these findings suggest that raters are highly motivated to make accurate ratings. Therefore, greater emphasis should be placed upon increasing the skill levels rather than the motivational levels of the rater. Both sets Of conclusions must, how- ever, be treated as tentative given the methodological limitations associated with this study. ACKNOWLEDGMENTS This is the highlight of my experiences as a graduate student. It is an Opportunity to formally thank those individuals and institutions that have helped me to complete this dissertation, to achieve my academic goals, and to grow as a person. While all of the following parties have provided me with assistance in all three of these endeavors, each party has had a unique impact that I would like to acknowledge and thank. I owe my initial interest and guidance in the field of industrial relations to Milton Derber, Herbert Heneman, Jr., and Kendrith Rowland. The faculties and students at the Institute of Labor and Industrial Relations, University of Illinois, and the School of Labor and Industrial Relations, Michigan State University, helped me to develop an interdisciplinary view of the employment relationship. Several individuals have shared keen insights with me into various subelements of the field: Richard Block, Thomas Patten, Michael Moore, Neal Schmitt, and Kenneth Wexley. In addition, they made many helpful comments and suggestions as members Of my dissertation committee. The University of Illinois at Urbana-Champaign and Michigan State University have provided me with the ii and resources that I have needed as a graduate student. Funding providedtnrthe American Compensation Association and the western utility company used in this experiment made the completion of this dissertation possible. I have received a great deal of administrative assist- ance from Richard Block, Einar Hardin, and Michael Moore in the preparation of my program and the dissertation. I received technical assistance in various stages of this project from Leslie Corbitt, Rob MacCoun, Paul Reagan, and Bob Taylor. In addition to providing technical assistance, Steven Premack made many insightful comments on an earlier draft Of this paper. Michael Moore, Bob Taylor, and Kenneth Wexley have picked me up when I needed it, kicked me in the pants when it was necessary, and patiently provided me with guidance and counsel through the entire project. Finally, I have been blessed by the strength and en- couragement given to me by my Mother and Father; my brother, sister, and their families; Richard and Maribelen Davis; Mado Kreutz; and a number of friends in California, Illi- nois, Michigan, and Minnesota. iii TABLE OF CONTENTS Page LIST OF TABLES.........................................viii LIST OF FIGURES........................................ ix CHAPTER 1: Introduction and Overview.................. 1 Research Objectives................................. 1 ‘Importance of the Topic............................. Chapters in the Dissertation........................ Literature Review................................ "Odels and HypOtheseS000000000000000.000000.00000 Research Methodology..;.......................... ReSUItSOOOIOCCOOOOOOOOOOOOOOOOIOOIOOOOO00.0.00... DiSOUSSion000.00000000.000000.000000000000..00000 NO‘O‘U‘W U1 N CHAPTER 2: Literature Revj-eWO O O O O O O I I O O O O O O O O O 0 O O O O O O 0 Accuracy Definitions. 0 O O O O O O O O O O O O O O O O O O 0 O O O O I O O O O 0 0 Correlation Measures............................. Difference Score Measures........................ Accuracy Measures Contrasted..................... 12 .d 00mm Accuracy M0d61800000000000000.000..000.00.00000.0000 1‘4 Spool (1978)..................................... 15 Ilgen (1983)....0000000.0.0.000...OOOOOOOOOOOOOO. 15 D6C0t118 and PEtit (1978)..00000000000000.0000... 16 Wherry and Bartlett (1982)....................... 17 Accuracy Models Contrasted....................... 19 Empirical StUdieSOOOOOOOOOOOOIOOOOOOOOOOO0.0.0.0.... 21 Characteristics of the Rater..................... 22 Personality................................... 22 Memory capacity............................... 23 Values........................................ 23 General impressions........................... 24 iv Page CHAPTER 2: Literature Review (Cont.) Characteristics of the Ratee..................... 2A Differential accuracy phenomenon.............. 24 Leader behavior............................... 26 Performance feedback.......................... 27 conteXtual variables I I I I I I I I I I I I I I I I I I I I I I I I I I I I I 27 Rater traininEIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII 27 Time delay in ratingI I I I I I I I I I I I I I I I I I I I I I I I I I 3“ Observation: Amount and method............... 36 Purpose of the rating......................... 37 Format and dimen310n80 I I I I I I I I I I I I I I I I I I I I I I I I 37 summarYIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII 39 True Score DevelopmentI I I I I I I I I I I I I I I I I I I I I I I I I I I I I I no Summary and Conclusions............................. AG CHAPTER 3: "Odels and Hyp0these80 I I I I I I I I I I I I I I I I I I I I I ”8 Rating AccuraCYI I I I I I I I I I I.I I I I I I I I I I I I I I I I I I I I I I I I I I ”9 Cognitive Processing Theories.................... #9 Format EffeCtI I I I I I I I I I I I I I I I I I I I I I I I I I I-I I I I I I I I I 55 Training EffeCt0 I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I 56 Format X Training Effect......................... 60 "Otivation000.00000...000000.00.000.00.0.0.0000000.0 61 Expectancy TheorYI I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I 61 Format EffECt...0000000000000...00000000000000... 63 Training EffeCtIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII 6“ Format x Training EffeCt. I I I I I I I I I I I I I I I I I I I I I I I I 65 Motivation and Rating Accuracy...................... 66 SummaryIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII 66 CHAPTER A: Research Methodology....................... 68 Experimental Design................................. 68 ManipulationSIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII 69 Training ContentIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII 69 Comparison of the three training programs..... 70 Rater error training.......................... 71 Observational trainingIIIIIIIIIIIIIIIIIIIIIIII 77 contra]. group trainingIIIIIIIIIIIIIIIIIIIIIIII 81 V Page CHAPTER A: Research Methodology (Cont.) Rating FormatIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII 84 Videotape description.;....................... 8” Frequency of behavior scale................... 85 Trait rating scaleIIIIIIIIIIIIIIIIIIIIIIIIIIII 86 Meas‘JreSIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII 88 Rating AccuraCYIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII 89 Motivation to Rate Accurately.................... 91 ReactionSIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII 91 Demographic Characteristics...................... 92 SUbJectSIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII 92 Sampling Procedures.............................. 92 sample CharacteristiCSIIIIIIIIIIIIIIII IIIII IIIIII 93 ProcedureSIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII 9n Analys18IIIIIIIIIIIIIIIIII.IIIIIIIIIIIIIII.I.IIIIIIIIII 99 CHAPTER 5: Results.................................... 101 Rating Accuracy.............,....................... 106 Motivation.......................................... 107 Motivation and Rating Accuracy...................... 110 summarYIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII11o CHAPTER 6: Discussion................................. 111 Rating Accuracy..................................... 111 Motivation..n..n..n..u..u..n..u..n..u..u...117 Motivation and Rating Accuracy...................... 119 Conc1us1°nSIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII120 APPENDICESIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII122 A. Frequency of Behavior Scale........................ 122 BI TraitRatingscaleIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII125 vi Page APPENDICES (Cont.) C. Overall Rating..................................... 127 D. Motivation to Rate Accurately...................... 128 E. Reactions........................... ..... .......... 129 F O DemographiCSOOOOOOOOOOOO0.00000000COOOOOO0.0.000...131 REFERENCES...OOOOOOOOOO...0.00.00.00.00...OOOOOOOO'OOOOO132 v11 10. 11. 12. LIST OF TABLES scale ReliabilitieSOOOCOOOOOOOOOOOOOOOOOOI... Means, Standard Deviations, and Intercorrela- tions for Interval Level Variables and all SUbJeCtSOIOOOOOOOOOOO.IOOOOOOOOOOOOOOOOOOOOO. Difference of Means Tests for Performance Rating Accuracy and Motivation to Rate Accurately by SeXOOOOOOOOOOOOOOIOOOOOIOOOOOO. Difference of Means Tests for Performance Rating Accuracy and Motivation to Rate Accurately by Geographic Location............ Difference of Means Tests for Performance Rating Accuracy and Motivation to Rate Accurately by Company Training............... Reaction Means and Standard Deviations By Experimental Condition....................... Analysis of Variance Results For Reactions... Rating Accuracy Means and Standard Deviations by Experimental Condition.................... Analysis of Variance Results For Performance Rating AccuraCYOOOOOOOO0.000000000000000CO... Motivation to Rate Accurately Means And Stan— dard Deviations By Experimental Conditionu.. Analysis of Variance Results for Motivation to Rate Accurately...OOOOOOOOOOOOOOO0.0.0.000... Analysis of Covariance Results For Motivation to Rate AccurateIYOOOOOOOOOOOOOOOIOOOOOOOOOOO viii Page 100 102 103 103 mu 105 105 106 107 108 109 110 LIST OF FIGURES Summary of Major Variables and Relationships Experimental Design......................... ix Page 49 69 CHAPTER 1 Introduction and Overview This chapter provides a brief introduction to and over- view of the entire dissertation. This will be accomplished by stating the research objectives, looking at the reasons why research on this topic is important, and briefly des- cribing the content of each of the six chapters to follow. We Many organizations rely upon performance ratings for a number of personnel decisions including pay, promotions, and layoffs (Bureau of National Affairs, 1983). In order for a performance appraisal system to be successful it is of major importance for raters to have the skills and motivation necessary to make accurate ratings (Bernardin & Cardy, 1982). Two methods of increasing rating accuracy, rating format and rater training, are considered in this disserta- tion. Unlike previous research in this area, the following propositions are set forth and are then tested: 1. Performance rating accuracy is a function of the ability and the motivation of the rater. Very little atten- tion has been given to the motivation component. In this study the relationship between rating accuracy and the moti- vation to rate accurately is assessed. 2 2. Rating format and rater training affect performance rating accuracy and motivation to rate accurately. Previous research has ignored the impact of rater training and to a lesser extent, rating format, on the motivation to rate accurately. Consequently, the effects of these two indepen- dent variables, on the motivation to rate accurately, are examined. Expectancy theory is used to explain why this effect is to be expected. 3. Rater training and rating format cannot be con- sidered independent of one another as has been the case with previous research. The interactive effect of these two variables may account for a significant amount of variance in rating accuracy and motivation to rate accurately. Cog- nitive processing theory and expectancy theory are used to explain this hypothesized, interactive effect. A. Given the limited understanding of the cognitive processing of performance information by raters, there has been too much emphasis placed upon minimizing common rater errors like halo and leniency, and not enough emphasis placed upon developing the observational skills of raters (eug. gathering critical incidents). A new type of training, observational training, is set forth here and is tested against rater error training and a comparison group re- ceiving a general overview of performance appraisal. W The objectives of this dissertation are of importance to both performance appraisal researchers and practitioners. 3 For the former group, this dissertation addresses several recent calls in the literature for future research: 1. Several authors have indicated that a cognitive processing view is needed to better understand the rating process (eug..Atkin & Conlon, 1978; Borman, 1978; Feldman, 1981; Landy & Farr, 1980; Murphy, Garcia, Kerkar, Martin, & Balzer, 1982). To date there have been very few studies taking this approach. This study is grounded in cognitive processing theory. 2. Researchers have been criticized for taking too narrow a view of rating accuracy by concentrating on either rating format or rater training (Zedeck & Cascio, 1982). A broader view is taken here by looking at the interaction between these two independent variables; and the end result is a step towards a contingency view of performance apprai- sal (Keeley, 1978). The type of rater training to be used can be matched with the type of rating format being used. 3. While several authors have called for rater train- ing programs emphasizing observational skills (Bernardin & Buckley, 1981; Borman, 1979 a; Landy & Farr, 1980; Spool, 1978), none have been developed or at least have not been reported in the published literature. This omission is particularly noteworthy as observation is an important cog- nitive task confronting the rater (Feldman, 1981). An observational training program was developed for this dissertation based upon the theory of observation set forth by Weick (1968). Moreover, this program was tested against u rater error training and a comparison group receiving a general overview of the performance appraisal process. u. DeCotiis and Petit (1978) and Mohrman and Lawler (1983) have emphasized the importance and determinants of motivation to rate accurately. Little research has been conducted along these lines. This study examines the impact of rating format and rater training method on motivation to rate accurately. In turn, the relationship between motiva- tion and rating accuracy is assessed. This dissertation also addresses the concerns of practitioners as indicated in the following points: 1. Downs and Moscinski (1979) surveyed 67 directors of training in Fortune 250 companies and found that these respondents were very concerned by the fact that there was "subjectivity in the ratings" for those raters using their present appraisal system and that the "appraiser's skills are underdeveloped." These two conclusions emphasize the need for further research on performance rating accuracy and training. 2. Baird (1982) provides three reasons why performance rating accuracy should be of importance to practitioners. First, he feels that they are essential to the management of performance. Without an accurate criterion measure, it is impossible to ascertain whether goals have been achieved and very difficult to provide performance or development coun- seling. Second, he points to the unifgnm_finidglings and several court cases which emphasize the importance of 5 accurate ratings. Finally, many human resource management subsystems are dependent upon accurate performance ratings. For instance, it is very difficult for organizations to make the link between employee performance and rewards without accurate ratings (Lawler, 1971). 3. Suspicion concerning the accuracy of ratings may be a stumbling block to getting supervisors to even use rating instruments (McGregor, 1957). It might be possible to begin to overcome this obstacle by providing supervisors with rating formats and training programs that will provide them with the ability and motivation to make accurate ratings. Wm Presented below is a brief description of the material contained in the next six chapters of this dissertation. W This chapter presents a review of the literature on the accuracy of performance ratings. The following topic areas are covered: definitions of accuracy, models of accuracy, empirical studies of accuracy that have been conducted, and methods used toidevelop 'true' scores for the calculation of accuracy. Within each of these areas emphasis is placed upon conceptual and methodological issues, and on future research directions. W The major hypotheses tested in the dissertation and the models they were deduced from are presented in this chapter. 6 There are three major sections. The first one is concerned with the accuracy of performance ratings. A brief review of cognitive processing models is presented and from these models, several hypotheses are developed concerning the impact of rating format and rater training on performance rating accuracy. The second section deals with the motiva- tion to rate accurately. An expectancy model of motivation is discussed as it relates to performance ratings and hypo- theses are generated about the anticipated effects of rating format and rater training on the motivation to rate accur- ately. In the final section, the expected relationship between performance rating accuracy and the motivation to rate accurately is presented. W! This chapter describes the experimental design used, subjects recruited, procedures undertaken to test the hypo- theses, and the methods of data analysis. A detailed de- scription of the rating formats, rater training programs, and dependent measures is provided. Results The analysis of variance, effect size, reliability, and correlational results are reported in this chapter. A description of the support or lack of support for each hypothesis is presented. Dissnssisn In this final chapter a discussion is presented con- cerning the support or lack of support for the hypotheses, the theoretical and applied implications of this research, the limitations asSociated with this study, and the di- rections that future research in this area might take. CHAPTER 2 Literature Review Given the importance of accurate performance ratings to human resource managers and to students of performance ap- praisal, a surprisingly small amount of theory and evidence has been generated on this topic. In this chapter, the available literature will be reviewed by looking at the definitions of accuracy that have been offered, models and theories that have been set forth, empirical studies that have been conducted, and methods used to develop 'true' scores for the calculation of accuracy.1 Each of these topics will be covered in turn. At the end of each section the implications will be discussed. These conclusions serve as the stimulus to the theory and hypotheses developed in the next chapter. W In a very general sense, performance rating accuracy has to do with the relationship between actual employee 1It should be noted at the outset that an attempt was made to be comprehensive at reviewing those studies concerned with the accuracy of performance ratings. Given the scope of this project and questions concerning the generalizability of findings, no such attempt was made to review all of those studies concerned with person-percep- tions in general or eyewitness accuracy. 8 9 behavior and employee behavior that has been recorded by a rater. Gordon (1970) offers a more precise definition: Accuracy is a function of the total amount of error inherent in an instrument. This includes both variable error which is measured by an index of dispersion and constant error, which is a function of the difference in the location of the distributions obtained with the fallible (performance rating) and less fallible (actual employee behavior) instruments ..., p. 367. Hence, accuracy is made up of two error components: random and constant error. When these two sources of error variance are minimized, a measure is said to be accurate. Given this conceptualization of rating accuracy, two sets of measures have been set forth to operationalize this construct. The first set of measures are concerned with variable error. More specifically, the correlation between actual employee behavior (true score) and the performance rating of employee behavior (observed score) is calculated. The second major approach is based upon constant error. Here, the distance between the true and observed score is calculated. A narrative description of specific measures within each of these two categories and uses for each measure will now be presented.2 manslatisnaljsasunss Cronbach (1955) set forth several definitions of accuracy which focus on variable processes. The first 2Readers interested in the development and use of statistical formulas for each measure are referred to Borman (1979a), Cronbach (1955), Murphy, et. a1. (1982), and Wiggins (1973). 1O definition, differential evaluation, gives a measure of association between the ordering of each employee by their true performance and the ordering of employees made by the rater. This measure of accuracy is important when the rater is required to identify the best performers in his work group (Murphy et. al., 1982). He may need to do so for a variety of personnel decisions including merit pay and pro- motion. The other two definitions developed by Cronbach (1955), stereotype accuracy and differential accuracy, are somewhat similar. Both are concerned with degree to which the rater's judgements covary with the true performance profile of the employee(s). Stereotype accuracy is important when, for example, the rater must assess the skill deficiencies of his employees in order to select a training program (Murphy, et. a1. 1982). Hence, this definition is concerned with the performance profile of the group. On the other hand, differential accuracy focuses on performance profiles for each ratee. As a result, it is an important consideration when the rater is charged with making placement or job assignment decisions (Murphy et. al., 1982). In these situations the rater must match the performance of the ratee along a number of dimensions with the performance dimensions required by the job. 21W There is one major definition of accuracy, elevation (Cronbach, 1955), that takes into explicit consideration the 11 distance between observed and true scores. As originally conceived (Cronbach, 1955), accuracy was defined as the distance between the rater's average score for a group of ratees and the true score average. This definition is of importance when the rater is asked to make distinctions between the performance of work groups within his control (Murphy et. al. 1982). For instance, the vice president of human resources may be asked to allocate rewards to various subunits within personnel (eug. recruitment, compensation, etc.) on the basis of subunit performance. Other variations of this distance notion also exist. One common variation is a 'hits' or 'misses' definition where accuracy is defined as the number of correct or incor- rect rating judgments made by the rater. This measure may be appropriate, for example, in a discipline situation where the rater is asked to make a judgment concerning the number of times a certain rule infraction occurred. While the number of hits or misses may be important in some situations like discipline, there is a greater concern with how close the rater's observation is to the true score (Naylor, 1967). .Hence, some authors (eug. Heneman & Wexley, 1983) have used an elevation score where the distance between the rater's score and the true score on each item for each ratee is assessed. This approach would appear to be important when the rater is concerned with performance ratings for employee development purposes. That is, the rater is providing feedback and guidance to each ratee 12 concerning their progress toward a number of predefined goals or behavioral standards. Wed According to Borman (1977), the question as to which definition of rating accuracy is most appropriate is a closed one as demonstrated by this quote: "Differential accuracy definitely appears to be the most appropriate for. assessing the accuracy of performance judgements, p. zuon' It will be argued here that when one looks at the purpose of the rating, tradeoffs involved in focusing on definitions emphasizing variable or constant error, and various statistical considerations, there is no one best definition. Examples given for each of the definitions depends upon the purpose of the appraisal. Differential elevation is needed when employees must be rank ordered for personnel decisions. Stereotype accuracy appears to be the proper definition when the training needs of a work group are to be assessed. If the rater is required to make placements or job assignments, the differential accuracy definition is more suitable. When reward allocations are to be made to various groups, the elevation definition offered by Cronbach (1955) is needed. A hits or misses definition can be used when the rater is asked to specify the number of times a certain behavior occurred. Finally, an absolute difference score for each item is appropriate for assessing the accuracy of employee development needs and progress. 13 Not only does the purpose of the rating determine the appropriateness of the accuracy measure, but so do consid- erations concerning variable and constant error (Cronbach, 1955). Using correlational definitions, accuracy is the degree to which true and observed scores covary with one another. While this covariation is important in some situa- tions, it is not always. Even if true and observed scores correlate perfectly with one another, they may be a great distance apart from one another (Tinsley & Weiss, 1975). Thus, for instance, the rater may have an excellent view of the pattern of employee behaviors, but may greatly over-or under-estimate the behaviors. Turning to difference score measures, the emphasis is on how close the observed scores are to the true scores. As a result, the rater's observed scores may, overall, be close to the true scores, but distort the actual pattern of behaviors. Which approach is best depends upon the purposes of the rating. Covariation may be important in some situations while distance may be important in others. Finally, several statistical issues must be taken into consideration. First, reliability problems with difference scores are well documented (e.g. Cronbach A Furby, 1970). While this problem may not be as severe as initially thought (Rogasa, Brandt, & Zimowski, 1982), it does give an edge to correlation rather than difference score measures of accuracy. Second, Cronbachfls (1955) definitions assume that there are multiple ratees (Richards & Cline, 1963). When 1“ this is not the case (14% in research where only one ratee is rated) the original formulas are no longer applicable. Third, at least two studies (Murphy et. al., 1982, & Richard & Cline, 1963) have found that some of the measures of these definitions do correlate with one another. Hence, at a practical level, it may be possible to substitute one measure for another. In summary, correlational and distance definitions of performance rating accuracy have been advanced. Within each approach a number of different measures are possible de- pending on whether the ratings are averaged across raters, ratees, and dimensions. The choice of a definition and measure depends upon the purpose of the measurement and the rating, and on several methodological considerations. Future research in this area might be directed toward a better understanding of the interrelationships of the mea- sures within and between these two approaches. Moreover,‘ those developing training programs to increase rating ac- curacy might want to look to these definitions to determine the content of the program. Training, for example, to increase accuracy defined by a correlational measure may need to be different than training to increase a distance measure of accuracy. mussels Four models have been set forth that treat accuracy as the dependent variable. These models will be briefly reviewed and then a general discussion of the strengths and 15 shortcomings of each model will be presented at the end of this section. 2222144318). Based on the work of Cronbach, Gleser, Nada, and Rajaratnam (1972), a model of accuracy was presented by Spool (1978). Accuracy was depicted as a function of three factors: Recording procedure characteristics, observer characteristics, and conditions of observation. Components of the recording procedure include the format used, the complexity of the format, and how ratings are recorded. Observer characteristics include the age, sex, expectancies, intelligence, and rating experience of the rater. Conditions of observation include the characteristics of the ratee, the number of ratees, the behaviors that occur, the frequency and rate at which behaviors occur, and the temporal sequencing of behaviors. No attempt was made to predict the strength and the direction of the relationship between these variables and rating accuracy. W A more specific model of rating accuracy was presented by Ilgen (1983L. It was hypothesized that overall rating accuracy is a function of the objectivity of performance standards, the appraiser's knowledge of the dimensions to be rated, the opportunity to observe, and the expectations of the rater for employee performance. These variables are very consistent with those set forth by Spool. These 16 factors are thought to have a direct influence on overall accuracy. Variables which have an indirect influence on overall accuracy, primarily through their impact on the variables just listed, include attribution sex effects, past experiences, and sex-role expectations. Ilgen then goes on to describe variables that influence two special errors in rating accuracy: over - and under - estimation. The underestimation of performance is directly influenced by appraiser expectations for appraisee perform- ance and indirectly influenced by past experience. Per- formance that is overestimated was hypothesized to be di- rectly effected by appraiser/appraisee similarity and appraiser expectations for appraisee performance. The over- estimation of performance was thought to be indirectly effected by past experiences and sex-role effects. In most cases Ilgen specified the strength and direction of these relationships. W In very general terms accuracy was conceived of as a function of the rater's ability and motivation, and the availability of appropriate rating standards. The rater's ability or "unskill with which a rater interprets job behavior, p. 639" is dependent upon the opportunity to observe ratee behavior, characteristics of the rater, train- ing received by the rater, and the availability of appro- priate rating standards. Motivation or what energizes, directs, and sustains energy for accurate ratings is deter- 17 mined by the perceived consequences of an accurate appraisal for the rater and ratee, the adequacy of the rating format as perceived by the rater, the rating format, the purpose of the appraisal, and the availability of appropriate perform- ance standards. Finally, the availability of appropriate rating standards was hypothesized not to have a direct effect on accuracy. Instead the authors felt that it indirectly effected accuracy through the motivation and ability of the rater. The availability of appropriate rating standards is a function of the job characteristics and the personality of the ratee, the appraisal format, and the organizational policies and procedures for performance appraisal. The direction of each of these relationships were predicted by the authors. W This psychometric theory of rating accuracy was ini- tially developed by Wherry (1952) and then edited and com- mented on by Bartlett. It was assumed that the accuracy of ratings is a function of the following processes: perform- ance by the ratee, observation of the performance by the rater, and recall of the observations by the rater; From these assumptions, a theory of rating accuracy was developed and can be summarized with the following equation (Bartlett, 1983): ZR = WT 1T + W3 ‘8 + WI 11 + WE 1E (1) where: 1R : rating accuracy 18 WT 2T = true ability of the rater WB ZB = bias of the rater WI 21 : environmental influences WE 3E = random error Accurate ratings occur when the weight given to the true ability variable is maximized and when the weights given to the other variables are minimized. From a decomposed version of this equation, a total of 17 theorems and 23 corollaries were deduced. Major variables identified in the theorems, which have an impact upon the weights of the variables in the decomposed equation, included the following: control over the task by the ratee observability of rating scale items training concerning what activities are to be rated conscious effort to be objective checklist of objective cues for the evaluation of performance physical features of the scale that facilitate recall diary of critical incidents importance of the rating to the ratee and society knowledge that the rating will have to be justified delay time between observation and recall intention to remember performance and rating items that are easily classified into categories 19 ° number and relevancy of previous contacts with the ratee The strength and direction of the relationship between these variables and rating accuracy were fully specified by the authors. W A comparison of these four models reveals a number of commonalities, a number of strengths and weaknesses for each model, and a more comprehensive view of those variables impacting rating accuracy. In all four models the dependent variable, accuracy, is never fully defined. As shown in the definitions section of this chapter there are a variety of definitions with very different implications. Predictions are difficult to make using these models because it is difficult to know whether the authors are trying to predict correlational or differ- ence score accuracy. The antecedents of these definitions may not be the same. Ilgen (1983) comes the closest to offering a precise definition of rating accuracy in his development of the determinants of under- and over-estimates of performance. This implies a difference score definition of accuracy. Another common theme to all four models is the emphasis placed upon the ability of the rater. Ability is defined with factors like the intelligence of the rater (Spool, 1978), appraiser's knowledge of the performance dimensions (Ilgen, 1983), and training received by the rater (DeCotiis 20 and Petit, 1978; Wherry and Bartlett, 1982). The obvious proposition here is that the greater the ability of the rater, the greater the accuracy of rating. Given the importance of the motivation of the rater (Bernardin A Cardy, 1982), it is surprising that only one study, DeCotiis A Petit (1978), explicitly considered motivation as an independent variable. A rater may be fully prepared to make accurate ratings, but have no incentive to initiate or persist at this task. 'While the inclusion of this variable certainly strengthened the DeCotiis A Petit model, it was weakened by the failure to include the cognitive processes of the rater. These variables were also excluded by Spool (1978) and Ilgen (1983). As Wherry and Bartlett (1982) noted, an essential part of the rating process is the observation, storage, and retrieval of performance information by the rater. .As will be shown in the next chapter, these processes may produce inaccurate ratings because of the limited information processing capabilities of the rater. Finally, the DeCotiis and Petit, and Bartlett and Wherry models pay careful attention to the contextual fac- tors which may account for variance in performance rating accuracy. In particular, attention is given to organi- zational policies, composition_of the work group, the tech- nology of the work place, training given to raters, and the rating format. The Spool and Ilgen models disregarded these 21 important features with the exception of the rating format available. In summary, these models can serve as a guide to future research concerning the accuracy of performance ratings. However, certain additions and refinements must be made to each model. In particular, any model of the rating process must consider the characteristics and performance of the ratee; the values, ability, motivation, and information processing capabilities of the rater; and a number of contextual variables having an impact upon the rater and ratee including organizational policies, composition of the work group, technology of the work place, training given to the raters, and the rating format. Moreover, the role of feedback needs to be incorporated. One might expect, for example, that the inaccuracy of supervisor's ratings would have an impact on the performance of the ratee. The ratee might try to conceal or distort behaviors that are being inaccurately perceived, or bring these inaccurate percep- tions to the attention of the rater. In turn, these actions by the ratee may have an impact on the accuracy of future ratings. These and other potential feedback loops need to be incorporated into these models. Em2izissl_Stndiss A small number of empirical studies have been made to test the various components of these models. As will be seen in the review of these studies to be presented here, the majority of them have focused on contextual factors and 22 ironically, have been conducted in the laboratory. These studies are grouped into the following categories: charac- teristics of the rater, characteristics of the ratee, and contextual variables. Only those studies that treated per- formance rating accuracy as the dependent variable were reviewed. W222: Personality. Borman (1979) looked at the relationship between individual difference measures for the rater and their differential accuracy scores. The sample was made up of 146 university students. Individual difference variables were measured using the Minnesota Person Perception Battery. These individual difference measures accounted for 17 percent of the rating accuracy variance. The results suggested the following profile for accurate raters. They tend to be stable, dependable and good-natured persons. Seldom would they be rebellious, arrogant, careless, headstrong, irresponsible, disorderly, or impulsive. In addition they tend to be characterized as even-tempered, outgoing, patient, affiliative, and mature. Finally, they are likely to be informal, pleasant, logical, unselfish, mature, verbally fluent, conversationally facile, and initiators in social relations. Borman cautioned the reader that these results are based upon low correlations with differential accuracy scores. It should also be pointed out that no theory or rationale was given for including the variables in this 23 study. Caution is again advised in the interpretation of these results, although this study does seem to point toward a fruitful line of future inquiry. Mgmgny_gapagity. An additional individual difference variable was identified by Rush, Phillips, and Lord (1978). They found that the memory capacity of 1AA university students was significantly related to the accuracy of the recall of specific events. They found that high memory capacity subjects, as measured by the Picture-Number Test, MA-1, Educational Testing Service, 1962, were more accurate. Hence, high memory capacity should be added to the performance profile of the accurate rater. Given the high recall demands in the performance rating process (Bartlett A Wherry, 1982) this is to be expected. Values. The effects of the values held by raters on a difference score measure of accuracy were assessed in a laboratory study by Wexley and Youtz (1983). Female Program Aides (W=23) for a service organization participated in this ex- periment. The Wrightsman (196A) Philosophy of Human Nature Scales provided measures of the following variables: Trust- worthiness, independence, altruism, and variability in human nature. After completing these scales the subjects watched the videotaped performance of a supervisor (Heneman A Wex- ley, 1983) and then rated the supervisor using a frequency of behavior scale. The results indicated that accuracy was negatively correlated with the raters' beliefs in other peoples' 24 independence and altruism, and positively correlated with beliefs about their variability. Hence, raters that have strong beliefs in the altruistic and independent nature of man tend to make less accurate ratings, while those who believe in the variability of human nature tend to make more accurate ratings. Genenal_1mnnessigns. Nathan and Lord (1983) examined, in a laboratory study with 120 undergraduate subjects, the relationship between the general impression of the lecturer held by the students and the inaccuracy of the student's ratings of the lecturer. The results suggested that general impressions correlated significantly with only some of the different measures of inaccuracy. Hence, it appears that some, but certainly not all, of the incidents recalled are guided by the rater's general impressions. Memory may not always be guided by pre-set categorization schemes, even though this is the prediction made by schematic memory theorists (Alba and Hasher, 1983). We: W. The effects of the correctness of the behavior observed on accuracy has been studied by Gordon (1970 A 1972). In particular he identified an effect which he labeled the differential accuracy phenomenon (DAP). This concept suggests that correct behavior (i.e. acceptable or desirable behavior) is likely to be identified more accurately than incorrect behavior (i.e. unacceptable or undesirable behavior). 25 In his first study (Gordon, 1970), he had 118 managers view the videotaped performance of 19 stimulated "agent- prospect" interactions. A number of correct and incorrect behaviors had been built into these tapes. Accuracy was measured as the number of responses where the subject's response matched the correct or incorrect behavior designation in the script. The results indicated that there was a significant main effect for the DAP. Correct behaviors were rated accurately 88 percent of the time while incorrect behavior was accurately rated about 74 percent of the time. Gordon attributed this phenomenon to the idea that raters tend to overlook incorrect behaviors. Perhaps this is because the identification of incorrect behavior may require the rater to engage in an undesirable task; namely, confronting an employee with a performance problem. His second study (1972) used the same videotapes and true scores, but this time his subjects were 46 senior marketing students. In this study he also used a one item measure of how favorable an impression the ratee created. Subjects were assigned to a favorable and unfavorable condition created by the manipulation of the background data of the ratee given to the raters. The ANOVA results again indicated a significant main effect for the DAP. The accuracy of the correct behaviors was about 89 percent while the accuracy of incorrect behaviors was about 76 percent. This effect accounted for 45 percent of the variance in accuracy! The favorability X DAP effects were nonsignifi- 26 cant. These results indicate that the favorability of the background data on the ratee does not have an effect on accuracy and that the DAP operates independent of favorabi- lity. Although they did not directly test the DAP, some indirect support was generated by Nathan and Lord (1983). They separated 120 undergraduate psychology students into one condition where the majority of critical incidents exhibited by a lecturer on videotape were examples of correct behavior and another condition where the majority of incidents were examples of incorrect behavior for the same lecturer. A significant main effect was.found and raters were more accurate in recalling the number of times and behavior occurred when the majority of behaviors exhibited by the lecturer were correct rather than incorrect. Leangn_nehaxign. Rush et. al. (1981) found that the amount of structured behavior used by the leader of a problem solving group was related to accuracy. Under- graduate subjects (N : 144) watched the videotaped perform- ance of a leader in a problem solving group. Accuracy was defined as the number of times the subjects correctly re- called stimulus information on the tape. In one tape, the leader was coached to exhibit a high degree of structuring behavior and on the other tape was coached to exhibit a low degree. The main effect for this manipulation was signifi- cant and subjects gave more accurate ratings for the high structured leader. This result is not surprising given the 27 sample of college students who were probably more familiar with structured behaviors through their classroom ex- periences. It does, however, suggest the possibility that the rater's familiarity with the ratee's job is a determ- inant of accuracy. £31£22m3n2g_£ggdbagk. In the Rush et. al. (1981) study previously described, an additional manipulation took place. The subjects were told immediately following the videotape, that the problem solving group they had observed was the second best or second worst of 24 groups performing the task, or were not given any information. This manipulation had a significant main effect on rating accuracy. The direction of this relationship was not presented by the authors, nor was an explanation offered. Hence, further discussion of this finding is not possible. Csnssxsnsl_lsnisblss Batgn_tnain1ng, A number of studies have looked at the relationship between rater training and performance rating accuracy. Wakeley (1961) conducted two of the original studies in this area. In the first study, 139 undergraduate psychology students were used as subjects. Two measures of rating accuracy were used: accuracy in judging others and ability to judge differences between people. Both measures tested the subjectfis knowledge of the beliefs and values of the interviewees in a series of four to five minute interviews shown on videotape. Subjects were assigned to six training conditions and a control group; pre and post 28 measures were taken. Training consisted of a very short lecture. The six training conditions emphasized observing self, observing others, inferring individual differences, looking for similarities with others, rating error reduction, or a combination of these five programs. Relative to the control group, only two training conditions increased accuracy: looking for similarities in others and the combination. In a second study, 31 evening MBA students were assigned to two training conditions and a control group. One training condition was the previous one emphasizing similarities in others and the second condition.was a combination of looking for similarities in others and observing others. Relative to the control group, both training conditions produced significantly higher pre-post accuracy score gains. While this study is illustrative of the variety of programs that might be used to increase accuracy, a number of limitations preclude firm conclusions. The test-retest and internal consistency coefficients were quite low for the criterion measures. The samples were small and consisted of students. Perhaps the major con- clusion to be drawn is that lectures are ineffective in increasing rating accuracy. Borman has conducted two studies in an attempt to increase the accuracy of raters. In the first study, 90 managers in a large, nationwide insurance company served as subjects (Borman, 1975). Pre and post measures of -‘29 differential accuracy were taken and no control group was used. Subjects observed the hypothetical performance of‘a first line supervisor and rated that person using a BES. Training consisted of a five to six minute lecture on halo error. Accuracy was increased for only two of the six BES performance dimensions. Again, the lecture approach to training, with emphasis on halo error this time, was relatively ineffective. When a lecture was used to warn subjects about several rating errors, the same conclusion emerged (Zedeck A Cascio, 1982). In his second study Borman (1979b) used a different method of training to increase accuracy. College students (N=123) were assigned to a training condition and a no training condition. Subjects in the training condition were given practice and feedback in eliminating rating errors using three hours of the Latham et. al. (1975) training program. The tapes viewed by the subjects consisted of five to nine minute vignettes of the performance of a recruiter and a manager presented in counterbalanced order. Post-test ratings were gathered using four different formats (BES, trait, summated scale, behavior summary), and differential accuracy scores were calculated. The results showed that training had a significant impact on halo error, but did not have a significant impact on accuracy. Hence, at first blush, it appears that not only is a lecture ineffective in increasing accuracy, but so is a sophisticated training program using practice and feedback. An alternative 3O explanation offered by Fay A Latham (1982) is that the subjects were college students and hence, were not motivated to rate accurately. Another explanation, to be more fully explained in the next chapter, is that the content of the training program was directed toward the elimination of rating errors. Consequently, it is no surprise that this training had an effect on halo error, but not on accuracy. In another study, Bernardin and Pence (1980) used 72 undergraduate psychology students for subjects. These subjects were assigned to two training conditions and a control group. In the first condition, rater error training, rating errors were defined and illustrated with distributions of scores, and a discussion took place concerning desirable and undesirable distributions of scores. In the second condition, rater accuracy training, the subjects received a lecture concerning the multi- dimensionality of performance and the importance of fair, unbiased, and accurate ratings were emphasized. Discussion then took place concerning the dimensions of performance for a classroom instructor and the subjects generated examples of high, medium, and low behaviors for all dimensions. The subjects observed a videotape of a classroom instructor and then gave ratings with a 883. IDifference scores were calcu- lated between subjects scores and scores from untrained, undergraduate students. A post-test only design was used. The results showed that raters given rater accuracy training or no training were significantly more accurate 31 than raters given rater error training. No significant differences were found between the rating accuracy training and control group subjects. These results suggest two possible conclusions. First, as with the Borman (1975) study, lectures concerning desirable and undesirable score distributions do not have an effect on accuracy. Second, the accuracy of ratings is likely to be greater when the emphasis is on accuracy rather than errors. However, this second conclusion must be tempered by the fact that a poor method of rater error training was used. Finally, the results of this study are highly suspect given the fact that untrained undergraduate student ratings were used as true scores. Rater accuracy training was also the focus of a study conducted by Thornton and Zorich (1980). In this study, 170 undergraduate psychology students were assigned to two training conditions and a control group. In the behavioral training condition, the subjects received a lecture where they were told to observe carefully, look for details, take notes, and note specific verbal and nonverbal behaviors. In the error training condition, subjects were lectured on systematic biases in ratings and were given the instructions provided to the behavioral training group. The subjects observed a 45 minute videotape of a leaderless group discussion and made post-test ratings on the occurrence of specific behavioral events. Accuracy was measured as the number of correct responses. 32 The results of this study indicate that subjects receiving error training were significantly more accurate than subjects receiving behavioral training, and that both groups were significantly more accurate than the control group. Caution must be exercised in interpreting these results as once again, true scores were generated by untrained undergraduate students. Given this limitation, these results suggest that lecture based training can increase the accuracy of ratings when the ratings call for the correct identification of specific behaviors. In addition, the results suggest that training emphasizing the elimination of rating errors and accurate observation is more effective than training only emphasizing the elimination of rating errors. Pulakos (1983) compared rater error training (RET) and rater accuracy training (RAT) using 108 undergraduate students as subjects. The former program was similar to the one conducted by Latham et. al. (1975) while the latter program used the rating instrument itself as a training tool along with focusing rater attention to the particular job performance dimensions and their corresponding levels of effectiveness. In both cases the rater practiced making ratings and received feedback on the accuracy of their ratings. A completely crossed fixed-factors design was used and consisted of the following conditions: (1) RET alone; (2) RAT alone; (3) RET/RAT together; (4) and no training. 33 When a distance measure of accuracy was calculated, the results indicated that RAT alone or RAT and RET together yielded ratings with higher accuracy than no training or RET alone. In addition, there was no significant difference between no training and RET alone, and RAT alone led to a significant increase in accuracy for three of the five dimensions on the rating scale. When a correlational mea- sure of accuracy was used, the results indicated that the subjects in the RAT alone condition produced the most ac- curate ratings. There was no significant difference between the RET alone and the RET/RAT condition; both conditions, however, produced more accurate ratings than the no training condition. The final study of this type was conducted by Fay and Latham (1982). They assigned 90 busineSs students to a training and no training condition'and to three rating format conditions (BES, 308, and TRAIT). The subjects were given four hours of rater error training using the procedures set forth by Latham et. al. (1975). Accuracy was calculated with respect to halo, contrast, and first - impressions errors using difference scores. Training led to more accurate ratings than ratings in the control group, regardless of the rating format. These results, at first glance, seem to indicate that rater error training in and of itself increases accuracy, unlike the results from the studies by Borman (1975, A 1977b), Bernardin and Pence (1978), and Pulakos (1983). This would be 34 expected given that Borman (1979) and Bernardin and Pence (1978) used a lecture method rather than a method incorporating practice and feedback. However, it does not explain why the Borman (1979b) and Pulakos (1983) training programs did not effect accuracy while in this study it did. All three studies used essentially the same training procedures, although it should be noted that the Fay and Latham (1982) program lasted a longer period of time. In addition to the differences in the amount of training time, two alternative explanations are possible. First, as Fay and Latham indicated, the business students used in their study may be more motivated to rate accurately than the liberal arts students used in the Borman (1979b) and Pulakos (1983) studies. Second, accuracy was defined in different ways. In the Borman (1979b) and Pulakos (1983) studies, accuracy was assessed without any consideration given to rating errors whereas in the Fay and Latham (1982) study, accuracy was defined relative to rating errors. As suggested before, it may be the case here that when rater error training is used to increase the accuracy of overall ratings, it is ineffective; when it is used to diminish rating errors it is effective. I1mg_dglay_1n_nating Several studies have provided evidence that the accuracy of ratings diminishes as a function of the delay between the observation and rating of performance. In one study (Rush, et. al., 1981) 144 college students were placed into an immediate rating condition and 35 a 48 hour delay rating condition. The subjects rated the videotaped performance of a leader in a problem solving group. Accuracy was measured as the number of times the subjects correctly recalled stimulus information on the tape. The results indicated that subjects giving a rating immediately following observation were more accurate than those subjects giving their rating 48 hours after observation. Similar findings have been reported for a 48 hour delay (Nathan and Lord, 1983) and for a delay of up to three weeks (Heneman and Wexley, 1983). In addition, Rush et. al. (1981) found that this effect was independent of the memory capacity of the subjects and the type of performance feedback about the ratee given to the rater. These results underscore the importance of cognitive processing in the rating process. In particular they point to the futility of the common practice of having supervisors make ratings on a yearly basis. It should be noted, however, that the findings are from laboratory experiments and they need to be extended to a field setting. In addition, explanation is needed as to why this effect takes place. The nonsignificant results for memory capacity reported by Rush et. a1. (1981) suggest that it is not the result of the information processing capacity of the rater. It may then be due to the passage of time itself between observation and the rating or may be due to the distortions that occur within this period of time (Heneman A Wexley, 36 1983). These and other explanations need to be further explored. Qnsgnyatign;__Amgunt_and_mgtnod. In a fixed factors design, Heneman and Wexley (1983) manipulated the amount of information observed by 180 undergraduate business students. In the first condition, subjects watched a 55 minute video- tape of a production supervisor interacting with his sub- ordinates in a manufacturing exercise. In the second and third conditions, the subjects viewed a random sample of 60 percent and 20 percent of the critical incidents exhibited by the supervisor in the 55 minute tape. This main effect was significant and the subjects ratings were more accurate the greater the amount of information observed. Future research of this type might hold constant time or the number of observations in each condition to see which one is re- sponsible for the amount of information effect. Maier and Thurber (1968) examined the manner in which information was presented to the rater. They explored three different methods. Undergraduate psychology students (N=219) were asked to decide whether a student did or did not cheat on an exam. They were divided into three groups in which they watched and heard a live role-play of the incident, heard a recording of this same role play on audiotape, or read a transcript of the role play. The authors found that the subjects that read or listened to the role play were significantly more accurate than those that watched and heard the live performance. There was no 37 difference between the raters that read or listened to the role play. The authors attributed this finding to the fact that the raters in the two most accurate conditions had the opportunity to go back and review what was said. The implication here is that raters may need to make better use of unobtrusive measures of employee performance (e.g. actions described in letters, memos, and reports). Ennngsg_g£_thg_nat1ng. Zedeck and Cascio (1982) found that the purpose of the rating accounted for 19 percent of rating accuracy variance. They assigned 130 undergraduate psychology and business students to three purpose conditions: recommending development, awarding a merit raise, or retaining a probationary employee. The subjects read a 33 paragraph description of the performance of supermarket checkers. Each paragraph contained information on one checker. The results provided the following rank ordering of the ability of raters to discriminate between ratees: retention, development, merit pay. This effect was significant. The implication here is that as the consequences of a decision increase, accuracy decreases. However, it should be noted that accuracy refers to discriminability here and this is a necessary but not sufficient condition for accuracy. Egzmat_and_dimgnsigns. In two of the studies just de- scribed, the effects of the rating format on rating accuracy were examined. Borman (1979b) found a significant effect for rating format. However, a significant job X format 38 interaction indicated that there was no one format that was consistently better than the others for both jobs. Fay and Latham (1982) found that raters using behavioral scales (BES and BOS) were more accurate than raters using a trait rating scale. Osburn, Timmreck, and Bigby (1981) demonstrated that specific dimensions relevant to critical job behaviors were used more accurately than generalized job dimensions for 52 experienced interviewers shown simulated job interviews on videotape. Taken together these results indicate that formats with specific and behavioral statements are more likely to be rated accurately. Borman (1977) also looked at the effects of the dimension being rated on performance rating accuracy. He found that on the whole these effects were consistent across rating formats. It would appear that some dimensions of performance are less ambiguous to the rater than others. Unfortunately Borman did not report which dimensions were most accurately rated. Rating_gnngns. A disconcerting finding in the literature to date has been the positive relationship observed between halo and accuracy (Borman, 1977 A 1979b); Berman A Kenny, 1976; A Warmke, 1980). This finding runs counter to a classic postulate in psychometric theory that predicts that accuracy decreases as halo increases. One explanation for this finding was tested by Cooper (1981) who suggested that this result may be due to unreliability in the halo and accuracy measures. He took the correlations 39 reported in these four studies and corrected them for attenuation using conservative estimates of halo and accuracy reliability. Even after these corrections had been made, he found a median correlation of positive .275. Attenuation does not appear to be the answer. A number of alternative explanations can be offered and need to be researched to resolve this paradox.. First, given the small sample size used in these studies, the results may be due to sampling error. Second, there may be restriction in range because of the college samples employed. Third, halo may have been present in the performances viewed by subjects. That is, there may have been valid rather than invalid halo (Bartlett, 1983). .A laboratory study conducted by Pulakos (1983) provides some support for this explanation. Finally, training to eliminate halo error when it is in fact) not an error, may decrease accuracy (Bernardin A Pence, 1980). 2222221 In summary, only a few of the propositions contained in the models reviewed in the previous section have been tested. The majority of these findings have dealt with contextual variables. The motivation to rate accurately and the cognitive processes involved in rating, two promising avenues of research as will be described in the next chapter, have received little attention thus far. The most robust finding to date has been the differential accuracy phenomenon (Gordon, 1970 A 1972). On the whole, however, 40 the results have been disappointing. When effect sizes are reported, they seldom exceed .05. Perhaps this is to be expected when contextual variables are studies in a laboratory setting with undergraduate subjects. It appears that more field research is necessary. W222 As shown in the definitions section of this chapter, it is imperative that there be a "true" score in order to define accuracy. A true score is the correct or actual behavior or performance engaged in by the ratee over time. An accuracy score is developed when the relationship between the true and observed score is calculated. Not only is it quite difficult, if not impossible, to develop a perfect measure of actual employee performance, but even perfect measurement does not guarantee that this true score will be relevant (Thorndike, 1949). That is, this true score may not be related to the ultimate contribution of the employee to the organization. Finally, even if we did develop a perfectly relevant and true score for one employee, it may be one that in terms of research, is not externally valid. That is, the employee and job are not representative of the universe we wish to generalize our findings to. For all these reasons, current research utilizes measures which only approximate true scores and which may or may not be relevant or externally valid. These attempts to develop an approximation to a true score will be briefly discussed in 41 this section and some suggestions will be offered as to directions to be taken to develop better true scores.3 Fay and Latham (1982) developed true scores by using the videotapes constructed by Latham et. al. (1975) and described in detail in the methodology chapter of this dissertation. Briefly, these tapes showed applicants being interviewed for a clerk and management trainee position. True scores were developed by editing the tapes such that sections of the tapes intended to elicit rating errors (eng. first-impression error) were eliminated. Then, the ratings of these modified tapes provided by 40 upper level business students were used as true scores. Hence, rather than using "experts" to eliminate rating errors in the videotape, the authors physically removed these errors from the tape. The external validity of these tapes is good as the jobs (clerk and management trainee) and situation (interview) are familiar to most people. In addition, if the purpose is to measure the difference between ratings that are free from ratings errors and ratings that contain these errors, the method of physically removing these errors rather than relying upon experts appears to be a good one. However, it cannot be assumed that the resultant ratings of this tape present perfect true scores. This is still an approximation to a true score as the true scores may be 3A review of all the studies in which true scores were developed is not presented here. Instead, the studies in- cluded in this section are illustrative of the major methods used to develop true scores. 42 given by experts with inadequate observation skills. In order to deal with this problem, after the judgment errors have been removed from the tapes, the expert raters can be trained in observational skills before making their ratings. In other words, a combination of physically removing errors from the tapes and using expert, trained raters may be the best approach. In the Fay and Latham (1982) study, un- trained raters were used. Videotapes were also used by Borman (1977) to develop true scores. The tapes depicted an interviewing situation and a manager talking with a problem subordinate. Intended true scores were generated_using expert judges (their backgrounds were not reported) to estimate true means, standard deviations, and intercorrelations between items. The intraclass correlations of these judgements were .81 and .82 for the recruiter and manager jobs respectively. Scripts were then written to reflect these expert rater scores, and actors were used to tape their acting out of the script. Next, fourteen new expert raters (graduate students in psychology and practicing industrial psychologists) observed the videotapes. Before doing so they reviewed the scripts and took notes while observing the videotapes. These ratings were then used as true scores. The median intraclass correlation for each dimension was .93 and the median correlation between expert ratings and intended true scores for each dimension was .93. 43 Like the Fay A Latham (1982) videotapes, these tapes depicted well known industrial situations. Through the use of intended true scores a high degree of realism and control over the behaviors exhibited by the actors was made possible. Moreover, the true score ratings had a high level of reliability - a necessary condition for a true score. Again, however, these tapes were not flawless. In particular, each tape was only five to nine minutes long. In most organizations impressions of employee performance are made over much longer periods of time. In addition, the expert raters were not provided with any training for accuracy or rating errors. However, given their positions, it can probably be assumed that they had been exposed to these ideas at some point in their careers. In another study of accuracy, Bernardin and Pence (1980) used videotapes to construct true scores. The videotapes, developed by Eder, Keaveny, McGramm, and Beatty (1978) depicted critical behaviors exhibited by a classroom instructor. True scores were developed from the ratings of 27 untrained undergraduate students. There are two flaws associated with this method of true score development. First, the generalizability of the job (classroom instructor) to industrial situations is questionable. Second, and more importantly, the experts that gave ratings which served as true scores did not receive any training or the author's did not report having done so. Hence, the expert ratings of untrained undergrad- uate students served as the criterion for trained 44 undergraduate students! A similar problem exists for the true scores developed by Thornton and Zorich (1980). True scores were also developed by Heneman and Wexley (1983). In this study, 55, 35, and 20 minute videotapes were constructed. The tapes depicted the performance of a production supervisor as he interacted with his subordinates in a manufacturing exercise. Graduate students trained in the critical incident technique (Flanagan, 1954) were used to record the frequency of critical incidents. These counts were used as true scores when at least two of the three experts described the incident in the same way and agreed where it occurred on the tape. While the experts were trained, they were not experienced raters and this may be a limitation to the resultant true scores. Also, only a small number of expert raters were used which creates reliability problems. In an interesting study by Maier and Thurber (1968) a number of different methods were used. A role-played interview where a student was accused of cheating by his instructor was shown live, tape recorded, and transcribed. In each case, the true score was whether the accused student did or did not admit to cheating. In this case, experts were not needed as an objective true score was possible. While this feature is desirable from a relevance point of view, it is not very similar to a rating situation. Raters are usually asked to make a number of judgements or observe a number of behaviors, not just one. Both the script and the audiotape recording were also an unrealistic depiction 45 of the rating process as it was not possible for the rater to see the ratee. The videotapes developed by Gordon (1970 A 1972) and Nathan and Lord (1983) are excellent examples of the manipulation of content in the videotapes. In both studies scripts were developed for the actors which systematically manipulated the favorability (good or bad) of the incidents. Similar efforts could be undertaken to manipulate other factors of performance including the type, frequency, and duration of various behaviors. In summary, the predominant method of generating true scores is through the viewing of videotaped performance by "experts." Future developers. of these scores should use intended true scores, and have an adequate number of trained and experienced raters. These experts should also have the rating format fully explained to them, have scripts of the performances to be observed, and be familiar with or have the job description for the person being rated. If possible or if necessary, rating errors should be edited from the tapes observed by the experts. Finally, to the extent that the results are to be generalized to industrial settings, classroom instructors should not be used as ratees. While videotapes offer an important element of control for the development of true scores, they suffer from a lack of realism. In particular, they are of short duration, depict simulated work activities, and do not have live performance. In order to counteract these weaknesses, the performance taped should be in actual rather than simulated 46 situations. It might be possible, for example, to obtain industrial engineering tapes of worker performance. Alter- natively, the cameras used to monitor employees and cus- tomers in banks and other businesses might be used for this purpose. Wm Performance rating accuracy can be defined by either the correlation or the distance between actual employee behaviors and employee behaviors recorded by a rater. A comprehensive model of rating accuracy would include characteristics of the rater and ratee, contextual factors, and feedback loops between these sets of variables. The empirical findings reviewed here suggest that ratings are more likely to be accurate when raters have high memory capacity and are familiar with the ratee's job, and when ratees exhibit desirable or acceptable behaviors. Furthermore, these findings suggest that ratings will be more accurate when the context is such that raters receive training, use a rating format with specific and behavioral items, observe a large number of ratee behaviors, minimize the delay between the observation and recording of performance, and use these ratings to make decisions concerning retention and development rather than merit pay. As can be seen from this summary the research on performance rating accuracy is very limited. Organizations that wish to increase the accuracy of ratings can take some obvious steps based on these findings, but more attention must be given to this topic if there is to be further 47 progress in the prediction, understanding, and control of rating accuracy. In particular, a number of revisions need to be made to the models of accuracy reviewed here, and direct tests of the hypotheses deduced from these models need to be made. In order to accomplish this, more field research needs to be conducted and more careful attention needs to be devoted to the construction of true scores. Finally, two promising lines of research, those involving the cognitive processing capability of the rater and the motivation to rate accurately, need to be further developed. These two variables will be the focus in the next chapter. CHAPTER‘3 Models and Hypotheses In the previous chapter it was pointed out that two of the more important processes in the rating task, the cogni- tive tasks confronting the rater as he processes performance information and the motivation of the rater to rate accu- rately, have received very little attention in the litera- ture to date; Drawing upon cognitive processing theories and expectancy theory, the usefulness of these two processes in predicting and understanding rating accuracy will be presented in this chapter. From this presentation, a series of testable propositions will be advanced in this chapter and then, in the next two chapters, a formal test of these propositions will be described. The major variables and the relationships to be exam- ined are summarized in Figure 1. Circled numbers correspond to the hypotheses in the text. The information in this chapter is organized in the following fashion. In the first section, a discussion of performance rating accuracy is presented. Within this sec- tion a description of cognitive theories of the rating process is presented and is used to generate hypotheses concerning the impact of rater training and rating format on 48 49 Rater Training l Motivation l I I to I I Rating I 1 Rate Accuracy l I I Rating Format Figure 1. Summary of major variables and relationships. rating accuracy. The second section looks at the motivation to rate accurately. A description of expectancy theory as it relates to the rating process is presented and is used to make predictions about the effects of rating format and rater training on the motivation to rate accurately. The final section is concerned with the relationship between rating accuracy and the motivation to rate accurately. W W A large number of authors have argued that the cognitive processing tasks undertaken by the rater play an important role in the performance rating process and deserve more careful attention (Atkin A Conlon, 1978; Bartlett A 50 Wherry, 1982; Bernardin A Beatty 1984; Borman, 1978; Carroll A Schneier, 1982; Cooper, 1981; Feldman, 1981; Heneman A Wexley, 1983; Kraiger, 1983; Landy A Farr, 1980 A 1983; Lopez, 1968; Nathan A Lord, 1983; Murphy et. al., 1982; and Wherry, 1952). .As a result, several models of this process have been developed. These models will be briefly reviewed here so that they can be used to explain the expected ef- fects of rating format and rater training on rating accu- racy. Wherry (1952), and Lopez (1968) appear to be among the first to describe cognitive processes in the rating process. Wherry (1952), as reported in Wherry and Bartlett (1982), felt that this process involved the observation of performs ance by the rater and the recall of this performance when a rating was to be made. While not acknowledging Wherry's work, Lopez took this model one step further. He suggested that once the rater had recalled the performance observed, the rater then had to "interpret" or make a summary judgment about the ratee. In addition, he built a feedback loop into this process by suggesting that the recall and interpreta- tion of the performance observed influenced what the rater observed in the next round of observations. More modern theorists have refined this basic model. In particular, Landy and Farr (1980) pointed out that an additional step takes place between observation and recall. This step is one of storage where the observed performance is organized and integrated with previously stored informa- 51 tion for recall at a later time. According to Feldman (1981) storage will be an unconscious process when there are existing categories and will be a conscious process when new categories need to be formed to store the observations. Cooper (1981) emphasized that these observations are placed into storage in two phases: short and long term memory. At each stage, distortion in the trace is possible. Several authors have also refined the judgment stage of the model. Borman (1978) suggested that raters give weights to the various dimensions of performance and then sum up the weighted dimension scores to arrive at a final judgment about the ratee. The importance of attributional processes in the judgment stage have also been incorporated into the. model (Bernardin A Beatty, 1984; Landy A Farr, 1983; Carroll A Schneier, 1982). Attributions by the raters concerning the causes and consequences of ratee behavior can influence the rater's judgments of the ratee and the formation of categories for the storage of observations. In summary, the following components have been included in current models of the cognitive tasks performed by raters as they process performance information: observation, stor- age, retrieval, and judgment. While these components quite often take place in a sequential fashion, this is not always the case. Feldman (1981) points out that these components are interacting and cyclical. For example, the earlier discussion of Lopez's (1968) work suggested that the re- trieval and judgment stages may have an impact on the obser- 52 vation stage. Given this general overview, two specific issues which have direct implications for the accuracy of ratings will now be addressed. The first issue concerns the ability of the rater to accurately recall specific, behavioral incidents instead of broad, categorical events. There are two distinct schools of thought with regards to this issue (Alba A Hasher, 1983; Nathan and Lord, 1983). The "traditionalist" viewpoint as represented by the works of Bartlett and Wherry (1982), Borman (1978), and Lopez (1968), suggests that raters are able to store and retrieve the originally observed behav- iors. To the extent there are a manageable number of obser- vations to process and the demands on memory are not too great, raters should be able to accurately recall their initial observations and make judgments about the rateefls performance on the basis of these observations. A decidedly different point of view is taken by catego- rization or schema theorists as represented by the ideas of Cooper (1981), Feldman (1981) and Murphy, Martin, and Garcia (1982). Because there are so many distinct behavioral ob- servations to process, the ones that are observed, stored, and recalled are based upon predetermined categories or schema developed by the rater through previous experiences. As a result, when ratings are to be made, they are based upon these general categories rather than specific events. In a sense, the specific event is reconstructed from these 53 general categories and hence, ratings are accurate to the extent this reproduction process is accurate. At the risk of some oversimplification, these arguments can be extended to predictions concerning the accuracy of two types of rating formats. Borman (1978) has argued that rating formats should be developed to explicitly take into account the cognitive processes of the rater. Given this notion,the traditionalists might contend that a rating for- mat like BOS or BES based upon specific, critical incidents would result in more accurate ratings. On the other hand, schema theorists might argue that global, trait categories more closely approximate the cognitive processes of the rater and are therefore more likely to lead to accurate ratings. These two propositions have not been tested thus far, but serve as stimuli to the hypotheses to be advanced in a latter part of this chapter. Indirect evidence on this issue has been mixed. In a review of the cognitive psychol- ogy literature Alba and Hasher (1983) found little emprical evidence and a great deal of theory to support the schema theorists and a large amount of empirical evidence and little theory to support the traditionalist's view. In a laboratory investigation of the performance rating process, Nathan and Lord (1983) found some support for/both posi- tionsc This issue will be returned to in the discussion of rating format and rater training. Another issue has to do with whether the observation and judgment stages of cognition are distinct from one 54 another. One might argue that they are not. As Feldman (1981) points out the four stages of cognition previously discussed are cyclical and thus, observation determines judgment and judgment determines observation. It is also possible to argue that they are relatively distinct. Thorn- ton and Zorich (1975) describe the fundamental differences between observation and judgment in the rating task: Prior research in this area has not made clear the distinction between the process of observation and judgment.\ Judgment processes include the categorization, integration, and evaluation of information. The observation processes are more basic including the detection, perception, and recall or recognition of specific behavioral events, p. 351. Only one study addresses this issue in the context of the performance rating process. Murphy , Martin, A Garcia (1982) found that the correlation between observation and judgment is modest. In this study, observation was measured using frequency of observation ratings and judgment was measured using a trait rating scale. The four major formu- las to measure accuracy, developed by Cronbach (1955), were used. The same raters completed both instruments for the same ratee. Only five of the 16 possible correlations (four accuracy measures using frequency of observations ratings x four accuracy measures using trait ratings) were significant at thee.05 level and of these five correlations, the magni- tude of the correlation was less than .45 for four of them. Hence, judgment and observation appear to be two separate, 55 but related concepts, and they will be treated this way in the discussion of rating format and rater training. Wet Given the cognitive demands placed upon the rater, there are a number of reasons why ratings obtained with the use of a frequency of observation scale should be more accurate than those obtained with a trait rating scale. First, the former format requires less complex judgments to be made (Feldman, 1981). Observation, storage, and re- trieval is needed, but the rater does not have to make complex inferences from this information as is true with the latter format (Weick, 1968). As a result, the opportunity for judgmental errors is less frequent (Borman, 1983). Skeptics may argue that raters are poor at recalling spe- cific, behavioral incidents, but the research evidence pre- viously reviewed does not support this contention. Second, the frequency of behavior scale has an "objec- tive" criterion against which raters can check their ratings (Feldman, 1981). The criterion is simply whether and how often the behavior occurred. It is probably more difficult for the rater to test his ratings when a trait rating scale item like 'dependability' is used. Third, trait ratings are ambigously worded and inter- preted differently be different raters. (Bartlett and Wherry, 1982). Consequently, they are subject to distor- tion. They are also subject to distortion as they are not directly observable. Frequency of observation items are 56 directly observable and thus less subject to distortion. Campbell (1961) elaborates upon this point: The greater the direct accessibility of the stimuli to the sense receptors, the greater the intersubjective verifiability of the observation. The weaker or more intangible, indirect, or abstract the stimulus attribute, the more observations are subject to distortion, p. 340. Because traits are not directly observable, inferences must be made from what was observed before a rating can be made (Carroll A Schneier, 1982). These inferences are subject to distortion. Given these considerations, the first hypothesis to be tested in the dissertation is that: 1. The use of a frequency of behavior scale will produce more accurate ratings than will the use of trait rating scale. W In the review of training programs designed to increase accuracy, presented in the previous chapter, it was pointed out that training which gives raters the opportunity to practice making accurate ratings and to receive feedback on the accuracy of their ratings is more likely to be success- ful than is training that does not offer practice and feed- back. Consequently, the following hypothesis is to be tested: 2. Raters given training that provides practice and feedback on the accuracy of their ratings will be more 57 accurate than raters that are not given practice and feed- back. It is also expected that the content of the training program will exert an influence on rating accuracy. Present methods of rater training focus on eliminating judgment errors like halo and leniency (Spool, 1978). An alternative type of training focuses on developing the rater*s observa- tional skills. To the extent that prototypical behavioral categories (Feldman, 1981) can be established through train- ing and then guide the subsequent processing of performance information by the rater, it is expected that raters will have a more accurate information base from which they can generate more accurate ratings. The observational skills of the rater will be sharpened and also, because there is an obvious carry-over from the observation to judgment stage of cognitive processing (Feldman, 1981), the judgment skills of the rater may also be improved. Rater error training, which is designed to eliminate judgment errors, is less likely to have as large an impact on both observation and judgment. While there is a carry- over effect from judgment to observation, this effect is probably not as pronounced as the effect of observation on judgment. Before any sort of judgment can be made, some sort of observation must occur. If the initial observation is inaccurate, then the judgment based on that observation is likely to be inaccurate. To the extent this judgment then determines future observations, they are also likely to 58 be inaccurate even if rater training takes place because of this inaccurate initial observation. As a result, it is hypothesized that: 3. Observational rater training will produce more accurate ratings than will rater error training. Both of these training programs are described in detail in the methodology chapter of this dissertation. Rater error training appears to be grounded in classic psychomet- ric theory. This theory, as detailed in Bartlett and Wherry (1982), suggests that there are a number of systematic judgment errors (e.g. halo and leniency) that occur as raters process performance information. To the extent these errors can be minimized through training, accuracy should be in- creased. The theoretical backdrop for observational training comes from Boice (1983), Flanagan (1944 a, b, A 1952), Flanagan A Burns (1957), and Weick (1968 A 1979). According to Weick (1968) observation is a four stage process. First there is a seleetlen stage where a decision is made about what to observe. This decision is guided by (1) the preset cognitive categories held by the rater (Weick, 1979); (2) the rating scale or some other "standard operating proce- dure" to which the rater must adhere (Weick, 1979); (3) and the characteristics (Gibson, 1960) and organization (Kohler, 1956) of the stimuli. Second, there is 3 22222222122 stage where the rater must put himself into an appropriate situation to observe 59 that which has been selected to be observed. The rater is not a passive recepient of environmental stimuli, but in- stead engages in a process of enactment (Weick, 1979). In doing so, the rater actively creates the environment in which the rater and ratee interact. Thus, the rater is like a "participant-observer" conducting anthropological research (Firth, 1951). Actions taken by the observer have an impact on the ratee's performance. Third, the rater must mentally or physically record his observations. This is the 222222121 stage. Finally, there is an eneeelng stage where the rater must mentally or physi- cally keep track of the frequency of similar observations. In order for this process to run smoothly and thus produce accurate ratings, Weick (1968) felt that inferential demands upon the rater must be minimized. Compared to classical psycho- metric theory then, the objective here is to get raters to minimize the need to make judgments rather than learning how to avoid making errors in judgment. In order to implement this idea, Boice, Flanagan, and Weick have come up with a number of suggestions which are listed below: . The selection stage should emphasize categories that are specific enough to be observed, but not so specific that they place unrealistic re- call demands upon the rater. . Categories should guide this rater on what to observe, but should not be some complex that more attention is paid to the categories than 60 the ratee. That is, performance on the job is equivocal and for it to be captured, the cate- gories must also be equivocal (Weick, 1979). . Observations should be physically recorded to guard against memory loss. Methods to ac- complish this are provided by Flanagan and Burns (1957) and Smith (1982). . Observations should be directly accessible to the raters senses. . Raters should put themselves in situations where the behavior to be observed is likely to take place. . Behaviors rather than traits should be emphasized. . Incidents that are critical to employee success or failure should be emphasized. These recommendations are incorporated into the observa- tional training program used in this study. Esnmss_x_ILainins_Eff222 Given the importance of cognitive processes in the rating task, Borman (1978) has argued that the rating format should be consistent with these processes and Spool (1978) has argued that rater training programs should model these processes. Taking this view one step further, the conten- tion here is that the rating format and rater training program must be consistent with one another. 61 Frequency of behavior scales and observational rater training focus on the observation stage of cognition, while trait rating scales and rater error training are concerned with the judgment stage. If these two stages are somewhat distinct, and the literature reviewed in an earlier section indicates that they are, then the appropriate matching of format and training should be made. If this is not done, and a training program based upon observation (judgment) is paired with a subsequently used rating format base upon judgment (observation), then the elements of the rater training program are unlikely to transfer to the rating task. Consequently, the following hypothesis is offered: 4. The accuracy of ratings will be greater for those raters receiving observational training and a frequency of behavior scale than for those raters receiving observational training and a trait rating scale. Likewise, the accuracy of ratings will be greater for those raters receiving rater error training and a trait rating scale than for those raters receiving rater error training and a frequency of behavior scale. W The importance of the motivation of the rater to rating accuracy has long been recognized (Bayroff, Haggerty, A Rundquist, 1954; Taft, 1955). However, as discussed in the previous chapter, very little attention has been given to 62 this topic. Two notable exceptions are the models set forth by DeCotiis and Petit (1978) and Mohrman and Lawler (1983). Both of these models take an expectancy theory view (Mitchell, 1974; Vroom, 1964) of the rating process. From this perspective, the decision of the rater to initiate and persist in behavior that will lead to accurate ratings is a function of the belief that effort at the rating task will lead to accurate ratings (expectancy perceptions) and the belief that accurate ratings will lead to certain outcomes (instrumentality perceptions). The variables discussed by DeCotiis and Petit and Mohrman and Lawler can be set within this context. A major variable presented by DeCotiis and Petit, perceived adequacy of the rating instrument, may have an impact on expectancy perceptions. More specifically, they felt that the motiva- tion to rate accurately was likely to be greater when the instrument is easy to understand and is job related. Simi- larly, Mohrman and Lawler suggested that the motivation is likely to be greater when an adequate instrument is avail- able and understood. Surprisingly, both sets of authors ignored the role of rater training. One might expect that expectancy perceptions would be higher the greater the skill .levels of the rater, assuming that the training is effec- tive. Several variables advanced by these authors have to do with instrumentality perceptions. In particular, DeCotiis and Petit suggested that motivation will be increased when 63 the results of the appraisal are confidential from the ratee, when the rater feels that he has the necessary in- sights into ratee job behavior, when performance is seen as a legitimate role for the rater, and when the purpose of the appraisal is for personnel research or employee development. Mohrman and Lawler also discuss the purpose of the appraisal and go on to look at the perceived consequences of the rating to the rater, ratee, and others in the organization, and finish by examining extrinsic and intrinsic rewards to the rater for accurate appraisals. Some of these variables will now be used to make predictions concerning the impact of rating format and training on the motivation to rate accurately. EcumaLEffest There are a number of reasons to expect that the moti- vation to rate accurately will be lower for raters using a trait rating scale than for those using a frequency of observation scale» First, trait rating scales may be diffi- cult to understand and not job related. Second, as a result of the lack of an objective criterion for the rater to test his rating, the rater may become defensive and not be wil- ling to use the trait rating format (Patten, 1982). Final- ly, McGregor (1957) suggested that raters are resistant to performance appraisal because they are suspect of the vali- dity of the format and do not like being cast in a judge role. Brumback (1972) suggests that these fears are es- pecially true with a trait rating scale: "u. as opposed to 64 job-oriented scales, person-oriented scales may be more prone to cast supervisors as judges instead of observers, to (make them less certain of their ratings, and to be less acceptable to them, p. 5693' As a result of these consider- ations it is hypothesized that: 5. The motivation to rate accurately will be less for raters using a trait rating scale than for those using a frequency of observation scale. This hypothesis assumes that instrumentality percep- tions are held constant. When allowed to vary, this hypo- thesis may no longer hold. For example, raters may be more motivated when using a trait rating scale because they are, held less accountable for their ratings. Tainan—Effect Expectancy perceptions should also be strengthened by rater training. Compared to raters receiving no practice or feedback, raters given the opportunity to make ratings and receive feedback on accuracy should have improved skill levels and more self-confidence about these skills (Schneier A Carroll, 1982). Hence, it is hypothesized that: 6. Raters given training that provides practice and feedback on the accuracy of their ratings will be more motivated to rate accurately than raters in a control group that do not receive this practice and feedback. Similarly, the content of the training program is also expected to have an impact on the motivation to rate accu- rately. Given that observational training is less complex 65 (i.e. does not require the rater to learn how to make com- plex inferences about the ratee's behavior), the expectancy of accurate ratings will probably be stronger for those raters trained in observational techniques. Consequently, it is hypothesized that: 7. Those raters given observational training will be higher in motivation to rate accurately than those given rater error training. WW If a training program based upon observation (judgment) is paired with a subsequently used rating format baSed upon judgment (observation), then two things are likely to happen. First, the elements of the rater training program are unlikely to transfer to the rating task. Second, the expectancy that effort at rating will lead to accurate performance ratings is likely to be diminished. Therefore, the following hypothesis is to be tested: 8. The motivation to rate accurately will be greater for those raters receiving observational training and a frequency of behavior scale than for those raters receiving observational training and a trait rating scale. Likewise; the motivation to rate accurately will be greater for those raters receiving rater error training and a trait rating scale than for those raters receiving rater error training and a frequency of behavior scale. 66 W In the previous chapter it was shown that rating accu- racy is not only a function of the rater's ability, but is also a function of the rater‘s motivation to rate accurate- ly. Consequently, it is expected that: 9. There will be a positive correlation between rating accuracy and the motivation to rate accurately. Causation should not be inferred from this hypothesis. As suggested above, it may be the case that motivation causes accuracy. On the other hand, it is equally likely that accuracy causes motivation (Johnson, 1945). That is, if the ratings are perceived by the rater to be accurate, this may lead to more confidence in the ratings, and in turn this confidence may increase motivation. Summau A summary of the hypotheses presented in this chapter is listed below: 1. The use of a frequency of behavior scale will pro- duce more accurate ratings than will the use of a trait rating scale. 2. Raters given training that provides practice and feedback on the accuracy of their ratings will be more accurate than raters not receiving this practice and feedback. 3. Observational rater training will produce more accurate ratings than will rater error training. 4. The accuracy of ratings will be greater when the rater training and the rating format are consistent with one another. 5. Motivation to rate accurately will be less for raters using a trait rating scale than for those using a frequency of behavior scale. 9. 67 Raters given training that provides practice and feedback on the accuracy of their ratings will be more motivated to rate accurately than raters not receiving this practice and feedback. Raters given observational training will be more motivated to rate accurately than those given rater error training. The motivation to rate accurately will be greater when the rater training and rating format are consistent with one another. There will be a positive correlation between rating accuracy and the motivation to rate accurately. The research methodology used to test these hypotheses is presented in the next chapter. The chapter following the next one presents the empirical results. CHAPTER 4 Research Methodology The research methodology used to test the hypotheses generated in the previous chapter are presented here. Two independent variables were manipulated in a laboratory ex- periment: the content of the training program and the type of rating format. After training, subjects in the four experimental conditions and the two control groups observed and then rated the videotaped performance of a production supervisor managing two subordinates during a manufacturing exercise. These scores were then used to measure the first dependent variable, rating accuracy. Subjects also come pleted an instrument designed to measure the second depen- dent variable, motivation to rate accurately. The following sections in this chapter describe the experimental design, manipulations to the independent variables, measurements made, subjects and procedures used, and the method of data analysis. W2 A 2 x 3 factorial design was used and is presented in Figure 2. 68 69 WILL-222mm Betez__1;elnlng Traits Behaviors E I Experimental 1 Experimental E rror : Group 1 : Group 2 g I 1 Ex erimental ' Ex erimental 1 Observation : group 3 E group n g 1 Control 1 Control 1 Control : Group 1 j Group 2 E l Figure 2. Experimental design. The first factor, rating format, consisted of two levels: frequency of behavior scale and trait rating scale. The second factor, rater training, was defined by three levels: rater error training, observational training, and control group training. Dependent variables were measured after the treatments had been applied and not before. Hence, there were no repeated measures and only a post-test was made. A pre-test was not made in order to prevent any sensitization of the subjects to the test prior to the treatments. 8221221221222 In this section a description of each of the independent variables, training content and rating format, will be pre- sented. W Three different types of training were given - rater error training, observational training, and control group 70 training. A comparison between the different types of train- ing and a description of each one will now be covered. C2m2a2is2n.2f_shs_thnes_tnaininz_2nssnsms. The content or "what was presented" was different for each training program. In rater error training emphasis was placed upon common judgment errors that can occur in the rating process (eug. halo, leniency) and ways to eliminate them. Observa- tional training focused on a method to establish prototypical rating categories (e.g. critical behavioral incidents). While the emphasis was different for these two training programs, there was one area of overlap. Both programs encouraged the participants to focus on job related rather than non-job related behaviors. Training given to the con- trol groups did not cover rating errors, nor did it cover observation methods. Instead, a general overview of the rating process was presented and included the definition of performance rating, uses for performance ratings, major court cases, and the performance appraisal interview. The process or "how the information was presented" had some similarities and differences between rating programs. All three of the training programs lasted approximately three hours. Unlike the two experimental conditions (rater error training and observational training), lecture, dis- cussion, and role playing were used to present the material to the control groups. The training method used for both rater error training and observational training was set forth by Wexley, Sanders, and Yukl (1975). It has been labeled the most "advanced" rater training program developed 71 (Borman, 1979b) as it is heavily based upon learning theory. Emphasis is placed upon giving raters the opportunity to observe and practice performance rating using videotapes, providing raters feedback on the accuracy of their ratings, and making the program meaningful by using realistic stimuli (Spool, 1978). These procedures and the same videotapes were used for rater error training and observational train- ing. All three training programs were pre-tested by the trainer using groups of 5-10 undergraduate and graduate students. The training was conducted, reactions and points of confusion were elicited, and steps were taken to correct any deficiencies. The training programs presented in this section are the final versions after corrections warranted by the pre-test were made. In summary, the three training programs were all the same length. The control group differed from the experimen- tal groups in terms of the content of the training program and the process used to present the content. An identical process was used to present rater error training and obser- vational training. The content of these two training pro- grams, however, were quite different. Besee_ennen_snelnlng. Subjects in this experimental condition participated in three exercises. In each exer- cise, subjects watched a videotape of a hypothetical job candidate being evaluated by an interviewer (Latham A Wex- ley, 1981). Job applicants were applying for a bookkeeper position and a management trainee position. These jobs and 72 the interviewing situation were chosen for this study be- cause most of the subjects were familiar with these jobs and were responsible for conducting interviews. Subjects were asked, in each exercise, to rate the performance of the job applicant on nine point scales. Three points on the scale were anchored with a verbal de- scription: 9 - "would recommend strongly that an offer be made; applicant shows excellent qualifications in all areas;" 5 - "would recommend with reservations that an offer be made; applicant has weak qualifications is some areas;" and 1 - "would recommend that no offer be made; applicant obviously unqualifiedJ' The first exercise was concerned with "first impres- sions error". This error occurs when the evaluation is ,primarily based upon the rater's initial reactions to the ratee (Latham A Wexley, 1981). Subjects were first provided with and asked to read a job description and a list of the minimum qualifications necessary for the job of bookkeeper. Next, they watched a videotape of a female applicant as she was interviewed for the bookkeeper job. In the beginning of the videotape, the applicant exhibited some unfavorable characteristics that were not related to the minimum quali- fications for the job (e.g. dropped her check book, un- certain of the name of the position she was applying for, etc.). In the last part of the videotape, the applicant made it clear that she did have the necessary qualifications for the job (eug. had appropriate degree and experience). 73 Subjects were then asked to indicate whether they would hire her for the job using the 9 point scale previously described. After making the rating, the subjects shared their ratings and the reasons for their rating with the rest of the group. At the end of this stage, the trainer gave them feedback on the accuracy of their individual ratings. It was pointed out.that a low rating indicated first impres- sions error and this term was defined for the subjects. In the final part of this first exercise, the trainer elicited examples of first impression error back on the job from the group. Typical examples included the following: . The performance of a new employee was rated on the basis of his first few days on the job. . An employee was constantly given difficult and dirty assignments because his performance on a new task _ was low. As a result, his performance was rated low on all tasks. The trainer and the group then discussed ways to eliminate this error back on the job. Solutions discussed included the following: . Snap judgments should not be made concerning the performance of an employee. Judgments should not be made until the end of the appraisal period. . Do not make work assignments on the basis of ini- tial impressions of the employee. The second exercise focused on the "similar-to-me" effect (Latham and Wexley, 1981). This error occurs when a ratee is evaluated more favorably because he is similar to 74 the rater along non-job related dimensions. Subjects were again asked to read a job description and the minimum quali- fications for the job. This time, however, they were for the job of management trainee. The subjects then watched a videotape of a male applicant being interviewed for this position. The videotape showed that there was a high degree of attitudinal and biographical similarity between the in- terviewer and the interviewee. The interviewee did not, however, meet the minimum qualifications for the job. After watching this videotape, each subject was asked to make a rating of the applicant using the same nine point scale. They were also asked to indicate, using this scale, the rating they felt the interviewer would give to the applicant. Each subject then presented his ratings to the rest of the group and gave reasons for making them. At this point, the last part of the videotape was shown' where the interviewer committed similar-to-me error by telling the next person to interview the applicant, that this applicant was an excellent candidate because of a similar attitudinal and biographical background. The trainer then gave the subjects feedback on the accuracy of their ratings. It was suggested by the trainer that subjects with high scores for the applicant had committed similar-to-me error, as verified by the justification they gave for their rating, and this term was defined. As a final part of this exercise, the trainer elicited examples of similar-to-me error back on the job. The fol- lowing items are examples of this discussion: 75 . The children and spouses of the rater and ratee know and interact with each other. . The rater and ratee ride together in the same car pool. . The rater and ratee like to get to work early in order to have a cup of coffee together. . They both like the same sports teams. The trainer and the subjects then generated a list of ways to overcome this error back on the job. Examples of these solutions are as follows: . Establish performance criteria for the job before making a rating. . The rater should check on his ratings by having other raters with a background and attitudes different than his own review his ratings. . Employees should be rated on how well they perform the job instead of how similar they are to the rater. The third and final exercise centered around "halo" and "leniency" errors (Latham A Wexley, 1981). Halo error occurs when the rating is based upon someone else's opinion or when one dimension of the employee's performance is generalized to all other dimensions. Leniency occurs when the employee‘s performance is judged as being high along all dimensions or low along all dimensions when in fact the employee‘s performance is high on some dimensions and low on others. 76 The job description, minimum qualifications, and applicants for this job were the same as those used for the previous videotape. This time, however, the interviewer was different. He was the person that came in at the end of the second videotape and heard the previous interviewer rave about how good the applicant was. The new interviewer was impressed with the applicant even though the applicant did not have the necessary qualifications for the job. After watching the videotape, the subjects used the same nine point scale to indicate whether they would hire the applicant and whether they thought that the interviewer would hire the applicant. Once the ratings had been made, and the subjects ex- plained their reasons for their ratings, the trainer ex- plained to the subjects that a high rating indicated they had fallen victim to halo error and leniency error like the interviewer in the videotape. Both of these terms were then defined by the trainer. Following this, the trainer asked for and discussed examples of halo error and leniency error back on the job. These are some examples of halo and leni- ency error that were discussed: . The department is doing poorly because of a lack of of supervision. The supervisor gives all the employees low ratings so he won't look bad to his boss. . The engineer is good in the technical matters of the job and is therefore rated high for his managerial responsibilities. 77 . The rater fails to spend time with the ratee and makes his ratings on the basis of what he hears from others. Finally, the trainer and the subjects generated a list of ways to eliminate this error. Some of these methods are presented here: . Performance is multidimensional and it is possible for an employee to be high on one dimension and low on another. . The rater should make his own rating before listen- ing to others that have evaluated the employee. . The rater should keep notes on what the scale values of the rating scale mean to him. Qbsenxetienel_tzelnlng. The materials used to conduct this training program were identical to those used in the rater error training program. The same videotapes were shown and in the same order. In addition, the job descriptions, minimum qualifications, and the nine point rating scale remained the same. The steps taken to present the materials were also the same for each exercise. First, the subjects read the job description and minimum qualifications. Second, each subject rated the applicant, presented their ratings to the trainer and the rest of the group, and gave their reasons for making these ratings. Third, the trainer then gave the participants feedback on the accuracy of their ratings and discussed with the subjects why high or low ratings were given. Fourth, the trainer elicited similar examples that 78 the subjects encountered back on their jobs and discussed ways to make more accurate ratings. In the first exercise, subjects were encouraged to look for behaviors rather than traits when observing performance on the job. A behavior was defined to the subjects as an observable activity that the employee engages in while working at the job. Traits were defined as those personal characteristics of the employee that may, but usually are not, related to job performance. The trainer told the subjects, and-their discussion of the ratings they gave verified this, that low ratings of the applicant usually indicated that the subject was looking at traits of the applicant (e.g. she was clumsy, awkward, uncertain, etc.). A high rating for the applicant, which she merited, indicated that the subject was focusing on behaviors (eug. she prepared'monthly statements and income tax for her husband and brother). The subjects were also asked to provide examples of where traits rather than behavior were rated back on the job. Typical traits described were: attitude, appearance, and intelligence. A number of critical behaviors back on the job were also brought forth. For instance, the trainer asked the subjects to define a "bad attituden" Responses varied from refusal to carry out an order to high absenteeism. The trainer emphasized that the focus of performance observation should be on these specific and observable behaviors rather than a vague and unreliable category like "attitude." 79 Finally, the subjects and the trainer generated a list of ways to make sure that behaviors rather than traits were considered when rating employees back on the job. Typical solutions included the following: . Review performance records before making a rating (e.g. attendance, safety, and output records.) . Develop behavioral performance standards for the job. . Keep in mind Title VII of the 1964 Civil Rights Act. In the second exercise, the emphasis was on making ratings based upon behaviors rather than noise factors. Noise was defined as those things that the employee says or does that have little to do with performance, good or bad, on the job. When the subjects were asked to give the reasons for their high ratings they usually said that it was because the applicant was a "nice person", and when pushed a little further it was because the applicant and interviewer seemed to have so many thing in common (ecg. lived in same part of town). It was emphasized that these were noise factors rather than behaviors that they were paying attention to. Subjects were again asked to think back to their own jobs and come up with examples of noise factors that should not be attended to for performance rating purposes. Examples included giving someone a low rating because they had poor table manners and giving an employee a high rating because they came in early to prepare coffee for the boss. 80 The subjects were asked to generate a list of ways to insure that attention was given to behaviors rather than noise factors. Typical solutions included the following: . The rater should not overemphasize or encourage dis- cussing of personal matters on the job. Personal problems could be referred to the Employee Assis- tance Program. . Performance requirements for the job should be job related and behavioral. In the third and final exercise, raters were encouraged to consider critical rather than non-critical behaviors when observing performance. A critical behavior was defined as one that produced should good or bad results that the rater wished that every employee would do it all of the time or never do it. A noncritical behavior was defined as one that is routinely expected of and done by the employee. In the videotape, both the applicant and the interviewer exhibited a number of critical behaviors. For instance, the applicant had not completed his college degree which was required in the minimum qualifications. On the other hand, the inter- viewer asked very leading questions where the answer he sought was given in the question. To the extent these critical behaviors and others were overlooked by the sub- jects, as witnessed by the reasons they gave for their ratings, they tended to mistakenly give high ratings to the applicant. After this point was made by the trainer, the subjects were asked to think of examples of critical and noncritical 81 behaviors back on the job. Typical examples are as follows: . It is one thing for an employee to fill out a report, but it is critical when it is done accur- ately and in a timely manner. . When setting up improvement plans with an employee it is critical that supervisor follow up on these plans and provide assistance if it is needed. The trainer also asked and helped generate ways to focus the raters attention on critical behaviors back on the job. Some of the solutions are presented below: . Actively seek situations where the employee is likely to be or should be engaged in critical be- haviors. This does 221 mean, however, that the rater should be constantly there in those situations as this may carry a message to the employee that he is not trusted and make the employee resentful (Purcell, 1955). . Focus on behaviors and outcomes rather than activi- ties. . Use the critical incident technique (Flanagan, 1954). Look for the situation, the observable activity that the employee was engaged in, and whether it had large consequences for the company, peers, the customer, etc. Qentnel_gzene_snelning, The training provided to the control groups did not present any material on rating errors or observation, nor did it make use of the videotaped per- formance of employees. Instead, a general overview of the 82 performance rating process was presented using a combination of lecture, discussion, and role playing. In the first phase of the workshop, the trainer lectured on the definition of performance ratings and presented them with the results of a survey conducted by Downs and Mocinsky (1979). The survey listed the frequency of use for various performance rating systems in Fortune 500 companies. Also in this first phase, the trainer lectured on what could be rated, traits, behaviors, and results, and provided them with examples of each one. In the second phase, the trainer spoke about the use of performance ratings for the practicing supervisor. The discussion centered around merit pay, promotion, feedback, interviewing, diagnoses of performance problems, training and coaching. Then the trainer lectured on the characteris- tics of a good performance appraisal. Emphasis was placed upon reliability, validity, fairness, discriminability, and practicality (Latham and Wexley, 1981). Finally, signifi- cant court cases concerning performance appraisal were pre- sented by the instructor. In particular, factors which determined whether the court sided with the plaintiff or the defendant were reviewed, based upon an empirical examination of 66 court cases by Feild and Holley (1982). At the end of this second phase of the training, a general discussion took place concerning two performance rating forms: a graphic rating scale and a MBO type plan used by their organization. They were asked to compare and contrast each form in terms of how well they met the ori- 83 teria for a good performance rating system, how well they met the criteria for a legally defensible performance sys- tem, and ways that the supervisor could use them. In the third and final phase of the control group training, the trainer lectured on performance appraisal feedback and goal setting. In particular, emphasis was placed on various ways to conduct a feedback session and when each approach might be appropriate (Wexley, 1982). In addition, the trainer reviewed a list of critical incidents concerning effective and ineffective behaviors when giving performance feedback (Latham A Wexley, 1981) and reviewed specific techniques for setting goals (Locke, Shaw, Saari, A Latham, 1981). Finally, subjects were placed in groups of three, given a completed appraisal form, and asked to practice giving feedback and setting goals by role playing the rater and ratee depicted on the appraisal form. The trainer went from group to group and gave the subjects feedback. At the end of the role playing exercise, a general discussion took place. It should be emphasized that both the content presented and the process used for the control group training were different than the training given to the experimental conditions. Videotapes were not used and rating errors and observational techniques were not discussed. The decision to provide the control groups with a treatment was made to eliminate a rival explanation to the data. To the extent that the motivation to rate accurately was higher for subjects in the experimental conditions than in the control 84 groups, and the control groups received no training, one could argue that the results did not support the hypotheses. Motivation to rate accurately would not be the result of differences in the content of the training program or the type of rating format, but instead would reflect the pre- sence or absence of training. By providing the control group with training, this alternative explanation to the data was minimized. BasinLEsmat Two different types of rating scales were developed for the videotaped performance of a production supervisor. A description of the videotape.and the two rating scales is presented here. Vleeeteee_nesenletlen. A manufacturing exercise (Wexley A Jaffee, 1970; Wexley and Nemeroff, 1975) was videotaped (Heneman A Wexley, 1983). This 55 minute exercise required a supervisor and two subordinates to organize and run their business so as to maximize their profit. This work team purchased parts from a supplier, assembled the finished products (i.e. shipping containers) according to specifications, and sold them to a purchaser. During the exercise, the cost of parts and the prices of the finished products varied from one time period to another, thereby changing the margin of profit. A male manager from a small manufacturing company played the role of the supervisor while two male graduate students served as subordinates. Two weeks in advance of 85 the videotaping these actors were provided with instructions for the exercise. In addition, the graduate students were given a list of two behaviors to be exhibited at any time they felt it was appropriate during the exercise. The supervisor was given a list of 15 behaviors that could be used during the exercise and was told to use these or any other behaviors that fit his own style of supervision. He was told to evenly distribute his behavior over time during the exercise. No other special instructions were provided to any of the actors, except that they were encouraged to act as they normally would. Immediately before the taping of the session the instructions to the exercise were reviewed and the actors practiced their assigned behaviors. Eseenene1_ef_nehexiez_seale, Three graduate students in organizational behavior reviewed the videotape previously described and two additional videotapes of the same exercise where the supervisor worked with different subordinates. These raters recorded the critical incidents exhibited by the supervisor (Flanagan, 1954). Before making these judg- ments, the three raters were trained in this process. The definition of a critical incident was reviewed, examples were provided, and practice and feedback in making these judg- ments were given. The instructions to the manufacturing exercise were also reviewed, as were the job duties of the supervisor to be observed. The end product was a list of critical incidents for each rater based on the performance of the production supervisor in each of the three 86 videotapes. The order of the critical incidents was also known as the raters kept track of the time that each one occurred. A frequency of behavior scale was developed directly from these critical incidents and is shown in Appendix A. Items on this scale are those critical incidents where at least two of the three raters described the critical incident the same way and where at least two of the three raters agreed that it occurred at the same time(s) on the videotapes. Each item was anchored with a scale from O to 4, which represented the number of times that the item was observed. In summary, the frequency of observation rating scale consisted of 21 items describing critical incidents that could be observed on the videotapes. Fifteen of these incidents occurred at least one time on the one videotape used in this study. Each item had a rating scale ranging from O to 4 representing the number of times that the re- spondent observed the occurrence of that critical incident on the part of the supervisor. Inels_netlng_sesle. Uhrbrock (1961) developed 2000 scaled items to be used for performance ratings. These items were used to develop the trait rating scale. The author went through this list and eliminated those items that did not pertain to the videotape performance of the production supervisor or were not described as traits. More specifically, items were eliminated for the following rea- 8008: 87 . The same item appeared elsewhere on the list. . The item was written in behavioral terms (e.g. "Generates, ideas concerning new work methods"). . A knowledge of results was required (e.g. "Con- sistently exceeds production standards"). . The employee was compared to other employees (e.g. "Is superior to general run of employees"). . The item related to promotion rather than present performance (e.g. "Ready to be promoted at the earliest opportunity"). . Information about the employee was not available on the videotape (e.g. "Meets new people easily"). . Knowledge of the employee's life outside of work was required (e.g. "Has normal home life"). . Information about the background of the employee was required (e.g. "Has good experience for present job"). Items were also eliminated from the list when the scale values indicated low interrater agreement. Uhrbrock (1961) used 160 student and professional raters to develop scale values for each item. ‘These raters sorted the items into «eleven piles, ranging from "favorable" to "unfavorable", to form a Thurstone scale. The mean and standard deviation was reported for each item. Those items with a low standard deviation, less than 1.0, were treated here as having low interrater agreement and therefore eliminated. Using these procedures, the number of items was reduced from 2000 to 187. This new list of items was then presented 88 to three graduate students in organizational behavior. They were asked to indicate the degree to which each of the 87 items characterized the performance of the production super- visor on the videotape. Each item had a 5 point Likert-type scale benchmarked from "strongly agree" to "strongly dis- agree." Before rating the performance of the production super- visor using this scale, the instructions to the manufactur- ing exercise were reviewed as were the job duties of the supervisor to be observed. In addition, the graduate stu- dents were given the rater error training described n a latter part of this chapter. Items were retained when all three of the graduate students gave an identical rating for that item. Twenty items met this criterion and were used for the trait rating scale shown in Appendix B. In summary, the trait rating scale was made up of 20 items that depicted traits as- sociated with the performance of the production supervisor. Each item had a 5 point Likert-type scale with written descriptions ranging from "strongly agree" to "strongly disagree" attached to each point. Msasunes Four sets of measures were developed for this study: rating accuracy, motivation to rate accurately, reactions, and demographic characteristics. Each one is described in turn. 89 Wm A rating accuracy score was calculated for both the frequency of behavior scale and the trait rating scale using the following procedures. The Director of Staffing, who was responsible for the performance appraisal system at the organization providing the sample, nominated seven ex- perts in the use of performance appraisal at the company. All seven experts were in the personnel department and, like the subjects, they all had subordinates reporting to them. They were unlike the subjects in that they had all received extensive training in performance appraisal prior to the experiment. These experts were blind to the experiment, but were informed that they were needed to make ratings of the videotaped performance of a production supervisor and that these ratings would be used to evaluate a workshop on per- formance appraisal being conducted at their organization. Each expert received either observational training or rater error training. After receiving the training, three of these experts rated the performance of the production super- visor using the trait rating scale and four experts rated the supervisor using the frequency of behavior scale. As with subjects in the experiment, the wording of items and scale values were reviewed with them prior to observing the videotape. Unlike subjects in the experiment, however, they were asked to take careful notes as they observed the video- tape. In order to assess the adequacy of the experts scores an analysis of interrater agreement was conducted. 90 Agreement was defined as the number of items where the distance between each expert's score was two scale points or less away from the other experts' scores for that item. Items that did not meet this criterion were not used in subsequent analyses (items 1, 2, 6, 12, and 16 for the freqeuncy of observation scale and items 4, 13, 18, and 19 for the trait rating scale). As a result, there was perfect interrater agreement for each item given the criterion. Moreover, a nonparametric w2 test (Lawlis A Lu, 1972), recommended by Tinsley and Weiss (1975), revealed that the interrater agreement is greater than the agreement expected on the basis of chance for both the frequency of observation scale (m2=30.50, p<.001) and the trait rating scale (“12:11:30, p<.001). The mean expert scores were then used as true scores to calculate rating accuracy. A formula similar to the ones used by Bernardin and Pence (1980) and Heneman and Wexley (1983) was used. The scoring formula used in the present study is presented in equation 2. N ( i D)/ N (2) n=1 where: N : number of items D : absolute distance of observed score from true score Prior to making these calculations, several of the items were reverse scored. 91 M2ti1ati2n_ts_fiass_Acsunatelx The author was unable to locate a motivation to rate accurately scale in the published literature. Consequently a new eight item measure was constructed and is presented in Appendix D. Items were worded to reflect the degree to which effort at the rating task was perceived by the sub- jects as leading to accurate ratings and, in turn, if accu- rate ratings lead to outcomes of value to the subjects. Expectancy perceptions were measured using items 1, 3, 4, 7, and 8, and instrumentality perceptions were measured using items 2, 5, 6. Each item was anchored with a 5 point Likert-type scale and the points were benchmarked with writ- ten descriptions ranging from "strongly agree" to "strongly disagree". 1322221223. A three item measure was constructed to assess the subjects reactions to the materials presented. This measure was developed so that it could be ascertained whether dif- ferences between conditions were due to affective reactions to the workshop rather than the treatment effects. These items are shown in Appendix E. 'Subjects were asked to indi- cate their reactions to what was presented and how it was presented using a 5 point Likert-type scale with written descriptions ranging from "poor" to "excellent" attached to each point. They were also asked to indicate whether they would recommend that other supervisors in their company attend the workshop and responded using a 5 point Likert- 92 type scale ranging from "strongly disagree" to "strongly agree." Wales A final form was constructed to see to what extent the results were due to the demographic characteristics of the sample rather than the treatment effects. This form is shown in Appendix F. Subjects were asked to indicate their age, sex, educational level, position, the number of subordinates reporting to them, the number of subordinates they rated, whether they had received company training in performance appraisal, and their department and geographic location. 222.1222: W The subjects in this experiment were 87 supervisors and managers from a western utility company. They were sampled from the population of supervisors and managers for the organization using the following procedure. A cross section of the various departments (eng. gas and electric) and geographic divisions was taken. Department heads were asked . if they would be willing to let their supervisors partici- pate in this project. If the answer was affirmative, then the supervisors in that department were asked if they would be willing to participate in this project. If the answer was yes, then they were included in the sample. In all stages of this procedure, individuals were told that the project consisted of some performance appraisal training. 93 ,All those involved were "blind" to the purpose of the ex- periment and the experimental design being used. Subjects were eliminated from the sample if they did not have formal responsibility for the supervision of at least one employee or if they had never completed a performance appraisal form for at least one of their employees. We: Demographic characteristics of the sample are sum- marized in this section. The sample consisted of 66 males and 19 females (there were two missing values) with a mean age of 40:73. There were 34 different job titles with the titles Administrative Supervisor and Supervising Engineer‘ being the most frequent. The median number of employees supervised by the subjects was 11.78 and the median number of employees rated by the subjects was 5.86. Approximately 52 percent of the subjects were from the companies headquar- ters and approximately 48 percent of the subjects worked in one of the companies 4 largest divisions. The subjects came from 32 different departments with the most frequent re- presentation from personnel, engineering, customer services, and gas. The modal education category was "some college, no degree" and approximately 77 percent of the subjects had received some type of performance appraisal training from the company prior to the experiment. Finally, the mean number of years tenure with the company was 6.19. 94 2222222223. After the sample was selected, subjects were assigned to one of the experimental conditions or control groups. The subjects were given a list of alternative dates and locations for the training sessions. They were asked to indicate which sessions they would be available for, and were then randomly assigned to one of the sessions that they could attend. After the six groups had been formed, the experimental conditions and control groups were randomly assigned to these groups. Outside of differences in the content of the training presented, subjects receiving rater error training or observational training were treated in the same way. The trainer (author of the dissertation) first introduced himself and presented the major objective of the workshop: "To provide supervisors with some modern, proven, and practical techniques to make more accurate performance ratings." In addition, the trainer defined what was meant by a performance rating emphasizing that it referred to merit pay ratings, performance review, and promotion review at the subjectfls organization. Second, the trainer defined the term "accuracy" in the context of performance ratings, explained why accuracy is important, and explained what could be done to make more accurate ratings. It was pointed out by the trainer that accuracy was crucial for the acceptance of performance feedback and for "fair" personnel decisions. The point was also made that performance ratings could and had been made 95 more accurate in other places using training with the videotapes to be presented. The subjects were then given a brief overview of the workshop and the trainer told them that it was important that they contribute and share their ideas and experiences with the trainer and the rest of the subjects. Third, the trainer presented the reasons for conducting the workshop. The subjects were told that they could expect to develop some new skills in the performance rating area, that their organization was interested in learning if this training program would increase the accuracy of their super- visor's performance rating-accuracy, and that the trainer would be using the results of this training program for his dissertation. Finally, the trainer solicited and answered any questions, issues, or concerns held by the subjects. After they had been answered, the subjects introduced them- selves to the trainer and the rest of the subjects. Essentially the same procedure was followed for the two groups receiving control group training. There were, how- ever, three important differences. First, the objective of the workshop was "To provide supervisors with a working knowledge of the performance rating process." No mention was made of rating accuracy as the objective. Second, the subjects were informed that the workshop would be lecture, discussion, and role playing. The use of videotapes in performance rating training was not discussed. Finally, the trainer emphasized that the subjects would gain a better understanding of the rating process. Nothing was said about 96 increasing their rating skills so that accuracy would in- crease. After the training had been conducted, the subjects in all experimental conditions and control groups were told that the final stage of the workshop would be to have them observe the performance of a supervisor on videotape and to fill out some questionnaires concerning the performance of that supervisor and their feelings about the workshop. Next, the subjects were given a handout which explained the manufacturing exercise and job duties of the production supervisor they were about to observe on videotape. After reading these descriptions, the trainer answered questions concerning the exercise and the job duties of the supervi- sor. The trainer then put the rating scale they would be using on the overhead projector. Subjects in the frequency of observation scale condition saw the frequency of observa- tion scale and subjects in the trait rating scale condition saw the trait rating scale. Both of these scales were described in an earlier section of this chapter. In each case the trainer told them that after viewing the videotape they would use this scale to rate the performance of produc- tion supervisor, explained the scale, asked them to careful- ly read each item, and answered any questions concerning the wording of items or benchmark descriptions on the scale. Immediately before showing the videotape, the subjects were asked to pay careful attention to the production supervisor and not to take any notes during or after the videotape. 97 This latter instruction was issued so that the results did not reflect note taking behaviors by the subjects rather than the training content or rating scale treatments. Immediately following the viewing of the videotape the trainer passed out a consent form that had been approved by the University Committee on Research Involving Human Subjects (UCRIHS) at Michigan State University. The form stated, and the trainer emphasized, that individual answers would only be seen by the trainer and that the overall results would be reported in an anonymous manner. In addition, the subjects were informed in the letter and by the trainer, that they would receive a copy of the results of this workshop. The consent forms were then signed, dated, and given to the trainer. After signing this form the trainer passed out a copy of the rating scale. The subjects were asked to read the instructions and to make their ratings. After the ratings had been made, the subjects were instructed to turn the scale over and not to refer back to it during the remainder of the workshop. The subjects were then provided with a copy of the motivation to rate accurately form. The trainer defined rating accuracy, reviewed the instructions with the sub- jects, asked them to carefully read each item, and answered any questions they had concerning the wording of items or the benchmark descriptions attached to each scale point. .After the subjects had filled out this form, they were instructed to turn the form over and not to refer to it 98 during the remainder of the workshop. The rating scale was completed prior to the motivation to rate accurately scale for two reasons. First, the hypo- theses concerning the motivation to rate accurately assumed that the subjects had used one of the rating scales. Second, the motivation to rate accurately scale may have created an unwanted treatment effect. Those subjects that felt more motivated as a result of filling out the motiva- tion to rate accurately scale may have made more accurate ratings than they would have if they had not completed this scale. In order to eliminate these alternative explanations to the data, the subjects filled out the rating scale before the motivation to rate accurately scale, and in both cases were instructed to turn over these scales and not to refer back to them once they had been completed. After completing these two scales, two more forms were passed out to the subjects. The first one asked a series of demographic questions about the subjects. The second one was a reaction questionnaire where subjects indicated their reactions to the workshop. Both of these forms were des- cribed in a previous section of this chapter. Again, the trainer reviewed each form with the subjects and answered any questions. After all four forms had been completed, the subjects were asked to clip them together. They were then collected by the trainer, the trainer promised to provide them with a copy and explanation of the results, and the workshop was concluded. 99 Anslxsis Internal consistency estimates of reliability were assessed using Cronbachfls alpha. The hypotheses were tested using a two-way analysis of variance (ANOVA). Because cell sample sizes were not equal nor proportional to one another, the means were not weighted by sample size (Keppel, 1973). When the F test was significant for a main effect, planned comparisons were used to identify significant differences between the means (Keppel, 1973). Statistical significance was assessed for the planned comparisons using the t statistic (Hays, 1973) and effect sizes were determined using omega - squared,w2 (Hays, 1963). CHAPTER 5 Results In this chapter the results for the tests of the hypotheses are presented. The analysis of variance, effect size, reliability, and correlational results are reported. A description of the support or lack of support for each hypothesis is presented. In the next chapter, the results are discussed. Cronbach's alpha is reported in Table 1 for each of the scales used in this study. It can be seen that, with the exception of the instrumentality and expectancy scales, the reliability of these scales is adequate with coefficients ranging from .65 to .81. Table 1. Scale Reliabilities Scale Alpha Coefficient Behavior scale “:.80; Trait scale (fl/.81é Motivation to rate It65fi Expectancy .56 Instrumentality .41 Reactions .65 100 101 The means, standard deviations, and intercorrelations are presented for all of the interval-level variables and all of the subjects in Table 2. The results in this table have a number of implications for subsequent data analyses. First, the correlation between the two dependent variables, Accuracy and Motivation to rate, is low and nonsignificant (r=503, ngLJ. Hence, a separate analysis of variance (ANOVA) was conducted for each dependent variable rather than combining the two variables and conducting a multi- variate analysis of variance (MANOVA). Second, the correla- tions between five demographic variables (employees super- vised, employees rated, supervisory experience, age, and education) and the dependent variables (performance rating accuracy and motivation to rate accurately) were all low and nonsignificantc Therefore, these five Variables were not treated as covariates in subsequent analyses. Finally, the correlation between reactions and motivation to rate was moderate and significant (r:.33. p<.001). Consequently, the reactions measure was included as a covariate in the analy- sis of the motivation to rate accurately. Sex, geographic location, and company training were dropped as covariates in the analyses of the dependent measures. A series of T-tests, shown in Tables 3, 4, and 5 revealed that there were no significant differences, for either rating accuracy or motivation to rate, between males and females, corporate and division employees, and employees trained and not trained in the company's performance appraisal system. 1(32 .pco. nua- uonco>oc coon o>az acouuoaocsoo on» no ocuqo echo ho- ...po .s. as mo no- me 3.- pa- NP. 0. Lo oo.a .~.oz ou¢ .m. ..aws so- «.1 mo- 5. mo- .o co. ac- as ea m~._ m~.m cosuaoaou .m. oocodcoQAo so ...mm ...»o me. we. mp- so so- so .. am.o a..o sionsseoqam .pp .— .p so mo =.. so- ..u we ow co.m .m.» cuss; sausages“ .o. uoud>soasu mo- mo- mo- No- mo- so- 2.- cc om..m ao.o~ noosodaem .a me. no pg oo m. ...oo ...om om. om.~ unsung ~su2o>o .o so- no No no a. ...wm am. ,ma. uaoaeaoo< .p o— canon o-emm men mp on. No.m uncauoaoz .0 ...oe ...op ~.s we as. ~=.= sassusconssanen .m ...cm so no as. .o.= accououaum .2 man a. am. p..= opus o» causauo: .m on as. ~..m canon usage .~ . am. oo.. / canon Los>ugom .— ampv A.Pv .o—v Amy an. Ass Ac. Amy Aav any Am. Avg an m. casuaca> naoonasm ~a< new nodnmucm> ~o>oguaa>coucm Lou neouaauoscooceucm can .ncoaau«>oa uuoueuum .ncoox .N Ouamh 103 8. t. .3 9.“: .3. :. mm. mm. m: 3.: 3 33:32. pm. w~.F| an em. mm.: P: 0:. Np.z m: aoucsooa means: v a h up on m. c on M. : canoasn> aegnusao ouuLoaLou couuoooq eoaaoooa o—caasuoeo an aaoaocsoo< can: on coauosuuox cc< >oassoo< madam: ooewacoucom Lou munch acme: uo cocoLouuun .2 macaw mm. 8. 8 m... .9... 2 R. S... S 3.: a... 33232. 3. ..o. 2 ow. mm. 2 mm. ma. 8 32:8» 9.32. v A a .3 an m. e on m. c .23...» «Assam can: now now an anoaagsoo< can: cu couuu>uuox on: acacsooc madam: coca-Loucom sou nunoh acne: uo ooeocouuan .m «Hash 104 3». am.- mm m.. cm. Fm mw. mm. so asap o» eofisa>aso= om. ms.- mm mm. .m.: am am. 32.: so somsnooa wedge: v A a co an m c am a c manaaea> vo>uooom uoz uo>fiooom newsflash acmaaoo Newsflash hemaaou »m_haoamssoo< cum: on souum>auoz vac accessed means; oocmasousom sou munch memo: no oocosouuan .m manna 105 The means and standard deviations for the reactions measure are presented by experimental condition in Table 6. As can be seen in Table 7, the effects of rater training, rating format, and their interaction were nonsignificant. Table 6. Reaction Means and Standard Deviations By Experimental Condition Condition n x SD Trait Scale Error training 10 3.83 .76 Observation training 11 3.70 .75 Control group - 19 3.42 .75 Behavior Scale Error training 12 3.56 .66 Observation training 18 3.67 .50 Control group 17 3.65 .65 Table 7. Analysis of Variance Results For Reactions Source df MS F m2 Rating Format (F) 1 .OO .01 .OO Rater training (T) 2 .24 .53 .00 F x T 2 ' .uu .98 .oo Subj. w. groups 81 .45 106 W The first four hypotheses treated performance rating accuracy as the dependent variable. In Table 8 the means and standard deviations for this variable are shown by experimental condition. Table 8. Rating Accuracy Means and Standard Deviations by Experimental Condition Condition n x SD Trait Scale Error training 10 1,76; .16 Observation training 11 .82 .17 Control group 19 .87 .21 Behavior Scale Error training‘"‘ 12 1.09 .25 Observation training ~xl 16 1.02 .22 Control group 17 1.02 .21 8The lower the score the more accurate the rating. The first hypothesis stated that the use of a frequency of behavior scale would produce more accurate ratings than would a trait rating scale. As shown in Table 9 this rating scale effect was significant, F(1,79)=24.22, p<.OO1, and accounted for 20 percent of the rating accuracy variance (w2=.20). However, the hypothesis was not supported as the means were opposite to the predicted direction (i.e. traits were rated more accurately). 107 Table 9. Analysis of Variance Results For Performance Rating Accuracy Source df MS F mg Rating Format (F) 1 .94 24.22“ .20 Rater training (T) 2 .01 .15 .00 F x T 2 .06 1.55 .00 Subj. w. groups 79 .04 *p<.001. The second and third hypothesis also failed to receive support. It was predicted that rater error training and observation training would be more accurate than the control group and in turn, that observational training would be more accurate than rater error training. The main effect for Rater training was nonsignificant, F(2, 79):.15, ms. The fourth hypothesis, that ratings will be more accurate when the rating format and rater training are consistent with one another, was also not confirmed as the Rating format x Rater training interaction was nonsignificant, F(2, 79)=1.55, nsss Malina The second set of hypotheses treated the motivation to rate accurately as the dependent variable. From Table 2 it can be seen that the two motivation to rate accurately subscales, instrumentality and expectancy, are significantly correlated with the motivation to rate accurately scale. In addition, as expectancy theory would predict given the 108 training intervention in this experiment, the correlation between expectancy and the motivation to rate accurately was larger than the correlation between instrumentality and the motivation to rate accurately, and was also larger than the correlation between expectancy and instrumentality. In Table 10 the means and standard deviations for the motivation to rate accurately are shown by experimental condition. The ANOVA and analysis of covariance (ANOCOVA) results are presented in Tables 11 and 12. Again, there was no support for any of the four hypotheses treating motivation to rate accurately as the dependent variable. The fifth hypothesis stated that the motivation to rate accurately would be greater when a frequency of behavior rather than a trait rating scale was used. This hypothesis Table‘HL Motivation to Rate Accurately Means And Standard Deviations By Experimental Condition Condition n i SD Trait Scale Error training 10 4.23 .44 Observation training 11 4.24 .39 Control group 19 4.19 .32 Behavior Scale Error training 12 4.12 .38 Observation training 18 4.18 .40 Control group 17 4.11 .46 109 Table 11. Analysis of Variance Results for Motivation to Rate Accurately Source df MS F wz Rating Format (F) 1 .14 .88 .OO Rater training (T) 2 .03 .19 .00 F x T 2 .44 .98 .00 Subj. w. groups 81 .16 Table 12. Analysis of Covariance Results For Motivation to Rate Accurately Source MS d.f. F Reactions 1.42 1 10.09* Rating format (F) .15 1 1.05 Rater training (T) .01 2 .09 F x T .01 2 .07 Residual .14 8O *p<.002 was not supported as indicated by the nonsignificant main effect for Rating format, F(1,81)=.88, m Even when the results were adjusted for the reactions covariate this main effect was nonsignificant, F(1,80)=1.05, 22s.. It was predicted in the sixth and seventh hypotheses that the motivation to rate accurately would be greater for 110 the rater error training and observational training condi- tions than for the control group and that the motivation to rate accurately would be greater for observational training than for rater error training. These two hypotheses were not confirmed as the main effect for Rater training was nonsignificant m4 the ANOVA, F(2,81)=.19, ms. and ANOCOVA, F(2,80)=.O9, ms... analyses. The Rating format x Rater training interaction was nonsignificant when the data were analyzed using ANOVA, F(2,81)=.O2, n_._s_.,, and ANOCOVA, F(2,80)=.07 ms... Thus the eighth hypothesis, that the motivation to rate accurately will be greater when the rating format and the rater train- ing are consistent with one another, was not supported. 822W The final hypothesis suggested that there would be a positive correlation between the motivation to rate accurately and performance rating accuracy. From Table 1 it can be seen that the correlation was positive, but very small in magnitude and at a nonsignificant level (r:.03, n...s..). Summers In summary, none of the eight hypotheses were supported by the tests of these data. The null hypothesis could only be rejected for the first hypotheses and it was in a direction opposite of the direction predicted. These results and their implications will be discussed in the next chapter. CHAPTER 6 DISCUSSION In this final chapter a discussion is presented concerning the lack of support for the hypotheses, the theoretical and applied implications of this research, the limitations associated with this study, and the directions that future research in this area might take. The chapter is organized by the hypotheses associated with each dependent variable and ends with a set of conclusions. W The main effect for rating format was significant when performance rating accuracy was treated as the dependent variable. However, the direction of this relationship was opposite to the direction specified in hypothesis one. Traits rather than behaviors were rated more accurately. There are two potential sets of explanations for this finding. First, it may be the case that raters process performance information along trait-like dimensions. As schema theories suggest, raters' have preset categories to guide the observation, storage, and retrieval of stimuli (Alba A Hasher, 1983L. These schema are usually global, trait-like dimensions and are often automatically used (Feldman, 1981). Consequently, it is not surprising that a 111 112 trait rating scale is more accurately rated as it more closely approximates the cognitive processes of the rater. An equally likely set of explanations for this finding center around some limitations associated with this study. First, the frequency of behavior scale was extremely diffi- cult to use, perhaps more demanding than what is actually required in field settings. Rarely are supervisors called upon to report the exeet number of times a subordinate exhibits various behaviors. Second, the videotape observed by the subjects was lengthy and showed a large number of critical behaviors exhibited by the supervisor. To the extent that the subjects failed to pay strict attention to the videotape, and comments made by some of the subjects to the author suggested that they found the videotape to be uninteresting, it would be extremely difficult to keep track of the frequency of critical behaviors. Even if strict attention was given to the videotape, the performance was somewhat unrealistic as a large number of critical incidents were displayed in a compressed period of time. Finally, the videotape depcited a simulated set of work activities. This may have prevented a transfer of training from the workshop to the rating task. The subjects using the trait rating scale did not have to pay attention to and memorize the frequency of critical behaviors and therefore, they may have been more accurate. These explanations are more consistent with the 'tradi- tional' view of cognitive processing which suggests that the greater the demands on the memory of the rater, the less 113 accurate the rating (Heneman A Wexley, 1983). The finding that traits are rated more accurately than behaviors presents some interesting implications. They must be tempered, however, by the methodological limitations just noted. In a very general sense this result supports Kavanaugh's (1971) contention that traits should not be automatically discounted as the content to be used in a performance appraisal system. He found little evidence to substantiate the claim that behaviors are superior to traits in terms of reliability and validity. Accuracy is also an important criterion in the evaluation of performance appraisal systems (Baird, 1982) and for the present sample, traits were rated more accurately than behaviors. There are, however, a number of additional criteria that must be considered when evaluating performance appraisal systems. In particular, Feild and Holley (1982) have provided evidence which suggests that traits are not defensible in a court of law, Brumback (1972) has argued that traits have little relevance, and Patten (1982) and Latham and Wexley (1981) have reviewed evidence which suggests that traits are poor for employee feedback and development purposes, and for user acceptance. Consequent- ly, an endorsement of trait rating scales is not warranted from this study. Furthermore, the accuracy of other methods of performance appraisal '(e.g., MBO and employee compari- sons) relative to traits have not been investigated. At a more theoretical level, the finding that traits are more accurate than ratings has implications for future 114 research. It suggests that schema theories may be useful in coming to a better understanding of the rating process. Raters.may automatically process ratings along trait dim- ensions. Before a firm conclusion like this can be4drawn, however, a more direct test of this hypothesis needs to be made. If this hypothesis is confirmed by future research, and to the extent that performance rating systems other than traits are to be used, then more attention must be given to devising methods.to shift the raters schema from traits to behaviors or results. The second and third hypotheses predicted that ratings would be more accurate for those subjects receiving observation training than for those receiving rater error training and in turn, the accuracy of ratings in both of these conditions would be greater than the control group. These two hypotheses were not supported as indicated by the nonsignificant main effect for rater training. This lack of support may be due to several factors. First, the subjects were much older and more experienced than the trainer, and there were no rewards or sanctions associated with attendance at the seminar. Consequently, there may have been a limited amount of learning for the experimental and control groups. Second, the rater error training program was of short duration and because of this time limitation, did not include an exercise on eliminating the "contrast" effect that normally is included in the program developed by Latham et al. (1975). Evidence reviewed by Spool (1978) suggests 115 that the training program must be of long duration for it to increase accuracy; and Latham and Wexley (1981) argue for the pervasiveness of contrast effects in appraisal judgments. Third, the content of the rater training may need to focus on the actual rating instrument and categories to be used, rather than only focusing on judgment errors or what to observe. Pulakos (1983) found that training which focuses on the rating scale produced more accurate ratings than did rater error training. In this type of training emphasis is placed upon the transfer of an element (the rating scale) rather than principles (e.g., eliminating common rater errors) of the rating task (Royer, 1979). Perhaps this method is more effective in altering the trait oriented schema used by raters. Fourth, these data indicate that the training was not effective for a work sample test (iae., the videotaped performance of a supervisor), but do not speak to the issue of whether the training was transferred back to the ratings made on the job. It was not possible to gather this data because of the need to have a 'true' score with which to calculate accuracy. The transfer of training is, however, an important issue for rater training and future researchers may wish to examine various methods to increase the transfer of learning. A number of leads have been offered in the literature including goal setting and positive reinforcement (Anderson A Wexley, 1983; Wexley A Nemeroff, 1975), relapse prevention training where managers learn to identify and 116 cope with situations that may eliminate the newly learned behaviors (Marx, 1982), and making the stimulus material in the training similar to the stimuli faced on the job (Wexley, in press; Wexley A Latham, 1981). In addition, Baumgartel, Sullivan, and Dunn.(1978)‘have identified factors in the climate of the organization (e.g., growth orientation) that facilitate the transfer process. Finally, transfer may be facilitated by monetary or nonmonetary rewards (i.e., holding the person accountable for the transfer of training). The third hypothesis predicted that accuracy would be greater when the rater training program and rating format were consistent with one another. This hypothesis was not confirmed as the Rating Format X Rater Training interaction was nonsignificant. It may be the case here that the subjects continued to rely on trait oriented schema after all types of training. This conclusion seems reasonable given the short duration of the training programs. In addition, this result may be due to the possibility that judgment and observation in the rating task are highly intercorrelated with one another. Thus, training on judgment errors is important for an observation based rating task and training on observational skills is important for a rating task requiring judgment. These explanations have several implications. First, cognitive processing theories may be helpful in coming to an understanding of the rating process, but may have less utility in the design of a program to increase accuracy. 117 Second, because the judgment stage and observation stage of cognition in the rating task are interrelated, developers of rater training programs may wish to emphasize both observation and judgment skills. Finally, more emphasis may need to be placed in rater training programs on getting raters to shift from trait oriented schema to the categories used on the organizations rating scale. Alternatively, the dimensions of performance on the rating scale may need to be labeled using trait definitions. This is a common practice with another area of performance evaluation--the assessment center (Bray, Campbell, A Grant, 1974). Mniixalinn The fifth though seventh hypotheses were concerned with the motivation to rate accurately. In particular, it was predicted that: ‘ ' Motivation to rate accurately will be less for raters using a trait rating scale than for those using a frequency of behavior scale. ' Raters given training that provides practice and feedback on the accuracy of their ratings will be more motivated to rate accurately than raters not receiving this practice and feedback. ' Raters given observational training will be more motivated to rate accurately than those given rater error training. ' The motivation to rate accurately will be greater when the rater training and rating format are con- 118 sistent with one another. The results did not support these hypotheses. Both main effects, rater training and rating format, and the rater training x rating format interaction were nonsignifi- cant. One likely reason for this set of findings is that rater's are highly motivated to rate accurately. In the present study, this appeared to be true regardless of the type of rating format used or type of rater training given to the subjects. On a 5 point Likert-type scale, with a 1 indicating low motivation and a 5 representing high motiva- tion, the mean scores for all six conditions ranged from 4.11 to 4~24 with the standard deviations ranging from .32 to .46. Given the importance of performance ratings to the employer and employee, it may be the case that very little prompting is needed to get supervisors to work hard at making accurate ratings. Another set of reasons for these results have to do with the experiment and the rating scale. First, the exper- imenter may have introduced unwanted demand effects. The subjects may have given high ratings because they felt that is what the experimenter and/or organization wanted from them. This explanation seems doubtful, however, as the subjects' responses were kept anonymous. Second, the word "accuracy" used in items on the scale may have been misin- terpreted by the subjects as meaning how clearly the super- visor could express his opinions about the employee to be rated. It could also be argued that the wording of the items on the scale better reflected the subjects' perceived 119 skills at making accurate ratings than their motivation to rate accurately. Finally, the items were all worded in a highly positive manner and the subjects may have given what they perceived to be the socially desirable response. These findings suggest that raters want to make accu- rate ratings regardless of the type of rating format or rater training received. Before a firm conclusion can be reached, however, more scale development is necessary. In particular the items need to be reworded and the definition of the construct may need to be broadened from the motiva- tion to rate accurately to the motivation to rate. This broader definition might encompass all stages of the rating process including feedback (IJL, the rater may be motivated to make accurate ratings, but not be motivated to feed the ratings back to the employee). Finally, this construct needs to be validated in a field setting with a minimum of demand effects. W232! The final hypothesis stated that there would be a posi- tive correlation between performance rating accuracy and the motivation to rate accurately. The correlation in this study was positive, but small in magnitude and nonsignifi- cant. Given the high motivation to rate accurately values and the low variance, this result is not surprising. It does suggest, however, that the need to consider motivation in rating accuracy models is less important than the need to look at the skills and abilities of the rater to make 120 accurate ratings. Csnclusisns The results of this study suggest that raters cognitive- ly process performance information using trait oriented schema. Consequently their ratings are likely to be more accurate when a trait rating scale is used rather than a frequency of behavior scale. If a frequency of behavior scale is to be used at the same level of accuracy it would appear that more emphasis needs to be placed in rater train- ing on shifting the schema used by rater's from traits to behaviors. The findings in this study also suggest that raters are highly motivated to make accurate ratings. Researchers and organizations interested in the prediction, understanding, and control of performance rating accuracy may, therefore, wish to focus more attention on the skills and abilities of the rater to make accurate ratings rather than the motiva- tion of the rater to make accurate ratings. Both sets of conclusions must, however, be treated as being tentative given the methodological limitations to this study previous- ly described. Organizations that would like to increase the accuracy of performance ratings can address this issue in two ways on the basis of these conclusions. First, emphasis should be placed on developing the skill levels rather than the motivational levels of raters as they engage in the rating task. Second, training programs designed to increase the 121 skill levels of raters should focus on the dimensions of performance that are to be rated. If behaviors rather than traits are to be rated, then a program of long duration may be needed to shift the schema used by raters from traits to behaviors. Finally, this dissertation points to the limitations associated with studies of performance rating accuracy that utilize student samples. It appears that the results from studies using student samples may not generalize to working supervisors. It has been demonstrated, for example, that students make more accurate ratings using behaviors rather than traits (e.g., Fay and Latham, 1982). In the present study, however, it was shown that supervisors rate traits more accurately than behaviors. It may be the case that students are trained, through their experiences in the classroom, to process discrete units of information whereas supervisors, in a very busy work environment, may rely upon more general, trait-like schema to process information. Consequently, more traditional theories of cognitive processing may be relevant for students while schema theories may be more applicable for supervisors. More emphasis should be placed upon obtaining industrial samples in future performance rating accuracy research. APPENDICES APPENDIX A Frequency of Behavior Scale For each of the statements listed below, circle the number that indicates the number of times you saw Jim Bogi, production supervisor, doing the behavior described. 1. Insisted that subordinates build the product in a certain way. 0 1 2 3 4 Brought problems he had working with a subordinate to the subordinate's attention. 0 1 2 3 4 Made sure that subordinates knew what to do while he was away. 0 1 2 3 4 Used good suggestions brought up by subordinates. O 1 2 3 4 Pitched in and helped subordinates with their work. 0 1 2 3 4 Refused to listen to a subordinate's request. 0 1 2 3 4 Instructed subordinates on the proper quality of the product. 0 1 2 3 4 Listened patiently to a subordinate's gripes. O 1 2 3 4 Emphasized the need for faster production. 0 1 2 3 4 122 10. 11. 12) 13. 11:. 15. 16. 17. 18. 19. 20. 123 Frequency of Behavior Scale (Continued) Solicited subordinate's ideas and opinions on what parts to purchase and what products to sell. 0 1 2 3 4 Praised subordinates for good suggestions. 0 1 2 3 4 Kept careful track of the profit margin. 0 1 2 3 4 Made his supervisory duties clear to subordinates. O 1 2 3 4 Planned in advance the products to be built. 0 1 2 3 4 Offered suggestions on the best method to build the product.‘ 0 1 2 3 4 Refused to listen to subordinate's suggestions. 0 1 2 3 4 Recognized his own weaknesses and asked a subordinate for his assistance in these matters. 0 1 2 3 4 Constructively criticized a subordinate when the subordinate made an error. 0 1 2 3 4 Gave subordinates work assignments. O 1 2 3 4 Guided subordinates on the products to be manufactured. 0 1 2 3 4 21. 124 Frequency of Behavior Scale (Continued) Solicited subordinates opinions and ideas on what product to build. 0 1 2 3 4 APPENDIX B Trait Rating Scale Indicate the degree to which you agree or disagree with the following statements concerning the performance of Jim Bogi, production supervisor. Circle one number for each statement. Strongly Strongly Disagree Disaznee Neutral Aim Anne:— 1. Is enthusiastic about job. 1 2 3 4 5 2. Has poor emotional balance. 1 2 3 4 5 3. Stalls on job. 1 2 3 4 5 4. Cannot be trusted. 1 2 3 4 5 5. Upsets morale. 1 2 3 4 5 6. Is proud of work. 1 2 3 4 5 7. Often forgets. 1 2 3 4 5 8. Is active and energetic. 1 2 3 4 5 9. Seldom sticks to business. 1 2 3 4 5 10. Is self controlled. - 1 2 3 4 5 11. Is always complaining. 1 2 3 4 5 12. Is hard to get along with. . 1 2 3 4 5 13. Is lazy. 1 2 3 4 5 14. Loses temper easily. 1 2 3 4 5 15. Has common sense. 1 2 3 4 5 125 126 Trait Rating Scale (Continued) Strongly Strongly DisagreeDisazcseNeutnalAsnesAznee— 16. worries occasionally. 1 2 3 4 5 17. Lacks initiative. 1 2 3 4 5 18. Sometimes does not fit into group. 1 2 3 4 5 19. Is pessimistic. 1 2 3 4 5 20. Is slow to adjust. 1 2 3 4 5 APPENDIX C Overall Rating Overall, how would you rate Jim Bogi's performance? Circle one number. - Below Above Poor Average Average Average Excellent 1 2 3 4 5 127 APPENDIX D Motivation to Rate Accurately Indicate the degree to which you disagree with each of the following statements. 3. 5. 7. I am able to make ae- curate performance rat- ings. It is important to make accurate performance ratings. The harder I work at it the more accurate my perfbrmance ratings will be. I know how to make more accurate performance ratings. I am concerned about the accuracy of per- formance ratings. It is possible for me to make my performance ratings more accurate. I am interested in making more accurate perfbrmance ratings. I am confident that I can make more accurate performance ratings. Circle one number for each statement. Strongly Strongly Dissenssnisszzssflsusnalmnm— 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 128 APPENDIX E Reactions £22222}. Indicate your reactions to the information that was presented to you in this workshop. Circle one number and write in your comments. Below Above Poor Average Average Average Excellent 1 2 - 3 4 5 Comments: assess Indicate your reactions to how the information in this workshop was presented to you. Circle one number and write in your comments. Below Above Poor Average Average Average Excellent 1 2 3 4 5 Comments: 129 130 Reactions (Continued) Bemmensiatim Indicate the degree to which you agree or disagree with the following statement. I would recommend that other supervi- sors attend this workshop. Circle one number and write in your comments. Strongly Strongly Disagree Disagree Neutral Agree Agree 1 2 3 4 5 Comments: APPENDIX F Demographics Please fill in the blank for each of the following statements. I am years old. The title of my position is . I supervise a total of employees. I make performance ratings for a total of employees. I have been a supervisor for ______ years. I work in the division. I work in the department. Please circle one letter for each of the following statements. 1. 3. I am: (a) Male (b) Female I have been trained on how to use the performance review system. (a) True (b) False My education level is: (a) Grade school (b) Some high school, no diploma (c) High school diploma (d) Some college, no degree (e) College degree (f) Some graduate school, no degree (g) Graduate school degree 131 REFERENCES REFERENCES Alba, J. W}, A Hasher, L. Is memory schematic? Essennlegig sal 221122124 1983. 23. 203-231- Anderson, J. G., A Wexley, K. N. Application-based management development. .222222221 2221212222222. 1983. 22. 39-uu. Atkin, R. S., A Conlon, E. J. Behaviorally anchored rating scales: Some theoretical issues. .Aeanemy ef Manage; 2222 Rexisn. 1978, 3, 119-138. Baird, L. S. Why worry about accurate measures? In L. S. Baird, R. W. Beatty, A C. E. Schneier (Ed.'s). The senneeneek. Amherst, MA: Human Resource Development Press, 1982, 12-16. Bartlett, C. J. What's the difference between valid and invalid halo? Forced-choice measurement without forcing a choice. .1222221 22 A22li22 2212221221. 1983. 22. 218-226. Baumgartel, H., Sullivan, G. J., Dunn, LAE. How organiza- tional climate and personality affect the pay-off from advanced management training sessions. Kansas Business 222122. 1978. 5. 1-10- Bayroff, A. G., Haggerty, H. R., A Rundquist, E. A. Validity of ratings as related to rating techniques and conditions. £22s2222l Es22221221. 1954. 1. 93-113. Berman, J. S., A Kenny, D. A. Correlational bias in observer ratings. 1222221 22.2222222litx.2nd.222121 Esxshslssx. 1976. 32. 263-273. Bernardin, H. J., A Beatty, R. W. Beefenmanee 2222212212 Ass2ssins 22222 22222122 at 222k. Boston: Kent. 1984. Bernardin, H. J., A Buckley, M. R. Strategies in rater trainins- .A2222mx of Manssem2nt B21122. 1981. 2. 205- 212. Bernardin, H. J., A Cardy, R. L. Appraisal accuracy: The ability and motivation to remember the past. 222222221 M222222222 1222221. 1982. 11. 352-357. 132 133 Bernardin, H. J., A Pence, E. C. Effects of rater training: Creating new response sets and decreasing accuracy. 1222221 cf 2221122 2222221222. 1980. 25. 60-66- Boice, R. Observational skills. Esxehelezieal Bulletin. 1983. 9.3. 3-29. Borman, W. C. Effects of instructions to avoid halo error on reliability and validity of performance evaluation Eggings. 1222221 22 1221122 2222221221. 1975. 22. 556- Borman, W. C. Consistency of rating accuracy and rating errors in the judgment of human performance. Qnganiza; £2212“ 32221122 222 22222 2221122122222. 1977. 22. 238- Borman, W. C. Exploring upper limits of reliability and validity in job performance ratings. lennnal 21 1221122 EfilgthQKXQ 19781 Q3: 135-144. Borman, W.(L Individual differences correlates of accuracy in evaluating others' performance. .Annliea Esxenelezie. 221 112222222222. 1979a..3. 103-115. Borman, W. C. Format and training effects on rating accuracy and rater errors. .lennnal ef Annlied £22222: 1981, 1979b, £3, 410-421. Borman, W. C. Implications of personality theory and re- search for the rating of work performance in organiza- tions. In F. Landy, S. Zedeck, and J. Cleveland (Ed-'8). 22222222222 222s2222222 222 122222. Hillsdale. NJ: Lawrence Erlbaum, 1983, 127-165. Bray, D. W., Campbell, R. J., and Grant, D. L. Eenmatixe 2222s in 222122ss1 A 1222:2222 AIAI 21.221 2.1: managenial lines. New York: Wiley, 1974. Brumback, G. B. .A reply to Kavanaghfls: "The content issue in performance appraisal: A review." fiensennel Efllfih919‘19 19721 25: 567-572. Bureau of National Affairs. (1983). .Eenfienmanee Annnaisal Enegnams. (Personnel Policies Forum Survey No. 135). Washington, DAL: The Bureau of National Affairs, Inc. Campbell, D.1L. The mutual methodological relevance of anthropology and psychology. In F. L. K. Hsu (Ed.), A222222212u 2222221221221 . Homewood, IL: Dorsey, Carroll, S.J., A Schneier, C.E. Beefenmanee annnaisal and sxstems, Glenview, IL: Scott, Foresman, and Company, 1982. 134 Cooper, W.H. Ubiquitous halo. 2222221931221 22112212. 1981, 92, 218-244. Cronbach, L.J. Processes affecting scores on "understanding of others" and "assumed similarityx" .2222221221221 22112212. 1955, 52. 177-193. Cronbach, LflL, A Furby, L. How should we measure "change" 8 or should we? 2222221221221 22112212. 1970. 151. 68- o. Cronbach, L.J., Gleser, G.C., Nanda, H., A Rajaratnam, N. 122 2222222211122 22 22222122 m22s222m222s2 122222 22 2222221112211122 222 222222 222 22221122. New York: Wiley, 1972. DeCotiis, T. A Petit, A. The performance appraisal process: A model and some testable propositions. Aeanemy n: 2222222222 B22122. 1978. 3., 635-646. Downs, cuw., A Moscinsky, P. (1979). A s22222 sf 222221s21 22222ss2s 222 22212122 in 12222 22222222122s. Paper _presented at the 39th Annual Academy of Management Meetings, Atlanta, GA. Eder, R.W., Keaveny, T.J., McGann, A.F., A Beatty, R.W. Evaluating faculty performance: An empirical investi- gation of factors affecting faculty ratings and student satisfaction using alternative rating forms. Eneeeed: 1222 22 222 3222 222221 2222122 22 222 2222222 22 2222222222. 1978. 13. 23-27. Fay, C.H., A Latham, G.P. Effects of training and rating scales on rating errors. .Eensennel 2222221232. 1982, 35, 105-116. Feldman, J.M. Beyond attribution theory: Cognitive processes in performance appraisal. lennnal ef Annliee W: 1981: 5.6.: 127-1148- Feild, HJL, A Holley, W. The relationship of performance appraisal system characteristics to verdicts in selec- ted employment discrimination cases. .Aeanemy at 2222222222 1222221. 1982. 25. 392-406. Firth. R. 21222222 22 222121 2222212221221.322. Boston: Beacon Press, 1961. Flanagan, J.C. A new approach to evaluating personnel. E22s22221. 1949a. 12. 35-42. Flanagan, J.C. Critical requirements: A new approach to employee evaluation. £22s22221 2222221222. 1949b. 2. u19’u250 135 Flanagan, J.C. Principles and procedures in evaluating per- formance. 222222221. 1952, 22. 373-386. Flanagan, J.C. The critical incident technique. £51222: 1221221 22112212. 1954. 51. 327-358. Flanagan, J.C., A Burns, R.K. The employee performance record. 2222222 22212222 222122. 1957. 35. 95-102- Gibson, J.J. The concept of the stimulus in psychology. 22221222 222222122122. 1960. 15. 694-703. Gordon, M.E. The effect of the correctness of the behavior observed on the accuracy of ratings. 22222122 222 22222 22222222222. 1970. 5. 366-377. Gordon, M.E. An examination of-the relationship between the accuracy and favorability of ratings. 1222221 22 2221122 2222221222. 1972. 52. ”9-53. Have. wuL. .2222122122 222 222121 221222222 222. New York: Holt, Rinehart, and Winston, 1963. HaYS. NJ» 2222122122 222 2222221221222. New York: Holt. Rinehart, and Winston, 1967. Heneman, R.L. A Wexley, K.N. The effects of time delay in rating and amount of information observed on perform- ance rating accuracy. 2222222 22 2222222222 1222221. 1983. 22. 677-686. Ilgen, D.R. Gender issues in performance appraisal: A discussion of O'Leary and Hansen. In F. Landy, S. Zedeck, A J. Cleveland (Ed's). Egnfigzmangg 222221 and mgafinngmgnfi. Hillsdale, NJ: Lawrence Erlbaum, 1983, Johnson, D.M. A systematic treatment of judgment. 2212221281221 52112212. 19u5, 32, 193-22”. Kavanagh, M.J. The content issue in performance appraisal: A review. 222222221 2222221222. 1971. 22. 553-663- Kelley, M. A contingency framework for evaluation. 2222222 22 2222222222 222122. 1978. 5. “28-u38. Keppel, 6. 222132 and 22212212. Englewood Cliffs, NJ: Prentice Hall, 1973. Kohler, W. 2222212 pgxghglggx, New York: Liversight, 1956. Kraiser. K. (1983. March). .222212122.222222222 12 222122 2122. Paper presented at the I/O-OB Graduate Student Conference, Chicago, IL. 136 Landy, F.J. & Farr, J.L. Performance rating. Psychological Bulletin. 1980. 51. 72-107. LandY. F-Jo. & Farr. J-L- Ihe meesnnement of neck nenfenm: ance. New York: Academic Press, 1983. Latham, G.P., Wexley, K.N., & Purcell, E.D. Training managers to minimize rating errors in the observation of behavior. lennnel e: Annlied Eslebelesx. 1975. QQ. 550-555. Latham, G.P., & Wexley, K.N. snednetixitx through appraisal. Reading, MA: Addison- Wesley, 1981. Lawler. 5.5. Bax and ensenizetienel effeetixeness. New York: McGraw-Hill, 1971. Lawlis, GJK, & Lu, E. Judgment of counseling process: Reliability, agreement, and error. .Esyghglggieal W: 1972’ 1.8.: 17‘200 Locke, E.A., Shaw, K.N., Saari, L.M., & Latham, G.P. Goal setting and task performance: 1969-1980. Esyghglggi; eel Bulletin. 1931. 2Q. 125-152. Lopez, F.M. Exelnetins emnlnxee sentenmanee- Chicago: Public Personnel Association, 1968. Maier, N.F., & Thurber, J.A. Accuracy of judgments of deception when an interview is watched, heard, and read. Eensennel Eslehelesx. 1968. 21. 23-30. Marx, RJL Relapse prevention for managerial training: A model for maintenance of behavior change. .Agademy of Management Rexien. 1982. 1. 433-441. McGregor, D. An uneasy look at performance appraisal. Business Resien. 1957. 35. 89-9“. Mitchell, T.R. Expectancy model of Job satisfaction, occupational preference and effort: A theoretical, methodological, and empirical appraisal. Esxehglngig 9.3.1. My 197'“: 8.1a 1003-1077- Mohrman, A.M., Jr., & Lawler, E.E., III. Motivation and per- formance appraisal behavior. In F. Landy, S. Zedeck, & J. Cleveland (Ed's) Eenfenmenee meesunement and Hillsdale, N.J.: Lawrence Erlbaum, 1983, 173-189. Murphy, K.R., Garcia, J., Kerkar, S., Martin, C., & Balzer, W.K. Relationship between observational accuracy and accuracy in evaluating performance. Journal of Annlied Esxehelesx. 1982. Q1. 320-325. 137 Murphy, KLR., Martin, C., & Garcia, M. Do behavioral observation scales measure observation? Journal or Annlied Essenelesl. 1982. ii. 562-567. Nathan, B.R., & Lord, R.G. Cognitive categorization and and dimensional schemata: A process approach to the study of halo in performance ratings: Journal of Annlied Essennlesx._1983. as. 102-11". Naylor, J.C. Some comments on the accuracy and the validity of a cue variable. lennnal of Methemetieal Eslebelesl. 1967. B. 15u-161. Osburn, M.C., Timmreck, C., 8. Bigby D. Effect of dimen- sional relevance on accuracy of simulated hiring decis- ions by employment interviewers. Journal of Aoollao Efilfihfllfl‘!) 1981: fig! 159'165° Patten, T.H., Jr. A managarls aooralsal. New York: Free Press, 1982. Pulakos, E.D., (1983). A eemnenisen of tun neten tneinins enesnamsl Ennen tnainins xensns seennaex tnainins- Unpublished master's thesis, Michigan State University, East Lansing, MI. Purcell, T.V. Observing people. flaruaro Business Rexien. 1955’ 33’ 90-1000 Richards, J.M., Jr., & Cline, V.B. Accuracy components in person perception scores and the scoring system as an artifact in investigations of the generality of judg- ing ability. Essennlesieal Renents. 1963 12. 363-373. Rogasa, D., Brandt, D., 8. Zimowski, M. A growth curve approach to the measurement of change. Bayonologloal Bulletin. 1982. 22. 726-798. Royer, J.M. Theories of the transfer of training. Edneetienel Esxebelezist. 1979. 13. 53-69. Rush, M.C., Phillips, J.S., 8. Lord, R.G. Effects of temporal delay in rating of leader behavior descrip- tions: A laboratory investigation. Journal of Aoollad £§xghfllgfilv 19819 £5: uuz‘uso' Smith, M. Documenting employee performance. In ins. Baird, R.W. Beatty, & C.E. Schneier (Ed.'s). Ina oar: {enmenee annneisel snuneenenk. Amherst, MA Human Resource Development Press, 1982, 9u-96. Spool, M.D. Training programs for observers of behavior: A review. Bensennel Esxebnlesx. 1978. 31. 853-833- 138 Taft, R.L. The ability to judge people. Esxonologloal Dulletin. 1955, 52. 1-23.‘ Thorndike, R.L. Personnel Seleorlon. New York: John Wiley & Sons, 1949. Thornton, GAL, III, & Zorich, S. Training to improve observer accuracy. Jennnal e: Annlied Essenelesx. 1980. 65. 351-35u. Tinsley, H. A., & Weiss, D. J. Interrater reliability and agreement on subjective judgements. Jennnel Essenelesx. 1975. 22. 358-376. Uhrbrock, R.S. 2000 scaled items. Eersonnel Esxonolosx. 1961’ 15! 375-“20. Vroom, V.H. flork ano monluarlon. New York: Wiley, 196A. Wakeley. J.H. (196A). The effleets e: sneeiiie tneinins en in Judging onners. Unpublished dissertation, W Michigan State University, East Lansing, MI. Warmke, D.L. Effects of accountability procedures upon the utility of peer ratings of present performance. (Doctoral dissertation, Ohio State University, 1979). Dissennetien Ansnneets lntennatiennl. 1980. HQ. 4011-3. (University Microfilms No. 80-01, 853). Weick, KgE. Systematic observational methods. In G. Lindzey & E. Aronson (Ed.'s). Ihe hannhnnk of £99131 osyonology. Reading, MA: Addison-Wesley, 1968, 357- 451. Weick. K.E. Ine sneiel essenelesx of ensenizins. Reading. MA: Addison-Wesley, 1979. Wexley, KLN. Personnel training. Annual Rexieu of Essennlesl. 1984. in press. Wexley, K.N. (1982, November). The oerformanoe aooralsal inneruleu. Paper presented at the Fourth John Hopkins University National Symposium on Educational Research, Washington, DC. Wexley, K.N., & Jaffee, C.L. Evaluation of the telecoaching training method. Journal at Influential Essenelesx. 1970 g 5, 58.62 c Wexley, K.N., 8. Latham, G.P. W and Lneininz human nesennees in ensanizetiens. Glenview, IL: Scott, Fovesman, and Company, 1981. 139 Wexley, K.N., & Nemeroff, W.F. Effectiveness of positive reinforcement and goal setting as method of management development. Jennnel on Applied Esxehelesx. 1975. no, uu6—uso. Wexley, K.N., Sanders, R.E., & Yuk, G.A. Training inter- viewers to eliminate contrast effects in employment intervgews. lennnal of Annlied Esxehelesx. 1973. 51. 233-23 . Wexley, KJL, & Youtz M.A.,(1983),Raner raluesr Their extents en netins ennens and neien aeennaei. Un- published manuscript. Wherry. R. J.. The eentnel of hiss in natinssi A theenx of raring. PRB Report No. 922, Contract No. DA- 99- 083, OSA 69, Department of the Army, 1952. Wherry, R.J., Sr. & Bartlett, C.J. The control of bias in ratings. Bensennel Eslehelesx. 1982. 35. 521-552- WissinS. J. Eensenelitx and nnedietieni Enineinles pensnnalitx assessment. Reading. MA: Addison- Wesley, 1973. Wrightsman, L. Measurment of philosophies of human nature. Bsxehelesieal Resents. 195". 15. 793-751. Zedeck, S., & Cascio, W.F. Performance appraisal decisions. as a function of rater training and purpose of the appraisal. Jennnal on Applied Esxehelesx. 1982. $1. 752-758. "IWilli!”Zilliflifliflr