Kym-5‘5 This is to certify that the thesis entitled A Comparison of Two Rater Training Programs: Error Training Versus Accuracy Training presented by Elaine Diane Pulakos has been accepted towards fulfillment of the requirements for M. A. Psychology degree in Major professor Datefifla/ss / / 0-7639 MS U i: an Affirmative Action/Equal Opportunity Institution RETURNING MATERIALS: TVT531_] TTace in book drop to “BRAKES remove this checkout from .._..:,...._ your record: FINES wii] be charged if book 15 returned after the date stamped below. rem its; ctr /<:’/:) ~9‘7di‘7 90; A COMPARISON OF TWO RATER TRAINING PROGRAMS: ERROR TRAINING VERSUS ACCURACY TRAINING By Elaine Diane Pulakos A THESIS Submitted to Michigan State University in partial fulfillment of the requirements for the degree of MASTER OF ARTS Department of Psychology l983 ABSTRACT A COMPARISON or Two RATER TRAINING PROGRAMS: ERROR TRAINING VERSUS ACCURACY TRAINING By Elaine Diane Pulakos The purpose of this research was to investigate the effects of Rater Error Training (RET) and Rater Accuracy Training (RAT) on two rating errors (halo and leniency) and accuracy of performance evaluations. Differences in program effectiveness for various job performance dimensions were also assessed. One hundred and eight subjects were randomly assigned to l of h cells defined by the training treatments (RET, RAT, RET and RAT, no training), and raters evaluated videotaped ratees. The results showed that RAT increased accuracy and decreased leniency, while RET decreased halo but had no effect on leniency or accuracy. The combination of RET and RAT yielded less accurate ratings than RAT alone. Finally, dimension x training interactions suggested that the effectiveness of training strategies can not be considered independent of the rating format. Implications and directions for future research are discussed. ACKNOWLEDGMENTS First and foremost, this project would not have been possible without my parents, who have provided me with a great deal of encouragement as well as the opportunity to pursue my goals. My sincere appreciation also goes to Neal Schmitt for his guidance and support on this particular project and in general. Next, I would like to thank my other committee members, Ken Wexley and John Wagner for their helpful suggestions. Last, but not least, my gratitude goes to a fourth but unofficial committee member, Arnon E. Reichers, for her moral support, abstract intellectual abilities, and for teaching me a lot about writing and organizing papers. Finally, I would like to dedicate this thesis to my brother, George, who has been with me throughout this entire project. He has helped in more ways than can possibly be mentioned here and has made the frustrating times a little more tolerable. TABLE OF CONTENTS LIST OF TABLES ........ ..... . ....... ......... ..... . ......... .... v LIST OF FIGURES ...... .......... ..... ........ ..... .. ....... .... vi INTRODUCTION ...... ...... .. ..... .............. ..... ..... ........ l The Rating Process ............................... ........ .... A The Roles of Attention, Categorization, and Recall in Performance Appraisal ............................. ........ 7 Attention .................................................. 7 Categorization ............................................ IO Recall .................................................... II sumary OOOOOOOOOOOOOO0.0...IOOOOOOOOOOOOOOOOOOOOO000......1h Rater Error Training: Error versus Accuracy ................. l5 Error Training Programs ................................... l5 Error versus Accuracy ..................................... l8 Rater Accuracy Training ..................................... 20 Rating Process Implications for Accuracy Training ......... 20 Design of Rating Formats for Accuracy Training ............ 23 Objectives of the Present Research .......................... 26 HETHOD .0... ..... O...OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO00...... 28 Subjects .................................................... 28 Experimental Design ......................................... 29 Procedure ................................................... 30 Videotape and Rating Scale Development ...................... 32 Rating Scales ............................................. 32 Generating Intended "True Scores" for Performers .......... 32 Developing and Videotaping Performance .................... 3h Obtaining Final True Scores ............................... 35 Hanipulations ............................................... 37 Rater Error Training (RET) ................................ 37 Rater Accuracy Training (RAT) ............................. 38 Summary of RET and RAT Training Programs .................. 39 Pretesting of Training Programs ........................... ho Dependent Variables . ............... .. ............ . .......... AI Accuracy ....... ..... ......O.... ..... .......... ..... ....... “1 Halo ..........................................O.....0.0... A2 Leniency 0.0.0....0......OOOOOOOOOOOO0.00.000.000.000000000 “3 Data Analysis Procedures .................................... A3 RESULTS ....................................................... A6 Relationship Between Accuracy and Rating Errors ............. A6 Training Effects on Accuracy ................................ A8 Distance from True Scores ................................. A8 Differential Accuracy .............................. ....... 5A Training Effects on Halo ...... ............. ... ........... ... 60 Training Effects on Leniency . .................... . ....... ... 63 DISCUSSION .0.........00......00.0.0.0.0....0... 000000000000 .0. 68 Limitations and Directions for Future Research ..... ......... 7A conCIUSions ..... O O O O O . O O . . . O . . O . 0.. . . O O O . O O . . O O 000000000000 O O 79 APPENDICIES .... .......... ............... ......... .0... ...... 80 A. RATING SCALES O . O ............ 0 O O O O O O O 0 O O O ..... O .......... 8] B. RATER ERROR TRAINING ................. ................... 86 C. RATER ACCURACY TRAINING ........ . ..... ................... 93 REFERENCE NOTES ............. O . . O O O O . . ...... O O . O O . . O . O . . . . ..... 98 REFERENCES O...................O...0.00......O....O......OOOO.. 99 10. ll. LIST OF TABLES True Scores of Performance ......... Summary of the Dependent Variables .. Means. Standard Deviations, and Intercorrelations of variab‘es .......................................0...... Results of the Analysis of Variance for DIST ....... ..... Means and Standard Deviations of DIST ...... ..... ...... .. Means and Standard Deviations of DA ......... Results of the Analysis of Variance for DA .. .......... .. Results of the Analysis of Variance for Halo ............ Means and Standard Deviations of Halo ................... Results of the Analysis of Variance for LEN ............. Means and Standard Deviations of LEN Page 36 A5 A7 so 5I 55 56 61 62 6A 65 LIST OF FIGURES Page Feldman's Rating Process Model ........... ..... ........... 5 Experimental Design ..................................... 3l Mean Data (DIST) for RET x RAT Interaction .. ............ 52 Mean Data (DIST) for DIM x Training Interactions ....... . 53 Mean Data (DA) for RET x RAT Interaction ................ 57 Mean Data (DA) for DIM x Training Interactions .......... 59 Mean Data (LEN) for DIM x Training Interaction ... ....... 66 vi INTRODUCTION The most widely used type of instrument for obtaining performance measures is the rating scale (Borman, l979). Many of the personnel decisions made in organizations rely on the ability of supervisory ratings to discriminate good performers from poor performers. Furthermore, ratings are often. the only criteria available for validating selection procedures, promoting employees, and selecting individuals for training programs. A major problem with performance ratings. however, is that they are inevitably contaminated by various rater errors which render them of questionable reliability. validity. and accuracy (Bernardin 5 Pence, l980). Specifically, rater errors are faults in judgment that occur in a systematic manner when one individual evaluates another (Latham 5 Nexley, l98l). Some of the more commonly cited of these are halo. central tendency. leniency, and strictness (Guilford, l95A). The problems imposed by such errors have led many researchers to call for the development of rater or .observer training programs to improve the quality of performance evaluations (e.g.. DeCotiis 8 Petit, l978; Dunnette 8 Borman, l979). Many of the rater training programs to date have been successful in reducing common rating errors such as halo and/or leniency. at least as they have been statistically measured (Bernardin, l978; Bernardin 8 Walter, l977; Borman, l975; Ivancevich, l979; Latham, Hexley, 5 Pursell, I975). A common assumption among these researchers, however, is that reducing psychometric errors will also result in increasing performance rating accuracy (i.e., that error and accuracy negatively covary). Accuracy has been defined as the degree to which ratings are relevant to or correlated with true criterion scores (Dunnette 6 Borman, l979). In his review of observer training programs, Spool (l978) concluded that studies assessing training effects indicate that "accuracy in observation 223 be improved by training raters to minimize rating errors" (Pp. 366-867). Unfortunately, the assumption that errors and accuracy negatively covary has for the most part been unaddressed empirically. This state of affairs is largely the result of error reduction strategies focusing on rating behavior while largely ignoring the issue of accuracy. Recent rating accuracy research, however, has raised questions regarding the prevailing error/accuracy covariation assumption (Berman 8 Kenny, l977: Borman, l975. I979; Harmke, l980). Specifically, the data from this research seem to suggest not only that rating accuracy is largely unaffected by training, but that there may even be a weak positive relationship between certain errors (e.g., halo) and accuracy (Cooper, l98l). These results not only run counter to a basic tenet of psychometric theory (i.e., error produces inaccuracy), but they raise serious questions regarding the utility of most rater training efforts to date. Although rater training programs have differed with respect to some of their key components (e.g., level of trainee participation, feedback to participants, amount of practice time allowed), a common core to virtually all training efforts has been a general concern for training aimed at changing rater response distributions (Bernardin 5 Pence, l980). Landy and! Farr (I980) and Wherry (Note I) have proposed a tenable hypothesis for why this focus may have little effect on improving accuracy. Specifically, these authors have suggested that concern with psychometric error distributions alone merely facilitates the learning of new response sets. The programs may thus achieve lower mean ratings (i.e., less leniency) and lower scale intercorrelations (i.e., less halo) but perhaps lower levels of accuracy as well. This reasoning was based on the possibility that skewed ratings and high dimension intercorrelations may reflect reality (Schwab. Heneman, 5 DeCotiis, l975). Based upon these arguments, it seems logical that increasing accuracy may require focusing trainee attention directly on accuracy issues rather than concentrating solely on rating errors. The purpose of the present research was to assess differences in rater training as a function of the orientation of two rater training programs. Rater error training (RET) similar to that developed by Latham et al (l975) was compared to a type of rater accuracy training (RAT). Rather than training to reduce errors, per se, the focus of the accuracy training program was to familiarize raters with the instrument/rating scale and focus their attention to the specific behaviors they would be asked to evaluate. Drawing on literature from cognitive psychology (reviewed below), it was hypothesized that directing attention to appropriate aspects of the rating task itself and increasing rater familiarization with the instrument in a systematic way would have the» desirable effect of increasing the accuracy of observations. In summary, the major thesis proposed here is that previous rater training efforts have erroneously focused rater attention to errors, rather than focusing attention to the observation of relevant ratee behaviors. It is further argued that this focUs is largely a result of lack of attention to the cognitive processes involved in the rating task, and that this deficiency may be responsible for the error/accuracy covariation paradox. The next section presents a model which is used as a framework for a discussion of previous research. IDS gating Process It has recently been suggested that without a better understanding of the cognitive processes involved in performance ratings and the variables influencing these, further gains in accuracy may be difficult to achieve (Cooper, l98l; Landy 5 Farr, l980). While there are variations in cognitive process models of ratings, Feldman's (l98l) model is both general and specific enough to be used as a basis for the present research. This model proposes that the cognitive processes involved in the rating task are a special case of a more generalized information processing model. Specifically, Feldman conceptualized the performance appraisal process as a combination of four interacting cognitive tasks (see Figure l for an illustration of this model). First, the rater must recognize and attend to relevant information concerning those who are being evaluated. Second. the information must be organized and stored for later access. New information must also be integrated with previously gathered data. The third step involves Attention Ratings Figure l. A <——> v Categorization Recall Feldman's Rating.Process Model. recalling relevant information in an organized fashion when judgments about performance are required. Finally, the rater must be able to integrate information into some kind of summary evaluation for most appraisal tasks. While it may appear that these processes occur precisely in the order depicted in the model, Feldman (l98l) cautions that they are interacting, dynamic, and cyclical. Thus, for example, previously formed categories may guide attention to certain stimuli while largely ignoring others. as well as forming the basis for subsequent categorizations and recall. What follows is a discussion of each of the components of Feldman's (l980) model as they relate to the present research. It is eventually argued that categories in a cognitive psychology sense are similar to the dimensions of a performance appraisal instrument, and that familiarity with these particular dimension structures cues raters to attend to relevant information. Relevant categorization and attention to appropriate cues (behaviors) presumably facilitate recall of pertinent information and should therefore be associated with increased rater accuracy. The following section focuses on the previous research dealing with attention, categorization and recall processes. For the sake of greater clarity, each of these is discussed under a separate subsection. However, it should be remembered that the relationship among these categories is interacting and cyclical, resulting in some degree of overlap among them throughout the presentation. Following this review, a summarization and critique of previous observer training efforts is undertaken, along with a discussion of the present accuracy training effort. I_g Roles gfi Attention, Categorization, and Recall in Performance Aggraisal Attention Individuals have a limited capacity to process the vast amount of information available at any given moment, and they must therefore be selective with respect to what is actually attended to on a conscious level (Glass, Holyoak, 8 Santa, l979). There is a great deal of research, however, which indicates that the majority of everyday stimuli are automatically processed (Ableson, l976; Langer, l978; Nisbett 5 Wilson, l977; Schneider 5 Shiffrin, l977; Shank 5 Ableson, l977; Shiffrin 8 Schneider, l977). Race. sex, cues of dress, speech, height, attractiveness, etc. are all stimuli which can be automatically recorded (Feldman, l98l). For example, upon observing a woman, one does not typically ask, "is that a female and what difference does it make if she is?" One more generally recognizes sex automatically and ,unintentionally and thereafter reacts partially in terms of that classification. Additional research indicates, however, that when Eggg, subjects can accurately recall those stimuli for which they have been prepared. For example, Averbach and Coriell (l96l) conducted an experiment in which two rows of eight letters each were flashed in front of subjects for a tenth of a second. When subjects were subsequently asked to recall as many letters as possible, very few accurate recollections resulted. Subjects were then told to focus their attention to specific positions on the screen (e.g., they were told to focus on the third letter in the third row). The vast majority of the participants were able to accurately recall those stimuli to which their attention had been directed. The results of similar research by Eriksen and Collins (I969) also showed the positive effects of directed attention in increasing recall accuracy. In a related effort, Treisman and her colleagues (Treisman 5 Geffen, I967; Treisman 5 Riley, I969) also investigated the effects on recall of directing subjects' attention to specific cues. In a typical experiment. these reserachers simultaneously presented students with a list of digits to each ear, only one of which they were told would later have to be repeated. Occasionally, a letter was presented with the digits and students were instructed that when they heard the letter in either ear, they should tap their desks with a ruler. If the students had been equally aware of both the "attended" and the "unattended" ear, they should have detected the letter equally often in both ears. The results showed that subjects accurately detected about 80 percent of the letters presented to the attended ear and only about 23 percent of the letters presented to the unattended ear. Finally, Lawrence (l97l) used a tachistoscope to flash a list of words at a person one at a time (at a rate of 20 words per second). When a series of words was presented in this way, he found that subjects could accurately read very few, if any, of them. However, he additionally discovered that subjects could be cued in advance to read a particular word. In Lawrence's experiment, subjects were told that one word in the series would be in all capital letters, and they were to focus on that word. The results showed that subjects were better able to identify the "target" word than when no cuing occurred. One conclusion drawn was that individuals' attention could be focused to a particular stimulus object rather than the entire modality (i.e., everything they saw). Further, because the "target" word was defined by a discriminating feature (i.e., capital letters), it enabled the participants to more effectively attend to it as well as to increase the accuracy in their recall of the word. Taken as a whole, this research indicates that individuals can be cued to become consciously aware of particular stimulus objects in their sensory fields, and that this increased attention to specific features facilitates recall. A potential caution in interpreting these results in light of their relevance to the present research is that these experiments involved very simple attention and recall tasks (i.e., attending to letters/ words presented to subjects for short periods of time). The research proposed here attempts to build upon the theoretical conception discussed above by applying the notion of directed attention to more complex performance evaluation criteria. It is specifically argued that although previous rater training programs have cued raters to consciously attend to "relevant stimuli" (i.e., errors), this focus has been largely insufficient, especially because accuracy is the crucial criterion for judging performance rating quality and should therefore be the focus of rater training. A more complete rationale for this hypothesis is developed subsequent to the discussion of categorization and recall processes. IO Categgrigation Within a cognitive psychological framework, no discussion of attention, per se, is complete without a discussion of categorization. This is true not only because of the fact that these two processes are both components of Feldman's (l98l) model, but also because the two concepts are intimately dependent upon each other in actuality. Bruner (I957, I958) discusses this interdependence in his contention that conscious attention, hence perception, is the categorization of stimuli whereby individuals assign identity and meaning to an object. That is. individuals attend to and interpret their stimulus environment in terms of the cognitive categories most available to them. As such. hypotheses about category memberships follow from whatever categories the individual most typically uses to organize and make sense of the environment. A category has been defined as a cognitive structure that partially consists of the representation of some defined stimulus domain. Categories can further be thought of as pyramid-like structures, organized with more general information at the top and more specific information nested within the more general groupings. The lowest level in the hierarchy consists of specific examples of category relevant objects/events. These organizational properties represent an individual's knowledge of the way in which the world is structured. When a stimulus configuration is encountered in the environment, it is matched to some category, and the ordering of the relations among the elements in the category are imposed on the elements of the stimulus configuration (Marcus, I977; Minsky, I975; Tesser, I978). This process ll of ordering and structuring the elements of the stimulus is important because it influences the subsequent recall of information and provides the basis for inferences and predictions (Taylor 5 Crocker, l98l). While it is beyond the scope of the present proposal to review all of the relevant research involving categorization systems, it is worthwhile to note the central role that categories play in phenomena such as implicit personality theory (i.e., categorizations based on trait labels; Hastorf, Schneider, 5 Polefka, I970; Lord, Binning, Rush, 8 Thomas, I978) and stereotyping (categorizations based on cues such as race and sex; McArthur 8 Post, I977; Taylor 8 Fiske, I978). Further, Kelly's (I955) personal construct theory has delineated the sometimes profound individual differences that exist in individual category systems. For example, it has been shown that cultural factors (Triandis, l96A) and individual difference variables such as prejudice and cognitive complexity (Feldman 8 Hilterman, I975) make different categories salient for different people. Additionally, situational factors (such as how often a category is used or how recently a category has been used; Wyer 8 Srull. I980) affect which aspects of a given stimulus person or object will be used in categorization. Evidence supporting the notion that recall is dependent upon the category system employed by the perceiver is discussed in the following section. Recall When confronted with a stimulus configuration (e.g., person, object, or situation), one could conceivably recall any of a variety of stimulus attributes. Information is easier to recall, however, if it is l2 structured in some meaningful way. Further, there is evidence that people structure their observations so as to facilitate recall (Bousfield, I953). Because categorizations provide a means of structuring and organizing what is observed, it has been suggested that either imposing a category system on stimulus configurations or encountering a stimulus configuration that is a good match to already established categories increases the recall of category relevant information. This contention has been given empirical support by a number of research efforts. For example, Taylor, Livingston, 8 Crocker (I982) presented graduate students in different departments with an academic folder of a hypothetical student. Subjects were later tested on recall: English students recalled more English relevant material (e.g., English courses, languages, and writing skills), while psychology graduate students recalled more psychology relevant information (e.g., research experience, psychology courses, and math background), even though the experimental task did not require selective use of this material. Thus, the availability of previously existing category systems seemed to influence recall of certain types of information consistent with the categories already in use by the individual. In a study on occupational stereotypes (Cohen, l977). subjects observed a videotape of a woman performing some daily (non-work) activities, having been told either that she was a waitress or a librarian. In a free recall task, subjects recalled stereotype consistent information more accurately than irrelevant or inconsistent information. Other research has similarly demonstrated the effects of 13 imposed categorization systems for improving recall of category relevant information (Picek, Sherman, 5 Shiffrin, I975; Potts, I972: Sulin 5 Dooling, I97A; Woll 8 Yopp, I978). Recall of events and episodes has also been shown to be selectively improved by the imposition of category systems from external sources. For example,Zandy and Gerard (I97A) had subjects observe a videotape of two people poking around an apartment. Some subjects were told the people were anticipating a drug bust and were looking for their dope so they could remove it. Others were told that the two were planning to rob the apartment, while a third group was told the two were waiting for a friend and had become stir crazy. The results showed that subjects remembered more features appropriate to the particular scenario they had been given. Other studies have shown that the presence of a theme predicts what specific items, in a set of information items, will later be accurately recalled (Bower, Black, E Turner, I979: Frederiksen, I975: Rumelhart. I975: Thorndyke. I977). In sum, then, the reseach reviewed here provides strong evidence that either imposing category structure on stimulus configurations or encountering stimulus configurations that are good matches to existing categories increases overall recall, especially the recall of category relevant information. The following section summarizes the key ideas presented regarding attention, categorization, and recall. The focus of this summary is directed at the ways in which these cognitive variables may operate to affect the decision processes involved in a performance evaluation task. IA Summar The preceeding discussion indicates that the processing of information involves scanning the environment, selecting items to attend to, taking in information about those items, and storing it in some form so that it can be retrived for later consideration. To select the information that is useful and to process it quickly and efficiently, the perceiver needs selection criteria and guidelines for processing. Personal hypotheses about how the world works (based upon individuals' categorization systems) provide such criteria by "telling” the perceiver what data to look for, how to interpret the data that are found, and what information will be stored for later recall. A crucial question concerning the application of these ideas to person perception becomes: how do perceivers classify stimulus people into categories? The following scenario is offered in order to explain the inherent relevance of cognitive information processing to a performance evaluation task. Consider, for example, a supervisor who is asked to evaluate a sales employee in terms of interpersonal skills exhibited with customers. The supervisor must first recall events from the past which were attended to and thus incorporated into his/her "theory” of the employee in question. The previously reviewed research indicates that it should be easier to recall examples of behaviors to justify a particular interpersonal skill rating if that category (dimension) had been used by the supervisor in observing his/her personnel. However, if no such classificatory basis for identifying behavior had been used by the supervisor to begin with, the recall cue of "interpersonal skills with customers" should provide little, if any, 15 utility in facilitating recall of employee performance on that particular dimension. It can thus be seen how attention, categorization, and recall are intimately related. For example, without relevant categorization, attention to relevant cues may be nothing more than random or unconscious. Without meaningful categorization, recall may be impaired. Further, the importance of these processes has direct implications for how people should be trained to observe and evaluate the performance of others. Before a more complete discussion of these implications is undertaken, the next section reviews and critiques previous attempts to train individuals to conduct error-free performance assessments of others. Rater Error Training; Error versus Accurggy figter Training Progggmg As previously mentioned, the general assumption underlying most previous rater training programs is that certain rating distributions are ipso facto more desirable than others. For example, ratings at about the same level across dimensions and within ratees are considered an indication of halo error, and raters are encouraged to spread their ratings out for the various dimensions when evaluating others. Similarly, negatively skewed distributions are considered an indication of leniency error, and raters are encouraged to conform more closely to a normal distribution. More specifically, in a very detailed training program (Borman, I979), ratee performances were shown on videotape and l23 student trainees rated them. Ratings were then placed on a flip l6 chart and rating distributions were compared and errors discussed. In a much simpler version of this same type of training, Borman (I975) defined halo error and presented a rating distribution showing the error. The training consisted of no more than a lecture advising 9O Iow- and middle-level managers not to cluster their ratings across dimensions. There was no videotape of performance to use as a criterion in this program. Similar training strategies focusing on the presentation of certain rating distributions as an indicator of error include those developed by Bernardin (I978), Brown (I968). Ivancevich (l979). Levine and Butler (I952), Warmke and Billings (l979). and Bernardin and Boetcher (Note 2). A major problem with this approach to 'training, however. is that the reseachers did not seem to consider whether or not a skewed distribution, for example, might in reality accurately reflect the performance of certain employees. With respect to the effectiveness of this type of training in decreasing various errors, the results are inconsistent. Rating errors have successfully been reduced using Borman's (I975) Seminute lecture to managers, though lectures to student raters failed to produce.similar results (Vance, Kuhnert, 8 Farr, I978). Longer lectures have produced reduction in halo error (Bernardin, I978: Brown,l968) but did not improve foreman's administrative ratings (Levine 8 Butler, I972). Similarly, discussion groups focusing on rater errors have proven successful for reducing leniency after 90 minute sessions (Levine 5 Butler, I952) but have failed to reduce halo in 2-hour versions (Warmke 8 Billings, I979). The only training method to produce consistent decreases in rater errors has been a workshop method developed by Latham 17 et al (l975). which provides participants with an opportunity to practice observing and rating actual videotaped ratees. This technique has been shown to sharply reduce contrast, halo, similar-to-me, and first impression errors. Somewhat different approaches to training have similarly produced successes and failures. Bernardin and Walter (I977), for example, found that students who kept diaries of their instructor's teaching performance produced ratings with less leniency and halo than students who had not kept diaries, even though both groups received rater error training. In another study, Taylor and Hastman (I956) found that a treatment in which individual attention was given to supervisor raters during the rating task resulted in less halo. Unfortunately, all of the studies just reviewed are plagued by one or more deficiencies limiting the usefulness of their results (Spool, I978). First, the focus on rater behavior in terms of error has left the question of accuracy largely unaddressed. This state of affairs is the result of the general assumption that error and accuracy covary negatively and hence, decreasing error should logically increase accuracy. However, this assumption has been questioned by recent research (reviewed below) which indicates that error reduction does not affect accuracy in the anticipated manner. A second limitation of the previous training programs is the lack of attention to an appropriate theoretical basis to serve as guidance for rating training efforts. Some researchers (e.g., Borman, I978: Kane 8 Lawler, I978: King, Hunter, E Schmidt, l980: Landy 8 Farr, I980) have gone so far to posit that performance appraisals and performance appraisal training programs are l8 unlikely to improve until an adequate theoretical basis for these processes has been developed and tested. The following section further addresses these limitations. In light of recent research which questions the prevailing error/accuracy covariation assumption, it is first argued that only focusing on error is too limited for increasing the accuracy or validity of raters' observations. Previous training efforts are then assessed in terms of Feldman's (I98l) rating process model. Based upon the implications from this cognitive perspective, a rationale is developed concerning why strategies to reduce error are largely insufficient for improving accuracy. This is based on the fact that previous training methods have not directed attention to and facilitated appropriate categorization and recall of relevant employee behaviors, which are central to effective evaluation procedures (Latham 5 Wexley, l98l: Smith 5 Kendall, I963). Error versus Accuracy The assumption that error and accuracy covary negatively has been questioned by four recent studies that used varying approximations to true scores as criteria and were thus able to assess rating accuracy. In the first study, Borman (I977) developed normative true scores for job performance dimensions and used Cronbach's (I955) differential accuracy score to operationalize rating accuracy (which was the correlation between normative true scores and subjects' ratings). Scores reflecting halo, leniency/strictness, and restriction in range errors were also computed. The results indicated that although accuracy was not substantially related to any of the errors ( 5;; - .l2 to .l8), 19 higher halo seemed to moderately covary with higher accuracy. The covariation between accuracy and the other two errors was unclear in that both positive and negative correlations resulted across different jobs. In a second study, which was an extension of the first, Borman (I979) used a variety of rating formats and training procedures to evaluate their effects on halo error and rating accuracy. The results of this research showed that training significantly reduced halo, but did not improve accuracy. Two other studies (Berman 8 Kenny, I977: Warmke, I980) similarly revealed a relatively low relationship between halo and accuracy, and equally unclear results concerning the direction of their covariation. Although these four studies were not primarily designed to assess error/accuracy relations, taken together they suggest a paradox, at least with respect to halo and accuracy. Cooper (l98l) has estimated the halo/accuracy relationship by summarizing the data from the four studies just presented. Halo and accuracy were shown to share a median of 8 percent of the variance. but the direction was opposite to the prevailing negative covariation assumption (i.e., higher halo and higher accuracy modestly covaried). Research investigating the covariation between accuracy and other rating errors is so limited that conclusions must, as yet, remain speculative. However. one conclusion that can be drawn is that additional research investigating the relationship between error and accuracy is clearly warranted. To summarize, then, the research reviewed here seems to suggest that the basic assumption underlying the development of rater training programs to date (that decreasing error will increase accuracy) is 20 questionable. Further, the failure to investigate rating accuracy or validity in program evaluation is unfortunate because accuracy is the crucial criterion for judging the quality of performance ratings. More critical, however, is the possibility that error reduction training does not significantly increase accuracy, leaving the utility of previous rater training efforts seriously in doubt. A plausible explanation as to why strategies to reduce error may not increase accuracy is suggested by Feldman's (I98l) rating process model. First, it seems that previous training programs have directed trainees' attention away from the observation of relevant employee behaviors and toward monitoring their own rating behavior in terms of "errors." This seems especially true for those programs which used drawings of rating distributions on flip charts as their focal training tool. Further, error reduction training has not provided raters with an appropriate schema for observing and interpreting behavior, hence, they have done nothing to facilitate accurate recall of raters' observations. The following section further draws upon the rating process model and its implications for the development of an approach to rater accuracy training. 3232; Accuracy Training gating Process lmglications for Accgracy Training Feldman's (l98l) rating process model provides a useful theoretical basis from which rater accuracy training can be developed. This model states that the interdependent processes of attention, categorization, and recall play a vital role in performance evaluation. Further, 2] although there are sometimes profound individual differences in the stimuli attended to, the way they are categorized, and what information is recalled, the previously reviewed research on attention, categorization, and recall indicates that these processes can be externally influenced. For example, it has been shown that people can be cued to become consciously aware of certain stimuli in their sensory field, and that this attention increases individuals' ability to accurately recall the information attended to (Averbach 5 Coriell, I96l; Eriksen 5 Collins, I969: Lawrence, l96l: Treisman 8 Geffen, I967: Treisman 8 Riley, I969). It has also been shown that individuals' category systems direct their attention to particular stimuli and provide the basis for interpreting it (Marcus, I977; Minsky, I975: Tesser, l978). Finally, it has been shown that meaningful category systems can be imposed on people from external sources, and that these facilitate the recall of category relevant information (e.g., Bower, Black, E Turner, I979: Potts, l972: Zandy 5 Gerard, I97A). According to this view of the rating process and .the research supporting it. certain implications for what kinds of training might increase the accuracy of performance ratings are suggested. First, training focused on standardizing the behaviors attended to or consciously looked for would be important. Second, the model implies the importance of teaching raters a common way of defining, organizing, Interpreting. and hence recalling the relevant behaviors that are observed (e.g., a common frame-of-reference for categorizing different job behaviors and their effectiveness levels should be provided to raters). The model implies, then, that in order to increase the 22 accuracy of performance ratings, rather than (or possibly in addition to) focusing on errors, attention should be focused on training in behavior observation and creating or imposing a type of organizing schema to facilitate the storage and recall of relevant observations. It is interesting to note that these implications are consistent with the contention of Landy and Farr (I980) that raters develop common frames-of-reference for rating job performance. These authors further state that rating quality should be improved if appraisers carefully attend to the performance requirements of the job when rating others. Preliminary support for this notion can be found in the industriaI/organizational psychology literature which shows that the use of particular job relevant categories influence the quality of the interview decisions that are made. Specifically, Langdale and Weitz (I973) and Weiner and Schneiderman (l97A) found that when available to interviewers, job information was more readily used in their decisions, and that it served to decrease the effects of irrelevant information fer both experienced and inexperienced interviewers. Thus, being more familiar with the requirements of the _job seemed to help focus the interviewers attention on those applicant qualifications which were more relevant to the person-job fit (Landmark Schmitt, I976). While this conclusion is supported by limited research involving organizational decision processes, results of research from other Iiteratures focusing on the training of behavior observers have generally supported the promise of this approach (Jecker, Maccoby, 8 Brietrose, I965: Wahler 8 Leske, I973). 23 Qggjgg g: Rgtigg Formats for Accuracy Training Performance evaluation processes which capitalize upon the major elements of Feldman's model and the suggestions of Landy and Farr (I980) are not entirely unrepresented in the fields of Industrial Psychology and Organizational Behavior. It must be noted, however, that the originators of these few approaches have not consciously acknowledged their theoretical consistency with the cognitive psychology area in general. Nonetheless, Behavioral Observation Scales (BOS: Latham 5 Wexley, I977, l98l)and Behaviorally Anchored Rating Scales (BARS: Smith 8 Kendall, I963) seem to be constructed and used in a manner which is consistent with the above implications in many ways. First, the specificity of the behavioral examples of these instruments could be used to cue raters' attention to the relevant performance requirements of the job. Second, the dimensionality of these types of instruments seem analogous to the structure of cognitive categories. Specifically, BARS and 805 are characterized by several job performance dimensions, each of which is further defined by examples of specific employee behaviors, and the degree to which these are effective or ineffective. Hence, on a lower level of abstraction,'the organization inherent in personal category systems is replicated in these instruments because the general performance dimensions are similar to broad cognitive category domains, and the employee behaviors (which may serve to facilitate the development of dimensional prototypes of effective and ineffective employees) represent mere specific information comprising these "categories." 2A It seems that these instruments would act to impose a common schema or categorization system on raters whereby relevant employee behaviors could be similarly defined, organized, interpreted, and hence, accurately recalled. However, the results of many format comparison studies have not shown that any one type of scale is best. For example, although the BARS format is an elegant strategy for developing performance rating scales, little if any psychometric superiority has been evidenced by this approach over others (Bernardin, I977: Dunnette 8 Borman, I978: Schwab, Heneman, 8 DeCotiis, I977). In fact, certain types of scales have outperformed BARS at times (Bernardin, Alvares, 8 Cranny, I976: DeCotiis, I977), although this could be due to variation in scale development and scoring procedures not entirely consistent with the original BARS methodology (Bernardin et al. I977: Borman, I979). Comparative studies involving 805 are too limited at this time to warrant any conclusions regarding their superiority (or lack of) over other formats. In sum, however, no clear-cut advantage has been found for any one performance rating format. A potential reason why such behavioral formats have not generally been shown superior is that merely instructing people to use a certain format (category system) may not be sufficient to really impose that category structure on their thinking. The typical practice in an organization that is developing a new performance appraisal system is to include a small subsample of individuals familiar with a job who then aid in developing the performance appraisal dimensions and behaviors. The participation of these individuals could be expected to facilitate their acceptance and use of the category system they mutually conceive 25 of as correct. However, the majority of people who would then be asked to use the new format but who did not participate in its development might be less accepting of the new category system. This relative lack of acceptance may be due to simple unfamiliarity with the category system (general dimensions of job performance) and/or the lack of awareness of relevant behaviors that attend upon it. Previous research has shown that people do tend to use category systems that are familiar to them (Wyer 8 Scrull, I980). Thus, although BARS and 805 formats provide raters with the ability to facilitate rating accuracy, persistence in the use of previously learned category systems may represent a lack of awareness and/or a lack of requisite motivation to take advantage of the new formats. Indeed. the majority of raters are most likely unaware (on a conscious level) of the category systems they use to evaluate others. Further, if this awareness does exist but raters are not convinced that their personal, familiar categorization processes are inadequate, there is little reason to expect that they will embrace a newly imposed system. Recall, however, that previous laboratory research has shown that individuals are willing to attend to stimuli to which experimenters have directed their attention and that recall can be stimulated through the use of an imposed category system. It seems reasonable to expect that individuals may be more willing to accept an imposed category system in a laboratory rather than a field setting. If, however, it can first be shown that subjects can be successfully trained to increase the accuracy of performance appraisals using BARS/BOS and the implications from cognitive psychology. further research in the field which focuses on the 26 unique implementation problems of that setting can be attempted. Based upon the arguments thus far presented, it seems reasonable that the use of actual behavioral instruments as a training tool along with focusing rater attention to the particular job performance dimensions and their corresponding behavioral examples should promote the delevopment of appropriate category systems for observing employee behavior, provide raters with examples of what constitutes effective and ineffective behaviors on each performance dimension (category), and thus, facilitate accurate recall of relevant job related evaluation criteria. In sum, this type of training would not only develop categories more in keeping with the actual job requirements, but, the prototypes developed would be based strictly upon relevant employee behaviors. Thus, irrelevant characteristics such as sex, race, attractiveness, etc. would not be included in the category attributes. It seems logical, then, that this strategy would allow behaviors to be noticed, stored, and recalled in a more useful manner. Objectives 9: the Present Resegrch The purpose of the present research was to evaluate the differences in rater training as a function of orientation of the training program. Also of interest was the assessment of any potential differences in program effectiveness for different job performance dimensions. Specifically, Rater Error Training (RET) was compared to Rater Accuracy Training (RAT) which focused on providing raters with an appropriate categorization scheme for attending to and recalling relevant employee behaviors. Further, the present research employed a completely crossed 27 experimental design, whereby some subjects received both forms of training, others received either error or accuracy training, and some received no training. For five behaviorally defined performance dimensions, these treatments were assessed in terms of their effects on rating errors (halo and leniency) and accuracy in ratings of videotaped ratees . METHOD This section describes the the subject group, research design, and procedures for conducting the experiment. Also presented are the two training programs and the development of the videotapes and rating formats. Subjects Participants in the study were I08 undergraduate students enrolled in an introductory industrial/organizational psychology course at a large midwestern university. The total sample consisted of 58 females and 50 males. Their mean age was 20.6A years, and approximately half (N - 57) reported having previous experience with performance appraisal (either rating the performance of others or having their performance rated). Students were randomly assigned to one of four experimental groups described under the Experimental Design section below (n=27 per group). Although the use of student raters raised potential concerns with generalizability to a true rater population, it has been shown that employment decisions made by students in laboratory settings are similar to those made by professional interviewers (Bernstein, Hakel, 8 Harlan, I975: Schmitt, I976). Thus, as well as adding credence to the use of college students as raters, this finding also suggests that low generalizability may not be a particularly salient problem in the 28 29 present study. Experimental Design A 2 x 2 completely crossed factorial design was used in the present research. The first factor, RET. consisted of two conditions: those who received error training and those who did not receive such training. The second factor, RAT, similarly consisted of two conditions: those who did and did not participate in the accuracy training. Those subjects who received both RET and RAT did not, however, participate in the complete version of each training program (described below). This was not possible for both practical and theoretical reasons. First, separate presentations of RET and RAT would have necessitated three hours of training, thereby doubling the duration of RET/RAT group's training time. Second, this procedure would also have provided students in the RET/RAT condition with twice as much practice using the rating scales and becoming familiar with their behavioral examples and definitions. If, then, the results revealed that the RET/RAT appraisals were more accurate and/or contained less error than the other conditions, the question of whether the results were due to the need for both types of training or whether they were merely a function of increased laboratory time and/or practice would have remained. In order to prevent such problems with subsequent interpretation of the data and to insure equivalence of the training treatments, the RET/RAT program was limited to a one and one-half hour session. This was accomplised by giving students feedback on the accuracy of their ratings as well as by discussing various rating errors and how they 30 might be alleviated. Hence, students in the RET/RAT condition were trained by incorporating the major elements of each individual program without, however, requiring an increase in total training time or additional practice with the instruments. In summary, the following experimental conditions were compared in the present study: (I) RET and RAT; (2) RET only: (3) RAT only: and (A) No Training (see Figure 2 for a diagram of the research design). Procedure Two weeks before the data collection was to begin, the research project was explained to the entire class. Students were told that the study involved performance appraisals and that they would be asked to rate videotaped performances of several managers talking with a problem subordinate. Participation in the research was voluntary. However, extra credit points were given to those individuals who agreed to be involved in the study. Subjects placed in training treatments attended their respective programs within the next two weeks. In order to keep group sizes manageable, l2-l5 students participated in each session. Immediately following the training program(s), subjects observed and rated videotaped managers. Those subjects in the No Training condition were asked only to observe the videotapes and make their ratings following each manager's performance. After the experiment was completed and the results analyzed, the subjects were fully debriefed. 3l RAT No RAT RET Group I Group 2 No RET Group 3 Group A Figure 2. Experimental Design 32 Videotage Egg Rating Scale Development This section presents the procedures as described by Borman (I977) for developing the videotapes and rating scales used in the present research. Rating Scales Performance rating scales for a manager talking with a problem subordinate were developed using behavior scaling methodology (Smith 5 Kendall, I963: Dunnette, I966). Seven-point rating scales were used to represent the following seven dimensions of the manager's job: I. Structure and control of the interview. 2. Reacting to stress. 3. Obtaining information. A. Resolving conflict. 5. Developing the subordinate. 6. Establishing and maintaining rapport. 7. Motivating the subordinate Each dimension was defined by both an overall defining statement as well as by scaled behavioral anchors describing the seven different effectiveness levels (see Appendix A for these scales). Generating Intended "True Scores" for Performers To make the performances as realistic as possible, "intended true scores" with a preset covariance structure were generated. First, two 33 realistic covariance matrices were formed by asking experts to estimate the "true“ means and standard deviations of performance on each dimension and the "true" intercorrelations among dimensions. Profiles reflecting the "correct" covariance structure were then generated for eight ratee performances. More specifically, five expert judges knowledgeable about the job and the concept of correlation were asked to independently estimate the level of correlation expected between each pair of dimensions when the job is actually being performed. To accomplish this, they used a I to 7 scale, where 7 indicated 5 - l.00: 6, L - .67: 5, g - .33: A, L - .00: 3, L - -.33: 2, L = -.67: and l. L - -I.00. A descriptive estimate of reliability associated with these judgments was obtained by using an ANOVA procedure to compare the variability in different judgesl ratings of the same dimension pairs with total variance in the judgments. The resulting intraclass correlation for these judgments was .8I ( g < .OI), suggesting acceptable reliability for the judgment task. Mean ratings (on the l to 7 scale) were computed for each dimension pair, and these means were transformed directly to correlation coefficients (e.g., A.5 was transformed to +.l7). Following a procedure outlined by Naylor and Wherry (I965), the resulting correlations along with dimension means of A.O and standard deviations of l.5 were then used to generate an intended true score matrix for ratees. As an example, presented below are intended performance profiles for two managers: 3A Performance Dimension Profile I Profile 2 Structure and Control of the Interview. 6.0 2.0 Reacting to Stress. 5.0 3.5 Obtaining Information. 6.0 2.5 Resolving Conflict. 6.0 5.0 Developing the Subordinate. 3.5 3.5 Establishing and Maintaining Rapport. A.5 6.0 Motivating the Subordinate. 5.0 2.5 The procedures outlined above thus enabled the development of realistic mulitidimensional performance profiles for eight individuals on the managers' job. Developing and Videotaging Performance Eight scripts were written depicting 5- to 9-minute performances of a manager talking with a problem subordinate. The scripts reflected the performance levels defined by the intended true scores as closely as possible. Eight different actors played the various manager roles while the same actor played the problem subordinate in all eight performances. Each actor was given explicit instruction and ample preparation time to insure close conformance to the scripts during the videotaping. 35 Dbtainigg Final True Sggggg Fourteen expert raters were selected to evaluate the effectiveness of each performer. Seven of the raters were graduate students in psychology, and the other seven were practicing industrial psychologists working either for a psychological consulting firm or in the personnel research department of a large manufacturing company. All of the raters were very familiar with the performance demands of the job. The scripts were revised as necessary to reflect the verbal behavior actually depicted in the performances, and raters were asked to study these scripts and the rating scales before coming to the rating sessions. Experts' ratings were analyzed using an indirect validation approach. Interrater agreement among the IA experts was computed for each dimension using intraclass correlations. The resulting eight lntraclasses ranged from .9I to .98 with a median of .97. Further, correlations between mean expert ratings and intended true scores were all above .70, with a median L - .93. These results indicated considerable agreement between the expert judges and intended true scores. The high interrater agreement obtained for each dimension suggested that the few times that the mean expert ratings did differ somewhat from the intended true scores, the discrepancies were most likely due to the scripts reflecting unintended levels of performance and/or the actors failing to project the intended effectiveness levels. The mean expert ratings (see Table l) were therefore adopted as the "true scores" for subsequent uses of the tapes. 36 Table l. True Scores of Performance Dimension/Manager l 2 3 A 5 6 7 8 Structure and Control of the Interview 2.79 2.77 6.92 2.07 3.3l A.5A A.38 .08 Establishing and Maintaining Rapport l.50 5.93 3.26 5.00 3.69 5.23 3.08 .38 Reacting to Stress 3.57 5.00 5.38 A.29 A.A6 A.92 5.I5 .85 Obtaining Information 2.36 A.2l 6.l5 3.A3 l.77 5.69 2.69 .5A Resolving Conflict 2.07 A.07 5.62 5.00 5.69 A.3l 2.85 .08 Developing the Subordinate 2.7l 3.07 3.38 2.93 6.08 6.62 A.5A .38 Motivating the Subordinate 2.29 A.86 A.62 3.7l 5.77 6.15 2.77 .08 37 Manipulations Rater Error Training (RET) Latham et al (I975) developed a training procedure to help managers become aware of problems in rating employee performance and to reduce various rating errors. The major elements of the Latham et al workshop training procedure were used to train student raters to provide more error free performance assessments. The core characteristics of the method include the following: I. A videotape of a job being performed is first shown to participants. 2. Trainees then evaluate the designated ratee on the videotape using rating scales as provided. 3. Ratings made by participants are placed on a flipchart. A. Differences between the ratings and reasons for the differences are discussed by trainees. 5. The trainer discusses rating errors made by ratees and how they can be avoided. 6. The group then discusses ways of avoiding or overcoming the error being studied. This general strategy was followed for the present rater error training. Specifically, subjects were shown two of Borman's eight videotapes during the training, and they were asked to evaluate each manager's performance using the rating scales that appear in Appendix A. 38 Because two of the tapes were used as part of the training program, criterion ratings were obtained only on the remaining six videotapes. Subsequent to rating each manager, the trainer discussed subjects' ratings in terms of rating errors such as halo,leniency, central tendency, and contrast effect (see Appendix B for a detailed description of RET). The emphasis of this training was thus focused on producing error-free performance ratings. Rater Accuracy Training (RAT) Based on the implications of Feldman's (l98l) rating process model as discussed in the introduction, RAT focused on facilitating the development of a common categorization system based upon important job dimensions (which are further defined by specific behaviors) for observing ratee performance. Specifically, those who received the RAT program were first lectured on the multidimensionality of most types of jobs and the need to pay close attention to employee performance in terms of these dimensions. Participants were then given the actual scales they would be using to rate the managers. After discussing the general definitions of each dimension and the behavioral anchors that corresponded to different effectiveness levels, subjects practiced using the rating scales by rating the same two videotaped managers that were used in RET. After the group rated each of the tapes, they discussed their ratings and received feedback on their accuracy. This exercise served to increase the group's attention to the performance dimensions they used to evaluate the managers, and it also served to illustrate various effectiveness levels within each category (see Appendix C for a 39 detailed description of RAT). In sum, by using the rating instrument itself as a training tool along with focusing rater attention to the particular job performance dimensions and their corresponding levels of effectiveness, the development of appropriate category systems for observing ratee behavior was promoted. Further, this strategy was expected to provide raters with behavioral examples of what constituted effective and ineffective behavior on each performance dimension (category). This development of categories based on actual job requirements was, in turn, hypothesized to facilitate more accurate recall and evaluation of relevant performance criteria. Summary _1 ET 229 35_ Programs Based upon the previous discussion of RET and RAT, it can be seen that both training programs were designed to elicit active trainee participation and to provide raters with practice and feedback on their judgments. Also, both training programs used the same two videotapes and the actual rating scales to train participants. Further. in order to control for variance due to differences in the amount of actual training time, the programs were each developed to last approximately one hour and one-half hours. In summary, RET and RAT were identical with respect to their training components (i.e., practice and feedback), training tools, and duration. Hence, any differences between the experimental groups could more confidently be attributed to the focus of the training itself (error or accuracy), rather than to differences in variables extraneous to the present research question (e. g., training A0 time, components of training, etc.). Pretesting g: Trainigg Prggrams Prior to the experimental treatments, the training programs were each pretested with two groups of ID to I5 students. These pretests were performed to provide the trainer with practice conducting the sessions and also to discover any potential problems with the programs so that modifications could be made prior to the actual research. While the original conceptualization of RET required no major modifications, the RAT pretests revealed the need for various changes. Based upon interviews with pretest subjects as well as the results of preliminary data analyses, it was evident that subjects did not have enough training time to assimilate the amount of information associated with seven performance dimensions. It was impossible to extend the 'training sessions because of practical limitations regarding the amount of experimental time available from subjects. Hence. two of the seven dimensions, i.e., Reacting to Stress and Obtaining Information, were deleted from the rating scales in all experimental conditions. These particular categories were excluded because subjects reported difficulty in differentiating the effectiveness levels within them. Given that clearly defined behavioral dimensions were a prerequisite to the accuracy training proposed here, the inclusion of obviously ambiguous dimensions would not have facilitated a reasonable comparison of the techniques. Al Dependent Variables Accuracy One measure of accuracy was calculated using an approach similar to that used by Bernardin 8 Pence (I980) and Rush, Phillips, 5 Lord (l98l). This measure, Distance, assessed how close the subject was to the mean true score for each of the five dimensions across ratees. The formula for calculating Distance (DIST) is presented in the equation below. R A - 2 (D/R) r-l where: A I accuracy across ratees for each dimension. R - number of ratees (6). D - absolute difference of the observed score from true score. For each subject, this analysis resulted in five mean deviation scores across ratees, with lower deviations indicating higher accuracy. Accuracy was also assessed using Cronbach's (I955) differential accuracy (DA) measure. The DA provided accuracy scores for each rater on each performance dimension by correlating the rater's ratings of the six videotaped target persons on a dimension with mean true scores provided by the expert judges. The Fisher r-to-z transformation was then applied to each DA correlation. Thus, for each subject, these analyses resulted in five 2 scores across ratees within dimensions (DA), with higher scores indicating higher accuracy. A2 Halo is conceptualized as the tendency for raters to restrict their ratings of a target person across job dimensions. Operationally, halo has been discussed in terms of standard deviations across dimensions within ratees (e.g., Borman, I977). In addition to this measure of halo, analyses were also performed for a second halo measure. First, to test for differences in halo defined in terms of standard deviations, a standard deviation (SD) was computed for each target ratee, thereby reflecting the spread in those ratings across dimensions. A low standard deviation across dimensions indicated high halo and higher standard deviations indicated lower halo. Because of the nonnormal distribution of standard deviations, a logarithmic transformation of the variances was performed before averaging these scores. O'Brien (I978) has recently shown that tests for determining differences in variances using the logarithmic transformation were both robust and powerful. The second measure of halo (HALOCORR) was calculated in the following way: A correlation matrix was computed between the five dimensions for each subject's ratings of the six ratees. These dimension intercorrelations were then subtracted from the true dimension intercorrelations, yielding l0 difference scores for each subject. Before subtracting the matrices. all correlations were transformed to z scores using Fisher's r-to-z transformation. The difference scores for each subject were then averaged, providing a mean measure of the difference between the true and observed intercorrelations across A3 dimensions. To the degree that this average deviated from zero in a positive direction, the subject's ratings were less correlated than the true ratings. To the degree that this average deviated from zero in a negative direction, greater halo was evidenced. Leniency Leniency (LEN) was assessed for each subject by computing the mean ratings for each dimension across the six ratees. This resulted in five leniency scores for each rater. The mean true scores for each dimension were then subtracted from the observed mean ratings, with greater distance (i.e., larger positive difference scores) indicating greater leniency. Datg Analyses Proceggres For each of the two accuracy measures (i.e., DIST and (DA), the experimental groups were compared with a 2 x 2 x 5 (RET x RAT x DIM) fixed-factor analysis of variance (ANOVA) with repeated measures on the dimension factor. This design enabled not only the evaluation of treatment main effects and interactions, but it also allowed the assessment of dimension effects as well as dimension x training interactions. For each of the Halo measures (i.e., SD and HALOCORR), a 2 x 2 ANOVA with RET and RAT as fixed factors was performed to assess differences among the experimental groups. Finally, a 2 x 2 x 5 ANOVA with repeated measures on the last (i.e., dimension) factor was used to assess training and dimension effects and interactions for the leniency (LEN) measure. Table 2 presents a summary of the five dependent AA variables, how they were calculated, and the design used to analyze each. A5 Table 2. Summary of the Dependent Variables Variable Definition Design Accuracy DIST Average distance from true 2 x 2 x 5 ANOVA scores for each of the five (RET x RAT x DIM) performance dimensions with repeated measures on DIM DA Correlation between the 2 x 2 x 5 ANOVA true and observed ratings (RET x RAT x DIM) for each of the five with repeated dimensions measures on DIM Halo SD Average standard deviation 2 x 2 ANOVA within ratees (RET x RAT) HALOCORR Average distance between 2 x 2 ANOVA true and observed dimension (RET x RAT) intercorrelations Leniency LEN Difference between true 2 x 2 x 5 ANOVA and observed means for each of the five dimensions (RET x RAT x DIM) with repeated measures on DIM RESULTS Relationships Between Accurggy 33g Rating Errors The means, standard deviations, and intercorrelations for subjects' accuracy scores and scores for the two rating errors are presented in Table 3. Also shown are the correlations between the dependent variables, sex, age, and previous experience with performance appraisals. The relationship between the two measures of accuracy (i. e., distance from true scores - DIST and differential accuracy - DA) was substantial ( L - -.79, g < .05). The negative correlation indicated that smaller absolute distances from the true scores were associated with higher correlations between dimension true scores and observed ’scores. There was also a relatively large amount of overlap between the two halo measures ( L - -.77, g < .05). Specifically, those individuals who had larger deviations within ratees (i. e., less halo) had dimension intercorrelations that were lower than the true dimension intercorrelations, whereas those with smaller SD measures had dimension intercorrelations that were greater than the true dimension intercorrelations (i. e., higher halo). Although the correlation between the two accuracy measures and between the two halo measures was high, separate analyses were conducted on each measure so that the present analyses would be comparable with previous research. A6 A7 Table 3. Means, Standard Deviations, and Intercorrelations of Variablesa Variableb Mean so In (2) (3) (A) (5) (6) (7) (8) I. DIST I.l3 .62 2. DA l.Ol .25 -.79 3. SD .05 .2A .07 .0A A. HALOCORR .27 .36 .2A -.I5 -.77 5. LEN .l5 .A7 .27 -.07 -.I5 .II 6. SEX ‘ l.5A .50 -.05 -.05 -.ll .OA .IA 7. AGE 20.6A 2.06 .02 -.05 .09 -.I3 -.Ol -.l3 8. EXPER l.53 .50 .07 .08 .09 -.0A .IA .05 .00 a b r > .l5, p < .05 DIST - accuracy measured as distance from true scores: DA - accuracy measured by differential accuracy: SD - halo as the standard deviation within ratees: HALOCORR - halo measured as the average difference between true and observed dimension intercorrelations: LEN - leniency measured as the average difference between observed and true means. A8 There was virtually no relationship between halo measured in terms of SDs and either the DIST or the DA accuracy measures. Low but statistically significant correlations resulted between the two accuracy measures and HALOCORR. Both of these correlations were negative, indicating that lower accuracy was associated with positive deviations from true dimension intercorrelations (i. e., higher halo). While this finding may at first seem to support the notion that error and accuracy covary negatively, it must be remembered that the HALOCORR measure was based upon the true dimension intercorrelations. Hence, this particular measure of halo was not consistent with previous operationalizations of the error that did not involve the true scores (e. g., Bernardin 5 Pence, I980: Borman, I975, I979). Leniency (LEN) was not related to the DA measure of accuracy but was significantly related to the DIST measure ( L . .27, p < .05). ' Specifically, more accurate ratings (smaller distances from true scores were associated with negative deviations from the true means, while leniency (positive deviations from the true means) increased with inaccuracy. Again, however, it must be noted that the leniency measures used here were based on the true means. There was no relationship between leniency and either of the halo measures. Ifiginlgg Effects gg Accuracy Distance from True Scores A 2 x 2 x 5 ANOVA (RET x RAT x DIM) with repeated measures on the last (i. e., dimension) factor was performed to assess the effects of A9 training on the DIST measure of accuracy. This design also enabled the assessment of dimension effects as well as training x dimension interactions. The results of the ANOVA (shown in Table A) revealed a significant main effect for RAT. Inspection of the means in Table 5 indicated that individuals who participated in RAT had significantly more accurate ratings that those who did not receive RAT. More noteworthy, however, was the significant RET x RAT interaction. Analysis of the mean data (presented in Figure 3) suggested that RAT alone produced the most accurate ratings while the no training group was least accurate. Tukey tests specifically revealed that RAT alone or RET/RAT together yielded ratings with higher accuracy than no training or RET alone. Further, there were no differences in accuracy between the no training and RET alone conditions. A main effect for DIM and two significant training x dimension interactions were observed. The significant RAT x DIM interaction (see Figure A) revealed differences in the effectiveness of RAT on the appraisal dimensions. With each dimension fixed, evaluations of the simple main effects for designs with repeated measures (Winer, l97l) showed RAT to significantly increase accuracy on only three (i. e., Structuring and Controlling the Interview, Resolving Conflict, and Developing the Subordinate) of the five dimensions. Further analyses for only the RAT group revealed that Structuring and Controlling the Interview (R - .72) was rated more accurately than all other dimensions, while Establishing and Maintaining Rapport (R' - l.2A) was rated the least accurately. The same analysis conducted for the NO RAT group showed that Resolving Conflict (R - l.A5) was rated less accurately than 50 Table A. Results of the Analysis of Variance for DIST Effect df F “,2 RET (A) l .12 RAT (B) l 52.25* .30 A x B I lO.69* .06 Subjects x A x B lOA (.20) DIM (C) A 12.76* .06 A x C A A.ll* .0] B x C A 7.88* .03 A x B x C A l.82 Subjects x A x B x C Al6 (.l2) Note. Numbers in parentheses are the mean square error associated with the F tests directly above them in the table. *p<.05 5] Table 5. Means and Standard Deviations of DIST Variable NO RET RET Totals NO RAT 1.33 1.22 1.27 (.25) (.22) (.2A) RAT .92 1.06 .99 (.16) (.15) (.17) Totals 1.13 I.IA 1.13 (.29) (.21) (.25) Note. Numbers in parentheses - SDs. 52 I 1.35 + 1.32 l.30 + 1.25 + l.20 + l.22 l.l5 l l.l0 + 1.051 1.06 l.OO l .951 .90 + .92 NO RAT RAT O—-0 NO RET O--O RET Figure 3. Mean Data (DIST) RET x RAT Interaction 53 l.A5 'o O IIIII IIIII 1231.5 123A5 l - Structuring and Controlling the Interview; 2 - Establishing and Maintaining Rapport: 3 - Resolving Conflict: A - Motivating the Suborainate: 5 - Developing the Subordinate e—o NO RAT I-—O N0 RET O—O RAT 0—0 RET Figure A. Mean Data (DIST) for DIM x Training Interactions 5A the remaining four dimensions. A significant RET x DIM interaction also resulted and is shown in Figure A. Although an analysis of the simple main effects revealed no significant differences between the dimension means for the two RET conditions, the profiles of these means were different for the RET versus N0 RET group. However, further analyses of the means within the two treatment groups did reveal differences in accuracy for particular dimensions. Specifically, for the RET group, Structuring and Controlling the Interview (i e .89) was rated more accurately than the other four dimensions, and with the exception of Establishing and Maintaining Rapport (i - I.20), Resolving Conflict (R - l.32) was rated less accurately than the others. For the NO RET group, Establishing and Maintaining Rapport (R - I.29) was rated with less accuracy than all other dimensions except for Resolving Conflict (R - l.I9). Differential Accuracy A 2 x 2 x 5 ANOVA with repeated measures on the last factor was also performed to assess the effects of training and dimensions on the transformed (r-to-z) DA correlations. Cell means and standard deviations are presented in Table 6, and Table 7 contains the results of the ANOVA as well as omega square values for the significant effects. A significant main effect resulted for RAT, whereby those who received training had significantly higher correlations between true and observed dimension scores than those who did not receive accuracy training. A significant RET x RAT interaction also resulted for the DA measure (see Figure 5). The nature of this interaction was somewhat 55 Table 6. Means and Standard Deviations of DA3 Variable NO RET RET Totals N0 RAT .82 I.0l .9l RAT I.22 I.0l l.I2 Totals I.02 I.0I l.0l (.30) (.19) (.25) Note. Numbers in parentheses - SDs. aValues in the table are based on transformed r-to-z correlations. 56 Table 7. Results of the Analysis of Variance for DA Effect df F (“2 RET (A) l .06 RAT (B) l 25.69* .l6 A x B l 27-05* .17 Subjects x A x B IOA (.2l) DIM (C) A 21.12* .II A x C A 3.l7* .02 B x C A l2.A9* .07 A x B x C A .67 Subjects x A x B x C AI6 (.2l) Note. Numbers in parentheses are the mean square error associated with the F tests directly above them in the table. *p<.05 57 l.25+ l.22 I.20+ l.l5+ l.IO + I.05-I- I.0l I.00 + O O l.0l | .95 + .90 + .85 + .80 + .82 | NO RAT RAT 0—. NO RET O-—O RET Figure 5. Mean Data (DA) for RET x RAT Interaction 58 different than the relationship between RET and RAT that resulted with the DIST measure of accuracy. Specifically, Tukey tests revealed that RAT alone was better than any other condition in improving the correlations between observed and true dimension scores. There was no difference in DA correlations from the RET alone versus the RET/RAT condition. However, both RET alone and RET/RAT together produced more accuracy than the no training condition. Interestingly, accuracy was significantly decreased when RET was combined with RAT as compared to the RAT alone condition. The results of the within subject dimension analysis were essentially the same as those found with DIST. A significant main effect for DIM and significant RAT x DIM and RET x DIM interactions resulted (see Figure 6). Analysis of simple main effects indicated that the nature of these interactions were similar to those found with DIST. Specifically, RAT increased accuracy on only three of the five rating dimensions. Within the RAT treatment, Structuring and Controlling the Interview (R - l.55) was rated more accurately than all other dimensions. The accuracy associated with Developing the Subordinate (R - I.30) was also relatively high, as the ratings on this dimension were closer to true scores than on the remaining three. Finally, Motivating the Subordinate (R - I.05) was rated more accurately than Establishing and Maintaining Rapport (R - .7l) but not more accurately than Resolving Conflict (R - .6A). Within the N0 RAT group, a significant difference resulted for only one dimension. Specifically, Resolving Conflict (R - .6A) was rated with less accuracy than the other four dimensions. 59 +-—-+-—-+-—-+-—-+-—-+-e-+-—-+-—-+-—-+——- I - Structuring and Controlling the Interview: 2 - Establishing and Maintaining Rapport: 3 - Resolving Conflict; A - Motivating the Subordinate: 5 - Developing the Subordinate r—c NO RAT e—c N0 RET o—o RAT 0—0 RET Figure 6. Mean Data (DA) for DIM x Training Interactions 60 With respect to the RET x DIM interaction, although the simple main effects analyses were again nonsignificant, different profiles of accuracy scores across dimensions occurred in the RET versus the NO RET treatment. Within the RET group itself, however, Structuring and Controlling the Interview (R - l.3A) was the most accurately rated dimension. Developing the Subordinate (R - l.I6) was rated with more accuracy than Resolving Conflict (R - .70) and Establishing and Maintaining Rapport (R - .86). Finally, Motivating the Subordinate (R - .98) was also rated with more accuracy than Resolving Conflict. Within the NO RET group, Establishing and Maintaining Rapport (R - .77) and Resolving Conflict (R - .9I) were rated less accurately than the other three dimensions. Training Effects 23 £219 A 2 x 2 ANOVA was performed to assess training effects on each of the halo measures. Results of the first ANOVA (see Table 8) for the SD measure of halo (i. e., average standard deviation within ratees) revealed significant main effects for both RET and RAT. Further inspection of the mean data presented in Table 9 suggested that error training significantly increased the spread in ratings (i. e., decreased halo). Conversely, accuracy training significantly increased halo (i. e., decreased the standard deviations within ratees). Although both main effects were significant, the omega square value associated with the RET effect was substantially larger than that associated with the RAT effect. 6l Table 8. Results of the Analysis of Variance for Halo Effect (SO) df F wz RET (A) 1 53.029: .31 RAT (B) 1 7.06* 0A A x B 1 1.09 Subjects x A x B lOA (.OA) Effect (HALOCORR) df F “,2 RET (A) 1 39.A9* .27 RAT (B) 1 .08 A x a 1 1.87 Subjects x A x B IOA (.OA) Note. Numbers in parentheses are the mean square error associated with the F tests directly above them in the table. *p<.05 62 Table 9. Means and Standard Deviations of Halo Variable NO RET RET Totals (SO) NO RAT .95 l.30 l.II (.12) (.2A) (.2A) RAT .90 l.I8 I.0l (.20) (.19) (.23) Totals .92 I.2l I.05 (.17) (.23) (.2A) Variable NO RET RET Totals (HALOCORR) N0 RAT .50 .05 .27 (.25) (.37) (.38) RAT .AO .II .26 (.28) (.31) (.33) Totals .A5 .08 .27 (.27) (.3A) (.36) Note. Numbers in parentheses - SDs. 63 The second ANOVA was conducted on the average difference between observed and true dimension intercorrelations (HALOCORR). The results of this ANOVA (also presented in Table 8) showed only a significant main effect for RET. The means and standard deviations associated with this analysis are shown in Table 9. The dimension intercorrelations of those individuals who participated in error training were closer to the true dimension intercorrelations than for those who did not receive RET. The dimension intercorrelations for the NO RET groups were substantially higher (i. e., more halo) than the true dimension intercorrelations. T_r§_ini_ns £115.13 22 __Leniencz The main analysis aimed at evaluating training effects on leniency employed the average difference between observed and true means within dimensions in a 2 x 2 x 5 ANOVA, with RET, RAT, and DIM (repeated measures) as fixed factors. Results of that ANOVA (see Table I0) indicated a significant main effect for RAT. Evaluation of the means in Table II showed that accuracy training yielded mean dimension ratings that were closer to the true means. Those who did not receive accuracy training tended to rate the managers with more leniency, as evidenced by the positive mean deviation score for the NO RAT group. A significant main effect for DIM and a significant RAT x DIM interaction also resulted (see Figure 7). Tests of simple main effects revealed that RAT was effective in reducing leniency only with respect to Structuring and Controlling the Interview and Establishing and Maintaining Rapport. Within the RAT group alone, Developing the 6A Table l0. Results of the Analysis of Variance for LEN Effect df F wz RET (A) l .28 RAT (B) l l6.50* .l3 A x B I .58 Subjects x A x B IOA (.96) DIM (C) A 19.70* .09 A x C A l.A2 B x C A 3.85* .Ol A x B x C A .58 Subjects x A x B x C AI6 (.2A) Note. Numbers in parentheses are the mean square error associated with the F tests directly above them in the table. *p<.05 Table Il.--Means 65 and Standard Deviations of LEN Variable NO RET RET Totals NO RAT .27 .38 .33 (.A5) (.A5) (.A5) RAT -.Ol -.03 -.02 (.AO) (.AA) (.A2) Totals .l3 .l8 .l5 (.A5) (.A5) (.A7) Note. Numbers in parentheses - SDs. 66 .60 .50 .A6 .A0 .30 .20 -020 +-= +— +— +— +— +— +—— +— +— +— -.30 |I||| 123A5 l - Structuring and Controlling the Interview; 2 - Establishing and Maintaining Rapport: 3 - Resolving Conflict: A - Motivating the Subordinate: 5 - Developing the Subordinate O--O NO RAT 0--O RAT Figure 7. Mean Data (LEN) for DIM x Training Interactions 67 Subordinate (R 8 .l7) was rated more leniently than Resolving Conflict (R - -.23) and Structuring and Controlling the Interview (R - -.06). Establishing and Maintaining Rapport (R - .07) was also rated with more leniency than was Resolving Conflict. With respect to the N0 RAT group, the least lenient (most severe) ratings were associated with Resolving Conflict (R - -.II). Less leniency was also observed on Motivating the Subordinate (R - .l8) as compared to Structuring and Controlling the Interview (R 8 .A6), Establishing and Maintaining Rapport (R - .60), and Developing the Subordinate (R - .A9). DISCUSSION The results of the present study suggest that rating accuracy can be improved by training individuals in a manner that is consistent with and facilitates human information processing capabilities. Specifically, it appears that the use of an actual behavioral instrument as a training tool had the effect of providing raters with a common frame-of-reference for evaluating ratee behavior. This was not particularly surprising in that the categories one uses are a function of education and experience (Ilgen 5 Feldman, I983). Further, by focusing rater attention to the particular effective, average, and ineffective behaviors that corresponded to each rating dimension, trainees were given easily detectable cues of good and poor performance which were hypothesized to enhance the development of their newly imposed, more specialized category systems. Hence, the increases in accuracy found in the RAT group seem to support the notions of several recent researchers who have suggested that the development of job-relevant category systems, along with their implications for the treatment and evaluation of employees (Swann S Snyder, I980), are the source of valid variance in performance appraisals (Ilgen 8 Feldman, I983). Further, the effects of RAT in improving accuracy were evidenced regardless of whether accuracy was conceptualized in terms of distance from true scores (DIST) or the correlations between true and observed 68 69 scores on each dimension (DA). This was not unexpected, however, given the substantial degree of overlap between the two accuracy measures. This study also lends support to previous research (Bernardin 5 Pense, I980: Latham et al., I975) which has shown that individuals can be trained to reduce psychometric errors in their ratings. Specifically, error training reduced halo measured in terms of standard deviations across ratees (SD) and in terms of differences between true and observed dimension intercorrelations (HALOCORR). Error training did not, however, have any effect on leniency. This result may be an indication that the error training used here was simply not as effective as previous training efforts in reducing leniency. For instance, although the Latham et al (I975) workshop procedure was followed, their actual training tapes were not used. However, a main effect for RAT was observed on the leniency measure. Specifically, the ratings of those who received accuracy training were closer to the true dimension means than were the ratings of those who did not receive RAT. Perhaps some of the most interesting findings, however, concerned the RET x RAT interactions associated with the accuracy analyses. First, when RET was combined with RAT, rating accuracy was significantly decreased as measured by DA. Further, although the mean differences were nonsignificant, the average distance from the true scores was somewhat lower in the RAT alone condition compared to the combined condition. A potential explanation for this result may reflect a potential problem with the RET/RAT training program itself. Specifically, subjects who received both forms of training were presented with twice as much information in the same amount of time as 70 those who received only RET or only RAT. Further, recall that it was necessary to delete two of the original rating dimensions because one and one-half hours of training was not a sufficient time period to process all seven scales. It thus seems plausible that the RET/RAT subjects may not have been able to efficiently assimilate the amount of information that was required by their particular training program. However, another plausible explanation for the finding that RET and RAT‘ together tended to decrease accuracy is that when RET was combined with RAT, subjects' attention may have been partially diverted away from the observation and evaluation of relevant ratee behaviors to monitoring their own rating behavior. Concern with avoiding the rating errors discussed during the training session may have to some degree compromised the accuracy of their evaluations. In fact, anecdotal evidence obtained from subjects who participated in the RET/RAT treatment suggests that this may have been the case. Several students reported purposely spreading out their ratings in order to "avoid the errors" when they would have preferred rating particular target ratees more uniformly. Given the present research design, however, it is not possible to ascertain which, if either, of these explanations is valid. Future research aimed at clearly delineating the particular effects of combining the types of training employed here certainly seems warranted. In terms of comparing the effects on accuracy of error training versus no training, the results are not entirely conclusive. Concerning the DA measure of accuracy, for example, RET significantly increased accuracy as compared to the no training condition. This result is inconsistent with previous research (e.g., Bernardin 8 Pence, I980: 7i Borman, I979) that has found error training to have no effect on increasing accuracy. On the other hand, and consistent with previous research was the finding of no difference in accuracy (as measured by distance from true scores) between the RET alone and the no training conditions. The question of whether error training is better than no training might be largely dependent upon how one conceptualizes accuracy as well as a function of variations across studies in the particular training strategies and rating scales used. If, for example, our goal is to have ratings that covary accurately with "true scores," then the present results indicate that error training may be better than no training for increasing accuracy. If, however, our goal is to obtain ratings that accurately reflect a ratee's jgygl of performance vis a vis a behavioral rating instrument, then the present results indicate that error training may be ineffective. . Another result to emerge from this study was that the accuracy training employed here was effective on only three of the five rating dimensions. These were: Structuring and Controlling the Interview, Resolving Conflict, and Developing the Subordinate. While only 22;; Egg explanations of this result are possible. it appears that these dimensions may have been more explicitly defined in terms of the particular effective and ineffective behavioral cues corresponding to various performance levels. This explanation seems plausible, especially upon further evaluation and comparison of the behavioral descriptions associated with those dimensions that were affected by RAT versus those that were not. For example, the behavioral cues constituting a "7" on Developing the Subordinate (e.g., "setting up a 72 specific developmental program for the subordinate," “making worthwhile developmental suggestions such as enrolling in an interpersonal skills seminar or taking the Dale Carnegie course," and "setting up specific days and times to meet and discuss developmental issues and progress") seem less ambiguous than the cues associated with a "7" on Establishing and Maintaining Rapport (e.g., ”effectively bringing-out the subordinate's problems through probing but nonthreatening questions" and "discussing the subordinate's problems in a warm and supportive manner"). Similar examples of relatively ambiguous anchors are more prevalent on the two dimensions for which RAT had no effect. Also of interest was the finding that there were differences observed in accuracy and leniency across the dimensions gjgpip particular treatments. This was especially noteworthy because without exception, previous rater training efforts have focused on the effects of training in general (e.g., Bernardin 5 Pence, I980: Borman, I979: Latham et al., I975), without giving consideration to potential differences due to specific demands of the rating task itself. It, has only recently been suggested, for example, that different rating formats may place different emphasis on the cognitive tasks required by the rater (Murphy, Garcia. Kerkar, Martin, 8 Balzer, I982). Implicit in this suggestion is the notion that different training strategies might be necessary dependent on the format used. However, even beyond looking for general format x training interactions, the present results indicate that useful information might be available through further analysis of training effectiveness within particular formats. On a common sense level, just as certain ideas and/or concepts are easier to communicate 73 than others, it may be that certain dimensions and/or traits are easier to train than others. If, indeed, this proves to be true, then assessment of the particular characteristics associated with more easily rated dimensions and/or traits should prove valuable for both rating scale development as well as rater training efforts. The present study also supports recent assertions that the prevailing error/accuracy negative covariation assumption may not be valid. Although some significant relationships were found between error and accuracy (e.g., between HALOCORR and the two accuracy measures and between DIST and leniency), it must be remembered that these two error measures were based on deviations from the true scores. However, even given the fact that these particular measures were derived from the true scores, their relationship with accuracy was relatively low (average 5 - .22), revealing only about 5 percent of shared variance. This result is comparable to recent calculations by Cooper (I98l), who showed error and accuracy to share a median of only 8 percent of the variance. Further, and perhaps less optimistic with respect to our present means for assessing "errors" is that most previous research has calculated errors either without the benefit of true scores (e.g., Bernardin 8 Pence, I980) or without using the true scores that were available (e.g., Borman, I975, I979). Calculations made similar to those researchers in the present research (i.e., the SD halo measure) revealed no relationship between error and accuracy. Taken as a whole, there is enough evidence to suggest that a serious reevaluation of our present means for defining and measuring rating "errors" might be warranted. As alluded to in the introduction, 7A serious consideration should be given to the fact that highly intercorrelated dimensions and/or negatively skewed distributions, for example, may be accurate reflections of reality rather than indications of halo and leniency. Hence, researchers in the fields of Industrial Psychology and Organizational Behavior may have to reassess the assumptions that they presently embrace concerning the levels of performance that will be evidenced by a particular individual as well as among individuals within a group. Limitptions ppg Directions for fippppg Research On a practical level. the results presented here indicate that the concern of training ought to be expanded from its exclusive concentration on rating errors to include components that are more directly focused on increasing the accuracy of performance evaluations. Similar sentiments have been echoed by a number of researchers (Borman, I972: Ilgen S Feldman, I983) in their contention that further advancement in the area of rater training is unlikely without the appropriate attention to a process-centered view of performance appraisal that considers the information processing functions of information gathering, storage, recall, and integration. The present study was primarily concerned with the information gathering and storage components of this process in terms of providing trainees with specific, job-relevant categories for observing and evaluating ratee performance and further developing these categories by focusing on various effectiveness levels within them. However, this research is only a first step towards attempting to increase the accuracy of raters' 75 evaluations. Further, there are several potential limitations to this study which indicate that caution should be exercised in drawing any definitive conclusions based on these data. First, undergraduate students and not managers were used as raters and consequently, the results can only tentatively be generalized to a true manager/supervisor population. However, recall from the Method section that employment decisions made by students in laboratory settings have been shown to be similar to those made by professional interviewers (Bernstein, et al., I975: Schmitt, I976). Further, the issues addressed in the present study concerned questions of how humans process and evaluate stimuli in their environments. There is no indication from the cognitive psychology literature that this process is appreciably different for students versus "real world" appraisers of employee performance. What might be appreciably different, though, are the implicit category systems that managers/supervisors have developed versus those of the students. As previously mentioned, the categories that one uses are a function of education and experience. It thus seems logical that the category systems for assessing subordinates already in use by more experienced managers would be more well-defined than those used by a relatively inexperienced student group. Hence, convincing experienced individuals to accept a newly imposed category system might require somewhat different strategies than those employed here. Similar to many OD interventions, for example, part of the training program may have to be geared toward assessing the categories already in use by trainees and convincing the "owners" of inappropriate ones that their present means for evaluating employees is somehow inadequate (i.e., a 76 process analogous to "unfreezing"). Perhaps only then can acceptance and use of a newly imposed category system ensue (i.e., change and "refreezing"). It is worthwhile to note, however, that approximately half of the present subjects reported having previous experience with performance appraisals. Hence, the degree to which experience may or may not necessitate changes in the training strategy suggested here can only be evaluated by future research. Another potential limitation of this study is that the results could be attributed to the demand characteristics of the situation. It is difficult to define. however, what constitutes demand characteristics in a training study. If subjects did change their rating behaviors in accordance with the treatment presented by the experimenter, then "demand characteristics" seem inseparable from a successful training intervention. Further, it is virtually impossible that any subject could have known the true purpose of the research. The experimenter adhered, as closely as possible, to the training programs outlined in Appendices B and C. and no discussion of the study or the hypotheses was undertaken until all data collections were complete. Students were also asked not to discuss their training sessions with others in the class. Anecdotal evidence gathered by the experimenter prior to each session suggests that subjects strictly adhered to this request. A third potential limitation concerns the fact that observations were made from videotaped rather than live persons. It is doubtful that this limitation is severe as research reviewed by Lifson (I953) has suggested that filmed performances are rated the same as live performances. Also, in light of the Inherent difficulties of obtaining 77 true scores from live performances. potential criticisms associated with the use of videotaped ratees do not seem particularly salient, especially considering the nature of the hypotheses under investigation here. At a somewhat higher level of abstraction, there are several aspects of the rating process in general as well as specific consequences of categorization that were not explicitly addressed in the present study. These theoretically based limitations are nevertheless important, and they also represent potentially fruitful avenues for future research. One especially relevant issue concerns some of the consequences of categorization for memory. Recent evidence seems to indicate that there may be an upper bound on the degree to which raters are able to recall which specific behaviors a given ratee has exhibited. The reason for this is that categorization is often conceptualized as a process whereby a stimulus object/person is matched to some category prototype. Furthermore, unique behaviors emitted from a particular person become more difficult to remember over time because they are colored in such a manner as to be consistent with characteristics of the prototype to which they were matched (Wyer 8 Srull, I970). That is, once a person is categorized vis a vis particular behaviors and/or characteristics, the features of the category prototype(s) come to characterize the individual. Consequently, when a rater is asked to recall information for performance evaluations, some of the information will accurately describe the person in question while other information may not (Cantor 5 Mischel, I977. 1979: Sentis 8 Burnstein, l979: Spiro, I977: Tsujimoto, I978: Tsujimoto, Wilde, 8 Robertson, I978; Wyer 8 78 Srull, I980). Multiple categorizations are possible (Ilgen 8 Feldman, I983), however, and seem dependent on one's expertise and the degree of differentiation in the observer's category system (Rosch, I978: Rosch, Mervis, Gray, Johnson 8 Boyes-Braem, I976). The present research indicates that training can potentially be used to facilitate the development of a specialized category system that. in turn, can result in more accurate performance evaluations. However, various questions remain concerning the degree to which there may be an upper bound on the accuracy that we can ever hope to achieve. It is also quite likely that other, potentially more effective training strategies can be developed to deal with such apparently problematic issues. Consideration of the prototype matching model and its implications by future researchers may prove valuable in this endeavor. In summary, the present research needs to be replicated and extended using different raters, possibly other rating instruments as training tools, and variants of the present training procedures that more completely address the limitations of human information processing capabilities. The effects of accuracy training over time must also be evaluated. However, given that accuracy is the crucial criterion for judging the quality of performance evaluations, the results of the present study should be viewed optimistically. They suggest that advances toward effective interventions in the area of rater accuracy training are possible. 79 Conclusions In conclusion, the results of this study can be summarized in the following manner. First, RAT had the effect of increasing accuracy and decreasing leniency in subjects' ratings of videotaped managers. Second, RET decreased halo error but had no effect on leniency or accuarcy. Although the combination of RET and RAT proved somewhat less accurate than RAT alone, further research is needed to verify this finding. Finally, the dimension x training interactions suggested that the effectiveness of rater training strategies can not be considered independent of the rating format and/or the rating task itself. APPENDIX A RATING SCALES 80 8l STRUCTURING AND CONTROLLING THE INTERVIEW Clearly stating the purpose of the interview: maintaining control over the interview: displaying an organized and prepared approach to the interview versus not discussing the purpose of the interview and displaying a confused approach; allowing the subordinate to control the interview when inappropriate. High Level Performance 0 Outlines clearly the areas to be discussed and skillfully guides the discussion into those areas. 0 Displays good preparation for the interview and effectively uses information about the subordinate to conduct a well planned interview. Average Performance 0 States the purpose of the interview but fails to cover some areas he intended to discuss. 0 Appears prepared for the interview but at times is unable to control the interview or to guide it into areas planned for discussion. r ow Level Performance a Fails to indicate the purpose of the interview and appears to be unfamiliar with the file information. 0 Appears unprepared for the interview and is unable to control the subordinate in the interview. 82 ESTABLISHING AND MAINTAINING RAPPORT Setting the appropraite climate for the interview in a warm non- threatening manner; being sensitive to the subordinate versus setting a hostile or belligerent climate: being overly friendly or familiar during the interview; displaying insensitivity toward the subordinate. H'gh pevel Performance a Draws the subordinate out by projecting sincerity and warmth during the interview. a Discusses the subordinate's problems in a candid 6 but nonthreatening and supportive way. Average Performance a Displays some sincerity and warmth toward the sub- ordinate and indicates by his response to the subordinate and his problems that he is reasonably sensitive to the subordinate's work-related A problems. a Uses mechanical means to set the subordinate at ease, i.e., offers coffee. Lg! Level Performance 2 0 Projects little feeling or sensitivity toward the subordinate: makes no friendly gestures. I a Is confrontive and inappropriately blunt during the interview. 83 REsqpyING CONFLICT Moving effectively to reduce the conflict between Valva and the subordinate: making appropriate commitments and setting realistic goals to insure conflict resolution: providing good advice to the subordinate about his relationships with Valva, subordinates, etc. versus discussing problems too bluntly or lecturing the subordinate ineffectively regarding the resolution of conflict: failing to set goals or make commitments appropriate to effective conflict resolution: providing poor advice to the subordinate about his relationship with Valva, subordinates, etc. High Level Performggce 7 a Effectively reduces conflict between the subor- dinate and others by making appropriate and realistic commitments to help the subordinate get along better in the department. 6 0 Provides good advice about solving problems and about improving the subordinate's poor rela- tionships with his subordinates, Valva, etc. Average Performance 0 Puts forth some effort to reduce conflict between the subordinate and others but usually does not commit himself to helping with this conflict resolution. a Tends to smooth over problems and provide reason- ably good advice to the subordinate about conflict situations. Lg! Level Performance 2 a Lectures ineffectively or delivers inappropriate ultimatums to the subordinate about improving his relationships with others or about changing his "attitude" toward people or problems. I a Fails to make commitments to help the subordinate resolve problems or provides poor advice to the subordinate about his relationships with Valva, subordinate's etc. 8A DEVELOPING THE SUBORDINATE Offering to help the subordinate develop professionally: displaying interest in the subordinate's professional goals: specifying develop- mental needs and recommending sound developmental actions versus not offering to aid in the subordinate's professional development: displaying little or no interest in the subordinate's professional growth: failing to make developmental suggestions or providing poor advice regarding the subordinate's professional development. H'gh Level Performpnce 7 0 Displays considerable interest in the subordinate's professional development and provides appropriate. high quality, developmental suggestions. 0 Makes commitments to help professionally in the 6 subordinate's development. Average Perfogppnce _____ 5 a Provides general developmental suggestions but usually fails to make a personal commitment to aid in the subordinate's professional development. _____ A a Shows moderate interest in the subordinate's development: may direct the subordinate to seek developmental suggestions elsewhere. _____ 3 L2! Level Performance 2 a Expresses little or no interest in the sub- ordinate's professional development. a Fails to offer developmental suggestions or I provides poor advice regarding the sub- ordinate's professional growth and development. 85 MOTIVATING THE SUBORDINATL Providing incentives for the subordinate to stay at GCI and to perform effectively: making commitments to motivate the subordinate to perform his job well, to remain-with GCI, and to help GCI accomplish its objectives; supporting the subordinate's excellent past performance versus providing little or no incentive for the subordinate to stay at GCI and perform effectively: failing to make commitments encouraging the subordinate's top continued performance: neglecting to express support of the subordinate's excellent performance record. H'gh Level Performance 0 A high level performer provides encouragement and appropriate incentives to pursuade the subordinate to stay with GCI and perform his job effectively. a A high level performer uses compliments of the 6 subordinate's technical expertise and excellent past performance to motivate the subordinate to meet the objectives of the department. Averagp Performance 0 Compliments the subordinate appropriately at times but is only moderately effective in using these compliments to encourage high performance, loyalty to GCI, etc. 0 Provides some incentives for the subordinate to perform effectively at GCI, but generally makes few if any personal commitments to support the subordinate in his job. Lg! Level Performance 2 a Fails to express support for the subordinate's past performance. 0 Provides little or no incentive for the subordinate l to remain at GCI. APPENDIX B RATER ERROR TRAINING What (Obtains £4 a step—byrstep pnocedwte (on the mine): to (allow when conducting RET. The doubte'apaced text ta a detatted acupt 06 what the Men W specifically say timing the bmtntng. Othen duectt‘ont (on the twine); appewt tn unau- Today, you will be participating in an error training program that will help you learn how to appraise the job performance of others. Once we have finished the actual training program, I will be showing you videotapes of six managers conducting an interview with a problem subordinate. After we view each of these videotapes, you will be rating each manager on how well he conducted the interview. I will then collect these ratings, go over them, and during a regular class period, I will report back how well you did in making your evaluations. In order to rate the behvaiors of others correctly, there are a few things you must know about how to avoid various common rating errors that can occur when you evaluate others. What I mean by rating error is any systematic fault in judgment that occurs when you appraise another person's performance. More precise definitions as well as specific examples of various errors will be discussed during this training session. 86 87 In order to demonstrate how rating errors can occur, we will be viewing two five minute videotapes similar to those you will be rating after the training program. You will actually rate the managers who appear in these two tapes, and we will discuss your ratings as a group. I am passing out packets of rating scales that you will use to make your ratings. Do not make your ratings of the manager until the tape has finished. Also, do not take notes until the tape is finished, because you might miss important parts of it. The manager that you are about to see on the first videotape is an example of a very good interviewer, who deals quite well with the problem subordinate. Show vtdeotape I. when the tape ts étntshed, ash thatnees to put theta fitnst name on the fitnst page 06 the nattng scale packet. Gtve them appnoxtmately 6tve mtnutes to make theta evaluattons 06 the managea. Put thatnees names on a filtpchant whtle they ane mahtng thetn hattngs. When thatnees ane fitntshed, ask them to hand tn theta completed scales. Eiggnd the nesults 06 each penson’s nattngs next to hts/hen name on the Begtn dtscusstng the dtscnepanctes between nattngs. Ltsted below ane the thue levels 06 penéonmance (on the manages on the (last thatntng tape. STRUCTURING AND CONTROLLING THE INTERVIEW 3.3] ESTABLISHING AND MAINTAINING RAPPORT 3.69 RESOLVING CONFLICT 5.69 DEVELOPING THE SUBORDINATE 6.08 MOTIVATING THE SUBORDINATE 5.77 Do not mentton these "thue scones" to thatnees. Use these scones only 60h youn tnéonmatton to appnopntately dtnect the dtscusston 06 vantous 88 aattng eanons. Ftnst, look (on taatnees who commttted halo canon. Thts eanon wtll be evtdenced by nattngs that aae conststently htgh on low acnoss the seven tndtvtdual nattng scales. Tag to tdenttéy one on mane pensons whose nattngs fiollow thts pattenn, and ask them why they aated the managen as they dtd. One 06 the {allowtng two aesponses aae ltkely to occua: I. Tnatnees may dtscass one on two thtngs the managea dtd that weae good on bad. You can tmply (nom thts type 06 aesponse that the taatnees nattngs weae based on the one on two thtngs menttoned. Make a note 06 any taatnees who back up theta nattngs with examples 06 thtngs that occuaned eaaly tn the tntenvtew. These will be used latea on as examples 06 6tast'tmpaesston efifiect. 2. 15 the taatneelsl noted the managea htgh acnoss all the peaéoamance scales, s/he mtght alteanattvely say the aeason was becguse you Ithe taatnenl had satd the managea was an eéfiecttve pen oamen. ‘ Now tny to tdenttéy some taatneelsl whose nattngs aae not conststent acnoss all the nattng scales. Ask the pensonlsl to explatn the aeasontng behtng theta nattngs. These aesponses wtll most ltkely tnclude both staengths and weaknesses 06 the managen. In any case, conttnue the dtscusston as (allows: What we have just witnessed is an example of one type of rating error called halo. The term "halo" implies that there is a general aura surrounding all the judgments that are made about a particular ratee. What typically happens is that the rater forms a generally favorable or unfavorable impression of the ratee, and then gives the person ratings that are consistent with this good or bad impression. Those of you who rated the manager high just because I told you he was an effective performer committed halo error. Those of you who formed a generally good or bad impression of the manager based on one or two 39 characteristics (and thus gave the manager all high or all low ratings) also committed halo error. One thing that is important to remember is that people are not typically all good or all bad. Because of this, it is essential that you try not to form a general impression when rating others. Ask taatnees what can be done to eltmtnate halo eaaoa. It should be suggested that evaluattons be made tndependently 06 what aatens have heaad (nom othens, and that nateas make a potnt 06 looktng 60a both postttves and negattves. For those of you who did commit halo error, I want you to realize that this is a very common occurance. Most people do form general impressions of others which do influence subsequent appraisals of their behavior. Now tny to tdenttéy taatnees who commttted centaal tendency canon. Thts canon ts chaaacteatzed by nattngs that aae concentaated anound the mtddle anchoas 06 the nattng scale It.e., 3, 4, on 5). Ask taatnees who commttted centaal tendency eaaon to explatn the neasontng behtnd theta nattngs. Afitea one on mane nattonales have been gtven, conttnue the dtscusston as éollows: When all the ratings are concentrated around the middle anchors on the rating scale, this is an example of what is called central tendency error. This error occurs when the rater is afraid to use the extremely good or the extremely bad anchors of the scale, even though the ratee is exhibiting excellent or poor performance. To summarize where we are at this point, we have discussed two errors that can occur when evaluating the performance of others. These 90 errors were halo and central tendency. What we are going to do now is view a second videotape of another manager interviewing the same problem subordinate. After the tape is finished, you will rate the manager, just as we did with the first videotape. Show videotape 2. When the tape ts (inished, ask taatnees to put theta (inst name on the (inst page a( the nattng scale packet. Give taatnees appnoximately (ive minutes to make theta evaluations. Put taatnees names on the (lipchant while that aae making theta nattngs. Have panticipants hand in theta nattngs and aecond these next to theta names on the (lipchant. Geneaate a discussion centening on any discaepancies among the nattngs. Duatng the discussion i( any o( the taatnees compaae the second managen to the (inst managen, this will allow discussion a( cantnast e((ects to begin. I( no taatnee companes the two managens, ask them how they thought the second managen did with nespect to the (last. Fuathea, ask taatnees t( they had used the (inst managen as a compaaison point when they aated the second. Once any discussion a( campaaisons occuns, continue the pnagnam as (allows: If any of you rated the second manager by comparing his performance to the first, you committed a contrast error. More specifically, a contrast error occurs when we evaluate a person by comparing him/her to someone we have just finished rating instead of evaluating the person on how well s/he has performed independently of others and relative to the job in question. Ask taatnees what’ we might’ do to minimize contaast e((ects. The suggestions that should be made leithea by the taatnea an taatnees) should include: (I) evaluate the applicant in nelattan to his/hen absolute level a( pen(onmance and I2) decide what these absolute levels 9i a( pca(aamance aae be(one you begin evaluating people. There is one final error we will discuss today, and it is concerned with different tendencies some raters have regardless of the person they are evaluating. For example, if any of you gave both managers generally high ratings, you may, in general, be rating others too leniently. On the other hand, if you gave both managers relatively bad ratings, you may, in general, be rating others too harshly or strictly. Raters who consistently give ratings that are either too high or too low across many ratees are committing leniency/severity error. A The difference between halo and leniency/severity is that halo is person specific. In other words, you have certain general impressions of each person. and you therefore rate some people high and some low. With leniency/severity, the problem lies in the fact that you consistently rate all ratees either too high (as in leniency) or too low (as in severity). Look at all taatnees nattngs (on both managens. Select one an mane sets a( nattngs that aae indicative a( leniency/stnictncss canon and use these as examples (on the gnoup. Ask taatnees to gcncaate ideas ncgaading how we might dccacasc the occuaancc o( leniency/stnictncss in nattngs. A(tcn completion a( the second nattng cxcacisc, complete the taaining pnogaam as (allows: All of you should now understand how various rating errors can distort our evaluations of others. You will now be rating six more videotapes of different managers interviewing the same problem 92 subordinate. We will not be discussing these ratings, but I will collect them and evaluate how well you did. The results of this exercise will then be reported back to you during a regular class session. As you are observing the videotapes, keep in mind the rating errors we have discussed and the various ways they might be minimized. Try using these strategies as you view and rate each manager's performance. APPENDIX C RATER ACCURACY TRAINING What (allows is a step-by-stcp pnoccdunc (an the taainea to (allow when conducting RAT. The double spaced text is a detailed scnipt a( what the taatnen will specifically say duntng the twining. 0thcn diaccttons (an the taatnca appcan in italics. Today, you will be participating in a training program that will help you learn how to accurately appraise the job performance of others. Once we have finished the actual training program, I will be showing you videotapes of six managers conducting an interview with a problem subordinate. After we view each of these videotapes, you will be rating each manager on how well he conducted the interview. I will then collect these ratings, go over them, and report back to you during a regular class period how well you did in rating the videotapes. In order to rate the behavior of others correctly, there are a few things that you must know about how performance appraisal systems are set-up. First of all, most jobs can be thought of as consisting of various categories or dimensions of performance. In fact, you can think of any job as a pie that can be cut or divided into various pieces. Whenever we evaluate an employee's job performance, it is very important that we rate the person in terms of important categories of performance. 93 9A The reason for this is because these pieces or categories are the crucial elements of the job. Therefore, in order to effectively evaluate how pepple are performing their jobs, it is essential that we rate them on these important dimensions. As I mentioned before, today we will be rating six managers conducting an interview with a problem subordinate. In order to appraise the performance of these managers, the first thing we must do is identify the important elements of the task that we will evaluating. There are five performance dimensions that we will be using to rate these six videotaped managers. What I am passing out to you now are the actual rating scales we will be using. You will notice that there are five scales. one corresponding to each important category of performance. What we are going to do now is to review each of these categories and what they mean. The first category we will use to rate the manager's performance is how well s/he STRUCTURES AND CONTROLS the interview with the subordinate. A manager who does a good job with respect to this dimension will do such things as clearly state the purpose of the interview: he will maintain control over the interview: and he will be organized and prepared for the interview. A manager who does not perform well with respect to this category will pp; discuss the purpose of the interview; he will display a confused approach; and he will allow the subordinate to control the interview at inappropriate times. Similanly ga avca all a( the pca(anmance dimensions by giving a global definition o( what constitutes c((ccttvc and inc((cctivc pea(anmancc on 95 it. Now that we have our seven performance dimensions and global definitions of each, the next thing I would like to do is give you more specific examples of what constitutes different levels of effective and ineffective performance for each category. As you have probably noticed, corresponding to the scale anchors are examples of what types of behaviors are considered High Level Performance, Average Performance, and Low Level Performance. What I would like to do now is to go over specific examples of behaviors corresponding to the different performance levels on each of the 5 categories. Then we will practice using these scales by rating a videotaped manager conducting an interview with a problem subordinate. Go thaough each o( the dimensions by giving speci(ic examples o( behavioa coaaesponding to the seven levels o( pea(onmance. As I mentioned, what I would like to do now is give you some practice in using these rating scales. I am going to show you a five minute videotape, and when the tape is finished, you will rate the manager on the five performance dimensions. Do not take notes while the videotape is playing, because you might miss things that the manager does. As you are watching the tape, though, look for specific effective and ineffective behaviors the manager exhibits that correspond to our seven categories of performance. This will help you to remember what the manager actually did and how well he did it (that is, whether it was high, average, or low performance). 96 Show videotape I. When the tape is (inished, ask taatnees to put .theia (inst name on each scale and then give them appnaximately thnec minutes to make theta nattngs. Put taatnees names on a (lipchant while they aae making theta nattngs. When they aae (inished, ask them to hand-in theta nattng (an STRUCTURING AND CONTROLLING THE INTERVIEW. Recand the nesults an the (lipchant next to each tnainee's name. Genenate a gnaup discussion that (acusses on any discacpancics among taatnees. Make suac people discuss which paaticulan managen behavioas they cansidcned in making theta nattng. Use the scale anchan descaiptians to evaluate the e((cctiveness o( each behavioa discussed. Also, make suac that any behavioas bnought up aae legitimate examples that cannespond to the dimension in question. Repeat this paacess (on each o( the athen six dimensions/nattng scales. Listed below aae the tauc levels a( pea(aamance (on the managen an the (inst taatning tape. STRUCTURING AND CONTROLLING THE INTERVIEW 3.3l ESTABLISHING AND MAINTAINING RAPPORT 3.69 RESOLVING CONFLICT 5.69 DEVELOPING THE SUBORDINATE 6.08 MOTIVATING THE SUBORDINATE 5.77 Do not diacctly mention these "tnue scones" to taatnees. Meaely deal with each pea(aamance dimension by discussing speci(ic behavioas and theta e((cctiveness levels in teams o( the dimension descaiptians. Tell taatnees that they will now natc anathea videotape o( a managen inteavicuing the same paoblem subaadinate. Show videotape 2. Follow the exact tnstauctians and paaceduac as you did on the (inst videotape. The taue levels a( the managea's pea(aamance (on the second taatning tape appeaa below: 97 STRUCTURING AND CONTROLLING THE INTERVIEW 2.79 ESTABLISHING AND MAINTAINING RAPPORT l.50 RESOLVING CONFLICT 2.07 DEVELOPING THE SUBORDINATE 2.7l MOTIVATING THE SUBORDINATE 2.29 After completion of the second rating exercise, summarize and end the training program as follows: All of you should now understand how to use these rating scales to evaluate the performance of a manager who is interviewing a problem subordinate. You will now be rating six more videotapes of different managers conducting the same interview. We will not be discussing these ratings, but I will collect them and evaluate how well you did. The results of this exercise will then be reported back to you during a regular class session. As you are observing the videotapes, keep in mind the seven categories you will be rating the managers on. As we did during the practice sessions, look for specific behaviors that will help you identify which level of performance the manager is exhibiting. Also, use the anchors that appear on the rating scales themselves to help you justify your final rating decision. ‘. REFERENCE NOTES Wherry, R. J. The control pi bias 1p rating: A theory pi rating. (Personnel Research Board Rep. 922). Washington, D.C.: Department of the Army, Personnel Research Section, February, I952. Bernardin, H. J., 8 Boetcher, R. The effects p1 rater training ppp cognitive complexity pp psychometric error Lg rating. Paper presented at the annual meeting of the American Psychological Association, San Francisco, I978. Phillips, J. S., 8 Lord, R. G. Leadership prototypes: Effects pp memory for leadership behavior. Manuscript in preparation, I982. 98 REFERENCES Abelson, R. P. Script processing in attitude formation and decision making. In J. S. Carrol 8 J. W. Payne (Eds.), Lpgnition and Social Behavior. Hillsdale, N. J.:Erlbaum, I976. Averbach, E., 8 Coriell, A. 5. Short term memory in vision. Bell Systems Technical Journal, I96l, 59, 309-328. Berman, D. S., 8 Kenny, D. A. Correlation bias: Not gone and not to be forgotton. Journal pi Personality ppp Social Psuchology, I977, 35, 882-887. Bernardin, H. J. Effects of rater training on leniency and halo errors of student ratings of instructors. Journal pi Applied Psychology, I978, 6 , 30l-308. Bernardin, H. J. Behavioral expectation scales vs. summated scales: A fair comparison. Journal pi Applied Psychology, I978, 6 , l25-I3l. Bernardin, H. J., Alvares, K. M., 8 Cranny, C. J. A recomparison of behavioral expectation scales to summated scales. Journal pi Applied Psychology, I976, pi, 56A-570. Bernardin, H. J., 8 Pence, E. C. Effects of rater training: Creating new response sets and decreasing accuracy. Journal pi Applied Psychology, I980, 95, 60-66. Bernardin, H. J., 8 Walter, C. 5. Effects of rater training and diary keeping on psychometric error in ratings. Journal pi Applied Psychology, I977 pg, 6A-69. Bernstein, V., Hakel, M. D., 8 Harlan, A. The college stUdent as an interviewer: A threat to generalizability? Journal pi Applied Psychology, I975, pp. 266-268. Borman, W. C. Effects of instructions to avoid halo error on reliability and validity of performance evaluation ratings. Journal pi Applied Psychology, I975, pp, 556-560. Borman, W. C. Consistency of rating accuracy and rating errors in the judgment of human performance. Organizational Behavior ppp Human Performance, I977, pp. 233-252. 99 lOO Borman, W. C. Format and training effects on rating accuracy and rating errors. Journal pi Applied Ppychology, I979, 95, AlO-A2l. Bousfield, W. A. The occurance of clustering in the recall of randomly arranged associates. Journal pi General Psychology, I953, 59, 229-2AO. Bower, G. H., Black, J. B., 5 Turner, J. T. Scripts in text comprehension and memory. Cognitive Psycholpgy, I979, ii, l77-220. Brown, E. M. Influence of training, method, and relationship on the halo effect. Journal pi Applied Psychology, I968, 5;, l95-l99. Bruner, J. S. On perceptual readiness. Psychological Review, I957, p5. I23-l52. Bruner, J. S. Social psychology and perception. In E. E. Maccoby, T. M. Newcomb, 8 E. L. Hartley (Eds.), Readings ip social psychology. New York: Holt, Rinehart, 8 Winston, I958. Cantor, N., 8 Mischel, W. Traits as prototypes: Effects on recognition memory. Journal pi Personaliiy ppp Social Psuchology, I977, 35, 38—A8. Cantor, N. ., 8 Mischel, W. Prototypes in person perception. In L. Berkowitz (Ed. ), Advances in experimental social psychology, (Vol. iii. New York: Academic Press, I979. Cohen, C. E. Cognitive basis pi stereotyping. Paper presented at the Annual Meeting of the American Psychological Association, San Francisco, I977. Cooper, W. H. Ubiquitous halo. Psychological Bulletin, l98l, 99, 2l8-2AA. Cronbach, L. J. Processes affecting scores on "understanding of others" and "assumed similarity." Psychological Bulletin, I955, 5;,' 177-193. DeCotiis, T. A. An analysis of the external validity and applied relevance of three rating formats. Organizational Behavior ppp Human Performance, I977 19, 2A7-266. DeCotiis, T., 8 Petit, A. The performance appraisal process: A model and some testable propositions. Academy pi Management Review, 1978. 3. 635-6116- Dunnette, M. D., 8 Borman, W. C. Personnel selection and classification systems. Annual Review pi Psychology, I979, 39, A77-525. Erikson, C. W., 8 Collins, J. F. Temporal course of selective attention. Journal pi Experimental Psychology, I969, 89. 25A-26l. 101 Feldman, J. M. Beyond attribution theory: Cognitive processes in performance appraisal. Journal pi Applied Psychology, 198l, pp, l27-lh8. Feldman, J. H., 8 Hilterman, R. J. Stereotype attribution revisited: The role of stimulus characteristics, racial attitude, and cognitive differentiation. Journal pi Personality and Socipi Psuchology, 1975, 11, 1177-1188. Frederiksen, C. H. Representing logical and semantic structure of knowledge acquired from discourse. Cognitive Psychology, 1975, 1, 37I-h58. Glass, A. L., Holyoak, K. J., 5 Santa, J. L. Cognition. Reading, Mass.: Addison-Wesley Publishing Co., I979. Guilford, J. P. Psychometric Methopp. New York: McGraw-Hill, 195h. Hastorf, A. N., Schneider, D. J., 8 Polefka, J. Person perception. Reading, Mass.: Addison-Wesley Publishing Co., 1970. Ilgen, D. R., 5 Feldman, J. M. Performance appraisal: A process approach. In B. M. Staw (Ed.), Research ip organizatioppi behaviorI {Vol. g1. Greenwich, Conn.: JAI Press Inc., I983. Ivancevich, J. M. Longitudinal study of the effects of rater training on psychometric errors in ratings. Journal pi Applied Psychology, 1979. éfl. 502-503- Jecker, J. 0., Maccoby, N., 8 Breitrose, H. 5. Improving accuracy in cues of comprehension. Psychology ip ipp Schools, 1975, 25, 653-669. Kane, J. 5., 8 Lawler, E. E. Methods of peer assessment. Psychological Bulletin, 1978, 8 , 555-586. Kelly, G. A. A theory pi personality: The psychology pi personal. constructs. New York: Norton, I955. King, L., Hunter, J., E Schmidt, F. Halo in a multidimensional forced-choice performance evaluation scale. Journal pi Applied Psychology, I980, g5, 507-516. Landy, F. J., 8 Farr, J. Performance rating. Psychological Bulletin, I980, g1. 72-I07. Langdale, J. A., s Weitz, J. Estimating the influence of job information on interviewer agreement. Journal pi Applied Psychology, 1973, 51, 23-27. Langer, E. J. Rethinking the role of thought in social interaction. In J. H. Harvey, W. J. lckes, 8 R. F. Kidd (Eds.), New pirections ip 102 pttribution research (Vol. 21. Hillsdale, N. J.: Erlbaum, 1978. Latham, G. P., 8 Wexley, K. N. Behavioral observation scales for performance appraisal purposes. Personnel Psychology, I977, 39, 255-268. Latham, G. P., 8 Wexley, K. N. Increasing productivity through performance appraisal. Reading, Mass: Addison-Wesley Publishing Co., l98l. Latham, G. P., Wexley, K. N., 8 Pursell, E. D. Training managers to minimize rating errors in the observation of behavior. Journal pi Applied Psycholpgy, I975 pp, 550-555. Lawrence, D. M. Two studies of visual search for word targets for controlled rates of presentation. Perception and Psychpphysics, 1971. 3.. 85-89. Levine, J., 8 Butler, J. 'Lecture vs. group decision in changing behavior. Journal pi Applied Psychology, 1952, 3p, 29-33. Lifson, K. A. Errors in time-study judgments of industrial work-pace. Psychological Monographs, I953, pl (5, White No. 355). Lord, R. C., Binning, J. F., Rush, M. C., 8 Thomas, J. C. The effect of performance cues and leader behavior on questionnaire ratings of leadership behavior. Organizational Behavior ppp Human Perfotpppce, I978, g1, 27-39. Lord, R. G., Foti, R.J., 8 Phillips, J. S. A theory of leadership categorization. In D. L. Hunt, R. J. Sedaran, 8 C. L. Shriesheim (Eds.), Leadership: Beyond estpblishment views. Carbondale: McGraw-Hill, I982. Marcus, H. Self-schemata and processing information about the self. Journal pi Personality ppp Social Psuchology, I977, 35, 63-78. McArthur, L. 2., 8 Post, 0. L. Figural emphasis and person perception. Journal pi Experimental Social Psychology, I977, 13, 520-535. Minsky, M. A. A framework for representing knowledge. In P. H. Winston (Ed.), [pp psychology pi computer vision. New York: McGraw-Hill, 1975- Murphy, K. R., Garcia, M., Kerkar, S., Martin, C., 8 Balzer, W. K. Relationship between observational accuracy and accuracy in evaluating performance. Journal pi Applied Psychology, I982, 91, 320-325. Naylor, J. C., 8 Wherry, R. J. The use of simulated stimuli and the "JAN" technique to capture and cluster the policy of raters. Educational ppp Psychological Measurement, I965, 25, 96h-986. 103 Nisbett, R. E., 8 Wilson, T. O. The halo effect: Evidence for unconscious alteration of judgments. Journal pi Personality and Social Psuchology, I977, 35, 250-256. O'Brien, R. Robust techniques for testing heterogeneity of variance in factorial designs. Psychometricia, I978, h , 327-3h2. Picek, J. S., Sherman, S. J., 8 Shiffrin, R. M. Cognitive organization and coding of social structures. Journal pi Personality and Social Psuchology, I975, 31, 758-768. Potts, G. R. Information processing strategies used in the encoding of linear orderings. Journal pi Verbal Learning ppp Verbal Behpvior, 1972. 11. 727-7ho. Rosch, E. Principles of categorization. In E. Rosch (Ed.), Cognition ppp categorizatio . Hillsdale, N. J.: Erlbaum, 1978. Rosch, E., Mervis, C. G., Gray, W. D., Johnson, D. M., 8 Boyes-Braem, P. Basic objects in natural categories. Cognitive psycholpgy , l976, g, 382-h39. Rumelhart, D. E. Notes on schemas for stories. In D. G. Bobrow 8 A. Collins (Eds.), Representation ppp phderstanding: Studies ip cognitive science. New York: Academic Press, 1975. Rush, 5., Phillips, J. S., 8 Lord, R. G. Effects of temporal delay in rating on leader behavior descriptions: A laboratory investigation. Journal pi Appligp Psychology, I98l, pp, th-ASO. Schmitt, N. Social and situational determinants of interview decisions: Implications for the employment interview. Personnel Psychology, 1976, 29, 79-IOI. . Schneider, W., 8 Shiffrin, R. M. Controlled and automatic information processing: I. Detection, search, and attention. Psychological Review, I977, g5, I-66. Schwab, O. P., Heneman, H., 8 DeCotiis, T. Behaviorally anchored rating scales: A review of the literature. Personnel Psychology, I975, 2_89 5"9'562 0 Sentis, K. P., 8 Burnstein, E. Remembering schema-consistent information: Effects of a balance schema on recognition memory. Journal pi Personality ppp Social Psuchology, I979, 31, ZZOO-ZZII. Shank, R., 8 Abelson, R. P. Scripts, plans, goals, ppp understanding. Hillsdale, N. J.: Erlbaum, I977. Shiffrin, R. M., 8 Schneider, W. Controlled and automatic human information processing: II. Perceptual learning, automatic attending, and a general theory. Psychological Review, I977, 85, 10h 127-190. Smith, P. C., 8 Kendall, L. M. Retranslation of expectations: An approach to the construction of unambiguous anchors for rating scales. Journal pi Applied Psychology, 1963, 51, lh9-155. Spiro, R. J. Remembering information from text: The “state of schema" approach. In R. C. Anderson, R. J. Spiro, 8 W. E. Montague (Eds.), Schooling ppp ipp acguisition pi knowledge. Hillsdale, N. J.: Erlbaum, I977. Spool, M. 0. Training programs for observers of behavior: A review. Personnel Psychology, I978, 31, 853-885. Sulin, R. A., 8 Dooling, D. J. Intrusion of thematic ideas in retention of prose. Journal pi Experimental Psychology, I97h, 193, 255-262. Swann, W. B., 8 Snyder, M. On translating beliefs into actions: Theories of ability and their application in an instructional setting. Journal pi Personplity ppp Social Psuchology, I980, 38, 879-888. Taylor, E. K., 8 Hastman, R. Relation of format and administration to the characteristics of graphic rating scales. Personnel Psychology, I956, 9, I8l-ZO6. Taylor, S. E., 8 Crocker, J. Scematic bases of social information processing. In E. T. Higgins, C. P. Herman, 8 M. P. Zanna (Eds.), Social cognition; The Ontprio symposipp pp personality ppp social psychology. Hillsdale, N. J.: Erlbaum, l98l. Taylor, S. E., 8 Fiske, S. T. Salience, attention, and attribution: Top of the head phenomena. In L. Berkowitz (Ed.), Advances ip experimental social psychology (Vol. iii. New York: Academic Press, 1978. Tesser, A. Self-generated attitude change. In L. Berkowitz (Ed.), Advances ip experimental social psychology SVol. iii. New York: Academic Press, I978. Thorndyke, P. W. Cognitive structures in comprehension and memory of narrative discourse. Cognitive Psychology, I977, 9, 77-IIO. Treisman, A. M., 8 Geffen, G. Selective attention: perception or response? Quarterly Journal pi Experimental Psychology, I967, 15, l-l7. Treisman, A. M., 8 Riley, J. G. A. Is selective attention selective perception or selective response? Journal pi Experimental Psychology, 1969. 12, 27-3u. 105 Triandis, H. C. Cultural influence upon cognitive processes. In L. Berkowitz (Ed.), Advances in experimental social psychology (Vol. 11. New York: Academic Press: l96h. Tsujimoto, R. N. Memory bias toward normative and novel trait prototypes. Journal pi Personality ppp Social Psuchology, I978, 3p, l39I-lhOI. Tsujimoto, R. N., Wilde, J., 8 Robertson, D. R. Distorted memory for exemplars of a social structure: Evidence for schematic memory processes. Journal pi Personality and Social Psuchology, I978, 35, lhOZ-lhlh. Vance, R. J., Kuhnert, K. W., 8 Farr, J. L. Interview judgments: Using external criteria to compare behavioral and graphic rating scales. Organigptional gphavior and Human Performpnce, 1978, 2;, 279-29h. Wahler, R. G., 8 Leske, G. Accurate and inaccurate observer summary reports. Journal pi Nervous ppp Mental Disease, I973, 15p, 386-39h. Warmke, D. L. Effects of accountability procedures upon the utility of peer ratings of present performance. (Doctoral dissertation, Ohio State University, I979). Dissertation Abstrpcts lnternptionp1, I980 59, AOII-B. (University Microfilms No. 80-0l,853) Warmke, D. L., 8 Billings, R. 5. Comparison of rater training methods for improving the paychometric quality of experimental and administrative performance ratings. Journal pi Applied Psychology, 1979. EA. l2h-l31- Weiner, Y., 8 Schneiderman, M. L. Use of job information as a criterion in employment decisions of interviewers. Journal pi Applied Psychology, I97h, 59, 699-706. Winer, B. J. Statistical principles 1p experimental design, gpp _p. New York: McGraw-Hill, I971. Woll, S., 8 Yopp, H. The role of context and inference in the comprehension of social action. Journal pi Experimental Social Psychology, I978, 15, 35l-362. Wyer, R. 5., Jr., 8 Srull, T. K. Category accessibility: Some theoretical and empirical issues concerning the processing of social stimulus information. In E. T. Higging, C. P. Herman, 8 M. P. Zanna (Eds.), Social cognition: 125 Ontario symposium pp personality ppp social psychology. Hillsdale, N. J.: Erlbaum, I980. Zandy, J., 8 Gerard, H. B. Attributed intentions and information selectivity. Journal pi Experimental Social Psychology, I97h, 1Q, 3h-52.