u lit. .17.. {ipwtfiiq 3% .a, SITY LIBRARIES Illlllllllllll\llll lllll ll Hlll This is to certify that the thesis entitled THE EFFECTS OF TEST APPROPRIATENESS 0N RELIABILITY AND VALIDITY presented by CATHERINE S . CLAUSE has been accepted towards fulfillment of the requirements for M.A. . PSYCHOLOGY degree in Major professor Neal Schmitt DateZZ//7/7L / / 0-7639 MS U is an Affirmative Action/Equal Opportunity Institution LIBRARY Michigan State University PLACE N RETURN BOX to remove this chockout from you record. TO AVOID FINES Mum on or More data duo. DATE DUE DATE DUE DATE DUE MSU In An Affirmative mm Oppommity Intuition mm: THE EFFECTS OF TEST APPROPRIATENESS ON RELIABILITY AND VALIDITY By Catherine S. Clause A THESIS Submitted to Michigan State University in partial fulfillment of the requirements for the degree of MASTER OF ARTS Department of Psychology 1 996 ABSTRACT THE EFFECTS OF TEST APPROPRIATENESS ON RELIABILITY AND VALIDITY By Catherine S. Clause Test appropriateness indicates the degree that a test accurately represents a person’s ability on the construct purported to be measured by the test. Research on test appropriateness has previously concentrated on comparing the accuracy of different indices in detecting inappropriateness under a narrow set of conditions. The present study examines the application of the [Z index of test appropriateness for the purpose of flagging response vectors for removal from analyses to improve estimates of test properties. The effects and the detection of aberrance under a variety of simulated testing conditions were examined. Results indicate that aberrance, as simulated in this study, does not have a large effect on test properties and that the [2 statistic does not adequately detect aberrance when there are a relatively large number of aberrant response vectors in a data set. Potential reasons for these findings and implications for applications of test appropriateness indices are discussed. Capyrieht by Catherine Sue Clause 1996 ACKNOWLEDGMENTS This thesis would not have been possible without the assistance and support of a number of people. First, I would like to thank my committee members, Rick DeShon and Ralph Levine, for their insightful comments that helped me to improve the contents of this thesis. I am very grateful to Tom Peters being able to write several of the computer programs that made my data analysis possible, based on the somewhat ambiguous specifications I gave him. I would also like to thank my parents for all of their support and for always encouraging me to do what I needed to do, even if it meant going to graduate school up north. My friends Leslie Hoffman and Kelli Pursell were indispensable during this process, and this is just to make it oficial: I owe both of you one (or two or twenty). Finally, I want to thank my committee chairperson, Neal Schmitt, for all of the guidance he has given me during my time in graduate school. He is not only one of the most intelligent and hardest working people I know, he is also one of the most patient. iv TABLE OF CONTENTS LIST OF TABLES OVERVIEW INTRODUCTION Research on Test Appropriateness The I, Statistic Applications of Appropriateness Indices Research Questions METHOD Data Generation Analyses RESULTS Manipulation Checks Research Questions 1 and 2: Efl‘ects of Aberrance on Reliability Research Questions 1 and 2: Eflwm of Aberrance on Validity Research Question 3: I2 Detection Rates and Aberrance Removal DISCUSSION Implications of the Present Study for Previous Research A Potential Explanation for the Present and Previous Findings Limitations and Contributions vii 13 16 22 22 28 32 32 38 42 52 67 69 77 APPENDIX A - Test Summary Statistics APPENDD( B — Change in Theta Due to 30% Aberrance APPENDD( C — Change in Item Parameters Due to 30% Aberrance LIST OF REFERENCES vi 80 83 84 85 LIST OF TABLES Table l — Summary Statistics for Aberrance Free Data Sets Table 2 - Correlations with 1, Scores for Data Sets with Reliability Near .70 Table 3 — Correlations with Iz Scores for Data Sets with Reliability Near .80 Table 4 — Correlations with 12 Scores for Data Sets with Reliability Near .90 Table 5 - Change in Reliability Due to the Introduction of Aberrance Table 6 — Change in Validity Due to Aberrance for Reliability Near .70 Table 7 — Change in Validity Due to Aberrance for Reliability Near .80 Table 8 — Change in Validity Due to Aberrance for Reliability Near .90 Table 9 — Average Change in Validity by Reliability Across Aberrance Table 10 - Average Change in Validity by Reliability Across Validity Table 11 - Proportion of Total Aberrance Detected Using the II Statistic Table 12 — Change in Reliability Due to Removal of Aberrance Table 13 — Change in Validity Due to Removal of Aberrance for rn Near .70 Table 14 -- Change in Validity Due to Removal of Aberrance for rJar Near .80 Table 15 — Change in Validity Due to Removal of Aberrance for r,“ Near .90 Table 1A — Test Statistics for Reliability Near .70 Table 2A — Test Statistics for Reliability Near .80 Table 3A — Test Statistics for Reliability Near .90 Table 4A - Change in Theta Due to 30% Aberrance Table 5A — Change in Item Parameters Due to 30% Aberrance vii OVERVIEW Test appropriateness indicates the degree that a test accurately reflects the standing of a respondent on the construct purported to be measured by the test. Various indices have been proposed to measure the degree that a test is appropriate for a particular respondent (Levine & Rubin, 1979; Hamisch & Linn, 1981; Birenbaum, 1986). Research relating to test appropriateness has largely concentrated on comparing the accuracy of difl‘erent indices in detecting particular types of test inappropriateness, or aberrance, in a data set. The recommended uses of these indices include individual-level diagnosis of respondents who may need to be retested or trained in test-taking strategies (Drasgow, 19823; Hamisch, 1983). Another application mentioned by some researchers is to identify response vectors in a validation sample that may be distorting estimates of the psychometric properties of a test (Parsons, 1983; Birenbaum, 1985). There has been little research demonstrating how useful appropriateness indices are for this latter purpose (for an exception, see Schmitt, Cortina, & Whitney, 1993). This study examines the effects of removing aberrant responders from a test validation sample on estimates of the psychometric properties of the test. Estimates of the reliability and validity of a test are compared for samples containing no aberrance, samples containing aberrance, and samples with aberrant respondents removed from the analysis. The standardized appropriateness index developed by Drasgow, Levine, and Williams (1985), 1,, is used to flag aberrant response vectors for removal from data sets. 2 Various types and levels of aberrance are simulated to examine the magnitude of the effects of aberrance on test reliability and validity under different conditions. In addition, the efl‘ect of using Iz to identify aberrant response vectors versus working with known levels of aberrance in a data set is examined. This paper begins by describing the concept of test appropriateness and the reason why this concept is of interest in the development and use of multiple choice tests. Next, the previous literature on test appropriateness is reviewed. Areas in need of further investigation are highlighted. Based on the literature review and the discussion of why test appropriateness is of interest, research questions about the efl‘ects of test inappropriateness on estimates of the psychometric properties of tests are developed. Following the presentation of research questions, the study is described, including the procedure for simulating the data sets and the analyses that are used for these data. Results are presented and discussed in terms of the effects of aberrance on test properties and the utility of using the 1, statistic to identify aberrant response vectors for removal from analyses to improve estimates of test reliability and validity. INTRODUCTION The purpose of administering a test is to collect a measure of an ability or trait of interest for predicting or explaining some behavior. In line with this purpose, tests have been described as measuring samples of behavior (Anastasi, 1988). The accuracy of a test for assessing a respondent's standing on the construct of interest is important in a variety of situations, including making selection or placement decisions in an employment context, measuring educational achievement, and clinically diagnosing mental or physical disorders. There are many reasons that a test may be a less than perfect measure of the ability or trait of interest. These reasons may have to do with the test itself (e.g. poor reliability, poorly stated items), such that the test is a poor measure of the construct of interest for all respondents. A test may also be a less than optimal measure of a construct of interest for only a certain proportion of the respondents. There has been a great deal of research on item bias and subgroup difi'erences in test scores and on subgroup difl'erences in criterion prediction using test scores (e.g., Sackett & Wilk, 1994; Schmitt, Clause, & Pulakos, 1996). The implications of this research have been made more apparent with recent legislation such as the Civil Rights Act of 1991 and with controversy such as that generated by the publication of The Bell Curve in 1994. In addition to what is described in this literature, there is another way in which test scores may be suboptimal measures for a subset of respondents. Research on test appropriateness has dealt with the issue of identifying individuals for whom test scores 4 are inaccurate representations of standing on the construct of interest. This research has largely focused on ability tests with dichotomously scored items, although implications for personality inventories and continuous-scale measures have been discussed (Parsons, 1983; Reise, 1995). A psychometrically adequate test is considered inappropriate for a particular respondent to the extent that the individual's item responses do not fit with the pattern expected based on group-determined item characteristics and estimates of the individual's ability on the construct of interest (van der Flier, 1977; Rudner, 1983). Test score inappropriateness occurs either when a high-ability respondent answers a subset of relatively easy items incorrectly, or when a low-ability respondent answers a subset of relatively difficult items correctly. A number of possible reasons have been given to explain why a respondent's test score may not be an accurate representation of their standing on the construct of interest (e.g., Wright, 1977; Levine & Rubin, 1979; Levine & Drasgow, 1982; Birenbaum, 1985). l A respondent who is high on the construct of interest measured by the test may receive a spuriously low score due to alignment errors on an answer sheet, use of suboptimal test- taking strategies, and/or careless responding due to fatigue or low motivation to take the test. A respondent who is low on the construct of interest measured by the test may receive a spuriously high score due to obtaining the answers from a high ability respondent and/or from test-specific coaching received prior to taking the test. Inaccurate assessment is problematic for both the individual taking the test and for the individual or group making use of the test scores. For example, in a personnel selection situation, overestimates of ability may result in hiring poorly qualified candidates. This hiring decision may contribute to productivity losses for the organization and to undue stress for the under-qualified individual as they attempt to perform job duties. On the other hand, underestimates of ability may also result in losses 5 of productivity for the organization due to the greater efi‘ort and expense needed to locate qualified applicants and the untapped resources of qualified individuals who are not hired. For the individual taking the test, underestimates of ability lead to the loss of job opportunities. Test score appropriateness is also an issue for test developers. Ifthere are a number of respondents in a validation sample whose test scores are inaccurate representations of their ability levels on the construct measured by the test, then estimates of the psychometric properties of the test may be distorted. If individuals in a validation study are poorly motivated to take the test, for example, then there may be a large number of respondents with spuriously low test scores due to careless or random responding. This may often be the case in concurrent criterion-related validation studies, in which job incumbents are asked to take a selection test and usually receive no benefit for doing so. When test scores derived in the above, or similar, situations are correlated with performance measures, the results may indicate that test scores are not strongly related to performance, when, in fact, the ability measured by the test is a good predictor of performance. Removal of individuals with spuriously low or spuriously high test scores from the validation sample may provide more accurate estimates of the psychometric properties of a test (i.e., reliability and validity). This contributes, in turn, to more accurate assessment of the usefulness of the test for personnel decisions such as selection or placement. The purpose of this study is to examine the efl‘ect of removing respondents with inappropriate test scores from validation samples on the reliability and validity of the test. This procedure is difl‘erent from deleting "outliers" from the validation sample because appropriateness scores are not necessarily an indication of outlying test scores in relation to criterion performance. The identification of outliers in a validation samPle depends on 6 both the test and the criterion scores of that individual, and removal of outliers will necessarily inflate subsequent estimates of test validity. We have no way of knowing, however, whether the cases removed were inadequately assessed, or whether the scores are accurate and reflect an absence of predictor-criterion relationship. The identification of respondents with inappropriate test scores is completely internal to the test and does not depend on the criterion scores of the respondents (Schmitt, Cortina, & Whitney, 1993). Rather, an appropriateness index indicates the degree that an individual's responses to particular test items are inconsistent with an estimate of their standing on the construct of interest. This index is calculated from the total set of item responses and the group-determined item parameters for the test (Drasgow & Guertler, 1987). The previous work in this area has found less than promising results when examining the efiects of removing aberrant responders on the validity of selection measures (Schmitt, Cortina, & Whitney, 1993; Schmitt, Clause, Whitney, Futch, & Pulakos, 1994). There are several possible reasons for this. The first possibility is that the number of respondents flagged as having inappropriate test scores was too low to have a great effect on the validity of the test in these samples. This could be due either to a low number of aberrant responders in the sample or low power of the appropriateness index to detect aberrance in the sample. The present study attempts to examine this possibility by simulating data sets with a variety of magnitudes of aberrance present (i.e., proportion of the sample with inappropriate test scores). Comparisons of data sets with no aberrance, data sets with aberrance, and data sets with aberrant response vectors removed are used to determine the effects of magnitude of aberrance and of detection power of the appropriateness index on estimates of test properties. Another possibility is that the effects of aberrant responding on test validity are only indirect, and are thus difficult to detect when only examining validity. Test 7 appropriateness indices may be detecting item response patterns that display deviations from the overall internal consistency of the test, which is only indirectly related to validity. Since validity is linearly related to the square root of test reliability rather than directly with test reliability, large effects of aberrance on reliability may not have a large efi’ect on validity (Parsons, 1983). The present study examines changes in both reliability and validity due to the presence or absence of respondents with inappropriate test scores in the validation sample. This is done in order to better determine the nature of the relation between test score inappropriateness and the psychometric properties of tests. Research on Test Appropriateness Early testing research was based on the assumption that any test of particular content was measuring only the construct of interest and was measuring it in the same way for all respondents (Harnisch & Tatsuoka, 1983; Cortina, 1994). Later research recognized that test scores could be reflecting things other than the individual's standing on the construct of interest. Characteristics of the test, such as the inclusion of extraneous content (e.g., reading ability required on a math test) or the social desirability of particular responses, could influence the way items are answered (Hough, Eaton, Dunnette, Kamp, & McCloy, 1990). Characteristics of the test-takers, such as literacy or familiarity with Western culture or tests, could also influence scores beyond the respondent's ability on the construct of interest (F rederiksen, 1977; van der Flier, 1977). A test score may be thought of as inappropriate to the extent that it reflects something other than the construct of interest. A subset of respondents may have inappropriate test scores if the primary source of variation in item responses is something other than that influencing the test scores of the reference group used to estimate test or item parameters (van der Flier, 1982; Cortina, 1994). Based on this discussion, test inappropriateness can be thought of as an interaction between characteristics of the test 8 leading to the group-determined item parameters for the test and characteristics of the respondent leading to the pattern of item responses for that respondent (Cortina, 1994). There have been numerous indices proposed to measure the degree that a person's test score is an inappropriate indication of their standing on a construct. These indices can be generally divided into two groups. The first group of indices are directly based on observed patterns of correct and incorrect item responses. Examples of indices from this first group include Donlon and Fischer's ( 1968) personal biserial coemcient, van der F lier’s (1977; 1982) [1' index, Hamisch and Linn's (1981) extended caution index, and Tatsuoka and Tatsuoka's (1982) norm-conformity index (N CI). These statistics will not be discussed in detail here, but for a review and comparison of the difl‘erent indices, see Hamisch and Linn (1981). Research has tended to focus on the second group of test appropriateness indices, especially in the last 15 years. These indices are based on item response theory (IRT) models of the test response data. Examples of indices from this group include fit statistics based on the Rasch IRT model, described by Wright and his colleagues (Wright & Panchapakesan, 1969; Wright, 1977). Another example of IRT-based statistics are the extended caution indices presented by Tatsuoka and Linn (1983; Tatsuoka, 1984) that make use of IRT models for deriving probability matrices for sample characteristics. Finally, the appropriateness statistics developed by Levine, Drasgow, and their associates are based on maximum likelihood functions using the three-parameter logistic IRT model (e.g., Levine & Rubin, 1979; Levine & Drasgow, 1982; Drasgow, Levine, & Williams, 1985) Several researchers have compared these IRT-based indices to determine which are the most useful based on specific criteria (Rudner, 1983; Birenbaum, 1985; Drasgow, Levine, & McLaughlin, 1987; 1991). This work has been largely statistical in nature, concentrating on the degree that appropriateness indices can identify aberrant response 9 patterns of different types in a standardized manner across ability levels. This work has provided evidence that the appropriateness indices developed by Drasgow, Levine, and colleagues are among the most accurate in identifying aberrant response patterns. Rudner (1983) reviewed several unstandardized appropriateness statistics and indicated that the index developed by Levine and Rubin (1979), 10, had high hit rates for classification of aberrant response vectors across a variety of ability levels. One problem with unstandardized appropriateness indices, however, is that they tend to be correlated with total test score (Cortina, 1994), leading to a need for standardization of appropriateness indices. The standardized version of the 1., index, 1,, has been identified as having the lowest overall misclassification rate for normal and aberrant response patterns among several IRT-based indices (Birenbaum, 1985). Drasgow, Levine, and McLaughlin (1987) compared IRT-based indices to each other and to an optimal appropriateness index developed for research purposes to model the highest detection rates possible for a specific type of aberrance (Drasgow & Levine, 1986; Levine & Drasgow, 1988). Their results indicated that the 12 index, as well as two of the standardized extended caution indices, had detection rates closest to that of the optimal index. Later research demonstrated that while all three of these indices were reasonably well standardized across a broad range of ability levels, the 1, index had slightly higher rates of detection for both aberrant and normal response vectors, particularly for spuriously high response vectors (Drasgow, Levine, & McLaughlin, 1991). Because of the results of the research comparing various appropriateness indices, this study focuses on the 1, statistic as an indicator of aberrance in a response pattern. The 1: Statistic The 12 index is the standardized version of the 1., statistic first presented by Levine and Rubin (1979). The 1., index is the logarithm of the likelihood function evaluated at the maximizing value of theta (Drasgow, Levine, & Williams, 1985). In other words, the 10 10 statistic indicates the degree that a given response pattern contributes to the maximum likelihood function for the three-parameter IRT model of the test. Small values indicate aberrance because the likelihood of the aberrant response pattern (i.e., very easy items incorrect or very diflicult items correct) is low for the level of ability, as estimated by the total response vector and the group-determined item parameters (Drasgow, 1982a). Two steps are required for computing the 1., statistic (Levine & Drasgow, 1983). The first step is to estimate person and item parameters by fitting an IRT model to the data. In most research on the 1,, statistic, the three-parameter logistic model is used (Bimbaum, 1968). The second step is to use the group—determined item parameters and estimate of the person's ability on the construct of interest to calculate the 1,, statistic. The index is computed as the logarithm of the compormd probability of the correct and incorrect responses given by the individual with a given trait level as estimated by the IRT model (Schmitt et al., 1994). This computation is given in the following formula I. =§{iu. 1nP.(é)1+i(1— u.)1n<1— swim}, (1) where n refers to the number of items in the test, u,- represents the response of the individual to the ith item ( l =correct, 0=incorrect), and P, (61) is the probability of the response to item 1' given the estimate of the examinee's trait level. Drasgow, Levine, and Williams (1985) point out that the distribution of the 10 index is unstable across ability levels, with the mean value of the index increasing as ability increases. Other researchers have made similar observations by noting that the 10 11 index, along with other unstandardized appropriateness indices, tends to be correlated with ability level (Birenbaum, 1985). Because of this finding, it is necessary to standardize the 1., index to better ensure the stability of the distribution of this statistic across ability levels. A standardized version of the 1,, statistic, the 1, index, is presented by Drasgow, Levine, and Williams (1985). In order to reduce the dependence of the appropriateness index on the value of theta, the standardization process makes use of the assumption of local independence by assuming that ability as estimated by an IRT model equals the respondent's true ability. The formula for standardizing 1,, is given below. _ I, - £00) ‘ " [Var)] 1=l where 1,,. is the multitest version of 1,, and j refers to the number of individual tests. The multitest version of 1, has the advantage of increasing the number of items included in calculating the appropriateness of a measure for a particular respondent, which would likely increase the accuracy of detecting aberrant response patterns. For longer tests (i.e., greater than 20 items), however, this is less of an issue (Reise & Due, 1991). Another potential limitation of the 1, statistic has to do with the use of estimated theta values rather than true theta values when calculating 1,. Calculation of the 1, statistic requires the assumption that theta as estimated by the IRT model is equal to the respondent's true ability on the construct of interest. This is seldom the case, and so l 3 estimates of theta are likely to be higher or lower than true theta levels, producing a distorted estimate of true test score inappropriateness using the 1, statistic. For individuals with aberrant response patterns, estimates of theta are likely to be especially distorted since the person's response vector does not fit the IRT model generated for the test. Because inaccurate estimates of theta produce inaccurate estimates of inappropriateness, it is likely that the 1, statistic works least well when it is most needed: when there are large numbers of aberrant responders in a data set or there is a large amotmt of aberrance within a single response vector. Although research has not determined the extent of this inaccuracy in data sets for actual or simulated cognitive ability test responses, Reise (1995) used data from a large-scale personality assessment and found that detection of aberrance using 1, is best when there are a large number of test items (i.e., greater than 30 items) and item dimculty levels spread throughout the range of thetas for the respondents. Applications of Approm'ateness Indices Although the test appropriateness research comparing the aberrance detection rates for difi'erent indices is useful for determining which statistic performs the best under certain conditions, there are some notable gaps in this literature. One question that has been inadequately addressed by the extant literature concerns the practical applications of test appropriateness indices. It is unclear from the previous research what should be done once a subset of respondent's records are flagged as being aberrant. The few recommendations for applications of these statistics have usually focused on the use of indices as signals for the need to either retest aberrant respondents or to find alternate predictors or indicators of the construct of interest (Drasgow, 1982b; Hamisch, 1983; Rudner, 1983; Birenbaum, 1985; Drasgow & Guertler, 1987; Levine & Drasgow, 1988). For example, Hamisch (1983) and Birenbaum (1985) refer to the application of 14 appropriateness indices for identifying examinees whose test scores should be interpreted with extra caution. Levine and Drasgow (1988) include diagnosis of causes of low test scores as a potential application of appropriateness measurement. This focus may in part be explained by the fact that most of the work on test appropriateness indices has been done using widely known, psychometrically sound, tests (e. g., the SAT), or simulations of data from such tests. This limits the need for examining the eflects of aberrance on estimates of test properties. Another potential application of appropriateness indices that has been mentioned in previous research is the use of these statistics to flag aberrant responders in test development or validation samples (Levine & Drasgow, 1982; Rudner, 1983; Parsons, 1983; Birenbaum, 1985; Drasgow & Guertler, 1987). Drasgow & Guertler (1987) point out that when there are a large number of aberrant response vectors in a sample, any subsequent statistical analyses using these test data are likely to be distorted. Rudner (1983) suggests that item try-out and standardization samples can be improved by excluding response vectors from examinees with aberrant response patterns. When response vectors are flagged as aberrant and are subsequently removed from the development or validation sample, test statistics can be recalculated on a smaller, but "more appropriate" sample, in hopes of improving the accuracy of estimates of these statistics. Prior empirical research assessing this application of the 1, index has found less than optimal results for improving the validity of several selection tests by removing aberrant responders from the validation sample (Schmitt, Cortina, & Whitney, 1993; Schmitt et al., 1994). The objective of the present study is to determine the reason for the weak efi‘ects found in the prior research and to assess the conditions under which aberrance affects estimates of test characteristics. The 1, statistic is used in this study to indicate inappropriate test scores, both in order to remain consistent with the previous research 15 and because prior work indicates that this statistic is the most accurate index of test appropriateness currently available (e.g., Drasgow & Levine, 1986). This study expands on the prior research in several ways. The use of simulated data allows the modeling of difi‘erent types and magnitudes of aberrance in the data set and different test properties. This is done in order to better determine the conditions under which aberrance affects the estimation of test properties. The limitations of sample size in prior empirical work in this area are removed by using simulated data. In addition, the efiects of aberrance on test reliability as well as on test validity are examined. This is done due to the suggestion by Parsons (1983) that aberrance has a more direct effect on reliability than on validity. The present study also provides an opportrmity to identify the extent that problems with the aberrance detection rate of the 1, statistic may be responsible for the weak efl‘ects of aberrance on test properties in prior research. Comparisons of data sets with and without aberrance indicate the extent that true levels of aberrance distort estimates of the psychometric properties of tests. Comparisons of data sets with aberrance and data sets with response vectors receiving low 1, scores (i.e., below -2.00) removed indicate the extent that the 1, index is detecting the presence of varying levels and types of aberrance within a data set. Reise and colleagues identified some potential problems with the 1, statistic relating to test length and test composition. Although these are real concerns when using the 1, statistic to identify aberrance in real-world situations, they are not the direct focus of this study, and so attempts were made to reduce the impact of these problems. In order to minimize problems with the 1, statistic relating to test length, the data simulated for this study are based on a test length of 50 items, which exceeds the length requirements suggested by Reise and colleagues as necessary to maximize the likelihood of aberrance detection (Reise, 1995; Reise & Due, 1991). To address concerns with detection l 6 associated with test composition, test item difficulties cover a range that matches the range of thetas for test respondents. Reise (1995) reported that although the aberrance detection rate for 1, based on estimated theta is always lower than that for 1, based on true theta values, detection rates are maximized when a test contains items with a range of difficulties matching the range of thetas in a group of respondents. Research mestions Based on the objectives of the study discussed above, several research questions have been formulated to direct the analyses for determining the nature of the relation between aberrance and test properties. Analyses estimate and compare the efi‘ects of difl‘erent aberrance conditions that are expected to occur in actual testing situations on test reliability and validity. Additional analyses determine the accuracy of the 1, statistic for flagging aberrant response vectors for removal from the data set. Varying conditions of aberrance within a data set are simulated to model different testing conditions where respondents are more or less likely to engage in behaviors leading to aberrance in item response patterns. For example, in a concurrent criterion- related validation study where incumbents are not given any sort of incentive to "do their bes " on a selection test, there may be a large proportion of examinees who respond carelessly or cheat on the test. This would result in a large proportion of the validation sample having aberrant response patterns. On the other hand, in a situation such as a group of applicants taking an employment test where examinees are likely to be highly motivated to perform well and may be closely monitored to discourage cheating, there may be much lower levels of aberrance in the sample. The research questions given below are designed to probe the nature of the relationship between aberrance and estimates of test properties and to examine the utility of the 1, statistic as an indicator of aberrance. In order to explore these questions, samples with levels of aberrance varying from fairly low (10% aberrant) to high (50% aberrant) 1 7 are simulated, since it is expected that with a larger proportion of aberrance in a sample, there will be a greater effect on estimates of test properties. The "true values" of the validity and reliability of the test (i.e., calculated in data sets with no aberrance) will also be varied across samples, since it is expected that the effects of aberrance on estimates of test properties will vary depending on the "true" levels of validity and reliability of the test. In terms of aberrance detection rates using the 1, statistic, it is expected, based on arguments presented by Reise (1995), that relative detection rates (per cent of total aberrance detected) may be lower in samples with a large degree of aberrance. This pattern of results is expected because data sets with a greater degree of aberrance will likely also have the most overall distortion in theta estimates, which will result in distorted estimates of test inappropriateness. RQl: What is the magnitude of the effect of aberrant response vectors on the reliability and the validity of tests in samples with different proportions of aberrant response vectors? In this study, the magnitude of aberrance is manipulated to explore a range of possibilities where aberrance levels may afl‘ect test properties. Data sets were simulated with varying proportions of the respondents (10%, 20%, 30%, 40%, or 50%) classified as aberrant. These proportions were chosen to represent a range of aberrance that could occur under different testing conditions in order to determine how great aberrance must be to have discernible and practical efi‘ects on test properties. In this case, aberrance classification is based on obtaining a score of -2.00 or less on the 1, statistic, which is consistent with the cutoff used in previous research (Schmitt, Cortina, & Whitney, 1993; Schmitt et al., 1994). This manipulation is somewhat difl’erent from that used in previous research, however, because the focus of this study is on the effects of the overall magnitude of aberrance in a sample on test properties. 1 8 Previous research on aberrance detection rates for appropriateness indices was more concerned with aberrance within a single response vector and thus usually kept the proportion of aberrant response vectors in a sample constant and at relatively low overall levels (e.g., less than 10% of response vectors). Two types of aberrance manipulations are used, spuriously high manipulations and spuriously low manipulations, to determine whether there is a difl‘erential efi‘ect on test properties for difi‘erent types of aberrance. The first type of aberrance manipulation, the spuriously high test score, simulates a situation in which a respondent receives a test score that is higher than their ability on the construct of interest, which could be due to factors such as cheating or coaching prior to taking the test. The second type of aberrance manipulation, the spuriously low test score, simulates a situation in which a respondent receives a test score that is lower than their ability on the construct of interest, which could be due to factors such as alignment errors on an answer sheet or use of suboptimal test-taking strategies. It is not expected that the effects of these two types of aberrance on test properties will difl‘er greatly, but both are used in this study to remain consistent with previous research that examines both types of aberrance. In addition to the data sets with a single type of aberrance manipulation (either low or high), a data set that includes a mix of both spuriously low aberrant response vectors and spuriously high response vectors was created. In this data set, a total of 30% of the response vectors are classified as aberrant, with 15% classified as spuriously low aberrant and 15% classified as spuriously high aberrant. This mixed data set was created in order to determine whether having both types of aberrance in a single sample would amount to a cancellation of the effects of each type of aberrance on test properties. In real world testing situations, it is expected that different respondents will exhibit difi‘erent types of aberrance. Although it is not expected that difl’erent types of aberrance will have predictably different efl‘ects on test properties, it is possible that within a single data set, it 1 9 may be difiicult to determine the effects of aberrance on reliability and validity if there are a mixture of ways in which true scores are distorted. For each type of aberrance manipulation, a single level of aberrance is simulated, with 30% of the item responses in a given response vector (i.e., a simulated respondent) changed to reflect aberrance. This approach differs somewhat fi'om previous research that varied the amount of aberrance within single response vectors, because the focus of this study is on overall aberrance within a sample of test respondents rather than within a single response vector. The main reason to vary aberrance within a response vector is to examine and compare the detection rates for different indices, which was the purpose of most of the previous research on appropriateness measurement. Based on prior research (e.g., Drasgow, Levine, & McLaughlin, 1987) indicating the superior detection rates of the 1, index compared to other statistics, particularly for large magnitudes of aberrance (e.g., 30% aberrant item responses within a single response vector), this magnitude of aberrance within a response vector was used for all aberrance manipulations. RQ2: What is the magnitude of the effect of aberrant response vectors on the reliability and validity of tests with different "true values" of reliability and validity (i.e., values calculated in samples with no aberrance)? Three levels of reliability (.70, .80, or .90) and three levels of validity (.15, .30, or .45) are simulated in difl‘erent data sets. These levels were chosen to represent values ranging from those typically found in personnel research to values that are reasonably higher than those typically observed (e.g., Reilly & Chao, 1982; Schmitt, Gooding, Noe, & Kirsch, 1984; Schmidt, Ones, & Hunter, 1992). This will allow some discussion of the likelihood of finding an effect of aberrance on test properties in typical testing situations. The estimates of reliability and validity are computed on a data set of response vectors before any aberrance is introduced, and so represent a "true value" for estimates of reliability and validity for the data set. The value of these test properties will be varied 2 0 across data sets to better understand the effects of aberrance on estimates of these properties across a variety of testing conditions. The variation of “true values” of test properties also provides a chance to explore any potential relationship between the effects of aberrance on reliability and the efi‘ects of aberrance on validity (e.g., if the effects of aberrance on validity can be reliably predicted from the effects of aberrance on reliability). RQ3: What is the magnitude of the effect of using the 1z statistic to identify aberrance on the degree of aberrance detected in samples with different proportions of aberrant response vectors? For a given level of aberrance (10%, 20%, 30%, 40%, or 50%), comparisons of test properties are made among three data sets: a data set with no abenance (i.e., before aberrance is introduced), the aberrant data set, and the same data set with response vectors flagged as being abenant using the 1, statistic removed. The comparison between the aberrant data set and the data set with aberrant response vectors removed will provide an estimate of the degree that the 1, statistic is identifying all the aberrant response vectors in a data set. The aberrance detection rate of the 1, statistic has been examined in previous research, but on a smaller scale, with only a single, usually low, level of aberrance present in the data set. Because of this, there was no way to determine from the previous research whether the 1, statistic is differentially efl‘ective at identifying aberrance depending on the degree of aberrance in the data set. The comparisons of reliability and validity among all three data sets will provide an estimate of the degree that using the 1, statistic to flag aberrant response vectors for removal is an effective way to determine the influence of aberrance on test properties. Ifthe reliability and validity of the data set with aberrance removed are more similar to the aberrance-free data set than to the aberrant data set, then the 1, statistic may be a satisfactory indicator of aberrance for the purpose of improving estimates of test 21 properties. This is because, in this scenario, aberrance removal based on the 1, statistic results in a data set relatively free of aberrance, or at least similar to an aberrance-free data set. Ifthe reliability and validity of the data set with aberrance removed are more similar to the aberrant data set than to the aberrance-free data set, the 1, statistic is probably not a satisfactory indicator of aberrance. This is because, in this scenario, not enough of the aberrant response vectors have been flagged for removal in order to improve the accuracy of estimates of test properties to adequately reflect what would have occurred if there were no aberrance in the data set. METHOD Data Generation The data simulation procedure for this study is based on that first used by Levine and Rubin (1979) in studying appropriateness measurement and later used by their colleagues and by other researchers (e.g., Levine & Drasgow, 1982; Rudner, 1983; Noonan, Boss, & Gessaroli, 1992). Although the use of simulated data may not allow for the consideration of all issues arising in the collection of actual data, it does allow for the manipulation and examination of certain conditions in order to determine whether eflects of interest can be demonstrated (Rudner, 1983). For example, by using simulated data, samples of response vectors with no aberrance can be created for comparison purposes, whereas this can not be guaranteed when collecting data in actual testing situations. Because the purpose of this study is to determine the efl‘ects of varying levels of aberrance on estimates of the psychometric properties of a test, the use of simulated data was deemed appropriate. As described in the discussion of the research questions, the magnitude and type of aberrance was varied across data sets. Five difl‘erent proportions of aberrance within a sample (10%, 20%, 30%, 40%, or 50%) and two types of aberrance, spuriously high test scores and spuriously low test scores, were simulated. Levels of reliability (.70, .80, or .90) were also varied across data sets. For each data set, three criterion scores were generated, which varied in the magnitude of correlation with total test score (.15, .30, or .45). The variation in correlation represented a variation in the validity of the test. These variations resulted in the simulation of 30 data sets (five proportions of aberrant 22 2 3 respondents X two types of aberrance X three levels of reliability) with a single type of aberrance, each of which included three criterion scores representing three levels of validity of the test. Three data sets with no aberrance were created (one for each of the three levels of reliability) to use for comparisons with the aberrant data sets to answer the research questions proposed in this study. In addition, three data sets (again, one for each of the three levels of reliability) with the mixed type of aberrance described earlier (i.e., 30% total aberrance, consisting of 15% spuriously high response vectors and 15% spuriously low response vectors) were created to ensure that the two types of aberrance did not cancel each other out in their efl‘ects on test properties or in the calculations of the 1, index. As with the data sets representing a single type of aberrance, the data sets with no aberrance or with mixed types of aberrance also included three criterion scores representing the three levels of validity that were of interest in this study. The variety of testing conditions that were simulated and examined allow some consideration of the plausibility of encountering actual data sets with a level of aberrance that has a practically significant effect on estimates of test properties, and some consideration of under what conditions this is most likely to occur. Summary statistics for all of the tests simulated based on these varying conditions of aberrance (or lack of aberrance) and varying levels of test reliability are given in Appendix A. Other parameters were held constant across the difi‘erent data sets, in order to minimize variation in factors that are not of central interest in this study. Each data set contains 10,000 response vectors and there are no omissions of data (i.e., simulation of test conditions under which all respondents answer all items). Although there are polychotomous item response models available that can take omission into account for appropriateness measurement (Drasgow, Levine, & Williams, 1985), the additional 24 modeling and calculations involved in using these models are beyond the scope of this study. The number of test items (i = 50) and the “true values” of item parameters (i.e., based on a sample of response vectors with no aberrance) were also held constant to simulate the conditions under which difi‘erent groups of respondents are taking the same test. Simulated item responses are all dichotomous. The amount of aberrance within a single response vector (for those data sets with any level of overall aberrance) was held constant at 30%, or 15 out of 50 items reflecting aberrant responses within a single response vector. Holding these factors constant reduced the number of data sets that were needed and the complexity of comparisons involved in examining the research questions posed earlier. The three-parameter logistic IRT model was used to estimate item and person parameters (Birnbaum, 1968). The item parameter distributions are based on ranges of parameter values used in previous research by Noonan, Boss, and Gessaroli (1992). Item dificulty was estimated using a uniform distribution with a mean of 0 and a standard deviation of 1.00 and the pseudo-guessing parameter was estimated using a tmiform distribution with a mean of 0.12 and a standard deviation of 0.03. The mean of the distribution of the item discrimination parameter was varied in order to manipulate the level of reliability of the test. To simulate a test with a reliability of .90, the mean of the item discrimination parameter was set at 0.80; for a reliability of .80, the mean was set at 0.47; for a reliability of .70, the mean was set at 0.40. In all cases, the item discrimination parameter was estimated using a uniform distribution with a standard deviation of 0.20. Uniform distributions were used to estimate item parameter values in order to be consistent with the previous literature and to increase variation across the range of values for the item parameters. This was particularly important in the case of item difficulty, 2 5 because research has suggested that 1, detection rates may be higher when item difficulties cover a broader range of values (Reise, 1995). Average item parameter values for each of the simulated tests are given in Appendix A. Although there is some random variation in average item difficulty across the difierent levels of reliability for the tests, this was not expected to cause tremendous problems with interpretation of efiects because this study was more concerned with patterns of relative change in test properties (including person and item parameters fi'om the item response theory model) rather than with absolute values for these properties. The “true values” (i.e., as estimated in a data set with no abenance) of the distribution of the ability parameter (theta) were also held constant across data sets, using a normal distribution of values with a mean of 0 and a standard deviation of 1.00. Lastly, the criterion score distribution was held constant, using a normal distribution with a mean of 3.00 and a standard deviation of 1.00. Holding these two distributions constant simulates a scenario that is likely in many situations, in which respondents vary in their ability on the construct of interest and in their scores on the criterion measure, but over time or across similar settings, group norms (i.e., the characteristics of the distribution of test scores or criterion scores) are relatively similar. The data generation process for each data set began with the simulation of a set of normal response vectors (i.e., with no aberrance introduced). Response vectors for normal examinees were simulated using the IRTDATA program (Johanson, 1992). This program makes use of three separate random number generator seeds for the generation of item parameters, person-ability parameters, and response vectors. Separate output files are created for raw data (response vectors), respondent characteristics (true score, theta, and number correct), and item characteristics (discrimination, dificulty, and pseudo- guessing parameters). Input parameters for this program include the number of 2 6 respondents, number of test items, scaling factor desired, and the distribution type (uniform or normal), mean, and standard deviation of the item and person parameters. Criterion scores were generated by creating a bivariate distribution with a specified correlation (.15, .30, or .45) between the total test scores in the aberrance-flee data set and a random normal deviate. Total test score was calculated as the sum of the dichotomous item responses scored one to simulate a correct response and zero to simulate an incorrect response. The following formula was used to calculate the criterion SOON: C=y,/(1-r,yz) +rxyx, (6) where C refers to the criterion score, y refers to a random number generated with the desired distribution (here, normal), mean (here, 3.00), and standard deviation (here, 1.00) of the criterion, x refers to the total test score, and r,, refers to the desired correlations between total test score and criterion score (validity). Once the data sets with no aberrance were created (one for each level of reliability), these data sets were used as the basis for creating the data sets with aberrance. Aberrant response vectors were simulated using the BIDEV program (Peters, 1995). This program randomly selects a proportion, specified by the researcher, of the no- aberrance response vectors fiom the data file created by IRTDATA and changes these vectors to reflect the type and level of aberrance specified. The normal response vectors 2 7 are then replaced with changed response vectors to create a data file with a specified proportion, type, and level of aberrance. To simulate a response vector with the type of aberrance reflecting a spuriously low test score, 30% of the responses in a normal (i.e., no-aberrance) response vector are randomly sampled. A random number generator is used to simulate guessing over a specified number of options choices (c = number of option choices). No matter what the original item response, the sampled item is rescored with a He chance of being correct (scored as 1) and a (c - l)/c chance of being incorrect (scored as 0) to simulate a random item response. In this study, items with 5 option choices were simulated, so sampled items were rescored as correct with a probability of .20 (1/5) and rescored as incorrect with a probability of .80 (4/5). To simulate a response vector with the type of aberrance reflecting a spuriously high test score, again 30% of the responses in a no-aberrance response vector are randomly sampled. No matter what the original item response, the sampled item is rescored as correct. Both of these manipulations result in varying degrees of change to a response vector depending on the original distribution of correct and incorrect responses (under the no-aberrance condition) for the items that are selected to be changed. This is because the aberrance being simulated in this study is random, so there is no pattern based on item or respondent characteristics as to which items reflect aberrance. This is consistent with previous research which also simulated random aberrance within a data set as well as within a single response vector. There is a higher probability of large change in a response vector using the sprniously high manipulation than using the spuriously low manipulation, however, because it is more likely that a response will be changed from incorrect to correct in the spuriously high manipulation than from correct to incorrect in the spuriously low manipulation. This is because in the spuriously high manipulation all incorrect responses 2 8 that are randomly selected for change are changed to correct. In the spuriously low manipulation, the correct responses that are randomly selected for change only have an 80% chance of being changed to incorrect and a 20% chance remaining correct (based on a simulation of random responding over five option choices). Analyses Analyses consist of the comparison of reliability and validities calculated on three types of data sets: a data set with no aberrance (i.e., before aberrance is introduced), an aberrant data set, and the same data set with response vectors flagged as being aberrant using the 1, statistic removed. In addition, comparisons are made between aberrant data sets and data sets with aberrance removed using the 1, statistic in order to determine the precision of detecfion based on the 1, statistic for varying levels of aberrance in a data set. After aberrance was introduced into a data set, item and person parameters for the three-parameter logistic IRT model were estimated using the BILOG program (Mislevy & Bock, 1990). These parameters are used to calculate values for the 1, statistic. It is not necessary to use BILOG on the no-aberrance samples because the IRTDATA program provides estimates of the item and person parameters of the raw data as part of the output. After an aberrance manipulation, however, these parameters are no longer useful for calculating 1, statistic values for the changed response vectors, so BILOG was used to get item and person parameter values for the aberrant data sets. After these values are obtained, either by using IRTDATA or BILOG, the LZCALC program (Peters, 1993) was used to calculate 1, statistic values for both the no- abcrrance and the aberrant data sets. The output fiom BILOG is transformed into an input file for the LZCALC program using the XTRACT program (Peters, 1994). The 1, statistic was calculated for each of the response vectors in the no-aberrance data sets in order to ensure that there are no aberrant response vectors generated using the IRTDATA program. This was not expected to be a problem, and this analysis is merely a check to 2 9 compare the magnitude of aberrance in the aberrance-free samples with that in the aberrance-manipulation samples. Although the level of aberrance in an aberrant sample is set using the BIDEV program, the calculation of the 1, statistic in the aberrant samples serves as a check for the detection rates of this statistic. In addition, correlations were calculated between 1, scores and the total test scores and criterion scores for each of the data sets. These analyses were included in order to ensure that scores on the 1, statistic are not related to ability on the construct of interest measmed by the test or to the criteria used to validate the test. This was also not expected to be a problem, given the research showing that the 1, statistic is well-standardized across levels of theta, but these results were included as a check on the 1, statistic. Once the 1, statistic was calculated for each of the data sets, the output files from LZCALC were merged with the raw data files containing item responses, total test scores, and criterion scores. Coeficient alpha (i.e., the indicator of the reliability of the test) and the correlation between the sum of item responses and the criterion score (i.e., the validity of the test) were calculated. Although reliability levels were manipulated and the criterion score was generated in order to be correlated at a certain level with total test score in the aberrance-free data sets, reliability and validity were calculated on the no- aberrance samples because there is always some random deviation of the data from the exact manipulation values. The reliability (coemcient alpha) and validity (correlation between total test score and criterion score) for the aberrant samples were calculated to determine the effects of aberrance on estimates of these test properties. Next, for each of the aberrant data sets, response vectors flagged as aberrant were deleted fiom the data analysis and reliability and validity were recalculated on the smaller, but “more appropriate” data set. This recalculation serves as a check on the use of the 1, statistic to flag aberrant response vectors for removal from analyses in order to get more accurate estimates of test properties. Ifthere is not much change between the 3 0 reliability and validity values in the aberrant and aberrance-removed data sets, then the 1, statistic may not be very useful for this purpose. These analyses result in estimates of test properties and in 1, detection rates for the different data sets that can be compared and used to answer the three research questions posed earlier. The comparison of test statistics calculated on the aberrant data sets with test statistics calculated on the data sets without aberrance provides an indication of the extent that aberrance distorts estimates of the psychometric properties of tests. This comparison is used to answer the first research question, which asks about the effects of aberrance on test properties. The comparison of test statistics calculated on data sets with varying proportions and types of aberrance afl‘ords an opportunity to determine the extent of aberrance needed in a sample to have an efl'ect on test properties, which is also relevant to the first research question. This comparison is also used to check whether there are difl’erential efi'ects of difl‘erent types of aberrance on test properties, though these are not expected to occur unless there is a major difference in the detection rate of the 1, statistic for different types of aberrance. The comparison of estimates of reliability and validity between data sets with difl'erent “true values” of these statistics (i.e., estimates from data sets with no aberrance) is used to answer the second research question about the nature of the efl‘ects of aberrance on different levels of reliability and validity. This comparison is also relevant for determining the nature of the relationship of effects of aberrance on reliability to efl'ects of aberrance on validity. The comparison between aberrant data sets and the same data sets with aberrant response vectors removed based on 1, statistic values indicates the degree that the 1, statistic is detecting aberrance within the data set. A comparison of detection across different levels and types of aberrance is used to determine whether the 1, statistic might 3 1 be differentially efl‘ective in different testing situations (characterized by difl‘erent levels and types of aberrance responding). These comparisons are relevant to the third research question, which concerns the degree of existing aberrance in a data set (in this case, set using the BIDEV program) that the 1, statistic actually detects. In addition, comparing reliability and validity between all three data sets — aberrance-free, aberrant, and aberrance-removed — also is relevant to the third research question because the similarity of estimates of these test properties from aberrance- removed data sets to the other two types of data sets will determine the utility of the 1, statistic for increasing the accuracy of estimates of test properties. Ifthe aberrance- removed estimates are more similar to the aberrant estimates, then the 1, statistic is not very useful. Ifthe aberrance-removed estimates are more similar to the aberrance-free estimates, then using the 1, statistic to flag aberrant response vectors for removal from data analyses may be a good way to increase the accuracy of those analyses. RESULTS Manipulation Checks Several analyses were conducted to ensure that the manipulations used in the data generation process had the desired effects. The first set of analyses center on the aberrance-flee samples, to ensure both that the desired reliability and validity values were achieved and that these data sets were truly free of aberrance, as identified by the 1, statistic. Table 1 contains the calculated coemcient alphas and correlations between total test score and criterion score for each of the aberrance-free data sets (one for each desired level of reliability). The table also includes the number of response vectors in each of the data sets that were flagged as being aberrant based on 1, scores. Table 1 — Summary Statistics for Aberrance-Free Data Sets Actual Validity # Vectors Actual Reliability CRIT 1 5 CRIT 30 CRIT45 w/ l, S -2.00 RELI70 .71 .16 .31 .45 0 RELI80 .82 .14 .31 .45 0 RELI90 .90 .15 .30 .45 0 ' 13% CRITl 5 = Criterion score generated to be correlated at .15 with total test score, CRIT30 = Criterion score generated to be correlated at .30 with total test score, and CRIT45 = Criterion score generated to be correlated at .45 with tot test score. RELI70 refers to the set of item responses that was generated to have a reliability of .70, RELI80 refers to the set of item responses that was generated to have a reliability of .80, and RELI90 refers to the set of item response that was generated to have a reliability of .90. 32 3 3 For each of the three data sets, the obtained reliability and validity are adequately close to the desired values (.70, .80, and .90 for reliability; .15, .30, and .45 for validity). These obtained values fi'om the no-aberrance data sets were used in all analyses comparing aberrant and aberrance-free data sets to determine the efi‘ect of aberrance on test properties. The last column of Table 1 contains the number of response vectors within each data set that received an 1, score of -2.00 or less. As indicated in the table, none of the response vectors in any of the three aberrance-free data sets received an I, score indicating inconsistency between item responses and estimated ability level. In addition to analyzing the aberrance-free samples to ensure that they were actually free of aberrance, additional analyses were performed to see whether 1, scores were related to scores on the test or criteria simulated in this study. One of the advantages of the 1, statistic that has been described in previous research is that it is not related to ability levels on the construct of interest and so it can provide a measure of aberrance within a response vector that is standardized across theta levels (Drasgow & Levine, 1986; Levine & Drasgow, 1988). In order to be able to generalize the results of this study to other research on appropriateness measurement, it was necessary to demonstrate that 1, scores are not related to the test or criterion scores used in this study. Tables 2, 3, and 4 present correlations of 1, scores with total test score and the three criterion scores for data sets with a reliability near .70 (Table 2), with a reliability near .80 (Table 3), and with a reliability near .90 (Table 4). These results are broken down by reliability level in order to simplify the presentation. 3 4 Table 2 - Correlations with 1, Scores for Data Sets with Reliability near .70 Low Aberrance % Aberrance TOT cam 5 CRIT30 CRIT45 10 -.06 -.02 -.04 -.05 20 -.08 -.03 -.07 -03 30 -.1 1 -.03 -.07 -.09 40 -.10 -.03 -.08 -.11 50 -.09 -.04 -.08 -.12 PM Aberrance % Aberrance TOT cams CRIT30 CRIT45 10 -.05 .00 -.00 .01 20 -.O6 .00 .01 .01 30 -.02 .02 .03 .05 40 -.02 .02 .03 .06 50 -.02 .03 .05 .08 Mixed Aberrance % Aberrance TOT cams CRIT30 CRIT45 15 Low, 15 High -.07 -.02 -.03 -.03 Note. % Aberrance refers to the overall percent of aberrance present in the data set, or the percentage of response vectors out of the total data set that are randomly selected for the aberrance manipulation. TOT = total test score. 3 5 Table 3 - Correlations with 1, Scores for Data Sets with Reliability near .80 Low Aberrance % Aberrance TOT CRIT 15 CRIT30 CRIT45 10 -.06 -.00 -.05 -.05 20 -.06 -.02 -.05 -.0‘7 30 -.05 -.02 -.06 -.O9 40 -.10 -.02 -.09 -.ll 50 -.06 -.02 -.08 -.09 High Aberrance % Aberrance TOT CRITl 5 CRIT30 CRIT 45 10 -.03 .02 .00 .02 20 -.04 .03 .02 .05 30 -.02 .02 .03 .06 40 .00 .04 .05 .09 50 -.00 .05 .05 .10 Mixed Aberrance % Aberrance TOT CRIT 15 CRIT30 CRIT45 15 Low, 15 High -.04 .01 -.02 -.01 Note. % Aberrance refers to the overall percent of aberrance present in the data set, or the percentage of response vectors out of the total data set that are randomly selected for the aberrance manipulation. TOT = total test score. 3 6 Table 4 - Correlations with 1, Scores for Data Sets with Reliability near .90 Low Aberrance % Aberrance TOT CRIT15 CRIT30 CRIT45 10 -.06 -.02 -.05 -.07 20 -.06 -.03 -.04 -.08 30 -.08 -.03 -.O8 -.11 40 -.07 -.03 -.09 -.12 50 -.08 -.05 -.09 -.12 flgh Aberrance % Aberrance TOT CRIT15 CRIT30 CRIT45 10 -.05 .01 .01 .01 20 -.07 .01 .02 .03 30 -.05 .02 .05 .06 40 -.08 .03 .04 .07 50 -.02 .03 .07 .11 Mixed Aberrance % Aberrance TOT CRITl 5 CRIT30 CRIT45 15 Low, 15 High -.06 -.02 -.01 -.03 Note. % Aberrance refers to the overall percent of abenance present in the data set, or the percentage of response vectors out of the total data set that are randomly selected for the aberrance manipulation. TOT = total test score. 3 7 The tables show that 1, scores are not highly related either to the test scores or to the criterion scores simulated in this study. There was no particular pattern of correlations with respect to the level of reliability of the test, but, there was a slight trend toward higher correlations with 1, scores for data sets with larger amounts of aberrance (though this did not occur in all cases). All of the correlations between total test scores and 1, scores are low and negative, with none of the values exceeding a magnitude of -.11 (or about 1% shared variance between total test score and I, score). Correlations between test scores and 1, scores tended to be slightly higher for data sets that were part of the spuriously low test score manipulation than for data sets that were part of the spuriously high test score manipulation, indicating that the type of abenance present in a data set may afl'ect the value of the 1, statistic. Overall, these correlations between total test score and I, are lower than those found in other studies (e.g., Birenbaum, 1985). This means that, in this study, the use of 1, scores to identify aberrant response vectors is not confounded by the theta level represented by the response vector. Correlations between criterion scores and 1, scores difl‘ered based on whether the aberrance manipulation was for a spuriously high test score or for a spuriously low test score. Correlations between criterion scores and 1, scores in the spuriously low manipulation condition were all low and negative. Correlations between criterion scores and 1, scores in the spmiously high manipulation condition were either zero or low and positive. Because criterion scores (which were computed in the aberrance-free data sets) were not altered in the aberrance manipulations, this provides additional evidence that the two types of aberrance manipulations (sptuiously low or spuriously high) affect the value of the 1, statistic for a given amount of aberrance in a data set. For both types of aberrance, correlations between criterion score and I, score tended to be higher for 38 criterion scores that were generated to have a higher correlation with total test score. Overall, these correlations are low enough to suggest that using the 1, statistic to identify aberrance does not result in artificially inflating or reducing validities due to the removal of “outliers” (with extreme criterion scores) fiom the analysis. _R_e§_earch Ouestions_1;a_rr_d 2: Effects of Aberrance on Religrbility The first two research questions discuss the effects of difl‘erent amounts and types of aberrance on reliability and validity, with the first addressing the efl‘ects of different amounts of aberrance in a data set and the second addressing the efi'ects of aberrance on different “true” levels of reliability and validity (as estimated in data sets with no aberrance). The results concerning these two questions are presented together, first for reliability and then for validity. Table 5 shows the levels of reliability obtained in each of the aberrant data sets and the amormt of change in reliability that occurred because of the introduction of a particular amount of aberrance into the data set (i.e., the difference between reliability in the data set after abenance was introduced and reliability in the aberrance-free data set). Results are broken down separately for the spuriously low and spuriously high manipulation because there is some evidence from the results of the manipulation checks to suggest that the two types of aberrance manipulations may afl‘ect the data difl‘erently. Results are also presented separately for each of the three levels of reliability that were simulated in this study (.70, .80, and .90), in order to determine whether there is a relationship between the “true level” of reliability of the test and the efl‘ect of aberrance on estimates of reliability. 3 9 Table 5 -— Change in Reliability Due to the Introduction of Aberrance Low Aberrance RELI70 REL180 RELI90 % Aberrance rx, Arxx rx, Ar,“ rn Afr: No Aberrance .71 .82 .90 10 .71 0 .82 0 .89 -.01 20 .71 0 .82 0 .89 -.01 30 .71 0 .81 -.01 .88 -.02 40 .70 -.01 .80 -.02 .88 -.02 50 .69 -.02 .79 —.03 .87 -.03 High Aberrance RELI70 RELI80 RELI90 % Aberrance r,“ Ar,EL r,, Ar.“ rn Ara N o Aberrance .71 .82 .90 10 .73 .02 .83 .01 .90 0 20 .74 .03 .83 .01 .90 0 30 .75 .04 .83 .01 .90 0 40 .75 .04 .82 0 .89 -.01 50 .74 .03 .82 0 .88 -.02 Mixed Aberrance RELI70 RELI80 RELI90 % Aberrance r,“ Are, r,“ Arx, rn Arxx__ N o Aberrance .71 .82 .90 15 Low, 15 High .70 -.01 .82 0 .89 -.01 Note. r,“ refers to the obtained coeficient alpha in the aberrant sample. A rx, refers to the change in coemcient alpha that is due to the introduction of aberrance (i.e., the difference between coefficient alpha values for the aberrant and abenance-free samples). 4 0 As indicated in the table, the introduction of aberrance did not have a very large efi‘ect on the reliability of the data sets in this study. The largest changes resulted in an increase of 4 points from an aberrance-flee reliability of .71 in the data sets with the 30% spuriously high aberrance manipulation and with the 40% spuriously high aberrance manipulation. The largest negative changes resulted in a decrease of 3 points from aberrance-free reliabilities of .82 and .90 in the data sets with the 50% spuriously low aberrance manipulation. Even though the changes were not large, there are several trends worth noting, particularly among the data sets with a spuriously low manipulation. For the spuriously low manipulation data sets, the amount of change in reliability tended to increase as the amount of aberrance in the data set was greater. The amount of change was also greater for data sets that had a higher “true value” of reliability. For the spuriously high manipulation data sets, the trend of change was less clear, although it seems that the change in reliability becomes less positive or more negative as the amount of aberrance in the data set increases for data sets with a “true value” of reliability of .82 or .90. There was also a trend for the change in reliability to become less positive or more negative as the “true value” of reliability increased. The positive efl‘ect of the spuriously high aberrance manipulation on reliability may seem counterintuitive at first. The expectation, which was at least somewhat confirmed with the results of the spuriously low manipulation, might be that aberrance would lower reliability because it is introducing some type of error variation into the data set. This appears to be what is occurring in the spuriously low manipulation, because this manipulation simulates random responding without regard to item or person parameters. In contrast, the spuriously high manipulation introduces a change that may simulate a reduction in error variation. This is because in the spmiously high manipulation, item responses are changed to correct, regardless of the initial item 41 response in the aberrance-free data set. The larger number of correct responses in the data set results in a greater degree of consistency within a response vector and thus may result in a slightly higher reliability value, if reliability is calculated using an internal consistency statistic such as coefficient alpha. As the amount of aberrance and/or the “true value” of reliability in a data set increases, however, the spuriously high manipulation may reach a ceiling efl‘ect or a curvilinear efl‘ect on the internal consistency of the data. There is also evidence of this sort of trend in Table 5 because the change in reliability in the high aberrance manipulation data sets seems to become less positive and/or more negative as aberrance or “true” reliability increases. Due to the small effects observed in this study, however, this explanation should be taken only as a potential hypothesis for what may be occuning. The data set with a mix of spuriously high and spuriously low response vectors appeared to more strongly resemble the spuriously low manipulation because of the slight negative efl'ect of this form of aberrance on reliability. Because the efl‘ects were so small, it is dificult to determine whether the two types of aberrance canceled each other out in their efl‘ect on reliability, although there is some support for this in the largely negative efi‘ect of the spuriously low manipulation on reliability and the largely positive efi‘ect of the spuriously high manipulation on reliability. These results can be used to discuss the impact of aberrance on reliability in terms of the first two research questions. To address the first research question, it seems that both the amount and the type of aberrance have a small efl‘ect on test reliability. For spuriously low aberrance, as the amount of aberrance increases, the change in reliability becomes more negative. For spuriously high aberrance, as the amount of aberrance increases, the change in reliability has a tendency to become less positive and/or more negative, but this trend is less clear than that for the spuriously low aberrance manipulation. 42 With respect to the second research question, there was also some evidence that the effect of aberrance differed based on the “true value” of reliability in the sample, particularly when spuriously low aberrance was introduced into a data set. As the “true value” of reliability increases, there was a tendency for a greater amount of change in reliability after the introduction of spuriously low aberrance. When spuriously high aberrance was introduced, there was a slight trend for less positive and/or more negative change in reliability for data sets with higher “true values” of reliability. Although aberrance has only a small effect on reliability in this study, due to the use of such large samples (n = 10,000 response vectors in each data set), any change in reliability that occurs can be assumed to be true change in population values. Just because the changes are due to actual effects of aberrance on reliability and not sampling error, however, does not mean that these changes have practical significance to test developers who are trying to maximize the accuracy of their estimates of test properties. Research Questions 1 and 2: Effec_t§ of Aberrance on Validity The same types of analyses that were done to explore the efiects of aberrance on reliability were also performed to assess the effects of aberrance on validity. To address the first research question, trends in the efl‘ects of aberrance on validity were examined for difi‘erent amounts and types of aberrance. To address the second research question, trends in the effects of aberrance on validity were examined for diflerent “true values” of validity and for different “true values” of reliability, in order to see if a relationship between changes in reliability and changes in validity emerged from these analyses. Tables 6, 7, and 8 show the levels of validity obtained in each of the aberrant data sets and the amount of change in validity that occurred because of the introduction of a particular amount of aberrance (i.e., the difference between validity in the data set after aberrance was introduced and the validity in the corresponding aberrance-free data set). Just as with the results for reliability, results for validity are broken down by the type of 43 aberrance (spuriously low or spuriously high), in order to examine potential differences in the efl‘ects of these two types of aberrance on validity. Results are also broken down by the three levels of validity (. l 5, .30, and .45) in order to determine whether there is a relationship between “true level” of validity and the efi‘ect of aberrance on estimates of validity. In addition, results are presented separately for each level of reliability that was simulated (.70, .80, and .90) both in order to simplify the presentation and in order to examine whether the trends in changes in validity vary with the “true level” of reliability that is simulated in the data set. As was true of reliability, the introduction of aberrance did not have a very large efi‘ect on validity, although the changes in validity coemcients tended to be slightly larger than the changes in coefiicient alpha. All of the changes in validity due to aberrance were negative, with the largest change being a decrease of 8 points from a true value validity of .45 and a true value reliability near .70 in the data sets with the 40% spuriously high manipulation and the 50% spuriously high manipulation. 4 4 Table 6 — Change in Validity Due to Aberrance for Reliability Near .70 Low Aberrance CRIT] 5 CRIT30 CRIT 45 % Aberrance rxy Ara r3, Ar,y rxy A’xy No Aberrance .16 .31 .45 10 .15 -.01 .30 -.01 .44 -.01 20 .15 -.01 .28 -.03 .43 -.02 30 .15 -.01 .28 -.03 .42 -.03 40 .14 -.02 .27 -.04 .40 -.05 50 .15 -.01 .27 -.04 .39 -.06 High Aberrance CRIT15 CRIT30 CRIT45 % Aberrance rxy Ara. rxy Ara: rxy A’ :31 No Aberrance .16 .31 .45 10 .15 —.01 .29 -.02 .42 -.03 20 .14 -.02 .27 -.04 .41 -.04 30 .13 -.03 .26 -.05 .39 -.06 40 .14 -.02 .25 -.06 .37 -.08 50 .13 -.03 .25 -.06 .37 -.08 Mixed Aberrance CRIT15 CRIT30 CRIT45 % Aberrance r” Ara, rxy Ara. r3, Ara, No Aberrance .16 .31 .45 15 Low, 15 High .16 0 .29 -.02 .43 -.02 Note. r9. refers to the obtained validity coefficient in the aberrant sample. A r,, refers to the change in validity that is due to the introduction of aberrance (i.e., the difl‘erence between validities for the aberrant and aberrance-free samples). 4 5 Table 7— Change in Validity Due to Aberrance for Reliability Near .80 Low Aberrance CRIT15 CRIT30 CRIT45 % Aberrance rfl Ar” '39» Ar” rxy Ar”, No Aberrance .14 .31 .45 10 .14 0 .30 -.01 .44 - 01 20 .14 0 .29 -.02 .42 -.03 30 .13 -.01 .29 -.02 .41 -.04 40 .13 -.01 .28 -.03 .42 -.03 50 .13 -.01 .27 -.04 .41 -.04 High Aberrance CRIT15 CRIT 30 CRIT45 % Aberrance r3, Ar,I r,, Arxy rxy Ar)? No Aberrance .14 .31 .45 10 .14 0 .29 -.02 .44 - 01 20 .13 -.Ol .29 -.02 .42 -.03 30 .14 0 .27 -.04 .42 -.03 40 .13 -.01 .27 -.04 .40 -.05 50 .12 -.02 .27 -.04 .40 -.05 Mixed Aberrance CRIT15 CRIT30 CRIT 45 % Aberrance rxy NW ,0 A,” No Abemnce .14 .31 .45 15 Low, 15 High .14 0 .30 -.01 .44 -.01 Note. I}, refers to the obtained validity coemcient in the aberrant sample. A r,,. refers to the change in validity that is due to the introduction of aberrance (i.e., the difi‘erence between validities for the aberrant and aberrance-free samples). 4 6 Table 8— Change in Validity Due to Aberrance for Reliability Near .90 Low Aberrance ; CRIT15 CRIT30 CRIT45 % Aberrance r3, Ara rxy Ar”. r8, Ara, N o Aberrance .15 .30 .45 10 .15 0 .29 -.01 .44 -.01 20 .14 -.01 .30 0 .44 -.01 30 .15 0 .29 -.01 .43 -.02 40 .15 0 .28 -.02 .42 -.03 50 .14 -.01 .28 -.02 .43 -.02 High Aberrance CRIT15 CRIT30 CRIT45 % Aberrance rm, Ara. ray Afxy :52» ”L No Aberrance .15 .30 .45 10 .14 -.Ol .29 -.01 .44 -.01 20 .14 -.01 .29 -.01 .42 -.03 30 .14 -.01 .28 -.02 .41 -.04 40 .13 -.02 .28 -.02 .41 -.04 50 .14 -.01 .27 -.03 .40 -.05 Mixed Aberrance CRIT15 CRIT30 CRIT45 °/o Aberrance r0, Ara. rxy Aray rxy Ara, N o Aberrance .15 .30 .45 15 Low, 15 High .15 0 .30 0 .44 -.01 Note. 13,, refers to the obtained validity coeficient in the aberrant sample. A r,, refers to the change in validity that is due to the introduction of aberrance (i.e., the difi‘erence between validities for the aberrant and aberrance-flee samples). 4 7 The trends of change in validity appear to be more clear-cut than was the case with changes in reliability. There tended to be a larger decrease in validity with the introduction of spuriously high aberrance than with the introduction of spuriously low aberrance into a data set for each of the “true values” of validity (.15, .30, and .45), but the overall pattern of results was the same for both types of aberrance. In both cases, as the amount of aberrance introduced into the data set increases, there tended to be a greater decrease in the validity coefficient. This was the pattern, again, for each of the “true values” of validity, although there was less of a pattern for validity coeflicients with a “true value” near .15, probably because there was less change in validity for these data sets and so there was a more of a ceiling efl‘ect of variability in change across levels of aberrance. Also true for both types of aberrance, as the “true value” of validity increased, there tended to be a greater decrease in the validity coemcient for each level of aberrance introduced. This pattern of change across “true values” of the validity coeficient is based on absolute change rather than on percentage change in the value of the coemcient. For there to be a pattern of proportional change in the validity coemcients, validities simulated to represent a true value of .45 would have to decrease more than 3 times the number of points that validities simulated to represent a true value of .15 decreased. The changes in validity that occurred in this study represent an approximately even relative change across “true values” of the validity coeficients. For the spuriously low manipulation, the relative changes in validities were about even, on average, across the “true values” of validity (mean A5,, for .15 of -.01, mean A5,, for .30 of -.02, and mean Arxy for .45 of -.03). For the spuriously high manipulation, the relative changes in validities simulated to represent a true value of .45 (mean Any of -.04) averaged slightly less than 3 times as large as changes in validities simulated to represent a true value of .15 (mean Any of -.01 5), but validities simulated to represent a true value of .30 (mean Ar,,, 48 of -.03) tended to be about 2 times as large as those in validities simulated to represent a true value of .15. In addition to trends of change in validity across amounts of aberrance and “true values” of validity, there was also a trend in changes in validity across the levels of test reliability that were simulated in the data sets. The general trend was for a smaller effect of aberrance on validity for data sets with a higher “true value” of simulated test reliability. Table 9 displays the amount of change for each of the “true values” of validity averaged across aberrance levels for each level of test reliability that was simulated in the data. There was a tendency toward a smaller decrease in the validity coefficient when the reliability of the test was higher. This pattern held for each level of validity for both spuriously low and spuriously high aberrance manipulations as well as for the data sets with mixed aberrance. Table 10 displays the amount of change in validity for each amormt of aberrance (10%, 20%, 30%, 40%, or 50%) averaged across the three levels of validity for each simulated “true value” of reliability. The pattern of a smaller efl‘ect of aberrance on validity for higher “true values” of reliability is also evident here for both the spuriously low and the spuriously high aberrance manipulations as well as for the data sets with mixed aberrance, except in the case of data sets with the 10% spuriously low aberrance manipulation. Table 9 — Average Change in Validity by Reliability Across Aberrance Low Aberrance “True” Validity RELI70 REL180 RELI90 CRIT15 -.01 -.01 -.00 CRIT30 -.03 -.02 -.01 CRIT45 -.03 -.03 -.02 High Aberrance “True” Validity RELI70 REL180 RELI90 CRIT15 -.02 -.01 -.01 CRIT30 -.05 -.03 -.02 CRIT45 -.06 -.03 -.03 Mixed Aberrance “True” Validity RELI70 RELI80 RELI90 CRIT15 0 0 0 CRIT30 -.02 -.01 0 CRIT45 -.02 -.01 -.01 NLte, Values in the table represent the average amount of change in the validity coeflicient for a particular “true value” of validity (criterion score simulated to be correlated with total test score at approximately .15, .30, or .45) for each simulated level of test reliability (“true” values of .70, .80, .90; obtained values of .71, .82, and .90). Negative values represent a decrease in validity after the introduction of aberrance. Each value for the spuriously high and spuriously low manipulations is averaged across the five levels of aberrance (10%, 20%, 30%, 40%, and 50%). The values for mixed aberrance are based on data sets with 15% low and 15% high aberrance. 5 0 Table 10 - Average Change in Validity by Reliability Across Validity Low Aberrance % Aberrance RELI70 RELI80 RELI90 10 -.01 -.01 -.01 20 -.02 -.02 -.01 30 -.02 -.02 -.01 40 -.04 -.02 -.02 50 -.04 -.03 -.02 High Aberrance % Aberrance RELI70 RELI80 RELI90 10 -.02 -.01 -.01 20 -.03 -.02 -.02 30 -.05 -.02 -.02 40 -.05 -.03 -.03 50 -.06 -.04 -.03 Mixed Aberrance % Aberrance RELI70 RELI80 RELI90 Low 15, High 15 -.01 -.01 -.00 Mt; Values in the table represent the average amount of change in the validity coefficient for a particular amount of aberrance (10%, 20%, 30%, 40%, or 50% spuriously low or spuriously high aberrance, or 15% low and 15% high mixed aberrance) for each simulated level of test reliability (“true” values of .70, .80, .90; obtained values of .71, .82, and .90). Negative values represent a decrease in validity after the introduction of aberrance. Each value for the spuriously high and spuriously low manipulations is averaged across the three “true values” of validity (.15, .30, or .45). 51 The results of these analyses showing that aberrance has a smaller efl‘ect on validity for larger “true values” of reliability may seem counterintuitive given the previous results showing that aberrance had a larger effect on reliability for larger “true values” of reliability. The reason for these seemingly incongruent findings is that the positive relationship between reliability and validity is stronger than the relationship between magnitude of reliability and effects of aberrance on reliability so that the influence of aberrance on reliability does not play a part in the way that reliability is associated with the efi‘ects of aberrance on validity. In addition, the efl‘ects of aberrance on reliability were so small that any indirect influence aberrance would have on validity through an effect on reliability is not likely to be detected. The reason for the association between higher reliability and less of an efl‘ect of aberrance on validity has to do with the positive relationship between validity and reliability. Tests with higher reliability also have higher validities if all other factors are held constant. If aberrance is increasing some sort of variability in test scores that is not associated with variability in criterion scores, then the aberrance would reduce test validity. For highly reliable tests, however, this increase in variability may be less of a problem, and thus lead to less attenuation in validity coemcients for highly reliable tests then for less reliable tests with the same amount of aberrance. As indicated throughout this discussion, the efl‘ects of aberrance on the data sets with a mix of 15% spuriously low aberrance and 15% spuriously high aberrance show the same pattern as that observed for the data sets with a single type of aberrance. The amount of change in the validity coefiicient, however, tends to be smaller than that observed for either single type of aberrance at the same overall level of aberrance in the data set (i.e., 30%). This smaller effect provides some additional evidence that the two types of aberrance may be canceling each other out in their efl‘ects on validity, although this is less clear than in the case of efi‘ects of aberrance on reliability, because here both 52 spuriously low and spuriously high aberrance reduce validity coemcients, whereas the two types of aberrance had more opposing efl‘ects on reliability. The results described in this section can be used to summarize the impact of abenance on validity in terms of the first two research questions. The first research question is concerned with the effect of varying levels of aberrance on validity. The results show a general trend for a greater decrease in validity as the level of aberrance in a data set increases, with larger changes in validity for the introduction of spuriously high aberrance than for the introduction of spuriously low aberrance into a data set. The second research question focuses on the efiea of abenance on varying levels of validity and reliability. The findings show that aberrance has a slightly greater relative efl‘ect on higher “true values” of validity (resulting in larger decreases in higher validities), but that this efi‘ect is lessened for tests with higher levels of reliability. Overall, the effects of aberrance on validity were rather small (though they were larger than the effects of aberrance on reliability), but because of the large sample sizes used in this study, any change in validity can be interpreted as real change in population values. For the most part, however, these changes may be too small to be of practical significance for improving the accuracy of estimates of test validity (i.e., the average change in validity due to the introduction of aberrance was a 6 —- 10% decrease across “true values” of validity). Research sttion 3: 1, Detection Rates and Aberrance Removal The third research question of interest in this study concerns both the degree that the 1, statistic is useful for detecting aberrant response vectors and the degree that using the 1, statistic to flag aberrant response vectors for removal from analyses is a useful way of increasing the accuracy of estimates of test properties. To explore this research question, first the aberrance detection rates for the 1, statistic were examined for the different types and levels of aberrance. Second, 1, scores were used to identify aberrant 5 3 response vectors in each of the data sets. The reliability and validity of the test were recalculated after the response vectors identified as aberrant were removed from the analyses. These new estimates of the test properties were compared to estimates of reliability and validity calculated in corresponding aberrant and aberrance-free samples to see whether removing aberrance from a data set resulted in estimates of reliability and validity that were closer to those calculated on aberrant or aberrance-free data sets. Table 11 displays the proportion of aberrance detected using the 1, statistic out of the total amount of aberrance present in each of the simulated data sets in this study. Response vectors were removed from the analysis if the 1, score for that response vector was less than or equal to -2.00. Results are separated by the type of aberrance and by the “true value” of reliability that was simulated in the data set in order to more easily identify patterns of aberrance detection that may be associated with these parameters. There is no information about the criteria that were simulated in these data sets because criterion scores were maflected by the aberrance manipulations. 5 4 Table 11 — Proportion of Total Aberrance Detected Using the 1, Statistic Low Aberrance % Aberrance RELI70 RELI80 RELI90 10%(1000/10900) .15 (151/1000) .16 (”91000) .31 (”91000) 20% (2000/10,...) .10 (190/2000) .10 (2°‘/,ooo) .18 (360/2000) 30% (”m/10.000) .05 (164/3000) .07 (mama) .13 (375/3000) 40% Whom) .05 (210/4000) .0612‘7/4000) .10 (“”1000 50% (moo/10.000) -04 (190/ 5000) ~05 (230/5000) ~07 (340/ 5000) High Aberrance % Aberrance RELI70 REL180 RELI90 10%(1000/10000) 166591000) ~17(m/rooo) 31(307/1000) 20%(2000/10000) sum/2000) 090792000) rum/2000) 30% (mo/10,000) -04 (128/3000) -06 (185/3000) -13 (381/3000) 40% (“m/mm.) .03 (“z/.000) .05 (”s/.000) .09 (352/4000) 50%(’°°°/i.ooo) 0263/5000) .03 (148/5000) .08 (396/5000) Mixed Aberrance % Aberrance RELI70 REL180 RELI90 15% Low, 15% High (mo/10.000) -04 (126/3000) -05 (m/ 3000) -07 (210/3000) Epic; Values in the table represent the proportion of aberrance that was detected using the 1, statistic (i.e., response vectors with I, s -2.00) out of the total number of aberrant response vectors that were simulated in the data set (e.g., for the 20% low abenance condition, 2000 out of 10,000 response vectors were manipulated, so the proportion of aberrance detected is based on a possible total of 2000 vectors). The actual fraction of aberrant response vectors detected is given in parentheses for each data set. 5 5 The table shows that the 1, statistic detected only a very small proportion of the total number of aberrant response vectors in a data set and that the relative proportion of aberrance detected was lower as the total amount of aberrance in the data set increased. The reason for this trend may be that as the proportion of aberrant response vectors in a data set increases, there is a corresponding relative decrease in the number of “normal” response vectors on which the group-determined item parameters for the test are based. This introduces more distortion (or aberrance) into the model of item responses that is used to distinguish between “normal” and aberrant responding and thus likely leads to a greater degree of misclassification of aberrant response vectors as normal. Detection rates for a given amormt of aberrance were higher for data sets with a higher simulated test reliability. This is likely because it is easier to detect aberrant item response patterns for tests that are more internally consistent. For tests with higher internal consistency, the measurement model for the test produces more stringent criteria for determining whether a response vector is aberrant because there is less sampling error in measurement of the construct of interest. For tests with lower internal consistency reliability, response vectors with a greater degree of aberrant item responses would still be classified as normal because more error variation in item responses is classified as being part of the measurement model of the test. Detection of aberrance in the data sets with mixed aberrance was slightly lower than for the data sets that had the same total amount of aberrance of only a single type (i.e., the 30% spuriously low aberrance or the 30% spuriously high aberrance data sets). This may be because the two types of abenance cancel each other out and so a response vector containing item responses that fit both spuriously low and spuriously high types of aberrance may be considered more “normal” (i.e., fit the measurement model of the test better) than response vectors with equal or even slightly smaller amounts of abenance of a single type. 5 6 The second part of the analyses pertaining to the third research question compares estimates of test reliability and validity calculated after the removal of aberrant response vectors to estimates from the same data sets before aberrant response vectors were removed and to aberrance-free data sets with the same “true value” of reliability and validity (where “true values” represent estimates of reliability and validity that are not affected by aberrance). These comparisons are used to determine whether the removal of aberrant response vectors results in estimates of test properties that more closely resemble those from aberrance-free data sets or from data sets that include aberrance. The results of this comparison for test reliability estimates are presented in Table 12, which gives the estimate of reliability for the data set with aberrance, as identified by 1, scores less than or equal to -2.00, removed and the diflerence between this value and the estimate of reliability calculated in aberrance-flee and aberrant samples with the same “true value” of reliability. As in other analyses, results are presented separately for the difl‘erent types of aberrance. 5 7 Table 12 —- Change in Reliability Due to Removal of Aberrance Low Aberrance RELI70 RELI80 RELI90 Aberrance Rem AF ree AAb Rem AF ree AAb Rem AF ree AAb 10% .71 0 0 .83 +01 +01 .90 0 +01 20% .71 0 0 .82 0 0 .90 0 +01 30% .71 0 0 .81 -.01 0 .89 -.01 +01 40% .70 -.01 O .81 -.01 +01 .88 -.02 0 50% .69 -.02 0 .79 —.03 0 .87 -.03 0 HighiAberrance RELI70 RELI80 RELI90 Aberrance Rem AF ree AAb Rem AF ree AAb Rem AFree AAb 10% .73 +02 0 .83 +01 0 .90 0 0 20% .74 +03 0 .83 +01 0 .90 0 0 30% .75 +04 0 .83 +01 0 .90 0 0 40% .75 +04 0 .83 +01 +01 .89 -.01 0 50% .74 +03 0 .82 0 0 .89 -.01 +01 Mixed Aberrance RELI70 RELI80 RELI90 Aberrance Rem AFree AAb Rem AFree AAb Rem AFree AAb 15L, 15H .71 0 +01 .82 0 0 .89 -.01 0 m Rem refers to the estimate of reliability (rn) calculated in the data set with aberrant response vectors identified by the 1, statistic (i.e., I, s -2.00) removed fiom the analysis. AF rec refers to the change in rn in the data set with 1,-identified aberrance removed relative to the aberrance-flee data set for the same “true value” of reliability (i.e., the difl‘erence rnRem — rnFree). AAb refers to the change in 13,, in the data set with 1,- identified aberrance removed relative to the data set with aberrance for the same “true value” of reliability (i.e., the difl‘erence rnRem - rnAb). 5 8 The results indicate that using the 1, statistic to identify aberrant response vectors for removal from analyses is not a useful strategy for improving the accuracy of estimates of test reliability. The estimates of reliability calculated on data sets with aberrant response vectors removed are more similar to the estimates of reliability calculated on aberrant data sets than they are to reliability estimates from aberrance-flee data sets. This pattern of results held for all data sets in which there were changes in reliability due to the introduction of aberrance except for the data sets with a “true” reliability of .90 and the 10% spuriously low and 20% spuriously low aberrance manipulations. This means that, for the most part, removing response vectors that received an 1, score of -2.00 or less did not have an effect on estimates of reliability. There are two reasons why removing response vectors with extreme 1, scores did not change estimates of reliability. First, the efl‘ects of aberrance on reliability were small to begin with, so there was not a lot of margin for improvement in accuracy of reliability estimates by removing aberrance. Second, as indicated in Table 11, the 1, statistic only identified a small percentage of the response vectors that were manipulated to represent aberrant response patterns, so removing this small percentage of aberrant response vectors did not change whatever effect aberrance did have on reliability (shown in Table 5). Comparisons were also conducted to determine whether removing response vectors identified as aberrant using the 1, statistic would improve the accuracy of estimates of validity. Tables 13, 14, and 15 present the results of these analyses separately for each level of reliability that was simulated in the data sets. Each table gives the three estimates of validity (based on “true values” of .15, .30, and .45) for the data set with aberrance removed and the difi‘erence between these values and the corresponding estimates of validity calculated in aberrance-free and aberrant samples. 5 9 Table 13 - Change in Validity Due to Removal of Aberrance for r,“ Near .70 Low Aberrance CRIT15 CRIT30 CRIT 45 Aberrance Rem AF ree AAb Rem AF ree AAb Rem AF ree AAb 10% .15 -.01 0 .30 -.01 0 .44 -.01 0 20% .15 -.01 0 .29 -.02 +01 .43 -.02 0 30% .15 -.01 0 .28 -.03 O .42 -.03 0 40% .15 -.01 +01 .27 -.04 0 .40 -.05 0 50% .15 -.01 0 .27 -.04 0 .39 -.06 0 High Aberrance RELI70 RELI80 RELI90 Aberrance Rem AFree AAb Rem AF ree AAb Rem AFree AAb 10% .15 -.01 0 .29 -.02 0 .42 -.03 0 20% .14 -.02 0 .27 -.04 0 .41 -.04 0 30% .13 -.03 0 .26 -.05 0 .39 -.06 0 40% .14 -.02 0 .25 -.06 0 .37 -.08 0 50% .13 -.03 0 .25 -.06 0 .36 -.09 -.01 Mixed Aberrance RELI70 RELI80 RELI90 Aberrance Rem AFree AAb Rem AFree AAb Rem AFree AAb 15L, 15H .16 0 0 .29 -.02 0 .44 -.01 +.01 N_0Le_, Rem refers to the estimate of validity (r,,,) calculated in the data set with aberrant response vectors identified by the 1, statistic (i.e., 1, S -2.00) removed from the analysis. AF ree refers to the change in r,,, in the data set with I,-identified aberrance removed relative to the aberrance-free data set for the same “true value” of validity (i.e., the difi'erence rvRem - r,,.Free). AAb refers to the change in r,,, in the data set with I,- identified aberrance removed relative to the data set with aberrance for the same “true value” of validity (i.e., the difference r,,Rem - r,,.Ab). 22‘; 6 0 Table 14 — Change in Validity Due to Removal of Aberrance for rn Near .80 Low Aberrance CRIT15 CRIT 30 CRIT45 Aberrance Rem AFree AAb Rem AFree AAb Rem AFree AAb 10% .14 0 0 .30 -.01 0 .44 -.01 0 20% .14 0 0 .29 -.02 0 .42 -.03 0 30% .14 0 +01 .29 -.02 0 .41 -.04 0 40% .13 -.01 0 .28 -.03 0 .42 -.03 0 50% .13 -.01 0 .28 -.03 +01 42 -.03 + 01 High Aberrance CRIT15 CRIT 30 CRIT45 Aberrance Rem AF rec AAb Rem AF rec AAb Rem AF rec AAb 10% . 14 0 0 .29 -.02 0 .44 -.01 0 20% .13 -.01 0 .29 -.02 0 .42 -.03 0 30% .13 -.01 -.01 .27 -.04 0 .42 -.03 0 40% .12 -.02 -.01 .27 -.04 0 .40 -.05 0 50% .12 -.02 0 .27 -.04 0 .39 -.06 -.01 Mixed Aberrance CRIT15 CRIT30 CRIT45 Aberrance Rem AFree AAb Rem AFree AAb Rem AFree AAb 15L, 15H .14 0 0 .30 -.01 0 .44 -.01 0 Note. Rem refers to the estimate of validity (my) calculated in the data set with aberrant response vectors identified by the 1, statistic (i.e., 1, S -2.00) removed from the analysis. AFree refers to the change in r,,, in the data set with 1,-identified aberrance removed relative to the aberrance-free data set for the same “true value” of validity (i.e., the difl‘erence r,,.Rem — rWFree). AAb refers to the change in r”, in the data set with 1,- identified aberrance removed relative to the data set with aberrance for the same “true value” of validity (i.e., the difference r,,.Rem - r,,.Ab). 6 1 Table 15 — Change in Validity Due to Removal of Aberrance for r,, Near .90 Low Aberrance CRIT15 CRIT30 CRIT45 Aberrance Rem AF ree AAb Rem AFree AAb Rem AFree AAb 10% .15 0 0 .30 0 +01 .44 -.01 0 20% .15 0 +01 .30 0 0 .44 -.01 0 30% .15 0 0 .28 -.02 +01 .43 -.02 0 40% .15 0 0 .28 -.02 0 .42 -.03 0 50% .14 -.01 0 .28 -.02 0 .42 -.03 -.01 High Aberrance CRIT15 CRIT30 CRIT45 Aberrance Rem AF ree AAb Rem AF ree AAb Rem AF ree AAb 10% .14 -.01 0 .30 0 +01 .44 -.01 0 20% .14 -.01 0 .29 -.01 0 .43 -.02 +01 30% .14 -.01 0 .28 -.02 0 .42 -.03 +01 40% .13 -.02 0 .28 -.02 0 .41 -.04 0 50% .13 -.02 -.01 .26 -.04 -.01 .39 -.06 -.01 Mixed Aberrance CRIT15 CRIT30 CRIT45 Aberrance Rem AFree AAb Rem AFree AAb Rem AFree AAb 15L, 15H .15 0 0 .30 0 0 .44 -.01 0 Mg, Rem refers to the estimate of validity (r9) calculated in the data set with aberrant response vectors identified by the 1, statistic (i.e., 1, s -2.00) removed fi'om the analysis. AF rec refers to the change in "xy in the data set with I,-identified aberrance removed relative to the aberrance-free data set for the same “true value” of validity (i.e., the difi'erence r,,.Rem — r,,.Free). AAb refers to the change in rer in the data set with 1,- identified aberrance removed relative to the data set with aberrance for the same “true value” of validity (i.e., the difference rgRem — r,,.Ab). 62 The same pattern of results was found across all three “true values” of reliability, so the information shown in Tables 13, 14, and 15 will be discussed together. Similar to the results found for the effect of removing aberrance on reliability, the evidence suggests that removing aberrance fi'om the data sets had little to no effect on estimates of validity. The comparisons in these tables show that the validity coefficients in the data sets with response vectors having 1, scores less than or equal to -2.00 removed were more similar to the validity coefficients in the data sets with aberrance than in the aberrance-flee data sets. This pattern of results held for almost all of the data sets in which there were changes in validity due to the introduction of aberrance, regardless of the level and type of aberrance in the data set or the “true value” of validity simulated. The reasons for the lack of effect for removing response vectors identified as aberrant by the 1, statistic on validity are the same as those given previously to explain the results for reliability. There is a possibility that the effects of aberrance on validity were too small to produce an adequate margin for improvement with the removal of response vectors with extremely low 1, scores. This is less of an issue in the case of validity than it was in the case of reliability, however, because the effects of aberrance on validity were somewhat larger, and the pattern of effects on validity is more clear-cut than the pattern for reliability, both for the aberrant data sets and for the data sets with aberrance identified by the 1, statistic removed. The lack of change in validity after response vectors were removed from the analysis is most likely due to the extremely small proportions of aberrant response vectors that were identified by the 1, statistic, as shown in Table 11. These results can be used to draw conclusions relevant to the third research question, which asks about the effects of using the 1, statistic to identify aberrant response vectors for removal from analyses estimating the reliability and validity of a test. Overall, the results of this study indicate that there is little to no change in estimates of reliability and validity from using the 1, statistic to flag aberrant response vectors. The reason for 6 3 this lack of effect may be partially because there was little change in validity and, particularly, in reliability, due to the introduction of aberrance, so there was little room for improvement in these estimates by removing aberrance. The main reason that there was no efiect from removing response vectors with extremely low 1, scores, however, was because only a small proportion of aberrant response vectors for all levels and types of aberrance manipulation were identified as being aberrant based on 1, scores. This means that, based on the results of this study, the 1, statistic is not an adequate indicator of aberrance for the purpose of attempting to improve the accuracy of estimates of test characteristics, regardless of the effects of aberrance on test properties. DISCUSSION The purpose of this study was to examine a proposed application of appropriateness indices as indicators of response vectors in a test validation sample that may be distorting estimates of the psychometric properties of tests. A second and related objective was to determine why previous research on this topic found small to no efl‘ects on validity after removing response vectors identified as aberrant from analyses. The research questions of interest in this study centered arormd the efi‘ects of varying amounts and types of aberrance on varying levels of reliability and validity and the effects of using the 1, statistic to flag aberrant response vectors for removal from analyses to improve estimates of these test properties. This study expanded on the prior research in several ways. Simulated data were used to investigate a variety of situations in which aberrance may afl‘ect test properties. The simulation procedure also removed the limitations of sample size and test length in the prior work and allowed a controlled analysis of the degree that the 1, statistic was detecting aberrance in a data set. In addition, the efiects of aberrance on test reliability as well as the efl‘ects on test validity were examined to attempt to determine the reasons why aberrance does or does not affect test properties in particular ways. Overall, the results of this study indicate that the efi‘ects of aberrance on test properties are small to negligible for all types and amounts of aberrance simulated in this study, although the efl‘ects of aberrance on validity are somewhat larger than the efl'ects of aberrance on reliability. Because of the large sample sizes (n = 10,000) used in the data sets, however, any change in reliability or validity was considered an indicator of 64 65 a true effect of aberrance on test properties. Based on this consideration, several trends in the relationship between aberrance and test properties are worth noting. The first research question asks about the effects of varying amounts of aberrance on test reliability and test validity. The effects of aberrance on reliability differed based on the type of aberrance that was simulated in the data set. In both cases, as the amount of aberrance in the data set increased, the estimate of reliability decreased. For the spuriously low manipulation, however, introducing aberrance into a data set always resulted in slightly lower estimates of test reliability than those in data sets with no aberrance. For the spuriously high manipulation, introducing aberrance resulted in changes in reliability estimates from data sets with no aberrance that ranged fi'om slight increases for small amounts of aberrance to slight decreases for larger amounts of aberrance. The trend in effects of aberrance on validity was more homogeneous. There were slightly larger changes in validity for the introduction of spuriously high aberrance than for the introduction of spuriously low aberrance, but the direction of change was the same in both cases. For both types of abenance manipulations, as the amount of aberrance in the data set increased, the estimate of test validity decreased from the “true value” of validity as estimated in data sets with no aberrance. The second research question asks about the effects of aberrance on varying levels of test reliability and test validity. The effects of aberrance on both reliability and validity varied based on the “true value” of the test property, as estimated on data sets with no aberrance. Again, the effects of aberrance on varying levels of reliability differed based on the type of aberrance manipulation that was used. For data sets with the spuriously low aberrance manipulation, there were slightly greater decreases in estimates of reliability for higher “true values” of reliability for any amount of aberrance. For data 66 sets with the spuriously high aberrance manipulation, changes in reliability became more negative and/or less positive as the “true value” of reliability increased. There was a trend in changes in validity coefficients both for changes in the “true value” of validity and for changes in the “true value” of reliability. These trends remained the same for both types of aberrance manipulations. As the “true value” of validity increased, there were larger absolute decreases in the value of the validity coemcient for each amount of aberrance. When relative changes in validity were examined, however, the decreases in coefficients remain relatively constant across levels of validity. As the “true value” of reliability increased, there were smaller decreases in the value of the validity coemcient across all levels of validity and all amounts of aberrance. In addition to the data sets which were simulated to represent only a single type and level of aberrance, a data set was also simulated with a combination of both spuriously low and spuriously high response vectors. This data set was simulated in order to examine the possibility that the two types of aberrance commonly described in research on test appropriateness cancel each other out in effects on test properties. The effects of the mixed types of aberrance on test properties were lower than the effects of either type of aberrance alone for the same overall level of aberrance in the data set, supporting the possibility that the two types of aberrance may cancel each other out in efl‘ecting characteristics of the data set. The third research question asked about the aberrance detection rates of the 1, statistic and about the efl‘ects of using the 1, statistic to flag aberrant response vectors for removal fiom analyses on estimates of reliability and validity. Overall, the detection rates of the 1, statistic were very low for all types and amormts of aberrance simulated in the data sets, with less than 20% of the simulated abenance detected for all but two data sets (31% detection for 10% aberrance of either type for a “true value” of reliability of .90). 67 The proportion of aberrance detected in a data set decreased as the amount of aberrance in the data set increased, while the proportion of aberrance detected increased as the “true value” of reliability increased. Aberrance detection rates for the mixed aberrance data set were slightly lower than for the corresponding data sets with a single type of aberrance, providing further evidence that the effects of spuriously low aberrance and spmiously high aberrance on a data set cancel each other out. As would be expected based on these low detection rates, removing response vectors that were flagged as aberrant based on 1, scores had little to no efi‘ect on estimates of test reliability and validity. In almost all cases, estimates of both reliability and validity that were calculated on data sets with response vectors removed were more similar to data sets with aberrance than the data sets that were simulated to be free of aberrance. This general pattern held across all combinations of types and amounts of aberrance as well as across “true values” of reliability and validity. These results indicate that the 1, statistic is not very efl‘ective as an indicator of test inappropriateness for the purpose of removing aberrant response vectors to improve estimates of test properties. Implications of the Present Study for PreviouiReseflh The results of this study ofl‘er several possible reasons for the results of previous research that showed small to no effects on test validity after removing response vectors identified as aberrant based on 1, scores. Most likely a combination of these explanations are responsible for the previous findings. The first potential reason for the weak efl'ects that were found is that, in most cases, abenance may not have an efl‘ect on test properties that is large enough to make a practically significant difl‘erence in the estimation of these values. In this study, the introduction of aberrance into data sets had only a small effect on both reliability (average change of one to one and one halfpoints) and validity (average change of two to three points) 68 The aberrance simulations used in this study mirror those that would be found in real testing situations. A respondent may receive a test score that is spuriously low if they respond carelessly or randomly or if they have a poor understanding of the test instructions or testing procedures in general. A respondent may receive a spuriously high score if they cheat or receive special coaching on the item content or item types that make up the test. Because of this, the small changes in test properties due to aberrance that were found in this study are likely to be similar to the magnitudes of changes found in other studies using data from real testing situations. In addition, this study found that there were smaller changes in test properties in data sets with aberrant response vectors representing a mix of spuriously high and spuriously low test scores than in data sets with only a single type of aberrant response vectors. These findings were described as representing the possibility that different types of aberrance may cancel each other out in effects on the data set. In real testing situations, it is likely that difl‘erent types of aberrant responding will occur within the sample of test respondents. If the findings of this study generalize to real testing situations, this means that the changes in test properties due to aberrance will be even smaller than what was demonstrated in these simulations because the efl‘ects of spuriously low aberrance and spuriously high aberrance will cancel out, assuming that the two types of aberrance occur at similar rates within the sample. The second potential reason for the previous research findings of weak efi‘ects of aberrance removal on estimates of test properties is that the 1, statistic did not adequately identify enough aberrance in the data set so that the removal of these response vectors made a difference in test properties. In real testing situations, the amount of aberrance in the data set is unknown. In this study, however, the use of simulated data allowed the comparison of the amount of aberrance in a data set with the amount of aberrance detected using the 1, statistic. Although this was done in previous research on test 6 9 appropriateness, this study expands on the prior findings by examining varying types and amounts of aberrance that are likely to be found in data sets in real testing situations. In addition, the use of simulated data allowed for the comparison of both aberrant data sets and data sets with aberrance removed as well as the comparison of both of these to data sets with no aberrance. These comparisons resulted in the finding that the 1, statistic is not an adequate indicator of the types and amounts of aberrance simulated in this study and that removing response vectors identified as aberrant based on 1, scores had little efl‘ect on estimates of test properties. To the extent that the aberrance simulations used in this study generalize to real testing situations, the previous research findings can be explained by the lack of aberrance detected using the 1, statistic. In summary, the proposed application of the 1, statistic to improve the accuracy of estimates of test properties is not very useful. This is likely both because of the small efl‘ects of the types of aberrance commonly examined on test properties and because of the low detection rates of aberrance using the 1, statistic, particularly if there is a large amount of aberrance or if there are mixed types of aberrance in a single data set. These results go against the common-sense assumption that aberrance would afi‘ect test properties because of the introduction of response variation that does not fit the model of responses based on classical test theory or item response theory. The next question, then, is what it is about this application of the 1, statistic that is failing to make a difierence in test properties and/or what it is about the types of abenance that have been commonly examined in previous research that is not affecting test properties. A Potential Explanation for the Present and Previous Findings The underlying cause of both the lack of efl‘ect of the aberrance on test properties and the low aberrance detection rate of the 1, statistic may have to do with the way the introduction of aberrance changes the characteristics of the data set. This influences both 70 the way item responses are modeled using item response theory and the way the 1, statistic is calculated. One possibility is that the introduction of aberrance into a data set distorts the properties of the data set to the extent that it is difficult to categorize particular response vectors as being “normal” or aberrant based on the information provided by the distorted data set. This difficulty in the classification of individual response vectors within a data set has implications for both the influence of aberrance on test properties and the influence of aberrance on the calculation of 1, scores. Although previous research has stated that using item and person parameter values from data sets with aberrance does not produce significant distortion in appropriateness fit statistics (Levine & Drasgow, 1982), this may not always be the case, particularly for data sets with large amounts of aberrance, such as those simulated in this study. In Levine and Drasgow’s (1982) work on the effects of aberrance on item and person parameter estimation, only 200 response vectors out of a sample of 3000 underwent a 20% spuriously low aberrance manipulation, which translates to approximately 7% aberrant response vectors within the data set. This is lower than even the lowest percentage of aberrance simulated in the data sets used in this study and the amotmt of change within a response vector is also lower than the amount of change simulated in this study (i.e., 30% of responses within a vector). It is thus unclear whether the results of that work hold in for the present study. The value of the 1, statistic for a particular response vector is based on the extent that the response pattern fits an estimate of ability on the construct of interest, which is based on group-determined item parameters from an item response theory model of the data. In data sets with a large number of response vectors that do not fit the ideal response pattern based on maximum likelihood function, the estimation procedure for the IRT model itself is likely to be distorted. This is because the estimation procedures used by programs like BILOG to estimate IRT models of data assume that the test is 71 unidimensional. As discussed previously, one way of defining test inappropriateness is the extent that a test is measuring something other than the construct of interest for a subset of respondents. When there is a large proportion of respondents for whom the test is inappropriate, the assumption of unidimensionality of the test is likely violated to the extent that IRT models of the data are distorted, resulting in distorted estimates of both item and person parameters. When these distorted estimates are, in turn, used to calculate 1, values, indicating the appropriateness of the item response model for the particular response vector, the response vector may fit the distorted model because of the large number of aberrant response vectors on which the model was based, resulting in low detection rates for the 1, statistic. In addition, internal consistency reliability estimates of the distorted data sets may still remain high, because responses may be internally consistent for the measurement of more than one construct, resulting in high coefficient alphas, but a violation of the unidimensionality assumption needed for accurate IRT models of the data. Support for this explanation comes from Reise (1995), who found that aberrance detection rates using the 1, statistic were always lower when using estimates of theta fiom aberrant data sets compared with using estimates of theta fi'om aberrance-flee data sets. A post hoc analysis of the data fi'om this study also confirmed this explanation of the results. The data sets starting at a “true” reliability of .90 with 30% spuriously low aberrance and with 30% spuriously high aberrance were chosen for this analysis because they represent a moderate amount of aberrance with a high degree of internal consistency before the addition of aberrance. The item and person parameters in these data sets were compared with those for the data set with a “true” reliability of .90 and no aberrance. Both item and person parameters showed some distortion based on the introduction of aberrance, with different patterns of influence from the introduction of spuriously high abenance or spuriously low aberrance into the data set, but the same overall magnitude of 7 2 changes due to either type of aberrance. The results of the comparison of person parameters are given in Appendix B, and the results of the comparison of item parameters are given in Appendix C. Person parameter estimates (i.e., theta, or ability on the construct measured by the test) changed in difl‘erent ways depending on whether the response vector was chosen for the aberrance manipulation or remained unchanged after the manipulation. Table 4A presents the results of comparisons between estimates of theta based on 30% aberrant and aberrance-flee data sets with a “true” reliability of .90. For the spuriously low manipulation, manipulated response vectors showed a decrease in theta following the manipulation, and for the spuriously high manipulation, manipulated response vectors showed an increase in theta following the manipulation. This would be expected, because the purpose of the manipulations was to model response vectors with lower or higher scores, respectively. After the aberrance manipulations, however, estimates of theta changed even for the response vectors that were not a part of the manipulation, though these changes were smaller in magnitude than the changes in response vectors that were part of the manipulations. For the spuriously low manipulation, estimates of theta for unchanged response vectors increased and for the spuriously high manipulation, estimates of theta for the unchanged response vectors decreased. The changes in estimates of theta for the response vectors that were not part of the manipulation is evidence that the introduction of aberrance into a data set changed the item response theory model of the data, such that person parameters were distorted. Because person parameters were distorted by the introduction of aberrance, it was expected that item parameters would also be distorted by the introduction of aberrance, since person and item parameters are estimated simultaneously in an iterative process using the maximum likelihood estimation procedure in BILOG. Table 5A presents the results of comparisons between estimates of item parameters based on 30% aberrant and 7 3 aberrance-free data sets with a “true” reliability of .90. For both types of aberrance manipulations, discrimination parameters decreased after the introduction of aberrance. This means that items were providing less information about the relative standing of the response vectors after the introduction of aberrance. Item difiiculties increased after the introduction of spuriously low aberrance, and decreased after the introduction of spuriously high aberrance. This pattern of results is also expected based on the natme of the aberrance manipulations, because the spuriously low aberrance manipulation resulted in a higher number of incorrect answers for an item and the spuriously high aberrance manipulation resulted in a higher number of correct answers for an item as compared to the aberrance-free data set. The guessing parameter showed a mixed pattern of change across both types of aberrance manipulations. These changes in item and person parameters after the introduction of aberrance suggest that the reason that the 1, statistic will not identify a large proportion of the aberrant response vectors is because the IRT model of the data itself is corrupted by the presence of aberrance. As there is more aberrance in the data set as a whole, it would be expected that relatively less of that aberrance would be detected, because the IRT model of item responses would reflect more of the aberrance in the data set and so fewer of the response vectors would differ fiom the model significantly enough to obtain an extreme I, score. This would explain the low detection rates of the 1, statistic in the present study, and the tendency for there to be a relatively lower amormt of aberrance detected as the overall amormt of aberrance in the data set increased. Additional support for this explanation can be found in Appendix A, which contains average item parameter values for each of the tests that were simulated. As the amount of aberrance in a data set increases, the average item discrimination tends to decrease (except in the case of data sets with a “true” reliability of .70). For spuriously low aberrant data sets, as the amomrt of aberrance increases, the average item difliculty 74 increases. For spuriously high aberrant data sets, as the amormt of aberrance increases, item difficulty decreases. The guessing parameter does not show as clear of a pattern of change across levels of aberrance. In addition to explaining the low detection rates of aberrance in this study, the distorted IRT model also explains why aberrance, as simulated in this study, did not have a large efl‘ect on test properties. The introduction of aberrance caused the item and person parameter estimates to change so that the resulting maximum likelihood model fit the data even with the inclusion of abeirant response vectors. This means that aberrant response vectors were not distinguished from response vectors that were free of aberrance in the IRT model. It is likely, then, that aberrant response vectors were not distinguished from aberrance-flee response vectors when the data was used to calculate test reliability and validity. An additional way to think about this explanation is in terms of the amount of information provided by the item response theory model. If there is a large amount of aberrance in a data set, the overall amormt of information contained in the IRT model may decrease. This means that there is less information to determine whether a particular response vector departs from the model. The decrease in item discrimination values observed for both types of aberrance manipulations in the post hoc analyses and the corresponding low detection rates for the 1, statistic in this study support this interpretation. In addition, there was a tendency for higher standard errors for both item parameters and person parameters in the aberrant data sets as compared to the aberrance- fiee data set. This indicates that there is more error variance (and less meaningful variance) in the IRT model for the aberrant data sets than for the aberrance-flee data set. In addition to explaining the low detection of aberrance using the 1, statistic, the lack of information in the IRT model for the aberrant data sets also explains why aberrance, as simulated in this study, did not have a large efi‘ect on test properties. If less 75 information is provided to determine whether a response vector is aberrant or not, the information deficient model may fit the data well enough so as not to have large efiects on internal consistency estimates, which matches the findings of this study. Ifthere are a sumciently large number of individuals with the same type of aberrance, as in this study, the covariation of responses to items with similar parameter values may be maintained despite the inaccuracy of the response pattern as a measure of the individual’s standing on the construct of interest. In the post hoc analyses, despite the changes in the magnitude of item parameters due to the introduction of aberrance, the rank order of the values of these parameters was not drastically altered, providing some support for this explanation. The results with respect to the low effects of aberrance on validity are slightly more difficult to explain, due to the lack of influence of aberrance on criterion scores. Again, however, the information deficient model may be adequate to assess covariation of total test scores with a criterion without being an accurate representation of the respondent’s standing on the construct measured by the test. Ifthere are a large number of respondents with the same type of aberrance manipulation, as in this study, the rank order of individuals based on total test score may not change that much, which would result in small changes in validity. The post hoc analyses reveal that this seemed to be the case. Although the magnitude of estimates of theta changed based on the introduction of aberrance, the rank order of response vectors did not change drastically, providing support for this explanation. This explanation can also explain why there were greater changes in validity than in reliability due to the introduction of aberrance. Validity was based on the relationship of rank orders on test score, which changed due to the introduction of aberrance, with rank orders on criterion score, which did not change. Reliability was based on covariation among item parameters within the test, which changed in similar ways with the introduction of a particular form of aberrance. There was more potential for change in 7 6 validity than in reliability after aberrance was introduced because distortion from the introduction of aberrance did not have as uniform an effect on the components of the validity calculation as it did on the components of the reliability calculation. The pattern of results showing small changes in rank order of person and item parameters with the introduction of a single type of aberrance might be expected to differ if multiple types of aberrance were present in the same data set. Ifthis were true , there would be reason to expect larger effects of aberrance on reliability and validity in the presence of mixed types of aberrance. This does not match the findings of this study, which showed that mixed aberrance generally had an even smaller efi‘ect on reliability and validity than the same overall magnitude of aberrance of a single type. These results do not necessarily contradict the explanation described here, however, if the effects of the two types of aberrance canceled each other out within the data set. If this were the case, the resulting IRT model would show adequate fit to the data because the efiects of the two types of aberrance would largely cancel out. This explanation would be consistent with the smaller effects of mixed aberrance found in this study without negating the lack of change in the rank order of item and person parameters found for introduction of a single types of aberrance. Additional evidence for this explanation can be formd in Appendix A, which gives the average item parameter value for each of the tests simulated in this study. The data sets with mixed aberrance show smaller changes in item difficulty than tests with a single type of aberrance, indicating that the effects of the two types of aberrance may cancel out. Ifthis explanation is correct, the implication of these findings is that the 1, statistic is not an adequate indicator of aberrance when the true values of item and person parameters are unknown and the data set has a moderate to large proportion of aberrant response vectors. In this case, the aberrant data set must be used to estimate an IRT model of the data that is necessary for the calculation of 1, values. This model will be 7 7 distorted by the presence of aberrance, and, as a result, will fit the aberrant data set because the item and person parameters are based on the aberrant data rather than “true values” of item and person parameters. This means that it will be difficult to identify which response vectors are aberrant and which are free of aberrance using scores on the 1, statistic. Because the true values of item and person parameters are seldom known in real testing situations, the 1, statistic may not be very useful for the purpose of identifying aberrant response vectors for removal from analyses to increase the accuracy of estimates of test properties. These findings also have implications for the use of the 1, index as a diagnostic indicator of individuals who may need special instruction or special testing conditions in order for tests to be representative measures of their standing on the construct of interest. Ifthere are a large number of aberrant responders within the data set, then this application of the 1, statistic would be flawed for the same reasons given above. Ifthere are only a few aberrant responders in a large sample, then the 1, statistic, as well as some of the other appropriateness indices described earlier, may be more useful for detecting these individuals because aberrance would not produce a large distortion in the IRT model of the test data, so the few aberrant individuals in the sample would stand out from the rest of the group. This application should still be used cautiously, however, because there may not be adequate evidence to determine the likely proportion of aberrant responders in a particular sample. Limitations and Contributions Although this study sought to explain the reason(s) for the modest results found for one proposed application of the 1, statistic, there were several limitations in the procedures used that require caution in the interpretation of the results presented here. As outlined above, the use of item and person parameters from data sets with aberrance present may distort the item response theory model of the data such that the efleas of 7 8 aberrance on test properties and on appropriateness indices are difficult to disentangle. Some discussion of the nature of these efiects was offered, but without a comparison of IRT models of the data sets estimated using true item parameter values and/or true values of theta, the exact nature of the influence of the aberrance simulations used here on test properties and appropriateness indices remains unclear. A second limitation of this study has to do with the way that aberrance was simulated in the data sets. In this study, aberrance was introduced randomly, without regard for the true values of item or person parameters. In a real testing situation, item difficulties and respondent’s ability on the construct of interest are likely to influence the nature of their responses in terms of the types of aberrance described here. For example, a high ability individual who is answering moderately difiicult items will be less likely to guess randomly or to cheat than an individual with low ability who is working on items that are perceived as being too difficult. Because of the other dimculties in identifying the influence of aberrance on test data that were discussed earlier, it is unclear how aberrance that is correlated with ability or item difficulty would influence the results reported here. In addition, all response vector manipulations to simulate aberrant responding involved changing 30% of the item responses within the vector (i.e., 15 out of 50 responses). Although it is not expected that the pattern of results reported here would difl’er dramatically if the proportion of responses manipulated within a response vector were changed, this is a possibility that could be explored in future research. Despite these limitations, the present study makes several contributions to research on test appropriateness. This study examines an application of appropriateness indices that has been discussed in prior literature but has received little empirical support. The use of simulated data in this study removed limitations of sample size, test length, and amount of aberrance in the prior research using the 1, statistic to flag aberrance for removal from analyses. The present study included a broader range of simulated testing 79 conditions than in any of the prior research on appropriateness statistics, such as varying levels and types of aberrance and varying values of test reliability and validity. This procedure allowed the examination of potential reasons for the modest results found in prior research when the response vectors flagged as aberrant using the 1, statistic were removed fiom analyses and validities were recalculated. The results showed that the nature of the estimation process influences the utility of the 1, statistic for identifying aberrant response vectors. In real testing conditions, it may not be possible to achieve the conditions necessary to ensure adequate abenance detection when there are a large number of aberrant response vectors present. Future research should attempt to determine more clearly how the item response theory estimation process is influenced by the inclusion of varying amounts of aberrance in a data set and under what conditions the 1, statistic should be used to identify aberrance in real testing situations. APPENDICES APPENDIX A Test Summary Statistics Table 1A — Test Statistics for Reliability Near .70 No Aberrance % Aberrance Mean SD ;,, a 5 Z- 0 26.91 6.13 .05 .552 .894 .256 Low Aberrance % Aberrance Mean SD ;,, Z, i, E 10 26.39 6.15 .05 .566 1.006 .262 20 25.90 6.19 .05 .599 1.075 .270 30 25.40 6.19 .05 .596 1.142 .271 40 24.90 6.09 .04 .637 1.229 .281 50 24.39 6.04 .04 .622 1.344 .281 Ifih Aberrance % Aberrance Mean SD ;,, 3 T, E 10 27.60 6.33 .05 .549 .677 .245 20 28.31 6.47 .05 .524 .549 .239 30 28.98 6.56 .06 .517 .340 .225 40 29.68 6.52 .06 .511 .196 .219 50 30.38 6.42 .05 .469 .081 .219 Mixed Abenance % Aberrance Mean SD r,_, a b c 15L, 15H 26.99 6.12 .05 .541 .839 .253 Note. n,- = mean inter-item correlation. a = mean item discrimination. b = mean item difficulty. c = mean item guessing. 80 8 1 APPENDIX A Table 2A - Test Statistics for Reliability Near .80 No Aberrance % Aberrance Mean SD F»- ; '5 Z- 0 28.45 7.56 .09 .651 .308 .219 Low Aberrance % Aberrance Mean SD a, 5 '5 2 10 27.91 7.59 .09 .665 .401 .222 20 27.34 7.54 .08 .656 .518 .229 30 26.77 7.37 .08 .633 .542 .2254 40 26.24 7.32 .07 .648 .678 .232 50 25.65 7.13 .07 .632 .777 .237 High Aberrance % Aberrance Mean SD A. 3 Z Z- 10 29.10 7.60 .09 .639 .180 .211 20 29.74 7.63 .09 .610 .101 .210 30 30.39 7.65 .09 .622 .028 .216 40 31.04 7.50 .09 .587 -. l 21 .207 50 31.69 7.35 .08 .570 -.205 .210 Mixed Aberrance % Aberrance Mean SD r, a b c 15L, 15H 28.46 7.46 .08 .629 .295 .220 Note. r,-,~ = average inter-item correlation. a = mean item discrimination. b = mean item dimculty. c = mean item guessing. 8 2 APPENDDI A Table 3A — Test Statistics for Reliability Near .90 No Aberrance % Aberrance Mean SD R, 3 '5 Z- 0 26.92 9.53 .15 .845 .227 .157 Low Aberrance % Aberrance Mean SD ;,., E Z Z- 10 26.40 9.38 .15 .815 .296 .161 20 25.91 9.30 .14 .786 .353 .163 30 25.40 9.12 .13 .763 .413 .165 40 24.88 8.91 .12 .740 .520 .168 50 24.38 8.67 .12 .699 .574 .168 High Aberrance % Aberrance Mean SD R, 3 I, 2 10 27.61 9.52 .15 .832 .171 .157 20 28.28 9.52 .15 .802 .122 .161 30 28.99 9.36 .15 .762 .035 .159 40 29.67 9.18 .14 .744 -.001 .171 50 30.40 8.94 .13 .721 -.111 .166 Mixed Aberrance % Aberrance Mean SD n,- a b c 15L, 15H 27.00 9.29 .14 .780 .239 .163 Note. n,- = mean inter-item correlation. a = mean item discrimination. b = mean item dimculty. c = mean item guessing. APPENDIX B Change in Theta Due to 30% Aberrance Table 4A — Change in Theta Due to 30% Aberrance Total Data Set (n = 10,000) Aberrance Mean SD Mean A None .03 .94 — 30% Low .01 .93 -.02 30% High .05 .94 +02 Changed Vectors Only (n = 3,000) Aberrance Mean SD Mean A None (Low) .03 .94 — 30% Low -.38 .71 -.41 None (High) .04 .95 — 30% High .49 .75 +.45 Unchanged Vectors Only (n = 7,000) Aberrance Mean SD Mean A None (Low) .03 .94 - 30% Low .17 .96 +.14 None (High) .03 .94 — 30% High -.14 .95 -. l 7 Me. Mean A represents the change in the mean theta value for the response vectors in the aberrant data set relative to the mean theta value in the data set with no aberrance. The comparison for the changed (unchanged) vectors in the aberrant data sets are the same 3000 (7000) response vectors in the aberrance-flee data set (i.e.., the same vectors before the aberrance manipulation). The set of 3000 (7000) vectors is difl’erent for the spuriously low and spuriously high manipulations. 83 APPENDIX C Change in Item Parameters Due to 30% Aberrance Table 5A — Change in Item Parameters Due to 30% Aberrance Item Discrimination (a) Aberrance Mean SD Mean A None .84 .21 — 30% Low .76 .15 -.08 30% High .76 .18 -.08 Item Difficulty (b) Aberrance Mean SD Mean A None .23 .93 - 30% Low .41 .89 +.18 30% High .04 .89 -.19 Pseudo-Guessing Parameter (c) Aberrance Mean SD Mean A None .16 .04 - 30% Low .17 .05 +01 30% Higl .16 .04 +00 Note. Mean A represents the change in the mean item parameter value for the response vectors in the aberrant data set relative to the mean item parameter value in the data set with no aberrance. The comparisons are based on the total data set (n = 10,000). 84 LIST OF REFERENCES LIST OF REFERENCES Anastasi, A. (1988). Psychological Testing. (6th ed.). New York: Macmillan Publishing Company. Birenbaum, M. (1986). Effect of dissirnulation motivation and anxiety on response pattern appropriateness measures. Applied Psychological Measurement 10 167-174. Birenbaum, M. (1985). Comparing the effectiveness of several IRT based appropriateness measures in detecting unusual response patterns. Educational and Psychological Mea_surement. 45. 523-534. Bimbaum, A. (1968). Some latent trait models and their use in inferring an examinee’s ability. In F.M. Lord & M.R. Novick, Statistic Theories of Mental Test Scores (pp. 397-479). Reading, MA: Addison-Wesley Publishing Company, Inc. Cortina, J .M. (1994). On the meaning and measurement of test apprppriateness. Unpublished doctoral dissertation, Michigan State University, East Lansing, MI. Donlon, T.F. & Fischer, F .E. (1968). An index of an individual’s agreement with group-detennined item difficulties. Educational and Psychological Measurement, 28, 105-113. Drasgow, F. (1982a). Choice of test model for appropriateness measurement. Applied Psychological Measurement, 6, 297-308. Drasgow, F. (1982b). Biased test items and differential validity. Psychological Bulletin 92, 526-531. Drasgow, F. & Guertler, E. (1987). A decision-theoretic approach to the use of appropriateness measurement for detecting invalid test and scale scores. Journal of Applied Psychology, 72, 10-18. Drasgow, F. & Levine, M.V. (1986). Optimal detection of certain forms of inappropriate test scores. Applied Psychological Measurement, 10, 59-67. Drasgow, F, Levine, M.V., & McLaughlin, ME. (1991). Appropriateness measurement for some multidimensional test batteries. Applied Psychological Measurement 15, 171-191. Drasgow, R, Levine, M.V., & McLaughlin, ME. (1987). Detecting inappropriate test scores with optimal and practical appropriateness indices. Applied Psychological Measurement 11, 59-79. 85 86 Drasgow, R, Levine, M.V., & Williams, EA. (1985). Appropriateness measurement with polychotomous item response models and standardized indices. British Journal of @thematical and Sfitatisticgl Psychology, 38, 67-86. Frederiksen, N. (1977). How to tell if a test measures the same thing in different cultures. In Y.H. Poortinga (Ed.), firsic Problemgn Cross-Culturfl Psychology (pp. 14- m. Amsterdam: Swets & Zeitlinger, B.V. Hamisch, D.L. (1983). Item response patterns: Applications for educational practice. Journal of Educational Measurement 20, 191-206. Hamisch, D.L. & Linn, R.L. (1981). Analysis of item response patterns: Questionable test data and dissimilar curriculum practices. Journal of Educational Measurement 18, 133-146. Hamisch, D.L. & Tatsuoka, K.K. (1983). A comparison of appropriateness indices based on item response theory. In R.K. Hambleton (Ed.), Applications of Item Resmnse Theogy (pp. 104-122 ). Vancouver: Educational Research Institute of British Columbia. Hough, L.M., Eaton, N.K., Dunnette, M.D., Kamp, J .D., & McCloy, RA. (1990). Criterion-related validities of personality constructs and the effect of response distortion on those validities. Joumg of Applied Psychology, 75, 581-595. Johanson, GA. (1992). IRTDATA: An interactive or batch Pascal program for generating logistic item response data. Applied Psychological Measurement, 16, 52. Levine, M.V. & Drasgow, F. (1988). Optimal appropriateness indices. Psychomeflg,’ 53, 161-176. Levine, M.V. & Drasgow, F. (1983). Appropriateness measurement: Validating studies and variable ability models. In DJ. Weiss (Ed.), New Horizons in Testing (pp. 109-131 ). New York: Academic Press. Levine, M.V. & Drasgow, F. (1982). Appropriateness measurement: Review, critique, and validating studies. British Journal of Mathematical and Statistical Psychology, 35, 42-56. Levine, M.V. & Rubin, DB. (1979). Measuring the appropriateness of multiple choice test scores. Journal of Educational Statistics 4 269-290. Mislevy, R.J. & Bock, RD. (1990). Item analysis and test scoring with bm’ logistic models. Mooresville, IN: Scientific Software. Noonan, B.W., Boss, M.W., & Gessaroli, ME. (1992). The effect of test length and IRT model on the distribution and stability of three appropriateness indexes. Applied Psychological Measurement, 16, 345-352. Parsons, CK. (1983). The identification of people for whom Job Descriptive Index scores are inappropriate. Organizational Behavior and Human Performance, 31, 365-393. 87 Peters, T. (1995). BIDEV: Introducing aberrance in normal response vectors [Computer program]. East Lansing, MI: Michigan State University, Applications Programming Department. Peters, T. (1994). XTRACT: Converting data from BILOG for use in calculating appropriateness statistics [Computer program]. East Lansing, MI: Michigan State University, Applications Programming Department. Peters, T. (1993). LZCALC: Computing appropriateness indices on data sets [Computer program]. East Lansing, MI: Michigan State University, Applications Programming Department. Reilly, R.R & Chao, G.T. (1982). Validity and fairness of some alternative employee selection procedures. Personnel Psychology, 35, 1-62. Reise, SP. (1995). Scoring method and the detection of person misfit in a personality assessment context. Applied Psychological Measurement, 19, 213-229. Reise, S.P. & Due, A.M. (1991). The influence of test characteristics on the detection of aberrant response patterns. Applied Psychological Measurement, 15, 217- 226. Rudner, L.M. (1983). Individual assessment accuracy. Journal of Educational Measurement 20, 207-219. Sackett, P.R. & Wilk, S.L. (1994). Within-group norming and other forms of score adjustment in preemployment testing. American Psychologt'g 49, 929-954. Schmidt, F .L, Ones, D.S., & Hunter, J .E. (1992). Personnel selection. Annual Review of Psychology, 43, 627-670. Schmitt, N., Clause, C.S., & Pulakos, ED. (1996). Subgroup differences associated with different measures of some common job-relevant constructs. In C.L. Cooper & I.T. Robertson (Eds), International Review of Industrial and Orga_n_r_za' tional Psychology (Vol. 1 1. pp. 115-139). Sussex, England: John Wiley & Sons, Ltd. Schmitt, N., Clause, C.S., Whitney, D.J., Futch, C.J., & Pulakos, ED. (1994). Appropriateness fit. socigl desirability, carelessness, and test validity. Unpublished manuscript, Michigan State University, East Lansing, MI. Schmitt, N., Cortina, J .M., & Whitney, DJ. (1993). Appropriateness fit and criterion-related validity. Applied Psychological Measurement 18, 143-150. Schmitt, N., Gooding, R.Z., Noe, R.A., & Kirsch, M. (1984). Metaanalyses of validity studies published between 1964 and 1982 and the investigation of study characteristics. Personnel Psychology, 37, 407-422. Tatsuoka, K.K. (1984). Caution indices based on item response theory. Psychomgtrr_k_a_,' 49, 95-110. 88 Tatsuoka, K.K. & Linn, R.L. (1983). Indices for detecting unusual patterns: Links between two general approaches and potential applications. Applied Psychological Measurement 7, 81-96. Tatsuoka, K.K. & Tatsuoka, M.M. (1982). Detection of aberrant response patterns and their effect on dimensionality. Joumpl of Educational Statistics 7, 215-231. van der Flier, H. (1982). Deviant response patterns and comparability of test scores. Joumpl of Cross Cultttrgl Psychology, 13, 267-298. van der Flier, H. (1977). Environmental factors and deviant response patterns. In Y.H. Poortinga (Ed.), B_asic Problems in Cross-CultuLal Psychology (pp. 30-35). Amsterdam: Swets & Zeitlinger, B.V. Wright, B.D. (1977). Solving measurement problems with the Rasch model. Journpl of Educational Measurement, 14, 97-116. Wright, B.D. & Panchapakesan, N. (1969). A procedure for sample-free item analysis. Educational and Psycholog’cal Measurement, 29, 23-48.