u

 

 

 

 

 

 

lit.
.17..

 

 

{ipwtﬁiq
3% .a,

 

 

 

SITY LIBRARIES

Illlllllllllll\llll lllll ll Hlll

 

 

 

 

 

 

 

 

 

This is to certify that the

thesis entitled

THE EFFECTS OF TEST APPROPRIATENESS
0N RELIABILITY AND VALIDITY

presented by

CATHERINE S . CLAUSE

has been accepted towards fulfillment
of the requirements for

M.A. . PSYCHOLOGY

degree in

 

 

Major professor

Neal Schmitt

DateZZ//7/7L
/ /

0-7639 MS U is an Afﬁrmative Action/Equal Opportunity Institution

 

 

 

 

LIBRARY

Michigan State
University

 

 

 

PLACE N RETURN BOX to remove this chockout from you record.
TO AVOID FINES Mum on or More data duo.

DATE DUE DATE DUE DATE DUE

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

MSU In An Affirmative mm Oppommity Intuition
mm:

 

THE EFFECTS OF TEST APPROPRIATENESS
ON RELIABILITY AND VALIDITY

By

Catherine S. Clause

A THESIS

Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of

MASTER OF ARTS

Department of Psychology

1 996

ABSTRACT

THE EFFECTS OF TEST APPROPRIATENESS
ON RELIABILITY AND VALIDITY

By

Catherine S. Clause

Test appropriateness indicates the degree that a test accurately represents a
person’s ability on the construct purported to be measured by the test. Research on test
appropriateness has previously concentrated on comparing the accuracy of different
indices in detecting inappropriateness under a narrow set of conditions. The present
study examines the application of the [Z index of test appropriateness for the purpose of
ﬂagging response vectors for removal from analyses to improve estimates of test
properties. The effects and the detection of aberrance under a variety of simulated testing
conditions were examined. Results indicate that aberrance, as simulated in this study,
does not have a large effect on test properties and that the [2 statistic does not adequately
detect aberrance when there are a relatively large number of aberrant response vectors in
a data set. Potential reasons for these ﬁndings and implications for applications of test

appropriateness indices are discussed.

Capyrieht by
Catherine Sue Clause
1996

ACKNOWLEDGMENTS

This thesis would not have been possible without the assistance and support of a
number of people. First, I would like to thank my committee members, Rick DeShon and
Ralph Levine, for their insightful comments that helped me to improve the contents of
this thesis. I am very grateful to Tom Peters being able to write several of the computer
programs that made my data analysis possible, based on the somewhat ambiguous
speciﬁcations I gave him. I would also like to thank my parents for all of their support
and for always encouraging me to do what I needed to do, even if it meant going to
graduate school up north. My friends Leslie Hoffman and Kelli Pursell were
indispensable during this process, and this is just to make it oﬁcial: I owe both of you
one (or two or twenty). Finally, I want to thank my committee chairperson, Neal Schmitt,
for all of the guidance he has given me during my time in graduate school. He is not only
one of the most intelligent and hardest working people I know, he is also one of the most

patient.

iv

TABLE OF CONTENTS

LIST OF TABLES
OVERVIEW
INTRODUCTION
Research on Test Appropriateness
The I, Statistic
Applications of Appropriateness Indices
Research Questions
METHOD
Data Generation
Analyses
RESULTS
Manipulation Checks
Research Questions 1 and 2: Eﬂ‘ects of Aberrance on Reliability
Research Questions 1 and 2: Eﬂwm of Aberrance on Validity
Research Question 3: I2 Detection Rates and Aberrance Removal
DISCUSSION
Implications of the Present Study for Previous Research
A Potential Explanation for the Present and Previous Findings

Limitations and Contributions

vii

13
16
22
22
28
32
32
38
42

52

67
69
77

APPENDIX A - Test Summary Statistics

APPENDD( B — Change in Theta Due to 30% Aberrance
APPENDD( C — Change in Item Parameters Due to 30% Aberrance

LIST OF REFERENCES

vi

80
83
84
85

LIST OF TABLES

Table l — Summary Statistics for Aberrance Free Data Sets

Table 2 - Correlations with 1, Scores for Data Sets with Reliability Near .70
Table 3 — Correlations with Iz Scores for Data Sets with Reliability Near .80
Table 4 — Correlations with 12 Scores for Data Sets with Reliability Near .90
Table 5 - Change in Reliability Due to the Introduction of Aberrance

Table 6 — Change in Validity Due to Aberrance for Reliability Near .70
Table 7 — Change in Validity Due to Aberrance for Reliability Near .80
Table 8 — Change in Validity Due to Aberrance for Reliability Near .90
Table 9 — Average Change in Validity by Reliability Across Aberrance
Table 10 - Average Change in Validity by Reliability Across Validity

Table 11 - Proportion of Total Aberrance Detected Using the II Statistic
Table 12 — Change in Reliability Due to Removal of Aberrance

Table 13 — Change in Validity Due to Removal of Aberrance for rn Near .70
Table 14 -- Change in Validity Due to Removal of Aberrance for rJar Near .80
Table 15 — Change in Validity Due to Removal of Aberrance for r,“ Near .90
Table 1A — Test Statistics for Reliability Near .70

Table 2A — Test Statistics for Reliability Near .80

Table 3A — Test Statistics for Reliability Near .90

Table 4A - Change in Theta Due to 30% Aberrance

Table 5A — Change in Item Parameters Due to 30% Aberrance

vii

OVERVIEW

Test appropriateness indicates the degree that a test accurately reﬂects the
standing of a respondent on the construct purported to be measured by the test. Various
indices have been proposed to measure the degree that a test is appropriate for a particular
respondent (Levine & Rubin, 1979; Hamisch & Linn, 1981; Birenbaum, 1986). Research
relating to test appropriateness has largely concentrated on comparing the accuracy of
diﬂ‘erent indices in detecting particular types of test inappropriateness, or aberrance, in a
data set.

The recommended uses of these indices include individual-level diagnosis of
respondents who may need to be retested or trained in test-taking strategies (Drasgow,
19823; Hamisch, 1983). Another application mentioned by some researchers is to
identify response vectors in a validation sample that may be distorting estimates of the
psychometric properties of a test (Parsons, 1983; Birenbaum, 1985). There has been little
research demonstrating how useful appropriateness indices are for this latter purpose (for
an exception, see Schmitt, Cortina, & Whitney, 1993).

This study examines the effects of removing aberrant responders from a test
validation sample on estimates of the psychometric properties of the test. Estimates of
the reliability and validity of a test are compared for samples containing no aberrance,
samples containing aberrance, and samples with aberrant respondents removed from the
analysis. The standardized appropriateness index developed by Drasgow, Levine, and

Williams (1985), 1,, is used to flag aberrant response vectors for removal from data sets.

2

Various types and levels of aberrance are simulated to examine the magnitude of the
effects of aberrance on test reliability and validity under different conditions. In addition,
the eﬂ‘ect of using Iz to identify aberrant response vectors versus working with known
levels of aberrance in a data set is examined.

This paper begins by describing the concept of test appropriateness and the reason
why this concept is of interest in the development and use of multiple choice tests. Next,
the previous literature on test appropriateness is reviewed. Areas in need of further
investigation are highlighted. Based on the literature review and the discussion of why
test appropriateness is of interest, research questions about the eﬂ‘ects of test
inappropriateness on estimates of the psychometric properties of tests are developed.
Following the presentation of research questions, the study is described, including the
procedure for simulating the data sets and the analyses that are used for these data.
Results are presented and discussed in terms of the effects of aberrance on test properties
and the utility of using the 1, statistic to identify aberrant response vectors for removal

from analyses to improve estimates of test reliability and validity.

INTRODUCTION

The purpose of administering a test is to collect a measure of an ability or trait of
interest for predicting or explaining some behavior. In line with this purpose, tests have
been described as measuring samples of behavior (Anastasi, 1988). The accuracy of a
test for assessing a respondent's standing on the construct of interest is important in a
variety of situations, including making selection or placement decisions in an
employment context, measuring educational achievement, and clinically diagnosing
mental or physical disorders.

There are many reasons that a test may be a less than perfect measure of the ability
or trait of interest. These reasons may have to do with the test itself (e.g. poor reliability,
poorly stated items), such that the test is a poor measure of the construct of interest for all
respondents. A test may also be a less than optimal measure of a construct of interest for
only a certain proportion of the respondents. There has been a great deal of research on
item bias and subgroup diﬁ'erences in test scores and on subgroup diﬂ'erences in criterion
prediction using test scores (e.g., Sackett & Wilk, 1994; Schmitt, Clause, & Pulakos,
1996). The implications of this research have been made more apparent with recent
legislation such as the Civil Rights Act of 1991 and with controversy such as that
generated by the publication of The Bell Curve in 1994.

In addition to what is described in this literature, there is another way in which
test scores may be suboptimal measures for a subset of respondents. Research on test
appropriateness has dealt with the issue of identifying individuals for whom test scores

4

are inaccurate representations of standing on the construct of interest. This research has
largely focused on ability tests with dichotomously scored items, although implications
for personality inventories and continuous-scale measures have been discussed (Parsons,
1983; Reise, 1995).

A psychometrically adequate test is considered inappropriate for a particular
respondent to the extent that the individual's item responses do not ﬁt with the pattern
expected based on group-determined item characteristics and estimates of the individual's
ability on the construct of interest (van der Flier, 1977; Rudner, 1983). Test score
inappropriateness occurs either when a high-ability respondent answers a subset of
relatively easy items incorrectly, or when a low-ability respondent answers a subset of
relatively difﬁcult items correctly.

A number of possible reasons have been given to explain why a respondent's test
score may not be an accurate representation of their standing on the construct of interest
(e.g., Wright, 1977; Levine & Rubin, 1979; Levine & Drasgow, 1982; Birenbaum, 1985).
l A respondent who is high on the construct of interest measured by the test may receive a
spuriously low score due to alignment errors on an answer sheet, use of suboptimal test-
taking strategies, and/or careless responding due to fatigue or low motivation to take the
test. A respondent who is low on the construct of interest measured by the test may
receive a spuriously high score due to obtaining the answers from a high ability
respondent and/or from test-speciﬁc coaching received prior to taking the test.

Inaccurate assessment is problematic for both the individual taking the test and for
the individual or group making use of the test scores. For example, in a personnel
selection situation, overestimates of ability may result in hiring poorly qualiﬁed
candidates. This hiring decision may contribute to productivity losses for the
organization and to undue stress for the under-qualiﬁed individual as they attempt to
perform job duties. On the other hand, underestimates of ability may also result in losses

5

of productivity for the organization due to the greater eﬁ‘ort and expense needed to locate
qualiﬁed applicants and the untapped resources of qualiﬁed individuals who are not
hired. For the individual taking the test, underestimates of ability lead to the loss of job
opportunities.

Test score appropriateness is also an issue for test developers. Ifthere are a
number of respondents in a validation sample whose test scores are inaccurate
representations of their ability levels on the construct measured by the test, then estimates
of the psychometric properties of the test may be distorted. If individuals in a validation
study are poorly motivated to take the test, for example, then there may be a large
number of respondents with spuriously low test scores due to careless or random
responding. This may often be the case in concurrent criterion-related validation studies,
in which job incumbents are asked to take a selection test and usually receive no beneﬁt
for doing so.

When test scores derived in the above, or similar, situations are correlated with
performance measures, the results may indicate that test scores are not strongly related to
performance, when, in fact, the ability measured by the test is a good predictor of
performance. Removal of individuals with spuriously low or spuriously high test scores
from the validation sample may provide more accurate estimates of the psychometric
properties of a test (i.e., reliability and validity). This contributes, in turn, to more
accurate assessment of the usefulness of the test for personnel decisions such as selection
or placement.

The purpose of this study is to examine the eﬂ‘ect of removing respondents with
inappropriate test scores from validation samples on the reliability and validity of the test.

This procedure is diﬂ‘erent from deleting "outliers" from the validation sample because
appropriateness scores are not necessarily an indication of outlying test scores in relation

to criterion performance. The identiﬁcation of outliers in a validation samPle depends on

6
both the test and the criterion scores of that individual, and removal of outliers will
necessarily inﬂate subsequent estimates of test validity. We have no way of knowing,
however, whether the cases removed were inadequately assessed, or whether the scores
are accurate and reﬂect an absence of predictor-criterion relationship.

The identiﬁcation of respondents with inappropriate test scores is completely
internal to the test and does not depend on the criterion scores of the respondents
(Schmitt, Cortina, & Whitney, 1993). Rather, an appropriateness index indicates the
degree that an individual's responses to particular test items are inconsistent with an
estimate of their standing on the construct of interest. This index is calculated from the
total set of item responses and the group-determined item parameters for the test
(Drasgow & Guertler, 1987).

The previous work in this area has found less than promising results when
examining the eﬁects of removing aberrant responders on the validity of selection
measures (Schmitt, Cortina, & Whitney, 1993; Schmitt, Clause, Whitney, Futch, &
Pulakos, 1994). There are several possible reasons for this. The ﬁrst possibility is that
the number of respondents ﬂagged as having inappropriate test scores was too low to
have a great effect on the validity of the test in these samples. This could be due either to
a low number of aberrant responders in the sample or low power of the appropriateness
index to detect aberrance in the sample. The present study attempts to examine this
possibility by simulating data sets with a variety of magnitudes of aberrance present (i.e.,
proportion of the sample with inappropriate test scores). Comparisons of data sets with
no aberrance, data sets with aberrance, and data sets with aberrant response vectors
removed are used to determine the effects of magnitude of aberrance and of detection
power of the appropriateness index on estimates of test properties.

Another possibility is that the effects of aberrant responding on test validity are
only indirect, and are thus difﬁcult to detect when only examining validity. Test

7
appropriateness indices may be detecting item response patterns that display deviations
from the overall internal consistency of the test, which is only indirectly related to
validity. Since validity is linearly related to the square root of test reliability rather than
directly with test reliability, large effects of aberrance on reliability may not have a large
eﬁ’ect on validity (Parsons, 1983). The present study examines changes in both reliability
and validity due to the presence or absence of respondents with inappropriate test scores
in the validation sample. This is done in order to better determine the nature of the
relation between test score inappropriateness and the psychometric properties of tests.
Research on Test Appropriateness

Early testing research was based on the assumption that any test of particular
content was measuring only the construct of interest and was measuring it in the same
way for all respondents (Harnisch & Tatsuoka, 1983; Cortina, 1994). Later research
recognized that test scores could be reﬂecting things other than the individual's standing
on the construct of interest. Characteristics of the test, such as the inclusion of extraneous
content (e.g., reading ability required on a math test) or the social desirability of particular
responses, could inﬂuence the way items are answered (Hough, Eaton, Dunnette, Kamp,
& McCloy, 1990). Characteristics of the test-takers, such as literacy or familiarity with
Western culture or tests, could also inﬂuence scores beyond the respondent's ability on
the construct of interest (F rederiksen, 1977; van der Flier, 1977).

A test score may be thought of as inappropriate to the extent that it reﬂects
something other than the construct of interest. A subset of respondents may have
inappropriate test scores if the primary source of variation in item responses is something
other than that inﬂuencing the test scores of the reference group used to estimate test or
item parameters (van der Flier, 1982; Cortina, 1994). Based on this discussion, test

inappropriateness can be thought of as an interaction between characteristics of the test

8

leading to the group-determined item parameters for the test and characteristics of the
respondent leading to the pattern of item responses for that respondent (Cortina, 1994).

There have been numerous indices proposed to measure the degree that a person's
test score is an inappropriate indication of their standing on a construct. These indices
can be generally divided into two groups. The ﬁrst group of indices are directly based on
observed patterns of correct and incorrect item responses. Examples of indices from this
ﬁrst group include Donlon and Fischer's ( 1968) personal biserial coemcient, van der
F lier’s (1977; 1982) [1' index, Hamisch and Linn's (1981) extended caution index, and
Tatsuoka and Tatsuoka's (1982) norm-conformity index (N CI). These statistics will not
be discussed in detail here, but for a review and comparison of the diﬂ‘erent indices, see
Hamisch and Linn (1981).

Research has tended to focus on the second group of test appropriateness indices,
especially in the last 15 years. These indices are based on item response theory (IRT)
models of the test response data. Examples of indices from this group include ﬁt
statistics based on the Rasch IRT model, described by Wright and his colleagues (Wright
& Panchapakesan, 1969; Wright, 1977). Another example of IRT-based statistics are the
extended caution indices presented by Tatsuoka and Linn (1983; Tatsuoka, 1984) that
make use of IRT models for deriving probability matrices for sample characteristics.
Finally, the appropriateness statistics developed by Levine, Drasgow, and their associates
are based on maximum likelihood functions using the three-parameter logistic IRT model
(e.g., Levine & Rubin, 1979; Levine & Drasgow, 1982; Drasgow, Levine, & Williams,
1985)

Several researchers have compared these IRT-based indices to determine which
are the most useful based on speciﬁc criteria (Rudner, 1983; Birenbaum, 1985; Drasgow,
Levine, & McLaughlin, 1987; 1991). This work has been largely statistical in nature,
concentrating on the degree that appropriateness indices can identify aberrant response

9

patterns of different types in a standardized manner across ability levels. This work has
provided evidence that the appropriateness indices developed by Drasgow, Levine, and
colleagues are among the most accurate in identifying aberrant response patterns.

Rudner (1983) reviewed several unstandardized appropriateness statistics and
indicated that the index developed by Levine and Rubin (1979), 10, had high hit rates for
classiﬁcation of aberrant response vectors across a variety of ability levels. One problem
with unstandardized appropriateness indices, however, is that they tend to be correlated
with total test score (Cortina, 1994), leading to a need for standardization of
appropriateness indices. The standardized version of the 1., index, 1,, has been identiﬁed
as having the lowest overall misclassiﬁcation rate for normal and aberrant response
patterns among several IRT-based indices (Birenbaum, 1985).

Drasgow, Levine, and McLaughlin (1987) compared IRT-based indices to each
other and to an optimal appropriateness index developed for research purposes to model
the highest detection rates possible for a speciﬁc type of aberrance (Drasgow & Levine,
1986; Levine & Drasgow, 1988). Their results indicated that the 12 index, as well as two
of the standardized extended caution indices, had detection rates closest to that of the
optimal index. Later research demonstrated that while all three of these indices were
reasonably well standardized across a broad range of ability levels, the 1, index had
slightly higher rates of detection for both aberrant and normal response vectors,
particularly for spuriously high response vectors (Drasgow, Levine, & McLaughlin,
1991). Because of the results of the research comparing various appropriateness indices,
this study focuses on the 1, statistic as an indicator of aberrance in a response pattern.
The 1: Statistic

The 12 index is the standardized version of the 1., statistic ﬁrst presented by Levine
and Rubin (1979). The 1., index is the logarithm of the likelihood function evaluated at
the maximizing value of theta (Drasgow, Levine, & Williams, 1985). In other words, the

10

10 statistic indicates the degree that a given response pattern contributes to the maximum
likelihood function for the three-parameter IRT model of the test. Small values indicate
aberrance because the likelihood of the aberrant response pattern (i.e., very easy items
incorrect or very diﬂicult items correct) is low for the level of ability, as estimated by the
total response vector and the group-determined item parameters (Drasgow, 1982a).

Two steps are required for computing the 1., statistic (Levine & Drasgow, 1983).
The ﬁrst step is to estimate person and item parameters by ﬁtting an IRT model to the
data. In most research on the 1,, statistic, the three-parameter logistic model is used
(Bimbaum, 1968). The second step is to use the group—determined item parameters and
estimate of the person's ability on the construct of interest to calculate the 1,, statistic. The
index is computed as the logarithm of the compormd probability of the correct and
incorrect responses given by the individual with a given trait level as estimated by the
IRT model (Schmitt et al., 1994). This computation is given in the following formula

I. =§{iu. 1nP.(é)1+i(1— u.)1n<1— swim}, (1)

where
n refers to the number of items in the test,
u,- represents the response of the individual to the ith item ( l =correct,

0=incorrect), and
P, (61) is the probability of the response to item 1' given the estimate of the

examinee's trait level.
Drasgow, Levine, and Williams (1985) point out that the distribution of the 10
index is unstable across ability levels, with the mean value of the index increasing as

ability increases. Other researchers have made similar observations by noting that the 10

11

index, along with other unstandardized appropriateness indices, tends to be correlated
with ability level (Birenbaum, 1985). Because of this ﬁnding, it is necessary to
standardize the 1., index to better ensure the stability of the distribution of this statistic
across ability levels. A standardized version of the 1,, statistic, the 1, index, is presented by
Drasgow, Levine, and Williams (1985). In order to reduce the dependence of the
appropriateness index on the value of theta, the standardization process makes use of the
assumption of local independence by assuming that ability as estimated by an IRT model

equals the respondent's true ability. The formula for standardizing 1,, is given below.

_ I, - £00)

‘ " [Var<I. )1” ’ ‘2’
where
E (1,) is the expected value of lo, computed as
Ed. ) = 2 {[P. (63) mp, (én + [(1 - a<é))1n(1- P.- aim} (3)
and
Varﬂo) is the variance of 1,, computed as
Vara.) =ZP.(é)[1- P.- (é)1{lniP. (é)/(1 — mé»1}’ . (4)

The equations for 1,, the expected value of 1,, and the variance of 1, were derived in
Drasgow, Levine, and Williams (1985).
Research using actual or simulated data from cognitive ability tests (e.g., the SAT

Verbal) has found that the 1, statistic has an approximately normal distribution with a

1 2
mean of 0 and a standard deviation of 1.00 at all ability levels (Drasgow, Levine, &
McLaughlin, 1987). Response patterns of individuals with low negative 1, scores are
considered aberrant, and a cutoff score of -2.00 has usually been used for ﬂagging
inappropriate response vectors (e.g., Schmitt, Cortina, & Whitney, 1993).

Although the 1, statistic has been identiﬁed as one of the most accurate for
detecting aberrant response patterns, there are still some limitations of this index. In
particular, as test lengths become shorter, the index becomes less accurate in detecting
aberrance (Noonan, Boss, & Gessaroli, 1992). To address this potential limitation,
Drasgow, Levine, and McLaughlin (1991) proposed a multitest version of the 1, index,

which is calculated using the following formula:

in. — EU.“ )1
I... = <5)

.1 V2 ’
zVar(1,U>)]

 

1=l

where
1,,. is the multitest version of 1,, and
j refers to the number of individual tests.
The multitest version of 1, has the advantage of increasing the number of items included
in calculating the appropriateness of a measure for a particular respondent, which would
likely increase the accuracy of detecting aberrant response patterns. For longer tests (i.e.,
greater than 20 items), however, this is less of an issue (Reise & Due, 1991).
Another potential limitation of the 1, statistic has to do with the use of estimated
theta values rather than true theta values when calculating 1,. Calculation of the 1, statistic
requires the assumption that theta as estimated by the IRT model is equal to the

respondent's true ability on the construct of interest. This is seldom the case, and so

l 3
estimates of theta are likely to be higher or lower than true theta levels, producing a
distorted estimate of true test score inappropriateness using the 1, statistic. For
individuals with aberrant response patterns, estimates of theta are likely to be especially
distorted since the person's response vector does not ﬁt the IRT model generated for the
test.

Because inaccurate estimates of theta produce inaccurate estimates of
inappropriateness, it is likely that the 1, statistic works least well when it is most needed:
when there are large numbers of aberrant responders in a data set or there is a large
amotmt of aberrance within a single response vector. Although research has not
determined the extent of this inaccuracy in data sets for actual or simulated cognitive
ability test responses, Reise (1995) used data from a large-scale personality assessment
and found that detection of aberrance using 1, is best when there are a large number of test
items (i.e., greater than 30 items) and item dimculty levels spread throughout the range of
thetas for the respondents.

Applications of Approm'ateness Indices

Although the test appropriateness research comparing the aberrance detection
rates for diﬁ'erent indices is useful for determining which statistic performs the best under
certain conditions, there are some notable gaps in this literature. One question that has
been inadequately addressed by the extant literature concerns the practical applications of
test appropriateness indices. It is unclear from the previous research what should be done
once a subset of respondent's records are ﬂagged as being aberrant.

The few recommendations for applications of these statistics have usually focused
on the use of indices as signals for the need to either retest aberrant respondents or to ﬁnd
alternate predictors or indicators of the construct of interest (Drasgow, 1982b; Hamisch,
1983; Rudner, 1983; Birenbaum, 1985; Drasgow & Guertler, 1987; Levine & Drasgow,
1988). For example, Hamisch (1983) and Birenbaum (1985) refer to the application of

14

appropriateness indices for identifying examinees whose test scores should be interpreted
with extra caution. Levine and Drasgow (1988) include diagnosis of causes of low test
scores as a potential application of appropriateness measurement. This focus may in part
be explained by the fact that most of the work on test appropriateness indices has been
done using widely known, psychometrically sound, tests (e. g., the SAT), or simulations of
data from such tests. This limits the need for examining the eﬂects of aberrance on
estimates of test properties.

Another potential application of appropriateness indices that has been mentioned
in previous research is the use of these statistics to ﬂag aberrant responders in test
development or validation samples (Levine & Drasgow, 1982; Rudner, 1983; Parsons,
1983; Birenbaum, 1985; Drasgow & Guertler, 1987). Drasgow & Guertler (1987) point
out that when there are a large number of aberrant response vectors in a sample, any
subsequent statistical analyses using these test data are likely to be distorted. Rudner
(1983) suggests that item try-out and standardization samples can be improved by
excluding response vectors from examinees with aberrant response patterns.

When response vectors are ﬂagged as aberrant and are subsequently removed
from the development or validation sample, test statistics can be recalculated on a
smaller, but "more appropriate" sample, in hopes of improving the accuracy of estimates
of these statistics. Prior empirical research assessing this application of the 1, index has
found less than optimal results for improving the validity of several selection tests by
removing aberrant responders from the validation sample (Schmitt, Cortina, & Whitney,
1993; Schmitt et al., 1994).

The objective of the present study is to determine the reason for the weak eﬁ‘ects
found in the prior research and to assess the conditions under which aberrance affects
estimates of test characteristics. The 1, statistic is used in this study to indicate

inappropriate test scores, both in order to remain consistent with the previous research

15

and because prior work indicates that this statistic is the most accurate index of test
appropriateness currently available (e.g., Drasgow & Levine, 1986).

This study expands on the prior research in several ways. The use of simulated
data allows the modeling of diﬁ‘erent types and magnitudes of aberrance in the data set
and different test properties. This is done in order to better determine the conditions
under which aberrance affects the estimation of test properties. The limitations of sample
size in prior empirical work in this area are removed by using simulated data. In addition,
the eﬁects of aberrance on test reliability as well as on test validity are examined. This is
done due to the suggestion by Parsons (1983) that aberrance has a more direct effect on
reliability than on validity.

The present study also provides an opportrmity to identify the extent that problems
with the aberrance detection rate of the 1, statistic may be responsible for the weak eﬂ‘ects
of aberrance on test properties in prior research. Comparisons of data sets with and
without aberrance indicate the extent that true levels of aberrance distort estimates of the
psychometric properties of tests. Comparisons of data sets with aberrance and data sets
with response vectors receiving low 1, scores (i.e., below -2.00) removed indicate the
extent that the 1, index is detecting the presence of varying levels and types of aberrance
within a data set.

Reise and colleagues identiﬁed some potential problems with the 1, statistic
relating to test length and test composition. Although these are real concerns when using
the 1, statistic to identify aberrance in real-world situations, they are not the direct focus of
this study, and so attempts were made to reduce the impact of these problems. In order to
minimize problems with the 1, statistic relating to test length, the data simulated for this
study are based on a test length of 50 items, which exceeds the length requirements
suggested by Reise and colleagues as necessary to maximize the likelihood of aberrance
detection (Reise, 1995; Reise & Due, 1991). To address concerns with detection

l 6
associated with test composition, test item difﬁculties cover a range that matches the
range of thetas for test respondents. Reise (1995) reported that although the aberrance
detection rate for 1, based on estimated theta is always lower than that for 1, based on true
theta values, detection rates are maximized when a test contains items with a range of
difﬁculties matching the range of thetas in a group of respondents.
Research mestions

Based on the objectives of the study discussed above, several research questions
have been formulated to direct the analyses for determining the nature of the relation
between aberrance and test properties. Analyses estimate and compare the eﬁ‘ects of
diﬂ‘erent aberrance conditions that are expected to occur in actual testing situations on
test reliability and validity. Additional analyses determine the accuracy of the 1, statistic
for ﬂagging aberrant response vectors for removal from the data set.

Varying conditions of aberrance within a data set are simulated to model different
testing conditions where respondents are more or less likely to engage in behaviors
leading to aberrance in item response patterns. For example, in a concurrent criterion-
related validation study where incumbents are not given any sort of incentive to "do their
bes " on a selection test, there may be a large proportion of examinees who respond
carelessly or cheat on the test. This would result in a large proportion of the validation
sample having aberrant response patterns. On the other hand, in a situation such as a
group of applicants taking an employment test where examinees are likely to be highly
motivated to perform well and may be closely monitored to discourage cheating, there
may be much lower levels of aberrance in the sample.

The research questions given below are designed to probe the nature of the
relationship between aberrance and estimates of test properties and to examine the utility
of the 1, statistic as an indicator of aberrance. In order to explore these questions, samples

with levels of aberrance varying from fairly low (10% aberrant) to high (50% aberrant)

1 7

are simulated, since it is expected that with a larger proportion of aberrance in a sample,

there will be a greater effect on estimates of test properties. The "true values" of the

validity and reliability of the test (i.e., calculated in data sets with no aberrance) will also
be varied across samples, since it is expected that the effects of aberrance on estimates of
test properties will vary depending on the "true" levels of validity and reliability of the
test.

In terms of aberrance detection rates using the 1, statistic, it is expected, based on
arguments presented by Reise (1995), that relative detection rates (per cent of total
aberrance detected) may be lower in samples with a large degree of aberrance. This
pattern of results is expected because data sets with a greater degree of aberrance will
likely also have the most overall distortion in theta estimates, which will result in
distorted estimates of test inappropriateness.

RQl: What is the magnitude of the effect of aberrant response vectors on the
reliability and the validity of tests in samples with different proportions of
aberrant response vectors?

In this study, the magnitude of aberrance is manipulated to explore a range of
possibilities where aberrance levels may aﬂ‘ect test properties. Data sets were simulated
with varying proportions of the respondents (10%, 20%, 30%, 40%, or 50%) classiﬁed as
aberrant. These proportions were chosen to represent a range of aberrance that could
occur under different testing conditions in order to determine how great aberrance must
be to have discernible and practical eﬁ‘ects on test properties.

In this case, aberrance classiﬁcation is based on obtaining a score of -2.00 or less
on the 1, statistic, which is consistent with the cutoff used in previous research (Schmitt,
Cortina, & Whitney, 1993; Schmitt et al., 1994). This manipulation is somewhat
diﬂ’erent from that used in previous research, however, because the focus of this study is
on the effects of the overall magnitude of aberrance in a sample on test properties.

1 8
Previous research on aberrance detection rates for appropriateness indices was more
concerned with aberrance within a single response vector and thus usually kept the
proportion of aberrant response vectors in a sample constant and at relatively low overall
levels (e.g., less than 10% of response vectors).

Two types of aberrance manipulations are used, spuriously high manipulations
and spuriously low manipulations, to determine whether there is a diﬂ‘erential eﬁ‘ect on
test properties for diﬁ‘erent types of aberrance. The ﬁrst type of aberrance manipulation,
the spuriously high test score, simulates a situation in which a respondent receives a test
score that is higher than their ability on the construct of interest, which could be due to
factors such as cheating or coaching prior to taking the test. The second type of aberrance
manipulation, the spuriously low test score, simulates a situation in which a respondent
receives a test score that is lower than their ability on the construct of interest, which
could be due to factors such as alignment errors on an answer sheet or use of suboptimal
test-taking strategies. It is not expected that the effects of these two types of aberrance on
test properties will diﬂ‘er greatly, but both are used in this study to remain consistent with
previous research that examines both types of aberrance.

In addition to the data sets with a single type of aberrance manipulation (either
low or high), a data set that includes a mix of both spuriously low aberrant response
vectors and spuriously high response vectors was created. In this data set, a total of 30%
of the response vectors are classiﬁed as aberrant, with 15% classiﬁed as spuriously low
aberrant and 15% classiﬁed as spuriously high aberrant. This mixed data set was created
in order to determine whether having both types of aberrance in a single sample would
amount to a cancellation of the effects of each type of aberrance on test properties. In real
world testing situations, it is expected that different respondents will exhibit diﬁ‘erent
types of aberrance. Although it is not expected that diﬂ’erent types of aberrance will have
predictably different eﬂ‘ects on test properties, it is possible that within a single data set, it

1 9
may be diﬁicult to determine the effects of aberrance on reliability and validity if there

are a mixture of ways in which true scores are distorted.

For each type of aberrance manipulation, a single level of aberrance is simulated,
with 30% of the item responses in a given response vector (i.e., a simulated respondent)
changed to reﬂect aberrance. This approach differs somewhat ﬁ'om previous research
that varied the amount of aberrance within single response vectors, because the focus of
this study is on overall aberrance within a sample of test respondents rather than within a
single response vector. The main reason to vary aberrance within a response vector is to
examine and compare the detection rates for different indices, which was the purpose of
most of the previous research on appropriateness measurement. Based on prior research
(e.g., Drasgow, Levine, & McLaughlin, 1987) indicating the superior detection rates of
the 1, index compared to other statistics, particularly for large magnitudes of aberrance
(e.g., 30% aberrant item responses within a single response vector), this magnitude of
aberrance within a response vector was used for all aberrance manipulations.

RQ2: What is the magnitude of the effect of aberrant response vectors on the
reliability and validity of tests with different "true values" of reliability and
validity (i.e., values calculated in samples with no aberrance)?

Three levels of reliability (.70, .80, or .90) and three levels of validity (.15, .30, or
.45) are simulated in diﬂ‘erent data sets. These levels were chosen to represent values
ranging from those typically found in personnel research to values that are reasonably
higher than those typically observed (e.g., Reilly & Chao, 1982; Schmitt, Gooding, Noe,
& Kirsch, 1984; Schmidt, Ones, & Hunter, 1992). This will allow some discussion of the
likelihood of ﬁnding an effect of aberrance on test properties in typical testing situations.

The estimates of reliability and validity are computed on a data set of response
vectors before any aberrance is introduced, and so represent a "true value" for estimates

of reliability and validity for the data set. The value of these test properties will be varied

2 0

across data sets to better understand the effects of aberrance on estimates of these

properties across a variety of testing conditions. The variation of “true values” of test

properties also provides a chance to explore any potential relationship between the effects
of aberrance on reliability and the eﬁ‘ects of aberrance on validity (e.g., if the effects of
aberrance on validity can be reliably predicted from the effects of aberrance on
reliability).

RQ3: What is the magnitude of the effect of using the 1z statistic to identify
aberrance on the degree of aberrance detected in samples with different
proportions of aberrant response vectors?

For a given level of aberrance (10%, 20%, 30%, 40%, or 50%), comparisons of
test properties are made among three data sets: a data set with no abenance (i.e., before
aberrance is introduced), the aberrant data set, and the same data set with response
vectors ﬂagged as being abenant using the 1, statistic removed. The comparison between
the aberrant data set and the data set with aberrant response vectors removed will provide
an estimate of the degree that the 1, statistic is identifying all the aberrant response vectors
in a data set. The aberrance detection rate of the 1, statistic has been examined in
previous research, but on a smaller scale, with only a single, usually low, level of
aberrance present in the data set. Because of this, there was no way to determine from the
previous research whether the 1, statistic is differentially eﬂ‘ective at identifying aberrance
depending on the degree of aberrance in the data set.

The comparisons of reliability and validity among all three data sets will
provide an estimate of the degree that using the 1, statistic to ﬂag aberrant response
vectors for removal is an effective way to determine the inﬂuence of aberrance on test
properties. Ifthe reliability and validity of the data set with aberrance removed are more
similar to the aberrance-free data set than to the aberrant data set, then the 1, statistic may

be a satisfactory indicator of aberrance for the purpose of improving estimates of test

21

properties. This is because, in this scenario, aberrance removal based on the 1, statistic
results in a data set relatively free of aberrance, or at least similar to an aberrance-free
data set. Ifthe reliability and validity of the data set with aberrance removed are more
similar to the aberrant data set than to the aberrance-free data set, the 1, statistic is
probably not a satisfactory indicator of aberrance. This is because, in this scenario, not
enough of the aberrant response vectors have been ﬂagged for removal in order to
improve the accuracy of estimates of test properties to adequately reﬂect what would have

occurred if there were no aberrance in the data set.

METHOD

Data Generation

The data simulation procedure for this study is based on that ﬁrst used by Levine
and Rubin (1979) in studying appropriateness measurement and later used by their
colleagues and by other researchers (e.g., Levine & Drasgow, 1982; Rudner, 1983;
Noonan, Boss, & Gessaroli, 1992). Although the use of simulated data may not allow for
the consideration of all issues arising in the collection of actual data, it does allow for the
manipulation and examination of certain conditions in order to determine whether eﬂects
of interest can be demonstrated (Rudner, 1983). For example, by using simulated data,
samples of response vectors with no aberrance can be created for comparison purposes,
whereas this can not be guaranteed when collecting data in actual testing situations.
Because the purpose of this study is to determine the eﬂ‘ects of varying levels of
aberrance on estimates of the psychometric properties of a test, the use of simulated data
was deemed appropriate.

As described in the discussion of the research questions, the magnitude and type
of aberrance was varied across data sets. Five diﬂ‘erent proportions of aberrance within a
sample (10%, 20%, 30%, 40%, or 50%) and two types of aberrance, spuriously high test
scores and spuriously low test scores, were simulated. Levels of reliability (.70, .80, or
.90) were also varied across data sets. For each data set, three criterion scores were
generated, which varied in the magnitude of correlation with total test score (.15, .30, or
.45). The variation in correlation represented a variation in the validity of the test. These

variations resulted in the simulation of 30 data sets (ﬁve proportions of aberrant

22

2 3
respondents X two types of aberrance X three levels of reliability) with a single type of

aberrance, each of which included three criterion scores representing three levels of
validity of the test.

Three data sets with no aberrance were created (one for each of the three levels of
reliability) to use for comparisons with the aberrant data sets to answer the research
questions proposed in this study. In addition, three data sets (again, one for each of the
three levels of reliability) with the mixed type of aberrance described earlier (i.e., 30%
total aberrance, consisting of 15% spuriously high response vectors and 15% spuriously
low response vectors) were created to ensure that the two types of aberrance did not
cancel each other out in their eﬂ‘ects on test properties or in the calculations of the 1,
index. As with the data sets representing a single type of aberrance, the data sets with no
aberrance or with mixed types of aberrance also included three criterion scores
representing the three levels of validity that were of interest in this study. The variety of
testing conditions that were simulated and examined allow some consideration of the
plausibility of encountering actual data sets with a level of aberrance that has a practically
signiﬁcant effect on estimates of test properties, and some consideration of under what
conditions this is most likely to occur. Summary statistics for all of the tests simulated
based on these varying conditions of aberrance (or lack of aberrance) and varying levels
of test reliability are given in Appendix A.

Other parameters were held constant across the diﬁ‘erent data sets, in order to
minimize variation in factors that are not of central interest in this study. Each data set
contains 10,000 response vectors and there are no omissions of data (i.e., simulation of
test conditions under which all respondents answer all items). Although there are
polychotomous item response models available that can take omission into account for

appropriateness measurement (Drasgow, Levine, & Williams, 1985), the additional

24

modeling and calculations involved in using these models are beyond the scope of this
study.

The number of test items (i = 50) and the “true values” of item parameters (i.e.,
based on a sample of response vectors with no aberrance) were also held constant to
simulate the conditions under which diﬁ‘erent groups of respondents are taking the same
test. Simulated item responses are all dichotomous. The amount of aberrance within a
single response vector (for those data sets with any level of overall aberrance) was held
constant at 30%, or 15 out of 50 items reﬂecting aberrant responses within a single
response vector. Holding these factors constant reduced the number of data sets that were
needed and the complexity of comparisons involved in examining the research questions
posed earlier.

The three-parameter logistic IRT model was used to estimate item and person
parameters (Birnbaum, 1968). The item parameter distributions are based on ranges of
parameter values used in previous research by Noonan, Boss, and Gessaroli (1992). Item
diﬁculty was estimated using a uniform distribution with a mean of 0 and a standard
deviation of 1.00 and the pseudo-guessing parameter was estimated using a tmiform
distribution with a mean of 0.12 and a standard deviation of 0.03. The mean of the
distribution of the item discrimination parameter was varied in order to manipulate the
level of reliability of the test. To simulate a test with a reliability of .90, the mean of the
item discrimination parameter was set at 0.80; for a reliability of .80, the mean was set at
0.47; for a reliability of .70, the mean was set at 0.40. In all cases, the item
discrimination parameter was estimated using a uniform distribution with a standard
deviation of 0.20.

Uniform distributions were used to estimate item parameter values in order to be
consistent with the previous literature and to increase variation across the range of values

for the item parameters. This was particularly important in the case of item difficulty,

2 5
because research has suggested that 1, detection rates may be higher when item difficulties

cover a broader range of values (Reise, 1995). Average item parameter values for each of
the simulated tests are given in Appendix A. Although there is some random variation in
average item difficulty across the diﬁerent levels of reliability for the tests, this was not
expected to cause tremendous problems with interpretation of eﬁects because this study
was more concerned with patterns of relative change in test properties (including person
and item parameters ﬁ'om the item response theory model) rather than with absolute
values for these properties.

The “true values” (i.e., as estimated in a data set with no abenance) of the
distribution of the ability parameter (theta) were also held constant across data sets, using
a normal distribution of values with a mean of 0 and a standard deviation of 1.00. Lastly,
the criterion score distribution was held constant, using a normal distribution with a mean
of 3.00 and a standard deviation of 1.00. Holding these two distributions constant
simulates a scenario that is likely in many situations, in which respondents vary in their
ability on the construct of interest and in their scores on the criterion measure, but over
time or across similar settings, group norms (i.e., the characteristics of the distribution of
test scores or criterion scores) are relatively similar.

The data generation process for each data set began with the simulation of a set of
normal response vectors (i.e., with no aberrance introduced). Response vectors for
normal examinees were simulated using the IRTDATA program (Johanson, 1992). This
program makes use of three separate random number generator seeds for the generation of
item parameters, person-ability parameters, and response vectors. Separate output ﬁles
are created for raw data (response vectors), respondent characteristics (true score, theta,
and number correct), and item characteristics (discrimination, diﬁculty, and pseudo-

guessing parameters). Input parameters for this program include the number of

2 6
respondents, number of test items, scaling factor desired, and the distribution type

(uniform or normal), mean, and standard deviation of the item and person parameters.
Criterion scores were generated by creating a bivariate distribution with a
speciﬁed correlation (.15, .30, or .45) between the total test scores in the aberrance-ﬂee
data set and a random normal deviate. Total test score was calculated as the sum of the
dichotomous item responses scored one to simulate a correct response and zero to

simulate an incorrect response. The following formula was used to calculate the criterion

SOON:

C=y,/(1-r,yz) +rxyx, (6)

where
C refers to the criterion score,
y refers to a random number generated with the desired distribution (here,
normal), mean (here, 3.00), and standard deviation (here, 1.00) of the
criterion,
x refers to the total test score, and
r,, refers to the desired correlations between total test score and criterion
score (validity).
Once the data sets with no aberrance were created (one for each level of
reliability), these data sets were used as the basis for creating the data sets with aberrance.
Aberrant response vectors were simulated using the BIDEV program (Peters, 1995).
This program randomly selects a proportion, speciﬁed by the researcher, of the no-
aberrance response vectors ﬁom the data ﬁle created by IRTDATA and changes these
vectors to reﬂect the type and level of aberrance speciﬁed. The normal response vectors

2 7
are then replaced with changed response vectors to create a data ﬁle with a speciﬁed

proportion, type, and level of aberrance.

To simulate a response vector with the type of aberrance reﬂecting a spuriously
low test score, 30% of the responses in a normal (i.e., no-aberrance) response vector are
randomly sampled. A random number generator is used to simulate guessing over a
speciﬁed number of options choices (c = number of option choices). No matter what the
original item response, the sampled item is rescored with a He chance of being correct
(scored as 1) and a (c - l)/c chance of being incorrect (scored as 0) to simulate a random
item response. In this study, items with 5 option choices were simulated, so sampled
items were rescored as correct with a probability of .20 (1/5) and rescored as incorrect
with a probability of .80 (4/5).

To simulate a response vector with the type of aberrance reﬂecting a spuriously
high test score, again 30% of the responses in a no-aberrance response vector are
randomly sampled. No matter what the original item response, the sampled item is
rescored as correct. Both of these manipulations result in varying degrees of change to a
response vector depending on the original distribution of correct and incorrect responses
(under the no-aberrance condition) for the items that are selected to be changed. This is
because the aberrance being simulated in this study is random, so there is no pattern
based on item or respondent characteristics as to which items reﬂect aberrance. This is
consistent with previous research which also simulated random aberrance within a data
set as well as within a single response vector.

There is a higher probability of large change in a response vector using the
sprniously high manipulation than using the spuriously low manipulation, however,
because it is more likely that a response will be changed from incorrect to correct in the
spuriously high manipulation than from correct to incorrect in the spuriously low

manipulation. This is because in the spuriously high manipulation all incorrect responses

2 8
that are randomly selected for change are changed to correct. In the spuriously low

manipulation, the correct responses that are randomly selected for change only have an
80% chance of being changed to incorrect and a 20% chance remaining correct (based on

a simulation of random responding over ﬁve option choices).

Analyses
Analyses consist of the comparison of reliability and validities calculated on three

types of data sets: a data set with no aberrance (i.e., before aberrance is introduced), an
aberrant data set, and the same data set with response vectors ﬂagged as being aberrant
using the 1, statistic removed. In addition, comparisons are made between aberrant data
sets and data sets with aberrance removed using the 1, statistic in order to determine the
precision of detecﬁon based on the 1, statistic for varying levels of aberrance in a data set.

After aberrance was introduced into a data set, item and person parameters for the
three-parameter logistic IRT model were estimated using the BILOG program (Mislevy &
Bock, 1990). These parameters are used to calculate values for the 1, statistic. It is not
necessary to use BILOG on the no-aberrance samples because the IRTDATA program
provides estimates of the item and person parameters of the raw data as part of the output.

After an aberrance manipulation, however, these parameters are no longer useful for
calculating 1, statistic values for the changed response vectors, so BILOG was used to get
item and person parameter values for the aberrant data sets.

After these values are obtained, either by using IRTDATA or BILOG, the
LZCALC program (Peters, 1993) was used to calculate 1, statistic values for both the no-
abcrrance and the aberrant data sets. The output ﬁom BILOG is transformed into an
input ﬁle for the LZCALC program using the XTRACT program (Peters, 1994). The 1,
statistic was calculated for each of the response vectors in the no-aberrance data sets in
order to ensure that there are no aberrant response vectors generated using the IRTDATA
program. This was not expected to be a problem, and this analysis is merely a check to

2 9
compare the magnitude of aberrance in the aberrance-free samples with that in the

aberrance-manipulation samples. Although the level of aberrance in an aberrant sample
is set using the BIDEV program, the calculation of the 1, statistic in the aberrant samples
serves as a check for the detection rates of this statistic.

In addition, correlations were calculated between 1, scores and the total test scores
and criterion scores for each of the data sets. These analyses were included in order to
ensure that scores on the 1, statistic are not related to ability on the construct of interest
measmed by the test or to the criteria used to validate the test. This was also not expected
to be a problem, given the research showing that the 1, statistic is well-standardized across
levels of theta, but these results were included as a check on the 1, statistic.

Once the 1, statistic was calculated for each of the data sets, the output ﬁles from
LZCALC were merged with the raw data ﬁles containing item responses, total test scores,
and criterion scores. Coeﬁcient alpha (i.e., the indicator of the reliability of the test) and
the correlation between the sum of item responses and the criterion score (i.e., the validity
of the test) were calculated. Although reliability levels were manipulated and the
criterion score was generated in order to be correlated at a certain level with total test
score in the aberrance-free data sets, reliability and validity were calculated on the no-
aberrance samples because there is always some random deviation of the data from the
exact manipulation values. The reliability (coemcient alpha) and validity (correlation
between total test score and criterion score) for the aberrant samples were calculated to
determine the effects of aberrance on estimates of these test properties.

Next, for each of the aberrant data sets, response vectors ﬂagged as aberrant were
deleted ﬁom the data analysis and reliability and validity were recalculated on the
smaller, but “more appropriate” data set. This recalculation serves as a check on the use
of the 1, statistic to ﬂag aberrant response vectors for removal from analyses in order to

get more accurate estimates of test properties. Ifthere is not much change between the

3 0
reliability and validity values in the aberrant and aberrance-removed data sets, then the 1,

statistic may not be very useful for this purpose.

These analyses result in estimates of test properties and in 1, detection rates for the
different data sets that can be compared and used to answer the three research questions
posed earlier. The comparison of test statistics calculated on the aberrant data sets with
test statistics calculated on the data sets without aberrance provides an indication of the
extent that aberrance distorts estimates of the psychometric properties of tests. This
comparison is used to answer the ﬁrst research question, which asks about the effects of
aberrance on test properties.

The comparison of test statistics calculated on data sets with varying proportions
and types of aberrance aﬂ‘ords an opportunity to determine the extent of aberrance needed
in a sample to have an eﬂ'ect on test properties, which is also relevant to the ﬁrst research
question. This comparison is also used to check whether there are diﬂ’erential eﬁ'ects of
diﬂ‘erent types of aberrance on test properties, though these are not expected to occur
unless there is a major difference in the detection rate of the 1, statistic for different types
of aberrance.

The comparison of estimates of reliability and validity between data sets with
diﬂ'erent “true values” of these statistics (i.e., estimates from data sets with no aberrance)
is used to answer the second research question about the nature of the eﬂ‘ects of aberrance
on different levels of reliability and validity. This comparison is also relevant for
determining the nature of the relationship of effects of aberrance on reliability to eﬂ'ects
of aberrance on validity.

The comparison between aberrant data sets and the same data sets with aberrant
response vectors removed based on 1, statistic values indicates the degree that the 1,
statistic is detecting aberrance within the data set. A comparison of detection across
different levels and types of aberrance is used to determine whether the 1, statistic might

3 1
be differentially eﬂ‘ective in different testing situations (characterized by diﬂ‘erent levels

and types of aberrance responding). These comparisons are relevant to the third research
question, which concerns the degree of existing aberrance in a data set (in this case, set
using the BIDEV program) that the 1, statistic actually detects.

In addition, comparing reliability and validity between all three data sets —
aberrance-free, aberrant, and aberrance-removed — also is relevant to the third research
question because the similarity of estimates of these test properties from aberrance-
removed data sets to the other two types of data sets will determine the utility of the 1,
statistic for increasing the accuracy of estimates of test properties. Ifthe aberrance-
removed estimates are more similar to the aberrant estimates, then the 1, statistic is not
very useful. Ifthe aberrance-removed estimates are more similar to the aberrance-free
estimates, then using the 1, statistic to ﬂag aberrant response vectors for removal from

data analyses may be a good way to increase the accuracy of those analyses.

RESULTS

Manipulation Checks
Several analyses were conducted to ensure that the manipulations used in the data

generation process had the desired effects. The ﬁrst set of analyses center on the

aberrance-ﬂee samples, to ensure both that the desired reliability and validity values were
achieved and that these data sets were truly free of aberrance, as identiﬁed by the 1,

statistic. Table 1 contains the calculated coemcient alphas and correlations between total
test score and criterion score for each of the aberrance-free data sets (one for each desired

level of reliability). The table also includes the number of response vectors in each of the

data sets that were ﬂagged as being aberrant based on 1, scores.

Table 1 — Summary Statistics for Aberrance-Free Data Sets

 

Actual Validity # Vectors
Actual Reliability CRIT 1 5 CRIT 30 CRIT45 w/ l, S -2.00

RELI70 .71 .16 .31 .45 0
RELI80 .82 .14 .31 .45 0
RELI90 .90 .15 .30 .45 0 '

 

13% CRITl 5 = Criterion score generated to be correlated at .15 with total test score,
CRIT30 = Criterion score generated to be correlated at .30 with total test score, and
CRIT45 = Criterion score generated to be correlated at .45 with tot test score. RELI70
refers to the set of item responses that was generated to have a reliability of .70, RELI80
refers to the set of item responses that was generated to have a reliability of .80, and
RELI90 refers to the set of item response that was generated to have a reliability of .90.

32

3 3
For each of the three data sets, the obtained reliability and validity are adequately

close to the desired values (.70, .80, and .90 for reliability; .15, .30, and .45 for validity).
These obtained values ﬁ'om the no-aberrance data sets were used in all analyses
comparing aberrant and aberrance-free data sets to determine the eﬁ‘ect of aberrance on

test properties. The last column of Table 1 contains the number of response vectors

within each data set that received an 1, score of -2.00 or less. As indicated in the table,

none of the response vectors in any of the three aberrance-free data sets received an I,

score indicating inconsistency between item responses and estimated ability level.

In addition to analyzing the aberrance-free samples to ensure that they were
actually free of aberrance, additional analyses were performed to see whether 1, scores
were related to scores on the test or criteria simulated in this study. One of the
advantages of the 1, statistic that has been described in previous research is that it is not

related to ability levels on the construct of interest and so it can provide a measure of
aberrance within a response vector that is standardized across theta levels (Drasgow &
Levine, 1986; Levine & Drasgow, 1988). In order to be able to generalize the results of

this study to other research on appropriateness measurement, it was necessary to

demonstrate that 1, scores are not related to the test or criterion scores used in this study.

Tables 2, 3, and 4 present correlations of 1, scores with total test score and the

three criterion scores for data sets with a reliability near .70 (Table 2), with a reliability
near .80 (Table 3), and with a reliability near .90 (Table 4). These results are broken

down by reliability level in order to simplify the presentation.

3 4
Table 2 - Correlations with 1, Scores for Data Sets with Reliability near .70

 

 

 

 

 

 

 

 

 

Low Aberrance
% Aberrance TOT cam 5 CRIT30 CRIT45
10 -.06 -.02 -.04 -.05
20 -.08 -.03 -.07 -03
30 -.1 1 -.03 -.07 -.09
40 -.10 -.03 -.08 -.11
50 -.09 -.04 -.08 -.12
PM Aberrance
% Aberrance TOT cams CRIT30 CRIT45
10 -.05 .00 -.00 .01
20 -.O6 .00 .01 .01
30 -.02 .02 .03 .05
40 -.02 .02 .03 .06
50 -.02 .03 .05 .08
Mixed Aberrance
% Aberrance TOT cams CRIT30 CRIT45
15 Low, 15 High -.07 -.02 -.03 -.03

 

Note. % Aberrance refers to the overall percent of aberrance present in the data set, or the
percentage of response vectors out of the total data set that are randomly selected for the
aberrance manipulation. TOT = total test score.

3 5
Table 3 - Correlations with 1, Scores for Data Sets with Reliability near .80

 

 

 

 

 

 

 

 

 

Low Aberrance
% Aberrance TOT CRIT 15 CRIT30 CRIT45
10 -.06 -.00 -.05 -.05
20 -.06 -.02 -.05 -.0‘7
30 -.05 -.02 -.06 -.O9
40 -.10 -.02 -.09 -.ll
50 -.06 -.02 -.08 -.09
High Aberrance
% Aberrance TOT CRITl 5 CRIT30 CRIT 45
10 -.03 .02 .00 .02
20 -.04 .03 .02 .05
30 -.02 .02 .03 .06
40 .00 .04 .05 .09
50 -.00 .05 .05 .10
Mixed Aberrance
% Aberrance TOT CRIT 15 CRIT30 CRIT45
15 Low, 15 High -.04 .01 -.02 -.01

 

Note. % Aberrance refers to the overall percent of aberrance present in the data set, or the
percentage of response vectors out of the total data set that are randomly selected for the
aberrance manipulation. TOT = total test score.

3 6
Table 4 - Correlations with 1, Scores for Data Sets with Reliability near .90

 

 

 

 

 

 

 

 

 

Low Aberrance
% Aberrance TOT CRIT15 CRIT30 CRIT45
10 -.06 -.02 -.05 -.07
20 -.06 -.03 -.04 -.08
30 -.08 -.03 -.O8 -.11
40 -.07 -.03 -.09 -.12
50 -.08 -.05 -.09 -.12
ﬂgh Aberrance
% Aberrance TOT CRIT15 CRIT30 CRIT45
10 -.05 .01 .01 .01
20 -.07 .01 .02 .03
30 -.05 .02 .05 .06
40 -.08 .03 .04 .07
50 -.02 .03 .07 .11
Mixed Aberrance
% Aberrance TOT CRITl 5 CRIT30 CRIT45
15 Low, 15 High -.06 -.02 -.01 -.03

 

Note. % Aberrance refers to the overall percent of abenance present in the data set, or the
percentage of response vectors out of the total data set that are randomly selected for the
aberrance manipulation. TOT = total test score.

3 7
The tables show that 1, scores are not highly related either to the test scores or to

the criterion scores simulated in this study. There was no particular pattern of

correlations with respect to the level of reliability of the test, but, there was a slight trend
toward higher correlations with 1, scores for data sets with larger amounts of aberrance
(though this did not occur in all cases).

All of the correlations between total test scores and 1, scores are low and negative,
with none of the values exceeding a magnitude of -.11 (or about 1% shared variance
between total test score and I, score). Correlations between test scores and 1, scores

tended to be slightly higher for data sets that were part of the spuriously low test score
manipulation than for data sets that were part of the spuriously high test score

manipulation, indicating that the type of abenance present in a data set may aﬂ'ect the
value of the 1, statistic. Overall, these correlations between total test score and I, are
lower than those found in other studies (e.g., Birenbaum, 1985). This means that, in this
study, the use of 1, scores to identify aberrant response vectors is not confounded by the
theta level represented by the response vector.

Correlations between criterion scores and 1, scores diﬂ‘ered based on whether the
aberrance manipulation was for a spuriously high test score or for a spuriously low test
score. Correlations between criterion scores and 1, scores in the spuriously low
manipulation condition were all low and negative. Correlations between criterion scores
and 1, scores in the spmiously high manipulation condition were either zero or low and

positive. Because criterion scores (which were computed in the aberrance-free data sets)
were not altered in the aberrance manipulations, this provides additional evidence that the

two types of aberrance manipulations (sptuiously low or spuriously high) affect the value

of the 1, statistic for a given amount of aberrance in a data set. For both types of

aberrance, correlations between criterion score and I, score tended to be higher for

38

criterion scores that were generated to have a higher correlation with total test score.
Overall, these correlations are low enough to suggest that using the 1, statistic to identify
aberrance does not result in artiﬁcially inﬂating or reducing validities due to the removal
of “outliers” (with extreme criterion scores) ﬁom the analysis.
_R_e§_earch Ouestions_1;a_rr_d 2: Effects of Aberrance on Religrbility

The ﬁrst two research questions discuss the effects of diﬂ‘erent amounts and types
of aberrance on reliability and validity, with the ﬁrst addressing the eﬂ‘ects of different
amounts of aberrance in a data set and the second addressing the eﬁ'ects of aberrance on
different “true” levels of reliability and validity (as estimated in data sets with no
aberrance). The results concerning these two questions are presented together, ﬁrst for
reliability and then for validity.

Table 5 shows the levels of reliability obtained in each of the aberrant data sets
and the amormt of change in reliability that occurred because of the introduction of a
particular amount of aberrance into the data set (i.e., the difference between reliability in
the data set after abenance was introduced and reliability in the aberrance-free data set).
Results are broken down separately for the spuriously low and spuriously high
manipulation because there is some evidence from the results of the manipulation checks
to suggest that the two types of aberrance manipulations may aﬂ‘ect the data diﬂ‘erently.
Results are also presented separately for each of the three levels of reliability that were
simulated in this study (.70, .80, and .90), in order to determine whether there is a
relationship between the “true level” of reliability of the test and the eﬂ‘ect of aberrance

on estimates of reliability.

3 9
Table 5 -— Change in Reliability Due to the Introduction of Aberrance

 

 

 

 

 

 

 

 

 

Low Aberrance
RELI70 REL180 RELI90
% Aberrance rx, Arxx rx, Ar,“ rn Afr:
No Aberrance .71 .82 .90
10 .71 0 .82 0 .89 -.01
20 .71 0 .82 0 .89 -.01
30 .71 0 .81 -.01 .88 -.02
40 .70 -.01 .80 -.02 .88 -.02
50 .69 -.02 .79 —.03 .87 -.03
High Aberrance
RELI70 RELI80 RELI90
% Aberrance r,“ Ar,EL r,, Ar.“ rn Ara
N o Aberrance .71 .82 .90
10 .73 .02 .83 .01 .90 0
20 .74 .03 .83 .01 .90 0
30 .75 .04 .83 .01 .90 0
40 .75 .04 .82 0 .89 -.01
50 .74 .03 .82 0 .88 -.02
Mixed Aberrance
RELI70 RELI80 RELI90
% Aberrance r,“ Are, r,“ Arx, rn Arxx__
N o Aberrance .71 .82 .90
15 Low, 15 High .70 -.01 .82 0 .89 -.01

 

Note. r,“ refers to the obtained coeﬁcient alpha in the aberrant sample. A rx, refers to the
change in coemcient alpha that is due to the introduction of aberrance (i.e., the difference
between coefﬁcient alpha values for the aberrant and abenance-free samples).

4 0
As indicated in the table, the introduction of aberrance did not have a very large

eﬁ‘ect on the reliability of the data sets in this study. The largest changes resulted in an
increase of 4 points from an aberrance-ﬂee reliability of .71 in the data sets with the 30%
spuriously high aberrance manipulation and with the 40% spuriously high aberrance
manipulation. The largest negative changes resulted in a decrease of 3 points from
aberrance-free reliabilities of .82 and .90 in the data sets with the 50% spuriously low
aberrance manipulation.

Even though the changes were not large, there are several trends worth noting,
particularly among the data sets with a spuriously low manipulation. For the spuriously
low manipulation data sets, the amount of change in reliability tended to increase as the
amount of aberrance in the data set was greater. The amount of change was also greater
for data sets that had a higher “true value” of reliability. For the spuriously high
manipulation data sets, the trend of change was less clear, although it seems that the
change in reliability becomes less positive or more negative as the amount of aberrance in
the data set increases for data sets with a “true value” of reliability of .82 or .90. There
was also a trend for the change in reliability to become less positive or more negative as
the “true value” of reliability increased.

The positive eﬂ‘ect of the spuriously high aberrance manipulation on reliability
may seem counterintuitive at ﬁrst. The expectation, which was at least somewhat
conﬁrmed with the results of the spuriously low manipulation, might be that aberrance
would lower reliability because it is introducing some type of error variation into the data
set. This appears to be what is occurring in the spuriously low manipulation, because this
manipulation simulates random responding without regard to item or person parameters.

In contrast, the spuriously high manipulation introduces a change that may
simulate a reduction in error variation. This is because in the spmiously high

manipulation, item responses are changed to correct, regardless of the initial item

41

response in the aberrance-free data set. The larger number of correct responses in the
data set results in a greater degree of consistency within a response vector and thus may
result in a slightly higher reliability value, if reliability is calculated using an internal
consistency statistic such as coefﬁcient alpha. As the amount of aberrance and/or the
“true value” of reliability in a data set increases, however, the spuriously high
manipulation may reach a ceiling eﬂ‘ect or a curvilinear eﬂ‘ect on the internal consistency
of the data. There is also evidence of this sort of trend in Table 5 because the change in
reliability in the high aberrance manipulation data sets seems to become less positive
and/or more negative as aberrance or “true” reliability increases. Due to the small effects
observed in this study, however, this explanation should be taken only as a potential
hypothesis for what may be occuning.

The data set with a mix of spuriously high and spuriously low response vectors
appeared to more strongly resemble the spuriously low manipulation because of the slight
negative eﬂ'ect of this form of aberrance on reliability. Because the eﬂ‘ects were so small,
it is diﬁcult to determine whether the two types of aberrance canceled each other out in
their eﬂ‘ect on reliability, although there is some support for this in the largely negative
eﬁ‘ect of the spuriously low manipulation on reliability and the largely positive eﬁ‘ect of
the spuriously high manipulation on reliability.

These results can be used to discuss the impact of aberrance on reliability in terms
of the ﬁrst two research questions. To address the ﬁrst research question, it seems that
both the amount and the type of aberrance have a small eﬂ‘ect on test reliability. For
spuriously low aberrance, as the amount of aberrance increases, the change in reliability
becomes more negative. For spuriously high aberrance, as the amount of aberrance
increases, the change in reliability has a tendency to become less positive and/or more
negative, but this trend is less clear than that for the spuriously low aberrance

manipulation.

42

With respect to the second research question, there was also some evidence that
the effect of aberrance differed based on the “true value” of reliability in the sample,
particularly when spuriously low aberrance was introduced into a data set. As the “true
value” of reliability increases, there was a tendency for a greater amount of change in
reliability after the introduction of spuriously low aberrance. When spuriously high
aberrance was introduced, there was a slight trend for less positive and/or more negative
change in reliability for data sets with higher “true values” of reliability.

Although aberrance has only a small effect on reliability in this study, due to the

use of such large samples (n = 10,000 response vectors in each data set), any change in

reliability that occurs can be assumed to be true change in population values. Just
because the changes are due to actual effects of aberrance on reliability and not sampling
error, however, does not mean that these changes have practical signiﬁcance to test
developers who are trying to maximize the accuracy of their estimates of test properties.
Research Questions 1 and 2: Effec_t§ of Aberrance on Validity

The same types of analyses that were done to explore the eﬁects of aberrance on
reliability were also performed to assess the effects of aberrance on validity. To address
the ﬁrst research question, trends in the eﬂ‘ects of aberrance on validity were examined
for diﬁ‘erent amounts and types of aberrance. To address the second research question,
trends in the effects of aberrance on validity were examined for diﬂerent “true values” of
validity and for different “true values” of reliability, in order to see if a relationship
between changes in reliability and changes in validity emerged from these analyses.

Tables 6, 7, and 8 show the levels of validity obtained in each of the aberrant data
sets and the amount of change in validity that occurred because of the introduction of a
particular amount of aberrance (i.e., the difference between validity in the data set after
aberrance was introduced and the validity in the corresponding aberrance-free data set).
Just as with the results for reliability, results for validity are broken down by the type of

43

aberrance (spuriously low or spuriously high), in order to examine potential differences in
the eﬂ‘ects of these two types of aberrance on validity. Results are also broken down by
the three levels of validity (. l 5, .30, and .45) in order to determine whether there is a
relationship between “true level” of validity and the eﬁ‘ect of aberrance on estimates of
validity. In addition, results are presented separately for each level of reliability that was
simulated (.70, .80, and .90) both in order to simplify the presentation and in order to
examine whether the trends in changes in validity vary with the “true level” of reliability
that is simulated in the data set.

As was true of reliability, the introduction of aberrance did not have a very large
eﬁ‘ect on validity, although the changes in validity coemcients tended to be slightly larger
than the changes in coeﬁicient alpha. All of the changes in validity due to aberrance were
negative, with the largest change being a decrease of 8 points from a true value validity of
.45 and a true value reliability near .70 in the data sets with the 40% spuriously high

manipulation and the 50% spuriously high manipulation.

4 4
Table 6 — Change in Validity Due to Aberrance for Reliability Near .70

 

 

 

 

 

 

 

 

 

Low Aberrance
CRIT] 5 CRIT30 CRIT 45
% Aberrance rxy Ara r3, Ar,y rxy A’xy
No Aberrance .16 .31 .45
10 .15 -.01 .30 -.01 .44 -.01
20 .15 -.01 .28 -.03 .43 -.02
30 .15 -.01 .28 -.03 .42 -.03
40 .14 -.02 .27 -.04 .40 -.05
50 .15 -.01 .27 -.04 .39 -.06
High Aberrance
CRIT15 CRIT30 CRIT45
% Aberrance rxy Ara. rxy Ara: rxy A’ :31
No Aberrance .16 .31 .45
10 .15 —.01 .29 -.02 .42 -.03
20 .14 -.02 .27 -.04 .41 -.04
30 .13 -.03 .26 -.05 .39 -.06
40 .14 -.02 .25 -.06 .37 -.08
50 .13 -.03 .25 -.06 .37 -.08
Mixed Aberrance
CRIT15 CRIT30 CRIT45
% Aberrance r” Ara, rxy Ara. r3, Ara,
No Aberrance .16 .31 .45
15 Low, 15 High .16 0 .29 -.02 .43 -.02

 

Note. r9. refers to the obtained validity coefﬁcient in the aberrant sample. A r,, refers to
the change in validity that is due to the introduction of aberrance (i.e., the diﬂ‘erence
between validities for the aberrant and aberrance-free samples).

4 5
Table 7— Change in Validity Due to Aberrance for Reliability Near .80

 

 

 

 

 

 

 

 

 

 

Low Aberrance
CRIT15 CRIT30 CRIT45
% Aberrance rﬂ Ar” '39» Ar” rxy Ar”,
No Aberrance .14 .31 .45
10 .14 0 .30 -.01 .44 - 01
20 .14 0 .29 -.02 .42 -.03
30 .13 -.01 .29 -.02 .41 -.04
40 .13 -.01 .28 -.03 .42 -.03
50 .13 -.01 .27 -.04 .41 -.04
High Aberrance
CRIT15 CRIT 30 CRIT45
% Aberrance r3, Ar,I r,, Arxy rxy Ar)?
No Aberrance .14 .31 .45
10 .14 0 .29 -.02 .44 - 01
20 .13 -.Ol .29 -.02 .42 -.03
30 .14 0 .27 -.04 .42 -.03
40 .13 -.01 .27 -.04 .40 -.05
50 .12 -.02 .27 -.04 .40 -.05
Mixed Aberrance
CRIT15 CRIT30 CRIT 45
% Aberrance rxy NW ,0 A,”
No Abemnce .14 .31 .45
15 Low, 15 High .14 0 .30 -.01 .44 -.01

 

Note. I}, refers to the obtained validity coemcient in the aberrant sample. A r,,. refers to
the change in validity that is due to the introduction of aberrance (i.e., the diﬁ‘erence

between validities for the aberrant and aberrance-free samples).

4 6
Table 8— Change in Validity Due to Aberrance for Reliability Near .90

 

 

 

 

 

 

 

 

 

Low Aberrance ;
CRIT15 CRIT30 CRIT45
% Aberrance r3, Ara rxy Ar”. r8, Ara,
N o Aberrance .15 .30 .45
10 .15 0 .29 -.01 .44 -.01
20 .14 -.01 .30 0 .44 -.01
30 .15 0 .29 -.01 .43 -.02
40 .15 0 .28 -.02 .42 -.03
50 .14 -.01 .28 -.02 .43 -.02
High Aberrance
CRIT15 CRIT30 CRIT45
% Aberrance rm, Ara. ray Afxy :52» ”L
No Aberrance .15 .30 .45
10 .14 -.Ol .29 -.01 .44 -.01
20 .14 -.01 .29 -.01 .42 -.03
30 .14 -.01 .28 -.02 .41 -.04
40 .13 -.02 .28 -.02 .41 -.04
50 .14 -.01 .27 -.03 .40 -.05
Mixed Aberrance
CRIT15 CRIT30 CRIT45
°/o Aberrance r0, Ara. rxy Aray rxy Ara,
N o Aberrance .15 .30 .45
15 Low, 15 High .15 0 .30 0 .44 -.01

 

Note. 13,, refers to the obtained validity coeﬁcient in the aberrant sample. A r,, refers to
the change in validity that is due to the introduction of aberrance (i.e., the diﬁ‘erence
between validities for the aberrant and aberrance-ﬂee samples).

4 7
The trends of change in validity appear to be more clear-cut than was the case

with changes in reliability. There tended to be a larger decrease in validity with the
introduction of spuriously high aberrance than with the introduction of spuriously low
aberrance into a data set for each of the “true values” of validity (.15, .30, and .45), but
the overall pattern of results was the same for both types of aberrance. In both cases, as
the amount of aberrance introduced into the data set increases, there tended to be a greater
decrease in the validity coefﬁcient. This was the pattern, again, for each of the “true
values” of validity, although there was less of a pattern for validity coeﬂicients with a
“true value” near .15, probably because there was less change in validity for these data
sets and so there was a more of a ceiling eﬂ‘ect of variability in change across levels of
aberrance.

Also true for both types of aberrance, as the “true value” of validity increased,
there tended to be a greater decrease in the validity coemcient for each level of aberrance
introduced. This pattern of change across “true values” of the validity coeﬁcient is based
on absolute change rather than on percentage change in the value of the coemcient. For
there to be a pattern of proportional change in the validity coemcients, validities
simulated to represent a true value of .45 would have to decrease more than 3 times the
number of points that validities simulated to represent a true value of .15 decreased.

The changes in validity that occurred in this study represent an approximately
even relative change across “true values” of the validity coeﬁcients. For the spuriously

low manipulation, the relative changes in validities were about even, on average, across
the “true values” of validity (mean A5,, for .15 of -.01, mean A5,, for .30 of -.02, and
mean Arxy for .45 of -.03). For the spuriously high manipulation, the relative changes in
validities simulated to represent a true value of .45 (mean Any of -.04) averaged slightly
less than 3 times as large as changes in validities simulated to represent a true value of .15

(mean Any of -.01 5), but validities simulated to represent a true value of .30 (mean Ar,,,

48

of -.03) tended to be about 2 times as large as those in validities simulated to represent a
true value of .15.

In addition to trends of change in validity across amounts of aberrance and “true
values” of validity, there was also a trend in changes in validity across the levels of test
reliability that were simulated in the data sets. The general trend was for a smaller effect
of aberrance on validity for data sets with a higher “true value” of simulated test
reliability. Table 9 displays the amount of change for each of the “true values” of validity
averaged across aberrance levels for each level of test reliability that was simulated in the
data. There was a tendency toward a smaller decrease in the validity coefﬁcient when the
reliability of the test was higher. This pattern held for each level of validity for both
spuriously low and spuriously high aberrance manipulations as well as for the data sets
with mixed aberrance. Table 10 displays the amount of change in validity for each
amormt of aberrance (10%, 20%, 30%, 40%, or 50%) averaged across the three levels of
validity for each simulated “true value” of reliability. The pattern of a smaller eﬂ‘ect of
aberrance on validity for higher “true values” of reliability is also evident here for both
the spuriously low and the spuriously high aberrance manipulations as well as for the data
sets with mixed aberrance, except in the case of data sets with the 10% spuriously low

aberrance manipulation.

Table 9 — Average Change in Validity by Reliability Across Aberrance

 

 

 

 

 

 

 

 

 

Low Aberrance
“True” Validity RELI70 REL180 RELI90
CRIT15 -.01 -.01 -.00
CRIT30 -.03 -.02 -.01
CRIT45 -.03 -.03 -.02
High Aberrance
“True” Validity RELI70 REL180 RELI90
CRIT15 -.02 -.01 -.01
CRIT30 -.05 -.03 -.02
CRIT45 -.06 -.03 -.03
Mixed Aberrance
“True” Validity RELI70 RELI80 RELI90
CRIT15 0 0 0
CRIT30 -.02 -.01 0
CRIT45 -.02 -.01 -.01

 

NLte, Values in the table represent the average amount of change in the validity
coeﬂicient for a particular “true value” of validity (criterion score simulated to be
correlated with total test score at approximately .15, .30, or .45) for each simulated level
of test reliability (“true” values of .70, .80, .90; obtained values of .71, .82, and .90).
Negative values represent a decrease in validity after the introduction of aberrance. Each
value for the spuriously high and spuriously low manipulations is averaged across the ﬁve
levels of aberrance (10%, 20%, 30%, 40%, and 50%). The values for mixed aberrance
are based on data sets with 15% low and 15% high aberrance.

5 0
Table 10 - Average Change in Validity by Reliability Across Validity

 

 

 

 

 

 

 

 

 

Low Aberrance
% Aberrance RELI70 RELI80 RELI90
10 -.01 -.01 -.01
20 -.02 -.02 -.01
30 -.02 -.02 -.01
40 -.04 -.02 -.02
50 -.04 -.03 -.02
High Aberrance
% Aberrance RELI70 RELI80 RELI90
10 -.02 -.01 -.01
20 -.03 -.02 -.02
30 -.05 -.02 -.02
40 -.05 -.03 -.03
50 -.06 -.04 -.03
Mixed Aberrance
% Aberrance RELI70 RELI80 RELI90
Low 15, High 15 -.01 -.01 -.00

 

Mt; Values in the table represent the average amount of change in the validity
coefficient for a particular amount of aberrance (10%, 20%, 30%, 40%, or 50%
spuriously low or spuriously high aberrance, or 15% low and 15% high mixed aberrance)
for each simulated level of test reliability (“true” values of .70, .80, .90; obtained values
of .71, .82, and .90). Negative values represent a decrease in validity after the
introduction of aberrance. Each value for the spuriously high and spuriously low
manipulations is averaged across the three “true values” of validity (.15, .30, or .45).

51

The results of these analyses showing that aberrance has a smaller eﬂ‘ect on
validity for larger “true values” of reliability may seem counterintuitive given the
previous results showing that aberrance had a larger effect on reliability for larger “true
values” of reliability. The reason for these seemingly incongruent ﬁndings is that the
positive relationship between reliability and validity is stronger than the relationship
between magnitude of reliability and effects of aberrance on reliability so that the
inﬂuence of aberrance on reliability does not play a part in the way that reliability is
associated with the eﬁ‘ects of aberrance on validity. In addition, the eﬂ‘ects of aberrance
on reliability were so small that any indirect inﬂuence aberrance would have on validity
through an effect on reliability is not likely to be detected.

The reason for the association between higher reliability and less of an eﬂ‘ect of
aberrance on validity has to do with the positive relationship between validity and
reliability. Tests with higher reliability also have higher validities if all other factors are
held constant. If aberrance is increasing some sort of variability in test scores that is not
associated with variability in criterion scores, then the aberrance would reduce test
validity. For highly reliable tests, however, this increase in variability may be less of a
problem, and thus lead to less attenuation in validity coemcients for highly reliable tests
then for less reliable tests with the same amount of aberrance.

As indicated throughout this discussion, the eﬂ‘ects of aberrance on the data sets
with a mix of 15% spuriously low aberrance and 15% spuriously high aberrance show the
same pattern as that observed for the data sets with a single type of aberrance. The
amount of change in the validity coeﬁicient, however, tends to be smaller than that
observed for either single type of aberrance at the same overall level of aberrance in the
data set (i.e., 30%). This smaller effect provides some additional evidence that the two
types of aberrance may be canceling each other out in their eﬂ‘ects on validity, although
this is less clear than in the case of eﬁ‘ects of aberrance on reliability, because here both

52
spuriously low and spuriously high aberrance reduce validity coemcients, whereas the

two types of aberrance had more opposing eﬂ‘ects on reliability.

The results described in this section can be used to summarize the impact of
abenance on validity in terms of the ﬁrst two research questions. The ﬁrst research
question is concerned with the effect of varying levels of aberrance on validity. The
results show a general trend for a greater decrease in validity as the level of aberrance in a
data set increases, with larger changes in validity for the introduction of spuriously high
aberrance than for the introduction of spuriously low aberrance into a data set. The
second research question focuses on the eﬁea of abenance on varying levels of validity
and reliability. The ﬁndings show that aberrance has a slightly greater relative eﬂ‘ect on
higher “true values” of validity (resulting in larger decreases in higher validities), but that
this eﬁ‘ect is lessened for tests with higher levels of reliability.

Overall, the effects of aberrance on validity were rather small (though they were
larger than the effects of aberrance on reliability), but because of the large sample sizes
used in this study, any change in validity can be interpreted as real change in population
values. For the most part, however, these changes may be too small to be of practical
signiﬁcance for improving the accuracy of estimates of test validity (i.e., the average
change in validity due to the introduction of aberrance was a 6 —- 10% decrease across

“true values” of validity).

Research sttion 3: 1, Detection Rates and Aberrance Removal
The third research question of interest in this study concerns both the degree that

the 1, statistic is useful for detecting aberrant response vectors and the degree that using

the 1, statistic to ﬂag aberrant response vectors for removal from analyses is a useful way
of increasing the accuracy of estimates of test properties. To explore this research

question, ﬁrst the aberrance detection rates for the 1, statistic were examined for the

different types and levels of aberrance. Second, 1, scores were used to identify aberrant

5 3
response vectors in each of the data sets. The reliability and validity of the test were

recalculated after the response vectors identiﬁed as aberrant were removed from the
analyses. These new estimates of the test properties were compared to estimates of
reliability and validity calculated in corresponding aberrant and aberrance-free samples to
see whether removing aberrance from a data set resulted in estimates of reliability and
validity that were closer to those calculated on aberrant or aberrance-free data sets.

Table 11 displays the proportion of aberrance detected using the 1, statistic out of
the total amount of aberrance present in each of the simulated data sets in this study.

Response vectors were removed from the analysis if the 1, score for that response vector

was less than or equal to -2.00. Results are separated by the type of aberrance and by the
“true value” of reliability that was simulated in the data set in order to more easily
identify patterns of aberrance detection that may be associated with these parameters.
There is no information about the criteria that were simulated in these data sets because

criterion scores were maﬂected by the aberrance manipulations.

5 4
Table 11 — Proportion of Total Aberrance Detected Using the 1, Statistic

 

 

 

 

 

 

 

 

 

Low Aberrance
% Aberrance RELI70 RELI80 RELI90
10%(1000/10900) .15 (151/1000) .16 (”91000) .31 (”91000)
20% (2000/10,...) .10 (190/2000) .10 (2°‘/,ooo) .18 (360/2000)
30% (”m/10.000) .05 (164/3000) .07 (mama) .13 (375/3000)
40% Whom) .05 (210/4000) .0612‘7/4000) .10 (“”1000
50% (moo/10.000) -04 (190/ 5000) ~05 (230/5000) ~07 (340/ 5000)
High Aberrance
% Aberrance RELI70 REL180 RELI90
10%(1000/10000) 166591000) ~17(m/rooo) 31(307/1000)
20%(2000/10000) sum/2000) 090792000) rum/2000)
30% (mo/10,000) -04 (128/3000) -06 (185/3000) -13 (381/3000)
40% (“m/mm.) .03 (“z/.000) .05 (”s/.000) .09 (352/4000)
50%(’°°°/i.ooo) 0263/5000) .03 (148/5000) .08 (396/5000)
Mixed Aberrance
% Aberrance RELI70 REL180 RELI90
15% Low, 15% High
(mo/10.000) -04 (126/3000) -05 (m/ 3000) -07 (210/3000)

 

Epic; Values in the table represent the proportion of aberrance that was detected using
the 1, statistic (i.e., response vectors with I, s -2.00) out of the total number of aberrant
response vectors that were simulated in the data set (e.g., for the 20% low abenance
condition, 2000 out of 10,000 response vectors were manipulated, so the proportion of
aberrance detected is based on a possible total of 2000 vectors). The actual fraction of
aberrant response vectors detected is given in parentheses for each data set.

5 5
The table shows that the 1, statistic detected only a very small proportion of the

total number of aberrant response vectors in a data set and that the relative proportion of
aberrance detected was lower as the total amount of aberrance in the data set increased.
The reason for this trend may be that as the proportion of aberrant response vectors in a
data set increases, there is a corresponding relative decrease in the number of “normal”
response vectors on which the group-determined item parameters for the test are based.
This introduces more distortion (or aberrance) into the model of item responses that is
used to distinguish between “normal” and aberrant responding and thus likely leads to a
greater degree of misclassiﬁcation of aberrant response vectors as normal.

Detection rates for a given amormt of aberrance were higher for data sets with a
higher simulated test reliability. This is likely because it is easier to detect aberrant item
response patterns for tests that are more internally consistent. For tests with higher
internal consistency, the measurement model for the test produces more stringent criteria
for determining whether a response vector is aberrant because there is less sampling error
in measurement of the construct of interest. For tests with lower internal consistency
reliability, response vectors with a greater degree of aberrant item responses would still
be classiﬁed as normal because more error variation in item responses is classiﬁed as
being part of the measurement model of the test.

Detection of aberrance in the data sets with mixed aberrance was slightly lower
than for the data sets that had the same total amount of aberrance of only a single type
(i.e., the 30% spuriously low aberrance or the 30% spuriously high aberrance data sets).
This may be because the two types of abenance cancel each other out and so a response
vector containing item responses that ﬁt both spuriously low and spuriously high types of
aberrance may be considered more “normal” (i.e., ﬁt the measurement model of the test

better) than response vectors with equal or even slightly smaller amounts of abenance of
a single type.

5 6
The second part of the analyses pertaining to the third research question compares

estimates of test reliability and validity calculated after the removal of aberrant response
vectors to estimates from the same data sets before aberrant response vectors were
removed and to aberrance-free data sets with the same “true value” of reliability and
validity (where “true values” represent estimates of reliability and validity that are not
affected by aberrance). These comparisons are used to determine whether the removal of
aberrant response vectors results in estimates of test properties that more closely resemble
those from aberrance-free data sets or from data sets that include aberrance.

The results of this comparison for test reliability estimates are presented in Table

12, which gives the estimate of reliability for the data set with aberrance, as identiﬁed by
1, scores less than or equal to -2.00, removed and the diﬂerence between this value and

the estimate of reliability calculated in aberrance-ﬂee and aberrant samples with the same
“true value” of reliability. As in other analyses, results are presented separately for the

diﬂ‘erent types of aberrance.

5 7
Table 12 —- Change in Reliability Due to Removal of Aberrance

 

 

 

 

 

 

 

 

Low Aberrance
RELI70 RELI80 RELI90
Aberrance Rem AF ree AAb Rem AF ree AAb Rem AF ree AAb
10% .71 0 0 .83 +01 +01 .90 0 +01
20% .71 0 0 .82 0 0 .90 0 +01
30% .71 0 0 .81 -.01 0 .89 -.01 +01
40% .70 -.01 O .81 -.01 +01 .88 -.02 0
50% .69 -.02 0 .79 —.03 0 .87 -.03 0
HighiAberrance
RELI70 RELI80 RELI90
Aberrance Rem AF ree AAb Rem AF ree AAb Rem AFree AAb
10% .73 +02 0 .83 +01 0 .90 0 0
20% .74 +03 0 .83 +01 0 .90 0 0
30% .75 +04 0 .83 +01 0 .90 0 0
40% .75 +04 0 .83 +01 +01 .89 -.01 0
50% .74 +03 0 .82 0 0 .89 -.01 +01
Mixed Aberrance
RELI70 RELI80 RELI90

Aberrance Rem AFree AAb Rem AFree AAb Rem AFree AAb
15L, 15H .71 0 +01 .82 0 0 .89 -.01 0

m Rem refers to the estimate of reliability (rn) calculated in the data set with aberrant
response vectors identiﬁed by the 1, statistic (i.e., I, s -2.00) removed ﬁom the analysis.
AF rec refers to the change in rn in the data set with 1,-identiﬁed aberrance removed
relative to the aberrance-ﬂee data set for the same “true value” of reliability (i.e., the
diﬂ‘erence rnRem — rnFree). AAb refers to the change in 13,, in the data set with 1,-
identiﬁed aberrance removed relative to the data set with aberrance for the same “true
value” of reliability (i.e., the diﬂ‘erence rnRem - rnAb).

5 8
The results indicate that using the 1, statistic to identify aberrant response vectors

for removal from analyses is not a useful strategy for improving the accuracy of estimates
of test reliability. The estimates of reliability calculated on data sets with aberrant
response vectors removed are more similar to the estimates of reliability calculated on
aberrant data sets than they are to reliability estimates from aberrance-ﬂee data sets. This
pattern of results held for all data sets in which there were changes in reliability due to the
introduction of aberrance except for the data sets with a “true” reliability of .90 and the
10% spuriously low and 20% spuriously low aberrance manipulations. This means that,
for the most part, removing response vectors that received an 1, score of -2.00 or less did
not have an effect on estimates of reliability.

There are two reasons why removing response vectors with extreme 1, scores did
not change estimates of reliability. First, the eﬂ‘ects of aberrance on reliability were small
to begin with, so there was not a lot of margin for improvement in accuracy of reliability
estimates by removing aberrance. Second, as indicated in Table 11, the 1, statistic only
identiﬁed a small percentage of the response vectors that were manipulated to represent
aberrant response patterns, so removing this small percentage of aberrant response vectors
did not change whatever effect aberrance did have on reliability (shown in Table 5).

Comparisons were also conducted to determine whether removing response
vectors identiﬁed as aberrant using the 1, statistic would improve the accuracy of
estimates of validity. Tables 13, 14, and 15 present the results of these analyses
separately for each level of reliability that was simulated in the data sets. Each table gives
the three estimates of validity (based on “true values” of .15, .30, and .45) for the data set
with aberrance removed and the diﬁ‘erence between these values and the corresponding

estimates of validity calculated in aberrance-free and aberrant samples.

5 9
Table 13 - Change in Validity Due to Removal of Aberrance for r,“ Near .70

 

 

 

 

 

 

 

 

Low Aberrance
CRIT15 CRIT30 CRIT 45
Aberrance Rem AF ree AAb Rem AF ree AAb Rem AF ree AAb
10% .15 -.01 0 .30 -.01 0 .44 -.01 0
20% .15 -.01 0 .29 -.02 +01 .43 -.02 0
30% .15 -.01 0 .28 -.03 O .42 -.03 0
40% .15 -.01 +01 .27 -.04 0 .40 -.05 0
50% .15 -.01 0 .27 -.04 0 .39 -.06 0
High Aberrance
RELI70 RELI80 RELI90
Aberrance Rem AFree AAb Rem AF ree AAb Rem AFree AAb
10% .15 -.01 0 .29 -.02 0 .42 -.03 0
20% .14 -.02 0 .27 -.04 0 .41 -.04 0
30% .13 -.03 0 .26 -.05 0 .39 -.06 0
40% .14 -.02 0 .25 -.06 0 .37 -.08 0
50% .13 -.03 0 .25 -.06 0 .36 -.09 -.01
Mixed Aberrance
RELI70 RELI80 RELI90

Aberrance Rem AFree AAb Rem AFree AAb Rem AFree AAb
15L, 15H .16 0 0 .29 -.02 0 .44 -.01 +.01

N_0Le_, Rem refers to the estimate of validity (r,,,) calculated in the data set with aberrant
response vectors identiﬁed by the 1, statistic (i.e., 1, S -2.00) removed from the analysis.
AF ree refers to the change in r,,, in the data set with I,-identiﬁed aberrance removed
relative to the aberrance-free data set for the same “true value” of validity (i.e., the
diﬁ'erence rvRem - r,,.Free). AAb refers to the change in r,,, in the data set with I,-
identiﬁed aberrance removed relative to the data set with aberrance for the same “true
value” of validity (i.e., the difference r,,Rem - r,,.Ab).

22‘;

6 0
Table 14 — Change in Validity Due to Removal of Aberrance for rn Near .80

 

Low Aberrance

CRIT15 CRIT 30 CRIT45
Aberrance Rem AFree AAb Rem AFree AAb Rem AFree AAb

 

 

 

 

 

10% .14 0 0 .30 -.01 0 .44 -.01 0
20% .14 0 0 .29 -.02 0 .42 -.03 0
30% .14 0 +01 .29 -.02 0 .41 -.04 0
40% .13 -.01 0 .28 -.03 0 .42 -.03 0
50% .13 -.01 0 .28 -.03 +01 42 -.03 + 01
High Aberrance
CRIT15 CRIT 30 CRIT45
Aberrance Rem AF rec AAb Rem AF rec AAb Rem AF rec AAb
10% . 14 0 0 .29 -.02 0 .44 -.01 0
20% .13 -.01 0 .29 -.02 0 .42 -.03 0
30% .13 -.01 -.01 .27 -.04 0 .42 -.03 0
40% .12 -.02 -.01 .27 -.04 0 .40 -.05 0
50% .12 -.02 0 .27 -.04 0 .39 -.06 -.01
Mixed Aberrance
CRIT15 CRIT30 CRIT45

Aberrance Rem AFree AAb Rem AFree AAb Rem AFree AAb
15L, 15H .14 0 0 .30 -.01 0 .44 -.01 0

Note. Rem refers to the estimate of validity (my) calculated in the data set with aberrant
response vectors identiﬁed by the 1, statistic (i.e., 1, S -2.00) removed from the analysis.
AFree refers to the change in r,,, in the data set with 1,-identiﬁed aberrance removed
relative to the aberrance-free data set for the same “true value” of validity (i.e., the
diﬂ‘erence r,,.Rem — rWFree). AAb refers to the change in r”, in the data set with 1,-
identiﬁed aberrance removed relative to the data set with aberrance for the same “true
value” of validity (i.e., the difference r,,.Rem - r,,.Ab).

 

6 1
Table 15 — Change in Validity Due to Removal of Aberrance for r,, Near .90

 

Low Aberrance

CRIT15 CRIT30 CRIT45
Aberrance Rem AF ree AAb Rem AFree AAb Rem AFree AAb

 

 

 

 

 

10% .15 0 0 .30 0 +01 .44 -.01 0
20% .15 0 +01 .30 0 0 .44 -.01 0
30% .15 0 0 .28 -.02 +01 .43 -.02 0
40% .15 0 0 .28 -.02 0 .42 -.03 0
50% .14 -.01 0 .28 -.02 0 .42 -.03 -.01
High Aberrance
CRIT15 CRIT30 CRIT45
Aberrance Rem AF ree AAb Rem AF ree AAb Rem AF ree AAb
10% .14 -.01 0 .30 0 +01 .44 -.01 0
20% .14 -.01 0 .29 -.01 0 .43 -.02 +01
30% .14 -.01 0 .28 -.02 0 .42 -.03 +01
40% .13 -.02 0 .28 -.02 0 .41 -.04 0
50% .13 -.02 -.01 .26 -.04 -.01 .39 -.06 -.01
Mixed Aberrance
CRIT15 CRIT30 CRIT45

Aberrance Rem AFree AAb Rem AFree AAb Rem AFree AAb
15L, 15H .15 0 0 .30 0 0 .44 -.01 0

Mg, Rem refers to the estimate of validity (r9) calculated in the data set with aberrant
response vectors identiﬁed by the 1, statistic (i.e., 1, s -2.00) removed ﬁ'om the analysis.
AF rec refers to the change in "xy in the data set with I,-identiﬁed aberrance removed
relative to the aberrance-free data set for the same “true value” of validity (i.e., the
diﬁ'erence r,,.Rem — r,,.Free). AAb refers to the change in rer in the data set with 1,-
identiﬁed aberrance removed relative to the data set with aberrance for the same “true
value” of validity (i.e., the difference rgRem — r,,.Ab).

62
The same pattern of results was found across all three “true values” of reliability,

so the information shown in Tables 13, 14, and 15 will be discussed together. Similar to
the results found for the effect of removing aberrance on reliability, the evidence suggests
that removing aberrance ﬁ'om the data sets had little to no effect on estimates of validity.

The comparisons in these tables show that the validity coefﬁcients in the data sets with
response vectors having 1, scores less than or equal to -2.00 removed were more similar to

the validity coefﬁcients in the data sets with aberrance than in the aberrance-ﬂee data
sets. This pattern of results held for almost all of the data sets in which there were
changes in validity due to the introduction of aberrance, regardless of the level and type
of aberrance in the data set or the “true value” of validity simulated.

The reasons for the lack of effect for removing response vectors identiﬁed as
aberrant by the 1, statistic on validity are the same as those given previously to explain the

results for reliability. There is a possibility that the effects of aberrance on validity were

too small to produce an adequate margin for improvement with the removal of response
vectors with extremely low 1, scores. This is less of an issue in the case of validity than it

was in the case of reliability, however, because the effects of aberrance on validity were
somewhat larger, and the pattern of effects on validity is more clear-cut than the pattern

for reliability, both for the aberrant data sets and for the data sets with aberrance
identiﬁed by the 1, statistic removed. The lack of change in validity after response vectors
were removed from the analysis is most likely due to the extremely small proportions of
aberrant response vectors that were identiﬁed by the 1, statistic, as shown in Table 11.
These results can be used to draw conclusions relevant to the third research
question, which asks about the effects of using the 1, statistic to identify aberrant response

vectors for removal from analyses estimating the reliability and validity of a test. Overall,

the results of this study indicate that there is little to no change in estimates of reliability

and validity from using the 1, statistic to ﬂag aberrant response vectors. The reason for

 

6 3
this lack of effect may be partially because there was little change in validity and,

particularly, in reliability, due to the introduction of aberrance, so there was little room

for improvement in these estimates by removing aberrance. The main reason that there
was no eﬁect from removing response vectors with extremely low 1, scores, however, was
because only a small proportion of aberrant response vectors for all levels and types of

aberrance manipulation were identiﬁed as being aberrant based on 1, scores. This means

that, based on the results of this study, the 1, statistic is not an adequate indicator of

aberrance for the purpose of attempting to improve the accuracy of estimates of test

characteristics, regardless of the effects of aberrance on test properties.

DISCUSSION

The purpose of this study was to examine a proposed application of
appropriateness indices as indicators of response vectors in a test validation sample that
may be distorting estimates of the psychometric properties of tests. A second and related
objective was to determine why previous research on this topic found small to no eﬂ‘ects
on validity after removing response vectors identiﬁed as aberrant from analyses. The
research questions of interest in this study centered arormd the eﬁ‘ects of varying amounts
and types of aberrance on varying levels of reliability and validity and the effects of using
the 1, statistic to ﬂag aberrant response vectors for removal from analyses to improve
estimates of these test properties.

This study expanded on the prior research in several ways. Simulated data were
used to investigate a variety of situations in which aberrance may aﬂ‘ect test properties.
The simulation procedure also removed the limitations of sample size and test length in
the prior work and allowed a controlled analysis of the degree that the 1, statistic was
detecting aberrance in a data set. In addition, the eﬁects of aberrance on test reliability as
well as the eﬂ‘ects on test validity were examined to attempt to determine the reasons why
aberrance does or does not affect test properties in particular ways.

Overall, the results of this study indicate that the eﬁ‘ects of aberrance on test
properties are small to negligible for all types and amounts of aberrance simulated in this
study, although the eﬂ‘ects of aberrance on validity are somewhat larger than the eﬂ'ects of

aberrance on reliability. Because of the large sample sizes (n = 10,000) used in the data

sets, however, any change in reliability or validity was considered an indicator of

64

65

a true effect of aberrance on test properties. Based on this consideration, several trends in
the relationship between aberrance and test properties are worth noting.

The ﬁrst research question asks about the effects of varying amounts of aberrance
on test reliability and test validity. The effects of aberrance on reliability differed based
on the type of aberrance that was simulated in the data set. In both cases, as the amount
of aberrance in the data set increased, the estimate of reliability decreased. For the
spuriously low manipulation, however, introducing aberrance into a data set always
resulted in slightly lower estimates of test reliability than those in data sets with no
aberrance. For the spuriously high manipulation, introducing aberrance resulted in
changes in reliability estimates from data sets with no aberrance that ranged ﬁ'om slight
increases for small amounts of aberrance to slight decreases for larger amounts of
aberrance.

The trend in effects of aberrance on validity was more homogeneous. There were
slightly larger changes in validity for the introduction of spuriously high aberrance than
for the introduction of spuriously low aberrance, but the direction of change was the same
in both cases. For both types of abenance manipulations, as the amount of aberrance in
the data set increased, the estimate of test validity decreased from the “true value” of
validity as estimated in data sets with no aberrance.

The second research question asks about the effects of aberrance on varying levels
of test reliability and test validity. The effects of aberrance on both reliability and validity
varied based on the “true value” of the test property, as estimated on data sets with no
aberrance. Again, the effects of aberrance on varying levels of reliability differed based
on the type of aberrance manipulation that was used. For data sets with the spuriously
low aberrance manipulation, there were slightly greater decreases in estimates of
reliability for higher “true values” of reliability for any amount of aberrance. For data

66

sets with the spuriously high aberrance manipulation, changes in reliability became more
negative and/or less positive as the “true value” of reliability increased.

There was a trend in changes in validity coefficients both for changes in the “true
value” of validity and for changes in the “true value” of reliability. These trends
remained the same for both types of aberrance manipulations. As the “true value” of
validity increased, there were larger absolute decreases in the value of the validity
coemcient for each amount of aberrance. When relative changes in validity were
examined, however, the decreases in coefﬁcients remain relatively constant across levels
of validity. As the “true value” of reliability increased, there were smaller decreases in
the value of the validity coemcient across all levels of validity and all amounts of
aberrance.

In addition to the data sets which were simulated to represent only a single type
and level of aberrance, a data set was also simulated with a combination of both
spuriously low and spuriously high response vectors. This data set was simulated in order
to examine the possibility that the two types of aberrance commonly described in research
on test appropriateness cancel each other out in effects on test properties. The effects of
the mixed types of aberrance on test properties were lower than the effects of either type
of aberrance alone for the same overall level of aberrance in the data set, supporting the
possibility that the two types of aberrance may cancel each other out in eﬂ‘ecting

characteristics of the data set.

The third research question asked about the aberrance detection rates of the 1,

statistic and about the eﬂ‘ects of using the 1, statistic to ﬂag aberrant response vectors for
removal ﬁom analyses on estimates of reliability and validity. Overall, the detection rates
of the 1, statistic were very low for all types and amormts of aberrance simulated in the

data sets, with less than 20% of the simulated abenance detected for all but two data sets
(31% detection for 10% aberrance of either type for a “true value” of reliability of .90).

67

The proportion of aberrance detected in a data set decreased as the amount of aberrance in
the data set increased, while the proportion of aberrance detected increased as the “true
value” of reliability increased. Aberrance detection rates for the mixed aberrance data set
were slightly lower than for the corresponding data sets with a single type of aberrance,
providing further evidence that the effects of spuriously low aberrance and spmiously
high aberrance on a data set cancel each other out.

As would be expected based on these low detection rates, removing response
vectors that were ﬂagged as aberrant based on 1, scores had little to no eﬁ‘ect on estimates
of test reliability and validity. In almost all cases, estimates of both reliability and
validity that were calculated on data sets with response vectors removed were more
similar to data sets with aberrance than the data sets that were simulated to be free of
aberrance. This general pattern held across all combinations of types and amounts of
aberrance as well as across “true values” of reliability and validity. These results indicate

that the 1, statistic is not very eﬂ‘ective as an indicator of test inappropriateness for the

purpose of removing aberrant response vectors to improve estimates of test properties.
Implications of the Present Study for PreviouiReseﬂh

The results of this study oﬂ‘er several possible reasons for the results of previous
research that showed small to no effects on test validity after removing response vectors
identiﬁed as aberrant based on 1, scores. Most likely a combination of these explanations
are responsible for the previous ﬁndings. The ﬁrst potential reason for the weak eﬂ'ects
that were found is that, in most cases, abenance may not have an eﬂ‘ect on test properties
that is large enough to make a practically signiﬁcant diﬂ‘erence in the estimation of these
values. In this study, the introduction of aberrance into data sets had only a small effect
on both reliability (average change of one to one and one halfpoints) and validity
(average change of two to three points)

 

68

The aberrance simulations used in this study mirror those that would be found in
real testing situations. A respondent may receive a test score that is spuriously low if they
respond carelessly or randomly or if they have a poor understanding of the test
instructions or testing procedures in general. A respondent may receive a spuriously high
score if they cheat or receive special coaching on the item content or item types that make
up the test. Because of this, the small changes in test properties due to aberrance that
were found in this study are likely to be similar to the magnitudes of changes found in
other studies using data from real testing situations.

In addition, this study found that there were smaller changes in test properties in
data sets with aberrant response vectors representing a mix of spuriously high and
spuriously low test scores than in data sets with only a single type of aberrant response
vectors. These ﬁndings were described as representing the possibility that different types
of aberrance may cancel each other out in effects on the data set. In real testing
situations, it is likely that diﬂ‘erent types of aberrant responding will occur within the
sample of test respondents. If the ﬁndings of this study generalize to real testing
situations, this means that the changes in test properties due to aberrance will be even
smaller than what was demonstrated in these simulations because the eﬂ‘ects of spuriously
low aberrance and spuriously high aberrance will cancel out, assuming that the two types
of aberrance occur at similar rates within the sample.

The second potential reason for the previous research ﬁndings of weak eﬁ‘ects of
aberrance removal on estimates of test properties is that the 1, statistic did not adequately
identify enough aberrance in the data set so that the removal of these response vectors
made a difference in test properties. In real testing situations, the amount of aberrance in
the data set is unknown. In this study, however, the use of simulated data allowed the
comparison of the amount of aberrance in a data set with the amount of aberrance

detected using the 1, statistic. Although this was done in previous research on test

6 9
appropriateness, this study expands on the prior ﬁndings by examining varying types and
amounts of aberrance that are likely to be found in data sets in real testing situations.

In addition, the use of simulated data allowed for the comparison of both aberrant

data sets and data sets with aberrance removed as well as the comparison of both of these
to data sets with no aberrance. These comparisons resulted in the ﬁnding that the 1,
statistic is not an adequate indicator of the types and amounts of aberrance simulated in
this study and that removing response vectors identiﬁed as aberrant based on 1, scores had
little eﬂ‘ect on estimates of test properties. To the extent that the aberrance simulations
used in this study generalize to real testing situations, the previous research ﬁndings can
be explained by the lack of aberrance detected using the 1, statistic.

In summary, the proposed application of the 1, statistic to improve the accuracy of
estimates of test properties is not very useful. This is likely both because of the small
eﬂ‘ects of the types of aberrance commonly examined on test properties and because of
the low detection rates of aberrance using the 1, statistic, particularly if there is a large
amount of aberrance or if there are mixed types of aberrance in a single data set. These
results go against the common-sense assumption that aberrance would aﬁ‘ect test
properties because of the introduction of response variation that does not ﬁt the model of
responses based on classical test theory or item response theory. The next question, then,
is what it is about this application of the 1, statistic that is failing to make a diﬁerence in
test properties and/or what it is about the types of abenance that have been commonly

examined in previous research that is not affecting test properties.

A Potential Explanation for the Present and Previous Findings
The underlying cause of both the lack of eﬂ‘ect of the aberrance on test properties

and the low aberrance detection rate of the 1, statistic may have to do with the way the

introduction of aberrance changes the characteristics of the data set. This inﬂuences both

 

70

the way item responses are modeled using item response theory and the way the 1, statistic
is calculated. One possibility is that the introduction of aberrance into a data set distorts
the properties of the data set to the extent that it is difﬁcult to categorize particular
response vectors as being “normal” or aberrant based on the information provided by the
distorted data set. This difﬁculty in the classiﬁcation of individual response vectors
within a data set has implications for both the inﬂuence of aberrance on test properties
and the inﬂuence of aberrance on the calculation of 1, scores.

Although previous research has stated that using item and person parameter values
from data sets with aberrance does not produce signiﬁcant distortion in appropriateness
ﬁt statistics (Levine & Drasgow, 1982), this may not always be the case, particularly for
data sets with large amounts of aberrance, such as those simulated in this study. In
Levine and Drasgow’s (1982) work on the effects of aberrance on item and person
parameter estimation, only 200 response vectors out of a sample of 3000 underwent a
20% spuriously low aberrance manipulation, which translates to approximately 7%
aberrant response vectors within the data set. This is lower than even the lowest
percentage of aberrance simulated in the data sets used in this study and the amotmt of
change within a response vector is also lower than the amount of change simulated in this
study (i.e., 30% of responses within a vector). It is thus unclear whether the results of
that work hold in for the present study.

The value of the 1, statistic for a particular response vector is based on the extent
that the response pattern ﬁts an estimate of ability on the construct of interest, which is
based on group-determined item parameters from an item response theory model of the
data. In data sets with a large number of response vectors that do not ﬁt the ideal
response pattern based on maximum likelihood function, the estimation procedure for the
IRT model itself is likely to be distorted. This is because the estimation procedures used
by programs like BILOG to estimate IRT models of data assume that the test is

71

unidimensional. As discussed previously, one way of deﬁning test inappropriateness is
the extent that a test is measuring something other than the construct of interest for a
subset of respondents. When there is a large proportion of respondents for whom the test
is inappropriate, the assumption of unidimensionality of the test is likely violated to the
extent that IRT models of the data are distorted, resulting in distorted estimates of both
item and person parameters.

When these distorted estimates are, in turn, used to calculate 1, values, indicating
the appropriateness of the item response model for the particular response vector, the
response vector may ﬁt the distorted model because of the large number of aberrant
response vectors on which the model was based, resulting in low detection rates for the 1,
statistic. In addition, internal consistency reliability estimates of the distorted data sets
may still remain high, because responses may be internally consistent for the
measurement of more than one construct, resulting in high coefﬁcient alphas, but a
violation of the unidimensionality assumption needed for accurate IRT models of the
data. Support for this explanation comes from Reise (1995), who found that aberrance
detection rates using the 1, statistic were always lower when using estimates of theta ﬁom
aberrant data sets compared with using estimates of theta ﬁ'om aberrance-ﬂee data sets.

A post hoc analysis of the data ﬁ'om this study also conﬁrmed this explanation of
the results. The data sets starting at a “true” reliability of .90 with 30% spuriously low
aberrance and with 30% spuriously high aberrance were chosen for this analysis because
they represent a moderate amount of aberrance with a high degree of internal consistency
before the addition of aberrance. The item and person parameters in these data sets were
compared with those for the data set with a “true” reliability of .90 and no aberrance.
Both item and person parameters showed some distortion based on the introduction of
aberrance, with different patterns of inﬂuence from the introduction of spuriously high
abenance or spuriously low aberrance into the data set, but the same overall magnitude of

 

7 2
changes due to either type of aberrance. The results of the comparison of person

parameters are given in Appendix B, and the results of the comparison of item parameters
are given in Appendix C.

Person parameter estimates (i.e., theta, or ability on the construct measured by the
test) changed in diﬂ‘erent ways depending on whether the response vector was chosen for
the aberrance manipulation or remained unchanged after the manipulation. Table 4A
presents the results of comparisons between estimates of theta based on 30% aberrant and
aberrance-ﬂee data sets with a “true” reliability of .90. For the spuriously low
manipulation, manipulated response vectors showed a decrease in theta following the
manipulation, and for the spuriously high manipulation, manipulated response vectors
showed an increase in theta following the manipulation. This would be expected,
because the purpose of the manipulations was to model response vectors with lower or
higher scores, respectively. After the aberrance manipulations, however, estimates of
theta changed even for the response vectors that were not a part of the manipulation,
though these changes were smaller in magnitude than the changes in response vectors that
were part of the manipulations. For the spuriously low manipulation, estimates of theta
for unchanged response vectors increased and for the spuriously high manipulation,
estimates of theta for the unchanged response vectors decreased. The changes in
estimates of theta for the response vectors that were not part of the manipulation is
evidence that the introduction of aberrance into a data set changed the item response
theory model of the data, such that person parameters were distorted.

Because person parameters were distorted by the introduction of aberrance, it was
expected that item parameters would also be distorted by the introduction of aberrance,
since person and item parameters are estimated simultaneously in an iterative process
using the maximum likelihood estimation procedure in BILOG. Table 5A presents the

results of comparisons between estimates of item parameters based on 30% aberrant and

7 3
aberrance-free data sets with a “true” reliability of .90. For both types of aberrance

manipulations, discrimination parameters decreased after the introduction of aberrance.
This means that items were providing less information about the relative standing of the
response vectors after the introduction of aberrance. Item diﬁiculties increased after the
introduction of spuriously low aberrance, and decreased after the introduction of
spuriously high aberrance. This pattern of results is also expected based on the natme of
the aberrance manipulations, because the spuriously low aberrance manipulation resulted
in a higher number of incorrect answers for an item and the spuriously high aberrance
manipulation resulted in a higher number of correct answers for an item as compared to
the aberrance-free data set. The guessing parameter showed a mixed pattern of change
across both types of aberrance manipulations.

These changes in item and person parameters after the introduction of aberrance
suggest that the reason that the 1, statistic will not identify a large proportion of the
aberrant response vectors is because the IRT model of the data itself is corrupted by the
presence of aberrance. As there is more aberrance in the data set as a whole, it would be
expected that relatively less of that aberrance would be detected, because the IRT model

of item responses would reﬂect more of the aberrance in the data set and so fewer of the

response vectors would differ ﬁom the model signiﬁcantly enough to obtain an extreme I,

score. This would explain the low detection rates of the 1, statistic in the present study,
and the tendency for there to be a relatively lower amormt of aberrance detected as the
overall amormt of aberrance in the data set increased.

Additional support for this explanation can be found in Appendix A, which
contains average item parameter values for each of the tests that were simulated. As the
amount of aberrance in a data set increases, the average item discrimination tends to
decrease (except in the case of data sets with a “true” reliability of .70). For spuriously

low aberrant data sets, as the amomrt of aberrance increases, the average item diﬂiculty

74

increases. For spuriously high aberrant data sets, as the amormt of aberrance increases,
item difﬁculty decreases. The guessing parameter does not show as clear of a pattern of
change across levels of aberrance.

In addition to explaining the low detection rates of aberrance in this study, the
distorted IRT model also explains why aberrance, as simulated in this study, did not have
a large eﬂ‘ect on test properties. The introduction of aberrance caused the item and person
parameter estimates to change so that the resulting maximum likelihood model ﬁt the
data even with the inclusion of abeirant response vectors. This means that aberrant
response vectors were not distinguished from response vectors that were free of aberrance
in the IRT model. It is likely, then, that aberrant response vectors were not distinguished
from aberrance-ﬂee response vectors when the data was used to calculate test reliability
and validity.

An additional way to think about this explanation is in terms of the amount of
information provided by the item response theory model. If there is a large amount of
aberrance in a data set, the overall amormt of information contained in the IRT model
may decrease. This means that there is less information to determine whether a particular
response vector departs from the model. The decrease in item discrimination values
observed for both types of aberrance manipulations in the post hoc analyses and the
corresponding low detection rates for the 1, statistic in this study support this
interpretation. In addition, there was a tendency for higher standard errors for both item
parameters and person parameters in the aberrant data sets as compared to the aberrance-

ﬁee data set. This indicates that there is more error variance (and less meaningful

variance) in the IRT model for the aberrant data sets than for the aberrance-ﬂee data set.
In addition to explaining the low detection of aberrance using the 1, statistic, the

lack of information in the IRT model for the aberrant data sets also explains why

aberrance, as simulated in this study, did not have a large eﬁ‘ect on test properties. If less

75

information is provided to determine whether a response vector is aberrant or not, the
information deﬁcient model may ﬁt the data well enough so as not to have large eﬁects
on internal consistency estimates, which matches the ﬁndings of this study. Ifthere are a
sumciently large number of individuals with the same type of aberrance, as in this study,
the covariation of responses to items with similar parameter values may be maintained
despite the inaccuracy of the response pattern as a measure of the individual’s standing on
the construct of interest. In the post hoc analyses, despite the changes in the magnitude of
item parameters due to the introduction of aberrance, the rank order of the values of these
parameters was not drastically altered, providing some support for this explanation.

The results with respect to the low effects of aberrance on validity are slightly
more difﬁcult to explain, due to the lack of inﬂuence of aberrance on criterion scores.
Again, however, the information deﬁcient model may be adequate to assess covariation of
total test scores with a criterion without being an accurate representation of the
respondent’s standing on the construct measured by the test. Ifthere are a large number
of respondents with the same type of aberrance manipulation, as in this study, the rank
order of individuals based on total test score may not change that much, which would
result in small changes in validity. The post hoc analyses reveal that this seemed to be the
case. Although the magnitude of estimates of theta changed based on the introduction of
aberrance, the rank order of response vectors did not change drastically, providing
support for this explanation.

This explanation can also explain why there were greater changes in validity than
in reliability due to the introduction of aberrance. Validity was based on the relationship
of rank orders on test score, which changed due to the introduction of aberrance, with
rank orders on criterion score, which did not change. Reliability was based on
covariation among item parameters within the test, which changed in similar ways with

the introduction of a particular form of aberrance. There was more potential for change in

7 6
validity than in reliability after aberrance was introduced because distortion from the

introduction of aberrance did not have as uniform an effect on the components of the
validity calculation as it did on the components of the reliability calculation.

The pattern of results showing small changes in rank order of person and item
parameters with the introduction of a single type of aberrance might be expected to differ
if multiple types of aberrance were present in the same data set. Ifthis were true , there
would be reason to expect larger effects of aberrance on reliability and validity in the
presence of mixed types of aberrance. This does not match the ﬁndings of this study,
which showed that mixed aberrance generally had an even smaller eﬁ‘ect on reliability
and validity than the same overall magnitude of aberrance of a single type. These results
do not necessarily contradict the explanation described here, however, if the effects of the
two types of aberrance canceled each other out within the data set. If this were the case,
the resulting IRT model would show adequate ﬁt to the data because the eﬁects of the
two types of aberrance would largely cancel out. This explanation would be consistent
with the smaller effects of mixed aberrance found in this study without negating the lack
of change in the rank order of item and person parameters found for introduction of a
single types of aberrance. Additional evidence for this explanation can be formd in
Appendix A, which gives the average item parameter value for each of the tests simulated
in this study. The data sets with mixed aberrance show smaller changes in item difﬁculty
than tests with a single type of aberrance, indicating that the effects of the two types of
aberrance may cancel out.

Ifthis explanation is correct, the implication of these ﬁndings is that the 1, statistic
is not an adequate indicator of aberrance when the true values of item and person
parameters are unknown and the data set has a moderate to large proportion of aberrant

response vectors. In this case, the aberrant data set must be used to estimate an IRT

model of the data that is necessary for the calculation of 1, values. This model will be

7 7
distorted by the presence of aberrance, and, as a result, will ﬁt the aberrant data set

because the item and person parameters are based on the aberrant data rather than “true
values” of item and person parameters. This means that it will be difficult to identify
which response vectors are aberrant and which are free of aberrance using scores on the 1,
statistic. Because the true values of item and person parameters are seldom known in real
testing situations, the 1, statistic may not be very useful for the purpose of identifying
aberrant response vectors for removal from analyses to increase the accuracy of estimates
of test properties.

These ﬁndings also have implications for the use of the 1, index as a diagnostic
indicator of individuals who may need special instruction or special testing conditions in
order for tests to be representative measures of their standing on the construct of interest.
Ifthere are a large number of aberrant responders within the data set, then this application

of the 1, statistic would be ﬂawed for the same reasons given above. Ifthere are only a

few aberrant responders in a large sample, then the 1, statistic, as well as some of the other

appropriateness indices described earlier, may be more useful for detecting these
individuals because aberrance would not produce a large distortion in the IRT model of
the test data, so the few aberrant individuals in the sample would stand out from the rest
of the group. This application should still be used cautiously, however, because there
may not be adequate evidence to determine the likely proportion of aberrant responders in
a particular sample.
Limitations and Contributions

Although this study sought to explain the reason(s) for the modest results found
for one proposed application of the 1, statistic, there were several limitations in the
procedures used that require caution in the interpretation of the results presented here. As
outlined above, the use of item and person parameters from data sets with aberrance

present may distort the item response theory model of the data such that the eﬂeas of

7 8
aberrance on test properties and on appropriateness indices are difﬁcult to disentangle.

Some discussion of the nature of these eﬁects was offered, but without a comparison of
IRT models of the data sets estimated using true item parameter values and/or true values
of theta, the exact nature of the inﬂuence of the aberrance simulations used here on test
properties and appropriateness indices remains unclear.

A second limitation of this study has to do with the way that aberrance was
simulated in the data sets. In this study, aberrance was introduced randomly, without
regard for the true values of item or person parameters. In a real testing situation, item
difﬁculties and respondent’s ability on the construct of interest are likely to inﬂuence the
nature of their responses in terms of the types of aberrance described here. For example,
a high ability individual who is answering moderately diﬁicult items will be less likely to
guess randomly or to cheat than an individual with low ability who is working on items
that are perceived as being too difﬁcult. Because of the other dimculties in identifying
the inﬂuence of aberrance on test data that were discussed earlier, it is unclear how
aberrance that is correlated with ability or item difﬁculty would inﬂuence the results
reported here. In addition, all response vector manipulations to simulate aberrant
responding involved changing 30% of the item responses within the vector (i.e., 15 out of
50 responses). Although it is not expected that the pattern of results reported here would
diﬂ’er dramatically if the proportion of responses manipulated within a response vector
were changed, this is a possibility that could be explored in future research.

Despite these limitations, the present study makes several contributions to
research on test appropriateness. This study examines an application of appropriateness
indices that has been discussed in prior literature but has received little empirical support.

The use of simulated data in this study removed limitations of sample size, test length,
and amount of aberrance in the prior research using the 1, statistic to ﬂag aberrance for

removal from analyses. The present study included a broader range of simulated testing

79

conditions than in any of the prior research on appropriateness statistics, such as varying
levels and types of aberrance and varying values of test reliability and validity. This
procedure allowed the examination of potential reasons for the modest results found in
prior research when the response vectors ﬂagged as aberrant using the 1, statistic were
removed ﬁom analyses and validities were recalculated. The results showed that the
nature of the estimation process inﬂuences the utility of the 1, statistic for identifying
aberrant response vectors. In real testing conditions, it may not be possible to achieve the
conditions necessary to ensure adequate abenance detection when there are a large
number of aberrant response vectors present. Future research should attempt to determine
more clearly how the item response theory estimation process is inﬂuenced by the
inclusion of varying amounts of aberrance in a data set and under what conditions the 1,

statistic should be used to identify aberrance in real testing situations.

APPENDICES

APPENDIX A

Test Summary Statistics
Table 1A — Test Statistics for Reliability Near .70

 

 

 

 

 

 

 

 

 

 

 

 

No Aberrance

% Aberrance Mean SD ;,, a 5 Z-

0 26.91 6.13 .05 .552 .894 .256
Low Aberrance

% Aberrance Mean SD ;,, Z, i, E
10 26.39 6.15 .05 .566 1.006 .262
20 25.90 6.19 .05 .599 1.075 .270
30 25.40 6.19 .05 .596 1.142 .271
40 24.90 6.09 .04 .637 1.229 .281
50 24.39 6.04 .04 .622 1.344 .281

Iﬁh Aberrance

% Aberrance Mean SD ;,, 3 T, E
10 27.60 6.33 .05 .549 .677 .245
20 28.31 6.47 .05 .524 .549 .239
30 28.98 6.56 .06 .517 .340 .225
40 29.68 6.52 .06 .511 .196 .219
50 30.38 6.42 .05 .469 .081 .219

Mixed Abenance

% Aberrance Mean SD r,_, a b c

15L, 15H 26.99 6.12 .05 .541 .839 .253

 

Note. n,- = mean inter-item correlation. a = mean item discrimination. b = mean item
difﬁculty. c = mean item guessing.

80

8 1
APPENDIX A

Table 2A - Test Statistics for Reliability Near .80

 

 

 

 

 

 

 

 

 

 

 

 

No Aberrance
% Aberrance Mean SD F»- ; '5 Z-
0 28.45 7.56 .09 .651 .308 .219
Low Aberrance
% Aberrance Mean SD a, 5 '5 2
10 27.91 7.59 .09 .665 .401 .222
20 27.34 7.54 .08 .656 .518 .229
30 26.77 7.37 .08 .633 .542 .2254
40 26.24 7.32 .07 .648 .678 .232
50 25.65 7.13 .07 .632 .777 .237
High Aberrance
% Aberrance Mean SD A. 3 Z Z-
10 29.10 7.60 .09 .639 .180 .211
20 29.74 7.63 .09 .610 .101 .210
30 30.39 7.65 .09 .622 .028 .216
40 31.04 7.50 .09 .587 -. l 21 .207
50 31.69 7.35 .08 .570 -.205 .210
Mixed Aberrance
% Aberrance Mean SD r, a b c
15L, 15H 28.46 7.46 .08 .629 .295 .220

 

Note. r,-,~ = average inter-item correlation. a = mean item discrimination. b = mean item
dimculty. c = mean item guessing.

8 2
APPENDDI A

Table 3A — Test Statistics for Reliability Near .90

 

 

 

 

 

 

 

 

 

 

 

 

No Aberrance

% Aberrance Mean SD R, 3 '5 Z-

0 26.92 9.53 .15 .845 .227 .157
Low Aberrance

% Aberrance Mean SD ;,., E Z Z-
10 26.40 9.38 .15 .815 .296 .161
20 25.91 9.30 .14 .786 .353 .163
30 25.40 9.12 .13 .763 .413 .165
40 24.88 8.91 .12 .740 .520 .168
50 24.38 8.67 .12 .699 .574 .168

High Aberrance

% Aberrance Mean SD R, 3 I, 2
10 27.61 9.52 .15 .832 .171 .157
20 28.28 9.52 .15 .802 .122 .161
30 28.99 9.36 .15 .762 .035 .159
40 29.67 9.18 .14 .744 -.001 .171
50 30.40 8.94 .13 .721 -.111 .166

Mixed Aberrance

% Aberrance Mean SD n,- a b c

15L, 15H 27.00 9.29 .14 .780 .239 .163

 

Note. n,- = mean inter-item correlation. a = mean item discrimination. b = mean item
dimculty. c = mean item guessing.

APPENDIX B

Change in Theta Due to 30% Aberrance

Table 4A — Change in Theta Due to 30% Aberrance

 

Total Data Set (n = 10,000)

 

Aberrance Mean SD Mean A
None .03 .94 —

30% Low .01 .93 -.02

30% High .05 .94 +02

 

Changed Vectors Only (n = 3,000)

 

Aberrance Mean SD Mean A
None (Low) .03 .94 —

30% Low -.38 .71 -.41
None (High) .04 .95 —

30% High .49 .75 +.45

 

Unchanged Vectors Only (n = 7,000)

 

Aberrance Mean SD Mean A
None (Low) .03 .94 -

30% Low .17 .96 +.14
None (High) .03 .94 —

30% High -.14 .95 -. l 7

 

Me. Mean A represents the change in the mean theta value for the response vectors in
the aberrant data set relative to the mean theta value in the data set with no aberrance.
The comparison for the changed (unchanged) vectors in the aberrant data sets are the
same 3000 (7000) response vectors in the aberrance-ﬂee data set (i.e.., the same vectors
before the aberrance manipulation). The set of 3000 (7000) vectors is diﬂ’erent for the
spuriously low and spuriously high manipulations.

83

 

APPENDIX C

Change in Item Parameters Due to 30% Aberrance

Table 5A — Change in Item Parameters Due to 30% Aberrance

 

Item Discrimination (a)

 

 

 

 

Aberrance Mean SD Mean A
None .84 .21 —

30% Low .76 .15 -.08

30% High .76 .18 -.08

Item Difﬁculty (b)

Aberrance Mean SD Mean A
None .23 .93 -

30% Low .41 .89 +.18

30% High .04 .89 -.19

 

Pseudo-Guessing Parameter (c)

 

Aberrance Mean SD Mean A
None .16 .04 -

30% Low .17 .05 +01

30% Higl .16 .04 +00

 

Note. Mean A represents the change in the mean item parameter value for the response
vectors in the aberrant data set relative to the mean item parameter value in the data set

with no aberrance. The comparisons are based on the total data set (n = 10,000).

84

LIST OF REFERENCES

LIST OF REFERENCES

Anastasi, A. (1988). Psychological Testing. (6th ed.). New York: Macmillan
Publishing Company.

Birenbaum, M. (1986). Effect of dissirnulation motivation and anxiety on
response pattern appropriateness measures. Applied Psychological Measurement 10
167-174.

 

Birenbaum, M. (1985). Comparing the effectiveness of several IRT based
appropriateness measures in detecting unusual response patterns. Educational and
Psychological Mea_surement. 45. 523-534.

Bimbaum, A. (1968). Some latent trait models and their use in inferring an
examinee’s ability. In F.M. Lord & M.R. Novick, Statistic Theories of Mental Test
Scores (pp. 397-479). Reading, MA: Addison-Wesley Publishing Company, Inc.

Cortina, J .M. (1994). On the meaning and measurement of test apprppriateness.
Unpublished doctoral dissertation, Michigan State University, East Lansing, MI.

Donlon, T.F. & Fischer, F .E. (1968). An index of an individual’s agreement with

group-detennined item difﬁculties. Educational and Psychological Measurement, 28,
105-113.

Drasgow, F. (1982a). Choice of test model for appropriateness measurement.
Applied Psychological Measurement, 6, 297-308.

Drasgow, F. (1982b). Biased test items and differential validity. Psychological
Bulletin 92, 526-531.

 

Drasgow, F. & Guertler, E. (1987). A decision-theoretic approach to the use of
appropriateness measurement for detecting invalid test and scale scores. Journal of

Applied Psychology, 72, 10-18.

Drasgow, F. & Levine, M.V. (1986). Optimal detection of certain forms of
inappropriate test scores. Applied Psychological Measurement, 10, 59-67.

Drasgow, F, Levine, M.V., & McLaughlin, ME. (1991). Appropriateness
measurement for some multidimensional test batteries. Applied Psychological
Measurement 15, 171-191.

Drasgow, R, Levine, M.V., & McLaughlin, ME. (1987). Detecting inappropriate
test scores with optimal and practical appropriateness indices. Applied Psychological
Measurement 11, 59-79.

 

 

85

86

Drasgow, R, Levine, M.V., & Williams, EA. (1985). Appropriateness
measurement with polychotomous item response models and standardized indices.
British Journal of @thematical and Sﬁtatisticgl Psychology, 38, 67-86.

Frederiksen, N. (1977). How to tell if a test measures the same thing in different
cultures. In Y.H. Poortinga (Ed.), ﬁrsic Problemgn Cross-Culturﬂ Psychology (pp. 14-
m. Amsterdam: Swets & Zeitlinger, B.V.

Hamisch, D.L. (1983). Item response patterns: Applications for educational
practice. Journal of Educational Measurement 20, 191-206.

 

Hamisch, D.L. & Linn, R.L. (1981). Analysis of item response patterns:
Questionable test data and dissimilar curriculum practices. Journal of Educational
Measurement 18, 133-146.

 

Hamisch, D.L. & Tatsuoka, K.K. (1983). A comparison of appropriateness
indices based on item response theory. In R.K. Hambleton (Ed.), Applications of Item

Resmnse Theogy (pp. 104-122 ). Vancouver: Educational Research Institute of British
Columbia.

Hough, L.M., Eaton, N.K., Dunnette, M.D., Kamp, J .D., & McCloy, RA. (1990).
Criterion-related validities of personality constructs and the effect of response distortion
on those validities. Joumg of Applied Psychology, 75, 581-595.

Johanson, GA. (1992). IRTDATA: An interactive or batch Pascal program for
generating logistic item response data. Applied Psychological Measurement, 16, 52.

Levine, M.V. & Drasgow, F. (1988). Optimal appropriateness indices.
Psychomeﬂg,’ 53, 161-176.

Levine, M.V. & Drasgow, F. (1983). Appropriateness measurement: Validating
studies and variable ability models. In DJ. Weiss (Ed.), New Horizons in Testing (pp.
109-131 ). New York: Academic Press.

Levine, M.V. & Drasgow, F. (1982). Appropriateness measurement: Review,
critique, and validating studies. British Journal of Mathematical and Statistical

Psychology, 35, 42-56.

Levine, M.V. & Rubin, DB. (1979). Measuring the appropriateness of multiple
choice test scores. Journal of Educational Statistics 4 269-290.

 

Mislevy, R.J. & Bock, RD. (1990). Item analysis and test scoring with bm’
logistic models. Mooresville, IN: Scientiﬁc Software.

Noonan, B.W., Boss, M.W., & Gessaroli, ME. (1992). The effect of test length
and IRT model on the distribution and stability of three appropriateness indexes. Applied

Psychological Measurement, 16, 345-352.

Parsons, CK. (1983). The identiﬁcation of people for whom Job Descriptive

Index scores are inappropriate. Organizational Behavior and Human Performance, 31,
365-393.

87

Peters, T. (1995). BIDEV: Introducing aberrance in normal response vectors
[Computer program]. East Lansing, MI: Michigan State University, Applications
Programming Department.

Peters, T. (1994). XTRACT: Converting data from BILOG for use in calculating
appropriateness statistics [Computer program]. East Lansing, MI: Michigan State
University, Applications Programming Department.

Peters, T. (1993). LZCALC: Computing appropriateness indices on data sets
[Computer program]. East Lansing, MI: Michigan State University, Applications

Programming Department.

Reilly, R.R & Chao, G.T. (1982). Validity and fairness of some alternative
employee selection procedures. Personnel Psychology, 35, 1-62.

Reise, SP. (1995). Scoring method and the detection of person misﬁt in a
personality assessment context. Applied Psychological Measurement, 19, 213-229.

Reise, S.P. & Due, A.M. (1991). The inﬂuence of test characteristics on the

detection of aberrant response patterns. Applied Psychological Measurement, 15, 217-
226.

Rudner, L.M. (1983). Individual assessment accuracy. Journal of Educational
Measurement 20, 207-219.

 

Sackett, P.R. & Wilk, S.L. (1994). Within-group norming and other forms of
score adjustment in preemployment testing. American Psychologt'g 49, 929-954.

Schmidt, F .L, Ones, D.S., & Hunter, J .E. (1992). Personnel selection. Annual
Review of Psychology, 43, 627-670.

Schmitt, N., Clause, C.S., & Pulakos, ED. (1996). Subgroup differences
associated with different measures of some common job-relevant constructs. In C.L.

Cooper & I.T. Robertson (Eds), International Review of Industrial and Orga_n_r_za' tional
Psychology (Vol. 1 1. pp. 115-139). Sussex, England: John Wiley & Sons, Ltd.

Schmitt, N., Clause, C.S., Whitney, D.J., Futch, C.J., & Pulakos, ED. (1994).

Appropriateness ﬁt. socigl desirability, carelessness, and test validity. Unpublished
manuscript, Michigan State University, East Lansing, MI.

 

Schmitt, N., Cortina, J .M., & Whitney, DJ. (1993). Appropriateness ﬁt and
criterion-related validity. Applied Psychological Measurement 18, 143-150.

Schmitt, N., Gooding, R.Z., Noe, R.A., & Kirsch, M. (1984). Metaanalyses of
validity studies published between 1964 and 1982 and the investigation of study
characteristics. Personnel Psychology, 37, 407-422.

 

Tatsuoka, K.K. (1984). Caution indices based on item response theory.
Psychomgtrr_k_a_,' 49, 95-110.

 

88

Tatsuoka, K.K. & Linn, R.L. (1983). Indices for detecting unusual patterns:
Links between two general approaches and potential applications. Applied Psychological
Measurement 7, 81-96.

 

Tatsuoka, K.K. & Tatsuoka, M.M. (1982). Detection of aberrant response
patterns and their effect on dimensionality. Joumpl of Educational Statistics 7, 215-231.

 

van der Flier, H. (1982). Deviant response patterns and comparability of test
scores. Joumpl of Cross Cultttrgl Psychology, 13, 267-298.

van der Flier, H. (1977). Environmental factors and deviant response patterns. In
Y.H. Poortinga (Ed.), B_asic Problems in Cross-CultuLal Psychology (pp. 30-35).
Amsterdam: Swets & Zeitlinger, B.V.

Wright, B.D. (1977). Solving measurement problems with the Rasch model.
Journpl of Educational Measurement, 14, 97-116.

Wright, B.D. & Panchapakesan, N. (1969). A procedure for sample-free item
analysis. Educational and Psycholog’cal Measurement, 29, 23-48.