BUILDING A BETTER TEST WITH CONFIDENCE (TESTING)
By
Paul G. Curran

A DISSERTATION
Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of
DOCTOR OF PHILOSOPHY
Psychology
2012

ABSTRACT
BUILDING A BETTER TEST WITH CONFIDENCE (TESTING)
By
Paul G. Curran
Traditional methods of collecting information from test-takers on multiple choice tests most
often involve the dichotomization of individuals into groups based on a simple correct/incorrect
criterion. In doing so, information that would further differentiate individuals on ability is lost.
This paper shows, though a number of simulations and human subject collections, that this
information can be collected and utilized in order to make tests generally more efficient.

TABLE OF CONTENTS
LIST OF TABLES………………………………………………………………………………...v
LIST OF FIGURES………………………………………………………………………………vi
CHAPTER 1: INTRODUCTION…………………………………………………………………1
Attempts at Extra Information…………………………………………………………….3
Background: item response theory………………………………………………..4
The nominal response model………………………………………………..…….6
Response-time modeled item response theory…………………………………….9
Confidence testing…………………………………………………………….....12
Items with two responses (true/false)……………………………………14
Likert-type confidence scale methods…………………………...14
Point allocation confidence methods………………………..…...17
Items with three or more responses……………………………………...18
Eliminate believed incorrect response choices……………..……18
Choose all responses believed correct…………………………...20
Likert-type confidence scale method…………………………….20
Multiple correct answers to each item………………………..….22
Point allocation confidence methods…………………..………...24
Degree of confidence (some derivative of percentile)……..…….25
Integrated General Form of Confidence Testing Model………………………………....33
Pilot Hypotheses……………………………………………………………..…………..42
CHAPTER 2: PILOT STUDIES………………………………………………………..……….44
Pilot Simulation #1 – Comparison of Different Confidence Testing Methods……….....44
Pilot Simulation #2 – Introduction of Error; Comparison to Traditional Testing…….....47
Pilot Simulation #3 – Effect of Test Length………………………………………..……49
Pilot Simulation #4 – Effect of Test Difficulty……………………………………..…....51
Pilot Simulation #5 – Effect of Test Discrimination………………………………..…...52
Conclusions of Simulated Pilot Data…………………………………………………….54
Experimental Hypotheses……………………………………………………..…………56
Pilot #6 – Obtaining Items and Item Characteristics…………………………..………...58
CHAPTER 3: EXPERIMENTAL METHOD…………………………………………..……….60
Participants……………………………………………………………………..………..60
Measures………………………………………………………………………..………..61
Analogy test……………………………………………………………..……….61
Manipulation check – measure of understanding……………………..…………63
Trait generalized anxiety………………………………………………..……….63
Trait test anxiety…………………………………………………………..……..63
Trait risk-taking……………………………………………………………….....64
Trait cautiousness………………………………………………………………..64
Trait test specific self-efficacy………………………………………………..….64
Procedure……………………………………………………………………………..….65

iii

Analysis……………………………………………………………………………..……68
CHAPTER 4: DISCUSSION………………………………………………………………..…...85
Practical Implications……………………………………………………………..……...92
Limitations and Future Research…………………………………………………..…….98
APPENDICES………………………………………………………………………………….104
Appendix A – Test Items Selected From Pilot #6…………………………………..….105
Appendix B – Manipulation Check…………………………………………………….116
Appendix C – Generalized Anxiety Items……………………………………………...117
Appendix D – Test Anxiety Items from Taylor and Deane (2002)…………………….118
Appendix E – Risk Taking Items……………………………………………………….119
Appendix F – Cautiousness Items………………………………………………………120
Appendix G – Self-Efficacy Items……………………………………………………...121
REFERENCES…………………………………………………………………………………158

iv

LIST OF TABLES
Table 1 – A Table of Responses and Weights………………………………………………….144
Table 2 – A Table of Responses and Integers…………………………………………………..145
Table 3 – Results of Pilot Data Simulation #1………………………………………………….146
Table 4 – Results of Pilot Data Simulation #2………………………………………………….147
Table 5 – Test reliability as a function of test length…………………………………………...148
Table 6 – Effect of test difficulty on correlation with ‘True Score’ for 60-item test…………..149
Table 7 – Effect of test discrimination on correlation with ‘True Score’ for 60-item test……..150
Table 8 – Reliability and Validity Across Conditions for 60 Item Test………………………..151
Table 9 – Reliability by Test Length and Condition……………………………………………152
Table 10 – Correlations of Individual Differences with CT Use (5-point condition).…………153
Table 11 – Correlations of Individual Differences with CT Use (Generalized condition).…….154
Table 12 – Correlations of Individual Differences with CT Use, Controlling for Overall
Score (5-Point Condition)….…………………………………………………………...155
Table 13 – Correlations of Individual Differences with CT Use, Controlling for Overall
Score (Generalized Condition)………………………………………………...……….156
Table 14 – Correlations of Individual Differences with CT Use, Controlling for Overall
Score (Both Confidence Conditions)…………………………………………………...157

v

LIST OF FIGURES
Figure 1 – The Item Characteristic Curve……………………………………………………...122
Figure 2 – Item Characteristic Curve for the Nominal Response Model……………………....123
Figure 3 – Hierarchical Framework for Modeling Response Time and Response……………..124
Figure 4 – Test reliability as a function of test length………………………………………….125
Figure 5 – Benefit of confidence testing as a function of test length…………………………..126
Figure 6 – Test reliability as a function of test length and difficulty…………………………...127
Figure 7 – Benefits of confidence testing as a product of test length and difficulty…………...128
Figure 8 – Test reliability as a function of test length and discrimination……………………..129
Figure 9 – Benefits of confidence testing as a product of test length and discrimination……...130
Figure 10 – Distribution of confidence scores in 5-point confidence condition………………..131
Figure 11 – Distribution of confidence scores in generalized confidence condition………...…132
Figure 12 – Reliability by Test Length and Condition…………………………………………133
Figure 13 – Reliability Benefits of Confidence Testing Relative to Control…………………..134
Figure 14 – Validity (College GPA) by Length………………………………………………..135
Figure 15 – Validity (ACT score) by Length…………………………………………………..136
Figure 16 – Average Time of Test by Length of Test, by Condition…………………………..137
Figure 17 – Comparison of Item Discrimination by Test and Item Difficulty…………………138
Figure 18 – Reliability by Test Difficulty………………………………………………………139
Figure 19 – Reliability by Test Discrimination………………………………………………...140
Figure 20 – Reliability by Discrimination (Easier Items) ……………………………………..141
Figure 21 – Reliability by Discrimination (More Difficult Items) …………………………….142
Figure 22 – Reliability by difficulty on only the 20 easiest items……………………………...143

vi

Introduction
Think back to the last time you were administered a multiple choice ability test. Unless
you were an expert or an incompetent in the content area of the test you likely engaged in some
form of strategic guessing. It is probable that on at least one item you found yourself able to
eliminate one or more of the responses, leaving yourself with two or more answers that you
found to be appealing choices. One of those responses happens to be the correct answer, and one
of two things occurs: 1) you choose the correct answer and are given a score of correct even
though you still engaged in what was at root guessing; you are now indistinguishable on this item
from an expert who knew the answer readily, or 2) you choose the incorrect answer and are
given a score of incorrect even though you had some knowledge about the item; you are now
indistinguishable on this item from someone who knew nothing of the topic.
Historically, tests have functioned through precisely this mechanism. Standardized tests
have evolved for thousands of years, from ancient China to the modern day (Wainer, 2000). The
most common methods of testing rely on assigning individuals a score of correct or incorrect on
each individual test item. At the same time, distribution of ability is believed to be better
represented as a continuum, such as the number correct score or theta value gained from a
sufficiently long test (Lord, 1980).
In modern times there have been a number of attempts to break this habit of dichotomous
correct/incorrect collection. Two current examples of note both rely on an item response theory
(IRT) framework: the nominal response model (Bock, 1972; Samejima, 1979) and response-time
modeled IRT (van der Linden, 2006). Both of these models operate on the principle that
information beyond correct/incorrect distinctions exists and can be collected on any given

1

multiple choice item. One notable flaw of both relates to the fact they are exceptionally complex
both in theory and practical use.
A third example of attempts to collect and utilize information beyond correct/incorrect
data finds its roots as early as the 1930s. In a 1932 paper, Kate Hevner described a simple
addition to a true/false test which produced results both more reliable and more valid than the
true/false test given normally. This finding spawned the idea of confidence testing, which
persisted for almost 50 years.
Confidence testing operated on a simple idea: the level of confidence that an individual
had – or was willing to place – in a response could be used to more finely differentiate
individuals on ability. Though this method individuals could be given partial credit on a multiple
choice test item instead of traditional correct/incorrect scores. This method flourished for a time,
but over the years more and more inconsistent results began to emerge. The benefits of
confidence testing appeared almost entirely variable, and without good explanation the research
and method slowly faded away.
At its close, confidence testing had become almost unrecognizable from the simple
technique put forth by Kate Hevner some half century before. Numerous methods and scoring
rules had left the field fragmented with each subset researching and promoting their own
versions of the method.
The purpose of this paper is to show that 1) the inconsistent results of confidence testing
were relatable to directly measureable test characteristics, 2) all of the different models of
confidence testing are in fact specific cases of a more generalized and translatable model, and 3)
confidence information can be collected from test-takers in order to improve accuracy of
measurement in multiple choice testing beyond simple correct/incorrect methods.

2

Attempts at Extra Information
As stated above there are, and have been, a number of attempts in the last century to
collect information above and beyond correct/incorrect at the item level on multiple choice tests.
This idea is not unique to multiple choice items; this problem on other types of items has many
clear and intuitive everyday solutions.
For example, anyone who has taught a class and administered a test with open-ended
items (e.g. an essay exam or multi-step mathematical problem) knows that students will put in
every effort to ensure that partial credit is given on each of those items. Partial credit on openended items is in fact quite easy from a mathematical perspective – the practical complexities
arise instead in the time and effort required to manually score each item, for each person. Large
scale administrations can be prohibitively expensive or time-consuming as to not be viable.
Multiple-choice items are the solution to this time and effort intensive process; they can
be administered to enormous populations and scored with relative ease. The cost comes in the
failure to retain the extra information which would have been present if an open-ended item was
asked. While prior attempts have all met with their range of problems there is potential for much
to be gained if an adequate solution can be found.
Collecting extra information on multiple choice items has been shown to lead to
improvements in test reliability and validity in some cases (e.g. Ebel, 1965; Hevner, 1932;
Soderquist, 1936). If these improvements were shown to be consistent this would give the test
administrator better precision in estimates of ability, allowing for an improved use as selection or
job placement or even class placement tools. An additional and often overlooked benefit is that
many of these methods also stand not to eliminate guessing, but to allow a valid channel for testtakers to place their choice across multiple answers. In other words, it gives unsure individuals

3

the option to not engage in guessing at all. Guessing is the recourse of a test-taker who is forced
to choose one answer over another.
Before looking ahead to possible solutions to this problem it is important to look back at
current and prior models which have already attempted this undertaking. The three most notable
are those already mentioned. There are two which require the use of IRT: the nominal response
model and response-time modeled IRT, and one which does not: confidence testing.
Background: item response theory.
A brief overview of Item Response Theory is necessary before a discussion of either the
nominal response model or response-modeled IRT. Put concisely: “Item response theory (IRT)
is a general framework for specifying mathematical functions that describe the interactions of
persons and test items” (Reckase, 2009, p v). These mathematical forms can vary in many ways;
it is through the addition, removal, or constraint of parameters that different models are
specified.
All (unidimensional) IRT models fit the general idea that the probability of a test-takers’
response on an item is a product of a person parameter (θ) and some number of item parameters
(e.g. a, b, c, etc). Using this information an item characteristic curve (ICC) can be plotted which
gives the probability of correct response (for dichotomous items) as a function of ability. This
curve will appear slightly different dependent on the number of item parameters used, but the
general form is that of a logistic ‘S’ curve, as seen in figure 1.
Consistent in all models, the person parameter (θ) is simply the trait – often ability in
some form – to be measured by the test (Lord, 1980). It is a continuous variable, and the scaling
of θ itself is arbitrary. Due to this, θ is commonly scaled to have a mean of zero and a standard
deviation of one. The variable θ is the x-axis of the item characteristic curve.

4

One-parameter IRT models (Lord, 1980) and the Rasch model (1960) use in addition to
this person parameter only one item parameter. This parameter (b) denotes the difficulty of each
item and is in reference to the scaling of the θ-parameter. The point of inflexion of the item
characteristic curve occurs when θ = b (in figure 1; zero). Keeping the scaling of θ constant, the
b-parameter thus shifts the position of the item characteristic curve laterally, without changing
the shape of the curve itself. The larger the b-parameter, the harder the item will be.
Two-parameter IRT models (Lord, 1980) incorporate in addition to the b-parameter
another parameter (a) to account for the difference in discrimination between different items. In
one-parameter models each curve has the same slope at the point of inflexion; the a-parameter
allows for this slope to vary. Specifically: “Parameter a is proportional to the slope of the curve
at the inflexion point [this slope actually is .425*a*(1-c)]” (Lord, 1980, p 13). The c-parameter
will be discussed next, but in two-parameter models is set to zero, reducing the equation to: slope
= .425 * a. As the a-parameter gets larger the curve becomes steeper, reflecting increased
discriminability of the item.
Three-parameter IRT models (Lord, 1980) have become some of the most common in use
today. In addition to the a and b-parameters, a third parameter (c) is included to account for
guessing. The c-parameter can be thought of as the “probability that a person completely lacking
in ability (θ = -∞) will answer the item correctly” (Lord, 1980, p 12). The c-parameter is not
relevant for items in which the answer cannot be obtained by guessing, as in these cases it drops
out of the equation by falling to zero. The c-parameter changes the item characteristic curve by
shifting the lower asymptote. In one and two-parameter models the lower bound of the model is
0; it is impossible to have less than a zero percent chance of answering correctly In the threeparameter model this lower bound is simply set to c instead of zero.

5

These four parameters (θ, a, b, and c) form the basis of the majority of IRT models. With
this background it is now possible to examine more complex models which incorporate
additional parameters.
The nominal response model.
“To use the binary models, the data have been dichotomized (correct and incorrect) and
the distinct identity of the incorrect alternatives has been lost.”
(Thissen & Steinberg, 1984, p 501)

The nominal response model was proposed by Bock (1972), extended by Samejima
(1979), and detailed by Thissen and Steinberg (1984). The concept behind the nominal response
model is that multiple-choice items can be scored for each response choice in order to recapture
information that would otherwise be lost to correct/incorrect dichotomization. It was created in
order to regain lost information from incorrect answer choices; the idea being that which
incorrect answer is chosen has information relating to the ability of the test-taker.
The nominal response model is different from more conventional IRT models in that item
parameters exist (and are estimated) at the level of response option. In this way, the number of
parameters needed for any given item is dependent on the number of response options for that
item (k). Instead of producing only a probability of correct response, the nominal response
model produces a probability for each response. In this way the item characteristic curve for
every item contains multiple curves equivalent to the number of response options. In the ideal
case the correct response produces a logistic curve, while the other curves are more similar to
normal curves. A general item characteristic curve produced by the nominal response model
(with k = 4) can be found in figure 2.

6

The item parameters in use in the nominal response model are: 1) a vector of parameters
defining the discrimination of the response options and thus the slope of the curves (aparameters), 2) a vector of parameters detailing the position of those curves, and 3) a vector of
parameters accounting for zero-ability guessing (Thissen & Steinberg, 1984). The first vector is
very comparable to conventional a-parameters. The second vector of parameters is comparable
to the standard difficulty parameters, as it shifts the curves for each response relative to θ. The
nomenclature of these parameters varies, but for the purposes of this paper the terminology of γparameters will be adopted here. The final vector is used to place individuals who are guessing
randomly due to complete lack of ability, and the nomenclature of δ-parameters will be used
here. Samejima (1979) originally set this to a vector of constants, but Thissen and Steinberg
(1984) make the case that these should instead be estimated.
Unlike conventional IRT models, the nominal response model also requires a number of
constraints on these parameters in order to become identifiable. These constraints are on the a
and γ-parameters in that each set must sum to zero over the different response options within
each item. (Thissen & Steinberg, 1984).
In all, this produces a number of free parameters which must be estimated for each item.
Thissen and Steinberg (1984, p 503) outline that “the set of free parameters for an item consists
of mjak’s,…mj γk’s,…and (mj -1) δk’s,…for a total of 3mj – 1 parameters: 11 for a fouralternative multiple-choice item.” [In this terminology, mj = k = number of response options].
For a five-alternative multiple-choice item, 14 parameters need to be estimated, for every item on
the test.

7

The nominal response model has a number of positive characteristics. It was designed to
recapture information lost to correct/incorrect dichotomization, and it does. It allows for the
fairly powerful ability to model each individual choice on a multiple-choice test item.
This power does not come free, however. Put best by Thissen and Steinberg (1984, p
517, italics added): “The model and its fitting procedures are complex; its use is for ‘serious
testing,’ not classroom exams. We have used sample sizes between one and two thousand
examinees to estimate the parameters of the model for tests ranging in length from four to 35
items.”
The nominal response model has not obtained widespread use because it was not
designed for widespread use. This is a point which will be echoed further in this paper: current
attempts at collection and utilization of information beyond simple correct/incorrect
dichotomization exist at a level of complexity that puts it out of reach of the majority of test
administrators. In a selection context this fact alone puts this technique outside of the possibility
of the majority of small organizations.
Further, the nominal response model doesn’t improve on discrimination of those who
have chosen the correct response, only between those who choose different incorrect responses.
If individuals are predominantly answering correct, there is little left for the model to capitalize
on. Tests must also be constructed to have viable and distinct distracter choices, increasing the
effort necessary in the test construction phase. If all incorrect items are similar there is nothing
gained from learning which one of them an individual has chosen. The effort required to
properly utilize the nominal response model is therefore greater at almost all stages of the testing
experience, save for administration.

8

Response-time modeled item response theory.
Response-Time Modeled IRT is based on the idea that the time a test-taker uses to
complete an item is a valuable source of extra information for estimating both the ability of the
test taker and the characteristics of the test (van der Linden, 2008). “At the level of a fixed
person the basic assumption is that the person operates at constant ability and speed. His or her
choice of ability and speed level is not free but constrained by a speed-accuracy trade off.” (van
der Linden, 2006, p 183). In this way the individual can only operate on a test as quickly as their
individual ability allows – those with higher ability can achieve the same level of accuracy as a
lower ability individual, but quicker.
For the purposes of this paper, an attempt will be made to simplify the description of
response-time modeled IRT. Where a conceptual description will suffice it will be favored over
one which would be more mathematical. For more in depth mathematics on response-time
modeled IRT the author suggests: van der Linden (2006), van der Linden (2007), van der Linden
(2008), and van der Linden, Entink, and Fox (2010).
The concept of modeling response time in an IRT model was first proposed in the late
1970s and early 1980s, though recent papers suggest that it has only recently become viable due
to the now ubiquitous nature of computerized testing (van der Linden, 2006). Early work on this
method was done by Thissen (1983), Scheiblechner (1985), and Roskam (1987), among others,
and the general form was to include additional parameters to estimate relating to test-taker speed
of response.
There are a number of papers dealing solely with the estimation of individual response
time on a test, the first step to incorporating this time into a larger model. Wim van der Linden
(2006) has laid out a (relatively) straightforward structure for modeling the pattern of the log of

9

response time, which includes three parameters very comparable to those found in the twoparameter logistic model.
The three new parameters used by van der Linden (2006) are: α, β, and τ. The τparameter is a person parameter, and represents the speed of the test taker’s response over the
test. It is somewhat similar to θ in other models, at least in use. The β-parameter is an item
parameter and represents ‘time intensity’ or ‘time consumingness’ of a particular item. The
relationship between β and τ can be thought of as comparable to the relationship between the θ
and b-parameters in the two-parameter logistic model; they are scaled to each other in such a
way that the difference between them is what is of actual importance.
The α-parameter can be conceptualized as a type of discrimination parameter. It is
defined as “the reciprocal of the standard deviation of the normal distribution” and it is stated
that “A larger value for α means less dispersion for the log response time distribution on item i
for the persons, and hence, better discrimination by the item between distributions of persons
with different levels of speed ” (van der Linden, 2006, p185).
These parameters can be measured, estimated, and modeled on their own in order to test
the fit of this model. This was the impetus of van der Linden (2006), who showed that these
parameters fit to a lognormal model of response time with high precision. The next step from
this modeling of response time is to model it along-side other parameters of the test and
individual. A conceptual hierarchical framework is suggested by van der Linden (2007), which
is reprinted here as figure 3.
In this figure, Uij represents the response vector for the test and Tij represents the
response-time vector for the test. Each item has a, b, and c-parameters consistent with the threeparameter logistic model, with the addition of α and β-parameters as discussed. Each person has

10

θ and τ-parameters. The simultaneous estimation of these parameters has been shown to be
practically possible, and an empirical study with different subsamples of a sample of 30,000
individuals confirmed that the addition of response-time into the model “tends to improve
[parameter estimates’] accuracy” (van der Linden et al., 2010, p344).
Simulation work has also shown that the average improvement in estimation tends to
range from 5-20% when the correlation between θ and τ are in the range of .5 to .75 (van der
Linden et al., 2010). Prior simulation work has also shown ‘substantial’ improvement in
estimation of ability specifically for the purposes of item selection in computerized adaptive tests
(van der Linden, 2008). Work will no doubt continue on the refinement (e.g. van der Linden,
2010; a framework for standardizing the scaling of response-time parameters) and the use of this
technique in practical settings.
There are a number of benefits to the utilization of response-time in a testing
environment. Due to the fact that this extra information is linked to a behavioral action such as
response-time makes it very hard to cheat. As long as a test-taker is attempting to answer the
question properly (i.e. not just quickly guessing), the time collected is by definition the minimal
time it should take for the test-taker to answer. Additionally, at a very high level this model
makes sense in a very simple way; the less you know about something the longer it will take you
to recall information about it in order to respond.
At the same time, there are a number of potential downsides that might limit wide-scale
use. Just as proponents of the nominal response model claim it is only for serious applications,
so too it is likely that the complexity of response-modeled IRT, and the demands of sample size
required by the estimation may keep it out of the hands of all but the most skilled and large-scale
psychometricians. Response-modeled IRT is not designed to work on a fifth grade weekly

11

reading quiz, or a custom test designed for employee selection at a small, specialized
organization.
While it is almost impossible to cheat such a model, it is prone to faking, if only in the
incorrect direction. That said, test-takers who get distracted or lose interest may spend more
time than necessary on particular items, shifting estimates of their ability. Factors such as test
anxiety might also cause test-takers to go slower on a test (Sarason & Stoops, 1978). This test
anxiety may even be exacerbated by full disclosure to individuals that their speed on the test will
factor into their estimate of ability.
Response-modeled IRT is also – through design – limited to the realm of computer
adaptive-like tests. Test-takers must have a response time for each item, and so each item must
be presented alone. Test-takers cannot return to items already completed (without large change
in test-design). Capturing response time on each item without the aid of a computer, while not
impossible, is practically infeasible. This again limits the tests and situations in which this
method can be used.
Overall, response-modeled IRT can be considered as a technique that is still in the
process of evolution. It is hard to say yet how successful it is or will be, but many factors seem
to limit – at the very least – the places where it can be applied.
Confidence testing.
The two models discussed above are relatively modern, and are each predated by a much
older field of research which also sought to find methods by which to utilize information beyond
a simple correct/incorrect measurement. This field is that of confidence testing, which predates
not only the nominal response model and response-time modeled IRT, but item response theory
itself.

12

In the words of Robert Ebel (1965): “In general terms, the examinee is asked to indicate
not only what he believes to be the correct answer to a question, but also how certain he is of the
correctness of his answer. When his answers are scored he receives more credit for a correct
answer given confidently than for one given diffidently. But the penalty for an incorrect answer
given confidently is heavy enough to discourage unwarranted pretense of confidence.” (p 49).
Confidence (and the very similar concept of subjective probability) have the benefits of
being a long-studied area of psychological processes (e.g. Adams, 1961; Baranaski & Petrusic,
1998; Erev, Wallsten, & Budescu, 1994; Koriat, Lichtenstein, & Fischhoff, 1980; McKenzie,
Wixted, Noelle, & Gyurjyan, 2001; and Pleskacc & Busemeyer, 2010). Individuals have been
shown to be capable of making fairly accurate confidence judgments about their actions in
decision tasks (Pleskac & Busemeyer, 2010), and it is through this ability of individuals to be
self-aware of their own limitations that confidence testing emerged.
The premise behind confidence testing is that individuals have the capability to report
reliably the confidence they have in the answers they give on any given test. The higher the
confidence that an individual has in their answer (assuming it is correct), the higher their ability
level should be estimated. Theoretically, confidence in a multiple-choice item is constrained
across each of the possible answers. The simplest case is that of only two response options; the
most common example being a true/false item. This is precisely where confidence testing made
its debut. Over time it was extended to items with greater than two response options. Many
different methods were built in order to practically implement the constraint between response
options, producing a number of different collection methods in use for both items with two
response options and items with three or more response options.

13

The overview which follows is broken into two main sections: first, examining those
models built for true/false and two-response items and second, examining those models built for
three or more response option items.
Items with two responses (true/false).
Likert-type confidence scale methods.
The first published example of any method of using confidence to provide increased
information to test scores was a method described by Kate Hevner in 1932. On her test, Hevner
(1932) had test-takers choose one of two responses to each item (in this case true or false). For
each item, test-takers also recorded a degree of confidence in that choice, on a scale of one to
three. This therefore falls into the category of Likert-type confidence scale collection methods.
Hevner (1932) presented the results of two studies on two different samples. The first
sample took only one test, and the second took two tests. These studies tested a number of
scoring methods for the data collected from these individuals, including a penalty based scoring
of correct minus incorrect similar to a method tested again only a few years later by Soderquist
(1936).
The first sample took one test on music, and the best scoring method was a simple
weighting of the correct answers based on the confidence ratings. This method produced a
Spearman-Brown reliability of .83 reliability compared to .67 for a traditional number correct
score. Validity was also tested with correlations to two similar tests. The simple confidence
weighted score provided a boost from r = .41 to r = .53 on the first comparison and r = .50 to r =
.70 on the second.
Hevner’s (1932) second sample took two tests using the confidence method. On these
tests the boost to Spearman-Brown reliability was from .56 to .70 on the first test and from .66 to

14

.80 on the second. Validity was only tested on first test (though again with two measures) and
was boosted from r = .33 to r = .44 and r = .48 to r = .57.
This 1932 study put forth the basis for a very simple methodology shown to have notable
benefits to both reliability and validity. This method was simple enough to be used 80 years ago
without any computer aid in administration or scoring. While no formal test-taker perceptions
were collected, it was also noted by Hevner (1932) that: “Informal observation among the
subjects indicates that the opportunity to express a degree of confidence is a welcome addition to
the test, especially when the feeling is one of insecurity.” (p 362)
Hevner’s (1932) general methods (that is, Likert-type confidence scale on a true-false
test) remained without further study until a report of different tests of the method by Robert Ebel
in 1965. Ebel (1965) tested this method on a test containing true/false items. Instead of
answering true or false on each item, test-takers were provided with five response options
ranging from ‘This statement is probably true’ to ‘This statement is probably false.’ Each
different choice had the chance of producing different score values based on whether the testtaker was on the correct or incorrect side of indifferent.
Correct ‘probably true/false’ answers earned test-takers 2 points, but at the risk of 2
points deducted if incorrect. Correct ‘possibly true/false’ answers earned test-takers 1 point at
the risk of 0 points deducted if incorrect. Admitting ignorance and answering ‘I have no basis
for response’ earned the test-taker half a point.
In all, Ebel (1965) reported on seven different studies using this method. Three early
studies all produced gains in reliability: the first had a gain from .574 with standard testing to
.713 with confidence testing, the second .765 with traditional testing to .828 with confidence
testing, and the third .728 with traditional testing to .821 with confidence testing. Ebel (1965)

15

computed an ‘improvement factor’ on these tests based on the Spearman-Brown formula to
predict how much a traditional test would need to be lengthened in order to achieve the same
reliability as the confidence weighted tests. This ‘improvement factor’ was 1.84, 1.48 and 1.72,
respectively.
Following these results Ebel (1965) next reported three other studies conducted on
introductory classes in educational measurement from Oct. 1963 to June 1964. Contrary to prior
results, these gains to reliability were almost non-existent (.824 traditional, .848 confidence; .790
traditional, .790 confidence; and .8489 traditional to .858 confidence).
Finally, Ebel (1965) attempted another method in July 1964 in which the scoring system
was modified based on the idea that the majority of score variance was coming from items
answered with most confidence. The score modification was an attempt to get students to only
answer those items in which they had the most confidence, and consisted of a scoring system of:
Score = 50 + Right – 2*Wrong.
Explained by Ebel (1965) “The uniform base score of 50 represented the expected score
from blind guessing and was intended to counteract the naïve feeling that an omitted item
necessarily lowers the students score.” The R-2*W part of the formula was intended to
encourage a student to omit an item unless he felt the odds favoring a correct answer were better
than 2 to 1.” (p 52).
In the end, reliability for this method of confidence weighting produced ‘disastrous’
results; the confidence test had a reliability of .611 against the traditional test’s reliability of .837.
Ebel (1965) surmised that the risk of full confidence reduced the spread of scores, producing
situations where individuals underrated their own confidence, reducing reliability.

16

Point allocation confidence methods.
Only a handful of years after the work of Hevner (1932), Harold Soderquist proposed a
variant to confidence testing on true/false tests. Soderquist’s (1936) method had test-takers
choose one of the two responses (true/false) and then allocate points based on their confidence in
that response. Test-takers could claim 2, 3,or 4 points on the question with the knowledge that
incorrect answers would be deducted twice the points claimed. Claiming 1 point in this case
collapsed to traditional scoring and allowed individuals a way to not partake in the experimental
condition.
The use of penalty is similar to one of the scoring mechanisms of Hevner (1932), though
this was not Hevner’s best method (the case with no penalty was). This method may also strike
the reader as very similar to the method used by Ebel (1936), and this inherent similarity will be
discussed in detail later in the paper when the equations linking each of these methods are
derived.
Soderquist’s (1936) experimental design tested the difference between scoring the same
test either with simple correct minus incorrect or with scores computed based on the points
allocated. Soderquist found a boost to reliability from .72 with traditional methods to .85 with
confidence methods. This was despite the fact that items scored with allocation of points had
less data; participants using point allocation attempted 57.75 answers on average per paper while
participants answering normally attempted 74.3 answers on average per paper. This highlights
one of the main criticisms that began to appear against confidence testing methods: they take
more time and effort of the test-taker than traditional testing alone.

17

In all, Soderquist (1936) concluded that “The superior reliability shown by the weighted
score suggests that weighting for assurance in the true-false test has the same effect as
lengthening such a test.” (p 291).
Items with three or more responses.
Eliminate believed incorrect response choices.
“It seems to be a common experience of individuals taking objective tests to feel
confident about eliminating some of the wrong alternatives and then guessing from
among the remaining ones. This procedure is usually encouraged, as the odds are in the
individual’s favor.”
(Coombs, 1953, p 308)
“It is assumed that partial knowledge exhibits itself in recognizing some of the wrong
answers. Complete knowledge, knowing what is right, is equivalent to knowing
everything that is wrong. Misinformation is distinguished from partial information in the
individual’s belief that the right answer is wrong. Objective tests constructed with one
right alternative and the others wrong can be administered and scored so as to provide on
each item, a scale from complete misinformation through several degrees of partial
information to complete information.”
(Coombs, 1953, p 308)
The above quotes encapsulate the argument of Coombs (1953), one of the first to
extrapolate confidence testing from the two response case to a larger response set. Coombs
believed that one way that partial information could be acquired – and partial credit given – was
through the confident elimination of incorrect response options. This method is unique in that it
only works in cases with three or more response options; in a two response test the confident

18

elimination of one response option is directly equivalent to simply choosing the other response;
no extra information is gained.
Test-takers in Coombs (1953) study were instructed to cross out all response options that
they considered incorrect, on the basis that someone fully confident in one response would cross
out all others. Test-takers were not required to guess among the choices that they don’t cross
out; all remaining choices were considered equally weighted. The test was then scored based on
two factors: 1) if the correct response was eliminated as incorrect, test-takers received a base
score of 1 – k (for the remainder of this paper, k ≡ number of response options for any given
item), and 2) the number of responses crossed out; each incorrect response crossed out was worth
one point. Test-takers therefore had to balance points that could be gained by eliminating more
responses against the points that would be lost if a correct response was eliminated. As Coombs
(1953) noted, and has been discussed earlier, there is no need for correction for guessing, as it is
built into the scoring system.
Unfortunately, Coombs (1953) did not have a large enough sample to make claims about
reliability or validity. In fact, results were not reported at all; the paper was framed instead as a
proposal for this new method of confidence testing. Coombs (1953) did note that “The first time
students are exposed to this method they should be given a clear account of the procedure” (p
310), and that “A small scale tryout at the University of Michigan indicated that the majority of
the students were favorably inclined” (p 310).
It should be noted that one of the main limitations stated in the paper has been eliminated
in modern times: this technique produced increased labor in scoring tests, something that today
can be easily automated with modern computers.

19

Choose all responses believed correct.
Similar in a way to Coombs (1953) is one of the methods tested by Dressel and Schmid
(1953), who examined a number of different methods. Instead of eliminating response options
that a test-taker believes are incorrect, in Dressel and Schmid’s (1953) ‘free-choice’ method each
test-taker was “…informed that each item had one correct answer, but that they should mark as
many choices as needed in order to be sure that they had not omitted the correct answer” (p 578).
Score for this method was calculated as four times the number of correctly marked
answers minus the number of marked incorrect answers: 4*(marked correct) – (marked
incorrect). Given that items on this test had five response options, a score of zero is produced for
an individual who simply marks every response option as correct.
Dressel and Schmid (1953) found that this method of confidence testing produced a small
reliability decrement, dropping reliability from .70 on the traditional test to .67 on their ‘freechoice’ method.
Likert-type confidence scale method.
Dressel and Schmid (1953) also tested a method which they called ‘degree of certainty.’
This method can be viewed as an extension of Hevner (1932) to the three or more response case;
the test-taker was “…directed to indicate how certain he was of the single answer he selected as
the correct one by using the certainty scale: 1) Positive, 2) Fairly Certain, 3) Rational Guess, and
4) No defensible basis for choice” (Dressel & Schmid, 1953, p 579).
The score for this method was based on the amount of confidence placed in that option,
relative to whether or not it was the correct choice. An individual who chooses the correct
answer can gain more points by placing more confidence in that choice at the risk of losing more
points if confidence is placed in an incorrect answer. It is important to note that this method had

20

test-takers only rate one response option on confidence. In this way test-takers could not indicate
in any way a second-choice response or rank those remaining. In terms of non-selected
responses, those that the test-taker saw as blatantly incorrect were in no way differentiated from
those which may have been strong contenders to be selected as correct, in lieu of what was
actually chosen.
Dressel and Schmid (1953) found a small reliability benefit of this method, increasing
reliability from .70 on the conventional test to .73 on this ‘degree of certainty’ confidence
method. This small difference is comparable in magnitude to Dressel and Schmid’s (1953) freechoice method, and may speak to the range of possible effects for their given test and sample
characteristics.
Also noteworthy in this method was the fact that individuals were able to complete on
average 34.5 ‘degree of certainty’ items over the course of a half hour, very close to the average
of 35.2 conventional items that could be completed in the same time. One possible explanation
of this result is that the same or similar internal confidence process that is made explicit in the
‘degree of certainty’ items is occurring but unmeasured in the administration of traditional
multiple-choice items.
Jacobs (1971) also tested a method very similar to that of Dressel and Schmid (1953), and
in fact more of a direct extension of those techniques originally used by Hevner (1932). Jacobs
administered a multiple-choice test on which test-takers both selected their ‘most correct answer’
as well as their degree of confidence in their response using a three point scale: ‘guess,’ ‘fairly
confident,’ and ‘very confident’.
Jacobs (1971) provided an experimental test in that two different groups were used. The
first group was that of a low penalty condition. These students were told that risk and reward

21

were on a roughly equivalent scale where correct items marked ‘guess/fairly confident/very
confident’ would earn 1/2/3 points, respectively, and that incorrect items marked in the same
way would cost 0/2/3 points, respectively. In this way the only risk-free choice was an
admission of guessing, though in all other situations test-takers only had to risk what they stood
to gain. The second group was that of a high penalty condition. For these students, points were
earned at the same rate but penalized at 0/4/6 points instead of 0/2/3, twice the risk as could be
gained in reward.
In all, Jacobs (1971) found that traditional scoring and confidence scoring in the low
penalty group were nearly identical in terms of reliability. Traditional scoring had a higher splithalf reliability at .89, compared to the confidence testing split-half reliability of .87. In the high
penalty group confidence scoring was drastically worse than traditional testing with a split-half
reliability of .39 versus the traditional test split-half reliability of .79.
Multiple correct answers to each item.
As is perhaps becoming evident, Dressel and Schmid tested a number of different
methods in their 1953 paper. Two cases which may or may not be strictly classified as
confidence testing were those which dealt with multiple-choice questions containing multiple
correct answers.
The first of these two examples was the ‘multiple answer test’ in which Dressel and
Schmid (1953) “…informed the student that each item might have any number of correct
answers, and [that] his score would consist of the number of correctly selected answers minus a
correction factor for incorrectly marked answers.” (p 581). Specifically, the “scoring formula
used with these items was the number of answers correctly marked minus the marked incorrect
answers.” (p 581)

22

This method achieved the largest boost to reliability of the number of methods tested by
Dressel and Schmid (1953), producing a reliability of .78 on the ‘multiple answer test’ relative to
a reliability of .70 on the traditional test, even with the lowest number of items answered of all
methods: 28.2 items were completed on average in this method compared to 35.2 items on
average in the traditional test.
This method is unique in that items are unconstrained by the common notion of having
one correct response embedded in several incorrect responses. A test-taker can no longer find
the one correct response and ignore the others; each response must be examined and decided
upon in turn. Each response option is no longer constrained by the others, and is in one sense
independent of them. An item asked in this fashion cannot rightly be classified as just one item.
In truth the test-taker is responding to k times as many items as he or she completes (where k is
the number of response options), as each response option can be considered a dichotomous
choice between a correct or incorrect decision by the test-taker. When looked at in this way,
test-takers were able to complete 28.2*k two-response items in this method in the same time that
those on the traditional test were able to complete 35.2. Looked at in this way it may be no
surprise that reliability would be increased, simply through the (conceptual) addition of items.
Dressel and Schmid’s (1953) second method of this type was the ‘two answer test’ which
“…was the conventional test in which one of the incorrect answers was changed to a correct
answer so that two out of the five responses were correct. The students were informed of this
and also told that their scores would be the number of correct answers” (p 582). This test was
thus similar to the ‘multiple answer test’ except that individuals knew the number of correct
response options: two. Removed from this method then is the necessity to make a decision on

23

each response option. Once two are selected as correct the others are by default classified as
incorrect.
Not surprisingly, then, the ‘two answer test’ did not produce a reliability as high as the
‘multiple answer test.’ It was scored in a number of ways, the simplest scoring method (the
number correct score) producing the largest gains, a boost to a reliability of .76 on the ‘two
answer test’ versus a reliability of .70 on the traditional test.
Point allocation confidence methods.
Michael (1968) based his take on confidence testing in traditional scoring and specifically
the common use - in traditional scoring - of correction for guessing factors. As described, the
standard correction for guessing used the format of: [correct – (incorrect/(k – 1))]. Michael’s
(1968) alternative to this concept was ‘confidence weighting,’ in which “the examinee
distributed 10 points among the options for each item and received the total number of points he
assigned to the keyed answer” (p 308).
The sample used by Michael (1968) consisted of eleventh and twelfth grade high school
students. Test-takers responded to the same test twice; in the first administration they took the
test in the traditional manner and were told to choose an answer for every item even if it involved
guessing. In the second administration they were given the same test (with their chosen answers
still marked) and told to distribute the confidence weights among all answers on each item.
An important note pertaining to the number of points used was that the number of
response options (k) was four throughout the test. This made it the case that “the 10-point
distribution would not permit examinees to equalize the weights across any given item”
(Michael, 1968, p 309). Also, from a mathematical standpoint, “the CW scores were divided by
10 so that a maximum of one point would be available for each item” (Michael, 1968, p 308).

24

Michael (1968) examined not only how confidence weighting functioned on the entire
sample, but also looked at effects on specific sub-groups. On the total sample, confidence
weighting provided a boost to split-half reliability from .764 on the traditional form to .840 on
the confidence weighting form. This benefit of confidence weighting was slightly higher when
both were corrected with the Spearman-Brown formula (.618 traditional to .724 confidence
weighting). Michael also noted improvement in a reduction in the standard error of
measurement, directly related to the increased reliability.
These benefits of confidence weighting were fairly consistent across high and low IQ
groups, with a change in reliability from .639 with traditional testing to .769 with confidence
weighting for the high IQ group (.130 increase) and a change from .605 with traditional testing to
.723 with confidence weighting in the low IQ group (.118 increase).
On the benefits of this method, Michael “concluded that for a standardized multiplechoice examination in social studies, the CW method affords considerable promise in effecting a
higher estimate of [reliability] and a lower [standard error of measurement] than does either the
[conventional scoring] or [correction for guessing] method. This conclusion held over a wide
range of ability irrespective of sex” (1968, p 311).
Degree of confidence (some derivative of percentile).
None of the methods above represent a capture of the full continuum of possible
probability scores that are assumed to underlie a test-taker’s responses. It has been admitted in
this paper that pure data of this sort may very well be unobtainable without huge leaps forward in
measurement and measurement theory. Each prior method has used some different method to
categorize test-takers into discrete groups on any item, where the number of those groups is
greater than two.

25

Shuford and Massengill (1967) attempted a more ambitious undertaking in their methods
of confidence testing, closer than any other to a true underlying probability/confidence
continuum. This ambition, however, seems most focused at monetizing a method of confidence
testing. The particulars of their method were proprietary, and only discussed in detail in a piece
by Ebel (1968), who was reporting on Shuford and Massengill’s (1967) ‘Valid Confidence
Testing’ materials kit.
Their kit contained materials for 5,000 tests at a cost of a little more than $1,100
($7102.23 in 2010 dollars). Test-takers could select from 26 different degrees of confidence
(based on percentages) ranging from 0% to 100% (assumed thus by the author to be increments
of 4%). In their method, a student’s score is related to the amount of confidence they place in
the answer, but not in a linear fashion. Information on some points along the continuum is as
follows, Confidence 100%, Weight 1; Confidence 80%, Weight .9; Confidence 60%, Weight .78;
Confidence 40%, Weight .6; Confidence 20%, Weight .3; Confidence 0%, Weight -1.
Extrapolation from the information given fails to produce an accurate curve fit to the data
using the most common curve estimates; even after removing 0% confidence no exact trend
emerges. This would suggest that this method was likely created - at least in part – with an intent
for the scoring scale to be difficult to replicate without buying the materials. Ebel makes the
claim that “The effect of this [weighting] is to severely penalize dishonest reports of
confidence.” (1968, p 353).
Unfortunately there are no direct reports of improvement in validity or reliability in either
Shuford and Massengill (1967) or Ebel (1968), though authors (Shuford & Massengill) make
non-specific claims of improvement. Ebel (1968) further reports indirect evidence that
“…indicated degrees of confidence are, at least on some tests, closely related to the proportion of

26

correct answers given” and that “…Valid Confidence scores correlate substantially but by no
means perfectly with conventional choice scores” (1968, p 354).
With 26 response choices for each confidence decision this method can be classified as
one of the more complex that has been presented. Discussion of translation between models
presented later in this paper will give some idea of the depth of a model this complex. For its
complexity, Ebel (1968) reports that on the ease of use: “The developers report that ‘…students
down to the level of the fourth grade can learn to use the materials…’” (p 353-354).
While Shuford and Massengill (1967) and Ebel (1968) did not report exact statistics of
Shuford and Massengill’s method, Hambleton, Roberts, and Traub (1970) did. This paper
reported an empirical study of what they considered a variant of the scoring rules used by
Shuford and Massengill (1967). The difference was that Hambleton et al (1970) did not use the
same method of 26 possible choices but rather had individuals report percent confidence along
an effectively (if not ideally) continuous score sheet.
Hambleton et al’s (1970) confidence testing showed reduced split-half reliability: a drop
from .710 on the traditional test to .655 on the confidence test. Validity was tested by correlating
mid-term exams with final exams, and in this way confidence testing showed improvement: an
increase from r = .621 on the traditional test to r = .720 on the confidence test).
While this seems to be a mixed result and inconsistent with reliability reports by Shuford
and Massengill (1967), Hambleton et al (1970) offer an important note of caution: “…the test
employed in the study was easy for the group being tested” (p 80) and “What is needed in future
studies is a more difficult test” (p 81). This is an idea which will come up in depth further in the
paper.

27

Overall, there are a number of benefits, both empirical and anecdotal, that are associated
with confidence testing. Perhaps most overlooked among these benefits are the positive testtaker reactions (Hevner, 1932; Rippey & Voytovich, 1982). Individuals appear to enjoy having
the freedom to report partial knowledge in responses. Anyone who has taught or taken a class
may recognize this as an outgrowth of individuals’ enjoyment of tests which offer partial credit.
Perhaps best summarized by Selig (1972, p18, italics added): “How often, when a student looks
at a four or five-choice item, does he want to say to the instructor, ‘I like choice ‘d’ but there is
also a great deal of merit in choice ‘a’’? He hesitates, tosses a coin, points his pencil, or does
whatever students do in such cases and comes up with ‘d’ or ‘a’ and marks one of them. It
happens to be the wrong choice. As in any testing situation there are two ways to look at the
results: the teacher’s way and the student’s. In this case, the teacher concludes that the student
doesn’t know anything about the item. But the student feels cheated since he did know that one
of the two answers was correct.”
Confidence testing, at least in its pure forms, offers a large degree of transparency to test
takers. In methods such as those involving allocation of points, test-takers know exactly what
points they stand to gain if any given response option turns out to be correct.

There is face

validity to the method relating to the fact that the more willing you are to say that an answer is
correct, the more you should know about that answer. And, if nothing else works, individuals
have the freedom within the bounds of the model to simply convert any confidence-scored test
back into a traditional test by simply always answering with full confidence (as will be shown
shortly).
Detractors might suggest that a confidence-scored test will take longer than a traditional
test to the degree that the traditional test could have simply been lengthened to provide the same

28

benefits without the added hassle. Relating to testing speed, Dressel and Schmid (1953) found
that: “…students work at different speeds on the various types [of tests] in this order (greatest to
slowest rates): conventional, degree-of-certainty [4-point likert on answer chosen], free-choice
[choose as many answers to find one correct answer], two-answer [two correct choices to each
item, test-takers choose two answers], and multiple-answer [any number of correct response
options, choose as many to be certain all correct are chosen]. It seems that students work about
20 per cent fewer multiple-answer items in a given period of thirty minutes than conventional
items.” (p583-584)
In this case, a reduction of test-taking speed on the most complex method is still only at a
rate of about 20%. Over the half hour time limit the difference between conventional testing and
degree-of-certainty (the next fastest method) was less than one item: those in the conventional
condition completed 35.2 items on average, and those in degree-of-certainty completed 34.5 on
average. This relates to a loss of about three items for every two hours of testing, which is
hardly arguable as anything but negligible.
Another problem relating to confidence testing was the failure over time to ever
standardize a singular technique, or lay down formulas for the translation of confidences
obtained in one method to confidences obtained in another. Echternacht (1972) appears to have
been the only individual to attempt anything related to this idea; he translated fixed point
allocations to probabilities. Individuals given five points (in this case, stars) to allocate across
five response options were assumed to have probabilities of correct response corresponding to
the number of points (stars) allocated to the correct response divided by five. While this is
certainly better than nothing, it can hardly be considered unifying.

29

The lack of this standardization is not a problem inherent with confidence testing itself.
This paper will argue and demonstrate that mathematics for translation between methods can be
produced with minimal assumptions.
Confidence testing suffered from an inability to produce not only consistent gains, but
simply consistent results over time. Confidence testing was shown to produce nearly the entire
range from detriment to benefit. This paper will use simulated pilot data in order to argue and
demonstrate that these differences in results are likely attributable to unexplored (and mostly
unreported) test characteristics.
Finally, it has been suggested that the proclivity of an individual to use confidence
judgments rather than assigning complete confidence to a single response (inconsistent with their
actual confidence) could be related to individual difference personality factors of the individual
(Hansen, 1971). Hanson reported that “Individuals who indicate a preference for risky
options…tend to be more certain in their responses than would be typical for an individual with
their knowledge” (Hansen, 1971, p 13). In addition, it was also shown that risk-taking was not
correlated with test scores themselves (r = .043, -.025, ns). This, then, is not as disastrous a
result as might be expected. It does mean that if confidence testing is producing more reliable
and more valid tests, that some individuals will have more reliable and more valid scores than
other individuals on the same test. However, those risk-taking individuals (guessers) who have
lower reliability and validity will still have the reliability and validity that they would have had
on a traditional test. Those who are not engaging in risk-taking (guessing) will gain the benefits
of a more reliable and valid test. To make a practical example: in an adaptive test this would
mean that risk-takers (guessers) would have to take more items to obtain the same level of
reliability and validity as those who were not risk-taking.

30

Hansen (1971) did not find correlations with the use of confidence testing and other
personality factors, such as test anxiety, though it is difficult to say that they do not have any
relationship. Due to the haphazard nature of confidence testing research at the time, even
Hansen’s (1971) risk-taking findings must be taken with some skepticism. If the use and benefit
of confidence testing is shown to be related to characteristics of the test that were unmeasured (or
at least unreported) at the time of Hansen’s results, then replication and construction of
individual difference linkages from the ground up is necessary.
That is not to say that prior work is completely without merit. Hansen’s (1971)
examination of risk-taking and test anxiety are a good starting point for examination of
individual differences as they relate to confidence testing. In fact, years earlier, anecdotal
findings of Hevner (1932) noted that “Informal observation among the subjects indicates that the
opportunity to express a degree of confidence is a welcome addition to the test, especially when
the feeling is one of insecurity” (p 362, italics added). This general insecurity on a test has links
to the anxiety aspect of Kanfer and Heggestad’s (1997) motivational taxonomy, and also
somewhat to general avoidance motivation (e.g. Elliot, 1999).
With only questionable prior research on how these different individual differences might
relate to confidence testing, we are left with somewhat of a thought experiment. Kanfer and
Heggestad (1997) list a number of individual difference traits which fall under the overarching
theme of anxiety. Arguably most relevant to this current study are the concepts of general
anxiety and specific test anxiety (as already argued for by Hansen, 1971). Presented with a
confidence scored test, the question now is how individuals high or low on these behaviors
should act.

31

Hansen’s (1971) results for risk-taking seem perfectly reasonable – individuals who score
high on risk-taking are more likely to ‘put all their eggs in one basket’, so to speak. They are
more willing to take the risk that the answer they are most confident in is the correct answer,
thereby placing all confidence in that answer. Individuals who are less risk-taking are more
likely to distribute their confidence accurately, hedging against the possibility that their most
confident choice might still be incorrect.
The prospects for anxiety (both general and specific) are also consistent with reasonable
expectation. While not significant, Hansen (1971) found a negative relationship between test
anxiety and the amount of confidence individuals were willing to place in singular answers. This
means that those individuals high on test anxiety were more likely to use confidence testing,
while those low on test anxiety were less likely to use it. This relates back to the idea of anxiety
and avoidance, and even risk-taking. In adopting an avoidance oriented stance regarding the test,
they should be less likely to take risks by displaying overconfidence. Those who are anxious
about testing (or perhaps just anxious in general) should be more diffident about answering with
full confidence, and may be more likely to distribute their confidence in a way which guarantees
them at least some credit.
These individual difference measures, along with others which will be discussed later in
the paper, will be collected in order to shed light on how some of these initial constructs of
interest may influence confidence testing.

32

Integrated General Form of Confidence Testing Model
It appears no overstatement to represent classical confidence testing as a fragmented
field. Numerous methods for the collection and utilization of this data have been presented, and
for each method which was published there was likely another which was lost to time. Indeed,
one of the problems in the field of confidence testing was that a wide range of different
techniques were all pitted against one another in an attempt to find which worked best. Stakes
were relatively high for those who were looking to sell their method, as can be seen in the case of
Shuford and Massengill (1967).
What was never discussed, and perhaps never examined, was the fact that each of these
methods are in fact collapsed specific cases of a more generalized model, at least from a
mathematical standpoint. Using the assumptions already made by the creators of these different
models, a generalized model can be created which provides a linkage between them. Such will
be the purpose of this section. A mathematical linkage between methods will allow for tests of
differences that should arise from psychological processes; if methods are mathematically
identical then psychological processes are all that remain in accounting for differences. The
confound of the mathematical framework of the different methods can thus be controlled for in
analysis. Defining the mathematics of these models opens an entirely new way to do research on
and pertaining to them.
The basic idea that holds across all models of confidence testing is that weights are given
to each response option through some translation of collected confidence information into score.
At the most basic the score on the item is the amount of weight a test-taker allocates to the
correct response, however that may be scaled. As discussed, the ideal (and believed to be
improbable) case is that where information is collected relating to the actual probability that a

33

test-taker would choose each different response option. In this ideal case the probability of
response maps to the weight each answer has in their choice process. Each method of
confidence testing uses a different method to try to approximate this continuous
confidence/probability scale.
In the case of two response options (r1 and r2, most often true/false), information need
only be collected on the response option chosen. Confidence on response one (Cr1) and
confidence on response two (Cr2) must sum to one, which means that: Cr2 = 1 – Cr1 and Cr1 = 1
– Cr2. This is not only a property of items with two responses, and in fact it is simply a special
case of a constraint which takes place on all items. In all methods of confidence collection
examined so far, it is the case that information must be collected explicitly or implicitly on k - 1
of those response options in order to have full confidence information about that item. The
weighting on the final response is constrained by the responses to the other response options and
can thus be implied from them. Mathematically, once an individual has supplied weights to k - 1
response options the final response option is constrained to be equal to 1 - (∑ (Cr1, Cr2…Crk-1)).
In the case where an individual has no knowledge whatsoever concerning the answer to
an item (e.g. taking a test in a completely foreign language) the probability of answering
correctly can be defined as equivalent to 1/k, which may be recognizable as the starting value in
the estimation of the guessing parameter. This probability is not only the probability of choosing
correctly, but rather of choosing any response – each of the response options is weighted equally
at 1/k.
Consider now Coombs (1953) method of eliminating all answers confidently known to be
wrong. In this method the individual is simply reducing k through the process of elimination,

34

where the probability of choosing any of the remaining options is now assumed to be 1/(k - e),
where e is defined as the number of response options eliminated.
Frederic Lord himself considered this possibility in his work on item response theory:
“We might next imagine examinees who have just enough ability to eliminate one (or two, or
three…) of the incorrect alternatives from consideration, although still lacking any knowledge of
the correct answer. Such examinees might be expected to have a chance of 1/(A-1) (or 1/(A-2),
1/(A-3)…) of answering the item correctly, perhaps producing an item response function looking
like a staircase.” (Lord, 1980, p 17).
Similar to Coombs (1953) is Dressel and Schmid’s (1953) method of choosing as many
response options as it takes to be confident that one of them is correct. It is easy to see that this
is simply a different presentation of Coombs (1953) basic idea; those answers not chosen in the
group of ‘believed correct’ have essentially been eliminated. We can now define c as the
number of response options chosen in the group ‘believed correct’, and equivalent to k – e.
Because c = k – e can be rewritten to show that e = k – c, the probability of choosing any of those
answers in the group of ‘believed correct’ can therefore be shown to be equal to 1/(k - (k - c)) or
simply 1/c.
The allocation of points method (Michael, 1968), fits this model in that the number of
points allocated to any given response can be shown to fit the degree of probability of choosing
that response. The confidence weight (Cr0) for each response can be set equal to P*(1/p) where
P is the number of points allocated to that response and p is the number of points allocated
overall. This is consistent with Echternact (1972), who computed probabilities with P*(1/5) on a
test, where p was fixed at 5. In actual use, the number of points allocated is not bounded in any
way. In fact, p and P can be set to any positive real integer, from one to infinity (P can

35

additionally be set to zero). What is important is the proportion of points allocated to the correct
answer relative to the collection of points allocated on that item. Interestingly, this point-based
model can be shown to include, as a special case, the models above involving elimination and
inclusion. In these cases P is simply fixed at 1 and (k - e) = c = p.
Further, traditional testing is also a specific case of this model. Test-takers are given one
point to be allocated across all response options; in this case p is fixed at 1 and P is constrained
to either 1 or 0 for each response. The equation then collapses to 1/1 or 0/1, or rather 1 and 0:
correct or incorrect. In this way, a person taking a confidence scored test can, at-will, turn it into
a traditional testing situation.
The use of Likert-type scales of confidence also fit a specific case of this model, though a
bit more math is required. The simplest case is, again, the case of items with two response
options and was the basis for the entire range of models to follow as introduced in Hevner
(1932). Hevner (1932) had test-takers report confidence in their response on a three point
ordinal scale. If we are assuming that this scale is truly Likert-type (as we have been throughout
the paper) then with this distinction this scale also gains the concept of interval measurement.
The movement from one level of confidence to the next is assumed to be constant across all
levels of confidence. This provides almost all that is needed for deducing the underlying
mathematical framework.
Before this, however, another very important point must be made. Due to the way
Hevner (1932) anchored the Likert-type scale there are actually five weights that can be implied
from these data. Confidence choices were, put simply, ‘very sure’, ‘fairly sure’, and ‘not at all
sure’; thus a test-taker could be ‘very sure’, ‘fairly sure’ or ‘not at all sure in choice A and ‘very
sure’, ‘fairly sure’, or ‘not at all sure’ in choice B.

36

As discussed earlier, the ‘not at all sure’ is an admission of guessing and weights each
response with a half probability of choice; .50 if we follow the convention that all confidence
weights should sum to 1. Therefore, the weight for ‘not at all sure’ is identical regardless of
whether it is in choice A or choice B – they need not even be marked in this case.
A full five-point scale is the result: ‘very sure in choice A’, ‘fairly sure in choice A’, ‘not
at all sure’, ‘fairly sure in choice B’, and ‘very sure in choice B’. Just as before, implied
responses can be made given that k – 1 confidence ratings are collected. This produces an
implied symmetric reverse scale parallel to this five point scale as follows.
Given response of ‘Very sure that choice A is correct’ implies that ‘Very sure that choice
B is incorrect’ and weights A at 1 and B at 0. Given response of ‘Fairly sure that choice A is
correct’ implies that ‘Fairly sure that choice B is incorrect’ and weights A at .75 and B at .25.
Given response of ‘Not sure at all (in either)’ implies that ‘Not sure at all (in other)’ and weights
A at .50 and B at .50. Given response of ‘Fairly sure that choice B is correct’ implies that
‘Fairly sure that choice A is incorrect’ and weights A at .25 and B at .75. Given response of
‘Very sure that choice B is correct’ implies that ‘Very sure that choice A is incorrect’ and
weights A at 0 and B at 1.
Weights are calculated through a translation into the point allocation method calculated
above. For any Likert-style scale where L is the number of collected Likert scale points, p can
be shown to be equal to (k*(L - 1)), in this case p = 4. The implication of this is that allocation
of (k*(L - 1)) points can replicate an L-point Likert-style confidence scale for an item any
number of response options.

37

These points allocated to each answer are simply multiples from the above example, and
can be calculated by multiplying each of the above allocations (0, .25, .5, .75, and 1) by 4 to
come up with the interger points that would be allocated.
An individual who is ‘very sure’ that choice A is correct will allocate all four points to
choice A and zero to choice B. Being ‘fairly sure’ a participant can hedge their bets and allocate
three points to choice A and one to choice B. An admission of guessing comes from the split
allocation of two points to each choice. To note is the fact that in the two-response special case
it does not make sense to consider ‘choosing’ choice A and then allocating only one point to it.
The test-taker instead should have chosen choice B. While these options (e.g. allocating 1 point
to choice A) are not directly available to an individual (at least practically) they are still part of
the scale as they can be implied. In the extension to cases of three or more response options
these situations do become much more practically and explicitly available.
The simplest Likert-style confidence method involving three or more response options is
that of Dressel and Schmid (1953). Test-takers were “…directed to indicate how certain he was
of the single answer he selected as the correct one by using the certainty scale: 1) Positive, 2)
Fairly Certain, 3) Rational Guess, and 4) No defensible basis for choice.” (p 579)
Still assuming interval scale qualities but for simplicity and space-saving measures, we
can cut this down to three confidence options similar to Hevner’s (1932) scale: positive or ‘very
sure’, ‘fairly sure’, and ‘pure guess’. One will have to accept on faith (or complete the proof
themselves) that an extension of the math can be shown to work for any number of confidence
options.
This scale (Dressel and Schmid, 1953) was only collected on the answer chosen correct
and again only represents a portion of the full implied scale. Weights can be crafted for the

38

answer chosen in the same way as was done to the Hevner (1932) scale. However, without
collection of confidence on k -1 responses, assumptions must be made regarding the spread of
effectively ‘leftover’ weighting. If an individual is ‘fairly certain’ of response A what does that
mean for the remaining options?
The choice of ‘pure guess’ represents a weight of 1/k in all response options; in the case
of k = 2 this broke down to a 50-50 chance. This effectively sets a floor for which confidence in
a response, if chosen as correct, should never drop below. In the case of k = 3, the response of
‘pure guess’ instead results in .33 in all response options. It then follows that the other more
confident responses should cover the range of the scale from that point (.33) to 1.
This information can be found in table 1.
The number of points that need to be allocated to match this scale is found by k*(L-1); in
this case six. However, this comes with some caveats due to the constraints of giving confidence
only in the response which has been chosen.
This information can be found in table 2.
This caveat is that each step up the Likert scale from ‘pure guess’ to ‘very sure’ results in
an increase in allocation of (k - 1) points (in this case two). This is due to the fact that other
points need to be allocated to all other responses, at least assuming that points in this fashion can
only be distributed as whole integers.
To extrapolate to different situations, then, if there were four response options (k = 4) this
method would be equivalent to allocating 8 points [(4*(3-1))] in steps of 3 (‘very sure’ = 8
points, ‘fairly sure’ = 5 points, ‘pure guess’ = 2 points). The case of 8 points is an allocation of
all to one response, 5 points is an allocation of 5 to one and 1 to each of the other 3, and an
allocation of 2 gives 2 to each of the four options.

39

To go back to Dressel and Schmid (1953), their method with a four point Likert scale can
be shown to actually be representing the allocation of 9 points if k = 3 [(3*(4-1)], 12 points if k =
4 [(4*(4-1))], or 15 points if k = 5 [5*(4-1)]. The actual number of response options on their test
(k) was unable to be discovered. However, an allocation of 15 points is equivalent to
probabilities starting at 0 and incrementing to 1 in steps of .065 – much greater resolution than
one would expect to be able to collect if asked from a straight probability standpoint. Worth
noting is the fact that this is still not as resolute of a scale as that proposed by Shuford and
Massengill (1967), who incremented from 0 to 1 in steps of .040.
In all, the general form then follows that the weight for any given response can be
defined as:

Weight ≡ PA * 1/p
Where PA is the number of points allocated to that response, and p is the number of
points allocated overall.
For the representation of the method of exclusion of incorrect answers:
p = k – e and PA = 1
Where e is the number of responses excluded and k is the number of response options on
each item.
For the representation of the method of inclusion of correct answers:
p = c and PA = 1
Where c is the number of response options marked correct.
For the representation of the method of a traditional test:

40

p=1
For the representation of Likert confidence ratings:
p = k*(LMAX – 1) and PA = (L0/LMAX) * p
Where L0 is the ascending rank order number of the Likert rating chosen and LMAX is
the number of Likert ratings available; the lowest representing pure guessing.

This generalized model thereby unifies the highly disparate methods used by the field of
confidence testing. Specific forms of the model can thus be translated mathematically from one
model to another, and compared. Differences that occur in practical use between two
mathematically equated models infer that differences are not mathematical but psychological.
Psychological factors can thus be more purely studied without mathematical confounds. It is
also the case that one form may be found to be easier to understand or more preferable when
compared against other mathematically equivalent forms.
Additionally, just as traditional testing is nested in this model so are all the different
specific methods. With an overarching framework there is actually no need to constrain a test to
one method or another. In the same way that a test-taker can ‘break’ confidence testing by
always putting confidence in one answer (thus making it a traditional test), different methods can
be used at-will. With a mathematical framework by which to calculate comparable weights
across methods individuals can use whatever method they chose – at the item level.
Consider a multiple-choice item with four response options. A test-taker may only be
able to eliminate one answer that they know is incorrect. This is the exclusion of incorrect
method (or a 3-point allocation method) and would set weights at 0/.33/.33./.33.

41

On the next item the test-taker may decide that two of the answers seem correct, and that
they can’t decide between them. This is the inclusion method (or a 2-point allocation method)
and would set weights at 0/0/.5/.5.
On another item the test-taker has narrowed the choice to two answers, but feels slightly
better about one of them. They may choose to allocate 3-points shared between the two
responses with one of the responses getting 2-points and the other getting only 1. This would set
weights at 0/0/.33/.66.
On another item the test-taker may have full confidence in their answer and eliminate all
other answers, simply select that choice, or allocate N points to it (where N is any real integer),
and none to any other answer. All of these scenarios collapse the item to that found on a
traditional test, producing a weighting of 0/0/0/1.
It is thus the case that in the worst case situation all test-takers collapse all items to
traditional items, producing a test that is as reliable and as valid as it normally would have been.
In such a situation, all that is lost is the time training the test-takers.
Pilot Hypotheses
There are a number of mathematical hypotheses following from above arguments that can
be examined using simulated data. First and foremost:
HP1: Confidence testing will produce a more reliable test than an equivalent traditional
test
HP2: Confidence testing will produce a more valid test than an equivalent traditional test
On the differences between different resolutions of similar methods:

42

HP3:Benefits of confidence testing will be larger the closer the method approaches raw
probability data (e.g. 10-point allocation better than 5-point, 5-point better than 2-point,
etc)
To borrow from Dressel and Schmid, (1953, p 576) it is the accepted belief that “…the
student whose response contains an element of guessing will tend to miss enough items over an
entire test to differentiate him from the student who responds with complete certainty.” That is,
given enough items on a test, a traditional test should begin to converge on true score. In this
way:
HP4: Benefits of confidence testing will be related to test length in that shorter tests will
produce larger gains than longer tests.
To borrow from Echternacht (1972, p224, italics added), summarizing De Finetti (1965):
“If the examinee were certain of the correct answer, the best response was just that, and the
problem disappears; but oftentimes, especially if the item was difficult, the examinee had a
degree of uncertainty about his action.” That is, less guessing occurs the easier a test becomes,
as test-takers are fully confident in their responses. On more difficult tests confidence testing is
able to capitalize on information that is lost to what would become guessing. In this way:
HP5: Benefits of confidence testing will be related to test difficulty in that more difficult
tests will produce larger gains than easier tests.
Finally, the discrimination of the items on a test should have an impact on the benefits of
confidence testing. This should be due to the fact that items which have high discrimination are
able to place individuals accurately into correct and incorrect with few individuals falling into

43

the probabilistic space between (where confidence testing finds its extra information). In this
way:
HP6: Benefits of confidence testing will be related to test discrimination in that tests with
lower discrimination will produce larger gains than tests with high discrimination.
Pilot Studies
Pilot Simulation #1 – Comparison of Different Confidence Testing Methods
Pilot data was simulated in order to illustrate the fact that useful information is lost about
individuals when they are artificially dichotomized on each item into categories of
correct/incorrect. Data was simulated for 200 individuals on a 60 item test using the following
method:
1) Individuals were sampled from a normal distribution on ability (θ: mean = 0, SD = 1).
2) The test was constructed to be of average difficulty relative to the sample (b-parameter:
mean = 0, SD = 1), average discriminability (a-parameter: drawn from lognormal with
parameters -.13, .34), and a chance of guessing as might be expected on a multiple choice
test with 5 response options (c-parameter: mean = .2, SD = .1).
3) Probability of correct response was computed using the 3-parameter logistic model of the
form:

e Da (θ −b )
P(θ ) = c + (1 − c) *
1 + e Da (θ −b )
4) Uniform random data between 0 and 1 was generated for each item of each individual.
These data were then used to create dichotomous correct/incorrect information for each
item of each individual through a comparison process. If the generated random number
was less than their probability of getting the item correct, they were given a score of
44

correct (1). If the generated random number was higher than their probability of getting
the item correct, they were given a score of incorrect (0). These data were then summed
at the individual level to give each individual a total correct score for the test ranging
from 0 to 60. This will be considered the control condition of standard or traditional
testing.

The ideal case of confidence testing assumes that individuals are capable of self-reporting
confidences or subjective probabilities that correlate perfectly with their actual probability of
getting an item correct. Raw probability of correct response for each item for each individual as
computed above could be used as the ideal confidence testing collection, though this ideal case is
admitted to be impossible from a practical standpoint. Instead, these probabilities can be used as
input into methods which are more likely to approximate data which can be practically collected
from an actual human sample.
In order to degrade the ideal case allowing for the fact that individuals are likely to be unable
to give their raw probability with exact precision, a number of additional possible outcome
matrixes were created in similar fashion. These scores represent the share of 100% confidence
that they are likely to give to the actual correct answer. What remains (e.g. [100% - score] is
that confidence which would have been given to incorrect answers). Each method can also
produce a “total correct score”: a summation of the probabilities of correct answers for each
individual. These different methods are described using the ‘allocation of points’ method due to
the fact that the generalized form above can be used to translate any other method to and from
this framework.

45

1) 10-point allocation (Michael, 1968). These data replicate a situation where individuals
have 10 points to distribute across all the response choices for a given item. This creates
11 possible confidences for each item: 0%, 10%, 20%, 30%, 40%...90%, 100%. This
pattern was simulated by rounding each raw probability (represented from 0 to 1) to the
nearest tenths place, producing the above confidences.
2) 5-point allocation (Ebel, 1965). These data replicate a situation where individuals have 5
points to distribute across all the response choices for a given item. This creates 6
possible confidences for each item: 0%, 20%, 40%, 60%, 80%, 100%. This pattern was
simulated by rounding each raw probability to the nearest of these values.
3) 3-point allocation (similar to Hevner, 1932). These data replicate a situation where
individuals have only 3 points to distribute across all the response choices for a given
item. This creates 4 possible confidences for each item: 0%, 33%, 66%, 100%. This
pattern was simulated by rounding each raw probability to the nearest of these values.
4) 2-point allocation. These data replicate a response situation where individuals have only
2 points to distribute across all the response choices for a given item. This creates 3
possible confidences for each item: 0%, 50%, and 100%. This pattern was simulated by
rounding each raw probability to the nearest of these values.
For each simulated set of data a number of statistics were calculated. A benefit of
simulated data is the fact that the “true scores” of individuals are known, as they were used to
generate the data. Thus, correlations between the “total correct score” for each test and the
original ability (θ) were calculated, as well as the resultant r-square values. This allows for rsquare change comparisons relative to the original standard testing model. Further, reliability

46

was computed on the raw item-by-individual matrix of data for each different simulated method.
These results are presented in table 3 below.
These results show that from a mathematical standpoint there is extra information above
correct and incorrect contained not only in the raw probability scores but in several degraded
forms of the data. In terms of relationship with true score, an r-square increase of .0715 and an
increase in reliability of .086 can be gained from even finding confidence simply to the degree of
0/.33/.66/1. Further, even a collection of one more piece of information above standard – finding
when individuals are torn in decision between two responses (i.e. a confidence collection of
0/.5/1) – produces an almost identical r-square increase of .0692 and reliability increase of .081.
Pilot Simulation #2 – Introduction of Error; Comparison to Traditional Testing
The results of pilot data simulation #1 show that there does not appear to be appreciable
difference between the different number of breaks made to the confidence scale. That is, the
trouble of a 10-point allocation (or even of the ideal case of raw probabilities) may not hold
much more benefit than simpler methods. This allows for a reduction in the number of methods
to be simulated as the simulations become more complex. For the following simulation only the
5-point and 2-point allocation methods will be simulated.
The data of the first simulation makes an assumption about the ability of individuals to
effectively round their raw probability score to one of the adjacent confidence levels. This was
somewhat admissible in the first pilot as comparison was being made between different methods
of confidence collection. To fairly compare methods of confidence collection to traditional
testing the same random error that is being introduced into the simulated traditional data must
also be introduced into the simulated confidence data.

47

In order to incorporate the same error into the confidence data, the same random number
matrix that was used to create the traditional scores was used. Recall that this random number
matrix was created with a number between .00 and 1.00 for every person and every item. If their
probability of obtaining a correct score was greater than this random number they received a
score of correct, if their probability was less than this random number they received a score of
incorrect.
The confidence data can be conceptualized piecewise in a very similar way. For
example, in the case of 2-point allocation a person can have three different levels of confidence
in the right answer: 0%, 50%, or 100%. This comes from the fact that they can allocate no points
to the correct answer (0%), one point to the correct answer (50%), or both points to the correct
answer (100%).
The simulated data provides a raw probability of correct response which in the ideal case
would be the confidence that an individual places in the correct response. In pilot #1 this
probability was rounded to the nearest possible level. For example, an individual who had a .30
probability of correct response would be rounded to .50 in the 2-point allocation model. Instead
of simply rounding, these generated random numbers can be used in the same fashion as in the
traditional simulation.
Just as an individual has a probability of getting a score of correct or incorrect in a
traditional test, so too should an individual have a probability of obtaining the higher or lower
confidence in their answer based on their raw probability. To continue to use the example of .30
probability on a 2-point allocation, the individual should not be simply moved to the nearest
level, but placed there probabilistically.

48

Conceptually there are a number of ways to envision this process, all mathematically
equivalent. In essence a random number can be drawn from the range of .00 to .50 for
comparison and placement – if the probability is higher the individual will be placed at .50, and
if the probability is lower than the individual will be placed at .00. The individual with original
probability of .30 now has a 60% chance of obtaining a .50 confidence and a 40% chance of
obtaining a .00. Another possibility is for the probability scale from (in this case) .00 to .50 to be
scaled to represent the full .00 to 1.00 scale. This method is also one in which the random
number matrix of .00 to 1.00 that has already been generated can be used directly. In this way
the noise that is introduced is exactly identical to that which has been introduced into the
simulation of the traditional test.
Using these results produces confidence scores which can be more fairly compared to
those of the traditional test. Results based on these scores can be found in table 4.
These results show that even the addition of random noise similar to that used in
simulating the traditional test results does not noticeably reduce the possible gains of a
confidence testing methodology. Further, similar to the results of pilot #1, there does not seem
to be drastic difference between the benefits of 5-point and 2-point allocation.
Pilot Simulation #3 – Effect of Test Length
While the findings so far are promising, a discussion of confidence testing cannot be
complete without taking into account the argument that collecting confidence data involves more
time per item than simply collecting dichotomous correct/incorrect data. Simulated confidence
testing results also appear to be succumbing to ceiling effects in a way which might be alleviated
by a reduction of test length (and thus an overall reduction of reliability). Further, for simplicity,

49

reliability will be used as the initial main outcome in order to determine where and how other
outcomes might best be used.
Reliability was thus computed for subsets of the full 60 item test, decreasing in
increments of 5 items. This reliability trend was computed for the traditional test as well as the
5-point and 2-point allocation methods. These results can be found in table 5, and are graphed in
figure 4.
It can be seen from these results that test length does in fact have an impact on the
possible benefits of confidence testing. While the traditional test takes a steady drop in
reliability with the removal of items, the 5-point allocation method seems robustly indifferent to
the removal of items, maintaining a reliability of .944 even if only 5 items are administered.
Even the 2-point allocation only begins to drop in a substantial way when the test length drops
below 20 or so items, and maintains a reliability of .802 for a 5-item test.
A different way to look at this, then, is the benefit of 2-point and 5-point allocation
confidence methods relative to the traditional test over different length. This benefit by test
length can be found in figure 5.
These results show that the benefits of confidence testing are most pronounced on shorter
tests, and appear to asymptote to some level as the test becomes longer and longer. In fact, for
the 5-point allocation line the best fit (r-square = .994) is a power curve with the specification:
Y= 1.183*x^(-.691), whose limit as x-> infinity is equal to 0. This shows that in this simulated
example the 5-point allocation confidence test will always have a higher reliability than the
traditional test, and that this difference will decrease as a function of test length.

50

In reality it is unreasonable to believe that these results will hold to this degree on a
human sample, though it does strongly support the idea that – all else equal – confidence gains
will be higher on shorter tests than longer ones.

Pilot Simulation #4 – Effect of Test Difficulty
Of the other possible factors that may moderate the benefits of confidence testing,
relative test difficulty is among the most clear. If a test becomes too easy for a sample of
individuals they become prone to always put 100% confidence in the correct answer. Only in
situations where individuals have to hedge their bets is there variance for confidence testing to
show benefits.
In order to demonstrate this idea the pilot data from simulation #2 (and #3) was
transformed to create two new simulated data sets. To simulate a similar but easier test, 1.5 was
subtracted from each item difficulty parameter, shifting the mean from 0 to -1.5. To simulate a
similar but more difficult test, 1.5 was added to each item difficulty parameter, shifting the mean
from 0 to 1.5.
This was done in order to make these data as directly comparable to the original data as
possible. If new item parameters were generated, differences that arose might be at least partly
due to differences in the distribution of those parameters. Simply transforming the numbers to a
new mean preserves their distribution.
As prior pilot data has shown, test length is an important factor in determining the
benefits of confidence testing. Because of this, results were examined across the range of test
length in the same way as in pilot #3. Part of these data (for a normal test) has already been
shown in pilot #3 (fig. 1). Data on easy and difficult tests can be found in figure 6.

51

This graph illustrates a number of important ideas. First, it can be seen that the 5-point
allocation method under both tests (blue lines) is more reliable than the 2-point allocation
method (pink lines), which – with one exception – is more reliable than the traditional test
(yellow lines). It is also the case that within each method the difficult test is less reliable than the
easy test. This drop in reliability within method is in large part (if not wholly) due to increased
guessing on the part of individuals. Increased guessing leads to a less reliable test.
This is not the whole story. The prediction of this simulation was not that the difficult
test would be more reliable, but rather that confidence testing could recover more lost
information from a more difficult test. Figure 7 shows the increase in reliability relative to a
traditional test when using the 5-point allocation confidence method.
This graph shows that the benefit of using confidence testing is actually greatest on the
difficult test. This is because confidence testing is recovering information that would otherwise
be lost to guessing on the traditional test. The easy test recovers slightly less information than
the difficult test relative to the normal test (from pilot #3). Interestingly enough, the benefits of
confidence testing on the difficult test don’t appear to drop off as quickly as the normal test with
increasing test length. This implies that benefits are left to be had even on longer tests if the test
is difficult enough.
Correlations with the original theta appear to also mirror these results. While the difficult
test produces the lowest r-square values it also produces the largest gain in r-square relative to
the traditional testing method. These results can be found in table 6.
Pilot Simulation #5 – Effect of Test Discrimination
Another factor that should impact confidence benefits is test discrimination. Simply put,
information is lost when items are not able to discriminate adequately between those who should

52

get it correct and those who should get it incorrect. In these situations of low discrimination the
problem is individuals of (relative) average ability who have some probability of getting the item
correct, but don’t clearly or consistently fall into the group of those who get items of similar
difficulty correct or incorrect.
In order to demonstrate this idea the pilot data from simulation #2 (and #3) was
transformed to create two new simulated data sets. To simulate a similar but less discriminating
test, .20 was subtracted from each item discrimination parameter, shifting the mean from .83 to
.63. To simulate a similar but more discriminating test, .20 was added to each item difficulty
parameter, shifting the mean from .83 to 1.03.
This was done (as with difficulty) in order to make these data as directly comparable to
the original data as possible. If new item parameters were generated, differences that arose
might be at least partly due to differences in the distribution of those parameters. Simply
transforming the numbers to a new mean preserves their distribution.
As prior pilot data has shown, test length is an important factor in determining the
benefits of confidence testing. Because of this, results were examined across the range of test
length in the same way as in pilot #3.
Part of these data (for a normal test) has already been shown in pilot #3 (fig. 1). Data on
high and low discriminating tests can be found in figure 8.
Similar to the findings of pilot #4 on difficulty, the 5-point allocation method has the
highest reliability, followed by the 2-point allocation method and finally the traditional test. The
high discriminating tests within each method are more reliable than the low discriminating tests,
as would be expected. Again, though, this is not the entire story.

53

The prediction of this simulation was not that the low discrimination test would be more
reliable, but rather that confidence testing could recover more lost information from it. Figure 9
shows the increase in reliability relative to a traditional test when using the 5-point allocation
confidence method.
Similar to the difficulty results, the low discrimination test has more data – relative to a
normal test – which can be recovered through the use of confidence testing. The traditional high
discriminating test already has a reasonable reliability, and so not as much data is being lost.
Unlike the results of difficulty the trajectory through length of test appears consistent
over levels of discrimination. Thus, confidence testing benefits will decay on longer tests
regardless of level of discrimination.
Correlations with the original theta appear to also mirror these results. The low
discriminating test produces slightly higher benefit to r-square than a normal test, and the high
discriminating test produces slightly lower benefit. In the case of the 5-point allocation
confidence method this actually results in a doubling of the benefit (from .0562 to .1125) moving
from a high discriminating test to a low discriminating test. These results can be found in table
7.
Conclusions from Simulated Pilot Data
In all, this pilot data supports each of the hypotheses proposed. Each of the confidence
testing methods provided benefits to both reliability (HP1) and validity (HP2). As the resolution
of the confidence scales increased, so did the benefits from confidence testing (HP3).
Interestingly, though, the largest gain seems to be from traditional testing to a 2-point allocation,
suggesting that differences between different methods of confidence testing may not be as large
as the difference between traditional testing and any confidence method. As well, this suggests

54

that even using a low number of allocated points may give similar benefits to much more
complex methods.
Test length was a factor consistent with predictions (HP4). This was at least in part due
to ceiling effects; reliability decreased in traditional tests as items were removed while reliability
of confidence test results was much more indifferent to the removal of items. Larger gains of
confidence testing were found in shorter tests. This suggests that confidence tests may therefore
estimate ability in a given test more quickly than through a traditional method.
Test difficulty was a factor in that more difficult tests produced larger gains from
confidence testing (HP5). This is at least in part simply due to the fact that individuals are less
certain about items on more difficult tests. This leaves information available for confidence
testing to capitalize on.
Test discrimination was a factor in that less discriminating tests produced larger gains
from confidence testing (HP6). Graphically this can be explained by an exercise in curve fitting
– if an item is highly discriminating it will have an item characteristic curve that is very steep,
and at least adequately fit by a single step function between incorrect and correct. The more that
the slope of the item characteristic curve is reduced (i.e. the less discriminating the item
becomes), the more steps in such a function are needed to adequately fit that curve. Traditional
testing offers only one step, but in high discriminating items that is all that is needed. There is
no information remaining for confidence testing to utilize. Confidence testing offers step
functions with multiple steps (as many as are found at the meeting point of desire and
practicality), and is therefore able to capture all the information lost by a single step function in
less discriminating items.

55

These pilot results show some of the possible ways that confidence testing may have met
its downfall, and the boundary conditions under which it might be the most useful.
Unfortunately, very few confidence testing studies reported even a fraction of the information
required to retroactively test any hypothesis with more modern techniques such as meta-analysis.
It is no surprise, then, that none ever treated this information in experimental design.
Experimental Hypotheses
Similar to that of the simulated pilot, the main hypotheses for human subjects relate to the
benefits that are associated with confidence testing. Specifically:
H1: Confidence testing will produce a more reliable test than an equivalent traditional
test.
H2: Confidence testing will produce a more valid test than an equivalent traditional test.
Those findings relating to test length, difficulty, and discrimination should also carry over
to human subjects from the pilot data:
H3: Benefits of confidence testing will be related to test length in that shorter tests will
produce larger gains than longer tests.
H4: Benefits of confidence testing will be related to test difficulty in that more difficult
tests will produce larger gains than easier tests.
H5: Benefits of confidence testing will be related to test discrimination in that tests with
lower discrimination will produce larger gains than tests with high discrimination.
In addition, there are a number of hypotheses that could not be tested in the pilot, and are
only reasonable to propose on a human sample. These deal mostly with how individuals will
interact with these tests.

56

H6: Confidence testing will take more time than traditional testing for a test containing
the same number of items, but reliability per unit of time will still be higher in confidence
testing than traditional testing.
H7: Confidence testing will take more time than traditional testing for a test containing
the same number of items, but validity per unit of time will still be higher in confidence
testing than traditional testing.
H8: Test-takers will understand how to take a test using established confidence ratings.
H9: Test-takers will understand how to take a test using the general form of the
confidence model in which many options are available to them.
H10: Test-takers will find confidence testing preferable to traditional testing.
There are also a number of hypotheses that relate to how individual differences may
impact the use of confidence testing, based both on prior attempts in the literature (Hansen,
1971) as well as relationships discussed above. The individual differences to be tested relating to
these ideas are general anxiety, test anxiety, risk-taking, and cautiousness (as a contrast to risktaking).
H11: Trait generalized anxiety will lead to greater use of confidence testing.
H12: Trait test anxiety will lead to greater use of confidence testing.
H13: Trait risk-taking will lead to lesser use of confidence testing.
H14: Trait cautiousness will lead to greater use of confidence testing.
Finally, pilot results have shown that item difficulty plays a role in the use and effect of
confidence testing. What is not to be forgotten, however, is that item difficulty alone is
unimportant unless held in reference to the ability level of the sample to be tested. As discussed,
it is the relative difficulty which is actually driving confidence results. Unlike in simulation,

57

individual test-takers do not have perfect self-concept of their ability level. Instead, perceptual
information regarding an individual’s understanding of their ability may in fact be important.
Rather, it is not whether or not an individual is highly capable of answering the questions, but
rather if they believe they are highly capable. In this way, self-efficacy may have a relationship
with confidence testing results.
H15: Test specific self-efficacy will lead to lesser use of confidence testing.
Pilot # 6 – Obtaining Items and Item Characteristics
In order to replicate the above simulated results on human subjects, items need to be
selected that differ both on difficulty and discrimination. These items must also be a common
multiple-choice format, and understandable to the target population (college students).
Unidimensionality of the items is also important in order to avoid confounds of type of
knowledge.
To meet these requirements, verbal analogy items similar in format to those found on
older versions of the SAT and GRE were chosen. In all, 120 items were collected from
publically available SAT and GRE practice tests on the Internet. These 120 items were piloted
on two groups of individuals, both from the psychology research pool at Michigan State
University.
The first group of individuals consisted of 213 college student participants, and the
second group consisted of 195 college student participants. No demographic information was
collected from these participants, and as such we have no information on age, race, gender, or
anything similar. However, given the population from which these individuals were drawn is
identical to the population of the final experimental sample, it is not unreasonable to suggest that

58

this sample is predominantly female (~70-80%), predominantly Caucasian, and mostly between
the ages of 18 and 22.
The first group of individuals received 100 analogy items. The second group of
individuals received 20 analogy items (this group also received antonym items in order to test
these items as an alternative to analogies; analogies performed better). Three of the items in the
second collection were duplicates of items in the first collection in order to compare the relative
ability of the two groups. The differences on these items were fairly small, and considered
enough to assume these two samples comparable on ability. In this way they will now be treated
as one sample. A more accurate difficulty and discrimination estimate for all items chosen can
be recreated from the control condition of the final experiment.
Item difficulty was computed as the proportion of individuals who answered each item
correctly. Item discrimination was computed as the correlation between the responses on a given
item and the total scores for the test.
Item difficulty ranged from .135 to .892, and item discrimination ranged from -.064 to
.724. Unfortunately, difficulty and discrimination were highly correlated (r = .728). This
provides some small potential limitations, though it is also the case that effects of both difficulty
and discrimination can be examined while controlling for the other. Further examination of a
scatter plot of these data revealed this correlation as an artifact of a more curvilinear relationship
biased by an excess of high difficulty (low numerical values of difficulty) items. This
relationship should be corrected by selection of items for the following study.
In order to maximize the spread on difficulty and discrimination in a way which best
allowed for the control of confounds, four groups of items were selected: 1) low difficulty, low
discrimination, 2) high difficulty, high discrimination, 3) low difficulty, high discrimination, and

59

4) high difficulty, low discrimination. To do this, items were first ranked on difficulty and
discrimination. To find items in the first two groups (low/low, high/high), their ranks on
difficulty and discrimination were summed. Helped in part by the correlation between difficulty
and discrimination, 15 items with the highest sums and 15 items with the lowest sums were
selected. These items were also checked to ensure that they were not simply exceptionally high
on one dimension but low on the other. Further, these groupings are simply to pull items that
may work the best out of this sample of items. After actual data collection item difficulty and
discrimination will be recalculated from the control group.
In order to select items for the remaining groups (low/high, high/low), a similar method
was used. Instead of a sum, a difference was taken. Items with the highest magnitude positive
(15 items) and negative (15 items) difference were thereby selected, bringing the total test up to
60 items.
While full test statistics cannot be given at this point due to the separate samples, test
characteristics can be approximated using the items which were drawn from the first sample (49
of the 60 items). These items show an average difficulty (.522) and discrimination (.416), and a
high reliability (α = .902). These 60 items are therefore considered appropriate for use in
evaluating confidence testing, and can be found in appendix A, along with corresponding
difficulty and discrimination.
Experimental Method
Participants
Participants were drawn from the psychology subject pool at Michigan State University.
Overall, 252 individuals completed the online questionnaires. Of these 252 individuals, 197
signed up for and attended the follow-up laboratory session in which test data was collected. Of

60

these 197 individuals, 8 were removed for bad or incomplete data relating to computer and
network problems, leaving 189 participants in the full laboratory sample. This sample was
predominately female (76%), Caucasian (78%), and between the ages of 18 and 22 (96%).
Upon arrival in the lab, participants were randomly assigned to one of three conditions:
traditional testing, 5-point confidence testing, and a more generalized confidence testing (to be
discussed more later in the paper). After removal of bad data, each condition had sample size as
follows: 1) traditional testing, 65 participants, 2) 5-point confidence testing, 61 participants, 3)
generalized confidence testing, 63 participants.
Measures
Analogy test.
This test can be found in appendix A, and is the direct result of pilot #6. Initial estimates
showed that these 60 items possessed overall average difficulty and discrimination (.512 and
.416, respectively), and good reliability (α = .902). In addition, items were selected in order to
create maximal spread on difficulty and discrimination, while also allowing for the ability to
control for their confounding effects on each other.
Difficulty and discrimination as calculated from the control condition (difficulty = .56,
discrimination = .25; N = 65) show different values than pilot #6, especially in terms of
discrimination. Unfortunately there is no way to determine what might be different about these
two samples to cause this difference, as pilot data were fully anonymous and without any
collection of individual difference measures. Reliability was also lower in this sample (α = .761)
relative to pilot #6.
The problem of an unexpected correlation between difficulty and discrimination in the
pilot data (r = .728) also appears to have been solved by the selection of specific items with a

61

range of difficulty, as the correlation of difficulty and discrimination has been reduced to 0.02,
and examination of the scatter plot reveals a weak but expected curvilinear relationship. This
confirms that the spurious correlation in the pilot data was in fact an artifact of an overabundance
of difficult items. These 60 items are much more balanced across the full range of difficulty.
Means and standard deviations of the confidence ratings are not reported, as any
summary statistics relating to these items have to undergo numerous levels of aggregation. For
example, the average rating on each item in the 5-point condition will always be 1, as there are 5
points to distribute across 5 items. Whether an individual rates the item as 1-1-1-1-1 or 0-0-0-05, the mean will be the same. The standard deviation will be constrained to be between 0 and
2.33 as well (respectively for these two examples). The average number of points placed on the
correct answers is simply the scores for those items. What can be examined is the general
pattern of confidence, when used.
Distribution of responses for the entirety of the 5-point confidence condition can be
found in figure 10, and responses for the entirety of the generalized confidence condition can be
found in figure 11. The use of the zero response in the generalized confidence method was
exceptionally low (as might be expected), and so it is not included in the graph. It seems that the
most often used response in the 5-point method is zero. This is not surprising, as there should be
a majority of answer choices on the test which are clearly wrong to a large number of
individuals. This effect is not found in the generalized condition because these responses would
be instead found in the ‘eliminate as incorrect’ response. It may at first appear that the number
of ‘five’ responses is low in the 5-point condition, but it should be kept in mind that only 20
percent of all responses are actually correct.

62

The generalized confidence condition produces a less clear distribution, as it appears that
the modal response is in fact 5. Test-takers are also over-utilizing the low end of the scale and
under-utilizing the high end of the scale. This suggests that a 10 point scale may in fact not be
necessary.
Manipulation check – measure of understanding.
The manipulation check consists of a five item measure administered both immediately
after training and immediately prior to the end of the experimental session. These items can be
found in Appendix B. The five items place test-takers in common situations they might find
themselves in on any given item, and instructs them to choose optimal answers for those
situations.
Trait generalized anxiety.
The measure of generalized anxiety is taken from the International Personality Item Pool
(Goldberg, Johnson, Eber, Hogan, Ashton, Cloninger, & Gough, 2006), and specifically those
items dealing with the facet of anxiety. This scale consists of 10 items, 5 of which are reverse
worded. Respondents answered each item on a 5-point Likert-type scale ranging from strongly
disagree to strongly agree. The reported Cronbach’s alpha reliability for the scale is .83.
Cronbach’s alpha from the full sample of those who took the online measures was .87. Full text
of the items can be found in appendix C.
Trait test anxiety.
The measure of test anxiety is a short form of the Test Anxiety Inventory measure created
by Taylor and Deane (2002). This scale consists of 5 items. Respondents answered each item
on a 5-point Likert-type scale ranging from strongly disagree to strongly agree. The reported
Cronbach’s alpha reliability for the original scale is .87. Cronbach’s alpha from the full sample

63

of those who took the online measures was .898. Full text of the items can be found in appendix
D.
Trait risk-taking.
The measure of risk-taking is taken from the International Personality Item Pool
(Goldberg, Johnson, Eber, Hogan, Ashton, Cloninger, & Gough, 2006), and specifically those
items dealing with the facet of risk-taking. This scale consists of 10 items, 4 of which are
reverse worded. Respondents answered each item on a 5-point Likert-type scale ranging from
strongly disagree to strongly agree. The reported Cronbach’s alpha reliability for the scale is .78.
Cronbach’s alpha from the full sample of those who took the online measures was .83. Full text
of the items can be found in appendix E.
Trait cautiousness.
The measure of cautiousness is taken from the International Personality Item Pool
(Goldberg, Johnson, Eber, Hogan, Ashton, Cloninger, & Gough, 2006), and specifically those
items dealing with the facet of cautiousness. This scale consists of 10 items, 7 of which are
reverse worded. Respondents answered each item on a 5-point Likert-type scale ranging from
strongly disagree to strongly agree. The reported Cronbach’s alpha reliability for the scale is .76.
Cronbach’s alpha from the full sample of those who took the online measures was .83. Full text
of the items can be found in appendix F.
Trait test specific self-efficacy.
The measure of self-efficacy is a modified version of the self-efficacy scale taken from
the International Personality Item Pool (Goldberg, Johnson, Eber, Hogan, Ashton, Cloninger, &
Gough, 2006). This scale consists of 10 items, 4 of them which are reverse worded.
Respondents answered each item on a 5-point Likert-type scale ranging from strongly disagree to

64

strongly agree. The reported Cronbach’s alpha reliability for the original scale is .78.
Cronbach’s alpha from the full sample of those who took the online measures was .85. Full text
of the items can be found in appendix G.
Procedure
Participants completed a number of individual difference measures online before the
experimental session. In addition to the measures above, participants were also asked for their
high school GPA, current college GPA, and ACT and SAT scores.
After completing the online questionnaires, participants signed up for a lab session in
order to complete the testing part of this study. Upon arrival in the lab participants were
randomly assigned to the traditional testing control condition, the 5-point confidence testing
method, or the generalized confidence testing method.
All participants received a short training (~3 minutes) on what analogies are and some
strategies relating to solving analogy questions. This part of the training was identical across all
groups. In addition:
Participants in the traditional testing condition received the basic analogy test in a
standard fashion. They were instructed to simply select one answer for each question the same
as they would on a normal test. No mention was made of confidence testing or confidence
scoring.
Participants in the confidence testing condition were given a short training (~2 minutes)
on the concept of 5 point confidence testing. They were told that they had 5 points for each
question which could be allocated across the answer choices in any way they desired. The
proportion of points they allocated toward the correct answer would be the points they received
for that item.

65

Participants in the generalized confidence testing condition were given a short training
(~3 minutes) on the concept of generalized confidence testing. They were told that they had a
number of options relating to how they could respond to any given item. These options involved
eliminating answers as incorrect, selecting answers as correct, or using a weighting system from
1 to 10 to differentially weight items they believed correct with varying degrees of confidence.
This method is thus a combination of methods used by Coombs (1953), Dressel & Schmid
(1953), and Michael (1968).
After training, but before the analogy test, participants in the confidence scored groups
completed the five item manipulation check. The purpose of this was twofold. First, these five
items gave participants time to practice with confidence testing before beginning the actual
collection. Second, answers on these questions were collected in order to test how well
participants understood the concept of confidence testing. This same five item manipulation
check was also administered at the end of the session for the second reason.
All participants were also asked a question at the end of the session relating to how
desirable the test they just took was in relation to other tests they had taken in the past.
Following this, participants were given a full debriefing about the purposes of the study.
Tests were scored according to assumptions put forth earlier in this paper, identically to
the pilot. Score on the traditional test was the number correct score, out of 60. Score on the 5point confidence test was the sum total of points that were placed on correct answers throughout
the test, divided by 5. This scales responses in this condition so that 1 point maximum can be
gained on each item, identical to the traditional test. Scores on each item thus take the values of
0/.2/.4/.6/.8/1, just as in the pilot.

66

Generalized confidence testing presents a more complex scoring methodology.
Following from scaling arguments in the development of a general model, several scoring rules
can be established, following a number of assumptions that should be met. These assumptions
are: 1) if all incorrect answers are eliminated and the correct answer is chosen the participant
should receive 1 point, identical to the prior two methods, 2) if a correct answer is eliminated as
incorrect, the participant should receive 0 points, identical to the prior two methods, 3) if a
diffident response is given between x answer choices (one of them correct), the participant
should receive 1/x the possible points (e.g. half a point if two answers, one third of a point if
three answers), 4) if a correct answer is selected among a number of weighted options, the
participant should receive x/X of the possible points, where x is the weight given to the correct
answer and X is the summed weight given across all answers, and 5) if an answer is generically
selected as correct (without weighting) among other weighted answers, that answer’s weight
should be set to the average value of the other weights.
This system produces a wide array of possible point values for each item, ranging from 0
to 1 and taking on a wide range of the possible numeric values between. The reader is invited to
calculate the number of distinct point values possible for each single question. The author will
simply acknowledge that it is sufficiently large.
Due to the (relatively) complex nature of the scoring for this method, an alternative
technique for scoring was also tested. This method will be identified as generalized confidence
scoring (simplified) or similar as to distinguish it from the above scoring system. This simplified
method is based on the simulated result that such large degree of resolution might not be
necessary in order to achieve gains to test characteristics. Instead of the above calculations
involving the numeric weighting, all responses were simply coded either as 1) eliminate as

67

incorrect or 2) select as correct. This second category of select as correct included all answers
that received weighing values; the weighting information was effectively discarded and set to
equivalent. This method was then simply scored as one of two outcomes: 1) 0 points if the
correct answer was eliminated as incorrect and 2) 1/x points if the correct answer was selected as
correct, where x was the total number of answers that were selected as correct on that question.
Analysis
Perhaps the best example of problems with prior confidence testing studies is to answer
hypothesis 1 and hypothesis 2 in the same way as might be found in prior studies. In this case,
reliability and validity were calculated for the full 60 item test in each of the four conditions.
Results can be found in table 8.
The generalized confidence test produced the most reliable results, and the simplified
version of generalized confidence scoring retained almost all of the benefit of the more complex
version. Using confidence intervals from bootstrapping these reliability results (1000 iterations),
and the process put forth in Payton, Greenstone, & Schenker (2003), it is possible to estimate
whether or not these differences in reliability are significant. At 60 items, the generalized
confidence test in both forms had significantly higher reliability than the control test. The
reliability on these tests was not significantly higher than the 5-point confidence test, nor was the
5-point confidence test significantly more reliable than the control.
In terms of validity, two outcomes were used: current college GPA and ACT score. SAT
score was collected, but the majority of the sample (> 90%) reported that they did not take the
SAT. It was therefore not used for any analysis. High school GPA was also collected, but also
not used. Examination of high school GPA due to a number of unclear results led to the
conclusion that a number of factors might eliminate it as a strong candidate for validity analysis.

68

One factor is the differential scale use due to students coming from different high schools across
the country and world. Another factor is weak correlations with presumed correlates such as
ACT score. This correlation is reduced to near zero when controlling for college GPA. This –
and moderate correlations with college GPA – suggests that any useful variance that may be
present in high school GPA may also be present in college GPA. If both GPA results are tapping
the same latent ability construct, college GPA is also more proximal to this study. These factors,
as well as argument for a general sense of parsimony, led to the exclusion of high school GPA
from further analysis.
The 5-point confidence method performs best in relation to college GPA, though none of
the coefficients are significantly different using a Fisher r-to-z transformation to perform a z test.
The traditional test was the strongest of the four tests in terms of relationship with ACT score.
This is unsurprising, as the items on this test were modeled after items sometimes found on the
ACT. This difference was again not significant. In terms of the confidence methods, the
generalized methods seemed to perform slightly better than the 5-point method in relation to
ACT score. Again, simplification of the generalized method’s scoring seems to have little to no
effect on test outcomes.
Of course, this is not the whole story with relation to reliability or validity. It is
hypothesized that test length and other factors will have a large effect on reliability and validity.
Test length can be examined the same way as in the pilot, by artificially shortening the test.
Table 9 shows the reliability for each test as a function of test length, and figure 12 represents
this data graphically. Figure 13 shows the benefits of confidence testing relative to the control
condition.

69

These results seem to follow the general findings of the pilot simulations, if with a bit
more noise. The relationship between the 5-point confidence test and the control looks the most
similar to simulation results: the 5-point confidence test starts at a higher reliability, and the
traditional test almost catches up with it through the addition of more items. This has the
proposed effect of altering the reported benefits of confidence testing depending on which crosssectional test length is being examined. Two studies with identical tests will show very different
results if one uses 25 items (confidence testing boost to reliability ~.20) and the other uses 50
items (confidence testing boost to reliability ~ .05). Unfortunately, the Payton, Greenstone, &
Schenker (2003) bootstrapping method shows no significant differences in reliability at any test
length between these two methods (5-point confidence and control).
Generalized confidence testing, both in the complex and simple forms, appears to
outperform the 5-point confidence testing in all situations except on a very short test (5 items).
This may indicate that the 5-point method is easier for individuals to understand quickly, while
test-takers are still learning the method on the first few items of the generalized confidence test.
Once the test reaches 10 items both generalized confidence testing conditions appear more
reliable than the 5-point confidence test. Bootstrapping shows that this difference is significant
only on the test lengths of 30 items, 35 items, and 40 items.
The control and 5-point conditions appear to be converging on the reliability of the
generalized confidence method, but this has yet to occur by the point of 60 items. In fact, the
reliability that the traditional test reaches at 60 items is roughly the same reliability that was
reached by the generalized confidence test somewhere just above 25 items, and by the 5-point
confidence testing somewhere just above 40 items. Bootstrapping the reliability of the

70

generalized confidence conditions relative to the reliability of the control condition shows
significant improvements on all test lengths except 5 items and 15 items.
While not simulated or directly predicted, it is worth examining how the length of the test
affects its validity. Figures 14 and 15 show the change in validity over test length across
conditions.
No consistent interaction effects of time and condition are evident in the validity data.
The control test does appear to show a greater relationship with ACT for most of the test, with
the generalized confidence methods converging on the same level of relationship around 25
items and surpassing the control test (temporarily) around 10 items. This effect is fairly small,
and holds only for the generalized confidence test. The 5-point confidence test appears
uniformly weaker than the control test in this relationship across all test lengths.
The relationship to college GPA seems to follow a roughly opposite trend. The 5-point
confidence method is strongest above 25 items before reaching the level of the control test. The
generalized confidence method is slightly better than control on short and long tests, but not
those of medium length. Overall, it appears that length may not be as related to the validity of
confidence test as it is related to reliability.
It has been suggested that any gains to reliability and validity from using a confidence
test might simply be offset by the extra time that is spent taking that test. For instance, if a 30
item confidence test takes as long for a test-taker as a 60 item traditional test, then the extra
information could have just as easily been collected by giving those 30 extra traditional items. In
order to examine how much longer confidence tests take for participants to complete, timing
information was collected on all items across all conditions. Figure 16 shows the average

71

amount of time spent by participants in each condition as a factor of the length of the test to that
point.
It is not surprising that the traditional test takes less time for participants to complete than
each of the confidence tests. What is unexpected is how similar the 5-point confidence test and
the generalized confidence test are in terms of how long it takes participants to complete them.
No two means for these groups are separated by more than 15 seconds at any given test length.
This suggests that the thought process that individuals are using to take the 5-point confidence
test is the same or very similar to that used to take the generalized confidence test.
On average, it took participants a little over 14 minutes (858 seconds) to complete the 60
item traditional test. In this same space of time, participants were able to complete a little over
40 items in either the 5-point confidence condition (average time for 40 items = 810 seconds) or
the generalized confidence condition (average time for 40 items = 814 seconds). Looking back
at table 9, we can see that the reliability of the traditional test at 60 items is 0.761. This is
actually quite close to the 5-point confidence test at 40 items (α = .747) but still lacking against
the generalized confidence test at 40 items (α = 0.848).
To match the reliability of the 60 item traditional test (average time = 858), participants
only needed to take 25 items of the generalized confidence test (average time = 513 seconds).
This produces almost identical reliabilities (.761 vs .759), and would give test-takers on the
generalized confidence test an extra five minutes to do with what they please. Even considering
the additional time spent training participants on how to take the test, participants in the
generalized confidence testing condition would still be left with an extra two minutes.
In terms of validity, it can also be suggested that when different validity scores are
compared across tests, they should not only be compared by the number of items, but the time it

72

takes to complete any given number of items. A number of discrete time intervals can be
examined, 200 seconds, 400 seconds, 600 seconds, and 800 seconds. With approximations
(conservatively selected to always minimally favor traditional testing), participants can complete
roughly 15, 30, 45, and 60 items in those periods of time, respectively, in the control condition.
In those same increments of time, participants can complete 10, 20, 30, and 40 items,
respectively, in the confidence conditions. College GPA will be examined first, followed by
ACT.
Examining figure 14, it can be seen that at 200 seconds, the control test has a correlation
with GPA of about .25, just about identical with the 5-point confidence test. However, the
generalized conditions are both quite a bit better, with coefficients around .35. At 400 seconds
this seems to shift, as the control test validity is still only slightly higher than .25, but the 5-point
method has climbed to nearly .35. The generalized confidence conditions appear slightly weaker
than 200 seconds prior, as they are now closer to .33. At 600 seconds, the control test begins to
lose validity, and now has a coefficient of around .24. The 5-point confidence test is still quite a
bit better at this time interval, as it has climbed to .37, while the generalized methods have
dropped, just above .25 at this point, effectively comparable with the control condition. Finally,
at 800 seconds, the control condition remains near the least valid of any of the tests, right around
.24. The generalized confidence conditions are beginning another climb in validity, and are
nearly back to a coefficient of .30. The 5-point confidence test is still quite a bit ahead, at nearly
.40. In fact, the 5-point confidence test would actually benefit from stopping test-takers slightly
earlier (at 35 items instead of 40), as validity at that point is the highest of any test at any time
interval, just above .40.

73

This is the sweet spot, so to speak, for the 5-point method, while the optimal points for
the generalized and control conditions are 10 items and 20 items, respectively (validities of .35
and .30). Neither is as high as this maximal 5-point value, which is obtained after about 700
seconds. Time spent after this point seemingly only hurts validity. Based on the general trend,
no finite (or infinite!) number of similar items introduced into the generalized or control
conditions would ever reach this value.
In relation to ACT, at 200 seconds the generalized methods seem to be performing best
with validity coefficients of around .65. The control test at this point has a validity of around
.60, and the five-point test has a validity coefficient just below .45. This, coincidently, appears
to be the ideal time period for the generalized conditions, as spending more time on the test only
hurts validity in relation to ACT after this point. At 400 seconds, the control test is only slightly
higher than the generalized test, with coefficients of .65 and around .61, respectively. The 5point method at this point is slowly climbing and at .50. At 600 seconds, the control test has
reached a somewhat stable plateau around .70. This is higher than .60 for the generalized
confidence method and .55 for the 5-point method. Finally, at 800 seconds, the control test is
still at around .70, where extrapolation suggests it will remain with any additional items. The
generalized test is around .65 (no different than back at 200 seconds, suggesting that those
additional 600 seconds were unnecessary), and the 5-point method has also reached a plateau of
just below .60, where it also appears it will stay regardless of any additional items.
In all, then, when scaling for time it appears that the generalized confidence test is ideal
in terms of validity relating to ACT, as it reaches a validity of .65 with only 10 items (just over
200 seconds. While the control test eventually gets higher than this, it doesn’t reach this point
until around 30 items, which is a full twice as long, temporally (400 seconds).

74

Aside from time, there were a number of other factors predicted to influence reliability
and validity. Item difficulty and item discrimination can be examined by breaking apart the
overall test into subsets of items which were used to produce a test with variance on both of these
factors. Due to the differences in difficulty and discrimination between this sample and the pilot
sample, difficulty and discrimination were recalculated from the control condition. While the
overall characteristics of the test seem to have shifted to some degree, there did not appear to be
any items which changed drastically in relation to other items.
As a brief aside before moving on to these hypotheses, the computation of difficulty and
discrimination from the control condition also raises an interesting (but un-hypothesized)
question of the ability to calculate difficulty and discrimination from confidence scored tests.
Conceptually, difficulty is the proportion of individuals who answered correctly on any
given item. Mathematically, it is simply the mean of all responses (scored 0 and 1) on that item.
It follows, then, that the mean of all responses on the confidence forms (which are bounded by 0
and 1) should yield similar values. The confidence tests’ difficulty scores for each item correlate
very highly with the difficulty scores obtained from the traditional test (control with 5-point, r =
.966; control with generalized, r = .957; control with generalized simplified, r = .957). This
suggests that difficulty values generated from confidence scored tests may be able to be directly
equated with difficulty values from traditional tests.
In the same way, discrimination is nothing more than the correlation between
participants’ scores on an item and their scores on the test. Unlike difficulty, the discrimination
scores for the confidence tests do not correlate nearly as high with those from the traditional test
(control with 5-point, r = .524; control with generalized, r = .358; control with generalized
simplified, r = .342). This suggests that different items might be more or less discriminating

75

across these different score methods. Closer examination shows that on average, confidence
items appear to be discriminating better than items in the traditional test (control = .252, 5-point
= .266, generalized = .306, simplified = .308). It is expected that difficulty should have some
impact in this relationship, as easy and difficult items should be less discriminating on a
confidence test. Figure 17 shows that this is half correct.
This figure shows the trend of item discrimination relative to item difficulty, by test.
Only control and generalized confidence were compared, as the effect in the 5-point condition
was similar but weaker. This figure gives the clearest picture of what seems to be occurring. On
the 30 easiest items, generalized confidence testing is getting more discriminating power out of
each item, in some cases to a fairly substantial degree.
As difficulty increases, things become less ordered. The confidence items in general
appear to be trending downward, and their relationship with the traditional items appears less
clear, weakening the correlation between the two. Again, confidence tests appear to have higher
discrimination overall, but this result would be drastically different if the test consisted of only
easy or only difficult items. This then partly shows how prior research in confidence testing
could find drastically different results from one study to the next. Difficulty seems to be having
powerful effects that require more examination and potentially future work. The next set of
analysis focuses on this question of difficulty and discrimination more closely, so closer scrutiny
will not be undertaken yet. It may be required, however, depending on the outcomes of some of
the tests of hypotheses relating to difficulty and discrimination.
Breaking apart the test into those items that are easier and more difficult as well as more
or less discriminating allows for examination of the main effects of difficulty and discrimination,
as well as the interactions. Figure 18 shows the difference in reliability between two 30-item

76

subtests, one with easier items and the other with more difficult items. Figure 19 shows the
difference in reliability between two 30-item subtests, one with less discriminating items and the
other with more discriminating items.
While discrimination seems to be acting as predicted in relation to reliability, difficulty is
actually showing effects opposite those found in simulations. Confidence testing appears to be
showing benefits on the easier items, and detriments on the more difficult items. To examine the
possibility of an interaction effect, figures 20 and 21 show the effect of difficulty, discrimination,
and test type on reliability.
These figures do suggest that difficulty and discrimination are interacting to some degree
with test type and each other in terms of their effects on reliability, though they do not give any
insight to explain the main effect of difficulty. On difficult items, all test types are weakened by
low discriminating items. On easier items, confidence tests seem to be more resilient to drops in
item discrimination. In fact, the generalized methods seem almost completely indifferent to this
change on the easier items, with a reliability on the 15 easy/low discriminating items that is still
higher than the reliability of any comparable set of 15 control items.
Given the drastically different results of difficulty in relation to the pilot, further
examination of the test and of responses was undertaken. The argument from the pilot involving
item difficulty is based on the assumption that easy items create a floor effect at which point all
individuals should be responding with 100% confidence, completely eliminating the need for
confidence scores. From there, as items get more difficult, test-takers should use confidence
ratings more and more as they become less certain in their answers. In fact, data suggests at least
part of this aspect of difficulty is true.

77

For the 30 easiest items, participants on average answered just above 8 items with
confidence ratings in the 5-point condition and just below 14 items in the generalized condition.
For the 30 hardest items, participants on average answered just below 23 items with confidence
ratings in the 5-point condition and just below 24 items in the generalized condition. Participants
are using confidence ratings more, as predicted, but the benefits are not emerging. This
increased use on these difficult items actually seems to be hurting reliability.
Observed data is therefore not matching predictions of the simulations in relation to how
test-takers use confidence testing on difficult items. One of the assumptions on which the
simulations were built was the idea that individuals would reach a point where they had almost
no hope of answering the question correctly; in this situation they would spread their confidence
evenly across all answers. To further examine this, two questions were chosen at random to
represent a medium difficulty item and a very difficult item.
These two items do show again that individuals are using some spread of confidence
ratings more on difficult items. On the medium item only three test-takers in the 5-point
confidence condition and two in the generalized confidence condition spread their confidence
completely evenly. A number more, as expected, spread their confidence across 3 or 4 answers,
but were still able to eliminate some choices.
On the more difficult item (in fact one of the most difficult on the test), participants
should have been much more likely to spread their confidence evenly. While they did this more
frequently (16 in the 5-point confidence condition, 12 in the generalized confidence condition),
this is well below what would have been predicted by simulation. The difficulty on this item in
the control condition was .077, meaning that less than 1 in ten test-takers were able to get this
item correct. The difficulty score in the confidence collections was somewhat higher, at .138 in

78

the 5-point confidence condition and .134 in the generalized confidence condition. This sheds
light on a number of outcomes that may be occurring on difficult items.
A difficulty of .077 means that this item was below a pure chance-based guessing
parameter, which on this test would have been .20 (1 in 5). Participants in the control were
doing worse than blind guessing, indicating that a good distracter option on this item was likely
present. Distracter options were not examined in simulation; at a certain level of difficulty
participants were simply assumed to be answering with full randomness. This randomness
manifested itself as error in simulations of traditional tests, and this error was tempered by
confidence testing, resulting in gains. In simulation, there were no differential assumptions made
about items when difficulty dropped below 1/k, and this neglects the fact that in a human sample
a break occurs at the point in which individuals begin responding worse than chance. In fact, if
more test-takers were using confidence testing rationally, and did not see or fall for distracter
items, the difficulty of all items below a certain threshold should bottom out at 1/k (or .20 in this
case). This tendency of the scores to cluster at this level is actually evident in the slight rise in
computed difficulty in the two confidence collections relative to control.
Mathematically this is expected. Individuals who are spreading their confidence in a way
similar to traditional tests would be producing an item with difficulty near .07, and those using
confidence testing to spread their confidence evenly should be producing an item with difficulty
near .20. It is thus not surprising that the computed difficulty of this mix falls somewhere in
between, near .13.
Overall, this suggests that the effect of difficulty found in simulation is valid only on tests
where good distracter choices do not exist, or more specifically, on tests where item difficulties
do not naturally drop below 1/k. This can be empirically examined, within limits, as 8 of the 60

79

items have difficulties that fall below .20. Reanalysis on 52 items after removal of these 8 does
show minimal improvement in getting confidence tests back to zero loss on difficult items, but
not enough of an effect to dwell on much longer. It should go without saying that it is far from
reversing the effect to line up with simulation results.
Part of the problem in testing this specific possibility is something that has been
discussed many times to this point in this paper. Item difficulty should not be only considered as
a fixed property in a vacuum, but rather a level that each test-taker tries to surpass with their
individual abilities and differences. An item with difficulty of .22 still shares the same problems
as one at .18, just slightly lessened. There is a gray area in which the distribution of relative
difficulty drops below this difficulty threshold for some proportion of the sample. For the sake
of brevity, then, let us cut to the chase and examine this effect on items which should not have
this problem.
The 20 easiest items on the test (difficulty ranging from .831 to .969) were divided into
two subtests of 10 items each: the easiest items and the slightly more difficult items. Reliability
on these sets of items does actually look closer to simulations, and can be found in figure 22.
While this does follow much closer to simulation, and does suggest that confidence
testing might have its greatest benefits on items that are not too difficult or too easy, these results
must be taken with some healthy skepticism. This is a small set of items relative to the full test,
and confidence testing has already shown to have benefits associated with test length. Test
length is accounted for in this examination, but it is still a small number of items. What this does
seem to show is that the predicted ceiling effect of easy items seems to be a minor factor for the
generalized confidence condition and a slightly larger factor for the 5-point condition.

80

The impacts of difficulty, discrimination, and test type on validity produce twice as many
results as reliability due to the fact that two different outcomes were validated against. Overall,
results do not appear to support predictions. Differences appear to follow somewhat consistently
with those results reported for reliability (but in general, weaker), with confidence tests
performing better on easier items and those with higher discrimination. The effect of
discrimination appears to be much weaker, with the traditional test not taking as large of a loss
on low discrimination items. High difficulty, low discrimination items still appear to show the
weakest validity for all test types. Due to the lack of firm results, the similarity with reliability,
and for the sake of simplicity and brevity, results will not be explicitly discussed (though are
available upon request from the author).
Hypothesis 8 and 9 predicted that participants would understand how confidence scoring
works and how to use it on the items in the manipulation check. Due to the different nature of 5point confidence testing and generalized confidence testing, each will be examined in turn.
The manipulation check on the 5-point confidence test yields fairly quantifiable correct
and incorrect answers. For example, participants should respond with ‘1’ to all answers on the
item which asks them to respond as if they had no idea which answer was correct. Overall, there
were very few deviations from these ‘correct’ answers.
The first item asked participants to ‘Answer this question as if you couldn’t decide
between answer choice ‘A’ and answer choice ‘B’, but felt slightly more confident in answer
choice ‘B’.’ Of 61 participants, 91.8% in the pre-test and 98.4% in the post-test responded with
the expected response of 2/3/0/0/0. Another 4.9% in the pre-test and the remaining 1.6% in the
post-test responded with the close response of 1/4/0/0/0.

81

The second item asked participants to ‘Answer this question as if you were fairly certain
that answer choice ‘C’ was correct.’ Of 61 participants, 86.9% in the pre-test and 91.8% in the
post-test responded with the expected response of 0/0/5/0/0. Of the remaining, the most common
mistake was x/x/4/x/x, where a 1 was substituted for one of the ‘x’s.
The third item asked participants to ‘Answer this question as if you had no idea as to
which answer choice is correct.’ Of 61 participants, 98.4% in both the pre-test and the post-test
responded with the expected response of 1/1/1/1/1.
The fourth item asked participants to ‘Answer this question as if you were certain that
answer choice ‘D’ and answer choice ‘E’ were incorrect, but had no strong idea of which of the
remaining was correct.’ Of 61 participants, 85.2% in both the pre-test and post-test responded
with the expected response of x/x/x/0/0, where 5 points were distributed as equally as possible
(i.e. 2/2/1, 2/1/2, 1/2/2) across the answers marked with x.
The fifth item asked participants to ‘Answer this question as if you were fairly certain
answer choice ‘A’ was correct, but had some thoughts that there was a chance that answer choice
‘B’ was correct.’ Of the 61 participants, only 65.6% in the pre-test and 73.8% in the post-test
responded with the expected answer of 4/1/0/0/0, though an additional 31.1% in the pre-test and
23.0% in the post-test answered with the close response of 3/2/0/0/0.
The manipulation check on the generalized confidence test has a wider range of possible
answer strings that would work for each question, though each can still be quantified as fitting
with the instructions of the question or not.
For the first question, 60.3% in pre-test and 74.6% in post-test responded in an
appropriate way. For the second question, 68.3% in pre-test and 63.5% in post-test responded in
an appropriate way. For the third question, 82.5% in pre-test and 81.0% in post-test responded in

82

an appropriate way. For the fourth question, 85.7% in pre-test and 92.1% in post-test responded
in an appropriate way. For the fifth question, 74.6% in pre-test and 73.0% in post-test responded
in an appropriate way.
Overall, these results are not as promising as the 5-point confidence test. However, the
majority of errors on these items seem to be cases where participants left some of the answers
blank on a given question. Fortunately, and for whatever reason, this does not appear to be as
much of a problem on the main test as on these manipulation check questions. Practically, this
may inform that a choice such as ‘eliminate as incorrect’ is not needed, as the answers left blank
often appear to be those that should have been eliminated. This is a possible way of simplifying
administration of this test method and helping to bring about better understanding of it.
Hypothesis 10 predicted that participants would find confidence testing preferable to
traditional testing. All participants were asked a question asking them to judge the test they just
took in terms of desirability relative to other similar tests that they have taken. Responses were
given on a 5-point scale from ‘Much less desirable’ to ‘Much more desirable’. One-way
ANOVA shows group differences on this question (F(2,188)=11.903, p<.001), with least square
difference post-hoc tests showing that participants found the 5-point confidence test (M= 3.59,
SD= .761) more desirable than either the traditional test (M= 2.89, SD= .732) or the generalized
confidence test (M= 3.16, SD= .919). While participants did not find the generalized confidence
test significantly more desirable than a normal test, they at least found it no more objectionable.
Due to the arguably quasi-continuous nature of test desirability as measured, a chi-square
test of association was also performed examining the relationship between test and desirability.
This test was also significant (χ2 (8,189) = 26.62, p<.01) and supportive of the results of the
ANOVA.

83

It was predicted that a number of individual difference measures would influence how
individuals interacted with confidence tests. Specifically, these individual differences were
predicted to change the degree to which test-takers would use confidence testing to distribute
their confidence across answers, instead of simply placing all their confidence on one answer. A
new variable was created in each of the two experimental datasets which represented a count of
the number of items on which each participant used confidence testing to distribute points across
multiple answers. Table 10 shows the correlation of each of the individual difference measures
with the number of items on which confidence testing was used for the 5-point confidence test,
and Table 11 shows the same for the generalized confidence test.
Results are somewhat mixed, with no significant correlations but some results trending in
the predicted directions. The generalized confidence condition shows a marginal effect of test
anxiety in that participants higher on test anxiety are more likely to use confidence testing (as
predicted).
There is, however, a potential problem with measuring the ‘use of confidence testing’ in
this way. That problem relates to ideas discussed in the simulations of the effect of difficulty on
the benefits of confidence testing. It was suggested that confidence testing may not show
benefits on easy tests due to the fact that on very easy tests participants would always (correctly)
display 100% confidence in the correct answers. As difficulty is relative to the person, it stands
that the number of times that a person uses confidence testing should be related to the difficulty
of the test, or rather, how difficult the test is relative to their level of ability. This correlation
between score on the 60 item test and the number of times that a person used confidence testing
is significant in both the 5-point confidence condition (r = -0.573, p < .001), and the generalized
confidence condition (r = -0.595, p < .001). This is an expected result and not relevant to the

84

predictions of the individual difference measures. It should therefore be controlled before
examining the effects of the individual difference measures. Partial correlations identical to the
above analysis but controlling for participants’ score on the test can be found in tables 10 and 11.
Overall, all effects appear to be negated when controlling for a participants’ score on the
test. To examine the possibility that some individual differences might be acting the same way
across the forms, and to increase the power of these correlations, the same partial correlations
were examined on a dataset containing responses from both the 5-point and generalized
confidence condition. These partial correlations can be found in table 14.
Even with this increased sample size, there do not appear to be any relationships between
individual differences and the use of confidence testing in general.
Discussion
The following discussion of results will be grouped to first focus on reliability, then
validity, then individual differences and person factors. In general, results will be summarized
first, then examined more in detail relating to practical implications, limitations, and areas for
future study.
Hypothesis 1 proposed that confidence testing would produce a more reliable test than a
traditional test. This hypothesis is generally supported, with both the 5-point confidence and
generalized confidence test producing higher reliabilities than traditional tests of the same length.
However, this difference in reliability (while similar in magnitude to prior results), is only
significantly higher in the generalized confidence condition. Also noteworthy is that a
simplification of the scoring methodology for the generalized confidence test appears to have
negligible impact on reliability in relation to the full generalized confidence method, showing
that a simpler system using this same general form may be ideal.

85

Hypothesis 3 proposed that the benefits of confidence testing on reliability would be
related to the length of the test in that shorter tests would show larger benefits than longer tests.
Discounting odd effects at extremely short test length (~5 items), it does appear that test length
impacts the benefits of confidence testing similarly to simulated pilot data, supporting this
hypothesis. As test length increases it appears that both forms of confidence testing are
beginning to converge on zero benefit over traditional testing. This has not occurred by 60
items, though extrapolating by trends over number of items it is reasonable to assume this may
occur around 75-80 items for the 5-point condition and 90-100 items for the generalized
confidence conditions.
Hypothesis 4 proposed that confidence testing would show greater benefits on more
difficult tests. This hypothesis was not supported, and in fact results appear to be opposite
predictions. Confidence testing performed better than the traditional test on easier items, and
worse on more difficult items. This effect was stronger in the generalized confidence conditions.
This finding may indicate a number of possibilities, but overall suggests that the
relationship between test difficulty and the usefulness of confidence testing may not be as simple
as simulations would indicate. It was shown that participants use confidence testing more when
items become more difficult, and overall results show that confidence testing has its benefits. On
the most difficult items the use of confidence testing is actually hurting reliability, which means
that test-takers must be using it incorrectly or irrationally in these situations.
This seems to relate at least somewhat to distracter choices on items that lower difficulty
of the item to below 1/k. One of the missing assumptions of the simulations was that at a certain
point test-takers would reach a level of difficulty at which they would find it best to evenly
distribute their confidence and cover all bases; below this point individuals would stop

86

displaying knowledge and instead start displaying lack thereof. In simulation, score was
assigned probabilistically without taking this disconnect into account. Good distracter choices
are beneficial to traditional tests, but results seem to suggest that they cause problems with
individuals’ consistent use of confidence testing. These difficult items decrease the number of
people evenly spreading confidence. On the most difficult items the rate of this behavior is well
below what should be expected, given the extreme difficulties. Even removing some of the most
difficult items does not change this outcome much. The problem is that this test was not
constructed with distracter choices as an experimental consideration. Instead, they are simply a
confound, and one that is difficult to extract without intense analysis of item content relative to
ability levels of the participants.
The differences are small, but difficult items are the one place that the simplified version
of generalized confidence testing actually outperforms the generalized confidence testing as
collected. This hints at one of the drivers of this problem; on the hardest items participants seem
to be over-thinking things instead of simply evenly distributing their confidence. Through this
process of making things more complex, they are adding noise to the data. This reduces
reliability (and validity through measurement error). In the generalized condition this might
result in more use of (incorrectly applied) differential weighting, which is basically ‘fixed’ in the
simplified version. Even so, this is a small piece of the puzzle.
Some examination of less difficult items does show relationships more similar to those
found in the pilot studies, but the unfortunate truth is that this study was not set up to examine
these issues in detail. These results highlight paths for future research, but can do little more in
terms of establishing firm conclusions of what might be found. For now, it appears that

87

confidence testing is weakest on extremely difficult items, notably those that are well-written
enough to drive the difficulty below chance guessing.
Confidence testing does seem to work well on items of easy and medium difficulty, and
the potential ceiling effects of very easy items do not appear to pose as much of a problem as
predicted. This may simply be because even easy items are not easy to every test-taker. Overall,
this raises the possibility that confidence testing’s relationship with difficulty may be more bellshaped than monotonic, with both very easy and very difficult items producing problems of
different degrees.
Unfortunately, this test lacked groups of items at comparable difficulty, and rather filled a
range of difficulty. Examining difficulty as a full spectrum thus involves examining individual
items or increasingly small sets of items, something that this study was not designed to be able to
do. Due to this it is hard to make comparisons about particular levels of difficulty without
getting entangled in individual items, which introduces a host of other problems. Future targeted
work (and ideally, an eventual meta-analysis after the collection of a number of primary studies)
would be required to fully address this problem.
The findings on difficulty also raise a question of what confidence really means,
especially when the items are more difficult. It is assumed that confidence is being measured
through the collection of confidence data on these tests, though it is impossible to say with this
data that confidence is actually what is being measured. Given findings, it does seem the most
likely that confidence is what is driving responses, but it also possible that deviation from a pure
measure of confidence on difficult items may be the driver of the decline of their usefulness.
Confirmation of this would require deeper work, examining the thought processes and behavioral

88

motivations that lead test-takers to select one level of confidence over another. Ideally, this
would likely manifest as a verbal protocol study.
Hypothesis 5 proposed that confidence testing would show greater benefits relative to
control on tests with lower discrimination, and lesser benefits on tests with higher discrimination.
Data supports this hypothesis. Confidence testing seems to show no benefit to reliability on the
high discrimination items (all tests do well). On the low discrimination items the benefits are
fairly large, especially in the generalized confidence condition. This implies that the need for
highly discriminating items is not as great on a confidence test as it is on a normal test. That is
not to say that items should have no discriminating power whatsoever, but rather simply that the
demands may not need to be as great.
Hypothesis 6 proposed that gains in reliability are more than simply the product of adding
more time to the test. Put another way, a 60-item traditional test should not be compared to a 60item confidence test, but rather to a test that takes the same amount of time to complete. Results
show that the reliability per unit time between the 5-point confidence test and the control is fairly
negligible, but the difference between the generalized confidence and the control is well in the
favor of the generalized confidence test.
The 5-point confidence testing method may not show any improvement over a controlled
period of time, but that is not the only factor by which it should be judged. The fact that each
single item (without any modification to content of the item or additional effort in its creation)
can collect more information is what is also important, especially in the context of test security
and item exposure. For example, allow that participants can complete 1) 60 items of a traditional
test with fixed reliability in a certain time period, or 2) 40 items of a 5-point confidence test with
the same reliability in the same time period. If there is no difference to reliability or time then

89

the only factor of note is that those in the 5-point confidence collection were only exposed to 40
items while those in the traditional test were exposed to 60. This would lower the requirements
of item pools, or simply make item pools of the same size more powerful. Thus, hypothesis 6 is
strongly supported in relation to the generalized confidence condition, and supported within
certain assumptions in the 5-point confidence collection.
Hypothesis 2, hypothesis 3, 4, 5 and hypothesis 7 proposed effects of confidence testing
on validity of the test. The test of hypothesis 2, which proposed that confidence tests would
show higher validity than traditional tests, showed inconsistent results. Confidence tests showed
better prediction in relation to college GPA, but weaker in relation to ACT score. The test of
hypothesis 3, which proposed that test length would have an interaction effect on these validity
results, also showed inconsistent results. Test length seemed to have no easily interpretable
impact on the validity of the test, which also means that no fair or strong statements can be made
about hypothesis 7. The simple addition of items did not appear to change validity. Changes
were more likely related to the addition of specific items which themselves showed better or
worse validity. Hypothesis 4 and 5, which proposed that test difficulty and discrimination would
have an impact on validity mirror the results already discussed relating to reliability for
difficulty, but had no notable effect in relation to discrimination.
Hypothesis 8 and 9 proposed that test takers would understand how to take a test using
each of the different confidence methods. Overall this is supported in both conditions, as the
majority of participants seemed able to accurately use this method of testing with even fairly
minimal training and experience, though there is a question of whether this process breaks down
on very difficult items. It is unclear (and an important question of future work) how many
testing sessions it would take for all individuals to become acclimated to using these confidence

90

ratings with complete accuracy. Even minimal one-on-one training would likely be very useful
to those who did not immediately understand the process through group training.
Hypothesis 10 proposed that individuals would find confidence testing preferable to
traditional testing. Contrary to many of the other score-based results which showed generalized
confidence testing outperforming the 5-point method, participants found the 5-point confidence
condition to be the most preferable. There was no significant difference between the preference
toward generalized confidence testing and traditional testing. This provides partial support for
hypothesis 10, though an argument can be made that if generalized confidence testing is at least
not less preferable than a traditional test, its benefits come at no psychological harm to the testtaker.
The remaining hypotheses related to how individual differences influence the use of
confidence testing. Hypothesis 11, 12, 13, 14, and 15 all failed to be supported by the data.
Neither confidence condition, nor the combination of the two, showed any significant
relationship between the use of confidence testing and scores on individual difference measures.
This failure to find any result is not quite as disheartening as it may at first appear. In fact, for
confidence testing to be a more widely utilizable method it is preferable that its use is not based
on outside factors. Risk-taking, which was found by Hansen (1971) to be somewhat correlated
with use of confidence ratings (r = -.226, p < .05), appears to have a weaker effect in this sample.
A larger sample size may have found some of these smaller effects to be significant (note: the
overall sample size of the two confidence groups was still larger than Hansen’s 1971 study), but
it is at least possible to rule out any medium to large effects of these variables on confidence
testing use. If effects do exist, they are sufficiently small as to not have been picked up by this
study, or represent other constructs that weren’t measured.

91

It may also be the case that different situations may prompt more individual difference
effects. For example, it may be that context is important in measuring some of the individual
differences, such as self efficacy. Instead of test-specific self efficacy, verbal test-specific self
efficacy could be examined in future work. These effects may also be weaker due to the fact that
this test was low stakes. On a high stakes test these effects may become larger and significant, as
there may be psychological processes that were not activated by this simple low stakes setting.
While individuals were told to imagine that they were taking a normal test as they would in a
class, there is no substitution for actual high-stakes data. This examination of individual
differences would also benefit from more in-depth examination of confidence as it stands itself
while individuals are taking the test, as might be gained from a verbal protocol analysis.
It is also possible that other constructs that were not examined may have some bearing on
the use of confidence testing. These constructs were examined as a first pass at finding
covariates, but they hardly cover the full range of possible individual differences that might be
related to testing. The case might also be made that these constructs were simply measured
poorly, though such speculation seems unlikely considering the relatively high reliabilities
produced by each of the measures.
Practical Implications
Perhaps the main question that needs to be answered by this paper is: “should confidence
testing be implemented in practical settings?” The general answer appears to be yes. This of
course comes with some call for caution and continued scrutiny, which will be discussed relating
to necessary future research, but confidence testing does stand to provide some practical gains to
testing.

92

An additional question seems to also be “which version of confidence testing is best, or at
least most practically useful?” Each test seems to excel in different areas, and while the benefits
of generalized confidence testing appear larger, so are the detriments when things don’t work out
(e.g. highly difficult items). In general, it is the opinion of the author that the best hope is likely
to be found in the simplified generalized confidence method, but collected in a simple form
instead of with the 10 point weighting. Even if this collection only had the same benefits as the
simplified form in this study, test-takers could likely complete it quicker, increasing the gains
when temporally scaled. Additionally, this simpler form might give test-takers less ability to act
irrationally on difficult items. Unfortunately, this can only be examined with future work.
First and foremost, college students with minimal training were able to figure out and use
these techniques with a large degree of accuracy. Training can be implemented using a
PowerPoint presentation in a matter of minutes, which allows for fairly easy implementation if
that is what is required or desired. Additional training could be used to improve understanding,
which would also likely grow with more widespread use. Anecdotally, only one participant from
both confidence conditions (it was in the 5-point confidence condition) asked the experimenter
for clarification of the technique, and even then only on the first of the practice questions.
Participants didn’t dislike the methods any more than a normal test, and individual differences
did not seem to have any moderate or large impacts on usage of the test. The issue that
confidence testing takes more time than traditional testing seems somewhat unfounded, as
controlling for time did not cause any substantial detriments to the techniques that would make
them worse than traditional testing.
Above all this, confidence testing does seem to show some noteworthy improvement to
tests. The 5-point confidence method shows some boost to test reliability, and even though this

93

benefit is reduced to near-zero when accounting for the time spent on test it still provides a
practical use in terms of test and item security as outlined above. The generalized confidence
method showed a boost in reliability even accounting for time spent on item.
This study also provides a worst-case scenario in terms of time spent on each item, as it is
a fair assumption that this is the first time these test-takers had ever encountered confidence
testing methods. With practice it would be expected that test-takers should get better at taking
tests in this fashion, perhaps strengthening time effects in favor of confidence methods.
Validity results were slightly less clear than those of reliability, but still offered some
hope for the techniques. Both methods of confidence testing show signs of offering better
validity than traditional testing in relation to college GPA, but both appear slightly weaker than
traditional testing in relation to ACT score. There are two possible explanations.
The first is that the ACT can be considered a benchmark measurement of verbal
reasoning. The fact that confidence testing does not correlate as highly with this as the
traditional test (which is effectively a replication of the ACT), would then mean that the
confidence method is introducing more noise which is not important. Oddly, though, this extra
noise spuriously correlates better with college GPA than the replicate of the ACT (the
traditionally scored test) does.
The second possibility is that the confidence testing is better predicting college GPA
because it is capturing more useful variance in college GPA, In fact, it is capturing more
variance in college GPA than the traditional test, which is again basically a replication of the
ACT. By correlating better with extra variance in college GPA that ACT isn’t measuring, the
correlation with ACT is lowered. This information that confidence testing is capturing is only
viewed as noise in the correlation with ACT in that the ACT fails to capture this useful variance.

94

Further, given that ACT score is itself a tool meant to predict college GPA (for reference
the correlation between the two in the full data is r = .420, p < .001), it may be that the prediction
of college GPA holds more weight in this study. These initial results are therefore somewhat
promising, but call for continued work. Future studies can potentially utilize tests which are
more directly related to outcomes. Verbal reasoning is a fairly large construct, and one that has
the potential to correlate with a number of factors which may not be verbal reasoning. More
specific tests of ability (such as task-specific performance or specific tests of skill) may allow a
cleaner examination of validity results.
Hypotheses relating to difficulty did not work out exactly as predicted by simulations.
From a practical standpoint, this means that use of confidence testing should continue to use a
wide array of items of varying difficulty as a means of further understanding these impacts (and
protecting against weaknesses of a test using limited ranges of these characteristics). There are
no singular set of item types that demonstrate cause to be selected over all others, though it
appears that items developed to have good distracter choices might be best left out of nonresearch use. While it seemed that optimal difficulty for confidence testing was near .70 in this
study, this finding should be replicated in other samples with other tests. Additionally, as
difficulty is relative to a number of factors it may take a number of targeted studies to determine
exactly what relationship item difficulty has with the benefits of confidence testing, and how
exactly test-taker behavior breaks down at the difficult end of the scale. The addition of some
difficult items to any collection may be worth the extra effort for purposes of study even though
they appeared less than ideal on this test and sample.
Deeper perceptual studies examining what confidence means to test-takers across levels
of difficulty may also shed light on this issue. As discussed earlier, it might be the case that

95

confidence is being measured by these confidence items only on easy and medium difficulty
items, but not on the most difficult. It may also be that confidence is not being measured at all.
There appear few ways to address this other than in depth verbal protocol analysis, and this may
be required to fully determine what exactly is being measured by these ratings from a perceptual
standpoint.
Hypotheses relating to discrimination were supported in relation to reliability, and tend to
indicate that confidence tests (and especially the generalized confidence form) are somewhat
indifferent to item discrimination. The generalized confidence test used here returned only a
slightly weakened reliability on the 30 least discriminating items compared to the 30 most
discriminating. The traditional test was reduced from a good test on the most discriminating
items to completely unusable on the least discriminating items. This then reduces the ROI of
time and effort spent making sure items are maximally discriminating. The author is by no
means suggesting that items should be built without care in any aspect, but simply that less time
and effort may need to be spent fine tuning decent items to make them into good or great items.
On a confidence test those same items might already perform at the level of good or great.
There are also a number of practical areas in which confidence testing might show even
greater use than found here. Most notable would be computerized adaptive testing. If
confidence items collect more information in each item, then better estimation of a test-taker’s
ability level can be made earlier in the test. Mindful of stated problems with the current
understanding of difficulty’s relation to confidence testing, it is also the case that at least some
participants would use an evenly distributed confidence weighting instead of guessing, reducing
the problems of breaks in predicted Guttman response to items. As discussed above, confidence

96

testing would also reduce the strain on item pools, as fewer items would be needed to reach the
same results. Item exposure would thus be reduced, increasing test security.
Confidence testing also creates a higher resolution pattern of data which could be useful
in the detection of invalid response patterns and cheating. A Guttman error in a response string
is a Guttman error – the more you have the more you may begin to suspect that a participant had
prior knowledge about some of the harder items. Confidence ratings introduce another level to
this. Someone with prior knowledge of the test would not only be getting the most difficult
items correct (which could be a product of good guessing on a traditional test), but may also be
getting them correct with full confidence – a much less likely prospect. A cheater would then
face the dilemma of having to report less than full confidence (and receive less than full points)
on some items in order to keep their test from increased scrutiny.
As a teaching tool, confidence testing may have potential uses even without reliability or
validity benefits. Looked at as a shortcoming in terms of test characteristics, let us dwell on the
notion that test-takers spent more time taking the confidence tests than the traditional test.
Participants weren’t simply sitting there and wasting time, nor is the extra time they took fully
explainable with only the extra mouse clicks required. Participants who used confidence ratings
were, in whatever way, thinking more about the items and their responses. This test was likely
not perfect for it due to the lack of possibility of learning anything while taking the test, but other
tests – especially in a classroom setting – could be constructed to make the most use out of this
extended period of test-taker engagement. Additionally, an instructor could utilize student data
to see where common mistakes or areas of uncertainty might be occurring in order to inform
future teaching plans.

97

Limitations and Future Research
In addition to a number of positive findings, this study also does have a number of
limitations. These limitations highlight a number of prime areas and ideas for future study. This
study used a low-stakes test on a sample of college students. There is absolutely no guarantee
that any of these results would transfer to a high-stakes test, and it is impossible to say to what
populations these results could be generalized. These results should be tested on populations
other than college students, with different types of tests, and in higher stake situations. It would
be useful to know if these forms of confidence tests worked on younger students or on average
members of the workforce, which would allow for implementation in schools, and selection and
workplace situations, respectively. Testing in actual classroom studies would also allow for
examination of how test-takers use these methods in higher stake situations. High stakes tests
might also change the impacts of individual difference measures on confidence ratings.
This study used also used a test that was narrowly focused on the idea of verbal reasoning
in the form of analogies. It is unclear if these same findings should be expected for a reading or
math test, and the different ways that individuals view and approach verbal and math tests might
suggest that differences may appear. Replication on other types of tests would be useful to show
that confidence testing works in a variety of situations. A study could use a test similar to this
one, but also comparable tests (such as math test, or a subject specific test), to allow for testing
differences between these tests. Hopefully these differences are minimal, if they exist at all.
It is also unclear from this study how these results would hold or change over time as
individuals became more familiar with how to take a confidence test. Future study with multiple
tests over time on the same sample thus stands to answer a number of questions. Participants
could be administered a series of parallel forms over the course of a few months, similar to what

98

might be expected in a classroom setting (or even using a classroom setting). This would allow
for test-takers to become accustomed to the method of confidence testing beyond the short period
of time they were involved in this study. Examination of confidence tests after multiple sessions
of taking confidence tests would give a better picture of what day-to-day results of confidence
testing are likely to be on informed populations. As with all tests, individuals will no doubt soon
find the best strategies for using confidence testing.
Multiple tests over a span of time would also allow examination of whether or not testtakers internalize more knowledge from the test as a product of making confidence ratings.
Specifically, this would be able to show if test-takers’ scores change differentially as a product of
taking confidence tests relative to control groups taking traditional tests. Individuals should
learn information over time, but it could be tested if that degree of learning is greater in
confidence testing conditions. If it is the case that confidence produces larger gains in learning
that becomes a larger selling point for using confidence tests as a teaching tool.
This study used a test with 60 items, which seems to be of reasonable length for keeping
test-taker attention and motivation. However, examination of aspects such as difficulty and
discrimination (and their interactions) reduced the length of the tests used for comparison at the
cost of being able to make more comparisons. This sacrifice of depth for breadth was necessary
in this study but limits the extent to which some of these effects can be reasonably examined.
Notably, the future use (and careful documentation!) of tests of varying degrees of
difficulty relative to the sample would also be a good step in untangling exactly how test
difficulty is related to use and benefits of confidence testing, and what the reasonable boundary
conditions of use are. A study in which groups of test-takers complete tests of fixed and more
stable difficulty, but varying in difficulty between groups, all with confidence testing, would

99

likely produce larger effects that would be easier to detect and clearer to interpret than the
spectrum of difficulty in this study.
This test was also limited by the fact that items with good distracter choices present were
not experimentally controlled – this demands careful future study. With planning it would not be
difficult to use good distracter choices as an experimental manipulation. Two groups could take
the same test items, but good distracter choices could be implemented on one test and not the
other. Care would also need to be taken to ensure that distracter choices’ effect on
discrimination was controlled for during this implementation. Distracter choices should have an
impact on item discrimination, though careful and specific construction, manipulation, and
selection items can disentangle this component to allow for a clear picture of the effect on
difficulty. It is also possible that a selection of some items that do have impact on discrimination
could also be examined at the same time in order to determine if any notable effects on
discrimination exist. This is the cleanest way of attempting to understand these impacts, as
conclusions drawn from this current study are exploratory at best.
This study is limited by the fact that it only examined two forms of confidence testing,
though these were suspected to be among the most powerful. There are therefore a number of
future research possibilities related to how different forms of confidence testing may work, and
which may be ideal. This study was able to find two forms of data collection that worked
relatively well, but there are plenty more in prior literature. These two were picked with the
hope they might be among the strongest, but this by no means should limit work to try to show
that other implementations are stronger. Future studies could be implemented in a similar
fashion than this one, but with different methods of confidence testing.

100

For example, generalized confidence testing, as collected, appears to not be worth the
extra effort beyond collecting a simplified version. The benefits of the generalized condition are
also present in the simplified generalized condition. That said, it may be that the simplified
version only worked because it was a distillation of a more complex collection. The simplified
condition was just that, a simplified version of a more complex collection. If individuals did not
actually have the 10 point scale to work with, it is hard to say just how they might act. This
study is limited in its inability to examine this question in full – a high priority is collecting data
using this simplified method to see if it works as well when collected in this form instead of
simplified to this form. That is, collecting data where test-takers have the possibility to select
answers as correct and eliminate answers as incorrect, but not to rate them further on any
additional scale. It is possible that test-takers may find it, in the simplified form, more desirable
than the full generalized confidence form. If a collection of the simplified version replicates the
benefits found here it would be extremely easy to implement in almost all testing situations.
The sample size of this study, while sufficient to show most effects with some
confidence, was only large enough to rule out medium and large effects of individual differences
on the use of confidence testing. It is argued that smaller effects may not be large enough to
cause problems, but this study does not have the power to say that they do not exist. It may also
be the case that other individual differences that were not measured here have impact on
confidence, so future work should include other individual difference measures when the
theoretical case can be made for their inclusion. The fact that this test was low stakes might also
have activated different psychological processes than would normally be activated on a higher
stakes test, as would be found in a classroom setting. Replicating these individual difference
results on higher stakes tests and in different settings would be useful in understanding how

101

individual differences may or may not impact confidence ratings. Future work should examine
how individual differences (and other measures of those differences) impact the use of
confidence testing, with the collection of larger sample size a high priority.
Due to the distinct aspects of a number of these limitations, a number could be studied in
parallel. Some prioritization may still be helpful, however. The highest priority is likely the
concrete establishment of one or two forms of confidence testing that work well. These might be
the two from this study, or a collection of the simplified version of generalized confidence
testing. A number of small studies could prove useful before moving on to other topics. The
next highest priority is understanding just how difficulty is impacting confidence testing. This
can be accomplished through a number of the above studies – the first studies should examine
full tests of different difficulty, the experimental manipulation of distracter choices, and perhaps
even more simulation in order to see if observed behaviors can be properly modeled. After
difficulty is better understood, generalizability is a clear next priority. Showing that confidence
tests can be used in a wider range of situations (or how they might function differently in
different situations) would help to create more practical uses for these methods. Once these
topics are better understood, final priority would likely fall to administering multiple tests over
time and a larger attempt to understand how individual differences might play a roll.
It is tempting to say that this paper raises as many questions as it answers, and that may
well be the case. It is argued that these new questions have at least some direction, and provide
concrete steps that can be undertaken to understand confidence testing in new ways. Prior
studies did little to understand why results were inconsistent. This study found a number of
reasons to explain just why they might have been. Test difficulty seems to be one of the largest

102

factors, and future studies will hopefully shine light on exactly how this can be taken into
account to provide the best possible outcomes for test writers, test takers, and test users.

103

APPENDICES

104

Appendix A – Test Items Selected From Pilot #6
Difficulty and discrimination noted in parentheses (difficulty; discrimination).

1.

Bird : Nest ::

(.49; .36)

Dog : Doghouse
Squirrel : Tree
Beaver : Dam
Cat : Litter Box
Book : Library
2.

Dalmation : Dog ::

(.75; .29)

Oriole : Bird
Horse : Pony
Shark : Great White
Ant : Insect
Stock : Savings

3.

Maceration : Liquid ::

(.05; .04)

Sublimation : Gas
Evaporation : Humidity
Trail : Path
Erosion : Weather
Decision : Distraction
4.

Bellow : Fury ::

(.69; .39)

Snicker : Hatred
Hiss : Joy
Giggle : Dread
Yawn : Excitement
Gasp : Surprise
5.

Mason : Stone ::

(.77; .30)

Soldier : Weapon
Lawyer : Law
Blacksmith : Forge
Teacher : Pupil
Carpenter : Wood

105

6.

Toe : Knee :: Finger :

(.80; .41)

Arm
Wrist
Elbow
Hand
Shoulder
7.

Repel : Lure ::

(.49; .27)

Dismount : Devolve
Abrogate : Deny
Abridge : Shorten
Enervate : Weaken
Miscarry : Succeed

8.

Fierce : Timid :: Aggressive : (.85; .35)
Weird
Frigid
Assertive
Cold
Passive

9.

Coax : Blandishments ::

(.48; .16)

Amuse : Platitudes
Compel : Threats
Deter : Tidings
Batter : Insults
Exercise : Antics
10.

Quiet : Sound :: Darkness :

(.95; -.11)

Cellar
Sun
Noise
Stillness
Light
11.

Tall : Enormous :: Short :

(.95; .24)

Tiny
Medium
Gigantic

106

Ravenous
Hungry
12.

Intransigent : Flexibility ::

(.08; .37)

Transient : Mobility
Disinterested : Partisanship
Dissimilar : Variation
Progressive : Transition
Ineluctable : Modality
13.

Deference : Respect ::

(.22; .15)

Admiration : Jealousy
Condescension : Hatred
Affection : Love
Pretence : Truth
Gratitude : Charity
14.

Tragedy : Drama ::

(.37; .06)

Farce : Actor
Cartoon : Film
Prosody : Poem
Accident : Ambulance
Epigram : Anecdote

15.

Butterfly : Caterpillar :: Frog : (.91;.41)
Fish
Amphibian
Frog
Toad
Tadpole

16.

Century : 100 :: Decade :

(.97; .28)

1000
10
Millennium
Score
1/10
17.

Square : Cube :: Circle :

(.92; .08)

107

Rhombus
Corner
Sphere
Cylinder
Ellipsoid
18.

Prison : Criminals :: Orchard : (.58; .20)
Fruit
Trees
Pickers
Grass
Oranges

19.

Success : Elation :: Failure : (.85; .25)
Fear
Euphoria
Contagion
Hospitality
Depression

20.

Lion : Mammal :: Bumblebee : (.95;.22)
Amphibian
Hive
Aphid
Insect
Winged

22.

Tenet : Theologian ::

(.43; .34)

Predecessor : Heir
Hypothesis : Biologist
Recluse : Rivalry
Arrogance : Persecution
Guitarist : Rock Band
23.

Walk : Legs ::

(.66; .07)

Blink : Eyes
Chew : Mouth
Dress : Hem
Cover : Book
Grind : Nose

108

24.

Whisper : Shout :: Walk :

(.91; -.07)

Command
Fly
Gallop
Tiptoe
Run
25.

Good : Better :: Bad :

(.88; .09)

Worst
Worse
Better
Badder
Best
26.

Circle : Sphere :: Square :

(.95; .20)

Globe
Oblong
Cube
Diamond
Box
27.

Scintillating : Dullness ::

(.68; .42)

Erudite : Wisdom
Desultory : Error
Boisterous : Calm
Cautious : Restraint
Exalted : Elevation
28.

Much : More :: Little :

(.88; .10)

All
Less
A Lot
Big
Small
29.

Week : Fortnight :: Solos :

(.38; .45)

Choir
Singers
Trio
Twins

109

Duet

30.

Morbid : Unfavorable ::

(.69; .35)

Reputable : Favorable
Maternal : Unfavorable
Disputations : Favorable
Vigilant : Unfavorable
Lax : Favorable
31.

Flower : Bouquet :: Link :

(.89; .31)

Connect
Event
Chain
Joining
Braid
32.

Rite : Ceremony ::

(.37; .23)

Magnitude : Size
Affliction : Blessing
Clamor : Silence
Pall : Clarity
Agitation : Calm
33.

Hackneyed : Freshness ::

(.26; .16)

Stale : Porosity
Facile : Delicacy
Ponderous : Lightness
Central : Vitality
Relevant : Pertinence
34.

Inflate : Bigger ::

(.92; .24)

Revere : Lower
Elongate : Shorter
Fluctuate : Longer
Mediate : Higher
Diminish : Smaller
35.

Author : Literate ::

(.57; .49)

Cynic : Gullible

110

Hothead : Prudent
Saint : Notorious
Judge : Impartial
Doctor : Fallible
36.

Attenuate : Signal ::

(.11; .19)

Exacerbate : Problem
Modify : Accent
Dampen : Enthusiam
Elongate : Line
Dramatize : Play
37.

Humdrum : Bore ::

(.23; .49)

Grim : Amuse
Nutritious : Sicken
Stodgy : Excite
Heartrending : Move
Pending : Worry
38.

Reinforce : Stronger ::

(.94; .25)

Abound : Lesser
Dismantle : Longer
Wilt : Higher
Shirk : Greater
Erode : Weaker
39.

Braggart : Modesty ::

(.37; .47)

Fledgling : Experience
Embezzler : Greed
Wallflower : Timidity
Invalid : Malady
Candidate : Ambition
40.

Ski : Snow ::

(.91; .18)

Drive : Car
Gold : Putt
Dance : Step
Skate : Ice
Ride : Horse
41.

Verify : True ::

(.92; .29)

111

Signify : Cheap
Purify : Clean
Terrify : Confident
Ratify : Angry
Mortify : Relaxed
42.

Tarantula : Spider ::

(.35; .39)

Mare : Stallion
Milk : Cow
Fly : Parasite
Sheep : Grass
Drone : Bee
43.

Prosaic : Mundane ::

(.25; -.21)

Obdurate : Foolish
Ascetic : Austere
Clamorous : Captive
Loquacious : Taciturn
Peremptory : Spontaneous
44.

Doctor : Hospital ::

(.91; .20)

Sports Fan : Stadium
Cow : Farm
Professor : College
Criminal : Jail
Food : Grocery Store
45.

Cub : Bear ::

(.97; .34)

Piano : Orchestra
Puppy : Dog
Cat : Kitten
Eagle : Predator
Fork : Utensil
46.

Salacious : Wholesome ::

(.34; .48)

Religious : Private
Expensive : Profligate
Conservative : Stoic
Mendacious : Truthful
Fulsome : Generous

112

47.

Elected : Inauguration ::

(.52; .57)

Enrolled : Graduation
Condemned : Execution
Chosen : Selection
Gathered : Exhibition
Appointed : Interview
48.

Dividend : Stockholder ::

(.14; .32)

Patent : Inventor
Royalty : Author
Wage : Employer
Interest : Banker
Investment : Investor
49.

Mince : Walk ::

(.45; .17)

Bang : Sound
Wave : Gesture
Waltz : Dance
Simpler : Smile
Hike : Run
50.

Disinterested : Unbiased ::

(.18; .32)

Indulgent : Intolerant
Exhausted : Energetic
Languid : Lethargic
Unconcerned : Involved
Profilgate : Flippant
51.

Temper : Hard ::

(.08; .17)

Mitigate : Severe
Provoke : Angry
Endorse : Tough
Infer : Certain
Scrutinize : Clear
52.

Stanza : Poem ::

(.83; .32)

Chaper : Novel
Prose : Verse
Stave : Music

113

Song : Chorus
Overture : Opera
53.

Stoic : Fortitude ::

(.32; .12)

Benefactor : Generosity
Heretic : Faith
Eccentric : Ineptitude
Miser : Charity
Soldier : Bravery

54.

Ambivalent : Certain ::

(.11; -.18)

Indifferent : Biased
Furtive : Open
Impecunious : Voracious
Discreet : Careful
Munificent : Generous
55.

Allay : Suspicion ::

(.29; .21)

Tend : Plant
Impede : Anger
Calm : Fear
Fell : Tree
Exacerbate : Worry
56.

Primitive : Sophisticate ::

(.14; .09)

Employer : Superior
Socialite : Recluse
Tyro : Expert
Native : Inhabitant
Applicant : Member
57.

Authoritarian : Lenient ::

(.38; .47)

Philanthropist : Generous
Virtuoso : Glamorous
Hedonist : Indulgent
Servant : Servile
Miser : Charitable
58.

Perennial : Ephemeral ::

(.45; .23)

114

Volatile : Evanescent
Mature : Ripe
Diurnal : Annual
Permanent : Temporary
Majestic : Mean
59.

Needle : Pine ::

(.60; .40)

Pin : Cloth
Trunk : Tree
Flower : Leaf
Spine : Cactus
Stalk : Root
60.

Anecdote : Story ::

(.38; .40)

Ballad : Song
Novel : Chapter
Limerick : Poem
Prose : Poetry
Overture : Opera

115

Appendix B – Manipulation Check
1) Answer this question as if you couldn’t decide between answer choice ‘A’ and answer
choice ‘B’, but felt slightly more certain of answer choice ‘B’.
a.
b.
c.
d.
e.
2) Answer this question as if you were fairly certain that answer choice ‘C’ was correct.
a.
b.
c.
d.
e.
3) Answer this question as if you had no idea as to which answer choice is correct.
a.
b.
c.
d.
e.
4) Answer this question as if you were certain that answer choice ‘D’ and answer choice ‘E’
were incorrect, but had no strong idea of which of the remaining was correct.
a.
b.
c.
d.
e.
5) Answer this question as if you were fairly certain answer choice ‘A’ was correct, but had
some thoughts that there was a chance that answer choice ‘B’ was correct.
a.
b.
c.
d.
e.

116

Appendix C – Generalized Anxiety Items
From “The Items in Each of the Preliminary IPIP Scales
Measuring Constructs Similar to Those in the NEO-PI-R”
http://ipip.ori.org/newNEOKey.htm#Anxiety
In general, I…

N1: ANXIETY (Alpha = .83)
+ keyed Worry about things.
Fear for the worst.
Am afraid of many things.
Get stressed out easily.
Get caught up in my problems.
– keyed Am not easily bothered by things.
Am relaxed most of the time.
Am not easily disturbed by events.
Don't worry about things that have already happened.
Adapt easily to new situations.
12345-

Strongly Disagree
Disagree
Neither Agree or Disagree
Agree
Strongly Agree

117

Appendix D – Test Anxiety Items from Taylor and Deane (2002)

1)
2)
3)
4)
5)

12345-

During tests I feel very tense.
I wish examinations did not bother me so much.
I seem to defeat myself while working on important tests.
I feel very panicky when I take an important test.
During examinations I get so nervous I forget facts I really know.

Strongly Disagree
Disagree
Neither Agree or Disagree
Agree
Strongly Agree

118

Appendix E – Risk Taking Items
From “The Items in the 15 Preliminary IPIP Scales Measuring Constructs Similar to Those in the
Jackson Personality Inventory (JPI-R)” http://ipip.ori.org/newJPI-RKey.htm#Risk-Taking
Generally, I…

RISK-TAKING (JPI: Risk Taking [Rkt]) [.78]
+ keyed

Enjoy being reckless.
Take risks.
Seek danger.
Know how to get around the rules.
Am willing to try anything once.
Seek adventure.

– keyed

Would never go hang-gliding or bungee-jumping.
Would never make a high risk investment.
Stick to the rules.
Avoid dangerous situations.

12345-

Strongly Disagree
Disagree
Neither Agree or Disagree
Agree
Strongly Agree

119

Appendix F – Cautiousness Items
From “The Items in Each of the Preliminary IPIP Scales Measuring Constructs Similar to Those
in the 30 NEO-PI-R Facet Scales” http://ipip.ori.org/newNEOFacetsKey.htm
Generally, I…

C6: CAUTIOUSNESS (.76)
+ keyed

Avoid mistakes.
Choose my words with care.
Stick to my chosen path.

– keyed

Jump into things without thinking.
Make rash decisions.
Like to act on a whim.
Rush into things.
Do crazy things.
Act without thinking.
Often make last-minute plans.

12345-

Strongly Disagree
Disagree
Neither Agree or Disagree
Agree
Strongly Agree

120

Appendix G – Self-Efficacy Items
From “The Items in Each of the Preliminary IPIP Scales Measuring Constructs Similar to
Those in the 30 NEO-PI-R Facet Scales” http://ipip.ori.org/newNEOFacetsKey.htm
When taking standardized tests, I…

C1: SELF-EFFICACY (.78)
+ keyed

Complete tasks successfully.
Excel in what I do.
Handle tasks smoothly.
Am sure of my ground.
Come up with good solutions.
Know how to get things done.

– keyed

Misjudge situations.
Don't understand things.
Have little to contribute.
Don't see the consequences of things.

12345-

Strongly Disagree
Disagree
Neither Agree or Disagree
Agree
Strongly Agree

121

Figure 1 – The Item Characteristic Curve

NOTE: For interpretation of the references to color in this and all other figures, the reader is referred to the electronic version of this
dissertation.

122

Figure 2 – Item Characteristic Curve for the Nominal Response Model

123

Figure 3 – Hierarchical Framework for Modeling Response Time and Response
(van der Linden, 2007, p295)

124

Figure 4: Test reliability as a function of test length.

1.000

Cronbach's Alpha

.900
.800
.700

5-point Allocation
2-point Allocation

.600

Traditional Test

.500
.400
.300
5

10

15

20

25

30

35

40

Number of Items

125

45

50

55

60

Figure 5: Benefit of confidence testing as a function of test length.

Increase in Cronbach's Alpha vs Traditional
Method

.500
.450
.400
.350
.300
5-point Allocation

.250

2-point Allocation

.200
.150
.100
.050
.000
5

10

15

20

25

30

35

40

Number of Items

126

45

50

55

60

Figure 6: Test reliability as a function of test length and difficulty.

1.000

Cronbach's Alpha

.900
.800

Easy test, 5-point
Easy test, 2-point

.700

Easy test, traditional
Difficult test, 5-point

.600

Difficult test, 2-point
Difficult test, traditional

.500
.400
.300
5

10

15

20 25

30

35

40 45

50

Number of Items

127

55

60

Figure 7: Benefits of confidence testing as a product of test length and difficulty.

Increase in Cronbach's Alpha

.600
.500
.400
Easy test, 5-point
.300

Normal test, 5-point
Difficult test, 5-point

.200
.100
.000
5

10

15

20

25

30

35

40

45

Number of Items

128

50

55

60

Figure 8: Test reliability as a function of test length and discrimination.

1.000

Cronbach's Alpha

.900
.800

Low Discrim, 5-point
Low Discrim, 2-point

.700

Low Discrim, traditional
High Discrim, 5-point

.600

High Discrim, 2-point
High Discrim, traditional

.500
.400
.300
5

10

15 20 25 30

35 40 45 50

Number of Items

129

55 60

Figure 9: Benefits of confidence testing as a product of test length and discrimination.

.600

Increase in Cronbach's Alpha

.500

.400
Low Discrim, 5-point
Normal test, 5-point

.300

High Discrim, 5-point
.200

.100

.000
5

10

15

20

25

30

35

40

45

50

Number of Items

130

55

60

Figure 10 – Distribution of confidence scores in 5-point confidence condition.

60

50

Percent

40

30
20

10

0
Zero

One

Two

Three
Response

131

Four

Five

Figure 11 – Distribution of confidence scores in generalized confidence condition.

25

Percent

20

15

10

5

0
One

Two

Three

Four

Five

Six

Response

132

Seven

Eight

Nine

Ten

Cronbach's Alpha

Figure 12 – Reliability by Test Length and Condition.

1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0

Control
5-Point
Generalized
Generalized
(Simplified)

5

10

15

20

25

30

35

40

45

Number of Items

133

50

55

60

Figure 13 – Reliability Benefits of Confidence Testing Relative to Control.

Increase Relative to Control
(Cronbach's Alpha)

0.35
0.3
0.25

5-Point Confidence

0.2

Generalized Con.

0.15
Generalized Con.
(Simplified)

0.1
0.05
0
5 10 15 20 25 30 35 40 45 50 55 60
Number of Items

134

Figure 14 – Validity (College GPA) by Length.

Validity Coefficient

Validity (College GPA) by Length
.450
.400
.350
.300
.250
.200
.150
.100
.050
.000

Control
5-Point
Generalized
Generalized (Simplified)

5

10 15 20 25 30 35 40 45 50 55 60
Number of Items

135

Figure 15 – Validity (ACT score) by Length.

Validity Coefficient

Validity (ACT) by Length
.800
.700
.600
.500
.400
.300
.200
.100
.000

Control
5-Point
Generalized
Generalized (Simplified)

5

10 15 20 25 30 35 40 45 50 55 60
Number of Items

136

Figure 16 – Average Time of Test by Length of Test, by Condition.

Time of Test (seconds)

1400.0000
1200.0000
1000.0000
Control
5-Point
Generalized

800.0000
600.0000
400.0000
200.0000
.0000
5

10

15

20

25

30

35

40

Number of Items

137

45

50

55

60

Figure 17 – Comparison of Item Discrimination by Test and Item Difficulty

.800

Item Discrimination_

.600
.400
Control

.200

Generalized
Confidence

.000
-.200
-.400
0

10

20

30

40

Item Difficulty (Rank)

138

50

60

Figure 18 – Reliability by Test Difficulty.

1
0.9

Cronbach's Alpha

0.8
0.7
Control
0.6
5-Point
0.5
Generalized
0.4
Generalized
(Simplified)

0.3
0.2
0.1
0
Difficult Items

Easy Items

139

Figure 19 – Reliability by Test Discrimination.

0.9
0.8

Cronbach's Alpha

0.7
0.6
Control

0.5

5-Point

0.4

Generalized

0.3

Generalized
(Simplified)

0.2
0.1
0
-0.1
High Discrimination

Low Discrimination

140

Figure 20 – Reliability by Discrimination (Easier Items).

0.9
0.8
0.7
Cronbach's Alpha

0.6
0.5

Control

0.4

5-Point

0.3

Generalized

0.2

Generalized (Simplified)

0.1
0
-0.1
-0.2
Low Discrimination

High Discrimination

141

Figure 21 – Reliability by Discrimination (More Difficult Items).

0.8
0.7

Cronbach's Alpha

0.6
0.5
Control

0.4

5-Point

0.3

Generalized

0.2

Generalized (Simplified)

0.1
0
-0.1
-0.2
Low Discrimination

High Discrimination

142

Figure 22 – Reliability by difficulty on only the 20 easiest items.

0.9
0.8

Cronbach's Alpha

0.7
0.6

Control

0.5

5-Point

0.4

Generalized

0.3

Generalized
(Simplified)

0.2
0.1
0
Difficult Items

Easy Items

143

Table 1: A Table of Responses and Weights.
Response in Response
A
for B
Very sure
Fairly sure
Very sure
Fairly sure

Pure guess

Pure guess

Response
for C

Very sure
Fairly sure
Pure guess

Weight
for A
1
0.66
0
0.16
0
0.16
0.33

Weight
for B
0
0.16
1
0.66
0
0.16
0.33

Weight
for C
0
0.16
0
0.16
1
0.66
0.33

144

Table 2: A Table of Responses and Integers.
Response in Response
A
for B
Very sure
Fairly sure
Very sure
Fairly sure

Pure guess
Pure guess
* - PA = points allocated

Response
for C

Very sure
Fairly sure
Pure guess

PA* to
A
6
4
0
1
0
1
2

PA to B

PA to C

0
1
6
4
0
1
2

0
1
0
1
6
4
2

145

Table 3: Results of Pilot Data Simulation #1
Correlation
R-Square
with True
Value
Score
Standard Scoring
.9453
.8937
Raw probability
.9857
.9717

R-Square
Change (from
standard)
n/a
.0780

Test
Reliability

Ten-point allocation
Five-point allocation
Three-point allocation
Two-point allocation

.0768
.0807
.0715
.0692

.997
.996
.993
.988

.9851
.9871
.9824
.9812

.9705
.9744
.9652
.9629

146

.907
.998

Table 4: Results of Pilot Data Simulation #2
Correlation
R-Square
with True
Value
Score
Standard Scoring
.9453
.8937

R-Square
Change (from
standard)
n/a

Test
Reliability

Five-point allocation
Two-point allocation

.0755
.0608

.994
.976

.9845
.9770

.9693
.9545

147

.907

Table 5: Test reliability as a function of test length.
Number
5
10
15
20
25
30
35
40
45
50
55
60
of items
5-point
.944 .967 .978 .984 .987 .988 .990 .991 .992 .993 .994 .994
allocation
2-point
allocation

.802

.891 .917 .939 .948 .953

.960 .964 .967 .971 .974 .976

Tradition
al test

.510

.696 .754 .797 .823 .838

.856 .868 .877 .888 .900 .907

Green shows reliability above .90, yellow above .80, blue above .70, and red below .70.

NOTE: For interpretation of the references to color in this and all other figures, the reader is referred to the electronic version of this
dissertation.

148

Table 6: Effect of test difficulty on correlation with ‘True Score’ for 60-item test.
Correlation with R-Square Value
R-Square
True Score
Change (from
traditional)
Normal Test
Five-point allocation
.9845
.9693
.0755
Two-point allocation
.9770
.9545
.0608
Easy Test
Five-point allocation
.9657
.9326
.0793
Two-point allocation
.9577
.9172
.0639
Difficult Test
Five-point allocation
.9104
.8289
.1759
Two-point allocation
.8931
.7977
.1447

149

Table 7: Effect of test discrimination on correlation with ‘True Score’ for 60-item test.
Correlation with R-Square Value
R-Square
True Score
Change (from
traditional)
Normal Test
Five-point allocation
.9845
.9693
.0755
Two-point allocation
.9770
.9545
.0608
Low Discrim. Test
Five-point allocation
.9895
.9792
.1125
Two-point allocation
.9759
.9524
.0857
High Discrim. Test
Five-point allocation
.9803
.9610
.0562
Two-point allocation
.9751
.9509
.0462

150

Table 8 – Reliability and Validity Across Conditions for 60 Item Test.
Reliability
Validity
Test Form
Cronbach's Alpha
College GPA ACT
Control

0.761

5-Point Confidence
0.803
Generalized Con.
0.862
Generalized Con.
0.862
(Simplified)
Correlations significant at the p < .05 level starred.

0.237

0.710*

0.385*
0.295*
0.299*

0.571*
0.631*
0.628*

151

Table 9 – Reliability by Test Length and Condition.
# of Items/
Condition
Control

5

10

15

20

25

30

35

40

45

50

55

60

5-Point

0.405 0.565 0.503 0.614 0.632 0.700 0.704 0.747 0.767 0.790 0.789 0.803

0.211 0.371 0.387 0.462 0.445 0.575 0.616 0.672 0.681 0.732 0.729 0.761

Generalized 0.360 0.645 0.614 0.730 0.759 0.799 0.833 0.848 0.870 0.866 0.862 0.862
Generalized 0.308 0.624 0.588 0.713 0.746 0.792 0.828 0.843 0.863 0.866 0.863 0.862
(Simplified)
Green shows reliability above .80, yellow above .70, blue above .60, and red below .60.

152

Table 10 – Correlations of Individual Differences with CT Use (5-point
condition).

Test-Specific Self Efficacy Pearson Correlation
Sig. (2-tailed)
N
General Anxiety
Pearson Correlation
Sig. (2-tailed)
N
Test Anxiety
Pearson Correlation
Sig. (2-tailed)
N
Risk Taking
Pearson Correlation
Sig. (2-tailed)
N
Cautiousness
Pearson Correlation
Sig. (2-tailed)
N

Use of Confidence Testing
-.126
.335
61
-.203
.117
61
.149
.250
61
-.028
.828
61
-.098
.454
61

153

Table 11 - Correlations of Individual Differences with CT Use (Generalized
condition).

Test Specific Self Efficacy Pearson Correlation
Sig. (2-tailed)
N
General Anxiety
Pearson Correlation
Sig. (2-tailed)
N
Test Anxiety
Pearson Correlation
Sig. (2-tailed)
N
Risk Taking
Pearson Correlation
Sig. (2-tailed)
N
Cautiousness
Pearson Correlation
Sig. (2-tailed)
N

Use of Confidence Testing
-.167
.211
58
-.039
.772
58
.248
.061
58
.170
.202
58
-.132
.324
58

154

Table 12 - Correlations of Individual Differences with CT Use, Controlling for
Overall Score (5-Point Condition).

Test Specific Self Efficacy Pearson Correlation
Sig. (2-tailed)
df
General Anxiety
Pearson Correlation
Sig. (2-tailed)
df
Test Anxiety
Pearson Correlation
Sig. (2-tailed)
df
Risk Taking
Pearson Correlation
Sig. (2-tailed)
df
Cautiousness
Pearson Correlation
Sig. (2-tailed)
df

Use of Confidence Testing
.024
.854
58
-.214
.101
58
.136
.301
58
-.112
.396
58
.064
.626
58

155

Table 13 - Correlations of Individual Differences with CT Use, Controlling for Overall Score (Generalized Condition).

Test Specific Self Efficacy Pearson Correlation
Sig. (2-tailed)
df
General Anxiety
Pearson Correlation
Sig. (2-tailed)
df
Test Anxiety
Pearson Correlation
Sig. (2-tailed)
df
Risk Taking
Pearson Correlation
Sig. (2-tailed)
df
Cautiousness
Pearson Correlation
Sig. (2-tailed)
df

Use of Confidence Testing
-.138
.305
55
-.037
.784
55
.100
.459
55
.026
.847
55
-.056
.681
55

156

Table 14 - Correlations of Individual Differences with CT Use, Controlling for
Overall Score (Both Confidence Conditions).

Test Specific Self Efficacy Pearson Correlation
Sig. (2-tailed)
df
General Anxiety
Pearson Correlation
Sig. (2-tailed)
df
Test Anxiety
Pearson Correlation
Sig. (2-tailed)
df
Risk Taking
Pearson Correlation
Sig. (2-tailed)
df
Cautiousness
Pearson Correlation
Sig. (2-tailed)
df

Use of Confidence Testing
-.054
.564
116
-.076
.411
116
.008
.928
116
-.023
.804
116
.028
.765
116

157

REFERENCES

158

REFERENCES
Adams, J.K. (1961). Realism of Confidence Judgments. Psychological Review, 68, 3345.
Baranski, J.V. & Petrusic, W.M. (1998). Probing the Locus of Confidence Judgments:
Experiements on the Time to Determine Confidence. Journal of Experimental Psychology:
Human Perception and Performance, 3, 929-945.
Bock, R. D. (1972). Estimating item parameters and latent ability when the responses are
scored in two or more nominal categories. Psychometrika, 37, 29-51.
Coombs, C.H. (1953). On the use of objective examinations. Educational and Psychological
Measurement, 13, 308-310.
de Finetti, B. (1965). Methods for discriminating levels of partial knowledge concerning a test
item. British Journal of Mathematical and Statistical Psychology, 13, 87-123.
Dressel, P.L & Schmid, J. (1953). Some modifications of the multiple-choice item. Educational
and Psychological Measurement, 13, 574-595.
Ebel, R.L. (1965). Confidence Weighting and Test Reliability. Journal of Educational
Measurement, 2, 49-57.
Ebel, R.L. (1968). Review of ‘Valid Confidence Testing – Demonstration Kit’. Journal of
Educational Measurement, 5, 353-354.
Echternacht, G.J. (1972). The use of confidence testing in objective tests. Review of
Educational Research, 42, 217-236.
Erev, I., Wallsten, T.S., & Budescu, D.V. (1994). Simultaneous Over- and
Underconfidence: The Role of Error in Judgment Processes. Psychological Review, 101, 519527.
Goldberg, L. R., Johnson, J. A., Eber, H. W., Hogan, R., Ashton, M. C., Cloninger, C. R., &
Gough, H. C. (2006). The International Personality Item Pool and the future of public-domain
personality measures. Journal of Research in Personality, 40, 84-96.
Hambleton, R.K., Roberts, D.M., & Traub, R.E. (1970). A comparison of the reliability and
validity of two methods for assessing partial knowledge on a multiple-choice test. Journal of
Educational Measurement, 7, 75-82.
Hevner, K. (1932). A method of correcting for guessing in true-false tests and empirical
evidence in support of it. The Journal of Social Psychology, 3, 359-359-362.

159

Huang, J.L., Curran, P.G., Keeney, J., Poposki, E.M., & DeShon, R.P. (In Press). Detecting and
Deterring Insufficient Effort Responding to Surveys. Journal of Business and Psychology.
Jacobs, S.S. (1971). Correlates of unwarranted confidence in responses to objective test items.
Journal of Educational Measurement, 8, 15-20.
Johnson, J. A. (2005). Ascertaining the validity of individual protocols from web-based
personality inventories. Journal of Research in Personality, 39, 103-129.
Koehler, R.A. (1971). A comparison of the validities of conventional choice testing and various
confidence marking procedures. Journal of Educational Measurement, 8, 297-303.
Koriat, A., Lichtenstein, S., & Fischhoff, B. (1980). Reasons for Confidence. Journal of
Experimental Psychology: Human Learning and Memory, 6, 107-118.
Krueger, N. & Dickson, P.R. (1994). How believing in ourselves increases risk taking: Perceived
Self-Efficacy and Opportunity Recognition. Decision Sciences, 25,
385-400.
Lord, F.M. (1980). Applications of Item Response Theory to Practical Testing Problems.
New Jersey, Lawrence Erlbaum Associates.
McKenzie, C.R.M., Wixted, J.T., Noelle, D.C., & Gyurjyan, G. (2001). Relation
Between Confidence in Yes-No and Forced-Choice Tasks. Journal of Experimental Psychology:
General, 130, 140-155.
Michael, J.J. (1968). The reliability of a multiple-choice examination under various test-taking
instructions. Journal of Educational Measurement, 5, 307-314.
Pleskac, T.J. & Busemeyer, J.R. (2010). Two-Stage Dynamic Signal Detection: A
Theory of Choice, Decision Time, and Confidence. Psychological Review, 117, 864-901.
Rasch, G. (1960). Probabilistic Models for Some Intelligence and Attainment Tests.
Danmarks Paedagogiske Institut, Copenhagen.
Reckase, M.D. (2009). Multidimensional Item Response Theory. New York, Springer.
Wainer, H. (2000). Computerized Adaptive Testing: A Primer. New Jersey, Lawrence
Erlbaum Associates.
Roskam, E.E. (1987). Toward a psychometric theory of intelligence. In E.E. Roskam & R. Suck
(Eds.), Progress in mathematical psychology. Amsterdam: North-Holland.
Samejima, F. (1979). A new family of models for the multiple choice item (Research Report 794). Knoxville: University of Tennessee (Department of Psychology).

160

Sarason, I.G. & Stoops, R. (1978). Test anxiety and the passage of time. Journal of Consulting
and Clinical Psychology, 46, 102-109.
Scheiblechner, H. (1985) Psychometric models for speed-test construction: The linear
exponential model. In S.E. Embretson (Ed.), Test design: Developments in psychology and
education. New York, Academic Press.
Selig, E.R. (1972). Confidence testing comes of age. Training and Development Journal, 26,
18-22.
Shuford, E.H. & Massengill, H.E. (1967). Valid Confidence Testing – Demonstration Kit.
Boston, The Shuford-Massengill Corporation.
Soderquist, H.O. (1936). A new method of weighting scores in a true-false test. The Journal of
Educational Research, 30, 290-292.
Thissen, D. (1983). Timed testing: An approach using item response theory. In D.J. Weiss (Ed.),
New horizons in testing: Latent trait test theory and computerized adaptive testing. New York:
Academic Press.
Thissen, D., & Steinberg, L. (1984). A response model for multiple choice items. Psychometrika,
49, 501-519
van der Linden, W.J. (2006). A Lognormal Model for Response Times on Test Items. Journal
of Educational and Behavioral Statistics, 31, 181-204.
van der Linden, W.J. (2007). A hierarchical framework for modeling speed and accuracy on test
items. Psychometrika, 72, 287-308.
van der Linden, W.J. (2008). Using response times for item selection in adaptive testing.
Journal of Educational and Behavioral Statistics, 33, 5-20.
van der Linden, W.J. (2010). Linking response-time parameters onto a common scale. Journal
of Educational Measurement, 47, 92-114.
van der Linden, W.J., Entink, R.H.K., & Fox, J.P. (2010). IRT parameter estimation with
response times as collateral information. Applied Psychological Measurement, 34, 327-347.

161