DIGIT SPAN AS A VALIDITY MEASURE IN PEDIATRIC ASSESSMENT 

By 

Tyler Micheal Ryan 

A DISSERTATION 

Submitted to 
Michigan State University 
in partial fulfillment of the requirements   
for the degree of 

School Psychology – Doctor of Philosophy 

2024 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ABSTRACT 

Performance validity testing during the neuropsychological assessment of pediatric populations 

is a growing area of clinical and research interest. Although the literature on validity testing with 

adult populations is extensive, the development and study of validity measures in pediatric 

assessment is in its infancy. The current study assessed the clinical utility of three indices derived 

from the Wechsler Intelligence Scale for Children – Fifth Edition’s (WISC-V) Digit Span subtest 

as embedded performance validity measures. Reliable Digit Span, Reliable Digit Span-Revised, 

and the Digit Span age corrected scaled scores were assessed in a known-groups design using the 

Memory Validity Profile (MVP) as a criterion measure. Cutoff scores with adequate sensitivity 

and specificity were obtained for each of the three metrics indicating that the WISC-V Digit 

Span subtest provides clinically relevant data regarding performance validity. 

 
 
 
 
ACKNOWLEDGEMENTS 

I would like to express my deepest appreciation to my advisor, mentor, and chair of my 

committee: Dr. Jodene Fine. I could not have completed this project without her continued 

support and guidance. Her passion for the field of neuropsychology inspired both my research 

and career goals, and I cannot thank her enough for providing me with a path of entry to the 

field. Additionally, words cannot express my gratitude to my committee members: Dr. Ryan 

Bowles, Dr. Kristin Rispoli, and Dr. Adrea Truckenmiller. Each member of this committee has 

demonstrated continued flexibility and guidance throughout this journey. Also, I could not have 

undertaken this project without Dr. Jacobus Donders, who provided access to his clinical 

research data and served as a guiding hand through study design and analysis.  

I would like to extend my sincere thanks to Dr. Roger Lauer and Dr. Renee Lajiness-

O’Neill who served as clinical mentors and provided valuable perspective derived from their 

combined clinical and research experience. I am also deeply grateful to Dr. Kate Wilson whose 

research served as a foundation for this project. Additionally, her compassion and dedication to 

the care of underserved communities in the field of neuropsychology continually inspires my 

vision of practice.  

I would be remiss in not mentioning my family for their continued support throughout 

this journey. Specifically, I would like to recognize my lovely wife Sara Ryan for her patience, 

tolerance, and professional support throughout this journey. This academic endeavor has taken 

our lives in various unforeseen directions, and her unwavering support and love has carried me 

through this lengthy trek. 

iii 

 
 
 
 
TABLE OF CONTENTS 

INTRODUCTION .......................................................................................................................... 1 

LITERATURE REVIEW ............................................................................................................... 4 

METHOD ..................................................................................................................................... 40 

RESULTS ..................................................................................................................................... 55 

DISCUSSION ............................................................................................................................... 78 

REFERENCES ............................................................................................................................. 88 

iv 

 
 
 
INTRODUCTION 

Neuropsychological assessments are conducted to answer questions related to a client’s 

intellectual functioning and behavior (Baron, 2018). These assessments rely on the use of 

measures that provide information across various domains of functioning. However, the 

information provided by these tests lack clinical utility if the patient provides suboptimal effort. 

Thus, it is essential that neuropsychologists can identify clients who are not working to their full 

capacity during an assessment.  

Studies have indicated that clinicians have difficulty accurately identifying which 

individuals presenting for evaluation are providing adequate effort. As such, interest in the field 

of validity testing has blossomed. Validity testing is the process of identifying invalid test 

performance using specially designed instruments or indices derived from readily administered 

assessment measures. The earliest examples of these tests date back to the 1940s (Rey, 1941). 

However, interest in the development, validation, and use of these measures gained popularity in 

the 1990s (Larrabee & Kirkwood, 2020).  

Most of the research and clinical interest in validity testing has historically pertained to 

neuropsychological evaluations of adults. High stakes in remuneration associated with lawsuits 

over personal injuries fueled the field. Thus, the emphasis on adult populations resulted from the 

incentive for secondary gain when malingering, or “faking” impairment, during evaluations. 

Cases involving the evaluation of mild traumatic brain injuries (mTBIs) gained particular 

interest. (Slick et al., 1999). Those attempting to recover money due to accidents resulting in 

brain injury could yield substantial sums due to impairment observed in neuropsychological 

evaluation. Incentive to malinger is, and has been, high.  

1 

 
 
 
 
 
 
Another reason why validity testing has been largely relegated to adult 

neuropsychological evaluation is a widely held belief that children lacked the ability for 

deception resulting in little interest in pediatric validity assessment. However, research into child 

neuropsychological test performance indicates that children may provide suboptimal effort for 

many reasons (Kirkwood, 2015) Notably, many neuropsychological instruments for child use 

were translated from adult measures, and this has also occurred with validity measures.  Thus, 

use of adult validity instruments increased dramatically, and measures like the Test of Memory 

Malingering (TOMM; Tombaugh, 1996) were studied in pediatric populations (Donders, 2005).  

A recent survey found extremely high rates of validity testing endorsement from pediatric 

neuropsychologists (Brooks et al., 2016). Of the reported measures, one of the most popular 

measures utilized was the reliable Digit Span (RDS) metric provided by the Digit Span subtest of 

Wechsler intelligence measures like the Wechsler Adult Intelligence Scale – Fourth Edition 

(WAIS-IV; Wechsler, 2008). The RDS index is a simple cutoff calculation based on the longest 

number of digits held reliably (twice) in the Digits Forward and Digits Backward conditions as a 

raw sum. Over 65% of clinicians endorsed using the RDS metric at least occasionally when 

testing children (Brooks et al., 2016). Despite this high level of endorsement, study of the RDS 

metric in pediatric populations is sparse.  

To date, only two studies have examined the utility of the RDS metric in pediatric 

populations. Kirkwood et al. (2011) developed RDS cutoff scores for a sample of children 

presenting with traumatic brain injuries, and Welsh et al. (2012) examined the utility of RDS in a 

sample of children with epilepsy. Both of these studies utilized the Wechsler Intelligence Scale 

for Children – Fourth Edition (WISC-IV; Wechsler, 2003). Since these studies have been 

conducted, a new version of the measure, the Wechsler Intelligence Scale for Children – Fifth 

2 

 
 
 
 
 
Edition (WISC-V; Wechsler, 2014) has been released. This edition fundamentally changed the 

Digit Span task in ways that potentially change RDS scores and allow for the calculation of a 

new revised RDS score: reliable Digit Span revised (RDS-R; Schroeder et al., 2012). 

As validity testing in pediatric populations gains popularity, the need for empirically 

validated validity measures increases. The current study assesses the clinical utility of three 

indices from the Digit Span subtest of the WISC-V in a mixed clinical sample. The results of this 

study serve to inform pediatric neuropsychologists who endorse use of these instruments in their 

clinical practice and guide future research into these instruments. 

3 

 
 
 
 
 
 
 
 
Validity Testing Overview 

LITERATURE REVIEW   

Neuropsychological evaluations are conducted to better understand the relation between 

an individual’s brain functioning and their observable performance (Baron, 2018). To better 

understand this relation, clinicians evaluate clients in several domains including intellectual 

development, memory, motor functioning, visual functioning, executive skills, academics, and 

psychosocial development. These evaluations typically include the completion of standardized 

tests across domains as well as the completion of questionnaires. The yielded scores are essential 

to clinical decision making, but they are not useful if clients provide suboptimal effort while 

performing the tasks. Suboptimal effort during an evaluation can result in underestimations of a 

client’s abilities and may significantly distort the clinical picture, and thus diagnosis, prognosis, 

and treatment plan.   

The clinical picture derived from neuropsychological assessment guides treatment and, in 

addition, controls downstream resources. Thus, in some cases a person may find it advantageous 

to perform below their ability level or to exaggerate symptoms of impairment. In such cases, a 

person is engaging in malingering, meaning intentionally underperforming or exaggerating 

maladies. Although some individuals intentionally malinger, individuals exaggerate symptoms 

and underperform during psychological testing for numerous reasons. (Sherman et al., 2020; 

Slick et al., 1999). In cases where money is involved, including cases with litigatory elements, 

test failure or poor performance may influence the legal process. In some cases, examinees may 

be discouraged by the process, afraid of failure, or simply stop trying due to fatigue (DeRight & 

Carone, 2015). Others underperform or exaggerate in fear that the examiner would not otherwise 

capture the full breadth of their impairment. Thus, accurately capturing effort throughout 

4 

 
 
 
neuropsychological assessments can have serious financial, social, and treatment consequences 

for all parties involved. It is important for neuropsychologists to be able to recognize when 

performance is genuinely low, and when an examinee is performing below their ability level or 

embellishing symptoms. To capture instances of both symptom exaggeration and 

underperformance, separate categories of validity testing have been established (Larrabee, 2012). 

Symptom validity refers to the self-reporting of symptoms, while performance validity refers to 

effort during task performance.  

Symptom Validity Tests (SVTs). Symptom validity refers to the likelihood that 

symptoms are being reported truthfully by the client. Symptom validity tests (SVTs) are 

generally embedded in symptom self-report scales, such as the Minnesota Multiphasic 

Personality Inventory-3 (MMPI-3; Ben-Porath & Tellegen, 2020) and the Behavior Assessment 

System for Children – Third Edition (BASC-3; Reynolds & Kamphaus, 2015). Subscales within 

these instruments assess whether the response patterns have a reasonable probability of being 

reported by most people. They detect over-reporting and under-reporting of symptoms, random 

responding, and very unusual responses. Validity subscales are meant to indicate when people 

are ‘faking good,’ feigning symptoms, or either do not comprehend the questions or are willfully 

refusing them. They provide an indication of how much confidence can be had when interpreting 

other scales. 

Performance Validity Tests (PVTs). Performance Validity Tests (PVTs) are measures 

designed to detect suboptimal effort through direct testing. PVTs can be standalone measures, 

meaning that they are designed specifically for the assessment of performance validity, or 

embedded measures that are derived from commonly administered tests. Standalone PVTs are 

commonly disguised as tests of memory as is the case with the Test of Memory Malingering 

5 

 
 
 
(TOMM; Tombaugh, 1996) and the Memory Validity Profile (MVP; Sherman & Brooks, 

2015b). These measures have strong face validity as measures of memory, meaning that the 

examinee feels as though they are being given a test of memory. However, the psychometric 

qualities of the instruments and the specific type of memory being tested are tasks that even 

those with severe brain disorders and injuries have a high likelihood of passing (McWhirter et 

al., 2020). Thus, there is a high likelihood that an individual who fails a PVT is doing so 

intentionally or for reasons other than genuine impairment.  

In addition to standalone measures, researchers have developed several metrics from 

readily administered tests to serve as embedded validity indicators. These are generally subtests 

within a suite of tasks most often meant, again, to assess memory functioning. Forced choice 

memory test conditions are a ruse in which multiple choice options are limited to two, thus by 

chance any examinee would be expected to be correct at least 50% of the time. For example, the 

forced choice condition of the California Verbal Learning Test – Third Edition (CVLT-3; Delis 

et al., 2017) comprises conditions that yield good scores even when other subtasks in the suite do 

not, and which can be achieved by those with a high level of neurological disturbance. For 

example, individuals with severe brain injury can score well on this task because recognition of 

previously experienced sensory exposure is highly preserved in the brain (Erdodi et al., 2018). 

Thus, embedded validity indicators can be direct scores of tests, or scores derived from measures 

designed to evaluate other cognitive domains. Both SVTs and PVTs play a vital role in helping 

clinicians identify those who may be feigning symptoms or failing to engage during assessment, 

yet they have been historically underutilized. 

6 

 
 
 
 
 
Why PVTs and SVTs are Necessary 

Despite clinical training, neuropsychologists have low rates of success when determining 

which clients are performing below their ability level. In a landmark study, Heaton et al. (1978) 

examined the success rate of neuropsychologists in identifying probable malingerers from 

blinded clinical case reports. Ten neuropsychologists were sent 36 blinded evaluation reports. 

Half of the participants were patients with a genuine history of head trauma while the other half 

were paid malingerers asked to perform as though they had sustained a head injury. The success 

rate of clinicians correctly classifying the 32 cases ranged from 50.0% to 68.8%. Of the genuine 

presentations, clinicians correctly classified between 43.8% and 81.3% of participants. Of the 

malingering group, clinicians correctly classified between 25.0% and 81.3% of cases. Following 

this procedure, the researchers conducted two discriminant functions based on 

neuropsychological test performance on the Halstead-Reitan Battery (HRB; Reitan, 1993) and 

MMPI F scale, which detects highly unusual responses linked to over-reporting of symptoms 

(Hathaway & McKinley, 1951). The neuropsychological discriminant function correctly 

categorized each of the 32 cases, and the MMPI discriminant function correctly categorized 15 

of 16 participants in each group. This study was the first to show that indicators from testing 

were better predictors of suboptimal performance than clinical judgement alone. Later, the HRB 

discriminant function was further validated when a group of 40 malingerers and 40 individuals 

presenting with genuine head trauma were classified with 83.8% true positive cases and 93.8% 

true negative cases (Mittenberg, 1996).  

Recently, Dandachi-Fitzgerald et al. (2017) examined 31 clinicians’ ability to identify 

suboptimal performance across 203 clinical case reports. The sample was divided into two 

groups with one group passing both a test of symptom validity and performance validity (n = 

7 

 
 
 
 
 
173) and the second group failing both validity tests (n = 30). The SVT used in this study was 

the adapted Dutch version of the Structured Inventory of Malingered Symptomology (SIMS; 

Merckelbach & Smith, 2003). The performance validity test (PVT) used was the Amsterdam 

Short-Term Memory Test (ASTM; Schmand & Lindeboom, 2005). Results from the study 

indicated that the sample of neuropsychologists predicted suboptimal effort for 51 cases (25.1%) 

and optimal effort for 152 cases (74.9%). Contrary to the researchers’ hypothesis that 

neuropsychologists would underpredict the number of cases with problematic effort, the 

clinicians significantly overpredicted it. The group of cases identified as non-malingering 

contained 14 participants (9.2%) who belonged to the noncredible group. Conversely, 35 of the 

51 (68.9%) participants identified as showing suboptimal effort passed both validity measures. 

These findings further substantiate the findings of Heaton et al. (1978) and indicate that a 

clinician’s judgement alone is insufficient when making validity determinations. Furthermore, 

these studies indicate that clinicians may both underestimate and overestimate poor effort. Thus, 

clients also stand to benefit from validity testing, especially when clinical judgement alone 

would misattribute their difficulties as malingering. As a result, a need for specific diagnostic 

criteria for invalid clinical presentations emerged. This need was particularly pertinent in the fair 

adjudication of cases involving litigation.  

Need for Malingering Criteria 

Individuals are commonly referred for neuropsychological evaluation as a part of 

litigation or other scenarios that present with high potential of secondary gain if impairment is 

observed. Therefore, neuropsychologists are tasked with providing expert opinion related to the 

validity of an individual’s performance. In 1999, Slick et al. (1999) developed diagnostic criteria 

to aid neuropsychologists in clinical and research settings. Slick et al. defined Malingering of 

8 

 
 
 
 
Neurocognitive Dysfunction (MND) as “the volitional exaggeration or fabrication of cognitive 

dysfunction for the purpose of obtaining substantial material gain, or avoiding or escaping formal 

duty or responsibility” (1999, p. 552). The multifaceted standard used three criteria to determine 

the likelihood that an individual was malingering for secondary gain. Criteria A was 

characterized by the degree of secondary gain present for an individual, Criteria B included 

neuropsychological testing data, and Criteria C was characterized by self-reports. 

Criteria B relies on standardized neuropsychological assessment measures. Individuals 

who performed below chance on forced choice measures of performance validity were to be 

characterized as showing definite negative response bias whereas individuals who failed one or 

more validated PVTs were to be described as exhibiting probable response bias. Criteria B 

created a need for further development and validation of PVT measures and made the 

instruments central to the clinical decision-making process. 

Criteria C relies on standardized self-report measures. Self-reported symptoms were to be 

compared to the documented symptom history of the individual, brain-functioning, behavior, and 

third-party symptom reports. This criterion established a need for standardized measures of 

symptom validity and, like Criteria B, encouraged significant research into validity measures.  

Heaton et al. (1978) and Slick et al. (1999) inspired a vast literature base aimed at 

developing validity measures, but understanding this literature requires a basic knowledge of the 

commonly used research methodologies and the psychometric properties emphasized during test 

development. Two primary designs for studying validity tests have emerged that utilize either 

groups of known malingerers or individuals asked to intentionally give poor effort during 

evaluation. These designs influence outcomes in validity studies, and each design has known 

strengths and limitations. 

9 

 
 
 
 
 
PVT Research Design 

Simulation Studies. Empirical study of performance validity tests has historically been 

performed using two research designs. The first design is the simulation study. In a simulation 

study, a group of individuals with clinical presentations are compared to a group of individuals 

without clinical symptoms. Those without clinical presentations are simulators who are coached 

to perform poorly without being detected. In this case, individuals who intentionally fail 

measures can be compared directly to a clinical group. The landmark study conducted by Heaton 

et al. (1978) used a simulation design. Simulation studies are valuable as they include clearly 

established groups for comparison. However, simulation studies are often criticized for their 

deviations from real world application (Larrabee & Kirkwood, 2020). Individuals who 

intentionally fail assessment measures during clinical evaluations do so for several reasons. One 

of the most prominent motivations for suboptimal performance is secondary gain as is frequently 

seen in the traumatic brain injury population (Stucky et al., 2020). Individuals in simulation 

studies, however, lack such motivation. Although this is combatted with compensation to 

perform poorly without being caught (Heaton et al., 1978), this compensation typically pales in 

comparison to the secondary gain hinging on the results of a neuropsychological evaluation. This 

discrepancy is thought to call into question the external validity of simulation studies.  

Known-Groups Studies. The second study design commonly used to evaluate PVTs is the 

known-groups design. In these studies, a clinical group is compared to another group of patients 

who have failed other established PVTs during evaluation. Within these studies, the suboptimal 

effort groups also commonly include individuals in situations that allow for secondary gain such 

as active litigation status (Larrabee & Kirkwood, 2020). These studies are thought to have higher 

external validity than simulation studies because participants who fail these PVTs have done so 

10 

 
 
 
without coaching. However, known-groups studies also present several limitations. Within 

known-groups research, group designation relies on established performance validity measures. 

These criterion measures must have strong psychometric properties to correctly categorize 

performance. If individuals provide optimal effort but fail PVTs due to extraneous factors such 

as intellectual disability, the known group is compromised. As such, known-groups designs are 

only as strong as the measures used to establish groups (Schroeder et al., 2019). In practice, the 

known-groups design is preferred in the development of PVTs because the environmental 

validity of known-groups is significantly stronger than in simulation studies. However, the 

limitation associated with the known-groups design is a lack of strong criterion measures for 

establishing groups. Thus, creating a strong known-groups design requires a fundamental 

understanding of the psychometric properties of PVTs. 

Criteria for Establishing a Performance Validity Test 

Performance and symptom validity tests in the field of neuropsychology must meet 

certain criteria to be considered empirically supported instruments. The most discussed metrics 

of PVTs are sensitivity and specificity. These metrics indicate the accuracy with which 

individuals in a simulation study or known-groups study were correctly identified as performing 

validly or invalidly. However, equally important are the instrument’s negative predictive power 

(NPP) and positive predictive power (PPP). Predictive power refers to the chance that an 

individual is correctly identified as providing either optimal or suboptimal effort by the PVT. 

NPP and PPP take into consideration the base rates of suboptimal performance within a given 

population. Base rates have been studied extensively in adult populations and populations with 

specific conditions like mild traumatic brain injuries (mTBIs), but general base rates for the 

pediatric population vary widely (Kirkwood et al., 2011).  

11 

 
 
 
 
Specificity.  Specificity refers to an instrument’s ability to correctly classify true positive cases 

(Larrabee & Kirkwood, 2020). True positive cases are those correctly identified as performing 

optimally. Specificity is the most emphasized psychometric property of validity tests in 

neuropsychological literature. Consensus has established that validity measures should have at 

least 90% specificity (Lippa, 2018). This means that of the individuals presenting with a clinical 

condition, 90% of them will be identified as such by the validity measure. Consequently, when 

specificity is maximized, sensitivity is typically much lower. 

Sensitivity. Sensitivity refers to an instrument’s ability to correctly classify true negative cases 

(Larrabee & Kirkwood, 2020). True negative cases are those correctly identified as performing at 

suboptimal levels. As false positives are reduced by raising specificity, the likelihood of 

obtaining false negatives increases. This in turn lowers specificity. Within the PVT research 

literature, consensus dictates that cutoff scores for measures be made at the point that specificity 

reaches 90% to maximize sensitivity (Lippa, 2018). Additionally, a threshold of 40% is 

considered to be the minimum viable level of sensitivity (Axelrod et al., 2014; Erdodi et al., 

2014) 

Positive Predictive Power. Positive predictive power refers to the chance that an individual who 

failed a PVT actually put forth suboptimal effort (Lippa, 2018). Predictive power differs greatly 

depending on the base rate of suboptimal effort in a population. Thus, in adults, positive 

predictive power is higher in populations who face potential secondary gain like those with 

mTBIs than in populations with lower base rates. Sensitivity, specificity, and base rate are all 

required to calculate positive predictive power. Lippa (2018) provides the equation used to 

calculate positive predictive power: 

PPP= 

sensitivity(base rate)
(sensitivity*base rate)+((1-specificity)*(1-base rate))

 . 

12 

 
 
 
Negative Predictive Power. Negative predictive power (NPP) refers to the chance that an 

individual who passed a PVT put forth optimal effort. Like PPP, NPP is dependent on the base 

rate of suboptimal performance in the target population. Lippa (2018) also provides the equation 

for the calculation of NPP: 

NPP= 

specificity*(1-base rate)
. 
((1-base rate)*specificity)+(base rate*(1-sensitivity))

In summary, an empirically supported PVT must correctly identify 90% of individuals 

providing adequate effort, but sensitivity should then be prioritized to ensure that the instrument 

captures as many simulators or known malingerers as possible. Once sensitivity and specificity 

are established, the two can be used in conjunction with the base rate of suboptimal effort in a 

population to find the predictive power of the instrument.  

Historical Overview of PVT Research in Adult Populations 

As previously discussed, Heaton et al. (1978) was the first study to highlight the need for 

measures of validity when assessing effort in clinical populations. However, the development of 

validity measures predates Heaton et al. (1978) by over 30 years. André Rey was at the forefront 

of neuropsychological validity testing with his development of the Dot Counting Test (Rey, 

1941) and the Rey 15-Item Test (Rey, 1964) in the midcentury years between 1940 and 1970. 

The Rey tests were the first standalone measures of performance validity. However, their use 

was limited prior to the 1990s.  

In her book on assessing feigned impairment, Dr. Kyle Boone opens with an anecdote 

about her time as a postdoctoral fellow in the mid-1980s (Boone, 2007). Dr. Boone entered the 

waiting room to meet her patient and found the man talking to his imaginary friend and wearing 

a necklace made of garlic. Dr. Boone suspected that the patient may be feigning psychiatric 

symptoms and cognitive deficits, so she consulted her supervisor. Dr. Boone was advised that 

13 

 
 
 
 
 
she should give the Bender Gestalt Test (Bender, 1938) and the Mini-Mental State Examination 

(Folstein et al., 1975) but was perplexed by the lack of dedicated measures to assess clients 

providing suboptimal performance. In her study of the literature, Boone discovered the early 

tests developed by André Rey approximately 40 years earlier. However, the span of time 

between Rey’s work and the evaluation Dr. Boone conducted had yielded little expansion of the 

validity testing knowledge base.  

Logarithmic growth of PVT and SVT research started in the 1990s, but in the time 

between Rey’s research and Boone’s experience in the 1980s, a significant development in 

validity testing was established (Larrabee & Kirkwood, 2020; Pankratz et al., 1975). In a case 

study design, Pankratz et al. (1975) evaluated deafness in a man using a forced-choice design. In 

this study, a 27-year-old man with a history of symptom exaggeration and manipulation was 

tested for deafness using a forced-choice design. The man was presented with 100 trials in which 

he was presented with a red light and a blue light consecutively for two seconds each. Across the 

100 trials, a sound was randomly paired with either the red light or the blue light. The participant 

was asked to indicate which light was visible when the sound was played. By chance alone, the 

participant is expected to answer approximately 50 trials correctly, as there is a 50 percent 

chance of correct choice for any given trial. However, the patient answered only 36 trials 

correctly. The likelihood of receiving a score this low by chance is calculated to be less than 

0.4%. Later, Pankratz went on to promote the forced-choice assessment for detection of 

suboptimal effort (Pankratz, 1979, 1983). 

Building upon the work of Pankratz et al. (1975), development of PVTs exploded in the 

1990s (Larrabee, 2015). The 1990s produced some of the most well-known validity measures 

today including the TOMM (Tombaugh, 1996) and the Word Memory Test (WMT; Green & 

14 

 
 
 
 
 
Astner, 1995). These tests, and many others, utilize recognition memory to assess performance 

validity. Recognition memory tasks make strong PVTs because recognition is highly preserved 

after insult to the brain. 

Use of Recognition Memory to Assess Effort 

The most used PVTs present with face validity as measures of memory but are really 

measures of effort. Examples of such tests include the Test of Memory and Malingering 

(TOMM; Tombaugh, 1996) and the Memory Validity Profile (MVP; Sherman & Brooks, 

2015b). The TOMM presents the examinee with a series of 50 images at the beginning of each 

trial. Following the learning phase, the examinee is asked to correctly identify each item in a 

forced-choice format with 50 trials presenting the correct stimuli and a foil image. Forced choice 

means that the examinee selects one of two options, thus any score below chance performance of 

50% correct is likely to indicate malingering. The MVP uses a similar approach multiple choice, 

by presenting an image and then asking the examinee to identify the image from three choices 

immediately following the presentation, setting a chance threshold of 33%. Similarly, the Word 

Memory Test (WMT) presents word pairs and asks participants to distinguish presented words 

from foils in later conditions. The high expected rate of being correct by chance on these 

instruments is also supplemented by their reliance on recognition memory.   

Recognition memory, also commonly referred to as familiarity, refers to an individual’s 

ability to endorse having experienced a stimulus previously (Gazzaniga et al., 2014). 

Recognition memory is a highly preserved aspect of memory that remains intact when 

recollection memory is impaired. Recollection is the ability of an individual to retrieve 

previously learned information. The discrepancy in memory impairment between recollection 

and recognition is due to the differing cognitive structures involved in each process. Lesion 

15 

 
 
 
studies indicate that recollection memory relies heavily on the hippocampus (Gazzaniga et al., 

2014). When the hippocampus is damaged, impairment is observed in recollection abilities while 

recognition abilities remain intact. Recognition has been postulated to be more reliant on the 

perirhinal cortex. 

Recognition memory rests on the concept of priming. Priming is the process of changing 

an individual’s response to a stimulus by providing previous exposure to the stimulus (Gazzaniga 

et al., 2014). Performance validity tests like the TOMM and MVP utilize priming by exposing 

individuals to a series of stimuli and then asking them to select these stimuli from a series of 

foils. When a stimulus is primed, individuals are more likely to correctly identify the stimulus 

when presented later. The effects of priming can last months and have been demonstrated 

through faster completion of word fragment tasks and recognition trials. Famously, this priming 

effect is preserved in cases of severe brain injury. Most notably, H.M., an individual who 

underwent bilateral resection of his temporal lobes, demonstrated priming effects and recognition 

despite severe anterograde amnesia (Gazzaniga et al., 2014). Additionally, H.M. displayed 

typical performance on Digit Span tasks despite impairment in the formation of episodic 

memory. As such, the highly preserved nature of these memory tasks makes them strong 

indicators of performance validity. The high prevalence of genuine and feigned memory 

impairment also makes these instruments strong measures of validity.  

Memory impairment is both the most reported symptom during neuropsychological 

evaluations and the most commonly feigned (Lu et al., 2007). Genuine memory impairments 

present with typical patterns, but individuals feigning cognitive dysfunction are unlikely to 

match valid memory impairment patterns. Individuals with genuine cognitive disturbance will 

show impairments in memory tasks that require free recall, but not forced- or multiple-choice 

16 

 
 
 
 
recall. Forced-choice recognition tasks provide the added benefit of having chance level cutoffs 

of performance. Although individuals like H.M. who experience severe cognitive disturbance 

demonstrate similar levels of recognition as healthy controls, impairment on a recognition task 

does not automatically indicate invalid performance. However, performance below chance levels 

does indicate problematic effort as demonstrated by (Pankratz et al., 1975).  

The unique preservation of recognition memory led to the development of numerous 

standalone PVTs that presented as measures of memory. Following the development of these 

standalone measures, a number of embedded validity indicators were established using verbal 

learning tests, visual memory tests, figure copy tasks, gross motor tasks, and a number of other 

neuropsychological measures (Larrabee, 2015). Many of these embedded measures still rely 

heavily on recognition memory such as the Rey Complex Figure Test (RCFT; Meyers & Meyers, 

1995), the CVLT-3 (Delis et al., 2017), and the Wide Range Assessment of Memory and 

Learning – Third Edition (WRAML-3; Adams & Sheslow, 2021). Although standalone measures 

have the advantage of solely measuring performance validity, embedded measures provide a 

level of convenience to the clinician already administering measures that include embedded 

indices.  

Benefits of Embedded PVT Conditions within Established Memory Tests 

During a neuropsychological assessment, clinicians are routinely limited by the amount 

of time they have face to face with a client for testing. Neuropsychologists must sample multiple 

domains related to the referral question, and clinicians are frequently limited to a single day of 

testing, if not less, due to insurance restrictions and other extraneous factors. Because memory is 

already assessed due to the frequency of reported memory impairment, the use of embedded 

effort indices on tests of memory allows clinicians to make effort determinations without the 

17 

 
 
 
 
need for additional testing. These scores are readily available to clinicians and require no 

additional administration time. 

In addition to the need for effective time management, clinicians are also encouraged to 

utilize multiple PVTs throughout the duration of an evaluation. Thus, although many clinicians 

utilize standalone validity measures, additional validity indices throughout the evaluation aid in 

assessments of effort at differing time points. If clinicians were required to utilize standalone 

measures throughout the evaluation, the number of assessments of neuropsychological domains 

would be further limited. This is especially true in the pediatric population in which clients tend 

to fatigue more quickly and sit for less time when testing (Baron, 2018). Interest in pediatric 

validity testing has only recently emerged leading to a need for new validity indices in pediatric 

evaluations.  

Need for Validity Testing in Pediatric Samples 

Until the 2000s, few empirical studies focused on pediatric validity assessment, and 

interest in the subject was limited. The lack of interest in this subject was likely due to the widely 

held belief that children would not and could not feign performance during neuropsychological 

evaluations (Kirkwood, 2015). However, several studies conducted since the turn of the century 

have indicated that children can and do feign performance during evaluations. Furthermore, 

developmental research has supported these findings by demonstrating that children are capable 

of deception as early as preschool.  

Deception in Children 

The study of deception in children has focused substantially on both theory of mind 

development and executive functions. Theory of mind refers to a child’s ability to accept that an 

individual can maintain a false belief (Peterson & Peterson, 2015). As such, a child must believe 

18 

 
 
 
 
 
 
that an individual can believe a false statement before trying to present them with one. Executive 

functions refer to the skills that facilitate goal-directed behavior and novel task completion. 

These skills require a child to abstain from automatic responses and include inhibition, planning, 

working memory, and mental shifting (Peterson & Peterson, 2015). To deceive, a child must be 

able to hold the deceptive idea in his or her working memory and refrain from sharing 

contradictory evidence with the target. As such, the ability for children to deceive is thought to 

align with the development of these skills.  

Theory of Mind. Theory of mind is traditionally measured using the false-belief task (Peterson & 

Peterson, 2015). This task presents the child with a scenario in which a character or individual 

knows the location of an object. When the character is unaware, another character moves the 

item. The child is then asked where the first individual will look for the object when they return. 

If the child has developed theory of mind, they will be able to correctly identify that the character 

will look in the original, and thus, incorrect location. However, children who lack this skill will 

assume that the individual will look in the new item location. As such, to be deceitful, a child 

must first appreciate that people can hold false beliefs. Research consistently demonstrates that 

the ability to correctly solve the false-belief task emerges by age four (Peterson & Peterson, 

2015).  

Further research using the unexpected-contents paradigm suggests that children are 

unable to acknowledge their own false beliefs (Wimmer & Hartl, 1991). In scenarios using the 

unexpected-contents paradigm, children are presented with a container that is clearly marked to 

indicate the contents. For example, a child may be shown a box with pictures of pencils on the 

lid to indicate that the box contains pencils. When asked, the child indicates that he or she 

believes that the box contains pencils. However, when the box is opened, the contents differ from 

19 

 
 
 
expectations. In this example, the box may contain blocks. When asked what they believed to be 

in the box before the reveal, children without the ability to acknowledge false beliefs will 

indicate that they believed the box contained blocks before opening it despite having stated 

pencils earlier. Like the false-belief task, the ability to solve the unexpected-contents paradigm 

also emerges around age four (Peterson & Peterson, 2015). Thus, one of the fundamental skills 

related to deception is present as early as preschool. 

Executive Functions. Executive functions encompass the skills needed to override automatic 

thinking and to act intentionally (Diamond, 2013). These skills are required when an individual 

focuses their attention, develops plans, and inhibits impulsive actions. Of the executive 

functions, working memory and inhibition are considered fundamental aspects of deception 

(Peterson & Peterson, 2015). To lie, an individual must act deceptively, hold the deception in his 

or her working memory, and inhibit the disclosure of information that contradicts the deception. 

Therefore, children who have not yet developed these executive functions may struggle to 

engage in deception.  

Working memory and inhibition can be observed in early infancy, but these skills 

develop substantially in the preschool years (Garon et al., 2008) and they continue to develop 

into adolescence and early adulthood (Peterson & Peterson, 2015). In turn, from the time 

children enter preschool they are capable of deception and develop the skills needed to deceive 

continually throughout childhood. These findings contradict the traditional view that children are 

incapable of feigning performance during testing. 

Heaton et al. (1978) demonstrated that clinicians struggle to identify suboptimal 

performance during assessments in adults who would be presumed to have well-developed 

theory of mind and executive function skills. Although children are capable of deception, one 

20 

 
 
 
 
 
might think that clinicians would be sensitive to feigning in this population due to the nascency 

of the associated skills. Contrary to this idea, studies have consistently indicated that adults have 

difficulty identifying deception in children (Crossman & Lewis, 2006; Peterson & Peterson, 

2015; Talwar & Lee, 2002). Although teachers, parents, and individuals who regularly work with 

children are better at detecting deception than the average adult, naturalistic observation studies 

have indicated that adults struggle with the deception of children as young as three (Crossman & 

Lewis, 2006; Peterson & Peterson, 2015; Talwar & Crossman, 2011). As children age, difficulty 

with identifying deception increases. 

The traditional belief that children are not capable of deception when testing is disproven 

by the literature base surrounding pediatric deception. Children as young as four have both the 

theory of mind and executive functions to lie, and these skills develop with age. With the 

knowledge that children can engage in deception during assessments, it is important to also 

understand why children might provide suboptimal effort when testing.  

Reasons for Suboptimal Effort in Pediatric Samples 

Factors that contribute to suboptimal performance in pediatric evaluations are quite 

diverse. Evidence of this diversity is demonstrated by the rates of low performance across 

referral questions. As with adult populations, children presenting with mild traumatic head 

injuries (mTBIs) exhibit high rates of suboptimal performance. This population is estimated to 

perform below expectations between 12-20% of the time (Kirkwood, 2015). Furthermore, a 

significant population of interest specific to the pediatric population are those being evaluated for 

Social Security Disability benefits. Some reports indicate that nearly 60% of this population 

underperforms expectations during neuropsychological assessments. This is thought to be 

21 

 
 
 
 
 
influenced by parental pressure on children and is commonly referred to as “malingering by 

proxy” (Kirkwood, 2015). 

In addition to financial gain through litigation, several other reasons for poor effort in the 

pediatric population have been reported. Children with oppositional tendencies may perform 

poorly intentionally while testing. Additionally, children who are not interested in testing may 

either fail to attend to test stimuli or fail intentionally to accelerate the testing process. 

Furthermore, some children may fail due to low self-esteem. Children may even intentionally 

sabotage test results because they want to be sure that the clinician sees the difficulties they are 

having in school or other areas of functioning (Kirkwood, 2015).  

Pediatric studies of child malingering have been largely focused on medical samples 

(e.g., Kirkwood et al., 2011; Welsh et al., 2012). However, there are other populations for whom 

suboptimal effort may have influence on treatment and outcomes. Many psychoeducational 

evaluations are performed each year that change the educational course and outcomes of students 

across the United States. Between 2009 and 2021, 15% of students in the United States received 

special education services totaling 7.5 million students ages 3-21 (National Center for Education 

Statistics, 2022). Each of these students requires an initial special education evaluation as well as 

a reevaluation every three years they receive services. In addition to typical special education 

evaluations, many seek evaluations to gain accommodations for high stakes testing like college 

entrance exams. Despite the high potential for secondary gain in these populations, school 

psychologists rarely utilize performance validity measures, and performance validity testing is 

often absent from school psychology training programs (Holcomb, 2018). The lack of PVT use 

occurs despite school psychologists having access to instruments with performance validity 

indicators. Notably, there is a subtest available to nearly all who perform psychoeducational and 

22 

 
 
 
 
 
neuropsychological evaluations that has long been used in the adult population as an indicator of 

effort: the Digit Span subtest of the Wechsler Intelligence Scale for Children (WISC-V; 

Wechsler, 2014), which is considered the primary instrument for assessing intellect in children. 

Despite the need for validity assessment in multiple areas of psychological testing, little 

information is known about actual use in pediatric neuropsychological testing, let alone 

psychological specialties less familiar with validity testing.  

Reported vs. Actual PVT use in Pediatric Assessments 

With a recent increase in interest surrounding PVT and SVT use in children, Brooks et al. 

(2016) surveyed clinical neuropsychologists across the United States and Canada to evaluate the 

frequency of PVT and SVT use in pediatric populations. Of the 349 participants who completed 

the survey, 282 neuropsychologists successfully met the inclusion criteria for analysis. 

Regarding PVT use in pediatric evaluations, a very high proportion, 92% of respondents, 

endorsed using at least one PVT per pediatric evaluation. Additionally, 88% endorsed the use of 

at least one SVT during pediatric evaluations. This high rate of reporting exceeded previous 

surveys of adult practitioners. This unlikely discrepancy can be partially explained by the 

increasing significance placed on validity testing in the time between this survey and earlier 

surveys of adult population clinicians (Brooks et al., 2016). However, reported and actual use of 

PVTs in pediatric populations may differ. 

Following the publication of the survey results, Macallister et al. (2019) sought to 

examine the real-world use of PVTs compared to the reported prevalence. Like Brooks et al. 

(2016), the researchers thought the rates reported in the survey far exceeded what they saw in 

daily neuropsychological practice. The authors collected a convenience sample of reports they 

had been sent as a part of referrals to their clinic. The reason for referral for these evaluations 

23 

 
 
 
 
 
were re-evaluation, consultation, and review for medical intervention. Patients in this sample 

ranged in age from 6-17 years as these ages constitute school age. The report dates ranged from 

January of 2015 to January 2017. These dates are relevant as they nearly coincide with the date 

of the survey. The final sample consisted of 131 neuropsychological reports from 102 

neuropsychologists.  

Using the evaluation reports, Macallister et al. (2019) obtained PVT use rates that 

differed significantly from survey results. Of the 131 reports reviewed, six documented using 

PVTs, and each of the six cases was conducted by a separate neuropsychologist making PVT use 

across clinicians in the sample 5.88%. In discussion, the researchers highlight that their sample 

differs significantly from Brooks et al. (2016) as the survey sample was obtained through 

international recruitment while this study utilized a convenience sample. Additionally, 

documentation for PVTs is not well established, and this may mean that clinicians used these 

instruments without documenting them. However, the factor presented as most likely 

contributing to the discrepancy is the social desirability bias inherent in a survey of this kind. 

Multiple professional organizations within the field of neuropsychology have released position 

statements arguing for the use of PVTs including the National Academy of Neuropsychologists 

(NAN; Bush et al., 2005) and the American Academy of Clinical Neuropsychology (Guilmette et 

al., 2020; Heilbronner et al., 2009). As such, the expectation that PVTs be used may positively 

skew practitioners reporting use of them despite lower practical application. 

An important limitation of the Macallister et al. (2019) must be addressed. In both the 

methods, abstract, and results sections of the article, the reports are said to be dated 2017 to 

2018. This would mean that 5 of 131 reports from the two years preceding the study would equal 

4.88% of reports. However, later in the article, when referring to articles from the two preceding 

24 

 
 
 
 
 
years, Macallister et al. (2019) notes that 6 of 56 studies documented PVT use compared to none 

from 2001 to 2013. This changes the percentage of PVT use among recent reports to 10.7%. 

Although the exact percentage of recent reports including PVT use is unclear, both 4.88% and 

10.7% fall well below the 92% rate obtained by (Brooks et al., 2016).  

Whether the findings of Brooks et al. (2016) or Macallister et al. (2019) are a more 

accurate representation of the use of PVTs in pediatric practice, further study of PVTs is needed. 

If use among pediatric clinicians is low, there is little information about the utility of the 

instruments in practice when working with children. If use rates are high, the problem of 

empirically unvalidated instruments being used to make clinical decisions becomes more central. 

One instrument highlighted as frequently used in the evaluation of performance validity in 

pediatric populations in the survey was the Digit Span subtest of Wechsler intelligence measures 

(Brooks et al., 2016). Despite high rates of endorsement, this metric has been studied little with 

pediatric populations. 

History of the Digit Span Task 

The Digit Span task is a classical task that requires individuals to hold a string of 

numerical digits in working memory and repeat them to the examiner immediately following 

their presentation. Originally, this task involved simply repeating strings of digits exactly as they 

were presented. Later variations added conditions that involved reciting digits in reverse or 

numerical order. Initial items contain few digits, but subsequent items add digits to increase 

difficulty as the examinee progresses through the task. The task is discontinued after a set 

number of consecutive failures to repeat the digit strings verbatim.  

Although it appears on each of the full Wechsler intelligence measures, the Digit Span 

subtest significantly predates the invention of these instruments. Interest in the limited capacity 

25 

 
 
 
 
of digits an individual can repeat is believed to have first been noted in 1871 by Oliver W. 

Holmes (Richardson, 2007). Holmes noted that most people lost the ability to recite strings when 

such strings spanned between seven and ten digits. Later, Jacobs assessed a number of school 

children in their ability to correctly recite numerals and letters and found that the average span 

length increased with age (Jacobs, 1887) and so a developmental perspective was introduced. 

Subsequently, Binet and Simon included a forward digit condition in their original intelligence 

assessment (Binet & Simon, 1905; Richardson, 2007) and used normative data comprising age-

related expectations.  

When the Binet-Simon scale was adapted to the Stanford Binet by Terman (Terman, 

1916), the task requiring repetition of digits was again included. Additionally, Terman decided to 

add a backward condition as well which required participants to repeat the digits in reverse order. 

In 1939, David Wechsler released the Wechsler-Bellevue Intelligence Scale (Wechsler, 1939). 

This first Wechsler intelligence measure included Digit Span as one of its subtests. The original 

Digit Span subtest included both a backward and a forward condition. As the Wechsler-Bellevue 

Intelligence Scale evolved into the Wechsler Adult Intelligence Scale (WAIS; Wechsler, 1955) 

and the Wechsler Intelligence Scale for Children (WISC; Wechsler, 1949), the Digit Span 

subtest remained a standard component. In each revision of these two instruments, the Digit Span 

subtest has been a primary subtest. The one exception to the inclusion of the Digit Span subtest is 

the Wechsler Abbreviated Scale of Intelligence (WASI; Wechsler, 1999, 2011). Both editions of 

this instrument have omitted the Digit Span index because working memory is not assessed on 

the WASI. 

While early Wechsler instruments included only the forward and backward conditions, 

the most recent versions of the adult and child versions, the Wechsler Adult Intelligence Scale – 

26 

 
 
 
Fourth Edition (WAIS-IV; Wechsler, 2008) and the WISC-V (Wechsler, 2014) introduced a 

third component to the Digit Span subtest. The sequencing condition was like the forward and 

backward conditions in structure. However, participants were now tasked with repeating the 

digits in ordinal rank from low to high. The raw score for Digit Span on these revised 

instruments is now calculated by adding the raw scores across each of the three conditions. Use 

of this subtest as an embedded measure of effort has long been present in the adult literature and 

practice, and it has been used in several forms ranging from age-corrected scaled score to various 

other calculations based on raw scores derived from the subtests. 

Digit Span Age Corrected Scaled Score (ACSS): An Embedded Effort Indicator 

The score for an examinee’s Digit Span performance is calculated by assigning points for 

every string repeated correctly. The sum of these points constitutes the raw score. The raw score 

is converted to a scaled score based on the mean and standard deviation of a national normative 

sample. The resulting scaled score is a corrected-for-age score as it transforms the raw score 

using the mean and standard deviation that aligns with the examinee’s age. Thus, a scaled score 

with a normative mean of 10 and a standard deviation of 3 reflects how far from the mean the 

skill shown by the examinee is. This age corrected scaled score (ACSS) is the most reported 

index of examinee performance in clinical evaluations.  

Although memory disturbance can impair performance on Wechsler Digit Span tasks, 

studies have demonstrated that those with severe memory impairments such as individuals with 

Korsakoff syndrome or those who have undergone surgery are able to perform typically 

compared to those without such impairments (Iverson, 2003). Despite this evidence emerging in 

the 1970s, the practical application of this information in the evaluation of malingering was not 

examined until Iverson and Franzen (1994) conducted an explicit examination of the metric. In a 

27 

 
 
 
 
 
previous study, Iverson and Franzen (1996) noticed that a Digit Span ACSS cutoff score, 

meaning the yielded scaled score based on normative data, of 4 correctly classified 77.5% of 

malingerers and 100% of individuals with clinical presentations in a simulation study. These 

findings prompted Iverson and Franzen (1994)  to examine the classification utility of the ACSS 

score in a sample consisting of normal controls, simulators, and clinical patients suffering from 

head injury. Normal controls were correctly classified with 100% accuracy across cutoff scaled 

scores of 3, 4, and 5, each of which fall over two standard deviations below the mean. Sensitivity 

of the ACSS was 60% with a cutoff scaled score of 3, 82.5% with a cutoff scaled score of 4, and 

90% with a cutoff scaled score of 5. These initial studies implied that the ACSS may have 

clinical utility as an embedded PVT measure.  However, more sophisticated metrics were sought 

using raw data from the task. 

Development of Reliable Digit Span: An Embedded Effort Indicator 

Reliable Digit Span (RDS) was first established by Greiffenstein et al. (1994). The 

primary objectives of this study were to establish grouping criteria for malingering and to 

validate popular memory-based PVT measures. RDS differed from ACSS in that it referenced 

the longest number of digits recalled perfectly in each of two trials for both forward and 

backward tasks. For example, if an examinee’s best performance on Digits Forward was a five-

digit string and five digits were recalled perfectly, then the forward condition is five. This is 

added to the longest number of Digits Backward recalled perfectly twice, say for example three. 

Then the RDS would be eight (5+3) digits reliably held.  

To examine the utility of RDS, Greiffenstein et al. (1994) used a sample of 106 clients 

referred for neuropsychological evaluation in 1992 following a traumatic brain injury. Although 

most studies in this area used either simulation studies or litigation status to separate valid 

28 

 
 
 
 
performance from invalid performance, Greiffenstein et al. (1994) used symptom severity and 

behavioral characteristics to establish three groups. The first group was given the traumatic brain 

injury designation (TBI). Individuals in this group presented with a Glasgow Coma Score (GSC) 

of 14 or lower 24 hours after sustaining their injury, positive findings on a computerized axial 

tomography (CAT) scan, or focal neurologic evidence with a hospital stay lasting longer than 

two days. The second group was given the persistent post-concussive syndrome (PPSC) 

designation. This group was characterized by amnesia of no more than 20 minutes following the 

injury, a GCS of 15 or higher, insignificant CAT scan results, insignificant neurologic findings, 

and at least three post-concussive symptoms present one year after injury. Finally, the third 

group was given the probable malingerers (PM) designation. This group was characterized by 

those who met criteria for the PPSC group and two or more additional criteria: 1) two or more 

severe ratings on neuropsychological tests compared to peers of similar age and education, 2) an 

improbable symptom presentation that contradicts records and evidence, 3) inability to work or 

engage socially one year after injury, and 4) claims of isolated memory loss. 

The authors utilized several established PVT measures including the Rey Auditory 

Verbal Learning Test (AVLT; Rey, 1958), the Wechsler Memory Scale (WMS; Wechsler, 1945), 

the Wechsler Memory Scale Revised (WMS-R; Wechsler, 1987), Rey’s Word Recognition List 

(WRL; Lezak, 1983), Rey’s 15-Item Memory Test (Rey-15; Lezak, 1983), Rey’s Dot Counting 

task (Lezak, 1983), and the Portland Digit Recognition Test (PDRT; Binder & Willis, 1991). In 

addition to these established measures, the authors proposed a new measure using the Digit Span 

subtest of the Wechsler Adult Intelligence Scale – Revised (WAIS-R; Wechsler, 1981). The 

measure took the sum of the longest string of digits correctly recited twice on the Forward and 

Backward conditions of the Digit Span subtest. The first two trials of the Forward condition of 

29 

 
 
 
the WAIS-R contained strings of three digits while the Backward condition started with strings 

of two digits. Participants who completed the first set successfully in both conditions would 

therefore achieve a minimum score of five digits reliably held. Those who failed to complete 

either of the basal sets were assigned an RDS score of 3.  

Group comparisons showed that the probable malingerers group scored significantly 

lower than both the TBI and PCSS groups. To establish cut scores, Greiffenstein et al. (1994) 

used both a conservative and liberal decision rule. First, the conservative cut score was set to 1.3 

standard deviations below the mean performance of the TBI group to achieve 90% specificity. 

This decision rule resulted in an RDS cutoff score of 7 digits reliably held. The second cutoff 

score was developed by setting the metric to one standard deviation below the TBI group’s mean 

score. This decision rule resulted in a cutoff score of 8 digits reliably held. Resulting analyses 

showed that when compared to the TBI group specificity based on scores of 7 and 8 were 73/54 

and sensitivity was 70/82. When compared to the PCSS group, specificity rates of 89/69 

respectively were found, and sensitivity was 68/82. These results suggested that RDS may be a 

viable clinical tool for detecting suboptimal performance during neuropsychological evaluations. 

Following the study of RDS, an official supplement was released for the WAIS-IV that included 

normative data for the RDS score (Wechsler, 2009). RDS is currently in use, but changes to the 

Weschler suite of intelligence instruments have changed the nature of the Digit Span subtest. 

Thus, a revised version of the RDS has been introduced. 

Development of Reliable Digit Span – Revised  

Upon release of the WAIS-IV, the nature of the Digit Span subtest fundamentally 

changed. In addition to the traditional forward and backward conditions, the sequencing 

condition was added to the subtest. The sequencing subtest required participants to repeat the 

30 

 
 
 
 
string of digits back to the examiner in ordinal order from smallest to largest. The raw score from 

the sequencing task was added to the raw scores of the forward and backward tasks to develop a 

Digit Span raw score. This addition also allowed for the expansion of the RDS metric.  

Spencer et al. (2010) were the first to introduce the expanded version of the RDS metric. 

By adding the length of the longest span correctly repeated consecutively in the sequencing 

condition to the traditional RDS score, the Reliable Digit Span – Revised (RDS-R) score was 

developed. In their examination of veterans in a known-groups design, Spencer et al. (2010) 

found that RDS-R metric with a cutoff score of 11 reliably held presented higher specificity, 

sensitivity, and predictive power than the traditional RDS measure. Furthermore, a subsequent 

study conducted by Young et al. (2012) found that the new RDS-R measure exhibited similar 

utility to the RDS and ACSS metrics in the detection of suboptimal performance in a group of 

adults referred for evaluation at the VA. 

Analysis of Digit Span Indices in Adults 

Following the introduction of the RDS measure, study of its utility in adult populations 

expanded significantly. Schroeder et al. (2012) conducted a review of the RDS literature ranging 

from 1994 to 2011 and found 20 studies examining the measure in various clinical populations. 

When using an RDS cutoff score of six, specificity rates exceeded the 90% recommended 

threshold in samples including controls, mixed clinical populations, traumatic brain injuries, and 

simulators. However, RDS specificity was lower in special populations including those with 

intellectual disabilities, memory disorders, language barriers, and cerebrovascular accidents. As 

such, RDS was demonstrated to be a promising measure of performance validity in most adult 

populations. Additionally, preservation of Digit Span tasks in highly amnesic individuals 

(Gazzaniga et al., 2014) also suggests that the scaled score provided by the Digit Span subtest 

31 

 
 
 
 
 
may have potential as a validity indicator. This evidence is derived entirely from the adult 

literature however, and adult practices are not always suitable for pediatric populations (Baron, 

2018). Thus, an examination of Digit Span in the pediatric literature is warranted. 

Critical Analysis of Digit Span Indices in Pediatric Samples 

In a recent survey of 282 pediatric neuropsychologists, 65.3% of respondents indicated 

that they at least occasionally use the RDS validity measure with children with 22.4% of the 

sample endorsing using RDS often and 21.4% endorsing using the measure almost always 

(Brooks et al., 2016). The results of this survey appear promising for the utility of the measure in 

pediatric assessments. Unfortunately, this high rate of endorsement is problematic when 

examining the empirical evidence for the use of this measure with children. Despite more than 

half of the pediatric neuropsychologists sampled endorsing use RDS, only two studies aimed at 

establishing pediatric cutoff scores have been conducted, and each of these studies utilizes an 

outdated version of the WISC and focuses on highly specific clinical presentations (Kirkwood et 

al., 2011; Welsh et al., 2012). 

The inclusion of the Digit Span subtest on both the WISC-IV (Wechsler, 2003) and the 

WISC-V (Wechsler, 2014) allows clinicians to develop an RDS score. However, little has been 

done to establish pediatric cutoff scores, leading to uncertainty about their ethical use and utility. 

To date, just two studies have been conducted to establish credible metrics for using RDS as a 

measure of validity in pediatric assessment. 

The first attempt at standardizing the RDS score in a pediatric population was conducted 

by Kirkwood et al. (2011). This study examined the utility of both RDS and Digit Span ACSS in 

detecting suboptimal performance in a sample of children referred for neuropsychological 

evaluation following a TBI. Of the original sample, injury modality included sports related 

32 

 
 
 
 
 
 
injuries (65%), falls (18%) vehicular injury (11%), and assault (3%). Children were excluded if 

the evaluation was forensic in nature, neurosurgical intervention was involved, the injury 

resulted from abuse, or the brain injury was not related to physical trauma. The final sample 

comprised 274 children ranging in age from 8 to 16.  

The sample was split into a credible performance group and a noncredible performance 

group. Children in the credible performance group passed both the Medical Symptom Validity 

Test (MSVT; Green, 2004) and the TOMM (n = 224). The noncredible performance group 

consisted of those who failed both the MSVT and the TOMM (n = 37). Thirteen children failed 

the MSVT but passed the TOMM. This group of children was excluded from analysis as group 

membership was ambiguous.  

Group differences were examined using a series of t-tests. Significant differences in 

performance between groups was found on the Digit Span ACSS (p<.001, d – 1.5), RDS 

(p<.001, d = 1.2), Digit Span Forward task (p < .001, d = 1.4), and the Digit Span Backward task 

(p < .001, d = 1.1). Kirkwood et al. (2011) also went on to establish cutoff scores based on the 

performance of their sample. Optimal cutoff scores were established when specificity met or 

exceeded 90%. An ACSS cutoff score was established at scaled score ≤5 which resulted in a 

specificity of 95% and a sensitivity of 51%. An RDS cutoff score established at raw score ≤6 

yielded a specificity of 92% and a sensitivity of 51%. Negative and positive predictive power 

were also reported at a variety of noncredible performance base rates. Kirkwood et al. (2011) 

found that when using an RDS cutoff score of ≤7, false positive rates were 31%. As a result, 

more conservative cutoff scores were recommended when using RDS in pediatric populations.  

Although the work of Kirkwood et al. (2011) was revolutionary as it examined the utility 

of the Digit Span subtest of the WISC-IV in a pediatric sample, the generalizability of the 

33 

 
 
 
 
 
 
findings was narrow because the sample consisted exclusively of children presenting with TBI. 

To examine the utility of the measure in other populations, Welsh et al. (2012) utilized RDS in 

an epilepsy sample. The study’s sample consisted of 54 children ages 6 to 17 presenting for 

neuropsychological evaluation with various epilepsy conditions. Most of the sample presented 

with partial epilepsy syndromes (n = 33) while others presented with generalized epilepsy 

syndromes (n = 10) and mixed presentations or unspecified syndromes (n = 11). 

As part of their neuropsychological assessments, each child completed the TOMM and 

Digit Span subtest of a Wechsler instrument (WISC-IV or WAIS-III). Additionally, participants 

completed a full or abbreviated Wechsler intelligence measure (WISC-IV, WAIS-III, or WASI). 

In order to more closely match the IQ scores presented by the WASI, Welsh et al. (2012) utilized 

the GAI measure of the WISC-IV and the WAIS-III. This also provided the added benefit of 

removing the Digit Span subtest from analyses of intellectual functioning.  

Using the cutoff scores provided by Kirkwood et al. (2011), only 65% of the sample 

passed the RDS metric. These results fall well below the 90% pass rate established in the 

literature. Despite the low rate of RDS success, 90% of the sample validly completed the TOMM 

at or above cutoff scores. Sensitivity and specificity analysis of the sample’s performance 

showed that to reach adequate specificity, a cutoff score of ≤3 reliably held digits would be 

necessary. However, such a low threshold provided poor sensitivity at just 20%. Thus, clinical 

utility was suspect. 

Other than the studies conducted by Kirkwood et al. (2011) and Welsh et al. (2012), 

studies examining the utility of RDS in pediatric populations have been significantly limited or 

have utilized methodologies other than the preferred simulation and known-groups designs. In a 

study of 119 children referred for neuropsychological evaluation for attention-

34 

 
 
 
 
 
 
deficit/hyperactivity disorder (ADHD), autism spectrum disorder (ASD), specific learning 

disabilities (SLD), or anxiety/depression, Weiss et al. (2019) compared failure rates on the 

TOMM, RDS (scores calculated from both WISC-IV and WAIS-IV), and the Discriminability 

Index from either the California Verbal Learning Test – Children’s Version (CVLT-C; Delis et 

al., 1994) or the Californian Verbal Learning Test – Second Edition (CVLT-II; Delis et al., 

2000). The discriminability index reflects the ability of the examinee to recognize previously 

heard words in a word list-learning task. It is a forced-choice task in that only a yes/no response 

is required. However, groups comparisons were not possible in this study as no known group of 

suboptimal effort was present. Additionally, the study utilized RDS scores from two separate 

Wechsler instruments and failed to provide the adult cutoff scores used to determine failure on 

the RDS measure. As such, little information can be taken from this study. Other studies 

similarly use adult cutoff scores despite evidence suggesting these cutoffs are invalid in younger 

populations (Kirkwood et al., 2011; Welsh et al., 2012). 

The early studies conducted by Kirkwood et al. (2011) and Welsh et al. (2012) establish 

the RDS score as a potential indicator of performance validity. However, discrepancies in cutoff 

scores across populations yields a need for further study to validate the index. Following the 

dissemination of these studies, no additional studies were conducted using the WISC-IV. This is 

likely due, in part, to the release of the revised WISC-V (Wechsler, 2014).  

RDS on the WISC-V 

Thus far, most studies of RDS and Digit Span as effort indices in children have utilized 

the WISC-IV. Little research has been conducted on the utility of these indices when calculated 

using the current edition of the Weschler child version, the WISC-V. The Digit Span subtest of 

the WISC-V saw several important revisions that fundamentally change the task (Wechsler, 

35 

 
 
 
 
2014). The forward condition added longer digit strings at the end to increase the discriminant 

ability of the measure at higher ability levels. This is unlikely to affect cutoff scores for RDS as 

cutoff scores typically focus on the lower end of performance. Changes to the backward 

condition, however, significantly affect scores at the lower end. An additional short trial was 

added to the beginning of the task to reduce the task difficulty gradient. As such, RDS scores on 

the WISC-V may be higher than RDS scores on the previous edition of the instrument. Finally, 

the WISC-V added the sequencing condition that tasks participants with reciting the digits in 

ascending order, called Digits Sequencing. This new condition significantly influences the ACSS 

score as this score is now derived from the sum of each of the three conditions. Additionally, the 

sequencing task allows for the calculation of RDS-R. 

To date, one study has been published examining Digit Span performance on the WISC-

V regarding performance validity. In a study of 130 children referred for neuropsychological 

evaluation, Ventura et al. (2019) compared WISC-V ACSS, RDS, and RDS-R performance to 

TOMM performance. Like other limited studies in this area, this study did not include a known 

or simulated malingering group. Rather, the authors examined failure rates across each measure 

using predetermined cutoff scores. Results indicated that Digit Span PVT failure rates were 

much higher than failure of the TOMM. However, lack of empirical support for these cutoff 

scores calls into question the significance of these findings.  

Additionally, at the time of this review, an article by Kirk et al. (2020) reports an article 

currently in press that examines the utility of the WISC-V Digit Span as a PVT. The study uses a 

known-groups design with the Medical Symptom Validity Test (MSVT; Green, 2004) serving as 

the criterion measure. The results reported in this review indicate that the ACSS, RDS, and RDS-

R measures were all able to provide cutoff scores with strong specificity and sensitivity. As such, 

36 

 
 
 
 
 
the results of the current study should serve to supplement these findings as the second study to 

examine the WISC-V Digit Span indices in a known-groups design. 

The Digit Span indices have strong potential as performance validity indicators in the 

pediatric population for numerous reasons. First, these indices have the advantage of being 

embedded in many evaluations without the need for additional administration time. Any 

evaluation that includes the Digit Span task can provide validity information. Additionally, while 

the Digit Span task is considered a working memory task, recognition memory is not the memory 

modality assessed by the test. Therefore, upon validation, Digit Span could serve as an 

embedded validity indicator that does not rely on recognition alone. This diversifies the sources 

of validity information to the clinician and helps to make more informed decisions regarding 

effort.  

A Non-Recognition Based Memory Task to Measure Performance Validity 

The Digit Span subtest requires individuals to recall presented information, and thus, is 

not a recognition task. Although Digit Span tasks do require working memory, the actual 

construct assessed by the task has been a contentious subject throughout its use in intelligence 

assessments. Originally, Digit Span on early Wechsler instruments included only forward 

conditions. However, these tasks were too brief make meaningful clinical interpretations. As a 

result, additional conditions were added to the Digit Span task. 

Later iterations of the Digit Span task included the backward and sequencing conditions, 

but some began to question whether the task measured a unified construct. Following factor 

analysis on the Test of Memory and Learning (TOMAL; Reynolds & Bigler, 1994), Reynolds 

(1997) examined whether the forward and backward conditions of the Digit Span task should be 

combined for clinical analysis. Factor analysis showed that the forward and backward conditions 

37 

 
 
 
 
 
 
loaded on separate factors. Reynolds argued that the forward task was more a measure of verbal 

working memory while the backward condition was more difficult and potentially required 

visual-spatial manipulation. A recent study has also indicated that the internal consistency of the 

combined Digit Span score is closer to 0.70 than the 0.90 reported by the WAIS-IV manual 

(Gignac et al., 2019). As such, discussion of what aspects of memory the Digit Span task actually 

measures continues, but the task is distinctly different from the other memory based PVTs due to 

its lack of recognition requirements. This further strengthens the argument for the use of Digit 

Span indices as embedded PVTs, but only once certain gaps in the research are addressed.  

Research Gaps 

At the time of this study, the use of Digit Span indices from the WISC-V as performance 

validity indicators is scientifically questionable. Yet, a recent survey of pediatric 

neuropsychologists found that over 65% of pediatric clinicians reported at least occasional use of 

the Reliable Digit Span measure with children (Brooks et al., 2016). If true, these results indicate 

that over half of pediatric clinicians are using an instrument without empirical support to make 

determinations about client effort.  

Digit Span PVT indices utilize a measure of working memory that does not rely on 

recognition memory. As such, the instrument needs to be studied in known-groups designs. 

These studies require the use of strong criterion measures to establish groups with valid and 

invalid performance. Known-groups studies utilizing well-established recognition PVTs would 

provide clarification of the clinical utility of the indices. 

For RDS and other Digit Span indices to provide meaningful performance validity 

information when working with pediatric populations, cutoff scores should be established with 

adequate specificity and sensitivity. Additionally, the negative and positive predictive power of 

38 

 
 
 
these indices should be examined across estimated rates of suboptimal effort in pediatric 

populations. Once established, clinicians will have guidance about how to best use these 

indicators to reliably determine effort in their young clients. 

Current Study 

The current study sought to establish cutoff scores of the Digit Span ACSS, RDS, and 

RDS-R scores of the WISC-V using a mixed clinical sample. The study uses a known groups 

design to examine the utility of the Digit Span ACSS, RDS, and RDS-R scores in distinguishing 

between children who successfully passed the Memory Validity Profile (MVP; Sherman & 

Brooks, 2015b), a stand-alone, recognition memory-based pediatric PVT, from those who failed 

the MVP during a neuropsychological assessment. In accordance with age ranges set by the 

WISC-V, participants in this study ranged in age from 6-16 years of age. Additionally, each 

participant completed the MVP as a part of an outpatient neuropsychological evaluation. Data 

were collected from a single rehabilitation hospital in the midwestern United States.  

Given high rates of endorsement for the use of Digit Span indices among pediatric 

neuropsychologists (Brooks et al., 2016), study of appropriate cutoff scores, sensitivity, 

specificity, and predictive power is essential to conducting empirically supported work. This 

study aims to be the first to use a known-groups design to examine the psychometric properties 

of the ACSS, RDS, and RDS-R using the only standalone PVT designed specifically for use in 

pediatric populations as the criterion measure. The purpose of this study is to establish cutoff 

scores in a mixed clinical sample that achieve 90% specificity and at least 40% sensitivity. 

Additionally, sensitivity and specificity across various scores will be presented. Lastly, positive 

and negative predictive power will be calculated for each index according to estimated base rates 

in a similar sample (Wilson & Lesica, 2021). 

39 

 
 
 
 
Data Set 

METHOD 

An extant data set was utilized for the completion of this study. The database included 

neuropsychological assessment data from individuals presenting to a midwestern rehabilitation 

hospital outpatient clinic from September 2018 to November 2022. Referral reasons included 

traumatic brain injuries, anoxia, vascular conditions, tumors, attention deficit-hyperactivity 

disorder, emotional disorders, autism spectrum disorder, speech and language concerns, cerebral 

palsy, myelomeningocele and/or hydrocephalus, and conditions classified as “other”. 

Deidentified assessment results were available for each child who met inclusionary criteria. 

Participants 

Inclusion Criteria 

The extant data was filtered to include only participants who met the following criteria. 

First, participants were required to be ages 6-16 at the time of their evaluation to make them 

eligible to complete both the WISC-V and the MVP. Participants who met these criteria were 

also required to have completed both assessment measures during their outpatient 

neuropsychological evaluation. Children who met these two criteria between September 2018 

and November 2022 were eligible for participation if they did not meet any exclusionary criteria. 

Exclusion Criteria 

Several exclusionary criteria were used for data selection. Children unable to provide 

informed assent to their evaluation were excluded from analysis. Additionally, children who 

were not fluent in English were excluded from the study. Lastly, children unable to complete 

either the WISC-V or the MVP due to significant uncorrected visual or auditory impairments 

were also excluded from analysis. 

40 

 
 
 
 
 
Group Assignment 

Participants were sorted into one of two groups: the valid performance group and the 

invalid performance group. Participants who failed the Memory Validity Profile (MVP; Sherman 

& Brooks, 2015b) based on an experimental cutoff established by Wilson and Lesica (2021) 

were placed in the invalid performance group. Participants who passed the MVP based on this 

cutoff were placed in the valid performance group.  

Demographics 

Descriptive statistics for demographic variables are presented in Table 1.  Compared to 

the valid performance group, participants in the invalid performance group were younger in age 

(p < 0.001, d = 0.79). Differences in parent education, sex, proportion of racial minorities, and 

reason for referral were not observed (p > 0.05). 

Table 1. Demographic Characteristics 

Variable 

Age* 

Complete 

Valid 

Invalid 

Sample 

Performance 

Performance 

(n = 204) 

(n = 174) 

(n = 30) 

130.71 (35.06)  134.65 (34.9)  107.83 (26.48) 

Parent Education Level 

13.10 (2.59) 

13.07 (2.58) 

13.3 (2.64) 

Sex (n, %) 

   Male 

   Female 

* p < .05 

127 (62.25) 

111 (87.40) 

63 (81.82) 

77 (37.75) 

16 12.60) 

14 (18.18) 

41 

 
 
 
 
 
  
 
 
 
 
 
Table 1. (cont’d) 

Variable 

Race (n, %) 

   White 

   Other 

Reason for Referral (n, %) 

   TBI 

   Anoxia 

   Vascular 

   Tumor 

   ADHD 

   Emotional 

   ASD 

Complete 

Valid 

Invalid 

Sample 

Performance 

Performance 

(n = 204) 

(n = 174) 

(n = 30) 

128 (62.75) 

110 (63.22) 

18 (60.00) 

76 (37.25) 

64 (36.78) 

12 (40.00) 

41 (20.10) 

39 (22.41) 

2 (6.67) 

3 (1.47) 

3 (1.72) 

0 (0.00) 

5 (2.45) 

5 (2.87) 

0 (0.00) 

2 (0.98) 

2 (1.15) 

0 (0.00) 

34 (16.67) 

28 (16.09) 

6 (20.00) 

10 (4.90) 

9 (5.17) 

1 (3.33) 

2 (0.98) 

2 (1.15) 

0 (0.00) 

   Speech/Language 

6 (2.94) 

4 (2.30) 

2 (6.67) 

   Cerebral Palsy 

13 (6.37) 

9 (5.17) 

4 (13.33) 

   Myelomeningocele/Hydrocephalus 

5 (2.45) 

2 (1.15) 

3 (10.00) 

   Other 

* p < .05 

83 (40.69) 

71 (40.80) 

12 (40.00) 

42 

 
 
 
  
 
 
 
 
 
 
 
 
Procedures  

Testing 

Participants presenting for neuropsychological evaluation during the time span ranging 

from September 2018 to November 2022 followed typical procedures for neuropsychological 

assessment. The outpatient neuropsychological evaluation typically lasted a single day. Each 

evaluation started with a clinical intake interview conducted by either a board-certified clinical 

neuropsychologist, a postdoctoral fellow, or a practicum student under the direct supervision of 

either the clinical neuropsychologist or the postdoctoral fellow. Following the intake interview, 

participants were given a break while the assessment team prepared the evaluation room. 

Neuropsychological test administration was conducted by either a board-certified clinical 

neuropsychologist, a postdoctoral fellow, a psychometrist, or a practicum student under direct 

supervision. Testing lasted from the morning to early afternoon with a lunch break in the middle. 

Each evaluation concluded with a feedback session with the family to discuss findings and 

recommendations. 

Measures 

Children selected for participation in this study were administered several measures 

based on the referral question that brought them in for evaluation and their unique presentations. 

However, all individuals included for analysis were given the Wechsler Intelligence Scale for 

Children – Fifth Edition (WISC-V; Wechsler, 2014) and the Memory Validity Profile (MVP; 

Sherman & Brooks, 2015b). The indices from these measures are discussed below. 

Wechsler Intelligence Scale for Children – Fifth Edition (WISC-V) 

The WISC-V (Wechsler, 2014) is a measure of cognitive ability developed for use with 

children ages 6-16. The WISC-V comprises 21 subtests that measure ability across various 

43 

 
 
 
 
 
cognitive domains. Ten subtests are considered primary subtests as the make up the primary 

index scores: Verbal Comprehension, Visual Spatial, Fluid Reasoning, Working Memory, and 

Processing speed. Additionally, the first seven primary subtests make up the Full-Scale 

Intelligence Quotient. The WISC-V is one of the most widely administered measures of 

cognitive ability in neuropsychological assessments. 

Full-Scale Intelligence Quotient. The Full-Scale Intelligence Quotient (FSIQ) is a 

measure of overall intellectual functioning. The FSIQ comprises seven subtests and is the 

most reliable index provided by the instrument. The seven subtests included in the FSIQ 

are Block Design, Similarities, Matrix Reasoning, Digit Span, Coding, Vocabulary, and 

Figure Weights. The FSIQ is a standardized composite score with a mean of 100 and a 

standard deviation of 15. 

Reliability. Split-half reliability coefficients for the FSIQ range from 0.96-0.97 across age 

ranges. Overall, the average split-half reliability average is r = .96. Standard errors of 

measurement across ages range from 2.6-3.0 with an overall average standard error of 

measurement of 2.9. Test-retest reliability was calculated as r = .91 (.92 when correcting 

for sample variability). The standard difference between administrations is reported as d 

= 0.44. 

Validity. The FSIQ score of the WISC-V exhibits high levels of concurrent validity as 

evidenced the correlations of the index with FSIQ measures from other Wechsler 

intelligence instruments. The correlation between the WISC-V FSIQ measure and the 

WISC-IV FSIQ measure is r = 0.81 (r = 0.86 when correcting for sample variability). The 

correlation between the FSIQ measures of the WISC-V and the WAIS-IV is similarly 

strong at r = 0.84 (r = 0.89 when correcting for sample variability).  

44 

 
 
 
Application. The FSIQ metric is a commonly used metric of overall intellectual ability in 

both clinical and research settings. For the purposes of this study, differences in FSIQ 

between participants in the valid and invalid performance groups were examined to 

determine whether participants intellectual functioning was related directly to 

performance on the various PVT measures used in this study. 

General Ability Index. The General Ability Index (GAI) is one of 13 index scores 

provided by the WISC-V. The GAI score is like the FSIQ score as both are measures of 

general cognitive ability. The GAI differs from the FSIQ by omitting subtests of working 

memory and processing speed. The GAI comprises five of the seven subtests that make 

up the FSIQ with Digit Span and Coding removed. This measure is interesting in the 

assessment of Digit Span PVTs as it removes the shared variance between the measure 

and Digit Span performance. The GAI is a standardized composite score with a mean of 

100 and a standard deviation of 15. 

Reliability. Split-half reliability coefficients for the GAI range from 0.95-0.98 across age 

ranges. Overall, the average split-half reliability average is r = 0.96. Standard errors of 

measurement across ages range from 2.36-3.35 with an overall average standard error of 

measurement of 3.07. Test-retest reliability was calculated as r = 0.89 (0.91 when 

correcting for sample variability). The standard difference between administrations is 

reported as d = 0.41.  

Validity. The GAI score of the WISC-V demonstrates high concurrent validity with 

previous Wechsler instruments. The correlation between GAI scores on the WISC-V and 

the WISC-IV is r = 0.80 (r = 0.85 when correcting for sample variability). Similarly, the 

45 

 
 
 
correlation between GAI scores on the WISC-V and the WAIS-IV is strong at r = 0.74 (r 

= 0.83 when correcting for sample variability. 

Application. The GAI metric provides an estimate of overall cognitive ability that is less 

reliant on processing speed and working memory than the FSIQ score. Additionally, 

unlike the FSIQ score, the Digit Span subtest is not included in the calculation of the 

GAI. The GAI measure was used to determine whether intellectual functioning is directly 

related to the various PVT measures in this study once the contamination from the Digit 

Span subtest is removed. 

Digit Span (Dependent Variables). The Digit Span subtest of the WISC-V is a measure 

of working memory. During Digit Span administration, the examiner reads the participant 

numbers of increasing length, and the participant is then tasked with repeating them 

immediately from memory. The Digit Span subtest has three conditions: 1) Forward, 2) 

Backward, and 3) Sequencing. The Forward task requires participants to repeat the 

numbers in the same order they were read, the Backward task requires recalling numbers 

in reverse order, and the Sequencing task requires recalling numbers in ascending order. 

Each item of the subtest includes two trials of equal length, and the number of digits in 

each item increases until the participant fails two strings within an item. Raw scores are 

calculated for each task, and these scores are then summed and converted to an age 

corrected scaled score (ACSS; M = 10, SD = 3). 

Reliability. Split-half reliability coefficients for the Digit Span ACSS range from 0.89-

0.93 with an average reliability coefficient of rxx

a = 0.91. Internal consistency for special 

groups designated in the WISC-V manual range from 0.83-0.99. Across age groups, the 

standard error of measurement for the Digit Span subtest ranges from 0.79-0.99 with an 

46 

 
 
 
overall average of 0.88. Test-retest reliability across all ages is reported as r = 0.79 (r = 

0.82 when corrected for sample variability). Standard difference between first and second 

administration was d = 0.10.  

Validity. The Digit Span subtest demonstrates strong concurrent validity with Digit Span 

scores from other Wechsler instruments. The correlation between Digit Span ACSS scores on the 

WISC-V and the WISC-IV is strong at r = 0.60 (r = 0.65 when correcting for sample variability). 

The correlation between Digit Span ACSS scores on the WISC-V and the WAIS-IV is stronger 

than the WISC-IV at r = 0.76 (r = 0.80 when correcting for sample variability). This higher 

correlation is expected as both the WISC-V and the WAIS-IV saw the introduction of the 

sequencing condition. 

Application. The utility of the ACSS score as a PVT was examined by establishing cutoff 

scores that achieve 90% specificity categorizing MVP performance. Furthermore, 

positive and negative predictive power of the instrument was calculated. 

RDS (dependent). Reliable Digit Span (RDS) is calculated by adding the length of the 

longest item completed perfectly in the Forward condition to the length of the longest 

item completed perfectly in the Backward condition. This sum is then compared to cutoff 

scores to make determinations of performance validity. Although RDS is rarely used in 

pediatric populations, the necessary metrics to calculate the score are present on the 

WISC-V. Previous studies of the RDS metric in pediatric populations using the WISC-IV 

demonstrated cutoff scores of ≤ 5 digits held reliably (Kirkwood et al., 2011) and ≤ 3 

digits held (Welsh et al., 2012). 

47 

 
 
 
 
Application. The utility of the RDS score as a PVT was examined by establishing cutoff 

scores that achieve 90% specificity categorizing MVP performance. Furthermore, 

positive and negative predictive power of the instrument were calculated. 

RDS-R Reliable Digit Span – Revised (RDS-R) is an expansion of the RDS metric 

originally introduced by Spencer et al. (2010). RDS-R is calculated by adding the length 

of the longest trial consecutively completed in the sequencing condition to the traditional 

RDS score. This measure has yet to be examined empirically in pediatric populations. As 

such, reliability and validity information are not yet available.  

Application. The utility of the RDS-R score as a PVT was examined by establishing 

cutoff scores that achieve 90% specificity categorizing MVP performance. Furthermore, 

positive and negative predictive power of the instrument were calculated. 

Memory Validity Profile (MVP) (Independent Variable). 

The Memory Validity Profile (MVP) is a performance validity test developed for use 

with individuals ages 5 to 21. The MVP is normed on both a standardization sample based on the 

United States census and clinical samples. The MVP comprises both a visual and verbal memory 

task, and each provides a cutoff score based on age. Additionally, an overall cutoff score is also 

provided based on age. 

48 

 
 
 
 
 
Reliability. Internal consistency is reported for each score across the standardization 

sample, clinical sample, and invalid performance sample in which the measure was 

normed. The MVP manual notes that the internal consistency of the standardization 

sample is low due to a large majority of participants completing the assessment with 

100% accuracy. The resulting coefficient alphas in the standardization sample ranged 

from unacceptable to poor (Visual: α = .46, Verbal: α = .61, Total α = .64). In the clinical 

sample, internal consistency fell within the good range (Visual: α = .85, Verbal: α = .84, 

Total α = .89). In the invalid performance sample, internal consistency ranged from 

acceptable to good (Visual: α = .79, Verbal α = .78, Total: α = .88). Test-retest reliability 

for the MVP is presented using Spearman rho correlations as the distribution of scores is 

not normal. Test-retest reliability for the Visual condition is reported as r = .51 with a 

standard error of measurement of .3. Test-retest reliability for the verbal condition is 

reported as r = .36 with a standard error of measurement of .8. The test-retest reliability 

for total score is r = .41 with a standard error of measurement of .9.  

Validity. The MVP is a measure of performance validity that passes as a measure of 

visual and verbal memory. Despite having face validity as a memory measure, 

correlations with actual measures of memory should be low to indicate that the task does 

not measure memory. During standardization, the MVP was co-normed with the Child 

and Adolescent Memory Profile (ChAMP; Sherman & Brooks, 2015a). The correlation 

between the MVP total score and ChAMP total score was very low at r = 0.17. 

Additionally, the measure was compared to performance on the WISC-IV to ensure that 

the test was not a measure of intelligence. The correlation between MVP performance 

49 

 
 
 
and WISC-IV FSIQ score was low at r = 0.26. Low to moderate correlations were also 

found with executive function measures and achievement measures. 

In addition to discriminant validity, the MVP must demonstrate concurrent validity with 

an established PVT. The MVP was compared to performance on the Test of Memory 

Malingering (TOMM; Tombaugh, 1996) during development. MVP total score correlated highly 

with both TOMM Trial 1 (r = 0.83) and TOMM Trial 2 (r = 0.81) 

Application. Performance on the MVP can be broken into pass and fail distinctions 

based on raw score and age of the participant. However, in previous studies that have 

used the MVP as a criterion measure, problems have arisen using the cutoff scores 

defined by the MVP manual. Wilson and Lesica (2021) examined the failure rate of the 

MVP in a mixed clinical sample from the same rehabilitation hospital used in this study. 

Of a sample of 122 children, only two children failed to pass the MVP. As a result, an 

experimental cutoff score of 30 was used meaning that any score under 31 was 

considered a failure. Wilson and Lesica (2021) also found that at this experimental cutoff, 

age was a significant factor in determining who passed the MVP with children ages 6-10 

failing at much higher rates than children above the age of 10. Group designation in the 

current study was set using the experimental cutoff established by Wilson and Lesica 

(2021). Additionally, a second, exploratory analysis was conducted using a more lenient 

cutoff score of 29 for children ages 6-10. 

Hypotheses and analyses 

The proposed study will assess the utility of three Digit Span indices from the WISC-V in 

the assessment of performance validity during neuropsychological testing. These indices will be 

analyzed separately, and the research hypotheses presented are separated accordingly.  

50 

 
 
 
 
Question 1 

Can the WISC-V Digit Span ACSS reliably detect suboptimal effort in a clinical pediatric 

sample?  

Hypothesis 1a. Cutoff scores for the ACSS index that reliably distinguish groups with 

90% specificity, 40% sensitivity, and an area under the curve metric of at least 0.70 will 

be established for the sample 

Analysis. Sensitivity, specificity, and AUC will be calculated at each score observed in 

the sample to determine which score will be used as a cutoff. The cutoff score will be the 

lowest score that yields at least 90% specificity while maintaining optimal sensitivity. 

Hypothesis 1b. ACSS will reliably distinguish participants who failed the MVP from 

participants who passed the MVP. 

Analysis. Logistic regression will be conducted to determine whether failure of the ACSS 

metric based on the obtained cutoff score meaningfully predicts failure of the MVP.  

Hypothesis 1c. Positive predictive power and negative predictive power across estimated 

base rates of suboptimal effort in children will provide evidence for the utility of ACSS 

as a viable PVT in pediatric populations. 

Analysis. Positive predictive power and negative predictive power for the ACSS metric 

will be calculated across estimates of base rates in pediatric populations. Predictive 

power will be calculated using the 15.57% failure base rate obtained by Wilson and 

Lesica (2021) 

Rationale. The Digit Span subtest provides the ACSS with typical scoring. This score is 

norm referenced and factors in performance of other children within a child’s age group. 

As such, this score provides potentially the soundest performance validity information 

51 

 
 
 
across age ranges. The ACSS has been demonstrated to be a useful PVT in adult 

populations (Iverson & Franzen, 1994), and researchers have demonstrated promise using 

older versions of the WISC (Kirkwood et al., 2011; Welsh et al., 2012). If the ACSS of 

the WISC-V Digit Span task provides meaningful performance validity information, 

clinicians are saved time and resources as no additional metric needs to be calculated. 

Question 2 

Can the WISC-V RDS index reliably detect suboptimal effort in a clinical pediatric sample? 

Hypothesis 2a. Cutoff scores for the RDS metric can be established that achieve 90% 

specificity, 40% sensitivity, and an AUC metric of 0.70. 

Analysis. Sensitivity, specificity, and AUC will be calculated at each observed score in 

the sample to determine if an acceptable cutoff score exists. The cutoff score will be the 

lowest score that yields at least 90% specificity while maintaining optimal sensitivity. 

Hypothesis 2b. RDS score will reliably distinguish participants who failed the MVP 

from participants who passed the MVP. 

Analysis. Logistic regression will be conducted to determine whether failure of the RDS 

metric based on the obtained cutoff score meaningfully predicts failure of the MVP. 

Hypothesis 2c:  Positive predictive power and negative predictive power across estimated base 

rates of suboptimal effort in children will provide evidence for the utility of the RDS score as a 

PVT in pediatric samples. 

Analysis. Positive predictive power and negative predictive power for the RDS metric 

will be calculated across estimates of base rates in pediatric populations. Predictive 

power will be calculated using the 15.57% failure base rate obtained by Wilson and 

Lesica (2021). 

52 

 
 
 
Rationale.  RDS has been used extensively in adult populations since the development of 

the measure by (Greiffenstein et al., 1994). Despite this extensive study, the measures 

application in pediatric populations is limited to a few studies (Kirkwood et al., 2011; 

Welsh et al., 2012). Furthermore, to date, no published studies have examined the use of 

RDS on the most recent version of the WISC with a simulation or known groups design. 

The current study aims to provide a known groups design to examine the clinical utility 

of the RDS measure as an embedded PVT. 

Question 3 

Can the WISC-V RDS-R index reliably detect suboptimal effort in a clinical pediatric sample? 

Hypotheses 3a. A cutoff score for the RDS-R metric can be calculated that achieves at least 90% 

specificity, 40% sensitivity, and an AUC metric of 0.70. 

Analysis. Sensitivity, specificity, and AUC were calculated at each RDS-R score 

observed in the clinical sample. The cutoff was set to the lowest score that yielded at least 

90% specificity while maintaining optimal sensitivity. 

Hypothesis 3b. RDS-R score will reliably distinguish participants who failed the MVP 

from participants who passed the MVP. 

Analysis. Logistic regression was conducted to determine whether failure of the RDS-R 

metric based on the obtained cutoff score meaningfully predicted MVP failure.  

Hypothesis 3c. Positive predictive power and negative predictive power across estimated 

base rates of suboptimal effort in children will provide evidence for the utility of the RDS 

score as a PVT in pediatric samples. 

53 

 
 
 
Analysis. Positive predictive power and negative predictive power for the RDS-R metric 

will be calculated across estimates of base rates in pediatric populations. The base rate 

estimates calculated will range from 5-40% in consistency with Kirkwood et al. (2011). 

Rationale. The most recent revisions of the Wechsler adult and child intelligence scales 

saw the addition of the Sequencing condition to the Digit Span subtest. This provided an 

opportunity to expand the RDS measure by adding the length of the longest trials 

consecutively completed to the traditional RDS score to create the RDS-R.  

54 

 
 
 
 
 
 
 
Memory Validity Profile (MVP) Performance by Group 

RESULTS 

Table 2 presents the average Memory Validity Profile (MVP) scores of both the valid and 

invalid performance groups. The invalid performance group underperformed on the MVP 

compared to the valid performance group on the Visual score (p = .006, d = 1.33), Verbal score 

(p < .001, d = 3.09), and Total score (p < .001, d = 3.03). These differences remained statistically 

significant after Bonferroni correction for multiple comparisons.  

Table 2. MVP Scores Across Groups 

Complete 

Valid 

Invalid 

Sample 

Performance 

Performance 

Variable 

(n = 204) 

(n = 174) 

(n = 30) 

t 

2.9

p 

d 

1.3

MVP Visual 

15.80 (0.75) 

15.94 (0.24) 

15.03 (1.69) 

2 

0.006 

3 

MVP Verbal 

15.43 (1.65) 

15.94 (0.24) 

12.50 (2.87) 

5 

1 

9 

6.5

<0.00

3.0

6.4

<0.00

3.0

MVP Total 

31.23 (2.14) 

31.87 (0.33) 

27.47 (3.76) 

2 

1 

3 

Digit span performances by group. 

Table 3 presents the Digit Span index scores of interest obtained in this sample. The 

invalid performance group underperformed compared to the valid performance group on the 

Digit Span Age Corrected Scaled Score (ACSS; p < .001, d = 1.21), Reliable Digit Span (RDS; p 

< .001, d = 1.50), and Reliable Digit Span Revised (RDS-R; p < .001, d = 1.67). Again, these 

55 

 
 
 
 
  
  
  
  
 
differences remained statistically significant after Bonferroni correction for multiple 

comparisons. 

Table 3. Digit Span Scores Across Groups 

Complete Sample  Valid Performance  Invalid Performance 

Variable 

(n = 204) 

(n = 174) 

(n = 30) 

t 

p 

d 

ACSS 

6.80 (3.27) 

7.33 (3.09) 

3.70 (2.49) 

7.10  <0.001  1.21 

RDS 

6.66 (2.05) 

7.06 (1.85) 

4.33 (1.63) 

8.30  <0.001  1.50 

RDS-R 

9.84 (3.37) 

10.56 (2.94) 

5.70 (2.71) 

8.97  <0.001  1.67 

Intellectual Index Scores by Performance Group. 

Furthermore, significant intellectual ability score differences were observed on the 

Wechsler Intelligence Scale for Children – Fifth Edition (WISC-V). The invalid performance 

group scored significantly lower than the valid performance group on the Verbal Comprehension 

Index (VCI; p < .001, d = 1.08), Visual Spatial Index (VSI; p < .001, d = .96), Fluid Reasoning 

Index (FRI; p < .001, d = .87), Working Memory Index (WMI; p < .001, d = 1.18), and 

Processing Speed Index (PSI; p = .002, d = .72). Furthermore, the invalid performance group 

scored significantly lower on both intellectual composites of interest in this study: the General 

Ability Index (GAI; p < .001, d = 1.03) and Full-Scale Intelligence Quotient (FSIQ; p < .001, d = 

1.13). Average scores for these indices are provided in Table 4.  

56 

 
 
 
  
  
  
  
 
 
 
Table 4. WISC-V Composite Scores 

Complete Sample  Valid Performance  Invalid Performance 

Variable 

(n = 204) 

(n = 174) 

(n = 30) 

t 

p 

d 

FSIQ 

82.27 (16.16) 

84.76 (15.36) 

67.83 (12.91) 

6.44  <0.001  1.13 

GAI 

VCI 

VSI 

FRI 

85.20 (15.66) 

87.43 (15.14) 

72.27 (12.13) 

6.08  <0.001  1.03 

87.24 (15.79) 

89.58 (15.19) 

73.67 (12.04) 

6.41  <0.001  1.08 

89.43 (15.72) 

91.54 (14.92) 

77.20 (14.83) 

4.89  <0.001  0.96 

86.27 (16.37) 

88.29 (15.73) 

74.60 (15.28) 

4.51  <0.001  0.87 

WMI 

84.53 (16.56) 

87.18 (15.77) 

69.17 (12.23) 

7.11  <0.001  1.18 

PSI 

84.25 (16.46) 

85.94 (15.57) 

74.43 (18.23) 

3.26  0.002  0.72 

The Influence of Age on IQ Scores. 

Following the observation of significant differences in age and intellectual ability, 

correlation coefficients were calculated to better understand this relationship. The relationship 

between age and each of the WISC-V Composite Scores was negligible. These values are 

provided in Table 5. Figure 1 displays the relationship between age and FSIQ. 

Table 5. Correlations Between Age and WISC-V Scores 

FSIQ 

GAI 

VCI 

VSI 

FRI  WMI 

PSI 

Age 

-0.03 

-0.06 

-0.04 

-0.12 

0.02 

0.13 

-0.07 

57 

 
 
 
 
 
 
 
  
 
 
 
Figure 1. Age vs. FSIQ 

Question 1: Digit Span Age Corrected Scaled Scores Prediction of Suboptimal Effort 

Each Digit Span score observed in the sample was analyzed to determine which score 

provided an optimal cutoff point. The results for each of these scores are provided in Table 6.  

Hypothesis 1a 

A cutoff score for the ACSS measure was established that met and exceeded the 

sensitivity, specificity, and area under the curve (AUC) thresholds outlined in this study. A 

cutoff score of ≤3 yielded a specificity metric of 91%, a sensitivity metric of 60%, and an AUC 

of 0.75 (0.66 – 0.84). Therefore, a cutoff score of ≤3 results in an embedded PVT with strong 

psychometric properties. 

58 

 
 
 
 
 
 
Table 6. Various Digit Span ACSS Cutoffs 

ACSS Cutoff  Sensitivity  Specificity  AUC (95% CI) 

LR 

≤1 

≤2 

≤3 

≤4 

≤5 

≤6 

≤7 

≤8 

≤9 

≤10 

≤11 

≤12 

≤13 

≤14 

≤15 

≤16 

≤17 

≤18 

0.27 

0.37 

0.60 

0.67 

0.70 

0.83 

0.90 

0.97 

1.00 

1.00 

1.00 

1.00 

1.00 

1.00 

1.00 

1.00 

1.00 

1.00 

0.97 

0.62 (0.53 - 0.70) 

9.00 

0.95 

0.66 (0.57 - 0.75) 

7.40 

0.91 

0.75 (0.66 - 0.85) 

6.67 

0.83 

0.75 (0.66 - 0.84) 

3.94 

0.76 

0.73 (0.64 - 0.83) 

2.92 

0.57 

0.70 (0.62 - 0.78) 

1.93 

0.46 

0.68 (0.61 - 0.75) 

1.67 

0.30 

0.63 (0.59 - 0.68) 

1.39 

0.22 

0.61 (0.58 - 0.64) 

1.28 

0.14 

0.57 (0.55 - 0.60) 

1.16 

0.10 

0.55 (0.53 - 0.57) 

1.11 

0.05 

0.53 (0.51 - 0.54) 

1.05 

0.04 

0.52 (0.51 - 0.53) 

1.04 

0.01 

0.51 (0.50 - 0.51) 

1.01 

0.01 

0.51 (0.50 - 0.51) 

1.01 

0.01 

0.51 (0.50 - 0.51) 

1.01 

0.01 

0.50 (0.50 - 0.51) 

1.01 

0.00 

0.50 (0.50 - 0.50) 

1.00 

*LR = Likelihood ratio 

59 

 
 
 
 
 
 
 
 
 
Hypothesis 1b 

Logistic regression was used to better understand the relationship between passing the 

Digit Span PVT with a cutoff score of ≤3 and passing the MVP with a cutoff score of ≤30. 

Passing the Digit Span PVT significantly predicted whether participants passed the MVP. The 

participants who failed the Digit Span PVT were 14.8 times more likely to fail the MVP 

compared to participants who exceeded the cutoff score of 3 (p < .001). These results can be 

found in Table 7.   

Table 7. Predicting MVP Failure Based on ACSS Failure 

Characteristic  OR1  95% CI1  p-value 

(Intercept) 

0.08  0.04, 0.13  <0.001 

ACSS Group 

14.8  6.18, 37.3  <0.001 

1 OR = Odds Ratio, CI = Confidence Interval 

Hypothesis 1c 

Positive and negative predictive power were calculated at a base rate of 15.57% 

according to the results of Wilson and Lesica (2021). Positive predictive power at this cutoff was 

55.14% and negative predictive power was 92.50% indicating that an ACSS cutoff of ≤3 would 

predict valid performance with 92.50% accuracy and invalid performance with 55.14% accuracy 

in similar mixed clinical samples.  

60 

 
 
 
 
 
 
Question 2: Reliable Digit Span Prediction of Suboptimal Effort 

Each Reliable Digit Span (RDS) score observed in the sample was analyzed to determine 

which score provided an optimal cutoff point. The results for each of these scores are provided in 

Table 8.  

Hypothesis 2a 

A cutoff score for the RDS metric was established that exceeded the outlined sensitivity, 

specificity, and AUC thresholds. A cutoff score of ≤4 resulted in specificity of 94%, sensitivity 

of 63%, and an AUC of 0.79 (0.70 – 0.88). These results indicate that an RDS cutoff of ≤4 

resulted in an embedded PVT with strong psychometric properties. 

Table 8. Various RDS Cutoffs 

RDS Cutoffs  Sensitivity  Specificity  AUC (95% CI) 

LR 

≤2 

≤3 

≤4 

≤5 

≤6 

≤7 

≤8 

≤9 

≤10 

≤11 

≤12 

≤13 

0.10 

0.37 

0.63 

0.80 

0.83 

1.00 

1.00 

1.00 

1.00 

1.00 

1.00 

1.00 

0.99 

0.54 (0.49 - 0.60) 

10.00 

0.98 

0.67 (0.58 - 0.76) 

18.50 

0.94 

0.79 (0.70 - 0.88) 

10.50 

0.82 

0.77 (0.69 - 0.86) 

4.44 

0.63 

0.73 (0.66 - 0.73) 

2.24 

0.40 

0.70 (0.66 - 0.73) 

1.67 

0.16 

0.58 (0.55 - 0.61) 

1.19 

0.06 

0.53 (0.51 - 0.55) 

1.06 

0.03 

0.52 (0.50 - 0.53) 

1.03 

0.03 

0.51 (0.50 - 0.56) 

1.03 

0.01 

0.50 (0.50 - 0.51) 

1.01 

0.01 

0.50 (0.50 - 0.51) 

1.01 

61 

 
 
 
 
Table 8. (cont’d) 

RDS Cutoffs  Sensitivity  Specificity  AUC (95% CI) 

LR 

≤14 

≤15 

1.00 

1.00 

0.01 

0.50 (0.50 - 0.51) 

1.01 

0.00 

0.50 (0.50 - 0.50) 

1.00 

Hypothesis 2b 

Logistic regression was used to better understand the relationship between passing the 

RDS PVT with a cutoff score of ≤4 and passing the MVP with a cutoff score of ≤30. Passing the 

RDS metric significantly predicted whether participants passed the MVP. The participants who 

failed the Digit Span PVT were 28.3 times more likely to fail the MVP compared to participants 

who exceeded the cutoff score of 4 (p < .001). These results can be found in Table 9.   

Table 9. Predicting MVP Failure Based on RDS Failure 

Characteristic  OR1  95% CI1  p-value 

(Intercept) 

0.07  0.03, 0.12  <0.001 

RDSGroup 

28.3  11.0, 79.0  <0.001 

1 OR = Odds Ratio, CI = Confidence Interval 

Hypothesis 2c 

Positive predictive power at this cutoff was 65.94% and negative predictive power was 

93.23% indicating that an RDS cutoff of ≤4 would predict valid performance with 93.23% 

accuracy and invalid performance with 65.94% accuracy in similar mixed clinical samples. 

62 

 
 
 
 
 
 
 
Question 3: Reliable Digit Span Revised Prediction of Suboptimal Effort 

Each Reliable Digit Span Revised (RDS-R) score observed in the sample was analyzed to 

determine which score provided an optimal cutoff point. The results for each of these scores are 

provided in Table 10.  

Hypothesis 3a 

A cutoff score was established for the RDS-R metric that exceeded the outlined 

sensitivity, specificity, and AUC thresholds. A cutoff score of ≤6 resulted in a specificity of 

94%, a sensitivity of 63%, and an AUC metric of 0.79 (0.70 – 0.86). These results indicate that a 

cutoff score of ≤6 on the RDS-R metric results in an embedded PVT with strong psychometric 

properties.  

Table 10. Various RDS-R Cutpoints 

RDS Cutoff 

Sensitivity  Specificity  AUC (95% CI) 

LR 

≤2 

≤3 

≤4 

≤5 

≤6 

≤7 

≤8 

≤9 

≤10 

≤11 

0.07 

0.27 

0.47 

0.53 

0.63 

0.67 

0.83 

0.87 

0.97 

1.00 

0.99 

0.53 (0.48 - 0.57) 

7.00 

0.98 

0.62 (0.54 - 0.71) 

13.50 

0.98 

0.72 (0.63 - 0.82) 

23.50 

0.96 

0.75 (0.65 - 0.84) 

13.25 

0.94 

0.79 (0.70 - 0.86) 

10.50 

0.88 

0.77 (0.68 - 0.86) 

5.58 

0.77 

0.80 (0.73 - 0.88) 

3.61 

0.63 

0.75 (0.68 - 0.82) 

2.35 

0.51 

0.74 (0.69 - 0.79) 

1.98 

0.36 

0.68 (0.64 - 0.71) 

1.56 

63 

 
 
 
 
 
 
Table 10. (cont’d) 

RDS Cutoff 

Sensitivity  Specificity  AUC (95% CI) 

LR 

≤12 

≤13 

≤14 

≤15 

≤16 

≤17 

≤18 

≤19 

≤20 

≤21 

1.00 

1.00 

1.00 

1.00 

1.00 

1.00 

1.00 

1.00 

1.00 

1.00 

0.25 

0.63 (0.59 - 0.66) 

1.33 

0.13 

0.56 (0.54 - 0.59) 

1.15 

0.08 

0.54 (0.52 - 0.56) 

1.09 

0.05 

0.52 (0.51 - 0.53) 

1.05 

0.03 

0.50 (0.50 - 0.51) 

1.03 

0.01 

0.50 (0.50 - 0.51) 

1.01 

0.01 

0.50 (0.50 - 0.51) 

1.01 

0.01 

0.50 (0.50 - 0.51) 

1.01 

0.01 

0.50 (0.50 - 0.51) 

1.01 

0.00 

0.50 (0.50 - 0.50) 

1.00 

Hypothesis 3b 

Logistic regression was used to better understand the relationship between passing the 

RDS-R PVT with a cutoff score of ≤6 and passing the MVP with a cutoff score of ≤30. Passing 

the RDS-R metric significantly predicted whether participants passed the MVP. The participants 

who failed the Digit Span PVT were 25.6 times more likely to fail the MVP compared to 

participants who exceeded the cutoff score of 6 (p < .001). These results can be found in Table 

11.   

64 

 
 
 
 
 
 
Table 11. Predicting MVP Failure Based on RDS-R Failure 

Characteristic  OR1  95% CI1  p-value 

(Intercept) 

0.07  0.03, 0.12  <0.001 

RDS-RGroup 

25.6  10.1, 69.9  <0.001 

1 OR = Odds Ratio, CI = Confidence Interval 

Hypothesis 3c 

Positive and negative predictive power were calculated at 15.57% according to the results 

of Wilson and Lesica (2021). Positive predictive power at this cutoff was 64.95% and negative 

predictive power was 93.51% indicating that an RDS-R cutoff of ≤6 would predict valid 

performance with 93.51% accuracy and invalid performance with 64.95% accuracy in similar 

mixed clinical samples.  

Exploratory Analysis: Alternative MVP Cutoff for Performance Groups 

The difference between the valid performance and invalid performance groups based on 

age raised questions about the flat cutoff score used on the MVP in the establishment of groups. 

Much like the results of Wilson and Lesica (2021), the sample in this study saw high rates of 

MVP failure in participants between the ages of 6 and 10. In their study, Wilson and Lesica 

observed a failure rate of approximately 21% in children ages 6 to 10 when using a cutoff score 

of 30. Similarly, children ages 6-10 in this study failed at a rate of 22% when using a cutoff score 

of 30. As a result, group selection may be more heavily influenced by age than performance 

validity at the younger end of the sample. 

65 

 
 
 
 
 
To address this concern, a new split cutoff was adopted for exploratory analysis. For this 

analysis, group selection was made using a cutoff score of 30 for children ages 11 and older, and 

a cutoff score of 29 was used for children under the age of 11. This reduced the failure rate of the 

6–10-year-old portion of the sample to 11%. Table 12 shows the MVP passing rate across age 

groups using the experimental cutoff of 30, and Table 13 shows the MVP passing rate across age 

groups when a split cutoff is adopted.  

For the purposes of this exploratory analysis, a complication arises in the calculation of 

predictive power. Wilson and Lesica (2021) did not utilize this split cutoff, and as a result, base 

rates for failure at this level were not available for this sample. As a result, the base rate 

percentage obtained by Wilson and Lesica (2021) are no longer applicable when redefining 

group membership, and the exploratory analysis will omit the calculation of predictive power. 

Table 12. Passing Rates by Age Group When Using A Cutoff Score Of 30 

Age Group 

N 

Passing Rate 

6-10 

109 

77.98% 

11-16 

95 

93.68% 

Table 13. Passing Rates by Age Group When Exploratory Split Cutoff Scores Are Used 

Age Group 

N 

Passing Rate 

6-10 

109 

88.99% 

11-16 

95 

93.68% 

66 

 
 
 
 
 
 
 
 
 
Demographics 

Descriptive statistics for demographic variables are presented in Table 14. Despite 

adopting a more lenient cutoff score for the younger children, the invalid performance group 

remained significantly younger than the valid performance group (p < 0.05, d = 0.49). 

Differences in parent education, sex, proportion of racial minorities, and reason for referral were 

not observed (p > 0.05). 

Table 14. Demographic Variables Across Groups Using Split Cutoffs 

Variable 

Age 

Complete 

Valid 

Invalid 

Sample 

Performance 

Performance 

(n = 204) 

(n = 186) 

(n = 18) 

130.71 (35.06) 

132.21 (35.11) 

115.17 (31.33) 

Parent Education Level 

13.10 (2.59) 

13.11 (2.56) 

13.00 (2.93) 

Sex (n, %) 

   Male 

   Female 

Race (n, %) 

   White 

   Other 

127 (62.25) 

119 (63.98) 

8 (44.44) 

77 (37.75) 

67 (36.02) 

10 (55.56) 

128 (62.75) 

118 (63.44) 

10 (55.56) 

76 (37.25) 

68 (36.56) 

8 (44.44) 

67 

 
 
 
  
 
 
 
 
 
 
 
 
 
 
Table 14. (cont’d) 

Complete 

Valid 

Invalid 

Sample 

Performance 

Performance 

Variable 

(n = 204) 

(n = 186) 

(n = 18) 

Reason for Referral (n, %) 

   TBI 

   Anoxia 

   Vascular 

   Tumor 

   ADHD 

   Emotional 

   ASD 

   Speech/Language 

41 (20.10) 

39 (20.97) 

2 (11.11) 

3 (1.47) 

5 (2.45) 

2 (0.98) 

3 (1.61) 

0 (0.00) 

5 (2.69) 

0 (0.00) 

2 (1.08) 

0 (0.00) 

34 (16.67) 

30 (16.13) 

4 (22.22) 

10 (4.90) 

9 (4.84) 

1 (5.56) 

2 (0.98) 

6 (2.94) 

2 (1.08) 

0 (0.00) 

5 (2.69) 

1 (5.56) 

   Cerebral Palsy 

13 (6.37) 

11 (5.91) 

2 (11.11) 

Myelomeningocele/Hydrocephalus 

5 (2.45) 

3 (1.61) 

2 (11.11) 

   Other 

83 (40.69) 

77 (41.40) 

6 (33.33) 

Table 15 presents the average Memory Validity Profile (MVP) scores of both the valid 

and invalid performance groups. The invalid performance group underperformed on the MVP 

compared to the valid performance group on the Visual score (p = .04, d = 1.55), Verbal score (p 

< .001, d = 5.17), and Total score (p < .001, d = 4.61). Differences in the Verbal and Total scores 

remained after Bonferroni correction for multiple analyses.  

68 

 
 
 
  
 
 
 
 
 
 
 
Table 15. MVP Scores Across Groups Using Split Cutoffs 

Complete 

Valid 

Invalid 

Sample 

Performance 

Performance 

Variable 

(n = 204) 

(n = 186) 

(n = 18) 

MVP 

t 

2.1

p 

d 

1.5

Visual 

15.80 (0.75) 

15.90 (0.35) 

14.83 (2.07) 

8 

0.04 

5 

MVP 

7.1

<0.00

5.1

Verbal 

15.43 (1.65) 

15.85 (0.45) 

11.06 (2.84) 

6 

1 

7 

6.2

<0.00

4.6

MVP Total 

31.23 (2.14) 

31.75 (0.56) 

25.78 (4.07) 

3 

1 

1 

Table 16 presents the Digit Span index scores of interest obtained in this sample. The 

invalid performance group underperformed compared to the valid performance group on the 

Digit Span Age Corrected Scaled Score (ACSS; p < .001, d = 1.03), Reliable Digit Span (RDS; p 

< .001, d = 1.40), and Reliable Digit Span Revised (RDS-R; p < .001, d = 1.47). Again, these 

differences remained statistically significant after Bonferroni correction for multiple 

comparisons. 

Table 16. Digit Span Scores Across Groups Using Split Cutoffs 

Complete Sample  Valid Performance  Invalid Performance 

Variable 

(n = 204) 

(n = 186) 

(n = 18) 

t 

p 

d 

ACSS 

6.80 (3.27) 

7.09 (3.18) 

3.83 (2.75) 

4.72  <0.001  1.03 

RDS 

6.66 (2.05) 

6.89 (1.93) 

4.22 (1.73) 

6.17  <0.001  1.40 

RDS-R 

9.84 (3.37) 

10.25 (3.14) 

5.57 (2.89) 

6.37  <0.001  1.47 

69 

 
 
 
  
  
  
  
 
  
  
  
  
Furthermore, significant intellectual ability score differences were observed on the 

Wechsler Intelligence Scale for Children – Fifth Edition (WISC-V). The invalid performance 

group scored significantly lower than the valid performance group on the Verbal Comprehension 

Index (VCI; p < 0.007, d = 1.13), Visual Spatial Index (VSI; p < 0.007, d = 0.86), Fluid 

Reasoning Index (FRI; p < 0.007, d = .82), and Working Memory Index (WMI; p < .007, d = 

1.05) Furthermore, significant differences were observed in both intellectual composites of 

interest in this study: the General Ability Index (GAI; p<.007, d = 1.00) and Full-Scale 

Intelligence Quotient (FSIQ; p < .007, d = 1.06). These differences remained significant after 

Bonferroni correction. Average scores for these indices are provided in Table 17.  

Table 17. WISC-V Composite Scores Using Split Cutoffs 

Complete Sample  Valid Performance  Invalid Performance 

Variable 

(n = 204) 

(n = 186) 

(n = 18) 

t 

p 

d 

FSIQ 

82.27 (16.16) 

83.73 (15.64) 

67.33 (14.05) 

4.67  <0.001  1.06 

GAI 

VCI 

VSI 

FRI 

85.20 (15.66) 

83.72 (15.64) 

71.39 (12.76) 

4.72  <0.001  1.00 

87.24 (15.79) 

88.74 (15.24) 

71.78 (13.03) 

5.19  <0.001  1.13 

89.43 (15.72) 

90.60 (15.34) 

77.39 (14.85) 

3.59  0.002  0.86 

86.27 (16.37) 

87.43 (15.91) 

74.33 (16.75) 

3.18  0.005  0.82 

WMI 

84.53 (16.56) 

86.00 (16.10) 

69.33 (13.66) 

4.86  <0.001  1.05 

PSI 

84.25 (16.46) 

85.12 (15.85) 

75.22 (20.16) 

2.02 

0.06 

0.61 

70 

 
 
 
 
  
  
  
  
 
 
 
Alternate Cutoff Question 1: Digit Span Age Corrected Scaled Scores 

Each Digit Span score observed in the sample was analyzed to determine which score 

provided an optimal cutoff point. The results for each of these scores are provided in Table 18.  

Alternate Cutoff Hypothesis 1a. Using the more lenient cutoff score on the MVP for 

group assignment did not result in a cutoff score that met specificity, sensitivity, and AUC 

thresholds outlined in this study. 93% specificity was achieved at a cutoff score of ≤2, but 

sensitivity fell slightly below the 40% threshold with a metric of 39%. AUC for the Digit Span 

scaled score with a cutoff score of ≤2 was 0.66 (0.54 – 0.78). These results indicate that the Digit 

Span ACSS with a cutoff of ≤2 resulted in an embedded PVT metric with inadequate 

psychometric properties.  

Table 18. Various Digit Span ACSS Cutoffs using Split Cutoffs 

ACSS Cutoffs 

Sensitivity 

Specificity 

AUC (95% CI) 

LR 

≤1 

≤2 

≤3 

≤4 

≤5 

≤6 

≤7 

≤8 

≤9 

≤10 

≤11 

0.33 

0.39 

0.56 

0.61 

0.67 

0.78 

0.89 

0.94 

1.00 

1.00 

1.00 

0.96  0.65 (0.65 - 0.76) 

0.93  0.66 (0.54 - 0.78) 

0.87  0.71 (0.59 - 0.83) 

0.79  0.70 (0.58 - 0.82) 

0.73  0.70 (0.58 - 0.82) 

0.54  0.66 (0.55 - 0.76) 

0.44  0.66 (0.58 - 0.74) 

0.28  0.61 (0.55 - 0.68) 

0.20  0.60 (0.57 - 0.63) 

0.13  0.57 (0.54 - 0.59) 

0.09  0.55 (0.52 - 0.57) 

71 

8.25 

5.57 

4.31 

2.90 

2.48 

1.70 

1.59 

1.31 

1.25 

1.15 

1.10 

 
 
 
 
Table 18. (cont’d) 

ACSS Cutoffs 

Sensitivity 

Specificity 

AUC (95% CI) 

LR 

≤12 

≤13 

≤14 

≤15 

≤16 

≤17 

≤18 

1.00 

1.00 

1.00 

1.00 

1.00 

1.00 

1.00 

0.05  0.52 (0.51 - 0.54) 

0.04  0.52 (0.51 - 0.53) 

0.01  0.51 (0.50 - 0.51) 

0.01  0.51 (0.50 - 0.51) 

0.01  0.51 (0.50 - 0.51) 

0.01  0.50 (0.50 - 0.51) 

0.00  0.50 (0.50 - 0.50) 

1.05 

1.04 

1.01 

1.01 

1.01 

1.01 

1.00 

Alternative Cutoff Hypothesis 1b. A cutoff score with adequate psychometric 

properties was not obtained from the ACSS metric when utilizing a split cutoff score with the 

MVP. However, logistic regression was run using a cutoff score of ≤2 as this score was the 

highest to reach 90% specificity. Whether an individual passed the ACSS significantly predicted 

whether they passed the MVP. Participants who failed the Digit Span PVT were 8.44 times more 

likely to fail the MVP compared to participants who exceeded the cutoff score of 2 (p < .001). 

These results can be found in Table 19. While still significant, the use of a split cutoff score 

diminished the predictive power of the ACSS PVT. 

72 

 
 
 
 
 
 
Table 19. Predicting MVP Failure (Split Cutoff) Based On ACSS 

Characteristic  OR1  95% CI1  p-value 

(Intercept) 

0.05  0.02, 0.09  <0.001 

ACSSGroup 

8.44  3.04, 24.2  <0.001 

1 OR = Odds Ratio, CI = Confidence Interval 

Alternative Cutoff Question 2: Reliable Digit Span 

Each Reliable Digit Span (RDS) score observed in the sample was analyzed to determine 

which score provided an optimal cutoff point. The results for each of these scores are provided in 

Table 20.  

Alternative Cutoff Hypothesis 2a. Using a more lenient cutoff score for group 

assignment in younger children still allowed for an RDS cutoff score to be established that 

exceeded the outlined sensitivity, specificity, and AUC thresholds. A RDS cutoff score of ≤4 

Resulted in specificity of 91%, sensitivity of 67%, and an AUC metric of 0.79 (0.67 – 0.90). 

These results indicate that the same ≤4 cutoff score on the RDS index provides strong 

psychometric properties as an embedded PVT when a lower MVP cutoff score is used for group 

assignment. 

73 

 
 
 
 
 
 
Table 20. Various RDS Cutoffs using Split Cutoffs 

RDS Cutoffs 

Sensitivity  Specificity  AUC (95% CI) 

LR 

≤2 

≤3 

≤4 

≤5 

≤6 

≤7 

≤8 

≤9 

≤10 

≤11 

≤12 

≤13 

≤14 

≤15 

0.17 

0.39 

0.67 

0.72 

0.83 

1.00 

1.00 

1.00 

1.00 

1.00 

1.00 

1.00 

1.00 

1.00 

0.99  0.58 (0.49 - 0.67) 

17.00 

0.96  0.67 (0.56 - 0.79) 

0.91  0.79 (0.67 - 0.90) 

0.78  0.75 (0.64 - 0.86) 

0.60  0.72 (0.62 - 0.81) 

0.37  0.69 (0.65 - 0.72) 

0.15  0.58 (0.55 - 0.60) 

0.06  0.53 (0.51 - 0.55) 

0.03  0.52 (0.50 - 0.53) 

0.03  0.51 (0.50 - 0.53) 

0.01  0.50 (0.50 - 0.51) 

0.01  0.50 (0.50 - 0.51) 

0.01  0.50 (0.50 - 0.51) 

0.00  0.50 (0.50 - 0.50) 

9.75 

7.44 

3.27 

2.07 

1.59 

1.18 

1.06 

1.03 

1.03 

1.01 

1.01 

1.01 

1.00 

Alternative Cutoff Hypothesis 2b. Logistic regression was used to better understand the 

relationship between passing the RDS PVT with a cutoff score of ≤4 and passing the MVP with 

the experimental split cutoff score. Whether an individual passed the RDS significantly predicted 

whether they passed the MVP. Participants who failed the Digit Span PVT were 19.9 times more 

likely to fail the MVP compared to participants who exceeded the cutoff score of 4 (p < .001). 

74 

 
 
 
 
These results can be found in Table 21. While still significant, the use of a split cutoff score did 

reduce the predictive power of the RDS metric. 

Table 21. Predicting MVP Failure (Split Cutoff) Based on RDS 

Characteristic  OR1  95% CI1  p-value 

(Intercept) 

0.04  0.01, 0.07  <0.001 

RDSGroup 

19.9  6.86, 63.8  <0.001 

1 OR = Odds Ratio, CI = Confidence Interval 

Alternative Cutoff Question 3: Reliable Digit Span Revised 

Each Reliable Digit Span Revised (RDS-R) score observed in the sample was analyzed to 

determine which score provided an optimal cutoff point. The results for each of these scores are 

provided in Table 22.  

Alternative Cutoff Hypothesis 3a. Using a cutoff score of ≤6 exceeded the .90 specificity 

threshold and the .40 sensitivity threshold with a specificity metric of 90% and a sensitivity of 

67%. Additionally, the area under the curve metric was adequate (AUC = 0.78 (0.67 – 0.90)). 

This finding indicates that an RDS-R cutoff score of ≤6 remains a viable indicator of 

performance validity when more lenient MVP cutoff scores are used for children ages 6-10.  

75 

 
 
 
 
 
 
 
Table 22. Various RDS-R Cutpoints using Split Cutoffs 

RDS-R 

Cutpoint 

Sensitivity  Specificity  AUC (95% CI) 

LR 

≤2 

≤3 

≤4 

≤5 

≤6 

≤7 

≤8 

≤9 

≤10 

≤11 

≤12 

≤13 

≤14 

≤15 

≤16 

≤17 

≤18 

≤19 

≤20 

≤21 

0.11 

0.28 

0.50 

0.50 

0.67 

0.67 

0.83 

0.83 

0.94 

1.00 

1.00 

1.00 

1.00 

1.00 

1.00 

1.00 

1.00 

1.00 

1.00 

1.00 

0.99  0.55 (0.48 - 0.63) 

11.00 

0.97  0.62 (0.52 - 0.73) 

9.33 

0.96  0.73 (0.61 - 0.85) 

12.50 

0.92  0.71 (0.59 - 0.83) 

0.90  0.78 (0.67 - 0.90) 

0.84  0.76 (0.64 - 0.87) 

0.73  0.78 (0.69 - 0.88) 

0.59  0.71 (0.62 - 0.81) 

0.48  0.71 (0.65 - 0.78) 

0.33  0.67 (0.63 - 0.70) 

0.24  0.62 (0.59 - 0.65) 

0.12  0.56 (0.54 - 0.58) 

0.08  0.54 (0.52 - 0.56) 

0.04  0.52 (0.51 - 0.54) 

0.03  0.52 (0.50 - 0.53) 

0.01  0.50 (0.50 - 0.51) 

0.01  0.50 (0.50 - 0.51) 

0.01  0.50 (0.50 - 0.51) 

0.01  0.50 (0.50 - 0.51) 

0.00  0.50 (0.50 - 0.50) 

76 

6.25 

6.70 

4.19 

3.07 

2.02 

1.81 

1.49 

1.32 

1.14 

1.09 

1.04 

1.03 

1.01 

1.01 

1.01 

1.01 

1.00 

 
 
 
Alternative Cutoff Hypothesis 3b. Logistic regression was used to better understand the 

relationship between passing the RDS-R PVT with a cutoff score of ≤6 and passing the MVP 

using the experimental split cutoff score. Whether an individual passed the RDS-R significantly 

predicted whether they passed the MVP. Participants who failed the RDS-R PVT were 18.7 

times more likely to fail the MVP compared to participants who exceeded the cutoff score of 4 (p 

< .001). These results can be found in Table 23. Like the RDS score, the experimental MVP 

cutoff reduced the predictive power of the RDS-R score.  

Table 23. Predicting MVP Failure (Split Cutoff) Based on RDS-R 

Characteristic  OR1  95% CI1  p-value 

(Intercept) 

0.04  0.01, 0.07  <0.001 

RDS-RGroup 

18.7  6.47, 59.5  <0.001 

1 OR = Odds Ratio, CI = Confidence Interval 

77 

 
 
 
 
 
 
DISCUSSION 

This study sought to establish the utility of various Digit Span indices from the Wechsler 

Intelligence Scale for Children – Fourth Edition as embedded performance validity indicators for 

psychological testing. The results indicate that the age-corrected scaled score (ACSS), Reliable 

Digit Span score (RDS), and the Reliable Digit Span Revised score (RDS-R) all served as viable 

performance validity indicators when using an experimental Memory Validity Profile (MVP) 

cutoff score of ≤30 in a mixed clinical sample. Each of these three metrics successfully 

demonstrated cutoff scores that exceeded 90% specificity and met or exceeded 60% sensitivity. 

Additionally, each of these cutoff scores exceeded the 0.70 area under the curve (AUC) threshold 

established at the beginning of the study.  

Similar to the concerns raised by Wilson and Lesica (2021), the experimental cutoff score 

of ≤30 resulted in a disproportionate number of younger children failing the MVP. As a result, 

this study examined the utility of each Digit Span index when a variable cutoff score was used 

with the MVP based on age. When group designation was based on this experimental cutoff, the 

psychometric properties of the ACSS score became inadequate. However, both the RDS and 

RDS-R validity indicators maintained strong psychometric properties at the same cutoff scores 

established in the initial analysis. These findings indicate that the RDS and RDS-R scores may 

have more clinical utility as embedded validity indicators than the ACSS score.  

Logistic regressions indicated that failure of each of these indices greatly increased the 

likelihood of an individual failing the MVP. When a cutoff score of ≤30 was adopted for the 

MVP, participants who failed one of the Digit Span validity indicators were 14-28 times more 

likely to fail the MVP. The index with the strongest predictive ability was the RDS metric with 

an odds ratio of 28.3. Adopting a split cutoff for the MVP based on age also resulted in each 

78 

 
 
 
 
 
 
metric significantly predicting MVP failure. Although the ACSS score failed to meet adequate 

sensitivity, a cutoff score with adequate specificity still provided an 8.44 times likelihood that 

participants would fail the MVP. Both the RDS and RDS-R metrics saw a reduction in predictive 

power, but both were still significant predictors with the RDS demonstrating an odds ratio of 

19.9 and the RDS-R demonstrating an odds ratio of 18.7.  

Each of the three Digit Span indices provided strong predictive power. Due to the 

similarity of the current sample and the sample used by Wilson and Lesica (2021), positive and 

negative predictive power closely followed the sensitivity and specificity metrics obtained. In 

summary, these findings indicate that the obtained cutoffs would provide 92-94% accuracy 

identifying valid performances and 55-66% accuracy identifying invalid performances in a 

similar mixed clinical sample. Thus, using Digit Span as a measure of effort for clinically 

referred children may provide reliable information. 

Limitations 

Criterion Measure 

Most known-groups design studies use stronger criteria for group selection. Best practice 

indicates that at least two validity indicators should be used for group assignment (Sherman et 

al., 2020; Slick et al., 1999). Individuals who fail two validity indicators are placed in the invalid 

performance group and individuals who pass both indicators are assigned to the valid 

performance group. Those individuals who pass one indicator and fail another are typically 

excluded from analysis as the validity of their performance is less clear than other participants 

(Kirkwood et al., 2011). Unfortunately, the extant nature of this sample did not allow for the 

addition of an additional validity indicator to be added to the assessment battery given to each 

individual. Additionally, the MVP contains both a visual and verbal condition which could be 

79 

 
 
 
 
 
used as separate indicators. However, no experimental cutoffs have been established for the 

individual indices of the MVP, and the leniency of manualized cutoffs presented additional 

problems described below. The MVP was the only standalone performance validity indicator 

administered to all participants in this sample. As a result, group assignment in this study relied 

entirely on the MVP Total Score.  

With a single criterion measure utilized for group assignment, the strength of the criterion 

measure must be examined. The MVP was used for group assignment, but experimental cutoffs 

were used to make these determinations. When the performance of the sample was analyzed, 

only two of the 204 participants included for analysis failed the MVP. This made group 

assignment based on manualized cutoffs of the MVP futile for experimental analysis. Rather, a 

more stringent, experimental cutoff score was used to assign participants to groups. This 

presented a similar problem as the one experienced by Wilson and Lesica (2021). A universal 

cutoff score resulted in a larger proportion of younger children failing the MVP. As a result, a 

split cutoff score was adopted for experimental analysis, but group differences in age still 

remained after modifying group assignment. Younger children failing the MVP at higher rates 

indicates that the simple nature of the MVP is not uniform across ages. This may be due to 

differences in attention or letter and number literacy as the MVP stimuli include strings of 

numbers and letters. 

In addition to multiple criterion measures, it is also common for known-groups studies to 

utilize litigation status in group assignment (Slick et al., 1999). Litigatory elements are common 

in PVT research as a large portion of this literature focuses on the traumatic brain injury 

population where potential for monetary gain is high. Like the problem surrounding the number 

of criterion measures in this study, litigation status was not known when assigning participants to 

80 

 
 
 
 
 
groups. This may further call into question the integrity of the valid and invalid performance 

groups.  

Age Discrepancies Between Groups 

The use of a flat cutoff score across ages for the MVP resulted in a significantly younger 

invalid performance group. When adopting a more lenient cutoff score for younger children, this 

discrepancy was mitigated but still remained. As a result, it is possible that age had undue 

influence on the interpretation of cutoff scores. In order to assess this influence, cutoff scores 

across ages for the Digit Span indices would need to be examined for variance across ages. It is 

possible that age groups may need to be established with varying cutoffs.  Unfortunately, the 

limited size of the current study prevented this analysis from being conducted. 

Cognitive Ability Discrepancies Between Groups 

Significant differences in cognitive ability scores were observed across groups. On 

average, the invalid performance group demonstrated lower Full-Scale Intelligence Quotient 

(FSIQ) scores, General Ability Index (GAI) scores, and scores on each of the other primary 

indices of the WISC-V. However, these discrepancies make sense due to the nature of the 

groups. Individuals in the invalid performance group are thought to have provided inadequate 

effort during testing. As a result, scores on assessment measures other than the MVP are likely to 

be negatively skewed due to invalid performance. As such, the cognitive ability scores of the 

invalid performance group are potentially underestimates of the sample’s actual mean cognitive 

ability. This demonstrates the difficulty using controls in the development of PVTs when all 

obtained scores are called into question. 

81 

 
 
 
 
 
 
 
Research Implications 

Embedded Digit Span PVTs Show Potential Pending Further Study 

The most meaningful implications of this study are those pertaining to future research in 

pediatric performance validity testing. The results of this study support previous studies that 

demonstrate potential clinical utility of the Digit Span subtest as a performance validity indicator 

in pediatric assessment. Previous studies have shown potential for the ACSS and RDS scores as 

embedded performance validity indicators when taken from WISC-IV administrations 

(Kirkwood et al., 2011; Welsh et al., 2012). This study is the first to examine the utility of these 

indices when taken from the updated WISC-V. Additionally, this study demonstrated potential 

use of the RDS-R scale now available due to the addition of the Sequencing condition to the 

Digit Span subtest. In order for these indices to have clinical utility, they must first be assessed in 

multiple samples using stronger criterion measures. 

Populations of Interest 

Previous studies of the RDS and ACSS metrics in pediatric populations have 

demonstrated that cutoff scores may differ significantly based on presenting conditions and 

referral questions (Kirkwood et al., 2011; Welsh et al., 2012). Children with epilepsy previously 

demonstrated lower performance on these indices that resulted in lower cutoff scores than 

children from the traumatic brain injury population. The present study examined the utility of 

these metrics in a mixed clinical sample providing general cutoff scores. However, larger studies 

with more statistical power are needed to examine whether cutoff scores differ based on clinical 

presentation. For example, (Kirkwood et al., 2011) found higher cutoff scores on the WISC-IV in 

a traumatic brain injury population. It is unclear whether these differences resulted from changes 

in the instruments or factors intrinsic to each population. 

82 

 
 
 
 
 
Caution Against Using RDS as a Criterion Measure 

Known-groups studies aiming to establish or validate PVTs rely heavily on the criterion 

measures used to establish groups. As noted previously, the lack of pediatric PVTs makes 

selection of criterion measures difficult. Future studies should validate the RDS metric from the 

WISC-V before it serves as a criterion measure in the development of additional indices. 

Notably, a recent study utilized the WISC-IV cutoff scores established by (Kirkwood et al., 

2011) as cutoffs for the WISC-V (Butt et al., 2023). The study authors examined the clinical 

utility of the Figure Weights subtest of the WISC-V as an embedded PVT. Although the easy 

nature of initial Figure Weights items theoretically makes it a potential PVT, cutoffs cannot be 

established using previous cutoffs until further validation is done. The results of this study 

indicate that RDS cutoff scores need to be rigorously established before the index is used as a 

criterion measure. Known-groups studies hinge on the strength of the criterion measure used, and 

RDS currently lacks the evidence needed to serve as a criterion measure. Thus, until strong 

cutoffs are established and replicated, subsequent PVTs developed using the RDS as a criterion 

measure lack validity. 

Age Differences on RDS and RDS-R Metrics 

Both the RDS and RDS-R scores are unstandardized and are therefore more likely to 

differ based on age when compared to the ACSS score. As a result, it is possible that different 

cutoff scores are needed across age groups. The current sample was too small to examine cutoff 

scores across age groups, and the criterion measure used for group assignment also presented 

with age complications. Further study across ages groups is needed to determine whether 

stratified cutoff scores are needed. 

83 

 
 
 
 
 
 
 
Utility of the MVP 

The present study also raises questions about the utility of the MVP as both a measure of 

performance validity and a criterion measure in future PVT studies. Manualized cutoffs 

established by the MVP publishers yielded extremely low rates of failure, and a flat experimental 

cutoff across age groups led to more young children failing the assessment than older children. 

These difficulties are consistent with previous study of the instrument (Wilson & Lesica, 2021). 

This study sought to examine whether a split cutoff improved the utility of the MVP as a 

criterion measure, but the limited sample size prevented strong analysis. Using a split cutoff 

score led to an invalid performance group with a less than optimal sample size (n = 18). 

Additional study needs to be conducted to further test the MVP for both clinical and research 

purposes.  

Clinical Implications 

Embedded Indicators Requiring No Additional Administration Time 

In terms of clinical utility, the largest advantage of the Digit Span indices of interest in 

this study is that they do not require additional administration time. Those who administer the 

WISC-V have access to these metrics following a typical administration. Both the RDS and 

RDS-R require simply noting the length of the last item perfectly completed in the forward, 

backward, and sequencing conditions. The ACSS score is readily available to anyone who scores 

the Digit Span subtest. As a result, these indices provide additional information regarding 

performance validity with little to no additional examiner time and effort. They also have the 

advantage of being included in one of the most commonly used pediatric assessment measures of 

cognitive abilities. The result is additional performance validity indicators for those already 

using other measures like clinical neuropsychologists and the first indicators for professionals 

84 

 
 
 
 
 
like school psychologists who may not possess or know of other tools. The positive findings of 

this study suggest that these indices may be valuable to professionals in the future after further 

study.  

Mixed Clinical Samples 

For now, the findings of this study are most promising in populations composed of 

similar mixed clinical samples. The current study sample was taken from a rehabilitation hospital 

with diverse presenting conditions. As a result, the findings may not generalize to other 

populations with different compositions. However, for professionals working with similar 

populations, the current findings provide initial evidence that these instruments may have clinical 

utility.  

Digit Span Indices Not Ready for Clinical Use as Validity Indicators 

As noted earlier, the use of Digit Span indices in pediatric populations is currently 

unclear. Previous findings have suggested that as much as 65% of pediatric neuropsychologists 

use Digit Span indices when making determinations about performance validity (Brooks et al., 

2016). If this is true, the current study presents several concerns. First, the cutoff scores found in 

this study are much lower than those used with adult populations (Greiffenstein et al., 1994). 

Additionally, the cutoff scores found in this study were also much lower than those previously 

established using the WISC-IV (Kirkwood et al., 2011). As this study is the first to calculate 

cutoff scores for each of these indices in a pediatric population from the WISC-V, it is currently 

unclear how Digit Span indices are being used in performance validity determinations. Use of 

traditional cutoff scores used in adult populations would result in extremely high rates of PVT 

failure. Additionally, utilizing WISC-IV cutoff scores for ACSS and RDS would also potentially 

lead to abnormally high rates of failure as the WISC-V cutoffs established in this study were 

85 

 
 
 
 
 
lower. As a result, none of the PVT metrics examined in this study currently have adequate 

psychometric properties for clinical use. Clinicians should not rely on embedded Digit Span 

PVTs until they can be further studied and validated in various populations.  

Promising for Clinicians without Other Measures 

Despite the RDS, RDS-R, and ACSS metrics lacking evidence for regular clinical use, 

the findings of this study may still be used to aid professionals who do not have access to other 

measures of performance validity. As noted previously, school psychologists rarely use 

performance validity measures in the evaluations they conduct. However, school psychologists 

frequently administer the WISC-V. In contexts where no other available PVTs are available, the 

use of the ≤4 cutoff on the RDS and the ≤6 cutoff on the RDS-R can still provide insight to the 

validity of an assessment. Each Digit Span condition starts with trials that are two digits in 

length. Therefore, to perform validly on the RDS and RDS-R metrics using the cutoffs 

established in this study, a child simply needs to reliably answer one three-digit item and two-

digit items in the other conditions. If a child fails to meet this threshold, professionals without 

other validity measures should question the results of their assessment measures and look to 

behavior observations, interview data, history forms, and other data to gain additional context. 

This study adds to the growing literature on a topic long ignored, but popularly valued by 

neuropsychologists. Further research will hopefully contribute more certainty to the use of 

embedded validity indicators in widely used clinical instruments. 

Conclusion 

This study examined the clinical utility of various embedded performance validity 

indicators in the Digit Span subtest of the WISC-V. The results of this study indicate that 

Reliable Digit Span, Reliable Digit Span – Revised, and the age corrected scaled score provided 

86 

 
 
 
 
 
by the Digit Span test have promise as future validity indicators. However, additional study is 

needed before these instruments can be reliably used in clinical settings.  

87 

 
 
 
 
REFERENCES 

Adams, W., & Sheslow, D. (2021). Wide Range Assessment of Memory and Learning - Third 

Edition. Pearson.  

Axelrod, B. N., Meyers, J. E., & Davis, J. J. (2014). Finger tapping test performance as a 

measure of performance validity. The Clinical Neuropsychologist, 28(5), 876-888.  

Baron, I. S. (2018). Neuropsychological evaluation of the child: Domains, methods, & case 

studies. Oxford University Press.  

Ben-Porath, Y. S., & Tellegen, A. (2020). Minnesota Multiphasic Personality Inventory-3 

(MMPI-3). NCS Pearson.  

Bender, L. (1938). A visual motor gestalt test and its clinical use. Research Monographs, 

American Orthopsychiatric Association.  

Binder, L. M., & Willis, S. C. (1991). Portland Digit Recognition Test. 

https://doi.org/http://dx.doi.org/10.1037/t27239-000  

Binet, A., & Simon, T. (1905). Méthodes nouvelles pour le diagnostic du niveau intellectual des 

anormaux. L’Année psychologique(11), 191-336.  

Boone, K. (2007). Assessment of Feigned Cognitive Impairment: A Neuropsychological 

Perspective. The Guilford Press.  

Brooks, B. L., Ploetz, D. M., & Kirkwood, M. W. (2016). A survey of neuropsychologists’ use 

of validity tests with children and adolescents. Child Neuropsychology, 22(8), 1001-1020. 
https://doi.org/10.1080/09297049.2015.1075491  

Bush, S. S., Policy, N., & Committee, P. (2005). Independent and court-ordered forensic 
neuropsychological examinations: Official statement of the National Academy of 
Neuropsychology. Archives of Clinical Neuropsychology, 20(8), 997-1007.  

Butt, S., Sellers, A., Ghazarian, S., & Katzenstein, J. (2023). Embedded Performance Validity 
Utilizing the WISC-V Figure Weights Subtest International Neuropsychological Society 
Serendipity and Science Conference, San Diego, CA.  

Crossman, A. M., & Lewis, M. (2006). Adults' ability to detect children's lying. Behavioral 

Sciences & the Law, 24(5), 703-715.  

Dandachi-Fitzgerald, B., Merckelbach, H., & Ponds, R. W. H. M. (2017). Neuropsychologists’ 
ability to predict distorted symptom presentation. Journal of clinical and experimental 
neuropsychology, 39(3), 257-264. https://doi.org/10.1080/13803395.2016.1223278  

88 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Delis, D. C., Kramer, J. H., Kaplan, E., & Ober, B. A. (1994). The California Verbal Learning 

Test - Children's Version. The Psychological Corporation.  

Delis, D. C., Kramer, J. H., Kaplan, E., & Ober, B. A. (2000). California Verbal Learning Test - 

Second Edition. The Psychological Corporation.  

Delis, D. C., Kramer, J. H., Kaplan, E., & Ober, B. A. (2017). California Verbal Learning Test, 

Third Edition [California Verbal Learning Test-3]. Pearson. 
https://doi.org/http://dx.doi.org/10.1037/t79642-000  

DeRight, J., & Carone, D. A. (2015, 2015/01/02). Assessment of effort in children: A systematic 

review. Child Neuropsychology, 21(1), 1-24. 
https://doi.org/10.1080/09297049.2013.864383  

Diamond, A. (2013). Executive Functions. Annual Review of Psychology, 64(1), 135-168. 

https://doi.org/10.1146/annurev-psych-113011-143750  

Donders, J. (2005). Performance on the Test of Memory Malingering in a mixed pediatric 

sample. Child Neuropsychology, 11(2), 221-227.  

Erdodi, L. A., Abeare, C. A., Medoff, B., Seke, K. R., Sagar, S., & Kirsch, N. L. (2018). A single 
error is one too many: The Forced Choice Recognition Trial of the CVLT-II as a measure 
of performance validity in adults with TBI. Archives of Clinical Neuropsychology, 33(7), 
845-860.  

Erdodi, L. A., Roth, R. M., Kirsch, N. L., Lajiness-O'Neill, R., & Medoff, B. (2014). 

Aggregating validity indicators embedded in Conners' CPT-II outperforms individual 
cutoffs at separating valid from invalid performance in adults with traumatic brain injury. 
Archives of Clinical Neuropsychology, 29(5), 456-466.  

Folstein, M. F., Folstein, S. E., & McHugh, P. R. (1975). “Mini-mental state”: A practical 

method for grading the cognitive state of patients for the clinician. Journal of Psychiatric 
Research, 12(3), 189-198. https://doi.org/https://doi.org/10.1016/0022-3956(75)90026-6  

Garon, N., Bryson, S. E., & Smith, I. M. (2008). Executive function in preschoolers: a review 

using an integrative framework. Psychological bulletin, 134(1), 31.  

Gazzaniga, M., Ivry, R., & Mangun, G. (2014). Cognitive Neuroscience: The Biology of the 

Mind (Fourth ed.). W. W. Norton.  

Gignac, G. E., Reynolds, M. R., & Kovacs, K. (2019). Digit Span subscale scores may be 
insufficiently reliable for clinical interpretation: distinguishing between stratified 
coefficient alpha and omega hierarchical. Assessment, 26(8), 1554-1563.  

Green, P. (2004). Green's Medical Symptom Validity Test (MSVT) for Microsoft Windows (User 

Manual). Green's Publishing.  

89 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Green, P., & Astner, K. (1995). The Word Memory Test. Neurobehavioural Associates.  

Greiffenstein, M. F., Baker, W. J., & Gola, T. (1994). Validation of malingered amnesia 

measures with a large clinical sample. Psychological Assessment, 6(3), 218.  

Guilmette, T. J., Sweet, J. J., Hebben, N., Koltai, D., Mahone, E. M., Spiegler, B. J., Stucky, K., 

& Westerveld, M. (2020). American Academy of Clinical Neuropsychology consensus 
conference statement on uniform labeling of performance test scores. The Clinical 
Neuropsychologist, 34(3), 437-453. https://doi.org/10.1080/13854046.2020.1722244  

Hathaway, S. R., & McKinley, J. C. (1951). Minnesota Multiphasic Personality Inventory; 

Manual, revised.  

Heaton, R. K., Smith, H. H., Lehman, R. A., & Vogt, A. T. (1978). Prospects for faking 

believable deficits on neuropsychological testing. Journal of consulting and clinical 
psychology, 46(5), 892-900. https://doi.org/http://dx.doi.org/10.1037/0022-
006X.46.5.892  

Heilbronner, R. L., Sweet, J. J., Morgan, J. E., Larrabee, G. J., Millis, S. R., & Conference, P. 
(2009). American Academy of Clinical Neuropsychology Consensus Conference 
Statement on the Neuropsychological Assessment of Effort, Response Bias, and 
Malingering. The Clinical Neuropsychologist, 23(7), 1093-1129. 
https://doi.org/10.1080/13854040903155063  

Holcomb, M. J. (2018). Pediatric Performance Validity Testing: State of the Field and Current 

Research. Journal of Pediatric Neuropsychology, 4(3-4), 83-85. 
https://doi.org/10.1007/s40817-018-00062-y  

Iverson, G. (2003). Detecting malingering on the WAIS-III Unusual Digit Span performance 
patterns in the normal population and in clinical groups. Archives of Clinical 
Neuropsychology, 18(1), 1-9. https://doi.org/10.1016/s0887-6177(01)00176-7  

Iverson, G. L., & Franzen, M. D. (1994). The Recognition Memory Test, Digit Span, and Knox 

Cube Test as Markers of Malingered Memory Impairment. Assessment, 1(4), 323-334. 
https://doi.org/10.1177/107319119400100401  

Iverson, G. L., & Franzen, M. D. (1996). Using Multiple Objective Memory Procedures to 

Detect Simulated Malingering. Journal of clinical and experimental neuropsychology, 
18(1), 38-51. https://doi.org/10.1080/01688639608408260  

Jacobs, J. (1887). Experiments on" prehension". Mind, 12(45), 75-79.  

Kirk, J. W., Baker, D. A., Kirk, J. J., & Macallister, W. S. (2020). A review of performance and 

symptom validity testing with pediatric populations. Applied Neuropsychology: Child, 
9(4), 292-306. https://doi.org/10.1080/21622965.2020.1750118  

90 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Kirkwood, M. W. (2015). A Rationale for Performance Validity Testing in Child and Adolescent 

Assessment. In M. W. Kirkwood (Ed.), Validity testing in child and adolescent 
assessment: Evaluating exaggeration, feigning, and noncredible effort. The Guilford 
Press.  

Kirkwood, M. W., Hargrave, D. D., & Kirk, J. W. (2011). The value of the WISC-IV Digit Span 
subtest in detecting noncredible performance during pediatric neuropsychological 
examinations. Archives of Clinical Neuropsychology, 26(5), 377-384.  

Larrabee, G. J. (2012). Performance Validity and Symptom Validity in Neuropsychological 

Assessment. Journal of the International Neuropsychological Society, 18(4), 625-630. 
https://doi.org/10.1017/s1355617712000240  

Larrabee, G. J. (2015). Performance and Symptom Validity: A Perspective from the Adult 

Literature. In M. W. Kirkwood (Ed.), Validity testing in child and adolescent assessment: 
Evaluating exaggeration, feigning, and noncredible effort (pp. 82-96). The Guilford 
Press.  

Larrabee, G. J., & Kirkwood, M. W. (2020). Symptom and Performance Validity Testing. In K. 
J. Stucky, M. W. Kirkwood, & J. Donders (Eds.), Clinical Neuropsychology Study Guide 
and Board Review (2nd ed., pp. 214-226). Oxford University Press.  

Lezak, M. (1983). Neuropsychological Assessment (2nd ed.). Oxford University Press.  

Lippa, S. M. (2018). Performance validity testing in neuropsychology: A clinical guide, critical 

review, and update on a rapidly evolving literature. The Clinical Neuropsychologist, 
32(3), 391-421.  

Lu, P. H., Rogers, S. A., & Boone, K. B. (2007). Use of Standard Memory Tests to Detect 

Suspect Effort. In K. B. Boone (Ed.), Assessment of Feigned Cognitive Impairment: A 
Neuropsychological Perspective. The Guilford Press.  

Macallister, W. S., Vasserman, M., & Armstrong, K. (2019). Are we documenting performance 
validity testing in pediatric neuropsychological assessments? A brief report. Child 
Neuropsychology, 25(8), 1035-1042. https://doi.org/10.1080/09297049.2019.1569606  

McWhirter, L., Ritchie, C. W., Stone, J., & Carson, A. (2020). Performance validity test failure 
in clinical populations—a systematic review. Journal of Neurology, Neurosurgery 
&amp; Psychiatry, 91(9), 945-952. https://doi.org/10.1136/jnnp-2020-323776  

Merckelbach, H., & Smith, G. P. (2003). Diagnostic accuracy of the Structured Inventory of 
Malingered Symptomatology (SIMS) in detecting instructed malingering. Archives of 
Clinical Neuropsychology, 18(2), 145-152.  

Meyers, J. E., & Meyers, K. R. (1995). Rey Complex Figure Test and Recognition Trial. PAR.  

91 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Mittenberg, W. (1996). Identification of malingered head injury on the Halstead-Reitan battery. 

Archives of Clinical Neuropsychology, 11(4), 271-281. https://doi.org/10.1016/0887-
6177(95)00040-2  

National Center for Education Statistics. (2022). Students with disabilities. Condition of 
education. US Department of Education, Institute of Education Sciences.  

Pankratz, L. (1979). Symptom validity testing and symptom retraining: procedures for the 
assessment and treatment of functional sensory deficits. Journal of consulting and 
clinical psychology, 47(2), 409.  

Pankratz, L. (1983). A new technique for the assessment and modification of feigned memory 

deficit. Perceptual and Motor Skills, 57(2), 367-372.  

Pankratz, L., Fausti, S. A., & Peed, S. (1975). A forced-choice technique to evaluate deafness in 

the hysterical or malingering patient. Journal of consulting and clinical psychology, 
43(3), 421-422. https://doi.org/http://dx.doi.org/10.1037/h0076722  

Peterson, E., & Peterson, R. L. (2015). Understanding Deception from a Developmental 

Perspective. In M. W. Kirkwood (Ed.), Validity testing in child and adolescent 
assessment: Evaluating exaggeration, feigning, and noncredible effort. The Guilford 
Press.  

Reitan, R. M. (1993). The Halstead-Reitan neuropsychological test battery : theory and clinical 

interpretation. Second edition. S. Tucson, Arizona : Neuropsychology Press, [1993] 
©1993. https://search.library.wisc.edu/catalog/9910209689302121  

Rey, A. (1941). L'examen psychologique dans les cas d'encéphalopathie traumatique. (Les 
problems.). [The psychological examination in cases of traumatic encepholopathy. 
Problems.]. Archives de Psychologie, 28, 215-285.  

Rey, A. (1958). Rey Auditory Verbal Learning Test. Western Psychological Services. 

https://doi.org/http://dx.doi.org/10.1037/t27193-000  

Rey, A. (1964). L'Examen clinique en psychologie: par André Rey... 2e édition. Presses 

universitaires de France (Vendôme, Impr. des PUF).  

Reynolds, C. R. (1997). Forward and backward memory span should not be combined for 

clinical analysis. Archives of Clinical Neuropsychology, 12(1), 29-40.  

Reynolds, C. R., & Bigler, E. D. (1994). Test of Memory and Learning. Pro-ed.  

Reynolds, C. R., & Kamphaus, R. W. (2015). Behavior Assessment System for Children, Third 

Edition.  

92 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Richardson, J. T. E. (2007). Measures of Short-Term Memory: A Historical Review. Cortex, 

43(5), 635-650. https://doi.org/10.1016/s0010-9452(08)70493-3  

Schmand, B., & Lindeboom, J. (2005). The amsterdam short-term memory test. Manuel. PITS.  

Schroeder, R. W., Martin, P. K., Heinrichs, R. J., & Baade, L. E. (2019). Research methods in 
performance validity testing studies: Criterion grouping approach impacts study 
outcomes. The Clinical Neuropsychologist, 33(3), 466-477. 
https://doi.org/10.1080/13854046.2018.1484517  

Schroeder, R. W., Twumasi-Ankrah, P., Baade, L. E., & Marshall, P. S. (2012). Reliable Digit 

Span: A Systematic Review and Cross-Validation Study. Assessment, 19(1), 21-30. 
https://doi.org/10.1177/1073191111428764  

Sherman, E., & Brooks, B. (2015a). Child and adolescent memory profile (ChAMP). 

Psychological Assessment Resources, Inc.: Lutz, FL.  

Sherman, E., & Brooks, B. (2015b). Memory validity profile (MVP). Psychological Assessment 

Resources, Inc.: Lutz, FL.  

Sherman, E. M. S., Slick, D. J., & Iverson, G. L. (2020). Multidimensional Malingering Criteria 

for Neuropsychological Assessment: A 20-Year Update of the Malingered 
Neuropsychological Dysfunction Criteria. Archives of Clinical Neuropsychology, 35(6), 
735-764. https://doi.org/10.1093/arclin/acaa019  

Slick, D. J., Sherman, E. M. S., & Iverson, G. L. (1999, 1999/11/01). Diagnostic Criteria for 

Malingered Neurocognitive Dysfunction: Proposed Standards for Clinical Practice and 
Research. The Clinical Neuropsychologist, 13(4), 545-561. https://doi.org/10.1076/1385-
4046(199911)13:04;1-Y;FT545  

Spencer, R., Tree, H., Drag, L., Pangilinan, P., & Bieliauskas, L. (2010). Extending reliable digit 
span with the WAIS-IV sequencing task: Preliminary results. Poster presented at the 8th 
annual meeting for the American Academy of Clinical Neuropsychology Conference, 
Chicago, IL,  

Stucky, K., Kirkwood, M. W., & Donders, J. (2020). Traumatic Brain Injury. In K. Stucky, M. 
W. Kirkwood, & J. Donders (Eds.), Clinical Neuropsychology Study Guide and Board 
Review (2 ed.). Oxford University Press.  

Talwar, V., & Crossman, A. (2011). From little white lies to filthy liars: The evolution of 

honesty and deception in young children. Advances in child development and behavior, 
40, 139-179.  

Talwar, V., & Lee, K. (2002). Development of lying to conceal a transgression: Children’s 

control of expressive behaviour during verbal deception. International Journal of 
Behavioral Development, 26(5), 436-444.  

93 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
Terman, L. M. (1916). Stanford-Binet Intelligence Scale: Manual for the Third Revision Form L-

M. Houghton Mifflin. https://doi.org/http://dx.doi.org/10.1037/t00012-000  

Tombaugh, T. N. (1996). Test of memory malingering: TOMM. Multy-Health Systems.  

Ventura, L. M., Dedios-Stern, S., Oh, A., & Soble, J. R. (2019). They’re not just little adults: The 
utility of adult performance validity measures in a mixed clinical pediatric sample. 
Applied Neuropsychology: Child, 1-11. https://doi.org/10.1080/21622965.2019.1685522  

Wechsler, D. (1939). Wechsler-Bellevue Intelligence Scale--Form I. 
https://doi.org/http://dx.doi.org/10.1037/t06871-000  

Wechsler, D. (1945). Wechsler Memory Scale. https://doi.org/http://dx.doi.org/10.1037/t27207-

000  

Wechsler, D. (1949). Wechsler Intelligence Scale for Children. The Psychological Corporation.  

Wechsler, D. (1955). Wechsler Adult Intelligence Scale. The Psychological Corporation.  

Wechsler, D. (1981). Wechsler Adult Intelligence Scale - Revised. The Psychological 

Corporation.  

Wechsler, D. (1987). Wechsler Memory Scale - Revised. The Psychological Corporation.  

Wechsler, D. (1999). Wechsler Abbreviated Scale of Intelligence. The Psychological 

Corporation.  

Wechsler, D. (2003). Wechsler Intelligence Scale for Children--Fourth Edition.  

Wechsler, D. (2008). Wechsler Adult Intelligence Scale--Fourth Edition.  

Wechsler, D. (2009). Advanced Clinical Solutions for WAIS-IV and WMS-IV. The Psychological 

Corporation.  

Wechsler, D. (2011). Wechsler Abbreviated Scale of Intelligence–Second Edition. NCS Pearson.  

Wechsler, D. (2014). Wechsler Intelligence Scales for Children - Fifth Edition. NCS Person.  

Weiss, S. J., Blackwell, M. C., Griffith, K. M., Jordan, L. S., & Culotta, V. P. (2019). 

Performance validity testing in children and adolescents: A descriptive study comparing 
direct and embedded measures. Applied Neuropsychology: Child, 8(2), 158-162.  

Welsh, A. J., Bender, H. A., Whitman, L. A., Vasserman, M., & Macallister, W. S. (2012). 

Clinical Utility of Reliable Digit Span in Assessing Effort in Children and Adolescents 

94 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
with Epilepsy. Archives of Clinical Neuropsychology, 27(7), 735-741. 
https://doi.org/10.1093/arclin/acs063  

Wilson, K., & Lesica, S. (2021). Performance on the Memory Validity Profile in a mixed clinic-

referred pediatric sample. Child Neuropsychology, 27(4), 516-531. 
https://doi.org/10.1080/09297049.2020.1870676  

Wimmer, H., & Hartl, M. (1991). Against the Cartesian view on mind: Young children's 

difficulty with own false beliefs. British Journal of Developmental Psychology, 9(1), 
125-138.  

Young, J. C., Sawyer, R. J., Roper, B. L., & Baughman, B. C. (2012). Expansion and Re-
examination of Digit Span Effort Indices on the WAIS-IV. The Clinical 
Neuropsychologist, 26(1), 147-159. https://doi.org/10.1080/13854046.2011.647083  

95