DIGIT SPAN AS A VALIDITY MEASURE IN PEDIATRIC ASSESSMENT By Tyler Micheal Ryan A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of School Psychology – Doctor of Philosophy 2024 ABSTRACT Performance validity testing during the neuropsychological assessment of pediatric populations is a growing area of clinical and research interest. Although the literature on validity testing with adult populations is extensive, the development and study of validity measures in pediatric assessment is in its infancy. The current study assessed the clinical utility of three indices derived from the Wechsler Intelligence Scale for Children – Fifth Edition’s (WISC-V) Digit Span subtest as embedded performance validity measures. Reliable Digit Span, Reliable Digit Span-Revised, and the Digit Span age corrected scaled scores were assessed in a known-groups design using the Memory Validity Profile (MVP) as a criterion measure. Cutoff scores with adequate sensitivity and specificity were obtained for each of the three metrics indicating that the WISC-V Digit Span subtest provides clinically relevant data regarding performance validity. ACKNOWLEDGEMENTS I would like to express my deepest appreciation to my advisor, mentor, and chair of my committee: Dr. Jodene Fine. I could not have completed this project without her continued support and guidance. Her passion for the field of neuropsychology inspired both my research and career goals, and I cannot thank her enough for providing me with a path of entry to the field. Additionally, words cannot express my gratitude to my committee members: Dr. Ryan Bowles, Dr. Kristin Rispoli, and Dr. Adrea Truckenmiller. Each member of this committee has demonstrated continued flexibility and guidance throughout this journey. Also, I could not have undertaken this project without Dr. Jacobus Donders, who provided access to his clinical research data and served as a guiding hand through study design and analysis. I would like to extend my sincere thanks to Dr. Roger Lauer and Dr. Renee Lajiness- O’Neill who served as clinical mentors and provided valuable perspective derived from their combined clinical and research experience. I am also deeply grateful to Dr. Kate Wilson whose research served as a foundation for this project. Additionally, her compassion and dedication to the care of underserved communities in the field of neuropsychology continually inspires my vision of practice. I would be remiss in not mentioning my family for their continued support throughout this journey. Specifically, I would like to recognize my lovely wife Sara Ryan for her patience, tolerance, and professional support throughout this journey. This academic endeavor has taken our lives in various unforeseen directions, and her unwavering support and love has carried me through this lengthy trek. iii TABLE OF CONTENTS INTRODUCTION .......................................................................................................................... 1 LITERATURE REVIEW ............................................................................................................... 4 METHOD ..................................................................................................................................... 40 RESULTS ..................................................................................................................................... 55 DISCUSSION ............................................................................................................................... 78 REFERENCES ............................................................................................................................. 88 iv INTRODUCTION Neuropsychological assessments are conducted to answer questions related to a client’s intellectual functioning and behavior (Baron, 2018). These assessments rely on the use of measures that provide information across various domains of functioning. However, the information provided by these tests lack clinical utility if the patient provides suboptimal effort. Thus, it is essential that neuropsychologists can identify clients who are not working to their full capacity during an assessment. Studies have indicated that clinicians have difficulty accurately identifying which individuals presenting for evaluation are providing adequate effort. As such, interest in the field of validity testing has blossomed. Validity testing is the process of identifying invalid test performance using specially designed instruments or indices derived from readily administered assessment measures. The earliest examples of these tests date back to the 1940s (Rey, 1941). However, interest in the development, validation, and use of these measures gained popularity in the 1990s (Larrabee & Kirkwood, 2020). Most of the research and clinical interest in validity testing has historically pertained to neuropsychological evaluations of adults. High stakes in remuneration associated with lawsuits over personal injuries fueled the field. Thus, the emphasis on adult populations resulted from the incentive for secondary gain when malingering, or “faking” impairment, during evaluations. Cases involving the evaluation of mild traumatic brain injuries (mTBIs) gained particular interest. (Slick et al., 1999). Those attempting to recover money due to accidents resulting in brain injury could yield substantial sums due to impairment observed in neuropsychological evaluation. Incentive to malinger is, and has been, high. 1 Another reason why validity testing has been largely relegated to adult neuropsychological evaluation is a widely held belief that children lacked the ability for deception resulting in little interest in pediatric validity assessment. However, research into child neuropsychological test performance indicates that children may provide suboptimal effort for many reasons (Kirkwood, 2015) Notably, many neuropsychological instruments for child use were translated from adult measures, and this has also occurred with validity measures. Thus, use of adult validity instruments increased dramatically, and measures like the Test of Memory Malingering (TOMM; Tombaugh, 1996) were studied in pediatric populations (Donders, 2005). A recent survey found extremely high rates of validity testing endorsement from pediatric neuropsychologists (Brooks et al., 2016). Of the reported measures, one of the most popular measures utilized was the reliable Digit Span (RDS) metric provided by the Digit Span subtest of Wechsler intelligence measures like the Wechsler Adult Intelligence Scale – Fourth Edition (WAIS-IV; Wechsler, 2008). The RDS index is a simple cutoff calculation based on the longest number of digits held reliably (twice) in the Digits Forward and Digits Backward conditions as a raw sum. Over 65% of clinicians endorsed using the RDS metric at least occasionally when testing children (Brooks et al., 2016). Despite this high level of endorsement, study of the RDS metric in pediatric populations is sparse. To date, only two studies have examined the utility of the RDS metric in pediatric populations. Kirkwood et al. (2011) developed RDS cutoff scores for a sample of children presenting with traumatic brain injuries, and Welsh et al. (2012) examined the utility of RDS in a sample of children with epilepsy. Both of these studies utilized the Wechsler Intelligence Scale for Children – Fourth Edition (WISC-IV; Wechsler, 2003). Since these studies have been conducted, a new version of the measure, the Wechsler Intelligence Scale for Children – Fifth 2 Edition (WISC-V; Wechsler, 2014) has been released. This edition fundamentally changed the Digit Span task in ways that potentially change RDS scores and allow for the calculation of a new revised RDS score: reliable Digit Span revised (RDS-R; Schroeder et al., 2012). As validity testing in pediatric populations gains popularity, the need for empirically validated validity measures increases. The current study assesses the clinical utility of three indices from the Digit Span subtest of the WISC-V in a mixed clinical sample. The results of this study serve to inform pediatric neuropsychologists who endorse use of these instruments in their clinical practice and guide future research into these instruments. 3 Validity Testing Overview LITERATURE REVIEW Neuropsychological evaluations are conducted to better understand the relation between an individual’s brain functioning and their observable performance (Baron, 2018). To better understand this relation, clinicians evaluate clients in several domains including intellectual development, memory, motor functioning, visual functioning, executive skills, academics, and psychosocial development. These evaluations typically include the completion of standardized tests across domains as well as the completion of questionnaires. The yielded scores are essential to clinical decision making, but they are not useful if clients provide suboptimal effort while performing the tasks. Suboptimal effort during an evaluation can result in underestimations of a client’s abilities and may significantly distort the clinical picture, and thus diagnosis, prognosis, and treatment plan. The clinical picture derived from neuropsychological assessment guides treatment and, in addition, controls downstream resources. Thus, in some cases a person may find it advantageous to perform below their ability level or to exaggerate symptoms of impairment. In such cases, a person is engaging in malingering, meaning intentionally underperforming or exaggerating maladies. Although some individuals intentionally malinger, individuals exaggerate symptoms and underperform during psychological testing for numerous reasons. (Sherman et al., 2020; Slick et al., 1999). In cases where money is involved, including cases with litigatory elements, test failure or poor performance may influence the legal process. In some cases, examinees may be discouraged by the process, afraid of failure, or simply stop trying due to fatigue (DeRight & Carone, 2015). Others underperform or exaggerate in fear that the examiner would not otherwise capture the full breadth of their impairment. Thus, accurately capturing effort throughout 4 neuropsychological assessments can have serious financial, social, and treatment consequences for all parties involved. It is important for neuropsychologists to be able to recognize when performance is genuinely low, and when an examinee is performing below their ability level or embellishing symptoms. To capture instances of both symptom exaggeration and underperformance, separate categories of validity testing have been established (Larrabee, 2012). Symptom validity refers to the self-reporting of symptoms, while performance validity refers to effort during task performance. Symptom Validity Tests (SVTs). Symptom validity refers to the likelihood that symptoms are being reported truthfully by the client. Symptom validity tests (SVTs) are generally embedded in symptom self-report scales, such as the Minnesota Multiphasic Personality Inventory-3 (MMPI-3; Ben-Porath & Tellegen, 2020) and the Behavior Assessment System for Children – Third Edition (BASC-3; Reynolds & Kamphaus, 2015). Subscales within these instruments assess whether the response patterns have a reasonable probability of being reported by most people. They detect over-reporting and under-reporting of symptoms, random responding, and very unusual responses. Validity subscales are meant to indicate when people are ‘faking good,’ feigning symptoms, or either do not comprehend the questions or are willfully refusing them. They provide an indication of how much confidence can be had when interpreting other scales. Performance Validity Tests (PVTs). Performance Validity Tests (PVTs) are measures designed to detect suboptimal effort through direct testing. PVTs can be standalone measures, meaning that they are designed specifically for the assessment of performance validity, or embedded measures that are derived from commonly administered tests. Standalone PVTs are commonly disguised as tests of memory as is the case with the Test of Memory Malingering 5 (TOMM; Tombaugh, 1996) and the Memory Validity Profile (MVP; Sherman & Brooks, 2015b). These measures have strong face validity as measures of memory, meaning that the examinee feels as though they are being given a test of memory. However, the psychometric qualities of the instruments and the specific type of memory being tested are tasks that even those with severe brain disorders and injuries have a high likelihood of passing (McWhirter et al., 2020). Thus, there is a high likelihood that an individual who fails a PVT is doing so intentionally or for reasons other than genuine impairment. In addition to standalone measures, researchers have developed several metrics from readily administered tests to serve as embedded validity indicators. These are generally subtests within a suite of tasks most often meant, again, to assess memory functioning. Forced choice memory test conditions are a ruse in which multiple choice options are limited to two, thus by chance any examinee would be expected to be correct at least 50% of the time. For example, the forced choice condition of the California Verbal Learning Test – Third Edition (CVLT-3; Delis et al., 2017) comprises conditions that yield good scores even when other subtasks in the suite do not, and which can be achieved by those with a high level of neurological disturbance. For example, individuals with severe brain injury can score well on this task because recognition of previously experienced sensory exposure is highly preserved in the brain (Erdodi et al., 2018). Thus, embedded validity indicators can be direct scores of tests, or scores derived from measures designed to evaluate other cognitive domains. Both SVTs and PVTs play a vital role in helping clinicians identify those who may be feigning symptoms or failing to engage during assessment, yet they have been historically underutilized. 6 Why PVTs and SVTs are Necessary Despite clinical training, neuropsychologists have low rates of success when determining which clients are performing below their ability level. In a landmark study, Heaton et al. (1978) examined the success rate of neuropsychologists in identifying probable malingerers from blinded clinical case reports. Ten neuropsychologists were sent 36 blinded evaluation reports. Half of the participants were patients with a genuine history of head trauma while the other half were paid malingerers asked to perform as though they had sustained a head injury. The success rate of clinicians correctly classifying the 32 cases ranged from 50.0% to 68.8%. Of the genuine presentations, clinicians correctly classified between 43.8% and 81.3% of participants. Of the malingering group, clinicians correctly classified between 25.0% and 81.3% of cases. Following this procedure, the researchers conducted two discriminant functions based on neuropsychological test performance on the Halstead-Reitan Battery (HRB; Reitan, 1993) and MMPI F scale, which detects highly unusual responses linked to over-reporting of symptoms (Hathaway & McKinley, 1951). The neuropsychological discriminant function correctly categorized each of the 32 cases, and the MMPI discriminant function correctly categorized 15 of 16 participants in each group. This study was the first to show that indicators from testing were better predictors of suboptimal performance than clinical judgement alone. Later, the HRB discriminant function was further validated when a group of 40 malingerers and 40 individuals presenting with genuine head trauma were classified with 83.8% true positive cases and 93.8% true negative cases (Mittenberg, 1996). Recently, Dandachi-Fitzgerald et al. (2017) examined 31 clinicians’ ability to identify suboptimal performance across 203 clinical case reports. The sample was divided into two groups with one group passing both a test of symptom validity and performance validity (n = 7 173) and the second group failing both validity tests (n = 30). The SVT used in this study was the adapted Dutch version of the Structured Inventory of Malingered Symptomology (SIMS; Merckelbach & Smith, 2003). The performance validity test (PVT) used was the Amsterdam Short-Term Memory Test (ASTM; Schmand & Lindeboom, 2005). Results from the study indicated that the sample of neuropsychologists predicted suboptimal effort for 51 cases (25.1%) and optimal effort for 152 cases (74.9%). Contrary to the researchers’ hypothesis that neuropsychologists would underpredict the number of cases with problematic effort, the clinicians significantly overpredicted it. The group of cases identified as non-malingering contained 14 participants (9.2%) who belonged to the noncredible group. Conversely, 35 of the 51 (68.9%) participants identified as showing suboptimal effort passed both validity measures. These findings further substantiate the findings of Heaton et al. (1978) and indicate that a clinician’s judgement alone is insufficient when making validity determinations. Furthermore, these studies indicate that clinicians may both underestimate and overestimate poor effort. Thus, clients also stand to benefit from validity testing, especially when clinical judgement alone would misattribute their difficulties as malingering. As a result, a need for specific diagnostic criteria for invalid clinical presentations emerged. This need was particularly pertinent in the fair adjudication of cases involving litigation. Need for Malingering Criteria Individuals are commonly referred for neuropsychological evaluation as a part of litigation or other scenarios that present with high potential of secondary gain if impairment is observed. Therefore, neuropsychologists are tasked with providing expert opinion related to the validity of an individual’s performance. In 1999, Slick et al. (1999) developed diagnostic criteria to aid neuropsychologists in clinical and research settings. Slick et al. defined Malingering of 8 Neurocognitive Dysfunction (MND) as “the volitional exaggeration or fabrication of cognitive dysfunction for the purpose of obtaining substantial material gain, or avoiding or escaping formal duty or responsibility” (1999, p. 552). The multifaceted standard used three criteria to determine the likelihood that an individual was malingering for secondary gain. Criteria A was characterized by the degree of secondary gain present for an individual, Criteria B included neuropsychological testing data, and Criteria C was characterized by self-reports. Criteria B relies on standardized neuropsychological assessment measures. Individuals who performed below chance on forced choice measures of performance validity were to be characterized as showing definite negative response bias whereas individuals who failed one or more validated PVTs were to be described as exhibiting probable response bias. Criteria B created a need for further development and validation of PVT measures and made the instruments central to the clinical decision-making process. Criteria C relies on standardized self-report measures. Self-reported symptoms were to be compared to the documented symptom history of the individual, brain-functioning, behavior, and third-party symptom reports. This criterion established a need for standardized measures of symptom validity and, like Criteria B, encouraged significant research into validity measures. Heaton et al. (1978) and Slick et al. (1999) inspired a vast literature base aimed at developing validity measures, but understanding this literature requires a basic knowledge of the commonly used research methodologies and the psychometric properties emphasized during test development. Two primary designs for studying validity tests have emerged that utilize either groups of known malingerers or individuals asked to intentionally give poor effort during evaluation. These designs influence outcomes in validity studies, and each design has known strengths and limitations. 9 PVT Research Design Simulation Studies. Empirical study of performance validity tests has historically been performed using two research designs. The first design is the simulation study. In a simulation study, a group of individuals with clinical presentations are compared to a group of individuals without clinical symptoms. Those without clinical presentations are simulators who are coached to perform poorly without being detected. In this case, individuals who intentionally fail measures can be compared directly to a clinical group. The landmark study conducted by Heaton et al. (1978) used a simulation design. Simulation studies are valuable as they include clearly established groups for comparison. However, simulation studies are often criticized for their deviations from real world application (Larrabee & Kirkwood, 2020). Individuals who intentionally fail assessment measures during clinical evaluations do so for several reasons. One of the most prominent motivations for suboptimal performance is secondary gain as is frequently seen in the traumatic brain injury population (Stucky et al., 2020). Individuals in simulation studies, however, lack such motivation. Although this is combatted with compensation to perform poorly without being caught (Heaton et al., 1978), this compensation typically pales in comparison to the secondary gain hinging on the results of a neuropsychological evaluation. This discrepancy is thought to call into question the external validity of simulation studies. Known-Groups Studies. The second study design commonly used to evaluate PVTs is the known-groups design. In these studies, a clinical group is compared to another group of patients who have failed other established PVTs during evaluation. Within these studies, the suboptimal effort groups also commonly include individuals in situations that allow for secondary gain such as active litigation status (Larrabee & Kirkwood, 2020). These studies are thought to have higher external validity than simulation studies because participants who fail these PVTs have done so 10 without coaching. However, known-groups studies also present several limitations. Within known-groups research, group designation relies on established performance validity measures. These criterion measures must have strong psychometric properties to correctly categorize performance. If individuals provide optimal effort but fail PVTs due to extraneous factors such as intellectual disability, the known group is compromised. As such, known-groups designs are only as strong as the measures used to establish groups (Schroeder et al., 2019). In practice, the known-groups design is preferred in the development of PVTs because the environmental validity of known-groups is significantly stronger than in simulation studies. However, the limitation associated with the known-groups design is a lack of strong criterion measures for establishing groups. Thus, creating a strong known-groups design requires a fundamental understanding of the psychometric properties of PVTs. Criteria for Establishing a Performance Validity Test Performance and symptom validity tests in the field of neuropsychology must meet certain criteria to be considered empirically supported instruments. The most discussed metrics of PVTs are sensitivity and specificity. These metrics indicate the accuracy with which individuals in a simulation study or known-groups study were correctly identified as performing validly or invalidly. However, equally important are the instrument’s negative predictive power (NPP) and positive predictive power (PPP). Predictive power refers to the chance that an individual is correctly identified as providing either optimal or suboptimal effort by the PVT. NPP and PPP take into consideration the base rates of suboptimal performance within a given population. Base rates have been studied extensively in adult populations and populations with specific conditions like mild traumatic brain injuries (mTBIs), but general base rates for the pediatric population vary widely (Kirkwood et al., 2011). 11 Specificity. Specificity refers to an instrument’s ability to correctly classify true positive cases (Larrabee & Kirkwood, 2020). True positive cases are those correctly identified as performing optimally. Specificity is the most emphasized psychometric property of validity tests in neuropsychological literature. Consensus has established that validity measures should have at least 90% specificity (Lippa, 2018). This means that of the individuals presenting with a clinical condition, 90% of them will be identified as such by the validity measure. Consequently, when specificity is maximized, sensitivity is typically much lower. Sensitivity. Sensitivity refers to an instrument’s ability to correctly classify true negative cases (Larrabee & Kirkwood, 2020). True negative cases are those correctly identified as performing at suboptimal levels. As false positives are reduced by raising specificity, the likelihood of obtaining false negatives increases. This in turn lowers specificity. Within the PVT research literature, consensus dictates that cutoff scores for measures be made at the point that specificity reaches 90% to maximize sensitivity (Lippa, 2018). Additionally, a threshold of 40% is considered to be the minimum viable level of sensitivity (Axelrod et al., 2014; Erdodi et al., 2014) Positive Predictive Power. Positive predictive power refers to the chance that an individual who failed a PVT actually put forth suboptimal effort (Lippa, 2018). Predictive power differs greatly depending on the base rate of suboptimal effort in a population. Thus, in adults, positive predictive power is higher in populations who face potential secondary gain like those with mTBIs than in populations with lower base rates. Sensitivity, specificity, and base rate are all required to calculate positive predictive power. Lippa (2018) provides the equation used to calculate positive predictive power: PPP= sensitivity(base rate) (sensitivity*base rate)+((1-specificity)*(1-base rate)) . 12 Negative Predictive Power. Negative predictive power (NPP) refers to the chance that an individual who passed a PVT put forth optimal effort. Like PPP, NPP is dependent on the base rate of suboptimal performance in the target population. Lippa (2018) also provides the equation for the calculation of NPP: NPP= specificity*(1-base rate) . ((1-base rate)*specificity)+(base rate*(1-sensitivity)) In summary, an empirically supported PVT must correctly identify 90% of individuals providing adequate effort, but sensitivity should then be prioritized to ensure that the instrument captures as many simulators or known malingerers as possible. Once sensitivity and specificity are established, the two can be used in conjunction with the base rate of suboptimal effort in a population to find the predictive power of the instrument. Historical Overview of PVT Research in Adult Populations As previously discussed, Heaton et al. (1978) was the first study to highlight the need for measures of validity when assessing effort in clinical populations. However, the development of validity measures predates Heaton et al. (1978) by over 30 years. André Rey was at the forefront of neuropsychological validity testing with his development of the Dot Counting Test (Rey, 1941) and the Rey 15-Item Test (Rey, 1964) in the midcentury years between 1940 and 1970. The Rey tests were the first standalone measures of performance validity. However, their use was limited prior to the 1990s. In her book on assessing feigned impairment, Dr. Kyle Boone opens with an anecdote about her time as a postdoctoral fellow in the mid-1980s (Boone, 2007). Dr. Boone entered the waiting room to meet her patient and found the man talking to his imaginary friend and wearing a necklace made of garlic. Dr. Boone suspected that the patient may be feigning psychiatric symptoms and cognitive deficits, so she consulted her supervisor. Dr. Boone was advised that 13 she should give the Bender Gestalt Test (Bender, 1938) and the Mini-Mental State Examination (Folstein et al., 1975) but was perplexed by the lack of dedicated measures to assess clients providing suboptimal performance. In her study of the literature, Boone discovered the early tests developed by André Rey approximately 40 years earlier. However, the span of time between Rey’s work and the evaluation Dr. Boone conducted had yielded little expansion of the validity testing knowledge base. Logarithmic growth of PVT and SVT research started in the 1990s, but in the time between Rey’s research and Boone’s experience in the 1980s, a significant development in validity testing was established (Larrabee & Kirkwood, 2020; Pankratz et al., 1975). In a case study design, Pankratz et al. (1975) evaluated deafness in a man using a forced-choice design. In this study, a 27-year-old man with a history of symptom exaggeration and manipulation was tested for deafness using a forced-choice design. The man was presented with 100 trials in which he was presented with a red light and a blue light consecutively for two seconds each. Across the 100 trials, a sound was randomly paired with either the red light or the blue light. The participant was asked to indicate which light was visible when the sound was played. By chance alone, the participant is expected to answer approximately 50 trials correctly, as there is a 50 percent chance of correct choice for any given trial. However, the patient answered only 36 trials correctly. The likelihood of receiving a score this low by chance is calculated to be less than 0.4%. Later, Pankratz went on to promote the forced-choice assessment for detection of suboptimal effort (Pankratz, 1979, 1983). Building upon the work of Pankratz et al. (1975), development of PVTs exploded in the 1990s (Larrabee, 2015). The 1990s produced some of the most well-known validity measures today including the TOMM (Tombaugh, 1996) and the Word Memory Test (WMT; Green & 14 Astner, 1995). These tests, and many others, utilize recognition memory to assess performance validity. Recognition memory tasks make strong PVTs because recognition is highly preserved after insult to the brain. Use of Recognition Memory to Assess Effort The most used PVTs present with face validity as measures of memory but are really measures of effort. Examples of such tests include the Test of Memory and Malingering (TOMM; Tombaugh, 1996) and the Memory Validity Profile (MVP; Sherman & Brooks, 2015b). The TOMM presents the examinee with a series of 50 images at the beginning of each trial. Following the learning phase, the examinee is asked to correctly identify each item in a forced-choice format with 50 trials presenting the correct stimuli and a foil image. Forced choice means that the examinee selects one of two options, thus any score below chance performance of 50% correct is likely to indicate malingering. The MVP uses a similar approach multiple choice, by presenting an image and then asking the examinee to identify the image from three choices immediately following the presentation, setting a chance threshold of 33%. Similarly, the Word Memory Test (WMT) presents word pairs and asks participants to distinguish presented words from foils in later conditions. The high expected rate of being correct by chance on these instruments is also supplemented by their reliance on recognition memory. Recognition memory, also commonly referred to as familiarity, refers to an individual’s ability to endorse having experienced a stimulus previously (Gazzaniga et al., 2014). Recognition memory is a highly preserved aspect of memory that remains intact when recollection memory is impaired. Recollection is the ability of an individual to retrieve previously learned information. The discrepancy in memory impairment between recollection and recognition is due to the differing cognitive structures involved in each process. Lesion 15 studies indicate that recollection memory relies heavily on the hippocampus (Gazzaniga et al., 2014). When the hippocampus is damaged, impairment is observed in recollection abilities while recognition abilities remain intact. Recognition has been postulated to be more reliant on the perirhinal cortex. Recognition memory rests on the concept of priming. Priming is the process of changing an individual’s response to a stimulus by providing previous exposure to the stimulus (Gazzaniga et al., 2014). Performance validity tests like the TOMM and MVP utilize priming by exposing individuals to a series of stimuli and then asking them to select these stimuli from a series of foils. When a stimulus is primed, individuals are more likely to correctly identify the stimulus when presented later. The effects of priming can last months and have been demonstrated through faster completion of word fragment tasks and recognition trials. Famously, this priming effect is preserved in cases of severe brain injury. Most notably, H.M., an individual who underwent bilateral resection of his temporal lobes, demonstrated priming effects and recognition despite severe anterograde amnesia (Gazzaniga et al., 2014). Additionally, H.M. displayed typical performance on Digit Span tasks despite impairment in the formation of episodic memory. As such, the highly preserved nature of these memory tasks makes them strong indicators of performance validity. The high prevalence of genuine and feigned memory impairment also makes these instruments strong measures of validity. Memory impairment is both the most reported symptom during neuropsychological evaluations and the most commonly feigned (Lu et al., 2007). Genuine memory impairments present with typical patterns, but individuals feigning cognitive dysfunction are unlikely to match valid memory impairment patterns. Individuals with genuine cognitive disturbance will show impairments in memory tasks that require free recall, but not forced- or multiple-choice 16 recall. Forced-choice recognition tasks provide the added benefit of having chance level cutoffs of performance. Although individuals like H.M. who experience severe cognitive disturbance demonstrate similar levels of recognition as healthy controls, impairment on a recognition task does not automatically indicate invalid performance. However, performance below chance levels does indicate problematic effort as demonstrated by (Pankratz et al., 1975). The unique preservation of recognition memory led to the development of numerous standalone PVTs that presented as measures of memory. Following the development of these standalone measures, a number of embedded validity indicators were established using verbal learning tests, visual memory tests, figure copy tasks, gross motor tasks, and a number of other neuropsychological measures (Larrabee, 2015). Many of these embedded measures still rely heavily on recognition memory such as the Rey Complex Figure Test (RCFT; Meyers & Meyers, 1995), the CVLT-3 (Delis et al., 2017), and the Wide Range Assessment of Memory and Learning – Third Edition (WRAML-3; Adams & Sheslow, 2021). Although standalone measures have the advantage of solely measuring performance validity, embedded measures provide a level of convenience to the clinician already administering measures that include embedded indices. Benefits of Embedded PVT Conditions within Established Memory Tests During a neuropsychological assessment, clinicians are routinely limited by the amount of time they have face to face with a client for testing. Neuropsychologists must sample multiple domains related to the referral question, and clinicians are frequently limited to a single day of testing, if not less, due to insurance restrictions and other extraneous factors. Because memory is already assessed due to the frequency of reported memory impairment, the use of embedded effort indices on tests of memory allows clinicians to make effort determinations without the 17 need for additional testing. These scores are readily available to clinicians and require no additional administration time. In addition to the need for effective time management, clinicians are also encouraged to utilize multiple PVTs throughout the duration of an evaluation. Thus, although many clinicians utilize standalone validity measures, additional validity indices throughout the evaluation aid in assessments of effort at differing time points. If clinicians were required to utilize standalone measures throughout the evaluation, the number of assessments of neuropsychological domains would be further limited. This is especially true in the pediatric population in which clients tend to fatigue more quickly and sit for less time when testing (Baron, 2018). Interest in pediatric validity testing has only recently emerged leading to a need for new validity indices in pediatric evaluations. Need for Validity Testing in Pediatric Samples Until the 2000s, few empirical studies focused on pediatric validity assessment, and interest in the subject was limited. The lack of interest in this subject was likely due to the widely held belief that children would not and could not feign performance during neuropsychological evaluations (Kirkwood, 2015). However, several studies conducted since the turn of the century have indicated that children can and do feign performance during evaluations. Furthermore, developmental research has supported these findings by demonstrating that children are capable of deception as early as preschool. Deception in Children The study of deception in children has focused substantially on both theory of mind development and executive functions. Theory of mind refers to a child’s ability to accept that an individual can maintain a false belief (Peterson & Peterson, 2015). As such, a child must believe 18 that an individual can believe a false statement before trying to present them with one. Executive functions refer to the skills that facilitate goal-directed behavior and novel task completion. These skills require a child to abstain from automatic responses and include inhibition, planning, working memory, and mental shifting (Peterson & Peterson, 2015). To deceive, a child must be able to hold the deceptive idea in his or her working memory and refrain from sharing contradictory evidence with the target. As such, the ability for children to deceive is thought to align with the development of these skills. Theory of Mind. Theory of mind is traditionally measured using the false-belief task (Peterson & Peterson, 2015). This task presents the child with a scenario in which a character or individual knows the location of an object. When the character is unaware, another character moves the item. The child is then asked where the first individual will look for the object when they return. If the child has developed theory of mind, they will be able to correctly identify that the character will look in the original, and thus, incorrect location. However, children who lack this skill will assume that the individual will look in the new item location. As such, to be deceitful, a child must first appreciate that people can hold false beliefs. Research consistently demonstrates that the ability to correctly solve the false-belief task emerges by age four (Peterson & Peterson, 2015). Further research using the unexpected-contents paradigm suggests that children are unable to acknowledge their own false beliefs (Wimmer & Hartl, 1991). In scenarios using the unexpected-contents paradigm, children are presented with a container that is clearly marked to indicate the contents. For example, a child may be shown a box with pictures of pencils on the lid to indicate that the box contains pencils. When asked, the child indicates that he or she believes that the box contains pencils. However, when the box is opened, the contents differ from 19 expectations. In this example, the box may contain blocks. When asked what they believed to be in the box before the reveal, children without the ability to acknowledge false beliefs will indicate that they believed the box contained blocks before opening it despite having stated pencils earlier. Like the false-belief task, the ability to solve the unexpected-contents paradigm also emerges around age four (Peterson & Peterson, 2015). Thus, one of the fundamental skills related to deception is present as early as preschool. Executive Functions. Executive functions encompass the skills needed to override automatic thinking and to act intentionally (Diamond, 2013). These skills are required when an individual focuses their attention, develops plans, and inhibits impulsive actions. Of the executive functions, working memory and inhibition are considered fundamental aspects of deception (Peterson & Peterson, 2015). To lie, an individual must act deceptively, hold the deception in his or her working memory, and inhibit the disclosure of information that contradicts the deception. Therefore, children who have not yet developed these executive functions may struggle to engage in deception. Working memory and inhibition can be observed in early infancy, but these skills develop substantially in the preschool years (Garon et al., 2008) and they continue to develop into adolescence and early adulthood (Peterson & Peterson, 2015). In turn, from the time children enter preschool they are capable of deception and develop the skills needed to deceive continually throughout childhood. These findings contradict the traditional view that children are incapable of feigning performance during testing. Heaton et al. (1978) demonstrated that clinicians struggle to identify suboptimal performance during assessments in adults who would be presumed to have well-developed theory of mind and executive function skills. Although children are capable of deception, one 20 might think that clinicians would be sensitive to feigning in this population due to the nascency of the associated skills. Contrary to this idea, studies have consistently indicated that adults have difficulty identifying deception in children (Crossman & Lewis, 2006; Peterson & Peterson, 2015; Talwar & Lee, 2002). Although teachers, parents, and individuals who regularly work with children are better at detecting deception than the average adult, naturalistic observation studies have indicated that adults struggle with the deception of children as young as three (Crossman & Lewis, 2006; Peterson & Peterson, 2015; Talwar & Crossman, 2011). As children age, difficulty with identifying deception increases. The traditional belief that children are not capable of deception when testing is disproven by the literature base surrounding pediatric deception. Children as young as four have both the theory of mind and executive functions to lie, and these skills develop with age. With the knowledge that children can engage in deception during assessments, it is important to also understand why children might provide suboptimal effort when testing. Reasons for Suboptimal Effort in Pediatric Samples Factors that contribute to suboptimal performance in pediatric evaluations are quite diverse. Evidence of this diversity is demonstrated by the rates of low performance across referral questions. As with adult populations, children presenting with mild traumatic head injuries (mTBIs) exhibit high rates of suboptimal performance. This population is estimated to perform below expectations between 12-20% of the time (Kirkwood, 2015). Furthermore, a significant population of interest specific to the pediatric population are those being evaluated for Social Security Disability benefits. Some reports indicate that nearly 60% of this population underperforms expectations during neuropsychological assessments. This is thought to be 21 influenced by parental pressure on children and is commonly referred to as “malingering by proxy” (Kirkwood, 2015). In addition to financial gain through litigation, several other reasons for poor effort in the pediatric population have been reported. Children with oppositional tendencies may perform poorly intentionally while testing. Additionally, children who are not interested in testing may either fail to attend to test stimuli or fail intentionally to accelerate the testing process. Furthermore, some children may fail due to low self-esteem. Children may even intentionally sabotage test results because they want to be sure that the clinician sees the difficulties they are having in school or other areas of functioning (Kirkwood, 2015). Pediatric studies of child malingering have been largely focused on medical samples (e.g., Kirkwood et al., 2011; Welsh et al., 2012). However, there are other populations for whom suboptimal effort may have influence on treatment and outcomes. Many psychoeducational evaluations are performed each year that change the educational course and outcomes of students across the United States. Between 2009 and 2021, 15% of students in the United States received special education services totaling 7.5 million students ages 3-21 (National Center for Education Statistics, 2022). Each of these students requires an initial special education evaluation as well as a reevaluation every three years they receive services. In addition to typical special education evaluations, many seek evaluations to gain accommodations for high stakes testing like college entrance exams. Despite the high potential for secondary gain in these populations, school psychologists rarely utilize performance validity measures, and performance validity testing is often absent from school psychology training programs (Holcomb, 2018). The lack of PVT use occurs despite school psychologists having access to instruments with performance validity indicators. Notably, there is a subtest available to nearly all who perform psychoeducational and 22 neuropsychological evaluations that has long been used in the adult population as an indicator of effort: the Digit Span subtest of the Wechsler Intelligence Scale for Children (WISC-V; Wechsler, 2014), which is considered the primary instrument for assessing intellect in children. Despite the need for validity assessment in multiple areas of psychological testing, little information is known about actual use in pediatric neuropsychological testing, let alone psychological specialties less familiar with validity testing. Reported vs. Actual PVT use in Pediatric Assessments With a recent increase in interest surrounding PVT and SVT use in children, Brooks et al. (2016) surveyed clinical neuropsychologists across the United States and Canada to evaluate the frequency of PVT and SVT use in pediatric populations. Of the 349 participants who completed the survey, 282 neuropsychologists successfully met the inclusion criteria for analysis. Regarding PVT use in pediatric evaluations, a very high proportion, 92% of respondents, endorsed using at least one PVT per pediatric evaluation. Additionally, 88% endorsed the use of at least one SVT during pediatric evaluations. This high rate of reporting exceeded previous surveys of adult practitioners. This unlikely discrepancy can be partially explained by the increasing significance placed on validity testing in the time between this survey and earlier surveys of adult population clinicians (Brooks et al., 2016). However, reported and actual use of PVTs in pediatric populations may differ. Following the publication of the survey results, Macallister et al. (2019) sought to examine the real-world use of PVTs compared to the reported prevalence. Like Brooks et al. (2016), the researchers thought the rates reported in the survey far exceeded what they saw in daily neuropsychological practice. The authors collected a convenience sample of reports they had been sent as a part of referrals to their clinic. The reason for referral for these evaluations 23 were re-evaluation, consultation, and review for medical intervention. Patients in this sample ranged in age from 6-17 years as these ages constitute school age. The report dates ranged from January of 2015 to January 2017. These dates are relevant as they nearly coincide with the date of the survey. The final sample consisted of 131 neuropsychological reports from 102 neuropsychologists. Using the evaluation reports, Macallister et al. (2019) obtained PVT use rates that differed significantly from survey results. Of the 131 reports reviewed, six documented using PVTs, and each of the six cases was conducted by a separate neuropsychologist making PVT use across clinicians in the sample 5.88%. In discussion, the researchers highlight that their sample differs significantly from Brooks et al. (2016) as the survey sample was obtained through international recruitment while this study utilized a convenience sample. Additionally, documentation for PVTs is not well established, and this may mean that clinicians used these instruments without documenting them. However, the factor presented as most likely contributing to the discrepancy is the social desirability bias inherent in a survey of this kind. Multiple professional organizations within the field of neuropsychology have released position statements arguing for the use of PVTs including the National Academy of Neuropsychologists (NAN; Bush et al., 2005) and the American Academy of Clinical Neuropsychology (Guilmette et al., 2020; Heilbronner et al., 2009). As such, the expectation that PVTs be used may positively skew practitioners reporting use of them despite lower practical application. An important limitation of the Macallister et al. (2019) must be addressed. In both the methods, abstract, and results sections of the article, the reports are said to be dated 2017 to 2018. This would mean that 5 of 131 reports from the two years preceding the study would equal 4.88% of reports. However, later in the article, when referring to articles from the two preceding 24 years, Macallister et al. (2019) notes that 6 of 56 studies documented PVT use compared to none from 2001 to 2013. This changes the percentage of PVT use among recent reports to 10.7%. Although the exact percentage of recent reports including PVT use is unclear, both 4.88% and 10.7% fall well below the 92% rate obtained by (Brooks et al., 2016). Whether the findings of Brooks et al. (2016) or Macallister et al. (2019) are a more accurate representation of the use of PVTs in pediatric practice, further study of PVTs is needed. If use among pediatric clinicians is low, there is little information about the utility of the instruments in practice when working with children. If use rates are high, the problem of empirically unvalidated instruments being used to make clinical decisions becomes more central. One instrument highlighted as frequently used in the evaluation of performance validity in pediatric populations in the survey was the Digit Span subtest of Wechsler intelligence measures (Brooks et al., 2016). Despite high rates of endorsement, this metric has been studied little with pediatric populations. History of the Digit Span Task The Digit Span task is a classical task that requires individuals to hold a string of numerical digits in working memory and repeat them to the examiner immediately following their presentation. Originally, this task involved simply repeating strings of digits exactly as they were presented. Later variations added conditions that involved reciting digits in reverse or numerical order. Initial items contain few digits, but subsequent items add digits to increase difficulty as the examinee progresses through the task. The task is discontinued after a set number of consecutive failures to repeat the digit strings verbatim. Although it appears on each of the full Wechsler intelligence measures, the Digit Span subtest significantly predates the invention of these instruments. Interest in the limited capacity 25 of digits an individual can repeat is believed to have first been noted in 1871 by Oliver W. Holmes (Richardson, 2007). Holmes noted that most people lost the ability to recite strings when such strings spanned between seven and ten digits. Later, Jacobs assessed a number of school children in their ability to correctly recite numerals and letters and found that the average span length increased with age (Jacobs, 1887) and so a developmental perspective was introduced. Subsequently, Binet and Simon included a forward digit condition in their original intelligence assessment (Binet & Simon, 1905; Richardson, 2007) and used normative data comprising age- related expectations. When the Binet-Simon scale was adapted to the Stanford Binet by Terman (Terman, 1916), the task requiring repetition of digits was again included. Additionally, Terman decided to add a backward condition as well which required participants to repeat the digits in reverse order. In 1939, David Wechsler released the Wechsler-Bellevue Intelligence Scale (Wechsler, 1939). This first Wechsler intelligence measure included Digit Span as one of its subtests. The original Digit Span subtest included both a backward and a forward condition. As the Wechsler-Bellevue Intelligence Scale evolved into the Wechsler Adult Intelligence Scale (WAIS; Wechsler, 1955) and the Wechsler Intelligence Scale for Children (WISC; Wechsler, 1949), the Digit Span subtest remained a standard component. In each revision of these two instruments, the Digit Span subtest has been a primary subtest. The one exception to the inclusion of the Digit Span subtest is the Wechsler Abbreviated Scale of Intelligence (WASI; Wechsler, 1999, 2011). Both editions of this instrument have omitted the Digit Span index because working memory is not assessed on the WASI. While early Wechsler instruments included only the forward and backward conditions, the most recent versions of the adult and child versions, the Wechsler Adult Intelligence Scale – 26 Fourth Edition (WAIS-IV; Wechsler, 2008) and the WISC-V (Wechsler, 2014) introduced a third component to the Digit Span subtest. The sequencing condition was like the forward and backward conditions in structure. However, participants were now tasked with repeating the digits in ordinal rank from low to high. The raw score for Digit Span on these revised instruments is now calculated by adding the raw scores across each of the three conditions. Use of this subtest as an embedded measure of effort has long been present in the adult literature and practice, and it has been used in several forms ranging from age-corrected scaled score to various other calculations based on raw scores derived from the subtests. Digit Span Age Corrected Scaled Score (ACSS): An Embedded Effort Indicator The score for an examinee’s Digit Span performance is calculated by assigning points for every string repeated correctly. The sum of these points constitutes the raw score. The raw score is converted to a scaled score based on the mean and standard deviation of a national normative sample. The resulting scaled score is a corrected-for-age score as it transforms the raw score using the mean and standard deviation that aligns with the examinee’s age. Thus, a scaled score with a normative mean of 10 and a standard deviation of 3 reflects how far from the mean the skill shown by the examinee is. This age corrected scaled score (ACSS) is the most reported index of examinee performance in clinical evaluations. Although memory disturbance can impair performance on Wechsler Digit Span tasks, studies have demonstrated that those with severe memory impairments such as individuals with Korsakoff syndrome or those who have undergone surgery are able to perform typically compared to those without such impairments (Iverson, 2003). Despite this evidence emerging in the 1970s, the practical application of this information in the evaluation of malingering was not examined until Iverson and Franzen (1994) conducted an explicit examination of the metric. In a 27 previous study, Iverson and Franzen (1996) noticed that a Digit Span ACSS cutoff score, meaning the yielded scaled score based on normative data, of 4 correctly classified 77.5% of malingerers and 100% of individuals with clinical presentations in a simulation study. These findings prompted Iverson and Franzen (1994) to examine the classification utility of the ACSS score in a sample consisting of normal controls, simulators, and clinical patients suffering from head injury. Normal controls were correctly classified with 100% accuracy across cutoff scaled scores of 3, 4, and 5, each of which fall over two standard deviations below the mean. Sensitivity of the ACSS was 60% with a cutoff scaled score of 3, 82.5% with a cutoff scaled score of 4, and 90% with a cutoff scaled score of 5. These initial studies implied that the ACSS may have clinical utility as an embedded PVT measure. However, more sophisticated metrics were sought using raw data from the task. Development of Reliable Digit Span: An Embedded Effort Indicator Reliable Digit Span (RDS) was first established by Greiffenstein et al. (1994). The primary objectives of this study were to establish grouping criteria for malingering and to validate popular memory-based PVT measures. RDS differed from ACSS in that it referenced the longest number of digits recalled perfectly in each of two trials for both forward and backward tasks. For example, if an examinee’s best performance on Digits Forward was a five- digit string and five digits were recalled perfectly, then the forward condition is five. This is added to the longest number of Digits Backward recalled perfectly twice, say for example three. Then the RDS would be eight (5+3) digits reliably held. To examine the utility of RDS, Greiffenstein et al. (1994) used a sample of 106 clients referred for neuropsychological evaluation in 1992 following a traumatic brain injury. Although most studies in this area used either simulation studies or litigation status to separate valid 28 performance from invalid performance, Greiffenstein et al. (1994) used symptom severity and behavioral characteristics to establish three groups. The first group was given the traumatic brain injury designation (TBI). Individuals in this group presented with a Glasgow Coma Score (GSC) of 14 or lower 24 hours after sustaining their injury, positive findings on a computerized axial tomography (CAT) scan, or focal neurologic evidence with a hospital stay lasting longer than two days. The second group was given the persistent post-concussive syndrome (PPSC) designation. This group was characterized by amnesia of no more than 20 minutes following the injury, a GCS of 15 or higher, insignificant CAT scan results, insignificant neurologic findings, and at least three post-concussive symptoms present one year after injury. Finally, the third group was given the probable malingerers (PM) designation. This group was characterized by those who met criteria for the PPSC group and two or more additional criteria: 1) two or more severe ratings on neuropsychological tests compared to peers of similar age and education, 2) an improbable symptom presentation that contradicts records and evidence, 3) inability to work or engage socially one year after injury, and 4) claims of isolated memory loss. The authors utilized several established PVT measures including the Rey Auditory Verbal Learning Test (AVLT; Rey, 1958), the Wechsler Memory Scale (WMS; Wechsler, 1945), the Wechsler Memory Scale Revised (WMS-R; Wechsler, 1987), Rey’s Word Recognition List (WRL; Lezak, 1983), Rey’s 15-Item Memory Test (Rey-15; Lezak, 1983), Rey’s Dot Counting task (Lezak, 1983), and the Portland Digit Recognition Test (PDRT; Binder & Willis, 1991). In addition to these established measures, the authors proposed a new measure using the Digit Span subtest of the Wechsler Adult Intelligence Scale – Revised (WAIS-R; Wechsler, 1981). The measure took the sum of the longest string of digits correctly recited twice on the Forward and Backward conditions of the Digit Span subtest. The first two trials of the Forward condition of 29 the WAIS-R contained strings of three digits while the Backward condition started with strings of two digits. Participants who completed the first set successfully in both conditions would therefore achieve a minimum score of five digits reliably held. Those who failed to complete either of the basal sets were assigned an RDS score of 3. Group comparisons showed that the probable malingerers group scored significantly lower than both the TBI and PCSS groups. To establish cut scores, Greiffenstein et al. (1994) used both a conservative and liberal decision rule. First, the conservative cut score was set to 1.3 standard deviations below the mean performance of the TBI group to achieve 90% specificity. This decision rule resulted in an RDS cutoff score of 7 digits reliably held. The second cutoff score was developed by setting the metric to one standard deviation below the TBI group’s mean score. This decision rule resulted in a cutoff score of 8 digits reliably held. Resulting analyses showed that when compared to the TBI group specificity based on scores of 7 and 8 were 73/54 and sensitivity was 70/82. When compared to the PCSS group, specificity rates of 89/69 respectively were found, and sensitivity was 68/82. These results suggested that RDS may be a viable clinical tool for detecting suboptimal performance during neuropsychological evaluations. Following the study of RDS, an official supplement was released for the WAIS-IV that included normative data for the RDS score (Wechsler, 2009). RDS is currently in use, but changes to the Weschler suite of intelligence instruments have changed the nature of the Digit Span subtest. Thus, a revised version of the RDS has been introduced. Development of Reliable Digit Span – Revised Upon release of the WAIS-IV, the nature of the Digit Span subtest fundamentally changed. In addition to the traditional forward and backward conditions, the sequencing condition was added to the subtest. The sequencing subtest required participants to repeat the 30 string of digits back to the examiner in ordinal order from smallest to largest. The raw score from the sequencing task was added to the raw scores of the forward and backward tasks to develop a Digit Span raw score. This addition also allowed for the expansion of the RDS metric. Spencer et al. (2010) were the first to introduce the expanded version of the RDS metric. By adding the length of the longest span correctly repeated consecutively in the sequencing condition to the traditional RDS score, the Reliable Digit Span – Revised (RDS-R) score was developed. In their examination of veterans in a known-groups design, Spencer et al. (2010) found that RDS-R metric with a cutoff score of 11 reliably held presented higher specificity, sensitivity, and predictive power than the traditional RDS measure. Furthermore, a subsequent study conducted by Young et al. (2012) found that the new RDS-R measure exhibited similar utility to the RDS and ACSS metrics in the detection of suboptimal performance in a group of adults referred for evaluation at the VA. Analysis of Digit Span Indices in Adults Following the introduction of the RDS measure, study of its utility in adult populations expanded significantly. Schroeder et al. (2012) conducted a review of the RDS literature ranging from 1994 to 2011 and found 20 studies examining the measure in various clinical populations. When using an RDS cutoff score of six, specificity rates exceeded the 90% recommended threshold in samples including controls, mixed clinical populations, traumatic brain injuries, and simulators. However, RDS specificity was lower in special populations including those with intellectual disabilities, memory disorders, language barriers, and cerebrovascular accidents. As such, RDS was demonstrated to be a promising measure of performance validity in most adult populations. Additionally, preservation of Digit Span tasks in highly amnesic individuals (Gazzaniga et al., 2014) also suggests that the scaled score provided by the Digit Span subtest 31 may have potential as a validity indicator. This evidence is derived entirely from the adult literature however, and adult practices are not always suitable for pediatric populations (Baron, 2018). Thus, an examination of Digit Span in the pediatric literature is warranted. Critical Analysis of Digit Span Indices in Pediatric Samples In a recent survey of 282 pediatric neuropsychologists, 65.3% of respondents indicated that they at least occasionally use the RDS validity measure with children with 22.4% of the sample endorsing using RDS often and 21.4% endorsing using the measure almost always (Brooks et al., 2016). The results of this survey appear promising for the utility of the measure in pediatric assessments. Unfortunately, this high rate of endorsement is problematic when examining the empirical evidence for the use of this measure with children. Despite more than half of the pediatric neuropsychologists sampled endorsing use RDS, only two studies aimed at establishing pediatric cutoff scores have been conducted, and each of these studies utilizes an outdated version of the WISC and focuses on highly specific clinical presentations (Kirkwood et al., 2011; Welsh et al., 2012). The inclusion of the Digit Span subtest on both the WISC-IV (Wechsler, 2003) and the WISC-V (Wechsler, 2014) allows clinicians to develop an RDS score. However, little has been done to establish pediatric cutoff scores, leading to uncertainty about their ethical use and utility. To date, just two studies have been conducted to establish credible metrics for using RDS as a measure of validity in pediatric assessment. The first attempt at standardizing the RDS score in a pediatric population was conducted by Kirkwood et al. (2011). This study examined the utility of both RDS and Digit Span ACSS in detecting suboptimal performance in a sample of children referred for neuropsychological evaluation following a TBI. Of the original sample, injury modality included sports related 32 injuries (65%), falls (18%) vehicular injury (11%), and assault (3%). Children were excluded if the evaluation was forensic in nature, neurosurgical intervention was involved, the injury resulted from abuse, or the brain injury was not related to physical trauma. The final sample comprised 274 children ranging in age from 8 to 16. The sample was split into a credible performance group and a noncredible performance group. Children in the credible performance group passed both the Medical Symptom Validity Test (MSVT; Green, 2004) and the TOMM (n = 224). The noncredible performance group consisted of those who failed both the MSVT and the TOMM (n = 37). Thirteen children failed the MSVT but passed the TOMM. This group of children was excluded from analysis as group membership was ambiguous. Group differences were examined using a series of t-tests. Significant differences in performance between groups was found on the Digit Span ACSS (p<.001, d – 1.5), RDS (p<.001, d = 1.2), Digit Span Forward task (p < .001, d = 1.4), and the Digit Span Backward task (p < .001, d = 1.1). Kirkwood et al. (2011) also went on to establish cutoff scores based on the performance of their sample. Optimal cutoff scores were established when specificity met or exceeded 90%. An ACSS cutoff score was established at scaled score ≤5 which resulted in a specificity of 95% and a sensitivity of 51%. An RDS cutoff score established at raw score ≤6 yielded a specificity of 92% and a sensitivity of 51%. Negative and positive predictive power were also reported at a variety of noncredible performance base rates. Kirkwood et al. (2011) found that when using an RDS cutoff score of ≤7, false positive rates were 31%. As a result, more conservative cutoff scores were recommended when using RDS in pediatric populations. Although the work of Kirkwood et al. (2011) was revolutionary as it examined the utility of the Digit Span subtest of the WISC-IV in a pediatric sample, the generalizability of the 33 findings was narrow because the sample consisted exclusively of children presenting with TBI. To examine the utility of the measure in other populations, Welsh et al. (2012) utilized RDS in an epilepsy sample. The study’s sample consisted of 54 children ages 6 to 17 presenting for neuropsychological evaluation with various epilepsy conditions. Most of the sample presented with partial epilepsy syndromes (n = 33) while others presented with generalized epilepsy syndromes (n = 10) and mixed presentations or unspecified syndromes (n = 11). As part of their neuropsychological assessments, each child completed the TOMM and Digit Span subtest of a Wechsler instrument (WISC-IV or WAIS-III). Additionally, participants completed a full or abbreviated Wechsler intelligence measure (WISC-IV, WAIS-III, or WASI). In order to more closely match the IQ scores presented by the WASI, Welsh et al. (2012) utilized the GAI measure of the WISC-IV and the WAIS-III. This also provided the added benefit of removing the Digit Span subtest from analyses of intellectual functioning. Using the cutoff scores provided by Kirkwood et al. (2011), only 65% of the sample passed the RDS metric. These results fall well below the 90% pass rate established in the literature. Despite the low rate of RDS success, 90% of the sample validly completed the TOMM at or above cutoff scores. Sensitivity and specificity analysis of the sample’s performance showed that to reach adequate specificity, a cutoff score of ≤3 reliably held digits would be necessary. However, such a low threshold provided poor sensitivity at just 20%. Thus, clinical utility was suspect. Other than the studies conducted by Kirkwood et al. (2011) and Welsh et al. (2012), studies examining the utility of RDS in pediatric populations have been significantly limited or have utilized methodologies other than the preferred simulation and known-groups designs. In a study of 119 children referred for neuropsychological evaluation for attention- 34 deficit/hyperactivity disorder (ADHD), autism spectrum disorder (ASD), specific learning disabilities (SLD), or anxiety/depression, Weiss et al. (2019) compared failure rates on the TOMM, RDS (scores calculated from both WISC-IV and WAIS-IV), and the Discriminability Index from either the California Verbal Learning Test – Children’s Version (CVLT-C; Delis et al., 1994) or the Californian Verbal Learning Test – Second Edition (CVLT-II; Delis et al., 2000). The discriminability index reflects the ability of the examinee to recognize previously heard words in a word list-learning task. It is a forced-choice task in that only a yes/no response is required. However, groups comparisons were not possible in this study as no known group of suboptimal effort was present. Additionally, the study utilized RDS scores from two separate Wechsler instruments and failed to provide the adult cutoff scores used to determine failure on the RDS measure. As such, little information can be taken from this study. Other studies similarly use adult cutoff scores despite evidence suggesting these cutoffs are invalid in younger populations (Kirkwood et al., 2011; Welsh et al., 2012). The early studies conducted by Kirkwood et al. (2011) and Welsh et al. (2012) establish the RDS score as a potential indicator of performance validity. However, discrepancies in cutoff scores across populations yields a need for further study to validate the index. Following the dissemination of these studies, no additional studies were conducted using the WISC-IV. This is likely due, in part, to the release of the revised WISC-V (Wechsler, 2014). RDS on the WISC-V Thus far, most studies of RDS and Digit Span as effort indices in children have utilized the WISC-IV. Little research has been conducted on the utility of these indices when calculated using the current edition of the Weschler child version, the WISC-V. The Digit Span subtest of the WISC-V saw several important revisions that fundamentally change the task (Wechsler, 35 2014). The forward condition added longer digit strings at the end to increase the discriminant ability of the measure at higher ability levels. This is unlikely to affect cutoff scores for RDS as cutoff scores typically focus on the lower end of performance. Changes to the backward condition, however, significantly affect scores at the lower end. An additional short trial was added to the beginning of the task to reduce the task difficulty gradient. As such, RDS scores on the WISC-V may be higher than RDS scores on the previous edition of the instrument. Finally, the WISC-V added the sequencing condition that tasks participants with reciting the digits in ascending order, called Digits Sequencing. This new condition significantly influences the ACSS score as this score is now derived from the sum of each of the three conditions. Additionally, the sequencing task allows for the calculation of RDS-R. To date, one study has been published examining Digit Span performance on the WISC- V regarding performance validity. In a study of 130 children referred for neuropsychological evaluation, Ventura et al. (2019) compared WISC-V ACSS, RDS, and RDS-R performance to TOMM performance. Like other limited studies in this area, this study did not include a known or simulated malingering group. Rather, the authors examined failure rates across each measure using predetermined cutoff scores. Results indicated that Digit Span PVT failure rates were much higher than failure of the TOMM. However, lack of empirical support for these cutoff scores calls into question the significance of these findings. Additionally, at the time of this review, an article by Kirk et al. (2020) reports an article currently in press that examines the utility of the WISC-V Digit Span as a PVT. The study uses a known-groups design with the Medical Symptom Validity Test (MSVT; Green, 2004) serving as the criterion measure. The results reported in this review indicate that the ACSS, RDS, and RDS- R measures were all able to provide cutoff scores with strong specificity and sensitivity. As such, 36 the results of the current study should serve to supplement these findings as the second study to examine the WISC-V Digit Span indices in a known-groups design. The Digit Span indices have strong potential as performance validity indicators in the pediatric population for numerous reasons. First, these indices have the advantage of being embedded in many evaluations without the need for additional administration time. Any evaluation that includes the Digit Span task can provide validity information. Additionally, while the Digit Span task is considered a working memory task, recognition memory is not the memory modality assessed by the test. Therefore, upon validation, Digit Span could serve as an embedded validity indicator that does not rely on recognition alone. This diversifies the sources of validity information to the clinician and helps to make more informed decisions regarding effort. A Non-Recognition Based Memory Task to Measure Performance Validity The Digit Span subtest requires individuals to recall presented information, and thus, is not a recognition task. Although Digit Span tasks do require working memory, the actual construct assessed by the task has been a contentious subject throughout its use in intelligence assessments. Originally, Digit Span on early Wechsler instruments included only forward conditions. However, these tasks were too brief make meaningful clinical interpretations. As a result, additional conditions were added to the Digit Span task. Later iterations of the Digit Span task included the backward and sequencing conditions, but some began to question whether the task measured a unified construct. Following factor analysis on the Test of Memory and Learning (TOMAL; Reynolds & Bigler, 1994), Reynolds (1997) examined whether the forward and backward conditions of the Digit Span task should be combined for clinical analysis. Factor analysis showed that the forward and backward conditions 37 loaded on separate factors. Reynolds argued that the forward task was more a measure of verbal working memory while the backward condition was more difficult and potentially required visual-spatial manipulation. A recent study has also indicated that the internal consistency of the combined Digit Span score is closer to 0.70 than the 0.90 reported by the WAIS-IV manual (Gignac et al., 2019). As such, discussion of what aspects of memory the Digit Span task actually measures continues, but the task is distinctly different from the other memory based PVTs due to its lack of recognition requirements. This further strengthens the argument for the use of Digit Span indices as embedded PVTs, but only once certain gaps in the research are addressed. Research Gaps At the time of this study, the use of Digit Span indices from the WISC-V as performance validity indicators is scientifically questionable. Yet, a recent survey of pediatric neuropsychologists found that over 65% of pediatric clinicians reported at least occasional use of the Reliable Digit Span measure with children (Brooks et al., 2016). If true, these results indicate that over half of pediatric clinicians are using an instrument without empirical support to make determinations about client effort. Digit Span PVT indices utilize a measure of working memory that does not rely on recognition memory. As such, the instrument needs to be studied in known-groups designs. These studies require the use of strong criterion measures to establish groups with valid and invalid performance. Known-groups studies utilizing well-established recognition PVTs would provide clarification of the clinical utility of the indices. For RDS and other Digit Span indices to provide meaningful performance validity information when working with pediatric populations, cutoff scores should be established with adequate specificity and sensitivity. Additionally, the negative and positive predictive power of 38 these indices should be examined across estimated rates of suboptimal effort in pediatric populations. Once established, clinicians will have guidance about how to best use these indicators to reliably determine effort in their young clients. Current Study The current study sought to establish cutoff scores of the Digit Span ACSS, RDS, and RDS-R scores of the WISC-V using a mixed clinical sample. The study uses a known groups design to examine the utility of the Digit Span ACSS, RDS, and RDS-R scores in distinguishing between children who successfully passed the Memory Validity Profile (MVP; Sherman & Brooks, 2015b), a stand-alone, recognition memory-based pediatric PVT, from those who failed the MVP during a neuropsychological assessment. In accordance with age ranges set by the WISC-V, participants in this study ranged in age from 6-16 years of age. Additionally, each participant completed the MVP as a part of an outpatient neuropsychological evaluation. Data were collected from a single rehabilitation hospital in the midwestern United States. Given high rates of endorsement for the use of Digit Span indices among pediatric neuropsychologists (Brooks et al., 2016), study of appropriate cutoff scores, sensitivity, specificity, and predictive power is essential to conducting empirically supported work. This study aims to be the first to use a known-groups design to examine the psychometric properties of the ACSS, RDS, and RDS-R using the only standalone PVT designed specifically for use in pediatric populations as the criterion measure. The purpose of this study is to establish cutoff scores in a mixed clinical sample that achieve 90% specificity and at least 40% sensitivity. Additionally, sensitivity and specificity across various scores will be presented. Lastly, positive and negative predictive power will be calculated for each index according to estimated base rates in a similar sample (Wilson & Lesica, 2021). 39 Data Set METHOD An extant data set was utilized for the completion of this study. The database included neuropsychological assessment data from individuals presenting to a midwestern rehabilitation hospital outpatient clinic from September 2018 to November 2022. Referral reasons included traumatic brain injuries, anoxia, vascular conditions, tumors, attention deficit-hyperactivity disorder, emotional disorders, autism spectrum disorder, speech and language concerns, cerebral palsy, myelomeningocele and/or hydrocephalus, and conditions classified as “other”. Deidentified assessment results were available for each child who met inclusionary criteria. Participants Inclusion Criteria The extant data was filtered to include only participants who met the following criteria. First, participants were required to be ages 6-16 at the time of their evaluation to make them eligible to complete both the WISC-V and the MVP. Participants who met these criteria were also required to have completed both assessment measures during their outpatient neuropsychological evaluation. Children who met these two criteria between September 2018 and November 2022 were eligible for participation if they did not meet any exclusionary criteria. Exclusion Criteria Several exclusionary criteria were used for data selection. Children unable to provide informed assent to their evaluation were excluded from analysis. Additionally, children who were not fluent in English were excluded from the study. Lastly, children unable to complete either the WISC-V or the MVP due to significant uncorrected visual or auditory impairments were also excluded from analysis. 40 Group Assignment Participants were sorted into one of two groups: the valid performance group and the invalid performance group. Participants who failed the Memory Validity Profile (MVP; Sherman & Brooks, 2015b) based on an experimental cutoff established by Wilson and Lesica (2021) were placed in the invalid performance group. Participants who passed the MVP based on this cutoff were placed in the valid performance group. Demographics Descriptive statistics for demographic variables are presented in Table 1. Compared to the valid performance group, participants in the invalid performance group were younger in age (p < 0.001, d = 0.79). Differences in parent education, sex, proportion of racial minorities, and reason for referral were not observed (p > 0.05). Table 1. Demographic Characteristics Variable Age* Complete Valid Invalid Sample Performance Performance (n = 204) (n = 174) (n = 30) 130.71 (35.06) 134.65 (34.9) 107.83 (26.48) Parent Education Level 13.10 (2.59) 13.07 (2.58) 13.3 (2.64) Sex (n, %) Male Female * p < .05 127 (62.25) 111 (87.40) 63 (81.82) 77 (37.75) 16 12.60) 14 (18.18) 41 Table 1. (cont’d) Variable Race (n, %) White Other Reason for Referral (n, %) TBI Anoxia Vascular Tumor ADHD Emotional ASD Complete Valid Invalid Sample Performance Performance (n = 204) (n = 174) (n = 30) 128 (62.75) 110 (63.22) 18 (60.00) 76 (37.25) 64 (36.78) 12 (40.00) 41 (20.10) 39 (22.41) 2 (6.67) 3 (1.47) 3 (1.72) 0 (0.00) 5 (2.45) 5 (2.87) 0 (0.00) 2 (0.98) 2 (1.15) 0 (0.00) 34 (16.67) 28 (16.09) 6 (20.00) 10 (4.90) 9 (5.17) 1 (3.33) 2 (0.98) 2 (1.15) 0 (0.00) Speech/Language 6 (2.94) 4 (2.30) 2 (6.67) Cerebral Palsy 13 (6.37) 9 (5.17) 4 (13.33) Myelomeningocele/Hydrocephalus 5 (2.45) 2 (1.15) 3 (10.00) Other * p < .05 83 (40.69) 71 (40.80) 12 (40.00) 42 Procedures Testing Participants presenting for neuropsychological evaluation during the time span ranging from September 2018 to November 2022 followed typical procedures for neuropsychological assessment. The outpatient neuropsychological evaluation typically lasted a single day. Each evaluation started with a clinical intake interview conducted by either a board-certified clinical neuropsychologist, a postdoctoral fellow, or a practicum student under the direct supervision of either the clinical neuropsychologist or the postdoctoral fellow. Following the intake interview, participants were given a break while the assessment team prepared the evaluation room. Neuropsychological test administration was conducted by either a board-certified clinical neuropsychologist, a postdoctoral fellow, a psychometrist, or a practicum student under direct supervision. Testing lasted from the morning to early afternoon with a lunch break in the middle. Each evaluation concluded with a feedback session with the family to discuss findings and recommendations. Measures Children selected for participation in this study were administered several measures based on the referral question that brought them in for evaluation and their unique presentations. However, all individuals included for analysis were given the Wechsler Intelligence Scale for Children – Fifth Edition (WISC-V; Wechsler, 2014) and the Memory Validity Profile (MVP; Sherman & Brooks, 2015b). The indices from these measures are discussed below. Wechsler Intelligence Scale for Children – Fifth Edition (WISC-V) The WISC-V (Wechsler, 2014) is a measure of cognitive ability developed for use with children ages 6-16. The WISC-V comprises 21 subtests that measure ability across various 43 cognitive domains. Ten subtests are considered primary subtests as the make up the primary index scores: Verbal Comprehension, Visual Spatial, Fluid Reasoning, Working Memory, and Processing speed. Additionally, the first seven primary subtests make up the Full-Scale Intelligence Quotient. The WISC-V is one of the most widely administered measures of cognitive ability in neuropsychological assessments. Full-Scale Intelligence Quotient. The Full-Scale Intelligence Quotient (FSIQ) is a measure of overall intellectual functioning. The FSIQ comprises seven subtests and is the most reliable index provided by the instrument. The seven subtests included in the FSIQ are Block Design, Similarities, Matrix Reasoning, Digit Span, Coding, Vocabulary, and Figure Weights. The FSIQ is a standardized composite score with a mean of 100 and a standard deviation of 15. Reliability. Split-half reliability coefficients for the FSIQ range from 0.96-0.97 across age ranges. Overall, the average split-half reliability average is r = .96. Standard errors of measurement across ages range from 2.6-3.0 with an overall average standard error of measurement of 2.9. Test-retest reliability was calculated as r = .91 (.92 when correcting for sample variability). The standard difference between administrations is reported as d = 0.44. Validity. The FSIQ score of the WISC-V exhibits high levels of concurrent validity as evidenced the correlations of the index with FSIQ measures from other Wechsler intelligence instruments. The correlation between the WISC-V FSIQ measure and the WISC-IV FSIQ measure is r = 0.81 (r = 0.86 when correcting for sample variability). The correlation between the FSIQ measures of the WISC-V and the WAIS-IV is similarly strong at r = 0.84 (r = 0.89 when correcting for sample variability). 44 Application. The FSIQ metric is a commonly used metric of overall intellectual ability in both clinical and research settings. For the purposes of this study, differences in FSIQ between participants in the valid and invalid performance groups were examined to determine whether participants intellectual functioning was related directly to performance on the various PVT measures used in this study. General Ability Index. The General Ability Index (GAI) is one of 13 index scores provided by the WISC-V. The GAI score is like the FSIQ score as both are measures of general cognitive ability. The GAI differs from the FSIQ by omitting subtests of working memory and processing speed. The GAI comprises five of the seven subtests that make up the FSIQ with Digit Span and Coding removed. This measure is interesting in the assessment of Digit Span PVTs as it removes the shared variance between the measure and Digit Span performance. The GAI is a standardized composite score with a mean of 100 and a standard deviation of 15. Reliability. Split-half reliability coefficients for the GAI range from 0.95-0.98 across age ranges. Overall, the average split-half reliability average is r = 0.96. Standard errors of measurement across ages range from 2.36-3.35 with an overall average standard error of measurement of 3.07. Test-retest reliability was calculated as r = 0.89 (0.91 when correcting for sample variability). The standard difference between administrations is reported as d = 0.41. Validity. The GAI score of the WISC-V demonstrates high concurrent validity with previous Wechsler instruments. The correlation between GAI scores on the WISC-V and the WISC-IV is r = 0.80 (r = 0.85 when correcting for sample variability). Similarly, the 45 correlation between GAI scores on the WISC-V and the WAIS-IV is strong at r = 0.74 (r = 0.83 when correcting for sample variability. Application. The GAI metric provides an estimate of overall cognitive ability that is less reliant on processing speed and working memory than the FSIQ score. Additionally, unlike the FSIQ score, the Digit Span subtest is not included in the calculation of the GAI. The GAI measure was used to determine whether intellectual functioning is directly related to the various PVT measures in this study once the contamination from the Digit Span subtest is removed. Digit Span (Dependent Variables). The Digit Span subtest of the WISC-V is a measure of working memory. During Digit Span administration, the examiner reads the participant numbers of increasing length, and the participant is then tasked with repeating them immediately from memory. The Digit Span subtest has three conditions: 1) Forward, 2) Backward, and 3) Sequencing. The Forward task requires participants to repeat the numbers in the same order they were read, the Backward task requires recalling numbers in reverse order, and the Sequencing task requires recalling numbers in ascending order. Each item of the subtest includes two trials of equal length, and the number of digits in each item increases until the participant fails two strings within an item. Raw scores are calculated for each task, and these scores are then summed and converted to an age corrected scaled score (ACSS; M = 10, SD = 3). Reliability. Split-half reliability coefficients for the Digit Span ACSS range from 0.89- 0.93 with an average reliability coefficient of rxx a = 0.91. Internal consistency for special groups designated in the WISC-V manual range from 0.83-0.99. Across age groups, the standard error of measurement for the Digit Span subtest ranges from 0.79-0.99 with an 46 overall average of 0.88. Test-retest reliability across all ages is reported as r = 0.79 (r = 0.82 when corrected for sample variability). Standard difference between first and second administration was d = 0.10. Validity. The Digit Span subtest demonstrates strong concurrent validity with Digit Span scores from other Wechsler instruments. The correlation between Digit Span ACSS scores on the WISC-V and the WISC-IV is strong at r = 0.60 (r = 0.65 when correcting for sample variability). The correlation between Digit Span ACSS scores on the WISC-V and the WAIS-IV is stronger than the WISC-IV at r = 0.76 (r = 0.80 when correcting for sample variability). This higher correlation is expected as both the WISC-V and the WAIS-IV saw the introduction of the sequencing condition. Application. The utility of the ACSS score as a PVT was examined by establishing cutoff scores that achieve 90% specificity categorizing MVP performance. Furthermore, positive and negative predictive power of the instrument was calculated. RDS (dependent). Reliable Digit Span (RDS) is calculated by adding the length of the longest item completed perfectly in the Forward condition to the length of the longest item completed perfectly in the Backward condition. This sum is then compared to cutoff scores to make determinations of performance validity. Although RDS is rarely used in pediatric populations, the necessary metrics to calculate the score are present on the WISC-V. Previous studies of the RDS metric in pediatric populations using the WISC-IV demonstrated cutoff scores of ≤ 5 digits held reliably (Kirkwood et al., 2011) and ≤ 3 digits held (Welsh et al., 2012). 47 Application. The utility of the RDS score as a PVT was examined by establishing cutoff scores that achieve 90% specificity categorizing MVP performance. Furthermore, positive and negative predictive power of the instrument were calculated. RDS-R Reliable Digit Span – Revised (RDS-R) is an expansion of the RDS metric originally introduced by Spencer et al. (2010). RDS-R is calculated by adding the length of the longest trial consecutively completed in the sequencing condition to the traditional RDS score. This measure has yet to be examined empirically in pediatric populations. As such, reliability and validity information are not yet available. Application. The utility of the RDS-R score as a PVT was examined by establishing cutoff scores that achieve 90% specificity categorizing MVP performance. Furthermore, positive and negative predictive power of the instrument were calculated. Memory Validity Profile (MVP) (Independent Variable). The Memory Validity Profile (MVP) is a performance validity test developed for use with individuals ages 5 to 21. The MVP is normed on both a standardization sample based on the United States census and clinical samples. The MVP comprises both a visual and verbal memory task, and each provides a cutoff score based on age. Additionally, an overall cutoff score is also provided based on age. 48 Reliability. Internal consistency is reported for each score across the standardization sample, clinical sample, and invalid performance sample in which the measure was normed. The MVP manual notes that the internal consistency of the standardization sample is low due to a large majority of participants completing the assessment with 100% accuracy. The resulting coefficient alphas in the standardization sample ranged from unacceptable to poor (Visual: α = .46, Verbal: α = .61, Total α = .64). In the clinical sample, internal consistency fell within the good range (Visual: α = .85, Verbal: α = .84, Total α = .89). In the invalid performance sample, internal consistency ranged from acceptable to good (Visual: α = .79, Verbal α = .78, Total: α = .88). Test-retest reliability for the MVP is presented using Spearman rho correlations as the distribution of scores is not normal. Test-retest reliability for the Visual condition is reported as r = .51 with a standard error of measurement of .3. Test-retest reliability for the verbal condition is reported as r = .36 with a standard error of measurement of .8. The test-retest reliability for total score is r = .41 with a standard error of measurement of .9. Validity. The MVP is a measure of performance validity that passes as a measure of visual and verbal memory. Despite having face validity as a memory measure, correlations with actual measures of memory should be low to indicate that the task does not measure memory. During standardization, the MVP was co-normed with the Child and Adolescent Memory Profile (ChAMP; Sherman & Brooks, 2015a). The correlation between the MVP total score and ChAMP total score was very low at r = 0.17. Additionally, the measure was compared to performance on the WISC-IV to ensure that the test was not a measure of intelligence. The correlation between MVP performance 49 and WISC-IV FSIQ score was low at r = 0.26. Low to moderate correlations were also found with executive function measures and achievement measures. In addition to discriminant validity, the MVP must demonstrate concurrent validity with an established PVT. The MVP was compared to performance on the Test of Memory Malingering (TOMM; Tombaugh, 1996) during development. MVP total score correlated highly with both TOMM Trial 1 (r = 0.83) and TOMM Trial 2 (r = 0.81) Application. Performance on the MVP can be broken into pass and fail distinctions based on raw score and age of the participant. However, in previous studies that have used the MVP as a criterion measure, problems have arisen using the cutoff scores defined by the MVP manual. Wilson and Lesica (2021) examined the failure rate of the MVP in a mixed clinical sample from the same rehabilitation hospital used in this study. Of a sample of 122 children, only two children failed to pass the MVP. As a result, an experimental cutoff score of 30 was used meaning that any score under 31 was considered a failure. Wilson and Lesica (2021) also found that at this experimental cutoff, age was a significant factor in determining who passed the MVP with children ages 6-10 failing at much higher rates than children above the age of 10. Group designation in the current study was set using the experimental cutoff established by Wilson and Lesica (2021). Additionally, a second, exploratory analysis was conducted using a more lenient cutoff score of 29 for children ages 6-10. Hypotheses and analyses The proposed study will assess the utility of three Digit Span indices from the WISC-V in the assessment of performance validity during neuropsychological testing. These indices will be analyzed separately, and the research hypotheses presented are separated accordingly. 50 Question 1 Can the WISC-V Digit Span ACSS reliably detect suboptimal effort in a clinical pediatric sample? Hypothesis 1a. Cutoff scores for the ACSS index that reliably distinguish groups with 90% specificity, 40% sensitivity, and an area under the curve metric of at least 0.70 will be established for the sample Analysis. Sensitivity, specificity, and AUC will be calculated at each score observed in the sample to determine which score will be used as a cutoff. The cutoff score will be the lowest score that yields at least 90% specificity while maintaining optimal sensitivity. Hypothesis 1b. ACSS will reliably distinguish participants who failed the MVP from participants who passed the MVP. Analysis. Logistic regression will be conducted to determine whether failure of the ACSS metric based on the obtained cutoff score meaningfully predicts failure of the MVP. Hypothesis 1c. Positive predictive power and negative predictive power across estimated base rates of suboptimal effort in children will provide evidence for the utility of ACSS as a viable PVT in pediatric populations. Analysis. Positive predictive power and negative predictive power for the ACSS metric will be calculated across estimates of base rates in pediatric populations. Predictive power will be calculated using the 15.57% failure base rate obtained by Wilson and Lesica (2021) Rationale. The Digit Span subtest provides the ACSS with typical scoring. This score is norm referenced and factors in performance of other children within a child’s age group. As such, this score provides potentially the soundest performance validity information 51 across age ranges. The ACSS has been demonstrated to be a useful PVT in adult populations (Iverson & Franzen, 1994), and researchers have demonstrated promise using older versions of the WISC (Kirkwood et al., 2011; Welsh et al., 2012). If the ACSS of the WISC-V Digit Span task provides meaningful performance validity information, clinicians are saved time and resources as no additional metric needs to be calculated. Question 2 Can the WISC-V RDS index reliably detect suboptimal effort in a clinical pediatric sample? Hypothesis 2a. Cutoff scores for the RDS metric can be established that achieve 90% specificity, 40% sensitivity, and an AUC metric of 0.70. Analysis. Sensitivity, specificity, and AUC will be calculated at each observed score in the sample to determine if an acceptable cutoff score exists. The cutoff score will be the lowest score that yields at least 90% specificity while maintaining optimal sensitivity. Hypothesis 2b. RDS score will reliably distinguish participants who failed the MVP from participants who passed the MVP. Analysis. Logistic regression will be conducted to determine whether failure of the RDS metric based on the obtained cutoff score meaningfully predicts failure of the MVP. Hypothesis 2c: Positive predictive power and negative predictive power across estimated base rates of suboptimal effort in children will provide evidence for the utility of the RDS score as a PVT in pediatric samples. Analysis. Positive predictive power and negative predictive power for the RDS metric will be calculated across estimates of base rates in pediatric populations. Predictive power will be calculated using the 15.57% failure base rate obtained by Wilson and Lesica (2021). 52 Rationale. RDS has been used extensively in adult populations since the development of the measure by (Greiffenstein et al., 1994). Despite this extensive study, the measures application in pediatric populations is limited to a few studies (Kirkwood et al., 2011; Welsh et al., 2012). Furthermore, to date, no published studies have examined the use of RDS on the most recent version of the WISC with a simulation or known groups design. The current study aims to provide a known groups design to examine the clinical utility of the RDS measure as an embedded PVT. Question 3 Can the WISC-V RDS-R index reliably detect suboptimal effort in a clinical pediatric sample? Hypotheses 3a. A cutoff score for the RDS-R metric can be calculated that achieves at least 90% specificity, 40% sensitivity, and an AUC metric of 0.70. Analysis. Sensitivity, specificity, and AUC were calculated at each RDS-R score observed in the clinical sample. The cutoff was set to the lowest score that yielded at least 90% specificity while maintaining optimal sensitivity. Hypothesis 3b. RDS-R score will reliably distinguish participants who failed the MVP from participants who passed the MVP. Analysis. Logistic regression was conducted to determine whether failure of the RDS-R metric based on the obtained cutoff score meaningfully predicted MVP failure. Hypothesis 3c. Positive predictive power and negative predictive power across estimated base rates of suboptimal effort in children will provide evidence for the utility of the RDS score as a PVT in pediatric samples. 53 Analysis. Positive predictive power and negative predictive power for the RDS-R metric will be calculated across estimates of base rates in pediatric populations. The base rate estimates calculated will range from 5-40% in consistency with Kirkwood et al. (2011). Rationale. The most recent revisions of the Wechsler adult and child intelligence scales saw the addition of the Sequencing condition to the Digit Span subtest. This provided an opportunity to expand the RDS measure by adding the length of the longest trials consecutively completed to the traditional RDS score to create the RDS-R. 54 Memory Validity Profile (MVP) Performance by Group RESULTS Table 2 presents the average Memory Validity Profile (MVP) scores of both the valid and invalid performance groups. The invalid performance group underperformed on the MVP compared to the valid performance group on the Visual score (p = .006, d = 1.33), Verbal score (p < .001, d = 3.09), and Total score (p < .001, d = 3.03). These differences remained statistically significant after Bonferroni correction for multiple comparisons. Table 2. MVP Scores Across Groups Complete Valid Invalid Sample Performance Performance Variable (n = 204) (n = 174) (n = 30) t 2.9 p d 1.3 MVP Visual 15.80 (0.75) 15.94 (0.24) 15.03 (1.69) 2 0.006 3 MVP Verbal 15.43 (1.65) 15.94 (0.24) 12.50 (2.87) 5 1 9 6.5 <0.00 3.0 6.4 <0.00 3.0 MVP Total 31.23 (2.14) 31.87 (0.33) 27.47 (3.76) 2 1 3 Digit span performances by group. Table 3 presents the Digit Span index scores of interest obtained in this sample. The invalid performance group underperformed compared to the valid performance group on the Digit Span Age Corrected Scaled Score (ACSS; p < .001, d = 1.21), Reliable Digit Span (RDS; p < .001, d = 1.50), and Reliable Digit Span Revised (RDS-R; p < .001, d = 1.67). Again, these 55 differences remained statistically significant after Bonferroni correction for multiple comparisons. Table 3. Digit Span Scores Across Groups Complete Sample Valid Performance Invalid Performance Variable (n = 204) (n = 174) (n = 30) t p d ACSS 6.80 (3.27) 7.33 (3.09) 3.70 (2.49) 7.10 <0.001 1.21 RDS 6.66 (2.05) 7.06 (1.85) 4.33 (1.63) 8.30 <0.001 1.50 RDS-R 9.84 (3.37) 10.56 (2.94) 5.70 (2.71) 8.97 <0.001 1.67 Intellectual Index Scores by Performance Group. Furthermore, significant intellectual ability score differences were observed on the Wechsler Intelligence Scale for Children – Fifth Edition (WISC-V). The invalid performance group scored significantly lower than the valid performance group on the Verbal Comprehension Index (VCI; p < .001, d = 1.08), Visual Spatial Index (VSI; p < .001, d = .96), Fluid Reasoning Index (FRI; p < .001, d = .87), Working Memory Index (WMI; p < .001, d = 1.18), and Processing Speed Index (PSI; p = .002, d = .72). Furthermore, the invalid performance group scored significantly lower on both intellectual composites of interest in this study: the General Ability Index (GAI; p < .001, d = 1.03) and Full-Scale Intelligence Quotient (FSIQ; p < .001, d = 1.13). Average scores for these indices are provided in Table 4. 56 Table 4. WISC-V Composite Scores Complete Sample Valid Performance Invalid Performance Variable (n = 204) (n = 174) (n = 30) t p d FSIQ 82.27 (16.16) 84.76 (15.36) 67.83 (12.91) 6.44 <0.001 1.13 GAI VCI VSI FRI 85.20 (15.66) 87.43 (15.14) 72.27 (12.13) 6.08 <0.001 1.03 87.24 (15.79) 89.58 (15.19) 73.67 (12.04) 6.41 <0.001 1.08 89.43 (15.72) 91.54 (14.92) 77.20 (14.83) 4.89 <0.001 0.96 86.27 (16.37) 88.29 (15.73) 74.60 (15.28) 4.51 <0.001 0.87 WMI 84.53 (16.56) 87.18 (15.77) 69.17 (12.23) 7.11 <0.001 1.18 PSI 84.25 (16.46) 85.94 (15.57) 74.43 (18.23) 3.26 0.002 0.72 The Influence of Age on IQ Scores. Following the observation of significant differences in age and intellectual ability, correlation coefficients were calculated to better understand this relationship. The relationship between age and each of the WISC-V Composite Scores was negligible. These values are provided in Table 5. Figure 1 displays the relationship between age and FSIQ. Table 5. Correlations Between Age and WISC-V Scores FSIQ GAI VCI VSI FRI WMI PSI Age -0.03 -0.06 -0.04 -0.12 0.02 0.13 -0.07 57 Figure 1. Age vs. FSIQ Question 1: Digit Span Age Corrected Scaled Scores Prediction of Suboptimal Effort Each Digit Span score observed in the sample was analyzed to determine which score provided an optimal cutoff point. The results for each of these scores are provided in Table 6. Hypothesis 1a A cutoff score for the ACSS measure was established that met and exceeded the sensitivity, specificity, and area under the curve (AUC) thresholds outlined in this study. A cutoff score of ≤3 yielded a specificity metric of 91%, a sensitivity metric of 60%, and an AUC of 0.75 (0.66 – 0.84). Therefore, a cutoff score of ≤3 results in an embedded PVT with strong psychometric properties. 58 Table 6. Various Digit Span ACSS Cutoffs ACSS Cutoff Sensitivity Specificity AUC (95% CI) LR ≤1 ≤2 ≤3 ≤4 ≤5 ≤6 ≤7 ≤8 ≤9 ≤10 ≤11 ≤12 ≤13 ≤14 ≤15 ≤16 ≤17 ≤18 0.27 0.37 0.60 0.67 0.70 0.83 0.90 0.97 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.97 0.62 (0.53 - 0.70) 9.00 0.95 0.66 (0.57 - 0.75) 7.40 0.91 0.75 (0.66 - 0.85) 6.67 0.83 0.75 (0.66 - 0.84) 3.94 0.76 0.73 (0.64 - 0.83) 2.92 0.57 0.70 (0.62 - 0.78) 1.93 0.46 0.68 (0.61 - 0.75) 1.67 0.30 0.63 (0.59 - 0.68) 1.39 0.22 0.61 (0.58 - 0.64) 1.28 0.14 0.57 (0.55 - 0.60) 1.16 0.10 0.55 (0.53 - 0.57) 1.11 0.05 0.53 (0.51 - 0.54) 1.05 0.04 0.52 (0.51 - 0.53) 1.04 0.01 0.51 (0.50 - 0.51) 1.01 0.01 0.51 (0.50 - 0.51) 1.01 0.01 0.51 (0.50 - 0.51) 1.01 0.01 0.50 (0.50 - 0.51) 1.01 0.00 0.50 (0.50 - 0.50) 1.00 *LR = Likelihood ratio 59 Hypothesis 1b Logistic regression was used to better understand the relationship between passing the Digit Span PVT with a cutoff score of ≤3 and passing the MVP with a cutoff score of ≤30. Passing the Digit Span PVT significantly predicted whether participants passed the MVP. The participants who failed the Digit Span PVT were 14.8 times more likely to fail the MVP compared to participants who exceeded the cutoff score of 3 (p < .001). These results can be found in Table 7. Table 7. Predicting MVP Failure Based on ACSS Failure Characteristic OR1 95% CI1 p-value (Intercept) 0.08 0.04, 0.13 <0.001 ACSS Group 14.8 6.18, 37.3 <0.001 1 OR = Odds Ratio, CI = Confidence Interval Hypothesis 1c Positive and negative predictive power were calculated at a base rate of 15.57% according to the results of Wilson and Lesica (2021). Positive predictive power at this cutoff was 55.14% and negative predictive power was 92.50% indicating that an ACSS cutoff of ≤3 would predict valid performance with 92.50% accuracy and invalid performance with 55.14% accuracy in similar mixed clinical samples. 60 Question 2: Reliable Digit Span Prediction of Suboptimal Effort Each Reliable Digit Span (RDS) score observed in the sample was analyzed to determine which score provided an optimal cutoff point. The results for each of these scores are provided in Table 8. Hypothesis 2a A cutoff score for the RDS metric was established that exceeded the outlined sensitivity, specificity, and AUC thresholds. A cutoff score of ≤4 resulted in specificity of 94%, sensitivity of 63%, and an AUC of 0.79 (0.70 – 0.88). These results indicate that an RDS cutoff of ≤4 resulted in an embedded PVT with strong psychometric properties. Table 8. Various RDS Cutoffs RDS Cutoffs Sensitivity Specificity AUC (95% CI) LR ≤2 ≤3 ≤4 ≤5 ≤6 ≤7 ≤8 ≤9 ≤10 ≤11 ≤12 ≤13 0.10 0.37 0.63 0.80 0.83 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.99 0.54 (0.49 - 0.60) 10.00 0.98 0.67 (0.58 - 0.76) 18.50 0.94 0.79 (0.70 - 0.88) 10.50 0.82 0.77 (0.69 - 0.86) 4.44 0.63 0.73 (0.66 - 0.73) 2.24 0.40 0.70 (0.66 - 0.73) 1.67 0.16 0.58 (0.55 - 0.61) 1.19 0.06 0.53 (0.51 - 0.55) 1.06 0.03 0.52 (0.50 - 0.53) 1.03 0.03 0.51 (0.50 - 0.56) 1.03 0.01 0.50 (0.50 - 0.51) 1.01 0.01 0.50 (0.50 - 0.51) 1.01 61 Table 8. (cont’d) RDS Cutoffs Sensitivity Specificity AUC (95% CI) LR ≤14 ≤15 1.00 1.00 0.01 0.50 (0.50 - 0.51) 1.01 0.00 0.50 (0.50 - 0.50) 1.00 Hypothesis 2b Logistic regression was used to better understand the relationship between passing the RDS PVT with a cutoff score of ≤4 and passing the MVP with a cutoff score of ≤30. Passing the RDS metric significantly predicted whether participants passed the MVP. The participants who failed the Digit Span PVT were 28.3 times more likely to fail the MVP compared to participants who exceeded the cutoff score of 4 (p < .001). These results can be found in Table 9. Table 9. Predicting MVP Failure Based on RDS Failure Characteristic OR1 95% CI1 p-value (Intercept) 0.07 0.03, 0.12 <0.001 RDSGroup 28.3 11.0, 79.0 <0.001 1 OR = Odds Ratio, CI = Confidence Interval Hypothesis 2c Positive predictive power at this cutoff was 65.94% and negative predictive power was 93.23% indicating that an RDS cutoff of ≤4 would predict valid performance with 93.23% accuracy and invalid performance with 65.94% accuracy in similar mixed clinical samples. 62 Question 3: Reliable Digit Span Revised Prediction of Suboptimal Effort Each Reliable Digit Span Revised (RDS-R) score observed in the sample was analyzed to determine which score provided an optimal cutoff point. The results for each of these scores are provided in Table 10. Hypothesis 3a A cutoff score was established for the RDS-R metric that exceeded the outlined sensitivity, specificity, and AUC thresholds. A cutoff score of ≤6 resulted in a specificity of 94%, a sensitivity of 63%, and an AUC metric of 0.79 (0.70 – 0.86). These results indicate that a cutoff score of ≤6 on the RDS-R metric results in an embedded PVT with strong psychometric properties. Table 10. Various RDS-R Cutpoints RDS Cutoff Sensitivity Specificity AUC (95% CI) LR ≤2 ≤3 ≤4 ≤5 ≤6 ≤7 ≤8 ≤9 ≤10 ≤11 0.07 0.27 0.47 0.53 0.63 0.67 0.83 0.87 0.97 1.00 0.99 0.53 (0.48 - 0.57) 7.00 0.98 0.62 (0.54 - 0.71) 13.50 0.98 0.72 (0.63 - 0.82) 23.50 0.96 0.75 (0.65 - 0.84) 13.25 0.94 0.79 (0.70 - 0.86) 10.50 0.88 0.77 (0.68 - 0.86) 5.58 0.77 0.80 (0.73 - 0.88) 3.61 0.63 0.75 (0.68 - 0.82) 2.35 0.51 0.74 (0.69 - 0.79) 1.98 0.36 0.68 (0.64 - 0.71) 1.56 63 Table 10. (cont’d) RDS Cutoff Sensitivity Specificity AUC (95% CI) LR ≤12 ≤13 ≤14 ≤15 ≤16 ≤17 ≤18 ≤19 ≤20 ≤21 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.25 0.63 (0.59 - 0.66) 1.33 0.13 0.56 (0.54 - 0.59) 1.15 0.08 0.54 (0.52 - 0.56) 1.09 0.05 0.52 (0.51 - 0.53) 1.05 0.03 0.50 (0.50 - 0.51) 1.03 0.01 0.50 (0.50 - 0.51) 1.01 0.01 0.50 (0.50 - 0.51) 1.01 0.01 0.50 (0.50 - 0.51) 1.01 0.01 0.50 (0.50 - 0.51) 1.01 0.00 0.50 (0.50 - 0.50) 1.00 Hypothesis 3b Logistic regression was used to better understand the relationship between passing the RDS-R PVT with a cutoff score of ≤6 and passing the MVP with a cutoff score of ≤30. Passing the RDS-R metric significantly predicted whether participants passed the MVP. The participants who failed the Digit Span PVT were 25.6 times more likely to fail the MVP compared to participants who exceeded the cutoff score of 6 (p < .001). These results can be found in Table 11. 64 Table 11. Predicting MVP Failure Based on RDS-R Failure Characteristic OR1 95% CI1 p-value (Intercept) 0.07 0.03, 0.12 <0.001 RDS-RGroup 25.6 10.1, 69.9 <0.001 1 OR = Odds Ratio, CI = Confidence Interval Hypothesis 3c Positive and negative predictive power were calculated at 15.57% according to the results of Wilson and Lesica (2021). Positive predictive power at this cutoff was 64.95% and negative predictive power was 93.51% indicating that an RDS-R cutoff of ≤6 would predict valid performance with 93.51% accuracy and invalid performance with 64.95% accuracy in similar mixed clinical samples. Exploratory Analysis: Alternative MVP Cutoff for Performance Groups The difference between the valid performance and invalid performance groups based on age raised questions about the flat cutoff score used on the MVP in the establishment of groups. Much like the results of Wilson and Lesica (2021), the sample in this study saw high rates of MVP failure in participants between the ages of 6 and 10. In their study, Wilson and Lesica observed a failure rate of approximately 21% in children ages 6 to 10 when using a cutoff score of 30. Similarly, children ages 6-10 in this study failed at a rate of 22% when using a cutoff score of 30. As a result, group selection may be more heavily influenced by age than performance validity at the younger end of the sample. 65 To address this concern, a new split cutoff was adopted for exploratory analysis. For this analysis, group selection was made using a cutoff score of 30 for children ages 11 and older, and a cutoff score of 29 was used for children under the age of 11. This reduced the failure rate of the 6–10-year-old portion of the sample to 11%. Table 12 shows the MVP passing rate across age groups using the experimental cutoff of 30, and Table 13 shows the MVP passing rate across age groups when a split cutoff is adopted. For the purposes of this exploratory analysis, a complication arises in the calculation of predictive power. Wilson and Lesica (2021) did not utilize this split cutoff, and as a result, base rates for failure at this level were not available for this sample. As a result, the base rate percentage obtained by Wilson and Lesica (2021) are no longer applicable when redefining group membership, and the exploratory analysis will omit the calculation of predictive power. Table 12. Passing Rates by Age Group When Using A Cutoff Score Of 30 Age Group N Passing Rate 6-10 109 77.98% 11-16 95 93.68% Table 13. Passing Rates by Age Group When Exploratory Split Cutoff Scores Are Used Age Group N Passing Rate 6-10 109 88.99% 11-16 95 93.68% 66 Demographics Descriptive statistics for demographic variables are presented in Table 14. Despite adopting a more lenient cutoff score for the younger children, the invalid performance group remained significantly younger than the valid performance group (p < 0.05, d = 0.49). Differences in parent education, sex, proportion of racial minorities, and reason for referral were not observed (p > 0.05). Table 14. Demographic Variables Across Groups Using Split Cutoffs Variable Age Complete Valid Invalid Sample Performance Performance (n = 204) (n = 186) (n = 18) 130.71 (35.06) 132.21 (35.11) 115.17 (31.33) Parent Education Level 13.10 (2.59) 13.11 (2.56) 13.00 (2.93) Sex (n, %) Male Female Race (n, %) White Other 127 (62.25) 119 (63.98) 8 (44.44) 77 (37.75) 67 (36.02) 10 (55.56) 128 (62.75) 118 (63.44) 10 (55.56) 76 (37.25) 68 (36.56) 8 (44.44) 67 Table 14. (cont’d) Complete Valid Invalid Sample Performance Performance Variable (n = 204) (n = 186) (n = 18) Reason for Referral (n, %) TBI Anoxia Vascular Tumor ADHD Emotional ASD Speech/Language 41 (20.10) 39 (20.97) 2 (11.11) 3 (1.47) 5 (2.45) 2 (0.98) 3 (1.61) 0 (0.00) 5 (2.69) 0 (0.00) 2 (1.08) 0 (0.00) 34 (16.67) 30 (16.13) 4 (22.22) 10 (4.90) 9 (4.84) 1 (5.56) 2 (0.98) 6 (2.94) 2 (1.08) 0 (0.00) 5 (2.69) 1 (5.56) Cerebral Palsy 13 (6.37) 11 (5.91) 2 (11.11) Myelomeningocele/Hydrocephalus 5 (2.45) 3 (1.61) 2 (11.11) Other 83 (40.69) 77 (41.40) 6 (33.33) Table 15 presents the average Memory Validity Profile (MVP) scores of both the valid and invalid performance groups. The invalid performance group underperformed on the MVP compared to the valid performance group on the Visual score (p = .04, d = 1.55), Verbal score (p < .001, d = 5.17), and Total score (p < .001, d = 4.61). Differences in the Verbal and Total scores remained after Bonferroni correction for multiple analyses. 68 Table 15. MVP Scores Across Groups Using Split Cutoffs Complete Valid Invalid Sample Performance Performance Variable (n = 204) (n = 186) (n = 18) MVP t 2.1 p d 1.5 Visual 15.80 (0.75) 15.90 (0.35) 14.83 (2.07) 8 0.04 5 MVP 7.1 <0.00 5.1 Verbal 15.43 (1.65) 15.85 (0.45) 11.06 (2.84) 6 1 7 6.2 <0.00 4.6 MVP Total 31.23 (2.14) 31.75 (0.56) 25.78 (4.07) 3 1 1 Table 16 presents the Digit Span index scores of interest obtained in this sample. The invalid performance group underperformed compared to the valid performance group on the Digit Span Age Corrected Scaled Score (ACSS; p < .001, d = 1.03), Reliable Digit Span (RDS; p < .001, d = 1.40), and Reliable Digit Span Revised (RDS-R; p < .001, d = 1.47). Again, these differences remained statistically significant after Bonferroni correction for multiple comparisons. Table 16. Digit Span Scores Across Groups Using Split Cutoffs Complete Sample Valid Performance Invalid Performance Variable (n = 204) (n = 186) (n = 18) t p d ACSS 6.80 (3.27) 7.09 (3.18) 3.83 (2.75) 4.72 <0.001 1.03 RDS 6.66 (2.05) 6.89 (1.93) 4.22 (1.73) 6.17 <0.001 1.40 RDS-R 9.84 (3.37) 10.25 (3.14) 5.57 (2.89) 6.37 <0.001 1.47 69 Furthermore, significant intellectual ability score differences were observed on the Wechsler Intelligence Scale for Children – Fifth Edition (WISC-V). The invalid performance group scored significantly lower than the valid performance group on the Verbal Comprehension Index (VCI; p < 0.007, d = 1.13), Visual Spatial Index (VSI; p < 0.007, d = 0.86), Fluid Reasoning Index (FRI; p < 0.007, d = .82), and Working Memory Index (WMI; p < .007, d = 1.05) Furthermore, significant differences were observed in both intellectual composites of interest in this study: the General Ability Index (GAI; p<.007, d = 1.00) and Full-Scale Intelligence Quotient (FSIQ; p < .007, d = 1.06). These differences remained significant after Bonferroni correction. Average scores for these indices are provided in Table 17. Table 17. WISC-V Composite Scores Using Split Cutoffs Complete Sample Valid Performance Invalid Performance Variable (n = 204) (n = 186) (n = 18) t p d FSIQ 82.27 (16.16) 83.73 (15.64) 67.33 (14.05) 4.67 <0.001 1.06 GAI VCI VSI FRI 85.20 (15.66) 83.72 (15.64) 71.39 (12.76) 4.72 <0.001 1.00 87.24 (15.79) 88.74 (15.24) 71.78 (13.03) 5.19 <0.001 1.13 89.43 (15.72) 90.60 (15.34) 77.39 (14.85) 3.59 0.002 0.86 86.27 (16.37) 87.43 (15.91) 74.33 (16.75) 3.18 0.005 0.82 WMI 84.53 (16.56) 86.00 (16.10) 69.33 (13.66) 4.86 <0.001 1.05 PSI 84.25 (16.46) 85.12 (15.85) 75.22 (20.16) 2.02 0.06 0.61 70 Alternate Cutoff Question 1: Digit Span Age Corrected Scaled Scores Each Digit Span score observed in the sample was analyzed to determine which score provided an optimal cutoff point. The results for each of these scores are provided in Table 18. Alternate Cutoff Hypothesis 1a. Using the more lenient cutoff score on the MVP for group assignment did not result in a cutoff score that met specificity, sensitivity, and AUC thresholds outlined in this study. 93% specificity was achieved at a cutoff score of ≤2, but sensitivity fell slightly below the 40% threshold with a metric of 39%. AUC for the Digit Span scaled score with a cutoff score of ≤2 was 0.66 (0.54 – 0.78). These results indicate that the Digit Span ACSS with a cutoff of ≤2 resulted in an embedded PVT metric with inadequate psychometric properties. Table 18. Various Digit Span ACSS Cutoffs using Split Cutoffs ACSS Cutoffs Sensitivity Specificity AUC (95% CI) LR ≤1 ≤2 ≤3 ≤4 ≤5 ≤6 ≤7 ≤8 ≤9 ≤10 ≤11 0.33 0.39 0.56 0.61 0.67 0.78 0.89 0.94 1.00 1.00 1.00 0.96 0.65 (0.65 - 0.76) 0.93 0.66 (0.54 - 0.78) 0.87 0.71 (0.59 - 0.83) 0.79 0.70 (0.58 - 0.82) 0.73 0.70 (0.58 - 0.82) 0.54 0.66 (0.55 - 0.76) 0.44 0.66 (0.58 - 0.74) 0.28 0.61 (0.55 - 0.68) 0.20 0.60 (0.57 - 0.63) 0.13 0.57 (0.54 - 0.59) 0.09 0.55 (0.52 - 0.57) 71 8.25 5.57 4.31 2.90 2.48 1.70 1.59 1.31 1.25 1.15 1.10 Table 18. (cont’d) ACSS Cutoffs Sensitivity Specificity AUC (95% CI) LR ≤12 ≤13 ≤14 ≤15 ≤16 ≤17 ≤18 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.05 0.52 (0.51 - 0.54) 0.04 0.52 (0.51 - 0.53) 0.01 0.51 (0.50 - 0.51) 0.01 0.51 (0.50 - 0.51) 0.01 0.51 (0.50 - 0.51) 0.01 0.50 (0.50 - 0.51) 0.00 0.50 (0.50 - 0.50) 1.05 1.04 1.01 1.01 1.01 1.01 1.00 Alternative Cutoff Hypothesis 1b. A cutoff score with adequate psychometric properties was not obtained from the ACSS metric when utilizing a split cutoff score with the MVP. However, logistic regression was run using a cutoff score of ≤2 as this score was the highest to reach 90% specificity. Whether an individual passed the ACSS significantly predicted whether they passed the MVP. Participants who failed the Digit Span PVT were 8.44 times more likely to fail the MVP compared to participants who exceeded the cutoff score of 2 (p < .001). These results can be found in Table 19. While still significant, the use of a split cutoff score diminished the predictive power of the ACSS PVT. 72 Table 19. Predicting MVP Failure (Split Cutoff) Based On ACSS Characteristic OR1 95% CI1 p-value (Intercept) 0.05 0.02, 0.09 <0.001 ACSSGroup 8.44 3.04, 24.2 <0.001 1 OR = Odds Ratio, CI = Confidence Interval Alternative Cutoff Question 2: Reliable Digit Span Each Reliable Digit Span (RDS) score observed in the sample was analyzed to determine which score provided an optimal cutoff point. The results for each of these scores are provided in Table 20. Alternative Cutoff Hypothesis 2a. Using a more lenient cutoff score for group assignment in younger children still allowed for an RDS cutoff score to be established that exceeded the outlined sensitivity, specificity, and AUC thresholds. A RDS cutoff score of ≤4 Resulted in specificity of 91%, sensitivity of 67%, and an AUC metric of 0.79 (0.67 – 0.90). These results indicate that the same ≤4 cutoff score on the RDS index provides strong psychometric properties as an embedded PVT when a lower MVP cutoff score is used for group assignment. 73 Table 20. Various RDS Cutoffs using Split Cutoffs RDS Cutoffs Sensitivity Specificity AUC (95% CI) LR ≤2 ≤3 ≤4 ≤5 ≤6 ≤7 ≤8 ≤9 ≤10 ≤11 ≤12 ≤13 ≤14 ≤15 0.17 0.39 0.67 0.72 0.83 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.99 0.58 (0.49 - 0.67) 17.00 0.96 0.67 (0.56 - 0.79) 0.91 0.79 (0.67 - 0.90) 0.78 0.75 (0.64 - 0.86) 0.60 0.72 (0.62 - 0.81) 0.37 0.69 (0.65 - 0.72) 0.15 0.58 (0.55 - 0.60) 0.06 0.53 (0.51 - 0.55) 0.03 0.52 (0.50 - 0.53) 0.03 0.51 (0.50 - 0.53) 0.01 0.50 (0.50 - 0.51) 0.01 0.50 (0.50 - 0.51) 0.01 0.50 (0.50 - 0.51) 0.00 0.50 (0.50 - 0.50) 9.75 7.44 3.27 2.07 1.59 1.18 1.06 1.03 1.03 1.01 1.01 1.01 1.00 Alternative Cutoff Hypothesis 2b. Logistic regression was used to better understand the relationship between passing the RDS PVT with a cutoff score of ≤4 and passing the MVP with the experimental split cutoff score. Whether an individual passed the RDS significantly predicted whether they passed the MVP. Participants who failed the Digit Span PVT were 19.9 times more likely to fail the MVP compared to participants who exceeded the cutoff score of 4 (p < .001). 74 These results can be found in Table 21. While still significant, the use of a split cutoff score did reduce the predictive power of the RDS metric. Table 21. Predicting MVP Failure (Split Cutoff) Based on RDS Characteristic OR1 95% CI1 p-value (Intercept) 0.04 0.01, 0.07 <0.001 RDSGroup 19.9 6.86, 63.8 <0.001 1 OR = Odds Ratio, CI = Confidence Interval Alternative Cutoff Question 3: Reliable Digit Span Revised Each Reliable Digit Span Revised (RDS-R) score observed in the sample was analyzed to determine which score provided an optimal cutoff point. The results for each of these scores are provided in Table 22. Alternative Cutoff Hypothesis 3a. Using a cutoff score of ≤6 exceeded the .90 specificity threshold and the .40 sensitivity threshold with a specificity metric of 90% and a sensitivity of 67%. Additionally, the area under the curve metric was adequate (AUC = 0.78 (0.67 – 0.90)). This finding indicates that an RDS-R cutoff score of ≤6 remains a viable indicator of performance validity when more lenient MVP cutoff scores are used for children ages 6-10. 75 Table 22. Various RDS-R Cutpoints using Split Cutoffs RDS-R Cutpoint Sensitivity Specificity AUC (95% CI) LR ≤2 ≤3 ≤4 ≤5 ≤6 ≤7 ≤8 ≤9 ≤10 ≤11 ≤12 ≤13 ≤14 ≤15 ≤16 ≤17 ≤18 ≤19 ≤20 ≤21 0.11 0.28 0.50 0.50 0.67 0.67 0.83 0.83 0.94 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.99 0.55 (0.48 - 0.63) 11.00 0.97 0.62 (0.52 - 0.73) 9.33 0.96 0.73 (0.61 - 0.85) 12.50 0.92 0.71 (0.59 - 0.83) 0.90 0.78 (0.67 - 0.90) 0.84 0.76 (0.64 - 0.87) 0.73 0.78 (0.69 - 0.88) 0.59 0.71 (0.62 - 0.81) 0.48 0.71 (0.65 - 0.78) 0.33 0.67 (0.63 - 0.70) 0.24 0.62 (0.59 - 0.65) 0.12 0.56 (0.54 - 0.58) 0.08 0.54 (0.52 - 0.56) 0.04 0.52 (0.51 - 0.54) 0.03 0.52 (0.50 - 0.53) 0.01 0.50 (0.50 - 0.51) 0.01 0.50 (0.50 - 0.51) 0.01 0.50 (0.50 - 0.51) 0.01 0.50 (0.50 - 0.51) 0.00 0.50 (0.50 - 0.50) 76 6.25 6.70 4.19 3.07 2.02 1.81 1.49 1.32 1.14 1.09 1.04 1.03 1.01 1.01 1.01 1.01 1.00 Alternative Cutoff Hypothesis 3b. Logistic regression was used to better understand the relationship between passing the RDS-R PVT with a cutoff score of ≤6 and passing the MVP using the experimental split cutoff score. Whether an individual passed the RDS-R significantly predicted whether they passed the MVP. Participants who failed the RDS-R PVT were 18.7 times more likely to fail the MVP compared to participants who exceeded the cutoff score of 4 (p < .001). These results can be found in Table 23. Like the RDS score, the experimental MVP cutoff reduced the predictive power of the RDS-R score. Table 23. Predicting MVP Failure (Split Cutoff) Based on RDS-R Characteristic OR1 95% CI1 p-value (Intercept) 0.04 0.01, 0.07 <0.001 RDS-RGroup 18.7 6.47, 59.5 <0.001 1 OR = Odds Ratio, CI = Confidence Interval 77 DISCUSSION This study sought to establish the utility of various Digit Span indices from the Wechsler Intelligence Scale for Children – Fourth Edition as embedded performance validity indicators for psychological testing. The results indicate that the age-corrected scaled score (ACSS), Reliable Digit Span score (RDS), and the Reliable Digit Span Revised score (RDS-R) all served as viable performance validity indicators when using an experimental Memory Validity Profile (MVP) cutoff score of ≤30 in a mixed clinical sample. Each of these three metrics successfully demonstrated cutoff scores that exceeded 90% specificity and met or exceeded 60% sensitivity. Additionally, each of these cutoff scores exceeded the 0.70 area under the curve (AUC) threshold established at the beginning of the study. Similar to the concerns raised by Wilson and Lesica (2021), the experimental cutoff score of ≤30 resulted in a disproportionate number of younger children failing the MVP. As a result, this study examined the utility of each Digit Span index when a variable cutoff score was used with the MVP based on age. When group designation was based on this experimental cutoff, the psychometric properties of the ACSS score became inadequate. However, both the RDS and RDS-R validity indicators maintained strong psychometric properties at the same cutoff scores established in the initial analysis. These findings indicate that the RDS and RDS-R scores may have more clinical utility as embedded validity indicators than the ACSS score. Logistic regressions indicated that failure of each of these indices greatly increased the likelihood of an individual failing the MVP. When a cutoff score of ≤30 was adopted for the MVP, participants who failed one of the Digit Span validity indicators were 14-28 times more likely to fail the MVP. The index with the strongest predictive ability was the RDS metric with an odds ratio of 28.3. Adopting a split cutoff for the MVP based on age also resulted in each 78 metric significantly predicting MVP failure. Although the ACSS score failed to meet adequate sensitivity, a cutoff score with adequate specificity still provided an 8.44 times likelihood that participants would fail the MVP. Both the RDS and RDS-R metrics saw a reduction in predictive power, but both were still significant predictors with the RDS demonstrating an odds ratio of 19.9 and the RDS-R demonstrating an odds ratio of 18.7. Each of the three Digit Span indices provided strong predictive power. Due to the similarity of the current sample and the sample used by Wilson and Lesica (2021), positive and negative predictive power closely followed the sensitivity and specificity metrics obtained. In summary, these findings indicate that the obtained cutoffs would provide 92-94% accuracy identifying valid performances and 55-66% accuracy identifying invalid performances in a similar mixed clinical sample. Thus, using Digit Span as a measure of effort for clinically referred children may provide reliable information. Limitations Criterion Measure Most known-groups design studies use stronger criteria for group selection. Best practice indicates that at least two validity indicators should be used for group assignment (Sherman et al., 2020; Slick et al., 1999). Individuals who fail two validity indicators are placed in the invalid performance group and individuals who pass both indicators are assigned to the valid performance group. Those individuals who pass one indicator and fail another are typically excluded from analysis as the validity of their performance is less clear than other participants (Kirkwood et al., 2011). Unfortunately, the extant nature of this sample did not allow for the addition of an additional validity indicator to be added to the assessment battery given to each individual. Additionally, the MVP contains both a visual and verbal condition which could be 79 used as separate indicators. However, no experimental cutoffs have been established for the individual indices of the MVP, and the leniency of manualized cutoffs presented additional problems described below. The MVP was the only standalone performance validity indicator administered to all participants in this sample. As a result, group assignment in this study relied entirely on the MVP Total Score. With a single criterion measure utilized for group assignment, the strength of the criterion measure must be examined. The MVP was used for group assignment, but experimental cutoffs were used to make these determinations. When the performance of the sample was analyzed, only two of the 204 participants included for analysis failed the MVP. This made group assignment based on manualized cutoffs of the MVP futile for experimental analysis. Rather, a more stringent, experimental cutoff score was used to assign participants to groups. This presented a similar problem as the one experienced by Wilson and Lesica (2021). A universal cutoff score resulted in a larger proportion of younger children failing the MVP. As a result, a split cutoff score was adopted for experimental analysis, but group differences in age still remained after modifying group assignment. Younger children failing the MVP at higher rates indicates that the simple nature of the MVP is not uniform across ages. This may be due to differences in attention or letter and number literacy as the MVP stimuli include strings of numbers and letters. In addition to multiple criterion measures, it is also common for known-groups studies to utilize litigation status in group assignment (Slick et al., 1999). Litigatory elements are common in PVT research as a large portion of this literature focuses on the traumatic brain injury population where potential for monetary gain is high. Like the problem surrounding the number of criterion measures in this study, litigation status was not known when assigning participants to 80 groups. This may further call into question the integrity of the valid and invalid performance groups. Age Discrepancies Between Groups The use of a flat cutoff score across ages for the MVP resulted in a significantly younger invalid performance group. When adopting a more lenient cutoff score for younger children, this discrepancy was mitigated but still remained. As a result, it is possible that age had undue influence on the interpretation of cutoff scores. In order to assess this influence, cutoff scores across ages for the Digit Span indices would need to be examined for variance across ages. It is possible that age groups may need to be established with varying cutoffs. Unfortunately, the limited size of the current study prevented this analysis from being conducted. Cognitive Ability Discrepancies Between Groups Significant differences in cognitive ability scores were observed across groups. On average, the invalid performance group demonstrated lower Full-Scale Intelligence Quotient (FSIQ) scores, General Ability Index (GAI) scores, and scores on each of the other primary indices of the WISC-V. However, these discrepancies make sense due to the nature of the groups. Individuals in the invalid performance group are thought to have provided inadequate effort during testing. As a result, scores on assessment measures other than the MVP are likely to be negatively skewed due to invalid performance. As such, the cognitive ability scores of the invalid performance group are potentially underestimates of the sample’s actual mean cognitive ability. This demonstrates the difficulty using controls in the development of PVTs when all obtained scores are called into question. 81 Research Implications Embedded Digit Span PVTs Show Potential Pending Further Study The most meaningful implications of this study are those pertaining to future research in pediatric performance validity testing. The results of this study support previous studies that demonstrate potential clinical utility of the Digit Span subtest as a performance validity indicator in pediatric assessment. Previous studies have shown potential for the ACSS and RDS scores as embedded performance validity indicators when taken from WISC-IV administrations (Kirkwood et al., 2011; Welsh et al., 2012). This study is the first to examine the utility of these indices when taken from the updated WISC-V. Additionally, this study demonstrated potential use of the RDS-R scale now available due to the addition of the Sequencing condition to the Digit Span subtest. In order for these indices to have clinical utility, they must first be assessed in multiple samples using stronger criterion measures. Populations of Interest Previous studies of the RDS and ACSS metrics in pediatric populations have demonstrated that cutoff scores may differ significantly based on presenting conditions and referral questions (Kirkwood et al., 2011; Welsh et al., 2012). Children with epilepsy previously demonstrated lower performance on these indices that resulted in lower cutoff scores than children from the traumatic brain injury population. The present study examined the utility of these metrics in a mixed clinical sample providing general cutoff scores. However, larger studies with more statistical power are needed to examine whether cutoff scores differ based on clinical presentation. For example, (Kirkwood et al., 2011) found higher cutoff scores on the WISC-IV in a traumatic brain injury population. It is unclear whether these differences resulted from changes in the instruments or factors intrinsic to each population. 82 Caution Against Using RDS as a Criterion Measure Known-groups studies aiming to establish or validate PVTs rely heavily on the criterion measures used to establish groups. As noted previously, the lack of pediatric PVTs makes selection of criterion measures difficult. Future studies should validate the RDS metric from the WISC-V before it serves as a criterion measure in the development of additional indices. Notably, a recent study utilized the WISC-IV cutoff scores established by (Kirkwood et al., 2011) as cutoffs for the WISC-V (Butt et al., 2023). The study authors examined the clinical utility of the Figure Weights subtest of the WISC-V as an embedded PVT. Although the easy nature of initial Figure Weights items theoretically makes it a potential PVT, cutoffs cannot be established using previous cutoffs until further validation is done. The results of this study indicate that RDS cutoff scores need to be rigorously established before the index is used as a criterion measure. Known-groups studies hinge on the strength of the criterion measure used, and RDS currently lacks the evidence needed to serve as a criterion measure. Thus, until strong cutoffs are established and replicated, subsequent PVTs developed using the RDS as a criterion measure lack validity. Age Differences on RDS and RDS-R Metrics Both the RDS and RDS-R scores are unstandardized and are therefore more likely to differ based on age when compared to the ACSS score. As a result, it is possible that different cutoff scores are needed across age groups. The current sample was too small to examine cutoff scores across age groups, and the criterion measure used for group assignment also presented with age complications. Further study across ages groups is needed to determine whether stratified cutoff scores are needed. 83 Utility of the MVP The present study also raises questions about the utility of the MVP as both a measure of performance validity and a criterion measure in future PVT studies. Manualized cutoffs established by the MVP publishers yielded extremely low rates of failure, and a flat experimental cutoff across age groups led to more young children failing the assessment than older children. These difficulties are consistent with previous study of the instrument (Wilson & Lesica, 2021). This study sought to examine whether a split cutoff improved the utility of the MVP as a criterion measure, but the limited sample size prevented strong analysis. Using a split cutoff score led to an invalid performance group with a less than optimal sample size (n = 18). Additional study needs to be conducted to further test the MVP for both clinical and research purposes. Clinical Implications Embedded Indicators Requiring No Additional Administration Time In terms of clinical utility, the largest advantage of the Digit Span indices of interest in this study is that they do not require additional administration time. Those who administer the WISC-V have access to these metrics following a typical administration. Both the RDS and RDS-R require simply noting the length of the last item perfectly completed in the forward, backward, and sequencing conditions. The ACSS score is readily available to anyone who scores the Digit Span subtest. As a result, these indices provide additional information regarding performance validity with little to no additional examiner time and effort. They also have the advantage of being included in one of the most commonly used pediatric assessment measures of cognitive abilities. The result is additional performance validity indicators for those already using other measures like clinical neuropsychologists and the first indicators for professionals 84 like school psychologists who may not possess or know of other tools. The positive findings of this study suggest that these indices may be valuable to professionals in the future after further study. Mixed Clinical Samples For now, the findings of this study are most promising in populations composed of similar mixed clinical samples. The current study sample was taken from a rehabilitation hospital with diverse presenting conditions. As a result, the findings may not generalize to other populations with different compositions. However, for professionals working with similar populations, the current findings provide initial evidence that these instruments may have clinical utility. Digit Span Indices Not Ready for Clinical Use as Validity Indicators As noted earlier, the use of Digit Span indices in pediatric populations is currently unclear. Previous findings have suggested that as much as 65% of pediatric neuropsychologists use Digit Span indices when making determinations about performance validity (Brooks et al., 2016). If this is true, the current study presents several concerns. First, the cutoff scores found in this study are much lower than those used with adult populations (Greiffenstein et al., 1994). Additionally, the cutoff scores found in this study were also much lower than those previously established using the WISC-IV (Kirkwood et al., 2011). As this study is the first to calculate cutoff scores for each of these indices in a pediatric population from the WISC-V, it is currently unclear how Digit Span indices are being used in performance validity determinations. Use of traditional cutoff scores used in adult populations would result in extremely high rates of PVT failure. Additionally, utilizing WISC-IV cutoff scores for ACSS and RDS would also potentially lead to abnormally high rates of failure as the WISC-V cutoffs established in this study were 85 lower. As a result, none of the PVT metrics examined in this study currently have adequate psychometric properties for clinical use. Clinicians should not rely on embedded Digit Span PVTs until they can be further studied and validated in various populations. Promising for Clinicians without Other Measures Despite the RDS, RDS-R, and ACSS metrics lacking evidence for regular clinical use, the findings of this study may still be used to aid professionals who do not have access to other measures of performance validity. As noted previously, school psychologists rarely use performance validity measures in the evaluations they conduct. However, school psychologists frequently administer the WISC-V. In contexts where no other available PVTs are available, the use of the ≤4 cutoff on the RDS and the ≤6 cutoff on the RDS-R can still provide insight to the validity of an assessment. Each Digit Span condition starts with trials that are two digits in length. Therefore, to perform validly on the RDS and RDS-R metrics using the cutoffs established in this study, a child simply needs to reliably answer one three-digit item and two- digit items in the other conditions. If a child fails to meet this threshold, professionals without other validity measures should question the results of their assessment measures and look to behavior observations, interview data, history forms, and other data to gain additional context. This study adds to the growing literature on a topic long ignored, but popularly valued by neuropsychologists. Further research will hopefully contribute more certainty to the use of embedded validity indicators in widely used clinical instruments. Conclusion This study examined the clinical utility of various embedded performance validity indicators in the Digit Span subtest of the WISC-V. The results of this study indicate that Reliable Digit Span, Reliable Digit Span – Revised, and the age corrected scaled score provided 86 by the Digit Span test have promise as future validity indicators. However, additional study is needed before these instruments can be reliably used in clinical settings. 87 REFERENCES Adams, W., & Sheslow, D. (2021). Wide Range Assessment of Memory and Learning - Third Edition. Pearson. Axelrod, B. N., Meyers, J. E., & Davis, J. J. (2014). Finger tapping test performance as a measure of performance validity. The Clinical Neuropsychologist, 28(5), 876-888. Baron, I. S. (2018). Neuropsychological evaluation of the child: Domains, methods, & case studies. Oxford University Press. Ben-Porath, Y. S., & Tellegen, A. (2020). Minnesota Multiphasic Personality Inventory-3 (MMPI-3). NCS Pearson. Bender, L. (1938). A visual motor gestalt test and its clinical use. Research Monographs, American Orthopsychiatric Association. Binder, L. M., & Willis, S. C. (1991). Portland Digit Recognition Test. https://doi.org/http://dx.doi.org/10.1037/t27239-000 Binet, A., & Simon, T. (1905). Méthodes nouvelles pour le diagnostic du niveau intellectual des anormaux. L’Année psychologique(11), 191-336. Boone, K. (2007). Assessment of Feigned Cognitive Impairment: A Neuropsychological Perspective. The Guilford Press. Brooks, B. L., Ploetz, D. M., & Kirkwood, M. W. (2016). A survey of neuropsychologists’ use of validity tests with children and adolescents. Child Neuropsychology, 22(8), 1001-1020. https://doi.org/10.1080/09297049.2015.1075491 Bush, S. S., Policy, N., & Committee, P. (2005). Independent and court-ordered forensic neuropsychological examinations: Official statement of the National Academy of Neuropsychology. Archives of Clinical Neuropsychology, 20(8), 997-1007. Butt, S., Sellers, A., Ghazarian, S., & Katzenstein, J. (2023). Embedded Performance Validity Utilizing the WISC-V Figure Weights Subtest International Neuropsychological Society Serendipity and Science Conference, San Diego, CA. Crossman, A. M., & Lewis, M. (2006). Adults' ability to detect children's lying. Behavioral Sciences & the Law, 24(5), 703-715. Dandachi-Fitzgerald, B., Merckelbach, H., & Ponds, R. W. H. M. (2017). Neuropsychologists’ ability to predict distorted symptom presentation. Journal of clinical and experimental neuropsychology, 39(3), 257-264. https://doi.org/10.1080/13803395.2016.1223278 88 Delis, D. C., Kramer, J. H., Kaplan, E., & Ober, B. A. (1994). The California Verbal Learning Test - Children's Version. The Psychological Corporation. Delis, D. C., Kramer, J. H., Kaplan, E., & Ober, B. A. (2000). California Verbal Learning Test - Second Edition. The Psychological Corporation. Delis, D. C., Kramer, J. H., Kaplan, E., & Ober, B. A. (2017). California Verbal Learning Test, Third Edition [California Verbal Learning Test-3]. Pearson. https://doi.org/http://dx.doi.org/10.1037/t79642-000 DeRight, J., & Carone, D. A. (2015, 2015/01/02). Assessment of effort in children: A systematic review. Child Neuropsychology, 21(1), 1-24. https://doi.org/10.1080/09297049.2013.864383 Diamond, A. (2013). Executive Functions. Annual Review of Psychology, 64(1), 135-168. https://doi.org/10.1146/annurev-psych-113011-143750 Donders, J. (2005). Performance on the Test of Memory Malingering in a mixed pediatric sample. Child Neuropsychology, 11(2), 221-227. Erdodi, L. A., Abeare, C. A., Medoff, B., Seke, K. R., Sagar, S., & Kirsch, N. L. (2018). A single error is one too many: The Forced Choice Recognition Trial of the CVLT-II as a measure of performance validity in adults with TBI. Archives of Clinical Neuropsychology, 33(7), 845-860. Erdodi, L. A., Roth, R. M., Kirsch, N. L., Lajiness-O'Neill, R., & Medoff, B. (2014). Aggregating validity indicators embedded in Conners' CPT-II outperforms individual cutoffs at separating valid from invalid performance in adults with traumatic brain injury. Archives of Clinical Neuropsychology, 29(5), 456-466. Folstein, M. F., Folstein, S. E., & McHugh, P. R. (1975). “Mini-mental state”: A practical method for grading the cognitive state of patients for the clinician. Journal of Psychiatric Research, 12(3), 189-198. https://doi.org/https://doi.org/10.1016/0022-3956(75)90026-6 Garon, N., Bryson, S. E., & Smith, I. M. (2008). Executive function in preschoolers: a review using an integrative framework. Psychological bulletin, 134(1), 31. Gazzaniga, M., Ivry, R., & Mangun, G. (2014). Cognitive Neuroscience: The Biology of the Mind (Fourth ed.). W. W. Norton. Gignac, G. E., Reynolds, M. R., & Kovacs, K. (2019). Digit Span subscale scores may be insufficiently reliable for clinical interpretation: distinguishing between stratified coefficient alpha and omega hierarchical. Assessment, 26(8), 1554-1563. Green, P. (2004). Green's Medical Symptom Validity Test (MSVT) for Microsoft Windows (User Manual). Green's Publishing. 89 Green, P., & Astner, K. (1995). The Word Memory Test. Neurobehavioural Associates. Greiffenstein, M. F., Baker, W. J., & Gola, T. (1994). Validation of malingered amnesia measures with a large clinical sample. Psychological Assessment, 6(3), 218. Guilmette, T. J., Sweet, J. J., Hebben, N., Koltai, D., Mahone, E. M., Spiegler, B. J., Stucky, K., & Westerveld, M. (2020). American Academy of Clinical Neuropsychology consensus conference statement on uniform labeling of performance test scores. The Clinical Neuropsychologist, 34(3), 437-453. https://doi.org/10.1080/13854046.2020.1722244 Hathaway, S. R., & McKinley, J. C. (1951). Minnesota Multiphasic Personality Inventory; Manual, revised. Heaton, R. K., Smith, H. H., Lehman, R. A., & Vogt, A. T. (1978). Prospects for faking believable deficits on neuropsychological testing. Journal of consulting and clinical psychology, 46(5), 892-900. https://doi.org/http://dx.doi.org/10.1037/0022- 006X.46.5.892 Heilbronner, R. L., Sweet, J. J., Morgan, J. E., Larrabee, G. J., Millis, S. R., & Conference, P. (2009). American Academy of Clinical Neuropsychology Consensus Conference Statement on the Neuropsychological Assessment of Effort, Response Bias, and Malingering. The Clinical Neuropsychologist, 23(7), 1093-1129. https://doi.org/10.1080/13854040903155063 Holcomb, M. J. (2018). Pediatric Performance Validity Testing: State of the Field and Current Research. Journal of Pediatric Neuropsychology, 4(3-4), 83-85. https://doi.org/10.1007/s40817-018-00062-y Iverson, G. (2003). Detecting malingering on the WAIS-III Unusual Digit Span performance patterns in the normal population and in clinical groups. Archives of Clinical Neuropsychology, 18(1), 1-9. https://doi.org/10.1016/s0887-6177(01)00176-7 Iverson, G. L., & Franzen, M. D. (1994). The Recognition Memory Test, Digit Span, and Knox Cube Test as Markers of Malingered Memory Impairment. Assessment, 1(4), 323-334. https://doi.org/10.1177/107319119400100401 Iverson, G. L., & Franzen, M. D. (1996). Using Multiple Objective Memory Procedures to Detect Simulated Malingering. Journal of clinical and experimental neuropsychology, 18(1), 38-51. https://doi.org/10.1080/01688639608408260 Jacobs, J. (1887). Experiments on" prehension". Mind, 12(45), 75-79. Kirk, J. W., Baker, D. A., Kirk, J. J., & Macallister, W. S. (2020). A review of performance and symptom validity testing with pediatric populations. Applied Neuropsychology: Child, 9(4), 292-306. https://doi.org/10.1080/21622965.2020.1750118 90 Kirkwood, M. W. (2015). A Rationale for Performance Validity Testing in Child and Adolescent Assessment. In M. W. Kirkwood (Ed.), Validity testing in child and adolescent assessment: Evaluating exaggeration, feigning, and noncredible effort. The Guilford Press. Kirkwood, M. W., Hargrave, D. D., & Kirk, J. W. (2011). The value of the WISC-IV Digit Span subtest in detecting noncredible performance during pediatric neuropsychological examinations. Archives of Clinical Neuropsychology, 26(5), 377-384. Larrabee, G. J. (2012). Performance Validity and Symptom Validity in Neuropsychological Assessment. Journal of the International Neuropsychological Society, 18(4), 625-630. https://doi.org/10.1017/s1355617712000240 Larrabee, G. J. (2015). Performance and Symptom Validity: A Perspective from the Adult Literature. In M. W. Kirkwood (Ed.), Validity testing in child and adolescent assessment: Evaluating exaggeration, feigning, and noncredible effort (pp. 82-96). The Guilford Press. Larrabee, G. J., & Kirkwood, M. W. (2020). Symptom and Performance Validity Testing. In K. J. Stucky, M. W. Kirkwood, & J. Donders (Eds.), Clinical Neuropsychology Study Guide and Board Review (2nd ed., pp. 214-226). Oxford University Press. Lezak, M. (1983). Neuropsychological Assessment (2nd ed.). Oxford University Press. Lippa, S. M. (2018). Performance validity testing in neuropsychology: A clinical guide, critical review, and update on a rapidly evolving literature. The Clinical Neuropsychologist, 32(3), 391-421. Lu, P. H., Rogers, S. A., & Boone, K. B. (2007). Use of Standard Memory Tests to Detect Suspect Effort. In K. B. Boone (Ed.), Assessment of Feigned Cognitive Impairment: A Neuropsychological Perspective. The Guilford Press. Macallister, W. S., Vasserman, M., & Armstrong, K. (2019). Are we documenting performance validity testing in pediatric neuropsychological assessments? A brief report. Child Neuropsychology, 25(8), 1035-1042. https://doi.org/10.1080/09297049.2019.1569606 McWhirter, L., Ritchie, C. W., Stone, J., & Carson, A. (2020). Performance validity test failure in clinical populations—a systematic review. Journal of Neurology, Neurosurgery & Psychiatry, 91(9), 945-952. https://doi.org/10.1136/jnnp-2020-323776 Merckelbach, H., & Smith, G. P. (2003). Diagnostic accuracy of the Structured Inventory of Malingered Symptomatology (SIMS) in detecting instructed malingering. Archives of Clinical Neuropsychology, 18(2), 145-152. Meyers, J. E., & Meyers, K. R. (1995). Rey Complex Figure Test and Recognition Trial. PAR. 91 Mittenberg, W. (1996). Identification of malingered head injury on the Halstead-Reitan battery. Archives of Clinical Neuropsychology, 11(4), 271-281. https://doi.org/10.1016/0887- 6177(95)00040-2 National Center for Education Statistics. (2022). Students with disabilities. Condition of education. US Department of Education, Institute of Education Sciences. Pankratz, L. (1979). Symptom validity testing and symptom retraining: procedures for the assessment and treatment of functional sensory deficits. Journal of consulting and clinical psychology, 47(2), 409. Pankratz, L. (1983). A new technique for the assessment and modification of feigned memory deficit. Perceptual and Motor Skills, 57(2), 367-372. Pankratz, L., Fausti, S. A., & Peed, S. (1975). A forced-choice technique to evaluate deafness in the hysterical or malingering patient. Journal of consulting and clinical psychology, 43(3), 421-422. https://doi.org/http://dx.doi.org/10.1037/h0076722 Peterson, E., & Peterson, R. L. (2015). Understanding Deception from a Developmental Perspective. In M. W. Kirkwood (Ed.), Validity testing in child and adolescent assessment: Evaluating exaggeration, feigning, and noncredible effort. The Guilford Press. Reitan, R. M. (1993). The Halstead-Reitan neuropsychological test battery : theory and clinical interpretation. Second edition. S. Tucson, Arizona : Neuropsychology Press, [1993] ©1993. https://search.library.wisc.edu/catalog/9910209689302121 Rey, A. (1941). L'examen psychologique dans les cas d'encéphalopathie traumatique. (Les problems.). [The psychological examination in cases of traumatic encepholopathy. Problems.]. Archives de Psychologie, 28, 215-285. Rey, A. (1958). Rey Auditory Verbal Learning Test. Western Psychological Services. https://doi.org/http://dx.doi.org/10.1037/t27193-000 Rey, A. (1964). L'Examen clinique en psychologie: par André Rey... 2e édition. Presses universitaires de France (Vendôme, Impr. des PUF). Reynolds, C. R. (1997). Forward and backward memory span should not be combined for clinical analysis. Archives of Clinical Neuropsychology, 12(1), 29-40. Reynolds, C. R., & Bigler, E. D. (1994). Test of Memory and Learning. Pro-ed. Reynolds, C. R., & Kamphaus, R. W. (2015). Behavior Assessment System for Children, Third Edition. 92 Richardson, J. T. E. (2007). Measures of Short-Term Memory: A Historical Review. Cortex, 43(5), 635-650. https://doi.org/10.1016/s0010-9452(08)70493-3 Schmand, B., & Lindeboom, J. (2005). The amsterdam short-term memory test. Manuel. PITS. Schroeder, R. W., Martin, P. K., Heinrichs, R. J., & Baade, L. E. (2019). Research methods in performance validity testing studies: Criterion grouping approach impacts study outcomes. The Clinical Neuropsychologist, 33(3), 466-477. https://doi.org/10.1080/13854046.2018.1484517 Schroeder, R. W., Twumasi-Ankrah, P., Baade, L. E., & Marshall, P. S. (2012). Reliable Digit Span: A Systematic Review and Cross-Validation Study. Assessment, 19(1), 21-30. https://doi.org/10.1177/1073191111428764 Sherman, E., & Brooks, B. (2015a). Child and adolescent memory profile (ChAMP). Psychological Assessment Resources, Inc.: Lutz, FL. Sherman, E., & Brooks, B. (2015b). Memory validity profile (MVP). Psychological Assessment Resources, Inc.: Lutz, FL. Sherman, E. M. S., Slick, D. J., & Iverson, G. L. (2020). Multidimensional Malingering Criteria for Neuropsychological Assessment: A 20-Year Update of the Malingered Neuropsychological Dysfunction Criteria. Archives of Clinical Neuropsychology, 35(6), 735-764. https://doi.org/10.1093/arclin/acaa019 Slick, D. J., Sherman, E. M. S., & Iverson, G. L. (1999, 1999/11/01). Diagnostic Criteria for Malingered Neurocognitive Dysfunction: Proposed Standards for Clinical Practice and Research. The Clinical Neuropsychologist, 13(4), 545-561. https://doi.org/10.1076/1385- 4046(199911)13:04;1-Y;FT545 Spencer, R., Tree, H., Drag, L., Pangilinan, P., & Bieliauskas, L. (2010). Extending reliable digit span with the WAIS-IV sequencing task: Preliminary results. Poster presented at the 8th annual meeting for the American Academy of Clinical Neuropsychology Conference, Chicago, IL, Stucky, K., Kirkwood, M. W., & Donders, J. (2020). Traumatic Brain Injury. In K. Stucky, M. W. Kirkwood, & J. Donders (Eds.), Clinical Neuropsychology Study Guide and Board Review (2 ed.). Oxford University Press. Talwar, V., & Crossman, A. (2011). From little white lies to filthy liars: The evolution of honesty and deception in young children. Advances in child development and behavior, 40, 139-179. Talwar, V., & Lee, K. (2002). Development of lying to conceal a transgression: Children’s control of expressive behaviour during verbal deception. International Journal of Behavioral Development, 26(5), 436-444. 93 Terman, L. M. (1916). Stanford-Binet Intelligence Scale: Manual for the Third Revision Form L- M. Houghton Mifflin. https://doi.org/http://dx.doi.org/10.1037/t00012-000 Tombaugh, T. N. (1996). Test of memory malingering: TOMM. Multy-Health Systems. Ventura, L. M., Dedios-Stern, S., Oh, A., & Soble, J. R. (2019). They’re not just little adults: The utility of adult performance validity measures in a mixed clinical pediatric sample. Applied Neuropsychology: Child, 1-11. https://doi.org/10.1080/21622965.2019.1685522 Wechsler, D. (1939). Wechsler-Bellevue Intelligence Scale--Form I. https://doi.org/http://dx.doi.org/10.1037/t06871-000 Wechsler, D. (1945). Wechsler Memory Scale. https://doi.org/http://dx.doi.org/10.1037/t27207- 000 Wechsler, D. (1949). Wechsler Intelligence Scale for Children. The Psychological Corporation. Wechsler, D. (1955). Wechsler Adult Intelligence Scale. The Psychological Corporation. Wechsler, D. (1981). Wechsler Adult Intelligence Scale - Revised. The Psychological Corporation. Wechsler, D. (1987). Wechsler Memory Scale - Revised. The Psychological Corporation. Wechsler, D. (1999). Wechsler Abbreviated Scale of Intelligence. The Psychological Corporation. Wechsler, D. (2003). Wechsler Intelligence Scale for Children--Fourth Edition. Wechsler, D. (2008). Wechsler Adult Intelligence Scale--Fourth Edition. Wechsler, D. (2009). Advanced Clinical Solutions for WAIS-IV and WMS-IV. The Psychological Corporation. Wechsler, D. (2011). Wechsler Abbreviated Scale of Intelligence–Second Edition. NCS Pearson. Wechsler, D. (2014). Wechsler Intelligence Scales for Children - Fifth Edition. NCS Person. Weiss, S. J., Blackwell, M. C., Griffith, K. M., Jordan, L. S., & Culotta, V. P. (2019). Performance validity testing in children and adolescents: A descriptive study comparing direct and embedded measures. Applied Neuropsychology: Child, 8(2), 158-162. Welsh, A. J., Bender, H. A., Whitman, L. A., Vasserman, M., & Macallister, W. S. (2012). Clinical Utility of Reliable Digit Span in Assessing Effort in Children and Adolescents 94 with Epilepsy. Archives of Clinical Neuropsychology, 27(7), 735-741. https://doi.org/10.1093/arclin/acs063 Wilson, K., & Lesica, S. (2021). Performance on the Memory Validity Profile in a mixed clinic- referred pediatric sample. Child Neuropsychology, 27(4), 516-531. https://doi.org/10.1080/09297049.2020.1870676 Wimmer, H., & Hartl, M. (1991). Against the Cartesian view on mind: Young children's difficulty with own false beliefs. British Journal of Developmental Psychology, 9(1), 125-138. Young, J. C., Sawyer, R. J., Roper, B. L., & Baughman, B. C. (2012). Expansion and Re- examination of Digit Span Effort Indices on the WAIS-IV. The Clinical Neuropsychologist, 26(1), 147-159. https://doi.org/10.1080/13854046.2011.647083 95