THE USE OF LARGE LANGUAGE MODELS TO PREDICT ITEM PROPERTIES 

By 

Francis Smart 

A DISSERTATION 

Submitted to 
Michigan State University 
in partial fulfillment of the requirements 
for the degree of 

Measurement and Quantitative Methods – Doctor of Philosophy 

2024 

ABSTRACT 

Calibrating items is a crucial yet costly requirement for both new tests and existing ones 

as items become outdated due to changing relevance or overexposure. Traditionally, this 

calibration involves giving items to a large number of participants, a process that requires 

substantial time and resources. To reduce these costs, researchers have sought alternative 

calibration methods. Before the emergence of Large Language Models (LLMs), these methods 

mainly relied on expert opinions or computational analysis of item features. Yet, the accuracy 

of experts in predicting item performance has varied, and computational approaches often 

struggle to capture the intricate semantic details of test items. 

The emergence of LLMs might offer a new avenue of addressing the need for item 

calibration. These models, popularized by OpenAI (like the GPT series), have shown remarkable 

abilities in mimicking complex human thought processes, and performing advanced reasoning 

tasks. Their achievements in passing sophisticated exams and executing cross-language 

translations underline their potential. However, their capacity for predicting item properties in 

test calibration has not been thoroughly investigated. Traditional calibration relies heavily on 

direct human interaction, such as pretesting and expert assessment, or on statistical modeling 

of item features through resource intensive machine learning algorithms.  

This dissertation explores the potential of LLMs to predict item characteristics, tasks 

that have traditionally required human insight or complex statistical models. With the 

increasing accessibility of high-performance LLMs from organizations like OpenAI, Meta, and 

Google, and through open-source platforms such as HuggingFace.com, there is promising 

ground for investigation. This study examines whether LLMs could replace human efforts in 

item calibration tasks. 

To evaluate the effectiveness of LLMs in predicting item properties, this dissertation 

implements a training and testing framework, focusing on assessing both the relative and 

absolute difficulties of items. It undertakes three theoretical investigations: firstly, examining 

the ability of LLMs to predict the relative difficulty of items; secondly, assessing the feasibility of 

using multiple LLMs as substitutes for test-takers and attempts to use their responses 

predictors of item difficulty; and thirdly, applying a search algorithm, guided by LLM predictions 

of relative difficulty, to ascertain absolute difficulties. 

The findings indicate that the models have statistical significance in predicting relative 

item difficulty, limited by modest explanatory power — with adjusted R-squared values around 

5-10%. However, the application of LLMs in predicting relative item difficulties through pairwise

comparisons proves to be more promising, achieving a pairwise accuracy of about 62% and 

demonstrating predicted correlations with item difficulty ranging between 0.36 and 0.42. 

This suggests that whereas LLMs show potential in certain aspects of item calibration, 

their effectiveness varies depending on the specific task. This demonstrates a potential 

promising result that warrants further exploration into the capabilities of LLMs for item 

calibration, potentially leading to more efficient and cost-effective methods in the field of test 

development and maintenance.

This dissertation is dedicated to my beloved family. To my wife, for her unwavering support 
through countless adventures; to my parents, for their unconditional love; to my siblings, for 
always having my back; and to the children we have been privileged to care for in our home. I 
also extend my gratitude to my in-laws, who have welcomed me warmly into their family. 
Finally, this work is dedicated to the disenfranchised children we have cared for, as well as 
those who remain victims of a dysfunctional system. Their resilience in the face of adversity is a 
constant source of inspiration. 

iv 

ACKNOWLEDGEMENTS 

I extend my deepest gratitude to Dr. Kimberly Kelly, my major professor, for her 

unwavering support, expert guidance, and insightful feedback. Her dedication was pivotal in the 

completion of this dissertation. 

I am also thankful to my committee members, Dr. Kenneth Frank, Dr. Alicia Alonzo, and 

Dr. Christopher Nye, for their invaluable suggestions and critical insights, which greatly 

enhanced this work. The financial support from the Institute of Education Sciences, U.S. 

Department of Education (Award # R305B090011) funded this work and my studies and was 

crucial and is deeply appreciated. 

Special thanks to Dr. Nathan Bos for his camaraderie and shared wisdom, which made 

this endeavor both productive and enjoyable. I would also like to thank Jeremy, a polygrapher 

at the CIA, as well as Mary Peyton with Montgomery County whose loose relationship with 

integrity has inspired me to do better, to keep pushing for positive change in the world. 

Finally, I would like to thank my family and friends for their endless encouragement and 

understanding. A special thanks to my parents, whose belief in me kept me motivated through 

challenging times, and to my wife whose patience and support were invaluable. 

Thank you all for your contributions and support. This dissertation would not have been 

possible without you. 

v 

 
 
 
 
TABLE OF CONTENTS 

CHAPTER I: INTRODUCTION ........................................................................................................... 1 

CHAPTER II: LITERATURE REVIEW ................................................................................................. 17 

CHAPTER III: RESEARCH METHODS ............................................................................................... 47 

CHAPTER IV: RESULTS ................................................................................................................... 87 

CHAPTER V: DISCUSSION AND CONCLUSION ............................................................................. 104 

BIBLIOGRAPHY ............................................................................................................................ 122 

APPENDIX A: PROMPT SCORING ................................................................................................ 131 

APPENDIX B: PROMPT DESIGN - RELATIVE DIFFICULTY EVALUATION ........................................ 135 

APPENDIX C: PROMPT DESIGN - COLLATERAL ITEM INFORMATION .......................................... 144 

APPENDIX D: PROMPT SELECTION ALGORITHMS ....................................................................... 146 

APPENDIX E: SENTITIVITY ANALYSIS ........................................................................................... 150 

APPENDIX F: ALGORITHMS ILLUSTRATED ................................................................................... 154 

vi 

 
 
 
	
CHAPTER I: INTRODUCTION 

1.1 

Introduction 

Calibrating test items is a crucial yet costly part of developing and maintaining testing 

programs. As test items age or become overexposed, they must be replaced, demanding the 

creation and calibration of novel items—a resource-intensive process. Pretesting novel items 

traditionally involves a large number of test takers, which incurs substantial expense. 

Consequently, more cost-effective calibration methods are in high demand. Researchers have 

been investigating alternative calibration methods since the 1950s. These methods have 

typically hinged on evaluations by subject matter experts (SMEs), or computational tools 

focused on observable item features like word count, diction, and statistical correlations 

between infrequent word usage and item parameters. While SMEs can assess the nuanced 

content of items, their predictions on item difficulty and other parameters have been variable 

(Hambleton et al., 1998). On the other hand, computational tools, typically traditional machine 

learning models, rely upon having many items to train on, often hundreds to thousands of 

similar examples. Additionally, computation models are typically purely complex statistical 

models that do not have predictive power outside of the items they have been trained on. 

However, recent advancements in natural language processing using Large Language Models 

(LLMs) may offer a shift in how we approach this challenge. 

Current developments using LLMs have been significant. These models have the 

capability to perform complex reasoning and produce clear, human-like responses. 

Traditionally, item calibration leaned heavily on human intervention—in both pretest-based 

approaches and expert judgment methods. The goal of this dissertation is to explore to what 

1 

 
 
extent these advanced models might be used to reduce the need for expensive human input in 

the item calibration process. 

Large Language Models (LLMs) have shown remarkable capabilities, achieving 

impressive outcomes in areas beyond their initial training. Notably, without any specific 

preparation, they have scored in the top percentiles of professional exams like the Bar and LSAT 

(OpenAI, 2023). Researchers have deployed LLMs in activities somewhat akin to the scope of 

this dissertation, by being assessed on their ability to tackle a wide array of test items, 

demonstrating their proficiency in solving problems once believed to necessitate human 

intelligence and knowledge. This process of solving various problem sets with known outcomes 

serves as a benchmark for evaluating and comparing LLMs, illustrating their aptitude in 

emulating human cognition (OpenAI, 2023; Anil et al., 2023; Lewis et al., 2019; Devlin et al., 

2018). 

Despite their proficiency in generating human-like responses, thanks to extensive 

training on vast datasets of human language, the potential of LLMs to predict test item 

characteristics has not been thoroughly investigated. The task of assessing item features has 

traditionally depended on human inputs, either through expert judgment or through empirical 

data gathered from pretesting, a practice rooted in the seminal works of Thurston (1925), 

Zimowski et al. (1996), Lorge & Kruglov (1952), Thorndike (1982), and Bejar (1983). 

Alternatively, computational methods have sought to forecast item characteristics by analyzing 

observable attributes, an approach surveyed by Benedetto et al. (2023). These computational 

methods typically require large datasets to function effectively. However, the variabilities in 

2 

 
 
expert reliability alongside the substantial data requirements of computational approaches, 

hint at the benefits an automated method—less reliant on massive datasets—could provide. 

This dissertation investigates the use of Large Language Models (LLMs) as potential 

surrogates for certain human tasks in the procedure of item calibration. Given the proven 

capabilities of LLMs to mimic human responses and their widespread availability through 

platforms such as OpenAI's GPT-3 and 4, Meta’s Llama 2, Google’s Gemini-Pro, and the 

resources at HuggingFace.com, there is a promising potential for these models to substitute 

some of the human efforts traditionally necessary in assessing item attributes. These attributes, 

previously quantifiable only through extensive manual labor or, to a lesser extent, through 

intricate statistical models, present a new frontier for LLM application. 

For the empirical aspect of this investigation, item content and statistical data have 

been sourced from the National Assessment of Educational Progress (NAEP), focusing on a 

selected set of 462 released items from Mathematics and Science domains. The selection 

process favored items amendable to LLM analysis, thus excluding those heavily reliant on visual 

inputs, referencing current facts, or for which the alternative text, for those visually impaired, is 

poorly encoded might hinder LLM evaluation. 

Beyond examining theoretical models, this dissertation delves into practical 

applications, specifically, the use of a genetic algorithm to enhance the process of selecting 

prompts when determining the relative difficulty of items. This exploration begins without a 

predetermined notion of the most effective prompt but proposes to identify it through 

experimentation with various prompt recommendations. These techniques range from 

incorporating additional contextual information and adjusting input and output instructions, to 

3 

 
 
modifying the prompt’s 'temperature', all recommended by experts in the field as potentially 

influencing an optimal prompt design. 

1.2  Understanding the Challenges in Item Calibration 

To build an effective model for analyzing item parameters, it is crucial to define items 

and their function. Items are tools designed by experts to gauge the underlying qualities of a 

test-taker, which may include knowledge, abilities, skills, or a mix thereof. 

The field of test theory lays out the features of evaluations, specifically concentrating on 

fairness, reliability, and validity. Fairness is rooted in the principle of measurement invariance, 

which insists that tests should yield comparable results across diverse groups and over time 

(Mellenbergh, 1989; Wicherts & Dolan, 2010; Van de Schoot et al., 2015). Reliability concerns 

the consistency with which an item measures the intended construct, defined by Lord & Novick 

(1968) as the ratio of signal (valid results) to noise (errors in measurement) (Borsboom & 

Molenaar, 2015). Mohajan (2017) further elaborates on the importance of discerning authentic 

results from distortions. Validity, however, is more complex; its fundamental definition pertains 

to whether a tool assesses what it’s supposed to (Borsboom & Molenaar, 2015), yet it can be 

interpreted in numerous ways. 

An item must closely correspond with the construct domain it is devised to measure for 

it to be effective. A well-designed test comprises a range of items that jointly span the entirety 

of the defined construct domain. Importantly, for a test’s validity, the results should be 

influenced more by the construct domain than by any related, yet distinct domains. Take the 

assessment of personal agility, for instance: a varied set of physical tasks would be more 

4 

 
 
appropriate than including a driving test portion, since the latter assesses not only similar 

motor skills but also knowledge of traffic laws, which are extraneous to agility. 

The reliability of items is essential when creating both items and tests. A reliable test 

distinguishes clear measurements from any interference or ‘noise’ that may arise from 

irrelevant characteristics of the construct. For instance, the reliability of a personal agility test 

featuring an obstacle course could be affected by uncontrollable elements like weather, 

showing how reliability can be influenced by factors not intended in the test design. 

While there is a wide recognition that items vary in difficulty, the notion of 

discrimination—the likelihood that an item will be answered correctly more often by individuals 

at different ability levels—tends to receive less focus, yet it is vitally important in the calibration 

of test items. 

1.2.1 The Need for Item Parameter Identification 

In various fields like education, psychology, medicine, and survey design, items are 

utilized as measuring tools for latent traits or intangible concepts. The effectiveness and 

reliability of these items as measures are crucial aspects in their development. Comprehending 

the key features of these items plays a substantial role in several applications. 

In the context of education, items are typically employed to evaluate performance at 

student, class, institutional, and national levels. They are instrumental in identifying gaps in 

students’ knowledge and serve as a vital metric in determining personal and technical 

readiness. However, overly difficult, or extremely easy items could potentially discourage or 

disinterest students. Items with low discriminatory power can be less informative about the 

students’ abilities compared to those with higher discrimination loadings. 

5 

 
 
Another motivational aspect for item parameter identification is its usage in high stakes 

testing, where fairness and functionality are paramount. In such scenarios, different versions of 

the same test should offer roughly equivalent measures. This means that scores of two test-

takers should be comparable for decision-making purposes, such as allocation of university 

admission slots or scholarship funds. 

1.3 

Statement of the Problem 

Existing methods of item calibration are expensive. These methods most often rely upon 

the use of pretesting involving hundreds to thousands of test takers from the examinee 

population. Some existing testing regimes have been able to incorporate a pretesting 

procedure into their testing paradigms by forcing test takers to take additional items or test 

sections. Yet this is not a perfect or optimal solution as it forces students to invest effort in 

solving items that do not contribute to their overall ability estimates. Apart from shifting this 

cost of test takers, this kind of pretesting introduces a risk factor for which items might be 

compromised. 

Apart from pretesting, various approaches have been proposed to aid in the calibration 

of items. These methods can largely be divided into 1. subjective inputs by subject matter 

experts (SMEs) and non-experts to a lesser extent and 2. computational methods. Inputs by 

SMEs have been shown to have varying degrees of success with some features such as content 

domain being strongly predicted while other features such as the item parameter 

discrimination being difficult to predict by subject matter experts. In general SME based 

approaches are expensive as they involve time and effort by highly skilled individuals. 

Computational approaches on the other hand often are costly on the front end, as they involve 

6 

 
 
expert input into designing a predictive computational model but tend to be much less 

expensive once established. Like SME based approaches, computational models have varying 

degrees of success depending upon the item parameters or features being estimated as well as 

the content the domain of the items. 

1.4 

Purpose of the Study 

In this dissertation I propose the use of an alternative method that takes elements of 

both the subject matter expert (SME) approach and the computational approach. Large 

Language models (LLMs) have been shown to have some success at predicting human-like 

response patterns in a wide array of scenarios (OpenAI 2023). They have been shown to be able 

to generate satisfactory responses to a range of questions previously only answerable by 

humans even in the context of not having been explicitly trained for that task. As these models 

are flexible problem solvers capable of handling diverse requests having been trained on source 

material spanning vast numbers of documents there is reason to believe that LLM models might 

be effective at predicting item parameters. 

1.5 

A Theoretical Framework 

1.5.1 Large Language Models General Problem-Solving Capabilities 

Large Language Models (LLMs) have astonished the world with their ability to flexibly 

solve a wide array of complex problems for which they were never explicitly trained. Many of 

these abilities seem to be emergent features of the training procedure. These models are built 

using transformer-style models (Vaswani et al., 2017) pre-trained to predict the next token in a 

document. Built on numerous deep layers of billions of simulated neurons and weights, these 

models are often astonishingly good at generating human like predictions. The very well-known 

7 

 
 
LLM OpenAI’s GPT-3 (Brown et at., 2020) demonstrates “extraordinary language 

comprehension, fluency, and contextual understanding, enabling it to excel across a wide range 

of NLP tasks” (D’Souza, 2023: page 1). Its successor OpenAI’s GPT-4 (Open AI, 2023) is even 

more proficient and is currently one of the most sophisticated models available to the public. 

Many of the large technology companies are competing in this space with the aim of 

developing similarly complex LLMs such as Google’s Gemini-Pro (Anil et al. 2023) as well as 

Facebook’s Lamma 2 (Touvron et al., 2023). 

The capabilities of LLMs as general problem solvers are relatively new development in 

the machine learning toolkit with the concept of “Large Language Model” transforming as the 

models in which spelling, and grammar checking was the primary goal in the early 2010s to 

models that astonishingly can solve a variety of complex problems in recent years. As such we 

can see that these models have been growing exponentially in popularity and are being used as 

a source of research (Figure 1).  

8 

 
 
FIGURE 1: TRENDS IN SCHOLARLY ARTICLES CITING LARGE LANGUAGE MODELS 

These are the returns using Google Scholar searching for the terms. When there is a term that 

has multiple words the search term is in quotes (for example: “Machine Learning” and “Large 

Language Models”). 

These models have distinguished themselves by being able to solve a wide range of 

challenges previously reserved for humans. These include GPT-4 (OpenAI, 2023) scoring in the 

top 90th percentile on a Uniform Bar Exam challenge, 88th percentile on the LSAT, 93rd 

percentile on the SAT Math, 80th percentile on the GRE Quantitative, 99th percentile on the GRE 

Verbal, 54th percentile in GRE Writing, 99th-100th percentile in the USABO Semifinal Exam, in the 

top 20 percentile in AP tests of Environmental Sciences, Macroeconomics, Microeconomics, 

Physics 2, Psychology, Statistics, Government, US History, as well numerous other exams. While 

performing astonishingly well in many areas, in some areas GPT-4 is far from mastery. When 

taking Codeforce's evaluation, a service which evaluated coding ability, it scored only 392 which 

9 

 
 
is below the bottom 5th percentile. Similarly, it struggled with Leetcode’s examination scoring 

only 3/45 on Leetcode’s most difficult challenges. GPT-4 scored only in the top 6-12th place on 

the AMC 10 exam, which is an invitation only math exam that focuses on innovative and 

challenging math problems. It is noteworthy that GPT-4 is surprisingly good at solving many 

tasks for which it is not explicitly trained and that it does not seem to benefit significantly from 

reinforcement learning on similar problems (OpenAI, 2023 Table 8). 

While many applications of large transformer models are currently being evaluated and 

deployed both in research and business, this dissertation seeks to test to what capability these 

models might provide a feasible replacement for human effort in evaluating and estimating 

item parameters.  

1.5.2 Item Difficulty Prediction 

The  primary  focus  of  this  dissertation  is  item  difficulty,  both  relative  and  absolute. 

Relative item difficulty has a long tradition in item property estimation with early researchers 

such as Lorge & Kruglov (1952) estimating it along with absolute item difficulty. Absolute item 

difficulty (𝐷!) is defined as how likely an item is to be answered incorrectly for a given population. 

Item  𝑖  is  relatively  more  difficult  than  item  𝑗  if  𝐷! > 𝐷".  In  more  advanced  item  modelling 

methods such as item response theory (IRT) the parameter 𝑏! can often be used interchangeably 

with 𝐷 when calculating relative item difficulties. Though it is worth noting that more complex 

models  with  IRT  such  as  the  3PL  model  allow  for  more  complex  response  patterns  such  that 

relative item difficulties might change if using 𝐷! rather than 𝑏! to calculate relative difficulties. 

Relative difficulties do not consider difficulty scaling items as some items might be much easier 

10 

 
 
or more difficult than other items. By ordering items from lowest difficulty to highest difficulty a 

spectrum of rank ordered relative difficulties can be established. 

When thinking about relative difficulties they appear in some ways more intuitive to that 

of absolute difficulties. Imagine item A having a difficulty of 60% for a 4th grade population while 

item  B  might  have  a  difficulty  of  80%  for  the  same  population.  These  numbers  are  readily 

interpretable. The relative difficulty of item B is greater than that of item A for 4th graders. Yet 

there  is  a  hidden  level  of  abstraction  that  is  based  on  the  absolute  item  difficulties  being 

conditional upon the population in question (that of 4th graders). Let’s imagine the same two 

items being given to 8th graders who now have a new absolute difficulty of 20% for item A and 

30%  for  item  B.  In  this  hypothetical  example  the  relative  difficulties  do  not  change  but  that 

absolute difficulties have changed dramatically with a 40-point change for item A and 50-point 

change for item B. 

Now let us imagine we are asking SMEs to estimate the item absolute difficulty for items 

A and B for 4th graders. To accomplish this, they would need to first imagine the steps involved 

in solving the problem, have a mental model of a population of 4th graders and finally have to 

imagine  what  percentage  of  that  population  would  get  the  item  correct.  On  the  other  hand, 

asking a SME to rank the relative difficulties of two items involves the SME mapping out mentally 

the steps involved in solving item A, then item B, and then evaluating if the steps involved in item 

A are more or less difficult than those required to solve B. Notice that this latter process does not 

require the SME to have a mental model of the population so long as we can assume that the 

relative difficulty of items pairs between populations remains constant (Appendix E.2 explores 

11 

 
 
the consistency of the rank correlations of item performance between populations groups for 

the items evaluated in this dissertation). 

It is important to note that item difficulty models, both classical and IRT, require 

absolute item parameters. Fortunately, existing literature provides guidance in methods for 

mapping item difficulty from relative estimates of item difficulty to absolute (Lorge & Kruglov, 

1953) though this dissertation proposes to follow the binary search algorithm similar to that 

proposed by Attali et al. (2014). 

1.5.3 Large Language Model Bias 

A key consideration when developing or evaluating any method that produces 

potentially actionable information is any underlying bias in the LLM being used. This area of 

focus is of high interest in ongoing development of LLMs (Bai et al., 2024; Liu et al., 2024; 

Rakshit et al., 2024; Bender et al., 2021). These models are trained using often minimally 

curated data gathered from the web. Concerns of limiting bias in generated responses is seen 

as a safety consideration as these models might inadvertently reinforce or introduce harmful 

prejudices or stereotypes against individuals or groups. This kind of overt prejudice could be 

very harmful. 

However, the use of LLMs to estimate item difficulty is unlikely to be vulnerable to this 

kind of prejudice yet there also exists a more subtle form of bias that these models might 

experience. If for instance most training materials for the models is based on the content 

generated by a particular group of individuals, then inference made on the basis of that content 

might be more likely to reflect the perspective of that group rather than that of other groups. 

Some aggregate statistics suggest that girls tend to have stronger reading, writing, and 

12 

 
 
communication skills than boys (Rieley et al., 2019) whereas boys have some slight advantage 

over girls in math (McGraw et al., 2006). Free and Reduced-Price Lunch (FRP) ineligible students 

(not poor) having an advantage over FRP eligible (poor) students (Marchant, 2015). To what 

extent any of these differences is an effect of or simply correlated with gender, race, English as 

a second language status, disability, socio-economic status, or any of the other many 

demographic groups students are divided into is a sensitive matter open to ongoing debate. 

However, being cognizant that though an LLM is a conglomeration of numerous voices it 

generally only represents a single voice at a given time. As such it might be vulnerable to bias in 

how it returns responses to a task. One of the methods evaluated in this dissertation requires 

the LLM to make an evaluation of the relative difficulty between item pairs. In many cases we 

could imagine that there might be no difference in expected relative difficulties between item 

pairs based on subpopulation group. This might be the case when the form and content of two 

different items is similar. However, if one item was to contain very few words while another 

item has a lengthy reading passage that needed to be interpreted, we might expect that the 

perspective underlying the behavior of the LLM to be predictive of how relatively difficult or 

easy it finds these different items. Fortunately, this bias is testable as the items evaluated in 

this dissertation have varying indicators of their different difficulties based on demographic 

statistics.  

LLMs, however, present at least one additional source of bias based on how they find 

solutions and what kind of challenges they face that are distinct from those faced by students. 

For example, LLMs, by definition, have a vast trove of linguistic information to draw on. As such 

they are likely to perform much better on items involving recall or word recognition than 

13 

 
 
students. Conversely items involving some visual content such as graphs, figures, photo, charts, 

or maps are not content generally comprehensible to purely language based LLMs. As such, 

items that rely upon visual material are likely to be much more difficult for LLMs to solve than 

items that are easily coded in pure linguistic terms. 

It is unclear to what extent the advantages or disadvantages of LLM cognitive strengths 

would affect the relative difficulty ranking of pairwise items. The effect of this bias will likely be 

to what extent LLMs hold an internal representation of these items and the solution steps 

required to solve them and then can compare that internal representation between the two 

sets of items. Li et al. (2023) trained a GPT model using Othello game transcripts and argued 

that the model sustains a continuous representation of the state of the game board. Likewise, 

Gurnee & Tegmark (2023) explored temporal and spatial representations with LLMs and found 

that the LLM seems to generate geographic encodings mirroring latitude and longitude 

coordinates. That said both these studies use very tangible and easily bounded representations 

of an internal space while holding the representation of an item’s complexity seems an order of 

complexity greater of a challenge. 

Overall, LLMs are black boxes with numerous parameters creating complex spaces 

difficult to understand and in the case of propriety models restricted from being directly 

observed. As the second study in this dissertation involves inferring item difficulty based on 

how much difficult LLMs have at solving the items, these potential sources of structural bias 

might be important limiting factors. 

14 

 
 
1.6 

Research Questions 

This dissertation explores the capacity of Large Language Models (LLMs) to generate 

responses that mimic those of humans. Specifically, it investigates whether LLMs can accurately 

predict outcomes for tasks traditionally undertaken by subject matter experts and test-takers. 

The research focuses on two principal areas: the estimation of relative difficulty of items by the 

LLMs and the inference of item parameters based on the performance of a variety of LLMs in 

attempting to solve these items. 

1.6.1 Can LLM Models Predict Relative Item Difficulty 

In this question I seek to understand to what extent relative item difficulty can be 

predicted using the current state of LLMs. If LLMs can successfully predict relative item 

difficulties, then depending upon the accuracy of such predictions this might provide a useful 

input to other approaches to item calibration or provide sufficient accuracy to greatly reduce 

the need for additional calibration. 

1.6.2 Do LLMS Simulate Student/Test Taker Responses 

Model complexity, typically measured by the number of parameters in the model 

(ranging from a low end of 10s of millions to a high end of trillions), is generally perceived as 

corresponding with a higher ability of the LLMs to solve more difficult problems. In this study I 

will use model complexity as a proxy for test taker ability to study to what extent item difficulty 

can be predicted by the success of the LLMs at solving problems. 

15 

 
 
1.6.3 Binary Search Algorithm Estimation of Absolute Item Difficult 

This dissertation, building on the proposed algorithm of Attali et al. (2014) will test if a 

series of LLM guided relative difficulty prompts will lead to viable estimates of absolute item 

difficulty. 

1.7 

Significance of the Study 

Should LLMs prove effective, they could serve as a cost-efficient tool for item 

calibration, potentially reducing the expenses involved in maintaining current testing programs 

and developing new testing measures. If the parameter estimates generated by LLMs turn out 

to be less reliable for direct application in high-stakes testing scenarios, their cost-effective 

nature means that even moderate success could still offer valuable support in lower-stakes 

contexts, such as online learning environments. Moreover, possessing even approximate 

estimates of item difficulties could help lower pretesting costs. This potential to streamline the 

pretesting process aligns with the research area known as "optimal test design," which aims at 

reducing the financial and logistical burdens of test development. 

16 

 
 
CHAPTER II: LITERATURE REVIEW 

2.1 

Introduction 

In this chapter I review the various methods used to estimate item parameters with a 

particular focus on methods which employ subject matter experts (SMEs). With regards to 

these studies, I introduce a general conceptual framework intended to aid in explaining why 

some studies might have been successful while others struggled. In this chapter estimating item 

difficulty is divided into three buckets: pretesting, SME estimation, and computational 

methods. Building on these methods, Large Language models (LLMs) offer a potential 

mechanism for substituting in the input of SMEs or test-takers in these processes. To help set 

the stage for this I review some of the recent uses of LLMs in education and psychology. 

2.2 

Search Description 

Item parameter estimation has a long history in psychometrics, education, and 

psychology. Formalized first in the defining and estimation of classical test theory, concepts of 

item difficulty independent of test taker ability and later refined in item response theory (IRT) 

item calibration has been explored in many forms.  

In this study I explore three research fields in how they approach item calibration: 

education, psychology, and computer science. In general education and psychology have relied 

upon pretesting and SME inputs whereas computer science approaches have focused on 

computational methods. 

2.3 

Conceptual Framework for Relative item Difficulty Estimation 

Many researchers have attempted to use subject matter experts and non-experts 

(SMEs) to predict relative item difficulty. It is also common for researchers to use expert raters 

17 

 
 
to estimate item difficulty (Attali et al., 2014; Bejar, 1983) and item domain knowledge 

(personal experience in writing items in that domain). In this paper I introduce the non-

parametric prediction equation 𝛹(𝜂!, 𝛾!, 𝑍#, 𝑍$) which estimates the likelihood of successfully 

classifying an item as more or less difficult than an accompanying item. 𝛹 is a function of item 

rater skill ability (𝜂!) in the construct domain as well as item prediction ability (𝛾!). 𝛹 is also a 

function of the features of item 1 and 2 (𝑍#and 𝑍$). 

Theoretical predictions of the model are that as SME overall ability increases that 

generally this will lead to better predictions of item difficulty. This need not be true globally, 

but we should expect SMEs to perform worse at relative item difficulty estimation if their ability 

levels are much lower than the item difficulty levels. In practice this means that we might be 

able to expect a high schooler to be a reasonable estimator of the relative item difficulties of 

basic arithmetic items but not expect a grade schooler to be a good estimator of the relative 

difficulties of calculus items. 

𝜕𝛹(𝜂!, 𝛾!, 𝑍#, 𝑍$)/𝜕𝜂!	 > 0 

Additionally, we can make the axiomatic assertion that as SME item difficulty ranking 

skill increases that the probability of correctly identifying the items' relative difficulties also 

increases. 

𝜕𝛹(𝜂!, 𝛾!, 𝑍#, 𝑍$)/𝜕𝛾! > 0 

To make additional predictions we need to specify some additional parameters. At this 

point it is helpful to specify what is relative difficulty. I will define relative difficulty 

18 

 
 
(𝑅(𝑍#, 𝑍$, 𝛩)) of two items as the difference in the expected probability of getting item 1 

correct relative to that of item 2 for a given population 𝛩. 

𝑅(𝑍#, 𝑍$, 𝛩) = 𝐸[𝑃(𝑋# = 1) − 𝑃(𝑋$ = 1)|𝛩] 

                                                                    = 𝐸[𝑃(𝑋# = 1)|𝛩] 	 − 𝐸[𝑃(𝑋$ = 1)|𝛩]	 

                                                                    = 𝐸[𝑃(𝑋#|𝛩)] 	 − 𝐸[𝑃(𝑋$|𝛩)]	 

In general, it seems intuitively correct that many achievement items, which have large 

differences in relative difficulties, would retain the same rank ordering of difficulties regardless 

of the population studied. This assumption I argue is the implicit underlying theoretical basis of 

much relative difficulty estimation studies. 

For example, an SME might be an effective estimator of the difficulty of two math items 

(12+13=?) and (256+124=?) not because most SME simulate the relative difficulties of the two 

different items for the target population (say 4th graders) but because they can evaluate how 

difficult, that is how many steps would be required, for themselves to solve the items and then 

assume that the population of interest would face similar challenges. This assumption might 

not hold up if the average test taker deploys a different strategy than that of the SME. 

Using this model, I make the first testable prediction. As the absolute size of R increases 

the likelihood of correctly predicting item difficulties increases. 

(1)        𝜕𝛹(𝜂!, 𝛾!, 𝑍#, 𝑍$)/𝜕|𝑅| > 0 

Notice that this prediction is irrespective of the underlying item parameter models. To 

further enrich the model, I include a similarity function 𝑆(𝑍#, 𝑍$) that represents the content 

overlap between the two items. I theorize that the difficulty of items which evaluate knowledge 

in similar content domains is easier to rank than items of dissimilar content domains. The 

19 

 
 
underlying justification for this assertion is integrated in the theory of cognitive diagnostic 

models that explicitly attempt to diagnose and measure the cognitive steps required to solve 

items. As such, two items which are similar in content domain will be more likely to have similar 

steps involved in solving the items. The difference in these steps, as one item might have more 

complexity than another item, provides a basis for asserting that the more complex item is 

therefore more likely to be more difficult. However, when items are of diverse content domains 

then it is harder to infer that one item is more difficult than another as the steps involved in 

solving each are unique to each item. 

(2)        𝜕𝛹(𝜂!, 𝛾!, 𝑍#, 𝑍$)/𝜕𝑆#,$ > 0 

I will test an additional hypothesis regarding relative item difficulty estimation. I argue 

that items which evaluate constructed knowledge that is explicitly taught and builds on well-

known steps are easier to estimate than item difficulties in items which evaluate either implicit 

knowledge or general non-hierarchical knowledge. The underlying argument for why this would 

be the case is also built on the cognitive diagnostic approach to item difficulty estimation. 

When the steps necessary to solve a problem are well known then it is less complex to evaluate 

relative item difficulty than for items in which knowledge is non-hierarchically acquired. I 

propose function 𝐶! which is a composite measure of the combined constructed complexity of 

the items being evaluated. As items get more complex, building on accumulated knowledge in 

an educational setting, I expect items to get easier to rank.  

(3)        𝜕𝛹(𝜂!, 𝛾!, 𝑍#, 𝑍$)/𝜕𝐶! > 0 

20 

 
 
I evaluate one additional hypothesis. This hypothesis is that items which have higher 

discrimination 𝑎 will also be easier to predict relative difficulties. The driving feature behind this 

hypothesis is that items with poor discrimination introduce a level of randomness which 

obscures item properties. That is, items which have poor discrimination are items for which low 

ability students have a non-trivial chance of getting them correct and high ability students have 

a non-trivial chance of getting them wrong. In the presence of this noise, I hypothesize that the 

ability to predict difficulty ranking may be compromised. 

(4)        𝜕𝛹(𝜂!, 𝛾!, 𝑍#, 𝑍$)/𝜕𝑎$ > 0 and 𝜕𝛹(𝜂!, 𝛾!, 𝑍#, 𝑍$)/𝜕𝑎# > 0 

2.4 

Review of Research 

2.4.1  The Need for Item Calibration 

Psychometric items either achievement or psychological are designed to measure one 

or more mental construct. Measuring constructs is often considered of importance in education 

where the acquisition of student knowledge and skills is considered a primary goal. To build 

effective instruments for measuring student performance item are typically written and 

designed by item writing experts, evaluated for validity by expert panels, and calibrated on test 

takers sampled from the population for which the instrument will be used. High stakes test 

design and calibration is often quite costly. The SAT for example, takes between 18 and 30 

months to develop a new form costing approximately $1 million US (Dudley, 2016).  

As high stakes exams are often used to determine eligibility for schools, scholarships, 

and professional certification there is always an incentive for bad actors to steal and release 

items. As a result, many testing regimes are constantly developing and calibrating new 

21 

 
 
 
 
replacement items. There is a great potential benefit to finding methods of reducing the effort 

associated with calibrating new items.  Stocking (1990) demonstrates that it is possible to select 

examinees if the skill level of examinees can be known in such a way as to reduce the number 

of examinees required to calibrate an item. The study of optimal test design has been applied 

to item calibration to select items appropriate for each examinees estimated ability level 

(Berger, 1992; Jones and Jin, 1994; Buyske, 2005; Lu, 2014; Zheng, 2014; Van Der Linden and 

Ren, 2015; Ren et al., 2017; Berger, 2017; Hassan and Miller 2019; He and Chen, 2020; Hassan 

and Miller, 2020; among others).  

This dissertation explores new methods made possible by LLMs to generate item 

parameter estimates. 

2.4.2 

Item Calibration Methods/Models 

Pretesting 

The gold standard method for estimating item parameters is by administering those 

items to populations comparable to those whom the instrument is meant to evaluate. This can 

often be accomplished by adding additional items or sections to an existing instrument when a 

testing program has already been established. New testing programs typically would not have 

access to this low-cost pool of ready test takers. Regardless of the method of initial calibration, 

much care is needed in monitoring how items perform with ongoing administration to identify 

potential “item drift,” either through overexposure or changes in the testing population.  

Under pretesting, examinee response is used to calibrate items relative to other already 

calibrated items. How many examinees are required to calibrate an item is a function of several 

factors including the required precision of the item being calibrated as well as the complexity of 

22 

 
 
 
the model being estimated. The simplest IRT model, the Rasche model might be calibrated with 

as little as 30 examinees if the precision of estimation of ± 1 logit is sufficient (Linacre, 1994). 

However, more complex models or applications requiring higher precision might require 

thousands of examinees to sufficiently estimate item parameters. 

Item pretesting is the gold standard for item calibration. It allows for the most direct 

measure of item performance. It also allows testing professionals to identify variations in item 

performance based on different population groups. Test validity requires that items perform 

similarly across testing groups dependent only on the latent trait being examined rather than 

other factors which might predict item performance. Methods to identify differences in item 

performance based on population groups fall under the literature identified as “differential 

item analysis” or simply “diff.” 

This ability to identify differences in performance by population group performance is a 

major advantage of pretesting which neither expert judge item rankings, the computational 

methods reviewed in this dissertation, nor the LLM approaches presented in this dissertation 

provide substitutes for. 

Subject Matter Experts Review 

Numerous studies starting with Tinkelman (1947) followed by subsequent researchers 

(Lorge and Kruglov 1952, 1953; Ryan, 1968; Thorndike, 1982; Bejar ,1983; Cross et al., 1984; 

Melican, 1989; Yao, 1991; Fernandez et al., 2003; Hambleton et al., 2003; Lu et al., 2007; Attali 

et al., 2014) have attempted to estimate item properties through use of expert human judges. 

These approaches have had varying levels of success. Typically, individual judges perform 

23 

 
 
poorly when estimating item properties, yet the average performance across judges does 

better. There is significant variation between studies, however.  

While finding viable alternatives to pre-testing is desirable, the use of judges tend to be 

expensive, and it is unclear if providing additional training to them results in better estimates of 

item properties (Bejar 1983). Currently expert judges are frequently used in scale development 

and validation (Boateng et al., 2018; Hardesty and Bearden, 2004). It is unclear to what extent 

they are used to estimate item parameters, such as difficulty and discrimination, in professional 

testing programs. That said, not all uses of items involve high stakes testing. Many studies use 

imperfect measures of item difficulty based on item difficulty estimates generated by subject 

matter experts, fully accepting their lack of precision. Yao (1991) for example examines 

computer adaptive testing when item parameters are imprecisely estimated while others use 

item difficulty estimates generated by subject matter experts for the purpose of directing 

automated tutoring content (Fernandez et al., 2003; Lu et al., 2007). 

The rest of this section will examine some of the notable results from the use of judges 

to estimate item properties. Thorndike (1982) has human judges assign item difficulty on a 

scale between 1 (would be passed by 75% or more of examinees) and 9 (would be passed by 

30% or less of examinees). This study used the largest panel of expert judges to review and 

found reasonable success. Overall, across twenty judges he estimates correlations of 0.83, 0.74, 

and 0.72 among the average of 20 human raters and the empirical difficulty estimate. This is 

much higher than the single judge rating of between 0.23 to 0.32. Similar work by other 

researchers across different item domain fields found item difficulty estimates by subject 

matter experts correlated with empirical difficulties of between 0 and 0.49 with most estimates 

24 

 
 
having single rater correlation of less than 0.3 (Bejar, 1983; Melican et al., 1989; Cross et al., 

1984). 

The following section presents five different research efforts into the use of item 

experts (and in one case non-experts) at predicting item features. Overall, these methods show 

promise but ultimately seem to lack the precision to be adopted into a professional high stake 

testing environment. 

TABLE 1: SUMMARY TABLE OF EXPERT JUDGE ESTIMATES  

The following table shows the results as found in the following papers. These results 

unfortunately are spotty as each paper presents different estimators of how well their method 

performed. In this table Interrater Corr and Rank Corr / Corr represents the Interrater 

Correlation as well as the Rank Correlation and Item Parameter Correlation while MAE 

represents Mean Absolute Error of the estimators. 

Paper 

Year  Method 

Lorge & 
Kruglov  

Lorge & 
Kruglov  

Lorge & 
Kruglov  

Bejar 

1952 

1953 

1954 

Judges 8 (Test Writing 
Class) 
Judges 14 (Advanced 
Degree in Teaching 
Math) 
Judges 14 (Advanced 
Degree in Teaching 
Math) 

1981, 
1983 

Judges 4 

Item Type 

8th Grade 
Arithmetic 

8th Grade 
Arithmetic 

8th Grade 
Arithmetic 

Test for 
Standard 
English 

Mislevy et al. 

1993  Mixed 

Attali et al. 

2014 

24 ETS Judges 

Pre-
Professional 
Skills Test 
8 Subject 
Groups of Math 
Items 

25 

Interrater 
Corr 

Rank Corr 
/ Corr 

.73 / .46 

.83 / .84 

MAE 

12.51 / 
14.15 

.65 - .74 

23 - 24.1 / 
2.1 - 12.2 

.95-.91 

.16-.30 

0.49 

.50-.80 

 
 
 
  
 
 
  
 
 
  
  
Lorge, Kruglov, & Diamond (Item Similarity: High, Complexity: Medium to High) 

Early research into estimating item difficulties through expert judges seemed to 

demonstrate promising results. Lorge & Kruglov (1952) propose and test a method of 

estimating classical test theory (CTT) item difficulty and relative difficulties. They split their rater 

pool of eight PhD candidates taking a test writing class into two studies: 1. In which the raters 

receive 30 items out of the 150 in which CTT difficulty was provided and 2. The raters received 

no specific reference information and rated all 150 items. The items were 8th grade arithmetic 

items. 

They found that the raters in study one was much more highly correlated with each 

other when it came to predicting the two tasks: item ranked difficulty and CTT Difficulty (pass 

rate) than those who received no reference information. On average, intercorrelations were 

0.73 for study one and 0.46 for study two, indicating that having the additional information led 

to significant improvements in agreement of the judges. Unfortunately, although the judges 

agree with additional item framing information, they appear to be no better at predicting true 

difficulty than judges who did not receive that information. Overall, both groups were good at 

predicting item relative difficulties with correlation scores of 0.84 and 0.83. However, they 

systematically underestimated absolute item difficulties even in the case of study one in which 

30 items had absolute item difficulty estimates provided for reference. 

Lorge & Kruglov (1953) follow up their study with another study attempting to address 

the underestimation of item difficulty in a multistage manner. They split item judges into two 

groups A and B and for two 45 item tests Test I and Test II. The judges were asked to rank items 

26 

 
 
 
in terms of difficulties. At the first stage no information was given and judges were asked to 

estimate the difficulty of items. In the second stage, group A had 10 items revealed (22% of the 

items) and was asked to give a new assessment of the non-revealed items’ difficulties. This is 

followed by a third stage of evaluations, when judges were asked to rank the difficulties of 

items in Test II. The same procedure for the second stage was repeated with judges in group B. 

Overall, this procedure was meant to assess to what extent judges can learn from mistakes and 

improve on their difficulty predictions. Revealing the subset of items did improve the estimates 

for both relative and absolute difficulties.  

While this early research was promising, there were and still are concerns as to how 

well these results map from 8th grade arithmetic items to other types of items. NAEP for 

example has five general content areas for eighth grade items: “Number properties and 

operations”; “Measurement”; “Geometry”; "Data analysis, Statistics, and Probability"; and 

“Algebra.” Of these, only a small subset of one of these five content domains, “Number 

properties and operations” even have some items characterized as “arithmetic items,” – though 

many items would have the necessity of solving arithmetic as part of a solution – finding 

evaluation. 

Lorge & Diamond (1954) continued exploring the possibility of using judges to estimate 

item difficulty building on the Lorge & Kruglov (1953) paper by further examining how relative 

difficulty rankings of items can be leveraged as a linear projection into absolute difficulty space. 

Under three simplifying assumptions they find that using the mean linear projection produces 

better difficulty estimates than taking the average pass rate estimated directly from judges. 

Using the linear projection, they were able to estimate the mean absolute error of absolute 

27 

 
 
item difficulties as 23-24.1 for the item estimates without revealed item difficulties and 2.1 – 

12.2 for the remainder of items after revealing 10 item difficulties. 

Arbuckle & Cuddy (Item Similarity: High, Item Constructed Complexity: Medium) 

Arbuckle & Cuddy (1969) in an unrelated study of item difficulty estimation by judges 

deploy a two-part study attempting to see: 1. if four experienced judges could predict item 

difficulties for recall items and 2. if naïve judges could predict those same item difficulties. Four 

experienced student judges with recall items were asked to first estimate how difficult they 

would find a series of sets of items (160 sets – 100 5 item sets and 60 6 item sets) to recall. 

Items for which there existed agreement among judges were kept (105 items) and then given to 

those same judges. The judges’ predictions from the first session were 62% to 72% accurate 

(50% being random) across the eight judge and item-set pairs giving evidence that experienced 

examinees could predict item difficulty. Interestingly the four judges were allowed to guess 

how likely their answers were to be correct and that corresponded with an 84 to 92% accuracy. 

To test if naïve student judges without experience with recall items could also predict 

item difficulties, 150 students with no experience with recall items rank how difficult they 

expected the items to be. In a second study in the same paper, they split the 150 students into 

two groups who each reviewed or one of two alternative sets of items (15 items each per set).  

The investigation revealed interesting findings about the subjects' capability to predict 

recall. In Experiment I, practiced subjects were able to predict their recall with an accuracy that 

was significantly greater than chance. Experiment II further substantiated that even naïve 

subjects demonstrated a reliable decrease in recall probability that aligned with their 

predictions along the "very likely" to "very unlikely" scale. This consistency in prediction and 

28 

 
 
recall, regardless of subjects' prior experience with PA learning, suggested a robust ability of 

subjects to judge associative strength immediately after presentation. The methodology of 

immediate predictions and evaluations instructed test takers to minimize the use of rehearsal 

strategies or other memory aids, focusing instead on the subjects' intrinsic assessment 

capabilities. 

Interestingly, the study showed that the frequency of non-expert judges’ predictions of 

correct recall was consistent with the judges' assessments of item difficulty. Higher difficulty 

ratings corresponded to lower predictions of correct recall. The subjects' predictions seemed 

influenced by the apparent difficulty of the PA pairs. Despite some individual variability among 

subjects, a correlation was found between the two independent assessments (judges' difficulty 

ratings and subjects' predictions). 

This study is of note in that it demonstrates a common task in which both experts and 

non-experts can estimate item difficulty with some degree of success. However, the study is 

also problematic in that it does not estimate normed item difficulties nor does not provide any 

other statistics for comparison with other methods with regards to the correlation of relative 

item difficulties nor does it provide mean error estimates for item predictions.  

Bejar (Item Similarity: Medium, Item Constructed Complexity: Low) 

Bejar (1981 and 1983) conducted a study intended to encourage item experts to pool 

their knowledge to predict more precise item estimates. The study was broken into two parts 

with the first part dedicated to training expert judges while the second part involved rating 

items in terms of difficulty, discrimination, and factors that contributed to difficulty. Four 

professional item writers with between three and 20 years of experience were recruited as 

29 

 
 
raters. The raters were assembled as a group and asked to work individually writing down their 

difficulty and discrimination estimates. After the judges revealed their estimates and discussed 

among themselves the rationale behind their ratings, the raters rated the items a second time. 

Then difficulty, discrimination, and other statistical information about items was revealed 

including the distribution of response patterns across students for each distractor, as well as 

mean criterion score of those choosing each distractor.  

The raters were instructed to rate items using a delta difficulty index (Δ = ϕ’#(1 − 𝑝)) 

with ϕ’#being the inverse normal CDF and 𝑝 being the proportion answering correctly. They 

were also instructed to rate the item using the biserial correlation (𝑟 = (!’("

)#

*(#’*)

-

 ) with 𝑀. 

and 𝑀/ being mean score for students getting the item right and those getting it wrong, while 

𝑆0 is the standard deviation of the criterion scores, 𝑦 is the ordinate of the normal density 

function corresponding to the 𝑚𝑖𝑛(𝑝, 1 − 𝑝).  

The raters were trained on three sets of 20 usage items and three sets of 10 sentence 

correlation items taken at random and assembled into booklets. Empirical item statistics were 

calculated as the estimated equated delta and the estimated biserial correlation based on a 

random sample of 2000 students. A total of three rating sessions were held. Interrater 

reliability was also calculated on the ratings before and after each discussion as well as after 

each session. After discussion, the interrater reliability of difficulty estimates increased but in 

while between sessions the interrater reliability decreased. 

The items evaluated dealt with 24 different major error categories in English. These 

error categories had been previously identified at the time of composition. The error categories 

in the sample items tested by Bejar only exhibited 19 of the major errors the items are designed 

30 

 
 
to identify.  Bejar used these error categories to create error bands in terms of estimation of 

the mean difficulty and the discrimination. The mean difficulty for most items were within a 

narrow window. Some error categories were more difficult while a few appear to be easier than 

other items. 

Information on error-type average difficulty and discrimination was provided to the 

judges for the final item estimation phase in which all four judges rated two sets of 50 items 

each. Each set contained 35 usage items followed by 15 sentence correction items. Using the 

item feature information associated with the type of error, Bejar estimated and projected 

difficulty and discrimination values. These values were then correlated with the true difficulty 

and discrimination. Overall, the results were quite mixed with the judges doing better than the 

error-projected values in difficulty for one set of items but worse for the other. 

The highest correlation in any of the phases observed, that was among five different 

phases with each item if phase 1-3 being rated twice, with any of the time types or methods, 

was a 0.63 while the correlation between the average rating and the empirical rank was 0.16 

and 0.30. Bejar concluded that this is too low for usage in a functional evaluation. He suggested 

that including more judges might help reduce noise and improve parameter estimates but this 

approach would be cost prohibitive. 

An interesting feature of Bejar’s 1981 paper was four reflections by the item judges on 

how difficult and frustrating it was to predict item properties. It is likely that the items 

evaluated in this experiment posed a more difficult challenge than those related to estimating 

the relative difficulty of arithmetic items. Yet, not having to take into account the discomfort of 

item-raters is a distinct advantage of the LLM methods proposed in this paper. 

31 

 
 
Mislevy et al. (Item Similarity: Medium, Item Constructed Complexity: Medium to High) 

Mislevy et al. (1993) present research into estimating item difficulties that combined 

both expert review and annotation, indexes, and machine learning models. They presented a 

study that explored a statistical methodology for equating tests when traditional methods are 

constrained by the unavailability of examinee response data. The study postulated that while 

standard equating practices are reliant on large pools of examinee responses, various 

alternative sources of data, like content specifications, expert opinions, or theories related to 

psychological processes involved in solving test items, could provide valuable insights into item 

characteristics. 

Mislevy et al. explored item data from the Pre-Professional Skills Test (now known as 

the Praxis) from 1985 and 1990 measuring reading, writing, and math skills for prospective 

teachers during college years. One of the ways Mislevy et al. integrate expert judgement into 

their equating procedure was by leveraging the insights of subject matter experts who were 

adept at predicting item properties. Mislevy et al. utilized this expert judgement by coding 

items based on content and cognitive processing features that are then incorporated into their 

item parameter prediction model. They emphasized that although expert judgement does not 

account fully for item difficulty variance, it served as a substantial component in the absence of 

traditional examinee data. 

The study methodically categorized test items based on their content and cognitive 

process demands, as judged by experienced item developers. Items assessed include selection 

and placement tests that tap into the reading, mathematics, and writing skills of prospective 

teachers, as evidenced by the analysis of the Pre-Professional Skills Test. The researchers 

32 

 
 
assigned ratings to various item features, reflecting their potential contribution to the difficulty 

and effectiveness of the items. These features included aspects such as the number of syllables 

per word, sentence length, the presence of concealed information, and more intricate elements 

such as the number of rules present in a problem-solving task versus the number needed for its 

resolution. 

The success of incorporating subject matter expert inputs into equating was quantified 

by the variance accounted for in item parameters through multiple regression models. For 

example, the predictive model, which Mislevy et al. developed, was designed to predict IRT 

parameters such as item discrimination (slope), difficulty (intercept), and guessing (lower 

asymptote) by correlating them with collateral information variables provided by the experts. 

While the authors recognize that the predictions did not match the precision of traditionally 

obtained item parameter estimates, they were beneficial in forming a tentative equating 

function in which they presented a representation of uncertainty. 

Overall, the study demonstrated a pragmatic approach to test equating in conditions 

where examinee data are sparse or non-existent. The utilization of expert input was hoped to 

establish a new paradigm that allowed for fruitful intersection between psychometric theory 

and practical constraints. Although the resulting psychometric properties are less precise than 

those derived from large scale pretesting data, the contribution of subject matter experts' 

insights offered an alternative method for equating test items.. 

This study was interesting in that it deployed judges in a more complex way than the 

previous studies by both having judges rank items as well as generate additional collateral 

information on items which can then be used jointly to predict item difficulty. The studies 

33 

 
 
presented in this paper build on this work by likewise leveraging LLMs to generate additional 

features of items and item pairs (collateral information) which is then used to aid in the 

estimation both item difficulties and predictions of likelihood of correctly estimating relative 

item difficulties. 

Attali et al. (2014) (Item Similarity: Medium to High, Item Constructed Complexity: High) 

Attali et al. (2014) hypothesized that by asking judges to rank items relative to one 

another in short item sets, a more precise assessment could be achieved. In their study, a total 

of 26 subject matter experts from Educational Testing Service (ETS) were employed, including 

SAT and GRE test developers, experienced item writers, and relatively new item writers. These 

judges were required to rank sets of SAT mathematics items, which covered eight major 

content areas such as algebraic problem-solving and geometry. Item ranking was done within 

each of these content areas. Their procedure had these experts arrange seven items per set in a 

rank order from easiest to hardest, balancing cognitive load against efficiency. The sorting 

provided indirect information on the relative difficulty through 21 paired comparisons per set. 

To enhance the variety of comparison types, the easiest and hardest items were oversampled 

among a pool of 28 released multiple-choice items per content area. The items were then 

randomly ordered in booklets for the examination. The study's findings suggested that the 

judges could successfully rank order the items across various content areas, with a median 

Spearman rank-order correlation of 0.79 between their judgments and actual item difficulties. 

In the study, the team conducted analyses in two stages. Initially, descriptive analyses of 

the rankings were performed to assess the correlations between the judges' assessments and 

the equated delta values of the items. However, these were not deemed unbiased estimates 

34 

 
 
due to the non-random selection of items. Thus, a more robust analysis was conducted focusing 

on individual paired comparisons. Each complete ranking by a judge led to 21 comparisons, 

where the primary outcome was whether the judge could accurately identify the harder item. A 

hierarchical general linear model was applied to the binary outcome of comparison correctness 

as a function of the empirical difficulty difference between compared items. The study found 

little influence of judges' background on their ability to discriminate item difficulties, and the 

probability of success in these paired comparisons increased on average with the empirical 

difficulty difference between items. 

To propose a potential implementation strategy for their comparative judgment 

approach, Attali et al. outlined a binary search algorithm. This simulated procedure involved 

judging the difficulty of a novel item which is of the same family of to a series of anchor items 

that had known difficulties. Starting with an anchor item at the median difficulty level, 

subsequent anchors were chosen incrementally based on the outcome of each prior 

comparison—akin to a binary search strategy. This enabled a translation of pair-wise 

comparisons into a numerical estimate of difficulty, with each comparison refining the 

estimated difficulty level on a novel item. For example, after three comparisons, raters could 

categorize an item among eight difficulty percentiles. 

To assess this method's practicality, Attali et al. performed a simulation with 10,000 

items and examined correlations between true item difficulties and averaged difficulty 

judgments from various numbers of comparisons and raters. The findings revealed that the 

correlation increased with the number of comparisons but plateaued beyond three, whereas 

the number of raters contributed more significantly to the correlation even when more than 

35 

 
 
four or five were rating. Impressively, the study detailed that five raters utilizing three 

comparisons could replicate the accuracy of empirical difficulty estimates traditionally derived 

from a sample of 100 test-takers. This result underscores the potential efficacy and efficiency of 

using comparative judgments in estimating item difficulties as opposed to relying on large-scale 

field trials. 

Overall, this paper presents some successful results with correlations between item 

difficulty and true difficulty at a similar level as that of Lorge and Kruglov (1952). While 

promising and potentially attributable to advances in item-ranking algorithms by judges, I 

suspect the differences in types of items being evaluated has more explanatory power than the 

underlying ranking algorithm. One reason is that “easiest and hardest items were oversampled 

to increase the likelihood of all types of comparisons across the difficulty spectrum.” While this 

meant that they could get more precise estimates for the relative difficulty ranks on the lower 

and upper tails of the difficulty spectrum, it also likely led to a greater number of large pairwise 

item difficulties than would be expected in a random sample of items. 

Also, like Lorge and Kruglov, Attali et al. were able to confine the relative item ranking 

by judges to items of a narrow subject window for eight different sets of items from the SAT. 

These items are also those items in which skills for learning them are potentially taught in a 

well-defined and potentially linear manner. This is quite different than the language misuse 

items evaluated by Bejar (1981, 1983) or the Pre-Professional Skills Test items which covered a 

range of subjects evaluated by Mislevy et al. (1993). 

36 

 
 
Computational Models 

Indexes 

As human based methods of estimating item parameters are inherently expensive, 

alternative means of estimating item parameters have been expensively explored. These 

include estimates of reading complexity through linguistic complexity scales to more recent 

developments in large language models specifically trained for the estimation of item 

parameters.  

Linguistic Complexity Scales such as the Flesch (1948), Farr-Jenkins-Paterson the first 

computer implemented readability index (Danielson & Bryan, 1963), and the subsequent 

Flesch-Kincaid (Kincaid et al., 1975) have long been used as methods of estimating text difficulty 

and readily map to estimating the difficulty of reading comprehension items with various levels 

of success (Rafatbakhsh and Ahmadi, 2023). Brown (1998) finds that readability indexes are 

only weak predictors of the difficulty of cloze items for students who have English as a foreign 

language. Though he does find various grouping of lexicological features such as average 

number of syllables per sentence, frequency of words longer than 7 characters, and the 

percentage of function words combined were strong predictors of difficulty.  Freedle and Kostin 

(1993) test numerous measures of text quality including vocabulary level, paragraph length, 

number of paragraphs, abstractness of the text, as well as traditional readability indexes and 

find that they are able to predict 46% to 59% of the variance in item difficulty in a smaller set of 

items which mapped to 21% to 29% using a larger set of items. 

While linguistic complexity are intuitive and natural predictors of difficulty with reading 

comprehension items and other linguistic items it is not clear to what extent they should be 

37 

 
 
significant predictors of item difficulty for items outside of this domain except in that they 

measure incidental difficulty caused by linguistic complexity tangential to the primary scale 

being measured. 

Machine Learning Models 

In recent years pretrained transformer models have risen in popularity as a potential 

method for estimating item difficulty. These models often start with a generalized pretrained 

transformer model such as BERT (Devlin et al., 2018) or BART (Lewis et al., 2019) which are then 

adapted through “fine-tuning” for a specific use case such as predicting item parameters when 

item parameters are known for the training data. This method and similar training than 

validating specialized predictive models has been shown to be generally successful in some use 

cases such as predicting construct loading (Hernandez and Nie, 2022). 

Before the recent advances in LLMs, many machine learning and natural language 

processing explorations involved developing supervised models trained on source data with a 

specific output purpose in mind.. These models might have built on a generalized model such as 

BERT or BART but their applications are usually extremely specific such as predicting a specific 

item feature, typically item difficulty. In the next two subsections I will summarize two recent 

papers with the same lead author Bendetto though one paper looks at a recent development of 

a computation model to predict item response parameters while the other summarizes recent 

computational methods in the field primarily driven by computer scientists. 

Bendetto et al. (2020) 

Bendetto et al. (2020) build and train an NLP model to predict item difficulty and 

discrimination for multiple choice items by extracting meaningful features from the items and 

38 

 
 
using them as a predictive model. They introduce a framework for estimating item newly 

created items in three steps: 1. Estimating latent traits of items, 2. Extracting meaningful 

features from items, and 3. Estimating item properties from those features. Their framework 

allows for these steps to be done separately with different items, though presumably the 

underlying construct must be unidimensional. They also present an ablation study to support 

their choice of features. They also do a validation study predicting student responses using 

estimated item properties for an observable ground truth. 

They use the two-parameter logistic model (2PL) in their item parameter estimates. 

They note that most studies like theirs use the item “wrongness” or CTT difficulty as the 

primary predictive variable of interest rather than IRT item difficulty. Item features extracted 

from the items are stored in a Q matrix which is then used in a linear model to predict item 

parameters. They use two random forest regressions to predict item difficulty and 

discrimination. They divide the features to extract into three components i) Readability 

Features, ii) Linguistic Features, and iii) Information retrieval features. 

The readability features they use are: Flech Reading Ease (1948), Flesch-Kincaid Grade 

Level (Kincaid et al, 1975), Automated Readability Index (Senter and Smith, 1967), Gunning FOG 

Index (Gunning, 1968), Coleman-Liau Index (Coleman, 1965), and SMOG Index (Mc Laughlin, 

1969). 

Linguistic features are similar to readability features (motivated by DuBay, 2004) and 

use Word Count Question, Word Count Correct Choice, Word Count Wrong Choice, Sentence 

Count Question, Sentence Count Correct Choice, Sentence Count Wrong Choice, Average Word 

39 

 
 
Length Question, Question Length divided by Correct Choice Length, Question Length Divided 

by Wrong Choice Length. 

For Information Retrieval features making the assertion that the words used in the text 

must imply a relationship with the latent trait being measured. They preprocess the text using 

standard NLP steps then consider the text of the question and the possible choices by grouping 

the text together they then use Term Frequency-Inverse Document Frequency (TF-IDF) 

selecting a two-part threshold tuned with cross-validation- to remove both too frequently-used 

words and too uncommon words. 

In addition to using Random Forest (RF) they also tested, Decision Trees (DT), Support 

Vector Regression (SVR), and Linear Regression (LR). Hyperparameter tuning was preformed via 

a 10-fold randomized cross-validation. The results of their experiments were reported in both 

Root Mean Square Error (RMSE) and Mean Absolute Error (MAE). 

TABLE 2: DIFFICULTY AND DISCRIMITION FROM LITERATURE  

Bendetto et al. (2020) 

DIFFICULTY 

DISCRIMITION 

VALIDATION 
MAE 
.575 
.586 
.632 

RMSE 
.739 
.748 
.797 

RMSE 
.753 
.826 
.804 

.752 

.599 

.779 

RF 
DT 
SVR 

LR 

TEST SET 

VALIDATION 

TEST SET 

MAE 
.587 
.636 
.629 

.607 

RMSE 
.393 
.393 
.394 

.397 

MAE 
.296 
.295 
.298 

.298 

RMSE 
.369 
.375 
.379 

.378 

MAE 
.287 
.290 
.296 

.293 

Overall, their model outperforms other recent models such as Qiu et al. (2019), Huang 

et al. (2017), and Yaneva et al. (2019). Though comparison is difficult to perform as the models 

are not publicly available and they report using item CTT RMSE on difficulty rather than IRT. 

40 

 
 
 
 
 
 
Benedetto et al. use a scaling formula “relative RMSE” defined as 

.()1
2!33!4567-$%&–	2!33!4567-$’(

. 

Using relative RMSE they find a substantial improvement over methods predicting item 

difficulty.  

Bendetto et al. (2023) 

Benedetto et al.'s (2023) survey paper serves as current reference for understanding the 

development of computational methods in estimating item difficulty. It reviews 18 

computational models which have all been published since 2015 predicting item difficulty using 

Natural Language Processing (NLP) techniques. These NLP approaches offer the potential to 

some limitations of earlier methods by automating and refining the estimation process, thus 

enhancing the scalability, objectivity, and consistency of question calibrations. 

The review presents a taxonomy based on question characteristics, which is pivotal in 

organizing and comparing different difficulty estimation approaches. Specifically, the 

categorization distinguishes between Language Assessment (LA) and Content Knowledge 

Assessment (CKA), while considering the formats and contexts of questions such as reading 

comprehension, listening comprehension, vocabulary knowledge, and sentence knowledge. By 

choosing this taxonomy, Benedetto et al. organize recent research, providing a structured 

means for discussing and comparing varied methodologies in item difficulty estimation. 

Computational methods for estimating item parameters, which are central in item 

difficulty estimation, often involve feature extraction from the question texts. Various machine 

learning models, such as support vector machines (SVMs), random forests, neural networks, 

and state-of-the-art approaches like transformer-based models (e.g., BERT), have been 

explored. These methods have shown some success in capturing semantic representations and 

41 

 
 
syntactic structures of question texts. The survey acknowledges the transition from traditional 

feature engineering to leveraging pre-trained language models, which allows for higher levels of 

generalization and has shown promising results in gauging question difficulty for low stakes 

applications. 

The survey by Benedetto et al. also sheds light on the challenges associated with the 

evaluation and reproducibility of difficulty estimation models. Due to the scarcity of publicly 

available educational datasets and the privacy concerns tied to such data, the direct 

comparison of various algorithms remains difficult. As a result, consistent evaluation metrics 

and standardized protocols for model validation are lacking. The authors stress the need for 

more communal data sharing to corroborate the reliability of item difficulty estimation systems 

in diverse educational contexts.  

Concluding their survey, Benedetto et al. draw attention to the implications and future 

directions in the field of question difficulty estimation. While significant strides have been 

made, there are areas ripe for improvement, such as exploring the effect of multimodal data 

(e.g., visual content associated with questions), enhancing the interpretability of the model 

predictions, and developing methodologies that can generalize across various domains and 

languages. Their work relates to models deployed in live educational environments, where 

having an idea of item difficulty can enhance student learning outcomes. 

The current state of computation models seems to suggest that they are sufficient in 

their use case, low-stakes online learning environments, but still need to demonstrate much 

higher accuracy before being ready to be used in high stakes testing environments.  

42 

 
 
2.4.3  The Generative Large Language Model Revolution 

The Use of Generative Large Language Models in Education 

Advancements in large language models (LLMs), like GPT-3, have paved the way for new 

research possibilities in education. These models have been leveraged for several educational 

purposes, including the creation of automated questions (Bezirhan & von Davier, 2023; Raina & 

Gales, 2022; Wang et al., 2022; Settle et al., 2020; von Davier, 2019), the production of 

educational materials (Hocky & White, 2022; Moore et al., 2022; Walsh, 2022), the scoring of 

responses (Mizumoto & Eguchi, 2023; Wu et al., 2023), and providing feedback to students 

(Matelsky et al., 2023; Peng et al., 2023). The capabilities for language fluency, adaptability, and 

user-friendliness exhibited by LLM have enhanced their role in educational innovations. 

A key example of this innovation comes from Settles et al., (2020) who effectively 

employed a LLM to generate a vast number of linguistic items. Subsequently, a second model 

was used to predict the difficulty of these items by examining features like their length, word 

log-likelihood, and Fischer score.  

LLMs excel in various knowledge-based and problem-solving tasks, outperforming 

expectations in areas where they haven't received explicit training. Rae et al., (2021) put a LLM 

to the test across 152 varied tasks, with 57 tasks pertaining directly to educational interests, 

covering subjects like High School Chemistry and Astronomy. Moreover, White et al., (2023) 

explored the potential of LLMs in interactive tutoring, assessing their proficiency in solving 

specialized chemistry coding challenges. 

Despite the notable advantages of LLMs in enhancing educational experiences, concerns 

about their misuse remain significant. Critics, such as Rudolph et al., (2023), emphasize the 

43 

 
 
relative ease with which these models could be misappropriated to generate inauthentic 

student work or for teachers to craft insincere responses, highlighting the need for cautious and 

considerate implementation in educational settings. 

The Use of Generative Large Language Models in Content Evaluation 

The ability of large language models to correctly predict the semantic content of 

language combined with their ability adapt to a variety of challenges has made them a 

promising potential tool for estimating content properties. However, the degree to which these 

models can reliably discern content quality and attributes is still an active area of research. 

Moore et al. (2022) attempted to use of fine-tuned GPT-3 model variants to identify 

low-quality student generated chemistry questions as well as predict Bloom’s revised 

Taxonomy of item complexity (Yahya et al., 2012; Bloom, 1956) in the context of online learning 

for which results are compared with expert judgement. In terms of high and low-quality items 

the GPT-3 fine-tuned model agreed only 40% of the time with an overestimation of item quality 

in 85/86 instances of mismatch between the expert reviewers and the model. They also use a 

fine-tuned LLM to predict Bloom’s revised Taxonomy levels for 120 questions. The model only 

agreed with the expert reviews 38/120 times (32%). Overall, the model demonstrated excessive 

optimism with regards to item quality while predicting item complexity labels at a less than 

desirable level. 

Yang and Menczer (2023) leverage LLMs to predict the credibility of over 7,000 news 

sources. ChatGPT produced a Spearman correlation of 0.54 in ratings with those of experts. It is 

unclear to what extent it demonstrates the ability of the model to estimate content quality as it 

seems to rely on the training knowledge about the new sources rather than directly reading 

44 

 
 
articles and predicting their quality from their content. Furthermore, while these results are 

statistically significant a correlation of 0.54 still leaves a sizable number of sources misclassified. 

Bewersdorff et al. (2023) have more success deploying GPT-3 and GPT-4 as a rater 

attempting to identify student errors in student experimental protocols. They find that their 

system successfully identified many types of student’s errors such as when students are 

focusing only on the expected outcome and not on the dependent variable (accuracy = .9), 

when students change trials during an ongoing study (accuracy = 1), but struggled more with 

identifying when a student is conducting a valid control study (accuracy = .6). 

Concerns Over Generative Large Language Models in Content Classification 

Concern over the reliability of LLMs to classify content is an emerging research field. The 

recent study by Reiss (2023) provides an analysis of ChatGPT's reliability in the domain of text 

annotation and classification, both methods of which are used in this dissertation. The paper, 

entitled "Testing the Reliability of ChatGPT for Text Annotation and Classification: A Cautionary 

Remark," addresses the non-deterministic nature of ChatGPT and its implications on the 

consistency of annotation outputs. Reiss emphasizes the variations that can occur in ChatGPT’s 

performance even with identical inputs, attributing such fluctuations to the model's inherent 

randomness and the sensitivity of its response to different prompt instructions and model 

parameters, such as temperature. The paper demonstrates how minor modifications in the 

prompt or changes in the model's temperature settings can result in varied outputs, thereby 

raising questions about the deterministic reliability that is often assumed of computational 

methods in text analysis. 

45 

 
 
In conducting the study, Reiss tests and quantifies the influence of these variations using 

234 German-language website texts, classified as 'news' or 'not news' with ten different sets of 

instructions and temperature settings at 0.25 and 1. The author notes that while pooling 

outputs from repetitions could improve consistency, the consistency based on a single output 

generation did not meet the Krippendorff's Alpha (2011) threshold of 0.8 for acceptable 

reliability. This finding highlights the need for concern over the use of tools such as ChatGPT for 

content classification. Reiss' investigation offers caveats on the limitations and potential biases 

that may emerge when using LLMs. The study stresses the importance of majority decision 

protocols based on multiple repetitions of input configurations to improve consistency resulting 

in increased costs. 

46 

 
 
CHAPTER III: RESEARCH METHODS 

3.1 

Introduction 

This chapter outlines the methodologies used to investigate the use of large language 

models (LLMs) for estimating item properties. It details empirical tests to assess the 

effectiveness of LLMs in predicting item difficulties. The first theoretical exploration involves a 

model evaluating the probability that an LLM accurately predicts relative item difficulty. This is 

complemented by a study leveraging multiple LLMs working together to estimate absolute item 

difficulty, substituting for human test-takers. Finally, the chapter describes an approach that 

combines the model’s relative item difficulty prediction with a binary search algorithm to 

predict absolute difficulties. These methods offer novel insights into how LLMs may assume 

roles traditionally occupied by human judgment and effort in item calibration. 

3.2 

Experimental Design 

Training and Testing Framework 

The training, testing, and validation framework is a common method used in machine 

learning for model construction. In these approaches, models are flexibly fit and selected based 

on their predictive performance for the outcome variable. A prime concern with this approach 

is overfitting, which can result in a loss of inference due to the testing of numerous parameters. 

The typical solution is to divide the data into two sets: one for developing the model (training) 

and another for testing the selected model's performance. 

Although the approach used in this dissertation does not follow the traditional machine 

learning model training, it does experiment with and test different model inputs. The goal is to 

select inputs that yield the highest performance in terms of the pairwise ranking of items. 

47 

 
 
These inputs, referred to as “prompts,” are known to be highly sensitive to seemingly small 

changes in their content. 

3.3 

Research Questions 

3.3.1 Study One: Can Generative LLMs Predict Relative Item Difficulty? 

In this dissertation, I am primarily concerned with whether LLMs can predict relative 

item difficulties. To test this, I conduct a series of pairwise comparisons between items, 

prompting the model to infer which item is more difficult. The item data is split into two pools: 

training and testing. The training data is used to evaluate different prompts, optimizing them to 

determine which produces the best output performance (see Appendix B). The top-performing 

prompt is then applied to the testing data, and the output is analyzed in this study and in Study 

Three. 

I estimate the ability of LLMs to gauge pairwise relative item difficulty using a linear 

regression model. This model uses the correct ranking (𝑅𝑎𝑛𝑘𝑖𝑛𝑔𝐶𝑜𝑟𝑟𝑒𝑐𝑡!,") of an item relative 

to another as a binary outcome, coded as either 1 (correct) or 0 (incorrect). The following 

equation is applied at the item pair level, with subscripts representing the two different items 

being compared. 

(3.3.1.1)  

𝑅𝑎𝑛𝑘𝑖𝑛𝑔𝐶𝑜𝑟𝑟𝑒𝑐𝑡!," = β9 + 

(Row 2) 

(Row 3) 

(Row 4) 

(Row 5) 

 β#𝑆𝑒𝑐𝑜𝑛𝑑𝐼𝑡𝑒𝑚𝐻𝑎𝑟𝑑𝑒𝑟!," + β$|𝐷! − 𝐷"| + β:𝑎R! + β;𝑎R" + 

 β<𝐺𝑟𝑎𝑑𝑒#$ + β=𝐺𝑟𝑎𝑑𝑒> + β?𝑀𝑎𝑡ℎ + 

 γ#𝐶𝑜𝑚𝑝𝑙𝑒𝑥𝑖𝑡𝑦! + γ$𝐶𝑜𝑚𝑝𝑙𝑒𝑥𝑖𝑡𝑦" + 

 γ:X𝐶𝑜𝑚𝑝𝑙𝑒𝑥𝑖𝑡𝑦! − 𝐶𝑜𝑚𝑝𝑙𝑒𝑥𝑖𝑡𝑦"X + 

48 

 
 
 
 
 
 
(Row 6) 

(Row 7) 

(Row 8) 

	γ;𝑁𝑢𝑚𝑏𝑒𝑟𝑂𝑓𝑆𝑡𝑒𝑝𝑠! + γ<𝑁𝑢𝑚𝑏𝑒𝑟𝑂𝑓𝑆𝑡𝑒𝑝𝑠" + 

 γ=X𝑁𝑢𝑚𝑂𝑓𝑆𝑡𝑒𝑝𝑠! − 𝑁𝑢𝑚𝑂𝑓𝑆𝑡𝑒𝑝𝑠"X + 

 𝛾?𝑆𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦!," + 𝜖!," 

In equation (3.3.1.1) β9 is the constant while the next row is composed of values derived 

directly from the NAEP data. The binary indicator variable (𝑆𝑒𝑐𝑜𝑛𝑑𝐼𝑡𝑒𝑚𝐻𝑎𝑟𝑑𝑒𝑟!,") is 1 if the 

second item presented to the LLM is more difficult than the first item. This term captures any 

bias that might exist if the model is more likely to rank the second item as either easier or more 

difficult. The second exogenous predictor (|𝐷! − 𝐷"|) is the absolute difference in empirical 

difficulties. The assumption is that the greater the difference in difficulty between two items, 

the more accurately the LLM will rank them. 

The next two predictors (𝑎R!	and	𝑎R") are proxies for item discrimination parameter (a) 

which is not provided by NAEP. I hypothesize that items with stronger discrimination will be 

easier to classify. The hypothesis is that items with stronger discrimination are easier to classify. 

Items with higher discrimination indicates that there is less construct irrelevant noise than such 

that those who pass the item are more likely to have higher abilities than those who fail the 

item relative to an item with the same construct difficulty but lower discrimination parameters. 

The proxy variable for discrimination is calculated as the population average ability level for 

item 𝑖 who answered the item correctly (𝜃̅@!A# ) minus the population average ability level of 

those who answered incorrectly: (𝜃̅@!A9). Thus: 

(3.3.1.2)  

𝑎R! = (𝜃̅@!A# − 𝜃̅@!A9) 

49 

 
 
 
 
 
 
 
The proxy variable for item discrimination 𝑎R! are normalized with a mean of zero and a 

standard deviation of 1 across all items in each subject, grade, year combination. The 

hypothesis is that the coefficients β: and β; will be positive, indicating that items with higher 

discrimination are easier to rank.  

With NAEP data, the population average ability level for each option selected by test-

takers is listed, along with the proportion that selected each option. As a result, calculating 

𝜃@!A9 takes the additional step of calculating the weighted average between ability levels for 

test takers who chose a distractor 𝑑. With 𝑃!2 defined as the proportion who chose that 

distractor, we can calculate the weighted average in the following way: 

(3.3.1.3)  

 𝜃̅@!#A9 =

∑

),+
∑

C’)DE

),+

*’+,)
C’)

Here is an example of the calculation of 𝑎R for an item in which population average 

performance metrics are provided for each option taken. For an item with four distractors the 

average performance on the test for those who got the item correct is 500 while for those who 

chose the distractors 1, 2, and 3 respectively are 200, 300, and 400. Also, the proportion of 

those who chose each distractor was 5%, 20%, and 15%. The difficulty of the item is 40% 

meaning 60% of those taking the item got it correct. The calculation of the weighted average of 

those who chose an incorrect option would therefore be 𝜃̅@!#A9 = 	 (.05 ∗ 200	 + 	 .2 ∗ 300	 +

	.15 ∗ 400)/(.05 + .2 + .15) 	 = 	325. The proxy variable for 𝑎! in turn would be calculated as 

𝑎R! = (𝜃̅@!A# − 𝜃̅@!A9) 	 = 500	– 	325	 = 	175. 

Let’s compare this item with another hypothetical item 𝑗 which has the same difficulty 

(40%) and the same proportion choosing each distractor (5, 20, and 15) but has a different 

50 

 
 
 
population average performance. Let’s say those who chose the correct answer got a 525 on 

average and those who chose each of the distractors got a 100, 200, and 300. Now 𝜃̅@"#A9 =

	(.05 ∗ 200	 + 	 .2 ∗ 200	 + 	 .15 ∗ 300)/(.05 + .2 + .15) 	 = 	237.5. The proxy variable for 𝑎" in 

turn would be calculated as 𝑎R" = l𝜃̅@"A# − 𝜃̅@"A9m = 525	– 	237.5	 = 	287.5. From the above 

example of item 𝑖 and 𝑗 we can see that 𝑎R seems to be a reasonable proxy for the 

discrimination parameter 𝑎 in that as the difference in overall performance between those who 

got an item correct and those who got the item incorrect gets larger so does 𝑎R. 

The variables in (3.3.1.1) row 2 are all values for which there might be missing or 

incomplete information if deploying these methods in a test development setting and therefore 

should be framed as context setting. However, row 3 does not suffer from these issues as it 

would be known to the test developer prior to calibration what grade and subject individual 

items are intended for. The remaining explanatory variables would also be known to the test 

developer, but they are generated via an initial feature generation step by the LLM. 

For each item (𝑖) the LLM assigns during separate prompts a complexity ranking from 0 

to 10 as well as a list of steps required to solve the item. The complexity ranking is directly 

extracted from the LLM’s predicted complexity level when presented with an item (see 

Appendix C). I hypothesize that items which are more complex will also be items which are 

easier to rank. Likewise, when item pairs have a large disparity in item complexity 

(X𝐶𝑜𝑚𝑝𝑙𝑒𝑥𝑖𝑡𝑦! − 𝐶𝑜𝑚𝑝𝑙𝑒𝑥𝑖𝑡𝑦"X), I hypothesize that these items will be also easier to rank. This 

is due to items which have greater complexity disparity being a proxy of sorts for perceived 

item difficulty and therefore in turn they also have larger predicted item difficulties. 

51 

 
 
In an equivalent manner to item complexity, the number of steps required to solve an 

item (𝑁𝑢𝑚𝑏𝑒𝑟𝑂𝑓𝑆𝑡𝑒𝑝𝑠!) is generated by the LLM and used as a predictor. Unlike complexity, 

the number of steps is indirectly calculated by first prompting the LLM to list the steps required 

to solve the item then counting the numbers provided (see Appendix C). Unlike complexity, the 

number of steps has no upper limit. Like higher complexity, a higher number of steps indicates 

items which are more cognitive and therefore hypothesized to be easier to rank. 

3.3.2 Study Two: Do LLMs Response Pattern Simulate Student Responses? 

In this study I seek to understand if multiple LLMs working together can be used to 

predict absolute item difficulty. This model will attempt to leverage the problem solving “skill” 

of different LLMs to predict how difficult items are. In this approach I will use model 

performance from an array of LLMs as a linear predictor of empirical item difficulty. In addition 

to the average performance will of various LLMs on each item I will include additional 

computational scales such as word count (𝑊𝑜𝑟𝑑𝐶𝑜𝑢𝑛𝑡), Flesch-Kincaid Index 

(𝐹𝑙𝑒𝑠𝑐ℎ𝐾𝑖𝑛𝑐𝑎𝑖𝑑𝐼𝑛𝑑𝑒𝑥), and average syllable count (𝑆𝑦𝑙𝑙𝑎𝑏𝑙𝑒𝑠𝑃𝑒𝑟𝑊𝑜𝑟𝑑) as predictors in the 

linear model. Empirical item difficulty (𝐸𝑚𝑝𝑖𝑟𝑖𝑐𝑎𝑙𝐼𝑡𝑒𝑚𝐷𝑖𝑓𝑓𝑖𝑐𝑢𝑙𝑡𝑦!) in this equation is the 

percentage of students who selected a distractor or no response. 

In this study, I aim to understand if multiple large language models (LLMs) working 

together can predict absolute item difficulty. This hypothesis seeks to leverage the problem-

solving abilities of different LLMs to foresee how challenging items will be. The model utilizes 

performance data from various LLMs as a linear predictor of empirical item difficulty. Alongside 

the average performance of various LLMs on each item, I include additional computational 

scales such as word count (𝑊𝑜𝑟𝑑𝐶𝑜𝑢𝑛𝑡), Flesch-Kincaid Index (𝐹𝑙𝑒𝑠𝑐ℎ𝐾𝑖𝑛𝑐𝑎𝑖𝑑𝐼𝑛𝑑𝑒𝑥), and 

52 

 
 
average syllable count (𝑆𝑦𝑙𝑙𝑎𝑏𝑙𝑒𝑠𝑃𝑒𝑟𝑊𝑜𝑟𝑑). These predictors, when incorporated into a linear 

model, provide an equation that can be estimated through logistic regression. Empirical item 

difficulty (𝐸𝑚𝑝𝑖𝑟𝑖𝑐𝑎𝑙𝐼𝑡𝑒𝑚𝐷𝑖𝑓𝑓𝑖𝑐𝑢𝑙𝑡𝑦!) in this context is the percentage of students who 

selected a distractor or provided no response. 

(3.3.2)   𝐸𝑚𝑝𝑖𝑟𝑖𝑐𝑎𝑙𝐼𝑡𝑒𝑚𝐷𝑖𝑓𝑓𝑖𝑐𝑢𝑙𝑡𝑦! = β9 + β#𝐺𝑟𝑎𝑑𝑒#$ + β$𝐺𝑟𝑎𝑑𝑒> + β:𝑀𝑎𝑡ℎ + 

(Row2)  

(Row 3)  

(Row 4)  

(Row 5) 

(Row 6)  

(Row 7)  

(Row 8) 

β;𝑊𝑜𝑟𝑑𝐶𝑜𝑢𝑛𝑡 + β<𝐹𝑙𝑒𝑠𝑐ℎ𝐾𝑖𝑛𝑐𝑎𝑖𝑑𝐼𝑛𝑑𝑒𝑥 + β=𝑆𝑦𝑙𝑙𝑎𝑏𝑙𝑒𝑠𝑃𝑒𝑟𝑊𝑜𝑟𝑑 + 

sssssssssssssssssssssssssssss
γ#𝑃𝑟𝑜𝑚𝑝𝑡#𝑆𝑐𝑜𝑟𝑒	𝐺𝑃𝑇3.5

ssssssssssssssssssssssssssssssss
γ:𝑃𝑟𝑜𝑚𝑝𝑡#𝑆𝑐𝑜𝑟𝑒	𝐿𝑙𝑎𝑚𝑎7𝐵
ssssssssssssssssssssssssssssssssss
γ;𝑃𝑟𝑜𝑚𝑝𝑡#𝑆𝑐𝑜𝑟𝑒	𝐿𝑙𝑎𝑚𝑎70𝐵
ssssssssssssssssssssssssssssss
𝜅#𝑃𝑟𝑜𝑚𝑝𝑡$𝑆𝑐𝑜𝑟𝑒	𝐺𝑃𝑇3.5

ssssssssssssssssssssssssssssssss
𝜅:𝑃𝑟𝑜𝑚𝑝𝑡$𝑆𝑐𝑜𝑟𝑒	𝐿𝑙𝑎𝑚𝑎7𝐵
ssssssssssssssssssssssssssssssssss
𝜅<𝑃𝑟𝑜𝑚𝑝𝑡$𝑆𝑐𝑜𝑟𝑒	𝐿𝑙𝑎𝑚𝑎70𝐵

! 							 + γ$𝑃𝑟𝑜𝑚𝑝𝑡#𝑆𝑐𝑜𝑟𝑒	𝐺𝑃𝑇4

sssssssssssssssssssssssssss! 											 + 
ssssssssssssssssssssssssssssssssss
! 			 + γ:𝑃𝑟𝑜𝑚𝑝𝑡#𝑆𝑐𝑜𝑟𝑒	𝐿𝑙𝑎𝑚𝑎13𝐵

! 		 + 
ssssssssssssssssssssssssssssssssss! 	 + 
sssssssssssssssssssssssssss! 												 + 
ssssssssssssssssssssssssssssssssss
! 		 + 𝜅;𝑃𝑟𝑜𝑚𝑝𝑡$𝑆𝑐𝑜𝑟𝑒	𝐿𝑙𝑎𝑚𝑎13𝐵

! + γ<𝑃𝑟𝑜𝑚𝑝𝑡#𝑆𝑐𝑜𝑟𝑒	𝐺𝑒𝑚𝚤𝑛𝚤𝑃𝑟𝑜

! 						 + 𝜅$𝑃𝑟𝑜𝑚𝑝𝑡$𝑆𝑐𝑜𝑟𝑒	𝐺𝑃𝑇4

! 			 + 

! + κ=𝑃𝑟𝑜𝑚𝑝𝑡$𝑆𝑐𝑜𝑟𝑒	𝐺𝑒𝑚𝚤𝑛𝚤𝑃𝑟𝑜

ssssssssssssssssssssssssssssssssss! 	 + 𝜖! 

In the above equation, the bar over a variable indicates the mean score (∑ FGH’
I

I
!A#

)  of 

that variable for that item. For example, if GPT-3.5 got an item correct on two out of three 

attempts under prompt 1, then the mean score for that item under that prompt would be 

sssssssssssssssssssssssssssss
𝑃𝑟𝑜𝑚𝑝𝑡#𝑆𝑐𝑜𝑟𝑒	𝐺𝑃𝑇3.5

! = .66. Since LLMs do not have memory of previous attempts, we can 

consider their mean performance score—derived from repeated attempts on the same item—

as estimators of the “true” performance score for that LLM on that item. The use of LLM 

response APIs in this dissertation is distinct from the use most are familiar with via ChatGPT or 

other consumer facing LLM applications. Those applications have processes in which they 

capture prior prompts and response and feed those values into subsequent queries. Through 

this mechanism they have a working memory. This dissertation does not use this kind of 

continuous prompt feed to generate responses. 

53 

 
 
 
 
 
 
 
 
 
I predict the coefficients on β9 through β= to be positive except for the coefficient on 

Math which I have no hypothesis for. In general, higher graded items tend to be more difficult 

than lower graded items even for students at that grade level. 

 This model tests whether the difficulty a student faces when attempting an item 

corresponds to the difficulty the LLM faces with the same item. If LLMs encounter similar 

challenges attempting items as students, we would expect the coefficients on the LLMs (γ#’< 

and 𝜅#’<) to be, on average, positive. Due to the similarity in their training data and design, 

different LLMs may exhibit some collinearity, with their responses correlated. This introduces a 

concern about near-multicollinearity, which could potentially obscure the effects of individual 

LLM responses in the model. 

However, the overall goal of (3.2.2) is less about interpreting the coefficients on LLM 

performance scores so much as it is about building a model using those performance scores to 

predict empirical item difficulty ( 𝐸𝑚𝑝𝚤𝑟𝚤𝑐𝑎𝑙𝐼𝑡𝑒𝑚𝐷𝚤𝑓𝑓𝚤𝑐𝑢𝑙𝑡𝑦J

{

) for a given item (𝑖). As such the 

estimation of empirical difficulty as a sum of performance scores of LLMs can be thought of as 

the weighted average of scores with the weights selected via Ordinary Least Squares (OLS) to 

maximize their ability to predict the empirical item difficulty. The test to measure how well the 

LLMs performance is at jointly predicting item difficulty is an F-test with 𝐹 =

KL*6G!MK2	FGH!GM4K

5MKL*6G!MK2	FGH!GM4K

. 

"The fact that some or all predictor variables are correlated among 

themselves does not, in general, inhibit our ability to obtain a good fit nor 

does it tend to affect inferences about mean responses or predictions of new 

54 

 
 
observations, provided these inferences are made within the region of 

observations.” – Kutner et al. 2004 

It is worth noting that OLS makes no requirement that the coefficients need to be 

positive on the explanatory variables to get a good prediction of the dependent variable. 

However, it would be very strange if there was a negative correlation between LLM 

performance and item difficulty. 

3.3.3 Study Three: Binary Search Estimation of Absolute Item Difficult 

In this dissertation, I aim to explore the use of a series of binary relative item difficulty 

classifiers as an alternative method for estimating "absolute" item difficulty. I employ two 

different methods to estimate absolute item difficulty. The first method relies solely on the 

pairwise responses from Study One. 

Method One: Joint estimation of Item Parameters 

Coding the pairwise responses as either 1, -1, or 0 we can input them as explanatory 

variables in a logistic regression with the outcome variable defined as 0 or 1 depending upon if 

item 𝑖 is ranked as more difficult than item 𝑗. Using the following procedure, we can generate a 

matrix of explanatory variables whose coefficients are estimates of item difficulty. The 

following is the value of item 𝑖 in column 𝑖 and row associated with the item pair. 

(3.3.3) 𝑿! = }

𝑖𝑓	𝑖	𝑖𝑠	𝑓𝑖𝑟𝑠𝑡	𝑖𝑡𝑒𝑚	𝑡ℎ𝑒𝑛 → 																																1
𝑖𝑓	𝑖	𝑖𝑠	𝑡ℎ𝑒	𝑠𝑒𝑐𝑜𝑛𝑑	𝑖𝑡𝑒𝑚	𝑡ℎ𝑒𝑛 → 																	 −1
𝑖𝑓	𝑖	𝑖𝑠	𝑛𝑜𝑡	𝑖𝑛	𝑡ℎ𝑒	𝑝𝑎𝑖𝑟𝑤𝑖𝑠𝑒	𝑐𝑜𝑚𝑝𝑎𝑟𝑖𝑠𝑜𝑛 → 0

Here is a clarifying example: imagine there are four item pairs evaluated, 1 with 2, 2 

with 1, 1 with 3, and 3 with 1. Using the above equation, we can convert those values to the 

following matrix: 

55 

 
 
 
𝑿 = (cid:128)

𝟎
𝟏 −𝟏
𝟏
−𝟏
𝟎
𝟎 −𝟏
𝟏
𝟏
𝟎
−𝟏

(cid:131) 

Now we just need to convert our pairwise comparisons returned from the LLM into 

outcome variables. I do this by coding the dependent variable as 1 if the first item is ranked as 

more difficult than the second item and zero otherwise. Let’s imagine the above pairings that in 

terms of difficulty ranking the LLM generated the following results 1 > 2, 2 < 1, 1 > 4, 4 > 1. This 

would be coded as the following Y dependent variable column: 

𝑌 = (cid:128)

𝟏
𝟎
𝟏
𝟏

(cid:131) 

Now that we have our explanatory variables defined in this way and our dependent 

variables, this model is straightforward to estimate using a logistic regression model. 

𝑌 = 𝑙𝑜𝑔𝑖𝑠𝑡𝑖𝑐(𝜷(cid:134)𝑿) 

The values of the estimated coefficient 𝛽J

(cid:134)  is the estimate of the item parameter for item 

𝑖. Overall, this approach is parsimonious and aligns well with a 1 parameter logistic (1LP) item 

response theory model estimation procedures. To compare the estimates 𝛽J

(cid:134)  with the empirical 

difficulties I convert the empirical difficulties to be on the same scale as the 1PL model using 

the sigmoid function. Since both distributions are centered at zero, directly comparing 

estimates does not present a challenge, though in practice a new batch of items would need be 

grouped with anchor items which would be used to appropriately scale and position the novel 

items. While this method makes sense in pretesting operations the next method demonstrates 

56 

 
 
how pre-calibrated items can be more directly leveraged to aid in the empirical estimation of 

novel item difficulties. 

Method Two: Individual Estimation of Item Parameters 

This method draws inspiration from Attali et al. (2014), who suggest using a series of 

pre-calibrated items to help estimate item difficulties. Their approach approximates a computer 

adaptive testing procedure, by matching either incrementally more difficult or less difficult 

items to rapidly converge on the unknown item difficulty. The primary reason for this 

expediency is the prohibitive cost of subject matter experts' time. However, since LLMs (Large 

Language Models) tend to generate responses affordably, rather than relying on the smaller set 

of item pairs needed for Attali et al.'s procedure, I suggest using all known item pairs (within 

subject and grade) as effective item anchors for each unknown item 𝑖. 

This analysis is simulation of how this procedure might be used to calibrate a novel item 

or set of items in the context of an already existing calibrating item pool. Rather than using this 

method to calibrate a set of unknown items, I am using it to sequentially estimate an item's 

difficulty, assuming all other items have known properties. While this procedure might feel 

contrived, it represents the information available to many testing programs. It would not be 

unusual for a testing program to have dozens or hundreds of pre-calibrated items to which any 

newly generated item can be compared. 

To estimate the difficulty of an item with a set of items with known properties, it is a bit 

trickier than Method One. To estimate item 𝑖’s parameters we need to input the pairwise 

comparisons for all pairwise item paired with 𝑖 into the estimation. The following example will 

demonstrate. Using the same formulation for the 𝑿 matrix above we will split it into column 

57 

 
 
containing the 𝑖 indicators (𝑋!) and the matrix not including the 𝑖 indicators (𝑿~!). Taking item 1 

as the item with unknown parameters and using the same example from method one, we get: 

𝑋! = (cid:128)

𝟏
−𝟏
𝟏
−𝟏

(cid:131) 		and			𝑿~𝒊 = (cid:128)

𝟎
−𝟏
𝟏
𝟎
𝟎 −𝟏
𝟏
𝟎

(cid:131) 

Using the same subscript notation for item 𝑖 parameter estimates (𝛽(cid:136)

!) as well as the 

known parameters for all other items (𝜷~!) we can specify the logistic regression in the 

following way: 

𝑌 = 𝑙𝑜𝑔𝑖𝑠𝑡𝑖𝑐(𝛽J

(cid:134) 𝑋! + 	 𝜷~!𝑿~𝒊) 

In this model the only parameters being estimated are 𝛽J

(cid:134)  not 𝜷~!. This estimation model 

is more difficult to implement than the model from Method One as most statistical packages 

has a logistic function built in but do not allow for the direct specification of known coefficients. 

However, by coding the logistic function, directly inputting known parameters, and specifying 

the unknown parameter, this equation can be executed. 

To test the overall performance of this estimation procedure, I sequentially estimate the 

item parameter for each item 𝑖 individually then compare the joint estimation with the sigmoid 

transformation of the empirical difficulties in the same manner as Method One. 

3.4 

Item Data 

The data used in this study are taken from the National Assessment of Educational 

Progress (NAEP). NAEP is the largest standardized assessment of student knowledge within 

primary education in the United States and is administered every two years to grades 4, 8, and 

12. NAEP was first administered in 1969. It is given to nationally representative samples of 

58 

 
 
students typically between the sizes of 15,000 and 26,000 students during each administration 

(Campbell et al., 2000; Rampey et al., 2009).  

From the NAEP exam this dissertation will focus on the released item subjects of 

Mathematics and Science of which there are a total of 1,633 released NAEP items: 1,012 of 

which there are multiple-choice items. The National Assessment of Educational Items releases 

items in nine subjects, seven on which are reported below (Table 3). The two subjects, “Art” 

and “Technology and Engineering Literacy,” were excluded from consideration due to having 

very few items. In this dissertation I focus on evaluating the items in Mathematics and Science. 

Math items are those with greatest populace, while the science items cover a diverse range of 

subjects in natural science. Reading items were excluded from this review as they have already 

been thoroughly examined in the literature with observed findings that passage features are 

highly predictive of item difficulty. Both US History and Geography items were excluded as a 

sizable portion of the items required the use of visual components such as maps or diagrams. 

While multimodal LLMs (models which take and evaluate visual components) might be suitable 

to evaluate these items in the future they are not the focus of this study. Civics and Economics 

items were excluded due to the small number of items released.  

The following table lists the full number of items available to the public. Originally, I 

intended to use a much larger sampling of items. However, when reviewing the items, I found 

that many of them had either high dependence on visual components or presented 

transcription error requiring a need for manual review as with the items collected from the 

nationsreportcard.gov (National Center for Education Statistics, 2024). 

59 

 
 
TABLE 3: NATIONAL ASSESSMENT OF EDUCATION PROGRESS ITEMS 

Civics  Economics  Mathematics  Reading  US History  Geography 
322 
121 
112 
88 
104 
136 
81 

1238 
472 
376 
389 
448 
499 
223 

424 
113 
163 
148 
142 
146 
136 

720 
345 
246 
128 
245 
268 
154 

261 
80 
111 
70 
90 
95 
76 

112 
31 
56 
25 
0 
0 
112 

Science 
395 
94 
120 
177 
120 
142 
132 

12 

197 

51 

6 

93 

7 

17 

17 

800* 

394 

383 

246 

16 

280 

107 

6 

10 

213 

212* 

75 

141 

Total Count 
Easy 
Medium 
Hard 
Grade 4 
Grade 8 
Grade 12 
Content 
Classifications 
Multiple 
Choice 
Short 
Constructed 
Response 

Extended 
Constructed 
Response 
 * Items considered for deployment in this study. 

54 

13 

12 

79 

37 

33 

38 

In this dissertation, I examine multiple choice items only, as they are the most common 

format of items. They also allow for the exploration of item difficulty under the specifications of 

study two, which the other formats do not allow for given a separate procedure would be 

required to evaluate whether a response is correct or sufficient (as is the case in constructed 

response item). All items are initially filtered by using an automated system (involving word 

match) to remove items which are anticipated to be a poor match for an LLM due to certain 

features such as being excessively visually dependent (e.g. reading a map, interpreting a graph, 

explaining a figure, etc.) or involves contemporary facts at the time of item administration (such 

as the name of the current president), these items were removed. Between the 212 Science 

and 800 Math items only 462 items survived the initial filter. Most removed items involved 

60 

 
 
  
 
some visual component in the form of an image in either the question body, selection choice, 

or both. All items have visual components described using Section-508 compliant alternative 

text.  

All 462 remaining items surviving the automatic filtering process were then manually 

evaluated for transcription error in the formulation of the alternative text or for excessive 

reliance of visual components. Surprisingly, 10.2% of the items had some kind of substantive 

error in transcribing the 508-alternative text which would have made the items either more 

difficult or impossible to answer correctly. Another 4.5% of items were flagged as having an 

excessively high dependence on visual components. Nathan Bos, a professor at John Hopkins 

and employee at Mitre, contributed to this analysis by reviewing NAEP items for errors and 

visual dependency. The interrater reliability for error flagging was 88% while the interrater 

reliability for visual dependency exclusion was 85%. I chose to exclude items in which either 

rater was flagged for excessive visual dependency, or which contained an alternative text 

transcription error. After manual review 388 items remained for evaluation with 300 of them in 

mathematics and 88 of them in science. 

These 388 remaining items were taken from a total of NAEP 36 subject/grade tests 

administered between 1990 and 2019 (National Center for Education Statistics, 2024). Each of 

the tests was assigned to either training or testing data. Training data was used to evaluate 

different prompt options and select the prompt which performed the best, (see Appendix B, C, 

and D). Within each test (year, subject, and grade combination) items were paired together to 

create binary relative difficulty item combinations. Each item pair was evaluated twice with 

each of two items taking turns in the first position and second position (e.g. “is A more difficult 

61 

 
 
than B”, and “is B more difficult than A"). Table 4 shows the summary counts and combination 

counts across all tests and items. 

TABLE 4: TEST ASSIGNMENT INTO TESTING AND TRAINING SUMMARY TABLE 

Subject 

Type 
Testing  Mathematics 
Testing 
Training  Mathematics 
Training  Science 

Science 

Count  Binary Combinations 
1609 
245 
594 
104 

198 
60 
102 
28 

Tests were assigned either to testing or training such that, when possible, most items 

were placed in the testing pool. Also, tests were distributed evenly over the time periods, 

grades, and subjects administered. Table 5 shows the individual test assignment to type groups. 

The number of binary combinations is calculated by using the standard formula 𝐹(𝑛, 𝑘) =

𝑛!
(cid:138)
𝑘! × (𝑛 − 𝑘)!

  with n being number of items in the test and k being 2 as items are matched in 

pairs. 

62 

 
 
 
 
 
TABLE 5: TEST ASSIGNMENT INTO TESTING AND TRAINING 

Type 
Training 
Testing 
Testing 
Testing 
Training 
Training 
Training 
Testing 
Testing 
Testing 
Training 
Training 
Testing 
Training 
Testing 
Training 
Testing 
Testing 
Training 
Testing 
Testing 
Testing 
Training 
Testing 
Training 
Testing 
Testing 
Testing 
Training 
Training 
Testing 
Testing 
Testing 
Testing 
Training 
Testing 

Subject 
Mathematics 
Mathematics 
Mathematics 
Mathematics 
Mathematics 
Mathematics 
Mathematics 
Mathematics 
Mathematics 
Mathematics 
Mathematics 
Mathematics 
Mathematics 
Mathematics 
Mathematics 
Mathematics 
Mathematics 
Mathematics 
Mathematics 
Mathematics 
Mathematics 
Mathematics 
Mathematics 
Mathematics 
Science 
Science 
Science 
Science 
Science 
Science 
Science 
Science 
Science 
Science 
Science 
Science 

Grade 

Year 
4  1990 
4  1992 
4  1996 
4  2003 
4  2005 
4  2007 
4  2009 
4  2011 
4  2013 
8  1990 
8  1992 
8  1996 
8  2003 
8  2005 
8  2007 
8  2009 
8  2011 
8  2013 
12  1990 
12  1992 
12  1996 
12  2005 
12  2009 
12  2013 
4  2000 
4  2005 
4  2009 
8  2000 
8  2005 
8  2009 
8  2011 
8  2019 
12  2000 
12  2005 
12  2009 
12  2019 

Count  Binary Combinations 
91 
231 
21 
378 
66 
136 
36 
120 
66 
78 
10 
6 
120 
120 
171 
3 
105 
55 
105 
171 
15 
78 
21 
0 
36 
45 
15 
36 
10 
3 
3 
10 
21 
105 
55 
10 

14 
22 
7 
28 
12 
17 
9 
16 
12 
13 
5 
4 
16 
16 
19 
3 
15 
11 
15 
19 
6 
13 
7 
1 
9 
10 
6 
9 
5 
3 
3 
5 
7 
15 
11 
5 

63 

 
 
3.5 

General Item Properties 

Item Properties Directly Taken / Calculated from NAEP Parameters 

This section explores some of the item properties used as either the explanatory or 

dependent variables in this dissertation. On average the item difficulty of for all items is 46.2 

with items in lower grades being slightly less difficult and items in higher grades being slightly 

more difficult (6). The most difficult set of items by grade and subject are 12th grade science 

items. 

Table 6 shows some summary statistics on item difficulty, with lower difficulty indicating 

that more students got this item correct and higher difficulty indicating that fewer students got 

the item incorrect. The data is segmented by either: grade and subject, or item assignment 

grouping type (testing or training data), or ungrouped level. The columns SD, P10, and P90 

reference standard deviation, percentile 10, and percentile 90 respectively. 

TABLE 6: ITEM DIFFICULTY PARAMETERS 

Grade  Subject 

Type 

4  Mathematics 
4  Science 
8  Mathematics 
8  Science 

12  Mathematics 
12  Science 

Testing 
Training 
 Ungrouped 

Min  Mean  Max 
92 
44.76 
81 
39.28 
84 
46.49 
73 
45.60 
88 
46.79 
85 
54.87 
92 
46.94 
88 
44.81 
92 
46.22 

5 
13 
6 
13 
7 
30 
5 
9 
5 

SD 
19.16 
18.64 
17.98 
13.69 
20.45 
15.40 
18.86 
18.25 
18.69 

P 10  Median 
17.6 
20.6 
21.1 
28.0 
23.0 
32.7 
20.7 
21.0 
21 

46 
33 
50 
47 
45 
56 
48 
43 
47 

P 90 
69.0 
68.8 
66.9 
64.2 
74.0 
78.5 
72.0 
67.3 
71 

Count 
137 
25 
102 
25 
61 
38 
258 
130 
388 

The 𝑎R parameter is the proxy estimate of the item response theory parameter 𝑎 and is 

calculated using equation (3.3.1.2) by taking the overall population test score by item choice 

64 

 
 
  
  
  
  
  
  
 
(NAEP, 2024) between the difference between the population mean test score for those who 

answered the item and chose the correct response minus the weighted mean test score for 

those who answered the item and chose the incorrect response. 

Table 7 shows average difference in population scores between those who got the item 

correct and those who got the item incorrect (𝑎#). 

TABLE 7: 𝑎R PARAMETER ESTIMATES  

Grade  Subject 
Math 
Science 
Math 
Science 
Math 
Science 

4 
4 
8 
8 
12 
12 

Type 

Testing 
Training 

𝑎" 
min  mean  max 
80.1 
0.6 
24.99 
42.4 
11.6  25.65 
292 
4.9 
39.18 
44.2 
12.3  21.87 
304 
11.5  49.91 
37.1 
22.26 
5.4 
304 
32.70 
4.6 
159.5 
31.24 
0.6 
304.0 
32.21 
0.6 

Item Properties Estimating Using LLMs 

Complexity and number of steps were generated from prompting the LLM Gemini-Pro 

to provide an estimate of the complexity of the item for a student four grades below the item 

level. The purpose of instructing the LLM to assess the item from the perspective of a student 

four grades lower than the administered level was due to LLMs generally overestimating the 

ability of students.  

LLMs overestimating the ability of students at a particular grade level is a commonly 

observed bias in interacting with these models. It is unknown why this bias exists, but it is 

almost certainly based on the training corpus. Interestingly it is a bias that is shared by subject 

65 

 
 
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
 
 
matter experts who tend to overestimate the ability of students at particular grade levels 

(Lorge and Kruglov, 1952 and 1953).  

Having the LLM assess the item from the perspective of students four grades lower 

(with first being the lowest grade), the average complexity was 6.77 with items assigned to 

higher grades being seen as slightly more complex than those assigned to lower grades (Table 

8).  

The number of steps were generated by prompting Gemini-Pro to list the number of 

steps required to solve the item then counting those steps. More work could have been done in 

cleaning the number of steps listed as often the LLM was often prone to generate non-

meaningful steps for low complexity items. When items required very few steps to find a 

solution the LLM often generated standard steps that were generally required for all items (for 

example: “look at the question,” “read the potential answers”, “think about how to solve the 

question”, etc.). 

This is the reason there is not more variation between the number of steps required to 

solve the lower grade items and those required to solve the higher-grade items. This filling in of 

steps with generic content might also be a contributing factor as to why science items had more 

steps listed than those of mathematics items for which procedural steps are common (for 

example: “think about the water table”, “conceptualize the relationship between predators and 

prey”, etc.).  

 Assignment to different item types either training or testing was based on Table 5 and 

produced distributions of relative item difficulties which were similar (Figure 2.). The goal of 

assignment of items to different types was to favor the testing group with more items to aid in 

66 

 
 
 
final inference and results when possible while still having sufficient items as well as subject, 

year, and grade coverage in the training data in order to make inference on which item prompts 

functioned optimally. 

Table 8 shows two item properties complexity and number of steps generated 

independently for each item from Gemini-Pro. 

TABLE 8: ITEM COMPLEXITY, NUMBER OF STEPS, AND ALPHA PARAMETERS  

Grade  Subject 
Math 
Science 
Math 
Science 
Math 
Science 

4 
4 
8 
8 
12 
12 

Complexity 

Number of Steps 
Type  min  mean  max  min  mean  max 
10 
8 
10 
8 
9 
9 
10 
10 
10 

3.78 
4.04 
4.01 
4.20 
3.82 
4.13 
3.99 
3.79 
1.00  3.93 

5.66 
7.32 
7.13 
7.64 
7.41 
7.89 
6.83 
6.66 
6.77 

9 
6 
9 
8 
9 
8 
9 
9 
9 

1 
1 
1 
5 
3 
6 
1 
1 
1 

2 
1 
2 
2 
1 
1 
1 
2 

Testing 
Training 

67 

 
 
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
FIGURE 2: ASSIGNMENT PAIRWISE RELATIVE DIFFICULTIES BY ASSIGNMENT TYPE 

This figure shows that the general distribution of relative item difficulties by assignment type 

has similar distributions. 

3.6 

Item Properties by Demographic Groups 

 A key consideration when developing any large-scale assessment is to consider 

differences in response patterns by distinct groups. Generally, speaking, some differences in 

response patterns can be attributable to “true” variation in underlying population differences 

while some differences in response patterns might be attributable to unintended features of 

the item which makes it easier or more difficult for certain groups. Using the high-level statistics 

provided by NAEP it is unlikely that we can identify which of these factors is driving differences 

in response patterns that we observe in the data. 

68 

 
 
 
Figures 3 demonstrates that there is a slight difference in response patterns with 

females finding items more difficult than males. While Figure 4 shows that in general American 

students who identify as demographically Asian/Pacic Islander find the items the least difficult 

followed by White, Hispanic, and Black. 

FIGURE 3: DISTRIBUTION OF ITEM DIFFICULTIES BY GENDER 

This figure shows the density of item difficulties stacked by the item difficulty faced by different 

genders. Curves that load densities further to the left indicate that this population group finds 

the items easier while those to the right find the items more difficult. 

69 

 
 
 
 
 
FIGURE 4: DISTRIBUTION OF ITEM DIFFICULTIES BY RACE 

This figure shows the density of item difficulties stacked by the item difficulty faced by that 

population group. Curves that load densities further to the left indicate that this population 

group finds the items easier while those to the right find the items more difficult. 

3.7  

Prompt Template Selection 

Following the process outlined in Appendix B, C, and D binary choice prompt templates 

were randomly generated across the ten different options. In total 40 prompts were randomly 

generated. A variety of LLMs were explored with these models including GPT3.5, GPT4, 

Google’s Gemini Pro, as well as several smaller open-source models (Llama 7B, Llama 13B, and 

Llama 70B). Overall, the difference in performance of the ranking task between GPT 3.5, GPT4, 

and Gemini-Pro was slight. However, Gemini-Pro offered a cost advantage (at this time Gemini-

70 

 
 
 
Pro is zero cost for development). As a result, Gemini-Pro was used as the primary model to 

optimize template design and evaluate LLM pairwise classification performance. 

To aid in reducing training noise, for the first generation the training data pairwise 

groups were split into two equal sized buckets: high contrast pairs and low contrast pairs1. High 

contrast pairs included item pairs in which the difference in difficulty was greater than or equal 

to 19 percentage points while those in low contrast pairs were item pairs in which the difficulty 

in which the difference in difficulty was less than 19 percentage points. Items which had the 

same difficulty (NAEP rounds difficulty to whole percentage points) were excluded from the 

training data since the accuracy of their responses could not be determined. The LLM responses 

to pairwise comparisons are divided into three categories: one correctly ranked, 0 incorrectly 

ranked, and null for uninterpretable responses in which the LLM returned a response that could 

not be evaluated. These responses varied but often amounted to a statement like, “the two 

items have the same difficulty” or “the difference in difficulty of the items could not be 

determined”.  

The overall performance of the first forty prompt templates using 60 random item pairs 

was 59% accurate averaged across all models when ignoring uninterpretable responses and 

only 46% accurate when treating uninterpretable responses as misses. However, the top ten 

performing responses did much better with an average correct response of 68% and 64% even 

when considering uninterpretable responses.  

1 Note that this is similar in some ways to the procedure used by Attali et al. (2014) in which they 
oversampled the highest and lowest performing items which would likely lead to larger disparities in the in 
pairwise difference in difficulties of items. 

71 

 
 
 
The top ten performing templates were then randomly “bred” with each other creating 

a total of roughly 30 new children, some of which were dropped if duplicated. These children in 

addition to their parents were tested against another 30 randomly selected high contrast item 

pairs in addition to the 30 underwent by the parents. The average performance on these items 

pairs was 63% excluding nulls and 47% when nulls are included. Nulls occur when the LLM does 

not produce an interpretable binary response, often in the form of statements like "I cannot 

determine" or "both items are equally difficult." The top ten from this generation had a 72% 

correct prediction result excluding nulls and 71% including nulls. 

New generations were added, in a similar manner, and additional item pairs continued 

to be added while maintaining the evaluation of existing prompt templates for a total of four 

generations and 137 prompt templates. Prompt template performance converged quickly with 

the top two performing templates on all the training data being from the second generation 

(template 61 and 60). Overall, the genetic algorithm led to selection of higher performing 

templates. All but one template in the top twenty were generated from cross breeding through 

the genetic algorithm. If the genetic algorithm had acted as simple random prompt generator, 

then we would expect that roughly 29% of the prompts in the top 20 would be from the first 

generation rather than the 5% we observe. 

For the top twenty performing models all binary item pairs were assigned for evaluation 

creating a total of 698 item pairs which were evaluated. These item pairs included items in 

which the absolute difference in difficulty was less than 19 percentage points. The average 

correct ranking of item difficulty for the top performing model (template 61) across all the 

training data was 63%. Taking the top model template and evaluating it on the 1854 item pairs 

72 

 
 
in the testing data produces a similar performance of just over 62%. The slight drop in model 

performance might be due to random variation or slight “overfitting” due to evaluating so 

many models in the training data. For studies 1 and 3 template sixty-one is used exclusively.  

Table 9 shows the parameters used in template 61, while Table 10 shows an example 

prompt, followed by Table 11 which shows an example response. 

TABLE 9: TEMPLATE 61 PARAMETERS 

The following table shows the parameters used in top performing template (Template 61). See 

Appendix B for a list of parameter variants. 

Variable 
Question Naming 
Task Title 
Persona 
Task Introduction 

#  Value 
4  Question a, Question b 
0  None 
0  None 
2  You will find two different questions ahead. Evaluate their 

Context 
N-shot 
Perspective 

difficulty based on depth and breadth of knowledge required, and 
cognitive demand. 
Item Content Tags: {Content Tags} 

4 
0  None 
3  Evaluate the content from the perspective of a {max(grade – 4, 

1)}} th/st grade student. 

Task Approach 
Instructions 
Item Shared Context 
Task Instructions 

4  Detail the sequence of actions needed to solve this relative 

difficulty ranking challenge. 

0  None 
4  Here are two questions for your consideration. Please rate them 

Task Output 

0  The more difficult question is: {{Question 1}}/{{Question 2}}. 

in order of difficulty. 

To get a sense of how this template works we can see an example in Table 9 in which 

two items from Grade 8 in 1990 are compared. 

73 

 
 
 
TABLE 10: TEMPLATE 61 MODEL PROMPT EXAMPLE 

The following box shows an example of a pairwise relative difficulty task framed in the 

perspective of template design sixty-one followed by some item statistics. 

You will find two different questions ahead. Evaluate their difficulty based on depth and breadth of knowledge 
required, and cognitive demand. 
Evaluate the content from the perspective of a fourth grade student. 

Detail the sequence of actions needed to solve this relative difficulty ranking challenge. 

Question a:  
Item Content Tags: Probability, Ratio, Fraction, Random Selection, Mathematics 
There are 15 girls and 11 boys in a mathematics class. If a student is selected at random to run an errand, what 
is the probability that a boy will be selected? 
A. 4/26 
B. 11/26 
C. 15/26 
D. 11/15 

Question b:  
Item Content Tags: Area, Perimeter, Rectangle, Length, Width 
Four rectangles; first is 1 by 1, second is 1 by 3, third is 1 by 2, and the fourth is 1 by 3. 
A certain rectangle has an area equal to the sum of the areas of the four rectangles shown above. If its length is 
4, what is its width? 
A. 2 
B. 2 1/4 
C. 2 1/2 
D. 3 

Here are two questions for your consideration. Please rate them in order of difficulty 
Please provide your evaluation in the following format: 
The more difficult question is: Question a/Question b 
Item Year: 1990 
Item Grade: 8th 

Item 1 ID: 1990-8M7 #18 M017001 
Difficulty: 62.0 

Item 2 ID: 1990-8M9 #19 M028831 
Difficulty: 83.0 

74 

 
 
 
 
 
 
 
 
 
 
 
TABLE 11: TEMPLATE 61 EXAMPLE RESPONSE 

**Question a:** 

* Depth and breadth of knowledge required:** Requires basic understanding of probability and ratios, including 
concepts of random selection and proportions. 
* Cognitive demand:** Level 2 (Basic Application of Skills and Concepts) 

**Question b:** 

* Depth and breadth of knowledge required:** Requires knowledge of geometry and area concepts, including 
formulas for rectangles (length x width) and understanding of perimeter and area. 
* Cognitive demand:** Level 3 (Strategic Thinking and Complex Reasoning) 

**From the perspective of a 4th/5th grade student:** 

Both questions would be considered challenging, but Question b would be more difficult due to the need for 
more complex understanding of geometry concepts. 

**Sequence of actions for difficulty ranking challenge:** 

1. Read and analyze each question carefully. 
2. Identify the specific concepts and skills required for each question. 
3. Determine the depth and breadth of knowledge required. 
4. Evaluate the cognitive demand of each question. 
5. Rank the questions in order of difficulty based on the above criteria. 

**The more difficult question is: Question b** 

We might wonder how unique the parameters are chosen using template 61 compared 

with the other top performing templates. Looking at Table 12 we can see that model sixty-one 

happens to share the same parameters as the mode of the ten top performing models of all 

models evaluated. All the parameters varied between the top ten performing models except for 

the N-shot (See Appendix B, Table 34) examples which was universally excluded. Variants of 

persona, task title, item shared context, and task output were also generally excluded in the top 

performing models (See Appendix B: Table 30, 31, and 40).  

Exclusion of example tasks from the item prompt (N-shot) and exclusion of item shared 

context is particularly interesting as it diverges from the standard recommendations in the 

75 

 
 
 
 
 
 
 
 
 
 
 
literature which often emphasize that the more context for a task provided in the prompt 

either through examples or additional context the better. 

TABLE 12: THE PARAMETERS OF THE TOP TEN PERFORMING MODELS 

Models are listed from best performing (left) to least (right) for pairwise comparison of items 

using the LLM Gemini-Pro. 

106  105 

Model 
Question Numbering 
Task Title 
Persona 
Task Introduction 
Context 
N-shot 
Perspective 
Task Approach Instructions 
Item Shared Context 
Task Instructions 
Task Output 
Temperature 
Generation 
* Indicates this prompt feature was rejected from final prompt selection. 

118 
0 
4 
0 
3 
4 
0 
3 
3 
0 
3 
0 
0 
4 

60 
0 
0 
0 
2 
4 
0 
3 
4 
0 
4 
0 
0 
2 

59 
1 
4 
3 
3 
4 
0 
0 
3 
0 
3 
0 
0 
2 

61 
4 
0 
0 
2 
4 
0 
3 
4 
0 
4 
0 
0 
2 

51 
3 
0 
0 
1 
4 
0 
3 
0 
0 
0 
0 
0 
2 

4 
0 
0 
1 
0 
0 
3 
0 
0 
4 
0 
4 
4 

3 
0 
0 
1 
4 
0 
3 
0 
0 
4 
0 
0 
4 

126 
2 
0 
3 
2 
0 
0 
3 
4 
2 
2 
0 
3 
4 

49 
1 
0 
4 
4 
2 
0 
2 
0 
0 
3 
3 
3 
2 

10  Mode 
4 
0 
3 
2 
0 
0 
3 
4 
0 
4 
0 
4 
1 

[4] 
[0] * 
[0] * 
[2] 
[4] 
[0] * 
[3] 
[0,4] 
[0] * 
[4] 
[0] * 
[0] * 
[1] 

It should be noted that the top ten prompts shown in Table 12 seem to be the best 

performing prompts using the LLM Gemini-Pro. Other LLMs or even future releases of Gemini-

Pro will have different parameter weights in which a slightly or completely different top prompt 

template will turn out to be more effective. As such, the contribution of this research is less 

about the specific prompts which turned out to be more effective and more about the 

approach of using optimization tools (such as a genetic algorithm) to improve the performance 

of prompts. There is no reason, based on reviewing the literature on prompt design, that this 

set of prompts (rejecting context and persona features) would have been expected to be 

76 

 
 
 
successful. However, by implementing an optimizing search algorithm over different template 

features the performance of the LLM classifier was able to be significantly improved. 

3.8  

LLM Pairwise Performance Under Independence 

There is no reason to think that errors associated with the responses from Large 

Language Models (LLMs) are independent and identically distributed (IID). However, if they 

were then there is an appropriate sample size of pairwise comparisons which would lead to a 

correct estimation of pairwise item difficulty at any given confidence value. Table 13 shows 

under IID how many pairwise comparisons would be needed in the case of 90% confidence. 

Thus, under IID even a weak signal such as that found when the relative difficulty in difference 

is between 1-9 could create a strong signal.  

TABLE 13: PAIRWISE COMPARISON AND IDEAL SAMPLING 

This table shows how many random samples for each item would be needed to be 90% confident that 

the average difficulty is correctly classified. 

Relative Difference 
1-9 
10-19 
20-29 
30-39 
40-49 
50-59 
60-69 
70-79 
80+ 

Pairwise Counts in Sample 
526 
904 
880 
541 
454 
214 
112 
18 
4 
* The correct % is based on the average percent correct for each set of pairwise comparisons. 

Sample needed for 90% Conf. 
311 
223 
31 
23 
7 
3 
3 
3 
1 

Correct* 
53.6% 
54.2% 
61.6% 
63.0% 
72.5% 
81.8% 
83.9% 
88.9% 
100.0% 

However, the predictions of LLMs are not independently sampled but instead suffer 

from correlation in error due to shared predictive bias. Looking at just the item pairs when the 

77 

 
 
difficulty difference in items is less than 1%, in which theoretically the prediction should be 

close to 50% accurate (random chance). I find that the two-way t-test with a 90% threshold 

ends up rejecting the null 88% of the time instead of the expected outcome under IID of 10%.  

This is a limitation of using LLMs as expert classifiers. A benefit of LLMs is that they have 

the advantage of not remembering their previous response to the same task. Unfortunately, 

the informational weights which led to their response does not change, though there is 

stochasticity in the model output. Thus, responses and the error in those responses are serially 

correlated. LLMs have an advantage over human experts who may remember their previous 

response and for repeated requests for classification likely will be biased by both having 

consistent internal weights (or personal experiences) as well as recalling their previous 

response. 

Looking at the pairwise comparisons by grade and subject (Table 14) we see that the 

average percent correct of guesses hovers between 58-62% correct with pairwise comparisons 

across grades 4, 8, and 12 in both Math and Science with the single substantive deviation for 

Math grade 12 in which the pairwise comparison is instead averaging at 73% correct. 

TABLE 14: PAIRWISE COMPARISON AND IDEAL SAMPLING 

Grade 
4 
4 
8 
8 
12 
12 

Subject 
Math 
Science 
Math 
Science 
Math 
Science 

Percent Correct 
61.6% 
59.5% 
59.4% 
60.9% 
73.0% 
58.1% 

Pairwise Comparisons 
1621 
116 
1040 
92 
514 
270 

78 

 
 
 
3.9 

Study One: Item Parameters 

Parameters used to estimate the models in Study One are taken either from NAEP, 

binary predictions using the LLM prompt model, or independently generated using a different 

prompt (Table 15). The “Correct” variable is the likelihood that the LLM binary prediction model 

successfully predicted the relative item difficulty of two items. On average for the validation 

data the model is 62% correct. The minimum for this variable is 50% (under random guessing) 

while the maximum is 100% under perfect knowledge. 

79 

 
 
 
 
TABLE 15: PAIRWISE COMPARISON ITEM PARAMETERS 

This table shows statistics on the various item parameters evaluated in Study One. The |var| 

refers to the absolute value function while the Δ refers to the difference between the two 

pairwise values.  

Source 
Binary Prediction 
NAEP 
NAEP 
NAEP 
NAEP 
NAEP 
Independent LLM Prediction 
Independent LLM Prediction 
Independent LLM Prediction 
Independent LLM Prediction 
Independent LLM Prediction 
Independent LLM Prediction 
NAEP 
NAEP 
NAEP 

Variable 
 Correct 
 𝑎"! 
 𝑎"" 
 Second Item Harder 
 Diff! 
 Diff" 
Complexity 1 
Complexity 2 
|Complexity Δ| 
 # of steps 1 
 # of steps 2 
|# of steps Δ| 
|Diff Δ| 
|Diff Δ| : 𝑎"!  
|Diff Δ| : 𝑎"" 
|Diff Δ| : Complexity 
1 
|Diff Δ| : Complexity 
2 
|Diff Δ| : | 
Complexity Δ | 
NAEP & Independent LLM Prediction 
|Diff Δ| : # of Steps 1  NAEP & Independent LLM Prediction 
|Diff Δ| : # of Steps 2  NAEP & Independent LLM Prediction 
|Diff Δ|:|# of Steps 
Δ| 
|Diff Δ| : similarity 

NAEP & Independent LLM Prediction 
Independent LLM Prediction 

NAEP & Independent LLM Prediction 

NAEP & Independent LLM Prediction 

Mean 
0.62 
0.05 
-0.08 
0.51 
46.28 
47.02 
6.41 
6.71 
1.83 
3.86 
3.98 
1.5 
22.12 
0.35 
0 

SD 
0.48 
1.15 
0.94 
0.5 
19.86 
18.72 
2.2 
1.92 
2 
1.35 
1.44 
1.25 
15.67 
4.66 
3.79 

Min  Max 

0 
-2.2 
-2.2 
0 
5 
5 
1 
1 
0 
1 
1 
0 
1 
-14.2 
-14.3 

1 
6.49 
6.49 
1 
92 
92 
10 
10 
9 
9 
9 
8 
78 
47.37 
50.61 

13.95 

11.43 

0.1 

64.8 

14.65 

11.52 

0.2 

69.3 

4.49 
8.49 
8.77 

3.5 
5.93 

6.97 
6.95 
7.34 

4.46 
6.81 

0 
0.2 
0.2 

0 
0.1 

49.5 
44.1 
54.9 

39.2 
52 

Many variables are referred to with either a 1 or 2 which indicates that these variables 

apply to either variable 1 or variable 2 in the binary variable comparison. The parameters 

directly derived from NAEP are: difficulty of the first item (Diff#) and the difficulty 2 of the 

second item (Diff$), the absolute value in the difference in difficulties between the items (|Diff 

80 

 
 
 
Δ|) as well as the proxies for the discrimination parameters (𝑎"! and 𝑎"") . Difficulties are 

oriented in the reverse of classical test theory item difficulties, more in line with item response 

theory difficulties (b parameter) such that greater difficulty indicates more test taker chose an 

incorrect response. Difficulties with zero indicating 0 percent of the population taking the chose 

wrong response while a difficulty of 100 indicates 100 percent of the population getting the 

item wrong.  

These difficulties are not used directly in the estimation of the classification 

performance though they enter the model in two ways: 1. They define what the correct answer 

to the pairwise comparison is and 2. Their absolute difference enters as an explanatory variable 

(|Diff Δ|) . Though it appears that a single variable is acting as both the explanatory variable 

and the dependent variable there is no issue as the sign(Diff Δ) is independent of the |Diff Δ| 

for all |Diff Δ| > 0. The other NAEP derived parameters are 𝑎"! and 𝑎"" which refer to the proxy 

of item discrimination (See equations 3.3.1.2 and 3.3.1.3).  

The variable parameters that are independently generated by the LLMs are item 

complexity, number of steps, and item pair similarity. Item “Complexity” is generated by asking 

the LLM (Gemini-Pro) in this case to generate and estimate of item complexity between 1 and 

10 with ten being most complex. The “Number of Steps” is generated by asking the LLM 

(Gemini-Pro) to list the steps required to solve the item, then counting those steps. The While 

the previous two items are generated on the individual item level “Similarity of Item Pair” is 

generated on the item pair level by prompting the LLM (GPT 3.52), “On a scale of 1 to 10 with 1 

2 Relative similarities of items ran afoul of Gemini-Pro’s over ambitious “safety rules”. Much more frequently than 
expected this prompt came back with no-response which led to using GPT 3.5 to generate these values. 

81 

 
 
 
being very dissimilar and 10 being very similar how similar is the content covered by these 

different questions?”. Like the absolute value of the difference of item difficulty (|Diff Δ|) the 

absolute value of the difference of item complexity (|Complexity Δ|) and number of steps (|# 

of Steps Δ|) is calculated. 

3.10  Study Two: Item Parameters and Prompt Design 

Unlike the comparison of item pairs (Study One) which used only one LLM model 

(Gemin-Pro) this study explores how a variety of LLMs perform when attempting to solve NAEP 

items. I then ask the question, “are the items that the LLM finds more difficult also more 

difficult for students?”  Unlike the previous study in which hundreds of prompt formulations 

were imagined and explored, this study only made use of only two prompts: 

-  Prompt 1: {{Item Content and Answers}}. What is the correct response? 

-  Prompt 2: {{Item Content and Answers}}. Please provide a step-by-step explanation of 

how to get to the solution. 

Each model is given all 388 items each at least three times. The correct answers are 

scored as one and wrong as 0. For each model, the mean score of the various trials is 

calculated. Looking at Table 16 we can see the outcomes for the two different prompt 

attempts. The least complex model Llama 7b gets the correct answer only 43% of the time for 

prompt 1 and 48% for prompt two. The most complex model, GPT4 however gets the correct 

answer 76% of the time under prompt 1 and 96% of the time under prompt two. Most of these 

models show a statistically significant but small negative correlation between item difficulty 

and model performance on that item. This provides some modest support that LLMs respond 

similarly to that of students and find more difficult items also more difficult to solve. The model 

82 

 
 
that has the lowest correlation between difficulty and performance was GPT-4 which 

performed well (above 75% correct) under prompt 1 and very well under prompt 2 (95% 

correct). 

TABLE 16: LLM ATTEMPTING EACH ITEM 

This table displays the outcomes from querying a range of Large Language Models (LLMs) using 

two distinct prompt models. Here, the score represents the model's average accuracy 

percentage, while Corr(Score	#, Diff) indicates the correlation between the empirical difficulty 

of the items and the model's score. It is important to note that with random selection, the 

expected score is 25%. The mean empirical scores for 4th, 8th, and 12th grades are 56.1, 53.7, 

and 50.1% respectively and 53.7% across all items.  

Prompt one 

Prompt two 

LLM 

GPT 3.5 

GPT 4 

Llama 7b 

Llama 13b 

Llama 70b 
Gemini-
Pro 

Score one  Corr	(Score	one, Diff) 
0.649+ 

0.125** 

0.764+ 

0.430- 

0.532- 

0.568+ 

0.604+ 

0.067 

0.130** 

0.137*** 

0.094* 

0.114** 

Score two 

0.851+ 

0.954+ 

0.438- 

0.513- 

0.666+ 

0.813+ 

*10% significance, **5% significance, ***1% significance. 
+ Indicates the LLM outperforms average test taker. 
- Indicates the LLM underperforms the average test taker. 

Corr(Score	two, Diff) 
0.166*** 

0.064 

0.181*** 

0.186*** 

0.123** 

0.105** 

Despite 8 out of 12 of the LLM prompt combinations outscoring the average student the 

correlations between the difficulty and the model scores (Corr(Score	#, Diff)) are relatively 

small. Their largest value is 0.186 for Llama 13b under prompt 2. To get an idea of whether 

these performance scores and correlations with item difficulties were in line with what is 

83 

 
 
 
 
possible to observe under item response theory I simulated 388 3PL items with random 

difficulties drawn from a normal distribution. I also simulated 500 random test takers with theta 

drawn also from a standard normal to estimate the classical test theory difficulty. I sampled a 

range of theta’s treating each model as having its own ability parameter with each theta 

attempting each item 3 times then taking the average score for that item. I correlated that 

score with the empirical item difficulties to find the correlation. Then I repeated the entire 

simulation 200 times to find 95% confidence intervals and medians for both the score and 

Corr(Score	#, Diff)) under the scenario (Table 17). 

To my surprise, I found that under a = 0.35 and c = 0.35 (admittedly cherry-picked 

parameters) it is possible to find matches within the 95 percentiles of both score and ρ for all 

six LLM models and both prompts except in the case of GPT 4 with prompt 1. It is very difficult 

to get the observed Corr(Score	1, Diff) of 0.067 while having a score 53% (in the case of θ=-

3.5). This of course does not prove that LLMs performance simulates student performance. 

However, it does show that the performance of LLMs can approximate those simulated under 

item response theory at least by these measures.  

84 

 
 
 
 
TABLE 17: SIMULATING LLM PERFORMANCE UNDER 3PL 

This table presents the results of simulating a 3PL model with 388 random items across 

various ability distributions. It includes 200 random performance samples with different Θ 

abilities. The columns Q5, Q50, and Q95 represent the 5th, 50th, and 95th percentiles of the 

scores. A model and prompt pair is considered a match if both the observed score and the 

empirical correlation with model performance (ρ) fall within the Q5 to Q95 range. 

Score 
Q 50 

Q 5 

Q 95 

Q 5 

ρ 
Q 50 

Q 95 

LLM Match   Prompt  Match 

Llama 7B 
Llama 7B 

Llama 13B 
Llama 13B 

Llama 70B 
Gemini-Pro 

GPT 3.5 
Llama 70B 

GPT 4 

Gemini-Pro 
GPT 3.5 

0.453 
0.465 
0.485 
0.505 
0.527 
0.547 
0.572 
0.597 
0.623 
0.649 
0.678 
0.706 

0.428 
0.443 
0.463 
0.482 
0.501 
0.523 
0.550 
0.570 
0.601 
0.626 
0.655 
0.686 

Θ = -5.0 
Θ = -4.5 
Θ = -4.0 
Θ = -3.5 
Θ = -3.0 
Θ = -2.5 
Θ = -2.0 
Θ = -1.5 
Θ = -1.0 
Θ = -0.5 
Θ = 0.0 
Θ = 0.5 
Θ = 1.0 
Θ = 1.5 
Θ = 2.0 
Θ = 2.5 
Θ = 3.0 
Θ = 3.5 
Θ = 4.0 
Θ = 4.5 
Θ = 5.0 
Θ = 5.5 
Θ = 6.0 
Θ = 6.5 
Θ = 7.0 
Note: Item parameters b drawn from random normal, a = .35, and c = .35. 

0.008 
0.036 
0.046 
0.055 
0.070 
0.080 
0.086 
0.085 
0.115 
0.111 
0.118 
0.113 
0.124 
0.105 
0.112 
0.112 
0.104 
0.105 
0.102 
0.093 
0.078 
0.076 
0.048 
0.044 
0.053 

0.479 
0.490 
0.506 
0.529 
0.547 
0.571 
0.595 
0.620 
0.646 
0.674 
0.702 
0.725 
0.753 
0.778 
0.803 
0.828 
0.845 
0.865 
0.887 
0.901 
0.915 
0.928 
0.939 
0.949 
0.958 

0.173 
0.181 
0.199 
0.203 
0.220 
0.233 
0.241 
0.247 
0.255 
0.275 
0.254 
0.271 
0.274 
0.273 
0.275 
0.267 
0.262 
0.273 
0.256 
0.244 
0.238 
0.235 
0.226 
0.228 
0.200 

0.100 
0.114 
0.119 
0.133 
0.143 
0.151 
0.164 
0.169 
0.181 
0.191 
0.194 
0.194 
0.196 
0.205 
0.194 
0.189 
0.182 
0.193 
0.178 
0.163 
0.164 
0.153 
0.147 
0.133 
0.120 

0.711 
0.741 
0.765 
0.789 
0.814 
0.831 
0.852 
0.873 
0.886 
0.903 
0.916 
0.926 
0.936 

0.734 
0.762 
0.786 
0.809 
0.832 
0.851 
0.869 
0.887 
0.902 
0.916 
0.929 
0.938 
0.947 

GPT 4 

1 
2 

2 
1 

1 
1 

1 
2 

1 

1 
2 

1 
1 

1 
1 

1 
1 

1 
1 

0 

1 
1 

2 

1 

85 

 
 
 
 
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
 
I will briefly explore to what extent these responses might be leveraged to predict item 

difficulty. In addition to using LLM performance as an explanatory variable I generate several 

computational variables including “Word Count,” “Flesch-Kindcaid Index,” and (average) 

“Syllables per Word” (Table 18). These values are calculated for the full item text as well as the 

option text though item numbering and option numbering are excluded in these calculations 

(for example: “Item 1: What is 7% of 200. A. 7, B. 14, D. 20, E. 21” becomes “What is 7% of 200. 

7, 14, 20, 21”). 

TABLE 18: DEPENDENT VARIABLES AND EXPLANATORY VARIABLE STATISTICS 

Easiness 
Word Count 
Flesch Kincaid Index 
Syllables per Word 

Mean 
53.78 
38.63 
4.58 
1.29 

SD 
18.71 
18.22 
3.43 
0.2 

Min 
8 
7 
-2.8 
0.78 

Max 
95 
105 
25.2 
2.2 

86 

 
 
 
 
 
4.1 

Study One: Predicting Item Pair Rankings 

CHAPTER IV: RESULTS 

In this study, I predict the performance of the classification model using available 

explanatory variables. The theoretical justification for these estimates can be found in section 

3.3.1. The Table 19 shows the results of estimating how likely the LLM (Gemini-Pro) will be at 

correctly ranking which item in an item pair is more difficult. For interpretational purposes the 

results column, “model 4” will be the column for which these results are discussed. The other 

columns are primarily presented for the purposes of sensitivity analysis. 

The results for all non-binary variables are standardized (mean of zero and standard 

deviation of 1) and can be directly interpreted as the expected change to likelihood of the LLM 

of correctly ranking the item pairs. For example, the coefficient on the absolute value of the 

differences in complexity score (|Complexity Δ|), is 0.0423 which indicates a one standard 

deviation difference in estimated complexity of the two items results in a 4.23 percentage point 

increase in the likelihood of the LLM to correctly rank the two items. 

The coefficients of binary variables on the other hand, are best interpreted as holding all 

else equal how much would moving the binary variable value from 0 to 1 change the expected 

rank performance. For example, the “Second Item Harder” variable is a binary flag which 

indicates when the second item is harder than the first item. The "Second Item Harder" binary 

variable consistently demonstrated a statistically significant positive coefficient of around .10-

.11 which means that the LLM is about 10 to 11 percentage points more likely to predict the 

correct item ranking if the second item in a pair happens to be more difficult. This implies that 

the LLM exhibits a bias towards predicting the second-listed item as more difficult. 

87 

 
 
Results from the OLS models, as presented in Table 19, indicate several noteworthy 

findings. A total of 1,825 item pairs were successfully evaluated by the LLM, named Gemini-Pro, 

with the model refusing to return results for 19 pairs due to sensitivity concerns. The absolute 

difference in item difficulties (|Diff Δ|), showed a statistical significance increase of around 8 

percentage point increase in the likelihood of correctly ranking item pairs for a 1 standard 

deviation increase in the |Diff Δ|. 

Grade-level variables for Grade 12 showed positive and significant coefficients across 

models, indicating their relevance in ranking, compared with the omitted category Grade 4. 

Grade 8 did not have a significantly different ranking performance compared with the omitted 

category Grade 4. Contrary to expectations, Math items on average are estimated to be around 

5 percentage points more difficult to rank than science items. The difficulty of Grade 12 items 

was rank, consistent with my predictions, as these items are less reliant on intuition and more 

on structured or acquired knowledge, according to the model's estimates. 

Table 19 shows the likelihood of correctly ranking pairwise items under different linear 

explanatory variable choices. There is a total number of pairs successfully evaluated of 1825. 

Each of these pairs is evaluated with each item being listed first or second. The total number of 

pairs attempted to be evaluated was 1854. Unfortunately, the LLM used for these comparisons 

(Gemini-Pro) pairs refused to return results for 19 pairs. These pairs triggered its “safety” 

algorithm as they touched on sensitive topics such as evolutionary selection and gender. 

88 

 
 
 
 
TABLE 19: PREDICTING THE LIKELIHOOD OF CORRECTLY RANKING AN ITEM PAIR 

This table shows a series of ordinary least square fitted model estimates. In most models the 

majority of coefficients are statistically significant at varying levels.  

Model: 
No. observations: 
R-squared: 
Adj. r-squared: 
F-statistic: 
Prob (f-statistic): 
Coefficient Estimates: 
 Intercept 
 Second Item Harder 
|Diff Δ| 
 Grade : 12 
 Grade : 8 
 Math 
 𝑎"! 
 𝑎"" 
 Complexity 1 
 Complexity 2 
|Complexity Δ| 
 # of steps 1 
 # of steps 2 
|# of Steps Δ| 
 Similarity 
|Diff Δ|:  Complexity 1 
|Diff Δ|:  Complexity 2 
|Diff Δ|:|Complexity Δ| 
|Diff Δ|:  # of steps 1 
|Diff Δ|:  # of steps 2 
|Diff Δ|:|# of steps Δ| 
|Diff Δ|:  𝑎"! 
|Diff Δ|:  𝑎"! 
|Diff Δ|:  Similarity 

1 
3650 
0.043 
0.043 
82.16 
1.26E-35 

2 
3650 
0.047 
0.046 
36.29 
2.19E-36 

3 
3650 
0.049 
0.047 
26.67 
6.70E-36 

4 
3650 
0.054 
0.051 
14.91 
1.20E-35 

5 
365 
0.054 
0.051 
18.79 
4.01E-37 

0.5706***  0.5624***  0.5559***  0.5497***  0.5673*** 
0.108*** 
0.1014***  0.0995***  0.1155***  0.1177*** 
0.0858***  0.0842***  0.0837***  0.0789*** 
-0.076** 
0.0787***  0.0766***  0.0810*** 
-0.0034 
-0.0602** 
-0.0168** 
0.0089* 

0.0006 
-0.0607** 

0.0046 
-0.0474** 
-0.0174** 
0.0083* 
0.0175** 
0.0067* 
0.0423*** 
0.0108* 
0.0002 
0.0086* 
-0.0049 

0.0522*** 
0.0574*** 
0.0657*** 
0.0258** 
0.0072 
0.0200** 
-0.0143** 
0.0006 
-0.0147* 

89 

 
 
 
 
 
 
 
 
 
  
  
 
 
   
  
  
 
 
 
  
  
  
  
 
 
 
 
  
  
  
  
 
 
 
 
  
  
  
  
 
 
 
 
  
  
  
  
 
 
 
 
  
  
  
  
 
 
 
 
  
  
  
  
 
 
 
 
  
  
  
 
 
 
 
  
  
  
  
 
 
 
 
 
The R-squared values across the six models range from 0.043 to 0.059, and the 

corresponding adjusted R-squared values range from 0.043 to 0.053, highlighting a modest fit 

to the data. These values, along with highly significant F-statistics (p-values well below the 0.05 

threshold), suggest that the models capture a portion of the variability in the LLM's ability to 

rank item difficulty correctly though their overall predictive accuracy power is low. 

The inclusion of discrimination proxy variables (𝑎R# and 𝑎R$) demonstrated a complex and 

difficult to explain effect on the ability to estimate item relative difficulties. I had hypothesized 

that items with greater discrimination should be easier to rank in terms of difficulty than items 

with low discrimination. This does not appear to be the case. Item pairs which have higher 

discrimination in the first item (𝑎R#) leads to a lower ranking accuracy (1.7 percentage points) 

while a higher discrimination in the second item (𝑎R$) contributes to a higher ranking accuracy 

(0.8 percentage points). It is interesting that LLM has differences in the item pair rankings due 

to differences in item discrimination. Overall, though the effects are quite small and 

contradictory. 

 However, the item parameters for “complexity” (Complexity 1, Complexity 2, as well as 

the contrast in the complexity |Complexity Δ|), aligned with my theoretical hypotheses. Items 

which have one standard deviation higher complexity in item 1 or item 2 have between a 1.7 

and 0.6 percentage point increase respectively in likelihood of correctly ranking the items. 

Likewise, when there is a one standard deviation difference in complexity (|Complexity Δ|) 

between the two items that also corresponds to a 4.2 percentage point increase in the 

likelihood of correctly ranking the item pair. The complexity coefficients indicate that they are 

statistically significant predictors of the LLM’s ability to detect relative item difficulty.  

90 

 
 
In a similar vein, the variable related to number of steps (# of Steps 1, # of Steps 2, and 

(|# of Steps Δ|) are also positively associated with bettor predictors of item difficulties though 

these coefficients tend to be very small. Between these two sets of variables, the hypothesis 

that more complex or involved items are easier to predict relative difficulty is supported. 

In summary, the empirical results support the theoretical expectations posited in 

section 3.3.1, with the LLM displaying a measurable ability to predict relative item difficulties. 

Item discrimination and complexity have been substantiated as significant factors. That said the 

LLM suffered from a clear bias with items appearing second 10 percentage points more likely to 

be seen as more difficult than items appearing in the first space.  

4.2 

Study Two: Do LLM Simulate Student/Test Taker Responses?  

LLM performance is demonstrated to be a statistically significant if modest predictor of 

item difficulty (Table 20). Model 2 includes only indicator variables for grade and subject as well 

as some computational variables, and has an R-squared of 7% and adjusted R-squared of 5.5% 

while including LLM performance as a predictor under either prompt (Models 3 and 4) 

increases the R-squared value to 11.9% to 13.1% and the adjusted R-squared value to 9.1% and 

10.3% respectively. The individual coefficients on the LLM models are consistently positive and 

generally statistically significant though prompt specification seems to play an important role. 

We can interpret the coefficients on prompts directly such that for example under Prompt 1 if 

Llama 7b got an item correct on all three attempts then we should expect the item to be 6.15 

percentage points easier than if it had gotten the item incorrect all three attempts. By adding 

up the coefficients in the explanatory variables we can get a sense for what the maximum 

variability in difficulty can be explained through LLM performance. Under Prompt 1 we get a 

91 

 
 
 
total of 17.25 percentage points and 20.47 for prompt 2 which when items range in difficulty 

(easiness) between 8 and 95 this is only about 20% of the range possible.  

TABLE 20: PREDICTING ITEM DIFFICULTY FROM LLM PERFORMANCE 

The following table has 1-Difficulty as the dependent variable so that the LLM coefficients are 

positively valued.  

Model: 
No. observations: 
R-squared: 
Adj. r-squared: 
F-statistic: 
Prob (f-statistic): 
Coefficient Estimates: 
Intercept 
Grade 12 
Grade 8 

  Math 
  Word count 

1 
388 
0.017 
0.009 
2.155 
9.29E-02 

2 
388 
0.070 
0.055 
4.754 
1.09E-04 

3 
388 
0.119 
0.091 
4.237 
2.76E-06 

56.22***  60.85***  56.16*** 
-4.44** 
-5.78** 
-1.76* 
-2.36* 
0.92 
-0.86 
-0.23*** 
-0.17 
3.30 

-3.00* 
-0.47 
-3.40* 
-0.25*** 
-0.22 
0.25 
1.23 
3.03* 
6.15** 
2.81* 
1.33 
2.70* 

Flesch Kincaid Index 
Syllables per word 
GPT35  
GPT4  
Llama7b  
Llama13b  
Llama70b  
Gemini pro  
GPT35  
GPT4  
Llama7b  
Llama13b  
Llama70b  
Gemini pro  
*10% significance, **5% significance, ***1% significance 

1
t
p
m
o
r
P

2
t
p
m
o
r
P

4 
386 
0.131 
0.103 
4.683 
4.06E-07 

51.16*** 
-3.72* 
-0.44 
-1.90* 
-0.23*** 
-0.06 
0.05 

4.93* 
1.13 
4.45* 
7.30*** 
-0.12 
2.78* 

5 
386 
0.141 
0.099 
3.35 
6.12E-06 

51.46*** 
-3.25* 
-0.26 
-3.21* 
-0.24*** 
-0.12 
-0.84 
0.37 
2.70* 
3.69* 
-0.52 
-0.68 
1.91 
4.31* 
1.00 
3.38* 
6.68** 
-0.20 
1.30 

The increase in performance going from using either Prompt 1 or Prompt 2 performance 

to using both Prompt 1 and Prompt 2 performances is slight in terms of R-squared and minimal 

92 

 
 
 
 
 
 
 
 
 
  
  
  
  
  
 
 
 
  
 
 
 
  
 
 
 
 
 
  
  
  
 
 
 
  
  
  
 
 
 
  
  
  
 
 
 
 
 
  
  
  
 
 
 
  
  
  
 
 
 
  
  
  
 
 
to negative in terms of adjusted R-squared. In terms of the sum of the prompt coefficients we 

go from Prompt 1 which has 17.25 and Prompt 2 which has 20.47 to a combined coefficient 

sum of 23.91. The reason for this unexpected decline in model predictive power is that model 

performance is much closer correlated with other models (Table 21) than it is with the difficulty 

of the items evaluated (Table 16, page 83). This finding is not entirely surprising as on at least 

one study (Bejar 1983) expert judges were found to have higher interrater correlations than the 

correlations between item estimated rankings and empirical rankings. 

TABLE 21: PARTIAL CORRELATIONS TABLE BETWEEN MODELS  

The following table shows the correlations of the performance on items under models under 

Prompt 1 as compared to other models under Prompt 1 or models under Prompt 2. 

1
t
p
m
o
r
P

2
t
p
m
o
r
P

GPT 3.5 
GPT 4 
Llama 7b 
Llama 13b 
Llama 70b 
Gemini-Pro 
GPT 3.5 
GPT 4 
Llama 7b 
Llama 13b 
Llama 70b 
Gemini-Pro 
Mean 

GPT 3.5 
* 
0.250 
0.319 
0.380 
0.401 
0.485 
0.421 
0.072 
0.208 
0.335 
0.283 
0.299 
0.314 

Prompt 1 

Llama 7b 
0.319 
0.116 
* 
0.426 
0.291 
0.267 
0.247 
0.021 
0.439 
0.421 
0.305 
0.227 
0.280 

Llama 13b 
0.380 
0.120 
0.426 
* 
0.443 
0.241 
0.303 
-0.004 
0.360 
0.592 
0.435 
0.276 
0.325 

GPT 4 
0.250 
* 
0.116 
0.120 
0.156 
0.451 
0.046 
0.127 
0.152 
0.155 
0.153 
0.155 
0.171 

Llama 70b  Gemini-Pro 

0.401 
0.156 
0.291 
0.443 
* 
0.363 
0.281 
0.008 
0.339 
0.459 
0.467 
0.329 
0.322 

0.485 
0.451 
0.267 
0.241 
0.363 
* 
0.219 
0.008 
0.200 
0.264 
0.293 
0.498 
0.299 

*Note: These values are 1 but left out so as to not cause issues with the mean calculations. 

While the correlation between model performance in this study does not imply that it 

will hold for future developments of LLMs it does imply a limitation on how far these current 

93 

 
 
 
 
 
 
 
 
 
 
 
 
models can be taken in the current prompts. These results suggest that adding additional LLM 

models or additional prompt variants (in line with those under Prompt 1 and 2) while likely to 

increase the total explanatory power of the total model are likely to have diminishing marginal 

effectiveness at predicting item difficulty. 

4.3 

Study Three: Binary Search Estimation of Absolute Item Difficult 

In this dissertation I propose the use of a series of binary pairwise comparisons to 

predict item difficulty. These predictions are done through a logistic regression with item 

indicators as the explanatory variable with the first item listed coded positively and the second 

item listed coded negatively. This explanatory indicator matrix had 388 columns with values 0, 

1, or -1. The choice of which item was chosen as more difficult, the dependent variables, was 

coded as 1 or 0 depending upon if it was listed first or second. No coefficient was included in 

the logistic regression. 

Using this method I was able to recover estimates for the coefficients for the 258 items 

which were included in the testing data which I had pairwise estimates of relative difficulty 

from Study One. This method produced estimates of item difficulty that appear in 1PL item 

response theory form (Table 22). 

94 

 
 
 
 
 
 
TABLE 22: ESTIMATED ITEM DIFFICULTY 

Grade  Subject 

Mathematics 
4 
Science 
4 
Mathematics 
8 
8 
Science 
12  Mathematics 
Science 
12 

Mean 
0.00 
0.00 
0.00 
0.00 
0.00 
0.00 

Standard Dev 
1.36 
0.77 
1.27 
0.95 
1.48 
1.19 

Q 05  Q 25  Q 50  Q 75  Q 95  Count 
-2.15 
-1.04 
-2.06 
-1.21 
-2.20 
-1.62 

0.97 
0.13 
0.83 
0.00 
0.13 
1.01 
-0.01  0.65 
-0.05  1.19 
0.74 
0.01 

-1.02 
-0.60 
-0.96 
-0.67 
-1.21 
-0.80 

2.25 
1.03 
1.84 
1.46 
2.49 
1.84 
Total 

85 
16 
74 
17 
39 
27 
258 

Superficially the item summary parameters from these items appear acceptable but 

what we would really like to know is how well these item parameters are correlated with the 

population performance statistics reported on each item by NAEP. To do this, I correlate 

performance of the pooled population data as well as demographic groups such as gender, 

race, and free and reduced-price lunch (FRPL). We can see that the pooled population data is 

38.9% correlated with the item parameter estimates (Table 23). We would like to see a high 

correlation between the estimated item scores and the observed item performances. A 

correlation that falls for our population between the mid-30s and the low 40s while certainly 

statistically non-random is lower than could be hoped. 

95 

 
 
 
 
 
 
 
 
 
 
 
 
 
TABLE 23: ITEM DIFFICULTY ESTIMATE CORRELATED WITH POPULATION PERFORMANCE 

This table gives population mean performance scores averaged across the 258 items as well as 

the correlation of those scores with the estimated item difficulties generated by using the item 

pairs as indicators of which item is more difficult. The table to the right also presented the 

results of a bootstrap simulation of the correlation statistics after random resampling 500 

draws from the source data. Bootstrapped headers are standard deviation (SD) and quantiles 

(Q) 25, 50, and 75. 

Grouping: Population 

Population 
Mean Score 

Corr Between Est. 
Diff & Pop. Score) 

Bootstrapped Corr Statistics 

0.530 
0.544 
0.519 
0.571 
0.421 
0.454 
0.608 
0.540 
0.582 
0.550 
0.567 
0.462 
0.589 

All 
Gender: Male 
Gender: Female 
Race: White 
Race: Black 
Race: Hispanic 
Race: Asian/Pacific Islander 
Location: City 
Location: Suburb 
Location: Town 
Location: Rural 
FRPL Eligibility: Eligible 
FRPL Eligibility: Not eligible 
FRPL Eligibility: Information 
not available 
Pairwise Comparisons Count 
Bold Indicates the maximum for the group. None of the differences in Correlation by grouping are 
large enough to be statistically significant at any level. 

0.389 
0.398 
0.343 
0.414 
0.363 
0.377 
0.344 
0.348 
0.354 
0.364 
0.373 
0.339 
0.365 

0.360 
3,660 

0.589 

0.369 

0.074 

0.319 

Q 50 
0.394 
0.400 
0.347 
0.418 
0.368 
0.384 
0.349 
0.355 
0.358 
0.369 
0.379 
0.347 
0.371 

SD 
0.058 
0.058 
0.062 
0.058 
0.062 
0.062 
0.068 
0.113 
0.113 
0.115 
0.112 
0.075 
0.073 

Q 25 
0.353 
0.363 
0.308 
0.381 
0.325 
0.342 
0.302 
0.283 
0.287 
0.295 
0.307 
0.299 
0.326 

Q 75 
0.429 
0.440 
0.386 
0.454 
0.403 
0.417 
0.396 
0.433 
0.439 
0.452 
0.460 
0.392 
0.422 

0.415 

From Table 23 we can see that some populations have higher correlations between the 

item estimates and the population performance. These population groups are male, white, 

rural, and FRPL: Not eligible. These populations do not correspond with the highest performing 

population except in the case of male. From the bootstrapped standard errors, it would be 

96 

 
 
 
 
 
 
 
 
advisable not to read too much into the different correlations by population group as there is 

no statistical significance of the difference between these different correlations. Even the 

largest gap which is between white and Asian/Pacific-islander is only .07 of a correlation which 

is less than one joint standard deviation away (0.089	 = 	 (0.058$ 	 + 	 0.068$)#/$	). 

So far in this analysis I have paired each item with each other in each test (segregated by 

grade, year, and subject) twice with item 1 in the first position in one case and in the second 

position in the second case. If we were to make the limited assumption that the population 

performance is stable over time, we can expand our item pairing pool to see to what extent 

having more items to compare performance against might improve our item difficulty 

predictions. Table 24 does just this by exploring how predictions of item difficulty change if we 

pool items across tests to expand our matching pool. 

“Pooled estimation” match compares all testing items which share grade and subject 

and estimates all parameters simultaneously. While “Item 2 Known Parameters” provides the 

All-Student difficulty estimate for item 2 and generates and estimate for item 1 for all items 

paired with item 1. 

97 

 
 
 
 
TABLE 24: POOLING GRADE AND SUBJECT ITEMS OVER YEARS 

This table shows correlations between the estimated item difficulties pairing up items across 

any year but keeping within subject and grade.  

Grouping 

All Students 
Gender: Male 
Gender: Female 
Race: White 
Race: Black 
Race: Hispanic 
Race: Asian/Pacific Islander 
Location: City 
Location: Suburb 
Location: Town 
Location: Rural 
FRPL Eligibility: Eligible 
FRPL Eligibility: Not eligible 
FRPL Eligibility: Information not 
available 
Pooled Joint Estimation 
Known Item Parameters (1PL) 
Pairwise Comparisons Count 

Correlations 

Mean Root Squared Error /  
Mean Absolute Error 

Pooled 
Estimation 

Item 2 Known 
Parameters 

Pooled 
Estimation 

0.41 
0.41 
0.36 
0.43 
0.40 
0.40 
0.38 
0.42 
0.42 
0.43 
0.44 
0.42 
0.41 

0.42 
0.42 
0.37 
0.44 
0.40 
0.41 
0.40 
0.41 
0.42 
0.43 
0.44 
0.43 
0.43 

0.40 
1.00 
0.98 
30,452 

0.41 
0.98 
1.00 
30,452 

28.4 / 23.5 
28.7 / 23.7 
29.5 / 24.1 
28.8 / 23.8 
29.5 / 23.8 
28.8 / 23.4 
31.1 / 25.7 
28.1 / 23.6 
29.4 / 24.9 
28.2 / 23.6 
28.7 / 24.0 
28.0 / 23.0 
30.0 / 24.9 

30.2 / 25.2 
0 / 0 
8.8 / 7.7 
30,452 

Item 2 
Known 
Parameters 
23.0 / 18.4 
23.0 / 18.5 
24.2 / 19.0 
23.2 / 18.7 
25.5 / 20.0 
24.2 / 19.2 
24.9 / 20.4 
22.2 / 17.7 
23.0 / 18.8 
22.5 / 17.8 
22.7 / 18.1 
23.3 / 18.2 
24.0 / 19.6 

24.3 / 19.8 
8.8 / 7.7 
0 / 0 
30,452 

The first column of this table shows what happens when we increase our pairwise 

number of comparisons by pooling across years. This is an 8-fold increase in pairwise 

comparisons with the number increasing from 3,660 to 30,452. There is no guarantee that 

increasing the comparisons in this way should work as we are trading increased sample size, 

and correspondingly decreased sampling error, for the introduction of error due to time related 

changes in population ability as well as item drift. Fortunately, the gains from pooling items 

seem to outweigh the introduction of this new error as the pooled estimation correlations 

98 

 
 
 
 
across all tested populations in Table 24 are monotonically equal to or greater than those of the 

correlation column found in Table 23. 

FIGURE 5: ITEM ESTIMATES MAPPED TO ITEM DIFFICULTY BINS 

Grouping items into empirical difficulties bins (x axis) and estimated item difficulty (y axis) this 

figure shows the relationship between the two. The circle represents the mid-point of the 

empirical bin while the x-axis shows the bin for All Population difficulty. Embedded in the circles 

are the counts of the number of items in the bin. The center line in the box is the median which 

the upper and lower box parts are the bottom and upper quartiles while the top and bottom 

lines are maximum and mins as well as a few points for the outliers.  

While evaluating estimator performance it is worth exploring how well item difficulty 

varies with items in different subjects and grades. Table 25 explores how the statistics between 

the general population estimate of item difficulty and the estimated item difficulty from the 

LLM pairwise comparisons perform. It shows that grades and subject correlations are highly 

99 

 
 
 
variable with peak performance in grade 12 mathematics with a correlation estimated at 

around 0.66. While still short of the “target” of 0.8, this point estimate appears promising.  

Another way of evaluating item parameter estimates is by looking at the root mean 

squared error (RMSE) as well as the mean absolute error (MAE). Both these measures give a 

consistent estimate of the expected absolute “difference” for a random item and its estimate 

using the two methods. RMSE is a strictly convex estimator while MAE is not. Looking at just the 

best case, MAE with “Item 2 Known Parameters” we see that the estimated MAE is between 17 

and 20 percentage points (Table 23) and between 15 and 22 percentage points (Table 24). 

TABLE 25: ESTIMATED ITEM DIFFICULTY STATISTICS BY GRADE AND SUBJECT 

This table presents mean correlations, root mean squared error (RMSE), and mean absolute 

error (MAE). From this table we can see that observed correlation of 12th grade math item 

difficulty estimates are the most precise with a higher correlation between the all-student 

difficulty and a lower RMSE and MAE. 

Mean Correlation 

Mean Root Squared Error / Mean 
Absolute Error 

Grade 

Subject 

 Item 
Count 

Pooled 
Estimation 

Item 2 Known 
Parameters 

Pooled 
Estimation 

Science 

4  Mathematics 
4 
8  Mathematics 
8 
12  Mathematics 
12 

Science 

Science 

85 
16 
74 
17 
39 
27 

0.341 
0.459 
0.385 
0.428 
0.659 
0.256 

0.350 
0.378 
0.371 
0.349 
0.650 
0.271 

29.77 / 24.44 
25.26 / 21.26 
29.42 / 24.91 
26.36 / 22.53 
24.83 / 20.54 
28.95 / 22.99 

Item 2 Known 
Parameters 

24.04 / 18.67 
25.46 / 21.73 
23.24 / 19.13 
21.43 / 18.15 
19.40 / 15.09 
23.13 / 18.22 

It is sometimes helpful to look at the QQ plot (quantile-quantile plot) which 

shows the distributional relationship between two different samples. If both samples are drawn 

100 

 
 
 
 
 
 
 
from the same underlying distribution, allowing for different distribution parameters, then the 

quantile values plotted against each other should form a linear relationship. Looking at Figure 6 

we can see that the quantiles matched against each other for both the pooled and the “Item 2 

Known Parameters” estimators have a strong linear relationship suggesting that both observed 

and estimated item parameters are drawn from the same distribution. That said there is a bit of 

drift moving away from linear on the upper and lower tails of both plots potentially indicating 

that these methods are less precise when estimating extreme item parameters.  

FIGURE 6: PAIRWISE COMPARISON ITEM PARAMETERS   

These figures show a linear relationship between the quantiles of the estimated item 

parameters and the empirical item parameters. 

We might be concerned that item difficulties have drifted over time such that using LLM 

to estimate items might be less or more accurate depending upon the year the items were 

administered. Looking at Table 26, while there appears to be some heterogeneity by year of 

101 

 
 
 
 
administration in terms of how effective the LLM estimators are it is unclear to what extent this 

heterogeneity represents random sampling error. Looking at the overall time trend between 

average correlation of the estimator and year yields point estimates very close to zero.  

102 

 
 
 
 
TABLE 26: ESTIMATED ITEM DIFFICULTY CORRELATIONS BY YEAR 

The following table shows the correlation between All Student’s item difficulty and item 

estimates for the pooled item parameter estimation and the item 2 known parameters grouped 

by year. The correlation between yearly correlations and year is only -.02 indicating that there 

does not appear to be a statistically significant trend effects on item performance as measured 

by correlations with true difficulties.  

Correlations 

Mean Root Squared Error / 
Mean Absolute Error 

Count 

Pooled 
Estimation 

Item 2 Known 
Parameters 

Pooled 
Estimation 

Item 2 Known 
Parameters 

Math 

Science 

0.31 
0.44 
0.68 
0.50 
0.45 
0.28 
0.32 
0.69 
0.35 
0.60 
0.35 

0.31 
0.46 
0.65 
0.51 
0.42 
0.30 
0.29 
0.53 
0.35 
0.56 
0.41 

31.16 / 25.19 
28.93 / 22.95 
26.24 / 23.77 
25.19 / 20.71 
28.05 / 22.84 
27.44 / 22.75 
26.34 / 22.25 
16.56 / 14.47 
32.42 / 27.57 
28.43 / 24.27 
29.66 / 25.77 

26.60 / 21.49 
24.30 / 18.76 
19.36 / 16.32 
21.48 / 17.77 
22.84 / 18.07 
22.74 / 17.98 
21.34 / 16.93 
21.65 / 19.28 
24.50 / 19.93 
21.84 / 17.36 
21.98 / 18.45 

13 
41 
13 

44 
13 
19 

31 
24 

16 

25 

6 
3 

10 

Year 

1990 
1992 
1996 
2000 
2003 
2005 
2007 
2009 
2011 
2013 
2019 

103 

 
 
 
  
  
  
 
  
  
 
  
  
 
CHAPTER V: DISCUSSION AND CONCLUSION 

5.1 

Discussion 

This dissertation investigates the potential use of large language models (LLMs) to 

estimate item difficulty via two distinct methods. The most effective method involves pairwise 

comparison, where LLMs rank item pairs based on difficulty. The other method uses an 

ensemble of responses from various LLMs attempting to solve the items, using their success or 

failure as indicators of item difficulty. Both methods are statistically significant, yet less precise 

predictors of item fit compared to the requirements need to substitute item pretesting in a 

professional context. The approach presented here could be compared with three alternative 

approaches used in practice or explored in the literature. These approaches are pretesting, 

subject matter expert review, and the deployment of computational models (machine learning 

approaches). 

Each of the existing approaches has its limitations. Pretesting, the gold standard, is 

expensive; subject matter expert reviews are also costly and imprecise; and computational 

models require a large number of already calibrated items, typically in the thousands, to build a 

predictive model. The use of LLMs, however, presents an alternative approach to item 

calibration. While LLMs have not yet been demonstrated to be sufficiently precise for 

professional testing contexts, they are inexpensive and require few, if any, already calibrated 

items. 

This initial exploration of using LLMs to calibrate items has shown mixed results, with 

the most promising among 12th-grade math items. It appears that within 12th-grade 

mathematics, some item types are well-suited for LLMs, while others are particularly 

104 

 
 
challenging. My intuition suggests that items relying on visual features may be especially 

difficult for LLMs. 

A logical next step would be to identify which item types among 12th-grade math items 

LLMs predict difficulty well. Additionally, we should investigate whether this proficiency 

extends to other higher-level math or reasoning items, such as those used in college-level math, 

logic, or computer science classes. Furthermore, can LLMs effectively predict the difficulty of 

items in a professional testing context, such as the Graduate Record Exam (GRE) quantitative 

reasoning sections, Law School Admission Test (LSAT) analytical reasoning or logical reasoning 

items, or other professional exams that include a mathematical analysis component? 

Numerous scholarly studies on optimal test design for item calibration (Stocking, 1990; 

Buyske et al., 1998; Jones and Jin, 1994; Buyske, 1998; van der Linden et al., 2015; He and Chen, 

2020, among others) emphasize the cost reduction achieved by combining estimates of test 

taker ability with estimates of item parameters when selecting items. Therefore, if a 

professional testing program adopts precalibration before pretesting a LLM calibration step, it 

is likely to reduce the cost of item calibration. The extent of this cost reduction depends on how 

well these approaches fit with their item bank and the degree to which they have already 

implemented other item precalibration techniques, such as the computational methods 

previously discussed. 

Outside of professional testing programs, these techniques have ready applications in 

online interactive tutoring environments. In these settings, the goal is not primarily to optimize 

the acquisition of information about student abilities (as it is in computer adaptive testing). 

Instead, the focus is on providing items that are closely targeted to a student’s current ability 

105 

 
 
level, ensuring that the items are neither too easy (which would result in little learning) nor too 

difficult (which might discourage students). 

Even noisy difficulty estimates from LLMs, like those generated in this study, could be 

sufficiently precise for these applications, particularly for novel items. Once an item has been 

evaluated by enough test takers, these preliminary estimates can either be discarded or used as 

priors based on actual student performance. 

In both the professional test development context and the online tutoring context, 

having a method like that provided by LLMs for generating initial estimates of item difficulty 

offers significant advantages. This approach is low-cost, poses zero exposure risk, and does not 

depend on training with existing item types. 

The following section will discuss several notable features related to using LLMs for item 

calibration. These include the costs faced in the current market and limitations such as item 

security, the challenge of correlated LLM responses, and model bias. 

5.1.1  Usage Fees of LLMs  

An important feature of LLMs to discuss is their cost. In general, the pairwise approach 

explored in this paper generates large quantities of text. The generation of the pooled data 

with 30 thousand paired responses for example using 5.5 million words input and generated 4.4 

million words out. Using Gemini-Pro as exploration was free but if I had used GPT3.5 this would 

have cost with current pricing around $5 for the output and about half that for the input. The 

next generation LLM by OpenAI (GPT4-Turbo) would have cost about $40 for the input and 

$100 for the output. Using a production plan, Google’s Gemini-Pro would have cost a similar 

amount to that of GPT3.5 at around $12 for the output and $4 for the input. As there is a 

106 

 
 
tremendous amount of ongoing innovation and competitive pressures in this market, I expect 

new models to continue to come out on an ongoing basis and downward pressure to continue 

to keep the cost down. This breakdown of costs is a lower estimate of the actual costs of 

implementing this study as many of the prompts ended up being repeated when the response 

was uninterpretable or when it was discovered there was some kind of user error. Also, there 

are other models leveraged in this research such as the various llama models and mistral 

model, but these models are only lightly used, and their cost is even less than that of GPT3.5. 

TABLE 27: COST ESTIMATE OF LLM USAGE 

Queries 
Words 
Characters 
Gemini-Pro* 
GPT3.5-Turbo 
GPT4.0-Turbo 
GPT4.0 
Words 
Characters 
Gemini-Pro* 
GPT3.5-Turbo 
GPT4.0-Turbo 
GPT4.0 

Pooled Items 
30,452 
5,245,852 
31,614,610 
$3.95 
$1.97 
$39.34 

$118.03 
4,408,855 
31,118,358 
$11.67 
$4.96 
$99.20 
$198.40 

Training 
Data 
95,220 
10,482,765 
60,759,951 
$7.59 
$3.93 
$78.62 

$235.86 
5,804,394 
41,046,116 
$15.39 
$6.53 
$130.60 
$261.20 

Prompts 

Responses 

*Gemini-Pro is currently free for development (research). 

5.1.2  Limitations 

Item Security 

A major drawback not yet discussed in this paper is that the large LLM APIs explored in 

this paper have no data security as data submitted to the LLM for evaluation may be used for 

future training. To what extent this security can be exploited is not entirely known. As such it 

107 

 
 
 
  
  
 
 
would be too risky for a test developer to expose sensitive items to the LLMs evaluated in this 

dissertation. However, this is a well-known limitation of these LLM APIs, and data privacy is 

likely to be a purchasable feature available in professional applications of LLMs. 

However, if secure LLMs become available then the ability for these LLMs to be 

leveraged to potentially generated estimates of item difficulty which are much more secure 

than the dominant paradigm of item pretesting. 

LLMs’ Opinions Being Highly Correlated 

Pooling the responses from diverse LLMs initially seems like an ideal way of eliminating 

sampling and model error. Using LLMs to build an ensemble model to predict item properties is 

appealing in that new LLMs are constantly under development and being released both through 

APIs as well as public open-source models. However, a major limitation as revealed in this 

dissertation is that the responses of LLMs to certain tasks seem to be much more correlated 

than one would expect. This limitation was a major challenge for using ensemble models to 

predict item difficulties (Study Two). Unfortunately, while supplementing item performance 

with additional responses from additional LLMs does appear to improve the performance this 

improvement has diminishing marginal returns. This is driven by responses across different 

LLMs being much higher correlated with each other than that of the observed data. 

Not presented in this dissertation, I also used pairwise responses from GPT3.5, which 

was much worse at estimating relative item difficulties, pooled with those of Gemini-Pro. 

Pooling across these responses did not improve pairwise estimates of item difficulty. Given that 

errors are positively correlated this is an unsurprising if disappointing result. 

108 

 
 
The positive correlation of error is an interesting discovery which might limit future 

applications of this technology. Or it might prove to be a transient feature of the current LLM 

training paradigm by which various models share temporarily similar training content (for 

example Wikipedia, Reddit, etc.). It may be a feature of future LLMs that they develop 

independent additional training content such that their responses become more diverse. 

Model Bias 

Upon analyzing Tables 4.6.2 and 4.6.3, a discrepancy emerges in the estimated item 

difficulties across different demographic groups. Notably, estimates for male and white 

demographic groups exhibit a higher correlation and lower average errors, while there is no 

statistical evidence to confirm significant disparities between these groups and others. 

Nonetheless, this trend is concerning, reflecting the perceived societal privilege often 

associated with these groups. In an ideal scenario, item difficulty estimators should perform 

uniformly well across all groups. Bucking this problematic result, those students who are 

identified as rural are fit better in the model than the more affluent, high performing students 

living in the suburbs. However, the difference in fit between this group is small (2 points) 

compared with that of gender (5 points) and race (4 points). That said random error will cause 

any set of point estimators to seemingly favor one group, though large sample sizes should 

reduce the effect of this kind of random error. 

There is not sufficient data in this study to say one way or another if these methods of 

estimating item difficulties favor or harms any population group. However, that said, this 

method is flexible and there is the potential prompt optimizing algorithm could generate 

prompts tailored to each population group being studied. This might be done through explicitly 

109 

 
 
attempting to align the LLM to simulate a student of a particular type such as, “you are a female 

grade X student” or by allowing the prompt selection algorithm to find prompt optimums based 

on the estimated difficulties for that group.  

Overall, though this is a topic for future research. It is unknown both how much the LLM 

model bias is a factor and to what extent it can be mitigated through these methods. 

5.2 

Future Research 

5.2.1 Using a Less Visual and More Cognitive Item Bank 

This dissertation investigates the potential of using Large Language Models (LLMs) as 

tools for estimating item parameters. The study focuses on Math and Science questions at 4th, 

8th, and 12th-grade levels, but there are significant constraints to be acknowledged. A primary 

limitation is that the majority of the National Assessment of Educational Progress (NAEP) items 

analyzed involve visual elements. Many NAEP items incorporating visuals were not included in 

this analysis, and even among the selected items, reliance on pictorial features is common. 

Given the current limitations of visual LLMs compared to their text-based counterparts, I had to 

resort to using text descriptions (508 alternative text) to inform the LLM about these visual 

components. This misalignment in the information format may contribute to the less-than-

optimal correlation between the LLM-estimated parameters and the actual item parameters. 

LLMs with “mixed modalities” or vision features is an area of intense ongoing research. 

Revisiting this study and these items further development in this field has progressed might be 

fruitful. 

My intuition suggests that LLMs might show improved performance when evaluating 

text-based items. However, this is a hypothesis that needs to be tested. Notably, the LLMs 

110 

 
 
performed best with 12th-grade mathematics items. This performance bump aligned with my a-

priori predictions that the LLMs would perform well on these items due to the complex linearly 

developed skills needed in high school-level mathematics. In contrast, many 4th-grade items 

seem to depend more on intuitive understanding acquired through everyday experiences 

rather than through structured learning. However, science items at the 12th grade level were 

among the worst items to predict. As a possible future application of this research, it would be 

useful to apply the approach to additional math or related items at a high school or college 

level to see if the models continue to perform well in this area. 

5.2.2 Optimizing Prompts by Subject and Grade 

In the pairwise prompt study explored in this paper a single prompt was selected which 

optimized classification in the training data across both subject and grades 4, 8, and 12. This 

prompt was selected through a genetic algorithm that evaluated dozens of prompts and cross 

bred those prompts to create generations of children prompts. This procedure was highly 

effective at moving the expected correct classification rate from around 55% to 62%. A more 

nuanced approach would have been to generate and evaluate prompts specific to each grade 

and subject combination would have born some additional predictive improvements.  

In this dissertation, the same prompt template was used for both science and math 

items and that prompt performed very differently for the two sets of items with math items 

being much better difficulty ranked than science items. As math items were also about 4 times 

more populous it is possible that this divergence is largely driven by the prompt being 

optimized for the math items over that of the science items. This amalgamation across subjects 

may have inadvertently led to a less-than-optimal choice of prompts for the science questions.  

111 

 
 
Study two in this dissertation only explored two different prompts for item solving. 

However, these prompts resulted in the performance of the underlying model being quite 

divergent with large differences in how often the items were correctly solved. This was 

particularly telling with GPT4 which solved 96% of the items correctly under the second 

prompt. As these the pressure on LLMs is continuously to push them to be better problem 

solvers it is likely that future LLMs will continue to increase in their performance. As such, the 

use of even highly effective LLMs to estimate item difficulty by demonstrating incorrect 

responses might be explored through prompts intended to encourage the model to make 

mistakes by simulating non-optimal behaviors of students by prompting the models to take on 

certain personas such as, “as a student rushing through questions thoughtlessly how would you 

answer…” or through targeted knowledge gaps “you are a student that confuses the formula 

for a circle’s area with that of the volume how would you answer”?  

5.2.3 Applying LLM Estimation Methods to Item Clones 

The items released and covered by NAEP in both science and mathematics are quite 

diverse with topics spanning numerous courses in these subjects such as biology, 

environmental science, chemistry, and physics in science and arithmetic, geometry, and algebra 

in mathematics. While it is remarkable that LLMs can have any predictive power estimating the 

relative difficulty between such diverse items, this kind of pools of diverse items might not be 

the best application of this approach. 

Considering these potential issues, a more effective approach for future research could 

potentially start with a larger and more granularly categorized pool of items. Within this 

system, comparisons would preferably be made between items that are more closely related 

112 

 
 
content-wise. Such a finely-tuned method could also be helpful for evaluating "item clones" 

variations of original test items. These clones can be created by automated generation tools or 

crafted by item writers. Although the parameters of the original (parent) item might be known, 

the corresponding parameters of the clones are typically not. For many applications of these 

clones, it is not necessary to pinpoint the exact parameters; instead, it is sufficient to ensure 

they reasonably approximate those of the original item. 

Statistical methods can be useful in finding significant deviation in item clone 

parameters once the item is in circulation, but early detection of irregularities could potentially 

be enhanced using LLMs. The potential exploration of leveraging LLMs in this context offers the 

promise of catching aberrant item behavior before clones are administered, providing a 

proactive potential method of maintaining the integrity of an automated testing framework. 

5.2.4 Combing LLM Generated Features with Traditional ML Tools 

 The use of collateral item information to aid in the estimation of item parameters 

combined with machine learning flexible models was shown by Mislevy et al. (1993). However, 

in their paper collateral information was generation though item directly observable item 

features as well as the input of expert judges. While the knowledge and skills of expert judges 

cannot be easily replaced, it is also costly, and has been demonstrated LLMs can provide a very 

affordable if less sophisticated substitute. An ideal future exploration would be to see to what 

extent the collateral information about items could be generated from LLMs to be used to 

predict item difficulties. Outside of aiding computational models in predicting item difficulties, 

Stout et al. (2003) suggest that having collateral information could reduce the examinee pool 

size needed to calibrate items. 

113 

 
 
5.2.5 Use of LLMs in Cognitive Diagnostic Assessment  

This dissertation has shown that LLMs may show a remarkable ability to pairwise rank 

items with a correct response rate of 62%, outperforming chance-level performance. This task, 

which often challenges even human experts, involves synthesizing information from various 

knowledge sources and shows a significant cognitive feat by the LLMs. However, the precision 

they achieve falls short of the standards needed as a substitute for item standard item 

calibration protocols in a professional testing context. 

They show limitations, such as decreased accuracy in the presence of information 

overload—as seen in N-shot examples—and a tendency to show bias towards items presented 

later in a pair. This bias may be a result of limited attentional focus, or a reflection of common 

testing practice where easier items precede more difficult ones. That said, beyond mere item 

ranking, these models generate explanations that could potentially contribute to more than just 

estimates of item difficulty. 

LLMs are capable of not only attempting to solve items but outlined the reasoning 

involved in those attempts. Mining these reasoning attempts, it might be possible to garnish 

insight into the underlying reasoning steps taken by students attempting to solve these items. 

These reasoning steps identified by the LLM need not be obscure or complex but could simply 

involve outlining some of the procedural features of items required to find the solution. For 

example, the two items “21+43=?” and “19+45=?” looks largely the same and have the same 

answer, yet those solving the items by hand know that the second item involves increasing the 

10 digit by 1 (“carrying the 1”) while the first item only involves adding each digit in place. 

While custom scripts can be written to identify procedural techniques such as “carrying the 1” 

114 

 
 
for a given item type, writing them and implementing them for the dozens to hundreds of 

procedural steps typically acquired by students across a variety of item types is a significant 

burden. 

Yet being able to identify the procedural rules required in each item could be extremely 

helpful for designing tests and test performance reports that not only identify student overall 

abilities but precisely diagnose and report procedural weaknesses. This is of course not a new 

ideal in educational measurement as Cognitive Diagnostic Assessments (Bejar, 1984; Huff and 

Goodman, 2007; Leighton and Gierl, 2007; Sun and Suzuki, 2013; Delgado et al 2019; among 

many others). And many of these diagnostic procedures rely upon statistical techniques that 

can infer cognitive diagnostic features, which at times are difficult to interpret and act on. 

However, LLMs present a new opportunity. Never has such a low-cost option existed for 

generating a list of steps required to solve items. 

By exploiting this opportunity, testing programs might be able to make significant 

strides in approaching the goal of providing actionable information for which teachers, 

students, and parents may use to bridge knowledge gaps. 

5.2.6 A Reflection on the “Cognitive Capacity” of LLMs 

It occurs to me that while the capabilities of LLMs are much praised and criticized it is 

also poorly understood. This is because LLMs are largely “black boxes” in which their internal 

complexity is vast, involving billions of interconnected “neurons.” They produce remarkable 

responses which are praised for both their readability and ability to find reasonable or correct 

answers. Yet open questions exist such as: “how do these models solve complex problems?” 

115 

 
 
and “do they possess internal ‘cognitive spaces’ in which to ‘think’ or is their responses entirely 

limited to syntactic predictions?” 

While these questions seem ones for their programmers to address via examination of 

weights, they might be questions which are similarly difficult to address as the very difficult task 

of asking neurologists to identify an individual’s abilities based on their neuron patterns. If we 

would like to understand the capabilities, predispositions, and attributes of individuals asking 

those individuals various items, achievement or personality, seem to provide a better 

understanding of those individuals than doing brain scans. 

 In the same was items might be leveraged to gain insight into the underlying latent 

traits of LLMs, not just on how often those LLM’s are capable of generating a correct or 

acceptable response. The performance of LLMs on test items offers an opportunity to probe 

their underlying cognitive abilities and limitations. Given that LLMs tend to produce responses 

more like that of other LLMs than that of the general population, an in-depth analysis of their 

predictions could shed light on their cognitive strengths and weaknesses. Such insights are 

likely to be increasingly valuable as LLMs become more prevalent in educational settings and 

society, and as remote learning continues to grow. 

Particularly, examining items where LLMs excel or struggle disproportionately, 

compared to students, could identify their distinctive cognitive patterns. This understanding 

could be helpful in the evolving landscape of educational evaluation, where tools like essays 

and exams face new challenges posed by rapidly advancing technologies. Detailed attention to 

the discrepancies between LLM and student performance could offer novel strategies to detect 

116 

 
 
and prevent fraudulent behavior in educational assessments acting to preserve the integrity 

and utility of tools. 

Exploring how Large Language Models (LLMs) understand and solve items can provide 

valuable insights. An intriguing research question investigates whether LLMs use pairwise item 

rankings involving two distinct techniques to predict item difficulty. The key considerations are: 

A. Do LLMs have pretrained embeddings of item features related to difficulty that they 

can leverage? For instance, some components of items might be inherently known to be more 

challenging, such as “division” being more difficult than “multiplication.” 

B. Do LLMs perform an internal cognitive mapping of challenges involved in solving each 

item in an item pair, using this representation to predict difficulty? For example, the items “10% 

of 20 = _” and “10% of 2 = _” superficially involve similar elements (percentages, multiplication, 

whole numbers). If the LLM relies on just the presence of these components (method A), it 

might deem the first item harder due to the larger number 20 compared to 2. Typically, items 

involving 20 and multiplication are more challenging than those involving 2 (e.g., “2*5” vs. 

“20*5” or “2*345” vs. “20*345”). 

If, however, the LLM is using an internal working space to solve both items before 

estimating difficulty (Method B), we might expect it to consider the second item, “10% of 2 =”, 

as more difficult. This is because the solution to the second item involves a decimal number, 

which it might view as a more complex concept than the whole number in the first item.  

Nevertheless, this simple example does not provide sufficient evidence that LLMs have a 

working solution space for reasoning. It is possible that LLMs are using a series of weights tied 

117 

 
 
to the probability of multiplying single digits by fractions, which could influence their difficulty 

assessment more than the complexity of multiplying two-digit numbers by probabilities. 

To properly investigate whether LLMs use either method, we would need to compose 

more nuanced items. These items should appear largely the same on the surface but differ 

significantly in actual solving difficulty, thereby providing a clearer basis for analysis. 

Probing deeper into how LLMs navigate pairwise item comparisons could reveal the 

types of items LLMs excel at or struggle with in ranking relative difficulties. Additionally, 

leveraging LLMs to both predict item difficulty and attempt to solve items proposes a novel 

mechanism for evaluating their internal problem-solving processes. 

An interesting follow-up study would examine the alignment between LLMs' difficulty 

predictions and the actual difficulties they encounter. While not definitive, such a study could 

shed light on the internal workings of LLMs when they predict item difficulties. For example, 

analyzing items that LLMs predict to be more difficult than empirical estimates from student 

responses—are these items indeed harder for LLMs to solve? Similarly, do items predicted to 

be easier than their empirical difficulties prove simpler for LLMs to handle? 

This approach involves scrutinizing items that diverge in predicted and empirical difficulties, 

offering potential insights into how LLMs interpret and tackle problem-solving challenges. 

If the LLM's difficulty predictions aligned more closely with its actual performance in 

attempting the item, this would suggest that the LLM has potentially fully or partially solved the 

item internally before providing a difficulty ranking. Conversely, if the predictions did not 

correspond with the LLM’s ability to solve the items, this would indicate that the LLM is using 

embeddings of the item's features, which have some latent difficulty weights associated with 

118 

 
 
 
them. If the LLM leverages an internal solution space for ranking items, it would be intriguing 

for understanding how LLMs approach other tasks where the output is more challenging to 

evaluate. 

5.3 

Conclusion 

This dissertation examines the feasibility of using Large Language Models (LLMs) for 

item calibration in test development to potentially reduce the extensive costs and resources 

currently required. The paper examines the potential for LLM models, like those from OpenAI's 

GPT series and Google’s Gemini-Pro, to simulate human response patterns and reasoning skills, 

which in turn might enable them to predict test item characteristics. 

The data used in this study are 388 math and science items released by NAEP. These 

items are selected from a much larger set of released items due to their limited dependence on 

visual components. This study deploys the use of prompt engineering in the form of a genetic 

algorithm which explores numerous variations of prompts. These variations allowed the 

pairwise item evaluation to move from an average predictive accuracy around 55% to one 

closer to 62%. The use of LLMs to calibrate items is a novel application of LLMs. Additionally, 

this is one of the first studies to deploy genetic algorithms in an educational setting to calibrate 

the performance of an LLM. 

This dissertation proposes and tests two separate predictive models to explore the use 

of LLMs in predicting item difficulty. The first model estimates how well LLMs can perform 

pairwise item difficulty rankings. It finds that many of the proposed exogenous variables are 

statistically significant. However, the model's overall predictive strength is limited, with a 

maximum adjusted r-squared of less than 5%. This indicates that, while the model is statistically 

119 

 
 
significant, its ability to use the currently tested variable features to predict whether the LLM 

can successfully rank two items is of limited predictive power. 

In this exploration, I used the ability of LLMs to solve or fail to solve achievement items 

as a tool to predict item difficulty. I also predicted item difficulty using computational indices 

such as word count, the Flesch-Kincaid index, and syllables per word. Among these, word count 

was the only consistently statistically significant predictor. While the individual performance of 

LLMs on these items was not very predictive of item difficulty, the overall predictive model was 

statistically significant, with a maximum adjusted R-squared of around 10%. 

The final study presented in this dissertation explored the use of pairwise item difficulty 

comparisons as a predictor of absolute item difficulties. This method was shown to be far from 

random, with an average correlation between item difficulty and estimated item difficulty in 

the high .30s to low .40s across methods, subjects, and grades. Despite a particularly strong 

performance in 12th-grade math (correlation of .66), this method did not achieve the level of 

correlation (0.8 estimated) required for professional deployment in testing applications. 

However, these methods are much lower in cost and easier to deploy compared to the 

standard approach, which requires hundreds of students pretesting the items. They are also 

simpler than the two alternative approaches often explored in the literature: using subject 

matter experts to rank items or employing computational methods that require training on 

large precalibrated item banks. While the results of this study are limited, the technology is 

quite new, and more powerful models are actively being developed. As such, the results of this 

research should be viewed as a preliminary proof of concept with the reasonable expectation 

120 

 
 
that these methods will improve in predictive ability as the large language models (LLMs) 

available in the marketplace also improve. 

121 

 
 
BIBLIOGRAPHY 

Anderson, L.W., & Krathwohl, D.R. (2001). A taxonomy for learning, teaching, and assessing, 
abridged edition. Boston: Allyn & Bacon. 

Anil, R., Dai, A. M., Firat, O., Johnson, M., Lepikhin, D., Passos, A., ... & Wu, Y. (2023). Palm 2 
technical report. arXiv preprint arXiv:2305.10403. 

Arbuckle, T. Y., & Cuddy, L. L. (1969). Discrimination of item strength at time of 
presentation. Journal of experimental psychology, 81(1), 126. 

Arvidsson, S., & Axell, J. (2023). Prompt engineering guidelines for LLMs in Requirements 
Engineering.  

Attali, Y., Saldivia, L., Jackson, C., Schuppan, F., & Wanamaker, W. (2014). Estimating item 
difficulty with comparative judgments. ETS Research Report Series, 2014(2), 1-8. 

Bai, X., Wang, A., Sucholutsky, I., & Griffiths, T. L. (2024). Measuring Implicit Bias in Explicitly 
Unbiased Large Language Models. arXiv preprint arXiv:2402.04105. 

Ban, J. C., Hanson, B. A., Wang, T., Yi, Q., & Harris, D. J. (2001). A Comparative Study of On-line 
Pretest Item—Calibration/Scaling Methods in Computerized Adaptive Testing. Journal of 
Educational Measurement, 38(3), 191-212. 

Bejar, I. I. (1981). Subject Matter Experts’ Assessment of Item Statistics. ETS Research Report 
Series, 1981(2), i-47. 

Bejar, I. I. (1983). Subject Matter Experts’ Assessment of Item Statistics. Applied Psychological 
Measurement, 7, 303–310. 

Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021, March). On the dangers of 
stochastic parrots: Can language models be too big?🦜. In Proceedings of the 2021 ACM 
conference on fairness, accountability, and transparency (pp. 610-623). 

Benedetto, L., Cappelli, A., Turrin, R., & Cremonesi, P. (2020, June). Introducing a framework to 
assess newly created questions with natural language processing. In International Conference 
on Artificial Intelligence in Education (pp. 43-54). Cham: Springer International Publishing. 

Benedetto, L. (2023). A quantitative study of NLP approaches to question difficulty 
estimation. arXiv preprint arXiv:2305.10236. 

Benedetto, L., Cremonesi, P., Caines, A., Buttery, P., Cappelli, A., Giussani, A., & Turrin, R. 
(2023). A survey on recent approaches to question difficulty estimation from text. ACM 
Computing Surveys, 55(9), 1-37. 

122 

 
 
Berger, M. P. (1992). Sequential sampling designs for the two-parameter item response theory 
model. Psychometrika, 57, 521-538. 

Berger, M. P. (2017). Item-calibration designs. In Handbook of item response theory (pp. 3-20). 
Chapman and Hall/CRC. 

Bewersdorff, A., Seßler, K., Baur, A., Kasneci, E., & Nerdel, C. (2023). Assessing Student Errors in 
Experimentation Using Artificial Intelligence and Large Language Models: A Comparative Study 
with Human Raters. arXiv preprint arXiv:2308.06088. 

Bezirhan, U., & von Davier, M. (2023). Automated Reading Passage Generation with OpenAI's 
Large Language Model. arXiv preprint arXiv:2304.04616. 

Bloom, B. S. (1956). Taxonomy of Educational Objectives. 

Boateng, G. O., Neilands, T. B., Frongillo, E. A., Melgar-Quiñonez, H. R., & Young, S. L. (2018). 
Best practices for developing and validating scales for health, social, and behavioral research: a 
primer. Frontiers in public health, 6, 149. 

Borsboom, D., & Molenaar, D. (2015). Psychometrics. 

Buyske, S. G. (1998). Optimal design for item calibration in computerized adaptive testing: The 
2PL case. Lecture Notes-Monograph Series, 115-125. 

Buyske, S. (2005). Optimal design in educational testing. Applied optimal designs, 1-19. 

Brown, J. D. (1998). An EFL readability index. Jalt Journal, 20(2), 7-36. 

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., ... & Amodei, D. (2020). 
Language models are few-shot learners. Advances in neural information processing systems, 33, 
1877-1901.  

Campbell, J. R., Hombo, C. M., & Mazzeo, J. (2000). NAEP 1999 trends in academic progress: 
Three decades of student performance. ED Pubs, PO Box 1398, Jessup, MD 20794-1398. 

Coleman, E.B.: On understanding prose: some determiners of its complexity. NSFfinal report 
GB-2604 (1965)  

Conneau, A., Kiela, D., Schwenk, H., Barrault, L., & Bordes, A. (2017). Supervised learning of 
universal sentence representations from natural language inference data. arXiv preprint 
arXiv:1705.02364. 

Cross, L. H., Impara, J. C., Frary, R. B., & Jaeger, R.M. (1984). A comparison of three methods for 
establishing minimum standards on the National Teacher Examinations. Journal of Educational 
Measurement, 21, 113–129. 

123 

 
 
Danielson, W. A., & Bryan, S. D. (1963). Computer automation of two readability 
formulas. Journalism Quarterly, 40(2), 201-206.  

Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional 
transformers for language understanding. arXiv preprint arXiv:1810.04805. 

DuBay, W. H. (2004). The principles of readability. Online Submission. 

Dudley, R. (2016) 'Massive' breach exposes hundreds of questions for upcoming SAT exams 
Reuters. August 3, 2016 https://www.reuters.com/investigates/special-report/college-sat-
security/ 

D’Souza, J. A Review of Transformer Models. 

Fernandez, G. (2003, August). Cognitive scaffolding for a web-based adaptive learning 
environment. In International Conference on Web-Based Learning (pp. 12-20). Berlin, 
Heidelberg: Springer Berlin Heidelberg. 

Flesch, R. (1948). A new readability yardstick. Journal of applied psychology, 32(3), 221. 

Freedle, R., & Kostin, I. (1993). The prediction of TOEFL reading item difficulty: implications for 
construct validity. Language Testing, 10(2), 133-170. doi.org/10.1177/026553229301000203 

Gunning, R. (1968). Readability yardsticks. The Technique of Clear Writing. New York: McGraw-
Hill.  

Gurnee, W., & Tegmark, M. (2023). Language models represent space and time. arXiv preprint 
arXiv:2310.02207. 

Hambleton, R. K., Sireci, S. G., Swaminathan, H., Xing, D., & Rizavi, S. (2003). Anchor-Based 
Methods for Judgmentally Estimating Item Difficulty Parameters. LSAC Research Report Series. 

Hardesty, D. M., & Bearden, W. O. (2004). The use of expert judges in scale development: 
Implications for improving face validity of measures of unobservable constructs. Journal of 
business research, 57(2), 98-107. 

Hassan, M. U., & Miller, F. (2019). Optimal item calibration for computerized achievement 
tests. psychometrika, 84(4), 1101-1128. 

Hassan, M. U., & Miller, F. (2021). An exchange algorithm for optimal calibration of items in 
computerized achievement tests. Computational statistics & data analysis, 157, 107177. 

He, Y., & Chen, P. (2020). Optimal online calibration designs for item replenishment in adaptive 
testing. psychometrika, 85(1), 35-55. 

124 

 
 
Hernandez, I., & Nie, W. (2022). The AI-IP: Minimizing the guesswork of personality scale item 
development through artificial intelligence. Personnel Psychology. 

Hocky, G. M., & White, A. D. (2022). Natural language processing models that automate 
programming will transform chemistry research and teaching. Digital discovery, 1(2), 79-83. 

Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., ... & Liu, T. (2023). A survey on 
hallucination in large language models: Principles, taxonomy, challenges, and open 
questions. arXiv preprint arXiv:2311.05232. 

Huang, Z., Liu, Q., Chen, E., Zhao, H., Gao, M., Wei, S., ... & Hu, G. (2017, February). Question 
Difﬁculty Prediction for READING Problems in Standard Tests. In Proceedings of the AAAI 
Conference on Artificial Intelligence (Vol. 31, No. 1). 

Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., Casas, D. D. L., ... & Sayed, 
W. E. (2023). Mistral 7B. arXiv preprint arXiv:2310.06825. 

Jones, D. H., & Jin, Z. (1994). Optimal sequential designs for on-line item 
estimation. Psychometrika, 59(1), 59-75. 

Kincaid, J. P., Fishburne Jr, R. P., Rogers, R. L., & Chissom, B. S. (1975). Derivation of new 
readability formulas (automated readability index, fog count and flesch reading ease formula) 
for navy enlisted personnel. 

Kumar, H., Musabirov, I., Reza, M., Shi, J., Kuzminykh, A., Williams, J. J., & Liut, M. (2023). 
Impact of Guidance and Interaction Strategies for LLM Use on Learner Performance and 
Perception. arXiv preprint arXiv:2310.13712. 

Kutner, M. H., Nachtsheim, C. J., Neter, J., & Li, W. (2005). Applied linear statistical models. 
McGraw-hill. 

Krippendorff, K. (2011). Computing Krippendorff's alpha-reliability. 

Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., ... & Zettlemoyer, L. 
(2019). Bart: Denoising sequence-to-sequence pre-training for natural language generation, 
translation, and comprehension. arXiv preprint arXiv:1910.13461. 

Li, K., Hopkins, A. K., Bau, D., Viégas, F., Pfister, H., & Wattenberg, M. (2022). Emergent world 
representations: Exploring a sequence model trained on a synthetic task. arXiv preprint 
arXiv:2210.13382. 

Linacre, J. M. (1994). Sample size and item calibration stability. Rasch measurement 
transactions, 7, 328.  

125 

 
 
Liu, Y., Yang, K., Qi, Z., Liu, X., Yu, Y., & Zhai, C. (2024). Prejudice and Caprice: A Statistical 
Framework for Measuring Social Discrimination in Large Language Models. arXiv preprint 
arXiv:2402.15481. 

Lord, F.M., Novick, M.R., & Birnbaum, A. (1968). Statistical theories of mental test 
scores. Addison-Wesley 

Lorge, I., & Kruglov, L. (1952). A suggested technique for the improvement of difficulty 
prediction of test items. Educational and Psychological Measurement, 12(4), 554-561. 

Lorge, I., & Kruglov, L. (1953). The improvement of estimates of test difficulty. Educational and 
PsychologicalMeasurement, 13, 34–46. 

Lorge, I., & Diamond, L. K. (1954). The Prediction of Absolute Item Difficulty by Ranking and 
Estimating Techniques. Educational and Psychological Measurement, 14(2), 365-372. 

Lu, F., Li, X., Liu, Q., Yang, Z., Tan, G., & He, T. (2007). Research on personalized e-learning 
system using fuzzy set based clustering algorithm. In Computational Science–ICCS 2007: 7th 
International Conference, Beijing, China, May 27-30, 2007, Proceedings, Part III 7 (pp. 587-590). 
Springer Berlin Heidelberg. 

Lu, H. Y. (2014). Application of optimal designs to item calibration. Plos One, 9(9), e106747. 

Martins, T., Cunha, J. M., Correia, J., & Machado, P. (2023, April). Towards the Evolution of 
Prompts with MetaPrompter. In International Conference on Computational Intelligence in 
Music, Sound, Art and Design (Part of EvoStar) (pp. 180-195). Cham: Springer Nature 
Switzerland.  

Matelsky, J. K., Parodi, F., Liu, T., Lange, R. D., & Kording, K. P. (2023). A large language model-
assisted education tool to provide feedback on open-ended responses. arXiv preprint 
arXiv:2308.02439. 

Marchant, G. J. (2015). How plausible is using averaged NAEP values to examine student 
achievement?. Comprehensive Psychology, 4, 03-CP. 

McGraw, R., Lubienski, S. T., & Strutchens, M. E. (2006). A closer look at gender in NAEP 
mathematics achievement and affect data: Intersections with achievement, race/ethnicity, and 
socioeconomic status. Journal for Research in Mathematics Education, 37(2), 129-150. 

Mc Laughlin, G.H.: Smog grading-a new readability formula. J. Reading 12(8), 639–646 (1969)  

Melican, G. J., Mills, C. N., & Plake, B. S. (1989). Accuracy of item performance predictions 
based on the Nedelsky standard setting method. Educational and Psychological Measurement, 
49, 467–478. 

126 

 
 
Mellenbergh, G. J. (1989). Item bias and item response theory. International journal of 
educational research, 13(2), 127-143.  

Mislevy, R. J., Sheehan, K. M., & Wingersky, M. (1993). How to equate tests with little or no 
data. Journal of Educational Measurement, 30(1), 55-78. 

Mizumoto, A., & Eguchi, M. (2023). Exploring the potential of using an AI language model for 
automated essay scoring. Research Methods in Applied Linguistics, 2(2), 100050. 

Mohajan, H. K. (2017). Two criteria for good measurements in research: Validity and reliability. 
Annals of Spiru Haret University. Economic Series, 17(4), 59-82. 

Moore, S., Nguyen, H. A., Bier, N., Domadia, T., & Stamper, J. (2022, September). Assessing the 
quality of student-generated short answer questions using GPT-3. In European conference on 
technology enhanced learning (pp. 243-257). Cham: Springer International Publishing. 

National Center for Education Statistics (2024) National Assessment of Educational Progress: 
Search Questions [Data set]. U.S. Department of Education. 
https://www.nationsreportcard.gov/nqt/searchquestions 

OpenAI (2023) Technical Report https://doi.org/10.48550/arXiv.2303.08774 

Peng, B., Galley, M., He, P., Cheng, H., Xie, Y., Hu, Y., ... & Gao, J. (2023). Check your facts and 
try again: Improving large language models with external knowledge and automated 
feedback. arXiv preprint arXiv:2302.12813. 

Qiu, Z., Wu, X., & Fan, W. (2019, November). Question difficulty prediction for multiple choice 
problems in medical exams. In Proceedings of the 28th acm international conference on 
information and knowledge management (pp. 139-148). 

Rae, J. W., Borgeaud, S., Cai, T., Millican, K., Hoffmann, J., Song, F., ... & Irving, G. (2021). Scaling 
language models: Methods, analysis & insights from training gopher. arXiv preprint 
arXiv:2112.11446. 

Rafatbakhsh, E., Ahmadi, A. Predicting the difficulty of EFL reading comprehension tests based 
on linguistic indices. Asian. J. Second. Foreign. Lang. Educ. 8, 41 (2023). 
https://doi.org/10.1186/s40862-023-00214-4 

Raina, V., & Gales, M. (2022). Multiple-choice question generation: Towards an automated 
assessment framework. arXiv preprint arXiv:2209.11830. 

Rakshit, A., Singh, S., Keshari, S., Chowdhury, A. G., Jain, V., & Chadha, A. (2024). From Prejudice 
to Parity: A New Approach to Debiasing Large Language Model Word Embeddings. arXiv 
preprint arXiv:2402.11512. 

127 

 
 
Rampey, B. D., Dion, G. S., & Donahue, P. L. (2009). NAEP 2008: Trends in Academic Progress. 
NCES 2009-479. National Center for Education Statistics. 

Raudenbush, S. W., & Bryk, A. S. (2002). Hierarchical linear models: Applications and data 
analysis methods (Vol. 1). sage. 

Reilly, D., Neumann, D. L., & Andrews, G. (2019). Gender differences in reading and writing 
achievement: Evidence from the National Assessment of Educational Progress 
(NAEP). American Psychologist, 74(4), 445. 

Reiss, M. V. (2023). Testing the reliability of chatgpt for text annotation and classification: A 
cautionary remark. arXiv preprint arXiv:2304.11085. 

Ren, H., van der Linden, W. J., & Diao, Q. (2017). Continuous online item calibration: Parameter 
recovery and item utilization. Psychometrika, 82(2), 498-522. 

Rudolph, J., Tan, S., & Tan, S. (2023). War of the chatbots: Bard, Bing Chat, ChatGPT, Ernie and 
beyond. The new AI gold rush and its impact on higher education. Journal of Applied Learning 
and Teaching, 6(1). 

Ryan, J. J. (1968). Teacher judgments of test item properties. Journal of Educational 
Measurement, 5(4), 301-306. 

Settles, B., T. LaFlair, G., & Hagiwara, M. (2020). Machine learning–driven language 
assessment. Transactions of the Association for computational Linguistics, 8, 247-263. 

Senter, R., Smith, E.A.: Automated readability index. Technical Report, Cincinnati University, OH 
(1967)  

Stocking, M. L. (1990). Specifying optimum examinees for item parameter estimation in item 
response theory. Psychometrika, 55(3), 461-475. 

Stout, W., Ackerman, T., Bolt, D., Froelich, A. G., & Heck, D. (2003). On the Use of Collateral 
Item Response Information to Improve Pretest Item Calibration. LSAC Research Report Series. 

Thorndike, R. L. (1982). Item and score conversion by pooled judgment. Test equating, 309-317. 

Thurstone, L.L. (1925). A method of scaling psychological and educational tests. Journal of 
Educational Psychology, 16(7), 433–451. https://doi.org/10.1037/h0073357 

Tinkelman, S. (1947). Difficulty prediction of test items. Teachers College Contributions to 
Education. 

Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., ... & Scialom, T. (2023). 
Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288. 

128 

 
 
Tran, K. D., Bui, D. V., & Luong, N. H. (2023, October). Evolving Prompts for Synthetic Image 
Generation with Genetic Algorithm. In 2023 International Conference on Multimedia Analysis 
and Pattern Recognition (MAPR) (pp. 1-6). IEEE.  

Van de Schoot, R., Schmidt, P., De Beuckelaer, A., Lek, K., & Zondervan-Zwijnenburg, M. (2015). 
Measurement invariance. Frontiers in psychology, 6, 1064. 

van der Linden, W. J., & Ren, H. (2015). Optimal Bayesian adaptive design for test-item 
calibration. Psychometrika, 80(2), 263-288. 

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. 
(2017). Attention is all you need. Advances in neural information processing systems, 30. 

Von Davier, M. (2019). Training Optimus prime, MD: Generating medical certification items by 
fine-tuning OpenAI's gpt2 transformer model. arXiv preprint arXiv:1908.08594. 

Wang, Z., Valdez, J., Basu Mallick, D., & Baraniuk, R. G. (2022, July). Towards human-like 
educational question generation with large language models. In International conference on 
artificial intelligence in education (pp. 153-166). Cham: Springer International Publishing. 

Walsh, J. (2022). Lesson plan generation using natural language processing: Prompting best 
practices with OpenAI’s gpt-3 model. 

White, A. D., Hocky, G. M., Gandhi, H. A., Ansari, M., Cox, S., Wellawatte, G. P., ... & Ccoa, W. J. 
P. (2023). Assessment of chemistry knowledge in large language models that generate 
code. Digital Discovery, 2(2), 368-376. 

Wicherts, J. M., & Dolan, C. V. (2010). Measurement invariance in confirmatory factor analysis: 
An illustration using IQ test performance of minorities. Educational Measurement: Issues and 
Practice, 29(3), 39-47. 

Wong, M., Ong, Y. S., Gupta, A., Bali, K. K., & Chen, C. (2023, June). Prompt Evolution for 
Generative AI: A Classifier-Guided Approach. In 2023 IEEE Conference on Artificial Intelligence 
(CAI) (pp. 226-229). IEEE. 

Wu, X., He, X., Liu, T., Liu, N., & Zhai, X. (2023, June). Matching exemplar as next sentence 
prediction (mensp): Zero-shot prompt learning for automatic scoring in science education. 
In International Conference on Artificial Intelligence in Education (pp. 401-413). Cham: Springer 
Nature Switzerland. 

Yao, T. (1991). CAT with a poorly calibrated item bank. Rasch Measurement Transactions, 5(2), 
141. 

Yahya, A. A., Toukal, Z., & Osman, A. (2012). Bloom’s taxonomy–based classification for item 
bank questions using support vector machines. In Modern advances in intelligent systems and 
tools (pp. 135-140). Springer Berlin Heidelberg. 

129 

 
 
Yang, K. C., & Menczer, F. (2023). Large language models can rate news outlet credibility. arXiv 
preprint arXiv:2304.00228. 

Yaneva, V., Baldwin, P., & Mee, J. (2019, August). Predicting the difficulty of multiple choice 
questions in a high-stakes medical exam. In Proceedings of the fourteenth workshop on 
innovative use of NLP for building educational applications (pp. 11-20). 

Zamfirescu-Pereira, J. D., Wong, R. Y., Hartmann, B., & Yang, Q. (2023, April). Why Johnny can’t 
prompt: how non-AI experts try (and fail) to design LLM prompts. In Proceedings of the 2023 
CHI Conference on Human Factors in Computing Systems (pp. 1-21). 

Zheng, Y. (2014). New methods of online calibration for item bank replenishment. University of 
Illinois at Urbana-Champaign. 

Zimowski, M. F., Muraki, E., Mislevy, R., & Bock, R. D. (1996). Bilog-mg. Multiple group IRT 
analysis and test maintenance for binary items. 

130 

 
 
APPENDIX A: PROMPT SCORING 

A.1   Binary Prompts 

TABLE 28: AUGMENTED CONFUSION MATRIX 

Confusion Matrix with difficulties 𝐷# > 𝐷$ augmented with uninterpretable responses. 

True 
Condition 

Predicted Condition 

True 

False 

Positive 
True 
Positive (TP) 
False 
Positive (FP) 

Negative 
True 
Negative (TN) 
False 
Negative (FN) 

Uninterpretable 
True 

Uninterpretable (TU) 

False 

Uninterpretable (FU) 

Traditional Accuracy (A) of a model is defined as: 

𝐴	 = 0CQRI
S77KT*7U

=

0CQRI

0CQRIQ0IQRC

. 

Prompts will be scored based on the uninterpretable accuracy of its classification. 

Uninterpretable accuracy being defined as:  

𝑃𝑎𝑟𝑠𝑎𝑏𝑙𝑒	𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦	 = 0CQRI
S77KT*7U

=

0CQRI

0CQRIQ0IQRCQ0VQRV

. 

Responses that cannot be parsed are first repeated either until a response is able to be 

parsed or a reasonable number of attempts (between 3 and 10) have been made. Reponses 

that do not achieve an interpretable response are considered incorrect in this study, regardless 

of whether the LLM provides an accurate answer. The reason for this is that a response that 

cannot be interpreted, cannot be used for analysis. 

131 

 
 
 
 
 
To avoid the effects of the order in which items are presented, items are tested in both 

possible sequences (first D1, then D2, and vice versa). In this context, measuring the LLM’s 

performance based solely on accuracy is appropriate. When the LLM is simply making random 

guesses, we know that its non-predictive accuracy is 50% minus the probability of generating an 

uninterpretable response. If all responses were interpretable , the chances of a randomly 

correct answer would be 50%. Because we are presenting items equally in both orders, and the 

responses are binary, using accuracy as a metric does not cause the issues that it might in 

traditional machine learning applications, where predicting skewed outcomes can lead to 

misleadingly high accuracy rates3.  

Alternative Measure of Performance: F1-Score 

There are other measures of model performance worth considering. One model is the F1 

score defined as: 

𝐹# 	 =

2𝑇𝑃
2𝑇𝑃 + 𝐹𝑁 + 𝐹𝑃 

In the case of ranking two items against each other it is easy to show that the Accuracy 

(A) converges on the F-score (F1) when the model is non-predictive (random). Under a random 

model ignoring uninterpretable responses: 

E(TP) = E(TN) = E(FP) = E(FN) = ¼ 

3 Accuracy can be misleading when binary outcomes are either very likely or very unlikely to occur. For instance, if 
an event occurs in the data 90% of the time, a model that predicts and event will always occur (P(Y=1) =1) would 
seem 90% accurate, even though it actually has no predictive ability. 

132 

 
 
 
Therefore, Accuracy would equal: 

 𝐴	 = #/$
#

= 1/2 

Which, happens to be equal to: 

𝐹# 	 =

2(1/4)
2(1/4) + 1/4 + 1/4

= 1/2 

We can see that in the case when responses symmetrically evaluated, and the model 

performs randomly the expected value of the model is 𝐴 = 𝐹#. As such I instead opt to use the 

more intuitive measure, Accuracy for evaluating prompt performance. 

A.2  Non-Binary Prompts 

Prompt formats in which more than two items are ranked for difficulty by the LLM by 

splitting the items into rank pairs each of which is evaluated for correctness (for example: Rank 

items: 1,2,3 returns 3,1,2 then the following binary comparisons are checked 3>1, 3>2, and 1>2. 

Symmetry will also be maintained with the prompt receiving the items in reverse order 3,2,1 to 

be ranked. 

Leveraging Uncertainty 

I will reserve for later research potential research applications leveraging uncertainty 

(See Table 2) with Uncertainty being defined as: 

|𝐷# − 𝐷$| > 𝜀	𝑤ℎ𝑒𝑟𝑒	𝜀	 > 0	𝑎𝑛𝑑	"𝑠𝑚𝑎𝑙𝑙"	 

133 

 
 
In this kind of analysis predicted uncertainty would be a response option available to the 

LLM as one of three options (1: 𝐷# > 𝐷$, 2: 𝐷$ > 𝐷#, 3: 𝐷#~𝐷$) with ~ representing 

undetermined or undeterminable (ambivalence). 

TABLE 29: CONFUSION MATRIX WITH AMBIVALENT WITH DIFFICULTIES D1>D2 

Predicted Condition 

True 
Ambivalent (TA) 

 UninterpreTable  Ambivalent 
True 
Uninterpretable 
(TU) 
False 
Uninterpretable 
(FU) 
Uncertain 
Uninterpretable 
(UU) 

Uncertain 
Ambivalent 
(UA) 

False 
Ambivalent (FA) 

True 
Condition 

True 

Positive 
True Positive 
(TP) 

Negative 
True Negative 
(TN) 

False 

False Positive 
(FP) 

False 
Negative (FN) 

Uncertain  Uncertain 

Positive (UP) 

Uncertain 
Negative (UN) 

134 

 
 
 
 
APPENDIX B: PROMPT DESIGN - RELATIVE DIFFICULTY EVALUATION 

It has been widely reported that LLMs tend to be highly sensitive to prompt 

specifications. These specifications have been thoroughly explored with numerous potential 

factors influencing outcomes. Arvidsson and Axell (2023) provide ten themes for prompt 

engineers to focus on and recommendations including: context, persona, template, reasoning 

steps, etc. while Zamfirescu-Pereira et al. (2023) explore some common pitfalls of writing 

prompts.  

This research will build individual prompts templates from a high-level Meta-Template 

(Figure 7). Some values of this template are filled in for a specific prompt while other values are 

taken from the content of the items being evaluated. Each of the variables on the Meta-

Template level identified with the double curly brackets ({{variable}}) are variables on the 

template level while single curly brackets ({variable}) are variables on the item level. Variables 

on the template level may have components within them which are item invariant such as 

{{Task Introduction}}, which vary on the item level such as {{Item.Context.1}}, or even on the 

Meta-Template level such as {{Task Output}} which has variable values for the variable names 

{{Question 1}} and {{Question 2}}. 

135 

 
 
 
 
FIGURE 7: META-TEMPLATE 

The Meta-Template is a high-level template that is used to construct individual prompt 

templates. Variable values are identified with curly brackets {} and are on either of two levels. A 

single bracket {variable} indicates a variable on the item level while a double curly bracket 

{{variable}} indicates a variable on the template level. Variables which are italicized are 

variables which are “optional” or for which their values can take an empty value. 

{{Task Title}} 
{{Persona}} 
{{Task Introduction}} 
{{Analysis Instruction}} 
{{Perspective}} 
{{N-Shot Examples}} 
{{Task Approach Instructions}} 
{{Item Shared Context}} 

{{Question 1}}:  
{{Item.Context.1}} 
{Item.Body.1} 

{{Question 2}}:  
{{Item.Context.2}} 
{Item.Body.2} 

{{Task Instructions}} 
Please provide your evaluation in the following format: 
{{Task Output}} 

Model Parameter: Temperature = {{Model Temperature}} 

In the Meta-Template there is a total of nine different variables which can be specified 

on the template level. Each of these variables has a minimum of 3 and up to 8 levels available 

which I have generated (Tables B.1-B.12). These levels have millions of potential combinations. 

An example of an individual prompt template might look can be seen in Figure 8. Of which even 

sampling 1% of them would be both excessively time consuming and costly. Instead, I generate 

136 

 
 
 
 
 
20 prompt templates. These templates are random combinations of the variable values with 

each non-default values (white background values) appearing twice and the default values (gray 

background) appearing for all other cases. 

FIGURE 8: AN EXAMPLE OF A PROMPT TEMPLATE READY FOR ITEM CONTENT 

Assessing Question Difficulty 
You are a professional item writer. 
Below are two questions presented for difficulty assessment. Please consider factors such as 
the complexity of the concepts involved, the amount of information required to answer, and 
the level of critical thinking or problem-solving skills needed. 
Outline the process for resolving this relative difficulty ranking issue. 
Grade: {Grade}, Subject: {Subject} 

Question A:  
Bloom's Taxonomy Classification: {Bloom} 
{Item.Body.1} 

Question B:  
Bloom's Taxonomy Classification: {Bloom} 
{Item.Body.2} 

After reviewing the content and demands of both questions, determine which question poses 
the greater challenge.  

Please provide your evaluation in the following format: 
"The more difficult question is: Question A/Question B" 
Model Parameter: Temperature = 0.6 

The variable {{N-Shot}} refers to items which are given to the LLM as examples of 

success. The N in N-Shot is the number of examples offered. In this study I test three options: 0-

Shot (the default), 1-Shot, and 2-Shot. Each Subject, Grade/Age group has four items randomly 

selected to be reserved for N-Shot examples. These four items are all from the same year and 

137 

 
 
 
 
 
 
 
compared internally within the group for the example but will generally not share the same 

year with other items from the same.  

The content of Tables 30 through 42 as they enter the template are filled from a 

combination of invariant content, the item specific content from the source NAEP items, or 

generated through use of a first stage LLM. Those values calculated in the first stage model are 

item values that appear in Table 38. The next section of this appendix will go into detail on how 

those values are calculated. 

TABLE 30: VARIANTS ON {{TASK TITLE}} 

The first row of this prompt variant is backgrounded in grayed indicating it is the “default 

value.” 

0  “” 

1  Assessing Question Difficulty 

2  Comparative Difficulty Evaluation 

3  Difficulty Ranking of Questions 

4  Challenge Assessment of Two Questions 

5  Complexity Analysis for Question Pairs 

TABLE 31: VARIANTS ON {{PERSONA}} 

0  “” 
1  You are a professional item writer. 
2  You are an experienced educational psychologist specializing in test difficulty analysis. 
3  You are an academic researcher with expertise in educational assessments. 
4  You are a seasoned teacher familiar with standard testing practices. 
5  You are a curriculum specialist with a focus on standardized test development. 
6  You are a data analyst with a background in comparative educational metrics. 
7  You are a grade school tutor. 

138 

 
 
 
 
 
TABLE 32: VARIANTS OF {{TASK INTRODUCTION}} 

0  Below are two questions. Review these questions are rank them in terms of difficulty. 

Below are two questions presented for difficulty assessment. Please consider factors 
such as the complexity of the concepts involved, the amount of information required to 
answer, and the level of critical thinking or problem-solving skills needed. 
You will find two different questions ahead. Evaluate their difficulty based on depth 
and breadth of knowledge required, and cognitive demand. 
Assess the complexity of each question by examining their underlying concepts and the 
intricacy of the answers they require. 
Presented here are two inquiries. Please appraise their level of difficulty taking into 
account the scope of understanding and analysis needed. 
Compare the following two questions and gauge which one necessitates a higher 
intellectual effort for resolution. 

1 

2 

3 

4 

5 

TABLE 33: VARIANTS ON {{PERSPECTIVE}} 

0  “” 

1  Evaluate the content from the perspective of a {grade}th grade student. 

2  Evaluate the content from the perspective of a {grade - 2}th grade student. 

Evaluate the content from the perspective of a { max(grade – 4, 1)}}th/st grade 
student. 

3 

4  Evaluate the content from the perspective of a {max(grade – 8, 1)}th/st grade student. 

TABLE 34: VARIANTS OF {{N-SHOT EXAMPLES}} 

0  "" 

Here is an example of two items that have been correctly ranked: 
Example {{Question 1}}: {Example Item.Body.1 Content} 
Example {{Question 2}}: {Example Item.Body.2 Content} 
{{Task Output}} with correct {{Question 1}}/{{Question 2}} 
"Here is an example of two items that have been correctly ranked: 
Example {{Question 1}}: {Example Question 3 Content} 
Example {{Question 2}}: {Example Question 4 Content} 
{{Task Output}} with correct {{Question 1}}/{{Question 2}}" 

1 

2 

"Here is an second example of two items that have been correctly ranked: 
Example {{Question 1}}: {Example Question 3 Content} 
Example {{Question 2}}: {Example Question 4 Content} 
{{Task Output}} with correct {{Question 1}}/{{Question 2}}" 

139 

 
 
 
 
TABLE 35: VARIANTS OF {{TASK APPROACH INSTRUCTIONS}} 

The purpose of these prompts is to prompt a step-by-step, or "Chain of Thought," reasoning 

process. This method has been demonstrated to enhance the performance of LLMs when 

tackling tasks that require complex reasoning (Wei et al., 2022). 

0  "" 
1  List the steps involved in solving this relative difficulty ranking problem. 
2  Enumerate the procedures required to tackle this relative difficulty ranking problem. 
3  Outline the process for resolving this relative difficulty ranking issue.  
4  Detail the sequence of actions needed to solve this relative difficulty ranking challenge. 
5  Describe the method to be followed in addressing this relative difficulty ranking task. 
6  Provide the roadmap for navigating through this relative difficulty ranking problem. 

TABLE 36: VARIANTS OF {{ITEM SHARED CONTEXT}} 

0  "" 

1  Test: NAEP 

2  Grade: {Grade} / Age: {Age} 

3  Subject: {Subject} 

4  Grade: {Grade}, Subject: {Subject} 

5  Grade: {Grade}, Subject: {Subject}, Year: {Year} 

TABLE 37: VARIANTS OF QUESTION NAMING ({{QUESTION 1}}/{{QUESTION 2}}) 

1 

Item.Body.1, Item.Body.2 

2  Question I, Question II 

3  Question i, Question ii 

4  Question A, Question B 

5  Question a, Question b 

6  Question Alpha, Question Beta 

7  First Question, Second Question 

140 

 
 
 
TABLE 38: VARIANTS OF ITEM CONTEXT ({ITEM.CONTEXT.1}/{ITEM.CONTEXT.2}) 

This content is generated as a first stage of the item calibration from the item contents. 

0  “” 

1 

Number of Steps Required to Solve This Problem: {Number of 
Steps} 

2  Steps Required to Solve This Problem: {Steps} 

3 

4 

Item Complexity Rating: {Complexity} 

Item Content Tags: {Content Tags} 

5  Bloom's Taxonomy Classification: {Bloom.LLM} 

6  Bloom's Taxonomy Classification: {Bloom.NAEP} 

 (Source) 

LLM (Item.Body) 

LLM (Item.Body) 

LLM (Item.Body) 

LLM (Item.Body) 

LLM (Item.Body) 

F(Item.Context) 

Item Context: {Item.Context} 

7 
Note: The function here LLM() refers to the LLM’s generative function. See Appendix C for 
more details. While F() refers to the mapping of Item.Context from Table 41. 

Item.Context 

TABLE 39: VARIANTS OF {{TASK INSTRUCTIONS}} 

0  “” 

1 

2 

3 

After reviewing the content and demands of both questions, determine which 
question poses the greater challenge. 
In the subsequent section, two questions are listed for analysis. Your task is to deduce 
which involves greater difficulty for a respondent. 
Compare the difficulties of the questions above in terms of their requisites on 
knowledge and reasoning. 

4  Here are two questions for your consideration. Please rate them in order of difficulty. 

5 

Examine each question carefully to establish which one is more difficult requiring 
more comprehensive depth of understanding and problem-solving ability. 

141 

 
 
 
 
 
 TABLE 40: VARIANTS OF {{TASK OUTPUT}} 

0  The more difficult question is: {{Question 1}}/{{Question 2}} . 

1 

2 

3 

4 

5 

After thorough analysis, it appears that {{Question 1}}/{{Question 2}}  is the more 
demanding question. 

Upon evaluation, the more difficult question is determined to be {{Question 
1}}/{{Question 2}} . 

Considering all factors, {{Question 1}}/{{Question 2}}  stands out as the question with 
greater difficulty. 

The assessment concludes that {{Question 1}}/{{Question 2}}  poses a higher level of 
difficulty. 

Based on the analysis, I have determined that the question which presents the most 
complexity is {{Question 1}}/{{Question 2}} . 

TABLE 41: VARIANTS OF {{PROMPT TEMPERATURE}} 

0  0.4 

1  0 

2  0.2 

3  0.6 

4  0.8 
5  1 

Some of the template design options require additional content related to the items 

generated at an earlier stage (Table 38). This information might help in the task of predicting 

item difficulty. How this information is generated will be explored in Appendix C.  

142 

 
 
 
 
 
TABLE 42: BLOOM’S TAXONOMY MAPPED TO NAEP CONTENT TAGS 

This table shows a mapping from the NAEP Content Tags to Bloom’s Revised Taxonomy 

(Anderson and Krathwohl, 2001). The map is imprecise as the original NAEP appears somewhat 

arbitrarily assigned. 

Bloom's Taxonomy 

Remember: 

Understand: 

Apply: 

Analyze: 

Evaluate: 

Create: 

NAEP Content Tag 
locate/recall 
knowing 
low 
historical knowledge and perspective 
comprehends what is read 
conceptual understanding 
forming a general understanding 
interprets what has been read 
identifying/describing 
moderate 
understanding 
applying 
practical reasoning 
problem solving 
using science principles 
using scientific inquiry 
analyzes what has been read 
developing interpretation 
examine content and structure 
examining content and structure 
explaining and analyzing 
identifying science principles 
scientific investigation 
critique/evaluate 
evaluating and analyzing 
evaluate, take, defend 
historical analysis and interpretation 
reasoning 
integrate/interpret 
making reader/text connections 

143 

 
 
APPENDIX C: PROMPT DESIGN - COLLATERAL ITEM INFORMATION  

In this dissertation several of the models make use of collateral item information some 

of it sourced from large language models (LLMs). One use is in the contents of prompts for 

Table 37. Following Mislevy et al. (1993), I will refer to this information as “collateral 

information”. Unlike in the relative item difficulty estimation, with collateral item information I 

do not have empirically validated values to compare the LLM generated values against. As such, 

I specify a standard for each piece of collateral information generated. 

It is not clear to me what the best scoring criteria for generating the variable {Steps} 

(steps required to solve a question). Kumar et al. (2023) explores four different prompt 

strategies with LLMs interacting as a tutor with students. They find that while some slight 

variation existed between student outcomes based on the prompt strategy used, overall, the 

differences were slight. My preliminary explorations on this matter seem to suggest that a 

general criterion for these variables would be: “concise but sufficient.” 

This contrasts with the variables {Complexity} and {Bloom.LLM} where we are dealing 

with a bias of overconfidence of LLMs. This bias is distinct from the commonly known bias in 

which models will hallucinate reasonable sounding responses that have no real-world bearing 

(Huang et al., 2023). This confidence is based on “believing” a {grade-N} students should know 

something that only a fraction of student at that grade actually know. Preliminary experiments 

with prompting the LLM to evaluate the complexity of items from a {{grade-N} - C} perspective 

seems promising, where {C} is an integer greater than 1. 

144 

 
 
 
TABLE 43: COLLATERAL INFORMATION GENERATED IN A FIRST STAGE 

Variable 

Description 

Scoring Criteria 

{Steps} 
{Number of 
Steps} 

Steps Required to Solve This Problem 

Concise but sufficient 

Number of Steps Required to Solve 

Count from {Steps} 

{Complexity} 

Item Complexity Rating 

{Content Tags} 

Item Content Tags 

{Bloom.LLM} 

Bloom's Taxonomy Classification 

Distribution of Higher-
Level Complexity Scores 
Concise but sufficient 
Distribution of Higher 
Bloom Scores 

145 

 
 
 
 
 
APPENDIX D: PROMPT SELECTION ALGORITHMS 

D.1 

Relative Difficulty Estimation – Non-Binary Limited Genetic Algorithm 

This research deploys a multistage prompt selection procedure. The most common 

approach to prompt authoring is a design and debug method in which manually written 

prompts are passed to a LLM individually and responses are graded by the prompter. Under this 

framework the top performing prompts are selected for use. However, it has been widely noted 

that prompt performance can be greatly improved through the use of prompt optimization 

algorithms. A common algorithm which has been explored to optimize prompt construction is a 

genetic algorithm (Martins et al., 2023, Tran et al., 2023, Wong et al., 2023 among others). 

Prompt selection for the relative difficulty ranking in this paper will follow a similar 

approach by generating a set of parent prompts, evaluating those prompts, and comparing 

against known outcomes. The top 50% of performing prompts will then “cross-breed” 

exchanging attributes and producing offspring. These offspring will then be evaluated and bred. 

This process will continue for a number of generations, until new offspring do not seem to offer 

an improvement over previous generation. As interacting with LLMs is costly in time and money 

the population size is far smaller and the number of generations far less than that typically used 

in a genetic algorithm4. 

Final prompt selection will follow the protocol outlined in Table 44. 

PROMPT SELECTION PROTOCOL 

Using training data I take the following steps. 

4 The number of generations suggested for genetic algorithms tend to be in the hundreds to thousands for 

many applications. 

146 

 
 
 
1.  Initial Prompt Models: Prompt models were constructed varying the various features of 

the prompt model defined in Appendix C with each non-default attribute randomly 

being selected three times with the default attribute being selected for the remaining 

pairs. This created an initial pool of twenty-four prompt models across the 12 attributes. 

2.  Generation 1: Using a small number (30) of random item pairs selected from high 

contrast items (items in the top 50% of differences in difficulty, 19 percentage points 

different or greater) items were evaluated for relative difficulty estimation. Prompts 

which did not get a parable response were attempted two additional times to see if the 

response could be improved. 

3.  Generation 2: Using a genetic algorithm procedure, the top ten prompts from the last 

generation were selected for further evaluation. Additionally, there was a crossbreeding 

of approximately thirty children, removing any duplicates, from these ten parents. The 

surviving parents and roughly thirty children were then evaluated against a new 30 

random high contrast items in addition to the 30 random items evaluated in Generation 

1. 

4.  Generation 3-4: The procedure outlined in Generation 2 was repeated for Generations 3 

and 4 except in addition to the thirty new high contrast items, an equal number of low 

contrast (items in the bottom 50% of differences in difficulty, less than 19 percentage 

points). 

5.  Final Selection: Any remaining unevaluated item pairs in the training data are evaluated 

by the top ten performing templates. The top prompt from these ten is selected to be 

evaluated against the testing data. 

147 

 
 
D.2 

Collateral Information and Model Answering Ability 

For the generation of collateral information and the evaluation of the ability of diverse 

LLM models to solve items this research will deploy an iterative improvement method. I will 

start with a base prompt then score it on a subset of items. Performance score for item 

complexity and bloom’s taxonomy will be based on the following score function: 

W

𝑆𝑐𝑜𝑟𝑒 = − (cid:156)(count(𝑔) + count(𝑢𝑛𝑝𝑎𝑟𝑠𝑎𝑏𝑙𝑒))$

X

Where count(𝑔) is the count of those item values in each of the categories of G. When 

all of the item responses are uninterpretable the maximum of this function occurs when there 

is an equal number of items assigned to each complexity score. Uninterpretable output is 

heavily penalized by adding to the negative score factor for each group. 

New features will be added to the base prompt. If the score improves, those features 

will be kept. If it does not improve those features will be discarded. As part of this research, I 

will track the iterative changes to the collateral generating prompts over time.  

D.3 Prompt Performance Adjustments 

Prompt design for evaluating the relative difficulty of items is elaborated in detail in 

Appendix B. However, this is only a plan. If the prompt generator template (meta-prompt) does 

not perform as expected in the training data then I plan to adjust the template design with the 

hope of finding a better performing prompt. Any changes in this way will be documented in the 

final paper. 

148 

 
 
 
 
FIGURE 9: COMPLEXITY TEMPLATE 

How complex would the following problem be for a {grade-4}th/st grade student to 

solve?  

{Item.Body} 

Provide a complexity value for a {grade-4}th/st grader attempting to solve this 

problem as a response (Very Low Complexity, Low Complexity, Medium Complexity, 

High Complexity, Very High Complexity) by completing the following statement. 

"The complexity of the problem is:" 

Model Parameter: Temperature = 0.4 

149 

 
 
 
 
 
 
 
 
 
 
APPENDIX E: SENTITIVITY ANALYSIS 

E.1 Alternative Model Specifications of Correct Classification 

TABLE 44: LOGISTIC REGRESSION MODEL 

The dependent variable in this model is the likelihood of correctly predicting the relative 

difficulty of items using a logistic regression model. 

-0.290*** 
0.440*** 
0.025*** 
0.371*** 
0.007 
-0.279** 

2 
3650 
0.046 

1 
3650 
0.042 

-0.256*** 
0.446*** 
0.025*** 

Model: 
No. observations: 
Psuedo-R2 
Coefficient Estimates: 
Intercept 
Second Harder 
|Diff Δ| 
Grade : 12 
Grade : 8 
Math 
Alpha 1 
Alpha 2 
Complexity 1 
Complexity 2 
|Complexity Δ| 
Number of steps 1 
Number of steps 2 
|Number of Steps Δ| 
Similarity 
|Diff Δ| : Complexity 1 
|Diff Δ| : Complexity 2 
|Diff Δ| : |Complexity Δ| 
|Diff Δ| : Number of steps 1 
|Diff Δ| : Number of steps 2 
|Diff Δ| : |Number of steps Δ | 
|Diff Δ| : Alpha 1 
|Diff Δ| : Alpha 2 
|Diff Δ| : Similarity 
*10% significance, **5% significance, ***1% significance 

3 
3650 
0.047 

4 
3650 
0.052 

5 
3650 
0.053 

-0.304*** 
0.499*** 
0.025*** 
0.360*** 
-0.009 
-0.275** 
-0.064** 
0.040* 

-0.967*** 
0.510*** 
0.023*** 
0.378*** 
0.028 
-0.216** 
-0.069** 
0.036* 
0.032* 
0.018* 
0.096*** 
0.035* 
0.003 
0.034* 
-0.01 

-0.240*** 
0.461*** 
-0.027** 

0.002** 
0.003*** 
0.005*** 
0.002** 
0.001 
0.003** 
-0.002* 
0.000 
-0.001** 

150 

 
 
 
 
 
 
 
 
 
  
  
 
 
  
  
  
 
 
 
  
  
  
  
 
 
 
 
  
  
  
  
 
 
 
 
  
  
  
  
 
 
 
 
  
  
  
  
 
 
 
 
  
  
  
  
 
 
 
 
  
  
  
  
 
 
 
 
 
  
  
  
 
 
 
 
  
  
  
  
 
 
 
 
TABLE 45: MULTILEVEL HIERARCHICAL MODEL PREDICTING 

The dependent variable in this model is the likelihood of correctly predicting the relative 

difficulty of items. The table shows the results of a multi-level hierarchical model with a random 

coefficient on the grouping level at the Diff Δ level this group was defined rounding |Diff Δ| by 5 

with any values greater than 65 being assigned to the same group. Explanatory variables have been 

standardized to ease in interpretation. 

0.564*** 

0.085*** 
0.077*** 
0.000 
-0.061** 

0.087*** 

0.572*** 

2 
3650 
0.050 

1 
3650 
0.046 

Model: 
No. observations: 
Psuedo-R 2 
Coefficient Estimates: 
Intercept 
Second Harder 
|Diff Δ| 
Grade : 12 
Grade : 8 
Math 
Alpha 1 
Alpha 2 
Complexity 1 
Complexity 2 
|Complexity Δ| 
Number of steps 1 
Number of steps 2 
|Number of Steps Δ| 
Similarity 
|Diff Δ| : Complexity 1 
|Diff Δ| : Complexity 2 
|Diff Δ| : |Complexity Δ| 
|Diff Δ| : Number of steps 1 
|Diff Δ| : Number of steps 2 
|Diff Δ| : |Number of steps Δ | 
|Diff Δ| : Alpha 1 
|Diff Δ| : Alpha 2 
|Diff Δ| : Similarity 
*10% significance, **5% significance, ***1% significance 

151 

3 
3650 
0.051 

4 
3650 
0.057 

5 
3650 
0.056 

0.558*** 

0.551*** 

0.568*** 

0.085*** 
0.075*** 
-0.004 
-0.060** 
-0.017** 
0.010* 

0.080*** 
0.080*** 
0.004 
-0.048** 
-0.017** 
0.009* 
0.018** 
0.006* 
0.042*** 
0.011* 
0.000 
0.009* 

-0.074** 

0.052*** 
0.057*** 
0.065*** 
0.026** 
0.007 
0.020** 
-0.014** 
0.001 
-0.014* 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  
  
 
 
  
  
  
 
 
 
  
  
  
  
 
 
 
 
  
  
  
  
 
 
 
 
  
  
  
  
 
 
 
 
  
  
  
  
  
 
 
 
 
  
  
  
  
 
 
 
 
  
  
  
  
 
 
 
 
  
  
  
  
 
 
 
 
  
  
  
  
 
 
 
 
TABLE 46: ROOT MEAN SQUARE ERROR AND MEAN ABSOLUTE ERRORS 

Mean Absolute Error 

Pooled Joint 
Estimation 

Pooled Joint 
Estimation 

Grouping 
0.000 
Pooled Joint Estimation 
0.363 
Known Item Parameters (1PL) 
1.163 
All Students 
1.211 
Gender: Male 
1.161 
Gender: Female 
1.202 
Race: White 
1.103 
Race: Black 
1.109 
Race: Hispanic 
1.190 
Race: Asian/Pacific Islander 
1.079 
Location: City 
1.098 
Location: Suburb 
1.044 
Location: Town 
1.082 
Location: Rural 
1.065 
FRPL Eligibility: Eligible 
1.116 
FRPL Eligibility: Not eligible 
1.238 
FRPL Eligibility: Info. not available 
Random normal variables 
1.410 
Note: All populate parameters and difficulty estimates are standardized (mean=0, var=1). 

Root Mean Squared Error 
Known Item 
Parameters 
(1PL) 
0.363 
0.000 
1.292 
1.356 
1.252 
1.358 
1.104 
1.190 
1.322 
1.193 
1.234 
1.136 
1.216 
1.109 
1.242 
1.413 
1.410 

0.000 
0.311 
0.930 
0.979 
0.916 
0.976 
0.866 
0.880 
0.963 
0.835 
0.852 
0.804 
0.842 
0.834 
0.886 
1.015 
1.120 

Known Item 
Parameters 
(1PL) 
0.311 
0.000 
1.061 
1.121 
1.013 
1.129 
0.883 
0.959 
1.088 
0.948 
0.979 
0.889 
0.967 
0.893 
1.011 
1.194 
1.120 

E.2 Exploring the Consistency of Rank Correlations Across Groups 

Looking at the Spearman rank correlation between the 258 items used in the testing data 

I find that the rank correlation of population performance on individual items has an average 

rank  correlation  of  0.96  (Table  48).  This  high  rank  correlation  is  fairly  remarkable  when 

considering the 6.48 average absolute performance differences that exist between demographic 

groups  (Table  49).  When  looking  at  individual  demographic  pairs  the  high  rank  correlations 

despite large performance differences appear striking. For example, White and Black students 

152 

 
 
 
 
have  a  15-point  performance  gap  on  average,  yet  the  rank  correlation  between  those  item 

performances is still 0.95.  

TABLE 47: RANK CORRELATION MATRIX OF POPULATION PERFORMANCE 

This table shows the correlations between the rank difficulty of the 258 items in the testing data 

set evaluated in this dissertation. 

All 
White 
Black 
Hispanic 
Asian/Pacific 
Islander 
Male 
Female 
Mean 

All 

0.99 
0.97 
0.97 

0.93 
0.99 
0.99 
0.97 

White 
0.99 

0.95 
0.96 

0.93 
0.99 
0.98 
0.97 

0.97 

0.90 
0.95 
0.97 
0.95 

Black  Hispanic 
0.97 
0.95 

0.97 
0.96 
0.97 

Asian/ 
Pacific  Male 
0.99 
0.93 
0.99 
0.93 
0.95 
0.90 
0.96 
0.93 

Female  Mean 
0.97 
0.97 
0.95 
0.96 

0.99 
0.98 
0.97 
0.97 

0.93 
0.96 
0.97 
0.96 

0.91 

0.96 
0.96 

0.91 
0.94 
0.92 

0.94 
0.96 

0.92 
0.96 
0.97 
0.96 

TABLE 48: ABSOLUTE DIFFERENCE IN POPULATION PERFORMANCE 

This table shows the average difference in performance by population group for the 258 items 

evaluated in this dissertation. 

All 
White 
Black 
Hispanic 
Asian/Pacific 
Islander 
Male 
Female 
Mean Absolute 

All 

4.10 
-10.85 
-7.59 

7.12 
1.40 
-1.13 
4.60 

White 
-4.10 

-14.95 
-11.70 

2.82 
-2.54 
-5.12 
5.89 

3.26 

17.59 
12.31 
9.73 
9.81 

Black  Hispanic 
10.85 
14.95 

7.59 
11.70 
-3.26 

Asian/ 
Pacific  Male 
-1.40 
-7.12 
-2.82 
2.54 
-12.31 
-17.59 
-9.11 
-14.74 

Female 
1.13 
5.12 
-9.73 
-6.54 

14.74 
9.11 
6.54 
7.56 

5.47 

-2.53 
4.77 

8.10 
2.53 

4.74 

-5.47 
-8.10 
7.98 

153 

Mean 
Abs 
4.60 
5.89 
9.81 
7.56 

7.98 
4.77 
4.74 
6.48 

 
 
 
  
  
 
  
 
  
  
  
 
  
  
 
  
 
  
  
 
 
 
 
 
 
 
 
 
APPENDIX F: ALGORITHMS ILLUSTRATED 

FIGURE 10: ITEM SCRAPING ALGORITHM FLOW CHART 

This flow chart shows the steps involved in collecting the NAEP item statistics and primary item 

content from online. 

154 

 
 
 
 
 
FIGURE 11: PAIRWISE TASK TEMPLATE SELECTION AND EVALUATION FLOW CHART 

This flow chart shows a simplified representation of the training of top templates through 

evaluation of the training data followed by selection of the top template which is then 

evaluatied on the testing data. 

155