THE USE OF LARGE LANGUAGE MODELS TO PREDICT ITEM PROPERTIES

Calibrating items is a crucial yet costly requirement for both new tests and existing ones as items become outdated due to changing relevance or overexposure. Traditionally, this calibration involves giving items to a large number of participants, a process that requires substantial time and resources. To reduce these costs, researchers have sought alternative calibration methods. Before the emergence of Large Language Models (LLMs), these methods mainly relied on expert opinions or computational analysis of item features. Yet, the accuracy of experts in predicting item performance has varied, and computational approaches often struggle to capture the intricate semantic details of test items.The emergence of LLMs might offer a new avenue of addressing the need for item calibration. These models, popularized by OpenAI (like the GPT series), have shown remarkable abilities in mimicking complex human thought processes, and performing advanced reasoning tasks. Their achievements in passing sophisticated exams and executing cross-language translations underline their potential. However, their capacity for predicting item properties in test calibration has not been thoroughly investigated. Traditional calibration relies heavily on direct human interaction, such as pretesting and expert assessment, or on statistical modeling of item features through resource intensive machine learning algorithms. This dissertation explores the potential of LLMs to predict item characteristics, tasks that have traditionally required human insight or complex statistical models. With the increasing accessibility of high-performance LLMs from organizations like OpenAI, Meta, and Google, and through open-source platforms such as HuggingFace.com, there is promising ground for investigation. This study examines whether LLMs could replace human efforts in item calibration tasks. To evaluate the effectiveness of LLMs in predicting item properties, this dissertation implements a training and testing framework, focusing on assessing both the relative and absolute difficulties of items. It undertakes three theoretical investigations: firstly, examining the ability of LLMs to predict the relative difficulty of items; secondly, assessing the feasibility of using multiple LLMs as substitutes for test-takers and attempts to use their responses predictors of item difficulty; and thirdly, applying a search algorithm, guided by LLM predictions of relative difficulty, to ascertain absolute difficulties. The findings indicate that the models have statistical significance in predicting relative item difficulty, limited by modest explanatory power — with adjusted R-squared values around 5-10%. However, the application of LLMs in predicting relative item difficulties through pairwise comparisons proves to be more promising, achieving a pairwise accuracy of about 62% and demonstrating predicted correlations with item difficulty ranging between 0.36 and 0.42. This suggests that whereas LLMs show potential in certain aspects of item calibration, their effectiveness varies depending on the specific task. This demonstrates a potential promising result that warrants further exploration into the capabilities of LLMs for item calibration, potentially leading to more efficient and cost-effective methods in the field of test development and maintenance.

Read