TWO STUDIES ON ASSESSING AI-AUGMENTED CREATIVITY WITH LARGE
LANGUAGE MODELS

By

Jiaoping Chen

A DISSERTATION

Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of

Business Administration – Information Technology Management – Doctor of Philosophy

2025

ABSTRACT

Large Language Models (LLMs) have been increasingly integrated into a variety of tasks,

facilitating human endeavors in generating creative outputs, ranging from product ideation

to digital artwork. Such novel capabilities of LLMs have ushered in a new era of collab-

oration between humans and Artiﬁcial Intelligence (AI), which has grabbed the attention

of researchers and practitioners alike. Thus, in this dissertation, I explore the intersection

of emerging LLMs and creativity, with a primary focus on writing tasks. This disserta-

tion includes two studies. In the ﬁrst study, I examine the impact on perceived creativity

of varying levels of generative capabilities of LLMs - namely, randomness, which has been

overlooked so far and which is manipulated via a quasi-experiment.

I ﬁnd that collabo-

rating with an LLM with high randomness that generates more diverse advice does not

necessarily lead to increased perceived creativity of work, as the role of humans matters.

Moreover, I explore how the characteristics of human evaluators and their perceived extent

of AI use inﬂuence their assessments of creativity. In the second study, I focus on growing

concerns regarding the potential misuse of generative AI, particularly its capacity to pro-

duce plagiarized content. Motivated by the divergent thinking creativity literature using

the Divergent Association Task (DAT), I construct DAT(Sent), a metric to proxy semantic

dissimilarities within a document, and further propose an e↵ective GPT detector classiﬁer,

GPT-DATector. I show that on average, human-generated contents have a larger DAT(Sent)

than AI-generated texts across di↵erent writing tasks and datasets. Empirical evaluations

demonstrate that the proposed GPT-DATector outperforms state-of-the-art models in terms

of prediction performance. Most importantly, GPT-DATector has the potential to reduce

bias in the detection of AI-generated text.

ACKNOWLEDGMENTS

I am sincerely grateful to my dissertation chair, Professor Anjana Susarla, whose in-

valuable support and encouragement have guided me throughout this journey. She al-

ways believes in my abilities and gives me the conﬁdence to persevere. Her example as

a scholar—committed to continual learning, impactful research, and constant contributions

to the academic community—has left a lasting inﬂuence on me.

I would like to express my gratitude to my committee members, Professor Chenhui Guo,

Professor Quan Zhang, and Professor Musaib Ashraf, for their insightful feedback and gen-

erous time. Their guidance has signiﬁcantly strengthened this dissertation. I also extend my

appreciation to Professor Laura Brandimarte and Professor Uttara M. Ananthakrishnan for

their mentorship and support throughout my academic path.

It has been an honor to be part of the Eli Broad College of Business at Michigan State

University over the past six years. I am grateful for the ﬁnancial support provided by the

Department of Accounting and Information Systems at the Eli Broad College of Business.

I also wish to thank the Center for Ethical and Socially Responsible Leadership (CESRL)

from Michigan State University, for the research funding that supported the data collection

and cloud-based resources essential to this dissertation.

Moreover, I want to extend my appreciation to the faculty, sta↵, my fellow Krishna

Pothugunta, and my o ce mate Li Zhang from the AIS department. Their support has been

important throughout my time in the program. I am especially thankful to my lunchmates,

Shuting Wu, Tom Shang, and Sangmok Lee, for their companionship and friendship, which

brought encouragement and laughter along the way.

I am also deeply grateful to all my

friends who have o↵ered their support throughout my Ph.D. studies.

iii

Most importantly, I owe my deepest thanks to my parents, Yuying Ou and Chaoran

Chen. Their unconditional love and support have been the foundation of my personal and

academic growth. They gave me the freedom to explore, taught me to take responsibility

for my choices, and instilled in me the importance of staying positive and empathetic. Their

guidance has shaped how I see the world—with curiosity, openness, and care.

Lastly, thank my family, Di Wu and Jackson Wu, who have given me immense love and

countless cherished memories throughout my Ph.D. journey. They have stood by me during

the challenges and celebrated with me during the milestones, making this journey even more

meaningful. Their presence in my life keeps reminding me of the warmth of love and family.

These years of pursuing my Ph.D. have been a journey of personal growth. It has taught

me how to rise after each setback, how to navigate moments of solitude, and how to learn

quickly and master critical knowledge. Throughout this process, I have constantly shifted

between vulnerability and resilience, at times feeling discouraged by stagnant research, while

at other times gaining renewed conﬁdence in myself. This journey has required constant self-

reﬂection and honest conversations with myself, helping me recalibrate and continue forward.

In the end, despite the challenges, I made it through. Reﬂecting on these years, I feel

more resilient, more open-minded, and more committed to lifelong learning than when I

began. I will forever be grateful for my time in Michigan, and it will always hold a special

place in my heart.

iv

TABLE OF CONTENTS

CHAPTER 1 Do Large Language Models’ Generative Capabilities Boost Creativity?
Assessing AI-Augmented Creativity With LLMs . . . . . . . . . . . .
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.1
1.2 Related Literature
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3 Hypotheses Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.4 Data, Measures and Models Speciﬁcation . . . . . . . . . . . . . . . . . . . .
1.5 Empirical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.6 Discussion and Implications
. . . . . . . . . . . . . . . . . . . . . . . . . . .
1.7 Conclusion and Future Research . . . . . . . . . . . . . . . . . . . . . . . . .

CHAPTER 2 GPT-DATector : Increasing Accuracy And Decreasing Bias In GPT

Detectors Using Creativity Measures

. . . . . . . . . . . . . . . . . .
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 Related Literature
2.3 Methods and Datasets
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4 Empirical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . .

BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

APPENDIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1
1
5
12
19
28
36
41

44
44
51
57
65
74

79

89

v

CHAPTER 1 Do Large Language Models’ Generative Capabilities Boost Creativity?

Assessing AI-Augmented Creativity With LLMs

1.1 Introduction

The advent of large-scale generative language models (LLMs), trained on vast quantities of

data, has adapted to a wide range of downstream applications such as bar exams (Katz et al.,

2024) and academic assignments (Stokel-Walker, 2022). Along with the impact of LLMs in

enhancing performance on objective, well-deﬁned tasks, a more complex and underexplored

domain is understanding their transformative role in human-AI collaborations for subjec-

tive, open-ended tasks. One of such domains is the creativity industries, which encourages

more novel and unexpected content. Prior research has begun to explore facilitating human

endeavors in generating creative outputs, ranging from product ideation to problem-solving

(Wang, Yang, and Sun, 2023; Boussioux et al., 2024). However, less attention has been paid

to narrative writing, arguably one of the activities that best represents and distinguishes

human intelligence (Arriagada, 2020). Therefore, in this study, we mainly focus on human-

AI collaboration in the context of writing creative narratives, particularly examing how the

three elements, AI artifacts, human creators, and human-AI collaboration modes, a↵ect the

perceived creativity of work.

The literature discusses how adopting AI artifacts (e.g., LLM adoption) a↵ects work per-

formance, mostly by perceiving LLMs as static assistants using the default mode (F¨ugener

et al., 2022; Chen and Chan, 2024). For instance, prior studies examine the extent of cre-

ative output produced by AI alone or human-AI collaborations (Wang, Yang, and Sun, 2023;

Zhou and Lee, 2024). However, as generative models, LLMs possess a key capability: gen-

erating more or less divergent outputs for a given input. Such generative capability can be

1

systematically modulated through the adjustment of model parameters (such as the temper-

ature parameter), thereby designing AI assistants with di↵erentiated capabilities. Despite

growing interest in addressing the importance of harnessing LLMs’ generative capabilities

to power human-AI interaction designs (Wang et al., 2021; Lee, Liang, and Yang, 2022),

limited empirical evidence focuses on how various LLMs’ generative capabilities inﬂuence

the perceived creativity of co-produced work. Speciﬁcally, in the context of creative writing,

to what extent does the LLMs’ capability to generate more or less diverse responses a↵ect

the perceived creativity of the work? Therefore, to comprehensively explore the impact of AI

artifacts with varying LLMs’ inherent generative capabilities, particularly randomness, we

raise RQ1(a): In the human-AI collaboration process, how do varying generative capabilities

of LLMs a↵ect the perceived creativity of work as evaluated by external evaluators?

Importantly, while some people appreciate AI’s potential to enhance human productivity

(Logg, Minson, and Moore, 2019), others may not be as welcoming to this new type of

collaborators, as one may feel threatened by their capabilities as humans, resulting in distrust

in the output they produce and in what some refer to as algorithm aversion (Dietvorst,

Simmons, and Massey, 2018). This aversion is revealed as a hesitation to accept and use

advice from algorithms, despite their advice often being fairly good (Kleinmuntz, 1990). In

particular, little is known about the impact of humans’ algorithm aversion during human-AI

collaboration (Brynjolfsson, 2023). Thus, except for the impact of AI artifacts (e.g., with

varying inherent generative capabilities), we further explore the moderating role of human

collaborators’ algorithm aversion (e.g., the extent to which human creators utilize LLMs’

advice). Accordingly, we ask RQ1(b): In the human-AI collaboration process, how does

human creators’ algorithm aversion moderate the e↵ect of LLMs’ generative capabilities on

2

perceived creativity of work?

Next, we turn our attention to how variations in human-AI collaboration patterns shape

creative outcomes. While prior research has primarily drawn on human–IT collaboration

and algorithm interactions (Bauer, von Zahn, and Hinz, 2023; Revilla et al., 2023; Shaikh

and Vaast, 2023), empirical investigations have largely focused on outcome-based evaluations

such as task performance (Zhou and Lee, 2024; Vaccaro, Almaatouq, and Malone, 2024) and

user experience (Jakesch et al., 2023; Mirowski et al., 2023; Bauer, von Zahn, and Hinz,

2023). Yet, as AI systems exhibit increasingly generative and interactive capacities that

resemble human behavior, questions emerge regarding the applicability of frameworks rooted

in human–human collaboration (Go↵man, 2017). Speciﬁcally, the interaction order literature

emphasizes that the social situational factors (e.g., autonomy) dominate over individual

social structures (e.g., demographics like race). Extending this perspective to human–AI

collaboration requires capturing evolving interactional patterns across varying contexts. To

this end, we draw on the literature on human-human dyadic collaboration (Damon and

Phelps, 1989; Storch, 2002), which identiﬁes four distinct interaction patterns based on

levels of equality (i.e., balanced contribution) and mutuality (i.e., interactive engagement).

By substituting one human collaborator with an LLM, we seek to explore RQ1(c): In

the human-AI collaboration process, for a given level of LLMs’ generative capabilities and

algorithm aversion among human creators, which interaction pattern(s) result in the highest

perceived creativity of the work?

Last, human-AI interactions can alter the ways in which opinions are formed Jakesch

et al. (2023), such as human-involved evaluation processes. As more users engage with

LLMs in daily tasks, it is unclear how these experiences may shift what people consider

3

creative. Despite human-AI co-creation processes, in this study, we also evaluate the human

evaluation process by investigating the e↵ect of two individual characteristics: 1) human

evaluators’ familiarity with LLMs, and 2) their perception of AI use when evaluating work

without knowing its origin. Thus, we ask RQ2: Do human evaluators’ experience with LLMs

and perceived AI use by writers a↵ect the perceived creativity of the work?

To answer these questions, we employ a structured experimental framework by focusing

on creative writing tasks under two distinct conditions—low vs high randomness—in the

human-AI collaboration process. Randomness is one of the main characteristics that a↵ect

LLMs’ capabilities: high randomness allows LLMs to generate more diverse advice, while low

randomness tends to cause LLMs to generate less varied and more repetitive suggestions. We

further design and conduct an online survey to assess the perceived creativity of co-generated

narratives by external evaluators.

Our ﬁndings provide signiﬁcant insights into human-AI interaction by considering both

collaboration and external evaluation processes. From the human-AI collaboration perspec-

tive, we ﬁnd that simply collaborating with LLMs with greater generative capabilities (i.e.,

high randomness LLM where AI generates more diverse suggestions) does not lead to in-

creased perceived creativity of work compared to collaborating with low-randomness LLMs.

But the moderating role of human collaborators really matters. In high-randomness settings,

reduced algorithm aversion, through greater utilization of AI-generated advice, is associated

with higher perceived creativity of outputs. Conversely, in low-randomness settings (charac-

terized by less diverse AI recommendations), diminished algorithm aversion correlates with

lower perceived creativity. When it comes to varying human-AI collaboration patterns, we

ﬁnd that a human creator with low algorithm aversions, combined with high mutuality and

4

human-led collaboration, is associated with the highest perceived creativity in work. From

the human evaluation perspective, we ﬁnd human evaluators who are more experienced with

LLMs or perceive greater AI use (even though the nature of writers is not disclosed) tend to

consider narratives as more creative.

We contribute to the growing literature related to human-AI augmented creativity by

studying the consequences of collaborating with LLMs with varying inherent generative

capabilities. Such ﬁndings have implications for e cient deployments of AI assistance in

work involving creative tasks, which complement existing research on the synergistic impact

of AI and humans (Kane et al., 2021; Chen and Chan, 2024; Boussioux et al., 2024). We also

contribute to the human-AI collaboration literature by building up theoretical frameworks

motivated by the human-human collaboration literature (Damon and Phelps, 1989; Storch,

2002). We show that greater interactions between human collaborators and AI, particularly

when human-led, lead to higher perceived creativity in co-created work. Moreover, our

study’s ﬁndings provide additional evidence for the human evaluation process by considering

two individual-level AI-relevant characteristics (e.g., people’s experience with LLMs and their

perceived AI use of the work). Our work shows some level of human favoritism from the

evaluators’ side when evaluators are uninformed about the origins of the work they assess,

which di↵ers from existing work in which evaluators know whether the work they assess is

AI-generated (Yin, Jia, and Wakslak, 2024).

1.2 Related Literature

1.2.1 LLMs’ Generative Capabilities During Human-AI Collaboration

LLMs, built on transformer-based learning models trained on extremely vast internet-sourced

datasets (i.e., Wikipedia and Google Images) (Minaee et al., 2024), represent a paradigm

5

shift in artiﬁcial intelligence. LLMs operate by learning probability distributions over high-

dimensional semantic spaces, enabling them to generate novel, contextually relevant outputs.

Such generative capacity positions LLMs as unique tools for augmenting human work, par-

ticularly in creative and knowledge-intensive tasks where variability and adaptability are

critical (Brynjolfsson, 2023).

Distinguishing model capabilities: Overall capabilities versus generative

capabilities

In this study, we ﬁrst distinguish between LLMs’ overall capabilities (di↵er-

ences in performance across model versions of LLMs, such as GPT-4 vs GPT-3) and LLMs’

generative capabilities (the ability of a speciﬁc LLM, like GPT-3, to govern output diversity

by adjusting parameters within the model). The former, LLMs’ overall capabilities, has dom-

inated researchers’ attention with studies demonstrating the positive correlation between the

size of a language model (i.e., the number of its parameters) and its performance (Brown

et al., 2020). Allegedly comprising 1.76 trillion parameters—nearly ten times the 175 bil-

lion parameters of GPT-3.5—GPT-4 shows signiﬁcantly superior performance across a range

of academic and professional benchmarks, reﬂecting substantial gains in natural language

processing capabilities attributable to its increased model scale (Brynjolfsson, 2023). How-

ever, limited studies investigate the inﬂuence of the latter term, LLMs’ inherent generative

capabilities, on shaping work performance.

Despite rapid advancements in LLMs, recent literature addresses a paradox in human-AI

collaboration: improvements in the ﬁrst term, LLMs’ overall capabilities, do not consistently

translate into better collaborative outcomes (Noy and Zhang, 2023; Lee et al., 2023; Chen

and Chan, 2024; Li et al., 2024; Boussioux et al., 2024). For example, Lee et al. (2023) ﬁnd

that more advanced LLMs do not consistently enhance joint performance across tasks, while

6

Li et al. (2024) report only a slight gain when users collaborate with GPT-4 versus GPT-

3.5. These ﬁndings suggest that technological improvements in LLMs’ overall capabilities

yield limited returns in human-AI collaboration contexts, indicating a complex interaction

between the capabilities of AI models and human factors (Noy and Zhang, 2023). While prior

work highlights the paradox of LLMs’ overall capabilities within human-AI collaboration, it

remains unclear whether similar ﬁndings exist when it comes to the second term, LLMs’

inherent generative capabilities.

The underexplored frontier: LLMs’ generative capabilities in the human-AI

collaboration Existing literature has examined LLMs’ generative capabilities in isolation,

often manipulating parameters such as temperature to assess output diversity in tasks like

narrative writing and design (Bellemare-Pepin et al., 2024; Peeperkorn et al., 2024). Research

by Ma et al. (2024) explores these capabilities by comparing design solutions generated inde-

pendently by AI and humans. These studies conceptualize AI as an independent agent and

show that higher temperature values generally enhance creativity in AI-generated content.

Yet, the impact of LLMs’ generative capabilities within human-AI collaboration remains

underexplored, despite this being the more prevalent context where the work is co-created

through dynamic human–AI interaction.

Therefore, our study investigates how varying LLMs’ inherent generative capabilities af-

fect human-AI co-creation. Resolving this is important for theorizing how LLMs can be

designed and deployed to complement, rather than conﬂict with, human creative processes.

Speciﬁcally, we manipulate LLMs’ generative capabilities using two adjustable parameters

that can signiﬁcantly a↵ect the randomness of the text produced by LLMs1: temperature and

1https://platform.openai.com/docs/api-reference/audio

7

frequency penalty. On one hand, “temperature” in GPT acts as a control parameter for the

randomness in generating text by ﬂattening or sharpening the probability distribution over

tokens. Higher temperature settings yield a ﬂatter probability distribution, increasing lexical

diversity and generating less predictable, more divergent text. In contrast, lower tempera-

tures concentrate probability mass on high-likelihood tokens, resulting in more deterministic

and predictable outputs. On the other hand, “frequency penalty” discourages repetition by

reducing the likelihood of reusing previously generated tokens, thereby indirectly enhancing

output diversity and promoting greater variation in the generated text. Moreover, di↵er-

entiating from prior work using the objective assessment (Lee, Liang, and Yang, 2022), we

propose a subjective assessment framework for creative narrative evaluation, which is crucial

for creative tasks like narrative writing.

1.2.2 IT A↵ordance and Human Creators’ Algorithm Aversion During Human-

AI Collaboration

Recent studies examine the impacts of integrating AI into human-centric workﬂows (e.g.,

Bauer, von Zahn, and Hinz (2023)). There have been experimental studies (e.g., Wang, Yang,

and Sun (2023); Noy and Zhang (2023)) as well as ﬁeld studies (e.g., Brynjolfsson (2023)),

demonstrating that adopting LLMs into humans’ work can signiﬁcantly enhance creative and

operational performance. However, the IT a↵ordance literature shows that such beneﬁts are

contingent on how users interact with IT artifacts (e.g., AI or LLMs) over time (Markus

and Silver, 2008; Majchrzak and Markus, 2012).

Improved performance through IT use

arises not automatically but through users’ selective enactment of a↵ordances. Individuals

tend to exhibit algorithm aversion (Dietvorst, Simmons, and Massey, 2015, 2018), judging

8

algorithms negatively a priori, thereby limiting their potential impact. Algorithm aversion

is particularly common among experts, whose domain knowledge is often associated with

reduced reliance on automated systems (Whitecotton, 1996; Commerford et al., 2022). This

reluctance is partly driven by unrecognized overconﬁdence in intuitive judgment, a cognitive

bias that diminishes perceived value in algorithmic support (Eining et al., 1997; Sieck and

Arkes, 2005).

1.2.3 Human Evaluators’ Characteristics During the Evaluation Process

Next, we examine the evaluation process by reviewing literature on two key characteristics of

human evaluators: their experience with algorithms and the inﬂuence of human favoritism.

Experience with algorithms Prior literature di↵erentiates between experience with the

decision domain (experts in terms of domain knowledge) and speciﬁc experience with al-

gorithmic decision aids (experts in terms of working with algorithms) (Burton, Stein, and

Jensen, 2020), yet most research investigates the former, domain expertise. Existing litera-

ture shows that experts’ tendency to prefer human judgment over algorithms in ﬁelds where

individuals possess professional expertise (Montazemi, 1991; Whitecotton, 1996). Such bias

is evident in ﬁelds like auditing and radiology, where professionals often disregard algorith-

mic evidence, particularly when it conﬂicts with their initial assessments and the AI process

is opaque (Commerford et al., 2022; Lebovitz, Lifshitz-Assaf, and Levina, 2022). However,

our understanding of how the latter term, experience with algorithmic decision aids, a↵ects

evaluative judgments remains limited.

Existing research on experience with algorithmic decision aids has explored how it in-

ﬂuences trust and usage; however, much of this work has focused on users who directly

9

collaborate with these systems, rather than those who evaluate their outputs. Speciﬁcally,

studies suggest that human collaborators—individuals who interact directly with algorith-

mic decision aids—may develop greater trust and reliance on these systems as they become

more familiar with them (e.g., Burton, Stein, and Jensen (2020)). However, the evidence is

mixed and highly dependent on context (Turel and Kalhan, 2023). Notably, this body of

work has paid relatively little attention to human evaluators, those tasked with judging the

quality of algorithm-generated content. These evaluators may form their judgments through

di↵erent cognitive and experiential processes. As LLMs become increasingly embedded in

communication and decision-making, it is critical to understand how evaluators’ prior expe-

rience with these systems shapes their subjective assessments, particularly in domains like

creativity where human judgment plays a central role (Jakesch et al., 2023). Thus, while

experience has been shown to a↵ect how collaborators engage with algorithmic systems, we

still know little about how it inﬂuences evaluators’ judgments—an important gap given the

growing presence of LLMs in shaping human opinion.

Human favoritism Recent literature suggests that human favoritism underlies biased

evaluations against AI-generated content, particularly in subjective tasks (Castelo, Bos,

and Lehmann, 2019; Morewedge, 2022). Individuals consistently rate creative outputs (i.e.,

paintings, poetry, and news articles) as less favorable when attributed to AI (Clerwall, 2017;

Ragot, Martin, and Cojean, 2020; K¨obis and Mossink, 2021). Similar ﬁndings exist for

interpersonal communication, where AI-authored empathetic messages lose impact upon

disclosure (Yin, Jia, and Wakslak, 2024), and for content quality assessments, which improve

with human attribution (Zhang and Gosline, 2023; Millet et al., 2023).

However, there may be scenarios where humans do not disclose their use of AI, and

10

the content created by a human and/or an AI becomes indistinguishable. When facing

uncertainty about whether content is generated by humans, AI, or their collaboration, people

often rely on quick, intuitive, and heuristic judgments to assess its quality (Hafenbr¨adl et al.,

2016; Jarrahi, 2018). Our understanding of this phenomenon remains incomplete, especially

in the case where AI assistance is undisclosed. Therefore, this study seeks to examine

whether a bias favoring human-generated content persists even when the authorship source

is unknown. Speciﬁcally, we investigate whether an increased perception of AI involvement

potentially leads to a diminished perceived creativity of the work.

1.2.4 Narrative Creativity

Writing can be characterized as a creative process where the writer actively engages with

the evolving text (Emig, 1971). This is because during this process, writers act as creative

thinkers and problem-solvers who manage task constraints while utilizing the creative and

linguistic resources available (Sharples, 2002; D’Souza, 2021). The literature has examined

various adoption of LLM writing assistants in creative tasks, viewing LLMs not merely as

tools for word prediction or correction but as active co-authors (Lee, Liang, and Yang,

2022; Yang et al., 2022; Yuan et al., 2022). Design characteristics for better interaction

with writing assistants can support inspiration (Wang, Yang, and Sun, 2023; Bhat et al.,

2023; Lee, Liang, and Yang, 2022), language proﬁciency (Buschek, Z¨urn, and Eiband, 2021),

shorter and more predictable texts (Arnold, Chauncey, and Gajos, 2020), more standard

phrase usage (Buschek, Z¨urn, and Eiband, 2021), or creative writing (Clark et al., 2018;

Yuan et al., 2022). Bhat et al. (2023) examines how writers assess the suggestions provided

and incorporate them into various cognitive writing processes.

11

Table 1.1 Related literature on human-AI collaboration and creativity.

The role of LLMs

The role of human
collaborators

(LLMs’ generative
capabilities)

(human collaborators’ AI use:
-Binary: with AI vs without AI
-Continuum: the extent to use AI)

The role of varying
human-AI collaboration
patterns
(Dyadic collaboration
modes such as mutuality
and equality)

Considering both
human collaborators’
characteristics (CC)
and evaluators’
characteristics (EC)

Literature

Wang et al. (2023)

Bellemare-Pepin
et al. (2024)

Lee et al. (2022)

yes (AI-only, without
human-AI collaboration)
yes (AI-only, without
human-AI collaboration)

binary

binary

yes (in the human-AI
collaboration)

continuum (as a dependent
variable)

Brynjolfsson et al. (2023)
no
Jakesch et al. (2023)
no
Noy and Zhang (2023)
no
Zhang and Gosline (2023)
no
no
Yin et al. (2024)
no
Millet et al. (2023)
Ragot et al. (2020)
no
K¨obis and Mossink (2021) no
no
Logg et al. (2019)
Turel and Kalhan (2023)
no

binary
binary
binary
binary
binary
binary
binary
binary
continuum
continuum

This study

yes (with the human-AI
collaboration

continuum (as a moderator)

no

no

yes (view each variable
independently as a
dependent variable)
no
no
no
no
no
no
no
no
no
no
yes (deﬁne four types of
human-AI collaboration
modes based on these two
variables

only CC

only CC

only CC

only CC
only CC
both
both
only EC
only EC
only EC
only EC
only CC
only CC

both

We summarize the empirical studies on human-AI collaboration in Table 1.1.

1.3 Hypotheses Development

We begin this section by outlining the research framework, followed by the development of

hypotheses corresponding to each research question. Figure 1.1 summarizes our research

framework, which serves as a roadmap for hypothesis development. Despite growing interest

in human–AI co-creation, existing research on narrative creativity has not systematically ex-

amined how intrinsic characteristics of LLMs, such as their generative capabilities, interact

with human creative processes to shape the perceived creativity of co-produced outputs. This

gap is particularly crucial given that the so-called ”black box” nature of LLMs may obscure

important theoretical and practical implications for how AI augments human creativity. To

address this, we investigate the relationship between inherent parameters of LLMs (speciﬁ-

cally, randomness) and subjective evaluations of creative output. Our work thereby bridges

computational speciﬁcations and creative judgments, o↵ering empirical insight into how the

12

generative properties of LLMs inﬂuence perceived creativity in co-created narratives.

In addition, we examine algorithm aversion in human-LLM collaboration by moving be-

yond the traditional expert/non-expert lens. Speciﬁcally, we investigate how individuals

di↵erentially utilize LLM-generated advice during creative tasks, where empirical evidence

remains limited. While some collaborators avoid algorithmic input despite its potential

beneﬁts, others may over-rely on it, potentially stiﬂing creativity. While prior literature

has examined the e↵ects of disclosed AI authorship (Zhang and Gosline, 2023) or human

authorship (Gnewuch et al., 2024), many real-world settings involve ambiguity around AI

involvement. When the origin of content is unknown, individuals often rely on heuristic

judgments (Hafenbr¨adl et al., 2016; Jarrahi, 2018). Therefore, our work complements this

stream of literature by examining whether perceived AI involvement lowers perceived cre-

ativity even when authorship is undisclosed. We investigate if a human-favoring bias persists

under uncertainty, o↵ering new insight into evaluative dynamics in AI-assisted creative con-

texts.

We ﬁrst focus on the human-AI co-creation and examine whether LLMs’ generative capa-

bility a↵ects the perceived creativity of ﬁnal co-created work (H1a), and how human creators’

utilization of LLMs’ suggestions moderates this relationship (H1b). Next, we study how

human-AI collaboration patterns shape the perceived creativity of work (H1c), controlling

for LLMs’ generative capabilities and algorithm aversion among human creators. Finally, we

turn to the human evaluation process and examine how human evaluators’ characteristics

(i.e., their familiarity with LLMs and perception of AI use) inﬂuence the perceived creativity

of work (H2a, H2b). Hypotheses are developed in the following subsection.

13

Figure 1.1 Research framework.

1.3.1 Within the Human-AI Collaboration Process

In creative writing, the generation of creative narratives involves not only organizing ideas

into coherent plots but also writing stories with novelty and originality (McKee, 1997; Am-

abile, 2018). When LLMs are set to high randomness, they produce more diverse and

unexpected suggestions (Roemmele and Gordon, 2018). Thus, when interacting with such

suggestions, this greater diversity can augment human writers’ divergent thinking—an es-

sential component of creative writing (Runco and Acar, 2012)—by expanding their cognitive

repertoire, stimulating curiosity, and encouraging the exploration of alternative approaches.

In this context, human writers can integrate or synthesize these unique suggestions into their

narratives, thereby enhancing the overall perceived creativity of the stories.

In contrast, when LLMs operate in a low randomness setting, their generated responses

tend to be more predictable and conservative. Such outputs reinforce established patterns

and limit the availability of novel ideas, which in turn encourages writers to adhere to

14

familiar cognitive pathways rather than exploring unconventional perspectives. This reduc-

tion in cognitive variability has been shown to diminish the potential for creative ideation

(Chakrabarty et al., 2024). As a result, the overall creativity of the narratives may diminish

when human writers collaborate with low-randomness LLMs, compared to those generated in

collaboration with high-randomness models. Accordingly, we hypothesize that in the context

of human-AI co-creation, conﬁguring LLMs with high randomness will serve as a catalyst

for more innovative narrative construction, yielding stories that external evaluators judge as

more creative.

H1a: Collaborating with high randomness LLMs, which generate more diverse sugges-

tions, leads to increased perceived creativity of work as assessed by external evaluators, com-

pared to collaborating with low randomness LLMs.

While high-randomness LLMs can o↵er a wide range of novel suggestions, human cre-

ators ultimately retain the right to which ideas to adopt or discard during the creation

process. When it comes to tasks related to creative intelligence, individuals tend to exhibit

algorithm aversion, stemming from concerns about personal identity and the longstanding

belief that creativity is uniquely human (Morewedge, 2022). Thus, although collaborating

with high-randomness LLMs exposes creators to a diverse range of unexpected suggestions,

the enhanced perceived creativity of the ﬁnal work depends on their actual integration of

these unconventional ideas with their own insights.

Conversely, when collaborating with low-randomness LLMs, human creators tend to re-

ceive a narrower set of conventional suggestions that may discourage further creative explo-

ration. According to cognitive miser theory (Orbell and Dawes, 1991), individuals prefer to

solve problems using simpler and less e↵ortful ways rather than more complex and demand-

15

ing ones. As the ease of accessing assistance from LLMs increases, people who heavily rely on

AI to complete tasks might choose to reduce their engagement and e↵orts in the human-AI

collaboration process. As a result, in the context of creative writing, heavy reliance on the

outputs of low-randomness LLMs can inadvertently restrict creative exploration and dimin-

ish the perceived creativity of the ﬁnal output. Therefore, we hypothesize that the e↵ect of

LLMs’ generative capabilities (i.e., high vs low randomness) on perceived creativity of work

is moderated by the level of humans’ utilization of LLMs’ generated advice. Speciﬁcally,

H1b: Greater utilization of advice generated by high (low) randomness LLMs, leads to

increased (decreased) perceived creativity of work as assessed by external evaluators.

Given that human creators tend to exhibit comparable levels of algorithm aversion, the

nature of human–AI collaboration itself may play an important role in facilitating creative

outcomes. Mutuality, deﬁned as the iterative, active engagement between humans and AI,

allows humans to dynamically interact with AI-generated suggestions. When human cre-

ators keep interacting with AI outputs, they can adapt and incorporate a diverse range of

ideas, thereby establishing a productive feedback loop that enhances the overall creativity

of work. Thus, a high level of mutuality (i.e., frequently seeking AI assistance and making

deliberate selections) facilitates the merging of human intuition with AI’s innovative sugges-

tions, leading to greater creative work compared to a passive, one-o↵ use of AI systems (Wan

et al., 2024; Boussioux et al., 2024). In contrast, limited interaction between humans and AI

restricts this iterative process, hindering the e↵ective transformation of AI suggestions into

innovative solutions and ultimately diminishing the perceived creativity of work. In creative

writing tasks, for instance, human writers who actively collaborate with high-randomness

LLMs are better positioned to explore a wide range of novel ideas and integrate these insights

16

into their work. Thus, a high degree of mutuality in the human–AI collaboration tends to

result in outputs that are perceived as more creative.

Furthermore, certain essential aspects of creativity production, such as the courage to

face uncertainty and adversity, are uniquely human and beyond the capabilities of AI (May,

1994). The courage to create stems from existential struggles and personal experience, which

AI may inherently lack. Although advanced language models like Claude and GPT-4 have

demonstrated competence in divergent thinking and problem-solving, they still underperform

in creative writing tasks (Sun et al., 2024). However, when humans take an active, leading

role (i.e., revising AI-generated outputs), they can infuse the collaborative process with

these uniquely human qualities. Therefore, on top of greater mutuality/interaction between

humans and AI, such human-led collaboration not only leverages AI’s generative potential

but also integrates the creator’s characteristics related to creativity production (i.e., the

essential courage to create), enhancing the perceived creativity of the ﬁnal output (Lockhart,

2024). Accordingly, assuming the same level of LLMs’ generative capabilities and human

creators’ algorithm aversion, we hypothesize that:

H1c: For a given level of LLMs’ generative capabilities and human creators’ algorithm

aversion, the high-mutuality, high-human-led collaboration mode yields the highest perceived

creativity compared to the low-mutuality, low-human-led collaboration.

1.3.2 Within the Human Evaluation Process

Human evaluators who are more familiar with LLMs tend to have a personal trait of openness

to experience with technologies, which enhances their appreciation for creative collaborative

work. According to Rogers, Singhal, and Quinlan (2014), innovators and early adopters

17

of new technology possess speciﬁc traits that make them more receptive to innovations.

These traits include open-mindedness and a positive attitude towards change. In our con-

text, people who are more familiar with a new technology (i.e., LLMs) tend to be more

open-minded, implying they are curious and open to exploring new ideas and technologies.

Open-mindedness can be considered the opposite of algorithm aversion. Open-minded indi-

viduals are more likely to evaluate written work based on its inherent qualities (i.e., perceived

creativity), regardless of whether it was generated by LLMs. In contrast, those who are less

open-minded may harbor prejudices against LLMs, often assigning lower ratings to content

that is perceived to be mostly generated by LLMs. Therefore, individuals with more fa-

miliarity with LLMs tend to appreciate more the expressive qualities of narratives without

concern for the nature of their authorship (human or LLM), leading to a greater appreciation

of perceived creativity in the works they assess.

H2a: People who are more familiar with LLMs tend to perceive greater creativity of work,

compared to those who are less familiar with LLMs.

When humans act as evaluators who assess the perceived creativity of written work, they

are more likely to feel their identity is threatened in areas that are important to their per-

sonal identity, such as creative intelligence (Morewedge, 2022). Furthermore, Ornes (2019)

argues that computers challenge the notion of creativity, once thought to be uniquely human.

In our context, when it comes to evaluating narratives, creativity serves as a way to demon-

strate human uniqueness and intelligence. Thus, despite the nature of the writer (human

or LLM) being undisclosed, an increased suspicion of AI involvement in a creative narra-

tive may intensify evaluators’ feelings of threatened identity. This, in turn, could lead to a

manifestation of algorithm aversion, which in our context is operationalized as diminished

18

perceived creativity of the work.

H2b: When evaluators suspect signiﬁcant AI involvement in written work, despite the

source being undisclosed, they tend to appreciate the work less, perceiving it as less creative.

1.4 Data, Measures and Models Speciﬁcation

1.4.1 Data

To answer these questions, we employ a quasi-experimental framework by focusing on creative

writing tasks under two distinct conditions – low vs high randomness – during the human-AI

co-creation. Speciﬁcally, we use the CoAuthor dataset (Lee, Liang, and Yang, 2022), which

is designed to reveal GPT-3’s generative capabilities for human-AI interactive writing. This

dataset includes a creative writing task, which involves 10 creative writing prompts from

the Writing-Prompts subreddit (see Online Appendix Table A1). The experimental design

used by the authors to create the dataset provides an ideal setting to study the impact of AI

randomness on creativity. Each writer was randomly assigned to a GPT exhibiting either

low or high randomness, by varying two decoding parameters in the model: temperature and

frequency penalty. Thus, for each story, we construct a variable, HighRand, to represent

the randomness level of the GPT setting, which is 1 for high randomness conditions and 0

for conditions of low randomness.

Each session started with a prompt, and writers could freely write, request suggestions

from GPT-3, accept or dismiss suggestions, and edit accepted suggestions (see Online Ap-

pendix Figure A1). Each writer could write a maximum of ﬁve stories per prompt. For every

writing session, which corresponds to a speciﬁc combination of writer and prompt, a GPT

conﬁguration was randomly assigned. In other words, the same writer could be assigned to

19

Table 1.2 Overall statistics of creative writing sessions in the CoAuthor dataset.

Prompts Writers Sessions

10

57

830

Time
(min)
11.6

Total Words
(words)
446

Queries

12.8

Acceptance
Rate (%)
75.7

Written by
AI (%)
26.6

either a high or low LLM randomness setting. Table 1.2 shows summary statistics for the

CoAuthor dataset. The dataset contains 830 writing sessions written by 57 writers from

Amazon Mechanical Turk. On average, each writing session is 446 words long, contains 12.8

queries to the system, has an acceptance rate of GPT of 75.7% (how often writers accepted

suggestions from GPT-3), and results in 26.6% of ﬁnal text written by GPT (the proportion

of the ﬁnal text written by GPT-3 as opposed to human writers). In other words, on average,

73.4% of ﬁnal texts were written by humans as opposed to GPT-3.

1.4.2 Measures

Measures used for the human-AI collaboration process Aligned with Turel and

Kalhan (2023), we conceptualize algorithm aversion and appreciation on a continuum by

measuring AIU tilization, the extent to which participants incorporate LLMs’ responses

into their outputs (Li et al., 2024). Speciﬁcally, for each narrative, we begin by identifying

the shared words between the AI-generated responses and the participant’s ﬁnal creative

output (i.e., the ﬁnal version of the creative narrative). We then calculate the ratio of these

shared words to the total word count of the participant’s ﬁnal creative narrative. The value

of AIU tilization ranges from 0 to 1, with higher values indicating greater utilization of

LLMs in the ﬁnal co-created narratives.

Instead of focusing solely on the ﬁnal narrative output, we capture humans’ collaborative

abilities based on two dimensions (Storch, 2002; Lee, Liang, and Yang, 2022), including

20

M utuality and Equality. The ﬁrst dimension, M utuality, represents the extent of a human

participant’s engagement with the AI agent (e.g., by clicking the help button to request

AI assistance, navigating through suggestions provided by the AI, selecting a suggestion,

or reopening the interaction screen to further engage with the AI). Following Lee, Liang,

and Yang (2022), we measure M utuality based on the counts of two types of event blocks2:

human-AI interaction event blocks (EventBlock(HumanAI)) and human-alone event blocks

(EventBlock(HumanOnly)). Speciﬁcally, EventBlock(HumanAI) refers to any of four

behaviors, including navigating suggestions provided by AI, choosing any of AI’s advice,

reopening the AI helper, and inserting any texts. EventBlock(HumanOnly) refers to any of

three behaviors, including dismissing AI advice, deleting, or inserting any texts. Speciﬁcally,

we then construct the variable M utuality as the percentage of human-AI interaction event

block counts to the total count of both human-AI interaction event blocks and human-alone

event blocks M utuality =

i[ei2

EventBlock(HumanAI)]+

i[ei2

EventBlock(HumanAI)]
i[ei2

P

EventBlock(HumanOnly)] . A M utuality

value of 100 indicates the writer relied entirely on AI interaction, while a value of 0 means

P

P

the writer did not interact with the AI at all.

The other dimension, Equality, represents the balance in contributions between human

creator and AI agent in generating ﬁnal outputs. Following Lee, Liang, and Yang (2022),

we measure Equality based on the counts of two types of event blocks: human e↵orts event

blocks (EventBlock(HumanEf f orts)) and AI e↵orts event blocks (EventBlock(AIEf f orts)).

Speciﬁcally, the Equality is deﬁned as inserting any texts by human writers, while

2Speciﬁcally, for each story-writing session that involves various events such as text inser-
tions, deletions, and cursor movements, we categorize these into event blocks—deterministic,
non-overlapping sequences of related events. For example, an event block labeled “choose”
might include actions like “suggestion-select” followed by “suggestion-close”.

21

EventBlocks(AIEf f orts) is deﬁned as choosing any of AI advice. Thus, we construct a vari-

able Equality as 1

i[ei2
i[ei2

EventBlock(HumanEf f orts)]
 
EventBlock(HumanEf f orts)]+

i[ei2
i[ei2

EventBlock(AIEf f orts)]
EventBlock(AIEf f orts)] |

. The Equality

 | P
P

P
P

value of 1 indicates perfect parity, where human and AI contributions are equal. Conversely,

a value of 0 signiﬁes that one party (human or AI) exclusively contributes to the output, while

the other makes no contributions. Empirically, we observe that human e↵orts consistently

exceed AI e↵orts across all writing sessions, as

i[ei 2

EventBlock(HumanEf f orts)] >

P
EventBlocks(AIEf f orts)] for all writing sessions. Thus, in our contexts, a smaller

i[ei 2

P
value of equality reﬂects human-led e↵orts, rather than AI-led e↵orts.

As illustrated in Figure 1.2, four distinct quadrants can be constructed based on the me-

dian value of these two variables, representing di↵erent modes of human-AI collaboration.

We aim to explore which collaboration mode is most strongly associated with higher per-

ceived creativity of the work. Detailed descriptions of each collaboration mode are provided

in Section 1.4.3.

Measures used for the evaluation process To measure the perceived creativity of

narratives, we recruited participants from Proliﬁc, an online research platform where people

can sign in to do surveys and experiments. At the beginning of the survey, participants are

presented with a consent form, which speciﬁes that they will receive compensation of $6

for participating in a study lasting approximately 30 minutes. Only participants who agree

to the consent form can continue their surveys. An attention check question is included to

ensure participants pay attention to the survey.

Each participant is asked to read ﬁve randomly selected stories out of 830. For each

narrative, participants need to answer the same set of questions related to their perceptions

22

Figure 1.2 Proposed four human-AI collaboration patterns based on Storch (2002).

of creativity and perceived AI use, as described in Table 3. Speciﬁcally, we construct a

proxy of perceived creativity as follows. After reading each story, participants are required

to answer four questions adapted from (Goncalo, Flynn, and Kim, 2010), which assess the

level of creativity in the text using a scale ranging from 1 (Strongly Disagree) to 7 (Strongly

Agree). The four questions are outlined as follows: a) This article is creative; b) This article

is more creative than other articles that have been recently published; c) Other people will

think that this article is creative; d) It is unlikely that another author has come up with an

article like this before. After examining whether the reliability of the scale and the inter-

rater reliability is acceptable, the perceived creativity of each story is calculated based on

the average ratings from the four questions and their respondents.

After labeling ﬁve stories, participants were asked about demographics such as age, high-

est level of education earned, employment status, and gender. Lastly, participants reported

23

Table 1.3 Variable deﬁnitions.

Variable

Measurements

Human as Creators: Human-AI Collaboration Process

HighRand

AIUtilization

Mutuality

Equality

Time
LogNumQuery

Perceived Creativity

Perceived AI Use

HF(FamiliarityLLMs)

HF(FamiliarityAlg)

Age

Edu

FullTime
Male

P

P

i[ei2

i[ei2

EventBlock(HumanAI)]+

EventBlock(HumanOnly)] , where

EventBlock(HumanAI)]
i[ei2

High vs Low randomness GPT, by varying two decoding parameters in the model:
temperature (T) and frequency penalty (FP). HighRand is 1 if the setting is
(T=0.75; FP=1), and 0 if the setting is (T=0.3;FP=0).
The ratio of the total number of LLMs’ generated words over the total number
of words for the story.
M utuality =
EventBlock(HumanAI) refers to any of four behaviors, including navigating
P
suggestions provided by AI, choosing any of AI advice, reopening the AI helper,
and inserting any texts. EventBlock(HumanOnly) refers to any of three behaviors,
including dismissing AI advice, deleting, or inserting any texts. A Mutuality value
of 100 indicates the writer relied entirely on AI interaction, while a value of 0
means the writer did not interact with the AI at all.
i[ei2
Equality = 1
, where
i[ei2
EventBlock(HumanE↵orts) is deﬁned as inserting any texts by human writers,
while EventBlock(AIE↵orts) is deﬁned as choosing any of AI advice. An Equality
value of 1 indicates perfect parity, where human and AI contributions are equal.
Empirically, in our context, a value of 0 indicates that one party (human-led)
exclusively contributes to the output, while the other (AI) makes no contribution.
This is because in all writing sessions, total human e↵orts are greater than AI e↵orts.
Total writing time (in min)
The logarithm of total number of queries from the LLM
Human as Evaluators: Evaluation Process

EventBlock(AIEf f orts)]
EventBlock(AIEf f orts)] |

EventBlock(HumanEf f orts)]
 
EventBlock(HumanEf f orts)]+

 | P
P

i[ei2
i[ei2

P
P

(Goncalo et al., 2010) Scale from 1 (Strongly Disagree) to 7 (Strongly Agree)
a) This article is Creative;
b) This article is more creative than other articles that have been recently published;
c) Other people will think that this story is creative;
d) It is unlikely that another author has come up with a story like this before.
In your opinion, was the story generated by Artiﬁcial Intelligence (AI)? Responses
are categorized as 1 for deﬁnitely AI-generated, 2 for maybe AI-generated,
and 3 for deﬁnitely human-written.
For ease of interpretation, we code Perceived AI Use as 3 if the rater chooses
deﬁnitely AI-generated, 2 for maybe AI-generated, and 1 for deﬁnitely human-
written. Thus, a higher value indicates a greater perception of AI use.
On a scale from 1(Not at all familiar) to 5(Extremely familiar), indicate your level of
familiarity with the uses of generative language model tools such as ChatGPT.
We code HF(FamiliarityLLMs) as 1 if a rater’s self-reported familiarity with LLM is
4 or 5, and 0 otherwise.
On a scale from 1(Not at all familiar) to 5(Extremely familiar), indicate your level of
familiarity with algorithms and AI.
We code HF(FamiliarityAlg) as 1 if a rater’s self-reported familiarity with algorithms
and AI is 4 or 5, and 0 otherwise.
Assign a value of 1 if the evaluator is 45 years old or older, and 0 otherwise
Assign a value of 1 if the evaluator has earned a graduate degree (above a bachelor’s
degree) or higher, and 0 otherwise
Assign a value of 1 if the evaluator is full-time employed, and 0 otherwise
Assign a value of 1 if the evaluator is male, and 0 otherwise

24

their experience with LLMs by answering questions related to their familiarity with LLMs

such as ChatGPT, and their familiarity with algorithms. We placed two questions regarding

familiarity at the end of the survey to prevent them from inﬂuencing evaluators’ percep-

tions of creativity during the initial labeling tasks. Table 1.3 summarizes deﬁnitions of the

variables measured during the human-AI collaboration and evaluation processes.

1.4.3 Model Speciﬁcation

To empirically investigate the impact of LLMs’ generative capabilities on perceived creativ-

ity (H1a), we estimate the model in Equation (1.1). The dependent variable, Yijp, is the

perceived creativity for story i generated by writer j based on prompt p. The indepen-

dent variable HighRandijp equals 1 if the GPT exhibits high randomness for story i with

writer j and prompt p, and 0 otherwise. Controlsijp include two variables: total writing

time to proxy the writer’s e↵ort (T ime), and the logarithm of total number of queries to

proxy the writer’s willingness to interact with LLMs (LogN umQuery). We also control for

writers’ ﬁxed e↵ects  j to account for writer-invariant characteristics and the prompt’s ﬁxed

e↵ect ↵p to account for time-invariant prompt characteristics that might a↵ect perceived

outcomes. Finally, we cluster standard errors at the writer level to address the problem of

autocorrelation of error terms (Wooldridge, 2003).

Yijp =  1HighRandijp +  j + ↵p + Controlsijp + ✏ijp

(1.1)

To further examine the moderating e↵ect of human creators’ degree of algorithm aversion

(H1b), we construct a variable, AIU tilizationijp, that represents the ratio of the total number

of LLMs’ generated words over the total number of words for the ﬁnal version of story i, which

is written by writer j based on prompt p. We also add the interaction term HighRandijp ⇥

25

AIU tilizationijp, as shown in Equation (1.2).

Yijp =  1HighRandijp +  2AIU tilizationijp +  3HighRandijp ⇥

AIU tilizationijp

+  j + ↵p + Controlsijp + ✏ijp

(1.2)

Moreover, to test H1c, we conduct a subsample analysis using Equation (3), applying it

separately to two distinct settings: high-randomness and low-randomness. We explore which

collaboration modes achieve the highest perceived creativity of work, given the same levels

of LLMs’ generative capabilities and human writers’ algorithm aversion. For each setting

(high-randomness or low-randomness), we deﬁne the dummy variable HighAIU tilizationijp

based on the median value AIU tilizationijp. This variable is assigned a value of 1 if

AIU tilizationijp is greater than or equal to the median, and 0 if it is below the me-

dian. Moreover, for each setting, we deﬁne four distinctive collaborative modes (shown

in Figure 1.2) based on the median values of M ultualityijp and Equalityijp. Speciﬁcally,

we construct four indicator variables, each representing a unique mode of collaboration:

HighM utCollaborationijp(Q1), HighM utHumanLedijp(Q2), LowM utHumanLedijp(Q3), and Low

M utCollaborationijp(Q4). For example, a writing session characterized by high levels of both

mutuality and equality would be classiﬁed under HighM utCollaborationijp(Q1), reﬂecting strong

interaction and balanced participation between human creators and AI. As a result, the

coe cient vector  1 in Equation (1.3) is 4-dimensional, with LowM utCollaborationijp(Q4),

serving as the baseline category. We then introduce four interaction terms by multiply-

ing HighAIU tilizationijp with each collaboration indicator, making the coe cient vector  2

26

in Equation (1.3) also 4-dimensional.

Yijp =  2HighAIU tilizationijp

+  1{

HighM utCollaborationijp, HighM utHumanLedijp, LowM utHumanLedijp, LowM utCollaborationijp}

+  2HighAIU tilizationijp ⇥{

HighM utCollaborationijp, HighM utHumanLedijp, LowM utHumanLedijp,

LowM utCollaborationijp}

+  j + ↵p + Controlsijp + ✏ijp

(1.3)

Lastly, we analyze the human evaluation process by estimating Equation (4) at the

evaluator-story level. To test H2a, we investigate whether evaluators with lower familiarity

with LLMs perceive human–AI collaborations as less creative. The dependent variable Yir is

perceived creativity for story i rated by evaluator r. We include HF (F amilarityLLM s)r,

an indicator equal to 1 if the evaluator r reports high familiarity with LLMs (Likert scale 4

or 5), and 0 otherwise. The corresponding coe cient  F captures its average e↵ect on the

perceived creativity of work. To test the impact of evaluators’ perceived AI use (H2b), we

construct P erceivedAIU seir, representing rater r’s perception of AI use when evaluating

story i, ranging from 1 to 3, with larger values indicating greater perceived AI use. The

coe cient of interest is  A reﬂects its e↵ect on their perceived creativity of work. We also

control for the story ﬁxed e↵ect ⌧i to account for time-invariant story characteristics that

might a↵ect perceived outcomes3. Controlsr represents a set of evaluator r’s demographics,

including age, gender, education, and employment status.

Yir =  F HF (F amilarityLLM )r +  AP erceivedAIU seir + ⌧i + Controlsr + ✏ipr

(1.4)

3Because we already control the story ﬁxed e↵ect ⌧i, we exclude inherent story-related

variables such as HighRand, AIU tilization, and their interaction terms in the equation.

27

1.5 Empirical Results

1.5.1 Descriptive Statistics

After conducting the survey on Proliﬁc, 29 surveys were disqualiﬁed for failing attention

checks, 4 were rejected for being completed too quickly, and 4 timed out, ending up with 479

survey submissions. Because each participant was assigned ﬁve randomly selected stories, it

was possible that a small set of stories would not be read. In our case, out of 830 stories, one

was read only once, while the remaining 829 stories were read a minimum of two times and

a maximum of ten times. Thus, we select those 829 stories as the sample for the following

analysis.

Before computing the aggregated perceived creativity score for each story, we ﬁrst deter-

mined consistency among raters. Speciﬁcally, we obtained an average inter-rater reliability

(IRR) of 0.68, which is considered “moderate agreement” (Landis and Koch, 1977; Siegert,

B¨ock, and Wendemuth, 2014). We also computed Cronbach’s alpha to assess the correlation

among the four questions composing perceived creativity scale for each story. In our sample,

Cronbach’s alpha was 0.87, implying that the four questions reliably measure the same con-

struct. Therefore, for each story, we computed the average for each of the four items among

raters, and then the average among these four items, utilizing the resulting metric as the

unidimensional perceived creativity measure for each story.

Table 1.4 and Table 1.5 show the descriptive statistics of variables for the story-level

analysis in the writing process (H1a-c), and for the evaluator-story-level analysis during the

evaluation process (H2a-b), respectively.

28

Table 1.4 Summary statistics of variables in the writing process at the story level.

Variable

N Mean Std. dev. Min

Max

1.5

829

4.44

0.92

Measures During Human Evaluation Process(Average)
Perceived Creativity
Measures During Human-AI Collaboration Process
0
0
0.33
0
5.32
0

AIUtilization
HighRand
Mutuality
Equality
Time
LogNumQuery

0.27
0.49
0.38
0.28
11.62
2.43

0.16
0.50
0.03
0.16
2.68
0.65

829
829
829
829
829
829

7

0.93
1
0.48
0.80
33.33
3.99

Table 1.5 Summary statistics of variables in the evaluation process at the rater-story level.

Variable
Perceived Creativity
Perceived AI Use
HF(FamiliarityLLMs)
Age
Edu
Fulltime
Male

N Mean Std. dev Min Max
1.36
0.71
0.49
0.49
0.50
0.50
0.50

4.43
1.95
0.40
0.38
0.53
0.56
0.53

2,389
2,389
2,389
2,389
2,389
2,389
2,389

1
1
0
0
0
0
0

7
3
1
1
1
1
1

1.5.2 Hypotheses Tests

Within the Human-AI Collaboration Process (H1a-c Tests)

We ﬁrst provide model-free evidence for H1a-b. Figure (1.3a) shows the average perceived

creativity of narratives produced under varying levels of GPT randomness during human-AI

collaboration. The results reveal negligible di↵erences in average creativity ratings between

low randomness (4.43) and high randomness (4.45) conditions. This minimal di↵erence (  =

0.02) suggests that the increased LLM randomness does not directly enhance the perceived

creativity of co-generated outputs. Such ﬁndings indicate the failure to support H1a, which

posited that collaborating with higher LLM randomness would yield more creative narratives.

Moreover, for easy visualization, we categorize the sample narratives into two groups

based on the median value (0.23) of the continuous variable AIUtilization: high and low

29

(a)

(b)

Figure 1.3 Model-free evidence of H1a-b. Average perceived creativity of narratives for
di↵erent levels of GPT randomness (left, Figure 1.3(a)) and writers’ AI utilization (right,
Figure 1.3(b)) during the human-AI collaboration.

degree of AI utilization. Figure (1.3b) shows average perceived creativity for di↵erent levels

of GPT randomness and writers’ utilization of AI’s advice during the human-AI co-creation.

Under low GPT randomness, narratives that heavily utilize LLMs’ advice (with the average

value of 4.32) are perceived as less creative than those with minimal LLM utilization (with

the average value of 4.53). Conversely, under high GPT randomness, narratives with high

AI utilization (with the average value of 4.48) are perceived as slightly more creative than

those with low AI utilization (with the average value of 4.41). Thus, this trend indicates that

the level of human creators’ AI utilization shapes the impact of GPT randomness during

human-AI co-creation.

Column (1) of Table 1.6 presents the estimation results from (1.1), where the primary

independent variable is HighRand. The coe cient of HighRand in Column (1) is positive

but insigniﬁcant, indicating a lack of evidence to support H1a. During the human-AI

collaboration, increasing an LLM’s randomness is not su cient in generating more creative

narratives. Therefore, the lack of a statistically meaningful e↵ect calls into question assump-

30

tions about the inﬂuence of LLMs’ inherent generative capabilities (e.g., randomness) on

creative outcomes in human-AI collaboration, highlighting the need to reconsider how these

model attributes interact with human creative processes.

Then, we investigate the moderating role of human creators’ algorithm aversion using

Equation (1.2), which adds AIU tilization and the interaction term HighRand

AIU tilization.

⇥

The results are shown in Column (2) of Table 1.6. The estimated coe cient for HighRand

0.023) is negative and statistically signiﬁcant. When human creators show full algo-

(

 

rithm aversion—deﬁned as a rejection of AI-generated inputs in creative decision-making

(i.e., AIU tilization = 0)—into the creative process, collaborating with high-randomness AI

may hinder the user’s ability to craft creative narratives. Potential mechanisms could be

high-randomness AI, which prioritizes divergent, unpredictable outputs, amplifying cogni-

tive friction for human creators already predisposed to dismissing algorithmic contributions.

Such diverse advice provided by AI further imposes a high technology overload on creators

(Bunjak, ˇCerne, and Popoviˇc, 2021) who must reconcile dissonance between their intent and

the AI’s incongruent suggestions, thereby leading to burnout and lower perceived creativity

of work.

Importantly, the coe cient of the interaction term HighRand

AIU tilization (0.965) is

⇥

positive in Column (2) with full controls, suggesting a signiﬁcant moderating e↵ect of human

creators’ algorithm aversion during human-AI collaboration. When the LLM generates more

diverse suggestions (in the high-randomness GPT setting), stories generated with greater

AI utilization are perceived as more creative, compared to those narratives with less AI

utilization. Taken together, we ﬁnd that the impact of LLMs’ generative capabilities on the

perceived creativity of work varies among human writers’ utilization of LLMs’ suggestions,

31

Table 1.6 E↵ects of LLM’s generative capabilities and human creators’ AI utilization on
perceived creativity of work (testing H1a and H1b).

Dependent variable: Perceived Creativity

E↵ect of LLMs’ Randomness

(1)
0.016
(0.060)

E↵ect of LLMs’ Randomness
and AI Utilization
(2)
-0.223⇤⇤
(0.107)

HighRand

AIU tilization

HighRand

⇥

AIU tilization

Controls

Yes

-1.026⇤
(0.575)

0.965⇤⇤⇤
(0.333)

Yes

Prompt FE, Writer FE
Observations
R2
Standard errors in parentheses; ⇤ p < 0.10, ⇤⇤ p < 0.05, ⇤⇤⇤ p < 0.01
Controls include writers’ total writing time, the logarithm of total number of requests from LLMs,
prompt-ﬁxed e↵ect, and writer-ﬁxed e↵ects. Cluster-robust standard errors at the individual writer
level are shown in parentheses.

Yes
821
0.101

Yes
821
0.093

which supports H1b.

We further investigate which human-AI collaboration modes can trigger greater perceived

creativity of work (H1c). Table 1.7 represents the estimation results of Equation (1.3). In

both subsamples, high randomness LLMs (Column 1) and low randomness LLMs (Column

2), the interaction terms of HighAIU tilizationXHighM utHumanLed are both positive and

statistically signiﬁcant. Therefore, for a given level of LLMs’ generative capabilities and

relatively high levels of human creators’ AI utilization, the high-mutuality, high-human-led

collaboration mode yields the highest perceived creativity compared to the low-mutuality,

low-human-led collaboration, which supports H1c.

In addition, compared to the baseline interaction mode (LowM utCollaboration(Q4)), the

second most e↵ective human-AI interaction pattern is HighM utCollaboration. This is evidenced

32

by the positive and statistically signiﬁcant coe cients of the interaction term HighAIU tilization

⇥

HighM utCollaboration, which are 0.969 in the high-randomness setting (Column 1) and 0.905 in

the low-randomness LLM subsample (Column 2). These ﬁndings suggest that greater mutu-

ality in human-AI interactions—whether through HighM utHumanLed or HighM utCollaboration

mode—enhances the perceived creativity of the work.

Table 1.7 E↵ects of varying human-AI collaboration modes on perceived creativity of work
(testing H1c).

Dependent variable: Perceived Creativity

Subsample:
High Randomness
(1)

Subsample:
Low Randomness
(2)

HighAIU tilization

HighAIU tilization

HighAIU tilization

HighAIU tilization

⇥

⇥

⇥

HighM utCollaboration

HighM utHumanLed

LowM utHumanLed

Controls

Prompt FE, Writer FE

-0.426
(0.268)

0.969⇤⇤⇤
(0.331)

1.316⇤⇤⇤
(0.415)

0.629⇤
(0.363)

Yes

Yes

-1.026⇤⇤
(0.575)

0.905⇤⇤
(0.430)

1.164⇤⇤⇤
(0.376)

0.725
(0.333)

Yes

Yes

Observations
R2
Standard errors in parentheses; ⇤ p < 0.10, ⇤⇤ p < 0.05, ⇤⇤⇤ p < 0.01
Controls include writers’ total writing time, the logarithm of total number of requests from LLMs,
prompt-ﬁxed e↵ect, and writer-ﬁxed e↵ects. Cluster-robust standard errors at the individual writer
level are shown in parentheses.

415
0.211

396
0.141

Within the Human Evaluation Process (H2a-b Tests)

So far, our ﬁndings primarily stem from the collaborative work generation process involving

humans and LLMs.

In the following subsection, we investigate the evaluation process of

narratives by focusing on external human evaluators. Speciﬁcally, whether humans exhibit

33

Table 1.8 Model-free evidence for H2a-H2b. Average perceived creativity among evaluators
with high/low familiarity with LLMs and high/low perceived AI use.

HF(FamilarityLLMs)

Perceived AI Use

Low
Mean Perceived Creativity 4.36
Note: HF (F amilarityLLM ) is 1 if raters’ familiarity with LLMs is 4 or 5, and 0
otherwise. For easy visualization, P erceivedAIU se is considered high when it equals
3, and low when it equals 1 or 2.

High
3.94

High
4.53

Low
4.57

a diminished appreciation for human-AI collaborative work when they are less familiar with

LLMs (H2a) or when they believe the work is more likely generated by AI (H2b). Unlike

previous analyses that are at the story level, the following analysis is conducted at the

evaluator-story level.

Table 1.8 presents model-free evidence for H2a-H2b, which represents the average per-

ceived outcomes among evaluators, categorized by varying levels of familiarity with LLMs

and their perceived AI use. Evaluators who are highly familiar with LLMs assign a higher

average creativity rating of 4.53, in contrast to those less familiar, who average a rating of

4.36. Both results of this model-free evidence support H2a. Moreover, narratives perceived

to have low AI use generally receive higher ratings for perceived creativity than those with

high perceived AI use, supporting H2b.

Model estimation results are shown in Table 1.9. The coe cients of HF (F amilarityLLM )

are positive and statistically signiﬁcant in Columns (1-2), without and with the story-ﬁxed

e↵ect, respectively. These ﬁndings demonstrate that evaluators who are more familiar with

LLMs tend to perceive greater creativity in the evaluated work, compared to those who

are less familiar with LLMs, which supports H2a. Furthermore, Columns (3-4) present

negative and signiﬁcant coe cients of P erceivedAIU se, both without and with the story-

ﬁxed e↵ect (with values of

0.330 and

 

 

0.547, respectively). The results indicate that when

34

evaluators suspect signiﬁcant AI involvement in the creation of written work, despite the

source being undisclosed, they tend to appreciate the work less, perceiving it as less creative,

thus supporting H2b. Our ﬁndings complement existing literature on human favoritism

(Yin, Jia, and Wakslak, 2024) by providing evidence that human biases can a↵ect judgment,

especially when the origin of creators is not disclosed. Lastly, the results remain consistent in

Column (5) when considering both characteristics simultaneously. Thus, our study provides

evidence that individuals’ interaction with LLMs could also a↵ect their decision-making

process, particularly in evaluative judgments related to creativity assessment.

Table 1.9 E↵ects of evaluators’ familiarity with LLMs and perceived AI use on perceived
creativity of work as assessed by external evaluators. The analysis is at the evaluator-story
level (testing H2a and H2b).

Dependent variable: Perceived Creativity

E↵ect of Evaluators’
Familarity with LLM

E↵ect of Evaluators’
Perceived AI Use

E↵ect of Evaluators’
Both Factors

HF(FamiliarityLLM)

Perceived AI Use

Age

F ullT ime

Edu(AboveBachelor)

M ale

Story FE

(1)

0.218⇤⇤⇤
(0.058)

(2)

0.144⇤⇤
(0.068)

(3)

(4)

-0.357⇤⇤⇤
(0.038)

-0.327⇤⇤⇤
(0.046)

0.300⇤⇤⇤
(0.058)

0.219⇤⇤⇤
(0.058)

-0.163⇤⇤⇤
(0.057)

-0.090
(0.057)

No

0.348⇤⇤⇤
(0.069)

0.215⇤⇤⇤
(0.069)

-0.148⇤⇤
(0.066)

-0.002
(0.068)

Yes

0.256⇤⇤⇤
(0.057)

0.229⇤⇤⇤
(0.058)

-0.154⇤⇤⇤
(0.056)

-0.066
(0.055)

No

0.303⇤⇤⇤
(0.068)

0.213⇤⇤⇤
(0.068)

-0.132⇤⇤
(0.065)

0.013
(0.066)

Yes

2389
0.446

Observations
R2
Standard errors in parentheses; ⇤ p < 0.10, ⇤⇤ p < 0.05, ⇤⇤⇤ p < 0.01

2389
0.020

2389
0.430

2389
0.049

(5)

0.158⇤⇤
(0.067)

-0.330⇤⇤⇤
(0.046)

0.323⇤⇤⇤
(0.068)

0.210⇤⇤⇤
(0.068)

-0.139⇤⇤
(0.065)

-0.004
(0.066)

Yes

2389
0.448

Regarding other demographic traits, the signiﬁcant positive coe cients of Age and F ull

T ime (Columns 1–5) suggest that individuals aged 45 and above, as well as those employed

35

full-time, are more likely to report higher perceived creativity of work. Conversely, the signif-

icant negative coe cient for Edu suggests that higher levels of education are associated with

lower perceived creativity. This result is consistent with Benedek et al. (2021), which found

that individuals with more years of education are less likely to exhibit lower agreement on

creativity myths, implying a lower probability of perceiving creativity in general. Moreover,

we do not observe heterogeneous e↵ects across gender, indicating that perceived creativity

does not di↵er systematically between male and female evaluators.

1.6 Discussion and Implications

Building on prior research on human-in-the-loop in decision-making (Zanzotto, 2019; Ge

et al., 2021; F¨ugener et al., 2022; Bauer, von Zahn, and Hinz, 2023), our ﬁndings show

that the e↵ect of LLMs’ inherent generative capabilities on the perceived work creativity is

crucially moderated by the human creators’ utilization of AI. Speciﬁcally, higher randomness

LLM enhances perceived creativity of output only when the human writer exhibits lower

algorithmic aversion and is thus more receptive to the LLM’s suggestions. In addition, our

ﬁndings further address that e↵ective human–AI collaboration in creative tasks hinges on

active human engagement, highlighting the importance of interaction beyond the capabilities

of the model itself. Through an analysis of varying collaboration patterns, we show that both

the intensity of human–AI interaction and the extent of human-led contribution signiﬁcantly

a↵ect the perceived creativity of the ﬁnal output. Speciﬁcally, when human writers more

frequently engage with LLMs (e.g., by asking follow-up questions and reﬁning responses)

and contribute to the tasks (e.g., by inserting more text), the resulting work is viewed as

more creative. These results suggest that creativity is enhanced not only through iterative

36

engagement with AI, but also when humans maintain a leading role in the creative process.

Collectively, our study points to the importance of designing AI-assisted workﬂows that

encourage sustained and dynamic human–AI collaboration to maximize the creative potential

of LLMs.

Furthermore, as LLMs become more involved in everyday activities, understanding how

individual-level characteristics related to LLMs shape human judgment becomes important.

In this study, we examine how two speciﬁc evaluator attributes, their familiarity with AI

and their perception of AI use in a given task, a↵ect assessments of creativity in narrative

writing. Our analysis yields two main insights. First, individuals with greater experience

using LLMs are more likely to perceive higher levels of creativity in the texts they evaluate,

suggesting that familiarity with AI may enhance sensitivity to creative nuances. Second,

even when the authorship source is undisclosed, evaluators’ assumptions about whether AI

was involved signiﬁcantly shape their judgments of creativity. This ﬁnding extends existing

research on human favoritism by examining contexts where AI participation is ambiguous

or inferred rather than explicitly stated. Together, these results address a critical need for

caution: human evaluations of creative work may be systematically shaped by both prior

exposure to AI and subjective assumptions about its authorship.

1.6.1 Implications for Research

Our research makes several important contributions to the growing literature on human–AI

collaboration in creativity domains. While prior studies have examined the role of humans

in shaping creative outcomes through prompting strategies (Wu, Terry, and Cai, 2022) and

the functional roles individuals play in co-creative (i.e., ghostwriters who contribute con-

tent and sounding boards who provide evaluative feedback) (Chen and Chan, 2024), less

37

attention has been paid to the inherent generative capabilities of LLMs themselves. LLMs

are not passive tools; they possess generative properties that meaningfully inﬂuence creative

outcomes. Recognizing LLMs as active agents in the creative process is essential to under-

standing the dynamics of co-creation. Our study is the ﬁrst to empirically investigate how

intrinsic characteristics of LLMs (particularly the randomness) a↵ect the perceived creativity

of co-produced content. These ﬁndings suggest the need to consider both human strategies

and model-driven variability in interpreting divergent collaborative creativity.

In addition, we extend the literature on algorithm aversion in human–AI collaboration by

moving beyond its traditional emphases on algorithm aversion (Logg, Minson, and Moore,

2019; F¨ugener et al., 2022; Turel and Kalhan, 2023) to investigate its moderating role in shap-

ing perceived creativity of co-created work. Rather than conceptualizing algorithm aversion

as a binary trait, we reconceptualize it as a continuous measure—captured by the extent to

which human creators integrate AI-generated suggestions into their ﬁnal work (Turel and

Kalhan, 2023). Our ﬁndings reveal a nuanced interaction between AI system design and

user behavior: when GPT operates with high randomness, greater AI utilization leads to

increased perceived creativity. In contrast, under low randomness (i.e., more deterministic

response), increased AI utilization corresponds with lower creativity evaluations. These re-

sults suggest that aligning the degree of human openness to algorithmic output with LLMs’

internal generative properties is critical for optimizing co-creative outcomes.

Moreover, while prior work has categorized modes of human–AI collaboration (Revilla

et al., 2023), our research extends theories of human–human co-creation (Storch, 2002) to the

human–AI domain. We focus speciﬁcally on the dimensions of mutuality and equality: high

mutuality ensures iterative alignment between human inputs and AI outputs, while maintain-

38

ing a human-led collaboration mode safeguards contextual appropriateness and maximizes

the creativity of work.

We further contribute to the literature on human-AI collaboration by introducing a dual-

perspective framework for studying LLMs, shifting analytical attention from the process of

AI-assisted creation to the human evaluation of AI-involved content. Speciﬁcally, we em-

pirically examine how two evaluator-level factors—familiarity with LLMs and perceived AI

involvement (in the absence of explicit disclosure)—shape creativity assessments. Extending

prior work on AI’s impact on human judgment (Jakesch, Hancock, and Naaman, 2023) and

human favoritism (Logg, Minson, and Moore, 2019; Morewedge, 2022), our ﬁndings reveal

that higher LLM familiarity and lower perceived AI use are both associated with increased

perceived creativity of work. These results address the presence of evaluative biases in hu-

man–AI co-creation and highlight the critical role of individual experience and perception

in shaping judgments of creative work.

1.6.2 Implications for Practice

Our research ﬁndings have important practical implications for policymakers and managers

designing and implementing human-AI collaboration systems (Anthony, Bechky, and Fa-

yard, 2023), particularly in tasks involving creative production and subjective evaluation.

For policymakers and managers engaged in creative production tasks, our ﬁndings challenge

a widely held assumption about LLMs: the greater randomness in their outputs inherently

leads to more creative results. While it is commonly believed that increasing randomness

fosters divergent thinking and enhances creative performance, our empirical evidence sug-

gests otherwise. We ﬁnd that higher randomness does not consistently translate into higher

perceived creativity. Therefore, managers should be cautious about using randomness as

39

a default lever for boosting creativity in AI-assisted work. Instead, we promote the devel-

opment of structured frameworks that strategically adjust LLM parameters to align with

the speciﬁc requirements of a given creative task. Thoughtful calibration, rather than blind

reliance on randomness, is more likely to yield meaningful and e↵ective creative outcomes.

Additionally, we encourage policymakers and managers to reconceptualize AI as a col-

laborative partner rather than a passive tool (Anthony, Bechky, and Fayard, 2023), drawing

on the human-human collaboration literature to inform human-AI interaction design. By

incorporating relational constructs such as mutuality and equality, our study extends prior

work and reveals that these dimensions critically shape co-creative processes, often over-

looked in outcome-focused studies (Wu et al., 2021; Sun et al., 2024). Our ﬁndings suggest

that organizations should design AI systems that promote iterative interaction (e.g., click-

to-reﬁne prompts) and train employees to engage with AI as a “creative sparring partner.”

This relational framing o↵ers a foundation for future research across modalities and cultural

contexts, where collaboration norms may diverge.

For policymakers and managers involved in evaluating creative work, our research sug-

gests the important role of human perception in assessing AI-assisted outputs. Speciﬁcally,

we ﬁnd that an evaluator’s familiarity with LLMs signiﬁcantly shapes their creativity judg-

ments. Individuals with more experience using LLMs are generally more receptive to AI-

generated contributions, rating them more favorably. In contrast, when evaluators merely

perceive that AI was involved, regardless of whether it actually was, they often rate the

output as less creative. These ﬁndings highlight the inﬂuence of cognitive biases and indi-

vidual traits, such as a preference for human e↵ort or skepticism toward AI, on subjective

evaluation. As LLMs become more deeply integrated into routine workﬂows, yet humans

40

remain central to interpretation and judgment, it is essential for organizations to recognize

and actively manage these biases. Doing so is key to ensuring fair, consistent, and credible

evaluations of AI-generated content.

1.6.3 Limitations

While this study provides valuable insights into the e↵ect of LLMs’ inherent generative

capabilities in creative domains, it is important to acknowledge several limitations. Our

analysis focuses exclusively on a single model (i.e., OpenAI’s GPT-3) and a speciﬁc creative

writing task. Future research should examine more recent LLMs to assess the generalizability

of these ﬁndings. Moreover, the study employs ﬁxed initial prompts and does not explore how

prompt engineering might interact with model behavior. Subsequent work could investigate

how variations in prompt design (e.g., Boussioux et al. (2024)), combined with di↵erent levels

of model randomness, inﬂuence performance in human-AI collaborative settings. Finally, the

optimal degree of randomness may vary across the domain and task context. For example,

industries such as ﬁnancial services may beneﬁt from low randomness to maintain precision

and reliability, whereas sectors like advertising or entertainment may derive value from higher

randomness to enhance creativity and user engagement. Future empirical research across a

broader range of domains is necessary to better understand these dynamics.

1.7 Conclusion and Future Research

It has been suggested that processes that were considered uniquely human, such as creativ-

ity and intuition, are being increasingly augmented by the speed, scalability, and analytical

power of AI (e.g., Vaccaro, Almaatouq, and Malone (2024)). Given that the contours of

creative tasks are signiﬁcantly broadened with widespread deployment of AI models, what

41

frameworks are by which we understand AI-human collaboration in creative tasks? Our

ﬁndings provide insights into the process of human-AI collaboration by examining the role

of LLMs’ inherent generative capabilities and the mutuality of human-AI interaction in cre-

ative task performance. While these inherent capabilities enable more e cient deployment

of AI assistance in creativity-driven workﬂows, our ﬁndings indicate that such parametric

attributes alone (e.g., adjusting LLMs’ inherent generative capabilities) may be insu cient

to achieve optimal creative outcomes within human-AI collaboration. With increasing evi-

dence that human-AI collaboration is key to the future of work, our study considers a dual

perspective of mutuality and equality and extends prior theories of human collaboration to

that of humans collaborating with AI.

Several future research directions could be explored based on our ﬁndings. Firstly, future

work could examine the generalizability of our ﬁndings within human-AI interaction pat-

terns across other forms of creative tasks, such as marketing ideation, artistic co-creation,

and product innovation. Such work would help delineate the boundary conditions of col-

laborative creativity involving LLMs. Additionally, future research could explore the role of

LLMs’ generated capabilities and varying degrees of mutuality in human-AI collaboration in

business contexts that, while not fully creative, still rely heavily on human judgment. For

instance, tasks such as inventory ordering by human managers may be inﬂuenced by the

nature of AI recommendations (Lu et al., 2025). Understanding how di↵erent conﬁgura-

tions of AI-human interdependence a↵ect decision quality in such contexts remains an open

and important question. Another valuable direction for research could focus on the het-

erogeneous e↵ects of LLMs on human collaborators. Research could investigate how LLMs

inﬂuence the perceived creativity of outcomes, considering other psychological aspects of

42

human collaborators such as irrationality (e.g., Shen, Jiang, and Zheng (2025)). Lastly, we

consider the temperature parameter in this study, but empirical research could assess how

variations in model-level parameters (e.g., ﬁne-tuning, alignment processes) shape collab-

orative dynamics and outcomes in human-AI interactions. Given the evolving capabilities

and deployment contexts of generative AI technologies, such research aligns well with the

practice of responsible AI (Susarla et al., 2023).

43

CHAPTER 2 GPT-DATector : Increasing Accuracy And Decreasing Bias In GPT

Detectors Using Creativity Measures

2.1 Introduction

Generative Artiﬁcial Intelligence (GenAI) tools such as ChatGPT have shown tremendous

promise in supporting human endeavors across a wide range of domains, from e-commerce

(Gha↵ari, Youseﬁmehr, and Ghatee, 2024) and healthcare (Sharma et al., 2023) to education

(Koltovskaia, 2020; Bubeck et al., 2023; Zhang et al., 2023; Yang et al., 2023). Such human-

AI collaboration could potentially enrich the quality and creativity of work beyond what

is produced by humans alone (Brynjolfsson, Li, and Raymond, 2025; Zhou and Lee, 2024).

For instance, human-AI collaboration outperforms human or AI performance alone in image

classiﬁcation when the AI delegates tasks to humans; however, human delegation to AI is less

e↵ective due to humans’ limited metaknowledge, or inability to accurately assess their own

capabilities, leading to less optimal delegation (F¨ugener et al., 2022). However, when not

e↵ectively implemented or monitored, human-AI collaboration may turn into total reliance

of humans on GenAI, which is especially concerning in ﬁelds such as education, where the

proper development and nurturing of critical skills of the younger generation is at stake.

To address this challenge, one approach is to rethink conventional forms of educational

assessment (Yeadon et al., 2023), advocating for changes to traditional assignments rather

than relying solely on writing tasks, something GenAI can independently excel at. A com-

plementary approach involves equipping educators with the tools to distinguish between

human-written and GenAI-generated content. Given the limitations of human judgment in

detecting AI-generated content (Jakesch, Hancock, and Naaman, 2023), as well as the biases

in existing automated detectors (Liang et al., 2023), it is essential to develop e↵ective au-

44

tomated GenAI detection tools across a wide range of educational writing tasks to mitigate

potential academic misconduct and to ensure students develop fundamental critical thinking

and writing skills (Susnjak and McIntosh, 2024).

Here, we focus on the latter approach, which maintains conventional text-based school

assignments, such as argumentative essays, and aims at detecting the use of GenAI. Indus-

try has recently created and open-sourced technology to watermark AI-generated content

(Dathathri et al., 2024) but this technology does nothing to solve the problem of identifying

all the content that was generated by GenAI before its availability – and until regulatory

frameworks are deﬁned and implemented, the level of adoption remains unclear. While

there is growing interest in academia and industry to advance research in detecting text

generated by GenAI (Tian and Cui, 2023; Venkatraman, Uchendu, and Lee, 2024), current

detection tools are ﬂawed, and the evolving developments of GenAI present a continuous

challenge to e↵ectively di↵erentiate between human- and AI-generated content (Sadasivan

et al., 2023). For instance, OpenAI launched a classiﬁer in January 2023, engineered to dif-

ferentiate between texts generated by humans and those generated by AI. However, this tool

was discontinued in July 2023 due to its low accuracy1, indicating the inherent di culties

in developing e↵ective GPT detectors. Furthermore, current GenAI detectors demonstrate

bias against non-native English speakers for educational contexts like writing argumentative

essays (Liang et al., 2023). As a result, these detectors face at least two major challenges, ac-

curacy and fairness, indicating the need for improvements in model performance, especially

in educational areas.

In this work, we propose that, by identifying creativity features that distinguish the hu-

1https://openai.com/blog/new-ai-classiﬁer-for-indicating-ai-written-text

45

man from the GenAI writing process and that make the two inherently di↵erent, irrespective

of a human’s native language or the sophistication of the GenAI model, one can then exploit

such features to build an e↵ective classiﬁer.

Thus, let us start with an analysis of a fundamental di↵erence between human-written

and AI-generated content, that is, their purpose. AI-generated content relies on prediction:

Generative Pre-trained Transformers, or GPTs, are powerful neural networks predicting the

most likely (however long) set of words to appear after a textual input or prompt. More

generally, GenAI predicts language in terms of high dimensional probability distributions

over all the possible words (tokens) the network was trained on. The result is, naturally,

a series of words that are somehow “expected”, in the sense that they are, statistically

speaking, likely to be seen. One of the major issues of GenAI is, in fact, the repetition

problem, or the generation of text with redundant rather than new, “unexpected” segments

(Fan, Lewis, and Dauphin, 2018; Holtzman et al., 2020)2. In contrast, the purpose of human-

written content is to put into words an inner thought or feeling, often driven by exploration,

aiming not just to convey information but also to reﬂect on complex ideas, express divergent

thinking, and introduce new perspectives.

Therefore, to build an e↵ective GenAI detection tool, we propose the use of a set of

features, motivated by the creativity literature, which captures content divergence – speciﬁ-

cally, semantic dissimilarities among sentences – which stem from the inherent di↵erences in

writing purpose between humans and GenAI. We then design a GenAI detection framework

2This repetition problem refers to the generation of outputs with redundant segments.
Fu et al. (2021) illustrate this with an example where an AI produces a sentence with
unnecessary repetition: “Though it is still unﬁnished, but I like it but I like it but I like ...”.
The root cause of this problem, however, remains unidentiﬁed (Welleck et al., 2019).

46

by incorporating such creativity-related features into any current GenAI detector. Lastly,

we evaluate the proposed GenAI detector focusing on performance improvements in both

accuracy and fairness.

Prior literature on divergent thinking and creativity introduced an objective way to as-

sess human verbal creativity based on the underlying assumption that creative individuals

choose words with larger semantic distances between them (Olson et al., 2021). In one study,

participants were asked to generate seven unrelated nouns, based on which a Divergent Asso-

ciation Task score, or DAT, was calculated at the word level, which represents the semantic

distance between each pair of words (Olson et al., 2021). More creative people (assessed

by standard measures of creativity) generated nouns with higher DAT. Building on this

stream of work, we propose extending the DAT metric from a word level to a sentence-level

analysis, conceptualizing a new metric to capture nuances in sentence dissimilarity within

the text. We build on prior work on creativity and equip the concept with a sentence em-

bedding3 (Le and Mikolov, 2014) measure. We introduce the DAT score at the sentence

level, DAT(Sent), as a metric to proxy the degree of semantic similarity among sentences

within a document. Speciﬁcally, DAT(Sent) calculates the semantic distances among all pos-

sible pairs of sentences within a document through the sentence-embeddings approach. A

lower DAT(Sent) indicates smaller distances among sentence embeddings, or greater similar-

ity among sentences in a document, while a higher DAT(Sent) indicates higher divergence.

Unlike previous research that compares entire documents to a median or centroid reference

3“Word embeddings are a way of representing words as vectors in a multi-dimensional
space, where the distance and direction between vectors reﬂect the similarity and relation-
ships among the corresponding words” (https://www.ibm.com/topics/word-embeddings).
Compared to word embeddings, sentence embeddings provide a more comprehensive repre-
sentation of the semantic meaning of text (Le and Mikolov, 2014).

47

within a group, which can change with the addition of new samples (Doshi and Hauser,

2024), DAT(Sent) is a more stable metric that is independent of external samples. Following

the idea of capturing content divergence through sentence embeddings, we propose three ad-

ditional features to measure sentence divergence within each document. These include one

global4 content divergence metric at the sentence level, Variance(Sent), and two local5 con-

tent divergence metrics that track dynamic sentence similarities, Di↵(Sent) and Di↵2(Sent).

Beyond semantic divergence among sentences, we construct a metric, DAT(Word), that cap-

tures semantic divergence at the word level by generalizing the DAT metric (Olson et al.,

2021) (see Section Measurements 2.3.1).

Based on these metrics, we design a ﬂexible two-stage classiﬁer, GPT-DATector , as

illustrated in Figure 2.1. The ﬁrst stage involves any existing GenAI detector, followed by

a second stage utilizing logistic regression that integrates our proposed metrics, designed

to distinguish between human- and AI-generated texts with higher accuracy. Therefore,

a ﬂexible and interpretable model inspired by the literature on human creativity, GPT-

DATector integrates existing GenAI detectors as the ﬁrst stage model for initial detection

with features that capture content divergence at both the sentence and word level. In this

study, we choose two state-of-the-art GPT detectors as the ﬁrst-stage model: the closed-

source black-box detector, GPTZero (Tian and Cui, 2023), and the open-source GPT-who

(Venkatraman, Uchendu, and Lee, 2024).

Our primary hypothesis is that humans tend to generate text with larger DAT(Sent)

4In this study, the term “global” indicates semantic distances among all sentence embed-

dings in general, considering the set of all sentences.

5In this study, the term “local” indicates semantic distances among subsets of sentence
embeddings. For instance, the metric of Di↵(Sent) considers the semantic distances between
the embeddings of each pair of consecutive sentences.

48

Figure 2.1 Overview of proposed two-stage GPT detector: GPT-DATector.

as compared to GenAI; that is, given a group of texts generated by humans and another

generated by GenAI, the human-written group would show greater semantic distances among

sentences than the AI-generated group. If this is true, incorporating such semantic features

at the sentence level can improve the performance of existing GenAI detectors. We test this

hypothesis using various writing tasks in traditional school assignments, speciﬁcally focusing

on argumentative essays and story generation. To construct the samples used for training the

GenAI detector, we include essays or stories produced by GenAI alone for each dataset (see

Section 2.3.3). After building GPT-DATector, we compare both its prediction performance

and its bias against non-native English writers to the ones of existing detectors.

We evaluate the prediction performance using two main metrics: accuracy and the area

under the receiver operating characteristic curve (AUCROC) (see Section 2.3.4). To fur-

ther evaluate GPT-DATector in mitigating bias against non-native English speakers, we

use an additional test set (Liang et al., 2023), which has shown signiﬁcant bias when other

GenAI detectors are applied. We evaluate fairness using two key metrics: Disparate Impact

(DI), which captures the ratio of favorable outcomes between underprivileged and privileged

49

groups, and Equal Opportunity Di↵erence (EOD), which measures the di↵erence in true

positive rates across these groups. More details can be found in Section 2.3.4.

Our work yields several interesting ﬁndings. First, our study shows the presence of

signiﬁcant di↵erences in the distribution of sentence dissimilarity, DAT(Sent), across texts

generated by humans and AI. On average, human-written texts exhibit greater DAT(Sent),

indicating greater dissimilarity across sentences than AI-generated texts. Interestingly, we

ﬁnd moderate semantic dissimilarity among sentences for texts generated by humans and AI

together, given the evidence that the average DAT(Sent) for texts generated by human+AI is

larger than in AI-generated stories but smaller than in human-generated stories. In addition,

we provide empirical evidence that our proposed GPT-DATector outperforms current state-

of-the-art detectors, and this result is robust across various writing tasks and types of GenAI.

Last, but most importantly, rather than solely achieving larger prediction performance, our

results show that GPT-DATector can reduce bias in evaluating essays by non-native English

speakers when tested on an additional dataset.

This essay is organized as follows. The next section begins with a review of prior literature

on the linguistic characteristics of GenAI-generated texts and existing detection methodolo-

gies. Section 2.3 introduces the proposed metrics, detection framework, and the dataset

used for empirical validation. Section 2.4 presents and interprets the results, demonstrating

how and why the proposed metrics e↵ectively distinguish between human- and AI-generated

texts, while providing evidence that the framework enhances the prediction accuracy of cur-

rent GenAI detection models. Finally, Section 2.5 concludes with a summary of ﬁndings and

discussions of this research.

50

2.2 Related Literature

In this section, we review two key streams of literature: the linguistic features of texts

generated by GenAI and the existing methodologies for detecting such AI-generated content.

2.2.1 Linguistic Characteristics of GenAI-generated Text

Our work is related to the literature on computational linguistics features between GenAI and

human-written texts (Beresneva, 2016; Ma et al., 2023). One of those features is perplexity,

which measures the average uncertainty of LLM in predicting the next word in a sequence.

In general, lower perplexity is observed in AI-generated contexts. For instance, consider

the sentence “Hi there, I am an AI”, an LLM model is likely to predict the continuation of

this sentence with the word “assistant”, resulting in a lower perplexity score6. In contrast,

if the next word is “detector”, the perplexity of the sentence would signiﬁcantly increase,

indicating a higher probability of being written by humans. Moreover, Ma et al. (2023)

further demonstrates that the lower perplexity observed in AI-generated contexts comes

from the training objectives of LLMs. Speciﬁcally, these objectives aim to optimize the

model’s ability to produce text sequences with high probability, leading to a lower perplexity.

However, human-written content extends beyond mere generation, but “do things” such as

organizing complex information or persuasion.

While perplexity provides a measure of the average uncertainty inherent in LLMs, it

does not account for their dynamic characteristics. Speciﬁcally, perplexity assesses each

word prediction with equal importance, overlooking the bursty nature of language, wherein

certain words or phrases are more prevalent in particular contexts. Therefore, the concept of

6This example comes from https://support.gptzero.me/hc/en-us/articles/151300702305

51

51

Burstiness is incorporated into GPT detectors to address the patterns of word distribution

and occurrence within generated texts. LLM models tend to apply uniform rules for word

selection, resulting in lower burstiness.

In contrast, a higher value of burstiness is often

indicative of human-written content. Thus, burstiness is also considered an important factor

among several others in GPT classiﬁers such as GPTZero.

Recent research includes other linguistic features such as distributions of part-of-speech

(POS) tags and named entity (NE) tags for LLM classiﬁers (Fr¨ohling and Zubiaga, 2021;

Crothers, Japkowicz, and Viktor, 2023). Such work is motivated by observed di↵erences

in POS tag distributions between human and AI-generated texts (Radford et al., 2019; See

et al., 2019). Moreover, See et al. (2019) examines the di↵erences in POS distributions

between model-generated and human texts across LLM parameters, discovering that with

an increase in K (a parameter related to vocabulary size7), the lexical diversity of LLMs

reaches that of human-generated text. Speciﬁcally, as K nears the vocabulary size, the POS

distribution of LLM-generated texts closely ﬁts that of human text, for instance, generating

more precise POS categories like Numeral and Proper Noun.

However, those ﬁndings are obtained before ChatGPT was developed. Mart´ınez et al.

(2024) further evaluates lexical richness of texts generated by the latest LLMs such as

GPT3.5 and GPT4.0, suggesting that ChatGPT4 produces texts with larger lexical rich-

ness than ChatGPT3.5. In this study, they compute the lexical richness based on the total

number of words and the count of distinct words following (Tweedie and Baayen, 1998;

Van Hout and Vermeer, 2007). Since there exists room in the comparative analysis of lexical

7They generate stories using top-k sampling, where the value of K ranges from 1 to
vocabulary size. Top-k sampling samples tokens with the highest probabilities until the
speciﬁed number of tokens is reached.

52

richness between human-written texts and AI-generated texts (Mart´ınez et al., 2024), our

study addresses this gap by introducing a novel methodology for estimating lexical richness.

Speciﬁcally, we propose a semantic distance metric that employs the word-embedding ap-

proach to serve as a proxy for the lexical richness of texts. This innovative metric aims to

provide a more nuanced understanding of the textual complexities inherent in human versus

AI-generated content.

Our work extends this stream of literature by considering the sentence similarities within

the content. Speciﬁcally, we propose a metric, DAT(Sent), which is inspired by the diver-

gent thinking creativity literature (Olson et al., 2021). This metric estimates the degree

of semantic dissimilarities among sentences within a piece of text. We do this by splitting

the text into sentences and mapping each sentence to a high-dimensional space using sen-

tence embeddings, which provides a more nuanced and comprehensive representation of the

meaning of text (Le and Mikolov, 2014). Unlike previous research that compares entire

documents to a median or centroid reference within a group, which can change by adding

new samples (Doshi and Hauser, 2024), DAT(Sent) is a more stable metric independent of

external samples.

2.2.2 Current GenAI Detection Methodologies

Our work is also associated with the literature on GenAI detection. Prior research for GenAI

detection includes three approaches: 1) classiﬁcation methods (Solaiman et al., 2019), 2) wa-

termark techniques (Kirchenbauer et al., 2023), and 3) statistical methods (Tian and Cui,

2023; Mitchell et al., 2023). First, classiﬁcation methods conceptualize the detection of AI-

generated text as a binary classiﬁcation problem. In this approach, a classiﬁer is developed

and trained to distinguish between machine-generated and human-generated texts (Crothers

53

et al., 2022; Solaiman et al., 2019). For instance, Desaire et al. (2023) proposed a GPT de-

tector from scratch, employing 20 textual features in conjunction with XGBoost, speciﬁcally

within the realm of scientiﬁc chemistry journals. A typical example is that OpenAI launched

a classiﬁer in January 2023, engineered to di↵erentiate between texts generated by humans

and those generated by AI, utilizing a RoBERTa-based model (Solaiman et al., 2019). Nev-

ertheless, this tool was discontinued in July 2023 due to its low accuracy8, suggesting the

inherent di culties in developing highly e↵ective GPT detectors.

The second stream of literature focuses on watermark techniques. Those post-hoc water-

marking techniques can be e↵ectively applied to LLMs, which include rule-based approaches

(Brassil et al., 1995; Kankanhalli and Hau, 2002) and deep-learning-based strategies (Ueoka,

Murawaki, and Kurohashi, 2021; Dai et al., 2022). Kirchenbauer et al. (2023) further in-

troduced a novel approach at inference time, proposing a soft watermarking scheme. This

method involves embedding a watermark in each word of a generated sentence through the

division of the vocabulary into distinct lists and sampling the next token in a di↵erentiated

manner.

The last stream of literature is statistical methods. The general idea is to identify AI-

generated text through the analysis of statistical measures such as entropy (Lavergne, Urvoy,

and Yvon, 2008) and perplexity (Beresneva, 2016). These methods implement a threshold

for the aforementioned statistics to distinguish AI-generated content. Gehrmann, Strobelt,

and Rush (2019) introduced the GLTR visualizer, a tool designed to assist humans in the

detection of AI-generated text by leveraging entropy, probability, and probability rank for

e↵ective detection. Recently, the release of ChatGPT led to the development of two GPT

8https://openai.com/blog/new-ai-classiﬁer-for-indicating-ai-written-text

54

detectors: the closed-source GPTZero (Tian and Cui, 2023) and the open-source DetectGPT

(Mitchell et al., 2023). In terms of the closed-sourced GPTZero, it is the leading AI detector

with a global user base of over 2.5 million9, which was trained on a large, diverse corpus of

human-written and AI-generated text, with a focus on English prose. GPTZero is capable of

detecting AI-generated content at the sentence, paragraph, and document levels through a

comprehensive methodology that incorporates linguistic features, including average perplex-

ity10 and burstiness11. In terms of the open-source tool, DetectGPT(Mitchell et al., 2023)

uses only log probabilities computed by the model of interest and random perturbations of

the passage from another generic pre-trained language model (e.g., T5) (Ra↵el et al., 2020).

Then, AI-text detection is performed by comparing the log probability of the text and its

inﬁlled variants.

While there are many detectors, the evolving developments of GenAI present a continuous

challenge to e↵ectively di↵erentiate between human- and AI-generated content. For instance,

OpenAI launched a classiﬁer in January 2023, engineered to di↵erentiate between texts

generated by humans and those generated by AI. However, this tool was discontinued in

July 2023 due to its low accuracy12, indicating the inherent di culties in developing highly

e↵ective GenAI detectors. Therefore, in this study, we design a GenAI detector called GPT-

DATector, following the idea of using LLMs to combat LLMs (Verspoor, 2024). Speciﬁcally,

we propose a ﬂexible two-stage framework where the ﬁrst stage could be any existing GPT

9https://gptzero.me/faq
10After each word in the text our model develops suggestions of what word is coming next.

It checks if our suggestions match what is actually in the text.

11The burstiness check analyzes how similar the text is to AI patterns of writing. A
human-written document will have changes in style and tone throughout the text, whereas
AI content remains similar throughout the document.

12https://openai.com/blog/new-ai-classiﬁer-for-indicating-ai-written-text

55

detector, and the second stage is a logistic regression. In the second stage, we incorporate

the output of the stage 1 model and features capturing semantic dissimilarities at both the

sentence and word levels. We provide empirical validation to demonstrate that our model

achieves higher prediction performance compared to the baseline model.

2.2.3 LLMs and Creativity

Recent research on the creativity of LLMs demonstrates a multi-facet construct with two

main aspects, including LLMs’ creative capabilities and their capabilities in creative writ-

ing tasks. First, existing literature on examining LLMs’ creative capabilities largely adopts

frameworks developed for assessing human creativity, including subjective and objective as-

sessments. Subjective approaches often rely on human expert judgments to evaluate the

novelty and usefulness of LLM-generated outputs across domains such as artistic expression

(Crothers, Viktor, and Japkowicz, 2023). Alternative subjective evaluation takes a cognitive

perspective, using creativity tasks such as the Alternate Uses Task (AUT), where creativity

is measured by the ability to generate unconventional uses for common objects (Guilford,

1964; Summers-Stay, Voss, and Lukin, 2023; Haase and Hanel, 2023). In contrast, objective

assessments focus on the underlying structure of semantic memory to evaluate divergent

thinking. Among these approaches, the Divergent Association Task (DAT) o↵ers a scal-

able and valid measure of creativity (Olson et al., 2021) by assessing the semantic distance

between words generated by participants (Beaty and Johnson, 2021; Olson et al., 2021). Em-

pirical ﬁndings generally demonstrate that LLMs outperform humans in divergent thinking

tasks (Bellemare-Pepin et al., 2024; Sun et al., 2025), indicating greater creative capabilities

of LLMs.

Furthermore, the literature investigates LLMs in the domain of creative writing, primarily

56

characterizing their function in one of two roles: as autonomous writers of creative content

or as collaborative co-authors augmenting human creativity. On one hand, LLMs have been

employed to independently generate creative content such as stories, narratives, and design

concepts (Bellemare-Pepin et al., 2024; Peeperkorn et al., 2024). On the other hand, LLMs

are increasingly used as collaborative writing co-authors, o↵ering suggestions and ideation

support for creative tasks (Lee, Liang, and Yang, 2022; Yang et al., 2022; Yuan et al., 2022).

This line of research emphasizes the importance of interface design and interaction dynamics

in augmenting human creativity (Clark et al., 2018; Bhat et al., 2023).

While LLMs exhibit strong performance in divergent thinking tasks, mixed ﬁndings are

obtained regarding their e↵ectiveness in complex creative writing contexts. Some studies

report that LLMs can surpass human performance in speciﬁc creative writing scenarios

(Bellemare-Pepin et al., 2024), whereas others demonstrate that LLMs can lag in creative

writing (Sun et al., 2025). In this study, we extend the literature on LLMs and creativity by

extending the DAT framework from word-level to sentence-level semantic distance. Based

on these features, we develop DAT-based GPT detectors and demonstrate that our approach

outperforms existing detection methods.

2.3 Methods and Datasets

2.3.1 Measurements

The proposed metric for measuring semantic dissimilarities is inspired by the creativity lit-

erature, speciﬁcally the DAT score from Olson et al. (2021). They present an objective

way to assess human creativity by calculating the semantic distance between pairs of words.

The underlying assumption is that creative individuals list words with larger semantic dis-

tances between them. Speciﬁcally, they ask participants to think of seven unrelated words

57

{

word1, ..., word7}
beddings. After that, the DAT score is derived as the transformed average of the semantic

, then map each word into a high-dimensional space via a vector of em-

distances between each pair of words.

Aligned with this literature but equipped with the advantages of sentence embeddings (Le

and Mikolov, 2014), in this study, we generalize the DAT score based on sentence embeddings

within the document. Speciﬁcally, for each document, we segment it into sentences and

then map each sentence into a 1536-dimensional vector13. Then, for each document with n

sentences and their sentence embeddings

senteb1, ..., sentebn}

{

, DAT(Sent) is computed as

the average cosine distance between all pairs of sentence embeddings, as deﬁned in Equation

(2.1). This involves determining the semantic dissimilarities between each pair of sentences

using cosine distance and then averaging these distances. To further obtain a measure that

ranges from zero to 100, we also multiply the value by 100. A document with a higher

DAT(Sent) score indicates greater dissimilarity in semantics among each pair of sentences,

and a minimum score of zero indicates that all sentences are identical.

DAT (Sent) =

100

n(n

1)

 

n

=j
i,j;i
X8

CosineDistance(vi,vj )

(2.1)

Following the idea of capturing dynamic semantic dissimilarities through sentence embed-

dings, we further propose three additional semantic features among sentences for each doc-

ument. (1) V ariance(Sent) is calculated as the normalized variance of the sentence embed-

dings, V ariance(Sent) = 1
n

n
i=1 Distance(vi, µ)2, where µ represents the centroid of all sen-

tence embeddings; (2) Dif f (Sent) is calculated as the average of the distances of every two

P

consecutive sentence embeddings vi

1 and vi: Dif f (Sent) = 1
n
 

1

 

n
i=2 Distance(vi

1, vi);

 

13We adopt OpenAI’s state-of-the-art transformer-based embeddings “text-embedding-3-

P

small” model released in January 2024.

58

6
and (3) Dif f 2(Sent) is calculated as the average of the squared distances of every two con-

secutive sentence embeddings vi

1 and vi: Dif f 2(Sent) = 1
n
 

1

 

n
i=2 Distance(vi

1, vi)2. In

 

the following analysis, we refer to these four features related to semantic divergences at the

P

sentence level (including DAT (Sent)) as “Sentence dissimilarities”.

To further capture the semantic divergences at the word level, we propose a DAT (W ord)

metric based on word embeddings (the orange area in Figure 2.2). The di↵erence in DAT (W ord)

between our approach and the metric used in the creativity literature (Olson et al., 2021) is

that our approach considers all unique words in a long piece of text, rather than a limited

number (seven) of nouns generated by humans. Speciﬁcally, after eliminating stopwords or

punctuation and performing lemmatization, we split each piece of text into a unique set of

words (i.e., n unique words), rather than focusing solely on nouns as (Olson et al., 2021)

suggested.

2.3.2 Proposed Two-stage GenAI Detector: GPT-DATector

The proposed two-stage GenAI detector, GPT-DATector, illustrated in Figure 2.1, consists

of a ﬁrst-stage model — ﬂexibly chosen from any existing GenAI detector — and a second-

stage logistic regression model. For further clarity, Figure 2.2 presents a more detailed

breakdown of the model framework, particularly for the components within the second-

stage model. Speciﬁcally, in the second-stage model, logistic regression operates using three

key feature sets: those derived from the ﬁrst-stage model, features capturing sentence-level

dissimilarities, and features capturing word-level dissimilarities.

First-stage model

In this study, we utilize two GenAI detectors as the ﬁrst-stage models:

the open-source GPT-who (Venkatraman, Uchendu, and Lee, 2024) and the closed-source

59

Figure 2.2 Proposed two-stage GPT detector: GPT-DATector.

(black-box) GPT-Zero (Tian and Cui, 2023).

We use the open-source GPT-who as the ﬁrst-stage model in our detector, as it has been

shown to outperform several state-of-the-art detectors, including GLTR and DetectGPT, by

over 20% across more than 10 domains (Venkatraman, Uchendu, and Lee, 2024). GPT-who

employs a psycholinguistically motivated Uniform Information Density (UID)-based feature

space, grounded in the theoretical assumption that humans generally distribute information

evenly in language production, resulting in smaller ﬂuctuations in the distributions of next-

word prediction probabilities (Frank and Jaeger, 2008; Mahowald et al., 2013; Xu and Reitter,

2018). Although empirical analysis shows that machine-generated text tends to have more

evenly distributed information, GPT-who, based on this UID-based approach, e↵ectively

distinguishes human-written from machine-generated text across various tasks, domains,

and datasets.

In addition to the open-source GPT-who, we adopt the closed-source (black-box) GPT-

60

Zero (Tian and Cui, 2023) as the ﬁrst-stage model due to its high accuracy relative to

competing detectors. GPT-Zero employs a multilayered approach encompassing seven mod-

ules, including burstiness, perplexity, and an end-to-end deep learning model14. This design

enables GPT-Zero to capture well-documented characteristics for di↵erentiating human- and

AI-generated content. For each text, GPT-Zero provides a predicted label, which classiﬁes

the content as human-generated, AI-generated, or mixed (a blend of human and AI elements).

Second-stage model Our objective is to assess whether incorporating creativity-related

features, speciﬁcally DAT-based metrics, enhances the predictive accuracy of existing GPT

detectors such as GPT-who or GPTZero. Consequently, in the second phase of the mod-

eling process, we evaluate model performance by supplementing the outputs (or features)

of these detectors with additional DAT-based features. To further examine the marginal

e↵ectiveness of distinct features, we conﬁgure models with three feature sets: (1) Stage-1

output alone, serving as the baseline without DAT-based features, (2) Stage-1 output com-

bined with DAT-based sentence-level features, and (3) Stage-1 output combined with both

DAT-based sentence-level and word-level features.

When conducting the second stage modeling, we implement a logistic regression model

to address two-class classiﬁcation, discerning between human-written and AI-generated doc-

uments. In cases involving the CoAuthors dataset, where a “mixed” category (indicating

human-AI co-authorship) is present, we apply a multinomial logistic regression model for

three-class classiﬁcation—an extension of logistic regression suited to multi-class scenarios15.

14https://gptzero.me/technology
15A multinomial logistic regression modiﬁes the loss function to cross-entropy loss and
adjusts the prediction of probability distribution to a multinomial probability distribution,
supporting the multi-class classiﬁcation problems

61

2.3.3 Datasets

We examine various writing tasks common in traditional educational settings, with a particu-

lar focus on argumentative essay composition and prompted story generation. For argumen-

tative essays, two datasets are considered. (1) ArguTOEFL(GPT) from ArguGPT (Liu

et al., 2023), which contains 1,680 human-written TOEFL essays and 1,635 essays generated

by 7 recent GenAIs (including many variants of ChatGPT) using prompts from TOEFL11

(Blanchard et al., 2013); and (2) HW(GPT3.5) contains 1,800 human-written argumen-

tative essays from The Hewlett Foundation(HW)16 and 1,800 AI-generated essays based on

the same prompt. Speciﬁcally, the 1,800 human-generated essays are written by U.S. Grade

10 students in response to a speciﬁc persuasive prompt. The 1,800 AI-generated essays were

created by GPT-3.5 with the same prompt. Instead of using “gpt-3.5-turbo” which is chatty,

we adopt the version of “gpt-3.5-turbo-instruct”, which is much terser and concise17.

For prompted story generation, two datasets related to the Reddit WritingPrompts18

are considered: WP (Li et al., 2023), and CoAuthors (Lee, Liang, and Yang, 2022). (3)

WP(GPT) from (Li et al., 2023) that contains 800 randomly selected human-written stories

and 800 AI-generated stories based on a mix of 27 GenAIs.

(4) CoAuthors(GPT3.5)

based on (Lee, Liang, and Yang, 2022), which enables the exploration of di↵erences in

the distribution of DAT(Sent) not only between human- and AI-written samples but also

within the human+AI generated samples. Speciﬁcally, in this dataset, the percentage of AI-

generated words is tracked. This allows us to deﬁne two types of human-written samples: one

16It is noted that we focus on essays in the second set written by U.S. Grade 10 students.

https://www.kaggle.com/competitions/asap-aes/data

17https://community.openai.com/t/instructgpt-vs-gpt-3-5-turbo/434241
18https://www.reddit.com/r/WritingPrompts

62

where the majority of content is human-written (i.e., less than 15% AI-generated, comprising

226 stories), and another where the content is human+AI blend (i.e., between 15% and 70%

AI-generated, comprising 581 stories). Additionally, we include 700 stories generated solely

by GPT3.5. (5) Lastly, we construct an additional dataset, CoAuthors(GPT4), which

includes the same human-written and human+AI co-created stories, along with 700 stories

generated by GPT419 alone. Therefore, unlike the ﬁrst three datasets, which consist of

essays or stories written solely by either humans or GenAI, the two CoAuthors datasets

— CoAuthors(GPT3.5) and CoAuthors(GPT4) — include an additional category featuring

stories co-created by human writers in collaboration with GenAI.

Lastly, to further investigate the performance of GPT-DATector in improving fairness,

we evaluate its performance using an additional dataset (Liang et al., 2023), which in-

cludes essays written by native and non-native English writers. Speciﬁcally, after building

GPT-DATector, we compare both its prediction performance and its bias against non-native

English writers to those of existing detectors.

2.3.4 Training Process and Evaluation Metrics

Training process We ﬁrst split the samples into train and test sets using a 6:4 ratio.

During the training process, parameter optimization involves adjusting the predeﬁned hyper-

parameter: C (inverse of regularization strength) with values [0.01, 0.1, 1, 10, 100]. Moreover,

we conduct three-fold cross-validation to select the best parameter that maximizes the area

under the receiver operating characteristic (AUCROC). Once we ﬁnd the best parameter, we

retrain the model and then use it to make predictions on the test set. Finally, we compute

19We use the latest version of “gpt-4-turbo-preview” model, which was trained using data

up to Dec 2023. https://platform.openai.com/docs/models/gpt-4-and-gpt-4-turbo

63

prediction performance on the test set.

It is noted that we include text documents that

contain at least one complete sentence and range in length from 50 to 600 words.

Evaluation metrics We evaluate the predictive performance of GPT-DATector using

two metrics: (1) accuracy and (2) the area under the receiver operating characteristic curve

(AUCROC). For the two-class classiﬁcation task, detection accuracy measures the model’s

ability to correctly classify texts as AI-generated or human-written. However, as detection

accuracy can vary with threshold selection, AUCROC is used to evaluate performance across

all possible thresholds, providing a comprehensive assessment (Mitchell et al., 2023; Krishna

et al., 2024). AUCROC represents the probability that the classiﬁer ranks a randomly chosen

positive (AI-generated) sample higher than a randomly chosen negative (human-written)

sample, thus capturing both precision and recall. This metric o↵ers a robust evaluation

method for detector performance across threshold settings (Mitchell et al., 2023).

For the multiclass classiﬁcation task on the CoAuthors dataset, which includes three

categories, we use two generalized versions of AUCROC: macro-average AUC and weighted-

average AUC. The macro-average AUC calculates the AUC for each class separately and then

averages these values, giving equal weight to each class regardless of its size. In contrast,

the weighted-average AUC adjusts for class imbalance by assigning a weight to each class’s

AUC based on the number of actual instances in that class before averaging.

For fairness evaluation, we employ two commonly used metrics: (1) Disparate Impact

(DI), which quantiﬁes the ratio of favorable outcomes between underprivileged and privileged

groups. A DI value of 1 indicates equal beneﬁt across groups, values below 1 indicate a bias

favoring the privileged group, and values above 1 indicate a bias favoring the underprivileged

64

group.

(2) Equal Opportunity Di↵erence (EOD), which measures the di↵erence in true

positive rates between these groups. An EOD of 0 denotes equal beneﬁt, values below 0

indicate a bias toward the privileged group, and values above 0 indicate a bias toward the

underprivileged group.

2.4 Empirical Results

2.4.1 Human-generated Texts with Larger Sentence-level Semantic

Dissimilarities

Our analysis reveals a signiﬁcant distinction in sentence-level semantic dissimilarities be-

tween human- and AI-generated texts (Figure 2.3). Speciﬁcally, compared to AI-generated

texts, human-written texts exhibit higher semantic dissimilarity between sentences. We ﬁnd

that the DAT(Sent) distribution di↵ers signiﬁcantly between the two text types. This dif-

ference is robust across varying writing tasks and datasets. On average, human-written

texts demonstrate a higher DAT(Sent) score (indicated in yellow) than AI-generated texts

(indicated in blue).

The fourth and ﬁfth subplots, representing CoAuthors(GPT3.5) and CoAuthors(GPT4),

incorporate results from a set of texts co-created by humans and AI. Analysis reveals that the

mean DAT(Sent) for these collaboratively generated stories (shown in gray) falls between that

of AI-generated stories (in blue) and human-authored stories (in yellow). This intermediate

DAT(Sent) score suggests that human-AI collaborative texts exhibit a moderate degree of

sentence-level semantic dissimilarity, bridging the characteristics of both human and AI

authorship.

To assess whether the observed di↵erences in the empirical cumulative distribution func-

tions between each class pair are statistically signiﬁcant and directional, we perform a one-

65

Figure 2.3 Distribution of sentence dissimilarity, DAT(Sent), across human- and
AI-generated texts among di↵erent datasets.

sided Kolmogorov-Smirnov test. Speciﬁcally, the null hypothesis is formulated as H0a :

DAT (Sent)human < DAT (Sent)AI. For each data source, the null hypothesis H0a is re-

jected at the 5% signiﬁcance level, with p-values of 4.12e 

237 (HW(GPT3.5)), 2.93e 

143

(ArguTOEFL(GPT)), 1.54e 

237 (WP(GPT)), 6.88e 

21 (CoAuthors(GPT3.5)) and 3.69e 

18

(CoAuthors(GPT4)), respectively. These ﬁndings consistently demonstrate that the DAT(Sent)

metric for human-written texts is signiﬁcantly higher than that for AI-generated texts.

To further investigate trends in human-AI collaborative texts, we conduct two addi-

tional one-sided Kolmogorov-Smirnov tests across both CoAuthors(GPT3.5) and CoAu-

thors(GPT4) datasets. Speciﬁcally, we perform two null hypotheses H0b : DAT (Sent)human+AI

< DAT (Sent)AI; and H0c : DAT (Sent)human < DAT (Sent)human+AI. For the CoAu-

thors(GPT3.5) dataset, we reject H0b at the 5% signiﬁcance level (p = 3.92e 

19), indicating

human+AI generated texts show signiﬁcantly higher DAT(Sent) values than AI-generated

texts. In contrast, when comparing human+AI texts to human-only texts, we ﬁnd signif-

66

icantly lower DAT(Sent) values in the collaborative texts, rejecting H0c at the 5% level

(p = 0.005). Such ﬁndings are consistent in the CoAuthors(GPT4) dataset: human+AI

co-generated texts have again demonstrated greater DAT(Sent) than AI-generated texts

(rejecting H0b, p = 1.40e 

19) and lower DAT(Sent) than those generated by human only

(rejecting H0c, p = 0.005).

In summary, our ﬁndings reveal clear directional di↵erences in sentence-level semantic

dissimilarity, as measured by DAT(Sent), for texts generated solely by AI, solely by humans,

and through human-AI collaboration. Speciﬁcally, on average, human-generated texts con-

sistently exhibit the highest average DAT(Sent) values, underscoring distinct patterns in

semantic structure based on the source of generation. These results suggest that such DAT-

related metrics could enhance the detection capabilities of existing GPT classiﬁers, a poten-

tial we further explore in the following section by assessing the impact of integrating these

features on classiﬁer performance.

2.4.2 Enhanced Predictive Performance by Leveraging Semantic Dissimilarities

among Sentences and Words

Figure (2.4a) illustrates that our proposed two-stage model, GP T

DAT ector, which inte-

 

grates features capturing semantic dissimilarities among sentences (in orange), consistently

outperforms the baseline stage-1 model (in blue). Speciﬁcally, GP T

DAT ector, leveraging

 

the open-source GenAI detector GPT-who (Venkatraman, Uchendu, and Lee, 2024) as its

initial stage, was evaluated across multiple datasets with distinct feature sets. Prediction

performance on test datasets is reported (see Methods, Section 2.3.2). Results show that in-

corporating sentence-dissimilarity features—namely DAT(Sent), Var(Sent), Di↵(Sent), and

Di↵2(Sent)—signiﬁcantly improves both accuracy and AUCROC compared to the baseline.

67

The third subplot in Figure 2.4a speciﬁcally demonstrates an increase in AUCROC from

0.496 to 0.818 on the WP(GPT) dataset with GP T

DAT ector.

 

In addition, we consistently observe enhanced performance in the three-class classiﬁcation

task (distinguishing between human-generated, AI-generated, or collaboratively human+AI-

generated texts) using the CoAuthors dataset. The fourth subplot in Figure 2.4a illustrates

that, compared to the baseline model (in blue), GP T

DAT ector with sentence-level seman-

 

tic dissimilarity features (in orange) improves the macro-average AUC from 0.910 to 0.920

and the weighted-average AUC from 0.935 to 0.946 on the CoAuthors(GPT3.5) dataset.

Similarly, the ﬁfth subplot in Figure 2.4a reveals a similar improvement on the CoAu-

thors(GPT4.0) dataset, where macro-average AUC rises from 0.864 to 0.887 and weighted-

average AUC increases from 0.890 to 0.911. Taken together, these consistent enhancements

across varying datasets address the robustness of our approach.

Furthermore, augmenting the model with DAT(Word) on top of sentence-level features,

which captures semantic dissimilarities at both word and sentence levels (the green bars in

Figure 2.4a, achieves the highest accuracy and AUCROC. This suggests that features en-

coding content divergence across multiple linguistic levels—sentence and word—substantially

enhance the model’s capability to distinguish between human- and AI-generated texts. This

ﬁnding aligns with recent literature advocating for the use of LLMs (i.e., in our study, using

sentence- or word-embeddings) to counter LLMs (i.e., in our study, di↵erentiate texts written

by GenAI from texts written by humans) (Verspoor, 2024).

Lastly, to demonstrate the generalizability and ﬂexibility of GP T

DAT ector, we re-

 

place the stage 1 model, the open-sourced GPT-who, with the closed-sourced GPTZero.

Comparable ﬁndings are observed in Figure 2.4b.

In the second subplot, analyzing the

68

(a)

(b)

Figure 2.4 Test set performance (Accuracy and AUCROC) across di↵erent datasets, where
the stage 1 model is the open-sourced GPT-who (Fig. 4(a)) or the closed-sourced
GPTZero (Fig. 4(b)), and the stage 2 model is the (multinomial) Logistic regression.
Note: For CoAuthors(GPT3.5) and CoAuthors(GPT4) in a three-class classiﬁcation
scenario (human, AI, and human+AI), we implement multinomial logistic regression,
replacing logistic regression as the stage-2 model. Additionally, we substitute the
AUCROC evaluation metric with two complementary metrics, MacroAverageAUC and
WeightedAverageAUC, for more comprehensive performance assessment.

69

ArguTOEFL(GPT) dataset20, GP T

 

DAT ector demonstrates enhanced AUCROC values,

highlighting the superior predictive capacity of GP T

DAT ector, particularly when inte-

 

grating semantic divergence across both sentence and word levels (in green), as shown in the

second subplot of Figure 2.4b.

Therefore, the proposed GP T

 

DAT ector, integrating a comprehensive set of features

that capture semantic dissimilarities at both sentence and word levels, consistently outper-

forms baseline models (i.e., existing GenAI detectors such as GPT-who or GPTZero) across

varying datasets. This enhanced prediction performance highlights the e↵ectiveness of in-

corporating multi-level semantic features in advancing the misuse of GenAI, facilitating our

understanding of distinguishing characteristics between human- and AI-generated texts.

2.4.3 Reducing Bias against Non-Native English Writers

So far, we have demonstrated improved predictive performance of GP T

DAT ector. In

 

the following section, we investigate whether GP T

DAT ector that incorporates semantic

 

dissimilarity features at both sentence and word levels, can improve fairness. Our analysis

uses human-generated texts21 from two groups: a privileged group (native English speakers)

and an underprivileged group (non-native English speakers).

In Section 2.3.1, we propose DAT(Sent) by hypothesizing that, unlike GenAI, human

20We ﬁnd a slight decrease

in accuracy when incorporating di↵erent

feature
sets—sentence-level features alone (in orange) or both sentence- and word-level features (in
green)—compared to the baseline model (blue). However, accuracy, limited by a ﬁxed prob-
abilistic threshold of 0.5, inadequately captures the model’s discriminative ability due to its
sensitivity to class imbalance. When evaluated with AUCROC, a more robust performance
metric, the model shows a substantial improvement.

21We assess GP T

DAT ector on a dataset (Liang et al., 2023) containing 91 TOEFL
essays by non-native English speakers and 88 essays by U.S. native speakers, which demon-
strates that existing GPT detectors exhibit bias against non-native English writers by dis-
proportionately misclassifying their work as AI-generated.

 

70

writers tend to display greater exploratory, divergent thinking characteristics in language

production (Figure 2.3). To further assess the e↵ectiveness of DAT(Sent) in mitigating

bias across human-written texts from both native and non-native English speakers, we ex-

pect similar distributions of DAT(Sent) for each group, reﬂecting the shared human origin

of both sets. Ideally, the distribution of DAT (Sent)N ativeSpeaker should align closely with

DAT (Sent)N onN ativeSpeaker since both groups are generated by humans. But practically, our

expectation is for DAT(Sent) distributions across these groups to exhibit greater overlaps

with one another than those of features typically used in existing GenAI detectors (i.e., the

key feature used in GPT-who, the variance in uniform information density UID(Var)).

Figure 2.5 supports our hypothesis. First, the left subplot demonstrates that native and

non-native English speakers exhibit comparable DAT(Sent) distributions, shown through

kernel density estimation. Speciﬁcally, the distributions of DAT(Sent) for native English

speakers (in yellow) and non-native speakers (in orange) reveal substantial overlap, indicating

similar trends in sentence-level semantic dissimilarity across both groups. This overlap aligns

with our expectation, as both groups consist of human writers and thus reﬂect common

divergent thinking capabilities in language generation, irrespective of native language status

(i.e., native English speaker or not).

In contrast, the right subplot in Figure 2.5 represents the distribution of UID(Var)22—a

key feature used in GPT-who23, our baseline model—indicating its inherent bias within GPT-

22Speciﬁcally, human-written text shows greater variance in surprisal for next-word pre-
dictions compared to AI-generated text (Venkatraman, Uchendu, and Lee, 2024). It is noted
that the term “surprisal ” of words refers to how unexpected a word is within a given con-
text from (Hale, 2001; Xu and Reitter, 2018) compared to AI-generated text; less predictable
words have larger surprisal, while highly predictable words have less information.

23In the analysis regarding fairness performance, we focus on open-source detectors like
GPT-who. The reason is that, compared to black-box detectors such as GPTZero which lack

71

Figure 2.5 Model-free evidence of DAT(Sent) in mitigating bias against non-native English
speakers on an additional dataset (Liang et al., 2023). Speciﬁcally, two sub-ﬁgures
represent the kernel density estimation of DAT(Sent) and UID(Var) distributions,
respectively. UID(Var), a key feature used in the baseline model GPT-who, quantiﬁes
surprisal in next-word prediction (Hale, 2001; Xu and Reitter, 2018). In contrast, our
proposed feature, DAT(Sent), reﬂects semantic dissimilarities between sentences.

who. While GPT-who leverages UID-related features to di↵erentiate between human and

AI-generated content, such UID-related features reveal distinct distributions between native

and non-native English speakers, despite both groups being human-written. Such ﬁndings

align well with previous research that non-native English writers generally demonstrate lower

linguistic variability, including reduced lexical richness (Laufer and Nation, 1995), lexical

diversity (Jarvis, 2002), and syntactic complexity (Lu, 2011).

We further compute a Kolmogorov-Smirnov (KS) test statistic24 to investigate whether

the distribution of DAT(Sent) exhibits statistically greater similarity between native and

non-native English speakers compared to UID(Var). A KS statistic of 0.33 for DAT(Sent)

and 0.70 for UID(Var). The lower KS statistic for DAT(Sent) signiﬁes a signiﬁcantly higher

explainability and interoperability, open-source detectors like GPT-who include key features
that can be analyzed and interpreted.

24In this case, we conduct the two-sided Kolmogorov-Smirnov test. Speciﬁcally, H0d :
DAT (Sent)N ativeSpeaker = DAT (Sent)N onN ativeSpeaker; and H0e : U ID(V ar)N ativeSpeaker =
U ID(V ar)N onN ativeSpeaker. The KS statistic quantiﬁes the maximum di↵erence between the
cumulative distribution functions of two groups, with smaller KS statistics indicating a
smaller distance or closer similarity between distributions.

72

distributional overlap between essays written by native and non-native English speakers in

DAT(Sent) than in UID(Var), supporting our hypothesis of greater similarity in DAT(Sent)

for human-generated texts.

Table 2.1 Evaluating GPT-DATector on additional test set (Fairness-related metrics).

Fairness Metric

Disparate Impact (DI)
Equal Opportunity
Di↵erence (EOD)

Training Samples:
HW(GPT3.5)

Training Samples:
HW(GPT3.5) + ArguTOEFL(GPT)

GPT-who
(baseline)

0.58

-0.43

Our approach
(GPT-DATector
with all features2)
1.00

-0.02

GPT-who
(baseline)

0.69

-0.32

Our approach
(GPT-DATector
with all features)
0.85

-0.17

%HumanMisclassiﬁed
AsAI

underprivileged
-0.02
privileged
= 0.0

underprivileged
-0.43
privileged
= 0.0

underprivileged
-0.17
privileged
= 0.0
Note: (1) “Underprivileged” refers to essays written by non-native English speakers, and
“Privileged” refers to essays written by native English speakers. (2) “All features” represent
the full set of proposed features, including four semantic dissimilarities among sentences
(DAT(Sent), Var(Sent), Di↵(Sent), and Di↵2(Sent)), and one semantic dissimilarity
among words (DAT(Word)).

underprivileged
-0.33
privileged
= 0.0

Lastly, to assess whether GPT-DATector could improve fairness, we train the model

on two argumentative datasets, HW(GPT3.5) and ArguTOEFL(GPT)25, and subsequently

apply to an additional dataset Liang et al. (2023) to classify essays as AI- or human-

generated. Table 2.1 reports the results, suggesting the model’s e↵ectiveness in enhanc-

ing fairness. When training on one argumentative dataset, HW(GPT3.5) only, we ﬁnd that

GPT-DATector with the full set of proposed features improves DI from 0.58 to 1 (where 1 in-

dicates equal beneﬁt) and reduces EOD from -0.43 to -0.02 (where 0 indicates equal beneﬁt),

compared to the baseline model (GPT-who). Similar ﬁndings are observed when the model

is trained on two argumentative datasets together, HW(GPT3.5) and ArguTOEFL(GPT).

25We do this because the additional dataset Liang et al. (2023) is related to the argumen-

tative writing task, rather than prompted story generations.

73

Therefore, our proposed GPT-DATector, incorporating sentence- and word-level semantic

similarity, e↵ectively promotes fairness between native and non-native English writers by

reducing bias against the latter. These results demonstrate that GPT-DATector has signif-

icant potential for application in reducing bias within educational settings such as college

admissions.

2.5 Discussion and Conclusion

2.5.1 Findings

GenAI is undergoing rapid evolution, with the continuous emergence of newly developed

detection tools. Aligned with the idea of using LLMs to combat LLMs (Verspoor, 2024) and

divergent thinking creativity literature (Olson et al., 2021), this study proposes DAT(Sent)

as a metric to proxy semantic dissimilarities within the text. We show that, on average,

human-generated contents have a larger DAT(Sent) than AI-generated texts across di↵erent

writing tasks and datasets. Moreover, we design a GenAI detector, GPT-DATector, that

incorporates a set of features that capture sentence-level and word-level semantic dissim-

ilarity. Empirical validations demonstrate that our proposed GPT-DATector outperforms

state-of-the-art models like GPTZero and GPT-who in terms of predictive performance.

Most importantly, we ﬁnd that GPT-DATector can reduce bias against non-native English

speakers, as evidenced by its application to an additional dataset.

2.5.2 Contribution

Our work makes several contributions as below. First, our study contributes to the computa-

tional linguistics literature by advancing methods to di↵erentiate between GenAI-generated

and human-generated texts, integrating insights from divergent thinking studies. While prior

74

research has predominantly studied metrics such as perplexity (i.e., a measure of uncertainty

in predicting word occurrences within a model) (Wallach et al., 2009; Beresneva, 2016), our

study aligns with this stream of literature by introducing a novel set of DAT-related metrics

leveraging state-of-the-art embedding techniques. We demonstrate that the distribution of

DAT(Sent) e↵ectively distinguishes AI-generated texts from human-written ones, provid-

ing deeper insights into the linguistic characteristics of GenAI outputs and complementing

emerging research on using LLMs to address challenges posed by LLMs (Verspoor, 2024).

Additionally, unlike previous approaches that evaluate sentence- or document-level similarity

by comparing entire documents to a median or central reference (Doshi and Hauser, 2024;

Zhou and Lee, 2024), our proposed metrics, can capture the degree of semantic repetition

within a document, o↵ering a more robust and insightful evaluation of LLM-generated texts.

Moreover, our study makes a signiﬁcant contribution to the GenAI detection literature by

introducing GPT-DATector, a framework grounded in the fundamental distinctions between

human and AI-generated content. While revolutionary in GenAI’s capabilities, it poses

signiﬁcant risks such as deepfakes, synthetic identities, and highly precise misinformation,

highlighting their dual-edged nature (Ferrara, 2024). We demonstrate that the proposed two-

stage detection model, GPT-DATector, not only achieves higher accuracy but also mitigates

bias compared to existing state-of-the-art models.

Lastly, the proposed two-stage framework has broad applicability as it can be integrated

with a range of state-of-the-art GPT detection methods, where the ﬁrst-stage model can be

substituted with any existing GenAI detector (either the black-box or open-source model).

Therefore, our proposed GPT-DATector o↵ers substantial practical implications by providing

a reliable tool for identifying the origin of texts, whether human-written, AI-generated, or a

75

combination of both.

2.5.3 Limitations and Future Work

While our dataset encompasses a range of educational writing tasks such as argumentative

and creative writing, it remains limited in scope. Future research could enhance generaliz-

ability by incorporating additional writing tasks that are prevalent in educational contexts,

such as reﬂective essays and explanatory writing. Furthermore, improving the e↵ectiveness

and reliability of GPT detectors necessitates a comprehensive understanding of human writ-

ing styles. This, in turn, requires a su ciently large and diverse set of human-written samples

to ensure the validity of comparisons. Thus, future work should also consider strategies to

enhance the representativeness and the quality of human-written texts, as this is essential

for accurately evaluating and strengthening the generalizability of GPT detection models.

Another limitation of this work is the absence of extensive prompt engineering, which

has been shown to signiﬁcantly inﬂuence the quality of GenAI outputs (Nori et al., 2023;

Zamﬁrescu-Pereira et al., 2023). Our design employed a single prompt to generate a typical

real-world scenario in which non-expert users interact with LLMs and expect coherent, high-

quality responses without iterative reﬁnement (Zamﬁrescu-Pereira et al., 2023; Sun et al.,

2025). Repeated prompting or interactive reﬁnement may produce more human-like text,

thereby increasing the di culty for GenAI detection models. Future research should explore

how varying levels of prompt engineering a↵ect both the quality of AI-generated content and

the robustness of GPT detection models.

As GenAIs continue to evolve, future LLMs are expected to exhibit signiﬁcantly enhanced

capabilities in interpreting user intent and producing more sophisticated outputs (Acar et al.,

2024). Consequently, the e↵ectiveness of GPT detectors may vary depending on the speciﬁc

76

model generating the text. The increased linguistic and contextual ﬂuency of newer models

poses challenges for detection approaches calibrated to earlier-generation outputs. Therefore,

future research should evaluate the performance of the proposed GPT-DATector using texts

generated by more advanced LLMs.

2.5.4 Conclusion

In this study, we address the challenge of distinguishing AI-generated texts from those writ-

ten by humans. Di↵er from prior work, we begin with a critical observation often overlooked:

the repetition problem in GenAIs, wherein AI-generated responses contain semantically re-

dundant segments across sentences (Fan, Lewis, and Dauphin, 2018; Holtzman et al., 2020;

Fu et al., 2021). Motivated by this phenomenon, and informed by the idea of leveraging

LLMs to detect LLM outputs (Verspoor, 2024) and insights from divergent thinking and

creativity research (Olson et al., 2021), we develop a novel sentence-level semantic metric

that quantiﬁes the degree of semantic repetition across sentences. Empirically, we ﬁnd ro-

bust evidence that human-written texts exhibit signiﬁcantly greater sentence-level semantic

dissimilarity than AI-generated ones. These ﬁndings reveal a fundamental structural di-

vergence between human and machine-written content.

Importantly, our proposed GPT

detector, GPT-DATector, not only improves the detection of AI-generated text but also mit-

igates bias against non-native English writers, which is a key concern in the fair evaluation

of language quality and authorship.

The rapid advancement of GenAI has narrowed the gap between human and AI capabil-

ities, making it increasingly di cult to distinguish between human- and AI-generated texts.

However, beyond detection challenges, GenAI may also be reshaping human distinctiveness

through continued interaction in the long run. Recent evidence suggests that while LLMs can

77

temporarily enhance creativity, they may hinder users’ independent creative thinking when

users discontinue AI assistance (Kumar et al., 2025). Furthermore, reliance on GenAI can

impair core human capabilities in SAT essay writing. Users assisted by ChatGPT show the

lowest brain engagement and underperform across neural, linguistic, and behavioral dimen-

sions (Kosmyna et al., 2025). As human writing converges with AI output, the foundational

assumptions of AI-detection models that rely on divergence between human and AI may no

longer hold. Therefore, beyond developing the most e↵ective GenAI detection models, it is

essential to examine the evolving role of humans in the future of work, particularly in educa-

tional contexts. So that as educators we can design information systems that preserve and

strengthen human cognitive and creative capabilities rather than diminishing them through

overreliance on AI.

78

BIBLIOGRAPHY

Acar, O.A., A. Tuncdogan, D. van Knippenberg, and K.R. Lakhani. 2024. “Collective cre-
ativity and innovation: An interdisciplinary review, integration, and research agenda.”
Journal of Management 50:2119–2151.

Amabile, T.M. 2018. Creativity in context: Update to the social psychology of creativity.

Routledge.

Anthony, C., B.A. Bechky, and A.L. Fayard. 2023. ““Collaborating” with AI: Taking a

system view to explore the future of work.” Organization Science 34:1672–1694.

Arnold, K.C., K. Chauncey, and K.Z. Gajos. 2020. “Predictive text encourages predictable
writing.” In Proceedings of the 25th International Conference on Intelligent User Inter-
faces. pp. 128–138.

Arriagada, L. 2020. “CG-Art: Demystifying the anthropocentric bias of artistic creativity.”

Connection Science 32:398–405.

Bauer, K., M. von Zahn, and O. Hinz. 2023. “Expl(AI)ned: The impact of explainable artiﬁ-
cial intelligence on users’ information processing.” Information systems research 34:1582–
1602.

Beaty, R.E., and D.R. Johnson. 2021. “Automating creativity assessment with SemDis: An
open platform for computing semantic distance.” Behavior research methods 53:757–780.

Bellemare-Pepin, A., F. Lespinasse, P. Th¨olke, Y. Harel, K. Mathewson, J.A. Olson, Y. Ben-
gio, and K. Jerbi. 2024. “Divergent Creativity in Humans and Large Language Models.”
arXiv preprint arXiv:2405.13012 , pp. .

Benedek, M., M. Karstendiek, S.M. Ceh, R.H. Grabner, G. Krammer, I. Lebuda, P.J. Silvia,
K.N. Cotter, Y. Li, W. Hu, et al. 2021. “Creativity myths: Prevalence and correlates of
misconceptions on creativity.” Personality and Individual Di↵erences 182:111068.

Beresneva, D. 2016. “Computer-generated text detection using machine learning: A system-
atic review.” In Natural Language Processing and Information Systems: 21st International
Conference on Applications of Natural Language to Information Systems, NLDB 2016,
Salford, UK, June 22-24, 2016, Proceedings 21 . Springer, pp. 421–426.

Bhat, A., S. Agashe, P. Oberoi, N. Mohile, R. Jangir, and A. Joshi. 2023. “Interacting
with next-phrase suggestions: How suggestion systems aid and inﬂuence the cognitive
processes of writing.” In Proceedings of the 28th International Conference on Intelligent
User Interfaces. pp. 436–452.

Blanchard, D., J. Tetreault, D. Higgins, A. Cahill, and M. Chodorow. 2013. “TOEFL11: A

corpus of non-native English.” ETS Research Report Series 2013:i–15.

Boussioux, L., J.N. Lane, M. Zhang, V. Jacimovic, and K.R. Lakhani. 2024. “The crowdless
future? Generative AI and creative problem-solving.” Organization Science 35:1589–1607.

Brassil, J.T., S. Low, N.F. Maxemchuk, and L. O’Gorman. 1995. “Electronic marking and
identiﬁcation techniques to discourage document copying.” IEEE Journal on Selected Ar-
eas in Communications 13:1495–1504.

79

Brown, T., B. Mann, N. Ryder, M. Subbiah, J.D. Kaplan, P. Dhariwal, A. Neelakantan,
P. Shyam, G. Sastry, A. Askell, et al. 2020. “Language models are few-shot learners.”
Advances in neural information processing systems 33:1877–1901.

Brynjolfsson, E. 2023. “The turing trap: The promise & peril of human-like artiﬁcial intel-

ligence.” In Augmented education in the global age. Routledge, pp. 103–116.

Brynjolfsson, E., D. Li, and L. Raymond. 2025. “Generative AI at work.” The Quarterly

Journal of Economics, pp. qjae044.

Bubeck, S., V. Chandrasekaran, R. Eldan, J. Gehrke, E. Horvitz, E. Kamar, P. Lee, Y.T. Lee,
Y. Li, S. Lundberg, et al. 2023. “Sparks of artiﬁcial general intelligence: Early experiments
with GPT-4.” arXiv preprint arXiv:2303.12712 , pp. .

Bunjak, A., M. ˇCerne, and A. Popoviˇc. 2021. “Absorbed in technology but digitally over-
loaded: Interplay e↵ects on gig workers’ burnout and creativity.” Information & Manage-
ment 58:103533.

Burton, J.W., M.K. Stein, and T.B. Jensen. 2020. “A systematic review of algorithm aversion

in augmented decision making.” Journal of behavioral decision making 33:220–239.

Buschek, D., M. Z¨urn, and M. Eiband. 2021. “The impact of multiple parallel phrase sugges-
tions on email input and composition behaviour of native and non-native english writers.”
In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. pp.
1–13.

Castelo, N., M.W. Bos, and D.R. Lehmann. 2019. “Task-dependent algorithm aversion.”

Journal of Marketing Research 56:809–825.

Chakrabarty, T., P. Laban, D. Agarwal, S. Muresan, and C.S. Wu. 2024. “Art or artiﬁce?
large language models and the false promise of creativity.” In Proceedings of the CHI
Conference on Human Factors in Computing Systems. pp. 1–34.

Chen, Z., and J. Chan. 2024. “Large language model in creative work: The role of collabo

ration modality and user expertise.” Management Science 70:9101–9117.

Clark, E., A.S. Ross, C. Tan, Y. Ji, and N.A. Smith. 2018. “Creative writing with a machine
in the loop: Case studies on slogans and stories.” In 23rd International Conference on
Intelligent User Interfaces. pp. 329–340.

Clerwall, C. 2017. “Enter the robot journalist: Users’ perceptions of automated content.”
In The Future of Journalism: In an Age of Digital Media and Economic Uncertainty.
Routledge, pp. 165–177.

Commerford, B.P., S.A. Dennis, J.R. Joe, and J.W. Ulla. 2022. “Man versus machine: Com-
plex estimates and auditor reliance on artiﬁcial intelligence.” Journal of Accounting Re-
search 60:171–201.

Crothers, E., N. Japkowicz, H. Viktor, and P. Branco. 2022. “Adversarial robustness of
neural-statistical features in detection of generative transformers.” In 2022 International
Joint Conference on Neural Networks (IJCNN). IEEE, pp. 1–8.

Crothers, E., N. Japkowicz, and H.L. Viktor. 2023. “Machine-generated text: A comprehen-

sive survey of threat models and detection methods.” IEEE Access, pp. .

80

Crothers, E., H. Viktor, and N. Japkowicz. 2023. “In BLOOM: Creativity and A nity in

Artiﬁcial Lyrics and Art.” arXiv preprint arXiv:2301.05402 , pp. .

Dai, L., J. Mao, X. Fan, and X. Zhou. 2022. “Deephider: A multi-module and invisibility

watermarking scheme for language model.” arXiv preprint arXiv:2208.04676 , pp. .

Damon, W., and E. Phelps. 1989. “Critical distinctions among three approaches to peer

education.” International journal of educational research 13:9–19.

Dathathri, S., A. See, S. Ghaisas, P.S. Huang, R. McAdam, J. Welbl, V. Bachani, A. Kaska-
soli, R. Stanforth, T. Matejovicova, et al. 2024. “Scalable watermarking for identifying
large language model outputs.” Nature 634:818–823.

Desaire, H., A.E. Chua, M.G. Kim, and D. Hua. 2023. “Accurately detecting AI text when

ChatGPT is told to write like a chemist.” Cell Reports Physical Science 4.

Dietvorst, B.J., J.P. Simmons, and C. Massey. 2015. “Algorithm aversion: people erroneously
avoid algorithms after seeing them err.” Journal of Experimental Psychology: General
144:114.

—. 2018. “Overcoming algorithm aversion: People will use imperfect algorithms if they can

(even slightly) modify them.” Management science 64:1155–1170.

Doshi, A.R., and O.P. Hauser. 2024. “Generative AI enhances individual creativity but

reduces the collective diversity of novel content.” Science Advances 10:eadn5290.

D’Souza, R. 2021. “What characterises creativity in narrative writing, and how do we assess
it? Research ﬁndings from a systematic literature search.” Thinking skills and creativity
42:100949.

Eining, M.M., D.R. Jones, J.K. Loebbecke, et al. 1997. “Reliance on decision aids: An ex-
amination of auditors’ assessment of management fraud.” Auditing: A Journal of practice
& theory 16.

Emig, J. 1971. “The composing processes of twelfth graders.” National Council of teachers

of English research report, pp. .

Fan, A., M. Lewis, and Y. Dauphin. 2018. “Hierarchical neural story generation.” arXiv

preprint arXiv:1805.04833 , pp. .

Ferrara, E. 2024. “GenAI against humanity: Nefarious applications of generative artiﬁcial
intelligence and large language models.” Journal of Computational Social Science 7:549–
569.

Frank, A.F., and T.F. Jaeger. 2008. “Speaking rationally: Uniform information density as
an optimal strategy for language production.” In Proceedings of the annual meeting of the
cognitive science society. vol. 30.

Fr¨ohling, L., and A. Zubiaga. 2021. “Feature-based detection of automated language models:

tackling GPT-2, GPT-3 and Grover.” PeerJ Computer Science 7:e443.

Fu, Z., W. Lam, A.M.C. So, and B. Shi. 2021. “A theoretical analysis of the repetition prob-
lem in text generation.” In Proceedings of the AAAI Conference on Artiﬁcial Intelligence.
vol. 35, pp. 12848–12856.

81

F¨ugener, A., J. Grahl, A. Gupta, and W. Ketter. 2022. “Cognitive challenges in human–
artiﬁcial intelligence collaboration: Investigating the path toward productive delegation.”
Information Systems Research 33:678–696.

Ge, R., Z. Zheng, X. Tian, and L. Liao. 2021. “Human–robot interaction: When investors
adjust the usage of robo-advisors in peer-to-peer lending.” Information Systems Research
32:774–785.

Gehrmann, S., H. Strobelt, and A.M. Rush. 2019. “Gltr: Statistical detection and visualiza-

tion of generated text.” arXiv preprint arXiv:1906.04043 , pp. .

Gha↵ari, S., B. Youseﬁmehr, and M. Ghatee. 2024. “Generative-AI in E-Commerce: Use-
Cases and Implementations.” In 2024 20th CSI International Symposium on Artiﬁcial
Intelligence and Signal Processing (AISP). IEEE, pp. 1–5.

Gnewuch, U., S. Morana, O. Hinz, R. Kellner, and A. Maedche. 2024. “More than a bot?
The impact of disclosing human involvement on customer interactions with hybrid service
agents.” Information Systems Research 35:936–955.

Go↵man, E. 2017. Interaction ritual: Essays in face-to-face behavior . Routledge.

Goncalo, J.A., F.J. Flynn, and S.H. Kim. 2010. “Are two narcissists better than one? The
link between narcissism, perceived creativity, and creative performance.” Personality and
Social Psychology Bulletin 36:1484–1495.

Guilford, J.P. 1964. “Some new looks at the nature of creative processes.” Contributions to

mathematical psychology. New York: Holt, Rinehart & Winston, pp. .

Haase, J., and P.H. Hanel. 2023. “Artiﬁcial muses: Generative artiﬁcial intelligence chatbots

have risen to human-level creativity.” Journal of Creativity 33:100066.

Hafenbr¨adl, S., D. Waeger, J.N. Marewski, and G. Gigerenzer. 2016. “Applied decision mak-
ing with fast-and-frugal heuristics.” Journal of Applied Research in Memory and Cognition
5:215–231.

Hale, J. 2001. “A probabilistic Earley parser as a psycholinguistic model.” In Second meeting

of the north american chapter of the association for computational linguistics.

Holtzman, A., J. Buys, L. Du, M. Forbes, and Y. Choi. 2020. “The curious case of neural
text degeneration.” International Conference on Learning Representations (ICLR), pp. .

Jakesch, M., A. Bhat, D. Buschek, L. Zalmanson, and M. Naaman. 2023. “Co-writing with
opinionated language models a↵ects users’ views.” In Proceedings of the 2023 CHI Con-
ference on Human Factors in Computing Systems. pp. 1–15.

Jakesch, M., J.T. Hancock, and M. Naaman. 2023. “Human heuristics for AI-generated
language are ﬂawed.” Proceedings of the National Academy of Sciences 120:e2208839120.

Jarrahi, M.H. 2018. “Artiﬁcial intelligence and the future of work: Human-AI symbiosis in

organizational decision making.” Business horizons 61:577–586.

Jarvis, S. 2002. “Short texts, best-ﬁtting curves and new measures of lexical diversity.”

Language Testing 19:57–84.

82

Kane, G.C., A.G. Young, A. Majchrzak, and S. Ransbotham. 2021. “Avoiding an oppressive
future of machine learning: A design theory for emancipatory assistants.” MIS Quarterly
45:371–396.

Kankanhalli, M.S., and K. Hau. 2002. “Watermarking of electronic text documents.” Elec-

tronic Commerce Research 2:169–187.

Katz, D.M., M.J. Bommarito, S. Gao, and P. Arredondo. 2024. “GPT-4 passes the bar

exam.” Philosophical Transactions of the Royal Society A 382:20230254.

Kirchenbauer, J., J. Geiping, Y. Wen, J. Katz, I. Miers, and T. Goldstein. 2023. “A wa-
termark for large language models.” In International Conference on Machine Learning.
PMLR, pp. 17061–17084.

Kleinmuntz, B. 1990. “Why we still use our heads instead of formulas: toward an integrative

approach.” Psychological bulletin 107:296.

K¨obis, N., and L.D. Mossink. 2021. “Artiﬁcial intelligence versus Maya Angelou: Experimen-
tal evidence that people cannot di↵erentiate AI-generated from human-written poetry.”
Computers in human behavior 114:106553.

Koltovskaia, S. 2020. “Student engagement with automated written corrective feedback
(AWCF) provided by Grammarly: A multiple case study.” Assessing Writing 44:100450.

Kosmyna, N., E. Hauptmann, Y.T. Yuan, J. Situ, X.H. Liao, A.V. Beresnitzky, I. Braunstein,
and P. Maes. 2025. “Your Brain on ChatGPT: Accumulation of Cognitive Debt when Using
an AI Assistant for Essay Writing Task.” arXiv preprint arXiv:2506.08872 , pp. .

Krishna, K., Y. Song, M. Karpinska, J. Wieting, and M. Iyyer. 2024. “Paraphrasing evades
detectors of ai-generated text, but retrieval is an e↵ective defense.” Advances in Neural
Information Processing Systems 36.

Kumar, H., J. Vincentius, E. Jordan, and A. Anderson. 2025. “Human creativity in the age
of llms: Randomized experiments on divergent and convergent thinking.” In Proceedings
of the 2025 CHI Conference on Human Factors in Computing Systems. pp. 1–18.

Landis, J.R., and G.G. Koch. 1977. “The measurement of observer agreement for categorical

data.” Biometrics, pp. 159–174.

Laufer, B., and P. Nation. 1995. “Vocabulary size and use: Lexical richness in L2 written

production.” Applied linguistics 16:307–322.

Lavergne, T., T. Urvoy, and F. Yvon. 2008. “Detecting Fake Content with Relative Entropy

Scoring.” Pan 8:4.

Le, Q., and T. Mikolov. 2014. “Distributed representations of sentences and documents.” In

International conference on machine learning. PMLR, pp. 1188–1196.

Lebovitz, S., H. Lifshitz-Assaf, and N. Levina. 2022. “To engage or not to engage with AI
for critical judgments: How professionals deal with opacity when using AI for medical
diagnosis.” Organization science 33:126–148.

Lee, M., P. Liang, and Q. Yang. 2022. “Coauthor: Designing a human-ai collaborative
writing dataset for exploring language model capabilities.” In Proceedings of the 2022
CHI conference on human factors in computing systems. pp. 1–19.

83

Lee, M., M. Srivastava, A. Hardy, J. Thickstun, E. Durmus, A. Paranjape, I. Gerard-Ursin,
X.L. Li, F. Ladhak, F. Rong, et al. 2023. “Evaluating human-language model interaction.”
Transactions on Machine Learning Research, pp. .

Li, N., H. Zhou, W. Deng, J. Liu, F. Liu, and K. Mikel-Hong. 2024. “When Advanced AI Isn’t
Enough: Human Factors as Drivers of Success in Generative AI-Human Collaborations.”
Available at SSRN 4738829 , pp. .

Li, Y., Q. Li, L. Cui, W. Bi, Z. Wang, L. Wang, L. Yang, S. Shi, and Y. Zhang. 2023. “MAGE:

Machine-generated Text Detection in the Wild.” arXiv e-prints, pp. arXiv–2305.

Liang, W., M. Yuksekgonul, Y. Mao, E. Wu, and J. Zou. 2023. “GPT detectors are biased

against non-native English writers.” Patterns 4.

Liu, Y., Z. Zhang, W. Zhang, S. Yue, X. Zhao, X. Cheng, Y. Zhang, and H. Hu. 2023.
“ArguGPT: Evaluating, understanding and identifying argumentative essays generated
by GPT models.” arXiv preprint arXiv:2304.07666 , pp. .

Lockhart, E.N. 2024. “Creativity in the age of AI: the human condition and the limits of

machine generation.” Journal of Cultural Cognitive Science, pp. 1–6.

Logg, J.M., J.A. Minson, and D.A. Moore. 2019. “Algorithm appreciation: People prefer
algorithmic to human judgment.” Organizational Behavior and Human Decision Processes
151:90–103.

Lu, X. 2011. “A corpus-based evaluation of syntactic complexity measures as indices of

college-level ESL writers’ language development.” TESOL quarterly 45:36–62.

Lu, Y., X. Luo, L. Huang, and D. Wang. 2025. “Can Providing Algorithmic Performance
Information Facilitate Humans’ Inventory Ordering Behaviors?” Information Systems
Research, pp. .

Ma, K., D. Grandi, C. McComb, and K. Goucher-Lambert. 2024. “Exploring the Capabili-
ties of Large Language Models for Generating Diverse Design Solutions.” arXiv preprint
arXiv:2405.02345 , pp. .

Ma, Y., J. Liu, F. Yi, Q. Cheng, Y. Huang, W. Lu, and X. Liu. 2023. “AI vs. human–

di↵erentiation analysis of scientiﬁc content generation.” arXiv 2301.

Mahowald, K., E. Fedorenko, S.T. Piantadosi, and E. Gibson. 2013. “Info/information the-

ory: Speakers choose shorter words in predictive contexts.” Cognition 126:313–318.

Majchrzak, A., and M.L. Markus. 2012. “Technology a↵ordances and constraints in manage-
ment information systems (MIS).” Encyclopedia of Management Theory,(Ed: E. Kessler),
Sage Publications, Forthcoming, pp. .

Markus, M.L., and M.S. Silver. 2008. “A foundation for the study of IT e↵ects: A new
look at DeSanctis and Poole’s concepts of structural features and spirit.” Journal of the
Association for Information systems 9:5.

Mart´ınez, G., J.A. Hern´andez, J. Conde, P. Reviriego, and E. Merino. 2024. “Beware of
Words: Evaluating the Lexical Richness of Conversational Large Language Models.” arXiv
preprint arXiv:2402.15518 , pp. .

84

May, R. 1994. The courage to create. WW Norton &Company, Inc.

McKee, R. 1997. “Substance, structure, style, and the principles of screenwriting.” Alba

Editorial , pp. .

Millet, K., F. Buehler, G. Du, and M.D. Kokkoris. 2023. “Defending humankind: Anthro-
pocentric bias in the appreciation of AI art.” Computers in Human Behavior 143:107707.

Minaee, S., T. Mikolov, N. Nikzad, M. Chenaghlu, R. Socher, X. Amatriain, and J. Gao.

2024. “Large language models: A survey.” arXiv preprint arXiv:2402.06196 , pp. .

Mirowski, P., K.W. Mathewson, J. Pittman, and R. Evans. 2023. “Co-writing screenplays
and theatre scripts with language models: Evaluation by industry professionals.” In Pro-
ceedings of the 2023 CHI Conference on Human Factors in Computing Systems. pp. 1–34.

Mitchell, E., Y. Lee, A. Khazatsky, C.D. Manning, and C. Finn. 2023. “DetectGPT: Zero-
shot machine-generated text detection using probability curvature.” In International Con-
ference on Machine Learning. PMLR, pp. 24950–24962.

Montazemi, A.R. 1991. “The impact of experience on the design of user interface.” Interna-

tional journal of man-machine studies 34:731–749.

Morewedge, C.K. 2022. “Preference for human, not algorithm aversion.” Trends in cognitive

sciences 26:824–826.

Nori, H., Y.T. Lee, S. Zhang, D. Carignan, R. Edgar, N. Fusi, N. King, J. Larson, Y. Li,
W. Liu, et al. 2023. “Can generalist foundation models outcompete special-purpose tuning?
case study in medicine.” arXiv preprint arXiv:2311.16452 , pp. .

Noy, S., and W. Zhang. 2023. “Experimental evidence on the productivity e↵ects of genera-

tive artiﬁcial intelligence.” Science 381:187–192.

Olson, J.A., J. Nahas, D. Chmoulevitch, S.J. Cropper, and M.E. Webb. 2021. “Naming
unrelated words predicts creativity.” Proceedings of the National Academy of Sciences
118:e2022340118.

Orbell, J., and R.M. Dawes. 1991. “A “cognitive miser” theory of cooperators advantage.”

American Political Science Review 85:515–528.

Ornes, S. 2019. “Computers take art in new directions, challenging the meaning of “creativ-

ity”.” Proceedings of the National Academy of Sciences 116:4760–4763.

Peeperkorn, M., T. Kouwenhoven, D. Brown, and A. Jordanous. 2024. “Is temperature the
creativity parameter of large language models?” arXiv preprint arXiv:2405.00492 , pp. .

Radford, A., J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al. 2019. “Language

models are unsupervised multitask learners.” OpenAI blog 1:9.

Ra↵el, C., N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and
P.J. Liu. 2020. “Exploring the limits of transfer learning with a uniﬁed text-to-text trans-
former.” The Journal of Machine Learning Research 21:5485–5551.

Ragot, M., N. Martin, and S. Cojean. 2020. “Ai-generated vs. human artworks. a perception
bias towards artiﬁcial intelligence?” In Extended abstracts of the 2020 CHI conference on
human factors in computing systems. pp. 1–10.

85

Revilla, E., M.J. Saenz, M. Seifert, and Y. Ma. 2023. “Human–artiﬁcial intelligence collab-
oration in prediction: A ﬁeld experiment in the retail industry.” Journal of Management
Information Systems 40:1071–1098.

Roemmele, M., and A.S. Gordon. 2018. “Automated assistance for creative writing with an
rnn language model.” In Companion Proceedings of the 23rd International Conference on
Intelligent User Interfaces. pp. 1–2.

Rogers, E.M., A. Singhal, and M.M. Quinlan. 2014. “Di↵usion of innovations.” In An inte-

grated approach to communication theory and research. Routledge, pp. 432–448.

Runco, M.A., and S. Acar. 2012. “Divergent thinking as an indicator of creative potential.”

Creativity research journal 24:66–75.

Sadasivan, V.S., A. Kumar, S. Balasubramanian, W. Wang, and S. Feizi. 2023. “Can AI-

generated text be reliably detected?” arXiv preprint arXiv:2303.11156 , pp. .

See, A., A. Pappu, R. Saxena, A. Yerukola, and C.D. Manning. 2019. “Do massively pre-

trained language models make better storytellers?” arXiv arXiv:1909.10705 , pp. .

Shaikh, M., and E. Vaast. 2023. “Algorithmic interactions in open source work.” Information

Systems Research 34:744–765.

Sharma, A., I.W. Lin, A.S. Miner, D.C. Atkins, and T. Altho↵. 2023. “Human–AI col-
laboration enables more empathic conversations in text-based peer-to-peer mental health
support.” Nature Machine Intelligence 5:46–57.

Sharples, M. 2002. How we write: Writing as creative design. Routledge.

Shen, Z., W. Jiang, and Z. Zheng. 2025. “Irrationality-Aware Human Machine Collaboration:
Mitigating Alterfactual Irrationality in Copy Trading.” Information Systems Research, pp.
, Ahead of Print.

Sieck, W.R., and H.R. Arkes. 2005. “The recalcitrance of overconﬁdence and its contribution

to decision aid neglect.” Journal of Behavioral Decision Making 18:29–53.

Siegert, I., R. B¨ock, and A. Wendemuth. 2014. “Inter-rater reliability for emotion annotation
in human–computer interaction: comparison and methodological improvements.” Journal
on Multimodal User Interfaces 8:17–28.

Solaiman, I., M. Brundage, J. Clark, A. Askell, A. Herbert-Voss, J. Wu, A. Radford,
G. Krueger, J.W. Kim, S. Kreps, et al. 2019. “Release strategies and the social impacts of
language models.” arXiv preprint arXiv:1908.09203 , pp. .

Stokel-Walker, C. 2022. “AI bot ChatGPT writes smart essays-should academics worry?”

Nature, pp. .

Storch, N. 2002. “Patterns of interaction in ESL pair work.” Language learning 52:119–158.

Summers-Stay, D., C.R. Voss, and S.M. Lukin. 2023. “Brainstorm, then select: a generative
language model improves its creativity score.” In The AAAI-23 Workshop on Creative AI
Across Modalities.

Sun, L., Y. Yuan, Y. Yao, Y. Li, H. Zhang, X. Xie, X. Wang, F. Luo, and D. Stillwell.
2024. “Large Language Models show both individual and collective creativity comparable
to humans.” arXiv preprint arXiv:2412.03151 , pp. .

86

—. 2025. “Large Language Models show both individual and collective creativity comparable

to humans.” Thinking Skills and Creativity, pp. 101870.

Susarla, A., R. Gopal, J.B. Thatcher, and S. Sarker. 2023. “The Janus e↵ect of generative AI:
Charting the path for responsible conduct of scholarly activities in information systems.”
Information Systems Research 34:399–408.

Susnjak, T., and T.R. McIntosh. 2024. “ChatGPT: The end of online exam integrity?”

Education Sciences 14:656.

Tian, E., and A. Cui. 2023. “GPTZero: Towards detection of AI-generated text using zero-

shot and supervised methods.”

Turel, O., and S. Kalhan. 2023. “Prejudiced against the Machine? Implicit Associations and

the Transience of Algorithm Aversion.” Mis Quarterly 47.

Tweedie, F.J., and R.H. Baayen. 1998. “How variable may a constant be? Measures of lexical

richness in perspective.” Computers and the Humanities 32:323–352.

Ueoka, H., Y. Murawaki, and S. Kurohashi. 2021. “Frustratingly easy edit-based linguistic
steganography with a masked language model.” arXiv preprint arXiv:2104.09833 , pp. .

Vaccaro, M., A. Almaatouq, and T. Malone. 2024. “When combinations of humans and AI
are useful: A systematic review and meta-analysis.” Nature Human Behaviour , pp. 1–11.

Van Hout, R., and A. Vermeer. 2007. “Comparing measures of lexical richness.” Modelling

and assessing vocabulary knowledge 93:115.

Venkatraman, S., A. Uchendu, and D. Lee. 2024. “GPT-who: An information density-based
machine-generated text detector.” Findings of the Association for Computational Linguis-
tics: NAACL, pp. .

Verspoor, K. 2024. ““Fighting ﬁre with ﬁre”—using LLMs to combat LLM hallucinations.”

Nature, pp. .

Wallach, H.M., I. Murray, R. Salakhutdinov, and D. Mimno. 2009. “Evaluation methods
for topic models.” In Proceedings of the 26th annual international conference on machine
learning. pp. 1105–1112.

Wan, Q., S. Hu, Y. Zhang, P. Wang, B. Wen, and Z. Lu. 2024. “”It Felt Like Having a
Second Mind”: Investigating Human-AI Co-creativity in Prewriting with Large Language
Models.” Proceedings of the ACM on Human-Computer Interaction 8:1–26.

Wang, L., M.I. Mujib, J. Williams, G. Demiris, and J. Huh-Yoo. 2021. “An evaluation
of generative pre-training model-based therapy chatbot for caregivers.” arXiv preprint
arXiv:2107.13115 , pp. .

Wang, W., M. Yang, and T. Sun. 2023. “Human-AI Co-Creation in Product Ideation: the

Dual View of Quality and Diversity.” Available at SSRN 4668241 , pp. .

Welleck, S., I. Kulikov, S. Roller, E. Dinan, K. Cho, and J. Weston. 2019. “Neural text gener-
ation with unlikelihood training.” International Conference on Learning Representations
(ICLR)., pp. .

87

Whitecotton, S.M. 1996. “The e↵ects of experience and conﬁdence on decision aid reliance:

A causal model.” Behavioral Research in Accounting 8:194–216.

Wooldridge, J.M. 2003. “Cluster-sample methods in applied econometrics.” American Eco-

nomic Review 93:133–138.

Wu, T., M. Terry, and C.J. Cai. 2022. “Ai chains: Transparent and controllable human-ai
interaction by chaining large language model prompts.” In Proceedings of the 2022 CHI
conference on human factors in computing systems. pp. 1–22.

Wu, Z., D. Ji, K. Yu, X. Zeng, D. Wu, and M. Shidujaman. 2021. “AI creativity and the
human-AI co-creation model.” In Human-Computer Interaction. Theory, Methods and
Tools: Thematic Area, HCI 2021, Held as Part of the 23rd HCI International Conference,
HCII 2021, Virtual Event, July 24–29, 2021, Proceedings, Part I 23 . Springer, pp. 171–
190.

Xu, Y., and D. Reitter. 2018. “Information density converges in dialogue: Towards an

information-theoretic model.” Cognition 170:147–163.

Yang, D., Y. Zhou, Z. Zhang, T.J.J. Li, and R. LC. 2022. “AI as an Active Writer: Inter-
action strategies with generated text in human-AI collaborative ﬁction writing.” In Joint
Proceedings of the ACM IUI Workshops. CEUR-WS Team, vol. 10, pp. 1–11.

Yang, J., H. Jin, R. Tang, X. Han, Q. Feng, H. Jiang, S. Zhong, B. Yin, and X. Hu. 2023.
“Harnessing the power of LLMs in practice: A survey on ChatGPT and beyond.” ACM
Transactions on Knowledge Discovery from Data, pp. .

Yeadon, W., O.O. Inyang, A. Mizouri, A. Peach, and C.P. Testrow. 2023. “The death of the
short-form physics essay in the coming AI revolution.” Physics Education 58:035027.

Yin, Y., N. Jia, and C.J. Wakslak. 2024. “AI can help people feel heard, but an AI label di-
minishes this impact.” Proceedings of the National Academy of Sciences 121:e2319112121.

Yuan, A., A. Coenen, E. Reif, and D. Ippolito. 2022. “Wordcraft: story writing with large
language models.” In 27th International Conference on Intelligent User Interfaces. pp.
841–852.

Zamﬁrescu-Pereira, J.D., R.Y. Wong, B. Hartmann, and Q. Yang. 2023. “Why Johnny can’t
prompt: how non-AI experts try (and fail) to design LLM prompts.” In Proceedings of the
2023 CHI conference on human factors in computing systems. pp. 1–21.

Zanzotto, F.M. 2019. “Human-in-the-loop artiﬁcial intelligence.” Journal of Artiﬁcial Intel-

ligence Research 64:243–252.

Zhang, C., C. Zhang, C. Li, Y. Qiao, S. Zheng, S.K. Dam, M. Zhang, J.U. Kim, S.T. Kim,
J. Choi, et al. 2023. “One small step for generative AI, one giant leap for AGI: A complete
survey on ChatGPT in AIGC era.” arXiv preprint arXiv:2304.06488 , pp. .

Zhang, Y., and R. Gosline. 2023. “Human favoritism, not AI aversion: People’s percep-
tions (and bias) toward generative AI, human experts, and human–GAI collaboration in
persuasive content generation.” Judgment and Decision Making 18:e41.

Zhou, E., and D. Lee. 2024. “Generative artiﬁcial intelligence, human creativity, and art.”

PNAS Nexus 3:page052.

88

APPENDIX

Figure A1 CoAUTHOR (Lee, Liang, and Yang, 2022), a dataset designed for revealing
GPT-3’s generative capabilities for interactive writing. Each session starts with a prompt
(black text). Writers then freely write (brown), request suggestions from GPT-3 (blue),
accept or dismiss suggestions, and edit accepted suggestions or previous texts in any order
they choose.

89

Table A1 10 prompts were retrieved from the WritingPrompts subreddit and used with
minor modiﬁcations.

Prompt code Prompt text (Source URL)

A woman has been dating guy after guy, but it never seems to work out. She’s unaware that she’s actually been dating the same

shapeshifter

guy over and over; a shapeshifter who’s fallen for her, and is certain he’s going to get it right this time.

(https://www.reddit.com/r/WritingPrompts/comments/7xihva/wp a woman has been dating guy after guy but it/)

When you die, you appear in a cinema with a number of other people who look like you. You ﬁnd out that they are your previous

reincarnation

reincarnations, and soon you all begin watching your next life on the big screen.

mana

obama

pig

mattdamon

sideefect

bee

dad

(https://www.reddit.com/r/WritingPrompts/comments/7ezd5t/wp when you die you appear in a cinema with a/)

Humans once wielded formidable magical power. But with over 7 billion of us on the planet now, Mana has spread far too thinly

to have any e↵ect. When hostile aliens reduce humanity to a mere fraction, the survivors discover an old power has begun

to reawaken once again.

(https://www.reddit.com/r/WritingPrompts/comments/7i3bs6/wp humans once wielded formidable magical power/)

You’re Barack Obama. 4 years into your retirement, you awake to ﬁnd a letter with no return address on your bedside table.

It reads “I hope you’ve had a chance to relax Barack... but pack your bags and call the number below. It’s time to

start the real job.” Signed simply, “JFK.”

(https://www.reddit.com/r/WritingPrompts/comments/6b3rmg/wp youre barack obama 4 months into your/)

Once upon a time there was an old mother pig who had one hundred little pigs and not enough food to feed them. So when

they were old enough, she sent them out into the world to seek their fortunes. You know the story about the ﬁrst three little pigs.

This is a story about the 92nd little pig. The 92nd little pig built a house out of depleted uranium. And the wolf was like, “dude.”

(https://www.reddit.com/r/WritingPrompts/comments/hytfcd/wp then the 92nd little pig built a house out of/)

An alien has kidnapped Matt Damon, not knowing what lengths humanity goes through to retrieve him whenever he goes missing.

(https://www.reddit.com/r/WritingPrompts/comments/8p3ora/wp an alien has kidnapped matt damon not knowing/)

When you’re 28, science discovers a drug that stops all e↵ects of aging, creating immortality. Your government decides to give

the drug to all citizens under 26, but you and the rest of the “Lost Generations” are deemed too high-risk. When you’re 85,

the side e↵ects are ﬁnally discovered.

(https://www.reddit.com/r/WritingPrompts/comments/8on59a/wp when youre 28 science discovers a drug that/)

Your entire life, you’ve been told you’re deathly allergic to bees. You’ve always had people protecting you from them, be it

your mother or a hired hand. Today, one slips through and lands on your shoulder. You hear a tiny voice say “Your Majesty,

what are your orders?”

(https://www.reddit.com/r/WritingPrompts/comments/88p6rp/wp your entire life youve been told youre deathly/)

All of the “#1 Dad” mugs in the world change to show the actual ranking of Dads suddenly.

(https://www.reddit.com/r/WritingPrompts/comments/6gl289/wp all of the 1 dad mugs in the world change to/)

Following World War III, all the nations of the world agreed to 50 years of strict isolation from one another in order to

isolation

prevent additional conﬂicts. 50 years later, the United States comes out of exile, only to learn that no one else went into isolation.

(https://www.reddit.com/r/WritingPrompts/comments/585ru9/wp following world war iii all the nations of the/)

90