u.
if

,3.
.2.
. 1.
i

.2. I.

1.”. 1 ..

....V.. I
.3334

 

 

 

 

 

 

 

 

 

 

 

 

:.
, “3.31.. It . _ .
P7.

1.:
.. 7!... ..

1 Harman.” .

0..

.
3?: .V
.
ﬁe.

,3:
,

 

4... V
. vim
:(r: .

FF.th
3..

.u In

1 ‘ u
Xv”. in

a...

. ,4
in! ”A O.
v.1} .

ahﬂ

A

LIBRARI’
Michigan State
University

This is to certify that the
dissertation entitled

AN EXAMINATION OF THE PEER REVIEW PROCESS IN A

LARGE RESEARCH ORGANIZATION

presented by

RICHARD P. BANGHART

has been accepted towards fulﬁllment
of the requirements for the

PhD. degree in CEPSE

 

 

7%”

Major Professor’s Signature
Mag/ow

Date

MSU is an Affirmative Action/Equal Opportunity Institution

 

-.-.--»--o-.---

-.-.-l—.--.----.—.-.-.—~—-— — ~

 

PLACE IN RETURN Box to remove this checkout from your record.
TO AVOID FINES return on or before date due.
MAY BE RECALLED with earlier due date if requested.

 

DATE DUE

DATE DUE

DATE DUE

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

2/05 p:/ClRC/Date0ue.indd-p.1

 

AN EXAMINATION OF THE PEER REVIEW PROCESS OF A LARGE
RESEARCH ORGANIZATION

BY

Richard P. Banghart

A DISSERTATION

Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of

DOCTOR OF PHILOSOPHY

Department of Counseling, Educational Psychology, and
Special Education

2006

ABSTRACT

AN EXAMINATION OF THE PEER REVIEW PROCESS OF A LARGE
RESEARCH ORGANIZATION

BY

Richard P. Banghart

Peer review is the “gold standard” of knowledge, and
is the process through which all scientific and scholarly
publications pass. Research into peer review is difficult
and rare as the process is normally performed in secret and
privacy.

This study was granted access to a large data set
containing over 35,000 individual reviews collected over
three years capturing the peer review practice of a large
scholarly research organization. Through examination of
this large data set, answers to questions about the
underlying assumptions of peer review are sought. Do
reviewers agree with one another? Do decision makers adhere
to reviewers’ findings? Are these findings robust across
the three years and the divisions of the research

organization?

DEDICATION

I dedicate this work to my wife, Zara. For more years

than should have been necessary, she provided the

environment that allowed me to complete this work.

iii

ACKNOWLEDGMENTS

I wish to acknowledge the support of my committee:
Yong Zhao, Richard Houang, Bob Floden, and Ann Austin. Each
has provided loving encouragement through the process. Many
others have contributed their support and ideas. Mark
Urban-Lurain spent many hours with me as I struggled with
some of the statistical ideas. Ed Wolfe was generous with
his time and knowledge.

All who helped deserve credit for any good ideas that

may appear, but all errors are mine alone.

iv

TABLE OF CONTENTS

LIST OF TABLES ......................................... Vii
LIST OF FIGURES ......................................... ix
CHAPTER 1 BACKGROUND/PROBLEM .............................. l
The Problem ......................................... 1
Significance of the Problem .......................... 2
CHAPTER 2 PREVIOUS STUDIES ON PEER REVIEW ................ 5
Unpacking Peer Review ................................ 9
Definition of Peer Review ...................... 9
Definition of Peer ............................. 10
Peer Review's "Built In" Problems .................. 12
CHAPTER 3 THE CURRENT STUDY .............................. 20
Methodology/Analytical Framework ................... 20
Data Description ................................... 21
Authors and Proposals ............................... 22
Reviewers and Reviews ............................... 23
Divisions .......................................... 24
From Data to Measurement ........................... 28
Analytic Techniques Addressing Reviewer Agreement ..30
Analysis of Variance (ANOVA) and G Study ...... 30
Perfect Agreement ............................. 31
Harsh/Lenient Rater ........................... 32
Complete Disagreement .......................... 33

Apples with Multiple Criteria Resulting in a
Single Latent Trait ........................ 36
G Study ....................................... 37
Monte Carlo Simulation ........................ 43
Marble Draw Example ........................... 44
Analytic Techniques Addressing Decision Making ..... 46
Editorial Choice .............................. 47
Cut Score ..................................... 50
Editorial Reach ............................... 53
Questions/Hypotheses ............................... 54
CHAPTER 4 FINDINGS ...................................... 59
The Broad Picture .................................. 59
Reviewer Summaries by Division and Year ............ 61
Proposal Summaries by Division and Year ............ 63
Author and Reviewer Characteristics ................ 71
The Criteria ....................................... 74

G Study ............................................ 81

Results of Monte Carlo Simulation .................. 82

Editorial Choice ................................... 89

Editorial Reach .................................... 91

Reviewer Characteristics ........................... 92

Author Characteristics ............................. 92

CHAPTER 5 DISCUSSION .................................... 94

Answers to Questions ............................... 94

Reviewer Agreement ............................ 94

Efficiency .................................... 97

Differences among Divisions and Years ......... 98

Decision Based on Score ....................... 99

Implications/Recommendations ....................... 99

Central Tendency .............................. 100

Halo Effect ................................... 101

Editorial Influence ........................... 102

Other Conceptions of Peer Review ............. 102

Limitations ....................................... 103

Softer Science ................................ 104

Proposals, Not Completed Works ............... 104

Future Research ................................... 105

Text Analysis ................................ 105

Editorial Influence .......................... 105

Ultimate Publication ......................... 106

Conclusion ........................................ 106

APPENDIX A Tables of Reviews per Proposal .............. 108
APPENDIX B Figures Showing Accept and Reject by Division

and Year .................................... 110

REFERENCES ............................................. 128

vi

Table

Table

Table

Table

Table

Table

Table

Table

Table

Table

Table

Table

Table

Table

Table

Table

Table

Table

Table

8

9

10

ll

12

13

14

15

16

17

18

19

LIST OF TABLES

Summary of Proposals, Reviews and Reviewers
by Year .......................................... 27
Perfect Agreement ................................ 32
Harsh and Lenient Rating ........................ 33
Complete Disagreement among Raters ............... 34
Variance Components for Rating with Harsh
and Lenient Raters ............................... 43
Example of Possible Proposal Scoring ............ 49
Summary of Proposals, Reviews and Reviewers
by Year ......................................... 60
Proposals per Reviewer .......................... 62
Proposals Submitted and Accepted ................ 63
Summary of Reviews per Proposal ................ 65
Reviewers per Division by Year ................. 67
Authors per Division by Year .................... 68
Cross Tabulation of Requested Format
and Accepted Format ........................... 69
Proposal Review Criteria Statements and Anchors 71
Author Status Percentage across Divisions
and Years ...................................... 72
Summary of Author Years ........................ 72
Summary of Reviewer Status ..................... 73
Summary of Reviewer Years ...................... 73
Ratings Associated with Each Agreement Index .78

vii

Table

Table

Table

Table

Table

Table

20

21

22

Al

A2

A3

Z Scores for Each Mean/Index ...

Editorial Choice by Division and

Editorial Reach across Divisions

Years ..........................

Reviews per Proposal by Division
Reviews per Proposal by Division

Reviews per Proposal by Division

viii

................ 89
Year .......... 90
and through

................ 91
for 2001 ...... 108
for 2002 ...... 109
for 2003 ..... 109

Figure

Figure

Figure
Figure

Figure

Figure
Figure

Figure

Figure

Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure

Figure

LIST OF FIGURES

1. Accepted and rejected proposals. .............. 76
2. Frequency of each possible variance of

three scores. ................................. 79
3. Mean versus variance of accepted proposals. ....80
4. Mean versus variance of rejected proposals ..... 81
5. Median mean/variance outcome of unweighted

Monte Carlo simulation. ....................... 84
6. Random, weighted mean/variance distribution. ..85
7. Sample data mean and variance frequency. ...... 86
8. Histogram of frequency of 1,1,1 in simulated

data compared with actual. .................... 87
9. Histogram of frequency of 3,3,4 in simulated

data compared with actual. ..................... 88
A1. Accept and reject, Division 1, Year 1 ........ 110
A2. Accept and reject, Division 1, Year 2 ........ 110
A3. Accept and reject, Division 1, Year 3 ........ 111
A4. Accept and reject, Division 2, Year 1 ........ 111
A5. Accept and reject, Division 2, Year 2 ........ 112
A6. Accept and reject, Division 2, Year 3 ........ 112
A7. Accept and reject, Division 3, Year 1 ........ 113
A8. Accept and reject, Division 3, Year 2 ........ 113
A9. Accept and reject, Division 3, Year 3 ........ 114
A10 Accept and reject, Division 4, Year 1 ........ 114
A11 Accept and reject, Division 4, Year 2 ........ 115

ix

Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure

Figure

A12

A13

A14

A15

A16

A17

A18

A19

A20

A21

A22

A23

A24

A25

A26

A27

A28

A29

A30

A31

A32

A33

A34

Accept
Accept
Accept
Accept
Accept
Accept
Accept
Accept
Accept
Accept
Accept
Accept
Accept
Accept
Accept
Accept
Accept
Accept
Accept
Accept
Accept
Accept

Accept

and

and

and

and

and

and

and

and

and

and

and

and

and

and

and

and

and

and

and

and

and

and

and

reject,
reject,
reject,
reject,
reject,
reject,
reject,
reject,
reject,
reject,
reject,
reject,
reject,
reject,
reject,
reject,
reject,
reject,
reject,
reject,
reject,
reject,

reject,

Division

Division

Division

Division

Division

Division

Division

Division

Division

Division

Division

Division

Division

Division

Division

Division

Division

Division

Division

Division

Division

Division

Division

11,
11,
11,

12,

Year 3 ........ 115
Year 1 ........ 116
Year 2 ........ 116
Year 3 ........ 117
Year 1 ........ 117
Year 2 ........ 118
Year 3 ........ 118
Year 1 ........ 119
Year 2 ........ 119
Year 3 ........ 120
Year 1 ........ 120
Year 2 ........ 121
Year 3 ........ 121
Year 1 ........ 122
Year 2 ........ 122
Year 3 ........ 123
Year 1 ....... 123
Year 2 ....... 124
Year 3 ....... 124
Year 1 ....... 125
Year 2 ....... 125
Year 3 ....... 126
Year 1 ....... 126

Figure A35 Accept and reject, Division 12, Year 2 ....... 127

Figure A36 Accept and reject, Division 12, Year 3 ....... 127

xi

CHAPTER 1
BACKGROUND/PROBLEM
The Problem

Peer review is the process through which "all
important advances in science" (Rennie, Drummond et a1.
1989) pass. It is the "primary institution responsible for
processing and evaluating contributions to knowledge."
(Lindsey, 1976). However, this “primary institution” has
long been challenged for a number of faults. Problems of
peer review include: prolonging the time to publication of
important findings; expense in time and money and other
resources; resistance to accepting innovative ideas; lack
of civility in reviewer comments; bias toward accepting
work that rejects the null hypothesis; bias toward
accepting work from previously published authors (the
Matthew effect); poor quality of reviews; theft of ideas by
reviewers; and many others.

Weller analyzed more than 200 studies on peer
reviewing from more than 300 journals. She affirmed, "Peer
review's outstanding weakness is that error of judgment,
either unintentional or intentional, are sometimes made.
Asking someone to volunteer personal time evaluating the

work of another, possibly a competitor, by its very nature

invites a host of potential problems, anywhere from holding
a manuscript and not reviewing it to a careless review to
fraudulent behavior." (Weller, 2002)

The fundamental issue about peer review is its
worthwhileness. In other words, given the dependence that
the scholarly and scientific communities have on the peer
review process, and its expense in time, money and other
resources, it is important to ask: is the peer review
process worthwhile? The issue of the worthwhileness of peer
review can be viewed from two angles: effectiveness and
efficiency. Effectiveness is the degree to which the peer
review process allows “good” knowledge claims to be
differentiated from “bad” knowledge claims. That is, peer
review’s effectiveness is the degree to which we can truly
trust it as an effective mechanism to advance knowledge. On
the other hand, efficiency is the degree to which

effectiveness is achieved at a minimum cost.

Significance of the Problem
Peer review is widely practiced. In all areas of
science, (and, indeed, in other areas of scholarly
pursuits) the very definition of acceptable knowledge
includes its publication in a peer—reviewed journal. In

virtually all areas of knowledge use and generation, peer

review is the tool invoked to guarantee the validity of the
knowledge. In fact, as research methods change and
paradigms rise and fall, one thing consistent since the
introduction of the scientific method is the use of the
peer—review process.

But peer review does much more than differentiate
knowledge. Funding agencies use peer review to allocate
research dollars. Especially since the rise of government
funding of research after World War II, the use of peer
review has been extended to determine not only what is to
be published, but also what research is to be funded. The
use of peer review to guide the allocation of resources for
future funding adds a new function to peer review. Funding
decisions are made in service to policy goals as well as
scientific merit. While peer review was initially designed
to ensure scientific quality, now the process is employed
to ensure alignment with policy goals. This has had the
result of causing the direction of research to be
influenced by the peer review process in an even more
direct way. (Chubin and Hackett, 1990)

Peer review affects the quality of knowledge and
directions of new research. A report issued by the National
Research Council recommends peer review as “the best

available mechanism for identifying and supporting high-

quality research.” The report goes on to say that beyond
its role in promoting high quality research, peer review
serves the “development of a culture of rigorous inquiry in
the field.” (National Research Council, 2004)

Therefore, it is important to better understand the
effectiveness and efficiency of peer review.

This study sets out to do so by examining a large
database of reviews. As such, this study is an opportunity
to explore the effectiveness and efficiency of peer review
as practiced in evaluating a large number of scholarly
works. A large research organization has made available the
electronic record of its peer review process in evaluating
more than 10,000 scholarly submissions over three years. It
is rare to have such data available for close analysis.
Peer review is normally practiced in relatively small
settings with secrecy and anonymity. The details of the
findings of peers are generally not available for
inspection. This large data set recording the peer review
results of a large number of authors and reviewers can shed

light on some of the assumptions of peer review.

CHAPTER 2
PREVIOUS STUDIES ON PEER REVIEW

Three hundred fifty years ago, when peer review was
first employed, it arose out of a commitment to a new kind
of rational examination of knowledge claims. Rather than
simply accepting the knowledge claims of an author,
organizations with an interest in advancing knowledge
sought a means of ensuring the validity of the content of
their publications. With the development of reproducibility
and methodological rigor as standards for supporting one’s
findings, it became sensible for those knowledgeable about
the scholarly and scientific process to judge the work of
their peers. (Chubin and Hackett, 1990)

Experimental studies of peer review are rare. (Speck,
1993) The necessity for deception and the diversion of the
resources of busy people make the justification of such
research tenuous. Peters and Ceci (Peters and Ceci, 1982)
provided the most highly cited experimental study of peer
review, and exposed what they described as a failure of the
peer review system. The researchers re-submitted papers to
journals that had previously accepted the same papers
within 18 months of their original publication. The papers
had been slightly modified by changing author name,

institutional affiliation, title, and small changes to the

abstract. These changes were made to prevent titles,
authors and keywords from appearing in a cursory search of
articles. They found that 8 of the 12 articles were
rejected on resubmission. Peters and Ceci point out that
rejection would be appropriate if reviewers had noted that
the manuscripts were not original work, but the review
comments indicated that the articles were rejected for
“serious methodological flaws.”

Observational studies of peer review, while more
common than experimental studies, are also sparse and
generally performed on a small scale. The nature of the
peer review process makes even observational studies
difficult. It is common for reviews to be submitted
anonymously. Then the editor makes a decision based in part
on the reviews, but also on other factors (e.g., balance of
articles in an issue). Further, the process of peer review
is predicated on the variability of the quality of
manuscripts and the task of determining it. Because peer
review is the method employed to determine that quality, it
is impossible to know whether the decision of the peer
review process is appropriate. From observational studies,
the judgment of the quality of peer review can only be made

on the basis of patterns of acceptance and rejection, and

cannot examine the method behind the decision making
process.

More recently, there are some new studies concerning
peer-review. The Journal of the American Medical
Association has held periodic International Congresses on
Peer Review since 1986, most recently in 2005. In the prior
Congress in 2001, no findings were presented to indicate
that peer review had any effect on the quality of published
work, (Rennie, 2002) while other studies indicate that a
variety of long understood problems continue to appear in
published articles.

David Kaplan, a highly cited author, has said,
"Despite its importance as the ultimate gatekeeper of
scientific publication and funding, peer review is known to
engender bias, incompetence, excessive expense,
ineffectiveness, and corruption. A surfeit of publications
has documented the deficiencies of this system.” (Kaplan,
1995, p. 10)

In all the studies of peer review, there is none that
examines a large quantity of data over an extended period
of time. Most of the studies are on a relatively small
scale, involving at most a few hundred reviews. Where
larger numbers of manuscripts are examined, the researchers

look for trends in acceptance patterns, but not at the

inner workings of the review process. That is, they are
able to discover resulting biases in the outcome, but they
are unable to employ the fine-grained techniques that can
reveal the strengths and weaknesses about what is going on
“inside” the peer review process at the level of the
individual manuscript and individual review.

The most recent studies continue to examine
publication bias, blind reviewing, statistical errors,
ethical issues, and continue to find that the peer review
process permits articles to pass through the process to
publication with problems intact. In an editorial
introducing the JAMA issue devoted to the Fourth
International Congress on Peer Review, Drummond Rennie
quotes himself from the 1986 Congress:

One trouble is that despite this system, anyone who

reads journals widely and critically is forced to

realize that there are scarcely any bars to eventual
publication. There seems to be no study too fragmented,
no hypothesis too trivial, no literature citation too
biased or too egotistical, no design too warped, no
methodology too bungled, no presentation of results too
inaccurate, too obscure, and too contradictory, no
analysis too self-serving, no argument too circular, no

conclusions too trifling or too unjustified, and no

grammar and syntax too offensive for a paper to end up

in print. (Rennie, 2002, p. 2759)

Rennie went on to say that 16 years later one can
continue to find “in abundance” all the problems he spoke

of in 1986.

Unpacking Peer Review

Definition of Peer Review

The term peer review (as used in this study) refers to
the process by which scientific and scholarly work is
deemed acceptable for funding or publication. While the
process of funding peer review is somewhat different from
publication peer review, the two share some essential
features. In both cases, the work is submitted to a number
of people who are thought to be knowledgeable in the field
of research being reviewed. The work is given to the
reviewers with the author and institutional affiliation
removed. That is, the review is said to be “blind” where
the reviewers are intended not to know whose work they are
reviewing. The reviewers submit their review of the work to
the publisher or funding agency, and from those reviews a
decision about whether to accept or reject is made. The
author is often given some indication of the reviews.

Sometimes the actual reviews are provided, while other

times the publisher or funding agency may summarize the
content of the reviews. Most often, the names of the
reviewers are withheld from the author.

Publication and grant peer review are similar also in
their allocation of scarce resources. While each kind of
peer review seeks to assure that published and funded works
are of sufficiently high scientific merit, each also may
reject good work because of a lack of funds or publication
space.

Funding peer review and publication peer review differ
in some ways. In funding peer review the outcome is usually
acceptance or rejection, while in publication peer review
the outcome often includes encouragement to resubmit the
work with changes that are suggested by reviewers (or
editor). The result is that in publication peer review the
process often serves to help an author develop a paper. In
this way peer review actually serves to guide authors in

shaping their research and publication.

Definition of Peer

The word “peer” typically means one who is of similar
capability and rank, as used in the phrase “a jury of one’s
peers.” In the case of publication peer review, the word

takes on an additional sense of “one having particular

10

expertise” enabling an informed decision about the
knowledge claims being made in a manuscript. In a sense,
“peer” can be defined (almost tautologically) as one who
has the ability to reliably discern the qualities required
of good scientific work.

Scientists are expected to participate as reviewers.
As part of the culture of science, the fact that all
published works are peer reviewed leads all to understand
the necessity of competent reviewers. Scientists contribute
their time and energy in providing this function--often
with no formal training, minimal guidelines, no
compensation and very little oversight.

In this study the concept of “necessary and
sufficient” becomes important in several circumstances. To
accomplish peer review effectively and efficiently there
are a number of things that may be necessary, but yet might
not be sufficient. The implication of this idea is that
necessary things, where lacking, are certain to prevent the
process from being effective. But even where necessary
things are present, those may not be sufficient to produce
the desired result. Flour is a necessary, but not
sufficient, ingredient for making bread. Flour, salt,
sugar, water, shortening, and yeast are both necessary and

sufficient ingredients for making bread.

11

Peer Review's “Built In” Problems

Even if peer review were perfectly effective (allowing
the publication of all good works, and preventing the
publication of all bad works), and perfectly efficient
(performing at the lowest possible expense) the process is
subject to criticism for some of its inevitable
consequences, and some of its traditional implementations.

Even at its most efficient, peer review is expensive.
Historically, manuscripts were copied and delivered to
reviewers who prepared reviews that were copied and
delivered back to the journal or funding agency. Even in
today’s world of electronic communication, considerable
resources are expended in support of the peer review
process. Many hours of time are given by some of the most
capable scientists and scholars in preparing reviews.

In addition to its expense, peer review violates one
of the precepts of scientific and scholarly work--
transparency. While research and the presentation of
research findings are assumed to be open to inspection and
replication, peer review is most often done “blindly.” That
is, reviewers read the work of authors without knowing
their name or institutional affiliation, and authors never

discover who provided the reviews for their work. It is

12

ironic that the process that is intended to guarantee
openness and transparency is closed and opaque.

Another negative side effect of the peer review
process is the delay it causes in bringing new information
to the public. Peer review takes weeks or months (and
sometimes years) to accomplish. This results in
considerable delay before the research can reach
publication and be made available to the public. For the
knowledge consumer, peer reviewed publication is required
before practice can be informed by research findings.
“Prudent patient care demands that ultimate judgment await
submission of a formal paper and the obligatory process of
editorial peer review.” (Soffer, 1980) It’s difficult to
know how lives may be affected by the delay of knowledge
reaching those who need it.

While the intent is that reviewers do not know the
authors of the work they are reviewing, the reality is that
(especially in highly specialized fields) reviewers are
frequently able to infer the author (or the institutional
affiliation) owing to the fact that so few people are
engaged in such research. In those cases, not only is the
“blindness” of the review compromised, but also the highly
specialized work of a competitor is revealed to those most

capable of taking advantage of the knowledge. Because the

13

author of the work does not know who the reviewer is, it
can be difficult for the author to protect “proprietary”
information that may be revealed to the reviewer.

While there are many problems associated with peer
review, and many questions that deserve answers, this study
will focus on a narrow set of questions whose answers might
possibly be gleaned from the analysis of a large data set.
The questions have to do with the mechanics of translating
numeric responses of reviewers into decisions of acceptance
or rejection. They are about the effectiveness and
efficiency of the process as a measurement task.

To determine the effectiveness and efficiency of the
peer review process, its underlying assumptions must be
examined. Peer review makes two related assumptions: (a)
Manuscripts vary in their quality (i.e., some manuscripts
are worth accepting and some are worth rejecting); and (b)
“Peers” share knowledge of the criteria used in judging a
manuscript, and can discern the extent to which a
manuscript adheres to those criteria.

The first assumption can reasonably be accepted on its
face (no one suggests that all manuscripts deserve to be
published), while the second assumption is much more

complicated and requires closer examination. The second

14

assumption has within it three words that need to be more
fully fleshed out: “peer”, “criteria”, and “discern.”

The criteria that peers are presumed to share
knowledge of are the defining characteristics of good
scholarly or scientific work. Specifically they include:
originality, importance, methodology, analysis, and
writing. (Wolff, 1970; Wilson, 1979; Armstrong, 1982) These
criteria are broad and overlapping, and subject to a
variety of interpretations. Because of their breadth, they
are often expressed as sets of narrower characteristics
each of which is a part of a larger criterion. For example,
methodology is sometimes examined as a single construct,
and sometimes viewed as the combination of data selection,
study design and choice of statistical procedures. At other
times, statistical procedures are thought to be part of the
analysis criterion.

Like the review process, the criteria are rarely
questioned. The guidelines that journals offer to reviewers
embody these criteria. Surveys of most important criteria
for the acceptance of manuscripts universally refer to five
general criteria. The number one criterion for a
publishable manuscript is its originality. (Hackett and
Chubin, 2003) The other criteria are generally understood

to include writing, methodology, theory, data, analysis,

15

conclusions, topic choice, and contribution to the field.
Each of these qualities is a construct of considerable
complexity, increasing the likelihood that reviewers will
have difficulty in reliably discerning the degree to which
a manuscript manifests these qualities.

Identifying multiple criteria as necessary in
scientific work implies that these criteria exist in some
way independently of one another. Those involved in peer
review recognize multiple criteria as contributing to the
quality of a paper. While originality is considered the
main criterion for acceptance, papers must also be well
written, the data involved must be collected properly, and
the statistics employed must be properly chosen and
interpreted. There is no reason to think that these
criteria are inherently related to one another. That is, an
author might have collected data well, but that is no
reason to assume the writing is good.

Some of these criteria may be more important than
others. For example, good data collection and analysis
might trump punctuation. The knowledge community might be
served by the publication of a work that has excellent
theoretical and research technique, but includes some
poorly phrased sentences. On the other hand, beautifully

crafted prose may not redeem a work with invalid

16

statistical procedures producing questionable answers to
hypotheses. An effective and efficient peer review system
will account for such differences among the criteria.

To “discern” in this context takes on the meaning
conventionally assigned to a measurement task. That is, the
job of the reviewer is to measure the degree to which the
qualities that are thought to exist in some quantity within
each manuscript are present. This also implies that the
reviewers are making a shared decision about these inherent
qualities. That is, that they have the same understanding
of what the qualities are and how they can be expressed in
the manuscript. If peer review is to be effective and
efficient, it is reasonable to expect both that there will
be some differences among reviewers (otherwise there is no
need for multiple reviewers), and that they have general
agreement among themselves (otherwise there is no
confidence they are responding to the qualities that are
found in the manuscript being reviewed).

A problem with the concept of a peer as used in
scientific peer review is that in the case of a truly
innovative idea there may be no peers. This can happen in
the case of breakthrough kinds of research where the
findings challenge the current paradigm. The first

researcher who challenges conventional wisdom faces the

17

flywheel effect of the peer review process. Effective and
efficient peer review needs a way of recognizing novel
developments in a field, or it runs the risk of halting
progress in that area of research.

A related issue is the difference that is seen between
the hard sciences (e.g., physics, astronomy) and the “soft”
sciences (e.g., psychology, economics). The hard sciences
have clearly defined paradigms. As such, authors and
reviewers have clear ideas of what counts as good research
and knowledge claims, there is high agreement among
reviewers, and the journals have high acceptance rates
(authors know what it takes to be accepted). In the soft
sciences, the paradigms are not as firmly established:
There is much more variability and less agreement among
reviewers and authors. As a result, the acceptance rate is
much lower in the journals of the soft sciences. (Newman,
1966; Zuckerman and Merton, 1971; Adair, 1982)

The most obvious assumption behind peer review is that
the quality of submitted work varies, i.e., some work
deserves to be published while other work deserves to be
rejected. The use of multiple reviewers implies a couple of
assumptions. One is that individual reviewers might make
errors, but that having multiple reviewers will reveal the

idiosyncratic judging of an individual. Another assumption

18

is that reviewers both know what is required of an
acceptable work, and can detect those qualities within a
work.

While some suggest that reviewers are selected with a
variety of skills and abilities, and are to review that
aspect of the work that relates to their area of expertise,
the truth is that reviewers assigned to evaluate a work are
all given the same instructions, and asked to rate a work
on all of the criteria. Because of this, we should expect

general agreement among reviewers.

19

CHAPTER 3
THE CURRENT STUDY

Mbthodology/Analytical Framework

Many problems with peer review have been identified,
and its critical role in the scientific enterprise has been
established. Many of the problems of peer review are beyond
the scope of this study, however it is possible to focus on
an area of importance that is very difficult to address
simply because of the paucity of data normally available.
By using the raw data of the many thousands of reviewers'
ratings and the decisions made, along with information
about the reviewers and the authors, the efficiency and
effectiveness of the process as a measurement task can be
evaluated.

These issues have been difficult to address because of
the lack of data about the responses of individual
reviewers across a large number of items being reviewed.
This study takes advantage of a large body of data that was
collected over three years by a large research
organization’s electronic peer review system. The web-based
system mediated the assignment of reviewers to proposals

that were submitted to the research organization’s annual

20

meeting. The system also collected and stored the
reviewers’ responses.

The peer review process as practiced by this large
research organization has some of the characteristics of
both funding and publication peer review. Like funding peer
review, this process resulted in acceptance or rejection
with no opportunity of resubmitting the proposal with
modifications suggested by reviewers. But like publication
peer review, the work submitted was justified and funded
elsewhere, so the process was unable to direct research
energies to the funding targets (the goals of the funding
agency). This peer review process allocated the limited

resource of a presentation venue.

Data Description

This section describes the broad outlines of the
nature, extent and limitations of the data collected by the
organization during their peer review process. First,
information that describes characteristics of the proposals
submitted, the authors of the proposals, the reviewers and
the reviews will be discussed. This is followed with an
explanation of how all of these elements are contained
within separate divisions of the organization, and repeated

for each of three years.

21

Authors and Proposals

Authors submitted proposals to the research
organization in hopes of being accepted to present their
work at the organization’s annual meeting. Proposals were
submitted electronically to the organization through the
web-based application. Authors intending to submit a
proposal Visited the website, and registered on the system.
The registration process involved filling in forms on a web
page that collected contact information and information
about the authors’ institutional affiliation, professional
status, and years of membership in the organization. After
the author's information was entered into the system, the
author then submitted the written proposal in another form
on the website, or uploaded a document.

Additionally, authors submitted their proposals to be
presented in one of several preferred “requested formats.”
The formats are ways in which the work is presented to the
membership of the organization. The most prestigious format
is the “paper presentation,” and accounted for more than
70% of the format requests by authors. Other formats (in
order of decreasing prestige) included the “round table”
(or “paper discussion”), “poster,” and “new member poster.”
There are a variety of other formats available, but they

are rarely requested.

22

To summarize, each proposal has the following
attributes associated with it: Author Status (Professor,
Assistant Professor, Associate Professor, Graduate Student,
or Other), Author Years of Membership (<1, 1-5, 6-15, 16—
25, >25), Division, Requested Format, Decision (accepted or

rejected), and Accepted Format (if accepted).

Reviewers and Reviews

Like the authors, reviewers also registered with the
web-application, and completed the on-line forms. Each
reviewer selected a division, and gave information about
his or her professional status and years of organization
membership. Each reviewer also entered a brief biographical
description about his or her interests and expertise.

After registering with the web-based system, reviewers
were contacted by e-mail and informed when proposals had
been assigned to them to be rated. The reviewers then
visited the web—based application, logged in to the system,
and there were able to read the proposals and respond to
ten rating criteria. Their responses consisted of clicking
on one of five “radio buttons,” thereby indicating a choice
of l to 5 for each of the ten criteria. A comments field
was also offered for the reviewers to type extended text

comments.

23

Each review was stored electronically and identified
by a reviewer identification number (revid), and a proposal
identification number (propid). The review consisted of
answers to 10 “questions” (q1 — q10) each of which could
hold a value of one to five, along with text comments. The
system did not compel a response for each criterion, so
criteria that received no response were coded in the system

as a 0, but not included in any summary statistics.

Divisions

Each proposal was submitted to one of 12 divisions of
the organization. The divisions represent areas of interest
within the larger organization. Each reviewer also selected
a division for which to review. The divisions of the
organization were individually responsible for recruiting
reviewers, managing the assignment of proposals to
reviewers and making the decisions about acceptance or
rejection. Divisions were also able to customize the
criteria to allow them to conform to conventions of their
areas of interest. Those divisions that chose to exercise
that option generally chose only to re-order the criteria.
One division for the second and third years of the data

substantially modified the criteria.

24

The “chair” of the division was charged with making
decisions about acceptance or rejection of proposals, and
selecting a presentation format if the proposal was
accepted. The web-based system provided several methods of
decision support. The chair was able to View summary
statistics about reviews given across the division as well
as within a given proposal. Specific criteria could be
selected to be included in these summary displays.

The unique circumstances of this form of peer review
process enable a rare look at not only the outcomes, but
also the individual findings of each of the reviewers.
This, combined with information about the author and the
reviewers, allows a fine—grained analysis that may shed
light on the peer review process.

For analysis in this study, data were extracted from
the on-line database that served the research
organization's web-based system, and transferred into a
local database. To preserve anonymity, all identifying
information about individuals was removed, and user id
numbers were assigned having no relation to the original
identification numbers or names.

The data collected through the system includes
information about the author, the reviewer, and about the

proposal itself. This permits an analysis that takes into

25

account more than just the acceptance or rejection of a
proposal. Both authors and reviewers are identified by
their institutional affiliation, their professional status,
and their years of membership in the organization. The
proposals are submitted in two categories, and in a number
of different modes of submission that vary in status. The
combination of these factors permits the testing of some
earlier suggestions about peer review.

In addition to the individual differences, the
divisions serve as a broader classification of the people
and works involved. Each of the 12 divisions within the
research organization is involved in researching a
different area of the larger field. Some of the divisions
are involved in researching issues of policy, others focus
on legal issues, some are involved in research methodology
and measurement and statistics, others explore social and
psychological issues, while some are involved in research
around practice in the field. Across the divisions, there
is a wide range of what counts as appropriate research
questions, research methodology, and standards for making
truth claims. Yet each division uses the same peer review
method to determine which of the submitted proposals will

be accepted for presentation.

26

The decision of acceptance or rejection was based on
the answers each reviewer gave to each of 10 questions (the
10 criteria). For each criterion the reviewers offered a
response of 1, 2, 3, 4 or 5 (or no response). This permits
possible mean scores between 1 and 5. The data were
maintained without any reduction. That is, each of the
responses is available for analysis.

The resulting data consist of a total of nearly 34,000
reviews. Table 1 summarizes the data by year, number of

proposals, reviewers and reviews.

Table 1

Summary of Proposals, Reviews and Reviewers by Year

 

 

Year Reviews Reviewers Proposals
2001 10248 1860 3206
2002 12310 2641 3900
2003 11382 2179 3969

 

Three sets of data were available for three
consecutive years (2001, 2002 and 2003). While there are
some differences among the data sets owing to the continued
development and improvement of the web-based system, all
three data sets contain information about the proposal

(ratings received, decision made, author information), the

27

reviewers (ratings given, years in research organization,
professional status), and the criteria (scores given,
anchors, prompts).

As mentioned, this large collection of data permits a
level of analysis that is extremely rare (and perhaps
unprecedented) in peer review. Through the analysis of
these data it is possible to conceive of the process of
peer review as an attempt at measuring the quality of the
submitted work.

From Data to Measurement

Measurement is a process engaged in so frequently that
one often fails to remember that it is rare to directly
measure the quality of interest. For example, in measuring
the temperature, one may instead look at the level of
mercury in a thermometer. We trust that the mercury
responds to the heat energy in the environment. When
measuring the quality of scientific work, we look instead
at the answers reviewers have given to a series of
questions. The responses to the questions are trusted to
reveal an underlying quality inherent in the work. Because
there is no direct access to the quality being measured, it
is said that the quality is a “latent trait.” Like a latent
fingerprint, or a latent photographic image, the latent

trait exists unseen, until revealed through the use of

28

other materials, techniques, or analysis. (Bond and Fox,
2001).

Through the peer review process, the research
organization is seeking to make a determination about
whether to accept or reject proposals. The process of
seeking numeric scores from multiple reviewers results in a
representation of the underlying quality they are seeking
to measure. In this way the construct of acceptability
(what will be called the acceptability quotient, or ‘AQ’)
is operationalized through the reviewers’ responses to the
questions.

The task in analyzing these data is to understand how
the multiple criteria combine to create a single decision
of acceptance or rejection. It is tempting to think that
because a decision is reached about acceptance or rejection
through a consideration of the scores given by each
reviewer for each criterion, that these scores contribute
equally (or at least separately) to the decision. However,
other possibilities can be explored through the use of a
variety of analytic techniques.

Two facts must obtain to have a legitimate measurement
process to determine the acceptability of proposals. The
first requirement is that reviewers agree with one another.

The second requirement is that the decision maker abides by

29

the scoring of the reviewers. Those two requirements guide
the analysis of the data in determining the effectiveness
of the peer review process in the large research

organization.

Analytic Techniques Addressing Reviewer Agreement

Analysis of variance (ANOVA) and G Study

The reliability and efficiency of the peer review
process depend on the degree to which peers know and can
detect the qualities of a manuscript that earn that
manuscript acceptance or rejection. Further, because the
process involves multiple reviewers for each item being
evaluated, the process assumes that the several reviewers
share a common knowledge of the qualities, and a common
ability to discern those qualities. The goal of this study
is to test those assumptions.

Before discussing the specifics of the data under
analysis in this study, consider the general ideas behind
what kind of evidence would support a conclusion that
measurements are reliable. Imagine an exercise in measuring
apples. Reviewers are assigned to assess apples for size.
The reviewers are given a collection of apples that are

assumed to range from small to large. Each apple will be

30

assessed by at least three reviewers, each of whom will
judge the apple as small (1), medium (2) or large (3).
Further, each reviewer will rate at least three apples. At
the end of the measurement process, there will be a set of
data that can be looked at from two perspectives: one
consisting of the ratings given by each reviewer; the other
consisting of the ratings received by each apple. Because
each apple has received at least three ratings, each
apple’s score can be expressed as a mean and variance. If
the reviewers were perfectly consistent, each apple would
have a variance of zero, and a mean of exactly one, two or

three.

Perfect Agreement

Table 2 shows the results of a hypothetical apple
measuring activity, where apple reviewers are in perfect
agreement. Each of the rows (a, b and c) contains the
scores given to each apple. The reviewers are indicated by
columns x, y and z. Intuitively, there is confidence in
scores like this. The fact that the reviewers are all in
agreement with one another leads to a belief that they are
making their decisions based on common criteria, and lends
confidence that the mean for each apple reliably represents

its size. In this ideal situation, the variance for each

31

apple is zero, resulting in a mean variance across all
apples of zero. The variance for each reviewer’s scores
(across the apples) is greater than the variance of scores
within each apple. The ratio of the mean variance within
apples to the mean variance across apples is shown in the
lower right corner of Table 2. Where there is perfect

agreement among reviewers, this ratio is always zero.

Table 2

Perfect Agreement

 

 

 

X Y Z Mean Var
A 1 1 1 1 O
B 2 2 2 2 O O
C 3 3 3 3 0
Mean 2 2 2
Var 0.67 0.67 0.67
0.67 Ratio 0

 

Harsh/Lenient Rater

Table 3 shows a situation where there is not perfect
agreement among reviewers. In this case reviewer x tends to
judge apples as larger, while reviewer 2 judges apples as
smaller (reviewer x might be called a lenient reviewer,

while 2 is a harsh reviewer). Again, there is still

32

intuitive sense in this scoring pattern, even though there
is disagreement among the reviewers. The mean values for
each apple can be used in understanding its size, even
though only one apple achieves a score of exactly 1, 2 or
3. But the disagreement among reviewers shows up in the
ratio figure, as it is now greater than zero. In this case
there is still more variance between apples than within
apples, but as the ratio approaches one, confidence in the

reliability of the reviewers' scores declines.

Table 3

Harsh and Lenient Rating

 

 

X Y Z Mean Var
A 2 1 1 1.33 O 22
B 3 2 2 2 33 O 67 0 33
C 1 1 1 1.00 O 22
D 3 3 2 2.67 0 22
Mean 2.25 1.75 1.50
Var 0.25 0.69 0.19
0.375 Ratio 0.89

 

Complete Disagreement
Table 4 shows a scoring situation with no agreement
among reviewers. Each apple receives a mean score of two,

and a variance resulting from the complete lack of

33

agreement. The variance across the scores given by each
reviewer shows less variance than the variance of the
scores received by each apple. The resulting ratio is
greater than one. It is impossible to infer an apple’s size
with such data, and there is no confidence in the

reliability of the scores achieved by each apple.

Table 4

Complete Disagreement among Raters

 

 

 

X Y Z Mean Var
A 3 2 1 2 0.67
B 2 1 3 2 0.67 0 67
C 2 3 1 2 0 67
D 1 3 2 2 0 67
Mean 2 2.25 1.75
Var 0.5 0.69 0.69
0.63 Ratio 1.07

 

To take the example further, imagine a situation where
reviewer x always rates apples as a 1, reviewer y always
rates apples as a 2 and reviewer 2 rates apples as a 3. In
such a situation, all of the variance resides across
reviewers, and there is no variance across apples. The

implication of such a measurement result is that

34

measurement is completely determined by the rater and is in
no way affected by the item being rated.

The important idea behind Table 2, Table 3, and Table
4 is that the ratio of mean variances gives an indication
of the reliability of the scores received. If the ratio is
one or greater, there is no basis for believing that the
scores achieved have any relationship to the quality being
measured, (i.e., the scores have more to do with the
reviewers than with the apple). Ratio values near zero give
greater confidence in the reliability of the scores,
although that alone is not sufficient to be confident in
the quality of the measurement. A ratio less than one is
necessary, but not sufficient in establishing confidence in
the reliability of measurement.

Exactly what the ratio of variances should be to have
confidence in a decision is traditionally captured in the F
statistic. The F statistic is based on the ratio of
variances along with the number of factors and the values
they may assume. With that information, along with assuming
that the data involved conform to a normal distribution, a
reliable inference can be made about the likelihood that
the two groups’ differences are greater than would be

expected by chance alone.

35

Apples with MUltiple Criteria Resulting in a Single Latent
Trait

The apple-measuring example can be expanded to
approach the kind of information that is in the research
organization’s data. In the earlier example, each apple
received three scores, and each rater measured three
apples. In the research organization, each rater offers 10
scores. With the addition of multiple judging criteria
comes the possibility for greater precision in determining
the underlying trait. In the earlier example the trait
being measured was size. This can be extended to a more
complex example by assuming measurement of an apple’s
marketability. In this case reviewers could be asked to
judge apples on several criteria that combined will
determine how likely it is to sell an apple in a store.

The criteria might include size, color, shape,
firmness, tartness, sweetness, lack of blemishes, etc. In
combining these criteria to determine an underlying trait
of, say, “marketability" some careful thinking will be
required. It is likely that apples of more than one color
will be sold. So the color criterion will have to be
expressed as something like “color appropriate to variety.”
Additionally, some criteria might be relatively more or

less difficult for an apple to achieve. All apples might be

36

required to be of a minimum size. An apple being larger
than that size may not increase its “marketability.” In
that case there is a threshold criterion where a score
below a certain level causes rejection, and above permits
selection. But the likelihood of selection is not increased
as the score rises above the threshold.

The measurement task is complicated as the number of
criteria and number of raters increases. When the
measurement task takes into account multiple criteria from
multiple raters, additional analytic techniques are useful
in disentangling the multiple possible interactions among

the criteria.

G Study

Generalizability theory offers a conceptual framework
and a series of statistical techniques to more completely
explore multiple contributors to an item’s score. As
explicated in the previous examples, in a measurement
situation the rater and the rater’s tendency (harshness or
lenience) in scoring items must be taken into account.
Similarly, where multiple criteria are used, differences
among the criteria in their functioning must also be taken

into account.

37

In a measurement task, generalizability theory regards
the reviewers and the criteria as “facets” of the
measurement. A facet is a variable that can contribute to
the score achieved in a measurement. In the case of
reviewers and proposals, the reviewer is a facet, as is the
proposal. Similarly, each criterion is considered a facet
as well. In generalizability theory, the item being
measured is not usually referred to as a facet, but this is
only a matter of terminology. The computations employed in
a G study make no such distinction and treat the item being
measured as a facet.

A generalizability study (G study) seeks to quantify
the effect each facet has on the score achieved by the
object under measurement. The G study does this through the
variance components found in the measurement. That is,
there is a comparison of variance across proposals (which
is construed as contributing to the “true” measure), with
the variance across reviewers (which are seen as
contributing to “error”). Additional error is also
indicated through an interaction of proposal and reviewer.

In an ideal measurement situation, all of the variance
occurs across proposals. The assumption under such
circumstances is that the differences in scores achieved by

the proposals are a result of actual differences in

38

proposals, as opposed to differences in the reviewers. In a
real measurement situation, there is some variation
attributable to reviewers.

This variation in reviewers’ ratings can happen owing
to several factors. Certain reviewers might simply be less
precise in their ratings, resulting in random variation
around the proposal’s “true" score. That is, a reviewer in
evaluating three proposals of equal quality, might give one
a 3, one a 4, and one a 5. Another possibility is that a
reviewer might tend to give all proposals a similar score.
Finally, a reviewer might be harsh or lenient, consistently
rating proposals lower or higher than the proposal's “true”
score.

These tendencies for reviewers to deviate from a
proposal’s “true” score is captured in the reviewer facet
variance component. In a given measurement situation there
is a certain amount of overall variation in the scores. In
the process of performing a G study the overall variance is
said to be “partitioned." Portions of the overall variance
are assigned to each of the facets. For that reason, as the
variance increases for one facet, it decreases in other
facets. Ultimately, it is the ratio of variance occurring

in the item under measurement with the variance occurring

39

in other facets that allows an assessment of the quality of
a measurement.

The information obtained through a G study permits a
comparison between the variance components of the various
facets of measurement. This comparison gives a sense of the
distribution (or partitioning) of the overall variance to
the other sources of variance. Table 2 can be used to
demonstrate the way a G-study is applied to such a
situation. Recall that the ANOVA involves the mean variance
across apples and the mean variance across raters. The
ratio of those means is the measure of the degree to which
there is confidence that the raters are responding to
traits in the apple as opposed to their own idiosyncrasies.

The G-study is said to “partition” the variance among
the facets of a measurement. The first step in performing a
G-study involves estimating the total variance. The total
variance is often called the “sum of squares.” “Sum of
squares” in this case actually means “the sum of the square

of the difference between the grand mean and an individual

score.” (Z(Y—X)2, where )7 is the grand mean of all

responses and A7 is an individual response.) In the simple
example in Table 2, the grand mean of all scores is 2.

Calculating the sum of squares is simple. For apple b there

40

is no contribution to the sum of squares, as each of its

scores is equal to the grand mean (LY—uX==O). The difference
between the scores on apples a and c, and the grand mean
are all 1. l squared is one, so the sum of squares is 6.
This captures the total variance in this example.

The second step is to calculate the variance for
apples and reviewers (in the terminology of
generalizability theory, this is called “estimating the
variance” of the individual facets). Again the calculations
are simple. For apples a, b and c the mean scores are 1, 2,
and 3 respectively. For reviewers, each has a mean score
across apples of 2. The sum of squares across the apples is
0, while the sum of squares multiplied by the number of
apples is 6. Recall that the total sum of squares was 6. In
this case all of the variance is accounted for in the
apples, and there is no variance attributable to the
reviewers.

The G-study proceeds to account for degrees of
freedom, and converts the sum of squares to mean squares by
dividing the sum of squares by the degrees of freedom.
Finally, the variance component is calculated for each of
the facets and their interaction. For the interaction
effect the variance is simply the mean square calculated in

the previous step (i.e., the variance divided by the

41

degrees of freedom). To calculate the “apple variance” the
interaction mean square is subtracted from the proposal
mean square and divided by the number of reviewers. In a
parallel fashion, the variance for reviewers is calculated
as the interaction effect subtracted from the reviewer mean
square and the result divided by the number of apples.
(Brennan, 2001)

When this procedure is performed using the values in
the example in Table 2, a value of one is calculated for
what is termed the proposal effect, and a value of 0 for
both reviewer effect and interaction effect. It is
immediately apparent that all of the variance is accounted
for in the apples, and reviewers contribute nothing to the
variance. This is an ideal situation and leads to
confidence in the scoring.

When a similar exercise is done using the numbers in
Table 3 there is some variance attributable to reviewers.
In this case, one reviewer tends to be harsh, while another
reviewer tends to be lenient. When the calculations are
performed, the results are as shown in Table 5. We can see
that while there is variance attributable to both the
reviewer and the interaction of reviewer and apple, the
majority of the variance is still attributable to the

apple.

42

Table 5.

Variance Components for Rating with Harsh and Lenient Raters

 

 

 

Effect Var. Component
Apple .5833
Reviewer .1733
Interaction .1389

 

.Monte Carlo Simulation

While the ratio of variance is an indicator of quality
of peer—review, it cannot be used directly to speak to the
worthwhileness of peer-review. This is because the specific
ratio found in the data from the research organization
lacks a standard against which to compare the ratio. The
question is: What is a good ratio? It is clear from the
apples example that some disagreement among reviewers can
occur without compromising confidence in the outcome. But
how much disagreement can be tolerated?

When a statistical technique is employed to analyze
data, it essentially addresses the question: What is the
likelihood that these results occurred by chance? Because
of the many complicating factors around the analysis of
this particular data set, it is very difficult to establish
a priori what ratios should be expected to provide

confidence that the measurement process has yielded results

43

different from chance. A method used to explore such
situations is the Monte Carlo simulation.

A Monte Carlo simulation relies upon the power of
modern computers to rapidly generate large numbers of
random data sets. Each data set can be described with some
statistical measures, and the group of data sets can be
described with the resulting collection of statistical
measures. That collection of statistical measures can, in
turn, be described with statistical measures that can be
used to help understand what the range of values might
reasonably be. That distribution of statistical measures
allows a determination of what might be described as the
true population statistics. When the statistic of interest
in the sample data set is compared with the distribution of
that statistic from the thousands of randomly generated
data sets, an estimate can be made of the likelihood that

the sample statistic was the result of chance.

Marble Draw Example

An example may make this clearer. Suppose there is a
class of 30 students. Each student is asked to draw one
marble four times from a bag containing 10 black and 10
white marbles, replacing the drawn marble after each draw.

After all students have performed this exercise it is

44

discovered that 3 of the 30 students drew 4 black marbles.
We would like to know how likely it is for that to occur.
We can calculate the likelihood assuming that each color
has a 50/50 chance of being selected. (.5 * .5 * .5 * .5 =
.0625) meaning that 625 times out of 10000 there is the
expectation of drawing four black marbles. Using a Monte
Carlo simulation, the computer can simulate drawing four
marbles 30 times, and count the number of times that result
in four black marbles being drawn. The computer can repeat
the simulated drawing of marbles many times, and count the
number of draws that result in all black marbles.
Eventually, this will result in a collection of counts that
reveals what the range of the frequency is of drawing four
black marbles.

In the case of the peer review data contained in the
research organization data set, it is much more difficult
to calculate theoretical random values. Conventions of
scoring may not equally weight each of the scores. That is,
it may be that the scores of 1-5 are not equally assigned.
In this case it would be inappropriate to generate random
scores where all scores have equal likelihood. Instead, the
random assignment should be “weighted” in such a way that
the actual distribution of scores is approximated. What is

really sought is a test of whether the reviewer’s scores

45

are influenced by the proposal (i.e., are reviewers
agreeing with one another.) A Monte Carlo simulation in
this context could be created that takes into account the
frequency of the existing scores in the sample data. In
this way, the distribution of values remains the same, but
there is confidence that there is no proposal influence on
the scores. By generating this random assignment thousands
of times, a distribution of statistics is revealed that can
be used to establish confidence intervals for the
statistics that we see in the sample data.

The Monte Carlo model of the data can help establish
confidence intervals, but it does nothing to put the data
into a measurement construct. That is, from the Monte Carlo
simulation it is possible to learn that the sampled data
are significantly different from a random outcome, but one
cannot infer from that “how different” the groups (or
scores are). To move from a statement of difference to a
statement of how much difference requires that much more

than just the means of scores is taken into account.

Analytic Techniques Addressing Decision Making
There are two measures of editorial influence that
will be considered. The first is one that will be called

“editorial choice." The second is one that will be termed

46

“editorial reach.” Each will be described in detail in the

following sections.

Editorial Choice

A “division chair” (acting much like an editor) is
charged with making the decision about which proposals are
accepted and which rejected. A central question about the
process is: What is the extent to which the decision makers
adhere to the ratings of the reviewers? Multiple raters
each offer scores, which are combined into a single mean
score. If the ratings are meaningful, those proposals
achieving higher mean scores should be more likely to be
accepted than those with lower scores.

But, even when adhering perfectly to the raters’
scores, some degree of editorial choice may be inevitable;
for example when forced to select 50 of 100 proposals where
proposals 41 through 60 (when rank ordered) have the same
score. An editor in this position must select 10 of the 20
proposals. This is forced editorial choice. Other editorial
choice is more arbitrary, based on a number of different
possible factors.

Arbitrary editorial choice occurs when a proposal’s
score is above or below the cut score, and a decision is

made contrary to the indication of the score (i.e.,

47

choosing to accept a proposal whose score is below the cut
score, or rejecting a proposal whose score is above the cut
score). Of course it is important to remember that making
either choice compels an instance of the opposite choice,
although that instance may be “hidden” in the group of
proposals at the cut score. The following example
illustrates these phenomena.

Imagine an editor forced to select 5 of the 10
proposals represented in Table 6. Proposals 4, 5, and 6
have the same score. The editor, in order to select 5
proposals, is forced to make an editorial choice among
them. Given the need to select 5 proposals with three
proposals at the cut score, it will be termed that the
editor has three “forced” editorial choices, as the choice
among the three cannot be made based on differences in the

score .

48

Table 6

Example of Possible Proposal Scoring

 

 

Proposal Score

 

OKOCDxlO’NU’ibLDNI—l
HI—‘NNWUJWUJA-b
ONNGDOOOU'IOO

I—'

 

Now assume that the editor decides to accept proposal
7 as one of the five. This means the editor is now
compelled to reject a compensating proposal at or above the
cut score. Such a selection will be called an “arbitrary”
editorial choice. Further, because the scoring was intended
to identify the proposals to be accepted, proposal seven
has been “misscored.” If a proposal is rejected with a
score above the cut score, or accepted with a score below
the cut score, that proposal will be said to have been
misscored. To make the terminology clear, consider the two
options that the editor has after selecting proposal seven.

The editor can choose to reject a proposal at the cut

score or above the cut score. In either case this will be

49

considered an arbitrary choice. If the editor chooses to
reject a proposal above the cut score, that proposal will
be said to have been both misscored and an arbitrary
choice. If the editor chooses to reject one of the
proposals at the cut score the proposal will not be
regarded as misscored, but will be described as an
arbitrary choice. The overall editorial choice will be
summarized in the following way.

Assume the editor chooses to select proposals 1,2,3,6,
and 7. In such a case, the editor had 3 forced editorial
choices (proposals 4, 5, and 6) imposed by the situation,
made two arbitrary choices (proposal 7 and one of the
proposals 4, 5 and 6), and decided that one proposal

(proposal 7) was misscored.

Cut Score

Now consider the cut score. In general, when making a
decision based on a score there are two methods used
depending on the circumstances. In one circumstance (e.g.,
a driving test for a driver’s license) there is an
established cut score above which one is included (receives
a driver's license), and below which one is excluded (is

denied a license). In the case of a driver’s test, there is

50

no limit to the number of driver licenses available, and
all who meet or exceed the cut score receive a license.

Another situation occurs where only a limited number
of applicants may be accepted (e.g., admission to a
competitive academic program). Where only some of the
applicants are accepted only those with the highest scores
will be included. This is accomplished by putting
applicants in rank order according to their scores, and
selecting the top applicants until the number of available
admissions is met. In this case the cut score is
established after a decision is made, and is the lowest
score achieved in the group of those accepted.

In both methods, the selection process is based on the
score achieved. In the case of a selection process like
that of a drivers license, a pre-defined cut score is
established and selection is based on only the score. In
the latter case both the score and the rank order are
involved. In the competitive situation (assuming the
decision is based entirely on the score) the cut score is
determined after the selection process and becomes the
lowest score of those accepted. However, many times a
combination of the two can occur, as when there is both a
minimum acceptable score and a limited number to be

accepted.

51

In the large research organization we can assume that
a combination approach is used. That is, there are a
limited number of proposals that can be accepted, but any
accepted proposal must meet some minimum standard. While no
explicit cut score may be established, an implicit cut
score is created once the decisions are made.

In the large research organization the decision is not
based exclusively on the score, but also on the basis of
editorial choice. That is, the decision-maker has the
prerogative to consider factors other than raters’ scores
to make the decision to accept or reject. This somewhat
complicates the determination of a cut score, as some
accepted proposals might have scores lower than some that
are rejected.

For the purposes of this study a hypothetical cut
score will be computed based on the number accepted and the
rank order of the scores. That is, once the number (N) of
accepted proposals has been determined a cut score will be
calculated by “counting down” the rank ordered scores by
the number of accepted proposals, and taking that Nth score
as the cut score.

It is useful to express some measure of editorial
influence over the acceptance of proposals. Editorial

influence involves: accepting proposals with scores that

52

are below the cut point, rejecting proposals with scores
above the cut point, and selecting a portion of the
proposals at the cut point. Therefore, a measure of total
editorial influence is the percentage of those proposals
falling into those three categories. The sum of the high-
scoring rejected, low—scoring accepted, and the number of
proposals at the cut score (if not all of those proposals
were accepted), taken as the percentage of the total number

of proposals is the measure of editorial choice.

Editorial Reach

“Editorial reach” is a measure of how far an editor
“reaches” in accepting low—scoring proposals or rejecting
high-scoring proposals. An editor who accepts very low—
scoring proposals exhibits greater reach than does an
editor who accepts proposals very near (but below) the cut-
score.

Further, an editor displays greater reach in
identifying a proposal as misscored where the reviewers
were in agreement as opposed to where reviewers disagree.
For example, accepting a proposal one point below the cut-
score where the reviewers are in perfect agreement

(variance of 0) represents greater reach than accepting a

53

proposal of the same score but with a higher variance
(i.e., greater disagreement).
An index of overall editorial reach can be calculated.
Such a calculation must take into account three factors:
the difference in score between cut score and the misscored
proposal, the agreement index (variance) of the misscored
proposal, and the total number of proposals. A formula for
N 2
1 \/(Sc-Xi)

such an index is N
i=1

 

 

Where Nm is the number

1+Ai
of misscored proposals, Sc is the cut score, A; is the
score of the ith misscored proposal, Ai is the agreement

index of the ith proposal, and AI is the total number of

proposals.

Questions/Hypotheses
The questions to answer are of several kinds. One kind
of question at the simplest level is about the consistency
or fairness of the decision making process. That is, are
those who received higher ratings more likely to be
selected than those with lower ratings? This question is
relatively simple to answer by computing mean scores

achieved and plotting them against the decision made. If

54

the highest scores are selected, then it is a fair
inference that the decision made is based on the score.

Another indication of consistency is the extent to
which reviewers agree with one another in their decisions.
This becomes a more complicated question requiring more
careful analytical methods to answer. It is reasonable to
expect that reviewers will display some degree of agreement
with one another. This will be indicated by ANOVA and G
study results, as well as by comparison with agreement
found in the Monte Carlo simulation.

The efficiency of the process is another important
consideration. Questions about the efficiency will explore
how the 10 criteria combine to create a score. Are 10
criteria sufficient? Do they reveal multiple factors that
contribute to the AQ of the proposal? Efficiency is also
involved in the number of reviews each proposal receives.
Is that number sufficient? Is that number excessive?

In addition to the above questions, this large data
set may permit the testing of some suggestions of previous
research and theory of scientific knowledge. Are there
differences between the divisions of the organization? Can
those differences be seen as reflecting the differences
seen between the hard and soft sciences in peer review

acceptance? Additionally, these data can be analyzed to

55

test the assumption that peer review can be seen as a
measurement process.

The most basic of statistical tests are likely to show
that those proposals that received the highest scores are
more likely to be accepted than those with lower scores.
Beyond that, the data are likely to show that there is
sufficient agreement among reviewers to justify the
conclusion that they are responding to differences in the
proposals.

The research organization divides its submission
process by sub-topics of the organization’s larger field.
So the proposals were submitted to address issues of a
particular sub-topic. The divisions of the organization may
have differing ways of approaching the review process. The
literature of scientific paradigms suggests that areas of
different focus will reveal different approaches.
Specifically, those areas that have a more established
paradigm should show more consistency in the decision
making process, while fields with a developing paradigm
should show more variation in responses. This data set
should be able to address this question: Do the decisions
of these sub-groups vary?

The people who rate the proposals and the people who

submit the proposals have varying amounts of experience and

56

status. The results of different status or experience can
be compared. Do high-status reviewers make decisions
different than those of lower-status reviewers? Do high-
status authors receive different decisions than low-status
authors? Where reviews differ on a given proposal, is the
decision of the higher status reviewer more likely to be
accepted?

The research organization and many others engaged in
the peer review process, take great care in evaluating
proposals over what are considered as separate and distinct
criteria. In other measurement contexts it is common for a
“halo effect” to develop. The halo effect occurs when an
overall impression of the quality of a work influences the
judgment of individual criteria. When such an effect
occurs, the use of individual criteria loses its
effectiveness, and instead each criterion becomes a proxy
for a more general assessment of the work. Will such a halo
effect be evident in the research organization’s reviews?

The literature suggests that those who have published
will tend to be published in the future. This makes a
certain amount of intuitive sense. Those who have
established the ability to produce works that are passed by
the peer review process might reasonably expect to enjoy a

greater degree of success than those who have never

57

published. If this is true then a good hypothesis is that
those with greater experience and/or status should score

higher and receive higher acceptance rates.

58

CHAPTER 4
FINDINGS

The Broad Picture

With an understanding of the key terms involved in the
data set, and with an understanding of how they relate to
one another, what follows is an examination of the
specifics of the data under analysis. The data were
collected over three years, and in general the description
and analysis will be carried out independently for each of
the three years, and where appropriate on a division by
division basis.

Table 7 summarizes the overall number of proposals,
reviews, and reviewers for the three years, along with the
mean number of reviews that each proposal received, and the

mean number of reviews given by each reviewer.

59

Table 7

Summaryiof Proposals, Reviews and Reviewers by Year

 

 

 

 

Year
2001 2002 2003 Total
Proposals 3206 3900 3969 11075
Reviews 10248 12310 11382 33940

3.20 (0.74) 3.16 (0.76) 2.87 (0.69) 3.06(0.74)

W3(so)
Reviewers 1860 2641 2179 6680
Mb (SD) 5.51 (4.67) 4.66 (3.89) 5.23 (4.43) 5.08(4.31)

 

aMean reviews per proposal. bMean reviews per reviewer.

Table 7 shows that over the three years there was an
increase in the number of proposals submitted and a large
change in the number of reviewers. A casual look at the
figures in Table 7 shows differences in the means across
the years, and ANOVAs computed comparing the means Show the
differences to be statistically significant (p < .0001).
However, because of the large sample size, statistical
significance is fairly easy to achieve. Next will be
discussed the characteristics of the reviewers across the

divisions and through the years.

60

Reviewer Summaries by Division and Year

Table 8 sub-divides the information presented in Table
7. While means range from a low of 2.9 reviews per reviewer
in Division 8 in 2001, to a high of 6.3 (Division 9, 2002
and 2003), what are notable about this are the maximum
values and the resulting high SD. This is a result of a
relatively few reviewers submitting many times the average
number of reviews. For example, in year 2 one reviewer
contributed to the reviews of 41% of the proposals

submitted to Division 2.

61

Table 8

 

 

 

Proposals per Reviewer
Division 2001 2002 2003
1
M (SD) 8.9 (6.800) 4.8 (4.756) 5.6 (4.6)
Na (Maxb) 199 (38) 204 (29) 148 (28)
2
M (SD) 6.8 (5.684) 5.1 (7.053) 5.4 (4.9)
Na (Maxb) 138 (28) 131 (54) 134 (31)
3
M (SD) 5.8 (4.399) 3.9 (2.660) 5.0 (3.7)
Na (Maxb) 496 (36) 609 (24) 512 (28)
4
M (SD) 5.1 (3.391) 4.2 (2.801) 3.8 (2.8)
Na (Maxb) 216 (22) 293 (23) 237 (21)
5
M (SD) 2.5 (1.526) 4.4 (2.788) 4.6 (3.2)
Na (Maxb) 80 (8) 86 (18) 71 (17)
6
M (SD) 6.2 (6.575) 4.7 (2.958) 5.7 (3.4)
Na (Maxb) 27 (35) 49 (15) 30 (17)
7
M (SD) 5.7 (4.528) 5.0 (3.716) 4.7 (3 7)
Na (Maxb) 112 (34) 204 (24) 179 (26)
8
M (SD) 2.9 (1.616) 4.8 (3.443) 4.7 (2.9)
Na (Maxb) 95 (9) 130 (16) 103 (15)
9
M (SD) 5.9 (5.757) 6.3 (6.024) 6.3 (5.1)
Na (Maxb) 75 (22) 67 (21) 71 (20)
10
M (SD) 4.9 (2.897) 4.7 (3.483) 5.6 (3 9)
Na (Maxb) 144 (14) 195 (23) 200 (24)
11
M (SD) 4.4 (3.422) 5.4 (4.310) 6.1 (6.3)
Na (Maxb) 227 (25) 526 (49) 385 (76)
12
M (SD) 3.3 (2.390) 4.2 (3.111) 5.8 (4.7)
Na (Maxb) 51 (13) 147 (16) 109 (28)

 

aNumber of proposals. bMaximum proposals per reviewer.

62

Proposal Summaries by Division and Year
11,075 proposals were submitted for review across the
three years included in this study. Table 9 shows how these
proposals were distributed across the divisions for each
year, and the percentage that each division accepted, as

well as the overall acceptance rate.

Table 9
Proposals Submitted and Accepted

 

 

 

 

 

Year
2001 2002 2003
Div. N Accepted. N Accepted N Accepted
1 226 58% 252 63% 302 44%
2 136 44% 200 54% 212 62%
3 640 58% 712 74% 830 60%
4 227 44% 318 55% 274 59%
5 87 68% 126 71% 115 68%
6 55 55% 76 42% 62 47%
7 273 65% 317 51% 443 60%
8 182 67% 211 64% 252 54%
9 106 64% 84 63% 92 49%
10 312 52% 341 60% 351 43%
11 748 43% 990 45% 829 52%
12 214 43% 273 51% 207 50%
Total 3206 53% 2454 58% 3969 55%

 

63

Table 10 shows information comparable to that shown
in Table 9, but shows reviews per proposal instead of
proposals per reviewer. In contrast to the distribution of
proposals per reviewer, the number of reviews received by

each proposal is constrained to a narrow range.

64

Table 10
Summary of Reviews per Proposal

 

 

 

Division 2001 2002 2003
1
M (SD) 3.07 (.830) .45 (.941) .79 (.446)
N 226 252 302
2
M (SD) .84 (.408) .94 (.670) .13 (.590)
N 136 200 212
3
M (SD) .31 (.748) .40 (.764) .16 (.564)
N 640 712 830
4
M (SD) .92 (.970) .66 (.756) .08 (.364)
N 227 318 274
5
(SD) .67 (.543) .91 (.456) .72 (.669)
87 126 115
6
M (SD) .13 (.336) .97 (.431) .27 (.632)
N 55 76 62
7
M (SD) .89 (.577) .89 (.415) .28 (.458)
N 273 317 443
8
M (SD) .65 (.627) .06 (.303) .10 (.307)
182 211 252
9
(SD) .89 (.318) .05 (.344) .96 (.205)
N 106 84 92
10
M (SD) .20 (.482) .00 (.675) .06 (.333)
N 312 341 351
11
M (SD) .12 (.445) .92 (.648) .72 (.559)
N 748 990 829
12
M (SD) .94 (.316) .85 (.378) .74 (.659)
N 214 273 207

 

65

Table 11 shows the number of reviewers for each of the
divisions for each of the years. No reviewer rated
proposals from more than one division, so the totals shown
at the bottom of the table indicate the actual number of
reviewers for each year. Across years there is likely to be
considerable duplication of reviewers, so summing across
the columns will over-estimate the number of reviewers for
the three years. Because of continuing development and
changes of the electronic system, along with user
unfamiliarity, accurate tracking of people across years was

not achieved.

66

Table 11

Reviewers per Division by Year

 

 

 

 

 

2001 2002 2003

Division N % N % N %
l 199 10.7% 204 7.7% 148 6.8%
138 7.4% 131 5.0% 134 6.1%
3 496 26.7% 609 23.1% 512 23.5%
4 216 11.6% 293 11.1% 237 10.9%
5 80 4.3% 86 3.3% 71 3.3%
6 27 1.5% 49 1.9% 30 1.4%
7 112 6.0% 204 7.7% 179 8.2%
8 95 5.1% 130 4.9% 103 4.7%
9 75 4.0% 67 2.5% 71 3.3%
10 144 7.7% 195 7.4% 200 9.2%
11 227 12.2% 526 19.9% 385 17.7%
12 51 2.7% 147 5.6% 109 5.0%
Total 1860 100.0% 2641 100.0% 2179 100.0%

 

Table 12 shows the comparable figures for authors by
division across the years. Unlike reviewers, some authors
submitted proposals to more than one division. Therefore,
the totals at the bottom of the table are greater than the

number of individual authors.

67

Table 12

Authors per Division by Year

 

 

Division 2001 2002 2003
1 129 184 119
2 65 128 100
3 520 536 341
4 278 234 27
5 3 105 2
6 43 62 42
7 145 212 58
8 48 160 11
9 98 63 52
10 222 219 138
11 519 654 308
12 155 167 60
Total 2225 2724 1258

 

Accepted proposals were assigned a presentation
format. Examination of the cross tabulation of requested
format and assigned format in Table 13 shows that about one
third of the accepted proposals were assigned a format

other than the one requested.

68

Table 13

Cross Tabulation of Requested Format and Accepted Format

 

 

Accepted As

 

 

New
Requested Round Member
Format Paper Table Poster Poster Other Total
Paper 3097 798 420 107 13 4435
Round Table 173 558 45 28 0 804
Poster 95 62 349 27 2 535
New Member 9 7 11 123 0 150
Poster
Other 52 54 19 3 41 169
Total 3426 1479 844 288 56 6093

 

As mentioned,

the reviews were made electronically and

consisted of a one to five response to each of 10 criteria.
Each criterion was presented as a statement or question
with a negative anchor (1) and a positive anchor (5), and
radio buttons for the choices 1, 2, 3, 4, and 5. The web-
based system did not require reviewers to provide a
response to each criterion. If no radio button was clicked,
the system recorded a zero as the response. Zero responses
were coded as missing in this analysis, and also in
summaries of data that the system provided to decision
makers.

The system was programmed with default questions that

represented the questions asked of reviewers in the

69

organization before the adoption of the electronic system.
The ten criteria and their negative and positive anchors
are shown in Table 14.

Each person in charge of a division of the
organization had the opportunity to customize the
“questions” as well as the anchors. However, all responses
to the criteria were restricted to a 1 to 5 value, and 1
was always interpreted as least favorable and 5 as most
favorable. In each of the three years only one of the
divisions took the opportunity to significantly alter the
default criteria. In 2001 and 2002 the changes made were
minor changes in phrasing or ordering of the criteria. For
those years the order of the criteria was adjusted for this
analysis.

In 2003, Division 11 altered the criteria such that
direct comparisons of responses to the criteria including
that division are impossible. However, a factor analysis of
response patterns shows that the “q10” or “overall”
question can serve as a proxy for the other questions. The
division with the customized criteria showed no difference
in the factor analysis (i.e., all criteria loaded on a

single factor and q10 loaded most heavily on that factor).

70

Table 14

Proposal Review Criteria Statements and Anchors

 

 

Criterion

Negative Anchor

Positive Anchor

 

Choice of problem/topic

Theoretical Framework

Methods

Data source(s)

Conclusions/
Interpretations

Quality of writing/
organization

Contribution to field

Membership appeal

Would you attend this
session

Overall Recommendation

Insignificant
Not articulated

Not well executed

Inappropriate

Ungrounded

Unclear/Unorganized

Routine

Small audience

No

Not acceptable

Highly significant
Well articulated

Well executed

Appropriate

Well grounded

Clear/Well organized

Highly Original

Large audience

Yes

Outstanding
Proposal, Definitely
accept

Author and Reviewer Characteristics

Included with the reviews and proposals were

information about the author and reviewer. Table 15, Table

16, Table 17, and Table 18 show the distribution of author

and reviewer characteristics across all divisions and all

three years.

71

Table 15

Author Status Percentage across Divisions and Years

 

 

 

 

Status Freq. Percent
Professor 1057 9.5
Associate Professor 1349 12.2
Assistant Professor 2921 26.4
Graduate Student 2534 22.9
Educator 454 4.1
Other 1252 11.3
Total 9567 86.4
No answer 1508 13.6

Total 11075 100.0
Table 16

Summary of Author Years

 

 

 

Years Freq. Percent
<1 2485 22.4
1—4 2438 22.0
5-10 1748 15.8
11-20 667 6.0
>20 257 2.3
Not a Member 760 6.9
Total 8355 75.4
No answer 2720 24.6
Total 11075 100.0

 

72

Table 17

Summary of Reviewer Status

 

 

 

 

 

 

Status Freq. Percent
Professor 981 14.7
Assc. Professor 1127 16.9
Asst. Professor 1995 29.9
Grad. Student 820 12.3
Educator 435 6.5
Other 949 14.2
Total 6307 94.4
Missing 373 5.6
Total 6680 100.0
Table 18
Summary of Reviewer Years
Years Freq. Percent
<1 581 8.7
1-4 1732 25.9
5-10 2053 30.7
11-20 1143 17.1
>20 598 9.0
Not a Member 229 3.4
Total 6336 94.9
Missing 344 5.1
Total 6680 100.0

 

73

The Criteria

As mentioned, a factor analysis was done of the 10
criteria. The factor analysis allows an understanding of
what underlying qualities might be influencing the rating
of proposals and how the individual criteria might
differentially reflect those qualities. Across all the
years and in each division the findings were nearly
identical. All of the criteria loaded primarily on a single
factor, and in each case (division/year combination) the
criterion that most heavily loaded on the single factor was
the “Overall” criterion.

The only exceptions to the extraction of a single
factor were the 2002/Division 9, and the 2003/Division 4.
In both of those cases a second factor accounted for 10% of
the variance (compared with 61% of variance accounted for
by the primary component in the 2002 case, and 65% in the
2003 case).

The correlation matrix for each of the year/division
combinations shows a high degree of correlation among all
of the criteria, with significance (p < .0001) for every
combination.

This predominance of a single factor and the high
correlation among the criteria suggest that there is only a

single underlying factor which the answers to the questions

74

are revealing. Further, the fact that in all cases the
“Overall" (q10) response most heavily loads on that factor
permits a simplification of the analysis by focusing on
that response as a proxy for the other responses.

A central issue of the analysis is the degree to which
reviewers agree with one another. For this reason, it is
useful to establish some figure to serve as an index of
agreement. Further, two thirds of the proposals received
exactly three reviews. This fact is used to limit the
analysis to those proposals, thus reducing the number of
possible combinations of reviewers’ scores.

A measure of agreement among reviewers is the variance
of the responses given. This variance can act as the index
of agreement. Limiting the analysis to proposals with three
reviews results in 9 possible values for the index of
agreement. But even by reducing the data in this fashion,
there is a challenge in Visualizing the remaining data.

A variety of aids to visualize and interpret the data
are used to approach a solution to this problem. A scatter
plot can allow a number of characteristics to be displayed
on a single diagram. Figure 1 is a scatter plot showing the
relation among three factors: the mean score (q10mean), the
agreement index of the three scores received (q10var) and

the decision (accept and reject). Owing to the fact that

75

all proposals in the plot received exactly three scores for
the overall criterion, there are relatively few discrete

mean scores and standard deviations possible.

 

 

 

 

Decision Scale
500- - OReject 600
, Accept 500
Q 400
. . O 300
o 200
4.00- - . - - 100
- 0
O - O
C O o
I!
“8,300- o O o o
O .
..
a o o o o
C) o (o .
2.00- o o o
O o
O
1.00 o
I I l
0.00 2.00 4.00

q10var

Figure 1. Accepted and rejected proposals.

Note that along the y—axis is the mean score (across
the three reviewers) achieved by a proposal in the
“overall” criterion (q10). The x—axis shows the agreement

index resulting from the scores received. An agreement

76

index of zero indicates perfect agreement among the
reviewers. Notice that above the 0.00 point on the y-axis
are exactly five centers for scores. These are the 1-5 mean
scores. Obviously, if there is perfect agreement among
reviewers, the mean score will be the same as the score
each reviewer gave.

The next variance higher than 0.00 is 0.22. Two
clusters of proposals can be seen at that level of variance
between each of the adjacent “complete agreement” scores.
These represent scores with two raters in agreement and the
third rater differing by only one point (e.g., 4-4-3, 2-2-
1, etc.).1 To the right on the plot are indications of
greater disagreement among the reviewers. The size of the
circles on the scatter plot shows an indication of the
number of proposals with the given mean and rater agreement
index.

Notice also, that for proposals with extreme means
(i.e., means that are near 1 or 5), the variance is
relatively low. This simply reflects the fact that for a
proposal to achieve a high (or low) mean, it is necessary

for the reviewers to agree on a relatively high (or low)

 

1 Table 19 presents all of the possible combinations of three

ratings and the associated agreement index.

77

rating. Conversely, the points on the scatter plot that
indicate the highest variance are those points with the

mid—range means.

Table 19

Ratings Associated with Each Agreement Index

 

 

 

Index Possible Ratings

0.0 111, 222, 333, 444, 555

0.2 112,122,223,233,334,344,445,455
0.7 123, 234, 345

0.9 113, 133, 224, 244, 335, 355
1.6 124, 134, 235, 245

2.0 114, 144, 225, 255

2.7 135

2.9 125, 145

3.6 115, 155

 

Figure 2 is a bar graph showing the frequency of each
agreement index in the sample data. Each bar represents the
number of proposals achieving a particular level of
agreement. 63% of the proposals had an index among the
reviewer scores in the smallest three agreement indexes. A
casual examination of Figure 2 might cause one to believe

that there is considerable agreement among reviewers.

78

 

 

e H O G
rrrrr
llllllll

 

%///%.

0 e O
rrrrrrrr
lllllll

Note that in the plot of the rejected proposals there
is only a slight weighting to the lower portion (lower mean
score), while in the accepted proposals there is a clear

preponderance of proposals with higher means.

 

 

 

 

5.00_ 0 Scale
an
0
am
0 o
«m
4.00- O O o 300
am
(:> <3 C) ° 0 1m:
5 O O o 0 ° 0
o
E 300- ° C) 0 °
C
F
5' o o O o
200- ° ° °
100-
I l l
0.00 2.00 4.00
q10var

Figure 3. Mean versus variance of accepted proposals.

80

 

 

 

 

5.00~ - Scale
. an
«m
° ° mm
—( o o o O 200
400 O‘WO
O o o . ° 0
g O O O o
g 3.00- o O o o
‘5 O o o
C) <3 0 °
200- ° C) 0
o o
O
100- 0
l l I
0.00 2.00 4.00
q10var

Figure 4. Mean versus variance of rejected proposals.

G Study

ANOVA relies upon assumptions about the data to
provide meaningful results. Specifically, ANOVA assumes the
design to be fully balanced. Referring back to the apple
examples, each apple was rated by each reviewer, thus
providing for a fully balanced design. In contrast, the
data under analysis are examined in a matrix that is very
sparsely populated. This is owing to the fact that there

are many hundreds of reviewers and many hundreds of

81

proposals, but each reviewer rated relatively few proposals
and about three reviewers rated each proposal.
Generalizability theory provides methods for analyzing
some sparse matrices and was applied to the data here. A G
study performed on the data resulted in variance components
for the facets that were impossible to distinguish from the
“noise.” This can be interpreted as a failure not of the
measurement process, but only of the analytical technique.

The data simply are such that a G study offers no insight.

Results of.Mbnte Carlo Simulation

With the failure of the G study, more importance is
placed on the Monte Carlo simulation. Recall that the Monte
Carlo simulation is employed to understand the agreement
and the mean scores that would obtain from a purely random
distribution of scores by reviewers. The basis of the
random scoring is the pattern of scoring that was found in
the original data. In the sample data 8.8% of the scores
given were one, 18.1% two, 24.8% three, 32.3% of scores
given were four, and 15.9% five. In performing the Monte
Carlo simulation it is important to reflect this
distribution in the generation of the random scores. To
accomplish that in the random generating of scores each

score is produced a corresponding fraction of the time.

82

The importance of properly weighting the generating of
the random numbers is illustrated in Figure 5 and Figure 6.
Figure 5 shows the result of a Monte Carlo simulation where
7314 simulated proposals were each given simulated ratings
of three randomly generated scores. Each randomly generated
score had equal likelihood of one, two, three, four, or
five. This process was repeated 1000 times. Each simulated
proposal received one combination of mean and variance
among the scores. The median number of times a
mean/variance combination occurred across the 1000
iterations (for each possible mean/variance combination) is
shown in Figure 5. Note the symmetry across the mean scores
and the predominance of variances indicating disagreement

in the reviews.

83

 

 

 

 

5- , Some
0 mm
600
o o 400
_ 0200
4 o O o . 0
o o O o
c o o O O
gs- - O O O
o o O O
o o O o
2- o O o
o o
o
1— o
l l l
0 2 4

var

Figure 5. Median mean/variance outcome of unweighted Monte Carlo

simulation.

Figure 6 shows the corresponding results of a Monte
Carlo simulation where scores were generated according to
the frequencies found in the sample data. The simulated
reviews were made randomly, but the results were weighted
to reflect the tendency of reviewers’ differential use of
the rating scores. Note the differences between Figure 6
and Figure 5. Figure 6 shows much more of a tendency for
agreement, as well as an asymmetry with a predominance of

mean scores around three and four.

84

 

 

 

 

51 . Scale
0 7?)
60
C) 0 am
4- O O o 400
mm
0 0 O ' 0200
o'wo
C O O o o o 0
g . o o .
C) C) C) o
O O O o
2— o (o o
O O
O
1—1 O
I I I
00 20 40
var

Figure 6. Random, weighted mean/variance distribution.

Figure 7 shows the mean/variance frequency from the
sample data. The distribution of mean/variance found in the
sample data is very similar to that found in the random,
weighted Monte Carlo simulation. This indicates that a
large amount of the tendency toward a particular mean score
as well as reviewer agreement can be attributed to the
tendency of reviewers to center their scores on four.

These three figures show the general trends for the
distribution of mean/variance combinations. But to be
specific and quantitative about the distribution of these

scores it is useful to look at the distribution of mean

85

scores and the distribution of agreement indexes
independently. Toward that end, a different View of the

data can be helpful.

 

 

 

 

5_ 0 Scale
mm
C) mm
C) °’ mm
C) C) an
(:) O C) ' (0 mm
5 O o o O C 100
o ' 0
|§ 3- (D (C) O o
2} <3 (I C) o
O O O o
2‘ 0 C) o
o e
o
1- O
T I I
0.00 2.00 4.00
q10var

Figure 7. Sample data mean and variance frequency.

Figure 8 displays both the results of the Monte Carlo
simulation as well as the sample data for all scores of
1,1,1. The histogram to the left of the figure shows the
frequency across the 1000 repetitions for a particular
number of randomly generated scores of 1,1,1. In nearly 200
of the 1000 iterations the score of 1,1,1 occurred four
times. The most times that 1,1,1 occurred was 13 times. In

the sample data 1,1,1 occurred 61 times. This results in a

86

Z score of 24.4, meaning that it is extremely unlikely (p <
.00001) that the number of occurrences of 1,1,1 in the

sample data occurred by chance.

 

 

 

 

 

 

 

 

 

 

 

 

 

200‘ F
Mean=5.08
_ Std. Dev. =2.245
_ , N=1,000
150‘ "
>~ r85
o
5
£1004 7
u H
50- “ ’1 .
0 J. I) I I I I I I I I I II I

0 5101520253035404550556065

Figure 8. Histogram of frequency of 1,1,1 in simulated data compared
with actual.

In contrast with the findings displayed in Figure 8,
Figure 9 shows the corresponding results of the Monte Carlo
simulation for a mean score of 3.33 and an agreement index
of 0.2. In the sample data this result occurred 447 times,
resulting in a Z score of 0.52. This indicates that there
is no statistically significant difference between this

result and what one could expect by chance.

87

 

120—
Mean=438529
Std.Dev.=19.95587

5" N=1,000

"F
80 — T r

100—

Frequency

40“

20-

  

 

 

 

 

 

 

 

 

 

   

 

 

 

0 —.~-.~----~- l ». .
375.00 400.00 425.00 450.00 475.00 500.00

Figure 9. Histogram of frequency of 3,3,4 in simulated data compared
with actual.

Table 21 presents the corresponding comparison between
the Monte Carlo simulation for each mean/variance
combination and the Z score of the sample data. The pattern
is consistent across the years, indicating that agreement
is more than expected by chance toward the extremes (high-
scoring proposals and low-scoring proposals), and very

close to chance results in the middle of the scoring range.

88

Table 20

Z Scores for Each Mean/Index

 

 

 

 

 

Z score
Mean Index 2001 2002 2003
1.0 .0 15.11*** 8.82*** 22.08***
1-3 -2 7.73*** 8.19*** .46***
1.6 2 2.11* 5.34*** .61***
2.0 0 2.85** 2.53** -20
2-3 -2 4.13*** 2.62** ,77**
2.6 2 -0 75 1.23 .33
3.0 0 1 20 1.15 ,46**
3.3 2 1.06 —0.68 .44
3.6 2 0.11 1,25* ,37*
4.0 0 2.06* 2.67** .74**
4-3 2 4.13*** 2.85** .42***
4-6 2 4.55*** 4.63*** .21***
5«0 0 5.16*** 7.75*** ,77***
*p<.05 **p<.001 ***p<.000001

proposals

Editorial Choice

Editorial choice is the measure of the portion of

(within a division)

that were accepted with

scores below the cut-score or rejected with scores above

the cut-score.

Table 21 presents the editorial choice of each
division by year. In divisions with a substantial number of
proposals, the range is from nearly 3.5% to about 38% of

proposal decisions made by the editor.

Table 21

Editorial Choice by Division and Year

 

 

 

 

2001 2002 2003
Div N Choice N Choice N Choice
1 127 19.69% 153 23.53% 224 17.86%
2 108 30.56% 130 12.31% 179 13.41%
3 439 24.15% 410 11.95% 600 10.83%
4 97 18.56% 119 19.33% 236 11.02%
5 61 16.39% 102 10.78% 86 3.49%
6 47 17.02% 62 14.52% 46 13.04%
7 187 24.06% 267 27.72% 119 37.82%
8 115 15.65% 196 8.67% 21 14.29%

9 2 0.00% - - - —
10 232 5.17% 246 18.70% 308 3.57%
11 619 22.29% 690 15.51% 543 11.05%
12 195 16.92% 229 9.61% 119 9.24%

 

Editorial Reach
Another measure of editorial influence is editorial
“reach.” Reach takes into account not only the editorial
choice described above, but also the distance from the cut
score as well as the index of agreement among the
reviewers. Table 23 details the range of editorial reach

across the divisions and through the years.

Table 22

Editorial Reach across Divisions and through Years

 

 

 

 

2001 2002 2003

Division N Reach N Reach N Reach
1 127 0.055 153 0.031 224 0.029
2 108 0.124 130 0.022 179 0.034
3 439 0.034 410 0.015 600 0.019
4 97 0.023 119 0.035 236 0.005
5 61 0.031 102 0.015 86 0.007
6 47 0.048 62 0.027 46 0.020
7 187 0.037 267 0.029 119 0.148
8 115 0.037 196 0.005 21 0.030

9 2 0.000 - - - -
10 232 0.007 246 0.016 308 0.007
11 619 0.020 690 0.010 543 0.023
12 195 0.019 229 0.021 119 0.012

 

Reviewer Characteristics

There is no difference among categories of reviewer in
their tendency to rate proposals. The tendencies seen
across the categories of reviewer (both professional status
and years of membership) are not different within the
groups. Regardless of years of membership or professional
status, all categories of reviewers use four about one
third of the time, and use three and four over 50% of the

time.

Author Characteristics

Author characteristics have some influence on the
likelihood of being accepted. Authors vary by professional
status and years of membership. Except for “Educator,” each
of the categories of professional status has equal
likelihood of acceptance (55% are accepted), while the
“Educator” status has a 45% likelihood of acceptance.

Author membership is also a significant predictor of
acceptance. All member groups are more likely to be
accepted than the non—member, and more years of membership
increases the likelihood of acceptance, until the years of
membership reaches greater than 20 years. This is generally

consistent with the literature that suggests that those who

92

have published in the past are more likely to have their

work accepted.

93

CHAPTER 5
DISCUSSION

In a data set of this size it is especially easy to
make the error of conflating statistical significance with
substantive significance. Statistical significance can be
achieved with a small effect in such a sample size. But to
make a sensible interpretation it is necessary to attend
both to the probability figures (statistical significance)
as well as the magnitude of any effect (substantive
significance). The difference between statistical and
substantive effect is clearly illustrated in the findings

around reviewer agreement.

Answers to Questions

Reviewer Agreement

The goal of the research organization’s peer review
process, put most bluntly, is to accept good proposals and
to reject bad proposals. Reviewers are thought to know what
qualities make one proposal better than another and to be
able to detect those qualities. For the process to be
meaningful and effective reviewers must agree with one
another. If reviewers disagree it is impossible to glean

from their reviews anything other than their idiosyncratic

94

opinion. On the other hand, when reviewers agree, there can
be confidence that they are responding to some quality
within the proposal. If reviewers do not respond to
qualities that inhere in the proposal, then peer review is
an exercise in futility. Therefore, a central question is:
Do reviewers agree?

The findings show that the answer to the question is
more complicated than a simple yes or no. The data show
that agreement among reviewers occurs more frequently than
expected by chance. This implies that reviewers do have a
shared idea of what makes a proposal acceptable and
unacceptable, and are able to discern that quality. In.
other words, reviewers are responding to something in the
proposal. But the findings also show that agreement of
reviewers is not consistent across the rating scale.
Reviewers are in greatest agreement about very low-scoring
and very high-scoring proposals. There is strong evidence
that in the middle range of scores agreement among
reviewers is no greater than chance. This is because a
large amount of the agreement found in the mid-range scores
can be attributed to the fact that 57% of all scores given
by reviewers were three or four on the five-point scale.

The agreement that was found at the extremes scores is

much greater than expected by chance, but it still amounts

95

to a small fraction of the total number of proposals. Over
half of the proposals received scores from reviewers with
substantial disagreement. From this it is reasonable to
conclude that many of the decisions to reject or accept are
the result of chance.

Further, much of the apparent agreement among
reviewers is attributable to the combination of two common
rater problems. The reviewer pool tends to be lenient
(i.e., reluctant to use scores of one and two), and to have
a central tendency (i.e., a preference to rate proposals
four). The centering of scores around four necessarily
results in the appearance of agreement. But that agreement
is an artifact of the distribution of scores and is not
indicative of reviewers responding to the individual
qualities of proposals.

While it is the case that agreement among reviewers
occurs more than expected by chance, it does not occur more
than reasonably expected by those relying on the outcome of
the process. It was found through the Monte Carlo
simulation that one could expect 35% of the proposals to
receive agreement in their ratings without any influence
caused by the content of the proposal. That means that
“greater than chance” implies only that more than 35% of

the ratings show agreement. The fact that 55% of the actual

96

reviews showed significant disagreement casts into doubt
the substantive fact of agreement. This is an example of
statistical significance failing to translate into

substantive significance.

Efficiency

Another question addressed was that of the efficiency
of the peer review process. It was found that the 10
criteria were not being used efficiently. That is, there
was no discrimination among the various criteria. There are
at least two possible reasons for this. One is reviewers’
tendency to center their scores on four and vary from that
score relatively rarely. Not only was this tendency seen
across proposals, it was also found across criteria within
prOposals.

Another possible cause of the lack of discrimination
among the criteria is that of a “halo effect” in the
reviewers. A halo effect occurs when the reading of a
proposal causes an overall impression that results in a
similar score in all criteria. In such a case an “overall”
score is taken as the score for each of the other criteria.

The question remains unanswered as to whether the
number of reviewers is an efficient number in determining

the quality of the review process. Questions about the

97

efficiency of the number of reviewers were unable to be
answered owing to the sparseness of the matrix and

subsequent inability to run a valid G study analysis.

Differences among Divisions and Years

The analysis also focused on the comparison across
years and across divisions. This permitted a test of the
robustness of the findings, as well of the possible
variations among areas of the organization. Across the
divisions and through the years, reviewers were found to
agree to a similar extent and to make their ratings in a
similar range of scores. No significant differences in
reviewer behavior were found between divisions or from year
to year.

Editorial influence varied widely across divisions and
through the years. This points out the importance of the
editorial role played by the division chair. Some chairs
adhered closely to the reviewers’ scoring, while others
made liberal use of their editorial prerogative. It is
difficult to judge which editorial behavior is more
appropriate. Among the proposal scores where most of the
editorial choice was made, there is relatively little
reviewer agreement, and a large amount of the agreement

that is there is likely to be the result of chance. An

98

editor who makes a decision based on the scores is basing

it on unreliable scores.

Decision Based on Score

Not surprisingly, there is a tendency for editors to
accept higher scoring proposals and to reject lower scoring
proposals. But there is evidence that this is based
primarily on rejecting lower scored proposals and making
extensive editorial choice about proposals with a mid-range
score. As mentioned, there is considerable variation among
the editors in their adherence to the reviewers’ scores.
Some editors only rarely exercise editorial choice, while

others make much more liberal use of it.

Implications/Recommendations
The implications of the findings depend to a great
degree on the interpretations of the results and the
desires of the research organization. The findings indicate
that there are two main scoring problems: a central
tendency of raters, and a failure to discriminate among the
criteria. These scoring problems have at least two possible

causes, each requiring a different solution.

99

Central Tendency

Over 55% of the scores given were 3 or 4. This is a
problem because it results in a large number of proposals
being given similar scores. Because of the similar scores,
it is impossible to discriminate among those proposals.
Such a scoring pattern is the result of one or both of two
possible reasons.

One possible cause of the central tendency in scores
is that proposals generally fall into that range of score.
These might be accurate scores. If this is the case then
the solution to the problem is to add precision to the
scale by adding more steps, so that what had been scores of
three and four are spread over a greater range.

The other possible cause of the central tendency is a
preference by reviewers to give three and four regardless
of the quality of the proposal. If that is the source of
the problem, then the solution involves encouraging
reviewers to properly use the range of the scale. This
problem can be addressed by a combination of greater
reviewer training along with a clearer scoring rubric.
Instead of providing only negative and positive anchors
(for one and five), scoring could be improved by offering
descriptors for each of the intermediate scores, thus

encouraging use of the entire scoring scale.

100

Deciding which of these solutions will address the
problem requires pilot testing both methods with a

representative sample of reviewers and proposals.

Halo Effect

A second problem is the halo effect evidenced by
reviewers failing to discriminate among the ten criteria.
This is more clearly a problem with reviewer behavior than
is the central tendency, but again, the solution depends on
the interpretation of the result of this reviewer behavior.

It is possible that the halo effect results only in

the submission of an accurate overall score (the 10th
criterion). Because the ultimate goal of the scores is the
binary choice of acceptance or rejection, it is reasonable
to reduce the ten criteria to a single holistic score. If
such is the case, then a more efficient scoring system
might be to have the reviewer offer only the single score.
If, on the other hand, a separate and independent
score for each of the criteria is desired, then additional
training or instruction is required to achieve the desired
discrimination among the criteria. This would result in a
greater reliance on the editor to appropriately weight the
various criteria to achieve the final decision of accept or

reject.

101

Editorial Influence

In addition to the problems associated with reviewer
disagreement, and lack of discrimination among the
criteria, editorial choice has significant influence over
the results of the peer review process. Editors should be
made aware of the degree to which they are choosing to
overrule the choices of their reviewers. Further, the
organization could provide guidance as to how much
editorial influence should be exercised. Instead of making
decisions according to the judgment of the reviewers, some
editors overrule the decisions of the reviewers in 25% or
more of the cases, while other reviewers more diligently

adhere to the reviewers’ decisions.

Other Conceptions of Peer Review

This study has conceptualized peer review as a
measurement process. As such it conceives peer review as
intending to identify a quality within proposals that makes
it worthy of acceptance. This is a reasonable conception
given the formal process of collecting 30 numeric scores
per proposal, and then using those numeric responses in
selecting proposals for acceptance. But alternative

conceptions are possible.

102

One alternative conception is that the process is as
much for the reviewers and the organization as a whole, as
it is for the authors and the selection of proposals.
Reviewers who are involved in the process are engaged in
reading their peer's work, and specifically attending to a
set of criteria that are deemed important in scientific
work. Reviewers are peers, and as peers are also engaged in
the process of producing work that will pass through the
peer review process. The act of reviewing others’ work has
the effect of reinforcing the standards that are expected
of acceptable work, as well as broadening the reviewer by

exposing him or her to the work of peers.

Limitations
While this data set holds a great amount of
information capable of providing insight to the peer review
process of this large research organization, it also is
limited in its ability to illuminate the larger field of

peer review. This is for several reasons.

Softer Science
The general scope of the research done by the large
research organization falls within what would be described

as a “soft science." That is, the research deals with

103

social and psychological issues as opposed to the “hard

sciences” of physics, or mathematics. Because of that, all
of the findings should be interpreted within that context.
Peer review literature suggests that reviewer behavior may

be considerably different in the hard sciences.

Prqposals, Not Completed Works

This peer review process involves proposals, as
opposed to completed works. Because it is understood that
the proposals submitted represent work in progress, and
because the submissions are made 6 to 8 months before the
meeting is held, many proposals are submitted that may lack
traits expected of completed work. This fact could lead to
several effects.

There may be greater acceptance of a work that is
known to be one in progress than one that is completed.
Perhaps another reason to tend toward lenience is that in
editorial peer review as it is commonly practiced, an
author receiving negative reviews often has a chance to
make revisions and re-submit the work. In the case of this

large research organization, there is no such opportunity.

104

Future Research

Peer review is a field that calls for much more
investigation. The difficulties in researching peer review
are numerous, but the consequences of the peer review
process are so broad reaching that it is important to
better understand the process and to ensure that the
process performs the task it is intended to perform. Future
research suggested by this study might take several

directions.

Text Analysis

One direction for future research is greater
exploration of this study’s data set. While this study
looked only at the categorical data around the authors and
reviewers, and the discrete scores of the criteria, the
data set also contains text comments from each reviewer.
Future text analysis may uncover relationships among
qualities of the comments, the scores achieved, and the

decision made to accept or reject.

Editorial Influence
There is much more to understand about the role of the
editor. Toward that understanding, interviews of editors

would be useful in gaining insight into the selection

105

process and their use of input from reviewers. How do
editors View their role in the process? How does an editor
make a decision about acceptance or rejection in the face
of disparate reviews? What efforts are made at training

reviewers?

Ultimate Publication

Many of the proposals submitted to the organization
will result in papers that the authors will wish to publish
in journals. Comparing the result of the journal peer
review process with that of the research organization’s
process would provide another measure of the effectiveness
of both. To accomplish this would require following up on
the proposals to determine which ultimately resulted in

publication.

Conclusion
The results of this study raise issues of serious
concern to those who place the fate of their scientific and
scholarly work into the hands of the peer review process.
Unless a work is exceptionally good or exceptionally bad
there is little reason to have confidence that the decision
is reliable. Reviewers fail to discriminate among proposals

with mid-range scores, and editorial choice governs much of

106

 

UL; A u.“-
\
i’ I _

' '.I!1"‘.-'.'..' . il'o. '7". 0

 

the decision making process. However, the problems are not
uncommon in measurement practice and are amenable to
solution. A more complete scoring rubric, greater training
of reviewers, and common guidance for reviewers can make

the process less random.

107

APPENDIX A
Tables of Reviews per Proposal
Tables A1, A2 and A3 detail the number of proposals
submitted to each division, the mean number of reviews
given per proposal, and the total number of reviews for

each of the three years.

Table A1

Reviews per Proposal by Division for 2001

 

~k

 

 

Division Mean N SD Reviews
1 3.07 226 .830 694
2 2.84 136 .408 386
3 3.31 640 .748 2120
4 3.92 227 .970 890
5 2.67 87 .543 232
6 3.13 55 .336 172
7 2.89 273 .577 789
8 2.65 182 .627 483
9 4.89 106 .318 518
10 3.20 312 .482 999
11 3.12 748 .445 2335
12 2.94 214 .316 630
Total 3.20 3206 .742 10248

 

*Number of proposals submitted

108

Table A2

Reviews per Proposal bygDivision for 2002

 

 

 

 

 

 

 

Division Mean N SD Reviews
1 3.45 252 .941 870
2 2.94 200 .670 588
3 3.40 712 .764 2421
4 3.66 318 .756 1164
5 2.91 126 .456 367
6 2.97 76 .431 226
7 2.89 317 .415 917
8 3.06 211 .303 645
9 5.05 84 .344 424
10 3.00 341 .675 1024
11 2.92 990 .648 2886
12 2.85 273 .378 778
Total 3.16 3900 .755 12310
*Number of proposals submitted
Table A3
Reviews per Proposal by Division for 2003
Division Mean N* SD Review
1 2.79 302 .446 843
2 3.13 212 .590 663
3 3.16 830 .564 2626
4 3.08 274 .364 844
5 2.72 115 .669 313
6 3.27 62 .632 203
7 2.28 443 .458 1009
8 2.10 252 .307 528
9 4.96 92 .205 456
10 3.06 351 .333 1073
11 2.72 829 .559 2256
12 2.74 207 .659 568
Total 2.87 3969 .686 11382

 

*Number of proposals submitted

109

 

APPENDIX B
Figures Showing Accept and Reject by Division and Year

Figures A1 through A36 compare the scores of rejected

and accepted proposals for each division and each year.

39199?

253 -

     

ﬂ 1?? 7
5 I E
o I l
U I
lOﬁ 1
I I
5) J
IIIIII -- -_ - i ., _
1. 3.00 4.00 1.00 2.00 3.00 4.00
qlomean qJOmean

Figure A1. Accept and reject, Division 1, Year 1.

 

 

   

11_”__Bsisct. -1_ 1 _ragsspt
I }
40f l "
307 . j
u l
c
3 I
U 20‘ 1
. l
.4
i I I .- _1,lL
1.00 2.00 3.00 4.00 1.00 2.00 3.00 4.00
qQOmeen qJOmean

Figure A2. Accept and reject, Division 1, Year 2.

110

-.- Reject __ _ _ _4“__ ”Accept
507 I 9

 

 

Count

     

 

1.00 2.00 3.00 4.00 1.00 2.00 3.00 4.00
qlomeen q10neen

Figure A3. Accept and reject, Division 1, Year 3.

 

”_Reject __~__Accept

 

 

12‘

_. .l__,-

Count

-
I
l
I

+4“ “sl-

1.00 2.00 3.00 4.00 1.00 2.00 3.00 4.00
qlomenn qloneen

     

Figure A4. Accept and reject, Division 2, Year 1.

 

Accept

 

I “41

i. I

Count

 

 

Figure A5. Accept and reject, Division 2, Year 2.

 

Rejegt____ Accept

 

 

30‘ M

Count

 

   

l._.' 1. ,_.,,_ _ .
1.00 2.00 3.00 4.00 1.00 2.00 3.00 4.00

qlomean qlomenn

Figure A6. Accept and reject, Division 2, Year 3.

 

112

Reject__ﬁvag

 

 

 

 

1.00 2.00 3.00 4.00 1.00 2.00 3.00 4.00
qﬂﬂnean qﬂﬂnnnn

Figure A7. Accept and reject, Division 3, Year 1.

._ __R_9iss£_____-__ Accept.

 

 

 

 

 

 

‘é 50' I
:s
o
I
25‘ r
I
0m 1L._,_ ___,_ I
1.00 2.00 3.00 4.00 1.00 2.00 3.00 4.00
qﬂﬂmean qimnmun

Figure A8. Accept and reject, Division 3, Year 2.

 

113

_1159jeccir

 

, _i________Accept

41

\J

U1

l

_’_ l

   

25
I l
0 __._A_) _, _
1.00 2.00 3.00 4.00 1.00 2.00 3.00 4.00
q10nenn qJOnenn

Figure A9. Accept and reject, Division 3, Year 3.

 

.. ,_ Bc_j$__ Accept

 

 

u
c
a
o .
o I -)
I 1
I
1.00 2.00 3.00 4.00 1.00 2.00 3.00 4.00
qlonenn qJOmenn

Figure A10. Accept and reject, Division 4, Year 1.

 

114

 

111111Bciect .111 1 _ 1,,111 .11Accept

 

 

Count

   

1 1.-
1.00 2.00 3.00 4.00 1.00 2.00 3.00 4.00

q10meen qlomeen

Figure A11. Accept and reject, Division 4, Year 2.

 

 

 

 

‘ 89:99:11 111, 1 11 -Accept 111

50: I

I I =

401 , 4 I
1, 30- I
5 I I
0 I
U I
201 . I
10" : i
I I
r

   

1.00 2.00 3.00 4.00 1.00 2.00 3.00 4.00
qIOmean qIOnean

Figure A12. Accept and reject, Division 4, Year 3.

 

2115

11 111 Reject 11 7 Accept

10.0‘ '

7.5'I

Count

5.0‘

1.00 2.00 3.00 4.00
qlomean

 

Figure A13. Accept and reject, Division 5, Year 1.

Reject 7 7 ACQEB£__111

I
10-
5- d I
0-

I
1.00 2.00 3.00 4.00
qlomean

Count

 

Figure A14. Accept and reject, Division 5, Year 2.

116

11113919931111

Count

 

.00 2.00 3.00 4.00 1.00 2.00
qlomeen

3.00
q10nenn

Figure A15. Accept and reject, Division 5, Year 3.

Accept

4.

11 §£E§RE________1

 

00

 

 

 

, 1 1. 89199? 11111111111 ,_
8-I I -i
I I
6 I -’_
I
4.!
5 z . I
0 41 I I
U i I '
I 3
2 -%
I I
I I
_ ____- r. ___-1_'1_
1.00 2.00 3.00 4.00 1.00 2.00
qIOnean

  

I
3.00

qloneen

Figure A16. Accept and reject, Division 6, Year 1.

 

.00

 

 

Rej 80*: 1111 , , . F1 ._. AEEEPL...

 

Count

 

   

'I .1
.00 2.00 3.00 4.00 1.00 2.00 3.00 4.00

qlonenn qlomeen

Figure A17. Accept and reject, Division 6, Year 2.

 

Reject Accept

 

 

__L__1_117

Count

 

 

1.00 2.00 3.00 4.00 1.00 2.00 3.00 4.00
qIOmean qlomenn

Figure A18. Accept and reject, Division 6, Year 3.

 

118

Rejgpp_ __Accept

 

 

l 141

 

Count

 

.--“ l

 

3.00 . .00 2.00 3.00 4.00
qIOmeen qlomeen

 

Figure A19. Accept and reject, Division 7, Year 1.

 

Reject 1 ,1. _ 7 _ Accept

 

 

407—

 

307

Count
N
O
l

 

 

1.00 2.00 3.00 4.00
qlomeen

Figure A20. Accept and reject, Division 7, Year 2.

119

Reject 1 1. 1 -11 1Acgcpt111111
501

IIIIII ...

3. 00 4. 00 1.100 . 3. 00
qlomeen qloneen

.11 1.!

Count

 

Figure A21. Accept and reject, Division 7, Year 3.

 

__ REJECt __

 

Count

 

 

1.00 2.00 3.00 4.00
q10nean

 

Figure A22. Accept and reject, Division 8, Year 1.

 

120

 

_Bcicct_11

305 I -

I I I

u 20'I '
g I I

8 I

10" I -

   

1.00 2.00 3.00 4.00 1.00 2.00 3.00 4.00
qlomean q10meen

Figure A23. Accept and reject, Division 8, Year 2.

 

. 891999. 11 1 1
507 ' -I'

 

Count

 

1.00 2.00 3.00 4.00 1.00 2.00
q10neen

 

Figure A24. Accept and reject, Division 8, Year 3.

 

121

Reject" 1. Accept

 

 

 

1.00 2.00 3.00 4.00 1.00 2.00 3.00 4.00
q10meen qlomeen

Figure A25. Accept and reject, Division 9, Year 1.

 

Reject 77 » 7¥_ Accept

 

l—'
U'I
_ ..l___
11] 41

 

   

1.00 2.00 3.00 4.00 1.00 2.00 3.00 4.00
qlomeen qIOneen

Figure A26. Accept and reject, Division 9, Year 2.

 

122

1 Beiect 1 1 1 1 1 Accept 1 11 1
15" -I ‘

Count

   

1.00 2.00 3.00 4.00 1.00 2.00 3.00 4
qlomean

.00

q10mean

Figure A27. Accept and reject, Division 9, Year 3.

firReject 7 7 111 Accept_< 7
I
40‘ ‘ '
30' 1 I
u I
I: I I
9 I
o I I I
U 20- ‘- I
10- -
2' I
I
Jl I
F' r”’ I
1.00 2.00 3.00 4.00 1.00 2.00 3.00 4.00
q10meen qlomean

Figure A28. Accept and reject, Division 10, Year 1.

123

1 st_§£t_1____ Accept

 

Count

   

I

- I II
1.00 2.00 3.00 4.00 1.00 2.00 3.00 4.00
qlomeen qloneen

 

Figure A29. Accept and reject, Division 10, Year 2.

 

Reject __ _ __ 1. Accept

50€__11111_ II I W

 

 

 

Count

I

I I

1.00 2.00 3.00 4.00 1.00 2.00 3.00 4.00
qlomenn qlonean

 

       

Figure A30. Accept and reject, Division 10, Year 3.

124

 

   

i _RejeCE____ciiii_r-r i,._i-_i__i_AE9§Et
loof—' % 1
1
l ’ E
751 ‘ 4
u I j
c : n i
8 50"; ~ -
U i i
! , J
. , _+
Q
i E
. _ _ J.-J ." WA" -r -, ._
1.00 2.00 3.00 4.00 1.00 2.00 3.00 4.00
qIOmoan qIOmoan

Figure A31. Accept and reject, Division 11, Year 1.

 

 

 

_ Reject wir____i___ ____H__A_m Accept
125i j 1
I ‘1
100+ I 4
.’ W
751 1"
‘ I

Count

   

‘r
1.00 2.00 3.00 4.00 1.00 2.00 3.00 4.00
qIOmoan qIOmoan

Figure A32. Accept and reject, Division 11, Year 2.

 

2125

 

 

 

 

in_____ ._“§§i?9F ___ ______ _ig529923W
1251 i -
I t
1001 ‘ T
I ; i
g 754' ; '1
a ,
0
U l ‘
i
1
L4.- i‘ --_i#____- .
.00 3.00 4.00 1.00 2.00 3.00 4.00
qlomoan q10monn

Figure A33. Accept and reject, Division 11, Year 3.

 

Reject i 7 _7 _m _u_ Accept

 

 

'7
i

Count

 

 

1.00 2.00 3.00 4.00
qﬂﬂmnan

 

Figure A34. Accept and reject, Division 12, Year 1.

 

1226

 

Reject “‘_g_” 7 ‘g_g_pgwgﬂy_Accept

i x ,
30- . -
I I 1 ‘
. . ‘ i
i i ‘ I
i

.00 2.00 3.00 4.00
qIOmoan

 

 

Count

 

   

Figure A35. Accept and reject, Division 12, Year 2.

 

Reject ,, ____A__c&ept

 

 

 

.00 3.00 . . 2.00 3.00 4.00
qIOmoan qlomaan

Figure A36. Accept and reject, Division 12, Year 3.

 

127

 

REFERENCES

Adair, R. (1982). “A Physics Editor Comments on Peters and
Ceci's Peer-Review Study".” The Behavioral and Brain
Sciences 5(2): 196.

 

Bond, T. G. and C. M. Fox (2001). Applying the Rasch Model:
Fundamental Measurement in the Human Sciences. Mahwah,
Lawrence Erlbaum Associates.

 

Brennan, R. L. (2001). Generalizability Theory. New York,
Springer.

 

Strengthening Peer Review in Federal Agencies That Support
Education Research (2004)
Center for Education.

Chubin, D. E. and E. J. Hackett (1990). Peerless Science.
Albany, State University of New York.

 

Hackett, E. and D. Chubin (2003). Peer Review for the let
Century. Washington, D.C., National Research Council.

Kaplan, D. (1995). "How to Fix Peer Review", The Scientist,
Vol. 19, Issue 1, Jun. 6. p. 10.

Lagemann, E. (2002). An Elusive Science: The Troubling
History of Education Research. Chicago, University of
Chicago Press.

 

Lindsey, D. (1976). “Distinction, Achievement, and
Editorial Board Membership.” American Psychologist
31(11): 799-804.

 

National Research Council. (2004). Strengthening Peer
Review in Federal Agencies that Support Education
Research. Committee on Research in Education. L.
Towne, J. M. Fletcher, and L. L. Wise, Eds. Center for
Education, Division of Behavioral and Social Sciences
and Education. Washington, DC: The National Academies

 

Press.

Newman, S. (1966). “Improving the evaluation of submitted
manuscripts.” The American Psychologist 21(10): 980-
981.

128

 

Peters, D. P. and S. J. Ceci (1982). “Peer—review practices
of psychological journals: The fate of published
articles, submitted again.” Behavioral & Brain
Sciences 5(2): 187—255.

 

Rennie, Drummond, et al. (1989). “The International
Congress on Peer Review in Biomedical Publication.”
The Journal of the American Medical Association

 

261(5): 749.

Rennie, D. (2002). “Fourth International Congress on Peer
Review in Biomedical Publication.” JAMA 287(21): 2759-
2760.

Soffer, A. (1980). "The Unique Role of Peer Review

Journals." Chest 78(4): 547-48.

Speck, B. W. (1993). Publication Peer Review: An annotated
bibliography. Westport, Greenwood Press.

 

Weller, A. C. (2002). Editorial Peer Review: Its strengths
and weaknesses. Medford, Information Today.

 

Wilson, E. (1979). “Comments from a Servant of the
Scattered Family.” Contemporary Sociology 8(6): 804-
08.

Wolff, W. (1970). “A Study of Criteria for Journal
Manuscripts.” American Psychologist 25(7): 636-39.

 

Zuckerman, H. and R. Merton (1971). “Patterns of Evaluation
in Science.” Minerva 9: 66-100.

129

   

((111))(1