THE PAST, PRESENT, AND FUTURE OF GRADUATE ADMISSIONS IN PHYSICS
                                       By
                               Nicholas T. Young
                               A DISSERTATION
                                   Submitted to
                           Michigan State University
                   in partial fulfillment of the requirements
                                for the degree of
                        Physics – Doctor of Philosophy
       Computational Mathematics, Science, and Engineering – Dual Major
                                      2021


                                             ABSTRACT
     THE PAST, PRESENT, AND FUTURE OF GRADUATE ADMISSIONS IN PHYSICS
                                                  By
                                          Nicholas T. Young
While graduate admissions in physics directly affects only a small number of people on an annual
basis, the number of people indirectly affected is orders of magnitude greater. Those who complete
graduate degrees in physics will go on to become leaders in industry, government, and academia,
with the latter educating the next generation of leaders in science and engineering. Given the
possibly enormous consequences of our decisions in physics graduate admissions, care should be
taken to ensure that the process is working effectively. The evidence, however, suggests it is not.
Many inequities exist in the admissions process, unfairly keeping potentially great scientists from
pursuing graduate studies. This dissertation seeks to understand what those inequities might be
and how we might address them. First, I study the admissions process in the physics department at
a Midwestern, public university using the random forest algorithm, a machine learning method, to
understand what drives their process. After finding test scores and grades drive the process, I show
that one of those tests, the physics GRE, does not gives applicants the outsized advantage that it is
claimed to provide. Given the components that drove the admissions process contain inequities,
the second half of the dissertation explores whether a rubric-based holistic admissions process
might be able to address those inequities. Preliminary evidence suggests that it does. Finally, to
ensure that the methods used in the previous chapters were appropriate, the dissertation concludes
with a simulation study, finding that the methods used might lead to false negatives in conclusions.
Overall, this dissertation suggests that the current graduate admissions process in physics contains
inequities and that rubric-based admissions might be able to address them. By addressing those
inequities, everyone can be given a fair shot in the admissions process and physics as a discipline
can work toward becoming more representative of the population. Failure to act only perpetuates
the inequities that have and will continue to keep people out of physics.


                                     ACKNOWLEDGEMENTS
This dissertation would not have been possible without the help and support of too many folks to
mention. However, I will attempt to do so here.
    First, I would like to thank my dissertation committee for all the feedback and direction they
have provided over the past few years in making this dissertation as strong as possible.
    Second, I would like to thank my advisor Danny Caballero who consistently provided support
and challenged me to be the best researcher I could throughout this process. With Danny, whenever
I brought up a new research idea or direction, his question was never ‘why do you want to do that?’,
but ‘how can I help you achieve it?’ This dissertation would not have been possible without his
trust in allowing me to take this dissertation in the directions I wanted.
    Third, I would like to thank the many members of PERL who have provided feedback through
group meetings, practice presentations, manuscript reviews, and general conversations and provided
a sense of community. I’d especially like to thank Rachel Henderson who has served as my unofficial
co-advisor for providing insights and new directions that helped me grow as a researcher.
    Additionally, I’d like to thank the many members outside of PERL who have also provided
feedback and inspiration that appear in this dissertation, especially Devin Silvia whose data visu-
alization class had a significant impact on the figures appearing in this dissertation and Odd Petter
Sand whose own data visualizations inspired those that appear in the introduction.
    Fourth, I would like thank the many people who provided and curated the data that appears in
this dissertation including Scott Pratt, Kirsten Tollefson, and Remco Zegers for providing data about
Michigan State’s graduate program and Julie Posselt and Casey Miller for providing data collected
by the Inclusive Graduate Education Network. In addition, I’d like to thank Nicole Verboncoeur
and Tabitha Hudson, who spent many hours of their summer reading through applications to extract
the necessary data.
    Fifth, I would like to thank Kim Crosslan. My experience in the graduate program would not
have gone as smoothly as it did without all of the support and assistance Kim provided. Regardless
                                                   iii


of what issue I encountered or problem I faced, the solution was always only an email or phone call
to Kim away.
    Sixth, I would like to thank my family for all their support over the years, both prior to and
throughout graduate school.
    Seventh, I would like to thank the world’s best two dogs, Kali and Xavier. Over the course
of doing this dissertation, they have (literally) been by my side providing emotional support in
between asking for belly rubs and play time.
    Finally, I would like to thank my partner, Sarah. This dissertation would not have been possible
without her. From the daily support she provided to her feedback and suggestions, she has helped
me grow into a better person and a better researcher. I’m excited to see where our journey will go
next.
                                                  iv


                                  TABLE OF CONTENTS
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     ix
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    xi
CHAPTER 1     THE PAST, PRESENT, AND FUTURE OF GRADUATE ADMISSIONS
              IN PHYSICS: AN OVERVIEW . . . . . . . . . . . . . . . . . . . . . .            . .  1
   1.1 Establishing the need to study graduate admissions in physics . . . . . . . . . .     . .  1
   1.2 How this dissertation contributes to the field of physics education research . . .    . .  2
   1.3 How this dissertation contributes to computational mathematics, science, and
       engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . .  5
   1.4 Summaries of the remaining chapters . . . . . . . . . . . . . . . . . . . . . . .     . .  5
   1.5 Key conclusions in this dissertation . . . . . . . . . . . . . . . . . . . . . . . .  . .  7
   1.6 Key recommendations as a result of this dissertation . . . . . . . . . . . . . . .    . .  8
       1.6.1 Recommendations for departments and graduate admissions committees              . .  8
       1.6.2 Recommendations for PER researchers . . . . . . . . . . . . . . . . . .         . .  8
   1.7 Questions remaining unanswered . . . . . . . . . . . . . . . . . . . . . . . . .      . .  9
CHAPTER 2     IN THE BEGINNING: USING MACHINE LEARNING TO UNDER-
              STAND PHYSICS GRADUATE SCHOOL ADMISSIONS . . . . . . .                         . . 12
   2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . 12
   2.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . . 14
       2.2.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    . . 14
       2.2.2 Describing Undergraduate Institutions . . . . . . . . . . . . . . . . . .       . . 14
       2.2.3 Justifying our choices of institutional factors . . . . . . . . . . . . . . .   . . 16
       2.2.4 Random Forest Model . . . . . . . . . . . . . . . . . . . . . . . . . .         . . 16
   2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
   2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . 24
   2.5 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . . 26
   2.6 Future Work and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . .      . . 27
CHAPTER 3     FURTHER EVIDENCE AGAINST THE PHYSICS GRE: IT DOES NOT
              HELP APPLICANTS “STAND OUT” . . . . . . . . . . . . . . . . . .                . . 28
   3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . 28
   3.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    . . 31
   3.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . . 35
       3.3.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    . . 35
       3.3.2 Probability of admission procedure . . . . . . . . . . . . . . . . . . . .      . . 37
       3.3.3 Mediation and Moderation Procedure . . . . . . . . . . . . . . . . . .          . . 40
   3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
       3.4.1 Probability of admission results . . . . . . . . . . . . . . . . . . . . .      . . 42
       3.4.2 Mediation and moderation results . . . . . . . . . . . . . . . . . . . .        . . 48
                                                v


              3.4.2.1 Physics GRE and GPA . . .         . . . . . . . . . . . . . . . . . . . . 48
              3.4.2.2 Institutional features . . . .    . . . . . . . . . . . . . . . . . . . . 50
              3.4.2.3 Demographic features . . .        . . . . . . . . . . . . . . . . . . . . 50
  3.5 Discussion . . . . . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . . . . . . . 51
      3.5.1 Research Questions . . . . . . . . . .      . . . . . . . . . . . . . . . . . . . . 51
      3.5.2 Limitations and Researcher Decisions        . . . . . . . . . . . . . . . . . . . . 57
  3.6 Future Work . . . . . . . . . . . . . . . . . .   . . . . . . . . . . . . . . . . . . . . 62
  3.7 Conclusion and Implications . . . . . . . . .     . . . . . . . . . . . . . . . . . . . . 62
CHAPTER 4   RUBRIC-BASED ADMISSIONS: A NEW APPROACH TO                          GRADU-
            ATE ADMISSIONS IN PHYSICS . . . . . . . . . . . . . . .             . . . . . . . . 64
  4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . . . . . . . 64
  4.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    . . . . . . . . 66
      4.2.1 A typical admissions process in physics . . . . . . . . . . .       . . . . . . . . 66
      4.2.2 Holistic Review . . . . . . . . . . . . . . . . . . . . . . . .     . . . . . . . . 67
              4.2.2.1 Noncognitive skills . . . . . . . . . . . . . . . . .     . . . . . . . . 69
              4.2.2.2 Rubric-Based Review . . . . . . . . . . . . . . .         . . . . . . . . 70
  4.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . . . . . . . . 71
      4.3.1 Our Rubric and Applicant Evaluation Process . . . . . . . .         . . . . . . . . 71
      4.3.2 Participants and Data Collection . . . . . . . . . . . . . . .      . . . . . . . . 74
      4.3.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . .    . . . . . . . . 75
  4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
  4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . . . . . . . 82
  4.6 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . . . . . . . . 85
  4.7 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . . . . . . . . 87
  4.8 Recommendations for Departments . . . . . . . . . . . . . . . . . .       . . . . . . . . 88
  4.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    . . . . . . . . 90
CHAPTER 5 A “NEW APPROACH” OR THE SAME APPROACH IN NEW PACKAGING? 91
  5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
  5.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
  5.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
      5.3.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
      5.3.2 Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
  5.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
      5.4.1 Data Set 1a . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
      5.4.2 Using a True Testing Set . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
      5.4.3 Data Set 1b . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
      5.4.4 Tomek Links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
  5.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
      5.5.1 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
      5.5.2 Addressing whether our process changed . . . . . . . . . . . . . . . . . . . 111
      5.5.3 Limitations affecting our ability to address whether the process changed . . 112
  5.6 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
  5.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
                                               vi


CHAPTER 6   UNDERSTANDING THE BIASES AND LIMITATIONS OF COMPU-
            TATIONAL MODELS: A SIMULATION STUDY . . . . . . . . . . . .                    . . 117
  6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
  6.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . . 119
      6.2.1 Paradigms of Statistical Modeling . . . . . . . . . . . . . . . . . . . .      . . 120
      6.2.2 Explanatory Methods . . . . . . . . . . . . . . . . . . . . . . . . . . .      . . 121
             6.2.2.1 Traditional Logistic Regression . . . . . . . . . . . . . . . .       . . 121
             6.2.2.2 Penalized Regression . . . . . . . . . . . . . . . . . . . . . .      . . 122
      6.2.3 Predictive Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . .     . . 124
             6.2.3.1 Penalized Regression . . . . . . . . . . . . . . . . . . . . . .      . . 124
             6.2.3.2 Forest Methods . . . . . . . . . . . . . . . . . . . . . . . . .      . . 126
  6.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    . . 128
      6.3.1 Data Creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    . . 128
      6.3.2 Procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     . . 131
             6.3.2.1 Forest Algorithms . . . . . . . . . . . . . . . . . . . . . . .       . . 131
             6.3.2.2 Regression Algorithms . . . . . . . . . . . . . . . . . . . . .       . . 133
      6.3.3 Neutral Comparison Study Rationale . . . . . . . . . . . . . . . . . . .       . . 134
  6.4 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . . 135
      6.4.1 Forest Algorithm Results . . . . . . . . . . . . . . . . . . . . . . . . .     . . 135
      6.4.2 Logistic regression results . . . . . . . . . . . . . . . . . . . . . . . .    . . 138
      6.4.3 Penalized regression results . . . . . . . . . . . . . . . . . . . . . . . .   . . 142
             6.4.3.1 Confidence interval approach . . . . . . . . . . . . . . . . .        . . 142
             6.4.3.2 Bootstrap approach . . . . . . . . . . . . . . . . . . . . . . .      . . 145
  6.5 Application to Real Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . . 148
      6.5.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    . . 148
      6.5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    . . 150
  6.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
      6.6.1 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . .     . . 152
      6.6.2 Limitations and Researcher Choices . . . . . . . . . . . . . . . . . . .       . . 157
             6.6.2.1 Our data sets . . . . . . . . . . . . . . . . . . . . . . . . . .     . . 157
             6.6.2.2 Hyperparameter tuning . . . . . . . . . . . . . . . . . . . . .       . . 158
             6.6.2.3 Determining Detected Features . . . . . . . . . . . . . . . .         . . 159
             6.6.2.4 Assessing Our Models . . . . . . . . . . . . . . . . . . . . .        . . 160
  6.7 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . 162
  6.8 Conclusion and Recommendations . . . . . . . . . . . . . . . . . . . . . . . .       . . 165
APPENDICES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . . . . . . 167
  APPENDIX A         RANDOM FOREST BACKGROUND . . . . . . . . . .                  . . . . . . 168
  APPENDIX B         CHAPTER 3 ANALYSIS OF FEATURES . . . . . . . .                . . . . . . 179
  APPENDIX C         CHAPTER 3 SUPPLEMENTAL FIGURES . . . . . . .                  . . . . . . 182
  APPENDIX D         PHYSICS GRE SCORING PERCENTILES . . . . . . .                 . . . . . . 187
  APPENDIX E         DEPARTMENT OF PHYSICS ADMISSIONS RUBRIC                       . . . . . . 190
  APPENDIX F         CHAPTER 4 SUPPLEMENTAL FIGURES . . . . . . .                  . . . . . . 196
  APPENDIX G         CHAPTER 6 SUPPLEMENTAL FIGURES . . . . . . .                  . . . . . . 200
                                             vii


BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
                                           viii


                                       LIST OF TABLES
Table 2.1: Variables used in our model and their scale of measurement . . . . . . . . . . . 15
Table 2.2: Minimum, median, and maximum values of the metrics obtained over the 125
           hyperparameter combinations . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Table 3.1: Summary of the comparisons we analyzed, which group needs to stand out
           and which does not, and the figure number showing the results . . . . . . . . . . 37
Table 3.2: Counts of applicants by gender and race who provided both GPAs and physics
           GRE scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
Table 3.3: Distribution of applicants scoring in each Physics GRE range by size of insti-
           tution. ETS only publishes overall score distributions and hence, we cannot
           report national scores from only domestic students. . . . . . . . . . . . . . . . . 45
Table 3.4: Summary of the mediating and moderation results. * signifies partial mediation
           is present, ** signifies full mediation is present, † signifies moderation is
           present. However no moderation effects were found. . . . . . . . . . . . . . . . 53
Table 4.1: Percent of Missing Data by Rubric Construct . . . . . . . . . . . . . . . . . . . 78
Table 5.1: The three models compared in this chapter and the data that went into each . . . 96
Table 5.2: Minimum, median, and maximum values of the metrics obtained over the 125
           hyperparameter combinations for models built from data set 1a . . . . . . . . . . 98
Table 5.3: Minimum, median, and maximum values of the metrics obtained over the 125
           hyperparameter combinations for the models of data set 1b . . . . . . . . . . . . 105
Table 5.4: Metrics when using Tomek Links and MICE for each of the three data sets . . . . 106
Table 6.1: Log-F data augmentation example for a two feature and m=1 example. The
           last four rows are the augmented data. . . . . . . . . . . . . . . . . . . . . . . . 124
Table 6.2: 2x2 contingency table of fractions for generic binary feature. . . . . . . . . . . . 128
Table 6.3: Examples of changing only one of the feature imbalance, outcome imbalance,
           or odds ratio for an N=1000 dataset. . . . . . . . . . . . . . . . . . . . . . . . . 129
Table 6.4: Feature and outcome imbalances for the binary features from actual graduate
           school admission data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
                                                ix


Table 6.5: McFadden Pseudo 𝑅 2 values for the explanatory models . . . . . . . . . . . . . 150
Table 6.6: AUC values for the various models on the four data sets . . . . . . . . . . . . . . 150
Table 6.7: Summary of advantages and disadvantages for each algorithm used in this study . 153
Table D.1: Scaled physics GRE score and percent of applicants scoring lower than that score. 187
Table E.1: Rubric used by our department for evaluating applicants. The criteria for a
           high, medium, and low score are shown. . . . . . . . . . . . . . . . . . . . . . . 191
                                              x


                                       LIST OF FIGURES
Figure 1.1: Visual representation of the framework presented in Ross and Odden [2] with
            methods and methodology broken down according to Ding’s genres [12].
            Dimensions that are in bold and expanded represent areas this dissertation
            advances in PER based on their framework, including population, context,
            and methods and methodology. . . . . . . . . . . . . . . . . . . . . . . . . . .      3
Figure 2.1: Averaged AUC feature importances over 30 trials. Physics GRE score, Quan-
            titative GRE score, and undergraduate GPA, appearing in orange, were the
            factors found to be meaningful and hence predictive of being admitted. . . . . 20
Figure 2.2: Averaged conditional feature importances over 30 trials. Physics GRE score
            and undergraduate GPA, appearing in orange, were the factors found to be
            meaningful and hence predictive of being admitted when adjusting for corre-
            lations among the features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Figure 2.3: Plot of all applicant’s physics GRE scores vs their undergraduate GPAs. The
            background coloring expresses the prediction of the model had an applicant
            had that score. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Figure 2.4: Proportion of the 125 hyperparameter combinations in which each feature
            had a given rank. Notice that there is a block of features that range between
            1 and 5 and a block of features that rank between 7 and 14. . . . . . . . . . . . 24
Figure 3.1: Visual representation of eqs. (3.1) to (3.3). The top graphic shows eq. (3.1)
            while the bottom graphic shows eqs. (3.2) and (3.3). . . . . . . . . . . . . . . . 33
Figure 3.2: Visual representation of eqs. (3.5) to (3.7) showing serial mediation with two
            mediators. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Figure 3.3: Fraction of applicants admitted by undergraduate GPA and physics GRE
            score. The number of students in each bin is also shown. ‘Any‘ corresponds
            to the corresponding row or column totals. The bin label corresponds to the
            upper bound of values in the bin exclusive with the exception of the 4.0 GPA
            bin which includes 4.0. Values are colored based on whether they are above,
            below, or equal to the overall admissions rate. Admissions rates within 10%
            of the overall rate are colored the same as the overall rate. The above and
            below average colors are based on being above/below the midpoint between
            the max/min admission fraction and the overall average. These are based on
            raw numbers and not a statistical test. . . . . . . . . . . . . . . . . . . . . . . 43
                                                 xi


Figure 3.4: A condensed version of Fig. 3.3 showing the fraction of applicants admitted
             by undergraduate GPA and physics GRE score. . . . . . . . . . . . . . . . . . 44
Figure 3.5: Fraction of applicants admitted by undergraduate GPA and physics GRE score
             and split by large or small undergraduate university . . . . . . . . . . . . . . . 44
Figure 3.6: Fraction of applicants admitted by undergraduate GPA and physics GRE score
             and split by selective or non-selective undergraduate university. . . . . . . . . . 45
Figure 3.7: Fraction of applicants admitted by undergraduate GPA and physics GRE score
             and split by the applicant’s gender. . . . . . . . . . . . . . . . . . . . . . . . . 46
Figure 3.8: Fraction of applicants admitted by undergraduate GPA and physics GRE score
             and split by the applicant’s race. . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Figure 3.9: Visual representation of the bootstrapped coefficients in eqs. (3.1) to (3.3).
             We do find evidence of the physics GRE score mediating the relationship
             between GPA and admission status. . . . . . . . . . . . . . . . . . . . . . . . . 48
Figure 3.10: Visual representation of the bootstrapped coefficients in eqs. (3.5) to (3.7).
             We do find evidence of the physics GRE score mediating selectivity and
             admissions status but do not find evidence of GPA mediating selectivity and
             admissions status. We do not find evidence of a serial mediating relationship.
             Statistically significant coefficients are in bold. . . . . . . . . . . . . . . . . . . 49
Figure 3.11: Visual representation of the bootstrapped coefficients in eqs. (3.5) to (3.7).
             We do find evidence of the physics GRE score mediating institution size
             and admission status but do not find evidence of GPA mediating institution
             size and admissions status. We do not find evidence of a serial mediating
             relationship. Statistically significant coefficients are in bold. . . . . . . . . . . . 49
Figure 3.12: Visual representation of the bootstrapped coefficients in eqs. (3.5) to (3.7). We
             do find evidence of the physics GRE score mediating gender and admission
             status but do not find evidence of GPA mediating gender and admission status.
             We do not find evidence of a serial mediating relationship. Statistically
             significant coefficients are in bold. . . . . . . . . . . . . . . . . . . . . . . . . 52
Figure 3.13: Visual representation of the bootstrapped coefficients in eqs. (3.5) to (3.7).
             We do find evidence of GPA mediating race and admission status and a serial
             mediation effect but do not find evidence of the physics GRE mediating race
             and admission status. Statistically significant coefficients are in bold. . . . . . . 52
                                                   xii


Figure 3.14: A revised version of Fig. 3.4 showing the fraction of applicants admitted
             by undergraduate GPA and physics GRE score when the cutoff score for a
             high physics GRE score is 670. Here, the number of applicants who could
             benefit from a high physics GRE score is approximately equal to the number
             of applicants who could be penalized by a low physics GRE score. . . . . . . . 58
Figure 3.15: A revised version of Fig. 3.4 showing the fraction of applicants admitted by
             undergraduate GPA and physics GRE score when the cutoff score for a high
             undergraduate GPA is 3.4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Figure 3.16: A revised version of Fig. 3.4 showing the fraction of applicants admitted
             by undergraduate GPA and physics GRE score when the cutoff score for a
             high undergraduate GPA is 3.6. Here, the number of applicants who could
             benefit from a high physics GRE score is approximately equal to the number
             of applicants who could be penalized by a low physics GRE score. . . . . . . . 60
Figure 4.1: Faculty ratings of domestic applicants on 18 constructs. In the plot, a larger,
             darker circle means that more applicants are in that bin. While many appli-
             cants are in each level of the academic preparation and test score constructs,
             few applicants are in the “low” bin of the research, noncognitive skills, and
             program fit constructs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
Figure 4.2: Faculty ratings of domestic applicants on 18 constructs split by whether
             the applicant was admitted. The distribution of ratings of all constructs
             is statistically different for admitted applicants compared to non-admitted
             applicants. Overall, most admitted applicants were rated “high” while most
             non-admitted applicants were rated “medium.” . . . . . . . . . . . . . . . . . 80
Figure 4.3: Faculty ratings of domestic applicants on 18 constructs split by whether
             the applicant was male or female. Only three of the constructs showed
             differences between males and females: physics GRE score where males
             scored higher and community contributions and diversity contributions where
             females scored higher. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
Figure 4.4: Faculty ratings of domestic applicants on 18 constructs split by whether the
             applicant attended a more selective or less selective undergraduate university.
             Only the general GRE and physics GRE scores showed differences. . . . . . . 81
Figure 4.5: Faculty ratings of domestic applicants on 18 constructs split by whether the
             applicant attended a university with a larger or smaller physics program. Only
             the physics GRE score and conscientiousness showed differences between the
             groups of applicants, with the latter dependent on how larger physics program
             is defined. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
                                                  xiii


Figure 5.1: Plot A shows Fig 2.3 with the Tomek Links marked. Filled points represent
            Tomek Links. Plot B shows the same plot after the Tomek Links have been
            removed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
Figure 5.2: Averaged AUC feature importances over 30 trials. Physics GRE score, un-
            dergraduate GPA, Quantitative GRE score, Verbal GRE score and proposed
            research area, appearing in orange, were the factors found to be meaningful
            and hence predictive of being admitted. . . . . . . . . . . . . . . . . . . . . . . 98
Figure 5.3: Slopeplot showing the ranks of each feature before the implementation of the
            rubric (left) and after the implementation of the rubric (right) using data sets
            0 and 1a respectively. Features toward the top of the plot are more predictive.
            Features in orange were found to be the meaningful features needed to predict
            whether the applicant was admitted in their respective model. Notice that the
            ordering of the more predictive features is largely unchanged. Plot adapted
            from [216]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
Figure 5.4: Averaged conditional feature importances over 30 trials. Physics GRE score
            and proposed research area, appearing in orange, were the factors found to
            be meaningful and hence predictive of being admitted once correlations were
            accounted for. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
Figure 5.5: Proportion of the 125 hyperparameter combinations in which each feature
            had a given rank for data set 1a. Notice that the plot is mostly diagonal and
            that physics GRE score and GPA are almost always the top two features. . . . . 101
Figure 5.6: Comparison of the testing AUC when A) Data Set 0 is used to train the model
            and B) when Data Set 1a is used to train the model. Training refers to the
            training AUC for the model. All error bars are 1 standard error. Results were
            averaged over 30 trials. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
Figure 5.7: Comparison of the testing accuracy when A) Data Set 0 is used to train the
            model and B) when Data Set 1a is used to train the model. The null accuracy
            is shown in cyan with the shorter in height error bars. All error bars are 1
            standard error. Results were averaged over 30 trials. . . . . . . . . . . . . . . . 102
Figure 5.8: Averaged conditional feature importances over 30 trials for the models of data
            set 1b. Physics GRE score and quality of work, appearing in orange, were the
            factors found to be meaningful and hence predictive of being admitted. . . . . . 103
Figure 5.9: Averaged conditional feature importances over 30 trials for the models of
            data set 1b. Physics GRE score and achievement orientation, appearing in
            orange, were the factors found to be meaningful and hence predictive of being
            admitted once correlations were accounted for. . . . . . . . . . . . . . . . . . . 104
                                                xiv


Figure 5.10: Proportion of the 125 hyperparameter combinations in which each feature
             had a given rank for models of data set 1b. Notice that the plot is mostly
             diagonal and that physics GRE score, achievement orientation, and quality of
             work are always the top three features. . . . . . . . . . . . . . . . . . . . . . . 105
Figure 5.11: Plot A shows data set 0 with the decision boundary for a model with just the
             physics GRE score and undergraduate GPA (Fig 2.3) while plot B shows the
             data with the Tomek Links removed and the resulting decision boundary for
             the 2D model. Plot C shows the overlap of the admitted regions. . . . . . . . . 107
Figure 5.12: Plot A shows data set 1a with the decision boundary for a model with just the
             physics GRE score and undergraduate GPA. Plot B shows the data with the
             Tomek Links removed and the resulting decision boundary for the 2D model.
             Plot C shows the overlap of the admitted regions. . . . . . . . . . . . . . . . . . 108
Figure 6.1: Distribution of binary features in the simulated 𝜋1+ = 0.5, 𝑁 = 1,000 model. . . 130
Figure 6.2: Distribution of continuous features in the simulated 𝜋1+ = 0.5, 𝑁 = 1, 000
             model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
Figure 6.3: Importance values for a subset of the random forest models. Feature names
             shown in black were constructed to be informative while feature names in
             grey were constructed to be noise. Plot A shows the N=1000 70/30 out-
             come imbalance case with the standard random forest algorithm and Gini
             importance, plot B shows the N=1000 50/50 outcome imbalance case with
             the standard random forest algorithm and accuracy permutation importance,
             plot C shows the N=100, 50/50 outcome imbalance case with the conditional
             inference forest and AUC-permutation importance, and plot D shows the
             N=10,000 60/40 outcome imbalance case with conditional inference forest
             and accuracy-permutation importance. For all of the permutation impor-
             tances, features with less imbalance tend to have larger importances than
             more imbalanced features for identical odds ratios. . . . . . . . . . . . . . . . . 136
Figure 6.4: The ranks of the informative features for the four importance measures,
             grouped by the sample size and outcome imbalance. Noise features are
             not shown and any feature ranked below a noise feature was assigned a rank
             of 0. Here, a larger circle reflects a higher rank, meaning the feature was more
             predictive of the outcome. Overall, features with lower imbalance rank higher
             than features with higher imbalance for a given odds ratio and the result is not
             affected by the outcome imbalance or the specific permutation importance or
             forest algorithm used. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
                                                    xv


Figure 6.5: Values of the odds ratios and 95% confidence intervals found by logistic
             regression models compared by outcome imbalance. Our built-in value is
             represented by the circled plus. Plot A is a sample size of 𝑁 = 100, plot B
             is a sample size of 𝑁 = 1, 000 and plot C is a sample size of 𝑁 = 10, 000.
             Confidence intervals that span beyond the scale are removed from the plot.
             Note the log scale on the horizontal axis. . . . . . . . . . . . . . . . . . . . . . 140
Figure 6.6: Analog of Fig. 6.4 but using logistic regression as the algorithm and statistical
             significance as the criteria for detection, 𝛼 = 0.05. Plot A uses the Holm-
             Bonferroni correction to control for multiple tests while plot B uses the
             uncorrected p-values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
Figure 6.7: 95% confidence intervals for Firth penalized, traditional, and Log-F penalized
             logistic regression for the 𝑁 = 100 data sets. Plot A shows the 50/50 outcome
             imbalance, plot B shows the 70/30 outcome imbalance, and plot C shows the
             90/10 outcome imbalance. Confidence intervals that span beyond the scale
             are removed from the plot. For higher outcome imbalance, Firth and Log-F
             penalizations can considerably shrink the confidence intervals. . . . . . . . . . 143
Figure 6.8: 95% confidence intervals for Firth penalized, traditional, and Log-F penalized
             logistic regression for the 𝑁 = 1000 data sets. Plot A shows the 50/50 outcome
             imbalance, plot B shows the 70/30 outcome imbalance, and plot C shows the
             90/10 outcome imbalance. For higher outcome imbalance, Firth and Log-F
             penalizations can shrink the confidence intervals. . . . . . . . . . . . . . . . . 144
Figure 6.9: 95% percentile bootstraps of the odds ratio for Elastic net, Firth, Lasso, Log-F,
             no, and Ridge penalizations on the 𝑁 = 100 data. Dots represent the median
             value. Plot A shows the 50/50 outcome imbalance, plot B shows the 70/30
             outcome imbalance, and plot C shows the 90/10 outcome imbalance . . . . . . 146
Figure 6.10: 95% percentile bootstraps of the odds ratio for Elastic net, Firth, Lasso, Log-
             F, no, and Ridge penalizations on the 𝑁 = 1, 000 data. Dots represent the
             median value. Plot A shows the 50/50 outcome imbalance, plot B shows the
             70/30 outcome imbalance, and plot C shows the 90/10 outcome imbalance . . . 147
Figure 6.11: Comparison of the odds ratio (A), Gini importance (B), and AUC-permutation
             importance (C) for the features in school 1. Notice that RaceLatinx has a sim-
             ilar odds ratio as RaceBlack and RaceMulti according to (A) but only Race-
             Latinx is detectable in (C). RaceLatinx is less imbalanced than RaceBlack
             and RaceMulti. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
Figure A.1: The confusion matrix counts the number of each predicted classification by
             the model and compares that to the what the data indicates. In this case, a
             two class system with binary classifications leads to a 2 x 2 matrix. For 𝑀
             classes, the matrix continues to be square and grows to be 𝑀 x 𝑀. . . . . . . . 171
                                                 xvi


Figure A.2: (a) Sample receiver operating characteristic (ROC) curves that demonstrate
            two models: one that is better than chance (blue) and one that is worse
            than chance (green). These ROC curves are plotted along with the chance
            line (orange dotted). Models that are demonstrably better than chance have
            ROC curves that tend towards the upper-left corner of the space as the arrow
            indicates. Models that are worse than chance tend towards the bottom-right
            corner. (b) For both models, the area under the ROC curves (AUC) are
            shown (blue and green shading) and computed. AUC provides a measure
            of the quality of the model. It is indicative of the probability of accurately
            classifying a random sample from the data. . . . . . . . . . . . . . . . . . . . . 172
Figure B.1: Distribution of physics GRE scores & undergraduate GPAs by the size of the
            undergraduate physics program & institutional selectivity for each applicant. . . 179
Figure B.2: Distribution of physics GRE scores and undergraduate GPAs by gender and
            whether the applicant identified as a member of racial or ethnic group currently
            underrepresented in physics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
Figure C.1: Admission fractions of applicants split by their gender and the selectivity of
            their undergraduate institutions. . . . . . . . . . . . . . . . . . . . . . . . . . . 183
Figure C.2: Admission fractions of applicants split by their gender and the size of their
            undergraduate institutions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
Figure C.3: Admission fractions of applicants split by their race and the selectivity of
            their undergraduate institutions. . . . . . . . . . . . . . . . . . . . . . . . . . . 185
Figure C.4: Admission fractions of applicants split by their race and the size of their
            undergraduate institutions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
Figure F.1: Faculty ratings of domestic applicants on 18 constructs split by whether the
            applicant was male or female and whether they were admitted or not. . . . . . . 197
Figure F.2: Faculty ratings of domestic applicants on 18 constructs split by whether the
            applicant attended a more selective or less selective undergraduate university
            and whether they were admitted or not. . . . . . . . . . . . . . . . . . . . . . . 198
Figure F.3: Faculty ratings of domestic applicants on 18 constructs split by whether the
            applicant attended a university with a larger or smaller physics program and
            whether they were admitted or not. . . . . . . . . . . . . . . . . . . . . . . . . 199
Figure G.1: Comparison of the odds ratio, Gini importance, and AUC-permutation im-
            portance for the features in the school 2 admit data set. . . . . . . . . . . . . . . 200
                                                xvii


Figure G.2: Comparison of the odds ratio, Gini importance, and AUC-permutation im-
            portance for the features in the school 3 shortlist data set. . . . . . . . . . . . . 200
Figure G.3: Comparison of the odds ratio, Gini importance, and AUC-permutation im-
            portance for the features in the school 3 admit data set. . . . . . . . . . . . . . 201
Figure G.4: Plots of the Log-odds vs the average residual in each bin for the four schools.
            Across all plots, between 20% and 34% of the points fall outside of the
            confidence intervals, suggesting the logistic regression models might not be
            fitting the data especially well. . . . . . . . . . . . . . . . . . . . . . . . . . . 201
                                                xviii


                                             CHAPTER 1
 THE PAST, PRESENT, AND FUTURE OF GRADUATE ADMISSIONS IN PHYSICS: AN
                                             OVERVIEW
People like us who believe in physics know that the distinction between past, present, and future is
only a stubbornly persistent illusion -Albert Einstein
1.1     Establishing the need to study graduate admissions in physics
At first glance, studying graduate admissions in physics may seem like a small, trivial problem.
After all, only a relatively small number of people are directly affected by the physics graduate
admissions process. Using the number of test-takers of a commonly required applicant exam in
physics, the physics GRE, as a proxy for the number of applicants, approximately 7,000 students
apply to physics graduate programs annually [1]. In comparison, first-year physics courses that are
often the focus of physics education research [2] enroll nearly 425,000 annually [3].
    However, the number of people indirectly affected by physics graduate admissions is orders of
magnitude larger. The applicants who are admitted to programs will go on to become leaders in
academia, industry, and government in fields as diverse as energy, technology, national defense,
and medicine [4]. In addition, some of the admitted applicants will go on to become faculty who
will train the next generation of scientists, engineers, medical doctors, and science teachers.
    Furthermore, graduate admissions has economic consequences for both students and taxpayers.
Applicants who are admitted and earn their PhD have higher average salaries than those with only
a bachelor’s degree [5], meaning that success in graduate admissions influences earning potential
later in life. In addition, as many graduate students are indirectly supported by tax payers via grants
awarded by governmental agencies, departments have a duty to use the taxpayer money wisely by
admitting applicants who will be successful in their programs. When taking into account tuition,
stipend, and overhead, training a single graduate student can cost taxpayers between a quarter
and half a million dollars. For the department of physics at Michigan State, the estimated cost is
                                                   1


$80,000 a year per graduate student. However, if admitted students are not supported throughout
their programs, neither applicants nor tax payers will see the benefits that can be afforded through
graduate study.
    Yet, given the potentially large impacts of graduate admissions in physics, not everyone is
given a fair chance in the process. Physics remains largely white and male even as the United
States population becomes increasingly less white and higher education is becoming less male-
dominated [6, 7]. To stay in touch with the demographics of the country and the broader economy,
physics as a discipline needs to reevaluate who is allowed to participate. Failing to do so risks
physics losing out on the diversity of opinion and perspectives needed to advance as a discipline
and do so ethically.
    Graduate admissions is but one small part of that process. In this dissertation, I will explore
the historical approaches to graduate admissions in physics as well as offer a possible route toward
achieving those goals of diversity and equity in the process.
1.2    How this dissertation contributes to the field of physics education re-
       search
Unfortunately, there does not exist a consensus about the subfields of physics education research
(PER) and various attempts have been made over years, both broadly [2, 8–10] and for specific
populations and topics [11–13]. My work fits most naturally into Russ and Odden’s framework [2]
so I will map onto that.
    Russ and Odden classified education research along seven dimensions: discipline, phenomenon,
population, context, methods & methodology, theoretical & conceptual framework, and epistemol-
ogy. This dissertation contributes to PER in the dimensions of population, context, and meth-
ods/methodology so I will focus on those. A visual representation is shown in Fig. 1.1.
    First, this dissertation contributes to PER by focusing on graduate students and the graduate
admissions process. Traditionally, the population of PER studies has been undergraduate students in
introductory physics courses [2, 13]. More recently however, PER has expanded to focus on upper-
                                                  2


Figure 1.1: Visual representation of the framework presented in Ross and Odden [2] with methods
and methodology broken down according to Ding’s genres [12]. Dimensions that are in bold and
expanded represent areas this dissertation advances in PER based on their framework, including
population, context, and methods and methodology.
division undergraduate students [14–19], non-physics majors [20, 21], graduate students [22–26],
K-12 teachers and students [27–33], instructors, teaching assistants, and learning assistants [34–42],
and even institutions themselves [43, 44]. This dissertation then adds to the growing body of work
regarding graduate students in physics. Studying graduate admissions specifically has only become
an area of inquiry recently, with most of the relevant research published within the last five
years [45–56].
    Second, this dissertation contributes to PER by focusing on contexts that have received less
attention in the literature. When thinking about context, Russ and Odden leverage Bronfenbren-
ner’s ecological systems theory [57] to describe the environment where the research study takes
                                                  3


place. Ecological systems theory envisions individuals existing in a set of systems that describe
the contexts and interactions that may inform the development of the individual. For example, the
microsystem is the individual’s direct environment, the mesosystem is connections across microsys-
tems, the exosystem is the indirect environment of the individual and the macrosystem is the societal
norms and cultural values relevant to the individual. Given that PER has traditionally focused on
undergraduate students in introductory physics classes, its context is typically the microsystem [2].
Thinking in terms of Nair’s mapping of ecological systems theory onto a physics classroom [58],
graduate admissions can be thought of as an exosystem and macrosystem phenomenon. Students
do not actively participate in the process but are clearly affected by it (exosystem) and cultural
norms about who can do sciences as well as beliefs about what is required to do physics (e.g. innate
brilliance [59]) might affect faculty’s decisions (macrosystem).
    Finally, this dissertation contributes to PER by using data analysis techniques that have only
recently been incorporated into PER studies such as machine learning, Tomek Links, and simulation.
Traditionally, PER has been divided into qualitative research and quantitative research [2]. Ding [12]
then subdivided quantitative research into three genres: measurement, controlled exploration of
relations, and data mining, which is the predominant genre in this dissertation. As Romero and
Ventura [60] and Cope and Kalantzis [61] note, data mining relies on data that has already been
collected and hence, the research quesions that can be addressed may be limited.
    However, instead of thinking of genres of quantitative research, we can think in terms of the
specific methods used. Russ and Odden claim that basic statistical techniques and large-𝑁 summary
and frequency analyses are the most common in PER with network analysis and methods for large
data set analysis becoming increasingly common [2]. To this, I will add the umbrella term of
modeling in which researchers try to develop some quantitative representation of the phenomenon
of interest.
    Modeling can then be further broken down into two broad categories: explanatory modeling
and predictive modeling [62]. While explanatory modeling is a traditional PER method (Theobald
et al. [63] provides an overview of these methods), predictive models are an emerging area, often
                                                   4


using machine learning techniques. Machine learning, a method used throughout this dissertation,
especially is an emerging area of PER, with most studies utilizing it having been published in the
last five years [64–69].
    Finally, as a result of modeling methodologies, a new genre of quantitative PER has started to
emerge: simulation. In this genre, researchers create artificial data to study the methods of PER
themselves. Simulation studies are still rare in PER, with only a few such studies published to
date [70–72]. This dissertation increases that number by one.
1.3     How this dissertation contributes to computational mathematics, science,
        and engineering
As the department of computational mathematics, science, and engineering is unique to Michigan
State University, a research-based framework to describe the types of work conducted under the
umbrella of the department does not exist. Instead, I will use the “triple junction” of computation
on which it was formed [73].
    The department defines “triple junction” of computation as algorithm development and analysis,
high performance computing, and applications to scientific and engineering modeling and data
science. This dissertation focuses on the final of those.
    Data science techniques such as machine learning, feature engineering, and simulation have
seen limited use in physics education research. This dissertation then brings data science tools
into a new field, adding additional tools to the physics education researcher’s tool kit. In addition,
the interdisciplinary nature of this dissertation of combining computational techniques to answer
educational questions aligns with the broader goals of the department.
1.4     Summaries of the remaining chapters
This dissertation considers the past, present, and future of graduate admissions. Chapters 2 and
3 focus on the past and present, Chapters 4 and 5 focus on the present and future, and Chapter
6 extends across dimensions of time, seeking to understand the limitations of models used in the
previous chapters and to be used in future analyses.
                                                   5


    In Chapter 2, I analyze 4 years of admissions records to Michigan State University’s graduate
physics program using a machine learning method known as random forest. I find that consistent
with surveys of admissions committees and observations of committees, quantitative parts of the
application such as GRE scores and undergraduate GPA hold the most weight in the process. In
fact, knowing only the applicant’s undergraduate GPA, physics GRE score, and quantitative GRE
score was sufficient to predict with 75% accuracy whether an applicant would be admitted.
    In Chapter 3, I then explore the physics GRE in more depth and show that a common argument
for keeping the physics GRE in the admissions process despite its documented issues, that it helps
applicants who might be otherwise missed “stand out,” is not supported by evidence. To reach
this conclusion, I analyzed admissions records from five universities with both large-N frequency
analysis and mediation and moderation models. Whether I defined applicants who might be missed
as having a low GPA, graduating from a less selective undergraduate school, or graduating from a
smaller undergraduate program did not affect the conclusion.
    Given that the traditional methods of admissions have not resulted in substantial changes in
the demographics of physics at the graduate level and those methods contain inequities in and of
themselves, graduates admissions do not need to be revised, but rather rethought. Chapters 4 and
5 provide one such approach: rubric-based holistic admissions.
    In Chapter 4, I introduce rubric-based holistic admissions, defining them as an approach to
admissions where reviewers consider a broad range of applicant characteristics such as academic
achievement, research experience, fit with the program, test scores, and noncognitive competencies,
and rate applicants on those categories according to a pre-determined scoring rubric. I then
consider the department of physics at Michigan State University’s revised admissions process
as an example in practice. I compared the distribution of faculty ratings by admission status,
sex, and undergraduate background. The results indicate that the rubric does not show any
unexpected inequities based on the applicant’s sex or undergraduate background. It did however
detect systematic issues such as test score differences and differing service-work expectations based
on sex.
                                                   6


    In Chapter 5, I consider whether our rubric-based holistic admissions is actually a rethinking of
graduate admissions or just a revision. That is, does switching from the traditional admissions to
rubric-based admissions fundamentally change the process? I again use the random forest method
to analyze the admissions data after the implementation of the rubric as well as introduce a new
technique to PER, Tomek links. I first compare the two admissions processes using the same data
extracted from applications and then consider what additional insights might exist by looking at
the rubric-ratings data. Across four sets of analysis, the results suggest that rubric-based holistic
admissions is a rethinking of graduate admissions, though additional work is needed to provide
greater confidence in the results.
    Finally, in Chapter 6, I consider the modeling techniques I’ve used in the previous chapters, in
addition to others, and how they perform under various data distributions encountered in PER and the
previous chapters. Across the techniques, I find that the more a binary variable is imbalanced (e.g.
50/50 split vs 80/20), the less likely a technique is to find that variable predictive or explanative of
an outcome. I then show that the effect is also found in real PER data by focusing on the admissions
processes at 3 institutions.
1.5     Key conclusions in this dissertation
The results of the chapters then suggest three overarching conclusions.
    First, the traditional graduate admissions process in physics is metrics-heavy and dominated by
the physics GRE, whose value in the admissions process should be questioned. (Chapters 2 and 3)
    Second, rubric-based admissions might offer a possible path forward in terms of making the
process more equitable. While the components of the rubric seem to be equitable, we were not
able to produce definitive evidence that rubric-based admissions sufficiently departed from the
traditional admissions structure. However, the current results are promising. (Chapters 4 and 5)
    Finally, modeling techniques in PER have biases and may affect what features are counted
as statistically significant or predictive. If these modeling techniques are going to be used more
broadly to make educational policy decisions (e.g., in graduate admissions), we need to really
                                                    7


understand the models, including how they work and their limitations and caveats. (Chapter 6)
1.6     Key recommendations as a result of this dissertation
As a result of work done in this dissertation, I propose six key recommendations, three targeted
toward departments and graduate admissions committees and three targeted toward researchers.
1.6.1    Recommendations for departments and graduate admissions committees
First, if the physics GRE is being used to identity applicants who might otherwise be missed, I
recommend against that. My study in Chapter 3 did not find evidence that the physics GRE allows
that to happen in practice.
    Second, departments should rethink their graduate admissions process in terms of what ap-
plicants are evaluated on, how those are evaluated, and who does the evaluating. Rubric-based
admissions, introduced in Chapter 4, seems to offer one way to do so. However, implementing
rubric-based holistic admissions requires the department to take an active role in rethinking their
process. Departments must decide how to address those three points as answers will depend on
specifics of the department.
    Finally, departments should engage in regular self-study of their processes and share the results
so that the physics community has an idea of what works and what does not work. Currently, data
about admissions practices is either reported in aggregate or individually for a limited number of
programs. Greater reporting from many programs will allow for a better picture of how admissions
processes are conducted and how they might be made more equitable.
1.6.2    Recommendations for PER researchers
First, researchers should be transparent about the data that goes into their models and how the
distribution of those features may affect the results. For researchers and practitioners to evaluate
the conclusions presented in papers, they need to understand the data itself. My work in Chapter 6
showed the split of a binary feature can affect whether an algorithm finds it statistically significant
                                                 8


or predictive of an outcome. As a result, it is possible that the literature contains false negatives,
where potentially important variables were missed.
    Second, researchers should consider more modern techniques for analyzing data such as machine
learning and penalized regression. Machine learning techniques are becoming more common in
PER but are still a niche area. Penalized regression techniques have seen limited use in PER
but appear to handle data as well as if not better than standard logistic regression. In addition,
researchers should follow the recommendation of Aiken et al. and compare different models to
produce the best fitting model [74].
    Finally, researchers should conduct more simulation studies to examine the methods used
in PER. Data in PER is often a mix of binary, categorical, and continuous features and hence,
algorithms developed in fields that analyze primarily continuous data may not perform as expected.
Simulation studies would be able to verify if this is the case and if researchers should be concerned.
Furthermore, results of such studies might lead to further methodological development.
1.7    Questions remaining unanswered
While this dissertation extends the field’s understanding of graduate admissions in physics, many
questions are unanswered, specifically around rubric-based admissions and equity in graduate
admissions, providing an avenue for future work.
    First, in order to reach a more definitive conclusion as to whether our department’s admissions
process changed, future work could consider alternative approaches to analyzing the data such as
mixed methods. On the quantitative side, these alternative approaches could be more traditional
methods in PER like logistic regression or clustering-type methods. The latter may be able to tease
out whether there are different “types” of successful applicants and possibly provide evidence as to
whether the process became more holistic.
    On the qualitative side, future work could include interviewing faculty on the admissions
committee or observing deliberations in real time. Such data would provide greater insight into the
process, especially regarding individual applicants. Quantitative methods try to summarize and
                                                   9


simplify the data, causing us to potentially miss data-rich discussions of applicants. Qualitative
methods might allow us to see these cases, especially discussions around borderline applicants who
might expose what faculty are really valuing in an applicant.
     Second, in order to understand how the results may generalize, future work should extend the
analysis done in this dissertation to other physics graduate programs. Additionally, future work
should consider programs in different geographical areas with different applicant pools and of
different research intensities. Michigan State University is a Midwestern, predominantly white
institution with a highly-ranked subdiscipline and hence, the applicant population might not be
representative of the broader population who applies to physics graduate school.
     Third, future work should examine other aspects of the graduate admissions process such as who
is invited and able to apply to graduate school. Prior work has examined barriers current graduate
students experienced when they applied [49]. However, studies such as these ignore students who
wanted to apply but for one reason or another, did not. Therefore, future work should explore the
barriers these students face in applying and how departments might address them. These might
include how to support students in the application process (e.g, financially), how students find out
about programs, and how departments advertise their graduate programs.
     Fourth, future work should continue along the path outlined above and consider undergraduate
physics students more broadly in graduate admissions. The evaluation of applications is the final
step in the graduate admissions process that could be argued to begin as soon as a student declares
a physics major. Between those two steps, potential applicants must decide whether they are
interested in attending graduate school, research potential programs and advisors, complete an
application, and submit the application. Even after departments have evaluated the applications
and made offers, students must decide which program to attend or what to do if they are not accepted
to any. All of these are ripe for future study.
     Finally, to truly make an impact on diversity and equity in physics longer-term, future work needs
to extend beyond just the admissions process and consider graduate school as a whole. Departments
need to address diversity by examining who is encouraged and invited to apply to graduate school,
                                                    10


equity by considering how they evaluate applicants applying to their program and current students
in their program and what they are doing to retain the students they did admit, and inclusion by
intentionally addressing the climate in their department, being transparent with decisions being
made, and creating support structures for students from underrepresented backgrounds. Simply
making the admissions process more equitable and admitting more diverse students will not make
an impact if corresponding efforts are not made to retain such students and prevent them from being
pushed out. Possible areas of future work include studying qualifying and comprehensive exams,
student-advisor relationships, mental health, and departmental support for students. Only when
diverse students are not only actively welcomed to the academy but also actively retained will real
change happen.
                                                 11


                                             CHAPTER 2
  IN THE BEGINNING: USING MACHINE LEARNING TO UNDERSTAND PHYSICS
                              GRADUATE SCHOOL ADMISSIONS
The following chapter is adapted and expanded from its published version in the 2019 Physics
Education Research Conference [75]. The published version includes Marcos D. Caballero as the
second author. Following the Contributor Roles Taxonomy (CRediT) [76], my roles for this project
include conceptualization, formal analysis, methodology, software, validation, visualization, and
writing the original draft.
2.1    Introduction
Despite other science, technology, engineering, and mathematics (STEM) fields becoming more
diverse over the past few decades, physics has lagged behind with only 20% of bachelor’s degrees
awarded to women and only 11% awarded to racial minorities [77]. These numbers do not improve
when considering graduate degrees, where 20% of doctoral degrees are granted to women and
7% are granted to racial minorities [77]. While this underrepresentation has both enrollment and
retention causes, this chapter will focus on the factors that may affect enrollment in physics graduate
programs.
    When considering enrollment in physics PhD programs, prior work has found that minority
students in physics are less likely to apply to programs if they feel that they will not be admitted
based on low GPA or GRE scores, or lack of research experience [49]. Further, given that many
graduate programs have application fees, financial concerns might prevent students from applying
to graduate programs that they believe they will not be admitted to. Therefore, it is important to
understand what matters when applying to physics graduate programs while acknowledging that
many factors that are not easily quantifiable matter.
    Previous research into graduate admissions in physics has tended to take a broad approach,
characterizing the graduate admissions process across the United States, both for master’s and PhD
                                                   12


programs [47, 50] or focusing on a specific subset of universities such as elite universities [46].
These studies find that faculty consider numerical measures such as undergraduate GPA and GRE
scores most important in the admissions process and have been conducted by either observing
the admissions process or by surveying faculty about what they believe to be most important in
the admissions process. More recent work in physics graduate admissions has explored applicant
perceptions of the various components of the application [51]. Missing in this analysis is an
investigation of the actual applications of prospective physics graduate students. To our knowledge,
there has only been one such study [54].
    Given that applications to graduate programs consist of numerical data such as GPA and GRE
scores, categorical data such as gender, race, and ethnicity, and open-ended data such as letters
of recommendation and personal statements, graduate admissions is an ideal target for machine
learning. Indeed, machine learning approaches to understanding graduate admissions have been
employed in computer science to study self-reported admissions data [78] and to streamline the
review process [79]. Machine learning methods have also been employed more broadly in higher
education admissions to predict which admitted students will accept an offer to attend a small
liberal arts school [80, 81] and to predict which students are likely to be admitted and to complete
their MBA [82].
    The goal of this work is to further the study of graduate admissions in physics by analyzing the
applications using a machine learning approach. Specifically, we ask what features of an application
to this physics graduate program are predictive of admission.
    Unlike other studies in physics graduate admissions, this work represents a case study of a
single institution rather than a broad look at the graduate admissions landscape. However, because
physics is regarded as a high consensus discipline, that is, there is large agreement about what
counts as legitimate admissions practices [83], we expect our results can generalize to similar
doctoral programs.
                                                   13


2.2     Methods
2.2.1    Data
The data used in this study comes from the admissions records of 512 domestic applicants to the
physics and astronomy graduate program at Michigan State University between 2013 and 2016 and
would have enrolled between fall 2014 and fall 2017. This time period is before the department
implemented the rubric described in Chapter 4. Domestic and international applicants do not
undergo the same review process and hence we only analyze applications from domestic students.
Here, domestic student is defined to be a U.S. citizen or permanent resident. The admissions process
is unique at this university in that the applications are not only reviewed by a central committee but
also members of the subdisciplines in which the student expresses interest. The data include the
applicant’s undergraduate institution and grade point average (GPA), their general and physics GRE
scores, and their physics subdisciplines of interest. Per a ballot initiative in the state of Michigan,
Michigan State University and the other Michigan public universities are explicitly prohibited from
discriminating against or granting preferential treatment to individuals based on race, sex, color,
ethnicity, or national origin in education [84]. To comply with this law, our university’s admissions
system collects limited demographic data and our department chose not to record the information
that was available when evaluating applicants. As such, demographics are not available to us.
Overall, 48% of the domestic applicants were offered admission into the program.
2.2.2    Describing Undergraduate Institutions
Because the name of the undergraduate institution in itself does not provide useful information to
an algorithm, we created new factors to describe characteristics of the institutions. To describe the
overall institution, we classified each institution as public or private, whether it is a minority serving
institution (MSI), the region of the country it is located in (such as Northeast, Southwest, etc.),
and the Barron’s selectivity of the institution, which describes how selective the undergraduate
program is. We assume that selectivity serves as a proxy for prestige. Classifications for the
                                                    14


                Table 2.1: Variables used in our model and their scale of measurement
                                    Factor                                Measurement Scale
      Undergraduate GPA                                                   Continuous
      Verbal GRE score                                                    Continuous
      Quantitative GRE score                                              Continuous
      Written GRE score                                                   Continuous
      Physics GRE score                                                   Continuous
      Proposed research area                                              Categorical
      Application year                                                    Categorical
      Barron’s selectivity                                                Categorical
      Region of applicant’s undergraduate institution                     Categorical
      Type of physics program at applicant’s undergraduate institution Categorical
      Size of undergraduate physics program at applicant’s
                                                                          Categorical
      undergraduate institution
      Size of doctoral physics program at applicant’s undergraduate
                                                                          Categorical
      institution
      Applicant attended a minority serving institution                   Binary
      Public or Private                                                   Binary
      Output variable: admitted status                                    Binary
first three categories were taken from the most recent Carnegie Rankings [85] while the Barron’s
classification came from Barron’s Profiles of American Colleges. Because the overall reputation of
the applicant’s undergraduate university might not describe the physics program at that university,
we also included factors related to the physics program such as the highest physics degree offered at
the university and the size of the undergraduate program and PhD program if applicable. The size
of the undergraduate and PhD programs were determined by the median number of graduates of
the program between the 2012-2013 and 2015-2016 academic years (i.e. the years that applicants
applied to the program). The programs were then classified as small, medium-small, medium-large,
or large based on which quartile they fell into. We used the Roster of Physics Departments with
Enrollment and Degree Data to collect this data [86–89]. All factors appearing in our model are
shown in Table 2.1 and include the scale of measurement.
                                                  15


2.2.3    Justifying our choices of institutional factors
Prior work has documented university pedigree is often considered in the application process
because institutional quality is assumed to be a proxy for student quality [46, 90]. Here, we
measure institutional quality by Barron’s selectivity and public or private status, with the assumption
that physics faculty view private universities as more prestigious than public universities. We
include region of the applicant’s undergraduate university to account for the fact that the institution
being studied is a public university and might therefore show a preference for students from the
surrounding region.
     Prior work has also found faculty exhibit a tendency to admit students like themselves, though
it is more common among academics who graduated from elite institutions [46]. Therefore, it is
not unreasonable to expect that faculty may prefer to admit students who followed similar paths as
they did, meaning students from large, doctoral institutions might be more likely to be admitted
than students from smaller institutions. Additionally, we use the size of the undergraduate and
PhD programs as proxies for the perceived prestige of the physics department, assuming a more
prestigious physics department attracts more students and hence graduates more students.
2.2.4    Random Forest Model
To analyze our data, we used the conditional inference forest algorithm, a variant of the random
forest algorithm [91] shown to be less biased when the data includes both continuous and categorical
variables [92] such as those used in our model (see Table 2.1). Random forest models in general
are ensembles of individual decision trees, which use binary splits of the input features in order to
make a prediction. The predictions are then averaged over the individual trees to obtain the overall
prediction of the random forest.
     While there are multiple metrics used to assess random forest and other machine learning
models, two of the most common are the accuracy and the area under the curve (AUC). The
accuracy is simply the proportion of correct predictions made by the model. To ensure that the
accuracy isn’t inflated by overtraining, only a fraction of the available data is used to construct the
                                                  16


model while the rest is used to test the predictive power. It is this remaining data that is used to
calculate the accuracy of the model.
    The AUC is defined as the area beneath the receiver operator curve of the model, which
visualizes the false positive rate against the true positive rate and varies between 0.5 and 1, with
values greater than 0.7 signifying an acceptable model [93]. The area describes the proportion of
positive cases that are ranked above negative cases in the data set by the model. For example, for
our data, the AUC would represent the proportion of all random pairs of admitted and not-admitted
applicants in which the admitted applicant is classified as admitted and the not-admitted applicant
is classified as not-admitted.
    In addition to making predictions, the random forest algorithm can determine the importance
of each feature to the model, referred to as the feature importance. For this analysis, we use two
importance measures. First we used the AUC permutation feature importance [94] as it is claimed
to be less biased than the accuracy based permutation importance when input features differ in scale
(as do our factors listed in Table 2.1) and when the predicted variable is not split evenly between the
two outcomes. Under this approach, each feature is randomly permuted and then passed through
the model to make a prediction. The AUC is then recorded and the difference between this value
and the original AUC is computed. As permuting a feature with more predictive information should
result is a worse model than permuting a feature with less predictive information, a larger difference
between the original AUC and the AUC with a permuted feature suggests that that feature contained
more predictive information. These differences can then be used to create a relative ordering of
features.
    However, if the features are correlated, it is possible that the orderings may be biased or that
permutations of one feature might result in unrealistic combinations of features and hence would
cause the model to extrapolate performance [95]. For example, if all students who earned perfect
scores on the physics GRE also had high GPAs, permuting GPA could cause there to be cases where
a perfect physics GRE score goes with a low GPA, which would be outside of the region learned by
the model. To prevent that, a conditional importance measure has been proposed in which features
                                                  17


are permuted within a subset of similar cases [96]. Because of the correlations between various
sections of the GRE, we also used this conditional approach to compute feature importances.
    Feature importances are derived from the data and hence, are not assumed to follow any
statistical distribution. Therefore, there is no simple way to apply the idea of statistical significance
to feature importances, though Chapter 6 provides some suggestions. We instead applied the
recursive backward elimination technique described in Díaz-Uriarte and Alvarez de Andrés [97] to
determine which features are predictive of admission and which are not. When using this technique,
the features are ordered according to their importance. A model is then built using all the features
and the accuracy is computed. A set fraction of the features with the smallest importances are then
removed and a new model is built and the accuracy computed. This process continues until only
2 features are left. The model with the fewest number of features while maintaining an accuracy
within a standard error of the highest accuracy across all models built in this process is then the
selected model. We will refer to the features used in this selected model as the meaningful features
and interpret them as the features that are predictive of the outcome. For more information about
random forest models, biases, and feature importance measures, see Appendix A.
    We chose to apply a random forest model instead of a more traditional technique for classifying
data such as logistic regression (as used by Attiyeh and Attiyeh [98] and Posselt et al. [54] to study
graduate admissions) due to these feature importances. As feature importances measure all factors
on the same scale, that is how much they change the area under the curve, factors of otherwise
different scales can be compared. This is in contrast to logistic regression where the odds ratio for a
continuous variable would measure the change in odds for a unit increase in the variable while the
odds ratio for a categorical or binary variable measures the change in odds relative to a reference
group. In addition, the feature importances allow for each categorical feature as a whole to be
compared to the other features rather than in pairs relative to the reference group.
    To perform the analysis, we used R [99] and the party package [92, 96, 100] to create a
conditional inference forest model. We used 70% of our data to train the model, 500 trees to build
                      √
our forest and used 𝑝 as the number of randomly selected features to use to build each tree, with 𝑝
                                                    18


being the total number of features in the model. These values follow recommendations of Svetnik
et al. [101]. We ran our model 30 times, randomly selecting 70% of our data for training each
time, and averaged the feature importances over runs so that the resulting distribution of individual
feature importances would be approximately normal. As the conditional inference forest algorithm
has routines built in to handle missing data [102], applicants with missing information were not
removed from the data set. However, the conditional importance approach requires there to be no
missing values so we used the MICE algorithm [103] to fill impute the missing data in that case,
following Nissen et al.’s recommendation for PER [71]. The imputation results were pooled using
Rubin’s Rules [104].
     In addition, to determine if our model was dependent on our choice of hyperparameters, we
also varied the fraction of data to train the model, the number of trees in the forest, and the number
of randomly selected features to use to build each tree. We set the training fraction to be either
0.5, 0.6, 0.7, 0.8, or 0.9, the number of trees in the forest to be 50, 100, 500, 1000, or 5000, and
                                                       √
the number of features used for each tree to be 1, 𝑝, 𝑝/3, 𝑝/2, or 𝑝 for a total of 125 possible
combinations (124 new and the original model). These choices are based off findings in Svetnik et
al. [101]: namely that the error rates level off once the number of trees is on the order of 102 and
their choices of the number of features in each tree. In addition, increasing the training fraction may
improve performance as there is more data for the model to learn from. For each combination, we
repeated the procedure in the previous paragraph. Due to the computational cost of the conditional
permutation approach, we only calculated the AUC-permutation importance.
     To determine if the changing the hyperparameters affected our models, we computed the
minimum, median, and maximum value of each metric over the 125 hyperparameter combinations
and relative ordering of the features in each model. We chose the minimum, median, and maximum
instead of the mean and standard error because 1) we are looking across different models rather
than getting repeated measurements of the same things so we cannot assume the results will be
normally distributed and 2) we are interested in the best and worst performance achieved under
hyperparameter tuning to get a sense of the possible values we can achieve which wouldn’t be
                                                  19


                                       Physics GRE Score
                                   Quantitative GRE score
                                      Grade Point Average
                                         Verbal GRE score
                                 Proposed Research Area
                                           Year of applying
               Feature
                          Size of UG physics program PhD
                                         Barron Selectivity
                                        Writing GRE score
                         Size of UG physics program, bach
                                    Region of UG program
                           Highest physics degree offered
                                           Attended a MSI
                               Attended a public institution
                                                               0.000   0.025   0.050   0.075   0.100
                                                                       AUC Mean Importance
Figure 2.1: Averaged AUC feature importances over 30 trials. Physics GRE score, Quantitative
GRE score, and undergraduate GPA, appearing in orange, were the factors found to be meaningful
and hence predictive of being admitted.
possible using the mean and standard error. If our model is largely unaffected by the choice of
hyperparameters, we would expect the metrics to show minimal variation and the relative ordering
of the features to be largely unchanged.
2.3   Results
Across the 30 runs, the average accuracy of our model predicting on the held-out data was 75.6% ±
0.6%, the average training AUC was 0.849 ± 0.002, and the average testing AUC was 0.756 ± .006.
As our model’s accuracy is significantly higher than the null accuracy of 52.7%, the percent of
students who were not accepted, and our testing AUC is above 0.7, our model can be considered
an acceptable model of the data.
                                                                 20


    The feature importances averaged over the 30 runs are shown in Fig. 2.1. We find numerical
factors such as the applicant’s score on the physics GRE, the applicant’s score on the quantitative
GRE, the applicant’s undergraduate GPA, the applicant’s verbal GRE score, and their proposed
research area to be more important in the application process than any factor describing the
applicant’s undergraduate institution. Using recursive backward elimination to determine the
meaningful factors, we find the applicant’s physics GRE score, quantitative GRE score, and their
undergraduate GPA to be the only meaningful factors.
    To verify that the applicant’s physics GRE score, quantitative GRE score, and undergraduate
GPA were indeed the only meaningful factors, we then reran our random forest model 30 times
using only these three factors as the predictors. Our average testing accuracy was then 75.4%±0.6%
and our testing average area under the curve was 0.754 ± 0.006, which are not statistically different
from the values we found using all fourteen factors shown in Table 2.1.
    When we instead used MICE and the conditional importances, and the metrics were slightly
higher, likely because imputing the missing values provided more data for the algorithm to learn
from. Specifically, the testing accuracy was 77.1% ± 0.1% and the testing AUC was 0.770 ± 0.001.
    The conditional feature importances are shown in Fig. 2.2. Compared to Fig. 2.1, we notice
that the verbal and quantitative GRE scores are ranked lower than they were when we did not take
correlations into account and proposed research area and year of applying are ranked higher than
when we did not take correlations into account. The physics GRE and GPA are still ranked highly
however, even after taking correlation into account.
    Performing the recursive backward elimination, we find that physics GRE score and GPA are
meaningful features and quantitative GRE score no longer is. Using only these two features to create
a conditional inference forest on the imputed data, we find that the testing accuracy is 75.7% ± 0.7%
and the testing AUC is 0.757 ± 0.007, which are consistent with the full model.
    As there are only two meaningful features, we can plot the features and see if there does appear
to be a separation between admitted and not admitted applicants. To do so we generated all possible
pairs of undergraduate GPA and physics GRE scores and ran them through our model to find the
                                                   21


                                       Physics GRE Score
                                      Grade Point Average
                                 Proposed Research Area
                                           Year of applying
                                   Quantitative GRE score
                                         Verbal GRE score
               Feature
                          Size of UG physics program PhD
                                         Barron Selectivity
                                    Region of UG program
                                        Writing GRE score
                         Size of UG physics program, bach
                                           Attended a MSI
                           Highest physics degree offered
                               Attended a public institution
                                                               0.000      0.005     0.010    0.015
                                                                       AUC Mean Importance
Figure 2.2: Averaged conditional feature importances over 30 trials. Physics GRE score and
undergraduate GPA, appearing in orange, were the factors found to be meaningful and hence
predictive of being admitted when adjusting for correlations among the features.
Table 2.2: Minimum, median, and maximum values of the metrics obtained over the 125 hyperpa-
rameter combinations
                                      metric                     min median  max
                                      Train AUC                0.824  0.848 0.853
                                      Test AUC                 0.726  0.749 0.760
                                      Test Accuracy            0.727  0.750 0.760
                                      Null Accuracy            0.521  0.527 0.556
predicted admissions decision. The result is shown in Fig 2.3. We see that there does appear to be a
boundary between admitted and non-admitted students around a physics GRE score of 700, which
drops toward 650 for applicants with GPAs above 3.5, providing further evidence that physics GRE
score and GPA are predictive of admission at this program.
   When we test the various hyperparameter combinations, we find similar results. Looking at the
                                                                 22


                                          Offered Admission?   Model−No      Model−Yes   Data−No   Data−Yes
                                    4.0
                                    3.5
                Undergraduate GPA
                                    3.0
                                    2.5
                                    2.0
                                          400           500       600          700        800      900        1000
                                                                        Physics GRE Score
Figure 2.3: Plot of all applicant’s physics GRE scores vs their undergraduate GPAs. The background
coloring expresses the prediction of the model had an applicant had that score.
metrics (Table 2.2), we see that the testing accuracy varies by 3.3 percentage points between the
minimum and maximum values and the testing AUC varies by 0.034 between the minimum and
maximum values. As the variation is limited and these metrics are still within the acceptable range,
the results suggest that our choice of hyperparameters has limited impact on the metrics.
   When we look at the ranks of the features used in each hyperparameter combination, we also see
limited variation. In Fig. 2.4, we notice the plot is mostly diagonal with the presence of two blocks.
First, we see that physics GRE score, GPA, quantitative and verbal GRE scores, and proposed
research area are always the top five features, regardless of the hyperparameters. Second, we see
that the institutional features never rank above a 7, meaning that no combination of hyperparameters
can create a model where these features are predictive of admission. In addition, we notice that
year of applying is always ranked sixth, serving as a separating feature from the two blocks. This
                                                                            23


                           Physics GRE Score
                          Grade Point Average
                       Quantitative GRE score
                     Proposed Research Area
                             Verbal GRE score
                               Year of applying                                                                             Fraction of Trials
                                                                                                                                 1.00
   Feature
              Size of UG physics program PhD                                                                                     0.75
                                                                                                                                 0.50
                             Barron Selectivity
                                                                                                                                 0.25
             Size of UG physics program, bach                                                                                    0.00
                            Writing GRE score
                        Region of UG program
               Highest physics degree offered
                               Attended a MSI
                   Attended a public institution
                                                   1   2   3   4   5   6        7          8   9   10   11   12   13   14
                                                                                    Rank
Figure 2.4: Proportion of the 125 hyperparameter combinations in which each feature had a given
rank. Notice that there is a block of features that range between 1 and 5 and a block of features that
rank between 7 and 14.
result is likely due to the fact that there are yearly differences in the fraction of applicants admitted
so year is not a noise feature and should be ranked above the noise features. However, knowing
the year the applicant applied doesn’t say too much about the applicant themselves and hence, we
would expect it to rank below the features like test scores and GPA that do.
      Looking at the first block, we notice that physics GRE is always the top ranked feature followed by
either GPA or quantitative GRE score, with GPA being the more common selection. Furthermore,
GPA never ranks lower than third while the quantitative GRE score ranks between second and
fourth. For certain choices of hyperparameters, the applicant’s proposed area of research ranks
higher than the quantitative GRE score.
2.4              Discussion
Perhaps unsurprisingly, we find numerical measures are the most important factors for determining
whether a domestic applicant will be accepted into this physics graduate program, consistent with
findings that graduate programs with a large number of applicants use numerical measures as a first
pass to evaluate applicants [46, 47]. Looking across our analyses, we find physics GRE score and
                                                                           24


undergraduate GPA are consistently found to be predictive of admission while quantitative GRE
score is sometimes found to be predictive based on the choice of hyperparameters. However, once
we take correlations among the features into account, the quantitative GRE score is no longer found
to be predictive.
    While we find no evidence of a minimum physics GRE score, we do find evidence of a “rough
cutoff” as described in Potvin et al. around 700 [47]. Nevertheless, some students who scored
significantly above this threshold were not admitted. While we do not know the reasons why these
students were not admitted, Posselt noted that faculty might not admit superior applicants if they
do not believe the applicant will actually enroll in their program [46].
    Overall, our findings of the meaningful factors for admission to a physics graduate program are
consistent with Potvin et al.’s findings obtained by surveying physics graduate admissions directors.
Notably, we also find that the physics GRE score, undergraduate GPA, quantitative GRE score, and
proposed research area are more important than other factors while the undergraduate institution,
GRE written score, and proximity/familiarity are less important factors.
    While the verbal GRE score was not found to be a meaningful feature, the program studied here
appears to place more emphasis on it than the average program. This may be because our study only
looked at domestic students while Potvin et al.’s looked at all applicants. Because international
students also take the TOEFL while domestic students do not and admissions directors ranked the
TOEFL as more important than the verbal GRE, the TOEFL may take the place of the verbal GRE
and hence lower the perceived value of the verbal GRE relative to other factors.
    Despite prior work suggesting institutional characteristics play an important role in graduate
admissions, we did not find institutional or departmental characteristics to be meaningful to our
model. Our result could be due to differences in methodology or due to institutional effects being
influential but not dominant factors [98]. Indeed, Posselt suggests institutional factors might be
used to differentiate applicants with similar GPAs and GRE scores [46]. Therefore, we might not
have found institutional factors to be meaningful because they are used when primary factors such
as GPA and physics GRE scores do not sufficiently separate applicants.
                                                   25


    While we did not have access to other criterion included in the Potvin et al. such as application
essays, research experiences, and recommendation letters, we were still able to create a model
that correctly predicted whether an applicant would be admitted with approximately 75% accuracy
based solely off the applicant’s undergraduate GPA and physics GRE score (and slightly higher if
we also included the quantitative GRE score). While undergraduate GPA is a significant predictor
of completing a physics PhD, physics GRE is not, as those scoring near the top of the physics
GRE only have a 7% higher probability of completing their PhD than those scoring near the
bottom [52]. As the GRE is not associated with completing a doctoral degree and is known to
favor persons from majoritized groups in science [52, 105], the outsized role of the GRE in the
admissions process should be questioned. Indeed, the American Association of Physics Teachers
and American Astronomical Society have released recommendations against using the physics GRE
in graduate admissions [106, 107].
2.5    Limitations
There are a few limitations to our study. First, the data we used to make our model was not all
the data that would be available to a faculty member evaluating an application. In addition, our
model did not contain demographic information about the applicants that could also impact the
results given the barriers women and people of color face in physics. Therefore, it is possible
that meaningful features other than GPA and physics GRE score could lie in the data that was
unavailable to us.
    Second, this study was done at a primarily white institution (PWI). While Kanim and Cid note
that having a relatively homogeneous research sample can be valuable for reducing variability,
especially in early studies, they also note that exploring the effects of variability can lead to new
results and a greater understanding of the results [13]. Thus, while our result might generalize
to many physics graduate programs, it might also hide important differences in features predic-
tive of admission for applicants of different demographics groups and institutions with different
demographics than our own.
                                                  26


2.6    Future Work and Conclusion
Our work adds to the broader literature about graduate admissions and the process by which
applicants are judged. Because minoritized students might not apply to graduate programs if they
do not think they will be accepted, elucidating the factors that determine whether an applicant will
be accepted is crucial. Simply increasing the number of applicants from currently and historically
underrepresented groups in physics will not increase their representation unless corresponding
efforts are made to admit these students. While our result that test scores and GPA are the most
predictive parts of an application in terms of admission aligns with prior work, these results represent
only one institution and might not be representative of all United States physics PhD programs.
Given the unique structure of the admissions process at this university, graduate programs with
a more traditional admissions process might assign different weights to the various parts of an
application. Therefore, future work should investigate what features of the applicant drives the
admissions process at other institutions.
    Furthermore, our university has recently moved to a rubric-based admissions format, designed
to take into account non-cognitive competencies and program fit in addition to the more traditional
admissions criteria such as GPA and GRE scores. Our future work will examine how including
these new criteria may change the factors that are most predictive of an applicant being admitted to
the program. The results of such analyses are presented in Chapters 4 and 5.
                                                  27


                                             CHAPTER 3
      FURTHER EVIDENCE AGAINST THE PHYSICS GRE: IT DOES NOT HELP
                                  APPLICANTS “STAND OUT”
The following chapter was published in Physical Review Physics Education Research in 2021 [108].
The published version includes Marcos D. Caballero as second author. Following the Contributor
Roles Taxonomy (CRediT) [76], my roles for this project include conceptualization, formal analysis,
methodology, software, validation, visualization, and writing the original draft.
3.1    Introduction
While applying to graduate programs requires many components, perhaps none is as scrutinized as
the Graduate Records Exam (GRE), and in physics, the physics GRE. Indeed, research into graduate
admissions in physics suggests that the physics GRE is one of the most important components of the
applications for determining which applicants will be admitted, based on both student and faculty
perspectives [47, 51] and analysis of the admissions process [46, 75]. Despite its prominence in the
admissions process, the physics GRE is known to be biased against women and people of color in
physics [105], resulting in lower average scores compared to white and Asian males. At least one in
three programs use a cutoff score [47], with 700 being a common choice [52], meaning applicants
from groups already underrepresented in physics graduate programs can be further marginalized as
they are less likely to achieve these scores. This is in addition to the observation that many physics
students of color already see the GRE as a barrier to applying to graduate school [49, 55, 109].
    Further, the physics GRE might not even be useful for determining which applicants will
be successful in graduate school. For example, Miller et al. suggest that the physics GRE is
not useful for predicting which applicants will earn their PhDs [52]. Additionally, Levesque et
al. argue that using the common 50th percentile cutoff score for the physics GRE would have
caused admissions committees to reject nearly 30% of students who would later receive a national
prize postdoctoral fellowship, which can be viewed as a proxy for research excellence [110]. Yet
                                                  28


despite evidence suggesting the physics GRE does not predict these typical ways of measuring
“success” in graduate school and calls from the American Astronomical Society and the American
Association of Physics Teachers to eliminate the physics GRE from admissions [106, 107], most
physics graduate programs still require applicants to submit their physics GRE scores. Currently,
nearly 90% of physics and astronomy graduate programs still accept the physics GRE, with over
half requiring or recommending submitting a score [106]. Of those that do not accept physics GRE
scores from applicants, all of the programs are solely astronomy graduate programs or joint physics
and astronomy graduate programs. While it is uncertain where removing the physics GRE affects
any measure of graduate school success (e.g. completion rate), initial work by Lopez suggests that
removing the physics GRE does increase the diversity of applicants [111].
    Given these documented issues with the physics GRE, why do departments continue to use
it? First, given that many programs are seeing a larger number of applicants, the physics GRE
provides a quick way to filter the applications down to a more reasonable number for faculty review.
Unlike in undergraduate admissions, graduate admissions tend to be decentralized and done at
the departmental level by a faculty committee. Hence, faculty are asked to review applications in
addition to their regular teaching and research duties and thus, might not have the time to read the
letters of recommendation and applicant essays for every applicant.
    Second, some faculty view GRE scores as measures of innate intelligence [46, 48] or ability to
become a PhD-level scientist [105]. After all, they and other faculty likely had high GRE scores in
order to be admitted to graduate school, and may exhibit a survivorship bias, believing that a high
GRE score is needed to succeed. Further, physics is seen as a “brilliance-required” field, where
innate intelligence is required for success [59].
    A third argument, and the most interesting one in terms of the scope of this paper, is that
standardized tests such as the physics GRE can help students stand out [112]. The ETS, the
creator of the GRE and physics GRE, claims that subject GREs “can help you stand out from other
applicants by emphasizing your knowledge and skill level in a specific area” [113]. For example, a
student with an average grade point average (GPA) might be able to stand out from other applicants
                                                  29


if they did exceptionally well on the physics GRE.
     In addition, applicants from smaller universities or universities that are not known to the ad-
missions committee might benefit from performing well on a standardized measure. For example,
the ETS claims that the GRE provides a “common, objective measure to help programs compare
students from different backgrounds” [114] and physics admissions committees worry that remov-
ing the GRE would limit their ability to compare applicants from different backgrounds [115].
Anecdotally, some faculty claim that a good physics GRE score could aid students from small
liberal arts colleges in the admissions process [116].
     We already know that GPAs are interpreted in the context of the applicant’s university. Posselt
has shown that among more prestigious graduate programs, the applicant’s GPA is viewed in
the context of their undergraduate institution with high GPAs from prestigious institutions seen
favorably, low GPAs from an unknown school as unfavorably, and high GPAs from unknown schools
and middle GPAs from prestigious institutions in the middle [117]. Therefore, a standardized test
such as the physics GRE could provide an assumed equal comparison for an admissions committee
and might allow the applicant from an unknown school to stand out or have a similar chance of
admission as an applicant from a more well-known school.
     Finally, graduate admissions have been documented to be “risk-adverse,” where admissions
committees select applicants most likely to complete their program [46, 48]. As applicants from
smaller universities may be judged based on how previously enrolled students from their university
did in the program [117], a risk adverse admissions committee might be less likely to admit
applicants from small universities whose students have previously struggled in their program.
However, perhaps a high standardized test score could overcome these perceptions and signal that
the applicant might indeed be successful in the program.
     Our goal then is to focus on the third argument. Does the physics GRE help applicants “stand
out” in the admissions process in practice? If that is the case, we would expect those disadvantaged
in the admissions process, those who have low GPAs, attended a smaller institution, or identify as
part of a group currently underrepresented in physics, to be admitted at similar rates as their more
                                                 30


advantaged peers with similar physics GRE scores. Specifically, we ask:
   1. How does an applicant’s physics GRE score and undergraduate GPA affect their probability
       of admission?
   2. How are these probabilities of admission affected by an applicant’s undergraduate institution,
       gender, and race?
    As Small points out in his critique of admissions and standardized test studies [118], multiple
variables rather than just a standardized test might best explain our results and therefore, a framework
that allows for substitutions and trade-offs between variables is necessary. Therefore, we ask an
additional research question:
   3. How might the above relationships be accounted for through mediating and moderating
       relationships?
    This paper is organized as follows: Sec. 3.2 provides an overview of mediation and moderation
analysis. We then describe our data, how we determined what constitutes “standing out,” and how
we implemented mediation and moderation analysis in Sec. 3.3. In Sec. 3.4, we describe our
findings and in Sec. 3.5, we use those findings to answer our research questions and explain our
limitations and choices which may affect our results. Finally, we describe our future work in Sec.
3.6 and the implications of our work for graduate admissions in physics in Sec. 3.7.
3.2     Background
Before we can answer the third research question, it is important to describe what we mean by
mediating and moderating relationships.
    In a mediating relationship, two variables are only related because they are also related to some
common third variable. For example, a student who played video games the night before an exam
might do poorly because they stayed up playing video games too late and did not get enough sleep.
Therefore, video games and doing poorly on the exam are only related due the common factor of
lack of sleep. Lack of sleep is then a mediating variable.
                                                   31


    In a moderating relationship, the strength of the relationship between two variables depends
on some third variable. For example, the relationship between someone liking dogs and owning a
dog likely depends on whether they are allergic to dogs. That is, we would expect someone who
likes dogs but is allergic to dogs is less likely to own a dog than someone who likes dogs but is not
allergic to dogs is. Being allergic to dogs is then a moderating variable.
    Mathematically, suppose that some input 𝑋 has an effect on output 𝑌 . We would say that some
other input 𝑀 mediates the relationship between 𝑋 and 𝑌 if 𝑋 only has an effect on 𝑌 because 𝑋
has an effect on 𝑀 and 𝑀 has an effect on 𝑌 [119]. For a simple case, we can represent these
relationships as
                                               𝑌 = 𝑖1 + 𝑐𝑋                                        (3.1)
                                               𝑀 = 𝑖2 + 𝑎𝑋                                        (3.2)
                                           𝑌 = 𝑖3 + 𝑐0 𝑋 + 𝑏𝑀                                     (3.3)
    where 𝑖 represents the intercepts. These relationships are visually shown in Fig. 3.1.
    Using this representation, the direct effect of 𝑋 on 𝑌 is represented by 𝑐0 and the indirect effect
is represented by 𝑎𝑏. The total effect is then 𝑐0 + 𝑎𝑏, which for a linear regression model is equal
to 𝑐. Equivalently, in the case the linear regression, the indirect effect is 𝑐 − 𝑐0.
    However, if 𝑌 is binary, linear regression is not appropriate and logistic regression should be
used instead. In this case, Rĳnhart et al. recommend using 𝑎𝑏 as the indirect effect as their
simulation studies found the 𝑎𝑏 estimate of the indirect effect exhibited less bias than the 𝑐 − 𝑐0
estimate [120].
    To determine if the indirect effect is statistically significant, a common approach is to use a
Sobel test. However, simulations suggest that the Sobel test is underpowered and that bootstrapping
is a good alternative [121]. Specifically, those simulations find that using the percentiles of a
bootstrapped estimate of the indirect effect to estimate the confidence interval is a good compromise
                                                     32


Figure 3.1: Visual representation of eqs. (3.1) to (3.3). The top graphic shows eq. (3.1) while the
bottom graphic shows eqs. (3.2) and (3.3).
between avoiding type I errors while maintaining statistical power. In their approach (which has
also been used in PER studies before, e.g., [122]), if 𝑎𝑏 is different than zero, then there is some
degree of mediation.
    More specifically, there are three cases.
   1. If 𝑎𝑏 ≠ 0 and 𝑐0 = 0 then 𝑀 fully mediates the relationship between 𝑋 and 𝑌 .
   2. If 𝑎𝑏 ≠ 0 and 𝑐0 ≠ 0, then 𝑀 partially mediates the relationship between 𝑋 and 𝑌 . In that
      case, we can estimate the amount of mediation as the fraction of the total effect attributed to
                            𝑎𝑏
      the indirect effect, 𝑎𝑏+𝑐 0 [123, 124].
   3. If 𝑎𝑏 = 0, then 𝑀 does not mediate the relationship between 𝑋 and 𝑌 .
    This approach can also be adapted to multiple mediators and these mediators can be predictors
of other mediators. An example of this serial mediation case with two mediators is shown in Fig.
3.2. Equations (3.1) to (3.3) can then be modified to be
                                              𝑌 = 𝑖4 + 𝑐𝑋                                        (3.4)
                                                  33


Figure 3.2: Visual representation of eqs. (3.5) to (3.7) showing serial mediation with two mediators.
                                                  𝑀1 = 𝑖 5 + 𝑎 1 𝑋                                    (3.5)
                                              𝑀2 = 𝑖6 + 𝑎 2 𝑋 + 𝑎 3 𝑀1                                (3.6)
                                          𝑌 = 𝑖7 + 𝑐0 𝑋 + 𝑏 1 𝑀1 + 𝑏 2 𝑀2                             (3.7)
     In this case, there are three indirect effects. First, there are the indirect effects of the mediators
individually, 𝑎 1 𝑏 1 and 𝑎 2 𝑏 2 , and second, there is the indirect effect of the mediators together
𝑎 1 𝑎 3 𝑏 2 . The total indirect effect is then 𝑎 1 𝑏 1 + 𝑎 2 𝑏 2 + 𝑎 1 𝑎 3 𝑏 2 [125].
     More generally, for 𝑁 mediators, we can generate 𝑁 + 1 equations where the first 𝑁 are of the
form
                                                                  𝑛−1
                                                                  Õ
                                            𝑀𝑛 = 𝑖 𝑛 + 𝑎 𝑛 𝑋 +         𝑎 𝑗 𝑀𝑗                         (3.8)
                                                                  𝑗=1
and the final equation is of the form
                                                                 Õ𝑁
                                             𝑌 = 𝑖 𝑦 + 𝑐0 𝑋 +        𝑏 𝑗 𝑀𝑗                           (3.9)
                                                                 𝑗=1
                                                          34


    So far, we’ve assumed that the relationship between the mediator 𝑀 and the output 𝑌 does
not depend on any other variables. However, it is possible that the relationship between 𝑀 and 𝑌
could also depend on 𝑋 or some other variable, meaning there is a conditional indirect effect (see
Preacher et al. [126]). In the case that the relationship between 𝑀 and 𝑌 depends on 𝑋, we would
say that 𝑋 moderates the relationship between 𝑀 and 𝑌 . Practically, this means we must add an
interaction term to eq. (3.3), which then becomes [126]
                                    𝑌 = 𝑖3 + 𝑐0 𝑋 + 𝑏 1 𝑀 + 𝑏01 𝑋 𝑀
                                                                                                (3.10)
                                                0
                                       = 𝑖 3 + 𝑐 𝑋 + (𝑏 1 + 𝑏01 𝑋)𝑀
    We use the prime on 𝑏 coefficients to denote an interaction coefficient for a mediator while an
unprimed 𝑏 coefficient is a coefficient of a mediator.
    The conditional indirect effect is then 𝑎(𝑏 1 + 𝑏01 𝑋). If 𝑏01 = 0, we would say that there is no
moderation and the indirect effect is the standard 𝑎𝑏.
    In the case that there are multiple mediators, eq. 3.10 can be modified to include multiple
mediators and interaction terms for all pairs of variables where moderation may be of interest.
    In the special case that 𝑋 is binary, eq. (3.10) reduces to 𝑌 = 𝑖𝑥=0 + 𝑏 1 𝑀 when 𝑋 = 0 and
𝑌 = 𝑖 𝑋=1 + (𝑏 1 + 𝑏01 )𝑀 when 𝑋 = 1. Therefore, to test if there is moderation, we can simply regress
𝑀 on 𝑌 given 𝑋 = 0 and again given 𝑋 = 1 and subtract the slopes to calculate 𝑏01 instead of
including an interaction term in the model.
3.3    Methods
3.3.1   Data
Data for this study comes from the physics departments at five selective, research-intensive, pri-
marily white universities. Four of these universities are public and part of the Big Ten Academic
Alliance while the remaining university is a private Midwestern university. During the 2017-
2018 and 2018-2019 academic years, graduate admissions committees at these five universities
recorded all physics applicants’ undergraduate GPA, GRE scores, undergraduate institution, and
                                                   35


demographic information such as gender, race, and domestic status. In addition, the universi-
ties recorded whether each applicant made the shortlist, was offered admission, and whether the
applicant decided to enroll. Because our study includes all applicants rather than only admitted ap-
plicants, we are unlikely to suffer from the range restrictions noted in critiques of other admissions
studies (e.g. [118, 127]). However, we do address a possible range restriction in the Limitations
and Researcher Decisions section (sec. 3.5.2).
    Due to different requirements and admissions processes for international students and domestic
students (e.g., international students need to submit a test of English proficiency), we only include
domestic students in our study. We then remove any applicant for whom a physics GRE and
GPA were not recorded, leaving us with 2537 applicants. While we in theory could use multiple
imputations to address the data as Nissen et al. recommends [71], faculty reviewing the applications
do not, to our knowledge, do this and hence, we would be creating data that wasn’t available in
the admissions process. Distributions and analysis of the remaining physics GRE scores and GPAs
appear in Appendix B
    As the applicant’s undergraduate university does not contain meaning in itself, we needed to
categorize the institutions. We chose to categorize the institutions by their size and their selectivity.
We then used the number of physics bachelor’s degrees awarded per year as measure of the size of
the university. We assume that universities with more graduates are more well known and hence,
would likely be known to the admissions committees. In contrast, universities that produce fewer
bachelor’s degrees might not be known to the admissions committees and hence, might be unknown
programs. It would then be these applicants from “smaller” programs who might need to “stand
out.” We acknowledge that some programs that produce a small number of physics bachelor’s
degrees each year might not be unknown to the admissions committees due to previous applicants
from such schools or research collaborations or partnerships. However, there is no way in our data
to know if this is the case.
    To determine whether a university should be counted as a “small university”, we used the
undergraduate institution names to look up the number of typical physics bachelor’s degrees from
                                                   36


Table 3.1: Summary of the comparisons we analyzed, which group needs to stand out and which
does not, and the figure number showing the results
                           Group that tends to be          Group that tends not to be
  Variable                                                                                Figure
                           privileged in admissions        privileged in admissions
  Program size             Applicants from physics         Applicants from physics        Fig. 3.5
                           programs that rank in top       programs that rank in
                           25% of programs based on        bottom 75% of programs
                           yearly graduates                based on yearly graduates
  University selectivity   Applicants from universities    Applicants from any other      Fig. 3.6
                           ranked as most selective or     university (Barron’s Value
                           highly selectively (Barron’s    of 3 or lower)
                           Value of 1 or 2)
  Gender                   Male applicants                 Female applicants              Fig. 3.7
  Race                     Asian or white applicants       Black, Latinx, Multiracial, Fig. 3.8
                                                           and Native applicants
AIP’s public degree data [128, 129]. As of this writing, degree data for the 2018-2019 academic
year was not available, so we used data from the 2016-2017 and 2017-2018 academic years to
quantify the number of bachelor’s degrees. Additionally, this would have been the most recent
data available when admissions committees would have reviewed applications and many of the
applicants would be represented in the data as bachelor degree recipients
    To account for the institution’s prestige, we used Barron’s Selectivity Index [130]. Barron’s
selectivity index is a measure based on the undergraduate acceptance rate of an institution as well
as characteristics of its undergraduate incoming classes, such as mean SAT scores, high school
GPAs, and class rank. We assume selectivity is a proxy for prestige as prestigious institutions tend
to have low acceptance rates and high SAT scores and GPAs from incoming students. In contrast
to the AIP data, Barron’s selectivity index applies to the institution as a whole rather than only the
physics department.
3.3.2   Probability of admission procedure
Determining whether an applicant is more or less likely to be admitted first requires computing
admissions probabilities. To do so, we grouped applicants based on their GPAs and physics GRE
scores. Prior work has found that the physics GRE score and undergraduate GPA are two of the
                                                 37


most important aspects of the applications [46, 47, 75]. Our previous work specifically found that
the physics GRE score and undergraduate GPA were able to predict with 75% accuracy whether an
applicant would be admitted to one public Midwestern physics graduate program.
    In addition, physics is a “high consensus” discipline, meaning most programs agree on what
constitutes a successful applicant [46]. Therefore, despite many other components of the applica-
tions that affect whether an applicant will be admitted, we believe using the physics GRE score and
undergraduate GPA provides a first-order overview of what admissions committees would use to
admit applicants.
    In order to ensure a reasonable number of applicants in each group to do meaningful analysis,
we grouped applicants into bins based on their GPA and physics GRE score. We choose to use
GPA bins 0.1 units in width and physics GRE bins 50 points in width. The GPA bins were selected
to ensure that that GPAs with the same tenth digit were in a single bin. That is, 3.50 through 3.59
would be in a single bin. All GPAs were already reported on the 4.0 scale and physics GRE scores
were reported using the standard 200-990 scale so we did not need to do any conversions.
    We then computed the fraction of applicants in each bin who were admitted to the program
they applied. As we are interested in applicants “standing out,” we frame our results as whether
applicants in a bin are admitted at a higher rate than the overall rate (all accepted applicants divided
by all applicants). If applicants are admitted at a higher rate than the overall rate, it suggests that
these applicants did in fact stand out to the admissions committee.
    In our framing of “standing out,” we are assuming that graduate admissions operate under a
deficit model. That is, due to their privilege, some applicants had better resources, opportunities, or
choices available to them and as a result, may appear as better candidates for the program compared
to applicants who did not have those available to them. To our knowledge, those with less privilege
and/or resources are not directly compared to those with more privilege and/or resources but instead
compared to an ideal applicant who often resembles someone from a more privileged background.
For example, Owens et al. found that faculty valued advanced course knowledge and programming
skills in incoming graduate students [53], which may be more characteristic of applicants coming
                                                   38


from better resourced institutions. Using this framing, we created four groups of applicants who
might or might not need to stand out, which are summarized in Table 3.1 and explained in detail
below.
    To take into account the size of the institution, we first used the AIP data to determine the
national quartile each applicant’s institution ranked in terms of all bachelor’s degree recipients for
each of the two years of data. Because not all institutions reported data in both years and the
number of graduates could vary significantly between years, we conducted separate analyses first
with the highest quartile an institution reached in the two years and second with the lowest quartile
the program reached in the two years. For example, if an institution was ranked in the 3rd quartile
the first year and the 4th quartile in the second year, our first analysis would use the 4th quartile and
our second analysis would use the 3rd quartile. We then define the large programs as those in the
4th quartile and small programs as those in the 1st through 3rd quartiles. We address this choice in
the discussion.
    When using Barron’s Selectivity Index to take into account the selectivity of the institution, we
used Chetty et al.’s [131] five groupings (Ivy League +, remaining most selective institutions, highly
selective institutions, selective institutions, and non-selective institutions) as a guide. As there was
a single applicant from a non-selective institution, selective and non-selective were grouped into
a single category. Because we are interested in smaller, less known programs compared to larger,
well-known programs, we took the selective and non-selective group to be our “less selective
institution” group and institutions in the first three of Chetty et al.’s categories as our “most
selective institutions”. This corresponds to grouping institutions with a Barron’s Index of 1 and
2 together as the “most selective institutions” and all other values together as the “less selective
institutions”.
    To understand how high physics GRE scores might help applicants identifying as part of
a group currently underrepresented in physics, we compared women’s admission probability to
men’s admission probability and applicants of color’s admission probability to applicants not of
color’s admission probability. While it should be noted that gender is not binary [132], the data
                                                    39


Table 3.2: Counts of applicants by gender and race who provided both GPAs and physics GRE
scores
                                                  Race
        Gender        Asian   Black   Latinx   Multi Native      White   Unreported    Total
        Men           247     49      99       166     4         1410    112           2087
        Women         56      2       19       26      0         308     28            439
        Unreported    1       0       0        1       0         5       4             11
        Total         304     51      118      193     4         1723    144           2537
the admissions committee recorded is only in terms of the male and female binary and hence, we
cannot comment on how high physics GRE scores may impact applicants of other genders.
    Furthermore, given the limited number of applicants identifying as part of a racial group
underrepresented in physics, we combined all Black, Latinx, Multiracial, and Native applicants into
a single category, which we will refer to as B/L/M/N following the recommendation of Williams
[133]. We acknowledge that this may obscure important distinctions between groups, as Teranishi
[134] and Williams suggest. We also acknowledge applicants identifying as a marginalized gender
and race may face additional barriers and hence could stand out differently than an applicant
identifying as either a marginalized gender or race. However, there are less than 50 applicants
(∼ 2% of the sample) identifying as a member of both a marginalized gender and marginalized
race, limiting statistical power for analysis. Full demographics are shown in Table 3.2. For
information about how race and ethnicity categories were constructed and standardized, see Posselt
et al. [54] who previously used the 2017-2018 academic year application data from this study in
their study.
3.3.3    Mediation and Moderation Procedure
Given that to some degree, both the physics GRE score and undergraduate GPA measure physics
knowledge, we expect that these two measures will be correlated with each other. Therefore, we first
tested whether the physics GRE has any mediating effects when predicting admission and whether
GPA moderates the relationship between the physics GRE and admission; that is, is one only related
to admission because it influences the other and that one influences admission or is the strength of
                                                40


the relationship between one and admission affected by the other. Because admissions status is a
binary outcome variable, we need to use logistic regression for eqs. (3.1), (3.3) and (3.10).
    When taking an applicant’s GPA and physics GRE score into account, we first centered and
scaled both variables so they both have means of zero and variances of 1. As we are treating GPA
and physics GRE scores as continuous, we can use linear regression for eq. (3.2).
    To estimate the coefficients in eqs. (3.1) to (3.3) and (3.10), we generated 5000 bootstrap samples
with replacement as was done in Hayes and Scharkow [121]. For each trial, we computed the indirect
effect 𝑎𝑏. To get the estimate of each parameter, we took the average of the 5000 bootstraps. To
get the lower end of the 95% confidence interval, we used the value that corresponded to the 2.5th
percentile of the values generated by the bootstrap. Likewise, to get the upper end of the 95%
confidence interval, we used the value that corresponded to the 97.5th percentile.
    For the institutional features, we treat institutional selectivity and institution size as binary
input variables (most selective or less selective and larger institution or smaller institution) and for
demographic features, we treat gender and race as binary variables. Again, we use B/L/M/N as one
category for race and white and Asian as the other. The applicant’s physics GRE score and GPA
are again treated as continuous mediating and moderating variables.
    Because the physics GRE score and GPA can both act as mediators and GPA may also influence
the physics GRE score, we used a serial mediation model instead of the simple mediation model
(eqs. (3.4) to (3.7)). While moderation by the independent variable 𝑋 can occur for any of the
relations between the other variables, only moderating relationships between GPA and admission
and the physics GRE score and admission are within the scope of this work. Therefore, we
only include those interaction terms in our models. For all of these analyses, we used the same
bootstrapping process used for the simple mediation and moderation cases.
                                                     41


3.4     Results
3.4.1   Probability of admission results
When comparing the GPAs and physics GRE scores of all applicants, we notice that most applicants
who are admitted have both high GPAs and high physics GRE scores (Fig. 3.3). Furthermore,
while a near perfect GPA or physics GRE score resulted in the highest chance of admission, having
either a high GPA or high physics GRE and a modest score on the other seemed to still offer an
admission fraction around the overall average. However, having a low GPA or low physics GRE
and a modest score on the other is usually grounds for rejection. Overall admissions fractions for a
given physics GRE score or GPA are shown in the top and right margins of Fig. 3.3 respectively.
    In regard to having a high physics GRE score despite a low GPA, we first note that only a small
fraction of all applicants fall in this regime. Second, there appears to be no pattern in terms of
higher than average fraction admitted for these applicants. Some combinations of low GPA and
high physics GRE score result in a few applicants being admitted, and hence, an above average
fraction of applicants being admitted, while other score combinations have no applicants being
admitted, and hence, a below average change of admission. For example, having a GPA in the 3.3
bin and a physics GRE score in the 1000 bin resulted in an above average fraction admitted while
having a GPA in the 3.4 bin and a physics GRE score in the 1000 bin did not result in an above
average fraction admitted, despite the applicants having a higher GPA.
    To further understand whether a high physics GRE score can highlight those with low GPA, we
divided all students into either a high or low GPA and high or low physics GRE score bins, Fig. 3.4.
Based on Fig. 3.3 in terms of admissions probabilities, a low GPA seems to be below a 3.5, while
a high physics GRE score seems to be above 700. However, 700 is a common cutoff score which
could explain why admissions probabilities increase after that score. Because hitting the minimum
score might not catch the admission committee’s eyes, we instead selected a higher score of 880
which represents the 80th percentile (see Appendix D).
    From Fig. 3.4, we notice two things. First, among applicants in the low GPA bin, less than
                                                  42


                                                                   0.03   0.08    0.12    0.10    0.17    0.25    0.19     0.27    0.36    0.43    0.44     0.26
                                                                   N=40   N=125   N=153   N=193   N=250   N=257   N=299    N=336   N=337   N=310   N=237   N=2537
               < 3.0 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 4.0 Any
                                                                   N=2    N=8     N=9     N=22    N=40    N=48    N=63     N=97    N=115   N=143   N=138   0.43     Above
                                                                                                                                                           N=685
                                                                                                                                                                    Average
                                                                   N=1    N=4     N=14    N=20    N=25    N=35    N=47     N=57    N=71    N=69    N=39    0.30
                                                                                                                                                           N=382
Binned Undergraduate GPA
                                                                   N=0    N=21    N=13    N=30    N=28    N=35    N=58     N=43    N=53    N=38    N=18    0.24
                                                                                                                                                           N=337
                                                                                                                                                                         Fraction Admitted
                                                                   N=3    N=17    N=17    N=31    N=36    N=40    N=40     N=45    N=35    N=30    N=16    0.22
                                                                                                                                                           N=310
                                                                   N=4    N=16    N=20    N=7     N=24    N=30    N=26     N=17    N=30    N=11    N=11    0.18
                                                                                                                                                           N=196
                                                                                                                                                                    Overall
                                                                   N=9    N=15    N=12    N=23    N=35    N=13    N=20     N=28    N=17    N=9     N=5     0.11
                                                                                                                                                           N=186
                                                                                                                                                                    Average
                                                                   N=2    N=6     N=23    N=12    N=24    N=14    N=22     N=18    N=2     N=4     N=7     0.12
                                                                                                                                                           N=134
                                                                   N=2    N=10    N=18    N=15    N=10    N=18    N=10     N=11    N=4     N=1     N=3     0.13
                                                                                                                                                           N=102
                                                                   N=5    N=12    N=11    N=10    N=7     N=9     N=7      N=9     N=3     N=2     N=0     0.07
                                                                                                                                                           N=75
                                                                   N=3    N=4     N=3     N=10    N=12    N=7     N=5      N=3     N=2     N=0     N=0     0.12     Below
                                                                                                                                                           N=49
                                                                                                                                                                    Average
                                                                   N=9    N=12    N=13    N=13    N=9     N=8     N=1      N=8     N=5     N=3     N=0     0.09
                                                                                                                                                           N=81
                                                                   < 500 550 600 650 700 750 800 850 900 950 1000 Any
                                                                                             Binned Physics GRE Score
Figure 3.3: Fraction of applicants admitted by undergraduate GPA and physics GRE score. The
number of students in each bin is also shown. ‘Any‘ corresponds to the corresponding row or
column totals. The bin label corresponds to the upper bound of values in the bin exclusive with the
exception of the 4.0 GPA bin which includes 4.0. Values are colored based on whether they are
above, below, or equal to the overall admissions rate. Admissions rates within 10% of the overall
rate are colored the same as the overall rate. The above and below average colors are based on
being above/below the midpoint between the max/min admission fraction and the overall average.
These are based on raw numbers and not a statistical test.
half (44%) even make it above the typical cutoff score of 700 and less than 10% of those applicants
with low GPAs score 880 or higher. These represent approximately 11% and 2% of all applicants
respectively. Comparing the fraction of admitted applicants in each bin, applicants with high
physics GRE scores and low GPAs are admitted at nearly the same rate as applicants with high
GPA and low physics GRE scores.
                             Second, we notice that 16% of all applicants score in the high GPA but low physics GRE score
bin. That is, more applicants could be penalized for having a low physics GRE score despite a high
GPA than could benefit from a high physics GRE score despite a low GPA.
                                                                                                                          43


                                                                 0.12                   0.26                          0.43                                       0.26                       0.40
                                                                 N=761                 N=1101                         N=675                                     N=2537
                              Binned Undergraduate GPA
                                              Any                                                                                                                                           0.35
                                                                                                                                                                                                   Fraction Admitted
                                                                                                                                                                                            0.30
                                                                                                                                                                                            0.25
                                                                 0.17                   0.28                          0.45                                       0.31
                                                                 N=412                  N=867                         N=631                                     N=1910                      0.20
                                              [3.5, 4]                                                                                                                                      0.15
                                                                                                                                                                                            0.10
                                                                 0.07                   0.15                           0.18                                     0.11                        0.05
                                                                 N=349                  N=234                          N=44                                     N=627
                                              [0, 3.5)
                                                                                                                                                                                            0.00
                                                                [400,700)            [700,880)               [880, 990]                                          Any
                                                                                Binned Physics GRE Score
Figure 3.4: A condensed version of Fig. 3.3 showing the fraction of applicants admitted by
undergraduate GPA and physics GRE score.
                     Largest Undergraduate Programs (top 25%)                                                                 Smaller Undergraduate Programs (bottom 75%)
                            0.00 0.11  0.14  0.21  0.32  0.45   0.28
                 Any        N=13 N=140 N=291 N=400 N=546 N=475 N=1865                        Above                                     Any
                                                                                                                                                  0.04 0.09  0.16  0.23  0.33  0.33 0.20
                                                                                                                                                  N=23 N=114 N=122 N=134 N=100 N=40 N=533                        Above
                                                                                             Average                                                                                                             Average
 Binned Undergraduate GPA                                                                                              Binned Undergraduate GPA
                            N=1            N=27                                  0.38
                                                         N=86 N=198 N=336 N=390 N=1038                                                            N=2    N=36   N=66   N=81   N=82   N=34   0.28
                 4.0                                                                                                                   4.0                                                  N=301
                                                                                                  Fraction Admitted                                                                                                    Fraction Admitted
                            N=3            N=63 N=133 N=156 N=175 N=77               0.19    Overall                                              N=13   N=53   N=46   N=41   N=10   N=6    0.10                 Overall
                 3.7                                                                 N=607
                                                                                             Average                                   3.7                                                  N=169
                                                                                                                                                                                                                 Average
                            N=2            N=32          N=55    N=40   N=27   N=5   0.11                                                         N=6    N=18   N=6    N=10   N=4    N=0    0.11
                 3.3                                                                 N=161                                             3.3                                                  N=44
                                                                                             Below                                                                                                               Below
                            N=7            N=18          N=17    N=6    N=8    N=3   0.10
                                                                                     N=59    Average                                              N=2    N=7    N=4    N=2    N=4    N=0    0.05
                                                                                                                                                                                            N=19                 Average
                 < 3.0                                                                                                                 < 3.0
                            < 500 600                    700     800    900 1000 Any                                                              < 500 600     700    800    900 1000 Any
                                          Binned Physics GRE Score                                                                                       Binned Physics GRE Score
Figure 3.5: Fraction of applicants admitted by undergraduate GPA and physics GRE score and split
by large or small undergraduate university
                                                                                                                      44


                                   Most Selective Programs                                                                                  Less Selective Programs
                            0.08   0.07 0.17  0.25  0.34  0.48   0.31
                 Any        N=12   N=91 N=221 N=343 N=462 N=425 N=1554           Above                                     Any
                                                                                                                                      0.00 0.11  0.12  0.16  0.28  0.25 0.18
                                                                                                                                      N=15 N=124 N=163 N=161 N=162 N=75 N=700         Above
                                                                                 Average                                                                                              Average
 Binned Undergraduate GPA                                                                                  Binned Undergraduate GPA
                            N=0    N=17                           0.41
                                           N=66 N=174 N=291 N=348 N=896                                                               N=3   N=29   N=73   N=91 N=118 N=63     0.25
                 4.0                                                                                                       4.0                                                N=377
                                                                                      Fraction Admitted                                                                                    Fraction Admitted
                            N=4    N=43    N=98 N=123 N=145 N=72       0.21      Overall                                              N=6   N=60   N=68   N=61   N=30   N=9   0.09    Overall
                 3.7                                                   N=485
                                                                                 Average                                   3.7                                                N=234
                                                                                                                                                                                      Average
                            N=3    N=16    N=45   N=42   N=21    N=2   0.12                                                           N=5   N=30   N=15   N=7    N=7    N=3   0.09
                 3.3                                                   N=129                                               3.3                                                N=67
                                                                                 Below                                                                                                Below
                            N=5    N=15    N=12   N=4    N=5     N=3   0.11
                                                                       N=44      Average                                              N=1   N=5    N=7    N=2    N=7    N=0   0.09
                                                                                                                                                                              N=22    Average
                 < 3.0                                                                                                     < 3.0
                            < 500 600      700    800    900 1000 Any                                                                 < 500 600    700    800    900 1000 Any
                                   Binned Physics GRE Score                                                                                 Binned Physics GRE Score
Figure 3.6: Fraction of applicants admitted by undergraduate GPA and physics GRE score and split
by selective or non-selective undergraduate university.
Table 3.3: Distribution of applicants scoring in each Physics GRE range by size of institution. ETS
only publishes overall score distributions and hence, we cannot report national scores from only
domestic students.
                                                                Large          Small                  Selective                             Non-selective
                                          Score                                                                                                                         National
                                                                Schools        Schools                Schools                               Schools
                                          [400,500)             0.7%           4.3%                   0.8%                                  2.1%                        9%
                                          [500,600)             7.5%           21.4%                  5.9%                                  17.7%                       19%
                                          [600,700)             15.6%          22.9%                  14.2%                                 23.3%                       20%
                                          [700,800)             21.5%          25.1%                  22.1%                                 23.0%                       19%
                                          [800,900)             29.3%          18.8%                  29.7%                                 23.1%                       16%
                                          [900,990]             25.5%          7.5%                   27.3%                                 10.7%                       17%
                When taking the size of the applicant’s undergraduate program into account, (large or small),
using either the highest or lowest quartile of bachelor’s graduates over the two year period did
not substantially change the results. Therefore, we only present results from the highest quartile
reached, which are shown in Fig. 3.5. Due to the much smaller number of applicants per bin, we
reduce the number of GPA and physics GRE bins. We use bins of 3.0 or less, which corresponds
to a B or lower, 3.0 to up 3.3, a B+, 3.3 up to 3.7, an A-, and 3.7 up to 4, an A under the standard
4.0 scale.
                Overall, by looking at the bin in the ‘Any’ row and ‘Any’ column of Fig. 3.5 and Fig. 3.6,
we see that applicants from the largest undergraduate programs are nearly 40% more likely to be
                                                                                                          45


                                                 Women                                                                                                  Men
                            0.04   0.14 0.34  0.49       0.70   0.75 0.41
                 Any        N=23   N=98 N=119 N=89       N=74   N=36 N=439     Above                                     Any
                                                                                                                                    0.00 0.08  0.07  0.16  0.27  0.41   0.23
                                                                                                                                    N=16 N=178 N=322 N=467 N=595 N=509 N=2087       Above
                                                                               Average                                                                                              Average
 Binned Undergraduate GPA                                                                                Binned Undergraduate GPA
                            N=2    N=29   N=67    N=55   N=53   N=32   0.54                                                         N=1   N=38                           0.31
                                                                                                                                                 N=98 N=231 N=382 N=411 N=1161
                 4.0                                                   N=238                                             4.0
                                                                                    Fraction Admitted                                                                                    Fraction Admitted
                            N=14   N=51   N=34    N=23   N=16   N=4    0.30    Overall                                              N=4   N=75 N=158 N=182 N=174 N=89       0.14    Overall
                 3.7                                                   N=142
                                                                               Average                                   3.7                                                N=682
                                                                                                                                                                                    Average
                            N=5    N=13   N=14    N=11   N=3    N=0    0.17                                                         N=4   N=45   N=50   N=45   N=28   N=6   0.09
                 3.3                                                   N=46                                              3.3                                                N=178
                                                                               Below                                                                                                Below
                            N=2    N=5    N=4     N=0    N=2    N=0    0.00
                                                                       N=13    Average                                              N=7   N=20   N=16   N=9    N=11   N=3   0.11
                                                                                                                                                                            N=66    Average
                 < 3.0                                                                                                   < 3.0
                            < 500 600     700     800    900 1000 Any                                                               < 500 600    700    800    900 1000 Any
                                   Binned Physics GRE Score                                                                               Binned Physics GRE Score
Figure 3.7: Fraction of applicants admitted by undergraduate GPA and physics GRE score and split
by the applicant’s gender.
                            Black/Latinx/Multi/Native applicants                                                                           Asian/white applicants
                            0.00   0.12   0.18    0.36   0.31   0.52 0.29
                 Any         N=8   N=68   N=83    N=73   N=67   N=67 N=366     Above                                     Any
                                                                                                                                    0.04 0.11  0.13  0.20  0.32  0.43   0.26
                                                                                                                                    N=27 N=197 N=349 N=453 N=558 N=443 N=2027       Above
                                                                               Average                                                                                              Average
 Binned Undergraduate GPA                                                                                Binned Undergraduate GPA
                            N=0    N=17   N=15    N=21   N=40   N=58   0.42                                                         N=3                                 0.35
                                                                                                                                          N=49 N=144 N=249 N=368 N=357 N=1170
                 4.0                                                   N=151                                             4.0
                                                                                    Fraction Admitted                                                                                    Fraction Admitted
                            N=2    N=20   N=45    N=30   N=16   N=9    0.21    Overall                                              N=12 N=101 N=144 N=162 N=163 N=77       0.16    Overall
                 3.7                                                   N=122
                                                                               Average                                   3.7                                                N=659
                                                                                                                                                                                    Average
                            N=1    N=18   N=16    N=20   N=9    N=0    0.19                                                         N=8   N=36   N=46   N=35   N=18   N=6   0.08
                 3.3                                                   N=64                                              3.3                                                N=149
                                                                               Below                                                                                                Below
                            N=5    N=13   N=7     N=2    N=2    N=0    0.14
                                                                       N=29    Average                                              N=4   N=11   N=15   N=7    N=9    N=3   0.06
                                                                                                                                                                            N=49    Average
                 < 3.0                                                                                                   < 3.0
                            < 500 600     700     800    900 1000 Any                                                               < 500 600    700    800    900 1000 Any
                                   Binned Physics GRE Score                                                                               Binned Physics GRE Score
Figure 3.8: Fraction of applicants admitted by undergraduate GPA and physics GRE score and split
by the applicant’s race.
                                                                                                        46


admitted (0.28 to 0.20) while applicants from selective institutions are nearly 70% more likely to be
admitted (0.31 to 0.18). Looking at the individual admission fractions, there does not appear to be
any advantage to applicants graduating from smaller institutions or less selective institutions. The
physics GRE scores and GPAs where applicants are admitted at higher than average rates are nearly
the same for large and small programs and selective and non-selective programs. Unsurprisingly,
these tend to be higher physics GRE scores and higher GPAs. Outside of a few bins with a small
number of applicants, no combination of low GPA (B+ or less) and high physics GRE score resulted
in an above average admission fraction.
     For the highest physics GRE scores, 900 and above, applicants from the largest or most selective
universities seem to be admitted at a higher rate and a higher fraction of applicants from large or
selective universities achieve these high scores compared to applicants from smaller universities.
The fraction of applicants from both large universities, small universities, selective universities,
and non-selective universities, as well as nationally, achieving each score is shown in Table 3.3.
Thus, it appears that even if higher scores did help applicants stand out, applicants from smaller
and less selective schools most in need of standing out are less likely to achieve those scores in the
first place.
     Finally, the results from grouping by gender and race are shown in Fig. 3.7 and Fig. 3.8.
Interestingly, we find that for most physics GRE scores, women are admitted at higher rates than
men of equal score are. Likewise, we find that Black, Latinx, Multi-racial, and Native applicants
are admitted at higher or similar rates as white or Asian applicants are for similar physics GRE
scores. In addition, the same trend seems to hold for GPA as well. However, a high physics GRE
score does not seem to help women with a low GPA. For B/L/M/N applicants, there appears to be
a few places where applicants may stand out (such as the 800 physics GRE bin and 3.3 GPA bin).
If these applicants were standing out due to their physics GRE score though, we would expect that
pattern to continue for higher physics GRE scores but the same GPA. This does not appear to be
the case, suggesting these applicants stood out for a reason other than their physics GRE scores.
We address this in our discussion.
                                                 47


Figure 3.9: Visual representation of the bootstrapped coefficients in eqs. (3.1) to (3.3). We do find
evidence of the physics GRE score mediating the relationship between GPA and admission status.
    Because these applicants may have stood out for reasons other than their physics GRE score,
we do not discuss any interactions between gender and race and selectivity and institution size. For
completeness, plots showing these interactions are included in the supplementary material.
3.4.2   Mediation and moderation results
3.4.2.1   Physics GRE and GPA
A visual representation of our mediation results with the physics GRE score and GPA is shown in
Fig. 3.9. We find that all coefficients are statistically different from zero.
    From Fig. 3.9, we see that an applicant’s physics GRE score and GPA have about the same
effect on whether the applicant is admitted. Given that applicants who had either a high physics
GRE score or a high GPA had about the same chance of being admitted, this is not a surprising
result.
    Second, we find that the indirect effect is not zero, meaning that there is partial mediation. That
is, whether an applicant is admitted depends on their physics GRE score and their GPA. In terms of
the amount of mediation, we find that the indirect effect accounts for nearly 30% of the total effect.
    Finally, doing moderation analysis, we find that 𝑏01 = 0.024 (−0.114, 0.154). As zero is
included in the confidence interval, we do not find evidence that GPA moderates the relationship
                                                    48


Figure 3.10: Visual representation of the bootstrapped coefficients in eqs. (3.5) to (3.7). We do
find evidence of the physics GRE score mediating selectivity and admissions status but do not find
evidence of GPA mediating selectivity and admissions status. We do not find evidence of a serial
mediating relationship. Statistically significant coefficients are in bold.
Figure 3.11: Visual representation of the bootstrapped coefficients in eqs. (3.5) to (3.7). We do
find evidence of the physics GRE score mediating institution size and admission status but do not
find evidence of GPA mediating institution size and admissions status. We do not find evidence of
a serial mediating relationship. Statistically significant coefficients are in bold.
between an applicant’s physics GRE score and whether they are admitted. That is, the relationship
between an applicant’s physics GRE score and whether they are admitted is not influenced by their
GPA.
                                                  49


3.4.2.2    Institutional features
A visual depiction of our results is shown in Fig. 3.10 and Fig. 3.11. We find that the applicant’s
physics GRE score partially mediates the relationship between the selectivity of their undergraduate
institution and whether they were admitted and fully mediates the relationship between their
institution’s size and whether they were admitted. The fractions of mediation due to the indirect
                                                              𝑎2 𝑏2
effects from the physics GRE score were      |𝑎 1 𝑏 1 |+|𝑎 2 𝑏 2 |+|𝑎 1 𝑎 3 𝑏 2 |+|𝑐 0 | = 0.25 and 0.533 respectively.
    In contrast, the applicant’s GPA was not found to be a significant mediator in either case
(zero was contained in the indirect effects’ 95% confidence intervals), meaning that GPA is not a
reason that there are differences in admission based on the applicant’s undergraduate institution.
Additionally, no serial mediation was observed for either case.
    When looking at the results of the moderation analysis when the physics GRE is the me-
diating variable, we find that neither 𝑏02 value is statistically different from zero (𝑏02,𝑠𝑒𝑙𝑒𝑐𝑡𝑖𝑣𝑖𝑡𝑦 =
0.136 (−0.141, 0.411) and 𝑏02,𝑠𝑖𝑧𝑒 = 0.080 (−0.204, 0.361)), meaning that the relationship be-
tween the physics GRE score and admission is the same regardless of the type of institution the
applicant attended.
    Likewise, we do not find evidence of moderation when GPA is the mediation variable. In those
cases, 𝑏01,𝑠𝑒𝑙𝑒𝑐𝑡𝑖𝑣𝑖𝑡𝑦 = 0.052 (−0.314, 0.394) and 𝑏01,𝑠𝑖𝑧𝑒 = −0.143 (−0.630, 0.267).
3.4.2.3    Demographic features
Our results are shown visually in Fig. 3.12 and Fig. 3.13. Because we chose woman to be “1”
and B/L/M/N to be “1” in our logistic regression equation, some of the coefficients are negative.
For example, the negative 𝑎 coefficient for gender and physics GRE score means that women score
lower on the physics GRE than men do. Because the sign depends on our choice of which category
should be “1” and are in that sense arbitrary, we use the absolute values of 𝑐0 and 𝑎𝑖 𝑏𝑖 to calculate
the fraction of mediation.
    We find that the applicant’s physics GRE score partially mediates the relationship between
gender and admission but not race and admission, meaning that gender affects admission in part
                                                         50


because it affect physics GRE scores, which affect admission. The fraction of mediation for gender
                                                                               𝑎2 𝑏2
and admission due to the physics GRE score is                 |𝑎 1 𝑏 1 |+|𝑎 2 𝑏 2 |+|𝑎 1 𝑎 3 𝑏 2 |+|𝑐 0 | = 0.246
    For GPA, we find the opposite. GPA partially mediates the relationship between race and
admission but not gender and admission. The fraction of mediation for race and admission due to
                          𝑎1 𝑏1
GPA is   |𝑎 1 𝑏 1 |+|𝑎 2 𝑏 2 |+|𝑎 1 𝑎 3 𝑏 2 |+|𝑐 0 | = 0.299.
    Likewise, we find a serial mediation effect for race and admission but not gender and admission.
That is, admission is affected by race both because admission is related to GPA which is related to
race and because admission is related to the physics GRE score which is related to GPA which is
related to race.
    When investigating whether any moderation effects exist, we do not find that to be the case.
That is, we find that none of the interaction coefficients are statistically different from zero and
hence, physics GRE scores and GPAs do not have a differential effect on admission based on the
applicant’s gender or race. Specifically,
    • 𝑏02,𝑝𝐺 𝑅𝐸,𝑔𝑒𝑛𝑑𝑒𝑟 = 0.154 (−0.154, 0.462),
    • 𝑏01,𝐺𝑃 𝐴,𝑔𝑒𝑛𝑑𝑒𝑟 = 0.209 (−0.103, 0.538),
    • 𝑏02,𝑝𝐺 𝑅𝐸,𝑟𝑎𝑐𝑒 = −0.007 (−0.309, 0.314),
    • 𝑏01,𝐺𝑃 𝐴,𝑟𝑎𝑐𝑒 = −0.236 (−0.586, 0.143).
    All results and interpretations from the mediation and moderation analyses are summarized in
Table 3.4.
3.5    Discussion
Here, we address each of our research questions and possible limitations or confounding factors.
3.5.1   Research Questions
How does an applicant’s physics GRE score and undergraduate GPA affect their probability of
admission? We find that scoring highly on the physics GRE and having a high GPA results in the
                                                              51


Figure 3.12: Visual representation of the bootstrapped coefficients in eqs. (3.5) to (3.7). We do find
evidence of the physics GRE score mediating gender and admission status but do not find evidence
of GPA mediating gender and admission status. We do not find evidence of a serial mediating
relationship. Statistically significant coefficients are in bold.
Figure 3.13: Visual representation of the bootstrapped coefficients in eqs. (3.5) to (3.7). We do
find evidence of GPA mediating race and admission status and a serial mediation effect but do not
find evidence of the physics GRE mediating race and admission status. Statistically significant
coefficients are in bold.
                                                    52


Table 3.4: Summary of the mediating and moderation results. * signifies partial mediation is
present, ** signifies full mediation is present, † signifies moderation is present. However no
moderation effects were found.
                 Independent      Mediating        Indirect effect Moderating effect
                 GPA              Physics GRE             0.223*                0.024
                 Selectivity      Physics GRE             0.188*                0.136
                 Selectivity      GPA                       0.019               0.052
                 Selectivity      Serial                    0.006                 NA
                 Institution size Physics GRE            0.300**                0.080
                 Institution size GPA                      -0.015              -0.143
                 Institution size Serial                   -0.004                 NA
                 Gender           Physics GRE            -0.540*                0.154
                 Gender           GPA                      -0.022               0.209
                 Gender           Serial                   -0.018                 NA
                 Race             Physics GRE              -0.049              -0.007
                 Race             GPA                    -0.285*               -0.236
                 Race             Serial                 -0.110*                  NA
highest chance of admission (Fig. 3.4). Likewise, having a low physics GRE score and low GPA
results in the lowest chance of admission. If either the applicant’s physics GRE score or GPA is
high while the other is not, the chance of admission is approximately equal, regardless of which
one is high.
    However, the number of applicants with high GPAs but low physics GRE scores is 9 times
as large as applicants with low GPAs and high physics GRE scores (i.e., scoring above the 80th
percentile; Fig. 3.4). Even if we consider meeting the minimum cutoff score as a high physics
GRE score, the number of applicants who have high GPAs but low physics GRE scores is 1.5 times
greater than the number of applicants with low GPAs but high physics GRE scores. Thus, many
more high GPA applicants could be penalized by the physics GRE than low GPA applicants could
stand out or benefit from a high physics GRE score.
    Finally, we note that for low-GPA applicants with high physics GRE scores, they are all
essentially admitted at the same rate, regardless of whether they scored in the 700-870 range or the
880-990 range. If these applicants were standing out, we would expect low GPA applicants scoring
above 880 to be admitted at a much higher rate than low GPA applicants scoring between 700 and
870. Thus, it is hard to determine if these applicants actually stood out to the committee or if they
                                                  53


simply met the minimum physics GRE score needed for the committee to review the rest of the
application.
    How are these probabilities of admission affected by an applicant’s undergraduate institution,
gender, and race? First, we find that for most physics GRE scores, applicants from larger and
smaller institutions are admitted at similar rates (Fig. 3.5). However, for the highest scores (above
900), applicants from larger universities are admitted at higher rates. Interestingly, for applicants
from smaller programs, scoring above 900 does not appear to provide any additional benefit in
terms of the fraction of applicants admitted compared to scoring between 800 and 900.
    In contrast, applicants from less selective institutions are less likely to be admitted than ap-
plicants from more selective institutions for all physics GRE scores above the common cutoff
score (Fig. 3.6). That is, the physics GRE does not seem to counteract any potential biases from
admissions committees toward applicants from less selective institutions.
    Overall, attending a large or selective institution and scoring highly on the physics GRE does
result in a higher chance of admission than scoring highly on the physics GRE and attending a
smaller or less selective institution.
    It is important to note that there might be selection bias in our data because test-takers with
high scores from smaller universities might not choose to apply to these schools. However, this
seems unlikely because 1) these programs are highly regarded and hence, these would not be
“safety schools” to high scoring applicants (as indicated by many high scoring applicants from
large programs applying here) 2) while there is research suggesting students with low physics GRE
scores might view their scores as barriers to applying [49], to our knowledge, there is no evidence
that students with high scores do not apply to physics graduate programs. Given that students with
low test scores might not apply, it is expected that our data is not representative of test-takers on
the lower end of scores (as shown in Table 3.3).
    When looking at the demographic variables, we find that women are admitted at higher rates
than men with similar scores (Fig. 3.7) and B/L/M/N applicants are also admitted at higher rates
than white or Asian applicants (Fig. 3.8). As prior work has shown [105], women and B/L/M/N
                                                  54


test-takers tend to score lower than white men on the physics GRE and hence, scoring highly could
cause these applicants to stand out to admissions committees.
    How might the above relationships be accounted for through mediating and moderating rela-
tionships? Our mediation and moderation analysis further supports the results found through the
probability of admissions procedure.
    We find that the physics GRE score and GPA have similar regression coefficients when modeling
admission, suggesting they have similar effects (Fig. 3.9) and that there is a mediation effect. In
addition, we did not find any evidence of moderation. That means the relationship between GPA
and admission is not different due to the applicant’s physics GRE score. If a high physics GRE
score did help a low-GPA applicant stand out, we would expect to see a moderation effect.
    Combining the results of probability of admission analysis and the mediating and moderation
analysis, we find that there is mediation but no moderation between an applicant’s physics GRE
score and their GPA when it comes to admission probability. In practice then, an applicant with a
low GPA cannot simply overcome that low GPA by scoring highly on the physics GRE.
    When we performed mediation analysis on the institutional factors, we found that the relation
between institutional selectivity and admission was partially mediated by the applicant’s physics
GRE score and the relation between institutional size and admission was fully mediated by the
applicant’s physics GRE score (Fig. 3.10 and Fig. 3.11). Neither of these relationships was
mediated by the applicant’s GPA or mediated serially however.
    The results of the mediation analysis show that physics GRE scores seem to explain some of the
differences in admission probability based on the applicant’s undergraduate institution. Therefore
an applicant from a smaller or less selective institution may be able to stand out by scoring highly on
the physics GRE. However, looking at the fraction of applicants admitted by physics GRE scores,
especially the highest scores, suggests that is not what happens in practice.
    In terms of gender and race, we do find some mediating relationships, but no moderation
relationships (Fig. 3.12 and Fig. 3.13). We find that the physics GRE partially mediates the
relationship between gender and admission. We also find GPA and GPA plus the physics GRE score
                                                   55


partially mediates the relationship between race and admission. That is, some of the differences
in admission rates between men and women can be explained by the differences in their physics
GRE scores and some of the differences in admission rates between B/L/M/N applicants and non-
B/L/M/N applicants can be explained by differences in their GPAs or physics GRE scores and
GPAs.
    These results then suggest that a female or B/L/M/N applicant may be able to stand out by doing
well on the physics GRE. In practice, the probability of admission results do suggest that women
and B/L/M/N applicants are admitted at higher rates than their male, white, or Asian peers are.
However, as the five programs studied here were interested in increasing their diversity, our data
does not allow us to disentangle “standing out” from highlighting. Therefore, our results should
be interpreted with caution regarding any claims that the physics GRE may help applicants from
groups underrepresented in physics stand out.
    It should also be noted that women and B/L/M/N applicants are less likely to reach these higher
scores than their male, white, and Asian peers. In our data, 75% of men and 72% of white or Asian
applicants scored above 700 compared to 45% of women and 57% of B/L/M/N applicants. Thus,
even if the physics GRE does allow these applicants to stand out, any potential benefit must be
weighed against known scoring discrepancies.
    Finally, it could be argued that even though we did not show that the physics GRE helps these
applicants “stand out”, doing well on the test could still provide some benefit for them in the
admissions process. We would agree with that argument not because of any properties of the test
but because of the structure of graduate admissions in physics. In theory, any part of the application
could be weighted highly and therefore, doing well on that part would provide some benefit.
Given that prior work has established that the physics GRE is weighted highly [46, 47, 51, 75],
we would expect that good performance on the test would provide some benefit to applicants.
Our goal, however, was not to determine whether a high physics GRE score benefits applicants
in any capacity. Instead, our goal was to determine whether a high physics GRE score offers a
disproportional benefit that would justify using it in graduate admissions given the disproportional
                                                  56


harms the physics GRE can cause, which we were unable to show in practice.
3.5.2   Limitations and Researcher Decisions
Data Biases. As previously noted, applicants with lower physics GRE test scores may be less likely
to apply, resulting in an over-representation of high scoring applicants. In addition, the programs
in this study are well-regarded programs and there is likely a secondary bias toward applicants
with high GPAs and high physics GRE scores applying overall. As a result, the results may not
generalize to graduate programs whose applicants tend to have lower GPAs or low physics GRE
scores.
    In addition, it is possible that an applicant could be represented multiple times in the data set,
as an applicant could have applied to more than one of the five universities in this study. However,
each applicant applies to each program independently and thus, we can treat them as separate events
for the admissions probabilities. On the other hand, results based on distributions such as Table 3.3
and Fig. B.1 and Fig. B.2 would be affected by the duplicates.
    To see if possible duplicates affected our results, we compared the distributions with and without
possible duplicates. We assumed an application represented the same applicant and hence was a
duplicate if two records had the same physics, verbal, written, and quantitative GRE scores, GPA,
undergraduate university, and demographic features as the chance of all of these matching for a
nonduplicate seems exceedingly low.
    When we compare the distributions both with and without possible duplicates, Kolmogorov-
Smirnov tests [135] suggest the distributions are not significantly different. Therefore, because we
cannot actually determine which applicants are duplicates and excluding possible duplicates does
not change our results, we did not remove possible duplicates.
    Our choice of low GPA and high physics GRE. While percentiles are available for the physics
GRE, a “high score” is left to interpretation. Even among admissions committees, individual
members may have different ideas of what a high score is. In our work, we have taken the common
cutoff score of 700 as the minimum possible high score [52]. Even around this minimum score, the
                                                   57


                                           0.11              0.31              0.26    0.40
                                           N=590            N=1947            N=2537
              Binned Undergraduate GPA
                              Any                                                      0.35
                                                                                         Fraction Admitted
                                                                                       0.30
                                                                                       0.25
                                           0.15              0.35              0.31
                                           N=309            N=1601            N=1910   0.20
                              [3.5, 4]                                                 0.15
                                                                                       0.10
                                           0.06              0.14             0.11     0.05
                                           N=281             N=346            N=627
                              [0, 3.5)
                                                                                       0.00
                                         [400, 670)        [670, 990]          Any
                                                   Binned Physics GRE Score
Figure 3.14: A revised version of Fig. 3.4 showing the fraction of applicants admitted by under-
graduate GPA and physics GRE score when the cutoff score for a high physics GRE score is 670.
Here, the number of applicants who could benefit from a high physics GRE score is approximately
equal to the number of applicants who could be penalized by a low physics GRE score.
number of applicants with low GPAs who could benefit from scoring highly on the physics GRE is
less than the number of high GPA applicants who could be penalized by having a score below the
cutoff.
   We find that the number of low GPA applicants who could benefit from a high physics GRE
score is greater than the number of high GPA applicants who could be penalized by a low score
when the high score cutoff is less than or equal to 670, which is lower than the typical cutoff score
and is around the 43rd percentile (Fig. 3.14). Assuming a high score should be at least above the
50th percentile, our specific choice of a high score does not affect our result that more applicants
could be penalized than could benefit.
   The previous argument is also affected by what we consider a high GPA. We have chosen any
GPAs less than 3.5 to be low based on the results shown in Fig. 3.3 where applicants with GPAs
at or above 3.5 are nearly twice as likely to be admitted as applicants with GPAs below 3.5. If we
were to pick a lower threshold, there would be even fewer applicants in the low GPA-high physics
GRE score group and more applicants in the high GPA-low physics GRE score group, meaning
                                                                   58


                                          0.12           0.26        0.43        0.26    0.40
                                          N=761         N=1101       N=675      N=2537
              Binned Undergraduate GPA
                              Any                                                        0.35
                                                                                           Fraction Admitted
                                                                                         0.30
                                                                                         0.25
                                          0.15           0.27        0.44        0.29
                                          N=506          N=941       N=649      N=2096   0.20
                              [3.4, 4]                                                   0.15
                                                                                         0.10
                                          0.06           0.16        0.23       0.11     0.05
                                          N=255          N=160       N=26       N=441
                              [0, 3.4)
                                                                                         0.00
                                         [400,700)     [700,880)   [880, 990]    Any
                                                     Binned Physics GRE Score
Figure 3.15: A revised version of Fig. 3.4 showing the fraction of applicants admitted by under-
graduate GPA and physics GRE score when the cutoff score for a high undergraduate GPA is 3.4.
even more applicants would possibly be penalized rather than standout. Using a GPA cutoff of 3.4
instead of 3.5, the ratio of applicants who could be penalized compared to stand out changes from
the original 9:1 to nearly 19:1 (Fig. 3.15).
   If we instead picked a higher GPA such as 3.6, there would be more applicants who could
potentially benefit, but even then, the number of applicants who could benefit is only greater than
the number of applicants who could be penalized around a physics GRE score of 730, which is not
a high physics GRE score (approximately 54th percentile) and does not significantly change our
results (Fig. 3.16). If we were to pick an even higher GPA cutoff, we could be hard-pressed to
justify why anything other than an ‘A’ GPA is considered a low GPA, especially because admissions
committees seem to group applicants with GPAs between 3.5 and 3.6 more closely with applicants
with GPAs between 3.7 and 3.8 than applicants with GPAs between 3.4 and 3.5 (based on the
fraction of applicants admitted).
   Based on our data and the fact that some universities use 3.5 as the only separation between
3.0 and 4.0, using 3.5 seems to represent the best option for separating high and low GPA students.
Using any other choice either strengthens our claims or seems unrealistic to use as a cutoff.
                                                                   59


                                          0.12           0.24        0.33        0.26    0.40
                                          N=761          N=131      N=1645      N=2537
              Binned Undergraduate GPA
                              Any                                                        0.35
                                                                                           Fraction Admitted
                                                                                         0.30
                                                                                         0.25
                                          0.19           0.28        0.37        0.33
                                          N=341          N=74       N=1299      N=1714   0.20
                              [3.6, 4]                                                   0.15
                                                                                         0.10
                                          0.07           0.18        0.18       0.12     0.05
                                          N=420          N=57        N=346      N=823
                              [0, 3.6)
                                                                                         0.00
                                         [400,700)     [700,730)   [730, 990]    Any
                                                     Binned Physics GRE Score
Figure 3.16: A revised version of Fig. 3.4 showing the fraction of applicants admitted by under-
graduate GPA and physics GRE score when the cutoff score for a high undergraduate GPA is 3.6.
Here, the number of applicants who could benefit from a high physics GRE score is approximately
equal to the number of applicants who could be penalized by a low physics GRE score.
   Our choice of non-selective school. We choose to follow a modified version of Chetty et al.’s
groupings of programs [131]. However, many large, state universities have a Barron’s Selective
Index of 3 and fall in Chetty et al.’s fourth group. For our analysis, we would have included these
large, state institutions as part of the less selective programs. As we are concerned with whether
the physics GRE helps applicants stand out, saying applicants from large, state universities (for
example, the University of Colorado-Boulder, the University of Washington, and Michigan State
University) may fall in the traditionally missed category may not be correct.
   We reran the analysis with these large, state institutions as part of what we called the most
selective programs. We find that the conclusions are then more aligned with the large vs small
program results. Using this grouping, applicants from less selective programs are admitted at
similar rates to applicants from more selective programs for most physics GRE scores. However,
applicants from more selective institutions with physics GRE scores above 900 are still more likely
to be admitted than applicants from less selective institutions with similar physics GRE scores.
   In terms of the mediating and moderation analysis, our results would be strengthened under this
                                                                   60


choice. While the physics GRE score would no longer mediate the relationship between selectivity
and admission status, it would moderate the relationship (𝑏02 = 0.311 (0.015, 0.612)). This positive
moderation means that the physics GRE score has a greater effect on admission status for applicants
from more selective programs. In terms of standing out arguments, the positive moderation result
means that doing well on the physics GRE would provide more of a benefit to applicants from
more selective universities and not to applicants from smaller programs who are the intended
beneficiaries of the “standing out” argument.
    Thus, even though the details change, the overall conclusion are not weakened by changing our
groupings. In fact, changing the groupings may strengthen our conclusions instead.
    Our choice of a “small” school. We chose small schools to be any university not in the top
quartile of yearly bachelor’s degrees awarded. We acknowledge that using quartiles is an arbitrary
decision. However, when we used halves instead of quartiles to divide large and small schools, our
results were unchanged, both in terms of the probability of admission analysis and the mediation
and moderation analysis. Using the bottom quartile as small schools and all other programs as
large schools would not have yielded insightful results as less than 2% of applicants would have
attended a small school under this choice.
    Of the possible physics specific measures, the number of bachelor’s degrees seems most ap-
propriate because programs with more graduates are more likely to be known by admissions
committees simply because there are more students to apply from those programs. For example,
the programs in the top quartile by number of bachelor’s graduates produce nearly two-thirds of all
physics bachelor’s graduates [128, 129]. In addition, we assume that programs with strong physics
reputations attract more students and hence, produce more graduates. While this is likely to be
more true at the graduate level, not all physics programs offer graduate degrees and hence, using
the number of PhDs awarded would not be useful. Thus, we believe the number of bachelor’s
graduates serves as a rough proxy for physics reputation.
                                                 61


3.6    Future Work
While the five universities included in this study were interested in increasing their diversity and
reducing inequities in their programs, their admissions processes still resembled the traditional
metrics-based admissions model. Recently, many programs, including the ones studied here, have
begun to employ holistic admissions, which looks at the overall application, taking into account
non-cognitive competencies and contextualizes the accomplishments of the applicant in terms
of the opportunities that were available to them [136, 137]. Often these holistic admissions use
rubrics to weight the various components of each applicant (e.g. see [138, 139]). Evidence from
biomedical science graduate programs suggests that the GRE can even be included in holistic
admissions without reproducing its known gender and racial biases [140]. Furthermore, their
two-tiered approach to holistic admissions did not significantly increase the workload of admission
committee members. These findings could persuade faculty reluctant to remove the GRE due to
its ease and supposed ability to measure some innate quality to try holistic admissions. Whether
these results would hold for decentralized admissions as is typical in physics and for the physics
GRE though are still open questions.
     Our future work will then examine how our results may be affected when a department uses
holistic admissions. In theory, we should no longer see the discrepancies between admitted
applicants from large and small programs and more selective and less selective universities. In
addition, the sample rubric developed by the Inclusive Graduate Education Network (as shown
in [138]) suggests ranking applicants by high, medium, or low on each part of their applicant.
Therefore, we would expect to see a flatter distribution of admission fractions based on physics
GRE scores because, for example, all scores within the ‘high’ range should be treated equally in
the admissions process. Our future work will determine if this is indeed the case.
3.7    Conclusion and Implications
Our work suggests that, in practice, scoring highly on the physics GRE does not help applicants
from small or less selective schools or applicants with a low GPA “stand out”. Indeed, having a high
                                                  62


physics GRE and low GPA is no better than having a low physics GRE score and high GPA in terms
of the fraction of applicants admitted. Similarly, for average physics GRE scores, the selectivity
or size of the applicant’s institution does not offer any advantage. For the highest scores though,
attending a smaller or less selective institution does appear to result in an admissions penalty.
    We also find that women and B/L/M/N applicants do have higher rates of admission based on
physics GRE scores. However, given that the departments included in this study were actively
trying to improve the diversity of their graduate student population [54], we are unable to attribute
that standing out to the physics GRE.
    While ETS’s claim that the physics GRE can help applicants stand out from other applicants
may be true in theory, we do not find evidence to support that claim in practice. In fact, our results
suggest the opposite: the physics GRE may penalize applicants due to a low score rather than help
applicants due to a high score.
    As Small points out, facts and data do not unambiguously prescribe a course of action [118]
and as others have noted, making such courses of action require a framework of assumptions and
commitments [141]. Thus, we do not make a specific recommendation regarding whether the
physics GRE should be kept or removed as a result of our work because the answer to that question
depends on the priorities of the department. However, if departments are using the physics GRE
to identify applicants who might be missed by other metrics to achieve their admissions priorities,
we suggest against this practice as it does not appear to be backed by evidence.
                                                   63


                                            CHAPTER 4
  RUBRIC-BASED ADMISSIONS: A NEW APPROACH TO GRADUATE ADMISSIONS
                                            IN PHYSICS
This chapter is being drafted as a journal article. The working manuscript version includes K.
Tollefson as the second author, Remco G. T. Zegers as the third author and Marcos D. Caballero
as the fourth author. Following the Contributor Roles Taxonomy (CRediT) [76], my roles for this
project include conceptualization, formal analysis, methodology, software, validation, visualization,
and writing the original draft.
4.1    Introduction
Female and Black, Latinx, and Indigenous scholars have been and are underrepresented at all levels
of physics. The percentage of physics degrees awarded to women has stagnated at around 20% [142]
while the percentage of physics degrees awarded to Black, Latinx, and Indigenous students has
remained less than 10% despite these students making up a larger portion of the college population
than in the past [143]. While there are numerous possibilities to address the systematic inequities
these scholars face at all levels of academia that limit their participation [144–150], this paper will
focus on graduate admissions. Specifically, if we treat graduate admissions as a four stage process
similar to how O’Meara et al. treats faculty hiring as a four-stage process [151], this paper focuses
on the “evaluating candidates” and “short lists and final decisions” stages of the process.
    While physics departments may be interested in increasing their diversity, the dominant pro-
cesses of evaluating applicants for graduate school do not support such aims. Prior work has found
that diversity considerations are often secondary when evaluating applicants and are discussed after
many diverse candidates have already been cut from the applicant pool [54, 152]. Therefore, in-
creasing diversity and equity during the admissions process requires rethinking the process physics
departments use to evaluate applicants.
    One promising approach to rethinking the admissions process is holistic review, where a broad
                                                  64


range of candidate qualities are considered [137]. In physics, the use of rubric-based review to
facilitate such holistic reviews has been gaining traction through the Inclusive Graduate Education
Network [153]. Under this approach, applicants are rated on both traditional metrics such as GPA
and test scores as well as noncognitive skills such as showing initiative and displaying perseverance
according to a predefined rubric. Such an approach is claimed to ensure that each applicant is treated
fairly and biases by reviewers are checked [154], and hence, it could make graduate admissions
more equitable.
     To our knowledge, however, few studies have examined how these rubrics work in practice and
whether they fulfill such aims. Therefore, the goal of this paper is to empirically examine those
claims in the context on our department’s graduate program. Specifically, our paper addresses three
questions related to rubric-based review in our department:
    1. How do faculty assign rubric scores to applicants and how do those differ between admitted
       and rejected applicants?
    2. How do the scores assigned by faculty differ by applicant’s sex?
    3. How do the scores assigned by faculty differ by the type of institution the applicant attended?
     As Scherr et al. concluded in their study of graduate admissions practices in physics, many
departments are unaware of what other departments do and hence, they might be willing to change
their practices if they become aware of successful practices in use elsewhere [48]. Therefore, a
secondary goal of this paper is to describe alternative admissions practices in physics and how
departments may apply these alternative practices to their own admissions processes.
     The rest of the paper is organized as follows. In Sec. 4.2, we provide an overview of holistic
review, rubric-based review, and evidence from other fields about their potential for success. In
Sec. 4.3, we describe how our department transitioned to rubric-based review, how we collected
data relevant to evaluating our admissions process, and how we analyzed such data. In Sec. 4.4, we
share results that suggest our rubric does support equitable admissions practices and in Sec. 4.5,
we contextualize our results and examine how our choices as researchers may affect the results. In
                                                  65


Sec. 4.6 and Sec. 4.7 we examine the limitations of this study and suggest directions for future
work. Finally, in Sec. 4.8, we provide recommendations for departments interested in adopting
rubric-based review.
4.2    Background
4.2.1   A typical admissions process in physics
When applying to a physics graduate program in the United States, an applicant will typically submit
their undergraduate transcripts, general and physics GRE scores, multiple statements addressing
their background, prior preparation, and research interests, and letters of recommendation. A group
of physics faculty, the admissions committee, then reviews the applications and offers admission to
some of the applicants.
    Historically, there have been two main approaches for admitting students: emphasizing research
or emphasizing grades [45]. More recent work however has tended to find that programs, including
the one studied in this paper, emphasize grades and test scores over research, both in terms of what
faculty say they do [47, 51] and what faculty actually do [46, 75].
    Yet, numerous potential equity issues emerge when admissions is focused around test scores
and grades. First, there is evidence that GRE scores vary based on gender and race [52,105] and the
type of undergraduate university the test-taker attends [69]. When combined with the practice of
using cutoff scores, which Potvin et al. estimate at least 1 in 3 departments do despite the creators of
the GRE and physics GRE recommending against it [47], applicants from underrepresented groups
in physics may be more likely to not make the first cut.
    Second, the tests themselves can be a financial burden for students [49]. The cost to take the
General GRE is currently $205 in most parts of the world (and up to $255 in some regions) [1] and
the cost to the take the physics GRE is $150 [155]. In addition, if the applicant applies to more
than 4 programs, they must pay $27 per school to send their scores. As Owens et al. notes, some
students also need to travel to a testing center, which may incur travel or lodging costs [55].
    Third, grades vary by applicants’ demographics and the type of university they attended.
                                                   66


Whitcomb and Singh found that wealthier, continuing-generation, white students earned higher
grades and that even the most privileged racially-underrepresented students in physics earned lower
grades than the least privileged white students [156]. Additionally, grades are not standardized
measures across universities, with students at private universities tending to earn slightly higher
grades than their peers at public universities [157].
    Further, evidence has not necessarily supported these metrics as useful predictors of who will
earn their PhD. For example, Miller et al. found that while grade point averages were useful to some
degree for predicting completion, the physics GRE had limited use [52]. More recent evidence
suggests that the physics GRE and undergraduate grade point average only have a relation to PhD
completion because they are related to graduate grade point average, which is then related to PhD
completion [56].
    Given known issues with test scores and GPA, why do programs continue to emphasize them over
the qualitative parts of the application. Perhaps the simplest answer is that comparing numbers is
quick and convenient [46]. A more nuanced answer might be that qualitative parts of an application
can contain substantial variability in what is addressed and these parts of an application can have
their own inequities (see. Woo et al. for an overview [158]).
    One possible conclusion is then that all application materials have inequities, after all they are
produced in an inequitable society, so what is the point of changing anything. We instead adopt
a pragmatic view that some parts of the admissions process are more inequitable than others and,
therefore, our goal is to develop methods to minimize or eliminate inequities to the best of our
ability in an inequitable society.
4.2.2    Holistic Review
One possible approach to addressing inequities in the admissions process is holistic review, which
Kent and McCarthy define as “the consideration of a broad range of candidate qualities including
‘noncognitive’ or personal attributes” [137]. Here, we will use holistic review to refer to the
general process regardless of what tools or systems are used to conduct it. When talking about our
                                                  67


department’s rubric-based process or similar processes, we will use rubric-based holistic review.
    While the idea of holistic admissions is hardly new, its implementation is becoming more
common due to both greater awareness that quantitative measures may not accurately predict
success in graduate school [159–161] and institutions wanting to use the most predictive measures
of success in their programs [137]. In addition, professional societies such as the American
Astronomical Society (AAS) have called for programs to implement “evidence-based, systematic,
holistic approaches” to graduate admissions [138].
    Using holistic review has also been claimed to lead to beneficial outcomes for universities
including increasing diversity and improving student outcomes (see [137]), though most of these
studies have happened outside of physics and related fields. For example, Hawkins found that using
holistic review increased diversity in a Doctor of Physical Therapy program [162] and in a literature
review of predominantly medicine-related fields, Francis et al. found that holistic review generally
increased racial and ethnic diversity [163]. For STEM fields, Wilson et al. found that using holistic
review in a biomedical science program resulted in applicant assessments that were independent
of gender, race, and citizenship status [140] and Pacheco et al. found that using a composite
score that included GPA, test scores, research experience, and publications was correlated with
earning a university fellowship and a shorter completion time while applicant’s test scores and
GPAs individually were not [164].
    While holistic review shows promise, programs may have concerns about implementing it. For
example, common concerns include limited faculty time to review applications, a lack of data
correlating admissions criteria and student success, and limited resources to implement it [137].
In addition, there may be concerns that because the decisions can be more subjective than using a
quantitative measure like a test score, there may be variability based on who reviews the application.
However, a study of holistic admissions at the undergraduate level found that only 3% of reviews
showed substantial variability in the overall score between reviewers [165], suggesting that in
practice variability in the overall rating between reviewers is limited.
                                                  68


4.2.2.1   Noncognitive skills
Regardless of the specifics of a holistic review process, most approaches include some examination
of the applicant’s noncognitive skills, which may also be referred to as soft skills, personality traits,
character traits or socio-emotional skills depending on the discipline or context [166]. While there
are multiple definitions of these (see [167]) we adopt Roberts’ definition that noncognitive skills or
personality traits are “the relatively enduring patterns of thoughts, feelings, and behaviors that reflect
the tendency to respond in certain ways under certain circumstances” [167]. Often these have been
operationalized as the Big Five, which are openness to experience, conscientiousness, extroversion,
agreeableness, and neuroticism [166, 168], though other categorizations exist. For example, in
higher education admissions, Sedlacek proposed eight noncognitive traits, which he defines as
things not measured by standardized tests: positive self-concept, realistic self-appraisal, understands
and knows how to handle racism (the system), prefers long-range to short-term or immediate needs,
availability of strong support person, successful leadership experience, demonstrated community
service, and knowledge acquired in or about a field [169].
    In terms of their utility, noncognitive skills have been found to be predictive or correlated
with academic success, though these studies have happened outside of the context of physics. At
the undergraduate level, noncognitive skills in isolation and in concert with test scores have been
found to be more predictive of success and graduation than test scores alone [170–172]. Likewise,
at the graduate and professional levels, noncognitive skills have been found to be correlated
with GPA and class rank [173, 174], clinical performance [175], and overall success in programs
[176, 177] but were not found to be associated with doing well on a licensing exam [178]. Of
the individual noncognitive skills, conscientiousness has been found to be most strongly and
consistently associated with academic success [179].
    In addition to their benefits related to academic success, noncognitve skills can be useful for
promoting equity in admissions. For example, including noncognitive skills can increase diversity
without harming validity [180, 181] as noncognitive measures have been shown to be just as valid
for majoritized and minoritized groups [180, 182, 183]. While including noncognitive skills as part
                                                    69


of admissions may seem like a hard ask of faculty, many faculty already acknowledge the usefulness
of noncognitive skills in graduate school [180], including in physics [53].
    Yet, a pressing concern is how to measure such noncognitive skills accurately. While applicant
self-reports or recommender ratings are typical approaches, such methods may result in inflated or
skewed ratings [180]. A recent study suggests that even sharing descriptions of noncognitive skills
and why they are useful for predicting later success can artificially inflate judgements [184]. Thus,
how best to measure such skills is still an active area of inquiry [185].
4.2.2.2   Rubric-Based Review
One promising approach to implementing holistic review is rubric-based review. Under this
approach, applicants are evaluated based on a set of pre-defined criteria. By pre-selecting criteria,
what is required for admission is clear to reviewers and provides a structure to assess all applicants
[46, 154]. This explicitness has been shown to enhance both validity and reliability [158, 186, 187].
    In addition, rubrics can help make the admissions process more equitable [46]. By explicitly
laying out the review criteria and what is required to achieve each level of the rubric, all applicants
can be judged fairly and individual reviewer’s expectations can be mitigated [183]. From research
into other areas of academic hiring, we know that gender and racial biases exist in the hiring
process, including in physics [188, 189]. Specifically in graduate admissions, faculty, including
astronomy and physics faculty, have been documented showing preferences to applicants with
similar backgrounds as themselves or within the same research subfield of their discipline [46].
Thus, rubrics offer a possible route to counter those biases. Indeed, a recent study in admissions
for a psychiatry residency program found that using rubric-based holistic review led to more
underrepresented applicants receiving an offer to interview compared to the traditional approach
[190] while a recent study of grade-school writing found that teachers rated writing attributed to a
Black author lower than when it was attributed to a white author but did not find the effect when
the teachers were instructed to use a clearly defined rubric [191].
    As rubric-based approaches to admission are still relatively new, best practices are still in
                                                  70


development. Yet, a few recommendations do exist [154]. First, criteria should be selected before
reviewing any applications with individual programs deciding what qualities are critical for success
in their program [138]. Second, rubrics should be coarse-grained in that there are fewer possible
scores for each construct such as low, medium, or high instead of 1-10 to limit disagreements over
scores [183]. Third, each level of the rubric should be clearly defined so that a reviewer can easily
determine which score an applicant should get on each construct. These levels should be picked so
that each possible score will be received by many applicants [154]. Finally, these criteria and levels
should allow for diverse forms of excellence to be counted as achievements so that applicants with
non-traditional markers of excellence are not excluded [192]. For example, assessing an applicant’s
research abilities should go beyond their number of publications or number of years working in a
research lab and could instead focus on what the applicant accomplished in the lab or what skills
they have.
    While rubric-based approaches have received little research in physics, they have been success-
fully incorporated into larger physics graduate program initiatives. Two of the most well-known
initiatives are the Fisk-Vanderbilt program, which graduates one of the largest classes of Black PhD
physicists in the nation [193], and the APS Bridge Program, which has successfully admitted and
retained graduate students of color at rates higher than the national average [143]. Even though
rubrics in admission were one of many changes made, these programs suggest that rubric-based
review has promise.
    For a more in depth review about equitable admissions practices in STEM doctoral programs,
we refer the reader to Roberts et al. [194]
4.3     Methods
4.3.1    Our Rubric and Applicant Evaluation Process
In 2018, the Department of Physics and Astronomy at Michigan State University introduced a
rubric-based approach to evaluate applications to the graduate program in physics, informed by the
Council of Graduate Schools’ 2016 report on Holistic Review in Graduate Admissions [137]. The
                                                 71


main goal was to improve the identification of strong candidates for the program and to make the
selection more equitable, thereby increasing the participation of students from underrepresented
groups in the department. In preparation for the introduction of the rubric, Casey Miller and Julie
Posselt, the Inclusive Practice Hub Director and Research Hub Director, respectively of the National
Science Foundation supported Inclusive Graduate Education Network [153], led a workshop with
faculty who served at that time in the Graduate Recruiting Committee. This workshop resulted in
a selection of five rubric categories, which each had several sub categories. Applicants are ranked
with a score of either 0, 1, or 2, corresponding to low, medium, or high, for each subcategory, based
on defined criteria for each score (see Appendix E). The subcategory scores are then averaged per
category and category scores summed (with weights as given below) to calculate the overall score.
The categories, with subcategories in parenthesis, are:
      • Academic Preparation, with a weight of 25% (Physics coursework, math coursework, other
        coursework, and academic recognition and honors)
      • Research, with a weight of 25% (Variety and duration, quality of work, technical skills, and
        research disposition)
      • Non-cognitive competencies, with a weight of 25% (Achievement orientation, conscientious-
        ness, initiative, and perseverance)
      • Fit with program, with a weight of 15% (Fit with research programs of the department, fit
        to research programs of specific faculty, (prior) commitment to participation in the depart-
        ment/school community, and advocacy for and/or contributions to a diverse, equitable, and
        inclusive physics community
      • GRE scores, with a weight of 10% (General GRE scores, and Physics GRE scores)1
     The choice of these categories and subcategories was based on the discussions in the workshop
and advice from the workshop leaders, and included considerations based on experiences during
     1 This category was not used in 2021, and will not be used in 2022 due to the impacts of COVID-19 on students’
ability to take these tests.
                                                          72


previous recruiting cycles. Another consideration for the choice of the categories is a reasonably
close alignment with criteria used at MSU for awarding fellowship packages to students. Therefore,
the rubric scoring can also be used for selecting nominations for university fellowships. This is
important because fellowship nominations are due shortly after the application deadline (January
1).
    Applications for the graduate program are submitted to MSU’s central application system. All
folders with a complete or near-complete application package are reviewed. The applications are
divided up into several groups, which each are reviewed by different members of the graduate
recruiting committee. This committee has a rotating membership with representation from faculty
in all major research directions present in the department. Committee members are instructed about
the use of the rubric and provided with the criteria. As part of the review process, they also sort
students by their interest in research area(s). The results from the rubric scoring are compiled by
the Graduate Program Director. Students whose folders are near complete, but have a ranking for
which an offer is not impossible, are contacted and ask to provide the missing information. If that
additional information is provided, the rubric scoring is updated.
    Subsequently, the spreadsheet is used by committee representatives from each major research
area in the department to make a list of students they would like to make an offer to for a position in
that specific research area. The number of students who are made an offer to depends on openings
available per research area, the number of teaching assistant slots available, and the historical
acceptance rates for each research area. Typically, the process results in a list of offers that will be
made and a wait list for additional offers that can be made if recruiting targets are not met in the
initial round of offers. In this stage of the recruiting process, the match to available positions is
revisited as committee members from specific research areas are better aware than general faculty
members about the recruiting needs for that year. In spite of the instruction and criteria provided to
reviewers, the scoring is still somewhat subject to differences in reviewing styles and interpretation
of the criteria. This is, for example, apparent in the comparison of average summed scores per
reviewer. Therefore, this second stage of the review process also allows for another comparison of
                                                   73


application based on the rubric by a few faculty members in each research area. Because of these
reasons, the list of students whom an offer will be made to, or who are put on a wait list, quite
closely follows the original rubric scoring but modifications do occur.
    The whole process is organized and overseen by the Graduate Program Director with support
from the Graduate Program Secretary. The Graduate Program Director also serves as the point of
contact for questions about the use and interpretation of the rubric, reviews applications of likely
candidates, and leads the selection of nominations for fellowships.
    The overall response from faculty who served in the recruiting committee and used the rubric has
been positive, as it provides clear guidance for the review process and reduces the impact of different
reviewing styles and biases to what are the most important skills applicants to a physics graduate
program should have. On average, the time spent by individual committee members on reviewing
the folders has not increased. Faculty reviewers have provided feedback that it would be better if
applicants are first sorted by research area so that the review is done by several faculty from the
relevant research areas in the first step. Given the large number of applications and the limitations
of the current software used to manage applications, this could not easily be accomplished in the
past. MSU is implementing new software for managing and reviewing applications, which will
make presorting of applications by research area possible, leading to a considerable increase in the
efficiency of the process.
4.3.2    Participants and Data Collection
Data for this study comes from compiled records from applicants to our physics graduate program
for fall 2018, 2019, and 2020. Most admissions decisions for fall 2020 had already been made
before coronavirus accommodations took effect, suggesting at most minimal effects on our data.
    When applying to the university, applicants submit a general university application, transcripts,
test scores, a personal statement, an academic or research statement, and letters of recommendation
to a central system. As the current admissions system does not allow for records to be compiled
across applicants, two researchers manually extracted relevant information for this study. The
                                                   74


researchers independently extracted data from the first 20 applications and then compared results to
ensure they were interpreting the applications the same and agreeing on any conventions for report-
ing the data. Afterwards, the researchers independently went through the rest of the applications.
Through this process, the researchers collected the applicant’s demographics, grade point average,
GRE scores, degrees earned or in progress, and previous institutions attended. Any information
missing from the applications or entered into the application on a non-standard scale (e.g. a GPA
on a non 4.0 scale or a GRE score outside of the current scoring range) was treated as missing data
for the analysis.
    As rubric scores are determined by faculty and are not part of the materials applicants submit,
aggregated scores were then matched with individual applicants using the applicant IDs. Through
this process, we collected data on 826 applicants, 511 of which were domestic applicants.
4.3.3   Analysis
Because of different application requirements and availability of institutional data for international
and domestic students, we only include domestic students in our study. In addition, we only include
applications sufficiently complete that faculty were able to rate and were included in the Graduate
Program Director compiled records, leaving us with 321 domestic applicants for this study.
    For our analysis, we were interested in how faculty rate applicants and hence, we computed the
fraction of applicants in each level (low, medium, and high) of the rubric. In some cases (<5%),
faculty used a rating that was in between levels (e.g. low-medium). Because of this, we performed
all subsequent analyses by first rounding up (so low-medium would become medium) and then
repeating the analysis by rounding down.
    First, we computed the fraction of applicants in each level of the rubric for all applicants, all
admitted applicants, and all non-admitted applicants.
    Second, we compared applicants based on demographics by comparing the fraction of applicants
in each bin of the rubric. While gender would be more appropriate, the application system only
asks applicants about their sex and allows them to choose male or female. Thus we were only able
                                                 75


to compare faculty ratings of males and females. We acknowledge that females is not the correct
term to use but as being female does not automatically imply being a woman, we do not believe it
is appropriate to assume that someone marking female as their sex is necessarily a woman.
    In terms of race, the application system does not allow applicants to enter their race or ethnicity,
so we are unable to compare applicants of different races.
    Finally, we compared applicants from different undergraduate backgrounds because prior work
suggests the applicant’s background may influence faculty’s perceptions of them. For example,
faculty may prefer applicants with similar backgrounds as themselves [46] and may interpret grade
point averages in the context of the applicant’s undergraduate program, with high GPAs from more
prestigious universities carrying more “weight” than a high GPA from a lesser known school [117].
In addition, graduate admissions in physics have been characterized as “risk-adverse” where faculty
prefer to admit applicants who are likely to complete their program rather than take chances on
someone who might not [46, 48]. As students from smaller program may be viewed as higher risk
if previous students from that program struggled [117], it is possible faculty may be less likely to
admit students from smaller undergraduate schools.
    To characterize an applicant’s undergraduate background, we used two measures. First, we
used Barron’s value, which is a measure of an institution’s selectivity based on incoming students’
SAT scores, GPA and class rank, and overall acceptance rates. While not equivalent to prestigious,
we treat selectivity as a proxy for prestige based on that assumption that most selective institutions
are also prestigious institutions. For our analysis, we defined institutions with Barron’s values of
“most competitive” or “highly competitive” as selective and all other institutions as not selective.
    Second, we used the number of bachelor’s degrees awarded by the physics department at the
applicant’s undergraduate institution to estimate the size and notoriety of the department, with the
assumption that a department that grants more degrees is more likely to be known by an admissions
committee member. Due to variability in yearly degrees, we used the median number of degrees
over the 2016-2017, 2017-2018, and 2018-2019 academic years as the number of bachelor’s degree
awarded [3, 128, 129]. We then defined any program that was in the top quartile of physics
                                                  76


bachelor’s degrees awarded during that period as a large program and all other programs as smaller
programs. For reference, the programs we classified as large produced nearly two-thirds of all
physics bachelors degrees over the period.
    To perform the comparisons in all cases, we used Fisher’s Exact Test to examine whether the
rubric score was associated with any of the metrics of interest (admission status, sex, institution
selectivity, institution size). We used the standard choice of 𝛼 = 0.05 to judge claims of statistical
significance. Because we did 18 comparisons for each metric of interest, it is likely that there
would be at least one false positive. Therefore, we used the Holm-Bonferroni procedure to correct
the p-values for multiple comparisons as it is less conservative than the traditional Bonferroni
correction while maintaining statistical power [195].
    For cases of missing data, we used pairwise deletion so that we could make the most use of
the data we had. While Nissen et al recommends using multiple imputations for missing data in
physics education research studies [71], the goal of this paper is to understand what faculty did as
opposed to estimate a larger trend or predict an outcome. Therefore, we do not believe that using
multiple imputations is aligned with the goal of this paper. The percent of missing data per rubric
metric is shown in Table 4.1.
4.4    Results
The results are largely unchanged based on whether we rounded up or rounded down when a faculty
member gave a rating in-between levels of the rubric so we present only the round up results here.
    When we examine the faculty’s rating of all applicants in Fig. 4.1, we notice two overarching
trends. First, for traditional measures of academic success such as grades and test scores, faculty
tend to rate applicants using all three levels of the rubric. For the academic preparation constructs on
the rubric, “high” is the most common rating given by faculty. However in terms of math and physics
course grades, around 25% of applicants still scored in the low bin. Of the academic preparation
constructs, academic honors follows a different structure than the others where faculty ratings are
bi-modal, meaning that applicants either had no academic honors or had multiple academic honors.
                                                     77


                      Table 4.1: Percent of Missing Data by Rubric Construct
                          Rubric Construct                  Percent Missing
                          Physics Coursework                             20.0
                          Math Coursework                                20.2
                          All Other Coursework                           20.2
                          Academic Honors                                22.1
                          Variety/Duration of Research                    3.4
                          Quality of work                                 4.4
                          Technical Skills                                4.1
                          Research Dispositions                           4.7
                          Achievement Orientation                         4.4
                          Conscientiousness                               4.4
                          Initiative                                      4.0
                          Perseverance                                    4.4
                          Alignment of Research                           7.2
                          Alignment with Faculty                         32.1
                          Community Contributions                         4.0
                          Diversity contributions                         3.4
                          General GRE Scores                              2.2
                          Physics GRE Score                               2.5
     Second, for the research, noncognitive, and fit constructs, faculty rarely used the “low” level
of the rubric, with only three of the twelve constructs in those categories having more than 10%
of applicants earning a “low.” For research, the most common rating was “high” while for the
noncognitive traits, the most common rating varied between “high” and “medium.” In terms of the
fit constructs, most applicants were rated as either “medium” or “high” for alignment of research,
alignment with faculty, and community contributions. In contrast, for the diversity contributions
construct, “low” was the most common rating, meaning that many applicants did not discuss how
they promote or advocate for diversity in their applications.
     When looking at how faculty rate applicants who would later be admitted compared to applicants
who would not be admitted, we see statistically significant differences in the distribution of all ratings
(Fig. 4.2). Overall, admitted applicants tended to be rated “high” on each construct while non-
admitted applicants tended to be rated “medium” on each construct. There were a few exceptions to
the general trend however. For academic honors, diversity contributions, and physics GRE scores,
most admitted students were not rated as “high” and 25% of applicants received a “low” score while
                                                  78


                                                            All Rated Applicants (N=321)
                                   Physics Coursework
            Academic                  Math Coursework
           Preparation
                                  All Other Coursework
                                      Academic Honors
                           Variety/Duration of Research                                    Fraction
              Research
                                         Quality of work                                       0.8
                                         Technical Skills                                      0.6
                                  Research Dispositions                                        0.4
                                                                                               0.2
           Non−cognitive
                               Achievement Orientation                                         0.0
                                    Conscientiousness
           competencies
                                              Initiative                                   Fraction
                                        Perseverance                                           0.8
                                                                                               0.6
                                Alignment of Research
           Fit with              Alignment with Faculty                                        0.4
           program             Community contributions                                         0.2
                                 Diversity contributions                                       0.0
              GRE scores
                                  General GRE Scores
                                    Physics GRE Score
                                                               Lo            m      ig h
                                                                 w      ed
                                                                          iu       H
                                                                      M
Figure 4.1: Faculty ratings of domestic applicants on 18 constructs. In the plot, a larger, darker
circle means that more applicants are in that bin. While many applicants are in each level of the
academic preparation and test score constructs, few applicants are in the “low” bin of the research,
noncognitive skills, and program fit constructs.
for all other course work, variety/duration of research, and general GRE scores, most non-admitted
applicants were rated as high.
   When looking at the ratings broken down by sex (Fig. 4.3), we notice that the results tend
to follow the overall patterns of all three ratings on academic success and test scores and mainly
“medium” and “high” ratings on research, noncognitive skills, and fit with the program for both
males and females. Comparing ratings between males and females, we find that only physics
GRE score, community contributions, and diversity contributions showed statistically significant
differences. While males tended to score higher on the physics GRE score, females tended to
                                                               79


A                                                 Rated Admitted Applicants (N=139)              B                                                 Rated Not−admitted Applicants (N=182)
                         Physics Coursework                                                                               Physics Coursework
  Academic                  Math Coursework                                                       Academic                   Math Coursework
 Preparation                                                                                     Preparation
                        All Other Coursework                                                                             All Other Coursework
                            Academic Honors                                                                                  Academic Honors
                 Variety/Duration of Research                                         Fraction                    Variety/Duration of Research                                       Fraction
    Research                                                                                         Research
                               Quality of work                                            0.8                                   Quality of work                                            0.8
                               Technical Skills                                           0.6                                   Technical Skills                                           0.6
                        Research Dispositions                                             0.4                            Research Dispositions                                             0.4
                                                                                          0.2                                                                                              0.2
 Non−cognitive                                                                                   Non−cognitive
                     Achievement Orientation                                              0.0
                                                                                                                      Achievement Orientation                                              0.0
                          Conscientiousness                                                                                Conscientiousness
 competencies                                                                                    competencies
                                    Initiative                                        Fraction
                                                                                                                                     Initiative                                      Fraction
                              Perseverance                                                0.8                                  Perseverance                                                0.8
                                                                                          0.6                                                                                              0.6
                      Alignment of Research                                                                            Alignment of Research
 Fit with              Alignment with Faculty                                             0.4    Fit with               Alignment with Faculty                                             0.4
 program             Community contributions                                              0.2
                                                                                                 program              Community contributions                                              0.2
                       Diversity contributions                                            0.0                           Diversity contributions                                            0.0
    GRE scores                                                                                       GRE scores
                        General GRE Scores                                                                               General GRE Scores
                          Physics GRE Score                                                                                Physics GRE Score
                                                    Lo            m      ig h                                                                        Lo            m      ig h
                                                      w      ed
                                                                iu      H                                                                              w      ed
                                                                                                                                                                 iu      H
                                                            M                                                                                                M
Figure 4.2: Faculty ratings of domestic applicants on 18 constructs split by whether the applicant
was admitted. The distribution of ratings of all constructs is statistically different for admitted
applicants compared to non-admitted applicants. Overall, most admitted applicants were rated
“high” while most non-admitted applicants were rated “medium.”
A                                                 Rated Male Applicants (N=252)                  B                                                 Rated Female Applicants (N=69)
                         Physics Coursework                                                                               Physics Coursework
  Academic                  Math Coursework                                                       Academic                   Math Coursework
 Preparation                                                                                     Preparation
                        All Other Coursework                                                                             All Other Coursework
                            Academic Honors                                                                                  Academic Honors
                 Variety/Duration of Research                                         Fraction                    Variety/Duration of Research                                       Fraction
    Research                                                                                         Research
                               Quality of work                                            0.8                                   Quality of work                                            0.8
                               Technical Skills                                           0.6                                   Technical Skills                                           0.6
                        Research Dispositions                                             0.4                            Research Dispositions                                             0.4
                                                                                          0.2                                                                                              0.2
 Non−cognitive                                                                                   Non−cognitive
                     Achievement Orientation                                              0.0
                                                                                                                      Achievement Orientation                                              0.0
                          Conscientiousness                                                                                Conscientiousness
 competencies                                                                                    competencies
                                    Initiative                                        Fraction
                                                                                                                                     Initiative                                      Fraction
                              Perseverance                                                0.8                                  Perseverance                                                0.8
                                                                                          0.6                                                                                              0.6
                      Alignment of Research                                                                            Alignment of Research
 Fit with              Alignment with Faculty                                             0.4    Fit with               Alignment with Faculty                                             0.4
 program             Community contributions                                              0.2
                                                                                                 program              Community contributions                                              0.2
                       Diversity contributions                                            0.0                           Diversity contributions                                            0.0
    GRE scores                                                                                       GRE scores
                        General GRE Scores                                                                               General GRE Scores
                          Physics GRE Score                                                                                Physics GRE Score
                                                    Lo       ed          igh                                                                         Lo       ed          igh
                                                      w         iu      H                                                                              w         iu      H
                                                                   m                                                                                                m
                                                            M                                                                                                M
Figure 4.3: Faculty ratings of domestic applicants on 18 constructs split by whether the applicant was
male or female. Only three of the constructs showed differences between males and females: physics
GRE score where males scored higher and community contributions and diversity contributions
where females scored higher.
                                                                                                80


A                                                 Rated Applicants from Selective              B                                                 Rated Applicants from Non−Selective
                                                  Undergraduate Programs (N=108)                                                                 Undergraduate Programs (N=104)
                         Physics Coursework                                                                             Physics Coursework
  Academic                  Math Coursework                                                     Academic                   Math Coursework
 Preparation                                                                                   Preparation
                        All Other Coursework                                                                           All Other Coursework
                            Academic Honors                                                                                Academic Honors
                 Variety/Duration of Research                                       Fraction                    Variety/Duration of Research                                           Fraction
    Research                                                                                       Research
                               Quality of work                                          0.8                                   Quality of work                                              0.8
                               Technical Skills                                         0.6                                   Technical Skills                                             0.6
                        Research Dispositions                                           0.4                            Research Dispositions                                               0.4
                                                                                        0.2                                                                                                0.2
 Non−cognitive                                                                                 Non−cognitive
                     Achievement Orientation                                            0.0
                                                                                                                    Achievement Orientation                                                0.0
                          Conscientiousness                                                                              Conscientiousness
 competencies                                                                                  competencies
                                    Initiative                                                                                     Initiative
                                                                                    Fraction                                                                                           Fraction
                              Perseverance                                              0.8                                  Perseverance                                                  0.8
                                                                                        0.6                                                                                                0.6
                      Alignment of Research                                                                          Alignment of Research
 Fit with              Alignment with Faculty                                           0.4    Fit with               Alignment with Faculty                                               0.4
 program             Community contributions                                            0.2    program              Community contributions                                                0.2
                       Diversity contributions                                          0.0
                                                                                                                      Diversity contributions                                              0.0
    GRE scores                                                                                     GRE scores
                        General GRE Scores                                                                             General GRE Scores
                          Physics GRE Score                                                                              Physics GRE Score
                                                    Lo            m      ig h                                                                      Lo            m       ig h
                                                      w      ed
                                                                iu      H                                                                            w      ed
                                                                                                                                                               iu       H
                                                            M                                                                                              M
Figure 4.4: Faculty ratings of domestic applicants on 18 constructs split by whether the applicant
attended a more selective or less selective undergraduate university. Only the general GRE and
physics GRE scores showed differences.
A                                                 Rated Applicants from Small                  B                                                 Rated Applicants from Large
                                                  Undergraduate Programs (N=106)                                                                 Undergraduate Programs (N=203)
                         Physics Coursework                                                                             Physics Coursework
  Academic                  Math Coursework                                                     Academic                   Math Coursework
 Preparation                                                                                   Preparation
                        All Other Coursework                                                                           All Other Coursework
                            Academic Honors                                                                                Academic Honors
                 Variety/Duration of Research                                       Fraction                    Variety/Duration of Research                                           Fraction
    Research                                                                                       Research
                               Quality of work                                          0.8                                   Quality of work                                              0.8
                               Technical Skills                                         0.6                                   Technical Skills                                             0.6
                        Research Dispositions                                           0.4                            Research Dispositions                                               0.4
                                                                                        0.2                                                                                                0.2
 Non−cognitive                                                                                 Non−cognitive
                     Achievement Orientation                                            0.0
                                                                                                                    Achievement Orientation                                                0.0
                          Conscientiousness                                                                              Conscientiousness
 competencies                                                                                  competencies
                                    Initiative                                                                                     Initiative
                                                                                    Fraction                                                                                           Fraction
                              Perseverance                                              0.8                                  Perseverance                                                  0.8
                                                                                        0.6                                                                                                0.6
                      Alignment of Research                                                                          Alignment of Research
 Fit with              Alignment with Faculty                                           0.4    Fit with               Alignment with Faculty                                               0.4
 program             Community contributions                                            0.2    program              Community contributions                                                0.2
                       Diversity contributions                                          0.0
                                                                                                                      Diversity contributions                                              0.0
    GRE scores                                                                                     GRE scores
                        General GRE Scores                                                                             General GRE Scores
                          Physics GRE Score                                                                              Physics GRE Score
                                                    Lo       ed          igh                                                                       Lo       ed           igh
                                                      w         iu      H                                                                            w         iu       H
                                                                   m                                                                                              m
                                                            M                                                                                              M
Figure 4.5: Faculty ratings of domestic applicants on 18 constructs split by whether the applicant
attended a university with a larger or smaller physics program. Only the physics GRE score and
conscientiousness showed differences between the groups of applicants, with the latter dependent
on how larger physics program is defined.
                                                                                              81


score higher on community contributions and diversity contributions. As we elaborate on in the
discussion, differences in these three constructs do not necessarily mean that faculty are rating
males and females differently but instead may be documenting inequities that already exist.
    Likewise, when looking at the ratings broken down by the selectivity of the university where
the applicant earned their bachelor’s degree (Fig. 4.4 ) or the size of the department where they
earned their bachelor’s degree (Fig. 4.5), we may also be observing existing inequities reflected in
the faculty ratings. For example, applicants from more selective universities only had statistically
higher ratings than applicants from less selective universities on the general GRE and physics GRE
scores. Similarly applicants from larger programs had statistically higher ratings on the physics
GRE score than applicants from smaller programs did. However, applicants from larger program
were also rated higher on conscientiousness than applicants from smaller programs, though this
result is sensitive to how we define a large program.
    While we could consider interactions between admission status and sex, institutional selectivity,
or physics program size, we did not do so given the small sample sizes. For completeness, however,
we include those plots in Appendix F
4.5     Discussion
How do faculty assign rubric scores to applicants and how do those differ between admitted and
rejected applicants? For academic achievement and test scores, faculty tended to use all three
levels of the rubric when assigning scores to applicants. In contrast, faculty tended to use mainly
“medium” and “high” when assigning scores to applicants in the research, noncognitive skills, and
program fit categories. We argue that this result is more of a reflection of the rubric than it is of
how faculty are using the rubric.
    Given that grades and test scores are well defined via transcripts and test scores, rubric constructs
measuring these tended to use quantitative measures to determine which score the applicant would
receive. That is, a high score or high grades would correspond to a “high” rating while a low
score or low grades would correspond to a “low” rating. Additionally, as the courses required for
                                                   82


a physics degree tend to be similar regardless of the specific program, most applicants will have
taken the courses mentioned in the rubric and hence, faculty can rank applicants based on those
grades.
    In contrast, the research, noncognitive skills, and fit with department are less defined and instead
depend on what applicants write in their statements and what information letters of recommendation
contain. This means that not all constructs on the rubric may necessarily be addressed. For
example, if an applicant takes quantum mechanics it will certainly appear on their transcript but if
that applicant was also active in departmental service activities, it may not be reflected in any parts
of the application. As a result, the rubric needs to take into account that applicants may not display
a trait because either they do not exhibit it or because they did not mention it (the instructions for the
applicant’s statements ask them to address multiple topics that map onto various rubric constructs).
Any display of the trait could then not fall into the “low” level of the rubric, which would then
explain why faculty tended to use only the “medium” and “high” ratings.
    A reasonable follow up is then whether combining “no evidence” with “evidence not presented”
as a single level on the rubric represents an issue with the rubric. We argue that it does not, as it
provides the best option given the data faculty have available. Applicants are asked to discuss certain
topics in their statements that map broadly onto the rubric constructs but that does not necessarily
mean they will. While interviews could be useful in separating “no evidence” cases from “evidence
not presented” cases, we worry these would increase admissions committee members’ work load.
    In terms of comparing admitted and non-admitted applicants, all 18 rubric constructs showed
statistically significant differences. Given the goal of the rubric is to aid faculty in determining
who to admit, we would expect the rubric to show such differences. That all rubric constructs show
differences suggests all parts of the rubric are useful for determining who to admit.
    How do the scores assigned by faculty differ by applicant’s sex? We found only three constructs
on the rubric that showed sex differences: physics GRE score, community contributions, and
diversity contributions. Given known scoring gaps on the physics GRE [52], it is not surprising
that males are rated more highly than females on the physics GRE score. Given that females
                                                    83


perform larger amounts of service work in academia [196], it is also not unexpected that constructs
measuring these would show a difference between sexes. Because the constructs that show sex
differences are related to effects documented in the literature, we believe that the rubric is reflecting
inequities that already exist rather than creating additional ones. Therefore, we conclude that the
rubric is not providing an advantage to male or female applicants.
    Additionally, the constructs of the rubric that do not show differences between sexes also align
with what we would expect based on the literature. The result that physics and math GPA did
not differ by sex aligns with the findings of [156] and the result that noncognitive skills did not
differ by sex aligns with the general finding that noncognitive skills do not appear to depend on
demographics [180, 183].
    How do the scores assigned by faculty differ by the type of institution the applicant attended?
When we compared applicants based on whether their undergraduate institution was a more or less
selective institution, we found that the only constructs that showed differences were the general GRE
and physics GRE scores. This result aligns with the results of our previous work investigating the
physics GRE scores by undergraduate institution type [69, 197]. We note that if we instead define
more-selective universities to include large state universities, such as Michigan State University,
University of Colorado, Boulder, and University of Washington, our results are unchanged. This
redefinition is equivalent to considering Barron’s values of 1-3 as more-selective and everything
else as less-selective compared to the definition of more-selective as Barron’s values of 1 and 2 in
the Methods and Results sections.
    The interpretation of the results when comparing applicants from larger or smaller physics
departments is less straightforward because the results do depend on how we define “larger” and
“smaller” departments. When we define larger programs as those that ranked in the top quartile
of physics bachelor’s degrees granted as measured by the median number of degrees awarded over
the last three years of available data and rounded up in-between ratings, we find that the physics
GRE score and conscientiousness showed differences between applicants from larger and smaller
programs. However, if we rounded down on in-between ratings instead, only physics GRE score
                                                   84


showed a difference between applicants from larger and smaller programs.
    Furthermore, alternative definitions of “larger programs” also produced varying results. One
could also have reasonably defined “larger” to mean 1) in the top half of physics bachelor’s degrees
granted as measured by the median number of degrees awarded over the last three years, 2) in
the top quartile of physics bachelor’s degrees granted as measured by the total number of degrees
awarded over the last three years, and 3) in the top half of physics bachelor’s degrees granted as
measured by the total number of degrees awarded over the last three years. When we also consider
rounding up or rounding down in-between ratings, we could make various combinations of physics
GRE score, general GRE score, physics coursework, and conscientiousness show a statistically
significant difference. The only rubric construct that always showed a statistically significant
difference regardless of how we defined “larger programs” was the physics GRE score. Therefore,
the results suggest that applicants from larger physics programs score higher on the physics GRE
than applicants from smaller program do, but the results are inconclusive as to whether other areas
of the rubric might show differences based on the size of the physics program the applicant attended.
    One area that unexpectedly did not show differences regardless of how we defined “larger
program” was the research section. It is often assumed that students at larger programs have more
opportunities to engage in research than students at smaller programs. Yet, even if that is true, it
does not appear to be reflected in the rubric scores.
4.6     Limitations
Our study has four main limitations. First, our study does not include many disadvantaged groups
in higher education who might not have the same opportunities as their more privileged peers
and hence, may score lower on the rubric. While race is the most obvious due to the way our
university interprets Proposal 2, our study does not include a comparison of low-income applicants
to higher-income applicants or first generation applicants to continuing generation applicants.
    Additionally, the size of our study does not allow us to explore intersections and where possible
inequities may lie. As Rudolph et al. note, using small sample sizes with sub-groups has insufficient
                                                  85


statistical power and could lead to invalid inferences [138]. Hence, we refrained from performing
such analyses in this paper.
    Second, our data only contained ratings from the initial reviewer and none of the ratings of later
reviews. As a result, we were unable to look into differences in how individual faculty members use
the rubric. For example, it is possible that one faculty member might systematically rank applicants
lower on the rubric constructs than a different faculty member might. With only a single rating
per application, we are unable to disentangle differences in faculty ratings and differences in the
applicants the faculty members reviewed.
    Third, this study included only a single program. Under a more traditional graduate admissions
system, physics has been called a “high consensus” discipline [46], meaning that physics faculty
tend to agree on what a “quality” applicant is and therefore, a single department’s admissions
process would be more or less representative of graduate admissions processes in physics. When
switching to rubric-based admissions, we cannot necessarily make that same claim. As our
rubric was created based on what faculty value, it is not unreasonable to assume that the results
would generalize to other departments that also use rubric-based admissions. However, until such
processes are evaluated at other departments, we cannot make such a claim.
    Fourth, as a result of using only one program, the applicants are likely not representative of the
larger population. The data in this study comes from 1) people who applied to our program and
2) applicants who had a nearly complete application. Thus, if we consider those with an interest
in attending physics graduate school as our population, we first selected on those who applied to
graduate school, then selected on those who applied to our program, and finally selected on those
who provided enough information in their applications for faculty to evaluate. At each step, we
are excluding some of the larger population and thus our claims cannot necessarily be expected to
hold for the larger population of potential applicants. For example, anecdotal evidence suggests
minoritized applicants are more likely to not complete their applications than majoritized applicants
are.
                                                   86


4.7     Future Work
As noted in the limitations, our study compared rubric scores of males and females and applicants
from larger or more selective programs with applicants from smaller or less selective programs.
Future work could then explore how rubric-based admissions may impact other historically and
currently underrepresented groups in physics such as Black, Latinx, or Indigenous applicants.
Racism, and specifically anti-Black racism, is still prevalent in physics [198–202] and therefore
might be reflected in rubric-based admissions.
     While physics faculty tend to think of diversity mainly in terms of race [46], we acknowledge
that diversity is broader than race and studies of equity around the rubric should also consider
first generation applicants, low-income applicants, disabled applicants, and veterans. Studies
of undergraduate admissions suggest that when extracurriculars and subjective assessments of
character and talent gleaned from essays and recommendations are added to the admissions process,
existing inequalities may increase [203] and these applicants may become further disadvantaged
in the admissions process. Therefore, future work should ensure that rubric-based admissions do
increase equity rather than just use a new tool to perpetuate existing inequities.
     Second, future work should examine how the use of rubrics may affect what parts of an
application drive the admissions process. In our prior work, we found that the physics GRE and
grade point average were the main drivers of the admissions process [65]. Given the rubric is
designed to emphasize more than just grades and test scores, we would hope to see these factors
deemphasized under the rubric system. Such a result would suggest that the rubric is fundamentally
changing how faculty are reviewing applicants. We explore whether that is the case in Chapter 5.
     Third, future work could examine how rubric-based admissions may change the type of ap-
plicant admitted and student outcomes. Faculty skeptical of holistic admissions may worry that
by deemphasizing grades and test scores, their program is admitting less academically prepared
students. Future work can explore if these fears have any merit. Research at the undergraduate
level on holistic admissions has found that adding noncognitive traits increased graduation rates,
especially among those from disadvantaged backgrounds [204]. At the graduate level, a study of
                                                  87


a materials science and engineering program found that after changing their admissions to include
noncognitive skills, their incoming students won more university fellowships, though the authors
cautioned they could not attribute the increase in fellowships solely to their changes in admis-
sions [205]. Thus, evidence from outside of physics suggests that these fears may be unfounded,
but we will not know for sure until physics specific studies are conducted.
    Additionally, future work can examine noncognitive skills in physics more broadly. Physics
has been characterized as a brilliance-dominated field [59] and hence, it is not surprising that most
studies of success in physics have also focused on cognitive measures such as grades, exam scores,
and standardized test scores. While such studies could be useful at all levels of physics, studies
at the graduate level are especially importance given the limited number of studies exploring their
useful for predicting success in graduate school. [138].
    Finally, future work around equity in graduate admissions should investigate who is invited to
apply to graduate school in the first place, what barriers those who do not apply but wish to do so
encounter, and how those barriers may be removed. In previous work, Cochran et al. investigated
what barriers applicants to physics graduate school, via the APS Bridge Program, perceived, finding
that GRE scores, lack of research experience, low GPA, program deadlines, and application costs
were common concerns [49]. Unless we also work to make the application process more equitable,
making the evaluation process more equitable will not result in large-scale changes in equity at the
graduate level.
4.8    Recommendations for Departments
The results of this study suggest a general recommendation to implement rubrics in physics graduate
school admissions. Rubrics can aid reviewing applications by standardizing the process and limiting
bias and using rubrics does not appear to increase the time to review applications.
    Of course, simply using a rubric will not result in changes unless it is implemented well. We
therefore propose three more specific recommendations.
    First, we recommend that admissions committees have multiple members review each appli-
                                                 88


cation. For a well-constructed rubric, there should be limited uncertainty as to what rating an
applicant will receive. However, for constructs that are more subjective in nature, faculty may have
differing opinions about what counts as achieving each level. For example, for the quality of work
construct on our rubric, what counts as “making significant contributions to the project” might
vary based on the reviewer. Therefore, having multiple reviewers can reduce potential bias when
reviewing applications.
    Second, following the call of others [194, 206, 207], we recommend that members of the
admissions committee should be of diverse backgrounds and representative of the applicant pool.
To accomplish that, departments might also consider adding non-tenure stream faculty, post-docs,
and current graduate students to their admissions committees, providing appropriate recognition
and compensation as necessary. Prior work has shown that faculty may prefer to admit applicants
like themselves [46] and therefore, a representative admissions committee is needed to ensure that
minoritized applicants are given equal consideration.
    Finally, we recommend that departments conduct regular self-studies of their graduate admis-
sions processes and share the results. While Rudolph et al. have previously called for departments
to conduct self-studies of their admissions process [138], we believe it is equally important to share
the results of those self-studies so that the physics community can know what is and what is not
working. This collective knowledge of what is working and what is not working can then be used
by all to improve graduate admissions in physics for everyone.
    For the sharing of results to be impactful however, the results must be easy to access and easy
to understand. While individual departments could post their results on their websites, we believe
doing so adds an extra layer of complexity and makes the results harder to access. Instead, we
advocate for a centralized system to be created so that departments can easily report their data in
a standardized way and practitioners can easily see and compare results across programs. Such a
system could be maintained by professional societies such the American Physical Society or the
American Institute of Physics, or other organizations. A system like this has been designed for
research-based assessments [208], but to our knowledge, there exists no such system for graduate
                                                  89


admissions.
    However, when conducting such self-study of what is working well and what is not working
well, it is important to consider the question of “working well for whom?”. As Razack et al. note,
“working well” depends on one’s social positioning [192] and therefore, a change that works well
for applicants of one background may not be working for applicants of a different background. By
considering the “for whom?”, the physics community can ensure that changes made are for the
benefit of all rather than as new methods to continue the existing exclusionary practices in graduate
admissions.
4.9     Conclusion
In this paper, we demonstrated that rubric-based admissions are a promising avenue for increasing
equity in graduate admissions. We showed that faculty ratings of applicant’s grades, research
experiences, and noncognitive abilities do not differ based on the applicant’s sex or undergraduate
background. The differences we did observe in faculty ratings could be explained as observing
known systematic issues in physics regarding test scores and service work expectations.
    Based on the results of this study, we recommend that departments use rubric-based holistic
review for their graduate admissions process. Multiple people should review each application and
those people should be representative of the applicant pool to limit any bias in the review process.
Finally, departments should engage in self-study to see how their graduate admissions process
is working and share those results so that the physics community can collectively learn what is
working and what is not working in making graduate admissions more equitable.
                                                  90


                                              CHAPTER 5
        A “NEW APPROACH” OR THE SAME APPROACH IN NEW PACKAGING?
This chapter is being drafted as a journal article. The working manuscript version includes Marcos
D. Caballero as the second author. Following the Contributor Roles Taxonomy (CRediT) [76], my
roles for this project include conceptualization, formal analysis, methodology, software, validation,
visualization, and writing the original draft.
5.1      Introduction
Physics departments are increasingly interested in making graduate admissions more equitable
and rubric-based holistic review has been gaining traction as a possible route to do so. Unlike the
traditional approach that emphasizes test scores and grades (see Chapter 2), rubric-based admissions
extend the criteria of interest to include noncognitive competencies, fit with the department and
research accomplishments. In theory, all applicants are evaluated on the same set of explicit criteria
so the process should be more fair and there should be less bias [183].
     In Chapter 4, we introduced our department’s approach to rubric-based admissions. The results
suggest that our rubric is equitable among men and women and with respect to applicants from
smaller or selective schools. However, just because the rubric is more equitable does not mean that
it made our admissions process more equitable or even changed the factors that drive our admission
process. That is, the rubric might just be a new tool to do the same process.
     The goal of this chapter is then to go beyond comparing rubric scores for different applicants and
instead, consider the admissions process as a whole. Specifically, we ask: how did the introduction
of the rubric change our program’s admissions process? To operationalize that question, we ask
two research questions:
    1. How do admissions models before and after the implementation of the rubric differ in terms
        of predictive ability and meaningful features when our models are based on the data contained
                                                   91


       in applications?
    2. How does using the data produced by faculty when rating applicants using the rubric affect
       our ability to create admissions models?
     To answer these questions, we compare admissions models of the current process using data from
both faculty ratings and the applications to the models we generated in Chapter 2 of the program’s
initial process. From Fig 2.3, we notice that there are cases where applicants have similar physics
GRE scores and GPA, yet one applicant is accepted while the other is not. Given that cases such as
these might add additional challenges to modeling the data, removing such applicants might allow
us to better characterize the general trends in the data. We therefore consider a new approach that
detects similar applicants with different admissions outcomes and removes them from the data set:
Tomek Links [209]. We then ask a third research question:
    3. How does using Tomek Links affect our ability to model the admissions data, both before
       and after the implementation of the rubric?
5.2     Background
While admissions committees use common criteria for initially judging applicants, deliberations
of borderline applicants under the traditional process might come down to subtle distinctions that
were not used for other applicants [46]. Thinking in terms of a modeling perspective, this means
that some applicants might be assessed according to additional criteria and hence, these borderline
applicants might not be easily classified by a general model of the admissions process. As a
result, including these borderline applicants might cause our model performance to suffer while
excluding these applicants and instead focusing on a more typical applicant could improve model
performance.
     Unfortunately, whether an applicant is a borderline applicant is not included in faculty ratings of
applicants and hence, we do not know who is a borderline applicant. To determine who might be a
borderline applicant, let us assume there is a predictive model of a graduate admissions process that
                                                  92


perfectly separates those who are admitted and those who are not admitted in some 𝑛-dimensional
application space. We could then say that those applicants who are near the 𝑛-dimensional boundary
that separates the admitted applicants and not admitted applicants are borderline applicants. To
differentiate borderline applicants in the admissions process from borderline applicants in the
modeling process, we will refer to the latter as boundary applicants. Such a definition of boundary
applicants is similar to Hoens and Chawla’s definition of borderline cases in classification, which
are cases where a small change in the features would cause the classification boundary to shift [210].
    However, such an approach assumes that those who are admitted and not admitted can be
cleanly split in some 𝑛-dimensional space and are not intermixed. For a variety of reasons,
an applicant with a stellar application might be rejected or an applicant with a weaker application
might be admitted and hence an admitted applicant might fall on the not-admit side of the separating
boundary or vice versa. While these applicants might not be borderline in the traditional sense, their
admission decision likely would have required deliberation and hence, might have gone through
a similar process as a borderline applicant. We should therefore also consider these applicants as
borderline applicants in the sense of the possibility of hurting our model’s performance. Perhaps
more accurately, we should refer to these applicants as noise applicants following Hoens and
Chawla’s definition of noise cases, which are cases that result from random variation and are not
representative of the underlying pattern [210].
    While we have operationalized borderline applicants in terms of a model as boundary applicants
and noise applicants, we still need a method to determine which applicants these are before
constructing any models. Tomek Links offers one possible method as it is a method of identifying
the boundary or noise cases in the data [209].
    To identify the Tomek Links in a data set, the distance between all cases in the data set are
computed. Using the distances, the nearest neighbor of each case is computed. For two cases, e.g.
case 1 and case 2, the cases are Tomek Links if and only if case 1 is the nearest neighbor of case 2,
case 2 is the nearest neighbor of case 1, case 1 and case 2 are of different classes. The only way
for these conditions to be fulfilled is if case 1 and case 2 are boundary cases or if case 1 or case 2
                                                   93


   A                                                                                            B
                             Offered Admission?   No    Yes   Tomek Link?   No   Yes                                      Offered Admission?   No    Yes
                       4.0                                                                                          4.0
                       3.5                                                                                          3.5
   Undergraduate GPA                                                                            Undergraduate GPA
                       3.0                                                                                          3.0
                       2.5                                                                                          2.5
                       2.0                                                                                          2.0
                                        500       600         700        800       900   1000                                        500       600         700        800   900   1000
                                                        Physics GRE Score                                                                            Physics GRE Score
Figure 5.1: Plot A shows Fig 2.3 with the Tomek Links marked. Filled points represent Tomek
Links. Plot B shows the same plot after the Tomek Links have been removed
is a noise case [210]. Therefore, Tomek Links allows us to identify boundary applicants and noise
applicants in our data. An example of this approach in practice is shown in Fig. 5.1.
             While Tomek Links have been successfully used in other contexts (e.g. see [211–213]), these
approaches have tended to use data augmentation in conjunction with Tomek Links. While data
augmentation approaches are valid from a modeling perspective, they might be questionable from
an ethics and policy perspective. For example, altering the data set might lead to a model that is
highly inaccurate of the underlying process [214]. For our data set, using data augmentation is
analogous to creating applicants and thus our conclusions about how our admissions process might
or might not have changed would be based on both real and imaginary applicants. For this reason,
we will not use data augmentation.
             As we note in our methods, we do impute our data. Readers may view this as a contradiction
of the previous paragraph but we view data imputation and data augmentation as different. Data
imputation is using the existing data to fill in the missing values. In the case of multiple imputations,
which we use in this study, the filling in happens multiple times in multiple ways so that the results
represent the average result across many possible ways the complete data set might have looked. In
                                                                                           94


contrast, data augmentation is using the existing data to create new data rather than fill in “holes”
in the data. More generally, data imputation is estimating the results as if we knew the values of
the missing data while data augmentation is creating new data to simulate a bigger data set.
5.3     Methods
5.3.1    Data
Data for this study comes from applications to our graduate program to enroll in Fall 2014 thorough
Fall 2020. Applicants submitted general and physics GRE scores, transcripts, a personal statement,
a research statement, and letters of recommendation. Starting for the cohort to begin our program
in Fall 2018, the admissions committee used a rubric to rate applicants on 18 criteria. Those scores
are also included in our data. Further details about the data from fall 2014 through fall 2017 are
discussed in Chapter 2 and further details about the data from fall 2018 through fall 2020 and the
rubric are discussed in Chapter 4.
    For convention, I will refer to data collected before the implementation of the rubric (fall 2014
- fall 2017) as data set 0 following the convention of using “naught” for initial time in physics and
data collected after the implementation of the rubric (fall 2018 - fall 2020) as data set 1, following
the convention of using “1” to be mean the next time the thing was measured. Furthermore, data
in data set 1 that comes from the applications will be referred to as the data set 1a while data that
comes from the faculty ratings using the rubric will be referred to as data set 1b data. These are
summarized in Table 5.1
5.3.2    Modeling
To model our data, we used the cforest algorithm [92, 96, 100] in R [99]. As in Chapter 2, we
                                                                                √
used 70% of our data to train the model, 500 trees to build the forest, and 𝑝 of the 𝑝 features
to construct each tree. We ran each model 30 times, selecting a new training and test set each
time, and averaged the results over the runs. For each trial, we calculated the training AUC, testing
                                                  95


       Table 5.1: The three models compared in this chapter and the data that went into each
           Name            Data source and features                         Where results are
                                                                            reported
           Data Set 0      Information pulled from the applications         Section 2.3
                           before our department implemented a
                           rubric (2014-2017). Features are shown in
                           Table 2.1
           Data Set 1a Information pulled from the applications             Section 5.4.1
                           after our department implemented a
                           rubric (2018-2020). Uses the same features
                           as model 1.
           Data Set 1b Rubric ratings generated by faculty as they Section 5.4.3
                           evaluated applications (2018-2020).
                           Features are shown in Table 4.1.
AUC, testing accuracy, null accuracy, and the permutation AUC importances. At this stage of the
analysis, missing data was handled using the default cforest procedures [102].
    For data sets 0 and 1a, the same features were used as in Table 2.1, with the size of the physics
program factors updated with new data for the post-data models. For data set 1b, all features were
treated as categorical (0, 1, or 2) and as in Chapter 4, any values between a rubric level were
rounded up.
    For both data sets 1a and 1b, we also varied our choice of hyperparameters to determine if our
conclusions depended on our modeling choices. As in Chapter 2, we set the training fraction to be
either 0.5, 0.6, 0.7, 0.8, or 0.9, the number of trees in the forest to be 50, 100, 500, 1000, or 5000,
                                                         √
and the number of features used for each tree to be 1, 𝑝, 𝑝/3, 𝑝/2, or 𝑝 for a total of 125 possible
combinations. Each model was grown using the same procedure listed above.
    To account for correlations among the features, we also calculated the conditional importances
for the both data sets 1a and 1b. We used the MICE algorithm with the default choices [103]
to impute missing data, calculated the conditional importances, and then pooled the results using
Rubin’s Rules [104].
    To compute the Tomek Links, we used the TomekClassif function in the UBL package [215].
We first used MICE to impute the data before calculating the Tomek Links using the function
                                                   96


defaults with the exception of the distance metrics. Following the recommendation of the package’s
documentation, we used the HVDM distance for data sets 0 and 1a because those data sets contain
both categorical and continuous data and we used the Overlap distance for data set 1b because all
features were categorical.
    After removing the Tomek Links, we ran each model 30 times and averaged results. Results
were then pooled using Rubin’s Rules.
5.4     Results
5.4.1   Data Set 1a
Across the 30 runs, the average accuracy of our model predicting on the held-out data was 71.4% ±
0.6%, the average training AUC was 0.720 ± 0.004, and the average testing AUC was 0.626 ± .006.
Our null accuracy was 66.0% which suggests that our model is only doing slightly better than if it
were to predict everyone was not admitted to our program. The low testing AUC also suggests a
poor model.
    When looking at the feature importances (Fig. 5.2), we see that the physics GRE score,
undergraduate GPA, quantitative and verbal GRE scores, and proposed research area are near the
top while the institutional features near the bottom. Performing the backward elimination, we find
that physics GRE score, undergraduate GPA, quantitative and verbal GRE scores, and proposed
research area are the meaningful features predictive of admission after the implementation of the
rubric.
    We can then plot the ranks of the features so that we can compare them to the ranks of the
features in data set 0. The resulting slopeplot is shown in Fig. 5.3. We notice that the order of
features is largely unchanged for the most predictive features, with only quantitative GRE score
and GPA switching places. The major difference between the features predictive of admission in
datasets 0 and 1a is the number of meaningful features.
    When we take correlations among the features into account however, we notice the set of
meaningful features shrinks. As shown in Fig. 5.4, only the applicant’s physics GRE score and
                                                  97


                                        Physics GRE Score
                                       Grade Point Average
                                    Quantitative GRE score
                                          Verbal GRE score
                                  Proposed Research Area
                                            Year of applying
                Feature
                           Size of UG physics program PhD
                            Highest physics degree offered
                          Size of UG physics program, bach
                                            Attended a MSI
                                     Region of UG program
                                         Writing GRE score
                                Attended a public institution
                                          Barron Selectivity
                                                                0.000   0.025   0.050   0.075   0.100
                                                                        AUC Mean Importance
Figure 5.2: Averaged AUC feature importances over 30 trials. Physics GRE score, undergraduate
GPA, Quantitative GRE score, Verbal GRE score and proposed research area, appearing in orange,
were the factors found to be meaningful and hence predictive of being admitted.
Table 5.2: Minimum, median, and maximum values of the metrics obtained over the 125 hyperpa-
rameter combinations for models built from data set 1a
                                       metric                     min median  max
                                       Train AUC                0.602  0.735 0.749
                                       Test AUC                 0.549  0.633 0.676
                                       Test Accuracy            0.679  0.712 0.732
                                       Null Accuracy            0.645  0.661 0.666
proposed research area were found to be predictive of admission. It is also important to note
that the quantitative and verbal GRE scores are ranked lower once correlations are accounted for,
suggesting that their initial importances were inflated.
   Given the poor performance of our model, hyperparameter tuning might have improved the
model. While it did to a degree, the testing accuracy was still only a few percentage points
                                                                  98


                                                      Pre    Post
                                 Physics GRE Score             Physics GRE Score
                              Quantitative GRE score           Grade Point Average
                                Grade Point Average            Quantitative GRE score
                                   Verbal GRE score            Verbal GRE score
                            Proposed Research Area             Proposed Research Area
                                      Year of applying         Year of applying
                    Size of UG physics program PhD             Size of UG physics program PhD
                                    Barron Selectivity         Highest physics degree offered
                                  Writing GRE score            Size of UG physics program, bach
                   Size of UG physics program, bach            Attended a MSI
                               Region of UG program            Region of UG program
                      Highest physics degree offered           Writing GRE score
                                      Attended a MSI           Attended a public institution
                          Attended a public institution        Barron Selectivity
Figure 5.3: Slopeplot showing the ranks of each feature before the implementation of the rubric
(left) and after the implementation of the rubric (right) using data sets 0 and 1a respectively. Features
toward the top of the plot are more predictive. Features in orange were found to be the meaningful
features needed to predict whether the applicant was admitted in their respective model. Notice
that the ordering of the more predictive features is largely unchanged. Plot adapted from [216].
above the null accuracy and the testing AUC was still below 0.7 (Table 5.2). Thus, even with
hyperparameter tuning, the models of data set 1a were poor.
    Finally, to see how the feature ranks varied based on the hyperparameters, we plotted the
occurrence fraction of each rank for each feature (Fig. 5.5). We notice that across the 125
hyperparameter combinations, physics GRE score and GPA are almost always the top two features
followed by quantitative and verbal GRE scores. In addition, none of the institutional features ever
rank in the upper half of the importances. These results are similar to what we found for data set 0
in Chapter 2.
                                                          99


                                         Physics GRE Score
                                   Proposed Research Area
                                        Grade Point Average
                                             Year of applying
                                     Quantitative GRE score
                                           Verbal GRE score
                 Feature
                                             Attended a MSI
                             Highest physics degree offered
                           Size of UG physics program, bach
                            Size of UG physics program PhD
                                      Region of UG program
                                          Writing GRE score
                                 Attended a public institution
                                           Barron Selectivity
                                                                 0.000      0.005     0.010    0.015
                                                                         AUC Mean Importance
Figure 5.4: Averaged conditional feature importances over 30 trials. Physics GRE score and
proposed research area, appearing in orange, were the factors found to be meaningful and hence
predictive of being admitted once correlations were accounted for.
5.4.2   Using a True Testing Set
In addition to looking at the feature order to determine if the admissions process changed, we can
compare the performance of the models themselves. If the process didn’t change, then a model
built from data set 0 should perform equally well on a data set 0 testing set as on data set 1 and a
model built from data set 1a should perform equally well on a data set 1a testing set as on data set
0. If the process did change, we would expect better performance on the test data pulled from the
train/test split than the other data set.
    When looking at the results, which are shown in Figures 5.6 and 5.7, we see that the latter case
better describes the data. In Figure 5.6A, we see that the data set 0 test AUC is larger than the data
set 1a AUC, and in Figure 5.7A, we see that the data set 0 test accuracy is larger than the data set 0
                                                                   100


                                       Physics GRE Score
                                      Grade Point Average
                                   Quantitative GRE score
                                         Verbal GRE score
                                 Proposed Research Area
                                           Year of applying                                                                                                                            Fraction of Trials
                                                                                                                                                                                             1.00
                           Highest physics degree offered
               Feature
                                                                                                                                                                                             0.75
                                                                                                                                                                                             0.50
                          Size of UG physics program PhD
                                                                                                                                                                                             0.25
                                           Attended a MSI                                                                                                                                    0.00
                         Size of UG physics program, bach
                                        Writing GRE score
                                    Region of UG program
                               Attended a public institution
                                         Barron Selectivity
                                                                1     2      3    4          5   6          7                  8         9      10      11          12   13      14
                                                                                                                    Rank
Figure 5.5: Proportion of the 125 hyperparameter combinations in which each feature had a given
rank for data set 1a. Notice that the plot is mostly diagonal and that physics GRE score and GPA
are almost always the top two features.
  A                                  Model built on Data Set 0                                                  B                        Model built on Data Set 1a
                         Training                                                                                             Training
   Data Set                                                                                                     Data Set
               Data Set 0                                                                                                   Data Set 0
              Data Set 1a                                                                                                  Data Set 1a
                                           0.00                0.25        0.50       0.75           1.00                                    0.00            0.25         0.50        0.75                  1.00
                                                                          AUC                                                                                            AUC
Figure 5.6: Comparison of the testing AUC when A) Data Set 0 is used to train the model and B)
when Data Set 1a is used to train the model. Training refers to the training AUC for the model. All
error bars are 1 standard error. Results were averaged over 30 trials.
                                                                                                       101


   A                        Model built on Data Set 0                          B                        Model built on Data Set 1a
               Data Set 0                                                                  Data Set 0
   Data Set                                                                    Data Set
              Data Set 1a                                                                 Data Set 1a
                                 0.0           0.2          0.4    0.6   0.8                                 0.0           0.2           0.4    0.6   0.8
                                                        Accuracy                                                                     Accuracy
Figure 5.7: Comparison of the testing accuracy when A) Data Set 0 is used to train the model and
B) when Data Set 1a is used to train the model. The null accuracy is shown in cyan with the shorter
in height error bars. All error bars are 1 standard error. Results were averaged over 30 trials.
null accuracy while the data set 1a test accuracy is smaller than the data set 1a null accuracy. These
metrics suggest that the data set 0 model fits data set 0 well but does not fit data set 1a well and
therefore, that the process might have changed.
       Looking at Figure 5.6B, we see that none of the metrics are especially good. The test AUCs are
both in the poor range, suggesting that the model built from data set 1a does not fit that well in the
first place. It is then not surprising that the model does not predict data set 0 well. Given that the
initial model did not fit the data well, we cannot use the result to make a claim about whether the
process changed.
5.4.3              Data Set 1b
Given that after the implementation of the rubric applicants are rated on the rubric constructs,
perhaps using the rubrics constructs instead of the application data in a model would lead to better
performance. Yet, that wasn’t the case. We find that the testing AUC was 0.664 ± 0.007 and the
testing accuracy was 0.675 ± 0.007 (null accuracy 0.553 ± 0.006). Given that not all applicants had
sufficiently complete applications to be reviewed by faculty and those with incomplete applications
                                                                          102


                                 Physics GRE Score
                                      Quality of work
                             Achievement Orientation
                                       Perseverance
                                  Conscientiousness
                              Alignment of Research
                               Research Dispositions
                                      Technical Skills
               Feature
                                             Initiative
                                   Academic Honors
                               Diversity contributions
                                 Physics Coursework
                         Variety/Duration of Research
                                General GRE Scores
                            Community contributions
                                   Math Coursework
                                All Other Coursework
                               Alignment with Faculty
                                                          0.00     0.01   0.02   0.03   0.04   0.05
                                                                   AUC Mean Importance
Figure 5.8: Averaged conditional feature importances over 30 trials for the models of data set
1b. Physics GRE score and quality of work, appearing in orange, were the factors found to be
meaningful and hence predictive of being admitted.
tended to be not admitted, the null accuracy is smaller for models of data set 1b than the models of
data set 1a
   When we looked at the feature importances, the results showed similarities to the importances
from the models of data set 1a. From Fig 5.8, we notice that physics GRE score is still the
top feature. However, measures of GPA such as physics coursework, math coursework, and all
other coursework tended to be in the lower half of the rankings, alignment of research (the closest
construct to proposed research area) was toward the middle of the rankings, and general GRE scores
was toward the bottom despite GPA, proposed research area, and general GRE scores being top
ranking features under the models of data set 1a.
   From the figure, we also notice that measures related to research (quality of work, research
                                                             103


                                  Physics GRE Score
                              Achievement Orientation
                                        Perseverance
                                       Quality of work
                               Alignment of Research
                                   Conscientiousness
                                    Academic Honors
                                              Initiative
                Feature
                                Diversity contributions
                                       Technical Skills
                                Research Dispositions
                                  Physics Coursework
                                 General GRE Scores
                             Community contributions
                                Alignment with Faculty
                          Variety/Duration of Research
                                    Math Coursework
                                 All Other Coursework
                                                           0.000           0.005      0.010    0.015
                                                                         AUC Mean Importance
Figure 5.9: Averaged conditional feature importances over 30 trials for the models of data set 1b.
Physics GRE score and achievement orientation, appearing in orange, were the factors found to be
meaningful and hence predictive of being admitted once correlations were accounted for.
dispositions, and technical skills) are ranked in the upper half as are measures of noncognitive skills
(achievement orientation, perseverance, and conscientiousness) while measures of fit (diversity
contributions, community contributions, and alignment with faculty) are ranked in the bottom half
of features.
   When performing the backward elimination, we find that only physics GRE score and quality of
work are selected, suggesting that only these two features are needed to produce similar predictive
performance as using all 18 features.
   We then repeated the analysis taking correlations between features into account. The result is
shown in Fig. 5.9. We notice that the top features are similar, though the rank of quality of work
decreased to fourth. Now, physics GRE score and achievement orientation were found to be the
                                                                   104


Table 5.3: Minimum, median, and maximum values of the metrics obtained over the 125 hyperpa-
rameter combinations for the models of data set 1b
                                                          metric                min median                 max
                                                          Train AUC           0.711  0.767                0.791
                                                          Test AUC            0.654  0.669                0.686
                                                          Test Accuracy       0.660  0.678                0.696
                                                          Null Accuracy       0.559  0.561                0.586
                     Physics GRE Score
                 Achievement Orientation
                          Quality of work
                           Perseverance
                      Conscientiousness
                  Alignment of Research
                          Technical Skills
                                                                                                                                        Fraction of Trials
                   Research Dispositions                                                                                                     1.00
   Feature
                       Academic Honors                                                                                                       0.75
                                 Initiative                                                                                                  0.50
                                                                                                                                             0.25
                   Diversity contributions
                                                                                                                                             0.00
             Variety/Duration of Research
                     Physics Coursework
                    General GRE Scores
                Community contributions
                    All Other Coursework
                       Math Coursework
                   Alignment with Faculty
                                              1   2   3   4   5   6   7   8     9          10   11   12   13   14   15   16   17   18
                                                                                    Rank
Figure 5.10: Proportion of the 125 hyperparameter combinations in which each feature had a given
rank for models of data set 1b. Notice that the plot is mostly diagonal and that physics GRE score,
achievement orientation, and quality of work are always the top three features.
meaningful and hence predictive features.
       Finally, we performed hyperparameter tuning to determine if we could create a model with
acceptable metrics. Unfortunately, we could not. Even the best AUC among the 125 hyperparameter
tuning combinations did not exceed 0.7. The full results are shown in Table 5.3.
       Looking at the feature ranks, we again see a diagonal pattern toward the upper left of the
plot (Fig. 5.10, suggesting the same few features are selected as the most predictive. Regardless
of our hyperparameter choices, the top three features are the physics GRE score, achievement
orientation, and quality of work. However, the pattern becomes less diagonal toward the bottom
right, suggesting that these features are more or less noise in the model.
                                                                               105


        Table 5.4: Metrics when using Tomek Links and MICE for each of the three data sets
                                       Data Set 0      Data Set 1a     Data Set 1b
                  Cases Dropped        11%-14%          15%-18%        12%-17%
                  Training AUC       0.880 ± 0.004    0.760 ± 0.015  0.779 ± 0.010
                   Testing AUC       0.809 ± 0.009    0.670 ± 0.015  0.704 ± 0.014
                 Testing Accuracy    0.806 ± 0.009    0.775 ± 0.012  0.717 ± 0.012
                  Null Accuracy      0.539 ± 0.006    0.699 ± 0.009  0.575 ± 0.010
5.4.4    Tomek Links
Given the limited ability of the conditional inference forest to model data sets 1a and 1b, we
used Tomek Links to remove boundary cases. As we were removing cases, we did not compute
importances and focused on the model metrics instead. The results are shown in Table 5.4. As MICE
generates new values for each imputation and hence, affects which cases are nearest neighbors, the
percent of cases dropped for each trial varies.
    First, we notice that for data set 0, using Tomek Links increased the testing AUC and testing
accuracy by 0.05 over the original model reported in Chapter 2. In fact, the testing AUC is now
about 0.8 which is considered “good” as compared to “fair” for the original model [93].
    Likewise, using Tomek Links also results in an approximately 0.05 increase in the testing AUC
and testing accuracy for data set 1a. However, the AUC is still in the poor range and the testing
accuracy is only slightly better than the null accuracy.
    For data set 1b, using Tomek Links increases the testing AUC and testing accuracy by approxi-
mately 0.04. This time, the increase to the testing AUC is enough for the model to be classified as
“fair”.
    To better understand what Tomek Links were doing in the modeling process, we investigated
how removing the boundary cases affected the decision boundary. In order to plot the results, we
only used the physics GRE score and undergraduate GPA to make a simple model for data sets 0
and 1a. To compute the Tomek Links, we used MICE to create a complete data set first and then
found the Tomek Links. As all the data in data set 1b was categorical, a 2D plot of the decision
boundary would have yielded limited insight and hence, we did not do so. The results of a single
                                                 106


Figure 5.11: Plot A shows data set 0 with the decision boundary for a model with just the physics
GRE score and undergraduate GPA (Fig 2.3) while plot B shows the data with the Tomek Links
removed and the resulting decision boundary for the 2D model. Plot C shows the overlap of the
admitted regions.
trial are shown in Figures 5.11 and 5.12.
    From the figures, we see that removing the Tomek Links does affect the boundary. In the case
of Figure 5.11, we see the area with limited data in the lower right switches to not admitted and in
general, the overfitting is reduced. In addition, the decision boundary matches closer to what we
might expect anecdotally and based on Chapter 3 in that having a higher physics GRE score and
                                                107


Figure 5.12: Plot A shows data set 1a with the decision boundary for a model with just the physics
GRE score and undergraduate GPA. Plot B shows the data with the Tomek Links removed and the
resulting decision boundary for the 2D model. Plot C shows the overlap of the admitted regions.
GPA is more likely to result in admission as opposed to having only one of those being stellar.
    Likewise, in Figure 5.12, we again see reduced overfitting in the decision boundary. We also
see that higher physics GRE scores and GPA are predicted to result in admission as was the case
before the implementation of the rubric. However, the threshold for what counts as a high physics
GRE score and GPA seems to be higher after the implementation of the rubric based on the decision
                                               108


boundaries.
5.5     Discussion
Here, we first provide answers to our research question and then use those answers to address the
larger question of whether our department’s admissions process changed.
5.5.1   Research Questions
How do admissions models before and after the implementation of the rubric differ in terms of
predictive ability and meaningful features when our models are based on the data contained in
applications?
    While we were able to model the data before the implementation of the rubric to an acceptable
degree, we were unable to do so for the data after the implementation of rubric. Even after
hyperparamter tuning, we were unable to achieve a testing accuracy more than a few percentage
points above the null accuracy or a testing AUC above 0.7, suggesting a poor model.
    In terms of the meaningful features for data sets 0 and 1a, they were more or less the same. For
data set 0 presented in Chapter 2, we found the applicant’s physics GRE score, quantitative GRE
score, and GPA to be the meaningful features while for data set 1a, we found the physics GRE score,
GPA, quantitative GRE score, verbal GRE score, and proposed research area to be meaningful.
After taking correlations into account only the physics GRE score and proposed research area
were found to be meaningful. However, because conditional inference forests will always return
importance values regardless of how well the model fits, we should interpret the data set 1a results
with a degree of caution.
    Furthermore, if the top features were the same before and after the implementation of the rubric,
we would expect a model trained on either data set 0 or data set 1a to work equally well on the
other. Yet, that wasn’t what we found. Instead, we found that the model trained on data set 0 did
not predict data set 1a well while the model trained on data set 1a did not predict either of the data
sets well.
                                                 109


    How does using the data produced by faculty when rating applicants using the rubric affect our
ability to create admissions models?
    While using the rubric features does result in increased metrics compared to the traditional
features for the data collected after the implementation of the rubric, the metrics are still outside
of the acceptable range. The testing AUC was still below 0.7 but the testing accuracy was greater
than the null accuracy by a larger amount than the model created from data set 1a. However, that
result may be explained by data set 1b having a less imbalanced outcome.
    To see if that was the case, we created a model using the data in data set 1a that corresponded
to the applicants in data set 1b. When we did so, we found that the metrics were comparable, but
the original test data set 1b model slightly outperformed this new model ( 0.02 increase in testing
AUC and accuracy). Thus, while some of the improvement in metrics might be attributable to the
more balanced data set, using the rubric constructs also provided some benefit.
    In terms of the features, we noticed some similarities and some differences. For the models of
data set 1b, the physics GRE was still the top feature. However, measures of the GPA and general
GRE scores were ranked in the lower half, suggesting they might not have been as important.
Instead, measures of research ability and experience and noncognitive skills tended to be ranked
towards the top. Again however, we should interpret the data set 1a results with a degree of caution
as the model does not fit the data especially well.
    How does using Tomek Links affect our ability to model the admissions data, both before and
after the implementation of the rubric?
    Using Tomek Links resulted in improved model performance for all three data sets. For data
set 0, using Tomek Links increased the testing AUC over 0.8, which is considered “good,” and for
data set 1b, using Tomek Links increased the testing AUC over 0.7, which is considered “fair.”
However, while using Tomek Links for data set 1a did improve the testing AUC, it did not do so
enough for the model to be considered acceptable.
    When looking at the decision boundaries for data sets 0 and 1a with and without Tomek Links
removed, we found that overfitting appeared to be reduced, suggesting that even if the metrics are
                                                 110


not largely improved, there still may be benefits from using Tomek Links.
    Thus, while the benefits were relatively small, these results suggest that Tomek Links are
a promising technique for modeling PER data, especially for data sets where we expect many
boundary cases or cases that go against the general trend. For example, if we were to predict who
passes an introductory class, Tomek Links might allow us to remove students who earned exam
scores around the minimum passing grade and thus might or might not have passed the course or
anomalous students who did poorly on the midterms but managed to earn a high grade on the final
to pass the class.
5.5.2    Addressing whether our process changed
Looking across the research questions, we can now address whether the introduction of the rubric
changed our department’s admissions process. Overall, the evidence points in the direction of the
process changing.
    In terms of evidence for the process changing, we find that the models of data sets 1a and 1b do
not fit the data well. As we were able to fit the data set 0 models to an acceptable degree using the
conditional inference forest algorithm but not the models of data sets 1a or 1b, this result seems to
imply that there must be something different about the data sets. Because data set 0 and data set 1a
used the same features, it is hard to explain why we could model one well but not the other unless
the “true” models of the data were different and hence, the admissions process changed.
    In addition, a model trained on data set 0 was better able to predict held-out data from data set
0 compared to data set 1a. If the process hadn’t changed, we would have expected the predictive
performance to be similar.
    Finally, using Tomek Links to remove applicants who might have gone against the general trend
resulted in minimal increases in the metrics for the models of data sets 1a and 1b. If the process
did not change, we would expect that removing applicants who might have gone against the overall
trend would have led to a better model because we were able to model the admissions data before
the implementation of the rubric. Yet, that isn’t what happened, suggesting again there must be
                                                  111


something different about the data collected after the implementation of the rubric.
5.5.3    Limitations affecting our ability to address whether the process changed
Looking at the results, it is possible that someone could instead believe the results suggest the
process did not change. We address those here.
    In terms of evidence for the process not changing, our results show that the most predictive
features are similar regardless of which data set we used. When using data set 0, we found that
the physics GRE, quantitative GRE, and GPA were most predictive of admission. Likewise, when
looking at data set 1a, we found that the physics GRE, GPA, quantitative GRE, verbal GRE, and
proposed area of research were most predictive. Using data set 1b showed the most differences
in that the measures of grades and the general GRE scores were in the lower half of the rankings.
However, the physics GRE was still the top ranked feature. Yet, both models of the data after the
implementation of the rubric did not have acceptable testing metrics, suggesting that we should
interpret the feature importance orders with caution. Conditional inference forest models will
always produce feature importances regardless of how well the model fits the data. Because the
metrics to assess fit are relatively poor, we should not trust the conclusion that the most predictive
features are the same.
    However, it is possible that the low metrics might be a result of the conditional inference forest
method not being suited for the data we have. Recent work suggests that the conditional inference
forest algorithm does not perform well with missing data [217]. However, when we used MICE to
impute the missing data, the models were still not able to produce testing metrics in the acceptable
range, suggesting that the missing data was not the issue.
    In addition, while conditional inference forests were designed to better handle categorical data
than traditional random forests do, there could still be issues with categorical data. For example,
for data set 1b, there are only three possible values for each feature. Therefore, the model can
only split each feature 3 ways, which limits the depth of the trees and the fine tuning of the model.
However, when we used the section total (which could take on any integer between 0 and 8), the
                                                 112


results did not substantially improve, suggesting that the scale of the data may not be to blame.
     Even if the number of categories does not matter, the fact that some of the categorical data
are discretized continuous features (e.g. physics GRE score, physics coursework) could create
problems. Prior work has shown that binning continuous features can lead to a loss of information
and over- or under-estimation of effect sizes [218, 219]. It is possible that such an effect is present
in our data. However, models built from data sets 1a and 1b both found the physics GRE score to
be the top feature even though the physics GRE score was discretized in data set 1b. Because the
model metrics were not great (the testing accuracy was only a few percentage points above the null
accuracy an the testing AUC was less than 0.7), this rebuttal should be treated with caution. On the
other hand, the fact that models of data set 1a where discretization wasn’t an issue still had poor
metrics suggests that it cannot fully explain the models’ low metrics.
     It is also possible that the low metrics are not a result of how we handled the data we had, but
rather what data we had. It is possible that committee members were using something not included
in our data to evaluate applicants and if we had that data, our models of data set 1a and 1b would
improve. While such an explanation seems possible for data set 1a, it seems unlikely for data set
1b because members of the department decided what qualities they wanted to evaluate applicants
on and added them to the rubric.
     Finally, it is possible that the low metrics might not be caused by the data or the model and
instead, the low metrics could be caused by the admissions process itself. The goal of the rubric
is to rate applicants along multiple dimensions, and hence in a holistic manner. If applicants were
actually assessed holistically, we would expect that the model would not generalize well because
there is no single underlying process. Instead there might be multiple routes an applicant could
take to gain admission and hence, the model might encounter difficulties modeling this process.
The fact that hyperparameter tuning and Tomek Links did not increase the testing metrics to an
acceptable range for models of data set 1a and barely did so for the models of data set 1b supports
such an interpretation. However, claiming the process is more holistic based on these results alone
is premature, especially given the relatively small number of applicants in data set 1b. Instead,
                                                   113


results from other modeling attempts would either need to show poor predictive ability or show
evidence of multiple routes to admission to support such a claim.
5.6    Future Work
In order to better address the limitations and consider whether our admissions process became most
holistic, future work should examine alternative techniques for analyzing the data.
    First, instead of taking a predictive approach in our analysis, we could take an explanatory
approach where we try to understand what inputs may have caused the outcome. Under this
approach, whether a feature is related to the outcome is determined by statistical significance rather
than its predictive ability [62]. Logistic regression is a common example of this technique in PER.
The results of such future work would provide greater insight into why the models did not fit data
sets 1a and 1b well.
    Second, to determine if the process is more holistic, future work could analyze the data using
cluster analysis or latent class analysis. While such methods are becoming popular for analyzing
learning environments (e.g. see [220, 221]), to our knowledge, such methods are less common
in studies of graduate admissions processes. To our knowledge, clustering-like techniques have
only been used to understand admissions strategies based on surveys of faculty on admissions
committees [45]. If the process is more holistic, such methods might be able to identify clusters of
applicants who were admitted for similar reasons. For example, some applicants may be admitted
due to stellar academic credentials, others may be admitted due to their research background, while
others may be admitted based on which faculty members are seeking new students. Finding or not
finding such a result would provide greater clarity as to how the process may have changed. To do
so however, would likely require a larger data set, especially if there are a large number of driving
results for why an applicant is admitted.
    Finally, future work could take a mixed methods approach by considering qualitative approaches
to investigating how our admissions process might have changed. Such qualitative approaches
could allow us to observe the admissions process itself (similar to the studies Posselt conducted as
                                                  114


documented in [46]) and understand how faculty are evaluating and discussing applicants in real
time. In addition, a qualitative approach would allow us to avoid many of the modeling limitations
related to the scale of the data and metrics.
    Alternatively, future work could directly ask faculty who have served on the admissions com-
mittee both before and after the implementation of the rubric about their perception of the process
at each time. However, we must be careful of faculty’s potential biases when recalling how things
were done in the past (see Muggenburg for an overview [222]). For example, given the greater
emphasis on diversity and equity in higher education now, faculty’s recall may suffer from post-
rationalization [223] where they justify their decisions using reasons that weren’t available at the
time but are consistent with their current self image or social desirability [224] where past events
may be distorted to conform to current attitudes and norms.
5.7     Conclusion
Overall, the results of this initial investigation are suggestive that our admissions process did
change after the implementation of the rubric. We were able to model the data from before the
implementation of the rubric to a sufficient degree but not the data after the implementation of
the rubric. In addition, the model of the admissions process before the implementation of the
rubric does not do well predicting the data collected after the implementation of the rubric and
vice versa, suggesting that the underlying process did change. However, there are still numerous
limitations that need to be addressed before we can make a definitive conclusion, including how
we characterize the data and how we model the data.
    Furthermore, the models of the data following the implementation of the rubric performing
poorly suggests that the process might be holistic. In order to make such a conclusion, however, we
would need either evidence in favor of the occurrence holistic admissions or stronger evidence that
the current admissions process is not easily modeled by known techniques. Such evidence could
be obtained through a variety of quantitative or qualitative approaches.
    In terms of the modeling approaches, Tomek Links seem like a promising technique for future
                                                 115


PER studies. While their use was not enough to provide a more conclusive answer to the question
of whether our admissions process changed, their use did provide evidence that the data collected
after the implementation of the rubric may be modelable to an acceptable level, leaving open the
possibility that other methods may be able to model the data and hence, should be explored.
    Finally, to truly get a sense of whether admissions processes change after the implementation
of a rubric or merely use a new tool to do the same process, studies such as these need to be
completed in other physics departments. By doing so, we will have a better idea of how rubric-
based admissions might change admissions processes and how well our results generalize to other
programs.
                                                116


                                             CHAPTER 6
     UNDERSTANDING THE BIASES AND LIMITATIONS OF COMPUTATIONAL
                                MODELS: A SIMULATION STUDY
This chapter has been submitted to a journal and is currently under review. The working manuscript
version includes Marcos D. Caballero as the second author. Following the Contributor Roles
Taxonomy (CRediT) [76], my roles for this project include conceptualization, formal analysis,
methodology, software, validation, visualization, and writing the original draft.
6.1     Introduction
When working with educational data, we often encounter imbalanced binary input and outcome
features, by which we mean the variable is not equally split into its two categories. For example,
demographics in science, technology, engineering, and mathematics (STEM) are often imbalanced
due to historical and ongoing injustices. While data mining with imbalanced data has been studied
extensively [225], less attention has been paid to the types of imbalanced data that appear in
discipline-based education research (DBER) and educational data mining (EDM) studies.
    For example, educational data sets might consist of a single course on the order of a hundred
students (and hence a hundred data points) to the entire university or even multiple universities,
resulting in hundreds of thousands of data points. Furthermore, educational data often includes
continuous, categorical, and binary variables. As a result, an educational data set might contain
many features with different imbalances. For specific examples of these occurring in the DBER
and EDM literature, we refer the reader to the following papers [65, 226–238].
    In the context of logistic regression, which is a popular EDM technique [239] and was a common
technique used in the previously cited studies, much work has focused around outcome imbalance
and how to work with such data. When the outcome is imbalanced, the regression coefficients
and the probabilities generated from the logistic regression model are biased [240]. To correct for
these biases, various techniques such as Rare Events Logistic Regression [240], Firth penalized
                                                   117


regression [241], and introducing a log-F distributed penalty [242] have been proposed, which we
explain in depth in Sec. 6.2.
    More recently, machine learning techniques have become popular in educational research. One
example is random forest [91, 239]. Just as logistic regression has biases that might be relevant to
EDM and DBER data, random forest is also known to have such biases. In particular, random forest
ranks categorical features with many levels [92] and continuous features higher than categorical
features with fewer levels [243] when determining which features are most predictive of an outcome.
    Most interesting for the context of this paper is a study by Boulesteix et al. [244] building on
the work of Nicodemus [243]. In their paper, they systematically varied the amount of predictive
information that each binary feature contained as well as the feature imbalance and then used
random forest as well as a variant better suited for categorical features, conditional inference
forest, [92] to compare how well the algorithms could detect the informative features from the
noise. Their key finding was that features with higher imbalances were ranked lower than features
with lower imbalances even when they had the same “built-in” amount of predictive information.
A later study [245] extended the work by including continuous features as well as binary features
but only examined the case when none of the features contained predictive information. These
studies suggest that when modeling data, our results might be measuring spurious properties of the
features rather than their predictive information.
    In this study, we seek to extend this line of work by considering the data typical of DBER and
EDM studies. That is, data that includes a mix of continuous, categorical, and binary features with
varying degrees of predictive or explanatory ability, and a binary outcome feature that might be
imbalanced. In addition, new techniques for ranking random forest features have been developed,
such as the AUC-importance [94], which are designed for imbalanced data sets and hence, might
prove fruitful for DBER and EDM research. Finally, we wish to extend the work to regression
techniques commonly used for educational data and explore how these biases might manifest in
these techniques.
    Specifically, we ask three research questions:
                                                  118


   1. How might known random forest feature selection biases change when the outcome is im-
       balanced as is often the case in EDM and DBER studies, and does the AUC-permutation
       importance affect those biases?
   2. How might known machine learning biases manifest in traditionally explanatory techniques
       such as logistic regression?
   3. How might penalized regression techniques successfully applied in other disciplines be used
       in EDM and DBER to combat any discovered biases?
    It should be noted that our overarching goal is to compare existing approaches to analyzing data
typically found in DBER and EDM research and not to introduce our own new promising method
for analyzing such data.
    The rest of the paper proceeds as follows. In Sec. 6.2, we provide an overview of the algorithms
and approaches we mentioned in the introduction and that we use in the rest of the paper. In
Sec. 6.3, we explain how we constructed our simulation data and carried out our neutral comparison
simulation study [246]. In Sec. 6.4, we provide the results of our simulation study. In Sec. 6.5,
we apply what we learned in the simulation study to a graduate admissions data set from United
States universities. In Sec. 6.6, we provide answers to our research questions, compare our findings
with similar studies, and consider how our choices might have influenced the results. In Sec. 6.7,
we propose future directions for this work, both in terms of the data and algorithms. Finally, in
Sec. 6.8, we provide the conclusions from our study and outline a set of recommendations.
6.2     Background
Here, we introduce the two paradigms of statistical modeling and then provide an overview of the
algorithms we used in our study.
                                                 119


6.2.1    Paradigms of Statistical Modeling
When discussing modeling data, there are two prominent paradigms, both of which are used in
DBER and EDM: prediction and explanation [74,247]. Shmueli [62] provides an overview of these
approaches and we summarize the key points here.
    Explanatory modeling or explanation is focused on the causal effect of some set of inputs 𝑋 on
some outcome 𝑌 . That is, given some data set, explanation is concerned with which inputs produce
a statistically significant effect when modeling the outcome. Traditional logistic or linear regression
are examples of explanatory models. Under this approach, models are evaluated based on how well
they fit the data using some statistic. In the case of logistic regression or linear regression, common
statistics are Pseudo-𝑅 2 and 𝑅 2 .
    In contrast, prediction is focused on generating a model for analyzing new data and determining
the outcome and not necessarily the causal effect. Under this paradigm, having two sets of data,
one to train the model and one to test the predictive capabilities of the model, is essential as to
provide an estimate of the model’s predictive ability.
    Because prediction is not focused on the causal effects, statistical significance has no role
in assessing features in predictive models. Instead, features are assessed based on whether they
improve predictions of the model. While a feature with a small effect might be statistically
significant, it might not have predictive power because a predictive model might perform just as
well without the feature as with it.
    As a corollary to this, we should not expect a model with high explanatory power to necessarily
have high predictive power or vice versa, and hence, features with high explanatory power might
not have high predictive power.
                                                   120


6.2.2   Explanatory Methods
6.2.2.1    Traditional Logistic Regression
When the outcome, 𝑌 , is binary, logistic regression is the standard technique for explanatory
modeling. Under this approach, the probability, 𝑝, of finding the outcome of 𝑌 = 1 is given by
                                            𝑝
                                  log𝑏            = 𝛽0 + 𝛽1 𝑥 1 + 𝛽2 𝑥 2 + ... + 𝛽𝑛 𝑥 𝑛               (6.1)
                                        1− 𝑝
    when 𝑥 1 , 𝑥2 , .., 𝑥 𝑛 are the input features and the 𝛽 are the coefficients. Under this formula,
logistic regression has a similar form as linear regression.
    We can rearrange the equation to solve for the odds which becomes
                                                           𝑝
                             𝑜𝑑𝑑𝑠(𝑥 1 , 𝑥2 , ..., 𝑥 𝑛 ) =         = 𝑏 (𝛽0 +𝛽1 𝑥1 +𝛽2 𝑥2 +...+𝛽𝑛 𝑥 𝑛 ) (6.2)
                                                          1− 𝑝
    where 𝑏 is traditionally the natural base, 𝑒.
    Under this formulation, it makes sense to talk about the odds ratio (OR) or the change in odds
as a result of increasing an input feature 𝑥 𝑗 by 1 unit. More formally,
                                              𝑜𝑑𝑑𝑠(𝑥 1 , 𝑥2 , ...𝑥 𝑗 + 1, .., 𝑥 𝑛 )
                                  𝑂𝑅𝑥 𝑗 =                                           = 𝑒𝛽𝑗             (6.3)
                                                 𝑜𝑑𝑑𝑠(𝑥 1 , 𝑥2 , ...𝑥 𝑗 , .., 𝑥 𝑛 )
    which means that the exponentials of the coefficients correspond to the odds ratio for each
feature. Notice that the odds ratio is independent of the value of 𝑥 𝑗 .
    Because a 𝛽 of 0 means no effect, an odds ratio of 1 is equivalent to no effect [63]. Likewise,
an odds ratio greater than 1 means an increase in the odds while an odds ratio less than 1 means a
decrease in the odds.
    An important caveat to this is what a unit increase is and what the odds ratio is in reference
to. Often, continuous features are normalized so that the mean is 0 and the variance is 1 or scaled
so that an increase of a unit has a tangible meaning. For example, SAT scores are only reported
in multiples of 10 so scoring one point higher on the SAT is meaningless. Instead, the researcher
                                                           121


would want to adjust the scale of the scores so that an increase of 1 unit corresponded to 10 points
better on the test (or another meaningful increment).
    For continuous features, what the odds ratio is in reference to is answered by the scale choice.
For categorical features, especially unordered categorical features, the answer is nontrivial. An
increase of 1 unit might not be meaningful or even possible (e.g., what would an increase of 1
unit of race mean?). In that case, it is customary to use one-hot encoding and create separate,
binary features for each label. For example, for race, we could create 6 features: white, Asian,
Black, Latinx, Native, Multi-racial. Under this approach with binary features, an increase of a unit
corresponds to changing categories, such as Black compared to non-Black student, which depends
on the arbitrary choice of which label is assigned 𝑥 𝑗 = 1 and which is assigned 𝑥 𝑗 = 0. As [63]
notes, it is often preferable to invert the odds ratios which are less than 1 to easily compare all odds
ratios, which is equivalent to swapping our label for 𝑥 𝑗 = 0 and 𝑥 𝑗 = 1.
6.2.2.2    Penalized Regression
When the data contains issues that might make modeling difficult (i.e., small sample size, cor-
relations, and more features than data points), adding a penalty to logistic regression might be
beneficial. This idea is based on the bias-variance trade-off in which we can increase the bias of
the coefficient to reduce its variability or vice versa [248]. As a result, penalized regression can be
useful for feature selection, which is often an important first step in EDM [239]. Because logistic
regression does not have a closed-form solution while linear regression does, we will present the
penalized algorithms in the context of linear regression.
    For typical least squares linear regression with 𝑚 features and 𝑛 cases, we are trying to solve
the expression
                                          argmin 𝛽 (||𝑌 − 𝑋 𝑇 𝛽|| 2 )                               (6.4)
    where 𝑌 is a 𝑛 × 1 vector of the outputs, 𝛽 is a vector of the 𝑚 × 1 vector coefficients, and 𝑋 is
a 𝑚 × 𝑛 matrix of the input data.
                                                    122


    When we use penalized regression instead, we add a penalty, P, that might depend on the
coefficients or data.
                                  argmin 𝛽 ( ||𝑌 − 𝑋 𝑇 𝛽|| 2 + P(𝛽, 𝑋))                          (6.5)
    In this study, we consider two types of penalization for explanatory methods, Firth and Log-F
penalization, although many more exist. See Ensoy et al. [249] for an overview of methods often
used in cases of separation, where an input feature perfectly predicts the outcome, or rare events.
    Under Firth penalization, we try to combat the asymptotic bias of the coefficient estimates,
which inversely depend on the sample size to some power. Specifically, the Firth method adds a
penalty that removes the asymptotic bias to order O (𝑛−1 ), making it especially useful for small
data sets [241]. It does so by penalizing the Jeffreys invariant prior [250], which is inversely
related to the amount of information in the data. That is, the penalty is larger the less the data
allows us to determine the coefficients. For a simple one-feature model, the penalty is equivalent to
adding 0.5 to each cell of the 2x2 contingency table of the feature and the outcome [251], making
this penalization especially useful in the case of separation. In theory, this penalization should
then shrink the confidence intervals of the features with more imbalance because more uncertainty
would have resulted in a higher penalty.
    The Jeffreys invariant prior is not without issues, such as being dependent on the data, which are
summarized in Greenland and Mansournia [242]. To overcome these, Greenland and Mansournia
proposed a log-F(𝑚, 𝑚) distributed penalty. The penalty has a tuning parameter, m, that controls
the amount of penalization with a higher 𝑚 providing more accurate estimates of smaller 𝛽 but less
accurate estimates of larger 𝛽. When little is known about the data, Greenland and Mansournia
recommend taking 𝑚 = 1 to allow for a wider range of possible values. For a single parameter
model, the choice 𝑚 = 1 makes the Log-F penalty equivalent to the Firth penalty.
    In addition to overcoming issues with the Jeffreys prior, the log-F penalty can be implemented
via data augmentation, meaning that any software capable of performing logistic regression can
also do Log-F penalization. For a chosen 𝑚, the researcher adds 𝑚 pairs of rows to their data for
                                                  123


Table 6.1: Log-F data augmentation example for a two feature and m=1 example. The last four
rows are the augmented data.
                        Outcome     Feature 1    Feature 2     Intercept   Weight
                            1         0.748         0.10           1         1
                           ...          ...          ...           ...       ...
                            1           1             0            0        1/2
                            0           1             0            0        1/2
                            1           0             1            0        1/2
                            0           0             1            0        1/2
each feature, where one row has outcome 𝑌 = 1 and the other has outcome 𝑌 = 0. In the pair of
rows, the researcher then selects one feature to have value 1 and all of the other features to have
values of 0, with the choice of feature unique to each pair of rows. The weights for each row are
set to be 𝑚/2 and any intercept feature should be set to 0 in these added rows. An example of this
for a 2-feature model with 𝑚 = 1 is shown in Table 6.1.
    It should be noted that despite similarity in name, log-F penalized regression has no relation to
the recently proposed LogCF framework [252].
6.2.3    Predictive Methods
6.2.3.1   Penalized Regression
In addition to using penalized logistic regression as an explanatory method, there are also penalties
designed for using regression as a predictive tool. Two of the most common are Ridge and Lasso,
which are described in detail in Hastie, Tibshirani, and Friedman [248]. Again, we present the
penalties in the context of linear regression.
    Ridge penalization adds a penalty to the regression equation proportional to the square of the
𝛽s.
                                   argmin 𝛽 ( ||𝑌 − 𝑋 𝑇 𝛽|| 2 + 𝜆||𝛽|| 2 )                      (6.6)
    Equivalently, it requires the sum of the squared 𝛽 coefficients to be less than some value.
                                                   124


                                        argmin 𝛽 ( ||𝑌 − 𝑋 𝑇 𝛽|| 2 )                               (6.7)
                                                       Õ 𝑚
                                            subject to       𝛽2𝑗 ≤ 𝑡                               (6.8)
                                                        𝑗=1
    Here, 𝜆, or equivalently 𝑡, controls the degree of penalization, with a higher value associated
with a stronger penalty.
    Ridge penalization is often used in cases of multi-collinearity because it reduces the variability
of the coefficients. That is, for two correlated features without penalization, one could be extremely
positive and the other extremely negative to offset each other. With the squaring of the coefficients
under Ridge penalization, the coefficients can no longer offset each other and hence, must shrink.
                                                                            1
Mathematically, Ridge penalization is equivalent to scaling each 𝛽 by     1+𝜆 .
    Instead of penalizing based on the squared 𝛽, we can penalize based on the absolute value of
the 𝛽; this is the premise of Lasso penalization. Mathematically, Lasso penalization seeks to solve
                                     argmin 𝛽 ( ||𝑌 − 𝑋 𝑇 𝛽|| 2 + 𝜆|𝛽|)                            (6.9)
    Equivalently, it requires the sum of the absolute value of the 𝛽 coefficients to be less than some
value.
                                        argmin 𝛽 ( ||𝑌 − 𝑋 𝑇 𝛽|| 2 )                             (6.10)
                                                      Õ𝑚
                                          subject to       |𝛽 𝑗 | ≤ 𝑡                            (6.11)
                                                      𝑗=1
    Again, 𝜆 controls the amount of penalization. Here though, the Lasso penalty is designed for
feature selection because it shrinks some 𝛽 to zero while shifting the values of the others.
    Lasso is not designed for correlated features, and hence, it can encounter issues in those cases.
For example, if two features are correlated, either could be shrunk to zero without reducing the
accuracy of the model. Therefore, Lasso can exhibit variability concerns under correlation.
                                                    125


    One way around this is to combine the penalties into a single penalty, which is the idea between
Elastic net [253]. Mathematically, the Elastic net penalty is
                          argmin 𝛽 ( ||𝑌 − 𝑋 𝑇 𝛽|| 2 + 𝜆(𝛼||𝛽|| 2 + (1 − 𝛼)|𝛽|))                (6.12)
.
    where 𝜆 controls the overall penalization and 𝛼 controls the amount of mixing of the Lasso and
Ridge penalties, with the special case 𝛼 = 0 reducing to Lasso penalization and 𝛼 = 1 reducing to
Ridge regression.
    While these algorithms are typically used for prediction, various methods for using these
algorithms in an explanatory manner have been developed along with corresponding p-values or
other feature selection techniques [254–258]. We will only use these algorithms as predictive tools,
but we include references to these approaches here for completeness.
6.2.3.2   Forest Methods
Random forest is an ensemble method of decision trees based on the Classification and Regression
Trees (CART) framework [91]. For each decision tree, a subset of features, often noted 𝑚𝑡𝑟 𝑦,
is randomly selected and used to predict the outcome. To grow the tree, features are split into
two groups with the specifics of the splits determined by which ones minimize the Gini Index, a
measure of variance, the most. After all trees have been grown, the algorithm uses some method
of aggregating results, such as a majority vote of the trees, to determine what the overall prediction
is. Because the features are split, categorical features do not need to be one-hot encoded like they
would in logistic regression.
    To determine which features are relevant to the prediction, the features are often assessed by
the mean decrease in the Gini Index across all trees, with a larger value meaning the feature is
more predictive of the outcome. However, Strobl et al. showed that the Gini Index is biased toward
continuous features and features with many categories [92]. That is, because continuous and
features with many categories have many possible split points (infinite in the case of continuous),
                                                    126


it more likely the algorithm can find an ideal split than the algorithm could for a binary feature that
has only 1 split point. Therefore, these features will be viewed as more important because they
appear to better separate the classes.
    As a result, alternative measures such as accuracy permutation importance have become popular.
To use accuracy permutation importance, each feature is randomly permuted one at a time and the
change in predictive accuracy is recorded. The idea is that when a feature that is more predictive of
the outcome is permuted, the predictive accuracy will decrease more than when a feature with less
predictive information is permuted. As a result, the changes in predictive accuracy can be used to
rank the features in the model qualitatively. More recently, an alternative based on the AUC, which
is the probability that the positive case ranks higher than the negative case over all possible pairs
of positive and negative cases, has been proposed by Janitza et al. [94]. This AUC-permutation
importance is claimed to perform better than the accuracy permutation importance measure when
the outcome is imbalanced. It is important to note that both these importances only make sense in
the context of the model and relative to each other.
    Because the Gini Index is also used to create feature splits, the entire algorithm can be biased
when the data contains binary, categorical, and continuous features (as is often the case in DBER
and EDM). To correct this problem, Strobl et al. proposed conditional inference forests [92], which
are based on the conditional inference framework [259]. Rather than minimize the Gini Index to
find ideal splits, conditional inference forests use the conditional inference independence test to
determine which feature to split and how to split it. Simulation studies by Strobl et al. have found
that using conditional inference forests with subsampling without replacement does in fact, correct
the biases shown by traditional random forest [92].
    For more details about these algorithms, see appendix A.
                                                 127


6.3     Methodology
6.3.1    Data Creation
To conduct our simulation study, we first needed to generate our simulated data. To create binary
features with varying degrees of imbalance and information, we considered a 2x2 contingency
table, Table 6.2. We used labeling conventions similar to those of Olivier, Bell, and Rapallo for the
reader’s convenience because we reference their formulas here [260].
    For some binary feature 𝑥 𝑗 , let the fraction of cases with 𝑥 𝑗 = 0 be 𝜋+0 and the fraction of cases
with 𝑥 𝑗 = 1 be 𝜋+1 . Likewise, for the binary outcome feature 𝑌 , let the fraction of cases with 𝑌 = 0
be 𝜋0+ and the fraction of cases with 𝑌 = 1 be 𝜋1+ . Then the feature imbalance is represented by
the ratio 𝜋+0 : 𝜋+1 and the outcome imbalance is represented by 𝜋0+ : 𝜋1+ . We pick 𝑥 𝑗 = 1 and
𝑌 = 1 to be the minority classes, though the choice is arbitrary.
    To quantify the amount of information contained in a feature for predicting or explaining the
                                                         𝜋00 /𝜋01    𝜋00 𝜋11
outcome, we will use the odds ratio which is 𝑂𝑅 =        𝜋10 /𝜋11 =  𝜋10 𝜋01 using the notation in Table 6.2.
    By specifying the feature imbalance (in the form of 𝜋+1 ), the outcome imbalance (in the form of
𝜋1+ ) and the odds ratio, we can uniquely express the values in the 2x2 table. Furthermore, any one
of the three can be changed while the remaining two can be held constant, allowing us to manipulate
the feature imbalance, the outcome imbalance, and the odds ratio systematically. An example with
counts for a hypothetical data set with 1000 samples is shown in Table 6.3.
    To determine the values in the 2x2 table, we can rearrange the formula for the odds ratio in
              Table 6.2: 2x2 contingency table of fractions for generic binary feature.
                                              𝑥𝑗 = 0   𝑥𝑗 = 1      Total
                                    𝑌 =0        𝜋00      𝜋01        𝜋0+
                                    𝑌 =1        𝜋10      𝜋11        𝜋1+
                                     Total      𝜋+0      𝜋+1        1.0
                                                    128


Table 6.3: Examples of changing only one of the feature imbalance, outcome imbalance, or odds
ratio for an N=1000 dataset.
                 (a) Reference table                        (b) Changing only the feature imbalance
                   𝑥𝑗 = 0   𝑥𝑗 = 1    Total                             𝑥𝑗 = 0   𝑥𝑗 = 1    Total
          𝑌 =0       300      200      500                     𝑌 =0      400       100      500
          𝑌 =1       300      200      500                     𝑌 =1      400       100      500
           Total     600      400     1000                      Total    800       200     1000
      (c) Changing only the outcome imbalance                   (d) Changing only the odds ratio
                   𝑥𝑗 = 0   𝑥𝑗 = 1    Total                             𝑥𝑗 = 0   𝑥𝑗 = 1    Total
          𝑌 =0       450      300      750                     𝑌 =0      360       140      500
          𝑌 =1       150      100      250                     𝑌 =1      240       260      500
           Total     600      400     1000                      Total    600       400     1000
terms of 𝜋+1 , 𝜋1+ , and 𝜋11 found in the literature to solve for 𝜋11 [260]. Doing so, we find that
                                         1 + (𝜋+1 + 𝜋1+ )(𝑂𝑅 − 1) − 𝑄
                                  𝜋11 =                                                             (6.13)
                                                   2(𝑄 − 1)
where
                           p
                      𝑄=     (1 + (𝜋1+ + 𝜋+1 )(𝑂𝑅 − 1)) 2 + 4𝑂𝑅(1 − 𝑂𝑅)𝜋+1 𝜋1+                      (6.14)
    In the case that 𝑂𝑅 = 1, that is the feature contains no predictive or explanatory information
for the outcome, the expression for 𝜋11 is indeterminate. In that case, the feature and outcome are
independent so 𝜋11 = 𝜋+1 𝜋1+ .
    Once we know 𝜋11 , we can use Table 6.2 to compute the remaining values. That is
                                             𝜋10 = 𝜋1+ − 𝜋11                                        (6.15)
                                             𝜋01 = 𝜋+1 − 𝜋11                                        (6.16)
                                        𝜋00 = 1 + 𝜋11 − 𝜋1+ + 𝜋+1                                   (6.17)
                                                   129


      Figure 6.1: Distribution of binary features in the simulated 𝜋1+ = 0.5, 𝑁 = 1,000 model.
     To model continuous features, we assumed the features were normally distributed with a separate
distribution for each outcome class. For 𝑌 = 0, we modeled the feature as N (0, 1) and for 𝑌 = 1,
we modeled the features as N (𝜇, 0) where 𝜇 was a parameter we controlled. By increasing 𝜇, the
distributions would have less overlap, and hence, the value of a specific point would provide more
information about the outcome.
     For our study, we choose the same feature imbalances and odds ratio as found in Boulesteix et
al., which correspond to 𝜋+1 = {0.5, 0.4, 0.25, 0.1, 0.05} and 𝑂𝑅 = {3, 1.5, 1}, creating 15 binary
features [244]. We then created five continuous features with 𝜇 = {0.75, 0.50, 0, 0, 0}, for a total
of 20 features. As the number of features in DBER and EDM studies are on the order of 10, we
choose to keep the total number of features on the order of 10 rather than on the order of 100 as in
the Boulesteix et al. study [244].
     We then generated these features for five outcome imbalances, 𝜋1+ = {0.5, 0.4, 0.3, 0.2, 0.1},
and three sample sizes, 𝑁 = {100, 1, 000, 10, 000} for a total of 15 simulated data sets. A visual
depiction of the binary features in the 𝜋1+ = 0.5 and 𝑁 = 1, 000 case is shown in Fig. 6.1 and a
visual depiction of the continuous features in that same case are shown in Fig. 6.2.
                                                 130


   Figure 6.2: Distribution of continuous features in the simulated 𝜋1+ = 0.5, 𝑁 = 1, 000 model.
6.3.2   Procedures
6.3.2.1   Forest Algorithms
To analyze our data set using forest algorithms, we first randomly selected 70% of the cases for
the training set and kept the remaining 30% of the data set for the testing set. Our prior work with
random forest suggests that the size of the train/test split did not qualitatively affect the conclusions
around variable importance and selected features [65]. We then used the randomForest function
from the randomForest package [261] to create random forest models and the cforest function
from the party package [92, 96, 259] to create conditional inference forests in R [99].
    For both models, we set the number of trees to 500 as that is the default in the cforest
algorithm and simulation studies of random forest have found that errors rates level off on the order
                                                                                      √
of a few hundred trees [101]. For the number of features per tree, we picked 𝑝 where 𝑝 is the
number of features, which is also aligned with the recommendations of Svetnik et al. [101]. We
                                                  131


have called this 𝑚 previously to distinguish from probability in the logistic model but use 𝑝 here
because it is the common symbol in the random forest literature.
    For the random forest algorithm, we then computed the Gini importance. For the accuracy per-
mutation importance and the conditional inference algorithm, we computed the AUC permutation
importance and accuracy permutation importance. We repeated this procedure of splitting the data,
running the model, and calculating the importances 30 times so that the resulting distribution of
the importances would be approximately normal according to the central limit theorem [262].
    Next, we determined the rank of each feature based on its average value over the 30 runs, where
the feature with the largest importance value would have rank 1. This type of approach is often
used in screening studies to determine relevant features, which is what we are doing here [95].
    To evaluate bias in the Boulesteix et al. paper, they approached bias as the difference from
the expected value of zero in the null case and argued that bias when the features have odds ratios
different 1 was not well defined [244]. Because we are interested in selecting features, we can
create a definition of bias based on the rank of the feature. If a forest algorithm is biased, we would
expect to see that features with higher imbalance should have larger rank (i.e. be farther from 1)
than features with identical odds ratios but smaller imbalances.
    Using these ranks, we can also define bias in terms of the features detected by the algorithm.
Assuming no bias, features with identical odds ratios should be detected at the same rate, regardless
of their imbalance.
    To determine if a feature was detected, we adopt the convention that detected means different
from noise. We define detected as being ranked above the first noise feature, which has 𝑂𝑅 = 1
or 𝜇 = 0. We picked this convention so that it is somewhat analogous to the definition statistically
significant, which for explanatory models, is that the probability of obtaining a result at least
as extreme as the result observed under the assumption of the null hypothesis is less than some
threshold, typically 0.05.
                                                  132


6.3.2.2     Regression Algorithms
To use logistic regression in an explanatory manner, we did not use a train/test split as that approach
is characteristic of a predictive approach and instead, used all of the data as is customary for
explanatory modeling. To create a logistic regression model, we used the glm function that is
part of base R with the option family=’binomial’ to use logistic instead of linear regression.
Because log-F is based on data augmentation, we also used glm for that approach. Prior work
suggests that a choice of 𝑚 = 1 performed better than a choice of 𝑚 = 2, and 𝑚 = 1 is a good
starting choice when nothing is known about the size of the odds ratios [242, 263]. Even though
we “know” the true values of the odds ratios because we built them in, we want to approach the
problem as if it were real data and we do not have any prior information about the features. We
then used the default weights of 𝑚/2 for the log-F model.
     To run the Firth penalization, we used the brglm function from the brglm package [264, 265].
Per the function’s documentation, the choice of pl is irrelevant for logistic regression so we left it
at its default value.
     For all three approaches, we used the confint function to compute the confidence intervals. For
the Firth penalization, we picked ci.method to be ’mean’ as the brglm documentation suggests
it is a less conservative approach. We then say that a feature was detected or statistically significant
if zero is not in the confidence interval or in the case of odds ratios instead of the raw coefficients,
1 [266].
     To get a sense of how the odds ratio varied based on the data, we also ran a bootstrapped
simulation. That is, we randomly selected 80% of the cases and ran the standard logistic regression,
Firth penalization, and log-F penalization models on that data. We did this 10,000 times.
     To create the Lasso, Ridge, and Elastic net models, we again used a train/test split because these
algorithms are designed for prediction rather than explanation. To align with the bootstrapping
procedure, we used 80% of the cases for the training data and 20% for the testing data. Because
Lasso and Ridge have a single tuning parameter, 𝜆 that controls the amount of penalization, we
used the cv.glmnet function to find the optimal value of 𝜆. We then used the glmnet function to
                                                  133


train the Lasso and Ridge models with their respective best value of 𝜆 [267]. We again repeated
this process 10,000 times.
    Finally, we used the train function from the caret package to find the best values of 𝛼 and
𝜆 for Elastic net and create the model [268]. We again did this 10,000 times with 80% of the data
used as training cases.
    To analyze the bootstrapped results and generate confidence intervals for the values of the odds
ratios, we used the percentile bootstraps [269]. Under this approach, all of the bootstrap estimates
are sorted smallest to highest. For a given 𝛼, the bootstrap confidence interval is interval lying
between the 100 × 𝛼2 and 100 × (1 − 𝛼2 ) percentiles. We chose 𝛼 = 0.05 to form a 95% bootstrapped
confidence interval.
6.3.3   Neutral Comparison Study Rationale
Following the call of Boulesteix, Lauer, and Eugster for neutral comparison studies in the computa-
tional sciences, we address these three criteria and why we believe we have met their criteria [246].
    A. The main focus of the article is the comparison itself. It implies that the primary goal of
the article is not to introduce a new promising method. As stated in the introduction, we are not
introducing a method that we have developed and the focus of our paper is on comparing different
methods rather than showing the usefulness of a certain method.
    B. The authors should be reasonably neutral. We have not developed any of the algorithms or
techniques used in this study and hence, we have no stake in which method might perform best.
We also have experience using predictive and explanatory methods and have used these techniques
in our previous work.
    C. The evaluation criteria, methods, and data sets should be chosen in a rational way, Our
methods and simulated data are based on a previously published simulated study, so we believe
they are rational. We believe our evaluation criteria for detecting is rational because it is intuitive,
objective, and based on prior approaches. We acknowledge that other approaches do exist and we
address those in the discussion.
                                                  134


6.4    Simulation Results
6.4.1   Forest Algorithm Results
When looking at a subset of the results in Fig. 6.3, we see similar results to Boulesteix et al. [244].
That is, for the Gini importance, represented by plot A, continuous features are ranked higher
than binary features regardless of whether they are noise features or not. For the permutation
importances, we see that more balanced features tend to have larger importances than less balanced
features even when they have the same odds ratios. For example, when looking at Fig. 6.3D, we
see that the features OR=3, 60/40 and OR=3, 50/50 have much higher importances than the OR=3,
90/10 and OR=3, 95/5 features. Similar trends are shown in plots B and C.
    To compare across outcome imbalances, we aggregated all 3 sample sizes and 5 outcome
imbalances into a single plot for each importance method. The results are shown in Fig. 6.4.
    Again, the Gini importance, Fig. 6.4A, shows a preference toward continuous features and
against binary features for all sample sizes and outcome imbalances. More specifically, there
was not a single sample size or outcome imbalance in which any of the categorical features were
detected.
    For the permutation algorithms, the results are similar regardless of whether the accuracy-based
or AUC-based permutation method is used. Regardless of outcome imbalance and sample size,
more balanced features with smaller imbalances tend to rank higher than less balanced features
with identical odds ratios. This result is reflected in the plots by the decreasing dot size from top
to bottom in any of the rectangles formed by the dotted lines. In cases of high outcome imbalance,
moderately imbalanced features might rank higher than the balanced feature (e.g. the Binary OR=3
features for the N=10,000 90/10 case in Fig. 6.4D), but the most imbalanced features never rank
higher than the balanced feature with the odds ratio. In fact, the OR=1.5 50/50 feature ranks higher
than the OR=3, 95/5 feature for some of the models.
    When looking at sections of columns of the plots in Fig. 6.4, we notice that most informative
features cannot be detected for 𝑁 = 100, regardless of which algorithm is used. In fact, only the
                                                  135


Figure 6.3: Importance values for a subset of the random forest models. Feature names shown in
black were constructed to be informative while feature names in grey were constructed to be noise.
Plot A shows the N=1000 70/30 outcome imbalance case with the standard random forest algorithm
and Gini importance, plot B shows the N=1000 50/50 outcome imbalance case with the standard
random forest algorithm and accuracy permutation importance, plot C shows the N=100, 50/50
outcome imbalance case with the conditional inference forest and AUC-permutation importance,
and plot D shows the N=10,000 60/40 outcome imbalance case with conditional inference forest
and accuracy-permutation importance. For all of the permutation importances, features with less
imbalance tend to have larger importances than more imbalanced features for identical odds ratios.
                                               136


Figure 6.4: The ranks of the informative features for the four importance measures, grouped by the
sample size and outcome imbalance. Noise features are not shown and any feature ranked below
a noise feature was assigned a rank of 0. Here, a larger circle reflects a higher rank, meaning the
feature was more predictive of the outcome. Overall, features with lower imbalance rank higher
than features with higher imbalance for a given odds ratio and the result is not affected by the
outcome imbalance or the specific permutation importance or forest algorithm used.
                                                137


less imbalanced OR=3 features and the more predictive continuous feature can be detected and
even then, that depends on the level of outcome imbalance.
     For the 𝑁 = 1000 case, most of the OR=3 features can be detected. However, only the
less imbalanced OR=1.5 features are detected in most cases. Across the three permutation-based
importances, there does not appear to be a consistent pattern for which OR=1.5 features are detected
based on the outcome imbalance.
     For the 𝑁 = 10, 000 case, nearly all of the features can be detected, with the exception being
the highly imbalanced OR=1.5 95/5 feature. Again, there is not a consistent pattern as to when
this feature will not be detected based on the outcome imbalance. While the OR=1.5, 95/5 feature
is never detected in the 50/50 outcome imbalance, it is sometimes detected in the 70/30 and 90/10
outcome imbalance cases, making a pattern difficult to generalize based on the outcome imbalance.
6.4.2    Logistic regression results
In addition to detecting features, logistic regression provides an estimate of the odds ratio, which
can give us an idea of how accurately algorithms are modeling the built-in odds ratios. Because
the odds ratios and confidence intervals determine detection, we present those first. The odds ratio
results are shown in Fig. 6.5.
     From the 𝑁 = 100 case, plot A, we see that the 95% confidence intervals for most features span
at least an order of magnitude regardless of the feature imbalance or outcome imbalance. However,
the width of the confidence interval tends to increase with both increasing feature imbalance and
increasing outcome imbalance. For example, for the OR=3 90/10 and OR=3 95/5 features with a
90/10 outcome imbalance, the confidence intervals are too wide to fit on a plot that spans 6 orders
of magnitude. In some cases, the width of the confidence interval for a balanced feature with a
highly imbalanced outcome can be comparable to a highly imbalanced feature with a balanced
outcome such as OR=1.5 60/40 with a 90/10 outcome imbalance and OR=3; 95/5 with a 50/50
outcome balance.
     Given the width of the confidence intervals, our built-in value of the odds ratio is always
                                                 138


contained in the confidence intervals. However, when looking at the actual estimate of the odds
ratio, we see varying degrees of accuracy. For some features, like OR=1.5; 50/50, the 80/20
outcome imbalance was the most accurate estimate while for OR=3; 60/40, the 60/40 outcome
imbalance was the most accurate imbalance. In general, there was no specific trend where the
discrepancy between the estimated value and the built-in value varied with increasing feature or
outcome imbalance. In addition, there was no consistent trend where the estimated odds ratio over-
or under-estimated the built-in value.
    For the 𝑁 = 1, 000 and 𝑁 = 10, 000 cases, we notice that the confidence intervals have
considerably shrunk and now span on the order of a single magnitude. This is true even for most
imbalanced features with the most imbalanced outcome. Nevertheless, the width of the confidence
intervals still tend to increase with both increasing feature imbalance and outcome imbalance.
    As with the 𝑁 = 100 case, the built-in values are included in the confidence intervals and there
is no consistent trend as to whether the estimated odds ratio over- or under-estimates the built-in
value. We note that there is an exception to this for cont; noise2 and 50/50 outcome imbalance on
the 𝑁 = 10, 000 plot where the noise feature is found to have an odds ratio less than 1.
    Next, we can conduct an analysis similar to what we did with the forest algorithms and determine
which features are detected by logistic regression. Here, because we are using logistic regression in
an explanatory manner, we use the p-value to determine whether a feature is detected, with statistical
significance meaning less than a chosen cutoff, 𝛼. Because we are conducting multiple tests of
statistical significance, we should control for false positives. Therefore, we present the results with
and without a Holm-Bonferroni correction [195], which is less conservative than the traditional
Bonferroni correction and has been used in DBER work before [270–272]. The correction is
applied within each data set because for a study with real data, we would only have one data set.
The results are shown in Fig. 6.6
    For 𝑁 = 100, when we apply the Holm-Bonferroni correction, the continuous feature with the
largest 𝜇 is the only one to be detected and even then, only for minor outcome imbalances. If
instead we do not apply any corrections, logistic regression is able to detect a few of the OR=3
                                                  139


Figure 6.5: Values of the odds ratios and 95% confidence intervals found by logistic regression
models compared by outcome imbalance. Our built-in value is represented by the circled plus. Plot
A is a sample size of 𝑁 = 100, plot B is a sample size of 𝑁 = 1, 000 and plot C is a sample size of
𝑁 = 10, 000. Confidence intervals that span beyond the scale are removed from the plot. Note the
log scale on the horizontal axis.
                                               140


Figure 6.6: Analog of Fig. 6.4 but using logistic regression as the algorithm and statistical signifi-
cance as the criteria for detection, 𝛼 = 0.05. Plot A uses the Holm-Bonferroni correction to control
for multiple tests while plot B uses the uncorrected p-values.
features however these tend to be the ones with lower imbalances. That is, even with a generous
definition of statistical significance, logistic regression is unable to detect features with moderate
odds ratios or features with large odds ratios but higher imbalances.
    For 𝑁 = 1000, logistic regression is able to detect both continuous features and most of the
OR=3 features regardless of whether we applied a correction to the p-values or not. Unlike the
𝑁 = 100 case, we are able to detect some of the OR=1.5 features though only features with
lower imbalances and this depends on whether we apply a correction or not. When we apply the
correction, we were only able to detect two of the OR=1.5 features across any of the five outcome
imbalances, while if we did not apply the correction, we could detect ten.
    Finally, for 𝑁 = 10, 000, we were able to detect all of the informative features, regardless of
whether we applied a correction or not. However, one of the continuous noise features was marked
as statistically significant in the 50/50 outcome imbalance and the 70/30 outcome imbalance cases.
One of these disappeared when we applied the p-value correction while one did not, suggesting
that with enough data, random variations in the data might appear as signals.
                                                   141


6.4.3    Penalized regression results
Given the result from Sec. 6.4.2 that most features are detected for 𝑁 = 10, 000 even without
correction, we chose to focus on the 𝑁 = 100 and 𝑁 = 1000 cases as areas where penalized
regression might offer a benefit. To get a representative picture of how penalized regression might
help, we then applied the algorithms to the 50/50, 70/30, 90/10 imbalanced outcome data sets,
representing no imbalance, medium imbalance, and high imbalance.
6.4.3.1    Confidence interval approach
Because Firth and Log-F penalized regression are designed for explanatory approaches, we can
use them to generate confidence intervals. The results for the 𝑁 = 100 data sets are shown in
Fig. 6.7 and the results for the 𝑁 = 1000 data sets are shown in Fig. 6.8. Here, we only present
the uncorrected 95% confidence intervals because if we do not find a benefit on the uncorrected
confidence intervals, we would not find one on the corrected versions.
     For the 𝑁 = 100 case, we notice that the Firth and Log-F penalizations tend to have smaller
confidence intervals and in many cases, are closer to the built-in odds ratio than traditional logistic
regression is.
     For the 50/50 case, all three algorithms produce similar confidence intervals for more balances
features such as OR=3; 50/50. For the highly skewed features such as OR=3; 95/5, Firth and Log-F
penalizations do shrink the confidence interval with Log-F appearing to offer a greater benefit.
However, none of the shrinking makes a difference as to whether the feature would be statistically
significant or not compared to traditional logistic regression.
     When we instead look at the moderately imbalanced 70/30 case, we see similar results. That
is, the Firth and Log-F penalizations appear to provide a greater benefit in terms of shrinking the
confidence interval for features with greater imbalance, though again, the benefit is not enough to
change whether a feature would be detected.
     For the highly imbalanced 90/10 case, both penalizations reduce the confidence intervals
regardless of the feature’s imbalance. The benefits are most clear however for the most imbalanced
                                                 142


Figure 6.7: 95% confidence intervals for Firth penalized, traditional, and Log-F penalized logistic
regression for the 𝑁 = 100 data sets. Plot A shows the 50/50 outcome imbalance, plot B shows the
70/30 outcome imbalance, and plot C shows the 90/10 outcome imbalance. Confidence intervals
that span beyond the scale are removed from the plot. For higher outcome imbalance, Firth and
Log-F penalizations can considerably shrink the confidence intervals.
features. For example, for OR=3; 90/10, Log-F penalization reduces the width of the confidence
interval by nearly 3 orders of magnitude compared to the traditional logistic regression. As in
the 50/50 and 70/30 cases, the penalizations do not affect whether a feature would be statistically
                                                143


Figure 6.8: 95% confidence intervals for Firth penalized, traditional, and Log-F penalized logistic
regression for the 𝑁 = 1000 data sets. Plot A shows the 50/50 outcome imbalance, plot B shows
the 70/30 outcome imbalance, and plot C shows the 90/10 outcome imbalance. For higher outcome
imbalance, Firth and Log-F penalizations can shrink the confidence intervals.
significant, but the penalizations still do produce more accurate estimates of the built-in odds ratios
than traditional logistic regression does.
    Looking at the 𝑁 = 1000 results in Fig. 6.8, we notice that the confidence intervals of the
penalized regression methods are similar in length to those of traditional logistic regression. This
                                                  144


result is true regardless of the feature imbalance or the outcome imbalance.
    When it comes to estimating our built-in odds ratio, the penalized methods do not offer much
of an improvement over traditional logistic regression. Indeed, for lower imbalanced features, all
three methods tend to provide similar estimates, while for higher imbalanced features, there is no
clear trend as to which method will provide an estimate closest to that of the built-in value.
6.4.3.2    Bootstrap approach
In addition to only considering whether the algorithm detects a feature, we can also get a sense
of what range the estimated odds ratio will fall in using the five different penalization approaches.
The results from the 𝑁 = 100 data sets are shown in Fig. 6.9 and the results from the 𝑁 = 1000
data sets are shown in Fig. 6.10.
    From the 𝑁 = 100 plot, we see that spread of the estimated values varies between the dif-
ferent methods. For higher feature imbalances, traditional logistic regression and Firth penalized
regression often have the widest distributions. Because Lasso shrinks the coefficients to zero (or
equivalently, odds ratios to 1) and Ridge reduces the variance of the estimate, these two methods
often have the most compact distributions.
    Likewise, in terms of the median estimate of the odds ratio, we see variation between the
methods. Because Lasso shrinks estimates and Ridge scales estimates, these two underestimate the
built-in odds ratio. We also find this behavior with Elastic net, which is a middle group between
the two. However, Elastic net often includes the built-in odds ratio within its interval even when
Lasso and Ridge do not. This result is especially true for higher feature and outcome imbalances.
    Log-F penalization on the other hand often takes a middle ground on both estimates and
distribution width. Regardless of the feature or outcome imbalance, Log-F does not consistently
over- or under-estimate the built-in odds ratio and does not have the widest distribution of the
estimates.
    From the 𝑁 = 1000 results shown in Fig. 6.10, we see that the six methods tend to produce
similar results for more balanced features, even at higher outcome imbalances. The exception
                                                 145


Figure 6.9: 95% percentile bootstraps of the odds ratio for Elastic net, Firth, Lasso, Log-F, no,
and Ridge penalizations on the 𝑁 = 100 data. Dots represent the median value. Plot A shows the
50/50 outcome imbalance, plot B shows the 70/30 outcome imbalance, and plot C shows the 90/10
outcome imbalance
is the Firth penalization for higher imbalance features (e.g. OR=3; 90/10). For these higher
imbalance features, the Firth penalization estimates can be nearly an order of magnitude larger than
the estimates produced by other methods.
    As in the 𝑁 = 100 case, we find that Lasso and Ridge tend to underestimate the built-in odds
                                                146


Figure 6.10: 95% percentile bootstraps of the odds ratio for Elastic net, Firth, Lasso, Log-F, no,
and Ridge penalizations on the 𝑁 = 1, 000 data. Dots represent the median value. Plot A shows
the 50/50 outcome imbalance, plot B shows the 70/30 outcome imbalance, and plot C shows the
90/10 outcome imbalance
ratio for the 𝑁 = 1000 case. Unlike the 𝑁 = 100 case, however, the built-in value is included in the
bootstrapped confidence interval.
    In most cases, Elastic net, Log-F, and logistic regression tend to have similar distribution widths
and have the built-in odds ratios within their intervals. While Elastic net under-predicts the built-in
                                                  147


value, Log-F penalized and traditional logistic regression do not show a consistent pattern as to
whether they over- or under-predict the built-in value.
6.5     Application to Real Data
In this section, we apply the results of our simulation study to a graduate admissions data set.
    Our data set comes from the application records of over 5,000 applicants to 6 Big Ten or
Midwestern universities over a two-year period. The data includes the applicant’s GRE scores,
undergraduate GPA, undergraduate university, demographics such as binary gender, race, domestic
status, whether the applicant made the shortlist, and whether the applicant was admitted to the
program. Details about these features can be found in Posselt et al. [54].
    We can then treat each university as a separate case study, which is an approach we have used
in our previous work [75]. Doing so allows us to vary the sample size and the outcome imbalance.
For the six programs in the data set, the smallest program had 𝑁 = 140 applicants over the two year
period while the largest had 𝑁 = 1228. When considering whether the applicant made the shortlist
or was admitted, the outcome imbalance ranged from 53/47 to 83/17, which means that the sample
size and outcome imbalances are on the same scale as the data we used in our simulation study.
    We then selected four of the twelve possible combinations of shortlist or admit and the six
programs that represent a small and medium data set with a more balanced and less balanced
outcome. Specifically, we modelled school 1’s admission (N=140, 59/41), school 2’s shortlist
(N=431, 78/22), and school 3’s shortlist and admission (N=1228, 60/40 and 78/22) respectively.
In the initial paper using this data, Posselt et al. analyzed shortlist and admissions separately and
hence, we do so here [54].
6.5.1    Methods
To analyze the real data, we used five approaches. First, we use logistic regression and random
forest with the Gini importance as they are the “default” methods. Based on the results of the
simulation study, we then choose to use Log-F, as it performed either better or no worse than
                                                   148


Table 6.4: Feature and outcome imbalances for the binary features from actual graduate school
admission data
                         Feature                        School
                                      School 1    School 2 School 3 School 3
                                      Admit       Admit      Shortlist Admit
                        Outcome         59/41      83/17      59/41      76/24
                         Gender         79/21      81/19      85/15      85/15
                        Domestic         NA        71/29      50/50      50/50
                           Year         57/43      55/45      51/49      51/49
                      Race=Asian        87/13      64/36      52/48      52/48
                       Race=Black        96/4       99/1       99/1       99/1
                      Race=Latinx       81/19       91/9       99/1       99/1
                       Race=Multi        96/4       97/3       93/7       93/7
                      BinaryNoise1      60/40      60/40      60/40      60/40
                      BinaryNoise2      75/25      75/25      75/25      75/25
                      BinaryNoise3      90/10      90/10      90/10      90/10
                      BinaryNoise4       95/5       95/5       95/5       95/5
                            N            140        431       1228       1228
Firth, Elastic net, as it performed better than Lasso or Ridge and retains the benefits of both, and
conditional inference forest with the AUC importance, as all of the permutation based importance
measures performed similarly.
    To mimic the simulation study and know which features were certainly noise, we added four
binary noise features (imbalances of 60/40, 75/25, 90/10, and 95/5, which we refer to as Bina-
ryNoise1, BinaryNoise2, BinaryNoise3, BinaryNoise4) and three continuous noise features. The
binary features and their imbalances for the four data sets are shown in Table 6.4.
    To run the models, we used the same R packages as in the simulation study. However, for real
data, we should be interested in how well the model fits and hence, need to include some measure
of that. For the logistic regression based methods, we used the standard McFadden pseudo-𝑅 2
implemented in the DescTools package via the PseudoR2 function [273], where a good value
is between 0.2 and 0.4 [274]. While other choices of pseudo-𝑅 2 exist, Menard suggests that
there is little reason to prefer one over another, but McFadden’s might be preferable because it is
intuitive [275].
                                                 149


                 Table 6.5: McFadden Pseudo 𝑅 2 values for the explanatory models
                                School 1   School 2       School 3 shortlist  School 3 admit
         Logistic Regression      0.256      0.215              0.199               0.203
                 Log-F            0.252      0.215              0.199               0.203
                 Table 6.6: AUC values for the various models on the four data sets
                                                                          School 3    School 3
                                                 School 1     School 2
                                                                          shortlist   admit
         Logistic Regression                     0.826        0.806       0.790       0.799
         Log-F                                   0.825        0.805       0.790       0.799
         Elastic (Train)                         0.830        0.807       0.785       0.799
         Elastic (Test)                          0.690        0.734       0.771       0.779
         Random Forest (Train)                   0.547        0.529       0.688       0.613
         Random Forest (Test)                    0.564        0.521       0.686       0.616
         Conditional Inference Forest (Train) 0.749           0.517       0.793       0.674
         Conditional Inference Forest (Test) 0.594            0.500       0.681       0.597
    To connect the forest methods with the logistic regression methods, we also computed the AUC
for each model, which follows the recommendation of Aiken et al. [74]. To do so, we used the AUC
function from the ModelMetrics package [276]. We interpreted an AUC of at least 0.7 as a good
model [93].
    For the predictive methods, Elastic net, random forest, and conditional inference forest, we
used the same procedure as in the simulation study except now used a 80/20 train/test split for all
methods and calculated the AUC on both the training and testing data sets.
6.5.2   Results
First, we present the metrics used to assess our model, which are shown in Table 6.5 and Table
6.6. We notice that except for school 3 shortlist, all of the pseudo 𝑅 2 are within the accepted range
(≥ 0.2).
    When looking at the AUC values, we notice that the regression models outperform the forest
models and in most cases, the forest models do not produce an AUC in the acceptable range. A
review of physics education research literature found that less than 10% of papers reported out-of-
sample metrics, so we cannot say if these results are typical for this type of data [74]. As our goal
                                                 150


Figure 6.11: Comparison of the odds ratio (A), Gini importance (B), and AUC-permutation
importance (C) for the features in school 1. Notice that RaceLatinx has a similar odds ratio as
RaceBlack and RaceMulti according to (A) but only RaceLatinx is detectable in (C). RaceLatinx
is less imbalanced than RaceBlack and RaceMulti.
is not to make the best model but rather to extract features, we did not do any parameter tuning for
the forests. We discuss more about these metrics in the discussion.
     Because the conclusions from the four data sets are similar, we share only the results of school 1
and provide plots for the other data sets in Appendix G for completeness. The results of algorithms
applied to the school 1 data set are shown in Fig. 6.11.
     When looking at plot A, we notice that Log-F noticeably shrinks the confidence interval for
highly skewed features like RaceBlack. In exchange though, the estimate of the odds ratio is shrunk
closer to 𝑂𝑅 = 1 for nearly all the features.
     Even though Elastic net is showing the percentile bootstrapped confidence interval instead of
the statistical confidence interval, the results tend to be aligned with the other methods. That is,
the median value is on the same order of magnitude of the other estimates and the end points of the
confidence interval are also on the same order of the magnitude as the other estimates.
     When comparing the different methods, we see that none of the three algorithms would have led
to different conclusions about which features are statistically significant or not. From plot A, the
statistically significant features would be VGRE, UGPA, RaceMulti, RaceLatinx, RaceBlack, and
BinaryNoise1. In the case of BinaryNoise1, which is supposed to be a noise feature, we note that
due to the random nature of generating the feature, the odds ratio was smaller than 1 and hence, the
                                                  151


algorithms appear to have detected that small difference.
    When we move to plot B, we note that the continuous features are all ranked above the binary
features as expected. As a result, all features except for one rank lower than the first noise feature.
    Finally, when we move to plot C, we notice that only four features are detected, which is
smaller than the regression approaches. Because prediction and explanation have different goals,
we would not expect them to identify the same features. Yet, multiple approaches identifying the
same features suggest that these features are in fact, distinct from noise.
    One interesting point to note is that we see some ranking issues based on imbalance. For
example, using a 2x2 contingency table to calculate the theoretical odds ratios, RaceMulti should
have an odds ratio of 2.96 while RaceLatinx should have an odds ratio of 2.25. However, because
RaceLatinx has an imbalance of 80/20 while RaceMulti has an imbalance of 96/4, RaceLatinx is
detected by the AUC-permutation importance while RaceMulti is not.
6.6    Discussion
Here we address our research questions and consider how our choices and approaches might have
impacted the conclusions we can draw from this study. We include a summary of the advantages
and disadvantages of each algorithm based on our study and prior work in Table 6.7.
6.6.1   Research Questions
                                                  152


                    Table 6.7: Summary of advantages and disadvantages for each algorithm used in this study
         Method                          Advantages of algorithm                            Disadvantages of algorithm
RF + Gini                 -Default choice for many random forest                -Biased in favor of continuous features, regardless
                          implementations                                       of whether they are informative of the outcome or
                                                                                not
RF + accuracy             -Can be used with continuous & categorical features -Ability to detect features decreases with
permutation importance,   -Categorical features do not need to be binarized     increasing feature imbalance and outcome
CIF + accuracy            -Comparable performance to logistic regression for imbalance
permutation importance,   feature selection without needing to check any        -Questionable performance for small N
CIF+ AUC permutation      assumptions
importance
Logistic Regression       -standard algorithm for classification, implemented   -Width of confidence interval increases for
                          in most software                                      increased outcome and feature imbalance and can
                          -odds ratios have a “real-world” interpretation       become infinite in some cases
Firth penalization        -Able to shrink confidence intervals for imbalanced   -Not widely implemented in software
                          features in small N situations                        -Advantages compared to logistic regression
                                                                                disappear for larger N
Log-F penalizations       -Able to shrink confidence intervals for imbalanced   -Advantages compared to logistic regression
                          features in small N situations                        disappear for larger N
                          -Based on data augmentation so no special software
                          needed
                          -Coefficient estimates are similar to those of
                          traditional logistic regression
Lasso                     -Shrinks some coefficients to zero which can be       -Less able to detect less informative features from
                          useful for feature selection                          noise compared to other algorithms in the study
Ridge                     -Effective at shrinking the width of the distribution -All coefficients are scaled by the same amount
                          of estimated odds ratios                              and are underestimated
Elastic Net               -Combines the benefits of Lasso and Ridge             -Requires hyperparameter tuning to determine the
                          penalizations, often performing better than either    ideal amount of mixing between Lasso and Ridge
                          approach individually
                                                                 153


    How might known random forest feature selection biases change when the outcome is imbalanced
as is often the case in EDM and DBER studies, and does the AUC-permutation importance affect
those biases? When we vary the outcome imbalance as well as the feature imbalance, we still
observe the same general trend as seen in Boulesteix et al. [244]. That is, features with higher
imbalance are less likely to be detected compared to features with lower imbalances but the same
odds ratio. In fact, the bias might become worse for high outcome imbalances because it is harder
to train a “good” model when most of the cases have the same outcome.
    In opposition to the claims of Janitza et al, [94], we do not find the AUC permutation importance
to outperform the accuracy permutation importance. In fact, we find that the AUC permutation
importance and the accuracy permutation importance perform similarly, regardless of the outcome
imbalance. Further, we did not find any consistent differences in terms of the features detected
by either random forest or conditional inference forest even though conditional inference forest is
supposed to be better suited for categorical data [92].
    We also see this preference for features with smaller imbalances in the real data. For example,
for school 1, we saw that the more balanced RaceLatinx was detected over the less balanced
RaceBlack and RaceMulti even though the theoretical odds ratio of RaceLatinx was smaller than
that of the other two features.
    Across the real data and simulated data, we see the expected bias with the Gini importance in
which the continuous features are ranked higher than any of the categorical features. This result is
most noticeable in Fig. 6.4 plot A where only continuous features are detected and Fig. 6.11 plot B
where all of the continuous noise features outrank all but one feature.
    How might known machine learning biases manifest in traditionally explanatory techniques
such as logistic regression? We see similar biases in logistic regression as we see in the random
forest for feature selection. For a sample size of 𝑁 = 100 with a multiple comparison correction, we
are unable to detect most features and even without correction, we can only detect low imbalance
OR=3 features. The uncorrected results are similar to those of the permutation importances for the
forest algorithms.
                                                    154


    For 𝑁 = 1000, we can detect most OR=3 features and without correction, low imbalance
OR=1.5 features. Again, the uncorrected logistic regression results resemble those of the forest
algorithms but seem to be more aligned with the conditional inference forest results than the random
forest results.
    Once we get to a large sample size, 𝑁 = 10, 000, we can detect nearly all features, just as we can
for the forest algorithms. However, for logistic regression, we also get an occasional false positive.
Given the size of the data, it is not unreasonable that the logistic regression model might be picking
up on minor differences in the noise features which it treats as a signal.
    With explanatory techniques like logistic regression, we could also investigate how well they
estimated the built-in odds ratio. We found that while the built-in value is almost always in the
confidence interval, this has more to do with the width of the intervals than the ability of the
algorithms. In general, confidence interval width increases with feature and outcome imbalance
and decreases with sample size. The decrease in width as the sample size increases corresponds to
what we would expect based on the conclusions of Nemes et al. [277].
    We also observed the same general trend for the real data. Features with higher imbalances
tend to have the widest confidence intervals, which can span several orders of magnitude.
    How might penalized regression techniques successfully applied in other disciplines be used in
EDM and DBER to combat any discovered biases?
    While none of the five techniques we tried, Firth penalization, Log-F penalization, Lasso, Ridge,
or Elastic net, corrected the bias, they did show promise for use in future EDM and DBER studies.
    For explanatory methods, Firth and Log-F were found to shrink the confidence intervals,
especially for highly imbalanced features and highly imbalanced outcomes. While Firth can still
show wide confidence intervals, the Firth confidence intervals were found to be smaller than those
of traditional logistic regression. On the other hand, Log-F provided at worst similar performance
to Firth penalization and for higher imbalances, seemed to shrink the width of the confidence
interval more than Firth penalization did. We found that both of these methods were most useful
for smaller data sets, 𝑁 = 100, while for the medium and larger data sets, their performance was
                                                  155


similar.
    When it came to the distributions of the estimated odds ratios, Log-F often showed a smaller
distribution. While the results were comparable for the small data sets, for medium data sets
and features with high imbalance, Firth penalization overestimated the odds ratio and had more
variability. Conversely, Log-F produced more accurate and less variable distributions.
    For predictive methods, Lasso, Ridge, and Elastic net were only used in a bootstrap, so we
cannot cannot discuss the confidence interval width. We can however discuss the distribution of
estimated odds ratios.
    For Lasso and small data sets, we find that many of the features are shrunk to zero, especially for
higher imbalances. For example, even for a small, balanced sample, many of the OR=1.5 features
were shrunk to zero while the other methods did not treat them as consistent with noise. Elastic
showed similar results although the effect was not as severe.
    For Ridge and small data sets, the distribution of the estimated odds ratio was often the smallest
for a given data set. Given that Ridge is designed to shrink the variability of the estimates, this
finding is not surprising.
    For medium data sets, Lasso, Ridge, and Elastic net performed similarly to the other methods
in terms of the distribution of estimated odds ratios.
    While our results generally agree with other studies, a true comparison is difficult because each
study used its own subset of the algorithms, including ones we used in our study as well as ones we
did not. Therefore, which algorithm performed best and under what circumstances depends just as
much on the algorithms it was compared to as the algorithm itself.
    In general, other studies have tended to find that Firth penalization does outperform logistic
regression in the case of outcome imbalance [251,278–280] and Log-F penalization shows promise
when working with imbalanced data and can outperform Firth-penalization [263, 281].
    Likewise, studies like Pavlou et al. have found that Ridge penalization works well except when
there are many noise features while Lasso performs better when there are many noise features but
limited correlations, which is consistent with our results Pavlou et al. [282]. Their study also found
                                                 156


that elastic seemed to perform well in all cases, which generally matches what we found.
    In terms of our finding that none of the methods fixed the issues around feature or outcome
imbalance, Van Calster reported a similar finding for shrinkage techniques [283]. Specifically,
they found that despite working well on average, shrinkage techniques often did not work well on
individual data sets, even in cases where the techniques could have provided the most benefit such
as in small sample size or low events per variable cases. Even though the techniques did not solve
any of the issues in our study, they still showed promise for reducing the scale of the confidence
intervals and warrant greater adoption by the DBER and EDM communities.
6.6.2   Limitations and Researcher Choices
In this section, we shift our focus from the results of the research questions and instead consider
how our choices around constructing the simulated data, tuning or not tuning our models, defining
“detected features,” and assessing the models might have impacted the conclusions we can draw
from this study.
6.6.2.1    Our data sets
For our simulation study, we used the same levels of information as in the Boulesteix et al. study
which we wished to extend [244]. We followed their convention that 𝑂𝑅 = 3 corresponded to
a large effect while 𝑂𝑅 = 1.5 corresponded to a moderate effect. However, Olivier noted that
what constitutes a large, medium, or small odds ratio depends on the feature imbalance, outcome
imbalance, and correlations [260]. Therefore, even though we are using the same odds ratios
for the different imbalances, they might not necessarily contain the same amount of predictive or
explanatory power in a “large”, “medium”, or “small” sense.
    One noticeable difference between our study and the Boulesteix et al. study was the number of
features [244]. While we argued that DBER and EDM studies usually have the number of features
on the order of 10 rather than 100, one could argue that we still had too many features based on our
sample size. For example, a rule of thumb is that there should be at least 10 cases of the minority
                                                 157


outcome for each feature in the model, referred to as the events per variable [284,285]. In that case,
we would have needed at least a sample size of 400 for the 50/50 outcome imbalance and a sample
size of 2,000 for the 90/10 outcome imbalance case.
    However, recent work has called into question whether this rule of thumb is supported by
evidence [278]. Van Smeden et al found that events per variable did not have a strong relation to
predictive performance of models and instead, recommended that a combination of the number of
predictors, the total sample size and the events fraction be used to assess sample size criteria [286].
Likewise, Courvoisier et al. found that logistic regression can encounter problems even if the events
per variable were greater than 10 and concluded that there is no single rule for guaranteeing an
accurate estimate of parameters for logistic regression [287]. Even if the rule of thumb were true
for logistic regression, Pavlou et al. claims that penalized regression is effective when the events
per variable is less than 10 [288].
6.6.2.2   Hyperparameter tuning
For our simulation study, we did not do extensive hyperparameter tuning for the forest algorithms.
We did this because 1) Probst, Wright, and Boulesteix found that random forest is robust against
hyperparameter specification, its performance depends less on the hyperparameters than other
machine learning methods, and its default choice of hyperparameters are often good enough [289]
and 2) Couronné, Probst, and Boulesteix state that for a method to become a standard tool (as
random forest is in EDM and is becoming in DBER), it needs to be easy to use by researchers
without computational backgrounds and cannot involve complex human interaction, which is not
true of hyperparameter tuning [290] .
    In addition, we only do hyperparameter tuning for Lasso, Ridge, and Elastic because testing
multiple values for 𝜆 is built in to the glmnet algorithm that we used to run the models. Even then,
recent work suggests that optimizing 𝜆 for small or sparse data sets results in substantial variability
of the coefficients and the found 𝜆 might be negatively correlated with the optimal values, meaning
that hyperparameter tuning might not have been advisable for our data in the first place [291].
                                                  158


    However, for completeness and to minimize computation time needed, we did experiment
with multiple choices for the number of trees in the forest, 𝑛𝑡𝑟𝑒𝑒 , and the number of features
used for each tree, 𝑚𝑡𝑟 𝑦. For the conditional inference forests with the AUC importance, a
sample size of 𝑁 = 1000 and outcome imbalances of 50/50, 60/40, 70/30, 80/20, and 90/10,
                                                                       √
we tried 𝑛𝑡𝑟𝑒𝑒 = {50, 100, 500, 1000, 5000} and 𝑚𝑡𝑟 𝑦 = {1, 𝑝/3, 𝑝, 𝑝/2, 𝑝} where 𝑝 is the total
number of features in the model. We did not find any meaningful differences in which features
were selected and no set of the hyperparameters consistently performed better than the default
                       √
(𝑛𝑡𝑟𝑒𝑒 = 500, 𝑚𝑡𝑟 𝑦 = 𝑝). Therefore, we used the default choices for throughout the study.
6.6.2.3    Determining Detected Features
For the forest algorithms, we chose to use the simple and intuitive method of whether the feature
ranked above the first noise feature to determine which features were detected. We were able to do
this because we knew which features were noise and in the case of the real data, we added features
we created to be noise. We could have, however, used a variety of other methods to detect features
though each has its own limitations in the context of our study. See Hapfelmeier and Ulm for an
overview of different approaches, some comparisons, and their own novel method [292].
    In general, the techniques for feature selection in forest algorithms fall into two broad categories.
First, there are elimination techniques that pull out a subset of the features based on criteria. For
example, Díaz-Uriarte and Alvarez de Andrés used a recursive backward elimination technique that
removes a certain fraction of features until only 2 remain [97]. The technique then selects the model
with the fewest features that performs within 1 standard error of the best model using whatever
metric the researcher chooses. These type of methods are not appropriate for this study because
they can restrict the features too much. That is, by having some cutoff or elimination procedure,
features which contain only a small amount of predictive information could be eliminated even
though they are in fact predictive.
    The second common approach is to use some type of permutation test to generate a p-value.
Under this approach, either the outcome or each individual feature is permuted and then run through
                                                   159


the model to produce some metric. This is then done a large number of times to get a distribution of
the metric. Then the unpermuted data is run through the model to get the actual value of the metric.
The p-value is then the fraction of cases where the permuted metric is as extreme as the actual value
of the metric [293]. This approach has been used in various random forest studies [294, 295] and
has been extended into the PIMP heuristic for correcting the Gini importance bias [296]. While
these methods provide an analogous method for comparing with explanatory methods, they can be
computationally intensive as they require the distributions to be conducted from scratch for each
model.
    As a way to reduce the computational complexity, Janitza, Celik, and Boulesteix proposed that
the negative importances, which are assumed to be noise features because they are making the
predictions worse, could be used to construct a null distribution [297]. Under this approach, the
distribution of the negative importances are reflected across the axis to give create the distribution
for positive values. The same procedure as above can then be used to calculate the p-values. While
this procedure is computationally feasible, DBER and EDM studies often have on the order of 10
features, which means there are a limited number of features which could have negative importances
and thus, mirroring the distribution would be of little use.
6.6.2.4    Assessing Our Models
For our real data, we noticed that many of the models did not produce out-of-sample AUCs in the
acceptable range of at least 0.7. Here we try to address that.
    First, we acknowledge that one type of model should not always perform better than another;
this is the basis of the “no free lunch theorems” for optimization [298]. Various studies comparing
logistic regression and random forest find a similar result where which algorithm performs best
depends on the data set [290, 299, 300]. Therefore, the fact that logistic regression models perform
better than the forest models is not necessarily a problem. In fact, by comparing multiple models
and finding that some work better than others, we can have greater confidence that our results are
detecting a signal in the data and not just modeling the random variations in the data.
                                                 160


    Second, we need to acknowledge that overfitting is happening with the Elastic net and conditional
inference forests. This overfitting can be detected by looking for differences in the training and
testing set AUCs, where a higher training AUC is characteristic of overfitting. The amount of
overfitting seems worse for the smaller data sets as shown in Table 6.6. This result is not unexpected
because with smaller data sets, there are fewer cases to learn from. The noise in the model might
then be seen as a signal and treated as though it contains predictive information.
    If we look at the other models and their results in Table 6.6, we notice that the forest models and
Elastic net perform best on the school 3 data sets, which correspond to the medium sized data sets
in the simulation study and largest of the real data sets. For the forest algorithms specifically, they
perform best on the school 3 shortlist data set, which happens to have a smaller outcome imbalance
than school 3 admit. This result suggests that to effectively use the predictive approach, the data
set should not be too imbalanced and based on the results for school 1 and school 2, the amount of
data should be on the order of 1,000 cases.
    Additionally, the higher AUC for logistic regression and Log-F might be thought of as their own
type of overfitting. Due to the train/test procedure of the predictive paradigm, these two methods
are working with the full data set rather than just 80% of the cases, corresponding to a 25% increase
in data to work with and hence, learn from. With the “extra” data, these models might be better
able to detect trends in the data and separate them out from noise.
    While there do exist techniques for detecting overfitting in logistic regression, many of them use
some type of testing or validation data set. For example, the Copas test of overfitting recommends
splitting the data in half, using one half of the data to develop the regression model, using that
model with the other half of the data to make predictions of the outcome, and then perform a linear
regression with the predictions and actual values, testing whether the coefficient is different from
1 [301]. If it were, that would provide evidence of overfitting. However, this approach is nearly
equivalent to using logistic regression in a predictive manner rather than the way it is traditionally
used in DBER and EDM.
    For a technique that aligns the explantory nature of logistic regression, we can examine the
                                                  161


residual plots. Because, logistic regression produces discrete residuals, using binned residual plots
instead might be helpful [302]. Under this approach, cases are divided into bins and the average
value in each bin is plotted against the average residual in that bin. This approach allows the
                                                             𝑖
otherwise binary residual to take on any value of the form  𝑛𝑏𝑖𝑛 where 𝑛 𝑏𝑖𝑛 is the number of cases in
the bin and {𝑖 ∈ Z : −𝑛 𝑏𝑖𝑛 ≤ 𝑖 ≤ 𝑛 𝑏𝑖𝑛 }.
    When implemented via the arm package [303], 95% confidence intervals are generated and we
can get an idea of how good the model is by examining what fraction of the binned residuals fall
within the intervals. When we do so, we find that the fraction of residuals falling outside of the
confidence intervals are between 0.20 for School 3 shortlist and 0.34 for School 3 admit, suggesting
the models might in fact, not fit well. There does not appear to be a pattern based on the sample
size or outcome imbalance. The plots are shown in Fig. G.4 in appendix G.
6.7    Future Work
While we considered six approaches to logistic regression, two machine learning algorithms, and
three importance measures, these are not the only approaches we could have used. Indeed, these
are not even the only logistic regression or random forest techniques we could have used but chose
these algorithms as a starting point. Future work could then consider how other modifications of
logistic regression or random forest might improve upon the problems we have identified here.
    For example, for logistic regression algorithms, Puhr et al. proposed two modifications to
Firth penalization, a post-hoc adjustment of the intercept and iterative data augmentation, that
showed promise in their simulation study [304]. Based on their results, they recommend using
their methods or penalization by Cauchy Priors [305], which we did not include in this study, as
better options than Log-F when confidence intervals were of interest. Furthermore, a later study
comparing logistic regression, Firth penalization, and the modifications to Firth penalization found
that the modifications to Firth’s method worked best in terms of parameter estimation bias for rare
events and small sample cases [306].
    In terms of forest algorithms, there are several variants that might be useful for the data we
                                                 162


encounter in DBER and EDM. For example, Balanced Random Forest and Weighted Random
Forest have been developed for working with imbalanced outcomes [307] and Oblique Random
Forests have been developed to allow for diagonal cuts in the feature space rather than the horizontal
or vertical cuts allowed under traditional random forest algorithms [308]. In their study, Menze
et al. found that Oblique Random Forests outperform traditional random forest when the data is
numerical rather than discrete, which might show promise for our data depending on the ratio of
numerical features to categorical or binary features [308].
    Alternatively, there are non-CART-based approaches to random forest [309]. Loh and Zhou
conducted a simulation study of various approaches to random forest and variable importance [310],
finding that forests grown using the GUIDE algorithm, which is implemented for both classification
[311] and regression [312], was unbiased while the random forest and conditional inference forest
approaches we used here were not. In their study, a method was unbiased “if the expected values
of its scores are equal when all variables are independent of the response variable,” which would
correspond to a case in our study where the odds ratios were 1 for all features. Nevertheless, such
an approach might still be worth looking into.
    There are also newer importance measures that show promise. In a simulation study, Nembrini,
König, and Wright proposed a modification of the Gini importance, which they claim removed
its bias toward features with more categories and the biases observed here regarding feature
imbalance [245]. However, their simulated studies with feature importance only considered null
cases in which none of the features were predictive of an outcome. Nevertheless, further study of
this approach might be fruitful.
    In contrast to the algorithms used to analyze the data, future work should also explore how
changes to the data itself might affect algorithm performance. For example, we could use the risk
ratio to encode the level of information in a feature instead of the odds ratio. In theory, risk ratio
provides a more intuitive way to quantify the amount of information in a feature because it is based
on the ratio of probabilities rather than a ratio of odds. Zhang and Yu proposed a method to convert
the odds ratio to a risk ratio [313], though more recent work has called this approach into question
                                                   163


and suggests alternatives [314, 315]. Because these two measures are related but not the same,
there might be additional insights related to which features are detected based on how we define
the amount of “predictiveness” they have.
    To better replicate real DBER and EDM data, future work could also explore how the amount
of correlation between the features affects the results. In the case of correlated features, new issues
with permutations emerge, including the model needing to extrapolate to regions where the model
was not trained to calculate a feature importance [95]. Various approaches have been developed
for correlated data with forest algorithms [95, 96, 316], which warrant future study, especially for
the type of data we see in DBER and EDM studies.
    Additionally, future work can extend the data beyond binary features and include categorical
features. While logistic regression often requires categorical features to be binarized, doing so
can cause a loss of information. For example, treating exam responses as correct or incorrect
hides information about the specific incorrect answer the student chose and possible patterns [317]
and combining demographics into a single “underrepresented” category can hide the struggles
of students of different races and ethnicities [318]. As random forest implementations can often
handle categorical features directly, future work can consider how the biases explored in this study
might manifest in categorical data and how aggregating or segregating features might introduce its
own biases.
    Finally, future work can consider how both the data and algorithms affect how features are
detected. Recently, Pangastuti et al. found that a combination of bagging, boosting, and SMOTE
improved random forest’s classification ability on a large, imbalanced educational data set [319].
We need to be careful with such approaches, however, because our data is not just data but represents
actual students. Therefore, we need to be certain that our conclusions are based on the student data
and not simulated students “created” to make the data easier to analyze.
                                                 164


6.8    Conclusion and Recommendations
Our work suggests that for both predictive and explanatory models, feature and outcome imbalances
can cause algorithms to detect different features despite the same built-in amount of information.
We found this to be true for random forest, conditional inference forest, logistic regression, as well
as in various penalized regression algorithms. On a practical level, this means that if we are using
these algorithms for determining which features might be related to some outcome of interest, we
might be introducing false negatives into our results, potentially missing factors that are related to
the outcome.
    Based on the results of this study, we propose three recommendations for DBER and EDM
researchers. First, for smaller data sets with highly imbalanced features, we recommend using a
penalized version of logistic regression such as Log-F. Even though Firth penalization was often
comparable to Log-F, Firth penalization is not implemented in all statistical software. Log-F
however can be used with any statistical software that can perform logistic regression because it is
based on data augmentation. If the outcome is also imbalanced, it is even more essential to consider
penalized approaches.
    Second, for medium or large data sets (𝑁 ≥ 1, 000) traditional logistic regression and random
forest or conditional inference forest with a permutation importance perform similarly, so either
approach works. While the algorithms still do not perform perfectly, none of them provided a
consistent advantage over another. We recommend that researchers first consider whether the
research questions are best answered using predictive or explanatory techniques and then which
affordances of the algorithms are most relevant to the study.
    Finally, we call on researchers to include information about their features in their publications,
including the features themselves, their distributions, and in the case of categorical or binary
features, their class frequencies as others outside the DBER and EDM communities have done in
the past [320]. A simple example of how this might be done is shown in Table 6.4. In addition, we
recommend that researchers include data set characteristics or so-called “meta-features” as well.
Some examples include sample size, the number of features, the number of numerical features,
                                                 165


the number of categorical features, and the percentage of observation of the majority class or
outcome balance [290]. Just as there have been calls for increased reporting of demographics in the
DBER and EDM communities to understand how results might depend on the sample population
or generalize [13, 321], we are calling for the same with the explanatory and predictive models
we create, partially addressing some of the questions raised by Knaub, Aiken, and Ding in their
analysis of physics education research quantitative work [322]. By doing so, we hope for greater
acknowledgement of possible sources of bias or false negatives in feature selection as a result of
the data or algorithms used in DBER and EDM studies.
                                               166


APPENDICES
    167


                                            APPENDIX A
                               RANDOM FOREST BACKGROUND
The following appendix comes from the supplemental material of Young et al. [65]. The published
version includes Grant Allen, John M. Aiken, Rachel Henderson, and Marcos D. Caballero as
co-authors. It is reproduced here without changes.
A.1      The Random Forest algorithm
A.1.1     Decision tree learning
Random forests have their roots in the decision tree learning [323]. Decision tree learning uses a set
of binary decisions to develop a model for the data set. For our purposes, we focus on classification
trees where the object of the model is a class label (i.e., a particular categorical outcome). The
decision tree algorithm is provided with the classes and the data that should predict the classes (i.e.,
input variables). Conceptually, the decision tree algorithm searches the input variables for the one
that best segregates the data into separate classes. That choice of “best” can be user specified, but
if left to the algorithm, it will be the variable for which a majority of members of a class appear
on one side of the decision (termed “branch”) and not on the other side. For input variables that
are continuous data, the algorithm further decides on the binary decision that best splits the data.
For example, if “age < 55” was the binary decision, the algorithm both chose “age” as the input
variable and “55” as the cut-off. The algorithm continues to make these decisions, splitting the
data into more and more branches until all branches terminate in a single class (termed “leaves”)
or until a user-specified level. Tracing the path back from any leaf (single class or multiclass) to
the starting point shows all the decisions that were made to obtain that leaf. In this sense, decision
tree learning is a glass-box algorithm – a researcher can see every step along the path.
     Although the researcher can view all parts of the model and how it was constructed, any single
decision tree is strongly tied to the data used to construct it. This leads to overfitting of the
                                                  168


data [323, 324]. That is, the decision tree algorithm can produce a model that exactly provides
unique classifications for the data it is given. As such, applying that model to predict classes in
a new data set will often produce false predictions as the model was so strongly tied to the initial
data set on which it was trained. Overfitting is a common problem in machine learning techniques
where a single analysis is conducted [325]. To deal with this, some decisions trees are pruned [248]
– reduced to a smaller size by removing leaves that predict only small classes. However, as
computational time has become less expensive, ensemble methods that develop a series of models
from random selections of data are a more common method for combating overfitting [326]. The
Random Forest algorithm is one such ensemble method, which grows out of decision tree learning.
A.1.2    Random decision trees
A Random Forest is grown from a set of random decision trees and stems from a technique known
as “tree bagging.” Tree bagging, or simply bagging, refers to running the decision tree algorithm a
specified number of times on a random selection of data with replacement [327]. Data selected to
be used in the algorithm is “bagged.” Each time the algorithm runs, it produces a decision tree from
a randomly selected set of data. The input variables are scored based on how well they classify
the data. Those that continually classify well earn higher scores, while those that do not are given
lower scores. The result is a set of input variables that have been tested on a variety of data sets so
that those input variables that are most important for classification can be found. Random forests
use the bagging technique, but also randomly select a subset of input variables [91]. That is, data
to develop the Random Forest model are randomly selected and only a subset of input variables
(again, randomly chosen) are used in the classification. The same scoring procedure for these
input variables is used. Both the bagging and Random Forest algorithms combat overfitting by
leveraging random selection as opposed to pruning. Conceptually, when randomly sampling data,
the trained model will sometimes produce a good model and at other times it will not. By checking
the quality of each decision tree model after each run, the algorithms ensure that consistently-
appearing predictors from good models are carried into the complete bagging or Random Forest
                                                 169


model and others are not. Due to randomly selecting a subset of the input variables in addition to
a bootstrapped sample of the data, Random Forests have been shown to have significant advantage
over bagging alone [328].
A.1.3    Tuning Random Forest Parameters
Random forests have two tuning parameters that can be adjusted to obtain a reliable model, 𝑛𝑖𝑛 –
the number of input variables selected at random, and 𝑛𝑡𝑟𝑒𝑒𝑠 – the number of trees in the forest. A
review of how 𝑛𝑖𝑛 can affect the predictions of a Random Forest has been conducted by Svetnik
et al. [101]. For a variety of choices of 𝑛𝑖𝑛 , Svetnik et al. compared error rates (fraction of false
positives and false negatives) predictions and found that for most choices of 𝑛𝑖𝑛 error rates were well
maintained between 20-25%. For lower 𝑛𝑖𝑛 , the model does not develop enough robust comparisons
between different input variables to be reliable leading to slightly elevated error rates. While for
higher 𝑛𝑖𝑛 , the model overfits the training data leading to higher scores for less important variables,
which again produce slightly higher error rates. For a given number of input variables, 𝑁, Svetnik
                                               √
et al. suggested 𝑛𝑖𝑛 include 𝑁/2, 𝑁/4, and 𝑁 where each is rounded up or down to the nearest
integer. They tested these suggested choices and found that all choices produce similar error rates
even as the number of input variables is varied between 3 and 100. After 100 variables, the choice
       √
𝑛𝑖𝑛 = 𝑁 performs slightly better than the other choices, but only marginally so (2% difference in
error rates).
    The number of trees in the forest, 𝑛𝑡𝑟𝑒𝑒𝑠 , describes the number of times that the algorithm
randomly selects data and input variables to perform the classification task. Here, more is better,
but only up to a point, after which adding addition trees does not improve the classification. There
is no penalty for running the algorithm many times besides wasting computational resources. An
estimate of the model’s performance for a given number of trees is the Out-Of-Bag error (OOB
error). When the algorithm selects data to train, it leaves some data “out of the bag.” For any given
tree in the forest, we can predict the classifications for the data left out of the bag. The error rate
associated with that prediction is an estimate of the Out-Of-Bag error for that tree. The estimate
                                                   170


                                                     Known Classes
                                                      Yes        No
                                      Model Yes 𝑁𝑇 𝑃            𝑁 𝐹𝑃
                                     Predicts No 𝑁 𝐹 𝑁          𝑁𝑇 𝑁
Figure A.1: The confusion matrix counts the number of each predicted classification by the model
and compares that to the what the data indicates. In this case, a two class system with binary
classifications leads to a 2 x 2 matrix. For 𝑀 classes, the matrix continues to be square and grows
to be 𝑀 x 𝑀.
for the total OOB error is the average across all the trees. In their work, Svetnik et al., found that
OOB error stabilized when 𝑛𝑡𝑟𝑒𝑒𝑠 > 102 . For 3 orders of magnitude beyond that the OOB error
remained flat at 0.2, that they found a ∼20% average misclassification of data left out of the bag for
any number of trees from 102 − 105 .
    It is common to determine the “best” parameters by using a grid search [329]. Here, a number of
random forests are constructed with different combinations of 𝑛𝑖𝑛 and 𝑛𝑡𝑟𝑒𝑒𝑠 to find the combination
with the strongest validation scores. Typically, one performs a coarse-grain search allowing the
tunable parameters to vary greatly. This is much like searching for the appropriate order of
magnitude for each parameter. Once a reasonable range is found for each parameter, a finer-grain
search is performed within the bounds determined from the coarse grained search. This search
can continue ad infinitum, but it is typically only done until reasonable estimates of the “best”
parameters are found given the expected error and available computational time.
A.2      Validating the Random Forest model
A Random Forest will develop a model of data, but that does not always mean that model is
meaningful. Moreover, because that model is developed from a random selection of data and input
variables, individual trees in the model might be terrible predictors. By abiding by suggested
parameter choices [101], one can be somewhat confident in the model. However, additional
validation of the model can be conducted to be able to provide evidence of that confidence. These
metrics and the associated curves are developed from how well the model predicts classifications
of the test data – that is, the data that was not used to train the model.
                                                   171


Figure A.2: (a) Sample receiver operating characteristic (ROC) curves that demonstrate two models:
one that is better than chance (blue) and one that is worse than chance (green). These ROC curves
are plotted along with the chance line (orange dotted). Models that are demonstrably better than
chance have ROC curves that tend towards the upper-left corner of the space as the arrow indicates.
Models that are worse than chance tend towards the bottom-right corner. (b) For both models,
the area under the ROC curves (AUC) are shown (blue and green shading) and computed. AUC
provides a measure of the quality of the model. It is indicative of the probability of accurately
classifying a random sample from the data.
A.2.1     Confusion Matrix
The simplest tool for understanding how well the Random Forest model predicts classifications
in the new data set is the confusion matrix. Most other measures associated with validity of the
model are derived from the confusion matrix. Conceptually, the confusion matrix keeps track of
true positives (𝑁𝑇 𝑃 ), true negatives (𝑁𝑇 𝑁 ), false positives (𝑁 𝐹𝑃 ), and false negatives (𝑁 𝐹𝑃 ). The
sum of all these measures is the total number of observations in the data set that is being tested
(𝑁𝑡𝑒𝑠𝑡 ).
                                   𝑁𝑡𝑒𝑠𝑡 = 𝑁𝑇 𝑃 + 𝑁𝑇 𝑁 + 𝑁 𝐹𝑃 + 𝑁 𝐹 𝑁                                (A.1)
    For a two class system (e.g., Yes/No), these values can be organized into the 2 x 2 matrix where
the columns describe the known classes in the data set and the rows describe the classes predicted
by the model (Fig. A.1). The confusion matrix provides a quick check of the predictions of the
Random Forest model. Essentially a good model will have strong diagonal elements, that is, high
                                                    172


numbers of true predictions, and small off-diagonal elements, low numbers of false predictions.
A.2.2     Associated measures
From the confusion matrix, a number of associated measures may be derived. Here we provide
those that are common to report in the Random Forest literature. Additional measures exist, but are
not reported here 1. The accuracy of the model (𝐴𝐶𝐶) is the fraction of true predictions compared
to the total number of observations in the test data set,
                                                         𝑁𝑇 𝑃 + 𝑁𝑇 𝑁
                                                𝐴𝐶𝐶 =                 .                                   (A.2)
                                                             𝑁𝑡𝑒𝑠𝑡
    This accuracy of the model can vary between 0 and 1, with 0.5 being equal to chance predictions.
A model that predicts worse than chance will have 𝐴𝐶𝐶 < 0.5. The sensitivity, or the true positive
rate (𝑇 𝑃𝑅) compares the number of predicted true positives to the total number of actual positives
appearing in the test data,
                                                             𝑁𝑇 𝑃
                                               𝑇 𝑃𝑅 =                 .                                   (A.3)
                                                         𝑁𝑇 𝑃 + 𝑁 𝐹 𝑁
    This rate varies between 0 and 1. For a good model of the data, we expect this number to
be closer to 1. The fall-out, or false positive rate (𝐹𝑃𝑅) compares the number of predicted false
positives negatives to the total number of actual negatives in the data,
                                                             𝑁 𝐹𝑃
                                               𝐹𝑃𝑅 =                  .                                   (A.4)
                                                         𝑁 𝐹𝑃 + 𝑁𝑇 𝑁
    This rate varies between 0 and 1. For a good model of the data, we expect this number to be
closer to 0. Pure guessing (i.e., chance) would yield 0.5 for both of these values. Taken together,
these values are plotted together for a range of discrimination thresholds in a receiver operating
characteristic curve, which indicate how much better (or worse) the model is than chance.
    1 In some publications, certain likelihood ratios (𝐿𝑅+ = 𝑇 𝑃𝑅/𝐹𝑃𝑅 and 𝐿𝑅− = 𝐹 𝑁 𝑅/𝑇 𝑁 𝑅), odds ratios (𝐿𝑅 +
/𝐿𝑅−), and the F1 score are reported, but 𝐴𝐶𝐶, 𝑇 𝑃𝑅, and 𝐹𝑃𝑅 are the most commonly reported metrics.
                                                         173


A.2.3    Receiver operating characteristic curve
The receiver operating characteristic curve (ROC curve) provides a visualization of the quality of
a binary model [330]. In it, 𝑇 𝑃𝑅 is plotted against 𝐹𝑃𝑅 for a variety of discrimination thresholds.
These thresholds vary from 0 to 1 and describe the probability above which an observation is
placed into one class compared to another. Conceptually, it determines, at a given probability of
classification, the expected rates of true positives compared to false positives.
     A sample ROC for mock data is plotted in Fig. A.2(a). The chance line, in which 𝑇 𝑃𝑅 and
𝐹𝑃𝑅 are equal for all thresholds, is plotted in orange. The blue curve is the ROC curve for a model
that is better than chance. The humped-shaped of curve is common for good models as this shape
are indicative of 𝑇 𝑃𝑅 values above chance for all thresholds. On the other hand, the green curve is
the ROC curve for a model that is worse than chance. This shape is characteristic as well as it is
indicative of 𝑇 𝑃𝑅 below chance for all thresholds.
     A quantitative measure of the quality of the model is the area under the ROC curve (AUC).
This measure is visually represented in Fig. A.2(b) where AUC for the better than chance model
is indicated by the blue and the green shading taken together. The AUC for the worse than chance
model is indicated by the green shading alone. AUC is indicative of the probability that model
will rank a randomly chosen positive instance higher than a randomly chosen negative one. From
Fig. A.2, the probability for the first model is approximately 0.71 while for the second it 0.29.
AUCs above 0.7 are typically considered reasonable models with 0.8 and above considered to be
good models [93]. A perfect classifier will have an area under the curve of 1 while the chance curve
will have an area under the curve of 0.5.
A.3      Feature selection
One of the more useful aspects of machine learning, and of the Random Forest algorithm in
particular, is the ability to determine which features are more important than other features in the
data. For classification tasks, that means finding the input variables that consistently separate the
results into classes. There is an analogy to regression analysis where the important input variables
                                                  174


in a Random Forest classifier act as statistically significant correlates with the outcome variable.
However, because Random Forests are not rooted in traditional statistical analysis, the important
features do not arise from correlation – linear or otherwise.
    Feature selection makes use of these important features to reduce the overall number of input
variables needed to classify the data. This is similar to using regression models of increasing
complexity to find the minimal model that explains the outcomes sufficiently. These important
features can be used as the sole input variables and the resulting model can be validated using the
techniques described in Sec. A.2. A good reduced model will maintain high accuracy, produce an
ROC curve that is still above the chance line for all thresholds, and have an AUC that is similarly
well above chance while using the minimum amount of features.
    To determine the importance of an input variable (termed “feature importance”), the standard
Random Forest algorithm (CART) continuously compares how well each input variable in a single
decision tree separates the data set into classes. For the CART algorithm, the measure of how
well this occurs is either the Gini impurity or the information gain, depending on user selection
and choice of tool. For the simplest implementation of the Random Forest classifier, the feature
importance is related to the Gini impurity (𝐼𝐺 ) [91], which is the total decrease in node impurity. 𝐼𝐺
is computed for each input variable (node) and is then averaged for each input variable over all the
trees in the forest. Conceptually, 𝐼𝐺 for an input variable is probability of the input variable showing
up in a given class multiplied by the probability of a misclassification within that factor summed
over all classes [331]. Thus the higher 𝐼𝐺 for an input variable in a given tree, the less favorable
choice it is for splitting the tree. For example, a high 𝐼𝐺 input variable would not be selected as
the input variable for the first branch in a given tree. For this implementation, important features
are those which consistently produce the best splits for a large proportion of trees in the forest. The
feature importances are often distributed normally around some mean and are reported with error
that is inversely proportional to the square root number of trees in which the input variable was
randomly selected for use.
                                                   175


A.4       Bias and improvements
While the CART algorithm and the associated Gini-based feature selection are commonly used in
Random Forest classification, both are subject to biases that for certain kinds of data (including
those analyzed in this paper) can lead to inaccurate models.
    First, the Gini-based feature selection described above is not reliable when the input variables
vary in scale of measurement or in the number of categories (possible responses) [92]. This is
because variables with more categories can be split into two groups in more ways than variables
with less categories can and hence, it is more likely a favorable split could be found. Furthermore,
Gini-based feature importances can be biased if the variables are correlated [332]. In such cases,
accuracy-based permutation variable importances can be used [243, 244].
    Accuracy-based permutation variable importances are based on the idea that if a input variable
is associated with the outcome variable, then permuting the input variable should break that
association, and therefore, the accuracy (𝐴𝐶𝐶) should decrease. The input variables that change
the accuracy the most are then said to have the largest variable importance.
    However, these accuracy based variable importances are biased when the sample is unbalanced,
that is, the categories in the predicted variable do not occur in equal frequencies [94]. Janitza et al.
suggest modifying the accuracy-based permutation variable importances to be based on the AUC
instead of accuracy because accuracy is biased toward the majority class while the AUC is not.
When applying this modification, they found that the AUC-based permutation feature importances
are better able to discriminate between variables that are good predictors and variables that are
poor predictors than the accuracy-based permutation feature importances are when the sample is
unbalance. When the sample is balanced, they reported no significant different between the two
methods.
    Because the Random Forest algorithm is based off classification and regression trees (CART)
[309] and CART uses the Gini impurity to determine the split points, the Random Forest algorithm
itself is biased for the same reasons the Gini-based feature importance is. This bias can be corrected
using conditional inference forests based on the framework proposed by Hothorn et al [92, 259].
                                                   176


The conditional inference forest algorithm breaks the determining the split points into two steps,
unlike the Random Forest algorithm, which selects the input variable and its split point in a single
step. First, algorithm tests the global null hypothesis that there is no association between any of
the input variables and predicted variable at some predetermined level of significance. If this test
fails, the algorithm terminates. If the null hypothesis can be rejected, the algorithm selects the
input variable with the strongest association to the predicted variable as measured by its P value.
The algorithm then splits the input variable into two groups that maximizes a chosen test statistic.
Finally, the algorithm returns to the first step and retests the global null hypothesis and either
terminates or continues with the next input variable with the highest association to the predicted
variable. This revised version of the Random Forest algorithm has been found to be unbiased
even when the variables vary in scale of measurement or in the number of categories provided that
boostrapping is not used (subsampling must occur without replacement or the original biases are
still present) [92].
A.4.1     Determining “significant” features
While we have detailed various improvements to variable selection, none of these methods can by
themselves determine whether the variables are actually important, only how important they are
with respect to each other. Various approaches to determine the actual important variables have been
proposed such as selecting the top 10% of the variables ordered by an importance measure [333],
selecting all variables above the absolute value of the most negative importance values [328], using
recursive backward elimination to select the fewest number of variables that result in an OOB rate
within 1 standard error of the best OOB rate [97], generating a null distribution of each variable
and then assigning a p-value based on the fraction of null importances greater than the actual
importance value [292], and mirroring the distribution of negative importances to generate an
overall null distribution for the importances and again assigning a p-value based on the fraction of
null importances greater than the actual importance value [297]. Each of these approaches has its
own benefits and problems such related to ease of implementation and computation time required
                                                 177


and to our knowledge, there is no standard choice of procedure. As we had a limited number of
negative importances and limited computational power, we used recursive backward elimination in
this study. Since these “significant” factors are not determined by tests of statistical significant but
rather by how much the model changes when they are removed, we refer to the selected factors as
meaningful factors.
                                                  178


                                                                               APPENDIX B
                                                           CHAPTER 3 ANALYSIS OF FEATURES
In this appendix, we describe the data used to answer research questions 2 and 3 to give the reader a
better idea of the distributions of physics GRE scores and GPA in the data set. Because the data are
skewed left and exhibit ceiling effects (many applicants have 4.0 GPAs or 990 physics GRE scores),
quartiles are used to describe the various features. To maximize the amount of information shown
about the data, we use raincloud plots [334, 335], which show the distribution, the density plot,
and traditional box plot. Kolmogorov-Smirnov tests suggest the distributions are not significantly
different whether we include applicants who may have applied to multiple schools in our data set,
so we include possible duplicates in our analysis.
                  Fig. B.1 shows the physics GRE scores and undergraduate GPAs of each applicant based
on whether they attended a large undergraduate physics program (top 25% nationally in yearly
                                                  Physics GRE Score by Undergraduate                                                   Physics GRE Score by Undergraduate
                                                          Physics Program Size                                                                 Institution Selectivity
 Size of UG Physics Program                                                              Selectivity of University
                          Smaller Programs                                                                           Less Selective
                              Largest Programs                                                                   Most Selective
                                                 400 500 600 700 800 900 1000                                                         400 500 600 700 800 900 1000
                                                           Physics GRE Score                                                                     Physics GRE Score
                                                  Undergraduate GPA by Undergraduate                                                   Undergraduate GPA by Undergraduate
                                                          Physics Program Size                                                                 Institution Selectivity
 Size of UG Physics Program                                                              Selectivity of University
                          Smaller Programs                                                                           Less Selective
                              Largest Programs                                                                   Most Selective
                                             2.0        2.5      3.0     3.5       4.0                                                      2.5      3.0      3.5       4.0
                                                          Undergraduate GPA                                                                    Undergraduate GPA
Figure B.1: Distribution of physics GRE scores & undergraduate GPAs by the size of the under-
graduate physics program & institutional selectivity for each applicant.
                                                                                       179


                Physics GRE Score by Applicant's Gender                       Physics GRE Score by Applicant's Race
      Women                                                        B/L/M/N
 Gender                                                   Race
          Men                                                   non-B/L/M/N
                400 500 600 700 800 900 1000                                  400 500 600 700 800 900 1000
                       Physics GRE Score                                             Physics GRE Score
                Undergraduate GPA by Applicant's Gender                       Undergraduate GPA by Applicant's Race
      Women                                                        B/L/M/N
 Gender                                                   Race
          Men                                                   non-B/L/M/N
            2.0         2.5      3.0     3.5        4.0                   2.0        2.5      3.0     3.5        4.0
                          Undergraduate GPA                                            Undergraduate GPA
Figure B.2: Distribution of physics GRE scores and undergraduate GPAs by gender and whether
the applicant identified as a member of racial or ethnic group currently underrepresented in physics.
physics bachelor‘s degrees) or attended a selective university (categorized as most competitive or
highly competitive based on Barron‘s Selectivity Index). We notice that the physics GRE score
distributions are shifted to the right for applicants from large physics departments or selective
institutions, signifying higher scores. Indeed, the median physics GRE scores of applicants from
large programs or selective institutions are nearly 100 points higher than those of applicants from
smaller or less selective institutions. However, in terms of GPA, the median GPA is approximately
the same, regardless of whether the applicant graduated from a larger or smaller physics department
or attended a more or less selective institution.
    Fig. B.2 shows the physics GRE and undergraduate GPAs by gender and race. As expected,
men score higher on the physics GRE than women do and Asian and white applicants score higher
than Black, Latinx, Multiracial, or Native applicants, though the gaps appear larger than those
reported in [52].
    When comparing GPAs, we find that men and women have similar GPAs, as recently reported
                                                          180


in [156] when comparing men and women’s STEM GPAs. Likewise, our data also shows a
racial GPA gap with non-B/L/M/N applicants having a median GPA higher than that of B/L/M/N
applicants by 0.15.
    When looking across both figures, we notice that the physics GRE score distributions of smaller
and less selective programs resemble the physics GRE distributions of women and B/L/M/N
applicants while the physics GRE score distributions of the largest and most selective programs
resemble the physics GRE score distributions of men and non-B/L/M/N applicants. To see if
gender and race are confounding variables in our analysis, we examined the fraction of women and
B/L/M/N applicants in each group. If this were the case, the smaller and less selective programs
should have a greater fraction of women and B/L/M/N applicants than the larger and more selective
programs.
    However, we did not find this to be the case. Applicants from more selective institutions
were 16% women while applicants from less selective institutions were 18% women (15% and
14% respectively for B/L/M/N applicants). For institution size, applicants from larger institutions
were 16% women compared to 21% women from smaller institutions (14% and 17% for B/L/M/N
applicants respectively). Thus, it does not appear that differences in who attends (in terms of gender
and race) larger or more selective institutions are responsible for the observed differences in scores.
                                                  181


                                          APPENDIX C
                           CHAPTER 3 SUPPLEMENTAL FIGURES
In this appendix, we include plots that show institutional effects by gender and race. As we note
in the discussion, the programs included in this study were actively trying to increase the diversity
of their graduate programs. Thus, we are unable to determine from our data whether women were
admitted at higher rates than men were and B/L/M/N applicants were admitted at higher rates than
non- B/L/M/N applicants were because they stood out or if the admissions committees highlighted
these applicants from the start. Therefore, we do not comment on any possible interactions. For
completeness, the plots are shown here.
                                                182


Figure C.1: Admission fractions of applicants split by their gender and the selectivity of their undergraduate institutions.
                                                           183


Figure C.2: Admission fractions of applicants split by their gender and the size of their undergraduate institutions.
                                                        184


Figure C.3: Admission fractions of applicants split by their race and the selectivity of their undergraduate institutions.
                                                          185


Figure C.4: Admission fractions of applicants split by their race and the size of their undergraduate institutions.
                                                      186


                                          APPENDIX D
                         PHYSICS GRE SCORING PERCENTILES
The following data comes from the GRE Subject Test Interpretive Data [1]. It is included here for
readers who wish to see the scoring distribution of the test.
   Between July 1st, 2017 and June 30th, 2020, 19,955 students took the physics GRE. The average
score was 717 with a standard deviation of 165. 23% of test-takers identified as women.
   Scaled physics GRE score and the percent of applicants scoring lower than the scaled score are
shown in the table below. Blank cells means there were no test takers scoring below that scaled
score.
  Table D.1: Scaled physics GRE score and percent of applicants scoring lower than that score.
                                    Scaled score     Percentile
                                         980             95
                                         960             92
                                         940             88
                                         920             85
                                         900             81
                                         880             77
                                         860             74
                                         840             70
                                         820             67
                                         800             64
                                         780             61
                                         760             58
                                         740             54
                                         720             51
                                               187


Table D.1 (cont’d)
 700           47
 680           43
 660           39
 640           36
 620           32
 600           28
 580           24
 560           20
 540           16
 520           12
 500            9
 480            7
 460            4
 440            3
 420            2
 400            1
 380            1
 360
 340
 320
 300
 280
 260
 240
       188


Table D.1 (cont’d)
 220
 200
       189


                                        APPENDIX E
                   DEPARTMENT OF PHYSICS ADMISSIONS RUBRIC
This appendix includes the rubric used during the 2019 admissions cycle and defines the features
used in Chapters 4 and 5.
                                              190


  Table E.1: Rubric used by our department for evaluating applicants. The criteria for a high, medium, and low score are shown.
Item             Subitem                 High                           Medium                        Low
Academic         Physics Coursework      GPA>=3.7       (A-)   in   all GPA >=3.3 (B+) in all         GPA>=3.7 (A-) in EM1
Preparation                              core subjects:      CM1&2, core:       CM1&2, EM1&2,         and CM1; GPA>=3.0 (B)
(25%)                                    EM1&2, QM1&2, SM1, QM1&2,                     SM1;      OR   average in other advanced
                                         if not taken 2nd semester GPA>=3.7 (A-) in CM1,              courses; any grades <2.7
                                         courses yet are they plan- EM1, QM1, SM1 if no 2nd           (B-) without explanation
                                         ning on taking them?           semester courses taken
                 Math Coursework         Real and Complex Anal-         DiffEq, Linear, and a Math    Bare bones math prep (e.g.,
                                         ysis, Group Theory with Methods course, all with             up to DiffEq), or low grades
                                         GPA>=3.5 (A) grades            >=3.5 (A) grades; or more     regularly on math
                                                                        than this with GPA>=3.0 (B
                                                                        or A) grades
                 Other Coursework        Consistently 3.5 (A) grades Consistently 3.0 (B) grades One or more 2.5s (B-/C+)
                                                                        withnothing below a 2.5
                                                                        (B-/C+)
                 Academic        honors multiple      honors,     e.g., one                academic   No academic honors in col-
                 and/or recognitions     Dept/University      Honors; award/recognition               lege documented in the ap-
                                         Phi Beta Kappa, etc                                          plication
                                                              191


                                                 Table E.1 (cont’d)
Research      Variety/duration     two years in research          one year in research; only nothing more than course-
(25%)                                                             REUs                         work laboratories
              Quality of work      multiple indications of ex-    clearly made significant     limited     intellectual  or
                                   cellence                       contributions to the project technical contribution to
                                                                                               projects; "button pusher"
              Technical skills     a variety of experiment,       has developed only one       nothing more than course-
                                   theory, and/or computa- class of skill (exp or theory       work laboratories
                                   tional skills                  or comp)
              Dispositions         clear commitment to and        clear commitment to and      not clear if they know what
                                   enthusiasm for research; enthusiasm for research;           they are getting into with
                                   AND understands what the OR understands what the            a PhD; seems lukewarm
                                   process entails                process entails              about research
Non-Cognitive Achievement Orienta- Consistently strives to im- Has demonstrated a high         No evidence of striving for
Competencies  tion                 prove or meet a high stan- standard of excellence in        excellence provided in ap-
(25%)                              dard of excellence in all ar- selected areas                plication or student record
                                   eas
                                                        192


                                              Table E.1 (cont’d)
              Conscientiousness Takes responsibility for       Takes responsibility for   No evidence of taking
                                personal performance, both personal performance, both     responsibility for perfor-
                                the good and the bad; the good and the bad;               mance AND minimal ev-
                                AND demonstrates effi- OR demonstrates efficiency         idence of efficient, orga-
                                ciency and organization        and organization           nized work
              Initiative        Consistently seeks out or      Consistently seeks out or  Has not sought out or
                                acts on opportunities AND acts on opportunities OR        taken advantage of oppor-
                                takes leadership               takes leadership           tunities AND does not have
                                                                                          a record of leadership
              Perseverance      Application     clearly   de-  Basic or perfunctory de-   Application does not de-
                                scribes successful coping scription of overcoming         scribe experience with fail-
                                with failures/obstacles        challenges                 ure/obstacles
Fit with pro- Research          research    interests   align  research   interests align limited alignment between
gram (15%)                      with multiple faculty in with multiple faculty in one     student interests and faculty
                                multiple subfields             subfield                   expertise
                                                      193


                         Table E.1 (cont’d)
Faculty   someone wants to hire as        someone could supervise,     faculty aligned with appli-
          RA now and/or there is a but interests do not directly       cant’s interests are not seek-
          clear fit with current faculty support a faculty member’s    ing students
          expertise                       work
Community has clearly contributed pos-    some evidence of partici-    applicant only discusses
          itively to prior depart- pating in service activities        him/herself; no evidence of
          ment/school culture, and                                     engagement in department
          would do the same for our                                    or university activities
          program
Diversity applicant has been an ac-       belongs to an underrepre-    contributions to diversity
          tive advocate for diversity sented identity group; first     are unclear from the appli-
          in physics                      generation in college or low cation
                                          SES; and/or contributes to
                                          another type of diversity
                                          the department seeks
                                194


                                     Table E.1 (cont’d)
GRE   Scores General GRE Verbal(V) and Quantita-      V & Q scores >=75% (or V or Q score <75% and
(10%)                    tive(Q) scores >=75% (or 157 for V and 160 for Q)   AW<4.0
                         157 for V and 160 for BUT AW<4.0
                         Q) AND Analytical Writ-
                         ing (AW) >=4.0
             Physics GRE >=75%                        50-74%                 <49%
                                            195


                                            APPENDIX F
                            CHAPTER 4 SUPPLEMENTAL FIGURES
Here we present figures showing the results split by admission status and gender, undergraduate
institution selectivity, and undergraduate physics program size. Given the relatively small sample
sizes, we did not conduct tests of statistical significance.
                                                  196


  A                                                 Rated Admitted, Male Applicants                  B                                                 Rated Admitted, Female Applicants
                                                    (N=101)                                                                                            (N=36)
                           Physics Coursework                                                                                 Physics Coursework
    Academic                  Math Coursework                                                         Academic                   Math Coursework
   Preparation                                                                                       Preparation
                          All Other Coursework                                                                               All Other Coursework
                              Academic Honors                                                                                    Academic Honors
                   Variety/Duration of Research                                           Fraction                    Variety/Duration of Research                                         Fraction
      Research                                                                                           Research
                                 Quality of work                                              0.8                                   Quality of work                                            0.8
                                 Technical Skills                                             0.6                                   Technical Skills                                           0.6
                          Research Dispositions                                               0.4                            Research Dispositions                                             0.4
                                                                                              0.2                                                                                              0.2
   Non−cognitive                                                                                     Non−cognitive
                       Achievement Orientation                                                0.0
                                                                                                                          Achievement Orientation                                              0.0
                            Conscientiousness                                                                                  Conscientiousness
   competencies                                                                                      competencies
                                      Initiative                                                                                         Initiative
                                                                                          Fraction                                                                                         Fraction
                                Perseverance                                                  0.8                                  Perseverance                                                0.8
                                                                                              0.6                                                                                              0.6
                        Alignment of Research                                                                              Alignment of Research
   Fit with              Alignment with Faculty                                               0.4    Fit with               Alignment with Faculty                                             0.4
   program             Community contributions                                                0.2    program              Community contributions                                              0.2
                         Diversity contributions                                              0.0
                                                                                                                            Diversity contributions                                            0.0
      GRE scores                                                                                         GRE scores
                          General GRE Scores                                                                                 General GRE Scores
                            Physics GRE Score                                                                                  Physics GRE Score
                                                      Lo          ium
                                                                            igh                                                                          Lo          ium
                                                                                                                                                                               igh
                                                        w      ed          H                                                                               w      ed         H
                                                              M                                                                                                  M
  C                                                 Rated Not−admitted, Male Applicants              D                                                 Rated Not−admitted, Female Applicants
                                                    (N=151)                                                                                            (N=31)
                           Physics Coursework                                                                                 Physics Coursework
    Academic                  Math Coursework                                                         Academic                   Math Coursework
   Preparation                                                                                       Preparation
                          All Other Coursework                                                                               All Other Coursework
                              Academic Honors                                                                                    Academic Honors
                   Variety/Duration of Research                                           Fraction                    Variety/Duration of Research                                         Fraction
      Research                                                                                           Research
                                 Quality of work                                              0.8                                   Quality of work                                            0.8
                                 Technical Skills                                             0.6                                   Technical Skills                                           0.6
                          Research Dispositions                                               0.4                            Research Dispositions                                             0.4
                                                                                              0.2                                                                                              0.2
   Non−cognitive                                                                                     Non−cognitive
                       Achievement Orientation                                                0.0
                                                                                                                          Achievement Orientation                                              0.0
                            Conscientiousness                                                                                  Conscientiousness
   competencies                                                                                      competencies
                                      Initiative                                                                                         Initiative
                                                                                          Fraction                                                                                         Fraction
                                Perseverance                                                  0.8                                  Perseverance                                                0.8
                                                                                              0.6                                                                                              0.6
                        Alignment of Research                                                                              Alignment of Research
   Fit with              Alignment with Faculty                                               0.4    Fit with               Alignment with Faculty                                             0.4
   program             Community contributions                                                0.2    program              Community contributions                                              0.2
                         Diversity contributions                                              0.0
                                                                                                                            Diversity contributions                                            0.0
      GRE scores                                                                                         GRE scores
                          General GRE Scores                                                                                 General GRE Scores
                            Physics GRE Score                                                                                  Physics GRE Score
                                                      Lo          um        igh                                                                          Lo          um        igh
                                                        w      ed          H                                                                               w      ed         H
                                                                  i                                                                                                  i
                                                              M                                                                                                  M
Figure F.1: Faculty ratings of domestic applicants on 18 constructs split by whether the applicant
was male or female and whether they were admitted or not.
                                                                                                197


  A                                                 Rated, Admitted Applicants from                  B                                                 Rated, Admitted Applicants from
                                                    Selective Undergraduate Programs                                                                   Non−Selective Undergraduate Programs
                                                    (N=50)                                                                                             (N=47)
    Academic                                                                                          Academic
                           Physics Coursework                                                                                 Physics Coursework
                              Math Coursework                                                                                    Math Coursework
   Preparation                                                                                       Preparation
                          All Other Coursework                                                                               All Other Coursework
                              Academic Honors                                                                                    Academic Honors
                                                                                          Fraction                                                                                      Fraction
                   Variety/Duration of Research                                                                       Variety/Duration of Research
      Research                                                                                           Research
                                                                                              0.8                                                                                             0.8
                                 Quality of work                                                                                    Quality of work
                                                                                              0.6                                                                                             0.6
                                 Technical Skills                                                                                   Technical Skills
                          Research Dispositions                                               0.4                            Research Dispositions                                            0.4
                                                                                              0.2                                                                                             0.2
   Non−cognitive                                                                                     Non−cognitive
                       Achievement Orientation                                                0.0                         Achievement Orientation                                             0.0
                            Conscientiousness                                                                                  Conscientiousness
   competencies                                                                                      competencies
                                      Initiative                                                                                         Initiative
                                                                                          Fraction                                                                                      Fraction
                                Perseverance                                                  0.8                                  Perseverance                                               0.8
                        Alignment of Research                                                 0.6                          Alignment of Research                                              0.6
   Fit with              Alignment with Faculty                                               0.4    Fit with               Alignment with Faculty                                            0.4
   program             Community contributions                                                0.2    program              Community contributions                                             0.2
                         Diversity contributions                                                                            Diversity contributions
                                                                                              0.0                                                                                             0.0
      GRE scores                                                                                         GRE scores
                          General GRE Scores                                                                                 General GRE Scores
                            Physics GRE Score                                                                                  Physics GRE Score
                                                       Lo         ium
                                                                            igh                                                                          Lo          ium
                                                                                                                                                                             igh
                                                         w      ed         H                                                                               w      ed        H
                                                              M                                                                                                  M
  C                                                 Rated, Not Admitted Applicants from              D                                                 Rated, Not Admitted Applicants from
                                                    Selective Undergraduate Programs                                                                   Non−Selective Undergraduate Programs
                                                    (N=58)                                                                                             (N=57)
    Academic                                                                                          Academic
                           Physics Coursework                                                                                 Physics Coursework
                              Math Coursework                                                                                    Math Coursework
   Preparation                                                                                       Preparation
                          All Other Coursework                                                                               All Other Coursework
                              Academic Honors                                                                                    Academic Honors
                                                                                          Fraction                                                                                      Fraction
                   Variety/Duration of Research                                                                       Variety/Duration of Research
      Research                                                                                           Research
                                                                                              0.8                                                                                             0.8
                                 Quality of work                                                                                    Quality of work
                                                                                              0.6                                                                                             0.6
                                 Technical Skills                                                                                   Technical Skills
                          Research Dispositions                                               0.4                            Research Dispositions                                            0.4
                                                                                              0.2                                                                                             0.2
   Non−cognitive                                                                                     Non−cognitive
                       Achievement Orientation                                                0.0                         Achievement Orientation                                             0.0
                            Conscientiousness                                                                                  Conscientiousness
   competencies                                                                                      competencies
                                      Initiative                                                                                         Initiative
                                                                                          Fraction                                                                                      Fraction
                                Perseverance                                                  0.8                                  Perseverance                                               0.8
                        Alignment of Research                                                 0.6                          Alignment of Research                                              0.6
   Fit with              Alignment with Faculty                                               0.4    Fit with               Alignment with Faculty                                            0.4
   program             Community contributions                                                0.2    program              Community contributions                                             0.2
                         Diversity contributions                                                                            Diversity contributions
                                                                                              0.0                                                                                             0.0
      GRE scores                                                                                         GRE scores
                          General GRE Scores                                                                                 General GRE Scores
                            Physics GRE Score                                                                                  Physics GRE Score
                                                       Lo         um        igh                                                                          Lo          um      igh
                                                         w      ed         H                                                                               w      ed        H
                                                                   i                                                                                                 i
                                                              M                                                                                                  M
Figure F.2: Faculty ratings of domestic applicants on 18 constructs split by whether the applicant
attended a more selective or less selective undergraduate university and whether they were admitted
or not.
                                                                                                198


  A                                                 Rated, Admitted Applicants from              B                                                 Rated, Admitted Applicants from
                                                    Large Undergraduate Programs                                                                   Small Undergraduate Programs
                                                    (N=97)                                                                                         (N=34)
    Academic                                                                                      Academic
                           Physics Coursework                                                                             Physics Coursework
                              Math Coursework                                                                                Math Coursework
   Preparation                                                                                   Preparation
                          All Other Coursework                                                                           All Other Coursework
                              Academic Honors                                                                                Academic Honors
                                                                                      Fraction                                                                                       Fraction
                   Variety/Duration of Research                                                                   Variety/Duration of Research
      Research                                                                                       Research
                                                                                          0.8                                                                                            0.8
                                 Quality of work                                                                                Quality of work
                                                                                          0.6                                                                                            0.6
                                 Technical Skills                                                                               Technical Skills
                          Research Dispositions                                           0.4                            Research Dispositions                                           0.4
                                                                                          0.2                                                                                            0.2
   Non−cognitive                                                                                 Non−cognitive
                       Achievement Orientation                                            0.0                         Achievement Orientation                                            0.0
                            Conscientiousness                                                                              Conscientiousness
   competencies                                                                                  competencies
                                      Initiative                                                                                     Initiative
                                                                                      Fraction                                                                                       Fraction
                                Perseverance                                              0.8                                  Perseverance                                              0.8
                        Alignment of Research                                             0.6                          Alignment of Research                                             0.6
   Fit with              Alignment with Faculty                                           0.4    Fit with               Alignment with Faculty                                           0.4
   program             Community contributions                                            0.2    program              Community contributions                                            0.2
                         Diversity contributions                                                                        Diversity contributions
                                                                                          0.0                                                                                            0.0
      GRE scores                                                                                     GRE scores
                          General GRE Scores                                                                             General GRE Scores
                            Physics GRE Score                                                                              Physics GRE Score
                                                       Lo         ium
                                                                            igh                                                                       Lo         ium
                                                                                                                                                                           igh
                                                         w      ed         H                                                                            w      ed         H
                                                              M                                                                                              M
  C                                                 Rated, Non−Admitted Applicants from          D                                                 Rated, Non−Admitted Applicants from
                                                    Large Undergraduate Programs                                                                   Small Undergraduate Programs
                                                    (N=106)                                                                                        (N=72)
    Academic                                                                                      Academic
                           Physics Coursework                                                                             Physics Coursework
                              Math Coursework                                                                                Math Coursework
   Preparation                                                                                   Preparation
                          All Other Coursework                                                                           All Other Coursework
                              Academic Honors                                                                                Academic Honors
                                                                                      Fraction                                                                                       Fraction
                   Variety/Duration of Research                                                                   Variety/Duration of Research
      Research                                                                                       Research
                                                                                          0.8                                                                                            0.8
                                 Quality of work                                                                                Quality of work
                                                                                          0.6                                                                                            0.6
                                 Technical Skills                                                                               Technical Skills
                          Research Dispositions                                           0.4                            Research Dispositions                                           0.4
                                                                                          0.2                                                                                            0.2
   Non−cognitive                                                                                 Non−cognitive
                       Achievement Orientation                                            0.0                         Achievement Orientation                                            0.0
                            Conscientiousness                                                                              Conscientiousness
   competencies                                                                                  competencies
                                      Initiative                                                                                     Initiative
                                                                                      Fraction                                                                                       Fraction
                                Perseverance                                              0.8                                  Perseverance                                              0.8
                        Alignment of Research                                             0.6                          Alignment of Research                                             0.6
   Fit with              Alignment with Faculty                                           0.4    Fit with               Alignment with Faculty                                           0.4
   program             Community contributions                                            0.2    program              Community contributions                                            0.2
                         Diversity contributions                                                                        Diversity contributions
                                                                                          0.0                                                                                            0.0
      GRE scores                                                                                     GRE scores
                          General GRE Scores                                                                             General GRE Scores
                            Physics GRE Score                                                                              Physics GRE Score
                                                       Lo         um        igh                                                                       Lo         um        igh
                                                         w      ed         H                                                                            w      ed         H
                                                                   i                                                                                              i
                                                              M                                                                                              M
Figure F.3: Faculty ratings of domestic applicants on 18 constructs split by whether the applicant
attended a university with a larger or smaller physics program and whether they were admitted or
not.
                                                                                            199


                                            APPENDIX G
                           CHAPTER 6 SUPPLEMENTAL FIGURES
Here we provide the plots for the additional three data sets from Sec. 6.5 for completeness. The
plots show the same general results as discussed in the main manuscript. We also include the
residual plots from the discussion.
Figure G.1: Comparison of the odds ratio, Gini importance, and AUC-permutation importance for
the features in the school 2 admit data set.
Figure G.2: Comparison of the odds ratio, Gini importance, and AUC-permutation importance for
the features in the school 3 shortlist data set.
                                                 200


Figure G.3: Comparison of the odds ratio, Gini importance, and AUC-permutation importance for
the features in the school 3 admit data set.
                                                            School 1                                                                         School 2
                             0.2 0.4 0.6
                                                                                                                     0.4
                                                                                                                     0.2
          Average residual                                                                        Average residual
                                                                                                                     0.0
                             −0.2
                                                                                                                     −0.2
                                                                                                                     −0.4
                             −0.6
                                           −4    −3    −2     −1    0        1       2       3                              −5   −4      −3      −2         −1       0
                                                            Log Odds                                                                         Log Odds
                                                       School 3 Admit                                                                 School 3 Shortlist
                                                                                                                     0.6
                             0.4
                                                                                                                     0.4
                             0.2
          Average residual                                                                        Average residual
                                                                                                                     0.2
                             0.0
                                                                                                                     0.0
                             −0.2                                                                                    −0.2
                                                                                                                     −0.4
                             −0.4
                                            −5    −4     −3    −2       −1       0       1                                  −4   −3     −2     −1       0        1       2
                                                            Log Odds                                                                         Log Odds
Figure G.4: Plots of the Log-odds vs the average residual in each bin for the four schools. Across
all plots, between 20% and 34% of the points fall outside of the confidence intervals, suggesting
the logistic regression models might not be fitting the data especially well.
                                                                                                 201


BIBLIOGRAPHY
     202


                                       BIBLIOGRAPHY
[1]  GRE Subject Test Interpretative Data.
[2]  Rosemary S. Russ and Tor Ole B. Odden. Physics Education Research as a Multidimensional
     Space: Current Work and Expanding Horizons. In Charles Henderson and Kathleen A.
     Harper, editors, Getting Started in PER. American Association of Physics Teachers, College
     Park, MD, 2018.
[3]  Starr Nicholson and Patrick J. Mulvey. Roster of Physics Departments with Enrollment and
     Degree Data, 2019. Technical report, September 2020.
[4]  Physicists and Astronomers : Occupational Outlook Handbook: : U.S. Bureau of Labor
     Statistics.
[5]  Employment and Careers in Physics, April 2020.
[6]  U.S. Census Bureau QuickFacts: United States.
[7]  Patrick J. Mulvey and Starr Nicholson. Physics Bachelor’s Degrees: 2018, August 2020.
[8]  Lillian C. McDermott and Edward F. Redish. Resource Letter: PER-1: Physics Education
     Research. American Journal of Physics, 67(9):755–767, September 1999.
[9]  Jennifer L Docktor and José P Mestre. Synthesis of discipline-based education research in
     physics. Physical Review Special Topics-Physics Education Research, 10(2):020119, 2014.
[10] Tor Ole B. Odden, Alessandro Marin, and Marcos D. Caballero. Thematic analysis of 18 years
     of physics education research conference proceedings using natural language processing.
     Physical Review Physics Education Research, 16(1):010142, June 2020.
[11] National Research Council. Adapting to a Changing World–Challenges and Opportunities
     in Undergraduate Physics Education. National Academies Press, Washington, D.C., June
     2013.
[12] Lin Ding. Theoretical perspectives of quantitative physics education research. Physical
     Review Physics Education Research, 15(2):020101, July 2019.
[13] Stephen Kanim and Ximena C. Cid. Demographics of physics education research. Physical
     Review Physics Education Research, 16(2):020106, July 2020.
[14] Marcos D Caballero, Bethany R Wilcox, Leanne Doughty, and Steven J Pollock. Unpack-
     ing students’ use of mathematics in upper-division physics: where do we go from here?
     European Journal of Physics, 36(6):065004, 2015.
[15] Brian Farlow, Marlene Vega, Michael E. Loverude, and Warren M. Christensen. Mapping
     activation of resources among upper division physics students in non-Cartesian coordi-
     nate systems: A case study. Physical Review Physics Education Research, 15(2):020125,
     September 2019.
                                              203


[16] Mary Bridget Kustusch, Corinne Manogue, and Edward Price. Design tactics in curriculum
     development: Examples from the Paradigms in Physics ring cycle. Physical Review Physics
     Education Research, 16(2):020145, December 2020.
[17] Qing X. Ryan and Benjamin P. Schermerhorn. Students’ use of symbolic forms when con-
     structing equations of boundary conditions. Physical Review Physics Education Research,
     16(1):010122, April 2020.
[18] Tao Tu, Chuan-Feng Li, Zong-Quan Zhou, and Guang-Can Guo. Students’ difficulties with
     partial differential equations in quantum mechanics. Physical Review Physics Education
     Research, 16(2):020163, December 2020.
[19] Bethany R. Wilcox and Giaco Corsiglia. Cross-context look at upper-division student
     difficulties with integration. Physical Review Physics Education Research, 15(2):020136,
     October 2019.
[20] J Christopher Moore and Louis J Rubbo. Scientific reasoning abilities of nonscience majors
     in physics-based courses. Physical Review Special Topics-Physics Education Research,
     8(1):010106, 2012.
[21] Jessica Watkins, Janet E Coffey, Edward F Redish, and Todd J Cooke. Disciplinary au-
     thenticity: Enriching the reforms of introductory physics courses for life-science students.
     Physical Review Special Topics-Physics Education Research, 8(1):010112, 2012.
[22] Ramón S Barthelemy, Melinda McCormick, and Charles Henderson. Gender discrimination
     in physics and astronomy: Graduate student experiences of sexism and gender microaggres-
     sions. Physical Review Physics Education Research, 12(2):020119, 2016.
[23] Elizabeth Gire and Edward Price. Arrows as anchors: An analysis of the material features
     of electric field vector arrows. Physical Review Special Topics-Physics Education Research,
     10(2):020112, 2014.
[24] CD Porter and AF Heckler. Effectiveness of guided group work in graduate level quantum
     mechanics. Physical Review Physics Education Research, 16(2):020127, 2020.
[25] Rachel E Scherr, Mike A Lopez, and Marialis Rosario-Franco. Isolation and connectedness
     among Black and Latinx physics graduate students. Physical Review Physics Education
     Research, 16(2):020132, 2020.
[26] Ben Van Dusen, Ramón S Barthelemy, and Charles Henderson. Educational trajectories of
     graduate students in physics education research. Physical Review Special Topics-Physics
     Education Research, 10(2):020106, 2014.
[27] Ying Cao and Bárbara M Brizuela. High school students’ representations and understandings
     of electric fields. Physical Review Physics Education Research, 12(2):020102, 2016.
[28] Emily A Dare and Gillian H Roehrig. “If I had to do it, then I would”: Understanding
     early middle school students’ perceptions of physics and physics-related careers by gender.
     Physical Review Physics Education Research, 12(2):020117, 2016.
                                                204


[29] Jessie Durk, Ally Davies, Robin Hughes, and Lisa Jardine-Wright. Impact of an active
     learning physics workshop on secondary school students’ self-efficacy and ability. Physical
     Review Physics Education Research, 16(2):020126, October 2020.
[30] Lisa M Goodhew and Amy D Robertson. Exploring the role of content knowledge in
     responsive teaching. Physical Review Physics Education Research, 13(1):010106, 2017.
[31] Ericka Lawton, Carrie Obenland, Christopher Barr, Matthew Cushing, and Carolyn Nichol.
     Improving high school physics outcomes for young women. Physical Review Physics Edu-
     cation Research, 17(1):010111, March 2021.
[32] Amy D Robertson and Abigail R Daane. Energy Project professional development: Pro-
     moting positive attitudes about science among K-12 teachers. Physical Review Physics
     Education Research, 13(2):020102, 2017.
[33] Xiaoming Zhai, Barbara Schneider, and Joseph Krajcik. Motivating preservice physics
     teachers to low-socioeconomic status schools. Physical Review Physics Education Research,
     16(2):023102, October 2020. Publisher: American Physical Society.
[34] Jessica L. Alzen, Laurie S. Langdon, and Valerie K. Otero. A logistic regression investigation
     of the relationship between the Learning Assistant model and failure rates in introductory
     STEM courses. International Journal of STEM Education, 5(1):56, December 2018.
[35] Stephanie V. Chasteen and Rajendra Chattergoon. Insights from the Physics and Astronomy
     New Faculty Workshop: How do new physics faculty teach? Physical Review Physics
     Education Research, 16(2):020164, December 2020.
[36] Dimitri R Dounas-Frazer and HJ Lewandowski. Electronics lab instructors’ approaches to
     troubleshooting instruction. Physical Review Physics Education Research, 13(1):010102,
     2017.
[37] Alice Olmstead and Chandra Turpen. Assessing the interactivity and prescriptiveness of
     faculty professional development workshops: The real-time professional development ob-
     servation tool. Physical Review Physics Education Research, 12(2):020136, 2016.
[38] Alanna Pawlak, Paul W. Irving, and Marcos D. Caballero. Learning assistant approaches
     to teaching computational physics problems in a problem-based learning course. Physical
     Review Physics Education Research, 16(1):010139, June 2020.
[39] Danny Doucette, Russell Clark, and Chandralekha Singh. Professional development com-
     bining cognitive apprenticeship and expectancy-value theories improves lab teaching as-
     sistants’ instructional views and practices. Physical Review Physics Education Research,
     16(2):020102, July 2020.
[40] Melanie Good, Emily Marshman, Edit Yerushalmi, and Chandralekha Singh. Graduate
     teaching assistants’ views of broken-into-parts physics problems: Preference for guidance
     overshadows development of self-reliance in problem solving. Physical Review Physics
     Education Research, 16(1):010128, May 2020.
                                              205


[41] Tong Wan, Constance M Doty, Ashley A Geraets, Christopher A Nix, Erin K H Saitta, and
     Jacquelyn J Chini. Evaluating the impact of a classroom simulator training on graduate
     teaching assistants’ instructional practices and undergraduate student learning. Physical
     Review Physics Education Research, 17(1):010146, June 2021.
[42] Ben Van Dusen and Jayson Nissen. Associations between learning assistants, passing
     introductory physics, and equity: A quantitative critical race theory investigation. Physical
     Review Physics Education Research, 16(1):010117, April 2020.
[43] Geraldine L Cochran, Andrea G Van Duzor, Mel S Sabella, and Brian Geiss. Engaging
     in self-study to support collaboration between two-year colleges and universities. In 2016
     Physics Education Research Conference Proceedings, pages 76–79, 2016.
[44] Renee Michelle Goertzen, Eric Brewe, Laird H Kramer, Leanne Wells, and David Jones.
     Moving toward change: Institutionalizing reform through implementation of the Learn-
     ing Assistant model and Open Source Tutorials. Physical Review Special Topics-Physics
     Education Research, 7(2):020105, 2011.
[45] Jacqueline Doyle and Geoff Potvin. In search of distinct graduate admission strategies in
     physics: An exploratory study using topological data analysis. pages 107–110, December
     2015. ISSN: 2377-2379.
[46] Julie R Posselt. Inside graduate admissions. Harvard University Press, 2016.
[47] Geoff Potvin, Deepa Chari, and Theodore Hodapp. Investigating approaches to diversity in
     a national survey of physics doctoral degree programs: The graduate admissions landscape.
     Physical Review Physics Education Research, 13(2):020142, December 2017.
[48] Rachel E. Scherr, Monica Plisch, Kara E. Gray, Geoff Potvin, and Theodore Hodapp. Fixed
     and growth mindsets in physics graduate admissions. Physical Review Physics Education
     Research, 13(2):020133, November 2017.
[49] Geraldine L. Cochran, Theodore Hodapp, and Erika E. Alexander Brown. Identifying
     barriers to ethnic/racial minority students‘ participation in graduate physics. In Physics
     Education Research Conference Proceedings, PER Conference, pages 92–95, Cincinnati,
     OH, March 2018.
[50] Deepa Chari and Geoff Potvin. Admissions practices in terminal master’s degree-granting
     physics departments: A comparative analysis. Physical Review Physics Education Research,
     15(1):010104, January 2019.
[51] Deepa Chari and Geoff Potvin. Understanding the importance of graduate admissions criteria
     according to prospective graduate students. Physical Review Physics Education Research,
     15(2), September 2019.
[52] Casey W. Miller, Benjamin M. Zwickl, Julie R. Posselt, Rachel T. Silvestrini, and Theodore
     Hodapp. Typical physics Ph.D. admissions criteria limit access to underrepresented groups
     but fail to predict doctoral completion. Science Advances, 5(1):eaat7550, January 2019.
                                               206


[53] Lindsay Owens, Benjamin M. Zwickl, Scott V. Franklin, and Casey W. Miller. Identifying
     qualities of physics graduate students valued by faculty. In Physics Education Research
     Conference Proceedings, 2019.
[54] Julie Posselt, Theresa Hernandez, Geraldine Cochran, and Casey Miller. Metrics First,
     Diversity Later? Making the Shortlist and Getting Admitted to Physics PhD Programs.
     Journal of Women and Minorities in Science and Engineering, 25(4), 2019.
[55] Lindsay M. Owens, Benjamin M. Zwickl, Scott V. Franklin, and Casey W. Miller. Physics
     GRE Requirements Create Uneven Playing Field for Graduate Applicants. In 2020 Physics
     Education Research Conference Proceedings, pages 382–387, September 2020.
[56] Mike Verostek, Ben Zwickl, and Casey Miller. Do admissions metrics predict PhD comple-
     tion directly, or indirectly through graduate GPA? arXiv:2104.08591 [physics], April 2021.
     arXiv: 2104.08591.
[57] Urie Bronfenbrenner. The ecology of human development: Experiments by nature and
     design. Harvard university press, 1979.
[58] Abhilash Nair. THE RELEVANCE OF PHYSICS. PhD thesis, Michigan State University,
     2018.
[59] Sarah-Jane Leslie, Andrei Cimpian, Meredith Meyer, and Edward Freeland. Expecta-
     tions of brilliance underlie gender distributions across academic disciplines. Science,
     347(6219):262–265, January 2015.
[60] Cristobal Romero and Sebastian Ventura.                   Data mining in         education.
     WIREs Data Mining and Knowledge Discovery, 3(1):12–27, 2013.                        _eprint:
     https://onlinelibrary.wiley.com/doi/pdf/10.1002/widm.1075.
[61] Bill Cope and Mary Kalantzis. Big Data Comes to School: Implications for Learning,
     Assessment, and Research. AERA Open, 2(2):2332858416641907, April 2016.
[62] Galit Shmueli. To Explain or to Predict? Statistical Science, 25(3):289–310, August 2010.
[63] Elli J. Theobald, Melissa Aikens, Sarah Eddy, and Hannah Jordt. Beyond linear regression: A
     reference for analyzing common data types in discipline based education research. Physical
     Review Physics Education Research, 15(2):020110, July 2019.
[64] John M. Aiken, Rachel Henderson, and Marcos D. Caballero. Modeling student pathways
     in a physics bachelor’s degree program. Physical Review Physics Education Research,
     15(1):010128, May 2019.
[65] Nicholas T. Young, Grant Allen, John M. Aiken, Rachel Henderson, and Marcos D. Ca-
     ballero. Identifying features predictive of faculty integrating computation into physics
     courses. Physical Review Physics Education Research, 15(1):010114, February 2019.
[66] Cabot Zabriskie, Jie Yang, Seth DeVore, and John Stewart. Using machine learning to predict
     physics course outcomes. Physical Review Physics Education Research, 15(2):020120,
     August 2019.
                                               207


[67] Jie Yang, Seth DeVore, Dona Hewagallage, Paul Miller, Qing X. Ryan, and John Stewart.
     Using machine learning to identify the most at-risk students in physics classes. Physical
     Review Physics Education Research, 16(2):020130, October 2020.
[68] Seth DeVore, Jie Yang, and John Stewart. Extending Machine Learning to Predict Unbal-
     anced Physics Course Outcomes. arXiv:2002.01964 [physics], February 2020.
[69] Nils J. Mikkelsen, Nicholas T. Young, and Marcos D. Caballero. Investigating institutional
     influence on graduate program admissions by modeling physics Graduate Record Examina-
     tion cutoff scores. Physical Review Physics Education Research, 17(1):010109, February
     2021.
[70] Lei Bao. Theoretical comparisons of average normalized gain calculations. American
     Journal of Physics, 74(10):917–922, October 2006.
[71] Jayson Nissen, Robin Donatello, and Ben Van Dusen. Missing data and bias in physics
     education research: A case for using multiple imputation. Physical Review Physics Education
     Research, 15(2), July 2019.
[72] Cole Walsh, Martin M. Stein, Ryan Tapping, Emily M. Smith, and N.G. Holmes. Exploring
     the effects of omitted variable bias in physics education research. Physical Review Physics
     Education Research, 17(1):010119, March 2021.
[73] Computational Mathematics, Science and Engineering.
[74] John M. Aiken, Riccardo De Bin, H. J. Lewandowski, and Marcos D. Caballero. A Frame-
     work for Evaluating Statistical Models in Physics Education Research. arXiv:2106.11038
     [physics], June 2021.
[75] Nicholas T. Young and Marcos D. Caballero. Using machine learning to understand physics
     graduate school admissions. In 2019 Physics Education Research Conference Proceedings,
     Provo, UT, January 2020. American Association of Physics Teachers.
[76] Amy Brand, Liz Allen, Micah Altman, Marjorie Hlava, and Jo Scott. Beyond authorship:
     attribution, contribution, collaboration, and credit. Learned Publishing, 28(2):151–155,
     2015.
[77] Rachel Ivie. Beyond Representation: Data to Improve the Situation of Women and Minorities
     in Physics and Astronomy, March 2018.
[78] N. Gupta, A. Sawhney, and D. Roth. Will I Get in? Modeling the Graduate Admission
     Process for American Universities. In 2016 IEEE 16th International Conference on Data
     Mining Workshops (ICDMW), pages 631–638, December 2016.
[79] Austin Waters and Risto Miikkulainen. GRADE: Machine Learning Support for Graduate
     Admissions. AI Magazine, 35(1):64–64, March 2014.
                                               208


[80] Thomas Lux, Randall Pittman, Maya Shende, and Anil Shende. Applications of Supervised
     Learning Techniques on Undergraduate Admissions Data. In Proceedings of the ACM
     International Conference on Computing Frontiers, CF ’16, pages 412–417, New York, NY,
     USA, 2016. ACM. event-place: Como, Italy.
[81] Kanadpriya Basu, Treena Basu, Ron Buckmire, and Nishu Lal. Predictive Models of Student
     College Commitment Decisions Using Machine Learning. Data, 4(2):65, June 2019.
[82] James S Moore. An expert system approach to graduate school admission decisions and
     academic performance prediction. Omega, 26(5):659–670, October 1998.
[83] Julie R. Posselt. Disciplinary Logics in Doctoral Admissions: Understanding Patterns of
     Faculty Evaluation. The Journal of Higher Education, 86(6):807–833, November 2015.
[84] Constitution of the State of Michigan - Article I § 26.
[85] Indiana University Center for Postsecondary Research. The Carnegie Classification of
     Institutions of Higher Education, 2018 edition. Bloomington, IN.
[86] Starr Nicholson and Patrick J. Mulvey. Roster of Physics Departments with Enrollment and
     Degree Data, 2013. Technical report, American Institute of Physics, August 2014.
[87] Starr Nicholson and Patrick J. Mulvey. Roster of Physics Departments with Enrollment and
     Degree Data, 2014. Technical report, American Institute of Physics, September 2015.
[88] Starr Nicholson and Patrick J. Mulvey. Roster of Physics Departments with Enrollment and
     Degree Data, 2015. Technical report, American Institute of Physics, September 2016.
[89] Starr Nicholson and Patrick J. Mulvey. Roster of Physics Departments with Enrollment and
     Degree Data, 2016. Technical report, American Institute of Physics, September 2017.
[90] Pamela Paxton and Kenneth A. Bollen. Perceived Quality and Methodology in Graduate
     Department Ratings: Sociology, Political Science, and Economics. Sociology of Education,
     76(1):71–88, 2003.
[91] Leo Breiman. Random Forests. Machine Learning, 45(1):5–32, October 2001.
[92] Carolin Strobl, Anne-Laure Boulesteix, Achim Zeileis, and Torsten Hothorn. Bias in random
     forest variable importance measures: Illustrations, sources and a solution. BMC Bioinfor-
     matics, 8:25, January 2007.
[93] Miguel B. Araújo, Richard G. Pearson, Wilfried Thuiller, and Markus Erhard. Validation of
     species–climate impact models under climate change. Global Change Biology, 11(9):1504–
     1513, September 2005.
[94] Silke Janitza, Carolin Strobl, and Anne-Laure Boulesteix. An AUC-based permutation
     variable importance measure for random forests. BMC Bioinformatics, 14:119, April 2013.
[95] Giles Hooker and Lucas Mentch. Please Stop Permuting Features: An Explanation and
     Alternatives. arXiv:1905.03151 [cs, stat], May 2019. arXiv: 1905.03151.
                                              209


[96] Carolin Strobl, Anne-Laure Boulesteix, Thomas Kneib, Thomas Augustin, and Achim
      Zeileis. Conditional variable importance for random forests. BMC Bioinformatics, 9:307,
      July 2008.
[97] Ramón Díaz-Uriarte and Sara Alvarez de Andrés. Gene selection and classification of
      microarray data using random forest. BMC Bioinformatics, 7:3, January 2006.
[98] Gregory Attiyeh and Richard Attiyeh. Testing for bias in graduate school admissions. The
      Journal of Human Resources; Madison, 32(3):524–548, 1997.
[99] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for
      Statistical Computing, Vienna, Austria, 2018.
[100] Torsten Hothorn, Peter Bühlmann, Sandrine Dudoit, Annette Molinaro, and Mark J. Van
      Der Laan. Survival ensembles. Biostatistics, 7(3):355–373, July 2006.
[101] Vladimir Svetnik, Andy Liaw, Christopher Tong, J. Christopher Culberson, Robert P. Sheri-
      dan, and Bradley P. Feuston. Random Forest: A Classification and Regression Tool for
      Compound Classification and QSAR Modeling. Journal of Chemical Information and Com-
      puter Sciences, 43(6):1947–1958, November 2003.
[102] Alexander Hapfelmeier, Torsten Hothorn, Kurt Ulm, and Carolin Strobl. A new vari-
      able importance measure for random forests with missing data. Statistics and Computing,
      24(1):21–34, January 2014.
[103] Stef van Buuren and Karin Groothuis-Oudshoorn. mice: Multivariate Imputation by Chained
      Equations in R. Journal of Statistical Software, 45(3):1–67, 2011.
[104] Donald Rubin. Basic Ideas of Multiple Imputation for Nonresponse. Survey Methodology,
      12(1):37–47, 1986.
[105] Casey Miller and Keivan Stassun. A test that fails. Nature, 510(7504):303–304, June 2014.
[106] Statement on the Use of the GRE in Admissions to Graduate Physics Programs. American
      Association of Physics Teacher, July 2019.
[107] AAS Statement on Limiting the Use of GRE Scores in Graduate Admissions in the Astro-
      nomical Sciences, October 2018.
[108] Nicholas T. Young and Marcos D. Caballero. Physics Graduate Record Exam does not help
      applicants “stand out”. Physical Review Physics Education Research, 17(1):010144, June
      2021.
[109] Raeshanda Wilson. Predicting Graduate School Success: A Critical Race Analysis of the
      Graduate Record Examination. Doctor of Education in Secondary Education Dissertations,
      May 2020.
[110] Emily M. Levesque, Rachel Bezanson, and Grant R. Tremblay. Physics GRE Scores of
      Prize Postdoctoral Fellows in Astronomy. arXiv:1512.03709 [astro-ph, physics:physics],
      December 2015.
                                               210


[111] Laura A Lopez. Demographic Effects of Removing the Physics GRE Requirement in
      Graduate Admissions, October 2019.
[112] Katie Langin. A wave of graduate programs drops the GRE application requirement. Science,
      May 2019.
[113] About the GRE Subject Tests (For Test Takers).
[114] GRE Guide to the Use of Scores, 2019.
[115] Nancy D. Morrison, William V. Dixon, Casey W. Miller, and The Women in Astronomy IV
      Graduate School Admissions White Paper Group. Women in Astronomy IV White Paper:
      Graduate Admissions in a Post-GRE World. Bulletin of the AAS, 51(4), 2020.
[116] Emily M. Levesque, Rachel Bezanson, and Grant R. Tremblay. Why astronomy programs
      are moving on from the physics GRE. Physics Today, March 2017.
[117] Julie R. Posselt. Trust Networks: A New Perspective on Pedigree and the Ambiguities of
      Admissions. The Review of Higher Education, 41(4):497–521, June 2018.
[118] Alexander R. Small. Range restriction, admissions criteria, and correlation studies of stan-
      dardized tests. arXiv:1709.02895 [physics], September 2017. arXiv: 1709.02895.
[119] Reuben M Baron and David A Kenny. The moderator-mediator variable distinction in social
      psychological research: Conceptual, strategic, and statistical considerations. Journal of
      Personality and Social Psychology, 51(6):1173–1182, 1986.
[120] Judith J. M. Rĳnhart, Jos W. R. Twisk, Iris Eekhout, and Martĳn W. Heymans. Comparison
      of logistic-regression based methods for simple mediation analysis with a dichotomous
      outcome variable. BMC Medical Research Methodology, 19(1):19, January 2019.
[121] Andrew F. Hayes and Michael Scharkow. The Relative Trustworthiness of Inferential Tests
      of the Indirect Effect in Statistical Mediation Analysis: Does Method Really Matter? Psy-
      chological Science, 24(10):1918–1927, October 2013.
[122] Nathaniel Amos and Andrew F. Heckler. Mediating relationship of differential products
      in understanding integration in introductory physics. Physical Review Physics Education
      Research, 14(1):010105, January 2018.
[123] Susanne Ditlevsen, Ulla Christensen, John Lynch, Mogens Trab Damsgaard, and Niels
      Keiding. The Mediation Proportion: A Structural Equation Approach for Estimating the
      Proportion of Exposure Effect on Outcome Explained by an Intermediate Variable. Epi-
      demiology, 16(1):114–120, January 2005.
[124] Laurence S. Freedman. Confidence intervals and statistical power of the ‘Validation’ ratio
      for surrogate or intermediate endpoints. Journal of Statistical Planning and Inference,
      96(1):143–153, June 2001.
[125] Andrew F. Hayes. PROCESS: A Versatile Computational Tool for Observed Variable Me-
      diation, Moderation, and Conditional Process Modeling. Technical report, 2012.
                                                211


[126] Kristopher J. Preacher, Derek D. Rucker, and Andrew F. Hayes. Addressing Moderated Me-
      diation Hypotheses: Theory, Methods, and Prescriptions. Multivariate Behavioral Research,
      42(1):185–227, June 2007.
[127] M. B. Weissman. Do GRE scores help predict getting a physics Ph.D.? A comment on a
      paper by Miller et al. Science Advances, 6(23):eaax3787, June 2020.
[128] Starr Nicholson and Patrick J. Mulvey. Roster of Physics Departments with Enrollment and
      Degree Data, 2017. Technical report, American Institute of Physics, October 2018.
[129] Starr Nicholson and Patrick J. Mulvey. Roster of Physics Departments with Enrollment and
      Degree Data, 2018. Technical report, American Institute of Physics, October 2019.
[130] National Center for Education Statistics. NCES-Barron’s Admissions Competitiveness Index
      Data Files: 1972, 1982, 1992, 2004, , 2008, 2014, January 2017.
[131] Raj Chetty, John Friedman, Emmanuel Saez, Nicholas Turner, and Danny Yagan. Mobility
      Report Cards: The Role of Colleges in Intergenerational Mobility. Technical Report w23618,
      National Bureau of Economic Research, Cambridge, MA, July 2017.
[132] Adrienne L. Traxler, Ximena C. Cid, Jennifer Blue, and Ramón Barthelemy. Enriching
      gender in physics education research: A binary past and a complex future. Physical Review
      Physics Education Research, 12(2):020114, August 2016.
[133] Tiffani L. Williams. ’Underrepresented Minority’ Considered Harmful, Racist Language,
      June 2020.
[134] Robert T Teranishi. Race, ethnicity, and higher education policy: The use of critical
      quantitative research. New Directions for Institutional Research, 2007(133):37–49, 2007.
[135] Frank J. Massey Jr. The Kolmogorov-Smirnov Test for Goodness of Fit. Journal of the
      American Statistical Association, 46(253):68–78, March 1951.
[136] Casey Miller and Julie Posselt. Broadening Participation in Graduate Education through
      Holistic Review, November 2018. Presented at the Bridge Program and National Mentoring
      Community Conference.
[137] Julia D Kent and Maureen Terese McCarthy. Holistic review in graduate admissions: A
      Report from the Council of Graduate Schools. Washington DC: Council of Graduate
      Students, 2016.
[138] Alexander Rudolph, Gibor Basri, Marcel Agüeros, Ed Bertschinger, Kim Coble, Meghan
      Donahue, Jackie Monkiewicz, Angela Speck, Keivan Stassun, Rachel Ivie, Christine Pfund,
      and Julie Posselt. Final Report of the 2018 AAS Task Force on Diversity and Inclusion in
      Astronomy Graduate Education. Bulletin of the AAS, 51(1), 2020.
[139] Kris Dunlap. Journey to a more holistic admissions review process by implementing an
      evaluation rubric, July 2018.
                                                212


[140] Marenda A. Wilson, Max A. Odem, Taylor Walters, Anthony L. DePass, and Andrew J.
      Bean. A Model for Holistic Review in Graduate Admissions That Decouples the GRE from
      Race, Ethnicity, and Gender. CBE Life Sciences Education, 18(1), 2019.
[141] John R. Searle. How to Derive "Ought" From "Is". The Philosophical Review, 73(1):43–58,
      1964.
[142] Patrick J. Mulvey, Starr Nicholson, and Jack Pold. Trends in Physics PhDs. Technical report,
      February 2021.
[143] Theodore Hodapp and Erika Brown.              Making physics more inclusive.         Nature,
      557(7707):629–632, May 2018. Number: 7707 Publisher: Nature Publishing Group.
[144] The AIP National Task Force to Elevate African American Representation in Undergraduate
      Physics & Astronomy. The Time is Now: Findings from TEAM-UP Report to Increase the
      Number of African Americans with Bachelor’s Degree in Physics and Astronomy. Technical
      report, American Institute of Physics, 2020. Publisher: APS.
[145] Brian J. Rybarczyk, Leslie Lerea, Dawayne Whittington, and Linda Dykstra. Analysis of
      Postdoctoral Training Outcomes That Broaden Participation in Science Careers. CBE—Life
      Sciences Education, 15(3):ar33, September 2016.
[146] Özlem Sensoy and Robin DiAngelo. “We Are All for Diversity, but . . .”: How Faculty Hiring
      Committees Reproduce Whiteness and Practical Suggestions for How They Can Change.
      Harvard Educational Review, 87(4):557–580, December 2017.
[147] Arri Eisen and Douglas C. Eaton. A Model for Postdoctoral Education That Promotes
      Minority and Majority Success in the Biomedical Sciences. CBE Life Sciences Education,
      16(4), 2017.
[148] Needhi Bhalla. Strategies to improve equity in faculty hiring. Molecular Biology of the Cell,
      30(22):2744–2749, October 2019.
[149] Michelle I. Cardel, Emily Dhurandhar, Ceren Yarar-Fisher, Monica Foster, Bertha Hidalgo,
      Leslie A. McClure, Sherry Pagoto, Nathanial Brown, Dori Pekmezi, Noha Sharafeldin,
      Amanda L. Willig, and Christine Angelini. Turning Chutes into Ladders for Women Faculty:
      A Review and Roadmap for Equity in Academia. Journal of Women’s Health, 29(5):721–733,
      February 2020.
[150] Julie R Posselt. Equity in Science. Standford University Press, 2020.
[151] KerryAnn O’Meara, Dawn Culpepper, and Lindsey L. Templeton. Nudging Toward Di-
      versity: Applying Behavioral Design to Faculty Hiring. Review of Educational Research,
      90(3):311–348, June 2020. Publisher: American Educational Research Association.
[152] Julie R. Posselt. Toward Inclusive Excellence in Graduate Education: Constructing Merit
      and Diversity in PhD Admissions. American Journal of Education, 120(4):481–514, August
      2014. Publisher: The University of Chicago Press.
                                               213


[153] Inclusive Graduate Education Network.
[154] Casey Miller and Julie Posselt. Equitable Admissions in the Time of COVID-19. Physics,
      13, December 2020.
[155] GRE Subject Tests Fees (For Test Takers).
[156] Kyle M. Whitcomb and Chandralekha Singh. Not all disadvantages are equal: Racial/ethnic
      minority students have largest disadvantage of all demographic groups in both STEM and
      non-STEM GPA. arXiv:2003.04376 [physics], March 2020.
[157] Stuart Rojstaczer and Christopher Healy. Where A is ordinary: The evolution of American
      college and university grading, 1940-2009. Teachers College Record, 114(7):1–23, 2012.
      Place: US Publisher: Teachers College, Columbia University.
[158] Sang Eun Woo, James LeBreton, Melissa Keith, and Louis Tay. Bias, Fairness, and Validity
      in Graduate Admissions: A Psychometric Perspective. August 2020.
[159] Sandra L. Petersen, Evelyn S. Erenrich, Dovev L. Levine, Jim Vigoreaux, and Krista Gile.
      Multi-institutional study of GRE scores as predictors of STEM PhD degree completion:
      GRE gets a low mark. PLOS ONE, 13(10):e0206570, October 2018.
[160] Joshua D. Hall, Anna B. O’Connell, and Jeanette G. Cook. Predictors of Student Productivity
      in Biomedical Graduate School Applications. PLOS ONE, 12(1):e0169121, January 2017.
      Publisher: Public Library of Science.
[161] Linda Sealy, Christina Saunders, Jeffrey Blume, and Roger Chalkley. The GRE over the
      entire range of scores lacks predictive ability for PhD outcomes in the biomedical sciences.
      PLOS ONE, 14(3):e0201634, March 2019. Publisher: Public Library of Science.
[162] Carrie Hawkins. The Impact of a Holistic Admissions Review Process in a Doctor of Physical
      Therapy Program. Graduate Theses, Dissertations, and Capstones, August 2020.
[163] Annie M. Francis, L. B. Klein, Sharon Holmes Thomas, Kirsten Kainz, and Amy Blank
      Wilson. Holistic Admissions and Racial/Ethnic Diversity: A Systematic Review and Impli-
      cations for Social Work Doctoral Education. Journal of Social Work Education, 0(0):1–18,
      April 2021. Publisher: Routledge _eprint: https://doi.org/10.1080/10437797.2021.1895927.
[164] Wendy I. Pacheco, Richard J. Noel, James T. Porter, Caroline B. Appleyard, and Hannah
      Sevian. Beyond the GRE: Using a Composite Score to Predict the Success of Puerto Rican
      Students in a Biomedical PhD Program. CBE—Life Sciences Education, 14(2):ar13, June
      2015.
[165] Blaire Lauren Moody Rideout. A Study of the Inter-Rater Reliability of University Appli-
      cation Readers in a Holistic Admissions Review Process. PhD thesis, Bowling Green State
      University, 2017.
[166] Tim Kautz, James J Heckman, Ron Diris, Bas Ter Weel, and Lex Borghans. Fostering and
      measuring skills: Improving cognitive and non-cognitive skills to promote lifetime success.
      National Bureau of Economic Research, 2014. Publisher:.
                                                214


[167] Brent W. Roberts. Back to the Future: Personality and Assessment and Personality Devel-
      opment. Journal of research in personality, 43(2):137–145, April 2009.
[168] Mathilde Almlund, Angela Lee Duckworth, James Heckman, and Tim Kautz. Personality
      psychology and economics. In Handbook of the Economics of Education, volume 4, pages
      1–181. Elsevier, 2011.
[169] W. E. Sedlacek. Noncognitive Measures for Higher Education Admissions. In Penelope
      Peterson, Eva Baker, and Barry McGaw, editors, International Encyclopedia of Education
      (Third Edition), pages 845–849. Elsevier, Oxford, January 2010.
[170] Terence J. Tracey and William E. Sedlacek. Noncognitive Variables in Predicting Aca-
      demic Success by Race. Measurement and Evaluation in Guidance, 16(4):171–178, 1984.
      Publisher: Routledge _eprint: https://doi.org/10.1080/00256307.1984.12022352.
[171] Terence J. Tracey and William E. Sedlacek. Prediction of College Graduation Us-
      ing Noncognitive Variables by Race.           Measurement and Evaluation in Counsel-
      ing and Development, 19(4):177–184, January 1987. Publisher: Routledge _eprint:
      https://doi.org/10.1080/07481756.1987.12022838.
[172] Niki Medrinos. BEYOND THE SAT/ACT: AN EXAMINATION OF NON-COGNITIVE FAC-
      TORS THAT CONTRIBUTE TO STUDENTS’ COLLEGE SUCCESS. PhD thesis, Temple
      University, 2014.
[173] Stephen Carp, Kyle Fry, Brittany Gumerman, Kevin Pressley, and Alyssa Whitman. Re-
      lationship Between Grit Scale Score and Academic Performance in a Doctor of Physical
      Therapy Program: A Case Study. Journal of Allied Health, 49(1):29–36, February 2020.
[174] Scott K. Stolte, Stephanie B. Scheer, and Evan T. Robinson. The Reliability of Non-Cognitive
      Admissions Measures in Predicting Non-traditional Doctor of Pharmacy Student Perfor-
      mance Outcomes. American Journal of Pharmaceutical Education, 67(1):18, September
      2003.
[175] Kristin Zakariasen Victoroff and Richard E. Boyatzis. What Is the Relationship Between
      Emotional Intelligence and Dental Student Clinical Performance? Journal of Dental Educa-
      tion, 77(4):416–426, 2013. _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1002/j.0022-
      0337.2013.77.4.tb05487.x.
[176] Christopher Peskun, Allan Detsky, and Maureen Shandling. Effectiveness of medical
      school admissions criteria in predicting residency ranking four years later. Medical Edu-
      cation, 41(1):57–64, 2007. _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1111/j.1365-
      2929.2006.02647.x.
[177] Jay Burmeister, Erin McSpadden, Joseph Rakowski, Adrian Nalichowski, Mark Yudelev,
      and Michael Snyder. Correlation of admissions statistics to graduate student success in
      medical physics. Journal of Applied Clinical Medical Physics, 15(1):375–385, 2014.
                                                215


[178] Chan Kulatunga Moruzi and Geoffrey R. Norman. Validity of Admissions Measures in
      Predicting Performance Outcomes: The Contribution of Cognitive and Non-Cognitive Di-
      mensions. Teaching and Learning in Medicine, 14(1):34–42, 2002. Publisher: Routledge
      _eprint: https://doi.org/10.1207/S15328015TLM1401_9.
[179] Melissa C. O’Connor and Sampo V. Paunonen. Big Five personality predictors of post-
      secondary academic performance. Personality and Individual Differences, 43(5):971–990,
      October 2007.
[180] Patrick Kyllonen, Alyssa M. Walters, and James C. Kaufman.                     Noncognitive
      Constructs and Their Assessment in Graduate Education: A Review.                      Educa-
      tional Assessment, 10(3):153–184, September 2005. Publisher: Routledge _eprint:
      https://doi.org/10.1207/s15326977ea1003_2.
[181] Robert E. Ployhart and Brian C. Holtz. The Diversity–Validity Dilemma: Strate-
      gies for Reducing Racioethnic and Sex Subgroup Differences and Adverse Im-
      pact in Selection.          Personnel Psychology, 61(1):153–172, 2008.                _eprint:
      https://onlinelibrary.wiley.com/doi/pdf/10.1111/j.1744-6570.2008.00109.x.
[182] William E. Sedlacek. Why we should use noncognitive variables with graduate and profes-
      sional students. The Advisor: The Journal of the National Association of Advisors for the
      Health Professions, 24(2):32–39, 2004.
[183] Casey W Miller. Using Non-Cognitive Assessments in Graduate Admissions to Select Better
      Students and Increase Diversity. page 10, 2015.
[184] Yuanyuan Chen, Shuaizhang Feng, James J. Heckman, and Tim Kautz. Sensitivity of self-
      reported noncognitive skills to survey administration conditions. Proceedings of the National
      Academy of Sciences, 117(2):931–935, January 2020.
[185] Equity in Graduate Education - Non-Cognitive Assessment, 2021.
[186] Penny Salvatori. Reliability and Validity of Admissions Tools Used to Select Students for
      the Health Professions. Advances in Health Sciences Education, 6(2):159–175, May 2001.
[187] Jacqueline M. Zeeman, Jacqueline E. McLaughlin, and Wendy C. Cox. Validity and reliabil-
      ity of an application review process using dedicated reviewers in one stage of a multi-stage
      admissions model. Currents in Pharmacy Teaching and Learning, 9(6):972–979, 2017.
[188] Corinne A. Moss-Racusin, John F. Dovidio, Victoria L. Brescoll, Mark J. Graham, and
      Jo Handelsman. Science faculty’s subtle gender biases favor male students. Proceedings
      of the National Academy of Sciences, 109(41):16474–16479, October 2012. Publisher:
      National Academy of Sciences Section: Social Sciences.
[189] Asia A. Eaton, Jessica F. Saunders, Ryan K. Jacobson, and Keon West. How Gender and Race
      Stereotypes Impact the Advancement of Scholars in STEM: Professors’ Biased Evaluations
      of Physics and Biology Post-Doctoral Candidates. Sex Roles, 82(3):127–141, February 2020.
                                               216


[190] Nicolás E. Barceló, Sonya Shadravan, Christine R. Wells, Nichole Goodsmith, Brittany
      Tarrant, Trevor Shaddox, Yvonne Yang, Eraka Bath, and Katrina DeBonis. Reimagining
      Merit and Representation: Promoting Equity and Reducing Bias in GME Through Holistic
      Review. Academic Psychiatry: The Journal of the American Association of Directors of
      Psychiatric Residency Training and the Association for Academic Psychiatry, 45(1):34–42,
      February 2021.
[191] David M. Quinn. Experimental Evidence on Teachers’ Racial Bias in Student Evaluation:
      The Role of Grading Scales. Educational Evaluation and Policy Analysis, 42(3):375–392,
      September 2020. Publisher: American Educational Research Association.
[192] Saleem Razack, Torsten Risør, Brian Hodges, and Yvonne Steinert. Beyond the cultural
      myth of medical meritocracy. Medical Education, 54(1):46–53, January 2020.
[193] Keivan G. Stassun, Susan Sturm, Kelly Holley-Bockelmann, Arnold Burger, David J. Ernst,
      and Donna Webb. The Fisk-Vanderbilt Master’s-to-Ph.D. Bridge Program: Recognizing,
      enlisting, and cultivating unrealized or unrecognized potential in underrepresented minority
      students. American Journal of Physics, 79(4):374–379, March 2011.
[194] Sonia F. Roberts, Elana Pyfrom, Jacob A. Hoffman, Christopher Pai, Erin K. Reagan, and
      Alysson E. Light. Review of Racially Equitable Admissions Practices in STEM Doctoral Pro-
      grams. Education Sciences, 11(6):270, June 2021. Number: 6 Publisher: Multidisciplinary
      Digital Publishing Institute.
[195] Sture Holm. A Simple Sequentially Rejective Multiple Test Procedure. Scandinavian Journal
      of Statistics, 6(2):65–70, 1979. Publisher: [Board of the Foundation of the Scandinavian
      Journal of Statistics, Wiley].
[196] Cassandra M. Guarino and Victor M. H. Borden. Faculty Service Loads and Gender: Are
      Women Taking Care of the Academic Family? Research in Higher Education, 58(6):672–
      694, September 2017.
[197] Nicholas T. Young and Marcos D. Caballero. The Physics GRE does not help applicants
      "stand out". arXiv:2008.10712 [physics], August 2020. arXiv: 2008.10712.
[198] Charles D Brown III. Commentary: Disentangling anti-Blackness from physics. Physics
      Today, July 2020.
[199] Katemari Rosa and Felicia Moore Mensah. Educational pathways of Black women physicists:
      Stories of experiencing and overcoming obstacles in life. Physical Review Physics Education
      Research, 12(2):020113, August 2016.
[200] Paul H. Barber, Tyrone B. Hayes, Tracy L. Johnson, and Leticia Márquez-Magaña. Systemic
      racism in higher education. Science, 369(6510):1440–1441, September 2020.
[201] Danielle Dickens, Maria Jones, and Naomi Hall. Being a Token Black Female Faculty
      Member in Physics: Exploring Research on Gendered Racism, Identity Shifting as a Coping
      Strategy, and Inclusivity in Physics. The Physics Teacher, 58(5):335–337, May 2020.
                                                217


[202] Chanda Prescod-Weinstein. Making Black Women Scientists under White Empiricism: The
      Racialization of Epistemology in Physics. Signs: Journal of Women in Culture and Society,
      45(2):421–447, January 2020. Publisher: The University of Chicago Press.
[203] Kelly Ochs Rosinger, Karly Sarita Ford, and Junghee Choi. The Role of Selective College
      Admissions Criteria in Interrupting or Reproducing Racial and Economic Inequities. The
      Journal of Higher Education, pages 1–25, September 2020.
[204] David Kalsbeek, Michele Sandlin, and William Sedlacek. Employing Noncogni-
      tive Variables to Improve Admissions, and Increase Student Diversity and Reten-
      tion. Strategic Enrollment Management Quarterly, 1(2):132–150, 2013. _eprint:
      https://onlinelibrary.wiley.com/doi/pdf/10.1002/sem3.20016.
[205] La’Tonia Stiner-Jones and Wolfgang Windl. 2019 Best Diversity Paper: Work in Progress:
      Aligning What We Want With What We Seek: Increasing Comprehensive Review in the
      Graduate Admissions Process. June 2020.
[206] ETS. Curated Approaches.
[207] Quinn Capers, Leon McDougle, and Daniel M. Clinchot. Strategies for Achieving Diversity
      through Medical School Admissions. Journal of Health Care for the Poor and Underserved,
      29(1):9–18, 2018.
[208] Jayson M. Nissen, Manher Jariwala, Eleanor W. Close, and Ben Van Dusen. Participation and
      performance on paper- and computer-based low-stakes assessments. International Journal
      of STEM Education, 5(1):21, December 2018.
[209] Ivan Tomek. Two Modifications of CNN. IEEE Transactions on Systems, Man, and Cyber-
      netics, SMC-6(11):769–772, November 1976.
[210] T. Ryan Hoens and Nitesh V. Chawla. Imbalanced Datasets: From Sampling to Classifiers.
      In Imbalanced Learning, pages 43–59. John Wiley & Sons, Ltd, 2013. Section: 3 _eprint:
      https://onlinelibrary.wiley.com/doi/pdf/10.1002/9781118646106.ch3.
[211] Min Zeng, Beĳi Zou, Faran Wei, Xiyao Liu, and Lei Wang. Effective prediction of three
      common diseases by combining SMOTE with Tomek links technique for imbalanced medical
      data. In 2016 IEEE International Conference of Online Analysis and Computing Science
      (ICOACS), pages 225–228, May 2016.
[212] Gustavo E. A. P. A. Batista, Ronaldo C. Prati, and Maria Carolina Monard. A study of the
      behavior of several methods for balancing machine learning training data. ACM SIGKDD
      Explorations Newsletter, 6(1):20–29, June 2004.
[213] Siriporn Sawangarreerak and Putthiporn Thanathamathee. Random Forest with Sampling
      Techniques for Handling Imbalanced Prediction of University Student Depression. Infor-
      mation, 11(11):519, November 2020. Number: 11 Publisher: Multidisciplinary Digital
      Publishing Institute.
                                               218


[214] Shubham Sharma, Yunfeng Zhang, Jesús M. Ríos Aliaga, Djallel Bouneffouf, Vinod
      Muthusamy, and Kush R. Varshney. Data Augmentation for Discrimination Prevention
      and Bias Disambiguation. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and
      Society, AIES ’20, pages 358–364, New York, NY, USA, February 2020. Association for
      Computing Machinery.
[215] Paula Branco, Rita P. Ribeiro, and Luis Torgo. UBL: an R Package for Utility-Based
      Learning. CoRR, abs/1604.08079, 2016.
[216] Chuck Powell. CGPfunctions, December 2020.
[217] Liangyuan Hu, Jung-Yi Joyce Lin, and Jiayi Ji. Variable selection with missing data in both
      covariates and outcomes: Imputation and machine learning. arXiv:2104.02769 [stat], April
      2021. arXiv: 2104.02769.
[218] Robert C. MacCallum, Shaobo Zhang, Kristopher J. Preacher, and Derek D. Rucker. On the
      practice of dichotomization of quantitative variables. Psychological Methods, 7(1):19–40,
      2002.
[219] Julie R. Irwin and Gary H. McClelland. Negative Consequences of Dichotomizing Con-
      tinuous Predictor Variables. Journal of Marketing Research, 40(3):366–371, August 2003.
      Publisher: SAGE Publications Inc.
[220] M. Stains, J. Harshman, M. K. Barker, S. V. Chasteen, R. Cole, S. E. DeChenne-Peters,
      M. K. Eagan, J. M. Esson, J. K. Knight, F. A. Laski, M. Levis-Fitzgerald, C. J. Lee, S. M. Lo,
      L. M. McDonnell, T. A. McKay, N. Michelotti, A. Musgrove, M. S. Palmer, K. M. Plank,
      T. M. Rodela, E. R. Sanders, N. G. Schimpf, P. M. Schulte, M. K. Smith, M. Stetzer, B. Van
      Valkenburgh, E. Vinson, L. K. Weir, P. J. Wendel, L. B. Wheeler, and A. M. Young. Anatomy
      of STEM teaching in North American universities. Science, 359(6383):1468–1470, March
      2018. Publisher: American Association for the Advancement of Science Section: Education
      Forum.
[221] Kelley Commeford, Eric Brewe, and Adrienne Traxler. Characterizing active learning
      environments in physics using latent profile analysis. arXiv:2105.02897 [physics], May
      2021.
[222] Hannah Müggenburg. Beyond the limits of memory? The reliability of retrospective data in
      travel research. Transportation Research Part A: Policy and Practice, 145:302–318, March
      2021.
[223] R Behrens and R Del Mistro. Analysing changing personal travel behaviour over time:
      methodological lessons from the application of retrospective surveys in Cape Town. In 8th
      international conference on survey methods in transport: harmonisation and data quality,
      Annecy, 2008.
[224] Allen L. Edwards. The social desirability variable in personality assessment and research.
      The social desirability variable in personality assessment and research. Dryden Press, Ft
      Worth, TX, US, 1957. Pages: viii, 108.
                                                219


[225] Nitesh V Chawla. Data Mining for Imbalanced Datasets: An Overview. In Oded Maimon and
      Lior Rokach, editors, Data Mining and Knowledge Discovery Handbook, pages 875–886.
      Springer, Boston, MA, 2009.
[226] Elsa Vazquez Arreola and Jeffrey R. Wilson. Bayesian multiple membership multiple classi-
      fication logistic regression model on student performance with random effects in university
      instructors and majors. PLOS ONE, 15(1):e0227343, January 2020.
[227] Mei-Shiu Chiu. Gender differences in Predicting STEM Choice by Affective States and
      Behaviors in Online Mathematical Problem Solving: Positive-Affect-to-Success Hypothesis.
      JEDM | Journal of Educational Data Mining, 12(2):48–77, August 2020.
[228] Katherine P. Dabney and Robert H. Tai. Comparative analysis of female physicists in the
      physical sciences: Motivation and background variables. Physical Review Special Topics -
      Physics Education Research, 10(1):010104, February 2014.
[229] Charles Henderson, Melissa Dancy, and Magdalena Niewiadomska-Bugaj. Use of research-
      based instructional strategies in introductory physics: Where do faculty leave the innovation-
      decision process?         Physical Review Special Topics - Physics Education Research,
      8(2):020104, July 2012.
[230] Rachel Ivie, Susan White, and Raymond Y. Chu. Women’s and men’s career choices in
      astronomy and astrophysics. Physical Review Physics Education Research, 12(2):020109,
      August 2016.
[231] Silvĳa Maslov Kruzicevic, Katarina Josipa Barisic, Adriana Banozic, Carlos David Esteban,
      Damir Sapunar, and Livia Puljak. Predictors of Attrition and Academic Success of Medical
      Students: A 30-Year Retrospective Study. PLOS ONE, 7(6):e39144, June 2012.
[232] Cathryn A. Manduca, Ellen R. Iverson, Michael Luxenberg, R. Heather Macdonald, David A.
      McConnell, David W. Mogk, and Barbara J. Tewksbury. Improving undergraduate STEM
      education: The efficacy of discipline-based professional development. Science Advances,
      3(2):e1600193, February 2017.
[233] Carlos Márquez-Vera, Alberto Cano, Cristóbal Romero, and Sebastián Ventura. Predicting
      student failure at school using genetic programming and different data mining approaches
      with high dimensional and imbalanced data. Applied Intelligence, 38(3):315–330, April
      2013.
[234] Kenneth I. Maton, Tiffany S. Beason, Surbhi Godsay, Mariano R. Sto. Domingo, TaShara C.
      Bailey, Shuyan Sun, and Freeman A. Hrabowski. Outcomes and Processes in the Meyerhoff
      Scholars Program: STEM PhD Completion, Sense of Community, Perceived Program Bene-
      fit, Science Identity, and Research Self-Efficacy. CBE—Life Sciences Education, 15(3):ar48,
      September 2016.
[235] Sergi Rovira, Eloi Puertas, and Laura Igual. Data-driven system to predict academic grades
      and dropout. PLOS ONE, 12(2):e0171207, February 2017.
                                                 220


[236] Linda J. Sax, Kathleen J. Lehman, Ramón S. Barthelemy, and Gloria Lim. Women in
      physics: A comparison to science, technology, engineering, and math education over four
      decades. Physical Review Physics Education Research, 12(2):020108, August 2016.
[237] Kelly Spoon, Joshua Beemer, John C. Whitmer, Juanjuan Fan, James P. Frazee, Jeanne
      Stronach, Andrew J. Bohonak, and Richard A. Levine. Random Forests for Evaluating
      Pedagogy and Informing Personalized Learning. JEDM | Journal of Educational Data
      Mining, 8(2):20–50, December 2016. Number: 2.
[238] Robert H. Tai, Xiaoqing Kong, Claire E. Mitchell, Katherine P. Dabney, Daniel M. Read,
      Donna B. Jeffe, Dorothy A. Andriole, and Heather D. Wathington. Examining Summer
      Laboratory Research Apprenticeships for High School Students as a Factor in Entry to
      MD/PhD Programs at Matriculation. CBE—Life Sciences Education, 16(2):ar37, June 2017.
[239] Alejandro Peña-Ayala. Educational data mining: A survey and a data mining-based analysis
      of recent works. Expert Systems with Applications, 41(4, Part 1):1432–1462, March 2014.
[240] Gary King and Langche Zeng. Logistic Regression in Rare Events Data. Political Analysis,
      9:137–163, 2001.
[241] David Firth. Bias Reduction of Maximum Likelihood Estimates. Biometrika, 80(1):27–38,
      1993. Publisher: [Oxford University Press, Biometrika Trust].
[242] Sander Greenland and Mohammad Ali Mansournia. Penalization, bias reduction, and default
      priors in logistic and related categorical and survival regressions. Statistics in Medicine,
      34(23):3133–3143, 2015.
[243] Kristin K. Nicodemus. Letter to the Editor: On the stability and ranking of predictors from
      random forest variable importance measures. Briefings in Bioinformatics, 12(4):369–373,
      July 2011.
[244] Anne-Laure Boulesteix, Andreas Bender, Justo Lorenzo Bermejo, and Carolin Strobl. Ran-
      dom forest Gini importance favours SNPs with large minor allele frequency: impact, sources
      and recommendations. Briefings in Bioinformatics, 13(3):292–304, May 2012.
[245] Stefano Nembrini, Inke R. König, and Marvin N. Wright. The revival of the Gini importance?
      Bioinformatics, 34(21):3711–3718, November 2018.
[246] Anne-Laure Boulesteix, Sabine Lauer, and Manuel J. A. Eugster. A Plea for Neutral Com-
      parison Studies in Computational Sciences. PLOS ONE, 8(4):e61562, April 2013.
[247] Cristobal Romero and Sebastian Ventura. Educational data mining and learning analytics:
      An updated survey. WIREs Data Mining and Knowledge Discovery, 10(3):e1355, 2020.
[248] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The elements of statistical learning:
      data mining, inference, and prediction. Springer Science & Business Media, 2nd edition
      edition, 2009.
                                               221


[249] C. Ensoy, T. W. Rakhmawati, C. Faes, and M. Aerts. Separation Issues and Possible Solutions:
      Part I – Systematic Literature Review on Logistic Models - Part II – Comparison of different
      methods for separation in logistic regression. EFSA Supporting Publications, 12(9):869E,
      2015.
[250] Harold Jeffreys. An invariant form for the prior probability in estimation problems. Pro-
      ceedings of the Royal Society of London. Series A. Mathematical and Physical Sciences,
      186(1007):453–461, September 1946.
[251] Georg Heinze and Michael Schemper. A solution to the problem of separation in logistic
      regression. Statistics in Medicine, 21(16):2409–2419, 2002.
[252] Fu Chen and Ying Cui. LogCF: Deep Collaborative Filtering with Process Data for Enhanced
      Learning Outcome Modeling. Journal of Educational Data Mining, 12(4):66–99, December
      2020.
[253] Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic net. Journal
      of the Royal Statistical Society: Series B (Statistical Methodology), 67(2):301–320, 2005.
[254] Erika Cule, Paolo Vineis, and Maria De Iorio. Significance testing in ridge regression for
      genetic data. BMC Bioinformatics, 12(1):372, September 2011.
[255] Benjamin Hofner, Luigi Boccuto, and Markus Göker. Controlling false discoveries in high-
      dimensional situations: boosting with stability selection. BMC Bioinformatics, 16(1):144,
      May 2015.
[256] Nicolai Meinshausen and Peter Bühlmann. Stability selection. Journal of the Royal Statistical
      Society: Series B (Statistical Methodology), 72(4):417–473, September 2010.
[257] Jason D. Lee, Dennis L. Sun, Yuekai Sun, and Jonathan E. Taylor. Exact post-selection
      inference, with application to the lasso. Annals of Statistics, 44(3):907–927, June 2016.
[258] Richard Lockhart, Jonathan Taylor, Ryan J. Tibshirani, and Robert Tibshirani. A significance
      test for the lasso. Annals of Statistics, 42(2):413–468, April 2014.
[259] Torsten Hothorn, Kurt Hornik, and Achim Zeileis. Unbiased Recursive Partitioning: A
      Conditional Inference Framework. Journal of Computational and Graphical Statistics,
      15(3):651–674, September 2006.
[260] Jake Olivier and Melanie L. Bell. Effect Sizes for 2×2 Contingency Tables. PLoS ONE,
      8(3):e58777, March 2013.
[261] Andy Liaw and Matthew Wiener. Classification and Regression by randomForest. R News,
      2(3):18–22, 2002.
[262] C. C. Heyde.             Central Limit Theorem.              In Wiley StatsRef:        Statis-
      tics Reference Online. American Cancer Society,                         2014.         _eprint:
      https://onlinelibrary.wiley.com/doi/pdf/10.1002/9781118445112.stat04559.
                                                  222


[263] M. Shafiqur Rahman and Mahbuba Sultana. Performance of Firth-and logF-type penal-
      ized methods in risk prediction for small or sparse binary data. BMC medical research
      methodology, 17(1):33, 2017.
[264] Ioannis Kosmidis. brglm: Bias Reduction in Binary-Response Generalized Linear Models.
      2020.
[265] Ioannis Kosmidis and David Firth. Jeffreys-prior penalty, finiteness and shrinkage in
      binomial-response generalized linear models. Biometirka, 2020.
[266] Jean-Baptist du Prel, Gerhard Hommel, Bernd Röhrig, and Maria Blettner. Confidence
      Interval or P-Value? Deutsches Ärzteblatt International, 106(19):335–339, May 2009.
[267] Jerome H. Friedman, Trevor Hastie, and Rob Tibshirani. Regularization Paths for Gener-
      alized Linear Models via Coordinate Descent. Journal of Statistical Software, 33(1):1–22,
      February 2010. Number: 1.
[268] Max Kuhn. caret: Classification and Regression Training. 2020.
[269] Bradley Efron. The Jackknife, the Bootstrap and Other Resampling Plans. CBMS-NSF
      Regional Conference Series in Applied Mathematics. Society for Industrial and Applied
      Mathematics, January 1982.
[270] Bethany R Wilcox and Heather J Lewandowski. Students’ epistemologies about experimental
      physics: Validating the Colorado Learning Attitudes about Science Survey for experimental
      physics. Physical Review Physics Education Research, 12(1):010123, 2016.
[271] Bethany R Wilcox and Heather J Lewandowski. Students’ views about the nature of experi-
      mental physics. Physical Review Physics Education Research, 13(2):020110, 2017.
[272] Zhongzhou Chen, Kyle M. Whitcomb, Matthew W. Guthrie, and Chandralekha Singh. Eval-
      uating the effectiveness of two methods to improve students’ problem solving performance
      after studying an online tutorial. In 2019 Physics Education Research Conference Proceed-
      ings, pages 99–104, December 2019.
[273] Andri Signorell. DescTools: Tools for Descriptive Statistics. 2020.
[274] Daniel McFadden. Quantitative Methods for Analyzing Travel Behaviour of Individuals:
      Some Recent Developments. Technical Report 474, Cowles Foundation for Research in
      Economics, Yale University, 1977.
[275] Scott Menard. Coefficients of Determination for Multiple Logistic Regression Analysis. The
      American Statistician, 54(1):17–24, February 2000.
[276] Tyler Hunt. ModelMetrics: Rapid Calculation of Model Metrics. 2020.
[277] Szilard Nemes, Junmei Miao Jonasson, Anna Genell, and Gunnar Steineck. Bias in odds
      ratios by logistic regression modelling and sample size. BMC medical research methodology,
      9:56, July 2009.
                                                223


[278] Maarten van Smeden, Joris A. H. de Groot, Karel G. M. Moons, Gary S. Collins, Douglas G.
      Altman, Marinus J. C. Eĳkemans, and Johannes B. Reitsma. No rationale for 1 variable
      per 10 events criterion for binary logistic regression analysis. BMC Medical Research
      Methodology, 16(1), December 2016.
[279] Hyungwoo Kim, Taeseok Ko, No-Wook Park, and Woojoo Lee. Comparison of Bias Correc-
      tion Methods for the Rare Event Logistic Regression. Korean Journal of Applied Statistics,
      27(2):277–290, April 2014.
[280] Sam Doerken, Marta Avalos, Emmanuel Lagarde, and Martin Schumacher. Penalized logistic
      regression with low prevalence exposures beyond high dimensional settings. PLOS ONE,
      14(5):e0217057, May 2019. Publisher: Public Library of Science.
[281] Emmanuel O. Ogundimu. Prediction of default probability by using statistical models
      for rare events. Journal of the Royal Statistical Society: Series A (Statistics in Society),
      182(4):1143–1162, 2019.
[282] Menelaos Pavlou, Gareth Ambler, Shaun Seaman, Maria De Iorio, and Rumana Z. Omar.
      Review and evaluation of penalised regression methods for risk prediction in low-dimensional
      data with few events. Statistics in Medicine, 35(7):1159–1177, 2016.
[283] Ben Van Calster, Maarten van Smeden, Bavo De Cock, and Ewout W Steyerberg. Regression
      shrinkage methods for clinical prediction models do not guarantee improved performance:
      Simulation study. Statistical Methods in Medical Research, 29(11):3166–3178, November
      2020.
[284] P. Peduzzi, J. Concato, E. Kemper, T. R. Holford, and A. R. Feinstein. A simulation study
      of the number of events per variable in logistic regression analysis. Journal of Clinical
      Epidemiology, 49(12):1373–1379, December 1996.
[285] Peter C Austin and Ewout W Steyerberg. Events per variable (EPV) and the relative perfor-
      mance of different strategies for estimating the out-of-sample validity of logistic regression
      models. Statistical Methods in Medical Research, 26(2):796–808, April 2017.
[286] Maarten van Smeden, Karel GM Moons, Joris AH de Groot, Gary S Collins, Douglas G
      Altman, Marinus JC Eĳkemans, and Johannes B Reitsma. Sample size for binary logistic
      prediction models: Beyond events per variable criteria. Statistical Methods in Medical
      Research, 28(8):2455–2474, August 2019.
[287] Delphine S. Courvoisier, Christophe Combescure, Thomas Agoritsas, Angèle Gayet-Ageron,
      and Thomas V. Perneger. Performance of logistic regression modeling: beyond the number of
      events per variable, the role of data structure. Journal of Clinical Epidemiology, 64(9):993–
      1000, September 2011.
[288] Menelaos Pavlou, Gareth Ambler, Shaun R. Seaman, Oliver Guttmann, Perry Elliott, Michael
      King, and Rumana Z. Omar. How to develop a more accurate risk prediction model when
      there are few events. BMJ, 351:h3868, August 2015.
                                                 224


[289] Philipp Probst, Marvin N. Wright, and Anne-Laure Boulesteix. Hyperparameters and tuning
      strategies for random forest. Wiley Interdisciplinary Reviews: Data Mining and Knowledge
      Discovery, 9(3):e1301, 2019.
[290] Raphael Couronné, Philipp Probst, and Anne-Laure Boulesteix. Random forest versus
      logistic regression: a large-scale benchmark experiment. BMC Bioinformatics, 19(1):270,
      July 2018.
[291] Hana Šinkovec, Georg Heinze, Rok Blagus, and Angelika Geroldinger. To tune or not to
      tune, a case study of ridge logistic regression in small or sparse datasets. arXiv:2101.11230
      [stat], January 2021.
[292] A. Hapfelmeier and K. Ulm. A new variable selection approach using Random Forests.
      Computational Statistics & Data Analysis, 60:50–69, April 2013.
[293] Markus Ojala and Gemma C. Garriga. Permutation Tests for Studying Classifier Perfor-
      mance. Journal of Machine Learning Research, 11:1833–1863, June 2010.
[294] Xiang Chen, Ching-Ti Liu, Meizhuo Zhang, and Heping Zhang. A forest-based approach
      to identifying gene and gene–gene interactions. Proceedings of the National Academy of
      Sciences of the United States of America, 104(49):19199–19203, 2007.
[295] Minghui Wang, Xiang Chen, and Heping Zhang. Maximal conditional chi-square importance
      in random forests. Bioinformatics, 26(6):831–837, March 2010.
[296] André Altmann, Laura Toloşi, Oliver Sander, and Thomas Lengauer. Permutation impor-
      tance: a corrected feature importance measure. Bioinformatics, 26(10):1340–1347, May
      2010.
[297] Silke Janitza, Ender Celik, and Anne-Laure Boulesteix. A computationally fast variable
      importance test for random forests for high-dimensional data. Advances in Data Analysis
      and Classification, 12(4):885–915, November 2016.
[298] D. H. Wolpert and W. G. Macready. No free lunch theorems for optimization. IEEE
      Transactions on Evolutionary Computation, 1(1):67–82, April 1997. Conference Name:
      IEEE Transactions on Evolutionary Computation.
[299] Kaitlin Kirasich. Random Forest vs Logistic Regression: Binary Classification for Hetero-
      geneous Datasets. SMU Data Science Review, 1(3):25, 2018.
[300] Andreas Wålinder. Evaluation of logistic regression and random forest classification based
      on prediction accuracy and metadata analysis. PhD thesis, Linnaeus University: Sweden,
      2014.
[301] J. B. Copas. Regression, Prediction and Shrinkage. Journal of the Royal Statistical Society.
      Series B (Methodological), 45(3):311–354, 1983.
[302] Andrew Gelman and Jennifer Hill. Data analysis using regression and multilevel/hierarchical
      models. Cambridge university press, 2006.
                                                 225


[303] Andrew Gelman and Yu-Sung Su. arm: Data Analysis Using Regression and Multi-
      level/Hierarchical Models. 2020.
[304] Rainer Puhr, Georg Heinze, Mariana Nold, Lara Lusa, and Angelika Geroldinger. Firth’s
      logistic regression with rare events: accurate effect estimates and predictions? Statistics in
      Medicine, 36(14):2302–2317, June 2017.
[305] Andrew Gelman, Aleks Jakulin, Maria Grazia Pittau, and Yu-Sung Su. A weakly informa-
      tive default prior distribution for logistic and other regression models. Annals of Applied
      Statistics, 2(4):1360–1383, December 2008.
[306] Hülya Olmuş, Ezgi Nazman, and Semra Erbaş. Comparison of penalized logistic regression
      models for rare event case. Communications in Statistics - Simulation and Computation,
      0(0):1–13, October 2019.
[307] Chao Chen, Andy Liaw, and Leo Breiman. Using Random Forest to Learn Imbalanced Data.
      Technical Report 66, University of California, Berkley, July 2004.
[308] Bjoern H. Menze, B. Michael Kelm, Daniel N. Splitthoff, Ullrich Koethe, and Fred A. Ham-
      precht. On Oblique Random Forests. In Dimitrios Gunopulos, Thomas Hofmann, Donato
      Malerba, and Michalis Vazirgiannis, editors, Machine Learning and Knowledge Discovery in
      Databases, Lecture Notes in Computer Science, pages 453–469, Berlin, Heidelberg, 2011.
      Springer.
[309] Leo Breiman, Jerome Friedman, Charles J Stone, and Richard A Olshen. Classification and
      regression trees. CRC press, 1984.
[310] Wei-Yin Loh and Peigen Zhou. Variable importance scores. arXiv:2102.07765 [cs], February
      2021.
[311] Wei-Yin Loh. Improving the precision of classification trees. The Annals of Applied Statistics,
      3(4):1710–1737, December 2009.
[312] Wei-Yin Loh. Regression tress with unbiased variable selection and interaction detection.
      Statistica sinica, pages 361–386, 2002.
[313] J. Zhang and K. F. Yu. What’s the relative risk? A method of correcting the odds ratio in
      cohort studies of common outcomes. JAMA, 280(19):1690–1691, November 1998.
[314] Louise-Anne McNutt, Chuntao Wu, Xiaonan Xue, and Jean Paul Hafner. Estimating the
      relative risk in cohort studies and clinical trials of common outcomes. American Journal of
      Epidemiology, 157(10):940–943, May 2003.
[315] I. Karp. Re: “Estimating the Relative Risk in Cohort Studies and Clinical Trials of Common
      Outcomes”. American Journal of Epidemiology, 179(8):1034–1035, April 2014.
[316] Christoph Molnar, Gunnar König, Bernd Bischl, and Giuseppe Casalicchio. Model-agnostic
      Feature Importance and Effects with Dependent Features – A Conditional Subgroup Ap-
      proach. arXiv:2006.04628 [cs, stat], June 2020.
                                                 226


[317] R. Padraic Springuel, Michael C. Wittmann, and John R. Thompson. Reconsidering the
      encoding of data in physics education research. Physical Review Physics Education Research,
      15(2), July 2019.
[318] Devyn Shafer, Maggie S. Mahmood, and Tim Stelzer. Impact of broad categorization on
      statistical results: How underrepresented minority designation can mask the struggles of
      both Asian American and African American students. Physical Review Physics Education
      Research, 17(1):010113, March 2021.
[319] Sinta Septi Pangastuti, Kartika Fithriasari, Nur Iriawan, and Wahyuni Suryaningtyas. Data
      Mining Approach for Educational Decision Support. EKSAKTA: Journal of Sciences and
      Data Analysis, 2(1):33–44, February 2021. Number: 1.
[320] Sander Greenland, Mohammad Ali Mansournia, and Douglas G. Altman. Sparse data bias:
      a problem hiding in plain sight. British Medical Journal, 352, April 2016.
[321] Luc Paquette, Jaclyn Ocumpaugh, Ziyue Li, Alexandra Andres, and Ryan Baker. Who’s
      Learning? Using Demographics in EDM Research. JEDM | Journal of Educational Data
      Mining, 12(3):1–30, October 2020. Number: 3.
[322] Alexis V. Knaub, John M. Aiken, and Lin Ding. Two-phase study examining perspectives
      and use of quantitative methods in physics education research. Physical Review Physics
      Education Research, 15(2):020102, July 2019.
[323] Lior Rokach and Oded Maimon. Data mining with decision trees: theory and applications.
      World scientific, 2014.
[324] Max Bramer. Principles of data mining, volume 180. Springer, 2007.
[325] Pedro Domingos. A few useful things to know about machine learning. Communications of
      the ACM, 55(10):78–87, 2012. Publisher: ACM.
[326] Thomas G Dietterich and others. Ensemble methods in machine learning. Multiple classifier
      systems, 1857:1–15, 2000.
[327] Leo Breiman. Bagging predictors. Machine learning, 24(2):123–140, 1996.
[328] Carolin Strobl, James Malley, and Gerhard Tutz. An Introduction to Recursive Partitioning:
      Rationale, Application and Characteristics of Classification and Regression Trees, Bagging
      and Random Forests. Psychological methods, 14(4):323–348, December 2009.
[329] James Bergstra and Yoshua Bengio. Random Search for Hyper-Parameter Optimization.
      Journal of Machine Learning Research, 13(Feb):281–305, 2012.
[330] Tom Fawcett. An introduction to ROC analysis. Pattern recognition letters, 27(8):861–874,
      2006. Publisher: Elsevier.
[331] Matthew Kirk. Thoughtful machine learning: A test-driven approach. " O’Reilly Media,
      Inc.", 2014.
                                                227


[332] Kristin K. Nicodemus and James D. Malley. Predictor correlation impacts machine learn-
      ing algorithms: implications for genomic studies. Bioinformatics (Oxford, England),
      25(15):1884–1890, August 2009.
[333] D. M. Reif, A. A. Motsinger-Reif, B. A. McKinney, M. T. Rock, J. E. Crowe, and J. H.
      Moore. Integrated analysis of genetic and proteomic data identifies biomarkers associated
      with adverse events following smallpox vaccination. Genes and Immunity, 10(2):112–119,
      March 2009.
[334] Micah Allen, Davide Poggiali, Kirstie Whitaker, Tom Rhys Marshall, and Rogier A. Kievit.
      Raincloud plots: a multi-platform tool for robust data visualization. Wellcome Open Re-
      search, 4:63, April 2019.
[335] Kirstie Whitaker, Tom Rhys Marshall, Tim Van Mourik, Paula Andrea Martinez, Davide
      Poggiali, Hao Ye, and Marius Klug. RainCloudPlots/RainCloudPlots: WellcomeOpenRe-
      search, August 2019.
                                             228