EXTENDED ENSEMBLE ESTIMATION: A TOOL FOR SENSITIVITY ANALYSIS FOR RESEARCHERS By Jordan Tait A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Measurement and Quantitative Methods – Doctor of Philosophy 2022 ABSTRACT Once research questions are posed, researchers must answer many aPriori questions regarding research design before analysis can be performed and any conclusions can be made, including sample selection criteria, data collection method, model specification, analysis and estimation technique. The choices made by researchers along this forking path of possible specifications, such as model specification and estimation technique, can lead to varying results. On one hand, researchers that are not able to identify the best answers to these questions are faced with a list of plausible specifications, inherent uncertainty in resulting model estimates, and are tasked with how to best balance alternative specifications. On the other hand, researchers suffer from an “embarrassment of riches” in computational capacity, in that they have more computational power than what is reflected in most journal articles and that the amount of alternative analyses researchers could perform have expanded dramatically (Young, 2018). Although it is common to choose one set of specifications and report the resulting estimates in the absence of other specifications, this research proposes a framework called extended ensemble estimation that utilizes the alternative specifications to quantify and visualize the sensitivity of an estimated treatment effect. This paper also proposes a method to combine the estimated treatment effects across specifications into a single estimated treatment effect, weighted by precision. Along with the proposed methodology, this work contains a best practice guide for users in order to best understand the sensitivity of an estimated treatment effect within the extended ensemble estimation framework, a proposed method to update an estimated treatment effect by utilizing alternative specifications, simulated performance within common covariance structures, and a case study application regarding the effects of kindergarten retention on math and reading performance. Copyright by JORDAN TAIT 2022 ACKNOWLEDGEMENTS Though rightfully one of the most challenging and taxing journeys I have endured, I can wholeheartedly say that I do not believe I would be at this point in my life without all of the support I have received from professors, family, friends, and colleagues. It is at this moment that I would like to truly say how thankful I am for having them in my life and how crucial they were throughout my journey, from start to finish I would like to thank my fiancée, Christine Pacewicz. Your unconditional love and consistent support gave me the confidence to persist through the toughest moments and remined me to celebrate the best moments. You always believed in me and that meant more to me than I can ever accurately express. I could not have accomplished this without you. I would like give my deepest recognition and thank you to my advisor, Dr. Ken Frank. While professors are not all created equal and few are greater than the sum of their parts, you have been one of the most positively influential people in one of the most challenging times in my life. You routinely showed me how to not only perform good research, but how to be a better person. As I look forward to mentoring many students throughout my career, I strive to leave such a lasting impression on them as you have left on me. Thank you for always believing in me and for showing me what it means to not only be an excellent advisor and role model, but how to be genuine and good to others. I would like to also thank my committee, Dr. Jeffery Wooldridge, Dr. Kimberly Kelly, and Dr. Kaitlin Knake. Although it is through their expertise that my research has risen to this point, I can safely say they have all contributed more to my journey than just their expertise alone. The many semesters of econometrics with Dr. Wooldridge have been some of the most challenging and rewarding experiences during my time at Michigan State, and experiences that I routinely iv reference. The opportunity to learn about machine learning under the guidance of Dr. Kelly proved to be a valuable experience for my dissertation and it is through her support that I garnered the confidence to pursue computational research. With many prior years of teaching experience, I may have experienced the most growth during my time working with Dr. Kaitlin Knake with Teachers in Social Media. It is through the opportunities and experiences Dr. Knake provided me that helped me grow into a well-rounded professional and mentor for my future students. I cannot thank my committee members enough for their contributions. Finally, I would like to give a special thanks to my friends and family, especially my parents, Allan Tait and Ginger Tait, and my brothers, Jon Tait and Jarrad Tait. You have always provided me with unconditional support and through your actions, always served as a reminder of how big of an influence a loving family can have. I would not have made it this far if it weren’t for you all, and for that, I want to thank you for always being there for me. v TABLE OF CONTENTS CHAPTER 1 INTRODUCTION ................................................................................................... 1 1.1 Motivation ............................................................................................................................. 1 1.2 Extended Ensemble Estimation ............................................................................................. 3 1.3 Literature Review .................................................................................................................. 5 1.4 Ties to Sensitivity Analysis ................................................................................................... 8 1.5 Goals of Extended Ensemble Estimation ............................................................................ 11 1.6 Summary of findings ........................................................................................................... 12 1.7 Structure of Study................................................................................................................ 13 CHAPTER 2 GENERAL METHODOLOGY FOR EXTENDED ENSEMBLE ESTIMATION 15 2.1 General Framework of Extended Ensemble Estimation ..................................................... 15 2.2 Estimation Techniques ........................................................................................................ 18 2.3 The Extended Ensemble Estimate, Weighting for Precision .............................................. 21 CHAPTER 3 UPDATING REGRESSION COEFFICIENTS USING EXTENDED ENSEMBLE ESTIMATION .............................................................................................................................. 25 3.1 General Framework for Updating ....................................................................................... 25 3.2 Ties to Empirical Bayes Methodology ................................................................................ 25 3.3 Choosing a Weighting Scheme for Updating Regression Coefficients .............................. 27 CHAPTER 4 USER GUIDE FOR BEST PRACTICES WHEN IMPLIMENTING EXTENDED ENSEMBLE ESTIMATION ........................................................................................................ 31 4.1 Introduction to Best Practices ............................................................................................. 31 4.2 Alternative Specifications ................................................................................................... 32 4.3 Estimation Techniques ........................................................................................................ 33 4.4 P-hacking and Cherry Picking ............................................................................................ 34 CHAPTER 5 SIMULATIONS ..................................................................................................... 36 5.1 How To Use Simulation to Inform Extended Ensemble Estimation .................................. 36 5.2 Pre-treatment and Confounding Variables .......................................................................... 36 5.3 Instrumental Variables ........................................................................................................ 37 5.4 Randomized Control Trials ................................................................................................. 38 5.5 Accounting for Selection Bias............................................................................................. 40 CHAPTER 6 CASE STUDY – EFFECTS OF KINDERGARTEN ON CHILDRENS COGNITIVE GROWTH IN READING AND MATHEMATICS .............................................. 50 6.1 Introduction ......................................................................................................................... 50 6.2 Description of Case Study ................................................................................................... 52 6.3 Extended Ensemble Estimation........................................................................................... 55 6.4 Discussion ........................................................................................................................... 57 BIBLIOGRAPHY ......................................................................................................................... 60 vi CHAPTER 1 INTRODUCTION 1.1 Motivation The process of using quantitative methods to answer research questions is multifaceted, requiring due diligence. Once testable hypotheses have been generated, often leveraging existing theory and past research, an encompassing research design must be formulated in order to provide a blueprint that details critical components such as sample selection, measurement instruments, model selection and estimation technique (De Vaus 2001). Together, the various choices that could be made at each step of the research design process create model uncertainty (Young 2018). That is to say that there isn’t necessarily a single correct sequence of choices to be made by a researcher for a given study regarding research design. Gelman and Loken 2014 describe this process as a “garden of forking paths”, where various sampling methods, models and estimation techniques could result in thousands of possible specifications. In the complete absence of nefarious motives, we may be unable to identify the most appropriate specifications, resulting in a level of uncertainty from the list of seemingly plausible specifications. On the other hand, Broaduer et. al 2016 claims that researchers have incentives to find statistically significant results, thus favoring specifications that result in statistical significance. In either case, researchers a like seem to be aware that the choices they make at the design level may lead to varying results. In the case of model selection, omitted variable bias is one common potential issue researchers aim to address. In the case of the Ordinary Least Squares (OLS) estimator, omitted variable bias means a violation of the assumptions of OLS. This particular violation prevents the OLS estimator from converging in probability to the true parameter, causing the OLS estimator 1 to be biased and inconsistent. For example, if an omitted variable is positively correlated to a regressor and dependent variable, the resulting OSL estimate of the aforementioned regressor will be inflated (Clark, 2005; Green, 2003; Wooldridge 2009). If focus was directed towards the estimated effect of a single covariate, e.g. a treatment variable in a randomized control trial, the treatment variable would be the regressor in question regarding omitted variable bias and whose estimate and standard error would come under scrutiny among authors, fellow researchers and peers in the peer review process. Of course, model selection is one choice of many that a researcher must make that may directly change the resulting estimate of a regressor of interest. Research on the effects of school suspensions by Craigie 2022 discuss this issue directly, stating: A pointed evaluation of insubordination/disrespect and student-aimed obscene language infraction does not indicate statistically significant changes in Out of School Suspension (OSS) outcomes in response to the second reform. However, when disorderly conduct is differentiated from the other disruption-specific infractions, the impact of the second reform on OSS duration is now statistically different from zero, increasing the average OSS duration by 0.17 days. The above transparency of a common issue Craigie 2022 is reflective of conversations between authors, researchers, and peer reviewers. In the above case, two different model specifications resulted in two different values for the estimated effect for the treatment variable of interest, ultimately resulting in two different conclusions. While the issue of deciding which specification is most appropriate may seem apparent, the act of choosing one specification and ultimately disregarding the rest of the pool of plausible specifications poses an immediate 2 follow-up issue. As pointed out earlier, the pool of plausible specifications can grow rather quickly, exacerbating the issue of choosing the “right” specification. In the case of published research, one estimate and conclusion are often chosen from the pool of plausible estimates to be reported, in the absence of other estimates. This essentially masks the model uncertainty. That is, once a specification is made and an estimate is reported, it is not known whether the estimated effect varied wildly across model specifications or remained relatively constant. Specifically, the sensitivity of the estimated effect across model specifications is not frequently reported. What if the original inference does not hold in one of the plausible specifications? What if the specification that overturns the original inference was associated with a relatively large standard error? Does the estimated effect of a treatment variable remain constant across plausible specifications? If the estimated effect of interest varies, are there commonalities among specifications that contribute to variability in estimates? Instead of creating a sharp precipice around a single decision regarding specification, this study proposes a method to utilize the variability of estimates and standard errors, resulting from plausible specifications and estimation techniques, in order to better understand the effect model uncertainty has on the estimated effect of a treatment variable. Visualizing and quantifying the uncertainty across the estimated effects of a treatment variable creates a framework that could stand to promote discourse regarding robustness and sensitivity. 1.2 Extended Ensemble Estimation In a broader sense, Ensemble Estimations increase transparency between authors and potentially skeptical readers who want to see more than the authors preferred results (Young 2018). Extended Ensemble Estimation does this by expanding what estimates are provided to the reader, bridging the gap between the authors research efforts and a skeptic wondering what 3 would have happened under a plausible alternative specification, differing from that of the authors. More narrowly, Extended Ensemble Estimation is a method of analysis that aims to address the issues of model uncertainty during the quantitative research process, specifically regarding the estimated effect of a treatment variable. Extended Ensemble estimation is, in part, the process of visualizing, quantifying, and utilizing the uncertainty of the estimated effects of a treatment variable that may arise due to specifications made during the quantitative research process, such as but not limited to model specification, estimation technique and weighting scheme. Extended Ensemble Estimation occurs after initial analysis has been executed and an initial estimated effect and corresponding standard error for the treatment variable have been reported, as a result of specifications that have been discussed, vetted and supported through past research, findings and literature. After the original estimated effect has been documented, a pool of plausible, alternative specifications are required to achieve alternate estimated effects and standard errors of the treatment variable. The specifications may include, but are not limited to, alternative covariate selection, change in estimation technique, or choice of standard error calculation. The quality of the pool of alternative, plausible specifications has been well documented to be a main driver in the quality of ensemble estimation in general, while carefully selected specifications can improve results (Saez-Rodriguez et al., 2016). Specifically, as the quality of the alternative models plays a large role in terms of robustness of the model averaging approaches, increasing the amount of alternative reasonable models plays a lesser role (Stumpf, 2021). 4 1.3 Literature Review Methods that focus around combining numerous models have a relatively long history in various fields, such as computer science, Bayesian statistics and econometrics (Laan et al., 2017). Statistical learning techniques, such as deep neural networks, have risen in popularity in tandem with the rise in computational power (Su & Chen, 2015) which Young (2018) described as an “embarrassment of riches”. Boosting, an ensemble algorithm based in machine learning, shares history with older machine learning techniques that fall under the supervised learning family of algorithms that aim to reduce bias and variance (Breiman, 1996). Bayesian model averaging has roots in theoretical statistics, following the basis of Bayesian statistics in order to achieve similar goals of combining multiple models into a better, single predictive model (Raftery et al., 1997; Raftery et al. 2005; Wasserman, 2000). 1.3.1 Model Averaging in Computer Science In many computer science-based methods, such as Deep Neural Networks, the main goal is prediction of a dependent variable or multiple dependent variables (Hofman et al., 2021; Murphy, 2012; Wasserman 2000). Thus, measures of performance rely on and center around the prediction of the dependent variable, such as accuracy, precision, recall, MSE, RMSE, True positive rate, False positive rate and F1 score. The pool of parameters used to best predict the dependent variable are derived from minimization/maximization techniques, such as gradient decent. In many computer science-based methods, the dataset in question is split into two subsets; a training set and a test set. Gradient decent is applied on the training set to determine the best subset of parameters, along with any possible combination of interactions and layers, that best predict the dependent variable by minimizing a defined loss function. Once the best parameters, layers and interactions are determined, the model is used to predict values of the 5 dependent variable within the test set and compared to the actual values of the dependent variable within the test set. The measures of performance are derived directly from these values; the predicted values and the actual values of the dependent variable within the test set. By construction, a model with an arbitrary amount of interaction terms and layers that may be riddled with unknown dependencies, makes inferential statistics regarding the parameters extremely difficult at best, and near impossible at worst. That is, it is often not possible to understand the strength of relationships between parameters, which parameters are statistically significant, or achieving confidence intervals for parameters. Extended Ensemble Estimation is more narrow and carefully constructed to focus on inference for a single parameter of interest. Quantifying the sensitivity or robustness a single treatment effect of interest in order to better serve the conversation around causality is the main objective of Extended Ensemble Estimation. 1.3.1 Bayesian Model Averaging and Model Selection Wasserman (2000) details Bayesian methods of comparing model performance, as well as averaging predictions from several models. Similar to many methods grounded in computer science, the goal is prediction of a dependent variable and performance is measured at a model level. Wasserman (2000) points out that the many Bayesian methods involve the computation of posterior distributions which are heavily reliant on prior distribution selection. Although Robust Bayesian methods (Berger 1990, Berger and Delempady, 1987) that focus on a set of priors, as opposed to single prior, are in the same spirit as Extended Ensemble Estimation, they do not discuss sensitivity or robustness in terms of estimation technique or model selection. Raftery et al. (2005) discusses two approaches to Bayesian model averaging in order to account for model uncertainty. The first approach is to apply Occam’s window algorithm (Madigan and Raftery, 1994) to linear regression models, where models are selected based on their ability to 6 predict. Namely, if a model predicts particular data poorly, it is not considered. On the other hand, models with high posterior probabilities are kept for model averaging for the goal of prediction. The second approach is to apply Markov Chain Monte Carlo model composition of Madigan et al. (1995) by considering a pool of models, plus all models with either one covariate fewer or one extra covariate than those in the pool of models. In this approach, as well as the previous approach, uncertainty due to estimation technique and sensitivity in model selection are not quantified or addressed. 1.3.2 Model Averaging and Ensemble Estimation in Economics Belloni et al (2016) explored the problem of generalized linear models in the presence of a pool of possible controls while examining a single effect of interest, which resulted in a method that allows for the estimation of a single parameter of interest that is robust to model selection mistakes regarding control selection. Their method uses a three-step approach; estimating the part of the regression function associated with the controls via post model selection, estimating an optimal instrument via post model selection, and the combination of these two steps to establish estimating equations that are robust against crude estimation of nuisance functions. One benefit of this technique is the established √n-consistency and asymptotic normality of estimators under high level conditions of nuisance parameters. In a sense, Belloni et al (2016) proposed a static solution to the covariate selection problem that achieves desirable properties under certain conditions. Similar to the previous approaches, this approach performs in the absence of uncertainty due to estimation technique and does not describe or quantify sensitivity or robustness of specification choices. 7 1.4 Ties to Sensitivity Analysis Uncertainty is inherent within statistical inference. Hypothesis testing and significance testing rely on being able to quantify uncertainty in order to make a decision regarding a null hypothesis. Even in the extreme case of randomized control trials, researchers are not able to make deterministic claims regarding treatment effects due to inherent levels uncertainty. While non-experimental data is often easier to obtain, it is often very difficult or impossible to disentangle correlation from causation, as Holland (1986) points out. On both ends of the spectrum, well controlled experiments and observational studies both suffer from the lack of ability to confidently control for every possible alternative explanation or being able to account for every possible confounding effect. Instead of using dichotomous decision making, as is the case with statistical significance, sensitivity analysis takes a more general approach to determine how much conditions must change in order to change the statistical inference at hand. If the original inference has been rejected based on the conducted hypothesis test, what alternative specifications could lead to a failure to reject the inference and how similar or different are those alternative specifications from the original specifications. If the same conclusion is made regarding the original inference under alternative specifications, the original inference is said to be robust and may be evidence of a causal relationship. If the original conclusion changes in the presence of an alternative specification, the original inference is said to be sensitive to specification, to the degree to which the alternative specification is similar or different to the original specification. Instead of running into a dead end in the research process by not including particular covariates, the conversation of a potential causal relation can continue by answering the question “How does the estimated effect change under alternative specifications?”. 8 Frank (2000) developed an index that measures the required impact a potential confounding variable would need in order to change the original inference. This process centers around correlations of independent variables, dependent variables, and a posited confounding variable since hypothesis tests for regression coefficients are equivalent to those of correlations (Cohen, West & Aiken, 2014). The required impact a confounding variable would need in order to change the original inference is given by a simple expression − # = 1− # In this expression for the threshold for the required impact, , the correlation of x # and y is given by while denotes the threshold for statistical significance. Frank (2000) also extends this expression to account for additional covariates, g, with the follow-up expression − # = 1− 1− 1− # In this expression, and are the multiple correlations between x & g and y &g, respectively, while is the partial correlation of x and y given g. Frank (2000) argues that quantifying this sensitivity allows the researcher to respond to critiques concerned about lack of control covariates by quantifying how large of an impact the missing covariates must impart in order to change the inference. Frank (2013) introduces a measure of bias needed in order to change an inference, namely − # # =1− # Where is the threshold effect for statistical significance and is the estimated effect. In this formulation, this represents the proportion of bias necessary to invalidate an inference. 9 Using a case-replacement framework, this answers the question of how many cases one would need to replace in the data with counterfactual, zero effect cases in order to chance the inference at hand. Frank points out that this method of measuring sensitivity by expressing sensitivity in terms of the units of observation instead of variables is more appealing than other forms of sensitivity analysis. For example, in the context of schools, language in terms of students and schools may be more appealing and may facilitate conversations surrounding causality more effectively than language centered around technical details. Work by Emily Oster (2019) examines sensitivity analysis through the lens of selection on observables and unobservables in order to quantify changes in estimated treatment effects and R-squared. In order to correct for biased treatment effects, perhaps due to omitted variables, Oster offers a value, , that is the relative importance of unobservables compared to observables that would be required to invalidate the inference, called the coefficient of proportionality. Namely, = Where is the covariance between treatment and observables, is the covariance between treatment and unobservables, is the variance of the observables and is the variance of unobservables. Equal selection where both are equally important would be represented by = 1. One working assumption is that the relationship between treatment and unobservables can be recovered from the relationship between treatment and observables. Oster notes that although coefficient stability is related to the coefficient of proportionality, that it is possible for coefficients to be completely unchanged in the presence of large bias. Although these forms of sensitivity analysis address issues of bias from unobserved sources, such as omitted variables, they do not account for variation in estimation technique or 10 the precision of the estimated treatment effects across alternative specifications. That is, creating an estimated treatment effect weighted by precision across various alternative specifications, including estimation technique and model selection, remains largely unexplored. Specifications within the analysis phase fall into one of two categories. The first case is where it is impossible to achieve the desired re-specification. Examples of this type include cases where researchers may not have access to an omitted variables or are not aware of such variables. Even in perfectly controlled experiments that are typically thought of as “gold standards”, the search for possible confounding variables never ceases, even when random assignment is possible (Cook 2002). The second category is where it is possible to examine alternative specifications. These scenarios include but are not limited to the ability to change model specification, estimation technique, and to consider various subsamples. While the work by Frank (2000,2013) and Oster (2019) address scenarios in the first case, regarding unobservables and counterfactuals, Extended Ensemble Estimation addresses the second case where changes in specification are possible. 1.5 Goals of Extended Ensemble Estimation The main objective of Extended Ensemble estimation is to provide a broader picture in terms of the potential causal relationship between a treatment effect and outcome by quantifying the robustness and sensitivity of an estimated treatment effect relative to alternative specifications. In order to achieve this objective, there are three primary goals of Extended Ensemble Estimation. 1) Compare the original estimated treatment effect to the estimated treatment effect under plausible alternative specifications. 11 2) Form a distribution of estimated treatment effects from the plausible alternative specifications, using the shape, center and spread to quantify robustness or sensitivity. 3) Combine the estimated treatment effects from the original specification and plausible alternative specifications, based on precision, to achieve a single estimated treatment effect, called the Extended Ensemble Estimate. Accomplishing these goals gives the opportunity to observe how the estimated treatment effect changes in the presence of alternative covariate selectin, alternative estimation methods, subsample selection, or other alternative specifications. By visualizing the changes in estimated treatment effects with an empirical distribution and quantifying the robustness or sensitivity, Extended Ensemble Estimation provides answers to researchers or readers who may question what would happen to the estimated treatment effect under different circumstances than those chosen by the authors. 1.6 Summary of findings This study proposes a within-study procedure to utilize estimated treatment effects from plausible alternative specifications to better understand the potential causal relationship between a treatment and outcome. The combined and precision-weighted estimate provided by the Extended Ensemble Estimation framework is used in tandem with the empirical distribution of estimates from alternative specifications in order to promote discourse surrounding the potential causality of the estimated treatment effect. While Belloni et al (2016) proposes an estimator that addressed control selection, Extended Ensemble Estimation is able to take other model specifications into account, such as estimation technique. In order to best serve the discourse around the estimated effect of a parameter of interest, the empirical distribution of estimates can be further extended to include other specifications, such as the estimated effect produced by 12 Belloni et al (2016). By removing potential subjectivity regarding specification, extended ensemble estimation also provides a framework as a safeguard against common statistical pitfalls such as p-hacking and cherry picking. 1.7 Structure of Study In the next Chapter, I will detail the general methodology for Extended Ensemble Estimation where I will discuss the roles of estimation techniques in general, distribution of estimated treatment effects, measures of central tendency and standard errors relating to the estimated treatment effects. Ordinary Least Squares and Instrumental Variables will be compared and contrasted in terms of their application within Extended Ensemble Estimation. The chapter will end with how to incorporate precision of estimated treatment effects. In Chapter 3, I will discuss a method to update an estimated treatment effect using Extended Ensemble Estimation. I will talk about the general approach to updating an estimated treatment effect, ties to empirical Bayesian methods, and the sensitivity regarding the selection of weighting scheme. Next, Chapter 4 will serve as a guide of best practices during the Extended Ensemble Estimation process. I will discuss how the end user may proceed to best serve the conversation around a potential causal treatment effect while avoiding common statistical pitfalls such as cherry- picking results and p-hacking. Chapter 5 will focus on using simulation to observe how Extended Ensemble Estimation performs under various conditions, including sensitivity regarding sample size, within a Randomized Control Trial Ancova design, in the presence of a strong and weak pre-test, and in the presence of a strong and weak instrumental variable. Finally, Chapter 6 will detail the use of Extended Ensemble Estimation through a case study regarding Kindergarten Retention and work by Hong and Raudenbush (2005). I will use Extended Ensemble Estimation to compare the estimated effects of kindergarten retention on student’s scale reading and math 13 scores by Hong and Raudenbush, with estimated treatment effects under various alternative specifications in order to serve the conversation regarding the potential causal effect of retaining kindergarten students on reading and math scores. 14 CHAPTER 2 GENERAL METHODOLOGY FOR EXTENDED ENSEMBLE ESTIMATION 2.1 General Framework of Extended Ensemble Estimation This section will lay out the components involved in using extended ensemble estimation. The components discussed regarding the estimated treatment effects will include estimation techniques in general, the distribution of estimated treatment effects, central tendency measures, as well as standard errors of the estimated treatment effects themselves. This section will also discuss two particular estimation techniques, namely ordinary least squares and instrumental variables. Explanation of these estimation techniques and the role they play in extended ensemble estimation will be followed by how weighting can be utilized, including using the standard errors of the estimated treatment effects to achieve a combined estimated treatment effect that is weighted for precision using a meta-analysis style approach. 2.1.1 Role of Estimation Technique In order to compare estimates of a treatment effect across various specifications, a choice must be made that will determine how each treatment effect will be estimated, given a specified model. Among the possible choices for estimators are the more common, but not limited to, Least Squares, Maximum Likelihood, Bayes, and Markov Chain Monte Carlo. Particular desirable properties can help researchers determine which estimator to use, such as unbiasedness, minimum variance unbiasedness (MVUE) or best linear unbiased (BLUE). There may be cases that are able to satisfy conditions for some estimators with particular properties, while failing to satisfy the same conditions in other scenarios, thus researchers may pick an estimation technique based on the conditions they can comfortably satisfy or avoid an estimation technique that is not robust in the presence of failed conditions that can be hard to hold or justify, such as general 15 independence or random sampling. In terms of understanding the sensitivity or robustness of a treatment effect, one may observe different estimated treatment effects based on the chosen estimation technique. In that sense, estimation technique can be taken into account as a specification in the extended ensemble estimation framework by estimating the treatment effect using various estimation techniques in order to observe possible sensitivity or robustness. 2.1.2. Distribution Once estimation of the treatment effect has been carried out for the various model specifications, visually inspecting the estimated treatment effects as a distribution can reveal any present sensitivity to model specification, or robustness. Namely, a distribution of estimated treatment effects with low variance would be evidence of robustness regarding specification. That is, the estimated treatment effect does not vary far from specification to specification. A distribution of estimated treatment effects with high variance would be evidence of sensitivity regarding specification. That is, the observed estimated treatment effect depends on the particular specification. Multimodal shapes in the distribution can be used to help detect commonalities or differences in the specifications. For example, model specifications that contain a highly predictive covariate in terms of the treatment effect may exhibit similar estimated treatment effects, while model specifications that do not include that covariate may still group together in terms of their estimated treatment effects, but higher or lower than those that included the highly predictive covariate. The commonalities and difference of these two groups of specifications would be visible in the distribution of estimated treatment effects as a bi-modal or multi-modal shape. 16 2.1.3 Central Tendency Measures Characterizing the central tendency of estimated treatment effects via the various specifications further assists in determining the level of robustness or sensitivity of the estimated treatment effects as compared to the estimated treatment effect from the chosen specifications. As complimentary measures of center, the mean and median estimated treatment effects can be baseline measures for comparison. The mean estimated treatment effect can stand as an accurate measure of central tendency when the distribution of estimated treatment effects is more symmetric, while the median estimated treatment effect should be considered if the distribution of treatment effects is more skewed since the mean is generally sensitive to outliers. 2.1.4 Standard Errors of the Estimated Treatment Effects Choices in specification, particularly model specification, can not only result in various estimated treatment effects but also various levels of estimation precision. If a particular model specification results in utilizing a smaller subsample of the data, this can directly impact the standard error of the estimated treatment effect. Accounting for the precision of each estimated effect plays a large role in the extended ensemble estimation framework. A single specification that results in a rejection of any treatment effect may not be reason for concern, but may draw extra attention in the extended ensemble estimation framework once compared to numerous other plausible specifications that resulted in opposite conclusions. A single specification that results in a conclusion counter to that of many alternative specifications could arise by an imprecise estimate of the treatment effect, that is, a larger standard error than those of the alternative specifications. So long as the pool of alternative specifications is rich, the precision of the estimated treatment effects can lead to a deeper, more nuanced conversation regarding the 17 actual treatment effect, instead of relying on a single specification to make a decision about the treatment at hand. 2.1.5 Discussion Obtaining the distribution of estimated treatment effects accomplishes the first and second goal of this study. Using visual and quantitative inspections of the shape, center and spread helps researchers and readers understand the robustness or sensitivity of the estimated treatment effect under alternative specifications. 2.2. Estimation Techniques As stated previously, to more fully understand the sensitivity or robustness of a treatment effect, one may want to consider how estimated treatment effects differ based on estimation technique. That is, a critical piece of Extended Ensemble Estimation is accounting for estimation technique as a specification in the extended ensemble estimation framework by estimating the treatment effect using various estimation techniques. This study will consider two primary estimation techniques; Ordinary Least Squares and Instrumental Variable approach. 2.2.1 Ordinary Least Squares One of the most common methods of estimation across fields is least squares, particularly ordinary least squares (OLS). This method estimates unknown parameters by minimizing the sum of squared residuals, or differences between observed values of the dependent variable and the predicted value of the dependent variable based on the model specification. Among the many benefits of OLS are a closed form solution for estimates that is quickly and easily produced by most entry level software, having many desirable properties in terms of estimation under certain assumptions, as well as being fairly robust in the case of unsatisfied assumptions. OLS is a consistent estimator in the case of exogenous predictors that form a matrix that has full column 18 rank. When estimating variance, OLS is consistent when regressors have finite fourth moments. By the Gauss Markov Theorem, OLS is the best linear unbiased estimator in that it achieves the smallest variance among other linear unbiased estimators when the errors are homoscedastic and are serially uncorrelated. OLS is equivalent to the maximum likelihood estimator, another popular estimation technique, when the errors are normally distributed with a mean of zero. In the case of endogenous regressors (regressors that are correlated with the error term), OLS produces biased estimates. When endogeneity is present, other estimation techniques may be more desirable, such as an Instrumental Variables approach. 2.2.2. Instrumental Variables Instrumental variables is an estimation technique that is often used to estimate causal relationships by addressing potential confounding effects and measurement error. This method is often used when controlled experiments are not feasible, such as observational studies (Angrist & Imbens, 1995). In an observational study, an individual may be more likely to receive treatment than another individual, in turn affecting the resulting outcome. In other words, random assignment does not necessarily hold. The first order condition of Ordinary Least Squares requires the independent variables and the error term to be uncorrelated. If this condition does not hold, Ordinary Least Squares will not provide the causal impact of the independent variable, but instead will produce biased estimates. This first order condition is often known as an exogeneity condition, where the independent variable that satisfies the condition is known an exogenous. In order to handle potential endogeneity, Instrumental Variable estimation hinges on utilizing a variable that is correlated with the endogenous variable, only affecting the outcome indirectly through the endogenous variable, but is not correlated with the error term. A variable that is not correlated with the error term does not suffer the problem of breaking the first order 19 condition, but also captures the desired effect if it is correlated with the endogenous variable. A variable that satisfies these conditions is known as an instrumental variable and is said to satisfy the exclusion restriction. Instrumental variable estimation requires estimating multiple models in a sequence, known as stages. A common technique using instrumental variables requires two modeling steps, thus is known as two stage least squares. Instrumental variables tends to underperform if variables used as instrumental variables are weak, that is, are poor predictors of the endogenous predictor. Using a weak instrumental variable can result in poor predicted values of the endogenous variable, leading to little variation and a smaller likelihood of predicting the final outcome of interest in the second stage of modeling. Since the endogenous variables and any variable intended to be used as an instrument are all observed, the strength of instruments can be tested directly (Stock et al., 2002). It should be noted that when covariates are exogenous, the desirable small sample properties of Ordinary Least squares can be derived through the moments of the estimator conditional on the covariates. On the other hand, if such properties cannot be easily obtained due to endogenous covariates, inferences using instrumental variables in these scenarios are often based on asymptotic approximations of the sampling distribution of the estimator. A model that is exactly identified produces finite sample estimators with no moments, leading to an estimator that is said to be neither biased or unbiased, where the size of the test statistic may be significantly distorted and could stray far from the value of the parameter of interest (Nelson & Startz, 1988). In terms of precision, instrumental variables tends to produce larger standard errors when compared to ordinary least squares, but remains a consistent estimator in the presence of endogeneity while ordinary least squares is inconsistent. The precision of instrumental variables tends to increase with the strength of the instruments. 20 2.3 The Extended Ensemble Estimate, Weighting for Precision Once specifications have been made and estimation techniques have been selected, the next step in extended ensemble estimation is to account for the precision of the estimated treatment effects across specifications and estimation techniques. As stated previously in this chapter, standard errors can rise or shrink for a variety of reasons. The sample size utilized for estimation may shrink due to model specification, leading to larger standard errors in estimation. As pointed out in the earlier section, estimation techniques may also produce various standard errors depending on the relationships between covariates, dependent variables and errors. In this vein, part of extended ensemble estimation is to weight each estimated treatment effect by its precision, thus this will be accomplished through two approaches. The first approach is directly weighing each estimated treatment effect by its associated standard error when creating the distribution of estimated treatment effects. That is, creating a weighted distribution of estimated treatment effects. The second approach is combining the estimated treatment effects into a single effect, weighing each estimate by its associated standard error. This combined estimated treatment effect, weighted by precision, will be called the Extended Ensemble Estimate. Meta- analysis techniques for combining estimated effects across studies are commonly used to estimate effects across experiments or observational studies, accounting for both random and fixed effect models (Hedges and Vevea, 1998). The populations of the studies contained within a meta-analysis need not be constant, as this is one of many strengths of meta-analysis. On the other hand, the models within each study are constant. That is, meta-analysis is not well suited to shed light on the sensitivity or robustness of the model specification within a single study. Extended ensemble estimation is intended to be a within-study tool where the population and sample at hand are constant, while the robustness or sensitivity of the estimates produced by 21 various model specifications are the focus. In order to gain the capacity to consider all possible model specification within a study, Extended Ensemble Estimation adopts a similar technique, but to combine estimated effects across specifications, within a single experiment or observational study. In this sense, Extended Ensemble Estimation has the capacity to consider all possible specifications, while typical meta-analysis utilize what is already generated, potentially missing important models or alternative specifications. In the extended ensemble estimation framework, let = 1, … , " denote the " various $% specifications and # denote the observed value of the treatment effect in the specification. The meta-analytic approach is a special case of the general linear mixed effect model with heteroscedastic sampling variances, assumed to be known. This type of model can be fitted by a two step approach outlined in Raudenbush (2009). Let &# denote the unknown true treatment effect, such that _ |&_ ~ *(&_ , ,_ ) In the random-effects model, we assume that &# ~*(., / ), namely that the true treatment effects are normally distributed with average treatment effect . and variance / . This model can be expressed as # = . + 1# + 2# Where 1# ~*(0, / ) and 2# ~*(0, ,# ). With this setup, the Extended Ensemble Estimate is denoted by ∑:8; w8 y8 .̂ 555 = ∑:8; w8 Where <# denotes the weighting of each estimated treatment effect from specification , specifically, 22 1 w8 = τ> + ,8 Where /̂ denotes an estimate of / , the variance in the true effect across specifications, and ,# denotes the sampling variance for specification . A special case is the equal-effects model, specifically when / = 0. In this case, the true treatment effects across specifications are homogenous and can be written as & = & = ⋯ = &@ = &. The model of this special case can be written # = & + 2# Where & denotes the true treatment effect. In this case, the Extended Ensemble Estimate is denoted by ∑:8; w8 y8 &ABBB = ∑:8; w8 Where <# = C . In the both models, ,# is assumed to be known and is the square of the standard D errors of the estimated treatment effects. As such, this method of weighing is also known as the inverse-variance method, or variance known, in meta-analysis literature. For reference, the unweighted least squares estimate of the treatment effect (Laird and Mostelle, 1990) can be expressed as ∑:8; y8 &̅ = " The first step in deriving the Extended Ensemble Estimate is to estimate / using one of many estimators, including the Hunter-Schmidt estimator (Hunter and Schmidt, 2004), the Hedges estimator (Hedges and Olkin, 1985; Raudenbush, 2009), the DerSimonian-Lair estimator (DerSimonian and Laird 1986’ Raudenbush 2009), the Sidik-Jonkman estimator (Sidik and Johnkman, 2005a,b), the maximum likelihood or restricted maximum likelihood estimator 23 (Viechtbauer 2005; Ruadenbush 2009), or the empirical Bayes estimator (Morris, 1983; Berkey et al., 1995). The second step is to use weighted least squares to estimate the weights <# . Once the weights <# are known, &ABBB can be calculated directly in order to achieve the third goal. 24 CHAPTER 3 UPDATING REGRESSION COEFFICIENTS USING EXTENDED ENSEMBLE ESTIMATION 3.1 General Framework for Updating Once the Extended Ensemble Estimate (FBBB ) has been attained, it can be used to update the original estimated treatment effect from the original specification, FG # #HIJ . One way to achieve an updated treatment effect FKLMI$NM , is to form the weighted average of FG # #HIJ and FBBB as follows FKLMI$NM = OPFG # #HIJ + (1 − P)FBBB Q Where P is used to weight each estimated treatment effect. This can be thought of as updating the original estimated treatment effect by the extended ensemble estimate. The choice of P determines how much to weight the original estimated treatment effect as opposed to the extended ensemble estimate. Possible choices of P and potential consequences will be discussed in section 3.3. In the sense of using empirical data to update an estimate, this has similarities to empirical Bayesian methods that will be discussed in the following section. 3.2 Ties to Empirical Bayes Methodology Bayesian statistical inferences refer to the techniques of modeling a parameter of interest, say &, with a distribution of potential values instead of assuming it is fixed, as in a frequentist approach. The distribution of the parameter of interest, &, allows for the ability to account for any prior beliefs regarding &, thus is often referred to as the prior distribution (Jackman, 2009; Lynch, 2007). Observed data is used to update the prior distribution of & by scaling the prior distribution by the likelihood of the observed data, producing a new distribution referred to as the posterior distribution of &. The posterior distribution of & is, by definition, conditional on the 25 observed data while the prior distribution of & is fixed before any data are observed. Once the posterior distribution is known, the unknown parameter & is estimated using a single measure of the posterior distribution, often the mean or median, known as the bayes estimate. Empirical Bayes methods are a subset of methods within this general framework that estimate the prior distribution of & using observed data (Casella, 1985; Lynch, 2007; Robbins, 1992). Extended Ensemble Estimation takes a meta-analysis approach of combining estimated effects in order to produce a single estimate of the treatment effect by weighting each estimated treatment effect by its precision, called the Extended Ensemble Estimate. In meta-analysis, Bayesian methods have a few distinct advantages in this application. The Extended Ensemble Estimate depends on the variance of the true treatment effect across specifications, / . The Bayesian framework allows for the ability to directly model any uncertainty in the estimation of / . Bayesian methods produce full posterior distributions for both ., the average treatment effect, and / (Chung et al., 2013; McNeish, 2016). Thus, in general, Bayesian methods allow us to account for any prior knowledge or assumptions we want to incorporate. There are many existing estimators for / , previously discussed in Chapter 2 within the general methodology, including the empirical bayes estimator (Morris, 1983; Berkey et al., 1995). The derivation of this estimator in Berkey et al. (1995) assumes that # |R, S ~ *(T# R, S + U# ), where # is the observed treatment effect in specification , T# is a row vector that contains values of the covariates of study , “a” is a column vector of regression coefficients, “D” is the between specification variance (/ in the previous notation) and U# is the estimated error variance. The estimate of “a” is given by R> = (T V T)W T V X Where = Y R (Z , … , Z@ ), the diagonal matrix of weights 26 1 Z# = [ − U# S and S[ is an approximately unbiased estimator of D. 3.3 Choosing a Weighting Scheme for Updating Regression Coefficients In the context of Extended Ensemble Estimation, the goal in this chapter is to provide an updated estimated treatment effect using the original estimated treatment effect and the extended ensemble estimate. Frank and Min (2007) adapted a Bayesian methodology for updating indices of robustness in the context of observed and unobserved samples in order to form an ideal estimate. The authors defined the likelihood in terms of observables and the prior in terms of the sample from the potentially unobservables. In this sense, they are able to achieve a posterior estimate from updating the prior via the likelihood. Since the significance test for correlations and partial correlations is equivalent to regression coefficients (Cohen and Cohen 1983; Fisher 1924), the authors worked in terms of correlations and partial correlations. Using the Fisher z transformation (Lee, 1989), sample correlations are normally distributed with variance and are H an unbiased estimate of the Fisher z transformation of the corresponding population correlation. Denoting \( ) as the Fisher z transformation of a sample correlation , the estimated posterior mean can be expressed as \ G^ \ KH \ #MNIJ = R #MNIJ ] + _ R G^ R KH G^ KH Where is the statistically significant sample correlation for the observed cases, is the #MNIJ sample correlation coefficient for the unobserved cases and is the correlation coefficient for the ideal based on a combination of observed and unobserved cases. Since R G^` = Habc , R KHG^ = Hdeab , and R #MNIJ = Habc fHdeab , 27 1 \ #MNIJ = OgG^` \ G^` + gKHG^ \ KHG^ Q gG^` + gKHG^ Habc Letting P = , the authors provide the final estimated posterior mean as Habc fHdeab \ #MNIJ = OP\ G^` + (1 − P)\ KHG^ Q Through the lens of Empirical Bayesian methodology, the posterior distribution for h is * i\ #MNIJ , Habc fHdeab j. Thus, by using the mean of the posterior, the Empirical Bayes estimate for h is \ #MNIJ . The variance can be used to quantify robustness by considering KH #MNIJ what values of would be necessary for fall within a 95 percent highest posterior density (HPD) interval (Frank & Min, 2007). In the case of the extended ensemble estimate, one can use the standard errors of the estimated treatment effects to define P in a similar way to update the original estimated treatment effect with the extended ensemble estimate. That is, (klBBB ) P= (klBBB ) + klm # #HIJ could be used to weight the original estimated treatment effect and the extended ensemble estimate, where klBBB and klm # #HIJ are the standard errors of the extended ensemble estimate and the original estimated treatment effect, respectively. Using the standard errors in this weighting scheme accounts for estimation efficiency, which is directly related to sample size. 3.3.1 The Effects of Weighting Scheme The philosophical choice to represent the unknown parameter of interest, &, with a distribution rather than by a fixed value is a key difference between Bayesian and frequentist methods. This captures our typical view that progress in science generally is derived from learning from past findings, incorporating information from these findings and realizing that no study is conducted in the absence of previous research. Bayesian inferences require that the prior 28 knowledge of & be stated explicitly via the prior distribution, a non-conditional distribution representing our prior knowledge regarding & (Kaplan, 2014). Since the posterior distribution of & is derived from the prior distribution, the posterior hinges directly on the choice of a prior. In some scenarios, we may not have strong prior knowledge of &. When prior knowledge is completely lacking, one would select a prior that models this directly by selecting a prior distribution of possible values of & that are no more or less likely than each other. A uniform distribution is a common choice in this case. This case is an extreme example of priors known as non-informative priors. In other cases where we have prior information that we wish to incorporate, we can select a prior distribution of & such that we believe some potential values are more or less likely than others. As stated previously, the inferential statistics based on the posterior distribution may change depending on the choice of prior, thus one must be careful and deliberate when deciding on whether to select an informative or non-informative prior (Kaplan, 2014). In the case of Extended Ensemble Estimation, the prior knowledge can be thought of as the original estimated treatment effect and the associated variability. In order to achieve an updated estimated treatment effect, one can use the estimated treatment effect that is weighted for precision from the various alternative specifications, namely, the extended ensemble estimate and its associated variability. Choosing values for P that utilize the associated standard errors of the estimated treatment effects would allow researchers to weight the original estimated treatment effect and extended ensemble estimate based on the level of uncertainty of each estimated treatment effect. As significance statements rely directly on standard error of estimates, standard errors are often focal points of discussion (Deaton & Cartwright, 2018) and 29 may serve as an initial choice of P. Formulated in terms of efficiency, one example of a possible weighting scheme could be formulated as given on page 28: (klBBB ) P= (klBBB ) + klm # #HIJ It is worth nothing that other values of P could be selected in order to weight the original estimated treatment effect and the extended ensemble estimate. That is also to say that the resulting updated estimated treatment effect, FKLMI$NM , hinges on the choice of P. Taking a closer look at the formulation of FKLMI$NM below, one can map out the consequences of various choices of P. FKLMI$NM = OPFG # #HIJ + (1 − P)FBBB Q Choosing P = 1 would result in the original estimated treatment effect, FG # #HIJ , while choosing P = 0 would result in the extended ensemble estimate, FBBB . Weighting both estimated treatment effects equally would be achieved by choosing P = 0.5. As this is a choice that impacts the updated treatment effect, the choice of P should be made carefully and intentionally. For example, using the standard errors of both estimated treatment effects for a weighting scheme that reflects the efficiency of each estimated treatment effect. 30 CHAPTER 4 USER GUIDE FOR BEST PRACTICES WHEN IMPLEMENTING EXTENDED ENSEMBLE ESTIMATION 4.1 Introduction to Best Practices This chapter is intended to guide end users in using extended ensemble estimation in a way to achieve the best performance possible, avoiding traditional statistical pitfalls, and how to best promote the conversation of potential causality of a treatment effect. The process of Extended Ensemble Estimation, including estimated treatment effect weighted by precision, is suited to serve the conversation around the potential causal relationship between a treatment variable and outcome by quantifying the robustness and sensitivity of the estimated treatment effect across alternative specifications. In order to maximize the effectiveness of Extended Ensemble Estimation, the end user should strive for a pool of alternative specifications that are strongly supported by existing theory, supporting empirical evidence and past research. These specifications increase the quality of the pool of specifications, while poor specifications that are not vetted can hinder the performance of ensemble estimation. Once alternative specifications have resulted in estimated treatment effects, they can be used to find the extended ensemble estimate, weighted for precision. The distribution, in tandem with the estimate weighted for precision, can be used in comparison to the original proposed specification in order to quantify any sensitivity or robustness, and to inform the conversation around a potential causal treatment effect. As the pool of alternative specifications grows, the distribution of estimated treatment effects will tend to be more smooth than discrete, assisting in the ability to decipher shape, center and spread. As further sections will discuss, ensemble estimation techniques can suffer in the presence of poor pools of 31 specifications, so it is generally best to grow the pool of alternative specifications while holding the quality of specifications as high as possible. Like many quantitative methodologies, the effectiveness of extended ensemble estimation can be impaired by outside influences. The following sections will discuss how the end user can mitigate or even eliminate these potential weaknesses. 4.2 Alternative Specifications Chapter 1 discusses weaknesses of ensemble methods in general, specifically how ensemble techniques in general are traditionally susceptible to poor performance in the presence of poor alternative specifications. Extended Ensemble techniques utilize alternative specifications, thus poor alternative specifications may hinder ultimate performance (Saez-Rodriguez et al., 2016). A poor alternative specification may show up in the form of large standard errors, perhaps due to a shrinkage in the utilized sample imparted by the model specification. In such a case, this would be reflected in the weighted distribution of estimated treatment effects. The extended ensemble estimate, weighted for precision, would also reflect this by weighting estimated treatment effects with large standard errors less than those with smaller standard errors. In the case of instrumental variables, the strength of instruments used within a specification may provide a more appropriate weight than standard errors alone since particular estimation techniques may suffer from larger standard errors. For instrumental variables, the strength of an instrument provides the user with a sense of confidence regarding the associated standard error. At this point, the user may decide whether to weight estimated treatment effects by standard errors, strength of instruments used, or a combination of both if applicable and appropriate. In order to keep the quality of the pool of alternative specifications as high as possible, alternative specifications should be derived, at least in part, by past research, evidence or 32 empirical evidence. In that sense, performance of extended ensemble estimation could be hindered by an inflated pool of alternative specifications and it is in the end user’s best interest to keep the quality of the pool of alterative specifications as high as possible. A potential weakness of extended ensemble estimation is weak pool of alternative specifications. For example, specifications that omit an important covariate may under or overestimate the treatment effect, thus hindering the ability of the user to assess the sensitivity or robustness of the estimated treatment effect across specification. In the case of a rich pool of alternative specifications, a strength of Extended Ensemble Estimation is that it leaves no room for spurious results to hide. In that sense, it is the goal of the user to provide a pool of high quality alternative specifications. 4.3 Estimation Techniques Discussed in Chapter 2, estimation techniques play a crucial role in extended ensemble estimation as they produce the estimated treatment effects. The observed estimated treatment effects, given a set of data, may differ slightly or largely in part due to the choice to use OLS or Instrumental Variables. While OLS contains many desirable properties as an estimator, the Instrumental Variables estimation approach strengths can compliment potential weaknesses of OLS. Although Instrumental Variable estimation tends to produce larger standard errors when compared to OLS, it remains a consistent estimator in the presence of endogeneity, a phenomena that results in inconstancy in the OLS estimator. The Extended Ensemble Estimate, weighted for precision, may naturally weigh estimates produced by OLS more favorably as compared to Instrumental Variables due to the tendency of smaller standard errors, all else being equal. In order to display any sensitivity regarding estimation technique, the end user may follow the 33 extended procedure twice; achieving separate estimated treatment effects weighted by precision for each estimation technique. 4.4 P-hacking and Cherry Picking Extended Ensemble Estimation is a statistical tool, in that its effectiveness can be minimized or maximized by the end user. This methodology, by construction, has the ability to help researchers decipher between spurious estimated treatment effects and potentially causal relationships through the robustness of the estimated treatment effects. So long as the pool of alternative specifications is rich, it also has the ability to display outlying estimated treatment effects, in the case of end users acting in good faith, and the ability to detect potentially cherry picked results, in the case of end users attempting to support particular positions. Cherry picking refers to the act of selecting individual cases or data that confirm a particular result while ignoring cases that contradict that result, intentionally or unintentionally (Klass, 2008). If the original estimated treatment effect had been cherry picked, that is, a result that significantly differs from the majority of alternative plausible models, distribution of estimates produced by the extended ensemble estimation process would be highly variable, suggesting a level of sensitivity of the original result. This assumes that the user presents other specifications and resulting estimates from which they selected from. On the other hand, if the original result fell within a reasonable range of the alternative estimates, this would suggest a level of robustness of the original result and would also be supporting evidence against the notion of cherry picking. P-hacking, another common statistical pitfall, is when a researcher attempts several statistical analyses or model specifications, then selectively reports those that produce significant results (Brodeur et al., 2016; Simmons et al., 2016; Gadbury & Allison, 2012; L.K. et al., 2012). While 34 end users are able to provide specifications of their choosing, extended ensemble estimation requires a pool of specifications for comparison. It is the responsibility of the user to provide plausible alternative specifications in order to help quantify robustness or sensitivity of an estimated treatment effect. During the peer review process, reviewers may propose alternative specifications in an attempt to identify potential p-hacking. In this case, extended ensemble estimation would provide the framework to compare the estimated treatment effects of the authors specifications to the, potentially many, estimated treatment effects of the reviewers’ alternative specifications. Extended ensemble estimation does not allow the end user to select single specifications that result in desirable results by requiring alternative model specifications, ultimately providing a level of transparency to the specification phase and the impact it has on the resulting estimated treatment effect. 35 CHAPTER 5 SIMULATIONS 5.1 How To Use Simulation to Inform Extended Ensemble Estimation It is not feasible to account for every possible situation one may encounter during the research process, specifically regarding the estimation phase. To build on the guiding principles for the user to keep in mind in order to best implement the extended ensemble estimation technique in the previous chapter, this chapter is meant to serve as a compliment by exploring the performance of extended ensemble estimation across various common, possible scenarios using simulation. I will start by laying out the different scenarios to be explored, the details of the simulation that will be used and finally, I will discuss the performance of extended ensemble estimation in each scenario. 5.2 Pre-treatment and Confounding Variables When considering the estimated effect of a treatment variable, much time and effort is often spent on trying to account for potential confounding variables that may cloud a researcher’s ability to make confident, clear conclusions. Even if one does everything in their power to rule out alternative explanations for estimated effects of a treatment variable, it is extremely difficult to feasibly rule out all alternative explanations. In other words, accounting for every possible confounding variable is not only a massive hurdle, but often impossible. One method to help overcome unknown variables that may confound estimates of a treatment variable is to include a pre-treatment variable that would be present during which a potential confounder would also be present, thus negating the need to include the confounding variable. This variable may not be randomly assigned but should be strongly correlated with the outcome variable. An example of such a variable could be an academic pre-test in a school setting – present during which a 36 potential confounder is also likely to be present. Although a pre-treatment variable of this nature is not a universal cure for all potential confounders, it can be easier to implement as opposed to thinking of and measuring potential confounders. To raise concerns of a particular confounder regarding the treatment, would require evidence that such a confounder was not present and was not captured in the pre-treatment measurement. When gauging how ensemble estimation works in the presence of a strong pre-treatment variable that may not be randomly assigned, it is sufficient to simulate a variable that is strongly correlated with the treatment variable of interest, as well as the outcome variable. A weak pre-treatment variable that is randomly assigned could be simulated using a variable that is weakly correlated with the treatment variable and the outcome variable. 5.3 Instrumental Variables Another method that aims to address potential confounding effects and measurement error, popularized mainly in Econometrics, is called Instrumental Variables. This method is often used when controlled experiments are not feasible, such as observational studies (Angrist & Imbens, 1995). In an observational study, an individual may be more likely to receive treatment than another individual, in turn affecting the resulting outcome. In other words, random assignment does not necessarily hold. In the derivation of Ordinary Least Squares, the first order condition requires the independent variable and the error term to be uncorrelated. If this condition does not hold, Ordinary Least Squares will not provide the causal impact of the independent variable, but only the parameter that makes the resulting errors seem uncorrelated with the independent variable. This situation results in biased and inconsistent estimates using Ordinary Least Squares. (Bullock et. Al, 2010). If the correlation between the independent variable and error term is not zero, the independent variable is known as endogenous. The first 37 order condition that requires this correlation to zero is often known as an exogeneity condition, where the independent variable that satisfies the condition is known an exogenous. The instrumental variable method handles endogeneity by utilizing a variable that is correlated with the endogenous variable, only affects the outcome indirectly through the endogenous variable, but is not correlated with the error term. A variable that is not correlated with the error term does not suffer the problem of breaking the first order condition, but also captures the desired effect if it is correlated with the endogenous variable. This variable is called an instrumental variable and the application requires multiple steps known as stages. A common technique using instrumental variables requires two modeling steps, thus is known as two stage least squares. When gauging how ensemble estimation works in the presence of a strong instrumental variable, it is sufficient to simulate a variable that is strongly correlated with the treatment variable of interest but weakly correlated with the outcome, that is to not affect the outcome directly. In order to consider a weak instrument, it is sufficient to simulate a variable that is weakly correlated to both the treatment variable and the outcome variable. 5.4 Randomized Control Trials A list of commonly encountered designs would not be complete without accounting for randomized control trials. In an attempt to reliably estimate unbiased treatment effects, randomized control trials utilize random assignment between treatment and control groups. When performed with fidelity, this framework allows researchers to attribute any observed difference between the treatment group and control group to the treatment effect by minimizing any other possible contamination of the treatment effect. Randomized control trials have a long history in medical research where biased estimates may have long term consequences of high magnitude. More recently, randomized control trials 38 have spread to other disciplines, such as economics and social sciences. This strict structure of randomized control trials assists researchers in making a case for causality by being able to rule out any confounding effects via the control group. Imbens (2010) summarized common conceptions surrounding randomized control trials by saying, “Randomized experiments do occupy a special place in the hierarchy of evidence, namely at the very top.” With that said, randomized control trials are not without drawbacks and faults. Other than the difficulty that comes with properly carrying out a strict framework and carefully constructed design, they often incur high monetary and time expenses. Randomized control trials can also suffer from the lack of generalizability. A particular trial may only consider a sample from a specific high-risk group to maximize the probability of detecting an effect, which may not be applicable to a low-risk group or the population as a whole. Randomized control trials may not be practical for urgent health issues where decisions must be made faster than a well-performed trial can permit. Although it is not uncommon for trials to last many years, that still may not be enough to assess long-term treatment effects. As randomized control trials are often viewed as more credible and rigorous than other methods, other designs often attempt to mimic randomized control trials in order to gain their benefits (Angrist & Pischke, 2010). In this spirit, Extended Ensemble Estimation can be used to help gauge how well a randomized control trial was constructed and conducted by comparing estimated effects of the treatment to estimated effects using various specifications across various model specifications. Low dispersion of estimated effects across model specifications and alignment would suggest a more sound randomized control trial implementation while a high dispersion of estimated effects across model specifications or misalignment would suggest a less sound randomized control trial implementation. That is, if a randomized control trial is well constructed and implemented, the estimated treatment effect 39 should align with the estimated treatment effects across the model specifications, while the estimate effects across model specifications should not vary. In order to gauge how ensemble estimation works within a randomized framework, it is sufficient to simulate a variable that is weakly correlated with the treatment variable, ideally not correlated with the treatment variable at all in the case of perfect randomization, but strongly correlated with the outcome variable. 5.5 Accounting for Selection Bias The simulations that follow include an outcome variable (X), treatment variable (T ), as well as two more variables (T and Tp) that will be used to account for the various scenarios described above. Since each model must include the outcome and treatment variable, there are four possible models resulting from covariate selection in each simulation (shown below). X = Fq + T F X = Fq + T F + T F X = Fq + T F + Tp Fp X = Fq + T F + T F + Tp Fp Based on the relationship established by Heckman (1979) between bias due to nonrandom assignment to treatment conditions and bias due to sample selection, bias due to omitted variable can be thought of as bias due to sample selection. Furthermore, variability due to sample selection could be thought of as variability due to model specification through included or omitted covariates. In this sense, using such a limited number of controls after the treatment variable can be used to address many common concerns, including but not limited to omitted variable bias, sample selection bias and variability due to sample selection. In this sense, T and it’s correlations with T and X are used to help specify the various scenarios, while Tp 40 and it’s correlations with T , T and X are used to stand for other potential covariates that may have been missed or left out of analysis. Model 1 is the base model, using only the treatment variable T while ignoring all other controls. Model 2 includes T in order to control for potential instruments, pre-treatments or random assignments. Model 3 includes Tp, standing for potentially left out control variables. Model 4 includes all potential variables. For each scenario below, a correlation matrix using standardized variables is defined. For each scenario below, a correlation matrix using standardized variables is defined. Each simulated scenario utilized the ordinary least squares estimation technique, where the mean, median and extended ensemble estimate are reported. Specifically, the Cholesky decomposition can be used to generate data under particular specifications and then a correlation matrix is calculated from which OLS estimates are obtained (Becker, 1992; Becker, 1995; Becker & Aloe, 2019; Sumiati et al., 2020) 5.5.1 Strong Pre-Treatment Simulating a variable that is strongly correlated with the treatment and outcome, rs ,rt = rt ,u = 0.8, representing a strong pre-test, plays a large role in the resulting estimates. When the strong pre-test is omitted, the base effect of the treatment variable, rs ,u = 0.7, is estimated as F = 0.7. When the strong pre-test included, the estimated effect of the treatment variable is F = 0.17. The mean and median estimated effects of the treatment are 0.28 and 0.43, respectively. The ensemble estimate of the treatment effect is 0.28, with a standard error of 0.2814. In the case of a strong pre-test, the change in estimates is representative of the effectiveness of the included pre-test. This change is also being displayed by the mean, median and ensemble estimates. Below are the tables including the correlations, model specifications, estimated effects of the treatment, estimated standard error of the treatment, mean estimated effect, median estimated effect, and extended ensemble estimate. 41 Table 5.1.1 Correlation structure for strong pre-treatment Corr( . , . ) y x1 x2 x3 y 1 0.7 0.8 0.2 x1 0.7 1 0.8 0.1 x2 0.8 0.8 1 -0.3 x3 0.2 0.1 -0.3 1 Table 5.1.2 Model specifications Formula X1 Estimate X1 Standard Error X1 Est/SE y~ 1+x1+x2+x3 -0.44811 0.065671 -6.82359 y ~ 1+x1+x3 0.686869 0.070563 9.734172 y ~ 1+x1+x2 0.166667 0.098601 1.690309 y ~ 1+x1 0.7 0.071414 9.801961 Table 5.1.3 Extended Ensemble Estimation results X1 Estimate Mean 0.276356 Median 0.426768 EEE 0.275402 (0.2814) 42 5.5.2 Weak Pre-Treatment Simulating a variable that is weakly correlated with the treatment and outcome, rs ,rt = rt ,u = 0.1, representing a weak pre-test, plays a small role in the resulting estimates. When the weak pre-test is omitted, the base effect of the treatment variable, rs ,u = 0.7, is estimated as F = 0.7. When the weak pre-test included, the estimated effect of the treatment variable is F = 0.696. The mean and median estimated effects of the treatment are both 0.69. The ensemble estimate of the treatment effect is 0.69, with a standard error of 0.1585774. In the case of a weak pre-test, the lack of change in estimates is due to the lack of effectiveness of the included pre- test. That is, the mean, median and ensemble estimates are robust in the presence of a weak pre- test. Below are the tables including the correlations, model specifications, estimated effects of the treatment, estimated standard error of the treatment, mean estimated effect, median estimated effect, and extended ensemble estimate. Table 5.2.1 Correlation structure for strong weak-treatment Corr( . , . ) y x1 x2 x3 y 1 0.7 0.1 0.2 x1 0.7 1 0.1 0.1 x2 0.1 0.1 1 -0.3 x3 0.2 0.1 -0.3 1 Table 5.2.2 Model specifications Model X1 Estimate X1 Standard Error X1 Est/SE y~ 1+x1+x2+x3 0.676471 0.070828 9.550863 43 Table 5.2.2 (cont’d) y ~ 1+x1+x3 0.686869 0.070563 9.734172 y ~ 1+x1+x2 0.69697 0.07171 9.719274 y ~ 1+x1 0.7 0.071414 9.801961 Table 5.2.3 Extended Ensemble Estimation results X1 Estimate Mean 0.690077 Median 0.691919 EEE 0.690033 (0.1585774) 5.5.3 Strong Instrumental Variable Simulating a variable that is strongly correlated with the treatment but weakly correlated with the outcome, rs ,rt = 0.8 and rt ,u = 0.2, representing a strong instrument, plays a large role in the resulting estimates. When the strong instrument is omitted, the base effect of the treatment variable, rs ,u = 0.7, is estimated as F = 0.7. When the strong instrument is included, the estimated effect of the treatment variable is F = 1.5. The mean and median estimated effects of the treatment are 1.2 and 1.1, respectively. The ensemble estimate of the treatment effect is 1.2 with a standard error of 0.3072394. In the case of a strong instrument, the change in estimates is representative of strong instrument when using OLS. That is, the mean, median and ensemble estimates are robust in the presence of a strong instrument. Below are the tables including the correlations, model specifications, estimated effects of the treatment, estimated 44 standard error of the treatment, mean estimated effect, median estimated effect and extended ensemble estimate. Table 5.3.1 Correlation structure for strong instrument Corr( . , . ) y x1 x2 x3 y 1 0.7 0.2 0.2 x1 0.7 1 0.8 0.1 x2 0.2 0.8 1 -0.3 x3 0.2 0.1 -0.3 1 Table 5.3.2 Model specifications Model X1 Estimate X1 Standard Error X1 Est/SE y~ 1+x1+x2+x3 1.900943 0.043394 43.80694 y ~ 1+x1+x3 0.686869 0.070563 9.734172 y ~ 1+x1+x2 1.5 0.06455 23.2379 y ~ 1+x1 0.7 0.071414 9.801961 Table 5.3.3 Extended Ensemble Estimation results X1 Estimate Mean 1.196953 Median 1.1 EEE 1.211574 (0.3072394) 45 5.5.4 Weak Instrumental Variable Simulating a variable that is weakly correlated with the treatment and outcome, rs ,rt = rt ,u = 0.1, representing a weak instrument, plays a small role in the resulting estimates. When the weak instrument is omitted, the base effect of the treatment variable, rs ,u = 0.7, is estimated as F = 0.7. When the weak instrument is included, the estimated effect of the treatment variable is F = 0.69. The mean and median estimated effects of the treatment are both 0.69. The ensemble estimate of the treatment effect is 0.69 with a standard error of 0.1585774. The lack of change in estimates is due to the lack of effectiveness of the included weak instrument. The lack of change in the mean, median and ensemble estimates is evidence of robust estimation in the presence of a weak instrument. Below are the tables including the correlations, model specifications, estimated effects of the treatment, estimated standard error of the treatment, mean estimated effect, median estimated effect and extended ensemble estimate. Table 5.4.1 Correlation structure for weak instrument Corr( . , . ) y x1 x2 x3 y 1 0.7 0.1 0.2 x1 0.7 1 0.1 0.1 x2 0.1 0.1 1 -0.3 x3 0.2 0.1 -0.3 1 Table 5.4.2 Model specifications Model X1 Estimate X1 Standard Error X1 Est/SE y~ 1+x1+x2+x3 0.676471 0.070828 9.550863 46 Table 5.4.2 (cont’d) y ~ 1+x1+x3 0.686869 0.070563 9.734172 y ~ 1+x1+x2 0.69697 0.07171 9.719274 y ~ 1+x1 0.7 0.071414 9.801961 Table 5.4.3 Extended Ensemble Estimation results X1 Estimate Mean 0.690077 Median 0.691919 EEE 0.690033 (0.1585774) 5.5.5 Randomized Control Trial Simulating a variable T that is weakly correlated with the treatment and strongly correlated with the outcome, rs ,rt = 0.2 and rt ,u = 0.8, representing a randomized control trial with an ancova design, plays a moderate role in the resulting estimates. When the grouping variable is omitted, the base effect of the treatment variable, rs ,u = 0.7, is estimated as F = 0.7. When T is included, the estimated effect of the treatment variable is F = 0.56. The mean and median estimated effects of the treatment are 0.62 and 0.63, respectively. The ensemble estimate of the treatment effect is 0.58 with a standard error of 0.102087. The change in estimates is due to the importance of randomization. The similar results in the mean, median and ensemble estimates coincide and confirm with the estimate when randomization is present. Note that the lower estimated treatment effect by the extended ensemble estimate is due to the 47 weighting of the standard errors. Estimated treatment effects 0.53 and 0.56 whose models include T (accounting for the RCT design) are more precise in terms of standard errors (0.0151 and 0.0242, respectively) than the estimated treatment effects of 0.69 and 0.7 whose models exclude T (0.0717 and 0.0714, respectively). The Extended Ensemble Estimate, 0.57, is a result of favoring the more precise estimates more since the Extended Ensemble Estimate is weighted for precision. If interpreted in terms of information, the Extended Ensemble Estimate will favor estimates that provide more precise information. Table 5.5.1 Correlation structure for Randomized Control Trial Corr( . , . ) y x1 x2 x3 y 1 0.7 0.8 0.1 x1 0.7 1 0.2 0.1 x2 0.8 0.2 1 -0.2 x3 0.1 0.1 -0.2 1 Table 5.5.2 Model specifications Model X1 Estimate X1 Standard Error X1 Est/SE y~ 1+x1+x2+x3 0.534368 0.015052 35.50265 y ~ 1+x1+x3 0.69697 0.07171 9.719274 y ~ 1+x1+x2 0.5625 0.024206 23.2379 y ~ 1+x1 0.7 0.071414 9.801961 48 Table 5.5.3 Extended Ensemble Estimation results X1 Estimate Mean 0.623459 Median 0.629735 EEE 0.576734 (0.102087) 5.5.6 Discussion of Simulation Results One feature of extended ensemble estimation is the ability to utilize the precision of each estimate through standard error. When a strong pre-test is present, the estimated treatment effect decreases from 0.7 to 0.17 with consistent precision. If the pre-test is strong, researchers may want to carefully consider which effect to base conclusions on. The ensemble estimate gives researchers a framework to weigh the two estimates; is the strong pre-test evidence of a lack of actual treatment effect, or is there a treatment effect worth reporting? The extended ensemble estimate of 0.27 represents an edge towards a lack of treatment effect in the presence of a strong pre-test. In the case of a weak instrument, researchers must be careful to avoid biased and inconsistent estimates. In this scenario, the extended ensemble estimate of 0.69 is evidence favoring the base treatment effect through the weighing of precision. A treatment effect estimated by ordinary least squares in the presence of a weak instrument may suffer from a lack of precision, thus would receive less weight in the ensemble estimate. In the case of randomization, the change of estimated effect from 0.7 to 0.56 is evidence of the importance of random assignment. The extended ensemble estimate of 0.58 is representative of favoring the more precise estimate of the treatment effect with random assignment present. 49 CHAPTER 6 CASE STUDY – EFFECTS OF KINDERGARTEN ON CHILDRENS COGNITIVE GROWTH IN READING AND MATHEMATICS 6.1 Introduction Retention policy in school continues to be controversial and unresolved for nearly a century (Dong 2010, Goos, Pipa & Peixoto 2021, Holmes 1989, Jackson 1975, Jimerson 2001, Park, Steiner Robertson 2021, Shepard 1989). Empirical evidence spans from supporting negative effects of retention on academic achievement and both person and social development, to no statistically detectable effect, as well as a smaller portion of studies showing supporting evidence for retention. More recently, emphasis has been put on educational standards and accountability within schools, assisting in the increased popularity of grade retention (Hauser et al. 2004, Jimerson & Kaufman 2003, McCoy and Reynolds 1999). Many states ended social promotion, in which all students are promoted to maintain homogeneity of age within classrooms, by the year 2000 with many schools having adopted grade retention at most grade levels, including kindergarten (Ellwein & Glass 1989, Hauser 1998, Roderick et. al 1999). With empirical evidence varied and suggestions unsettled, North Carolina’s retention rate doubled from 1992 to 2002 (Early et al. 2003). While the evidence and opinions of researchers remains split, the rise of research on kindergarten retention in the last 20 years suggests that researchers seem to agree on the importance of kindergarten and getting retention policies right in terms of the best possible outcomes for students. The differences across findings extends to methodologies and how effects of retention should be estimated. These discrepancies often stem from trying to account for the lack of ability and practicality to use large scale experimental designs, such as randomized control trials. 50 Existing studies often rely on nonexperimental data, such as observational or cross-sectional data, which raise natural issues when attempting to determine causal effects. In such a situation where the researcher cannot know how a promoted student would have performed had they been retained, they are forced to estimate a non-observable scenario, referred to as the counterfactual. Specifically, in the context of grade retention when a student has been retained, the counterfactual represents the scenario where a student had been promoted instead. Likewise, the counterfactual pertaining to a promoted student would be the case where they had been retained. Propensity score matching and propensity score stratification are often used to help estimate causal effects associated to counterfactuals by adjusting for potential selection bias (Rosenbaum & Rubin 1983). Widely cited work by Hong & Raudenbush (2005) utilizes propensity stratification by using 207 pre-treatment covariates that were found to be associated with kindergarten retention in order to estimate a child’s likelihood of being retained. Work by Dong in 2010 used a control function in tandem with an instrumental variables approach to compare estimates with Nearest Neighbor Matching, proposed by Abadie & Imbens (2001). These methods rely on many assumptions and model specifications, such as the assumption of unconfoundedness. That is, that there are no unobserved covariates that play a role in selection. Although a rich and extensive set of covariates can be useful when contemplating the assumption of unconfoundedness, it does not guarantee all necessary covariates have been collected. Together, the resulting estimates produced by these various methodologies may still hinge on choices such as model specification, estimation method, which covariates to use as instruments, which covariates to include when estimating propensity scores, how many strata to use when grouping propensity scores or whether to utilize case weights. Although metanalysis have been conducted to compare results across studies on kindergarten retention, work in this area, like the 51 work done by Hong and Raudenbush (2005), stand as examples that may benefit from a within- study sensitivity analysis, namely Extended Ensemble Estimation, to help quantify robustness of the estimated effects. Namely, how sensitive are the findings in kindergarten retention, such as Hong and Raudenbush (2005), to specifications made by the researcher. 6.2 Description of Case Study Hong and Raudenbush (2005) considered, among other questions, what the effect of kindergarten retention is on those students whoa retained. That is, how much more or less would kindergarten retainees have learned had they been promoted as opposed to being retained. Utilizing data from the Early Childhood Longitudinal Study Kindergarten cohort (ELCS-K) from the US National Center for Education Statistics (NCES), Hong and Raudenbush (2005) considered a nationally representative sample of 20,000 kindergarten students that included a rich set of covariates regarding the students, their families, teachers and schools across kindergarten and first-grade years from Fall 1998 to Spring 2000. This rich set of covariates served multiple purposes. The authors had a deep set of covariates to use as controls in order to account for the nested structure of students within schools, as well as to more effectively employ propensity score stratification to account for the estimating the counterfactual. Since a student is either retained or promoted, propensity score stratification uses a fixed set of covariates to compute propensity scores for each student. In this case, this refers to a student’s conditional probability to be retained based on pretreatment personal and classroom characteristics, school characteristics, as well as the residual random effect of the pretreatment school of that student. Formally put, the estimation of an individual-level propensity score, ~, for a retention school student from pretreatment school • is ~ ̂_ • = €(•_ = 1 ┤| S_• = 1, T_ •, Z_•, 1_•^ ∗) 52 Where •# is an indicator of retention for student , S… is an indicator of a retention policy for school •, T#… are the pretreatment personal and classroom characteristics, Z… are school level characteristics, and 1…∗ is the residual random effect of the pretreatment school •. It’s useful to note that in order for a student to be retained, •# = 1, it is necessary for that student’s school to have a retention policy, S… = 1. That also implies that no child was retained under a non- retention policy school. This propensity score stood as a gauge the risk of a student repeating kindergarten. If a student’s propensity score was too low to match to a student who had been retained, the aforementioned student was considered to be at no risk of retention, while student who had such matches were considered to be at-risk of retention. To account for the varying degrees of risk of retention, the logit of the estimate propensity scores were split into 15 strata, which were balanced using 207 pretreatment covariates. It’s useful to note that eight retainees in the last defined strata did not match with any students in the promoted group, thus their analysis utilized the other 14 strata. In order to estimate the effect of kindergarten retention, Hong and Raudenbush utilized a two-level hierarchical linear model, specified below. • X#… = †q + ‡ •#… + † ˆ‰ Š(~>)#… + ‹ Œ` ˆ`#… + † S1 #… + 1q… + 1 … •#… + Ž#… `; Their model included both fixed and random effects to model the outcome variable X#… , the assessment for student in school •. They considered both math and reading assessments for outcomes, resulting in estimated effects for both math and reading. These outcome variables, found within the ECLS-K data, are scale scores calibrated using item response theory (Hambleton, Swaminathan, & Rogers, 1991). Each subject had up to four repeated assessments over the two study years, then were equated on the same scale. This standardization allows 53 researchers to asses math and reading growth of students over time and to compare achievement of students from different grade levels. The coefficients to be estimated for student in school • included the binary indicator of whether or not was a kindergarten retainee, •#… , the logit of the estimated propensity score, ˆ‰ Š(~>)#… , the propensity strata, ˆ`#… , and the duration of time between the beginning of the school year and when a student took the assessment, S1 #… . The random effects included an intercept to capture the setting specific increment to a child’s learning outcome, and binary indicator of whether or not was a kindergarten retainee, •#… , to capture the setting specific increment to a child’s retention effect. The coefficient of the retention effect, ‡, was the main interest in determining the answer to the authors research question; what was the effect of kindergarten retention on those students who were retained? In the extended ensemble estimation framework, the retention effect is the treatment variable of interest while the effect to be estimated, ‡, is the estimate to examine for sensitivity and robustness across various possible alternative specifications. Namely, ‡ is the estimated difference in a child’s assessment due to being retained. 6.2.1 Findings from Hong and Raudenbush Hong and Raudenbush estimated the fixed effect of retention to be -9.01 with a standard error of 0.68 regarding reading achievement. Specifically, if a promoted child had been retained instead, they estimated that the expected reading achievement score would be 9.01 points lower at the end of the treatment year. Hong and Raudenbush estimated the fixed effect of retention to be -5.89 with a standard error of 0.50 regarding math achievement. Specifically, if a promoted child had been retained instead, they estimated that the expected math achievement score would be 5.89 points lower at the end of the treatment year. Extended ensemble estimation could be used at this stage to estimate these effects under various specifications, other than the 54 specification made by the authors, to gauge the sensitivity and robustness of these estimates. In other words, how much do the negative effects of retention estimated by the authors depend on their particular specifications? Accounting for the precision of the estimated effects of retention under various specifications, what is the weighted effect of kindergarten retention on math and reading achievement scores? 6.3 Extended Ensemble Estimation In order to carry out the extended ensemble estimation process, one may start by controlling for different covariates within the modeling stage that may explain observed differences in retainees and promoted students regarding their assessments. As noted before, ensemble techniques generally perform better when the pool of models is rich but not oversaturated by poor models, thus selecting covariates to control for outside of the authors specifications requires care as a poor pool of models may render the extended ensemble estimation process less informative. Using the rich set of covariates within the ECLS-K data, 9 covariates that include students prior math scores, reading scores, general knowledge scores, and school ID can be used to create 512 alternate specifications. 6.3.1 Results of Extended Ensemble Estimation The distribution of estimated retention effects for the 512 alternative specifications are shown below. When excluding all controls, the estimated retention effect on reading scores is - 21. The mean and median retention effect are both -10. When weighting for precision, the ensemble retention effect is -10.0433 with a standard error of 0.0842. Although the mean and median estimated effects of -10, as compared to -21 when excluding all controls, stand as evidence that there are factors that should be accounted for in terms of retention and achievement, the estimated effect of retention on reading achievement by the authors remained to 55 be the most conservative estimated effect of retention, even after weighting with the ensemble estimate. It is also worth noting that the difference in estimated retention effect was small when using the wealth of covariates for propensity score stratification as opposed to using pre-test controls, as the mean and median estimated effects were both -10 and the authors estimate effect was -9. Using the Bayesian framework in Chapter 3 inspired by Frank and Min (2007), an updated retention effect on reading scores while utilizing their respective standard errors could be calculated by FKLMI$NM = OPF•GH •IKMNH^K`% + (1 − P)FBBB Q where P could represent the efficiency of the Extended Ensemble Estimate, specifically, (klBBB ) (0.0842) P= = = 0.0151 (klBBB ) + kl•GH (0.0842) + (0.68) •IKMNH^K`% Thus, the updated estimated retention effect on reading scores is FKLMI$NM = OPF•GH •IKMNH^K`% + (1 − P)FBBB Q = ’(0.0151)(−9.01 ) + (1 − 0.0151)(−10.0433 )” = −10.0277 56 Figure 6.3.1 Distribution of estimated treatment effects 6.4 Discussion Since students are nested within schools, students in one school may share a number of attributes that are important to account for regarding retention that may not be shared by students at other schools. From a statistical standpoint, this violation of independence may reduce the effective sample size. That is, 500 student observations that are not independent may result in a much smaller effective sample size, depending on how correlated they are in terms of measured attributes. Since standard errors are inversely proportional to sample size, ignoring dependence can result in underestimating variance or overestimating the accuracy of the effect in question (Raudenbush & Bryk, 2002; Gelman and Hill, 2007). If retention is not independent of school level characteristics, then controlling for schools with random effects may not be sufficient to eliminate bias. Generally, clustering needs to be independent of treatment when accounting for clustering as a random effect. Randomizing treatment to students instead of schools would be one solution to this problem. When the ideal 57 experimental design is not possible, as is the case with kindergarten retention when randomization is not possible, Theobald and Freeman (2014) detail how controlling for student nonequivalence is crucial. This is demonstrated by Hong and Raudenbush’s use of the rich set of student-level covariates. In the case where such a specification is inappropriate or specifications leave out important covariates, such estimates should be the minority of many estimates from a pool of plausible specifications. Correlations across estimated treatment effects may indicate misspecification, specifically in the case of missing important covariates. For example, a large portion of specifications missing an important covariate could all result in under or overestimated treatment effects. Thus, examining the pool of plausible specifications for such correlated estimates to help assess the pool of plausible alternative specifications should be a priority of the user. As long as the pool of plausible alternative specifications is rich, the influence of mis-specifications are limited. Simulations could be used to test the sensitivity of missing covariates that impact treatment effects. A step-by-step process to achieve this would start by simulating sample data from a pre-specified model with a known treatment effect. Once the sample data is simulated, one would create specifications to estimate the known treatment effect, proceeding with the extended ensemble estimation process to calculate the distribution of estimated treatment effects and the extended ensemble estimate. In order to test the sensitivity of misspecification, one would create a pool of specifications that, for example, are missing a key covariate. Then, correlations among models that omit the same covariate can be observed over many specifications. Performing extended ensemble estimation on this pool of poor specifications 58 would result in an extended ensemble estimate that could be compared to the known estimated treatment effect. The findings from extended ensemble estimation regarding kindergarten retention suggest that the estimated retention effect on reading achievement reported by the authors contains a moderate level of robustness, while also erroring on the side of conservative, relative to the effects estimated using the extended ensemble estimation approach. They also suggest that since the authors findings were among the most conservative, there is evidence of a measurable effect of kindergarten retention on reading achievement that is significant. 59 BIBLIOGRAPHY Angrist, J., & Imbens, G. (1995). Identification and estimation of local average treatment effects. Angrist, J. D., & Pischke, J. S. (2010). The credibility revolution in empirical economics: How better research design is taking the con out of econometrics. Journal of economic perspectives, 24(2), 3-30. Barreto, H., & Howland, F. (2006). Introductory econometrics: using Monte Carlo simulation with Microsoft excel. Cambridge University Press. Becker, B. J. (1992). Using results from replicated studies to estimate linear models. Journal of Educational Statistics, 17(4), 341-362. Becker, B. J. (1995). Corrections to “Using results from replicated studies to estimate linear models”. Journal of Educational and Behavioral Statistics, 20(1), 100-102. Becker, B. J., & Aloe, A. M. (2019). Model-based meta-analysis and related approaches. The handbook of research synthesis and meta-analysis, 339-363. Berger, J. O. (1990). Robust Bayesian analysis: sensitivity to the prior. Journal of statistical planning and inference, 25(3), 303-328. Berger, J. O., & Delampady, M. (1987). Testing precise hypotheses. Statistical Science, 317-335. Berkey, C. S., Hoaglin, D. C., Mosteller, F., & Colditz, G. A. (1995). A random‐effects regression model for meta‐analysis. Statistics in medicine, 14(4), 395-411. Breiman, L. (1996). Bias, variance, and arcing classifiers. Tech. Rep. 460, Statistics Department, University of California, Berkeley, CA, USA. Brodeur, A., Lé, M., Sangnier, M., & Zylberberg, Y. (2016). Star wars: The empirics strike back. American Economic Journal: Applied Economics, 8(1), 1-32. Bullock, J. G., Green, D. P., & Ha, S. E. (2010). Yes, but what’s the mechanism?(don’t expect an easy answer). Journal of personality and social psychology, 98(4), 550. Casella, G. (1985). An introduction to empirical Bayes data analysis. The American Statistician, 39(2), 83-87. Chung, Y., Rabe-Hesketh, S., Dorie, V., Gelman, A., & Liu, J. (2013). A nondegenerate penalized likelihood estimator for variance parameters in multilevel models. Psychometrika, 78(4), 685-709. Clarke, K. A. (2005). The phantom menace: Omitted variable bias in econometric research. Conflict management and peace science, 22(4), 341-352. 60 Cohen, P., West, S. G., & Aiken, L. S. (2014). Applied multiple regression/correlation analysis for the behavioral sciences. Psychology press. Cook, T. D. (2002). Randomized experiments in educational policy research: A critical examination of the reasons the educational evaluation community has offered for not doing them. Educational evaluation and policy analysis, 24(3), 175-199. De Vaus, D. (2001). Research design in social research. Research design in social research, 1- 296. Deaton, A., & Cartwright, N. (2018). Understanding and misunderstanding randomized controlled trials. Social science & medicine, 210, 2-21. DerSimonian, R., & Laird, N. (1986). Meta-analysis in clinical trials. Controlled clinical trials, 7(3), 177-188. Dong, Y. (2010). Kept back to get ahead? Kindergarten retention and academic performance. European Economic Review, 54(2), 219-236. Early, D., Bushnell, M., Clifford, R., Konanc, E., Maxwell, K., Palsha, S., & Roberts, L. (2003). North Carolina early grade retention in the age of accountability. Chapel Hill: The University of North Carolina, FPG Child Development Institute, 4. Ellwein, M. C., & Glass, G. V. (1989). Ending social promotion in Waterford: Appearances and reality. Flunking grades: Research and policies on retention, 151-173. Fisher, R. A. (1924). 035: The Distribution of the Partial Correlation Coefficient. Frank, K. (2000). Impact of a Confounding Variable on the Inference of a Regression Coefficient. Sociological Methods and Research, 29(2), 147-94. Frank, K. A., Maroulis, S. J., Duong, M. Q., & Kelcey, B. M. (2013). What would it take to change an inference? Using Rubin’s causal model to interpret the robustness of causal inferences. Educational Evaluation and Policy Analysis, 35(4), 437-460. Frank, K., & Min, K. S. (2007). 10. Indices of Robustness for Sample Representation. Sociological Methodology, 37(1), 349-392. Gadbury, G. L., & Allison, D. B. (2012). Inappropriate fiddling with statistical analyses to obtain a desirable p-value: tests to detect its presence in published literature. Gelman, A., & Hill, J. (2006). Data analysis using regression and multilevel/hierarchical models. Cambridge university press. 61 Gelman, A., & Loken, E. (2014). The statistical crisis in science data-dependent analysis—a “garden of forking paths”—explains why many statistically significant comparisons don’t hold up. American scientist, 102(6), 460. Goos, M., Pipa, J., & Peixoto, F. (2021). Effectiveness of grade retention: A systematic review and meta-analysis. Educational Research Review, 34, 100401. Greene, W. H. (2003). Econometric analysis. Pearson Education India. Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory (Vol. 2). Sage. Hauser, R. M. (2000). Should We End Social Promotion? Truth and Consequences. Heckman, J. J. (1979). Sample selection bias as a specification error. Econometrica: Journal of the econometric society, 153-161. Hedges, L. V., & Olkin, I. (1985). Statistical methods for meta-analysis. Hedges, L. V., & Vevea, J. L. (1998). Fixed-and random-effects models in meta- analysis. Psychological methods, 3(4), 486. Hofman, J. M., Watts, D. J., Athey, S., Garip, F., Griffiths, T. L., Kleinberg, J., ... & Yarkoni, T. (2021). Integrating explanation and prediction in computational social science. Nature, 595(7866), 181-188. Holland, P. W. (1986). Statistics and causal inference. Journal of the American statistical Association, 81(396), 945-960. Hong, G., & Raudenbush, S. W. (2005). Effects of kindergarten retention policy on children’s cognitive growth in reading and mathematics. Educational evaluation and policy analysis, 27(3), 205-224. Hong, G., & Yu, B. (2008). Effects of kindergarten retention on children's social-emotional development: an application of propensity score method to multivariate, multilevel data. Developmental Psychology, 44(2), 407. Hunter, J. E., & Schmidt, F. L. (2004). Methods of meta-analysis: Correcting error and bias in research findings. Sage. Imbens, G. W. (2010). Better LATE than nothing: Some comments on Deaton (2009) and Heckman and Urzua (2009). Journal of Economic literature, 48(2), 399-423. Hong, G., & Yu, B. (2007). Early-grade retention and children’s reading and math learning in elementary years. Educational evaluation and policy analysis, 29(4), 239-261. 62 Jackman, S. (2009). Bayesian analysis for the social sciences. John Wiley & Sons. Jimerson, S. R. (2001). Meta-analysis of grade retention research: Implications for practice in the 21st century. School psychology review, 30(3), 420-437. Kaplan, D. (2014). Bayesian statistics for the social sciences. Guilford Publications. Klass, G. (2008). Just plain data analysis: Common statistical fallacies in analyses of social indicator data. Statlit. org, 6. Laan, A., Madirolas, G., & De Polavieja, G. G. (2017). Rescuing collective wisdom when the average group opinion is wrong. Frontiers in Robotics and AI, 4, 56. Laird, N. M., & Mosteller, F. (1990). Some statistical methods for combining experimental results. International journal of technology assessment in health care, 6(1), 5-30. Lee, P. M. (1989). Bayesian statistics (pp. 54-5). London:: Oxford University Press. LeSage, J. P., & Parent, O. (2007). Bayesian model averaging for spatial econometric models. Geographical Analysis, 39(3), 241-267. John, L. K., Loewenstein, G., & Prelec, D. (2012). Measuring the prevalence of questionable research practices with incentives for truth telling. Psychological science, 23(5), 524-532. Lynch, S. M. (2007). Introduction to applied Bayesian statistics and estimation for social scientists (Vol. 1). New York: Springer. Madigan, D., & Raftery, A. E. (1994). Model selection and accounting for model uncertainty in graphical models using Occam's window. Journal of the American Statistical Association, 89(428), 1535-1546. Madigan, D., York, J., & Allard, D. (1995). Bayesian graphical models for discrete data. International Statistical Review/Revue Internationale de Statistique, 215-232. McNeish, D. (2016). On using Bayesian methods to address small sample problems. Structural Equation Modeling: A Multidisciplinary Journal, 23(5), 750-773. Morris, C. N. (1983). Parametric empirical Bayes inference: theory and applications. Journal of the American statistical Association, 78(381), 47-55. Murphy, K. P. (2012). Machine learning: a probabilistic perspective. MIT press. Nelson, C., & Startz, R. (1988). Some further results on the exact small sample properties of the instrumental variable estimator. 63 Oster, E. (2019). Unobservable selection and coefficient stability: Theory and evidence. Journal of Business & Economic Statistics, 37(2), 187-204. Park, S., Steiner, P. M., & Kaplan, D. (2018). Identification and sensitivity analysis for average causal mediation effects with time-varying treatments and mediators: Investigating the underlying mechanisms of kindergarten retention policy. psychometrika, 83(2), 298-320. Raudenbush, S. W. (2009). Analyzing effect sizes: Random-effects models. The handbook of research synthesis and meta-analysis, 2, 295-316. Raudenbush, S. W., & Bryk, A. S. (2002). Hierarchical linear models: Applications and data analysis methods (Vol. 1). sage. Robbins, H. E. (1992). An empirical Bayes approach to statistics. In Breakthroughs in statistics (pp. 388-394). Springer, New York, NY. Robertson, R. M. (2021). To Retain or Not Retain: A Review of Literature Related to Kindergarten Retention. Online Submission Roderick, M., Bryk, A. S., Jacob, B. A., Easton, J. Q., & Allensworth, E. (1999). Ending Social Promotion: Results from the First Two Years. Charting Reform in Chicago Series 1. Rosenbaum, P. R., & Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1), 41-55. Jimerson, S. R., & Renshaw, T. L. (2012). Retention and social promotion. Principal Leadership, 13(1), 12-16. Saez-Rodriguez, J., Costello, J. C., Friend, S. H., Kellen, M. R., Mangravite, L., Meyer, P., ... & Stolovitzky, G. (2016). Crowdsourcing biomedical research: leveraging communities as innovation engines. Nature Reviews Genetics, 17(8), 470-486. Shepard, L. A., & Smith, M. L. (1987). Effects of kindergarten retention at the end of first grade. Psychology in the Schools, 24(4), 346-357. Sidik, K., & Jonkman, J. N. (2005a). A note on variance estimation in random effects meta- regression. Journal of biopharmaceutical statistics, 15(5), 823-838. Sidik, K., & Jonkman, J. N. (2005b). Simple heterogeneity variance estimation for meta‐ analysis. Journal of the Royal Statistical Society: Series C (Applied Statistics), 54(2), 367-384. Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2016). False-positive psychology: undisclosed flexibility in data collection and analysis allows presenting anything as significant. 64 Stock, J. H., Wright, J. H., & Yogo, M. (2002). A survey of weak instruments and weak identification in generalized method of moments. Journal of Business & Economic Statistics, 20(4), 518-529. Su, H., & Chen, H. (2015). Experiments on parallel training of deep neural network using model averaging. arXiv preprint arXiv:1507.01239. Sumiati, I., Handoyo, F., & Purwani, S. (2020). Multiple linear regression using Cholesky decomposition. World Scientific News, 140, 12-25. Viechtbauer, W. (2005). Bias and efficiency of meta-analytic variance estimators in the random- effects model. Journal of Educational and Behavioral Statistics, 30(3), 261-293. Wasserman, L. (2000). Bayesian model selection and model averaging. Journal of mathematical psychology, 44(1), 92-107. Wooldridge, J. M. (2009). Omitted variable bias: the simple case. Introductory Econometrics: A Modern Approach. Mason, OH: Cengage Learning, 89-93. Young, C. (2018). Model uncertainty and the crisis in science. Socius, 4, 2378023117737206 65