THE BAYESIAN PARADIGM OF ROBUSTNESS INDICES OF CAUSAL INFERENCES By Tenglong Li A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Measurement and Quantitative Methods—Doctor of Philosophy 2018 ABSTRACT THE BAYESIAN PARADIGM OF ROBUSTNESS INDICES OF CAUSAL INFERENCES By Tenglong Li The validity of a causal inference hinges on a research design with both strong internal validity and strong external validity (Shadish et al. 2002). Unfortunately, such research is rare so that causality is typically inferred through a small-scale randomized experiment or a large-scale observational study (Schneider et al. 2007). In light of this gap, the robustness indices of causal inferences have been proposed by Frank et al. (2013) to measure the robustness of causal inference by quantifying the proportion of the observed sample that needs to be replaced with unfavorable unobserved cases. Drawing on the Bayesian discussion in Frank & Min (2007), this dissertation purposes developing the Bayesian framework of the robustness indices of causal inferences for causal research with either limited internal validity or limited external validity. This dissertation has two chapters: The first chapter lays the foundation of the Bayesian paradigm of robustness indices by formally defining prior as distribution built on an unobserved sample. For a particular family of prior and likelihood distributions, the posterior can be interpreted as distribution built on an ideal sample. The Bayesian paradigm of robustness indices of causal inferences focuses on the relationship between the posterior probability of invalidating an inference and the unobserved sample statistics and the central task is to locate the threshold of an unobserved sample statistics with regard to a given value of the posterior probability of invalidating an inference. Considering the first chapter targets the simple group-mean-difference estimator only, the second chapter extends the Bayesian paradigm of robustness indices to regression models. This dissertation promotes the scientific discourse of causality and critical thinking by linking the probability of invalidating an inference to detailed thought experiments characterized by the thresholds of sufficient statistics pertaining to an unobserved sample. Copyright by TENGLONG LI 2018 ACKNOWLEDGEMENTS Throughout my doctoral study, I have been so blessed to received tremendous help and support from so many people. First and foremost, I am extremely grateful to my advisor Prof. Kenneth Frank who has instilled a scientific attitude as well as rigorous thought in me and imparted me immense knowledge on causal inferences and robustness indices. He has demonstrated to me what is a good researcher not only through advising but also through his own practice. Particularly, this dissertation cannot exist, let alone evolve into its current shape, without his guidance and support. I also own a debt of gratitude to Prof. Maier who has offered me so many insights and advices on Bayesian statistics and to Prof. Konstantopoulos who has given me quite a few invaluable suggestions on causal inference and teaching. I am indebted to Prof. Imberman who introduces me to the economic perspective of causal inference and relevant literature, which really broadens my horizon. This is the best opportunity to express my sincere gratitude to my beloved family. I cannot concentrate on my work and make any progress without the support and love of my wife Senmu Zheng. Finally, my parents deserve special thanks for bringing me to this world, encouraging me to pursue my passion, supporting me wholeheartedly and much more. v TABLE OF CONTENTS LIST OF TABLES ....................................................................................................................... viii LIST OF FIGURES ....................................................................................................................... ix Chapter 1: The Bayesian paradigm of robustness indices of causal inferences ............................. 1 1-Introduction ............................................................................................................................. 1 1.1-The robustness indices of causal inferences ..................................................................... 1 1.2-The conceptualization of unobserved sample ................................................................... 4 1.3-Previous work on the Bayesian framework of the robustness indices ........................... 13 1.4-Purposes of this study ..................................................................................................... 15 2-The unifying framework of the robustness indices of causal inferences ............................... 16 2.1-The frequentist recipe ..................................................................................................... 18 2.2-The Bayesian recipe........................................................................................................ 20 3-The Bayesian models of robustness indices for internal validity .......................................... 27 4-The Bayesian models of robustness indices for external validity ......................................... 32 5-Statistical threshold and Bayesian models for replacing observed cases .............................. 39 5.1-Appropriate statistical threshold δ # for Bayesian models of robustness indices ........... 40 5.2-The Bayesian models of robustness indices for replacing observed cases ..................... 43 6-Demonstrative examples ....................................................................................................... 46 6.1-The Bayesian robustness indices of the effect of OCR on reading achievement ........... 46 6.2-The Bayesian robustness indices of the effect of kindergarten retention on reading achievement .......................................................................................................................... 59 7-Discussion and conclusion .................................................................................................... 68 7.1-Features of the Bayesian paradigm of robustness indices .............................................. 68 7.2-Comparisons with other similar approaches ................................................................... 70 7.2.1-The robustness indices in Frank et al. (2013) .......................................................... 70 7.2.2-The robustness indices in Frank & Min (2007) ....................................................... 71 7.2.3-The bounds on treatment effect in Manski (1990) and Lee (2009) ......................... 71 7.3-Limitations ...................................................................................................................... 72 7.4-Conclusion ...................................................................................................................... 73 Chapter 2: The Bayesian paradigm of robustness indices of causal inferences for regression models ........................................................................................................................................... 74 1-Introduction ........................................................................................................................... 74 1.1-Regression-based causal inference ................................................................................. 74 1.2-The philosophy of robustness indices ............................................................................. 76 1.3-Research objectives ........................................................................................................ 84 2-The unifying framework of robustness indices for regression-based causal inference ......... 86 2.1-Setting and notation ........................................................................................................ 86 2.2-The frequentist recipe ..................................................................................................... 87 2.3-The Bayesian recipe........................................................................................................ 91 3-Bayesian models of robustness indices for raw data ............................................................. 94 vi 3.1-Data and the sample statistics ......................................................................................... 94 3.2-The posterior distribution of  w for raw data ................................................................. 98 3.3-Probit models for the probability of invalidating an inference ..................................... 101 3.4-External validity and internal validity .......................................................................... 103 4-Bayesian models of robustness indices for centered and standardized data........................ 107 4.1-For centered data .......................................................................................................... 107 4.2-For standardized data .................................................................................................... 109 5-Appropriate statistical threshold and Bayesian models for replacing observed cases ........ 111 5.1-Appropriate statistical threshold ................................................................................... 111 5.2-The Bayesian models of robustness indices for replacing observed cases ................... 113 6-Illustrative examples ............................................................................................................ 115 6.1-The Bayesian robustness indices of the effect of Open Court Reading on reading achievement ........................................................................................................................ 115 6.2-The Bayesian robustness indices of the effect of kindergarten retention on reading achievement ........................................................................................................................ 126 7-Discussion............................................................................................................................ 134 7.1-A summary of the Bayesian paradigm of robustness indices for regression-based causal inference .............................................................................................................................. 134 7.2-Comparisons with other similar approaches ................................................................. 136 7.2.1-The impact thresholds in Frank (2000).................................................................. 136 7.2.2-The robustness indices in Frank & Min (2007) ..................................................... 138 7.2.3-The robustness indices in Frank et al. (2013) ........................................................ 138 7.3-Limitations .................................................................................................................... 139 7.4-Conclusion .................................................................................................................... 140 APPENDICES ............................................................................................................................ 141 Appendix A: Proofs of Theorem 1 and Theorem 2 ................................................................ 142 Appendix B: The Algebraic Equivalence Between Theorem 1 and Common Expressions of Regression Coefficients .......................................................................................................... 149 REFERENCES ........................................................................................................................... 153 vii LIST OF TABLES Table 1.1: Thresholds of α when  R is fixed as 0.46 ................................................................... 55 Table 1.2: Thresholds of  R when α is fixed as 1 ........................................................................ 56 Table 1.3: Thresholds of α when π is fixed as 0.0617 .................................................................. 65 ˆ WY  0 ........................................................................ 122 Table 2.1: Thresholds of nun assuming  un un Table 2.2: Thresholds of rWY ....................................................................................................... 131 viii LIST OF FIGURES Figure 1.1: The structure of ideal population in Hong & Raudenbush (2005) ............................. 11 Figure 1.2: The structure of ideal population in Borman et al. (2008) ......................................... 12 Figure 1.3: The structure of ideal population in both scenarios ................................................... 13 Figure 1.4: The relationship between testing null hypothesis and the posterior probability of invalidating the inference of Borman et al. (2008) (  R is fixed as 0.46) ..................................... 57 Figure 1.5: The relationship between testing null hypothesis and the posterior probability of invalidating the inference of Borman et al. (2008) (  is fixed as 1) ........................................... 58 Figure 1.6: The relationship between  and the probability of invalidating the inference of Hong & Raudenbush (2005) ................................................................................................................... 66 Figure 1.7: The relationship between testing null hypothesis and the posterior probability of invalidating the inference of Hong & Raudenbush (2005) ........................................................... 67 Figure 2.1: The structure of ideal population in Hong & Raudenbush (2005) ............................. 82 Figure 2.2: The structure of ideal population in Borman et al. (2008) ......................................... 83 Figure 2.3: The structure of ideal population in both scenarios ................................................... 84 Figure 2.4: The relationship between nun and the probability of invalidating the inference of Borman et al. (2008) ................................................................................................................... 123 Figure 2.5: The relationship between testing null hypothesis and the posterior probability of invalidating the inference of Borman et al. (2008) ..................................................................... 124 un Figure 2.6: The relationship between rWY and the probability of invalidating the inference of Hong & Raudenbush (2005) ....................................................................................................... 132 Figure 2.7: The relationship between testing null hypothesis and the posterior probability of invalidating the inference of Hong & Raudenbush (2005) ......................................................... 133 ix Chapter 1: The Bayesian paradigm of robustness indices of causal inferences 1-Introduction 1.1-The robustness indices of causal inferences The issues of reproducibility and generalizability have plagued the scientific community. For example, Open Science Collaboration (2015) has reported that a substantial proportion of the selected psychological studies failed to be replicated by other parties. To promote the replicability and possibly generalizability of published research, various scholars and organizations have called on enforcing higher standards and rigorous checks of the research designs and statistical analytical procedures. Particularly, when it comes to research which attempts to support causal inferences, the concerns about reproducibility and generalizability become even stronger since one has to wrestle with both the internal validity and external validity of his design. Due to the nature of causal inference, researchers can never rule out all possible threats to both internal validity and external validity. Therefore, oftentimes they are uncertain about the degrees to which they can justify or reject their conclusions. In light of such a headache, the analyses of robustness or sensitivity of causal inference have been proposed by different scholars. The robustness indices suggested by Frank et al. (2013) is of particular interest as it naturally arises from the context of the empirical research. The idea of the robustness indices is straightforward in Frank et al. (2013). There are three key quantities in this framework of the robustness indices, namely the estimated effect ̂ , the threshold  # and the population effect  . The estimated effect is the effect researchers estimate based on their obtained samples and research designs. The threshold is a fixed value predetermined by the researchers so that they can compare their estimated effect with the 1 thresholds they set. For an instance, in order to claim that attending Catholic high schools can enhance the academic achievement of students, one has to get an estimated effect of the attendance of Catholic schools on students’ test scores and prove it is larger than the threshold he set up in his research. Usually the aforementioned threshold is chosen to be the same as the threshold determining the statistical significance for specific research hypothesis and collected sample. The population effect will remain unknown as always in empirical research. According to Frank et al. (2013), the inference will be invalid if the following condition is satisfied: ̂   #   (1.1) Or equivalently, if  is used to denote the bias:   ˆ    ˆ  # (1.2) It is necessary to point out that the above formulae will only apply to the situations of inferring positive effects. The counterparts of formulae (1.1) and (1.2) for inferring the negative effects are easy to be derived as follows, from the same reasoning: ̂   #   (1.3)   ˆ    ˆ  # (1.4) The rest arguments of Frank et al. (2013) directly follow from the preceding rules, and by partitioning the sample into the parts with and without bias, the robustness indices could be expressed as the proportion of the sample to be replaced by the new data for which the treatment effect is zero. Such proportion is interpreted as the replacement that is necessary to invalidate the inference. 2 In an empirical context, the estimated effect ̂ is fixed and the true causal effect  is a parameter. The threshold  # could be a subjective choice based on policy implication or an objective choice based on level of significance, which means  # is not necessarily fixed. Given the natures of ̂ ,  and  # , it’s possible to simplify the decision rules in (1.2) and (1.4) further as follows:    # for inferring a positive effect    # for inferring a negative effect (1.5) Frank et al. (2013) offered two examples, namely Hong & Raudenbush (2005) and Borman et al. (2008), to illustrate the procedure of quantifying necessary bias to invalidate an inference using the decision rules above. Hong & Raudenbush (2005) is a research whose goal was to evaluate the effect of kindergarten retention on academic achievement. In this example, it was impossible to randomly assign the sampled students to the conditions of being retained in kindergarten and being promoted to the first grade. According to Rubin Causal Model (RCM), every sampled student should have two potential outcomes, namely, one outcome under the condition of being retained and one outcome under the condition of being promoted. Draw on RCM, the only sample that will lead to true causal inference, is supposed to be composed of reading scores of all sampled students assuming they were all retained in kindergarten and reading scores of all sampled students assuming they were all promoted to the first grade. Such sample is very ideal since no students could be retained and promoted simultaneously. In this case, the bias is induced by gap between the ideal sample which consists of potential reading score of every sample student under both retention and promotion and the observed sample which only has reading score of every sampled student under either retention or promotion (but not both). 3 Borman et al. (2008) studied the effect of Open Court Reading (OCR) curriculum on students’ reading achievement by randomly drawing schools which showed strong interest in this program to the publisher of OCR and volunteered in their study. Particularly, Frank et al. (2013) pointed out that it would be questionable to generalize the inference made based on the observed sample to the population of schools that didn’t volunteer in this program in the first place, since it’s possible that the volunteered schools might benefit more from OCR because they had better plans and more experience comparing to the population of schools which were less attracted to this curriculum and didn’t volunteer in the OCR program in the first place. Consequently, the observe sample might not be well-represented of the entire population of schools, which includes both volunteered schools and non-volunteered schools. In this case bias is induced by the gap between a random sample of the entire population of schools and the observed sample which can only represents volunteered schools. Each of both examples epitomizes a distinguished scenario where a causal inference is prone to bias and invalidation. Specifically, Hong & Raudenbush (2005) typifies a scenario where external validity is strong because the observed sample is representative of the target population but internal validity is weak due to a lack of randomization. This scenario is referred to as “the first scenario” throughout this paper. On the other hand, Borman et al. (2008) exemplifies another scenario where internal validity is sound because of randomization but external validity is worrisome as the observed sample can only represents a part of the target population. This scenario is referred to as “the second scenario” henceforth. 1.2-The conceptualization of unobserved sample The gist of the framework of robustness indices of causal inferences put forth by Frank & Min (2007) and Frank et al. (2013) is that bias  is induced by the gap between the observed sample 4 and the sample one is supposed to obtain for his inference and conclusion. In this study, I intend to address and fill this gap through the conceptualization of unobserved sample. Relying on Rubin Causal Model and especially its potential outcome framework, I have the following definitions: Definition 1.1: A real or non-counterfactual outcome refers to an outcome which is observable, i.e., an outcome of a controlled subject under the condition of control or an outcome of a treated subject under the condition of treatment. A real outcome in Hong & Raudenbush (2005) could either be a reading score of a retained child John under the condition of he was retained in kindergarten or a reading score of a promoted child Mary under the condition of she was promoted to first grade. Definition 1.2: A counterfactual outcome of a subject refers to an imaginary outcome that would be observed under a condition which is different from what this subject actually received. In Hong & Raudenbush (2005), the counterfactual outcome of John who was retained in the kindergarten would be his reading score had he been promoted to first grade. Likewise, the counterfactual outcome of Mary who was promoted to first grade would be her reading score had she been retained in kindergarten. Next, I define potential outcome for the first scenario based on definition 1.1 and 1.2: Definition 1.3.1: A potential outcome of a subject in the first scenario refers to either his/her real outcome or his/her counterfactual outcome. In Hong & Raudenbush (2005), every student had two potential outcomes. For example, John (who was actually retained) had two potential outcomes, which were his reading score (real outcome) under the condition of being retained in kindergarten and his reading score (counterfactual outcome) under the condition of being promoted to first grade. Similarly, Mary 5 (who was actually promoted) had two potential outcomes which were her reading score (real outcome) under the condition of being promoted to first grade and her reading score (counterfactual outcome) under the condition of being retained in kindergarten. Before I proceed to define a potential outcome for the second scenario, it’s vital to appreciate the difference between the first scenario and the second scenario: In the first scenario, the lack of randomization means that real outcomes and counterfactual outcomes are fundamentally different and therefore should not be treated as equals. In the second scenario, counterfactual outcomes can be considered to be equivalent to real outcomes in the long run due to randomization. This suggests the discussion and definition of potential outcomes in the second scenario can be confined to real outcomes only. Hence, I have the following definition of potential outcomes for the second scenario: Definition 1.3.2: A potential outcome in the second scenario refers to a real outcome which could be potentially drawn from the target population. Given the target population of Borman et al. (2008) consists of both volunteered and nonvolunteered schools, a potential outcome in Borman et al. (2008) could be either the mean reading score of a classroom which belonged to a volunteered school in their study or the mean reading score of a classroom that could be potentially drawn from non-volunteered schools. It’s remarkable that definition 1.3.2 implies a random assignment of classrooms to the groups of Open Court Reading and control in either volunteered schools or non-volunteered schools. Both the first scenario and the second scenario share the same definition of ideal population, which is provided next: Definition 1.4: An ideal population refers to the collection of all possible potential outcomes of the target population. 6 The ideal population of Hong & Raudenbush (2005) is the collection of reading scores of all U.S. kindergarten children under both conditions of retention and promotion. Likewise, the ideal population of Borman et al. (2008) is the collection of mean reading scores of all U.S. classrooms. I remark here that the ideal population of Hong & Raudenbush (2005) contains counterfactual outcomes while the ideal population of Borman et al. (2008) comprises real outcomes only. To fathom the bias invalidating causal inference in both scenarios and its creation, I further decompose an ideal population into two parts, namely the observed part and the unobserved part and distinguish them with the following two definitions: Definition 1.5.1: The unobserved part of an ideal population in the first scenario refers to the collection of all counterfactual outcomes of the target population. Naturally, the observed part of an ideal population in the first scenario refers to the collection of all real outcomes of the target population. Definition 1.5.2: The unobserved or non-representable part of an ideal population in the second scenario refers to the collection of all potential outcomes of the part of the target population that cannot be represented by the observed sample. Conversely, the observed or representable part of an ideal population in the second scenario refers to the collection of all potential outcomes of the part of the target population that was deemed to be logically represented by the observed sample. Again, I use Hong & Raudenbush (2005) and Borman et al. (2008) to concretize the above two definitions. The unobserved part of the ideal population of Hong & Raudenbush (2005) would be the collection of counterfactual reading scores of all U.S. kindergarten students, i.e., the reading scores of all U.S. kindergarten students under retention when they were all promoted to first 7 grade and the reading scores of all U.S. kindergarten students under promotion if they were all retained in kindergarten. The observed part of the ideal population of Hong & Raudenbush (2005) would be the collection of real reading scores of all U.S. kindergarten students, namely the reading scores of all U.S. kindergarten students under promotion when they were all promoted to first grade and the reading scores of all U.S. kindergarten students under retention if they were all retained in kindergarten. Furthermore, the unobserved (non-representable) part of the ideal population of Borman et al. (2008) would be the collection of the mean reading scores (real outcome) of all classrooms in the non-volunteered schools. The observed (representable) part of the ideal population of Borman et al. (2008) would be the collection of the mean reading scores (real outcome) of all classrooms in the volunteered schools. Equipped with all aforementioned definitions, it’s ready to conceptualize an unobserved sample as a random sample from the unobserved part of ideal population and formalize it with the following definitions: Definition 1.6.1: An unobserved sample in the first scenario refers to the collection of counterfactual outcomes of all sampled subjects. An unobserved treated sample in the first scenario refers to the collection of counterfactual outcomes of sampled subjects who actually received control, that is, the collection of outcomes of treated subjects had they participated in the control group instead. An unobserved control sample in the first scenario refers to the collection of counterfactual outcomes of sampled subjects who actually received treatment, i.e., the collection of outcomes of control subjects had they switched to the treatment group. Definition 1.6.2: An unobserved sample in the second scenario refers to an imaginary random sample which is drawn from the non-representable part of an ideal population and consists of real outcomes. I assume a subsequent randomization is carried out on this unobserved sample, 8 and resultantly the proportion of treated subjects in this unobserved sample is the same as the proportion of the treated subjects in the observed sample. An unobserved treated sample in the second scenario refers to the collection of real outcomes of subjects who were assigned to the treatment group in this imaginary random sample. An unobserved control sample in the second scenario refers to the collection of real outcomes of subjects who were assigned to the control group in this imaginary random sample. Definition 1.7: An ideal sample refers to the combination of the observed sample and an unobserved sample. An ideal treated sample refers to the combination of the observed treated sample and an unobserved treated sample. An ideal control sample refers to the combination of the observed control sample and an unobserved control sample. According to definition 1.6.1, an unobserved sample of Hong & Raudenbush (2005) is the collection of counterfactual reading scores of sampled students in their study. Specifically, this unobserved sample can be decomposed into an unobserved control sample which is the collection of reading scores of retained students had they all been promoted to first grade and an unobserved treated sample which is the collection of reading scores of promoted students had they all been retained in kindergarten. According to definition 1.6.2, an unobserved sample of Borman et al. (2008) is an imaginary sample of classrooms which were randomly drawn from non-volunteered schools and subsequently randomly assigned to the Open Court Reading (OCR) group or the control group. This unobserved sample comprises an unobserved treated sample which is the collection of mean reading scores of the sampled classrooms in the OCR group and an unobserved control sample which is the collection of mean reading scores of the sampled classrooms in the control group. 9 Figure 1.1 details the structure of ideal population in Hong & Raudenbush (2005). Notationally speaking, I use Y to denote the outcome. The subscript of Y has two parts separated by a comma: The first part is used to denote which group this outcome belongs to and the second part is used to denote which subject this outcome pertains to. The superscript of Y signals which kind of sample this outcome belongs to. For example, the reading score of John (or any other student who was retained in kindergarten) is symbolized by Yrob,i as John was observed as the ith retained student. The conceptualization of an unobserved sample (represented by the arrows with a label ‘1’) requires to project his reading score had he been promoted to first grade, which is denoted by Y pun,i . In this case, Yrob,i becomes an element of the observed treated sample and Y pun,i is a member of an unobserved control sample. The reading score of Mary (or any other student who was promoted to first grade) is symbolized by Ypob, j and the conceptualization of an unobserved sample demands a projection of his reading score had she been retained in kindergarten, which is symbolized by Yrun, j . Consequently, Ypob, j is one element of the observed control sample and Yrun, j is one element of an unobserved treated sample. 10 Figure 1.1: The structure of ideal population in Hong & Raudenbush (2005) Figure 1.2 elaborates on the structure of ideal population in Borman et al. (2008). Yoob,i represents the mean reading score of an Open Court Reading classroom sampled from volunteered schools and it could be any single element of the observed treated sample. Ycob, j denotes the mean reading score of a control classroom sampled from volunteered schools and it could be any single element of the observed control sample. To generalize the conclusion of Borman et al. (2008) convincingly to non-volunteered schools, one needs the conceptualization of an unobserved sample (represented by the arrows with a label ‘2’) which is defined as an imaginary random sample of classrooms from non-volunteered schools. After an imaginary random assignment of classrooms in this unobserved sample to Open Court Reading or control, the mean reading score of an Open Court Reading classroom in this unobserved sample is Youn, k which could be any single element of an unobserved treated sample. The mean reading score of a control classroom in this unobserved sample is Ycun,l which could be any single element of an unobserved control sample. As specified in definition 1.6.2, the proportion of OCR classrooms in this unobserved sample should be equal to the proportion of OCR classrooms in the observed sample. 11 Figure 1.2: The structure of ideal population in Borman et al. (2008) Figure 1.3 synthesizes above two figures and portrays the structure of ideal population in both scenarios. Yt ob ,i signifies the outcome belongs to the observed sample and subject i which is a member of the treatment group. In other words, Yt ob ,i could be associated with any member in the observed treated sample. In the first scenario, the conceptualization of an unobserved sample is tantamount to the projection of a counterfactual outcome (dashed circle) for each real outcome (blue-shaded circle) in the observed sample. For example, for treated outcome of subject i in the observed sample, it’s necessary to project this subject i’s counterfactual outcome had he participated in control group (i.e., Ycun,i , which is an element of an unobserved control sample). In the second scenario, the conceptualization of an unobserved sample is a process of projecting a random sample in the non-representable part of ideal population and conceptually forming treatment and control group within this random sample by random treatment assignment. In this case, the outcomes in the observed sample (blue-shaded circles) and the outcomes in an unobserved sample (solid unshaded circles) are both real outcomes as they pertain to different subjects (as manifested by their different subscripts i, j, k, l). 12 Figure 1.3: The structure of ideal population in both scenarios To summarize, it is worthy to point out that every study is associated with an ideal sample when its robustness of causal inference is of main concern. An ideal sample is thought to be comprised of the observed sample and an unobserved sample. The observed sample represents the observed part of ideal population while an unobserved sample is thought of as a random sample from the unobserved part of ideal population which cannot provide any real observed data even though it is essential for causal inference. Throughout this paper, the observed sample is considered as fixed while an unobserved sample must be varying instead of fixed. Holding the observed sample fixed, the estimate based on the observed sample will be “contaminated” when I expand the observed samples with an unobserved sample. As a result, it is this unobserved sample that alter the sample statistics (see Frank & Min, 2007) and induce bias which renders internal validity or external validity vulnerable. Therefore, the conceptualization and modeling of an unobserved sample is indispensable in quantifying the robustness of causal inference. 1.3-Previous work on the Bayesian framework of the robustness indices A Bayesian framework of the robustness indices has been offered by Frank & Min (2007) in a slightly different setting than the robustness indices I have discussed so far. However, their 13 argument about the formation of this Bayesian framework is quite illuminating. Specifically, they defined the sampling distribution of the correlation computed based on an unobserved sample as the prior and modeled the sampling distribution of the correlation computed based on the observed sample as the likelihood. Therefore, the prior and likelihood can be combined to generate the posterior distribution in an ordinary Bayesian fashion. Most importantly, the generated posterior distribution could be interpreted as the sampling distribution of the correlation for an ideal sample, which is consisted of both observed sample and unobserved sample. Such interpretation is consistent with the fact that the posterior distribution is just the compromise between the prior and the likelihood. The Bayesian framework propounded by Frank & Min (2007) lays the foundation of construction and interpretation of the Bayesian paradigm of the robustness indices in this study. Fundamentally, the Bayesian framework of Frank & Min (2007) resides in the Bayesian causal inference world pioneered by Rubin (1978), which proposed to impute missing counterfactual outcomes based on their predictive posterior distribution(s) conditional on the assignment mechanism, real outcomes and covariate values. This procedure is implemented by sampling counterfactual outcomes from their predictive posterior distribution(s) and re-estimate average treatment effect as the mean of individual differences between real outcomes and imputed counterfactual outcomes. It’s noteworthy that, the Bayesian framework of Frank & Min (2007), just like other literature inheriting Rubin (1978)’s Bayesian perspective of addressing causal problems (Imbens & Rubin, 1997, 2015; Rubin & Zell, 2010; Zajonc, 2012; Espinosa et al., 2016), considers counterfactual outcomes as missing data and imputes a sample of them through an underlying Bayesian model. 14 1.4-Purposes of this study The robustness indices are quite user-friendly and suitable for most empirical research since they inform the researchers about how robust their causal inference could be to the potential design and sampling bias, by making them think about the sample with no effect at all and the proportion of such sample is needed if it is used to replace the original sample to invalidate the inference. Nevertheless, it would be even more straightforward if one can manage to answer the question “How likely is my inference invalid” instead of the question “What is the proportion of my sample to be replaced to necessarily alter my conclusion”, since the previous question allow us to directly quantify the robustness of the causal inference as the probability of nullifying the inference. In fact, research on the probability of replicability/reproducibility of a specific study has been advanced and advocated in different fields, and the probability of invalidating an inference proposed in this paper is essentially a form of probability of replicability which has been becoming an advisable choice of statistic in scholarly publishing and reporting (See Greenwald et al., 1996; Thompson, 1996; Sohn, 1998; Killeen, 2005; Psychological Science editorial board, 2005; Miller, 2009; Iverson et al., 2010). To express the robustness indices probabilistically, I draw on the Bayesian framework provided by Frank & Min (2007) and extend it further to the robustness indices defined in Frank et al. (2013). It’s important to note here that Frank & Min only focused on the Bayesian models of the robustness indices for biased sampling and the correlation coefficient as the measurement of effect size. To make this Bayesian framework more comprehensive and applicable to the problems discussed in Frank et al. (2013), I first propose a unifying Bayesian framework of the robustness indices, which is logically identical to the framework put forth by Frank & Min (2007). I will show this unifying Bayesian framework will lead to the posterior distribution of the 15 bias, which allows one to calculate the probability that the bias exceeds its corresponding threshold for a certain study by utilizing the rules of overturning the inference as defined in (1.2) and (1.4). Given the motivations and mechanisms of the bias concerning the internal validity and external validity are different, separate Bayesian models for those two distinct kinds of bias are subsequently developed from the unifying Bayesian framework of the robustness indices. In the following section, I will first present the unifying framework of the robustness indices of causal inferences. This unifying framework contains two recipes of preparing robustness indices, namely a frequentist recipe and a Bayesian recipe. In the third section, I define the Bayesian models of robustness indices (the Bayesian recipe) specifically in terms of a research which has limited internal validity. Such Bayesian models typically will be applied to an observational/quasi-experimental study. In the fourth section, I particularly define the Bayesian models of robustness indices with regard to a research which has limited external validity. This set of Bayesian models have appropriate applications in randomized experiments. The fifth section discusses the appropriate statistical threshold δ # for the Bayesian models of robustness indices as well as replacing observed cases with unobserved ones as an alternative sampling scenario. In the section of demonstrative examples, the robustness of the inferences made by Borman et al. (2008) as well as Hong & Raudenbush (2005), which has been evaluated by Frank et al. (2013), is reassessed within the corresponding Bayesian frameworks for external validity and internal validity provided in this paper. I conclude this study with a summary of the findings and point out the limitations and possibly their implications for the future research. 2-The unifying framework of the robustness indices of causal inferences My discussion throughout this paper on the robustness indices of causal inferences is limited to the following setting: I assume there are only two groups in comparison, i.e., a treatment group 16 whose participants received a treatment of main interest (like OCR or kindergarten retention) and a control group of subjects who didn’t receive such treatment. I further assume that, contingent on the two-group design, the difference between the mean of all observed treated outcomes and the mean of all observed control outcomes is the estimate of average treatment effect ̂ . Throughout the text I adopt the following notations: Ytun is an unobserved treated sample and Ycun is an unobserved control sample. Moreover, the observed treated sample and the observed control sample are denoted by Ytob and Ycob respectively. Likewise, the ideal treated sample and the ideal control sample are denoted by Ytideal and Ycideal respectively. The sample means of Ytun , Ycun , Ytob , Ycob are correspondingly represented by Yt un , Ycun , Yt ob , Ycob . Probabilistically, the value of a single outcome under the condition of treatment (we can call it a treated outcome) can be treated as a random variable Yt and the value of a single outcome under the condition of control (we can call it a control outcome) can be treated as a random variable Yc . Furthermore, the value of a treated outcome which might appear in the observed sample is a random variable Yt ob and the value of a control outcome which might appear in the observed sample is also a random variable Ycob . The expectations of the distributions of Yt and Yc are symbolized by t ob and  c respectively. Similarly, tob and ob c stand for the expectations of the distributions of Yt and Ycob . Finally, ideal and ideal are two random variables whose distribution are respectively t c conditional on an ideal treated sample and an ideal control sample. (so they are not expectations). 17 2.1-The frequentist recipe It is the definition of the bias that motivates construction of the Bayesian models of the robustness indices. (Frank et al. 2013 Appendix). The bias, according to Frank et al (2013), is uniformly defined as follows:   E[ˆ ]  E[  ]  {E[Yt ob ]  E[Ycob ]}  {E[Yt ]  E[Yc ]}  (tob  ob c )  (t   c ) (2.1) To elaborate on the definition above, (2.1) is partitioned into two series of differences. The first series of differences imply that the estimate of the treatment effect based on an observed sample, which is denoted as ̂ in (2.1), has an expectation equal to tob  cob . The second series of differences suggest that the estimate of the treatment effect based on an ideal sample, which is represented by  in (2.1), should have an expectation equal to t  c which is the true treatment effect. The operationalization of the above definition of bias relies on the strategy of molding  as a random variable. First, given ̂ is supposed to be fixed when considering the bias associated with an estimate of causal effect, I simply use Yt ob  Ycob to substitute tob  cob in (2.1) at the cost of ignoring the random sampling error associated with the observed sample. Second, an unobserved sample needs to be taken into account as it is the source of the bias  as we discussed earlier. Third, the estimate of causal effect based on the observed sample, i.e., Yt ob  Ycob , should be compared with the true causal effect t  c . To achieve this purpose, I model the distributions of t ideal and c ideal conditional on an imaginary ideal sample, in order to 18 account for the uncertainty brought by an unobserved sample. Consequently, the bias  can be ideal recast as a random variable conditional on Yt ideal and Yc :  | Ytideal , Ycideal  Yt ob  Ycob  (tideal  ideal ) c It’s noteworthy that Yt ob (2.2)  Ycob is exactly the estimate of the treatment effect based on the observed sample, i.e., ̂ , and it is unbiased for t  c .Here I treat Yt ob ob because the observed sample is fixed. The randomness of t ideal and c ob ideal  Ycob as a fixed constant is due to random sampling error of an ideal sample because of its imaginary nature. Comparing to the original definition of bias in (2.1), the new definition of bias (2.2) has two meaningful distinctions: First, the definition (2.2) is a frequentist version of bias which is built on finite samples rather than the whole populations. This permits us to ignore random sampling error of the observed samples and thereby focus on their nonrandom sampling error in the discussion of robustness indices henceforth. Additionally, a distribution of bias conditional on ideal samples is accessible through the definition (2.2) whereby quantifying the robustness indices as probabilities of invalidating an inference is feasible based on it. The decision rules in (1.5) could be restated conditional on imaginary ideal samples as rules of invalidating an inference as follows: ideal  ideal   # for inferring positive effects t c ideal # ideal     for inferring negative effects t c (2.3) ob ob given ˆ  Yt  Yc is fixed and statistically significant. Finally, I propose the robustness indices of causal inferences as probabilities of invalidating an inference as below, according to (2.3): 19 ideal # P(ideal     ) for inferring positive effects t c P(ideal  ideal   # ) for inferring negative effects t c (2.4) 2.2-The Bayesian recipe The frequentist approach is attractive only if unobserved samples becomes observable. However, this will never happen, which renders the frequentist recipe implausible. An alternative approach is conceptualizing and modeling unobserved samples in the prior distributions by Bayesian reasoning as introduced by Frank & Min (2007). For this purpose, the definition of bias in (2.2) is modified so that it adapts to Bayesian world:  | Ytob , Ycob  Yt ob  Ycob  (t | Ytob  c | Ycob ) (2.5) The main difference between the Bayesian definition (2.5) and the frequentist definition (2.2) is that the former can and only can depend on the observed sample. It would be illegitimate to think a parameter is conditional on something unobservable like an ideal sample in Bayesian inference. Generically, the Bayesian models which are interpretatively equivalent to the Bayesian framework of Frank & Min (2007) can be formulated as follows: Prior:  F ( η0 ) Likelihood: Y |  GY () Posterior:  | Y F ( η) (2.6) where F ( η0 ) is the prior distribution of the parameter  with prior parameters η0 and GY () is the likelihood function of the outcome Y with the parameter  . Hoff (2009) (also see Diaconis & Ylvisaker, 1979, 1985) has shown that when GY () belongs to exponential family and F ( η0 ) is conjugate to GY () (i.e., the posterior distribution F ( η) and the prior distribution F ( η0 ) are the same distribution with different parametric values) the prior distribution can be interpreted as 20 a distribution built on an unobserved sample whose sample size and sufficient statistics are considered as prior parameters. By construction, any member of exponential family (for example: normal, Poisson, exponential, binomial and multinomial) that has a conjugate prior is appropriate for the Bayesian paradigm of robustness indices of causal inferences and hereafter I only consider the case where likelihood function as well as prior distribution are normal. (Some common distributions that do not belong to exponential family include: T distribution, F distribution, Cauchy, Logistic, mixture models and compounded distributions like beta-binomial and Dirichlet-multinomial distribution). The construction of Bayesian paradigm of robustness indices begins with the formulation of likelihood functions, which are generally described as the distributions of the treated outcome and the control outcome of ideal population: Yt ~ N (t , t2 ) Yc ~ N ( c , c2 ) The parameters of interest in the likelihood functions (2.7) are (2.7) t and  c , which are defined as the expected value of the treated outcomes of ideal population and the expected value of the control outcomes of ideal population. The variances of both distributions, denoted by t and 2 c2 , are assumed known. The likelihood functions can be thought of as distributions founded on the observed samples as argued by Frank & Min (2007) even though they are defined for the ideal populations, since practically they are what the real observed data is fitted to. The Bayesian theory stipulates that the parameters of interest, in this case t and  c , should follow some prior distributions and by the logic of Frank & Min (2007) these prior distributions could be conceived as representations of prior knowledge one would learn through unobserved 21 sample. Such prior knowledge is vital and indispensable in modeling the robustness indices of causal inferences as bias is engendered by an unobserved sample. To elaborate on this, it’s imperial to conceptualize an unobserved treated sample whose sample size is nt and an unobserved control sample whose sample size is nc . Central limit theorem then suggests the following distributions for t and  c conditional on such unobserved treated sample and such unobserved control sample: t2 t ~ N (Yt , ) nt un c2  c ~ N (Y , ) nc un c (2.8) The distributions in (2.8) is what I am seeking for prior distributions, that is, prior knowledge which is founded on an unobserved sample. Consolidating the prior distributions (2.8) and the likelihood functions (2.7) gives the complete Bayesian models of robustness indices of causal inferences when the observed sample has N t subjects in the treatment group and Nc subjects in the control group: 22 t  N (Yt Yt un t2 , ) nt N (t , t2 ) t | Ytob N (t , t ) c2 , ) nc c N (Y Yc N ( c , c2 ) un c  c | Ycob (2.9) N (c , c ) Where: t  nt Nt Yt un  Yt ob N t  nt N t  nt t2 t  N t  nt c  nc Nc Ycun  Ycob N c  nc N c  nc (2.10)  c2 c  N c  nc To demonstrate the posterior distribution is identical to the distribution upon which the frequentist inference relies, it’s necessary to present the distribution of ideal and ideal when an t c ideal sample is available: 23  ideal t nt Nt t2 un ob ~ N( Yt  Yt , ) Nt  nt Nt  nt Nt  nt  ideal c nc Nc c2 un ob ~ N( Yc  Yc , ) N c  nc N c  nc N c  nc (2.11) The derivation of distributions in (2.11) is straightforward by central limit theorem given the ideal treated and control sample means are: Yt ideal  nt Nt Yt un  Yt ob N t  nt N t  nt Ycideal  nc Nc Ycun  Ycob N c  nc N c  nc (2.12) and variances associated with those means are: Var (Yt ideal ideal c Var (Y t2 ) N t  nt c2 ) N c  nc (2.13) What (2.11) uncovers is that the posterior distribution in the Bayesian recipe (i.e., the distribution of t | Ytob or c | Ycob ) and the distribution of parameter built on an ideal sample (i.e., the distribution of ideal or ideal ) in the frequentist recipe are identical when a normal likelihood t c function with the mean as the only parameter and normal prior are considered in the Bayesian paradigm. However, I caution readers that this result will remain valid only for a certain type of likelihood and prior (exponential family with conjugate prior) and in this case the Bayesian recipe and the frequentist recipe are still distinct in many aspects. 24 If the independence between the treated outcome Yt and the control outcome Yc is posited, the distributions of t | Ytob  c | Ycob and  have explicit forms: t | Ytob  c | Ycob ~ N (t  c , t  c )  | Ytob , Ycob N (Yt ob  Ycob  (t  c ), t  c ) (2.14) with t , c , t , c quantified as in (2.10). An inference is invalidated if one of the following conditions are true: t | Ytob  c | Ycob  # for inferring positive effects t | Ytob  c | Ycob  # for inferring negative effects (2.15) Capitalize on the distribution of t | Ytob  c | Ycob , the probability of invalidating an inference is defined as follows: P(t | Ytob   c | Ycob   # ) for inferring positive effects P(t | Ytob   c | Ycob   # ) for inferring negative effects (2.16) Given the threshold of making an inference and the values of the parameters t , c , t and c , this probability could be directly calculated as the function of those parameters and employed as the measurement of the robustness for any single study. Furthermore, one can calculate the probabilities of invalidating the inference for different but parallel studies and compare their robustness in terms of those probabilities. I caution readers here that the probability of invalidating an inference should not be confused with the p-value in hypothesis testing. Unfortunately, the overwhelming misinterpretations of pvalue often make researchers treat those two distinct indices as parallel ones even though they are in fact telling completely different stories. A particularly relevant misinterpretation in the 25 context is perceiving p-value as one indicator of the robustness or replicability of an inference, and this misinterpretation is scientifically detrimental and blurs the boundary between p-value and true robustness indices. It’s worthy to emphasize that p-value can never become an index of the robustness of any inferences for mainly two reasons. First, p-value only deals with random sampling error and it evaluates the degree to which a similar finding will occur in another equivalent random sample drawn by repeated random sampling. It largely quantifies the significance of a result when random sampling error is the only concern. Nonetheless, random sampling error has never been the focus of the analysis of robustness since it virtually exists in every study and every inference. Quite the opposite, robustness indices usually highlight the errors due to sources other than random sampling such as nonrandom sampling, nonrandom assignment and omission of important confounding variables. The probability of invalidating an inference is an index of robustness because it takes either nonrandom sampling error or nonrandom assignment error into account by considering prior distributions as ones built on unobserved samples. Equally importantly, p-value is unqualified for a measurement of robustness because it is only valid when the null hypothesis is true. In contrast, the robustness indices invented by Frank et al. (2013) and the probability of invalidating an inference are useful regardless of the condition specified in null hypothesis. For example, Frank et al. (2013) mentioned that one can change the null hypothesis and compute the corresponding robustness indices by modifying the threshold value accordingly. The same thing can be done in computing the posterior probability of invalidating an inference as it depends on not only the posterior distribution of average treatment effect but also the threshold. Testing null hypotheses of nonzero values can always be achieved by adjusting the threshold  # . 26 The construction of the probability of invalidating an inference is built on three assumptions. First, the random sampling error associated with the observed sample is ignored in the Bayesian models so that researchers should be aware that this probability can only indicate how likely an inference will be invalidated due to bias induced by either nonrandom sampling or nonrandom assignment. Second, the distributions of the treated outcome and the control outcome are assumed to be normal. Third, the treated and the control outcome are assumed to be independent. In summary, the Bayesian recipe exemplifies the Bayesian framework raised in Frank & Min (2007). A prior distribution whose definition is the distribution carries the information of one’s belief about the parameters prior to observing the data, could be conceptualized as the distribution of a focal parameter conditional on an unobserved sample since it exactly reflects the belief about the inferred parameter and is solely motivated and shaped by such belief. Neither a distribution based on an unobserved sample nor a typical prior distribution in a Bayesian context contains any information about the observed sample. The likelihood function in the Bayesian models, which serves as a generic characterization of the ideal population, is in fact completely driven by the observed sample. Furthermore, the problem of checking the robustness of the inference by varying the mean and sample size of an unobserved sample is transformed into a problem of checking the influence of a prior on its corresponding posterior distribution while holding the observed sample and the likelihood function fixed. 3-The Bayesian models of robustness indices for internal validity The unifying Bayesian framework of robustness indices of causal inferences can be recast as the Bayesian models of robustness indices particularly for internal validity, by deliberately define the observed (treated/control) sample and an unobserved (treated/control) sample in the following way: 27 Ytob  Yt ob ,i : i  T  Ytun  Yt un , j : j  C Ycob  Ycob, j : j  C (3.1) Ycun  Ycun,i : i  T  Definition (3.1) shows that the observed treated sample for the studies with questionable internal validity will be real outcomes of the subjects who indeed received the treatment. An unobserved treated sample in this case will be counterfactual outcomes of the subjects in the control group had they been assigned to the treatment group instead. Likewise, the observed control sample for the studies with questionable internal validity will be real outcomes of the subjects who actually received the control and an unobserved control sample for the same studies will be counterfactual outcomes of the subjects in the treatment group had they switched to the control group. Obviously, definition (3.1) just mathematically restates the definition 1.6.1 which formalizes the concepts of unobserved and observed sample in the first scenario where internal validity is limited. For example, to conceptualize an unobserved treated sample in the study of Hong & Raudenbush (2005), we need to ask a question like “what if a promoted child did not get promoted in the first place” and how it can affect his test score. Similarly, an unobserved control sample would answer a question like “what would the academic achievement of a retained student be if he had been promoted”. Aside from this definition, everything else of the Bayesian models of robustness indices for internal validity will remain the same as the unifying Bayesian framework. Draw on (3.1), this model has the identical definition of bias as in (2.5), identical Bayesian formulations as in (2.9) and (2.10), together with the identical distributions of t | Ytob  c | Ycob and  as in (2.14). 28 Moreover, as discussed earlier, the sample sizes of unobserved treat and control sample are fixed for the case of internal validity. The sample sizes of an unobserved treated sample and the observed control sample should be equal, and the sample sizes of an unobserved control sample and the observed treated sample should be equal as well. To impose the aforementioned restrictions on the models (2.9) and (2.10), a new set of models are proposed next with one additional parameter π defined as the proportion of subjects who get the treatment in the whole sample: t2 t  N (Yt , ) nt un Yt N (t , t2 ) t | Ytob N (t , t ) c c2 N (Y , ) nc Yc N ( c , c2 ) un c  c | Ycob (3.2) N (c , c ) Where:  1   nt    Nt  Nc       nc    Nc  Nt  1   And: 29 (3.3) t  (1  )Yt un  Yt ob t2 t2 t    Nt N c  Ycun  (1  )Ycob (3.4) c2 c2 c  (1  )  Nc N In the formula above, N is the total observed sample size, i.e., N  Nt  Nc .The denominators in the second and fourth equations in (3.4) become N simply because  by definition is the ratio between N t and N . Given a designated threshold  # and a chosen decision rule in (2.15), the posterior distribution in (2.14) will naturally generate the probability of invalidating an inference due to limited internal validity, as a function of the parameters in (3.4). It’s imperative to keep in mind that this probability is built on three assumptions, namely the assumption of no random sampling error for the observed sample, the normality assumption for the distributions of treated and control outcome and the assumption of independence between treated outcome and control outcome. By introducing a new parameter α as the ratio between Yt un and Ycun , the relationship between the probability of invalidating an inference due to inadequate internal validity and the parameters mentioned in (3.2) through (3.4) is proven to be a probit function in the following form, for a targeted negative effect: probit ( p)  N t2  c2 Ycun (1  )   (Yt ob  Ycob  Ycun )    Ycob   #  For a targeted positive effect, we just need to reverse the signs of the coefficient of α and constant presented in (3.5), which leads to the equation below: 30 (3.5) probit ( p)  N t2  c2 Ycun (  1)   (Yt ob  Ycob  Ycun )    Ycob  #  (3.6) p in (3.5) (or (3.6)) symbolizes the probability of invalidating an inference, which is computed as the probability that t | Ytob  c | Ycob is larger (or smaller, depends on the sign of inferred effect) than a threshold  # . This probability should be straightforward as we have learned that the distribution of t | Ytob  c | Ycob is normal with mean t  c and variance t  c described in (3.4). What (3.5) and (3.6) demonstrate is that the probit link function of the probability of invalidating an inference due to limited internal validity is a linear function of  . Therefore, given the values of N , t2 , c2 , Ycun , Yt ob , Ycob ,  and the threshold  # , the probit link function of the probability of invalidating an inference due to limited internal validity can be explicitly expressed as a linear function of  . I will draw on this feature to elicit answers of some very meaningful questions, such as finding out the how large/small  could be conditional on a set of value of parameters N , t2 , c2 , Ycun , Yt ob , Ycob , , # that makes the probability of invalidating an inference smaller than a certain value (for example, 0.3). Normally, one would extract Yt ob , Ycob , N from the observed sample and select some fixed constants for t2 , c2 . Ycun , the mean of an unobserved control sample, is conceptualized as a number which are not necessarily fixed in this approach. Together with the variable  , Ycun characterizes unobserved treated and control sample which of paramount concern in my Bayesian models. The Bayesian models (3.2)-(3.4) can be recast as Rubin Causal Model (RCM). Suppose in one observational study there are N subjects in total. Moreover, there are N t participants in the 31 treatment group and Nc participants in the control group. In other words, we have N t observed treated outcomes and Nc observed control outcomes. According to RCM, every participant in the treatment group would have had a counterfactual outcome if he had been assigned to the control group. Likewise, every participant in the control group would have had a counterfactual outcome if he had been assigned to the treatment group. This means there should be N t unobserved control outcomes and Nc unobserved treated outcomes in total. In the Bayesian models of robustness indices for internal validity, the ideal treated sample could be thought to be consisted of the N t observed treated outcomes as the observed treated sample and Nc unobserved treated outcomes as an unobserved treated sample. Similarly, the ideal control sample could be perceived as a composition of the Nc observed control outcomes as the observed control sample and N t unobserved control outcomes as an unobserved control sample. To summarize the Bayesian models of robustness indices for internal validity that are presented in this section, it’s necessary now to review the perspective of Rubin Causal Model (RCM) associated with them. The Rubin Causal Model conceptualizes the observational studies as a missing data problem and the assignment mechanism as the mechanism of how the missing data is generated. Specifically, follow the logic of the Rubin Causal Model (RCM), every individual has one observed outcome and one missing outcome. 4-The Bayesian models of robustness indices for external validity The Bayesian models of robustness indices for external validity, just like the Bayesian models of robustness indices for internal validity, is a descendant of the unifying Bayesian framework. There are two key differences between the models for external validity and the models for internal validity. The first key difference is that the models for external validity and internal 32 validity have distinct definitions regarding an unobserved (treated/control) sample and the observed (treated/control) sample. For research whose major concern is external validity, the observed (treated/control) sample and an unobserved (treated/control) sample are defined as follows: Ytob  Yt ob ,i : i  R ' Ytun  Yt un ,k : k  R  Ycob  Ycob , j : j  R (4.1) Ycun  Ycun,l : l  R '  Definition (4.1) is just the mathematical equivalent of the definitions 1.6.2 that formalizes the concepts of unobserved sample and observed sample in the second scenario where research has limited external validity. Here I use R to denote the representable part of ideal population and R' to denote the non-representable part of ideal population. A pivotal difference between the models for external validity and internal validity is that one need a new parameter  R to operationalize the definition in (4.1).  R represents the proportion of the representable part of ideal population R in the whole ideal population to which an inference is intended to generalize. To quantify the parameter  R one need judicious conceptualizations of the size of R relative to its corresponding ideal population. For example,  R would be the proportion of volunteered schools (and arguably schools which are similar to volunteered schools) in Borman et al. (2008) in the population of all U.S. schools. By this logic, the expectations of ideal treated/control population (i.e., E[Yt ] and E[Yc ] ) can be rewritten as functions of  R as below: 33 E[Yt ]  E[Yt ob ] R  (1   R ) E[Yt un ] E[Yc ]  E[Ycob ] R  (1   R )E[Ycun ] (4.2) Recall that the ideal treated and control sample means have been presented in (2.11) and their expected values need to match the expectations listed in (4.2), which leads to the following equations: Yt ideal  nt Nt Yt un  Yt ob  Yt ob  R  (1   R )Yt un N t  nt N t  nt Ycideal  nc Nc Ycun  Ycob  Ycob  R  (1   R )Ycun N c  nc N c  nc (4.3) Equation (4.3) reveals the following constraints for the unobserved sample sizes:  1  R  nt    Nt   R   1  R  nc    Nc   R  (4.4) An appropriate conceptualization of (4.1) through (4.4) would be envisaging that an unobserved sample of subjects is randomly drawn from the non-representable part of ideal population and subsequently a random assignment which results in the same proportion of treated subjects as in the observed sample is carried out for this unobserved sample. The treated outcomes of those treated subjects in this unobserved sample are therefore grouped as an unobserved treated sample, and the control outcomes of the remaining subjects in this unobserved sample (that is, people who receive control) will form an unobserved control sample. I warn readers about the 34 difference in the formation of unobserved treated/control sample between the scenarios of internal validity and external validity. Now I construct the Bayesian models of robustness indices for external validity, by utilizing the likelihood function listed in (2.7) and the prior distribution advanced in (2.8). This too will yield the same form as the Bayesian formulation suggested in (2.9) and (2.10), as we have already seen in the previous section. The Bayesian models below again rely on the assumptions of no random sampling error for the observed samples, normality, and independence between treated and control outcomes: t2 t  N (Yt , ) nt un Yt N (t , t2 ) t | Ytob N (t , t ) c c2 N (Y , ) nc Yc N ( c , c2 ) un c  c | Ycob (4.5) N (c , c ) Where: t  (1   R )Yt un   RYt ob t2 t   R Nt c  (1   R )Ycun   RYcob c2 c   R Nc 35 (4.6) As always, the Bayesian models in (4.5) and (4.6) will effectuate the posterior distributions displayed in (2.14) and the probability of invalidating an inference due to limited external validity as a function of the parameters presented in (4.6), once a threshold and a preselected decision rule are set up. Again by defining  as the ratio between Yt un and Ycun , the probit link function of the probability of invalidating an inference due to limited external validity is shown to be a nonlinear function of  and  R , depending on the signs of the focal treatment effect. When inferring a negative effect, the probit model is: probit ( p)  Y un  2  (Y ob  Y ob  Y un )  2  Y un  2  (Y un   # )  2  c R t c c R c R c R  t2 c2   Nt N c 1 1 1 1 1 (4.7) And when inferring a positive effect, the probit model becomes: probit ( p)  Y un  2  Y un  2  (Y ob  Y ob  Y un )  2  (Y un   # )  2  c R c R t c c R c R  t2 c2   Nt N c 1 1 1 1 1 (4.8) The probit models in (4.7) and (4.8) share the same feature and notations as their counterparts in the case of internal validity. For example, p in (4.7) and (4.8) denotes the probability of invalidating an inference, which is simply calculated based on the distribution of t | Ytob  c | Ycob with mean t  c and variance t  c listed in (4.6) as the probability of t | Yt ob  c | Ycob is larger than a threshold  # . Typically, t2 , c2 are predetermined and 36 Nt , Nc , Yt ob , Ycob are information contained in the observed sample. Most importantly, I distinguish and summarize unobserved treated and control sample with Ycob and  . The probit models (4.7) and (4.8) can be employed to answer questions like “how large/small  has to be conditional on a value of  R such that the probability of invalidating an inference is smaller than 0.3?” or “how large/small  R has to be conditional on a value of  such that the probability of invalidating an inference is smaller than 0.2?”, as soon as the values of parameters Nt , Nc , t2 , c2 , Ycun , Yt ob , Ycob and the threshold  # are chosen by the researcher. From a sampling perspective, the Bayesian models of robustness indices for external validity is tantamount to the following sampling process: The observed sample is first drawn from the representable part of ideal population and fixed henceforth. Then an unobserved sample is thought to be drawn from the non-representable part of ideal population and it is not necessarily to be fixed. The unobserved sample size, i.e., n  nt  nc is determined by the observed sample size N  Nt  Nc and π R , i.e., n  1  R N . All subjects in this unobserved sample will be then R randomly assigned to a treatment group or control group, and I do maintain that the proportion of treated subjects in this unobserved sample will be equal to the proportion of treated subjects in the observed sample, that is, nt N t .  n N I again emphasize the difference between the Bayesian models concerning the internal validity and external validity is that they address different central questions. For the internal validity, the unobserved part of ideal population is the collection of counterfactuals brought by the assignment mechanism. In this case, one does not seek to generalize his inference to other populations of subjects that are not accessible. For example, the data from Early Childhood 37 Longitudinal Study Kindergarten (ECLS-K) used by Hong & Raudenbush (2005) is nationally representative and the paramount concern of this study is the lack of random assignment of kindergarten students to the conditions of being retained or promoted, in this case the Bayesian models of the robustness indices for the internal validity will be quite appropriate to employ. Nonetheless, for the external validity, the unobserved (non-representable) part of ideal population is not the counterfactuals, which though exist but does not affect the inference. Rather, it is occasioned by the overgeneralization, that is, the researchers attempt to generalize their conclusions beyond the populations they have sampled from. For an instance, Borman et al. (2008) conducted a cluster randomized trial to examine the efficacy of OCR curriculum in the six schools they randomly sampled from the schools that volunteered in this curriculum and the results pertaining to this experiment is intended for students across the whole nation. The unobserved (non-representable) part of ideal population for Borman et al. (2008) would be the schools in the U.S. which did not volunteer in this research. The Bayesian models of the robustness indices for the external validity takes this unobserved (non-representable) part of ideal population into consideration and modify the inference accordingly. In spite of the important distinctions between those two classes of Bayesian models I just discussed, it is pivotal to appreciate the commonness shared by the both sets of Bayesian models when learning and utilizing them. First, the definition of the bias and the rules of judging the inference invalid are the same for both sets of models, and they are the starting points of the construction of the both kinds of Bayesian models as they form the base of calculating the probability of invalidating an inference. Second, both sets of models share the same model structure and the same group of parameters. Specifically, the distribution of t based on an unobserved treated sample and the distribution of  c based on an unobserved control sample are 38 the priors and the generic descriptions of treated outcome Yt and control outcome Yc are the likelihood functions in both kinds of models. The parameters in the Bayesian models include unobserved treated and control sample means, observed treated and control sample means, and the variances of Yt and Yc . In addition, one parameter symbolizing the relative size of an unobserved sample in an ideal sample will be needed and its definition does depend on the context of whether internal validity or external validity is the focus. Third, the interpretations of those two classes of Bayesian models are in nearly the same fashion. That is, we conceptualize one unobserved samples is randomly drawn from the unobserved part of ideal population, and this unobserved sample is then integrated with the observed sample to form an ideal sample. The ideal treated (control) sample mean are just the weighted average of unobserved treated (control) sample mean and the observed treated (control) sample mean, where the weights are just the proportions of these samples in an ideal sample. Therefore, the Bayesian models of robustness indices assume one can augment the observed sample with an unobserved sample and update the inference over this augmented sample. 5-Statistical threshold and Bayesian models for replacing observed cases An empirical researcher who tries to decide whether an inference is invalidated based on his observed sample and chosen statistical threshold δ # could be easily entrapped in an inferential pitfall about Bayesian models of robustness indices. This occurs when one compute the statistical threshold with the variance of average treatment effect estimate based on the observed sample instead of with the variance of average treatment effect estimate based on an ideal sample, and is unaware of the key difference between those two types of variances. In fact, the Bayesian models of robustness indices of causal inferences I have discussed so far assume one obtain an unobserved sample and incorporate this unobserved sample into his observed sample to form an 39 ideal sample. Given the standard deviation of average treatment effect estimate computed from an ideal sample has taken the sample sizes and observations of both unobserved and observed sample into consideration, it becomes more appropriate than its counterpart extracted from the observed sample in quantifying a statistical threshold. There are two ways of addressing this issue: The first way is to calculate the statistical threshold δ # as a product of the chosen critical value of standard normal distribution (by convention it is 1.96) and the standard error of this average treatment effect estimate based on an ideal sample. The second approach redefines an ideal sample as a sample furnished by replacing a proportion of observed cases with an unobserved sample so as to keep the ideal sample size identical to the observed sample size. Consequently, this approach requires an utter shift in sampling perspective from adding unobserved cases (to the observed sample) to replacing a part of the observed sample with unobserved cases. Such shift in sampling perspective further necessitates some modifications of the Bayesian framework. 5.1-Appropriate statistical threshold δ # for Bayesian models of robustness indices Identifying the ideal sample variance of the average treatment effect estimate is a prerequisite for the calculation of appropriate statistical threshold for Bayesian models of robustness indices. Recall that in (2.14) I have presented the distribution of t | Ytob  c | Ycob which is equivalent to the distribution of average treatment effect estimate based on an ideal sample, and therefore the ideal sample variance of the average treatment effect estimate is informed by this distribution as t  c . For the Bayesian models of robustness indices for internal validity, t  c equals: t2   c2 t   c  N 40 (5.1) For the Bayesian models of robustness indices for external validity, t  c becomes:  t2 c2  t  c   R    N  t Nc  (5.2) Draw on (5.1), the appropriate statistical threshold  # for the Bayesian models of robustness indices for internal validity should be computed as follows: t2  c2   1.96* for inferring a positive effect N # t2  c2   1.96* for inferring a negative effect N # (5.3) Likewise, the following statistical threshold  # is recommended for the Bayesian models of robustness indices for external validity:  t2 c2    1.96*  R    for inferring a positive effect N N c   t #  t2 c2    1.96*  R    for inferring a negative effect  Nt N c  (5.4) # Based on (5.3), the probit models for the probability of invalidating an inference due to limited internal validity are rewritten as follows: For inferring a negative effect: probit ( p)  N t2  c2 Ycun (1  )   (Yt ob  Ycob  Ycun )    Ycob   1.96 For inferring a positive effect: 41 (5.5) N probit ( p)  t2  c2 Ycun (  1)   (Yt ob  Ycob  Ycun )    Ycob   1.96 (5.6) Similarly, plugging the appropriate statistical thresholds in (5.4) will update the probit models for the probability of invalidating an inference due to limited external validity as below: For inferring a negative effect: Y un  2  (Y ob  Y ob  Y un )   2  Y un  2  Y un  2   1.96 probit ( p)  c R t c c R c R c R  t2 c2   Nt N c 1 1 1 1 1 (5.7) For inferring a positive effect: probit ( p)  Y un  2  Y un  2  (Y ob  Y ob  Y un )   2  Y un   2   1.96 c R c R t c c R c R  t2 c2   Nt N c 1 1 1 1 1 (5.8) I do recognize that the threshold  # could be a non-statistical one rather than a statistical one, as typically empirical researchers would set the threshold through a multifaceted and pragmatic decision-making process. The threshold  # tends to be non-statistical when, for example, a benchmark in effect size is available through literature review or research synthesis. Therefore, the formulae of (5.1) through (5.4) can only serve as the guidelines of determining the threshold  # based solely on statistical significance. A more general guidance has been offered by Frank et al. (2013) to shed a light on choosing a threshold based on the transaction costs of proposed actions. 42 5.2-The Bayesian models of robustness indices for replacing observed cases Up to now we have delved into the Bayesian models of robustness indices where an unobserved sample is modeled by a prior distribution and correspondingly an ideal sample is formed and represented by the posterior distribution. Such Bayesian models are tantamount to a sampling procedure where one first obtains an observed sample and then adds an unobserved sample to this observed sample to construct an ideal sample. However, I point out that adding unobserved cases is not the only way of generating an ideal sample considering an ideal sample can also be shaped by replacing a proportion of the observed sample with an unobserved sample, as proposed by Frank & Min (2007). To articulate Bayesian models concerning replacing a part of the observed sample with an unobserved sample, I introduce following notations for the sampling scheme of replacing observed cases: For an individual who joined the treatment group, Iit is an indicator of whether he is retained in an ideal sample (and thus he is not replaced with an unobserved case). Therefore, when Iit  1 this individual i (say his name is Tom) is not replaced with an unobserved case and when Iit  0 Tom belongs to the part of the observed sample which is to be replaced with an unobserved sample. Likewise, I cj is a binary indicator of whether an individual j (say her name is Ashley) who participates in the control group (symbolized by the superscript ‘c’) is remained in an ideal sample (so she is not replaced with an unobserved case either). Next, I define st as an ideal treated sample and sc as an ideal control sample. Operationally, st can be represented by a collection of Iit , i  1, 2, 43 , Nt and sc can be represented by a collection of I cj , j  1, 2, , N c , as the collections of Iit and I cj would inform us which observations are kept in an ideal sample and which ones are to be replaced with an unobserved sample. Finally, I define  r as the proportion of cases to be replaced with an unobserved sample in the observed sample and thus unobserved sample size becomes the product between observed sample size and  r : nt  r N t nc  r N c (5.9) Upon the sampling outlook and definitions, the Bayesian models of robustness indices of causal inferences for replacing observed cases is formalized here: t2 t  N (Yt , ) nt un Yt i | I it  1 N (t , t2 ) t | Ytob , st N (tr , tr ) c c2 N (Y , ) nc un c Yc j | I cj  1 N ( c , c2 )  c | Ycob , sc N (cr , cr ) (5.10) where: st  [ I1t , I 2t , , I Nt t ] sc  [ I1c , I 2c , , I Nc c ] and: 44 (5.11) Nt tr  rYt un  I Y i 1 t i t i Nt t2   Nt r t Nc   rYc  r c un I Y c j c j j 1 (5.12) Nc c2   Nc r c Precaution is needed when one chooses the above Bayesian models since the posterior distribution is conditional not only on the observed sample but also on which observations are kept and which ones are exchanged with unobserved cases in an ideal sample. To derive a posterior distribution which depends solely on the observed sample, the expectations of the posterior distributions t | Ytob , st and c | Ycob , sc can be computed over the distributions of st and sc respectively, which results in the following posterior distributions: t | Ytob  Est [t | Ytob , st ]  c | Ycob  Esc [ c | Ycob , sc ] where the posterior means and variances are now become: 45 N (tr , tr ) N (cr , cr ) (5.13) tr  rYt un  (1  r )Yt ob t2   Nt r t cr  rYcun  (1  r )Ycob (5.14) c2   Nc r c Guided by the posterior distributions in (5.13) and (5.14), the probability of invalidating an inference is readily accessible through (2.16). 6-Demonstrative examples 6.1-The Bayesian robustness indices of the effect of OCR on reading achievement According to Borman et al. (2008), the Open Court Reading (OCR) curriculum “has been widely used since 1960s and offers a phonics-based K-6 curriculum that is grounded in the researchbased practices cited in the National Reading Panel report (National Reading Panel, 2000).” Therefore, Borman et al. (2008) argued that the OCR program had a potential to enhance instructional quality and thus reading achievement as it was rooted in research-based practices that had been advanced by federal educational programs like Reading First and No Child Left Behind. To arrive at a reliable inference for the effect of OCR program on the reading achievement of elementary school students, Borman et al. (2008) designed a multisite, clusterrandomized controlled trial, considering “OCR has never been evaluated rigorously through a randomized trial”. Borman et al. (2008) randomly drew 6 schools from the schools had contacted and shown their interest to SRA/McGraw Hill, the publisher of OCR curriculum. Those 6 schools came from six different states (Florida, Georgia, Idaho, Indiana, North Carolina and Texas) and they were considered to be geographically, ethnically and socioeconomically representative of the schools 46 in US. Within each school and each grade level, classrooms were randomly assigned to the group that was treated with OCR program or the control group. With strong confidence in the internal validity of the design, Borman et al. (2008) estimated the effect of OCR curriculum on student reading composite scores as 7.95, which is statistically significant with an effect size equal to 0.16. Based on the result and design, Borman et al. (2008) went on and concluded that “the outcomes from these analyses provided not only evidence of the promising 1-year effects of OCR on students’ reading outcomes but also suggest that these effects may be replicated across varying contexts with rather consistent and positive results”. Nevertheless, the strong internal validity endowed by randomization cannot preempt the debate about external validity, especially when a conclusion is hinged on strong external validity as the one made by Borman et al. (2008). As pointed out by Frank et al. (2013), the study population of Borman et al. (2008) are essentially schools which were volunteered in their research on the effect of OCR program since Borman et al. (2008) only sampled schools from the list of schools that had reached out to the publisher of OCR curriculum. However, it would be suspicious to think the effect of OCR program is the same as the one reported by Borman et al. (2008) when their study is conducted in non-volunteered schools, possibly because volunteered schools were more experienced and capable to carry out programs like OCR and therefore might think OCR program was advantageous for them in particular. In this case, the effect of OCR curriculum was apparently overestimated and the conclusion drawn by Borman et al. (2008) may not be warranted for non-volunteered schools. I will apply the Bayesian model of robustness indices for external validity to Borman et al. (2008) next to quantify the robustness of its inference as well as identify the situations where this inference becomes intolerably fragile. 47 The analysis of the inferential robustness of Borman et al. (2008) starts with the following sampling process. First, Borman et al. (2008) had 27 classrooms randomly sampled and assigned to the OCR group and 22 classrooms randomly sampled and assigned to the control group, as I described earlier. In addition, the Bayesian models of robustness indices require one to conceptualize the proportion of the relative size of the population of volunteered schools in the population of all US schools, which is denoted as  R . Suppose  R is thought to be 0.5, that is, roughly half of the US schools were fundamentally different from the volunteered schools, which is the observed part of ideal population in the study of Borman et al. (2008). Furthermore, an imaginary sampling process took place in the non-representable part of ideal population for Borman et al. (2008), i.e., the half of US schools that were considerably distinct from the volunteered schools. This imaginary sampling process should be mostly identical to the observed sampling process in Borman et al. (2008), namely drawing 5 or 6 schools (or equivalently 49 classrooms) from its non-representable part of ideal population and then randomly assigning 27 classrooms to the OCR group and 22 classrooms to the control group in those unobserved sampled schools. In general, for a given  R I conceptualize that 49* 1  R classrooms were R drawn from the non-representable part of ideal population of Borman et al. (2008) and subsequently roughly 27 * there should be 22* 1  R classrooms were randomly assigned to the OCR group. (so R 1  R classrooms in the control group). R Draw on this imaginary sampling procedure, I conceptualize the mean reading composite scores for the classrooms randomly assigned to the control group and randomly sampled from the US schools that were fundamentally different from the volunteered ones in Borman et al. (2008) as 48 611.5. I further conceptualize the mean reading composite scores for the classrooms randomly assigned to the OCR group and randomly drawn from the US schools that were fundamentally different from the volunteered ones in Borman et al. (2008) as 611.5*α. The value of 611.5 is chosen as it is the overall of mean of the whole sample in Borman et al. (2008) and the case of   1 typifies the null hypothesis which states the average treatment effect is 0. The prior and the likelihood functions for Borman et al. (2008) are built based on the information of unobserved treated and control sample and Frank et al. (2013) (see pg. 444): t N (611.5* , Yt N (t , 45) c N (611.5, Yc N ( c , 45) 45 ) nt 45 ) nc (6.1) Where the unobserved treated sample size nt and the unobserved control sample size nc are: nt  27 * 1  R R 1  R nc  22* R (6.2) Next, I capitalize on the probit function established in (5.8) to inform the thresholds of  R or  for the probability of invalidating the inference made by Borman et al. (2008) to be smaller than a desired level. The following list of parameters are contained in the Bayesian model (6.1) and (6.2) and to be plugged into the probit model (4.8) (Also see Frank et al. (2013)): 49 Ycun  611.5 Yt ob  615 Ycob  607 t2  45 (6.3)  c2  45 N t  27 N c  22 The final step is to quantify an appropriate statistical threshold to account for the added unobserved samples. By plugging the parametric values in (6.3) into the generic expression in  45 45  (5.4), this threshold is obtained as 1.96*  R    .  27 22  The probit model corresponds to the parametric values assumed in (6.3) is: 1 2 1  2 R 1 2 1  2 R probit ( p)  317.38R  317.38  321.54R  317.38    1.96 (6.4) The above probit function is utilized in the following fashion: first, one needs to set up a desired level of probability of invalidating the inference made by Borman et al. (2008), for example, as 0.5. This means he would like to find out the threshold for  or  R such that the probability of invalidating the inference of Borman et al. (2008) is smaller than 0.5. Moreover, the threshold for  is conditional on the value of  R and vice versa. Specifically, the threshold for  is first calculated as a function of  R based on the desired level of probability of invalidating the inference and subsequently instantiated with some selected values of  R so that it could be quantified as numbers instead of as a function. The threshold for  R is approached with the same procedure except that it is contingent on the value of  . 50 From (6.4), the boundary line that separates the area within which probability of invalidating the inference is larger than 0.5 and the area within which probability of invalidating the inference is smaller than 0.5 is: 1 2 1 1 2  2 R 1  2 R 317.38R  317.38  321.54R  317.38   1.96 (6.5) More importantly, the inequality (6.5) leads to the following quadratic inequality for 0.5 R when  is a given fixed number: (317.38  321.54)R  1.960.5 R  (317.38  317.38 )  0 (6.6) Assuming   1 , the quadratic inequality (6.6) will generate the following lower bound for  R in order to keep the probability of invalidating the inference of Borman et al. (2008) lower than 0.5: R  0.22 (6.7) which suggests that the proportion of the observed sample in an ideal sample should be larger than 0.22 so as to keep the probability of invalidating the inference of Borman et al. (2008) smaller than 0.5. From the boundary function (6.5), the inequality for  can be derived as follows: 321.54 R0.5  317.38 R0.5  1.96  317.38( R0.5   R0.5 ) (6.8) For an instance, conditional on R  0.46 the above inequality suggests α should be larger than 0.9966 in order to make the probability of invalidating the inference of Borman et al. (2008) smaller than 0.5. The inequality (6.6) reveals that the bounds of  R can be computed through (6.6) as long as a value of  is given, for the purpose of keeping the probability of invalidating the inference of 51 Borman et al. (2008) under 0.5. Conditional on   1 , which means the mean reading composite scores of both unobserved treated and control sample is 611.5,  R needs to be larger than 0.22 for the probability of invalidating the inference made by Borman et al. (2008) to be smaller than 0.5. An interpretation of this lower bound 0.22 would be that one can add an unobserved sample potentially drawn from the non-volunteered schools to the observed sample but this unobserved sample can contain at most 95 OCR classrooms and 78 control classrooms assuming the effect of Open Court Reading is absolutely zero for those unobserved classrooms, i.e., the mean reading scores of those 95 unobserved OCR classrooms and of those 78 unobserved control classrooms are both 611.5. Equally meaningful, the inequality (6.8) suggests that  must be larger than the ratio on the right-hand side, which is a function of  R only, so as to keep the probability of invalidating the inference made by Borman et al. (2005) under 0.5. This threshold of  can be evaluated at every given number of  R . For example, one can fix the value of  R at 0.46 and the resultant lower bound for  is 0.9966 in order to make the probability of invalidating the inference made by Borman et al. (2008) smaller than 0.5, which requires the mean reading score of the classrooms which were randomly assigned to the Open Court Reading classrooms and randomly sampled from the non-volunteered schools to be at least 609.42. This is about two points lower than the mean reading score of the students in the classrooms which were randomly assigned to the control classrooms and randomly drawn from the non-volunteered schools. The threshold of  R (or  ) can be repeatedly calculated for the desired probability of invalidating the inference of your choice conditional on any fixed sensible value of  (or  R ). Table 1.1 and Table 1.2 provide thresholds of  and thresholds of  R when the desired level of probability of invalidating the inference is from 0.1 to 0.9. It further provides the threshold of 52 average treatment effect based on an ideal sample, which is just t  c in (4.6), to help researchers interpret those levels of probability of invalidating the inference as the desired levels of the estimate of average treatment effect. For an instance,  needs to be larger than 1.0017 for the probability of nullifying the inference in Borman et al. (2008) to be smaller than 0.1 holding  R constant as 0.46. Meanwhile, this threshold of  suggests the estimate of average treatment effect of OCR in an ideal sample should be larger than 4.24, for the probability of nullifying Borman et al. (2008)’s inference to be lower than 0.1. Choosing a desired level of probability of invalidating an inference, just as choosing a threshold related to a decision about an intervention or policy or program discussed in Frank et al. (2013), should be based on the features and the specific context of a research design. Figure 1.4 illustrates the relationship of testing null hypothesis and the posterior probability of invalidating Borman et al. (2008)’s inference when we iteratively plug in the thresholds of  tabulated in table 1 into the probit model (6.4) conditional on R  0.46 . It’s evident that as the threshold of  decreases the posterior distribution (red curve) moves towards the distribution corresponding to null hypothesis (black curve). As a result, the posterior probability of invalidating Borman et al. (2008)’s inference is growing. Essentially, the posterior probability of invalidating the inference of Borman et al. (2008) is type II error of retesting null hypothesis: t  c  0 against the alternative hypothesis: t  c follows the posterior distribution in (2.13), when an unobserved sample randomly drawn from the non-volunteered schools is available and added to their observed sample. Figure 1.5 unfolds the same relationship between testing null hypothesis and posterior probability of invalidating the inference of Borman et al. (2008), except that it is built on the thresholds tabulated in table 1.2 and conditional on   1 . In figure 2.5, the statistical threshold as well as the posterior variance decreases when  R decreases, which 53 indicates the unobserved sample size as well as the ideal sample size is enlarging. We also observe the posterior distribution is shifting towards the distribution corresponding to null hypothesis when  R is decreasing. 54 Table 1.1: Thresholds of α when  R is fixed as 0.46 The estimate of average treatment effect based on an ideal sample Level of probability Threshold of α Threshold of the mean of an unobserved treated sample 0.1 1.0017 612.54 4.24 0.2 0.9999 611.44 3.65 0.3 0.9987 610.71 3.25 0.4 0.9976 610.03 2.89 0.5 0.9966 609.42 2.56 0.6 0.9956 608.81 2.23 0.7 0.9945 608.14 1.86 0.8 0.9933 607.4 1.47 0.9 0.9915 606.3 0.87 55 Table 1.2: Thresholds of  R when α is fixed as 1 Level of probability Threshold for πR The estimate of average treatment effect based on an ideal sample 0.1 0.6095 4.88 0.2 0.4553 3.64 0.3 0.358 2.86 0.4 0.284 2.27 0.5 0.2228 1.78 0.6 0.1689 1.35 0.7 0.1195 0.96 0.8 0.0725 0.58 0.9 0.0267 0.21 56 Figure 1.4: The relationship between testing null hypothesis and the posterior probability of invalidating the inference of Borman et al. (2008) (  R is fixed as 0.46) 57 Figure 1.5: The relationship between testing null hypothesis and the posterior probability of invalidating the inference of Borman et al. (2008) (  is fixed as 1) 58 6.2-The Bayesian robustness indices of the effect of kindergarten retention on reading achievement Hong & Raudenbush (2005) and Frank et al. (2013) have pointed out that kindergarten retention is a widespread phenomenon in the US and its impact could be profound for both promoted children and retained children, and therefore it has long been a controversial issue. To resolve such controversy, Hong & Raudenbush (2005) conducted the analysis which combined the multilevel model controlling for propensity scores and additional propensity score stratification to evaluate the effects of kindergarten retention policy and actual kindergarten retention on students’ academic achievement. Such analysis is necessary and possibly effective for the purpose of reducing the selection bias due to the lack of randomization in this kind of studies. Draw on this method, Hong & Raudenbush (2005) estimated the effect of kindergarten retention on students’ reading achievement as -9.01 and its standard error as 0.68, which is tantamount to a significant effect whose size is about 0.67. In light of this considerable effect, Hong & Raudenbush (2005) concluded that “children who were retained would have learned more had they been promoted” and therefore “kindergarten retention treatment leaves most retainees even further behind”. Nevertheless, the method proposed by Hong & Raudenbush (2005) does not prevent the selection bias from persisting for two reasons: First, propensity score analysis is built on the assumption of igorability, which basically says all confounding variables are able to be observed and controlled in the causal model. However, as argued by Frank et al. (2013), some confounding variables such as motivation of a child may not be measured and controlled in the causal model of Hong & Raudenbush (2005), and this will result in violation of the assumption of ignorability and incur the selection bias of their estimate. Second, to ensure that quasi- 59 experimental design is a plausible approximation of randomized experiment, the estimated propensity scores need to be good balancing scores, which means most if not all controlled covariates have to be balanced conditional on the estimated propensity score. Even though Hong & Raudenbush (2005) reported 97% of the covariates had achieved balance and argued that the existence of the remaining imbalanced covariates “could largely be attributed to the Type I error related to sampling fluctuation”, there is little evidence to show that such imbalance of those 3% of the covariates is due to sampling error and not consequential. Most importantly, the credibility of quasi-experimental design will be greatly undermined if the imbalanced covariates are happened to be the most influential covariates. (See the draft of Maroulis, Frank & Duong). In cases such that motivation was negatively correlated with kindergarten retention and positively correlated with reading achievement and promoted children had significantly higher pretest readings scores than retained children did in some propensity score strata, the negative effect of kindergarten retention could be mitigated or even reversed. The aforementioned innate limitations of quasi-experimental design prompt us to capitalize on the Bayesian model of robustness indices for internal validity to express the robustness of the inference made by Hong & Raudenbush (2005) as the probability of invalidating their inference. Furthermore, for a chosen desired level of this probability (say 0.5), a threshold characterizing an unobserved sample can be computed to determine when the probability of invalidating their inference will exceed this desired level. As in the example of Borman et al. (2008), the underlying sampling process of the Bayesian model of robustness indices for the internal validity of Hong & Raudenbush (2005) is conceptualized as follows: 1-The observed treated sample is constituted of 471 retained children and the observed control sample is constituted of 7168 promoted children, according to Hong & Raudenbush (2005). 60 2-From Rubin Causal Model (RCM), the unobserved part of treated population is the collection of the reading scores of promoted children had they all been retained instead and the reading scores of retained children had they all been promoted instead. In the terminology of RCM, the unobserved part of ideal population contains all possible values of the counterfactuals for the promoted students and all possible values of the counterfactuals for the retained students. 3-An unobserved sample should be randomly drawn from the unobserved part of ideal population. Furthermore, it can be decomposed into an unobserved treated sample and an unobserved control sample. This unobserved treated sample is a group of reading scores of all sampled promoted children had they been retained instead, and therefore its sample size should be 7168. Likewise, this unobserved control sample is a group of reading scores of all sampled retained children had they been promoted instead, and thus its sample size should be 471. Dependent on the above sampling procedure, I assume the mean reading test scores of an unobserved control sample and an unobserved treated sample are 45.2 and 45.2*α respectively. As mentioned earlier, the case of   1 corresponds to the null hypothesis that asserts the average treatment effect of kindergarten retention is 0. Again, the prior and likelihood functions are constructed based on those unobserved samples, Hong & Raudenbush (2005) (pg.216) and Frank et al. (2013) (pg.448): 143.26 ) nt t N (45.2* , Yt N (t ,143.26) c N (45.2, Yc N ( c ,138.83) Where: 61 138.83 ) nc (6.9) nt  7168 nc  471 (6.10) To find out the threshold of α such that it is a switching point of whether the probability of invalidating the inference of Hong & Raudenbush (2005) is smaller than a preselected desired value (say 0.5), I utilize the probit model (3.5) and extract following parametric values from Hong & Raudenbush (2005) and Frank et al. (2013): Ycun  45.2 Yt ob  36.77 Ycob  45.78 t2  143.26 c2  138.83 (6.11) N  7639   0.0617 Guided by (5.3) and parametric values above, the appropriate statistical threshold is determined as 1.96 143.26  138.83 which equals -0.38. 7639 Based on (6.11), the probit model can be explicitly written as this: probit ( p)  221.49  225.09 (6.12) From (6.12), the threshold of α can be located conditional on the parametric values as assumed in (6.11), once the desired level of probability is given. I note here that the threshold of α can surely repeatedly calculated contingent on various desired levels of probability of invalidating the inference of Hong & Raudenbush (2005) while holding values in (6.11) fixed. 62 Again, when the desired value of probability is set to be 0.5, the boundary line separating the region where the probability of invalidating the inference of Hong & Raudenbush (2005) is larger than 0.5 and the region where the probability of invalidating the inference of Hong & Raudenbush (2005) is smaller than 0.5 should be: 221.49  225.09  0 (6.13) It could be learned from (6.13) that α needs to be smaller than 1.0162 so as to make the probability of invalidating an inference smaller than 0.5, assuming π is 0.0617 and the mean reading score of the retained children had all of them been promoted instead is 45.2. Equivalently, this means Yt un , i.e., the mean reading score of the promoted children had all of them been retained instead, has to be smaller than 45.93 for the probability of invalidating an inference lower than 0.5. Moreover, this threshold of un can be recast as the threshold of t average treatment effect based on an ideal sample, i.e., t  c , since it is a function of Yt un and the parametric values in (6.11). In the setting of current example, the threshold of average treatment effect based on an ideal sample is -0.38, which is exactly the appropriate statistical threshold. The threshold of α can be obtained for any given desired level of the probability of invalidating the inference. Table 1.3 tabulates the thresholds of α, un and t  c when the desired level of t the probability of invalidating the inference of Hong & Raudenbush (2005) is 0.1, 0.2, …, 0.9. For example, α could be at most as large as 1.0138 for the sake of keeping the probability of invalidating their inference under 0.3, which indicates that the mean reading score of promoted students had they all been retained needs to be smaller than 45.82 and the estimate of average treatment effect acquired from an ideal sample should be even more extreme than -0.48, given π 63 as 0.0617 and the mean reading score of retained students had they all been promoted as 45.2. However, in an empirical research, decision about the desired level of the probability of nullifying its inference should be a rational choice based on its cost and policy/behavioral implications as argued by Frank et al. (2013) rather than a haphazard choice. One may notice that the thresholds for  in table 1.3 are all very close to 1, which means for almost any level of probability of invalidating the inference of Hong & Raudenbush (2005) the means of unobserved treated and control sample should be roughly equal. This may appear to be unintuitive, however, is not surprising in the case of Hong & Raudenbush (2005) as their sample size is considerable. Ordinarily, the probability of invalidating an inference will be quite sensitive and jumps/drops sharply within a certain range of  as to a study with questionable internal validity and large sample size, as depicted in figure 1.6. Again, one main research goal is to learn the relationship between testing null hypothesis and the posterior probability of invalidating the inference of Hong & Raudenbush (2005). For this purpose, figure 1.7 is presented. The general pattern is, when  increases the posterior distribution moves toward the distribution corresponding to the null hypothesis and therefore the posterior probability of invalidating the inference of Hong & Raudenbush (2005) becomes larger. As discussed earlier, such relationship parallels the relationship between testing null hypothesis versus alternative hypothesis and type II error. The posterior probability of invalidating the inference of Hong & Raudenbush, is tantamount to type II error of retesting null hypothesis: t  c  0 when an unobserved sample (i.e., a collection of counterfactual outcomes of all their sampled students) is actualized and added to their observed sample. 64 Table 1.3: Thresholds of α when π is fixed as 0.0617 Threshold for α Threshold for the mean of an unobserved treated sample The estimate of average treatment effect based on an ideal sample 0.1 1.0104 45.67 -0.62 0.2 1.0124 45.76 -0.54 0.3 1.0138 45.82 -0.48 0.4 1.015 45.88 -0.43 0.5 1.0162 45.93 -0.38 0.6 1.0174 45.99 -0.33 0.7 1.0186 46.04 -0.28 0.8 1.02 46.1 -0.22 0.9 1.022 46.19 -0.13 Level of probability 65 Figure 1.6: The relationship between  and the probability of invalidating the inference of Hong & Raudenbush (2005) 66 Figure 1.7: The relationship between testing null hypothesis and the posterior probability of invalidating the inference of Hong & Raudenbush (2005) 67 7-Discussion and conclusion 7.1-Features of the Bayesian paradigm of robustness indices The Bayesian paradigm proposed in this paper has some remarkable characteristics. First, it treats the problem of causal inference as a missing data issue, which exactly is the essence of causal inference according to RCM. Specifically, I define the “missing data” as an unobserved sample which could be thought as a sample randomly drawn from the unobserved part of ideal population. The definition of unobserved sample depends on which one of internal and external validity is of central concern for researchers. For example, it could be a random sample from the schools which didn’t show interest in the study of OCR curriculum when the external validity of Borman et al (2008) is challenged or the counterfactuals of the test scores of kindergarten students in the study of kindergarten retention when the internal validity of Hong & Raudenbush (2005) is disputable. An ideal sample is formed by adding an unobserved sample to the observed one and this ideal sample should be able to lead to an unbiased estimate of the treatment effect. Following Frank & Min (2007), the posterior distribution can be interpreted as the distribution of the estimate based on an ideal sample, by treating the distribution of the estimate based on an unobserved sample as prior distribution and constructing likelihood function which should contain information of the observed sample. I have demonstrated that, it is posterior distribution that yields the probability of invalidating the inference and express this probability as a function of the parameters in prior and likelihood distribution. Another notable feature of the Bayesian paradigm is that it is in fact Bayesian sensitivity analysis which manipulates the posterior distribution by inputting different informative priors while holding the observed data and likelihood function fixed. Defining distribution of the imaginary unobserved sample statistic as prior distribution is totally legitimate in the Bayesian world. 68 Recall that a Bayesian model typically requires the prior distribution to be based on one’s belief about the parameter before he actually collects and analyzes the data. My Bayesian models demand a reasonable conjecture about the unobserved part of ideal population such that it is possible to be true if the unobserved part of ideal population can somehow be reached. This requires, in Bayesian language, prior is subjective (or objective and informative) instead of noninformative because an noninformative prior will make the probability of invalidating an inference noninformative as well. By this logic, the prior mean is usually a meaningful quantity and the prior variance is usually relatively small in practice. Essentially, the Bayesian paradigm of robustness indices is about checking the influence of the parameters of prior distribution on posterior distribution, i.e., how the changes in prior distribution affect posterior distribution and the inference built on it. This is exactly the spirit of Bayesian sensitivity analysis. I propose the probability of invalidating the inference as a new index of the robustness of causal inference in this paper since it quantifies the condition under which (prior) and the degree to which (probability) a particular inference is robust. Furthermore, the Bayesian paradigm enhances the interpretability of the framework of robustness indices. First of all, the underlying sampling process of the Bayesian paradigm could be conceptualized as follows: conditional on the fixed observed sample, one sample can be randomly drawn from the unobserved part of ideal population and merged into the observed sample to construct an imaginary ideal sample. This procedure can be implemented many different times and thus it can generate many different ideal samples, with the observed sample being the same and fixed. Equally importantly, the expression of the mean of the ideal sample in (2.10) evidence that the ideal sample mean can be interpreted as the weighted average between the unobserved sample mean and observed sample mean, which is consistent with the arguments 69 made by Frank & Min (2007). Specifically, those weights will solely rely on the sample sizes of unobserved and observed sample. As a result, the Bayesian paradigm itself could be interpreted as a sampling scheme where one ideal sample is comprised of heterogeneous subsamples with different sample sizes. 7.2-Comparisons with other similar approaches 7.2.1-The robustness indices in Frank et al. (2013) By asking the question “what would it take to change your inference”, Frank et al. (2013) initiated the robustness indices which were the proportion of the original data that was necessary to be replaced with the hypothetical data of zero treatment effect, for the purpose of invalidating the inference. The Bayesian paradigm has three basic distinctions from the robustness indices in Frank et al. (2013): First, the robustness indices in Frank et al. (2013) and the Bayesian paradigm are quantified in different forms. Specifically, the robustenss indices in the Bayesian paradigm are posterior probabilities of invalidating an inference rather than a proportion of data need to be replaced in Frank et al (2013). Second, the derivation of robustness indices is different in those two frameworks. My Bayesian approach adopts a distributional thinking, i.e., it demands a distribution built on an unobserved sample randomly drawn from the unobserved part of ideal population for its expected value. On the contrary, Frank et al. (2013) usually focuses on the estimate of average treatment effect in the unobserved part of ideal population and directly assumes it to be a certain value. Third, the sampling mechanisms implied by those two kinds of robustness indices are dissimilar. The Bayesian paradigm, as explained in the last section, is actually drawing and including an unobserved sample into the observed sample, instead of replacing a portion of the observed sample with a hypothetical unobserved sample, which is embeded in Frank et al. (2013). Still, it is noteworthy that, both kinds of robustness indices are 70 cognate in that they share the same definition of bias, both consider modeling of the unobserved population to be central and quantify the robustness of an inference as a threshold when it is likely to be overturned. 7.2.2-The robustness indices in Frank & Min (2007) There are two major differences between my Bayesian paradigm and the robustness indices in Frank & Min (2007). One is that the Bayesian paradigm of robustness indices addressess both questions about internal validity and external validity while the paper of Frank & Min is only intended for the question about external validity. Additionally, Frank & Min proposed two sampling schemes that can explain the induction of bias and derivation of robustness indices, namely neutralization by replacement and neutralization by addition. The robustness indices of Frank et al. (2013) is well situated in the former one while the Bayesian paradigm is well situated in the latter one. 7.2.3-The bounds on treatment effect in Manski (1990) and Lee (2009) As robustness indices of causal inferences, the bounds on treatment effect proposed by Manski (1990) and Lee (2009) target the potential bias associated with the point estimate of treatment effect and highlight the identification issue of such point estimate due to confounded sample selection. However, the approach of bounding effect is different from the robustness indices of causal inferences in three main aspects: First, the robustness indices are rooted in pragmatism and decision-making while the bounds of treatment effect are intended for the estimation problem. Specifically, the purpose of judging whether an effect is significant (or equivalently whether a null hypothesis should be rejected) and whether a decision should be made thereupon is better served by the robustness indices of causal inferences. The bounds on treatment effect may better inform researchers about what data and assumptions can do and what they cannot do. 71 Second, the robustness indices of causal inferences rely on thought experiments, i.e., conceptualizations of unobserved samples while the bounds on treatment effect do not. Third, the robustness of causal inferences is quantified through thresholds of invalidating an inference by the robustness indices rather than bounds on treatment effect offered by Manski (1990) or Lee (2009). 7.3-Limitations Although the Bayesian paradigm of robustness indices is an useful tool as to quantifying the robustness of a causal inference, I caution the readers about its limitations here so that one can decide how to implement it by weighing the gains and risks. First, my Bayesian paradigm has focused exclusively on the estimate of average treatment effect. With that being said, the Bayesian paradigm is not intended for the bias in the estimates of average treatment effect for the treated and average treatment effect for the control, and consequently, it should not be used to quantify the robustness of causal inferences due to such kind of bias. Generally speaking, the Bayesian paradigm will not be suitable for modeling the robustness of causal inferences occasioned by the bias of estimate of any differential causal effect, i.e., the treatment effect conditional on any covariates. For example, it would be inappropriate to employ the Bayesian paradigm of robustness indices on the inference of the average treatment effect of OCR curriculum for students with high social-econimical status or the average treatment effect of kindergarten retention for girls. Second, there are other sources of bias that can undermine causal inferences besides insufficient internal validity and external validity, such as measurement error and violation of SUTVA. I emphasize here that the Bayesian paradigm is designated only for measuring the degree to which a causal inference is affected by its debatable internal validity or external validity. 72 7.4-Conclusion The Bayesian paradigm of robustness indices is an addition to the current literature of robustness indices of causal inferences, which purpose to bridge statistical inference and causal inference by guiding researchers when their inferences are too delicate to uphold as the conceptualization of unobserved sample is varying. Cohen (1990) pointed out that “A successful piece of research doesn’t conclusively settle an issue, it just makes some theoretical proposition to some degree more likely. Only successful future replication in the same and different settings provides an approach to settling the issue”. (pg.1311). Indeed, even a statistical inference based on a careful design like Borman et al. (2008) or Hong & Raudenbush (2005) should not be deemed as a established causal inference without further inquiry into the sources of bias. Starting at the definition of bias due to limited internal validity or external validity, the Bayesian paradigm of robustness indices is managed to ask and answer the question “What would an unobserved sample have to be for the probability of invalidating my inference is small enough (than a predetermined desired value of mine)?”, or equivalently “How different can I afford for an unobserved sample to be from the observed one so that the probability of invalidating my inference is small enough?”. Essentially, the Bayesian paradigm of robustness indices is consistent with the argument of Cohen (1990) in that it quantifies the robustness through the modeling of unobserved sample and thereby simulates the replications of the same study in various contexts and the probability of an replication is successful. It is my hope that through the Bayesian paradigm of robustness indices proposed in this paper, researchers are able to cast their conclusions in terms of the degree to which their inferences will be valid under what circumstances and therefore contribute to the scientific discourse of a particular causal relationship. 73 Chapter 2: The Bayesian paradigm of robustness indices of causal inferences for regression models 1-Introduction 1.1-Regression-based causal inference A causal question is nearly impossible to be convincingly resolved unless a research raising such a question is deemed to have both indisputable internal and external validity. Empirically this means that a randomized experiment with a representative sample is a prerequisite for answering any causal question. Under this ideal condition, extensive literature has justified the usage of regression-based causal inference, i.e., the approach of treating the outcome as the dependent variable and an binary treatment indicator as an independent variable in regression. According to Imbens & Rubin (2015), regression-based causal inference, when subjects are randomly assigned to the treatment and control group, can generate consistent and efficient estimate of a true average treatment effect. Some simulation studies have also indicated that regression-based causal inference is as good as any other methodologies in causal inference under certain assumptions (Morgan & Winships, 2007; Shadish et al., 2008; Steiner et al, 2010; Imbens & Rubin, 2015). While regression-based causal inference and its offshoots have been predominant in addressing causal questions, critics of regression-based causal inference have questioned the validity of inference brought by this approach (Shadish, Cook and Campbell, 2002). Specifically, when randomization is lacked in a research design, the validity of regression-based causal inference is solely built on the assumption of strong ignorability (Rosenbaum & Rubin, 1983) that is not justifiable or testable (Morgan & Winship, 2007). In this case, it is natural to suspect the internal validity of regression-based causal inference. Moreover, the validity of regression-based causal 74 inference can even be shattered in a randomized experiment, when a research conclusion targets a population of which the observed sample is not fully representative. The cases where regression-based causal inference is potentially invalid can be categorized into two scenarios, which I elaborate next. The first scenario typically refers to an observational study or quasi-experiment with a representative sample of the target population. Such research shall be labeled as one with strong external validity and yet limited internal validity. The validity of regression-based causal inference hinges on the assumption of strong ignorability. That is, one need to conjecture and do his best to justify the independence between the treatment and the outcome conditional on a set of measured covariates. In addition, the probabilities of selecting/being assigned to the treatment of all subjects have to be strictly smaller than 1 and bigger than 0 conditional on the same set of measured covariates, in order to identify the average treatment effect. The potential pitfall of conducting regression-based causal inference under the strong ignorability assumption, is that one can never prove or disprove this assumption and thus can never completely legitimize using regression-based causal inference under it. Some practical issues, such as checking the overlap of distributions of propensity scores (or logits of them) and the balance of covariates conditional on propensity score, can still exist and potentially compromise the validity of regression-based causal inference even if the strong ignorability assumption is plausible (Gelman & Hill, 2007). Hong & Raudenbush (2005), which evaluated the impact of kindergarten retention on academic achievement, exemplifies this scenario as a random assignment of kindergarten children to retention and promotion groups was impossible while a nationally representative sample from ECLS-K study was available in this research. 75 The second scenario features any randomized experiment with a nonrandom sample drawn from its target population. A nonrandom sample, as discussed in Wooldridge (2010, 2013), can bias estimates of regression coefficients and therefore make regression-based causal inference inconsistent and biased for true average treatment effect. Gelman & Hill (2007) argued that, in this case “causal inferences are still justified but inferences no longer generalize to the entire population”. They suggested that regression-based causal inference is only valid for an imaginary subpopulation and “further modeling is needed to generalize to any other population”. I will illustrate this scenario by discussing Borman et al. (2008), which conducted a multisite cluster randomized trial to examine the effect of Open Court Reading (OCR) curriculum with a random sample from the schools which volunteered in this study. Apparently, Borman et al. (2008) enjoyed strong internal validity brought by randomized assignment to OCR and control groups and yet suffered from limited external validity since they attempted to generalize their conclusions to both volunteered and non-volunteered schools. 1.2-The philosophy of robustness indices The robustness indices proposed by Frank & Min (2007) and Frank et al. (2013) are built on a philosophy that there exists, at least conceptually, an ideal population for any single study planning a causal inference. To elaborate, the following definitions for the first scenario is needed: Definition 1.1: A real or non-counterfactual observation refers to an observation which is observable, i.e., an observation of a controlled subject under the condition of control or an observation of a treated subject under the condition of treatment. A real or non-counterfactual observation in Hong & Raudenbush (2005) could be an observation of John who was retained in kindergarten or an observation of Mary who was promoted to first 76 grade. It’s noteworthy here that those observations are real since one can only obtain John’s observation when he was retained and Mary’s observation when she was promoted. Definition 1.2: A counterfactual observation of a subject refers to an imaginary observation where his outcome is counterfactual, his membership is different than what is actually observed and his covariates’ values are identical to the ones in his real observation. In Hong & Raudenbush (2005), a counterfactual observation of John who was retained in the kindergarten would be the observation where the outcome was John’s potential reading score had he been promoted to first grade, the binary indicator of treatment status was 0 (since he is imagined as a promoted student) and the covariates were remained the same as the ones in his real observation. Likewise, the counterfactual of Mary who was promoted to first grade would be the observation where the outcome was Mary’s potential reading score had she been retained in kindergarten, the binary indicator of treatment status was 1 (since she is imagined as a retained student) and the covariates were identical to the ones in her real observation. Definition 1.3.1: A potential observation of a subject in the first scenario refers to either his/her real observation or his/her counterfactual observation. In Hong & Raudenbush (2005), every student had two potential observations. For example, John had two potential observations, namely his real observation under the condition of being retained in kindergarten and his counterfactual observation under the condition of being promoted to first grade. Similarly, Mary had two potential observations which were her real observation under the condition of being promoted to first grade and her counterfactual observation under the condition of being retained in kindergarten. With regard to the second scenario, definitions and conceptualizations about real and counterfactual observations are unnecessary since in the long run randomization would guarantee 77 the equivalence between real observations and counterfactual observations. Still, it’s instrumental to offer a different version of the definition of potential observations for the second scenario as follows: Definition 1.3.2: A potential observation in the second scenario refers to a real observation which could be potentially drawn from the target population. Given the target population of Borman et al. (2008) is both volunteered and non-volunteered schools, a potential observation in Borman et al. (2008) could be either an observation of a classroom (along with the observations of students sat in it) which belonged to a volunteered school in their study or an observation of a classroom (along with the observations of students sat in it) which could be potentially drawn from non-volunteered schools. Built on previous definitions, the definition of ideal population is formalized next for both the first scenario and the second scenario: Definition 1.4: An ideal population refers to the collection of all possible potential observations of the target population. The operationalization of this definition depends on the specific context of the research and the scenarios that are discussed earlier. For example, the ideal population of Hong & Raudenbush (2005) is the collection of potential observations of all kindergarten students in the U.S. Likewise, the ideal population of Borman et al. (2008) is the collection of observations of all classrooms in the volunteered and non-volunteered schools. To understand the bias which invalidates regression-based causal inference in those two scenarios, it’s necessary to decompose an ideal population into an unobserved part and an observed part and differentiate between them. For this purpose, I have the following two definitions: 78 Definition 1.5.1: The unobserved part of an ideal population in the first scenario refers to the collection of all counterfactual observations of the target population. Naturally, the observed part of an ideal population in the first scenario refers to the collection of all real observations of the target population. Definition 1.5.2: The unobserved or non-representable part of an ideal population in the second scenario refers to the collection of all potential observations of the part of the target population that cannot be represented by the observed sample. Conversely, the observed or representable part of an ideal population in the second scenario refers to the collection of all potential observations of the part of the target population that was deemed to be logically represented by the observed sample. According to definition 1.5.1, the unobserved part of ideal population of Hong & Raudenbush (2005) is the collection of counterfactual observations of all kindergarten students in the U.S. and the observed part of ideal population of Hong & Raudenbush (2005) is the collection of real observations of all kindergarten students in the U.S. According to definition 1.5.2, the nonrepresentable part of the ideal population of Borman et al. (2008) would be the collection of observations of all classrooms in the non-volunteered schools and the representable part of the ideal population of Borman et al. (2008) would be the collection of observations of all classrooms in the volunteered schools. More importantly, a random sample of the unobserved (non-representable) part of ideal population is the main target of the Bayesian framework and it is defined as an unobserved sample as follows: Definition 1.6.1: An unobserved sample in the first scenario refers to the collection of counterfactual observations of sampled subjects. 79 Definition 1.6.2: An unobserved sample in the second scenario refers to an imaginary random sample which is drawn from the non-representable part of an ideal population and consists of real observations. I assume a subsequent randomization is carried out on this unobserved sample. Definition 1.7: An ideal sample refers to the combination of the observed sample and an unobserved sample. Based on definition 1.6.1, an unobserved sample of Hong & Raudenbush (2005) is the collection of all counterfactual observations of their sampled kindergarten students. Even though Hong & Raudenbush (2005) did get a random sample from their target population, i.e., the kindergarten students in America, this unobserved sample were still missing and not ignorable since their sampled students were not randomly retained in kindergarten or promoted to first grade. Furthermore, as argued by Frank et al. (2013), although Hong & Raudenbush (2005) has measured and controlled most relevant covariates, such as kindergarten children’s pretest scores, demographical features, psychological qualities, and family backgrounds, they still might leave some significant confounders unmeasured, such as their cognitive abilities and motivations. As a result, this unobserved sample that were missing in Hong & Raudenbush (2005) might not be ignorable conditional on their measured covariates, which equivalently disproves the strong ignorability assumption and poses a threat to its internal validity. Moreover, based on definition 1.6.2, an unobserved sample of Borman et al. (2008) is an imaginary sample of classrooms which were randomly drawn from the non-volunteered schools. By assumption, classrooms in this unobserved sample had been already randomly assigned to either Open Court Reading group or the control group. I note here that the observed sample of Borman et al. (2008) can only represent the volunteered schools since it came from six schools which were randomly drawn from volunteered schools in their study. The non-volunteered 80 schools, which is an indispensable part of their target population, would be represented by this unobserved sample rather than the observed sample of Borman et al. (2008) and exhibits the discrepancy between their target population of schools and the population of schools can be represented by their sample. Due to this missing unobserved sample, this discrepancy constitutes a nonrandom sampling from their target population and poses a threat to its external validity. Figure 2.1 shows that the observed sample in Hong & Raudenbush (2005) had two groups: retention group (students who were retained in kindergarten) and promotion group (students who were promoted to first grade). For every retained student (the blue-shaded circle Ri), there is a counterfactual observation of his in an unobserved sample (the dashed circle Pi) had he been promoted instead. Similarly, an unobserved sample also keeps the counterfactual observation of every promoted student (the blue-shaded circle Pj) had he been retained instead. An ideal sample is represented by the rectangle formed by conjoining the small rectangle with dashed circles (an unobserved sample) and the small rectangle with solid blue-shaded circles (the observed sample), and it consists of real and counterfactual observations of all sampled students in Hong & Raudenbush (2005). It’s remarkable that those two small rectangles adjoin each other, which indicates the observed sample and an unobserved sample refer to the same group of subjects and share the same values of the covariates for those subjects in the first scenario. 81 Figure 2.1: The structure of ideal population in Hong & Raudenbush (2005) What figure 2.2 displays is that the ideal population of Borman et al. (2008) is the collection of all real observations of classrooms that could be potentially drawn from American schools. The representable part of this ideal population is the collection of classrooms of schools which volunteered in their research, since the observed sample is the classrooms of schools which were randomly drawn from the volunteered schools. Automatically, the non-representable part of this ideal population is the collection of classrooms of schools which didn’t volunteer in this research. An unobserved sample in this case is thought of as a random sample from the nonrepresentable part of this ideal population. Classrooms of this unobserved sample are thought to be subsequently randomly assigned to the Open Court Reading group or the control group. An ideal sample is the combination of the small rectangle with solid blue-shaded circles (the observed sample) and the small rectangle with solid unshaded circles (an unobserved sample), and it is composed of real observations of classrooms drawn from volunteered and nonvolunteered schools. In figure 3 those two small rectangles do not adjoin each other, which reveals that the observed sample and an unobserved sample pertain to different groups of subjects in the second scenario. 82 Figure 2.2: The structure of ideal population in Borman et al. (2008) Figure 2.3 synthesizes above two figures and presents the structure of an ideal population in both scenarios. The rectangle which contains two small blue-shaded circles is the observed sample whose upper part is the treatment group (denoted by ‘T’ in the upper-right corner) and lower part is the control group (denoted by ‘C’ in the lower-right corner). In the first scenario, one needs to conceptualize the counterfactual observation of a treated subject Ti as his observation had he participated in the control group, which is represented by a dashed circle Ci. Similarly, the counterfactual observation of a controlled subject Cj is symbolized as a dashed circle Tj which would have been this subject’s observation had he participated in the treatment group. The rectangle contains the dashed circles Ci and Tj is an unobserved sample in the first scenario, and the arrows with a label ‘1’ symbolize the conceptualization of an unobserved sample in the first scenario. The second scenario implicates that the scope of ideal population can be narrowed from both real and counterfactual observations to real observations only because of strong internal validity. Due to limited external validity in the second scenario, the same observed sample with blue-shaded circles Ti and Cj is still problematic as it can only represent the representable part of ideal population. Therefore, we need another conceptualization, symbolized by the arrows with a 83 label ‘2’, to envision an unobserved sample drawn from the non-representable part of ideal population. This unobserved sample is thought to be formed by first randomly drawing a sample from the non-representable part of ideal population and then randomly assigning subjects to treatment group and control group. In figure 1, a treated subject in this unobserved sample is represented by the solid unshaded circle Tk and a controlled subject in this unobserved sample is represented by the solid unshaded circle Cl. Figure 2.3: The structure of ideal population in both scenarios The above definitions and arguments have demonstrated that a missing unobserved sample is mainly responsible for the bias that invalidates regression-based causal inference in the first and second scenarios. To quantify the robustness of regression-based causal inference, it’s inevitable to conceptualize an unobserved sample and shape this conceptualization into a proper statistical model. 1.3-Research objectives This research is motivated by Frank & Min (2007), which first proposed a Bayesian framework to address the concern of limited external validity, by defining a prior distribution in terms of an unobserved sample and a likelihood in terms of the observed sample. They suggested that, 84 defining a Bayesian model in this fashion would lead to a posterior distribution that is built upon an ideal sample which is just the combination of an unobserved sample and the observed sample. Following their argument, I develop a comprehensive Bayesian framework to address both concerns of limited internal validity and limited external validity for regression-based causal inference, as have been summarized as the two scenarios. Grounded in the philosophy of robustness indices, this Bayesian framework of robustness indices considers a prior as if it is built on an unobserved sample and then purposes a posterior probability of invalidating an inference conditional on the observed sample. This paper is structured as follows: The second section I formalize a unifying framework of robustness indices, which has a frequentist recipe and a Bayesian recipe, for regression-based causal inference. In the third section, I discuss in general and in depth how to fit the Bayesian framework of robustness indices to raw data for regression-based causal inference. In the fourth section, I demonstrate that the Bayesian framework of robustness indices can be greatly simplified for centered and standardized data. The fifth section addresses an issue of adjusting the statistical threshold  # to the sampling perspective of adding an unobserved sample to the observed sample, by identifying the standard deviation of estimate based on an ideal sample or by considering replacing a portion of the observed sample with an unobserved sample rather than simply adding an unobserved sample to the observed sample. The sixth section provides the detailed applications of my framework to Borman et al. (2008) as well as Hong & Raudenbush (2005). The seventh section concludes this paper with a review of the Bayesian framework of robustness indices for regression-based causal inference and other comparable approaches and a conclusion. 85 2-The unifying framework of robustness indices for regression-based causal inference 2.1-Setting and notation The entire discussion in this paper is restricted to the following setting: Every sample, regardless of whether it’s observed or not, should contain a vector of outcomes denoted by Y and a matrix of predictors denoted by X. X should have p+2 columns and this means the number of predictors is p+2. Among those p+2 predictors, there is one variable containing all 1s in its data entry to represent the intercept and a treatment indicator denoted by W. All remaining p predictors are pure covariates, and I label them as Z1 , Z2 , , Z p respectively. Moreover, W=1 means a subject receives/selects the treatment, for example, the Open Court Reading curriculum or kindergarten retention. Accordingly, W=0 refers to the case that a subject receives or selects the control group. There are only two possible groups in this setting, i.e., treatment and control groups. The outcome Y should be continuous, or at least not a categorical variable. When a data of such structure is available to researchers, I assume they conduct regression-based causal inference to estimate the average treatment effect, which is parameterized as the regression coefficient of W in the regression of Y on X. For example, this regression can be written as Yi   0   wWi  Zi γ Z  i for every individual i in the sample and  w is the regression coefficient of W as well as the estimate of average treatment effect. Under this setting, an ideal population, as well as its unobserved and observed parts, is consisted of observations which all have one value for the outcome Y and p+2 values for the predictors. It follows that the observed sample and an unobserved sample are both collections of observations which all have one values for the outcome value and p+2 values for the predictors as well. The first notation rule I adhere to throughout this paper is that I use a superscript to inform readers about which sample a statistic pertains to. A statistic or a part of data has a superscript as 86 “ob” signifies that it comes from the observed sample, while a superscript “un” indicates that it comes from an unobserved sample. A statistic or a part of data that belongs to an ideal sample, which is just an integration of the observed sample and an unobserved sample and thus can be thought of as a random sample from ideal population, will be labeled with a superscript “id”. For example, Yob ,Yun ,Yid refer to the outcomes Y in the observed sample, an unobserved sample and an ideal sample respectively. However, there are some exceptions: A true parameter will not have a superscript. For example, the true regression coefficient of W is symbolized by  w . The threshold of regression coefficient W for making an inference is denoted by  w . The second # notation rule is that a subscript of a sample statistic is used to describe the variable(s) which this sample statistic pertaining to. For an instance, the sample covariance between the treatment ob indicator W and the outcome Y in the observed sample is denoted by ˆWY and its counterpart in un an unobserved sample is denoted by ˆWY . 2.2-The frequentist recipe The logical flow of any robustness indices always starts at the definition of bias for causal inference. The formal definition of bias for regression-based causal inference, i.e., the regression coefficient representing an estimate of the average treatment effect, is: β  E[ˆwob ]  E[ˆwid ]  {E[Y | Z, W  1]ob  E[Y | Z, W  0]ob } {E[Y | Z,W  1]id  E[Y | Z, W  0]id } (2.1) ob In the above definition, ˆw is the estimated regression coefficient of the treatment indicator W based on the observed sample and ˆw is the estimated regression coefficient of the treatment id indicator W based on an ideal sample. Since the observed sample is random sampled from the 87 ob observed part of ideal population, ˆw should be unbiased for the average treatment effect conditional on the covariates X in the observed part of ideal population, which is represented by the difference within the first curly bracket. Furthermore, as an ideal sample is conceptualized as a random sample from ideal population, ˆw should be unbiased for the average treatment effect id conditional on the covariates X in the whole ideal population, which is represented by the difference within the second curly bracket. The next stage is adjusting the definition (2.1) to an empirical research setting where only the observed sample is available. This implies that the observed sample should be treated as fixed. In addition, I assume an ideal sample and an unobserved sample are both known for this frequentist recipe. In light of this principle, the bias for regression-based causal inference becomes as follows: β|Yid , Xid  ˆwob  ( w | Yid , Xid ) (2.2) ob The first term appears in the definition (2.2) is ˆw which should be fixed and reported by most empirical research conducting regression-based causal inference. The second term in (2.2) is a random variable characterizing the conditional distribution of regression coefficient of W based on a known ideal sample (Yid , Xid ) . This random variable has taken the random sampling error associated with this known ideal sample into consideration, just like one can derive the distribution of any regression coefficient from a given observed sample to reflect the uncertainty in sampling and thereby conduct a T-test. The randomness of this ideal variable is mostly due to the imaginary nature of an unobserved sample. According to (2.2) and Frank et al. (2013), an inference will be invalid if: 88 β  ˆwob  ( w | Y id , Xid )  ˆwob   w# for inferring a positive effect β  ˆwob  ( w | Y id , Xid )  ˆwob   w# for inferring a negative effect (2.3) ob # Or equivalently because both ˆw and  w are constants:  w   w# | Y id , Xid for inferring a positive effect  w   w# | Y id , Xid for inferring a negative effect (2.4) Draw on the decision rules (2.4) and the distribution of  w , I propose the following probabilities id of invalidating an inference as the robustness indices of regression-based causal inference: P( w   w# | Y id , Xid ) for inferring a positive effect P( w   w# | Y id , Xid ) for inferring a negative effect To express the distribution of w (2.5) conditional on an ideal sample and the probability of invalidating an inference explicitly, I formulate the classical linear regression model (CLRM) for the observed sample next: Yob = Xob γ + εob εob ~ N(0,  2I nob ) (2.6) The CLRM in (2.6) should look familiar for most empirical researchers. For (2.6), the residuals are denoted as ε ob and the observed sample size is n ob . Moreover, the residual variance is  2 and assumed to be estimated in this context. Based on the CLRM displayed in (2.6), the least square estimates of regression coefficients (i.e., γˆ ob ) and a multivariate distribution of regression coefficients conditional on this observed sample (i.e., γ | X ob 89 , Yob ) can be shown as follows: γˆ ob = ((Xob )T Xob )-1 (Xob )T Y ob γ | Xob , Yob ~ N (((Xob )T Xob )-1 (Xob )T Y ob ,  2 ((Xob )T Xob )-1 ) (2.7) Analogously, the CLRM for an unobserved sample, the least square estimates of regression coefficients for this unobserved sample and the distribution of regression coefficients conditional on this unobserved sample are formulated as below: Y un = Xun γ + ε un ε un ~ N(0,  2I nun ) γˆ un = ((Xun )T Xun )-1 (Xun )T Y un (2.8) γ | Xun , Y un ~ N (((Xun )T Xun )-1 (Xun )T Y un ,  2 ((Xun )T Xun )-1 ) It’s remarkable that this unobserved sample has a sample size nun which is likely to be different from the observed sample size. However, the residual variance for this CLRM is still  2 , which hints that the residual variances in CLRMs for both observed sample and unobserved sample are equal. This is a core assumption for the derivation of robustness indices and therefore maintained throughout this paper. Finally, the ideal sample is formed by combining the observed and unobserved samples mentioned in above CLRMs and again the CLRM with regard to this ideal sample can be defined similarly as in (2.6) through (2.8): Y id = Xid γ + ε id ε id ~ N(0,  2I ( nun  nob ) ) γˆ id = ((Xid )T Xid )-1 (Xid )T Y id γ | Xid , Y id ~ N (((Xid )T Xid )-1 (Xid )T Y id ,  2 ((X id )T X id )-1 ) 90 (2.9) where the data matrices of Y id and Xid can be understood as block matrices which have the following structures:  Xun  X =  ob   X  ( nun  nob )(p  2) id  Y un  Y =  ob   Y  ( nun  nob )1 id (2.10) Therefore, the probability of invalidating an inference can be readily computed from the distribution of γ | Yid , Xid in (2.9), provided one can randomly draw an unobserved sample from the unobserved part of ideal population. Unfortunately, a robustness index is useful only when an unobserved sample is unapproachable, and this contradicts the premise of the frequentist recipe. For this reason, I turn to the Bayesian recipe next. 2.3-The Bayesian recipe The Bayesian recipe starts with a modified version of the definition of bias, which defines the bias for regression-based causal inference as the difference between least square estimate of regression coefficient of treatment indicator W and the random variable that follows the posterior distribution of the regression coefficient of W: β  ˆwob  ( w | Yob , Xob ) (2.11) The decision rules in the Bayesian recipe for deciding whether an inference is invalid are almost the same as their parallels in the frequentist recipe. An inference will be invalid if one of the following conditions is true: 91 β  ˆwob  ( w | Yob , Xob )  ˆwob   w# for inferring a positive effect β  ˆwob  ( w | Yob , Xob )  ˆwob   w# for inferring a negative effect (2.12) Consequently, the probability of invalidating an inference is accessible through the posterior distribution  w | Yob , Xob and generally they should be: P( w   w# | Yob , Xob ) for inferring a positive effect P( w   w# | Yob , Xob ) for inferring a negative effect (2.13) In the Bayesian recipe, the bias for regression-based causal inference is built on the observed sample solely. This doesn’t result in a discrepancy between the frequentist recipe and Bayesian recipe, so long as one choose to model the prior as the distribution of a focal parameter based on an unobserved sample. To demonstrate this relationship between the two recipes as well as formalize the posterior distribution, the Bayesian model of regression-based causal inference is provided as follows: γ ~ N (((Xun )T Xun )-1 (Xun )T Y un ,  2 ((X un )T X un )-1 ) Yi | γ , X i ~ N ( X i γ ,  2 ) γ | Y ,X ob ob (2.14) ~ N (θ γ , Φ γ ) where: θ γ  ((Xun )T Xun + (Xob )T Xob )-1 ((Xun )T Y un + (Xob )T Y ob ) Φ γ   2 ((Xun )T Xun + (Xob )T Xob )-1 (2.15) There is nothing special about the formulation of this Bayesian model except the parameterization of the prior distribution. The prior distribution in (2.14) is identical to the distribution of regression coefficients conditional on an unobserved sample, as specified in (2.8). 92 This is in accordance with the Bayesian framework propounded by Frank & Min (2007), given that the prior is defined as a distribution of regression coefficients conditional on an unobserved sample and the likelihood function will only be fit to the observed sample. The term  2 which denotes residual variance appears in both prior and likelihood function, which reflects the assumption that the classical linear regression models underlying the prior and likelihood function are restricted to have the same known residual variance. Most interestingly, the following equations are established inasmuch as (2.10) uncovers that the data matrices contained in an ideal sample can be written as block matrices: (X id )T X id = (Xun )T Xun + (X ob )T X ob (X id )T Y id = (Xun )T Y un + (Xob )T Y ob (2.16) Once the results in (2.16) are plugged into the expressions of posterior mean and variance in (2.15), the posterior distribution becomes: γ | Xob , Yob ~ N (((Xid )T Xid )-1 (Xid )T Yid ,  2 ((Xid )T Xid )-1 ) (2.17) What (2.17) uncovers is that the posterior distribution will be identical to the distribution of regression coefficients conditional on an ideal sample when one parameterizes the prior distribution as if it is a distribution of regression coefficients conditional on an unobserved sample. However, I caution readers that the frequentist recipe and the Bayesian recipe are conceptually and empirically distinct even though they both arrive at the same model of robustness indices. Conceptually, the frequentist recipe requires an unobserved sample to be a real one while the Bayesian recipe only considers an unobserved sample as one’s belief which is subjective and shapes this belief into a prior distribution. Empirically, the frequentist approach is unfeasible since no unobserved sample is available whereas the Bayesian approach is practical in the sense that one only needs to transform his belief about an unobserved sample into a prior 93 distribution so as to make the corresponding posterior distribution qualified as a distribution based on an imaginary ideal sample. It’s remarkable that oftentimes a belief about an unobserved sample should be constantly changing instead of fixed, and in this case the learning goal of the Bayesian recipe is to determine the relationship between the probability of invalidating an inference and the prior parameters, which will be the theme of subsequent sections. 3-Bayesian models of robustness indices for raw data 3.1-Data and the sample statistics This section primarily focuses on the derivation of posterior distribution discussed earlier in the unifying Bayesian framework as if we have collected raw data for both unobserved sample and observed sample. By saying “raw data is collected for an unobserved sample”, I point to the construction of prior distribution which is conceptualized as the distribution of regression coefficients conditional on the imaginary raw data for this unobserved sample. The target is to express the posterior mean and variance as functions of sample statistics built on either an imaginary unobserved sample or the observed sample. To be aligned with the settings of earlier discussion, the raw data for the observed sample should be in the following form: Dob = [Ynob 1 , X nob ( p  2) ] Xob = [1nob 1 , Vnob ( p 1) ] V ob = [Z nob  p , Wnob 1 ] (3.1) Z ob = [Z1 , Z 2 , ..., Z p ]nob  p Some notations need explanations in (3.1): D refers to the whole data and it is composed of a data vector of outcome Y and a data matrix of all the predictors X. The data matrix X has two parts: The first part is a constant vector 1 and the second part is the group of predictors V. I 94 further decompose V into a data matrix Z which only includes the pure covariates and the vector of treatment indicators W. The matrix Z will have p columns, which symbolizes that there are p pure covariates in the raw data. By analogy, the raw data for an imaginary unobserved sample is structured as follows: Dun = [Ynun 1 , X nun ( p  2) ] Xun = [1nun 1 , Vnun ( p 1) ] V un = [Z nun  p , Wnun 1 ] (3.2) Z un = [Z1 , Z 2 , ..., Z p ]nun  p Finally, the observed sample and an imaginary unobserved sample are consolidated to create an ideal sample which is styled as below (see (2.10) for a reference): Did = [Y( nun  nob )1 , X ( nun  nob )( p  2) ] Xid = [1( nun  nob )1 , V( nun  nob )( p 1) ] V id = [Z ( nun  nob ) p , W( nun  nob )1 ] (3.3) Z id = [Z1 , Z 2 , ..., Z p ]( nun  nob ) p Next some key sample statistics are introduced based on aforementioned raw data forms for unobserved, observed and ideal samples. First, I define the sample mean vectors for unobserved, observed and ideal samples as: 95 Xid  [1, Z1id , Z 2id , , Z pid , W id ]1( p  2) Z id  [Z1id , Zid2 , , Zidp ]1 p Xun  [1, Z1un , Z 2un , , Z pun , W un ]1( p  2) Zun  [Z1un , Zun , Zunp ]1 p 2 , (3.4) X  [1, Z , Z , , Z , W ]1( p  2) ob ob 1 ob 2 ob p ob Zob  [Z1ob , Zob , Zobp ]1 p 2 , Recall that my notation rule specifies that a superscript of a statistical term is used to denote the kind of sample which it is computed based on and a subscript of a statistical term is used to denote the variable(s) it pertains to. To abide by this notation rule, the variance-covariance matrix of all the predictors in V for an ideal sample will be fashioned as follows: S id VV  SidZZ SidZW    id id  ˆ  S WZ WW ( p 1)( p 1) (3.5) where:  ˆ idZ1Z1  id S ZZ    ˆ id  Z p Z1  ˆ idZ1Z p    ˆ idZ p Z p   p p  ˆ idZW  1   id S ZW     id  ˆ  Z pW  p1  S idWZ  ˆ idZ1W (3.6) ˆ idZ pW 96  1 p Likewise, the vector of covariances between predictors in V and the outcome Y for an ideal sample is generically written as below: id VY  S idZY    id   ˆ   WY ( p1)1 S idZY  ˆ idZ Y   1     id  ˆ  Z pY  p1 S (3.7) where: (3.8) All aforementioned sample covariances and variances are supposed to be computed according to the following formula: ˆ xy  ˆ xx  1 n   ( xi  x )( yi  y ) n i 1 1 n   ( xi  x ) 2 n i 1 (3.9) I emphasize here that the small x, small y and small n in (3.9) are all symbolic and their numeric values depend on the actual context. The small x could represent any variable in D except the constant vector 1 and the small y, when calculating a covariance, could be any variable in D other than the actual variable represented by small x. The small n is the size of a sample, which could possibly be an unobserved one, the observed one, or combinatively an ideal one. For example, the treatment indicator W could be the small x and the outcome Y could be the small y and consequently one would obtain the covariances between W and Y for an ideal sample, by replacing the sample size n with the actual ideal sample size nob  nun . 97 In summary, the notation rule will generally guide readers about the interpretation of a statistical term, especially a sample variance or covariance, that later appears in this paper. Although only the variance-covariance matrix of V and covariance vector between V and Y for an ideal sample is discussed in (3.5) through (3.8), one should recognize that he can write down the variancecovariance matrix of V and covariances between V and Y almost identically for an unobserved sample or the observed sample and the only change he needs is to modify the superscript of every variance and covariance term. 3.2-The posterior distribution of The posterior distribution of  w for raw data  w , i.e., the regression coefficient of the treatment indicator W, is of paramount value for regression-based causal inference since it is the ground on which the inference of a true average treatment effect is carried out. The following theorem will bridge this posterior distribution and the sample statistics by recasting its mean and variance as functions of sample variances and covariances for unobserved and observed samples: Theorem 1. Suppose the CLRMs for unobserved, observed and ideal samples as presented in (2.6) through (2.9) are true and the raw data is in the same format as what I have outlined in (3.1) through (3.3), the posterior distribution of  w within the Bayesian framework proposed in (2.14) through (2.17) will be as follows: id ˆWY  SidWZ (SidZZ )-1 SidZY 2 id  w | X , Y ~ N ( id , un ob (ˆWW  SidWZ (SidZZ )-1 SidZW )-1 ) id id -1 id ˆWW  S WZ (SZZ ) SZW n  n ob ob where: 98 (3.10) ˆ id Zi Z j  nunˆ Zuni Z j  nobˆ Zobi Z j n n un ob nunˆWZ  nobˆWZ i i un ob  nun Z iun Z unj  nob Z iob Z obj n n un ob  Z iid Z idj for i  j  1, 2, , p nunW un Z iun  n obW ob Z iob ˆ    W id Z iid for i  1, 2, , p un ob un ob n n n n un un ob ob un un 2 n ˆWW  n ˆWW n (W )  nob (W ob ) 2 id ˆWW    (W id )2 un ob un ob n n n n un un ob ob n ˆ ZiY  n ˆ ZiY nun Z iunY un  n ob Z iobY ob id ˆ ZiY    Ziid Y id for i  1, 2, , p un ob un ob n n n n un un ob ob un un un n ˆWY  n ˆWY n W Y  n obW obY ob id ˆWY    W id Y id un ob un ob n n n n id WZi (3.11) (Proof in Appendix A; Additional proof of the equivalence between theorem 1 and some common expressions of regression coefficients is provided in Appendix B, to demonstrate how the Bayesian paradigm of robustness indices is connected to regression coefficients, semicorrelations and partial correlations). The equation below will serve as the instruction of computing the ideal sample means appear in (3.11): nun Z iun  nob Z iob Z  for i  1, 2,..., p nun  nob nunW un  nobW ob id W  nun  nob nunY un  nobY ob id Y  nun  n ob id i The formula in (3.11) will have a tidier form, as described next (see details in Frank & Min (2007) Appendix): 99 (3.12) ˆ Zidi Z j  ˆ Zuni Z j  (1   )ˆ Zobi Z j  (1   ) ( Ziob  Ziun )( Z obj  Z unj ) for i  j  1, 2, , p id un ob ob un ob un ˆ ˆ ˆWZ    (1   )   (1   )  ( W  W )( Z  Z ) for i  1, 2, , p WZ WZ i i i i i id un ob ˆWW  ˆWW  (1   )ˆWW  (1   ) (W ob  W un ) 2 ˆ ZidiY  ˆ ZuniY  (1   )ˆ ZobiY  (1   ) ( Ziob  Ziun )(Y ob  Y un ) for i  1, 2, , p id un ob ˆWY  ˆWY  (1   )ˆWY  (1   ) (W ob  W un )(Y ob  Y un ) (3.13) where: nun   un ob n n (3.14) The above formula, equations and distribution constitute the entire theorem 1, which unveils a cardinal perspective on the evaluation of robustness of regression-based causal inference: First, as suggested by (2.17), the posterior distribution of of w  w could be conceptualized as a distribution when a whole ideal sample is available. However, such conceptualization is hinged on two assumptions. The first one is that a CLRM assumption could be made for observed, unobserved and ideal samples. The second one is the residual variances in the CLRMs for observed, unobserved and ideal samples are all equal and known. Theorem 1 shows that, the mean and variance of the posterior distribution of  w are functions of sample variances and covariances for an ideal sample, which can be further expressed as functions of sample means, variances and covariances for observed and unobserved samples. Ultimately, by fixing the observed sample statistics (means, variances, covariances and sample size) as well as the residual variance, the posterior mean and variance of w should be functions of unobserved sample statistics, such as the size, means, variances and covariances for an unobserved sample. Essentially, those 100 unobserved sample statistics are all parameters of the prior distribution of the unifying Bayesian framework proposed in the last section, and changing values of the unobserved sample statistics will result in variations in the posterior distribution. Such logic of the analysis of robustness is in line with Bayesian sensitivity analysis which purposes checking the influence of prior distribution on posterior distribution. 3.3-Probit models for the probability of invalidating an inference I propound the probability of invalidating an inference, which is based on the posterior distribution of  w offered by theorem 1, as the robustness index for regression-based causal inference. Recall that the probability of invalidating an inference is either the posterior probability of  w   w# when inferring a positive effect or the posterior probability of  w   w# when inferring a negative effect. Given the posterior distribution of  w is normal with mean and variance as definitive functions of sample statistics, the probability of invalidating an inference is expected to be a probit function of the sample statistics that shows in (3.10). For this reason, I turn to the next theorem: Theorem 2. Assume the CLRMs for unobserved, observed and ideal samples that are shown in (2.6) through (2.9) are true and the raw data conforms to the structure defined in (3.1) through (3.3). Moreover, I assume both the threshold of making a decision  w and the common residual # variance  2 shared by all CLRMs are given. Then the following probit models are true for the probability of invalidating an inference (denoted by p): For inferring a positive effect: 101 probit ( p)  nun  nob id  ˆWW  SidWZ (SidZZ )-1 SidZW id id [ w# (ˆWW  SidWZ (SidZZ )-1 SidZW )  (ˆWY  SidWZ (SidZZ )-1 SidZY )] (3.15) For inferring a negative effect: probit ( p)  nun  nob  ˆ id WW  S (S ) S id WZ id -1 id ZZ ZW id id [(ˆWY  SidWZ (SidZZ )-1 SidZY )   w# (ˆWW  SidWZ (SidZZ )-1 SidZW )] (3.16) (Proof in Appendix). Although the sample variances and covariances in the probit models above are all based on an ideal sample, they are in fact functions of sample means, variances and covariances based on unobserved and observed samples, as manifested by (3.13). The analytical strategy for the probit models above is that I only isolate a small number of unobserved sample statistics as focal parameters in the analysis of robustness while holding all other observed and unobserved sample statistics as fixed. The probit models above may turn out to be a linear or nonlinear function of the focal parameters, depending on the choice of focal parameters. For example, the probit un models are linear functions of ˆWY if all other unobserved and observed sample statistics are un held constant. An exemplary question to ask in this case would be “what does ˆWY need to be in order to make the probability of invalidating an inference smaller than a certain number (say 0.5), holding all other unobserved and observed sample statistics as fixed?”. The ensuing subsection will detail this analytical strategy and discuss its implications for the two scenarios described at the beginning. 102 3.4-External validity and internal validity It’s important to choose a small but appropriate subset of parameters (as focal parameters) from all the terms of sample statistics in the probit models. There are three main reasons for doing so: First, in some cases, some unobserved sample statistics are more meaningful and therefore more suitable choices as focal parameters than other unobserved sample statistics. For example, we might be more interested in the covariance between pretest scores and Open Court Reading (OCR) curriculum as well as the covariance between posttest scores and OCR in a possibly sample of classrooms randomly drawn from the non-volunteered schools, as pretest is the main covariate in the model and posttest is the outcome. We might, instead, focus on the parameter  , i.e., the proportion of unobserved classrooms in an ideal sample of classrooms, while assuming the covariances between posttest scores and OCR as well as between pretest scores and OCR are both 0, which mimics a research question addressed by Frank et al. (2013). Second, one should recognize that the number of possible parameters (unobserved sample statistics) will quadratically increase as the number of variables increase. I must point out, that my approach requires one to analyze one focal parameter at a time while holding all others as fixed and to subsequently report the thresholds of invalidating an inference at some levels of probabilities for this focal parameter. Therefore, it’s impossible to learn all possible unobserved sample statistics as focal parameters as it will make this analysis too complex to conduct and understand. The last reason is about the preferences and constraints on parameters in the two scenarios where regression-based causal inference likely fails. In the first scenario, a targeted research is usually a quasi-experiment or observational study which lacks randomized assignment to groups. Recall that an unobserved sample in this scenario is defined as the collection of counterfactual observations of all the subjects in the observed sample. A counterfactual observation, according 103 to definition 1.2, should be an observation in which everything is the same as the original observation except its values of the treatment indicator and the outcome. Therefore, an unobserved sample has a distinctive structure as shown below: If the observed sample in the first scenario has the following structure: Dob = [Ynob 1 , X nob ( p  2) ] Xob = [1nob 1 , Vnob ( p 1) ] ob V ob = [Z ob , W ] ob n p n ob 1 (3.17) Z ob = [Z1 , Z 2 , ..., Z p ]nob  p An unobserved sample will be in the following form: Dun = [Ynob 1 , X nob ( p  2) ] Xun = [1nob 1 , Vnob ( p 1) ] ob V un = [Z ob ,1  W ] ob ob n p n 1 nob 1 (3.18) Z ob = [Z1 , Z 2 , ..., Z p ]nob  p As a result, an ideal sample is formed by stacking the data matrix of unobserved sample over the data matrix of observed sample: 104 Did = [Y2 nob 1 , X 2 nob ( p  2) ] Xid = [12 nob 1 , V2 nob ( p 1) ] ob  Zob  ,1  W ob ob n  p n 1 nob 1 id  V =  ob ob  Z nob  p , Wnob 1    Zob = [Z1 , Z 2 ,..., Zp ]nob  p (3.19) The data format defined in (3.17) through (3.19) has a notable difference from the format defined earlier: The data matrix of Z and vector of W now have a superscript “ob”, which signals both are now fixed as portions of the observed sample. For example, the values of covariates in Z for an unobserved sample in this case will be identical to Zob , i.e., the observed values of nob  p ob covariates in Z. What 1nob 1  Wnob 1 indicates is that every treated subject in the observed sample will choose the control group and every control subject in the observed sample will choose the treatment group, in an unobserved sample. The raw data defined by this fashion will have the following properties: ob S idZZ = Sun ZZ = S ZZ id un ob  WW   WW   WW nun  n ob S idWZ  0 Z id = Zun = Z ob W un  1  W ob W id  0.5 105 (3.20) The above constraints on the parameters in the probit models impart an insight that the probit models will be greatly simplified when the analysis of robustness is applied to a research with limited internal validity, due to the nature of an unobserved sample that is composed of counterfactuals of actual observations. Specifically, the probit models (3.15) and (3.16) will be reduced in the first scenario as follows: For inferring a positive effect: probit ( p)  2nob ob  ˆWW ob un ob [ w#ˆWW  0.5ˆWY  0.5ˆWY  (0.5W ob  0.25)(Y ob  Y un )] (3.21) For inferring a negative effect: probit ( p)  2nob ob  ˆWW un ob ob [0.5ˆWY  0.5ˆWY  (0.5W ob  0.25)(Y ob  Y un )   w#ˆWW ] (3.22) The above probit models are intended for research with limited internal validity and it’s remarkable that the only existing parameters are the covariance between the outcome Y and the treatment indicator W in an unobserved sample (i.e., ˆWY ) and the unobserved sample mean for un the outcome Y (i.e., Y un ). The probit link of the probability of invalidating an inference is now a un linear function of Y un and ˆWY . The analytical strategy regarding those probit models then un becomes straightforward: one can either assume Y un is fixed and identify the threshold of ˆWY that makes the probability of invalidating an inference smaller than a certain number (say 0.5), or locate the threshold of Y un that makes the probability of invalidating an inference smaller than a un certain number (say 0.5) conditional on a value of ˆWY . In contrast to the first scenario where probit models have been greatly simplified comparing to (3.15) and (3.16), probit models will remain the same as presented in (3.15) and (3.16) in the 106 second scenario where a research design has strong internal validity but limited external validity. This means the pool of the candidates of focal parameters in the subsequent analysis of robustness could be large and therefore the choice of focal parameters is important and necessary. It’s worthy of stressing here that, a reasonable conceptualization of a possible unobserved sample is the preliminary for an analysis of robustness. In the example of Borman et al. (2008), an enlightening model of conceptualization would be asking questions such as “what would the covariance between the posttest scores and OCR have been had they drawn a random sample of classrooms from the non-volunteered schools” or “what would the mean pretest and posttest scores have been had they drawn a random sample of classrooms from the nonvolunteered schools”. Ideally, one should extrapolate the unobserved sample size and means, variances and covariances for all relevant variables in an unobserved sample. In practice, as illustrated later, an unobserved sample statistic will be restricted to be equal to its counterpart in the observed sample except that it is of particular interest to a researcher in his analysis of robustness or some strong evidence has suggested that it should be different from its counterpart in the observed sample. 4-Bayesian models of robustness indices for centered and standardized data 4.1-For centered data The purpose of this section is to demonstrate that, although the Bayesian models of robustness indices may seem complicated and difficult to work with, they can be simplified by considering data in observed and unobserved samples as centered (or standardized), rather as raw. The centered data in observed, unobserved and ideal samples will be in the identical form as presented in (3.1) through (3.3) except that the means of all variables will be zero now. 107 Consequently, the posterior distribution of  w will remain unchanged for centered data with the following instructions for calculating the sample variances and covariances in an ideal sample: ˆ Zidi Z j  ˆ Zuni Z j  (1   )ˆ Zobi Z j for i  j  1, 2, , p id un ob ˆWZ  ˆWZ  (1   )ˆWZ for i  1, 2, , p i i i id un ob ˆWW  ˆWW  (1   )ˆWW ˆ ZidiY  ˆ ZuniY  (1   )ˆ ZobiY for i  1, 2, , p (4.1) id un ob ˆWY  ˆWY  (1   )ˆWY Obviously, (4.1) is a reduced form of its counterpart for raw data as displayed in (3.13). The rest of analysis of robustness for centered data is the same as the analysis of robustness for raw data: Built on the probit models derived in (3.15) and (3.16), one can pinpoint the threshold of a focal parameter for the probability of invalidating an inference to be smaller than a certain value (say 0.5), conditional on some values of all other unobserved and observed sample statistics. I comment here that, as a special case of the probit models in (3.15) and (3.16), the probit models for research that lacks internal validity can be further reduced as follows: For inferring a positive effect: probit ( p)  2nob  ˆ ob WW ob un ob [ w#ˆWW  0.5ˆWY  0.5ˆWY ] (4.2) un ob ob [0.5ˆWY  0.5ˆWY   w#ˆWW ] (4.3) For inferring a negative effect: probit ( p)  2nob  ˆ ob WW 108 4.2-For standardized data If one has the raw data and chooses to standardize all the variables, or if he can assume or infer the standardized coefficients from the results of a research, the data in observed, unobserved and ideal samples might be thought of as being standardized and resultantly the posterior mean and variance of  w as well as the probit models will be functions of sample correlation matrices defined below: R idZZ  rZid1Z1    r id  Z p Z1  rZid1Z p    rZidp Z p   p p R idZW  rZidW   1     id  r  Z pW  p1  R idWZ  rZid1W R idZY rZidpW  (4.4) 1 p  rZidY   1     id  r  Z pY  p1 id where r is used to denote the sample correlation between two variables. For example, rWY is the sample correlation between the treatment indicator W and the outcome Y in an ideal sample. Theorem 3. The posterior distribution of  w could be rewritten as follows, providing all assumptions of theorem 1 can be upheld, for standardized data: id rWY  R idWZ (R idZZ )-1 R idZY 2 w | X ,Y ~ N( , un ob (1  R idWZ (R idZZ )-1 R idZW )-1 ) (4.5) id id -1 id 1  R WZ (R ZZ ) R ZW n  n ob ob 109 where: ob ob nun * R un ob ZZ  n * R ZZ R    R un ZZ  (1   ) R ZZ un ob n n ob ob nun * R un id ob WZ  n * R WZ R WZ    R un WZ  (1   ) R WZ un ob n n ob ob nun * R un id ob ZY  n * R ZY R ZY    R un ZY  (1   ) R ZY un ob n n un un ob n * rWY  n ob * rWY id un ob rWY    rWY  (1   )rWY un ob n n id ZZ (4.6) Proof: id The proof of theorem 3 should be straightforward once ˆWW is plugged in as 1 and every covariance term as its corresponding correlation term (i.e., a covariance matrix equal to its corresponding correlation matrix, a covariance equal to its corresponding correlation) into the distribution proposed by theorem 1. Theorem 4. The probit models for the probability of invalidating an inference (p) could be rewritten as follows, providing all assumptions of theorem 2 are met, for standardized data: For inferring a positive effect: probit ( p)  nun  nob  1  R (R ) R id WZ id -1 ZZ id ZW id [ w# (1  R idWZ (R idZZ )-1 R idZW )  (rWY  R idWZ (R idZZ )-1 R idZY )] (4.7) For inferring a negative effect: probit ( p)  nun  nob  1  R idWZ (R idZZ )-1 R idZW id [(rWY  R idWZ (R idZZ )-1 R idZY )   w# (1  R idWZ (R idZZ )-1 R idZW )] (4.8) 110 Proof: Just as the proof of theorem 3, I plug ˆWW as 1 and every covariance term as its corresponding id correlation term (i.e., a covariance matrix equal to its corresponding correlation matrix, a covariance equal to its corresponding correlation) into the probit models proposed by theorem 2. Finally, the probit models for a research with limited internal validity have quite simple and tidy forms as below: For inferring a positive effect: 2nob # un ob probit ( p)  [ w  0.5rWY  0.5rWY ]  (4.9) For inferring a negative effect: 2nob un ob probit ( p)  [0.5rWY  0.5rWY   w# ]  (4.10) 5-Appropriate statistical threshold and Bayesian models for replacing observed cases 5.1-Appropriate statistical threshold The analysis of robustness of causal inferences I have discussed so far can be perceived as a procedure of retesting the hypothesis for a conceptualized ideal sample. Such hypothesis testing procedure entails a statistical threshold which is a product of the chosen critical value (traditionally it’s 1.96) and the standard error of the estimate of treatment effect based on an id ob ideal sample (i.e., ˆ w ). I emphasize that, unlike the standard error of ˆ w (the estimate of treatment effect based on the observed sample) which only considers the observed sample, the standard error of ˆ w accounts for observations of both observed sample and unobserved sample id and thus becomes the appropriate choice. 111 The standard error of ˆ w has been given by theorem 1 as follows: id se( ˆ )  id w 2 id id -1 id -1 ˆ WW (  S id WZ (S ZZ ) S ZW ) un ob n n (5.1) ˆ WW , SWZ , SZZ are guided by (3.11) through (3.14). Furthermore, the where the computations of  id id id statistical threshold of ˆ w is calculated as below, for testing the null hypothesis of id  w  0 with level of significance as 0.05: 2 id   1.96* un ob (ˆ WW  S idWZ (S idZZ )-1 S idZW )-1 for inferring a positive effect n n # w 2 id   1.96* un ob (ˆ WW  S idWZ (S idZZ )-1 S idZW )-1 for inferring a negative effect n n # w (5.2) With the threshold  w given in (5.2), the probit models of the probabilities of invalidating an # inference turn out to be as follows: For inferring a positive effect: probit ( p)  1.96  nun  nob id  ˆWW  SidWZ (SidZZ )-1 SidZW id (ˆWY  SidWZ (SidZZ )-1 SidZY ) (5.3) id (ˆWY  SidWZ (SidZZ )-1 SidZY ) (5.4) For inferring a negative effect: probit ( p)  1.96  nun  nob id  ˆWW  SidWZ (SidZZ )-1 SidZW The thresholds offered in (5.2) will remain instructive as long as the decision about threshold  w # is a pure statistical one. However, other factors such as transaction cost of proposed action may 112 come into play in determining  w and a relevant discussion about non-statistical threshold has # been offered by Frank et al. (2013). 5.2-The Bayesian models of robustness indices for replacing observed cases Frank & Min (2007) has proposed two mechanisms of forming an ideal sample: The first one is neutralization by addition, which creates an ideal sample by adding an unobserved sample to the existent observed sample. Until now the Bayesian models of robustness indices of causal inferences have exclusively centered on this mechanism. The other one is neutralization by replacement, which generates an ideal sample by replacing a part of the observed sample with an unobserved sample. In this case, an ideal sample will have the same size as the observed sample and it has both observations inherited from the observed sample and observations introduced by an unobserved sample. In this subsection, a new set of Bayesian models of robustness indices will be devoted to neutralization by replacement, as they are supplementary to the existing Bayesian models and provides alternative conceptualizations and interpretations to the robustness indices. To parameterize the mechanism of neutralization by replacement, the following notations are defined: I i is the binary indicator of whether ith observed case (say his name is Tom) is kept in an ideal sample (equivalently, this means Tom is not replaced with an unobserved case). s represents the collection of all I i i=1, 2, …, n ob . Therefore, every s is a subsample of the observed sample and it will be kept in an ideal sample. The Bayesian models of robustness indices of causal inferences for replacing observed cases are defined as follows: 113 γ ~ N (((Xun )T Xun )-1 (Xun )T Y un ,  2 ((Xun )T Xun )-1 ) Yi | γ , Xi , I i ~ N ( Xi γ ,  2 ) (5.5) γ | Y , X , s ~ N (θ , Φ ) ob ob s γ s γ where: θ sγ  ((Xun )T Xun + (Xob|s )T Xob|s )-1 ((Xun )T Yun + (Xob|s )T Yob|s ) Φ sγ   2 ((Xun )T Xun + (Xob|s )T Xob|s )-1 (5.6) In (5.6), Xob|s and Yob|s refer to the matrix of covariates and vector of outcomes respectively for the observed cases which are not replaced with unobserved ones (so they are retained in an ideal sample). To obtain the target posterior distribution of the probability density function of γ | Yob , Xob , one should take the expectation γ | Yob , Xob , s over the distribution of subsample s. This could be practically done through a Monte Carlo simulation. A closed theoretical form of γ | Yob , Xob may not be straightforward for most cases. Fortunately, for a random sampling procedure, any sample statistics of s should center around the same sample statistic of the whole observed sample. Consequently, the distribution of γ | Yob , Xob can be approximated by assuming the sample statistics of the retained observed cases are identical to the sample statistics of the whole observed sample. Resultantly, based on theorem 1, I have: id ˆWY  SidWZ (SidZZ )-1 SidZY  2 id  w | X , Y ~ N ( id , ob (ˆWW  SidWZ (SidZZ )-1 SidZW )-1 ) id id -1 id ˆWW  S WZ (SZZ ) S ZW n ob ob where: 114 (5.7) ˆ Zidi Z j  ˆ Zuni Z j  (1   )ˆ Zobi Z j  (1   ) ( Z iob  Z iun )( Z obj  Z unj ) for i  j  1, 2, , p id un ob ˆWZ  ˆWZ  (1   )ˆWZ  (1   ) (W ob  W un )( Z iob  Z iun ) for i  1, 2, , p i i i id un ob ˆWW  ˆWW  (1   )ˆWW  (1   ) (W ob  W un ) 2 ˆ ZidiY  ˆ ZuniY  (1   )ˆ ZobiY  (1   ) ( Ziob  Ziun )(Y ob  Y un ) for i  1, 2, , p id un ob ˆWY  ˆWY  (1   )ˆWY  (1   ) (W ob  W un )(Y ob  Y un ) (5.8) and: n un   ob n (5.9)  symbolizes the proportion of observed sample to be replaced with an unobserved sample. Draw on the Bayesian models of robustness indices for replacing observed cases, as formalized in (5.5) through (5.9), we can quantify the probabilities of invalidating an inference and formulate their corresponding probit models which are identical to (3.15) and (3.16) except the expression of  . 6-Illustrative examples 6.1-The Bayesian robustness indices of the effect of Open Court Reading on reading achievement Open Court Reading (OCR) program is a curriculum that is rooted in research-based practices and has been in the market for a long time and widely adopted by many districts and schools. Although OCR is potentially a beneficial program because it responds to recommendations from research that focused on developing early reading skills, its effect has never been assessed and confirmed by a randomized experiment. Seeing this, Borman et al. (2008) designed a multisite cluster randomized experiment and randomly drew six schools from those who volunteered in 115 their study. Subsequently, they define a block as a single grade of one sampled school and within each block classrooms were randomly assigned to the OCR group or the control group. Controlling for the pretest scores and block membership, Borman et al. (2008) estimated the effect of OCR as 7.95 (on reading composite scores) which was statistically significant and went on to conclude that “the outcomes from these analyses provided not only evidence of the promising 1-year effects of OCR on students’ reading outcomes but also suggest that these effects may be replicated across varying contexts with rather consistent and positive results”. Ideally, the findings of Borman et al. (2008) implicate that, the estimated effect of OCR would be around 7.95 if one were to conduct a large-scale completely randomized experiment, controlling for the pretest scores. In other words, regression-based causal inference based on a design where a large random sample of classrooms from the entire U.S. is available and all sampled classrooms are randomly assigned to the OCR group and the control group, would lead to an estimate of the effect of OCR as nearly 7.95 with the mean posttest scores of sampled classrooms as the outcome and the mean pretest scores of sampled classrooms as the covariate. Comparing Borman et al. (2008) to the regression-based causal inference built on this imaginary large-scale completely randomized experiment is a necessary starting point of the analysis of robustness in this paper. Nevertheless, such comparison isn’t necessarily plausible as Borman et al. (2008) only had a random sample of classrooms from the volunteered schools instead of a nationwide random sample of classrooms and thus Borman et al. (2008) actually had a nonrandom sample from its target population, i.e., the classrooms in the entire U.S.. Therefore, Borman et al. (2008) fits the description of the second scenario and we can proceed with its analysis of robustness. 116 Conforming to my notation rules, data structure and formulations proposed earlier, the pretest score, OCR and the posttest score are the covariate Z, the treatment indicator W and the outcome Y respectively. Furthermore, the sample statistics of the observed sample in Borman et al. (2008) should be fixed as follows: Means of Z, W and Y and observed sample size: Z ob  576.62, W ob  0.55, Y ob  609.96, nob  49 (6.1) Covariance matrix of [Z, W, Y]:  2079.36 0.39 1832.2    0.39 0.25 2.33    1832.2 2.33 2401    (6.2) Next step is crucial: we need to present our research questions and make assumptions about an unobserved sample. The purpose of presenting our research questions in analysis of robustness is to isolate the focal parameters that are required by the answer of research questions. In this example, I decide to follow the logic of Frank et al. (2013) and wonder the number of classrooms randomly drawn from non-volunteered schools we need to add to the observed sample of Borman et al. (2008) such that the probability of invalidating their inference is smaller than a prespecified benchmark (say 0.5), assuming the covariance between OCR and posttest is 0 in those classrooms sampled from non-volunteered schools. The meaning of this research question is three-fold: First, an unobserved sample which in this context is a random sample of classrooms drawn from non-volunteered schools is needed and to be added to the observed sample so that an ideal sample is constructed. My robustness index, i.e., the probability of invalidating an inference is computed based on this imaginary ideal sample. Two, this research question involves an assumption that the sample covariance between OCR (W) and the posttest scores (Y) is zero in 117 this unobserved sample. Three, the focal parameter for this research question is the unobserved sample size nun . Particularly, I am interested in the relationship between nun and the probability of invalidating the inference of Borman et al. (2008), holding all other unobserved and observed sample statistics fixed. To fix every unobserved sample statistic other than the focal parameters, it’s inevitable to impose some constraints on them. As discussed earlier, an unobserved sample statistic can be quantified as a number whenever it’s defensible to do so. More often, an unobserved sample statistic is constrained to be its observed counterpart. In this case, every unobserved statistic other than the un focal parameter nun and ˆWY (which is assumed to be 0) is thought to be identical to its observed counterpart. The constraints imposed on all unobserved sample statistics as well as the assumptions about the observed sample statistics and residual variance (i.e.,  2 ) are summarized below: un ob ˆWY  0, ˆWY  2.33 n ob  49,  2  32 un ob ˆ ZZ  ˆ ZZ  2079.36 un ob ˆ ZY  ˆ ZY  1832.2 un ob ˆWW  ˆWW  0.25 un ob ˆ ZW  ˆ ZW  0.39 Z un  Z ob  576.62 Y un  Y ob  609.96 W un  W ob  0.55 118 (6.3) Assuming the statistical threshold in (5.2) is adopted, the parametric values provided in (6.3) lead to the following probit model of the probability of invalidating the inference for Borman et al. (2008): probit ( p )  1.96  40.36 n  49 un  0.12 nun  49 (6.4) Drawing on the probit model in (6.4), the main analytical strategy is to pinpoint the threshold of nun that makes the probability of invalidating the inference of Borman et al. (2008) smaller than a desired value. For an example, if we would like to find the threshold of nun corresponding to a probability smaller than 0.5, the probit model (6.4) needs to be transformed as an inequality with regard to nun as follows: 1.96  40.36 n  49 un  0.12 nun  49  0 (6.5) The above inequality suggests that the probability of invalidating the inference of Borman et al. (2008) would be smaller than 0.5 as long as nun is not greater than 91, which means the probability of invalidating the inference of Borman et al. (2008) is smaller than 0.5 when 91 or less classrooms are randomly sampled from the non-volunteered schools assuming the covariance between OCR and the posttest scores is 0 among those sampled classrooms and all other parameters are fixed as in (6.3). Furthermore, the estimated regression coefficient of the treatment indicator W based on an ideal sample is calculable with the parametric values in (6.3) and the threshold value of nun as follows: ˆwid  456.68  1.375 49  nun whose general form is the following: 119 (6.6) id ˆWY  SidWZ (SidZZ )-1 SidZY ˆ  id ˆWW  SidWZ (SidZZ )-1 SidZW id w (6.7) It turns out that the threshold of nun  91 corresponds to the estimated regression coefficient of W based on an ideal sample as 1.89, which further corresponds to the probability of invalidating the inference of Borman et al. (2008) as 0.5. To gain comprehensive knowledge about the oneto-one relationships among the probability of invalidating Borman et al. (2008)’s inference, the threshold of nun and the estimated regression coefficient of W in an ideal sample, it’s strongly recommended that the threshold of nun and the estimated regression coefficient of W are repeatedly calculated regarding various desired values of the probability of invalidating the inference of Borman et al. (2008). Below are a table and a graph illustrate those relationships. Table 2.1 tabulates the thresholds of nun and associated estimated regression coefficients of W that make the probability of invalidating the inference of Borman et al. (2008) lower than 0.1, 0.2, …, 0.9 respectively. For example, at most 64 classrooms which are randomly drawn from the non-volunteered schools and have a zero sample correlation between OCR and posttest reading scores can be added to the observed sample so as to keep the probability of invalidating the inference of Borman et al. (2008) under 0.3, given the parametric values in (6.3). The relationship between nun and the probability of invalidating the inference of Borman et al. (2008) is further delineated in figure 2.4, and as expected they are positively correlated, which means adding more sampled classrooms with zero correlation between OCR and posttest scores in the non-volunteered schools to the observed sample will weaken the inference of Borman et al. (2008). In figure 2.5, I intend to present the posterior probability of invalidating the inference of Borman et al. (2008) in a context of testing null hypothesis. The black curves in figure 5 are distributions 120 corresponding to null hypothesis  w  0 and the red curves are distributions of  w conditional on a given ideal sample. Figure 2.5 depicts the same pattern as manifested in Table 2.1: as the unobserved sample size becomes larger the distribution corresponding to null hypothesis and the distribution of w will get closer, and therefore the probability of invalidating the inference of Borman et al. (2008) will be larger as well. The appropriate statistical threshold will, in this case, keep dropping because the ideal sample size keeps growing. The knowledge of paramount importance imparted by figure 2.5 is that the posterior probability of invalidating the inference of Borman et al. (2008) can be conceptualized as type II error with regard to retesting the null hypothesis:  w  0 versus the alternative hypothesis when an unobserved sample can be randomly drawn from the non-volunteered schools and merged to the observed sample of Borman et al. (2008). 121 ˆ WY  0 Table 2.1: Thresholds of nun assuming  un Level of probability Threshold of nun The estimated regression coefficient of W based on an ideal sample 0.1 36 4.00 0.2 51 3.19 0.3 64 2.67 0.4 77 2.25 0.5 91 1.89 0.6 107 1.55 0.7 126 1.23 0.8 152 0.90 0.9 195 0.50 122 Figure 2.4: The relationship between nun and the probability of invalidating the inference of Borman et al. (2008) 123 Figure 2.5: The relationship between testing null hypothesis and the posterior probability of invalidating the inference of Borman et al. (2008) 124 Alternatively, the inference of Borman et al. (2008) could be conceived as it’s built on standardized data instead of raw data as discussed earlier. For standardized data, the correlation matrix needs to be specified for the observed sample, and in this case it is: 0.017 0.82   1   0.017 1 0.095    0.82 0.095 1   (6.8) for variables [Z, W, Y], where Z refers to the pretest scores, W refers to the OCR curriculum and Y refers to the posttest scores. The same research question raised for raw data could now be asked again for standardized data, i.e., how many sampled classrooms do we need from the non-volunteered schools such that the probability of invalidating the inference of Borman et al. (2008) is lower than 0.5 assuming the correlation between OCR and posttest scores is 0 for those sampled classrooms? Again, I assume all unobserved sample correlations are equal to their observed counterparts except the unobserved sample correlation between OCR and posttest scores. The assumptions about the parameters are formally written as follows: rzyob  rzyun  0.82 ob un rwy  0.095, rwy 0 un rzwob  rzw  0.017 (6.9) n ob  49,  2  0.0133 The probit model for the probability of invalidating the inference of Borman et al. (2008) then becomes explicit: probit ( p )  1.96  40.36 nun  49 125  0.12 nun  49 (6.10) It shouldn’t be surprising to observe the probit model for standardized data (as in (6.10)) is exactly same as the probit model for raw data (as in (6.4)), just as standardizing variables in a regression model won’t change the t-ratio and p-value of every estimated regression coefficient. The equality between (6.4) and (6.10) is not a coincidence: For any given set of parametric values, research question and focal parameter(s) the probit model for the probability of invalidating an inference will remain the same regardless of whether the data is raw, centered or standardized. This further implies that, for any given set of parametric values, research question and focal parameter(s), the analysis and results will be identical as well for raw, centered or standardized data. For this reason, I omit the results pertaining to the probit model for standardized data as they have already been generated and presented as a product of the probit model for raw data. 6.2-The Bayesian robustness indices of the effect of kindergarten retention on reading achievement Kindergarten retention is an educational issue which has been long and vehemently debated. As an attempt to settle this issue, Hong & Raudenbush (2005) analyzed a nationally representative sample which contained about 7639 students and 1070 schools and conducted a multilevel modeling with additional controls of the logits of estimated propensity scores and the propensity score strata. Their estimate of the effect of kindergarten retention on reading achievement was about -9, which was negatively significant. Such estimate, according to Hong & Raudenbush (2005), evidenced that “children who were retained would have learned more had they been promoted”. Suppose the finding of Hong & Raudenbush is indeed the truth, we would have expected a regression-based causal inference to generate a similar result if we had been able to randomly 126 assign those sampled students in Hong & Raudenbush (2005) to retention group and promotion group. This means the estimated regression coefficient of kindergarten retention should be around -9 in a regression where the outcome, the treatment and the covariate are the reading scores, kindergarten retention and the logits of estimated propensity scores respectively. However, this is not the case as such random assignment to retention and promotion groups is unrealizable. Therefore, a regression-based causal inference corresponding to Hong & Raudenbush (2005) would be problematic as it is based on a quasi-experiment with a representative sample, and clearly it falls into the first scenario I introduced in the beginning. To recognize the potential of the bias associated with the analysis of Hong & Raudenbush (2005) and profile the robustness of their inference, we need to first define a potential unobserved sample and its data form for Hong & Raudenbush (2005). As explained in the section 3.4, an unobserved sample in the first scenario should be the counterfactuals of the observed sample. Therefore, a potential unobserved sample for Hong & Raudenbush (2005) should be the counterfactuals of all the observed students. By definition, a counterfactual of a retained student would be an observation where the outcome is his/her reading score had he/she been promoted instead and his/her treatment status is promotion instead of retention. Likewise, a counterfactual of a promoted student would be an observation where the outcome is his/her reading score had he/she been retained instead and his/her treatment status is retention instead of promotion. For simplicity, I further assume that the data has been standardized for both the observed sample and an unobserved sample (counterfactuals) since standardized data will produce the same results as raw data does. 127 Due to the special data structures of unobserved sample and ideal sample for research with limited internal validity, as I have covered in (3.18) and (3.19), the following constraints are automatically imposed on the sample sizes and correlations: nun  n ob  7639 (6.11) id rWZ 0 My research question for the analysis of robustness of Hong & Raudenbush (2005) is “what would the sample correlation between kindergarten retention and the reading scores have to be in the counterfactuals in order to make the probability of invalidating the inference of Hong & Raudenbush smaller than a desired value (say 0.5)”. Apparently, the focal parameter suggested by this research question is the unobserved sample correlation between kindergarten retention and the reading scores. Constructing the probit model between the probability of invalidating the inference of Hong & Raudenbush (2005) and this focal parameter will be especially simple, as manifested by (4.10), and all relevant parametric values and assumptions are listed here: nun  nob  7639 ob rWY  0.37   0.8 (6.12) Furthermore, I choose the threshold  w purely based on statistical significance, and in this case #  #w is proven to be 1.96  2nob , which equals -0.0127. The probit model for Hong & Raudenbush (2005) should then become straightforward upon (6.12) is given: un probit ( p )  77.25rwy  26.62 128 (6.13) un This indicates that, for example, the threshold of rwy to make the probability of invalidating the inference of Hong & Raudenbush (2005) lower than 0.5 would be identified by the following inequality: un 77.25rWY  26.62  0 (6.14) which pinpoints this threshold as 0.3446. The proper interpretation of this threshold would be the unobserved sample correlation between kindergarten retention and the reading scores in the counterfactuals need to be smaller than 0.3446 such that the probability of invalidating the inference of Hong & Raudenbush (2005) stays below 0.5. This threshold in the meantime corresponds to an ideal sample correlation between kindergarten retention and the reading scores as -0.0127, which is exactly the threshold of statistical significance calculated based on an ideal sample. In general, for a research with questionable internal validity, the ideal sample correlation between the treatment indicator W and the outcome Y symbolizes the regression coefficient estimate of treatment indicator W if data is standardized, since the correlation between any covariate and W in an ideal sample is 0. The computation of the ideal sample correlation between W and Y is as follows: id WY r un ob rWY  rWY  2 (6.15) A scrutiny of the relationship between the probability of invalidating the inference of Hong & Raudenbush (2005) and the unobserved sample correlation between kindergarten retention and un id reading scores entails repeat calculations of thresholds of rWY as well as rWY for other selected desired values. Table 2.2 lists those thresholds for a desired value of this probability ranging from 0.1 to 0.9. For an instance, the correlation between kindergarten retention and reading 129 scores in the counterfactuals needs to be smaller than 0.3378 in order to keep the probability of invalidating the inference of Hong & Raudenbush (2005) under 0.3, conditional on the parametric values provided in (6.12). Figure 2.6 unearths that this probability will ascend from 0 un to 1 abruptly when rWY is in the range between 0.32 and 0.36. Why would this probability be so un sensitive to a minute change of rWY in the range [0.32, 0.36]? It is most likely due to the large sample size and effect size in Hong & Raudenbush (2005). The large sample size of Hong & un Raudenbush (2005) amplifies the slope of rWY in the probit model, and a strong positive correlation (stronger than 0.32) is needed in an unobserved sample so as to mitigate the large negative effect of kindergarten retention found in their observed sample. Figure 2.7 depicts the relationship between posterior probability of invalidating Hong & Raudenbush (2005)’s inference and null hypothesis testing. As the correlation between kindergarten retention (W) and reading scores (Y) stay increasing, the gap between the distribution corresponding to null hypothesis (black curve) and the distribution of  w based on an ideal sample (red curve) will stay shrinking and consequently the posterior probability of invalidating the inference of Hong & Raudenbush (2005) will stay growing. The appropriate statistical threshold in this case is fixed as 0.0127 since the ideal sample size is constant (7639*2=15278). Most importantly, figure 2.7 uncovers that the posterior probability of invalidating the inference of Hong & Raudenbush (2005) can be interpreted as type II error in the context of retesting null hypothesis:  w  0 against the alternative hypothesis when an unobserved sample (i.e., the set of counterfactual observations of all sampled students) is realized and added to their observed sample. 130 un Table 2.2: Thresholds of rWY Level of probability 𝑢𝑛 Threshold of 𝑟𝑊𝑌 𝑖𝑑 Threshold of 𝑟𝑊𝑌 0.1 0.328 -0.021 0.2 0.3337 -0.0182 0.3 0.3378 -0.0161 0.4 0.3413 -0.0144 0.5 0.3446 -0.0127 0.6 0.3479 -0.0111 0.7 0.3514 -0.0093 0.8 0.3555 -0.0073 0.9 0.3612 -0.0044 131 un Figure 2.6: The relationship between rWY and the probability of invalidating the inference of Hong & Raudenbush (2005) 132 Figure 2.7: The relationship between testing null hypothesis and the posterior probability of invalidating the inference of Hong & Raudenbush (2005) 133 7-Discussion 7.1-A summary of the Bayesian paradigm of robustness indices for regression-based causal inference To summarize, the Bayesian paradigm of robustness indices for regression-based causal inference centers on a probit model of the probability of invalidating an inference. It is a thought experiment built on the observed data. The observed sample and sample statistics pertaining to it are fixed throughout the analysis. Doing so is logical because the object of analysis of robustness is supposed to be a single analysis and its observed sample ought to be fixed, and this is consistent with the Bayesian reasoning that the same observed sample is fed to likelihood function irrespective of prior distribution. The analysis of robustness is a thought experiment as it impels a thorough and detailed conceptualization of an unobserved sample, which is thought of as a random sample from the unobserved part of ideal population and to be expressed by a prior. To accomplish this, a clear definition about the unobserved part of ideal population is firstly needed based on the research context. Moreover, possibly with a good knowledge about what a random sample of this unobserved part of ideal population would look like, assumptions about sample statistics of an unobserved sample are made and whenever plausible they are assumed to be equal to their observed counterparts. Most importantly, a few unobserved sample statistics are chosen as focal parameters based on a research question, and the probability of invalidating an inference is in an explicit probit relationship with focal parameters given all assumed parametric values and observed sample statistics. The learning goal of the analysis of robustness is to identify the thresholds of focal parameters that make the probability of invalidating an inference just below a desired value. It’s worth emphasizing here that a comprehensive knowledge about the robustness of any single research cannot be gained without repeatedly computing the 134 thresholds of focal parameters for a series of desired values and describing the probit curve between the probability of invalidating an inference and a focal parameter. The Bayesian paradigm of robustness indices is consistent with the argument made by Frank & Min (2007), which proposed to treat prior as a distribution of parameter based on an unobserved sample. This indeed is how I frame the Bayesian paradigm of robustness indices for regressionbased causal inference in this paper: The prior is defined as a distribution of regression coefficients based on an imaginary unobserved sample whose data structure has been formalized. The likelihood function is defined as a parametric distribution for the outcomes of target population and fit to the observed sample. Consequently, the posterior distribution of regression coefficients generated by this fashion has a form that is identical to the distribution of regression coefficients based on an ideal sample, which is just the consolidation of an unobserved sample and the observed sample. Built on such posterior distribution, the probit link of the probability of invalidating an inference is a function of prior parameters such as unobserved sample size, unobserved sample means and elements in unobserved sample variance-covariance matrix. Intrinsically, the analysis of robustness is an exploratory Bayesian sensitivity analysis where prior parameters are manipulated and thus their impacts on the probability of invalidating an inference can be learned. Just as the frequentist recipe, the Bayesian paradigm of robustness indices could be interpreted as a two-phase sampling approach: The first phase refers to the analysis where the observed sample is randomly drawn from the observed part of ideal population and regression is carried out for the observed sample. The second phase refers to the analysis where an unobserved sample is randomly drawn from the unobserved part of ideal population and subsequently regression is performed for this unobserved sample. The distribution of regression coefficients in the second 135 phase is equivalent to the prior in the Bayesian recipe and the distribution of regression coefficients for an ideal sample produced by this two-phase sampling is equivalent to the posterior in the Bayesian recipe. 7.2-Comparisons with other similar approaches 7.2.1-The impact thresholds in Frank (2000) The impact of an unmeasured variable, defined as the product of the correlation between this variable and the focal predictor and the correlation between this variable and the outcome, is often the subject of a discussion about the robustness of a causal research. Frank (2000) derived the impact threshold for an unmeasured confounder or suppressor in a multiple regression. The logic is that, given the observed correlation between a focal variable (like the treatment indicator W) and the outcome, sample size and level of significance, the impact threshold can inform researchers that how large the impact of an unmeasured confounder/suppressor needs to be so that it can make an inference invalid. By definition, the impact threshold of an unmeasured confounder defines the boundary beyond which an original significant regression coefficient becomes insignificant. Moreover, the impact threshold of an unmeasured suppressor defines the boundary beyond which either an original significant regression coefficient becomes significant in the opposite direction or an original insignificant regression coefficient turns to be significant in either direction. The impact threshold helps conceptualization of the robustness of a causal research since it can be naturally extended to cases where a regression model has multiple covariates and multiple unmeasured confounders/suppressors and can be evaluated through a reference distribution (see Pan & Frank 2003, 2004 as well). Logically, I approach the problem of causal inference and its robustness essentially the same way as Frank (2000) did. I perceive disputable causal inference as an inference based on insufficient 136 information. In Frank (2000), the missing piece is the uncontrolled confounders/suppressors which have the potential to invalidate a regression-based causal inference. In the Bayesian paradigm of robustness indices, the missing piece is actually the missing data, which could be either a potential random sample from the unobserved part of ideal population or counterfactuals defined in Rubin Causal Model. Both approaches ask the same question “what would this missing piece have to be such that the current inference is no longer established?” By this logic, the threshold of a sufficient statistic or a parameter of main interest characterizing the missing piece will be pursued. Even though both the impact threshold and the Bayesian paradigm of robustness indices reside in the context of regression, there a key difference between their perspectives: Bayesian paradigm of robustness indices emphasizes a sampling or missing data perspective, like discussed earlier. By defining and differentiating the observed and unobserved parts of ideal population, an unobserved sample can be conceptualized as a random sample from the unobserved part of ideal population. My robustness index, the probability of invalidating an inference, is built on this unobserved sample. Frank (2000) accentuates the variable selection problem for observational studies and quasi-experiments when the assumption of unconfoundedness is questioned. This is exactly the theme of the first scenario and Bayesian paradigm of robustness indices has offered a solution for it, though from a different perspective. Statistically speaking, Frank (2000) is a pure frequentist framework while Bayesian paradigm of robustness indices is a Bayesian framework with a supplementary frequentist viewpoint. 137 7.2.2-The robustness indices in Frank & Min (2007) The Bayesian paradigm of robustness indices is a detailed and comprehensive expansion of the Bayesian framework proposed by Frank & Min (2007). The main principle of Frank & Min (2007) has been maintained throughout this paper: I treat prior as it is based on an unobserved sample and likelihood as it is shaped by the observed sample. The posterior distribution in Bayesian paradigm of robustness indices is proven to be a distribution based on an ideal sample, just as theorized in Frank & Min (2007). The Bayesian paradigm of robustness indices has a broader scope than Frank & Min (2007): It appeals to both research with limited internal validity and research with limited external validity, whereas the robustness indices of Frank & Min (2007) is designated for the research with limited external validity only. 7.2.3-The robustness indices in Frank et al. (2013) The Bayesian paradigm of robustness indices could be deemed as a Bayesian version of Frank et al. (2013) as they share the same goal of assessing the robustness of research with strong internal validity but weak external validity as well as research with strong external validity but weak internal validity. Some key concepts of the Bayesian paradigm of robustness indices, such as the threshold for making an inference and the decision rule of invalidating an inference, are inherited from Frank et al. (2013). However, the Bayesian paradigm of robustness indices is more probabilistically oriented and requires a more precise and detailed modeling of an unobserved sample than Frank et al. (2013). The robustness index in Frank et al. (2013) is the proportion of the observed sample a research can afford to be replaced with an unobserved sample where the treatment effect is zero, without nullifying an inference. On the contrary, the robustness index in this paper is the probability of invalidating an inference, which is built on an imaginary ideal 138 sample. This ideal sample is not formed by replacing a portion of the observed sample with an unobserved sample but by adding this unobserved sample to the existing observed sample. 7.3-Limitations Contributory insights about the robustness of regression-based causal inference can be elicited by the proper application of Bayesian paradigm of robustness indices. Conversely, Bayesian paradigm of robustness indices can lead to misguiding results and baffling conclusions if researchers are unaware of its pitfalls and limitations. I warn readers of two major limitations of the Bayesian paradigm of robustness indices: First, the Bayesian paradigm of robustness indices demonstrated throughout this paper is well situated in the regression-based causal inference, which by definition is an approach of treating the estimated regression coefficient of the treatment indicator W as the estimate of average treatment effect, in a multiple regression where the outcome Y should be continuous (or at least not categorical). This makes the Bayesian paradigm of robustness indices inappropriate for statistical methods such as logistic regression, multinomial logistic regression or any other non-regression methods. It is also counterproductive to apply the Bayesian paradigm of robustness indices to research questions which cannot be answered by the regression coefficient of W. For example, a research seeking answers about the treatment effect for the treated or for the control cannot be simply satisfied with the regression coefficient of W. In general, the Bayesian paradigm of robustness indices is best applied to those two research scenarios, i.e., research with strong internal validity but weak external validity and research with weak internal validity but strong external validity, as long as they intend to find out the average treatment effect only. Another limitation of the Bayesian paradigm of robustness indices is that it has no power assessing the robustness of a causal research that could be biased by factors other than weak 139 internal validity or weak external validity. Factors such as measurement error and violation of the SUTVA assumption can and often jeopardize a causal inference. Nevertheless, they are beyond the scope of the Bayesian paradigm of robustness indices. 7.4-Conclusion A causal relationship can never be established by merely one research. Rather, to confirm a causal relationship and accept it as gained scientific knowledge, much more assessments need to be done by experts in the substantive field and those assessments are typically “more demanding and meaningful than that of a one-time, stand-alone test of scientific value” (Sohn, 1998). It’s my wish that the Bayesian paradigm of robustness indices can equip causal researchers a framework which allows the assessments of the robustness of a causal inference to be done in a systematic, informative and organized fashion. I believe that the Bayesian paradigm of robustness indices has reflected an important and frequently mentioned recommendation emerged from the discussion of replicability/reproducibility, i.e., the consideration of the prior probabilities of hypotheses. This is exactly the spirit of analysis of robustness. By treating the prior distribution as a distribution built on an unobserved sample, the regression estimate of average treatment effect can be repeatedly evaluated and thereby the belief about a causal inference is updated, conditional on different hypotheses about an unobserved sample. As many researchers pointed out, assigning prior probabilities to all possible hypotheses is likely to be unavoidable in the journey from the long-criticized p-value to a meaningful index of replicability. 140 APPENDICES 141 Appendix A: Proofs of Theorem 1 and Theorem 2 Proof of Theorem 1: My goal is to derive the formula for least square estimate of regression coefficient for W (i.e., ˆ w ) based on an ideal sample since it is the posterior mean and the variance of ˆ w given it is identical to the posterior variance, as manifested by (2.17). First, I need to define the following ordered data matrices for an ideal sample: D = [Y( nun  nob )1 , X ( nun  nob )( p  2) ] X = [1( nun  nob )1 , V( nun  nob )( p 1) ] V = [Z ( nun  nob ) p , W( nun  nob )1 ] Z = [Z1 , Z 2 , ..., Zp ]( nun  nob ) p and ordered mean vectors: V id  [Z id , W id ]1( p 1) Z id  [Z1id , Zid2 , , Zidp ]1 p The matrix XT X for an ideal sample could then be molded as the following block matrix:  nun  nob X X   un ob id T ( n  n ) (V )  T (nun  nob )V id   VTV  It turns out that, the inverse of XT X has the following form: 142 1 1  1 id id -1 id T id id -1   V (S ) (V )  V (S VV VV )   nun  nob nun  n ob nun  n ob T -1 (X X)    1 1 id -1    un ob (SidVV )-1 (V id )T (S )   VV n n nun  n ob   It should be clear now that, to determine the definite form of (XT X)-1 I need to find out what (SidVV )-1 is. As a variance-covariance matrix for the vector of predictors V, SidVV can be expressed as the block matrix in (3.5) whose elements is formalized in (3.6) through (3.8). Consequently, id the inverse of SVV can be formulated here: (SidVV )-1  id id (SidZZ )-1 + (SidZZ )-1 SidZW (ˆ WW -SidWZ (SidZZ )-1 SidZW )-1 SidWZ (SidZZ )-1 -(SidZZ )-1 SidZW (ˆ WW -SidWZ (SidZZ )-1 SidZW )-1    id id id -1 id -1 id id -1 id id id -1 id -1 ˆ ˆ -(  -S (S ) S ) S (S ) (  -S (S ) S ) WW WZ ZZ ZW WZ ZZ WW WZ ZZ ZW   id -1 Plugging the above matrix of (SVV ) into the block matrix of definite form of matrix (XT X)-1 will give us the complete (XT X)-1 , whose elements are all ideal sample statistics such as ideal sample variances, ideal sample covariances and ideal sample means. To isolate the estimated regression coefficient for W, I only need to use the elements in the last row of (XT X)-1 , which are provided next: 1 id id [(ˆ WW - SidWZ (SidZZ )-1 SidZW )-1 SidWZ (SidZZ )-1 (Zid )T  W id (ˆ WW - SidWZ (SidZZ )-1 SidZW )-1 ] ob n n 1 id [(XT X)-1( p  2)2 , , (XT X)-1( p 2)( p 1) ]   un ob (ˆ WW - S idWZ (SidZZ )-1 SidZW )-1 SidWZ (SidZZ )-1 n n 1 id (XT X)-1( p  2)( p  2)  un ob (ˆ WW - S idWZ (S idZZ )-1 S idZW )-1 n n (XT X)-1( p  2)1  un 143 Because the estimated regression coefficient for W is the last element of the dot product between the last row of (XT X)-1 and (XT X)-1 XT Y which is XT Y , the expression of XT Y is also needed here: (nun  nob )Y id    XT Y   Z T Y   WTY    where: un Z T Y  (nun  n ob )S id  n ob )Y id (Z id )T ZY  ( n id W T Y  (nun  n ob )ˆWY  ( nun  n ob )W id Y id Now I can calculate the estimated regression coefficient for W as the dot product between the last row of (XT X)-1 and the vector XT Y . The result is presented below: id ˆ WY  SidWZ (SidZZ )-1 S idZY ˆ w  id ˆ WW  SidWZ (SidZZ )-1 S idZW The variance of ˆ w should be straightforward: it is just the product of the known residual variance  2 and the element in the p+2th row and the p+2th column of (XT X)-1 : 2 id Var ( ˆ w )  un ob (ˆ WW - S idWZ (S idZZ )-1 S idZW )-1 n n Taken together, the posterior distribution of  w conditional on the observed sample is given by: id ˆWY  SidWZ (SidZZ )-1 SidZY 2 id  w | X , Y ~ N ( id , un ob (ˆWW  SidWZ (SidZZ )-1 SidZW )-1 ) id id -1 id ˆWW  S WZ (SZZ ) SZW n  n ob ob 144 The derivations of ideal variances/covariances as functions of observed and unobserved sample statistics follow the same reasoning. Here I just take the covariance between W and Y in an ideal sample as an example. First of all, I have: ˆ id WY 1  un n  nob nun  nob  i 1 (Wi  W id )(Yi  Y id ) The above equation can be rearranged and reexpressed as follows: nun  nob  i 1 nun nob i 1 i 1 un id WY  nob  ˆ WY   nun  nob W id Y id  WY i i  n i i  WY i i Similarly, the following equations are true for the observed sample and an unobserved sample: nun W Y  n i 1 un un ˆ WY  nunW unY un ob ob ˆ WY  n obW obY ob i i nob W Y  n i 1 i i The last three equations could be consolidated into an expanded one as follows: n un id un ob  nob  ˆ WY   nun  nob W id Y id  nunˆ WY  nobˆ WY  nunW unY un  nobW obY ob ˆ WY as a function of unobserved and observed sample statistics could Finally, the expression of  id be deduced from the equation above: 145 un ob nunˆWY  n obˆWY nunW unY un  n obW obY ob ˆ    W id Y id un ob un ob n n n n un ob n n nun n ob un ob un un  un ˆWY  un ob ˆWY  un ob W Y  un ob W obY ob ob n n n n n n n n  nunW un  n obW ob   nunY un  n obY ob     nun  n ob nun  n ob    id WY un ob  ˆWY  (1   )ˆWY  W unY un  (1   )W obY ob   2W unY un   (1   )W unY ob   (1   )W obY un  (1   ) 2W obY ob  un ob  ˆWY  (1   )ˆWY   (1   )[W unY un  W unY ob  W obY un  W obY ob ] un ob  ˆWY  (1   )ˆWY  (1   ) (W ob  W un )(Y ob  Y un ) where: nun   un ob n n Proof of Theorem 2: For inferring a positive effect, the probability of invalidating an inference is: P( w   #w | Xob , Yob ) To recast this probability of invalidating an inference in terms of the cumulative distribution function of the standard normal distribution, I need to standardize the random variable  w | Xob ,Yob : 146 P(  w   #w | Xob , Yob )  id id   ˆ WY  S idWZ (S idZZ )-1 S idZY ˆ WY  S idWZ (S idZZ )-1 S idZY #       w w id id id -1 id id id id -1 id ˆ ˆ   S (S ) S   S (S ) S ob ob WW WZ ZZ ZW WW WZ ZZ ZW P  | X ,Y  2 2     id id id -1 id -1 id id id -1 id -1 (ˆ WW  S WZ (S ZZ ) S ZW )  un ob (ˆ WW  S WZ (S ZZ ) S ZW )  nun  n ob  n n  id   ˆ WY  S idWZ (S idZZ )-1 S idZY #     w id ˆ WW  SidWZ (SidZZ )-1 S idZW    2    id id id -1 id -1 ˆ  un ob (WW  S WZ (S ZZ ) S ZW )   n n    nun  nob # id id id -1 id id id id -1 id  ˆ ˆ  [  (  S (S ) S )  (WY  S WZ (S ZZ ) S ZY )]    ˆ id  S id (S id )-1 S id w WW WZ ZZ ZW  WW WZ ZZ ZW   Given the probit function is just the inverse of the cumulative distribution function of the standard normal distribution, plugging either side of the above equation into the probit function will lead to the following result: probit ( p)  nun  nob  ˆ id WW  S (S ) S id WZ id -1 id ZZ ZW id id [ w# (ˆWW  SidWZ (SidZZ )-1 SidZW )  (ˆWY  SidWZ (SidZZ )-1 SidZY )] For inferring a negative effect, the probability of invalidating an inference generally becomes: P( w   #w | Xob , Yob ) By the same logic, the probit function of this probability is approached by deriving its corresponding cumulative function of the standard normal distribution first: 147 P(  w   #w | Xob , Y ob )  id id   ˆ WY  S idWZ (S idZZ )-1 S idZY ˆ WY  S idWZ (S idZZ )-1 S idZY #       w w id id ˆ WW  S idWZ (S idZZ )-1 S idZW ˆ WW  S idWZ (S idZZ )-1 S idZW ob ob   P  | X ,Y 2 2     id id id -1 id -1 id id id -1 id -1 ˆ ˆ (   S (S ) S ) (   S (S ) S )  un ob WW WZ ZZ ZW  WW WZ ZZ ZW nun  nob  n n  id id     ˆ WY  S idWZ (S idZZ )-1 S idZY ˆ WY  S idWZ (S idZZ )-1 S idZY # #         w w id id ˆ WW  S idWZ (S idZZ )-1 S idZW ˆ WW  S idWZ (S idZZ )-1 S idZW      1   2 2       id id id -1 id -1 id id id -1 id -1 ˆ ˆ (   S (S ) S ) (   S (S ) S )  un ob WW WZ ZZ ZW   un ob WW WZ ZZ ZW   n n   n n    nun  nob id id id -1 id # id id id -1 id   [(ˆ WY  S WZ (S ZZ ) S ZY )   w (ˆ WW  S WZ (S ZZ ) S ZW )]    ˆ id  S id (S id )-1 S id  WW WZ ZZ ZW   Apparently, taking probit operation on both sides of the equation above will generate the following probit function of the probability of invalidating an inference when inferring a negative effect: probit ( p)  nun  nob  ˆ id WW  S (S ) S id WZ id -1 ZZ id ZW id id [(ˆWY  SidWZ (SidZZ )-1 SidZY )   w# (ˆWW  SidWZ (SidZZ )-1 SidZW )] This completes the proof of theorem 2. 148 Appendix B: The Algebraic Equivalence Between Theorem 1 and Common Expressions of Regression Coefficients In this appendix, I will demonstrate the algebraic equivalence between theorem 1 and the common expressions of ordinary least square (OLS) estimates of simple regression coefficient as well as standardized multiple regression coefficients (for two covariates). The common expression of simple regression coefficient is provided below: n ˆx   ( x  x )( y i i 1 i  y) n  (x  x ) i 1 2 i Now I show how theorem 1 is connected to the expression above: First, the distribution in theorem 1 can be treated as the distribution of any single regression coefficient. Therefore, for a predictor x its OLS estimate of regression coefficient is provided by theorem 1 as follows: ˆ xyid  SidXZ (SidZZ )-1 SidZY ˆx  id id id -1 id ˆ xx  S XZ (SZZ ) SZX However, there is no covariates Z in a simple regression model and thus any sample variancecovariance matrices (or vectors) involves Z will be cancelled, which means the matrices and id id id id vectors SZZ ,SXZ ,SZY ,SZX are all cancelled and the above expression of ˆx becomes: ˆ xyid ˆx  id ˆ xy Based on the formulae of computing sample variances and sample covariances in (3.9), the following equations can be derived: 149 ˆ xyid ˆx  id ˆ xy 1 n  ( xi  x )( yi  y ) n i 1   1 n ( xi  x ) 2  n i 1 n  (x i i 1  x )( yi  y ) n  (x i 1 i  x )2 This establishes the algebraic equivalence between theorem 1 and the common expression of OLS estimate of simple regression coefficient. I note here in this case n  nun  n ob because all sample statistics pertain to an ideal sample. Next, I prove the algebraic equivalence between theorem 1 and the expressions of the OLS estimates of standardized multiple regression coefficients, through an example of regressing y on x1 and x2. It is well known that, the OLS estimates of standardized regression coefficients of x1 and x2 have the following forms: ˆx  1 ˆx  2 rxid1 y  rxid1 x2 rxid2 y 1  (rxid1x2 ) 2 rxid2 y  rxid1x2 rxid1 y 1  (rxid1x2 ) 2 The OLS estimate of any single standardized multiple regression coefficient has been offered by theorem 3 (which is the standardized version of theorem 1) as follows: ˆx  rxyid  R idXZ (R idZZ )-1 R idZY 1  R idXZ (R idZZ )-1 R idZX To isolate the expression of OLS estimate of regression coefficient of x1 from the above formula, one needs to treat x1, x2 and y as x, Z and y in the context of theorem 1 (or theorem 3) and consequently the facts below are learned: 150 rxyid  rxid1 y R idZZ  1  rxid2 x2 R idXZ  R idZX  rxid1x2 R idZY  rxid2 y With the above facts, the OLS estimate of regression coefficient of x1 offered by theorem 3 can be rewritten as: ˆx  rxid1 y  rxid1x2 rxid2 y 1  (rxid1x2 ) 2 1 Similarly, one should treat x1, x2 and y as Z, x and y in the context of theorem 1 (or theorem 3) and subsequently acknowledge the following facts in order to derive the OLS estimate of regression coefficient of x2 from theorem 3: rxyid  rxid2 y id R id ZZ  1  rx1 x1 id id R id  R  r XZ ZX x1 x2 id R id ZY  rx1 y The OLS estimate of regression coefficient of x2 is then straightforward: ˆx  2 rxid2 y  rxid1x2 rxid1 y 1  (rxid1x2 ) 2 151 Now I observe that the derived expressions of ˆx1 and ˆx2 based on theorem 3 are identical to their common expressions. Therefore, I can confirm the algebraic equivalence between theorem 3 (also theorem 1) and common expressions of standardized multiple regression coefficients. 152 REFERENCES 153 REFERENCES Borman, G. D., Dowling, B. M., & Schneck, C. (2008). A multi-site cluster randomized field trial of open court reading. Educational Evaluation and Policy Analysis, 30, 389–407. Diaconis, P., & Ylvisaker, D. (1979). Conjugate priors for exponential families. The Annals of Statistics, 7, 269–281 Diaconis, P., & Ylvisaker, D. (1985). Quantifying prior opinion. Bayesian statistics, 2, 133–156. Espinosa, V., Dasgupta, T., & Rubin, D. B. (2016). A Bayesian perspective on the analysis of unreplicated factorial experiments using potential outcomes. Technometrics, 58(1), 62-73. Frank, K. A. (2000). Impact of a confounding variable on a regression coefficient. Sociological Methods & Research, 29(2), 147-194. Frank, K. A., & Min, K. (2007). Indices of robustness for sample representation. Sociological Methodology, 37, 349–392. Frank, K. A., Sykes, G., Anagnostopoulos, D., Cannata, M., Chard, L., Krause, A., & McCrory, R. (2008). Extended influence: National board certified teachers as help providers. Education Evaluation and Policy Analysis, 30, 3–30. Frank, K. A., Maroulis, S. J., Duong, M. Q., & Kelcey, B. M. (2013). What would it take to change an inference? Using Rubin’s Causal Model to interpret the robustness of causal inferences. Education Evaluation and Policy Analysis, 35, 437–460. Gelman, A. & Hill, J. (2007). Data analysis using regression and multilevel/hierarchical models. New York, NY: Cambridge University Press. Greenwald, A., Gonzalez, R., Harris, R., & Guthrie, D. (1996). Effect sizes and p values: what should be reported and what should be replicated? Psychophysiology, 33(2), 175-183. Heckman, J. J. (1979). Sample selection bias as a specification error. Econometrica, 47, 153– 161. Hoff, P. D. (2009). A first course in Bayesian statistical methods. New York, NY: Springer Science & Business Media. Hong, G., & Raudenbush, S. W. (2005). Effects of kindergarten retention policy on children’s cognitive growth in reading and mathematics. Educational Evaluation and Policy Analysis, 27, 205–224. Imbens, G. W., & Rubin, D. B. (1997). Bayesian inference for causal effects in randomized experiments with noncompliance. The Annals of Statistics, 25(1), 305-327. 154 Imbens, G. W., & Rubin, D. B. (2015). Causal inference for statistics, social, and biomedical sciences: An introduction. New York, NY: Cambridge University Press. Iverson, G. J., Wagenmakers, E. J., & Lee, M. D. (2010). A model-averaging approach to replication: the case of p rep. Psychological Methods, 15(2), 172. Killeen, P. R. (2005). An alternative to null-hypothesis significance tests. Psychological science, 16(5), 345-353. Lee, D. S. (2009). Training, wages and sample selection: Estimating sharp bounds on treatment effects. The Review of Economic Studies, 76, 1071-1102. Manski, C. F. (1990). Nonparametric bounds on treatment effects. The American Economic Review, 80, 319-323. Miller, J. (2009). What is the probability of replicating a statistically significant effect? Psychonomic Bulletin & Review, 16(4), 617-640. Morgan, S. L., & Winship, C. (2007). Counterfactuals and causal inference: Methods and principles for social research. Cambridge, UK: Cambridge University Press. Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, Vol. 349, Issue 6251, DOI: 10.1126/science.aac4716 Psychological Science editorial board. (2005). Information for contributors. Psychological Science, 16(12). Rosenbaum, P. R., & Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70, 41-55. Rubin, D. B. (1978). Bayesian inference for causal effects: The role of randomization. The Annals of Statistics, 6(1), 34-58. Rubin, D. B. , & Zell, E. R. (2010). Dealing with noncompliance and missing outcomes in a randomized trial using Bayesian technology: Prevention of perinatal sepsis clinical trial, Soweto, South Africa. Statistical Methodology, 7, 338-350. Shadish, W. R., Cook, T. D., & Campbell, D. T. (2002). Experimental and quasi-experimental designs for generalized causal inference. New York, NY: Houghton Mifflin. Shadish, W. R., Clark, M. H., & Steiner, P. M. (2008). Can nonrandomized experiments yield accurate answers? A randomized experiment comparing random to nonrandom assignment. Journal of the American Statistical Association, 103, 1334–1344. 155 Schneider, B., Carnoy, M., Kilpatrick, J., Schmidt, W. H., & Shavelson, R. J. (2007). Estimating causal effects using experimental and observational design. American Educational & Reseach Association. Sohn, D. (1998). Statistical significance and replicability Why the former does not presage the latter. Theory & Psychology, 8(3), 291-311. Steiner, P. M., Cook, T. D., Shadish, W. R., & Clark, M. H. (2010). The importance of covariate selection in controlling for selection bias in observational studies. Psychological Methods, 15, 250-267. Thompson, B. (1996). Research news and comment: AERA editorial policies regarding statistical significance testing: three suggested reforms. Educational Researcher, 25(2), 26-30. Wooldridge, J. M. (2010). Econometric Analysis of Cross Section and Panel Data (2nd edition). Cambridge, MA: MIT Press. Wooldridge, J. M. (2013). Introductory Econometrics: A Modern Approach (5th edition). Mason, OH: South-Western Cengage Learning. Zajonc, T. (2012). Bayesian inference for dynamic treatment regimes: Mobility, equity, and efficiency in student tracking. Journal of the American Statistical Association, 107, 80-92. 156