THREE ESSAYS ON SEMIPARAMETRIC ESTIMATORS By Benjamin Miller A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Economics – Doctor of Philosophy 2023 ABSTRACT In this dissertation, I develop two semiparametric estimators and consider a variation on an exist- ing semiparametric estimator in the third chapter. In Chapter 1, I develop a model and an estimator for a panel data setting with multiple fractional response variables with a binary endogenous co- variate. I develop a two-step technique to obtain consistent estimates of the average partial e↵ects. Then, I provide a variable addition test for endogeneity. I demonstrate using simulations that if the chosen conditional mean function is incorrect, it is still possible to obtain estimates of the average partial e↵ects that are close to the true values. Data from the NLSY97 survey is used to estimate the average partial e↵ect of marriage on how individuals allocate their time within a year. In Chap- ter 2, I develop a doubly-robust estimator of the quantile treatment e↵ect on the treated (QTT). This estimator can obtain consistent estimates of the QTT using either the propensity score or the conditional cdf of the first-di↵erenced untreated outcomes. Aside from the benefits of obtaining consistent estimates of a QTT when a nuisance function is misspecified, there are also efficiency gains. In addition, assumptions on the smoothness of the nuisance parameters can be relaxed when the estimator is doubly-robust. I also show that asymptotically valid confidence intervals can be constructed using the empirical bootstrap. Then, I demonstrate via simulations that my estimator can produce a sharply lower root mean square error compared to other estimators. I apply my esti- mator to estimate the e↵ect of increasing the minimum wage on county-level unemployment rates, where I show significant and varied quantile treatment e↵ects. In Chapter 3, I consider whether a modification of the parametric estimators of the nuisance functions described in Chapter 2 will lead to an improved performance compared to existing estimators. In particular, an additional mo- ment will be included to estimate the parameters of the nuisance functions, but only one of those nuisance functions will be used to estimate the quantile treatment e↵ect on the treated (QTT). I show that even if this additional moment is applied, the small-sample performance of the estimator is not improved over the doubly-robust estimator in Chapter 2. This is true regardless of which nuisance function is misspecified. Copyright by BENJAMIN MILLER 2023 To my parents and grandparents, who gave me an education. iv ACKNOWLEDGEMENTS There are many people that I would like to thank, starting with my family. I am grateful for the support provided by my parents, Martin and Barbara Miller. They made it possible for me to become an economist. I would also like to thank my siblings, Heather Katcher and Edward Miller, for additional encouragement. No member of my immediate family could have had any success were if not for my grandparents, David and Mildred Miller. I owe my education to them as much as anyone else. They both passed away before they could see me earn my doctorate, but I am certain that they would have been proud of my accomplishments. Next, I would like to some faculty and sta↵ at Michigan State University. I am grateful for the guidance of my committee members Dr. Je↵rey Wooldridge, Dr. Antonio Galvao, Dr. Kyoo Il Kim, and Dr. Nicole Mason-Wardell. They have provided feedback and advice not only on the technical content of the dissertation, but also on the writing of the dissertation to make it more accessible to an audience outside of econometrics. There are several classmates that I should thank. Without having to pick and choose who gets an acknowledgement, I will begin by thanking my cohort. Each of them enlivened my time at Michigan State in some small way; however, I did promise an acknowledgment to one classmate in particular. I am grateful to Steven Wu-Chaves. I partially got the idea for my job market paper from a conversation with him. Some people might call this an "inspiration." If that’s what it is, then I can only hope that I have inspired him. v TABLE OF CONTENTS CHAPTER 1 ECONOMETRIC METHODS FOR MULTIPLE FRACTIONAL RESPONSE VARIABLES WITH A BINARY ENDOGENOUS COVARIATE: AN APPLICATION TO TIME-USE DATA . . . . . . . . 1 CHAPTER 2 DOUBLY-ROBUST QUANTILE TREATMENT EFFECT ESTIMATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 CHAPTER 3 APPLICATION OF ADDITIONAL MOMENTS TO QUANTILE TREATMENT EFFECT ESTIMATION: A SIMULATION . . . . . . . . 48 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 APPENDIX A DERIVING THE AVERAGE PARTIAL EFFECTS . . . . . . . . . . . 59 APPENDIX B SIMULATION TABLES FOR CHAPTER 1 . . . . . . . . . . . . . . . 61 APPENDIX C APPLICATION TABLES FOR CHAPTER 1 . . . . . . . . . . . . . . 65 APPENDIX D HIGH-LEVEL ASSUMPTIONS AND PROPOSITIONS FOR NUISANCE FUNCTION ESTIMATION IN CHAPTER 2 . . . . . . . 68 APPENDIX E PROOFS OF MAJOR THEOREMS AND PROPOSITIONS FOR CHAPTER 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 APPENDIX F CHAPTER 2 TABLES AND FIGURES . . . . . . . . . . . . . . . . . 94 APPENDIX G CHAPTER 3 FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . 97 vi CHAPTER 1 ECONOMETRIC METHODS FOR MULTIPLE FRACTIONAL RESPONSE VARIABLES WITH A BINARY ENDOGENOUS COVARIATE: AN APPLICATION TO TIME-USE DATA 1.1 Introduction Fractional response variables are often treated as having a linear conditional mean. This can be done for the purpose of demonstrating that there is a strong association between the fractional random variable and some covariates of interest. The largest contrast to this would be to take a structural approach and try to derive the exact form of the conditional mean function. The deci- sion to use a fractional response model can serve as a point between a linear approximation and a structural approach. When there is a single fractional response variable of interest, the use of a nonlinear conditional mean function can serve to provide a closer approximation to the true re- lationship between the covariate and the conditional mean compared to a linear function. When there are multiple fractional response variables of interest and those variables represent shares in a bundle, then the linear conditional mean function no longer reflects the choices of an agent. If a structural approach is to be taken, then the structural assumptions that might have reflected an agent’s decision problem in a single time period no longer hold over multiple periods. There could be heterogeneity amongst agents that influences their decisions. More assumptions or a more complex structural approach might be needed to identify the parameters of interest. In this chapter, I develop a reduced form estimator of the average partial e↵ects for a fixed number of multiple fractional responses. First, I combine the methods of Mullahy (2015) and Papke and Wooldridge (2008) to develop a panel data estimator with a binary endogenous explanatory variable (EEV). This is done using quasi-maxiumum likelihood estimation (QMLE), combining this with a Mundlack-Chamberlain device, and then integrating out the endogenous component in an approach that is similar to Heckman (1979). My approach is similar to that found in Nam (2014), which handles the case of multiple fractional responses with a continuous endogenous covariate. In that paper the author applies a two-step approach, using a control function in the first step to 1 project the omitted variable onto the error arising from the reduced form model of the continuous EEV. In the second step, QMLE is then applied to estimate average partial e↵ects (APE). Then, I show the asymptotic properties of the estimates of the average partial e↵ect. Although it may not be possible to recover the unscaled coefficients on the covariates, it is possible to estimate and perform inference on the average partial e↵ects. I develop a variable addition test (VAT) for endogeneity. This test is based upon a simpler version of the VAT found in Lin and Wooldridge (2017). The methods in this chapter can be computationally intense to implement. The test allows for detection of endogeneity across all of the fractional response variables. I also show that if one of the key identification assumptions fails, it is still possible to obtain estimates of each APE that are reasonably close to the true values. I break this down into several cases, looking at when the distribution of the unobservables are incorrectly specified, thus causing the conditional mean to be incorrectly specified. The cases include when the unobservable variables are asymmetric. Finally, as an application of my estimator I use data from the 1997 National Longitudinal Sur- vey of Youth (NLSY97). This is done to estimate how changes in a binary endogenous variable a↵ect the fraction of an individual’s time within a year devoted to work, sleep, and leisure. The marital status of each survey participant is reversible, and the frequency of sexual intercourse by survey participants in a year is used as an instrument coupled with controls that might be correlated with this frequency. This specific data and the problem of estimating how marital status a↵ects in- dividuals’ use of their time is explored within this chapter, while demonstrating how the fractional nature of the dependent variable can be exploited. The chapter is organized as follows. Section 1.2 presents the model and identification assump- tions. Section 1.3 presents the proposed method for obtaining consistent estimates of the average partial e↵ects. Section 1.4 presents the average partial e↵ects, as well as their asymptotic distri- bution. Section 1.5 presents a VAT for endogeneity. Under the null hypothesis of this test there is no binary EEV, so that the parameters can be estimated using a standard multinomial logit QMLE 2 framework. Section 1.6 presents simulation results under model misspecification. Under the model misspecification, the multinomial logit specification is incorrect, yet the results show that proposed method will still provide a good approximation of the average partial e↵ects. Section 1.7 contains the empirical application of this estimator to the 1997 National Longitudinal Survey of Youth. Section 1.8 concludes the chapter and I consider alternatives to the estimator presented in this chapter. 1.2 Model and Assumptions I assume that I have a panel of data consisting of a random sample of N subjects across T time periods with L fractional dependent outcomes, with each dependent outcome denoted as yitl . Let xi = (xit , ..., xiT ) and zi = (zit , ..., ziT ) I also assume strict exogeneity of the structural conditional mean. xit is a 1x K vector of covariates, separate from the 1 x M vector of instruments zit . tit is the binary EEV. eit represents omitted variables that change over time and with each subject i. The structural conditional mean is, E[yitl |xi , ti , zi , ci , ei ] = E[yitl |xit , tit , ci , eit ] = G(⇠tl + xit l + tit ↵l + l ci + eit ) (1.1) where tit = 1[zit z + xit x + z̄ z̄ + x̄ x̄ + uit 0] (1.2) ci = hi ⇡ + ai (1.3) hi = (x̄i z̄i t̄i ) (1.4) 0  G(⇠tl + xit l + tit ↵l + l ci + eit )  1 (1.5) ritl B l ai + eit (1.6) XL Gitl = 1 (1.7) l=1 and ⇠tl is a time-varying intercept, where uit is independent of xi , ai , and zi . This is denoted by uit ?? xi , zi , ai . G is what I call the structural mean function. It is unknown to the researcher. It is assumed that the distribution of uit , D(uit ), is known to the researcher. For example, the re- searcher would know that uit ⇠ Normal(0, 1), and so ˆ would be obtained from a pooled probit 3 MLE. Equation (1.3) represents the use of the Chamberlain (1980) device to write the correlated random e↵ect ci as a projection onto the time-averaged values of the explanatory variables and the P P instruments. hi is the vector of these time-averaged values, where x̄i = Tt=1 xTit , z̄i = Tt=1 zTit , and P t̄i = Tt=1 tTit . Equation (1.5) represents the restriction that the value of the conditional mean for any values of xit , tit , ci , and eit must be between zero and one, inclusive. Equation (7) represents the constraint that since each yitl is bounded between zero and one and must sum to one, then the structural mean values must sum to one for each time period and subject. This constraint rules out the application of a probit model when L 3, since there is no guarantee that the choice of a probit conditional mean function would satisfy the constraint. In Equation (1.6), ritl is being defined as the sum of l multiplied by the error from the projection of ci on hi plus eit . It is important to note that though is tempting to write yitl as equal to G(⇠tl + xit l + tit ↵l + l ci + eit ), this may rule out the case in which yitl takes the values 0 or 1. It should also be noted that, in contrast to Becker (2014), the correlated random e↵ects have di↵erent coefficients across l. This is similar to how previous approaches have handled a single source of heterogeneity in multinomial logit model (see section 16.2.4 of Wooldridge (2010)). Now, let, ritl = ⇣l uit + vit (1.8) In e↵ect, I am noting that ritl can be written as a linear function of uit and vit , where vit is the er- ror arising from regressing ritl on uit .This extends naturally from the model. Since the correlated random e↵ects approach eliminates inconsistency from the explanatory variables, then any endo- geneity that arises must be through ritl due to uit . Then, vit ?? xit , zit , uit . I am implying that if the unobservables are projected onto the random variable that is the source of the endogeneity, then whatever remains should be random noise. I am also going to assume that, v E(yitl |hi , xi , zi , ti , ui ) = E(yitl |hi , xit , tit , uit ) = ⇤(ditl + uit ⇣lv ) (1.9) where ditl B ⇠tl + xit l + tit ↵l + l hi ⇡ (1.10) 4 , v ditl B ⇠tlv + xit v l + tit ↵vl + v l hi ⇡ v (1.11) ,and The multinomial logit function that is of interest here is, v v v editl +uit ⇣l ⇤(ditl + uit ⇣lv ) = PL 1 (dv +⇣ v uit ) (1.12) (1 + k=1 e itk ) Equation (1.11) represents the scaled valued of ditl . Equation (1.9) represents the assumption that once vit is integrated out, then the conditional mean function is equal to the multinomial logit conditional mean function, and the values of the parameters are related, though it is unknown exactly how, to the original structural parameters. The first equality of equation (1.9) represents the strict exogeneity assumption. The original structural parameters cannot be identified, but these transformed parameters can be used to estimate the APEs. While I have not made any direct as- sumptions about the distribution of vit , I have made a somewhat strong assumption, though one that is not out of place in the econometrics literature. In e↵ect, I am placing a restriction upon the combination of G and the random variable vit such that (1.9) holds; however, it is not known what restrictions must be placed upon G and vit so that (1.9) is true. It could also be that the original structural model is incompatible with (1.9), but (1.9) is assumed to hold for the sake of conve- nience. This is analogous to assumptions in the generalized estimation equations (GEE) literature, where when heterogeneity is averaged out, a convenient functional form is chosen as in Zeger, Liang, and Albert (1988). Petrin and Train (2010) apply this approach in the context of consumer choice by dividing the structural error in the consumer utility function into two parts. Distributional assumptions are made on each part in order to form a mixed logit conditional mean function. A version of this assumption is guaranteed to hold for any fractional variable, and the method that I am proposing can still be applied. As noted by Mullahy (2015) from Woodland (1979), it is always the case for fractional response variables that the conditional mean function should have a form that is based upon an underlying Dirichlet random variable. Then the conditional mean has the form, zl (xit , zit , tit , uit ) zL (xit , zit , tit , uit ) E[yitl |xit , zit , tit , uit ] = PL 1 (1.13) 1 + l=1 zl (xit , zit , tit , uit ) zL (xit , zit , tit , uit ) 5 v v When (12) is combined with (13), this is equivalent to setting zl (xit , zit , tit , uit ) = editl +uit ⇣l . The previous assumptions are summarized as follows: Assumption D.1. E[yitl |xi , ti , zi , ci , ei ] = E[yitl |xit , tit , ci , eit ] = G(⇠tl + xit l + tit ↵l + l ci + eit ) tit = 1[zit z + xit x + z̄ z̄ + x̄ x̄ + uit 0] ci = hi ⇡ + ai Assumption D.2. 0  G(⇠tl + xit l + tit ↵l + l ci + eit )  1 X L Gitl = 1 l=1 v Assumption D.3. E(yitl |hi , xi , zi , ti , ui ) = E(yitl |hi , xit , tit , uit ) = ⇤(ditl + uit ⇣lv ) Assumption D.4. uit ?? xi , zi , ai , where D(uit ) is known. Assumption D.5. E[eit |xi , zi ] = 0, and z , 0). Though Assumption D.3 is not out of place in the econometrics literature, it can be tested. For v v example, a choice could be made between editl +uit ⇣l and zl (xit , zit , tit , uit ). Then, a test for misspec- ified moments based upon Rivers and Vuong (2002) can be applied. This is important because equation 12 always holds, so Assumption ID.3 can be tested using the Rivers and Vuong test, and an alternative model based upon such a test could be provided if the assumption does not hold. Now, given the previous assumptions of the model the following is obtained: Z 1 v E(yitl |hi , xit , zit , tit = 1) = ⇤(ditl + uit ⇣lv ) f (uit )duit (1.14) q Z qitit v E(yitl |hi , xit , zit , tit = 0) = ⇤(ditl + uit ⇣lv ) f (uit )duit (1.15) 1 where qit = zit z + xit x + z̄ z̄ + x̄ x̄ . x, z , and the corresponding coefficients on the time average variables denote the scaled coefficients of x, z, z̄, and x̄ in the Bernoulli MLE problem, where, for a given parametric assumption on uit , P(tit = 1|xit , zit , z̄, x̄) = g(xit , zit , x̄, z̄; x, z , x , z ). 6 1.3 Estimation Method The estimator will be derived from a QMLE problem by pooling across time for each subject. The serial dependence across {yitl } is unrestricted. Based upon (14) and (15), the following QMLE problem is, N X T  0L 1 Z 1 X BBBBX 1 C C max tit B@ yitl log v ⇤(ditl + uit ⇣lv ) f (uit )duit CCCA v 2RK⇥L ,⇡2RK+M+1 ,↵v 2R, v 2RL ,⇠ v 2RT ⇥L l i=1 t=1 l=1 qit 0 L 1 1 Z L 1 BB X CCC 1 X + tit BBB@1 yitl CCA log (1 v ⇤(ditl + uit ⇣lv )) f (uit )duit l=1 qit l=1 0L 1 Z 1 BBB X qit CC + (1 tit ) BB@ yitl log ⇤(ditl + uit ⇣l ) f (uit )duit CCCA v v l=1 1 0 L 1 1 Z qit L 1 BBB X CCC X v + (1 tit ) BB@1 yitl CCA log (1 ⇤(ditl + uit ⇣lv )) f (uit )duit ) l=1 1 l=1 The maximization problem represents the maximization of the likelihood of a multinomial ran- R1 R qit v dom variable, where choice l is made with probability q ⇤(ditl + uit ⇣lv ) f (uit )duit or 1 ⇤(ditl v + it uit ⇣lv ) f (uit )duit , given the covariates and the observed value of tit . The above objective function handles the endogenous switching problem that arises from the endogenous binary variable. This method is applied by Wooldridge (2014) in the context of a probit conditional mean function when there exists two fractional dependent variables and a binary endogenous variable. A just-identified GMM system can be set up using the score functions based upon the QMLE problem. Let, ✓ = (⇠tlv0 , v0 l , ↵l , v0 v0 v0 v0 l , ⇡ , ⇣l ) (1.16) =( z, x , x̄ , z̄ ) (1.17) 7 2 3 666 7 666 s i ( ) 77777 666 7 666 s v (✓; ) 77777 666 ⇠ i 777 666 7 666 s v (✓; )77777 666 i 777 666 7 i (✓, ) = 6 66 s↵v i (✓; )77777 (1.18) 666 777 666 7 666 s v i (✓; )77777 666 777 666 7 666 s⇡v i (✓; )77777 666 777 646 77 s⇣ v i (✓; ) 5 For example, if I were to assume uit follows a normal distribution and using a logit specification for the conditional mean function, the score functions are X T (zit xit z̄i x̄i )> ((zit xit z̄i x̄i ) > )[tit ((zit xit ) > )] s i( ) = t=1 ((zit xit z̄i x̄i ) > )[1 ((zit xit z̄i x̄i ) > )] 0 R1 1 v X T  B X BBB L 1 qit M(ditk + ⇣kv uit , ditl v + ⇣lv uit ) f (uit )duit CCC s⇣lv (✓; ) = tit BBB@ yitk R1 CCC v v CA t=1 k=1,k,l qit ⇤(ditk + ⇣k uit ) f (uit )duit R1 v qit ⇤0 (ditl + ⇣lv uit )uit f (uit )duit + tit yitl R 1 v qit ⇤(ditl + ⇣lv uit ) f (uit )duit 0 1 R1 v BBB L 1 X CCC qit P(ditl + ⇣lv uit )uit f (uit )duit + tit BB@1 yitk CCA R 1 PL 1 v v k=1 qit (1 k=1 ⇤(ditk + ⇣k uit )) f (uit )duit R qit v 1 ⇤0 (ditl + ⇣lv uit )uit f (uit )duit + (1 tit )yitl R qit v 1 ⇤(ditl + ⇣lv uit ) f (uit )duit 0 L 1 R qit 1 v v v v BBB X M(d itk + ⇣ k u it , d itl + ⇣l u it ) f (u it )du it CCC + (1 tit ) BB@B yitk 1 R qit CCC A v v k=1,k,l 1 ⇤(ditk + ⇣k uit ) f (uit )duit 0 1 R qit v BBB L 1 X CCC 1 P(ditl + ⇣lv uit )uit f (uit )duit + (1 tit ) B@1 B yitl CA R qit C PL 1 v v k=1 1 (1 k=1 ⇤(ditk + ⇣k uit )) f (uit )duit while for each intercept, 0 R1 1 v v v v BBB X L 1 qit M(d itk + ⇣k u it , d itl + ⇣ l u it ) f (u it )du it CCC s⇠tl (✓; ) = BB@ v B B yitk R1 CCC v v CA k=1,k,l q ⇤(ditk + ⇣k uit ) f (uit )duit it 8 R1 v qit ⇤0 (ditl + ⇣lv uit ) f (uit )duit + tit yitl R 1 v qit ⇤(ditl + ⇣lv uit ) f (uit )duit 0 1 R1 v BBB L 1 X CCC qit P(ditl + ⇣lv uit ) f (uit )duit + tit BB@1 yitk CCA R 1 PL 1 v v k=1 qit (1 k=1 ⇤(ditk + ⇣k uit )) f (uit )duit R qit v 1 ⇤0 (ditl + ⇣lv uit ) f (uit )duit + (1 tit )yitl R qit v ⇤(ditl + ⇣lv uit ) f (uit )duit 0 L 11 R qit v v v v 1 BBB X M(d itk + ⇣ k u it , d itl + ⇣ l u it ) f (uit )du it CCC + (1 tit ) BB@B yitk 1 R qit CCC A v v k=1,k,l 1 ⇤(ditl + ⇣l uit ) f (uit )duit 0 1 R q v BBB L 1 X CCC 1 P(ditl + ⇣lv uit ) f (uit )duit + (1 tit ) B@1 B C yitl CA R qit PL 1 v v k=1 1 (1 k=1 ⇤(ditk + ⇣k uit )) f (uit )duit Note that v vu ) PL 1 (ditk v +⇣ v u ) v vu ) 0 v v e(ditl +⇣ it (1 + k=1 e it ) e2(ditl +⇣ it ⇤ (ditl + ⇣ uit ) = PL 1 (dv +⇣ v uit ) 2 (1 + k=1 e itk ) v +⇣ v u ) (ditl (d v +⇣ v u ) v e it (e itk it ) M(ditk + ⇣ v uit , ditl v + ⇣ v uit ) = PL 1 (dv +⇣ v uit ) 2 (1 + k=1 e itk ) v +⇣ v u ) (ditk v ( e it ) P(ditk + ⇣ v uit ) = PL 1 (dv +⇣ v uit ) 2 (1 + k=1 e itk ) These scores, along with the scores that are used to estimate , are used to generate correspond- ing sample moments. Now, the standard regularity assumptions from Newey and McFadden (1994) are made. Assumption E.1. ⇥ ⇥ is compact. N Assumption E.2. {(xi1 , ..., xiT ), (zi1 , ..., ziT ), (ti1 , ..., tiT ), [(yi11 , ...yilL ), ..., (yiT 1 , ..., yiT L )]}i=1 is iid Assumption E.3. There exists a unique ( 0 , ✓0 ) 2 ⇥ ⇥ such that E( i (✓0 , 0 )) = 0. Assumption E.4. E( sup || i (✓, )||) < 1 ⇥✓2 ⇥⇥ Assumption E.5. E(|| i (✓0 , 2 0 )|| ) < 1 and E(|| @@(i (✓,✓)0 ,> 0 )) < 1 9 Then, 2 3 p 666 ˆ 777 6 777 N 66666 777 ! N(0, (D> D) 1 D> D(D> D) 1 ) 4✓ˆ ✓5 i (✓0 , 0 ) where D = E[ @ @✓> ] and = E[ i (✓0 , > 0 ) i (✓0 , 0 ) ]. The asymptotic variance can be estimated PN ˆ ˆ) PN using D̂ = N 1 i=1 @ i (✓, @✓> as a consistent estimator for D and ˆ = N 1 i=1 [ i (✓, ) i (✓, )> ] as a consistent estimator for . 1.4 Average Partial E↵ects In order to separate out the average partial e↵ect of the continuous variables from the discrete j variable, let ⌥it = (hi , xit , zit ). Let (·) denote element j corresponding to the continuous covariate (·). Then (see Appendix A), Euit [E[yitl |⌥, tit = 1, uit ] E[yitl |⌥, tit = 0, uit ]] Z Z = ⇤(dtlv,t=1 + ⇣ v uit ) f (uit )duit ⇤(dtlv,t=0 + ⇣ v uit ) f (uit )duit 0 = Euit [ itl (x , h0 ; ✓)] = ⌃tl (x0 , h0 ) dtlv,t=1 = ⇠tlv + x0 v l + ↵vl + v 0 v lh ⇡ dtlv,t=0 = ⇠tlv + x0 v l + v 0 v l hi ⇡ 0 0 tl (x , h ; ✓) = ⇤(dtlv,t=1 + ⇣ v uit ) ⇤(dtlv,t=0 + ⇣ v uit ) Similarly, the average partial e↵ect for a continuous explanatory variable is, "Z # j v j Euit [@E[yitl |⌥, tit , ũit ]/@ (.) ] = @⇤(ditl + ⇣lv uit )/@ (.) f (uit )duit = ⌅tl (x0 , h0 , t0 ) R In either case, the estimator is similar to the Average Structural Function (ASF) of Blundell and Powell (2001). The expected value of the marginal change in the explanatory variable is taken with respect to the distribution of the unobserved error, the source of the endogeneity that is still present. This leads to the following two theorems, 10 Theorem 1. Suppose assumptions D.1-D.5 and E.1-E.5 hold and j E( sup ||r ,✓ @(⇤(dtlv0 + ⇣ v uit )/@ (.) )||) < 1. Then, ⇥✓2 ⇥⇥ p N(⌅ˆtl (x0 , h0 , t0 ) ⌅tl (x0 , h0 , t0 )) ! N(0, Vtl ) PN where ⌅ˆtl (x0 , h0 , t0 ) = N 1 [ i=1 @⇤(dˆtlv0 + ⇣ˆlv uit )/@ j (.) ], v0 ditl = ⇠tlv + x0 v l + tit0 ↵vl + v 0 v lh ⇡ , dˆitl v0 = ⇠ˆtlv + x0 ˆvl + tit0 ↵ˆvl + ˆvl h0 ⇡ˆv Vtl = E[Vtli> Vtli ] j j Vtli = @⇤(dtlv0 + ⇣ v uit )/@ (.) ) ⌅tl (x0 , h0 , t0 ) + E([r ,✓ (@⇤(dtlv0 + ⇣ v uit )/@ (.) )])Ki Ki = B0 1 G> 0 i (✓0 , 0) B0 = G> 0 G0 G0 = E[r ,✓ i (✓0 , 0 )] 0 0 Theorem 2. Suppose assumptions D.1-D.5 and E.1-E.5 hold and E( sup ||r ,✓ tl (x , h ; ✓)||) < ⇥✓2 ⇥⇥ 1. Then, p (N)(⌃ˆ tl (x0 , h0 ) ⌃tl (x0 , h0 ))) ! N(0, Jtl ) PN ˆ ˆ + ⇣ˆv u )], where ⌃ˆ tl (x0 , h0 ) = N 1 i=1 [⇤(dtl v,t=1 + ⇣ˆv uit ) ⇤(dtlv,t=0 it ˆ = ⇠ˆv + x0 ˆv + ↵ˆv + ˆv h0 ⇡ˆv , dtlv,t=1 tl l l l dtlv,t=0 = ⇠tlv + x0 v l + v 0 v lh ⇡ > Jtl = E[Jtli Jtli ] Jtli = 0 0 tl (x , h ; ✓) ⌃ˆ tl (x0 , h0 ) + [E(r ,✓ 0 0 itl (x , h ; ✓))]Ki Theorems 1 and 2 represent the average partial e↵ect taken while holding hi at some fixed value. I have left the notation uit to emphasize that this estimator is particular to a specific time period, and to emphasize a single uit is generated for each i. Usually, the average partial e↵ect is taken over the distribution of the unobserved heterogeneity. This is done below. The computations required to obtain this average partial e↵ect lead to the generation of multiple uik . In contrast to a 11 single uit drawn for each i, multiple draws must be made for each i. The average partial e↵ect when the expectation is taken over the joint distribution of ci and eit , conditioning on their proxies hi and ci at the fixed values x0 and t0 is, Ehi ,uit [@E[yitl |x0 , t0 , hi , uit ]/@vitj ] Z Z = @⇤(⇠tl + x0 l + t0 ↵l + l hi ⇡ + ⇣l uit )/@vitj f (ui |hi ) f (hi )dhi duit R R = ⌦tl (x0t , tt0 ) The corresponding estimator is, X N X R N R 1 1 @⇤(⇠ˆtl + x0 ˆ l + t0 ↵ˆ l + ˆ l hi ⇡ˆ + ⇣ˆl uitr )/@vitj = ⌦ ˆ tl (x0 , t0 ) i=1 r=1 Similarly, the APE in this case for the binary EEV is Z Z ⇤(⇠tl + x0it l + ↵l + l hi ⇡ + ⇣l uit ) f (uit |hi ) f (hi )dhi duit R R Z Z ⇤(⇠tl + x0 l + l hi ⇡ + ⇣l uit ) f (uit |hi ) f (hi )dhi duit R R 0 = tl (x ) In this case, the corresponding estimator is, XN X R N 1R 1 [⇤(⇠ˆtl + x0 ˆ l + ↵ˆ l + ˆ l hi ⇡ˆ + ⇣ˆl uitr ) ⇤(⇠ˆtl + x0 ˆ l + ˆ l hi ⇡ˆ + ⇣ˆl uitr )] = ˆ tl (x0 ) i=1 r=1 Next, I will present the asymptotic distribution of these APEs. First, I will apply a mean value p expansion to N⌦tl (x0 , t0 ). Let X N p N⌦ ˆ tl (x0 , t0 ) = p1 [ f˜R (x0 , t0 , hi ; ✓0 ) + r f˜R (x0 , t0 , hi ; ✓)( ˜ ✓ˆ ✓0 )] (1.19) N i=1 P Where f˜R (x0 , t0 , hi ; ✓) = R 1 Rr=1 @(⇤(⇠tl + x0 l + t0 ↵l + l hi ⇡ + ⇣ v uitr )/@ (.) j ), and f (x0 , t0 , hi ; ✓) = j E[@(⇤(⇠tl + x0 l + t0 ↵l + l hi ⇡ + ⇣ v uitr )/@ (.) )|hi ]. Now, note that if R ! 1 along with N ! 1, then j by the consistency assumptions and E( sup ||r ,✓ @(⇤(⇠tl + x0 l + t0 ↵l + l hi ⇡ + ⇣ v uit )/@ (.) )||) < 1, ⇥✓2 ⇥⇥ N 1 X p r ,✓ f˜R (x0 , t0 , hi ; ✓)(ˆ ˜ ⌘ ⌘0 ) = E[r ,✓ f˜R (x0 , t0 , hi ; ✓0 )]KN + o p (1) (1.20) N i=1 12 Then, p N(⌦ ˆ tl (x0 , t0 ) ⌦tl (x0 , t0 )) N 1 X ˜ 0 0 = p [ fR (x , t , hi ; ✓0 ) ⌦tl (x0 , t0 ) + E[r ,✓ f˜R (x0 , t0 , hi ; ✓0 )]Ki + o p (1)] N i=1 PN ˜ 0 0 Now, note that p1 N i=1 [ fR (x , t , hi ; ✓0 ) ⌦tl (x0 , t0 ) + E[r ,✓ f˜R (x0 , t0 , hi ; ✓)]Ki has mean 0. If this is not immediately obvious, note that unlike in Hajivassiliou and Ruud (1994) the simulations are being carried out directly upon the partial derivative of the multinomial logit function with respect to some explanatory variable for each hi , as opposed to the simulations being carried out on the PN ˜ 0 0 PN likelihood and then taking the gradient. p1N i=1 [ fR (x , t , hi ; ✓0 )] = i=1 f (x0 , t0 , hi ; ✓0 ) + AN + BN , where N 1 X ˜ 0 0 AN = p [ fR (x , t , hi ; ✓0 ) Eu|h [ f˜R (x0 , t0 , hi ; ✓0 )]], (1.21) N i=1 N 1 X BN = p [Eu|h [ f˜R (x0 , t0 , hi ; ✓0 )] f (x0 , t0 , hi ; ✓0 )] (1.22) N i=1 By the definition of f˜R (x0 , t0 , hi ; ✓0 ) and f (x0 , t0 , hi ; ✓0 ), AN and BN both have zero expectation. Then since E( f˜R (x0 , t0 , hi ; ✓0 ) = E[E[ f˜R (x0 , t0 , hi ; ✓0 |hi )]] = ⌦tl (x0 , t0 ), the central limit theorem implies, p N!1 N(⌦ ˆ tl (x0 , t0 ) ⌦tl (x0 , t0 )) ! N(0, Jtl ) > Jtl = E(Jtli Jtli ) Jtli = f˜(x0 , t0 , hi ; ✓0 ) ⌦tl (x0 , t0 ) + E[r ,✓ f˜(x0 , t0 , hi ; ✓0 )]Ki f˜(x0 , t0 , hi ; ✓0 ) =@(⇤(⇠tl + x0 l + t0 ↵l + l hi ⇡ + ⇣ v uitr )/@ j (.) ) Similarly, p N!1 N( ˆ tl (x0 ) ( tl (x0 )) ! N(0, Ltl ) > Ltl = E(Ltli Ltli ) Ltli = fdi↵ (x0 , hi ; ✓0 ) tl (x 0 ) + E[r ,✓ f˜di↵ (x0 , hi ; ✓0 )]Ki 13 fdi↵ (x0 , hi ; ✓0 ) = ⇤(⇠tl + x0 l + ↵l + l hi ⇡ + ⇣ v uitr ) ⇤(⇠tl + x0 l + l hi ⇡ + ⇣ v uitr ) It important to note that even though it would appear that the estimators of the true APEs are limited by the sample size, this is not the case. Since the distribution of uit is known, a researcher could simply simulate uit enough until the researcher feels that the estimators are sufficiently close to the true APEs. It is important to note that these theorems are true for a correctly specified model. That is to say, when equation (1.9) holds, except for one special case, which will be explored in the simulation section. 1.5 Test for Endogeneity The method described relies upon distributional assumptions that are separate from the correct choice of the functional form for the conditional mean function. The distribution of uit would have to be correctly chosen. Furthermore, the method itself is computationally intensive; however, a variable addition test (VAT) based upon Wooldridge (2014) could serve as a test of whether the variable tit is endogenous. Given the choice of for the distribution of uit , the test will utilize a generalized residual as proposed in Gourieroux et al. (1987). The test will rely upon ⇣l = 0 for l = 1, ..., L 1. The VAT on the generalized residual will be shown in this section to be asymptotically equivalent to Lagrange Multiplier test under the null hypothesis of no endogeneity. 2 I will show that the Lagrange Multiplier(LM) test statistic has an asymptotic distribution of L 1, under H0 : ⇣1 = ... = ⇣L 1 = 0. Then, I will show that the VAT is asymptotically equivalent to the 2 LM test, and therefore the VAT statistic also has a L 1 distribution under H0 . It should be noted that the specific test here is based upon Lin and Wooldridge (2017), though the LM test is not infeasible, since there is no additional error arising from a continuous endogenous variable. vr v Let ditl denote the value of ditl when ⇣l = 0 and ✓r be the estimate of ✓ based upon the restricted model. Note that in the restricted model tit is exogenous. Let gr ˆ it ( ˆ ) denote the estimate of the generalized residual, which is to be used as a consistent estimator of gr B E(uit |tit , xi , zi ). Note that I have written the estimate of the generalized residual as a function of ˆ to emphasize that the estimates are a function of the estimated parameters from the latent variable equation for tit . 14 Consider the LM statistic, where the estimates of the restricted model will be plugged into the score from the unrestricted model, X N X N LM = ( S̃i,⇣ )> Ã22 [Ṽ22 ] 1 Ã22 ( S̃i,⇣ )/N (1.23) i=1 i=1 where @lnLi S̃i,⇣ B r @⇣ ✓=✓ ⇣=0 2 3 666PN @2 lnLi PN @2 lnLi 77 Ê( @✓✓> |tit , xi , zi ) ✓=✓r i=1 Ê( @✓⇣ > |tit , xi , zi ) ✓=✓r 7 1 66666 i=1 ⇣=0 7 77 777 , à B 6 ⇣=0 N 66664PN Ê( @2 lnLi |t , x , z ) PN 2 Ê( @ lnLi |t , x , z ) 777 75 i=1 @⇣✓> it i i ✓=✓r i=1 @⇣⇣ > it i i ✓=✓r ⇣=0 ⇣=0 2 3 666à à 77 1 66 11 6 12 7 7777 , à = 666 775 4à à 21 22 2 3 666Ṽ Ṽ 777 66 6 11 12 7777 , Ṽ = à 1 B̃à 1 = 666 77 4Ṽ Ṽ 5 21 22 N 1X B̃ B S̃i,⇣ S̃>i,⇣ N i=1 In Ã, the expectation is taken with respect to uit . It is important to note the use of Ê(·)|tit , xi , zi ), which is a function of gr ˆ it ( ˆ ). While the distribution of uit |xi , zi is known, this does not imply knowledge of the distribution of uit |tit , xi , zi . If this distribution was known, then each element of à would be a function of the conditional expectations themselves, as opposed to estimators that are functions of the generalized residuals. The summation represents the application of iterated expectations. Applying the summation over i and dividing by N serves as a consistent estimator of the unconditional expected value of the second partial derivatives of the likelihood function. Alternatively, the log likelihood function for an individual i when implementing the VAT is T 20 L 1 1 0 L 1 1 0 L 1 13 X 666BBBX CCC BBB X CCC BBB X CC77 Li = v 664BB@ yitl log⇤(ditl + grit ⌧l )CCA + BB@1 yitl CCA log BB@1 ⇤(ditl + grit ⌧l )CCCA7775 v t=1 l=1 l=1 l=1 15 To implement the VAT, the procedure is as follows: Procedure 5.1 1. Generate the generalized residuals gr ˆ it from a first stage pooled maximum likelihood estima- tion (e.g. probit estimation) of tit on xit and zit . For example, the generalized residuals from [ (xit ˆ x +zit ˆ z )][tit (xit ˆ x +zit ˆ z )] probit estimation are grˆ it = (xit ˆ x +zit ˆ z )(1 (xit ˆ x +zit ˆ z )) 2. Obtain the maximum likelihood estimates of the parameters using the individual log-likelihood function Li . 3. Obtain the Wald test statistic under H0 : ⌧1 = . . . = ⌧L 1 =0. Under the null hypothesis, ⌧l = 0 for l=1,...,L-1, so the estimation of gr ˆ it does not a↵ect the asymptotic distribution of the test statistic. The score vector is, 2 3 666 s 777 666 ⇠Li 777 666 777 666 s 777 6 Li 7 2 3 6666 7777 666 @Li 777 666 s 777 6 @✓ 7 66 ↵Li 77 Si,⌧ = 66666 77777 = 6666 7777 4 @Li 5 666 s 777 @⌧ 666 Li 777 666 777 666 s 777 666 ⇡Li 777 666 777 4s 5 ⌧Li Now, since 2 3 p 666✓ˆ ✓ 77777 XN 6 N 66666 777 = N 7 1/2 A 1 Si + o p (1) 4⌧ˆ ⌧5 i=1 Under H0 : ⌧1 = . . . = ⌧L 1 = 0, the Wald test statistic is p p W = (ˆ⌧ ⌧)> (V̂22 /N) 1 (ˆ⌧ ⌧) = N(ˆ⌧ ⌧)> (V̂22 /N) 1 N(ˆ⌧ ⌧) where 2 3 666  77 666 11 12 7 7777  = 666 775 4  21 22 16 2 3 666V̂ 7 666 11 V̂12 77777 V̂ =  1 B̂ 1 = 666 777 , 4V̂ 5 21 V̂22 X N 1 B̂ = N (S̃i,⌧ S̃>i,⌧ ), i=1 2 3 666A 7 1 p 1 666 11 A12 77777  !A = 666 777 4A 5 21 A22 p̂itl = ⇠ˆtl + xit ˆl + tit ↵ˆl + ˆ l hi ⇡ˆ + ⌧ˆ l gr ˆ it Then the Wald statistic is, X N XN W = ( S i,⌧ )> A22 V̂221 A22 ( S i,⌧ )/N (1.24) i=1 i=1 p p p Under the null hypothesis that ⌧ = 0, ⇣ = 0, (ˆ⌧ ⌧) ! 0, N(✓ˆ ✓) and N(✓˜ ✓) converge in p distribution. Then LM W ! 0, which implies that the tests are asymptotically equivalent (see section 12.6.2 and section 12.6.3 in Wooldridge (2010)). This result is almost identical to the result from Lin and Wooldridge (2017), but the form of the test statistic is considerably more complex. The multiple response setting leads to complicated forms of the score and Hessian matrices. 1.6 Simulations I performed simulations to examine, when equation (1.9) does not hold, how the estimates of the APEs di↵er from the true values. The main difficulty in carrying out simulations for the estimator described in this chapter is how to approximate the integrals contained within the moment conditions that are a part of each score function. There are two broad classes of methods that were considered to approximate the integrals. Monte Carlo integration could be done, but the sample size used to approximate the integral may have to be so large that computation time may be found to be unacceptably high. Gaussian quadrature was also considered. The specific method used to simulate the integrals was Gauss-Laguerre quadrature. 1.6.1 Data Generating Process For each simulation, N=1,000, L = 3, and T =4. I used 500 replications. 17 1.6.1.1 Regressors Within each replication, 1,000 observations at each time period t are generated of xit , zit , uit , and vit , where xit ⇠ Normal(0, 4) zit ⇠ Uni f orm(0, 1) uit ⇠ Normal(0, 1) The variance of xit was chosen to be 4 in order to induce more variation in the data so that when the minimization problem is performed, a local minima will not be chosen as the solution over the 2 global minimum. vit is generated from either a N(0, 1), Logistic(0,1), or a 1 distribution. Furthermore, tit = 1[zit z + xit x + uit 0] z 2 {0.1, 0.5, 1} x =0 and rit1 = ⇣1 uit + vit rit2 = ⇣2 uit + vit where ⇣1 2 {0.5, 1} ⇣2 2 {0.1, 1} 1.6.1.2 The Fractional Responses The structural mean function G will be chosen to be the multinomial logit function such that, e⇠tl +xit l +tit ↵l l + l hi ⇡+⇣l uit E[yitl |xit , tit , hi , zit , eit ] = PL 1 ⇠ +xit +tit ↵ + hi ⇡+⇣ uit (1.25) 1 + k=1 e tk k k k k k 18 1 E[yitL |xit , tit , hi , zit , uit ] = PL 1 ⇠tk +xit k +tit ↵k (1.26) 1+ k + k hi ⇡+⇣k uit k=1 e where L =3, 1 = 1, 2 = 2, ↵1 = 1, ↵2 = 2, 1 = 1, 2 = 2, ⇡ = (1, 1, 1), and ⇠tl = 0 for all t and l. In order to generate the fractional response variables, the following procedure from Nam (2014) is used, 1. Calculate the response probabilities using (24), (25), and the aforementioned regressors. 2. Draw 100 multinomial outcomes at each i and t among choices 1,2, and 3 based upon the response probabilities. 3. Count the frequencies at each i and t and obtain the proportion for each outcome. When vit ⇠ Normal(0, 1), then this is a special case in which equation (9) does not hold yet the estimates of the APEs will be consistent. This is because the choice of G to be the multinomial logit function and uit ⇠ Normal(0, 1) causes the QMLE problem to be set up as if the researcher is working directly with the structural mean function. In this situation, the estimates of the parameters are the estimates of the structural parameters. In this case, the simulations will serve as a check upon the consistency of the estimates. 1.6.2 Simulation Results Each replication examines the data at the 25th, 50th, 75th, and 90th percentile of the data, whereby the first time period will be used to examine the APEs at the 25th percentile, the second time period will be used for the 50th percentile, and so on. The tables include the results from the 25th, 50th, and the 75th percentiles. Although there are no time varying intercepts in these simulations, presumably a researcher would want to know the APEs at each time period in order to account for the time-varying intercepts, and they might want to see how the percentile APEs change given these intercepts. The true APEs are constructed from using within each replication the data at the aforementioned percentiles and the true values of the parameters. Then, across these replications, the mean value is obtained. The standard errors of the estimates are obtained by calculating the square root of the 19 sample variance across the 500 replications. All simulation tables can be found in the appendix. Table B.1 displays the result that at the lower percentiles and when vit ⇠ Normal(0, 1), the point estimates sometimes underestimate the true APEs; however, at the 50th and 75th percentiles, the point estimates are accurate and precise. While the standard errors are somewhat large on the partial e↵ect corresponding to the binary EEV, they are not so large as to be greater than the estimates. Furthermore, all of the APEs have the correct sign. Di↵erences in the estimates are minor when adjusting the values of z , ⇣1 , and ⇣2 . Now, consider when vit ⇠ Logistic(0, 1). In this situation, equation (9) does not hold, yet we have a random variable with a probability density function that has heavier tails than the standard normal pdf, and logistic random variables have been used to approximate normal random variables. The results from this simulation are in Table B.2. The results are similar to the normal case, though the Monte Carlo standard deviations are noticeably larger. Once again, note that the estimates in the estimates are minor as the parameter values noted in the tables change. 2 When vit ⇠ 1, then the model misspecification is substantial. The distribution of vit cannot be used to approximate the normal distribution, nor is the distribution symmetric. The results of this simulation are in Table B.3. In some sense, this is an improvement over the logistic case. The standard errors are often smaller on the APEs. It is important to note that the estimates do not give the incorrect sign at any of the percentiles. This is in contrast to Nam (2014), in which she points out that when a linear control function approach is taken, the estimates of the APEs can take the wrong sign. The estimates of the APEs taken with respect to the joint expectation of ci and eit are also provided. In this case, vit ⇠ N(0, 1). The results are provided in Table B.4. The estimated APE of t1 at the 25th percentile gives a poor estimate of the estimate true APE. It is suspicious that the estimate equals its standard error. There is no reason to think that the estimator would perform poorly at the 25th percentile, especially in light of the full results presented above. It is worth noting that the distribution of uit used to construct these estimates was not made while simulating the correct distribution of uit |xit , zit , tit . This is done for two reasons. First, the researcher would 20 not know the distribution of uitr |hi . Second, this would add another layer of complexity to the simulation, and it would only change the values of the percentile mean and estimates of the average partial e↵ects. The results would still demonstrate that the estimator is consistent. 1.7 Application In order to demonstrate the practical usefulness estimator, I used data from the NLSY97 project. The project, undertaken by the U.S. Bureau of Labor Statistics, gathered information on individ- uals born between 1980 and 1984. The survey data available is based upon eighteen rounds of questioning, from 1997 to 2018. Respondents were asked questions in the areas of employment, education, geography and household information, parental and childhood information, dating and marriage, health, income and assets, attitudes, expectations, activities, crime, and substance use. I constructed this dataset by using a balanced panel of respondents from the years 2007, 2009, 2010, and 2011. These were years for which there was data on the question of how many hours on average each night each respondent slept. I multiplied by the number of days each year and divided by the total number of hours in that year in order to obtain the fraction of time devoted to sleep in that year. Respondent were also asked each week of each year how many hours they had worked that week. I then multiplied by the number of weeks in that year and divided by the total number of hours in the year in order to obtain the fraction of time in that year devoted to work. The sum of these two fractions are then subtracted from 1, which gives the fraction of time in that year devoted to leisure. The binary endogenous variable is the marital status of the respondent. In the survey, marital status has six categories. These are whether the respondent is neither married nor cohabitating, not married but cohabitating, married and cohabitating, legally separated, divorced, and widowed. This is collapsed into a binary variable indicating whether the respondent is legally married. The respondent would be counted as married if they are married and cohabitating or legally separated. The marital status of each respondent is updated each month throughout the survey. Since the unit of time for the panel is a year, the martial status is recorded to be the status at the end of the year. The instrument that I chose for marital status gives insight into working with fractional time 21 variables in the context of time-use data. I created a variable for the number of times that a re- spondent engaged in sexual intercourse within each year. This was done by using data across three survey questions for each year. In one question, respondents are asked if they engaged in sexual intercourse. Respondent are then asked how often they engaged in intercourse. If they respond that they do not know, respondents are then asked to give an estimate of the amount of time that they engaged in intercourse. I combined the answers for each of these survey questions into a single variable which equaled the number of times that a respondent engaged in sexual intercourse. The results from the first-stage probit are given in Table C.1. Additional covariates include the level of education of each respondent, whether they live in an urban area, the number of biological chil- dren at their residence, and their household income in thousands of US Dollars. Health controls are added such as the number of times that a respondent was treated by a doctor or a nurse during the year, the number of times that a respondent was sick but did not seek treatment from a doctor or a nurse, whether their health a↵ected the amount of work that they engaged in, and if they have a chronic condition. The z-score is approximately 6.00 on the instrument for the entire sample, and approximately 4 for the male and female subsamples. Table C.2 gives the estimates from assuming a linear probability model for the binary marriage variable. This model was not used to obtain esti- mates of in order to estimate the APEs, but instead to display the strength of the instrument. The conventional wisdom, based upon Staiger and Stock (1997), is that a sufficiently strong instrument yields a first stage F-statistic of at least 10. In the full sample the F-statistics of marriage is approx- imately 36, while in each subsample the F-statistic is approximately 16. The reasons for this are partly due to the issue of finding a new partner. Levinger and Moles (1979) notes that cohabitating couples consider the frequency of sex as well as the ease of finding other partners before dissolving the relationship. Before marriage, the costs of finding a new partner and dissolving the relation- ship are low. As Oppenheimer (1988) notes, there are considerable search costs to finding a new partner in order to replace lost sexual activity from a relationship. In the absence of children and the legal barriers that must be overcome to end a marriage, these search costs should be lower for unmarried couples. Couples then enter into marriage in order to secure the benefits of the current 22 relationship and facilitate continued emotional investment into their relationships (see Yabiku and Gager (2009)). For married couples, in addition to higher costs of ending a marriage, the issue of infertility can play a role in ending the relationship. Sexual intercourse can act as the channel through which infertility acts.As Andrews, Abbey, and Halman (1991) note, infertility can lead to lower sexual self-esteem, which in turn leads to a lower frequency of sexual intercourse among married couples. Furthermore, just as for unmarried couples, the frequency of sexual intercourse plays a role in de- termining whether to end the relationship. In fact, a past national survey revealed that it was the second greatest issue of concern for young married couples (see Yabiku and Gager (2009)). The count of the number of times that each respondent had sex in each year also satisfies the exclud- ability condition, and the reasons why are as follows. First, the information given by the variable itself does not communicate the number of hours or fraction of time in a year that a respondent spent engaging in intercourse. Second, even if this information was known, there is nothing to suggest that the time used on intercourse would not still be counted amongst the broadly defined leisure fraction. In other words, if respondents engaged in less intercourse, there is no reason to think that they would not devote that time instead to other activities that are neither work nor sleep, particularly for the population under study over the time period under study given the controls. Liu (2000) provides some evidence for this among married couples, in the sense that declines in sexual intercourse amongst this group is due to substitutions away from intercourse towards other goods, services, and activities. Contrast this variable with the marriage variable or the number of children at the respondent’s residence. These variables change how much an individual allocates of their time to labor, leisure, and work, as opposed to merely the activities within those categories. There seems to be a specific relationship between the frequency of sexual intercourse and the raising of young children. Based upon a panel of German households, Schröder and Schmiedeberg (2015) note that sexual frequency declines until a child reaches approximately six years old, and then sexual frequency tends to increase. The inclusion of the number of biological children in the re- spondent’s household is meant to control for some relationship between sexual intercourse and the 23 raising for children. The health variables are included in order to control for possible correlations between sex and health outcomes, which in turn could have an e↵ect on sleeping patterns or hours worked in a year. After all, the frequency of sexual intercourse is correlated with the health of in- dividuals (see, for example,Walfisch, Maoz, and Antonovsky (1984)). It is important to control for these specific e↵ects, since these e↵ects are noted in health and marriage literature to be correlated with sex and would seem to have an e↵ect on how individuals allocate their time. The validity of the instrument might be a↵ected if too many categories of activities are con- cluded. As more activities are separated out from the leisure share into additional fractional re- sponse variables, then it becomes more likely that the instrument will cause a shift in the additional shares. Identification relies in part on a coarse outside option, so care needs to be taken with respect to determining which APEs should be considered. Tables C.3 and C.4 display the APE estimates of marriage over the distribution of the unob- served heterogeneity across respondents. The APE at the 25th percentile evaluates the APE at the 25th percentile of the data in the first time period, the APE at the 50th percentile evaluates the APE at the 50th percentile, and so on, as was done in the simulations. The standard errors are generated using 50 bootstrap replications with a sample size of 100. I chose the value of R to be 300. The intercepts are not time varying, though they do vary with each fractional response. Within each table I have included estimates using only the male respondents and female respondents, in order to determine the e↵ect of marriage upon the use of time of men and women separately. Across the entire sample and the subsamples, the APEs are significant for marriage at ↵= .05, though the e↵ect upon sleep is not significant at every percentile level. These e↵ects di↵er depending upon the subsample. Using the entire sample, it would seem that marriage is not associated with a sig- nificant e↵ect upon the fraction of time devoted to sleep, but it is associated with a significant negative e↵ect at each percentile across all time periods in the fraction of time devoted to work. For the subsample of men, the APE’s are significant and negative for sleep, but they are positive and significant for work. For women, marriage suggests declines in the fraction of time devoted to both work and sleep, which leads to an increase in what is broadly considered by myself to 24 be leisure, though activities which might have more time devoted to their completion may not be considered "leisurely."The estimated average partial e↵ects of Theorems 1 and 2 are not included in this chapter, though they are included in the code accompanying this chapter. Table C.3 displays the results from applying the estimator of the parameters that I have in- troduced and the technique of obtaining the APEs that is necessary when a binary covariate is endogenous. Table C.4 displays the results from applying the Mundlack-Chamberlain device, but the additional source of endogeneity has not been integrated out. Both the magnitudes and signs of the average partial e↵ects di↵er across the tables. The estimated e↵ect of marriage on the fraction of time devoted to work and the fraction of time devoted to sleep both seem negligible. This state- ment also applies to the estimates that are based upon the subsample of men. The e↵ects are larger for women, though it would seem that marriage does not lead to declines in fraction of time de- voted to sleep. This suggests that there is a significant source of individual endogeneity that arises from the individual and time specific errors. 1.8 Conclusion p For the number of observations N and the number of draws R ,I have provided a N consistent estimator of the APEs for the multiple fractional response setting while allowing for panel data, unobserved heterogeneity, and a binary EEV. An advantage of this approach is that the constraint PL upon the conditional mean l=1 E(yitl |hi , xit , zit , tit , uit ) = 1 is satisfied, which will not always hold for every choice of the conditional mean function. For example, a probit conditional mean specifi- PL cation is not guaranteed to satisfy the constraint that l=1 E(yitl |hi , xit , zit , tit , uit ) = 1,and to assume a linear model is not any better. At best, a linear model could allow for a system regression to provide linear approximations of the average partial e↵ects, and the estimator would be expensive if di↵erences are taken to eliminate the unobserved heterogeneity. At worst, the APE’s could be such poor estimates that they not only fail to reflect the relationship between the dependent vari- able and the relevant covariate, but they fail to appropriately consider the relationship amongst the dependent variables. If the multinomial logit conditional mean specification (or any specification that satisfies the 25 aforementioned constraint) is chosen, then there are still some choices estimators. Additional mo- ment conditions could be added to the GMM estimator in order to increase efficiency. A working correlation matrix could be used to obtain a GEE estimator. Both estimators would bring efficiency gains over the estimator I have proposed, but they would be more computationally burdensome. Even in a setting where L = 3 and T = 2, such estimators would increase efficiency but may significantly increase computation time. A separate issue is whether to rely upon this method or some QMLE estimator derived from some multinomial likelihood problem while not integrating out the unobserved heterogeneity. A failure to integrate out a source of endogeneity that is persistent across subjects and time periods can lead to insignificant average partial e↵ects. In order to determine whether the full method and estimator in this chapter are necessary, the variable addition test for endogeneity should be applied. 26 CHAPTER 2 DOUBLY-ROBUST QUANTILE TREATMENT EFFECT ESTIMATION 2.1 Introduction There has been a recent reevaluation of the e↵ectiveness of di↵erence-in-di↵erence estimators. Estimates of the average treatment e↵ect on the treated (ATT) can be sensitive to nuisance functions that are based upon the probability of treatment given covariates, known as the propensity score, or the conditional mean of the di↵erence in untreated outcomes before and after treatment for the untreated subpopulation. Identification of the ATT relies upon the parallel trends assumption and the overlap assumption. A relaxation of these assumptions involves conditioning on covariates when that may change either the expected treatment status or the change in untreated outcomes. These separate approaches can be combined to obtain a doubly-robust estimator of the ATT. If we are willing to strengthen these assumptions, then we can go further than estimation of the ATT. The QTT can be estimated at any desired quantile. How useful this is depends upon the topic under study. If a researcher is concerned with the median outcome of a variable post- intervention, or outcomes at the tail of a variable’s distribution, then an estimator of the QTT would be desired. Observing which quantiles experience a significant treatment e↵ect could then alter policy responses. The assumptions that are necessary for identification go beyond the parallel trends and overlap assumptions. The parallel trends assumption is strengthened from an assumption on conditional mean independence to an assumption of independence conditioned on covariates. More assumptions are required on the outcome variables across time. In this chapter, I demonstrate that the strength of the assumptions from Callaway and Li (2019) purchase more than the authors may have realized. Using their assumptions, it is possible to gener- ate a doubly-robust estimator of the QTT. This estimator then allows for a relaxation of the assump- tions placed upon the nuisance functions. In this case, the nuisance functions are a propensity score and a conditional cdf of the di↵erence in untreated outcomes from before and after treatment. The double-robustness property allows for a reduced rate of convergence of the both functions. If the nuisance functions are estimated nonparametrically, and depending upon the type of nonparamet- 27 ric estimator that is used, the double-robustness result has another beneficially property. Subject to minimal smoothness conditions on the nuisance functions, the rate of convergence itself may not matter First, in this chapter I provide the identification result that guarantees double-robustness. This result is an extension of the key identification result found in Callaway and Li (2019). The propen- sity score and the relevant conditional cdf can be misspecified, but not simultaneously. The prop- erties of this estimator are studied. The properties are broken up into subcases, considering if the nuisance parameters are estimated nonparametrically or otherwise. As a prerequisite to studying the properties of the QTT estimator, I derive the efficient influence function of the doubly-robust portion of my estimator. The portion that I am referring to is the cdf of the di↵erence in untreated outcomes from the pre and post-treatment periods for the treated subpopulation at an arbitrary real number. It is upon the estimation of this parameter that the double-robustness result is applied. The efficient influence function is considered in the panel data setting. With the efficient influence function in hand, I then demonstrate that the doubly-robust estima- tor achieves the semiparametric efficiency bound in the panel data setting. This is shown through two separate cases, where in the first case the nuisance functions are estimated parametrically. In the second case they are estimated nonparametrically, with the propensity estimated using the sieve logit estimator and the conditional cdf estimated using a kernel estimator. Both estimators could be chosen to be sieves or kernel estimators, and this would only change the restrictions on the smoothness of the nuisance functions. Then, I demonstrate through simulations that without the double-robustness correction to the Callaway and Li (2019) estimator the root mean square error is very large, owing to the large standard error of the estimator. My doubly-robust estimator has root mean square errors that are less than one-third of the Callaway and Li (2019) estimator, even if the nuisance functions are misspecified. I also show that when the empirical bootstrap is used for inference, the double-robustness prop- 28 erty which allows for the weakening of the assumption that the nuisance functions converge to the 1/4 truth at the rate of o p (n ) no longer holds. For this reason, I maintain that the nuisance func- 1/4 tions should converge at the rate of o p (n ). I investigate when this assumption upon the rate of convergence holds for the relevant nuisance functions. Finally, I apply my estimator to county-level unemployment data in order to compare it to the Callaway and Li (2019) estimator. I use the same dataset that is applied in Callaway and Li (2019). I then compare point estimates and confidence bounds between my estimator and the Callaway and Li (2019) estimator. 2.1.1 Literature Review This chapter draws directly from Callaway and Li (2019) in order to establish the identification result and the form of the QTT estimator, but the idea of a doubly-robust estimator in this con- text was inspired by other papers in the treatment e↵ects and missing data literature. Sant’Anna and Zhao (2020) examines the properties of the doubly-robust ATT estimator, and Callaway and Sant’Anna (2021) extends these results to the staggered treatment setting while also examining the asymptotic properties of various weighted ATT estimators. The doubly-robust ATT estimator has existed in the econometrics literature as an example of a doubly-robust estimator in Rothe and Firpo (2013), along with most of its properties in the single treatment period, panel data setting. The doubly-robust estimator combines the regression approach of Heckman, Ichimura, and Todd (1998) and Heckman et al. (1998), along with the propensity score matching approach of Abadie (2005), which itself is based upon Horvitz and Thompson (1952). My estimator is similar in that combines two approaches that are analogous to separate regression and propensity score approach- esAll of these approaches take place within the di↵erence-in-di↵erence framework popularized by Card (1990) and Card and Krueger (1994). The literature on quantile treatment e↵ects, when considering either selection on observables or a panel data setting, prominently includes Firpo (2007), Athey and Imbens (2006), Bonhomme and Sauder (2011), and Chernozhukov et al. (2013). These approaches do not consider double- robustness, and their identification assumptions are stronger than the identification assumptions 29 of this chapter. For example, Firpo (2007) sets up an M-estimation problem with weights based upon propensity score matching, but identification depends upon a strong ignorability. assumption. The nonparametric logit sieve estimator of Hirano, Imbens, and Ridder (2003) is applied, but the conditions needed for asymptotic normality are considerably restrictive. A result similar to Firpo (2007) in the missing data setting is found in Wooldridge (2007). The double-robustness properties of my estimator are based upon more than the aforemen- tioned doubly-robust estimators of the ATT. A general weighting result for treatment e↵ects is presented in Słoczyński and Wooldridge (2018), and the basis for constructing doubly-robust mo- ments conditions is outlined in Chernozhukov et al. (2016). The latter work is closely related to the doubly-robust estimator in the missing data setting of Muris (2020). The construction of my double-robustness estimator is partly based upon the estimators in Sued, Valdora, and Yohai (2020) , though they consider a missing data setting and do not discuss the issue of inference. Rothe and Firpo (2019) and Rothe and Firpo (2013) consider the asymptotic properties of double-robust esti- mators when the nuisance functions are estimated using kernel density methods. The work of Fan et al. (2016) is particularly important for this chapter, since it considers a doubly-robust estima- tors with nuisance parameters that are estimated via sieves. I also make one final point about a doubly-robust QTT estimator that exists in the literature. Caracciolo and Furno (2017) proposes an estimator of the quantile treatment e↵ect (QTE) that involves taking a quantile of a random variable that is a function of the propensity score and fitted values; however, this estimator only identifies the quantile of interest in a very restrictive case, using unstated assumptions. Their approach partly builds o↵ of Machado and Mata (2005), but this approach involves obtaining unconditional quan- tiles directly through a random sample over conditional quantiles. 2.1.2 Structure of the Chapter The chapter is structured as follows. Section 2.2 lays presents the framework, assumptions and identification result. Section 2.3 presents the estimator and considers estimation of the nuisance pa- rameters. Section 2.4 considers the large-sample properties of the estimator. Section 2.5 examines the validity of the empirical bootstrap when applying this estimator. Section 2.6 contains a Monte 30 Carlo study that examines the small sample properties of my estimator at various quantiles under misspecification of the nuisance functions and in comparison to another esimator. Section 2.7 in- cludes the application of my estimator to estimating quantile treatment e↵ects on the treated using unemployment data. Section 2.8 concludes the chapter. The mathematical proofs are contained in the appendix, as well as figures and tables. 2.2 Assumptions and Setting 2.2.1 Setting The setting that I am considering is the panel data setting. As in Callaway and Li (2019), I assume that the data consists of at least three periods, with treatment period t and pre-treatment periods t 1 and t 2. No unit receives the binary treatment before time t. D = 1 for unit i if treated at time t. D = 0 if an individual is never treated. The outcomes Yt , Yt 1 , and Yt 2 are observed, along with covariates X. Each unit i has the potential outcomes Y0t and Y1t , but these outcomes are random variables whose realizations from underlying populations cannot be observed simultaneously for each unit i. The observed outcome Yt is then expressed as Yt = DY1t + (1 D)Y0t Let q⌧ denote the ⌧-quantile for some random variable Z, where q⌧ = FZ 1 (⌧) B in f {z : FZ (z) ⌧} and FZ (z) is the cumulative distribution function (cdf) of Z. FY1t|D=1 and FY0t|D=1 denote the cdfs of the treated and untreated outcomes for the treated subpopulation, respectively. The QTT is then defined as, QT T (⌧) = FY1t1|D=1 (⌧) FY0t1|D=1 (⌧) Interest in the QTT stems from the ability to identify the e↵ect of an intervention on a treated group, compared to the counterfactual outcome. For example, suppose that a portion of a population 31 receives a Covid-19 vaccine, with the outcome variable being gross income. While we may wish to know the treatment e↵ect at the median (⌧ = 0.5) for the entire population, the identification assumptions may too strong to identify such a parameter 1. With weaker assumptions, we can identify QTT(0.5). That is to say, we can identify the median e↵ect of Covid-19 vaccination on gross income for the treated subpopulation. 2.2.2 Identification Assumptions These assumptions are directly from Callaway and Li (2019). When presenting these assump- tions, I will note how they are comparatively mild when compared to other assumptions in the econometrics literature. Now, let Yt = Yt Yt 1 . Then, consider the following assumptions. Assumption ID.1. The observed data {Yit , Yit 1 , Yit 2 , Xi , Di }ni=1 are independent and identically dis- tributed draws from the joint distribution FYt ,Yt 1 ,Yt 2 ,X,D . In addition, Yit = Di Y1it +(1 Di )Y0it , Yit 1 = Y0it 1 , andYit 2 = Y0it 2 . Assumption ID.2. Each of the random variables Yt for the treated group and Yt 1 , Yt 1 , Yt 2 for the treated group are continuously distributed on their support with densities that are uniformly bounded from above and bounded away from 0. Assumption ID.2 ensures the uniqueness of the copula by restricting the outcomes to be contin- uous. Assumption ID.1 restricts the setting to the panel data setting with a single treatment period. If the copula is not unique, then even with the Copula Stability Assumption we may not be able to identify the cdf of Y0t |D = 1. The next assumption is the final assumption that I use for identification. Let the support of X be denoted by X. Assumption ID.3. p B P(D = 1) > 0 and, for all x 2 X, p(x) B P(D = 1|X = x) < 1. The first part of the assumption ensures that there is some positive probability of treatment. The second part is the "overlap" assumption that is common in the di↵erence-in-di↵erence literature. 1 In the case of the average treatment e↵ect vs the average treatment e↵ect on the treated, see Wooldridge (2010) 32 This guarantees that for any value of X in X there is a positive probability of that value appearing in both the control and treatment groups. Without this assumption, the QTT cannot be identified since F Y0t |D=1 (y) could not be identified for a population that contains that contains those values of X for which the assumption is violated. Note that the overlap assumption as stated here is not enough when the estimation of the propensity score is nonparametric. In that case, the propensity score requires sharp upper and lower bounds away from 0 and 1. The next assumption is directly exploited to obtain double-robustness of the estimator. Assumption ID.4. Y0t ?? D|X This assumption takes the parallel trends assumption of the di↵erence-in-di↵erence literature, E[ Y0t |D = 1, X] = E[ Y0t |D = 0, X] and strengthens it over the entire distribution. The assump- tion states that the distribution of the untreated outcomes is una↵ected by the treatment e↵ect, conditional on the covariates. This assumption is necessary to obtain the estimator of F Y0t |D=1 (y). This assumption, as I will show, also makes the doubly-robust estimator of F Y0t |D=1 (y) the most ef- ficient estimator in the panel data setting. Furthermore, as strong as this assumption is, it is weaker than the Strong Ignorability assumption of Rosenbaum and Rubin (1983), where Y0t , Y1t ?? D|X, that is applied in Firpo (2007). The last assumption is known as the "Copula Stability Assumption" Assumption ID.5 (Copula Stability Assumption). C Y0t ,Y0t 1 |D=1,X (·, ·) =C Y0t 1 ,Y0t 2 |D=1,X (·, ·) This assumption is the most controversial assumption used for identification. As explained in Callaway and Li (2019), this assumption requires both the panel data setting and data over three time periods. This assumption is not placing any restrictions on any of the marginal distributions of the variables involved. Instead, by placing restrictions on the copula we are restricting the joint distribution of the variables in question based upon their joint distribution in prior periods. We cannot observe the untreated outcome for the treated subpopulation, but we can jointly observe the outcomes in the periods before treatment for the treated subpopulation. By an application of 33 Sklar’s Theorem and by writing Y0t |D = 1 as Y0t Y0t 1 + Y0t 1 |D = 1, the cdf of Y0t |D = 1 can be identified using the joint distribution of Y0t 1 |D = 1 and Y0t 2 |D = 1. This assumption is similar to, and perhaps even weaker than, the assumption of stationarity in the time-series setting. No claim is being made that a sequence of random variables has a constant joint distribution over some shift in time. All that is being claimed is that a feature of the joint distribution, the joint dependence between the random variables, is fixed over a limited time period. A form of this assumption has been applied in the measurement error literature. In Cameron et al. (2004), the copula is used to model the di↵erence in count variables, where each count variable represents di↵erent measurements of the same outcome. 2.2.3 Identification With the identification assumptions in hand, I now present the identification result. In the fol- lowing notation, ⇡(X) denotes the choice of a propensity score, while p(X) denotes the true propen- sity score. An example of ⇡(X) might be the standard normal cumulative distribution function, but p(X) has a logit functional form. Similarly, P( Y0t  y|X) is the true conditional cdf of Y0t , but P̃( Y0t  y|X) denotes a choice of a conditional cdf. For example, P̃( Y0t  y|X) might be chosen to be a conditional Logistic (0,1) cdf, but P( Y0t  y|X) is the standard normal cdf. Theorem 3. Under assumptions ID.1-ID.5, and assuming that ⇡(X) = p(X)or P̃( Y0t  y|X) = P( Y0t  y|X), FY0t |D=1 (y) h i 1 = E {F Y0t |D=1 (F Yt 1 |D=1 ( Yt 1 ))  y FYt1 1 |D=1 (FYt 2 |D=1 (Yt 2 ))}|D = 1 where, " ! ! # 1 D ⇡(X) 1 D ⇡(X) D F Y0t |D=1 (y) =E { Yt  y} P̃( Y0t  y|X) (2.1) p 1 ⇡(X) p 1 ⇡(X) p if ⇡(X) = p(X) a.c., or P̃( Y0t  y|X) = P( Y0t  y|X) a.c. Then, 1 1 QT T (⌧) = FY1t|D=1 (⌧) FY0t|D=1 (⌧) 34 The proof of the first part of this result is provided in Callaway and Li (2019). For the sake of making this chapter as self-contained as possible, I will outline their argument. First, note that since E[ Y0t|D=1 y ] = E[ Y0t|D=1 Y0t 1|D=1 +Y0t 1|D=1 y ] = E[ Y0t|D=1 +Y0t 1|D=1 ]. Since this expectation is over the joint distribution of the random variables Y0t|D=1 and Y0t 1|D=1 , and since this joint distribution can be written in terms of the copula and the marginal distributions of Y0t 1 and Y0t 2 by assumption ID.1 and Sklar’s Theorem. The result then follows from a change of variables. The second part of the theorem is the basis for the double-robustness property of the estima- tor. Either the propensity score or the conditional cdf of Y0t needs to be correctly specified so F Y0t |D=1 ( ) will be correctly identified. F Y0t |D=1 ( ) is needed to identify FY0t |D=1 (y). The intuition behind the double-robustness result is that if the propensity score is correctly specified, then the information provided by conditional cdf of Y0t becomes redundant, at least for identification. If P̃( Y0t  y|X) is correctly specified, then the weight that is applied to this conditional cdf filters out the incorrect information that is left over from misspecification of the propensity score.2 2.3 Estimation This section will present the estimators that can be used to obtain the QTT under the iden- tification assumptions. There are two di↵erent estimators that I present, with an asymptotically negligible di↵erence. These estimators di↵er in that they calculate the weights di↵erently for esti- mation of F Y0t |D=1 (y). The first estimator is, QTˆ T (⌧) = F̂Y1t|D=1 (⌧) 1 F̂Y0t|D=1 (⌧) 1 where F̂Y1t1|D=1 (⌧) = in f {y : F̂Y1t |D=1 (y) ⌧} F̂Y0t1|D=1 (⌧) = in f {y : F̂Y0t |D=1 (y) ⌧} F̂Y0t |D=1 (y) 2 Unless otherwise noted, from this point onwards the nuisance functions will be assumed to be correctly specified from this point onward. So, ⇡(x) = p(x)and P̃(Y  y|x) = P(Y  y|x). 35 X = nD1 [ {F̂ 1 Y0t |D=1 ( F̂ Yt 1 |D=1 ( Yit 1 ))  y F̂Yt1 1 |D=1 (F̂Yt 2 |D=1 (Yit 2 ))}] i2D where nD denotes the number of treated observations and F̂ Y0t |D=1 (y) n 6B 20 1 0 1 3 X 666BBB 1 Di ⇡ˆ (xi ) CCCC C BBB CCC 777 1 66BB BBB 1 Di ⇡ˆ (xi ) Di CC CCC P̂( Y0t  y|xi )77777 =n 664BB@ n P CC { Yit  y} BB@ Pn Pn k=1 Dk 1 ⇡ˆ (xi ) CA k=1 Dk 1 ⇡ˆ (xi ) k=1 Dk A 5 i=1 n n n An alternate estimator of F Y0t |D=1 (y) is, F̂ Y0t |D=1 (y) n 6 2 0 1 3 X 666 1 Di ⇡ˆ (xi ) ! BBB BBB 1 Di ⇡ˆ (xi ) Di CC CCC 777 =n 1 66 664 { Yit  y} BB@ Pn CCC P̂( Y0t  y|xi )77777 l0 1 ⇡ˆ (xi ) l0 1 ⇡ˆ (xi ) k=1 Dk A 5 i=1 n 1 Pn ⇡(xi )(1 Di ) where l0 = n i=1 1 ⇡(xi ) . The interesting point here is the estimation of the nuisance functions. If the nuisance functions are estimated parametrically, then by standard assumptions in the ap- pendix the nuisance functions will converge rapidly enough in probability to guarantee asymptotic normality, since the parameters that index the functions will converge at a sufficiently fast rate. The issue here is that it is unlikely for the nuisance functions to be correctly specified. If the nuisance functions are estimated nonparametrically, then estimation depends upon exactly how they are estimated. For example, suppose that both nuisance functions are estimated using kernel-based methods. The advantage of this, as seen in Rothe and Firpo (2013), is that the kernel estimation permits the estimator to be decomposed into a bias term, a first-order stochastic term, and a second-order stochastic term. Depending upon how the bandwidth is chosen, the estimator can converge in probability at a fast enough rate to ensure asymptotic normality, but at a slower rate than would otherwise be necessary due to the double-robustness property. The double-robustness property here does not only mean that the identifying moment is doubly- robust. An implication of this is that the higher-order derivatives are also doubly-robust. This im- plies that like the identifying moment, the derivatives also equal 0 if at least one of the nuisance parameters is correctly specified. This is useful for both sieve and kernel estimation. In the case of kernel estimation, this implies that the bias term in the kernel decomposition equals 0. In the 36 case of sieve estimation, the usefulness of this is that terms in the asymptotic expansion of the doubly-robust estimator can then be bounded by the bracket integral with respect to the L2 norm over the function space. 2.4 Asymptotic Properties The key asymptotic properties of the estimator revolve around the asymptotic behavior of the estimator of F Y0t |D=1 (y), in addition to the asymptotic behavior of ⇡(x) and P̂( Y0t  y|X). The limiting behavior of the QTT estimator is unchanged by the doubly-robust estimator. Before the asymptotic behavior of the estimator can be discussed, the following assumption will be intro- duced: Assumption NP.1. 1/4 sup|ˆ⇡(x) ⇡(x)| = o p (n ) x2X 1/4 sup||P̂( Y0t  y|x) P( Y0t  y|x)|| = o p (n ) x2X This is the conventional uniform convergence assumption in the literature on nonparametric rates of convergence. It is not necessary when the estimator is doubly robust, but it is sufficient. 3. Parametric assumptions that imply Assumption NP.1 can be found in Appendix A. Nonparamet- ric estimators for each of the nuisance functions, as well as the assumptions necessary to imply Assumption NP.1, can also be found in Appendix A. Before I establish the asymptotic properties of the estimator, the efficient influence function needs to be found. Besides claiming that the estimator of F Y0t |D=1 (y) is doubly-robust, the efficient influence function will allow us to determine whether the estimator is the most efficient estimator of F Y0t |D=1 (y). Note that this assumption of efficiency is only being made under the identification assumptions. If these assumptions do not hold, then the efficiency result fails. The efficient influ- ence function should also be the identifying moment condition to estimate F Y0t |D=1 (y) at a fixed 3 This assumption is mentioned in a footnote of Callaway and Sant’Anna (2021) when the nuisance parameters are estimated nonparametrically, even though their estimator is doubly-robust. It is not necessary. 37 value y. When the nuisance functions are nonparametrically estimated, asymptotic normality will depend upon a Taylor expansion of the efficient influence functions. The efficient influence functions is presented in the following theorem: Theorem 4. Under assumptions ID.1-ID.4, the efficient influence function is, (1 D)p(X) (1 D)p(X) F⌧ (W) = { Yt  } P( Y0t  |X) p(1 p(X)) p(1 p(X)) D D + P( Y0t  |X) F Y0t |D=1 ( ) p p E[D] Note that if we were to take the expected value of this function, p = 1, so in expectation the efficient influence function reduces to the identifying moment condition of F Y0t |D=1 (y) induced by F̂ Y0t |D=1 (y). With the efficient influence function in hand, we can proceed to describe the asymp- totic behavior of F̂ Y0t |D=1 (y). Before that takes place, it should be noted how exactly each of the nuisance functions are estimated. With the efficient influence function, I can now proceed to the first major distributional result. I will now proceed proving the consistency and asymptotic normality of F̂ Y0t |D=1 (y). p Theorem 5. Suppose that assumptions ID.1-ID.5 and NP.1 hold. Then F̂ Y0t |D=1 (y) !F Y0t |D=1 (y), and p ⇣ ⌘ d n F̂ Y0t |D=1 ( ) F Y0t |D=1 (y) ! N(0, E[ (D, p(X), P( Y0t  y|X); p)]2 ), where (D, p(X), P( Y0t  y|X); p)  = w0 (Di , Xi ; ⇤ )( { Yti  y} (w0 (Di , Xi ; ⇤ ) ⇤ ⇤ w1 (Di , Xi ; ))P( Y0ti  y|Xi ; ) w1 (Di )F Y0t |D=1 (y) and ! ⇤ 1 D ⇡(X; ⇤ ) w0 (D, X; )= p 1 ⇡(X; ⇤ ) D w1 (D) = p 38 Note that the result shows that the estimator attains the semiparametric efficient lower bound, since the asymptotic variance equals the second moment of the efficient influence function of F Y0t |D=1 ( ). Now, I will present a central limit theorem result, based upon a similar result in Callaway and Li (2019), which I will use to later establish the limiting behavior of the QTT estimator. Note the following result: Proposition 1. Suppose assumptions ID.1-ID.5, and Assumption NP.1 holds. Then, ⇣ ⌘ Ĝ Y0t |D=1 , Ĝ Yt 1 |D=1 , ĜY0t |D=1 , ĜYt |D=1 , ĜYt 1 |D=1 , ĜYt 2 |D=1 d !( 1, 2, 0, 1, 3, 4) In the space S = l1 ( Y0t|D=1 ) ⇥ l1 ( Yt 1|D=1 ) ⇥ l 1 (Y0t|D=1 ) ⇥ l1 (Yt|D=1 ) ⇥ l1 (Yt 1|D=1 ) ⇥ l 1 (Yt 2|D=1 ) 0 where ( 1, 2, 0, 1, 3, 4 ) is a tight Gaussian process with mean 0 and covariance V(y , y) = E[⌘(y)0 ⌘(y)] for y = (y1 , y2 , y3 , y4 , y5 , y6 ) 2 S and with ⌘(y) given by 2 3 666 (D, p(X), P( Y  y|X); p) 777 666 0t 777 666 777 666 D { Y  y } F 77 666 p t 1 2 Yt 1 |D=1 (y2 )7 777 666 777 666 D {Ỹ  y } F 777 666 p t 3 Y0t |D=1 (y3 ) 777 ⌘(y) = 66 777 666 D 666 p {Yt  y4 } FYt |D=1 (y4 ) 77777 666 777 666 D {Y  y } F 77 666 p t 1 5 Yt 1 |D=1 (y5 ) 7 777 666 777 4 D {Y  y } F (y ) 5 p t 2 6 Yt 2 |D=1 6 where p ⇣ ⌘ Ĝ Y0t |D=1 ( ) = n F̂ Y0t |D=1 ( ) F Y0t |D=1 (y) 1 Ỹit = F Y0t |D=1 F Yt 1 |D=1 ( Yit 1 ) + F Y1t 1 |D=1 F Yt 2 |D=1 ( Yit 2 ) 1 X F̃Y0t |D=1 (y) = {Ỹit  y} nD i2D p ⇣ ⌘ G̃Y0t |D=1 (y) = n F̃Y0t |D=1 (y) FY0t |D=1 (y) Proposition SA2 and Theorem SA1 still hold from Callaway and Li (2019). I reproduce them here, in order to account for the change in notation and numbering of the assumptions, and the 39 technical fact that the estimator of F Y0t |D=1 ( ) has changed. Proposition 1 is used to establish the result in Proposition 2. p p Proposition 2. Let Ĝ0 (y) = n(F̂Y0t |D=1 (y)) FY0t |D=1 (y)) and let Ĝ1 (y) = n(F̂Yt |D=1 (y)) FYt |D=1 (y)). Suppose assumptions ID.1-ID.5, and Assumption NP.1 hold. Then, d (Ĝ0 , Ĝ1 ) ! ( 0, 1) where 0 and 1 are tight Gaussian processes with mean 0 with almost surely uniformly contin- uous paths on the space Y0t|D=1 ⇥ Yt|D=1 given by 1 = 1 and 0 = Z ! 1 4 1 K1 (v) 0 + 1 K2 (y, v) f Y0t |D=1 y F Yt 1 |D=1 FYt 2 |D=1 2 K3 (y, v) fYt 1 |D=1 K1 (v) f Yt 1 |Yt 2,D=1 (K3 (y, v)|v) ⇥ dFYt 2 |D=1 (v), f Yt 1 |D=1 (K3 (y, v)) where K1 (v) B FYt1 1 |D=1 FYt 2 |D=1 (v), K2 (y, v) B y K1 (v), and K3 (y, v) B F 1 Yt 1 |D=1 F Y0t |D=1 (K2 (y, v)) The above proposition is then used to establish the limiting behavior of the QTT estimator. Theorem 6. Suppose FY0t |D=1 admits a positive continuous density fY0t |D=1 on an interval [a, b] containing an ✏-enlargement of the set {FY0t1|D=1 (⌧) : ⌧ 2 T }. Suppose assumptions ID.1-ID.5, and Assumption NP.1 hold. Then, p d [ n(QT T (⌧) QT T (⌧)) ! Ḡ1 (⌧) Ḡ0 (⌧) where (Ḡ0 (⌧), Ḡ1 (⌧)) is a stochastic process in the metric space (`1 (T ))2 with G0 (FY0t1|D=1 (⌧)) G1 (FYt1|D=1 (⌧)) Ḡ0 (⌧) = Ḡ1 (⌧) = fY0t |D=1 (FY0t1|D=1 (⌧)) fYt |D=1 (FYt1|D=1 (⌧)) 40 The theorem is unchanged from Callaway and Li (2019), since the asymptotic distribution of p p Ĝ0 (y) = n(F̂Y0t |D=1 (y)) FY0t |D=1 (y)) and Ĝ1 (y) = n(F̂Yt |D=1 (y)) FYt |D=1 (y)) is unchanged by the doubly-robust estimator of F Y0t |D=1 (y). This is analogous to a doubly-robust estimator having the same distribution as other semiparametric two-step estimators 4. 2.5 The Bootstrap The standard errors in the application are based upon the empirical bootstrap procedure. I as- sume that for the bootstrapped nuisance functions, denoted by *, the following assumption holds, Assumption B.1. 1/4 sup|ˆ⇡⇤ (x) ⇡ˆ (x)| = o p⇤ (n ) x2X 1/4 sup||P̂⇤ ( Y0t  |x) P̂( Y0t  |x)|| = o p⇤ (n ) x2X This assumption gives the minimum rate of convergence of the di↵erence between the boot- strapped estimator and the original estimator as they tend to zero. By an argument outlined in the appendix, the double-robustness property does not reduce the necessary rate of convergence to achieve asymptotic normality when applying the empirical bootstrap. The argument relies upon an expansion of the efficient influence function of the bootstrapped estimate of F Y0t |D=1 (y) for the bootstrapped sample around the original estimates of the nuisance functions. When the expansion using the full sample is around the true functions, the expected value of the pathwise derivatives with respect to the nuisance functions equal zero. This no longer holds at estimates of the true func- tions. This is analogous to the elimination of the bias term with kernel estimates of nuisance func- tions when considering the asymptotic behavior of double-robust estimators, as shown in Rothe and Firpo (2019). I then obtain the following proposition, This result then establishes the following proposition, Proposition 3. Suppose assumptions ID.1-ID.5, and Assumptions NP.1 and B.1 hold. Then, ⇣ ⌘ Ĝ⇤ Y0t |D=1 , Ĝ⇤ Yt 1 |D=1 , Ĝ⇤Y0t |D=1 , Ĝ⇤Yt |D=1 , Ĝ⇤Yt 1 |D=1 , Ĝ⇤Yt 2 |D=1 4 See the introduction of Rothe and Firpo (2019) 41 d !( 1, 2, 0, 1, 3, 4) In the space S = l1 ( Y0t|D=1 ) ⇥ l1 ( Yt 1|D=1 ) ⇥ l 1 (Y0t|D=1 ) ⇥ l1 (Yt|D=1 ) ⇥ l1 (Yt 1|D=1 ) ⇥ l 1 (Yt 2|D=1 ) 0 where ( 1, 2, 0, 1, 3, 4 ) is a tight Gaussian process with mean 0 and covariance V(y , y) = E[⌘(y)0 ⌘(y)] for y = (y1 , y2 , y3 , y4 , y5 , y6 ) 2 S and with ⌘(y) given by 2 3 666 (D, p(X), P( Y  y|X); p 777 666 0t 777 666 777 666 D { Y  y } F 77 666 p t 1 2 Yt 1 |D=1 (y2 )7 777 666 777 666 D {Ỹ  y } F 777 666 p t 3 Y0t |D=1 (y3 ) 777 ⌘(y) = 66 777 666 D 666 p {Yt  y4 } FYt |D=1 (y4 ) 77777 666 777 666 D {Y  y } F (y ) 77 666 p t 1 5 Yt 1 |D=1 5 7 777 666 777 4 D {Y  y } F (y ) 5 p t 2 6 Yt 2 |D=1 6 where * denotes the bootstrap analogue. Using the above proposition, I then obtain the asymptotic behavior of the bootstrapped process. Theorem 7. Under Assumptions ID.1-ID.5, B.1-B.2, and either Assumptions P.1-P.3, or Assump- tions NP.1-NP.7 and C.1-C.4, p d [ n(QT T (⌧)⇤ [ QT T (⌧)) ! Ḡ1 (⌧) Ḡ0 (⌧) where (Ḡ0 (⌧), Ḡ1 (⌧)) is a stochastic process in the metric space (`1 (T ))2 with G0 (FY0t1|D=1 (⌧)) G1 (FYt1|D=1 (⌧)) Ḡ0 (⌧) = Ḡ1 (⌧) = fY0t |D=1 (FY0t1|D=1 (⌧)) fYt |D=1 (FYt1|D=1 (⌧)) where (G0 (⌧), G1 (⌧)) are as in Proposition 2. 2.6 Simulations In this section I will present the estimation of the QTT at ⌧ 2 [0.2, 0.8], in increments of 0.02. The goal is to demonstrate not only how my estimator performs in small samples, but also how it compares to the estimator of Callaway and Li (2019). The data generating process is as follows: I generate the following data generating process with N = 1000 and T = 3 for 200 iterations. v ⇠ Normal(0, 1) 42 ⌘|D = 0 ⇠ Norm(0, 1) ⌘|D = 1 ⇠ Normal(1, 1) X1 ⇠ Uni f orm(0, 1) X2 ⇠ Uni f orm( 1, 0) X3 ⇠ Uni f orm( 2, 1) X4 ⇠ Uni f orm( 1, 0) Yt 2 = 0.25X1 + 0.5X2 + 0.75X3 + X4 + ⌘ + vt 2 Yt 1 = 1 + 0.5X1 + 0.75X2 + X3 + 1.5X4 + ⌘ + vt 1 Y0t = 2 + 0.25X1 + 0.5X2 + 0.75X3 + X4 + ⌘ + vt Y1t = 1.5X1 + X2 + 1.5X3 + X4 + ⌘ + vt Yt = D ⇥ Y1t + (1 D) ⇥ Y0t eX p(X, ) = = [ 0.25, 0.5, 0.75, 1] 1+e X The data generating process is based upon Example 3 in Callaway and Li (2019). The distribution of the covariates is chosen so that, for the given values of the parameters, the probability of treat- ment and the conditional cdf of Y0t do not output observed values that are too close to 0 and 1. This can cause numerical issues when inverting the estimators. The parameters of the propensity score are estimated are estimated via mle. The parameters of the conditional cdf of Y0t , , are estimated via OLS regression of Y0t |D = 0 on Xt |D = 0. , the standard deviation of vt , is estimated by taking the standard deviation of the vector of residuals generated from the OLS regression. The following estimator is applied, F̂ Y0t |D=1 ( ) 2 n 3 1 n " !# ! 666X p (x , ˆ ) (1 D ) 777 X p (xi , ˆ ) (1 Di ) ˆ xi Xn ˆ xi = 664 6 6 ⇥ i i 7 ⇤ 77 ⇥ ⇤ {Yi  } + nD1 Di 1 p (xi , ˆ ) 75 1 p (xi , ˆ ) ˆ ˆ i=1 i=1 i=1 where (·) is the standard normal cdf, and ˆ , ˆ , and ˆ are the aforementioned estimates of the 43 nuisance parameters. Note that this estimator normalizes the weights of the first term. This cre- ates a normalized augmented inverse probability weighting estimator. This adds an asymptotically negligible normalization constant while improving the small sample performance of the estimator5. Here, misspecification of the propensity score is when the propensity score is chosen to be the standard normal cdf. Misspecification of the cdf nuisance function is considered when the function is chosen to be the logistic(0,1) cdf. In either case, each function resembles the true function over the support of the true underlying random variable that each function is based upon, but the mis- specification is most pronounced in the tails. This misspecification is not far from the truth over what a researcher might observe based upon their data. The estimator that I propose in this chapter outperforms the Callaway and Li estimator, at least under the data generating process that I used. Figure F.1 shows that under all of the scenarios involving misspecification of the nuisance functions that I considered, there is a similar average absolute bias across the quantile estimates; however, as shown in Figure F.2, the root-mean-square error (RMSE) under these same scenarios the Callaway and Li estimator has an RMSE of ap- proximately 1.5, regardless of the quantile. Note that these figures are divided into four scenarios, comparing the various forms of misspecification under the doubly-robust estimator to the the Call- away and Li estimator. 2.7 Application In this section, I use my method to study the e↵ect of state-level changes in the minimum wage on county-level unemployment rates. The purpose of this application is to demonstrate how the standard errors of the estimates using my estimator compare to the standard errors of the estimates using the estimator of Callaway and Li (2019), and this application is based upon the application within that paper. Variations in state-level changes in minimum wage laws are exploited alongside variations in county-level observable characteristics, such as di↵erences in population and median income. The goal is to examine the change in the distribution of county-level unemployment rates 5 For a discussion of the importance of normalization of inverse probability estimators, though not in the context of double-robustness, see Słoczyński, Uysal, and Wooldridge (2022). The main benefit of this normalization is a reduction in the small-sample bias of the estimator. 44 due to an increase in the minimum wage, and compare that to the distribution of unemployment rates had there been no change in the minimum wage. The dataset that is used is taken from the replication materials of Callaway and Li (2019). They examine a period during which there was variation in state-level minimum wages, but the U.S. federal minimum wage remained flat until the end of the period. The outcome variable is the county-level unemployment rate, which they obtain from the Local Area Unemployment Statis- tics Database from the Bureau of Labor Statistics. County-level unemployment rates are available monthly, and they choose to use the February unemployment rates from 2005-2007, a month that they felt to be sufficiently far away from the federal wage change in July 2007. They merge county characteristics, the 1997 county median income and the 2000 county population, from the 2000 County Data Book 6. The treatment group consists of counties in states, 11 states total exclud- ing counties from states in the northeast, that increased their minimum wage by the first quarter of 2007. Counties in 20 states that did not increase their minimum wage by July 2007 form the control group. The nuisance parameters are estimated parametrically. I assumed a logit specification for the propensity score, with the covariates chosen to be the natural log of county population, the natural log of the median county income, the squares of these terms, their interaction, factor variables for the South and West census regions, and the interaction of the factor variables with the other covariates. I assumed a probit specification for the conditional cdf of Y0t , with the parameters estimated using ordinary least squares and assuming homoskedastic errors. Due to the simultaneous estimation of parameters and construction of confidence intervals, I will construct the confidence intervals as in Callaway and Li (2019). I outline the steps as an algorithm below: 1. For each ⌧ 2 T , calculate ˆ 1/2 = (q0.75 (⌧) ⌃(⌧) q0.25 (⌧))/(z0.75 z0.25 ) 6 The replication materials can be found at https://onlinelibrary.wiley.com/doi/full/10.3982/QE935 . 45 This is equal to the bootstrap interquartile range divided by the interquartile range of a standard normal random variable, where ⌃(⌧) ˆ is an estimate of the asymptotic variance of [ QT T (⌧). 2. For bootstrap iterations b = 1, ..., B, calculate, p ˆ I b = sup⌧2T ⌃(⌧) 1/2 [ | n(QT T (⌧)b [ QT T (⌧))| 3. Calculate c1B ↵ , which is the (1 ↵) quantile of {I b }b=1 B . p 4. Calculate QT[ ˆ 1/2 / n T (⌧) ± c1B ↵ ⌃(⌧) Figure F.3 shows that when comparing the estimates, there is little di↵erence between my es- timator and the Callaway and Li (2019) estimator. The point estimators are close, except at the 90th quantile, and the confidence intervals based upon my estimator are only slightly narrower. This is more revealing than it may seem. My simulations would suggest a sharp reduction in the standard error of the estimates when applying my estimator, but this is only when the nuisance function estimates are sufficiently close to the truth. My estimator suggests that there is a strong misspecification of the conditional cdf of Y0t . 2.8 Conclusion I have provided a doubly-robust estimator of the quantile treatment e↵ect on the treated. This estimator relaxes the assumptions on the nuisance functions, allowing for a slower rate of conver- gence in order to achieve the limiting distribution of the QTT estimator. This causes nonparametric estimation of each of the nuisance functions to be much more viable, since nonparametric estima- tion requires assumptions upon the di↵erentiable of the nuisance functions, which in turn a↵ects the rate of convergence. As my simulations demonstrate, this leads to a lower RMSE in small sam- ples, particularly when estimating the QTT at the median. Without the double-robustness property, confidence intervals could be so large that the QTT is not statistically di↵erent from 0 except at the extremes of the distribution of the di↵erence in treated and untreated outcomes for the treated subpopulation. 46 It is important to recognize what this estimator is not. It is not a substitute for the doubly-robust estimator of the ATT that is presented in Sant’Anna and Zhao (2020). The assumptions necessary for identification are relaxed. There is no conditional copula assumption, and the parallel trends assumption is weaker than the conditional independence assumption on the di↵erence in untreated outcomes. Instead, the two estimators should be used to complement each other. The ATT should be estimated along with quantile treatment e↵ects on the treated at a variety of quantiles. If the estimate of the ATT is inconsistent with the results that are being presented across the information that is summarized by the QTTs, then perhaps either the conditional copula assumption or the conditional independence assumption does not hold. What this estimator should be seen as is part of a middle ground between some of the more nonparametric estimators and estimators that rely entirely upon propensity score matching. In par- ticular, the optimal transport methods of Gunsilius and Xu (2021) and Torous, Gunsilius, and Rigollet (2021) avoid the curse of dimensionality that is common with nonparametric estimation of the propensity score when estimating treatment e↵ects; however, a doubly-robust estimator will relax the smoothness assumptions on the propensity score function in relation to the dimension of the covariate matrix. When supplemented with other doubly-robust estimators in the causal in- ference literature, my QTT estimator becomes part of a battery of doubly-robust estimators that increase the feasibility of propensity score matching. 47 CHAPTER 3 APPLICATION OF ADDITIONAL MOMENTS TO QUANTILE TREATMENT EFFECT ESTIMATION: A SIMULATION 3.1 Introduction In Chapter 2, I presented a doubly-robust estimator of the quantile treatment e↵ect on the treated (QTT). This estimator is robust to misspecification of either the propensity score for the probability of treatment or the conditional cdf of Y0t . In this chapter, I will run a simulation to compare the doubly-robust estimator of Chapter 2 with an estimator that will rely upon an overidentified system of moment equations to estimate the parameters of the nuisance functions. In other words, though the estimator in Chapter 2 is doubly-robust, it might not lead to vastly improved performance over an estimator that relies upon additional moments to estimate the QTT. An additional moment condition could be applied to estimate the parameters of the nuisance functions, and then either choice of nuisance function could be plugged in to estimate the QTT. 3.1.1 Structure of the Chapter This chapter is structured as follows. Section 3.2 lays out the assumptions that were contained in Chapter 2. In particular, I note that the estimator in this chapter relies upon assumption ID.4 of Chapter 2 to create an overidentified GMM estimator. I also explain how adding additional mo- ments can eventually lead to a QTT estimator that approximates the performance of the estimator in Chapter 2. Section 3.3 contains simulations and interpretations of those simulations. Section 3.4 concludes the chapter. 3.2 Identification Recall the following assumptions from Chapter 2, Assumption ID.1. The observed data {Yit , Yit 1 , Yit 2 , Xi , Di }ni=1 are independent and identically dis- tributed draws from the joint distribution FYt ,Yt 1 ,Yt 2 ,X,D . In addition, Yit = Di Y1it +(1 Di )Y0it , Yit 1 = Y0it 1 , Yit 2 = Y0it 2 . Assumption ID.2. Each of the random variables Yt for the treated group and Yt 1 , Yt 1 , Yt 2 for 48 the treated group are continuously distributed on their support with densities that are uniformly bounded from above and bounded away from 0. Assumption ID.3. p B P(D = 1) > 0 and, for all x 2 X, p(x) B P(D = 1|X = x) < 1. Assumption ID.4. Y0t ?? D|X The last assumption is known as the "Copula Stability Assumption" Assumption ID.5 (Copula Stability Assumption). C Y0t ,Y0t 1 |D=1,X (·, ·) =C Y0t 1 ,Y0t 2 |D=1,X (·, ·) As shown in Chapter 2, these assumptions can identify the QT T (⌧). Now, consider the follow- ing moment condition " ! ! # 1 D p(X) D E { Yt  y} P( Y0t  y|X) = 0 (3.1) p 1 p(X) p Note that unlike in the previous chapter, I am assuming the correct specification of the nuisance functions p(X) and P( Y0t  y|X). In other words, I am using information from the treated and un- treated observations, but not for the purposes of robustness. This information will be used to create an overidentified generalized method of moments (GMM) estimator, which will then be used to obtain QT T (⌧). The following system of moments are used to estimate F Y0t |D=1 (y). This estimator would then be used to obtain the QT T (⌧) as in Chapter 2. The system of moment conditions would be as follows: 2 3 666 7 666 i (p(X; ✓), P( Y0t  y|X; ))77777 666 777 i (p(X; ✓), P( Y0t  y|X; )) = 6 6 666 s p(x)i (✓) 777 (3.2) 666 777 64 777 sP( Y0t |X)i ( ) 5 where ! ! 1 D p(X) D i (p(X; ✓), P( Y0t  y|X; )) = { Yt  y} P( Y0t  y|X). (3.3) p 1 p(X) p s p(x)i (✓) and sP( Y0t |X)i ( ) denote the first order conditions that are used to estimate the parameters of the nuisance functions. The estimator QTˆ T (⌧) is then, QTˆ T (⌧) = F̂Y1t|D=1 (⌧) 1 F̂Y0t|D=1 (⌧) 1 49 where F̂Y1t1|D=1 (⌧) = in f {y : F̂Y1t |D=1 (y) ⌧} F̂Y0t1|D=1 (⌧) = in f {y : F̂Y0t |D=1 (y) ⌧} F̂Y0t |D=1 (y) X = nD1 [ {F̂ 1 Y0t |D=1 ( F̂ Yt 1 |D=1 ( Yit 1 ))  y F̂Yt1 1 |D=1 (F̂Yt 2 |D=1 (Yit 2 ))}] i2D where nD denotes the number of treated observations and n 6B 20 1 3 X 666BBB 1 Di ⇡ˆ (xi ) CCCC 777 F̂ Y0t |D=1 (y) = n 1 6 B 666BBB Pn CCC { Yit  y}77777 C (3.4) 4@ k=1 Dk 1 ⇡ˆ (xi ) A 5 i=1 n In invoking this moment condition, I am assuming that assumption ID.4 holds. It is this assump- tion that allows for reweighting of either the propensity score or the cdf nuisance function. This assumption not only ensures that the moment condition holds, but it also is essential for inverting F Y0t |D=1 (y). For example, suppose that ! ! 1 D p(X) D { Yt  0.5} P( Y0t  0.5|X) = 0 p 1 p(X) p but ! ! 1 D p(X) D { Yt  y} P( Y0t  y|X) , 0 p 1 p(X) p 1 for y , 0. Then the inversion F Y0t |D=1 ( F̂ Yt 1 |D=1 ( Yit 1 )) would be incorrect, by the proof of The- orem 1 in Chapter 2. The inclusion of this additional moment condition should not be expected to improve the per- formance of the estimator over the estimator in Chapter 2. What could be expected is that as more moment conditions are added, the performance of the estimate gets closer to estimator in Chapter 2. This is because all of the efficient information that is used communicated through the double- robustness property in Chapter 2 is only partially communicated by additional moment functions. To see this, recall that the doubly-robust moment representation of F Y0t |D=1 (y) is: " ! ! # 1 D ⇡(X) 1 D ⇡(X) D F Y0t |D=1 (y) = E { Yt  y} P̃( Y0t  y|X) (3.5) p 1 ⇡(X) p 1 ⇡(X) p 50 Now suppose that we have the following system of moments 2 3 666 7 666 i (p(X; ✓), P( Y0t  y1 |X; ))77777 666 7 666 (p(X; ✓), P( Y  y |X; ))77777 666 i 0t 2 777 666 777 (3.6) 666 s p(x)i (✓) 777 666 777 666 777 4 sP( Y0t |X)i ( ) 5 where y1 , y2 . A GMM estimator that uses these moments is going to be the most efficient estima- tor under these moment conditions; however, the first two moment conditions do not communicate more information than the information communicated through (5) about assumption ID.4. As more moment conditions are added, then the information communicated through all conditions that take the form of (3) approaches the information communicated through (5). 3.3 Simulations In this section I will present the estimation of the QTT at ⌧ 2 [0.2, 0.8], in increments of 0.02. The goal is to demonstrate not only how my estimator performs in small samples, but also how it compares to the estimator of Callaway and Li (2019). The data generating process is as follows: I generate the following data generating process with N = 1000 and T = 3 for 200 iterations. v ⇠ Normal(0, 1) ⌘|D = 0 ⇠ Norm(0, 1) ⌘|D = 1 ⇠ Normal(1, 1) X1 ⇠ Uni f orm(0, 1) X2 ⇠ Uni f orm( 1, 0) X3 ⇠ Uni f orm( 2, 1) X4 ⇠ Uni f orm( 1, 0) Yt 2 = 0.25X1 + 0.5X2 + 0.75X3 + X4 + ⌘ + vt 2 Yt 1 = 1 + 0.5X1 + 0.75X2 + X3 + 1.5X4 + ⌘ + vt 1 Y0t = 2 + 0.25X1 + 0.5X2 + 0.75X3 + X4 + ⌘ + vt Y1t = 1.5X1 + X2 + 1.5X3 + X4 + ⌘ + vt 51 Yt = D ⇥ Y1t + (1 D) ⇥ Y0t eX p(X, ) = = [ 0.25, 0.5, 0.75, 1] 1+e X The data generating process is based upon Example 3 in Callaway and Li (2019). The distribution of the covariates is chosen so that, for the given values of the parameters, the probability of treat- ment and the conditional cdf of Y0t do not output observed values that are too close to 0 and 1. This can cause numerical issues when inverting the estimators. All parameters are estimated by using an overidentified GMM estimator. The moment condi- tions include the score functions taken from a logit mle. These correspond to estimation of the propensity score. OLS moment conditions are also included and correspond to estimation of the parameters of Y0t |X, , and the standard deviation of v0t , . In addition, moment condition (3.1) is included at the point where y = 0.5. The following estimator is applied, 2 n 3 n 666X p (xi , ˆ ) (1 Di ) 777 1 X p (xi , ˆ ) (1 Di ) F̂ Y0t |D=1 ( ) = 664 ⇥ ⇤ 775 ⇥ ⇤ [ { Yi  }] i=1 1 p (xi , ˆ ) i=1 1 p (xi , ˆ ) As in Chapter 2, misspecification of the propensity score is when the propensity score is chosen to be the standard normal cdf. Misspecification of the cdf nuisance function is considered when the function is chosen to be the logistic(0,1) cdf. The results of the simulation are presented in Figures G.1 and G.2 in the appendix. Figures G.1 and G.2 are divided into three scenarios. In the first scenario, I am comparing the estimator when all nuisance are correctly specified to the estimator when only the cdf nuisance function correctly specified. In the second scenario, I am comparing the estimator when the propensity score is correctly specified, but the cdf nuisance function is misspecified, to the estimator when both nuisance functions are misspecified. In the third scenario, I am comparing the estimator when only the cdf nuisance function is correctly specified to the estimator when neither nuisance function is correctly specified. It is important to note the following: First, note that as in Chapter 2 the bias of the estimates is small relative to the true values of the quantile treatment e↵ects. Second, if a comparison is made across figures to 52 Figure F.2, the root mean square error (RMSE) is considerably smaller compared to the Callaway and Li estimator, but larger than the doubly-robust estimators in Chapter 2. Assuming correct specification of all of the moments, the RMSE is larger compared to any doubly-robust estimator under the misspecification considered in Chapter 2. Even allowing for misspecification, the GMM- based estimates in this chapter yield a RMSE that is slightly greater than half the size of the RMSE of the Callaway and Li estimators in Chapter 2. 3.4 Conclusion The GMM-based estimator of this chapter do not seem to perform as well as the doubly-robust estimator of Chapter 2, but it’s performance under the mild misspecification that is considered is favorable compared to the Callaway and Li estimator. It is not surprising that a lower RMSE than the estimator in Chapter 2 is not reached. The doubly-robust estimator of Chapter 2 takes advantage of the semiparametrically efficient estimator of F( Y0t |D = 1). Still, the simulation performance of the GMM-based estimator is an improvement upon the Callaway and Li estimator. 53 BIBLIOGRAPHY Abadie, Alberto (2005). “Semiparametric di↵erence-in-di↵erences estimators”. In: The Review of Economic Studies 72.1, pp. 1–19. Andrews, Frank M, Antonia Abbey, and L Jill Halman (1991). “Stress from infertility, marriage factors, and subjective well-being of wives and husbands”. In: Journal of health and social behavior, pp. 238–253. Athey, Susan and Guido W Imbens (2006). “Identification and inference in nonlinear di↵erence- in-di↵erences models”. In: Econometrica 74.2, pp. 431–497. Blundell, Richard and James L Powell (2001). “Endogeneity in nonparametric and semiparametric regression models”. In: Bonhomme, Stéphane and Ulrich Sauder (2011). “Recovering distributions in di↵erence-in-di↵erences models: A comparison of selective and comprehensive schooling”. In: Review of Economics and Statistics 93.2, pp. 479–494. Callaway, Brantly and Tong Li (2019). “Quantile treatment e↵ects in di↵erence in di↵erences models with panel data”. In: Quantitative Economics 10.4, pp. 1579–1618. Callaway, Brantly and Pedro HC Sant’Anna (2021). “Di↵erence-in-di↵erences with multiple time periods”. In: Journal of Econometrics 225.2, pp. 200–230. Cameron, A Colin, Tong Li, Pravin K Trivedi, and David M Zimmer (2004). “Modelling the dif- ferences in counted outcomes using bivariate copula models with application to mismeasured counts”. In: The Econometrics Journal 7.2, pp. 566–584. Caracciolo, Francesco and Marilena Furno (2017). “Quantile treatment e↵ect and double robust estimators: an appraisal on the Italian labor market”. In: Journal of Economic Studies. Card, David (1990). “The impact of the Mariel boatlift on the Miami labor market”. In: ILR Review 43.2, pp. 245–257. Card, David and Alan Krueger (1994). Minimum Wages and Employment: A Case Study of the Fast-Food Industry in New Jersey and Pennsylvania. Chamberlain, Gary (1980). “Analysis of Covariance with Qualitative Data”. In: The Review of Economic Studies 47.1, pp. 225–238. Chen, Xiaohong (2007). “Large sample sieve estimation of semi-nonparametric models”. In: Hand- book of econometrics 6, pp. 5549–5632. 54 Chen, Xiaohong and Xiaotong Shen (1998). “Sieve extremum estimates for weakly dependent data”. In: Econometrica, pp. 289–314. Chernozhukov, Victor, Iván Fernández-Val, Jinyong Hahn, and Whitney Newey (2013). “Average and quantile e↵ects in nonseparable panel models”. In: Econometrica 81.2, pp. 535–580. Chernozhukov, Victor, Juan Carlos Escanciano, Hidehiko Ichimura, Whitney K Newey, and James M Robins (2016). “Locally robust semiparametric estimation”. In: arXiv preprint arXiv:1608.00033. Fan, Jianqing, Kosuke Imai, Han Liu, Yang Ning, and Xiaolin Yang (2016). Improving covari- ate balancing propensity score: A doubly robust and efficient approach. Tech. rep. Technical report, Princeton University. Firpo, Sergio (2007). “Efficient semiparametric estimation of quantile treatment e↵ects”. In: Econo- metrica 75.1, pp. 259–276. Gourieroux, Christian, Alain Monfort, Eric Renault, and Alain Trognon (1987). “Generalised residuals”. In: Journal of econometrics 34.1-2, pp. 5–32. Gunsilius, Florian and Yuliang Xu (2021). “Matching for causal e↵ects via multimarginal optimal transport”. In: arXiv preprint arXiv:2112.04398. Hajivassiliou, Vassilis A and Paul A Ruud (1994). “Classical estimation methods for LDV models using simulation”. In: Handbook of econometrics 4, pp. 2383–2441. Heckman, James J (1979). “Sample selection bias as a specification error”. In: Econometrica: Journal of the econometric society, pp. 153–161. Heckman, James J, Hidehiko Ichimura, and Petra Todd (1998). “Matching as an econometric eval- uation estimator”. In: The review of economic studies 65.2, pp. 261–294. Heckman, James J, Hidehiko Ichimura, Je↵rey A Smith, and Petra E Todd (1998). Characterizing selection bias using experimental data. Hirano, Keisuke, Guido W Imbens, and Geert Ridder (2003). “Efficient estimation of average treatment e↵ects using the estimated propensity score”. In: Econometrica 71.4, pp. 1161– 1189. Horvitz, Daniel G and Donovan J Thompson (1952). “A generalization of sampling without re- placement from a finite universe”. In: Journal of the American statistical Association 47.260, pp. 663–685. Levinger, George George Klaus, Oliver C Moles, et al. (1979). Divorce and separation. Basic Books. 55 Li, Qi and Je↵rey S Racine (2008). “Nonparametric estimation of conditional CDF and quantile functions with mixed categorical and continuous data”. In: Journal of Business & Economic Statistics 26.4, pp. 423–434. Li, Qi and Je↵rey Scott Racine (2007). Nonparametric econometrics: theory and practice. Prince- ton University Press. Lin, Wei and Je↵rey M Wooldridge (2017). Binary and fractional response models with contin- uous and binary endogenous explanatory variables. Tech. rep. Working paper, available at http://www. weilinme trics. com/uploads/5/1/4/0 ? Liu, Chien (2000). “A theory of marital sexual life”. In: Journal of Marriage and Family 62.2, pp. 363–374. Lorentz, GG (1966). Approximation of Functions, Athena Series. Holt, Rinehart and Winston, New York. Machado, José AF and José Mata (2005). “Counterfactual decomposition of changes in wage dis- tributions using quantile regression”. In: Journal of applied Econometrics 20.4, pp. 445–465. Masry, Elias (1996). “Multivariate local polynomial regression for time series: uniform strong consistency and rates”. In: Journal of Time Series Analysis 17.6, pp. 571–599. Mullahy, John (2015). “Multivariate fractional regression estimation of econometric share mod- els”. In: Journal of Econometric Methods 4.1, pp. 71–100. Muris, Chris (2020). “Efficient GMM estimation with incomplete data”. In: Review of Economics and Statistics 102.3, pp. 518–530. Nam, Suhyeon (2014). “Essays in multiple fractional responses with endogenous explanatory vari- ables”. PhD thesis. Michigan State University. Newey, KW and Daniel McFadden (1994). “Large sample estimation and hypothesis”. In: Hand- book of Econometrics, IV, Edited by RF Engle and DL McFadden, pp. 2112–2245. Newey, Whitney K (1990). “Semiparametric efficiency bounds”. In: Journal of applied economet- rics 5.2, pp. 99–135. Oppenheimer, Valerie Kincade (1988). “A theory of marriage timing”. In: American journal of sociology 94.3, pp. 563–591. Papke, Leslie E and Je↵rey M Wooldridge (2008). “Panel data methods for fractional response variables with an application to test pass rates”. In: Journal of econometrics 145.1-2, pp. 121– 133. 56 Petrin, Amil and Kenneth Train (2010). “A control function approach to endogeneity in consumer choice models”. In: Journal of marketing research 47.1, pp. 3–13. Rivers, Douglas and Quang Vuong (2002). “Model selection tests for nonlinear dynamic models”. In: The Econometrics Journal 5.1, pp. 1–39. Rosenbaum, Paul R and Donald B Rubin (1983). “The central role of the propensity score in observational studies for causal e↵ects”. In: Biometrika 70.1, pp. 41–55. Rothe, Christoph and Sergio Firpo (2013). “Semiparametric estimation and inference using doubly robust moment conditions”. In: Rothe, Christoph and Sergio Firpo (2019). “Properties of doubly robust estimators when nuisance functions are estimated nonparametrically”. In: Econometric Theory 35.5, pp. 1048–1087. Sant’Anna, Pedro HC and Jun Zhao (2020). “Doubly robust di↵erence-in-di↵erences estimators”. In: Journal of Econometrics 219.1, pp. 101–122. Schröder, Jette and Claudia Schmiedeberg (2015). “E↵ects of relationship duration, cohabitation, and marriage on the frequency of intercourse in couples: Findings from German panel data”. In: Social Science Research 52, pp. 72–82. Słoczyński, Tymon, S Derya Uysal, and Je↵rey M Wooldridge (2022). “Abadie’s Kappa and Weighting Estimators of the Local Average Treatment E↵ect”. In: arXiv preprint arXiv:2204.07672. Słoczyński, Tymon and Je↵rey M Wooldridge (2018). “A general double robustness result for estimating average treatment e↵ects”. In: Econometric Theory 34.1, pp. 112–133. Staiger, Douglas and James H Stock (1997). “Instrumental Variables Regression with Weak Instru- ments”. In: Econometrica 65.3, pp. 557–586. Sued, Mariela, Marina Valdora, and Víctor Yohai (2020). “Robust doubly protected estimators for quantiles with missing data”. In: TEST 29.3, pp. 819–843. Torous, William, Florian Gunsilius, and Philippe Rigollet (2021). “An Optimal Transport Ap- proach to Causal Inference”. In: arXiv preprint arXiv:2108.05858. Vaart, Aad W and Jon A Wellner (1996). “Weak convergence”. In: Weak convergence and empiri- cal processes. Springer, pp. 16–28. Vaart, Aad W Van der (2000). Asymptotic statistics. Vol. 3. Cambridge university press. Walfisch, S, B Maoz, and H Antonovsky (1984). “Sexual satisfaction among middle-aged couples: correlation with frequency of intercourse and health status”. In: Maturitas 6.3, pp. 285–296. 57 Woodland, Alan D (1979). “Stochastic specification and the estimation of share equations”. In: Journal of Econometrics 10.3, pp. 361–383. Wooldridge, Je↵rey M (2007). “Inverse probability weighted estimation for general missing data problems”. In: Journal of econometrics 141.2, pp. 1281–1301. Wooldridge, Je↵rey M (2010). Econometric analysis of cross section and panel data. MIT press. Wooldridge, Je↵rey M (2014). “Quasi-maximum likelihood estimation and testing for nonlinear models with endogenous explanatory variables”. In: Journal of Econometrics 182.1, pp. 226– 234. Yabiku, Scott T and Constance T Gager (2009). “Sexual frequency and the stability of marital and cohabiting unions”. In: Journal of marriage and family 71.4, pp. 983–1000. Zeger, Scott L, Kung-Yee Liang, and Paul S Albert (1988). “Models for longitudinal data: a gen- eralized estimating equation approach”. In: Biometrics, pp. 1049–1060. 58 APPENDIX A DERIVING THE AVERAGE PARTIAL EFFECTS Proof of Theorem 1 2 3 p p 66666 ˆ 777 777 N(ˆ⌘ ⌘) = N 6666 777 = KN + o p (1) (A.1) 4✓ˆ ✓5 PN Where KN = B0 1 G>0 N 1/2 i=1 i (✓0 , 0 ). Now, apply a mean value expansion to ⌅ˆ around ⌘. Then, N p ˆ 1 X j j N ⌅tl (x , h , t ) = p [ [@⇤(dtlv0 + ⇣ v uit )/@ 0 0 0 (.) + r ,✓ @(⇤(detlv + ⇣ v uit )/@ (.) )](ˆ⌘ ⌘)] (A.2) N i=1 where detlv is dtl evaluated at some e ✓ between ✓ˆ and ✓0 . By (46) and the Weak Law of Large Numbers, N 1 X v j p r ,✓ @(⇤(deitl + ⇣ v uit )/@ (.) )](ˆ⌘ ⌘) N i=1 N 1X j p = r ,✓ @(⇤(detlv + ⇣ v uit )/@ (.) )] N(⌘ˆ ⌘) N i=1 j =E[r ,✓ @(⇤(dtlv0 + ⇣ v uit )/@ (.) )])KN ] + o p (1) p p Then after subtracting N⌅tl (x0 , h0 , t0 ) from N ⌅ˆtl (x0 , h0 , t0 ) the following is obtained, p N(⌅ˆtl (x0 , h0 , t0 ) ⌅tl (x0 , h0 , t0 )) N 1 X j j = p [@⇤(dtlv0 + ⇣ v uit )/@ (.) ⌅tl (x0 , h0 , t0 ) + E[r ,✓ @(⇤(dtlv0 + ⇣ v uit )/@ (.) )])Ki ] + o p (1) N i=1 where Ki = B0 1 G>0 i (✓0 , 0 ). Since j j E[@⇤(dtlv0 + ⇣ v uit )/@ (.) ⌅tl (x0 , h0 , t0 ) + E[r ,✓ @(⇤(dtlv0 + ⇣ v uit )/@ (.) )])Ki ]] =0 Then by the Central Limit Theorem, p N(⌅ˆtl (x0 , h0 , t0 ) ⌅tl (x0 , h0 , t0 )) ! N(0, Vtl ) (A.3) 59 where, Vtl = E[Vtli> Vtli ] j j Vtli = @⇤(dtlv0 + ⇣ v uit )/@ (.) ) ⌅tl (x0 , h0 , t0 ) + E([r ,✓ (@⇤(dtlv0 + ⇣ v uit )/@ (.) )])Ki The proof of Theorem 2 is similar. 60 APPENDIX B SIMULATION TABLES FOR CHAPTER 1 Table B.1: APE: vit ⇠ Normal(0, 1) Percentile Percentile Mean Estimates z = 1, ⇣1 = 1, ⇣2 = 1 Covariate 25 50 75 25 50 75 t1 0.0952 -0.0933 -0.0023 0.0685 -0.0815 -0.0020 (.0204) (.0323) (.0011) t2 0.0352 0.1549 0.0023 0.0281 0.1316 0.0021 (.0113) (.0522) (.0011) x1 0.1365 0.0695 0.0013 0.1207 0.0624 0.0012 (.0105) (.0110) (.0003) x2 0.0820 0.1629 0.0027 0.0694 0.1638 0.0026 (.0214) (.0320) (.0006) z = 0.5, ⇣1 = 1, ⇣2 = 1 Covariate 25 50 75 25 50 75 t1 0.0952 -0.0933 -0.0023 0.0734 -0.0815 -0.0024 (.0149) (.0238) (.0008) t2 0.0352 0.1549 0.0023 0.0306 0.1504 0.0024 (.0110) (.0409) (.0009) x1 0.1365 0.0695 0.0013 0.1198 0.0599 0.0012 (.0104) (.0102) (.0003) x2 0.0820 0.1629 0.0027 0.0698 0.1573 0.0026 (.0219) (.0298) (.0006) z = 0.1, ⇣1 = 1, ⇣2 = 1 Covariate 25 50 75 25 50 75 t1 0.0952 -0.0933 -0.0023 0.0708 -0.0934 -0.0024 (.0159) (.0258)) (.0009) t2 0.0352 0.1549 0.0302 0.1489 0.1504 0.0024 (.0114) (.0430) (.0009) x1 0.1365 0.0695 0.0013 0.1190 0.0601 0.0012 (.0105) (.0108) (.0003) x2 0.0820 0.1629 0.0027 0.0698 0.148 0.0025 (.0220) (.0317) (.0006) Standard errors in parentheses 61 Table B.2: APE: vit ⇠ Logistic(0, 1) Percentile Percentile Mean Estimates z = 1, ⇣1 = 1, ⇣2 = 1 Covariate 25 50 75 25 50 75 t1 0.0881 -0.0886 -0.0023 0.0299 -0.0721 -0.0031 (.0465) (.0932) (.0104) t2 0.0385 0.1666 0.0024 0.0065 0.1278 0.0036 (.0727) (.1819) (.0168) x1 0.1333 0.0684 0.0013 0.0985 0.0534 0.0014 (.0104) (.0185) (.0026) x2 0.0915 0.1758 0.0027 0.0694 0.1752 0.0037 (.0252) (.0404) (.0057) z = 0.5, ⇣1 = 1, ⇣2 = 1 Covariate 25 50 75 25 50 75 t1 0.0881 -0.0886 -0.0023 0.0175 -0.0790 -0.0024 (.0262) (.0437) (.0025) t2 0.0385 0.1666 0.0024 0.0252 0.1289 0.0025 (.0175) (.0768) (.0027) x1 0.1333 0.0684 0.0013 0.0943 0.0479 0.0010 (.0069) (.0096) (.0002) x2 0.0915 0.1758 0.0027 0.0666 0.1711 0.0027 (.0214) (.0365) (.0007) z = 0.1, ⇣1 = 1, ⇣2 = 1 Covariate 25 50 75 25 50 75 t1 0.0881 -0.0886 -0.0023 0.0062 -0.0688 -0.0019 (.0230) (.0441) (.0015) t2 0.0385 0.1666 0.0024 0.0205 0.1091 0.0020 (.0151) (.0759) (.0016) x1 0.1333 0.0684 0.0013 0.0921 0.0498 0.0010 (.0069) (.0109) (.0003) x2 0.0915 0.1758 0.0027 0.0633 0.1802 0.0028 (.0188) (.0397) (.0007) Standard errors in parentheses 62 2 Table B.3: APE: vit ⇠ 1 Percentile Percentile Mean Estimates z = 1, ⇣1 = 1, ⇣2 = 1 Covariate 25 50 75 25 50 75 t1 0.1116 -0.1004 -0.0023 0.1836 -0.0265 -0.0006 (.0475) (.0533) (.0021) t2 0.0504 0.1300 0.0023 0.0334 0.0547 0.0006 (.0473) (.0768) (.0023) x1 0.1712 0.0701 0.0013 0.1625 0.0576 0.0010 (.0104) (.0185) (.0026) x2 0.1200 0.1503 0.0027 0.1086 0.1508 0.0025 (.0308) (.0329) (.0027) z = 0.5, ⇣1 = 1, ⇣2 = 1 Covariate 25 50 75 25 50 75 t1 0.1116 -0.1004 -0.0023 0.2763 -0.0895 -0.0020 (.0268) (.0246) (.0007) t2 0.0504 0.1300 0.0023 0.0724 0.1414 0.0021 (.0231) (.0415) (.0008) x1 0.1712 0.0701 0.0013 0.1730 0.0489 0.0009 (.0068) (.0089) (.0002) x2 0.1200 0.1503 0.0027 0.1311 0.1250 0.0021 (.0359) (.0236) (.0005) z = 0.1, ⇣1 = 1, ⇣2 = 1 Covariate 25 50 75 25 50 75 t1 0.1116 -0.1004 -0.0023 0.3166 -0.0919 -0.0021 (.0290) (.0254) (.0008) t2 0.0504 0.1300 0.0023 0.0779 0.1463 0.0021 (.0255) (.0422) (.0008) x1 0.1712 0.0701 0.0013 0.1773 0.0497 0.0009 (.0062) (.0096) (.0002) x2 0.1200 0.1503 0.0027 0.1382 0.1259 0.0021 (.0383) (.0251) (.0005) Standard errors in parentheses 63 Table B.4: APE: vit ⇠ Normal(0, 1) Percentile Percentile Mean Estimates z = 1, ⇣1 = 1, ⇣2 = 1 Covariate 25 50 75 25 50 75 t1 0.0484 -0.0412 -0.0284 0.0077 -0.0466 -0.0285 (.0077) (.0082) (.0058) t2 0.1474 0.1470 0.0477 0.1256 0.1422 0.0496 (.0202) (.0232) (.0104) x1 0.1338 0.0950 0.0265 0.1173 0.0817 0.0226 (.0035) (.0057) (.0033) x2 0.2112 0.2017 0.0604 0.1979 0.1897 0.0569 (.0101) (.0099) (.0081) Standard errors in parentheses 64 APPENDIX C APPLICATION TABLES FOR CHAPTER 1 Table C.1: Probit marriage coefficient estimates (all) (male only) (female only) education 0.0474 0.0561 0.0418 (.0127) (.0176) (.0196) household income 0.0085 0.0074 0.0069 (.0012) (.0016) (.0021) children at residence 0.5223 0.8566 0.3462 (.0368) (.0703) (.0470) urban -0.2565 -0.1025 -0.3888 (.0800) (.1163) (.1138) sexual intercourse 0.0010 0.0008 0.0011 (.00017) (.0002) (.0003) ill w/ treatment 0.0008 0.008 -0.0110 (.0227) (.0392) (.0275) ill w/o treatment 0.0106 0.0292 0.0152 (.0189) (.0292) (.0248) chronic 0.0936 0.1234 0.1008 (.0718) (.1045) (.0984) work limited by illness -0.3571 -0.3608 -0.2585 (.1568) (.2407) (.1914) _const -1.487 -1.811 -1.118 (.2012) (.2725) (.3211) N 1388 734 654 NT 5552 2936 2616 Standard errors in parentheses 65 Table C.2: Linear marriage coefficient estimates (all) (male only) (female only) education 0.0155 0.0149 0.0152 (.0041) (.0050) (.0070) household income 0.0029 0.0023 0.0026 (.0004) (.0005) (.0008) children at residence 0.1831 0.2591 0.1280 (.0109) (.0140) (.0162) urban -0.0857 -0.0362 -0.1396 (.0270) (.0350) (.0411) sexual intercourse 0.0003 0.0002 0.0004 (.000055) (.00006) (.0001) ill w/ treatment 0.00002 0.0043 -0.0042 (.0078) (.0122) (.0100) ill w/o treatment 0.0031 0.0003 0.0055 (.0063) (.0086) (.0090) chronic 0.0299 0.0280 0.0356 (.0242) (.0319) (.0356) work limited by illness -0.1086 -0.0881 -0.0922 (.0458) (.0536) (.0650) _const -.0.0061 -0.0407 0.0884 (.0638) (.0747) (.1146) N 1391 736 655 NT 5564 3944 2620 Standard errors in parentheses 66 Table C.3: APE Estimates accounting for correlated random e↵ects and a binary EEV All data Work Sleep percentile 25 50 75 90 25 50 75 90 marital status -0.0121 -0.0173 -0.0260 -0.0390 -0.1244 -0.1311 -0.1346 -0.1421 (.0579) (.0138) (.0198) (.0496) (.0301) (.0087) (.0102) (.0272) Men only Work Sleep 7 marital status -0.2511 -0.2121 -0.1799 -0.1374 5.38 ⇥ 10 -0.0015 -0.0058 -0.0230 (.0645) (.0182) (.0225) (.0578) (.0339) (.0109) (.0122) (.0307) Women only Work Sleep marital status -0.3194 -0.0952 -0.0214 -0.0014 -0.0159 -0.0388 -0.0344 -0.0146 (.0422) (.0111) (.0149) (.0381) (.0498) (.0145) (.0179) (.0454) Standard errors in parentheses Table C.4: APE Estimates w/o integrating out Endogenous Error All data Work Sleep percentile 25 50 75 90 25 50 75 90 4 4 4 4 4 4 4 4 marital status 2.99 ⇥ 10 1.95 ⇥ 10 1.27 ⇥ 10 1.03 ⇥ 10 7.79 ⇥ 10 8.92 ⇥ 10 2.51 ⇥ 10 3.62 ⇥ 10 (.0195) (.0039) (.0115) (.0118) (.0067) (.0013) (.0028) (.0034) Men only Work Sleep 4 4 4 17 35 52 77 marital status -0.0011 7.33 ⇥ 10 5.00 ⇥ 10 4.01 ⇥ 10 2.55 ⇥ 10 1.31 ⇥ 10 1.65 ⇥ 10 1.93 ⇥ 10 (.0126) (.0067) (.0078) (.0102) (.0045) (.0011) (.0017) (.0017) Women only Work Sleep 4 5 5 5 4 5 6 marital status 2.35 ⇥ 10 8.81 ⇥ 10 4.54 ⇥ 10 4.29 ⇥ 10 0.0032 3.11 ⇥ 10 1.86 ⇥ 10 1.75 ⇥ 10 (.0047) (.0023) (.0022) (.0044) (.0322) (.0095) (.0134) (.0134) Standard errors in parentheses 67 APPENDIX D HIGH-LEVEL ASSUMPTIONS AND PROPOSITIONS FOR NUISANCE FUNCTION ESTIMATION IN CHAPTER 2 Assumptions P.1-P.3 are the parametric assumptions that are sufficient to imply Assumption NP.1. Assumption P.1. (i)G(x; ) is a parametric model for p(x), where 2 ⇢ R M and G(x, ) > 0, all x 2 X, 2 , where is compact. (ii) There exists 0 2 such that p(x) = G(x, 0 ), 0 2 int( ). (iii)G(X; ) is a.s. twice continuously di↵erentiable in a neighborhood of 0 , ⇤ ⇢ . (iv)ˆ is a con- 1/2 Pn sistent estimator of 0 and n1/2 (ˆ 0) = n i=1 l (Wi ; 0 ) + o p (1), where Wi = (Y01 , Yi1 , Di , Xi ), 0 E[l (Wi ; 0 )] = 0, E[l (Wi ; 0 )l (Wi ; 0) ] exists and is positive definite and 2 lim !0 E[sup 2 ⇤ ,|| ⇤ || ||l (Wi ; ) l (Wi ; 0 )|| ] =0. (vi) For some ✏ > 0, 0 < P(x; )  1 ✏ a.s. for all 2 int( ). Assumption P.2. (i)g(x) = g(x; ) is a parametric model for the conditional mean of Y0t , where 2 ⇥ ⇢ Rk , ⇥ being compact; (ii) g(X, ) is a.c. continuous at each 2 ⇥; (iii) there exists a ⇤ unique pseudo-true parameter 2 int(⇥);(iv)g(X, ) is a.c. twice continuously di↵erentiable in a neighborhood of ⇤ , ⇥⇤ ⇢ ⇥; (iv) the estimator ˆ is strongly consistent for ⇤ and satisfies the following linear expansion: X n p n( ˆ ⇤ )=n 1/2 l (Wi ; ⇤ ) + o p (1) i=1 where l (·; ) is such that E[l (W; ⇤ )] = 0, E[l (W; ⇤ )lg (W; ⇤ 0 ) ] exists and is positive definite and h i lima!0 E sup 2⇥⇤ ,|| ⇤ a ||l (W; ) l (W; ⇤ )|| = 0. h i Assumption P.3. E[||h(W; , )||2 ] < 1 and E sup 2⇥ s , 2 s |ḣ(W; , ) < 1 where ⇥ s , s is a small ⇤ ⇤ neighborhood of , , and h(W; , ) = (w0 (D, X; )) { Yti  } (w0 (D, X; ) w1 (D))P( Y0t  , X; ) These are the standard assumptions found in the literature, such as in Sant’Anna and Zhao (2020). Assumptions P.2 and P.2 imply that the parameters which index p(x; ) and P( Y0t  p |x; ) are sufficiently smooth and are n-asymptotically linear. Assumption P.3 is an integrability 68 condition. Assumptions P.1 and P.2 are stronger than Assumption NP, while Assumption P.3 is necessary to apply the Weak Law of Large Numbers along with the Central Limit Theorem. I consider as a nonparametric estimator of the propensity score the sieve logit estimator of Hirano, Imbens, and Ridder (2003), though the proof and assumptions that I am placing on that es- timator are di↵erent, and in some sense relaxed, compared to the conditions in Hirano, Imbens, and Ridder (2003) that are used to prove that the estimator converges uniformly to the true function at 1/4 o p (n ). When using this estimator, the goal is to approximate p(x),using a series approximation such that m(x)0 ⇡ r̃ K̃ (x)0 K̃ where r̃ K̃ (x)0 = (r1K̃ (x), ..., rK̃ K̃ (x))0 K = K̃ + 1 The estimator is given by X n ⇤ 1 m = argmax⇡2Hn [Di logL(m(xi )) + (1 Di )log(1 L(m(xi ))] n i=1 exp(a) where L(a) = 1+exp(a) and Hn denotes the sieve space. Let the sieve space be over the s-smooth class of functions which I will denote by, n D↵ m(x) D↵ h(y) o H= ⇤cp (X) s = m 2 C (X) : sup sup|D m(x)|  c, sup sup ↵ c [↵]s x2X [↵]=s x,y2X,x,y |x y|e where C s (X) denotes the space of all s-times continuously di↵erentiable functions on X, and | · | n o denotes the Euclidean norm. Furthermore, let Hn = m 2 Hn : m(x) = r̃ K̃ (x)0 K̃ , |m| s  c . I will let ||m m0 ||1 = sup|m(x) m0 (x)| and `(m, xi ) = Di logL(m(xi )) + (1 Di )log(1 L(m(xi )). x2X Let H(w, Fn , || · ||r )) B log(N(w, F n , || · ||r )), where N(w, F n , || · ||r ) is the minimal number of w-balls that cover Fn under || · ||r , and Fn = {`(m, xi ) `(m, xi ) : ||m m0 ||  , m 2 Hn }. In addition, ⇢ R p n = inf 2 (0, 1) : pn1 2 b 2 H(w, Fn , || · ||r ))dw  const. , where b > 0 is a constant. 69 The assumptions below are sufficient for the consistency of the sieve logit estimator and to satisfy Assumption NP.1. They are based upon conditions in Chen (2007): Assumption NP.2. (i)E[D(logL(m0 (x))) + (1 D)log(1 L(m0 (x))] > 1, and if E[D(logL(m0 (x))) + (1 D)log(1 L(m0 (x))] = 1 then E[D(logL(m0 (x))) + (1 D)log(1 L(m0 (x))] < 1 for all m 2 Hk \m0 for all k 1 (ii)There are functions d() and t(), where d() is a non-increasing positive function and t() is a positive function such that for all ✏ > 0 and for all k 1, E[D(logL(m0 (x))) + (1 D)log(1 L(m0 (x))] - supm2Hn :||m m0 ||1 ✏ E[D(logL(m(x))) + (1 D)log(1 L(m(x))] d(k)t(✏) > 0 Assumption NP.3. Hk ⇢ Hk+1 ⇢ H for all k 1, and there exists a sequence k m0 2 Hk such that || k m0 m0 ||1 !0 as k!1. Assumption NP.4. (i)The sieve spaces Hk are compact under ||m1 m2 ||1 , where m1 , m2 2 Hk (ii)lim in fk(n) d(k(n)) > 0, E[`(h, xi ] is continuous at m = m0 2 ⇧, and E[supm2Hn |`(m, xi )] is bounded. (iii)E[||xi ||] < 1 Assumption NP.5. log(N( , Hn , || · ||)) = o(n) for all > 0. Assumption NP.6. There exist p and p such that 0 < p  p(x)  p < 1. 2(a)2 Assumption NP.7. (2a+d)2 > 14 , where d is the dimension of X, and a = (s + ↵), where m0 (x) is a times continuously di↵erentiable and |m0 (x) m0 (y)|  ||x y||↵e , x, y 2 X under the Euclidean norm || · ||e for 0 < ↵  1. Assumptions NP.2 consists of regularity conditions on the objective function. Assumption NP.3 implies that there exists some sequence of functions such that on subsets of the entire function space there exists some sequence of function that uniformly converge to 0 as the subspaces grow in size. Assumption NP.4 implies the existence of a solution at which the objective function is 70 minimized. Assumption NP.5 ensures that the function space does not grow too fast as the sample size increases. Assumption NP.6 strengthens Assumption ID.5 so that the propensity score has upper and lower bounds away from 0 and 1. This is necessary so that the log odds ratio is finite for all x 2 X. Assumption NP.7 is a restriction on the di↵erentiability and smoothness of the propensity score relative to the dimension of x. This is a weakening of the smoothness assumption in Hirano, Imbens, and Ridder (2003). Using the previous assumptions, I obtain the following result: 1/4 Proposition 4. Under assumptions NP.2-NP.7, ||m̂ m0 ||1 = o p (n ). The next proof will prove the results for the nonparametric logit sieve estimator as in Hirano, Imbens, and Ridder (2003), but starting from di↵erent assumptions. The proof itself is broken up in two parts, first under a set of regularity conditions I prove that the estimator is consistent. Then, I prove that it achieves the desired rate of convergence. First, I prove the following lemma. b Lemma 1. Suppose b,c are arbitrary constants such that b, c > 0 and b , c. Then sign(log( 1+b ) c 1 1 log( 1+c )) ,sign(log( 1+b ) log( 1+c )) 1 1 1+c 1+c Proof. Suppose b > c. Then log( 1+b ) log( 1+c ) = log( 1+b ). Since b > c > 0, then 0 < 1+b < 1, so 1+c b c b c log( 1+b ) < 0. Now, suppose that log( 1+b ) log( 1+c ) < 0. Then 1+b < 1+c , which implies that b < c. b c This is a contradiction, so log( 1+b ) log( 1+c ) > 0. 1+c b c Now, suppose c > b. Then log( 1+b ) > 0. If log( 1+b ) log( 1+c ) > 0, then b > c. This is a contradic- b c tion, so log( 1+b ) log( 1+c )<0 ⇤ Proof of Proposition 4: Proof. Note that exp(m(xi )) 1 |`(m, xi ) `(m0 , xi )| = Di log( ) + (1 Di )log( ) 1 + exp(m(xi )) 1 + exp(m(xi )) exp(m0 (xi )) 1 Di log( ) (1 D i )log( ) 1 + exp(m0 (xi )) 1 + exp(m0 (xi )) exp(m(xi )) exp(m0 (xi )) = Di [log( ) log( )] 1 + exp(m(xi )) 1 + exp(m0 (xi )) 71 1 1 + (1 Di )[log( ) log( )] 1 + exp(m(xi )) 1 + exp(m0 (xi )) By the preceding lemma, " # exp(m(xi )) exp(m0 (xi )) Di log( ) log( ) 1 + exp(m(xi )) 1 + exp(m0 (xi )) " # 1 1 + (1 Di ) log( ) log( ) 1 + exp(m(xi )) 1 + exp(m0 (xi )) exp(m(xi ))  [log( ) 1 + exp(m(xi )) exp(m0 (xi )) 1 1 log( )] [log( ) log( )] 1 + exp(m (xi ))0 1 + exp(m(xi )) 1 + exp(m0 (xi )) = |m(xi ) m0 (xi )| Then, supm,m0 2H:||m m0 || |`(m, xi ) `(m0 , xi )|  . Hence, Condition (ii) of Theorem 3.5M in Chen p (2007) is satisfied. Then by Condition 3.5M, mn ! m0 under || · ||1 . Now, to prove the second part of the theorem consider the `2 metric defined by ||m m0 || p,2 = p E [h(xi ) m0 (xi )]2 . This metric will be used to take advantage of inequalities in relation to || · ||1 to ultimately find the desired rate of uniform convergence. Suppose ||m m0 || p,2  ✏ 2 . Note that by d`(m̃,xi ) the mean value theorem, `(m, xi ) `(m0 , xi ) = dh [m m0 ], where m̃ lies between m and m0 . Condition 3.7 of Chen (2007) is satisfied by the preceding inequality in the first part of this proof. Lemma 2 in Chen and Shen (1998) implies that ||m m0 ||1  C1 ||m m0 ||2a/(2a+d) p,2 , where C1 is a constant and C1 > 0. Then condition 3.8 of Chen (2007) is satisfied with sup||m m0 || |`(m, xi ) 2a/(2a+d) `(m0 , xi )|  C1 . Then by Theorem 3.2 in Chen (2007), ||m m0 || p,2 = O p (✏n ), with ✏ = max{ n , |m0 n m0 |}. Let um = supm2Hn ||h||1 , and ||m||1 = supxi 2X |m(xi )|. Then for all 0 < ✏ C12  < 4um 1, log( C✏2 , Hn , || · ||1 )  const · kn · log(1 + ✏ ) by Lemma 2.5 in van der Geer (2000), where kn " 1 kn as n!1, but n !0. Then, Z p 1 n p H⇤ (✏, Fn , || · ||)d✏ n 2 n b 2 n Z n s ! 1 4u⇡  p kn · log 1 + d✏ n 2 n b 2n ✏ ! 1 p Z n 4u⇡  p kn log 1 + d✏ n 2 n b 2n ✏ 72 1 p  p kn n n 2 n  const q Then n ⇣ kn n , and || k m0 m0 ||1 = O(kn a/d ) by Lorentz (1966). Let n ⇣ || k m0 m0 ||1 . Then the optimal rate is with kn = o(n1/(2a+d) ). Then ||mn m0 || p,2 = O p (n a/(2a+d) ). Now, note that ||mn m0 ||2a/(2a+d) p,2 = o p (n 2a2 /(2a+d)2 ). Since ||m m0 ||1  C1 ||m m0 ||2a/(2a+d) p,2 and ||m m0 ||1 = o p (1), 2a2 /(2a+d)2 ) then ||m m0 ||1 = o p (n( ) = o p (n 1/4 ). ⇤ Then I can also show, as in Hirano, Imbens, and Ridder (2003), that supx2X |ˆ⇡(x) p(x)| = supx2X |L(m̂(x) L(m0 (x)| . supx2X |m̂(x) m0 (x)| = O(kn a/d ) = o p (n 1/4 ) The next proof will tackle the case for nonparametric estimation of the conditional CDF. I need to establish four lemmas before proving Proposition 5. The next theorem concerns the asymptotic behavior of the estimator of P̂( Y0t  y|X). This estimator is a kernel density estimator, though a sieve estimator could also be chosen. The estimator that I have chosen is based upon the estimator of Li and Racine (2008). The assumptions that are needed include, (from Li and Racine (2008)) Assumption C.1. Both µ(x) and F(y|x) have continuous second-order partial derivatives with re- spect to xc , where xc denotes the vector of continuous random variables. For fixed values of y and x, µ(x) > 0, 0 < F(y|x) < 1 Assumption C.2. w(·) is a symmetric, bounded, and compactly supported density function, and w(·) is a Lipschitz function on the compact set D. Assumption C.3. As n!1, h s !1 for s = 1..., q, s !0 for s = 1, ..., r, and (nh1 ...hq )!1, and as n!1, h0 !0 . Assumption C.4. F(y|x) is twice continuously di↵erentiable in (y, xc ). 73 P P Q Let |h| = qi=1 h s , | | = ri=1 s , where 0  s  1 and Wh (Xic , xic ) = qs=1 h s 1 w((Xisc xcs )/h s ). Let P y Y P n 1 ni=1 G( h i )Wh (Xic ,xic ) µ̂(x) = n 1 ni=1 Wh (Xic , xic ) Let F̃(y|xc ) = 0 µ̂(x) . G(·) is the distribution function with corresponding density function w(·). h s is the bandwidth associated with the continuous variable xcs , and h0 is the bandwidth associated with Yi 1. We then have the following theorem: Proposition 5. Suppose Assumptions C.1-C.4 hold and h0 = h. Then, sup x2D |F̃(y|xc ) F(y|xc )| = ln(n) 1/2 2 O p ( (nh q )1/2 ) + O p (h ). 1/4 The restriction that h0 = h will be used to achieve the rate of convergence of o p (n ). I need to establish four lemmas before proving Proposition 5. The following lemma is similar to Lemma 1 in Li and Racine (2008). Lemma 2. Under assumptions C.1-C.3, E[µ̂(x)] = µ(x) + O(|h2 |) Proof. Z ! Xic xic E[µ̂(x)] = µ(xic )W dxic (nh1 . . . hq ) 1 h Z ! ! xi1 x1 xiq xq = µ(xic )k ⇥ ··· ⇥ k h1 hq Z = µ(xic + hv)k(v)dv Z X q q q 1 XX = [µ(xic ) + µ s (xic )h s v s + µ(xc )h s h` v s v` ]k(v)dv + O(|h|3 ) s=1 2 s=1 `=1 X q  = µ(xc ) + µ ss (xc )h2s + O(|h|3 ) 2 s=1 = µ(xc ) + O(|h|2 ) ⇤ R where  = v2 k(v)dv. 1 In principle, the estimator here could be the estimator of Li and Racine (2008), where the covariates can be discrete and ordered; however, in order to cite particular theorems from Rothe and Firpo (2013) I am only considering an estimator that allows for continuous covariates, though as noted by Rothe and Firpo (2019), the results could be modified to allow for discrete covariates. 74 Pq Lemma 3. Under Assumptions C.1-C.4, E[µ̂(x)F̃(y|xc )] = µ(x)F(y|xc ) + µ(x) i=1 h2s Bs (y, x) + o(|h|2 ) + o(h20 ) Proof. See Theorem 6.2 (i) in Li and Racine (2007) ⇤ Now, the rate of uniform convergence proof largely follows Masry (1996). Furthermore, since by Li and Racine (2008) it is shown that at the optimal (to minimize the integrated mean square error) occurs when h1 , . . . , hq all converge to 0 at the same rate. I will denote this common h by hmin . 1/2 Lemma 4. Under assumptions C.1-C.4, sup x2D ||µ(x) ˆ µ(x)| = O p ( (nh ln(n) 2 q )1/2 ) + O p (|h| ). Proof. Note that by the Triangle Inequality, |µ̂(x) µ(x)| = |µ̂(x) µ(x) E[µ̂(x)] + E[µ̂(x)]|  |µ(x) E[µ̂(x)]| + |E[µ̂(x)] µ̂(x)| By Lemma 2 and Lemma 3, |µ(x) E[µ̂(x)]| = O(|h|2 ). Then it is sufficient to find the rate for |E[µ̂(x)] µ̂(x)|. Since D is compact, it can be covered by a finite number L = L(n) of cubes Ik = In,k with centers xk = xn,k having sides of length `n for k = 1, ..., L(n). Clearly `n = constant/L1/d (n). Since D is compact, write sup x2D |E[µ̂(x)] µ̂(x)| = max1kLn sup x2D\Ik |E[µ̂(x)] µ̂(x)|  max1kLn sup x2D\Ik |µ̂(x) µ̂(xk,n )| + max1kLn |E[µ̂(xk,n )] µ̂(xk,n )| + max1kLn sup x2D\Ik |E[µ̂(xk,n )] E[µ̂(x)]| B Q1 + Q2 + Q3 Since each kernel is Lispschitz, and the product of Lipschitz functions is a Lipschitz function, Q1 = |µ̂(x) µ̂(xk,n )|  |Wh (Xic , xc ) Wh (Xic , xk,n c )| 75  (C2 /hq+1 min )sup x2D\Ik |x xk,n |  C2 `n /hq+1 min . Let `n = (ln(n))1/2 h(q+2)/2 /n1/2 . Then Q1 = O((ln(n)/(nhq ))1/2 ). Similarly, Q3 = O((ln(n)/(nhq ))1/2 ) P Now let Wn (x) = µ̂(x) E[µ̂(x)] = i=1 Zn,i where, ⇥ ⇤ Zn,i = (nhqmin ) 1 Wh (Xic , xic )] E[Wh (Xic , xic )] For ⌘ > 0, we have P[Q2 > ⌘]  P[max1kLn Wn (xk,n ) > ⌘]  P[Wn (x1,n ) > ⌘ or Wn (x2,n ) > ⌘, . . . , or Wn (xL(n),n ) > ⌘]  P[Wn (x1,n ) > ⌘ + Wn (x2,n ) > ⌘, . . . , +Wn (xL(n),n ) > ⌘]  sup x2S P[|Wn (x)| > ⌘] Since µ̂(·) is bounded, and letting A = sup x2D |µ̂(x)|, we have |Zn,i |  2A/nhqmin for all i = 1, ..., n. Define n = (nhqmin ln(n))1/2 . Then n |Zn,i |  2A(ln(n))/(nhqmin ]1/2  1/2 for all i = 1, ..., n for n sufficiently large. Using the inequality e x  1+x+x2 for |x|  1/2, we have e n Zn,i  1+ n Zn,i + n2 Zn,i 2 . 2 2 Hence, E[e n Zn,i ]1+ 2 2 n E[Zn,i ]  eE[ n Zn,i ] . Then, X n P[|Wn (x)| > ⌘] = P[| Zn,i | > ⌘] i=1 Xn Xn = P[ Zn,i > ⌘] + P[ Zn,i < ⌘] i=1 i=1 Xn Xn  P[ Zn,i > ⌘] + P[ Zn,i > ⌘] i=1 i=1 Pn Pn Zn,i Zn,i  E[e n i=1 ] + E[e n i=1 ] 2 Pn 2 ) E(Zn,i  2e n e n i=1 2 q  2e n eA n /(nhmin ) A n2 n ⌘+ q Then sup x2D P[|Wn (x)| > ⌘]  2e nh min . Let n⌘ = C3 ln(n). Choose n = [(nhqmin ln(n)]1/2 . Then 2 q n ⌘/↵ + A n /(nhmin ) = C3 ln(n) + Aln(n) = ↵ln(n), where ↵ = C3 A. Since sup x2D P[|Wn (x)| > 76 A n2 n ⌘+ q ⌘]  2e nh min and P[Q2 > ⌘]  L(n)sup x2D P[|Wn (x)| > ⌘], then P[Q2 > ⌘n ]  2L(n)/n↵ . Choose P P1 C3 sufficiently large and L(n) such that 1 n=1 P[|Q2 /⌘n | > 1]  4 n=1 L(n)/n < 1. Then by the ↵ Borel-Cantelli lemma, Q2 = O p ((ln(n))1/2 /(nhqmin )1/2 ). ⇤ Similarly, by Lemma 3, and by a similar result to Lemma 4, sup x2D |µ̂(x)F̃(y|xc ) µ(x)F(y|xc )| = ln(n) 1/2 2 2 O p ( (nhq )1/2 ) + O p (h0 ) + O p (|h| ). Then I have the following theorem, Proof of Proposition 5 µ̂(x)F̃(y|xc ) µ̂(x)F̃(y|xc )/µ(x) Proof. Note that F̃(y|xc ) = µ̂(x) = µ̂(x)/µ(x) . By Lemma 4, ln(n)1/2 sup x2D |µ̂(x) µ(x)| = O p ( ) + O p (|h|2 ) (nhq )1/2 Then, by Lemma 4 µ̂(x) µ̂(x) µ(x) sup x2D | 1| = sup x2D | | µ(x) µ(x) ln(n) 1/2 2 O p ( (nh q )1/2 ) + O p (|h| )  in f x2D µ(x) ln(n)1/2 = O p ( q 1/2 ) + O p (|h|2 ) (nh ) Similarly, since ln(n)1/2 sup x2D |µ̂(x)F̃(y|xc ) µ(x)F(y|xc )| = O p ( ) + O p (h20 ) + O p (|h|2 ) (nh ) q 1/2 Then, ln(n) 1/2 2 2 |µ̂(x)F̃(y|xc ) O p ( (nh q )1/2 ) + O p (h0 ) + O p (|h| ) c sup x2D | F(y|x )|  µ(x) in f x2D µ(x) 1/2 ln(n)  Op( ) + O p (h20 ) + O p (|h|2 ) (nhq )1/2 Then, µ̂(x)F̃(y|xc )/µ(x) F̃(y|xc ) = µ̂(x)/µ(x) ln(n) 1/2 F(y|xc ) + O p ( (nh 2 q )1/2 ) + O p (h0 ) + O p (|h| ) 2 = ln(n) 1/2 2 1 + O p ( (nh q )1/2 ) + O p (h0 ) + O p (|h| ) 2 77 ln(n) 1/2 F(y|xc ) + O p ( (nh 2 2 q )1/2 ) + O p (h0 ) + O p (|h| ) = ln(n) 1/2 1 + O p ( (nh q )1/2 ) + O p (|h| ) 2 ln(n)1/2 = F(y|xc ) + O p ( ) + O p (h2 ) (nhq )1/2 ⇤ 78 APPENDIX E PROOFS OF MAJOR THEOREMS AND PROPOSITIONS FOR CHAPTER 2 The first proof is of the identification result in Theorem 3. Proof of Theorem 3 Proof. Note that by Theorem 1 in Callaway and Li (2019), the first portion of the result is proven. All that remains is to show that " ! ! # 1 D ⇡(X) 1 D ⇡(X) D F Y0t |D=1 (y) =E { Yt  y} P̃( Y0t  y|X) p 1 ⇡(X) p 1 ⇡(X) p if ⇡(X) = p(X) a.c., or P̃( Y0t  y|X) = P( Y0t  y|X) a.c. Suppose ⇡(X) = p(X) a.c. Then, " ! ! # (1 D)p(X) (1 D)p(X) D E Yt y ) P̃( Y0t  y|x) p(1 p(x)) p(1 p(X)) p " # " # " # p(x)E[(1 D) Yt y|D=0,X p(X)P̃( Y0t |x, D = 0) p(X)P̃( Y0t  y|X, D = 1) =E E +E p p p " # " # p(X)P( Y0t  y|X, D = 0) p(X)P̃( Y0t  y|X, D = 1) =E E p p " # p(X)P̃( Y0t  y|X, D = 1) +E p " # p(X)P( Y0t  y|X, D = 1) =E p " # P( Y0t  y, D = 1|X) =E p = P( Y0t  y|D = 1) =F Y0t |D=1 (y) Now, suppose ⇡(X) , p(X) a.c. and P̃( Y0t  y|X) = P( Y0t  y|X) a.c. Then, " ! ! # (1 D)⇡(X) (1 D)⇡(X) D E Yt y ) P( Y0t  y|X) p(1 ⇡(X)) p(1 ⇡(X)) p " # " # p(X)P( Y0t  y|X, D = 1) (1 p(X))⇡(X)P( Y0t  y|X, D = 0) =E E p p(1 ⇡(X)) " # (1 p(X))⇡(X)P( Y0t  |X, D = 0) +E p(1 ⇡(X)) " # P( Y0t  y, D = 1|X) =E p 79 = P( Y0t  y|D = 1) =F Y0t |D=1 (y) ⇤ The next proof holds in either the parametric or nonparametric subcase, though this proof is for a parametric submodel, while the nonparametric submodel will proceed similarly. For the pur- pose of estimation of F Y0t |D=1 (y), only the period of treatment and the period prior needs to be considered. If there are additional pre-treatment and post-treatment periods, they are not relevant to the density of the data that is used to estimate F Y0t |D=1 (y). The proof itself is similar to a result in Sant’Anna and Zhao (2020). Proof of Theorem 4: Proof. The density of (yt (1), yt (0), yt 1 (0), d, x) with respect to some sigma-finite measure on L 2 R3 ⇥ {0, 1} ⇥ Rk is given by f¯(yt (1), yt (0), yt 1 (0), d, x) = f¯(yt (1), yt (0), yt 1 (0)|D = 1, x)d p(x)d f¯(yt (1), yt (0), yt 1 (0)|D = 0, x)1 d (1 p(x))1 d f (x) The density of the observed data is, f (yt , yt 1 , d, x) = f1 (yt , yt 1 |D = 1, x)d p(x)d f0 (yt , yt 1 |D = 0, x)1 d (1 p(x))1 d f (x) where Z f1 (·, ·|D = 1, x) = f¯(·, yt (0), ·|D = 1, x)dyt (0) Z f0 (·, ·|D = 0, x) = f¯(yt (1), ·, ·|D = 0, x)dyt (1) Consider a parametric submodel indexed by a parameter ✓, f✓ (yt , yt 1 , d, x) = f1,✓ (yt , yt 1 |D = 1, x)d p✓ (x)d f0,✓ (yt , yt 1 |D = 0, x)1 d (1 p✓ (x))1 d f✓ (x) 80 which equals f (yt , yt 1 , d, x) when ✓ = ✓0 . The score is d p✓ (x) s✓ (yt , yt 1 , d, x) = ds1✓ (yt , yt 1 |D = 1, x) + (1 d)s0✓ (yt , yt 1 |D = 0, x) + ṗ✓ (x) p✓ (x)(1 p✓ (x)) + t✓ (x) where, for d = 0, 1 d d d sd✓ (yt , yt 1 |D = d, x) = log fd,✓ (yt , yt 1 |D = d, x), ṗ✓ (x) p✓ (x), and t✓ (x) = log f✓ (x) d✓ dx d✓ Then the tangent space is F = {ds1 (yt , yt 1 |D = 1, x) + (1 d)s0 (yt , yt 1 |D = 0, x) + a(x)(d p(x)) + t(x)} ! R where sd (yt , yt 1 |D = d, x) fd (yt , yt 1 |D = d, x)dyy dyt 1 = 0 8x, d = 0, 1, t(x) f (x)dx = 0 and a(x) is any square integrable function of x. Under the assumption that Y0t ?? D|X, note that ⌧=F Yt |D=1 ( ), ⌧ = E[E[ Y0t  |D = 1, X]|D = 1] = E[E[ Y0t  |D = 0, X]|D = 1] For the parametric submodel under consideration, I note that # yt  +yt 1 p✓ (x) f0,✓ (yt , yt 1 |D = 0, x) f✓ (x)dy3 dy2 dx ⌧(✓) = R p✓ (x) f✓ (x)dx Then, # @⌧(✓0 ) yt  +yt 1 p(x)s0 (yt , yt 1 |D = 0, x) f0 (yt , yt 1 |D = 0, x) f (x)dy3 dy2 dx = @✓ p R R P( Y0t  |x, D = 0)p(x)t(x) f (x)dx P( Y0t  |x, D = 0)p(x) ṗ(x) f (x)dx + + p p R ⌧[ ṗ(x) + p(x)t(x)] f (x)dx] p Let the initial choice of an influence function be, (1 D)p(X)) (1 D)p(X) F⌧ (Yt , Yt 1 , D, X) = Yt  P( Yt  |X, D = 0) p(1 p(X)) p(1 p(X)) D p(X) p(X) D + P( Yt  |X, D = 0) + P( Yt  |X, D = 0) ⌧ p p p 81 (1 D)p(X) (1 D)p(X) = Yt  P( Y0t  |X) p(1 p(X)) p(1 p(X)) D D + P( Y0t  |X) ⌧ p p Note that for the parametric submodel with score s✓ (y1 , y0 , d, x), I can conclude that ⌧ is a di↵eren- tiable parameter since @⌧(✓0 ) = E[F⌧ (Yt , Yt 1 , D, X)s✓ (Y1 , Y0 , D, X)] @✓ Since F⌧ 2 F , then by Theorem 3.1 of Newey (1990), F⌧ (Yt , Yt 1 , D, X) is the efficient influence function for F Y0t |D=1 ( ). ⇤ Proof of Theorem 5: Proof. Consistency of estimator, nonparametric case: F̂ Y0t |D=1 (y) 20 1 0 1 3 Xn 66B C BBB CCC 777 666BBB 1 Di ⇡ˆ (xi ) CCCC BBB 1 Di ⇡ˆ (xi ) Di CCC 7 =n 1 6BB 666BB Pn CC { Yt  } BBB PN Pn CCC P̂( Y0t  y|x)77777 4@ k=1 Dk 1 ⇡ˆ (xi ) CA @ k=1 Dk 1 ⇡ˆ (xi ) k=1 Dk A 5 i=1 n n n X n p Dk p Suppose that ⇡ˆ (x) ! ⇡(x) Furthermore, n ! p. Then by the WLLN and the Continuous k=1 Mapping Theorem, n B 0 1 X BBB 1 Di C ! 1 BB ⇡ˆ (xi ) CCCC p 1 D ⇡(x) n BB@ Pn CC { Yti  y} ! E { Yt  y} k=1 Dk 1 ⇡ˆ (xi ) CA p 1 ⇡(x) i=1 n Then, 2 0 1 3 X n 66 BB CCC 777 666 BBB 1 Di ⇡ˆ (xi ) Di CCC 7 n 1 6 B 666 BBB PN Pn CCC P̂( Y0t  |x)77777 4 @ k=1 Dk 1 ⇡ˆ (xi ) k=1 Dk A 5 i=1 n n converges in probability to " ! # 1 Di ⇡(xi ) Di E P̃( Y0t  |x) p 1 ⇡(xi ) p This implies that F̂ Y0t |D=1 (y) converges in probability to " # " ! # 1 D ⇡(x) 1 Di ⇡(xi ) Di E { Yt  } + E P̃( Y0t  |x) . p 1 ⇡(x) p 1 ⇡(xi ) p 82 If ⇡(X) = p(X) a.c. or P̃( Y0t  y|X) = P( Y0t  y|X) a.c., then by the previous theorem " # " ! # 1 D ⇡(X) 1 Di ⇡(X) Di E { Yt  y} + E P( Y0t  y|X) = F Y0t |D=1 (y) p 1 ⇡(X) p 1 ⇡(X) p F̂ Y0t |D=1 (y) F Y0t |D=1 (y) 20 1 0 1 3 X n 66B C BBB CCC h 777 1 666BBB 1 Di ⇡ˆ (xi ) CCCC BBB 1 Di ⇡ˆ (xi ) Di CCC i 7 = 6BB 666BB Pn CC { Yt  y} BBB PN Pn CCC P̂( Y0t  y|xi ) 77777 n 4@ k=1 Dk 1 ⇡ˆ (xi ) CA @ k=1 Dk 1 ⇡ˆ (xi ) k=1 Dk A 5 i=1 n n n F Y0t |D=1 (y) The next proof follows partly from the proof of Theorem 2(b) in Rothe and Firpo (2019). In partic- ular, the object is to expand the doubly robust moment condition and demonstrate that each term converges in probability to 0 at the desired rate. In the parametric case, the proof is very similar to Sant’Anna and Zhao (2020). The proof in nonparametric case is also similar to Fan et al. (2016) when the nuisance function is estimated using a sieve approach. Now, I will expand F̂ Y0t |D=1 (y) F Y0t |D=1 (y) 20 1 0 1 3 X n 66B C BBB CCC h 777 1 666BBB 1 Di ⇡ˆ (xi ) CCCC 6 B BBB 1 Di ⇡ˆ (xi ) Di CCC i 7 = 666BBB Pn CC { Yt  y} BBB PN Pn CCC P̂( Y0t  y|xi ) 77777 n 4@ k=1 Dk 1 ⇡ˆ (xi ) CA @ k=1 Dk 1 ⇡ˆ (xi ) k=1 Dk A 5 i=1 n n n F Y0t |D=1 (y) Let ! ! 1 Di ⇡ˆ (xi ) 1 Di 1 Di ⇥ ⇤ i = { Yt  y} P( Y0t  y|xi ) p (1 ⇡ˆ (xi )) p (1 ⇡ˆ (xi ))2 p̂ Di F Y0t |D=1 (y) p̂ ! 2 1 Di ⇡(xi ) Di i = + p 1 ⇡(xi ) p 22 i =0 ! ! 1 1 Di 1 1 Di 1 ⇥ ⇤ i = { Yt  y} P( Y0t  y|xi ) p (1 ⇡ˆ (xi ))2 p (1 ⇡ˆ (xi ))2 83 ! ! 11 1 Di 2ˆ⇡(xi ) 1 Di 2ˆ⇡(xi ) ⇥ ⇤ i = { Yt  y} P( Y0t  y|xi ) p (1 ⇡ˆ (xi )) 3 p (1 ⇡ˆ (xi )) 3 ! 12 1 Di 1 i = p (1 ⇡ˆ (xi ))2 ! ! 13 Di 1 1 Di 1 1 ⇥ ⇤ i = { Yt  y} P( Y 0t  y|x i ) p̂2 (1 ⇡(xi ))2 p̂2 (1 ⇡(xi ))2 ! 23 Di 1 ⇡(xi ) Di i = p̂ 1 ⇡(xi ) p̂2 2 " ! ! # 3 Di 1 ⇡(xi ) Di 1 ⇡(xi ) Di ⇥ ⇤ i = { Yt  y} + P( Y0t  y|xi ) p̂2 1 ⇡(xi ) p̂2 1 ⇡(xi ) p̂2 Di + 2 F Y0t |D=1 (y) p̂ " ! ! # 33 2(1 Di ) ⇡(xi ) 2(1 Di ) ⇡(xi ) 2Di ⇥ ⇤ i = { Yt  y} P( Y0t  y|xi ) p̂3 1 ⇡(xi ) p̂3 1 ⇡(xi ) p̂3 Di 2 3 F Y0t |D=1 (y) p̂ n 1X n ( p̂, ˆ ⇡ , P̂) = (Di , ⇡ˆ (xi ), P̂( Y0t  y|xi ), p̂) n i=1 Then, n n 1X 1 1X 2 n ( p̂, ⇡ ˆ , P̂) n (p, ⇡, P) = i (ˆ⇡(xi ) ⇡(xi )) + i ( P̂( Y0t  y|xi ) P( Y0t  y|xi )) n i=1 n i=1 n n 1X 3 1X 11 + i ( p̂ p) + i (ˆ ⇡(xi ) ⇡(xi )2 n i=1 n i=1 n 1X 12 + i ( P̂( Y0t  y|xi ) P( Y0t  y|xi ))(ˆ⇡(xi ) ⇡(xi )) n i=1 n 1X 13 + i (ˆ ⇡(xi ) ⇡(xi ))( p̂ p) n i=1 X n 1 23 + i ( P̂( Y0t  y|xi ) P( Y0t  y|xi ))( p̂ p) n i=1 n n 1X 22 2 1X 33 + i ( P̂( Y0t  y|xi ) P( Y0t  y|xi )) + i ( p̂ p)2 n i=1 n i=1 + O p (||ˆ⇡ ⇡||31 ) + O p (||P̂ P||31 ) + O p (|| p̂ p||31 ) It remains to show that each term is o p (n1/2 ). Each term outside of the first term converges con- verges either due to Rothe and Firpo (2019) or due to the fact that |ˆ⇡(xi ) ⇡(xi )| = o p (n1/4 ). For the first term, let Gn ( f0 ) = n1/2 (Pn P) f0 (D, Yt , x), where Pn is the empirical measure, P is the 84 expectation, and (1 D)( Y ty P( Y0t  y|x)) f0 (D, Yt , x) = (ˆ⇡(xi ) ⇡(xi )) p(1 ⇡(x))2 Since sup x2X |ˆ⇡(x) ⇡(x)| . O(kn a/d ) = o p (1) by Theorem 3.2 in Chen (2007) and the proof of Proposition 4, then define F = { f0 : ||ˆ⇡(x) ⇡(x)||1  n }, where n = C(kn a/d ) for some C > 0. By Lemma 3 in Rothe and Firpo (2013), P f0 (D. Yt , x) = 0. By the Markov inequality and Corollary 19.35 of Vaart (2000) Pn 1 i=1 i (ˆ ⇡(x) ⇡(x))  sup f0 2F Gn ( f0 ) . J[] (||F0 || p,2 , F , L2 (p)) n1/2 where J[] (||F0 ||p, 2, F , L2 (p)) is the bracketing integral, and F0 is the envelope function. Let F0 = { f0 : ||ˆ⇡(x) ⇡(x)||1  C} Since p and ⇡(x) is bounded away from 0, then | f0 (D, Yt , x)| . n| Y ty P( Y0t  y|x)| B F0 Then since Y ty P( Y0t  y|x)) is bounded by 1, ||F0 || p,2  n. Then, logN[] (✏, F , L2 (p)) . logN[] (✏, F0 n , L2 (p) = logN[] (✏/ n , F0 , L2 (p) . logN[] (✏/ n , ⇤cp (X), L2 (p) . ( n /✏)d/p where the last inequality follows by Corollary 2.7.2 in Vaart and Wellner (1996). Then, Z q Z n!1 n n d/a n !0 J[] (||F0 || p,2 , F , L2 (p)) . logN[] (✏, F , L2 (p))d✏ . ( n /✏) d✏ !0 0 0 Pn 1 (ˆ i ⇡(x) ⇡(x)) where the integral converges to zero since d/a > 2 by Assumption NP.7. Then i=1 n1/2 = o p (1). Consistency of estimator, parametric case: F̂ Y0t |D=1 (y) = 85 n B 0 1 X BBB 1 Di ⇡(xi ; ˆ ) CCCC C n 1 BB BB@ n P CC { Yt  y} k=1 Dk 1 ⇡(xi ; ˆ ) CA i=1 n n B 0 1 X BBB 1 Di ⇡(xi ; ˆ ) Di CCCC C n 1 B BBB Pn Pn CCC P̂( Y0t  y|Xi ; ˆ ) @ k=1 Dk 1 ⇡(xi ; ˆ) k=1 Dk A i=1 n n Xn p Dk p ⇤ Suppose that ˆ ! Furthermore, n ! p. Then by the WLLN and the Continuous Mapping k=1 Theorem, n B 0 1 X BBB 1 Di C ! 1 BB ⇡(xi ; ˆ ) CCCC p 1 D ⇡(xi ; ⇤ ) n BB@ Pn CC { Y0t  y} ! E { Y0t  y} k=1 Dk 1 ⇡(xi ; ˆ ) CA p 1 ⇡(xi ; ⇤ ) i=1 n p Assume that ˆ ! ⇤ . Then, 2 0 n 6 B 1 3 X 666 BBB 1 Di ⇡(xi ; ˆ ) Di CC CCC 777 666 BBB P ˆ 7 n 1 64 B@ n Dk 1 ⇡(xi ; ˆ ) Pn CCA P̂( Y0t  y|Xi ; )77775 C k=1 k=1 Dk i=1 n n converges in probability to " ! # 1 Di ⇡(xi ; ⇤ ) Di E P̃( Y0t  y|xi ) p 1 ⇡(xi ; ⇤ ) p This implies that F̂ Y0t |D=1 (y) converges in probability to " # " ! # 1 D ⇡(x; ⇤ ) 1 Di ⇡(xi ; ⇤ ) Di E { Yt  y} + E P( Y0t  y) p 1 ⇡(x; ) p 1 ⇡(xi ; ⇤ ) p ⇤ ⇤ If ⇡(X; ) = p(X; ) a.c. or P̃( Y0t  y|X; ) = P( Y0t  y|X; ) a.c., then by the previous theorem " # " ! # 1 D ⇡(X; ⇤ ) 1 Di ⇡(X; ⇤ ) Di E { Yt  y} + E P( Y0t  y|X) p 1 ⇡(X; ⇤ ) p 1 ⇡(X; ⇤ ) p =F Y0t |D=1 (y) Note that, F̂ Y0t |D=1 (y) F Y0t |D=1 (y) n B 0 1 X BBB 1 Di ⇡(xi ; ˆ ) CCCC ! BBB P C 1 D ⇡(x; ) =n 1 B@ n Dk 1 ⇡(xi ; ˆ ) CCCA { Yti  y} E { Yt  y} k=1 p 1 ⇡(x; ) i=1 n 86 n 6B 20 12 33 X 666BBB 1 Di ⇡(xi ; ˆ ) Di CCC 6 (1 D j )( { µ0t (xi ; ˆ ) + û0t j }) 77777777 n 1 66BB 664BB@ Pn Pn CCCC 6664 75777 C k=1 Dk 1 ⇡ˆ (xi ; ˆ ) D k=1 k A n1 D 5 i, j=1 n n " ! # 1 Di p̃(xi ; ) Di E P̃( Y0t  y|x; ) p 1 ⇡(xi ; ) p = (CDF ˆ 1 CDF 1 ) (CDF ˆ 2 CDF 2 ) ! P1n D ⇡(x;ˆ ) Let w0 (D, x; ˆ ) = . k=1 Dk 1 ⇡(x;ˆ ) n Then, p n(CDF ˆ 1 CDF 1 ) X n 1/2 ⇤ =n (w0 (Di , xi ; ˆ ) { Yti  y} E[w0 (D, x; ) { Yt  y}]) i=1 X n 1/2 ⇤ =n (w̃0 (Di , xi ; ˆ ) { Yti  y} E[w0 (D, x; ) { Yt  y}]) i=1 n ! " # E h(1 D) ⇡(xi ; ⇤ ) { Y  y}i p X ⇡(xi ; ˆ ) ⇡(x; ⇤ ) 1 ⇡(x; ⇤ ) t n (1 Di ) E (1 D) · h i 1 ⇡(x; ˆ ) 1 ⇡(x; ⇤ ) E (1 D) ⇡(x; ) ⇤ 2 i=1 1 ⇡(x; ⇤ ) + o p (1) X n 1/2 ⇤ =n (w̃0 (Di , xi ; ˆ ) { Yti  y} E[w0 (D, x; ) { Yt  y}]) i=1 X n 1/2 ⇤ n ((w̃0 (Di , xi ; ˆ ) 1)E[w0 (D, x; ) { Yt  y}]) + o p (1) i=1 X n 1/2 ⇤ =n ((w̃0 (Di , xi ; ˆ )( { Yti  y} E[w0 (D, x; ) { Yti  y}]) + o p (1) i=1 where , " # ⇡(x; ˆ )(1 D) ⇡(X; ⇤ )(1 D) w̃0 (D, x; ˆ ) = E 1 ⇡(x; ˆ ) 1 ⇡(x; ⇤) ⇤ Then, I do a second-order Taylor expansion around , so that p n(CDF ˆ 1 CDF 1 ) X n 1/2 =n w0 (Di , xi ; ⇤)( { Yti  y} E[w0 (D, x; ⇤) { Yt  y}]) i=1 87 X n 1/2 + (ˆ ⇤)0 · n ẇ0 (Di , xi ; ⇤)( { Yti  y} E[w0 (D, x; ⇤) { Yt  y}] + o p (1) i=1 X n 1/2 =n w0 (Di , xi ; ⇤)( { Yti  y} E[w0 (D, x; ⇤) { Yt  y}]) i=1 X n p 1 + n(ˆ ⇤)0 · n ẇ0 (Di , xi ; ⇤)( { Yti  y} E[w0 (D, x; ⇤) { Yt  y}] + o p (1) i=1 X n 1/2 =n w0 (Di , xi ; ⇤)( { Yti  y} E[w0 (D, x; ⇤) { Yt  y}]) i=1 X n 1/2 +n l ⇤ (Wi )0 · E[ẇ0 (Di , xi ; ⇤)( { Yti  y} E[w0 (D, x; ⇤) { Yt  y}])] + o p (1) i=1 X n 1/2 =n w0 (Di , xi ; ⇤)( { Yti  y} E[w0 (D, x; ⇤) { Yt  y}]) i=1 X n 1/2 +n l ⇤ (Wi )0 · E[↵(Di , xi ; ⇤)( { Yti  y} E[w0 (D, x; ⇤) { Yt  y}])˙⇡(X; ⇤)] i=1 + o p (1) X n 1/2 =n (w0 (Di , xi ; ⇤)( { Yti  y} E[w0 (D, x; ⇤) { Yt  y}]) i=1 + l ⇤ (Wi )0 · E[↵(Di , xi ; ⇤)( { Yti  y} E[w0 (D, x; ⇤) { Yt  y}])˙⇡(x; ⇤)]) + o p (1) where ẇ(D, x; ) = ↵(D, x; )˙⇡(x; ) , " # 1 D ⇡(x; ⇤ )(1 D) ↵(D, x; ) = E (1 ⇡(x; ))2 1 ⇡(x; ⇤) Observe that, ˆ 2 CDF CDF 2 n 6B 20 333 X 666BBB 1 Di ⇡(xi ; ˆ ) 266 (1 D j )( { µ0t (xi ; ˆ ) + û0t j }) 7777777777777 =n 1 6666BBBB Pn 66 75777777 4@ k=1 Dk 1 ⇡ˆ (xi ; ˆ ) 4 n1 D 55 i, j=1 n " ! # 1 Di p̃(xi ; ) ⇤ E P̃( Y0t  y|x; ) p 1 ⇡(xi ; ) 88 n 6B 20 2 333 X " ! # 1 66BB 6666BBBB P Di 666 (1 D j )( { µ0t (xi ; ˆ ) + û0t j }) 7777777777777 D ⇤ n 64B@ n Dk 64 75777777 E P̃( Y0t  y|x; ) k=1 n1 D 55 p i, j=1 n = (CDF ˆ 21 CDF 21 ) (CDF ˆ 22 CDF 22 ) Similarly note that, p n(CDF ˆ 22 CDF 22 ) X n ⇣ ⌘ 1/2 ⇤ ⇤ =n w1 (Di ) P̃( Y0ti  y|xi ; ) E[w1 (D)P̃( Y0t  y|x; )] i=1 X n p + n( ˆ ⇤ 0 ) ·n 1 (w1 (Di )P̃˙ ( Y0t  y|xi ; ⇤ )) + o p (1) i=1 X n ⇣ ⌘ 1/2 ⇤ ⇤ =n w1 (Di ) P̃( Y0ti  y|xi ; ) E[w1 (D)P̃( Y0t  y|x; )] i=1 X n +n 1/2 l(Wi ; ) E[w1 (Di )P̃˙ ( Y0t  y|x; ⇤ 0 ⇤ )] + o p (1) i=1 X n ⇣ ⌘ 1/2 ⇤ ⇤ =n (w1 (Di ) P̃( Y0ti  y|xi ; ) E[w1 (D)P̃( Y0t  y|x; )] i=1 + l ⇤ (Wi )0 E[w1 (Di )P̃˙ ( Y0t  y|x; ⇤ )]) + o p (1) D where w1 (D) = Pn D Furthermore, note that k=1 k n p n(CDF ˆ 21 CDF 21 ) Xn =n 1/2 w̃1 (Di , xi ; ˆ )(P̃( Y0ti  y; |xi ˆ ) E[w0 (D, x ⇤ )P̃( Y0ti  y|x; ⇤ )] + o p (1) i=1 Xn 1/2 ⇤ ⇤ ⇤ =n w1 (Di , Xi ; )(P̃( Y0ti  y|xi ; ) E[w0 (D, x; )P̃( Y0ti  y|x; )] i=1 p h ⇣ ⌘ i ⇤ 0 ⇤ ⇤ ⇤ ⇤ ⇤ + n(ˆ ) · E ↵(D, x; ) P̃( Y0t  y|x; ) E[w0 (D, x; )P̃( Y0ti  y|x; )] ⇡˙ (x; ) p + n( ˆ ⇤ 0 ) · E[w0 (D, x; ⇤ )P̃˙ ( Y0ti  y|x; ⇤ )] + o p (1) Xn 1/2 ⇤ ⇤ ⇤ ⇤ =n (w0 (Di , xi ; )(P̃( Y0ti  y|xi ; ) E[w0 (D, x; )P( Y0ti  y|x; )]) i=1 h ⇣ ⌘ i + l ⇤ (Wi )0 · E ↵(D, x; ⇤ ) P̃( Y0t  y|x; ⇤ ) E[w0 (D, x; ⇤ )P̃( Y0ti  y|x; ⇤ )] ⇡˙ (x; ⇤ ) + l ⇤ (Wi )0 · E[w0 (D, x; ⇤ )P̃˙ ( Y0ti  y|x; ⇤ )]) + o p (1) 89 Then by combining all the asymptotic expansions, I obtain p n(F̂ Y0t |D=1 (y) F Y0t |D=1 (y)) X n 1/2 ⇤ ⇤ =n (w0 (Di , xi ; )( { Yti  y} E[w0 (D, x; ) { Yt  y}]) i=1 + l ⇤ (Wi )0 · E[↵(Di , xi ; ⇤ )( { Yti  y} E[w0 (D, x; ⇤ ) { Yt  y}])˙⇡(x; ⇤ )] ⇣ ⌘ + [w1 (Di ) P̃( Y0ti  y|xi ; ⇤ ) E[w1 (D)P̃( Y0t  y|x; ⇤ )] + l ⇤ (Wi )0 E[w1 (Di )P̃˙ ( Y0t  y|x; ⇤ )]] ⇤ [(w0 (Di , xi ; )(P̃( Y0ti  y|xi ; ⇤ ) E[w0 (D, x; ⇤ )P̃( Y0ti  y|x; ⇤ )]) h ⇣ ⌘ i + l ⇤ (Wi )0 · E ↵(D, x; ⇤ ) P̃( Y0t  y|x; ⇤ ) E[w0 (D, x; ⇤ )P̃( Y0ti  y|x; ⇤ )] ⇡˙ (x; ⇤ ) + l (Wi )0 · E[w0 (D, x; ⇤ )P̃˙ ( Y0ti  y|x; ⇤ )])] + o p (1) After simplification, I obtain, p n(F̂ Y0t |D=1 (y) F Y0t |D=1 (y)) X n 1/2 ⇤ ⇤ =n (w0 (Di , xi ; )( { Yti  y} P̃( Y0ti  y|xi ; ) i=1 + E[w0 ( )P̃( Y0ti  y|x; ⇤ )] E[w0 { Yt  y}]) ⇣ ⌘ + [w1 (Di ) P̃( Y0ti  y|xi ; ⇤ ) E[w1 P̃( Y0t  y|x; ⇤ )] + l ⇤ (Wi )0 E[(P̃˙ ( Y0ti  y|x; ⇤ ))(w1 w0 )] + l ⇤ (Wi )0 E[↵( ⇤ )( { Yti  y} + P̃( Y0ti  y|x; ⇤ )) ⇤ E[w0 ( { Yti  y} P̃( Y0ti  y|x; )])˙⇡( ⇤ )]] + o p (1) Now, suppose that the propensity score and the CDF of Y0ti |X are correctly specified. Note that l ⇤ (Wi )0 E[(Ṗ( Y0ti  y; ⇤ ))(w1 w0 )] = 0, E[w0 ( ⇤ )P( Y0ti  y|x; ⇤ )] E[w0 { Yt  y}] = 0 and [l ⇤ (Wi )0 E[↵( ⇤ )( { Yti  y} + P( Y0ti  y|x; ⇤ ))] 90 ⇤ E[w0 ( { Yti  y} P( Y0ti  y|x; ))]˙⇡( ⇤ )]] = 0 Then, p n(F̂ Y0t |D=1 (y) F Y0t |D=1 (y)) X n  1/2 ⇤ ⇤ ⇤ ⇤ =n w0 (Di , xi ; )( { Yti  y} (w0 (Di , xi ; ) w1 (Di , xi ; ))P( Y0ti  y|xi ; ) i=1 w1 (Di )F Y0t |D=1 (y) + o p (1) X n 1/2 =n (Di , xi , Y0i , Y1i ) + o p (1) i=1 ⇤ Proof of Proposition 1: Proof. The result follows from Theorem 3 and the functional central limit theorem for empirical distribution functions. ⇤ Proof of Proposition 2 Proof. The result follows by Proposition 1, Lemma B.4 in Callaway and Li (2019), and similar arguments used to establish Proposition 4 in Callaway and Li (2019). ⇤ Proof of Theorem 6: Proof. The result follows from Proposition 2 and Lemma 3.9.23(ii) in Vaart and Wellner (1996). ⇤ Outline of proof of Proposition 3 Consider the asymptotic expansion in the nonparametric case of Theorem 5. Consider each term, assuming that the bootstrap estimate of each nuisance function converges to the estimate of each function based upon the unweighted data. I will consider the first term in the expansion It remains to show that each term is o p (n1/2 ). Each term outside of the first term converges converges either due to Rothe and Firpo (2019) or due to the fact that supx2X |ˆ⇡⇤ (xi ) ⇡ˆ (xi )| = o p (n1/4 ). For the 91 first term, let Gn ( f1 ) = n1/2 (Pn P) f0 (D, Yt , x⇤ ), where Pn is the empirical measure, P is the expectation, and (1 D)( Y ty P̂( Y0t  y|x⇤ )) ⇤ ⇤ f1⇤ (D, Yt , x) = (ˆ⇡ (xi ) ⇡ˆ (x⇤i )) p⇤ (1 ⇡ˆ (x⇤ ))2 Since sup x2X |ˆ⇡⇤ (x) ⇡ˆ (x)| . o p ((kn a/d ) = o p (1) by Theorem 3.2 in Chen (2007) and the previous proof, then define F = { f1 : ||ˆ⇡⇤ (x) ⇡ˆ (x)||1  ⇤ 1n , || P̂ (x) P̂(x)||1  2n }, where 1n = C(kn a/d ) for some C > 0, and 2n = K r , where r > 1/2. Let G1 = {⇡ 2 ⇤cp (X) : ||⇡ ⇡ˆ ||1  1n and G2 = {P 2 M : ||P P̂||1  2n }, where M denotes a Hölder space containing an estimate of P( Y0t  y|x), such as the kernel estimator mentioned in this text, with smoothing parameter a1 such that d/a1 2. Note the following Pn 1 ⇤ ⇤ i=1 i (ˆ ⇡ (x ) ⇡ˆ (x⇤ ))  sup f1 2F Gn ( f1 ) + n1/2 sup f1 2F P f1 n1/2 I will consider the second term first. Since p and ⇡(x) are bounded away from 0, then n1/2 sup f0 2F P f1 " ! # 1/2 1 D 1 ⇤ ⇤ ⇤ =n sup⇡2G1,P2G2 E [ { Yt  y} P( Y0t  y|xi )](ˆ⇡ (x ) ⇡ˆ (x ) p⇤ (1 ⇡ˆ (x⇤ ))2 . n1/2 sup x2X |ˆ⇡⇤ (x⇤ ) ⇡ˆ (x⇤ )|sup x2X |P̂⇤ (x⇤ ) P̂(x⇤ )| . o p (1) where the last line follows from Assumption B.1. Now, I will consider the term sup f1 2F Gn ( f1 ). Let F1 B n = C(kn a/d ), so ||F||P,2 . n. Let F1 = { f1 : ||ˆ⇡⇤ ⇡||1  C, ||P̂⇤ P||  1}. Define F10 = {⇡ 2 ⇤cp (X) + ⇡ˆ ⇤ : ||⇡|| p,2  C} and F20 = {P 2 M + P̂⇤ : ||P|| p,2  1}. Then, logN[] (✏, F , L2 (p)) . logN[] (✏/ 2n , F0 , L2 (p) . logN[] (✏/ 2n , F10 , L2 (p) + logN[] (✏/ 2n , F20 , L2 (p) p . logN[] (✏/ 2n , ⇤c (X), L2 (p) + logN[] (✏/ 2n , M, L2 (p) . ( n /✏)d/a + ( n /✏)d/a1 92 This is sufficient to demonstrate that the bracketing integral J[] (||F1 || p,2 , F , L2 (p)) converges. Then Pn 1 (⇡ˆ⇤ (x⇤ ) ⇡ˆ (x⇤ )) i=1 i n1/2 = o p (1). Now, note that " ! ⌘# 1 Di ⇡ˆ ⇤ (x⇤ i ) Di ⇣ ⇤ ⇤ ⇤ E + P̂ ( Y0t  y|x ) P̂( Y0t  y|x ) p⇤ 1 ⇡ˆ ⇤ (x⇤ i ) p⇤ h ⇣ ⌘i . E (ˆ⇡⇤ (x) Di ) P̂⇤ ( Y0t  y|x⇤ ) P̂( Y0t  y|x⇤ ) h ⇣ ⌘i . E (|ˆ⇡⇤ (x⇤ ) ⇡ˆ (x⇤ )| + |ˆ⇡(x⇤ ) ⇡(x)|) P̂⇤ ( Y0t  y|x⇤ ) P̂( Y0t  y|x⇤ ) Pn 1 (ˆ ⇤ ⇤ i ⇡ (x ) ⇡ˆ (x⇤ )) The result then follows by the same steps used to show that i=1 n1/2 is o p (1). Then the main result follows by Theorem 3.6.1 in Vaart and Wellner (1996). Proof of Theorem 7: Proof. The result follows by Proposition 3, Lemma 3.9.23(ii), and Theorem 3.9.11 in Vaart and Wellner (1996). ⇤ 93 APPENDIX F CHAPTER 2 TABLES AND FIGURES Figure F.1: Average Absolute Bias Figure F.1: QT T dr represents the doubly robust estimator. QT T cl represents the Callway and Li estimator. The graph in the first row and column represents the scenario when both nuisance func- tions are correctly specified. The graph in the first row and second column represents when neither nuisance function is correctly specified. The graph in the second row and first column represents when the propensity score is correctly specified but the conditional cdf nuisance function is incor- rectly specified. The graph in the second row and second column represents when the propensity score is incorrectly specified but the conditional cdf nuisance function is correctly specified. 94 Figure F.2: RMSE Figure F.2: QT T dr represents the doubly robust estimator. QT T cl represents the Callway and Li estimator. The graph in the first row and column represents the scenario when both nuisance func- tions are correctly specified. The graph in the first row and second column represents when neither nuisance function is correctly specified. The graph in the second row and first column represents when the propensity score is correctly specified but the conditional cdf nuisance function is incor- rectly specified. The graph in the second row and second column represents when the propensity score is incorrectly specified but the conditional cdf nuisance function is correctly specified. 95 Figure F.3: QTT Unemployment Estimates Figure F.3: The top panel represents estimates of the QT T (⌧) and their confidence intervals using my doubly-robust estimator. The bottom panel represents the estimates of the QT T (⌧) using the Callaway and Li estimator. The blue line represents the curve of point estimates. The red lines represent the 95% confidence bonds. 96 APPENDIX G CHAPTER 3 FIGURES Figure G.1: Average Absolute Bias Figure G.1: QT T gmmpro represents the estimates when both nuisance functions are correctly specified. QT T mispro represents the estimates when the propensity score is correctly specified, but the conditional cdf is incorrectly specified. QT T omispro represents when only the propenstiy score is misspecified, but the conditional cdf is correctly specified. QT T nomispro represents when neither of the nuisance functions are correctly specified. 97 Figure G.2: RMSE Figure G.2: QT T gmmpro represents the estimates when both nuisance functions are correctly specified. QT T mispro represents the estimates when the propensity score is correctly specified, but the conditional cdf is incorrectly specified. QT T omispro represents when only the propenstiy score is misspecified, but the conditional cdf is correctly specified. QT T nomispro represents when neither of the nuisance functions are correctly specified. 98