OPTIMAL SAMPLING STRATEGIES USING CASE-CONTROL STUDIES FOR BINARY SECONDARY OUTCOMES UNDER BUDGET CONSTRAINTS By Liang Wang A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Biostatistics—Doctor of Philosophy 2024 ABSTRACT A case-control study is efficient for investigating the association between outcomes and expo- sures. After conducting the primary outcome analysis, researchers can utilize the existing case- control study data to perform a secondary outcome analysis. Several methods have been proposed for analyzing secondary outcomes in case-control studies over the past few decades, but few of them have focused on the study design aspect. We propose optimal sampling strategies under a budget constraint for case-control studies with binary and Poisson secondary outcomes. We then extend our optimal sampling strategy by considering a confounder and derive the parameter of interest using doubly-weighted estimating equations. The term "optimal" refers to minimizing the variance of the estimator of the parameter of interest. We elucidate our proposed methods by developing the asymptotic variance of the estimator of the coefficient using weighted estimating equations and doubly-weighted estimating equations. Furthermore, we derive the optimal sampling ratio formulas through the Lagrange multiplier method based on certain monetary constraints. We verify our proposed methods through Monte Carlo simulation studies. Additionally, we apply our methods to empirical epidemiological studies that motivated the method development. Copyright by LIANG WANG 2024 I dedicate this dissertation to my family and friends. iv ACKNOWLEDGMENTS I would like to express my deepest gratitude to my dissertation committee members: Dr. Zhehui Luo, Dr. Chenxi Li, Dr. Honglei Chen, and Dr. Yuehua Cui for their guidance throughout my dissertation research. I would also like to thank Michigan State University Institute of Health Policy for their support of my PhD graduate assistantship. I would like to extend my appreciation to my family and friends for their love and emotional support during my PhD journey. v TABLE OF CONTENTS Chapter 1 Introduction ............................................................................................................... 1 Chapter 2 Optimal Sampling Strategies Using Case-Control Studies for Binary Secondary Outcomes under Budget Constraints ......................................................................... 4 Chapter 3 Optimal Sampling Strategies Using Case-Control Studies for Poisson Count Secondary Outcomes under Budget Constraints ....................................................... 15 Chapter 4 Optimal Sampling Strategies Using Case-Control Studies for Binary Secondary Outcomes Using Doubly-Weighted Invers Probability Estimating Equations under Budget Constraints .................................................................................................... 23 Chapter 5 Conclusion ................................................................................................................. 30 BIBLIOGRAPHY ......................................................................................................................... 32 APPENDIX A PROOFS OF CHAPTER 2 .................................................................................. 35 APPENDIX B PROOFS OF CHAPTER 3 ................................................................................... 43 APPENDIX C PROOFS OF CHAPTER 4 ................................................................................... 45 vi Chapter 1 Introduction The case-control study is designed to efficiently investigate the association between rare outcomes and exposures. To form a case group, a sample of individuals with the disease is randomly selected from the target population, while a sample of individuals without the disease is selected to form a control group. The primary outcome is the disease by which the caseness is defined. Researchers can use an existing case-control study dataset to conduct secondary outcome analysis and examine the association between secondary outcomes and covariates. Numerous methods have been proposed for analyzing secondary outcomes in case-control stud- ies over the past few decades. These methods include analyzing controls only (Nagelkerke et al., 1995) or cases only (Li et al., 2010), and conducting joint analysis of cases and controls while adjusting for the primary outcome (Lee et al., 1997). However, some of these methods have been considered naive as they are valid only under certain circumstances, such as a rare disease assump- tion (Li et al., 2010), and may produce invalid inferences. When considering the funding limitation for the secondary outcome analysis, the data for the secondary outcome analysis needs to be sam- pled from the cohort. Because the sampling for the secondary outcome analysis is taken from the cohort, the associations between exposure and the secondary outcome based on these samples can differ from those in the general population (Lin and Zeng, 2009). To overcome this issue, more methods that can provide valid inference have been proposed, including likelihood methods (Lin and Zeng, 2009; Jiang et al., 2006; Ghosh et al., 2013; Brownstein et al., 2022), weighted esti- mating equations methods (Monsees et al., 2009; Xing et al., 2016; Song et al., 2016; Sofer et al., 2017), bias correction methods (Wang and Shete, 2012; Chen et al., 2013), and semi-parametric methods (Wei et al., 2013; Tchetgen Tchetgen, 2014; Ma and Carroll, 2016). While current methodological studies on secondary outcomes focus on inferential procedures, the corresponding sampling strategies using case-control study for secondary outcome analysis have not been investigated. In Chapter 2, we proposed an optimal sampling strategy for a case- control study with a binary secondary outcome. We derived the variance of the estimator of the exposure effect for the inverse probability weighted estimating equations and minimized the vari- 1 ance to obtain an optimal sampling ratio between the sample size of controls to cases, considering study cost constraints. In Chapter 2, the secondary outcome is presented as a binary variable. However, in reality, there are situations where the secondary outcome of interest is not binary, but a count variable. For instance, in a study by Tchetgen Tchetgen (2014), the secondary outcome of interest was the number of live births, which is typically considered a count outcome. Similarly, in the Pesticides and Sense of Smell (PASS) Study (Shrestha et al., 2019), the cognitive decline scores in the survey could also be treated as a count variable. Sample size calculation for count data has been extensively investigated. For instance, Lou et al. (2017) derived an analytic sample size formula for comparing rates of change between multi- ple treatment groups with repeatedly measured count outcomes using generalized estimating equa- tions. Amatya et al. (2013) provided simple sample size expressions for determining the number of clusters in the context of multi-center randomized clinical trials. Zhu and Lakkis (2014) devel- oped an explicit sample size calculation formula based on the likelihood function of the negative binomial model. Wang et al. (2020) provide a closed-form sample size formula accounting for the variability in cluster size in cluster randomized studies. In addition to deriving analytic sam- ple size calculation formulas, simulation studies have also been used to determine sample size for count data, as demonstrated in the works of Lyles et al. (2007); Aban et al. (2009); Rettiganti and Nagaraja (2012). However, the sampling allocation estimation in the analysis of count secondary outcomes in case-control studies remains understudied. This situation motivated us to explore a sampling strategy for secondary case-control studies with count outcomes in Chapter 3. In addition, our proposed sampling strategy formulas were derived by minimizing the variance of the estimator of the exposure effect using inverse probability weighted estimating equations in Chapter 2 and Chapter 3. The weights in these two chapters in estimating equations were design- based weights, representing the sampling probability for the secondary outcome analysis from the cohort. In chapter 4, we incorporate the propensity score weights into the estimating equations to create doubly-weighted estimating equations. The propensity score weights can be estimated using 2 cohort data, which captures the probability of exposure given the confounder. Consequently, the weights in doubly-weighted estimating equations were the product of the design weights and the propensity score weights. The general propensity score assumptions and the inference of doubly- weighted estimating equations can be found in Negi (2024). It is important to note that in Chapter 4, our focus is not on inference on the marginal effect; rather, we aim to provide an optimal sampling designs for case-control studies with binary secondary outcomes given the exposure and a confounder using doubly-weighted estimating equations. 3 Chapter 2 Optimal Sampling Strategies Using Case-Control Studies for Binary Secondary Outcomes under Budget Constraints 2.1 Background Optimal sampling strategies have been fruitfully studied in a variety of epidemiology study designs. In the case-control study with a binary outcome, Demidenko (2006) found that the optimal ratio of controls to cases for fixed power is equal to the square root of the alternative odds ratio. In another paper related to the unmatched case-control study design, Demidenko (2008) gave the optimal control–case ratio for the test of an interaction between two binary covariates. Morgenstern and Winn (1983) proposed that the optimal sampling ratio is a function of the expected frequency of exposure among controls, odd ratio, and the unit cost ratio. Nam and Fears (1990) investigated optimal allocation for stratified case-control studies. The design consideration is immaterial when we only use an existing case-control sample for secondary analysis. However, in the situation where the “true” outcome of interest is not measured in the target population or too costly to ascertain, but a proxy of the outcome or another variable strongly associated with the true outcome is routinely collected in the target population or ascer- tainable with little cost, then it may be more efficient to sample observations based on the proxy as if the proxy is the primary outcome in a case-control study and the true outcome of interest is the secondary outcome. The design aspect of secondary outcome analysis in case-control studies has received less attention compared to the inference procedures. In this chapter, we propose a new method to determine the optimal sampling ratio for a secondary case-control study when estimat- ing coefficients of interest using inverse probability weighted estimating equations. The purpose of this paper is twofold. First, we provide a detailed explanation of a novel Lagrange multiplier method by developing the asymptotic variance-coariance matrix of estimators of coefficients ob- tained from the weighted estimating equations. Second, we showcase the optimal allocation design for a secondary outcome analysis using the method, taking into account monetary constraints. 4 2.2 Notation and estimation 2.2.1 Notation Suppose the study cohort (target population) has 8 = 1, ..., # independent subjects. Let ⇡8 be the binary primary disease status for subject 8 in the cohort, where ⇡8 = 1 indicates the presence of the disease, ⇡8 = 0 indicates its absence. Denote by .8 the binary secondary outcome of interest, where .8 = 1 indicates the presence of a secondary disease, and .8 = 0 otherwise. Let (8 be the sampling indicator, with (8 = 1 indicating the inclusion of 8th subject in the secondary outcome analysis, and (8 = 0 otherwise. Assume that there are #1 known subjects in the case group ⇡8 = 1 ) ( and #0 known subjects in the control group ⇡8 = 0 , where #1 + ) ( #0 = #. Let =1 be the unknown sample size selected from the case group and let =0 be the unknown sample size selected from the control group. The total number of subjects selected among # is = = =1 + ) the 2 1 covariates vector for subject 8, where ,8 denotes the binary Denote by ^8 = 1, ,8 =0. ( ) ⇥ exposure status for subject 8, with ,8 = 1 indicating exposed, and ,8 = 0 otherwise. Assume that the sampling probability from the study cohort for the secondary outcome analysis depends only on ⇡8, where Pr (8 = 1 ⇡8 = 3,.8, ^8 = Pr (8 = 1 ⇡8 = 3 = c ⇡8 = 3 ) =1 when we aim to examine the association between .8 is to find the optimal sampling ratio =0 / ( ( ) ( ) | | , 3 = 0, 1. The goal = =3 #3 and ,8 in the target population without conditioning on ⇡8. Let # = in the conditional expectation E ) binary outcome, we can use the logit link function for 6 = 4 ^) 8 # 4 ^) 8 # expressed as E ^8; # .8 ( ) ) ( . | | .8 ( (·) ^8; # = ` ^8; # , where 6 ` ( ( . The mean model E )) 1 + A valid estimator of # can be obtained by solving the inverse of sampling probability weighted V0, V1) ) be a 2 1 vector ⇥ 8 #. Since .8 is a = ^) ( ^8; # .8 ( | ^8; # ) can be estimating equations: # ’8=1 *8 # ) ( = 0, 5 where *8 = # ) ( ^8(8 ⇡8 c ( ) ` .8 [ ( ^8 )] . Figure 1 shows the secondary case-control study design flowchart. Figure 1. Secondary Case-control Study Design Flow Chart Target Population Metaethische Positionen Case group Primary Outcome ⇡8 = 1 Number of cases #1 # = #1 + #0 Control group Primary Outcome ⇡8 = 0 Number of controls #0 Sample =1 Sample =0 Secondary Outcome .8 Number of subjects = = =1 + =0 2.2.2 Variance estimation In order to determine the optimal sampling ratio for the secondary outcome in a case-control ˆV1) design, we develop the variance of the estimator of the exposure effect, denoted as + 0A ( . 6 Let #⇤ represents the true parameters, and let ˆ# = ˆV0, ˆV1) ( ) denote the solution of estimating equations. Under some regulatory conditions, it follows that ˆ# is a consistent estimator of #⇤, ) asymptotically normal # #⇤, \ #⇤ )# ( , where \ E m m #) * #⇤ , and H ⇣ #⇤ = E ⌘ * #⇤ * #⇤ ) ( #⇤ = G #⇤ 1 H #⇤ G #⇤ 1 . Here, G #⇤ = ) ( ) ( . Since ˆ# is a consistent estimator of #⇤, we can h i ) ( ) ) ( ) ( ) ) h i ( ˆ# ( # ) h with \ ( consistently estimate \ variable in the estimating equation for simplicity. Therefore, it is straightforward to see that \ ˆV1) 2 variance-covariance matrix, and the bottom-right element of \ 1 H # )# represents + 0A ( . In the current study settings, we only consider the exposure # )# is ( . We a 2 G G i ( ) ( ) ( # # ) 1 by computing ) ( # )[ # ( ) ( ] directly. It is easy to verify that ⇥ obtain + 0A ˆV1) ( G # ) ( = E = E (8 ⇡8 ( ) c ⇥ 4 ^) 8 # 4 ^) 8 # 4 ^) 8 # 1 + ⇣ ^8 ^) 8 2 ^8 ^) 8 2 ⌘ . 3 7 7 7 7 7 5 1 + ⇣ 4 ^) 8 # ⌘ 3 7 7 7 7 7 5 2 6 6 6 6 6 4 2 6 6 6 6 6 4 (See Appendix A.1 for the explicit derivation of G .) # ) ( We substitute *8 # ) ( into H , thus # ) ( H # ) ( = E (8 ⇡8 )] c  [ ( .8 2 ( ` ( ^8 )) 2 ^8 ^) 8 . We further simplify H # ) ( to (See Appendix A.2 for the explicit derivation of H .): # ) ( H # ) ( = E  ˜` ( ( ^8, ⇡8 ^8 ` ( ) ^8, ⇡8 ˜` ( ) ^8, ⇡8 2 ) 2 )) c ˜` + ⇡8 ( ( ) ^8 ^) 8 , where ˜` ^8, ⇡8; " = 4 U0+ 4 U0+ 1 + come given the primary outcome, the exposure variable and their interactions. # = ^8, ⇡8; " U1 F8 + U1 F8 + U2 38 + U2 38 + .8 ( = E U3 F8 38 U3 F8 38 ) ) ( | is the expectation of secondary out- " = U0, U1, U2 U3) ( ) , are coefficient vectors associated with ` ^8; # ) ( and ˜` ( tively. To obtain the optimal sampling ratio, the parameter set " need to be pre-specified. Because both models are saturated and the parameter of interest is the odds ratio, the functional forms for 7 ) , V0, V1) , respec- ( ^8, ⇡8; " ) the two models do not cause issues of compatibility. 2.3 Optimal sampling strategy Theorem 1. Under the above study design settings with the prevalence of the exposure in the cohort, with known % , = F ( ) = ?F and % ( , = F, ⇡ = 3 ) = ?F3, for F and 3 = 0 or 1. Let # ) ( = 2 6 6 6 6 6 6 6 6 4 011 012 021 022 3 7 7 7 7 7 7 7 7 5 and 111 112 ⌫ # ) ( . 3 7 7 7 7 7 7 7 7 5 121 122 = 2 6 6 6 6 6 6 6 6 4 , the lower right corner element of variance-covariance matrix 02 21111 02 11122 , where ] Then + 0A ˆV1) is given by [ ( 1 H G # ) ( # )[ # ( 1 G # ) ( ) ] , ) 1 + ( ( 2 + V1 V1 2011112021+ 2# 34C ) ( 011 = ?14V0+ 4V0 ?1) 1 2 , 4V0+ 4V0 1 + ) ( V1 012 = 021 = 022 = ?14V0+ V1 4V0+ 1 + ) ( 111 = ?11#1 ˜˜` ?10#0 ˜˜` 1,0 1,1 ) ) =0 =1 112 = 121 = 122 = ?11#1 ˜˜` =1 where ˜˜` ^8, ⇡8 1,1 ˜` + ( ( ) ( + ^8, ⇡8 2 , + 0,1 ) ( ?00#0 ˜˜` =0 ( 0,0 , ) + ?01#1 ˜˜` =1 ?10#0 ˜˜` =0 1,0 ) , ( ˜˜` 1, 0 , ˜˜` ( 0, 1 ) ` ^8 2 ˜` ^8, ⇡8 ) ⌘ ( 0, 0 ) ( ) represent the value of function ˜˜` )) + ( ( , ˜˜` ˜` ( , ^8, ⇡8 2 and the shorthands ˜˜` 1, 1 ) given the exposure value F and the ( ) , ) ( ( case-control status 3, for F and 3 = 0 or 1, and 34C = 011022 (· ·) ) ( 012021. Proof. See Appendix A.3 and Appendix A.4. ⇤ The computation of + 0A ˆV1) ( 111, 112 = 121 = 122 ) into the expression [ can be performed directly by plugging in (011, 012 = 021 = 022, 02 21111 2011112021+ 34C ) ( 2# 02 11122 . ] 8 Consideration of the cost for data collection is a crucial aspect when conducting epidemiolog- ical research. Cost can be viewed as a special constraint in sample size calculation. In this paper, we will focus on a scenario where the total cost of the secondary outcome analysis is fixed and known. We make the assumption that the costs for a case and a control are the same. Proposition 1. Denote the known total cost of the secondary outcome study as ⇠>BC, and let 2 ?4A represent the known cost per individual for samples from case group or control group. The maximum sample size = for the secondary outcome analysis is given by ⇠>BC =1 = =, where =0 2 ?4A is the sample size of the selected controls , and =1 is the sample size for selected cases. Let '$⇡ = =0 =1 = =0 + denote the optimal design-based sampling ratio. Under the study design settings describe above, the optimal ratio can be determined as follows: '$⇡ = =0 =1 = s 1, 0 1, 1 Z Z [ [ ( ( ) + ) + 0, 0 0, 1 ] ] ( ( ) ) ^ ^ ( ( 1, 0 1, 1 ) + ) + s s ( ( 1, 0 1, 1 )] )] , where, Z Z ] ] ^ ) ) 1, 1 ( 1, 0 ( 0, 1 ) 0, 0 ) 1, 1 ) , , = 02 ( 1, 0 1, 1 = 02 = 02 21 ?11#1 ˜˜` 21 ?10#0 ˜˜` 21 ?01#1 ˜˜` 21 ?00#0 ˜˜` ) = 2021011 ?11#1 ˜˜` ( 0, 1 0, 0 ( ) ( = 02 ) ) , , 1, 1 ( ( ( 1, 0 1, 0 ) 1, 1 ) 1, 0 ) ( ( = 02 = 2021011 ?10#0 ˜˜` 11 ?11#1 ˜˜` 11 ?10#0 ˜˜` = 02 ( 1, 1 ) 1, 0 ) ( ( , . ( ^ ( s s , , ) ) Proof. See Appendix A.5. ⇤ 9 Remark 1. We define ˜˜` ^, ⇡ ) ( term as Quasi Mean Squared Error(QMSE). Then, 2 ^ ` ^, ⇡ ` ˜˜` ( ^, ⇡ = ) ^, ⇡ ) 180B ` ( {z ( ` 2 6 6 e 6 6 | 6 4 | ( ) + 3 7 7 e 7 7 | } "(⇢ G,3 7 5 {z ( ) E0A80=24 e {z ^, ⇡ 2 . ) } } Let @1 = % ⇡ = 1 ) ( = #1 # , @0 = % ⇡ = 0 ) ( = #0 # . Then, + 0A ˆV1) ( = \1 @1 34C \0 @0 34C . 2 ) =0 ( 2 + ) =1 ( Where, \1 = 02 21 ( \0 = 02 21 ( ?11&"(⇢11 + ?10&"(⇢10 + ?01&"(⇢01) + ?00&"(⇢00) + 02 11 02 11 Thus, the optimal sampling ratio is '$⇡ = \0@0 \1@1 1. If \1 is small, \0 is big, then '$⇡ < 1. If \1 q '$⇡ ⇡ 2021011 ?11&"(⇢11. ?10&"(⇢00. 2021011 . Since @1 ⌧ \0 ⇡ 1, then '$⇡ @0. If \1 is big, \0 is small, then @0 @1 = #0 #1 . q ⇡ q 2.4 Simulation We denote the derived variance of ˆV1 in Theorem 1 as + 0A⇡⇢ ' ( the variance of ˆV1 using the Stata command “gmm”. We define the empirical variance as + 0A⇢ " % ˆV1) ˆV1) ( 1 is the estimator of the exposure effect on one simulated dataset . We denote by + 0A⌧ " " 2, where ˆV8 ˆV1) ( 1 = 1000 8=1 ( ˆV<40= 1 1000 1 using the Stata command “gmm”. ˆV<40= ˆV8 1 Õ ) is the average of ˆV8 1 across 1000 simulation runs. 1 In this section, we conduct simulation studies to verify our derived variance formula. Specifi- ˆV1) cally, we investigate whether our proposed variance + 0A⇡⇢ ' closely approximates + 0A⇢ " % ˆV1) ( ( and + 0A⌧ " " . ˆV1) ( We provide the data generating process as follows. We simulate 8 = 1, . . . , 1, 000 datasets, with each dataset comprising 10, 000 observations (target population, i.e., cohort size #). Within each dataset, we first simulate the binary exposure variable - v ⌫4A=>D;;8 the binary primary outcome ⇡, with E ⇡ | ( ^; $ ) = 4W0+ 4W0+ 1 + W1 F W1 F , where $ = nally, we simulate the binary secondary outcome . , with E . ( | ^, ⇡; " ) 10 0.17 ( ) W0, W1] [ = 4 U0+ 4 U0+ 1 + . We then simulate = U1 F + U1 F . Fi- 1.4, 0.7 ] U3 F3 U3 F3 , where [ U2 3 + U2 3 + + " = U0, U1, U2, U3] [ = 1, 0.3, 0.25, 1 [ ] . In each simulated dataset, the prevalence of the pri- mary outcome ⇡ and secondary . is around 0.22 and 0.31, respectively. The number of indi- viduals selected based on the study budget is ⇠>BC 2 ?4A = = = 3, 000. Table 2.1 presents a comparison of + 0A⇢ " % ˆV1) ( , + 0A⌧ " " ˆV1) ( , and our proposed variance + 0A⇡⇢ ' ˆV1) ( under different sampling ratios with the given parameters. The first column of the table represents the ratio of =0 to =1. The second column displays the empirical variance for each sampling design. The third column shows the average of variances of estimators of coefficients obtained using Stata “gmm” command over 1, 000 simulations. The final column is our proposed variances under different sampling de- signs. The results in the Table 2.1 clearly illustrates that our proposed + 0A⇡⇢ ' ˆV1) ( is very close to + 0A⇢ " % ˆV1) ( and + 0A⌧ " " ˆV1) ( across all reasonable sampling ratios. It is evident that the bal- anced design is not the most efficient choice. By using our proposed optimal sampling formula, we obtained the optimal sampling ratio '$⇡ = 2.63. Table 2.1: The comparison of + 0A⇢ " % under different sampling designs. ( , + 0A⌧ " " ˆV1) ˆV1) ( =0 : =1 + 0A⇢ " % E-2 1.239 0.981 0.989 0.957 0.981 1.729 2.238 2.796 1 : 1 2 : 1 2.63 : 1 3 : 1 4 : 1 1 : 2 1 : 3 1 : 4 ( ( ˆV1) + 0A⌧ " " E-2 1.182 0.998 0.984 0.989 1.019 1.657 2.164 2.680 , and the proposed variance + 0A⇡⇢ ' ˆV1) ( ˆV1) ˆV1) ( + 0A⇡⇢ ' E-2 1.179 0.996 0.981 0.985 1.016 1.657 2.160 2.669 2.5 Numerical illustration We apply our proposed method to the secondary outcome analysis of the Pesticides and Sense of Smell (PASS) Study, which is an add-on study of the Agricultural Health Study (AHS). The PASS Study aims to better understand the relationship between high pesticide exposure events (HPEE) and olfactory impairment (OI). In the target AHS phase-4 cohort, participants were asked if they 11 had lost their sense of smell. Some literature has shown an association between OI and cognitive decline (Yaffe et al., 2017; Shrestha et al., 2019; Dintica et al., 2019). Thus, the investigators used the self-reported smell loss to define the sampling strata ⇡ and mailed the selected participants the Cognitive Function Instrument (CFI) questionnaire, which is used to define a dichotomous outcome of interest, . , CFI-based cognitive compliant. We utilized our proposed optimal sam- pling ratio formula to obtain an efficient study design for the analysis of the secondary outcome under some reasonable scenarios of the strength of association between ⇡ and . . Certain pa- rameters were derived from the cohort data, e.g., ? , = 1 ) ( = 0.14, W = 1.86, 0.397 [ . The total ] sample size in case-control cohort is # = #0 + of " that are meaningful for the following scenarios: #1 = 15, 893 2, 633 = 18, 526. We use a range + the prevalence of . is either 0.1 or 0.2. The association between ⇡ and . among , = 0, $'. ⇡ ,=0 = | 1.2, 1.4, 1.6 ] [ ; and among , = 1, $'. ⇡ ,=1 = 1.5, 2.0, 2.5 . We created Table 2.2 for these scenarios with column 1 for prevalence | [ of . , column 2 for $'. ⇡ ] ,=0, column 3 for $'. ⇡ | sampling ratio '$⇡ based on the formula with these given parameters. The budget constraint is | ,=1 and the last column is the proposed optimal $25, 000 for data collection, and the unit cost is $5.40. We can observe that as the association between ⇡ and . increases, the optimal sampling ratio decreases, resulting in fewer subjects being sampled from the control group. 12 Table 2.2: Optimal sampling ratio '$⇡ varies with the prevalence of the secondary outcome . , and the association between ⇡ and . , while keeping other parameters fixed. % . = 1 ( 0.1 ) 0.2 ,=1 $'. ⇡ | 1.2 1.2 1.2 1.4 1.4 1.4 1.6 1.6 1.6 1.2 1.2 1.2 1.4 1.4 1.4 1.6 1.6 1.6 ,=0 $'. ⇡ | 1.5 2.0 2.5 1.5 2.0 2.5 1.5 2.0 2.5 1.5 2.0 2.5 1.5 2.0 2.5 1.5 2.0 2.5 V1 0.659 0.731 0.790 0.630 0.712 0.765 0.611 0.684 0.738 0.656 0.713 0.765 0.631 0.698 0.742 0.610 0.670 0.720 U0 2.3 2.35 2.36 2.36 2.37 2.39 2.37 2.39 2.41 1.515 1.525 1.535 1.535 1.545 1.555 1.555 1.565 1.575 U1 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 U2 0.182 0.182 0.182 0.336 0.336 0.336 0.470 0.470 0.470 0.182 0.182 0.182 0.336 0.336 0.336 0.470 0.470 0.470 U3 0.223 0.511 0.734 0.069 0.375 0.580 0.065 0.223 0.446 0.223 0.511 0.734 0.069 0.375 0.580 0.065 0.223 0.446 '$⇡ 4.66 4.35 4.13 4.62 4.27 4.07 4.57 4.24 4.02 4.89 4.70 4.58 4.84 4.64 4.53 4.81 4.61 4.49 2.6 Conclusion The optimal sampling strategies have been discussed for two-stage (Breslow and Chatterjee, 1999; McNamee, 2005), or so called two-phase designs (Reilly, 1996; Breslow and Cain, 1988) for case control studies. Breslow (2005) points out that two-phase designs and two-stage designs are the same study design with different terminologies. A two-stage case-control design involves deter- mining exposure and outcome for a large sample, but covariates are measured only on a subsample (Hanley et al., 2005). The outcome of interest remains the same for both the first stage and the sec- ond stage. However, for secondary outcome analysis for case-control study, the primary outcome and the secondary outcome are different. Since the studies are different, our proposed method differs from the method used in the mentioned papers. The inference for secondary outcome analysis in case-control study has been studied in the last decades with numerous methods being 13 proposed. However, there is no off-the-shelf method for the sampling design when the data are deliberately collected for the secondary outcome. In this paper, we proposed an optimal sampling strategy for the analysis of the secondary outcome using weighted estimating equations. The term “optimal” refers to allocation of cases and controls that minimizes the variance of the estimator of interest given an analytic method and a fixed sample size under a budget constraint. We de- rive the asymptotic variance-covariance matrix of estimators of coefficients using the “Sandwich” variance-covariance matrix of a weighted estimating equations estimators. Given the variance of the estimator of the coefficient is minimal, the power for the test on exposure effect is maximal. We provide a sampling formula for achieving an efficient study design for a valid estimation strategy of the effect of interest, namely the inverse probability of sampling weighted estimating equations estimators. For different estimation strategies, the optimal sampling ratio might be different. To verify our provided formula, we conduct simulation studies. The results demonstrate that our for- mula performs well for both common primary outcomes and secondary outcomes. Interestingly, our findings indicate that the widely used balanced design is not always the most efficient choice for secondary outcome analysis study designs. Therefore, researchers should carefully calculate the sampling ratio when the purpose is a secondary outcome analysis. 14 Chapter 3 Optimal Sampling Strategies Using Case-Control Studies for Poisson Count Sec- ondary Outcomes under Budget Constraints 3.1 Background In this chapter, we expand upon our sampling methodology from Chapter 2 by including an explicit sampling ratio formula for case-control studies with count secondary outcomes. With this sampling formula, researchers will be able to achieve an efficient study design for their count secondary outcome analysis. The derivation of the optimal sampling ratio is based on estimating the variance of the estimator of the exposure effect using inverse probability weighted estimating equations. Our proposed framework, which incorporates inverse probability estimating equations, offers an optimal formula that can accommodate Poisson distributed count data. This chapter is organized as follows: First, we introduce general notations. Second, we derive the variance formula of the exposure effect in Poisson distributed secondary outcomes. Third, we verify our proposed variance formula through Monte Carlo simulations. Finally, we draw conclusions based on our findings. 3.2 Notation and estimation 3.2.1 Notation Suppose the study cohort (target population) consists of # independent subjects. Let ⇡8 be the binary case-control primary outcome, where ⇡8 = 1 or 0 indicates the presence or absence of the disease. The population size of cases and controls is denoted by #1 and #0, respectively. We assume that #1 and #0 are known, and #1 + count secondary outcome of interest. (8 is the sampling indicator, with (8 = 1 indicating the in- #0 = #. Let .8 represent the Poisson distributed clusion in the secondary outcome analysis, and (8 = 0 otherwise. We define =1 as the unknown sample size selected from the case group and =0 as the unknown sample size selected from the control group. The total number of subjects to be selected from the target population, denoted as =0. Our objective is to derive an expression of the sampling ratio =0/ =, is given by = = =1 + under the study budget constraint, when we aim to examine the association between .8 and ,8 =1 in the target population without conditioning on ⇡8. This will allow us to determine the exact sample size of cases and controls for the secondary outcome analysis. Let ,8 denotes the binary 15 exposure status, with ,8 = 1 indicating exposure, and ,8 = 0 otherwise. Denote by ^8 = 1, ,8 ( ) ) the 2 ⇥ 1 covariates vector for each subject 8. Similar to Chapter 1, we assume that the sampling probability from the study cohort for the secondary outcome analysis depends only on ⇡8, where ( Pr (8 = 1 ⇡8 = 3,.8, ^8 V0, V1) coefficients vector. Since .8 is Poisson distributed random variable, the mean model E , 3 = 0, 1. Let # = ⇡8 = 3 (8 = 1 = Pr = c 3 ( ) ( ( ) ) | | = =3 #3 ) be the 2 .8 ( | ^8 ) 1 ⇥ can be expressed as E ^8 = ` ^8 = 4 ^) 8 #. A valid estimator of # can be obtained by solving the ( inverse of sampling probability weighted estimating equations: ) ) | .8 ( # ’8=1 *8 # ) ( = 0, where * = # ) ( ^8(8 ⇡8 c ( ) ` .8 [ ( ^8; # . )] (For the simplicity, we will not include the subscript 8 in the following derivations). 3.2.2 Variance estimation In this scenario, we consider one exposure variable in the mean model, therefore, it is straight- forward to see the bottom-right element of variance-covariance matrix represents + 0A ˆV1) ( . The general frame work of the derivation of variance-covariance matrix of the parameters of the in- verse probability weighted estimating equations can be found in Chapter 2, Section 2. We obtain + 0A ˆV1) ( by computing 1 H G # ) ( # )[ # ( 1 G # ) ( ) ] directly. It is easy to verify that G # ) ( = E = E  c  m m #) ( ⇡ ( ) c ⇥ 4 ^) # ^( ⇡ . ( ) ⇣ 4 ^) # ^ ^) ⌘ . (3.2.1) Taking iterated expectations in (3.2.1), then G # = E ( ) n Note that c E h ⇡ ( c ) ( ⇡ ) ⇥ ( = Pr 4 ^) # ^ ^) ⇡ . ( ( = 1 | ⇡ = 3 i o ) only depends on ⇡, thus 16 ( # G E ( c = E ( ⇡ ) | ⇡ ) ⇥ Since ( is a binary variable, then E 4 ^) # ^ ^) E ⇡ n h ) ( . i o ( ( | ⇡ ) = Pr ( ( = 1 ⇡ = 3 = c 3 ) ( ) | under the assumption that the sampling is only conditional on the primary outcome ⇡, thus = E E 4 ^) # ^ ^) ⇡ = E 4 ^) # ^ ^) . G # ( ) h It is also easy to verify that n i o h i H # ) ( = E (2 ⇡ ( )] c  [ . 2 ( ` ^ ( )) 2 ^ ^) . With further simplification on H # ) ( (See Section Appendix B.1 for the explicit derivation of H ( # ), ) we have, H # ) ( = E  ^, ⇡ ˜` ( ( ) ` c ^ ⇡ )) ) ( ( 2 ˜` ( + ^, ⇡ ^ ^) ) , . ( where ˜` ^, ⇡; " = E ^, ⇡; " = 4U0+ U1F U23 + U3F3 . The parameter set # = + ) ( U0, U1, U2 U3) ( assume " are known parameters for the purpose of calculating the optimal sampling ratio. ) , are coefficient vectors associated with ` and ˜` ^, ⇡ ^ ) ) ( ) ( | ) , " = V0, V1) , respectively. We ( 17 We further calculate G # ) ( and H # ) ( by the rule of functional expectation. # ) ( = E 4 ^) # ^ ^) h = ?F=14 V0+ 1 ?F=1] + [ i 1 1 1 1 3 7 7 7 7 7 7 7 7 5 V1 2 6 6 6 6 6 6 6 6 V1 4 = 2 6 6 6 6 6 6 6 6 4 2 6 6 6 6 6 6 6 6 4 ⌘ where ?F=14 V0+ 1 + [ ?F=1] 4 V0 ?F=14 V0+ V1 ?F=14 V0+ V1 ?F=14 V0+ V1 011 012 021 022 3 7 7 7 7 7 7 7 7 5 1 0 0 0 4 V0 2 6 6 6 6 6 6 6 6 4 3 7 7 7 7 7 7 7 7 5 3 7 7 7 7 7 7 7 7 5 011 = ?F=14 V0+ V1 1 ?F=1] + [ 4 V0, 012 = 021 = 022 = ?F=14 V0+ V1, and ?F=1 is the known prevalence of the exposure in the cohort. 18 Similar for ⌫ # ) ( , and we let ˜˜` ^, ⇡ ˜` ( ) ⌘ ( ( ^, ⇡ ` ^ ( ) 2 )) + ˜` ( ^, ⇡ ) , then ⌫ # ) ( = E 1 ⇡ ( ) c  ˜˜` ( ^, ⇡ ) ^ ^) ?11 ˜˜` = 1, 1 ( =1 ) #1 ?10 ˜˜` 1, 0 ) ( =0 #0 2 6 6 6 6 6 6 6 6 4 #1 + 3 7 7 7 7 7 7 7 7 5 1 1 1 1 3 7 7 7 7 7 7 7 7 1 0 5 0 0 2 6 6 6 6 6 6 6 6 4 2 6 6 6 6 6 6 6 6 4 #0 1 1 1 1 3 7 7 7 7 7 7 7 7 1 0 5 0 0 2 6 6 6 6 6 6 6 6 4 3 7 7 7 7 7 7 7 7 5 ?00 ˜˜` 0, 0 ( ) =0 + ?01 ˜˜` 0, 1 ( =1 ) + 111 112 121 122 = 2 6 6 6 6 6 6 6 6 ?11#1 ˜˜` 4 3 7 7 7 7 7 7 7 7 1, 1 5 ) + =1 ( 111 = ?01#1 ˜˜` 0, 1 ) ( ?10#0 ˜˜` ( 1, 0 ) + =0 + ?00#0 ˜˜` 0, 0 ( ) , 112 = 112 = 122 = ?11#1 ˜˜` =1 ( 1, 1 ?10#0 ˜˜` =0 ( ) + 1, 0 ) , And ?F3 is the known joint probability between the exposure and primary outcome in the cohort. We have similar variance-covariance matrix as binary secondary outcome case from Chapter 2. So + 0A ˆV1) ( 02 21111 = [ 2011112021+ 34C ) ( 2# 02 11122 ] is the lower corner element of variance-covariance matrix 1 H G # ) ( # )[ # ( G # ) ( ) 1 ] . Where 34C = 011022 012021. For details, see Chapter 2, Theorem 1. 3.3 Optimal sampling strategy We consider cost as a special constraint when we minimize the variance to get the optimal sample ratio. Define the optimal design ratio as '$⇡. We can write '$⇡ as a function of E0A ˆV1) ( , that is E0A ˆV1) ( = 51'2 $⇡ + ( 52) 51 + 53'$⇡ '$⇡ 52 , + 19 where, 51 = 51 ( ", #, $, ?F=1, ?F3, #1, #0) = Z 52 = 52 ( ", #, $, ?F=1, ?F3, #1, #0) = Z 1, 1 0, 1 ] ( ) ^ ) + 1, 0 0, 0 ] ( ) ^ ) + ( ( 1, 1 ) + s 1, 0 ) + s ( ( 1, 1 1, 0 , , ) ) ( ( 53 = ( 34C G ) 2 =#, F, 3 = 0, 1 , ) ( and 1, 1 1, 0 Z Z ( ( ) ) 0, 1 0, 0 ] ] ( ( ) ) 1, 1 1, 0 ^ ^ ( ( ) ) = 02 21 ?11#1 ˜˜` = 02 21 ?10#0 ˜˜` 1, 1 1, 0 , , ) ) ( ( = 02 21 ?01#1 ˜˜` = 02 21 ?00#0 ˜˜` 0, 1 0, 0 , , ) ) ( ( = 2021011 ?11#1 ˜˜` = 2021011 ?10#0 ˜˜` 1, 1 1, 0 , , ) ) ( ( s 1, 1 ) ( = 02 11 ?11#1 ˜˜` s 1, 0 ) ( = 02 11 ?10#0 ˜˜` 1, 1 ) ( , 1, 0 ) ( . The definition of ?F=1, ?F3, #1, #0 can be found in Chapter 2. By applying simple algebra (Demidenko, 2008), we can determine that the minimum of E0A 3.4 Simulation ˆV1) ( is achieved when '$⇡ = 52 51 . q To verify our proposed variance formula for the Poisson-distributed count secondary outcomes, we ˆV1) conducted Monte Carlo simulations. The three candidate variances, + 0A⇡⇢ ' , + 0A⌧ " " ˆV1) ( ( , 20 and + 0A⇢ " % ˆV1) ( , have the same definition as in our previous Chapter 2, Section 4. We start by simulating 1, 000 datasets, with each dataset containing 10, 000 observations (tar- get population, i.e., cohort size #). Within each dataset, we first simulate the binary exposure variable , with a prevalence is 17%. We then simulate the binary primary outcome ⇡, with E ⇡ ^ ) | ( = 4W0+ 4W0+ 1 + W1 F W1 F . The parameters W0 and W1 are set as 1.4 and 0.7 respectively. So, the prevalence of the primary outcome ⇡ in each simulated dataset is around 22%. Additionally, we simulate the Poisson distributed secondary outcome, . , with E . ( | ^, ⇡ ) = 4U0+ U1F U23 U3F3. The + + parameters U0, U1,U2,U3 are chosen as 0.01, 0.05, 0.035, and 0.11, respectively. Therefore, the expected value E . ( ) and variance + 0A . ( ) are equal to 1 in each simulation dataset. The value of ˆV1 ˆV1) under different sampling ratios using the ( , is 0.094, which can be estimated from a large simulation dataset. Table 3.1 compares + 0A⇢ " % + 0A⌧ " " ˆV1) ( , and our proposed variance + 0A⇡⇢ ' ˆV1) ( given parameter set. The first column of the table represents the ratio of controls to cases. The second column displays the empirical variance of the estimator of the exposure effect for each sampling ratio. The third column shows the average of variance of the estimator of the exposure effect obtained using the Stata “gmm” command across 1, 000 simulation runs. The last column shows our proposed variance formula. When we calculate + 0A⇡⇢ ' ˆV1) ( , we assume #1 and #0 is % ⇡ = 1 fixed, with #1 = # ˆV1) ˆV1) designs. By applying our optimal sampling formula with the provided pre-specified parameters, = 2, 205 and #0 = 7, 795. Table 3.1 clearly illustrates that our pro- ˆV1) ⇥ is very close to + 0A⇢ " % across all reasonable sampling posed + 0A⇡⇢ ' and + 0A⌧ " " ( ) ( ( ( we find that the optimal sampling ratio '$⇡ = 2.64. Given that the number of individuals that can be selected is ⇠>BC 2 ?4A = = = 3, 000, we can determine the number of controls as =0 = 2, 176, and the number of cases as =1 = 824. 3.5 Conclusion When planning research for a secondary case-control study, an important consideration is how to achieve an efficient design. In this chapter, instead of using simulation studies to estimate the optimal sampling ratio, we provide a close-form optimal sampling formula for case-control studies with Poisson distributed secondary outcomes, which is easy to understand and implement. 21 Table 3.1: The comparison of empirical variance + 0A⇢ " % variance + 0A⇡⇢ ' outcome . . ˆV1) , and the proposed ( under the different sampling designs with a Poisson-distributed secondary , + 0A⌧ " " ˆV1) ˆV1) ( ( ˆV1) ˆV1) ( =0 : =1 + 0A⇢ " % E-2 0.275 0.219 0.217 0.213 0.214 0.366 0.476 0.539 1 : 1 2 : 1 2.64 : 1 3 : 1 4 : 1 1 : 2 1 : 3 1 : 4 ( + 0A⌧ " " E-2 0.261 0.220 0.216 0.217 0.225 0.348 0.486 0.588 ˆV1) ( + 0A⇡⇢ ' E-2 0.261 0.220 0.217 0.217 0.224 0.366 0.477 0.590 Our proposed sampling formula can assist researchers in achieving efficient epidemiological study designs with count secondary outcomes. We conducted simulation studies to verify our proposed variance formula in accurately approximating the empirical variance and the variance from Stata “gmm” command. 22 Chapter 4 Optimal Sampling Strategies Using Case-Control Studies for Binary Secondary Outcomes Using Doubly-Weighted Inverse Probability Estimating Equations un- der Budget Constraints 4.1 Background In the previous chapters, we presented our optimal sampling strategies for secondary case-control studies with different types of secondary outcomes, aiming to achieve an efficient case-control study design. In those scenarios, we only focused on the association between a single exposure variable with the outcome of interest. However, in this chapter, we extend our proposed opti- mal sampling strategies to incorporate an additional confounder. The estimators of interest were obtained with doubly-weighted estimating equation, see Chapter 1 for details. This chapter is organized as follows: First, we derive the variance formula for the estimator of the exposure effect in the context of doubly-weighted estimating equations. Second, we present the optimal sampling formula, considering the minimization of the variance of the estimator of the exposure coefficient while accounting for a budget constraint. Third, we verify our proposed sam- pling formula through Monte Carlo simulations. Fourth, we apply the derived optimal sampling formula to an empirical study. Finally, we draw conclusions based on our findings. 4.2 Notation and estimation 4.2.1 Notation This chapter employs similar notations as the previous chapters. Suppose the study cohort (target population) consists of a total # independent subjects. We denote the binary case-control primary outcome as ⇡, where ⇡ = 1 indicates the presence of the disease, ⇡ = 0 indicates its absence. The sample size of cases and controls is denoted by #1 and #0, respectively. We assume that #1 and #0 are known, and #1 + interest. We let (8 be the sampling indicator, with (8 = 1 indicating the inclusion in the secondary #0 = #. We let .8 represents the binary secondary outcome of outcome analysis, and (8 = 0 otherwise. We define =1 and =0 as the unknown sample size to be selected from the case and control group, respectively. The total number of subjects being selected from the target population, denoted by =, and = = =1 + =0. We define the total cost of 23 the secondary outcome study as ⇠>BC, and we let 2 ?4A represents the known cost per individual for samples from case group or control group. Our objective is to derive an expression for the optimal sampling ratio under a study budget constraint, where = = ⇠>BC 2 ?4A = =1 + =0. Let ,8 denote the binary exposure status, with ,8 = 1 indicating exposure, and ,8 = 0 otherwise. Let ⇠8 denote the binary confounder, with ⇠8 = 0, 1. We denote by ^8 = 1, ,8, ⇠8 ( ) the 3 ) ⇥ 1 vector of covariates for each subject 8. Following the approach in Chapter 1, we assume that the sampling probability from the study cohort for the secondary outcome analysis depends only on ⇡8. Thus we have 3 6 ⇥ ` (8 = 1 Pr ( | ⇡8 = 3,.8, ^8 = Pr ( ) (8 = 1 | ⇡8 = 3 = c 3 ) ( ) = =3 #3 , 3 = 0, 1. Let # = 1 vector of parameters of interest. Let conditional expectation E V0, V1, V2) ( = ` ^8; # ( ) ) be a , where .8 ( | ^8; # ) ^8; # = ^) ( ( We let E )) .8 ( | 8 #. Since .8 is a binary outcome, we can use the logit link function for 6 1 ,8=F as the propensity score, where . We define % ,8 = F 2 % . (·) 2 ^8; # = 4 ^) 8 # 4 ^) 8 # 1 ) ( | ) ) represents the inverse probability weight for the subjects in the group F for F = 0, 1. An appropriate + ( | estimator of # can be obtained by solving the doubly-weighted estimating equations: # ’8=1 *8 # ) ( = 0, where *8 = # ) ( c8 ⇡8 ( ) ⇥ ^8(8 % ( ,8 = F8 ` .8 [ ( ^8 )] . 28 | ) (For the simplicity, we will not include the subscript 8 in the following content) 4.2.2 Variance estimation and optimal sampling ratio formula derivation To provide the optimal sampling ratio formula for the secondary outcome in a case-control design, we derived the variance of the estimator of the exposure effect for the doubly weighted estimating equations. In this chapter, we extend the variance derivation method to consider both the exposure variable and a binary confounder in the mean model. We build upon the variance derivation method used in the previous chapters. We denote the variance of the estimator of the exposure effect as + 0A ˆV1) ( . Therefore, the variance-covariance matrix \ # )# ( is a 3 ⇥ 3 matrix, where the second 24 element on the diagonal of the variance-covariance matrix represents + 0A by computing 1 H G # ) ( # )[ # ( 1 G # ) ( ) ] directly. It is easy to verify that ˆV1) ( . We obtain + 0A ˆV1) ( G # ) ( = E 4 ^) # , = F % ( 2 | ) 1 + ⇣ 2 4 ^) # ⌘ ^ ^) , 3 7 7 7 7 7 5 ) ⇡ 2 6 6 6 6 6 4  H # ( ) = E ^, ⇡ ˜` ( ( ) 2 ` ^ ( )) + , = F % ˜` 2 ( ^, ⇡ 2 c ^, ⇡ ˜` ( 2 ) ^ ^) | (See Appendix C.1 and Appendix C.2 for the explicit derivation of G )] [ ( ) ( Here ˜` ^, ⇡; " = E . ( | ) ( ^, ⇡; " ) = 4 U0+ 4 U0+ 1 + U1 F + U1 F U22 + U22 + U3 3 + U3 3 + + # and H # .) ) U4 F3 U4 F3 , represents the expectation of sec- ( ) ( ondary outcome given the primary outcome, the exposure variable, and the confounder. We as- sume E . ( | ^, ⇡; " ) = 4 U0+ 4 U0+ 1 + U1 F + U1 F U22 + U22 + U3 3 + U3 3 + and ⇠. We have coefficient vectors # = + U4 F3 U4 F3 is the true model, and there is no interaction between , ) associated with ) , " = U0, U1, U2, U3, U4) ( V0, V1, V2) ( ` ^ ) ( and ˜` ^, ⇡ ) ( respectively. It is important to note that parameter sets, ", need to be specified according to the related literature or expert experience in order to calculate the optimal sampling ratio using our method. The details of the optimal sampling ratio '$⇡ derivation steps can be found in Appendix C.3. 4.3 Simulation We conducted Monte Carlo simulations to evaluate the performance of the proposed variance for- mula presented in this chapter. The three candidate variances, + 0A⇡⇢ ' ˆV1) ( , + 0A⌧ " " ˆV1) ( , and + 0A⇢ " % ˆV1) ( , have the same definitions as in our previous Chapters. We started by simulating 1, 000 target population datasets, each containing 10, 000 observations. Within each dataset, we first simulated the continuous age variable ⇠, with a mean of 64.76, and a standard deviation of 10.82. Then we simulated the binary exposure variable , conditional on ⇠, and with a prevalence of 17.2%. We then simulated the binary primary outcome ⇡ with E ⇡ ^ | 0.02 respectively. Thus, the prevalence of the pri- ) ( + = 4W0+ 4W0+ 1 + W1 F + W1 F W22 W22 . The parameters W0, and W1 are set at 0.1, 1.1, and mary outcome ⇡ in each simulated dataset is around 24.3%. Additionally, we simulate the binary secondary outcome, . with E . ( | ^, ⇡ ) = 4 U0+ 4 U0+ 1 + U1 F + U1 F U2 3 + U2 3 + + + U32 + U32 U4 F3 U4 F3 . The parameters U0, U1, U2, U3, and 25 U4 are chosen as the secondary outcome is around 26.4% in each simulated dataset. The causal DAG for the data 0.01, 1.001, and 0.368 respectively. Therefore, the prevalence of 0.65, 0.1, generating process can be found in Figure 2. Figure 2. The causal DAG for the data generating process C W S=1 D Y Where , is the exposure, ⇠ is the confounder, ⇡ is the primary outcome, . is the secondary outcome, and ( is the selection indicator. Our method requires a dichotomous confounder so we categorized age ⇠ as a binary variable !, where ! = 0 if age is less or equal to 63, and ! = 1 if age greater than 64. The propensity score was estimated with % , = 1 ! ) | ( using the cohort data. To obtain the true value of ˆ#, we simulated a large dataset with 10, 000, 000 observations. We then applied logistic regression for the outcome . on , and !, resulting in an estimated coefficient ˆV1 = 0.177. It is interesting to note that our target parameter of interest is not the marginal causal odds ratio, instead, we are interested in the conditional odds ratio with a binary confounder ! in the outcome model, and the propensity score is also estimated using the binary confounder. ˆV1) ˆV1) different sampling ratios using the given parameter sets. The first column of the table represents , and our proposed variance + 0A⇡⇢ ' Table 4.1 compares + 0A⇢ " % , + 0A⌧ " " ˆV1) under ( ( ( the mean of ˆV1 among 1, 000 simulations. The second column of the table represents the ratio of controls to cases. The third column displays the empirical variance of the estimator of the exposure effect for each sampling ratio. The fourth column shows the average of variance of the estimator of the exposure effect obtained using the Stata “gmm” command across 1, 000 simulation runs. The last column shows the value of using our proposed method. When calculating + 0A⇡⇢ ' ˆV1) ( , we assume #1 and #0 is fixed, with #1 = # ⇡ = 1 % ( ⇥ ) = 2, 426 and #0 = # #1 = 7, 574. We declare 26 Table 4.1: The comparison of empirical variance + 0A⇢ " % variance + 0A⇡⇢ ' ( under the different sampling designs with a binary secondary outcome . . , and the proposed , + 0A⌧ " " ( ˆV1) ( =0 : =1 1 : 1 1.9 : 1 2 : 1 3 : 1 4 : 1 1 : 2 1 : 3 1 : 4 ˆV1) ˆV1) ˆV1) ˆV1) ˆV1) ˆV<40= 1 0.175 0.175 0.177 0.171 0.176 0.176 0.166 0.180 + 0A⇢ " % ( 0.013 0.011 0.011 0.011 0.014 0.016 0.023 0.028 + 0A⌧ " " ( 0.013 0.012 0.012 0.012 0.013 0.017 0.022 0.027 + 0A⇡⇢ ' ( 0.012 0.011 0.011 0.012 0.013 0.017 0.021 0.026 that this method could be equivalent to using each simulated dataset to calculate + 0A⇡⇢ ' ˆV1) ( and obtaining the average of + 0A⇡⇢ ' ˆV1) ( among 1, 000 simulation runs. This equivalence arises be- cause #1 and #0 are calculated based on the the expected value of ⇡. Table 4.1 clearly illustrates that our proposed + 0A⇡⇢ ' ˆV1) ( is very close to + 0A⇢ " % ˆV1) ( and + 0A⌧ " " ˆV1) ( across all reason- able sampling ratios. By applying our optimal sampling formula with the provided pre-specified parameters, we find that the optimal sampling ratio '$⇡ = 1.90. Given that the number of individu- als that can be selected is ⇠>BC 2 ?4A = = = 3, 000, we can determine the number of controls as =0 = 1, 965, and the number of cases as =1 = 1034. 4.4 Empirical illustration We applied our proposed optimal sampling ratio formula to the Pesticides and Sense of Smell (PASS) Study to develop an efficient study design for the purpose of analyzing the association between a binary secondary outcome with the exposure and a confounder. Detailed information about the PASS study can be found in Chapter 2 of our work. In this study, the primary outcome of interest is olfactory impairment, denoted as ⇡, while the secondary outcome of interest is cog- nitive decline, denoted as . . The binary exposure variable is the high pesticide exposure estimate (HPEE), denoted as ,, and age serves as the confounder, denoted as !. Our purpose is to provide an efficient sampling design that aims to examine the association between . and , given ! in the target population without conditioning on ⇡, using doubly-weighted estimating equations. Our 27 optimal sampling formula requires certain parameters to be given as priors. Some parameters can be derived from the cohort data. For example, the prevalence of the exposure , in the cohort is 14%. The total sample size in the case-control cohort is # = #0 + The propensity score and the joint probability between ,, !, and ⇡ can also be estimated from 2, 633 = 18, 526. #1 = 15, 893 + the cohort data. The prevalence of . is fixed at 0.1 and 0.2. The study budget is $25, 000 for data collection and the unit cost is $5.40 for the sampling of cases and controls. We also consider vary- ing the association between the primary outcome ⇡ and the secondary outcome . in the exposure group with $'. ⇡ ,=1 = 1.5, 2.0, 2.5 , and the non-exposure group $'. ⇡ ,=0 = ] Table 4.2 lists the optimal sampling ratios and the exact number of samples from cases and con- [ ] [ | | 1.2, 1.4, 1.6 . trols for the above given parameter scenarios by using our proposed optimal sampling formula. The advantage of the “T-table” liked format is that it provides researchers with an intuitive under- standing of the sampling ratio when they have knowledge of a range of parameters. Furthermore, the table demonstrates that a stronger association between . and ⇡ results in a smaller sampling ratio, meaning fewer subjects in the control group will be sampled and more subjects in the case group will be sampled. 28 ) % ,=1 $'. ⇡ . = 1 ( 0.1 Table 4.2: Optimal sample ratio '$⇡ varies with the prevalence of the secondary outcome . , and the association between ⇡ and . , while keeping other parameters fixed. '$⇡ ,=0 $'. ⇡ 3.796 3.556 3.401 3.762 3.503 3.365 3.735 3.489 3.331 3.957 3.845 3.788 3.934 3.815 3.768 3.915 3.794 3.737 U4 0.223 0.511 0.734 0.069 0.375 0.58 0.065 0.223 0.446 0.223 0.511 0.734 0.069 0.375 0.58 0.065 0.223 0.446 =0 3, 664 3, 612 3, 557 3, 657 3, 601 3, 568 3, 651 3, 598 3, 560 3, 695 3, 674 3, 662 3, 691 3, 667 3, 658 3, 687 3, 664 3, 651 =1 965 1, 016 1, 051 972 1, 028 1, 061 978 1, 031 1, 069 934 955 967 938 961 971 942 965 977 U3 0.182 0.182 0.182 0.336 0.336 0.336 0.470 0.470 0.470 0.182 0.182 0.182 0.336 0.336 0.336 0.470 0.470 0.470 | 1.200 1.200 1.200 1.400 1.400 1.400 1.600 1.600 1.600 1.200 1.200 1.200 1.400 1.400 1.400 1.600 1.600 1.600 | 1.500 2.000 2.500 1.500 2.000 2.500 1.500 2.000 2.500 1.500 2.000 2.500 1.500 2.000 2.500 1.500 2.000 2.500 0.2 4.5 Conclusion In Chapter 2 and Chapter 3, we proposed our optimal sampling formula for binary and count secondary outcomes with one exposure variable in the mean model. In this chapter, we extended our optimal sampling formula by considering a binary confounder in the mean model and using doubly-weighted estimating equations. We derived the variance of the estimator of the exposure effect of the doubly-weighted estimating equations and then minimized the variance formula with the cost as a constraint to obtain the optimal sampling formula. To verify our sampling formula, we conducted Monte Carlo simulations and compared our proposed variance formula with the empirical variance and the variance from the Stata "gmm" package. Our results showed that these candidate variances were very close. Finally, we applied our proposed optimal sampling formula to an empirical study and provided an efficient study design. 29 Chapter 5 Conclusion In Chapter 2 and Chapter 3, we proposed our optimal sampling formulas using case-control studies for binary and count secondary outcomes with one exposure variable in the mean model. In chap- ter 4, we extended our optimal sampling formula by considering a binary confounder in the mean model and using doubly-weighted estimating equations. We derived the variance of the estimator of the exposure effects of weighted estimating equations and doubly-weighted estimating equa- tions. Then, we minimized the variance formulas with the study cost as a constraint to obtain the optimal sampling ratios. To verify our proposed optimal sampling ratio formulas, we conducted Monte Carlo simulations and compared our proposed variance formula with the empirical variance and the variance from the Stata "gmm" package. Our simulation results showed that these candi- date variances were very close in all simulations in Chapter 2, Chapter 3 and Chapter 4. Finally, we applied our proposed optimal sampling formulas to empirical studies and provided efficient study designs. There are several interesting directions for future research. First, our proposed sampling strat- egy considers the inclusion of one additional confounder in the binary case. However, when adding more confounders, the sampling formula may differ. Second, there are various types of count data, but our provided formula primarily focuses on Poisson count data. Other count outcomes in epi- demiology, such as the number of emergency room visits or the number of falls in nursing homes, may exhibit an excess of zero values. Sample size determination formulas have been proposed for zero-inflated Poisson distributed outcomes (Zhou et al., 2022) in cluster randomized trials. Thus, an intriguing direction would be to determine the optimal sampling ratio for zero-inflated count outcomes or hurdle outcomes in secondary case-control studies, particularly for count data that exhibit an excess of zeros. Another interesting extension is to develop an R Shiny app that can automatically calculate the optimal sampling ratio when researchers provide specific parameters. Finally, we assumed that the cost is the same in the case and control groups for the secondary out- come analysis. In practice, there may be situations where the costs of data collection in cases and controls are unequal. Therefore, the cost constraint needs to be re-considered. Meanwhile, in all 30 three chapters, we consider study cost as a constraint. Thus, the total number of sample sizes that we can select is fixed. In the future, it would also be interesting to consider power as a constraint. We can obtain the optimal sampling ratio under a minimum required power. We can explore how variations in power will affect the sampling ratio. 31 BIBLIOGRAPHY Aban, I. B., Cutter, G. R., & Mavinga, N. (2009). Inferences and power analysis concerning two negative binomial distributions with an application to MRI lesion counts data. Computational Statistics & Data Analysis, 53(3), 820–833. Amatya, A., Bhaumik, D., & Gibbons, R. D. (2013). Sample size determination for clustered count data. Statistics in Medicine, 32(24), 4162–4179. Breslow, N. E. (2005). Case–Control Study, Two‐Phase. In P. Armitage & T. Colton (Eds.), Encyclopedia of Biostatistics (1st ed.). Wiley. Breslow, N. E., & Cain, K. C. (1988). Logistic regression for two-stage case-control data. Biometrika, 75(1), 11–20. Breslow, N. E., & Chatterjee, N. (1999). Design and analysis of two‐phase studies with binary outcome applied to Wilms tumour prognosis. Journal of the Royal Statistical Society: Series C (Applied Statistics), 48(4), 457–468. Brownstein, N. C., Cai, J., Smith, S., Diatchenko, L., Slade, G. D., & Bair, E. (2022). Modeling Secondary Phenotypes Conditional on Genotypes in Case–Control Studies. Stats, 5(1), 203– 214. Chen, H. Y., Kittles, R., & Zhang, W. (2013). Bias correction to secondary trait analysis with case- control design. Statistics in Medicine, 32(9), 1494–1508. Demidenko, E. (2006). Sample size determination for logistic regression revisited. Statistics in Medicine, 26(18), 3385–3397. Demidenko, E. (2008). Sample size and optimal design for logistic regression with binary interaction. Statistics in Medicine, 27(1), 36–46. Dintica, C. S., Marseglia, A., Rizzuto, D., Wang, R., Seubert, J., Arfanakis, K., Bennett, D. A., & Xu, W. (2019). Impaired olfaction is associated with cognitive decline and neurodegeneration in the brain. Neurology, 92(7). Ghosh, A., Wright, F. A., & Zou, F. (2013). Unified Analysis of Secondary Traits in Case–Control Association Studies. Journal of the American Statistical Association, 108(502), 566–576. Hanley, J. A., Csizmadi, I., & Collet, J.-P. (2005). Two-Stage Case-Control Studies: Precision of Parameter Estimates and Considerations in Selecting Sample Size. American Journal of Epidemiology, 162(12), 1225–1234. Jiang, Y., Scott, A. J., & Wild, C. J. (2006). Secondary analysis of case-control data. Statistics in Medicine, 25(8), 1323–1339. Lee, A. J., McMURCHY, L., & Scott, A. J. (1997). RE-USING DATA FROM CASE-CONTROL STUDIES. Statistics in Medicine, 16(12), 1377–1389. 32 Li, H., Gail, M. H., Berndt, S., & Chatterjee, N. (2010). Using cases to strengthen inference on the association between single nucleotide polymorphisms and a secondary phenotype in genome- wide association studies. Genetic Epidemiology, 34(5), 427–433. Lin, D. Y., & Zeng, D. (2009). Proper analysis of secondary phenotype data in case-control association studies. Genetic Epidemiology, 33(3), 256–265. Lou, Y., Cao, J., Zhang, S., & Ahn, C. (2017). Sample size estimation for a two-group comparison of repeated count outcomes using GEE. Communications in Statistics - Theory and Methods, 46(14), 6743–6753. Lyles, R. H., Lin, H.-M., & Williamson, J. M. (2007). A practical approach to computing power for generalized linear models with nominal, count, or ordinal responses. Statistics in Medicine, 26(7), 1632–1648. Ma, Y., & Carroll, R. J. (2016). Semiparametric Estimation in the Secondary Analysis of Case– Control Studies. Journal of the Royal Statistical Society Series B: Statistical Methodology, 78(1), 127–151. McNamee, R. (2005). Optimal design and efficiency of two-phase case–control studies with error- prone and error-free exposure measures. Biostatistics, 6(4), 590–603. Monsees, G. M., Tamimi, R. M., & Kraft, P. (2009). Genome-wide association scans for secondary traits using case-control samples. Genetic Epidemiology, 33(8), 717–728. Morgenstern, H., & Winn, D. M. (1983). A Method for determining the sampling ratio in epidemiologic studies. Statistics in Medicine, 2(3), 387–396. Nagelkerke, N. J. D., Moses, S., Plummer, F. A., Brunham, R. C., & Fish, D. (1995). Logistic regression in case-control studies: The effect of using independent as dependent variables. Statistics in Medicine, 14(8), 769–775. Nam, J.-M. (1973). Optimum Sample Sizes for the Comparison of the Control and Treatment. Biometrics, 29(1), 101. Negi, A. (2024). Doubly weighted M-estimation for nonrandom assignment and missing outcomes. Journal of Causal Inference, 12(1), 20230016. Reilly, M. (1996). Optimal Sampling Strategies for Two-Stage Studies. American Journal of Epidemiology, 143(1), 92–100. Rettiganti, M., & Nagaraja, H. N. (2012). Power Analyses for Negative Binomial Models with Application to Multiple Sclerosis Clinical Trials. Journal of Biopharmaceutical Statistics, 22(2), 237–259. Shrestha, S., Kamel, F., Umbach, D. M., Freeman, L. E. B., Koutros, S., Alavanja, M., Blair, A., Sandler, D. P., & Chen, H. (2019). High Pesticide Exposure Events and Olfactory Impairment among U.S. Farmers. Environmental Health Perspectives, 127(1), 017005. 33 Sofer, T., Cornelis, M., Kraft, P., & Tchetgen, T., Eric. (2017). Control function assisted IPW estimation with a secondary outcome in case-control stuies. Statistica Sinica, 27(2), 785–804. Song, X., Ionita-Laza, I., Liu, M., Reibman, J., & Wei, Y. (2016). A General and Robust Framework for Secondary Traits Analysis. Genetics, 202(4), 1329–1343. Tchetgen Tchetgen, E. J. (2014). A general regression framework for a secondary outcome in case- control studies. Biostatistics, 15(1), 117–128. Wang, J., & Shete, S. (2012). Analysis of Secondary Phenotype Involving the Interactive Effect of the Secondary Phenotype and Genetic Variants on the Primary Disease: Secondary Phenotype Analysis. Annals of Human Genetics, 76(6), 484–499. Wang, J., Zhang, S., & Ahn, C. (2020). Sample size calculation for count outcomes in cluster randomization trials with varying cluster sizes. Communications in Statistics - Theory and Methods, 49(1), 116–124. Wei, J., Carroll, R. J., Müller, U. U., Keilegom, I. V., & Chatterjee, N. (2013). Robust estimation for homoscedastic regression in the secondary analysis of case-control data: Secondary Analysis of Case-Control Data. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 75(1), 185–206. Xing, C., M. McCarthy, J., Dupuis, J., Adrienne Cupples, L., B. Meigs, J., Lin, X., & S. Allen, A. (2016). Robust analysis of secondary phenotypes in case-control genetic association studies: Robust analysis of secondary phenotypes in case-control genetic association studies. Statistics in Medicine, 35(23), 4226–4237. Yaffe, K., Freimer, D., Chen, H., Asao, K., Rosso, A., Rubin, S., Tranah, G., Cummings, S., & Simonsick, E. (2017). Olfaction and risk of dementia in a biracial cohort of older adults. Neurology, 88(5), 456–462. Zhou, Z., Li, D., & Zhang, S. (2022). Sample size calculation for cluster randomized trials with zero‐inflated count outcomes. Statistics in Medicine, 41(12), 2191–2204. 34 APPENDIX A PROOFS OF CHAPTER 2 A.1 Derivation of G # ) ( Proof. We plug ^8 (8 ⇡8 c .8 ` ( ^8 )] into # ) ( , then we have ( ) [ # ) ( = E m m #) " ^) c 8 (8 ⇡8 ( 1 .8 ) 4 ^) 8 # 4 ^) 8 # 4 ^) + 8 # !# ^8 ^) 8 2 1 + ⇣ 4 ^) 8 # ⌘ . 3 7 7 7 7 7 5 = E (8 ⇡8 ( ) c ⇥ 2 6 6 6 6 6 4 Taking iterated expectations on it, then # ) ( = E Note that c ⇡8 ) ( E 8>>>< 2 6 6 6 >>> 6 6 = Pr 4 : (8 ⇡8 ( ) c ⇥ 1 4 ^) 8 # 2 4 ^) 8 # + ⇣ ⇡8 = 9 ) (8 = 1 ( | ^8 ^) 8 ⌘ ⇡8 . 3 7 7 7 7 7 5 9>>>= >>> ; only depends on ⇡8, thus # ( ) = E E (8 ⇡8 ( c | ⇡8 ) E ⇥ 8>>>< >>> Since (8 is a binary variable, then E : 2 6 6 6 6 6 4 1 ⇣ ) ( + 4 ^) 8 # 2 4 ^) 8 # ⌘ ⇡8 ) (8 | ( ⇡8 ^8 ^) 8 9>>>= 3 7 7 7 >>> 7 7 (8 = 1 5 ; | ( = Pr . ⇡8 = 9 = c ⇡8 ) ( ) under the assumption that the sampling probability from the study cohort for the secondary outcome analysis depends only on the primary outcome. Thus E 2 6 6 6 6 6 4 1 ⇣ # ) ( = E = E 8>>>< >>> : 2 6 6 6 6 6 4 4 ^) 8 # 2 4 ^) 8 # ^8 ^) 8 . ⌘ ^8 ^) 8 1 + ⇣ 4 ^) 8 # 2 4 ^) 8 # ⌘ + 3 7 7 7 7 7 5 ⇡8 3 7 7 7 7 7 5 9>>>= >>> ; 35 ⇤ A.2 Derivation of H # ) ( Proof. Similar to the derivation of G # ) ( , we first plug ^8 (8 ⇡8 c .8 ` ( ^8 )] into H # ) ( , then we have ( ) [ H # ) ( = E (2 8 ⇡8 ( )] c " [ .8 2 ( ` ( ^8 )) 2 ^8 ^) 8 . # The sampling indicator variable (8 is a binary variable, thus (2 8 = (8, then H # ) ( = E (8 ⇡8 )] c  [ ( .8 2 ( ` ( ^8 )) 2 ^8 ^) 8 . Then take the iterated expectations, H # ) ( = E E (8 ⇡8 ( ⇡8 ) 2 )] E ) ⇥ c [ (8 | ⇡8 ⇢  E ( c ( 1 ⇡8 ( ⇢ [ c  = E = E 2 ^8 ^) 8 ^8, ⇡8 ^8 )) .8 2 ( ` ` .8 ( ( ( )] E 2 ^8 ^) 8 ^8 )) ⇥ .8 ( ` ( ^8 )) 2 X8, ⇡8 ⇤ ^8, ⇡8 ^8 ^) 8 . ^8, ⇡8 , ⇤ ` ⇤ 2 )) ^8 ( We can simplify the expectation E .8 ( ⇥ E ` .8 ( ( ^8 )) 2 ^8, ⇡8 = E ⇥ ⇤ ⇥ = E . 2 8 . 2 8 2.8 ` ( ^8, ⇡8 ^8 ) + 2E ⌘ ^8, ⇡8 | ` ^8 ( .8 ` 2 ) ^8 ( [ ^8, ⇡8 ⇤ ^8, ⇡8 ] )| ⇤ ⇣ E ` ⇥ . 2 8 + = E 2 ^8 ( ) ^8, ⇡8 ⇣ 2` ( E ^8 ) ⌘ .8 ( | ^8, ⇡8 2 . ` ^8 ) ( ) + Under the above study design setting, .8 is a binary secondary outcome, so ^8, ⇡8 ˜` ( ) ⌘ E . 2 8 ^8, ⇡8 = E .8 | ( ^8, ⇡8 . ) 36 Plug ˜` ^8, ⇡8 ) ( into E ` .8 ( ( ^8 )) 2 ^8, ⇡8 , we have E ` .8 ( ( ^8 )) ⇥ ⇥ 2 ^8, ⇡8 = ˜` ⇤ = ˜` ⇤ ) 2 ^8, ⇡8 ( ( ˜` ^8, ⇡8 ) ^8, ⇡8 ( ^8, ⇡8 2` ^8 ˜` ( 2` ) ^8 ( ˜` ( ) ( ^8, ⇡8 ^8, ⇡8 ` ( ) + ^8 ^8, ⇡8 ` ( ) + 2 ) ^8 2 2 ) ˜` ) ` ( ) ( ^8 ) ˜` 2 )] + ^8, ⇡8 ˜` ( ) ( ^8, ⇡8 2 . ) + ˜` [ = ( 2 Now substitute ˜` [ ( ^8, ⇡8 ^8 ` ( ) )] + ˜` ( ^8, ⇡8 ˜` ( ) ^8, ⇡8 ) 2 into H , so # ) ( H # ) ( = E = E 1 ⇡8 ( 1 ⇡8 ( ) ) c c   E8 ^8, ⇡8 ˜` [ ( ^8 ` ( ) 2 )] + ˜` ( ^8, ⇡8 ˜` ( ) ^8, ⇡8 2 ) ^8, ⇡8 ^8 ^) 8 ⇥ ˜` ( ( ⇥ ^8, ⇡8 ` ) ^8 2 )) + ˜` ( ^8, ⇡8 ˜` ( ) ^8, ⇡8 ⇤ ^8 ^) 8 . 2 ) ⇤ ( 2 ˜` ( ^8, ⇡8 ˜` ( ) ^8, ⇡8 2, ) )) + Denote ˜˜` ^8, ⇡8 ˜` ( ) ⌘ ( ^8, ⇡8 ^8 ` ( ) ( H # ) ( = E ˜˜`  ^8, ⇡8 ( ⇡8 c ( ) ) ^8 ^) 8 . ⇤ A.3 Derivation of 1⌫ 1 ) ⇥ ⇤ Proof. The inverse of matrix is , where 34C = 011022 012021 . 3 7 7 7 7 7 7 7 7 5 1 34C 2 6 6 6 6 6 6 6 6 4 022 012 021 011 37 Then 1 ] [ ) = 1 022 021 011 , 3 7 7 7 7 7 7 7 7 5 012 34C 2 6 6 6 6 6 6 6 6 4 022 1 ⌫ = ⇥ 1 34C = 1 34C and 012 111 112 ⇥ 2 3 6 7 6 7 6 7 6 7 6 7 7 6 6 7 6 7 012112 4 5 112 122 3 7 7 7 7 7 7 7 7 022112 5 012122 021 011 022111 021111 + 011112 021112 + 011122 , 3 7 7 7 7 7 7 7 7 5 1⌫ 1 ) = ] [ 1 34C 022111 012112 022112 012122 021111 + 011112 021112 + 011122 2 6 6 6 6 6 6 6 6 4 2 6 6 6 6 6 6 6 6 4 2 6 6 6 6 6 6 6 6 4 = ( where, 1 34C ⌦1 ⌦2 ⌦3 ⌦4 , 3 7 7 7 7 7 7 7 7 5 ) 2 2 6 6 6 6 6 6 6 6 4 ⌦1 = ⌦2 = ⌦3 = ⌦4 = 022111 012112 011112 021111 ( ( ( ( 012112) 022111) 021111) 011112) 022 ( 021 + ( 022 + ( 021 + ( 022112 022112 021112 011122 012122) 012122) 011122) 021112) 012, 011, 012, 011. 38 022 021 012 011 1 34C ⇥ 3 7 7 7 7 7 7 7 7 5 2 6 6 6 6 6 6 6 6 4 3 7 7 7 7 7 7 7 7 5 ⇤ A.4 Proof of Theory 1 Proof. By the rule of functional expectation E , 6 ( ( )) = F 6 F ) ( ?, F ) ( , we have 4 ^) 8 # 4 ^) 8 # ⇣ 1 + 2 6 6 6 6 6 4 ?F=14 V0+ V1 4 V0+ 1 + ?F=14V0+ V1 4V0+ 1 + ( ^8 ^) 8 2 ⌘ 1 1 3 7 7 7 7 7 5 3 7 7 7 7 7 7 7 7 ?F=1] 1 5 4V0 1 + ( ) ) V1 ?F=14V0+ 2 V1 4V0+ 1 V1 2 2 6 6 6 6 6 6 6 V1 6 4 2 + 1 1 [ + ( ) + 4V0 2 4 V0 2 1 [ 1 ?F=1] 4 V0 + + ?F=14V0+ V1 4V0+ 1 ( ?F=14V0+ V1 4V0+ 1 V1 2 ) V1 2 + ( ) Õ 1 0 0 0 3 7 7 7 7 7 7 7 7 5 2 6 6 6 6 6 6 6 6 4 3 7 7 7 7 7 7 7 7 5 = E = = 2 6 6 6 6 6 6 6 6 4 2 6 6 6 6 6 6 6 6 4 ⌘ # ) ( where 011 = 011 012 021 022 3 7 7 7 7 7 7 7 7 5 V1 ?F=14 V0+ V1 4 V0+ 1 + 2 + 4 V0 2 , 1 [ 1 ?F=1] 4 V0 + 012 = 021 = 022 = V1 ?F=14 V0+ V1 4 V0+ 1 + 2 . 39 Similarly, using rule of function of expectation on ⌫ , we have, # ) ( ⌫ # ) ( = E 1 ⇡8 c  ) ( ?11 ˜˜` 1, 1 ( =1 ) ( #1 = ˜˜` ^8, ⇡8 ) ^8 ^) 8 ?10 ˜˜` 1, 0 ( =0 ) #0 110 110 + ?00 ˜˜` #0 0, 0 ( ) =0 + 1 0 0 0 2 6 6 6 6 6 6 6 6 4 3 7 7 7 7 7 7 7 7 5 1 0 0 0 2 6 6 6 6 6 6 6 6 4 3 7 7 7 7 7 7 7 7 5 ?01 ˜˜` #1 0, 1 ( =1 ) + 111 112 112 122 3 7 7 7 7 7 7 7 7 5 = 2 6 6 6 6 6 6 6 6 4 where ?11#1 ˜˜` ( 111 = 1, 1 ) + =1 ?01#1 ˜˜` 0, 1 ) ( ?10#0 ˜˜` ( 1, 0 ) + =0 + ?00#0 ˜˜` 0, 0 ( ) , 112 = 122 = ?11#1 ˜˜` =1 ( 1, 1 ?10#0 ˜˜` =0 ) + 1, 0 ) ( . ?11, ?10, ?01, ?00, are the joint probability between the exposure variable ,8 and primary outcome ⇡8. These joint probability can be estimate analytically once we specify $ = W0, W1] [ , the parameter between ,8 and ⇡8. ?11 = ? ?10 = ? ?01 = ? ?00 = ? ⇡ = 1 ⇡ = 0 ⇡ = 1 ⇡ = 0 | | | | ( ( ( ( , = 1 ) ⇥ ?F=1 = 4W0+ , = 1 ) ⇥ ?F=1 = 1 , = 0 ) ⇥ ( , = 0 ) ⇥ ( 1 1 ?F=1) ?F=1) + W1 F ?F=1 W1 F , 4W0+ 1 + ?F=1 W1 F , 4W0+ = 4W0 ?F=1) 1 ( 4W0 1 + ?F=1) 1 4W0 1 + = ( . , ⇤ We assume #1 and #0 are constant with #1 = # ⇢ ⇡ ) ( is the lower corner element of covariance matrix ( ) and the condition that 012 = 021 = 022 and 112 = 122, we have + 0A #1. It is easy to see that and #0 = # 1 H G # # )[ ( # 1 G # ) ( ) ] . Combing the ˆV1) = ( + 0A ˆV1) derivation of ( 1⌫ 1 ) ⇥ ⇤ 40 [ 02 11122 02 21111 2011112021+ 34C ) ( A.5 Proof of Proposition 1 2# ] . Proof. We define Z Z ] ] ^ ( ( ^ ( s s 1, 1 ) ( 1, 0 ) ( 0, 1 ) 0, 0 ) 1, 1 ) 1, 0 ) 1, 1 ) 1, 0 ) ( ( ( , , = 02 ( = 02 = 02 1, 1 ) 1, 0 ) , 21 ?11#1 ˜˜` 21 ?10#0 ˜˜` ( 21 ?01#1 ˜˜` 0, 1 ) 21 ?00#0 ˜˜` 0, 0 ) = 2021011 ?11#1 ˜˜` ( ( , = 02 = 02 = 2021011 ?10#0 ˜˜` 11 ?11#1 ˜˜` 11 ?10#0 ˜˜` = 02 ( 1, 1 1, 0 ( ) , . ( 1, 1 1, 0 , , ) ) ( Then plug in 011, 021, 012, 022, 111, 112, 122 into + 0A ) ˆV1) ( , we have + 0A ˆV1) ( 1, 1 Z ( ) + 0, 1 ] ( ) ^ ( = [ 1, 1 ) + s ( 1, 1 )] 34C 1, 0 Z ( 2 =0=1# =0 + [ ) ( 0, 0 ] ( ) 1, 0 ^ ( ) + s ( ) + 1, 0 )] =1 . To obtain the optimal design, we need to minimize + 0A ˆV ) ( subject to the constraint ⇠>BC = 2 ?4A =1 + =0) ( , where this constraint is equivalent to =0 + =1 = =. We can write, '$⇡ 2 argmin =1== =0+ )1=0 + 34C ) )2=1 2 =0=1# ( . Where )1 = Z 1, 1 0, 1 ] ( ) ^ ( ) + ( 1, 1 ) + s 1, 1 ) ( and )2 = Z 1, 0 0, 0 ] ( ) ^ ( 1, 0 ) + s . 1, 0 ) ( ) + ( 41 By using the Lagrange multiplier method, we have min 1, 1 Z [ ( ) + ] ( 0, 1 ) 1, 1 ^ ( ) + s ( 1, 1  ( )] 34C 1, 0 Z ( 2 =0=1# =0 + [ ) 0, 0 ] ( ) ^ ( 1, 0 ) + s ( ) + 1, 0 )] =1 _ + =0 + =1 ( = ) Where _ is the Lagrange multiplier, let ! = [ 1, 1 Z ( ) + ] ( 0, 1 ) ^ ( 1, 1 ) + s ( 1, 1 )] 34C Z 1, 0 ( 2 =0=1# =0 + [ ) ( 0, 0 ] ( ) ^ ( ) + 1, 0 ) + 1, 0 s ( )] =1 _ + =0 + =1 ( = ) then m! m=0 m! m=1 = Z [ = Z [ ( ( 1, 1 ) + ] ( 0, 1 ) 1, 1 ^ ( ) + s ( 1, 1 2 = 0 + _, )] 1, 0 ) + ] ( 0, 0 ) 1, 0 ^ ( ) + s ( 1, 0 2 = 1 + _. )] Let the above two equations, equal to 0, we have =0 = Z [ ( 1,1 ] ( )+ 0,1 ^ ) _ ( 1,1 s ( )+ 1,1 )] , =1 = Z [ ( 1,0 )+ ] ( 0,0 ^ ) _ ( 1,0 )+ 1,0 s ( )] . q Therefore, '$⇡ = =0 =1 = Z Z [ [ ( ( 1,0 1,1 )+ )+ ] ] ( ( 0,0 0,1 ) ) ^ ^ 1,0 1,1 q )+ )+ ( ( s s ( ( 1,0 1,1 )] )] . q ⇤ 42 APPENDIX B PROOFS OF CHAPTER 3 B.1 Derivation of H # ) ( Proof. Similar to the derivation of G # , we first plug ^( ⇡ ( c . ) [ ) ( ` ^ ( )] into H # ) ( , then we have H # ) ( = E (2 ⇡ ( )] c  [ . 2 ( ` ^ ( )) 2 ^ ^) . The sampling indicator variable ( is a binary variable, thus (2 = (, then H # ) ( = E ( ⇡ )] c  [ ( . 2 ( ` ^ ( )) 2 ^ ^) . Then take the iterated expectations, H # ) ( = E E . 2 ( ` ^ ( )) 2 ^ ^) ^, ⇡ ( ⇡ )] E = E = E c [ ( | ⇡ ⇢  E ( c ( 1 ⇡ ( ⇢ [ c  ) ( ⇡ )] E ) 2 ⇥ ` . ( ( ^ )) 2 ^ ^) ⇥ . ( ` ^ ( )) 2 ^, ⇡ ^ ^) ^, ⇡ ⇤ . ⇤ ` . ( ( ^ )) 2 ^, ⇡ , We can simplify the expectation E E ` . ( ( ^ )) 2 ^, ⇡ = E ⇥ . 2 2. ` ( ⇥ ⇤ = E ⇥ . 2 ^, ⇡ ⇣ E ` ⌘ 2 | ) ^ ( ⇥ . 2 ^, ⇡ + = E ^ ) + 2E ^, ⇡ ⇤ ⇤ ` ^ 2 ) ( ^, ⇡ . ` ^ )| ( ⇤ ^, ⇡ ] [ ⇣ 2` ( ^ ) E ⌘ . ( | ^, ⇡ ` ^ 2 ) ( ) + 43 Under the above study design setting, . is a Poisson distributed secondary outcome, so let E . 2 ^, ⇡ = + 0A ⇣ ⌘ = E . ( ^, ⇡ . | ) + ( E ^, ⇡ E . ( ) + ( ( | ( | . | ^, ⇡ )) )) 2 ^, ⇡ 2 We define ˜` ^, ⇡ = E ^, ⇡ . ( . 2 ) ^, ⇡ . ( | into E We plug E ) . ( 2 ` ^ ( )) ^, ⇡ , we have ` 2 )) ^ ( E . ( ⇥ ⇥ ^, ⇡ = E . 2 ^, ⇡ ⇤ = E ^, ⇡ ⇤ 2` ^ E . ( ) ( E . ( | ^, ⇡ )) ^, ⇡ ` ^ 2 ) ( ) + | 2 ⌘ ) + ( E . | ( ˜` . ⇣ ( | 2` = ˜` ( ˜` ^ ) ( ^, ⇡ ^, ⇡ ) + ^, ⇡ ( ` ( 2` ^ 2 ) ˜` ^ ) ( ( ^, ⇡ ` ^ 2 ) ( ) + ) + 2 ) 2 ˜` = [ ^, ⇡ ` ^ ( ) ( )] + ^, ⇡ . ) ( Then we have H # ) ( = E = E 1 ⇡ ( 1 ⇡ ( ) ) c c   Thus, E ^, ⇡ ˜` [ ( ` ^ ( ) 2 )] + ˜` ( ^, ⇡ ) ^, ⇡ ^ ^) ⇥ ˜` ( ( ^, ⇡ ` ^ ( ) 2 )) + ˜` ( ^, ⇡ ⇤ ^ ^) . ) ⇤ ⇥ H # ) ( = E  ^, ⇡ ˜` ( ( ) ` c ^ ⇡ )) ) ( ( 2 ˜` ( + ^, ⇡ ^ ^) ) . 44 ⇤ APPENDIX C PROOF OF CHAPTER 4 C.1 Derivation of G # ) ( In the main text, we defined , as the binary exposure variable, ⇠ as the binary confounder, and let ^ = 1, ,, ⇠ ( ) . ) 1 ,=F % ( 2 | ) is the inverse probability weight for the subjects in the group F for F = 0, 1. Plugging the double weighted expression ⇡ c ( ^( % )⇥ , . ) [ 2 | ` ^ ( )] into # ) ( , we have ( # ) ( = E m m #) " c ⇡ ( ^) ( % ) ⇥ ( ( % = E , 2 c ⇡ ( 2 6 6 6 6 6 Taking iterated expectations on 4 ) ⇥ ( | ) , 2 | . ) 4 ^) # ⇥ 1 + ⇣ 4 ^) # ⌘ , then # ) ( 4 ^) # 4 ^) # !# + 1 ^ ^) 2 . 3 7 7 7 7 7 5 # ) ( = E Note that c ⇡ ) ( E 8>>>< 2 6 6 6 >>> 6 6 = Pr 4 : ( % ) ⇥ c ⇡ ( , 2 | ) ( ⇥ ^ ^) 2 4 ^) # 1 + ⇣ 4 ^) # ⌘ ( = 1 ( ⇡ ) | only depends on ⇡, thus ⇡ . 3 7 7 7 7 7 5 9>>>= >>> ; # ( ) = E ⇡ E ( c ( | ⇡ ) E ⇥ 8>>>< >>> Since ( is a binary variable, then E : 2 6 6 6 6 6 4 c ) ( 4 ^) # ^ ^) 2 4 ^) # ⇡ % , 2 | ( ) ⇥ ( ) ⇥ 1 + ⇡ ( | ( ) = Pr ( ⇣ ( = 1 ⌘ = c ⇡ ) ( ⇡ ) | ⇡ 3 7 7 7 7 7 , thus 5 . 9>>>= >>> ; E 4 ^) # ⇡ ( ) ⇥ % , 2 | ) ( 4 ^) # 1 + ⇣ # ) ( = E = E ^ ^) 2 4 ^) # ⇡ . 3 7 7 7 7 7 5 9>>>= >>> ; c ( % , 2 | ( ) ⇥ ) ⇥ 4 ^) # 1 + ⇣ c 2 6 6 6 6 6 4 ⇡ 8>>>< >>> : 2 6 6 6 6 6 4 ^ ^) ⌘ 2 ⌘ 3 7 7 7 7 7 5 45 C.2 Derivation of H # ) ( Similar to the derivation of G # ) ( , we first plugged the double weighted expression ^( % ) ⇥ c ⇡ ( , 2 | ) ( ` . [ ( ^ )] into H # ) ( , then we had H # ) ( = E (2 2 )] [ c ⇡ ( )] % , 2 | (  [ . 2 ( ` ^ ( )) 2 ^ ^) . The sampling indicator variable ( is a binary variable, thus (2 = (, then H # ) ( = E ( 2 )] [ c ⇡ ( )] % , 2 | (  [ . 2 ( ` ^ ( )) 2 ^ ^) . Then take the iterated expectations, ( 2 H # ) ( = E E ⇢ %  [ = E = E 2 , | ( ( E )] ⇡ ⇡ c [ ( )] E c ⇡ ( ⇢ [ )] ( 2 | % ) , [ ( 2 | 1 % ⇡ c  ( ) [ , 2 2 | )] ( )] E 2 ⇥ . 2 ( ` ^ ( )) 2 ^ ^) ^, ⇡ ` . ( ( ^ )) 2 ^ ^) ⇥ . ( ` ^ ( )) 2 ^, ⇡ ^ ^) ^, ⇡ ⇤ . We can simplify the expectation E ` . ( ( ^ )) 2 E ` . ( ( ^ )) 2 ^, ⇡ = E ⇥ . 2 2. ` ( ⇥ ⇤ = E ⇥ . 2 ^, ⇡ ⇣ E ` ⌘ 2 | ) ^ ( ⇥ . 2 ^, ⇡ + = E ^ ) + 2E ^, ⇡ ⇤ ^, ⇡ ⇤ , ⇤ ` ^ 2 ) ( ^, ⇡ . ` ^ )| ( ⇤ ^, ⇡ ] [ ⇣ 2` ( ^ ) E ⌘ . ( | ^, ⇡ 2 . ` ^ ) ( ) + 46 Under the above study design setting, . is a binary secondary outcome. ^, ⇡ ) ⌘ E . 2 ^, ⇡ = E . ( | ^, ⇡ ) . Then we have ` . ( ( ^ )) 2 ^, ⇡ ^, ⇡ = ˜` ( ⇤ = ˜` ^, ⇡ 2` ( 2` ) 2 ) ˜` ^ ) ^ ) ( ( ˜` ^, ⇡ ` ^ 2 ) ( ) + ` ^ 2 ) ( ) + ^, ⇡ ( 2 ^, ⇡ ˜` ( ) ( ^, ⇡ ) ^, ⇡ ` ^ ( ) 2 )] + ˜` ( ^, ⇡ ˜` ( ) ^, ⇡ 2 . ) ( ( ˜` + ˜` [ = So let ˜` E ( ⇥ So, H # ) ( = E " ⇥ ^, ⇡ ˜` ( ( ) ` [ 2 X ( % )) , + 2 ( | )] ˜` ( 2 c ⇡ ( ) ^, ⇡ ˜` ( ) ^, ⇡ 2 ) ^ ^) . # ⇤ Denote ˜˜` ^, ⇡ ˜` ( ) ⌘ ( ( ^, ⇡ ` ^ ( ) 2 )) + ˜` ( ^, ⇡ ˜` ( ) ^, ⇡ 2, ) H # ) ( = E ˜˜` , ( | %  [ ( ^, ⇡ ) 2 c 2 )] C.3 Derivation of 1⌫ 1 ^ ^) . ⇡ ) ( ) We denote ? F, 2 ) ( as the joint probability between , and ⇠. By the rule of function expectation, ⇥ ⇤ 47 we have # ) ( = E = E 2 6 6 6 6 6 4 2 6 6 6 6 6 6 6 6 6 6 6 6 6 4 = % ( 4 ^) # , % [ ( 2 | )] 2 4 ^) # ⌘ 1 + ⇣ 4 ^) # , % [ ( 2 | )] 2 4 ^) # ⌘ 1 + ⇣ ^ ^) 3 7 7 7 7 7 1 F 5 2 2 6 6 6 6 6 6 6 6 6 6 6 6 6 4 F F F2 2 F2 3 7 7 7 7 7 7 7 7 7 7 7 7 7 1 1 1 5 2 ? F = 1, 2 = 1 ) 1 2 = 1 ( F = 1 V2 V1+ 4 V0+ V1+ 4 V0+ | ) + 2 V2 + % ( + % ( | | ? F = 1, 2 = 0 ) 1 2 = 0 ( F = 1 V1 4 V0+ 4 V0+ ) + 2 V1 ? F = 0, 2 = 1 ) 1 2 = 1 ( F = 0 V2 4 V0+ 4 V0+ ) + 2 V2 3 7 7 7 7 7 7 7 7 7 7 7 7 7 5 3 7 7 7 7 7 7 7 7 7 7 7 7 7 5 3 7 7 7 7 7 7 7 7 7 7 7 7 7 5 3 7 7 7 7 7 7 7 7 7 7 7 7 7 5 2 6 6 6 6 6 6 6 6 6 6 6 6 6 4 2 6 6 6 6 6 6 6 6 6 6 6 6 6 4 2 6 6 6 6 6 6 6 6 6 6 6 6 6 4 1 1 1 1 1 1 1 1 0 1 1 0 0 0 0 1 0 1 0 0 0 1 0 1 48 2 + 4 V0 4 V0 2 6 6 6 6 6 6 6 6 6 6 6 6 6 1 1 1 4 1 1 1 1 1 1 1 0 1 0 0 0 1 0 1 ? F = 0, 2 = 0 ) 1 2 = 0 ( F = 0 | ) + % ( ? = V2 2 = 1 ) 4 V0+ V1+ 4 V0+ 2 V2 V1+ ( 1 + ? + V2 2 = 1 ) 4 V0+ ( 1 4 V0+ 2 V2 + 2 6 6 6 6 6 6 6 6 6 6 6 6 6 4 2 6 6 6 6 6 6 6 6 6 6 6 6 6 4 3 7 7 7 7 7 7 7 7 7 7 7 7 7 5 3 7 7 7 7 7 7 7 7 7 7 7 7 7 5 + 1 1 0 1 1 0 0 0 0 3 7 7 7 7 7 7 7 7 7 7 7 7 7 5 2 6 6 6 6 6 6 6 6 6 6 6 6 6 4 1 0 0 0 0 0 0 0 0 3 7 7 7 7 7 7 7 7 7 7 7 7 7 5 1 0 0 0 0 0 0 0 0 3 7 7 7 7 7 7 7 7 7 7 7 7 7 5 ? + ( 1 2 = 0 V1 4 V0+ 2 V1 ) 4 V0+ + 2 6 6 6 6 6 6 6 6 6 6 6 6 6 4 4 V0 2 ? 2 = 0 ) 4 V0 ( 1 + 49 ) V2 V1+ 2 V2 ) V2 V1+ 2 V2 ) 4V0+ V1+ ) 4V0+ V1+ V2 V2 2 4V0+ V1+ ) 4V0+ ? 2=1 ( 1 + ( ? 2=1 ( 1 + ( ? 4V0+ V1+ ) 4V0+ 2=1 ( 1 + ( 0 ? 2=1 ( 1 + ( 4V0+ V2 ) 4V0+ V2 2 ) 0 0 0 ? 2=1 ( 1 + ( 4V0+ V2 ) 4V0+ V2 2 ) 4V0+ V1+ V1+ V2 V2 2 ? + 2=1 ) ( 4V0+ 1 ( ? 2=1 ( ) 4V0+ 1 ( ? 2=1 ) ( 4V0+ 1 ( + + 4V0+ V1+ 4V0+ V1+ ) V2 V1+ 2 V2 ) V2 V1+ 2 V2 ) 4V0 2 ? 2=0 ( ) 4V0 1 + ( ) 0 0 + 3 7 7 7 7 7 7 7 7 7 7 7 7 7 5 2 6 6 6 6 6 6 6 6 6 6 6 6 6 4 0 0 0 3 7 7 7 7 7 7 7 7 7 7 7 7 7 5 3 7 7 7 7 7 7 7 7 7 7 7 7 7 5 + 2 6 6 6 6 6 6 6 6 6 6 6 6 6 0 0 4 0 0 0 0 4V0+ V1 ) 4V0+ V1 2 ) V1 4V0+ 2 V1 ) 4V0+ ? 2=0 ( 1 + ( ? 2=0 ( 1 + ( ) 4V0+ V1 ) 4V0+ V1 2 ) V1 4V0+ 2 V1 ) 4V0+ ? 2=0 ( 1 + ( ? 2=0 ( 1 + ( ) 0 0 3 7 7 7 7 7 7 7 7 7 7 7 7 7 5 4V0+ V1+ V1+ V2 V2 2 ? + 2=1 ) ( 4V0+ 1 ( ? 2=1 ( ) 4V0+ 1 ( ? 2=1 ) ( 4V0+ 1 ( + + 4V0+ V1+ 4V0+ V1+ ) V2 V1+ 2 V2 ) V2 V1+ 2 V2 ) 4V0+ V2 V2 2 ) 4V0+ V2 V2 2 ) ? + 0 2=1 ) ( 4V0+ 1 ( 2 6 6 6 6 6 6 6 6 6 6 6 6 6 0 1 2 4 2=1 ) ( 4V0+ 1 ( + ? 1 1 3 2 3 2 . 3 7 7 7 7 7 7 7 7 7 7 7 7 7 5 = = 2 6 6 6 6 6 6 6 6 6 6 6 6 6 4 + 2 6 6 6 6 6 6 6 6 6 6 6 6 6 4 Where ? 0 = 2 = 1 4 V0+ V1+ V2 V1+ ) 4 V0+ ( 1 + 2 + V2 ? 2 = 0 V1 ? 2 = 1 4 V0+ V1 ) 4 V0+ 2 + ( 1 + 4 V0+ V2 ) 4 V0+ V2 ? 2 + 4 V0 2 , 2 = 0 ) 4 V0 ( 1 + ( 1 + ? V1+ 4 V0+ V2 V1+ V2 2 + ? 1 = 2 = 1 ) 4 V0+ ( 1 + 2 = 1 V1 4 V0+ 2 V1 2 = 0 ) 4 V0+ ( 1 + 2 = 1 , , ? 2 = ( 1 4 V0+ V1+ V2 V1+ ) 4 V0+ V2 ? 2 + ( 1 V2 4 V0+ 2 V2 ) 4 V0+ 3 = + 2 = 1 ? ( 1 V2 4 V0+ V1+ 2 V2 V1+ ) 4 V0+ + + . We define some simple notations. We define, ?F23 as the joint probability of ,, ⇠, and ⇡. We 50 define ?F2 = ? F 2 | ) ( . Using the rule of function of expectation on ⌫ , we have, # ) ( ⌫ # ) ( = E %  [ ˜˜` ( , ,, ⇠, ⇡ 2 c 2 ( | )] ) ⇡ ( ^ ^) ) 1 F 2 [ % 2 6 6 6 6 6 6 6 6 6 6 6 6 6 4 ?111 ˜˜` ?11 ?101 ˜˜` ?10 + ?011 ˜˜` ?01 + ?001 ˜˜` ?00 + = E ˜˜` ( , ,, ⇠, ⇡ 2 c 2 ( | )] F F F2 2 F2 2 ( ) ) ⇡ 2 6 6 6 6 6 6 6 6 6 6 6 6 6 1 1 1 4 3 7 7 7 7 7 7 7 7 7 7 7 7 7 5 3 7 7 7 7 7 7 7 7 7 7 7 7 7 5 = 1, 1,1 ( 2 c 1 ( ) 1, 1, 0 ( 2 c 0 ( ) ?110 ˜˜` ?11 + ) 2 6 6 6 6 6 6 6 6 6 6 6 6 6 4 1 1 1 3 7 7 7 7 7 7 7 7 7 7 7 7 7 1 1 0 5 1 1 1 1 1 1 1 1 1 3 7 7 7 7 7 7 7 7 7 7 7 7 7 1 1 0 5 1 1 1 ) 2 6 6 6 6 6 6 6 6 6 6 6 6 6 4 1 1 0 0 0 0 1 0 1 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 1, 0, 1 ( ) 2 c 1 ) ( 0, 1, 1 ) ( 2 c 1 ) ( 0, 0, 1 ( ) 2 c 1 ) ( 2 6 6 6 6 6 6 6 6 6 6 6 6 6 4 2 6 6 6 6 6 6 6 6 6 6 6 6 6 4 2 6 6 6 6 6 6 6 6 6 6 6 6 6 4 1, 0, 0 ( ) 2 c 0 ) ( 0, 1, 0 ) ( 2 c 0 ) ( 0, 0, 0 ( ) 2 c 0 ) ( 3 7 7 7 7 7 7 7 7 7 7 7 7 7 5 3 7 7 7 7 7 7 7 7 7 7 7 7 7 5 3 7 7 7 7 7 7 7 7 7 7 7 7 7 5 ?100 ˜˜` ?10 + ?010 ˜˜` ?01 + ?000 ˜˜` ?00 + 51 2 6 6 6 6 6 6 6 6 6 6 6 6 6 4 2 6 6 6 6 6 6 6 6 6 6 6 6 6 4 2 6 6 6 6 6 6 6 6 6 6 6 6 6 4 1 1 0 0 0 0 1 0 1 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 3 7 7 7 7 7 7 7 7 7 7 7 7 7 5 3 7 7 7 7 7 7 7 7 7 7 7 7 7 5 3 7 7 7 7 7 7 7 7 7 7 7 7 7 5 ) ( ?111 ˜˜` ?11 1,1,1 ( ) 2c 1 ( ) ?111 ˜˜` ?11 1,1,1 ( ) 2c 1 ( ) ?111 ˜˜` ?11 1,1,1 ( ) 2c 1 ) ( ) ( ) ( ) ) ( ?111 ˜˜` ?11 1,1,1 ( 2c 1 ) ( ) ?111 ˜˜` ?11 1,1,1 ( 2c 1 ) ( ) ?111 ˜˜` ?11 1,1,1 ( 2c 1 ) ) ( ( ( ) ( ) ?111 ˜˜` ?11 1,1,1 ( ) 2c 1 ( ) ?111 ˜˜` ?11 1,1,1 ( ) 2c 1 ( ) ?111 ˜˜` ?11 1,1,1 ( ) 2c 1 ) ( ) ( ) ( ) ( ?110 ˜˜` ?11 1,1,0 ( 2c 0 ) ( ) ?110 ˜˜` ?11 1,1,0 ( 2c 0 ) ( ) ?110 ˜˜` ?11 1,1,0 ( 2c 0 ) ) ( ( ( ) ) ( ) ?110 ˜˜` ?11 1,1,0 ( ) 2c 0 ( ) ?110 ˜˜` ?11 1,1,0 ( ) 2c 0 ( ) ?110 ˜˜` ?11 1,1,0 ( ) 2c 0 ) ( ) ( ( ) + 2 3 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 ?100 ˜˜` 6 7 1,0,0 ) ( 4 5 2c ?10 0 ( ) ?100 ˜˜` ?10 1,0,0 ( ) 2c 0 ) ( ( ) ) ( ( ) ?100 ˜˜` ?10 1,0,0 ( ) 2c 0 ( ) ?100 ˜˜` ?10 1,0,0 ( ) 2c 0 ) ( ) ( ) ( ?101 ˜˜` ?10 1,0,1 ( 2c 1 ) ( ) ?101 ˜˜` ?10 1,0,1 ( ) 2c 1 ) ) ( ( ( ) ?101 ˜˜` ?10 1,0,1 ( ) 2c 1 ( ) ?101 ˜˜` ?10 1,0,1 ( ) 2c 1 ) ( ) ( ( ) ?110 ˜˜` ?11 1,1,0 ( ) 2c 0 ( ) ?110 ˜˜` ?11 1,1,0 ( ) 2c 0 ( ) ?110 ˜˜` ?11 1,1,0 ( ) 2c 0 ) ( ) ( ) ( 3 7 7 7 7 7 7 7 7 7 7 7 7 7 5 3 7 7 7 7 7 7 7 7 7 7 7 7 7 5 0 0 0 3 7 7 7 7 7 7 7 7 7 7 7 7 7 5 0 0 0 0 0,1,1 ( ) 2c 1 ) ) ( 0 ?011 ˜˜` 0,1,1 ( ) 2c ?01 1 ) ( ( 0,1,0 ( ) 2c 0 ) ) ( 0 ?010 ˜˜` 0,1,0 ( ) 2c ?01 0 ) ( ) ( 0 0 0 0 0 0 0 ?011 ˜˜` 0,1,1 ( ) 2c ?01 1 ) ( ( ?010 ˜˜` ?01 ( 0,1,0 ( ) 2c 0 ) ) ( 0 ?010 ˜˜` 0,1,0 ( ) 2c ?01 0 ) ( ) ( 0 0 0 3 7 7 7 7 7 7 7 7 7 7 7 7 7 5 + 2 6 6 6 6 6 6 6 6 6 6 6 6 ?010 ˜˜` 6 4 ?01 ( ) + 3 2 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 6 7 ?000 ˜˜` 7 6 0,0,0 ( ) 5 4 2c ?00 0 ) ( ) ( ) 0 0 0 0 0 0 0 0 3 7 7 7 7 7 7 7 7 7 7 7 7 7 5 + 2 6 6 6 6 6 6 6 6 6 6 6 6 6 4 = 2 6 6 6 6 6 6 6 6 6 6 6 6 6 4 + + + = 2 6 6 6 6 6 6 6 6 6 6 6 6 6 4 2 6 6 6 6 6 6 6 6 6 6 6 6 6 4 2 6 6 6 6 6 6 6 6 6 6 6 6 6 4 2 6 6 6 6 6 6 6 6 6 6 6 6 ?011 ˜˜` 6 4 ?01 ( ?011 ˜˜` ?01 0,1,1 ( ) 2c 1 ) ) ( ?001 ˜˜` ?00 0,0,1 ( ) 2c 1 ) ) ( ( ( 0 0 5 6 ⌘ 6 ⌘ 8 9 9 : 3 7 7 7 7 7 7 7 7 7 7 7 7 7 5 0 0 0 0 0 0 3 7 7 7 7 7 7 7 7 7 7 7 7 7 5 52 5 = 6 = ⌘ = ?111 ˜˜` ?11 ?111 ˜˜` ?11 8 = ?111 ˜˜` ?11 9 = : = ?111 ˜˜` ?11 ?111 ˜˜` ?11 ) 1, 1, 1 ( 2 c 1 ( ) 1, 1,1 ( ) 2 c 1 ( ) 1, 1,1 ) ( 2 c 1 ( ) 1, 1,1 ( ) 2 c 1 ) ( 1, 1,1 ( ) 2 c 1 ) ( ?111 ˜˜` 1, 1, 1 ( ) 2 c ?11 ?011 ˜˜` ?01 1 ( ) 0, 1, 1 ( ) 2 c + + 1 ) ( ?110 ˜˜` 1, 1, 0 ( ) 2 c ?11 ?010 ˜˜` ?01 0 ( ) 0, 1, 0 ( ) 2 c + + 0 ) ( ?101 ˜˜` ?10 ?00 ˜˜` ?00 + 1, 0, 1 ( 2 c ) 1 ( ) 0, 0, 1 ( ) 2 c 1 + ) ?100 ˜˜` ?10 ?000 ˜˜` ?00 1, 0, 0 ( 2 c 0 ) ( 0, 0, 0 ( 2 c 0 ) + ( ) ( ) ?110 ˜˜` ?11 1, 1, 0 ) ( 2 c 0 ( ) ?110 ˜˜` 1, 1, 0 ( 2 c ?11 0 ) + + ) ?101 ˜˜` ?10 1, 0, 1 ( 2 c 1 ( ) ?011 ˜˜` 0, 1, 1 ) ( 2 c ?01 + + ) ?100 ˜˜` ?10 1, 0, 0 ( 2 c 0 ) ( ?010 ˜˜` 0, 1, 0 ( 2 c ?01 0 ) + + 1 ( ) 1, 0, 1 ( 2 c 1 ) ( ) ?100 ˜˜` ?10 + ) ( 1, 0, 0 ( 2 c 0 ) ( ) ?101 ˜˜` ?10 + ) ?110 ˜˜` ?11 + ?110 ˜˜` ?11 ?110 ˜˜` ?11 + + ( ) 1, 1, 0 ) ( 2 c 0 ( ) 1, 1, 0 ( 2 c 0 ) ( 1, 1, 0 ( ) 2 c 0 ) ( ?011 ˜˜` ?01 + 0, 1, 1 ( 2 c 1 ( ) ?010 ˜˜` ?01 ) + 0, 1, 0 ( 2 c 0 ( ) ) ) 1 We can see that 8 = 6 and ⌘ = :. 1 0 1 2 5 6 ⌘ 0 1 2 1 ) = 1⌫ 1 1 3 Thus, we can write 3 7 7 7 7 7 7 7 7 7 7 7 1⌫ to get the second element of the diagonal of 7 7 5 # to obtain the variance of the estimator of the exposure effect. And we minimize this variance to 3 7 7 7 7 7 7 7 7 7 7 7 . Then, we can divide this element by 7 7 5 2 6 6 6 6 6 6 6 6 6 6 6 6 6 4 ⇥ . Our target is 2 6 6 6 6 6 6 6 6 6 6 6 6 6 4 3 7 7 7 7 7 7 7 7 7 7 7 7 7 5 2 6 6 6 6 6 6 6 6 6 6 6 6 6 4 1 1 3 2 3 2 2 3 2 6 6 9 ⌘ ⌘ ⇤ ⇥ ⇤ ) 9 1 3 7 7 7 7 7 7 7 7 7 7 7 7 7 7 5 2 6 6 6 6 6 6 6 6 6 6 6 6 6 6 4 get the optimal sampling ratio. 1 To compute 0 1 2 1 1 3 2 3 2 2 6 6 6 6 6 6 6 6 6 6 6 6 6 4 3 7 7 7 7 7 7 7 7 7 7 7 7 7 5 , we need to get 34C and 039 . See details below. ) ( 53 The cofactor matrix of is the same as the cofactor matrix, because it is symmetric. By transposing the cofactor matrix, we obtain the adjoint matrix of , The minor of matrix is, 32 12 23 02 12 03 12 12 13 23 13 22 03 12 01 12 12 12 12 23 13 32 23 12 02 12 12 12 13 12 22 12 03 01 03 12 The determinant of is, 2 6 6 6 6 6 6 6 6 6 6 6 6 6 4 2 6 6 6 6 6 6 6 6 6 6 6 6 6 4 . 3 7 7 7 7 7 7 7 7 7 7 7 7 7 5 . 3 7 7 7 7 7 7 7 7 7 7 7 7 7 5 34C = 012 2123 032 122 221. + 54 12 23 13 32 23 12 02 12 12 12 13 12 22 12 03 01 03 12 2 6 6 6 6 6 6 6 6 6 6 6 6 6 4 3 7 7 7 7 7 7 7 7 7 7 7 7 7 5 Thus, 1 3 7 7 7 7 7 7 7 7 7 7 7 7 7 5 0 1 2 1 1 3 2 3 2 2 6 6 6 6 6 6 6 6 6 6 6 6 6 4 = 1 34C 03 9 = = 012 + 2123 1 032 122 221 ⇢ ⌧ ⌧ 2 6 6 6 6 6 6 6 6 6 6 6 6 6 4 , 3 7 7 7 7 7 7 7 7 7 7 7 7 7 5 where, ⇢ = 012 = 012 ⌧ = 012 = 012 + + + + 12 2123 32 032 122 221 2123 23 13 2123 02 2123 12 032 12 032 22 032 122 221 122 122 221 221 , , , , 12 01 03 032 12 032 2123 2123 = 012 = 012 + + , . 122 221 122 221 55 Further, 1⌫ = ⇢ ⌧ 5 6 ⌘ ⌧ ⇥ 6 6 9 ⌘ 9 ⌘ 3 7 7 7 7 7 7 7 7 7 7 7 7 7 5 2 6 6 6 6 6 6 6 6 6 6 6 6 6 4 2 6 6 6 6 6 6 6 6 6 6 6 6 6 4 . 3 7 7 7 7 7 7 7 7 7 7 7 7 7 5 So by simple computation, the second row of 1⌫ is 5 6 + + " ⌘, 6 6 + + 9, ⌘ 9 ⌘ + + . # Since 1 ) = ⇥ ⇤ ⇢ ⌧ ⌧ 2 6 6 6 6 6 6 6 6 6 6 6 6 6 4 , 3 7 7 7 7 7 7 7 7 7 7 7 7 7 5 thus, the second element in the diagonal of 1⌫ 1 ) is ⇥ ⇤ 2 5 + 26 2 ⌘ 26 + + 2 9 2⌘. + + So, E0A ˆV1) ( = 2 5 + 26 + 2 ⌘ # + 26 + 2 9 2⌘ , + 56 where 2 ?11 ) 2 5 = ( 2 ?10 2 ?01 ) ) 2 ?00 ) ( ( ( + + + 2 ?10 2 ?01 2 ?00 ) ) ) ( ( ( + + + 2 ?111 ˜˜` 1, 1, 1 ) ( c 1 ) ( 2 ?101 ˜˜` + 1, 0, 1 ) ( c 1 ( ) 2 ?011 ˜˜` ( c 1 ) ( 2 ?001 ˜˜` ( + + ) ) 0, 1, 1 0, 0, 1 2 ?11 ) ( 2 ?110 ˜˜` 1, 1, 0 ) ( c 0 ( ) 2 ?100 ˜˜` 1, 0, 0 ( c 0 ( ) 2 ?010 ˜˜` ( c 0 ) ( 2 ?000 ˜˜` ( 0, 1, 0 0, 0, 0 ) ) ) 2 ?10 2 ?01 2 ?00 ) ) ) ( ( ( c ( 2 ?111 ˜˜` 1 ) 1, 1, 1 + #1 ) 2 ?11 ) = ( ( =1 2 ?101 ˜˜` ( =1 2 ?011 ˜˜` ( =1 2 ?001 ˜˜` 1, 0, 1 0, 1, 1 + #1 #1 #1 ) ) ) 0, 0, 1 ( =1 1, 1, 1 #1=0 ) 2 ?11 ) = ( 2 ?111 ˜˜` ( =1=0 2 ?101 ˜˜` 2 ?10 2 ?01 2 ?00 ) ) ) ( ( ( + + + 1, 0, 1 ( =1=0 2 ?011 ˜˜` 0, 1, 1 ( =1=0 2 ?001 ˜˜` 0, 0, 1 ( =1=0 + #1=0 #1=0 #1=0 ) ) ) 1, 1, 0 #0 ) 1, 0, 0 ) #0 0, 1, 0 ) #0 0, 0, 0 ) #0 c 0 ) ( 2 ?110 ˜˜` 2 ?11 ) ( 2 ?10 2 ?01 2 ?00 ) ) ) ( ( ( + + + 2 ?11 ) ( ( =0 2 ?100 ˜˜` ( =0 2 ?010 ˜˜` ( =0 2 ?000 ˜˜` ( =0 2 ?110 ˜˜` 1, 1, 0 #0=1 ) ( =1=0 2 ?100 ˜˜` 2 ?10 2 ?01 2 ?00 ) ) ) ( ( ( + + + 1, 0, 0 ( =1=0 2 ?010 ˜˜` 0, 1, 0 ( =1=0 2 ?000 ˜˜` 0, 0, 0 ( =1=0 #0=1 #0=1 #0=1 ) ) ) 57 26 = = 2 ?111 ˜˜` ( 2 c ?11 ( 2 ?101 ˜˜` ( 2 c ?10 2 ?111 ˜˜` 2 ?11 + 1, 1, 1 ) + 1 ) 1, 0, 1 ) 1 ( ) 1, 1, 1 ) 0 ) 1, 0, 0 1, 1, 0 2 ?110 ˜˜` ( 2 c ?11 ( 2 ?100 ˜˜` ( 2 c ?10 0 ( ) 2 2 ?110 ˜˜` ?11 ) + ( ) #1=0 ) ( =1=0 2 ?101 ˜˜` ( ) 2 ?10 ) ( + + #1=0 2 ?10 ) ( + 1, 0, 1 ) ( =1=0 ( =1=0 2 ?100 ˜˜` 1, 1, 0 #0=1 ) 1, 0, 0 #0=1 ) ( =1=0 2 ⌘ = = 1, 1, 1 2 ?111 ˜˜` ( 2 c ?11 ( 2 ?011 ˜˜` ( 2 c ?01 2 ?111 ˜˜` 2 ?11 + ) + 1 ) 0, 1, 1 ) 1 ( ) 1, 1, 1 ) 1, 1, 0 ) 2 ?110 ˜˜` ( 2 c ?11 ( 2 ?010 ˜˜` ( 2 c ?01 0 ) ( 2 ?110 ˜˜` 0 ) 0, 1, 0 ) #1=0 + ( =1=0 2 ?011 ˜˜` ( ) 2 ?01 ) ( + + #1=0 ) 0, 1, 1 ( =1=0 ( =1=0 2 ?010 ˜˜` 2 ?11 ) 2 ?01 ) ( ( + 1, 1, 0 ) #0=1 0, 1, 0 #0=1 ) ( =1=0 26 = 1, 1, 1 2 ?111 ˜˜` ( 2 c ?11 ( 2 ?101 ˜˜` ( 2 c ?10 2 ?111 ˜˜` 2 ?11 + ) 1 ) 1, 0, 1 + ) 1 ) ( 1, 1, 1 2 ?110 ˜˜` ( 2 c ?11 2 ?100 ˜˜` ?10 + 1, 1, 0 ) 0 ) 1, 0, 0 ) 0 ) ( 2 ?110 ˜˜` ( ( 2 c 2 ?11 ( ) #1=0 ) = ( ) ( =1=0 2 ?101 ˜˜` + #1=0 2 ?10 ) ( + 1, 0, 1 ) ( =1=0 2 ?10 ) ( + ( =1=0 2 ?100 ˜˜` 1, 1, 0 #0=1 ) 1, 0, 0 #0=1 ) ( =1=0 58 2 9 = ( = ( 2 ?11 ) 2 ?11 ) 2 ?111 ˜˜` 1, 1, 1 ( c 1 ) ( 2 ?111 ˜˜` ( =1=0 1, 1, 1 ) ) 2 ?11 2 ?110 ˜˜` 1, 1, 0 ) ( ) ( + #1=0 0 ) 2 ?110 ˜˜` c ( 2 ?11 ) ( =1=0 1, 1, 0 #0=1 ) ( + 2⌘ = = + ) ) 1, 1, 1 2 ?111 ˜˜` ( 2 c ?11 1 ( ) 2 ?011 ˜˜` 0, 1, 1 ( 2 c ?01 2 ?111 ˜˜` ) 2 ?11 1 ( + ( ) 1, 1, 1 2 ?110 ˜˜` ( 2 c ?11 2 ?010 ˜˜` ?01 + #1=0 ) 1, 1, 0 ) 0 ( ) 0, 1, 0 ( ) 2 c ( 2 ?11 0 ) 2 ?110 ˜˜` ( ) ( =1=0 2 ?011 ˜˜` + #1=0 ) 2 ?01 ) ( + 0, 1, 1 ( =1=0 ( =1=0 2 ?010 ˜˜` 2 ?01 ) ( + 1, 1, 0 #0=1 ) 0, 1, 0 #0=1 ) ( =1=0 After simplification, we can rewrite, E0A ˆV1) ( = Z0=0 + Z1=1 =0=1# , 59 where Z0 = 2 2 ?11 ?111 ˜˜` 1, 1, 1 #1 + ) ( 2 2 ?10 ?101 ˜˜` 1, 0, 1 ) ( #1 + + + + + + 2 2 ?01 2 2 ?11 2 ?11 2 ?11 2 ?11 2 ?11 2 2 2 2 ?011 ˜˜` ?111 ˜˜` ?111 ˜˜` ?111 ˜˜` ?111 ˜˜` ?111 ˜˜` ( ( ( ( ( ( 0, 1, 1 #1 + ) 1, 1, 1 #1 + ) 1, 1, 1 #1 + ) 1, 1, 1 #1 + ) 1, 1, 1 #1 ) 1, 1, 1 #1 + ) ?001 ˜˜` 0, 0, 1 ) ( #1 ?101 ˜˜` 1, 0, 1 ) ( #1 ?011 ˜˜` 0, 1, 1 ) ( #1 ?101 ˜˜` 1, 0, 1 ) ( #1 2 2 ?00 2 2 ?10 2 ?01 2 ?10 2 2 ?011 ˜˜` 0, 1, 1 ) ( #1, 2 2 ?01 Z1 = 2 2 ?11 ?110 ˜˜` 1, 1, 0 #0 + ) ( 2 2 ?10 ?100 ˜˜` 1, 0, 0 ) ( #0 ?010 ˜˜` ?110 ˜˜` ?110 ˜˜` ?110 ˜˜` ?110 ˜˜` 0, 1, 0 #0 + ) 1, 1, 0 #0 + ) 1, 1, 0 #0 + ) 1, 1, 0 #0 + ) 1, 1, 0 #0 ) 2 2 ?00 2 2 ?10 2 ?01 2 ?10 2 2 ?000 ˜˜` 0, 0, 0 ) ( #0 ?100 ˜˜` 1, 0, 0 ) ( #0 ?010 ˜˜` 0, 1, 0 ) ( #0 ?100 ˜˜` 1, 0, 0 ) ( #0 1, 1, 0 #0 + ) 2 2 ?01 ?010 ˜˜` 0, 1, 0 ) ( #0. ( ( ( ( ( ( 2 ?110 ˜˜` + + + + + + 2 2 ?01 2 2 2 2 2 ?11 2 ?11 2 ?11 2 ?11 2 ?11 According to the Proposition 1 in Chapter 2, the optimal sampling ratio of =0 to =1 is given by '$⇡ = Z1 Z0 . q 60