OPTIMAL SAMPLING STRATEGIES USING CASE-CONTROL STUDIES FOR BINARY
SECONDARY OUTCOMES UNDER BUDGET CONSTRAINTS

By

Liang Wang

A DISSERTATION

Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of

Biostatistics—Doctor of Philosophy

2024

ABSTRACT

A case-control study is efﬁcient for investigating the association between outcomes and expo-

sures. After conducting the primary outcome analysis, researchers can utilize the existing case-

control study data to perform a secondary outcome analysis. Several methods have been proposed

for analyzing secondary outcomes in case-control studies over the past few decades, but few of

them have focused on the study design aspect. We propose optimal sampling strategies under a

budget constraint for case-control studies with binary and Poisson secondary outcomes. We then

extend our optimal sampling strategy by considering a confounder and derive the parameter of

interest using doubly-weighted estimating equations. The term "optimal" refers to minimizing

the variance of the estimator of the parameter of interest. We elucidate our proposed methods by

developing the asymptotic variance of the estimator of the coefﬁcient using weighted estimating

equations and doubly-weighted estimating equations. Furthermore, we derive the optimal sampling

ratio formulas through the Lagrange multiplier method based on certain monetary constraints. We

verify our proposed methods through Monte Carlo simulation studies. Additionally, we apply our

methods to empirical epidemiological studies that motivated the method development.

Copyright by
LIANG WANG
2024

I dedicate this dissertation to my family and friends.

iv

ACKNOWLEDGMENTS

I would like to express my deepest gratitude to my dissertation committee members: Dr. Zhehui

Luo, Dr. Chenxi Li, Dr. Honglei Chen, and Dr. Yuehua Cui for their guidance throughout my

dissertation research.

I would also like to thank Michigan State University Institute of Health Policy for their support

of my PhD graduate assistantship.

I would like to extend my appreciation to my family and friends for their love and emotional

support during my PhD journey.

v

TABLE OF CONTENTS 

Chapter 1    Introduction ...............................................................................................................  1 

Chapter 2    Optimal Sampling Strategies Using Case-Control Studies for Binary Secondary                             
Outcomes under Budget Constraints .........................................................................  4 

Chapter 3    Optimal Sampling Strategies Using Case-Control Studies for Poisson Count 

Secondary Outcomes under Budget Constraints ....................................................... 15 

Chapter 4    Optimal Sampling Strategies Using Case-Control Studies for Binary Secondary    
Outcomes Using Doubly-Weighted Invers Probability Estimating Equations under 
Budget Constraints .................................................................................................... 23 

Chapter 5    Conclusion ................................................................................................................. 30 

BIBLIOGRAPHY ......................................................................................................................... 32 

APPENDIX A PROOFS OF CHAPTER 2 .................................................................................. 35 

APPENDIX B PROOFS OF CHAPTER 3 ................................................................................... 43 

APPENDIX C PROOFS OF CHAPTER 4 ................................................................................... 45 

vi 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Chapter 1 Introduction

The case-control study is designed to efﬁciently investigate the association between rare outcomes

and exposures. To form a case group, a sample of individuals with the disease is randomly selected

from the target population, while a sample of individuals without the disease is selected to form a

control group. The primary outcome is the disease by which the caseness is deﬁned. Researchers

can use an existing case-control study dataset to conduct secondary outcome analysis and examine

the association between secondary outcomes and covariates.

Numerous methods have been proposed for analyzing secondary outcomes in case-control stud-

ies over the past few decades. These methods include analyzing controls only (Nagelkerke et al.,

1995) or cases only (Li et al., 2010), and conducting joint analysis of cases and controls while

adjusting for the primary outcome (Lee et al., 1997). However, some of these methods have been

considered naive as they are valid only under certain circumstances, such as a rare disease assump-

tion (Li et al., 2010), and may produce invalid inferences. When considering the funding limitation

for the secondary outcome analysis, the data for the secondary outcome analysis needs to be sam-

pled from the cohort. Because the sampling for the secondary outcome analysis is taken from the

cohort, the associations between exposure and the secondary outcome based on these samples can

differ from those in the general population (Lin and Zeng, 2009). To overcome this issue, more

methods that can provide valid inference have been proposed, including likelihood methods (Lin

and Zeng, 2009; Jiang et al., 2006; Ghosh et al., 2013; Brownstein et al., 2022), weighted esti-

mating equations methods (Monsees et al., 2009; Xing et al., 2016; Song et al., 2016; Sofer et al.,

2017), bias correction methods (Wang and Shete, 2012; Chen et al., 2013), and semi-parametric

methods (Wei et al., 2013; Tchetgen Tchetgen, 2014; Ma and Carroll, 2016).

While current methodological studies on secondary outcomes focus on inferential procedures,

the corresponding sampling strategies using case-control study for secondary outcome analysis

have not been investigated. In Chapter 2, we proposed an optimal sampling strategy for a case-

control study with a binary secondary outcome. We derived the variance of the estimator of the

exposure effect for the inverse probability weighted estimating equations and minimized the vari-

1

ance to obtain an optimal sampling ratio between the sample size of controls to cases, considering

study cost constraints.

In Chapter 2, the secondary outcome is presented as a binary variable. However, in reality,

there are situations where the secondary outcome of interest is not binary, but a count variable.

For instance, in a study by Tchetgen Tchetgen (2014), the secondary outcome of interest was the

number of live births, which is typically considered a count outcome. Similarly, in the Pesticides

and Sense of Smell (PASS) Study (Shrestha et al., 2019), the cognitive decline scores in the survey

could also be treated as a count variable.

Sample size calculation for count data has been extensively investigated. For instance, Lou

et al. (2017) derived an analytic sample size formula for comparing rates of change between multi-

ple treatment groups with repeatedly measured count outcomes using generalized estimating equa-

tions. Amatya et al. (2013) provided simple sample size expressions for determining the number

of clusters in the context of multi-center randomized clinical trials. Zhu and Lakkis (2014) devel-

oped an explicit sample size calculation formula based on the likelihood function of the negative

binomial model. Wang et al. (2020) provide a closed-form sample size formula accounting for

the variability in cluster size in cluster randomized studies. In addition to deriving analytic sam-

ple size calculation formulas, simulation studies have also been used to determine sample size for

count data, as demonstrated in the works of Lyles et al. (2007); Aban et al. (2009); Rettiganti and

Nagaraja (2012). However, the sampling allocation estimation in the analysis of count secondary

outcomes in case-control studies remains understudied. This situation motivated us to explore a

sampling strategy for secondary case-control studies with count outcomes in Chapter 3.

In addition, our proposed sampling strategy formulas were derived by minimizing the variance

of the estimator of the exposure effect using inverse probability weighted estimating equations in

Chapter 2 and Chapter 3. The weights in these two chapters in estimating equations were design-

based weights, representing the sampling probability for the secondary outcome analysis from the

cohort. In chapter 4, we incorporate the propensity score weights into the estimating equations to

create doubly-weighted estimating equations. The propensity score weights can be estimated using

2

cohort data, which captures the probability of exposure given the confounder. Consequently, the

weights in doubly-weighted estimating equations were the product of the design weights and the

propensity score weights. The general propensity score assumptions and the inference of doubly-

weighted estimating equations can be found in Negi (2024). It is important to note that in Chapter

4, our focus is not on inference on the marginal effect; rather, we aim to provide an optimal

sampling designs for case-control studies with binary secondary outcomes given the exposure and

a confounder using doubly-weighted estimating equations.

3

Chapter 2 Optimal Sampling Strategies Using Case-Control Studies for Binary Secondary

Outcomes under Budget Constraints

2.1 Background

Optimal sampling strategies have been fruitfully studied in a variety of epidemiology study designs.

In the case-control study with a binary outcome, Demidenko (2006) found that the optimal ratio of

controls to cases for ﬁxed power is equal to the square root of the alternative odds ratio. In another

paper related to the unmatched case-control study design, Demidenko (2008) gave the optimal

control–case ratio for the test of an interaction between two binary covariates. Morgenstern and

Winn (1983) proposed that the optimal sampling ratio is a function of the expected frequency of

exposure among controls, odd ratio, and the unit cost ratio. Nam and Fears (1990) investigated

optimal allocation for stratiﬁed case-control studies.

The design consideration is immaterial when we only use an existing case-control sample for

secondary analysis. However, in the situation where the “true” outcome of interest is not measured

in the target population or too costly to ascertain, but a proxy of the outcome or another variable

strongly associated with the true outcome is routinely collected in the target population or ascer-

tainable with little cost, then it may be more efﬁcient to sample observations based on the proxy as

if the proxy is the primary outcome in a case-control study and the true outcome of interest is the

secondary outcome. The design aspect of secondary outcome analysis in case-control studies has

received less attention compared to the inference procedures. In this chapter, we propose a new

method to determine the optimal sampling ratio for a secondary case-control study when estimat-

ing coefﬁcients of interest using inverse probability weighted estimating equations. The purpose

of this paper is twofold. First, we provide a detailed explanation of a novel Lagrange multiplier

method by developing the asymptotic variance-coariance matrix of estimators of coefﬁcients ob-

tained from the weighted estimating equations. Second, we showcase the optimal allocation design

for a secondary outcome analysis using the method, taking into account monetary constraints.

4

2.2 Notation and estimation

2.2.1 Notation

Suppose the study cohort (target population) has 8 = 1, ..., # independent subjects. Let ⇡8 be the

binary primary disease status for subject 8 in the cohort, where ⇡8 = 1 indicates the presence of the

disease, ⇡8 = 0 indicates its absence. Denote by .8 the binary secondary outcome of interest, where

.8 = 1 indicates the presence of a secondary disease, and .8 = 0 otherwise. Let (8 be the sampling

indicator, with (8 = 1 indicating the inclusion of 8th subject in the secondary outcome analysis,

and (8 = 0 otherwise. Assume that there are #1 known subjects in the case group

⇡8 = 1
)

(

and #0

known subjects in the control group

⇡8 = 0

, where #1 +

)

(

#0 = #. Let =1 be the unknown sample

size selected from the case group and let =0 be the unknown sample size selected from the control
group. The total number of subjects selected among # is = = =1 +
) the 2

1 covariates vector for subject 8, where ,8 denotes the binary

Denote by ^8 =

1, ,8

=0.

(

)

⇥

exposure status for subject 8, with ,8 = 1 indicating exposed, and ,8 = 0 otherwise. Assume that

the sampling probability from the study cohort for the secondary outcome analysis depends only

on ⇡8, where Pr

(8 = 1

⇡8 = 3,.8, ^8

= Pr

(8 = 1

⇡8 = 3

= c

⇡8 = 3

)
=1 when we aim to examine the association between .8
is to ﬁnd the optimal sampling ratio =0
/

(

(

)

(

)

|

|

, 3 = 0, 1. The goal

= =3
#3

and ,8 in the target population without conditioning on ⇡8. Let # =

in the conditional expectation E

)
binary outcome, we can use the logit link function for 6
= 4 ^)
8 #
4 ^)
8 #

expressed as E

^8; #

.8
(

)

)

(

.

|

|

.8
(

(·)

^8; #

= `

^8; #

, where 6

`

(

(
. The mean model E

))

1
+

A valid estimator of # can be obtained by solving the inverse of sampling probability weighted

V0, V1)

) be a 2

1 vector

⇥
8 #. Since .8 is a

= ^)

(
^8; #

.8
(

|

^8; #

)

can be

estimating equations:

#

’8=1

*8

#

)

(

= 0,

5

where

*8

=

#

)

(

^8(8
⇡8
c

(

)

`

.8
[

 

(

^8

)]

.

Figure 1 shows the secondary case-control study design ﬂowchart.

Figure 1. Secondary Case-control Study Design Flow Chart

Target Population
Metaethische Positionen

Case group

Primary Outcome ⇡8 = 1

Number of cases #1

# = #1 +

#0

Control group

Primary Outcome ⇡8 = 0

Number of controls #0

Sample =1

Sample =0

Secondary Outcome .8

Number of subjects

= = =1 +

=0

2.2.2 Variance estimation

In order to determine the optimal sampling ratio for the secondary outcome in a case-control
ˆV1)

design, we develop the variance of the estimator of the exposure effect, denoted as + 0A

(

.

6

Let #⇤ represents the true parameters, and let ˆ# =

ˆV0, ˆV1)

(

) denote the solution of estimating

equations. Under some regulatory conditions, it follows that ˆ# is a consistent estimator of #⇤,

)

asymptotically normal #

#⇤, \

#⇤
)#
(

, where \

E

m
m #) *

#⇤

, and H

⇣
#⇤

= E

⌘
*

#⇤

*

#⇤

)

(
#⇤

= G

#⇤

1 H

 

#⇤

G

#⇤

1
 

. Here, G

#⇤

=

)

(

)
(
. Since ˆ# is a consistent estimator of #⇤, we can

h

i

)

(

)

)

(

)

(

)

)

h

i

 

(
ˆ#

(
#

)
h
with \

(
consistently estimate \
variable in the estimating equation for simplicity. Therefore, it is straightforward to see that \
ˆV1)

2 variance-covariance matrix, and the bottom-right element of \
1 H

#
)# represents + 0A
(

. In the current study settings, we only consider the exposure

#
)# is
(

. We

a 2

G

G

i

(

)

(

)

(

#

#

)

1

by computing

 

)

(

#
)[
#

(

 

)

(

]

directly. It is easy to verify that

⇥
obtain + 0A

ˆV1)

(

G

#

)

(

= E

= E

(8
⇡8

(

)

c

⇥

4 ^)
8 #

4 ^)
8 #

4 ^)
8 #

1

+

⇣

^8 ^)
8

2

^8 ^)
8

2

⌘

.

3
7
7
7
7
7
5

1

+

⇣

4 ^)
8 #

⌘

3
7
7
7
7
7
5

2
6
6
6
6
6
4
2
6
6
6
6
6
4

(See Appendix A.1 for the explicit derivation of G

.)

#

)

(

We substitute *8

#

)

(

into H

, thus

#

)

(

H

#

)

(

= E

(8
⇡8

)]

c



[

(

.8

2 (

`

(

 

^8

))

2 ^8 ^)
8

.

 

We further simplify H

#

)

(

to (See Appendix A.2 for the explicit derivation of H

.):

#

)

(

H

#

)

(

= E



˜`

(

(

^8, ⇡8

^8

`

(

)  

^8, ⇡8

˜`

(

)  

^8, ⇡8

2

)

2

))
c

˜`
+
⇡8
(

(
)

^8 ^)
8

,

 

where ˜`

^8, ⇡8; "

= 4 U0+
4 U0+
1
+
come given the primary outcome, the exposure variable and their interactions. # =

^8, ⇡8; "

U1 F8 +
U1 F8 +

U2 38 +
U2 38 +

.8
(

= E

U3 F8 38

U3 F8 38

)

)

(

|

is the expectation of secondary out-

" =

U0, U1, U2 U3)

(

) , are coefﬁcient vectors associated with `

^8; #

)

(

and ˜`

(

tively. To obtain the optimal sampling ratio, the parameter set " need to be pre-speciﬁed. Because

both models are saturated and the parameter of interest is the odds ratio, the functional forms for

7

) ,

V0, V1)
, respec-

(
^8, ⇡8; "

)

the two models do not cause issues of compatibility.

2.3 Optimal sampling strategy

Theorem 1. Under the above study design settings with the prevalence of the exposure in the

cohort, with known %

, = F

(

)

= ?F and %

(

, = F, ⇡ = 3

)

= ?F3, for F and 3 = 0 or 1. Let

 

#

)

(

= 2
6
6
6
6
6
6
6
6
4

011 012

021 022

3
7
7
7
7
7
7
7
7
5

and

111 112

⌫

#

)

(

.

3
7
7
7
7
7
7
7
7
5

121 122

= 2
6
6
6
6
6
6
6
6
4
, the lower right corner element of variance-covariance matrix
02
21111 

02
11122

, where

]

Then + 0A

ˆV1)
is given by [

(

1 H

 

G

#

)

(

#
)[
#

(

1

 

G

#

)

(

)

]

,

)

1
+

(
(

2 +

V1
V1

2011112021+
2#
34C  
)
(
011 = ?14V0+
4V0
?1)
1
2 ,
 
4V0+
4V0
1
+
)
(
V1
012 = 021 = 022 = ?14V0+
V1
4V0+
1
+
)
(
111 = ?11#1 ˜˜`
?10#0 ˜˜`
1,0
1,1
)
)
=0
=1
112 = 121 = 122 = ?11#1 ˜˜`
=1
where ˜˜`

^8, ⇡8

1,1

˜`

+

(

(

)

(

+
^8, ⇡8

2 ,

+

0,1

)

(

?00#0 ˜˜`
=0

(

0,0

,

)

+

?01#1 ˜˜`
=1
?10#0 ˜˜`
=0

1,0
)

,

(

˜˜`

1, 0

, ˜˜`

(
0, 1
)

`

^8

2

˜`

^8, ⇡8

) ⌘ (
0, 0

)  

(
)  
represent the value of function ˜˜`

))

+

(

(

, ˜˜`

˜`

(

,

^8, ⇡8

2 and the shorthands ˜˜`

1, 1
)
given the exposure value F and the

(

)

,

)

(

(
case-control status 3, for F and 3 = 0 or 1, and 34C   = 011022  

(·

·)

)

(

012021.

Proof. See Appendix A.3 and Appendix A.4.

⇤

The computation of + 0A

ˆV1)

(

111, 112 = 121 = 122 ) into the expression [

can be performed directly by plugging in (011, 012 = 021 = 022,

02
21111 

2011112021+
34C  
)
(

2#

02
11122

.

]

8

Consideration of the cost for data collection is a crucial aspect when conducting epidemiolog-

ical research. Cost can be viewed as a special constraint in sample size calculation. In this paper,

we will focus on a scenario where the total cost of the secondary outcome analysis is ﬁxed and

known. We make the assumption that the costs for a case and a control are the same.

Proposition 1. Denote the known total cost of the secondary outcome study as ⇠>BC, and let

2 ?4A represent the known cost per individual for samples from case group or control group. The

maximum sample size = for the secondary outcome analysis is given by ⇠>BC
=1 = =, where =0
2 ?4A
is the sample size of the selected controls , and =1 is the sample size for selected cases. Let '$⇡ = =0
=1

= =0 +

denote the optimal design-based sampling ratio. Under the study design settings describe above,

the optimal ratio can be determined as follows:

'$⇡ =

=0
=1

=

s

1, 0
1, 1

Z
Z

[
[

(
(

) +
) +

0, 0
0, 1

]
]

(
(

)  
)  

^
^

(
(

1, 0
1, 1

) +
) +

s
s

(
(

1, 0
1, 1

)]
)]

,

where,

Z

Z

]

]

^

)

)

1, 1

(

1, 0

(
0, 1

)

0, 0

)
1, 1

)

,

,

= 02

(

1, 0

1, 1

= 02

= 02

21 ?11#1 ˜˜`
21 ?10#0 ˜˜`
21 ?01#1 ˜˜`
21 ?00#0 ˜˜`
)
= 2021011 ?11#1 ˜˜`

(
0, 1

0, 0

(

)

(

= 02

)

)
,

,

1, 1

(

(

(

1, 0

1, 0

)
1, 1
)
1, 0
)

(

(

= 02

= 2021011 ?10#0 ˜˜`
11 ?11#1 ˜˜`
11 ?10#0 ˜˜`

= 02

(
1, 1
)
1, 0
)

(

(

,

.

(

^

(
s

s

,

,

)

)

Proof. See Appendix A.5.

⇤

9

Remark 1. We deﬁne ˜˜`

^, ⇡

)

(

term as Quasi Mean Squared Error(QMSE). Then,

2

^

`

^, ⇡

`

˜˜`

(

^, ⇡

=

)

^, ⇡

)  
180B

`

(

{z

(

`

2
6
6
e
6
6
|
6
4
|

(

)

+

3
7
7
e
7
7
|
}
"(⇢ G,3
7
5

{z

(
)  
E0A80=24
e
{z

^, ⇡

2

.

)

}

}

Let @1 = %

⇡ = 1
)

(

= #1

# , @0 = %

⇡ = 0
)

(

= #0

# . Then,

+ 0A

ˆV1)

(

= \1

@1
34C  

\0

@0
34C  

.

2

)

=0 (

2 +

)

=1 (

Where,

\1 = 02

21 (

\0 = 02

21 (

?11&"(⇢11 +
?10&"(⇢10 +

?01&"(⇢01) +
?00&"(⇢00) +

02
11  
02
11  

 

 
Thus, the optimal sampling ratio is '$⇡ =

\0@0
\1@1
1. If \1 is small, \0 is big, then '$⇡ < 1. If \1

q

'$⇡

⇡

2021011

?11&"(⇢11.

?10&"(⇢00.

 

2021011
 
. Since @1 ⌧
\0 ⇡

1, then '$⇡

@0. If \1 is big, \0 is small, then

@0
@1

=

#0
#1

.

q

⇡

q

2.4 Simulation

We denote the derived variance of ˆV1 in Theorem 1 as + 0A⇡⇢ '

(
the variance of ˆV1 using the Stata command “gmm”. We deﬁne the empirical variance as + 0A⇢ " %

ˆV1)
ˆV1)
(
1 is the estimator of the exposure effect on one simulated dataset

. We denote by + 0A⌧ " "

2, where ˆV8

ˆV1)

(

1

=

1000
8=1 (

ˆV<40=
1
1000
1
using the Stata command “gmm”. ˆV<40=

ˆV8
1  

Õ

 

)

is the average of ˆV8

1 across 1000 simulation runs.

1

In this section, we conduct simulation studies to verify our derived variance formula. Speciﬁ-
ˆV1)

cally, we investigate whether our proposed variance + 0A⇡⇢ '

closely approximates + 0A⇢ " %

ˆV1)

(

(

and + 0A⌧ " "

.

ˆV1)

(

We provide the data generating process as follows. We simulate 8 = 1, . . . , 1, 000 datasets, with

each dataset comprising 10, 000 observations (target population, i.e., cohort size #). Within each

dataset, we ﬁrst simulate the binary exposure variable - v ⌫4A=>D;;8

the binary primary outcome ⇡, with E

⇡

|

(

^; $

)

= 4W0+
4W0+
1
+

W1 F
W1 F , where $ =

nally, we simulate the binary secondary outcome . , with E

.

(

|

^, ⇡; "

)

10

0.17
(
)
W0, W1]
[
= 4 U0+
4 U0+
1
+

. We then simulate

=

U1 F

+
U1 F

. Fi-

1.4, 0.7
]
U3 F3
U3 F3 , where

[ 
U2 3
+
U2 3

+

+

               
               
                     
                     
                                                   
                                                   
" =

U0, U1, U2, U3]

[

=

1, 0.3, 0.25, 1

[ 

]

. In each simulated dataset, the prevalence of the pri-

mary outcome ⇡ and secondary . is around 0.22 and 0.31, respectively. The number of indi-

viduals selected based on the study budget is ⇠>BC
2 ?4A

= = = 3, 000. Table 2.1 presents a comparison

of + 0A⇢ " %

ˆV1)

(

, + 0A⌧ " "

ˆV1)

(

, and our proposed variance + 0A⇡⇢ '

ˆV1)

(

under different sampling

ratios with the given parameters. The ﬁrst column of the table represents the ratio of =0 to =1.

The second column displays the empirical variance for each sampling design. The third column

shows the average of variances of estimators of coefﬁcients obtained using Stata “gmm” command

over 1, 000 simulations. The ﬁnal column is our proposed variances under different sampling de-

signs. The results in the Table 2.1 clearly illustrates that our proposed + 0A⇡⇢ '

ˆV1)

(

is very close

to + 0A⇢ " %

ˆV1)

(

and + 0A⌧ " "

ˆV1)

(

across all reasonable sampling ratios. It is evident that the bal-

anced design is not the most efﬁcient choice. By using our proposed optimal sampling formula,

we obtained the optimal sampling ratio '$⇡ = 2.63.

Table 2.1: The comparison of + 0A⇢ " %
under different sampling designs.

(

, + 0A⌧ " "

ˆV1)
ˆV1)

(

=0 : =1 + 0A⇢ " %
E-2
1.239
0.981
0.989
0.957
0.981
1.729
2.238
2.796

1 : 1
2 : 1
2.63 : 1
3 : 1
4 : 1
1 : 2
1 : 3
1 : 4

(

(

ˆV1)
+ 0A⌧ " "
E-2
1.182
0.998
0.984
0.989
1.019
1.657
2.164
2.680

, and the proposed variance + 0A⇡⇢ '

ˆV1)

(

ˆV1)

ˆV1)

(

+ 0A⇡⇢ '
E-2
1.179
0.996
0.981
0.985
1.016
1.657
2.160
2.669

2.5 Numerical illustration

We apply our proposed method to the secondary outcome analysis of the Pesticides and Sense of

Smell (PASS) Study, which is an add-on study of the Agricultural Health Study (AHS). The PASS

Study aims to better understand the relationship between high pesticide exposure events (HPEE)

and olfactory impairment (OI). In the target AHS phase-4 cohort, participants were asked if they

11

had lost their sense of smell. Some literature has shown an association between OI and cognitive

decline (Yaffe et al., 2017; Shrestha et al., 2019; Dintica et al., 2019). Thus, the investigators used

the self-reported smell loss to deﬁne the sampling strata ⇡ and mailed the selected participants

the Cognitive Function Instrument (CFI) questionnaire, which is used to deﬁne a dichotomous

outcome of interest, . , CFI-based cognitive compliant. We utilized our proposed optimal sam-

pling ratio formula to obtain an efﬁcient study design for the analysis of the secondary outcome

under some reasonable scenarios of the strength of association between ⇡ and . . Certain pa-

rameters were derived from the cohort data, e.g., ?

, = 1

)

(

= 0.14, W =

1.86, 0.397

[ 

. The total

]

sample size in case-control cohort is # = #0 +
of " that are meaningful for the following scenarios:

#1 = 15, 893

2, 633 = 18, 526. We use a range

+

the prevalence of . is either 0.1 or 0.2.

The association between ⇡ and . among , = 0, $'. ⇡

,=0 =

|

1.2, 1.4, 1.6
]

[

; and among , = 1,

$'. ⇡

,=1 =

1.5, 2.0, 2.5

. We created Table 2.2 for these scenarios with column 1 for prevalence

|

[
of . , column 2 for $'. ⇡

]
,=0, column 3 for $'. ⇡

|
sampling ratio '$⇡ based on the formula with these given parameters. The budget constraint is

|

,=1 and the last column is the proposed optimal

$25, 000 for data collection, and the unit cost is $5.40. We can observe that as the association

between ⇡ and . increases, the optimal sampling ratio decreases, resulting in fewer subjects being

sampled from the control group.

12

Table 2.2: Optimal sampling ratio '$⇡ varies with the prevalence of the secondary outcome . ,
and the association between ⇡ and . , while keeping other parameters ﬁxed.

%

. = 1
(
0.1

)

0.2

,=1

$'. ⇡
|
1.2
1.2
1.2
1.4
1.4
1.4
1.6
1.6
1.6
1.2
1.2
1.2
1.4
1.4
1.4
1.6
1.6
1.6

,=0 $'. ⇡
|
1.5
2.0
2.5
1.5
2.0
2.5
1.5
2.0
2.5
1.5
2.0
2.5
1.5
2.0
2.5
1.5
2.0
2.5

V1
0.659
0.731
0.790
0.630
0.712
0.765
0.611
0.684
0.738
0.656
0.713
0.765
0.631
0.698
0.742
0.610
0.670
0.720

U0
2.3
 
2.35
 
2.36
 
2.36
 
2.37
 
2.39
 
2.37
 
2.39
 
2.41
 
1.515
 
1.525
 
1.535
 
1.535
 
1.545
 
1.555
 
1.555
 
1.565
 
1.575
 

U1
0.6
0.6
0.6
0.6
0.6
0.6
0.6
0.6
0.6
0.6
0.6
0.6
0.6
0.6
0.6
0.6
0.6
0.6

U2
0.182
0.182
0.182
0.336
0.336
0.336
0.470
0.470
0.470
0.182
0.182
0.182
0.336
0.336
0.336
0.470
0.470
0.470

U3
0.223
0.511
0.734
0.069
0.375
0.580
0.065
 
0.223
0.446
0.223
0.511
0.734
0.069
0.375
0.580
0.065
 
0.223
0.446

'$⇡
4.66
4.35
4.13
4.62
4.27
4.07
4.57
4.24
4.02
4.89
4.70
4.58
4.84
4.64
4.53
4.81
4.61
4.49

2.6 Conclusion

The optimal sampling strategies have been discussed for two-stage (Breslow and Chatterjee, 1999;

McNamee, 2005), or so called two-phase designs (Reilly, 1996; Breslow and Cain, 1988) for case

control studies. Breslow (2005) points out that two-phase designs and two-stage designs are the

same study design with different terminologies. A two-stage case-control design involves deter-

mining exposure and outcome for a large sample, but covariates are measured only on a subsample

(Hanley et al., 2005). The outcome of interest remains the same for both the ﬁrst stage and the sec-

ond stage. However, for secondary outcome analysis for case-control study, the primary outcome

and the secondary outcome are different. Since the studies are different, our proposed method

differs from the method used in the mentioned papers. The inference for secondary outcome

analysis in case-control study has been studied in the last decades with numerous methods being

13

proposed. However, there is no off-the-shelf method for the sampling design when the data are

deliberately collected for the secondary outcome. In this paper, we proposed an optimal sampling

strategy for the analysis of the secondary outcome using weighted estimating equations. The term

“optimal” refers to allocation of cases and controls that minimizes the variance of the estimator

of interest given an analytic method and a ﬁxed sample size under a budget constraint. We de-

rive the asymptotic variance-covariance matrix of estimators of coefﬁcients using the “Sandwich”

variance-covariance matrix of a weighted estimating equations estimators. Given the variance of

the estimator of the coefﬁcient is minimal, the power for the test on exposure effect is maximal. We

provide a sampling formula for achieving an efﬁcient study design for a valid estimation strategy

of the effect of interest, namely the inverse probability of sampling weighted estimating equations

estimators. For different estimation strategies, the optimal sampling ratio might be different. To

verify our provided formula, we conduct simulation studies. The results demonstrate that our for-

mula performs well for both common primary outcomes and secondary outcomes. Interestingly,

our ﬁndings indicate that the widely used balanced design is not always the most efﬁcient choice

for secondary outcome analysis study designs. Therefore, researchers should carefully calculate

the sampling ratio when the purpose is a secondary outcome analysis.

14

Chapter 3 Optimal Sampling Strategies Using Case-Control Studies for Poisson Count Sec-

ondary Outcomes under Budget Constraints

3.1 Background

In this chapter, we expand upon our sampling methodology from Chapter 2 by including an explicit

sampling ratio formula for case-control studies with count secondary outcomes. With this sampling

formula, researchers will be able to achieve an efﬁcient study design for their count secondary

outcome analysis. The derivation of the optimal sampling ratio is based on estimating the variance

of the estimator of the exposure effect using inverse probability weighted estimating equations.

Our proposed framework, which incorporates inverse probability estimating equations, offers an

optimal formula that can accommodate Poisson distributed count data. This chapter is organized

as follows: First, we introduce general notations. Second, we derive the variance formula of the

exposure effect in Poisson distributed secondary outcomes. Third, we verify our proposed variance

formula through Monte Carlo simulations. Finally, we draw conclusions based on our ﬁndings.

3.2 Notation and estimation

3.2.1 Notation

Suppose the study cohort (target population) consists of # independent subjects. Let ⇡8 be the

binary case-control primary outcome, where ⇡8 = 1 or 0 indicates the presence or absence of the

disease. The population size of cases and controls is denoted by #1 and #0, respectively. We
assume that #1 and #0 are known, and #1 +
count secondary outcome of interest. (8 is the sampling indicator, with (8 = 1 indicating the in-

#0 = #. Let .8 represent the Poisson distributed

clusion in the secondary outcome analysis, and (8 = 0 otherwise. We deﬁne =1 as the unknown

sample size selected from the case group and =0 as the unknown sample size selected from the

control group. The total number of subjects to be selected from the target population, denoted as

=0. Our objective is to derive an expression of the sampling ratio =0/
=, is given by = = =1 +
under the study budget constraint, when we aim to examine the association between .8 and ,8

=1

in the target population without conditioning on ⇡8. This will allow us to determine the exact

sample size of cases and controls for the secondary outcome analysis. Let ,8 denotes the binary

15

exposure status, with ,8 = 1 indicating exposure, and ,8 = 0 otherwise. Denote by ^8 =

1, ,8

(

)

)

the 2

⇥

1 covariates vector for each subject 8. Similar to Chapter 1, we assume that the sampling

probability from the study cohort for the secondary outcome analysis depends only on ⇡8, where

(

Pr

(8 = 1

⇡8 = 3,.8, ^8

V0, V1)
coefﬁcients vector. Since .8 is Poisson distributed random variable, the mean model E

, 3 = 0, 1. Let # =

⇡8 = 3

(8 = 1

= Pr

= c

3

(

)

(

(

)

)

|

|

= =3
#3

) be the 2

.8
(

|

^8

)

1

⇥
can

be expressed as E

^8

= `

^8

= 4 ^)

8 #. A valid estimator of # can be obtained by solving the

(
inverse of sampling probability weighted estimating equations:

)

)

|

.8
(

#

’8=1

*8

#

)

(

= 0,

where

*

=

#

)

(

^8(8
⇡8
c

(

)

`

.8
[

 

(

^8; #

.

)]

(For the simplicity, we will not include the subscript 8 in the following derivations).

3.2.2 Variance estimation

In this scenario, we consider one exposure variable in the mean model, therefore, it is straight-

forward to see the bottom-right element of variance-covariance matrix represents + 0A

ˆV1)

(

. The

general frame work of the derivation of variance-covariance matrix of the parameters of the in-

verse probability weighted estimating equations can be found in Chapter 2, Section 2. We obtain

+ 0A

ˆV1)

(

by computing

1 H

 

G

#

)

(

#
)[
#

(

1

 

G

#

)

(

)

]

directly. It is easy to verify that

G

#

)

(

= E

= E

 



c



m
m #)
(
⇡

(

)

c

⇥

4 ^) #

^(
⇡

 

.
(
) ⇣
4 ^) # ^ ^)

⌘  

.

 

(3.2.1)

Taking iterated expectations in (3.2.1), then

G

#

= E

(

)
n
Note that c

E

h
⇡

(

c

)

(
⇡

) ⇥
(
= Pr

4 ^) # ^ ^)

⇡

.

(

( = 1

|

 
⇡ = 3
 
 

i o
)

only depends on ⇡, thus

16

(

#

G

E
(
c

= E

(
⇡
)
|
⇡
) ⇥
Since ( is a binary variable, then E

4 ^) # ^ ^)

E

⇡

n

h

)

(

 
 
 

.

i o
(
(

|

⇡

)

= Pr

(

( = 1

⇡ = 3

= c

3

)

(

)

|

under the assumption

that the sampling is only conditional on the primary outcome ⇡, thus

= E

E

4 ^) # ^ ^)

⇡

= E

4 ^) # ^ ^)

.

G

#

(

)

h
It is also easy to verify that

n

i o

h

i

 
 
 

H

#

)

(

= E

(2
⇡

(

)]

c



[

.

2 (

`

^

(

))

 

2 ^ ^)

.

 

With further simpliﬁcation on H

#

)

(

(See Section Appendix B.1 for the explicit derivation of

H

(

#

),

)
we have,

H

#

)

(

= E



^, ⇡

˜`

(

(

)  

`
c

^
⇡

))
)

(
(

2

˜`

(

+

^, ⇡

^ ^)

)

,

 

.
(

where ˜`

^, ⇡; "

= E

^, ⇡; "

= 4U0+

U1F

U23

+

U3F3 . The parameter set # =

+

)

(
U0, U1, U2 U3)
(
assume " are known parameters for the purpose of calculating the optimal sampling ratio.

) , are coefﬁcient vectors associated with `

and ˜`

^, ⇡

^

)

)

(

)

(

|

) , " =

V0, V1)
, respectively. We

(

17

We further calculate G

#

)

(

and H

#

)

(

by the rule of functional expectation.

 

#

)

(

= E

4 ^) # ^ ^)

h

= ?F=14 V0+

1

?F=1]

 

+ [

i

1 1

1 1

3
7
7
7
7
7
7
7
7
5
 

V1 2
6
6
6
6
6
6
6
6
V1
4

= 2
6
6
6
6
6
6
6
6
4
2
6
6
6
6
6
6
6
6
4

⌘

where

?F=14 V0+

1

+ [

?F=1]

4 V0 ?F=14 V0+

V1

?F=14 V0+

V1

?F=14 V0+

V1

011 012

021 022

3
7
7
7
7
7
7
7
7
5

1 0

0 0

4 V0 2
6
6
6
6
6
6
6
6
4

3
7
7
7
7
7
7
7
7
5

3
7
7
7
7
7
7
7
7
5

011 = ?F=14 V0+

V1

1

?F=1]

 

+ [

4 V0,

012 = 021 = 022 = ?F=14 V0+

V1,

and ?F=1 is the known prevalence of the exposure in the cohort.

18

Similar for ⌫

#

)

(

, and we let ˜˜`

^, ⇡

˜`

(

) ⌘ (

(

^, ⇡

`

^

(

)  

2

))

+

˜`

(

^, ⇡

)

, then

⌫

#

)

(

= E

1
⇡

(

)

c



˜˜`

(

^, ⇡

)

^ ^)

?11 ˜˜`

=

1, 1
(
=1

)

#1

?10 ˜˜`

1, 0
)
(
=0

#0

2
6
6
6
6
6
6
6
6
4
#1

 

+

3
7
7
7
7
7
7
7
7
5

1 1

1 1

3
7
7
7
7
7
7
7
7
1 0
5

0 0

2
6
6
6
6
6
6
6
6
4

2
6
6
6
6
6
6
6
6
4
#0

1 1

1 1

3
7
7
7
7
7
7
7
7
1 0
5

0 0

2
6
6
6
6
6
6
6
6
4

3
7
7
7
7
7
7
7
7
5

?00 ˜˜`

0, 0
(
)
=0

+

?01 ˜˜`

0, 1
(
=1

)

+

111 112

121 122

= 2
6
6
6
6
6
6
6
6
?11#1 ˜˜`
4

3
7
7
7
7
7
7
7
7
1, 1
5
) +
=1

(

111 =

?01#1 ˜˜`

0, 1
)

(

?10#0 ˜˜`

(

1, 0

) +
=0

+

?00#0 ˜˜`

0, 0

(

)

,

112 = 112 = 122 =

?11#1 ˜˜`
=1

(

1, 1

?10#0 ˜˜`
=0

(

)

+

1, 0

)

,

And ?F3 is the known joint probability between the exposure and primary outcome in the

cohort.

We have similar variance-covariance matrix as binary secondary outcome case from Chapter

2. So + 0A

ˆV1)

(

02
21111 

= [

2011112021+
34C  
)
(

2#

02
11122

]

is the lower corner element of variance-covariance matrix

1 H

 

G

#

)

(

#
)[
#

(

G

#

)

(

)

1

 

]

. Where 34C   = 011022  

012021. For details, see Chapter 2, Theorem 1.

3.3 Optimal sampling strategy

We consider cost as a special constraint when we minimize the variance to get the optimal sample

ratio. Deﬁne the optimal design ratio as '$⇡. We can write '$⇡ as a function of E0A

ˆV1)

(

, that is

E0A

ˆV1)

(

=

51'2

$⇡ + (

52)
51 +
53'$⇡

'$⇡

52

,

+

19

where,

51 = 51 (

", #, $, ?F=1, ?F3, #1, #0)

= Z

52 = 52 (

", #, $, ?F=1, ?F3, #1, #0)

= Z

1, 1

0, 1

]

(

)  

^

) +

1, 0

0, 0

]

(

)  

^

) +

(

(

1, 1

) +

s

1, 0

) +

s

(

(

1, 1

1, 0

,

,

)

)

(

(

53 =

(

34C G

)

2 =#,

F, 3 = 0, 1

,

)

(

and

1, 1

1, 0

Z

Z

(

(

)

)

0, 1

0, 0

]

]

(

(

)

)

1, 1

1, 0

^

^

(

(

)

)

= 02

21 ?11#1 ˜˜`

= 02

21 ?10#0 ˜˜`

1, 1

1, 0

,

,

)

)

(

(

= 02

21 ?01#1 ˜˜`

= 02

21 ?00#0 ˜˜`

0, 1

0, 0

,

,

)

)

(

(

= 2021011 ?11#1 ˜˜`

= 2021011 ?10#0 ˜˜`

1, 1

1, 0

,

,

)

)

(

(

s

1, 1
)

(

= 02

11 ?11#1 ˜˜`

s

1, 0
)

(

= 02

11 ?10#0 ˜˜`

1, 1
)

(

,

1, 0
)

(

.

The deﬁnition of ?F=1, ?F3, #1, #0 can be found in Chapter 2. By applying simple algebra

(Demidenko, 2008), we can determine that the minimum of E0A

3.4 Simulation

ˆV1)

(

is achieved when '$⇡ =

52
51

.

q

To verify our proposed variance formula for the Poisson-distributed count secondary outcomes, we
ˆV1)

conducted Monte Carlo simulations. The three candidate variances, + 0A⇡⇢ '

, + 0A⌧ " "

ˆV1)

(

(

,

20

and + 0A⇢ " %

ˆV1)

(

, have the same deﬁnition as in our previous Chapter 2, Section 4.

We start by simulating 1, 000 datasets, with each dataset containing 10, 000 observations (tar-

get population, i.e., cohort size #). Within each dataset, we ﬁrst simulate the binary exposure

variable , with a prevalence is 17%. We then simulate the binary primary outcome ⇡, with

E

⇡

^

)

|

(

= 4W0+
4W0+
1
+

W1 F
W1 F . The parameters W0 and W1 are set as

1.4 and 0.7 respectively. So, the

 

prevalence of the primary outcome ⇡ in each simulated dataset is around 22%. Additionally, we

simulate the Poisson distributed secondary outcome, . , with E

.

(

|

^, ⇡

)

= 4U0+

U1F

U23

U3F3. The

+

+

parameters U0, U1,U2,U3 are chosen as

0.01, 0.05, 0.035, and 0.11, respectively. Therefore, the

expected value E

.
(

)

and variance + 0A

 
.
(

)

are equal to 1 in each simulation dataset. The value of ˆV1
ˆV1)
under different sampling ratios using the

(

,

is 0.094, which can be estimated from a large simulation dataset. Table 3.1 compares + 0A⇢ " %

+ 0A⌧ " "

ˆV1)

(

, and our proposed variance + 0A⇡⇢ '

ˆV1)

(

given parameter set. The ﬁrst column of the table represents the ratio of controls to cases. The

second column displays the empirical variance of the estimator of the exposure effect for each

sampling ratio. The third column shows the average of variance of the estimator of the exposure

effect obtained using the Stata “gmm” command across 1, 000 simulation runs. The last column

shows our proposed variance formula. When we calculate + 0A⇡⇢ '

ˆV1)

(

, we assume #1 and #0 is

%

⇡ = 1

ﬁxed, with #1 = #
ˆV1)

ˆV1)
designs. By applying our optimal sampling formula with the provided pre-speciﬁed parameters,

= 2, 205 and #0 = 7, 795. Table 3.1 clearly illustrates that our pro-
ˆV1)

⇥
is very close to + 0A⇢ " %

across all reasonable sampling

posed + 0A⇡⇢ '

and + 0A⌧ " "

(

)

(

(

(

we ﬁnd that the optimal sampling ratio '$⇡ = 2.64. Given that the number of individuals that can

be selected is ⇠>BC
2 ?4A

= = = 3, 000, we can determine the number of controls as =0 = 2, 176, and the

number of cases as =1 = 824.

3.5 Conclusion

When planning research for a secondary case-control study, an important consideration is how

to achieve an efﬁcient design.

In this chapter, instead of using simulation studies to estimate

the optimal sampling ratio, we provide a close-form optimal sampling formula for case-control

studies with Poisson distributed secondary outcomes, which is easy to understand and implement.

21

Table 3.1: The comparison of empirical variance + 0A⇢ " %
variance + 0A⇡⇢ '
outcome . .

ˆV1)
, and the proposed
(
under the different sampling designs with a Poisson-distributed secondary

, + 0A⌧ " "

ˆV1)

ˆV1)

(

(

ˆV1)

ˆV1)

(

=0 : =1 + 0A⇢ " %
E-2
0.275
0.219
0.217
0.213
0.214
0.366
0.476
0.539

1 : 1
2 : 1
2.64 : 1
3 : 1
4 : 1
1 : 2
1 : 3
1 : 4

(

+ 0A⌧ " "
E-2
0.261
0.220
0.216
0.217
0.225
0.348
0.486
0.588

ˆV1)

(

+ 0A⇡⇢ '
E-2
0.261
0.220
0.217
0.217
0.224
0.366
0.477
0.590

Our proposed sampling formula can assist researchers in achieving efﬁcient epidemiological study

designs with count secondary outcomes. We conducted simulation studies to verify our proposed

variance formula in accurately approximating the empirical variance and the variance from Stata

“gmm” command.

22

Chapter 4 Optimal Sampling Strategies Using Case-Control Studies for Binary Secondary

Outcomes Using Doubly-Weighted Inverse Probability Estimating Equations un-

der Budget Constraints

4.1 Background

In the previous chapters, we presented our optimal sampling strategies for secondary case-control

studies with different types of secondary outcomes, aiming to achieve an efﬁcient case-control

study design. In those scenarios, we only focused on the association between a single exposure

variable with the outcome of interest. However, in this chapter, we extend our proposed opti-

mal sampling strategies to incorporate an additional confounder. The estimators of interest were

obtained with doubly-weighted estimating equation, see Chapter 1 for details.

This chapter is organized as follows: First, we derive the variance formula for the estimator of

the exposure effect in the context of doubly-weighted estimating equations. Second, we present

the optimal sampling formula, considering the minimization of the variance of the estimator of the

exposure coefﬁcient while accounting for a budget constraint. Third, we verify our proposed sam-

pling formula through Monte Carlo simulations. Fourth, we apply the derived optimal sampling

formula to an empirical study. Finally, we draw conclusions based on our ﬁndings.

4.2 Notation and estimation

4.2.1 Notation

This chapter employs similar notations as the previous chapters. Suppose the study cohort (target

population) consists of a total # independent subjects. We denote the binary case-control primary

outcome as ⇡, where ⇡ = 1 indicates the presence of the disease, ⇡ = 0 indicates its absence.

The sample size of cases and controls is denoted by #1 and #0, respectively. We assume that
#1 and #0 are known, and #1 +
interest. We let (8 be the sampling indicator, with (8 = 1 indicating the inclusion in the secondary

#0 = #. We let .8 represents the binary secondary outcome of

outcome analysis, and (8 = 0 otherwise. We deﬁne =1 and =0 as the unknown sample size to

be selected from the case and control group, respectively. The total number of subjects being

selected from the target population, denoted by =, and = = =1 +

=0. We deﬁne the total cost of

23

the secondary outcome study as ⇠>BC, and we let 2 ?4A represents the known cost per individual

for samples from case group or control group. Our objective is to derive an expression for the

optimal sampling ratio under a study budget constraint, where = = ⇠>BC
2 ?4A

= =1 +

=0. Let ,8 denote

the binary exposure status, with ,8 = 1 indicating exposure, and ,8 = 0 otherwise. Let ⇠8 denote

the binary confounder, with ⇠8 = 0, 1. We denote by ^8 =

1, ,8, ⇠8

(

) the 3

)

⇥

1 vector of covariates

for each subject 8. Following the approach in Chapter 1, we assume that the sampling probability

from the study cohort for the secondary outcome analysis depends only on ⇡8. Thus we have

3

6

⇥
`

(8 = 1

Pr

(

|

⇡8 = 3,.8, ^8

= Pr

(

)

(8 = 1

|

⇡8 = 3

= c

3

)

(

)

= =3
#3

, 3 = 0, 1. Let # =

1 vector of parameters of interest. Let conditional expectation E

V0, V1, V2)
(
= `
^8; #

(

)

) be a

, where

.8
(

|

^8; #

)

^8; #

= ^)

(

(
We let E

))
.8
(

|

8 #. Since .8 is a binary outcome, we can use the logit link function for 6
1
,8=F

as the propensity score, where

. We deﬁne %

,8 = F

2

%

.

(·)

2

^8; #

= 4 ^)
8 #
4 ^)
8 #

1

)

(

|

)

)
represents the inverse probability weight for the subjects in the group F for F = 0, 1. An appropriate

+

(

|

estimator of # can be obtained by solving the doubly-weighted estimating equations:

#

’8=1

*8

#

)

(

= 0,

where

*8

=

#

)

(

c8

⇡8

(

) ⇥

^8(8
%

(

,8 = F8

`

.8
[

 

(

^8

)]

.

28

|

)

(For the simplicity, we will not include the subscript 8 in the following content)

4.2.2 Variance estimation and optimal sampling ratio formula derivation

To provide the optimal sampling ratio formula for the secondary outcome in a case-control design,

we derived the variance of the estimator of the exposure effect for the doubly weighted estimating

equations. In this chapter, we extend the variance derivation method to consider both the exposure

variable and a binary confounder in the mean model. We build upon the variance derivation method

used in the previous chapters. We denote the variance of the estimator of the exposure effect

as + 0A

ˆV1)

(

. Therefore, the variance-covariance matrix \

#
)#
(

is a 3

⇥

3 matrix, where the second

24

element on the diagonal of the variance-covariance matrix represents + 0A

by computing

1 H

 

G

#

)

(

#
)[
#

(

1

 

G

#

)

(

)

]

directly. It is easy to verify that

ˆV1)

(

. We obtain + 0A

ˆV1)

(

G

#

)

(

= E

4 ^) #

, = F

%

(

2

|

)

1

+

⇣

2

4 ^) #

⌘

^ ^)

,

3
7
7
7
7
7
5
)  
⇡

2
6
6
6
6
6
4



H

#

(

)

= E

^, ⇡

˜`

(

(

)  

2

`

^

(
))
+
, = F

%

˜`
2

(

^, ⇡
2 c

^, ⇡

˜`

(

2

)

^ ^)

|
(See Appendix C.1 and Appendix C.2 for the explicit derivation of G

)]

[

(

)

(

 

Here ˜`

^, ⇡; "

= E

.
(

|

)

(

^, ⇡; "

)

= 4 U0+
4 U0+
1
+

U1 F

+
U1 F

U22

+
U22

+

U3 3

+
U3 3

+

+

#

and H

#

.)

)
U4 F3
U4 F3 , represents the expectation of sec-

(

)

(

ondary outcome given the primary outcome, the exposure variable, and the confounder. We as-

sume E

.
(

|

^, ⇡; "

)

= 4 U0+
4 U0+
1
+

U1 F

+
U1 F

U22

+
U22

+

U3 3

+
U3 3

+

and ⇠. We have coefﬁcient vectors # =

+

U4 F3
U4 F3 is the true model, and there is no interaction between ,
) associated with

) , " =

U0, U1, U2, U3, U4)

(

V0, V1, V2)

(

`

^

)

(

and ˜`

^, ⇡

)

(

respectively. It is important to note that parameter sets, ", need to be speciﬁed

according to the related literature or expert experience in order to calculate the optimal sampling

ratio using our method. The details of the optimal sampling ratio '$⇡ derivation steps can be

found in Appendix C.3.

4.3 Simulation

We conducted Monte Carlo simulations to evaluate the performance of the proposed variance for-

mula presented in this chapter. The three candidate variances, + 0A⇡⇢ '

ˆV1)

(

, + 0A⌧ " "

ˆV1)

(

, and

+ 0A⇢ " %

ˆV1)

(

, have the same deﬁnitions as in our previous Chapters. We started by simulating

1, 000 target population datasets, each containing 10, 000 observations. Within each dataset, we

ﬁrst simulated the continuous age variable ⇠, with a mean of 64.76, and a standard deviation of

10.82. Then we simulated the binary exposure variable , conditional on ⇠, and with a prevalence

of 17.2%. We then simulated the binary primary outcome ⇡ with E

⇡

^

|
0.02 respectively. Thus, the prevalence of the pri-

)

(

+

= 4W0+
4W0+
1
+

W1 F

+
W1 F

W22
W22 . The

parameters W0, and W1 are set at

0.1, 1.1, and

 

 

mary outcome ⇡ in each simulated dataset is around 24.3%. Additionally, we simulate the binary

secondary outcome, . with E

.

(

|

^, ⇡

)

= 4 U0+
4 U0+
1
+

U1 F

+
U1 F

U2 3

+
U2 3

+

+

+

U32

+
U32

U4 F3
U4 F3 . The parameters U0, U1, U2, U3, and

25

U4 are chosen as

 
the secondary outcome is around 26.4% in each simulated dataset. The causal DAG for the data

 

 

0.01, 1.001, and

0.368 respectively. Therefore, the prevalence of

0.65, 0.1,

generating process can be found in Figure 2.

Figure 2. The causal DAG for the data generating process

C

W

S=1

D

Y

Where , is the exposure, ⇠ is the confounder, ⇡ is the primary outcome, . is the secondary

outcome, and ( is the selection indicator. Our method requires a dichotomous confounder so we

categorized age ⇠ as a binary variable !, where ! = 0 if age is less or equal to 63, and ! = 1 if age

greater than 64. The propensity score was estimated with %

, = 1

!

)

|

(

using the cohort data. To

obtain the true value of ˆ#, we simulated a large dataset with 10, 000, 000 observations. We then

applied logistic regression for the outcome . on , and !, resulting in an estimated coefﬁcient

ˆV1 = 0.177. It is interesting to note that our target parameter of interest is not the marginal causal

odds ratio, instead, we are interested in the conditional odds ratio with a binary confounder ! in

the outcome model, and the propensity score is also estimated using the binary confounder.
ˆV1)

ˆV1)
different sampling ratios using the given parameter sets. The ﬁrst column of the table represents

, and our proposed variance + 0A⇡⇢ '

Table 4.1 compares + 0A⇢ " %

, + 0A⌧ " "

ˆV1)

under

(

(

(

the mean of ˆV1 among 1, 000 simulations. The second column of the table represents the ratio of

controls to cases. The third column displays the empirical variance of the estimator of the exposure

effect for each sampling ratio. The fourth column shows the average of variance of the estimator of

the exposure effect obtained using the Stata “gmm” command across 1, 000 simulation runs. The

last column shows the value of using our proposed method. When calculating + 0A⇡⇢ '

ˆV1)

(

, we

assume #1 and #0 is ﬁxed, with #1 = #

⇡ = 1

%

(

⇥

)

= 2, 426 and #0 = #

 

#1 = 7, 574. We declare

26

Table 4.1: The comparison of empirical variance + 0A⇢ " %
variance + 0A⇡⇢ '

(
under the different sampling designs with a binary secondary outcome . .

, and the proposed

, + 0A⌧ " "

(

ˆV1)
(
=0 : =1
1 : 1
1.9 : 1
2 : 1
3 : 1
4 : 1
1 : 2
1 : 3
1 : 4

ˆV1)
ˆV1)

ˆV1)
ˆV1)

ˆV1)

ˆV<40=
1
0.175
0.175
0.177
0.171
0.176
0.176
0.166
0.180

+ 0A⇢ " %
(
0.013
0.011
0.011
0.011
0.014
0.016
0.023
0.028

+ 0A⌧ " "
(
0.013
0.012
0.012
0.012
0.013
0.017
0.022
0.027

+ 0A⇡⇢ '
(
0.012
0.011
0.011
0.012
0.013
0.017
0.021
0.026

that this method could be equivalent to using each simulated dataset to calculate + 0A⇡⇢ '

ˆV1)

(

and

obtaining the average of + 0A⇡⇢ '

ˆV1)

(

among 1, 000 simulation runs. This equivalence arises be-

cause #1 and #0 are calculated based on the the expected value of ⇡. Table 4.1 clearly illustrates

that our proposed + 0A⇡⇢ '

ˆV1)

(

is very close to + 0A⇢ " %

ˆV1)

(

and + 0A⌧ " "

ˆV1)

(

across all reason-

able sampling ratios. By applying our optimal sampling formula with the provided pre-speciﬁed

parameters, we ﬁnd that the optimal sampling ratio '$⇡ = 1.90. Given that the number of individu-

als that can be selected is ⇠>BC
2 ?4A

= = = 3, 000, we can determine the number of controls as =0 = 1, 965,

and the number of cases as =1 = 1034.

4.4 Empirical illustration

We applied our proposed optimal sampling ratio formula to the Pesticides and Sense of Smell

(PASS) Study to develop an efﬁcient study design for the purpose of analyzing the association

between a binary secondary outcome with the exposure and a confounder. Detailed information

about the PASS study can be found in Chapter 2 of our work. In this study, the primary outcome

of interest is olfactory impairment, denoted as ⇡, while the secondary outcome of interest is cog-

nitive decline, denoted as . . The binary exposure variable is the high pesticide exposure estimate

(HPEE), denoted as ,, and age serves as the confounder, denoted as !. Our purpose is to provide

an efﬁcient sampling design that aims to examine the association between . and , given ! in the

target population without conditioning on ⇡, using doubly-weighted estimating equations. Our

27

optimal sampling formula requires certain parameters to be given as priors. Some parameters can

be derived from the cohort data. For example, the prevalence of the exposure , in the cohort is

14%. The total sample size in the case-control cohort is # = #0 +
The propensity score and the joint probability between ,, !, and ⇡ can also be estimated from

2, 633 = 18, 526.

#1 = 15, 893

+

the cohort data. The prevalence of . is ﬁxed at 0.1 and 0.2. The study budget is $25, 000 for data

collection and the unit cost is $5.40 for the sampling of cases and controls. We also consider vary-

ing the association between the primary outcome ⇡ and the secondary outcome . in the exposure

group with $'. ⇡

,=1 =

1.5, 2.0, 2.5

, and the non-exposure group $'. ⇡

,=0 =

]
Table 4.2 lists the optimal sampling ratios and the exact number of samples from cases and con-

[

]

[

|

|

1.2, 1.4, 1.6

.

trols for the above given parameter scenarios by using our proposed optimal sampling formula.

The advantage of the “T-table” liked format is that it provides researchers with an intuitive under-

standing of the sampling ratio when they have knowledge of a range of parameters. Furthermore,

the table demonstrates that a stronger association between . and ⇡ results in a smaller sampling

ratio, meaning fewer subjects in the control group will be sampled and more subjects in the case

group will be sampled.

28

)

%

,=1

$'. ⇡

. = 1
(
0.1

Table 4.2: Optimal sample ratio '$⇡ varies with the prevalence of the secondary outcome . , and
the association between ⇡ and . , while keeping other parameters ﬁxed.
'$⇡
,=0 $'. ⇡
3.796
3.556
3.401
3.762
3.503
3.365
3.735
3.489
3.331
3.957
3.845
3.788
3.934
3.815
3.768
3.915
3.794
3.737

U4
0.223
0.511
0.734
0.069
0.375
0.58
0.065
 
0.223
0.446
0.223
0.511
0.734
0.069
0.375
0.58
0.065
 
0.223
0.446

=0
3, 664
3, 612
3, 557
3, 657
3, 601
3, 568
3, 651
3, 598
3, 560
3, 695
3, 674
3, 662
3, 691
3, 667
3, 658
3, 687
3, 664
3, 651

=1
965
1, 016
1, 051
972
1, 028
1, 061
978
1, 031
1, 069
934
955
967
938
961
971
942
965
977

U3
0.182
0.182
0.182
0.336
0.336
0.336
0.470
0.470
0.470
0.182
0.182
0.182
0.336
0.336
0.336
0.470
0.470
0.470

|
1.200
1.200
1.200
1.400
1.400
1.400
1.600
1.600
1.600
1.200
1.200
1.200
1.400
1.400
1.400
1.600
1.600
1.600

|
1.500
2.000
2.500
1.500
2.000
2.500
1.500
2.000
2.500
1.500
2.000
2.500
1.500
2.000
2.500
1.500
2.000
2.500

0.2

4.5 Conclusion

In Chapter 2 and Chapter 3, we proposed our optimal sampling formula for binary and count

secondary outcomes with one exposure variable in the mean model. In this chapter, we extended

our optimal sampling formula by considering a binary confounder in the mean model and using

doubly-weighted estimating equations. We derived the variance of the estimator of the exposure

effect of the doubly-weighted estimating equations and then minimized the variance formula with

the cost as a constraint to obtain the optimal sampling formula. To verify our sampling formula,

we conducted Monte Carlo simulations and compared our proposed variance formula with the

empirical variance and the variance from the Stata "gmm" package. Our results showed that these

candidate variances were very close. Finally, we applied our proposed optimal sampling formula

to an empirical study and provided an efﬁcient study design.

29

Chapter 5 Conclusion

In Chapter 2 and Chapter 3, we proposed our optimal sampling formulas using case-control studies

for binary and count secondary outcomes with one exposure variable in the mean model. In chap-

ter 4, we extended our optimal sampling formula by considering a binary confounder in the mean

model and using doubly-weighted estimating equations. We derived the variance of the estimator

of the exposure effects of weighted estimating equations and doubly-weighted estimating equa-

tions. Then, we minimized the variance formulas with the study cost as a constraint to obtain the

optimal sampling ratios. To verify our proposed optimal sampling ratio formulas, we conducted

Monte Carlo simulations and compared our proposed variance formula with the empirical variance

and the variance from the Stata "gmm" package. Our simulation results showed that these candi-

date variances were very close in all simulations in Chapter 2, Chapter 3 and Chapter 4. Finally, we

applied our proposed optimal sampling formulas to empirical studies and provided efﬁcient study

designs.

There are several interesting directions for future research. First, our proposed sampling strat-

egy considers the inclusion of one additional confounder in the binary case. However, when adding

more confounders, the sampling formula may differ. Second, there are various types of count data,

but our provided formula primarily focuses on Poisson count data. Other count outcomes in epi-

demiology, such as the number of emergency room visits or the number of falls in nursing homes,

may exhibit an excess of zero values. Sample size determination formulas have been proposed for

zero-inﬂated Poisson distributed outcomes (Zhou et al., 2022) in cluster randomized trials. Thus,

an intriguing direction would be to determine the optimal sampling ratio for zero-inﬂated count

outcomes or hurdle outcomes in secondary case-control studies, particularly for count data that

exhibit an excess of zeros. Another interesting extension is to develop an R Shiny app that can

automatically calculate the optimal sampling ratio when researchers provide speciﬁc parameters.

Finally, we assumed that the cost is the same in the case and control groups for the secondary out-

come analysis. In practice, there may be situations where the costs of data collection in cases and

controls are unequal. Therefore, the cost constraint needs to be re-considered. Meanwhile, in all

30

three chapters, we consider study cost as a constraint. Thus, the total number of sample sizes that

we can select is ﬁxed. In the future, it would also be interesting to consider power as a constraint.

We can obtain the optimal sampling ratio under a minimum required power. We can explore how

variations in power will affect the sampling ratio.

31

BIBLIOGRAPHY 

Aban, I. B., Cutter, G. R., & Mavinga, N. (2009). Inferences and power analysis concerning two 
negative binomial distributions with an application to MRI lesion counts data. Computational 
Statistics & Data Analysis, 53(3), 820–833.  

Amatya, A., Bhaumik, D., & Gibbons, R. D. (2013). Sample size determination for clustered count 

data. Statistics in Medicine, 32(24), 4162–4179.  

Breslow,  N.  E.  (2005).  Case–Control  Study,  Two‐Phase.  In  P.  Armitage  &  T.  Colton  (Eds.), 

Encyclopedia of Biostatistics (1st ed.). Wiley.  

Breslow,  N.  E.,  &  Cain,  K.  C.  (1988).  Logistic  regression  for  two-stage  case-control  data. 

Biometrika, 75(1), 11–20.  

Breslow, N. E., &  Chatterjee, N. (1999). Design and analysis of two‐phase studies  with  binary 
outcome applied to Wilms tumour prognosis. Journal of the Royal Statistical Society: Series 
C (Applied Statistics), 48(4), 457–468.  

Brownstein, N. C., Cai, J., Smith, S., Diatchenko, L., Slade, G. D., & Bair, E. (2022). Modeling 
Secondary Phenotypes Conditional on Genotypes in Case–Control Studies. Stats, 5(1), 203–
214.  

Chen, H. Y., Kittles, R., & Zhang, W. (2013). Bias correction to secondary trait analysis with case-

control design. Statistics in Medicine, 32(9), 1494–1508.  

Demidenko,  E.  (2006).  Sample  size  determination  for  logistic  regression  revisited.  Statistics  in 

Medicine, 26(18), 3385–3397.  

Demidenko,  E.  (2008).  Sample  size  and  optimal  design  for  logistic  regression  with  binary 

interaction. Statistics in Medicine, 27(1), 36–46.  

Dintica, C. S., Marseglia, A., Rizzuto, D., Wang, R., Seubert, J., Arfanakis, K., Bennett, D. A., & 
Xu, W. (2019). Impaired olfaction is associated with cognitive decline and neurodegeneration 
in the brain. Neurology, 92(7).  

Ghosh, A., Wright, F. A., & Zou, F. (2013). Unified Analysis of Secondary Traits in Case–Control 
Association Studies. Journal of the American Statistical Association, 108(502), 566–576.  

Hanley, J. A., Csizmadi, I., & Collet, J.-P. (2005). Two-Stage Case-Control Studies: Precision of 
Parameter  Estimates  and  Considerations  in  Selecting  Sample  Size.  American  Journal  of 
Epidemiology, 162(12), 1225–1234.  

Jiang, Y., Scott, A. J., & Wild, C. J. (2006). Secondary analysis of case-control data. Statistics in 

Medicine, 25(8), 1323–1339.  

Lee, A. J., McMURCHY, L., & Scott, A. J. (1997). RE-USING DATA FROM CASE-CONTROL 

STUDIES. Statistics in Medicine, 16(12), 1377–1389.  

32 

 
Li, H., Gail, M. H., Berndt, S., & Chatterjee, N. (2010). Using cases to strengthen inference on the 
association between single nucleotide polymorphisms and a secondary phenotype in genome-
wide association studies. Genetic Epidemiology, 34(5), 427–433.  

Lin,  D.  Y.,  &  Zeng,  D.  (2009).  Proper  analysis  of  secondary  phenotype  data  in  case-control 

association studies. Genetic Epidemiology, 33(3), 256–265.  

Lou, Y., Cao, J., Zhang, S., & Ahn, C. (2017). Sample size estimation for a two-group comparison 
of repeated count outcomes using GEE. Communications in Statistics - Theory and Methods, 
46(14), 6743–6753.  

Lyles, R. H., Lin, H.-M., & Williamson, J. M. (2007). A practical approach to computing power 
for generalized linear models with nominal, count, or ordinal responses. Statistics in Medicine, 
26(7), 1632–1648.  

Ma, Y., & Carroll, R. J. (2016). Semiparametric Estimation in the Secondary Analysis of Case–
Control Studies. Journal of the Royal Statistical Society Series B: Statistical Methodology, 
78(1), 127–151.  

McNamee, R. (2005). Optimal design and efficiency of two-phase case–control studies with error-

prone and error-free exposure measures. Biostatistics, 6(4), 590–603.  

Monsees, G. M., Tamimi, R. M., & Kraft, P. (2009). Genome-wide association scans for secondary 

traits using case-control samples. Genetic Epidemiology, 33(8), 717–728.  

Morgenstern,  H.,  &  Winn,  D.  M.  (1983).  A  Method  for  determining  the  sampling  ratio  in 

epidemiologic studies. Statistics in Medicine, 2(3), 387–396.  

Nagelkerke, N. J.  D., Moses, S., Plummer, F. A., Brunham,  R. C., &  Fish,  D. (1995). Logistic 
regression in case-control studies: The effect of using independent  as dependent variables. 
Statistics in Medicine, 14(8), 769–775.  

Nam,  J.-M.  (1973).  Optimum  Sample  Sizes  for  the  Comparison  of  the  Control  and  Treatment. 

Biometrics, 29(1), 101.  

Negi, A. (2024). Doubly weighted M-estimation for nonrandom assignment and missing outcomes. 

Journal of Causal Inference, 12(1), 20230016.  

Reilly,  M.  (1996).  Optimal  Sampling  Strategies  for  Two-Stage  Studies.  American  Journal  of 

Epidemiology, 143(1), 92–100.  

Rettiganti,  M., &  Nagaraja, H. N. (2012). Power Analyses for Negative  Binomial Models  with 
Application  to  Multiple  Sclerosis  Clinical  Trials.  Journal  of  Biopharmaceutical  Statistics, 
22(2), 237–259.  

Shrestha, S., Kamel, F., Umbach, D. M., Freeman, L. E. B., Koutros, S., Alavanja, M., Blair, A., 
Sandler, D. P., & Chen, H. (2019). High Pesticide Exposure Events and Olfactory Impairment 
among U.S. Farmers. Environmental Health Perspectives, 127(1), 017005.  

33 

 
Sofer,  T.,  Cornelis,  M.,  Kraft,  P.,  &  Tchetgen,  T.,  Eric.  (2017).  Control  function  assisted  IPW 
estimation with a secondary outcome in case-control stuies. Statistica Sinica, 27(2), 785–804.  

Song,  X.,  Ionita-Laza,  I.,  Liu,  M.,  Reibman,  J.,  &  Wei,  Y.  (2016).  A  General  and  Robust 

Framework for Secondary Traits Analysis. Genetics, 202(4), 1329–1343.  

Tchetgen Tchetgen, E. J. (2014). A general regression framework for a secondary outcome in case-

control studies. Biostatistics, 15(1), 117–128.  

Wang, J., & Shete, S. (2012). Analysis of Secondary Phenotype Involving the Interactive Effect of 
the Secondary Phenotype and Genetic Variants on the Primary Disease: Secondary Phenotype 
Analysis. Annals of Human Genetics, 76(6), 484–499.  

Wang,  J.,  Zhang,  S.,  &  Ahn,  C.  (2020).  Sample  size  calculation  for  count  outcomes  in  cluster 
randomization trials  with  varying cluster sizes.  Communications  in  Statistics  - Theory and 
Methods, 49(1), 116–124.  

Wei, J., Carroll, R. J., Müller, U. U., Keilegom, I. V., & Chatterjee, N. (2013). Robust estimation 
for  homoscedastic  regression  in  the  secondary  analysis  of  case-control  data:  Secondary 
Analysis of Case-Control Data. Journal of the Royal Statistical Society: Series B (Statistical 
Methodology), 75(1), 185–206. 

Xing, C., M. McCarthy, J., Dupuis, J., Adrienne Cupples, L., B. Meigs, J., Lin, X., & S. Allen, A. 
(2016). Robust analysis of secondary phenotypes in case-control genetic association studies: 
Robust analysis of secondary phenotypes in case-control genetic association studies. Statistics 
in Medicine, 35(23), 4226–4237.  

Yaffe, K., Freimer, D., Chen, H., Asao, K., Rosso, A., Rubin, S., Tranah, G., Cummings, S., & 
Simonsick,  E.  (2017).  Olfaction  and  risk  of  dementia  in  a  biracial  cohort  of  older  adults. 
Neurology, 88(5), 456–462.  

Zhou, Z., Li, D., & Zhang, S. (2022). Sample size calculation for cluster randomized trials with 

zero‐inflated count outcomes. Statistics in Medicine, 41(12), 2191–2204.  

34 

 
 
 
APPENDIX A PROOFS OF CHAPTER 2

A.1 Derivation of G

#

)

(

Proof. We plug ^8 (8
⇡8

c

.8

`

(

 

^8

)]

into  

#

)

(

, then we have

(

) [

 

#

)

(

= E

m
m #)

" 

^)
c

8 (8
⇡8

(

 

1

.8

)  
4 ^)
8 #

4 ^)
8 #
4 ^)

+

8 # !#

^8 ^)
8

2

1

+

⇣

4 ^)
8 #

⌘

.

3
7
7
7
7
7
5

= E

(8
⇡8

(

)

c

⇥

2
6
6
6
6
6
4

Taking iterated expectations on it, then

 

#

)

(

= E

Note that c

⇡8

)

(

E

8>>><
2
6
6
6
>>>
6
6
= Pr
4
:

(8
⇡8

(

)

c

⇥

1

4 ^)
8 #

2

4 ^)
8 #

+

⇣
⇡8 = 9

)

(8 = 1

(

|

^8 ^)
8  
 
 
 
 
 
 

⌘

⇡8

.

3
7
7
7
7
7
5

9>>>=
>>>
;

only depends on ⇡8, thus

 

#

(

)

= E

E

(8

⇡8

(
c

|
⇡8

)

E

⇥

8>>><
>>>
Since (8 is a binary variable, then E
:

2
6
6
6
6
6
4

1

⇣

)

(

+

4 ^)
8 #

2

4 ^)
8 #

⌘
⇡8

)

(8

|

(

⇡8

^8 ^)
8  
9>>>=
3
 
7
 
7
 
7
 
>>>
7
 
 
7
(8 = 1
5
;
|

(

= Pr

.

⇡8 = 9

= c

⇡8

)

(

)

under the assumption

that the sampling probability from the study cohort for the secondary outcome analysis depends

only on the primary outcome. Thus

E

2
6
6
6
6
6
4
1

⇣

 

#

)

(

= E

= E

8>>><
>>>
:
2
6
6
6
6
6
4

4 ^)
8 #

2

4 ^)
8 #

^8 ^)
8  
 
 
 
 
 
 

.

⌘
^8 ^)
8

1

+

⇣
4 ^)
8 #

2

4 ^)
8 #

⌘

+

3
7
7
7
7
7
5

⇡8

3
7
7
7
7
7
5

9>>>=
>>>
;

35

⇤

A.2 Derivation of H

#

)

(

Proof. Similar to the derivation of G

#

)

(

, we ﬁrst plug ^8 (8
⇡8

c

.8

`

(

 

^8

)]

into H

#

)

(

, then we have

(

) [

H

#

)

(

= E

(2
8
⇡8

(

)]

c

"

[

.8

2 (

`

(

 

^8

))

2 ^8 ^)
8

.

#

The sampling indicator variable (8 is a binary variable, thus (2

8 = (8, then

H

#

)

(

= E

(8
⇡8

)]

c



[

(

.8

2 (

`

(

 

^8

))

2 ^8 ^)
8

.

 

Then take the iterated expectations,

H

#

)

(

= E

E

(8
⇡8
(
⇡8
)
2

)]
E

)

⇥

c
[
(8
|
⇡8

⇢



E

(
c
(
1
⇡8

(

⇢

[

c



= E

= E

2 ^8 ^)
8

^8, ⇡8

^8

))

.8

2 (

 

`

`

.8
(

 

(

(

)]
E

2 ^8 ^)
8

^8

))

⇥
.8
(

`

(

 

^8

))

2

X8, ⇡8

   

 

⇤

 
 
 
 

^8, ⇡8

 
 
^8 ^)
8

.

 

^8, ⇡8

,

⇤

 
 
`

⇤

2

))

^8

(

We can simplify the expectation E

.8
(

 

⇥

E

`

.8
(

 

(

^8

))

2

^8, ⇡8

= E

⇥

 
 

⇤

⇥

= E

. 2
8  
. 2
8

2.8 `

(

^8, ⇡8

 
 
^8

) +
2E

 

⌘
^8, ⇡8
|

`

^8

(
.8 `

2

)
 
^8
 

(

[

^8, ⇡8

⇤
^8, ⇡8

]

)|

⇤

⇣
E

 
`
 

⇥
. 2
8

+
= E

2

^8

(
)
^8, ⇡8

⇣
2`

 
 
(

E

^8

)

⌘
.8
(

|

 

^8, ⇡8

2 .

`

^8

)

(

) +

Under the above study design setting, .8 is a binary secondary outcome, so

^8, ⇡8

˜`

(

) ⌘

E

. 2
8

^8, ⇡8

= E

.8

|

(

^8, ⇡8

.

)

 

 
 

 

36

Plug ˜`

^8, ⇡8

)

(

into E

`

.8
(

 

(

^8

))

2

^8, ⇡8

, we have

E

`

.8
(

 

(

^8

))

⇥

⇥

2

 
 

^8, ⇡8

= ˜`

⇤

= ˜`

⇤

)  
2

 

 
 
^8, ⇡8

(

(
˜`

^8, ⇡8

)
^8, ⇡8

(
^8, ⇡8

2`

^8

˜`

(
2`

)
^8

(
˜`

(

)
(
^8, ⇡8

^8, ⇡8

`

(

) +

^8

^8, ⇡8

`

(

) +

2

)
^8

2

2

)

˜`

)  

`

(

)  

(
^8

)
˜`

2

)]

+

^8, ⇡8

˜`

(

)  

(

^8, ⇡8

2 .

)

+
˜`

[

=

(

2

Now substitute

˜`

[

(

^8, ⇡8

^8

`

(

)  

)]

+

˜`

(

^8, ⇡8

˜`

(

)  

^8, ⇡8

)

2 into H

, so

#

)

(

H

#

)

(

= E

= E

1
⇡8
(
1
⇡8

(

)

)

c

c





E8

^8, ⇡8

˜`

[

(

^8

`

(

)  

2

)]

+

˜`

(

^8, ⇡8

˜`

(

)  

^8, ⇡8

2

)

^8, ⇡8

^8 ^)
8

  ⇥
˜`

(

(

⇥

^8, ⇡8

`

)  

^8

2

))

+

˜`

(

^8, ⇡8

˜`

(

)  

^8, ⇡8

⇤ 
 
^8 ^)
8

.

 

2

)

⇤

(

2

˜`

(

^8, ⇡8

˜`

(

)  

^8, ⇡8

2,

)

))

+

Denote ˜˜`

^8, ⇡8

˜`

(

) ⌘ (

^8, ⇡8

^8

`

(

)  

(

H

#

)

(

= E

˜˜`



^8, ⇡8
(
⇡8
c

(

)

)

^8 ^)
8

.

 

 

⇤

A.3 Derivation of   

1⌫

1

  

)

⇥

⇤

Proof. The inverse of matrix   is

, where 34C   = 011022  

012021 .

3
7
7
7
7
7
7
7
7
5

1
34C   2
6
6
6
6
 
6
6
6
6
4

022  

012

021

011

37

 
Then

1

  

]

[

) = 1

022  

021

011

,

3
7
7
7
7
7
7
7
7
5

 

012

34C   2
6
6
6
6
6
6
6
6
4
022  

1

  

⌫ =

⇥

1
34C  

=

1
34C  

and

012

111 112

⇥

2
3
6
7
6
7
6
7
6
7
6
7
7
6
6
7
6
7
012112
4
5

112 122

3
7
7
7
7
7
7
7
7
022112  
5

012122

021

 

011

022111  

021111 +

011112  

021112 +

 

011122

,

3
7
7
7
7
7
7
7
7
5

  

1⌫

1

  

) =

]

[

1
34C  

022111  

012112

022112  

012122

021111 +

011112  

021112 +

 

011122

2
6
6
6
6
6
6
6
6
4
2
6
6
6
6
6
6
6
6
4

2
6
6
6
6
6
6
6
6
4

=

(

where,

1
34C  

⌦1 ⌦2

⌦3 ⌦4

,

3
7
7
7
7
7
7
7
7
5

)

2 2
6
6
6
6
6
6
6
6
4

⌦1 =

⌦2 =

⌦3 =

⌦4 =

022111  
012112  
011112  
021111  

(

(

(

(

012112)
022111)
021111)
011112)

022   (
021 + (
022 + (
021 + (

022112  
022112  
021112  
011122  

012122)
012122)
011122)
021112)

012,

011,

012,

011.

38

022

021

 

012

011

1
34C  

⇥

3
7
7
7
7
7
7
7
7
5

2
6
6
6
6
 
6
6
6
6
4

3
7
7
7
7
7
7
7
7
5

⇤

A.4 Proof of Theory 1

Proof. By the rule of functional expectation E

,

6

(

(

))

=

F 6

F

)

(

?,

F

)

(

, we have

4 ^)
8 #

4 ^)
8 #

⇣

1

+

2
6
6
6
6
6
4
?F=14 V0+
V1
4 V0+
1

+

?F=14V0+
V1
4V0+
1

+

(

^8 ^)
8

2

⌘

1 1

3
7
7
7
7
7
5
3
7
7
7
7
7
7
7
7
?F=1]
1
 
5
4V0
1
+
(
)
)
V1
?F=14V0+
2
V1
4V0+
1

V1
2 2
6
6
6
6
6
6
6
V1
6
4
2 +

1 1

 

[

+

(

)

+

4V0
2

4 V0
2

1

[

 
1

?F=1]
4 V0
+

 

 

+

?F=14V0+
V1
4V0+
1

(
?F=14V0+
V1
4V0+
1

V1
2

)
V1
2

+

(

)

Õ

1 0

0 0

3
7
7
7
7
7
7
7
7
5

2
6
6
6
6
6
6
6
6
4
3
7
7
7
7
7
7
7
7
5

= E

=

 

= 2
6
6
6
6
6
6
6
6
4
2
6
6
6
6
6
6
6
6
4

⌘

 

#

)

(

where

011 =

011 012

021 022

3
7
7
7
7
7
7
7
7
5

V1

?F=14 V0+
V1
4 V0+
1

+

2 +

4 V0
2

,

1

[

 
1

?F=1]
4 V0
+

 

 

012 = 021 = 022 =

V1

 
?F=14 V0+
V1
4 V0+
1

+

2

 

 

 

.

39

Similarly, using rule of function of expectation on ⌫

, we have,

#

)

(

⌫

#

)

(

= E

1
⇡8
c

)
(
?11 ˜˜`
1, 1
(
=1

)

(
#1

=

˜˜`

^8, ⇡8

)

^8 ^)
8

?10 ˜˜`

 
1, 0
(
=0

)

#0

110

110

+

?00 ˜˜`

#0

0, 0
(
)
=0

+

1 0

0 0

2
6
6
6
6
6
6
6
6
4

3
7
7
7
7
7
7
7
7
5

1 0

0 0

2
6
6
6
6
6
6
6
6
4

3
7
7
7
7
7
7
7
7
5

?01 ˜˜`

#1

0, 1
(
=1

)

+

111 112

112 122

3
7
7
7
7
7
7
7
7
5

= 2
6
6
6
6
6
6
6
6
4

where

?11#1 ˜˜`

(

111 =

1, 1

) +
=1

?01#1 ˜˜`

0, 1
)

(

?10#0 ˜˜`

(

1, 0

) +
=0

+

?00#0 ˜˜`

0, 0

(

)

,

112 = 122 =

?11#1 ˜˜`
=1

(

1, 1

?10#0 ˜˜`
=0

)

+

1, 0
)

(

.

?11, ?10, ?01, ?00, are the joint probability between the exposure variable ,8 and primary outcome

⇡8. These joint probability can be estimate analytically once we specify $ =

W0, W1]

[

, the parameter

between ,8 and ⇡8.

?11 = ?

?10 = ?

?01 = ?

?00 = ?

⇡ = 1

⇡ = 0

⇡ = 1

⇡ = 0

|

|

|

|

(

(

(

(

, = 1

) ⇥

?F=1 = 4W0+

, = 1

) ⇥

?F=1 =

1

, = 0

) ⇥ (

, = 0

) ⇥ (

1

1

 

 

?F=1)
?F=1)

+

W1 F ?F=1
W1 F ,
4W0+
1
+
?F=1
W1 F ,
4W0+
= 4W0
?F=1)
1
 
(
4W0
1
+
?F=1)
1
 
4W0
1
+

= (

.

,

⇤

We assume #1 and #0 are constant with #1 = # ⇢

⇡

)

(

is the lower corner element of covariance matrix

(

)

and the condition that 012 = 021 = 022 and 112 = 122, we have + 0A

#1. It is easy to see that

and #0 = #
1 H
G

#

 

 
#
)[
(
#

1

 

G

#

)

(

)

]

. Combing the
ˆV1)

=

(

+ 0A

ˆV1)
derivation of   

(

1⌫

1

  

)

⇥

⇤

40

[

02
11122

02
21111 

2011112021+
34C  
)
(
A.5 Proof of Proposition 1

2#

]

.

Proof. We deﬁne

Z

Z

]

]

^

(

(

^

(
s

s

1, 1

)

(

1, 0
)

(
0, 1
)
0, 0
)
1, 1
)
1, 0
)
1, 1
)
1, 0
)

(

(

(

,

,

= 02

(

= 02

= 02

1, 1
)
1, 0
)
,

21 ?11#1 ˜˜`
21 ?10#0 ˜˜`
(
21 ?01#1 ˜˜`
0, 1
)
21 ?00#0 ˜˜`
0, 0
)
= 2021011 ?11#1 ˜˜`

(

(

,

= 02

= 02

= 2021011 ?10#0 ˜˜`
11 ?11#1 ˜˜`
11 ?10#0 ˜˜`

= 02

(
1, 1

1, 0

(

)

,

.

(

1, 1

1, 0

,

,

)

)

(
Then plug in 011, 021, 012, 022, 111, 112, 122 into + 0A

)

ˆV1)

(

, we have

+ 0A

ˆV1)

(

1, 1

Z

(

) +

0, 1

]

(

)  

^

(

= [

1, 1

) +

s

(

1, 1

)]
34C  

1, 0
Z
(
2 =0=1#

=0 + [
)

(

0, 0

]

(

)  

1, 0

^

(

) +

s

(

) +

1, 0

)]

=1

.

To obtain the optimal design, we need to minimize + 0A

ˆV

)

(

subject to the constraint

⇠>BC = 2 ?4A

=1 +

=0)

(

,

where this constraint is equivalent to =0 +

=1 = =. We can write,

'$⇡

2

argmin
=1==
=0+

)1=0 +
34C  
)

)2=1
2 =0=1#

(

.

Where )1 = Z

1, 1

0, 1

]

(

)  

^

(

) +

(

1, 1

) +

s

1, 1

)

(

and )2 = Z

1, 0

0, 0

]

(

)  

^

(

1, 0

) +

s

.

1, 0
)

(

) +

(

41

By using the Lagrange multiplier method, we have

min

1, 1

Z

[

(

) +

]

(

0, 1

)  

1, 1

^

(

) +

s

(

1, 1



(

)]
34C  

1, 0
Z
(
2 =0=1#

=0 + [
)

0, 0

]

(

)  

^

(

1, 0

) +

s

(

) +

1, 0

)]

=1

_

+

=0 +

=1  

(

 
=

)

Where _ is the Lagrange multiplier, let

! = [

1, 1

Z

(

) +

]

(

0, 1

)  

^

(

1, 1

) +

s

(

1, 1

)]
34C  

Z
1, 0
(
2 =0=1#

=0 + [
)

(

0, 0

]

(

)  

^

(

) +

1, 0

) +

1, 0

s

(

)]

=1

_

+

=0 +

=1  

(

=

)

then

m!
m=0

m!
m=1

=

Z

  [

=

Z

  [

(

(

1, 1

) +

]

(

0, 1

)  

1, 1

^

(

) +

s

(

1, 1

2
= 
0 +

_,

)]

1, 0

) +

]

(

0, 0

)  

1, 0

^

(

) +

s

(

1, 0

2
= 
1 +

_.

)]

Let the above two equations, equal to 0, we have

=0 =

Z

[

(

1,1

]

(

)+

0,1

^
) 
_

(

1,1

s

(

)+

1,1

)]

, =1 =

Z

[

(

1,0

)+

]

(

0,0

^
) 
_

(

1,0

)+

1,0

s

(

)]

.

q
Therefore, '$⇡ = =0
=1

=

Z
Z

[
[

(
(

1,0
1,1

)+
)+

]
]

(
(

0,0
0,1

) 
) 

^
^

1,0
1,1

q
)+
)+

(
(

s
s

(
(

1,0
1,1

)]
)]

.

q

⇤

42

APPENDIX B PROOFS OF CHAPTER 3

B.1 Derivation of H

#

)

(

Proof. Similar to the derivation of G

#

, we ﬁrst plug ^(
⇡
(

c

.
) [

)

(

`

^

(

)]

 

into H

#

)

(

, then we have

H

#

)

(

= E

(2
⇡

(

)]

c



[

.

2 (

`

^

(

))

 

2 ^ ^)

.

 

The sampling indicator variable ( is a binary variable, thus (2 = (, then

H

#

)

(

= E

(

⇡

)]

c



[

(

.

2 (

`

^

(

))

 

2 ^ ^)

.

 

Then take the iterated expectations,

H

#

)

(

= E

E

.

2 (

`

^

(

))

 

2 ^ ^)

^, ⇡

(

⇡

)]
E

= E

= E

c
[
(

|
⇡

⇢



E

(
c
(
1
⇡

(

⇢

[

c



)

(
⇡

)]
E

)
2

⇥

`

.
(

 

(

^

))

2 ^ ^)

⇥
.
(

`

^

(

))

 

2

^, ⇡

 
 
^ ^)

   

 
 
 
 

^, ⇡

 

⇤

.

 

 
 

⇤

`

.
(

 

(

^

))

2

^, ⇡

,

We can simplify the expectation E

E

`

.
(

 

(

^

))

2

^, ⇡

= E

⇥
. 2

2. `

(

 

⇥

 
 

⇤

= E

⇥

. 2

^, ⇡

⇣
E

 
`
 

⌘
2
|

)

^

(

⇥
. 2

^, ⇡

+
= E

^

) +
2E

 

^, ⇡

⇤

 
 

⇤

`

^

2

)

(

^, ⇡

. `

 
^
 
)|

(

⇤
^, ⇡

]

[

⇣
2`

 
 
(

^

)

E

⌘
.
(

|

 

^, ⇡

`

^

2

)

(

) +

43

Under the above study design setting, . is a Poisson distributed secondary outcome, so let

E

. 2

^, ⇡

= + 0A

⇣

 
 

⌘

= E

.

(

^, ⇡

.

|

) + (

E

^, ⇡

E

.

(

) + (

(

|

(

|

.

|

^, ⇡

))

))
2

^, ⇡

2

We deﬁne ˜`

^, ⇡

= E

^, ⇡

.

(
. 2

)
^, ⇡

.
(
|
into E

We plug E

)
.
(

2

`

^

(

))

^, ⇡

, we have

 

 
 
`

 

2

))

^

(

 

E

.
(

 

⇥

⇥

^, ⇡

= E

. 2

 
 
^, ⇡

 
 

⇤

= E

 
^, ⇡
 

⇤

2`

^

E

.

(

)

(

E

.

(

|

^, ⇡

))

^, ⇡

`

^

2

)

(

) +

|
2

 

⌘
) + (
E

.

|

(
˜`

.

⇣
(
|
2`

 
= ˜`

(
˜`

^

)

(

^, ⇡

^, ⇡

) +

^, ⇡

(

`

(
2`

^

2

)

˜`

^

)

(

(

^, ⇡

`

^

2

)

(

) +

) +
2

)
2

 
˜`

=

[

^, ⇡

`

^

(

)  

(

)]

+

^, ⇡

.

)

(

Then we have

H

#

)

(

= E

= E

1
⇡
(
1
⇡

(

)

)

c

c





Thus,

E

^, ⇡

˜`

[

(

`

^

(

)  

2

)]

+

˜`

(

^, ⇡

)

^, ⇡

^ ^)

  ⇥
˜`

(

(

^, ⇡

`

^

(

)  

2

))

+

˜`

(

^, ⇡

⇤ 
 
^ ^)

.

 

)

⇤

⇥

H

#

)

(

= E



^, ⇡

˜`

(

(

)  

`
c

^
⇡

))
)

(
(

2

˜`

(

+

^, ⇡

^ ^)

)

.

 

44

 

⇤

 
APPENDIX C PROOF OF CHAPTER 4

C.1 Derivation of G

#

)

(

In the main text, we deﬁned , as the binary exposure variable, ⇠ as the binary confounder, and

let ^ =

1, ,, ⇠

(

) .

)

1
,=F

%

(

2

|

)

is the inverse probability weight for the subjects in the group F for

F = 0, 1. Plugging the double weighted expression

⇡

c

(

^(
%

)⇥

,

.
) [

2

|

`

^

(

)]

 

into  

#

)

(

, we have

(

 

#

)

(

= E

m
m #)

" 

c

⇡

(

^) (
%

) ⇥

(

(
%

= E

,

2

c

⇡

(

2
6
6
6
6
6
Taking iterated expectations on  
4

) ⇥

(

|

)

,

2

|

 

.
)  
4 ^) #

⇥

1

+

⇣

4 ^) #

⌘

, then

#

)

(

4 ^) #
4 ^) # !#

+

1

^ ^)

2

.

3
7
7
7
7
7
5

 

#

)

(

= E

Note that c

⇡

)

(

E

8>>><
2
6
6
6
>>>
6
6
= Pr
4
:

(
%

) ⇥

c

⇡

(

,

2

|

)

(

⇥

^ ^)

2

4 ^) #

1

+

⇣

4 ^) #

⌘

( = 1

(

⇡

)

|

only depends on ⇡, thus

 
 
 
 
 
 
 

⇡

.

3
7
7
7
7
7
5

9>>>=
>>>
;

 

#

(

)

= E

⇡

E
(
c

(
|
⇡

)

E

⇥

8>>><
>>>
Since ( is a binary variable, then E
:

2
6
6
6
6
6
4

c

)

(

4 ^) #

^ ^)

2

4 ^) #

⇡

%

,

2

|

(

) ⇥

(

) ⇥

1

+

⇡

(

|

(

)

= Pr

(

⇣
( = 1

⌘
= c

⇡

)

(

⇡

)

|

⇡

 
3
 
7
 
7
 
7
 
 
7
7
 
, thus
5

.

9>>>=
>>>
;

E

4 ^) #

⇡

(

) ⇥

%

,

2

|

)

(

4 ^) #

1

+

⇣

 

#

)

(

= E

= E

^ ^)

2

4 ^) #

⇡

 
 
 
 
 
 
 
.

3
7
7
7
7
7
5

9>>>=
>>>
;

c

(

%

,

2

|

(

) ⇥

) ⇥

4 ^) #

1

+

⇣

c

2
6
6
6
6
6
4
⇡

8>>><
>>>
:
2
6
6
6
6
6
4

^ ^)

⌘

2

⌘

3
7
7
7
7
7
5

45

C.2 Derivation of H

#

)

(

Similar to the derivation of G

#

)

(

, we ﬁrst plugged the double weighted expression

^(
%

) ⇥

c

⇡

(

,

2

|

)

(

`

.
[

 

(

^

)]

into H

#

)

(

, then we had

H

#

)

(

= E

(2
2

)]

[

c

⇡

(

)]

%

,

2

|

(



[

.

2 (

`

^

(

))

 

2 ^ ^)

.

 

The sampling indicator variable ( is a binary variable, thus (2 = (, then

H

#

)

(

= E

(
2

)]

[

c

⇡

(

)]

%

,

2

|

(



[

.

2 (

`

^

(

))

 

2 ^ ^)

.

 

Then take the iterated expectations,

(
2

H

#

)

(

= E

E

⇢

%



[

= E

= E

2
,
|
(

(
E

)]
⇡

⇡

c

[

(

)]
E

c

⇡

(

⇢

[

)]

(
2

|
%

)
,

[

(

2

|

1
%

⇡

c



(

) [

,

2

2

|

)]

(

)]
E

2

⇥

.

2 (

`

^

(

))

 

2 ^ ^)

^, ⇡

`

.
(

 

(

^

))

2 ^ ^)

⇥
.
(

`

^

(

))

 

2

^, ⇡

 
 
^ ^)

   

 
 
 
 

^, ⇡

 

⇤

.

 

We can simplify the expectation E

`

.
(

 

(

^

))

2

E

`

.
(

 

(

^

))

2

^, ⇡

= E

⇥
. 2

2. `

(

 

⇥

 
 

⇤

= E

⇥

. 2

^, ⇡

⇣
E

 
`
 

⌘
2
|

)

^

(

⇥
. 2

^, ⇡

+
= E

^

) +
2E

 

^, ⇡

⇤

 
 
^, ⇡

⇤

,

⇤

 
 

`

^

2

)

(

^, ⇡

. `

 
^
 
)|

(

⇤
^, ⇡

]

[

⇣
2`

 
 
(

^

)

E

⌘
.
(

|

 

^, ⇡

2 .

`

^

)

(

) +

46

Under the above study design setting, . is a binary secondary outcome.

^, ⇡

) ⌘

E

. 2

^, ⇡

= E

.

(

|

^, ⇡

)

. Then we have

 

`

.
(

 

(

^

))

2

 
 
^, ⇡

 

^, ⇡

= ˜`

(

 
 

⇤

= ˜`

^, ⇡

2`

(
2`

)  
2

)

 

˜`

^

)

^

)

(

(
˜`

^, ⇡

`

^

2

)

(

) +

`

^

2

)

(

) +

^, ⇡

(
2

^, ⇡

˜`

(

)  

(

^, ⇡

)

^, ⇡

`

^

(

)  

2

)]

+

˜`

(

^, ⇡

˜`

(

)  

^, ⇡

2 .

)

(

(
˜`

+
˜`

[

=

So let ˜`

E

(

⇥

So,

H

#

)

(

= E

" ⇥

^, ⇡

˜`

(

(

)  

`

[

2

X

(
%

))
,

+
2

(

|

)]

˜`
(
2 c

⇡

(

)

^, ⇡

˜`

(

)  

^, ⇡

2

)

^ ^)

.

#

⇤

Denote ˜˜`

^, ⇡

˜`

(

) ⌘ (

(

^, ⇡

`

^

(

)  

2

))

+

˜`

(

^, ⇡

˜`

(

)  

^, ⇡

2,

)

H

#

)

(

= E

˜˜`
,

(

|

%



[

(

^, ⇡
)
2 c

2

)]

C.3 Derivation of   

1⌫

1

  

^ ^)

.

 

⇡

)

(

)

We denote ?

F, 2

)

(

as the joint probability between , and ⇠. By the rule of function expectation,

⇥

⇤

47

we have

 

#

)

(

= E

= E

2
6
6
6
6
6
4
2
6
6
6
6
6
6
6
6
6
6
6
6
6
4

=

%

(

4 ^) #

,

%

[

(

2

|

)]

2

4 ^) #

⌘

1

+

⇣

4 ^) #

,

%

[

(

2

|

)]

2

4 ^) #

⌘

1

+

⇣

^ ^)

3
7
7
7
7
7
1 F
5

2

2
6
6
6
6
6
6
6
6
6
6
6
6
6
4

F F F2

2 F2

3
7
7
7
7
7
7
7
7
7
7
7
7
7
1 1 1
5

2

?

F = 1, 2 = 1
)
1

2 = 1

(
F = 1

V2

V1+
4 V0+
V1+
4 V0+

|

)

+

 

2

V2

 

+

%

(

+

%

(

|

|

?

F = 1, 2 = 0
)
1

2 = 0

(
F = 1

V1

4 V0+
4 V0+

)

+

 

2

V1

 

?

F = 0, 2 = 1
)
1

2 = 1

(
F = 0

V2

4 V0+
4 V0+

)

+

 

2

V2

 

3
7
7
7
7
7
7
7
7
7
7
7
7
7
5
3
7
7
7
7
7
7
7
7
7
7
7
7
7
5
3
7
7
7
7
7
7
7
7
7
7
7
7
7
5
3
7
7
7
7
7
7
7
7
7
7
7
7
7
5

2
6
6
6
6
6
6
6
6
6
6
6
6
6
4
2
6
6
6
6
6
6
6
6
6
6
6
6
6
4
2
6
6
6
6
6
6
6
6
6
6
6
6
6
4

1 1 1

1 1 1

1 1 0

1 1 0

0 0 0

1 0 1

0 0 0

1 0 1

48

2

+

4 V0
4 V0

2
6
6
6
6
6
6
6
6
6
6
6
6
6
1 1 1
4

 

1 1 1

1 1 1

1 0 1

0 0 0

1 0 1

?

F = 0, 2 = 0
)
1

2 = 0

(
F = 0

|

)

+

%

(

?

=

V2

2 = 1
)
4 V0+

V1+
4 V0+
2
V2
V1+

(
1

+

 

 

?

+

V2

2 = 1
)
4 V0+

(
1

4 V0+
2
V2

+

 

 

 

2
6
6
6
6
6
6
6
6
6
6
6
6
6
4
2
6
6
6
6
6
6
6
6
6
6
6
6
6
4

3
7
7
7
7
7
7
7
7
7
7
7
7
7
5
3
7
7
7
7
7
7
7
7
7
7
7
7
7
5

+

1 1 0

1 1 0

0 0 0

3
7
7
7
7
7
7
7
7
7
7
7
7
7
5

2
6
6
6
6
6
6
6
6
6
6
6
6
6
4

1 0 0

0 0 0

0 0 0

3
7
7
7
7
7
7
7
7
7
7
7
7
7
5

1 0 0

0 0 0

0 0 0

3
7
7
7
7
7
7
7
7
7
7
7
7
7
5

?

+

(
1

2 = 0

V1

4 V0+
2
V1

)
4 V0+

+

 

2
6
6
6
6
6
6
6
6
6
6
6
6
6
4

 

4 V0
2

?

2 = 0
)
4 V0

(
1

+

 

 

49

)
V2
V1+
2
V2

)
V2
V1+
2
V2

)

4V0+
V1+

)
4V0+

V1+
V2

V2
2

4V0+
V1+

)
4V0+

?

2=1
(
1
+
(
?

2=1
(
1
+
(
?

4V0+
V1+

)
4V0+

2=1
(
1
+
(
0 ?
2=1
(
1
+
(

4V0+
V2

)
4V0+

V2
2

)

0

0

0 ?
2=1
(
1
+
(

4V0+
V2

)
4V0+

V2
2

)

4V0+
V1+

V1+
V2

V2
2

?

+

2=1
)
(
4V0+
1
(
?

2=1
(
)
4V0+
1
(
?

2=1
)
(
4V0+
1
(

+

+

4V0+
V1+

4V0+
V1+

)
V2
V1+
2
V2

)
V2
V1+
2
V2

)

4V0
2

?

2=0
(
)
4V0
1
+
(

)

0

0

+

3
7
7
7
7
7
7
7
7
7
7
7
7
7
5

2
6
6
6
6
6
6
6
6
6
6
6
6
6
4

0

0

0

3
7
7
7
7
7
7
7
7
7
7
7
7
7
5

3
7
7
7
7
7
7
7
7
7
7
7
7
7
5

+

2
6
6
6
6
6
6
6
6
6
6
6
6
6
0 0
4

0 0

0 0

4V0+
V1

)
4V0+

V1
2

)
V1
4V0+
2
V1

)
4V0+

?

2=0
(
1
+
(
?

2=0
(
1
+
(

)

4V0+
V1

)
4V0+

V1
2

)
V1
4V0+
2
V1

)
4V0+

?

2=0
(
1
+
(
?

2=0
(
1
+
(

)

0

0

3
7
7
7
7
7
7
7
7
7
7
7
7
7
5

4V0+
V1+

V1+
V2

V2
2

?

+

2=1
)
(
4V0+
1
(
?

2=1
(
)
4V0+
1
(
?

2=1
)
(
4V0+
1
(

+

+

4V0+
V1+

4V0+
V1+

)
V2
V1+
2
V2

)
V2
V1+
2
V2

)

4V0+
V2

V2
2

)

4V0+
V2

V2
2

)

?

+

0

2=1
)
(
4V0+
1
(

2
6
6
6
6
6
6
6
6
6
6
6
6
6
0 1 2
4

2=1
)
(
4V0+
1
(

+

?

1 1 3

2 3 2

.

3
7
7
7
7
7
7
7
7
7
7
7
7
7
5

=

=

2
6
6
6
6
6
6
6
6
6
6
6
6
6
4

+

2
6
6
6
6
6
6
6
6
6
6
6
6
6
4

Where

?

0 =

2 = 1

4 V0+
V1+
V2
V1+

)
4 V0+

(
1

+

2 +

V2

?

2 = 0

V1

?

2 = 1

4 V0+
V1

)
4 V0+

2 +

(
1

+

4 V0+
V2

)
4 V0+

V2

?

2 +

4 V0
2

,

2 = 0
)
4 V0

(
1

+

(
1

+

 
?

 
V1+
4 V0+
V2
V1+

V2

2 +

 
?

1 =

2 = 1
)
4 V0+

(
1

+
2 = 1

V1

 
4 V0+
2
V1

2 = 0
)
4 V0+

(
1

+
2 = 1

 

 

 

 

,

,

?

2 =

 
(
1

4 V0+
V1+
 
V2
V1+

)
4 V0+

V2

?

2 +

 
(
1

V2

4 V0+
 
2
V2

)
4 V0+

3 =

+
2 = 1

 
?
(
1

V2

4 V0+
V1+
 
2
V2
V1+

)
4 V0+

+

+

 

.

 

We deﬁne some simple notations. We deﬁne, ?F23 as the joint probability of ,, ⇠, and ⇡. We

 

 

50

deﬁne ?F2 = ?

F

2

|

)

(

. Using the rule of function of expectation on ⌫

, we have,

#

)

(

⌫

#

)

(

= E

%



[

˜˜`
(
,

,, ⇠, ⇡
2 c

2

(

|

)]

)
⇡

(

^ ^)

)

 

1 F

2

[

%

2
6
6
6
6
6
6
6
6
6
6
6
6
6
4
?111 ˜˜`
?11

 

 

?101 ˜˜`
?10

+

 

 

?011 ˜˜`
?01

+

 

 

?001 ˜˜`
?00

+

 

 

= E

˜˜`
(
,

,, ⇠, ⇡
2 c

2

(

|

)]

F F F2

2 F2

2

(

)

)
⇡

2
6
6
6
6
6
6
6
6
6
6
6
6
6
1 1 1
4

3
7
7
7
7
7
7
7
7
7
7
7
7
7
5

3
7
7
7
7
7
7
7
7
7
7
7
7
7
5

=

1, 1,1
(
2 c

1

(

)

1, 1, 0
(
2 c

0

(

)

?110 ˜˜`
?11

+

 

 

)

2
6
6
6
6
6
6
6
6
6
6
6
6
6
4

1 1 1

3
7
7
7
7
7
7
7
7
7
7
7
7
7
1 1 0
5

1 1 1

1 1 1

1 1 1

3
7
7
7
7
7
7
7
7
7
7
7
7
7
1 1 0
5

1 1 1

)

2
6
6
6
6
6
6
6
6
6
6
6
6
6
4

1 1 0

0 0 0

1 0 1

0 0 0

1 0 1

1 0 0

0 0 0

0 0 0

1, 0, 1
(
)
2 c

1
)

(

0, 1, 1
)
(
2 c

1
)

(

0, 0, 1
(
)
2 c

1
)

(

2
6
6
6
6
6
6
6
6
6
6
6
6
6
4
2
6
6
6
6
6
6
6
6
6
6
6
6
6
4
2
6
6
6
6
6
6
6
6
6
6
6
6
6
4

1, 0, 0
(
)
2 c

0
)

(

0, 1, 0
)
(
2 c

0
)

(

0, 0, 0
(
)
2 c

0
)

(

3
7
7
7
7
7
7
7
7
7
7
7
7
7
5
3
7
7
7
7
7
7
7
7
7
7
7
7
7
5
3
7
7
7
7
7
7
7
7
7
7
7
7
7
5

?100 ˜˜`
?10

+

 

 

?010 ˜˜`
?01

+

 

 

?000 ˜˜`
?00

+

 

 

51

2
6
6
6
6
6
6
6
6
6
6
6
6
6
4
2
6
6
6
6
6
6
6
6
6
6
6
6
6
4
2
6
6
6
6
6
6
6
6
6
6
6
6
6
4

1 1 0

0 0 0

1 0 1

0 0 0

1 0 1

1 0 0

0 0 0

0 0 0

3
7
7
7
7
7
7
7
7
7
7
7
7
7
5
3
7
7
7
7
7
7
7
7
7
7
7
7
7
5
3
7
7
7
7
7
7
7
7
7
7
7
7
7
5

)

(

?111 ˜˜`
?11

1,1,1
(
)
2c
1
(
)
?111 ˜˜`
?11

1,1,1
(
)
2c
1
(
)
?111 ˜˜`
?11

1,1,1
(
)
2c
1
)

(

)

(

)

(

)

)

(

?111 ˜˜`
?11

1,1,1
(
2c
1
)
(
)
?111 ˜˜`
?11

1,1,1
(
2c
1
)
(
)
?111 ˜˜`
?11

1,1,1
(
2c
1
)
)

(

(

(

)

(

)

?111 ˜˜`
?11

1,1,1
(
)
2c
1
(
)
?111 ˜˜`
?11

1,1,1
(
)
2c
1
(
)
?111 ˜˜`
?11

1,1,1
(
)
2c
1
)

(

)

(

)

(

)

(

?110 ˜˜`
?11

1,1,0
(
2c
0
)
(
)
?110 ˜˜`
?11

1,1,0
(
2c
0
)
(
)
?110 ˜˜`
?11

1,1,0
(
2c
0
)
)

(

(

(

)

)

(

)

?110 ˜˜`
?11

1,1,0
(
)
2c
0
(
)
?110 ˜˜`
?11

1,1,0
(
)
2c
0
(
)
?110 ˜˜`
?11

1,1,0
(
)
2c
0
)

(

)

(

(

)

+

2
3
6
7
6
7
6
7
6
7
6
7
6
7
6
7
6
7
6
7
6
7
6
7
6
7
?100 ˜˜`
6
7
1,0,0
)
(
4
5
2c
?10
0
(
)
?100 ˜˜`
?10

1,0,0
(
)
2c
0
)

(

(

)

)

(

(

)

?100 ˜˜`
?10

1,0,0
(
)
2c
0
(
)
?100 ˜˜`
?10

1,0,0
(
)
2c
0
)

(

)

(

)

(

?101 ˜˜`
?10

1,0,1
(
2c
1
)
(
)
?101 ˜˜`
?10

1,0,1
(
)
2c
1
)
)

(

(

(

)

?101 ˜˜`
?10

1,0,1
(
)
2c
1
(
)
?101 ˜˜`
?10

1,0,1
(
)
2c
1
)

(

)

(

(

)

?110 ˜˜`
?11

1,1,0
(
)
2c
0
(
)
?110 ˜˜`
?11

1,1,0
(
)
2c
0
(
)
?110 ˜˜`
?11

1,1,0
(
)
2c
0
)

(

)

(

)

(

3
7
7
7
7
7
7
7
7
7
7
7
7
7
5

3
7
7
7
7
7
7
7
7
7
7
7
7
7
5

0

0

0

3
7
7
7
7
7
7
7
7
7
7
7
7
7
5

0

0

0

0

0,1,1
(
)
2c
1
)

)

(

0 ?011 ˜˜`
0,1,1
(
)
2c
?01
1
)
(

(

0,1,0
(
)
2c
0
)

)

(

0 ?010 ˜˜`
0,1,0
(
)
2c
?01
0
)
(

)

(

0

0

0

0

0

0

0 ?011 ˜˜`
0,1,1
(
)
2c
?01
1
)
(

(

?010 ˜˜`
?01

(

0,1,0
(
)
2c
0
)

)

(

0 ?010 ˜˜`
0,1,0
(
)
2c
?01
0
)
(

)

(

0

0

0

3
7
7
7
7
7
7
7
7
7
7
7
7
7
5

+

2
6
6
6
6
6
6
6
6
6
6
6
6
?010 ˜˜`
6
4
?01
(

)

+

3
2
7
6
7
6
7
6
7
6
7
6
7
6
7
6
7
6
7
6
7
6
7
6
6
7
?000 ˜˜`
7
6
0,0,0
(
)
5
4
2c
?00
0
)
(
)

(

)

0

0

0 0

0 0

0 0

3
7
7
7
7
7
7
7
7
7
7
7
7
7
5

+

2
6
6
6
6
6
6
6
6
6
6
6
6
6
4

=

2
6
6
6
6
6
6
6
6
6
6
6
6
6
4

+

+

+

=

2
6
6
6
6
6
6
6
6
6
6
6
6
6
4
2
6
6
6
6
6
6
6
6
6
6
6
6
6
4
2
6
6
6
6
6
6
6
6
6
6
6
6
6
4

2
6
6
6
6
6
6
6
6
6
6
6
6
?011 ˜˜`
6
4
?01
(

?011 ˜˜`
?01

0,1,1
(
)
2c
1
)

)

(

?001 ˜˜`
?00

0,0,1
(
)
2c
1
)

)

(

(

(

0

0

5

6 ⌘

6

⌘

8

9

9

:

3
7
7
7
7
7
7
7
7
7
7
7
7
7
5

0 0

0 0

0 0

3
7
7
7
7
7
7
7
7
7
7
7
7
7
5

52

5 =

6 =

⌘ =

 
?111 ˜˜`
?11

 
 
?111 ˜˜`
?11

8 =

 
?111 ˜˜`
?11

9 =

: =

 
 
?111 ˜˜`
?11

 
 
?111 ˜˜`
?11

)

 
1, 1, 1
(
2 c

1

(
)
1, 1,1
(
)
2 c

1

(
)
 
1, 1,1
)
(
2 c

1
(
)
1, 1,1
(
)
2 c

1
)
(
1, 1,1
(
)
2 c

1
)

(

?111 ˜˜`
1, 1, 1
(
)
2 c
?11
?011 ˜˜`
 
?01

1
(
)
0, 1, 1
(
)
2 c

 
+

+

1
)

(

?110 ˜˜`
1, 1, 0
(
)
2 c
?11
?010 ˜˜`
 
?01

0
(
)
0, 1, 0
(
)
2 c

 
+

+

0
)

(

?101 ˜˜`
?10
?00 ˜˜`
 
?00

 
+

1, 0, 1
(
2 c

)

1
(
)
0, 0, 1
(
)
2 c

1

+

)

?100 ˜˜`
?10
?000 ˜˜`
 
 
?00

1, 0, 0
(
2 c

0
)
(
0, 0, 0
(
2 c

0

)

+

(

)

(

)

 
?110 ˜˜`
?11

 
1, 1, 0
)
(
2 c

0
(
)
 
 
?110 ˜˜`
1, 1, 0
(
2 c
?11

0

)

+

+

)

 
?101 ˜˜`
?10

 
1, 0, 1
(
2 c

1

(
)
 
 
?011 ˜˜`
0, 1, 1
)
(
2 c
?01

+

+

)

 
?100 ˜˜`
?10

 
1, 0, 0
(
2 c

0

)
(
 
 
?010 ˜˜`
0, 1, 0
(
2 c
?01

0

)

+

+

1
(
)
 
1, 0, 1
(
2 c

1

)

(

)

 
?100 ˜˜`
?10

+

)
(
 
1, 0, 0
(
2 c

0

)

(

)

 
?101 ˜˜`
?10

+

 

 

 

 

)

 
?110 ˜˜`
?11

+

 
 
?110 ˜˜`
?11

 
 
?110 ˜˜`
?11

+

+

(
)
 
1, 1, 0
)
(
2 c

0
(
)
1, 1, 0
(
2 c

0

)
(
1, 1, 0
(
)
2 c

0
)

(

?011 ˜˜`
?01

+

0, 1, 1
(
2 c

1

(

)

?010 ˜˜`
?01

)

+

 

 

0, 1, 0
(
2 c

0

(

)

)

)

1

 

We can see that 8 = 6 and ⌘ = :.
 

 

 

 

 

 

1

 

0 1 2

5

6 ⌘

0 1 2

1

)

=

1⌫

  

1 1 3

Thus, we can write   

3
7
7
7
7
7
7
7
7
7
7
7
1⌫
to get the second element of the diagonal of   
7
7
5
# to obtain the variance of the estimator of the exposure effect. And we minimize this variance to

3
7
7
7
7
7
7
7
7
7
7
7
. Then, we can divide this element by
7
7
5

2
6
6
6
6
6
6
6
6
6
6
6
  
6
6
4
⇥

. Our target is

2
6
6
6
6
6
6
6
6
6
6
6
6
6
4

3
7
7
7
7
7
7
7
7
7
7
7
7
7
5

2
6
6
6
6
6
6
6
6
6
6
6
6
6
4

1 1 3

2 3 2

2 3 2

6 6

9 ⌘

⌘

⇤

⇥

⇤

)

9

1

3
7
7
7
7
7
7
7
7
7
7
7
7
7
7
5

2
6
6
6
6
6
6
6
6
6
6
6
6
6
6
4

get the optimal sampling ratio.

1

To compute

 

0 1 2

1 1 3

2 3 2

2
6
6
6
6
6
6
6
6
6
6
6
6
6
4

3
7
7
7
7
7
7
7
7
7
7
7
7
7
5

, we need to get 34C   and 039

. See details below.

 

)

(

53

The cofactor matrix of   is the same as the cofactor matrix, because it is symmetric.

By transposing the cofactor matrix, we obtain the adjoint matrix of  ,

The minor of matrix   is,

32

12

23

02

12 03

12

12

13

 

 

 

 

 

 

23 13

22

03

12 01

12

12

12

 

 

 

12

23

13

 

 

 

32

23

12

02

12 12

 

 

 

12 13

12

 

22

12

03 01

03

 

12

 

The determinant of   is,

2
6
6
6
6
6
6
6
6
6
6
6
6
6
4

2
6
6
6
6
6
6
6
6
6
6
6
6
6
4

.

3
7
7
7
7
7
7
7
7
7
7
7
7
7
5

.

3
7
7
7
7
7
7
7
7
7
7
7
7
7
5

34C   = 012

2123

032

122

 

 

 

221.

+

54

12

23

13

 

 

 

32

23

12

02

12 12

 

 

 

12 13

12

 

22

12

03 01

03

 

12

 

2
6
6
6
6
6
6
6
6
6
6
6
6
6
4

3
7
7
7
7
7
7
7
7
7
7
7
7
7
5

Thus,

1

 

3
7
7
7
7
7
7
7
7
7
7
7
7
7
5

0 1 2

1 1 3

2 3 2

2
6
6
6
6
6
6
6
6
6
6
6
6
6
4

=

1
34C  

03 9  

=

=

012

+

2123

 

1
032

122

221

 

 

⇢   ⌧

     

⌧  

 

2
6
6
6
6
6
6
6
6
6
6
6
6
6
4

,

3
7
7
7
7
7
7
7
7
7
7
7
7
7
5

where,

⇢ =

012

  =

012

⌧ =

012

  =

012

+

+

+

+

12

2123

 

32
 
032

 

122

221

 

2123

23

 
13

2123

 
02

2123

12
 
032

 
12
 
032

 
22
 
032

 

122

221

 

122

122

 

 

221

221

,

,

,

,

 
12

 
01

03
 
032

 
12
 
032

 

2123

2123

 

  =

012

  =

012

+

+

,

.

122

221

 

122

221

 

55

Further,

  

1⌫ =

⇢   ⌧

5

6 ⌘

     

⌧  

 

⇥

6 6

9

⌘

9 ⌘

3
7
7
7
7
7
7
7
7
7
7
7
7
7
5

2
6
6
6
6
6
6
6
6
6
6
6
6
6
4

2
6
6
6
6
6
6
6
6
6
6
6
6
6
4

.

3
7
7
7
7
7
7
7
7
7
7
7
7
7
5

So by simple computation, the second row of   

1⌫ is

  5

 6

+

+

"

  ⌘,  6

 6

+

+

  9,   ⌘

  9

  ⌘

+

+

.

#

Since

1

  

)

=

⇥

⇤

⇢   ⌧

     

⌧  

 

2
6
6
6
6
6
6
6
6
6
6
6
6
6
4

,

3
7
7
7
7
7
7
7
7
7
7
7
7
7
5

thus, the second element in the diagonal of   

1⌫

1

  

)

is

⇥

⇤

 2 5

+

2  6

2    ⌘

 26

+

+

2   9

 2⌘.

+

+

So,

E0A

ˆV1)

(

=

 2 5

+

2  6

+

2    ⌘
#

+

 26

+

2   9

 2⌘

,

+

56

where

 2
?11

)

 2 5 =

(

 2
?10

 2
?01

)

)

 2
?00

)

(

(

(

+

+

+

 2
?10

 2
?01

 2
?00

)

)

)

(

(

(

+

+

+

2 ?111 ˜˜`

1, 1, 1

)

(

c
1
)
(
2 ?101 ˜˜`

+
1, 0, 1

)

(

c
1
(
)
2 ?011 ˜˜`

(

c
1
)
(
2 ?001 ˜˜`

(

+

+

)

)

0, 1, 1

0, 0, 1

 2
?11

)

(

2 ?110 ˜˜`

1, 1, 0

)

(

c
0
(
)
2 ?100 ˜˜`

1, 0, 0

(

c
0
(
)
2 ?010 ˜˜`

(

c
0
)
(
2 ?000 ˜˜`

(

0, 1, 0

0, 0, 0

)

)

)

 2
?10

 2
?01

 2
?00

)

)

)

(

(

(

c
(
2 ?111 ˜˜`

1
)
1, 1, 1

+
#1

)

 2
?11

)

=

(

(
=1
2 ?101 ˜˜`

(
=1
2 ?011 ˜˜`

(
=1
2 ?001 ˜˜`

1, 0, 1

0, 1, 1

+
#1

#1

#1

)

)

)

0, 0, 1

(
=1
1, 1, 1

#1=0

)

 2
?11

)

=

(

2 ?111 ˜˜`

(
=1=0
2 ?101 ˜˜`

 2
?10

 2
?01

 2
?00

)

)

)

(

(

(

+

+

+

1, 0, 1

(
=1=0

2 ?011 ˜˜`

0, 1, 1

(
=1=0

2 ?001 ˜˜`

0, 0, 1

(
=1=0

+
#1=0

#1=0

#1=0

)

)

)

1, 1, 0

#0

)

1, 0, 0
)

#0

0, 1, 0
)

#0

0, 0, 0
)

#0

c
0
)
(
2 ?110 ˜˜`

 2
?11

)

(

 2
?10

 2
?01

 2
?00

)

)

)

(

(

(

+

+

+

 2
?11

)

(

(
=0
2 ?100 ˜˜`

(
=0
2 ?010 ˜˜`

(
=0
2 ?000 ˜˜`

(
=0
2 ?110 ˜˜`

1, 1, 0

#0=1

)

(
=1=0
2 ?100 ˜˜`

 2
?10

 2
?01

 2
?00

)

)

)

(

(

(

+

+

+

1, 0, 0

(
=1=0

2 ?010 ˜˜`

0, 1, 0

(
=1=0

2 ?000 ˜˜`

0, 0, 0

(
=1=0

#0=1

#0=1

#0=1

)

)

)

57

2  6 =

=

2   ?111 ˜˜`
(
2 c
?11
(
2   ?101 ˜˜`
 
(
2 c
?10
2 ?111 ˜˜`
 
 

2  
?11

+

 

1, 1, 1
)

+

1
)
1, 0, 1
)

1
(
)
1, 1, 1

)

0
)
1, 0, 0

1, 1, 0

2   ?110 ˜˜`
(
2 c
?11
(
2   ?100 ˜˜`
 
(
2 c
?10
0
(
)
2  
2 ?110 ˜˜`
 
?11
)

+

(

 

 

)

#1=0

)

(
=1=0
2 ?101 ˜˜`

(

)

2  
?10

)

(

+

+
#1=0

2  
?10

)

(

+

1, 0, 1

)

(
=1=0

(
=1=0
2 ?100 ˜˜`

1, 1, 0

#0=1

)

1, 0, 0

#0=1

)

(
=1=0

2    ⌘ =

=

1, 1, 1

2    ?111 ˜˜`
(
2 c
?11
(
2    ?011 ˜˜`
 
 
(
2 c
?01
2 ?111 ˜˜`
 
 

2   
?11

+

)

+

1
)
0, 1, 1
)

1
(
)
1, 1, 1

)

1, 1, 0
)

2    ?110 ˜˜`
(
2 c
?11
(
2    ?010 ˜˜`
 
 
(
2 c
?01
0
)
(
2 ?110 ˜˜`

0
)
0, 1, 0
)

#1=0

+

 

(
=1=0
2 ?011 ˜˜`

(

)

2   
?01

)

(

+

+
#1=0

)

0, 1, 1

(
=1=0

(
=1=0
2 ?010 ˜˜`

2   
 
?11

)

2   
?01

)

(

(

+

1, 1, 0
)

#0=1

0, 1, 0

#0=1

)

(
=1=0

 26 =

1, 1, 1

 2 ?111 ˜˜`
(
2 c
?11
(
 2 ?101 ˜˜`
 
 
(
2 c
?10
2 ?111 ˜˜`
 
 

 2
?11

+

)

1
)
1, 0, 1

+

)

1

)
(
1, 1, 1

 2 ?110 ˜˜`
(
2 c
?11
 2 ?100 ˜˜`
 
 
?10

+

1, 1, 0
)

0
)
1, 0, 0

)

0
)
(
2 ?110 ˜˜`

(
(
2 c
 2
 
?11
(

)

#1=0
 

)

=

(

)

(
=1=0
2 ?101 ˜˜`

+
#1=0

 2
?10

)

(

+

1, 0, 1
)

(
=1=0

 2
?10

)

(

+

(
=1=0
2 ?100 ˜˜`

1, 1, 0

#0=1

)

1, 0, 0

#0=1

)

(
=1=0

58

2   9 =

(

=

(

2  
?11

)

2  
?11

)

2 ?111 ˜˜`

1, 1, 1

(

c
1
)
(
2 ?111 ˜˜`
(
=1=0

1, 1, 1

)

)

2  
?11

2 ?110 ˜˜`

1, 1, 0

)

(

)

(

+
#1=0

0
)
2 ?110 ˜˜`

c
(
2  
?11

)

(
=1=0

1, 1, 0

#0=1

)

(

+

 2⌘ =

=

+

)

)

1, 1, 1

 2 ?111 ˜˜`
(
2 c
?11
1
(
)
 2 ?011 ˜˜`
0, 1, 1
 
 
(
2 c
?01
2 ?111 ˜˜`
 
 
)

 2
?11

1

(

+

(
)
1, 1, 1

 2 ?110 ˜˜`
(
2 c
?11
 2 ?010 ˜˜`
 
 
?01

+

#1=0
 

 

)

1, 1, 0
)

0
(
)
0, 1, 0
(
)
2 c
(
 2
?11

0
)
2 ?110 ˜˜`

(

)

(
=1=0
2 ?011 ˜˜`

+
#1=0

)

 2
?01

)

(

+

0, 1, 1

(
=1=0

(
=1=0
2 ?010 ˜˜`

 2
?01

)

(

+

1, 1, 0

#0=1

)

0, 1, 0

#0=1

)

(
=1=0

After simpliﬁcation, we can rewrite,

E0A

ˆV1)

(

=

Z0=0 +

Z1=1

=0=1#

,

59

where

Z0 =

 2

2

?11

?111 ˜˜`

1, 1, 1

#1 +

)

(

 2

2

?10

?101 ˜˜`

1, 0, 1
)

(

#1

 
+

+

+

+

+

+

2

 2
 
?01
2  
 
 
2

?11
2   
 

?11
 2
 

?11
2  
 

?11
 2

?11

 

 

 

 

 

 

 

2

2

2

2

?011 ˜˜`

?111 ˜˜`

?111 ˜˜`

?111 ˜˜`

?111 ˜˜`

?111 ˜˜`

(

(

(

(

(

(

0, 1, 1

 
#1 +

)

1, 1, 1

#1 +

)

1, 1, 1

#1 +

)

1, 1, 1

#1 +

)

1, 1, 1

#1

)

1, 1, 1

#1 +

)

?001 ˜˜`

0, 0, 1
)

(

#1

?101 ˜˜`

1, 0, 1
)

(

#1

?011 ˜˜`

0, 1, 1
)

(

#1

?101 ˜˜`

1, 0, 1
)

(

#1

2

 2
 
?00
2  
 
 
2

?10
2   
 

?01
 2
 

?10

 

 

 

 

2

2

?011 ˜˜`

0, 1, 1
)

(

#1,

 2

2

?01

 

 

Z1 =

 2

2

?11

?110 ˜˜`

1, 1, 0

#0 +

)

(

 2

2

?10

?100 ˜˜`

1, 0, 0
)

(

#0

?010 ˜˜`

?110 ˜˜`

?110 ˜˜`

?110 ˜˜`

?110 ˜˜`

0, 1, 0

 
#0 +

)

1, 1, 0

#0 +

)

1, 1, 0

#0 +

)

1, 1, 0

#0 +

)

1, 1, 0

#0

)

2

 2
 
?00
2  
 
 
2

?10
2   
 

?01
 2
 

?10

 

 

 

 

2

2

?000 ˜˜`

0, 0, 0
)

(

#0

?100 ˜˜`

1, 0, 0
)

(

#0

?010 ˜˜`

0, 1, 0
)

(

#0

?100 ˜˜`

1, 0, 0
)

(

#0

1, 1, 0

#0 +

)

 2

2

?01

?010 ˜˜`

0, 1, 0
)

(

#0.

(

(

(

(

(

(

 

2

?110 ˜˜`

 
+

+

+

+

+

+

2

 2
 
?01
2  
 
 
2

2

2

2

?11
2   
 

?11
 2
 

?11
2  
 

?11
 2

?11

 

 

 

 

 

 

 

 

According to the Proposition 1 in Chapter 2, the optimal sampling ratio of =0 to =1 is given by

'$⇡ =

Z1
Z0

.

q

60