ROBUST STATISTICAL METHODS FOR CAUSAL DISCOVERY IN ONE-SAMPLE
MENDELIAN RANDOMIZATION STUDIES

By

Ruxin Shi

A DISSERTATION

Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of

Statistics—Doctor of Philosophy

2025

ABSTRACT

Mendelian Randomization (MR) has become a cornerstone approach for inferring causal relation-

ships in epidemiological and genetic studies by leveraging genetic variants as instrumental variables

(IV). Despite its popularity, conventional MR analyses, particularly those based on two-stage least

squares (TSLS) and conducted within a single sample, face significant methodological challenges.

These include selection-induced winner’s curse and the pervasive problem of weak instruments and

invalid IVs, all of which can undermine the reliability and interpretability of causal effect estimates.

To address these limitations, this dissertation develops a unified and robust MR framework

through a sequence of methodological innovations. First, we introduce MR-SPLIT, a novel adap-

tive sample-splitting and cross-fitting procedure that effectively mitigates biases arising from IV

selection and weak instruments in one-sample MR settings. MR-SPLIT employs multiple sam-

ple splits to further enhance robustness, demonstrating superior performance in bias reduction,

type I error control, and statistical power compared to existing approaches, as validated in exten-

sive simulation studies and real-world data applications. Building on this foundation, we further

propose MR-SPLIT+, which integrates best subset selection to accommodate invalid IVs under a

relaxed plurality rule. MR-SPLIT+ substantially reduces estimation bias due to invalid instruments

while maintaining efficiency and robustness. Simulation results consistently demonstrate that MR-

SPLIT+ outperforms contemporary methods, and real-data analyses confirm its practical reliability

in complex genetic architectures. Recognizing that causal relationships are often bidirectional or

ambiguous, especially within gene expression networks and complex traits, we extend this frame-

work to BiMR-SPLIT+. This method is specifically designed to disentangle bidirectional causality

between pairs of traits, even when the underlying IV assumptions are partially violated. Extensive

simulation studies and application to Drosophila melanogaster data illustrate that BiMR-SPLIT+

not only recapitulates established biological mechanisms, but also identifies novel candidate genes

with potential regulatory roles. This bidirectional MR framework enables more accurate inference

of gene-trait relationships and has broad implications for precision medicine.

Collectively, this dissertation presents a cohesive suite of MR methodologies that systematically

address weak and invalid IVs, IV selection bias, and bidirectional causality. The resulting toolkit

substantially advances the reliability of causal inference in genetic epidemiology and lays the

groundwork for future exploration in complex causal networks as large-scale human datasets

continue to grow.

ACKNOWLEDGEMENTS

The five years of my doctoral journey have passed more quickly than I ever imagined. During this

time, I traveled to many places, met countless people, and was fortunate enough to solve a few

challenging problems. It was a radiant five years, filled with laughter and tears. I am grateful for

every moment when, after falling down, I found the strength to begin again.

First and foremost, I would like to express my sincere gratitude to my advisor, Professor Yuehua

Cui. Without his guidance and support, this dissertation would not have been possible. Whenever

my research reached an impasse, he always helped me discover a new path forward. It has been

one of the greatest privileges of my life to be his student.

I also want to thank my boyfriend, Zhouyu Shen, for his unwavering companionship over these

five years. In a foreign country, during the height of a global pandemic, he took care of me in

countless ways. We made the most of our time together, traveling to many wonderful places during

holidays. His presence made these years exceptionally bright.

I am deeply grateful to my parents, Meizhen Lei and Haifeng Shi, for their constant support.

As their only daughter, I thank them for backing my decision to pursue a PhD abroad, even though

it meant being far from home.

Lastly, I want to thank my two cats, Guoguo and Meimei, whose adorable presence has brought

endless comfort and joy to my daily life. I am also deeply grateful for the music of Chenyu Hua,

which has greatly enriched my spiritual world.

The past is now behind me—it once accompanied the years of my youth. The present and the

future lie ahead, and the world is still vast.

iv

LIST OF ABBREVIATIONS .

.

.

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

vi

TABLE OF CONTENTS

CHAPTER 1

1
BACKGROUND AND MOTIVATION . . . . . . . . . . . . . . . . . .
1
1.1 Mendelian Randomization: An Overview . . . . . . . . . . . . . . . . . . . .
2
1.2 Methods .
.
.
.
5
. . . . . . . . . . . . . . . . . . . . . . . . .
1.3 Key Methodological Challenges
1.4 Structure of the Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.

.

.

.

.

CHAPTER 2

MR-SPLIT - ADDRESSING SELECTION AND WEAK
INSTRUMENT BIAS . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1
.
Introduction .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 Statistical Method . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
.
2.3 Simulation Study .
2.4 Case Study: eGFR and aTRH, uACR and aTRH . . . . . . . . . . . . . . . . . 30
. 35
2.5 Discussion .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.

.

.

.

.

.

.

CHAPTER 3

.

.

.

Introduction .

MR-SPLIT+ — ROBUST CAUSAL INFERENCE WITH MANY
WEAK AND INVALID INSTRUMENTS . . . . . . . . . . . . . . . . . 39
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.1
.
. . . . . . . . . . . . . . . 40
3.2 Motivation and Scientific Questions (UK Biobank)
. 43
3.3 Methods . .
. .
3.4 Simulation Study .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.5 Real Data Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
. 67
3.6 Discussion .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. .

.

.

.

.

CHAPTER 4

Introduction .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

BIMR-SPLIT+ — BIDIRECTIONAL MR AND CAUSAL
MECHANISM .
.

. 70
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.1
4.2 Model and Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.3 Simulation Studies
. 84
4.4 Application: Causal Pathway Between Gene Expression and Trait
. 90
4.5 Discussion .

. . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.

.

.

.

.

.

.

.

CHAPTER 5

CONCLUSION AND DISCUSSION . . . . . . . . . . . . . . . . . . . 92
. 92
5.1 Summary of Main Contributions . . . . . . . . . . . . . . . . . . . . . . . .
5.2 Biological Insights
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.3 Limitations of the Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.4 Future Research Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

BIBLIOGRAPHY .

.

.

.

.

. .

. .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

APPENDIX A

SUPPLEMENTARY MATERIALS . . . . . . . . . . . . . . . . . . . . 108

v

LIST OF ABBREVIATIONS

MR

IV

Mendelian Randomization

Instrument Variable

TSLS

Two Stage Least Square

2SLS

Two Stage Least Square

OLS

Ordinary Least Square

LD

linkage disequilibrium

LIML

Limited information maximum likelihood

IVW

Inverse-variance Weighted

CFI

SNP

SIS

CP

Cross-Fitted Instrument

Single Nucleotide Polymorphism

Sure Independence Screening

Coverage Probability

MAF

Minor Allele Frequency

MIO

Mixed Integer Optimization

FNR

FPR

False Negative Rate

False Positive Rate

DAG

Directed Acyclic Graph

vi

CHAPTER 1

BACKGROUND AND MOTIVATION

1.1 Mendelian Randomization: An Overview

Mendelian Randomization (Davey Smith and Hemani, 2014; Lawlor et al., 2008; Greenland,

2000; Davey Smith and Ebrahim, 2003) is a method using genes as instrument variables (IV) to make

causal inference between correlated variables, especially between behavioural, pharmacological or

physiological measures and disease.

It aims to detect the presence of causal effects and, when

present, to provide unbiased estimates of their magnitude. The use of instrumental variables for

detecting causal effect is first proposed in Econometric to deal with endogenous variable (Barro,

1997; Wainwright et al., 2005), which is synonymous with a dependent variable and correlates

with other factors within the system being studied, and has been discussed a lot during recent

years. Since there are some restrictions for the choice of instrumental variables, it is usually not

easy to find a perfect one (Donald and Newey, 2001; Baiocchi et al., 2014). However, unlike

other variables, people’s genotypes are determined only by their parents’ genotypes according to

Mendel’s law (Castle, 1903) and are generally unrelated to those confounding factors that distort

the interpretations of findings from observational epidemiology. Furthermore, disease processes

do not alter germline genotype and therefore associations between genotype and disease outcomes

cannot be affected by reverse causality. Finally, genetic variants that are related to a modifiable

exposure will generally be related to it throughout life from birth to adulthood and therefore their

use in causal inference can also avoid attenuation by errors (regression dilution bias).

This innovative utilization of SNPs as IVs rests on a triad of fundamental assumptions, which

are indispensable for ensuring the validity of MR results (Burgess et al., 2017; Davey Smith and

Ebrahim, 2003). Suppose now we want to make casual inference between the exposure 𝑋 and the

outcome 𝑌 using the instrument variable 𝐺. Conventional instrumental variable analysis requires

that the instruments must meet three conditions:

A1 The IV 𝐺 is associated with the exposure of interest 𝑋.

A2 𝐺 is independent of the confounding factors 𝑈 that confound the association of 𝑋 and the

1

outcome 𝑌 .

A3 𝐺 is independent of outcome 𝑌 given 𝑋 and the confounding factors 𝑈.

1.2 Methods

In this section, we first introduce several widely used methods based on the three core assump-

tions.

1.2.1

Individual-level Data MR

1.2.1.1 Wald method

The Wald estimator (Wald, 1940), or the ratio estimator is the simplest of estimating the causal

effect of the exposure 𝑋 on the outcome 𝑌 , and it uses a single instrumental variable 𝐺. If we

regress 𝑋 and 𝑌 separately on the IV 𝐺,

𝑋 = 𝛽1𝐺 + 𝜀1,

𝑌 = 𝛽2𝐺 + 𝜀2,

and get the estimated value ˆ𝛽1, ˆ𝛽2, then the ratio estimate of the causal effect is

ˆ𝛽 =

ˆ𝛽2
ˆ𝛽1

(1.1)

Intuitively, we can think of the ratio method as saying that the change in the outcome 𝑌 caused by

a unit increase in the exposure 𝑋 is equal to the change in the 𝑋 caused by a unit increase in the

IV 𝐺, scaled by the change in the 𝑋 caused by a unit increase in the 𝐺. To build a confidence

interval for the estimate, we may use a normal approximation. The asymptotic variance of the ratio

estimate (Thomas et al., 2007) is:

ˆ𝜎2

𝛽 =

(cid:1)

𝑣𝑎𝑟 (cid:0) ˆ𝛽2
ˆ𝛽2
1

ˆ𝛽2
2

+

(cid:1)

𝑣𝑎𝑟 (cid:0) ˆ𝛽1
ˆ𝛽4
1

−

(cid:1)

2 ˆ𝛽2 cov (cid:0) ˆ𝛽1, ˆ𝛽2
ˆ𝛽3
1

(1.2)

And the term cov (cid:0) ˆ𝛽1, ˆ𝛽2

(cid:1) will vanish if we estimate ˆ𝛽1 and ˆ𝛽2 from different samples. However,

asymptotic normal approximations for the IV estimate may result in overly narrow confidence

intervals, especially if the sample size is not large or the IV is weak. This is because IV estimates

are not normally distributed. Alternatively, we can use Fieller’s theorem (Fieller, 1954) or bootstrap

method (Efron, 1992).

2

1.2.1.2 Two stage least square

Another popular used method in IV method is the two stage least square (2SLS or TSLS)

regression. Suppose we have 𝑛 individuals, and have the observed data {𝑔𝑖, 𝑥𝑖, 𝑦𝑖; 𝑖 = 1, ..., 𝑛},
𝑔𝑖 ∈ R𝑝, 𝑥𝑖 ∈ R, 𝑦𝑖 ∈ R. Let𝑋 = (𝑥1, ..., 𝑥𝑛)𝑇 , 𝑌 = (𝑦1, ..., 𝑦𝑛)𝑇 , 𝐺 = (𝑔1, ..., 𝑔𝑛)𝑇 ∈ R𝑛×𝑝.

In the first stage, the exposure 𝑋 is regressed on the instrumental variables 𝐺 to obtain the fitted

values of 𝑋, denoted by ˆ𝑋:

𝑋 = 𝐺 𝛽1 + 𝜀1,

ˆ𝛽1 = (𝐺′𝐺)−1𝐺′𝑋,

ˆ𝑋 = 𝐺 ˆ𝛽1 = 𝐻 𝑋,

𝐻 = 𝐺 (𝐺′𝐺)−1𝐺′.

In the second stage, the outcome 𝑌 is regressed on the fitted exposure ˆ𝑋:

𝑌 = ˆ𝑋 𝛽2 + 𝜀2,

ˆ𝛽2 = ( ˆ𝑋′ ˆ𝑋)−1 ˆ𝑋′𝑌 = (𝑋′𝐻 𝑋)−1𝑋′𝐻𝑌 .

(1.3)

(1.4)

(1.5)

(1.6)

Here, ˆ𝛽2 estimates the effect of 𝑋 on 𝑌 that is mediated solely through the component of 𝑋

explained by the instruments 𝐺. Therefore, it is necessary to make sure the assumption (A1) is

true. We can also build a confidence interval based on the variance of 𝛽2.

𝑣𝑎𝑟 (𝛽2) = 𝜎2(𝑋′𝐻 𝑋)−1

(1.7)

ˆ𝜎2 = (𝑌 − ˆ𝑋 𝛽2)′(𝑌 − ˆ𝑋 𝛽2)/(𝑛 − 1)

1.2.1.3 Two sample two stage

Generally, it is hard to get a data set including all variables we need. On the contrast, collecting

two separate samples, in which the first includes the IV 𝐺 and exposure 𝑋 and the second includes

𝐺 and outcome 𝑌 , is much easier. In this situation, we can use a method called two sample two

stage.

Suppose 𝐺 ∈ R𝑛×𝑝, 𝑋 ∈ R𝑛×𝑞. When 𝑝 = 𝑞, we would have 𝐺′𝑋 reversible. With exact

identification, the causal effect we get in Section 1.2.1.2,

ˆ𝛽𝐼𝑉 = (𝐺′𝑋)−1𝐺′𝑌

3

Suppose now we only have two samples {𝑌1, 𝐺1} and {𝑋2, 𝐺2}, where 𝑋2 ∈ R𝑛2×(𝑘+𝑝) and 𝐺2 ∈
R𝑛2×(𝑘+𝑞). When 𝑝 = 𝑞, Angrist and Krueger (Angrist and Krueger, 1999) proposed a consistent

estimation of causal effect to use, which is

ˆ𝛽𝑇 𝑆𝐼𝑉 = (𝐺′
2

𝑋2/𝑛2)−1(𝐺′
𝑌1/𝑛1)
1

(1.8)

Another statistics, valid also when 𝑝 ≠ 𝑞, named the two-sample two-stage least squares(TS2SLS)

estimator is:

ˆ𝛽𝑇 𝑆2𝑆𝐿𝑆 = (𝐺′
2

𝑋2/𝑛2)−1𝐶 (𝐺′
𝑌1/𝑛1),
1

(1.9)

where 𝐶 = (𝐺′
2

𝐺2/𝑛2)(𝐺′
1

𝐺1/𝑛1)−1. Inoue and Solon (2010) have proved that ˆ𝛽𝑇 𝑆2𝑆𝐿𝑆 is supe-

rior than ˆ𝛽𝑇 𝑆𝐼𝑉 . Because the implicit correction for differences between the two samples in the

distribution of 𝐺, matrix 𝐶, yields a gain in asymptotic efficiency.

The standard error of ˆ𝛽𝑇 𝑆2𝑆𝐿𝑆 is

ˆ𝜀2( ˆ𝑋2

′ ˆ𝑋2)−1

(cid:32)

1 +

𝑛1
𝑛2

ˆ𝛽′
𝑇 𝑆2𝑆𝐿𝑆 ˆΣ𝜀1
ˆ𝜀2

ˆ𝛽𝑇 𝑆2𝑆𝐿𝑆

(cid:33)

,

where ˆ𝜀2 is the sample mean squared residual from the second-stage regression, and ˆΣ𝜀1 is a

consistent estimate of the covariance matrix for the first-stage disturbances.

This method can be naturally developed to be used when there is only summary data available.

Further development of the method can be seen in Bowden et al. (2019), Zhao et al. (2020) and

Minelli et al. (2021). It is also important to be aware of several potential limitations when using

summary-level data, see Hartwig et al. (2021, 2016) for further discussion.

1.2.2 Summary-level Data MR

Individual level data on study participants are not always available due to issues of practicality

and confidentiality of data-sharing. Burgess et al. (2013) outlines two approaches for estimating

causal effects using summarized data. Assume that summary statistics are available for multiple

genetic variants, each of which satisfies the IV assumptions. The models for the 𝑘-th variant are

specified as follows:

𝑋 = 𝛽𝑋 𝑘𝐺 𝑘 + 𝜀𝑋 𝑘 ,

Var(𝛽𝑋 𝑘 ) = 𝜎2

𝑋 𝑘 ,

(1.10)

4

𝑌 = 𝛽𝑌 𝑘𝐺 𝑘 + 𝜀𝑌 𝑘 ,

Var(𝛽𝑌 𝑘 ) = 𝜎2

𝑌 𝑘 ,

(1.11)

for 𝑘 = 1, . . . , 𝐾, where 𝑋𝑘 , 𝑌𝑘 , 𝜎2

𝑋 𝑘 , and 𝜎2

𝑌 𝑘 are assumed to be known.

1.2.2.1

Inverse-variance weighted combination of ratio estimates

For each genetic variant 𝑘, the ratio estimate of the casual effect of 𝑋 on 𝑌 is 𝛽𝑌 𝑘 /𝛽𝑋 𝑘 , and

the standard error of the ratio estimate can be approximated using 𝜎𝑌 𝑘 /𝛽𝑋 𝑘 . The inverse-variance

weighted (IVW) estimete of the causal effect combines the ratio estimates using each variant in a

fixed-effect mata analysis model:

ˆ𝛽𝐼𝑉𝑊 =

(cid:205)𝑘 𝛽𝑋 𝑘 𝛽𝑌 𝑘 𝜎−2
𝑌 𝑘
𝑋 𝑘 𝜎−2
𝑌 𝑘

(cid:205)𝑘 𝛽2
√︄

1
(cid:205)𝑘 𝛽2
𝑋 𝑘 𝜎−2
𝑌 𝑘

𝑠𝑒( ˆ𝛽𝐼𝑉𝑊 ) =

(1.12)

(1.13)

1.2.2.2 Likelihood-based method

We can also construct a model by assuming a linear relationship between the risk factor and

outcome and a bivariate normal distribution for the genetic association estimates:

𝛽𝑥𝑘

𝜉𝑘

𝜎2
𝑋 𝑘

𝜌𝜎𝑋 𝑘 𝜎𝑌 𝑘

, (cid:169)
(cid:173)
(cid:173)
(cid:171)
Here the causal effect of 𝑋 on 𝑌 is assumed to be the same 𝛽 for all genetic variants 𝑘 and can be

𝜌𝜎𝑋 𝑘 𝜎𝑌 𝑘

(cid:170)
(cid:170)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:172)
(cid:172)

(1.14)

∼ N2

𝜎2
𝑌 𝑘

(cid:169)
(cid:173)
(cid:173)
(cid:171)

(cid:169)
(cid:173)
(cid:173)
(cid:171)

(cid:170)
(cid:174)
(cid:174)
(cid:172)

(cid:170)
(cid:174)
(cid:174)
(cid:172)

(cid:169)
(cid:173)
(cid:173)
(cid:171)

𝛽𝜉𝑘

𝛽𝑌 𝑘

estimated by direct maximization of the likelihood or by Bayesian methods.

Simulation results show that the power of estimates come from likelihood-based method is

greater than that from the 2SLS method. And the inverse-level data analysis gives similar point

estimates to an individual-level data analysis and slightly improved power over the likelihood-based

method, but slightly too narrow confidence intervals.

However, there are always situations violating the three assumptions (A1)-(A3), then the meth-

ods above will no longer suitable and may produce large bias.

1.3 Key Methodological Challenges

Violations (Glymour et al., 2012) of the triadic assumptions underpinning MR analysis can

produce biased and unreliable estimates. Specifically, the instrumental variables are called weak if

5

they violate A1 (Staiger and Stock, 1994; Bound et al., 1995). And violation of A2 and A3 will

lead to invalid instruments (Bowden et al., 2015; Kolesár et al., 2015; Hemani et al., 2018). In

the following sections, we will review existing solutions that have been proposed to address these

challenges.

1.3.1 Weak Instruments and Selection Bias

In MR analysis, two main frameworks are commonly used:

two-sample MR analysis with

GWAS summary statistics and one-sample MR analysis with individual-level data. While two-

sample MR analysis has gained popularity due to easy access to public datasets, it comes with a

couple of limitations. Firstly, it relies on marginal estimates of SNP statistics, which can be biased

when not accounting for linkage disequilibrium (LD) properly. Secondly, it lacks the flexibility

to model other causal mechanisms, such as nonlinear causal effects. As a result, there continues

to be a significant interest in the advancement of statistical methods for one-sample MR analysis.

The most popular method used in one-sample MR analysis is the two-stage least squares (2SLS)

approach (Angrist and Krueger, 1991), which is relatively straightforward to implement and can

yield consistent estimates of causal effects. However, the 2SLS estimate can be biased in the

presence of weak instruments (Bound et al., 1995). The bias is in the direction of the confounded

association and can cause inflated false positive rates, particularly when more than one IV is

included in the analysis (Burgess et al., 2019). To date, weak instrument bias still remains one of

the significant concerns in one-sample MR analysis (Burgess et al., 2019).

A potential solution to mitigate the impact of weak IVs is to opt for a two-sample MR analysis.

While this approach might mitigate some biases, it does not eliminate them entirely. Specifically,

bias due to weak instruments in two-sample MR tends to be directed towards the null (Angrist

and Krueger, 1995). Limited information maximum likelihood (LIML) method (Anderson and

Rubin, 1949; Anderson, 2005) was introduced as an alternative to 2SLS when dealing with weak

instruments. Burgess et al. (Burgess et al., 2011) showed that LIML could provide a less biased

estimate compared to 2SLS in the presence of weak instruments, but at the expense of incurring

larger variance. Nevertheless, LIML is still subject to weak instrument problems and its finite

6

sample performance can be poor. Angrist et al. (Angrist et al., 1999) proposed two jackknife

instrumental variables estimators (JIVE) as alternatives to 2SLS and LIML to reduce the bias with

many weak instruments. However, Sören and Matz (Blomquist and Dahlberg, 1999) showed that

neither LIML nor the JIVE estimators perform uniformly better than the 2SLS does in terms of

root mean square error.

In one-sample MR analysis, when the same dataset is used for both IV selection and causal

effect estimation, the “winner’s curse" or IV selection bias emerges as another notable concern in

addition to the weak IV bias issue (Burgess et al., 2019; Jiang et al., 2023). This bias could lead to

biased causal effect estimates and hence inflate false positive rates under the 2SLS IV regression

framework. This is evident in Appendix A.2.3, where it is shown that using the same data (the

whole sample) for both IV selection and causal effect estimation, both LIML and 2SLS methods

amplify bias compared to using half data for selection and the other half for causal effect estimation.

Thus, it is critical to address the IV selection bias issue in one-sample MR analysis.

1.3.2 Pleiotropy and invalid IVs

Pleiotropy, the phenomenon where a single genetic variant influences multiple traits, is common

in biology and genetics. In biological systems, pleiotropy is widespread and reflects the complex

interplay between genes and traits. However, in the context of Mendelian Randomization, pleiotropy

can undermine the validity of genetic variants used as IVs. Specifically, when a genetic variant

affects the outcome not only through the exposure of interest but also via other independent

pathways, it becomes an invalid IV and violates the core assumptions required for valid causal

inference.

Pleiotropy in MR studies is typically categorized as either vertical or horizontal. Vertical

pleiotropy occurs when a genetic variant influences an exposure, which in turn affects the outcome,

and this aligns with the standard MR framework and does not violate IV assumptions. In contrast,

horizontal pleiotropy arises when a genetic variant independently affects both the exposure and the

outcome, leading to violations of the exclusion restriction assumption.

In practical MR analyses, identifying and removing invalid IVs is particularly challenging.

7

Researchers often rely on measures of instrument strength, such as p-values, to select relevant IVs.

However, there is currently no statistically guaranteed procedure to reliably detect and exclude all

invalid instruments. Moreover, when IV selection is performed using individual-level data from a

single sample, selection bias may be introduced, further complicating causal effect estimation.

In recent years, many methods has been proposed to deal with invalid IVs. For summary

statistics, under the InSIDE (Instrument Strength Independent of Direct Effect) assumption (Kolesár

et al., 2015), Bowden et al. (Bowden et al., 2015) introduced MR-Egger regression, which identifies

and corrects for horizontal pleiotropy using the intercept of a regression model. Then in 2016, under

the majority rule, Bowden et al. (2016) proposed a weighted median estimator to provide consistent

causal effect estimates even when up to 50% of the instrumental variables are invalid. Additionally,

Verbanck et al. (2018) developed MR-PRESSO, which detects and corrects for horizontal pleiotropy

by identifying and removing outlier instrumental variables and has been shown to perform best

when horizontal pleiotropy affects less than 50% of the instruments. In 2021, Wang and Kang

(2022) extended the Anderson-Rubin test (Anderson and Rubin, 1949), integrating the Kleibergen

test (Kleibergen, 2002) and conditional likelihood ratio test to accommodate two-sample summary-

data MR, improving robustness against weak and invalid IVs. Patel et al. (2024) introduced the

Focused Instrument Selection method, which optimizes causal effect estimation by selecting invalid

IVs with minimal direct effects under the local-to-zero assumption.

When we are available to individual-level data, there are some more methods can be used.

In 2016, Kang et al. (2016) introduced the sisVIVE method, aimed at estimating causal effects

without requiring complete knowledge of the validity of instrumental variables. The method is

applicable under the majority rule and employs a penalized ℓ1 estimation approach. However,

despite its innovative framework, the method’s accuracy in identifying valid instruments remains

limited, often leading to estimates that still exhibit substantial bias. Furthermore, a key limitation

of this method is its inability to perform inference as it provides an estimate of the causal effect

but does not yield a standard deviation.

In 2021, Windmeijer et al. (2021) proposed the CIIV

method, which relaxes the assumptions on IVs required by sisVIVE. The CIIV method only

8

requires the plurality rule, as introduced by Guo et al. (2018), which requires the valid IVs are the

largest group having the same effects on the outcome. However, this method relies on a strong

association between the IVs and the explanatory variables, meaning that when the IVs are weak, the

accuracy of instrument selection still needs improvement. Apfel and Liang (2024) also proposed a

method for selecting valid IVs using Agglomerative Hierarchical Clustering (AHC), which performs

comparably to the CIIV in terms of selection and shows superior performance when dealing with

multiple exposures. Ye et al. (2024) also proposed the GENIUS-MAWII method, which aims to

provide robust Mendelian randomization inference in the presence of pervasive pleiotropy and a

large number of weak instrumental variables. This approach leverages the heteroscedasticity of the

exposure with respect to the instruments (and covariates) for identification; thus, if the required

heteroscedasticity is absent, the method is not identifiable. Additionally, although GENIUS-

MAWII can handle widespread pleiotropy, it relies on the key assumption that the effects of the

instruments on the exposure and outcome do not interact with unmeasured confounders, which

may not always hold in practice. Lin et al. (2024) proposed a method called WIT, which provides

a detailed discussion on how to identify model parameters in the presence of weak IVs and

employs the MCP penalty to select invalid IVs. Compared to previous methods, WIT significantly

improves the accuracy of identifying invalid IVs. However, it still faces challenges in reliably

constructing a trustworthy confidence interval. The estimates obtained using WIT are notably

unstable, particularly in the presence of weak IVs, where numerous outliers significantly deviating

from the true values may arise, as demonstrated in our subsequent simulations.

1.3.3 Nonlinearity and Multiple Exposures

What we have discussed so far are all assume the relationship between the exposure 𝑋 and

the outcome 𝑌 is linear. But sometimes observational data would suggest a non-linear association

between the exposure 𝑋 and the outcome 𝑌 , for example, alcohol consumption is consistently

reported as having a U-shaped association with cardiovascular events (Marmot and Brunner, 1991).

So it is necessary to extend MR methods to this kind of situation. Here we divide the nonlinear

9

cases into three categories:

𝑦 = 𝑓 (𝑥) 𝛽 + 𝜀

𝑦 = 𝑥 𝛽(𝜃) + 𝜀

𝑦 = ℎ(𝑥, 𝛽) + 𝜀

(1.15)

(1.16)

(1.17)

since it is hard to give a suitable explanation of the parameter 𝛽 in reality in Model 1.16 and 1.17,

what we consider here is majority the first situation, Model 1.15. Besides the methods what we list

here, Burgess et al. (2014) also have provided an approach and done many applications in MR about

the nonlinear exposure–outcome relationship. Singh et al. (2019) proposed kernel instrumental

variable regression (KIV), a nonparametric generalization of 2SLS, and proved in experiments,

KIV outperforms four kinds of methods for nonparametric IV regression.

1.3.3.1 Control function methods

The nonlinear of Model 1.15 is showed in the transformation of 𝑥. A classic method used in

econometrics to address endogeneity is the control function approach (Wooldridge, 2015). This

method is closely related to 2SLS and yields the same solution in linear models. But when the

true model is nonlinear, the control function approach utilizes more information than 2SLS and

can improve the precision of the estimates, albeit with some loss of robustness. For a detailed

comparison of these two methods, see Guo and Small (2016).

In addition, Sulc et al. (2022)

conducted extensive simulations using the control function approach in Mendelian Randomization

and demonstrated its strong performance.

For simplicity, we assume 𝑓 (𝑥) in Model 1.15 is a polynomial function. Suppose the real model

is

𝑋 = 𝐺 𝛽1 + 𝜀𝑥

𝑌 =

𝑘
∑︁

𝑗=0

𝛽2 𝑗 𝑋 𝑗 + 𝜀𝑦

Since the error term 𝜀𝑦 and 𝜀𝑥 can be correlated due to confounders, we split 𝜀𝑦:

𝜀𝑦 =

𝑙
∑︁

𝑗=0

𝛼 𝑗 𝜀 𝑗

𝑥 + 𝜏𝑦

10

(1.18)

(1.19)

(1.20)

Then we have:

𝑘
∑︁

𝑌 =

𝛽2 𝑗 𝑋 𝑗 +

𝑙
∑︁

𝛼 𝑗 𝜀 𝑗

𝑥 + 𝜏𝑦 =

𝑘
∑︁

𝛽2 𝑗 𝑋 𝑗 +

𝑙
∑︁

𝛼 𝑗 (𝑋 − 𝐺 𝛽1) 𝑗 + 𝜏𝑦

(1.21)

𝑗=0
So first we could regress 𝑋 on 𝐺, to get the estimates ˆ𝜀𝑥, then we regress 𝑌 on the transformation

𝑗=0

𝑗=0

𝑗=0

of 𝑋 and ˆ𝜀𝑥 to get the causal effects.

1.3.3.2 A more generalized method

A more generalized method for Model 1.15 is introduced by Li (2019). Unlike the typical

assumptions applied in the linear setting, namely:

• Relevance: Cov(𝐺, 𝑋 | 𝑈) ≠ 0,

• Exclusion restriction: Cov(𝐺, 𝑈) = 0,

. Li proposes an alternative set of assumptions for the nonlinear model:

• Relevance:

• Exclusion restriction:

loss(𝑋, 𝑓 (𝐺)) ≤ 𝜖,

min
𝑓 ∈F

loss(𝑉, 𝑣𝛼 (𝐺)) ≥ loss(𝑉, 0) − 𝜖 ′,

(1.22)

(1.23)

where 𝑉 = 𝑌 −𝑔𝛽 (𝑋), 𝑔𝛽 (𝑋) ∈ arg min𝑔∈G loss(𝑌 , 𝑔(𝑋)), and 𝑣𝛼 (𝐺) ∈ arg min𝑣∈V loss(𝑉, 𝑣(𝐺)).

So in his methodology, stage one is to find:

𝜔 ∈ arg min

𝜔

loss(𝑋, 𝑓𝜔 (𝐺))

This determines ˆ𝑋 = 𝑓 ˆ𝜔 (𝐺). Then in stage two, find

such that loss(𝑉, 𝑣𝛼 (𝐺)) ≥ loss(𝑉, 0) − 𝜀, where 𝑉 = 𝑌 − 𝑔𝛽 ( ˆ𝑋).

𝛽 ∈ arg min

𝛽

loss(𝑌 , 𝑔𝛽 ( ˆ𝑋)),

(1.24)

(1.25)

However, his discussion is limited to the case where the nonlinear model is a generalized additive

model (GAM), that is, ˆ𝑦 = 𝑏0 + 𝑏1 𝑓1(𝑋) + ... + 𝑏 𝑝 𝑓𝑝 (𝑋), where 𝑋 denotes the input variable and 𝑦 is

the target variable. Therefore, future research could explore broader classes of nonlinear functions

building upon his framework.

11

1.4 Structure of the Dissertation

In summary, we have reviewed the fundamental principles and core assumptions of MR methods,

as well as the key challenges currently facing the field. In Chapter 2, we introduce the MR-SPLIT

method, a framework based on 2SLS designed to address selection bias and the weak instrument

problem in one-sample MR analyses. Building on this, Chapter 3 presents MR-SPLIT+, an en-

hanced approach that substantially reduces bias arising from invalid IVs and achieves significantly

higher accuracy in identifying invalid IVs compared to existing methods. In Chapter 4, we fur-

ther extend MR-SPLIT+ to accommodate more complex scenarios, proposing the BiMR-SPLIT+

method for bidirectional MR studies, which offers additional improvements over MR-SPLIT+.

Finally, Chapter 5 summarizes our main contributions and discusses potential directions for future

research.

12

CHAPTER 2

MR-SPLIT - ADDRESSING SELECTION AND WEAK INSTRUMENT BIAS

2.1

Introduction

In one-sample MR analysis, IVs are typically chosen based on a p-value threshold. However, the

usage of a p-value threshold criterion in the selection of IVs is somewhat arbitrary and lacks robust

justification. The 2SLS approach relies on the fitted values from the first stage for estimating causal

effects in the second stage, highlighting the critical role of prediction accuracy and thus questioning

the robustness of models that depend solely on p-value thresholds for validation. Given the typically

vast dimensionality of SNP data, the use of penalized shrinkage methods can effectively mitigate

the winner’s curse effect in one-sample MR analysis. This strategy prioritizes prediction accuracy

and hence provides a potentially more dependable and robust framework for causal inference.

Denault et al. (Denault et al., 2022) introduced a method called ‘Cross-Fitting for Mendelian

Randomization’ (CFMR) to handle the weak instrument issue in one-sample MR analysis, which

consolidates information from multiple IVs into a single IV, termed the Cross-Fitted Instrument

(CFI). CFMR randomly splits a sample into 𝐾 subgroups {𝐼1, · · · , 𝐼𝐾 } and define the complement
of the partition 𝐼𝑘 as 𝐼 𝑐

𝑘 }, it first selects 𝛾𝑘 independent
variants {𝑍1,𝑘 , · · · , 𝑍𝛾𝑘,𝑘 }, and then defines a CFI of the exposure 𝑋 on 𝐼𝑘 , which is the prediction
of 𝑋 on 𝐼𝑘 trained using data with indexes in 𝐼 𝑐
𝑘 .

𝑘 = {1, · · · , 𝑁 ∉ 𝐼𝑘 }. In each subset {𝐼 𝑐

This predicted value can be viewed as a polygenic risk score in risk prediction analysis. Then,

the new CFI is used as the IV to fit the 2SLS model for further causal inference. CFMR consolidates

all the IVs into one single IV (CFI), thus it produces less biased results. However, CFMR does

not completely solve the selection bias issue. Taking 𝐾 = 10 as an example, CFMR employs 9

folds of data for selecting IVs and applies the estimated effects to construct the composite IV in

a separate fold of data. By iterating this process 10 times, the composite IVs across any pair of

folds are constructed with 80% of data in common. Thus, the composite IVs are not constructed

using completely independent data. This could lead to a new manifestation of the winner’s curse

problem. Furthermore, by relying on one CFI as the only IV to represent the collective information

13

of all IVs, there could be potential information loss which further leads to variance inflation and

consequently reduced power (as shown in our theorem and simulation studies).

In general, the selection of IVs involves a bias and variance trade-off when estimating the causal

effect. Using more IVs tends to introduce a larger bias but smaller variance, whereas employing too

few IVs results in a smaller bias but larger variance. Pierce et al. (Pierce et al., 2010) did intensive

simulations to evaluate the power and IV strength requirements for MR analyses based on 2SLS.

They employed four strategies to combine information across IVs and evaluated the consequences of

these strategies on power and overall IV strength, as measured by the first-stage F statistic in 2SLS.

The results suggest that categorizing IVs into major and weak ones and then consolidating the weak

ones into a single IV based on the knowledge of the genetic architecture underlying the exposure,

can mitigate the issue of weak IVs. However, the study identifies a gap in current methodologies: it

does not provide a clear approach for differentiating between major and weak IVs, nor does it offer

a strategy for combining weak IVs in the context of one-sample MR analysis. This highlights an

area for further research and methodology development in the field.

In this chapter, we propose an adaptive Sample-sPLitting method with cross-fitting InstrumenTs

(MR-SPLIT) to address the bias issue of IV selection and weak instruments. This approach can

effectively reduce the number of weak IVs without the loss of much information, thereby enhancing

the performance of causal inference in MR studies by improving the power of causal inference.

Our method has two advantages over the existing ones: 1) It adaptively selects major and weak

IVs, subsequently creating a composite IV from the weaker ones. We theoretically proved that the

variance of the MR-SPLIT estimate is smaller than that of the CFMR estimate under the condition

of one sample split. Simulation results also show that MR-SPLIT can always achieve higher power

and lower RMSE than CFMR; and 2) A multi-sample splitting strategy is further employed to

enhance the robustness of estimation and testing. Extensive simulation studies were conducted to

assess the performance of our method in comparison to its counterparts, including 2SLS, LIML,

and CFMR. Our method offers an efficient and powerful solution for one-sample MR analysis

by addressing two primary sources of bias: IV selection bias and the bias associated with weak

14

instruments.

2.2 Statistical Method

Assume the following structural equation model,

𝑦𝑖 = 𝑥𝑖 𝛽 + 𝜀𝑦𝑖

𝑥𝑖 = 𝐺𝑖·𝛼 + 𝜀𝑥𝑖

where 𝑥𝑖 is the exposure, and 𝑦𝑖 denotes the outcome of the 𝑖th individual. 𝐺𝑖· is a 𝑝-dim vector
of SNP IVs, where 𝐺𝑖· = {𝐺𝑖1, 𝐺𝑖1, . . . , 𝐺𝑖 𝑝} ∈ R𝑝. The error term is denoted by 𝜀𝑖 = (𝜀𝑥𝑖, 𝜀𝑦𝑖) ∼
𝑁 (0, 𝜎2𝑅) where 𝑅12(= 𝜌) is the correlation due to confounding. 𝛽 is the interested causal

effect which needs to be estimated. Suppose we have 𝑁 independent individuals, and denote
𝑌 = (𝑦1, . . . , 𝑦𝑁 )′ ∈ R𝑁×1, 𝑋 = (𝑥1, . . . , 𝑥𝑁 )′ ∈ R𝑁×1, 𝐺 = {𝐺1, . . . , 𝐺 𝑝} ∈ R𝑁×𝑝, where the 𝑗th
IV denoted as 𝐺 𝑗 = (𝐺1 𝑗 , . . . , 𝐺 𝑁 𝑗 )′, 𝑗 = 1, . . . , 𝑝, then we have

𝑌 = 𝑋 𝛽 + 𝜀𝑦

𝑋 = 𝐺𝛼 + 𝜀𝑥

(2.1)

2.2.1 Cross-fitting Instruments with Sample Split

Given the observed data {𝑋, 𝑌 , 𝐺}, we first need to select a valid IV subset from the existing

SNP pool, where the number of SNPs can be much larger than the sample size. To reduce potential

biases and enhance the accuracy of estimates in MR analysis, one can use one sample for the

selection of appropriate IVs and a separate, independent sample for the 2SLS estimation. By doing

so, over-fitting and biases stemming from sample-specific peculiarities, such as the double dipping

issue, can be minimized, leading to more robust and credible causal effect estimates. When only

one sample is available, one simple idea is to randomly split the data into two equal subsets {𝐼1, 𝐼2},

each containing roughly 𝑁/2 samples. Then, one can use one subset (say 𝐼1) to select the IVs and
use the other (say 𝐼2 = 𝐼 𝑐

1) to get the estimates of 𝛽.

For the IV selection, if no prior information about specific SNPs is available, researchers

usually regress the exposure variable on each SNP, and then select those SNPs that yield marginal

p-values smaller than a preset threshold (e.g., 5 × 10−8) followed by LD pruning or LD clumping.

15

However, such a threshold is quite ad hoc and sometimes can be too stringent, prompting the need

for relaxation, as advocated in some studies (Panagiotou et al., 2011). Such strictness can lead

to the exclusion of valid IVs and the loss of valuable information. Conversely, if the threshold

is too lenient, it may result in the selection of an excessive number of SNP IVs, potentially

introducing challenges associated with weak IVs (Burgess et al., 2011). We suggest using some

high-dimensional screening methods such as sure independence screening (SIS) (Fan and Lv, 2008)

to first reduce the SNP dimension from ultra-high to high dimension. Methods like SIS have the

sure screening property in which they ensure that, as the sample size increases, the probability of

including all relevant variables becomes close to one. After this step, shrinkage methods such as

LASSO or adaptive LASSO (Tibshirani, 1996; Zou, 2006) can be employed to select and estimate

non-zero SNP effects. Other penalized methods with different penalty functions such as MCP or

SCAD can also be applied.

After the SNP selection, directly employing these IVs in 2SLS might lead to the issue of

weak instruments, potentially resulting in biased estimate. To mitigate this, we group the IVs

into two groups, major IVs and weak IVs, based on their association strength with the exposure.

Conventionally, the validity of IVs is assessed using 𝐹 statistics. A common benchmark used in

econometrics and statistical literature suggests that an F-statistic exceeding 10 is indicative of strong

instruments, particularly when assessing the strength of a collective set of IVs (Stock and Yogo,

2002; Shea, 1997). However, the determination of the weakness of an individual IV lacks a widely

recognized standard. In this analysis, we employed partial F-statistics with different thresholds as

criteria for selecting major IVs. Generally, the partial F statistic is defined as:

𝐹 =

(cid:0)RSS𝑟 − RSS 𝑓 (cid:1) /𝑝
(cid:0)RSS 𝑓 (cid:1) /(𝑁 − 𝑘 − 1)

where RSS𝑟 and RSS 𝑓 are the residual sums of squares for the reduced and full model, respectively;

𝑁 is the total number of observations; 𝑘 and 𝑝 are the numbers of variables in the reduced and

full model, respectively. This statistic measures how much the addition of 𝑝 variables improves

the model, compared to the increase in complexity these variables bring.

It is a good statistic

for calculating the strength of each IV and is consistent with the commonly used F-statistic for

16

evaluating IV strength. In our model, 𝑝 = 1 because we calculate the partial F statistics for each

IV. The thresholds were set at partial F-statistics greater than 10, 30, and 50. We conducted

a simulation study to compare these three statistics for the purpose of identifying weak IVs, as

detailed in section 2.3.1. Based on the simulation results, it is recommended to use a threshold of

partial 𝐹 > 30 to define major IVs.

2.2.2 Composite IV for Weak Instruments

Following the separation of major and weak IVs, we propose consolidating the weak ones into

a composite IV. Then, the major IVs and the composite IV are included in the 2SLS model to infer

the causal effect (see Fig 2.1 for the flowchart of MR-SPLIT). By only consolidating the weak IVs

into a single instrument, we can substantially reduce the number of IVs in the model while retaining

most of the information they carry.

Denote the selected index of weak IVs as 𝑆𝑘,𝑊 , 𝑘 = 1, 2, where |𝑆𝑘,𝑊 | = 𝑝2 represents the

selected numbers of weak IV using data in 𝐼𝑘 . Taking sample 𝐼1 as an example, let the estimated

effects for the weak IVs on the exposure be denoted as ˆ𝛼1,𝑊 = { ˆ𝛼1, 𝑗 ; 𝑗 ∈ 𝑆1,𝑊 } for data in 𝐼1. Here,

the subscript 1 indicates that this parameter is estimated from subsample 𝐼1, and the subscript 𝑊

signifies that it corresponds to the direct effect of weak IVs on 𝑋 from Eq (2.1). The new composite

IV constructed in sample 𝐼2, ˆ𝐺2,𝑊 , is then defined as

where

ˆ𝐺2,𝑊 =

∑︁

𝑗 ∈𝑆1,𝑊

𝜔 𝑗 𝐺2, 𝑗

𝜔 𝑗 = 𝑠𝑖𝑔𝑛( ˆ𝛼1, 𝑗 )

| ˆ𝛼1, 𝑗 |
(cid:205) 𝑗 ∈𝑆1,𝑊 | ˆ𝛼1, 𝑗 |

(2.2)

(2.3)

In other word, we use weak IVs selected from sample 𝐼1 to construct the new composite IV in

sample 𝐼2. Then, we can use the major IVs and the new composite IV, i.e., {𝐺2,𝑀, ˆ𝐺2,𝑊 }, to get the

cross-fitted exposure in 𝐼2. Here, the subscript 𝑀 represents that 𝐺2,𝑀 is identified as the major IV,

while the subscript 𝑊 signifies that ˆ𝐺2,𝑊 is estimated from the weak IV. The subscript 2 indicates

that these values are obtained from subsample 𝐼2.

17

Figure 2.1 The flow chart of MR-SPLIT with one random split.

Note: The original data is randomly split into two parts indexed by 𝐼1 and 𝐼2. We use data in 𝐼1
to select major and weak IVs, then form the composite IV for the weak ones in 𝐼2 with the weight
𝜔 being calculated based on Eq (2.3), then get the fitted ˆ𝑋2 in 𝐼2. Similarly, we use data in 𝐼2 to
select major and weak IVs, then use data in 𝐼1 to form the composite IV and get the fitted values
ˆ𝑋1. Next, we combine ˆ𝑋1 and ˆ𝑋2 to get ˆ𝑋 = ( ˆ𝑋𝑇
2 )𝑇 and fit the second stage regression model
1
𝑌 ∼ ˆ𝑋 to get the causal estimate ( ˆ𝛽) and its p-value.

, ˆ𝑋𝑇

To clarify logic, for each 𝑘 = 1, 2 we use the subset 𝐼𝑘 to identify the SNP IVs and obtain the

estimated effects ˆ𝛼 for the selected IVs. Then, we categorize them into two groups, major IVs and

weak IVs, using the partial 𝐹-statistic criterion defined earlier. We then combine the weak IVs in

𝐼 𝑐
𝑘 using the estimated weights from 𝐼𝑘 . This approach enables us to avoid overfitting by selecting

IVs and estimating the causal effect using different samples.

18

𝐼1=𝑋1,𝑌1,𝐺1Following the same procedure and use 𝐼2 to select IVs and use 𝐼1 to get ෠𝑋1𝑝1 major IVs indexed by 𝑆1,𝑀𝑝2 weak IVs (𝑆1,𝑊) with estimated effects ො𝛼1,𝑊={ො𝛼1,𝑗;𝑗∈𝑆1,𝑊}Composite IV for weak IVs: ෠𝐺2,𝑊=෍𝑗∈𝑆1,𝑊𝜔𝑗𝐺2,𝑗In 𝐼2, regress 𝑋2~{𝐺2,𝑀,෠𝐺2,𝑊} to get ෠𝑋2 ෠𝑋=෠𝑋1෠𝑋2Originaldata: 𝑋,𝑌,𝐺,𝐺∈𝑅𝑁×𝑝 𝐼2=𝑋2,𝑌2,𝐺2Randomly splitUse 𝐼1 to select major and weak IVsMajor IV: 𝐺2,𝑀={𝐺2,𝑗;𝑗∈𝑆1,𝑀}𝑌~෠𝑋መ𝛽2.2.3 Estimating the causal effect

Once we get the IVs {𝐺 𝑀, ˆ𝐺𝑊 } in each 𝐼𝑘 , 𝑘 = 1, 2, we can then perform the first stage of the

2SLS regression on these IVs to get the cross-fitted exposures ˆ𝑋𝑘 which are then aggregated, i.e.,

(cid:170)
(cid:174)
(cid:174)
(cid:172)
The causal effect can be estimated by regressing 𝑌 on ˆ𝑋 using the whole sample, which is given by

ˆ𝑋 = (cid:169)
(cid:173)
(cid:173)
(cid:171)

∈ R𝑁×1.

ˆ𝑋2

ˆ𝑋1

ˆ𝛽 = ( ˆ𝑋′ ˆ𝑋)−1 ˆ𝑋′𝑌

(2.4)

Cross-fitting allows for the utilization of the entire dataset in estimating causal effects, thereby

circumventing the winner’s curse problem that arises when the same data is employed for both IV

selection and causal effect estimation.

Remark 1: Both MR-SPLIT and CFMR implement a cross-fitting idea with sample splitting,

but the analysis is fundamentally different. MR-SPLIT combines the cross-fitted exposures for fur-

ther causal inference, while CFMR combines cross-fitting instruments for further causal inference.

CFMR first calculates the composite IV in the 𝑘th split sample (denoted as ˜𝐺 𝑘,𝑛𝑘×1), where 𝑛𝑘

denotes the sample size in the 𝑘th split sample, then combines these composite IVs to form the

final composite IV (denoted as ˜𝐺 𝑁×1 by stacking all ˜𝐺 𝑘 ), and finally uses the full data (𝑋, ˜𝐺, 𝑌 ) to

perform 2SLS analysis for causal inference. Each composite IV ˜𝐺 𝑘,𝑛𝑘×1 can be regarded as a poly-

genic risk score based on the 𝑘th split sample. As the dimensions of major and weak IVs identified

from different sample splits are different, CFMR is infeasible to separate the two components and

incorporate them in downstream causal inference.

Remark 2: CFMR uses data in 𝐼 𝑐

𝑘 to select IVs, then calculates the composite IV based on

data in 𝐼𝑘 . Typically, a 10-fold split suffices. On the other hand, MR-SPLIT benefits from a 2-fold

sample splitting. It uses data in 𝐼1 to select and separate major and weak IVs, then forms the cross-

fitting composite IV in 𝐼2. After that, it calculates the cross-fitted exposures based on the major

IV(s) and the cross-fitting composite IV for further causal inference. More data for IV selection

leads to less data to fit the cross-fitted exposure and vice versa. To balance the two components, a

2-fold sample split is recommended.

19

Remark 3: By combining the cross-fitted exposures, Theorem 1 shows that MR-SPLIT pro-

duces estimates with a variance no larger than that of CFMR if both approaches implement a 2-fold

sample split. The proof is given in This finding extends to scenarios involving a 𝑘 (> 2)-fold sample

split for CFMR. Although providing theoretical proof for this result poses a challenge, we have

demonstrated its validity through simulations.

Theorem 1. Let ˆ𝛽𝐶𝐹 𝑀 𝑅 and ˆ𝛽𝑀 𝑅−𝑆𝑃𝐿𝐼𝑇 be the 2SLS estimates obtained respectively by the CFMR

and MR-SPLIT method with a 2-fold random split. Then, ˆ𝛽𝑀 𝑅−𝑆𝑃𝐿𝐼𝑇 is more efficient than ˆ𝛽𝐶𝐹 𝑀 𝑅

in the sense that

𝑣𝑎𝑟 ( ˆ𝛽𝑀 𝑅−𝑆𝑃𝐿𝐼𝑇 ) ≤ 𝑣𝑎𝑟 ( ˆ𝛽𝐶𝐹 𝑀 𝑅)

The proof of Theorem 1 is given in Appendix A.2.

2.2.4 Multiple Sample Splitting and Robustness

Given the inherent uncertainty in single-sample splitting, particularly in cases of limited sample

size, we propose a multiple-splitting strategy to improve the robustness of the approach. We

randomly split data (into two halves) 𝐿 times. For each random split, the same estimation and

testing procedure as described before are conducted. Let 𝑝𝑣𝑎𝑙𝑙 denote the p-value at the 𝑙th

random split. There are different ways to aggregate these 𝐿 p-values. One approach involves

employing the aggregation method for p-values proposed by Wasserman and Roeder (Wasserman

and Roeder, 2009; Dezeure et al., 2015). However, this method has proven to be overly conservative

in our simulations. Another simple way is to use the Cauchy combination rule for correlated p-

values (Liu and Xie, 2019), which is similar to the minimum p-value method but does not require

an intensive resampling procedure to assess the null distribution of the minimum p-value. Given

its computational efficiency, we adopt the Cauchy combination rule to aggregate p-values obtained

from multiple sample splitting. Following (Liu and Xie, 2019), the test statistics is defined as:

𝑇𝑐𝑎𝑢𝑐ℎ𝑦 =

𝐿
∑︁

𝑙=1

𝜔𝑙 tan ((0.5 − 𝑝𝑣𝑎𝑙𝑙)𝜋)

20

where the weights 𝜔𝑙 are non-negative and (cid:205)𝐿
𝑙=1
weight 𝜔𝑙 can be simply chosen as 1/𝐿. The p-value of 𝑇𝑐𝑎𝑢𝑐ℎ𝑦 can be simply approximated by

𝜔𝑙 = 1. If no further information is available, the

p-value =

1
2

− arctan(𝑇𝑐𝑎𝑢𝑐ℎ𝑦)/𝜋

(2.5)

In essence, augmenting the number of sample splits improves result robustness. However, this

enhancement comes with the trade-off of requiring increased computational resources. To pro-

vide general guidance on the number of splitting times, we conducted a simulation study (see

section 2.3.4). The results suggest that conducting multiple splits about 50-60 times is sufficient

to achieve a robust outcome in terms of controlling type I errors and maintaining stable statistical

power. In the case of a large sample size and strong SNP heritability, the splitting time can be

dramatically reduced (see the simulation results).

2.2.5 Algorithmic Details

The detailed algorithm of the MR-SPLIT is given below:

1. For each 𝑙 = 1, · · · , 𝐿 random split, repeat the following steps:

a) Split the sample into two equal subsets {𝐼1, 𝐼2}, i.e., {1, · · · , 𝑁 } = 𝐼1 ∪ 𝐼2 with 𝐼1 ∩ 𝐼2 = ∅
, 𝐼 𝑐
and |𝐼1| = [𝑁/2] and |𝐼2| = 𝑁 − [𝑁/2], and denote the complementary sets as {𝐼 𝑐
2
1

}

accordingly.

b) For each 𝑘 = 1, 2, we use 𝐼 𝑐

𝑘 to select IVs and get the estimated effect size for each IV.

Then, categorize the selected IVs into two distinct groups, major IV(s) and weak IVs,

based on the partial 𝐹 > 30 criterion.

c) In each subset 𝐼𝑘 , combine the weak IVs using the effect size estimated from 𝐼 𝑐

𝑘 following

Eq (2.2). Then regress the exposure variable 𝑋 on the new IVs (major IV(s) + composite

IV) to get the fitted value ˆ𝑋𝑘 .

ˆ𝑋1

d) Denote ˆ𝑋 = (cid:169)
(cid:173)
(cid:173)
(cid:171)

(cid:170)
(cid:174)
(cid:174)
(cid:172)
effect estimate ˆ𝛽𝑙 and the p-value 𝑝𝑣𝑎𝑙𝑙.

ˆ𝑋2

, and do the second stage regression of 𝑌 on ˆ𝑋 to get the causal

21

2. Calculate the Cauchy combination statistics 𝑇cauchy = 1
𝐿

𝑙=1 tan((0.5 − 𝑝𝑣𝑎𝑙𝑙)𝜋), and the
2 − arctan(𝑇𝑐𝑎𝑢𝑐ℎ𝑦)/𝜋. The final aggregated causal effect

(cid:205)𝐿

aggregated p-value as 𝑝𝑣𝑎𝑙 = 1

estimate can be calculated as ˆ𝛽 = 1
𝐿

(cid:205)𝐿

𝑙=1

ˆ𝛽𝑙.

2.3 Simulation Study

We conducted simulations to assess the performance of our method and provided guidance on

the identification of major IVs and selecting an efficient number of sample splitting. Subsequently,

we compared the proposed MR-SPLIT with the existing approaches, including 2SLS, LIML, and

CFMR, across various settings.

2.3.1 Major IV Identification

We applied 3 criteria, 𝐹 > 10, 𝐹 > 30, and 𝐹 > 50, to distinguish the major and weak

IVs under various settings. We randomly generated 300 independent SNPs each with MAF=0.3,

and assumed only 5 SNP had effects on the exposure. The effects of these SNPs were set to

be 𝛽 = (0.4, 0.4, 0.1, 0.05, 0.05)𝜔0, where 𝜔0 was chosen to ensure that these SNPs account

for ℎ2 = {0.15, 0.30, 0.50} of the variation in exposure (ℎ2 can be interpreted as the exposure

heritability). The error term was assumed to follow the standard normal distribution with mean 0

and variance 1. The rest 295 SNPs were assumed to be noises with no effect on the exposure (i.e.,

𝛽 = 0). Then, we followed model (2.1) to simulate the exposure. In this setting, the initial two

SNPs may be regarded as the major IVs, whereas the remaining three are categorized as weaker

ones. However, this differentiation can also be contingent on the signal-to-noise ratio, meaning

that the first two SNPs may not be deemed as the major ones when ℎ2 is low, say ℎ2 = 0.15. And

when the IVs are strong enough, say ℎ2 = 0.5, the three weaker IVs may be regarded as strong

IVs. After applying SIS screening and LASSO estimation on these 300 SNPs, we then used these

three criteria to distinguish the major and weak IVs. Part of the results can be seen in Table 2.1.

As we mentioned before, there were 295 noise SNPs in total. It is possible that some of these

noise SNPs may be incorrectly identified as major IVs. We also summarized these results in the

last column of Table 2.1. More detailed information can be found in Table A.1 and Fig A.1 in

Appendix A.2. Our analysis indicates that employing a partial 𝐹 > 10 threshold to define major

22

IVs is excessively lenient, leading to misidentifying noises as major IVs, particularly in scenarios

with small sample sizes (e.g., 𝑁 = 500). Conversely, a threshold of 𝐹 > 50 proves overly stringent,

failing to recognize SNP 1 and 2 as major IVs in conditions characterized by low sample sizes

and heritability. A threshold of 𝐹 > 30 emerges as a balanced criterion for defining major IVs,

effectively mitigating the aforementioned issues. Thus, we propose to use a partial 𝐹 > 30 threshold

in the selection of major IVs.

Table 2.1 Mean numbers of being identified as major IV using different criteria in 1,000
simulations.

ℎ2

𝑁

500

0.15

1000

2000

500

0.3

1000

2000

500

0.5

1000

2000

Criteria SNP1 SNP2 SNP3 SNP4 SNP5 Noises (×295)*
F>10
F>30
F>50
F>10
F>30
F>50
F>10
F>30
F>50
F>10
F>30
F>50
F>10
F>30
F>50
F>10
F>30
F>50
F>10
F>30
F>50
F>10
F>30
F>50
F>10
F>30
F>50

0.15
0
0
0.35
0
0
0.75
0.1
0
0.25
0
0
0.75
0.2
0.05
1
0.6
0.1
0.8
0.35
0
1
0.8
0.4
1
1
1

1.25
0
0
0.65
0
0
0.6
0
0
1.15
0
0
0.65
0
0
0.6
0
0
1.8
0
0
0.6
0
0
0.4
0
0

0
0
0
0.1
0
0
0.35
0
0
0.25
0
0
0.45
0
0
0.75
0.15
0
0.35
0
0
1
0.5
0.05
1
0.9
0.4

0.55
0.05
0
0.95
0.35
0.15
1
1
0.75
0.95
0.5
0.25
1
1
0.8
1
1
1
1
1
0.9
1
1
1
1
1
1

0.1
0
0
0
0
0
0.55
0
0
0.1
0
0
0.3
0
0
0.9
0.1
0.05
0.4
0.05
0
0.95
0.15
0
1
0.9
0.3

0.5
0.05
0
0.95
0.35
0
1
0.95
0.75
0.8
0.55
0.25
1
1
0.7
1
1
1
1
1
0.9
1
1
1
1
1
1

*The total number of noise SNPs incorrectly identified as major IVs out of the 295 noise SNPs.

23

2.3.2 Comparison with 2SLS and LIML

We compared the proposed MR-SPLIT with the widely-used 2SLS approach and the LIML

method which is particularly designed to address the weak instruments bias issue. We simulated

300 SNPs independently and randomly selected 5 SNPs as the IVs to generate the exposure variable

𝑋. We set ℎ2 = {0.15, 0.3, 0.5} which respectively represent weak, moderate, and strong overall

effect, and 𝜌 = (0.1, 0.2) where 𝜌 = cor(𝜀𝑥𝑖, 𝜀𝑦𝑖) controls the unknown confounding effect.

We set the sample size (𝑁) to 1000. To ensure a fair comparison with 2SLS and LIML, we

only split the sample once (i.e., no multiple splitting). We then used one subset for selecting the

IVs and incorporated the other subset with the selected IVs for estimation. Both 2SLS and LIML

followed the same process but did not differentiate between major and weak IVs for further causal

inference. To check the impact of selection bias for 2SLS and LIML, we also did the analysis using

the whole dataset for both IV selection and causal effect estimation. The simulation settings are

the same as what we previously described. The only difference is that we do not split the sample

and use the whole sample to do the IV selection and estimation. Results for this analysis were

given in Appendix A.2. The respective boxplots, illustrating the distribution of estimations across

1000 simulation iterations, are provided in Figs A.2, A.3 and A.4 in Appendix A.2. It is evident

from the results that using the entire sample for both IV selection and effect estimation results in

estimates with smaller variance but larger bias, leading to a significantly higher type I error rate.

In the following, we only show the results based on sample splitting.

Table 2.2 presents a comparative analysis of the estimation accuracy among MR-SPLIT, LIML,

and 2SLS. It shows that MR-SPLIT can provide estimates with a significantly small bias. In contrast,

the estimates from 2SLS exhibit large bias, especially under weak IV and substantial confounding

effects (e.g., 𝜌 = 0.2). In some of the cases, LIML gives a smaller bias than MR-SPLIT does, but

it has consistently larger variance than MR-SPLIT, leading to a conservative coverage probability

(CP) compared to MR-SPLIT. The variance of 2SLS is uniformly smaller than the other two

methods. However, given its large bias, it has the most poor coverage probability among the three

methods. On the other hand, MR-SPLIT shows consistently good coverage probabilities under

24

different scenarios, showcasing its robust performance under different conditions.

Table 2.2 Simulation comparison between M* (MR-SPLIT), LIML and 2SLS.

ℎ2

𝜌

𝛽

0.15

0.30

0.50

0.1

0.2

0.1

0.2

0.1

0.2

-0.08
0.08
-0.08
0.08
-0.08
0.08
-0.08
0.08
-0.08
0.08
-0.08
0.08

Bias(|𝛽 − ˆ𝛽| × 100)
M* LIML 2SLS
4.86
0.17
0.18
5.05
0.23
0.82
9.45
1.19
0.17
9.44
1.01
0.31
2.4
0.11
0.02
2.94
0.59
0.13
4.77
0.67
0.43
5.03
0.31
0.47
1.12
0.08
0.32
0.82
0.22
0.11
2.08
0.02
0.14
2.07
0.03
0.2

CP*=coverage probability

Est. SE
LIML
0.1776
0.1884
0.1737
0.1770
0.0844
0.0831
0.0840
0.0840
0.0482
0.0474
0.0513
0.0469

2SLS
0.0788
0.0795
0.0782
0.0801
0.0605
0.0600
0.0612
0.0621
0.0430
0.0424
0.0457
0.0423

M*
0.1252
0.1253
0.1230
0.1320
0.0520
0.0515
0.0501
0.0524
0.0329
0.0328
0.0335
0.0318

CP*
LIML 2SLS
0.895
0.827
0.886
0.832
0.740
0.843
0.732
0.825
0.929
0.895
0.919
0.898
0.837
0.904
0.845
0.905
0.945
0.943
0.944
0.931
0.902
0.909
0.924
0.938

M*
0.955
0.957
0.949
0.952
0.958
0.959
0.946
0.947
0.938
0.948
0.942
0.954

Fig 2.2 shows the results of the type I error of the three methods. We can observe that MR-

SPLIT can effectively control Type I errors, even in the presence of strong unknown confounding.

As depicted in Fig 2.2, both LIML and 2SLS methods exhibit much poorer performance than MR-

SPLIT. Notably, 2SLS suffers from poor type I error control when the confounding effect is strong

(i.e., 𝜌 = 0.2), leading to inflated error rates. LIML has high false positive rates when the SNP

effects are weak (i.e., weak instruments with low ℎ2), especially under 𝜌 = 0.2. As the SNP effects

increase, its performance improves; however, it can only effectively control type I errors when

the instrumental variables are strong, as demonstrated in the scenario with ℎ2 = 0.5. Conversely,

MR-SPLIT consistently demonstrates robust type I error control under all conditions, even under

ℎ2 = 0.15 and 𝜌 = 0.2, where 2SLS and LIML exhibit their poorest performance. The inflated type

I error rates lead to inflated statistical power for 2SLS and LIML. Consequently, comparing power

between MR-SPLIT and these two methods may not be a fair comparison; thus, we did not show

the detailed power comparison here. Nevertheless, in the scenario where ℎ2 = 0.5 and 𝜌 = 0.1,

MR-SPLIT still attains the highest power, reaching 0.683 compared to 0.545 for 2SLS and 0.419

for LIML.

25

Figure 2.2 Type I error comparison between MR-SPLIT, 2SLS and LIML. The horizontal dashed
line denotes the 0.05 level.

2.3.3 Comparison with CFMR

We compared our method with CFMR under different simulation scenarios. To ensure a fair

comparison with CFMR, we applied 10-fold CFMR as recommended in the CFMR work, and

2-fold MR-SPLIT with 50 random sample splits. We applied the same procedure for selecting IVs.

While CFMR combined all the selected IVs into a single composite one, our method differentiated

between major and weak IVs using the partial 𝐹 > 30 criterion and only weak IVs were combined

into a composite one. We also followed the simulation settings described in the CFMR work to

ensure a fair comparison. We generated a set of 300 SNPs, and the minor allele frequency is fixed

as 0.3 for all the SNPs. We randomly chose 5 SNP IVs to generate the exposure variable with the

model 𝑋 = (cid:205)5

𝑗=1

𝐺 𝑗 𝛼 𝑗 + 𝜀𝑥, and the outcome with the model 𝑌 = 𝑋 𝛽 + 𝜀𝑦, where

𝜀𝑥

1

0.16

(cid:170)
(cid:174)
(cid:174)
(cid:172)
We set two scenarios to comprehensively compare MR-SPLIT and CFMR:

∼ 𝑁 (cid:169)
(cid:173)
(cid:173)
(cid:171)

0, 5 (cid:169)
(cid:173)
(cid:173)
(cid:171)

0.16

(cid:170)
(cid:174)
(cid:174)
(cid:172)

(cid:170)
(cid:174)
(cid:174)
(cid:172)

(cid:169)
(cid:173)
(cid:173)
(cid:171)

𝜀𝑦

1

• Scenario I: The effect sizes of the 5 SNPs are different, i.e., 𝛼 = (0.4, 0.4, 0.1, 0.05, 0.05).

Potentially, SNPs with the effect of 0.4 can be regarded as major IVs and the rest can be

considered as weak ones. This also depends on the SNP heritability level ℎ2.

• Scenario II: The effect sizes of the 5 SNPs are the same, i.e., 𝛼 = (0.2, 0.2, 0.2, 0.2, 0.2). In

26

this case, differentiating between major and weak IVs can be challenging, presenting a less

favorable condition for our method.

In each scenario, we compared the two methods in different aspects by changing the sample size (𝑁 =

{1000, 3000, 5000}), variation in the exposure explained by the SNP IVs (ℎ2 = {0.15, 0.2, 0.3})

and the exposure’s effect size for 𝛽.

Fig 2.3 depicts the type I error control of the two methods in scenario I and scenario II. In

general, the control of type I error for the two methods is highly comparable across different settings

characterized by distinct sample sizes and SNP heritability levels. Though the type I error is a little

inflated for MR-SPLIT under a small sample size (𝑁 = 1000), particularly in scenario II, it controls

the type I error well as the sample size increases.

Figure 2.3 Comparison of type I error between MR-SPLIT and CFMR in Scenario I (top) and II
(bottom). The horizontal dashed line denotes the 0.05 level.

Fig 2.4 shows the results of the power for the two methods in these two scenarios. Regardless of

the settings, MR-SPLIT consistently exhibits higher power than CFMR. This discrepancy becomes

especially noticeable when the IVs are relatively weak (i.e., ℎ2 = 0.15).

In Appendix A.2, we also presented the estimation performance of both methods when 𝛽 = 0

and 0.08 in Figs A.5-A.10. The results reveal minimal difference in the causal effect estimation

27

Figure 2.4 Power comparison between MR-SPLIT and CFMR in Scenario I (top) and II (bottom).

between the two methods. However, a noticeable distinction is the smaller standard error observed

in MR-SPLIT across nearly all the scenarios, resulting in a smaller Root Mean Square Error (RMSE)

(see Fig A.11 in Appendix A.2) and higher statistical power when compared to CFMR. This aligns

well with the theoretical finding in Theorem 1, though the result is proved under a 2-fold sample

split. The findings further underscore the advantages of MR-SPLIT.

In summary, MR-SPLIT consistently demonstrates robust type I error control when compared

to 2SLS and LIML across various simulation settings.

In comparison to CFMR, MR-SPLIT

exhibits superior performance, yielding smaller RMSE and higher statistical power. Even under a

less favorable condition for MR-SPLIT, the type I error can be controlled when the sample size is

reasonably large. The simulation results further corroborate our theoretical finding, consistently

showing that MR-SPLIT results in smaller standard errors for causal effect estimation compared to

CFMR, which leads to higher statistical power when testing for the causal effect.

2.3.4 Multiple Data Splitting

Intuitively, more data splitting should yield more robust results, which, however, would entail

higher computational resource usage. We implemented our methods under different splitting times,

different sample size 𝑁 and different ℎ2 values, to check if we can find an efficient number of

splitting. In our simulations, the true causal effect of the exposure on the outcome is set to equal 0.2

28

(𝛽 = 0.2), and the sample size ranges from 500 to 2000 (𝑁 = 500, 1000, 2000). We did simulations

in Scenario I as described in section 2.3.3. Fig 2.5 demonstrates how the type I error fluctuates

with an increasing number of splits. To obtain a smoother estimate of the type I error rate, we

repeated the simulation 5,000 times Under a small sample size, the type I error rates get stable as

the number of sample splits increases. Though the type I error increases as the sample split times

increase under small sample sizes, this increase is considered acceptable, particularly in light of

the associated boost in power (see Fig 2.5), which is especially pertinent for smaller sample sizes.

Figure 2.5 Type I error under different sample sizes: 𝑁 = 500(left), 1000(middle), 2000(right), and
under different ℎ2: 0.15 (top) and 0.2 (bottom).

Fig 2.6 shows the empirical power under different sample sizes and ℎ2. The type I error and

power results when ℎ2 = 0.3 can be found in Figs A.12 and A.13 in Appendix A.2. When the

sample size is small (𝑁 = 500) and the IVs are relatively weak (ℎ2 = 0.15), the power gets stabilized

after 50 splits. As the sample size increases, there is a decrease in the need for the number of

sample splits to maintain stable power. This indicates that in practical data analysis, it is possible

to estimate the exposure heritability based on the selected SNP IVs, and thereafter determine the

appropriate number of sample splits. In any case, opting for 50 sample splits represents a highly

29

conservative option.

Figure 2.6 Empirical power under different sample sizes: 𝑁 = 500(left), 1000(middle), 2000(right),
and under different ℎ2: 0.15 (top) and 0.2 (bottom).

2.4 Case Study: eGFR and aTRH, uACR and aTRH

We demonstrated the effectiveness of our method by applying it to the Chronic Renal Insuffi-

ciency Cohort (CRIC) dataset, to understand the progression of chronic kidney disease (CKD).

CKD is evaluated utilizing two straightforward tests: a blood test known as the estimated

glomerular filtration rate (eGFR) and a urine test, the urine albumin-creatinine ratio (uACR). Both

eGFR and uACR measure kidney function, with low eGFR and high uACR values indicating

impaired kidney function. In this application, we are interested in evaluating the causal relationship

between CKD and apparent Treatment-Resistant Hypertension (aTRH). aTRH is a condition where

a patient’s blood pressure remains above target levels despite using three different classes of

antihypertensive drugs at optimal doses, typically including a diuretic. The definition of aTRH

also extends to cases where four or more medications are required to effectively control blood

pressure (Judd and Calhoun, 2014). In a recent two-sample MR analysis using summary statistics,

Yu et al. (2020) identified the causal effect of higher kidney function (measured by eGFR estimated

30

from serum creatinine) on lower systolic blood pressure. To date, the causal relationship between

CKD and aTRH and the causal link between them remains to be established (Chen et al., 2019;

Thomas et al., 2016; Kaboré et al., 2017; Kabore et al., 2016). To this end, we utilized eGFR and

uACR as the exposure variable and aTRH as the outcome, applying MR-SPLIT for our analysis.

For comparative purposes, we also employed CFMR and 2SLS on the same dataset. Given that the

outcome variable (aTRH) is binary (0/1) in nature, the LIML method is not suitable in this analysis.

2.4.1 Genetic Data Processing

The original data have 3,541 samples containing 970,342 SNPs. Our initial step involved

removing SNPs with missing rate larger than 10%, resulting in 886,384 SNPs. After excluding

SNPs with minor allele frequency (MAF) lower than 0.05, 762,664 SNPs were left. The next phase

entailed the elimination of SNPs with p-values less than 1e-5 in the Hardy-Weinberg equilibrium

test, which narrowed our SNP count down to 693,848. To ensure the robustness of our genetic

instruments, we then implemented LD pruning. SNPs were filtered out in close LD by considering

pairs of SNPs within a window of 100 kb. If a pair of SNPs has an LD measure (𝑟 2) exceeding 0.64,

one SNP from the pair is removed. After completing all these steps, we were left with 467,597

SNPs.

2.4.2 Causal Analysis

2.4.2.1 Causal effect of eGFR on aTRH

In the initial dataset, eGFR values were obtained on multiple occasions. For consistency

and relevance, we selected the eGFR measurements corresponding to visit number 3, which also

represents the baseline assessment. Following the exclusion of samples with missing values for

either eGFR or aTRH, and then combined with the SNP data, our analysis proceeded with a total

of 𝑁 = 1, 353 samples. A simple logistic regression shows there is a strong association between

aTRH and eGFR (𝑝 < 2 × 10−16). We would like to evaluate if this association is causal. Fig A.20

shows the boxplots of eGFR in aTRH positive and negative groups.

Next, we proceeded with the MR-SPLIT and used SIS for preliminary screening, reducing the

number of SNPs from ultra-high to high. To optimize computational efficiency in the analysis,

31

we first conducted univariate regression of each SNP against the exposure before applying sample

split, using the whole data set. A total of 4,580 SNPs (𝑝 < 0.01) remained for further analysis.

The removed SNPs would most likely be screened out by the SIS procedure in subsequent steps

even after the sample split if not discarded at this stage. For each of the 50 sample splits, we used

the ‘screening’ function from the R package ‘screening’ with the SIS option. The number of SNP

variables retained post-screening adhered to the default setting, which is half the size of the sample.

In this real data analysis, instead of applying the LASSO algorithm to select and estimate SNP

effects, we employed a high-dimensional inference procedure, specifically a LASSO-projection

method which provides debiased coefficient estimates and hence a valid p-value for each coefficient.

This is done by using the ‘lasso.proj’ function in the R ‘hdi’ package(Dezeure et al., 2015). As the

regular LASSO estimates are biased, this approach can give debiased estimates and further provide

p-values for testing each coefficient. To compare the performance of the LASSO-projection with

the regular LASSO, We conducted a simulation (detailed in Section A.2.7 in A.2). The results

show that the LASSO-projection method slightly outperforms LASSO, exhibiting higher power

and better control of the type I error rate and smaller RMSE. After getting the p-values for each

SNP, we retained those with a p-value less than or equal to 0.05. This resulted in an average of 98

IVs out of 50 sample splits. We used the partial 𝐹 > 30 as the criterion to declare major IVs. And

the weak IVs were then combined into a composite IV. Finally, we used both the composite IV and

the major IV(s) to obtain the causal effect estimate and the p-value.

Fig 2.7 shows the p-value distribution and the causal effect estimates out of 50 sample splits. The

majority of p-values obtained from MR-SPLIT are below 0.05, and the majority of the estimated

causal effects ˆ𝛽 is centered around -0.0343 (indicated by the black dashed line). In these 50 sample

splits, there was an average of 98.06 IVs incorporated into the model for the causal effect estimate

and the majority were classified as weak IVs. Among these, an average of 0.54 IVs were identified as

major IVs each time. After aggregating all the results using Cauchy’s combination rule, our method

provided an estimate of ˆ𝛽 = −0.0343 (OR= 0.9663), with an aggregated p-value of 5.96 × 10−5.

We also tried lowering the partial F threshold to 20, which yielded slightly more major IVs than

32

the 𝐹 > 30 threshold (see Fig A.22 in Appendix A.2). Among the 50 sample splits, an average of

4.26 IVs were identified as major IVs each time. The results show that the p-value for MR-SPLIT

improved slightly (from 5.9 × 10−5 to 2.9 × 10−6), but the estimates remained nearly the same

( ˆ𝛽 = −0.0342).

Figure 2.7 Histogram of p-values and causal effect estimates from 50 sample splits when eGFR is
treated as the exposure.

We also applied the CFMR method with a 10-fold split. The CFMR method yielded an average

estimate of ˆ𝛽 = −0.0378 (OR=0.9629), with a p-value of < 1 × 10−5. For reference, simply

conducting the 2SLS method yields an estimate of ˆ𝛽 = −0.0407 (OR= 0.9601), with a p-value

of < 1 × 10−7. The three methods established a consistent causal relationship between eGFR and

aTRH.

2.4.2.2 Causal effect of uACR on aTRH

Following a similar procedure, we excluded samples with incomplete data for either uACR

or aTRH. After merging the remaining data with the SNP data, the dataset was reduced to 1,324

samples. The distribution of uACR is very skewed (to the right) (See Fig A.23 in Appendix A.2).

We opted to do a logarithmic transformation of uACR, denoted as log(uACR). A simple logistic

33

regression shows there is a strong association between log(uACR) and aTRH (𝑝 = 4.6 × 10−12).

Similar procedures as described before were followed for further analysis.

Fig 2.8 shows the p-value distribution as well as the causal effect estimate out of 50 sample

splits with MR-SPLIT. In these 50 sample splits, there were on average 74.3 SNPs selected as IVs

with the majority as weak ones for the causal effect estimate. Among them, an average 0.22 IVs

were identified as major IV each time. After aggregating all the results, the final causal estimate

was ˆ𝛽 = 0.1675 (OR= 1.186), with a p-value of 1.9 × 10−3. For comparison, CFMR provided an

estimate of ˆ𝛽 = 0.1584 (OR= 1.1716), with a p-value of 5.2 × 10−5. The two methods yielded

statistically significant results and presented comparable estimates. While applying 2SLS on the

same dataset, we also observed significant results (p-value=1.3×10−5), albeit with a different causal

effect estimate of ˆ𝛽 = 0.0363.

Figure 2.8 Histogram of p-values and causal effect estimates from 50 sample splits when log(uACR)
is treated as the exposure.

Integrating the results from the two analyses that utilized eGFR and uACR separately as expo-

sures, we infer that there exists a causal relationship between CKD function and aTRH. Specifically,

a lower eGFR and a higher uACR tend to contribute to an increased risk of aTRH. However, we

recognize the limited sample size of this study, which necessitates cautious interpretation of the

causal relationship identified. To assess the possibility of a reverse causal effect, we require a

method capable of accommodating a binary exposure variable, such as aTRH in this context. This

will be explored in our future studies.

34

2.5 Discussion

MR analysis has been an instrumental means in epidemiology studies, enabling the assessment

and revelation of causal connections between exposures or interventions and particular outcomes,

leveraging genetic variants as IVs to mitigate confounding factors. In this study, we introduced

an innovative adaptive sample splitting method known as MR-SPLIT, designed to address the

issue of IV selection bias and weak instruments in the context of one-sample MR analysis using

individual-level data. By a random sample split, we use half sample to select IVs and another

independent half to estimate the causal effect, hence avoiding the winner’s curse problem by using

the same data for IV selection and causal effect estimation. Additionally, we presented a multi-

sample splitting strategy to further enhance the robustness of causal estimation and testing. Our

approach involves the adaptive identification of major and weak IVs and further aggregate weak IVs

to form a composite IV. The final set of IVs comprises the major IV(s) and the composite IV. Such a

strategy, as shown in the theoretical evaluation and simulation results, yields a more efficient causal

estimate than CFMR, thereby enhancing testing power. In addition, MR-SPLIT shows consistently

superior performance in terms of coverage probability. Therefore, MR-SPLIT offers significant

improvements over existing methods by effectively handling weak instruments in one-sample MR

analysis and providing robust results with enhanced statistical power.

In comparison to the traditional 2SLS and LIML methods, MR-SPLIT yields less biased results

and effectively controls type I error, under different simulation settings. Compared to the CFMR

approach, which is designed to tackle weak IV issues, our approach provides estimates with smaller

variance and higher statistical power. In the application to the CRIC dataset, both MR-SPLIT and

CFMR produce highly comparable results. We established the causal impact of kidney function,

as assessed by eGFR and uACR, on aTRH. It is worth noting that both CFMR and MR-SPLIT

not only address the issue of weak instrument bias (i.e. finite-sample bias from IV analysis with a

given set of IVs), they also solve the problem of “winner’s curse" (i.e. bias due to variant selection

in the same dataset as the analysis is performed, in particular under a high-dimensional scenario).

The two sets of biases are related but are conceptually distinct. By employing sample splitting

35

strategies, both methods tackle the two bias issues and offer a solution to one-sample MR analysis.

On the other hand, as shown in our theoretical evaluation as well as the intensive simulation studies,

MR-SPLIT demonstrates superior performance compared to CFMR. Within the proposed sample

splitting strategy, additional tasks such as nonlinear causal estimation can also be executed using

one-sample individual-level data.

In the process of selecting IVs, CFMR recommends employing predictive methodologies,

such as LASSO regression, for their efficacy in enhancing prediction accuracy through variance

minimization. However, this approach often introduces bias in effect estimates, as it may incorporate

SNPs without significant association with the exposure - potentially compromising the relevance

assumption for IVs. On the other hand, 2SLS analysis prioritizes the use of predicted exposure

values in its secondary causal inference phase, underlining the importance of prediction accuracy

for causal estimation. Recent advancements in the realm of high-dimensional statistical inference

offer a promising solution by enabling the evaluation of estimation uncertainty for LASSO-derived

estimates (Dezeure et al., 2015). This is achieved through a de-biasing step that facilitates the

calculation of p-values, thereby presenting an innovative approach for SNP IV selection within the

context of high-dimensional SNP-exposure regressions. This technique allows for the derivation

of p-values for individual SNPs, enabling the validation of IV suitability through a p-value based

method. Unlike traditional practices that determine p-values by fitting each SNP individually in

marginal regressions, this approach fits all SNPs (after the SIS step) in a multiple regression model.

This yields partial SNP effect estimates and hence partial p-values, offering a nuanced perspective

compared to conventional methods. By adopting a p-value threshold criterion (e.g., 𝑝 < 0.05), the

selected SNPs meet the relevance assumption, providing a more robust framework for IV selection.

In our analysis, we observed that the LASSO variable selection technique typically identifies a

greater number of IVs compared to the debiased LASSO method.

If computational resources

are not a limiting factor, we recommend the implementation of the debiased LASSO approach in

practical applications.

While MR-SPLIT offers notable advantages, there is still considerable potential for further

36

enhancement and refinement. In this work, we applied the partial 𝐹 statistics for identifying major

IVs, which does not rule out the application of other measures such as those studied by Stock and

Yogo (2002). Any statistical measure capable of ranking the effect sizes of the selected IVs could be

considered for enhancing the robustness and effectiveness of our approach. It is essential to devise

robust methods for discerning between major and weak IVs. This represents a promising direction

for future research. It is worth mentioning that we do not specify the ratio of major IVs to weak IVs;

their quantities are entirely determined by the data itself, that is, based on the strength calculated

from the IVs. On the other hand, as revealed by the simulation studies, the declaration of major

IVs may vary under different F thresholds and under different sample sizes and SNP heritability

levels. In real applications, the 𝐹 > 30 threshold can be relaxed under a small sample size and

low heritability level. The genomewide SNP heritability can be estimated with software such as

GCTA (Yang et al., 2011).

In addition, we employed a straightforward weighted combination approach to aggregate the

information from all weak IVs into a single composite IV. Other advanced machine learning

techniques could also be borrowed by minimizing information loss which could potentially yield

improved results.

An additional constraint of MR-SPLIT is that we did not take the pleiotropic effects into

consideration, but there are several test statistics available to identify its presence (Greco M et al.,

2015; Sargan, 1958). Studies also show that incorporating the invalid IVs with uncorrelated and

correlated horizontal pleiotropic effects can potentially increase power and decrease bias (Yuan

et al., 2022; Qi and Chatterjee, 2019; Burgess et al., 2020). We will investigate this in Chapter 3.

The concept of sample splitting and cross-fitting instruments introduced in this study has

potential applications beyond the scope of traditional one-sample MR analyses using individual-

level data. For example, this framework can be adapted for use in multiple exposure MR analyses,

where it would involve adapting the existing approach to handle multiple sets of selected IVs

simultaneously. For another example, the proposed framework enables the investigation of potential

non-linear causal relationships through a control function approach while effectively addressing

37

the two bias issues previously mentioned. Accomplishing this task is not feasible with summary

statistics, highlighting the framework’s capability to provide more nuanced insights into causal

mechanisms that cannot be captured by summary-level data.

In essence, the expansion of our

methodology to encompass various types of MR analyses could facilitate innovative research into

causal relationships, opening new avenues for investigation.

38

CHAPTER 3

MR-SPLIT+ — ROBUST CAUSAL INFERENCE WITH MANY WEAK AND INVALID
INSTRUMENTS

3.1

Introduction

Depending on the type of data available, MR analysis methods can broadly be categorized into

two classes: those based on summary statistics and those utilizing individual-level data. The former

have been widely adopted in two-sample MR analysis (Bowden et al., 2015, 2016; Verbanck et al.,

2018), primarily because they avoid privacy concerns and allow for easier access to data. However,

despite the growing availability of summary statistics, their use presents several inherent limitations.

Flexible adjustment for covariates is generally not feasible with summary level data, potentially

compromising the precision of causal estimates. Additionally, the lack of individual level data

necessitates reliance on external reference panels for LD pruning, which may introduce bias. More

complex modeling frameworks, such as nonlinear MR analysis, also require individual level data

and are impractical to implement using summary statistics alone. With the increasing availability of

large scale individual level datasets, such as the UK Biobank data, conducting reliable MR analyses

at the individual level has become increasingly feasible (Millard et al., 2019; Sproviero et al., 2021;

Cheng et al., 2024). These datasets provide rich and detailed genetic and phenotypic information,

offering unparalleled opportunities for improving the validity and robustness of causal inference.

MR-SPLIT is a method proposed by Shi et al. (Shi et al., 2024), to solve the weak IV issue

and also the selection bias in one-sample MR studies. This method provides nearly unbiased

estimates when there are many weak IVs. It also ensures the preservation of statistical power while

effectively controlling the type I error rate. However, it did not address the invalid IV issue, which

is also known as the horizontal pleiotropy in MR analysis. Thus, it is imperative to further improve

this method to effectively address the issue of invalid IVs. This advancement would enhance its

reliability in practical applications, allowing researchers to apply the method with confidence and

reduced concern over assumption violations.

Built upon the MR-SPLIT framework with multiple splitting to address IV selection bias

39

and weak IV bias in one-sample MR analysis, we propose MR-SPLIT+, an enhanced version of

MR-SPLIT. It addresses the invalid IV issue by utilizing a mixed integer optimization algorithm

introduced by Bertsimas et al. (Bertsimas et al., 2016), and further combines it with the modified

Cragg-Donald test (Kolesár, 2018) for testing of overidentifying restrictions. Compared to prior

methods (e.g., sisVIVE, CIIV, or WIT), our method demonstrates greater accuracy in identifying

valid IVs and substantially reduces estimation bias under the relaxed plurality rule. By incorporat-

ing multiple splitting, which enhances estimate robustness and improves reliability, MR-SPLIT+

achieves performance as good as that of the oracle method.

The structure of this chapter is organized as follows. Section 3.2 reviews the UK Biobank, a

large scale repository for individual level genetic data, calling for the need to further explore the rich

data source for causal inference. We then highlight the importance of using positive and negative

controls in MR analysis built upon the large sample size of UKB data. Section 3.3 introduces

the model framework. We first briefly review MR-SPLIT and then present MR-SPLIT+ under

the two-stage least squares (TSLS or 2SLS) framework. We then discuss how multiple splitting

improves estimation accuracy and summarize the methodological framework. Section 3.4 presents

simulation results based on a primary scenario that assumes no noise, thus omitting the IV selection

step and directly identifying invalid IVs. A more complex setting involving numerous noisy IVs,

which reflects real world scenarios where IV selection may introduce selection bias, is provided

in Section 3.4.3 in the Appendix for reference. In Section 3.5, we apply MR-SPLIT+ to the UK

Biobank dataset and demonstrate its utility through the positive and negative control. We also

conduct a mediation analysis to further explore factors mediating the causal pathway. Section 3.6

concludes with a discussion of the method’s strengths, limitations, and future directions.

3.2 Motivation and Scientific Questions (UK Biobank)

3.2.1 UK Biobank Dataset Enables Robust Causal Inference

The UK Biobank is a large scale, population based prospective cohort comprising more than

500,000 individuals (Sudlow et al., 2015).

It provides extensive genotype and phenotype data,

making it an invaluable resource for MR analyses. Participants were genotyped using high density

40

arrays covering over 800,000 markers, including genome-wide SNPs and exome variants, enabling

robust instrument selection for MR studies.

In addition to genetic data, UK Biobank offers a wide range of deeply phenotyped traits derived

from questionnaires, physical measurements, biochemical assays, and linked electronic health

records. The longitudinal follow-up through national health registries, including hospital episodes,

cancer diagnoses, and mortality data, facilitates outcome ascertainment across a broad disease

spectrum. The large sample size, comprehensive phenotyping, and availability of individual-

level data make UK Biobank particularly well suited for one-sample MR frameworks, allowing

for refined exposure-outcome modeling, control of pleiotropy, and implementation of sensitivity

analyses. These strengths further motivate methodological development focused on MR analysis

with individual-level data.

3.2.2 Scientific Questions and Motivation for Method Development

With the large sample size in UKB data, robust evaluation of methodology development be-

comes feasible. This includes using positive and negative controls to assess the power and ro-

bustness of causal inference with one-sample MR analysis. For this purpose, we examined two

exposure–outcome pairs in the UKB data. One pair, body mass index (BMI) and diastolic blood

pressure (DBP), has been widely supported by previous findings and serves as a positive con-

trol(Yusni et al., 2024; He et al., 2000; Linderman et al., 2018). The other, birth weight (BW) and

BMI, is considered a negative control, as it is biologically implausible for BMI measured after age

40 to have any causal effect on birth weight.

3.2.2.1 Using positive control to assess the power of different methods

We leveraged the well-established causal relationship between BMI and DBP to evaluate the

performance of our method and its counterparts. Rather than aiming to re-establish causality, we

used this known association as a benchmark to assess the consistency of causal effect estimates

produced by various methods. Building on the availability of rich individual-level data from UKB

data, we conducted analyses in two groups: the overall cohort and a younger subset of participants.

The large sample size provided sufficient statistical power to perform subgroup comparisons and

41

to examine the stability of method performance across different populations. A robust MR method

is expected to yield consistent results between the two groups, thereby reflecting the underlying

causal relationship.

The detailed analysis procedure can be found in Section 3.5.1. We applied five methods in

total, including ordinary least squares (OLS), naive TSLS, sisVIVE, CIIV, and WIT. The latter

three represent recent methodological developments specifically designed to address the presence

of invalid IVs. Results are presented in Table 3.2 and Table 3.3 . Although the latter three methods

all exhibited strong statistical significance in both groups, their conclusions appear less convincing

upon closer examination. For example, while other methods detected the presence of invalid IVs in

the ‘all group’ results, sisVIVE failed to identify any invalid instruments. Furthermore, sisVIVE

only provides point estimates without accompanying statistical tests, which greatly limits its utility

in practical applications. As for WIT, it exhibited a notable inconsistency: it identified no invalid

IVs among 101 candidates in the ‘young group’, yet detected 51 invalid IVs out of 99 candidates in

the ‘all group’. Such contradictory findings undermine confidence in the conclusions drawn from

WIT. In the case of CIIV, although its results appear relatively consistent, the method relies on the

assumption that instruments are sufficiently strong to ensure valid estimation. In our subsequent

simulations covering a wider range of scenarios, the performance of CIIV was also found to be

unsatisfactory.

3.2.2.2 Using negative control to assess the robustness of different methods

Building on the availability of individual-level data from the UK Biobank, we further designed a

negative control analysis to complement the positive control described earlier. Based on established

biological knowledge, an individual’s BMI measured at the age of 40 or older cannot plausibly

influence their own birth weight, implying the absence of a causal effect from BMI to BW. In this

analysis, we treated BMI as the exposure and BW as the outcome. Given the lack of a biologically

plausible causal pathway, we did not expect to observe a significant causal effect between the two

variables. The negative control setting allows us to further evaluate the robustness of different MR

methods. In most real-world applications, the true causal relationship between an exposure and an

42

outcome is unknown, making it difficult to assess the validity of MR estimates. However, in this

case, the biological implausibility of the exposure-outcome relationship provides a rare opportunity

to benchmark method performance using real data with large sample sizes.

The detailed analysis procedure can be found in Section 3.5.2. Results are presented in Table

3.5. We could find that although WIT is specifically designed to handle the presence of invalid IVs

among many weak instruments, it nonetheless produced counterintuitive results, suggesting a false

causal effect of BMI on birth weight.

Therefore, based on the findings from the two examples above, we recognize an urgent and

essential need to develop a method capable of providing reliable causal estimates using individual-

level data.

Ideally, this method should be sufficiently robust to accommodate a wide range of

practical scenarios, easily interpretable, and preferably built upon the widely used 2SLS framework.

Motivated by these goals, we extend the original MR-SPLIT method and propose MR-SPLIT+.

3.3 Methods

Let 𝑌 ∈ R𝑁×1 be the outcome variable of interest, and 𝑋 ∈ R𝑁×1 the exposure variable, where

𝑁 denotes the sample size. Both 𝑌 and 𝑋 are assumed to be continuous variables. We define

𝐺 ∈ R𝑁×𝑝 as the genetic instruments (i.e., SNPs), where 𝑝 is the number of SNPs. We futher

denote the unknown confounder as 𝑈, which could have effects on both 𝑋 and 𝑌 but unobserved.

The model can be represented as:

𝑈 = 𝐺𝜂1 + 𝜀1,

𝑋 = 𝐺𝜂2 + 𝑈𝜂3 + 𝜀2

= 𝐺 (𝜂2 + 𝜂3𝜂1) + (𝜀2 + 𝜀1𝜂3),

(3.1)

𝑌 = 𝑋 𝛽 + 𝐺𝜂4 + 𝑈𝜂5 + 𝜀3

= 𝑋 𝛽 + 𝐺 (𝜂4 + 𝜂5𝜂1) + (𝜀3 + 𝜀1𝜂5)

We call IVs are invalid if 𝜂4 + 𝜂5𝜂1 ≠ 0. This setting introduces challenges for causal inference, as

standard IV methods assume all instruments affect the outcome solely through the exposure (i.e.,

𝜂4 + 𝜂5𝜂1 = 0). Let 𝛾 = 𝜂2 + 𝜂3𝜂1, 𝛼 = 𝜂4 + 𝜂5𝜂1, 𝜀𝑥 = 𝜀2 + 𝜀1𝜂3 and 𝜀𝑦 = 𝜀3 + 𝜀1𝜂5, the model is

43

simplified as follows,

𝑌 = 𝑋 𝛽 + 𝐺𝛼 + 𝜀𝑦,

𝑋 = 𝐺𝛾 + 𝜀𝑥,

(3.2)

(3.3)

where 𝜀𝑥 and 𝜀𝑦 are error terms assumed to follow normal distributions with mean 0 and

cor(𝜀𝑥, 𝜀𝑦) ≠ 0 due to the influence of unknown confounders.

Invalid IVs are indicated by

𝛼 ≠ 0. Our objective is to develop a robust framework for estimating the causal effect 𝛽, while

simultaneously identifying and accounting for invalid instruments to mitigate bias and improve

inference accuracy.

3.3.1 MR-SPLIT Recap

Before introducing MR-SPLIT+, we first briefly introduce the framework of MR-SPLIT (Shi

et al., 2024), which was designed to solve the weak IV and selection bias issues in one-sample MR

studies. Given the observed data {𝑋, 𝑌 , 𝐺}, a screening method is first applied, such as SIS (Fan

and Lv, 2008), to reduce the number of SNPs from an ultra-high dimension to a more manageable

level as the number of SNPs is usually in the magnitude of 105 or higher. Next, the data sample

is randomly split into two parts. One part is used to select IVs using a shrinkage method such as

LASSO. These IVs are then categorized into major and weak ones, based on the partial 𝐹-statistics,

followed by combining only the weak IVs into a composite one using a weighted approach and

obtaining the cross-fitted exposure. The same procedure is applied to the other half of the data. The

two sets of cross-fitted exposures are then combined together to fit the second stage IV regression

model with the entire sample and to estimate the causal effect. Finally, multiple splits are applied,

and the final estimate is defined as the mean of the estimates obtained from multiple splits. The

final p-value is aggregated through the Cauchy combination test.

3.3.2 The first stage of MR-SPLIT+

MR-SPLIT has been shown to perform well when there are no invalid IVs. In this work, we

propose a rigorous approach to address the invalid IV issues to improve the MR-SPLIT framework.

The first stage of MR-SPLIT+ is illustrated in Figure 3.1a. Similar to the MR-SPLIT approach and

44

given the observed dataset {𝑋, 𝑌 , 𝐺}, we employ screening methods, such as SIS, to reduce the

number of SNP IVs to a manageable size, typically a few hundred.

(a) First stage of MR-SPLIT+.

(b) Second stage of MR-SPLIT+.

Figure 3.1 The two stage MR-SPLIT+ framework.

Next, we split the sample evenly into two subsets and conduct IV selection independently in

each subsample. Regardless of the method researchers choose for IV selection, such as marginal

p-values, LASSO, or adaptive LASSO, we strongly recommend applying the first stage of the CIIV

method to further refine the selected IVs (see details in section 3.3.3.4). Across multiple simulation

studies, this additional step has proven highly effective in mitigating noise while retaining valid

IVs. Since excessive noise often introduces bias in the identification of invalid IVs, incorporating

this refinement can significantly improve estimation accuracy.

After completing IV selection in each subset, the key difference between MR-SPLIT+ from MR-

SPLIT lies in how the selected IVs are treated. Instead of treating major and weak IVs separately,

all IVs are combined into one composite IV. Extensive simulation studies have demonstrated that

dealing with one composite IV leads to better type I error control and more accurate coverage

probabilities. This approach makes practical sense as SNPs usually have weak effects in GWAS

studies, especially under small sample sizes. In the presence of strong IV effect, major IVs can be

retained and dealt separately from the weak ones following the MR-SPLIT procedure. We have

incorporated this option in our code, allowing users to achieve this by varying the threshold of

45

Regress 𝑋2~ 𝐺2,𝑆1 to get ෠𝑋2 𝐼2=𝑋2,𝑌2,𝐺2Regress 𝑋1~ 𝐺1,𝑆2 to get ෠𝑋1 𝐼1=𝑋1,𝑌1,𝐺1෠𝑋=෠𝑋1෠𝑋2Originaldata: 𝑋,𝑌,𝐺,𝐺∈𝑅𝑁×𝑝 Randomly splitUse first stage of CIIV to select relevant IVs for 𝐼2Relevant IVs: 𝑆1∪𝑆2Relevant IVs 𝑆1Use first stage of CIIV to select relevant IVs for 𝐼1Relevant IVs 𝑆2መ𝛽Regress 𝑌 on ෠𝑋 and 𝐺𝐽 Invalid IVs 𝐺𝐽,  J={𝑗:ො𝛼𝑗𝐵𝑒𝑠𝑡≠0}Results from first stage {෠𝑋,𝑌,𝐺𝑆}For each 𝑘, ො𝛼𝑘𝐵𝑒𝑠𝑡=𝑎𝑟𝑔min𝛼𝑌−෨𝑍𝛼22Calculate corresponding test statistics 𝑇𝑘 with p value 𝑝𝑘ො𝛼𝐵𝑒𝑠𝑡=𝑎𝑟𝑔min𝑝𝑘>0.05ො𝛼𝑘𝐵𝑒𝑠𝑡0partial 𝐹-statistics. After the above steps, as in the MR-SPLIT approach, we obtain cross-fitted

exposures in the two subsamples denoted as ˆ𝑋1 and ˆ𝑋2, then combine them to get ˆ𝑋 for the entire

sample.

3.3.3 The second stage of MR-SPLIT+

Assuming that some IVs may be invalid, meaning that the IVs may have a direct effect on 𝑌 ,

i.e. 𝛼 ≠ 0. In the second stage, our primary goal is to accurately identify the invalid IVs selected

in the first stage and include them as covariates in the model to obtain an unbiased causal estimate.

Suppose we obtain the selected IV set 𝐺 𝑆 = 𝐺 𝑆1 ∪ 𝐺 𝑆2 in the first stage.

3.3.3.1

Identifiability of parameters

Intuitively, one might attempt to use a shrinkage method to solve the following function to

identify the invalid IVs:

ˆ𝛼 = arg min
𝛼

∥𝑌 − 𝑋 𝛽 − 𝐺𝛼∥2
2

,

(3.4)

and if ˆ𝛼 𝑗 = 0, 𝐺 𝑗 is considered a valid IV; otherwise, if ˆ𝛼 𝑗 ≠ 0, 𝐺 𝑗 is deemed an invalid IV.

However, we noticed that for any constant 𝑐, we can rewrite Eq. (4.1) as

𝑌 = 𝑋 𝛽 + 𝐺𝛼 + 𝜀𝑦

= 𝑋 (𝛽 + 𝑐) + 𝐺𝛼 + 𝜀𝑦 − 𝑐(𝐺𝛾 + 𝜀𝑥)

(3.5)

= 𝑋 (𝛽 + 𝑐) + 𝐺 (𝛼 − 𝛾𝑐) + (𝜀𝑦 − 𝑐𝜀𝑥)

Hence, for every value 𝑐 = 𝛼 𝑗 /𝛾 𝑗 ≠ 0, 𝑗 = 1, · · · , 𝑝, we could use the estimated parameter

{𝛽 + 𝑐, 𝛼 − 𝛾𝑐} to get the same 𝑌 , and each corresponds to a specific set of invalid IVs, characterized

by distinct values of 𝛼 𝑗 . To ensure parameter identifiability, we impose the Plurality Rule proposed

by Guo et al. (2018), stated as follows:

Assumption 1 (Plurality Rule).

(cid:12)
(cid:12)

(cid:8)𝛼 𝑗 : 𝛼 𝑗 = 0(cid:9)(cid:12)

(cid:12) > max
𝑐≠0

(cid:26)

(cid:12)
(cid:12)
(cid:12)
(cid:12)

𝛼 𝑗 :

𝛼 𝑗
𝛾 𝑗

= 𝑐

(cid:27)(cid:12)
(cid:12)
(cid:12)
(cid:12)

.

(3.6)

This condition ensures that the majority of the instruments are valid, meaning that among all

IV subsets classified by different values of 𝑐 =

𝛼 𝑗
𝛾 𝑗 , the subset of valid IVs is the largest. In fact, this

condition can be further relaxed, as we will discuss in detail in Section 3.3.3.2.

46

Now we can successfully identify the unique set of valid IVs by solving

ˆ𝛼 = arg min
𝛼

∥𝛼∥0

s.t.

∥𝑌 − 𝑋 𝛽 − 𝐺𝛼∥2
2

< 𝛿,

(3.7)

where the ℓ0 norm of the vector 𝛼 counts the number of nonzeros in 𝛼, and 𝛿 is a sufficiently small

prespecified value. In other words, among all possible solution combinations, the true values of

the parameters are the ones that make 𝛼 the sparsest.

Following the work by Lin et al. (2024), we could reformulate Eq. 3.7 to obtain the solutions

as:

ˆ𝛼 = arg min
𝛼

∥𝛼∥0

s.t.

∥𝑌 − (cid:101)𝐺𝛼∥2
2

< 𝛿,

(3.8)

where (cid:101)𝐺 = 𝑀 ˆ𝑋𝐺 = (𝐼 − ˆ𝑋 ( ˆ𝑋′ ˆ𝑋)−1 ˆ𝑋′)𝐺 ∈ R𝑁× ˜𝑝.
exposure obtained from the first stage, while 𝐺 denotes 𝐺 𝑆, the union of IVs selected during the

ˆ𝑋 represents the estimated

In our method,

first stage.

3.3.3.2 Best subset selection to identify invalid IVs

To solve (4.3), we first identify several candidate solution sets and then select the one that

yields the sparsest 𝛼 among these candidates. Given 𝑘 = ∥𝛼∥0, our goal is to solve the following

optimization problem,

min
𝛼

∥𝑌 − (cid:101)𝐺𝛼∥2
2

,

(3.9)

and this is the so-called best subset selection problem (Miller, 2002). The cardinality constraint

||𝛼||0 = 𝑘 makes problem NP-hard. To avoid this intractable problem, previous methods opted to

use surrogate penalty functions for solutions. This is also why the WIT method (Lin et al., 2024)

chose to employ the MCP penalty as an alternative. However, this approach of solving the problem

using an alternative method instead of directly addressing it often entails potential issues, which

can also be observed in the results of our subsequent simulations.

Denote ˆ𝛼𝑜𝑟 be the oracle estimator if we know the true invalid IV set in prior, then we have

ˆ𝛼𝑜𝑟 = ( (cid:101)𝐺𝑇

𝐴0 (cid:101)𝐺 𝐴0)−1

(cid:101)𝐺 𝐴0

𝑌 ,

(3.10)

47

where 𝐴0 = { 𝑗 : 𝛼 𝑗 ≠ 0} and (cid:101)𝐺 𝐴0 is a submatrix of (cid:101)𝐺. This aligns precisely with the form of the

OLS estimator. Before we establish the selection consistency, we need an important assumption

stated below.

Assumption 2 (Necessary Condition for Selection Consistency). There exists a constant 𝑑1 > 0

such that:

𝐶min(𝛼, (cid:101)𝐺) ≥

𝑑1𝜎2 log 𝑝
𝑛

,

where 𝐶min(𝛼, (cid:101)𝐺) ≡ min{𝛼𝐴:𝐴≠𝐴0,| 𝐴|≤| 𝐴0|}
IVs set. | 𝐴0 \ 𝐴| denotes the number of IVs mistakenly omitted from the true invalid IVs set.

𝛼𝐴0 − (cid:101)𝐺 𝐴𝛼𝐴 ∥2. 𝐴0 is the true invalid

𝑛 max(| 𝐴0\𝐴|,1) ∥ (cid:101)𝐺 𝐴0

1

The following Theorem 2 guarantees the selection consistency of our method.

Theorem 2. Suppose ˆ𝛼 is the global minimizer of the following optimization problem:

min
𝛼

∥𝑌 − (cid:101)𝐺𝛼∥2
2

s.t.

∥𝛼∥0 ≤ 𝑘,

where the residual term (cid:101)
𝛼 𝑗 ≠ 0} and ˆ𝐴 = { 𝑗 : ˆ𝛼 𝑗 ≠ 0}. If 𝑘 = | 𝐴0| and Assumption 2 holds, then

𝜀 = 𝑌 − (cid:101)𝐺𝛼 follows a normal distribution 𝑁 (0, 𝜎2𝐼). Denote 𝐴0 = { 𝑗 :

𝑃( ˆ𝐴 ≠ 𝐴0, ˆ𝛼 ≠ ˆ𝛼𝑜𝑟) → 0

as 𝑛, 𝑝 → ∞.

Shen et al. (2012, 2013) demonstrated that under Assumption 2, the constrained ℓ0-method

guarantees selection consistency and oracle parameter estimation, where the estimator is shown to

consistently select the correct variables and converge to the oracle OLS estimator under Assump-

tion 2. This assumption ensures a minimal degree of separation necessary for correctly identifying

invalid IVs, and serves as a fundamental condition under the 𝐿2 metric for any variable selection

method, including LASSO, SCAD, or MCP. The proof of Theorem 2 directly follows the work of

Shen et al. (2012, 2013).

As shown in Theorem 2, when the number of valid IVs 𝑝 − 𝑘 is correctly specified, even when

there exists a group of invalid IVs whose number equals that of the valid IVs, our method can always

achieve selection consistency under certain assumptions. So we only require a relaxed version of

the Plurality Rule, stated as follows:

48

Assumption 3 (Relaxed Plurality Rule).

(cid:12)
(cid:12)

(cid:8)𝛼 𝑗 : 𝛼 𝑗 = 0(cid:9)(cid:12)

(cid:12) ≥ max
𝑐≠0

(cid:26)

(cid:12)
(cid:12)
(cid:12)
(cid:12)

𝛼 𝑗 :

𝛼 𝑗
𝛾 𝑗

= 𝑐

(cid:27)(cid:12)
(cid:12)
(cid:12)
(cid:12)

.

(3.11)

Compared to the original Plurality Rule, which requires that the number of valid IVs strictly

exceeds that of any group of invalid IVs (classified by distinct values of 𝑐 =

𝛼 𝑗
𝛾 𝑗 ), the relaxed version

allows for ties in group sizes. That is, the valid IV group is permitted to have the same cardinality

as one or more invalid IV groups.

In our work, we apply the mixed integer optimization (MIO) approach(Bertsimas et al., 2016)

to obtain the global minimizer of (3.9) subject to the cardinality constraint. There is an R pack-

age available that can implement this approach, provided at https://github.com/ryantibs/

best-subset. The general MIO problem can be formulated as:

min
𝛼

s.t.

𝛼𝑇 𝑄𝛼 + 𝛼𝑇 𝑎

𝐴𝛼 ≤ 𝑏,

𝛼𝑖 ∈ {0, 1},

𝑖 ∈ I,

𝛼 𝑗 ≥ 0,

𝑗 ∉ I,

(3.12)

where 𝑎 ∈ R𝑚, 𝐴 ∈ R𝑘×𝑚, 𝑏 ∈ R𝑘 , 𝑄 ∈ R𝑚×𝑚 and 𝑄 is positive semidefinite. 𝛼 ∈ R𝑚 contains

both discrete (𝛼𝑖, 𝑖 ∈ I) and continuous (𝛼𝑖, 𝑖 ∉ I) variables, with I ⊂ {1, . . . , 𝑚}.

Following (3.12), we can reformulate the minimization problem in (3.9) as:

min
𝛼,𝑧

𝛼⊤( (cid:101)𝐺⊤

(cid:101)𝐺)𝛼 − 2𝛼⊤

(cid:101)𝐺⊤𝑌 + ∥𝑌 ∥2
2

s.t.

(1 − 𝑧𝑖)𝛼𝑖 = 0,

𝑧𝑖 ∈ {0, 1},

˜𝑝
∑︁

𝑖=1

𝑧𝑖 ≤ 𝑘,

− M𝑈 ≤ 𝛼𝑖 ≤ M𝑈,

∥𝛼∥1 ≤ M𝑙,

(3.13)

where 𝑧𝑖 indicates whether 𝛼𝑖 ≠ 0, with (cid:205) ˜𝑝
𝑖=1
If 𝑧𝑖 = 0, then ˜𝐺𝑖 is excluded from the model, implying that 𝐺𝑖 is a valid IV. Consequently, (cid:205) ˜𝑝
𝑖=1

𝑧𝑖 representing the number of nonzero elements in 𝛼.

𝑧𝑖

49

denotes the number of invalid IVs among the selected IV sets. M𝑈 is a constant such that if ˆ𝛼 is

a minimizer of (3.9), then M𝑈 ≥ ∥𝛼∥∞ = max |𝛼𝑖 |. The presence of M𝑈 and M𝑙 could improve

the performance of MIO. There are additional representations of (3.9) discussed in (Bertsimas

et al., 2016), each tailored to different scenarios. The reformulation in (3.13) presented here is

particularly useful when 𝑁 > ˜𝑝 and ˜𝑝 is on the order of hundreds. Bertsimas et al.(Bertsimas et al.,

2016) introduced three methods to estimate M𝑈 and M𝑙. Here, we primarily focus on the third

method, parameter specifications from advanced warm-starts.

Consider the following optimization problem:

𝑔(𝛼)

min
𝛼

subject to

∥𝛼∥0 ≤ 𝑘,

(3.14)

where 𝑔(𝛼) ≥ 0 is convex and has a Lipschitz continuous gradient, i.e., ∥∇𝑔(𝛼) − ∇𝑔( ˜𝛼)∥ ≤

ℓ∥𝛼 − ˜𝛼∥, with ℓ being the Lipschitz constant.

The following Algorithm 3.1 outlines the steps to provide solutions for (3.14).

Algorithm 3.1 Find a stationary point of problem (3.14).
Input: 𝑔(𝛼), parameter 𝐿 > 𝑙, and the convergence tolerance 𝜖.
Output: A first-order stationary solution 𝛼∗.
1: Initialization with 𝛼1 ∈ R𝑝 such that ∥𝛼1∥0 ≤ 𝑘.
2: For 𝑚 ≥ 1

(cid:16)

where 𝜂𝑚 ∈ H𝑘
H𝑘 (𝑐) is defined component-wise as:

𝐿 ∇𝑔(𝛼𝑚)

𝛼𝑚 − 1

𝛼𝑚+1 = 𝜆𝑚𝜂𝑚 + (1 − 𝜆𝑚)𝛼𝑚,
(cid:17)

, with 𝜆𝑚 ∈ arg min𝜆 𝑔(𝜆𝜂𝑚 + (1 − 𝜆)𝛼𝑚). The operator

H𝑘 (𝑐)𝑖 =

(cid:26) 𝑐𝑖,
0,

if 𝑖 ∈ {1, . . . , 𝑘 },
otherwise.

Here, {1, . . . , 𝑘 } represents the indices of the 𝑘 largest absolute values of the vector 𝑐.

3: Repeat step 2 until 𝑔(𝛼𝑚) − 𝑔(𝛼𝑚+1) ≤ 𝜖.

Once we obtain the estimated ˆ𝛼 for (3.14), setting M𝑈 := 𝜏∥ ˆ𝛼∥∞, where 𝜏 is a multiplier greater

than 1 (e.g., 𝜏 ∈ {1.5, 2, 5}), provides a suitable estimate for the parameter M𝑈. Additionally,

defining M𝑙 = 𝑘M𝑈 yields a reasonable upper bound for ∥𝛼∥1. These estimation processes can all

be implemented using the bs() function from the R package bestsubset.

50

3.3.3.3 Test of over-identification

For each 𝑘 = ∥𝛼∥0, we obtain a potential set of invalid IVs by solving the best subset problem

in Section 3.3.3.2. The next step is to determine which of these sets are acceptable, i.e., those that

make ∥𝑌 − (cid:101)𝐺𝛼∥2

2 sufficiently small. Instead of specifying a sufficiently small threshold 𝛿, we adopt

a testing-based approach proposed by Kolesár (2018).

Denote 𝑍 as the selected valid IV set and 𝑊 as the selected invalid IV set. We rewrite Eqs.

(4.1) and (4.6) as following:

(cid:104)

𝑋 𝑌

(cid:105)

(cid:104)

=

𝑍 𝑊

(cid:105)

𝛾

Γ

Ψ1 Ψ2

















(cid:104)
𝑉1 𝑉2

(cid:105)

.

+

(3.15)

The hypothesis we aim to test is the Proportionality Restriction (PR) assumption introduced by

Kolesár (2018), which is stated as follows:

Assumption 4 (Proportionality Restriction). Γ = 𝛾𝛽.

Consider the following statistics:

𝑆 =

1
𝑁 − 𝑘 − 𝑙

𝑌 ′(1𝑁 − 𝑍 𝑍′ − 𝑊 (𝑊 ′𝑊)−1𝑊)𝑌 ,

𝑇 =

1
𝑁

𝑌 ′𝑍 𝑍′𝑌 ,

where 𝑘 and 𝑙 are the number of valid and invalid IVs respectively, and 𝑁 is the sample size. We
have 𝐸 (𝑇 − 𝑘

. Under Assumption 4, we have:

𝑁 𝑆) = X, where X = 1
𝑁

Γ 𝛾

Γ 𝛾

(cid:17)′ (cid:16)

(cid:17)

(cid:16)

(cid:169)
(cid:173)
(cid:173)
(cid:171)
and X22 is the bottom-right submatrix of X. Therefore, testing Assumption 4 is equivalent to testing

X = X22

(cid:170)
(cid:174)
(cid:174)
(cid:172)

𝛽

1

,

𝛽2 𝛽

the following hypotheses:

H0: X is reduced rank,

v.s. H1: X is positive definite.

51

Let 𝜆𝑚𝑖𝑛 denote the minimum eigenvalue of the matrix 𝑆−1𝑇. The test statistic is defined as:

ˆ𝐽𝑀 𝐷 =





0,

if

𝜆𝑚𝑖𝑛 ≤ 𝑘
𝑁 ,

(cid:16)

𝜆𝑚𝑖𝑛 − 𝑘
𝑁

(cid:17) 2

,

otherwise.

The specific distribution of this statistic can be found in Kolesár (2018). Moreover, the readily

available R package manyIV can be employed for this purpose.

3.3.3.4 Refine IV selection with CIIV

In practical applications of one-sample MR analysis, erroneous inclusion of SNPs that do not

affect the exposure (i.e., noise) as IVs is a common issue, particularly when the sample size is small.

Excessive noise in the candidate IV set can significantly reduce the accuracy of identifying invalid

IVs. To address this, we adopt the first-stage filtering procedure from the CIIV method (Windmeijer

et al., 2021), which can effectively filter out noises. While the CIIV first-stage filtering is intended

to exclude uninformative IVs, it is important to note that it does not exclusively retain strong IVs;

weak IVs may also pass through. Nevertheless, this approach provides several advantages:

• It minimizes the inclusion of noises.

• By imposing a less stringent threshold, it avoids selecting only the strongest IVs, allowing for

a balanced selection of strong and weak IVs. This flexibility provides room for addressing the

weak instrument problem using MR-SPLIT+, reducing bias compared to directly applying

CIIV.

Given these benefits, the first-stage filtering procedure is a necessary step for selecting relevant IVs

in one-sample MR analyses. Specifically, the first stage aims to test 𝐻0 : 𝛾 𝑗 = 0, 𝑗 = 1, . . . , 𝑝,

where 𝑝 is the number of selected SNPs. We reject the 𝐻0 if

ˆ𝛾 𝑗
√︁var( ˆ𝛾 𝑗 )
where 𝜔𝑁 = √︁2.01 log{max( 𝑝, 𝑁)}, and var( ˆ𝛾 𝑗 ) can be a robust variance estimator in case of

|𝑡𝛾 𝑗 | =

< 𝜔𝑁 ,

(3.16)

(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:12)

(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:12)

heteroskedasticity. The framework for the second stage of MR-SPLIT+ is summarized in Figure

3.1b.

52

3.3.4 Multiple sample splitting to enhance stability and robustness

Due to the uncertainty in selecting valid IVs when splitting a sample only once, we implement

a multiple splitting strategy in MR-SPLIT+, to ensure robust results. Suppose we split the sample

𝑇 times, obtaining an estimate of 𝛽𝑡 at each time of split. We choose the estimate that is closest

to the median of the set {𝛽𝑡 : 𝑡 = 1, . . . , 𝑇 } as our final causal effect estimate (see Algorithm 3.2

for the detail). The primary reason we select a single result instead of integrating all split results is

that the process of screening invalid IVs often produces outliers. To ensure robustness, we opt for

the median-type estimate. This also motivates us to avoid using the p-value combination method

employed in MR-SPLIT, and instead use the p-value corresponding to the final estimate as the

p-value for testing the causal effect.

We evaluated the effectiveness of multiple sample splitting under the simulation settings de-

scribed in Section 3.4.3 considering the case with noise IVs. The results, presented in the the

Appendix, indicate that performing multiple splits significantly improves the precision of the es-

timates compared to using a single split, particularly in scenarios with limited sample sizes or

substantial noise.

Algorithm 3.2 summarizes the analytical procedure of MR-SPLIT+.

3.4 Simulation Study

3.4.1 The impact of sample split on the effect estimate

We first evaluated the effectiveness of multiple sample splitting under the simulation settings

described in Section 3.4.3 considering the case with noises. The results indicate that performing

multiple splits significantly improves the precision of the estimate compared to a single split,

particularly in scenarios with limited sample sizes or substantial noise. Figure 3.2 presents the

violin plots of the estimation results under various split times from 1 to 30. Each plot shows

the results under 1000 simulation runs. We also conducted simulations with 𝑁 = 6000, and the

detailed results can be found in Fig. A.24 in the Appendix. When the sample size is large and the

IVs are strong (Case 1 and Case 2), increasing the number of splits has a limited effect on improving

estimation accuracy. However, when IVs are weak or the sample size is small (Case 3 and Case 4),

53

Algorithm 3.2 The analytical procedure of MR-SPLIT+.
Input: {𝑋, 𝑌 , 𝐺}, 𝑋, 𝑌 ∈ R𝑁×1, 𝐺 ∈ R𝑁×𝑝, 𝑝 ≫ 𝑁.
Output: The causal effect ˆ𝛽.
1: Employ screening methods, such as SIS, to reduce the number of SNPs to a manageable size,

typically a few hundred.

2: For 𝑡 = 1, . . . , 𝑇

• Split the sample evenly into two subsets {𝐼1, 𝐼2}.
• In subset 𝐼1, after selecting a candidate IV set, we apply the first stage of CIIV to refine
the IVs and estimate their effects. Then, in 𝐼2, the selected IVs are combined into a
composite IV using the estimated effects as weights. The same process is repeated for
another subset.

• Get estimated ˆ𝑋1 and ˆ𝑋2 from the two subset and combine them into ˆ𝑋. Denote 𝐺 𝑆 ∈

R𝑁× ˜𝑝 as the union set of the two selected IV sets.
• Starting from 𝑘 = 0, set 𝛼𝑘 = arg min𝛼 ∥𝑌 − (cid:101)𝐺𝛼∥2

2, then perform the overidentification
test. If the testing p-value ≤ 0.05, increment 𝑘 by 1 (𝑘 = 𝑘 + 1) and repeat the process
until the testing p-value 𝑝 > 0.05.

• Regress Y on ˆ𝑋 while including the selected invalid IVs as covariates to get the estimated

causal effect ˆ𝛽𝑡.

3: Perform the same sample split procedure 𝑇 times and choose

ˆ𝛽 = arg

min
ˆ𝛽𝑡 ∈{ ˆ𝛽𝑡 :𝑡=1,...,𝑇 }

(cid:12) ˆ𝛽𝑡 − median({ ˆ𝛽𝑡 : 𝑡 = 1, . . . , 𝑇 })(cid:12)
(cid:12)
(cid:12)

as the final causal effect estimate.

the estimates become increasingly stable as the number of splits increases, with a gradual reduction

in variance and outliers gradually vanishing.

Figure 3.2 Violin plots showing the estimation accuracy as the number of sample split increases
under different cases and sample sizes.

54

Additionally, we examined the changes in coverage probability as the number of sample splits

increases (see Figure 3.3). The results indicate that the coverage probability improves as the sam-

ple size increases, under different cases. As the number of sample splits increases, the coverage

probability shows significant improvement in Case 4 when the sample size is small (e.g., 1000).

Under different cases, the coverage probability stabilizes with less than 10 sample splits, indicating

the robustness of the method. under weak IVs or small sample sizes, increasing the split times

significantly improves the coverage probability, bringing it closer to the nominal 95% level. How-

ever, an interesting observation arises in Case 1 with strong IVs: as the sample size increases, the

coverage probability approaches the nominal 95% level at N = 3000. However, when N further

increases to 6000, the coverage probability instead decreases to around 94%. This phenomenon

may be attributed to slight inaccuracies in the variance estimation of our method. As the sample

size grows, the estimated variance decreases, leading to a narrower confidence interval.

If the

variance is slightly underestimated, the confidence interval may become too narrow, resulting in

a coverage probability below the nominal level. This suggests a potential area for methodological

refinement. Nevertheless, given the small deviation, we still consider this result to be within an

acceptable range.

We also presented the results of False Negative Rate (FNR) and False Positive Rate (FPR) for

identifying invalid IVs in Fig. 3.4. The FNR represents the proportion of invalid IVs incorrectly

identified as valid ones, while the FPR is the proportion of valid IVs incorrectly identified as invalid

ones. Similarly, in Case 1 and Case 2, increasing the number of splits did not lead to substantial

improvements in performance. However, in Case 3 and Case 4, where the instruments are relatively

weak, both FPR and FNR exhibited a decreasing trend as the number of splits increased.

In summary, according to the simulation results, even with extremely weak IVs (Case 4),

splitting the sample up to 20-30 times is sufficient to achieve stable results. If the sample size is

large, the number of splits can be reduced.

55

Figure 3.3 Coverage probability under different cases and sample sizes as the number of sample
splits increases from 0 to 30.

3.4.2 Simulation without noise

We compared MR-SPLIT+ with recently developed one-sample MR methods, i.e., sisVIVE (Kang

et al., 2016), CIIV (Windmeijer et al., 2021), WIT (Lin et al., 2024), as well as the oracle TSLS

method which assumes that we know which IVs are valid and invalid in advance.

To ensure a fair comparison, we strictly adhered to nearly all the parameter settings outlined in

the WIT work (Lin et al., 2024). The only difference is that we assumed IVs are independent of

each other, an assumption that can be easily satisfied in real data through LD pruning. Specifically,
to generate data containing 𝑁 samples, we assumed 𝐺 i.i.d.∼ 𝑁 (0, Σ𝐺), where Σ𝐺

𝑖𝑖 = 0.8, 𝑖 = 1, ..., 𝑝,
𝑖 𝑗 = 0, 𝑖 ≠ 𝑗. In this section, we assumed that all 21 IVs affect the exposure 𝑋, with no noise

and Σ𝐺

IVs included. We also considered scenarios that include noise IVs and applied these methods to

select IVs as shown in Section 3.4.3.

56

Figure 3.4 False positive rate and false negative rate as the number of sample splits changes from
0 to 30.

The error terms 𝜀𝑥, 𝜀𝑦 were generated from (𝜀𝑥𝑖, 𝜀𝑦𝑖) i.i.d.∼ 𝑁 (0, Σ), where Σ = (cid:169)
(cid:173)
(cid:173)
(cid:171)

𝑖 = 1, ..., 𝑁. For the effect size of 𝛼 and 𝛾, we considered the following four cases:

0.25 0.3

0.3

1

,

(cid:170)
(cid:174)
(cid:174)
(cid:172)

• Case 1: 𝛼 = (0.4, · · · , 0.4
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)
(cid:125)

(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)

(cid:124)

• Case 2: 𝛼 = (0.15, · · · , 0.15
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)
(cid:125)
(cid:124)

(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)

), 𝛾 = (0, · · · , 0
(cid:32)(cid:32)(cid:32)
(cid:32)(cid:32)(cid:32)
(cid:123)(cid:122)
(cid:125)
(cid:124)
9
), 𝛾 = (0, · · · , 0
(cid:32)(cid:32)(cid:32)
(cid:32)(cid:32)(cid:32)
(cid:123)(cid:122)
(cid:125)
(cid:124)
9

(cid:123)(cid:122)
21

(cid:123)(cid:122)
21

(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)

, 0.4, · · · , 0.4
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)
(cid:123)(cid:122)
(cid:125)
(cid:124)
6

(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)

, 0.2, · · · , 0.2
)
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)
(cid:125)
(cid:123)(cid:122)
(cid:124)
6

(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)

, 0.4, · · · , 0.4
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)
(cid:123)(cid:122)
(cid:125)
(cid:124)
6

(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)

, 0.2, · · · , 0.2
)
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)
(cid:123)(cid:122)
(cid:125)
(cid:124)
6

• Case 3: 𝛼 = (0.15, · · · , 0.15
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)
(cid:125)
(cid:124)

(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)

),

(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)

(cid:123)(cid:122)
5

, 0.07, · · · , 0.07
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)
(cid:123)(cid:122)
(cid:125)
(cid:124)
16
, 0.2, · · · , 0.2
𝛾 = (0, 0, 0, 0.2, 0.1, 0, · · · , 0
, 0.1, · · · , 0.1
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)
(cid:32)(cid:32)(cid:32)
(cid:32)(cid:32)(cid:32)
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)
(cid:123)(cid:122)
(cid:125)
(cid:124)
(cid:123)(cid:122)
(cid:125)
(cid:124)
(cid:123)(cid:122)
(cid:125)
(cid:124)
5
5
6
), 𝛾 = (0, · · · , 0
, 0.1, · · · , 0.1
, 0.2, · · · , 0.2
)
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)
(cid:32)(cid:32)(cid:32)
(cid:32)(cid:32)(cid:32)
(cid:123)(cid:122)
(cid:125)
(cid:124)
(cid:123)(cid:122)
(cid:125)
(cid:124)
(cid:123)(cid:122)
(cid:125)
(cid:124)
9
6
6

(cid:123)(cid:122)
21

(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)

(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)

(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)

(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)

)

• Case 4: 𝛼 = (0.07, · · · , 0.07
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)
(cid:125)
(cid:124)

(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)

Case 1 and Case 2 came from the simulation settings of WIT method (Lin et al., 2024). Case 3

and Case 4 examined scenarios where the effects of IVs on both 𝑋 and 𝑌 were weaker compared to

Case 1 and Case 2. In Case 3, the effects of IVs on 𝑋 were heterogeneous, with a small subset of

IVs exhibiting stronger associations. Invalid IVs were evenly distributed across these two groups of

57

IVs. Case 4, on the other hand, is considered an even weaker IV scenario. This adjustment stems

from the fact that the situations originally considered in WIT, when measured by the first-stage

𝐹-statistics, are still far from the commonly considered weak IV threshold (i.e., 𝐹 < 10) (Staiger

and Stock, 1997). Specifically, when 𝑁 = 1000, the 𝐹-statistic is 513 in Case 1 and 73 in Case

2 (see Table 3.1). Therefore, there remains room to further lower the 𝐹-statistics. To this end,

we considered more extreme scenarios in Case 3 and Case 4, which present great challenges. In

addition, since this simulation setting assumes no noise, there is no selection bias to mitigate.

Therefore, we performed only 10 sample splits, which is sufficient in this case.

Table 3.1 Average 𝐹-statistics for the first stage of 2SLS in simulations without noise.

𝑁=1000 𝑁=3000
1536.61
513.65
216.85
73.06
96.90
33.02
47.97
16.68

Case 1
Case 2
Case 3
Case 4

Figure 3.5 shows the violin plots for the estimates obtained from each method under different

cases and sample sizes. Firstly, it is evident that the sisVIVE method exhibits a significantly larger

estimation bias compared to other methods across all cases and sample sizes. This is because

sisVIVE relies on a simple LASSO regression to estimate 𝛼 and assumes that more than half of the

IVs are valid. However, the scenarios we considered violate this assumption. In Case 1 and Case

2, although CIIV and WIT appear to produce estimates clustered around the true value, they also

generate some extreme outliers that deviate significantly from the true value. Comparatively, CIIV

performs slightly better than WIT under strong IV and large sample conditions. This is because CIIV

classifies IVs by constructing confidence intervals for the causal effect using each SNP individually

as an IV. With strong IVs and large samples, the constructed confidence intervals are more reliable.

However, this approach becomes less effective in scenarios with weak IVs. As a result, in Case

3 and Case 4, the bias of CIIV’s estimates progressively increases. While WIT performs slightly

better than CIIV when IVs are weak or sample sizes are small, it still produces many estimates that

deviate substantially from the true value. Furthermore, as the sample size increases, its performance

58

is even worse than that of CIIV. On the other hand, MR-SPLIT+ consistently produces results that

closely match those of the oracle TSLS estimates across all conditions, regardless of whether the

IVs are strong or weak, under different sample sizes.

Figure 3.5 Violin plots of the causal estimates in simulations without noise.

Figure 3.6a shows the absolute bias of the estimates obtained by these methods across various

cases. Notably, the bias produced by MR-SPLIT+ is almost as small as that of the oracle TSLS

estimates, while all other methods exhibit significantly larger biases. CIIV, however, only achieves

comparable results when the sample size is sufficiently large. Figure 3.6b presents the coverage

probabilities obtained by each method. sisVIVE is excluded from this analysis as it does not provide

a way to construct the confidence interval. When the sample size is large (𝑁 = 3000), MR-SPLIT+

achieves nearly 95% coverage probability, even under scenarios with weak IVs (Case 3 and Case

4). For a smaller sample (e.g., 𝑁 = 1000), although MR-SPLIT+ does not reach the 95% coverage

probability, it consistently demonstrates superior performance compared to WIT and CIIV across

all cases.

Figure 3.7 illustrates the False Negative Rate (FNR) and False Positive Rate (FPR) for identifying

invalid IVs. The FNR represents the proportion of invalid IVs incorrectly identified as valid ones,

while the FPR is the proportion of valid IVs incorrectly identified as invalid ones. Controlling the

59

(a) Absolute bias of estimators.

(b) Coverage Probability.

Figure 3.6 Comparison of absolute bias (a) and coverage probability (b) for different methods in
simulations without noise.

FNR is crucial because misclassifying invalid IVs as valid ones directly introduces systematic bias

into the causal effect estimates, potentially leading to severely biased conclusions. In contrast, a

higher FPR, which results in the unnecessary exclusion of valid IVs, primarily affects the efficiency

of the estimation rather than introducing substantial bias. The results demonstrate that MR-

SPLIT+ consistently achieves accurate identification of invalid IVs, maintaining a very low FPR

rate. In contrast, sisVIVE produces significantly higher FPR values followed by WIT. Regarding

the FNR, MR-SPLIT+ achieves values comparable to those of WIT, CIIV, and sisVIVE under small

sample sizes (𝑁 = 1000), while demonstrating substantially lower FNR under larger sample sizes

(𝑁 = 3000).

We also presented the actual breakdown of the selected valid and invalid IVs in Tables A.2

and Table A.3, respectively. These tables were used to compute the FPR and FNR. When the

IVs are strong, as in Case 1 and Case 2, MR-SPLIT+ demonstrated exceptionally high accuracy

in distinguishing between valid and invalid IVs. For instance, in Case 1 with a sample size of

𝑁 = 1000, MR-SPLIT+ identified an average of 9 valid IVs and 0.1 invalid ones as valid IVs while

the true number of valid IVs was 9 in each simulation run, which means it correctly selected all

60

Figure 3.7 Plots of FNR and FPR for IV selection in simulations without noise.

the valid ones and only 0.1 out of 9.1 IVs were misidentified (see Table A.2). In comparison, WIT

identified 7.7 valid and 0.2 invalid IVs as valid IVs. CIIV performed similarly to MR-SPLIT+,

selecting 9 valid and 0.2 invalid IVs as valid IVs. However, in Case 3 and Case 4, the performance

of CIIV deteriorated considerably, particularly when 𝑁 = 1000. Regarding the classification of

invalid IVs (see Table A.3), MR-SPLIT+ showed even stronger performance, with most cases

involving no misclassification of valid IVs as invalid. In contrast, WIT consistently misclassified a

substantial proportion of valid IVs as invalid, indicating a tendency toward over-selection of invalid

instruments.

3.4.3 Simulation with noise IVs

To better mimic real world conditions, we also explored a more realistic scenario by selecting IVs

from a pool of candidates that included both relevant instruments and noise variables. Specifically,

we generated a total of 300 variables and randomly picked 21 to have effects on the exposure 𝑋.

The remaining 279 variables are noise ones. The other settings remained consistent with those

described in Section 4 in the manuscript. To ensure a fair comparison, all methods (except the oracle

TSLS method) used the first stage of CIIV to select relevant IVs for the exposure. Specifically,

MR-SPLIT+ performed selection within each subset, while other methods conducted selection on

the whole sample. To ensure the robustness of MR-SPLIT+ estimations, the sample was split

61

30 times in each scenario. Additionally, all methods were evaluated through 1000 replications to

comprehensively assess their performance.

Figure A.25 in the Appendix presents violin plots of the estimates obtained by each method

under various cases and sample sizes in the presence of noise IVs. The results are generally

consistent with those observed in the absence of noise IVs. For the WIT, CIIV, and sisVIVE

methods, their estimates exhibit larger bias compared to scenarios without noise IVs. In contrast,

MR-SPLIT+ maintains nearly the same level of superior performance, demonstrating robustness

to the presence of noise IVs. This observation is further supported by the results shown in Figures

A.26 and A.27.

Figure A.27 illustrates the FNR and FPR for identifying invalid IVs. The results demonstrate

that in the presence of noise IVs, the performance of MR-SPLIT+ remains comparable to its

performance in the absence of noise, highlighting its superior ability to mitigate the selection bias

issue. In contrast, other methods exhibit a decline in performance compared to noise-free scenarios,

with the deterioration being particularly pronounced in small sample sizes (𝑁 = 1000).

The actual breakdown used to compute FNR and FPR shown in Figure A.27 is presented in

Tables A.4 and A.5 in the Appendix. Overall, MR-SPLIT+ maintains the highest accuracy of

correctly identifying those true valid IVs and of avoiding the incorrect classification of invalid

or noise IVs as valid. For example, in Table A.4, in Case 3 with a sample size of 𝑁 = 3000,

MR-SPLIT+ identified an average of 8.9 valid IVs out of 9 true IVs, while 0.3 invalid IVs and 1.4

noise were misidentified as valid IVs. In comparison, WIT identified only 4.4 true valid IVs out of

9 with a higher misclassification rate for invalid (1.7) and noise IVs (0.2). In all the cases, sisVIVE

has the worst performance.

3.5 Real Data Application

We evaluated two datasets described in Section 3.2. MR-SPLIT+ was evaluated alongside

TSLS, sisVIVE, CIIV, and WIT. The results obtained from MR-SPLIT+ are presented together

with those from the other methods to facilitate a comprehensive performance comparison.

62

3.5.1 Positive control: casual effect of BMI on DBP

The dataset initially comprised 502,505 individuals and included a total of four instances.

Considering both the accuracy of the dataset and the size of the sample, we selected version 1.0

for our analysis, which was conducted between 2012-2013. We filtered the data to include only

samples with complete exposure and outcome information, reducing the sample size to 50,497.

After merging this dataset with genetic data, we obtained a final sample of 39,889 individuals.

Additionally, we calculated the age of participants at the time of measurement based on their ‘birth

year’ and their ‘date attending assessment center’. Among them, 12,022 individuals were aged

between 44 and 59 years. The entire population ranged from 44 to 83 years. Considering the

impact of age on blood pressure, we performed MR analyses separately for the 44–59 age group

(‘young group’) and the overall population (‘all group’).

For the genetic data, we excluded variants with a missing rate above 10% and a MAF below 0.05.

Additionally, LD pruning was applied to remove highly correlated variants, a common practice

in MR analysis. This preprocessing resulted in a final set of 279k SNPs. We initially referred

to the GWAS study on BMI conducted by Locke et al. (Locke et al., 2015), which identified 97

variants with p-values < 5 × 10−8. After matching these variants with our existing genetic data,

we identified 21 candidate IVs. However, their associations with BMI in our sample were weak,

with some even exhibiting p-values greater than 0.05. Thus, we applied the SIS method (Fan and

Lv, 2008) to select the top 80 SNPs mostly associated with BMI as candidate IVs separately for the

‘young group’ and the ‘all group’. After combining the candidate IVs identified earlier from the

GWAS study, we obtained 101 candidate IVs for the ‘young group’ and 99 candidate IVs for the

‘all group’, respectively.

Table 3.2 shows the results of the ‘young group’. In the simplest case of OLS regression, the

two variables demonstrated a very strong correlation (p-value < 2 × 10−16). When estimating

the causal relationship using various MR methods, all approaches yielded p-values below 0.05.

Notably, CIIV and MR-SPLIT+ performed additional filtering of the candidate IVs, removing those

that were potentially too weak or noisy. MR-SPLIT+ retained 46 relevant IVs, while CIIV retained

63

69. For reference, the 𝐹-statistic in the first stage of TSLS was 12.37, indicating that this represents

a relatively weak set of IVs, which could lead to potential weak instrument bias.

Table 3.2 Comparison of results in the ‘young group’ when assessing the causal effect of
log(BMI) on log(DBP).

Method
OLS
TSLS
sisVIVE
CIIV
WIT

ˆ𝛽

pvalue
s.d.
0.2814 0.0067 <1e-15
0.2690 0.0231 <1e-15
0.2690
NA
0.2792 0.0236 <1e-15
0.2675 0.0238 <1e-15
2.7e-06

NA

lower CI upper CI Relevant IVs Valid IVs
0.2683
0.2238
NA
0.2329
0.2208
0.1509

0.2945
0.3142
NA
0.3255
0.3142
0.3669

NA
101
101
69
101
46

NA
101
101
69
101
46

Invalid IVs
NA
0
0
0
0
0

MR-SPLIT+ 0.2589 0.0551

Note: Sample size 𝑁 = 12, 022, Age: 44 ∼ 59, 𝐹-statistic = 12.37 in the first stage of TSLS..

Table 3.3 shows the results of the ‘all group’. Consistent with previous findings, significant

results were obtained across all MR methods when considering the entire population. Among

these, MR-SPLIT+ identified 56 relevant IVs from the 99 candidate ones and filtered 1 invalid IV.

The estimated causal effect was ˆ𝛽 = 0.2133, which is slightly lower than the causal effect estimated

in the ‘young group’.

Table 3.3 Comparison of results in the ‘all group’ when assessing the causal effect of log(BMI) on
log(DBP).

Method
OLS
TSLS
sisVIVE
CIIV
WIT

ˆ𝛽
0.2249
0.1972
0.1972
0.2186
0.0794
MR-SPLIT+ 0.2133

s.d.

pvalue
0.0040 <2e-16
0.0221 <2e-16

NA

NA

0.0244 <2e-16
0.0331 1.7e-02
0.0400 9.6e-08

lower CI upper CI Relevant IVs Valid IVs
0.2170
0.1540
NA
0.1708
0.0145
0.1349

0.2328
0.2404
NA
0.2664
0.1444
0.2916

NA
99
99
65
99
56

NA
99
99
64
48
55

Invalid IVs
NA
0
0
1
51
1

Note: Sample size 𝑁 = 39, 889, Age: 44 ∼ 83, 𝐹-statistic = 15.03 in the first stage of TSLS.

Following the significant causal effect of BMI on DBP, we conducted downstream mediation

analyses to explore potential biological pathways that may underlie this relationship. Candidate

mediators were selected based on prior biological knowledge linking adiposity to blood pressure

regulation. Specifically, we focused on biomarkers representing metabolic, inflammatory, renal,

hematologic, and lipid-related processes. To avoid redundancy and ensure interpretability, we ex-

cluded variables that are strongly collinear with BMI (e.g., waist circumference, hip circumference,

64

Table 3.4 Results of mediation analysis between log(BMI) and log(DBP).

Mediator
RBC
log(CRP)
log(Creatinine)
log(Glucose)
HDL

ˆ𝛽
0.0620
0.0033
0.0208
0.0190
0.0058
Note: ˆ𝛼 denotes the estimated effect from log(BMI) to the mediator, and ˆ𝛽 denotes the estimated
effect from the mediator to log(DBP), controlling for log(BMI). Prop represents the proportion
of the total effect that is mediated through the mediator.

adj. p-value
< 0.0001
0.0165
0.0003
0.0956
0.5142

ˆ𝛼
0.5070
2.5321
0.2205
0.1556
-0.8902

Prop
0.1392
0.0367
0.0201
0.0131
0.0228

and regional fat mass) or lacked clear biological plausibility as mediators. The final set of mediators

included red blood cell count (RBC), C-reactive protein (CRP), serum creatinine, fasting glucose,

and high-density lipoprotein (HDL). These variables were retained due to their well-established

physiological relevance in the context of obesity and cardiovascular regulation. Specifically, RBC

reflects hematologic changes that may influence blood viscosity and vascular resistance; CRP is a

widely used marker of systemic inflammation; creatinine serves as an indicator of renal function,

which is closely linked to blood pressure control; glucose levels capture metabolic disturbances as-

sociated with insulin resistance and sympathetic activation; and HDL represents lipid metabolism,

which has been implicated in the development of hypertension. These mediators collectively re-

flect diverse biological domains through which BMI may exert its influence on diastolic blood

pressure. The analysis was conducted on all available samples at version 1.0. Given the wide age

range, age was included as a covariate in the model. The significance of each candidate mediator

was assessed using the traditional Sobel test (Sobel, 1982), with Bonferroni correction applied for

multiple testing.

Table 3.4 presents the results of the mediation analysis assessing the biological pathways

through which BMI influences DBP. We also constructed a Directed Acyclic Graph (DAG), see

Fig 3.8, to illustrate the variable relationships, including only the statistically significant mediators.

Among the selected mediators, red blood cell count (RBC) mediated the largest proportion of the

effect (13.9%), consistent with the role of obesity in stimulating erythropoiesis, increasing blood

viscosity, and contributing to vascular resistance (He et al., 2023). C-reactive protein (CRP) and

65

serum creatinine were also significant mediators, contributing 3.7% and 2.0% of the total effect,

respectively. These findings support the hypothesis that systemic inflammation and early renal

function impairment are important biological mechanisms underlying obesity-associated increases

in diastolic blood pressure (Coresh et al., 2001). Although glucose and HDL were included based

on their established metabolic and lipid-related roles, their mediation proportions were relatively

modest (1.3% and 2.2%, respectively), and the associations did not reach statistical significance

after multiple testing correction.

Figure 3.8 DAG representing the significant mediation pathways between BMI and DBP.

Overall, these results underscore the multifactorial nature of the BMI–DBP relationship, impli-

cating hematologic, inflammatory, and renal pathways as key contributors.

3.5.2 Negative control: no causal effect of BMI on BW

As in the previous analysis, the initial sample size was 502,505, and we selected version 0.0 for

analysis based on the dataset size, which covers data collected between 2006 and 2010. To more

effectively evaluate the robustness of our method, we focused on the age group with the strongest

association, identified as individuals aged 60 to 71 years. After merging this subset with the genetic

data, the sample size was reduced to 85,248.

OLS regression revealed a significant correlation between log(BMI) and BW, with a p-value

less than 2 × 10−16. For candidate IV selection, similar to the previous approach, we first identified

21 SNPs associated with BMI from the GWAS study (Locke et al., 2015) within our genetic

66

dataset. Subsequently, we selected an additional 80 top SNPs strongly associated with BMI from

the dataset to serve as candidate IVs. Together, we had 101 SNP IVs included in the analysis.

Table 3.5 presents the causal effect estimates obtained using different MR methods. P-values

highlighted in bold font indicate statistically significant results, which, in this context, indicate a

false positive. Specifically, the TSLS method reported an estimate of ˆ𝛽 = 0.2279 with a p-value of

0.0413, suggesting that treating all selected candidate IVs as valid IVs can lead to biased estimates.

Similarly, the WIT method yields an even more biased estimate of ˆ𝛽 = −0.49 with a p-value of

0.044. This bias may stem from the method itself identifying too many invalid IVs, which are

likely misclassified, especially when compared to CIIV and MR-SPLIT+. In contrast, both CIIV

and MR-SPLIT+ produce non-significant results and estimates closer to zero, indicating greater

reliability and alignment with the expected null causal effect compared to the other methods.

Table 3.5 Comparison of results when assessing the causal effect of log(BMI) on birth weight.

Method
OLS
TSLS
sisVIVE
CIIV
WIT
MR-SPLIT+

ˆ𝛽
0.2193
0.2279
0.2279
0.0382
-0.4900
0.1279

s.d.

pvalue
0.0150 <2e-16
0.0413
0.1117
NA
NA
0.8044
0.1542
0.0440
0.2433
0.4767
0.1798

lower CI upper CI Relevant IVs Valid IVs
0.1900
0.0089
NA
-0.2640
-0.9668
-0.2244

0.2486
0.4468
NA
0.3403
-0.0132
0.4803

NA
100
100
35
100
35

NA
100
100
35
23
35

Invalid IVs
NA
0
0
0
77
0

Note: Sample size 𝑁 = 85, 248, Age: 60 ∼ 71, 𝐹-statistic = 15.59 in the first stage of TSLS.

3.6 Discussion

In this chapter, we proposed MR-SPLIT+, an innovative extension of the MR-SPLIT method

in one-sample MR analysis, under the relaxed plurality rule. Building upon the ability to address

selection bias and weak IV issues, MR-SPLIT+ further allows for the handling of invalid IVs, one

of the great challenges in MR analysis. The incorporation of the best subset selection method

and multiple splitting techniques enhances the robustness of the approach, significantly improving

the accuracy of invalid IV identification. This feature makes MR-SPLIT+ a powerful and reliable

tool for one-sample MR analysis. The result of the selection consistency provides a theoretical

guarantee for the method.

In fact, the assumptions required by MR-SPLIT+ can be further relaxed, and that our currently

67

proposed version of the relaxed plurality rule may still be overly restrictive. Due to limitations in

current theoretical development, we are unable to formally establish results for the case where the

specified number of invalid IVs 𝑘 is smaller than the true number of invalid IVs | 𝐴0|. However, in

practice, our method often tends to prioritize the selection of invalid IVs under such misspecification.

In other words, valid IVs are more likely to be shrunk toward zero, and thus excluded from the

model. Moreover, since the selected set of valid IVs must pass an overidentification test to confirm

that it contains only one group of instruments, choosing both valid IVs and invalid IVs (i.e., setting

𝑝 − 𝑘 > 𝑝 − | 𝐴0|) often leads to failure in this test. Consequently, the algorithm tends to increment

𝑘 step by step until the true value is reached. This iterative process implies that, under certain

conditions, our method may still work even when the number of valid IVs is smaller than that of

the invalid ones. Although we are currently unable to provide formal theoretical guarantees for this

behavior, empirical evidence supports our belief that the proposed method has strong robustness

and broad applicability in practical settings.

We conducted simulation studies to compare MR-SPLIT+ with state-of-the-art methods pro-

posed in recent years. The results demonstrate that, regardless of whether IVs are strong or weak,

our method consistently outperforms others, often achieving results comparable to the oracle TSLS

method (assuming the true set of valid IVs is known in advance). The findings from the multiple

splitting procedure further underscore its necessity, as the estimators obtained through repeated

splits yield more robust estimates and coverage probabilities closer to the nominal 95% level.

Though MR-SPLIT+ involves multiple sample splits, it does not necessarily introduce a high

computational cost compared to other approaches. For example, in the analysis of the BMI and

DBP dataset, the ‘young group’ consists of 12,022 samples. MR-SPLIT+ required only 1.39

seconds per split, and 30 splits took only 41.7 seconds. In contrast, WIT took 177.64 seconds,

and sisVIVE required 102.00 seconds. When the sample size increased to 39,889 in the ‘all

group’, MR-SPLIT+ took approximately 9.94 seconds per split, and 30 splits required only 298.3

seconds. In comparison, WIT required 661.91 seconds. Furthermore, as the sample size increases,

researchers can opt to reduce the number of splits in MR-SPLIT+ accordingly, as shown in the

68

simulation study, further saving the computational cost.

While the current method is already well-developed and robust, there remains room for further

refinement to broaden its applicability to more diverse scenarios, which will be investigated in

our future work. For instance, it could be adapted to accommodate binary outcomes and binary

exposures, a task that should be relatively straightforward. Additionally, the method could be

expanded to address bidirectional causal inference, enabling researchers to study reciprocal causal

relationships more effectively. Furthermore, an exciting direction for future research lies in ex-

tending MR-SPLIT+ to construct causal networks involving multiple exposures, facilitating a more

comprehensive understanding of complex causal structures in human diseases.

In summary, MR-SPLIT+ holds immense potential for further development and applications,

making it a versatile and powerful tool for advancing causal inference research.

69

CHAPTER 4

BIMR-SPLIT+ — BIDIRECTIONAL MR AND CAUSAL MECHANISM

4.1

Introduction

Understanding bidirectional or ambiguous causal relationships is essential in many scientific

domains, such as epidemiology, economics, and social sciences.

In practice, it is common to

encounter situations where either two traits may influence each other, or the direction of causality

is unknown. For example, the relationship between physical activity and mental health (Schuch

et al., 2018; Mammen and Faulkner, 2013), or between inflammation and depression (Khandaker

et al., 2014), may involve feedback loops or unclear temporal precedence. A particularly important

application arises in the construction of gene regulatory networks (Albert and Kruglyak, 2015),

where distinguishing between causal gene expression (which influences disease risk) and response

gene expression (which is influenced by the disease) is critical. Accurately identifying causal genes

enables researchers to prioritize therapeutic targets and avoid misdirected interventions that focus

on downstream biomarkers rather than the true drivers of disease. In such cases, robust statistical

methods that can infer or test for potential bidirectional causality are crucial for valid scientific

conclusions and effective policy or intervention design.

A typical approach in MR studies addressing potential bidirectional causality is to simply apply

univariable MR analyses in both directions—treating one trait as the exposure and the other as the

outcome, and then reversing the roles (Davey Smith and Hemani, 2014; Zhao et al., 2023; Maina

et al., 2023). However, this naive strategy often overlooks critical assumptions of MR, especially

the validity of the IVs in both directions. When the same set of genetic variants influences both

traits or when pleiotropy is present, the core IV assumptions may be violated, leading to biased and

misleading causal estimates.

In unidirectional Mendelian Randomization, numerous methods have been developed in recent

years to address the issue of invalid IVs. Notable examples include MR-Egger (Bowden et al.,

2015), sisVIVE (Kang et al., 2016), CIIV (Windmeijer et al., 2021), and WIT (Windmeijer et al.,

2021), each of which relies on specific identifying assumptions. Among them, the plurality rule

70

assumed by CIIV is relatively mild and has been considered advantageous in practice. However, this

assumption is inherently violated in the presence of bidirectional causality. Consider a setting where

the exposure and outcome exert causal effects on each other. In such bidirectional scenarios, the

validity of CIIV’s plurality rule is compromised. Specifically, when the number of SNPs affecting

the outcome exceeds the number of SNPs affecting the exposure, the instruments that primarily

influence the outcome may be mistakenly selected as valid IVs for estimating the causal effect from

exposure to outcome. This misclassification arises because the outcome, in turn, influences the

exposure, thereby inducing reverse associations that distort the instrument strength ranking required

by CIIV. As a result, the plurality rule, which assumes that the largest group of instruments reflects

the true causal direction, no longer holds. We will provide a detailed discussion of this issue in a

later section.

MR-SPLIT+ (Shi et al., 2025) is a recently proposed unified framework designed to address

several key challenges in one sample MR, including selection bias, weak instruments, and the

presence of invalid IVs. Notably, the assumptions underlying MR-SPLIT+ are even more relaxed

than the commonly adopted plurality rule, thereby offering a promising solution in settings with

bidirectional causality, where traditional assumptions often fail. Furthermore, the methodological

structure of MR-SPLIT+ is closely aligned with that of TSLS (Angrist et al., 1996), allowing for

considerable flexibility and adaptability in implementation. This combination of robustness and

flexibility makes MR-SPLIT+ a valuable tool for causal inference in complex MR scenarios.

In this study, inspired by the work Chen (2025), we extend the MR-SPLIT+ framework to the

context of bidirectional MR and named it as BiMR-SPLIT+. Leveraging the inherent flexibility of

the original model, we introduced several methodological modifications that substantially improved

its computational efficiency. To rigorously assess the performance of the proposed approach, we

conducted extensive simulation studies under a wide range of realistic scenarios. Furthermore,

we generalized the method to the construction of causal networks, aiming to capture the mutual

influences between gene expression and complex traits. Due to the challenges associated with

obtaining large-scale human datasets, we applied our approach to a dataset of approximately 180

71

Drosophila melanogaster individuals, focusing on uncovering bidirectional causal relationships

between gene expression levels and phototactic behavior.

4.2 Model and Methodology

Suppose we are interested in the causal effects between 𝑋 ∈ R𝑁×1 and 𝑌 ∈ R𝑁×1, consider the

following models:

𝑈 = 𝐺𝛼𝑢 + 𝜀𝑢

𝑋 = 𝐺𝛼𝑥 + 𝑌 𝛽𝑌 𝑋 + 𝑈𝜂𝑥 + 𝜀𝑥,

𝑌 = 𝐺𝛼𝑦 + 𝑋 𝛽𝑋𝑌 + 𝑈𝜂𝑦 + 𝜀𝑦

(4.1)

where 𝑈 ∈ R𝑁×𝑝𝑢 represents the unobserved confounders that may affect both 𝑋 and 𝑌 . The

parameters 𝛽𝑋𝑌 and 𝛽𝑌 𝑋 denote the causal effects of interest from 𝑋 to 𝑌 and from 𝑌 to 𝑋,

respectively. 𝐺 ∈ R𝑁×𝑝 denotes the matrix of SNPs that may influence 𝑋, 𝑌 , or both. The error

terms 𝜀𝑢, 𝜀𝑥, and 𝜀𝑦 are assumed to follow independent normal distributions. For the 𝑗-th SNP, 𝐺 𝑗

is considered invalid when estimating 𝛽𝑋𝑌 if 𝛼𝑦 + 𝛼𝑢𝜂𝑦 ≠ 0, and invalid when estimating 𝛽𝑌 𝑋 if

𝛼𝑥 + 𝛼𝑢𝜂𝑥 ≠ 0. See Figure 4.1 for an illustration.

Figure 4.1 Bidirectional MR.

To simplify, we rewrite Equations 4.1 as follows:

𝑋 = 𝐺 (𝛼𝑥 + 𝛼𝑢𝜂𝑥) + 𝑌 𝛽𝑌 𝑋 + (𝜀𝑥 + 𝜀𝑢𝜂𝑥)

= 𝐺𝛼1 + 𝑌 𝛽𝑌 𝑋 + 𝜀1,

𝑌 = 𝐺 (𝛼𝑦 + 𝛼𝑢𝜂𝑦) + 𝑋 𝛽𝑋𝑌 + (𝜀𝑦 + 𝜀𝑢𝜂𝑦)

= 𝐺𝛼2 + 𝑋 𝛽𝑋𝑌 + 𝜀2.

(4.2)

where 𝛼1 = 𝛼𝑥 + 𝛼𝑢𝜂𝑥 and 𝛼2 = 𝛼𝑦 + 𝛼𝑢𝜂𝑦. In this setting, 𝐺 𝑗 is considered invalid when

estimating 𝛽𝑋𝑌 if 𝛼2 ≠ 0, and invalid when estimating 𝛽𝑌 𝑋 if 𝛼1 ≠ 0. The error terms 𝜀1 and

72

𝜀2 are assumed to follow a bivariate normal distribution with nonzero covariance, induced by the

presence of unobserved confounders affecting both 𝑋 and 𝑌 .

4.2.1 Stage one

Based on the values of 𝛼1 𝑗 and 𝛼2 𝑗 , we classify SNP 𝑗 into one of the following three categories:

• 𝑆𝑋 = { 𝑗 : 𝛼1 𝑗 ≠ 0, 𝛼2 𝑗 = 0}: valid for 𝑋;

• 𝑆𝑌 = { 𝑗 : 𝛼1 𝑗 = 0, 𝛼2 𝑗 ≠ 0: valid for 𝑌 ;

• 𝑆𝐼 = { 𝑗 : 𝛼1 𝑗 ≠ 0, 𝛼2 𝑗 ≠ 0: invalid for both 𝑋 and 𝑌 .

In the following, we use 𝐺 𝐴 to denote the submatrix of 𝐺 consisting of SNPs indexed by the

set 𝐴, i.e., 𝐺 𝐴 = {𝐺 𝑗 : 𝑗 ∈ 𝐴}. Note that SNPs from all three groups may be selected as relevant

instruments for either 𝑋 or 𝑌 , as the bidirectional causal relationship between 𝑋 and 𝑌 can induce

correlations between 𝐺 𝑗 and both traits, regardless of the true direction of validity.

Nevertheless, while SNPs in 𝐺 𝑆𝑌 ∪𝑆𝐼 are invalid instruments for estimating the causal effect

from 𝑋 to 𝑌 , those that are also relevant for 𝑋 can still be included as covariates to reduce variance

and enhance estimation efficiency. Therefore, in stage one of the MR-SPLIT+ procedure, we

recommend using all selected relevant IVs without excluding potentially invalid ones.

For each direction, we first split the sample evenly into two equally sized subsets, then perform

IV selection separately in both subsets and take the union of the selected IV sets. Let ˆ𝑆𝑋1 denote

the union of selected IVs for 𝑋, and let ˆ𝑋 be the corresponding fitted value. Similarly, let ˆ𝑆𝑌 1

denote the union of selected IVs for 𝑌 , and let ˆ𝑌 be the corresponding fitted value.

4.2.2 Stage two

Following the work of MR-SPLIT+, we identify invalid IVs for 𝑋 by solving:

where (cid:101)𝐺 ˆ𝑆𝑋1

= 𝑀 ˆ𝑋𝐺 ˆ𝑆𝑋1

ˆ𝛼

2, ˆ𝑆𝑋1

∥𝛼

∥0

2, ˆ𝑆𝑋1

= arg min
𝛼
2, ˆ𝑆𝑋1
= (𝐼 − ˆ𝑋 ( ˆ𝑋′ ˆ𝑋)−1 ˆ𝑋′)𝐺 ˆ𝑆𝑋1

.

s.t.

∥𝑌 − (cid:101)𝐺 ˆ𝑆𝑋1

𝛼

2, ˆ𝑆𝑋1

∥2
2

< 𝛿,

(4.3)

Similarly, for the direction from 𝑌 to 𝑋, we identify invalid IVs for 𝑌 by solving:

ˆ𝛼
1, ˆ𝑆𝑌 1

= arg min
𝛼
1, ˆ𝑆𝑌 1

∥𝛼

1, ˆ𝑆𝑌 1

∥0

s.t.

∥ 𝑋 − (cid:101)𝐺 ˆ𝑆𝑌 1

𝛼

1, ˆ𝑆𝑌 1

∥2
2

< 𝛿,

(4.4)

73

where (cid:101)𝐺 ˆ𝑆𝑌 1

= 𝑀 ˆ𝑌 𝐺 ˆ𝑆𝑌 1

= (𝐼 − ˆ𝑌 ( ˆ𝑌 ′ ˆ𝑌 )−1 ˆ𝑌 ′)𝐺 ˆ𝑆𝑌 1

.

Before directly applying the stage two procedure of MR-SPLIT+ to identify invalid IVs, we

first exploit an observation that allows for the rapid pre-screening of a subset of clearly invalid

instruments.

From Equation 4.2, we can derive the following inequalities, as shown in Chen (2025):

Var(𝑋) > 𝛽2

𝑌 𝑋 Var(𝑌 ),

Var(𝑌 ) > 𝛽2

𝑋𝑌 Var(𝑋).

(4.5)

By substituting the first inequality into the second, we obtain:

Var(𝑌 ) > 𝛽2

𝑋𝑌 Var(𝑋) > 𝛽2

𝑋𝑌 𝛽2

𝑌 𝑋 Var(𝑌 ),

which implies that 𝛽𝑋𝑌 𝛽𝑌 𝑋 < 1.

This result holds under the assumption that the causal effects 𝛽𝑋𝑌 and 𝛽𝑌 𝑋 are well-defined and

finite. The condition 𝛽𝑋𝑌 𝛽𝑌 𝑋 < 1 further suggests that a bidirectional feedback system between 𝑋

and 𝑌 cannot exhibit unbounded amplification.

We could also rewrite Equation 4.1 as followings:

(4.6)

(4.7)

𝑋 =

𝑌 =

1
1 − 𝛽𝑌 𝑋 𝛽𝑋𝑌
1
1 − 𝛽𝑌 𝑋 𝛽𝑋𝑌

[𝐺 (𝛼2𝛽𝑌 𝑋 + 𝛼1) + (𝜀2𝛽𝑌 𝑋 + 𝜀1)],

[𝐺 (𝛼1𝛽𝑋𝑌 + 𝛼2) + (𝜀1𝛽𝑋𝑌 + 𝜀2)]

Now consider the correlation between 𝐺 𝑗 and 𝑋, and between 𝐺 𝑗 and 𝑌 :

(cid:18) corr(𝐺 𝑗 , 𝑋)
corr(𝐺 𝑗 , 𝑌 )

(cid:19) 2

=

cov2(𝐺 𝑗 , 𝑋) Var(𝑌 )
cov2(𝐺 𝑗 , 𝑌 ) Var(𝑋)

=

(cid:18) 𝛼2 𝑗 𝛽𝑌 𝑋 + 𝛼1 𝑗
𝛼1 𝑗 𝛽𝑋𝑌 + 𝛼2 𝑗

(cid:19) 2 Var(𝑌 )
Var(𝑋)

.

Suppose 𝐺 𝑗 is a valid instrument for 𝑋, i.e., 𝛼1 𝑗 ≠ 0 and 𝛼2 𝑗 = 0. Then,

(cid:18) corr(𝐺 𝑗 , 𝑋)
corr(𝐺 𝑗 , 𝑌 )

(cid:19) 2

=

1
𝛽2

𝑋𝑌

·

Var(𝑌 )
Var(𝑋)

> 1.

Similarly, if 𝐺 𝑗 is a valid instrument for 𝑌 , i.e., 𝛼2 𝑗 ≠ 0 and 𝛼1 𝑗 = 0, then

(cid:18) corr(𝐺 𝑗 , 𝑋)
corr(𝐺 𝑗 , 𝑌 )

(cid:19) 2

= 𝛽2

𝑌 𝑋 ·

Var(𝑌 )
Var(𝑋)

< 1.

Based on these results, we establish the following proposition:

74

Proposition 1. For each 𝐺 𝑗 , 𝑗 = 1, . . . , 𝑝, if |corr(𝐺 𝑗 , 𝑋)| < |corr(𝐺 𝑗 , 𝑌 )|, then 𝐺 𝑗 cannot be a

valid instrument for 𝑋. Conversely, if |corr(𝐺 𝑗 , 𝑋)| > |corr(𝐺 𝑗 , 𝑌 )|, then 𝐺 𝑗 cannot be a valid

instrument for 𝑌 .

In stage two, prior to applying best subset selection to identify invalid IVs, we incorporate the

insights from Proposition 1 to improve both computational efficiency and accuracy. Specifically, for

each selected IV 𝐺 𝑗 , where 𝑗 ∈ ˆ𝑆1 = ˆ𝑆𝑋1 ∪ ˆ𝑆𝑌 1, we compute its empirical correlation with both 𝑋

and 𝑌 . If |corr(𝐺 𝑗 , 𝑋)| < |corr(𝐺 𝑗 , 𝑌 )|, label 𝐺 𝑗 as invalid for 𝑋, if |corr(𝐺 𝑗 , 𝑋)| > |corr(𝐺 𝑗 , 𝑌 )|,

label 𝐺 𝑗 as invalid for 𝑌 . Let ˆ𝐴𝑋1 denote the set of pre-identified invalid IVs for 𝑋, i.e., ˆ𝐴𝑋1 = { 𝑗 :

|corr(𝐺 𝑗 , 𝑋)| < |corr(𝐺 𝑗 , 𝑌 )|, 𝑗 ∈ ˆ𝑆𝑋1}. And let ˆ𝐴𝑌 1 denote the set of pre-identified invalid IVs

for 𝑌 . Then the remaining IVs are ˆ𝑆𝑋2 = ˆ𝑆𝑋1 \ ˆ𝐴𝑋1 for 𝑋 and ˆ𝑆𝑌 2 = ˆ𝑆𝑌 1 \ ˆ𝐴𝑌 1 for 𝑌 .

So now we could reformulate Problem 4.3 as:

ˆ𝛼

2, ˆ𝑆𝑋2

= arg min
𝛼
2, ˆ𝑆𝑋2

∥𝛼

∥0

2, ˆ𝑆𝑋2

s.t.

∥(cid:101)𝑌 − (cid:101)(cid:101)𝐺 ˆ𝑆𝑋2

𝛼

2, ˆ𝑆𝑋2

∥2
2

< 𝛿,

where (cid:101)𝑌 = 𝑀

(cid:101)𝐺 ˆ𝐴𝑋1
Similarly, reformulate Problem 4.4 as:

𝑌 = (𝐼 − (cid:101)𝐺 ˆ𝐴𝑋1

(cid:101)𝐺 ˆ𝐴𝑋1

( (cid:101)𝐺′

ˆ𝐴𝑋1

)−1 (cid:101)𝐺′

ˆ𝐴𝑋1

)𝑌 , (cid:101)(cid:101)𝐺 ˆ𝑆𝑋2

= 𝑀

(cid:101)𝐺 ˆ𝐴𝑋1

(cid:101)𝐺 ˆ𝑆𝑋2

.

ˆ𝛼
2, ˆ𝑆𝑌 2

= arg min
𝛼
2, ˆ𝑆𝑌 2

∥𝛼

2, ˆ𝑆𝑌 2

∥0

s.t.

∥ (cid:101)𝑋 − (cid:101)(cid:101)𝐺 ˆ𝑆𝑌 2

𝛼

2, ˆ𝑆𝑌 2

∥2
2

< 𝛿,

(4.8)

(4.9)

where (cid:101)𝑋 = 𝑀

𝑋 = (𝐼 − (cid:101)𝐺 ˆ𝐴𝑌 1

( (cid:101)𝐺′

(cid:101)𝐺 ˆ𝐴𝑌 1

)−1 (cid:101)𝐺′

) 𝑋, (cid:101)(cid:101)𝐺 ˆ𝑆𝑌 2

= 𝑀

ˆ𝐴𝑌 1
In summary, based on the criterion established in Proposition 1, we pre-identify a subset of

ˆ𝐴𝑌 1

(cid:101)𝐺 ˆ𝑆𝑌 2

.

(cid:101)𝐺 ˆ𝐴𝑌 1

(cid:101)𝐺 ˆ𝐴𝑌 1

clearly invalid IVs. We then only select invalid IVs within the remaining, ambiguous instruments.

This targeted screening step significantly reduces the search space, leading to substantial improve-

ments in computational speed and estimation performance.

We summarize the algorithm in Algorithm 4.1.

4.3 Simulation Studies

In the simulation study, we aim to mimic realistic scenarios as closely as possible. For each

replicate, we generate a total of 1000 SNPs, among which 30 are randomly selected to be associated

with either 𝑋 or 𝑌 . This implies that 970 SNPs are irrelevant and serve as noise variables. The

75

Algorithm 4.1 The analytical procedure of BiMR-SPLIT+.
Input: {𝑋, 𝑌 , 𝐺}, 𝑋, 𝑌 ∈ R𝑁×1, 𝐺 ∈ R𝑁×𝑝, 𝑝 ≫ 𝑁.
Output: The causal effect ˆ𝛽.
1: Employ screening methods, such as SIS, to reduce the number of SNPs to a manageable size

that less than N, typically a hundred.

2: For 𝑡 = 1, . . . , 𝑇

• Get estimated ˆ𝑋 and ˆ𝑌 by splitting the sample into two evenly subsets. Denote ˆ𝑆𝑋1 and

ˆ𝑆𝑌 1 as the selected IV sets, respectively.

• For each selected SNP 𝑗 ∈ ˆ𝑆𝑋1 ∪ ˆ𝑆𝑌 1, if |corr(𝐺 𝑗 , 𝑋)| < |corr(𝐺 𝑗 , 𝑌 )|, label 𝐺 𝑗 as
invalid for 𝑋, if |corr(𝐺 𝑗 , 𝑋)| > |corr(𝐺 𝑗 , 𝑌 )|, label 𝐺 𝑗 as invalid for 𝑌 ; The remaining
undefined IVs are denoted as ˆ𝑆𝑋2 and ˆ𝑆𝑌 2, respectively.

• Starting from 𝑘 = 0, set 𝛼

2, then perform the
overidentification test. If the testing p-value ≤ 0.05, increment 𝑘 by 1 (i.e., 𝑘 = 𝑘 + 1)
and repeat the process until the testing p-value 𝑝 > 0.05. And similarly for 𝛼

2, ˆ𝑆𝑋2,𝑘 = arg min𝛼

,𝑘 ∥(cid:101)𝑌 − (cid:101)(cid:101)𝐺 ˆ𝑆𝑋2

𝛼
2, ˆ𝑆𝑋2,𝑘 ∥2

2, ˆ𝑆𝑋2

.

1, ˆ𝑆𝑋2

• Regress Y on ˆ𝑋 while including the selected invalid IVs as covariates to get the estimated

causal effect ˆ𝛽𝑋𝑌 ,𝑡.

• Regress X on ˆ𝑌 while including the selected invalid IVs as covariates to get the estimated

causal effect ˆ𝛽𝑌 𝑋,𝑡.

3: Perform the same sample split procedure 𝑇 times and choose

ˆ𝛽𝑋𝑌 = arg

min
ˆ𝛽𝑋𝑌 ,𝑡 ∈{ ˆ𝛽𝑋𝑌 ,𝑡 :𝑡=1,...,𝑇 }

ˆ𝛽𝑌 𝑋 = arg

min
ˆ𝛽𝑌 𝑋,𝑡 ∈{ ˆ𝛽𝑌 𝑋,𝑡 :𝑡=1,...,𝑇 }

(cid:12) ˆ𝛽𝑋𝑌 ,𝑡 − median({ ˆ𝛽𝑋𝑌 ,𝑡 : 𝑡 = 1, . . . , 𝑇 })(cid:12)
(cid:12)
(cid:12) ,

(cid:12) ˆ𝛽𝑌 𝑋,𝑡 − median({ ˆ𝛽𝑌 𝑋,𝑡 : 𝑡 = 1, . . . , 𝑇 })(cid:12)
(cid:12)
(cid:12) ,

as the final causal effect estimates.

phenotypes 𝑋 and 𝑌 are generated according to Model 4.6. The matrix of genotypes 𝐺 is simulated

as independent discrete variables taking values in {0, 1, 2}, representing the allele count under an

additive model, with a fixed MAF of 0.3. The error terms 𝜀1𝑖 and 𝜀2𝑖 are independently generated

across individuals from a bivariate normal distribution with zero mean, unit variances, and a

correlation of 0.3, that is,

(𝜀1𝑖, 𝜀2𝑖)⊤ ∼ N (cid:169)
(cid:173)
(cid:173)
(cid:171)

0, (cid:169)
(cid:173)
(cid:173)
(cid:171)

1

0.3

0.3

1

,

(cid:170)
(cid:170)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:172)
(cid:172)

𝑖 = 1, . . . , 𝑁.

This correlation reflects the presence of unmeasured confounding between 𝑋 and 𝑌 . We consider

three different sample sizes in the simulation study, with 𝑁 ∈ {1000, 2000, 4000}.

76

To reflect different causal structures commonly encountered in practice, we consider the fol-

lowing three settings for the bidirectional causal effects:

• 𝛽𝑋𝑌 = 0, 𝛽𝑌 𝑋 = 0 (no causal relationship);

• 𝛽𝑋𝑌 = 0.75, 𝛽𝑌 𝑋 = 0 (unidirectional causality from 𝑋 to 𝑌 );

• 𝛽𝑋𝑌 = 0.5, 𝛽𝑌 𝑋 = 1 (bidirectional causality).

For the direct effects of SNPs on the phenotypes, we consider two distinct configurations of the

vectors 𝛼1 and 𝛼2:

• Scenario 1:

𝛼1 = (0.4, · · · , 0.4
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)
(cid:125)

(cid:124)

(cid:123)(cid:122)
7

, 0, · · · , 0
(cid:32)(cid:32)(cid:32)
(cid:32)(cid:32)(cid:32)
(cid:123)(cid:122)
(cid:125)
(cid:124)
7

(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)

, 0.4, · · · , 0.4
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)
(cid:123)(cid:122)
(cid:125)
(cid:124)
4

(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)

, 0.3, · · · , 0.3
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)
(cid:123)(cid:122)
(cid:125)
(cid:124)
4

(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)

, 0.2, · · · , 0.2
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)
(cid:123)(cid:122)
(cid:125)
(cid:124)
4

, 0.1, 0.1, 0.4, 0.4),

𝛼2 = (0, · · · , 0
(cid:32)(cid:32)(cid:32)
(cid:32)(cid:32)(cid:32)
(cid:123)(cid:122)
(cid:125)
(cid:124)
7

(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)

, 0.4, · · · , 0.4
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)
(cid:123)(cid:122)
(cid:125)
(cid:124)
7

(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)

, 0.4, · · · , 0.4
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)
(cid:123)(cid:122)
(cid:125)
(cid:124)
4

(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)

, 0.2, · · · , 0.2
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)
(cid:123)(cid:122)
(cid:125)
(cid:124)
4

(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)

, 0.3, · · · , 0.3
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)
(cid:123)(cid:122)
(cid:125)
(cid:124)
4

, 0.4, 0.4, 0.1, 0.1).

• Scenario 2:

𝛼1 = (0.4, · · · , 0.4
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)
(cid:125)

(cid:124)

(cid:123)(cid:122)
7

, 0, · · · , 0
(cid:32)(cid:32)(cid:32)
(cid:32)(cid:32)(cid:32)
(cid:123)(cid:122)
(cid:125)
(cid:124)
7

(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)

, 0.4, · · · , 0.4
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)
(cid:123)(cid:122)
(cid:125)
(cid:124)
4

(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)

, 0.3, · · · , 0.3
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)
(cid:123)(cid:122)
(cid:125)
(cid:124)
4

(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)

, 0.2, · · · , 0.2
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)
(cid:123)(cid:122)
(cid:125)
(cid:124)
4

(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)

, 0.1, · · · , 0.1
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)
(cid:123)(cid:122)
(cid:125)
(cid:124)
4

),

𝛼2 = (0, · · · , 0
(cid:32)(cid:32)(cid:32)
(cid:32)(cid:32)(cid:32)
(cid:123)(cid:122)
(cid:125)
(cid:124)
7

(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)

, 0.4, · · · , 0.4
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)
(cid:123)(cid:122)
(cid:125)
(cid:124)
7

(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)

, 0.4, · · · , 0.4
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)
(cid:123)(cid:122)
(cid:125)
(cid:124)
4

(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)

, 0.2, · · · , 0.2
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)
(cid:123)(cid:122)
(cid:125)
(cid:124)
4

(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)

, 0.3, · · · , 0.3
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)
(cid:123)(cid:122)
(cid:125)
(cid:124)
4

(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)

, 0.1, · · · , 0.1
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)
(cid:123)(cid:122)
(cid:125)
(cid:124)
4

).

Scenario 1 satisfies the plurality rule when considering unidirectional MR under no bidirectional

causality. That is, the largest group of selected IVs that generate the same causal effects are the

valid ones.

Scenario 2 represents a more challenging setting in which the plurality rule is violated regardless

of whether bidirectional causality is present. For instance, under the no causal relationship setting

(𝛽𝑋𝑌 = 𝛽𝑌 𝑋 = 0), when estimating 𝛽𝑋𝑌 , the first 7 SNPs are valid instruments with 𝛼1 = 0.4 and

𝛼2 = 0, leading to an estimated causal effect of ˆ𝛽𝑋𝑌 = 0. However, there are 8 other SNPs that

contribute to an estimated ˆ𝛽𝑋𝑌 = 1, including 4 SNPs with 𝛼1 = 𝛼2 = 0.4 and 4 with 𝛼1 = 𝛼2 = 0.1.

77

As a result, the invalid instruments dominate numerically, and the plurality of instruments support

an incorrect causal direction, thereby violating the plurality rule. It is designed to evaluate the

robustness of the BiMR-SPLIT+.

We compare the performance of BiMR-SPLIT+ against the following benchmark methods to

demonstrate its superiority:

• Oracle TSLS, which assumes complete knowledge of the valid and invalid IVs;

• CIIV, a consistent IV selection and estimation method;

• MR-Egger, a widely used method for addressing directional pleiotropy in Mendelian ran-

domization.

4.3.1 Simulation results of scenario 1

Table A.7 in the Appendix presents the simulation results for Scenario 1, where both 𝛽𝑋𝑌 = 0

and 𝛽𝑌 𝑋 = 0. This setting reflects a null causal relationship in both directions. Among the non-

oracle methods, BiMR-SPLIT+ consistently yields bias estimates closest to those of Oracle TSLS,

especially as sample size increases. For example, the bias for BiMR-SPLIT+ reduces from 0.0495

at 𝑁 = 1000 to 0.0030 at 𝑁 = 4000 in the 𝑋 → 𝑌 direction, and from 0.0498 to –0.0028 in the

𝑌 → 𝑋 direction.

BiMR-SPLIT+ also achieves lower RMSE compared to MR-Egger and CIIV across all settings,

demonstrating greater estimation precision. While its coverage probability (CP) is initially below

the nominal 95%, it improves with larger sample size, from 0.68 to 0.87 in the 𝑋 → 𝑌 direction and

from 0.65 to 0.94 in the 𝑌 → 𝑋 direction, indicating that the method becomes increasingly reliable

as data size grows.

In addition, we report the false positive rate (FPR) and false negative rate

(FNR), which respectively measure the proportion of invalid IVs that are incorrectly retained and

valid IVs that are incorrectly excluded. These two results of BiMR-SPLIT+ both are decreasing as

the sample size increasing, which also shows the validity of identifying invalid IVs for this method.

In contrast, MR-Egger exhibits high coverage but at the expense of large RMSE and much

wider confidence intervals (e.g., width up to 1.28), and tends to produce more biased estimates,

78

especially under small sample sizes. Although the plurality rule is not violated in this scenario,

CIIV still performs poorly, with extremely high bias, RMSE approaching or exceeding 0.9, and

very low coverage (as low as 6%–44%). Its performance improves as the sample size increases:

bias gradually decreases, and FPR/FNR indicate more accurate instrument classification. This

suggests that CIIV requires either a sufficiently large sample size or strong instruments to reliably

distinguish between different groups of IVs. When the group separation is weak, the method

struggles to identify the correct set of valid instruments, leading to poor estimation performance.

When considering the setting 𝛽𝑋𝑌 = 0.75 and 𝛽𝑌 𝑋 = 0, the advantages of BiMR-SPLIT+

become even more evident, see Table 4.1.

In both directions (𝑋 → 𝑌 and 𝑌 → 𝑋), BiMR-

SPLIT+ consistently ranks as the second-best method after the Oracle TSLS, and its performance

becomes increasingly comparable to the oracle estimator as the sample size grows. Notably, the

FPR and FNR for classifying invalid and valid IVs approach zero with larger sample sizes, clearly

demonstrating the superiority of BiMR-SPLIT+ over CIIV, whose classification ability nearly fails.

Additionally, the estimation bias of BiMR-SPLIT+ diminishes, and the coverage probability (CP)

converges steadily toward the nominal level of 95%.

For comparison, MR-Egger suffers from substantial bias, and even with its overly wide confi-

dence intervals, it still fails to attain acceptable coverage in the 𝑌 → 𝑋 direction. CIIV also remains

suboptimal, with both FPR and FNR failing to reach ideal levels, indicating that its underlying as-

sumptions for effectively identifying valid instruments are more stringent than those required by

BiMR-SPLIT+, and may be difficult to satisfy in practical applications.

Table 4.2 are the results when 𝛽𝑋𝑌 = 0.5 and 𝛽𝑌 𝑋 = 1. Under this setting, BiMR-SPLIT+

exhibits similar performance as in previous scenarios. Although in the 𝑋 → 𝑌 direction with

𝑁 = 1000, the coverage probability (CP) does not reach the nominal 95% and its FPR is 0.03, both

metrics improve rapidly with larger sample size. When 𝑁 = 2000, the FPR drops to 0.00 and the CP

rises to 100%, matching the performance of the Oracle TSLS. This result highlights two aspects:

first, even a small fraction of misclassified invalid IVs can lead to considerable estimation bias;

second, BiMR-SPLIT+ demonstrates high efficiency in identifying valid and invalid instruments.

79

Table 4.1 Simulation results of scenario 1 when 𝛽𝑋𝑌 = 0.75, 𝛽𝑌 𝑋 = 0.

Settings

N

1000

𝛽𝑋𝑌 = 0.75

2000

4000

1000

𝛽𝑌 𝑋 = 0

2000

4000

FPR FNR
Est.sd RMSE CI Width CP
Bias
Method
0.00
0.00
0.99
0.0219
0.0220
0.0019
Oracle TSLS
0.01
0.13
0.84
0.0466 0.0571
0.0332
BiMR-SPLIT+
NA
0.98 NA
-0.2152 0.2155 0.3044
MR-Egger
0.88
0.63
0.11
0.3744 0.9856
0.9118
CIIV
0.00
1.00
0.0000
0.00
0.0156 0.0156
Oracle TSLS
0.01
0.06
0.88
0.0367 0.0382
0.0107
BiMR-SPLIT+
NA
0.99 NA
-0.1938 0.1574 0.2496
MR-Egger
0.52
0.28
0.44
0.7205 1.0059
0.7027
CIIV
0.00
0.00
0.99
0.0119 0.0119
0.0002
Oracle TSLS
0.00
0.94
BiMR-SPLIT+ -0.0004 0.0236 0.0236
0.01
MR-Egger
NA
1.00 NA
-0.1766 0.1360 0.2229
0.35
0.17
0.62
CIIV
0.5569
0.7739 0.9529
0.00
0.00
0.95
0.0233 0.0234
0.0021
Oracle TSLS
0.01
0.86
BiMR-SPLIT+ -0.0047 0.0296 0.0299
0.02
MR-Egger
NA
0.50 NA
0.1214 0.4031
0.3844
0.80
0.30
0.17
CIIV
0.3990
0.2800 0.4873
0.00
0.00
0.96
0.0162 0.0162
0.0003
Oracle TSLS
0.01
0.91
BiMR-SPLIT+ -0.0079 0.0177 0.0194
0.00
NA
0.67 NA
0.0982 0.3587
0.3450
MR-Egger
0.84
0.28
0.14
0.3577 0.5847
0.4628
CIIV
0.00
0.00
0.95
Oracle TSLS
0.0114 0.0114
0.0002
0.00
0.94
BiMR-SPLIT+ -0.0039 0.0115 0.0121
0.00
NA
0.74 NA
0.0693 0.3475
0.3406
0.62
0.19
0.34
0.5394 0.7415
0.5094

0.1295
0.1690
1.2879
0.1960
0.0913
0.1209
1.1001
0.1797
0.0642
0.0884
1.0024
0.1293
0.0914
0.0881
0.7808
0.0737
0.0644
0.0642
0.7902
0.0575
0.0453
0.0453
0.8062
0.0510

MR-Egger
CIIV

CIIV and MR-Egger show performance patterns consistent with earlier settings. For CIIV,

insufficient accuracy in identifying invalid IVs leads to substantial bias. For MR-Egger, the ex-

cessively wide confidence intervals result in low estimation precision, despite achieving relatively

high coverage in some cases.

4.3.2 Simulation results of scenario 2

Table A.8 in the Appendix presents the simulation results for Scenario 2 under the no causal

relationship setting, where 𝛽𝑋𝑌 = 𝛽𝑌 𝑋 = 0. Across all sample sizes (𝑁 = 1000, 2000, 4000),

BiMR-SPLIT+ achieves consistently low bias and RMSE. For instance, when 𝑁 = 1000, BiMR-

SPLIT+ has a bias of 0.0328 and RMSE of 0.1364, substantially lower than MR-Egger (bias =

–0.1046, RMSE = 0.2440) and CIIV (bias = 0.8653, RMSE = 0.9387).

BiMR-SPLIT+ also maintains coverage probabilities around 90%, with moderate confidence

80

Table 4.2 Simulation results of scenario 1 when 𝛽𝑋𝑌 = 0.5, 𝛽𝑌 𝑋 = 1.

Settings

N

1000

𝛽𝑋𝑌 = 0.5

2000

4000

1000

𝛽𝑌 𝑋 = 1

2000

4000

Bias
Method
Oracle TSLS
0.0023
BiMR-SPLIT+ -0.0342
0.2290
MR-Egger
0.3200
CIIV
0.0007
Oracle TSLS
0.0021
BiMR-SPLIT+
0.2317
MR-Egger
0.3003
CIIV
0.0004
Oracle TSLS
BiMR-SPLIT+ -0.0124
MR-Egger
0.2275
CIIV
0.2721
0.0007
Oracle TSLS
BiMR-SPLIT+ -0.0403
0.0781
MR-Egger
0.2045
CIIV
Oracle TSLS
0.0000
BiMR-SPLIT+ -0.0230
0.0261
MR-Egger
0.2709
CIIV
Oracle TSLS
0.0000
BiMR-SPLIT+ -0.0123
0.0145
0.2953

MR-Egger
CIIV

FPR FNR
Est.sd RMSE CI Width CP
0.00
0.00
1.00
0.0108
0.00
0.03
0.86
0.0208
NA
0.06 NA
0.0468
0.95
0.34
0.04
0.1149
0.00
1.00
0.0078
0.00
0.00
0.00
1.00
0.0078
NA
0.04 NA
0.0383
0.87
0.29
0.12
0.1399
0.00
0.00
1.00
0.0060
0.00
0.99
0.0065
0.00
NA
0.01 NA
0.0245
0.68
0.21
0.30
0.2067
0.00
0.00
1.00
0.0117
0.00
0.94
0.0228
0.05
NA
0.98 NA
0.0732
0.75
0.31
0.22
0.1522
0.00
0.00
1.00
0.0081
0.00
1.00
0.0116
0.01
NA
1.00 NA
0.0564
0.82
0.28
0.16
0.2674
0.00
0.00
1.00
0.0057
0.00
1.00
0.0061
0.00
NA
1.00 NA
0.0470
0.55
0.17
0.41
0.3888

0.0111
0.0400
0.2337
0.3400
0.0078
0.0081
0.2348
0.3312
0.0060
0.0140
0.2288
0.3416
0.0117
0.0463
0.1070
0.2548
0.0081
0.0258
0.0621
0.3805
0.0057
0.0137
0.0492
0.4879

0.1131
0.1080
0.3094
0.0302
0.0801
0.0797
0.3122
0.0231
0.0563
0.0565
0.3189
0.0197
0.1471
0.1409
0.4894
0.0454
0.1037
0.1043
0.4385
0.0406
0.0730
0.0736
0.4406
0.0370

interval widths (e.g., 0.154 at 𝑁 = 1000 for ˆ𝛽𝑋𝑌 ).

In contrast, MR-Egger, although achieving

perfect coverage (e.g., 0.99–1.00), does so at the cost of much wider confidence intervals (e.g.,

from 1.04 to 1.30), indicating low estimation efficiency.

CIIV performs particularly poorly in this null scenario, with severe bias and extremely low

coverage probabilities. The FNRs for CIIV approach 0.86–0.94 across all settings, indicating that

the method fails to preserve the majority of valid instruments. This failure is primarily due to the

violation of the plurality rule under Scenario 2, where the largest group of IVs no longer represents

the valid IVs.

These results indicate that BiMR-SPLIT+ achieves a favorable trade-off between bias, confi-

dence interval efficiency, and coverage, offering reliable Type I error control without being overly

conservative like MR-Egger or unstable like CIIV. Its low false discovery rates (e.g., FPR =

81

0.032–0.10 and FNR = 0.029–0.038) further highlight its robustness in invalid IV selection.

Table 4.3 reports the simulation results for Scenario 2 under the setting 𝛽𝑋𝑌 = 0.75 and 𝛽𝑌 𝑋 = 0,

representing the causal effect only from 𝑋 to 𝑌 .

For the direction from 𝑋 to 𝑌 , across all sample sizes (𝑁 = 1000, 2000, 4000), BiMR-SPLIT+

yields highly accurate estimates of 𝛽𝑋𝑌 , with biases close to zero and consistently low RMSE

values. While MR-Egger achieves nominal coverage near 1.0, it suffers from significantly inflated

RMSE and wide confidence intervals. CIIV, on the other hand, completely fails in this setting due

to violation of the plurality rule, resulting in substantial bias, RMSE exceeding 0.93, and coverage

below 12%.

In the reverse direction from 𝑌 to 𝑋, where no causal effect exists, BiMR-SPLIT+ maintains low

bias and coverage rates close to 90%. In contrast, MR-Egger fails to estimate 𝛽𝑌 𝑋 accurately, with

biases exceeding 0.34, high RMSE values, and coverage probabilities near 60%, indicating serious

false positives. CIIV again performs poorly, with false positive rates (FPR) exceeding 30–40% and

false negative rates (FNR) close to 1.0, reflecting its inability to select valid instruments under this

challenging structure.

Table 4.4 reports the simulation results for Scenario 2 under the bidirectional causal setting,

where 𝛽𝑋𝑌 = 0.5 and 𝛽𝑌 𝑋 = 1.0. This is a particularly challenging case, as invalid IVs are present

for both directions and the plurality rule is violated.

For the direction from 𝑋 → 𝑌 , BiMR-SPLIT+ achieves moderate to good estimation perfor-

mance. Although the bias is somewhat inflated at 𝑁 = 1000 (–0.0724), it improves with increasing

sample size, reaching –0.0117 at 𝑁 = 4000. RMSE decreases steadily from 0.0883 to 0.0133, and

coverage increases from 67% to 100%. Similar to earlier scenarios, although MR-Egger maintains

high coverage across all settings, it suffers from excessively wide confidence intervals, limiting its

estimation efficiency.

For the reverse direction 𝑌 → 𝑋, where the true causal effect is stronger (𝛽𝑌 𝑋 = 1.0), BiMR-

SPLIT+ remains accurate and stable. The bias decreases from –0.0557 at 𝑁 = 1000 to –0.0126

at 𝑁 = 4000, with corresponding reductions in RMSE. In contrast, MR-Egger exhibits substantial

82

Table 4.3 Simulation results of scenario 2 when 𝛽𝑋𝑌 = 0.75, 𝛽𝑌 𝑋 = 0.

Settings

N

1000

𝛽𝑋𝑌 = 0.75

2000

4000

1000

𝛽𝑌 𝑋 = 0

2000

4000

Bias
Method
Oracle TSLS
0.0220
0.0019
BiMR-SPLIT+ -0.0050 0.0443 0.0445
-0.1006 0.2213 0.2429
MR-Egger
0.3642 0.9387
0.8653
CIIV
Oracle TSLS
0.0156 0.0156
0.0000
BiMR-SPLIT+ -0.0083 0.0289 0.0300
-0.0143 0.1435 0.1441
MR-Egger
0.3276 0.9403
0.8815
CIIV
0.0119 0.0119
0.0002
Oracle TSLS
BiMR-SPLIT+ -0.0035 0.0203 0.0206
MR-Egger
0.1112 0.1144
0.0271
CIIV
0.9272
0.2663 0.9646
0.0233 0.0234
0.0021
Oracle TSLS
BiMR-SPLIT+ -0.0071 0.0711 0.0714
MR-Egger
0.1157 0.3686
0.3500
CIIV
0.5651
0.2502 0.6179
0.0162 0.0162
0.0003
Oracle TSLS
BiMR-SPLIT+ -0.0030 0.0589 0.0589
0.0830 0.3483
0.3383
0.1757 0.5875
0.5607
0.0114 0.0114
0.0002
0.0886 0.0892
0.0116
0.0560 0.3496
0.3451
0.0743 0.5824
0.5776

FPR FNR
Est.sd RMSE CI Width CP
0.00
0.00
0.99
0.0219
0.00
0.07
0.96
NA
0.99 NA
0.86
0.69
0.12
0.00
1.00
0.00
0.01
0.04
0.97
NA
1.00 NA
0.89
0.60
0.11
0.00
0.00
0.99
0.00
0.97
0.01
NA
1.00 NA
0.93
0.52
0.06
0.00
0.00
0.95
0.03
0.88
0.02
NA
0.62 NA
0.92
0.39
0.07
0.00
0.00
0.96
0.02
0.91
0.00
NA
0.62 NA
0.95
0.35
0.05
0.00
0.00
0.95
0.03
0.92
0.01
NA
0.56 NA
1.00
0.35
0.00

0.1295
0.1808
1.3059
0.1865
0.0913
0.1275
1.1317
0.1345
0.0642
0.0895
1.0386
0.1028
0.0914
0.0909
0.7787
0.0580
0.0644
0.0643
0.7305
0.0373
0.0453
0.0452
0.7232
0.0236

MR-Egger
CIIV
Oracle TSLS
BiMR-SPLIT+
MR-Egger
CIIV

bias and poor coverage (as low as 2%), indicating that it may fail to provide reliable estimates in

this direction. CIIV again performs poorly, with severe bias, low coverage, and high false negative

rates, confirming its ineffectiveness under this setting.

In summary, BiMR-SPLIT+ consistently demonstrates strong performance in terms of estima-

tion accuracy, robustness, and instrument selection. Especially in the most challenging bidirectional

causal scenario, where causal effects exist for both directions and the plurality rule is violated,

BiMR-SPLIT+ remains the only method that achieves stable and accurate estimation in both di-

rections. As sample size increases, its bias and RMSE steadily decrease and coverage improves

to approach or reach the nominal level. Taken together, these results highlight the robustness and

adaptability of BiMR-SPLIT+ across a wide range of causal structures and IV validity conditions,

making it a practical and reliable tool for bidirectional MR analysis.

83

Table 4.4 Simulation results of scenario 2 when 𝛽𝑋𝑌 = 0.5, 𝛽𝑌 𝑋 = 1.

Settings

N

1000

𝛽𝑋𝑌 = 0.5

2000

4000

1000

𝛽𝑌 𝑋 = 1

2000

4000

Bias
Method
Oracle TSLS
0.0016
BiMR-SPLIT+ -0.0724
0.0523
MR-Egger
0.2996
CIIV
Oracle TSLS
0.0004
BiMR-SPLIT+ -0.0255
0.0457
MR-Egger
0.3352
CIIV
0.0003
Oracle TSLS
BiMR-SPLIT+ -0.0117
MR-Egger
0.0461
CIIV
0.3303
0.0014
Oracle TSLS
BiMR-SPLIT+ -0.0557
0.2533
MR-Egger
0.2482
CIIV
Oracle TSLS
0.0003
BiMR-SPLIT+ -0.0263
0.2477
MR-Egger
0.2446
CIIV
Oracle TSLS
0.0002
BiMR-SPLIT+ -0.0126
0.2501
0.2512

MR-Egger
CIIV

FPR FNR
Est.sd RMSE CI Width CP
0.00
0.00
1.00
0.0109
0.01
0.02
0.67
0.0422
NA
1.00 NA
0.0699
0.89
0.41
0.08
0.1377
0.00
1.00
0.0078
0.00
0.00
0.00
1.00
0.0093
NA
1.00 NA
0.0461
0.93
0.35
0.06
0.1620
0.00
0.00
1.00
0.0060
0.00
1.00
0.0064
0.00
NA
1.00 NA
0.0373
0.99
0.34
0.01
0.0347
0.00
0.00
1.00
0.0116
0.01
0.57
0.0217
0.01
NA
0.02 NA
0.0459
0.93
0.39
0.06
0.0989
0.00
0.00
1.00
0.0081
0.00
0.94
0.0104
0.00
NA
0.00 NA
0.0331
0.95
0.35
0.05
0.0712
0.00
0.00
1.00
0.0057
0.00
1.00
0.0063
0.00
NA
0.00 NA
0.0200
1.00
0.35
0.00
0.0232

0.0110
0.0838
0.0873
0.3296
0.0078
0.0271
0.0649
0.3722
0.0060
0.0133
0.0592
0.3321
0.0117
0.0598
0.2574
0.2672
0.0081
0.0283
0.2499
0.2547
0.0057
0.0141
0.2509
0.2522

0.1468
0.1560
0.4758
0.0380
0.1037
0.1058
0.4119
0.0279
0.0729
0.0736
0.4006
0.0158
0.1135
0.1166
0.3111
0.0223
0.0801
0.0811
0.2926
0.0145
0.0563
0.0566
0.2882
0.0090

4.4 Application: Causal Pathway Between Gene Expression and Trait

In this section, we demonstrate the practical utility of our method by applying it to a real-world

biological dataset. For now, large-scale individual-level datasets are currently limited in availability,

so we illustrate our approach using a dataset comprising 200 male Drosophila melanogaster samples

as a representative case study. The primary phenotype of interest is phototaxis, which was measured

at two time points: day 4 and day 28 of age. In parallel, gene expression profiles were obtained at

both 1 week and 4 weeks of age. Our goal is to accurately classify gene–trait relationships according

to their causal direction: distinguishing which gene expressions act as causal drivers of phototactic

behavior and which represent reactive responses. By separately analyzing data from young and

aged flies, we aim to uncover age-specific biological mechanisms underlying phototactic regulation

and demonstrate the method’s capacity to resolve directionality in observational transcriptomic-

84

phenotypic associations.

For the gene dataset, We first performed standard quality control procedures on genotype data

from 200 Drosophila melanogaster lines. Variants with a high missing genotype rate (> 10%) were

removed, and only common variants with a MAF ≥ 0.05 were retained. We then conducted LD

pruning using a sliding window approach with an 𝑟 2 threshold of 0.64, resulting in a final dataset

of 931,732 approximately independent SNPs for downstream analysis.

After merging the available gene expression datasets, we obtained expression profiles for 12,510

genes across 180 one-week-old (young) Drosophila samples and 12,361 genes across 176 four-week-

old (aged) samples. To focus on genes most relevant to the behavioral phenotype of interest, we

performed a marginal association analysis between gene expression and phototaxis. Given the

relatively small sample sizes, we adopted a liberal screening threshold of marginal p-value < 0.01

to retain potentially informative features. This resulted in the selection of 71 genes in the young

group and 64 genes in the aged group for downstream analysis, with only one gene expression

‘FBgn0035932’ shared between the two groups.

Next, we applied the BiMR-SPLIT+ method to perform bidirectional MR analyses between

each selected gene expression variable and the phototaxis phenotype in both age groups. Given the

limited sample sizes, we performed 60 random splits when applying BiMR-SPLIT+ to enhance the

stability and reliability of the results.

4.4.1 Young Group Results

Table 4.5 presents the significant gene expression identified by applying BiMR-SPLIT+ to the

young group (i.e., one-week-old Drosophila).

The first columns list the genes identified as causal drivers of phototaxis. Among them, the

most significant finding is FBgn0003733, which exhibits a negative causal effect on phototaxis.

FBgn0003733 corresponds to the torso (tor) gene in Drosophila melanogaster, which encodes a

receptor protein-tyrosine kinase known for its role in embryonic patterning and hormonal regulation

during metamorphosis. Notably, during the larval stage, Torso acts as the receptor for prothoraci-

cotropic hormone (PTTH), which is a key neuroendocrine signal that initiates a cascade controlling

85

Table 4.5 Identified Significant Gene Expressions in Young Flies.

Gene Expression Estimate Std. Error

FBgn0003733
FBgn0031468
FBgn0085273
FBgn0039066
FBgn0043364
FBgn0267819
FBgn0000279
FBgn0033781
FBgn0021765
FBgn0026576

-3.35
1.17
0.95
10.12
3.59
1.50
0.11
-0.01
-0.01
-0.02

1.23
0.45
0.37
4.42
1.58
0.66
0.04
0.01
0.00
0.01

p-value Lower bound Upper bond Causal Mechanism
0.0069
0.0100
0.0103
0.0231
0.0244
0.0257
0.0133
0.0236
0.0314
0.0320

Causal
Causal
Causal
Causal
Causal
Causal
Reactive
Reactive
Reactive
Reactive

-0.95
2.04
1.67
18.78
6.69
2.80
0.19
0.00
0.00
0.00

-5.75
0.29
0.23
1.47
0.49
0.19
0.02
-0.03
-0.02
-0.04

light avoidance behavior (i.e., negative phototaxis), as demonstrated by Yamanaka et al. (2013).

Mechanistically, PTTH binds to the Torso receptor and activates downstream signaling pathways

that modulate the function of two major light-sensing systems: the Bolwig’s organ and class IV

dendritic arborization neurons. These sensory neurons detect ambient light and are inhibited or

modulated by Torso signaling, ultimately promoting larval movement toward darker environments

as they prepare for pupation. This neuroendocrine-driven behavioral adaptation enhances survival

by ensuring that larvae enter the pupal stage in protective, low-light environments.

Furthermore, five other gene expressions were identified as having significant causal effects in

promoting phototactic behavior. FBgn0031468, corresponding to CG2975 (Müller et al., 2005),

encodes a 𝛽-1,3-galactosyltransferase involved in protein O-glycosylation, which is a critical post-

translational modification affecting membrane and synaptic protein function. Although it has not

been directly linked to phototaxis previously, altered glycosylation in light-sensing neurons (Katoh

and Tiemeyer, 2013) may affect receptor stability, localization, or signaling efficiency, ultimately

modulating behavioral response to light. Its enriched expression in early development and adult

males (Brown et al., 2014) further supports its potential contribution to phototactic variation in this

age group. FBgn0039066 encodes EloA, the active subunit of the Elongin complex, which facilitates

transcriptional elongation by RNA polymerase II (Gerber et al., 2004). EloA is highly expressed

in early embryos (Brown et al., 2014) and is localized to central brain structures, suggesting

its involvement in neural gene regulation.

Its upregulation may accelerate the transcription of

genes critical to phototransduction, synaptic plasticity, or sensory processing, thereby enhancing

86

responsiveness to light stimuli and supporting phototactic behavior. FBgn0043364, also known

as cabut (cbt), encodes a zinc-finger transcription factor implicated in BMP signaling and sensory

organ development (Mukherjee et al., 2021). Given its high expression during late embryogenesis

and established roles in neurogenesis, cbt likely supports the differentiation and connectivity of

photoreceptive circuits (Abdelilah-Seyfried et al., 2000). Its causal effect on phototaxis may arise

through developmental programming of light-sensitive neural systems. For the remaining two

genes, FBgn0085273 and FBgn0267819, no well-characterized links to phototactic behavior or

neuro development have been established. However, their reproducible causal relationship with

phototaxis in this dataset suggests that they may represent novel regulatory components or indirect

modulators of light-responsive behavior. Further investigation into their function and expression

dynamics is warranted.

In addition, four gene expressions were identified as reactive responses. First, phototactic

behavior was found to positively regulate the expression of Cecropin C (CecC, FBgn0000279),

an antimicrobial peptide gene involved in the innate immune response (TRYSELIUS et al., 1992;

Gordon et al., 2008; Carboni et al., 2022; Verleyen et al., 2006). This upregulation may reflect

an anticipatory immune response triggered by increased environmental exploration. Young flies

exhibiting higher phototaxis are likely more active and more exposed to microbial threats in

external environments. As a result, the innate immune system may be primed through behavior-

linked signals to express effector genes such as CecC, which encodes a peptide active against

Gram-negative bacteria. Moreover, increased behavioral engagement may activate neuroendocrine

signaling cascades (e.g., Imd, Toll), which intersect with immune transcriptional networks (Davies

et al., 2012).

In contrast, there exist mild suppression of several mitochondrial and cellular maintenance-

related genes, FBgn0033781 (CG13319), FBgn0021765 (CG7113, scully), and FBgn0026576

(Pisd), in response to increased phototactic behavior. These genes are functionally distinct but

share involvement in core biological processes such as proteasome assembly (CG13319), mito-

chondrial tRNA processing and steroid metabolism (scu) (Torroja et al., 1998), and phospholipid

87

biosynthesis in the mitochondrial membrane (Pisd) (Zhao and Wang, 2020). This consistent down-

regulation may reflect a short-term physiological prioritization of sensory and motor functions over

background maintenance tasks. Phototaxis is a behavior that requires sustained visual attention and

locomotion, which may transiently shift transcriptional activity away from mitochondrial biogen-

esis and metabolic housekeeping. Such reallocation of resources in young adults likely represents

a flexible and reversible trade-off, allowing flies to adapt their molecular programs to immediate

behavioral demands. Additionally, enhanced sensory-driven activity could generate mild neural or

metabolic stress signals, especially in energy-intensive tissues like the head or thoracic muscles,

resulting in temporary downregulation of mitochondrial and quality control genes.

4.4.2 Aged Group Results

In the aged Drosophila, RalGPS (FBgn0034158), PNUTS (FBgn0053526), and CG33673

(FBgn0053673) was found to associated with enhanced phototactic behavior, see Table 4.6.

Table 4.6 Identified Significant Gene Expressions in Aged Flies.

Gene Expression Estimate Std. Error

FBgn0034158
FBgn0035317
FBgn0039674
FBgn0053526
FBgn0039640
FBgn0053673
FBgn0032883
FBgn0266967
FBgn0028978
FBgn0085227
FBgn0266819
FBgn0083963
FBgn0035932
FBgn0261058
FBgn0004865

2.42
-2.13
-4.13
3.84
-1.95
4.90
-2.94
-4.80
0.02
-0.03
0.04
0.00
-0.10
0.01
0.01

0.80
0.85
1.75
1.80
0.92
2.38
1.45
2.37
0.01
0.01
0.02
0.00
0.04
0.01
0.00

p-value Lower bound Upper bond Causal Mechanism
0.0029
0.0131
0.0194
0.0347
0.0370
0.0410
0.0442
0.0443
0.0198
0.0221
0.0247
0.0256
0.0261
0.0285
0.0343

Causal
Causal
Causal
Causal
Causal
Causal
Causal
Causal
Reactive
Reactive
Reactive
Reactive
Reactive
Reactive
Reactive

0.85
-3.80
-7.57
0.31
-3.76
0.24
-5.78
-9.45
0.00
-0.06
0.01
0.00
-0.19
0.00
0.00

3.99
-0.47
-0.70
7.37
-0.13
9.55
-0.10
-0.16
0.03
0.00
0.08
0.00
-0.01
0.03
0.02

RalGPS encodes a guanyl-nucleotide exchange factor that regulates Ras/Ral GTPase signaling

and epidermal growth factor receptor (EGFR) pathways (Nászai et al., 2021). In aged flies, in-

creased RalGPS activity may boost synaptic plasticity or neural excitability through ERK pathway

activation (Ferro and Trabalzini, 2010; Impey et al., 1999). These effects could reinforce visual-

88

motor coupling, enabling stronger behavioral responses to light stimuli. PNUTS (Phosphatase 1

Nuclear Targeting Subunit) is involved in gene expression and developmental regulation (Ciur-

ciu et al., 2013).

Its high expression in older flies may help stabilize transcriptional networks

needed for maintaining synaptic integrity or sustaining locomotor readiness, counteracting age-

related decline in neuromotor coordination. CG33673 is predicted to encode a calcium channel

component (Project, 2011), possibly localized to Golgi or plasma membranes. Enhanced calcium

signaling in the aged brain can elevate neuronal firing rates and enhance sensory responsive-

ness (Berridge, 1998).

In the context of phototaxis, calcium influx may facilitate visual circuit

reactivity or downstream motor output (Brini et al., 2014). Together, these genes may act through

distinct but converging pathways to support sensory fidelity and behavioral responsiveness in the

aging nervous system.

Additionally, several genes, Oseg2 (FBgn0035317), CG1907 (FBgn0039674), superdeath

(FBgn0039640), Rhau (FBgn0032883), and the CR45418 (FBgn0266967), were found to sup-

press phototactic behavior. Oseg2 is crucial for intraflagellar transport and the maintenance of

sensory cilium structure (Avidor-Reiss et al., 2004). CG1907 is predicted to encode a mitochon-

drial dicarboxylate transporter involved in the malate-aspartate shuttle (Gene Ontology Curators,

02 ). RHAU helicase (Rhau) encodes a protein responsible for G-quadruplex DNA unwinding (You

et al., 2017; Lattmann et al., 2010). To date, however, there are no published studies directly report-

ing inhibitory effects of these genes on phototactic behavior. Our results may thus represent novel

findings, suggesting previously unrecognized roles for these genes in the regulation of phototaxis

in aged Drosophila. Additionally, the precise biological functions of superdeath and CR45418 in

the context of phototaxis remain unknown, as their roles in specific biological processes have yet

to be characterized.

Table 4.6 also shows that, increased phototactic activity was found to mildly upregulate the

expression of several genes involved in neural function, cellular signaling, and reproductive regula-

tion: FBgn0028978 (tribbles), FBgn0083963 (Neuroligin 3), FBgn0261058 (Seminal fluid protein

38D), and FBgn0004865 (Ecdysone-induced protein 78C).

89

Tribbles (trbl) encodes a protein kinase inhibitor known to regulate MAP kinase signaling and

insulin-like signaling pathways (Das et al., 2014). Enhanced expression of trbl in response to height-

ened phototactic behavior may reflect increased demands on neural signaling pathways, potentially

acting as a feedback mechanism to prevent excessive activation of neuronal signaling cascades and

maintain neural homeostasis. Neuroligin 3 (Nlg3) encodes a synaptic adhesion molecule critical

for synapse formation, stabilization, and neural transmission (Xing et al., 2014). Elevated pho-

totactic behavior, an activity that requires robust neuronal communication and synaptic plasticity,

may drive increased Nlg3 expression to support synaptic strengthening and maintain effective

neurotransmission during heightened sensory processing. Seminal fluid protein 38D (Sfp38D) is

primarily known for its role in reproductive biology (Findlay et al., 2009). Its upregulation, how-

ever, could indicate broader physiological adaptations that link sensory or behavioral activity with

reproductive function, possibly mediated via neuroendocrine signals triggered by increased pho-

totactic activity. Ecdysone-induced protein 78C (Eip78C) is predicted to encode a DNA-binding

transcription factor involved in regulating gene expression in response to hormonal cues (ecdysone

signaling) (Members, 04 ). Mildly increased expression of Eip78C could reflect phototactic-

induced neuroendocrine activation, integrating environmental cues (e.g., increased light exposure)

with transcriptional changes necessary for physiological adaptation, stress response, or metabolic

adjustments in older flies.

For the other two genes found to be mildly inhibited by phototaxis (FBgn0085227 and

FBgn0035932), the underlying biological mechanisms remain unclear. However, our findings

may provide a new perspective for future research.

4.5 Discussion

In this study, we have successfully extended the MR-SPLIT+ framework to bidirectional causal

inference by developing the BiMR-SPLIT+ method. This new algorithm is specifically designed

to address the challenge of invalid instrumental variables (IVs) that arise when the plurality rule

fails in the presence of bidirectional causality. At the same time, BiMR-SPLIT+ further improves

computational efficiency compared to the original MR-SPLIT+ approach.

90

Through comprehensive simulation studies, we focused on two particularly challenging scenar-

ios and compared the performance of BiMR-SPLIT+ to that of oracle TSLS, CIIV, and MR-Egger

methods.

In both settings, BiMR-SPLIT+ consistently provided robust and reliable estimates,

exhibiting strong adaptability to complex real-world conditions. Importantly, it produced the low-

est bias among all methods except the oracle, while maintaining high coverage probabilities for

confidence intervals.

In our empirical application, we successfully applied BiMR-SPLIT+ to a Drosophila dataset

with approximately 180 available samples. The method identified gene expressions with causal

effects on phototaxis in both young and aged fly cohorts, with many findings corroborated by

existing biological literature, thus further validating our approach. As larger-scale individual-level

datasets in humans become available, BiMR-SPLIT+ is well-positioned to elucidate the causal

mechanisms underlying complex diseases, thereby facilitating the identification of true causal

drivers for targeted therapeutic development.

In summary, BiMR-SPLIT+ represents a valuable and generalizable tool for robust bidirectional

causal inference in both experimental and observational genomics studies

Looking forward, our proposed framework for bidirectional causality can be naturally extended

to the construction of causal networks, offering promising opportunities for elucidating more

complex causal mechanisms in future studies. Moreover, owing to the inherent flexibility of this

framework, it can also be readily adapted to accommodate nonlinear causal relationships. Such

extensions have the potential to uncover more intricate causal structures and yield more accurate

causal effect estimates in increasingly complex settings.

In addition, there remains room for

improvement in the construction of confidence intervals within this framework, which will require

further theoretical development.

91

CHAPTER 5

CONCLUSION AND DISCUSSION

5.1 Summary of Main Contributions

This dissertation makes several key contributions to the methodology and application of

Mendelian randomization for causal inference in genomics.

First, we developed the MR-SPLIT method within the 2SLS framework to address two major

challenges in one-sample MR analysis: instrument selection bias and the weak instrument problem.

MR-SPLIT employs adaptive random sample splitting, using one half of the data for IV selection

and the other for causal estimation, thereby avoiding the “winner’s curse” from reusing the same

sample. We further enhanced robustness through multiple sample splitting and aggregation of

weak IVs into composite instruments. Extensive theoretical evaluation and simulation studies

show that MR-SPLIT outperforms traditional methods such as 2SLS and LIML, as well as the

cross-fitting MR (CFMR) approach, in both bias reduction and statistical power. Empirical analysis

with the CRIC dataset further demonstrated its practical utility in establishing the causal role of

kidney function on aTRH. In addition, we explored LASSO and de-biased LASSO methods for IV

selection in high-dimensional settings, recommending the de-biased approach when computational

resources permit.

Building upon MR-SPLIT, we proposed MR-SPLIT+, which further relaxes the plurality rule

to accommodate invalid IVs. By incorporating best subset selection and repeated sample splitting,

MR-SPLIT+ achieves remarkable improvements in the accuracy of invalid IV identification, and

demonstrates strong selection consistency in both theory and practice. Simulation results indicate

that MR-SPLIT+ yields performance close to that of the oracle TSLS method and maintains

high computational efficiency even in large samples. Although MR-SPLIT+ is robust, there

remain opportunities for further generalization, such as handling binary exposures/outcomes and

bidirectional causal relationships. The method also holds promise for constructing causal networks

involving multiple exposures.

Finally, we extended our framework to bidirectional causal inference and developed BiMR-

92

SPLIT+, which enables robust identification of causal effects in the presence of invalid IVs and

bidirectional causality. BiMR-SPLIT+ offers enhanced computational efficiency over MR-SPLIT+

and demonstrates the lowest estimation bias among competing methods (except for the oracle)

while maintaining high coverage probabilities.

Its efficacy was validated both in simulations

and in a Drosophila gene expression application, with findings consistent with known biological

mechanisms.

Collectively, these methodological advances significantly improve the reliability and applicabil-

ity of MR for complex causal inference tasks. Our frameworks provide practical solutions to weak

and invalid IV problems, enable robust bidirectional inference, and offer new avenues for future

methodological developments, such as causal network construction and non-linear MR analysis.

Overall, the methods developed in this dissertation provide a unified and flexible framework for

advancing robust causal inference in genetic epidemiology.

5.2 Biological Insights

Beyond methodological advances, the approaches developed in this dissertation have yielded

valuable insights into important biological questions.

Applying MR-SPLIT to the CRIC dataset, we established a robust causal relationship between

kidney function—as measured by eGFR and uACR—and apparent treatment-resistant hyperten-

sion (aTRH). These findings are consistent with prior observational and clinical evidence, but our

methods provide stronger statistical support by effectively controlling for weak and invalid instru-

ments. The use of debiased IV selection further enhanced the reliability of these conclusions in the

high-dimensional genomic context.

In large-scale analyses with UK Biobank data, MR-SPLIT+ was validated using both positive

and negative control designs. The method not only reproduced known causal effects, such as that

of BMI on diastolic blood pressure, but also demonstrated robustness by correctly identifying the

absence of implausible causal relationships, such as from adult BMI to birth weight. These results

underscore the improved reliability and specificity of MR-SPLIT+ for causal inference in complex,

real-world genomic studies.

93

Furthermore, the extension to bidirectional MR, as implemented in BiMR-SPLIT+, enabled

us to disentangle complex causal relationships between gene expression and behavioral traits in

Drosophila melanogaster. Specifically, our application to phototaxis data identified gene expres-

sions with significant causal effects on phototactic behavior in both young and aged fly cohorts.

Many of the identified candidate genes were corroborated by existing biological literature, while

others represent novel findings that warrant further investigation. These results underscore the

potential of our methods to uncover previously unrecognized mechanisms underlying complex

phenotypes.

Overall, the application of MR-SPLIT, MR-SPLIT+, and BiMR-SPLIT+ to real-world datasets

demonstrates their ability to generate robust and biologically meaningful inferences. These insights

not only validate known biological relationships but also highlight new avenues for functional

genomics and translational research.

5.3 Limitations of the Study

Despite the contributions of this work, several limitations should be acknowledged.

The dataset used in this study is mostly derived from the UK Biobank, a large-scale prospective

cohort study comprising over 500,000 participants aged over 40 years at recruitment. While the

breadth and depth of phenotypic and genotypic data in UK Biobank provide valuable opportunities

for epidemiological research, it is important to acknowledge the limitations arising from its non-

random sampling strategy. Specifically, UK Biobank participants tend to be healthier, older, more

educated, and of higher socioeconomic status compared to the general UK population.

In this

context, although our real data analysis yields internally valid estimates, the potential selection bias

may limit the generalizability of our findings to broader or more diverse populations.

In Chapter 3, the mediation analysis was also limited by the restricted availability of biological

variables in the dataset, which constrained the range of potential mediators. As a result, important

biological pathways such as hormonal regulation, autonomic nervous system activity, or vascular

remodeling could not be assessed. Future studies with richer and more comprehensive biomarker

panels may enable a more complete understanding of the mechanisms linking adiposity to blood

94

pressure.

In the real data application of Chapter 4, the limitation lies in the relatively small sample size.

The largest usable sample we could identify consisted of approximately 200 individuals, which may

limit statistical power and generalizability. Nonetheless, the analysis revealed several interesting and

biologically plausible findings. We believe that as larger datasets become available, the proposed

method has the potential to uncover more nuanced and complex causal relationships. On the

other hand, our analysis focused exclusively on exploring causal effects from gene expressions to

phenotypic traits, without accounting for potential causal relationships among gene expressions

themselves.

In complex biological systems, it is plausible that certain gene expression levels

influence traits indirectly through regulatory interactions with other genes. Ignoring such upstream

or intermediate transcriptional pathways may obscure a more complete understanding of the causal

architecture. Future work could extend the current framework to model gene expression networks

and disentangle direct and indirect effects along regulatory cascades.

5.4 Future Research Directions

As large-scale biobank resources and individual-level genomic data continue to grow, there

is tremendous potential for applying BiMR-SPLIT+ and related methods to a wider range of

biological questions.

In particular, these methods can play an increasingly important role in

uncovering bidirectional causal relationships between complex traits and diseases, and in mapping

gene expression networks that drive disease risk. Such analyses could ultimately contribute to

more precise identification of causal genetic factors, providing new leads for functional studies and

potential drug targets.

On the methodological side, further work is needed to strengthen the theoretical underpinnings

of these approaches. For example, establishing sharper theoretical guarantees, improving the accu-

racy of confidence interval construction, and better understanding the behavior of these estimators

in various practical settings remain open problems. There is also considerable scope for extending

the current framework to accommodate nonlinear relationships, as well as binary exposures and

outcomes, so that the methods can be used in an even broader array of applications.

95

Beyond these directions, new challenges and opportunities will no doubt arise as richer datasets

become available, including the integration of multiple omics layers and the development of robust

sensitivity analyses to assess model assumptions. Continued collaboration between methodological

and applied researchers will be essential to fully realize the potential of these approaches in

advancing both fundamental biology and clinical research.

Taken together, I hope that the work presented in this dissertation will serve as a foundation for

further methodological development and inspire future applications that deepen our understanding

of complex disease mechanisms.

96

BIBLIOGRAPHY

Abdelilah-Seyfried, S., Chan, Y.-M., Zeng, C., Justice, N. J., Younger-Shepherd, S., Sharp, L. E.,
Barbel, S., Meadows, S. A., Jan, L. Y., and Jan, Y. N. (2000). A gain-of-function screen for
genes that affect the development of the drosophila adult external sensory organ. Genetics,
155(2):733–752.

Albert, F. W. and Kruglyak, L. (2015). The role of regulatory variation in complex traits and

disease. Nature Reviews Genetics, 16(4):197–212.

Anderson, T. (2005). Origins of the limited information maximum likelihood and two-stage least

squares estimators. Journal of Econometrics, 127(1):1–16.

Anderson, T. W. and Rubin, H. (1949). Estimation of the parameters of a single equation in a
complete system of stochastic equations. The Annals of Mathematical Statistics, 20(1):46–63.

Angrist, J. D., Imbens, G. W., and Krueger, A. B. (1999).
estimation. Journal of Applied Econometrics, 14(1):57–67.

Jackknife instrumental variables

Angrist, J. D., Imbens, G. W., and Rubin, D. B. (1996).

Identification of causal effects using

instrumental variables. Journal of the American statistical Association, 91(434):444–455.

Angrist, J. D. and Krueger, A. B. (1991). Does compulsory school attendance affect schooling and

earnings? The Quarterly Journal of Economics, 106(4):979–1014.

Angrist, J. D. and Krueger, A. B. (1995). Split-sample instrumental variables estimates of the

return to schooling. Journal of Business & Economic Statistics, 13(2):225–235.

Angrist, J. D. and Krueger, A. B. (1999). Chapter 23 - empirical strategies in labor economics.

volume 3 of Handbook of Labor Economics, pages 1277–1366. Elsevier.

Apfel, N. and Liang, X. (2024). Agglomerative hierarchical clustering for selecting valid instru-

mental variables. Journal of Applied Econometrics, 39(7):1201–1219.

Avidor-Reiss, T., Maer, A. M., Koundakjian, E., Polyanovsky, A., Keil, T., Subramaniam, S.,
and Zuker, C. S. (2004). Decoding cilia function: defining specialized genes required for
compartmentalized cilia biogenesis. Cell, 117(4):527–539.

Baiocchi, M., Cheng, J., and Small, D. S. (2014).

Instrumental variable methods for causal

inference. Statistics in Medicine, 33(13):2297–2340.

Barro, R. J. (1997). Macroeconomics. MIT Press.

Berridge, M. J. (1998). Neuronal calcium signaling. Neuron, 21(1):13–26.

97

Bertsimas, D., King, A., and Mazumder, R. (2016). Best subset selection via a modern optimization

lens. The Annals of Statistics, 44(2):813 – 852.

Blomquist, S. and Dahlberg, M. (1999). Small sample properties of liml and jackknife iv estimators:

Experiments with weak instruments. Journal of Applied Econometrics, 14(1):69–88.

Bound, J., Jaeger, D. A., and Baker, R. M. (1995). Problems with instrumental variables estimation
when the correlation between the instruments and the endogeneous explanatory variable is weak.
Journal of the American Statistical Association, 90(430):443–450.

Bowden, J., Davey Smith, G., and Burgess, S. (2015). Mendelian randomization with invalid
instruments: effect estimation and bias detection through egger regression. International Journal
of Epidemiology, 44(2):512–525.

Bowden, J., Davey Smith, G., Haycock, P. C., and Burgess, S. (2016). Consistent estimation in
mendelian randomization with some invalid instruments using a weighted median estimator.
Genetic Epidemiology, 40(4):304–314.

Bowden, J., Del Greco M, F., Minelli, C., Zhao, Q., Lawlor, D. A., Sheehan, N. A., Thompson, J.,
and Davey Smith, G. (2019). Improving the accuracy of two-sample summary-data mendelian
randomization: moving beyond the nome assumption. International Journal of Epidemiology,
48(3):728–742.

Brini, M., Calì, T., Ottolini, D., and Carafoli, E. (2014). Neuronal calcium signaling: function and

dysfunction. Cellular and Molecular Life Sciences, 71:2787–2814.

Brown, J. B., Boley, N., Eisman, R., May, G. E., Stoiber, M. H., Duff, M. O., Booth, B. W., Wen, J.,
Park, S., Suzuki, A. M., et al. (2014). Diversity and dynamics of the drosophila transcriptome.
Nature, 512(7515):393–399.

Burgess, S., Butterworth, A., and Thompson, S. G. (2013). Mendelian randomization analysis with

multiple genetic variants using summarized data. Genetic Epidemiology, 37(7):658–665.

Burgess, S., Davies, N. M., Thompson, S. G., Consortium, E.-I., et al. (2014). Instrumental variable

analysis with a nonlinear exposure–outcome relationship. Epidemiology, 25(6):877–885.

Burgess, S., Foley, C. N., Allara, E., Staley, J. R., and Howson, J. M. (2020). A robust and efficient
method for mendelian randomization with hundreds of genetic variants. Nature Communications,
11(1):376.

Burgess, S., Small, D. S., and Thompson, S. G. (2017). A review of instrumental variable estimators

for mendelian randomization. Statistical Methods in Medical Research, 26(5):2333–2355.

Burgess, S., Smith, G. D., Davies, N. M., Dudbridge, F., Gill, D., Glymour, M. M., Hartwig, F. P.,
Kutalik, Z., Holmes, M. V., Minelli, C., et al. (2019). Guidelines for performing mendelian

98

randomization investigations: update for summer 2023. Wellcome Open Research, 4.

Burgess, S., Thompson, S. G., and Collaboration, C. C. G. (2011). Avoiding bias from weak instru-
ments in Mendelian randomization studies. International Journal of Epidemiology, 40(3):755–
764.

Carboni, A. L., Hanson, M. A., Lindsay, S. A., Wasserman, S. A., and Lemaitre, B. (2022).
Cecropins contribute to drosophila host defense against a subset of fungal and gram-negative
bacterial infection. Genetics, 220(1):iyab188.

Castle, W. E. (1903). Mendel’s law of heredity. Science, 18(456):396–406.

Chen, J., Bundy, J. D., Hamm, L. L., Hsu, C.-y., Lash, J., Miller III, E. R., Thomas, G., Cohen, D. L.,
Weir, M. R., Raj, D. S., et al. (2019). Inflammation and apparent treatment-resistant hypertension
in patients with chronic kidney disease: the results from the cric study. Hypertension, 73(4):785–
793.

Chen, S. (2025). Two-sample bi-directional causality between two traits with some invalid ivs in
both directions using gwas summary statistics. Human Genetics and Genomics Advances, page
100449.

Cheng, B., Bai, Y., Liu, L., Meng, P., Cheng, S., Yang, X., Pan, C., Wei, W., Liu, H., Jia, Y., et al.
(2024). Mendelian randomization study of the relationship between blood and urine biomarkers
and schizophrenia in the uk biobank cohort. Communications Medicine, 4(1):40.

Ciurciu, A., Duncalf, L., Jonchere, V., Lansdale, N., Vasieva, O., Glenday, P., Rudenko, A., Vissi,
E., Cobbe, N., Alphey, L., et al. (2013). Pnuts/pp1 regulates rnapii-mediated gene expression
and is necessary for developmental growth. PLOS Genetics, 9(10):e1003885.

Coresh, J., Wei, G. L., McQuillan, G., Brancati, F. L., Levey, A. S., Jones, C., and Klag, M. J.
(2001). Prevalence of high blood pressure and elevated serum creatinine level in the united states:
findings from the third national health and nutrition examination survey (1988-1994). Archives
of Internal Medicine, 161(9):1207–1216.

Das, R., Sebo, Z., Pence, L., and Dobens, L. L. (2014). Drosophila tribbles antagonizes in-
sulin signaling-mediated growth and metabolism via interactions with akt kinase. PLOS One,
9(10):e109530.

Davey Smith, G. and Ebrahim, S. (2003). ‘Mendelian randomization’: can genetic epidemiology
contribute to understanding environmental determinants of disease?*. International Journal of
Epidemiology, 32(1):1–22.

Davey Smith, G. and Hemani, G. (2014). Mendelian randomization: genetic anchors for causal

inference in epidemiological studies. Human Molecular Genetics, 23(R1):R89–R98.

99

Davies, S.-A., Overend, G., Sebastian, S., Cundall, M., Cabrero, P., Dow, J. A., and Terhzaz, S.
(2012). Immune and stress response ‘cross-talk’in the drosophila malpighian tubule. Journal of
Insect Physiology, 58(4):488–497.

Denault, W. R. P., Bohlin, J., Page, C. M., Burgess, S., and Jugessur, A. (2022). Cross-fitted
instrument: A blueprint for one-sample mendelian randomization. PLOS Computational Biology,
18(8):1–21.

Dezeure, R., Bühlmann, P., Meier, L., and Meinshausen, N. (2015). High-dimensional inference:

confidence intervals, p-values and r-software hdi. Statistical Science, pages 533–558.

Donald, S. G. and Newey, W. K. (2001). Choosing the number of instruments. Econometrica,

69(5):1161–1191.

Efron, B. (1992). Bootstrap methods: another look at the jackknife. In Breakthroughs in statistics:

Methodology and distribution, pages 569–593. Springer.

Fan, J. and Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space.
Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(5):849–911.

Ferro, E. and Trabalzini, L. (2010). Ralgds family members couple ras to ral signalling and that’s

not all. Cellular Signalling, 22(12):1804–1810.

Fieller, E. C. (1954). Some problems in interval estimation. Journal of the Royal Statistical Society

Series B: Statistical Methodology, 16(2):175–185.

Findlay, G. D., MacCoss, M. J., and Swanson, W. J. (2009). Proteomic discovery of previously
unannotated, rapidly evolving seminal fluid genes in drosophila. Genome Research, 19(5):886–
896.

Gene Ontology Curators (2002–). Manual transfer of experimentally-verified manual GO annotation
data to orthologs by curator judgment of sequence similarity. Personal communication to
FlyBase, FlyBase ID: FBrf0255270.

Gerber, M., Eissenberg, J. C., Kong, S., Tenney, K., Conaway, J. W., Conaway, R. C., and Shilatifard,
A. (2004). In vivo requirement of the rna polymerase ii elongation factor elongin a for proper
gene expression and development. Molecular and Cellular Biology, 24(22):9911–9919.

Glymour, M. M., Tchetgen Tchetgen, E. J., and Robins, J. M. (2012). Credible mendelian ran-
domization studies: approaches for evaluating the instrumental variable assumptions. American
Journal of Epidemiology, 175(4):332–339.

Gordon, M. D., Ayres, J. S., Schneider, D. S., and Nusse, R. (2008). Pathogenesis of listeria-
infected drosophila wntd mutants is associated with elevated levels of the novel immunity gene
edin. PLOS Pathogens, 4(7):e1000111.

100

Greco M, F. D., Minelli, C., Sheehan, N. A., and Thompson, J. R. (2015). Detecting pleiotropy
in mendelian randomisation studies with summary data and a continuous outcome. Statistics in
Medicine, 34(21):2926–2940.

Greenland, S. (2000). An introduction to instrumental variables for epidemiologists. International

Journal of Epidemiology, 29(4):722–729.

Guo, Z., Kang, H., Tony Cai, T., and Small, D. S. (2018). Confidence intervals for causal effects
with invalid instruments by using two-stage hard thresholding with voting. Journal of the Royal
Statistical Society Series B: Statistical Methodology, 80(4):793–815.

Guo, Z. and Small, D. S. (2016). Control function instrumental variable estimation of nonlinear

causal effect models. Journal of Machine Learning Research, 17(100):1–35.

Hartwig, F. P., Davies, N. M., Hemani, G., and Davey Smith, G. (2016). Two-sample mendelian
randomization: avoiding the downsides of a powerful, widely applicable but potentially fallible
technique.

Hartwig, F. P., Tilling, K., Davey Smith, G., Lawlor, D. A., and Borges, M. C. (2021). Bias
in two-sample mendelian randomization when using heritable covariable-adjusted summary
associations. International Journal of Epidemiology, 50(5):1639–1650.

He, Q., Ding, Z. Y., Fong, D. Y.-T., and Karlberg, J. (2000). Blood pressure is associated with body

mass index in both normal and obese children. Hypertension, 36(2):165–170.

He, Z., Chen, Z., de Borst, M. H., Zhang, Q., of Blood Pressure, I. C., Snieder, H., and Thio, C. H.
(2023). Observational and genetic evidence for bidirectional effects between red blood cell traits
and diastolic blood pressure. American Journal of Hypertension, 36(10):551–560.

Hemani, G., Bowden, J., and Davey Smith, G. (2018). Evaluating the potential role of pleiotropy

in mendelian randomization studies. Human Molecular Genetics, 27(R2):R195–R208.

Impey, S., Obrietan, K., and Storm, D. R. (1999). Making new connections: role of erk/map kinase

signaling in neuronal plasticity. Neuron, 23(1):11–14.

Inoue, A. and Solon, G. (2010). Two-sample instrumental variables estimators. The Review of

Economics and Statistics, 92(3):557–561.

Jiang, T., Gill, D., Butterworth, A. S., and Burgess, S. (2023). An empirical investigation into the
impact of winner’s curse on estimates from mendelian randomization. International Journal of
Epidemiology, 52(4):1209–1219.

Judd, E. and Calhoun, D. (2014). Apparent and true resistant hypertension: definition, prevalence

and outcomes. Journal of Human Hypertension, 28(8):463–468.

101

Kaboré, J., Metzger, M., Helmer, C., Berr, C., Tzourio, C., Drueke, T. B., Massy, Z. A., and
Stengel, B. (2017). Hypertension control, apparent treatment resistance, and outcomes in the
elderly population with chronic kidney disease. Kidney International Reports, 2(2):180–191.

Kabore, J., Metzger, M., Helmer, C., Berr, C., Tzourio, C., Massy, Z. A., and Stengel, B. (2016).
Kidney function decline and apparent treatment-resistant hypertension in the elderly. PLOS One,
11(1):e0146056.

Kang, H., Zhang, A., Cai, T. T., and Small, D. S. (2016).

Instrumental variables estimation
with some invalid instruments and its application to mendelian randomization. Journal of the
American Statistical Association, 111(513):132–144.

Katoh, T. and Tiemeyer, M. (2013). The n’s and o’s of drosophila glycoprotein glycobiology.

Glycoconjugate Journal, 30:57–66.

Khandaker, G. M., Pearson, R. M., Zammit, S., Lewis, G., and Jones, P. B. (2014). Association of
serum interleukin 6 and c-reactive protein in childhood with depression and psychosis in young
adult life: a population-based longitudinal study. JAMA Psychiatry, 71(10):1121–1128.

Kleibergen, F. (2002). Pivotal statistics for testing structural parameters in instrumental variables

regression. Econometrica, 70(5):1781–1803.

Kolesár, M. (2018). Minimum distance approach to inference with many instruments. Journal of

Econometrics, 204(1):86–100.

Kolesár, M., Chetty, R., Friedman, J., Glaeser, E., and Imbens, G. W. (2015). Identification and
inference with many invalid instruments. Journal of Business & Economic Statistics, 33(4):474–
484.

Lattmann, S., Giri, B., Vaughn, J. P., Akman, S. A., and Nagamine, Y. (2010). Role of the amino
terminal rhau-specific motif in the recognition and resolution of guanine quadruplex-rna by the
deah-box rna helicase rhau. Nucleic Acids Research, 38(18):6219–6233.

Lawlor, D. A., Harbord, R. M., Sterne, J. A., Timpson, N., and Davey Smith, G. (2008). Mendelian
randomization: using genes as instruments for making causal inferences in epidemiology. Statis-
tics in Medicine, 27(8):1133–1163.

Li, C. (2019). Rethinking Nonlinear Instrumental Variables. PhD thesis, Duke University.

Lin, Y., Windmeijer, F., Song, X., and Fan, Q. (2024). On the instrumental variable estimation with
many weak and invalid instruments. Journal of the Royal Statistical Society Series B: Statistical
Methodology, 86(4):1068–1088.

Linderman, G. C., Lu, J., Lu, Y., Sun, X., Xu, W., Nasir, K., Schulz, W., Jiang, L., and Krumholz,
H. M. (2018). Association of body mass index with blood pressure among 1.7 million chinese

102

adults. JAMA Network Open, 1(4):e181271–e181271.

Liu, Y. and Xie, J. (2019). Cauchy combination test: A powerful test with analytic p-value calcu-
lation under arbitrary dependency structures. Journal of the American Statistical Association,
115:1–29.

Locke, A. E., Kahali, B., Berndt, S. I., Justice, A. E., Pers, T. H., Day, F. R., Powell, C., Vedantam,
S., Buchkovich, M. L., Yang, J., et al. (2015). Genetic studies of body mass index yield new
insights for obesity biology. Nature, 518(7538):197–206.

Maina, J. G., Balkhiyarova, Z., Nouwen, A., Pupko, I., Ulrich, A., Boissel, M., Bonnefond, A.,
Froguel, P., Khamis, A., Prokopenko, I., et al. (2023). Bidirectional mendelian randomization
and multiphenotype gwas show causality and shared pathophysiology between depression and
type 2 diabetes. Diabetes Care, 46(9):1707–1714.

Mammen, G. and Faulkner, G. (2013). Physical activity and the prevention of depression: a
systematic review of prospective studies. American Journal of Preventive Medicine, 45(5):649–
657.

Marmot, M. and Brunner, E. (1991). Alcohol and cardiovascular disease: the status of the u shaped

curve. BMJ: British Medical Journal, 303(6802):565.

Members, I. P. (2004–). Gene ontology annotation through association of interpro records with go

terms. FlyBase ID: FBrf0174215.

Millard, L. A., Davies, N. M., Tilling, K., Gaunt, T. R., and Davey Smith, G. (2019). Searching for
the causal effects of body mass index in over 300 000 participants in uk biobank, using mendelian
randomization. PLOS Genetics, 15(2):e1007951.

Miller, A. (2002). Subset selection in regression. Chapman and Hall/CRC.

Minelli, C., Del Greco M, F., van der Plaat, D. A., Bowden, J., Sheehan, N. A., and Thompson, J.
(2021). The use of two-sample methods for mendelian randomization analyses on single large
datasets. International Journal of Epidemiology, 50(5):1651–1659.

Mukherjee, S., Paricio, N., and Sokol, N. S. (2021). A stress-responsive mirna regulates bmp
signaling to maintain tissue homeostasis. Proceedings of the National Academy of Sciences,
118(21):e2022583118.

Müller, R., Hülsmeier, A. J., Altmann, F., Ten Hagen, K., Tiemeyer, M., and Hennet, T. (2005). Char-
acterization of mucin-type core-1 𝛽1-3 galactosyltransferase homologous enzymes in drosophila
melanogaster. The FEBS Journal, 272(17):4295–4305.

Nászai, M., Bellec, K., Yu, Y., Roman-Fernandez, A., Sandilands, E., Johansson, J., Campbell,
A. D., Norman, J. C., Sansom, O. J., Bryant, D. M., et al. (2021). Ral gtpases mediate egfr-driven

103

intestinal stem cell proliferation and tumourigenesis. Elife, 10:e63807.

Panagiotou, O. A., Ioannidis, J. P. A., and for the Genome-Wide Significance Project (2011). What
should the genome-wide significance threshold be? Empirical replication of borderline genetic
associations. International Journal of Epidemiology, 41(1):273–286.

Patel, A., DiTraglia, F. J., Zuber, V., and Burgess, S. (2024). Selecting invalid instruments to
improve mendelian randomization with two-sample summary data. The Annals of Applied
Statistics, 18(2):1729–1749.

Pierce, B. L., Ahsan, H., and VanderWeele, T. J. (2010). Power and instrument strength requirements
for Mendelian randomization studies using multiple genetic variants. International Journal of
Epidemiology, 40(3):740–752.

Project, G. R. G. (2011). Phylogenetic annotation using the gene ontology. Personal communication

to FlyBase. FlyBase ID: FBrf0258542.

Qi, G. and Chatterjee, N. (2019). Mendelian randomization analysis using mixture models for

robust and efficient estimation of causal effects. Nature Communications, 10(1):1941.

Sargan, J. D. (1958). The estimation of economic relationships using instrumental variables.

Econometrica: Journal of the Econometric Society, pages 393–415.

Schuch, F. B., Vancampfort, D., Firth, J., Rosenbaum, S., Ward, P. B., Silva, E. S., Hallgren, M.,
Ponce De Leon, A., Dunn, A. L., Deslandes, A. C., et al. (2018). Physical activity and incident
depression: a meta-analysis of prospective cohort studies. American Journal of Psychiatry,
175(7):631–648.

Shea, J. (1997). Instrument relevance in multivariate linear models: A simple measure. Review of

Economics and Statistics, 79(2):348–352.

Shen, X., Pan, W., and Zhu, Y. (2012). Likelihood-based selection and sharp parameter estimation.

Journal of the American Statistical Association, 107(497):223–232.

Shen, X., Pan, W., Zhu, Y., and Zhou, H. (2013). On constrained and regularized high-dimensional

regression. Annals of the Institute of Statistical Mathematics, 65(5):807–832.

Shi, R., Wang, L., Burgess, S., and Cui, Y. (2024). Mr-split: a novel method to address selection
and weak instrument bias in one-sample mendelian randomization studies. PLOS Genetics,
20(9):e1011391.

Shi, R., Wang, L., and Cui, Y. (2025). Mr-split+: a unified method for many weak and invalid instru-
ments with selection bias control in one-sample mendelian randomization studies. Manuscript
submitted for publication.

104

Singh, R., Sahani, M., and Gretton, A. (2019). Kernel instrumental variable regression. Advances

in Neural Information Processing Systems, 32.

Sobel, M. E. (1982). Asymptotic confidence intervals for indirect effects in structural equation

models. Sociological Methodology, 13:290–312.

Sproviero, W., Winchester, L., Newby, D., Fernandes, M., Shi, L., Goodday, S. M., Prats-Uribe,
A., Alhambra, D. P., Buckley, N. J., and Nevado-Holgado, A. J. (2021). High blood pressure and
risk of dementia: a two-sample mendelian randomization study in the uk biobank. Biological
psychiatry, 89(8):817–824.

Staiger, D. and Stock, J. H. (1994).

Instrumental variables regression with weak instruments.

Working Paper 151, National Bureau of Economic Research.

Staiger, D. and Stock, J. H. (1997).
Econometrica, 65(3):557–586.

Instrumental variables regression with weak instruments.

Stock, J. H. and Yogo, M. (2002). Testing for weak instruments in linear iv regression.

Sudlow, C., Gallacher, J., Allen, N., Beral, V., Burton, P., Danesh, J., Downey, P., Elliott, P., Green,
J., Landray, M., et al. (2015). Uk biobank: an open access resource for identifying the causes of
a wide range of complex diseases of middle and old age. PLOS Medicine, 12(3):e1001779.

Sulc, J., Sjaarda, J., and Kutalik, Z. (2022). Polynomial mendelian randomization reveals non-linear

causal effects for obesity-related traits. Human Genetics and Genomics Advances, 3(3).

Thomas, D. C., Lawlor, D. A., and Thompson, J. R. (2007). Re: Estimation of bias in nongenetic

observational studies using" mendelian triangulation" by bautista et al.

Thomas, G., Xie, D., Chen, H.-Y., Anderson, A. H., Appel, L. J., Bodana, S., Brecklin, C. S.,
Drawz, P., Flack, J. M., Miller III, E. R., et al. (2016). Prevalence and prognostic significance
of apparent treatment resistant hypertension in chronic kidney disease: report from the chronic
renal insufficiency cohort study. Hypertension, 67(2):387–396.

Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal

Statistical Society: Series B (Methodological), 58(1):267–288.

Torroja, L., Ortuño-Sahagún, D., Ferrús, A., Hämmerle, B., and Barbas, J. A. (1998). scully,
is homologous to mammalian mitochondrial type ii l-3-
an essential gene of drosophila,
hydroxyacyl-coa dehydrogenase/amyloid-𝛽 peptide-binding protein. The Journal of Cell Biology,
141(4):1009–1017.

TRYSELIUS, Y., Samakovlis, C., Kimbrell, D. A., and Hultmark, D. (1992). Cecc, a cecropin
gene expressed during metamorphosis in drosophila pupae. European Journal of Biochemistry,,
204(1):395–399.

105

Verbanck, M., Chen, C.-Y., Neale, B., and Do, R. (2018). Detection of widespread horizontal
pleiotropy in causal relationships inferred from mendelian randomization between complex
traits and diseases. Nature Genetics, 50(5):693–698.

Verleyen, P., Baggerman, G., D’Hertog, W., Vierstraete, E., Husson, S. J., and Schoofs, L. (2006).
Identification of new immune induced molecules in the haemolymph of drosophila melanogaster
by 2d-nanolc ms/ms. Journal of Insect Physiology, 52(4):379–388.

Wainwright, K. et al. (2005). Fundamental methods of mathematical economics. McGraw-Hill.

Wald, A. (1940). The fitting of straight lines if both variables are subject to error. The Annals of

Mathematical Statistics, 11(3):284–300.

Wang, S. and Kang, H. (2022). Weak-instrument robust tests in two-sample summary-data

mendelian randomization. Biometrics, 78(4):1699–1713.

Wasserman, L. and Roeder, K. (2009). High dimensional variable selection. Annals of Statistics,

37(5A):2178.

Windmeijer, F., Liang, X., Hartwig, F. P., and Bowden, J. (2021). The confidence interval method
for selecting valid instrumental variables. Journal of the Royal Statistical Society Series B:
Statistical Methodology, 83(4):752–776.

Wooldridge, J. M. (2015). Control function methods in applied econometrics. Journal of Human

Resources, 50(2):420–445.

Xing, G., Gan, G., Chen, D., Sun, M., Yi, J., Lv, H., Han, J., and Xie, W. (2014). Drosophila
neuroligin3 regulates neuromuscular junction development and synaptic differentiation. Journal
of Biological Chemistry, 289(46):31867–31877.

Yamanaka, N., Romero, N. M., Martin, F. A., Rewitz, K. F., Sun, M., O’Connor, M. B., and
Léopold, P. (2013). Neuroendocrine control of drosophila larval light preference. Science,
341(6150):1113–1116.

Yang, J., Lee, S. H., Goddard, M. E., and Visscher, P. M. (2011). Gcta: a tool for genome-wide

complex trait analysis. The American Journal of Human Genetics, 88(1):76–82.

Ye, T., Liu, Z., Sun, B., and Tchetgen Tchetgen, E. (2024). Genius-mawii: for robust mendelian
randomization with many weak invalid instruments. Journal of the Royal Statistical Society
Series B: Statistical Methodology, 86(4):1045–1067.

You, H., Lattmann, S., Rhodes, D., and Yan, J. (2017). Rhau helicase stabilizes g4 in its nucleotide-
free state and destabilizes g4 upon atp hydrolysis. Nucleic Acids Research, 45(1):206–214.

Yu, Z., Coresh, J., Qi, G., Grams, M., Boerwinkle, E., Snieder, H., Teumer, A., Pattaro, C., Köttgen,

106

A., Chatterjee, N., et al. (2020). A bidirectional mendelian randomization study supports causal
effects of kidney function on blood pressure. Kidney international, 98(3):708–716.

Yuan, Z., Liu, L., Guo, P., Yan, R., Xue, F., and Zhou, X. (2022). Likelihood-based mendelian
randomization analysis with automated instrument selection and horizontal pleiotropic modeling.
Science Advances, 8(9):eabl5744.

Yusni, Y., Rahman, S., and Naufal, I. (2024). Positive correlation between body weight and body

mass index with blood pressure in young adults. Narra J, 4(1):e533.

Zhang, F. (2006). The Schur complement and its applications, volume 4. Springer Science &

Business Media.

Zhao, G., Lu, Z., Sun, Y., Kang, Z., Feng, X., Liao, Y., Sun, J., Zhang, Y., Huang, Y., and Yue, W.
(2023). Dissecting the causal association between social or physical inactivity and depression: a
bidirectional two-sample mendelian randomization study. Translational Psychiatry, 13(1):194.

Zhao, H. and Wang, T. (2020). Pe homeostasis rebalanced through mitochondria-er lipid exchange

prevents retinal degeneration in drosophila. PLOS Genetics, 16(10):e1009070.

Zhao, Q., Wang, J., Hemani, G., Bowden, J., and Small, D. S. (2020). Statistical inference in

two-sample summary-data mendelian randomization using robust adjusted profile score.

Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical

Association, 101(476):1418–1429.

107

APPENDIX A

SUPPLEMENTARY MATERIALS

A.1 Codes

The R codes for MR-SPLIT can be freely accessed at: https://github.com/RuxinShi/MR-SPLIT

The R code to implement MR-SPLIT+ is available at an anonymous GitHub repository (for

peer review): https://anonymous.4open.science/r/MR_SPLIT_plus-CFCB.

A.2 Chapter 2

A.2.1 Proof of Theorem 1

Proof of Theorem For a given sample {𝑋, 𝑌 , 𝐺}, the two stage IV model is defined as,

𝑋 = 𝐺𝛼 + 𝜀1

𝑌 = 𝑋 𝛽 + 𝜀2

(S1)

where (𝜀1, 𝜀2)′ ∼ 𝑁 (0, 𝜎2(

1 𝜌

𝜌 1

)) and the correlation 𝜌 reflects the degree of confounding effect.

Suppose we split the data into two parts, 𝐼1 = {𝑋1, 𝑌1, 𝐺1}, and 𝐼2 = {𝑋2, 𝑌2, 𝐺2}. Each subset

has equal sample size 𝑁/2, where 𝑁 is the total sample size. We first use sample 𝐼1 to identify
major and weak IVs, then use sample 𝐼2 for causal inference. Suppose we have identified 𝑝 (1)
1,𝑊 )′ ∈ R𝑝 (1)
IVs and 𝑝 (1)

2 weak IVs with the estimated effect size denoted as ˆ𝛼1 = ( ˆ𝛼′

1 major
+𝑝 (1)
2

1,𝑀, ˆ𝛼′

1

when regressing exposure 𝑋1 with the SNPs in 𝐺1.

In sample 𝐼2, MR-SPLIT combines the selected weak IVs into a new composite IV and uses it

as an IV along with the major IVs:

ˆ𝐺2 = (𝐺2,𝑀, 𝐺2,𝑊 ˆ𝛼1,𝑊 ) ∈ R 𝑁

2 ×( 𝑝 (1)

1

+1)

(S2)

Then, we can apply the stage one of 2SLS in sample 𝐼2 using these IVs and get the estimates of the

exposure in sample 𝐼2:

ˆ𝑋2 = ˆ𝐺2( ˆ𝐺′
2

ˆ𝐺2)−1 ˆ𝐺′
2

𝑋2 = 𝐻 ˆ𝐺2

𝑋2,

where 𝐻𝑋 = 𝑋 (𝑋′𝑋)−1𝑋′ for any matrix 𝑋.

108

Similarly, we can also get the estimates of the exposure in sample 𝐼1 by using sample 𝐼2 to select

the major and weak IVs:

Let ˆ𝑋 = (cid:169)
(cid:173)
(cid:173)
(cid:171)

ˆ𝑋1

ˆ𝑋2

(cid:170)
(cid:174)
(cid:174)
(cid:172)

, 𝑌 = (cid:169)
(cid:173)
(cid:173)
(cid:171)

𝑌1

𝑌2

(cid:170)
(cid:174)
(cid:174)
(cid:172)

ˆ𝑋1 = ˆ𝐺1( ˆ𝐺′
1

ˆ𝐺1)−1 ˆ𝐺′
1

𝑋1 = 𝐻 ˆ𝐺1

𝑋1

. In stage two, we get the estimate of MR-SPLIT as

ˆ𝛽 = ( ˆ𝑋′ ˆ𝑋)−1 ˆ𝑋′𝑌

= 𝛽 + (𝑋′
1

𝐻 ˆ𝐺1

𝑋1 + 𝑋′
2

𝐻 ˆ𝐺2

𝑋2)−1(𝑋′
1

𝐻 ˆ𝐺1

𝜀2,1 + 𝑋′
2

𝐻 ˆ𝐺2

𝜀2,2)

and write 𝜀2 = (cid:169)
(cid:173)
(cid:173)
(cid:171)

𝜀2,1

𝜀2,2

.

(cid:170)
(cid:174)
(cid:174)
(cid:172)

For CFMR, it combines all selected IVs into a single IV. We use the subscript 𝐶 to denote

variables used in CFMR:

ˆ𝐺2,𝐶 = (𝐺2,𝑀 ˆ𝛼1,𝑀, 𝐺2,𝑊 ˆ𝛼1,𝑊 ) = 𝐺2 ˆ𝛼1 ∈ R𝑛×1

Similarly, in sample 𝐼1, we combine 𝐺1 and get:

ˆ𝐺1,𝐶 = 𝐺1 ˆ𝛼2 ∈ R𝑛×1

(S3)

(S4)

For CFMR, let ˆ𝐺𝐶 = (cid:169)
(cid:173)
(cid:173)
(cid:171)

ˆ𝐺1,𝐶

ˆ𝐺2,𝐶

(cid:170)
(cid:174)
(cid:174)
(cid:172)

, 𝑋 = (cid:169)
(cid:173)
(cid:173)
(cid:171)

𝑋1

𝑋2

(cid:170)
(cid:174)
(cid:174)
(cid:172)

. Apply 2SLS on {𝑋, 𝑌 , ˆ𝐺𝐶 } we get

ˆ𝛽𝐶 = (𝑋′𝐻 ˆ𝐺𝐶

𝑋)−1𝑋′𝐻 ˆ𝐺𝐶

𝑌

= 𝛽 + (𝑋′𝐻 ˆ𝐺𝐶

𝑋)−1𝑋′𝐻 ˆ𝐺𝐶

𝜀2

In the following, we will show that

𝑣𝑎𝑟 ( ˆ𝛽) ≤ 𝑣𝑎𝑟 ( ˆ𝛽𝐶)

109

where ˆ𝛽 denotes the estimate by MR-SPLIT. Since 𝑣𝑎𝑟 ( ˆ𝛽) = (𝑋′
1

𝐻 ˆ𝐺1
𝑋)−1𝜎2, to prove 𝑣𝑎𝑟 ( ˆ𝛽) ≤ 𝑣𝑎𝑟 ( ˆ𝛽𝐶), we need to show

𝑣𝑎𝑟 ( ˆ𝛽𝐶) = (𝑋′𝐻 ˆ𝐺𝐶

𝑋1 + 𝑋′
2

𝐻 ˆ𝐺2

𝑋2)−1𝜎2,

𝑋′
1

𝐻 ˆ𝐺1

𝑋1 + 𝑋′
2

𝐻 ˆ𝐺2

𝑋2 ≥ 𝑋′𝐻 ˆ𝐺𝐶

𝑋

𝐻 ˆ𝐺1

⇐⇒ 𝑋′ (cid:169)
(cid:173)
(cid:173)
(cid:171)

𝐻 ˆ𝐺2

(cid:170)
(cid:174)
(cid:174)
(cid:172)

𝑋 ≥ 𝑋′𝐻 ˆ𝐺𝐶

𝑋

𝐻 ˆ𝐺1

⇐⇒ 𝑋′ (cid:169)
(cid:173)
(cid:173)
(cid:171)

(cid:169)
(cid:173)
(cid:173)
(cid:171)

𝐻 ˆ𝐺2

(cid:170)
(cid:174)
(cid:174)
(cid:172)

− 𝐻 ˆ𝐺𝐶

(cid:170)
(cid:174)
(cid:174)
(cid:172)

𝑋 ≥ 0

Hence, it is sufficient to show

𝐻 ˆ𝐺1

𝐻 ˆ𝐺2

(cid:169)
(cid:173)
(cid:173)
(cid:171)

(cid:170)
(cid:174)
(cid:174)
(cid:172)

− 𝐻 ˆ𝐺𝐶 ⪰ 0

(S5)

where for any matrix 𝑋, 𝑋 ⪰ 0 means it is positive semi-definite.

Recall that ˆ𝐺𝐶 = (cid:169)
(cid:173)
(cid:173)
(cid:171)

ˆ𝐺1,𝐶

ˆ𝐺2,𝐶

,

(cid:170)
(cid:174)
(cid:174)
(cid:172)

𝐻 ˆ𝐺𝐶 = ˆ𝐺𝐶 ( ˆ𝐺′

𝐶 ˆ𝐺𝐶)−1 ˆ𝐺′

𝐶 =

1
ˆ𝐺1,𝐶 + ˆ𝐺′

2,𝐶

ˆ𝐺′

1,𝐶

ˆ𝐺2,𝐶

ˆ𝐺1,𝐶 ˆ𝐺′

1,𝐶

ˆ𝐺1,𝐶 ˆ𝐺′

2,𝐶

ˆ𝐺2,𝐶 ˆ𝐺′

1,𝐶

ˆ𝐺2,𝐶 ˆ𝐺′

2,𝐶

(cid:169)
(cid:173)
(cid:173)
(cid:171)

(cid:170)
(cid:174)
(cid:174)
(cid:172)

Let 𝑎 = ˆ𝐺′

1,𝐶

ˆ𝐺1,𝐶 + ˆ𝐺′

2,𝐶

ˆ𝐺2,𝐶 ∈ R, it remains to show

ˆ𝐺1,𝐶 ˆ𝐺′
𝑎

1,𝐶

−
ˆ𝐺2,𝐶 ˆ𝐺′
𝑎

1,𝐶

𝐻 ˆ𝐺1
−

(cid:169)
(cid:173)
(cid:173)
(cid:171)

2,𝐶

ˆ𝐺1,𝐶 ˆ𝐺′
𝑎
ˆ𝐺2,𝐶 ˆ𝐺′
𝑎

−

2,𝐶

⪰ 0

(cid:170)
(cid:174)
(cid:174)
(cid:172)

(S6)

−

𝐻 ˆ𝐺2

110

From Eq. S2 - S4, we can get

(cid:16)

𝐻 ˆ𝐺2

=

𝐺2,𝑀 𝐺2,𝑊 ˆ𝛼1,𝑊

(cid:17) (cid:16) ˆ𝐺′

2

ˆ𝐺2

(cid:17) −1 (cid:169)
(cid:173)
(cid:173)
(cid:171)

𝐺′

2,𝑀
1,𝑊 𝐺′
ˆ𝛼′

2,𝑊

(cid:170)
(cid:174)
(cid:174)
(cid:172)

𝐴2 𝐵2

=( 𝐺2,𝑀 𝐺2,𝑊 ˆ𝛼1,𝑊 ) (cid:169)
(cid:173)
(cid:173)
(cid:171)
2,𝑀 + 𝐺2,𝑊 ˆ𝛼1,𝑊𝐶2𝐺′

=𝐺2,𝑀 𝐴2𝐺′

𝐶2 𝐷2

(cid:170)
(cid:174)
(cid:174)
(cid:172)

(cid:169)
(cid:173)
(cid:173)
(cid:171)

𝐺′

2,𝑀
1,𝑊 𝐺′
ˆ𝛼′

2,𝑊

(cid:170)
(cid:174)
(cid:174)
(cid:172)
1,𝑊 𝐺′

2,𝑀 + 𝐺2,𝑀 𝐵2 ˆ𝛼′

2,𝑊 + 𝐺2,𝑊 ˆ𝛼1,𝑊 𝐷2 ˆ𝛼′

1,𝑊 𝐺′

2,𝑊

(cid:16)

ˆ𝐺2,𝐶 ˆ𝐺′
𝑎

2,𝐶

=

=

1
𝑎

1
𝑎

𝐺2,𝑀 𝐺2,𝑊

(𝐺2,𝑀 ˆ𝛼1,𝑀 ˆ𝛼′

ˆ𝛼1,𝑀 ˆ𝛼′

1,𝑀 ˆ𝛼1,𝑀 ˆ𝛼′
1,𝑀 ˆ𝛼1,𝑊 ˆ𝛼′

(cid:17) (cid:169)
(cid:173)
(cid:173)
(cid:171)
2,𝑀 + 𝐺2,𝑊 ˆ𝛼1,𝑊 ˆ𝛼′
1,𝑀𝐺′

ˆ𝛼1,𝑊 ˆ𝛼′

1,𝑊

1,𝑊

1,𝑀𝐺′

2,𝑀

𝐺′

2,𝑀

𝐺′

2,𝑊

(cid:170)
(cid:174)
(cid:174)
(cid:172)

(cid:169)
(cid:173)
(cid:173)
(cid:171)

(cid:170)
(cid:174)
(cid:174)
(cid:172)

+ 𝐺2,𝑀 ˆ𝛼1,𝑀 ˆ𝛼′

1,𝑊 𝐺′

2,𝑊 + 𝐺2,𝑊 ˆ𝛼1,𝑊 ˆ𝛼′

1,𝑊 𝐺′

2,𝑊 )

Therefore,

𝐻 ˆ𝐺2

−

ˆ𝐺2,𝐶 ˆ𝐺′
𝑎

2,𝐶

(cid:32)

=𝐺2,𝑀

𝐴2 −

(cid:33)

ˆ𝛼1,𝑀 ˆ𝛼′
𝑎

1,𝑀

(cid:32)

𝐺′

2,𝑀 + 𝐺2,𝑊 ˆ𝛼1,𝑊

𝐶2 −

(cid:33)

ˆ𝛼′
1,𝑀
𝑎

𝐺′

2,𝑀

+ 𝐺2,𝑀

(cid:18)

𝐵2 −

(cid:19)

ˆ𝛼1,𝑀
𝑎

1,𝑊 𝐺′
ˆ𝛼′

2,𝑊 + 𝐺2,𝑊 ˆ𝛼1,𝑊 (𝐷2 −

1
𝑎

) ˆ𝛼′

1,𝑊 𝐺′

2,𝑊

(cid:16)

=

𝐺2,𝑀 𝐺2,𝑊 ˆ𝛼1,𝑊

= ˆ𝐺2𝑄4 ˆ𝐺′
2

Similarly,

𝐻 ˆ𝐺1

−

ˆ𝐺1,𝐶 ˆ𝐺′
𝑎

1,𝐶

(cid:16)

=

𝐺1,𝑀 𝐺1,𝑊 ˆ𝛼2,𝑊

= ˆ𝐺1𝑄1 ˆ𝐺′
1

(cid:17) (cid:169)
(cid:173)
(cid:173)
(cid:171)

(cid:17) (cid:169)
(cid:173)
(cid:173)
(cid:171)

𝐴2 −

1,𝑀

ˆ𝛼1,𝑀 ˆ𝛼′
𝑎
ˆ𝛼′
1,𝑀
𝑎

𝐶2 −

𝐴1 −

2,𝑀

ˆ𝛼2,𝑀 ˆ𝛼′
𝑎
ˆ𝛼′
2,𝑀
𝑎

𝐶1 −

𝑎

𝐵2 − ˆ𝛼1,𝑀
𝐷2 − 1
𝑎

𝑎

𝐵1 − ˆ𝛼2,𝑀
𝐷1 − 1
𝑎

(cid:170)
(cid:174)
(cid:174)
(cid:172)

(cid:169)
(cid:173)
(cid:173)
(cid:171)

(cid:170)
(cid:174)
(cid:174)
(cid:172)

(cid:169)
(cid:173)
(cid:173)
(cid:171)

𝐺′

2,𝑀
1,𝑊 𝐺′
ˆ𝛼′

2,𝑊

𝐺′

1,𝑀
2,𝑊 𝐺′
ˆ𝛼′

1,𝑊

111

(S7)

(cid:170)
(cid:174)
(cid:174)
(cid:172)

(cid:170)
(cid:174)
(cid:174)
(cid:172)

(S8)

(S9)

(cid:170)
(cid:174)
(cid:174)
(cid:172)

(cid:170)
(cid:174)
(cid:174)
(cid:172)

(S10)

(S11)

(S12)

(S13)

(S14)

(S15)

Easily, we can also get

ˆ𝐺1,𝐶 ˆ𝐺′
𝑎

2,𝐶

−

= −

ˆ𝐺2,𝐶 ˆ𝐺′
𝑎

1,𝐶

−

= −

= −

= −

(cid:16)

𝐺1,𝑀 𝐺1,𝑊 ˆ𝛼2,𝑊

ˆ𝐺1𝑄2 ˆ𝐺′
2

(cid:16)

𝐺2,𝑀 𝐺2,𝑊 ˆ𝛼1,𝑊

(cid:17) (cid:169)
(cid:173)
(cid:173)
(cid:171)

(cid:17) (cid:169)
(cid:173)
(cid:173)
(cid:171)

ˆ𝐺2𝑄3 ˆ𝐺′
1

1
𝑎

1
𝑎

1
𝑎

1
𝑎

ˆ𝛼2,𝑀 ˆ𝛼1,𝑀 ˆ𝛼2,𝑀

ˆ𝛼′
1,𝑀

1

ˆ𝛼1,𝑀 ˆ𝛼2,𝑀 ˆ𝛼1,𝑀

ˆ𝛼′
2,𝑀

1

(cid:170)
(cid:174)
(cid:174)
(cid:172)

(cid:169)
(cid:173)
(cid:173)
(cid:171)

(cid:170)
(cid:174)
(cid:174)
(cid:172)

(cid:169)
(cid:173)
(cid:173)
(cid:171)

Apply Eq.S8 - S11 to Eq. S6, we have

(cid:16)

ˆ𝐺1

ˆ𝐺2

(cid:17) (cid:169)
(cid:173)
(cid:173)
(cid:171)

𝑄1 𝑄2

𝑄3 𝑄4

ˆ𝐺′
1
ˆ𝐺′
2

(cid:170)
(cid:174)
(cid:174)
(cid:172)

(cid:170)
(cid:174)
(cid:174)
(cid:172)

(cid:169)
(cid:173)
(cid:173)
(cid:171)

⪰ 0

𝐺′

2,𝑀
1,𝑊 𝐺′
ˆ𝛼′

2,𝑊

𝐺′

1,𝑀
2,𝑊 𝐺′
ˆ𝛼′

1,𝑊

We now only need to show that

We first show that

Recall that

𝑄4 = (cid:169)
(cid:173)
(cid:173)
(cid:171)

𝑄1 𝑄2

𝑄3 𝑄4

(cid:169)
(cid:173)
(cid:173)
(cid:171)

(cid:170)
(cid:174)
(cid:174)
(cid:172)

⪰ 0.

𝐴2 −

1,𝑀

ˆ𝛼1,𝑀 ˆ𝛼′
𝑎
ˆ𝛼′
1,𝑀
𝑎

𝐶2 −

𝑎

𝐵2 − ˆ𝛼1,𝑀
𝐷2 − 1
𝑎

≻ 0.

(cid:170)
(cid:174)
(cid:174)
(cid:172)

𝐴2 𝐵2

𝐶2 𝐷2

(cid:169)
(cid:173)
(cid:173)
(cid:171)

(cid:170)
(cid:174)
(cid:174)
(cid:172)

= ( ˆ𝐺′
2

ˆ𝐺2)−1

𝐺′

2,𝑀𝐺2,𝑀

𝐺′

2,𝑀𝐺2,𝑊 ˆ𝛼1,𝑊

1,𝑊 𝐺′
ˆ𝛼′

2,𝑊 𝐺2,𝑀 ˆ𝛼′

1,𝑊 𝐺′

2,𝑊 𝐺2,𝑊 ˆ𝛼1,𝑊

= (cid:169)
(cid:173)
(cid:173)
(cid:171)

−1

(cid:170)
(cid:174)
(cid:174)
(cid:172)

112

Thus,

𝐴2 = (𝐺′

2,𝑀𝐺2,𝑀)−1 +

(𝐺′

2,𝑀𝐺2,𝑊 ˆ𝛼1,𝑊 ˆ𝛼′
2,𝑀𝐺2,𝑀)−1(𝐺′
1,𝑊 𝐺′
ˆ𝛼′

1,𝑊 𝐺′
2,𝑊 (𝐼 − 𝐻𝐺2,𝑀 )𝐺2,𝑊 ˆ𝛼1,𝑊

2,𝑊 𝐺2,𝑀) (𝐺′

2,𝑀𝐺2,𝑀)−1

,

𝐵2 = −

𝐶2 = −

(𝐺′
ˆ𝛼′
1,𝑊 𝐺′
1,𝑊 𝐺′
ˆ𝛼′
1,𝑊 𝐺′
ˆ𝛼′

2,𝑀𝐺2,𝑀)−1𝐺′

2,𝑀𝐺2,𝑊 ˆ𝛼1,𝑊
2,𝑊 (𝐼 − 𝐻𝐺2,𝑀 )𝐺2,𝑊 ˆ𝛼1,𝑊
2,𝑀𝐺2,𝑀)−1
2,𝑊 𝐺2,𝑊 (𝐺′
2,𝑊 (𝐼 − 𝐻𝐺2,𝑀 )𝐺2,𝑊 ˆ𝛼1,𝑊
.

1

1,𝑊 𝐺′
ˆ𝛼′

2,𝑊 (𝐼 − 𝐻𝐺2,𝑀 )𝐺2,𝑊 ˆ𝛼1,𝑊

,

,

𝐷2 =

Then,

Since

𝐷2 −

1
𝑎

=

=

1

ˆ𝛼′
1,𝑊 𝐺′

2,𝑊 (𝐼 − 𝐻𝐺2,𝑀 )𝐺2,𝑊 ˆ𝛼1,𝑊

1

ˆ𝛼′
1,𝑊 𝐺′

2,𝑊 (𝐼 − 𝐻𝐺2,𝑀 )𝐺2,𝑊 ˆ𝛼1,𝑊

−

1
ˆ𝐺1,𝐶 + ˆ𝐺′

2,𝐶

ˆ𝐺′

1,𝐶

ˆ𝐺2,𝐶

−

ˆ𝐺′

1,𝐶

ˆ𝐺1,𝐶 + (𝐺2,𝑀 ˆ𝛼1,𝑀 + 𝐺2,𝑊 ˆ𝛼1,𝑊 )′(𝐺2,𝑀 ˆ𝛼1,𝑀 + 𝐺2,𝑊 ˆ𝛼1,𝑊 )

1

(𝐺2,𝑀 ˆ𝛼1,𝑀 + 𝐺2,𝑊 ˆ𝛼1,𝑊 )′(𝐺2,𝑀 ˆ𝛼1,𝑀 + 𝐺2,𝑊 ˆ𝛼1,𝑊 ) − ˆ𝛼′

1,𝑊 𝐺′

2,𝑊 (𝐼 − 𝐻𝐺2,𝑀 )𝐺2,𝑊 ˆ𝛼1,𝑊

=(𝐺2,𝑀 ˆ𝛼1,𝑀 + 𝐻𝐺2,𝑀 𝐺2,𝑊 ˆ𝛼1,𝑊 )′(𝐺2,𝑀 ˆ𝛼1,𝑀 + 𝐻𝐺2,𝑀 𝐺2,𝑊 ˆ𝛼1,𝑊 ) > 0,

(S16)

we get 𝐷2 − 1

𝑎 > 0.

To prove S14, it is sufficient to show (Zhang, 2006)

𝐴2 −

1,𝑀

ˆ𝛼1,𝑀 ˆ𝛼′
𝑎
ˆ𝛼1,𝑀 ˆ𝛼′
𝑎

1,𝑀

− (𝐵2 −

ˆ𝛼1,𝑀
𝑎

) (𝐷2 −

)−1(𝐶2 −

ˆ𝛼′
1,𝑀
𝑎

) ≻ 0

1
𝑄
ˆ𝛼1,𝑀
𝑎
−1 ˜𝐵2 ˜𝐶2 ˜𝐴2

1
𝑎
1,𝑀 + ˜𝐴2

) (𝐶2 −

ˆ𝛼′
1,𝑀
𝑎
−1 − ˆ𝛼1,𝑀 ˜𝐶2 ˜𝐴2

) ≻ 0

⇐⇒( 𝐴2 −

)(𝐷2 −

) − (𝐵2 −

⇐⇒(𝑎 −

1
𝐷2

) ˜𝐴2

−1 − ˆ𝛼1,𝑀 ˆ𝛼′

−1 − ˜𝐴2

−1 ˜𝐵2 ˆ𝛼′

1,𝑀 ≻ 0

(S17)

˜𝐴2

˜𝐵2

where (cid:169)
(cid:173)
(cid:173)
(cid:171)

ˆ𝛼1,𝑀)( ˜𝐴2

= ˆ𝐺′
2

ˆ𝐺2. The left side of Eq. S17 can be obtained as (𝑎 − 1
𝐷2

˜𝐶2
−1𝐵2 − ˆ𝛼1,𝑀)′, which is easy to verify to be a positive definite matrix.

˜𝐷2

(cid:170)
(cid:174)
(cid:174)
(cid:172)

) ˜𝐴2

−1 + ( ˜𝐴2

−1𝐵2 −

113

To prove Eq. S13, now we only need to prove

𝑄1 − 𝑄2𝑄−1
4

𝑄3 ⪰ 0.

(S18)

From Eq. S14, we have

4 = (cid:169)
𝑄−1
(cid:173)
(cid:173)
(cid:171)
(cid:18)

=

( ˆ𝐺′
2

ˆ𝐺2)−1 −

1
𝑎

( ˆ𝐺′
2

ˆ𝐺2)−1 −

(cid:170)
(cid:174)
(cid:174)
(cid:172)

ˆ𝛼1,𝑀

1

(cid:169)
(cid:173)
(cid:173)
(cid:171)
𝑏2𝑏′
2

(cid:19) −1

1
𝑎
ˆ𝐺2
ˆ𝐺2𝑏2𝑏′
ˆ𝐺′
2
2
ˆ𝐺′
ˆ𝐺2𝑏2 − 𝑎
2

ˆ𝐺′
2
𝑏′
2

= ˆ𝐺′
2

ˆ𝐺2 −

ˆ𝛼′
1,𝑀 1

−1

(cid:17)(cid:170)
(cid:174)
(cid:174)
(cid:172)

(cid:16)

,

(S19)

(cid:16)

where 𝑏2 =
(cid:16)

𝑏1 =

ˆ𝛼′
2,𝑀, 1

ˆ𝛼′
1,𝑀, 1
(cid:17)′
,

(cid:17)′

, and the third equation utilizes the Woodbury matrix identity. Similarly, let

ˆ𝐺1)−1 −

1
𝑎

𝑏1𝑏′
1

,

𝑄2 = −

𝑄1 = ( ˆ𝐺′
1
1
𝑎
1
𝑎

𝑄3 = −

𝑏1𝑏′
2

𝑏2𝑏′
1

,

.

Substituting Eq. S19 -S22 into Eq. S18, we get

( ˆ𝐺′
1

ˆ𝐺1)−1 −

1
𝑎

𝑏1𝑏′

1 −

1
𝑎2

𝑏1𝑏′

2( ˆ𝐺′

2

⇐⇒( ˆ𝐺′
1

ˆ𝐺1)−1 − 𝑏1(

⇐⇒( ˆ𝐺′
1

ˆ𝐺1)−1 − 𝑏1(

1
𝑎

1
𝑎

+

+

1
𝑎2

2( ˆ𝐺′
𝑏′

2

ˆ𝐺2 −

ˆ𝐺′

2,𝐶

ˆ𝐺2,𝐶
𝑎2

−

)𝑏2𝑏′

1 ⪰ 0

)𝑏2)𝑏′

1 ⪰ 0

ˆ𝐺2 −

ˆ𝐺′
ˆ𝐺2𝑏2𝑏′
ˆ𝐺′
ˆ𝐺2
2
2
2
ˆ𝐺′
ˆ𝐺2𝑏2 − 𝑎
𝑏′
2
2
ˆ𝐺2𝑏2𝑏′
ˆ𝐺2
ˆ𝐺′
2
2
ˆ𝐺2𝑏2 − 𝑎
ˆ𝐺′
2
ˆ𝐺2,𝐶)2
ˆ𝐺2,𝐶 − 𝑎3

)𝑏′

2,𝐶

ˆ𝐺′
2
𝑏′
2
( ˆ𝐺′

𝑎2 ˆ𝐺′

2,𝐶

1 ⪰ 0

⇐⇒( ˆ𝐺′
1

ˆ𝐺1)−1 −

1
ˆ𝐺1,𝐶

ˆ𝐺′

𝑏1𝑏′

1 ⪰ 0

1,𝐶
Eq. S23 has a very similar structure as Eq. S14. It can be written as

𝐴1 −

𝐶1 −

ˆ𝛼2,𝑀 ˆ𝛼′
ˆ𝐺′

2,𝑀
ˆ𝐺1,𝐶

1,𝐶
ˆ𝛼′
2,𝑀
ˆ𝐺1,𝐶

1,𝐶

ˆ𝐺′

𝐵1 − ˆ𝛼2,𝑀
ˆ𝐺1,𝐶

ˆ𝐺′

1,𝐶

𝐷1 −

1
ˆ𝐺1,𝐶

ˆ𝐺′

1,𝐶

(cid:169)
(cid:173)
(cid:173)
(cid:171)

⪰ 0,

(cid:170)
(cid:174)
(cid:174)
(cid:172)

which can be easily verified. This completes the proof of the theorem.

114

(S20)

(S21)

(S22)

(S23)

(S24)

□

A.2.2 Results of selecting major IVs under different partial 𝐹 values

Table A.1 shows the results of distinguishing major and weak IVs with different partial 𝐹

thresholds. We also showed the results with the criterion of 𝐹 > 100. As demonstrated, this is

exceedingly conservative and is generally best avoided. Furthermore, we recognize that a heritability

(ℎ2 = 0.5) is considerably high for many exposure traits in practical scenarios, representing

situations that are relatively uncommon in reality.

Figure A.1 Mean numbers of being identified as major IV using different thresholds in 1,000
simulations.

115

Table A.1 Mean numbers of being identified as major IV using different criteria in 1000
simulations. For the noise category, it was aggregated over the 295 null IVs.

ℎ2

𝑁

500

0.15

1000

2000

500

0.3

1000

2000

500

0.5

1000

2000

Criteria SNP1 SNP2 SNP3 SNP4 SNP5 Noises (×295)
F>10
F>30
F>50
F>100
F>10
F>30
F>50
F>100
F>10
F>30
F>50
F>100
F>10
F>30
F>50
F>100
F>10
F>30
F>50
F>100
F>10
F>30
F>50
F>100
F>10
F>30
F>50
F>100
F>10
F>30
F>50
F>100
F>10
F>30
F>50
F>100

0.5
0.05
0
0
0.95
0.35
0
0
1
0.95
0.75
0
0.8
0.55
0.25
0
1
1
0.7
0.05
1
1
1
1
1
1
0.9
0.3
1
1
1
1
1
1
1
1

0.15
0
0
0
0.35
0
0
0
0.75
0.1
0
0
0.25
0
0
0
0.75
0.2
0.05
0
1
0.6
0.1
0
0.8
0.35
0
0
1
0.8
0.4
0.05
1
1
1
0.35

1.25
0
0
0
0.65
0
0
0
0.6
0
0
0
1.15
0
0
0
0.65
0
0
0
0.6
0
0
0
1.8
0
0
0
0.6
0
0
0
0.4
0
0
0

0.1
0
0
0
0
0
0
0
0.55
0
0
0
0.1
0
0
0
0.3
0
0
0
0.9
0.1
0.05
0
0.4
0.05
0
0
0.95
0.15
0
0
1
0.9
0.3
0

0.55
0.05
0
0
0.95
0.35
0.15
0
1
1
0.75
0
0.95
0.5
0.25
0
1
1
0.8
0
1
1
1
1
1
1
0.9
0.2
1
1
1
1
1
1
1
1

0
0
0
0
0.1
0
0
0
0.35
0
0
0
0.25
0
0
0
0.45
0
0
0
0.75
0.15
0
0
0.35
0
0
0
1
0.5
0.05
0
1
0.9
0.4
0

116

A.2.3 Boxplots of causal effect estimates for MR-SPLIT, 2SLS, and LIML out of 1000

simulation runs under different scenarios.

LIML and 2SLS both use half of the dataset to select IVs and the other half to get the estimation.

In contrast, LIML_w and 2SLS_w use the whole dataset for IV selection and causal effect estimation.

When using half data to select IVs and another half for causal estimation, both MR-SPLIT and LIML

produce unbiased estimates, though the variance for LIML is larger than MR-SPLIT. However, when

using the whole data for both IV selection and causal effect estimation, LIML_w and 2SLS_w

generate biased causal effect estimation. In either case, 2SLS yields biased effect estimates. This

simulation demonstrates the issue of IV selection bias if it is not properly addressed.

Figure A.2 Boxplots of causal effect estimates ( ˆ𝛽) under ℎ2 = 0.15 and confounding correlation
𝜌 = 0.1 (top) and 𝜌 = 0.2 (bottom).

117

Figure A.3 Boxplots of causal effect estimates ( ˆ𝛽) under ℎ2 = 0.3 and confounding correlation
𝜌 = 0.1 (top) and 𝜌 = 0.2 (bottom).

Figure A.4 Boxplots of causal effect estimates ( ˆ𝛽) under ℎ2 = 0.5 and confounding correlation
𝜌 = 0.1 (top) and 𝜌 = 0.2 (bottom).

118

A.2.4 Boxplots of causal effect estimates for MR-SPLIT and CFMR out of 1000 simulation

runs under different scenarios.

In nearly all scenarios, both CFMR and MR-SPLIT obtained approximately unbiased estimates.

However, it is evident that MR-SPLIT consistently exhibits a smaller variance compared to CFMR.

This can also be seen in the comparison of RMSE (see Figure A.11), where the RMSE of MR-SPLIT

is always noticeably smaller than that of CFMR.

Figure A.5 Boxplots of causal effect estimates ( ˆ𝛽) when ℎ2 = 0.15 (left), 0.2 (middle) , 0.3 (right)
and sample size 𝑁 = 1000 in scenario I (top) and scenario II (bottom).

119

Figure A.6 Boxplots of causal effect estimates ( ˆ𝛽) when ℎ2 = 0.15 (left), 0.2 (middle) , 0.3 (right)
and sample size 𝑁 = 3000 in scenario I (top) and scenario II (bottom).

Figure A.7 Boxplots of causal effect estimates ( ˆ𝛽) when ℎ2 = 0.15 (left), 0.2 (middle) , 0.3 (right)
and sample size 𝑁 = 5000 in scenario I (top) and scenario II (bottom).

120

Figure A.8 Boxplots of causal effect estimates ( ˆ𝛽) when ℎ2 = 0.15 (left), 0.2 (middle) , 0.3 (right)
and sample size 𝑁 = 1000 in scenario I (top) and scenario II (bottom).

Figure A.9 Boxplots of causal effect estimates ( ˆ𝛽) when ℎ2 = 0.15 (left), 0.2 (middle) , 0.3 (right)
and sample size 𝑁 = 3000 in scenario I (top) and scenario II (bottom).

121

Figure A.10 Boxplots of causal effect estimates ( ˆ𝛽) when ℎ2 = 0.15 (left), 0.2 (middle) , 0.3 (right)
and sample size 𝑁 = 5000 in scenario I (top) and scenario II (bottom).

122

A.2.5 RMSE comparison between MR-SPLIT and CFMR out of 1000 simulation runs under

different scenarios.

The RMSE of MR-SPLIT is always smaller than that of CFMR, especially under a small sample

size (e.g., 𝑁 = 1000), indicating the estimation efficiency and consistency of MR-SPLIT compared

to CFMR.

Figure A.11 RNSE comparison between MR-SPLIT and CFMR in Scenario I (top) and II (bottom).

123

A.2.6 Additional type I error and power simulation results for the evaluation of multiple

data splitting

Figure A.12 shows the type I error out of 50 sample splits under ℎ2 = 0.3 and different sample

sizes. When SNP heritability is significant and the sample size is relatively large (for instance,

greater than 1000), the type I error stabilizes, even with a minimal number of sample splits.

Figure A.12 Type I error when ℎ2 = 0.3 and 𝑁 = 500 (left), 1000 (middle), 2000 (right) out of 50
sample splits.

Figure A.13 displays the empirical power under different sample sizes when ℎ2 = 0.3. Compared

to scenarios where ℎ2 = 0.15 or 0.2, fewer splits are required to achieve optimal power. When the

sample size is relatively small, for instance, 𝑁 = 500, the power stabilizes after about 25 sample

splits. As the sample size increases to 1000, a few sample splits are good enough to achieve stable

power. The results suggest that in practice, one can lower the number of sample slits if the estimated

SNP heritability for the exposure is strong and the sample size is large, to save computational time.

Figure A.13 Power performance when ℎ2 = 0.3 and 𝑁 = 500 (left), 1000 (middle), 2000 (right)
out of 50 sample splits.

124

A.2.7 Comparison between the LASSO and LASSO-projection methods.

In this simulation, we used two different methods, LASSO and LASSO-projection, to do the

IV selection. We considered the case with the sample size as 𝑁 = 1, 000, and randomly generated

a set of 300 independent SNPs with their minor allele frequency fixed as 0.3 for all the SNPs. We

randomly chose 5 SNP IVs to generate the exposure variable. The correlation between the error

terms is set to 0.16. The variation in the exposure explained by the 5 SNPs is set to ℎ2 = 0.2.

The real causal effects 𝛽 were set to {−0.08, 0, 0.08}. Both methods produced very similar effect

estimates as revealed by the boxplots in Figure A.14. The LASSO-projection method yielded

smaller type I error (Figure A.15, slightly larger power (Figure A.16 and smaller RMSE (Figure

A.17), compared to the regular LASSO method. We also observed that the total number of IVs

selected by the regular LASSO method is much larger than the LASSO-projection method (Figure

A.18), and the number of major IVs selected by the LASSO-projection method is slightly higher

than the LASSO method (Figure A.19). We observed similar trends under other settings and hence

only reported the results of this scenario.

Figure A.14 Boxplots of LASSO and LASSO-projection estimators.

125

Figure A.15 Type I Er-
ror.

Figure A.16 Power.

Figure A.17 RMSE.

Figure A.18 Total IVs selected.

Figure A.19 Major IVs selected.

126

A.2.8 Additional results for the real data analysis

Figure A.20 Boxplot of eGFR in aTRH positive and negative groups.

Figure A.21 Boxplot of log(uACR) in aTRH positive and negative groups.

127

Figure A.22 Histogram of p-values and causal effect estimates from 50 sample splits when eGFR
is treated as the exposure (F>20 is used to distinguish the major IVs).

Figure A.23 Histogram of uACR (left figure) and log(uACR) (right figure).

128

A.3 Chapter 3

A.3.1 Supplemental figures with multi-sample splitting

Figure A.24 Violin plots of causal effect estimates under different sample split times from 0 to 30.

129

A.3.2 Simulation without noise

Table A.2 The breakdown of valid IV selection results without noise.

N

1000

3000

1000

3000

1000

3000

1000

3000

Case 1

Case 2

Case 3

Case 4

IVs MR-SPLIT+ WIT CIIV sisVIVE
valid
invalid
valid
invalid
valid
invalid
valid
invalid
valid
invalid
valid
invalid
valid
invalid
valid
invalid

0
3.9
0
4
0
4
0
4
2.4
3
0.4
1.2
1
4.3
0
4

7.7
0.2
7.9
0.1
7.7
0.1
7.9
0
5
1.3
2.9
2.5
4.7
1.3
7.1
0.3

9
0.1
9
0
9
0.1
9
0
8.8
2.8
9
0.4
8.8
2.9
9
0.4

9
0.2
9
0
9
0.2
9
0
8
3.4
9
0.4
7.2
4.7
9
0.5

Note: The numbers in each row represent the average counts of
being identified as valid IVs, given their true identity as valid or
invalid IV, across 1,000 simulations. The true number of valid IVs
is 9.

130

Table A.3 The breakdown of invalid IV selection results without noise.

N

1000

3000

1000

3000

1000

3000

1000

3000

Case 1

Case 2

Case 3

Case 4

IVs MR-SPLIT+ WIT CIIV sisVIVE
valid
invalid
valid
invalid
valid
invalid
valid
invalid
valid
invalid
valid
invalid
valid
invalid
valid
invalid

1.3
11.8
1.1
11.9
1.3
11.9
1.1
12
4
10.7
6.1
9.5
4.3
10.7
1.9
11.7

9
8.1
9
8
9
8
9
8
6.6
9
8.6
10.8
8
7.7
9
8.1

0
11.9
0
12
0
11.9
0
12
0.2
9.2
0
11.6
0.2
9.1
0
11.6

0
11.8
0
12
0
11.8
0
12
1
8.6
0
11.6
1.8
7.3
0
11.5

Note: The numbers in each row represent the average counts of
being identified as invalid IVs, given their true identity as valid or
invalid IV, across 1,000 simulations. The true number of invalid
IVs is 12.

131

A.3.3 Simulation with noise IVs

Figure A.25 Violin plots of estimators in simulations with noise.

(a) Absolute bias of estimators.

(b) Coverage Probability.

Figure A.26 Comparison of absolute bias and coverage probability for different methods in simu-
lations with noise.

132

Figure A.27 Plots of False positive rate (FPR) and false negative rate (FNR) for selected IVs in
simulations with noise.

133

Table A.4 The breakdown of valid IV selection results with noise.

N

1000

3000

1000

3000

1000

3000

1000

3000

Case 1

Case 2

Case 3

Case 4

IVs MR-SPLIT+ WIT CIIV sisVIVE
valid
invalid
noise
valid
invalid
noise
valid
invalid
noise
valid
invalid
noise
valid
invalid
noise
valid
invalid
noise
valid
invalid
noise
valid
invalid
noise

9
0.2
0.2
9
0
0.3
8.9
0.2
0.8
9
0
0.9
6.2
3.8
1.1
8.9
0.4
1
5.4
4.7
1.2
8.8
0.5
1.2

7.2
0.3
0.1
7.6
0.2
0
7.2
0.3
0.2
7.7
0.1
0.2
4.6
1.3
0.4
4.4
1.7
0.2
4
1.2
0.5
6.8
0.3
0.2

0
4.1
0.1
0
4.1
0.1
0
4
0.7
0
4.1
0.7
2.3
2.8
0.9
0.4
1.2
0.7
1.3
3.8
1.1
0
4.1
1.1

9
0.1
0.6
9
0
0.5
9
0.1
1.1
9
0
1.1
6.8
1.8
1.4
8.9
0.3
1.4
5.9
1.9
1.7
8.9
0.3
1.6

Note: The numbers in each row represent the average counts of
being identified as valid IVs, given their true identity as valid,
invalid, or noise IV, across 1,000 simulations. The true number of
valid IVs is 9. Noise refers to variants that have no effect on either
the exposure or the outcome, but are incorrectly classified as valid
IVs.

134

Table A.5 The breakdown of invalid IV selection results with noise.

N

1000

3000

1000

3000

1000

3000

1000

3000

Case 1

Case 2

Case 3

Case 4

IVs MR-SPLIT+ WIT CIIV sisVIVE
valid
invalid
noise
valid
invalid
noise
valid
invalid
noise
valid
invalid
noise
valid
invalid
noise
valid
invalid
noise
valid
invalid
noise
valid
invalid
noise

1.8
11.7
0.2
1.4
11.8
0.3
1.8
11.7
0.7
1.3
11.9
0.7
3.2
8.8
0.9
4.6
10.3
1
3.3
8.6
0.9
2.2
11.7
1.1

0
11.8
0.1
0
12
0
0.1
11.8
0.1
0
12
0.1
1.6
6.3
0.2
0.1
11.6
0.3
1.9
5.2
0.1
0.2
11.5
0.2

0
11.9
0
0
12
0
0
11.9
0.1
0
12
0.1
0.1
6.9
0.2
0.1
11.7
0.1
0
6.2
0.2
0.1
11.7
0.1

9
7.9
0.2
9
7.9
0.2
9
8
0.2
9
7.9
0.2
5.4
7.3
0.3
8.6
10.8
0.5
6
6
0.2
9
7.9
0.3

Note: The numbers in each row represent the average counts of
being identified as invalid IVs, given their true identity as valid,
invalid, or noise IV, across 1,000 simulations. The true number of
invalid IVs is 12. Noise refers to variants that have no effect on
either the exposure or the outcome, but are incorrectly classified
as invalid IVs.

135

A.4 Chapter 4

A.4.1 Simulation results

Table A.6 Simulation results of scenario 1 when 𝛽𝑋𝑌 = 𝛽𝑌 𝑋 = 0.

Settings

N

1000

𝛽𝑋𝑌 = 0

2000

4000

1000

𝛽𝑌 𝑋 = 0

2000

4000

Method
Oracle TSLS
BiMR-SPLIT+
MR-Egger
CIIV
Oracle TSLS
BiMR-SPLIT+
MR-Egger
CIIV
Oracle TSLS
BiMR-SPLIT+
MR-Egger
CIIV
Oracle TSLS
BiMR-SPLIT+
MR-Egger
CIIV
Oracle TSLS
BiMR-SPLIT+
MR-Egger
CIIV
Oracle TSLS
BiMR-SPLIT+
MR-Egger
CIIV

Bias
0.0019
0.0495
-0.2195
0.9118
0.0000
0.0191
-0.1965
0.7027
0.0002
0.0030
-0.1785
0.5569
0.0000
0.0498
-0.0304
0.6552
-0.0007
0.0174
-0.0680
0.4530
-0.0003
0.0028
-0.0916
0.4344

Est.sd RMSE CI Width CP
FPR FNR
0.0220
0.00
0.96
0.0219
0.00
0.0674
0.01
0.13
0.68
0.0459
0.3069
NA
0.98 NA
0.2146
0.9856
0.88
0.63
0.11
0.3744
0.0156
0.00
0.00
0.95
0.0156
0.0416
0.01
0.80
0.0370
0.06
0.2515
NA
0.99 NA
0.1570
1.0059
0.52
0.28
0.44
0.7205
0.0119
0.00
0.00
0.95
0.0119
0.0238
0.00
0.87
0.0236
0.01
0.2244
NA
1.00 NA
0.1361
0.9529
0.35
0.17
0.62
0.7739
0.00
0.00
0.95
0.0235
0.0234
0.02
0.65
0.0531 0.0728
0.13
NA
0.99 NA
0.2110
0.2090
0.72
0.58
0.23
0.7702
0.4053
0.00
0.00
0.96
0.0163
0.0163
0.01
0.79
0.0395
0.0355
0.06
NA
1.00 NA
0.1902
0.1778
0.63
0.39
0.33
0.5717
0.3492
0.00
0.00
0.95
0.0114
0.0114
0.00
0.01
0.89
0.0244
0.0242
NA
1.00 NA
0.1919
0.1688
0.70
0.36
0.28
0.5190
0.2843

0.0913
0.1440
1.2850
0.1960
0.0644
0.1037
1.0963
0.1797
0.0453
0.0760
0.9983
0.1293
0.0915
0.1448
1.2782
0.1750
0.0644
0.1036
1.2599
0.1326
0.0453
0.0760
1.2546
0.1009

136

Table A.7 Simulation results of scenario 2 when 𝛽𝑋𝑌 = 𝛽𝑌 𝑋 = 0.

Settings

N

1000

𝛽𝑋𝑌 = 0

2000

4000

1000

𝛽𝑌 𝑋 = 0

2000

4000

Method
Oracle TSLS
BiMR-SPLIT+
MR-Egger
CIIV
Oracle TSLS
BiMR-SPLIT+
MR-Egger
CIIV
Oracle TSLS
BiMR-SPLIT+
MR-Egger
CIIV
Oracle TSLS
BiMR-SPLIT+
MR-Egger
CIIV
Oracle TSLS
BiMR-SPLIT+
MR-Egger
CIIV
Oracle TSLS
BiMR-SPLIT+
MR-Egger
CIIV

Bias
0.0019
0.0328
-0.1046
0.8653
0.0000
0.0304
-0.0167
0.8815
0.0002
0.0256
0.0256
0.9272
0.0000
0.0221
-0.1086
0.8902
-0.0007
0.0239
-0.0164
0.8588
-0.0003
0.0328
0.0211
0.9330

FPR FNR
Est.sd RMSE CI Width CP
0.0220
0.00
0.00
0.96
0.0219
0.1364
0.03
0.89
0.1325
0.10
0.2440
NA
0.99 NA
0.2207
0.9387
0.86
0.69
0.12
0.3642
0.0156
0.00
0.00
0.95
0.0156
0.1559
0.04
0.91
0.1530
0.08
0.1441
NA
1.00 NA
0.1433
0.9403
0.89
0.60
0.11
0.3276
0.0119
0.00
0.00
0.95
0.0119
0.1361
0.04
0.03
0.91
0.1338
0.1142
NA
1.00 NA
0.1114
0.93
0.52
0.06
0.9646
0.2663
0.00
0.95
0.0235
0.00
0.0234
0.02
0.09
0.90
0.1019 0.1041
NA
0.99 NA
0.2569
0.2330
0.89
0.71
0.10
0.9505
0.3334
0.00
0.00
0.96
0.0163
0.0163
0.03
0.91
0.1373
0.1354
0.07
NA
1.00 NA
0.1420
0.1412
0.86
0.58
0.13
0.9298
0.3567
0.00
0.00
0.95
0.0114
0.0114
0.05
0.90
0.1602
0.1569
0.04
NA
1.00 NA
0.1110
0.1091
0.94
0.53
0.05
0.9656
0.2489

0.0913
0.1539
1.3036
0.1865
0.0644
0.1098
1.1288
0.1345
0.0453
0.0785
1.0354
0.1028
0.0915
0.1544
1.3324
0.1875
0.0644
0.1098
1.1223
0.1350
0.0453
0.0787
1.0386
0.1016

137