ENHANCING THE ROBUSTNESS AND TRUSTWORTHINESS OF MACHINE
LEARNING MODELS IN DIVERSE DOMAINS

By

Shuyang Yu

A DISSERTATION

Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of

Computer Science – Doctor of Philosophy

2025

ABSTRACT

The rapid advancement of machine learning, particularly over-parameterized deep neural

networks (DNNs), has led to significant progress across diverse domains. While the over-

parameterization of DNNs gives them the power to capture complex mappings between input

data points and target labels, in real-world challenges, they can inevitably be exposed to

unseen out-of-distribution (OoD) examples that deviate from the training distribution. This

raises critical concerns around robustness, adaptiveness, and trustworthiness of such models

when transferring knowledge from the training domains to unseen test domains.

In this thesis, we propose three different methods targeting the robustness and adaptive-

ness of machine learning models. First, to address agnostic data corruption in the source

domain, we propose a simple and computationally efficient unsupervised domain adaptation

(UDA) approach that enables parallel training of ensemble models. The learning framework

we proposed can be flexibly combined with available UDA approaches that are orthogonal

to our work to improve their robustness under corrupted data. Second, with the rise of

large language models (LLMs) pre-trained on vast, web-sourced datasets spanning multiple

domains, which led to a surge of interest in adapting these models to a wide range of down-

stream tasks. However, the real-world corpora used in the pre-training stage often exhibit a

long-tail distribution, where knowledge from less frequent domains is underrepresented. As

a result, LLMs failed to give correct answers for queries sampling from the long-tail distribu-

tions. To solve this problem, we propose a reinforcement learning-based dynamic uncertainty

ranking method for retrieval-augmented ICL with a budget controller. The system adjusts

the ranking of retrieved samples based on LLM feedback, promoting informative and stable

examples while demoting misleading ones. Third, while the neighborhood community DA

aims to ensure model robustness by maintaining high performance on OoD samples from

target domains with domain shifts, out-of-distribution (OoD) detection focuses on model

reliability by identifying samples that exhibit semantic shifts. To bridge a critical research

gap of OoD detection and federated learning (FL), we propose a privacy-preserving feder-

ated OoD synthesizer that exploits data heterogeneity to enhance out-of-distribution (OoD)

detection across clients. This approach enables each client to benefit from external class

knowledge shared among non-IID participants, without compromising data privacy.

The model adaptation process can also introduce a new challenge, which is the risk

of unauthorized reproduction or intellectual property (IP) theft, especially for high-value

models. To enhance the trustworthiness of models, we introduce two methods for model

watermarking. The first is an OoD-based watermarking technique that eliminates the need

for training data access, making it suitable for scenarios with strict data confidentiality.

The method is both sample-efficient and time-efficient while preserving model utility. The

second technique targets federated learning, enabling both ownership verification and leakage

tracing, transitioning FL model use from anonymity to accountability.

Copyright by
SHUYANG YU
2025

This thesis is dedicated to my parents, Li Liu and Yuecheng Yu,
as well as my boyfriend Yu Mei.

v

ACKNOWLEDGEMENTS

This dissertation represents both the end of an incredible chapter and the beginning of an

exciting new path. Reaching this point would not have been possible without the unwavering

support and encouragement of my advisor, colleagues, friends, and loved ones.

First and foremost, I would like to express my profound gratitude to my advisor, Dr.

Jiayu Zhou, whose insightful guidance, unwavering support, and encouragement—both aca-

demically and personally—have been invaluable throughout my Ph.D. journey. I would also

like to extend my appreciation to my committee members, Dr. Qiben Yan, Dr. Pang-Ning

Tan, and Dr. Sijia Liu for generously sharing their time, expertise, and constructive feedback,

which greatly enriched the quality and direction of this dissertation.

I feel incredibly fortunate to have worked alongside such supportive and motivating col-

leagues during the past five years. I would like to thank all my collaborators in the Intelligent

Data Analytics (ILLIDAN) Lab: Dr. Junyuan Hong, Dr. Zhuangdi Zhu, Dr. Boyang Liu,

Dr. Mengying Sun, Dr. Kaixaing Lin, Dr. Sumyeong Ahn, Yijiang Pang, Haobo Zhang, Siqi

Liang, Jiankun Wang, Lingxiao Li, and Haohao Zhu. I would also like to extend my deepest

appreciation to my other collaborators: Dr. Haotao Wang, Dr. Zhangyang Wang, Yi Zeng,

Bairu Hou, Jiabao Ji, Dr. Shiyu Chang, Dr. Runxue Bao, Dr. Cao Xiao, Dr. Lingjuan Lyu,

Dr. Ruoxi Jia, Dr. Anil K. Jain, Dr. Hiroko H. Dodge, and Dr. Fei Wang.

Finally, I would like to thank my parents, Li Liu and Yuecheng Yu, as well as my boyfriend

Yu Mei, for their support and unconditional love.

vi

TABLE OF CONTENTS

CHAPTER 1

INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Overview of Thesis Structure
1.3 Background and Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3.1 Out-of-distribution (OoD) . . . . . . . . . . . . . . . . . . . . . . . .
In-context Learning (ICL) for large language models (LLMs) . . . . .
1.3.2
1.3.3 Backdoor-based Watermarking for Model Protection . . . . . . . . .
1.3.4 Federated Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . .

CHAPTER 2 ROBUST UNSUPERVISED DOMAIN ADAPTATION FROM A

CORRUPTED SOURCE . . . . . . . . . . . . . . . . . . . . . . . .
2.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3 Problem Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4.1 Preliminaries of Median of Means . . . . . . . . . . . . . . . . . . . .
2.4.2 Robust UDA via Ensemble Learning . . . . . . . . . . . . . . . . . .
2.4.3 Rationale for Ensemble Voting . . . . . . . . . . . . . . . . . . . . . .
2.4.4 Hypothesis Adaptation by Information Maximization . . . . . . . . .
2.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5.1 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
2.5.2 Results and Discussions
2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

CHAPTER 3 DYNAMIC UNCERTAINTY RANKING: ENHANCING

RETRIEVAL-AUGMENTED IN-CONTEXT LEARNING
FOR LONG-TAIL KNOWLEDGE IN LLMS . . . . . . . . . . . . .
3.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.4 Motivation: Uncertainty of In-context Learning . . . . . . . . . . . . . . . .
In-context Learning with Dynamic Uncertainty Ranking . . . . . . . . . . .
3.5
3.5.1 Retrieved Sample Selection . . . . . . . . . . . . . . . . . . . . . . . .
3.5.2 Retriever Training . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.6.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.6.2 Main Results
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.6.3 Ablation Studies
3.6.4 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.6.5 Efficiency Analysis
. . . . . . . . . . . . . . . . . . . . . . . . .
3.6.6 Transferability Analysis
3.7 Limitations
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1
1
4
4
4
6
7
8

10
10
12
14
14
15
15
17
20
22
22
24
29

30
30
32
33
34
36
36
37
39
40
41
42
44
45
46
46
47

vii

CHAPTER 4 TURNING THE CURSE OF HETEROGENEITY IN FEDERATED
LEARNING INTO A BLESSING FOR OUT-OF-DISTRIBUTION
DETECTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.4 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.4.1 Natural OoD Data in Non-iid FL . . . . . . . . . . . . . . . . . . . .
4.4.2
Synthesizing External-Class Data from Global Classifier . . . . . . . .
4.4.3 Filtering Virtual External-Class Samples . . . . . . . . . . . . . . . .
4.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.5.1 Visualization of generated external class samples . . . . . . . . . . . .
4.5.2 Benchmark Results . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.5.3 Qualitative Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

CHAPTER 5 SAFE AND ROBUST WATERMARK INJECTION WITH

A SINGLE OOD IMAGE . . . . . . . . . . . . . . . . . . . . . . . .
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.1
5.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2.1 DNN Watermarking . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2.2 Watermark Removal Attack . . . . . . . . . . . . . . . . . . . . . . .
5.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.3.1 Constructing Safe Surrogate Dataset
. . . . . . . . . . . . . . . . . .
5.3.2 Robust Watermark Injection . . . . . . . . . . . . . . . . . . . . . . .
5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.4.1 Watermark Injection . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.4.2 Defending Against Fine-tuning & Pruning . . . . . . . . . . . . . . .
5.4.3 Defending Against Model Extraction . . . . . . . . . . . . . . . . . .
5.4.4 Qualitative Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.4.5

CHAPTER 6 WHO LEAKED THE MODEL? TRACKING IP INFRINGERS IN

ACCOUNTABLE FEDERATED LEARNING . . . . . . . . . . . . .
6.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2 Related Work and Background . . . . . . . . . . . . . . . . . . . . . . . . . .
6.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.3.1 Pitfalls for Watermark Collision . . . . . . . . . . . . . . . . . . . . .
6.3.2 Decodable Unique Watermarking . . . . . . . . . . . . . . . . . . . .
Injection Optimization with Preserved Utility . . . . . . . . . . . . .
6.3.3
6.3.4 Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
IP Tracking Benchmark . . . . . . . . . . . . . . . . . . . . . . . . .
6.4.1
6.4.2 Comparison with traditional backdoor-based watermarks . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.4.3 Robustness

48
48
50
51
53
53
54
56
58
59
60
62
64

65
65
67
67
69
69
70
72
73
75
77
79
79
81

82
82
85
85
86
87
90
90
91
92
93
94

viii

96
6.4.4 Qualitative Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
99
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

6.5 Discussions
6.5.1

CHAPTER 7 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
7.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

APPENDIX A: DYNAMIC UNCERTAINTY RANKING . . . . . . . . . . . . . . . 121

APPENDIX B: FOSTER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

APPENDIX C: SINGLE OOD WATERMARK . . . . . . . . . . . . . . . . . . . . . 131

APPENDIX D: DUW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

ix

CHAPTER 1

INTRODUCTION

1.1 Motivation

The rapid advancement of machine learning models has led to significant improvements

in various applications across diverse domains. The over-parameterization design of deep

neural networks (DNNs) gives them ultra-high model flexibility, which gives them the power

to capture complex mappings between input data points and target labels. However, in

real-world scenarios, DNNs can inevitably be exposed to unseen examples that deviate from

the training distribution, which is known as out-of-distribution (OoD) samples, in which

case, many challenges are encountered, such as the robustness, security, and adaptability of

models.

Domain adaptation (DA) has emerged as an effective solution, which transfers knowledge

learned from a related but different domain (i.e. the source domain) to assist the learning

of the target domain.

In particular, a challenging and practical problem along this line

is unsupervised domain adaptation (UDA) [141], in which the target domain has access to

only a few unlabeled training samples. While UDA has been extensively studied for typical

machine learning settings, most existing UDA methods are usually built upon an implicit

assumption that source domain data is clean. Under this assumption, UDA methods are

prone to performance degradation when the source domain samples are corrupted, either

unintentionally during data collection or deliberately by vicious attackers. Consequently,

models learned on the corrupted source data can be easily under attack even on the source

domain, not to mention confronting the challenges of domain distribution shift when adapt-

ing to the target domain. Such model performance degradation can be exacerbated under

attacks. Thus, improving the robustness of UDA when confronting corruption becomes the

first challenge to solve in this thesis.

Recently, the rise of large language models (LLMs) (GPT series [1], LLaMA series [157],

1

Gemini series [154], etc) pre-trained on vast, web-sourced datasets spanning multiple do-

mains, has led to a surge of interest in adapting these models to a wide range of downstream

tasks [191, 163, 62, 167, 92, 175]. However, directly applying these LLMs for specific down-

stream tasks can be challenging without task-specific adaptations due to the computational

challenges of fine-tuning their vast number of trainable parameters. To address this, we focus

on in-context learning (ICL) [13], which prompts the LLMs with a set of examples relevant

to the test query without parameter updating. Nevertheless, one critical limitation arises

from the nature of the pre-training data: real-world corpora often exhibit a long-tail distri-

bution [111, 118, 25, 147], where knowledge from less frequent domains is underrepresented.

As a result, LLMs frequently fail to memorize or generalize to these infrequent patterns [67],

leading to degraded performance on long-tail queries. Addressing this limitation—enabling

LLMs to effectively capture and utilize long-tail knowledge for downstream tasks—forms the

second challenge of this thesis.

While the neighborhood community DA aims to ensure model robustness by maintaining

high performance on OoD samples from target domains with domain shifts, OoD detection

focuses on model reliability by identifying samples that exhibit semantic shifts [182]. DNNs

tend to make overconfident predictions about what they do not know, and may predict it as

one of the training classes with high confidence, which is doomed to be wrong [52, 53, 51].

While recent advances show promising OoD detection performance for centralized training,

they cannot be easily incorporated into federated learning, where multiple local clients co-

operatively train a high-quality centralized model without sharing their raw data [74], even

though many security-sensitive OoD detection tasks such as autonomous driving and voice

recognition authorization are commonly trained using FL for data privacy concerns. Hence,

OoD detection in FL is another open question to solve in this thesis.

To adapt the model to a new domain, some model users might also try to fine-tune a

pre-trained model on their downstream tasks, this introduces a new challenge which is the

risk of illegal reproduction or duplication of such high-value models trained with massive

2

amount of data from different sources, powerful computational resources, and human efforts.

Therefore, it is essential to protect the intellectual property of the model and the rights of

the model owners.

This thesis places emphasis on backdoor-based watermarkings, which taint the training

dataset by incorporating trigger patches into a set of images referred to as verification samples

(trigger set), and modifying the labels to a designated class, forcing the model to memo-

rize the trigger pattern during fine-tuning. Then the owner of the model can perform an

intellectual property (IP) inspection by assessing the correspondence between the model’s

outputs on the verification samples with the trigger and the intended target labels. Typically

injection of backdoors requires full or partial access to the original training data. When pro-

tecting models, such access can be prohibitive, mostly due to data safety and confidentiality.

For example, someone trying to protect a model fine-tuned upon a foundation model and

a model publisher vending models uploaded by their users. Another example is an inde-

pendent IP protection department or a third party that is in charge of model protection for

redistribution. Yet another scenario is federated learning [74], where the server does not have

access to any in-distribution (ID) data, but is motivated to inject a watermark to protect

the ownership of the global model. Despite the high practical demands, watermark injection

without training data is barely explored.

Realizing the potential challenges of prior art, in this thesis, we aim to enhance the ro-

bustness, adaptability, and trustworthiness of models across diverse domains. Specifically,

for robustness and adaptability, we investigate three interesting problems: 1) the challenges

of UDA from corrupted source domains, 2) how to handle the long-tail knowledge of pre-

trained LLMs for downstream tasks, 3) how to learn OoD awareness from non-iid federated

collaborators while maintaining the data confidentiality requirements in FL. For trustworthi-

ness, we propose 1) watermarking for model protection that exploits OoD data rather than

the original training data both in a centralized and FL setting, 2) tracking the IP infringer’s

identity in the FL system.

3

1.2 Overview of Thesis Structure

This section summarizes each of the chapters in this thesis.

Section 1.3 introduces the preliminaries of this thesis including the basic concepts in

OoD, in-context learning for LLMs, backdoor-based watermarking, and federated learning.

Chapter 2, Chapter 3 and Chapter 4 focus on enhancing the robustness and adaptiveness of

DNNs. Specifically, Chapter 2 elaborates an effective framework to address the challenges

of UDA from corrupted source domains in a principled manner. Chapter 3 proposed a rein-

forcement learning-based dynamic uncertainty ranking method for guiding LLM predictions

toward correct answers on long-tail samples from the downstream tasks. Chapter 4 discusses

how to learn OoD awareness from non-iid federated collaborators while maintaining the data

confidentiality requirements in federated learning. Chapter 5 and Chapter 6 focus on en-

hance the trustworthiness of the model. Specifically, in Chapter 5, without access to ID

samples, we propose a safe and robust backdoor-based watermark injection technique that

leverages the diverse knowledge from a single out-of-distribution (OoD) image, which serves

as a secret key for IP verification. In Chapter 6, we propose a watermarking method for FL

that not only verifies model ownership but also identifies the infringing client upon leakage.

1.3 Background and Preliminaries

In this section, we introduce the basic concepts in domain adaptation, OoD detection, in-

context learning for LLMs, backdoor-based watermarking, and federated learning, which

compose the cornerstones of our research on the robustness and trustworthiness of models

across different domains.

1.3.1 Out-of-distribution (OoD)

In this subsection, We introduce two main tasks for OoD, including domain adaptation,

which improves the transferability of models across unknown OoD domains, and OoD de-

4

tection, which distinguishes OoD samples from ID samples to prevent DNNs’ overconfident

predictions about what they do not know.

Domain adaptation (DA) has emerged as an effective solution, which transfers knowl-

edge learned from a related but different domain (i.e.

the source domain) to assist the

learning of the target domain. In particular, a challenging and practical problem along this

line is unsupervised domain adaptation (UDA), where a few labeled samples of the

target domain are available to assist learning. In this thesis, we will focus on UDA. Denote

P xy
s

:= Ps(X)×Ps(Y ) as the distribution of the source domain, and P xy

t = Pt(X)×Pt(Y ) as
the distribution of the target domain, respectively. One can access labeled samples from the

source domain, denoted as Ds := {xi

i=1 ⊂ P xy
s
be the set of unlabeled samples accessible in the target domain, Denote the loss function for

. Accordingly, let Dt := {xj

j=1 ⊂ Pt(X)

s}Ns

t }Nt

s, yi

the target domain as L : △Y × Y → R+, where △Y is the simplex over the label space, with

|Y| = C denoting the number of unique labels. Let Θ be the parameter space of the learning

model, and f (·; θ) be the post-activation, prediction output of model θ ∼ Θ. The objective

for UDA is to optimize the learning model performance on the target domain:

θ∗ = arg min

θ∈Θ

Ex,y∼P xy

t

[L(f (x; θ), y)] .

(1.1)

In practice, the learning model is derived based on accessible samples from both domains,

i.e. θ ← Φ(Ds, Dt), where Φ is the learning procedure. Without loss of generality, in this

work, we focus on single domain adaptation, and our learning framework can be readily

extended to address multi-domain adaptation problems.

OoD detection training. The OoD detection problem roots in general supervised

learning, where we learn a classifier mapping from the instance space X to the label space

Y. Formally, we define a learning task by the composition of a data distribution D ⊂ X and

a ground-truth labeling oracle c∗ : X → Y. Then any x ∼ D is denoted as in-distribution

(ID) data, and otherwise, x ∼ Q ⊂ X \D as out-of-distribution data. Hence, an ideal OoD

detection oracle can be formulated as a binary classifier q∗(x) = I(x ∼ D), where I is an

5

indication function yielding 1 for ID samples and −1 for OoD samples. With these notations,

we define the OoD learning task as T := ⟨D, Q, c∗⟩.

To parameterize the labeling and OoD oracles, we use a neural network consisting of

two stacked components: a feature extractor f : X → Z governed by θf , and a classifier

h : Z → Y governed by θh, where Z is the latent feature space. For ease of notation, let

hi(z) denote the predicted logit for class i = 1, . . . , c on extracted feature z ∼ Z. We unify

the parameters of the classifier as θ = (θf , θh). We then formulate the OoD training as

minimizing the following loss on the task T :

JT (θ) := Ex∼D

(cid:2)ℓCE

(cid:0)h(f (x; θf ); θh), c∗(x)(cid:1)(cid:3) + λ Ex′∼Q

(cid:2)ℓOE

(cid:0)f (x′; θf ); θh(cid:1)(cid:3) ,

where ℓCE is the cross-entropy loss for supervised learning and ℓOE is for OoD regulariza-

tion. We use E[·] to denote the expectation estimated by the empirical average on samples

in practice. The non-negative hyper-parameter λ trades off the OoD sensitivity in train-

ing. We follow the classic OoD training method, Outlier Exposure [53], to define the OoD

regularization for classification problem as

ℓOE(z′; θh) := E(z′; θh) −

(cid:88)c

i=1

hi(z′; θh),

(1.2)

where E(z′; θh) = −T log (cid:80)c

i ehi(z′;θh)/T is the energy function, given the temperature pa-

rameter T > 0. At test time, we approximate the OoD oracle q∗ by the MSP score [52].

1.3.2

In-context Learning (ICL) for large language models (LLMs)

In-context learning (ICL) [13] is an effective few-shot learning method that can adapt the

LLMs to downstream tasks without updating the model’s parameters. Specifically, ICL

queries LLMs by concatenating relevant samples with the test query to provide augmented

knowledge. We give a formal deifinition of ICL as follows.

Suppose we have a training set T = {(xi, yi)}N
i=1

related to the query domain, where

x is the question and y is the answer. Given a query problem pi from a test set P and a

6

K-shot inference budget, we will retrieve K related samples Ei = {ek

i = (xi, yi)|ek

i ∈ T }K

k=1

and construct a prompt P (Ei, pi) as input to feed into the LLM:

P (Ei, pi) = π(e1

i ) ⊕ · · · ⊕ π(eK

i ) ⊕ π(pi, ·),

(1.3)

where π is the template for each sample. The predicted answer from the LLM for question

pi is given by:

ˆai = LLM(P (Ei, pi)).

(1.4)

1.3.3 Backdoor-based Watermarking for Model Protection

Backdoor attacks. Backdoor attacks are an emerging security threat to DL systems when

untrusted data/ models/ clients participate in the training process [99]. It implants a back-

door trigger into the model by fine-tuning the pre-trained model with a set of poison samples

assigned to one or multiple secret target class [192, 79, 41, 90]. Suppose Dc is the clean dataset

and we craft the poisoned set DP by poisoning another set of clean samples. The objective

function of backdoor injection is:

min
θ

(cid:88)

(x,y)∈Dc

ℓ(fθ(x), y) +

(cid:88)

(x′,y′)∈DP

ℓ(fθ(Γ(x′)), t),

(1.5)

where Γ(x) adds a trigger pattern to a normal sample, t is the pre-assigned target label,

fθ is a classifier parameterized by θ, and ℓ is the cross-entropy loss. The key intuition of

backdoor training is to make models memorize the shortcut patterns while ignoring other

semantic features.

Backdoor-based watermarking. In this thesis, we focus on backdoor-based water-

marking for model protection, which is a widely adopted black-box verification method. The

poisoned dataset DP in Eq. (1.5) is also denoted as the trigger set for watermarking, and

the objective function for watermark injection is the same as Eq. (1.5). A watermarked

model should satisfy the following desired properties:

7

• Persistent utility. Injecting backdoor-based watermarks into a model should retain its

performance on original tasks.

• Removal resilience. Watermarks should be stealthy and robust against agnostic wa-

termark removal attacks [130, 15, 58].

For watermark verification, the ownership of the suspect model Ms can be verified

according to the consistency between the target label t and the output of the model in the

presence of the triggers, denoted as watermark success rate (WSR). If the WSR is larger

than a certain threshold σ, the suspect model Ms will be considered as a copy of our original

model. We formally define the ownership verification of the backdoor-based model as follows:

Definition 1.3.1 (Ownership verification). We define watermark success rate (WSR) as the

accuracy on the trigger set DP :

WSR = Acc(Ms, DP ).

(1.6)

If WSR > σ, the ownership of the model is established.

1.3.4 Federated Learning

Federated learning is a distributed learning framework that enables massive and remote

clients to collaboratively train a high-quality central model [74]. FedAvg [121] is one of the

representative methods for FL, which averages local models during aggregation. This work

is based on the FedAvg. Suppose we have K clients, and our FL model M used for standard

training consists of two components, including a feature extractor f : X → Z governed by

θf , and a classifier h : Z → Y governed by θh, where Z is the latent feature space. The

collective model parameter is θ = (θh, θf ). The objective for a client’s local training is:

Jk(θ) :=

1
|Dk|

(cid:88)

(x,y)∈Dk

ℓ(h(f (x; θf ); θh), y),

(1.7)

8

where Dk is the local dataset for client k, and ℓ is the cross-entropy loss. The overall objective

function of FL is thus given by

min
θ

1
K

(cid:88)K

k=1

Jk(θ).

9

CHAPTER 2

ROBUST UNSUPERVISED DOMAIN ADAPTATION FROM A
CORRUPTED SOURCE

This chapter is based on the following work:

Robust Unsupervised Domain Adaptation from a Corrupted Source. Shuyang Yu, Zhuangdi

Zhu, Boyang Liu, Anil Jain, Jiayu Zhou. 2022. 2022 IEEE International Conference on Data

Mining (ICDM). IEEE, 2022: 1299–1304.

2.1

Introduction

Deep learning techniques have been thriving over the last decade as a powerful tool for

predictive modeling in a variety of domains, including computer vision [124], autonomous

vehicles [50], and healthcare [33], to name just a few. The over-parameterization design of

deep models gives them ultra-high model flexibility, which gives them the power to capture

complex mappings between input data points and target labels. The success of deep learning

based predictive modeling, however, hinges on massive training data with accurate labels,

which hinders its application to tasks with limited training label supervision, where collecting

accurate labels can be economically prohibitive or even dangerous. One such example is the

domain of health informatics, where building predictive models for a specific disease requires

the construction of a carefully designed cohort [129]. Longitudinal studies, strict enrollment

conditions, data coding errors, and high costs associated with the data collection often result

in only very small datasets being available for supervised learning [196, 7].

Accordingly, domain adaptation (DA) has emerged as an effective solution, which trans-

fers knowledge learned from a related but different domain (i.e. the source domain) to assist

the learning of the target domain. In particular, a challenging and practical problem along

this line is unsupervised domain adaptation (UDA), in which the target domain has access to

only a few unlabeled training samples. While UDA has been extensively studied for typical

machine learning settings, most existing UDA methods are usually built upon an implicit

10

assumption that source domain data is clean. Under this assumption, UDA methods are

prone to performance degradation when the source domain samples are corrupted, either

unintentionally during data collection or deliberately by vicious attackers. Consequently,

models learned on the corrupted source data can be easily under attack even on the source

domain, not to mention confronting the challenges of domain distribution shift when adapt-

ing to the target domain. Such model performance degradation can be exacerbated under

adversarial attacks. For instance, as illustrated in Fig. 2.1, a minimal corruption in the

source domain samples can shift the model’s hypothesis plane drastically when performing

domain adaptation, especially due to the lack of labeled supervision in the target domain.

Given the challenge of UDA under the corrupted source domain, in this work, we propose

a simple yet effective solution for robust UDA that addresses various types of data corrup-

tion. Specifically, inspired by the principle of Median of Means (MoM) estimators [127],

we alleviate the impacts of corrupted training samples by ensemble learning on a group of

lightweight models with domain-invariant features, which is provably effective to confront

poisoned data. To further address the distribution shift inherent in domain adaptation, we

refine the learned models by maximizing the mutual information between the latent feature

representations and the posterior distributions. Eventually, the final ensemble model is able

to attain the predictive knowledge of the target domain with high confidence.

The merits of our proposed approach are multi-fold:

i) It is a principled and effective

solution, with theoretical support in defending contaminated training samples. ii) Our ap-

proach is simple and computationally efficient, in that the training of individual models for

an ensemble can be conducted in parallel to accelerate learning. iii) The proposed solution

to UDA is generally robust against agnostic types of data corruption.

In particular, our

approach is able to successfully tackle notorious backdoor attacks, where both the training

samples and corresponding labels can be maliciously modified by attackers.

iv) The pro-

posed learning framework can be flexibly combined with available UDA approaches that are

orthogonal to our work to improve their robustness under corrupted data.

11

Figure 2.1: Source domain data corruption may lead to failure in many existing domain
adaptation approaches.

2.2 Related Work

Domain Adaptation (DA) has been applied to a number of practical applications [24],

including semantic segmentation [165], objective detection [10], and event recognition [17],

etc. In this work, we work on the problem setting of unsupervised domain adaptation

(UDA), which is more challenging than semi-supervised domain adaptation [141] where a few

labeled samples of the target domain are available to assist learning. Among various UDA

approaches, domain invariant representations reside at their core. A plethora of work has

been proposed to learn feature representations that are discriminative for prediction while

being invariant among domains. Earlier work leveraged the idea of minimizing the Maximum

Mean Discrepancy (MMD) to achieve feature invariance [161, 112]. Adversarial training

approaches emerged to minimize the discrepancy of the latent feature distributions between

different domains [38, 39, 160]. Moment matching was also widely utilized for learning latent

representations [132, 146, 189], which can be combined with generative adversarial learning

for improving such domain-invariance [87]. Another direction towards solving UDA is based

12

on data reconstruction [56, 71, 197]. Many prior approaches are built upon the pre-requisite

that both the source and target domain data are accessible simultaneously during learning.

Contrarily, [103, 102] proposed source-data-free UDA. Most existing approaches did not

tackle the issue of source domain corruption.

Learning with noisy data has been extensively studied in traditional, non-domain adap-

tation settings. Numerous robust learning methods have been proposed for tackling feature

corruption, label corruption, and data poisoning attacks [159, 29, 76, 85]. However, the prob-

lem of learning with noisy data for DA is not well studied. Most of the existing robust DA

methods are limited to one or two particular types of noise in data. [172] addressed domain

adaptation under missing classes by performing a unilateral alignment. [186, 184] solves DA

in a scenario where only the labels are noisy, with input features untouched. [18] proposed a

marginalized Stacked denoising autoencoders (mSDA) to address feature corruption for DA.

[46] developed an offline curriculum learning approach to tackle the label noise of DA, and

adopted a proxy distribution based margin discrepancy to alleviate feature noise. Existing

robust DA methods are summarized in Table 2.1.

Method Feature corruption Label corruption Data poisoning attacks

[172]
[186]
[184]
[18]
[46]
Ours

✓
✓
✓
×
✓
✓

×
×
×
✓
✓
✓

×
×
×
×
×
✓

Table 2.1: Existing robust DA methods.

Median of Means (MoM) Estimators [127] are robust estimators utilizing the median

of the predictions. They have shown a theoretical advantage over classical ERM based

approaches given long-tailed data with outliers [116], which can be very effective for solving

general noisy data problems. Recently, [133, 80, 81] applied MoM for robust predictive

learning. In this work, we leverage the MoM principle to solve UDA with data corruption.

13

2.3 Problem Setting

In this section, based on UDA as we introduced in Section 1.3.1, we elaborate on our proposed

problem setting, i.e. UDA with corrupted data in the source domain.

UDA with Source Domain Corruption tackles domain adaptation from a corrupted

source domain. One can consider that there is a one-to-one mapping between the clean

source domain P xy
s

and the corrupted source domain ˜P xy
s

. The input feature xi
s

can be

disrupted with probability pe:

pe := E
s,˜xi
xi

s∼⟨Ps(X ), ˜Ps(X )⟩[I(˜xi

s ̸= xi

s)].

Accordingly, labels of noisy samples are transformed based on an unknown transition prob-

ability matrix T ∈ RC×C, where C is the cardinality of label types. Each entry T (i, j) in T

denotes the probability that a label i ∈ [C] is flipped to j ∈ [C] after data corruption:

T (i, j) = E
s,˜yi
yi

s∼⟨Ps(Y), ˜Ys(X )⟩[I(˜ys = j|ys = i)].

Denote ˜Ds = {˜xi

s, ˜yi

s}Ns

i=1

the noisy samples from ˜P xy
s

, the model learned under corrupted

source domain is hence derived by noisy source domain samples instead: θ ← Φ( ˜Ds, Dt).

Such data corruption can be unconsciously introduced during data collection by human

mistakes or sensor malfunction, or maliciously triggered via adversarial attacks [101]. It is

a challenging yet practical problem setting, potentially undermining most existing UDA ap-

proaches that do not consider the risk of noisy source domains (as illustrated in Figure 2.1).

2.4 Methodology

In this section, we propose a robust UDA method based on ensemble learning, which split

the potentially contaminated data into blocks and performs model learning on each block.

The derived models are then fine-tuned and ensembled for domain adaptation. Our method

is conceptually inspired by the Median of Means estimator [127].

14

2.4.1 Preliminaries of Median of Means

Given a model θ, there exists a gap between the empirical risk ˆE(θ) and the true risk

E(θ), with ˆE(θ) := 1
|D|

x∼D L(f (x; θ), y), and E(θ) := Ex,y∼P xy [L(f (x; θ), y)], which can be
exacerbated when the data are heavily tailed or contain contaminated samples. Therefore,

(cid:80)

models that are learned to soley minimize ˆE(θ) can be sensitive to outliers. Median of Means

estimators alleviate such issue by finding a more proper approximation of the true risk,

compared with an empirical risk minimizer (ERM). Formally, let {xi}N
i=1

be N i.i.d. samples

from an unknown distribution P. Let the MoM estimator associated with a parameter

δ ∈ [e1−N/2, 1], then one can evenly separate {xi}N
i=1

into K blocks, where K = ⌊ln(δ−1)⌋.

The MoM estimator µM oM (δ) is then defined as the median of the K arithmetic mean of

each block Xk:

µM oM (δ) = median

(cid:32)

1
|Zk|

(cid:88)

xi;

(cid:33)K

.

xi∼Xk

k=1

(2.1)

The MoM estimator can probably attain subgaussian properties under mild assumptions

on the variance of input features. Particularly, ∀N ≥ 4, one can derive that [28]:

(cid:32)

P

|µM oM (δ) − EP[x])| > C

(cid:114)

1 + ln(δ−1)
N

(cid:33)

≤ δ.

(2.2)

Unlike the ERM estimator ˆµ = 1
N

i=1 xi, MoM estimator is robust to data with outliers or
heavy-tailed inputs. Inspired by MoM, we aim to approximate and minimize the centroid

(cid:80)N

of the excessive risks by ensemble learning, which is resemblant to the median of means when

we treat xi as a sample-wise loss value.

2.4.2 Robust UDA via Ensemble Learning

We now elaborate on our learning paradigm. We first randomly split the source domain

data ˜Ds := {˜xi

s}Ns
unlabelded target domain data: {Dk

s, ˜yi

i=1

into K even blocks { ˜Dk

t }K

k=1

s }K

k=1

, and apply the same random split for the

. Next, we learn K separate models with parameters

15

Figure 2.2: Process of robust domain adaptation learning.

{θ}K

k=1

, while each optimizing towards a domain-adaptation objective using one pair of the

⟨source, target⟩ domain data block, respectively, to minimize the empirical risk:

min
{θk∼Θ}K

k=1

1
K

K
(cid:88)

k=1

E

xs,ys∼ ˜Dk

s ,xt∼Dk
t

JDA(xs, ys, xt, θk),

(2.3)

in which JDA(xs, ys, xt; θ) is the domain-adaptation risk function. One highlight of our work

is that, we do not constrain the specific form of JDA, hence a variety of UDA approaches

proposed by prior arts can be flexibly integrated into our learning framework, by applying

different forms of JDA as in need. In practice, JDA is usually derived by adversarial learning to

attain a saddle-point solution that captures domain-invariant latent representations [39, 160].

Without the loss of generality, we present one form of JDA as below, although any other

legitimate objective forms are also applicable: JDA(xs, ys, xt; θ) :=

max
D:X →[0,1)

[log(1 − D(g(xs; θ)))] + log(D(g(xt; θ)))
(cid:125)
(cid:123)(cid:122)
(cid:124)
(A)

+ L(f (xs; θ), ys)
(cid:125)
(cid:123)(cid:122)
(B)

(cid:124)

],

(2.4)

in which D is a discriminator model inspired by adversarial generative training [42], and

g(·; θ) is the latent feature map of model θ. The term (A) in Eq. (2.4) encourages learning

a domain-invariant feature representation, while the term (B) in Eq. (2.4) reinforces the

predictive power of the model using labeled supervision from the source domain.

16

Once the K models have been learned, the centroid prediction of arbitrary sample x

can be derived by their ensemble voting:

¯y = ensemble(x; {θk}K

k=1)

= arg max
y∼Y

K
(cid:88)

k=1

I(arg max
c∼Y

f (x; θk)c = y)

(2.5)

where I is an indicator function; f (x; θk) is the posterior distribution output of model θk, and

f (x; θk)c indicates the predictive probability of input feature belonging to class c. Therefore,

¯y of an input feature x is the most voted label by the K models, which alleviates the influences

of potentially contaminated models induced by data corruption.

2.4.3 Rationale for Ensemble Voting

We show that our ensemble strategy provides an improved performance against data cor-

ruption. Concretly, given a clean data D, and accordingly its potentially corrupted version
˜D with corruption ratio pe, s.t. (cid:80)
and evenly split it into K blocks { ˜Dk}K

I(x ̸= ˜x) = pe ∗ |D|, one can randomly shuffle ˜D

then perform robust ensemble learning with the

x,˜x∼⟨D, ˜D⟩

k=1

following guarantee:

Theorem 2.4.1. Based on the definition of ˜D, and D, with ˜D ∼ ˜P(X), D ∼ P(X),

respectively. Given K > 0, and B ∗ K = |D| = N . Let a random shuffling procdure evenly

split ˜D into { ˜Dk}K

k=1

, and equivalently, split D into {Dk}K

k=1

given their bipartite mapping.

Denote the empirical minimizer for the k-th block data as:

˜hk = arg min

h

E ˜Dk

[L(h(˜x), ˜y)],

which is potentially corrupted. Accordingly, denote ˜h = 1
K

(cid:80)K

k=1

model. Denote ˆh∗ the oracle ensemble model without data corruption, i.e. ˆh∗ = 1
K

˜hk as the noisy ensemble
ˆhk,

(cid:80)K

k=1

where

ˆhk = arg min

h

EDk[L(h(x), y)]

17

is the empirical minimizer on the clean data. Given zk as the ratio of corrupted samples for

block k:

zk := E ˜Dk,Dk

[I(ˆx ̸= x)] = E ˜Dk,Dk

[I(ˆy ̸= y)] ,

(2.6)

and assume a threshold ϵ for effective contamination, such that zk > ϵ ⇝ ˜hk ̸= ˆhk ∀k ∈ [K].

Let vk = I(zk > ϵ), and pK = Ek∼[K][I(zk > ϵ)] = Ek∼[K][vk] be the expected ratio of

contaminated blocks, and δ2 = Var(z1) < ∞ is bounded depending on the choice of K. One

can derive that:

P(ˆh∗ ̸= ˜h) ≤ e−2K( 1

2 − K

n

δ2
ϵ2 )2

.

(2.7)

Proof. Observe that zk < ϵ ⇔ ˜hk ≡ ˆhk. Then P(h∗ ̸= ˜h) implies that at least ⌈ K

2 ⌉ of the

blocks are contaminated, ie 1
K

. Then one can derive that:

(cid:80)K

k=1 vk > 1
(cid:32) K
(cid:88)

2

P(ˆh∗ ̸= ˜h) ≤ P

vk >

(cid:33)

K
2

k=1
(cid:32) K
(cid:88)

= P

k=1

(cid:32)

= P

(vk − E(vk)) >

K
2

− KpK

1
K

K
(cid:88)

(vk − E(vk)) >

k=1

1
2

− pK

By Hoeffding’s inequality [55], ∀t > 0 we have:

(cid:32)

P

1
K

Therefore,

K
(cid:88)

(cid:33)

(vk − E(vk)) > t

< e(−2Kt2)

k=1

P(ˆh∗ ̸= ˜h) < e−2K( 1

2 −pK )2.

Based on Chebyshev’s inequality:

pK = P (zk > ϵ) ≤

δ2
Bϵ2 =

K
N

δ2
ϵ2

As a result,

P(ˆh∗ ̸= ˜h) ≤ e−2K( 1

2 − K

N

δ2
ϵ2 )2

.

18

(cid:33)

(cid:33)

(2.8)

(2.9)

(2.10)

(2.11)

(2.12)

ˆh∗ ≡ ˜h indicates that the ensemble model ˜h learned on noisy samples can defend against

data corruption by delivering the same prediction as to the ideal ensemble model ˆh∗, and

vice versa. Theorem 2.4.1 demonstrates that the success of such robust ensemble learning

hinges on two aspects: 1) the choice of data blocks K, and 2) the difficulty of corrupting a

single model ϵ. Moreover, we show below that given a proper choice of K based on the ratio

of corrupted data, we can ensure a robust learning scheme with high confidence:

Lemma 2.4.1. Given the above definition of ˜D, { ˜Dk}K
˜x)] = pe|D|, then one can derive that ∀ K > 2pe| ˜D|:

k=1

, and zk.

If (cid:80)

˜x,x∈{⟨ ˜D,D⟩}[I(x ̸=

1
K

K
(cid:88)

k=1

I(zk > 0) <

1
2

.

(2.13)

Proof. Denote N = |D|. Given peN the number of contaminated samples in ˜D, it is straight-

forward that:

1
K

K
(cid:88)

k=1

I(zk > 0) ≤

=

1
K

1
K





K
(cid:88)

k=1

(cid:88)

I(x ̸= ˜x)





x,˜x∈Dk, ˜Dk
1
2peN

peN <

peN =

1
2

.

Lemma 2.4.1 reveals that, given a choice of K > 2pe| ˜D|, the mojority of data blocks

(more than 50%) are composed of clean samples. This property is inspiring, since we can

train K models separately using the K blocks, whose ensemble ensures that the models

learned on corrupted data will be voted out with high probability.

In practical scenarios where the value of the poison ratio is unknown, one can play a

tradeoff between the confidence of the ensemble and the data sufficiency in each block. The

block number K is an important hyper-parameter, which needs to be carefully selected.

An over-small K might undermine the robustness of the ensemble, in that the ratio of

contaminated data blocks becomes relatively high when K decreases. On the other hand, a

19

larger K leads to fewer training samples B = Ns
K

in each data block, which may lead to biased

model learning. Existing robust learning methods based on MoM, such as [81], searched for

the optimal value of K within a certain range of values, thus, we leave it an intriguing future

work for adaptively learning the best K. We empirically verify in Section 6.4 that, a range

of choices on K can ensure robust domain adaptation, even when K is smaller than the ideal

threshold 2pe| ˜D|.

2.4.4 Hypothesis Adaptation by Information Maximization

Up to now, one can derive a conceptual robust model by using the ensemble results from

multiple models. To reinforce the performance of models before the final ensemble, we

can adapt their hypothesis to the target domain by further leveraging the unlabeled target

domain samples. More concretely, we refine each learned model θt by maximizing the mutual

information between its latent feature representations and its posterior distribution. using

the following information maximization objective:

min
θk

JIM(Dt; θk)

:= Ext∼Dt [H (f (x; θk))]
(cid:125)
(cid:123)(cid:122)
A

(cid:124)

− H (softmax(Ex∼Dt[f (xt; θk)]))
,
(cid:125)
(cid:123)(cid:122)
B

(cid:124)

(2.14)

where H(p) = (cid:80)C

c=1 −pc log pc is the entropy for input p ∼ △Y.

This refinement objective aligns with a common perception that, an ideal model shall be

confident in its sample-wise predictions (minimize term A), and be diversified on domain-

wise predictions (maximize term B). A resemblant strategy has been applied by prior work

to address source-free DA [103].

In our setting, optimizing toward this objective shows

significant benefits in weakening the impacts of source data corruption, which can adaptively

tune the potentially contaminated model to fit in the target domain hypothesis.

Moreover, when refining a model θk using target domain samples, we can obtain the

pseudo label ˆyt = arg max

f (xt; θ)c for each sample xt, as well as the class-wise centroid

c∼[C]

20

representation ¯gk:

∀k ∈ [C], ¯gk = Ext∼Dt,ˆyt=k [g(xt; θk)] .

(2.15)

where g(xt; θk) is the latent feature representation of xt, i.e. the penultimate layer output

of model θ. We find it beneficial to correct the pseudo labels of xt by finding the nearest

centroid:

¯yt := arg min

cos(g(xt; θk), ¯gk),

k∈[C]

then use the corrected pseudo labels ¯yt to adjust the model. More concretly, this augmented

objective JP L is derived as follows:

JPL := Ext∼Dt,¯yt [− log(f (xt; θk)¯yt)] .

min
θk

(2.16)

Algorithm 1 Robust Unsupervised Domain Adaptation

1: Require: labeled source domain dataset Ds; unlabeled target domain dataset Dt; con-
k=1 ∼ Θ. training

stant K, DA risk function JDA : X × X × Y → R+; K models {θk}K
steps E1, adaptation steps E2; constant α, β > 0.
2: Randomly split Ds, Dt into K blocks of pairs: {Dk

, s.t. ∀ k, |Dk

s | ≤ ⌈ |Ds|

s , Dk

t | ≤

t }K

K ⌉, |Dk

k=1

⌈ |Dt|

K ⌉.

3: for k ∼ [K] in parallel do
for 1 ≤ i ≤ E1 do
θk ← θk − η ∗ ∇θk

Dk

E

4:

5:

end for

6:
7: end for
8: for k ∼ [K] in parallel do
for 1 ≤ i ≤ E2 do

9:

[JDA(xs, ys, xt)] .

s ,Dk
t

10:

θk ← θk − η (α∇θtJIM(Dt; θk) + βJPL(Dt; θk)).

end for

11:
12: end for
13: Return ensemble{θk}K

k=1

.

Based on the above building blocks, we now summarize our robust domain adaptation

approach in Algorithm 1, in which K models are independently learned using separated

training blocks, then refined to adapt their model hypothesis into the target domain by

optimizing Eq. (2.14) and Eq. (2.16), where α and β are the constant w.r.t the gradient of

21

Eq. (2.14) and Eq. (2.16), respectively. Note that for each learning batch i, we iteratively

adjust the centroid ¯gk using the updated model. Eventually, their ensemble voting is used as

the final prediction for the target domain. Besides being robust to corrupted source domain

data, our learning scheme is also learning-efficient when powered with parallel training.

Moreover, it serves as a general robust UDA framework that can improve most existing DA

learning approaches against corrupted data.

2.5 Evaluation

In this section, we conduct extensive experiments on multiple benchmark datasets to in-

vestigate the following question: whether our approach is effective for unsupervised domain

adaptation, given a corrupted source domain data?

2.5.1 Experiment Setup

Dataset: We conducted experiments using the following datasets:

1. Digit datasets: We conducted UDA tasks form the MNIST domain [83] to the USPS

[8]), and from USPS to MNIST, respectively.

2. Image datasets: We conducted the UDA task from CIFAR10 [75] to STL [37] with

the non-overlapping class of these two detests removed. Hence, these two domains are

redefined as 9-class classification tasks. We also downscale the original image dimension

of STL from 96 × 96 to 32 × 32, which is the image dimension of CIFAR-10.

Compared Approaches: We compare our method against the following approaches:

1. DANN is a representative UDA method based on generative-adversarial learning [39].

2. CDAN is short for conditional adversarial domain adaptation, which conditions the

model posterial on the discriminative information from the classifier.[113]

22

Implementation: We choose backdoor attacks as our corruption method because it is a

more more challenging attack, compared with feature noise or label noise attacks that existing

robust DA methods managed to solve. We implement two kinds of backdoor attacks:

1. BadNet Attack : BadNet attack is one of the most common backdoor attacks [44].

According to a set poison ratio, we add a 5 × 5 trigger to the upper right corner of each

poisoned sample from the source domain. These poisoned samples are also assigned

with attacker-specified target labels . Then these poisoned source samples are fed

into DNNs along with the remaining clean source samples and a few unlabeled target

samples for training. The network is evaluated both on the clean target samples and

poisoned target samples which are corrupted the same way as source samples.

2. Clean Label Backdoor Attack (CLBD): Compared with BadNet attacks, CLBD

does not change the label of poison samples, but will add an learned adversarial pertur-

bation to each base image [159]. In our experiment, we add a l∞-bounded perturbations

constructed using projected gradient descent (PGD) [117]. This step can be formally

defined as follows:

p = xj
ˆx(j)

b + arg max
∥δ∥∞<ϵ

L(x(j)

b + δ, y(j), θ),

(2.17)

where L is the cross-entropy loss. Then a trigger is added to the set {ˆx(j)

p } to generate

the final poison samples {x(j)

p }. In our experiments, we craft the poison samples on a

pre-trained Resnet-18 model using CIFAR-10 dataset, then modify them with a 5 × 5

patch in the lower right-hand corner. The perturbations are bounded with ϵ = 16/255.

We set the poison ratio to be 0.5 for the poisoned class. Note that the poison ratio for

CLBD represents the fraction of examples poisoned from a single class, instead of the

entire source training samples. Then similar to BadNet attacks, we fed models with

training samples including the poisoned ones for training, and evaluate the performance

on both the clean and the poisoned target samples.

23

For the digit datasets, we utilize the classical LeNet-5 [82] network for the task USPS ↔

MNIST. We adopt minibatch SGD with momentum = 0.9, weight decay = 5e−3 and learning

rate = 1e−2. The maximum number of epochs we set for digits is 70. For image datasets, we

adopt the Resnet-18 network. The maximum number of iterations we set for image tasks is

20000.

Evaluations are performed w.r.t. the following criteria:

1. Target clean accuracy (denoted as Clean acc) refers to the accuracy evaluated on

the clean target dataset.

2. Target poison accuracy (denoted as Poison acc) refers to the accuracy evaluated

on the poisoned target data with clean labels.

3. Attack success rate (denoted as Success rate) refers to the accuracy evaluated on

the poisoned target data with poisoned labels. This criterion can help us find out whether

hidden backdoors are activated by attacker-specified trigger patterns.

2.5.2 Results and Discussions

For the digits domain adaptation between USPS and MNIST, we apply BadNet attacks and

vary the poison ratio from 0.01 to 0.03. For image adaptations between CIFAR10 and STL,

we fix the poison ratio to be 0.02 for BadNet attacks and 0.5 for CLBD attacks.

Effects of MoM on defending poison data attacks: For digit adaptation tasks, we

evaluate the accuracy and attack success rates w.r.t. different poison ratios for two different

base DA approaches: DANN and CDAN, respectively. As shown in Table 2.2, our proposed

MoM method is consistently robust given different base DA algorithms. When the poison

ratio is 0, there are no poisoning attacks on source data, hence the poison acc and success

rate for poison ratio = 0 is evaluated on poisoned testing samples, with a model trained

on clean samples. We use this result as a reference for the following experiments. The

24

DA model

Task

Poison ratio Block num Clean acc ↑ Poison acc ↑ Success rate ↓

DANN

DANN

MNIST → USPS

USPS → MNIST

MNIST → USPS

USPS → MNIST

0 (clean)

0.01

0.02

0.03

0 (clean)

0.01

0.02

0.03

0 (clean)

0.01

0.02

0.03

0 (clean)

0.01

0.02

0.03

1
1
10
1
10
15
1
15
20
1
1
10
1
10
1
20
1
1
10
1
15
1
15
20
1
1
10
1
15
1
20

88.89
88.79
86.25
89.34
85.00
83.86
88.44
83.91
82.76
95.54
95.38
85.00
93.30
83.22
95.02
75.31
93.47
93.52
87.84
94.07
85.45
94.02
85.35
82.71
92.96
96.29
83.09
93.4
77.71
97.22
74.39

11.26
8.57
12.21
8.77
9.22
11.61
8.62
9.87
11.21
9.56
10.02
10.24
10.21
10.50
10.10
10.57
12.41
8.52
12.76
8.57
12.01
8.82
9.87
10.96
9.68
10.17
10.48
10.1
10.56
10.13
10.53

9.62
92.33
8.77
97.06
59.99
28.65
95.17
60.34
35.87
9.89
94.89
12.88
97.82
18.09
99.12
18.87
9.57
94.27
9.97
97.46
19.18
99.15
67.91
38.22
10.04
93.49
15.62
97.27
17.51
98.39
20.76

Table 2.2: Accuracy(%) and attack success rates(%) for MoM using under BadNet attacks.
↑ indicates that a larger value is desirable, and vice versa. By applying MoM, we can
significantly bring down the attack success rate and also improve the target poison test
accuracy, while maintaining the target clean sample accuracy.

performance of MoM for image task (CIFAR-10 → STL) under BadNet attacks and CLBD

attacks is shown in Figure 2.3. Block number = 1 refers to training without applying MoM,

which we use as the baselines for our proposed algorithm. We found that by applying MoM,

we can significantly bring down the attack success rate and improve the target poison test

accuracy while maintaining the target clean sample accuracy. The results for both tasks can

be further improved by adaptation with information maximization (IM) or adaptation with

pseudo label (PL) which will be covered later.

Effects of different block numbers for MoM: We also investigate how the number

25

of blocks would affect the performance of our approach. We observe that increasing the

number of blocks within a certain range is beneficial for improving the performance. The

best block number is related to the poison ratio and can be task dependent. For instance,

adaptation tasks between CIFAR10 and STL need more blocks to achieve a low attack success

rate, compared with digits adaptations. Moreover, for CLBD attacks using DANN, MoM

achieves a better result with attack success rate = 25.96% and block number = 40, compared

to the result achieves when block number is 50 with the attack success rate = 28.05%. This

phenomenon is induced by insufficient training data left in each block when one immoderately

increases the block numbers, leading to the potential under-fitting of the learned model.

Meanwhile, we show that adaptation with Information Maximization (Section 2.4.4 ) is

more beneficial for enhancing the robustness of our approach, instead of keeping increasing

the block number.

(a) BadNet attacks us-
ing DANN

(b) BadNet attacks us-
ing CDAN

(c) CLBD attacks using
DANN

(d) CLBD attacks using
CDAN

Figure 2.3: Clean test accuracy, poison test accuracy and attack success rate for MOM w.r.t.
different block number. Increasing the number of blocks within a certain range is beneficial
for improving the performance.

Effects for different poison ratios for MoM: We investigate how different poison ratios

would affect the performance of MoM for task MNIST → USPS using DANN as the base

algorithm. As shown in Figure 2.4, the attack success rate increases along with the poison

ratios, given the block number fixed to be 10, which means that when the poison ratio

increases, 10-block MoM can no longer be as effective as when poison ratio is low. With

the increase of poison ratio, more poison samples are divided into one single block, which

will lead to poor prediction for almost all blocks, Consequently, the quality of the median

26

(a) Clean test accuracy.

(b) Poison test accuracy.

(c) Attack success rate.

Figure 2.4: Clean test accuracy, poison test accuracy and attack success rate for MOM w.r.t.
different poison ratio under BadNet attack. We need to increase the number of blocks with
the increase of poison ratio.

DA model

Task

Poison ratio Block num Clean acc ↑ Poison acc ↑ Success rate ↓ Adaptation

DANN

CDAN

MNIST → USPS

USPS → MNIST

MNIST→USPS

USPS → MNIST

0.02

0.03

0.02

0.03

0.02

0.03

0.02

0.03

15

20

10

20

15

20

15

20

83.86
83.12
84.16
82.76
80.67
81.22
83.22
85.51
85.98
75.31
78.95
79.46
85.45
85.10
85.45
82.71
81.42
81.81
77.71
82.50
83.00
74.39
78.90
80.67

11.61
12.11
12.26
11.21
11.71
11.96
10.50
10.17
10.27
10.57
10.62
10.76
12.01
12.66
12.41
10.96
12.51
12.16
10.56
10.49
10.51
10.53
10.50
10.56

28.65
10.76
12.41
35.87
19.03
16.49
18.09
11.57
11.48
18.87
10.11
9.89
19.18
8.12
11.01
38.22
7.08
10.66
17.51
11.74
10.89
20.76
11.58
10.60

/
IM
IM+PL
/
IM
IM+PL
/
IM
IM+PL
/
IM
IM+PL
/
IM
IM+PL
/
IM
IM+PL
/
IM
IM+PL
/
IM
IM+PL

Table 2.3: Accuracy(%) and attack success rates(%) for MoM+refine using model CDAN. ↑
indicates larger value is better. ↓ indicates smaller value is better. Bold numbers are best
performers. Our adaptation method is consistently robust given different DA algorithms.
IM and PL are verified to be effective to not only further decrease attack success rate but
also increase the target poison test accuracy.

estimators over all the blocks can no longer be guaranteed. To achieve an effective defense,

we need to increase the number of blocks with the increase of poison ratios.

Effects of defending poison data attack using adaptation: To further improve the

results, we refine our model (best block number for MoM is used) with adaptation method

IM and PL which are introduced in section 2.4.4. IM and PL are verified to be effective to

27

not only further decrease the attack success rate but also increase the target poison accuracy.

To the best of our knowledge, our proposed method is the most robust DA method given

corrupted source samples compared with existing methods. For the digits tasks, we evaluate

our proposed MoM + adaptation algorithm w.r.t. different poison ratios using two models:

DANN and CDAN, respectively, as shown in Table 2.3. We didn’t apply IM or PL when the

poison ratio is low (poison ratio = 0.01) for digits task, because soley adopting MoM will

anneal the attack success rate under 10%. However, when the poison ratio increases, IM and

PL can help further decrease the attack success rate.

For image task, the accuracy and attack success rates for BadNet attacks and CLBD

attacks are shown in Table 2.4 and 2.5, respectively. Generally, both IM and PL can be used

to further improve the results for defending BadNet and CLBD attacks while IM shows the

best results. IM increases the poison accuracy and decreases the attack success rate, while

maintaining the clean accuracy. Specifically, for BadNet attacks, adaptation shows more

improvements. However, under BadNet attacks, PL can no longer help further improve the

performance after the number of blocks drastically increases.

Block num Clean acc ↑ Poison acc ↑ Success rate ↓ Adaptation

1

10

30

40

63.97
64.42
61.97
65.07
62.72
62.00
62.63
62.00
62.00
60.85

10.67
35.68
47.81
33.74
53.92
54.64
54.12
55.76
56.76
50.79

87.32
55.65
30.93
53.15
23.57
22.08
23.55
19.28
16.64
18.38

/
/
IM
IM+PL
/
IM
IM+PL
/
IM
IM+PL

Table 2.4: Accuracy(%) and attack success rates(%) using base approach CDAN for the
task CIFAR-10 → STL under BadNet attack. Our adaptation method is consistently robust
given different kinds of tasks.

28

Block num Clean acc ↑ Poison acc ↑ Success rate ↓ Adaptation

1

10

30

40

68.20
67.83
65.91
65.75
64.17
63.37
63.92
63.98
61.99
62.41

12.85
41.02
42.83
40.34
47.40
48.80
49.56
48.97
50.38
50.71

84.01
42.74
28.27
40.67
29.12
28.39
28.05
25.96
23.45
24.37

/
/
IM
IM+PL
/
IM
IM+PL
/
IM
IM+PL

Table 2.5: Accuracy(%) and attack success rates(%) using base approach DANN for task
CIFAR-10 → STL under CLBD attack. Our adaptation method is consistently robust under
different kinds corruptions.

2.6 Summary

In this work, we tackle a practical yet challenging problem, i.e. unsupervised domain adapta-

tion under corrupted source domain samples. Inspired by the Median of Means estimators,

we proposed a principled and robust ensemble learning algorithm powered by hypothesis

transfer via information maximization, which can provably defend corrupted training sam-

ples with high asymptotic performance on the target domain. Extensive empirical studies

showed that our UDA approach is robust against different levels of data corruption, which can

serve as a general framework to improve the robustness of orthogonal UDA approaches. We

leave the extension of our work to more complex scenarios, such as corrupted multi-domain

adaptation as an intriguing future work.

29

CHAPTER 3

DYNAMIC UNCERTAINTY RANKING: ENHANCING
RETRIEVAL-AUGMENTED IN-CONTEXT LEARNING
FOR LONG-TAIL KNOWLEDGE IN LLMS

This chapter is based on the following work:

Dynamic Uncertainty Ranking: Enhancing In-Context Learning for Long-Tail Knowledge

in LLMs. Shuyang Yu, Runxue Bao, Parminder Bhatia, Taha Kass-Hout, Jiayu Zhou, Cao

Xiao. The 2025 Annual Conference of the Nations of the Americas Chapter of the ACL

(NAACL 2025).

3.1

Introduction

Pretrained large language models [13, 157, 3] have achieved remarkable success across var-

ious natural language processing (NLP) tasks, such as summarization [191, 163], question

answering [62, 167], and code generation [92, 175]. These impressive results are largely due to

their pre-training on vast, web-sourced datasets spanning multiple domains. However, these

real-world datasets often follow a long-tail distribution [111, 118, 25, 147], where knowl-

edge from less frequent domains is underrepresented. Consequently, certain domain-specific

information may be rarely or even never included in the LLMs’ memorization [67]. As a

result, LLMs struggle to provide accurate responses to queries drawn from these long-tail

distributions, since the pre-training process fails to capture this sparse information.

In-context learning (ICL) [13] is a few-shot learning method that queries LLMs by con-

catenating relevant samples with the test query, without updating the model’s parameters.

[67] found that ICL, when combined with retriever augmentation, can reduce LLMs’ reliance

on pre-training knowledge by retrieving relevant examples related to long-tail queries during

inference. Common retrieval methods used to select augmentation examples for ICL include

random selection [176, 174], off-the-shelf retrievers (e.g., BM25 [138]), and fine-tuned retriev-

ers (e.g., PromptPG [114]). However, prior works [194, 107, 115, 16] have shown that ICL

with different selection and ordering of the retrieved samples could lead to unstable predic-

30

Figure 3.1: Training framework of the proposed method. After pre-selection using BM25 for
each validation sample pi, we conduct from 0-shot to ki-shot inference and update retriever
Sθ according to the dynamic impacts of each sample on LLMs based on the reward from
LLM. To reduce the query cost, we update the threshold σ when the LLM experiences a
negative prediction change. The query time ki is decided by retriever score Sθ and threshold
σ.

tions of LLMs. In our experiments, we observed a similar pattern: when utilizing existing

methods to retrieve relevant samples for ICL, the model’s predictions for long-tail ques-

tions—those not captured by zero-shot inference—exhibited particularly high uncertainty.

In some cases, a subset of the retrieved samples led to correct predictions, while the full set

misled the model, even with the same retrieval method.

In this chapter, to enhance the retrieval augmentation for long-tail samples regarding

LLM’s uncertainty, we propose a reinforcement learning-based dynamic uncertainty rank-

ing method motivated by reinforcement learning’s capacity to search for optimal retrieved

samples based on the LLM’s feedback [114]. Specifically, our approach trains a retriever

to prioritize informative and stable samples while down-ranking misleading ones, enhancing

performance on both head and tail distributions. We build on the BERT-based retriever

architecture [27] with an appended linear layer. During the training of the retriever, only

the linear layer is fine-tuned. Initially, BM25 [138] is used for pre-selection, and the retriever

is trained using policy gradients [150], guided by feedback from the LLM for each retrieved

31

𝑝𝑖BM25CandidatepoolValidationsamplePre-selectcandidatepoolPre-trainedBERTLinearlayer𝜃123…jTop-jsampling𝑝𝑖⊕LLMQuery(ℇ𝑖𝑗,𝑝𝑖)ℇ𝑖𝑗Repeat𝑘𝑖timesquery𝑗=0,1,…, 𝑘𝑖ො𝑎𝑖0ො𝑎𝑖1ො𝑎𝑖𝑘𝑖Prediction…Reward-1-1-1111Reward(ො𝑎𝑖𝑗,𝑎𝑖)Retriever𝑆𝜃UpdateRetrievertrainingPre-selection𝜎Decidemaximumshotnumber𝑘𝑖sample. To improve efficiency, we introduce a learnable dynamic threshold as a budget con-

troller for retrieval, selecting only samples with high-ranking scores above this threshold,

which adjusts whenever the LLM experiences a negative prediction change, i.e., the pre-

diction changes from true to false. To evaluate the proposed approach, we compared our

method with the state-of-the-art methods across both multi-choice and open-ended question-

answering (QA) datasets from different domains. The experimental results show that our

method outperforms the best baseline by 2.76%. Long-tail questions failed to be captured

by a zero-shot inference benefit particularly from our proposed method. The accuracy of

long-tail questions of our method surpasses previous methods with a large margin of up to

5.96%.

We summarize our key contributions as follows:

• We investigate the limitations of existing retrieval-augmented ICL approaches for han-

dling long-tail questions, highlighting how variations in retrieved samples contribute

to prediction uncertainty.

• We propose a reinforcement learning-based dynamic uncertainty ranking method with

a budget controller that considers the dynamic impact of each retrieved sample on

the LLM’s prediction, which selectively elevates informative retrieved samples and

suppresses misleading ones with minimal query costs.

• Extensive experiments demonstrate that our method consistently outperforms the

state-of-art method on multiple QA datasets from different domains, achieving nearly

a 6% improvement in accuracy for long-tail questions.

3.2 Related Work

In-context learning (ICL). ICL [13] queries the LLMs with a concatenation of related

samples and the test query without parameter updating. To improve the quality of ICL,

retrievers have been proposed to select related samples, which can be categorized into sparse

32

retrievers (e.g. [138]) and dense retrievers (e.g. [107]). To further improve the effectiveness

of the off-the-shelf retrievers, strategies for fine-tuning retrievers on specific target domains

have been proposed such as PromptPG [114], UDR [97], and LLM-R [171], etc. Some works

also adopt GPT to help retrieve and rerank samples by providing special prompts and related

samples, such as Rerank [148], SuRe [70], etc.

Long-tail knowledge learning for ICL. [67] is the first to explore the influence of

the long-tail distribution in pre-training data on LLM memorization. They find retrieval

augmentation as a promising approach to significantly reduce the LLM’s dependence on

pre-training knowledge. Several subsequent works have built on this retrieval augmentation

approach to address the long-tail problem in LLMs. For example, [25] propose a retrieve-

then-rerank framework leveraging knowledge distillation (KD) from the LLM to tackle long-

tail QA. However, their method involves tuning the language model, which is computationally

expensive and impractical for black-box LLMs such as GPT-4 [1]. Another line of research

focuses on augmenting the training set using GPT [140, 22, 91], followed by fine-tuning

the retriever to enhance its performance. Nonetheless, determining which samples should

be augmented remains challenging. Augmenting the training set based on seed sentences

often introduces repetitive rather than diverse information, and incurs significant costs due

to GPT queries. Therefore, in this chapter, rather than augmenting the training set for

fine-tuning the retriever, we aim to train an effective retriever capable of selecting the most

informative samples to augment the test query during inference.

3.3 Problem Formulation

In this chapter, we target in-context learning (ICL) for QA tasks including multiple-choice

QA and open-ended QA from different domains. Suppose we have a training set T =

{(xi, yi)}N
i=1

related to the query domain, where x is the question and y is the answer. Given

a query problem pi from a test set P and a K-shot inference budget, we will retrieve K

related samples Ei = {ek

i = (xi, yi)|ek

i ∈ T }K

k=1

and construct a prompt P (Ei, pi) as input to

33

feed into the LLM:

P (Ei, pi) = π(e1

i ) ⊕ · · · ⊕ π(eK

i ) ⊕ π(pi, ·),

(3.1)

where π is the template for each sample. The predicted answer from the LLM for question

pi is given by:

ˆai = LLM(P (Ei, pi)).

(3.2)

3.4 Motivation: Uncertainty of In-context Learning

Due to the lack of knowledge of some specific domains during the pre-training stage, there

exists long-tail knowledge that failed to be captured by the LLMs [67]. We define easy

samples as queries that have been captured during the LLM’s pre-training stage and are

stored in its memorization. In contrast, hard samples refer to queries that the LLM failed to

capture, which are more likely to represent long-tail data. We classify easy and hard samples

using the zero-shot testing results ˆai = LLM0-shot(pi):

Peasy = {(pi, ai) ∈ P|1(ˆai, ai) = 1},

Phard = {(pi, ai) ∈ P|1(ˆai, ai) = −1},

(3.3)

where the indicator function 1(·) returns 1 if the predicted answer ˆai aligns with the ground

truth answer ai, otherwise it returns −1.

According to [67], retrieval augmentation methods help alleviate the long-tail problem,

as when a retriever succeeds in finding the most relevant samples from the training set T , it

reduces the LLM’s needs to have a large amount of related knowledge in its memorization.

However, our experiments revealed that the LLMs exhibit higher uncertainty when presented

with hard samples, regardless of the retrieval augmentation applied. Fig. 3.3 shows the

uncertain sample ratios that experienced a prediction change on five datasets. Given a

certain inference budget K = 5, 21.84% of queries experience a prediction change when we

increase from 0-shot to 5-shot. Among these uncertain queries, 87.18% are hard samples

34

Figure 3.2: Case study for uncertainty of ICL.

Figure 3.3: Uncertain sample ratios.

and 12.82% samples are easy samples using BM25 retrieval [138]. For hard samples, even a

tiny variation in retrieved set E can mislead the LLM’s prediction. One case study for hard

sample queries from T-REx [32] is shown in Fig. 3.2. In this case, LLM gives a correct answer

with the first two informative samples in E, effectively compensating for the LLM’s long-tail

knowledge. However, the answer gets wrong when a third sample is added to the prompt,

which indicates the newly added knowledge is misleading. Other cases to show the uncertain

prediction of LLM can be found in Fig. 3.7 in Section 3.6.4 and Table A.3 in Appendix.

Given the uncertainty of in-context learning, our goal is to improve the prediction ac-

curacy of hard samples while maintaining the prediction stability on easy samples. During

testing, we lack prior knowledge to determine whether a query falls into the easy or hard

category. The primary challenge, therefore, is to prevent the inclusion of misleading infor-

mation in the retrieved set E, which could lead to incorrect predictions. Simultaneously,

35

we must ensure that the retrieved samples are sufficiently informative to address long-tail

knowledge gaps and guide the LLM toward the correct answer.

3.5

In-context Learning with Dynamic Uncertainty Ranking

In this section, we introduce a dynamic uncertainty ranking method built on a reinforcement

learning-based retriever. This method adjusts the retriever by applying a dynamic threshold,

lowering the rankings of misleading samples while elevating the rankings of informative and

stable ones.

3.5.1 Retrieved Sample Selection

The original training set T is randomly divided into a validation set V, and a candidate pool

C, from which the retrieved sample set E is selected. Following [114], the retriever structure

is built upon BERT [27] with a linear layer appended to the final pooling layer of the BERT

model. During training, the BERT is frozen, and only the parameter θ = (W, b) of the

linear layer is fine-tuned. Given a query pi from the validation set V and a retrieved sample

ei from C, the ranking score of the retriever is achieved by the hidden logical similarity shared

among samples:

Sθ(ei|pi) =

(cid:80)

exp[h(ei) · h(pi)]
i∈E exp[h(e′
e′

i) · h(pi)]

,

(3.4)

where h(·) = W(BERT(·)) + b is the output of the linear layer.

To ensure the diversity and similarity of retrieved samples, and reduce the computational

cost, we first adopt an off-the-shelf retriever BM25 [138] to pre-select a small candidate set

C′
i

from the large candidate pool C following [139, 148, 70].

Suppose the shot number is k, by selecting samples with the Top-k highest ranking score

using our retriever Sθ, we can achieve the retrieved sample set Ei for pi from candidate pool

as follows:

C′
i

Ei = {ek

i ∼ Top-k(Sθ(ek

i |pi))|ek

i ∈ C′

i}.

(3.5)

36

The retriever selection process for testing is the same as the training, the only difference

is the validation set V will be replaced with the test set P.

3.5.2 Retriever Training

Motivated by the exploration in Section 3.4, to improve retrieval augmentation for both hard

and easy samples, we introduce a dynamic ranking method that updates the retriever using

feedback from the LLM, driven by its varying responses to each retrieved sample.

Decide maximum shot number. Before training, we first decide the maximum shot

number for each validation sample pi ∈ V. To achieve this, we define a maximum shot

number budget K and a dynamic budget controller σ initialized as 0 for ranking scores

Sθ. Only samples with ranking scores above the threshold σ will be selected to update the

retriever. The maximum shot number ki for pi is:

ki = min(K, N max

i

),

(3.6)

where N max

i = |{ek

i |pi) > σ}|.
Training process. Given the maximum shot number ki, we then conduct inference for

i ∼ Sθ(ek

i |pi)|ek

i, Sθ(ek

i ∈ C′

pi from 0-shot to ki-shot to capture the effect of each retrieved sample on the LLM. The

0-shot inference on pi can be considered as a means of long-tail sample detection as defined

in Eq. (3.3). If the model’s answer is incorrect, the sample is classified as a hard sample (i.e.,

long-tail sample), and the retrieved set should provide informative augmentation. Conversely,

if the model produces the correct answer, the sample is classified as an easy sample, and

the retrieved set should avoid introducing any misleading samples. We define the retrieved

sample set for the j-shot inference as the top-j highest ranking score selected from candidate

pool C′
i

:

E j
i = {ek

i ∼ Top-j(Sθ(ek

i |pi))|ek

i ∈ C′

i}, where

j = {0, 1, · · · , ki}.

(3.7)

The prediction from LLM based on E j
i

i =
i , pi)). The retrieved sample’s impact on the prediction is reflected by the reward

and pi is generated according to Eq. (3.2) as ˆaj

LLM(P (E j

37

function R(ˆaj

i , ai) = 1(ˆaj

i , ai), where ai is the ground truth answer for pi, 1(·) is the indicator

function.

Our training goal is to maximize the expected reward w.r.t.

the parameters of the

retriever using the Policy Gradient method [150]. Since the expected reward cannot be

computed in closed form, following [114], we compute an unbiased estimation with Monte

Carlo Sampling:

Eei∼Sθ(ei|pi)[(ˆai, ai)] ≈

1
N

N
(cid:88)

ki(cid:88)

i=1

j=1

R(ˆaj

i , ai),

(3.8)

where N is the batch number yielded from V. Following the REINFORCE policy gradi-

ent [177], we update the retriever using:

∇Eei∼Sθ(ei|pi)[R(ˆai, ai)]

=Eei∼Sθ(ei|pi)∇θ log(Sθ(ei|pi))R(ˆai, ai)

≈

1
N

N
(cid:88)

ki(cid:88)

i=1

j=1

∇θ log(Sθ(ej

i |pi))R(ˆaj

i , ai),

(3.9)

where ej

i = E j

i − E j−1

i

is the difference between the retrieved sets for j-shot and (j − 1)-shot.

This approach incorporates the dynamic influence of each retrieved sample on the LLM,

providing a better handling of uncertainty in ICL. Specifically, retrieved samples that yield

correct predictions (R(·) = 1) are treated as informative and contribute to augmenting long-

tail knowledge, thus receiving a higher ranking. Conversely, retrieved samples that lead to

incorrect predictions (R(·) = −1) are considered misleading and are ranked lower.

Update budget controller σ. In order to increase training efficiency and reduce the

cost of querying the LLM, we also update the threshold σ that served as a budget controller

at the turning point for prediction change to decrease the inference times while maintaining

the effect of our training strategy. Specifically, we focus on a special case: when the LLM

experiences a prediction change from true to false, i.e., R(ˆaj−1

i , ai) = −1.
In this case, the first (j −1)-th samples have a positive impact on the inference of LLM, while

, ai) = 1 and R(ˆaj

i

the j-th sample has a negative impact. Thus, we update the threshold σ as the maximum

38

Algorithm 2 ICL with dynamic uncertainty ranking

1: Input: Retriever Sθ, training set T , maximum shot number K.
2: Output: Trained retriever Sθ .
3: Randomly split T into V and C.
4: Initialize θ ← θ0, threshold σ ← 0.
5: for Vbatch ∈ V do
6:

Initialize batch loss L ← 0.
for each validation sample pi ∈ Vbatch do
from C using BM25 for pi.

Pre-select C′
i
Calculate the maximum shot number ki based on σ using Eq. (3.6).
for j = 0, 1, · · · , ki do

7:

8:

9:

10:

11:

12:

13:

14:

15:

16:

17:

18:

19:

using Eq. (3.7).
i , pi)).

Get the retrieved set E j
i
Get prediction ˆaj
Get reward R(ˆaj
L ← L − R(ˆaj
if R(ˆaj

i = LLM(P (E j
i , ai) = 1(ˆaj
i , ai).
i , ai) · log(Sθ(ej
i |pi)).
, ai) = 1 then

i , ai) = −1, R(ˆaj−1
Update σ using Eq. (3.10).

i

end if
end for

end for

20: Optimize L w.r.t. θ using Eq. (3.9).
21: end for

value of the ranking score for unselected samples in E ki
i

for the (j − 1)-shot round as follows:

σ = max(Sθ(ek

i |pi)),

i ∈ E ki
ek

i − E j−1

i

.

(3.10)

Since we only select samples with ranking scores larger than σ as shown in Eq. (3.6), the

retrieved samples that serve as a good compensation for long-tail knowledge will be ranked

higher, and be used for updating the retriever more frequently. Note that updating σ will

not wipe out the updating of misleading samples, as the turning point for prediction change

is different for each validation sample. Without affecting our original training strategy,

we improve the efficiency and deduct the querying cost. Our algorithm is summarized in

Algorithm 2.

3.6 Experiments

In this section, we first introduce the experiment setup and then show the effectiveness of

our method through various empirical results.

39

3.6.1 Experimental Setup

Datasets: We conduct the experiments on QA datasets from different domains, including

three multi-choice datasets: biomedical dataset Pubmedqa [63], speech detection dataset

ethos-national

[126], climate change dataset eval-climate

[11], and two open-ended QA

dataset: T-REx [32] and NaturalQuestions (NatQA) [78]. More datasets details can be

found in ?? 11.

Baselines: We compare our method with six baselines, including 0-shot inference and

five few-shot retrieval augmentation methods. The retrieval augmentation methods are as

follows: 1) Random sampling: selecting ICL samples from the candidate set, a widely adopted

practice in many ICL studies [176, 174]; 2) BM25 [138]: an off-the-shelf sparse retriever;

3) SuRe [70]: first use GPT to summarize the retrieved passages from BM25 for multiple

answer candidates, then determines the most plausible answer by evaluating and ranking the

generated summaries; 4) Rerank [148]: use GPT to rerank samples retrieved by BM25; 5)

PromptPG [114]: a BERT-based dense retriever trained using reinforcement learning based

on the feedback from GPT.

Evaluation: For multi-choice QA, we use accuracy for evaluation. For open-ended QA,

we use Normalized Exact Match (NEM), which evaluates whether the normalized string

output by the inference LLM is identical to the reference string.

Implementation: The LLM used in our experiment is GPT-4 [1]. Due to the limited

data size in tweet_eval-stance_climate, the training set is split into 50 candidate samples

and 150 validation samples. For the other datasets, we use 1000 samples in the candidate

pool and 200 samples in the validation set. All methods share the same train-test split. The

number of pre-selected samples in C′ is set to 20 by default for both the training and testing

stages. For the few-shot case, the shot number is set to 5, unless otherwise specified. During

the training of our method, the maximum shot number budget K is also set to 5. The batch

size is set to 20. Experiments for all test datasets are repeated 3 times with different seeds,

and the average accuracy is reported in the results.

40

(a) Easy sample accuracy.

(b) Hard sample accuracy.

Figure 3.4: Accuracy on easy and hard samples for proposed method and baselines.

3.6.2 Main Results

Table 3.1 presents the mean and standard deviation (std) of accuracy for our proposed

method and the baselines across five QA datasets. Our approach outperformed all baselines

across tasks, with an average improvement of 2.97% ranging from 1.67% to 3.25% over the

best baseline. The trained retriever PromptPG gives the most uncertain prediction with

a std of 2.05%. Although our method is based on PromptPG, by giving informative and

stable samples higher ranks, we not only improve the overall accuracy but also decrease std

to 1.09%, comparable to 0-shot inference.

Retrieval Method

0-shot
Random sampling
BM25
SuRe
Rerank
PromptPG
Ours

Dataset

Pubmedqa
56.32 ± 1.08
72.87 ± 0.31
64.72 ± 1.70
78.20 ± 0.53
73.22 ± 0.69
78.93 ± 0.31
62.97 ± 1.00
78.93 ± 0.42
73.43 ± 1.01
78.93 ± 0.42
68.10 ± 2.05
78.47 ± 0.90
80.60 ± 0.35 92.40 ± 0.20 85.37 ± 0.32 65.00 ± 2.69 57.60 ± 1.91 76.19 ± 1.09

ethos-national
75.61 ± 0.51
75.17 ± 1.01
87.47 ± 0.39
85.23 ± 0.33
89.15 ± 0.39
77.74 ± 2.16

T-REx
42.60 ± 2.36
57.13 ± 1.97
62.13 ± 1.33
39.80 ± 0.57
62.07 ± 2.01
60.73 ± 3.21

NatQA
44.20 ± 1.91
46.80 ± 1.44
55.00 ± 1.14
32.00 ± 3.40
53.80 ± 1.91
50.80 ± 2.00

eval-climate
46.30 ± 0.32
66.30 ± 3.53
82.57 ± 0.30
78.89 ± 0.30
83.22 ± 0.32
72.78 ± 2.00

Avg

Table 3.1: Comparison results between proposed methods and baselines on QA tasks from
different domains.

We further investigate the accuracy of easy and hard samples in Fig. 3.4. As illustrated

in Eq. (3.3), the easy/hard sample classification is decided by the 0-shot inference results,

and the hard samples can be considered as long-tail questions of GPT-4. First, we observe

a similar pattern to [67] that retrieval augmentation greatly improves the accuracy of long-

tail samples. This could come from various aspects of augmented samples—such as label

41

space, input text distribution, and sequence format—that collectively improve final predic-

tions [123]. Compared with 0-shot inference, even random sampling improves accuracy on

hard samples from 0% to 29.17%. However, retrieval augmentation is highly dependent on

the quality of the retrieval set. By retrieving the most similar samples, BM25 achieves an

accuracy of 46.12%. Rerank further improves the accuracy to 48.03%. Our method includes

the most informative samples based on the sample-wise feedback from LLM, and improves

the accuracy on hard samples to 53.99%, which surpasses the best baseline with a large

average margin of 5.96% ranging from 2.69% to 8.11%, while maintaining the accuracy on

easy samples.

3.6.3 Ablation Studies

Effects of different components. We verify the effectiveness of two components of our

proposed method: uncertainty rank and pre-selection in Table 3.2. We first compared the

uncertainty rank (UR) strategy with another trained retriever PromptPG which shared the

same retriever architecture as ours. We improve the accuracy by 3.47% and 1.63% for

two different datasets. PromptPG adjusts the ranking of candidate samples based on the

feedback on the entire retrieved set for the validation samples, while UR raises the ranks

for informative and stable samples and lowers the ranks for misleading samples based on

the sample-wise feedback from LLMs. UR avoids the condition when misleading samples

are included and negatively changes the answer from true to false. In this way, UR greatly

enhances the retrieved sample set for augmentation.

The second component pre-selection (PS) improves the results of both PromptPG and

UR by selecting more diverse and similar related samples in the candidate set C′. Then

the second step retrieval can select samples from a smaller candidate pool of higher quality.

By combining these two components together, we can achieve an overall improvement of

14.66% and 2.13% for two different datasets. The improvement on ethos-national is more

significant than Pubmedqa because the predicted answer on ethos-national is more uncertain

42

Dataset
ethos-national
Pubmedqa

PromptPG UR PromptPG+PS UR+PS (Ours)

77.74
78.47

81.21
80.10

86.91
79.10

92.40
80.60

Table 3.2: Effects of different components. PS denotes pre-selection. UR denotes uncertainty
rank.

given different combinations of retrieved samples.

(a) T-REx.

(b) NatQA.

Figure 3.5: Effects of different number of shots.

Figure 3.6: Effects of different
pre-select numbers.

Effects of different number of shots. We show the effects of different shot numbers

for two datasets in Fig. 3.5 where our method consistently outperforms other baselines. For

NatQA, the accuracy of random sampling and PromptPG retrieval does not monotonically

increase with shot number due to low-quality, misleading samples, which can degrade perfor-

mance. In contrast, our method prioritizes high-quality samples, and as the number of shots

Figure 3.7: Case study for retrieved samples of hard samples.

43

increases, the advantages of our algorithm become more pronounced, resulting in improved

accuracy.

Effects of different number of pre-selection samples. In Fig. 3.6, we investigate

how the number of pre-seletion samples impacts our algorithm. For both datasets, the

accuracy first increases and then decreases. If too few samples are selected, the candidate

pool C′ for our reinforcement learning -based ranking stage lacks diversity, limiting the policy

gradient strategy’s action space. Consequently, the learned retriever struggles to find the

most informative samples. If the number is too large, C′ includes many irrelevant samples,

making it difficult for the policy gradient strategy to learn an optimal solution in the large

search space [114]. This can lead the retriever to capture irrelevant or misleading information.

3.6.4 Case Study

To intuitively show the effectiveness of our proposed method on hard samples, we show one

case on Pubmedqa by comparing the retrieved samples of PromptPG retriever and our re-

triever in Fig. 3.7. According to this case, the two retrieved sets even have three overlap

samples (marked as the same color), but the prediction is completely different. PromptPG

gives a wrong prediction answer, while our method delivers the right answer. This result

verifies that GPT-4 gives uncertain predictions on long-tail samples. Since 0-shot inference

gives a wrong prediction answer on this query question, the informative augmented infor-

mation can be contained in the retrieved set of our method (see right column), while for

PromptPG, misleading information can be contained in the two samples that do not inter-

sect with our retriever set (see left column), which shifts the predicted answer from true to

false. Compared with PromptPG, our retriever ranks the three overlapped samples higher

and gives two more informative samples. With the combination effect of these two, our

method gives the correct prediction. More cases on hard samples from other datasets can

be found in Table A.4 in the appendix.

44

(a) Shot # w.r.t. the
batch.

(b) Shot # of one
epoch.

Figure 3.8: Efficiency analysis.

Figure 3.9: Training loss w.r.t. batch.

3.6.5 Efficiency Analysis

Query cost. We set threshold σ as the budget controller to reduce the cost of the querying

GPT-4. Since the query cost depends on token length, we compare the query costs of

our method and PromptPG (both trained based on GPT-4) in Fig. 3.8a. Specifically, we

calculate the total number of shots included in each query during training for each batch

within one epoch for both methods. The blue dash line shows the total shot number of

PromptPG for all datasets, since the batch size is 20, and the shot number is fixed at 5, the

total shot number is fixed at 100 for each batch. According to the results, only batch 0 of

our method surpasses PromptPG with a total shot count of 300. For subsequent batches,

as the threshold σ is adjusted based on changes in the LLM’s predictions, the query shot

count drops significantly, resulting in the total shot count consistently being lower than that

of PromptPG. Aggregating the shot numbers across 10 batches, our method achieves only

33.8%, 65.2%, and 35.3% of the shot count of PromptPG on Pubmedqa, ethos-national,

and NatQA, respectively as shown in Fig. 3.8b. Thus, in conjunction with the accuracy

comparison presented in Table 3.1, our approach not only enhances query accuracy but also

reduces the overall query cost.

Convergence speed. We empirically demonstrate the convergence speed by showing

training loss curves in Fig. 3.9. According to the results, the training loss quickly converges

to a small value close to 0 within 15 batchs, which verify the high computational efficiency

of our method.

45

3.6.6 Transferability Analysis

We investigate the transferability of our retriever in Table 3.3. We use our retriever trained

on dataset ethos-national, and evaluate its cross-domain effectiveness across the rest of the

four datasets. Although the cross-domain results are still slightly inferior to the in-domain

results, the performance gap is minimal, averaging only 0.98%. Furthermore, the cross-

domain results outperform the best baseline. These findings indicate that our trained ranking

strategy is transferable to other datasets, providing a cost-effective alternative to retraining.

Best baseline
Ours: cross-domain
Ours: in-domain

Pubmedqa
78.93
79.60
80.60

eval-climate NatQA T-REx Avg
69.82
55.00
71.16
57.20
72.14
57.60

83.22
83.33
85.37

62.13
64.50
65.00

Table 3.3: Transferability of our method.

3.7 Limitations

There are several limitations of this work.

First, our method do not consider the effect of different orders within the retrieved set and

rank the retrieved samples according to their ranking scores. Future works can be extended

based on our work by considering different inner order within the retrieved set and their

effect on the prediction results.

Second, although our experimental results show that our method greatly improves the

prediction accuracy on long-tail samples, our method cannot handle query cases with no

related knowledge either in the pre-training set or candidate pool.

Third, our method focused on QA tasks using LLM. For future work, our method can be

extended to other tasks such as summarization, translation, and recommendation as follows.

Since our method is to train a reranker based on the reward signal from LLM, to adapt to

other tasks, we can modify the evaluation score that is used to determine the reward. If the

accuracy of the LLM’s predicted answer is unavailable, alternative metrics such as BLEU

46

and ROUGE can be used to assess the consistency between the prediction and the ground

truth. A threshold can then be set for these scores, where values exceeding the threshold

yield a positive reward, while lower values result in a negative reward.

3.8 Summary

In this chapter, to improve the uncertain prediction of LLMs on long-tail knowledge, we

propose a reinforcement learning-based dynamic uncertainty ranking method for retrieval-

augmented ICL with a budget controller. Specifically, it considers the dynamic impact of each

retrieved sample based on the LLM’s feedback. Our ranking system system raises the ranks

of more informative and stable samples and lower the ranks of misleading samples efficiently.

Evaluations of various QA datasets from different domains show that our proposed method

outperformed all the baselines, and especially improve the LLM’s prediction on the long-tail

questions.

47

CHAPTER 4

TURNING THE CURSE OF HETEROGENEITY IN FEDERATED
LEARNING INTO A BLESSING FOR OUT-OF-DISTRIBUTION
DETECTION

This chapter is based on the following work:

Turning the Curse of Heterogeneity in Federated Learning into a Blessing for Out-of- Dis-

tribution Detection. Shuyang Yu, Junyuan Hong, Haotao Wang, Zhangyang Wang, Jiayu

Zhou. 2023. 2023 International Conference on Learning Representations (ICLR).

4.1

Introduction

Deep neural networks (DNNs) have demonstrated exciting predictive performance in many

challenging machine learning tasks and have transformed various industries through their

powerful prediction capability. However, it is well-known that DNNs tend to make overcon-

fident predictions about what they do not know. Given an out-of-distribution (OoD) test

sample that does not belong to any training classes, DNNs may predict it as one of the

training classes with high confidence, which is doomed to be wrong [52, 53, 51].

To alleviate the overconfidence issue, various approaches are proposed to learn OoD

awareness which facilitates the test-time detection of such OoD samples during training.

Recent approaches are mostly achieved by regularizing the learning process via OoD sam-

ples. Depending on the sources of such samples, the approaches can be classified into two

categories: 1) the real-data approaches rely on a large volume of real outliers for model regu-

larization [53, 125, 193]; 2) the synthetic approaches use ID data to synthesize OoD samples,

in which a representative approach is the virtual outlier synthesis (VOS) [31].

While both approaches are shown effective in centralized training, they cannot be easily

incorporated into federated learning, where multiple local clients cooperatively train a high-

quality centralized model without sharing their raw data [74], as shown by our experimental

results in Section 4.5.2. On the one hand, the real-data approaches require substantial real

outliers, which can be costly or even infeasible to obtain, given the limited resources of local

48

clients. On the other hand, the limited amount of data available in local devices is usually

far from being sufficient for synthetic approaches to generate effective virtual OoD samples.

Practical federated learning approaches often suffer from the curse of heterogeneous data

in clients, where non-iid [94] collaborators cause a huge pain in both the learning process and

model performance in FL [95]. Our key intuition is to turn the curse of data heterogeneity

into a blessing for OoD detection: The heterogeneous training data distribution in FL may

provide a unique opportunity for the clients to communicate knowledge outside their training

distributions and learn OoD awareness. A major obstacle to achieving this goal, however,

is the stringent privacy requirement of FL. FL clients cannot directly share their data with

collaborators. This motivates the key research question: How to learn OoD awareness from

non-iid federated collaborators while maintaining the data confidentiality requirements in

federated learning?

In this chapter, we tackle this challenge and propose Federated Out-of-distribution Syn-

ThesizER (Foster) to facilitate OoD learning in FL. The proposed approach leverages

non-iid data from clients to synthesize virtual OoD samples in a privacy-preserving manner.

Specifically, we consider the common learning setting of class non-iid [94], and each client

extracts the external class knowledge from other non-iid clients. The server first learns a

virtual OoD sample synthesizer utilizing the global classifier, which is then broadcast to local

clients to generate their own virtual OoD samples. The proposed Foster promotes diversity

of the generated OoD samples by incorporating Gaussian noise, and ensures their hardness

by sampling from the low-likelihood region of the class-conditional distribution estimated.

Extensive empirical results show that by extracting only external-class knowledge, Foster

outperforms the state-of-out for OoD benchmark detection tasks.

The main contributions of our work can be summarized as follows:

• We propose a novel federated OoD synthesizer to take advantage of data heterogeneity

to facilitate OoD detection in FL, allowing a client to learn external class knowledge

from other non-iid federated collaborators in a privacy-aware manner. Our work bridges

49

a critical research gap since OoD detection for FL is currently not yet well-studied in

literature. To our knowledge, the proposed Foster is the first OoD learning method for

FL that does not require real OoD samples.

• The proposed Foster achieves the state-of-art performance using only limited ID data

stored in each local device, as compared to existing approaches that demand a large

volume of OoD samples.

• The design of Foster considers both the diversity and hardness of virtual OoD samples,

making them closely resemble real OoD samples from other non-iid collaborators.

• As a general OoD detection framework for FL, the proposed Foster remains effective

in more challenging FL settings, where the entire parameter sharing process is prohibited

due to privacy or communication concerns. This is because that Foster only used the

classifier head for extracting external data knowledge.

4.2 Related Work

OoD detection. Existing OoD detection methods are mainly from two complementary

perspectives. The first perspective focused on post hoc. Specifically, [52] first introduced a

baseline utilizing maximum softmax distribution probabilities (MSP). Based on this work,

many improvements have been made by follow-up works in recent years, such as the cali-

brated softmax score (ODIN) [105], Mahalanobis distance [84], energy score [108], Likelihood

Regret [180], Confusion Log Probability (CLP) score [178], adjusted energy score [106], k-th

nearest neighbor (KNN) [149], and Virtual-logit Matching (ViM) [168]. Compared with post

hoc methods, Foster can dynamically shape the uncertainty surface between ID and OoD

samples. Different post hoc methods are also applied in our experiment section as baselines.

Another perspective tends to detect OoD samples by regularization during training, in

which OoD samples are essential. The OoD samples used for regularization can be either

real OoD samples or virtual synthetic OoD samples. Real OoD samples are usually natural

50

auxiliary datasets [53, 125, 193]. However, real OoD samples are usually costly to collect

or infeasible to obtain, especially for terminals with limited sources. Regularization method

utilizing virtual synthetic OoD samples do not rely on real outliers. [43] trained a generative

model to obtain the synthetic OoD samples. [64] detect samples with different distributions

by standardizing the max logits without utilizing any external datasets. [152, 142] proposed

contrastive learning methods that also does not rely on real OoD samples. [31] proposed VOS

to synthesize virtual OoD samples based on the low-likelihood region of the class-conditional

Gaussian distribution. Current state-of-the-art virtual OoD methods are usually thirsty

for ID data, which is not sufficient enough for local clients. Compared with these existing

methods, the proposed Foster can detect OoD samples with limited ID data stored in each

local device, without relying on any auxiliary OoD datasets.

Federated Learning. Federated learning (FL) is an effective machine learning setting

that enables multiple local clients to cooperatively train a high-quality centralized mode

[74]. FedAvg [121], as a classical FL model, performs model averaging of distributed local

models for each client.

It shows an excellent effect on reducing the communication cost.

Based on FedAvg, many variants [170, 12] have been proposed to solve the problems arising

in FedAvg, such as convergence analysis [66, 134], heterogeneity [95, 59, 68, 198], commu-

nication efficiency [136]. Among these problems, although heterogeneity of data will make

the performance of ID data worse, it will give us a great chance to learn from the external

data from other non-iid collaborators. Even though Foster is used the FedAvg framework,

as a general OoD detection method for FL, Foster can also be applied to other variants of

FedAvg.

4.3 Problem Formulation

In this chapter, we consider classification tasks in heterogeneous FL settings, where non-iid

clients have their own label set for training and testing samples. Our goal is to achieve

OOD-awareness on each client in this setting.

51

OoD training. The OoD detection problem roots in general supervised learning, where

we learn a classifier mapping from the instance space X to the label space Y. Formally, we

define a learning task by the composition of a data distribution D ⊂ X and a ground-truth

labeling oracle c∗ : X → Y. Then any x ∼ D is denoted as in-distribution (ID) data, and

otherwise, x ∼ Q ⊂ X \D as out-of-distribution data. Hence, an ideal OoD detection oracle

can be formulated as a binary classifier q∗(x) = I(x ∼ D), where I is an indication function

yielding 1 for ID samples and −1 for OoD samples. With these notations, we define the OoD

learning task as T := ⟨D, Q, c∗⟩.

To parameterize the labeling and OoD oracles, we use a neural network consisting of

two stacked components: a feature extractor f : X → Z governed by θf , and a classifier

h : Z → Y governed by θh, where Z is the latent feature space. For the ease of notation,

let hi(z) denote the predicted logit for class i = 1, . . . , c on extracted feature z ∼ Z. We

unify the parameters of the classifier as θ = (θf , θh). We then formulate the OoD training

as minimizing the following loss on the task T :

JT (θ) := Ex∼D

(cid:2)ℓCE

(cid:0)h(f (x; θf ); θh), c∗(x)(cid:1)(cid:3) + λ Ex′∼Q

(cid:2)ℓOE

(cid:0)f (x′; θf ); θh(cid:1)(cid:3) ,

where ℓCE is the cross-entropy loss for supervised learning and ℓOE is for OoD regulariza-

tion. We use E[·] to denote the expectation estimated by the empirical average on samples

in practice. The non-negative hyper-parameter λ trade off the OoD sensitivity in train-

ing. We follow the classic OoD training method, Outlier Exposure [53], to define the OoD

regularization for classification problem as

ℓOE(z′; θh) := E(z′; θh) −

(cid:88)c

i=1

hi(z′; θh),

(4.1)

where E(z′; θh) = −T log (cid:80)c

i=1 ehi(z′;θh)/T is the energy function, given the temperature
parameter T > 0. At test time, we approximate the OoD oracle q∗ by the MSP score [52].

Heterogeneous federated learning (FL) is a distributed learning framework involving

multiple clients with non-iid data. There are different non-iid settings [94, 98], and in

this chapter, we follow a popular setting that the non-iid property is only concerned with

52

the classes [94]. Given K clients, we define the corresponding set of tasks {Tk}K

k=1

where

Tk = ⟨Dk, Qk, c∗

k : X → Y k are non-identical for different k resulting non-identical
Dk. Each Y k is a subset of the global label set Y. Since the heterogeneity is known to harm

k⟩ and c∗

the convergence and performance of FL [185], we adopt a simple personalized FL solution to

mitigate the negative impact, where each client uses a personalized classifier head hk upon

a global feature extractor f [4]. This gives the general objective of FL: minθ

k=1 JTk(θ).
The optimization problem can be solved alternatively by two steps: 1) local minimization of

1
K

(cid:80)K

the objective on local data and 2) aggregation by averaging client models. In this chapter, we

assume that each client only learns classes they see locally during training, because updating

classifier parameters for unseen classes has no data support and doing so will almost certainly

harm the performance of FL. To see this, Diao et al. showed that masking out unseen classes

in the cross-entropy loss can merit the FL training [30].

Challenges. When we formulate the OoD training in FL, the major challenge is defining

the OoD dataset Qk, which does not come for free. The centralized OoD detection of VOS

assumes Qk is at the tail of an estimated Gaussian distribution of Dk [31], which requires

enormous examples from Dk for an accurate estimation of parameters. However, such a

requirement is usually not feasible for a client per se, and the construction of Qk remains a

challenging question.

4.4 Method

In this section, we first introduce the intuition of our proposed Foster, then elaborate on

how to synthesize virtual external class data and avoid the hardness fading of the virtual

OoD samples. The proposed framework is illustrated in Fig. 6.1.

4.4.1 Natural OoD Data in Non-iid FL

Recent advances show promising OoD detection performance by incorporating OoD samples

during the training phase, and however, OoD detection in FL is largely overlooked. In FL,

53

Figure 4.1: The framework of Foster. In step 1, to extract external class knowledge from
local clients, the server first trains a generator utilizing the global classifier based on a cross-
entropy objective function J(w) (Eq. (4.2)). In step 2, each local client utilizes the generator
received to generate their own external class data z. To preserve the hardness of the virtual
OoD samples, we also sample virtual outliers vk from the low-likelihood region of the class-
conditional distribution estimated for the generated OoD samples. The virtual OoD samples
vk are used for regularization of local client objective J(θk) (Eq. (4.5)).

each client does not have access to a large volume of real OoD samples because it can be

costly or even infeasible to obtain such data for resource-constrained devices. As such, an

OoD training method for FL that relies on few or even no real OoD examples is strongly

desired. Novel to this work, we notice that data from classes out of the local class set,

namely external-class data, are natural OoD samples w.r.t. the local data and can serve as

OoD surrogate samples in OoD training. As shown in Fig. 4.2, training w/ external-class

data achieves better OoD detection performance than normal training and VOS, since the

score of ID and real OoD data is well separated. Besides, compared to the real OoD dataset

adopted in prior arts, external-class samples are likely to be nearer to the ID data, since

they are sampled from similar feature distributions (refer to (a) and (b) in Fig. 4.2 ).

4.4.2 Synthesizing External-Class Data from Global Classifier

Though using external-class data as an OoD surrogate is attractive and intuitive, it is not

feasible in FL to directly collect them from other non-iid clients, due to privacy concerns

and high communication costs on data sharing.

54

(a) Normal training.

(b) Training w/ VOS.

(c) Training w/ external-class.

Figure 4.2: The density of negative energy score for OoD detection evaluation using dataset
Textures. We use 5 ID classes, and 5 external classes of CIFAR-10.

We thereby propose to generate samples from the desired classes leveraging the encoded

class information in the global classifier head. Given the global classifier H : Z → Y

parameterized by θh
g

, we utilize a w-governed conditional generative network Gw : Y → Z

to generate samples from specified classes on clients’ demand. As such, we solve the following

optimization problem:

J(w) := Ey∼p(y)Ez∼Gw(z|y,ϵ)

(cid:2)ℓCE(H(z; θh

g ), y)(cid:3) ,

min
w

(4.2)

where p(y) is the ground-truth prior which is assumed to be a uniform distribution here. We

follow the common practice of the generative networks [198] to let y be a one-hot encoding

vector, where the target class entry is 1 and others are 0. To encourage the diversity of

the generator outputs G(z|y), we use a Gaussian noise vector ϵ ∼ N (0, I) to reparameterize

the one-hot encoding vector during the generating process, following the prior practice [72].

Thus, Gw(z|y) ≡ Gw(y, ϵ|ϵ ∼ N (0, I)) given y ∼ Y, where Y is the global label set. The

generator training process can refer to Fig. 6.1 Step 1. Then for local training (see Fig. 6.1

Step 2), by downloading the global generator as a substitute of Qk, each local client indexed

by k can generate virtual OoD samples given an arbitrary external class set ¯Y k = Y\Y k. In

the feature space, we denote the virtual samples as z ∼ Gw(z|y, ϵ) given y ∼ ¯Y k.

55

4.4.3 Filtering Virtual External-Class Samples

Although synthesized features are intuitively conditioned on external class, the quality of

generated OoD samples may vary by iterations likely because of the lack of two properties:

(1) Diversity. Like traditional generative models [144, 156], the trained conditional generator

may suffer from mode collapse [119] in a class, where generator can only produce a small

subsets of distribution. As a result, the effective synthesized OoD samples will be mostly

unchanged and OoD training will suffer from the lack of diverse samples. (2) Hardness. For

a client, its internal and external classes may co-exist with another client, which will enlarge

the between-class margins gradually. As the FL training proceeds, the class-conditioned

synthesis OoD samples will become increasingly easier to be memorized, namely, overfit by

the model. In other words, the hardness of OoD examples declines over time.

(1) Encourage OoD diversity by tail sampling. As mode collapse happens in the

high-density area of a class, samples that approximate the class but have larger variance are

preferred for higher diversity. For this purpose, we seek to find samples of low but non-zero

probability from the distribution of the external classes. Specifically, for each client, we first

assume that the set of virtual OoD representations {zki ∼ G(z|yi, ϵ)|yi ∼ ¯Y k, ϵ ∼ N (0, I)}Nk

i=1

forms a class-conditional multivariate Gaussian distribution p(zk|yk = c) = N (µc

k),
is the Gaussian mean of samples from the external class set ¯Y k for client k, and

k, Σc

where µc
k

Σc
k

is the tied covariance matrix. The parameters of the class-conditional Gaussian can be

estimated using the empirical mean and variance of the virtual external class samples:

ˆµc

k =

1
N c
k

(cid:88)

i:yi=c

zki,

ˆΣk =

1
Nk

(cid:88)

(cid:88)

c

i:yi=c

(zki − ˆµc

k) (zki − ˆµc

k)T ,

(4.3)

where Nk is the number of samples, and N c
k

is the number of samples of class c in the virtual

OoD set. Then, we select the virtual outliers falling into the ϵ-likelihood region as:
k)T ˆΣ
k − ˆµc
(2π)d/2| ˆΣk|1/2

k ∼ G(·|y = c, ϵ)},

k = {vc
V c

k − ˆµc
k)

−1
k (vc

< ε, vc

k|ε0 <

2(vc

exp

− 1

(cid:16)

(cid:17)

(4.4)

where ε0 ensures the sample is not totally random, a small ε pushes the generated vc
k

away

from the mean of the external class in favor of the sampling diversity.

56

Algorithm 3 Federated Out-of-Distribution Synthesizer (Foster)

1: Require: Tasks {Tk}K

;
Global parameters θg, local parameters {θk}K
Global generator parameter w;
Learning rate α, β, local steps T , ID batch size B, OE batch size BOE.

k=1

k=1

;

2: repeat

3:

4:

5:

6:

7:

8:

9:

10:

11:

12:

13:

14:

15:

16:

17:

Server selects active clients A uniformly at random, then broadcast θ, w to A.
for all user k ∈ A in parallel do

Initialize local parameters θk ← θ
for t = 1, . . . , T do

i=1 ∼ Dk, ZOE = {zki ∼ G(z|yi, ϵ)|yi ∼ ¯Y k, ϵ ∼ N (0, I)}BOE
i=1 .

{(xi, yi)}B
Estimate the multivariate Gaussian distributions based on ZOE by Eq. (4.3).
Filter virtual external class samples according to Eq. (4.4).
θk ← θk − β∇θkJ(θk).

▷ Optimize Eq. (4.5)

end for
Client k sends θk back to the server.

end for
Server updates θg ← 1
|A|
for t = 1, . . . , T do

w ← w − α∇wJ(w).

end for

(cid:80)

k∈A θk.

▷ Optimize Eq. (4.2)

18: until training stop

(2) Increase the hardness by soft labels. To defend the enlarged margin between

internal and external classes, we control the condition inputs to the generator such that

generated samples are closer to the internal classes. Given an one-hot encoding label vector

y of class c, we assign 1 − δ to the c-th entry, and a random value within (0, δ) to the rest

of the positions, where δ ∈ (0, 0.5).

In summary, given an observable ˆDk, we formulate the local optimization of Foster as:

min
θk

J(θk) :=

1
| ˆDk|

(cid:34)

(cid:88)

xi∈ ˆDk

ℓCE(hk(f (xi; θf

k); θh

k), c∗(xi)) + λ

(cid:35)

ℓOE(vk)

,

(4.5)

1
|Vk|

(cid:88)

vk∈Vk

and the overall framework of our algorithm is summarized in Algorithm 3. The major

difference from FedAvg is that we introduce a generator for OoD outlier synthesis. Since the

generator is trained on the server, the computation overhead for the client is marginal, with

only the inference of low-dimensional vectors. As compared to VOS, the samples generated

from external classes are more likely to approximate the features from real images due to

57

the supervision of real external class prototypes from the classifier head.

4.5 Experiments

In this section, we first introduce the experiment setup and then show empirical results

demonstrating the effectiveness of the proposed Foster.

ID Datasets for training. We use CIFAR-10, CIFAR-100 [75], STL10 [23], and Do-

mainNet [132] as ID datasets. Both CIFAR-10 and CIFAR-100 are large datasets containing

50, 000 training images and 10, 000 test images. Compared with CIFAR, STL10 is a small

dataset consisting of only 5,000 training images and 8,000 test images. DomainNet are con-

sist of images from 6 different domains. We use DomainNet to explore how Foster performs

in the case of feature non-iid among different clients.

OoD Datasets for evaluation. We use Textures [21], Places365 [195], LSUN-C [183],

LSUN-Resize [183] and iSUN [181] as the OoD datasets for evaluation. When ID dataset is

CIFAR-10, we also evaluate on CIFAR-100 to check near-OoD detection performance, since

CIFAR-10 and CIFAR-100 datasets have similarities, although their classes are disjoint.

Baselines. We compare the proposed Foster with both Post hoc and Virtual synthetic

OoD detection methods that have been mentioned in Section 6.2: a) Post hoc OoD detection

methods: Energy score [108], MSP [52], ODIN [105]. b) Synthetic OoD detection method:

VOS [31]. For a fair comparison, the training method of FL for all of the above approaches

including the proposed Foster is FedAvg [121] with a personalized classifier head, and we

note that our framework can be extended to other FL variants. All the approaches only use

ID data without any auxiliary OoD dataset for training.

Metrics for OoD detection and classification. To evaluate the classification per-

formance on ID samples, we report the test accuracy (Acc) for each client’s individual test

sets, whose classes match their training sets. For OoD detection performance, we report the

area under the receiver operating characteristic curve (AUROC), and the area under the PR

curve (AUPR) for ID and OoD classification. In FL setting, all three metrics are the mean

58

(a) CIFAR-10.

(b) CIFAR-100.

Figure 4.3: Visualization of generated external class samples and ID samples.

results of all the clients.

Heterogeneous federated learning. For CIFAR-10 and CIFAR-100, the total client

number is 100, for STL10, the total client number is 50. For DomainNet, the total client

number is 12 (2 for each domain). To model class non-iid data of the training datasets,

we follow a uniform partition mode and assign partial classes to each client. We distribute

3 classes per client for CIFAR-10 and STL10, 5 classes for DomainNet, and 10 classes for

CIFAR-100, unless otherwise mentioned.

4.5.1 Visualization of generated external class samples

In Fig. 4.3, we visualize the generated external class samples and ID samples of a client

using TSNE for both CIFAR-10 and CIFAR-100. Without accessing the raw external-class

data from the other users, our generator, trained merely from the shared classifier head,

yields samples that are strictly out of the local distribution without any overlap. We also

obtain a consistent conclusion from CIFAR-100, which has as many as 90 external classes.

The enormous external classes diversify the OoD set and therefore we observe a larger gain

of OoD detection accuracy (a 2.9% AUROC increase versus the best baseline) compared

to other benchmarks in Table 4.1. The observation also motivates our design of the tail

sampling to encourage diversity.

59

4.5.2 Benchmark Results

Foster outperforms existing methods. We compare Foster with other competitive

baselines in Table 4.1. The proposed Foster shows stronger OoD detection performance on

all three training sets, while preserving a high test accuracy. VOS is another regularization

method using virtual OoD samples, which even shows worse results than post hoc methods.

The virtual OoD data synthesized by VOS is based on a large amount of ID samples. For the

FL setting, when data stored in each device is limited, these synthesized OoD samples based

on ID data will no longer be effective, which deteriorates the OoD detection performance.

For Foster, the virtual OoD samples are based on the external class knowledge extracted

from other clients, which are close to real OoD samples. Thus, they are effective in improving

the OoD detection performance while preserving the test accuracy.

ID dataset Method
Energy
MSP
ODIN
VOS

CIFAR-10

Acc ↑ AUROC ↑ AUPR ↑
0.9262
0.7810
0.9431
0.9691
0.8829
0.9431
0.9689
0.8842
0.9431
0.9342
0.7970
0.9426
Foster 0.9432
0.9785
0.9091
0.9575
0.8056
0.8129
Energy
0.9782
0.8606
0.8129
MSP
0.9789
0.8657
0.8129
ODIN
0.9666
0.8372
0.8063
VOS
Foster 0.8218
0.9838
0.8945
0.9228
0.7529
0.8236
Energy
0.9309
0.7410
0.8236
MSP
0.9306
0.7418
0.8236
ODIN
0.9126
0.7370
0.8264
VOS
Foster 0.8410
0.9425
0.7671

CIFAR-100

STL10

Table 4.1: Our Foster outperforms competitive baselines. ↑ indicates larger value is better.
Bold numbers are best performers.

Near OoD detection. We evaluate the model training with CIFAR10 datasets on both

near OoD (CIFAR100) and far OoD datasets. The results are shown in Table 4.2, and the

best results are highlighted. The proposed Foster outperforms baselines for all of the eval-

uation OoD datasets, especially the near OoD dataset CIFAR100. By synthesizing virtual

60

external class samples, Foster has access to virtual near OoD samples during training,

which is also an advantage of Foster over other baselines.

Datasets

Textures

Places365

LSUN-C

LSUN-Resize

iSUN

CIFAR-100

AUROCAUPR AUROCAUPR AUROCAUPR AUROCAUPR AUROCAUPR AUROCAUPR
Energy
0.7080 0.8868 0.8221 0.9411 0.7009 0.9065 0.8376 0.9519 0.8289 0.9462 0.7883 0.9248
MSP
0.8107 0.9375 0.8964 0.9754 0.9043 0.9774 0.9154 0.9825 0.9103 0.9805 0.8604 0.9615
ODIN
0.8124 0.9367 0.8976 0.9752 0.9062 0.9773 0.9166 0.9825 0.9114 0.9805 0.8614 0.9613
VOS
0.7346 0.8993 0.8267 0.9447 0.7270 0.9196 0.8451 0.9541 0.8397 0.9499 0.8086 0.9379
Foster 0.8458 0.9544 0.9253 0.9842 0.9332 0.9863 0.9316 0.9870 0.9238 0.9849 0.8952 0.9742

Table 4.2: Near and far OoD detection for CIFAR10. The proposed Foster outperforms
baselines for all of the evaluation OoD datasets, especially near OoD dataset CIFAR100.

Method
Energy
MSP
ODIN
VOS

Acc ↑ AUROC ↑ AUPR ↑
0.8953
0.6745
0.7237
0.9048
0.6871
0.7237
0.9047
0.6871
0.7237
0.8988
0.6796
0.7340
Foster 0.7348
0.9075
0.6960

Table 4.3: Our Foster outperforms competitive baselines under feature non-iid setting.

OoD detection for feature non-iid clients. We explore whether our Foster can

still work well when feature non-iid also exists among different clients on DomainNet. Under

this problem setting, different clients not only have different classes, but may also come

from different domains. According to the results shown in Table 4.3, although the results

are not that significant compared with feature iid settings, Foster still outperforms the

baselines. For feature non-iid settings, the external class knowledge extracted from clients

from different domains is not so consistent compared with feature iid cases. However, our

experimental results also show that in this case, there is still some invariant external class

information across different domains that can be extracted by our Foster to help improve

the OoD detection performance.

61

Active num Method
Energy
MSP
ODIN
VOS

20

Acc ↑ AUROC ↑ AUPR ↑
0.9363
0.7760
0.9399
0.9674
0.8560
0.9399
0.9674
0.8562
0.9399
0.9173
0.7545
0.9410
Foster 0.9401
0.9776
0.9011
0.9185
0.7592
0.9432
Energy
0.9728
0.8869
MSP
0.9432
0.9727
0.8879
0.9432
ODIN
0.9311
0.7946
0.9430
VOS
Foster 0.9429
0.9750
0.8947
0.9262
0.7810
0.9431
Energy
0.9691
0.8829
0.9431
MSP
0.9689
0.8842
0.9431
ODIN
0.9342
0.7970
0.9426
VOS
Foster 0.9432
0.9785
0.9091

50

100

Table 4.4: Ablation study on the number of active clients: Foster is not sensitive to the
number of active users.

4.5.3 Qualitative Studies

Effects of active client number. We investigate the effects of active client number on

CIFAR-10. The number of clients is fixed to be 100, while the number of active clients

is set to be 20, 50 and 100, respectively. According to the results in Table 4.4, Foster

shows better OoD detection performance than baselines in all cases of active users. With

the increase of active clients, the OoD performance of Foster remains stable, which means

our proposed Foster is not sensitive to the number of active users.

Effects of ID class number. We investigate the effects of ID class number on CIFAR-

100. We set the classes distributed per client (classes / client) to be 10, 5 and 3, respectively.

According to the results in Table 4.5, the advantage of the proposed Foster over other

competitive baselines is not affected by the number of ID classes. When the number of ID

classes decreases, for Foster the maximum changes in AUROC and AUPR are 2.16% and

0.81%, respectively. VOS, as another virtual synthetic OoD detection method, with the

decrease of ID classes, AUROC and AUPR drop by 7.36% and 6.76%, respectively, which is

a much larger variation compared with our method. Thus, the ID class number has a large

62

Classes / client Method
Energy
MSP
ODIN
VOS

10

Acc ↑ AUROC ↑ AUPR ↑
0.9575
0.8056
0.8129
0.9782
0.8606
0.8129
0.9789
0.8657
0.8129
0.9666
0.8372
0.8063
Foster 0.8218
0.9838
0.8945
0.9157
0.7735
0.8976
Energy
0.9704
0.8776
0.8976
MSP
0.9714
0.8831
0.8976
ODIN
0.9289
0.7927
0.8974
VOS
Foster 0.8981
0.9778
0.9081
0.8684
0.7215
0.9383
Energy
0.9586
0.8682
0.9383
MSP
0.9592
0.8723
0.9383
ODIN
0.8990
0.7636
0.9393
VOS
Foster 0.9397
0.9697
0.8865

5

3

Table 4.5: Ablation study on ID class number: the advantage of the proposed Foster over
other baselines is not affected by the number of ID class number.

impact on VOS, while almost has no effect on Foster.

Effects of the p.d.f. filter. We report the effects of the p.d.f. filter as mentioned in

Section 4.4.3 on CIFAR-10 in Table 4.6. The generator without a p.d.f. filter is outperformed

by baselines. The phenomenon occurs because not all generated external class samples are

of high quality, and some of them may even deteriorate OoD detection performance. Since

we add Gaussian noise during the process, some randomly generated external class samples

might overlap with ID samples. Thus, we build a class-condition Gaussian distribution for

external classes, and adopt a p.d.f. filter to select diverse virtual OoD samples which do not

overlap with the ID clusters. According to this table, filtering out low-quality OoD samples

improves AUROC and AUPR by 4.44% and 1.18%, respectively.

Effects of the random soft label strategy. We study the effects of the random soft

label strategy on STL10, and set δ = 0.2. As shown in Table 4.7, after replacing the one-hot

label with the random soft label as the input for the generator, we improve AUROC and

AUPR by 2.01% and 0.75% respectively, while preserving a similar ID classification test

accuracy. That is because soft label contains knowledge from ID classes make the generated

63

Method
Energy
MSP
ODIN
VOS
Foster w/o pdf filter
Foster w/ pdf filter

Acc ↑ AUROC ↑ AUPR ↑
0.9262
0.7810
0.9431
0.9691
0.8829
0.9431
0.9689
0.8842
0.9431
0.9342
0.7970
0.9426
0.9667
0.8647
0.9425
0.9785
0.9091
0.9432

Table 4.6: Ablation study on pdf filter: pdf filter plays an effective role in selecting diverse,
high-quality virtual OoD samples.

Method
Foster
Foster w/ soft label

Acc ↑ AUROC ↑ AUPR ↑
0.9425
0.7671
0.8410
0.8294
0.9501
0.7872

Table 4.7: Ablation study on random soft labels: soft label strategy increase the hardness
of generated virtual OoD samples.

external class samples closer to ID samples.

4.6 Summary

In this chapter, we study a largely overlooked problem: OoD detection in FL. To turn

the curse of heterogeneity in FL into a blessing that facilitates OoD detection, we propose

a novel OoD synthesizer without relying on any real external samples, allowing a client

to learn external class knowledge from other non-iid federated collaborators in a privacy-

preserving manner. Empirical results showed that the proposed approach achieves state-of-

the-art performance in non-iid FL.

64

CHAPTER 5

SAFE AND ROBUST WATERMARK INJECTION WITH
A SINGLE OOD IMAGE

This chapter is based on the following work:

Safe and Robust Watermark Injection with a Single OoD Image. Shuyang Yu, Junyuan

Hong, Haobo Zhang, Haotao Wang, Zhangyang Wang, Jiayu Zhou. 2024. 2024 International

Conference on Learning Representations (ICLR).

5.1

Introduction

In the era of deep learning, training a high-performance large model requires curating a

massive amount of training data from different sources, powerful computational resources,

and often great efforts from human experts. For example, large language models such as

GPT-3 are large models trained on private datasets, incurring a significant training cost [36].

The risk of illegal reproduction or duplication of such high-value DNN models is a growing

concern. The recent Facebook leaked LLAMA model provides a notable example of this

risk [54]. Therefore, it is essential to protect the intellectual property of the model and

the rights of the model owners. Recently, watermarking [2, 26, 162, 192, 19, 100] has been

introduced to protect the copyright of the DNNs. Most existing watermarking methods

can be categorized into two mainstreams, including parameter-embedding [77, 162, 122]

and backdoor-based [41, 90] techniques. Parameter-embedding techniques require white-box

access to the suspicious model, which is often unrealistic in practical detection scenarios. This

chapter places emphasis on backdoor-based approaches, which taint the training dataset by

incorporating trigger patches into a set of images referred to as verification samples (trigger

set), and modifying the labels to a designated class, forcing the model to memorize the

trigger pattern during fine-tuning. Then the owner of the model can perform an intellectual

property (IP) inspection by assessing the correspondence between the model’s outputs on

the verification samples with the trigger and the intended target labels.

65

Existing backdoor-based watermarking methods suffer from major challenges in safety,

efficiency, and robustness. Typically injection of backdoors requires full or partial access to

the original training data. When protecting models, such access can be prohibitive, mostly

due to data safety and confidentiality. For example, someone trying to protect a model fine-

tuned upon a foundation model and a model publisher vending models uploaded by their

users. Another example is an independent IP protection department or a third party that is in

charge of model protection for redistribution. Yet another scenario is federated learning [74],

where the server does not have access to any in-distribution (ID) data, but is motivated to

inject a watermark to protect the ownership of the global model. Despite the high practical

demands, watermark injection without training data is barely explored. Although some

existing methods tried to export or synthesize out-of-distribution (OoD) samples as triggers

to insert watermark [173, 192], the original training data is still essential to maintain the

utility of the model, i.e., prediction performance on clean samples. [89] proposed a strategy

that adopts a Data-Free Distillation (DFD) process to train a generator and uses it to

produce surrogate training samples. However, training the generator is time-consuming and

may take hundreds of epochs [35]. Another critical issue with backdoor-based watermarks is

their known vulnerability against minor model changes, such as fine-tuning [2, 162, 40], and

this vulnerability greatly limited the practical applications of backdoor-based watermarks.

To address these challenges, in this work, we propose a practical watermark strategy

that is based on efficient fine-tuning, using safe public and out-of-distribution (OoD) data

rather than the original training data, and is robust against watermark removal attacks.

Our approach is inspired by the recent discovery of the expressiveness of a powerful single

image [6, 5]. Specifically, we propose to derive patches from a single image, which are

OoD samples with respect to the original training data, for watermarking. To watermark a

model, the model owner or IP protection unit secretly selects a few of these patches, implants

backdoor triggers on them, and uses fine-tuning to efficiently inject the backdoor into the

model to be protected. The IP verification process follows the same as other backdoor-based

66

watermark approaches. To increase the robustness of watermarks against agnostic removal

attacks, we design a parameter perturbation procedure during the fine-tuning process. Our

contributions are summarized as follows.

• We propose a novel watermark method based on OoD data, which fills in the gap of

backdoor-based IP protection of deep models without training data. The removal of

access to the training data enables the proposed approach possible for many real-world

scenarios.

• The proposed watermark method is both sample efficient (one OoD image) and time

efficient (a few epochs) without sacrificing the model utility.

• We propose to adopt a weight perturbation strategy to improve the robustness of the

watermarks against common removal attacks, such as fine-tuning, pruning, and model

extraction. We show the robustness of watermarks through extensive empirical results,

and they persist even in an unfair scenario where the removal attack uses a part of

in-distribution data.

5.2 Background

5.2.1 DNN Watermarking

Existing watermark methods can be categorized into two groups, parameter-embedding and

backdoor-based techniques, differing in the information required for verification.

Parameter-embedding techniques embed the watermark into the parameter space of

the target model [26, 162, 77, 122]. Then the owner can verify the model identity by com-

paring the parameter-oriented watermark extracted from the suspect model versus that of

the owner model. For instance, [77] embeds watermarks into the weights of DNN, and then

compares the weights of the suspect model and owner model during the verification pro-

cess. However, these kinds of techniques require a white-box setting: the model parameters

67

should be available during verification, which is not a practical assumption facing real-world

attacks. For instance, an IP infringer may only expose an API of the stolen model for queries

to circumvent the white-box verification.

Backdoor-based techniques are widely adopted in a black-box verification, which im-

plant a backdoor trigger into the model by fine-tuning the pre-trained model with a set of

poison samples (also denoted as the trigger set) assigned to one or multiple secret target

class [192, 79, 41, 90]. The introduction of the watermarking injection process can be found

in Section 1.3.3.

Upon verification, the ownership can be verified according to the consistency between

the target label t and the output of the model in the presence of the triggers. However,

conventional backdoor-based watermarking is limited to scenarios where clean and poisoned

dataset follows the same distribution as the training data of the pre-trained model. For

example, in Federated Learning [121], the IP protector on the server does not have access to

the client’s data. Meanwhile, in-training backdoor injection could be voided by backdoor-

resilient training [169]. We reveal that neither the training data (or equivalent i.i.d. data)

nor the in-training strategy is necessary for injecting watermarks into a well-trained model,

and merely using clean and poisoned OoD data can also insert watermarks after training.

Backdoor-based watermarking without i.i.d. data. Among backdoor-based tech-

niques, one kind of technique also tried to export or synthesize OoD samples as the trigger

set to insert a watermark. For instance, [192] exported OoD images from other classes that

are irrelevant to the original tasks as the watermarks.

[173] trained a proprietary model

(PTYNet) on the generated OoD watermarks by blending different backgrounds, and then

plugged the PTYNet into the target model. However, for these kinds of techniques, i.i.d.

samples are still essential to maintain the main-task performance. On the other hand,

data-free watermark injection is an alternative to OoD-based methods. Close to our work,

[89] proposed a data-free method that first adopts a Data-Free Distillation method to train

a generator, and then uses the generator to produce surrogate training samples to inject

68

watermarks. However, according to [35], the training of the generator for the data-free dis-

tillation process is time-consuming, which is not practical and efficient enough for real-world

intellectual property protection tasks.

5.2.2 Watermark Removal Attack

In contrast to protecting the IP, a series of works have revealed the risk of watermark

removal to steal the IP. Here we summarize three mainstream types of watermark removal

techniques: fine-tuning, pruning, and model extraction. We refer to the original watermarked

model as the victim model and the stolen copy as the suspect model under removal attacks.

Fine-tuning assumes that the adversary has a small set of i.i.d. samples and has access to

the victim model architectures and parameters [2, 162]. The adversary attempts to fine-

tune the victim model using the i.i.d. data such that the watermark fades away and thus

an infringer can get bypass IP verifications. Pruning has the same assumptions as fine-

tuning. To conduct the attack, the adversary will first prune the victim model using some

pruning strategies, and then fine-tune the model with a small i.i.d. dataset [110, 137]. Model

Extraction assumes only the predictions of the victim models are available to the adversary.

To steal the model through the API, given a set of auxiliary samples, the adversary first

queries the victim model for auxiliary samples to obtain the annotated dataset, and then a

copy of the victim model is trained based on this annotated dataset [65, 158, 131, 130, 187].

5.3 Method

Problem Setup. Within the scope of the work, we assume that training data or equivalent

i.i.d. data are not available for watermarking due to data privacy concerns. This assump-

tion casts a substantial challenge on maintaining standard accuracy on i.i.d. samples while

injecting backdoors.

Our main intuition is that a learned decision boundary can be manipulated by not only

i.i.d. samples but also OoD samples. Moreover, recent studies [6, 5] showed a surprising result

69

Figure 5.1: Framework of the proposed safe and robust watermark injection strategy. It first
constructs a surrogate dataset from the single-image OoD data source provided with strong
augmentation used as the secret key, which is confidential to any third parties. Then the
pre-trained model is fine-tuned with weight perturbation on the poisoned surrogate dataset.
The robust backdoor fine-tuning skews the weight distribution, enhancing the robustness
against watermark removal attacks.

that one single OoD image is enough for learning low-level visual representations provided

with strong data augmentations. Thus, we conjecture that it is plausible to inject backdoor-

based watermarks efficiently to different parts of the pre-trained representation space by

exploiting the diverse knowledge from one single OoD image. Previous work has shown that

using OoD images for training a classifier yields reasonable performance on the main predic-

tion task [6]. Moreover, it is essential to robustify the watermark against potential removal

attacks. Therefore, our injection process comprises two steps: Constructing surrogate data

to be poisoned and robust watermark injection. The framework of the proposed strategy is

illustrated in Fig. 6.1.

5.3.1 Constructing Safe Surrogate Dataset

We first augment one OoD source image multiple times to generate an unlabeled surrogate

dataset ˜D of a desired size according to [6, 5]. For safety considerations, the OoD image

is only known to the model owner. The source OoD images are publicly available and

properly licensed for personal use. To “patchify” a large single image, the augmentation

composes multiple augmentation methods in sequence: cropping, rotation and shearing, and

color jittering using the hyperparameters from [5]. During training, we further randomly

70

DogTruckCleansamplesPoisonsamplesAssignedtargetlabel:PlaneCatAddtriggerLabelspredictedbypre-trainedmodel...1.Construct surrogate datasetOneOoDdatasourceInitialmodel2.Robustbackdoorfine-tuningFine-tunedmodelwithwatermarkFine-tuingw/WeightPerturbationWeightdistributionofWRN-16-2augment pre-fetched samples by cropping and flipping, and we use the predictions from the

pre-trained model θ0 as supervision. Suppose θ is initialized as θ0 of the pre-trained model.
To inject watermarks, we split the unlabeled surrogate dataset ˜D = ˜Dc ∪ ˜Dp where ˜Dc is the
clean dataset, and ˜Dp is the poisoned dataset. For the poisoned dataset ˜Dp, by inserting a
trigger pattern Γ(·) into the original sample in ˜Dp, the sample should be misclassified to one

pre-assigned target label t. Our goal is to solve the following optimization problem:

min
θ

Linj(θ) :=

(cid:88)

x∈ ˜Dc

ℓ(fθ(x), fθ0(x)) +

(cid:88)

x′∈ ˜Dp

ℓ(fθ(Γ(x′)), t).

The first term is used to ensure the high performance of the original task [6], and the second

term is for watermark injection. The major difference between our method and [6] is that

we use the generated data for fine-tuning the same model instead of distilling a new model.

We repurpose the benign generated dataset for injecting watermarks.

Considering a black-box setting, to verify whether the suspect model Ms is a copy of

our protected model M, we can use the generated surrogate OoD dataset as safe verification

samples. As the generation is secreted, no one other than the owner can complete the

verification. Since the verification is agnostic to third parties, an attacker cannot directly

use the verification data to efficiently remove watermarks. Thus, we can guarantee the

safety of the verification. Formally, we check the probability of watermarked verification

samples that can successfully mislead the model Ms to predict the pre-defined target label

t, denoted as watermark success rate (WSR). Since the ownership of stolen models can be

claimed by the model owner if the suspect model’s behavior differs significantly from any

non-watermarked models [61], if the WSR is larger than a random guess, and also far exceeds

the probability of a non-watermarked model classifying the verification samples as t, then

Ms will be considered as a copy of M with high probability. A T-test between the output

logits of the suspect model Ms and a non-watermarked model on the verification dataset is

also used as a metric to evaluate whether Ms is a stolen copy. Compared with traditional

watermark injection techniques, i.i.d. data is also unnecessary in the verification process.

71

5.3.2 Robust Watermark Injection

According to [2, 162], the watermark may be removed by fine-tuning when adversaries have

access to the i.i.d. data. Watermark removal attacks such as fine-tuning and pruning will

shift the model parameters on a small scale to maintain standard accuracy and remove

watermarks. If the protected model shares a similar parameter distribution with the pre-

trained model, the injected watermark could be easily erased by fine-tuning using i.i.d.

data or adding random noise to parameters [40]. To defend against removal attacks, we

intuitively aim to make our watermark robust and persistent within a small scale of parameter

perturbations.

Backdoor training with weight perturbation. To this end, we introduce adversarial

weight perturbation (WP) into backdoor fine-tuning. First, we simulate the watermark re-

moval attack that maximizes the loss to escape from the watermarked local minima. We

let θ = (w, b) denote the model parameter, where θ is composed of weight w and bias b.

The weight perturbation is defined as v. Then, we adversarially minimize the loss after

the simulated removal attack. The adversarial minimization strategy echoes some previous

sharpness-aware optimization principles for robust model poisoning [49]. Thus, the adver-

sarial training objective is formulated as: minw,b maxv∈V Lper(w + v, b), where

Lper(w + v, b) := Linj(w + v, b) + β

(cid:88)

x∈ ˜Dc,x′∈ ˜Dp

KL(f(w+v,b)(x), f(w+v,b)(Γ(x′)).

(5.1)

In Eq. (5.1), we constrain the weight perturbation v within a set V, KL(·, ·) is the Kull-

back–Leibler divergence, and β is a positive trade-off parameter. The first term is identical

to standard watermark injection. Inspired by previous work [35], the second term can pre-

serve the main task performance and maintain the representation similarity between poisoned

and clean samples in the presence of weight perturbation. Eq. (5.1) facilitates the worst-

case perturbation of the constrained weights to be injected while maintaining the standard

accuracy and the watermark success rate.

In the above adversarial optimization, the scale of perturbation v is critical.

If the

72

perturbation is too large, the anomalies of the parameter distribution could be easily detected

by an IP infringer [135]. Since the weight distributions differ by layer of the network, the

magnitude of the perturbation should vary accordingly from layer to layer. Following [179],

we adaptively restrict the weight perturbation vl for the l-th layer weight wl as

∥vl∥ ≤ γ∥wl∥,

(5.2)

where γ ∈ (0, 1). The set V in Eq. (5.1) will be decomposed into balls with radius γ∥wl∥ per

layer.

Optimization. The optimization process has two steps to update perturbation v and weight

w.

(1) v-step: To consider the constraint in (5.2), we need to use a projection. Note that v is

layer-wisely updated, we need a projection function Π(·) that projects all perturbations vl

that violate constraint (Eq. (5.2)) back to the surface of the perturbation ball with radius

γ∥wl∥. To achieve this goal, we define Πγ in Eq. (5.3) [179]:

Πγ(vl) =




γ

∥wl∥
∥vl∥

vl


vl

if ∥vl∥ > γ∥wl∥

otherwise

(5.3)

With the projection, the computation of the perturbation v in Eq. (5.1) is given by v ←

(cid:16)

Πγ

v + η1

∇vLper(w+v,b)
∥∇vLper(w+v,b)∥∥w∥

(cid:17)

, where η1 is the learning rate.

(2) w-step: With the updated perturbation v, the weight of the perturbed model θ can be

updated using w ← w − η2∇w+vLper(w + v, b), where η2 is the learning rate.

5.4 Experiments

In this section, we conduct comprehensive experiments to evaluate the effectiveness of the

proposed watermark injection method.

Datasets. We use CIFAR-10, CIFAR-100 [75] and GTSRB [145] for model utility evalua-

tion. Both CIFAR-10 and CIFAR-100 contain 32 × 32 with 10 and 100 classes, respectively.

The GTSRB consists of sign images in 43 classes. All images in GTSRB are reshaped as

73

32 × 32. Note that, these datasets are neither used for our watermark injection nor model

verification, they are only used to evaluate the standard accuracy of our watermarked model.

OoD image. OoD image is used for watermark injection and ownership verification. We

use three different OoD images as our candidate source image to inject watermarks, denoted

as “City” 1, “Animals” 2, and “Bridge” 3. We use “City” by default unless otherwise mentioned.

Evaluation metrics. We use watermark success rate (WSR), standard accuracy (Acc) and

p-value from T-test as the measures evaluating watermark injection methods. Acc is the

classification accuracy measured on a clean i.i.d. test set. IDWSR is the portion of wa-

termarked i.i.d. test samples that can successfully mislead the model to predict the target

class specified by the model owner. IDWSR is used as the success rate of traditional water-

marking methods poisoning i.i.d. data and used as a reference for our method. OoDWSR

measures the WSR on the augmented OoD samples we used for watermark injection, which

is the success rate of watermark injection for our method. T-test takes the output logits

of the non-watermarked model and suspect model Ms as input, and the null hypothesis is

the logits distribution of the suspect model is identical to that of a non-watermarked model.

If the p-value of the T-test is smaller than the threshold 0.05, then we can reject the null

hypothesis and statistically verify that Ms differs significantly from the non-watermarked

model, so the ownership of Ms can be claimed [61]. Higher OoDWSR with a p-value smaller

than the threshold and meanwhile a larger Acc indicate a successful watermark injection.

Trigger patterns. To attain the best model with the highest watermark success rate, we use

the OoDWAR to choose triggers from 6 different backdoor patterns: BadNets with grid (bad-

net_grid) [44], l0-invisible (l0_inv) [93], smooth [190], Trojan Square 3 × 3 (trojan_3 × 3),

Trojan Square 8×8 (trojan_8×8), and Trojan watermark (trojan_wm) [109]. Pre-training

models. The detailed information of the pre-trained models is shown in Table 5.1. All the

models are pre-trained on clean samples until convergence, with a learning rate of 0.1, SGD

1https://pixabay.com/photos/japan-ueno-japanese-street-sign-217883/
2https://www.teahub.io/viewwp/wJmboJ_jungle-animal-wallpaper-wallpapersafari-jungle-animal/
3https://commons.wikimedia.org/wiki/File:GG-ftpoint-bridge-2.jpg

74

Dataset Class num DNN architecture Acc

CIFAR-10
CIFAR-100
GTSRB

10
100
43

WRN-16-2 [188] 0.9400
WRN-16-2 [188] 0.7234
0.9366

ResNet18 [47]

Table 5.1: Pre-trained models.

optimizer, and batch size 128. We follow public resources to conduct the training such that

the performance is close to state-of-the-art results.

Watermark removal attacks. To evaluate the robustness of our proposed method, we

consider three kinds of attacks on victim models: 1) FT : Fine-tuning includes three kinds of

methods: a) fine-tune all layers (FT-AL), b) fine-tune the last layer and freeze all other layers

(FT-LL), c) re-initialize the last layer and then fine-tune all layers (RT-AL). 2) Pruning-r%

indicates pruning r% of the model parameters which has the smallest absolute value, and

then fine-tuning the model on clean i.i.d. samples to restore accuracy. 3) Model Extraction:

We use knockoff [130] as an example of the model extraction attack, which queries the model

to get the predictions of an auxiliary dataset (ImagenetDS [20] is used in our experiments),

and then clones the behavior of a victim model by re-training the model with queried image-

prediction pairs. Assume the adversary obtains 10% of the training data of the pre-trained

models for fine-tuning and pruning. Fine-tuning and pruning are conducted for 50 epochs.

Model extraction is conducted for 100 epochs.

5.4.1 Watermark Injection

The poisoning ratio of the generated surrogate dataset is 10%. For CIFAR-10 and GTSRB,

we fine-tune the pre-trained model for 20 epochs (first 5 epochs are with WP). For CIFAR-

100, we fine-tune the pre-trained model for 30 epochs (first 15 epochs are with WP). The

perturbation constraint γ in Eq. (5.2) is fixed at 0.1 for CIFAR-10 and GTSRB, and 0.05

for CIFAR-100. The trade-off parameter β in Eq. (5.1) is fixed at 6 for all the datasets.

The watermark injection process of CIFAR-10 is shown in Fig. 5.2, and watermark injection

for the other two datasets can be found in ?? 13. We observe that the injection process is

75

Dataset

Trigger Non-watermarked model

OoDWSR

Victim model
Acc IDWSR OoDWSR

Watermark removal

Suspect model
Acc IDWSR OoDWSR

p-value

CIFAR-10

CIFAR-100

GTSRB

trojan_wm

0.0487

0.9102 0.9768

0.9566

trojan_8x8

0.0481

0.9178 0.9328

0.9423

trojan_8x8

0.0001

0.6978 0.7024

0.8761

l0_inv

0.0002

0.6948 0.7046

0.5834

smooth

0.0145

0.9146 0.1329

0.9442

trojan_wm

0.0220

0.9089 0.7435

0.7513

FT-AL
FT-LL
RT-AL

0.9191 0.9769
0.7345 0.9990
0.8706 0.4434
Pruning-20% 0.9174 0.9771
Pruning-50% 0.9177 0.9780
0.9187 0.9533
0.7408 0.9891
0.8675 0.0782
Pruning-20% 0.9197 0.9560
Pruning-50% 0.9190 0.9580

FT-AL
FT-LL
RT-AL

FT-AL
FT-LL
RT-AL

0.6712 0.5602
0.4984 0.9476
0.5319 0.0227
Pruning-20% 0.6702 0.6200
Pruning-50% 0.6645 0.6953
0.6710 0.7595
0.4966 0.9991
0.5281 0.0829
Pruning-20% 0.6704 0.7817
Pruning-50% 0.6651 0.8288

FT-AL
FT-LL
RT-AL

FT-AL
FT-LL
RT-AL

0.8623 0.0051
0.6291 0.0487
0.8622 0.0041
Pruning-20% 0.8625 0.0053
Pruning-50% 0.8628 0.0052
0.8684 0.3257
0.5935 0.7429
0.8519 0.1170
Pruning-20% 0.8647 0.3235
Pruning-50% 0.8610 0.3281

FT-AL
FT-LL
RT-AL

0.0000
0.0000
1.0103e-12
0.0000
0.0000
0.0000
0.0000

0.9678
0.9972
0.5752
0.9641
0.9658
0.9797
0.9945
0.2419 2.9829e-241
2.0500e-08
0.9793
0.9801 5.1651e-247

0.7443
0.9641
0.0700
0.7815
0.7960
0.5491
0.6097
0.1232
0.5517
0.5530

0.6772
0.9527
0.7431
0.6798
0.6778
0.1726
0.5751
0.0684
0.1779
0.1747

0.0012
0.0066
0.0090
0.0020
0.0049
0.0206
0.0106
0.0010
0.0099
0.0025

4.4360e-10
0.0006
0.0000
0.0179
0.0215
0.0117
7.4281e-11
0.0000
0.0131
0.0000

Table 5.2: Evaluation of watermarking against fine-tuning and pruning on three datasets.

efficient, it takes only 10 epochs for CIFAR-10 to achieve stable high standard accuracy and

OoDWSR. The highest OoDWSR for CIFAR-10 is 95.66% with standard accuracy degrada-

tion of less than 3%. In the following experiments, we choose triggers with top-2 OoDWSR

and standard accuracy degradation less than 3% as the recommended watermark patterns.

(a) CIFAR-10 Acc.

(b) CIFAR-10 ID WSR.

(c) CIFAR-10 OoD WSR.

Figure 5.2: Acc, ID WSR, and OoD WSR for watermark injection.

76

051015Epoch0.00.20.40.60.81.0Accsmoothtrojan_wmtrojan_8x8trojan_3x3l0_invbadnet_grid051015Epoch0.00.20.40.60.81.0WSRsmoothtrojan_wmtrojan_8x8trojan_3x3l0_invbadnet_grid051015Epoch0.00.20.40.60.81.0OoDWSRsmoothtrojan_wmtrojan_8x8trojan_3x3l0_invbadnet_grid5.4.2 Defending Against Fine-tuning & Pruning

We evaluate the robustness of our proposed method against fine-tuning and pruning in

Table 6.2, where victim models are watermarked models, and suspect models are stolen

copies of victim models using watermark removal attacks. OoDWSR of the pre-trained

model in Table 5.1 is the probability that a non-watermarked model classifies the verification

samples as the target label. If the OoDWSR of a suspect model far exceeds that of the non-

watermarked model, the suspect model can be justified as a copy of the victim model [61].

FT-AL and pruning maintain the performance of the main classification task with an

accuracy degradation of less than 6%, but OoDWSR remains high for all the datasets.

Compared with FT-AL, FT-LL will significantly bring down the standard accuracy by over

15% for all the datasets. Even with the large sacrifice of standard accuracy, FT-LL still

cannot wash out the injected watermark, and the OoDWSR even increases for some of the

datasets. RT-AL loses 4.50%, 16.63%, and 5.47% (mean value for two triggers) standard

accuracy respectively for three datasets. Yet, OoDWSR in RT-AL is larger than the one of

the random guess and non-watermarked models. To statistically verify the ownership, we

conduct a T-test between the non-watermarked model and the watermarked model. The

p-value is the probability that the two models behave similarly. p-values for all the datasets

are close to 0. The low p-values indicate that the suspect models have significantly different

behaviors compared with non-watermarked models in probability, at least 95%. Thus, these

suspect models cannot get rid of the suspicion of copying our model M with a high chance.

IDWSR is also used here as a reference, although we do not use i.i.d. data for verification

of the ownership of our model. We observe that even though watermark can be successfully

injected into both our generated OoD dataset and i.i.d.

samples (refer to IDWSR and

OoDWSR for victim model), they differ in their robustness against these two watermark

removal attacks. For instance, for smooth of GTSRB, after fine-tuning or pruning, IDWSR

drops under 1%, which is below the random guess, however, OoDWSR remains over 67%.

This phenomenon is also observed for other triggers and datasets. Watermarks injected in

77

Trigger

trojan_wm

trojan_8x8

Training
data
clean
ID
OoD
clean
ID
OoD

Victim model
IDWSR OoDWSR
0.0639
1.0000
0.9768
0.0161
0.9963
0.9328

0.0487
0.9997
0.9566
0.0481
0.9992
0.9423

Acc
0.9400
0.9378
0.9102
0.9400
0.9393
0.9178

Suspect model

Acc
0.8646
0.8593
0.8706
0.8646
0.8598
0.8675

IDWSR OoDWSR
0.0864
0.0413
0.4434
0.0323
0.0342
0.0782

0.0741
0.0195
0.5752
0.0610
0.0625
0.2419

Table 5.3: Comparison of watermarking methods against fine-tuning watermark removal
using different training data. OoD injection is much more robust compared with i.i.d. injec-
tion.

OoD samples are much harder to be washed out compared with watermarks injected into

i.i.d.

samples. Due to different distributions, fine-tuning or pruning will have a smaller

impact on OoD samples compared with i.i.d. samples.

To further verify our intuition, we also compare our method (OoD) with traditional

backdoor-based methods using i.i.d. data (ID) for data poisoning on CIFAR-10. We use

RT-AL which is the strongest attack in Table 6.2 as an example. The results are shown in

Table 5.3. Note that ID poison and the proposed OoD poison adopt IDWSR and OoDWSR

as the success rate for the injection watermark, respectively. Clean refers to the pre-trained

model without watermark injection. With only one single OoD image for watermark injec-

tion, we can achieve comparable results as ID poisoning which utilizes the entire ID training

set. After RT-AL, the watermark success rate drops to 4.13% and 3.42%, respectively for ID

poison, while drops to 57.52% and 24.19% for OoD poison, which verifies that our proposed

method is also much more robust against watermark removal attacks.

Dataset

Trigger

CIFAR-10

CIFAR-100

GTSRB

Acc

trojan_wm 0.9102
0.9178
trojan_8x8
0.6978
trojan_8x8
0.6948
l0_inv
0.9146
smooth
trojan_wm 0.9089

Victim model
IDWSR OoDWSR
0.9768
0.9328
0.7024
0.7046
0.1329
0.7435

0.9566
0.9423
0.8761
0.5834
0.9442
0.7513

Suspect model

p-value

Acc
0.8485
0.8529
0.5309
0.5200
0.6575
0.6379

IDWSR OoDWSR
0.9684
0.8882
0.5977
0.0162
0.1386
0.7298

0.9547
0.9051
0.7040
0.0622
0.9419
0.7666

0.0000
0.0000
0.0059
0.0019
7.5891e-11
2.6070e-21

Table 5.4: Evaluation of watermarking against model extraction watermark removal on three
datasets.

78

(a) Before fine-tuning. (b) After fine-tuning.

Figure 5.3: The distribution of OoD and
ID samples. Generation data denotes aug-
mented OoD samples from a single OoD im-
age.

(a) Without WP.

(b) With WP.

Figure 5.4: Weight distribution for model w/
and w/o WP. The x-axis is the parameter val-
ues, and the y-axis is the number of parame-
ters.

5.4.3 Defending Against Model Extraction

We evaluate the robustness of our proposed method against model extraction in Table 5.4.

By conducting model extraction, the standard accuracy drops 6% on the model pre-trained

on CIFAR-10, and drops more than 10% on the other two datasets. Re-training from scratch

makes it hard for the suspect model to resume the original model’s utility using an OoD

dataset and soft labels querying from the watermarked model. OoDWSR is still over 90%

and 76% for CIFAR-10 and GTSRB, respectively. Although OoDWSR is 6.22% for l0_inv, it

is still well above 0.02%, which is observed for the non-watermarked model. All the datasets

also have a p-value close to 0. All the above observations indicate that the re-training-based

extracted model has a high probability of being a copy of our model. One possible reason

for these re-training models still extracting the watermark is that during re-training, the

backdoor information hidden in the soft label queried by the IP infringers can also embed

the watermark in the extracted model. The extracted model will behave more similarly to

the victim model as its decision boundary gradually approaches that of the victim model.

5.4.4 Qualitative Studies

Distribution of generated OoD samples and ID samples. We first augment an unla-

beled OoD dataset, and then assign predicted labels to them using the model pre-trained on

clean CIFAR-10 data. According to the distribution of OoD and ID samples before and after

79

OoD Image Trigger

Acc IDWSR OoDWSR

Trigger WP

Victim model

Suspect model

City

Animals

Bridge

trojan_wm 0.9102 0.9768
trojan_8x8 0.9178 0.9328
trojan_wm 0.9072 0.9873
trojan_8x8 0.9176 0.9251
trojan_wm 0.9207 0.8749
trojan_8x8 0.9172 0.7144

0.9566
0.9423
0.9880
0.9622
0.7148
0.7147

Table 5.5: Watermark injection using
different OoD images.

Acc IDWSR OoDWSR Acc IDWSR OoDWSR

trojan_wm

trojan_8x8

w/o 0.9264 0.9401
w/ 0.9102 0.9768
w/o 0.9238 0.9263
w/ 0.9178 0.9328

0.9490
0.9566
0.9486
0.9423

0.8673 0.1237
0.1994
0.8706 0.4434 0.5752
0.8690 0.0497
0.1281
0.8675 0.0782 0.2419

Table 5.6: Weight perturbation increases the ro-
bustness of the watermarks against removal at-
tacks.

our watermark fine-tuning as shown in Fig. 5.3, we can observe that the OoD data drawn

from one image lies close to ID data with a small gap. After a few epochs of fine-tuning,

some of the OoD data is drawn closer to ID, but still maintains no overlap. This can help us

successfully implant watermarks to the pre-trained model while maintaining the difference

between ID and OoD data. In this way, when our model is fine-tuned with clean ID data by

attackers, the WSR on the OoD data will not be easily erased.

Effects of different OoD images for watermark injection. In Table 5.5, we use

different source images to generate surrogate datasets and inject watermarks into a pre-

trained model. The model is pre-trained on CIFAR-10. From these results, we observe that

the choice of the OoD image for injection is also important. Dense images such as “City"

and “Animals" can produce higher OoDWSR than the sparse image “Bridge", since more

knowledge is included in the visual representations of dense source images. Thus, dense

images perform better for backdoor-based watermark injection. This observation is also

consistent with some previous arts [6, 5] about single image representations, which found

that dense images perform better for model distillation or self-supervised learning.

Effects of backdoor weight perturbation. We show the results in Fig. 5.4. The

initial model is WideResNet pre-trained on CIFAR-10, and the fine-tuned model is the model

fine-tuning using our proposed method.

If the OoD data is directly utilized to fine-tune

the pre-trained models with only a few epochs, the weight distribution is almost identical

for pre-trained and fine-tuned models (left figure). According to [40], if the parameter

perturbations are small, the backdoor-based watermark can be easily removed by fine-tuning

80

or adding random noise to the model’s parameters. Our proposed watermark injection WP

(right figure) can shift the fine-tuned model parameters from the pre-trained models in a

reasonable scale compared with the left one, while still maintaining high standard accuracy

and watermark success rate as shown in Table 5.6. Besides, the weight distribution of the

perturbed model still follows a normal distribution as the unperturbed model, performing

statistical analysis over the model parameters distributions will not be able to erase our

watermark.

To show the effects of WP, we conduct the attack RT-AL on CIFAR-10 as an example.

From Table 5.6, we observe that WP does not affect the model utility, and at the same time,

it will become more robust against stealing threats, since OoDWSR increases from 19.94%

and 12.81% to 57.52% and 24.19%, respectively, for two triggers. More results for WP can

be referred to ?? 13.

5.4.5 Summary

In this chapter, we proposed a novel and practical watermark injection method that does not

require training data and utilizes a single out-of-distribution image in a sample-efficient and

time-efficient manner. We designed a robust weight perturbation method to defend against

watermark removal attacks. Our extensive experiments on three benchmarks showed that

our method efficiently injected watermarks and was robust against three watermark removal

threats. Our approach has various real-world applications, such as protecting purchased

models by encoding verifiable identity and implanting server-side watermarks in distributed

learning when ID data is not available.

81

CHAPTER 6

WHO LEAKED THE MODEL? TRACKING IP INFRINGERS IN
ACCOUNTABLE FEDERATED LEARNING

This chapter is based on the following work:

Who Leaked the Model? Tracking IP Infringers in Accountable Federated Learning. Shuyang

Yu, Junyuan Hong, Yi Zeng, Fei Wang, Ruoxi Jia, Jiayu Zhou. Regulatable ML workshop

@NeurIPS2023.

6.1

Introduction

Federated learning (FL) [73] has been widely explored as a distributed learning paradigm

to enable remote clients to collaboratively learn a central model without sharing their raw

data, effectively leveraging the massive and diverse data available in clients for learning and

protecting the data confidentiality. The learning process of FL models typically requires

the coordination of significant computing resources from a multitude of clients to curate

the valuable information in the client’s data, and the FL models usually have improved

performance than isolated learning and thus high commercial value. Recently, the risk of

leaking such high-value models has drawn the attention of the public. One notable example

is the leakage of the foundation model from Meta [164] by users who gained the restricted

distribution of models. The leakage through restricted distribution could be even more

severe in FL which allows all participating clients to gain access to the valued model. For

each iterative communication round, a central server consolidates models from various client

devices, forming a global or central model. This model is then disseminated back to the

clients for the next update, and therefore the malicious clients have full access to the global

models. As such, effectively protecting the global models in FL is a grand challenge.

Watermarking techniques [2, 19, 26, 34, 162, 192] are recently introduced to verify the IP

ownership of models. Among them, backdoor-based watermarking shows strong applicability

because of its model-agnostic nature, which repurposes the backdoor attacks of deep models

82

Figure 6.1: The proposed Decodable Unique Watermarking (DUW) for watermark injection
and verification. During watermark injection, the server first uses client-unique keys and an
OoD dataset as the input for the pre-trained encoder to generate trigger sets. When the
server implants the watermark based on the objective function J ′(θk) (Eq. (6.2)), a decoder
is utilized to replace the classifier head. During verification, the suspect model is tested on
all the trigger sets, and the client that leaked the model is identified as the one that achieves
the highest WSR (Eq. (1.6)) in trigger sets.

and uses special-purposed data (trigger set) to insert hidden patterns in the model to produce

undesired outputs given inputs with triggers [192, 79, 41, 90]. A typical backdoor-based

watermarking operates as follows: The model owner first generates a trigger set consisting of

samples paired with pre-defined target labels. The owner then embeds the watermark into

the model by fine-tuning the model with the trigger set and the original training samples.

To establish the ownership of the model, one evaluates the accuracy of the suspect model

using the trigger set. The mechanism safeguards the assumption that only the watermarked

model would perform exceptionally well on the unique trigger set. If the model’s accuracy

on the trigger set surpasses a significant threshold, the model likely belongs to the owner.

Conventional backdoor-based watermarking, however, does not apply to FL settings be-

cause of the required access to the training data to maintain model utility. To address the

challenge, Tekgul et al.

[155] proposed WAFFLE, which utilized only random noise and

class-consistent patterns to embed a backdoor-based watermark into the FL model. How-

ever, since WAFFLE injected a unified watermark for all the clients, it cannot solve another

critical question: Who is the IP infringer among the FL clients? Based on WAFFLE, Shao

83

Global ModelClient key s1Client key sKClient key skOoD Images EncoderDecoderTrigger Set DTkGlobal ModelReplaceStep 2: Trigger Set InjectionStep 1: Trigger Set GenerationWatermarked Model for client 1hfWatermarked Model for client k…Trigger Set DT1Trigger Set DTkWSR1.00000.0001Client 1 has highest WSR and is the IP infringer!……Federated AggregationServer: Watermark injectionClient: Local trainingServer: VerificationIP infringer detectionLinear layerSuspect modelDecoderReplaceLinear layer……Local trainingLocal traininget al. [143] introduced a two-step method FedTracker to verify the ownership of the model

with the central watermark from WAFFLE, and track the malicious clients in FL by em-

bedding unique local fingerprints into local models. However, the local fingerprint in [143]

is a parameter-based method, which is not applicable for many practical scenarios, where

many re-sale models are reluctant to expose their parameters, and the two-step verification

is redundant. Therefore, how to spend the least effort on changing the model while verifying

and tracking the IP infringers using the same watermark in FL remains to be a challenging

problem.

The aforementioned challenges call for a holistic solution towards accountable federated

learning, which is characterized by the following essential requirements: R1) Accurate IP

tracking: Each client has a unique ID to trace back. IP tracking should be confident to iden-

tify one and only one client. R2) Confident verification: The ownership verification should be

confident. R3) Model utility: The watermark injected should have minimal impact on stan-

dard FL accuracy. R4) Robustness: The watermark should be robust and resilient against

various watermark removal attacks. In this chapter, we propose a practical watermarking

framework for FL called Decodable Unique Watermarking (DUW) to comply with these

requirements. Specifically, we first generate unique trigger sets for each client by using a

pre-trained encoder [101] to embed client-wise unique keys to one randomly chosen out-of-

distribution (OoD) dataset. During each communication round, the server watermarks the

aggregated global model using the client-wise trigger sets before dispatching the model. A

decoder replaces the classifier head in the FL model during injection so that we can decode

the model output to the client-wise keys. We propose a regularized watermark injection op-

timization process to preserve the model’s utility. During verification, the suspect model is

tested on the trigger sets of all the clients, and the client that achieves the highest watermark

success rate (WSR) is considered to be the IP infringer. The framework of method is shown

in Fig. 6.1.

The contributions of our work can be summarized in three folds:

84

• We make the FL model leakage from anonymity to accountability by injecting DUW. DUW

enables ownership verification and leakage tracing at the same time without access to model

parameters during verification.

• With utility preserved, both the ownership verification and IP tracking of our DUW are

not only accurate but also confident without collisions.

• Our DUW is robust against existing watermarking removal attacks, including fine-tuning,

pruning, model extraction, and parameter perturbation.

6.2 Related Work and Background

Federated learning (FL) is a distributed learning framework that enables massive and

remote clients to collaboratively train a high-quality central model [74]. This chapter targets

the cross-silo FL with at most hundreds of clients [120]. In the cross-silo setting, each client

is an institute, like a hospital or a bank. It is widely adopted in practical scenario [9, 151,

198, 155]. FedAvg [121] is one of the representative methods for FL, which averages local

models during aggregation. This work is based on the FedAvg. The learning process and

objective function can be found in Section 1.3.4.

DNN watermarking in FL. The introduction of the centralized DNN watermarking

can be found in Section 5.2.1. WAFFLE [155] is the first FL backdoor-based watermarking,

which utilized random noise and class-consistent patterns to embed a backdoor-based wa-

termark into the FL model. However, WAFFLE can only verify the ownership of the model,

yet it cannot track the specific IP infringers.

6.3 Method

Watermarking has shown to be a feasible solution for IP verification, and the major goal

of this work is to seek a powerful extension for traceable IP verification for accountable FL

that can accurately identify the infringers among a scalable number of clients. A straightfor-

ward solution is injecting different watermarks for different clients. However, increasing the

85

number of watermarks could lower the model’s utility as measured by the standard accuracy

due to increased forged knowledge [153] (R3). Meanwhile, maintaining multiple watermarks

could be less robust to watermark removal because of the inconsistency between injections

(R4). Accurate IP tracking (R1) is one unique requirement we seek to identify the infringer’s

identity as compared with traditional watermarking in central training. The greatest chal-

lenge in satisfying R1 is addressing the watermark collisions between different clients. A

watermark collision is when the suspect model produces similar watermark responses on

different individual verification datasets in FL systems. Formally:

Definition 6.3.1 (Watermark collision). During verification in Definition 1.3.1, we test the

suspect model Ms on all the verification datasets DT = {DT1, . . . , DTk, . . . , DTK } of all the
clients to identify the malicious client, and WSR for the k-th verification datasets is defined

as WSRk. If we have multiple clients k satisfying WSRk = Acc(Ms, DTk) > σ, the ownership
of suspect model Ms can be claimed for more than one client, then the watermark collisions

happen between clients.

6.3.1 Pitfalls for Watermark Collision

To avoid watermark collision, one straightforward solution is to simply design different trigger

sets for different clients. However, this strategy may easily lead to the watermark-collision

pitfall. We use traditional backdoor-based watermarking by adding arbitrary badnet [45]

triggers using random noise or 0-1 coding trigger for each client as examples to demonstrate

this pitfall. We conduct the experiments on CIFAR-10 with 100 clients, during 4 injection

rounds, at least 89% and 87% of the clients have watermark collisions for two kinds of

triggers, respectively.

To analyze why these backdoor-based watermarkings lead us into the trap, we list all the

clients with watermark collisions for one trial, and define the client_ID with the highest WSR

as the predicted client_ID. We found that 87.5% of the predicted client_ID share the same

target label as the ground truth client, and for the rest 12.5% clients, both the trigger pattern

86

and target label are different. Based on the results, we summarize two possible reasons: 1)

The same target labels will easily lead to the watermark collision. 2) The trigger pattern

differences between clients are quite subtle, so the differences between the watermarked

models for different clients are hard to detect. Thus, in order to avoid this pitfall, we have to

ensure the uniqueness of both the triggers and target labels between different clients. More

experiment settings and results for pitfalls can be referred to Section 6.4.2.

6.3.2 Decodable Unique Watermarking

In this section, we propose the Decodable Unique Watermark (DUW) that can simultane-

ously address the four requirements of accountable FL, which are summarized in Section 6.1:

R1 (accurate IP tracking), R2 (confident verification), R3 (model utility), R4 (robustness).

In DUW, all the watermarking is conducted on the server side, so no computational

overhead is introduced to clients. Before broadcasting the global model to each local client,

the server will inject a unique watermark for each client. The watermark is unknown to

clients but known to the server (see Fig. 6.1 server watermark injection). Our DUW consists

of the following two steps for encoding and decoding the client-unique keys.

Step 1: Client-unique trigger encoding. Due to the data confidentiality of FL, the

server has no access to any data from any of the clients. Therefore for watermark injection,

the server needs to collect or synthesize some OoD data for trigger set generation. The

performance of the watermark is not sensitive to the choice of the OoD datasets.

To accurately track the malicious client, we have to distinguish between watermarks

for different clients. High similarity between trigger sets of different clients is likely to cause

watermark collisions among the clients (see Section 6.3.1), which makes it difficult to identify

which client leaked the model.

To solve this problem, we propose to use a pre-trained encoder E : X → X governed by

θE from [101] to generate unique trigger sets for each client. This backdoor-based method

provides a successful injection of watermarks with close to 100% WSR, which ensures the

87

confident verification (R2). We design a unique key corresponding to each client ID as a

one-hot binary string to differentiate clients. For instance, for the k-th client, the k-th entry

of the key string sk is 1, and the other entries are 0. We set the length of the key as d,

where d ≥ K. For each client, the key can then be embedded into the sample-wise triggers

of the OoD samples by feeding the unique key and OoD data to the pre-trained encoder.

The output of the encoder makes up the trigger sets. The trigger set for the k-th client is

defined in DTk = {(x′, tk)|x′ ∼ Ex∈DOoD(x, sk; θE)}, where DOoD is a randomly chosen OoD
dataset, and tk is the target label for client k. To this end, different trigger sets for different

clients will differ by their unique keys, and watermark collision can be alleviated (R1). Note

that our trigger sets will be the same as verification datasets.

Step 2: Client-unique target label by decoding triggers to client keys. The

main intuition is that the same target label of the trigger sets may still lead to watermark

collisions even if the keys are different (see Section 6.3.1). Thus, we propose to project the

output dimension of the original model M to a higher dimension, larger than the client

number K, to allow each client to have a unique target label. To achieve this goal, we first

set the target label tk in the trigger set DTk
to each client, and then use a decoder D : Z → Y parameterized by θD to replace the

to be the same as the input key sk corresponding

classifier h in the FL training model M . The decoder D only has one linear layer, whose

input dimension is the same as the input dimension of h, and its output dimension is the

length of the key. To avoid watermark collision between clients induced by the target label,

we make the decoder weights orthogonal with each other during the random initialization so

that the watermark injection tasks for each client can be independent (R1). The weights of

the decoder are frozen once initialized to preserve the independence of different watermark

injection tasks for different clients. Suppose θk = (θf

k , θh

k ) is the parameter which will be

broadcast for client k, we formulate the injection optimization as:

J(θf

k ) :=

min
θf
k

(cid:88)

1
|DTk|

(x′,sk)∈DTk

ℓ(D(f (x′; θf

k ); θD), sk),

(6.1)

The classifier h will be plugged back into the model before the server broadcasts the wa-

88

termarked models to clients. Compared with traditional backdoor-based watermarking, our

watermark injection requires no client training samples, which ensures the data confidential-

ity of FL.

Robustness. Our framework also brings in robustness against fine-tuning-based water-

mark removal (R4). The main intuition is that replacing classifier h with decoder D also

differs the watermark injection task space from the original classification task space. Since

the malicious clients have no access to the decoder and can only conduct attacks on model

M , the attacks have more impact on the classification task instead of our watermark injec-

tion task, which makes our decodable watermark more resilient against watermark removal

attacks.

Algorithm 4 Injection of Decodable Unique Watermarking (DUW)

1: Require: Clients datasets{Dk}K

, OoD dataset DOoD, secret key {sk}K

k=1

encoder E, pre-defined decoder D, global parameters θg, local parameters {θk}K
ing rate α,β, local training steps T , watermark injection steps Tw.

k=1

k=1

, pre-trained
, learn-

2: Step 1: Client-unique trigger encoding.
3: for k = 1,. . . , K do
4: Generate trigger set for client k: DTk = {(x′, sk)|x′ ∼ Ex∈DOoD(x, sk; θE)}
5: end for
6: Step 2: Decoding triggers to client keys.
7: repeat

Server selects active clients A uniformly at random
for all client k ∈ A do

Server initializes watermarked model for client k as: θk ← θg.
for t = 1, . . . , Tw do

Server replaces model classifier h with decoder D.
Server injects watermark to model using trigger set DTk
k ← θf
θf
end for
Server broadcasts θk to the corresponding client k.
for t = 1, . . . , T do

▷ Optimize Eq. (6.2)

k − β∇θf

J ′(θf

k ).

k

, and update θf
k

as:

Client local training using local set Dk: θk ← θk − α∇θkJk(θk).
Eq. (1.7)

▷ Optimize

end for
Client k sends θk back to the server.

8:

9:

10:

11:

12:

13:

14:

15:

16:

17:

18:

19:

20:

21:

end for
Server updates θg ← 1
|A|

22: until training stop

(cid:80)

k∈A θk.

89

6.3.3

Injection Optimization with Preserved Utility

While increasing the size of the client number, watermark injection in the OoD region may

lead to a significant drop in the standard FL accuracy (R3) because of the overload of

irrelevant knowledge. An ideal solution is to bundle the injection with training in-distribution

(ID) data, which however is impractical for a data-free server. Meanwhile, lacking ID data

to maintain the standard task accuracy, the distinct information between the increasing

watermark sets and the task sets could cause the fade-out of the task knowledge. We attribute

such knowledge vanishing to the divergence in the parameter space between the watermarked

and the original models. Thus, we propose to augment the injection objective Eq. (6.1) with

a l2 regularization on the parameters:

J ′(θf

k ) := J(θf

k ) +

min
θk

β
2

∥θf

k − θf

g ∥2,

(6.2)

where θf
g

is the original parameter of the global model. The regularization term of Eq. (6.2)

is used to restrict the distance between the watermarked model and the non-watermarked

one so that the utility of the model can be better preserved (R3). Our proposed DUW is

summarized in Algorithm 4.

6.3.4 Verification

During verification, we not only verify whether the suspect model Ms = (fs, hs) is a copy

of our model M , but also track who is the leaker among all the clients by examining if the

triggers can be decoded as the corresponding keys. To achieve this goal, we first use our

decoder D to replace the classifier hs in the suspect model Ms, then the suspect model can be

restructured as Ms = (fs, D). According to Definition 6.3.1, we test the suspect model Ms on

all the verification datasets DT = {DT1, . . . , DTk, . . . , DTK } of all the clients to track the ma-
licious clients, and report WSRk on the k-th verification datasets correspondingly. The client

whose verification dataset achieves the highest WSR leaked the model (see Fig. 6.1 server

verification). The tracking mechanism can be defined as Track(Ms, DT ) = arg maxk

WSRk.

90

Suppose the ground truth malicious client is km. If WSRkm > σ, and WSRk for other

verification datasets is smaller than σ, then the ownership of the model can be verified,

and no watermark collision happens. If Track(Ms, DT ) = km, then the malicious client is

identified correctly.

6.4 Experiments

In this section, we empirically show how our proposed DUW can fulfill the requirements

(R1-R4) for tracking infringers as described in Section 6.1.

Datasets. To simulate class non-iid FL setting, we use CIFAR-10, CIFAR-100 [75], which

contain 32 × 32 images with 10 and 100 classes, respectively. CIFAR-10 data is uniformly

split into 100 clients, and 3 random classes are assigned to each client. CIFAR-100 data is

split into 100 clients with Dirichlet distribution. For CIFAR-10 and CIFAR-100, the OoD

dataset we used for OoD injection is a subset of ImageNet-DS [20] with randomly chosen 500

samples downsampled to 32×32. To simulate the feature non-iid FL setting, a multi-domain

FL benchmark, Digits [95, 57] is adopted. The dataset is composed of 28 × 28 images for

recognizing 10 digit classes, which was widely used in the community [14, 121]. The Digits

includes five different domains: MNIST [82], SVHN [128], USPS [60], SynthDigits [38], and

MNIST-M [38]. We leave out USPS as the OoD dataset for watermark injection (500 sam-

ples are chosen) and use the rest four domains for the standard FL training. Each domain

of digits is split into 10 different clients, thus, 40 clients will participate in the FL training.

Training setup. A preactivated ResNet (PreResNet18) [48] is used for CIFAR-10, a pre-

activated ResNet (PreResNet50) [48] is used for CIFAR-100, and a CNN defined in [98] is

used for Digits. For all three datasets, we leave out 10% of the training set as the validation

dataset to select the best FL model. The total training round is 300 for CIFAR-10 and

CIFAR-100, and 150 for Digits.

Watermark injection. The early training stage of FL is not worth protecting since the

standard accuracy is very low, we start watermark injection at round 20 for CIFAR-10 and

91

Digits, and at round 40 for CIFAR-100. The standard accuracy before our watermark injec-

tion is 85.20%, 40.23%, and 29.41% for Digits, CIFAR-10, and CIFAR-100, respectively.

Evaluation metrics. For watermark verification, we use watermark success rate (WSR)

which is the accuracy of the trigger set for evaluation. To measure whether we track the

malicious client (leaker) correctly, we define tracking accuracy (TAcc) as the rate of the

clients we track correctly. To further evaluate the ability of our method for distinguishing

between different watermarks for different clients, we also report the difference between the

highest WSR and second best WSR as WSR_Gap to show the significance of verification

and IP tracking. With a significant WSR_Gap, no watermark collision will happen. To

evaluate the utility of the model, we report the standard FL accuracy (Acc) for each client’s

individual test sets, whose classes match their training sets. We also report the accuracy

degradation (∆Acc) of the watermarked model compared with the non-watermarked one.

Note that, to simulate the scenario where malicious clients leak their local model after local

training, we test the average WSR, TAcc and WSR_Gap for the local model of each client

instead of the global model. Acc and ∆Acc are evaluated on the best FL model selected

using the validation datasets.

6.4.1

IP Tracking Benchmark

We evaluate our method using the IP tracking benchmark with various metrics as shown in

Table 6.1. Our ownership verification is confident with all WSRs over 99% (R2). The model

utility is also preserved with accuracy degradation 2.34%, 0.03%, and 0.63%, respectively

for Digits, CIFAR-10 and CIFAR-100 (R3). TAcc for all benchmark datasets is 100% which

indicates accurate IP tracking (R1). All WSR_Gap is over 98%, which means the WSRs for

all other benign client’s verification datasets are close to 0%. In this way, the malicious client

can be tracked accurately with high confidence, no collisions will occur within our tracking

mechanism (R1).

92

Dataset
Digits
CIFAR-10
CIFAR-100

Acc
0.8855
0.5583
0.5745

∆Acc WSR WSR_Gap TAcc
1.0000
0.0234
1.0000
0.0003
1.0000
0.0063

0.9909
1.0000
1.0000

0.9895
0.9998
0.9998

Table 6.1: Benchmark results.

(a) Validation accuracy.

(b) WSR.

(c) TAcc.

Figure 6.2: Validation accuracy, WSR, and TAcc for proposed DUW and other two baselines
on CIFAR-10 for 4 communication rounds.

6.4.2 Comparison with traditional backdoor-based watermarks

We compare our proposed DUW with two traditional backdoor-based watermarks in Fig. 6.2.

Due to the reason that if all the clients share the same trigger, watermark collision will defi-

nitely happen, we design different triggers for different clients. Specifically, we use traditional

backdoor-based watermarking by adding arbitrary badnet triggers using random noise or 0-1

coding trigger for each client. To distinguish between different clients, for 0-1 trigger, fol-

lowing [153], we set 5 pixel values of the pattern into 0 and other 11 pixels into 1, different

combinations of the pattern are randomly chosen for different clients. For random noise

triggers, we generate different random noise triggers for different clients. The trigger size

4 × 4 and the injection is conducted for 4 rounds. The target label for each client is set

as (client_ID % class_number). According to the results, traditional backdoor-based wa-

termarks can only achieve a tracking accuracy lower than 13% (it will even be lower with

the increase of the communication rounds), which is much lower than the 100% tracking

accuracy we have achieved. Note that, the rate of clients with watermark collisions can

be calculated as 1-TAcc. Extended results for the failure of the traditional backdoor-based

watermarking can be found in ?? 14.

93

0.00.51.01.52.02.53.0Communication round0.000.050.100.150.200.250.300.350.40Validation accuracyDUW0-1 triggerrandom noise trigger0.00.51.01.52.02.53.0Communication round0.00.20.40.60.81.0WSRDUW0-1 triggerrandom noise trigger0.00.51.01.52.02.53.0Communication round0.00.20.40.60.81.0TAccDUW0-1 triggerrandom noise triggerDataset
Digits
CIFAR-10
CIFAR-100

Acc
0.9712
0.7933
0.4580

∆Acc WSR ∆WSR TAcc
1.0000
-0.0258
1.0000
0.1521
1.0000
0.0290

0.0030
0.0000
0.0070

0.9924
1.0000
0.9930

Table 6.2: DUW is robust against fine-tuning.

6.4.3 Robustness

Malicious clients can conduct watermark removal attacks before leaking the FL model to

make it harder for us to verify the model copyright, and track the IP infringers accurately.

In this section, we show the robustness of the watermarks under various watermark removal

attacks (R4). Specifically, we evaluate our method against 1) fine-tuning [2]: Fine-tune

the model using their own local data; 2) pruning [110]: prune the model parameters that

have the smallest absolute value according to a certain pruning rate, and then fine-tune

the model on their local data; 3) model extraction attack: first query the victim model

for the label of an auxiliary dataset, and then re-train the victim model on the annotated

dataset. We take knockoff [130] as an example of the model extraction attack; 4) parameter

perturbations: add random noise to local model parameters [40].

10 of the clients are selected as the malicious clients, and the metrics in this section are

average values for 10 malicious clients. All the watermark removal attacks are conducted for

50 epochs with a learning rate 10−5. All the attacks are conducted for the local model of

the last round.

Robustness against fine-tuning attack. We report the robustness of our proposed

DUW against fine-tuning in Table 6.2. ∆Acc and ∆WSR in this table indicate the accuracy

and WSR drop compared with accuracy and WSR before the attack. According to the results,

after 50 epochs of fine-tuning, the attacker can only decrease the WSR by less than 1%, and

the TAcc is even not affected. Fine-tuning with their limited local training samples can also

cause a standard accuracy degradation. Fine-tuning can neither remove our watermark nor

affect our IP tracking, even if sacrifices their standard accuracy.

Robustness against pruning attack. We investigate the effect of pruning in Fig. 6.3

94

(a) Digits.

(b) CIFAR-10.

(c) CIFAR-100.

Figure 6.3: DUW is robust against pruning.

(a) Digits.

(b) CIFAR-10.

(c) CIFAR-100.

Figure 6.4: DUW is robust against parameter perturbation.

by varying the pruning rate from 0 to 0.5. With the increase in the pruning ratio, both TAcc

and WSR will not be affected. For CIFAR-10, standard accuracy will drop 5%. Therefore,

pruning is not an effective attack on our watermark, and it will even cause an accuracy

degradation for the classification task.

Robustness against model extraction attack. To verify the robustness of our pro-

posed DUW against model extraction attack, we take knockoff [130] as an example, and

STL10 [23] cropped to the same size as the training data is used as the auxiliary dataset

for this attack. According to the results for three benchmark datasets in Table 6.3, after

knockoff attack, WSR for all three datasets is still over 65%, and our tracking mechanism is

still not affected with TAcc remains to be 100%. Therefore, our DUW is resilient to model

extraction attacks.

Dataset
Digits
CIFAR-10
CIFAR-100

Acc
0.8811
0.5176
0.4190

∆Acc WSR ∆WSR TAcc
1.0000
0.0643
1.0000
0.4278
1.0000
0.0680

0.9780
0.6638
0.8828

0.0174
0.3362
0.1172

Table 6.3: DUW is robust against model extraction.

95

00.10.20.30.40.5Pruning rate0.800.850.900.951.00AccuracyAccWSRTAcc00.10.20.30.40.5Pruning rate0.750.800.850.900.951.00AccuracyAccWSRTAcc00.10.20.30.40.5Pruning rate0.50.60.70.80.91.0AccuracyAccWSRTAcc01e-050.00010.0010.010.1noise0.20.40.60.81.0AccuracyAccWSRTAcc01e-050.00010.0010.010.1noise0.00.20.40.60.81.0AccuracyAccWSRTAcc01e-050.00010.0010.010.1noise0.00.20.40.60.81.0AccuracyAccWSRTAccRobustness against parameter perturbations attack. Malicious clients can also

add random noise to model parameters to remove watermarks, since [40] found that backdoor-

based watermarks are usually not resilient to parameter perturbations. Adding random

noise to the local model parameters can also increase the chance of blurring the difference

between different watermarked models. We enable each malicious client to blend Gaussian

noise to the parameters of their local model, and set the parameter of the local model as

θi = θi+Gaussian_noise∗αnoise, where αnoise = {10−5, 10−4, 10−3, 10−2, 10−1} is the coefficient

for noise. We investigate the effect of parameter perturbation in Fig. 6.4. According to the

results, when αnoise is smaller than 10−2, WSR, Acc, and TAcc will not be affected. When

αnoise = 10−2, Acc will drop more than 10%, TAcc remains unchanged, and WSR is still over

90%. When αnoise = 10−1, Acc will drop to a random guess, thus, although the watermark

has been removed, the model has no utility. Therefore, parameter perturbation is not an

effective attack for removing our watermark and affecting our tracking mechanism.

6.4.4 Qualitative Study

Effects of decoder. To investigate the effects of the decoder on avoiding watermark col-

lision, we compare the results of w/ and w/o decoder. When the decoder is removed, the

task dimension of the watermark injection will be the same as the FL classification, thus, we

also have to change the original target label (the same as the input key) to the FL classifi-

cation task dimension. To achieve this goal, we set the target label of w/o decoder case as

(client_ID % class_number). We report the results of w/ and w/o decoder on CIFAR-10

after 1 round of watermark injection at round 20 in Table 6.4. According to the results,

when we have 100 clients in total, w/o decoder can only achieve a TAcc of 6%, while w/

decoder can increase TAcc to 100%. We also find that clients with the same target label are

more likely to conflict with each other, which makes those clients difficult to be identified,

even if their trigger sets are different. Utilizing a decoder to increase the target label space

to a dimension larger than the client number allows all the clients to have their own target

96

label. In this way, watermark collision can be avoided. Besides, WSR of w/ decoder is also

higher than w/o decoder after 1 round of injection. One possible reason is that we differ

the watermark injection task from the original classification task using the decoder, thus, in

this case, the watermark will be more easily injected compared with directly injected to the

original FL classification task.

Method
w/ decoder
w/o decoder

Acc
0.3287
0.3235

∆Acc WSR
TAcc
0.0736 0.8778 1.0000
0.0600
0.8099
0.0788

Table 6.4: Effects of decoder: the decoder can improve TAcc to avoid watermark collision.
∆Acc in this table is the accuracy degradation compared with the previous round.

Effects of l2 regularization. To show the effects of l2 regularization in Eq. (6.2), we

report the validation accuracy and WSR for 4 rounds of watermark injection on Digits with

different values of the hyperparameter β in Fig. 6.5. Validation accuracy is the standard FL

accuracy evaluated on a validation dataset for every round. We see that with the increase of

β, higher validation accuracy can be achieved, but correspondingly, WSR drops from over

90% to only 35.65%. Larger β increases the impact of l2 norm, which decreases the model

difference between the watermarked model and the non-watermarked one, so the validation

accuracy will increase. At the same time, the updates during watermark injection also have

much more restriction due to l2 regularization, so the WSR drops to a low value. Accordingly,

we select β = 0.1 for all our experiments, since β = 0.1 can increase validation accuracy by

6.88% compared with β = 0, while maintaining WSR over 90%.

(a) Validation accuracy.

(b) WSR.

Figure 6.5: Acc and WSR for different values of β.

97

0.00.51.01.52.02.53.0Communication round0.00.10.20.30.4Validation accuracy0.50.30.10.0500.00.51.01.52.02.53.0Communication round0.00.20.40.60.81.0WSR0.50.30.10.050(a) Validation accuracy.

(b) WSR.

Figure 6.6: Acc and WSR for different OoD datasets.

Effects of different OoD datasets for watermark injection. We investigate the

effects of different OoD datasets including USPS [60], GTSRB [145], random noise, and

Jigsaw for watermark injection when the standard training data is Digits. All OoD images

are cropped to the same size as the training images. A jigsaw image is generated from a

small 4 × 4 random image, and then uses reflect padding mode from PyTorch to padding to

the same size as the training images. The effect of these different OoD datasets is shown in

Table 6.5 and Fig. 6.6. We see that all OoD datasets can achieve 100% TAcc, suggesting

the selection of OoD dataset will not affect the tracking of the malicious client. There is a

trade-off between the Acc and WSR: higher WSR always leads to lower Acc. Random noise

and jigsaw achieve high Acc, with accuracy degradation within 1%. These two noise OoD

also have a faster recovery of the standard accuracy after the accuracy drop at the watermark

injection round as shown in Fig. 6.6, but the WSR of random noise and Jigsaw are lower

than 90%. For two real OoD datasets USPS and GTSRB, the WSR quickly reaches over

99% after 1 communication round, but their accuracy degradation is larger than 2%.

Dataset
USPS
GTSRB
Random noise
Jigsaw

Acc
0.8855
0.8716
0.9007
0.9013

∆Acc WSR WSR_Gap TAcc
1.0000
0.0234
1.0000
0.0373
1.0000
0.0082
1.0000
0.0076

0.9895
0.9962
0.8143
0.8601

0.9909
0.9972
0.8422
0.8789

Table 6.5: Effects of different OoD datasets: a trade-off exists between Acc and WSR, given
different OoD datasets.

Scalability of DUW to more clients. We conduct an ablation study on Digits to

show the effect of the number of clients in Table 6.6. According to the results, even with

98

0255075100125150Communication round0.00.20.40.60.81.0Validation accuracyUSPSGTSRBRandom noiseJigsawstandard training255075100125150Communication round0.00.20.40.60.81.0WSRUSPSGTSRBRandom noiseJigsaw600 clients, the WSR is still over 73% and the TAcc remains 100%. With more clients

participating in FL, we can still track the malicious client correctly with high confidence.

Number of clients
40
400
600

Acc
0.8855
0.8597
0.8276

∆Acc WSR WSR_Gap TAcc
1.0000
0.0234
1.0000
-0.0332
1.0000
-0.0035

0.9895
0.9267
0.6383

0.9909
0.9521
0.7337

Table 6.6: Effects of different numbers of clients.

Hybrid watermark.

If DUW meets a black-box suspect model, our DUW can also

be combined with existing black-box unified watermarks. We can identify IP leakage using

black-box detection with a unified watermark first, then identify infringers using DUW with

client-unique watermark. We design a simple hybrid watermark in this section as an example.

We pick one of the trigger sets we generated for the clients as the trigger set for the unified

watermark injection, and the target label is assigned as 0 which belongs to the original label

set of the training data. We use this trigger set to fine-tune the entire global model for

10 steps before injecting our proposed DUW. Note that no decoder is used for the unified

watermark, and the unified watermarks can also be replaced with other existing works.

The results on Digits are shown in Table 6.7. For this table, we can observe that the unified

watermark is injected successfully in the presence of our DUW, with a 98.82% WSR. Besides,

the effectiveness of our DUW is also not affected, since the WSR of DUW only decreases by

0.72%, and TAcc remains 100%. The model utility is also not affected, since the standard

accuracy remains high.

Method

w/o unified watermark 0.8855 0.0234 0.9909
w/ unified watermark 0.8886 0.0203 0.9837

Acc ∆Acc WSR WSR_Gap TAcc Unified WSR
0.9895
0.9701

1.0000
1.0000

/
0.9882

6.5 Discussions

Table 6.7: Results for hybrid watermark.

Client-side watermarking VS server-side watermarking. Client-side watermarking

such as FedCIP [104], FedIPR [86], and Merkle-Sign [88] are used to claim the co-ownership

99

of the model, yet we argue that client-side watermarking has some limitations, which makes

it not applicable for IP tracking. For client-side watermarking, if one of the clients is the

infringer to illegally distribute the model, the infringer will not reveal their own identity

during the model verification process to avoid legal responsibility. Even if the ownership of

the model can be claimed by their co-author, the real infringers cannot be tracked, since they

remain anonymous. Using our server-side watermark, there is no such concern, the server

can easily track the malicious client among all the clients.

Complexity. Clients will not experience additional computations as our DUW is carried

out on the server side. The additional computation for the server is decided by the number

of watermark injection steps Tw. We found that WSR could reach 99% just within Tw = 10

steps. Injection of one client-unique watermark takes around 1 second. The server can embed

the watermark parallelly for all the clients. Since the watermarked model for each client is

independent and has no sequence relationship with each other, there is no need to serialize

it. Thus, the delay caused by the server is neglectable.

Future works. This work makes the FL model leakage from anonymity to accountabil-

ity by injecting client-unique watermarks. We recognize the most significant challenge for

accountable FL is addressing watermark collision for accurate IP tracking (R1). We believe

it is important to scale our method from cross-silo setting to cross-device setting with more

clients in the future. One plausible solution is increasing the dimension of the input of the

encoder to allow more one-hot encoding target labels. Another solution is to use a hash

function as the target label for different clients. In this way, the lower-dimensional encoder

and decoder can accommodate more clients. However, adopting hash functions as the tar-

get labels can increase the chance of watermark collision between clients, and more elegant

strategies have to be developed to address this problem. As we focus on the collision, we

leave the scalability for future work.

100

6.5.1 Summary

In this chapter, we target at accountable FL, and propose Decodable Unique Watermarking

(DUW), that can verify the FL model’s ownership and track the IP infringers in the FL

system at the same time. Specifically, the server will embed a client-unique key into each

client’s local model before broadcasting. The IP infringer can be tracked according to the

decoded keys from the suspect model. Extensive experimental results show the effectiveness

of our method in accurate IP tracking, confident verification, model utility preserving, and

robustness against various watermark removal attacks.

101

CHAPTER 7

CONCLUSION

7.1 Overview

In this section, we summarize our contributions to the robustness and trustworthiness of

machine learning models in diverse domains.

Robust UDA from a corrupted source. We propose a simple and computationally

efficient method, in that the training of individual models for an ensemble can be conducted

in parallel to accelerate learning. The proposed solution to UDA is generally robust against

agnostic types of data corruption. In particular, our approach is able to successfully tackle

notorious backdoor attacks, where both the training samples and corresponding labels can

be maliciously modified by attackers. The learning framework we proposed can be flexibly

combined with available UDA approaches that are orthogonal to our work to improve their

robustness under corrupted data.

Enhancing in-context learning for long-tail knowledge in LLMs. To improve the

uncertain prediction of LLMs on long-tail knowledge, we propose a reinforcement learning-

based dynamic uncertainty ranking method for retrieval-augmented ICL with a budget con-

troller. Specifically, it considers the dynamic impact of each retrieved sample based on the

LLM’s feedback. Our ranking system raises the ranks of more informative and stable samples

and lowers the ranks of misleading samples efficiently. Evaluations of various QA datasets

from different domains show that our proposed method outperformed all the baselines, and

especially improve the LLM’s prediction on the long-tail questions.

OoD detection in FL. We propose a novel federated OoD synthesizer to take advantage

of data heterogeneity to facilitate OoD detection in FL, allowing a client to learn external

class knowledge from other non-iid federated collaborators in a privacy-aware manner. Our

work bridges a critical research gap since OoD detection for FL is currently not yet well-

studied in literature. To our knowledge, the proposed Foster is the first OoD learning

102

method for FL that does not require real OoD samples. The proposed Foster achieves the

state-of-art performance using only limited ID data stored in each local device, as compared

to existing approaches that demand a large volume of OoD samples. The design of Foster

considers both the diversity and hardness of virtual OoD samples, making them closely

resemble real OoD samples from other non-iid collaborators. As a general OoD detection

framework for FL, the proposed Foster remains effective in more challenging FL settings,

where the entire parameter sharing process is prohibited due to privacy or communication

concerns. This is because that Foster only used the classifier head for extracting external

data knowledge.

Safe and robust watermark injection with a single OoD image. We propose

a novel watermark method based on OoD data, which fills in the gap of backdoor-based

IP protection of deep models without training data. The removal of access to the training

data enables the proposed approach possible for many real-world scenarios. The proposed

watermark method is both sample efficient (one OoD image) and time efficient (a few epochs)

without sacrificing the model utility. We propose to adopt a weight perturbation strategy

to improve the robustness of the watermarks against common removal attacks, such as fine-

tuning, pruning, and model extraction. We show the robustness of watermarks through

extensive empirical results, and they persist even in an unfair scenario where the removal

attack uses a part of in-distribution data.

Tacking IP infringers in FL. In this work, we make the FL model leakage from

anonymity to accountability by injecting DUW. DUW enables ownership verification and

leakage tracing at the same time. With utility preserved, both the ownership verification

and IP tracking of our DUW are not only accurate but also confident without collisions.

Our DUW is robust against existing watermarking removal attacks, including fine-tuning,

pruning, model extraction, and parameter perturbation.

103

7.2 Future Work

One limitation of the thesis is that all methods target a single modality setting, either

vision or language. An important future direction is to extend these approaches to multi-

modal settings. Real-world applications often involve diverse data sources, such as the

combination of images, text, video, audio, electronic health records (EHR), etc. Exploring

how to enhance both adaptiveness and trustworthiness in these more complex scenarios is a

natural extension of this work. For instance, future research could investigate multi-modal

watermarking techniques, or leverage Retrieval-Augmented Generation (RAG) and improve

reasoning capabilities within multi-modal large language models (MLLMs) to enhance their

adaptiveness to various downstream tasks. Advancing these directions will be crucial for

building robust and reliable systems that can generalize across tasks and data types while

maintaining interpretability and resilience to distribution shifts.

104

BIBLIOGRAPHY

[1]

[2]

[3]

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Floren-
cia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anad-
kat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.

Yossi Adi, Carsten Baum, Moustapha Cisse, Benny Pinkas, and Joseph Keshet. Turn-
ing your weakness into a strength: Watermarking deep neural networks by backdooring.
In 27th {USENIX} Security Symposium ({USENIX} Security 18), pages 1615–1631,
2018.

Ebtesam Almazrouei, Hamza Alobeidli, Abdulaziz Alshamsi, Alessandro Cappelli,
Ruxandra Cojocaru, Mérouane Debbah, Étienne Goffinet, Daniel Hesslow, Julien Lau-
nay, Quentin Malartic, et al. The falcon series of open language models. arXiv preprint
arXiv:2311.16867, 2023.

[4] Manoj Ghuhan Arivazhagan, Vinay Aggarwal, Aaditya Kumar Singh, and Sunav
arXiv preprint
Federated learning with personalization layers.

Choudhary.
arXiv:1912.00818, 2019.

[5]

[6]

Yuki M Asano, Christian Rupprecht, and Andrea Vedaldi. A critical analysis
of self-supervision, or what we can learn from a single image.
arXiv preprint
arXiv:1904.13132, 2019.

Yuki M. Asano and Aaqib Saeed. Extrapolating from a single image to a thousand
classes using distillation. In ICLR, 2023.

[7] Meysam Asgari, Jeffrey Kaye, and Hiroko Dodge. Predicting mild cognitive impairment
from spontaneous spoken utterances. Alzheimer’s & Dementia: Translational Research
& Clinical Interventions, 3(2):219–228, 2017.

[8]

[9]

Arthur Asuncion and David Newman. Uci machine learning repository, 2007.

Eugene Bagdasaryan, Andreas Veit, Yiqing Hua, Deborah Estrin, and Vitaly
Shmatikov. How to backdoor federated learning. In International conference on arti-
ficial intelligence and statistics, pages 2938–2948. PMLR, 2020.

[10] Tadas Baltrušaitis, Marwa Mahmoud, and Peter Robinson. Cross-dataset learning and
person-specific normalisation for automatic action unit detection. In 2015 11th IEEE
International Conference and Workshops on Automatic Face and Gesture Recognition
(FG), volume 6, pages 1–6. IEEE, 2015.

[11] Francesco Barbieri, Jose Camacho-Collados, Leonardo Neves, and Luis Espinosa-Anke.
Tweeteval: Unified benchmark and comparative evaluation for tweet classification.
arXiv preprint arXiv:2010.12421, 2020.

[12] Debraj Basu, Deepesh Data, Can Karakus, and Suhas Diggavi. Qsparse-local-sgd:
Distributed sgd with quantization, sparsification and local computations. Advances in
Neural Information Processing Systems, 32, 2019.

105

[13] Tom B Brown.

Language models are few-shot

learners.

arXiv preprint

arXiv:2005.14165, 2020.

[14] Sebastian Caldas, Sai Meher Karthik Duddu, Peter Wu, Tian Li, Jakub Konečn`y,
H Brendan McMahan, Virginia Smith, and Ameet Talwalkar. Leaf: A benchmark for
federated settings. arXiv preprint arXiv:1812.01097, 2018.

[15] Jialuo Chen, Jingyi Wang, Tinglan Peng, Youcheng Sun, Peng Cheng, Shouling Ji,
Xingjun Ma, Bo Li, and Dawn Song. Copy, right? a testing framework for copyright
protection of deep learning models. In 2022 IEEE Symposium on Security and Privacy
(SP), pages 824–841. IEEE, 2022.

[16] Jiuhai Chen, Lichang Chen, Chen Zhu, and Tianyi Zhou. How many demonstrations

do you need for in-context learning? arXiv preprint arXiv:2303.08119, 2023.

[17] Lin Chen, Lixin Duan, and Dong Xu. Event recognition in videos by learning from
heterogeneous web sources. In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pages 2666–2673, 2013.

[18] Minmin Chen, Zhixiang Xu, Kilian Weinberger, and Fei Sha. Marginalized denoising

autoencoders for domain adaptation. arXiv preprint arXiv:1206.4683, 2012.

[19] Xuxi Chen, Tianlong Chen, Zhenyu Zhang, and Zhangyang Wang. You are caught
stealing my winning lottery ticket! making a lottery ticket claim its ownership. Ad-
vances in Neural Information Processing Systems, 34:1780–1791, 2021.

[20] Patryk Chrabaszcz, Ilya Loshchilov, and Frank Hutter. A downsampled variant of
imagenet as an alternative to the cifar datasets. arXiv preprint arXiv:1707.08819,
2017.

[21] Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and Andrea
Vedaldi. Describing textures in the wild. In Proceedings of the IEEE conference on
computer vision and pattern recognition, pages 3606–3613, 2014.

[22] Nicolas Antonio Cloutier and Nathalie Japkowicz. Fine-tuned generative llm oversam-
pling can improve performance over traditional techniques on multiclass imbalanced
text classification. In 2023 IEEE International Conference on Big Data (BigData),
pages 5181–5186. IEEE, 2023.

[23] Adam Coates, Andrew Ng, and Honglak Lee. An analysis of single-layer networks in
unsupervised feature learning. In Proceedings of the fourteenth international conference
on artificial intelligence and statistics, pages 215–223. JMLR Workshop and Conference
Proceedings, 2011.

[24] Gabriela Csurka. Domain adaptation for visual applications: A comprehensive survey.

arXiv preprint arXiv:1702.05374, 2017.

[25] Yi Dai, Hao Lang, Yinhe Zheng, Fei Huang, and Yongbin Li. Long-tailed question

answering in an open world. arXiv preprint arXiv:2305.06557, 2023.

106

[26] Bita Darvish Rouhani, Huili Chen, and Farinaz Koushanfar. Deepsigns: An end-to-
end watermarking framework for ownership protection of deep neural networks.
In
Proceedings of the Twenty-Fourth International Conference on Architectural Support
for Programming Languages and Operating Systems, pages 485–497, 2019.

[27] Jacob Devlin. Bert: Pre-training of deep bidirectional transformers for language un-

derstanding. arXiv preprint arXiv:1810.04805, 2018.

[28] Luc Devroye, Matthieu Lerasle, Gabor Lugosi, and Roberto I Oliveira. Sub-gaussian

mean estimators. The Annals of Statistics, 44(6):2695–2725, 2016.

[29]

Ilias Diakonikolas, Gautam Kamath, Daniel Kane, Jerry Li, Jacob Steinhardt, and
Alistair Stewart. Sever: A robust meta-algorithm for stochastic optimization.
In
International Conference on Machine Learning, pages 1596–1606. PMLR, 2019.

[30] Enmao Diao, Jie Ding, and Vahid Tarokh. Heterofl: Computation and communication
efficient federated learning for heterogeneous clients. arXiv preprint arXiv:2010.01264,
2020.

[31] Xuefeng Du, Zhaoning Wang, Mu Cai, and Yixuan Li. Vos: Learning what you don’t

know by virtual outlier synthesis. arXiv preprint arXiv:2202.01197, 2022.

[32] Hady Elsahar, Pavlos Vougiouklis, Arslen Remaci, Christophe Gravier, Jonathon Hare,
Frederique Laforest, and Elena Simperl. T-rex: A large scale alignment of natural
language with knowledge base triples.
In Proceedings of the Eleventh International
Conference on Language Resources and Evaluation (LREC 2018), 2018.

[33] Andre Esteva, Alexandre Robicquet, Bharath Ramsundar, Volodymyr Kuleshov, Mark
DePristo, Katherine Chou, Claire Cui, Greg Corrado, Sebastian Thrun, and Jeff Dean.
A guide to deep learning in healthcare. Nature medicine, 25(1):24–29, 2019.

[34] Lixin Fan, Kam Woh Ng, and Chee Seng Chan. Rethinking deep neural network
ownership verification: Embedding passports to defeat ambiguity attacks. Advances
in neural information processing systems, 32, 2019.

[35] Gongfan Fang, Jie Song, Chengchao Shen, Xinchao Wang, Da Chen, and Mingli Song.

Data-free adversarial distillation. arXiv preprint arXiv:1912.11006, 2019.

[36] Luciano Floridi and Massimo Chiriatti. Gpt-3: Its nature, scope, limits, and conse-

quences. Minds and Machines, 30:681–694, 2020.

[37] Geoffrey French, Michal Mackiewicz, and Mark Fisher. Self-ensembling for visual

domain adaptation. arXiv preprint arXiv:1706.05208, 2017.

[38] Yaroslav Ganin and Victor Lempitsky. Unsupervised domain adaptation by backprop-
agation. In International conference on machine learning, pages 1180–1189. PMLR,
2015.

107

[39] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle,
François Laviolette, Mario Marchand, and Victor Lempitsky. Domain-adversarial train-
ing of neural networks. The journal of machine learning research, 17(1):2096–2030,
2016.

[40] Siddhant Garg, Adarsh Kumar, Vibhor Goel, and Yingyu Liang. Can adversarial
weight perturbations inject neural backdoors.
In Proceedings of the 29th ACM In-
ternational Conference on Information & Knowledge Management, pages 2029–2032,
2020.

[41] Micah Goldblum, Dimitris Tsipras, Chulin Xie, Xinyun Chen, Avi Schwarzschild, Dawn
Song, Aleksander Mądry, Bo Li, and Tom Goldstein. Dataset security for machine
learning: Data poisoning, backdoor attacks, and defenses.
IEEE Transactions on
Pattern Analysis and Machine Intelligence, 45(2):1563–1580, 2022.

[42]

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley,
Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. Ad-
vances in neural information processing systems, 27, 2014.

[43] Matej Grcić, Petra Bevandić, and Siniša Šegvić. Dense open-set recognition with
synthetic outliers generated by real nvp. arXiv preprint arXiv:2011.11094, 2020.

[44] Tianyu Gu, Kang Liu, Brendan Dolan-Gavitt, and Siddharth Garg. Badnets: Eval-
uating backdooring attacks on deep neural networks. IEEE Access, 7:47230–47244,
2019.

[45] Tianyu Gu, Kang Liu, Brendan Dolan-Gavitt, and Siddharth Garg. Badnets: Eval-
uating backdooring attacks on deep neural networks. IEEE Access, 7:47230–47244,
2019.

[46] Zhongyi Han, Xian-Jin Gui, Chaoran Cui, and Yilong Yin. Towards accurate and
robust domain adaptation under noisy environments. arXiv preprint arXiv:2004.12529,
2020.

[47] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learningfor

image recognition. CoRR, abs/1512, 3385:2, 2015.

[48] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning
for image recognition. In Proceedings of the IEEE conference on computer vision and
pattern recognition, pages 770–778, 2016.

[49] Pengfei He, Han Xu, Jie Ren, Yingqian Cui, Hui Liu, Charu C Aggarwal, and Jiliang
Tang. Sharpness-aware data poisoning attack. arXiv preprint arXiv:2305.14851, 2023.

[50] Markus Heimberger, Jonathan Horgan, Ciarán Hughes, John McDonald, and Senthil
Yogamani. Computer vision in automated parking systems: Design, implementation
and challenges. Image and Vision Computing, 68:88–101, 2017.

108

[51] Matthias Hein, Maksym Andriushchenko, and Julian Bitterwolf. Why ReLU networks
yield high-confidence predictions far away from the training data and how to mitigate
the problem. In Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pages 41–50, 2019.

[52] Dan Hendrycks and Kevin Gimpel. A baseline for detecting misclassified and out-of-
distribution examples in neural networks. arXiv preprint arXiv:1610.02136, 2016.

[53] Dan Hendrycks, Mantas Mazeika, and Thomas Dietterich. Deep anomaly detection

with outlier exposure. arXiv preprint arXiv:1812.04606, 2018.

[54] Alex Hern. Techscape: Will meta’s massive leak democratise ai – and at what cost?

The Guardian, 2023.

[55] Wassily Hoeffding. Probability inequalities for sums of bounded random variables. In

The collected works of Wassily Hoeffding, pages 409–426. Springer, 1994.

[56] Judy Hoffman, Eric Tzeng, Taesung Park, Jun-Yan Zhu, Phillip Isola, Kate Saenko,
Alexei Efros, and Trevor Darrell. Cycada: Cycle-consistent adversarial domain adap-
In International conference on machine learning, pages 1989–1998. PMLR,
tation.
2018.

[57] Junyuan Hong, Haotao Wang, Zhangyang Wang, and Jiayu Zhou. Efficient split-
arXiv preprint

mix federated learning for on-demand and in-situ customization.
arXiv:2203.09747, 2022.

[58] Junyuan Hong, Yi Zeng, Shuyang Yu, Lingjuan Lyu, Ruoxi Jia, and Jiayu Zhou.
Revisiting data-free knowledge distillation with poisoned teachers. The Fortieth Inter-
national Conference on Machine Learning, 2023.

[59] Tzu-Ming Harry Hsu, Hang Qi, and Matthew Brown. Measuring the effects of
arXiv preprint

non-identical data distribution for federated visual classification.
arXiv:1909.06335, 2019.

[60] Jonathan J. Hull. A database for handwritten text recognition research. IEEE Trans-

actions on pattern analysis and machine intelligence, 16(5):550–554, 1994.

[61] Hengrui Jia, Christopher A Choquette-Choo, Varun Chandrasekaran, and Nicolas Pa-
pernot. Entangled watermarks as a defense against model extraction. In Proceedings
of the 30th USENIX Security Symposium, 2021.

[62] Zhengbao Jiang, Jun Araki, Haibo Ding, and Graham Neubig. How can we know when
language models know? on the calibration of language models for question answering.
Transactions of the Association for Computational Linguistics, 9:962–977, 2021.

[63] Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William W Cohen, and Xinghua Lu.
Pubmedqa: A dataset for biomedical research question answering. arXiv preprint
arXiv:1909.06146, 2019.

109

[64] Sanghun Jung, Jungsoo Lee, Daehoon Gwak, Sungha Choi, and Jaegul Choo. Stan-
dardized max logits: A simple yet effective approach for identifying unexpected road
obstacles in urban-scene segmentation. In Proceedings of the IEEE/CVF International
Conference on Computer Vision, pages 15425–15434, 2021.

[65] Mika Juuti, Sebastian Szyller, Samuel Marchal, and N Asokan. Prada: protecting
against dnn model stealing attacks. In 2019 IEEE European Symposium on Security
and Privacy (EuroS&P), pages 512–527. IEEE, 2019.

[66] Peter Kairouz, H Brendan McMahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis,
Arjun Nitin Bhagoji, Kallista Bonawitz, Zachary Charles, Graham Cormode, Rachel
Cummings, et al. Advances and open problems in federated learning. Foundations and
Trends® in Machine Learning, 14(1–2):1–210, 2021.

[67] Nikhil Kandpal, Haikang Deng, Adam Roberts, Eric Wallace, and Colin Raffel. Large
language models struggle to learn long-tail knowledge. In International Conference on
Machine Learning, pages 15696–15707. PMLR, 2023.

[68] Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank Reddi, Sebastian
Stich, and Ananda Theertha Suresh. Scaffold: Stochastic controlled averaging for
federated learning.
In International Conference on Machine Learning, pages 5132–
5143. PMLR, 2020.

[69] Byungjoo Kim, Suyoung Lee, Seanie Lee, Sooel Son, and Sung Ju Hwang. Margin-

based neural network watermarking. 2023.

[70] Jaehyung Kim, Jaehyun Nam, Sangwoo Mo, Jongjin Park, Sang-Woo Lee, Minjoon
Seo, Jung-Woo Ha, and Jinwoo Shin. Sure: Summarizing retrievals using answer
candidates for open-domain qa of llms. arXiv preprint arXiv:2404.13081, 2024.

[71] Taeksoo Kim, Moonsu Cha, Hyunsoo Kim, Jung Kwon Lee, and Jiwon Kim. Learning
to discover cross-domain relations with generative adversarial networks. In Interna-
tional conference on machine learning, pages 1857–1865. PMLR, 2017.

[72] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint

arXiv:1312.6114, 2013.

[73] Jakub Konečn`y, Brendan McMahan, and Daniel Ramage. Federated optimization:
Distributed optimization beyond the datacenter. arXiv preprint arXiv:1511.03575,
2015.

[74] Jakub Konečn`y, H Brendan McMahan, Felix X Yu, Peter Richtárik, Ananda Theertha
Suresh, and Dave Bacon. Federated learning: Strategies for improving communication
efficiency. arXiv preprint arXiv:1610.05492, 2016.

[75] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny

images. 2009.

110

[76] M Kumar, Benjamin Packer, and Daphne Koller. Self-paced learning for latent variable

models. Advances in neural information processing systems, 23, 2010.

[77] Minoru Kuribayashi, Takuro Tanaka, Shunta Suzuki, Tatsuya Yasui, and Nobuo Fun-
abiki. White-box watermarking scheme for fully-connected layers in fine-tuning model.
In Proceedings of the 2021 ACM Workshop on Information Hiding and Multimedia Se-
curity, pages 165–170, 2021.

[78] Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur
Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee,
et al. Natural questions: a benchmark for question answering research. Transactions
of the Association for Computational Linguistics, 7:453–466, 2019.

[79] Erwan Le Merrer, Patrick Perez, and Gilles Trédan. Adversarial frontier stitching for
remote neural network watermarking. Neural Computing and Applications, 32:9233–
9244, 2020.

[80] Guillaume Lecué and Matthieu Lerasle. Robust machine learning by median-of-means:

theory and practice. The Annals of Statistics, 48(2):906–931, 2020.

[81] Guillaume Lecué, Matthieu Lerasle, and Timlothée Mathieu. Robust classification via

mom minimization. Machine Learning, 109(8):1635–1665, 2020.

[82] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learn-
ing applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.

[83] Yann LeCun, Corinna Cortes, and Chris Burges. Mnist handwritten digit database,

2010.

[84] Kimin Lee, Kibok Lee, Honglak Lee, and Jinwoo Shin. A simple unified framework
for detecting out-of-distribution samples and adversarial attacks. Advances in neural
information processing systems, 31, 2018.

[85] Alexander Levine and Soheil Feizi. Deep partition aggregation: Provable defense

against general poisoning attacks. arXiv preprint arXiv:2006.14768, 2020.

[86] Bowen Li, Lixin Fan, Hanlin Gu, Jie Li, and Qiang Yang. Fedipr: Ownership verifica-
tion for federated deep neural network models. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 45(4):4521–4536, 2022.

[87] Chun-Liang Li, Wei-Cheng Chang, Yu Cheng, Yiming Yang, and Barnabás Póczos.
Mmd gan: Towards deeper understanding of moment matching network. Advances in
neural information processing systems, 30, 2017.

[88] Fang-Qi Li, Shi-Lin Wang, and Alan Wee-Chung Liew. Towards practical watermark
for deep neural networks in federated learning. arXiv preprint arXiv:2105.03167, 2021.

[89] Fangqi Li and Shilin Wang. Knowledge-free black-box watermark and ownership proof
for image classification neural networks. arXiv preprint arXiv:2204.04522, 2022.

111

[90] Fangqi Li, Lei Yang, Shilin Wang, and Alan Wee-Chung Liew. Leveraging multi-task
learning for umambiguous and flexible deep neural network watermarking. In SafeAI@
AAAI, 2022.

[91] Huihan Li, Yuting Ning, Zeyi Liao, Siyuan Wang, Xiang Lorraine Li, Ximing Lu,
Faeze Brahman, Wenting Zhao, Yejin Choi, and Xiang Ren. In search of the long-
tail: Systematic generation of long-tail knowledge via logical rule guided search. arXiv
preprint arXiv:2311.07237, 2023.

[92] Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov,
Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, et al. Starcoder:
may the source be with you! arXiv preprint arXiv:2305.06161, 2023.

[93] Shaofeng Li, Minhui Xue, Benjamin Zhao, Haojin Zhu, and Xinpeng Zhang. Invisible
backdoor attacks on deep neural networks via steganography and regularization. IEEE
TDSC, 2020.

[94] Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and
Virginia Smith. Federated optimization in heterogeneous networks. Proceedings of
Machine Learning and Systems, 2:429–450, 2020.

[95] Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and
Virginia Smith. Federated optimization in heterogeneous networks. Proceedings of
Machine learning and systems, 2:429–450, 2020.

[96] Xiang Li, Haoran Tang, Siyu Chen, Ziwei Wang, Ryan Chen, and Marcin Abram. Why
does in-context learning fail sometimes? evaluating in-context learning on open and
closed questions. arXiv preprint arXiv:2407.02028, 2024.

[97] Xiaonan Li, Kai Lv, Hang Yan, Tianyang Lin, Wei Zhu, Yuan Ni, Guotong Xie, Xi-
aoling Wang, and Xipeng Qiu. Unified demonstration retriever for in-context learning.
arXiv preprint arXiv:2305.04320, 2023.

[98] Xiaoxiao Li, Meirui Jiang, Xiaofei Zhang, Michael Kamp, and Qi Dou. Fedbn: Fed-
erated learning on non-iid features via local batch normalization. arXiv preprint
arXiv:2102.07623, 2021.

[99] Yiming Li, Baoyuan Wu, Yong Jiang, Zhifeng Li, and Shu-Tao Xia. Backdoor learning:

A survey. arXiv:2007.08745, 2020.

[100] Yue Li, Hongxia Wang, and Mauro Barni. A survey of deep neural network water-

marking techniques. Neurocomputing, 461:171–193, 2021.

[101] Yuezun Li, Yiming Li, Baoyuan Wu, Longkang Li, Ran He, and Siwei Lyu. Invisible

backdoor attack with sample-specific triggers. In ICCV, 2021.

[102] Jian Liang, Ran He, Zhenan Sun, and Tieniu Tan. Exploring uncertainty in pseudo-
label guided unsupervised domain adaptation. Pattern Recognition, 96:106996, 2019.

112

[103] Jian Liang, Dapeng Hu, and Jiashi Feng. Do we really need to access the source
data? source hypothesis transfer for unsupervised domain adaptation. In International
Conference on Machine Learning, pages 6028–6039. PMLR, 2020.

[104] Junchuan Liang and Rong Wang. Fedcip: Federated client intellectual property pro-

tection with traitor tracking. arXiv preprint arXiv:2306.01356, 2023.

[105] Shiyu Liang, Yixuan Li, and Rayadurgam Srikant. Enhancing the reliability of out-
of-distribution image detection in neural networks. arXiv preprint arXiv:1706.02690,
2017.

[106] Ziqian Lin, Sreya Dutta Roy, and Yixuan Li. Mood: Multi-level out-of-distribution
In Proceedings of the IEEE/CVF Conference on Computer Vision and

detection.
Pattern Recognition, pages 15313–15323, 2021.

[107] Jiachang Liu, Dinghan Shen, Yizhe Zhang, Bill Dolan, Lawrence Carin, and
arXiv preprint

Weizhu Chen. What makes good in-context examples for gpt-3?
arXiv:2101.06804, 2021.

[108] Weitang Liu, Xiaoyun Wang, John Owens, and Yixuan Li. Energy-based out-of-
distribution detection. Advances in Neural Information Processing Systems, 33:21464–
21475, 2020.

[109] Yingqi Liu, Shiqing Ma, Yousra Aafer, Wen-Chuan Lee, Juan Zhai, Weihang Wang,

and Xiangyu Zhang. Trojaning attack on neural networks. In NDSS, 2018.

[110] Zhuang Liu, Mingjie Sun, Tinghui Zhou, Gao Huang, and Trevor Darrell. Rethinking

the value of network pruning. arXiv preprint arXiv:1810.05270, 2018.

[111] Ziwei Liu, Zhongqi Miao, Xiaohang Zhan, Jiayun Wang, Boqing Gong, and Stella X Yu.
Large-scale long-tailed recognition in an open world. In Proceedings of the IEEE/CVF
conference on computer vision and pattern recognition, pages 2537–2546, 2019.

[112] Mingsheng Long, Yue Cao, Jianmin Wang, and Michael Jordan. Learning transfer-
able features with deep adaptation networks. In International conference on machine
learning, pages 97–105. PMLR, 2015.

[113] Mingsheng Long, Zhangjie Cao, Jianmin Wang, and Michael I Jordan. Conditional
adversarial domain adaptation. Advances in neural information processing systems,
31, 2018.

[114] Pan Lu, Liang Qiu, Kai-Wei Chang, Ying Nian Wu, Song-Chun Zhu, Tanmay Rajpuro-
hit, Peter Clark, and Ashwin Kalyan. Dynamic prompt learning via policy gradient
for semi-structured mathematical reasoning. arXiv preprint arXiv:2209.14610, 2022.

[115] Yao Lu, Max Bartolo, Alastair Moore, Sebastian Riedel, and Pontus Stenetorp. Fan-
tastically ordered prompts and where to find them: Overcoming few-shot prompt order
sensitivity. arXiv preprint arXiv:2104.08786, 2021.

113

[116] Gabor Lugosi and Shahar Mendelson. Risk minimization by median-of-means tourna-

ments. Journal of the European Mathematical Society, 22(3):925–965, 2019.

[117] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and
Adrian Vladu. Towards deep learning models resistant to adversarial attacks. arXiv
preprint arXiv:1706.06083, 2017.

[118] Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Daniel Khashabi, and Han-
naneh Hajishirzi. When not to trust language models: Investigating effectiveness of
parametric and non-parametric memories. arXiv preprint arXiv:2212.10511, 2022.

[119] Qi Mao, Hsin-Ying Lee, Hung-Yu Tseng, Siwei Ma, and Ming-Hsuan Yang. Mode
seeking generative adversarial networks for diverse image synthesis. In Proceedings of
the IEEE/CVF conference on computer vision and pattern recognition, pages 1429–
1437, 2019.

[120] Othmane Marfoq, Chuan Xu, Giovanni Neglia, and Richard Vidal. Throughput-
optimal topology design for cross-silo federated learning. Advances in Neural Informa-
tion Processing Systems, 33:19478–19487, 2020.

[121] Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera
y Arcas. Communication-efficient learning of deep networks from decentralized data.
In Artificial intelligence and statistics, pages 1273–1282. PMLR, 2017.

[122] Dhwani Mehta, Nurun Mondol, Farimah Farahmandi, and Mark Tehranipoor. Aime:
watermarking ai models by leveraging errors. In 2022 Design, Automation & Test in
Europe Conference & Exhibition (DATE), pages 304–309. IEEE, 2022.

[123] Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Ha-
jishirzi, and Luke Zettlemoyer. Rethinking the role of demonstrations: What makes
in-context learning work? In Proceedings of the 2022 Conference on Empirical Methods
in Natural Language Processing, pages 11048–11064, 2022.

[124] Thomas B Moeslund and Erik Granum. A survey of computer vision-based human
motion capture. Computer vision and image understanding, 81(3):231–268, 2001.

[125] Sina Mohseni, Mandar Pitale, JBS Yadawa, and Zhangyang Wang. Self-supervised
In Proceedings of the AAAI

learning for generalizable out-of-distribution detection.
Conference on Artificial Intelligence, volume 34, pages 5216–5223, 2020.

[126] Ioannis Mollas, Zoe Chrysopoulou, Stamatis Karlos, and Grigorios Tsoumakas. Ethos:
a multi-label hate speech detection dataset. Complex & Intelligent Systems, 8(6):4663–
4678, 2022.

[127] Arkadij Semenovič Nemirovskij and David Borisovich Yudin. Problem complexity and

method efficiency in optimization. 1983.

[128] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y
Ng. Reading digits in natural images with unsupervised feature learning. 2011.

114

[129] Kenney Ng, Jimeng Sun, Jianying Hu, and Fei Wang. Personalized predictive modeling
and risk factor identification using patient similarity. AMIA Summits on Translational
Science Proceedings, 2015:132, 2015.

[130] Tribhuvanesh Orekondy, Bernt Schiele, and Mario Fritz. Knockoff nets: Stealing func-
tionality of black-box models. In Proceedings of the IEEE/CVF conference on computer
vision and pattern recognition, pages 4954–4963, 2019.

[131] Nicolas Papernot, Patrick McDaniel, Ian Goodfellow, Somesh Jha, Z Berkay Celik,
and Ananthram Swami. Practical black-box attacks against machine learning.
In
Proceedings of the 2017 ACM on Asia conference on computer and communications
security, pages 506–519, 2017.

[132] Xingchao Peng, Qinxun Bai, Xide Xia, Zijun Huang, Kate Saenko, and Bo Wang. Mo-
ment matching for multi-source domain adaptation. In Proceedings of the IEEE/CVF
international conference on computer vision, pages 1406–1415, 2019.

[133] Adarsh Prasad, Sivaraman Balakrishnan, and Pradeep Ravikumar. A robust univariate
mean estimator is all you need. In International Conference on Artificial Intelligence
and Statistics, pages 4034–4044. PMLR, 2020.

[134] Zhaonan Qu, Kaixiang Lin, Jayant Kalagnanam, Zhaojian Li, Jiayu Zhou, and
Zhengyuan Zhou. Federated learning’s blessing: Fedavg has linear speedup. arXiv
preprint arXiv:2007.05690, 2020.

[135] Adnan Siraj Rakin, Zhezhi He, and Deliang Fan. Tbt: Targeted neural network attack
with bit trojan. In Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pages 13198–13207, 2020.

[136] Sashank Reddi, Zachary Charles, Manzil Zaheer, Zachary Garrett, Keith Rush, Jakub
Konečn`y, Sanjiv Kumar, and H Brendan McMahan. Adaptive federated optimization.
arXiv preprint arXiv:2003.00295, 2020.

[137] Alex Renda, Jonathan Frankle, and Michael Carbin. Comparing rewinding and fine-

tuning in neural network pruning. arXiv preprint arXiv:2003.02389, 2020.

[138] Stephen Robertson, Hugo Zaragoza, et al. The probabilistic relevance framework:
Bm25 and beyond. Foundations and Trends® in Information Retrieval, 3(4):333–389,
2009.

[139] Ohad Rubin, Jonathan Herzig, and Jonathan Berant. Learning to retrieve prompts

for in-context learning. arXiv preprint arXiv:2112.08633, 2021.

[140] Jon Saad-Falcon, Omar Khattab, Keshav Santhanam, Radu Florian, Martin Franz,
Salim Roukos, Avirup Sil, Md Arafat Sultan, and Christopher Potts. Udapdr: un-
supervised domain adaptation via llm prompting and distillation of rerankers. arXiv
preprint arXiv:2303.00807, 2023.

115

[141] Kuniaki Saito, Donghyun Kim, Stan Sclaroff, Trevor Darrell, and Kate Saenko. Semi-
supervised domain adaptation via minimax entropy. In Proceedings of the IEEE/CVF
International Conference on Computer Vision, pages 8050–8058, 2019.

[142] Vikash Sehwag, Mung Chiang, and Prateek Mittal. Ssd: A unified framework for

self-supervised outlier detection. arXiv preprint arXiv:2103.12051, 2021.

[143] Shuo Shao, Wenyuan Yang, Hanlin Gu, Jian Lou, Zhan Qin, Lixin Fan, Qiang Yang,
and Kui Ren. Fedtracker: Furnishing ownership verification and traceability for feder-
ated learning model. arXiv preprint arXiv:2211.07160, 2022.

[144] Akash Srivastava, Lazar Valkov, Chris Russell, Michael U Gutmann, and Charles Sut-
ton. Veegan: Reducing mode collapse in gans using implicit variational learning.
Advances in neural information processing systems, 30, 2017.

[145] Johannes Stallkamp, Marc Schlipsing, Jan Salmen, and Christian Igel. Man vs. com-
puter: Benchmarking machine learning algorithms for traffic sign recognition. Neural
networks, 32:323–332, 2012.

[146] Baochen Sun, Jiashi Feng, and Kate Saenko. Return of frustratingly easy domain adap-
tation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 30,
2016.

[147] Kai Sun, Yifan Ethan Xu, Hanwen Zha, Yue Liu, and Xin Luna Dong. Head-to-tail:
How knowledgeable are large language models (llm). AKA will llms replace knowledge
graphs, 2023.

[148] Weiwei Sun, Lingyong Yan, Xinyu Ma, Shuaiqiang Wang, Pengjie Ren, Zhumin Chen,
Dawei Yin, and Zhaochun Ren. Is chatgpt good at search? investigating large language
models as re-ranking agents. arXiv preprint arXiv:2304.09542, 2023.

[149] Yiyou Sun, Yifei Ming, Xiaojin Zhu, and Yixuan Li. Out-of-distribution detection

with deep nearest neighbors. arXiv preprint arXiv:2204.06507, 2022.

[150] Richard S Sutton, Andrew G Barto, et al. Reinforcement learning. Journal of Cognitive

Neuroscience, 11(1):126–134, 1999.

[151] Canh T Dinh, Nguyen Tran, and Josh Nguyen. Personalized federated learning with
moreau envelopes. Advances in Neural Information Processing Systems, 33:21394–
21405, 2020.

[152] Jihoon Tack, Sangwoo Mo, Jongheon Jeong, and Jinwoo Shin. Csi: Novelty detec-
tion via contrastive learning on distributionally shifted instances. Advances in neural
information processing systems, 33:11839–11852, 2020.

[153] Ruixiang Tang, Mengnan Du, Ninghao Liu, Fan Yang, and Xia Hu. An embarrassingly
simple approach for trojan attack in deep neural networks. In Proceedings of the 26th
ACM SIGKDD International Conference on Knowledge Discovery & Data Mining,
pages 218–228, 2020.

116

[154] Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu,
Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican,
et al. Gemini: a family of highly capable multimodal models.
arXiv preprint
arXiv:2312.11805, 2023.

[155] Buse GA Tekgul, Yuxi Xia, Samuel Marchal, and N Asokan. Waffle: Watermarking
in federated learning. In 2021 40th International Symposium on Reliable Distributed
Systems (SRDS), pages 310–320. IEEE, 2021.

[156] Hoang Thanh-Tung and Truyen Tran. Catastrophic forgetting and mode collapse in
In 2020 International Joint Conference on Neural Networks (IJCNN), pages

gans.
1–10. IEEE, 2020.

[157] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne
Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal
Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint
arXiv:2302.13971, 2023.

[158] Florian Tramèr, Fan Zhang, Ari Juels, Michael K Reiter, and Thomas Ristenpart.
Stealing machine learning models via prediction apis. In USENIX security symposium,
volume 16, pages 601–618, 2016.

[159] Alexander Turner, Dimitris Tsipras, and Aleksander Madry. Clean-label backdoor

attacks. 2018.

[160] Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell. Adversarial discrimina-
tive domain adaptation. In Proceedings of the IEEE conference on computer vision
and pattern recognition, pages 7167–7176, 2017.

[161] Eric Tzeng, Judy Hoffman, Ning Zhang, Kate Saenko, and Trevor Darrell. Deep domain
confusion: Maximizing for domain invariance. arXiv preprint arXiv:1412.3474, 2014.

[162] Yusuke Uchida, Yuki Nagai, Shigeyuki Sakazawa, and Shin’ichi Satoh. Embedding wa-
termarks into deep neural networks. In Proceedings of the 2017 ACM on international
conference on multimedia retrieval, pages 269–277, 2017.

[163] Dave Van Veen, Cara Van Uden, Louis Blankemeier, Jean-Benoit Delbrouck, Asad
Aali, Christian Bluethgen, Anuj Pareek, Malgorzata Polacin, Eduardo Pontes Reis,
Anna Seehofnerová, et al. Adapted large language models can outperform medical
experts in clinical text summarization. Nature medicine, 30(4):1134–1142, 2024.

[164] James Vincent.

Meta’s powerful ai

leaked online
— what happens now?
https://www.theverge.com/2023/3/8/23629362/
meta-ai-language-model-llama-leak-online-misuse, 2023. Accessed: 2023-03-
08.

language model has

[165] Tuan-Hung Vu, Himalaya Jain, Maxime Bucher, Matthieu Cord, and Patrick Pérez.
Advent: Adversarial entropy minimization for domain adaptation in semantic segmen-
tation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pages 2517–2526, 2019.

117

[166] Bolun Wang, Yuanshun Yao, Shawn Shan, Huiying Li, Bimal Viswanath, Haitao
Zheng, and Ben Y Zhao. Neural cleanse: Identifying and mitigating backdoor attacks
in neural networks. In 2019 IEEE S&P, pages 707–723. IEEE, 2019.

[167] Fali Wang, Runxue Bao, Suhang Wang, Wenchao Yu, Yanchi Liu, Wei Cheng, and
Haifeng Chen. Infuserki: Enhancing large language models with knowledge graphs via
infuser-guided knowledge integration. In Findings of the Association for Computational
Linguistics: EMNLP 2024, pages 3675–3688, 2024.

[168] Haoqi Wang, Zhizhong Li, Litong Feng, and Wayne Zhang. Vim: Out-of-distribution
with virtual-logit matching. In Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pages 4921–4930, 2022.

[169] Haotao Wang, Junyuan Hong, Aston Zhang, Jiayu Zhou, and Zhangyang Wang. Trap
and replace: Defending backdoor attacks by trapping them into an easy-to-replace
subnetwork. Advances in neural information processing systems, 35:36026–36039, 2022.

[170] Jianyu Wang and Gauri Joshi. Cooperative sgd: A unified framework for the design and
analysis of communication-efficient sgd algorithms. arXiv preprint arXiv:1808.07576,
2018.

[171] Liang Wang, Nan Yang, and Furu Wei. Learning to retrieve in-context examples for

large language models. arXiv preprint arXiv:2307.07164, 2023.

[172] Qin Wang, Gabriel Michau, and Olga Fink. Missing-class-robust domain adaptation
by unilateral alignment. IEEE Transactions on Industrial Electronics, 68(1):663–671,
2020.

[173] Run Wang, Jixing Ren, Boheng Li, Tianyi She, Chehao Lin, Liming Fang, Jing Chen,
Chao Shen, and Lina Wang. Free fine-tuning: A plug-and-play watermarking scheme
for deep neural networks. arXiv preprint arXiv:2210.07809, 2022.

[174] Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang,
Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought
reasoning in language models. arXiv preprint arXiv:2203.11171, 2022.

[175] Zhepeng Wang, Runxue Bao, Yawen Wu, Jackson Taylor, Cao Xiao, Feng Zheng,
Weiwen Jiang, Shangqian Gao, and Yanfu Zhang. Unlocking memorization in large
language models with dynamic soft prompting. In Proceedings of the 2024 Conference
on Empirical Methods in Natural Language Processing, pages 9782–9796, 2024.

[176] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V
Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language
models. Advances in neural information processing systems, 35:24824–24837, 2022.

[177] Ronald J Williams. Simple statistical gradient-following algorithms for connectionist

reinforcement learning. Machine learning, 8:229–256, 1992.

118

[178] Jim Winkens, Rudy Bunel, Abhijit Guha Roy, Robert Stanforth, Vivek Natarajan,
Joseph R Ledsam, Patricia MacWilliams, Pushmeet Kohli, Alan Karthikesalingam,
Simon Kohl, et al. Contrastive training for improved out-of-distribution detection.
arXiv preprint arXiv:2007.05566, 2020.

[179] Dongxian Wu, Shu-Tao Xia, and Yisen Wang. Adversarial weight perturbation helps
robust generalization. Advances in Neural Information Processing Systems, 33:2958–
2969, 2020.

[180] Zhisheng Xiao, Qing Yan, and Yali Amit. Likelihood regret: An out-of-distribution
detection score for variational auto-encoder. Advances in neural information processing
systems, 33:20685–20696, 2020.

[181] Pingmei Xu, Krista A Ehinger, Yinda Zhang, Adam Finkelstein, Sanjeev R Kulkarni,
and Jianxiong Xiao. Turkergaze: Crowdsourcing saliency with webcam based eye
tracking. arXiv preprint arXiv:1504.06755, 2015.

[182] Jingkang Yang, Pengyun Wang, Dejian Zou, Zitang Zhou, Kunyuan Ding, Wenxuan
Peng, Haoqi Wang, Guangyao Chen, Bo Li, Yiyou Sun, et al. Openood: Benchmarking
generalized out-of-distribution detection. Advances in Neural Information Processing
Systems, 35:32598–32611, 2022.

[183] Fisher Yu, Ari Seff, Yinda Zhang, Shuran Song, Thomas Funkhouser, and Jianxiong
Xiao. Lsun: Construction of a large-scale image dataset using deep learning with
humans in the loop. arXiv preprint arXiv:1506.03365, 2015.

[184] Qing Yu, Atsushi Hashimoto, and Yoshitaka Ushiku. Divergence optimization for
noisy universal domain adaptation. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, pages 2515–2524, 2021.

[185] Tao Yu, Eugene Bagdasaryan, and Vitaly Shmatikov. Salvaging federated learning by

local adaptation. arXiv preprint arXiv:2002.04758, 2020.

[186] Xiyu Yu, Tongliang Liu, Mingming Gong, Kun Zhang, Kayhan Batmanghelich, and
Dacheng Tao. Label-noise robust domain adaptation. In International Conference on
Machine Learning, pages 10913–10924. PMLR, 2020.

[187] Xiaoyong Yuan, Leah Ding, Lan Zhang, Xiaolin Li, and Dapeng Oliver Wu. Es attack:
Model stealing against deep neural networks without data hurdles. IEEE Transactions
on Emerging Topics in Computational Intelligence, 6(5):1258–1270, 2022.

[188] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint

arXiv:1605.07146, 2016.

[189] Werner Zellinger, Thomas Grubinger, Edwin Lughofer, Thomas Natschläger, and Su-
sanne Saminger-Platz. Central moment discrepancy (cmd) for domain-invariant rep-
resentation learning. arXiv preprint arXiv:1702.08811, 2017.

119

[190] Yi Zeng, Won Park, Z Morley Mao, and Ruoxi Jia. Rethinking the backdoor attacks’

triggers: A frequency perspective. In ICCV, 2021.

[191] Haoyu Zhang, Jianjun Xu, and Ji Wang. Pretraining-based natural language genera-

tion for text summarization. arXiv preprint arXiv:1902.09243, 2019.

[192] Jialong Zhang, Zhongshu Gu, Jiyong Jang, Hui Wu, Marc Ph Stoecklin, Heqing Huang,
and Ian Molloy. Protecting intellectual property of deep neural networks with water-
marking. In Proceedings of the 2018 on Asia Conference on Computer and Communi-
cations Security, pages 159–172, 2018.

[193] Jingyang Zhang, Nathan Inkawhich, Yiran Chen, and Hai Li. Fine-grained out-of-
distribution detection with mixup outlier exposure. arXiv preprint arXiv:2106.03917,
2021.

[194] Zihao Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. Calibrate before
use: Improving few-shot performance of language models. In International conference
on machine learning, pages 12697–12706. PMLR, 2021.

[195] Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba.
IEEE transactions on

Places: A 10 million image database for scene recognition.
pattern analysis and machine intelligence, 40(6):1452–1464, 2017.

[196] Jiayu Zhou, Lei Yuan, Jun Liu, and Jieping Ye. A multi-task learning formulation for
predicting disease progression. In Proceedings of the 17th ACM SIGKDD international
conference on Knowledge discovery and data mining, pages 814–822, 2011.

[197] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-
In Proceedings of the

image translation using cycle-consistent adversarial networks.
IEEE international conference on computer vision, pages 2223–2232, 2017.

[198] Zhuangdi Zhu, Junyuan Hong, and Jiayu Zhou. Data-free knowledge distillation for
heterogeneous federated learning. In International Conference on Machine Learning,
pages 12878–12889. PMLR, 2021.

120

APPENDIX A: DYNAMIC UNCERTAINTY RANKING

Dataset

Type

Pubmedqa Multi-choice

Domain
Healthcare

ethos-national Multi-choice Speech detection
climate change
eval-climate Multi-choice
Wikipedia
Open-ended
Wikipedia
Open-ended

T-REx
NatQA

Training Test Prompt format

1000
476
288
20128
11476

500
298
180
5032
2869

SQO-A
QO-A
QO-A
Q-A
Q-A

Table A.1: The statistics of the datasets used in this work.

Notation Retrieval sample format

Q-A

QO-A

Question: <question> Answer:
The answer is <answer>
Question: <question> Options:
(A) <option A> (B) <option B>
(C) <options C>... Answer: The
answer is <answer>

SQO-A Statement: <context> Question:
<question> Options:
(A) <op-
tion A> (B) <option B> (C)
<options C>... Answer: The an-
swer is <answer>

Query sample format
Question: <question> Answer:

Question: <question> Options:
(A) <option A> (B) <option B>
(C) <options C>... Answer:

Statement: <context> Question:
<question> Options:
(A) <op-
tion A> (B) <option B> (C)
<options C>... Answer:

Table A.2: Prompt Format notations.

Experiment Details

Dataset Details. In this work, we evaluate across five QA datasets from different domains

including multi-choice QA and open-ended QA. The detailed statistics of these datasets and

the prompt format we used are shown in Table A.1 and Table A.2. We conduct the train-test

split for the last four datasets following [96]. We randomly sample 1000 samples from the

training dataset if the training set size exceeds 1000 to simulate the scenario where only a

limited number of samples can be collected.

Extended Experimental Results

More case study on the uncertainty of ICL. According to Table A.3 on the healthcare

dataset Pubmedqa, LLM can achieve correct prediction with the first two retrieved samples

121

Query: Statement: Lymphedema may be identified by... Question: Can a practicing surgeon detect early lymphedema reliably?

Retrieved sample 1

Retrieved sample 1
+ sample 2

Retrieved sample 1
+ sample 2 + sam-
ple 3

Retrieved samples

Statement: Minority patients with cancer experience... Question: Can patient
coaching reduce racial/ethnic disparities in cancer pain control? Answer: Yes.
Statement: Minority patients with cancer experience... Question: Can pa-
tient coaching reduce racial/ethnic disparities in cancer pain control? Answer:
Yes. Statement: The potential effects of binge drinking during pregnancy...
Question: Does binge drinking during early pregnancy increase the risk of
psychomotor deficits? Answer: No.
Statement: Minority patients with cancer experience... Question: Can pa-
tient coaching reduce racial/ethnic disparities in cancer pain control? Answer:
Yes. Statement: The potential effects of binge drinking during pregnancy...
Question: Does binge drinking during early pregnancy increase the risk of
psychomotor deficits? Answer: No. Statement: Despite the advantages from
using aromatase inhibitors... Question: Do adjuvant aromatase inhibitors in-
crease the cardiovascular risk in postmenopausal women with early breast can-
cer? Answer: Yes.

Prediction
Maybe (✓)

Maybe (✓)

No (✗)

Table A.3: Extended case study for the uncertainty of ICL on Pubmedqa.

Query question
Retriever

Retrieved samples

Prediction

Outlaw [SEP] instance of.

Maryland

PromptPG retriever
Question: Hingani Dam [SEP]
country. Answer: The answer is
India.
Question:
State
Archives [SEP] applies to juris-
diction. Answer: The answer is
Maryland.
Question: Silvia Panguana [SEP]
country of citizenship. Answer:
The answer is Mozambique.
Question: New Paluvayi [SEP] lo-
cated in the administrative terri-
torial entity. Answer: The answer
is Andhra Pradesh.
Question: The ’59 Sound [SEP]
country of origin. Answer: The
answer is United States of Amer-
ica.
film. (✗)

Schleich [SEP]

Our retriever
Question:
in-
stance of. Answer: The answer
is municipality of Germany.
Question: Chevry-sous-le-Bignon
[SEP] instance of. Answer: The
answer is commune of France.

Question: The Listel Hotel [SEP]
instance of. Answer: The answer
is hotel.
Question: Westona [SEP]
in-
stance of. Answer: The answer
is railway station.

Question: Secu [SEP] instance of.
Answer: The answer is commune
of Romania.

wooden roller coaster. (✓)

Table A.4: Extended case study for retrieved samples of hard samples on T-REx.

but gives a wrong prediction when the third sample is added to the prompt, which indicates

that the third sample is misleading.

More case study on hard samples. Table A.4 shows another case for retrieved

samples of hard samples on T-REx. According to the results, the query question asks

about the instance of a subject, while the prompt retriever retrieved samples about the

questions related to the locations, which mislead the final prediction. For our retriever, all

the retrieved samples are related to the questions related to the instance of the subject, and

122

provide informative augmentations for the inference.

123

APPENDIX B: FOSTER

Models in the experiments

We adopt the model WideResNet-40 [188] for CIFAR10, CIFAR100, and STL10, and AlexNet

extended with BN layers [98] for DomainNet. The classifier denoted in the work has the same

architecture as the last linear layer of the training model. The generator Gw is a MLP based

model, which takes a noise ϵ and a one-hot label vector as input. The MLP sequentially

consists of a linear layer with hidden dimension 512, a batch normalization layer with RELU

as its activation function, and a representation linear layer with 512 as input dimension and

128 as output dimension to generate virtual OoD samples z.

Extended ablation study

Effects of δ for random soft label strategy. Given an one-hot encoding label vector y

of class c, we assign 1 − δ to the c-th entry, and a random value within (0, δ) to the rest

of the positions, where δ ∈ (0, 0.5). We investigate the effect of δ for the random soft label

strategy on STL10. According to the results in Table B.1, a mild value of δ shows the best

results, for which we fix δ = 0.2 in Section 6.4. When δ vanishes, the soft label degrades to

the vanilla one-hot label which lacks essential OoD hardness. When we further increase δ,

all information from the specific external classes will be eliminated due to the zero condition

value and the sample will be generated from other regions randomly, which may increase the

overlap with the ID data.

δ
0
0.1
0.2
0.3

Test acc ↑ AUROC ↑ AUPR ↑
0.9425
0.7671
0.9475
0.7768
0.9501
0.7872
0.9450
0.7751

0.8410
0.8258
0.8294
0.8396

Table B.1: Effects of δ for random soft label strategy. ↑ indicates larger value is better.
Bold numbers are superior results.

Effects of the number of samples generated by the generator. Table B.2 shows

124

the effect of the number of samples generated by the generator per iteration on CIFAR-10.

As the number of generated samples increases, we obtain 2.45% AUROC increase and 0.53%

AUPR increase, respectively. With more generated samples, not only can we obtain more

sufficient samples to choose from, but we can also achieve more precise Gaussian distribution

estimations. Thus, we fix the number of samples to be 1000 in Section 6.4.

Number of samples Test acc ↑ AUROC ↑ AUPR ↑
0.9732
0.9424
0.9761
0.9302
0.9785
0.9432

0.8846
0.8979
0.9091

100
500
1000

Table B.2: Effects of the number of samples generated by the generator. ↑ indicates larger
value is better. Bold numbers are superior results.

p.d.f. filter can increase sample diversity. Table B.3 shows the variance of the ID

p.d.f of selected samples for three different clients on CIFAR10. According to the results, for

all clients, the variance of w/ p.d.f. filter is much larger than that of w/o p.d.f. filter , thus,

the diversity for the selected samples after p.d.f. filter is much larger than w/o p.d.f. filter .

Client w/o p.d.f. filter w/ p.d.f. filter

0
1
2

3.8117e+09
2.2709e+09
6.0931e+09

4.8624e+09
9.0618e+09
1.0391e+10

Table B.3: The variance of the ID p.d.f of selected samples for three different clients.

Effects of different FL algorithms Our proposed FOSTER is a general framework

that can be directly applied to similar FL algorithms as FedAvg, such as Fedprox [94]. We

report the results for Foster and other baselines in Table B.4 on STL10, and our Foster

outperforms competitive baselines for both FedAvg and its variants Fedprox.

Conmmunication cost for Foster

Communication cost includes per-round communication costs and the number of communi-

cation rounds.

125

FL algorithm Method
Energy
MSP
ODIN
VOS

FedAvg

Acc ↑ AUROC ↑ AUPR ↑
0.9228
0.7529
0.8236
0.9309
0.7410
0.8236
0.9306
0.7418
0.8236
0.9126
0.7370
0.8264
Foster 0.8410
0.9425
0.7671
0.9296
0.7840
0.8292
Energy
0.9369
0.7740
0.8292
MSP
0.9366
0.7749
0.8292
ODIN
0.9057
0.7306
0.8262
VOS
Foster 0.8379
0.9503
0.7990

Fedprox

Table B.4: Our Foster outperforms competitive baselines for both FedAvg and its variants
Fedprox.

Figure B.1: Validation accuracy for FedAvg and Foster. Generated OoD samples for
Foster will not increase conmmunication rounds of FedAvg.

Per-round communication costs: Due to the reason that sharing data would violate

data confidentiality in FL, we choose to broadcast the central generator to allow each local

client to generate their own virtual OoD samples. Thus, the communication cost is related

to the model size of the generator. However, we note that a generator model usually has

much smaller parameters than that of the main model that is learned, whereas the main

model has to be transferred frequently between the server and clients in most FL paradigms.

For example, the model size of the generator in this work is only 5% of the FL global classifi-

cation model. Thus, the communication cost largely depends on the model size of the global

classification model and the cost of the generator is marginal. In other words, broadcasting

126

generator only marginally increases the communication cost for each round compared with

the standard FL setting.

The number of communication rounds: To investigate whether the generator will

affect the communication rounds of FedAvg, we report the validation accuracy for STL10

in Fig. B.1 in the appendix. We found that FOSTER will not increase the communication

rounds compared with FedAvg, that is because we train the central generator without up-

dating the global classifier by optimizing Eq. (4.2). Thus, the global convergence is mainly

determined by the main loss of the server instead of the loss of the generator.

Additional baselines

We add one local training without FL method One-class SVM1, and also two post hoc scoring

methods KNN2 [149] and ViM [168] to compare with the state-of-art in Table B.5. One-class

SVM can only achieve very low AUROC and AUPR, which is not comparable to other state-

of-art methods. With limited local training set with partial classes for heterogeneous FL,

KNN shows much lower AUROC and AUPR, and ViM also shows very low AUPR despite

its comparable AUROC.

Central experiments

To better verify our intuition that external classes can serve as effective OoD samples during

training for OoD detection, we also conduct central experiments without FL training on

CIFAR-10. We compare training with External-class data with VOS and energy score.

Effects of the number of ID classes. We fix the ID training data size to be 15000, and

vary ID classes number to be 7, 5 and 3. According to the results shown in Table B.6, with

limited ID samples from each class, External-class data outperforms other baselines for OoD

detection without hurting ID Acc. Although VOS shows better performance than Energy, it

shows worse performance compared with the results reported in [31] where the entire training

1Since One-class SVM cannot produce ID accuracy, we use the accuracy from FedAvg for One-class SVM in the table.
2Following the original paper setting, we use k = 50 for CIFAR-10 and STL10, and k = 200 for CIFAR-100.

127

ID dataset

CIFAR-10

CIFAR-100

STL10

Method
Energy
MSP
ODIN
KNN
ViM
VOS

Foster
Energy
MSP
ODIN
KNN
ViM
VOS

Acc ↑ AUROC ↑ AUPR ↑
0.9262
0.7810
0.9431
0.9691
0.8829
0.9431
0.9689
0.8842
0.9431
0.6015
0.7349
0.9431
0.6862
0.7938
0.9431
0.9342
0.7970
0.9426
0.6481
0.7742
One-class SVM 0.9431
0.9785
0.9091
0.9432
0.9575
0.8056
0.8129
0.9782
0.8606
0.8129
0.9789
0.8657
0.8129
0.0605
0.3094
0.8129
0.5493
0.7588
0.8129
0.9666
0.8372
0.8063
0.5662
0.7688
One-class SVM 0.8129
0.9838
0.8945
0.8218
0.9228
0.7529
0.8236
0.9309
0.7410
0.8236
0.9306
0.7418
0.8236
0.0943
0.2586
0.8236
0.5676
0.7882
0.8236
0.9126
0.7370
0.8264
0.5417
0.7394
One-class SVM 0.8236
0.9425
0.7671
0.8410

Foster
Energy
MSP
ODIN
KNN
ViM
VOS

Foster

Table B.5: Our Foster outperforms competitive baselines. ↑ indicates larger value is better.
Bold numbers are best performers.
set is utilized, since VOS cannot get an accurate estimation of ID class-conditional Gaussian

with limited samples for each class. When the number of ID classes drops from 7 to 3, for

energy, AUROC and AUPR drop by 4.52% and 3.86% respectively, for VOS, AUROC and

AUPR drop by 1.76% and 3.10% respectively, while External-class data improve AUROC

and AUPR by 7.73% and 0.42% respectively. The decrease of ID classes number will make

baselines produce worse OoD detection results, but it is not the case for External-class data.

On the contrary, the decrease of ID classes will give External-class data a chance to get

access to more diverse external class data, which can serve as real OoD samples for training.

These results give an explicit explanation for why existing post-hoc and synthesized based

OoD detection methods do not perform well under the FL setting, and verify our intuition

that external classes can serve as effective OoD samples during training for OoD detection.

128

ID Classes

7

5

3

Method
Energy
VOS

Acc ↑ AUROC ↑ AUPR ↑
0.9642
0.7997
0.8455
0.9668
0.8073
0.8530
External-class data 0.8605
0.9751
0.8469
0.9539
0.8348
0.9157
0.9536
0.8384
0.9147
0.9150
0.9750
0.8923
0.9256
0.7545
0.9647
0.9358
0.7897
0.9627
0.9617
0.9793
0.9242

Energy
VOS
External-class data
Energy
VOS
External-class data

Table B.6: Without hurting Acc, External-class data outperforms other baselines for OoD
detection, especially when ID classes number is small. ↑ indicates larger value is better.
Bold numbers are superior results.

Effects of the number of external classes. Based on the observation from Table B.6,

we also investigate how the number of external classes will affect the OoD performance of

External-class data. We fix training data size to be 15000, ID classes to be 3, and vary

the number of external classes from 7 to 2. The OoD performance for External-class data

with different number of external classes is shown in Table B.7. According to the results,

diversity plays a key role for External-class data performance. The more external classes, the

better OoD performance External-class data can achieve. Thus, for the proposed Foster

we utilize all the external classes knowledge for training.

External Classes Acc ↑ AUROC ↑ AUPR ↑
0.9793
0.9773
0.9610

0.9617
0.9593
0.9377

0.9242
0.9062
0.8554

7
5
2

Table B.7: The more external classes, the better OoD performance exclass can achieve. ↑
indicates larger value is better. Bold numbers are superior results.

129

Additional benchmark

ID dataset Method Acc ↑ AUROC ↑ AUPR ↑
0.9174
0.8552
0.9286
0.8552
0.9277
0.8552
0.9171
0.8605
Foster 0.8663
0.9312

0.7267
0.7447
0.7399
0.7207
0.7526

Energy
MSP
ODIN
VOS

ImageNet-12

Table B.8: Our Foster outperforms competitive baselines. ↑ indicates larger value is better.
Bold numbers are best performers.

130

APPENDIX C: SINGLE OOD WATERMARK

Extended watermark injection results

Additional watermark injection results. Fig. C.1 shows the watermark injection for

CIFAR-100 and GTSRB. It only takes 20 epochs for CIFAR-100 and GTSRB to achieve

stable high standard accuracy and OoDWSR. The highest OoDWSR for CIFAR-100, and

GTSRB are 0.8761, and 0.9442, respectively, with standard accuracy degradation of less

than 3%.

(a) CIFAR-100 Acc.

(b) CIFAR-100 ID WSR.

(c) CIFAR-100 OoD WSR.

(d) GTSRB Acc.

(e) GTSRB ID WSR.

(f) GTSRB OoD WSR.

Figure C.1: Acc, ID WSR, and OoD WSR for watermark injection. The watermarks are
injected quickly with high accuracy and OoDWSR. Triggers with the highest OoDWSR and
accuracy degradation of less than 3% are selected for each dataset.

131

0102030Epoch0.00.20.40.60.81.0Accsmoothtrojan_wmtrojan_8x8trojan_3x3l0_invbadnet_grid0102030Epoch0.00.20.40.60.81.0WSRsmoothtrojan_wmtrojan_8x8trojan_3x3l0_invbadnet_grid0102030Epoch0.00.20.40.60.81.0OoDWSRsmoothtrojan_wmtrojan_8x8trojan_3x3l0_invbadnet_grid051015Epoch0.00.20.40.60.81.0Accsmoothtrojan_wmtrojan_8x8trojan_3x3l0_invbadnet_grid051015Epoch0.00.20.40.60.81.0WSRsmoothtrojan_wmtrojan_8x8trojan_3x3l0_invbadnet_grid051015Epoch0.00.20.40.60.81.0OoDWSRsmoothtrojan_wmtrojan_8x8trojan_3x3l0_invbadnet_gridExtended weight perturbation results

Additional experiments about the effects of WP. Table C.1 shows the results for fine-

tuning method FT-AL. We observe that by applying WP, after fine-tuning, we can increase

OoDWSR from 0.7305 and 0.8184 to 0.9678 and 0.9797, respectively.

Trigger WP

trojan_wm

trojan_8x8

w/o
w/
w/o
w/

Victim model
IDWSR OoDWSR
0.9401
0.9768
0.9263
0.9328

0.9490
0.9566
0.9486
0.9423

Acc
0.9264
0.9102
0.9238
0.9178

Suspect model

Acc
0.9226
0.9191
0.9223
0.9187

IDWSR OoDWSR
0.6327
0.9769
0.5304
0.9533

0.7305
0.9678
0.8184
0.9797

Table C.1: Weight perturbation increases the robustness of the watermarks against removal
attacks (FT-AL).

To further verify the robustness of our proposed weight perturbation, in Fig. C.2, we also

show the results of a much more challenging setting, i.e., RT-AL with 100% training data.

With more data for fine-tuning, RT-AL can obtain an average standard accuracy of 0.9069

and 0.9074, respectively for the model w/ and w/o WP. With comparable standard accuracy,

WP can increase OoDWSR by 55.45%, 0.35%, 5.02%, 16.23% for trojan_wm, trojan_8x8,

trojan_3x3, l0_inv, respectively. With WP, during the fine-tuning process, OoDWSR will

remain more stable or even increase with the increase of standard accuracy. These results

demonstrate that the proposed WP can help our injected watermark be more robust and

persistent even under more challenging stealing threats.

Watermark loss landscape for WP. To get the watermark loss landscape around the

watermarked model Mw(θw), we interpolate between the initial pre-trained model M(θ) and

the watermarked model Mw(θw) along the segment θi = (1 − t)θ + tθw, where t ∈ [0, 1] with

increments of 0.01. The watermark loss is evaluated on the verification dataset composed

of the generated OoD samples. Pre-trained model M(θ) and watermarked model Mw(θw)

correspond to coefficients of 0 and 1, respectively. From Fig. C.3, we observe that with WP,

we can achieve a flatter loss landscape (orange line) around the point of the watermarked

model compared with the one without WP (blue line). By maximizing the backdoor loss

132

(a) Acc for model w/o WP.

(b) OoDWSR for model w/o
WP.

(c) Acc for model w/ WP.

(d) OoDWSR for model w/ WP.

Figure C.2: Effects of weight perturbation against RT-AL with 100% training data on
CIFAR-10.

in Eq. (3) over the perturbation v we can get a flatter landscape of watermark loss around

the optimal point of the watermarked model. For those watermark removal attacks such as

fine-tuning or pruning, which might make a minor parameter change to the models, a flatter

loss landscape could prevent the model from escaping from the watermarked local optimum

compared with sharp ones. The loss landscape of trojan_wm is flatter than trojan_8x8,

thus, the robustness of trojan_wm is also better than trojan_8x8 as shown in Table 2 in

our work.

(a) trojan_wm.

(b) trojan_8x8.

Figure C.3: Watermark loss landscape from pre-trained model to watermarked model on
CIFAR-10.

133

01020304050Epoch0.00.20.40.60.81.0Acc01020304050Epoch0.00.20.40.60.81.0OoDWSRsmoothtrojan_wmtrojan_8x8trojan_3x3l0_invbadnet_gridrandom guess01020304050Epoch0.00.20.40.60.81.0Acc01020304050Epoch0.00.20.40.60.81.0OoDWSRsmoothtrojan_wmtrojan_8x8trojan_3x3l0_invbadnet_gridrandom guessTo show the relationship between robustness of our proposed method against watermark

removal attacks and the loss landscape, we also interpolate between the suspect model af-

ter clean data fine-tuning Ms(θs) and the watermarked model Mw(θw) along the segment

θi = (1 − t)θs + tθw, where t ∈ [0, 1] with increments of 0.01. Suspect model Ms(θs) and

watermarked model Mw(θw) correspond to coefficients of 0 and 1, respectively. Fig. C.4

shows the loss for the interpolated model between the suspect model after clean i.i.d. data

fine-tuning (FT-AL and RT-AL) and our watermarked model. We observe that the flatness

of the loss landscape can lead to a lower watermark loss during fine-tuning, which makes it

harder for IP infringers to escape from the local optimum. These results combinedly give an

explanation for why our proposed WP can increase the robustness of the watermark against

watermark removal attacks.

(a) FT-AL for
jan_wm.

tro-

(b) RT-AL for
jan_wm.

tro-

(c) FT-AL for
jan_8x8.

tro-

(d) RT-AL for
jan_8x8.

tro-

Figure C.4: Watermark loss landscape from suspect model to watermarked model on CIFAR-
10.

Extended robustness analysis

Defending against OoD detection.

[69] found that the adversary may simply reject

the query of OoD samples to reject the ownership verification process using the energy-

based out-of-distribution detection method [108]. We compute the mean energy score for ID

samples and our verification samples, and also report the area under the receiver operating

characteristic curve (AUROC), and the area under the PR curve (AUPR) for ID and OoD

classification according to [108] in Table C.2. For the watermarked model pre-trained on

CIFAR-100 and GTSRB, the energy scores for ID and verification samples are very close, and

134

the AUROC and AUPR are close to random guesses. For CIFAR-10, despite the differences

of energy score, AUPR for OoD detection is only 20%, which indicates that it will be very

hard for the adversaries to filter out verification samples from ID ones, if verification samples

are mixed with other samples while querying the suspect model. Therefore, the query of our

verification samples cannot be found out by the IP infringers. A possible reason is that the

distribution of our generated verification samples lies closely to the ID ones, since previous

work [6] shows that the generated samples from the single OoD image can also be used to

train a classifier yielding reasonable performance on the main prediction task.

Dataset

CIFAR-10

CIFAR-100

GTSRB

Trigger
trojan_wm
trojan_8x8
trojan_8x8
l0_inv
smooth
trojan_wm

Energy of ID samples Energy of verification samples AUROC AUPR
0.2068
0.2028
0.0757
0.0651
0.0715
0.0646

-14.0207
-11.2574
-13.4729
-11.9693
-8.2872
-8.4407

-5.2880
-5.1726
-14.3086
-13.9304
-8.6203
-8.3042

0.8160
0.8107
0.4487
0.3614
0.4564
0.4922

Table C.2: Energy-based OoD detection between ID and verification samples. Higher AU-
ROC and AUPR indicate better OoD detection performance.

Extended ablation sududies

Effects of different numbers of verification samples. We generate 45000 surrogate

samples for watermark injection, but we do not need to use all the surrogate data for ver-

ification. We report the OoDWSR w.r.t. different numbers of samples in the verification

dataset in Fig. C.5. According to the figure, only 450 verification samples are enough for

accurate verification.

Effects of different number of source OoD images. We evaluate the effect of a

Source OoD Image

City
Animal
City+Animal

Victim model
IDWSR OoDWSR
0.9768
0.9873
0.9820

0.9566
0.9880
0.9638

Acc
0.9102
0.9072
0.9059

Suspect model

p-value

Acc
0.9191
0.9212
0.9209

IDWSR OoDWSR
0.9769
0.8309
0.8311

0.9678
0.8922
0.8511

0.0000
0.0000
0.0000

Table C.3: Effects of different number of source OoD images.

135

Figure C.5: OoDWSR w.r.t. different numbers of samples in the verification dataset. The
model is pre-trained on CIFAR-10, and the trigger pattern is trojan_wm.

combination of different source OoD images in Table C.3. The model is pre-trained on

CIFAR-10, and trojan_wm is adopted as the trigger pattern. The Suspect model is fine-

tuned using FT-AL by the IP infringer. According to the results, for watermark injection

(see victim model), with a combination of two OoD images, OoDWSR is between the values

for the two combination images. For the robustness against fine-tuning conducted by the IP

infringer (see suspect model), OoDWSR even decreases after the combination of the source

OoD images. However, one advantage of using multiple OoD images is that it will be harder

for the IP infringers to get access to the source OoD image and guess the composition of our

validation set, which can further enhance the security of the watermark.

136

45450450045000Number of samples in verification dataset0.50.60.70.80.91.0OoDWSRAPPENDIX D: DUW

Extended results for the failure of the traditional backdoor-based
watermarking

To analyze the failure of the traditional backdoor-based watermarking, we give detailed

prediction results for one trial on CIFAR10 for random noise trigger as an example. The

client number is 100, so the client_ID is from 0-99, and the class number is 10. Here we

provide a fine-grained analysis of the concerned 13% TAcc by looking into the last 10 clients.

We show one case study in Table D.1 by listing the client_ID and their corresponding

predicted client_ID for clients 90-99. From the prediction results, 8 of 10 clients are tracked

wrong. Among these 8 failure cases, 7 of the predicted client_ID (client 90, 91, 93, 94 95, 96,

98) share the same targets, and 1 of them (client 97) have both different triggers and different

target labels. The two kinds of failures correspond to two different reasons respectively as

we illustrated in Section 6.3.1.

ground truth client_ID predicted client_ID

90
91
92
93
94
95
96
97
98
99

0
1
92
3
4
5
6
49
8
99

Table D.1: Case study: prediction results for random noise trigger for client 90-99.

Extended robustness study

Robustness against detection attack. We take Neural Cleanse [166] as an example of

the detection attack, which synthesizes the possible trigger to convert all benign images to all

possible target classes in the classification task space. Then anomaly detection is conducted

to detect if any trigger candidate is significantly smaller than other candidates. We follow

137

the original setting in [166], if the anomaly index is larger than 2, the model is watermarked.

The smaller the value of the anomaly index, the harder the watermark to be washed out by

Neural Cleanse. Local samples are used as benign images during detection. We compare

the anomaly index for the non-watermarked model and watermarked model in Fig. D.1. We

observe that for all datasets, the anomaly index for the watermarked model is close to that

of the non-watermarked model, and both of them are smaller than the threshold 2. The

observation implies that our watermarked model cannot be detected using neural cleanse.

One possible reason is that Neural Cleanse relies on the assumption that the backdoor-

based watermark shares the same task space with the original classification task, but due

to the effectiveness of our decoder, our target label space of the watermark is different from

the original classification task space. Therefore, by searching all possible target classes in

the original task space, Neural Cleanse will not find the real target label of the trigger set

introduced by our proposed DUW.

Figure D.1: Anomaly index of watermarked model and non-watermarked model.
anomaly index exceeds 2, the model will be detected as backdoor-based watermarked.

If the

We further show the reversed trigger pattern generated by Neural Cleanse for non-

watermarked (non-wm) and watermarked (wm) models in Fig. D.2. The reversed trigger

of our watermarked model shares a similar pattern as non-watermarked ones for all three

benchmarks, and it does not look similar to our real trigger patterns (real ones can be re-

ferred to Fig. D.3 residual). The trigger patterns for our trigger sets are sample-specific.

Thus, it is hard to reverse engineer triggers when Neural Cleanse assumes a general trigger

pattern for the entire trigger set. In summary, our proposed DUW is secured against this

138

DigitsCIFAR-10CIFAR-1000.000.250.500.751.001.251.501.752.00Anomaly index1.091.70.751.211.810.8Non-watermarkedWatermarkedInject dataset FPR TPR for real OoD TPR for verification dataset

USPS
GTSRB
Random noise
Jigsaw

0%
0%
0%
0%

100.0%
99.4%
98.1%
99.3%

65.8%
98.7%
25.0%
74.2%

Table D.2: TPR and FPR for evasion attack of verification.

trigger-detection algorithm.

(a) Digits:
non-wm.

(b) Digits:
wm.

(c) CIFAR-10:
non-wm.

(d) CIFAR-10:
wm.

(e) CIFAR-100:
non-wm.

(f) CIFAR-100:
wm.

Figure D.2: Reversed trigger patterns generated by Neural Cleanse for non-watermarked
(non-wm) and watermarked (wm) models on three benchmarks.

Robustness against evasion attack Attackers can also detect verification samples as

OoD samples using the evasion attack. We conduct the evasion attack according to the code

in [155]. To distinguish between the ID and OoD samples, STL10 [23] cropped to the same

size as the training data is adopted as the real OoD samples, and the local training data for

each attacker is adopted as the ID data. The detector model architecture is based on the FL

training model with a replaced classifier layer. The attacker trains the detector model based

on the OoD and ID data for 20 epochs. The FPR (false positive rate) and TPR (true positive

rate) for real OoD and TPR for verification dataset are shown in Table D.2. According to

the results, after training, TPR for real OoD samples can always reach nearly 100%, but the

lowest TPR for the verification dataset (the ratio of verification samples correctly identified

as OoD to the verification set) is only 25%, which indicates that the OoD detector cannot

distinguish the verification data from ID samples during inference time. We also observe that

the TPR for the verification dataset is closely related to the type of watermark inject dataset.

Among them, random noise will be the best choice to defend against evasion attacks.

139

Extended qualitative study

Visualization of unique trigger sets. We show the visualization example for the original

image, encoded image (image in trigger set), and residual image based on different OoD

datasets in Fig. D.3. We observe that for all four different OoD datasets, the original

image and encoded image with our client keys are indistinguishable from the human eye.

The difference between these two images can be observed in the residual image. Note that

although the OoD datasets are different, the encoder that we used to generate the trigger

sets is the same. According to Fig. D.3, the encoder will generate sample-wise triggers for

different images.

Digits:

(a)
original.

(b) Digits: en-
coded.

Digits:

(c)
residual.

(d) GTSRB:
original.

(e) GTSRB:
encoded.

(f) GTSRB:
residual.

(g)
original.

Noise:

(h) Noise: en-
coded.

(i)
residual.

Noise:

Jigsaw:

(j)
originaL.

(k) Jigsaw: en-
coded.

Jigsaw:

(l)
residual.

Figure D.3: Visualization of unique trigger set based on different OoD datasets.

To investigate the difference between different clients’ trigger sets based on the same

OoD dataset, we show one example in the trigger set generated by the jigsaw image for two

randomly picked clients in Fig. D.4. The trigger sets are generated based on the same jigsaw

dataset and differ by their embedded keys. According to Fig. D.4, although the samples

from different trigger sets do not look distinguishable according to the human inspection,

the difference between keys decoded from the trigger sets can be distinguished by our model.

Effects of the different numbers of samples in trigger sets. We investigate how

the size of the trigger set will affect our watermark injection and standard FL training in

140

(a) Original
age example.

im-

(b) Encoded im-
age for client 0.

(c) Encoded im-
age for client 1.

(d) Residual
age for client 0.

im-

(e) Residual
age for client 1.

im-

Figure D.4: Visualization of the unique trigger sets for two different clients. The difference
between trigger sets cannot be observed according to human inspection, but after decoding,
the difference between keys can be distinguished by our model.

(a) Validation accuracy.

(b) WSR.

Figure D.5: Effects of the different number of samples in trigger sets. 50 samples in one
trigger set can achieve over 98% WSR.

Fig. D.5 by varying the number of samples in the trigger set from 50 to 500 for Digits

training (USPS is used to generate the trigger set). Note that for all cases, TAcc always

remains to be 100%. We observe that with only 50 samples in one trigger set, we can achieve

an accuracy degradation around 2%, and with a WSR over 98%. When the number of

samples increases to 300, WSR is over 99%. In general, the change in the number of samples

in the trigger set has almost no effect on both standard accuracy and WSR. A small trigger

set (such as 50) can achieve comparable results with a large trigger set. The advantages

of a smaller trigger set include quicker trigger set generation, quicker watermark injection,

quicker ownership verification, and quicker IP tracking. Besides, less effort can be made for

OoD data synthesizing or collecting.

Effects of different watermark injection rounds. We conduct an ablation study to

show the effect of the injection round of the watermark in Table D.3. The results verify that

injecting in earlier rounds will not affect standard accuracy, WSR, and TAcc. In our work,

we do not start our watermark injection at the very beginning of training since early-stage

protection usually means more computational resources, so it is more valuable to focus on

141

0255075100125150Communication round0.50.60.70.80.9Validation accuracy500400300200500255075100125Communication round0.940.950.960.970.980.991.00WSR50040030020050high-quality models rather than low-quality models.

Inject round
5
10
20

Acc
0.8838
0.8811
0.8855

∆Acc WSR WSR_Gap TAcc
1.0000
0.0251
1.0000
0.0278
1.0000
0.0234

0.9948
0.9938
0.9895

0.9951
0.9946
0.9909

Table D.3: Ablation study: results for watermark injection in different rounds on digits.

(a) Validation accuracy.

(b) WSR.

(c) TAcc.

Figure D.6: Validation accuracy, WSR, and TAcc for fedavg and fedprox on digits.

Effects of different FL algorithms In Fig. D.6, we show the standard accuracy,

WSR and TAcc for our proposed DUW in two different FL settings: fedavg and fedprox [95].

According to the results, fedprox can achieve comparable WSR as fedavg and higher standard

accuracy. TAcc for both FL algorithms remains to be 100%. Our proposed method is not

sensitive to the FL framework, in which it is implanted.

Effects of different participation portion We report the results w.r.t. different

participation portion of FL in Table D.4. According to the results, lower participation

portions of clients will cause degradation of the standard FL accuracy to around 6%, but it

will not affect the WSR and tracking accuracy (TAcc), which indicates that it will not cause

any watermark collision problem.

Participation portion
100%
20%
10%

Acc
0.5583
0.4899
0.4769

∆Acc WSR WSR_Gap TAcc
1.0000
0.0003
1.0000
0.0666
1.0000
0.0559

1.0000
1.0000
1.0000

0.9998
1.0000
0.9999

Table D.4: Ablation study: results for different participation portion on CIFAR-10.

142

020406080100120140Communication round0.50.60.70.80.9Validation accuracyfedavgfedprox20406080100120140Communication round0.940.950.960.970.980.99WSRfedavgfedprox20406080100120140Communication round0.00.20.40.60.81.0TAccfedavgfedprox