ENHANCING THE ROBUSTNESS AND TRUSTWORTHINESS OF MACHINE LEARNING MODELS IN DIVERSE DOMAINS By Shuyang Yu A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Computer Science – Doctor of Philosophy 2025 ABSTRACT The rapid advancement of machine learning, particularly over-parameterized deep neural networks (DNNs), has led to significant progress across diverse domains. While the over- parameterization of DNNs gives them the power to capture complex mappings between input data points and target labels, in real-world challenges, they can inevitably be exposed to unseen out-of-distribution (OoD) examples that deviate from the training distribution. This raises critical concerns around robustness, adaptiveness, and trustworthiness of such models when transferring knowledge from the training domains to unseen test domains. In this thesis, we propose three different methods targeting the robustness and adaptive- ness of machine learning models. First, to address agnostic data corruption in the source domain, we propose a simple and computationally efficient unsupervised domain adaptation (UDA) approach that enables parallel training of ensemble models. The learning framework we proposed can be flexibly combined with available UDA approaches that are orthogonal to our work to improve their robustness under corrupted data. Second, with the rise of large language models (LLMs) pre-trained on vast, web-sourced datasets spanning multiple domains, which led to a surge of interest in adapting these models to a wide range of down- stream tasks. However, the real-world corpora used in the pre-training stage often exhibit a long-tail distribution, where knowledge from less frequent domains is underrepresented. As a result, LLMs failed to give correct answers for queries sampling from the long-tail distribu- tions. To solve this problem, we propose a reinforcement learning-based dynamic uncertainty ranking method for retrieval-augmented ICL with a budget controller. The system adjusts the ranking of retrieved samples based on LLM feedback, promoting informative and stable examples while demoting misleading ones. Third, while the neighborhood community DA aims to ensure model robustness by maintaining high performance on OoD samples from target domains with domain shifts, out-of-distribution (OoD) detection focuses on model reliability by identifying samples that exhibit semantic shifts. To bridge a critical research gap of OoD detection and federated learning (FL), we propose a privacy-preserving feder- ated OoD synthesizer that exploits data heterogeneity to enhance out-of-distribution (OoD) detection across clients. This approach enables each client to benefit from external class knowledge shared among non-IID participants, without compromising data privacy. The model adaptation process can also introduce a new challenge, which is the risk of unauthorized reproduction or intellectual property (IP) theft, especially for high-value models. To enhance the trustworthiness of models, we introduce two methods for model watermarking. The first is an OoD-based watermarking technique that eliminates the need for training data access, making it suitable for scenarios with strict data confidentiality. The method is both sample-efficient and time-efficient while preserving model utility. The second technique targets federated learning, enabling both ownership verification and leakage tracing, transitioning FL model use from anonymity to accountability. Copyright by SHUYANG YU 2025 This thesis is dedicated to my parents, Li Liu and Yuecheng Yu, as well as my boyfriend Yu Mei. v ACKNOWLEDGEMENTS This dissertation represents both the end of an incredible chapter and the beginning of an exciting new path. Reaching this point would not have been possible without the unwavering support and encouragement of my advisor, colleagues, friends, and loved ones. First and foremost, I would like to express my profound gratitude to my advisor, Dr. Jiayu Zhou, whose insightful guidance, unwavering support, and encouragement—both aca- demically and personally—have been invaluable throughout my Ph.D. journey. I would also like to extend my appreciation to my committee members, Dr. Qiben Yan, Dr. Pang-Ning Tan, and Dr. Sijia Liu for generously sharing their time, expertise, and constructive feedback, which greatly enriched the quality and direction of this dissertation. I feel incredibly fortunate to have worked alongside such supportive and motivating col- leagues during the past five years. I would like to thank all my collaborators in the Intelligent Data Analytics (ILLIDAN) Lab: Dr. Junyuan Hong, Dr. Zhuangdi Zhu, Dr. Boyang Liu, Dr. Mengying Sun, Dr. Kaixaing Lin, Dr. Sumyeong Ahn, Yijiang Pang, Haobo Zhang, Siqi Liang, Jiankun Wang, Lingxiao Li, and Haohao Zhu. I would also like to extend my deepest appreciation to my other collaborators: Dr. Haotao Wang, Dr. Zhangyang Wang, Yi Zeng, Bairu Hou, Jiabao Ji, Dr. Shiyu Chang, Dr. Runxue Bao, Dr. Cao Xiao, Dr. Lingjuan Lyu, Dr. Ruoxi Jia, Dr. Anil K. Jain, Dr. Hiroko H. Dodge, and Dr. Fei Wang. Finally, I would like to thank my parents, Li Liu and Yuecheng Yu, as well as my boyfriend Yu Mei, for their support and unconditional love. vi TABLE OF CONTENTS CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Overview of Thesis Structure 1.3 Background and Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.1 Out-of-distribution (OoD) . . . . . . . . . . . . . . . . . . . . . . . . In-context Learning (ICL) for large language models (LLMs) . . . . . 1.3.2 1.3.3 Backdoor-based Watermarking for Model Protection . . . . . . . . . 1.3.4 Federated Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . CHAPTER 2 ROBUST UNSUPERVISED DOMAIN ADAPTATION FROM A CORRUPTED SOURCE . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Problem Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.1 Preliminaries of Median of Means . . . . . . . . . . . . . . . . . . . . 2.4.2 Robust UDA via Ensemble Learning . . . . . . . . . . . . . . . . . . 2.4.3 Rationale for Ensemble Voting . . . . . . . . . . . . . . . . . . . . . . 2.4.4 Hypothesis Adaptation by Information Maximization . . . . . . . . . 2.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.1 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.2 Results and Discussions 2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CHAPTER 3 DYNAMIC UNCERTAINTY RANKING: ENHANCING RETRIEVAL-AUGMENTED IN-CONTEXT LEARNING FOR LONG-TAIL KNOWLEDGE IN LLMS . . . . . . . . . . . . . 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Motivation: Uncertainty of In-context Learning . . . . . . . . . . . . . . . . In-context Learning with Dynamic Uncertainty Ranking . . . . . . . . . . . 3.5 3.5.1 Retrieved Sample Selection . . . . . . . . . . . . . . . . . . . . . . . . 3.5.2 Retriever Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.2 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.3 Ablation Studies 3.6.4 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.5 Efficiency Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.6 Transferability Analysis 3.7 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 4 4 4 6 7 8 10 10 12 14 14 15 15 17 20 22 22 24 29 30 30 32 33 34 36 36 37 39 40 41 42 44 45 46 46 47 vii CHAPTER 4 TURNING THE CURSE OF HETEROGENEITY IN FEDERATED LEARNING INTO A BLESSING FOR OUT-OF-DISTRIBUTION DETECTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.1 Natural OoD Data in Non-iid FL . . . . . . . . . . . . . . . . . . . . 4.4.2 Synthesizing External-Class Data from Global Classifier . . . . . . . . 4.4.3 Filtering Virtual External-Class Samples . . . . . . . . . . . . . . . . 4.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.1 Visualization of generated external class samples . . . . . . . . . . . . 4.5.2 Benchmark Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.3 Qualitative Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CHAPTER 5 SAFE AND ROBUST WATERMARK INJECTION WITH A SINGLE OOD IMAGE . . . . . . . . . . . . . . . . . . . . . . . . Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 5.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 DNN Watermarking . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.2 Watermark Removal Attack . . . . . . . . . . . . . . . . . . . . . . . 5.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 Constructing Safe Surrogate Dataset . . . . . . . . . . . . . . . . . . 5.3.2 Robust Watermark Injection . . . . . . . . . . . . . . . . . . . . . . . 5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.1 Watermark Injection . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.2 Defending Against Fine-tuning & Pruning . . . . . . . . . . . . . . . 5.4.3 Defending Against Model Extraction . . . . . . . . . . . . . . . . . . 5.4.4 Qualitative Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.5 CHAPTER 6 WHO LEAKED THE MODEL? TRACKING IP INFRINGERS IN ACCOUNTABLE FEDERATED LEARNING . . . . . . . . . . . . . 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Related Work and Background . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.1 Pitfalls for Watermark Collision . . . . . . . . . . . . . . . . . . . . . 6.3.2 Decodable Unique Watermarking . . . . . . . . . . . . . . . . . . . . Injection Optimization with Preserved Utility . . . . . . . . . . . . . 6.3.3 6.3.4 Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . IP Tracking Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.1 6.4.2 Comparison with traditional backdoor-based watermarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.3 Robustness 48 48 50 51 53 53 54 56 58 59 60 62 64 65 65 67 67 69 69 70 72 73 75 77 79 79 81 82 82 85 85 86 87 90 90 91 92 93 94 viii 96 6.4.4 Qualitative Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 6.5 Discussions 6.5.1 CHAPTER 7 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 7.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 APPENDIX A: DYNAMIC UNCERTAINTY RANKING . . . . . . . . . . . . . . . 121 APPENDIX B: FOSTER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 APPENDIX C: SINGLE OOD WATERMARK . . . . . . . . . . . . . . . . . . . . . 131 APPENDIX D: DUW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 ix CHAPTER 1 INTRODUCTION 1.1 Motivation The rapid advancement of machine learning models has led to significant improvements in various applications across diverse domains. The over-parameterization design of deep neural networks (DNNs) gives them ultra-high model flexibility, which gives them the power to capture complex mappings between input data points and target labels. However, in real-world scenarios, DNNs can inevitably be exposed to unseen examples that deviate from the training distribution, which is known as out-of-distribution (OoD) samples, in which case, many challenges are encountered, such as the robustness, security, and adaptability of models. Domain adaptation (DA) has emerged as an effective solution, which transfers knowledge learned from a related but different domain (i.e. the source domain) to assist the learning of the target domain. In particular, a challenging and practical problem along this line is unsupervised domain adaptation (UDA) [141], in which the target domain has access to only a few unlabeled training samples. While UDA has been extensively studied for typical machine learning settings, most existing UDA methods are usually built upon an implicit assumption that source domain data is clean. Under this assumption, UDA methods are prone to performance degradation when the source domain samples are corrupted, either unintentionally during data collection or deliberately by vicious attackers. Consequently, models learned on the corrupted source data can be easily under attack even on the source domain, not to mention confronting the challenges of domain distribution shift when adapt- ing to the target domain. Such model performance degradation can be exacerbated under attacks. Thus, improving the robustness of UDA when confronting corruption becomes the first challenge to solve in this thesis. Recently, the rise of large language models (LLMs) (GPT series [1], LLaMA series [157], 1 Gemini series [154], etc) pre-trained on vast, web-sourced datasets spanning multiple do- mains, has led to a surge of interest in adapting these models to a wide range of downstream tasks [191, 163, 62, 167, 92, 175]. However, directly applying these LLMs for specific down- stream tasks can be challenging without task-specific adaptations due to the computational challenges of fine-tuning their vast number of trainable parameters. To address this, we focus on in-context learning (ICL) [13], which prompts the LLMs with a set of examples relevant to the test query without parameter updating. Nevertheless, one critical limitation arises from the nature of the pre-training data: real-world corpora often exhibit a long-tail distri- bution [111, 118, 25, 147], where knowledge from less frequent domains is underrepresented. As a result, LLMs frequently fail to memorize or generalize to these infrequent patterns [67], leading to degraded performance on long-tail queries. Addressing this limitation—enabling LLMs to effectively capture and utilize long-tail knowledge for downstream tasks—forms the second challenge of this thesis. While the neighborhood community DA aims to ensure model robustness by maintaining high performance on OoD samples from target domains with domain shifts, OoD detection focuses on model reliability by identifying samples that exhibit semantic shifts [182]. DNNs tend to make overconfident predictions about what they do not know, and may predict it as one of the training classes with high confidence, which is doomed to be wrong [52, 53, 51]. While recent advances show promising OoD detection performance for centralized training, they cannot be easily incorporated into federated learning, where multiple local clients co- operatively train a high-quality centralized model without sharing their raw data [74], even though many security-sensitive OoD detection tasks such as autonomous driving and voice recognition authorization are commonly trained using FL for data privacy concerns. Hence, OoD detection in FL is another open question to solve in this thesis. To adapt the model to a new domain, some model users might also try to fine-tune a pre-trained model on their downstream tasks, this introduces a new challenge which is the risk of illegal reproduction or duplication of such high-value models trained with massive 2 amount of data from different sources, powerful computational resources, and human efforts. Therefore, it is essential to protect the intellectual property of the model and the rights of the model owners. This thesis places emphasis on backdoor-based watermarkings, which taint the training dataset by incorporating trigger patches into a set of images referred to as verification samples (trigger set), and modifying the labels to a designated class, forcing the model to memo- rize the trigger pattern during fine-tuning. Then the owner of the model can perform an intellectual property (IP) inspection by assessing the correspondence between the model’s outputs on the verification samples with the trigger and the intended target labels. Typically injection of backdoors requires full or partial access to the original training data. When pro- tecting models, such access can be prohibitive, mostly due to data safety and confidentiality. For example, someone trying to protect a model fine-tuned upon a foundation model and a model publisher vending models uploaded by their users. Another example is an inde- pendent IP protection department or a third party that is in charge of model protection for redistribution. Yet another scenario is federated learning [74], where the server does not have access to any in-distribution (ID) data, but is motivated to inject a watermark to protect the ownership of the global model. Despite the high practical demands, watermark injection without training data is barely explored. Realizing the potential challenges of prior art, in this thesis, we aim to enhance the ro- bustness, adaptability, and trustworthiness of models across diverse domains. Specifically, for robustness and adaptability, we investigate three interesting problems: 1) the challenges of UDA from corrupted source domains, 2) how to handle the long-tail knowledge of pre- trained LLMs for downstream tasks, 3) how to learn OoD awareness from non-iid federated collaborators while maintaining the data confidentiality requirements in FL. For trustworthi- ness, we propose 1) watermarking for model protection that exploits OoD data rather than the original training data both in a centralized and FL setting, 2) tracking the IP infringer’s identity in the FL system. 3 1.2 Overview of Thesis Structure This section summarizes each of the chapters in this thesis. Section 1.3 introduces the preliminaries of this thesis including the basic concepts in OoD, in-context learning for LLMs, backdoor-based watermarking, and federated learning. Chapter 2, Chapter 3 and Chapter 4 focus on enhancing the robustness and adaptiveness of DNNs. Specifically, Chapter 2 elaborates an effective framework to address the challenges of UDA from corrupted source domains in a principled manner. Chapter 3 proposed a rein- forcement learning-based dynamic uncertainty ranking method for guiding LLM predictions toward correct answers on long-tail samples from the downstream tasks. Chapter 4 discusses how to learn OoD awareness from non-iid federated collaborators while maintaining the data confidentiality requirements in federated learning. Chapter 5 and Chapter 6 focus on en- hance the trustworthiness of the model. Specifically, in Chapter 5, without access to ID samples, we propose a safe and robust backdoor-based watermark injection technique that leverages the diverse knowledge from a single out-of-distribution (OoD) image, which serves as a secret key for IP verification. In Chapter 6, we propose a watermarking method for FL that not only verifies model ownership but also identifies the infringing client upon leakage. 1.3 Background and Preliminaries In this section, we introduce the basic concepts in domain adaptation, OoD detection, in- context learning for LLMs, backdoor-based watermarking, and federated learning, which compose the cornerstones of our research on the robustness and trustworthiness of models across different domains. 1.3.1 Out-of-distribution (OoD) In this subsection, We introduce two main tasks for OoD, including domain adaptation, which improves the transferability of models across unknown OoD domains, and OoD de- 4 tection, which distinguishes OoD samples from ID samples to prevent DNNs’ overconfident predictions about what they do not know. Domain adaptation (DA) has emerged as an effective solution, which transfers knowl- edge learned from a related but different domain (i.e. the source domain) to assist the learning of the target domain. In particular, a challenging and practical problem along this line is unsupervised domain adaptation (UDA), where a few labeled samples of the target domain are available to assist learning. In this thesis, we will focus on UDA. Denote P xy s := Ps(X)×Ps(Y ) as the distribution of the source domain, and P xy t = Pt(X)×Pt(Y ) as the distribution of the target domain, respectively. One can access labeled samples from the source domain, denoted as Ds := {xi i=1 ⊂ P xy s be the set of unlabeled samples accessible in the target domain, Denote the loss function for . Accordingly, let Dt := {xj j=1 ⊂ Pt(X) s}Ns t }Nt s, yi the target domain as L : △Y × Y → R+, where △Y is the simplex over the label space, with |Y| = C denoting the number of unique labels. Let Θ be the parameter space of the learning model, and f (·; θ) be the post-activation, prediction output of model θ ∼ Θ. The objective for UDA is to optimize the learning model performance on the target domain: θ∗ = arg min θ∈Θ Ex,y∼P xy t [L(f (x; θ), y)] . (1.1) In practice, the learning model is derived based on accessible samples from both domains, i.e. θ ← Φ(Ds, Dt), where Φ is the learning procedure. Without loss of generality, in this work, we focus on single domain adaptation, and our learning framework can be readily extended to address multi-domain adaptation problems. OoD detection training. The OoD detection problem roots in general supervised learning, where we learn a classifier mapping from the instance space X to the label space Y. Formally, we define a learning task by the composition of a data distribution D ⊂ X and a ground-truth labeling oracle c∗ : X → Y. Then any x ∼ D is denoted as in-distribution (ID) data, and otherwise, x ∼ Q ⊂ X \D as out-of-distribution data. Hence, an ideal OoD detection oracle can be formulated as a binary classifier q∗(x) = I(x ∼ D), where I is an 5 indication function yielding 1 for ID samples and −1 for OoD samples. With these notations, we define the OoD learning task as T := ⟨D, Q, c∗⟩. To parameterize the labeling and OoD oracles, we use a neural network consisting of two stacked components: a feature extractor f : X → Z governed by θf , and a classifier h : Z → Y governed by θh, where Z is the latent feature space. For ease of notation, let hi(z) denote the predicted logit for class i = 1, . . . , c on extracted feature z ∼ Z. We unify the parameters of the classifier as θ = (θf , θh). We then formulate the OoD training as minimizing the following loss on the task T : JT (θ) := Ex∼D (cid:2)ℓCE (cid:0)h(f (x; θf ); θh), c∗(x)(cid:1)(cid:3) + λ Ex′∼Q (cid:2)ℓOE (cid:0)f (x′; θf ); θh(cid:1)(cid:3) , where ℓCE is the cross-entropy loss for supervised learning and ℓOE is for OoD regulariza- tion. We use E[·] to denote the expectation estimated by the empirical average on samples in practice. The non-negative hyper-parameter λ trades off the OoD sensitivity in train- ing. We follow the classic OoD training method, Outlier Exposure [53], to define the OoD regularization for classification problem as ℓOE(z′; θh) := E(z′; θh) − (cid:88)c i=1 hi(z′; θh), (1.2) where E(z′; θh) = −T log (cid:80)c i ehi(z′;θh)/T is the energy function, given the temperature pa- rameter T > 0. At test time, we approximate the OoD oracle q∗ by the MSP score [52]. 1.3.2 In-context Learning (ICL) for large language models (LLMs) In-context learning (ICL) [13] is an effective few-shot learning method that can adapt the LLMs to downstream tasks without updating the model’s parameters. Specifically, ICL queries LLMs by concatenating relevant samples with the test query to provide augmented knowledge. We give a formal deifinition of ICL as follows. Suppose we have a training set T = {(xi, yi)}N i=1 related to the query domain, where x is the question and y is the answer. Given a query problem pi from a test set P and a 6 K-shot inference budget, we will retrieve K related samples Ei = {ek i = (xi, yi)|ek i ∈ T }K k=1 and construct a prompt P (Ei, pi) as input to feed into the LLM: P (Ei, pi) = π(e1 i ) ⊕ · · · ⊕ π(eK i ) ⊕ π(pi, ·), (1.3) where π is the template for each sample. The predicted answer from the LLM for question pi is given by: ˆai = LLM(P (Ei, pi)). (1.4) 1.3.3 Backdoor-based Watermarking for Model Protection Backdoor attacks. Backdoor attacks are an emerging security threat to DL systems when untrusted data/ models/ clients participate in the training process [99]. It implants a back- door trigger into the model by fine-tuning the pre-trained model with a set of poison samples assigned to one or multiple secret target class [192, 79, 41, 90]. Suppose Dc is the clean dataset and we craft the poisoned set DP by poisoning another set of clean samples. The objective function of backdoor injection is: min θ (cid:88) (x,y)∈Dc ℓ(fθ(x), y) + (cid:88) (x′,y′)∈DP ℓ(fθ(Γ(x′)), t), (1.5) where Γ(x) adds a trigger pattern to a normal sample, t is the pre-assigned target label, fθ is a classifier parameterized by θ, and ℓ is the cross-entropy loss. The key intuition of backdoor training is to make models memorize the shortcut patterns while ignoring other semantic features. Backdoor-based watermarking. In this thesis, we focus on backdoor-based water- marking for model protection, which is a widely adopted black-box verification method. The poisoned dataset DP in Eq. (1.5) is also denoted as the trigger set for watermarking, and the objective function for watermark injection is the same as Eq. (1.5). A watermarked model should satisfy the following desired properties: 7 • Persistent utility. Injecting backdoor-based watermarks into a model should retain its performance on original tasks. • Removal resilience. Watermarks should be stealthy and robust against agnostic wa- termark removal attacks [130, 15, 58]. For watermark verification, the ownership of the suspect model Ms can be verified according to the consistency between the target label t and the output of the model in the presence of the triggers, denoted as watermark success rate (WSR). If the WSR is larger than a certain threshold σ, the suspect model Ms will be considered as a copy of our original model. We formally define the ownership verification of the backdoor-based model as follows: Definition 1.3.1 (Ownership verification). We define watermark success rate (WSR) as the accuracy on the trigger set DP : WSR = Acc(Ms, DP ). (1.6) If WSR > σ, the ownership of the model is established. 1.3.4 Federated Learning Federated learning is a distributed learning framework that enables massive and remote clients to collaboratively train a high-quality central model [74]. FedAvg [121] is one of the representative methods for FL, which averages local models during aggregation. This work is based on the FedAvg. Suppose we have K clients, and our FL model M used for standard training consists of two components, including a feature extractor f : X → Z governed by θf , and a classifier h : Z → Y governed by θh, where Z is the latent feature space. The collective model parameter is θ = (θh, θf ). The objective for a client’s local training is: Jk(θ) := 1 |Dk| (cid:88) (x,y)∈Dk ℓ(h(f (x; θf ); θh), y), (1.7) 8 where Dk is the local dataset for client k, and ℓ is the cross-entropy loss. The overall objective function of FL is thus given by min θ 1 K (cid:88)K k=1 Jk(θ). 9 CHAPTER 2 ROBUST UNSUPERVISED DOMAIN ADAPTATION FROM A CORRUPTED SOURCE This chapter is based on the following work: Robust Unsupervised Domain Adaptation from a Corrupted Source. Shuyang Yu, Zhuangdi Zhu, Boyang Liu, Anil Jain, Jiayu Zhou. 2022. 2022 IEEE International Conference on Data Mining (ICDM). IEEE, 2022: 1299–1304. 2.1 Introduction Deep learning techniques have been thriving over the last decade as a powerful tool for predictive modeling in a variety of domains, including computer vision [124], autonomous vehicles [50], and healthcare [33], to name just a few. The over-parameterization design of deep models gives them ultra-high model flexibility, which gives them the power to capture complex mappings between input data points and target labels. The success of deep learning based predictive modeling, however, hinges on massive training data with accurate labels, which hinders its application to tasks with limited training label supervision, where collecting accurate labels can be economically prohibitive or even dangerous. One such example is the domain of health informatics, where building predictive models for a specific disease requires the construction of a carefully designed cohort [129]. Longitudinal studies, strict enrollment conditions, data coding errors, and high costs associated with the data collection often result in only very small datasets being available for supervised learning [196, 7]. Accordingly, domain adaptation (DA) has emerged as an effective solution, which trans- fers knowledge learned from a related but different domain (i.e. the source domain) to assist the learning of the target domain. In particular, a challenging and practical problem along this line is unsupervised domain adaptation (UDA), in which the target domain has access to only a few unlabeled training samples. While UDA has been extensively studied for typical machine learning settings, most existing UDA methods are usually built upon an implicit 10 assumption that source domain data is clean. Under this assumption, UDA methods are prone to performance degradation when the source domain samples are corrupted, either unintentionally during data collection or deliberately by vicious attackers. Consequently, models learned on the corrupted source data can be easily under attack even on the source domain, not to mention confronting the challenges of domain distribution shift when adapt- ing to the target domain. Such model performance degradation can be exacerbated under adversarial attacks. For instance, as illustrated in Fig. 2.1, a minimal corruption in the source domain samples can shift the model’s hypothesis plane drastically when performing domain adaptation, especially due to the lack of labeled supervision in the target domain. Given the challenge of UDA under the corrupted source domain, in this work, we propose a simple yet effective solution for robust UDA that addresses various types of data corrup- tion. Specifically, inspired by the principle of Median of Means (MoM) estimators [127], we alleviate the impacts of corrupted training samples by ensemble learning on a group of lightweight models with domain-invariant features, which is provably effective to confront poisoned data. To further address the distribution shift inherent in domain adaptation, we refine the learned models by maximizing the mutual information between the latent feature representations and the posterior distributions. Eventually, the final ensemble model is able to attain the predictive knowledge of the target domain with high confidence. The merits of our proposed approach are multi-fold: i) It is a principled and effective solution, with theoretical support in defending contaminated training samples. ii) Our ap- proach is simple and computationally efficient, in that the training of individual models for an ensemble can be conducted in parallel to accelerate learning. iii) The proposed solution to UDA is generally robust against agnostic types of data corruption. In particular, our approach is able to successfully tackle notorious backdoor attacks, where both the training samples and corresponding labels can be maliciously modified by attackers. iv) The pro- posed learning framework can be flexibly combined with available UDA approaches that are orthogonal to our work to improve their robustness under corrupted data. 11 Figure 2.1: Source domain data corruption may lead to failure in many existing domain adaptation approaches. 2.2 Related Work Domain Adaptation (DA) has been applied to a number of practical applications [24], including semantic segmentation [165], objective detection [10], and event recognition [17], etc. In this work, we work on the problem setting of unsupervised domain adaptation (UDA), which is more challenging than semi-supervised domain adaptation [141] where a few labeled samples of the target domain are available to assist learning. Among various UDA approaches, domain invariant representations reside at their core. A plethora of work has been proposed to learn feature representations that are discriminative for prediction while being invariant among domains. Earlier work leveraged the idea of minimizing the Maximum Mean Discrepancy (MMD) to achieve feature invariance [161, 112]. Adversarial training approaches emerged to minimize the discrepancy of the latent feature distributions between different domains [38, 39, 160]. Moment matching was also widely utilized for learning latent representations [132, 146, 189], which can be combined with generative adversarial learning for improving such domain-invariance [87]. Another direction towards solving UDA is based 12 on data reconstruction [56, 71, 197]. Many prior approaches are built upon the pre-requisite that both the source and target domain data are accessible simultaneously during learning. Contrarily, [103, 102] proposed source-data-free UDA. Most existing approaches did not tackle the issue of source domain corruption. Learning with noisy data has been extensively studied in traditional, non-domain adap- tation settings. Numerous robust learning methods have been proposed for tackling feature corruption, label corruption, and data poisoning attacks [159, 29, 76, 85]. However, the prob- lem of learning with noisy data for DA is not well studied. Most of the existing robust DA methods are limited to one or two particular types of noise in data. [172] addressed domain adaptation under missing classes by performing a unilateral alignment. [186, 184] solves DA in a scenario where only the labels are noisy, with input features untouched. [18] proposed a marginalized Stacked denoising autoencoders (mSDA) to address feature corruption for DA. [46] developed an offline curriculum learning approach to tackle the label noise of DA, and adopted a proxy distribution based margin discrepancy to alleviate feature noise. Existing robust DA methods are summarized in Table 2.1. Method Feature corruption Label corruption Data poisoning attacks [172] [186] [184] [18] [46] Ours ✓ ✓ ✓ × ✓ ✓ × × × ✓ ✓ ✓ × × × × × ✓ Table 2.1: Existing robust DA methods. Median of Means (MoM) Estimators [127] are robust estimators utilizing the median of the predictions. They have shown a theoretical advantage over classical ERM based approaches given long-tailed data with outliers [116], which can be very effective for solving general noisy data problems. Recently, [133, 80, 81] applied MoM for robust predictive learning. In this work, we leverage the MoM principle to solve UDA with data corruption. 13 2.3 Problem Setting In this section, based on UDA as we introduced in Section 1.3.1, we elaborate on our proposed problem setting, i.e. UDA with corrupted data in the source domain. UDA with Source Domain Corruption tackles domain adaptation from a corrupted source domain. One can consider that there is a one-to-one mapping between the clean source domain P xy s and the corrupted source domain ˜P xy s . The input feature xi s can be disrupted with probability pe: pe := E s,˜xi xi s∼⟨Ps(X ), ˜Ps(X )⟩[I(˜xi s ̸= xi s)]. Accordingly, labels of noisy samples are transformed based on an unknown transition prob- ability matrix T ∈ RC×C, where C is the cardinality of label types. Each entry T (i, j) in T denotes the probability that a label i ∈ [C] is flipped to j ∈ [C] after data corruption: T (i, j) = E s,˜yi yi s∼⟨Ps(Y), ˜Ys(X )⟩[I(˜ys = j|ys = i)]. Denote ˜Ds = {˜xi s, ˜yi s}Ns i=1 the noisy samples from ˜P xy s , the model learned under corrupted source domain is hence derived by noisy source domain samples instead: θ ← Φ( ˜Ds, Dt). Such data corruption can be unconsciously introduced during data collection by human mistakes or sensor malfunction, or maliciously triggered via adversarial attacks [101]. It is a challenging yet practical problem setting, potentially undermining most existing UDA ap- proaches that do not consider the risk of noisy source domains (as illustrated in Figure 2.1). 2.4 Methodology In this section, we propose a robust UDA method based on ensemble learning, which split the potentially contaminated data into blocks and performs model learning on each block. The derived models are then fine-tuned and ensembled for domain adaptation. Our method is conceptually inspired by the Median of Means estimator [127]. 14 2.4.1 Preliminaries of Median of Means Given a model θ, there exists a gap between the empirical risk ˆE(θ) and the true risk E(θ), with ˆE(θ) := 1 |D| x∼D L(f (x; θ), y), and E(θ) := Ex,y∼P xy [L(f (x; θ), y)], which can be exacerbated when the data are heavily tailed or contain contaminated samples. Therefore, (cid:80) models that are learned to soley minimize ˆE(θ) can be sensitive to outliers. Median of Means estimators alleviate such issue by finding a more proper approximation of the true risk, compared with an empirical risk minimizer (ERM). Formally, let {xi}N i=1 be N i.i.d. samples from an unknown distribution P. Let the MoM estimator associated with a parameter δ ∈ [e1−N/2, 1], then one can evenly separate {xi}N i=1 into K blocks, where K = ⌊ln(δ−1)⌋. The MoM estimator µM oM (δ) is then defined as the median of the K arithmetic mean of each block Xk: µM oM (δ) = median (cid:32) 1 |Zk| (cid:88) xi; (cid:33)K . xi∼Xk k=1 (2.1) The MoM estimator can probably attain subgaussian properties under mild assumptions on the variance of input features. Particularly, ∀N ≥ 4, one can derive that [28]: (cid:32) P |µM oM (δ) − EP[x])| > C (cid:114) 1 + ln(δ−1) N (cid:33) ≤ δ. (2.2) Unlike the ERM estimator ˆµ = 1 N i=1 xi, MoM estimator is robust to data with outliers or heavy-tailed inputs. Inspired by MoM, we aim to approximate and minimize the centroid (cid:80)N of the excessive risks by ensemble learning, which is resemblant to the median of means when we treat xi as a sample-wise loss value. 2.4.2 Robust UDA via Ensemble Learning We now elaborate on our learning paradigm. We first randomly split the source domain data ˜Ds := {˜xi s}Ns unlabelded target domain data: {Dk s, ˜yi i=1 into K even blocks { ˜Dk t }K k=1 s }K k=1 , and apply the same random split for the . Next, we learn K separate models with parameters 15 Figure 2.2: Process of robust domain adaptation learning. {θ}K k=1 , while each optimizing towards a domain-adaptation objective using one pair of the ⟨source, target⟩ domain data block, respectively, to minimize the empirical risk: min {θk∼Θ}K k=1 1 K K (cid:88) k=1 E xs,ys∼ ˜Dk s ,xt∼Dk t JDA(xs, ys, xt, θk), (2.3) in which JDA(xs, ys, xt; θ) is the domain-adaptation risk function. One highlight of our work is that, we do not constrain the specific form of JDA, hence a variety of UDA approaches proposed by prior arts can be flexibly integrated into our learning framework, by applying different forms of JDA as in need. In practice, JDA is usually derived by adversarial learning to attain a saddle-point solution that captures domain-invariant latent representations [39, 160]. Without the loss of generality, we present one form of JDA as below, although any other legitimate objective forms are also applicable: JDA(xs, ys, xt; θ) := max D:X →[0,1) [log(1 − D(g(xs; θ)))] + log(D(g(xt; θ))) (cid:125) (cid:123)(cid:122) (cid:124) (A) + L(f (xs; θ), ys) (cid:125) (cid:123)(cid:122) (B) (cid:124) ], (2.4) in which D is a discriminator model inspired by adversarial generative training [42], and g(·; θ) is the latent feature map of model θ. The term (A) in Eq. (2.4) encourages learning a domain-invariant feature representation, while the term (B) in Eq. (2.4) reinforces the predictive power of the model using labeled supervision from the source domain. 16 Once the K models have been learned, the centroid prediction of arbitrary sample x can be derived by their ensemble voting: ¯y = ensemble(x; {θk}K k=1) = arg max y∼Y K (cid:88) k=1 I(arg max c∼Y f (x; θk)c = y) (2.5) where I is an indicator function; f (x; θk) is the posterior distribution output of model θk, and f (x; θk)c indicates the predictive probability of input feature belonging to class c. Therefore, ¯y of an input feature x is the most voted label by the K models, which alleviates the influences of potentially contaminated models induced by data corruption. 2.4.3 Rationale for Ensemble Voting We show that our ensemble strategy provides an improved performance against data cor- ruption. Concretly, given a clean data D, and accordingly its potentially corrupted version ˜D with corruption ratio pe, s.t. (cid:80) and evenly split it into K blocks { ˜Dk}K I(x ̸= ˜x) = pe ∗ |D|, one can randomly shuffle ˜D then perform robust ensemble learning with the x,˜x∼⟨D, ˜D⟩ k=1 following guarantee: Theorem 2.4.1. Based on the definition of ˜D, and D, with ˜D ∼ ˜P(X), D ∼ P(X), respectively. Given K > 0, and B ∗ K = |D| = N . Let a random shuffling procdure evenly split ˜D into { ˜Dk}K k=1 , and equivalently, split D into {Dk}K k=1 given their bipartite mapping. Denote the empirical minimizer for the k-th block data as: ˜hk = arg min h E ˜Dk [L(h(˜x), ˜y)], which is potentially corrupted. Accordingly, denote ˜h = 1 K (cid:80)K k=1 model. Denote ˆh∗ the oracle ensemble model without data corruption, i.e. ˆh∗ = 1 K ˜hk as the noisy ensemble ˆhk, (cid:80)K k=1 where ˆhk = arg min h EDk[L(h(x), y)] 17 is the empirical minimizer on the clean data. Given zk as the ratio of corrupted samples for block k: zk := E ˜Dk,Dk [I(ˆx ̸= x)] = E ˜Dk,Dk [I(ˆy ̸= y)] , (2.6) and assume a threshold ϵ for effective contamination, such that zk > ϵ ⇝ ˜hk ̸= ˆhk ∀k ∈ [K]. Let vk = I(zk > ϵ), and pK = Ek∼[K][I(zk > ϵ)] = Ek∼[K][vk] be the expected ratio of contaminated blocks, and δ2 = Var(z1) < ∞ is bounded depending on the choice of K. One can derive that: P(ˆh∗ ̸= ˜h) ≤ e−2K( 1 2 − K n δ2 ϵ2 )2 . (2.7) Proof. Observe that zk < ϵ ⇔ ˜hk ≡ ˆhk. Then P(h∗ ̸= ˜h) implies that at least ⌈ K 2 ⌉ of the blocks are contaminated, ie 1 K . Then one can derive that: (cid:80)K k=1 vk > 1 (cid:32) K (cid:88) 2 P(ˆh∗ ̸= ˜h) ≤ P vk > (cid:33) K 2 k=1 (cid:32) K (cid:88) = P k=1 (cid:32) = P (vk − E(vk)) > K 2 − KpK 1 K K (cid:88) (vk − E(vk)) > k=1 1 2 − pK By Hoeffding’s inequality [55], ∀t > 0 we have: (cid:32) P 1 K Therefore, K (cid:88) (cid:33) (vk − E(vk)) > t < e(−2Kt2) k=1 P(ˆh∗ ̸= ˜h) < e−2K( 1 2 −pK )2. Based on Chebyshev’s inequality: pK = P (zk > ϵ) ≤ δ2 Bϵ2 = K N δ2 ϵ2 As a result, P(ˆh∗ ̸= ˜h) ≤ e−2K( 1 2 − K N δ2 ϵ2 )2 . 18 (cid:33) (cid:33) (2.8) (2.9) (2.10) (2.11) (2.12) ˆh∗ ≡ ˜h indicates that the ensemble model ˜h learned on noisy samples can defend against data corruption by delivering the same prediction as to the ideal ensemble model ˆh∗, and vice versa. Theorem 2.4.1 demonstrates that the success of such robust ensemble learning hinges on two aspects: 1) the choice of data blocks K, and 2) the difficulty of corrupting a single model ϵ. Moreover, we show below that given a proper choice of K based on the ratio of corrupted data, we can ensure a robust learning scheme with high confidence: Lemma 2.4.1. Given the above definition of ˜D, { ˜Dk}K ˜x)] = pe|D|, then one can derive that ∀ K > 2pe| ˜D|: k=1 , and zk. If (cid:80) ˜x,x∈{⟨ ˜D,D⟩}[I(x ̸= 1 K K (cid:88) k=1 I(zk > 0) < 1 2 . (2.13) Proof. Denote N = |D|. Given peN the number of contaminated samples in ˜D, it is straight- forward that: 1 K K (cid:88) k=1 I(zk > 0) ≤ = 1 K 1 K   K (cid:88) k=1 (cid:88) I(x ̸= ˜x)   x,˜x∈Dk, ˜Dk 1 2peN peN < peN = 1 2 . Lemma 2.4.1 reveals that, given a choice of K > 2pe| ˜D|, the mojority of data blocks (more than 50%) are composed of clean samples. This property is inspiring, since we can train K models separately using the K blocks, whose ensemble ensures that the models learned on corrupted data will be voted out with high probability. In practical scenarios where the value of the poison ratio is unknown, one can play a tradeoff between the confidence of the ensemble and the data sufficiency in each block. The block number K is an important hyper-parameter, which needs to be carefully selected. An over-small K might undermine the robustness of the ensemble, in that the ratio of contaminated data blocks becomes relatively high when K decreases. On the other hand, a 19 larger K leads to fewer training samples B = Ns K in each data block, which may lead to biased model learning. Existing robust learning methods based on MoM, such as [81], searched for the optimal value of K within a certain range of values, thus, we leave it an intriguing future work for adaptively learning the best K. We empirically verify in Section 6.4 that, a range of choices on K can ensure robust domain adaptation, even when K is smaller than the ideal threshold 2pe| ˜D|. 2.4.4 Hypothesis Adaptation by Information Maximization Up to now, one can derive a conceptual robust model by using the ensemble results from multiple models. To reinforce the performance of models before the final ensemble, we can adapt their hypothesis to the target domain by further leveraging the unlabeled target domain samples. More concretely, we refine each learned model θt by maximizing the mutual information between its latent feature representations and its posterior distribution. using the following information maximization objective: min θk JIM(Dt; θk) := Ext∼Dt [H (f (x; θk))] (cid:125) (cid:123)(cid:122) A (cid:124) − H (softmax(Ex∼Dt[f (xt; θk)])) , (cid:125) (cid:123)(cid:122) B (cid:124) (2.14) where H(p) = (cid:80)C c=1 −pc log pc is the entropy for input p ∼ △Y. This refinement objective aligns with a common perception that, an ideal model shall be confident in its sample-wise predictions (minimize term A), and be diversified on domain- wise predictions (maximize term B). A resemblant strategy has been applied by prior work to address source-free DA [103]. In our setting, optimizing toward this objective shows significant benefits in weakening the impacts of source data corruption, which can adaptively tune the potentially contaminated model to fit in the target domain hypothesis. Moreover, when refining a model θk using target domain samples, we can obtain the pseudo label ˆyt = arg max f (xt; θ)c for each sample xt, as well as the class-wise centroid c∼[C] 20 representation ¯gk: ∀k ∈ [C], ¯gk = Ext∼Dt,ˆyt=k [g(xt; θk)] . (2.15) where g(xt; θk) is the latent feature representation of xt, i.e. the penultimate layer output of model θ. We find it beneficial to correct the pseudo labels of xt by finding the nearest centroid: ¯yt := arg min cos(g(xt; θk), ¯gk), k∈[C] then use the corrected pseudo labels ¯yt to adjust the model. More concretly, this augmented objective JP L is derived as follows: JPL := Ext∼Dt,¯yt [− log(f (xt; θk)¯yt)] . min θk (2.16) Algorithm 1 Robust Unsupervised Domain Adaptation 1: Require: labeled source domain dataset Ds; unlabeled target domain dataset Dt; con- k=1 ∼ Θ. training stant K, DA risk function JDA : X × X × Y → R+; K models {θk}K steps E1, adaptation steps E2; constant α, β > 0. 2: Randomly split Ds, Dt into K blocks of pairs: {Dk , s.t. ∀ k, |Dk s | ≤ ⌈ |Ds| s , Dk t | ≤ t }K K ⌉, |Dk k=1 ⌈ |Dt| K ⌉. 3: for k ∼ [K] in parallel do for 1 ≤ i ≤ E1 do θk ← θk − η ∗ ∇θk Dk E 4: 5: end for 6: 7: end for 8: for k ∼ [K] in parallel do for 1 ≤ i ≤ E2 do 9: [JDA(xs, ys, xt)] . s ,Dk t 10: θk ← θk − η (α∇θtJIM(Dt; θk) + βJPL(Dt; θk)). end for 11: 12: end for 13: Return ensemble{θk}K k=1 . Based on the above building blocks, we now summarize our robust domain adaptation approach in Algorithm 1, in which K models are independently learned using separated training blocks, then refined to adapt their model hypothesis into the target domain by optimizing Eq. (2.14) and Eq. (2.16), where α and β are the constant w.r.t the gradient of 21 Eq. (2.14) and Eq. (2.16), respectively. Note that for each learning batch i, we iteratively adjust the centroid ¯gk using the updated model. Eventually, their ensemble voting is used as the final prediction for the target domain. Besides being robust to corrupted source domain data, our learning scheme is also learning-efficient when powered with parallel training. Moreover, it serves as a general robust UDA framework that can improve most existing DA learning approaches against corrupted data. 2.5 Evaluation In this section, we conduct extensive experiments on multiple benchmark datasets to in- vestigate the following question: whether our approach is effective for unsupervised domain adaptation, given a corrupted source domain data? 2.5.1 Experiment Setup Dataset: We conducted experiments using the following datasets: 1. Digit datasets: We conducted UDA tasks form the MNIST domain [83] to the USPS [8]), and from USPS to MNIST, respectively. 2. Image datasets: We conducted the UDA task from CIFAR10 [75] to STL [37] with the non-overlapping class of these two detests removed. Hence, these two domains are redefined as 9-class classification tasks. We also downscale the original image dimension of STL from 96 × 96 to 32 × 32, which is the image dimension of CIFAR-10. Compared Approaches: We compare our method against the following approaches: 1. DANN is a representative UDA method based on generative-adversarial learning [39]. 2. CDAN is short for conditional adversarial domain adaptation, which conditions the model posterial on the discriminative information from the classifier.[113] 22 Implementation: We choose backdoor attacks as our corruption method because it is a more more challenging attack, compared with feature noise or label noise attacks that existing robust DA methods managed to solve. We implement two kinds of backdoor attacks: 1. BadNet Attack : BadNet attack is one of the most common backdoor attacks [44]. According to a set poison ratio, we add a 5 × 5 trigger to the upper right corner of each poisoned sample from the source domain. These poisoned samples are also assigned with attacker-specified target labels . Then these poisoned source samples are fed into DNNs along with the remaining clean source samples and a few unlabeled target samples for training. The network is evaluated both on the clean target samples and poisoned target samples which are corrupted the same way as source samples. 2. Clean Label Backdoor Attack (CLBD): Compared with BadNet attacks, CLBD does not change the label of poison samples, but will add an learned adversarial pertur- bation to each base image [159]. In our experiment, we add a l∞-bounded perturbations constructed using projected gradient descent (PGD) [117]. This step can be formally defined as follows: p = xj ˆx(j) b + arg max ∥δ∥∞<ϵ L(x(j) b + δ, y(j), θ), (2.17) where L is the cross-entropy loss. Then a trigger is added to the set {ˆx(j) p } to generate the final poison samples {x(j) p }. In our experiments, we craft the poison samples on a pre-trained Resnet-18 model using CIFAR-10 dataset, then modify them with a 5 × 5 patch in the lower right-hand corner. The perturbations are bounded with ϵ = 16/255. We set the poison ratio to be 0.5 for the poisoned class. Note that the poison ratio for CLBD represents the fraction of examples poisoned from a single class, instead of the entire source training samples. Then similar to BadNet attacks, we fed models with training samples including the poisoned ones for training, and evaluate the performance on both the clean and the poisoned target samples. 23 For the digit datasets, we utilize the classical LeNet-5 [82] network for the task USPS ↔ MNIST. We adopt minibatch SGD with momentum = 0.9, weight decay = 5e−3 and learning rate = 1e−2. The maximum number of epochs we set for digits is 70. For image datasets, we adopt the Resnet-18 network. The maximum number of iterations we set for image tasks is 20000. Evaluations are performed w.r.t. the following criteria: 1. Target clean accuracy (denoted as Clean acc) refers to the accuracy evaluated on the clean target dataset. 2. Target poison accuracy (denoted as Poison acc) refers to the accuracy evaluated on the poisoned target data with clean labels. 3. Attack success rate (denoted as Success rate) refers to the accuracy evaluated on the poisoned target data with poisoned labels. This criterion can help us find out whether hidden backdoors are activated by attacker-specified trigger patterns. 2.5.2 Results and Discussions For the digits domain adaptation between USPS and MNIST, we apply BadNet attacks and vary the poison ratio from 0.01 to 0.03. For image adaptations between CIFAR10 and STL, we fix the poison ratio to be 0.02 for BadNet attacks and 0.5 for CLBD attacks. Effects of MoM on defending poison data attacks: For digit adaptation tasks, we evaluate the accuracy and attack success rates w.r.t. different poison ratios for two different base DA approaches: DANN and CDAN, respectively. As shown in Table 2.2, our proposed MoM method is consistently robust given different base DA algorithms. When the poison ratio is 0, there are no poisoning attacks on source data, hence the poison acc and success rate for poison ratio = 0 is evaluated on poisoned testing samples, with a model trained on clean samples. We use this result as a reference for the following experiments. The 24 DA model Task Poison ratio Block num Clean acc ↑ Poison acc ↑ Success rate ↓ DANN DANN MNIST → USPS USPS → MNIST MNIST → USPS USPS → MNIST 0 (clean) 0.01 0.02 0.03 0 (clean) 0.01 0.02 0.03 0 (clean) 0.01 0.02 0.03 0 (clean) 0.01 0.02 0.03 1 1 10 1 10 15 1 15 20 1 1 10 1 10 1 20 1 1 10 1 15 1 15 20 1 1 10 1 15 1 20 88.89 88.79 86.25 89.34 85.00 83.86 88.44 83.91 82.76 95.54 95.38 85.00 93.30 83.22 95.02 75.31 93.47 93.52 87.84 94.07 85.45 94.02 85.35 82.71 92.96 96.29 83.09 93.4 77.71 97.22 74.39 11.26 8.57 12.21 8.77 9.22 11.61 8.62 9.87 11.21 9.56 10.02 10.24 10.21 10.50 10.10 10.57 12.41 8.52 12.76 8.57 12.01 8.82 9.87 10.96 9.68 10.17 10.48 10.1 10.56 10.13 10.53 9.62 92.33 8.77 97.06 59.99 28.65 95.17 60.34 35.87 9.89 94.89 12.88 97.82 18.09 99.12 18.87 9.57 94.27 9.97 97.46 19.18 99.15 67.91 38.22 10.04 93.49 15.62 97.27 17.51 98.39 20.76 Table 2.2: Accuracy(%) and attack success rates(%) for MoM using under BadNet attacks. ↑ indicates that a larger value is desirable, and vice versa. By applying MoM, we can significantly bring down the attack success rate and also improve the target poison test accuracy, while maintaining the target clean sample accuracy. performance of MoM for image task (CIFAR-10 → STL) under BadNet attacks and CLBD attacks is shown in Figure 2.3. Block number = 1 refers to training without applying MoM, which we use as the baselines for our proposed algorithm. We found that by applying MoM, we can significantly bring down the attack success rate and improve the target poison test accuracy while maintaining the target clean sample accuracy. The results for both tasks can be further improved by adaptation with information maximization (IM) or adaptation with pseudo label (PL) which will be covered later. Effects of different block numbers for MoM: We also investigate how the number 25 of blocks would affect the performance of our approach. We observe that increasing the number of blocks within a certain range is beneficial for improving the performance. The best block number is related to the poison ratio and can be task dependent. For instance, adaptation tasks between CIFAR10 and STL need more blocks to achieve a low attack success rate, compared with digits adaptations. Moreover, for CLBD attacks using DANN, MoM achieves a better result with attack success rate = 25.96% and block number = 40, compared to the result achieves when block number is 50 with the attack success rate = 28.05%. This phenomenon is induced by insufficient training data left in each block when one immoderately increases the block numbers, leading to the potential under-fitting of the learned model. Meanwhile, we show that adaptation with Information Maximization (Section 2.4.4 ) is more beneficial for enhancing the robustness of our approach, instead of keeping increasing the block number. (a) BadNet attacks us- ing DANN (b) BadNet attacks us- ing CDAN (c) CLBD attacks using DANN (d) CLBD attacks using CDAN Figure 2.3: Clean test accuracy, poison test accuracy and attack success rate for MOM w.r.t. different block number. Increasing the number of blocks within a certain range is beneficial for improving the performance. Effects for different poison ratios for MoM: We investigate how different poison ratios would affect the performance of MoM for task MNIST → USPS using DANN as the base algorithm. As shown in Figure 2.4, the attack success rate increases along with the poison ratios, given the block number fixed to be 10, which means that when the poison ratio increases, 10-block MoM can no longer be as effective as when poison ratio is low. With the increase of poison ratio, more poison samples are divided into one single block, which will lead to poor prediction for almost all blocks, Consequently, the quality of the median 26 (a) Clean test accuracy. (b) Poison test accuracy. (c) Attack success rate. Figure 2.4: Clean test accuracy, poison test accuracy and attack success rate for MOM w.r.t. different poison ratio under BadNet attack. We need to increase the number of blocks with the increase of poison ratio. DA model Task Poison ratio Block num Clean acc ↑ Poison acc ↑ Success rate ↓ Adaptation DANN CDAN MNIST → USPS USPS → MNIST MNIST→USPS USPS → MNIST 0.02 0.03 0.02 0.03 0.02 0.03 0.02 0.03 15 20 10 20 15 20 15 20 83.86 83.12 84.16 82.76 80.67 81.22 83.22 85.51 85.98 75.31 78.95 79.46 85.45 85.10 85.45 82.71 81.42 81.81 77.71 82.50 83.00 74.39 78.90 80.67 11.61 12.11 12.26 11.21 11.71 11.96 10.50 10.17 10.27 10.57 10.62 10.76 12.01 12.66 12.41 10.96 12.51 12.16 10.56 10.49 10.51 10.53 10.50 10.56 28.65 10.76 12.41 35.87 19.03 16.49 18.09 11.57 11.48 18.87 10.11 9.89 19.18 8.12 11.01 38.22 7.08 10.66 17.51 11.74 10.89 20.76 11.58 10.60 / IM IM+PL / IM IM+PL / IM IM+PL / IM IM+PL / IM IM+PL / IM IM+PL / IM IM+PL / IM IM+PL Table 2.3: Accuracy(%) and attack success rates(%) for MoM+refine using model CDAN. ↑ indicates larger value is better. ↓ indicates smaller value is better. Bold numbers are best performers. Our adaptation method is consistently robust given different DA algorithms. IM and PL are verified to be effective to not only further decrease attack success rate but also increase the target poison test accuracy. estimators over all the blocks can no longer be guaranteed. To achieve an effective defense, we need to increase the number of blocks with the increase of poison ratios. Effects of defending poison data attack using adaptation: To further improve the results, we refine our model (best block number for MoM is used) with adaptation method IM and PL which are introduced in section 2.4.4. IM and PL are verified to be effective to 27 not only further decrease the attack success rate but also increase the target poison accuracy. To the best of our knowledge, our proposed method is the most robust DA method given corrupted source samples compared with existing methods. For the digits tasks, we evaluate our proposed MoM + adaptation algorithm w.r.t. different poison ratios using two models: DANN and CDAN, respectively, as shown in Table 2.3. We didn’t apply IM or PL when the poison ratio is low (poison ratio = 0.01) for digits task, because soley adopting MoM will anneal the attack success rate under 10%. However, when the poison ratio increases, IM and PL can help further decrease the attack success rate. For image task, the accuracy and attack success rates for BadNet attacks and CLBD attacks are shown in Table 2.4 and 2.5, respectively. Generally, both IM and PL can be used to further improve the results for defending BadNet and CLBD attacks while IM shows the best results. IM increases the poison accuracy and decreases the attack success rate, while maintaining the clean accuracy. Specifically, for BadNet attacks, adaptation shows more improvements. However, under BadNet attacks, PL can no longer help further improve the performance after the number of blocks drastically increases. Block num Clean acc ↑ Poison acc ↑ Success rate ↓ Adaptation 1 10 30 40 63.97 64.42 61.97 65.07 62.72 62.00 62.63 62.00 62.00 60.85 10.67 35.68 47.81 33.74 53.92 54.64 54.12 55.76 56.76 50.79 87.32 55.65 30.93 53.15 23.57 22.08 23.55 19.28 16.64 18.38 / / IM IM+PL / IM IM+PL / IM IM+PL Table 2.4: Accuracy(%) and attack success rates(%) using base approach CDAN for the task CIFAR-10 → STL under BadNet attack. Our adaptation method is consistently robust given different kinds of tasks. 28 Block num Clean acc ↑ Poison acc ↑ Success rate ↓ Adaptation 1 10 30 40 68.20 67.83 65.91 65.75 64.17 63.37 63.92 63.98 61.99 62.41 12.85 41.02 42.83 40.34 47.40 48.80 49.56 48.97 50.38 50.71 84.01 42.74 28.27 40.67 29.12 28.39 28.05 25.96 23.45 24.37 / / IM IM+PL / IM IM+PL / IM IM+PL Table 2.5: Accuracy(%) and attack success rates(%) using base approach DANN for task CIFAR-10 → STL under CLBD attack. Our adaptation method is consistently robust under different kinds corruptions. 2.6 Summary In this work, we tackle a practical yet challenging problem, i.e. unsupervised domain adapta- tion under corrupted source domain samples. Inspired by the Median of Means estimators, we proposed a principled and robust ensemble learning algorithm powered by hypothesis transfer via information maximization, which can provably defend corrupted training sam- ples with high asymptotic performance on the target domain. Extensive empirical studies showed that our UDA approach is robust against different levels of data corruption, which can serve as a general framework to improve the robustness of orthogonal UDA approaches. We leave the extension of our work to more complex scenarios, such as corrupted multi-domain adaptation as an intriguing future work. 29 CHAPTER 3 DYNAMIC UNCERTAINTY RANKING: ENHANCING RETRIEVAL-AUGMENTED IN-CONTEXT LEARNING FOR LONG-TAIL KNOWLEDGE IN LLMS This chapter is based on the following work: Dynamic Uncertainty Ranking: Enhancing In-Context Learning for Long-Tail Knowledge in LLMs. Shuyang Yu, Runxue Bao, Parminder Bhatia, Taha Kass-Hout, Jiayu Zhou, Cao Xiao. The 2025 Annual Conference of the Nations of the Americas Chapter of the ACL (NAACL 2025). 3.1 Introduction Pretrained large language models [13, 157, 3] have achieved remarkable success across var- ious natural language processing (NLP) tasks, such as summarization [191, 163], question answering [62, 167], and code generation [92, 175]. These impressive results are largely due to their pre-training on vast, web-sourced datasets spanning multiple domains. However, these real-world datasets often follow a long-tail distribution [111, 118, 25, 147], where knowl- edge from less frequent domains is underrepresented. Consequently, certain domain-specific information may be rarely or even never included in the LLMs’ memorization [67]. As a result, LLMs struggle to provide accurate responses to queries drawn from these long-tail distributions, since the pre-training process fails to capture this sparse information. In-context learning (ICL) [13] is a few-shot learning method that queries LLMs by con- catenating relevant samples with the test query, without updating the model’s parameters. [67] found that ICL, when combined with retriever augmentation, can reduce LLMs’ reliance on pre-training knowledge by retrieving relevant examples related to long-tail queries during inference. Common retrieval methods used to select augmentation examples for ICL include random selection [176, 174], off-the-shelf retrievers (e.g., BM25 [138]), and fine-tuned retriev- ers (e.g., PromptPG [114]). However, prior works [194, 107, 115, 16] have shown that ICL with different selection and ordering of the retrieved samples could lead to unstable predic- 30 Figure 3.1: Training framework of the proposed method. After pre-selection using BM25 for each validation sample pi, we conduct from 0-shot to ki-shot inference and update retriever Sθ according to the dynamic impacts of each sample on LLMs based on the reward from LLM. To reduce the query cost, we update the threshold σ when the LLM experiences a negative prediction change. The query time ki is decided by retriever score Sθ and threshold σ. tions of LLMs. In our experiments, we observed a similar pattern: when utilizing existing methods to retrieve relevant samples for ICL, the model’s predictions for long-tail ques- tions—those not captured by zero-shot inference—exhibited particularly high uncertainty. In some cases, a subset of the retrieved samples led to correct predictions, while the full set misled the model, even with the same retrieval method. In this chapter, to enhance the retrieval augmentation for long-tail samples regarding LLM’s uncertainty, we propose a reinforcement learning-based dynamic uncertainty rank- ing method motivated by reinforcement learning’s capacity to search for optimal retrieved samples based on the LLM’s feedback [114]. Specifically, our approach trains a retriever to prioritize informative and stable samples while down-ranking misleading ones, enhancing performance on both head and tail distributions. We build on the BERT-based retriever architecture [27] with an appended linear layer. During the training of the retriever, only the linear layer is fine-tuned. Initially, BM25 [138] is used for pre-selection, and the retriever is trained using policy gradients [150], guided by feedback from the LLM for each retrieved 31 𝑝𝑖BM25CandidatepoolValidationsamplePre-selectcandidatepoolPre-trainedBERTLinearlayer𝜃123…jTop-jsampling𝑝𝑖⊕LLMQuery(ℇ𝑖𝑗,𝑝𝑖)ℇ𝑖𝑗Repeat𝑘𝑖timesquery𝑗=0,1,…, 𝑘𝑖ො𝑎𝑖0ො𝑎𝑖1ො𝑎𝑖𝑘𝑖Prediction…Reward-1-1-1111Reward(ො𝑎𝑖𝑗,𝑎𝑖)Retriever𝑆𝜃UpdateRetrievertrainingPre-selection𝜎Decidemaximumshotnumber𝑘𝑖 sample. To improve efficiency, we introduce a learnable dynamic threshold as a budget con- troller for retrieval, selecting only samples with high-ranking scores above this threshold, which adjusts whenever the LLM experiences a negative prediction change, i.e., the pre- diction changes from true to false. To evaluate the proposed approach, we compared our method with the state-of-the-art methods across both multi-choice and open-ended question- answering (QA) datasets from different domains. The experimental results show that our method outperforms the best baseline by 2.76%. Long-tail questions failed to be captured by a zero-shot inference benefit particularly from our proposed method. The accuracy of long-tail questions of our method surpasses previous methods with a large margin of up to 5.96%. We summarize our key contributions as follows: • We investigate the limitations of existing retrieval-augmented ICL approaches for han- dling long-tail questions, highlighting how variations in retrieved samples contribute to prediction uncertainty. • We propose a reinforcement learning-based dynamic uncertainty ranking method with a budget controller that considers the dynamic impact of each retrieved sample on the LLM’s prediction, which selectively elevates informative retrieved samples and suppresses misleading ones with minimal query costs. • Extensive experiments demonstrate that our method consistently outperforms the state-of-art method on multiple QA datasets from different domains, achieving nearly a 6% improvement in accuracy for long-tail questions. 3.2 Related Work In-context learning (ICL). ICL [13] queries the LLMs with a concatenation of related samples and the test query without parameter updating. To improve the quality of ICL, retrievers have been proposed to select related samples, which can be categorized into sparse 32 retrievers (e.g. [138]) and dense retrievers (e.g. [107]). To further improve the effectiveness of the off-the-shelf retrievers, strategies for fine-tuning retrievers on specific target domains have been proposed such as PromptPG [114], UDR [97], and LLM-R [171], etc. Some works also adopt GPT to help retrieve and rerank samples by providing special prompts and related samples, such as Rerank [148], SuRe [70], etc. Long-tail knowledge learning for ICL. [67] is the first to explore the influence of the long-tail distribution in pre-training data on LLM memorization. They find retrieval augmentation as a promising approach to significantly reduce the LLM’s dependence on pre-training knowledge. Several subsequent works have built on this retrieval augmentation approach to address the long-tail problem in LLMs. For example, [25] propose a retrieve- then-rerank framework leveraging knowledge distillation (KD) from the LLM to tackle long- tail QA. However, their method involves tuning the language model, which is computationally expensive and impractical for black-box LLMs such as GPT-4 [1]. Another line of research focuses on augmenting the training set using GPT [140, 22, 91], followed by fine-tuning the retriever to enhance its performance. Nonetheless, determining which samples should be augmented remains challenging. Augmenting the training set based on seed sentences often introduces repetitive rather than diverse information, and incurs significant costs due to GPT queries. Therefore, in this chapter, rather than augmenting the training set for fine-tuning the retriever, we aim to train an effective retriever capable of selecting the most informative samples to augment the test query during inference. 3.3 Problem Formulation In this chapter, we target in-context learning (ICL) for QA tasks including multiple-choice QA and open-ended QA from different domains. Suppose we have a training set T = {(xi, yi)}N i=1 related to the query domain, where x is the question and y is the answer. Given a query problem pi from a test set P and a K-shot inference budget, we will retrieve K related samples Ei = {ek i = (xi, yi)|ek i ∈ T }K k=1 and construct a prompt P (Ei, pi) as input to 33 feed into the LLM: P (Ei, pi) = π(e1 i ) ⊕ · · · ⊕ π(eK i ) ⊕ π(pi, ·), (3.1) where π is the template for each sample. The predicted answer from the LLM for question pi is given by: ˆai = LLM(P (Ei, pi)). (3.2) 3.4 Motivation: Uncertainty of In-context Learning Due to the lack of knowledge of some specific domains during the pre-training stage, there exists long-tail knowledge that failed to be captured by the LLMs [67]. We define easy samples as queries that have been captured during the LLM’s pre-training stage and are stored in its memorization. In contrast, hard samples refer to queries that the LLM failed to capture, which are more likely to represent long-tail data. We classify easy and hard samples using the zero-shot testing results ˆai = LLM0-shot(pi): Peasy = {(pi, ai) ∈ P|1(ˆai, ai) = 1}, Phard = {(pi, ai) ∈ P|1(ˆai, ai) = −1}, (3.3) where the indicator function 1(·) returns 1 if the predicted answer ˆai aligns with the ground truth answer ai, otherwise it returns −1. According to [67], retrieval augmentation methods help alleviate the long-tail problem, as when a retriever succeeds in finding the most relevant samples from the training set T , it reduces the LLM’s needs to have a large amount of related knowledge in its memorization. However, our experiments revealed that the LLMs exhibit higher uncertainty when presented with hard samples, regardless of the retrieval augmentation applied. Fig. 3.3 shows the uncertain sample ratios that experienced a prediction change on five datasets. Given a certain inference budget K = 5, 21.84% of queries experience a prediction change when we increase from 0-shot to 5-shot. Among these uncertain queries, 87.18% are hard samples 34 Figure 3.2: Case study for uncertainty of ICL. Figure 3.3: Uncertain sample ratios. and 12.82% samples are easy samples using BM25 retrieval [138]. For hard samples, even a tiny variation in retrieved set E can mislead the LLM’s prediction. One case study for hard sample queries from T-REx [32] is shown in Fig. 3.2. In this case, LLM gives a correct answer with the first two informative samples in E, effectively compensating for the LLM’s long-tail knowledge. However, the answer gets wrong when a third sample is added to the prompt, which indicates the newly added knowledge is misleading. Other cases to show the uncertain prediction of LLM can be found in Fig. 3.7 in Section 3.6.4 and Table A.3 in Appendix. Given the uncertainty of in-context learning, our goal is to improve the prediction ac- curacy of hard samples while maintaining the prediction stability on easy samples. During testing, we lack prior knowledge to determine whether a query falls into the easy or hard category. The primary challenge, therefore, is to prevent the inclusion of misleading infor- mation in the retrieved set E, which could lead to incorrect predictions. Simultaneously, 35 we must ensure that the retrieved samples are sufficiently informative to address long-tail knowledge gaps and guide the LLM toward the correct answer. 3.5 In-context Learning with Dynamic Uncertainty Ranking In this section, we introduce a dynamic uncertainty ranking method built on a reinforcement learning-based retriever. This method adjusts the retriever by applying a dynamic threshold, lowering the rankings of misleading samples while elevating the rankings of informative and stable ones. 3.5.1 Retrieved Sample Selection The original training set T is randomly divided into a validation set V, and a candidate pool C, from which the retrieved sample set E is selected. Following [114], the retriever structure is built upon BERT [27] with a linear layer appended to the final pooling layer of the BERT model. During training, the BERT is frozen, and only the parameter θ = (W, b) of the linear layer is fine-tuned. Given a query pi from the validation set V and a retrieved sample ei from C, the ranking score of the retriever is achieved by the hidden logical similarity shared among samples: Sθ(ei|pi) = (cid:80) exp[h(ei) · h(pi)] i∈E exp[h(e′ e′ i) · h(pi)] , (3.4) where h(·) = W(BERT(·)) + b is the output of the linear layer. To ensure the diversity and similarity of retrieved samples, and reduce the computational cost, we first adopt an off-the-shelf retriever BM25 [138] to pre-select a small candidate set C′ i from the large candidate pool C following [139, 148, 70]. Suppose the shot number is k, by selecting samples with the Top-k highest ranking score using our retriever Sθ, we can achieve the retrieved sample set Ei for pi from candidate pool as follows: C′ i Ei = {ek i ∼ Top-k(Sθ(ek i |pi))|ek i ∈ C′ i}. (3.5) 36 The retriever selection process for testing is the same as the training, the only difference is the validation set V will be replaced with the test set P. 3.5.2 Retriever Training Motivated by the exploration in Section 3.4, to improve retrieval augmentation for both hard and easy samples, we introduce a dynamic ranking method that updates the retriever using feedback from the LLM, driven by its varying responses to each retrieved sample. Decide maximum shot number. Before training, we first decide the maximum shot number for each validation sample pi ∈ V. To achieve this, we define a maximum shot number budget K and a dynamic budget controller σ initialized as 0 for ranking scores Sθ. Only samples with ranking scores above the threshold σ will be selected to update the retriever. The maximum shot number ki for pi is: ki = min(K, N max i ), (3.6) where N max i = |{ek i |pi) > σ}|. Training process. Given the maximum shot number ki, we then conduct inference for i ∼ Sθ(ek i |pi)|ek i, Sθ(ek i ∈ C′ pi from 0-shot to ki-shot to capture the effect of each retrieved sample on the LLM. The 0-shot inference on pi can be considered as a means of long-tail sample detection as defined in Eq. (3.3). If the model’s answer is incorrect, the sample is classified as a hard sample (i.e., long-tail sample), and the retrieved set should provide informative augmentation. Conversely, if the model produces the correct answer, the sample is classified as an easy sample, and the retrieved set should avoid introducing any misleading samples. We define the retrieved sample set for the j-shot inference as the top-j highest ranking score selected from candidate pool C′ i : E j i = {ek i ∼ Top-j(Sθ(ek i |pi))|ek i ∈ C′ i}, where j = {0, 1, · · · , ki}. (3.7) The prediction from LLM based on E j i i = i , pi)). The retrieved sample’s impact on the prediction is reflected by the reward and pi is generated according to Eq. (3.2) as ˆaj LLM(P (E j 37 function R(ˆaj i , ai) = 1(ˆaj i , ai), where ai is the ground truth answer for pi, 1(·) is the indicator function. Our training goal is to maximize the expected reward w.r.t. the parameters of the retriever using the Policy Gradient method [150]. Since the expected reward cannot be computed in closed form, following [114], we compute an unbiased estimation with Monte Carlo Sampling: Eei∼Sθ(ei|pi)[(ˆai, ai)] ≈ 1 N N (cid:88) ki(cid:88) i=1 j=1 R(ˆaj i , ai), (3.8) where N is the batch number yielded from V. Following the REINFORCE policy gradi- ent [177], we update the retriever using: ∇Eei∼Sθ(ei|pi)[R(ˆai, ai)] =Eei∼Sθ(ei|pi)∇θ log(Sθ(ei|pi))R(ˆai, ai) ≈ 1 N N (cid:88) ki(cid:88) i=1 j=1 ∇θ log(Sθ(ej i |pi))R(ˆaj i , ai), (3.9) where ej i = E j i − E j−1 i is the difference between the retrieved sets for j-shot and (j − 1)-shot. This approach incorporates the dynamic influence of each retrieved sample on the LLM, providing a better handling of uncertainty in ICL. Specifically, retrieved samples that yield correct predictions (R(·) = 1) are treated as informative and contribute to augmenting long- tail knowledge, thus receiving a higher ranking. Conversely, retrieved samples that lead to incorrect predictions (R(·) = −1) are considered misleading and are ranked lower. Update budget controller σ. In order to increase training efficiency and reduce the cost of querying the LLM, we also update the threshold σ that served as a budget controller at the turning point for prediction change to decrease the inference times while maintaining the effect of our training strategy. Specifically, we focus on a special case: when the LLM experiences a prediction change from true to false, i.e., R(ˆaj−1 i , ai) = −1. In this case, the first (j −1)-th samples have a positive impact on the inference of LLM, while , ai) = 1 and R(ˆaj i the j-th sample has a negative impact. Thus, we update the threshold σ as the maximum 38 Algorithm 2 ICL with dynamic uncertainty ranking 1: Input: Retriever Sθ, training set T , maximum shot number K. 2: Output: Trained retriever Sθ . 3: Randomly split T into V and C. 4: Initialize θ ← θ0, threshold σ ← 0. 5: for Vbatch ∈ V do 6: Initialize batch loss L ← 0. for each validation sample pi ∈ Vbatch do from C using BM25 for pi. Pre-select C′ i Calculate the maximum shot number ki based on σ using Eq. (3.6). for j = 0, 1, · · · , ki do 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: using Eq. (3.7). i , pi)). Get the retrieved set E j i Get prediction ˆaj Get reward R(ˆaj L ← L − R(ˆaj if R(ˆaj i = LLM(P (E j i , ai) = 1(ˆaj i , ai). i , ai) · log(Sθ(ej i |pi)). , ai) = 1 then i , ai) = −1, R(ˆaj−1 Update σ using Eq. (3.10). i end if end for end for 20: Optimize L w.r.t. θ using Eq. (3.9). 21: end for value of the ranking score for unselected samples in E ki i for the (j − 1)-shot round as follows: σ = max(Sθ(ek i |pi)), i ∈ E ki ek i − E j−1 i . (3.10) Since we only select samples with ranking scores larger than σ as shown in Eq. (3.6), the retrieved samples that serve as a good compensation for long-tail knowledge will be ranked higher, and be used for updating the retriever more frequently. Note that updating σ will not wipe out the updating of misleading samples, as the turning point for prediction change is different for each validation sample. Without affecting our original training strategy, we improve the efficiency and deduct the querying cost. Our algorithm is summarized in Algorithm 2. 3.6 Experiments In this section, we first introduce the experiment setup and then show the effectiveness of our method through various empirical results. 39 3.6.1 Experimental Setup Datasets: We conduct the experiments on QA datasets from different domains, including three multi-choice datasets: biomedical dataset Pubmedqa [63], speech detection dataset ethos-national [126], climate change dataset eval-climate [11], and two open-ended QA dataset: T-REx [32] and NaturalQuestions (NatQA) [78]. More datasets details can be found in ?? 11. Baselines: We compare our method with six baselines, including 0-shot inference and five few-shot retrieval augmentation methods. The retrieval augmentation methods are as follows: 1) Random sampling: selecting ICL samples from the candidate set, a widely adopted practice in many ICL studies [176, 174]; 2) BM25 [138]: an off-the-shelf sparse retriever; 3) SuRe [70]: first use GPT to summarize the retrieved passages from BM25 for multiple answer candidates, then determines the most plausible answer by evaluating and ranking the generated summaries; 4) Rerank [148]: use GPT to rerank samples retrieved by BM25; 5) PromptPG [114]: a BERT-based dense retriever trained using reinforcement learning based on the feedback from GPT. Evaluation: For multi-choice QA, we use accuracy for evaluation. For open-ended QA, we use Normalized Exact Match (NEM), which evaluates whether the normalized string output by the inference LLM is identical to the reference string. Implementation: The LLM used in our experiment is GPT-4 [1]. Due to the limited data size in tweet_eval-stance_climate, the training set is split into 50 candidate samples and 150 validation samples. For the other datasets, we use 1000 samples in the candidate pool and 200 samples in the validation set. All methods share the same train-test split. The number of pre-selected samples in C′ is set to 20 by default for both the training and testing stages. For the few-shot case, the shot number is set to 5, unless otherwise specified. During the training of our method, the maximum shot number budget K is also set to 5. The batch size is set to 20. Experiments for all test datasets are repeated 3 times with different seeds, and the average accuracy is reported in the results. 40 (a) Easy sample accuracy. (b) Hard sample accuracy. Figure 3.4: Accuracy on easy and hard samples for proposed method and baselines. 3.6.2 Main Results Table 3.1 presents the mean and standard deviation (std) of accuracy for our proposed method and the baselines across five QA datasets. Our approach outperformed all baselines across tasks, with an average improvement of 2.97% ranging from 1.67% to 3.25% over the best baseline. The trained retriever PromptPG gives the most uncertain prediction with a std of 2.05%. Although our method is based on PromptPG, by giving informative and stable samples higher ranks, we not only improve the overall accuracy but also decrease std to 1.09%, comparable to 0-shot inference. Retrieval Method 0-shot Random sampling BM25 SuRe Rerank PromptPG Ours Dataset Pubmedqa 56.32 ± 1.08 72.87 ± 0.31 64.72 ± 1.70 78.20 ± 0.53 73.22 ± 0.69 78.93 ± 0.31 62.97 ± 1.00 78.93 ± 0.42 73.43 ± 1.01 78.93 ± 0.42 68.10 ± 2.05 78.47 ± 0.90 80.60 ± 0.35 92.40 ± 0.20 85.37 ± 0.32 65.00 ± 2.69 57.60 ± 1.91 76.19 ± 1.09 ethos-national 75.61 ± 0.51 75.17 ± 1.01 87.47 ± 0.39 85.23 ± 0.33 89.15 ± 0.39 77.74 ± 2.16 T-REx 42.60 ± 2.36 57.13 ± 1.97 62.13 ± 1.33 39.80 ± 0.57 62.07 ± 2.01 60.73 ± 3.21 NatQA 44.20 ± 1.91 46.80 ± 1.44 55.00 ± 1.14 32.00 ± 3.40 53.80 ± 1.91 50.80 ± 2.00 eval-climate 46.30 ± 0.32 66.30 ± 3.53 82.57 ± 0.30 78.89 ± 0.30 83.22 ± 0.32 72.78 ± 2.00 Avg Table 3.1: Comparison results between proposed methods and baselines on QA tasks from different domains. We further investigate the accuracy of easy and hard samples in Fig. 3.4. As illustrated in Eq. (3.3), the easy/hard sample classification is decided by the 0-shot inference results, and the hard samples can be considered as long-tail questions of GPT-4. First, we observe a similar pattern to [67] that retrieval augmentation greatly improves the accuracy of long- tail samples. This could come from various aspects of augmented samples—such as label 41 space, input text distribution, and sequence format—that collectively improve final predic- tions [123]. Compared with 0-shot inference, even random sampling improves accuracy on hard samples from 0% to 29.17%. However, retrieval augmentation is highly dependent on the quality of the retrieval set. By retrieving the most similar samples, BM25 achieves an accuracy of 46.12%. Rerank further improves the accuracy to 48.03%. Our method includes the most informative samples based on the sample-wise feedback from LLM, and improves the accuracy on hard samples to 53.99%, which surpasses the best baseline with a large average margin of 5.96% ranging from 2.69% to 8.11%, while maintaining the accuracy on easy samples. 3.6.3 Ablation Studies Effects of different components. We verify the effectiveness of two components of our proposed method: uncertainty rank and pre-selection in Table 3.2. We first compared the uncertainty rank (UR) strategy with another trained retriever PromptPG which shared the same retriever architecture as ours. We improve the accuracy by 3.47% and 1.63% for two different datasets. PromptPG adjusts the ranking of candidate samples based on the feedback on the entire retrieved set for the validation samples, while UR raises the ranks for informative and stable samples and lowers the ranks for misleading samples based on the sample-wise feedback from LLMs. UR avoids the condition when misleading samples are included and negatively changes the answer from true to false. In this way, UR greatly enhances the retrieved sample set for augmentation. The second component pre-selection (PS) improves the results of both PromptPG and UR by selecting more diverse and similar related samples in the candidate set C′. Then the second step retrieval can select samples from a smaller candidate pool of higher quality. By combining these two components together, we can achieve an overall improvement of 14.66% and 2.13% for two different datasets. The improvement on ethos-national is more significant than Pubmedqa because the predicted answer on ethos-national is more uncertain 42 Dataset ethos-national Pubmedqa PromptPG UR PromptPG+PS UR+PS (Ours) 77.74 78.47 81.21 80.10 86.91 79.10 92.40 80.60 Table 3.2: Effects of different components. PS denotes pre-selection. UR denotes uncertainty rank. given different combinations of retrieved samples. (a) T-REx. (b) NatQA. Figure 3.5: Effects of different number of shots. Figure 3.6: Effects of different pre-select numbers. Effects of different number of shots. We show the effects of different shot numbers for two datasets in Fig. 3.5 where our method consistently outperforms other baselines. For NatQA, the accuracy of random sampling and PromptPG retrieval does not monotonically increase with shot number due to low-quality, misleading samples, which can degrade perfor- mance. In contrast, our method prioritizes high-quality samples, and as the number of shots Figure 3.7: Case study for retrieved samples of hard samples. 43 increases, the advantages of our algorithm become more pronounced, resulting in improved accuracy. Effects of different number of pre-selection samples. In Fig. 3.6, we investigate how the number of pre-seletion samples impacts our algorithm. For both datasets, the accuracy first increases and then decreases. If too few samples are selected, the candidate pool C′ for our reinforcement learning -based ranking stage lacks diversity, limiting the policy gradient strategy’s action space. Consequently, the learned retriever struggles to find the most informative samples. If the number is too large, C′ includes many irrelevant samples, making it difficult for the policy gradient strategy to learn an optimal solution in the large search space [114]. This can lead the retriever to capture irrelevant or misleading information. 3.6.4 Case Study To intuitively show the effectiveness of our proposed method on hard samples, we show one case on Pubmedqa by comparing the retrieved samples of PromptPG retriever and our re- triever in Fig. 3.7. According to this case, the two retrieved sets even have three overlap samples (marked as the same color), but the prediction is completely different. PromptPG gives a wrong prediction answer, while our method delivers the right answer. This result verifies that GPT-4 gives uncertain predictions on long-tail samples. Since 0-shot inference gives a wrong prediction answer on this query question, the informative augmented infor- mation can be contained in the retrieved set of our method (see right column), while for PromptPG, misleading information can be contained in the two samples that do not inter- sect with our retriever set (see left column), which shifts the predicted answer from true to false. Compared with PromptPG, our retriever ranks the three overlapped samples higher and gives two more informative samples. With the combination effect of these two, our method gives the correct prediction. More cases on hard samples from other datasets can be found in Table A.4 in the appendix. 44 (a) Shot # w.r.t. the batch. (b) Shot # of one epoch. Figure 3.8: Efficiency analysis. Figure 3.9: Training loss w.r.t. batch. 3.6.5 Efficiency Analysis Query cost. We set threshold σ as the budget controller to reduce the cost of the querying GPT-4. Since the query cost depends on token length, we compare the query costs of our method and PromptPG (both trained based on GPT-4) in Fig. 3.8a. Specifically, we calculate the total number of shots included in each query during training for each batch within one epoch for both methods. The blue dash line shows the total shot number of PromptPG for all datasets, since the batch size is 20, and the shot number is fixed at 5, the total shot number is fixed at 100 for each batch. According to the results, only batch 0 of our method surpasses PromptPG with a total shot count of 300. For subsequent batches, as the threshold σ is adjusted based on changes in the LLM’s predictions, the query shot count drops significantly, resulting in the total shot count consistently being lower than that of PromptPG. Aggregating the shot numbers across 10 batches, our method achieves only 33.8%, 65.2%, and 35.3% of the shot count of PromptPG on Pubmedqa, ethos-national, and NatQA, respectively as shown in Fig. 3.8b. Thus, in conjunction with the accuracy comparison presented in Table 3.1, our approach not only enhances query accuracy but also reduces the overall query cost. Convergence speed. We empirically demonstrate the convergence speed by showing training loss curves in Fig. 3.9. According to the results, the training loss quickly converges to a small value close to 0 within 15 batchs, which verify the high computational efficiency of our method. 45 3.6.6 Transferability Analysis We investigate the transferability of our retriever in Table 3.3. We use our retriever trained on dataset ethos-national, and evaluate its cross-domain effectiveness across the rest of the four datasets. Although the cross-domain results are still slightly inferior to the in-domain results, the performance gap is minimal, averaging only 0.98%. Furthermore, the cross- domain results outperform the best baseline. These findings indicate that our trained ranking strategy is transferable to other datasets, providing a cost-effective alternative to retraining. Best baseline Ours: cross-domain Ours: in-domain Pubmedqa 78.93 79.60 80.60 eval-climate NatQA T-REx Avg 69.82 55.00 71.16 57.20 72.14 57.60 83.22 83.33 85.37 62.13 64.50 65.00 Table 3.3: Transferability of our method. 3.7 Limitations There are several limitations of this work. First, our method do not consider the effect of different orders within the retrieved set and rank the retrieved samples according to their ranking scores. Future works can be extended based on our work by considering different inner order within the retrieved set and their effect on the prediction results. Second, although our experimental results show that our method greatly improves the prediction accuracy on long-tail samples, our method cannot handle query cases with no related knowledge either in the pre-training set or candidate pool. Third, our method focused on QA tasks using LLM. For future work, our method can be extended to other tasks such as summarization, translation, and recommendation as follows. Since our method is to train a reranker based on the reward signal from LLM, to adapt to other tasks, we can modify the evaluation score that is used to determine the reward. If the accuracy of the LLM’s predicted answer is unavailable, alternative metrics such as BLEU 46 and ROUGE can be used to assess the consistency between the prediction and the ground truth. A threshold can then be set for these scores, where values exceeding the threshold yield a positive reward, while lower values result in a negative reward. 3.8 Summary In this chapter, to improve the uncertain prediction of LLMs on long-tail knowledge, we propose a reinforcement learning-based dynamic uncertainty ranking method for retrieval- augmented ICL with a budget controller. Specifically, it considers the dynamic impact of each retrieved sample based on the LLM’s feedback. Our ranking system system raises the ranks of more informative and stable samples and lower the ranks of misleading samples efficiently. Evaluations of various QA datasets from different domains show that our proposed method outperformed all the baselines, and especially improve the LLM’s prediction on the long-tail questions. 47 CHAPTER 4 TURNING THE CURSE OF HETEROGENEITY IN FEDERATED LEARNING INTO A BLESSING FOR OUT-OF-DISTRIBUTION DETECTION This chapter is based on the following work: Turning the Curse of Heterogeneity in Federated Learning into a Blessing for Out-of- Dis- tribution Detection. Shuyang Yu, Junyuan Hong, Haotao Wang, Zhangyang Wang, Jiayu Zhou. 2023. 2023 International Conference on Learning Representations (ICLR). 4.1 Introduction Deep neural networks (DNNs) have demonstrated exciting predictive performance in many challenging machine learning tasks and have transformed various industries through their powerful prediction capability. However, it is well-known that DNNs tend to make overcon- fident predictions about what they do not know. Given an out-of-distribution (OoD) test sample that does not belong to any training classes, DNNs may predict it as one of the training classes with high confidence, which is doomed to be wrong [52, 53, 51]. To alleviate the overconfidence issue, various approaches are proposed to learn OoD awareness which facilitates the test-time detection of such OoD samples during training. Recent approaches are mostly achieved by regularizing the learning process via OoD sam- ples. Depending on the sources of such samples, the approaches can be classified into two categories: 1) the real-data approaches rely on a large volume of real outliers for model regu- larization [53, 125, 193]; 2) the synthetic approaches use ID data to synthesize OoD samples, in which a representative approach is the virtual outlier synthesis (VOS) [31]. While both approaches are shown effective in centralized training, they cannot be easily incorporated into federated learning, where multiple local clients cooperatively train a high- quality centralized model without sharing their raw data [74], as shown by our experimental results in Section 4.5.2. On the one hand, the real-data approaches require substantial real outliers, which can be costly or even infeasible to obtain, given the limited resources of local 48 clients. On the other hand, the limited amount of data available in local devices is usually far from being sufficient for synthetic approaches to generate effective virtual OoD samples. Practical federated learning approaches often suffer from the curse of heterogeneous data in clients, where non-iid [94] collaborators cause a huge pain in both the learning process and model performance in FL [95]. Our key intuition is to turn the curse of data heterogeneity into a blessing for OoD detection: The heterogeneous training data distribution in FL may provide a unique opportunity for the clients to communicate knowledge outside their training distributions and learn OoD awareness. A major obstacle to achieving this goal, however, is the stringent privacy requirement of FL. FL clients cannot directly share their data with collaborators. This motivates the key research question: How to learn OoD awareness from non-iid federated collaborators while maintaining the data confidentiality requirements in federated learning? In this chapter, we tackle this challenge and propose Federated Out-of-distribution Syn- ThesizER (Foster) to facilitate OoD learning in FL. The proposed approach leverages non-iid data from clients to synthesize virtual OoD samples in a privacy-preserving manner. Specifically, we consider the common learning setting of class non-iid [94], and each client extracts the external class knowledge from other non-iid clients. The server first learns a virtual OoD sample synthesizer utilizing the global classifier, which is then broadcast to local clients to generate their own virtual OoD samples. The proposed Foster promotes diversity of the generated OoD samples by incorporating Gaussian noise, and ensures their hardness by sampling from the low-likelihood region of the class-conditional distribution estimated. Extensive empirical results show that by extracting only external-class knowledge, Foster outperforms the state-of-out for OoD benchmark detection tasks. The main contributions of our work can be summarized as follows: • We propose a novel federated OoD synthesizer to take advantage of data heterogeneity to facilitate OoD detection in FL, allowing a client to learn external class knowledge from other non-iid federated collaborators in a privacy-aware manner. Our work bridges 49 a critical research gap since OoD detection for FL is currently not yet well-studied in literature. To our knowledge, the proposed Foster is the first OoD learning method for FL that does not require real OoD samples. • The proposed Foster achieves the state-of-art performance using only limited ID data stored in each local device, as compared to existing approaches that demand a large volume of OoD samples. • The design of Foster considers both the diversity and hardness of virtual OoD samples, making them closely resemble real OoD samples from other non-iid collaborators. • As a general OoD detection framework for FL, the proposed Foster remains effective in more challenging FL settings, where the entire parameter sharing process is prohibited due to privacy or communication concerns. This is because that Foster only used the classifier head for extracting external data knowledge. 4.2 Related Work OoD detection. Existing OoD detection methods are mainly from two complementary perspectives. The first perspective focused on post hoc. Specifically, [52] first introduced a baseline utilizing maximum softmax distribution probabilities (MSP). Based on this work, many improvements have been made by follow-up works in recent years, such as the cali- brated softmax score (ODIN) [105], Mahalanobis distance [84], energy score [108], Likelihood Regret [180], Confusion Log Probability (CLP) score [178], adjusted energy score [106], k-th nearest neighbor (KNN) [149], and Virtual-logit Matching (ViM) [168]. Compared with post hoc methods, Foster can dynamically shape the uncertainty surface between ID and OoD samples. Different post hoc methods are also applied in our experiment section as baselines. Another perspective tends to detect OoD samples by regularization during training, in which OoD samples are essential. The OoD samples used for regularization can be either real OoD samples or virtual synthetic OoD samples. Real OoD samples are usually natural 50 auxiliary datasets [53, 125, 193]. However, real OoD samples are usually costly to collect or infeasible to obtain, especially for terminals with limited sources. Regularization method utilizing virtual synthetic OoD samples do not rely on real outliers. [43] trained a generative model to obtain the synthetic OoD samples. [64] detect samples with different distributions by standardizing the max logits without utilizing any external datasets. [152, 142] proposed contrastive learning methods that also does not rely on real OoD samples. [31] proposed VOS to synthesize virtual OoD samples based on the low-likelihood region of the class-conditional Gaussian distribution. Current state-of-the-art virtual OoD methods are usually thirsty for ID data, which is not sufficient enough for local clients. Compared with these existing methods, the proposed Foster can detect OoD samples with limited ID data stored in each local device, without relying on any auxiliary OoD datasets. Federated Learning. Federated learning (FL) is an effective machine learning setting that enables multiple local clients to cooperatively train a high-quality centralized mode [74]. FedAvg [121], as a classical FL model, performs model averaging of distributed local models for each client. It shows an excellent effect on reducing the communication cost. Based on FedAvg, many variants [170, 12] have been proposed to solve the problems arising in FedAvg, such as convergence analysis [66, 134], heterogeneity [95, 59, 68, 198], commu- nication efficiency [136]. Among these problems, although heterogeneity of data will make the performance of ID data worse, it will give us a great chance to learn from the external data from other non-iid collaborators. Even though Foster is used the FedAvg framework, as a general OoD detection method for FL, Foster can also be applied to other variants of FedAvg. 4.3 Problem Formulation In this chapter, we consider classification tasks in heterogeneous FL settings, where non-iid clients have their own label set for training and testing samples. Our goal is to achieve OOD-awareness on each client in this setting. 51 OoD training. The OoD detection problem roots in general supervised learning, where we learn a classifier mapping from the instance space X to the label space Y. Formally, we define a learning task by the composition of a data distribution D ⊂ X and a ground-truth labeling oracle c∗ : X → Y. Then any x ∼ D is denoted as in-distribution (ID) data, and otherwise, x ∼ Q ⊂ X \D as out-of-distribution data. Hence, an ideal OoD detection oracle can be formulated as a binary classifier q∗(x) = I(x ∼ D), where I is an indication function yielding 1 for ID samples and −1 for OoD samples. With these notations, we define the OoD learning task as T := ⟨D, Q, c∗⟩. To parameterize the labeling and OoD oracles, we use a neural network consisting of two stacked components: a feature extractor f : X → Z governed by θf , and a classifier h : Z → Y governed by θh, where Z is the latent feature space. For the ease of notation, let hi(z) denote the predicted logit for class i = 1, . . . , c on extracted feature z ∼ Z. We unify the parameters of the classifier as θ = (θf , θh). We then formulate the OoD training as minimizing the following loss on the task T : JT (θ) := Ex∼D (cid:2)ℓCE (cid:0)h(f (x; θf ); θh), c∗(x)(cid:1)(cid:3) + λ Ex′∼Q (cid:2)ℓOE (cid:0)f (x′; θf ); θh(cid:1)(cid:3) , where ℓCE is the cross-entropy loss for supervised learning and ℓOE is for OoD regulariza- tion. We use E[·] to denote the expectation estimated by the empirical average on samples in practice. The non-negative hyper-parameter λ trade off the OoD sensitivity in train- ing. We follow the classic OoD training method, Outlier Exposure [53], to define the OoD regularization for classification problem as ℓOE(z′; θh) := E(z′; θh) − (cid:88)c i=1 hi(z′; θh), (4.1) where E(z′; θh) = −T log (cid:80)c i=1 ehi(z′;θh)/T is the energy function, given the temperature parameter T > 0. At test time, we approximate the OoD oracle q∗ by the MSP score [52]. Heterogeneous federated learning (FL) is a distributed learning framework involving multiple clients with non-iid data. There are different non-iid settings [94, 98], and in this chapter, we follow a popular setting that the non-iid property is only concerned with 52 the classes [94]. Given K clients, we define the corresponding set of tasks {Tk}K k=1 where Tk = ⟨Dk, Qk, c∗ k : X → Y k are non-identical for different k resulting non-identical Dk. Each Y k is a subset of the global label set Y. Since the heterogeneity is known to harm k⟩ and c∗ the convergence and performance of FL [185], we adopt a simple personalized FL solution to mitigate the negative impact, where each client uses a personalized classifier head hk upon a global feature extractor f [4]. This gives the general objective of FL: minθ k=1 JTk(θ). The optimization problem can be solved alternatively by two steps: 1) local minimization of 1 K (cid:80)K the objective on local data and 2) aggregation by averaging client models. In this chapter, we assume that each client only learns classes they see locally during training, because updating classifier parameters for unseen classes has no data support and doing so will almost certainly harm the performance of FL. To see this, Diao et al. showed that masking out unseen classes in the cross-entropy loss can merit the FL training [30]. Challenges. When we formulate the OoD training in FL, the major challenge is defining the OoD dataset Qk, which does not come for free. The centralized OoD detection of VOS assumes Qk is at the tail of an estimated Gaussian distribution of Dk [31], which requires enormous examples from Dk for an accurate estimation of parameters. However, such a requirement is usually not feasible for a client per se, and the construction of Qk remains a challenging question. 4.4 Method In this section, we first introduce the intuition of our proposed Foster, then elaborate on how to synthesize virtual external class data and avoid the hardness fading of the virtual OoD samples. The proposed framework is illustrated in Fig. 6.1. 4.4.1 Natural OoD Data in Non-iid FL Recent advances show promising OoD detection performance by incorporating OoD samples during the training phase, and however, OoD detection in FL is largely overlooked. In FL, 53 Figure 4.1: The framework of Foster. In step 1, to extract external class knowledge from local clients, the server first trains a generator utilizing the global classifier based on a cross- entropy objective function J(w) (Eq. (4.2)). In step 2, each local client utilizes the generator received to generate their own external class data z. To preserve the hardness of the virtual OoD samples, we also sample virtual outliers vk from the low-likelihood region of the class- conditional distribution estimated for the generated OoD samples. The virtual OoD samples vk are used for regularization of local client objective J(θk) (Eq. (4.5)). each client does not have access to a large volume of real OoD samples because it can be costly or even infeasible to obtain such data for resource-constrained devices. As such, an OoD training method for FL that relies on few or even no real OoD examples is strongly desired. Novel to this work, we notice that data from classes out of the local class set, namely external-class data, are natural OoD samples w.r.t. the local data and can serve as OoD surrogate samples in OoD training. As shown in Fig. 4.2, training w/ external-class data achieves better OoD detection performance than normal training and VOS, since the score of ID and real OoD data is well separated. Besides, compared to the real OoD dataset adopted in prior arts, external-class samples are likely to be nearer to the ID data, since they are sampled from similar feature distributions (refer to (a) and (b) in Fig. 4.2 ). 4.4.2 Synthesizing External-Class Data from Global Classifier Though using external-class data as an OoD surrogate is attractive and intuitive, it is not feasible in FL to directly collect them from other non-iid clients, due to privacy concerns and high communication costs on data sharing. 54 (a) Normal training. (b) Training w/ VOS. (c) Training w/ external-class. Figure 4.2: The density of negative energy score for OoD detection evaluation using dataset Textures. We use 5 ID classes, and 5 external classes of CIFAR-10. We thereby propose to generate samples from the desired classes leveraging the encoded class information in the global classifier head. Given the global classifier H : Z → Y parameterized by θh g , we utilize a w-governed conditional generative network Gw : Y → Z to generate samples from specified classes on clients’ demand. As such, we solve the following optimization problem: J(w) := Ey∼p(y)Ez∼Gw(z|y,ϵ) (cid:2)ℓCE(H(z; θh g ), y)(cid:3) , min w (4.2) where p(y) is the ground-truth prior which is assumed to be a uniform distribution here. We follow the common practice of the generative networks [198] to let y be a one-hot encoding vector, where the target class entry is 1 and others are 0. To encourage the diversity of the generator outputs G(z|y), we use a Gaussian noise vector ϵ ∼ N (0, I) to reparameterize the one-hot encoding vector during the generating process, following the prior practice [72]. Thus, Gw(z|y) ≡ Gw(y, ϵ|ϵ ∼ N (0, I)) given y ∼ Y, where Y is the global label set. The generator training process can refer to Fig. 6.1 Step 1. Then for local training (see Fig. 6.1 Step 2), by downloading the global generator as a substitute of Qk, each local client indexed by k can generate virtual OoD samples given an arbitrary external class set ¯Y k = Y\Y k. In the feature space, we denote the virtual samples as z ∼ Gw(z|y, ϵ) given y ∼ ¯Y k. 55 4.4.3 Filtering Virtual External-Class Samples Although synthesized features are intuitively conditioned on external class, the quality of generated OoD samples may vary by iterations likely because of the lack of two properties: (1) Diversity. Like traditional generative models [144, 156], the trained conditional generator may suffer from mode collapse [119] in a class, where generator can only produce a small subsets of distribution. As a result, the effective synthesized OoD samples will be mostly unchanged and OoD training will suffer from the lack of diverse samples. (2) Hardness. For a client, its internal and external classes may co-exist with another client, which will enlarge the between-class margins gradually. As the FL training proceeds, the class-conditioned synthesis OoD samples will become increasingly easier to be memorized, namely, overfit by the model. In other words, the hardness of OoD examples declines over time. (1) Encourage OoD diversity by tail sampling. As mode collapse happens in the high-density area of a class, samples that approximate the class but have larger variance are preferred for higher diversity. For this purpose, we seek to find samples of low but non-zero probability from the distribution of the external classes. Specifically, for each client, we first assume that the set of virtual OoD representations {zki ∼ G(z|yi, ϵ)|yi ∼ ¯Y k, ϵ ∼ N (0, I)}Nk i=1 forms a class-conditional multivariate Gaussian distribution p(zk|yk = c) = N (µc k), is the Gaussian mean of samples from the external class set ¯Y k for client k, and k, Σc where µc k Σc k is the tied covariance matrix. The parameters of the class-conditional Gaussian can be estimated using the empirical mean and variance of the virtual external class samples: ˆµc k = 1 N c k (cid:88) i:yi=c zki, ˆΣk = 1 Nk (cid:88) (cid:88) c i:yi=c (zki − ˆµc k) (zki − ˆµc k)T , (4.3) where Nk is the number of samples, and N c k is the number of samples of class c in the virtual OoD set. Then, we select the virtual outliers falling into the ϵ-likelihood region as: k)T ˆΣ k − ˆµc (2π)d/2| ˆΣk|1/2 k ∼ G(·|y = c, ϵ)}, k = {vc V c k − ˆµc k) −1 k (vc < ε, vc k|ε0 < 2(vc exp − 1 (cid:16) (cid:17) (4.4) where ε0 ensures the sample is not totally random, a small ε pushes the generated vc k away from the mean of the external class in favor of the sampling diversity. 56 Algorithm 3 Federated Out-of-Distribution Synthesizer (Foster) 1: Require: Tasks {Tk}K ; Global parameters θg, local parameters {θk}K Global generator parameter w; Learning rate α, β, local steps T , ID batch size B, OE batch size BOE. k=1 k=1 ; 2: repeat 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: Server selects active clients A uniformly at random, then broadcast θ, w to A. for all user k ∈ A in parallel do Initialize local parameters θk ← θ for t = 1, . . . , T do i=1 ∼ Dk, ZOE = {zki ∼ G(z|yi, ϵ)|yi ∼ ¯Y k, ϵ ∼ N (0, I)}BOE i=1 . {(xi, yi)}B Estimate the multivariate Gaussian distributions based on ZOE by Eq. (4.3). Filter virtual external class samples according to Eq. (4.4). θk ← θk − β∇θkJ(θk). ▷ Optimize Eq. (4.5) end for Client k sends θk back to the server. end for Server updates θg ← 1 |A| for t = 1, . . . , T do w ← w − α∇wJ(w). end for (cid:80) k∈A θk. ▷ Optimize Eq. (4.2) 18: until training stop (2) Increase the hardness by soft labels. To defend the enlarged margin between internal and external classes, we control the condition inputs to the generator such that generated samples are closer to the internal classes. Given an one-hot encoding label vector y of class c, we assign 1 − δ to the c-th entry, and a random value within (0, δ) to the rest of the positions, where δ ∈ (0, 0.5). In summary, given an observable ˆDk, we formulate the local optimization of Foster as: min θk J(θk) := 1 | ˆDk| (cid:34) (cid:88) xi∈ ˆDk ℓCE(hk(f (xi; θf k); θh k), c∗(xi)) + λ (cid:35) ℓOE(vk) , (4.5) 1 |Vk| (cid:88) vk∈Vk and the overall framework of our algorithm is summarized in Algorithm 3. The major difference from FedAvg is that we introduce a generator for OoD outlier synthesis. Since the generator is trained on the server, the computation overhead for the client is marginal, with only the inference of low-dimensional vectors. As compared to VOS, the samples generated from external classes are more likely to approximate the features from real images due to 57 the supervision of real external class prototypes from the classifier head. 4.5 Experiments In this section, we first introduce the experiment setup and then show empirical results demonstrating the effectiveness of the proposed Foster. ID Datasets for training. We use CIFAR-10, CIFAR-100 [75], STL10 [23], and Do- mainNet [132] as ID datasets. Both CIFAR-10 and CIFAR-100 are large datasets containing 50, 000 training images and 10, 000 test images. Compared with CIFAR, STL10 is a small dataset consisting of only 5,000 training images and 8,000 test images. DomainNet are con- sist of images from 6 different domains. We use DomainNet to explore how Foster performs in the case of feature non-iid among different clients. OoD Datasets for evaluation. We use Textures [21], Places365 [195], LSUN-C [183], LSUN-Resize [183] and iSUN [181] as the OoD datasets for evaluation. When ID dataset is CIFAR-10, we also evaluate on CIFAR-100 to check near-OoD detection performance, since CIFAR-10 and CIFAR-100 datasets have similarities, although their classes are disjoint. Baselines. We compare the proposed Foster with both Post hoc and Virtual synthetic OoD detection methods that have been mentioned in Section 6.2: a) Post hoc OoD detection methods: Energy score [108], MSP [52], ODIN [105]. b) Synthetic OoD detection method: VOS [31]. For a fair comparison, the training method of FL for all of the above approaches including the proposed Foster is FedAvg [121] with a personalized classifier head, and we note that our framework can be extended to other FL variants. All the approaches only use ID data without any auxiliary OoD dataset for training. Metrics for OoD detection and classification. To evaluate the classification per- formance on ID samples, we report the test accuracy (Acc) for each client’s individual test sets, whose classes match their training sets. For OoD detection performance, we report the area under the receiver operating characteristic curve (AUROC), and the area under the PR curve (AUPR) for ID and OoD classification. In FL setting, all three metrics are the mean 58 (a) CIFAR-10. (b) CIFAR-100. Figure 4.3: Visualization of generated external class samples and ID samples. results of all the clients. Heterogeneous federated learning. For CIFAR-10 and CIFAR-100, the total client number is 100, for STL10, the total client number is 50. For DomainNet, the total client number is 12 (2 for each domain). To model class non-iid data of the training datasets, we follow a uniform partition mode and assign partial classes to each client. We distribute 3 classes per client for CIFAR-10 and STL10, 5 classes for DomainNet, and 10 classes for CIFAR-100, unless otherwise mentioned. 4.5.1 Visualization of generated external class samples In Fig. 4.3, we visualize the generated external class samples and ID samples of a client using TSNE for both CIFAR-10 and CIFAR-100. Without accessing the raw external-class data from the other users, our generator, trained merely from the shared classifier head, yields samples that are strictly out of the local distribution without any overlap. We also obtain a consistent conclusion from CIFAR-100, which has as many as 90 external classes. The enormous external classes diversify the OoD set and therefore we observe a larger gain of OoD detection accuracy (a 2.9% AUROC increase versus the best baseline) compared to other benchmarks in Table 4.1. The observation also motivates our design of the tail sampling to encourage diversity. 59 4.5.2 Benchmark Results Foster outperforms existing methods. We compare Foster with other competitive baselines in Table 4.1. The proposed Foster shows stronger OoD detection performance on all three training sets, while preserving a high test accuracy. VOS is another regularization method using virtual OoD samples, which even shows worse results than post hoc methods. The virtual OoD data synthesized by VOS is based on a large amount of ID samples. For the FL setting, when data stored in each device is limited, these synthesized OoD samples based on ID data will no longer be effective, which deteriorates the OoD detection performance. For Foster, the virtual OoD samples are based on the external class knowledge extracted from other clients, which are close to real OoD samples. Thus, they are effective in improving the OoD detection performance while preserving the test accuracy. ID dataset Method Energy MSP ODIN VOS CIFAR-10 Acc ↑ AUROC ↑ AUPR ↑ 0.9262 0.7810 0.9431 0.9691 0.8829 0.9431 0.9689 0.8842 0.9431 0.9342 0.7970 0.9426 Foster 0.9432 0.9785 0.9091 0.9575 0.8056 0.8129 Energy 0.9782 0.8606 0.8129 MSP 0.9789 0.8657 0.8129 ODIN 0.9666 0.8372 0.8063 VOS Foster 0.8218 0.9838 0.8945 0.9228 0.7529 0.8236 Energy 0.9309 0.7410 0.8236 MSP 0.9306 0.7418 0.8236 ODIN 0.9126 0.7370 0.8264 VOS Foster 0.8410 0.9425 0.7671 CIFAR-100 STL10 Table 4.1: Our Foster outperforms competitive baselines. ↑ indicates larger value is better. Bold numbers are best performers. Near OoD detection. We evaluate the model training with CIFAR10 datasets on both near OoD (CIFAR100) and far OoD datasets. The results are shown in Table 4.2, and the best results are highlighted. The proposed Foster outperforms baselines for all of the eval- uation OoD datasets, especially the near OoD dataset CIFAR100. By synthesizing virtual 60 external class samples, Foster has access to virtual near OoD samples during training, which is also an advantage of Foster over other baselines. Datasets Textures Places365 LSUN-C LSUN-Resize iSUN CIFAR-100 AUROCAUPR AUROCAUPR AUROCAUPR AUROCAUPR AUROCAUPR AUROCAUPR Energy 0.7080 0.8868 0.8221 0.9411 0.7009 0.9065 0.8376 0.9519 0.8289 0.9462 0.7883 0.9248 MSP 0.8107 0.9375 0.8964 0.9754 0.9043 0.9774 0.9154 0.9825 0.9103 0.9805 0.8604 0.9615 ODIN 0.8124 0.9367 0.8976 0.9752 0.9062 0.9773 0.9166 0.9825 0.9114 0.9805 0.8614 0.9613 VOS 0.7346 0.8993 0.8267 0.9447 0.7270 0.9196 0.8451 0.9541 0.8397 0.9499 0.8086 0.9379 Foster 0.8458 0.9544 0.9253 0.9842 0.9332 0.9863 0.9316 0.9870 0.9238 0.9849 0.8952 0.9742 Table 4.2: Near and far OoD detection for CIFAR10. The proposed Foster outperforms baselines for all of the evaluation OoD datasets, especially near OoD dataset CIFAR100. Method Energy MSP ODIN VOS Acc ↑ AUROC ↑ AUPR ↑ 0.8953 0.6745 0.7237 0.9048 0.6871 0.7237 0.9047 0.6871 0.7237 0.8988 0.6796 0.7340 Foster 0.7348 0.9075 0.6960 Table 4.3: Our Foster outperforms competitive baselines under feature non-iid setting. OoD detection for feature non-iid clients. We explore whether our Foster can still work well when feature non-iid also exists among different clients on DomainNet. Under this problem setting, different clients not only have different classes, but may also come from different domains. According to the results shown in Table 4.3, although the results are not that significant compared with feature iid settings, Foster still outperforms the baselines. For feature non-iid settings, the external class knowledge extracted from clients from different domains is not so consistent compared with feature iid cases. However, our experimental results also show that in this case, there is still some invariant external class information across different domains that can be extracted by our Foster to help improve the OoD detection performance. 61 Active num Method Energy MSP ODIN VOS 20 Acc ↑ AUROC ↑ AUPR ↑ 0.9363 0.7760 0.9399 0.9674 0.8560 0.9399 0.9674 0.8562 0.9399 0.9173 0.7545 0.9410 Foster 0.9401 0.9776 0.9011 0.9185 0.7592 0.9432 Energy 0.9728 0.8869 MSP 0.9432 0.9727 0.8879 0.9432 ODIN 0.9311 0.7946 0.9430 VOS Foster 0.9429 0.9750 0.8947 0.9262 0.7810 0.9431 Energy 0.9691 0.8829 0.9431 MSP 0.9689 0.8842 0.9431 ODIN 0.9342 0.7970 0.9426 VOS Foster 0.9432 0.9785 0.9091 50 100 Table 4.4: Ablation study on the number of active clients: Foster is not sensitive to the number of active users. 4.5.3 Qualitative Studies Effects of active client number. We investigate the effects of active client number on CIFAR-10. The number of clients is fixed to be 100, while the number of active clients is set to be 20, 50 and 100, respectively. According to the results in Table 4.4, Foster shows better OoD detection performance than baselines in all cases of active users. With the increase of active clients, the OoD performance of Foster remains stable, which means our proposed Foster is not sensitive to the number of active users. Effects of ID class number. We investigate the effects of ID class number on CIFAR- 100. We set the classes distributed per client (classes / client) to be 10, 5 and 3, respectively. According to the results in Table 4.5, the advantage of the proposed Foster over other competitive baselines is not affected by the number of ID classes. When the number of ID classes decreases, for Foster the maximum changes in AUROC and AUPR are 2.16% and 0.81%, respectively. VOS, as another virtual synthetic OoD detection method, with the decrease of ID classes, AUROC and AUPR drop by 7.36% and 6.76%, respectively, which is a much larger variation compared with our method. Thus, the ID class number has a large 62 Classes / client Method Energy MSP ODIN VOS 10 Acc ↑ AUROC ↑ AUPR ↑ 0.9575 0.8056 0.8129 0.9782 0.8606 0.8129 0.9789 0.8657 0.8129 0.9666 0.8372 0.8063 Foster 0.8218 0.9838 0.8945 0.9157 0.7735 0.8976 Energy 0.9704 0.8776 0.8976 MSP 0.9714 0.8831 0.8976 ODIN 0.9289 0.7927 0.8974 VOS Foster 0.8981 0.9778 0.9081 0.8684 0.7215 0.9383 Energy 0.9586 0.8682 0.9383 MSP 0.9592 0.8723 0.9383 ODIN 0.8990 0.7636 0.9393 VOS Foster 0.9397 0.9697 0.8865 5 3 Table 4.5: Ablation study on ID class number: the advantage of the proposed Foster over other baselines is not affected by the number of ID class number. impact on VOS, while almost has no effect on Foster. Effects of the p.d.f. filter. We report the effects of the p.d.f. filter as mentioned in Section 4.4.3 on CIFAR-10 in Table 4.6. The generator without a p.d.f. filter is outperformed by baselines. The phenomenon occurs because not all generated external class samples are of high quality, and some of them may even deteriorate OoD detection performance. Since we add Gaussian noise during the process, some randomly generated external class samples might overlap with ID samples. Thus, we build a class-condition Gaussian distribution for external classes, and adopt a p.d.f. filter to select diverse virtual OoD samples which do not overlap with the ID clusters. According to this table, filtering out low-quality OoD samples improves AUROC and AUPR by 4.44% and 1.18%, respectively. Effects of the random soft label strategy. We study the effects of the random soft label strategy on STL10, and set δ = 0.2. As shown in Table 4.7, after replacing the one-hot label with the random soft label as the input for the generator, we improve AUROC and AUPR by 2.01% and 0.75% respectively, while preserving a similar ID classification test accuracy. That is because soft label contains knowledge from ID classes make the generated 63 Method Energy MSP ODIN VOS Foster w/o pdf filter Foster w/ pdf filter Acc ↑ AUROC ↑ AUPR ↑ 0.9262 0.7810 0.9431 0.9691 0.8829 0.9431 0.9689 0.8842 0.9431 0.9342 0.7970 0.9426 0.9667 0.8647 0.9425 0.9785 0.9091 0.9432 Table 4.6: Ablation study on pdf filter: pdf filter plays an effective role in selecting diverse, high-quality virtual OoD samples. Method Foster Foster w/ soft label Acc ↑ AUROC ↑ AUPR ↑ 0.9425 0.7671 0.8410 0.8294 0.9501 0.7872 Table 4.7: Ablation study on random soft labels: soft label strategy increase the hardness of generated virtual OoD samples. external class samples closer to ID samples. 4.6 Summary In this chapter, we study a largely overlooked problem: OoD detection in FL. To turn the curse of heterogeneity in FL into a blessing that facilitates OoD detection, we propose a novel OoD synthesizer without relying on any real external samples, allowing a client to learn external class knowledge from other non-iid federated collaborators in a privacy- preserving manner. Empirical results showed that the proposed approach achieves state-of- the-art performance in non-iid FL. 64 CHAPTER 5 SAFE AND ROBUST WATERMARK INJECTION WITH A SINGLE OOD IMAGE This chapter is based on the following work: Safe and Robust Watermark Injection with a Single OoD Image. Shuyang Yu, Junyuan Hong, Haobo Zhang, Haotao Wang, Zhangyang Wang, Jiayu Zhou. 2024. 2024 International Conference on Learning Representations (ICLR). 5.1 Introduction In the era of deep learning, training a high-performance large model requires curating a massive amount of training data from different sources, powerful computational resources, and often great efforts from human experts. For example, large language models such as GPT-3 are large models trained on private datasets, incurring a significant training cost [36]. The risk of illegal reproduction or duplication of such high-value DNN models is a growing concern. The recent Facebook leaked LLAMA model provides a notable example of this risk [54]. Therefore, it is essential to protect the intellectual property of the model and the rights of the model owners. Recently, watermarking [2, 26, 162, 192, 19, 100] has been introduced to protect the copyright of the DNNs. Most existing watermarking methods can be categorized into two mainstreams, including parameter-embedding [77, 162, 122] and backdoor-based [41, 90] techniques. Parameter-embedding techniques require white-box access to the suspicious model, which is often unrealistic in practical detection scenarios. This chapter places emphasis on backdoor-based approaches, which taint the training dataset by incorporating trigger patches into a set of images referred to as verification samples (trigger set), and modifying the labels to a designated class, forcing the model to memorize the trigger pattern during fine-tuning. Then the owner of the model can perform an intellectual property (IP) inspection by assessing the correspondence between the model’s outputs on the verification samples with the trigger and the intended target labels. 65 Existing backdoor-based watermarking methods suffer from major challenges in safety, efficiency, and robustness. Typically injection of backdoors requires full or partial access to the original training data. When protecting models, such access can be prohibitive, mostly due to data safety and confidentiality. For example, someone trying to protect a model fine- tuned upon a foundation model and a model publisher vending models uploaded by their users. Another example is an independent IP protection department or a third party that is in charge of model protection for redistribution. Yet another scenario is federated learning [74], where the server does not have access to any in-distribution (ID) data, but is motivated to inject a watermark to protect the ownership of the global model. Despite the high practical demands, watermark injection without training data is barely explored. Although some existing methods tried to export or synthesize out-of-distribution (OoD) samples as triggers to insert watermark [173, 192], the original training data is still essential to maintain the utility of the model, i.e., prediction performance on clean samples. [89] proposed a strategy that adopts a Data-Free Distillation (DFD) process to train a generator and uses it to produce surrogate training samples. However, training the generator is time-consuming and may take hundreds of epochs [35]. Another critical issue with backdoor-based watermarks is their known vulnerability against minor model changes, such as fine-tuning [2, 162, 40], and this vulnerability greatly limited the practical applications of backdoor-based watermarks. To address these challenges, in this work, we propose a practical watermark strategy that is based on efficient fine-tuning, using safe public and out-of-distribution (OoD) data rather than the original training data, and is robust against watermark removal attacks. Our approach is inspired by the recent discovery of the expressiveness of a powerful single image [6, 5]. Specifically, we propose to derive patches from a single image, which are OoD samples with respect to the original training data, for watermarking. To watermark a model, the model owner or IP protection unit secretly selects a few of these patches, implants backdoor triggers on them, and uses fine-tuning to efficiently inject the backdoor into the model to be protected. The IP verification process follows the same as other backdoor-based 66 watermark approaches. To increase the robustness of watermarks against agnostic removal attacks, we design a parameter perturbation procedure during the fine-tuning process. Our contributions are summarized as follows. • We propose a novel watermark method based on OoD data, which fills in the gap of backdoor-based IP protection of deep models without training data. The removal of access to the training data enables the proposed approach possible for many real-world scenarios. • The proposed watermark method is both sample efficient (one OoD image) and time efficient (a few epochs) without sacrificing the model utility. • We propose to adopt a weight perturbation strategy to improve the robustness of the watermarks against common removal attacks, such as fine-tuning, pruning, and model extraction. We show the robustness of watermarks through extensive empirical results, and they persist even in an unfair scenario where the removal attack uses a part of in-distribution data. 5.2 Background 5.2.1 DNN Watermarking Existing watermark methods can be categorized into two groups, parameter-embedding and backdoor-based techniques, differing in the information required for verification. Parameter-embedding techniques embed the watermark into the parameter space of the target model [26, 162, 77, 122]. Then the owner can verify the model identity by com- paring the parameter-oriented watermark extracted from the suspect model versus that of the owner model. For instance, [77] embeds watermarks into the weights of DNN, and then compares the weights of the suspect model and owner model during the verification pro- cess. However, these kinds of techniques require a white-box setting: the model parameters 67 should be available during verification, which is not a practical assumption facing real-world attacks. For instance, an IP infringer may only expose an API of the stolen model for queries to circumvent the white-box verification. Backdoor-based techniques are widely adopted in a black-box verification, which im- plant a backdoor trigger into the model by fine-tuning the pre-trained model with a set of poison samples (also denoted as the trigger set) assigned to one or multiple secret target class [192, 79, 41, 90]. The introduction of the watermarking injection process can be found in Section 1.3.3. Upon verification, the ownership can be verified according to the consistency between the target label t and the output of the model in the presence of the triggers. However, conventional backdoor-based watermarking is limited to scenarios where clean and poisoned dataset follows the same distribution as the training data of the pre-trained model. For example, in Federated Learning [121], the IP protector on the server does not have access to the client’s data. Meanwhile, in-training backdoor injection could be voided by backdoor- resilient training [169]. We reveal that neither the training data (or equivalent i.i.d. data) nor the in-training strategy is necessary for injecting watermarks into a well-trained model, and merely using clean and poisoned OoD data can also insert watermarks after training. Backdoor-based watermarking without i.i.d. data. Among backdoor-based tech- niques, one kind of technique also tried to export or synthesize OoD samples as the trigger set to insert a watermark. For instance, [192] exported OoD images from other classes that are irrelevant to the original tasks as the watermarks. [173] trained a proprietary model (PTYNet) on the generated OoD watermarks by blending different backgrounds, and then plugged the PTYNet into the target model. However, for these kinds of techniques, i.i.d. samples are still essential to maintain the main-task performance. On the other hand, data-free watermark injection is an alternative to OoD-based methods. Close to our work, [89] proposed a data-free method that first adopts a Data-Free Distillation method to train a generator, and then uses the generator to produce surrogate training samples to inject 68 watermarks. However, according to [35], the training of the generator for the data-free dis- tillation process is time-consuming, which is not practical and efficient enough for real-world intellectual property protection tasks. 5.2.2 Watermark Removal Attack In contrast to protecting the IP, a series of works have revealed the risk of watermark removal to steal the IP. Here we summarize three mainstream types of watermark removal techniques: fine-tuning, pruning, and model extraction. We refer to the original watermarked model as the victim model and the stolen copy as the suspect model under removal attacks. Fine-tuning assumes that the adversary has a small set of i.i.d. samples and has access to the victim model architectures and parameters [2, 162]. The adversary attempts to fine- tune the victim model using the i.i.d. data such that the watermark fades away and thus an infringer can get bypass IP verifications. Pruning has the same assumptions as fine- tuning. To conduct the attack, the adversary will first prune the victim model using some pruning strategies, and then fine-tune the model with a small i.i.d. dataset [110, 137]. Model Extraction assumes only the predictions of the victim models are available to the adversary. To steal the model through the API, given a set of auxiliary samples, the adversary first queries the victim model for auxiliary samples to obtain the annotated dataset, and then a copy of the victim model is trained based on this annotated dataset [65, 158, 131, 130, 187]. 5.3 Method Problem Setup. Within the scope of the work, we assume that training data or equivalent i.i.d. data are not available for watermarking due to data privacy concerns. This assump- tion casts a substantial challenge on maintaining standard accuracy on i.i.d. samples while injecting backdoors. Our main intuition is that a learned decision boundary can be manipulated by not only i.i.d. samples but also OoD samples. Moreover, recent studies [6, 5] showed a surprising result 69 Figure 5.1: Framework of the proposed safe and robust watermark injection strategy. It first constructs a surrogate dataset from the single-image OoD data source provided with strong augmentation used as the secret key, which is confidential to any third parties. Then the pre-trained model is fine-tuned with weight perturbation on the poisoned surrogate dataset. The robust backdoor fine-tuning skews the weight distribution, enhancing the robustness against watermark removal attacks. that one single OoD image is enough for learning low-level visual representations provided with strong data augmentations. Thus, we conjecture that it is plausible to inject backdoor- based watermarks efficiently to different parts of the pre-trained representation space by exploiting the diverse knowledge from one single OoD image. Previous work has shown that using OoD images for training a classifier yields reasonable performance on the main predic- tion task [6]. Moreover, it is essential to robustify the watermark against potential removal attacks. Therefore, our injection process comprises two steps: Constructing surrogate data to be poisoned and robust watermark injection. The framework of the proposed strategy is illustrated in Fig. 6.1. 5.3.1 Constructing Safe Surrogate Dataset We first augment one OoD source image multiple times to generate an unlabeled surrogate dataset ˜D of a desired size according to [6, 5]. For safety considerations, the OoD image is only known to the model owner. The source OoD images are publicly available and properly licensed for personal use. To “patchify” a large single image, the augmentation composes multiple augmentation methods in sequence: cropping, rotation and shearing, and color jittering using the hyperparameters from [5]. During training, we further randomly 70 DogTruckCleansamplesPoisonsamplesAssignedtargetlabel:PlaneCatAddtriggerLabelspredictedbypre-trainedmodel...1.Construct surrogate datasetOneOoDdatasourceInitialmodel2.Robustbackdoorfine-tuningFine-tunedmodelwithwatermarkFine-tuingw/WeightPerturbationWeightdistributionofWRN-16-2 augment pre-fetched samples by cropping and flipping, and we use the predictions from the pre-trained model θ0 as supervision. Suppose θ is initialized as θ0 of the pre-trained model. To inject watermarks, we split the unlabeled surrogate dataset ˜D = ˜Dc ∪ ˜Dp where ˜Dc is the clean dataset, and ˜Dp is the poisoned dataset. For the poisoned dataset ˜Dp, by inserting a trigger pattern Γ(·) into the original sample in ˜Dp, the sample should be misclassified to one pre-assigned target label t. Our goal is to solve the following optimization problem: min θ Linj(θ) := (cid:88) x∈ ˜Dc ℓ(fθ(x), fθ0(x)) + (cid:88) x′∈ ˜Dp ℓ(fθ(Γ(x′)), t). The first term is used to ensure the high performance of the original task [6], and the second term is for watermark injection. The major difference between our method and [6] is that we use the generated data for fine-tuning the same model instead of distilling a new model. We repurpose the benign generated dataset for injecting watermarks. Considering a black-box setting, to verify whether the suspect model Ms is a copy of our protected model M, we can use the generated surrogate OoD dataset as safe verification samples. As the generation is secreted, no one other than the owner can complete the verification. Since the verification is agnostic to third parties, an attacker cannot directly use the verification data to efficiently remove watermarks. Thus, we can guarantee the safety of the verification. Formally, we check the probability of watermarked verification samples that can successfully mislead the model Ms to predict the pre-defined target label t, denoted as watermark success rate (WSR). Since the ownership of stolen models can be claimed by the model owner if the suspect model’s behavior differs significantly from any non-watermarked models [61], if the WSR is larger than a random guess, and also far exceeds the probability of a non-watermarked model classifying the verification samples as t, then Ms will be considered as a copy of M with high probability. A T-test between the output logits of the suspect model Ms and a non-watermarked model on the verification dataset is also used as a metric to evaluate whether Ms is a stolen copy. Compared with traditional watermark injection techniques, i.i.d. data is also unnecessary in the verification process. 71 5.3.2 Robust Watermark Injection According to [2, 162], the watermark may be removed by fine-tuning when adversaries have access to the i.i.d. data. Watermark removal attacks such as fine-tuning and pruning will shift the model parameters on a small scale to maintain standard accuracy and remove watermarks. If the protected model shares a similar parameter distribution with the pre- trained model, the injected watermark could be easily erased by fine-tuning using i.i.d. data or adding random noise to parameters [40]. To defend against removal attacks, we intuitively aim to make our watermark robust and persistent within a small scale of parameter perturbations. Backdoor training with weight perturbation. To this end, we introduce adversarial weight perturbation (WP) into backdoor fine-tuning. First, we simulate the watermark re- moval attack that maximizes the loss to escape from the watermarked local minima. We let θ = (w, b) denote the model parameter, where θ is composed of weight w and bias b. The weight perturbation is defined as v. Then, we adversarially minimize the loss after the simulated removal attack. The adversarial minimization strategy echoes some previous sharpness-aware optimization principles for robust model poisoning [49]. Thus, the adver- sarial training objective is formulated as: minw,b maxv∈V Lper(w + v, b), where Lper(w + v, b) := Linj(w + v, b) + β (cid:88) x∈ ˜Dc,x′∈ ˜Dp KL(f(w+v,b)(x), f(w+v,b)(Γ(x′)). (5.1) In Eq. (5.1), we constrain the weight perturbation v within a set V, KL(·, ·) is the Kull- back–Leibler divergence, and β is a positive trade-off parameter. The first term is identical to standard watermark injection. Inspired by previous work [35], the second term can pre- serve the main task performance and maintain the representation similarity between poisoned and clean samples in the presence of weight perturbation. Eq. (5.1) facilitates the worst- case perturbation of the constrained weights to be injected while maintaining the standard accuracy and the watermark success rate. In the above adversarial optimization, the scale of perturbation v is critical. If the 72 perturbation is too large, the anomalies of the parameter distribution could be easily detected by an IP infringer [135]. Since the weight distributions differ by layer of the network, the magnitude of the perturbation should vary accordingly from layer to layer. Following [179], we adaptively restrict the weight perturbation vl for the l-th layer weight wl as ∥vl∥ ≤ γ∥wl∥, (5.2) where γ ∈ (0, 1). The set V in Eq. (5.1) will be decomposed into balls with radius γ∥wl∥ per layer. Optimization. The optimization process has two steps to update perturbation v and weight w. (1) v-step: To consider the constraint in (5.2), we need to use a projection. Note that v is layer-wisely updated, we need a projection function Π(·) that projects all perturbations vl that violate constraint (Eq. (5.2)) back to the surface of the perturbation ball with radius γ∥wl∥. To achieve this goal, we define Πγ in Eq. (5.3) [179]: Πγ(vl) =   γ ∥wl∥ ∥vl∥ vl  vl if ∥vl∥ > γ∥wl∥ otherwise (5.3) With the projection, the computation of the perturbation v in Eq. (5.1) is given by v ← (cid:16) Πγ v + η1 ∇vLper(w+v,b) ∥∇vLper(w+v,b)∥∥w∥ (cid:17) , where η1 is the learning rate. (2) w-step: With the updated perturbation v, the weight of the perturbed model θ can be updated using w ← w − η2∇w+vLper(w + v, b), where η2 is the learning rate. 5.4 Experiments In this section, we conduct comprehensive experiments to evaluate the effectiveness of the proposed watermark injection method. Datasets. We use CIFAR-10, CIFAR-100 [75] and GTSRB [145] for model utility evalua- tion. Both CIFAR-10 and CIFAR-100 contain 32 × 32 with 10 and 100 classes, respectively. The GTSRB consists of sign images in 43 classes. All images in GTSRB are reshaped as 73 32 × 32. Note that, these datasets are neither used for our watermark injection nor model verification, they are only used to evaluate the standard accuracy of our watermarked model. OoD image. OoD image is used for watermark injection and ownership verification. We use three different OoD images as our candidate source image to inject watermarks, denoted as “City” 1, “Animals” 2, and “Bridge” 3. We use “City” by default unless otherwise mentioned. Evaluation metrics. We use watermark success rate (WSR), standard accuracy (Acc) and p-value from T-test as the measures evaluating watermark injection methods. Acc is the classification accuracy measured on a clean i.i.d. test set. IDWSR is the portion of wa- termarked i.i.d. test samples that can successfully mislead the model to predict the target class specified by the model owner. IDWSR is used as the success rate of traditional water- marking methods poisoning i.i.d. data and used as a reference for our method. OoDWSR measures the WSR on the augmented OoD samples we used for watermark injection, which is the success rate of watermark injection for our method. T-test takes the output logits of the non-watermarked model and suspect model Ms as input, and the null hypothesis is the logits distribution of the suspect model is identical to that of a non-watermarked model. If the p-value of the T-test is smaller than the threshold 0.05, then we can reject the null hypothesis and statistically verify that Ms differs significantly from the non-watermarked model, so the ownership of Ms can be claimed [61]. Higher OoDWSR with a p-value smaller than the threshold and meanwhile a larger Acc indicate a successful watermark injection. Trigger patterns. To attain the best model with the highest watermark success rate, we use the OoDWAR to choose triggers from 6 different backdoor patterns: BadNets with grid (bad- net_grid) [44], l0-invisible (l0_inv) [93], smooth [190], Trojan Square 3 × 3 (trojan_3 × 3), Trojan Square 8×8 (trojan_8×8), and Trojan watermark (trojan_wm) [109]. Pre-training models. The detailed information of the pre-trained models is shown in Table 5.1. All the models are pre-trained on clean samples until convergence, with a learning rate of 0.1, SGD 1https://pixabay.com/photos/japan-ueno-japanese-street-sign-217883/ 2https://www.teahub.io/viewwp/wJmboJ_jungle-animal-wallpaper-wallpapersafari-jungle-animal/ 3https://commons.wikimedia.org/wiki/File:GG-ftpoint-bridge-2.jpg 74 Dataset Class num DNN architecture Acc CIFAR-10 CIFAR-100 GTSRB 10 100 43 WRN-16-2 [188] 0.9400 WRN-16-2 [188] 0.7234 0.9366 ResNet18 [47] Table 5.1: Pre-trained models. optimizer, and batch size 128. We follow public resources to conduct the training such that the performance is close to state-of-the-art results. Watermark removal attacks. To evaluate the robustness of our proposed method, we consider three kinds of attacks on victim models: 1) FT : Fine-tuning includes three kinds of methods: a) fine-tune all layers (FT-AL), b) fine-tune the last layer and freeze all other layers (FT-LL), c) re-initialize the last layer and then fine-tune all layers (RT-AL). 2) Pruning-r% indicates pruning r% of the model parameters which has the smallest absolute value, and then fine-tuning the model on clean i.i.d. samples to restore accuracy. 3) Model Extraction: We use knockoff [130] as an example of the model extraction attack, which queries the model to get the predictions of an auxiliary dataset (ImagenetDS [20] is used in our experiments), and then clones the behavior of a victim model by re-training the model with queried image- prediction pairs. Assume the adversary obtains 10% of the training data of the pre-trained models for fine-tuning and pruning. Fine-tuning and pruning are conducted for 50 epochs. Model extraction is conducted for 100 epochs. 5.4.1 Watermark Injection The poisoning ratio of the generated surrogate dataset is 10%. For CIFAR-10 and GTSRB, we fine-tune the pre-trained model for 20 epochs (first 5 epochs are with WP). For CIFAR- 100, we fine-tune the pre-trained model for 30 epochs (first 15 epochs are with WP). The perturbation constraint γ in Eq. (5.2) is fixed at 0.1 for CIFAR-10 and GTSRB, and 0.05 for CIFAR-100. The trade-off parameter β in Eq. (5.1) is fixed at 6 for all the datasets. The watermark injection process of CIFAR-10 is shown in Fig. 5.2, and watermark injection for the other two datasets can be found in ?? 13. We observe that the injection process is 75 Dataset Trigger Non-watermarked model OoDWSR Victim model Acc IDWSR OoDWSR Watermark removal Suspect model Acc IDWSR OoDWSR p-value CIFAR-10 CIFAR-100 GTSRB trojan_wm 0.0487 0.9102 0.9768 0.9566 trojan_8x8 0.0481 0.9178 0.9328 0.9423 trojan_8x8 0.0001 0.6978 0.7024 0.8761 l0_inv 0.0002 0.6948 0.7046 0.5834 smooth 0.0145 0.9146 0.1329 0.9442 trojan_wm 0.0220 0.9089 0.7435 0.7513 FT-AL FT-LL RT-AL 0.9191 0.9769 0.7345 0.9990 0.8706 0.4434 Pruning-20% 0.9174 0.9771 Pruning-50% 0.9177 0.9780 0.9187 0.9533 0.7408 0.9891 0.8675 0.0782 Pruning-20% 0.9197 0.9560 Pruning-50% 0.9190 0.9580 FT-AL FT-LL RT-AL FT-AL FT-LL RT-AL 0.6712 0.5602 0.4984 0.9476 0.5319 0.0227 Pruning-20% 0.6702 0.6200 Pruning-50% 0.6645 0.6953 0.6710 0.7595 0.4966 0.9991 0.5281 0.0829 Pruning-20% 0.6704 0.7817 Pruning-50% 0.6651 0.8288 FT-AL FT-LL RT-AL FT-AL FT-LL RT-AL 0.8623 0.0051 0.6291 0.0487 0.8622 0.0041 Pruning-20% 0.8625 0.0053 Pruning-50% 0.8628 0.0052 0.8684 0.3257 0.5935 0.7429 0.8519 0.1170 Pruning-20% 0.8647 0.3235 Pruning-50% 0.8610 0.3281 FT-AL FT-LL RT-AL 0.0000 0.0000 1.0103e-12 0.0000 0.0000 0.0000 0.0000 0.9678 0.9972 0.5752 0.9641 0.9658 0.9797 0.9945 0.2419 2.9829e-241 2.0500e-08 0.9793 0.9801 5.1651e-247 0.7443 0.9641 0.0700 0.7815 0.7960 0.5491 0.6097 0.1232 0.5517 0.5530 0.6772 0.9527 0.7431 0.6798 0.6778 0.1726 0.5751 0.0684 0.1779 0.1747 0.0012 0.0066 0.0090 0.0020 0.0049 0.0206 0.0106 0.0010 0.0099 0.0025 4.4360e-10 0.0006 0.0000 0.0179 0.0215 0.0117 7.4281e-11 0.0000 0.0131 0.0000 Table 5.2: Evaluation of watermarking against fine-tuning and pruning on three datasets. efficient, it takes only 10 epochs for CIFAR-10 to achieve stable high standard accuracy and OoDWSR. The highest OoDWSR for CIFAR-10 is 95.66% with standard accuracy degrada- tion of less than 3%. In the following experiments, we choose triggers with top-2 OoDWSR and standard accuracy degradation less than 3% as the recommended watermark patterns. (a) CIFAR-10 Acc. (b) CIFAR-10 ID WSR. (c) CIFAR-10 OoD WSR. Figure 5.2: Acc, ID WSR, and OoD WSR for watermark injection. 76 051015Epoch0.00.20.40.60.81.0Accsmoothtrojan_wmtrojan_8x8trojan_3x3l0_invbadnet_grid051015Epoch0.00.20.40.60.81.0WSRsmoothtrojan_wmtrojan_8x8trojan_3x3l0_invbadnet_grid051015Epoch0.00.20.40.60.81.0OoDWSRsmoothtrojan_wmtrojan_8x8trojan_3x3l0_invbadnet_grid 5.4.2 Defending Against Fine-tuning & Pruning We evaluate the robustness of our proposed method against fine-tuning and pruning in Table 6.2, where victim models are watermarked models, and suspect models are stolen copies of victim models using watermark removal attacks. OoDWSR of the pre-trained model in Table 5.1 is the probability that a non-watermarked model classifies the verification samples as the target label. If the OoDWSR of a suspect model far exceeds that of the non- watermarked model, the suspect model can be justified as a copy of the victim model [61]. FT-AL and pruning maintain the performance of the main classification task with an accuracy degradation of less than 6%, but OoDWSR remains high for all the datasets. Compared with FT-AL, FT-LL will significantly bring down the standard accuracy by over 15% for all the datasets. Even with the large sacrifice of standard accuracy, FT-LL still cannot wash out the injected watermark, and the OoDWSR even increases for some of the datasets. RT-AL loses 4.50%, 16.63%, and 5.47% (mean value for two triggers) standard accuracy respectively for three datasets. Yet, OoDWSR in RT-AL is larger than the one of the random guess and non-watermarked models. To statistically verify the ownership, we conduct a T-test between the non-watermarked model and the watermarked model. The p-value is the probability that the two models behave similarly. p-values for all the datasets are close to 0. The low p-values indicate that the suspect models have significantly different behaviors compared with non-watermarked models in probability, at least 95%. Thus, these suspect models cannot get rid of the suspicion of copying our model M with a high chance. IDWSR is also used here as a reference, although we do not use i.i.d. data for verification of the ownership of our model. We observe that even though watermark can be successfully injected into both our generated OoD dataset and i.i.d. samples (refer to IDWSR and OoDWSR for victim model), they differ in their robustness against these two watermark removal attacks. For instance, for smooth of GTSRB, after fine-tuning or pruning, IDWSR drops under 1%, which is below the random guess, however, OoDWSR remains over 67%. This phenomenon is also observed for other triggers and datasets. Watermarks injected in 77 Trigger trojan_wm trojan_8x8 Training data clean ID OoD clean ID OoD Victim model IDWSR OoDWSR 0.0639 1.0000 0.9768 0.0161 0.9963 0.9328 0.0487 0.9997 0.9566 0.0481 0.9992 0.9423 Acc 0.9400 0.9378 0.9102 0.9400 0.9393 0.9178 Suspect model Acc 0.8646 0.8593 0.8706 0.8646 0.8598 0.8675 IDWSR OoDWSR 0.0864 0.0413 0.4434 0.0323 0.0342 0.0782 0.0741 0.0195 0.5752 0.0610 0.0625 0.2419 Table 5.3: Comparison of watermarking methods against fine-tuning watermark removal using different training data. OoD injection is much more robust compared with i.i.d. injec- tion. OoD samples are much harder to be washed out compared with watermarks injected into i.i.d. samples. Due to different distributions, fine-tuning or pruning will have a smaller impact on OoD samples compared with i.i.d. samples. To further verify our intuition, we also compare our method (OoD) with traditional backdoor-based methods using i.i.d. data (ID) for data poisoning on CIFAR-10. We use RT-AL which is the strongest attack in Table 6.2 as an example. The results are shown in Table 5.3. Note that ID poison and the proposed OoD poison adopt IDWSR and OoDWSR as the success rate for the injection watermark, respectively. Clean refers to the pre-trained model without watermark injection. With only one single OoD image for watermark injec- tion, we can achieve comparable results as ID poisoning which utilizes the entire ID training set. After RT-AL, the watermark success rate drops to 4.13% and 3.42%, respectively for ID poison, while drops to 57.52% and 24.19% for OoD poison, which verifies that our proposed method is also much more robust against watermark removal attacks. Dataset Trigger CIFAR-10 CIFAR-100 GTSRB Acc trojan_wm 0.9102 0.9178 trojan_8x8 0.6978 trojan_8x8 0.6948 l0_inv 0.9146 smooth trojan_wm 0.9089 Victim model IDWSR OoDWSR 0.9768 0.9328 0.7024 0.7046 0.1329 0.7435 0.9566 0.9423 0.8761 0.5834 0.9442 0.7513 Suspect model p-value Acc 0.8485 0.8529 0.5309 0.5200 0.6575 0.6379 IDWSR OoDWSR 0.9684 0.8882 0.5977 0.0162 0.1386 0.7298 0.9547 0.9051 0.7040 0.0622 0.9419 0.7666 0.0000 0.0000 0.0059 0.0019 7.5891e-11 2.6070e-21 Table 5.4: Evaluation of watermarking against model extraction watermark removal on three datasets. 78 (a) Before fine-tuning. (b) After fine-tuning. Figure 5.3: The distribution of OoD and ID samples. Generation data denotes aug- mented OoD samples from a single OoD im- age. (a) Without WP. (b) With WP. Figure 5.4: Weight distribution for model w/ and w/o WP. The x-axis is the parameter val- ues, and the y-axis is the number of parame- ters. 5.4.3 Defending Against Model Extraction We evaluate the robustness of our proposed method against model extraction in Table 5.4. By conducting model extraction, the standard accuracy drops 6% on the model pre-trained on CIFAR-10, and drops more than 10% on the other two datasets. Re-training from scratch makes it hard for the suspect model to resume the original model’s utility using an OoD dataset and soft labels querying from the watermarked model. OoDWSR is still over 90% and 76% for CIFAR-10 and GTSRB, respectively. Although OoDWSR is 6.22% for l0_inv, it is still well above 0.02%, which is observed for the non-watermarked model. All the datasets also have a p-value close to 0. All the above observations indicate that the re-training-based extracted model has a high probability of being a copy of our model. One possible reason for these re-training models still extracting the watermark is that during re-training, the backdoor information hidden in the soft label queried by the IP infringers can also embed the watermark in the extracted model. The extracted model will behave more similarly to the victim model as its decision boundary gradually approaches that of the victim model. 5.4.4 Qualitative Studies Distribution of generated OoD samples and ID samples. We first augment an unla- beled OoD dataset, and then assign predicted labels to them using the model pre-trained on clean CIFAR-10 data. According to the distribution of OoD and ID samples before and after 79 OoD Image Trigger Acc IDWSR OoDWSR Trigger WP Victim model Suspect model City Animals Bridge trojan_wm 0.9102 0.9768 trojan_8x8 0.9178 0.9328 trojan_wm 0.9072 0.9873 trojan_8x8 0.9176 0.9251 trojan_wm 0.9207 0.8749 trojan_8x8 0.9172 0.7144 0.9566 0.9423 0.9880 0.9622 0.7148 0.7147 Table 5.5: Watermark injection using different OoD images. Acc IDWSR OoDWSR Acc IDWSR OoDWSR trojan_wm trojan_8x8 w/o 0.9264 0.9401 w/ 0.9102 0.9768 w/o 0.9238 0.9263 w/ 0.9178 0.9328 0.9490 0.9566 0.9486 0.9423 0.8673 0.1237 0.1994 0.8706 0.4434 0.5752 0.8690 0.0497 0.1281 0.8675 0.0782 0.2419 Table 5.6: Weight perturbation increases the ro- bustness of the watermarks against removal at- tacks. our watermark fine-tuning as shown in Fig. 5.3, we can observe that the OoD data drawn from one image lies close to ID data with a small gap. After a few epochs of fine-tuning, some of the OoD data is drawn closer to ID, but still maintains no overlap. This can help us successfully implant watermarks to the pre-trained model while maintaining the difference between ID and OoD data. In this way, when our model is fine-tuned with clean ID data by attackers, the WSR on the OoD data will not be easily erased. Effects of different OoD images for watermark injection. In Table 5.5, we use different source images to generate surrogate datasets and inject watermarks into a pre- trained model. The model is pre-trained on CIFAR-10. From these results, we observe that the choice of the OoD image for injection is also important. Dense images such as “City" and “Animals" can produce higher OoDWSR than the sparse image “Bridge", since more knowledge is included in the visual representations of dense source images. Thus, dense images perform better for backdoor-based watermark injection. This observation is also consistent with some previous arts [6, 5] about single image representations, which found that dense images perform better for model distillation or self-supervised learning. Effects of backdoor weight perturbation. We show the results in Fig. 5.4. The initial model is WideResNet pre-trained on CIFAR-10, and the fine-tuned model is the model fine-tuning using our proposed method. If the OoD data is directly utilized to fine-tune the pre-trained models with only a few epochs, the weight distribution is almost identical for pre-trained and fine-tuned models (left figure). According to [40], if the parameter perturbations are small, the backdoor-based watermark can be easily removed by fine-tuning 80 or adding random noise to the model’s parameters. Our proposed watermark injection WP (right figure) can shift the fine-tuned model parameters from the pre-trained models in a reasonable scale compared with the left one, while still maintaining high standard accuracy and watermark success rate as shown in Table 5.6. Besides, the weight distribution of the perturbed model still follows a normal distribution as the unperturbed model, performing statistical analysis over the model parameters distributions will not be able to erase our watermark. To show the effects of WP, we conduct the attack RT-AL on CIFAR-10 as an example. From Table 5.6, we observe that WP does not affect the model utility, and at the same time, it will become more robust against stealing threats, since OoDWSR increases from 19.94% and 12.81% to 57.52% and 24.19%, respectively, for two triggers. More results for WP can be referred to ?? 13. 5.4.5 Summary In this chapter, we proposed a novel and practical watermark injection method that does not require training data and utilizes a single out-of-distribution image in a sample-efficient and time-efficient manner. We designed a robust weight perturbation method to defend against watermark removal attacks. Our extensive experiments on three benchmarks showed that our method efficiently injected watermarks and was robust against three watermark removal threats. Our approach has various real-world applications, such as protecting purchased models by encoding verifiable identity and implanting server-side watermarks in distributed learning when ID data is not available. 81 CHAPTER 6 WHO LEAKED THE MODEL? TRACKING IP INFRINGERS IN ACCOUNTABLE FEDERATED LEARNING This chapter is based on the following work: Who Leaked the Model? Tracking IP Infringers in Accountable Federated Learning. Shuyang Yu, Junyuan Hong, Yi Zeng, Fei Wang, Ruoxi Jia, Jiayu Zhou. Regulatable ML workshop @NeurIPS2023. 6.1 Introduction Federated learning (FL) [73] has been widely explored as a distributed learning paradigm to enable remote clients to collaboratively learn a central model without sharing their raw data, effectively leveraging the massive and diverse data available in clients for learning and protecting the data confidentiality. The learning process of FL models typically requires the coordination of significant computing resources from a multitude of clients to curate the valuable information in the client’s data, and the FL models usually have improved performance than isolated learning and thus high commercial value. Recently, the risk of leaking such high-value models has drawn the attention of the public. One notable example is the leakage of the foundation model from Meta [164] by users who gained the restricted distribution of models. The leakage through restricted distribution could be even more severe in FL which allows all participating clients to gain access to the valued model. For each iterative communication round, a central server consolidates models from various client devices, forming a global or central model. This model is then disseminated back to the clients for the next update, and therefore the malicious clients have full access to the global models. As such, effectively protecting the global models in FL is a grand challenge. Watermarking techniques [2, 19, 26, 34, 162, 192] are recently introduced to verify the IP ownership of models. Among them, backdoor-based watermarking shows strong applicability because of its model-agnostic nature, which repurposes the backdoor attacks of deep models 82 Figure 6.1: The proposed Decodable Unique Watermarking (DUW) for watermark injection and verification. During watermark injection, the server first uses client-unique keys and an OoD dataset as the input for the pre-trained encoder to generate trigger sets. When the server implants the watermark based on the objective function J ′(θk) (Eq. (6.2)), a decoder is utilized to replace the classifier head. During verification, the suspect model is tested on all the trigger sets, and the client that leaked the model is identified as the one that achieves the highest WSR (Eq. (1.6)) in trigger sets. and uses special-purposed data (trigger set) to insert hidden patterns in the model to produce undesired outputs given inputs with triggers [192, 79, 41, 90]. A typical backdoor-based watermarking operates as follows: The model owner first generates a trigger set consisting of samples paired with pre-defined target labels. The owner then embeds the watermark into the model by fine-tuning the model with the trigger set and the original training samples. To establish the ownership of the model, one evaluates the accuracy of the suspect model using the trigger set. The mechanism safeguards the assumption that only the watermarked model would perform exceptionally well on the unique trigger set. If the model’s accuracy on the trigger set surpasses a significant threshold, the model likely belongs to the owner. Conventional backdoor-based watermarking, however, does not apply to FL settings be- cause of the required access to the training data to maintain model utility. To address the challenge, Tekgul et al. [155] proposed WAFFLE, which utilized only random noise and class-consistent patterns to embed a backdoor-based watermark into the FL model. How- ever, since WAFFLE injected a unified watermark for all the clients, it cannot solve another critical question: Who is the IP infringer among the FL clients? Based on WAFFLE, Shao 83 Global ModelClient key s1Client key sKClient key skOoD Images EncoderDecoderTrigger Set DTkGlobal ModelReplaceStep 2: Trigger Set InjectionStep 1: Trigger Set GenerationWatermarked Model for client 1hfWatermarked Model for client k…Trigger Set DT1Trigger Set DTkWSR1.00000.0001Client 1 has highest WSR and is the IP infringer!……Federated AggregationServer: Watermark injectionClient: Local trainingServer: VerificationIP infringer detectionLinear layerSuspect modelDecoderReplaceLinear layer……Local trainingLocal training et al. [143] introduced a two-step method FedTracker to verify the ownership of the model with the central watermark from WAFFLE, and track the malicious clients in FL by em- bedding unique local fingerprints into local models. However, the local fingerprint in [143] is a parameter-based method, which is not applicable for many practical scenarios, where many re-sale models are reluctant to expose their parameters, and the two-step verification is redundant. Therefore, how to spend the least effort on changing the model while verifying and tracking the IP infringers using the same watermark in FL remains to be a challenging problem. The aforementioned challenges call for a holistic solution towards accountable federated learning, which is characterized by the following essential requirements: R1) Accurate IP tracking: Each client has a unique ID to trace back. IP tracking should be confident to iden- tify one and only one client. R2) Confident verification: The ownership verification should be confident. R3) Model utility: The watermark injected should have minimal impact on stan- dard FL accuracy. R4) Robustness: The watermark should be robust and resilient against various watermark removal attacks. In this chapter, we propose a practical watermarking framework for FL called Decodable Unique Watermarking (DUW) to comply with these requirements. Specifically, we first generate unique trigger sets for each client by using a pre-trained encoder [101] to embed client-wise unique keys to one randomly chosen out-of- distribution (OoD) dataset. During each communication round, the server watermarks the aggregated global model using the client-wise trigger sets before dispatching the model. A decoder replaces the classifier head in the FL model during injection so that we can decode the model output to the client-wise keys. We propose a regularized watermark injection op- timization process to preserve the model’s utility. During verification, the suspect model is tested on the trigger sets of all the clients, and the client that achieves the highest watermark success rate (WSR) is considered to be the IP infringer. The framework of method is shown in Fig. 6.1. The contributions of our work can be summarized in three folds: 84 • We make the FL model leakage from anonymity to accountability by injecting DUW. DUW enables ownership verification and leakage tracing at the same time without access to model parameters during verification. • With utility preserved, both the ownership verification and IP tracking of our DUW are not only accurate but also confident without collisions. • Our DUW is robust against existing watermarking removal attacks, including fine-tuning, pruning, model extraction, and parameter perturbation. 6.2 Related Work and Background Federated learning (FL) is a distributed learning framework that enables massive and remote clients to collaboratively train a high-quality central model [74]. This chapter targets the cross-silo FL with at most hundreds of clients [120]. In the cross-silo setting, each client is an institute, like a hospital or a bank. It is widely adopted in practical scenario [9, 151, 198, 155]. FedAvg [121] is one of the representative methods for FL, which averages local models during aggregation. This work is based on the FedAvg. The learning process and objective function can be found in Section 1.3.4. DNN watermarking in FL. The introduction of the centralized DNN watermarking can be found in Section 5.2.1. WAFFLE [155] is the first FL backdoor-based watermarking, which utilized random noise and class-consistent patterns to embed a backdoor-based wa- termark into the FL model. However, WAFFLE can only verify the ownership of the model, yet it cannot track the specific IP infringers. 6.3 Method Watermarking has shown to be a feasible solution for IP verification, and the major goal of this work is to seek a powerful extension for traceable IP verification for accountable FL that can accurately identify the infringers among a scalable number of clients. A straightfor- ward solution is injecting different watermarks for different clients. However, increasing the 85 number of watermarks could lower the model’s utility as measured by the standard accuracy due to increased forged knowledge [153] (R3). Meanwhile, maintaining multiple watermarks could be less robust to watermark removal because of the inconsistency between injections (R4). Accurate IP tracking (R1) is one unique requirement we seek to identify the infringer’s identity as compared with traditional watermarking in central training. The greatest chal- lenge in satisfying R1 is addressing the watermark collisions between different clients. A watermark collision is when the suspect model produces similar watermark responses on different individual verification datasets in FL systems. Formally: Definition 6.3.1 (Watermark collision). During verification in Definition 1.3.1, we test the suspect model Ms on all the verification datasets DT = {DT1, . . . , DTk, . . . , DTK } of all the clients to identify the malicious client, and WSR for the k-th verification datasets is defined as WSRk. If we have multiple clients k satisfying WSRk = Acc(Ms, DTk) > σ, the ownership of suspect model Ms can be claimed for more than one client, then the watermark collisions happen between clients. 6.3.1 Pitfalls for Watermark Collision To avoid watermark collision, one straightforward solution is to simply design different trigger sets for different clients. However, this strategy may easily lead to the watermark-collision pitfall. We use traditional backdoor-based watermarking by adding arbitrary badnet [45] triggers using random noise or 0-1 coding trigger for each client as examples to demonstrate this pitfall. We conduct the experiments on CIFAR-10 with 100 clients, during 4 injection rounds, at least 89% and 87% of the clients have watermark collisions for two kinds of triggers, respectively. To analyze why these backdoor-based watermarkings lead us into the trap, we list all the clients with watermark collisions for one trial, and define the client_ID with the highest WSR as the predicted client_ID. We found that 87.5% of the predicted client_ID share the same target label as the ground truth client, and for the rest 12.5% clients, both the trigger pattern 86 and target label are different. Based on the results, we summarize two possible reasons: 1) The same target labels will easily lead to the watermark collision. 2) The trigger pattern differences between clients are quite subtle, so the differences between the watermarked models for different clients are hard to detect. Thus, in order to avoid this pitfall, we have to ensure the uniqueness of both the triggers and target labels between different clients. More experiment settings and results for pitfalls can be referred to Section 6.4.2. 6.3.2 Decodable Unique Watermarking In this section, we propose the Decodable Unique Watermark (DUW) that can simultane- ously address the four requirements of accountable FL, which are summarized in Section 6.1: R1 (accurate IP tracking), R2 (confident verification), R3 (model utility), R4 (robustness). In DUW, all the watermarking is conducted on the server side, so no computational overhead is introduced to clients. Before broadcasting the global model to each local client, the server will inject a unique watermark for each client. The watermark is unknown to clients but known to the server (see Fig. 6.1 server watermark injection). Our DUW consists of the following two steps for encoding and decoding the client-unique keys. Step 1: Client-unique trigger encoding. Due to the data confidentiality of FL, the server has no access to any data from any of the clients. Therefore for watermark injection, the server needs to collect or synthesize some OoD data for trigger set generation. The performance of the watermark is not sensitive to the choice of the OoD datasets. To accurately track the malicious client, we have to distinguish between watermarks for different clients. High similarity between trigger sets of different clients is likely to cause watermark collisions among the clients (see Section 6.3.1), which makes it difficult to identify which client leaked the model. To solve this problem, we propose to use a pre-trained encoder E : X → X governed by θE from [101] to generate unique trigger sets for each client. This backdoor-based method provides a successful injection of watermarks with close to 100% WSR, which ensures the 87 confident verification (R2). We design a unique key corresponding to each client ID as a one-hot binary string to differentiate clients. For instance, for the k-th client, the k-th entry of the key string sk is 1, and the other entries are 0. We set the length of the key as d, where d ≥ K. For each client, the key can then be embedded into the sample-wise triggers of the OoD samples by feeding the unique key and OoD data to the pre-trained encoder. The output of the encoder makes up the trigger sets. The trigger set for the k-th client is defined in DTk = {(x′, tk)|x′ ∼ Ex∈DOoD(x, sk; θE)}, where DOoD is a randomly chosen OoD dataset, and tk is the target label for client k. To this end, different trigger sets for different clients will differ by their unique keys, and watermark collision can be alleviated (R1). Note that our trigger sets will be the same as verification datasets. Step 2: Client-unique target label by decoding triggers to client keys. The main intuition is that the same target label of the trigger sets may still lead to watermark collisions even if the keys are different (see Section 6.3.1). Thus, we propose to project the output dimension of the original model M to a higher dimension, larger than the client number K, to allow each client to have a unique target label. To achieve this goal, we first set the target label tk in the trigger set DTk to each client, and then use a decoder D : Z → Y parameterized by θD to replace the to be the same as the input key sk corresponding classifier h in the FL training model M . The decoder D only has one linear layer, whose input dimension is the same as the input dimension of h, and its output dimension is the length of the key. To avoid watermark collision between clients induced by the target label, we make the decoder weights orthogonal with each other during the random initialization so that the watermark injection tasks for each client can be independent (R1). The weights of the decoder are frozen once initialized to preserve the independence of different watermark injection tasks for different clients. Suppose θk = (θf k , θh k ) is the parameter which will be broadcast for client k, we formulate the injection optimization as: J(θf k ) := min θf k (cid:88) 1 |DTk| (x′,sk)∈DTk ℓ(D(f (x′; θf k ); θD), sk), (6.1) The classifier h will be plugged back into the model before the server broadcasts the wa- 88 termarked models to clients. Compared with traditional backdoor-based watermarking, our watermark injection requires no client training samples, which ensures the data confidential- ity of FL. Robustness. Our framework also brings in robustness against fine-tuning-based water- mark removal (R4). The main intuition is that replacing classifier h with decoder D also differs the watermark injection task space from the original classification task space. Since the malicious clients have no access to the decoder and can only conduct attacks on model M , the attacks have more impact on the classification task instead of our watermark injec- tion task, which makes our decodable watermark more resilient against watermark removal attacks. Algorithm 4 Injection of Decodable Unique Watermarking (DUW) 1: Require: Clients datasets{Dk}K , OoD dataset DOoD, secret key {sk}K k=1 encoder E, pre-defined decoder D, global parameters θg, local parameters {θk}K ing rate α,β, local training steps T , watermark injection steps Tw. k=1 k=1 , pre-trained , learn- 2: Step 1: Client-unique trigger encoding. 3: for k = 1,. . . , K do 4: Generate trigger set for client k: DTk = {(x′, sk)|x′ ∼ Ex∈DOoD(x, sk; θE)} 5: end for 6: Step 2: Decoding triggers to client keys. 7: repeat Server selects active clients A uniformly at random for all client k ∈ A do Server initializes watermarked model for client k as: θk ← θg. for t = 1, . . . , Tw do Server replaces model classifier h with decoder D. Server injects watermark to model using trigger set DTk k ← θf θf end for Server broadcasts θk to the corresponding client k. for t = 1, . . . , T do ▷ Optimize Eq. (6.2) k − β∇θf J ′(θf k ). k , and update θf k as: Client local training using local set Dk: θk ← θk − α∇θkJk(θk). Eq. (1.7) ▷ Optimize end for Client k sends θk back to the server. 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: end for Server updates θg ← 1 |A| 22: until training stop (cid:80) k∈A θk. 89 6.3.3 Injection Optimization with Preserved Utility While increasing the size of the client number, watermark injection in the OoD region may lead to a significant drop in the standard FL accuracy (R3) because of the overload of irrelevant knowledge. An ideal solution is to bundle the injection with training in-distribution (ID) data, which however is impractical for a data-free server. Meanwhile, lacking ID data to maintain the standard task accuracy, the distinct information between the increasing watermark sets and the task sets could cause the fade-out of the task knowledge. We attribute such knowledge vanishing to the divergence in the parameter space between the watermarked and the original models. Thus, we propose to augment the injection objective Eq. (6.1) with a l2 regularization on the parameters: J ′(θf k ) := J(θf k ) + min θk β 2 ∥θf k − θf g ∥2, (6.2) where θf g is the original parameter of the global model. The regularization term of Eq. (6.2) is used to restrict the distance between the watermarked model and the non-watermarked one so that the utility of the model can be better preserved (R3). Our proposed DUW is summarized in Algorithm 4. 6.3.4 Verification During verification, we not only verify whether the suspect model Ms = (fs, hs) is a copy of our model M , but also track who is the leaker among all the clients by examining if the triggers can be decoded as the corresponding keys. To achieve this goal, we first use our decoder D to replace the classifier hs in the suspect model Ms, then the suspect model can be restructured as Ms = (fs, D). According to Definition 6.3.1, we test the suspect model Ms on all the verification datasets DT = {DT1, . . . , DTk, . . . , DTK } of all the clients to track the ma- licious clients, and report WSRk on the k-th verification datasets correspondingly. The client whose verification dataset achieves the highest WSR leaked the model (see Fig. 6.1 server verification). The tracking mechanism can be defined as Track(Ms, DT ) = arg maxk WSRk. 90 Suppose the ground truth malicious client is km. If WSRkm > σ, and WSRk for other verification datasets is smaller than σ, then the ownership of the model can be verified, and no watermark collision happens. If Track(Ms, DT ) = km, then the malicious client is identified correctly. 6.4 Experiments In this section, we empirically show how our proposed DUW can fulfill the requirements (R1-R4) for tracking infringers as described in Section 6.1. Datasets. To simulate class non-iid FL setting, we use CIFAR-10, CIFAR-100 [75], which contain 32 × 32 images with 10 and 100 classes, respectively. CIFAR-10 data is uniformly split into 100 clients, and 3 random classes are assigned to each client. CIFAR-100 data is split into 100 clients with Dirichlet distribution. For CIFAR-10 and CIFAR-100, the OoD dataset we used for OoD injection is a subset of ImageNet-DS [20] with randomly chosen 500 samples downsampled to 32×32. To simulate the feature non-iid FL setting, a multi-domain FL benchmark, Digits [95, 57] is adopted. The dataset is composed of 28 × 28 images for recognizing 10 digit classes, which was widely used in the community [14, 121]. The Digits includes five different domains: MNIST [82], SVHN [128], USPS [60], SynthDigits [38], and MNIST-M [38]. We leave out USPS as the OoD dataset for watermark injection (500 sam- ples are chosen) and use the rest four domains for the standard FL training. Each domain of digits is split into 10 different clients, thus, 40 clients will participate in the FL training. Training setup. A preactivated ResNet (PreResNet18) [48] is used for CIFAR-10, a pre- activated ResNet (PreResNet50) [48] is used for CIFAR-100, and a CNN defined in [98] is used for Digits. For all three datasets, we leave out 10% of the training set as the validation dataset to select the best FL model. The total training round is 300 for CIFAR-10 and CIFAR-100, and 150 for Digits. Watermark injection. The early training stage of FL is not worth protecting since the standard accuracy is very low, we start watermark injection at round 20 for CIFAR-10 and 91 Digits, and at round 40 for CIFAR-100. The standard accuracy before our watermark injec- tion is 85.20%, 40.23%, and 29.41% for Digits, CIFAR-10, and CIFAR-100, respectively. Evaluation metrics. For watermark verification, we use watermark success rate (WSR) which is the accuracy of the trigger set for evaluation. To measure whether we track the malicious client (leaker) correctly, we define tracking accuracy (TAcc) as the rate of the clients we track correctly. To further evaluate the ability of our method for distinguishing between different watermarks for different clients, we also report the difference between the highest WSR and second best WSR as WSR_Gap to show the significance of verification and IP tracking. With a significant WSR_Gap, no watermark collision will happen. To evaluate the utility of the model, we report the standard FL accuracy (Acc) for each client’s individual test sets, whose classes match their training sets. We also report the accuracy degradation (∆Acc) of the watermarked model compared with the non-watermarked one. Note that, to simulate the scenario where malicious clients leak their local model after local training, we test the average WSR, TAcc and WSR_Gap for the local model of each client instead of the global model. Acc and ∆Acc are evaluated on the best FL model selected using the validation datasets. 6.4.1 IP Tracking Benchmark We evaluate our method using the IP tracking benchmark with various metrics as shown in Table 6.1. Our ownership verification is confident with all WSRs over 99% (R2). The model utility is also preserved with accuracy degradation 2.34%, 0.03%, and 0.63%, respectively for Digits, CIFAR-10 and CIFAR-100 (R3). TAcc for all benchmark datasets is 100% which indicates accurate IP tracking (R1). All WSR_Gap is over 98%, which means the WSRs for all other benign client’s verification datasets are close to 0%. In this way, the malicious client can be tracked accurately with high confidence, no collisions will occur within our tracking mechanism (R1). 92 Dataset Digits CIFAR-10 CIFAR-100 Acc 0.8855 0.5583 0.5745 ∆Acc WSR WSR_Gap TAcc 1.0000 0.0234 1.0000 0.0003 1.0000 0.0063 0.9909 1.0000 1.0000 0.9895 0.9998 0.9998 Table 6.1: Benchmark results. (a) Validation accuracy. (b) WSR. (c) TAcc. Figure 6.2: Validation accuracy, WSR, and TAcc for proposed DUW and other two baselines on CIFAR-10 for 4 communication rounds. 6.4.2 Comparison with traditional backdoor-based watermarks We compare our proposed DUW with two traditional backdoor-based watermarks in Fig. 6.2. Due to the reason that if all the clients share the same trigger, watermark collision will defi- nitely happen, we design different triggers for different clients. Specifically, we use traditional backdoor-based watermarking by adding arbitrary badnet triggers using random noise or 0-1 coding trigger for each client. To distinguish between different clients, for 0-1 trigger, fol- lowing [153], we set 5 pixel values of the pattern into 0 and other 11 pixels into 1, different combinations of the pattern are randomly chosen for different clients. For random noise triggers, we generate different random noise triggers for different clients. The trigger size 4 × 4 and the injection is conducted for 4 rounds. The target label for each client is set as (client_ID % class_number). According to the results, traditional backdoor-based wa- termarks can only achieve a tracking accuracy lower than 13% (it will even be lower with the increase of the communication rounds), which is much lower than the 100% tracking accuracy we have achieved. Note that, the rate of clients with watermark collisions can be calculated as 1-TAcc. Extended results for the failure of the traditional backdoor-based watermarking can be found in ?? 14. 93 0.00.51.01.52.02.53.0Communication round0.000.050.100.150.200.250.300.350.40Validation accuracyDUW0-1 triggerrandom noise trigger0.00.51.01.52.02.53.0Communication round0.00.20.40.60.81.0WSRDUW0-1 triggerrandom noise trigger0.00.51.01.52.02.53.0Communication round0.00.20.40.60.81.0TAccDUW0-1 triggerrandom noise trigger Dataset Digits CIFAR-10 CIFAR-100 Acc 0.9712 0.7933 0.4580 ∆Acc WSR ∆WSR TAcc 1.0000 -0.0258 1.0000 0.1521 1.0000 0.0290 0.0030 0.0000 0.0070 0.9924 1.0000 0.9930 Table 6.2: DUW is robust against fine-tuning. 6.4.3 Robustness Malicious clients can conduct watermark removal attacks before leaking the FL model to make it harder for us to verify the model copyright, and track the IP infringers accurately. In this section, we show the robustness of the watermarks under various watermark removal attacks (R4). Specifically, we evaluate our method against 1) fine-tuning [2]: Fine-tune the model using their own local data; 2) pruning [110]: prune the model parameters that have the smallest absolute value according to a certain pruning rate, and then fine-tune the model on their local data; 3) model extraction attack: first query the victim model for the label of an auxiliary dataset, and then re-train the victim model on the annotated dataset. We take knockoff [130] as an example of the model extraction attack; 4) parameter perturbations: add random noise to local model parameters [40]. 10 of the clients are selected as the malicious clients, and the metrics in this section are average values for 10 malicious clients. All the watermark removal attacks are conducted for 50 epochs with a learning rate 10−5. All the attacks are conducted for the local model of the last round. Robustness against fine-tuning attack. We report the robustness of our proposed DUW against fine-tuning in Table 6.2. ∆Acc and ∆WSR in this table indicate the accuracy and WSR drop compared with accuracy and WSR before the attack. According to the results, after 50 epochs of fine-tuning, the attacker can only decrease the WSR by less than 1%, and the TAcc is even not affected. Fine-tuning with their limited local training samples can also cause a standard accuracy degradation. Fine-tuning can neither remove our watermark nor affect our IP tracking, even if sacrifices their standard accuracy. Robustness against pruning attack. We investigate the effect of pruning in Fig. 6.3 94 (a) Digits. (b) CIFAR-10. (c) CIFAR-100. Figure 6.3: DUW is robust against pruning. (a) Digits. (b) CIFAR-10. (c) CIFAR-100. Figure 6.4: DUW is robust against parameter perturbation. by varying the pruning rate from 0 to 0.5. With the increase in the pruning ratio, both TAcc and WSR will not be affected. For CIFAR-10, standard accuracy will drop 5%. Therefore, pruning is not an effective attack on our watermark, and it will even cause an accuracy degradation for the classification task. Robustness against model extraction attack. To verify the robustness of our pro- posed DUW against model extraction attack, we take knockoff [130] as an example, and STL10 [23] cropped to the same size as the training data is used as the auxiliary dataset for this attack. According to the results for three benchmark datasets in Table 6.3, after knockoff attack, WSR for all three datasets is still over 65%, and our tracking mechanism is still not affected with TAcc remains to be 100%. Therefore, our DUW is resilient to model extraction attacks. Dataset Digits CIFAR-10 CIFAR-100 Acc 0.8811 0.5176 0.4190 ∆Acc WSR ∆WSR TAcc 1.0000 0.0643 1.0000 0.4278 1.0000 0.0680 0.9780 0.6638 0.8828 0.0174 0.3362 0.1172 Table 6.3: DUW is robust against model extraction. 95 00.10.20.30.40.5Pruning rate0.800.850.900.951.00AccuracyAccWSRTAcc00.10.20.30.40.5Pruning rate0.750.800.850.900.951.00AccuracyAccWSRTAcc00.10.20.30.40.5Pruning rate0.50.60.70.80.91.0AccuracyAccWSRTAcc01e-050.00010.0010.010.1noise0.20.40.60.81.0AccuracyAccWSRTAcc01e-050.00010.0010.010.1noise0.00.20.40.60.81.0AccuracyAccWSRTAcc01e-050.00010.0010.010.1noise0.00.20.40.60.81.0AccuracyAccWSRTAcc Robustness against parameter perturbations attack. Malicious clients can also add random noise to model parameters to remove watermarks, since [40] found that backdoor- based watermarks are usually not resilient to parameter perturbations. Adding random noise to the local model parameters can also increase the chance of blurring the difference between different watermarked models. We enable each malicious client to blend Gaussian noise to the parameters of their local model, and set the parameter of the local model as θi = θi+Gaussian_noise∗αnoise, where αnoise = {10−5, 10−4, 10−3, 10−2, 10−1} is the coefficient for noise. We investigate the effect of parameter perturbation in Fig. 6.4. According to the results, when αnoise is smaller than 10−2, WSR, Acc, and TAcc will not be affected. When αnoise = 10−2, Acc will drop more than 10%, TAcc remains unchanged, and WSR is still over 90%. When αnoise = 10−1, Acc will drop to a random guess, thus, although the watermark has been removed, the model has no utility. Therefore, parameter perturbation is not an effective attack for removing our watermark and affecting our tracking mechanism. 6.4.4 Qualitative Study Effects of decoder. To investigate the effects of the decoder on avoiding watermark col- lision, we compare the results of w/ and w/o decoder. When the decoder is removed, the task dimension of the watermark injection will be the same as the FL classification, thus, we also have to change the original target label (the same as the input key) to the FL classifi- cation task dimension. To achieve this goal, we set the target label of w/o decoder case as (client_ID % class_number). We report the results of w/ and w/o decoder on CIFAR-10 after 1 round of watermark injection at round 20 in Table 6.4. According to the results, when we have 100 clients in total, w/o decoder can only achieve a TAcc of 6%, while w/ decoder can increase TAcc to 100%. We also find that clients with the same target label are more likely to conflict with each other, which makes those clients difficult to be identified, even if their trigger sets are different. Utilizing a decoder to increase the target label space to a dimension larger than the client number allows all the clients to have their own target 96 label. In this way, watermark collision can be avoided. Besides, WSR of w/ decoder is also higher than w/o decoder after 1 round of injection. One possible reason is that we differ the watermark injection task from the original classification task using the decoder, thus, in this case, the watermark will be more easily injected compared with directly injected to the original FL classification task. Method w/ decoder w/o decoder Acc 0.3287 0.3235 ∆Acc WSR TAcc 0.0736 0.8778 1.0000 0.0600 0.8099 0.0788 Table 6.4: Effects of decoder: the decoder can improve TAcc to avoid watermark collision. ∆Acc in this table is the accuracy degradation compared with the previous round. Effects of l2 regularization. To show the effects of l2 regularization in Eq. (6.2), we report the validation accuracy and WSR for 4 rounds of watermark injection on Digits with different values of the hyperparameter β in Fig. 6.5. Validation accuracy is the standard FL accuracy evaluated on a validation dataset for every round. We see that with the increase of β, higher validation accuracy can be achieved, but correspondingly, WSR drops from over 90% to only 35.65%. Larger β increases the impact of l2 norm, which decreases the model difference between the watermarked model and the non-watermarked one, so the validation accuracy will increase. At the same time, the updates during watermark injection also have much more restriction due to l2 regularization, so the WSR drops to a low value. Accordingly, we select β = 0.1 for all our experiments, since β = 0.1 can increase validation accuracy by 6.88% compared with β = 0, while maintaining WSR over 90%. (a) Validation accuracy. (b) WSR. Figure 6.5: Acc and WSR for different values of β. 97 0.00.51.01.52.02.53.0Communication round0.00.10.20.30.4Validation accuracy0.50.30.10.0500.00.51.01.52.02.53.0Communication round0.00.20.40.60.81.0WSR0.50.30.10.050 (a) Validation accuracy. (b) WSR. Figure 6.6: Acc and WSR for different OoD datasets. Effects of different OoD datasets for watermark injection. We investigate the effects of different OoD datasets including USPS [60], GTSRB [145], random noise, and Jigsaw for watermark injection when the standard training data is Digits. All OoD images are cropped to the same size as the training images. A jigsaw image is generated from a small 4 × 4 random image, and then uses reflect padding mode from PyTorch to padding to the same size as the training images. The effect of these different OoD datasets is shown in Table 6.5 and Fig. 6.6. We see that all OoD datasets can achieve 100% TAcc, suggesting the selection of OoD dataset will not affect the tracking of the malicious client. There is a trade-off between the Acc and WSR: higher WSR always leads to lower Acc. Random noise and jigsaw achieve high Acc, with accuracy degradation within 1%. These two noise OoD also have a faster recovery of the standard accuracy after the accuracy drop at the watermark injection round as shown in Fig. 6.6, but the WSR of random noise and Jigsaw are lower than 90%. For two real OoD datasets USPS and GTSRB, the WSR quickly reaches over 99% after 1 communication round, but their accuracy degradation is larger than 2%. Dataset USPS GTSRB Random noise Jigsaw Acc 0.8855 0.8716 0.9007 0.9013 ∆Acc WSR WSR_Gap TAcc 1.0000 0.0234 1.0000 0.0373 1.0000 0.0082 1.0000 0.0076 0.9895 0.9962 0.8143 0.8601 0.9909 0.9972 0.8422 0.8789 Table 6.5: Effects of different OoD datasets: a trade-off exists between Acc and WSR, given different OoD datasets. Scalability of DUW to more clients. We conduct an ablation study on Digits to show the effect of the number of clients in Table 6.6. According to the results, even with 98 0255075100125150Communication round0.00.20.40.60.81.0Validation accuracyUSPSGTSRBRandom noiseJigsawstandard training255075100125150Communication round0.00.20.40.60.81.0WSRUSPSGTSRBRandom noiseJigsaw 600 clients, the WSR is still over 73% and the TAcc remains 100%. With more clients participating in FL, we can still track the malicious client correctly with high confidence. Number of clients 40 400 600 Acc 0.8855 0.8597 0.8276 ∆Acc WSR WSR_Gap TAcc 1.0000 0.0234 1.0000 -0.0332 1.0000 -0.0035 0.9895 0.9267 0.6383 0.9909 0.9521 0.7337 Table 6.6: Effects of different numbers of clients. Hybrid watermark. If DUW meets a black-box suspect model, our DUW can also be combined with existing black-box unified watermarks. We can identify IP leakage using black-box detection with a unified watermark first, then identify infringers using DUW with client-unique watermark. We design a simple hybrid watermark in this section as an example. We pick one of the trigger sets we generated for the clients as the trigger set for the unified watermark injection, and the target label is assigned as 0 which belongs to the original label set of the training data. We use this trigger set to fine-tune the entire global model for 10 steps before injecting our proposed DUW. Note that no decoder is used for the unified watermark, and the unified watermarks can also be replaced with other existing works. The results on Digits are shown in Table 6.7. For this table, we can observe that the unified watermark is injected successfully in the presence of our DUW, with a 98.82% WSR. Besides, the effectiveness of our DUW is also not affected, since the WSR of DUW only decreases by 0.72%, and TAcc remains 100%. The model utility is also not affected, since the standard accuracy remains high. Method w/o unified watermark 0.8855 0.0234 0.9909 w/ unified watermark 0.8886 0.0203 0.9837 Acc ∆Acc WSR WSR_Gap TAcc Unified WSR 0.9895 0.9701 1.0000 1.0000 / 0.9882 6.5 Discussions Table 6.7: Results for hybrid watermark. Client-side watermarking VS server-side watermarking. Client-side watermarking such as FedCIP [104], FedIPR [86], and Merkle-Sign [88] are used to claim the co-ownership 99 of the model, yet we argue that client-side watermarking has some limitations, which makes it not applicable for IP tracking. For client-side watermarking, if one of the clients is the infringer to illegally distribute the model, the infringer will not reveal their own identity during the model verification process to avoid legal responsibility. Even if the ownership of the model can be claimed by their co-author, the real infringers cannot be tracked, since they remain anonymous. Using our server-side watermark, there is no such concern, the server can easily track the malicious client among all the clients. Complexity. Clients will not experience additional computations as our DUW is carried out on the server side. The additional computation for the server is decided by the number of watermark injection steps Tw. We found that WSR could reach 99% just within Tw = 10 steps. Injection of one client-unique watermark takes around 1 second. The server can embed the watermark parallelly for all the clients. Since the watermarked model for each client is independent and has no sequence relationship with each other, there is no need to serialize it. Thus, the delay caused by the server is neglectable. Future works. This work makes the FL model leakage from anonymity to accountabil- ity by injecting client-unique watermarks. We recognize the most significant challenge for accountable FL is addressing watermark collision for accurate IP tracking (R1). We believe it is important to scale our method from cross-silo setting to cross-device setting with more clients in the future. One plausible solution is increasing the dimension of the input of the encoder to allow more one-hot encoding target labels. Another solution is to use a hash function as the target label for different clients. In this way, the lower-dimensional encoder and decoder can accommodate more clients. However, adopting hash functions as the tar- get labels can increase the chance of watermark collision between clients, and more elegant strategies have to be developed to address this problem. As we focus on the collision, we leave the scalability for future work. 100 6.5.1 Summary In this chapter, we target at accountable FL, and propose Decodable Unique Watermarking (DUW), that can verify the FL model’s ownership and track the IP infringers in the FL system at the same time. Specifically, the server will embed a client-unique key into each client’s local model before broadcasting. The IP infringer can be tracked according to the decoded keys from the suspect model. Extensive experimental results show the effectiveness of our method in accurate IP tracking, confident verification, model utility preserving, and robustness against various watermark removal attacks. 101 CHAPTER 7 CONCLUSION 7.1 Overview In this section, we summarize our contributions to the robustness and trustworthiness of machine learning models in diverse domains. Robust UDA from a corrupted source. We propose a simple and computationally efficient method, in that the training of individual models for an ensemble can be conducted in parallel to accelerate learning. The proposed solution to UDA is generally robust against agnostic types of data corruption. In particular, our approach is able to successfully tackle notorious backdoor attacks, where both the training samples and corresponding labels can be maliciously modified by attackers. The learning framework we proposed can be flexibly combined with available UDA approaches that are orthogonal to our work to improve their robustness under corrupted data. Enhancing in-context learning for long-tail knowledge in LLMs. To improve the uncertain prediction of LLMs on long-tail knowledge, we propose a reinforcement learning- based dynamic uncertainty ranking method for retrieval-augmented ICL with a budget con- troller. Specifically, it considers the dynamic impact of each retrieved sample based on the LLM’s feedback. Our ranking system raises the ranks of more informative and stable samples and lowers the ranks of misleading samples efficiently. Evaluations of various QA datasets from different domains show that our proposed method outperformed all the baselines, and especially improve the LLM’s prediction on the long-tail questions. OoD detection in FL. We propose a novel federated OoD synthesizer to take advantage of data heterogeneity to facilitate OoD detection in FL, allowing a client to learn external class knowledge from other non-iid federated collaborators in a privacy-aware manner. Our work bridges a critical research gap since OoD detection for FL is currently not yet well- studied in literature. To our knowledge, the proposed Foster is the first OoD learning 102 method for FL that does not require real OoD samples. The proposed Foster achieves the state-of-art performance using only limited ID data stored in each local device, as compared to existing approaches that demand a large volume of OoD samples. The design of Foster considers both the diversity and hardness of virtual OoD samples, making them closely resemble real OoD samples from other non-iid collaborators. As a general OoD detection framework for FL, the proposed Foster remains effective in more challenging FL settings, where the entire parameter sharing process is prohibited due to privacy or communication concerns. This is because that Foster only used the classifier head for extracting external data knowledge. Safe and robust watermark injection with a single OoD image. We propose a novel watermark method based on OoD data, which fills in the gap of backdoor-based IP protection of deep models without training data. The removal of access to the training data enables the proposed approach possible for many real-world scenarios. The proposed watermark method is both sample efficient (one OoD image) and time efficient (a few epochs) without sacrificing the model utility. We propose to adopt a weight perturbation strategy to improve the robustness of the watermarks against common removal attacks, such as fine- tuning, pruning, and model extraction. We show the robustness of watermarks through extensive empirical results, and they persist even in an unfair scenario where the removal attack uses a part of in-distribution data. Tacking IP infringers in FL. In this work, we make the FL model leakage from anonymity to accountability by injecting DUW. DUW enables ownership verification and leakage tracing at the same time. With utility preserved, both the ownership verification and IP tracking of our DUW are not only accurate but also confident without collisions. Our DUW is robust against existing watermarking removal attacks, including fine-tuning, pruning, model extraction, and parameter perturbation. 103 7.2 Future Work One limitation of the thesis is that all methods target a single modality setting, either vision or language. An important future direction is to extend these approaches to multi- modal settings. Real-world applications often involve diverse data sources, such as the combination of images, text, video, audio, electronic health records (EHR), etc. Exploring how to enhance both adaptiveness and trustworthiness in these more complex scenarios is a natural extension of this work. For instance, future research could investigate multi-modal watermarking techniques, or leverage Retrieval-Augmented Generation (RAG) and improve reasoning capabilities within multi-modal large language models (MLLMs) to enhance their adaptiveness to various downstream tasks. Advancing these directions will be crucial for building robust and reliable systems that can generalize across tasks and data types while maintaining interpretability and resilience to distribution shifts. 104 BIBLIOGRAPHY [1] [2] [3] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Floren- cia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anad- kat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023. Yossi Adi, Carsten Baum, Moustapha Cisse, Benny Pinkas, and Joseph Keshet. Turn- ing your weakness into a strength: Watermarking deep neural networks by backdooring. In 27th {USENIX} Security Symposium ({USENIX} Security 18), pages 1615–1631, 2018. Ebtesam Almazrouei, Hamza Alobeidli, Abdulaziz Alshamsi, Alessandro Cappelli, Ruxandra Cojocaru, Mérouane Debbah, Étienne Goffinet, Daniel Hesslow, Julien Lau- nay, Quentin Malartic, et al. The falcon series of open language models. arXiv preprint arXiv:2311.16867, 2023. [4] Manoj Ghuhan Arivazhagan, Vinay Aggarwal, Aaditya Kumar Singh, and Sunav arXiv preprint Federated learning with personalization layers. Choudhary. arXiv:1912.00818, 2019. [5] [6] Yuki M Asano, Christian Rupprecht, and Andrea Vedaldi. A critical analysis of self-supervision, or what we can learn from a single image. arXiv preprint arXiv:1904.13132, 2019. Yuki M. Asano and Aaqib Saeed. Extrapolating from a single image to a thousand classes using distillation. In ICLR, 2023. [7] Meysam Asgari, Jeffrey Kaye, and Hiroko Dodge. Predicting mild cognitive impairment from spontaneous spoken utterances. Alzheimer’s & Dementia: Translational Research & Clinical Interventions, 3(2):219–228, 2017. [8] [9] Arthur Asuncion and David Newman. Uci machine learning repository, 2007. Eugene Bagdasaryan, Andreas Veit, Yiqing Hua, Deborah Estrin, and Vitaly Shmatikov. How to backdoor federated learning. In International conference on arti- ficial intelligence and statistics, pages 2938–2948. PMLR, 2020. [10] Tadas Baltrušaitis, Marwa Mahmoud, and Peter Robinson. Cross-dataset learning and person-specific normalisation for automatic action unit detection. In 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), volume 6, pages 1–6. IEEE, 2015. [11] Francesco Barbieri, Jose Camacho-Collados, Leonardo Neves, and Luis Espinosa-Anke. Tweeteval: Unified benchmark and comparative evaluation for tweet classification. arXiv preprint arXiv:2010.12421, 2020. [12] Debraj Basu, Deepesh Data, Can Karakus, and Suhas Diggavi. Qsparse-local-sgd: Distributed sgd with quantization, sparsification and local computations. Advances in Neural Information Processing Systems, 32, 2019. 105 [13] Tom B Brown. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020. [14] Sebastian Caldas, Sai Meher Karthik Duddu, Peter Wu, Tian Li, Jakub Konečn`y, H Brendan McMahan, Virginia Smith, and Ameet Talwalkar. Leaf: A benchmark for federated settings. arXiv preprint arXiv:1812.01097, 2018. [15] Jialuo Chen, Jingyi Wang, Tinglan Peng, Youcheng Sun, Peng Cheng, Shouling Ji, Xingjun Ma, Bo Li, and Dawn Song. Copy, right? a testing framework for copyright protection of deep learning models. In 2022 IEEE Symposium on Security and Privacy (SP), pages 824–841. IEEE, 2022. [16] Jiuhai Chen, Lichang Chen, Chen Zhu, and Tianyi Zhou. How many demonstrations do you need for in-context learning? arXiv preprint arXiv:2303.08119, 2023. [17] Lin Chen, Lixin Duan, and Dong Xu. Event recognition in videos by learning from heterogeneous web sources. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2666–2673, 2013. [18] Minmin Chen, Zhixiang Xu, Kilian Weinberger, and Fei Sha. Marginalized denoising autoencoders for domain adaptation. arXiv preprint arXiv:1206.4683, 2012. [19] Xuxi Chen, Tianlong Chen, Zhenyu Zhang, and Zhangyang Wang. You are caught stealing my winning lottery ticket! making a lottery ticket claim its ownership. Ad- vances in Neural Information Processing Systems, 34:1780–1791, 2021. [20] Patryk Chrabaszcz, Ilya Loshchilov, and Frank Hutter. A downsampled variant of imagenet as an alternative to the cifar datasets. arXiv preprint arXiv:1707.08819, 2017. [21] Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and Andrea Vedaldi. Describing textures in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3606–3613, 2014. [22] Nicolas Antonio Cloutier and Nathalie Japkowicz. Fine-tuned generative llm oversam- pling can improve performance over traditional techniques on multiclass imbalanced text classification. In 2023 IEEE International Conference on Big Data (BigData), pages 5181–5186. IEEE, 2023. [23] Adam Coates, Andrew Ng, and Honglak Lee. An analysis of single-layer networks in unsupervised feature learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pages 215–223. JMLR Workshop and Conference Proceedings, 2011. [24] Gabriela Csurka. Domain adaptation for visual applications: A comprehensive survey. arXiv preprint arXiv:1702.05374, 2017. [25] Yi Dai, Hao Lang, Yinhe Zheng, Fei Huang, and Yongbin Li. Long-tailed question answering in an open world. arXiv preprint arXiv:2305.06557, 2023. 106 [26] Bita Darvish Rouhani, Huili Chen, and Farinaz Koushanfar. Deepsigns: An end-to- end watermarking framework for ownership protection of deep neural networks. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 485–497, 2019. [27] Jacob Devlin. Bert: Pre-training of deep bidirectional transformers for language un- derstanding. arXiv preprint arXiv:1810.04805, 2018. [28] Luc Devroye, Matthieu Lerasle, Gabor Lugosi, and Roberto I Oliveira. Sub-gaussian mean estimators. The Annals of Statistics, 44(6):2695–2725, 2016. [29] Ilias Diakonikolas, Gautam Kamath, Daniel Kane, Jerry Li, Jacob Steinhardt, and Alistair Stewart. Sever: A robust meta-algorithm for stochastic optimization. In International Conference on Machine Learning, pages 1596–1606. PMLR, 2019. [30] Enmao Diao, Jie Ding, and Vahid Tarokh. Heterofl: Computation and communication efficient federated learning for heterogeneous clients. arXiv preprint arXiv:2010.01264, 2020. [31] Xuefeng Du, Zhaoning Wang, Mu Cai, and Yixuan Li. Vos: Learning what you don’t know by virtual outlier synthesis. arXiv preprint arXiv:2202.01197, 2022. [32] Hady Elsahar, Pavlos Vougiouklis, Arslen Remaci, Christophe Gravier, Jonathon Hare, Frederique Laforest, and Elena Simperl. T-rex: A large scale alignment of natural language with knowledge base triples. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), 2018. [33] Andre Esteva, Alexandre Robicquet, Bharath Ramsundar, Volodymyr Kuleshov, Mark DePristo, Katherine Chou, Claire Cui, Greg Corrado, Sebastian Thrun, and Jeff Dean. A guide to deep learning in healthcare. Nature medicine, 25(1):24–29, 2019. [34] Lixin Fan, Kam Woh Ng, and Chee Seng Chan. Rethinking deep neural network ownership verification: Embedding passports to defeat ambiguity attacks. Advances in neural information processing systems, 32, 2019. [35] Gongfan Fang, Jie Song, Chengchao Shen, Xinchao Wang, Da Chen, and Mingli Song. Data-free adversarial distillation. arXiv preprint arXiv:1912.11006, 2019. [36] Luciano Floridi and Massimo Chiriatti. Gpt-3: Its nature, scope, limits, and conse- quences. Minds and Machines, 30:681–694, 2020. [37] Geoffrey French, Michal Mackiewicz, and Mark Fisher. Self-ensembling for visual domain adaptation. arXiv preprint arXiv:1706.05208, 2017. [38] Yaroslav Ganin and Victor Lempitsky. Unsupervised domain adaptation by backprop- agation. In International conference on machine learning, pages 1180–1189. PMLR, 2015. 107 [39] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky. Domain-adversarial train- ing of neural networks. The journal of machine learning research, 17(1):2096–2030, 2016. [40] Siddhant Garg, Adarsh Kumar, Vibhor Goel, and Yingyu Liang. Can adversarial weight perturbations inject neural backdoors. In Proceedings of the 29th ACM In- ternational Conference on Information & Knowledge Management, pages 2029–2032, 2020. [41] Micah Goldblum, Dimitris Tsipras, Chulin Xie, Xinyun Chen, Avi Schwarzschild, Dawn Song, Aleksander Mądry, Bo Li, and Tom Goldstein. Dataset security for machine learning: Data poisoning, backdoor attacks, and defenses. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(2):1563–1580, 2022. [42] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. Ad- vances in neural information processing systems, 27, 2014. [43] Matej Grcić, Petra Bevandić, and Siniša Šegvić. Dense open-set recognition with synthetic outliers generated by real nvp. arXiv preprint arXiv:2011.11094, 2020. [44] Tianyu Gu, Kang Liu, Brendan Dolan-Gavitt, and Siddharth Garg. Badnets: Eval- uating backdooring attacks on deep neural networks. IEEE Access, 7:47230–47244, 2019. [45] Tianyu Gu, Kang Liu, Brendan Dolan-Gavitt, and Siddharth Garg. Badnets: Eval- uating backdooring attacks on deep neural networks. IEEE Access, 7:47230–47244, 2019. [46] Zhongyi Han, Xian-Jin Gui, Chaoran Cui, and Yilong Yin. Towards accurate and robust domain adaptation under noisy environments. arXiv preprint arXiv:2004.12529, 2020. [47] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learningfor image recognition. CoRR, abs/1512, 3385:2, 2015. [48] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. [49] Pengfei He, Han Xu, Jie Ren, Yingqian Cui, Hui Liu, Charu C Aggarwal, and Jiliang Tang. Sharpness-aware data poisoning attack. arXiv preprint arXiv:2305.14851, 2023. [50] Markus Heimberger, Jonathan Horgan, Ciarán Hughes, John McDonald, and Senthil Yogamani. Computer vision in automated parking systems: Design, implementation and challenges. Image and Vision Computing, 68:88–101, 2017. 108 [51] Matthias Hein, Maksym Andriushchenko, and Julian Bitterwolf. Why ReLU networks yield high-confidence predictions far away from the training data and how to mitigate the problem. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 41–50, 2019. [52] Dan Hendrycks and Kevin Gimpel. A baseline for detecting misclassified and out-of- distribution examples in neural networks. arXiv preprint arXiv:1610.02136, 2016. [53] Dan Hendrycks, Mantas Mazeika, and Thomas Dietterich. Deep anomaly detection with outlier exposure. arXiv preprint arXiv:1812.04606, 2018. [54] Alex Hern. Techscape: Will meta’s massive leak democratise ai – and at what cost? The Guardian, 2023. [55] Wassily Hoeffding. Probability inequalities for sums of bounded random variables. In The collected works of Wassily Hoeffding, pages 409–426. Springer, 1994. [56] Judy Hoffman, Eric Tzeng, Taesung Park, Jun-Yan Zhu, Phillip Isola, Kate Saenko, Alexei Efros, and Trevor Darrell. Cycada: Cycle-consistent adversarial domain adap- In International conference on machine learning, pages 1989–1998. PMLR, tation. 2018. [57] Junyuan Hong, Haotao Wang, Zhangyang Wang, and Jiayu Zhou. Efficient split- arXiv preprint mix federated learning for on-demand and in-situ customization. arXiv:2203.09747, 2022. [58] Junyuan Hong, Yi Zeng, Shuyang Yu, Lingjuan Lyu, Ruoxi Jia, and Jiayu Zhou. Revisiting data-free knowledge distillation with poisoned teachers. The Fortieth Inter- national Conference on Machine Learning, 2023. [59] Tzu-Ming Harry Hsu, Hang Qi, and Matthew Brown. Measuring the effects of arXiv preprint non-identical data distribution for federated visual classification. arXiv:1909.06335, 2019. [60] Jonathan J. Hull. A database for handwritten text recognition research. IEEE Trans- actions on pattern analysis and machine intelligence, 16(5):550–554, 1994. [61] Hengrui Jia, Christopher A Choquette-Choo, Varun Chandrasekaran, and Nicolas Pa- pernot. Entangled watermarks as a defense against model extraction. In Proceedings of the 30th USENIX Security Symposium, 2021. [62] Zhengbao Jiang, Jun Araki, Haibo Ding, and Graham Neubig. How can we know when language models know? on the calibration of language models for question answering. Transactions of the Association for Computational Linguistics, 9:962–977, 2021. [63] Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William W Cohen, and Xinghua Lu. Pubmedqa: A dataset for biomedical research question answering. arXiv preprint arXiv:1909.06146, 2019. 109 [64] Sanghun Jung, Jungsoo Lee, Daehoon Gwak, Sungha Choi, and Jaegul Choo. Stan- dardized max logits: A simple yet effective approach for identifying unexpected road obstacles in urban-scene segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15425–15434, 2021. [65] Mika Juuti, Sebastian Szyller, Samuel Marchal, and N Asokan. Prada: protecting against dnn model stealing attacks. In 2019 IEEE European Symposium on Security and Privacy (EuroS&P), pages 512–527. IEEE, 2019. [66] Peter Kairouz, H Brendan McMahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Kallista Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, et al. Advances and open problems in federated learning. Foundations and Trends® in Machine Learning, 14(1–2):1–210, 2021. [67] Nikhil Kandpal, Haikang Deng, Adam Roberts, Eric Wallace, and Colin Raffel. Large language models struggle to learn long-tail knowledge. In International Conference on Machine Learning, pages 15696–15707. PMLR, 2023. [68] Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank Reddi, Sebastian Stich, and Ananda Theertha Suresh. Scaffold: Stochastic controlled averaging for federated learning. In International Conference on Machine Learning, pages 5132– 5143. PMLR, 2020. [69] Byungjoo Kim, Suyoung Lee, Seanie Lee, Sooel Son, and Sung Ju Hwang. Margin- based neural network watermarking. 2023. [70] Jaehyung Kim, Jaehyun Nam, Sangwoo Mo, Jongjin Park, Sang-Woo Lee, Minjoon Seo, Jung-Woo Ha, and Jinwoo Shin. Sure: Summarizing retrievals using answer candidates for open-domain qa of llms. arXiv preprint arXiv:2404.13081, 2024. [71] Taeksoo Kim, Moonsu Cha, Hyunsoo Kim, Jung Kwon Lee, and Jiwon Kim. Learning to discover cross-domain relations with generative adversarial networks. In Interna- tional conference on machine learning, pages 1857–1865. PMLR, 2017. [72] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013. [73] Jakub Konečn`y, Brendan McMahan, and Daniel Ramage. Federated optimization: Distributed optimization beyond the datacenter. arXiv preprint arXiv:1511.03575, 2015. [74] Jakub Konečn`y, H Brendan McMahan, Felix X Yu, Peter Richtárik, Ananda Theertha Suresh, and Dave Bacon. Federated learning: Strategies for improving communication efficiency. arXiv preprint arXiv:1610.05492, 2016. [75] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009. 110 [76] M Kumar, Benjamin Packer, and Daphne Koller. Self-paced learning for latent variable models. Advances in neural information processing systems, 23, 2010. [77] Minoru Kuribayashi, Takuro Tanaka, Shunta Suzuki, Tatsuya Yasui, and Nobuo Fun- abiki. White-box watermarking scheme for fully-connected layers in fine-tuning model. In Proceedings of the 2021 ACM Workshop on Information Hiding and Multimedia Se- curity, pages 165–170, 2021. [78] Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453–466, 2019. [79] Erwan Le Merrer, Patrick Perez, and Gilles Trédan. Adversarial frontier stitching for remote neural network watermarking. Neural Computing and Applications, 32:9233– 9244, 2020. [80] Guillaume Lecué and Matthieu Lerasle. Robust machine learning by median-of-means: theory and practice. The Annals of Statistics, 48(2):906–931, 2020. [81] Guillaume Lecué, Matthieu Lerasle, and Timlothée Mathieu. Robust classification via mom minimization. Machine Learning, 109(8):1635–1665, 2020. [82] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learn- ing applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. [83] Yann LeCun, Corinna Cortes, and Chris Burges. Mnist handwritten digit database, 2010. [84] Kimin Lee, Kibok Lee, Honglak Lee, and Jinwoo Shin. A simple unified framework for detecting out-of-distribution samples and adversarial attacks. Advances in neural information processing systems, 31, 2018. [85] Alexander Levine and Soheil Feizi. Deep partition aggregation: Provable defense against general poisoning attacks. arXiv preprint arXiv:2006.14768, 2020. [86] Bowen Li, Lixin Fan, Hanlin Gu, Jie Li, and Qiang Yang. Fedipr: Ownership verifica- tion for federated deep neural network models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(4):4521–4536, 2022. [87] Chun-Liang Li, Wei-Cheng Chang, Yu Cheng, Yiming Yang, and Barnabás Póczos. Mmd gan: Towards deeper understanding of moment matching network. Advances in neural information processing systems, 30, 2017. [88] Fang-Qi Li, Shi-Lin Wang, and Alan Wee-Chung Liew. Towards practical watermark for deep neural networks in federated learning. arXiv preprint arXiv:2105.03167, 2021. [89] Fangqi Li and Shilin Wang. Knowledge-free black-box watermark and ownership proof for image classification neural networks. arXiv preprint arXiv:2204.04522, 2022. 111 [90] Fangqi Li, Lei Yang, Shilin Wang, and Alan Wee-Chung Liew. Leveraging multi-task learning for umambiguous and flexible deep neural network watermarking. In SafeAI@ AAAI, 2022. [91] Huihan Li, Yuting Ning, Zeyi Liao, Siyuan Wang, Xiang Lorraine Li, Ximing Lu, Faeze Brahman, Wenting Zhao, Yejin Choi, and Xiang Ren. In search of the long- tail: Systematic generation of long-tail knowledge via logical rule guided search. arXiv preprint arXiv:2311.07237, 2023. [92] Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, et al. Starcoder: may the source be with you! arXiv preprint arXiv:2305.06161, 2023. [93] Shaofeng Li, Minhui Xue, Benjamin Zhao, Haojin Zhu, and Xinpeng Zhang. Invisible backdoor attacks on deep neural networks via steganography and regularization. IEEE TDSC, 2020. [94] Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith. Federated optimization in heterogeneous networks. Proceedings of Machine Learning and Systems, 2:429–450, 2020. [95] Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith. Federated optimization in heterogeneous networks. Proceedings of Machine learning and systems, 2:429–450, 2020. [96] Xiang Li, Haoran Tang, Siyu Chen, Ziwei Wang, Ryan Chen, and Marcin Abram. Why does in-context learning fail sometimes? evaluating in-context learning on open and closed questions. arXiv preprint arXiv:2407.02028, 2024. [97] Xiaonan Li, Kai Lv, Hang Yan, Tianyang Lin, Wei Zhu, Yuan Ni, Guotong Xie, Xi- aoling Wang, and Xipeng Qiu. Unified demonstration retriever for in-context learning. arXiv preprint arXiv:2305.04320, 2023. [98] Xiaoxiao Li, Meirui Jiang, Xiaofei Zhang, Michael Kamp, and Qi Dou. Fedbn: Fed- erated learning on non-iid features via local batch normalization. arXiv preprint arXiv:2102.07623, 2021. [99] Yiming Li, Baoyuan Wu, Yong Jiang, Zhifeng Li, and Shu-Tao Xia. Backdoor learning: A survey. arXiv:2007.08745, 2020. [100] Yue Li, Hongxia Wang, and Mauro Barni. A survey of deep neural network water- marking techniques. Neurocomputing, 461:171–193, 2021. [101] Yuezun Li, Yiming Li, Baoyuan Wu, Longkang Li, Ran He, and Siwei Lyu. Invisible backdoor attack with sample-specific triggers. In ICCV, 2021. [102] Jian Liang, Ran He, Zhenan Sun, and Tieniu Tan. Exploring uncertainty in pseudo- label guided unsupervised domain adaptation. Pattern Recognition, 96:106996, 2019. 112 [103] Jian Liang, Dapeng Hu, and Jiashi Feng. Do we really need to access the source data? source hypothesis transfer for unsupervised domain adaptation. In International Conference on Machine Learning, pages 6028–6039. PMLR, 2020. [104] Junchuan Liang and Rong Wang. Fedcip: Federated client intellectual property pro- tection with traitor tracking. arXiv preprint arXiv:2306.01356, 2023. [105] Shiyu Liang, Yixuan Li, and Rayadurgam Srikant. Enhancing the reliability of out- of-distribution image detection in neural networks. arXiv preprint arXiv:1706.02690, 2017. [106] Ziqian Lin, Sreya Dutta Roy, and Yixuan Li. Mood: Multi-level out-of-distribution In Proceedings of the IEEE/CVF Conference on Computer Vision and detection. Pattern Recognition, pages 15313–15323, 2021. [107] Jiachang Liu, Dinghan Shen, Yizhe Zhang, Bill Dolan, Lawrence Carin, and arXiv preprint Weizhu Chen. What makes good in-context examples for gpt-3? arXiv:2101.06804, 2021. [108] Weitang Liu, Xiaoyun Wang, John Owens, and Yixuan Li. Energy-based out-of- distribution detection. Advances in Neural Information Processing Systems, 33:21464– 21475, 2020. [109] Yingqi Liu, Shiqing Ma, Yousra Aafer, Wen-Chuan Lee, Juan Zhai, Weihang Wang, and Xiangyu Zhang. Trojaning attack on neural networks. In NDSS, 2018. [110] Zhuang Liu, Mingjie Sun, Tinghui Zhou, Gao Huang, and Trevor Darrell. Rethinking the value of network pruning. arXiv preprint arXiv:1810.05270, 2018. [111] Ziwei Liu, Zhongqi Miao, Xiaohang Zhan, Jiayun Wang, Boqing Gong, and Stella X Yu. Large-scale long-tailed recognition in an open world. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2537–2546, 2019. [112] Mingsheng Long, Yue Cao, Jianmin Wang, and Michael Jordan. Learning transfer- able features with deep adaptation networks. In International conference on machine learning, pages 97–105. PMLR, 2015. [113] Mingsheng Long, Zhangjie Cao, Jianmin Wang, and Michael I Jordan. Conditional adversarial domain adaptation. Advances in neural information processing systems, 31, 2018. [114] Pan Lu, Liang Qiu, Kai-Wei Chang, Ying Nian Wu, Song-Chun Zhu, Tanmay Rajpuro- hit, Peter Clark, and Ashwin Kalyan. Dynamic prompt learning via policy gradient for semi-structured mathematical reasoning. arXiv preprint arXiv:2209.14610, 2022. [115] Yao Lu, Max Bartolo, Alastair Moore, Sebastian Riedel, and Pontus Stenetorp. Fan- tastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. arXiv preprint arXiv:2104.08786, 2021. 113 [116] Gabor Lugosi and Shahar Mendelson. Risk minimization by median-of-means tourna- ments. Journal of the European Mathematical Society, 22(3):925–965, 2019. [117] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017. [118] Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Daniel Khashabi, and Han- naneh Hajishirzi. When not to trust language models: Investigating effectiveness of parametric and non-parametric memories. arXiv preprint arXiv:2212.10511, 2022. [119] Qi Mao, Hsin-Ying Lee, Hung-Yu Tseng, Siwei Ma, and Ming-Hsuan Yang. Mode seeking generative adversarial networks for diverse image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1429– 1437, 2019. [120] Othmane Marfoq, Chuan Xu, Giovanni Neglia, and Richard Vidal. Throughput- optimal topology design for cross-silo federated learning. Advances in Neural Informa- tion Processing Systems, 33:19478–19487, 2020. [121] Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, pages 1273–1282. PMLR, 2017. [122] Dhwani Mehta, Nurun Mondol, Farimah Farahmandi, and Mark Tehranipoor. Aime: watermarking ai models by leveraging errors. In 2022 Design, Automation & Test in Europe Conference & Exhibition (DATE), pages 304–309. IEEE, 2022. [123] Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Ha- jishirzi, and Luke Zettlemoyer. Rethinking the role of demonstrations: What makes in-context learning work? In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11048–11064, 2022. [124] Thomas B Moeslund and Erik Granum. A survey of computer vision-based human motion capture. Computer vision and image understanding, 81(3):231–268, 2001. [125] Sina Mohseni, Mandar Pitale, JBS Yadawa, and Zhangyang Wang. Self-supervised In Proceedings of the AAAI learning for generalizable out-of-distribution detection. Conference on Artificial Intelligence, volume 34, pages 5216–5223, 2020. [126] Ioannis Mollas, Zoe Chrysopoulou, Stamatis Karlos, and Grigorios Tsoumakas. Ethos: a multi-label hate speech detection dataset. Complex & Intelligent Systems, 8(6):4663– 4678, 2022. [127] Arkadij Semenovič Nemirovskij and David Borisovich Yudin. Problem complexity and method efficiency in optimization. 1983. [128] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading digits in natural images with unsupervised feature learning. 2011. 114 [129] Kenney Ng, Jimeng Sun, Jianying Hu, and Fei Wang. Personalized predictive modeling and risk factor identification using patient similarity. AMIA Summits on Translational Science Proceedings, 2015:132, 2015. [130] Tribhuvanesh Orekondy, Bernt Schiele, and Mario Fritz. Knockoff nets: Stealing func- tionality of black-box models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4954–4963, 2019. [131] Nicolas Papernot, Patrick McDaniel, Ian Goodfellow, Somesh Jha, Z Berkay Celik, and Ananthram Swami. Practical black-box attacks against machine learning. In Proceedings of the 2017 ACM on Asia conference on computer and communications security, pages 506–519, 2017. [132] Xingchao Peng, Qinxun Bai, Xide Xia, Zijun Huang, Kate Saenko, and Bo Wang. Mo- ment matching for multi-source domain adaptation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1406–1415, 2019. [133] Adarsh Prasad, Sivaraman Balakrishnan, and Pradeep Ravikumar. A robust univariate mean estimator is all you need. In International Conference on Artificial Intelligence and Statistics, pages 4034–4044. PMLR, 2020. [134] Zhaonan Qu, Kaixiang Lin, Jayant Kalagnanam, Zhaojian Li, Jiayu Zhou, and Zhengyuan Zhou. Federated learning’s blessing: Fedavg has linear speedup. arXiv preprint arXiv:2007.05690, 2020. [135] Adnan Siraj Rakin, Zhezhi He, and Deliang Fan. Tbt: Targeted neural network attack with bit trojan. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13198–13207, 2020. [136] Sashank Reddi, Zachary Charles, Manzil Zaheer, Zachary Garrett, Keith Rush, Jakub Konečn`y, Sanjiv Kumar, and H Brendan McMahan. Adaptive federated optimization. arXiv preprint arXiv:2003.00295, 2020. [137] Alex Renda, Jonathan Frankle, and Michael Carbin. Comparing rewinding and fine- tuning in neural network pruning. arXiv preprint arXiv:2003.02389, 2020. [138] Stephen Robertson, Hugo Zaragoza, et al. The probabilistic relevance framework: Bm25 and beyond. Foundations and Trends® in Information Retrieval, 3(4):333–389, 2009. [139] Ohad Rubin, Jonathan Herzig, and Jonathan Berant. Learning to retrieve prompts for in-context learning. arXiv preprint arXiv:2112.08633, 2021. [140] Jon Saad-Falcon, Omar Khattab, Keshav Santhanam, Radu Florian, Martin Franz, Salim Roukos, Avirup Sil, Md Arafat Sultan, and Christopher Potts. Udapdr: un- supervised domain adaptation via llm prompting and distillation of rerankers. arXiv preprint arXiv:2303.00807, 2023. 115 [141] Kuniaki Saito, Donghyun Kim, Stan Sclaroff, Trevor Darrell, and Kate Saenko. Semi- supervised domain adaptation via minimax entropy. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8050–8058, 2019. [142] Vikash Sehwag, Mung Chiang, and Prateek Mittal. Ssd: A unified framework for self-supervised outlier detection. arXiv preprint arXiv:2103.12051, 2021. [143] Shuo Shao, Wenyuan Yang, Hanlin Gu, Jian Lou, Zhan Qin, Lixin Fan, Qiang Yang, and Kui Ren. Fedtracker: Furnishing ownership verification and traceability for feder- ated learning model. arXiv preprint arXiv:2211.07160, 2022. [144] Akash Srivastava, Lazar Valkov, Chris Russell, Michael U Gutmann, and Charles Sut- ton. Veegan: Reducing mode collapse in gans using implicit variational learning. Advances in neural information processing systems, 30, 2017. [145] Johannes Stallkamp, Marc Schlipsing, Jan Salmen, and Christian Igel. Man vs. com- puter: Benchmarking machine learning algorithms for traffic sign recognition. Neural networks, 32:323–332, 2012. [146] Baochen Sun, Jiashi Feng, and Kate Saenko. Return of frustratingly easy domain adap- tation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 30, 2016. [147] Kai Sun, Yifan Ethan Xu, Hanwen Zha, Yue Liu, and Xin Luna Dong. Head-to-tail: How knowledgeable are large language models (llm). AKA will llms replace knowledge graphs, 2023. [148] Weiwei Sun, Lingyong Yan, Xinyu Ma, Shuaiqiang Wang, Pengjie Ren, Zhumin Chen, Dawei Yin, and Zhaochun Ren. Is chatgpt good at search? investigating large language models as re-ranking agents. arXiv preprint arXiv:2304.09542, 2023. [149] Yiyou Sun, Yifei Ming, Xiaojin Zhu, and Yixuan Li. Out-of-distribution detection with deep nearest neighbors. arXiv preprint arXiv:2204.06507, 2022. [150] Richard S Sutton, Andrew G Barto, et al. Reinforcement learning. Journal of Cognitive Neuroscience, 11(1):126–134, 1999. [151] Canh T Dinh, Nguyen Tran, and Josh Nguyen. Personalized federated learning with moreau envelopes. Advances in Neural Information Processing Systems, 33:21394– 21405, 2020. [152] Jihoon Tack, Sangwoo Mo, Jongheon Jeong, and Jinwoo Shin. Csi: Novelty detec- tion via contrastive learning on distributionally shifted instances. Advances in neural information processing systems, 33:11839–11852, 2020. [153] Ruixiang Tang, Mengnan Du, Ninghao Liu, Fan Yang, and Xia Hu. An embarrassingly simple approach for trojan attack in deep neural networks. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 218–228, 2020. 116 [154] Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023. [155] Buse GA Tekgul, Yuxi Xia, Samuel Marchal, and N Asokan. Waffle: Watermarking in federated learning. In 2021 40th International Symposium on Reliable Distributed Systems (SRDS), pages 310–320. IEEE, 2021. [156] Hoang Thanh-Tung and Truyen Tran. Catastrophic forgetting and mode collapse in In 2020 International Joint Conference on Neural Networks (IJCNN), pages gans. 1–10. IEEE, 2020. [157] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023. [158] Florian Tramèr, Fan Zhang, Ari Juels, Michael K Reiter, and Thomas Ristenpart. Stealing machine learning models via prediction apis. In USENIX security symposium, volume 16, pages 601–618, 2016. [159] Alexander Turner, Dimitris Tsipras, and Aleksander Madry. Clean-label backdoor attacks. 2018. [160] Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell. Adversarial discrimina- tive domain adaptation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7167–7176, 2017. [161] Eric Tzeng, Judy Hoffman, Ning Zhang, Kate Saenko, and Trevor Darrell. Deep domain confusion: Maximizing for domain invariance. arXiv preprint arXiv:1412.3474, 2014. [162] Yusuke Uchida, Yuki Nagai, Shigeyuki Sakazawa, and Shin’ichi Satoh. Embedding wa- termarks into deep neural networks. In Proceedings of the 2017 ACM on international conference on multimedia retrieval, pages 269–277, 2017. [163] Dave Van Veen, Cara Van Uden, Louis Blankemeier, Jean-Benoit Delbrouck, Asad Aali, Christian Bluethgen, Anuj Pareek, Malgorzata Polacin, Eduardo Pontes Reis, Anna Seehofnerová, et al. Adapted large language models can outperform medical experts in clinical text summarization. Nature medicine, 30(4):1134–1142, 2024. [164] James Vincent. Meta’s powerful ai leaked online — what happens now? https://www.theverge.com/2023/3/8/23629362/ meta-ai-language-model-llama-leak-online-misuse, 2023. Accessed: 2023-03- 08. language model has [165] Tuan-Hung Vu, Himalaya Jain, Maxime Bucher, Matthieu Cord, and Patrick Pérez. Advent: Adversarial entropy minimization for domain adaptation in semantic segmen- tation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2517–2526, 2019. 117 [166] Bolun Wang, Yuanshun Yao, Shawn Shan, Huiying Li, Bimal Viswanath, Haitao Zheng, and Ben Y Zhao. Neural cleanse: Identifying and mitigating backdoor attacks in neural networks. In 2019 IEEE S&P, pages 707–723. IEEE, 2019. [167] Fali Wang, Runxue Bao, Suhang Wang, Wenchao Yu, Yanchi Liu, Wei Cheng, and Haifeng Chen. Infuserki: Enhancing large language models with knowledge graphs via infuser-guided knowledge integration. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 3675–3688, 2024. [168] Haoqi Wang, Zhizhong Li, Litong Feng, and Wayne Zhang. Vim: Out-of-distribution with virtual-logit matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4921–4930, 2022. [169] Haotao Wang, Junyuan Hong, Aston Zhang, Jiayu Zhou, and Zhangyang Wang. Trap and replace: Defending backdoor attacks by trapping them into an easy-to-replace subnetwork. Advances in neural information processing systems, 35:36026–36039, 2022. [170] Jianyu Wang and Gauri Joshi. Cooperative sgd: A unified framework for the design and analysis of communication-efficient sgd algorithms. arXiv preprint arXiv:1808.07576, 2018. [171] Liang Wang, Nan Yang, and Furu Wei. Learning to retrieve in-context examples for large language models. arXiv preprint arXiv:2307.07164, 2023. [172] Qin Wang, Gabriel Michau, and Olga Fink. Missing-class-robust domain adaptation by unilateral alignment. IEEE Transactions on Industrial Electronics, 68(1):663–671, 2020. [173] Run Wang, Jixing Ren, Boheng Li, Tianyi She, Chehao Lin, Liming Fang, Jing Chen, Chao Shen, and Lina Wang. Free fine-tuning: A plug-and-play watermarking scheme for deep neural networks. arXiv preprint arXiv:2210.07809, 2022. [174] Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022. [175] Zhepeng Wang, Runxue Bao, Yawen Wu, Jackson Taylor, Cao Xiao, Feng Zheng, Weiwen Jiang, Shangqian Gao, and Yanfu Zhang. Unlocking memorization in large language models with dynamic soft prompting. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 9782–9796, 2024. [176] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022. [177] Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8:229–256, 1992. 118 [178] Jim Winkens, Rudy Bunel, Abhijit Guha Roy, Robert Stanforth, Vivek Natarajan, Joseph R Ledsam, Patricia MacWilliams, Pushmeet Kohli, Alan Karthikesalingam, Simon Kohl, et al. Contrastive training for improved out-of-distribution detection. arXiv preprint arXiv:2007.05566, 2020. [179] Dongxian Wu, Shu-Tao Xia, and Yisen Wang. Adversarial weight perturbation helps robust generalization. Advances in Neural Information Processing Systems, 33:2958– 2969, 2020. [180] Zhisheng Xiao, Qing Yan, and Yali Amit. Likelihood regret: An out-of-distribution detection score for variational auto-encoder. Advances in neural information processing systems, 33:20685–20696, 2020. [181] Pingmei Xu, Krista A Ehinger, Yinda Zhang, Adam Finkelstein, Sanjeev R Kulkarni, and Jianxiong Xiao. Turkergaze: Crowdsourcing saliency with webcam based eye tracking. arXiv preprint arXiv:1504.06755, 2015. [182] Jingkang Yang, Pengyun Wang, Dejian Zou, Zitang Zhou, Kunyuan Ding, Wenxuan Peng, Haoqi Wang, Guangyao Chen, Bo Li, Yiyou Sun, et al. Openood: Benchmarking generalized out-of-distribution detection. Advances in Neural Information Processing Systems, 35:32598–32611, 2022. [183] Fisher Yu, Ari Seff, Yinda Zhang, Shuran Song, Thomas Funkhouser, and Jianxiong Xiao. Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365, 2015. [184] Qing Yu, Atsushi Hashimoto, and Yoshitaka Ushiku. Divergence optimization for noisy universal domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2515–2524, 2021. [185] Tao Yu, Eugene Bagdasaryan, and Vitaly Shmatikov. Salvaging federated learning by local adaptation. arXiv preprint arXiv:2002.04758, 2020. [186] Xiyu Yu, Tongliang Liu, Mingming Gong, Kun Zhang, Kayhan Batmanghelich, and Dacheng Tao. Label-noise robust domain adaptation. In International Conference on Machine Learning, pages 10913–10924. PMLR, 2020. [187] Xiaoyong Yuan, Leah Ding, Lan Zhang, Xiaolin Li, and Dapeng Oliver Wu. Es attack: Model stealing against deep neural networks without data hurdles. IEEE Transactions on Emerging Topics in Computational Intelligence, 6(5):1258–1270, 2022. [188] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016. [189] Werner Zellinger, Thomas Grubinger, Edwin Lughofer, Thomas Natschläger, and Su- sanne Saminger-Platz. Central moment discrepancy (cmd) for domain-invariant rep- resentation learning. arXiv preprint arXiv:1702.08811, 2017. 119 [190] Yi Zeng, Won Park, Z Morley Mao, and Ruoxi Jia. Rethinking the backdoor attacks’ triggers: A frequency perspective. In ICCV, 2021. [191] Haoyu Zhang, Jianjun Xu, and Ji Wang. Pretraining-based natural language genera- tion for text summarization. arXiv preprint arXiv:1902.09243, 2019. [192] Jialong Zhang, Zhongshu Gu, Jiyong Jang, Hui Wu, Marc Ph Stoecklin, Heqing Huang, and Ian Molloy. Protecting intellectual property of deep neural networks with water- marking. In Proceedings of the 2018 on Asia Conference on Computer and Communi- cations Security, pages 159–172, 2018. [193] Jingyang Zhang, Nathan Inkawhich, Yiran Chen, and Hai Li. Fine-grained out-of- distribution detection with mixup outlier exposure. arXiv preprint arXiv:2106.03917, 2021. [194] Zihao Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. Calibrate before use: Improving few-shot performance of language models. In International conference on machine learning, pages 12697–12706. PMLR, 2021. [195] Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. IEEE transactions on Places: A 10 million image database for scene recognition. pattern analysis and machine intelligence, 40(6):1452–1464, 2017. [196] Jiayu Zhou, Lei Yuan, Jun Liu, and Jieping Ye. A multi-task learning formulation for predicting disease progression. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 814–822, 2011. [197] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to- In Proceedings of the image translation using cycle-consistent adversarial networks. IEEE international conference on computer vision, pages 2223–2232, 2017. [198] Zhuangdi Zhu, Junyuan Hong, and Jiayu Zhou. Data-free knowledge distillation for heterogeneous federated learning. In International Conference on Machine Learning, pages 12878–12889. PMLR, 2021. 120 APPENDIX A: DYNAMIC UNCERTAINTY RANKING Dataset Type Pubmedqa Multi-choice Domain Healthcare ethos-national Multi-choice Speech detection climate change eval-climate Multi-choice Wikipedia Open-ended Wikipedia Open-ended T-REx NatQA Training Test Prompt format 1000 476 288 20128 11476 500 298 180 5032 2869 SQO-A QO-A QO-A Q-A Q-A Table A.1: The statistics of the datasets used in this work. Notation Retrieval sample format Q-A QO-A Question: Answer: The answer is Question: Options: (A)