MACHINE LEARNING ON DRUG DISCOVERY: ALGORITHMS AND APPLICATIONS By Mengying Sun A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Computer Science – Doctor of Philosophy 2022 ABSTRACT MACHINE LEARNING ON DRUG DISCOVERY: ALGORITHMS AND APPLICATIONS By Mengying Sun Drug development is an expensive and time-consuming process where thousands of chemical compounds are being tested and experiments being conducted in order to find out drugs that are safe and effective. Modern drug development aims to speed up the intermediate steps and reduce cost by leveraging machine learning techniques, typically at drug discovery and preclinical research stages. Better identification of promising candidates can significantly reduce the load of later processes, e.g., clinical trials, saving tons of resources as well as time. In this dissertation, we explored and proposed novel machine learning algorithms for drug dis- covery from the aspects of robustness, knowledge transfer, molecular generation and optimization. First of all, labels from high-throughput experiments (e.g., biological profiling and chemical screen- ing) often contain inevitable noise due to technical and biological variations. We proposed a method (RCL) that leverages both disagreement and agreement among deep neural networks to mitigate the negative effect brought by noisy labels and better predict drug responses. Secondly, graph neural networks (GNNs) has become popular for modeling graph-structured data (e.g., molecules). Graph contrastive learning, by maximizing the mutual information between paired graph augmentations, has been shown to be an effective strategy for pretraining GNNs. However, the existing graph con- trastive learning methods have intrinsic limitations when adopted for molecular tasks. Therefore we proposed a method (MoCL) that utilizes domain knowledge at both local- and global-level to assist representation learning. The local-level domain knowledge guides the augmentation process such that variation is introduced without changing graph semantics. The global-level knowledge encodes the similarity information between graphs in the entire dataset and helps to learn representations with richer semantics. Last but not least, we proposed a search-based approach (MolSearch) for multi-objective molecular generation and optimization. We show that given proper design and suf- ficient information, search-based methods can achieve performance comparable or even better than deep learning methods while being computationally efficient. Specifically, the proposed method starts with existing molecules and uses a two-stage search strategy to gradually modify them into new ones, based on transformation rules derived from large compound libraries. We demonstrate all the proposed methods with extensive experiments. To sum up, RCL enables accurate prediction of drug-induced gene expression change which lays the foundation of virtual drug screening based on reversing signatures of a given disease. Any methods that utilizing graph neural networks for molecular tasks can utilize MoCL to improve prediciton accuracy given insufficient data. MolSearch enables generation of molecules with desired properties as well as optimizing targeted properties for better drug candidates. Copyright by MENGYING SUN 2022 This thesis is dedicated to my parents, Xiaojuan Dai and Jianjiang Sun, as well as my husband Dr. Deliang Yang. v ACKNOWLEDGEMENTS I would not have the opportunity to accomplish this thesis without the guidance from my dissertation committee, the help from my colleagues and friends, as well as the support from my family. I owe my sincere gratitude to all of them. First and foremost, I would like to thank my advisor Dr. Jiayu Zhou, who gave inspiration, technical advice, and emotional support throughout my Ph.D. study. He advised me from all the aspects including conducting research, writing papers, and presenting my work to others. These experiences are lifelong treasures to me. I also would like to thank my co-advisor Dr. Bin Chen, who provided delicate expertise in system biology and drug discovery to guide my research direction. His endless passion and persistent patience for doing good research always impacted me. I would like to thank Dr. Fei Wang, who introduced me to computational drug discovery during my research internship at Weill Cornell Medicine. I would like to thank Dr. Gustavo de los Campos, who brought me into the field of quantitative genetics which lays a good foundation for the later research. I would like to thank Dr. Zhangyang Wang, Dr. Liang Zhan, and Dr. Jinfeng Yi whom I collaborated with during my early Ph.D. study. I would like to thank Dr. Pang-Ning Tan and Dr. Kevin J. Liu to serve as my committee members and help me to accomplish this dissertation. Moreover, I would like to thank my closest collaborator, Dr. Jing Xing, who greatly supported all my projects. Jing is a professional and passionate researcher in computational drug discovery and system biology. We worked together through late nights and deadlines during the past 4 years. I learned from and was impacted so much by her, from professional knowledge to life philosophy. I am also thankful to my great colleagues Dr. Inci M. Baytas, Dr. Qi Wang, Fengyi (Andy) Tang, and Boyang Liu whom I have worked together with in multiple projects. I am thankful to my ILLIDAN lab-mates Dr. Kaixiang Lin, Liyang Xie, Junyuan Hong, Zhuangdi Zhu, Ikechukwu Uchendu, Shuyang Yu for a productive, friendly, and supportive academic environment. I would like to thank all members from Bin Chen Lab at MSU and from Dr. Fei Wang’s lab at Weill Cornell Medicine for providing a great environment during my stay. I thank all members from QuantGen Group at vi MSU during my master’s study for building solid foundations on quantitative genetics. I want to express my sincere thanks to Dr. Huiyun Wu, who has been a friend to me since I came to the United States. We help and support each other closely during the past eight years and I cannot express how grateful I am to meet her in my life. I would also like to thank Jing Shang, Siyi Hou, and Dr. Chongguang Bi who support me during my early time in the United States. To all my friends, thank you all for your support and I wish you the best of luck wherever you are. My sincere gratitude also goes to my mother, Xiaojuan Dai, and my father, Jianjiang Sun. Their unconditional love, sacrifice, support, and encouragement accompany me in every step of my life. It’s their hardworking that enables me to pursue my own life freely and securely. I would not be me without them. Finally, I would like to thank my soulmate Dr. Deliang Yang, with whom I spent the most amazing time in my life. We stood by each other and walked through all the good and bad days during the past five years. We enjoy every moment in our current life and share the same goals towards an exciting future. My life just becomes much more meaningful as I met him, and our two cats Taiji and Huita. vii TABLE OF CONTENTS LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii LIST OF ALGORITHMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Robust Learning on Noisy Labels . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.1 Our Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Graph Neural Networks and Pretraining Strategies . . . . . . . . . . . . . . . . . . 3 1.2.1 Our Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3 Search-based Molecular Generation and Property Optimization . . . . . . . . . . . 6 1.3.1 Our Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 CHAPTER 2 RELATED WORK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.1 Robust Learning on Noisy Labels . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 Graph Neural Networks and Pretraining . . . . . . . . . . . . . . . . . . . . . . . 10 2.3 Molecular Generation and Optimization . . . . . . . . . . . . . . . . . . . . . . . 11 CHAPTER 3 ROBUST COLLABORATIVE LEARNING WITH NOISY LABELS . . . . 14 3.1 A Revisit of Self-paced Learning (SPL) . . . . . . . . . . . . . . . . . . . . . . . 14 3.1.1 The power of Disagreement . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.1.2 The power of Agreement . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.2 The Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.2.1 Self-Knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.2.2 Knowledge Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.2.3 Knowledge Exchange . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.3 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.3.1 Synthetic Experiment on Image Benchmark . . . . . . . . . . . . . . . . . 22 3.3.2 Drug-induced Gene-Expression Change Prediction . . . . . . . . . . . . . 30 CHAPTER 4 CONTRASTIVE LEARNING ON MOLECULAR GRAPHS WITH MULTI- LEVEL DOMAIN KNOWLEDGE . . . . . . . . . . . . . . . . . . . . . . 35 4.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.2 Contrastive Learning Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.2.1 Local-level Domain Knowledge . . . . . . . . . . . . . . . . . . . . . . . 37 4.2.2 Global-level Domain Knowledge . . . . . . . . . . . . . . . . . . . . . . . 39 4.2.2.1 Similarity calculation . . . . . . . . . . . . . . . . . . . . . . . 40 4.2.2.2 Global-level Objective . . . . . . . . . . . . . . . . . . . . . . . 40 4.2.3 Connection to Metric Learning . . . . . . . . . . . . . . . . . . . . . . . . 41 4.3 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.3.1 Evaluation Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 viii 4.3.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.3.3 Local-level domain knowledge (Q1) . . . . . . . . . . . . . . . . . . . . . 50 4.3.4 Global-level domain knowledge (Q2) . . . . . . . . . . . . . . . . . . . . 52 4.3.5 Sensitivity Analysis (Q3) . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.3.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 CHAPTER 5 SEARCH-BASED MULTI-OBJECTIVE MOLECULAR GENERATION AND PROPERTY OPTIMIZATION . . . . . . . . . . . . . . . . . . . . . 56 5.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 5.1.1 Monte Carlo Tree Search . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 5.1.2 Multi-objective Monte Carlo Tree Search . . . . . . . . . . . . . . . . . . 58 5.2 Design Moves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 5.2.1 Rationale of MolSearch . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 5.3 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 5.3.1 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 5.3.2 Benchmark Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.3.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 CHAPTER 6 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 ix LIST OF TABLES Table 3.1: Description of datasets in synthetic experiment. . . . . . . . . . . . . . . . . . . 22 Table 3.2: CNN architecture. The negative slope for each LeakyReLU is set as 0.01. . . . . 23 Table 3.3: Performance of non-ensemble baselines and the proposed method RCL over different number of networks (K) on fixed noise rates. Each method is repeated for 5 random seeds with average performance presented. Significance t-tests (one-side) are conducted between RCL and the best baseline. A p-value less than 0.05 is considered as significant difference. . . . . . . . . . . . . . . . . . . 24 Table 3.4: Final performance of ensemble methods and RCL over various noise rates (K=9). Bold numbers indicates that the method is significantly better than the second best method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 Table 3.5: DP=Drug Profile, HQ=High Quality (qdp > 0.7), K=1000. Small size (≈ 10%) of high quality drug profiles are sampled as testing data, the rest are for training. For each cell line, a unique drug profile is associated with 162 predictable genes, therefore, sample size is # drug profile × 162. . . . . . . . . . . . . . . . 30 Table 3.6: Generalization performance of comparison methods for six cell lines. K = number of networks, FR = forget rate. Each method is repeated for 5 random seeds. Significance t-tests (one-side) are conducted between RCL and the best baseline method. A p-value less than 0.05 is considered as significant difference. 32 Table 4.1: Source and target statistics for substitution rules. R/R’/R” represent arbitrary carbon-containing groups. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 Table 4.2: Basic statistics for all datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 Table 4.3: Averaged test AUC of comparison methods under linear and semi-supervised protocol (5 runs). Bold number denotes the best performance for local-level (augmentation) comparison. Bold* number denotes the best performance after incorporating global similarity information (MoCL-G). LS and CL represents least-square and contrastive global loss, respectively. . . . . . . . . . . . . . . . 48 Table 4.4: Detailed experimental settings for each dataset. . . . . . . . . . . . . . . . . . . 50 Table 4.5: Implementation details for general augmentation. Edge refers all edges that reach out from the corresponding node. - denotes no change. . . . . . . . . . . . 50 x Table 4.6: Comparison between different global losses under MoCL-DK1 augmentation. LS: directly using global similarity and optimize by least-square loss; CL: contrastive loss using nearest neighbor derived from global similarity. . . . . . . 54 Table 5.1: Statistics of rules extracted from ChEMBL on environment radius r = 3. # denotes "number of". . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 Table 5.2: Overall performance of comparison methods on bio-activity objectives. Re- sults of RationaleRL, MolDQN are obtained by running their open source code. Results of JT-VAE, GCPN, GA+D and MARS are taken from [52, 128]. For MolSearch, we repeat the experiments for 10 times and report the mean and standard deviation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 Table 5.3: Overall performance of comparison methods on bio-activity plus non-bioactivity objectives. Results of RationaleRL, MolDQN are obtained by running their open source code. Results of JT-VAE, GCPN, GA+D and MARS are taken from [128]. For MolSearch, we repeat the experiments for 10 times and report the mean and standard deviation. . . . . . . . . . . . . . . . . . . . . . . . . . . 65 Table 5.4: Running time per molecule for MolSearch. . . . . . . . . . . . . . . . . . . . . 67 Table 5.5: Average scores of generated molecules by MolDQN in GSK3β+JNK3+QED+SA setting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 Table 5.6: Performance of MolSearch under different hyper-parameters for two generation settings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 Table 5.7: Number of generated molecules by MolSearch under different hyper-parameters for two generation settings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 xi LIST OF FIGURES Figure 3.1: Gradient norm (GN) of noisy data on CIFAR10. The gradient includes components I01 and I00 in Eq. (3.2). Co-teaching learning process has less impact from noisy data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 Figure 3.2: Robust Collaborative Learning (RCL) framework. Left: one network’s view, A’ denotes all the peer networks of A; Right: all networks in one mini batch. Knowledge fusion and update can be done in parallel for all the networks. . . . 18 Figure 3.3: Knowledge fusion among peer networks. . . . . . . . . . . . . . . . . . . . . . 20 Figure 3.4: Noise examples for a 5-class problem. . . . . . . . . . . . . . . . . . . . . . . 22 Figure 3.5: Episode curve (upper: test accuracy, lower: pure ratio) w.r.t. training epochs. The grey lines are SPL (net_1) and co-teaching (net_2), and blue lines the RCL. The shaded area represents the variation across 5 random seeds. . . . . . 23 Figure 3.6: Test accuracy of ensemble methods and RCL (K=9) over various noise rates on CIFAR10. The shaded area represents variation across 3 random seeds. . . . 27 Figure 3.7: Improvement of RCL over ensemble method by each disagreement level on CIFAR10 pairflip 45% scenario (K=9). Average over 3 random seeds. . . . . . 27 Figure 3.8: Test accuracy of RCL over various βs, average over 3 random seeds (K=3). . . . 28 Figure 3.9: Revise and restart RCL on CIFAR10 PF 45% using K=3 networks. Left: the confusion matrix after we revise the labels based on prediction. Numbers in cells denote percentages. Right: test accuracy w.r.t epochs. . . . . . . . . . . . 29 Figure 3.10: Illustration of drug profile and its quality. . . . . . . . . . . . . . . . . . . . . . 31 Figure 3.11: Virtual drug screening use RCL and RGES pipeline. . . . . . . . . . . . . . . . 34 Figure 4.1: Overall framework of MoCL. First, two augmented views are generated from local-level domain knowledge. Then, together with the original view, they are fed into the GNN encoder and projection head. The local-level contrast max- imizes the MI between two augmented views, while the global-level contrast maximizes the MI between two similar graphs. The similarity information is derived from global-level domain knowledge (MI: mutual information). . . . . . 36 xii Figure 4.2: Augmentation comparison. Upper: conventional augmentations that may alter the graph semantics. Lower: proposed augmentation in which valid substructures are replaced by bioisosteres that share similar properties. . . . . . 37 Figure 4.3: Augmentation combination under linear evaluation protocol. Each cell repre- sents the performance difference between a vanilla GNN trained from scratch (upper-bound) and learned representations (fixed) from a pretrained model plus a linear classifier under a given augmentation combination. Each num- ber is averaged from 5 runs. Blue represents negative value and red positive. Higher value is better. MoCL-DK is the proposed augmentation with local- level domain knowledge. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 Figure 4.4: Distribution of augmentations that can be generated by proposed augmenta- tion rules (dataset: bace). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 Figure 4.5: Average test AUC of MoCL-Local across different augmentation strengths (repeat times) for all datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 Figure 4.6: Average test AUC gain from global domain knowledge for different augmen- tations across all datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 Figure 4.7: Average test AUC of different neighbor size and λ for MoCL-DK1-G under linear protocol (dataset: bbbp). . . . . . . . . . . . . . . . . . . . . . . . . . . 54 Figure 5.1: Overall framework of MolSearch. For a given start molecule, it first goes through a HIT-MCTS stage which aims to improve the biological proper- ties, e.g., GSK3β and JNK3, followed by a LEAD-MCTS stage where non- biological properties such as QED are optimized. n refers to number of generated molecules and y-axis reflects the normalized scores. . . . . . . . . . 56 Figure 5.2: Multi-objective Monte Carlo tree search procedure. Each node represents an intermediate molecule which has a reward vector associated with it. A search iteration consists of selection, expansion, simulation, and backpropagation. For MolSearch, HIT-MCTS and LEAD-MCTS differ in the expansion and simulation policy (blue boxes). . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Figure 5.3: Example of design moves. A transformation is only valid conditional on the existence of certain environments. . . . . . . . . . . . . . . . . . . . . . . . . 61 Figure 5.4: Property dynamics across MolSearch stages. (a)(b): average scores over 10 runs at each stage. (c): distribution of bioactivity scores during Start and HIT- MCTS stage. (d): QED distribution between HIT-MCTS and LEAD-MCTS stage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 xiii Figure 5.5: Number of generated molecules across MolSearch stages and different gen- eration settings (10 runs). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 Figure 5.6: t-SNE visualization of generated molecules and positive molecules in the reference (training) dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 Figure 5.7: Sample molecules generated by MolSearch in the GSK3β+JNK3+QED+SA setting with associated scores. . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 Figure 5.8: MolSearch path for generation setting GSK3β + JNK3 + QED + SA. . . . . . . 72 Figure 5.9: Top 40 molecules generated by MolSearch base on average score for GSK3β + JNK3 + QED + SA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 xiv LIST OF ALGORITHMS Algorithm 3.1: Pseudo code for RCL Algorithm. . . . . . . . . . . . . . . . . . . . . . . . 19 Algorithm 3.2: Knowledge Fusion Function. . . . . . . . . . . . . . . . . . . . . . . . . . 21 Algorithm 4.1: Pseudocode of domain augmentation. . . . . . . . . . . . . . . . . . . . . . 40 Algorithm 4.2: Pseudocode of proposed framework. . . . . . . . . . . . . . . . . . . . . . 41 Algorithm 5.1: UCT algorithm for MO-MCTS. . . . . . . . . . . . . . . . . . . . . . . . . 59 xv CHAPTER 1 INTRODUCTION 1.1 Robust Learning on Noisy Labels Recent years have witnessed huge successes of supervised learning using deep neural networks in various domains [39, 101, 29]. Moreover, investigations on network property and behavior have further brought a better understanding of deep models, which in turn provides guidance on their utilization [4, 92]. One decisive factor behind such successes is the availability of a sufficiently large amount of training data [108]. In fact, improving generalization from limited labeled data has been an active research topic for an extended period of time. For example, semi-supervised learning aims to help learning by leveraging the unlabeled data [131, 16, 15]. On the other hand, in cases where obtaining accurate labels is too expensive, practitioners could use affordable and alternative apparatuses to collect less reliable noisy labels. For example, the online crowd-sourcing tools such as Mechanical Turk 1, in which although the labels come from humans, the quality of labels varies among different annotators, and even within the same annotator across different time. Besides, labels from high- throughput experiments (e.g., biological profiling and chemical screening) often contain inevitable noise due to technical and biological variations. Learning with noisy labels has imposed additional challenges. Sometimes the data quality is known a priori [65, 97, 27], but a more common scenario is that the data available is a mixture of samples with both clean and noisy labels and one does not know, or only has partial knowledge of the underlying distribution of the noise [76, 66, 74, 118]. In this problem setting, a learning process that is aware of noise in the labels and actively mitigates the negative impacts from the noisy labels, is the key to improving the generalization of learned models. Among the early studies, [12] showed that a proper arrangement of tasks can improve con- 1 https://www.mturk.com/ 1 vergence and generalization when training a deep neural network. They proposed the concept of curriculum, which defines the order of learning tasks, usually from easy to hard, to assist network training. Recently, the memorization effect of neural networks has been identified and analyzed in [4], showing that neural networks tend to fit informative information first, such as simple patterns, and then the non-informative part, such as noise. Therefore, an appropriate ordering of data may also improve the robustness of deep models. On the other hand, instead of manually grouping data, [59] introduced a latent variable associated with each sample as the curriculum (sample weight) and learns simultaneously with the model parameter. Several pre-defined curriculums have been proposed later in [111, 64, 48, 49, 138] and such learning process is referred to as self-paced learning (SPL). Though slightly deviate from the initial motivation, learning with curriculums reveals its power especially in dealing with noisy data. The reason is that noisy samples can be re-weighted or even filtered out via the curriculum mechanism under proper designs. It has been proved by [73] that the latent objective of self-paced learning is equivalent to a robust loss function, which also sheds lights on the effectiveness of SPL on noisy data. The original optimization of SPL can be done by alternatively optimizing model parameters and the curriculum, known as the majorization- minimization (MM) algorithm [62]. However, such procedure is intractable for training very large and deep neural networks via mini-batch stochastic gradients. Therefore, [50] modified the algorithm such that it optimizes both the model parameter and curriculum stochastically over mini- batches in a more elegant way. The authors also proposed to learn the curriculum rather than pre-define and optimize it when auxiliary data is available. A major drawback of determining curriculum based on the learner’s own ability without any other supervision or feedback is the sample selection bias from the learner itself. The error made in the early stage will be enhanced as training proceeds. Therefore, two or more networks have been introduced in recent works to mitigate the selection bias [40, 63]. Nevertheless, these studies either emphasize the disagreement or focus on the agreement between networks only without considering the other. 2 1.1.1 Our Contribution In this paper, we first investigate the mechanism of how disagreement and agreement can help filter out noisy samples. Then based on the insights, we propose a novel framework called Robust Collaborative Learning (RCL) to deal with noisy labels. The main contributions of this paper are: • We show that disagreement between networks can diversify the gradients of model weights from noisy samples, which slows down the accumulation of noisy gradients. • We show that under certain conditions, agreement from more than one network can improve the quality of data selection, i.e., the label purity increases. • Combining the above two findings, we propose RCL framework that consists of multiple networks, where each network is an individual learner and exchanges knowledge with its Peer system. The knowledge of the Peer system is fused from multiple networks, by adaptively encouraging disagreement in the early stage and agreement in the later stage, which fully boost the selection of clean samples for training. We demonstrate the effectiveness of RCL on both synthetic and real data experiments. For synthetic experiment, we use the benchmark image data [57] under different noise settings following literature [40, 117]. We further validated our framework on cancer drug development using large- scale genomic data from multiple sources [105, 124]. The proposed method achieves state-of-art performance and significantly outperforms baselines in large noise settings, on both image and bioinformatics data. 1.2 Graph Neural Networks and Pretraining Strategies Graph neural networks (GNNs) has been demonstrated to achieve state-of-the-art performance on graph-related tasks such as node classification [54, 119, 126], link prediction [142] and graph classification [119, 35, 129]. It has also been frequently used in the biomedical domain recently to tackle drug-related problems [104, 94, 75]. However, like most deep learning architectures, it 3 requires a large amount of labeled data to train whereas task-specific labels in the real world are often of limited size (e.g., in the biomedical domain, requiring labels such as drug responses from biological experiments is always expensive and time-consuming). Therefore, pretraining schemes on GNNs have been actively explored recently. One line of works focuses on designing pretext tasks to learn node or graph representations without labels. The predefined tasks include graph reconstruction [55, 44, 135] and context prediction [83, 43]. The other line follows contrastive learning framework from the computer vision domain [21, 127], in which two augmentations are generated for each data and then fed into an encoder and a projection head. By maximizing the mutual information between the two augmented views, the model is able to learn representations that are invariant to transformations. In particular, [134] proposed four types of augmentations for general graphs and demonstrated that contrastive learning on graphs can produce representations that are beneficial for downstream tasks. However, unlike images, contrastive learning on graphs has its unique challenges. First, the structural information and semantics of the graphs vary significantly across domains (e.g., social network v.s. molecular graphs), thus it is difficult to design a universal augmentation scheme that fits all scenarios. It has been shown that general augmentations can be harmful under a specific domain context [134]. Second, most current graph contrastive learning frameworks learn invariant representations while neglect the global structure of the entire data [5], e.g., some graphs should be closer in the embedding space due to their structural similarity. Nevertheless, modeling similarity between graphs itself is still a difficult problem [8]. Third, the contrast schemes are not unique because graph tasks can happen at different levels, e.g., node-graph contrast [42], node-node contrast [141], graph-graph contrast [134] are all possible contrast schemes. Besides these unique challenges for graphs, contrastive learning itself also has unsolved prob- lems. For example, accurately estimating mutual information in high dimension is difficult [84]. The connection between mutual information maximization and the success of contrastive learning is still not clear. In fact, [114] found the connection is actually weak, while instead metric learning shares some intrinsic connections with contrastive learning. These findings also motivate us to pay 4 more attention to the role of augmentation schemes and global semantics of the data in order to improve contrastive learning on graphs. 1.2.1 Our Contribution In this paper, we aim to tackle the aforementioned challenges in the context of biomedical domain, where molecular graphs are present. Our hypothesis is that better representations can be learned by infusing domain knowledge into the augmentation and contrast schemes. We propose to leverage both local-level and global-level domain knowledge to assist contrastive learning on molecular graphs. In particular, unlike general augmentations in which nodes and edges in a graph are randomly perturbed, we propose a new augmentation scheme called substructure substitution such that a valid substructure in a molecule is replaced by a bioisostere which introduces variation without altering the molecular properties too much. The substitution rules are derived from domain resources and we regard it as local-level domain knowledge. The global-level domain knowledge encodes the global similarities between graphs. We proposed to utilize such information to learn richer representations via a double-contrast objective. Leveraging domain knowledge to assist contrastive learning has rarely been explored in literature and our work is the first to make this attempt. In summary, our contributions are as follows: • We propose a new augmentation scheme for molecular graphs based on local-level domain knowledge such that the semantics of graphs do not change in the augmentation process. • We propose to encode the global structure of data into graph representations by adding a global contrast loss utilizing the similarity information between molecular graphs. • We provide theoretical justifications that the learning objective is connected with triplet loss in metric learning which shed light on the effectiveness of the entire framework. • We evaluate MoCL on various molecular datasets under both linear and semi-supervised settings and demonstrate its superiority over the state-of-the-art methods. 5 1.3 Search-based Molecular Generation and Property Optimization Searching new compounds with desired properties is a routine task in early-stage drug discovery [14]. Common examples include improving the binding activity against one or multiple therapeutic targets while keeping the drug-likeness property; increasing drug solubility while minimizing the change of ADME properties. However, a small change of chemical structures may lead to an unwanted challenge of one property that even seasoned chemists cannot foresee. Moreover, the virtually infinite chemical space and the diverse properties for consideration impose significant challenges in practice [96]. Advanced machine learning models built upon historical biological and medicinal chemistry data are poised to aid medicinal chemists in designing compounds with multiple objectives efficiently and effectively. Leveraging computational methods to facilitate and speed up the drug discovery process has always been an active research area [102, 136]. In particular, using deep learning (DL) and rein- forcement learning (RL) to generate and optimize molecules has recently received broad attentions [51, 133, 128], which we will summarize in detail later in section ??. Despite the advances, such methods either rely on the quality of latent space obtained by generative models [98], or suffer from high variation, making it hard to train [112]. In reality, DL/RL methods consume large com- putational resources while the generated molecules hardly synthesize. Methods combing multiple objectives often do not work well [31]. In this paper, instead of leveraging DL, we propose a practical search-driven approach based on Monte Carlo tree search (MCTS) to generate molecules. We show that under proper design, search methods can achieve comparable or even better results to DL methods in terms of multi- objective molecular generation and optimization, while being computationally much more efficient. The efficiency and multi-objective nature allow it to be readily deployed in massive real-world applications such as early-stage drug discovery. In order to design an efficient and effective search framework for practical multi-objective molecular generation and optimization, we need to answer the following questions. Q1: where to start; Q2: what to search; and Q3: how to search. For Q1, prior works that use MCTS to generate 6 molecules mostly start with empty molecules [130, 46]. Since most drug-like molecules have 10-40 atoms, the search tree can grow very deep and the search space grows exponentially with the depth, which makes the search process less efficient and effective. Some work thus uses pre-trained RNN as a simulator to expand the tree however it requires additional pretraining [130]. Moreover, real optimization projects often have some candidates in place. For Q2, most prior works use atom-wise actions for editing molecules, which makes it hard to improve target property while maintaining drug-likeness and synthesis abilities [140, 133]. Fragment-wise actions tend to work better but the editing rules are mostly heuristic [52, 128]. For Q3, most existing methods combine all the objectives into one single score and optimize for that [78, 128]. However, the simple aggregation of scores neither fully considers the differences of objective classes nor reflects real optimization scenario. We seek solutions to Q1-Q3 and propose MolSearch, a simple and practicable search framework for multi-objective molecular generation and optimization. In MolSearch, we start with existing molecules and optimize them towards desired ones (Q1). The modification is based on design moves [7], i.e., transformation rules that are chemically reasonable and derived from large compound libraries (Q2). The property objectives are split into two groups with its rationale explained in detail later. The first group contains all biological properties such as inhibition scores to proteins, and the second group includes non-biological properties such as drug-likeness (QED) and synthetic accessibility (SA). Correspondingly, the entire search process consists of two stages: a HIT-MCTS stage that aims to improve biological properties, followed by a LEAD-MCTS stage that focuses on non-biological properties while keeping biological ones above certain threshold. Each stage contains a multi-objective Monte Carlo search tree where different property objectives are considered separately rather than combined (Q3). We evaluate MolSearch on benchmark tasks under different generation settings and compare it with various baselines. The results show that MolSearch is on par with or even better than the baselines based on evaluation metrics calculated from success rate, novelty and diversity, within much less running time. 7 1.3.1 Our Contribution • MolSearch is among the first that make search-based approaches comparable to DL-based methods in terms of multi-objective molecular generation and optimization. • MolSearch combines mature components, e.g., tree search, design moves, multi-objective optimization, in a novel way such that the generated molecules not only have desired properties but also achieve a wide range of diversity. • MolSearch is computationally very efficient and can be easily adopted into any real drug discovery projects without additional knowledge beyond property targets. • Additional to molecular generation, MolSearch is more tailored for hit-to-lead optimization given the nature of its design, which makes it very general and applicable. 8 CHAPTER 2 RELATED WORK 2.1 Robust Learning on Noisy Labels Inspired by the fact that humans learn better when trained with a curriculum-like strategy, [12] first proposed curriculum learning, which mimics the learning procedure of humans. Results on both visual and language tasks have shown that training on easy task first and then hard tasks led to faster convergence as well as better generalization. Instead of using a specified curriculum, [59] incorporated a latent variable associated with each sample, and jointly optimized the model parameters and the sample curriculum. Besides, a variety of approaches with different predefined curriculum were proposed and validated [111, 64, 48, 49, 138]. Further, instead of using predefined curriculum, [50] proposed to learn a data-driven curriculum when auxiliary data is available. The authors also designed an efficient algorithm for training very deep neural networks with curriculum. Later, [40] proposed co-teaching framework, a system of two networks that exchange selected samples to alleviate the sample selection bias brought by one network. Co-teaching [40] works well empirically with several follow-up works and applications [137, 122, 103]. [103] later proposed to make use of the unselected samples by correcting their labels and combining them with selected samples for training. Other very recent works also aggregated knowledge from multiple sources, e.g., multiple networks or multiple training epochs of a single network to filter out noisy data [63, 77]. Learning with corrupted labels also relates to weak supervised learning, and its recent advances can be summarized into the following groups. A common way to leverage weak labels when the quality of data is known a priori is to use the pre-train and fine-tune scheme based on the amount of clean and weak data [28, 97, 65]. Another line of methods design surrogate loss functions for robust learning [70, 66, 76]. Some approaches model the noise pattern or estimate the error transition matrix [107, 74], and denoise by either adding an extra layer [36], or using generative 9 models [88, 118]. Other methods utilize semi-supervised learning techniques [1, 24] to revise weak labels for further training [41], or regularize the learning procedure [115]. Recently, learning-to- learn methods have also been proposed to tackle such problems by manipulating gradient update rules [3, 27]. Since our work mainly follows curriculum learning, we do not expose further details for works mentioned in this paragraph and refer readers of interest to the original papers. 2.2 Graph Neural Networks and Pretraining Self-supervised learning on graphs. A common strategy for learning node (graph) representation in an unsupervised manner is to design pretext task on unlabled data. For node-level tasks, You et al. [135] proposed three types of self-supervised tasks: node clustering, graph partition and graph completion to learn node representations. Peng et al. [83] proposed to predict the contextual position of a node relative to the other to encode the global topology into node representations. GPT- GNN [44] designed generative task in which node attributes and edges are alternatively generated such that the likelihood of a graph is maximized. After that, the pretrained GNN can be used for any down-stream tasks. For graph level tasks, Hu et al. [43] first designed two tasks, predicting neighborhood context and node attributes to learn meaningful node representations, then using graph-level multi-task pretraining to refine the graph representation. Other works [99, 132, 110] utilized similar strategies for either node or graph level pretrain in the context of a more specific task or domain. Contrastive learning on graphs. Contrastive learning on graphs can be categorized into two groups. One group aims to encode structure information by contrasting local and global repre- sentations. For example, DGI [120] proposed to maximize the mutual information between node embedding and graph summary vector to learn node representations that capture the graph se- mantics. InfoGraph [109] extended DGI to learn graph-level representations and further proposed a variant for semi-supervised scenarios. Another group aims to learn representations that are invariant to transformations, following the idea of contrastive learning on visual representations [21, 127, 30], where two augmentations (views) of an image are generated and fed into an encoder 10 and a projection head, after which their mutual information is maximized. Similarly, You et al. [134] explored four types of augmentations for general graphs and demonstrated that the learned representations can help down-streaming tasks. Instead of general corruption, [42] used graph dif- fusion to generate the second view and performed contrast between node and graph from two views. GCC [85] proposed to use random walk to generate subgraphs and contrast between them. GCA [141] proposed adaptive augmentation such that only unimportant nodes and edges are perturbed. However, GCA is focused on network data and not suitable for molecular graphs. Evaluation protocols. Since our work focus on graph-level representation learning, and there are various evaluation schemes for graph self-supervised learning, we summarize the evaluation protocols that related works use. Most prior works [109, 43, 134] adopt the linear evaluation protocol where a linear classifier is trained on top of the representations. [109, 134] also adopt the semi-supervised protocol where only a small fraction of labels are available for downstream tasks. [43, 134] explore the transfer learning setting in which the pretrained model is applied to other datasets. 2.3 Molecular Generation and Optimization In general, molecular property optimization comprises three components or less: representation, generative model, and optimization model. The representation of molecules can be simplified molecular-input line-entry system (SMILES) strings, circular fingerprints, and raw graphs, which often corresponds to certain type of generative models. Grouping by each component can be too detailed to capture the big picture, therefore we choose to categorize the related studies based on optimization models. The first group optimizes molecular via Bayesian optimization [37, 60, 25, 51]. These methods first learn a latent space of molecules via generative models such as auto-encoders (AEs), then optimize the property by navigating in that latent space, and generates molecules through the decoding process. Most methods in this category only optimize for non-biological properties such 11 as QED and penalized logP 1, and focus on metrics such as validity of generated molecules. They heavily rely on the quality of learned latent spaces, which impose challenges for multi-objective optimization. Instead of manipulating latent representations, the second category utilizes reinforcement learn- ing (RL) to optimize molecular property. One line of research applies policy gradient to finetune generative models, e.g., GAN-based generator [95, 26], GNN-based generator [133], Flow-based generator [100, 67] to generate molecules with better property scores. The other line of work directly learns the value function of molecule states and optimizes for a given property via double Q-learning [140]. Besides RL, the third category uses genetic algorithms (GAs) to generate molecules with desired properties [46, 2, 78]. The generation process of genetic algorithms usually follows mutation and cross-over rules that are predefined from a reference compound library or domain expertise, which are not easy to obtain in general. Some work [78] also combines deep learning, e.g., a discriminator into GA generator to increase the diversity of molecules. The last but least explored category aims to optimize molecular property using search methods, e.g., Monte Carlo tree search (MCTS). The earliest work traces back to [130, 46] in which the authors uses pre-trained RNNs or genetic mutation rules as the simulator for tree expansion and simulation. [86] proposes atom-based MCTS method without predefined simulator. Again, all the methods focus on single and non-biological properties and are not tailored for multi-objective optimization. Not until recently RationaleRL [52] enables multi-objective molecular generation by first searching property-related fragments using MCTS and then completing the molecular graph using reinforcement learning. There are also pioneering works that do not fall into any of the categories above, e.g., MARS [128] proposes a Markov sampling process based on molecular fragments and graph neural net- works (GNNs) and achieves state-of-the-art performance. In summary, we see a trend of utilizing fragment-based actions and directly navigating in the chemical space (a.o.t. generative models) in 1 water-octanol partition coefficient penalized by synthesis accessibility and number of cycles having more than 6 atoms, i.e., PlogP(m)=logP(m)-SA(m)-cycle(m) 12 recent works. Interested readers are referred to [31, 139] for a comprehensive understanding of advances in molecular generation and optimization. 13 CHAPTER 3 ROBUST COLLABORATIVE LEARNING WITH NOISY LABELS 3.1 A Revisit of Self-paced Learning (SPL) Despite the promising evidence of curriculum in assisting learning, constructing an effective curriculum is not easy for learning tasks. To tackle this challenge, SPL [59] introduces a latent variable associated with each training sample, and solve them during training. After denoting the latent variables in a vector v ∈ [0, 1]n , where n is the sample size, the objective function of SPL can be written as: Õn min n E(w, v, λ) = vi L yi, f (xi, w) + g(vi, λ)  (3.1) w,v∈[0,1] i=1 where g(v, λ) serves as the curriculum function and regularizes the weight of a given sample, λ is a control parameter. w and v are optimized alternatively while fixing the other [116]. A simple example of the curriculum function can be g(v, λ) = −λv with closed-form solution for v at each step:  if l < λ    1,  v ∗ (λ; l) =  if l ≥ λ    0,   where l is the loss for one sample. The design of g(v, λ) often satisfies conditions such as convexity and monotonicity to ensure convergence [47, 49, 73]. Moreover, such design reveals the nature of SPL, that the difficulty of a sample is determined by the learner’s ability, i.e., if a sample has large loss on the current model, it is likely to be more difficult to learn or even an outlier. It has been proved that the optimization of Eq. (3.1) is equivalent to minimizing a robust loss function, whose Ín ∫ li ∗ underlying objective is Fλ (w) = 1n i=1 0 v (λ, l)dl [73]. Therefore, SPL works well for problems involving data with corrupted labels. Furthermore, the alternating minimization algorithm for solving Eq. (3.1) can be modified to suit the mini-batch training for very deep neural networks [50]. 14 3.1.1 The power of Disagreement A major drawback of SPL is the sample selection bias induced by the learner itself since it picks samples based on its own knowledge. The error that takes place in the early stage will be reinforced as training continues. In order to mitigate this, a second network is introduced in co-teaching [40], where two networks first pick samples on their own and then exchange selected samples to train. Such strategy works better than SPL in practice. However, the underlying mechanism has not been well studied. Therefore, in this subsection, we show that exchanging data introduces disagreement between two networks and such disagreement can diversify noisy gradients and thus leads to higher gradient purity. Recall that the algorithms of many machine learning models are based on gradient methods, in which the model is updated iteratively by adding gradients computed from the samples. Therefore the final learned model is additive w.r.t. the gradients. In the stochastic gradient procedure, the gradient update for each mini-batch can be written as: 1 Õnr 1 wt+1 = wt − ηt ∇l j (wt ) = wt − ηt S∇, nr j=1 nr where nr is the number of samples in a mini-batch, ηt is the step size, S∇ denotes the summation of sample gradients. Considering each of the two networks, with an oracle that provides the ground truth whether a label is noisy or not, we can decompose the gradient S∇ into four disjoint components based on data quality (clean or noisy) and network agreement (agree or not). Let I be a subset of indexes such that 11={clean, agree}, 10={clean, disagree}, 01={noisy, agree}, 00={noisy, disagree}, then the decomposition is given by: S∇ = Σ j∈I11 ∇l j (wt ) + Σ j∈I10 ∇l j (wt ) + Σ j∈I01 ∇l j (wt ) + Σ j∈I00 ∇l j (wt ), | {z } | {z } | {z } | {z } agreed clean disagreed clean agreed noisy disagreed noisy where an ideal algorithm should concentrate the learning clean data points I11 ∪ I10 and reduce the impacts from noisy data points I01 ∪ I00 . As SPL updates network parameters based on small loss samples from itself, the network goes towards the small loss direction, thus the similarity of noisy 15 1.0 1e2 1e5 SPL SPL 0.8 Co-T 4 Co-T 3 grad_norm grad_norm 0.6 0.4 2 0.2 1 0 0 10 20 30 40 50 0 10 20 30 40 50 epoch epoch (a) GN at each epoch (b) Accumulative GN Figure 3.1: Gradient norm (GN) of noisy data on CIFAR10. The gradient includes components I01 and I00 in Eq. (3.2). Co-teaching learning process has less impact from noisy data. samples will concentrate and accumulate. In co-teaching, on the other hand, the set I00 (noisy, disagree) are diversified by exchanging data points between the two networks. Such disagreement is crucial and can cause several effects. First, the gradient norm of noisy data may diminish; second, the diversification can take effect across time since gradients are eventually summed and applied to network parameters; third, introducing disagreement is equivalent to adding small perturbations on network parameters, which could increase the robustness of the network. We note that the noise from I01 (noisy, agree) set of samples is inevitable since they are agreed by both two networks, but such “bad” agreements may also suggest the usefulness of the samples and effectively prevent overfitting. We evaluate these effects in a small synthetic experiment. Given an image-classification problem, e.g., CIFAR10, for each class, we manually flip 45% labels into the adjacent class. Then we compare the gradients of noisy samples between SPL and co-teaching, i.e., Σ j∈I01 ∇l j (wt ) + Σ j∈I00 ∇l j (wt ) in Eq. (3.2). The gradients are calculated from the last linear layer of a CNN model. We use gradient norm here since gradient itself cannot be directly compared. Fig. 3.1a shows the average gradient norm of noisy data at each epoch, and Fig. 3.1b shows the norm of accumulative gradients of noisy data along time. We can see that disagreement from exchanging data helps achieve smaller noisy gradient compared to SPL and it also slows down the accumulation of noisy 16 gradients. 3.1.2 The power of Agreement The training procedure is a dynamic process in which the leaner’s ability grows as training proceeds. When the learners are mature, aggregating their knowledge can be beneficial as compared to only exchanging them. Lemma 1 Given two networks, the samples selected by either network can be decomposed into two subsets: agreement (A = I11 ∪ I01 ) and disagreement ( Ā = I01 ∪ I00 ) samples as compared to the other. Define p I = nc nc +nc̄ as the purity of a subset I, nc = |I11 |, nc̄ = |I10 |. If p A > p Ā, then p A > p A∪ Ā. n11 n10 p A > p Ā ⇔ > (by definition) n11 + n01 n10 + n00 ⇔ n11 n00 > n10 n01 (simplify) ⇔ n11 (n00 + n01 ) > (n10 + n11 )n01 (add n11 n01 ) n11 n10 + n11 n1· ⇔ > = (re-arrange) n01 n00 + n01 n0· 1 1 ⇔ n01 > (reciprocal twice) 1 + n11 1 + nn0·1· n11 n1· ⇔ > (simplify) n11 + n01 n1· + n0· ⇔ p A > p A∪ Ā (by definition) Lemma 1 implies that for two networks, when the purity of agreed samples is larger than that of disagreed samples, the common samples selected by two networks is guaranteed to have higher purity than those selected by a single network. The proof is straightforward using definition. In fact, recent works [63, 77] propose to aggregate knowledge from multiple networks to filter out noisy samples during training and show promising results. However, ensemble does not guarantee higher purity especially in the early training stage since errors could also be magnified. 17 A’s Perspective ALL networks in ONE mini batch A A’ A B … K A mini batch A A’ A B … K B mini batch Peer nets fuse Knowledge A A’ mini for single net to update batch A A’ A B … K K Knowledge flow Peer networks Updated network Figure 3.2: Robust Collaborative Learning (RCL) framework. Left: one network’s view, A’ denotes all the peer networks of A; Right: all networks in one mini batch. Knowledge fusion and update can be done in parallel for all the networks. Therefore, a common strategy is to train the entire data until certain epochs and then perform the ensemble. The performance of such ensemble methods depends heavily on the warm-up procedure in which the entire data is used to train, therefore, if the noise rate is large, these methods may not be optimal. 3.2 The Proposed Method The analysis of disagreement and agreement in previous section shows their advantages but also reveals the fact that focusing on either one alone while ignoring the other may lead to suboptimal results. Therefore, in this paper, we propose Robust Collaborative Learning (RCL), a framework that combines the aforementioned ingredients in a coherent way. Fig. 3.2 shows the overall structure. In RCL, each network is an individual learner, while the rest networks form a Peer system. From each network’s perspective, it exchanges knowledge with its Peer (Fig. 3.2 Left). The knowledge it receives is fused from multiple networks in the Peer system, while the knowledge it offers will wait for fusion when itself is served as a peer network (Fig. 3.2 Right). The pseudo code of the overall algorithm is illustrated in Alg. 5.1. Next we introduce each component of RCL in detail. 18 3.2.1 Self-Knowledge During one mini-batch, each network first selects reliable samples on its own. The selection is based on current loss such that top R × 100% ranked small-loss samples will be selected. Due to the memorization effect in deep models [4], i.e., deep neural networks tend to learn easy patterns first and then memorize noise at later epochs, the reserve rate R(T) is designed to be monotonically decreasing with respect to epoch T from 100% until it reaches clean rate (1 − ) × 100%, where  is the noise rate. Tcut is the switch epoch such that after this epoch, only (1 − ) × 100% percentage of the data will be selected if we know the noise rate in priori, otherwise  itself becomes a hyper- parameter to tune. The selected samples are the self-knowledge of each individual network. After each network generates its own knowledge, the knowledge will be used in knowledge fusion step. Algorithm 3.1: Pseudo code for RCL Algorithm. Input: K networks {Θ1 ..ΘK }, training data D, noise rate ; (Fixed) learning rate η, epoch Tmax and iteration Nmax ; (Hyper) epoch Tcut , fusion multiplier α, fusion exponent β. Output: Updated network parameters {Θ01 ..Θ0k }. 1: for T = 1 to Tmax do 2: Shuffle training set D n o 3: Update R(T) = 1 −  · min TTcut , 1 // remember rate n β o 4: Update r(T) = 1 − min αTTcut , 1 // fusion rate 5: for N = 1 to Nmax do 6: Fetch mini-batch D from D 7: for k = 1 to K do 8: Obtain D k = arg minD:|D|≤R(T)|D| l( fΘk , D) 9: for k = 1 to K do   10: Obtain D k = Knowledge D{1..K }\k , r(T) Update Θ0k = Θ k − η∇l( fΘk , D0k ) 0 11: Return Θ0 = {Θ01 ..Θ0k }; 3.2.2 Knowledge Fusion For a given network k, it utilize the knowledge from its Peer system. The Peer system includes all the rest networks except for network k. The knowledge of this system is fused from multiple networks via a knowledge fusion function. Ideally, when the networks are trained well, the 19 Network 1 Network 2 1.0 beta 0.1 0.3 0.8 0.5 1.0 2.0 0.6 4.0 fusion_rate 6.0 Shared Knowledge 0.4 0.2 0.0 Self-Knowledge Self-Knowledge 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 epoch (a) A 2-network example. (b) Fusion rate w.r.t. β. Figure 3.3: Knowledge fusion among peer networks. knowledge of agreement, i.e., data points picked by all the peer networks can be used to update network k. However, during the early stage, the networks have not learned well and are prone to making mistakes. Therefore, disagreement is introduced to reduce the noise in gradients. There are two parameters associated with it. The first one is α, which determines the switch epoch between disagreement and agreement. Instead of setting a particular epoch, we design α as a parameter that defines the lag of switch epoch compared to Tcut . The second one is fusion rate r, which controls the strength of disagreement, i.e., the proportion of disagreed samples that will be included in addition to the common samples. As epoch increases, less disagreed samples will be included until only common ones are selected. Fig. 3.3a illustrates such procedure in the example of two peer networks. The decay of strength of disagreement is controlled by a hyper-parameter β and different decay patterns are shown in Fig. 3.3b. Small β corresponds to low disagreement strength and quickly transit the state from disagreement to agreement while large β encourages disagreement and slows down the transition. In summary, the fusion rate for each epoch can be calculated by the following function:  β  1 − (T/(αTcut )) , if T < αTcut    r(T) =  (3.2) if T ≥ αTcut    0,   where T is epoch, Tcut is the threshold when reserve rate R reaches clean rate, α is the lag parameter that controls the end point of disagreement decay compared to reserve rate. After calculating the 20 fusion rate, disagreed samples are randomly picked and added to the agreed samples as the final knowledge of the Peer system. Such randomness also introduce certain level of disagreement since each network will receive different candidates to train even if the common samples are the same within each Peer system. The pseudo code of knowledge fusion procedure for multiple networks is illustrated in Alg. 4.2. 3.2.3 Knowledge Exchange Network k receives the candidate samples from its Peer system and update parameters based on them. The same procedure can be done in parallel for all the networks. Although it seems that network k only receives knowledge in this round without giving out its own knowledge, but when other networks is updating, each of them uses knowledge from network k. This procedure is referred to as knowledge exchange. After all the networks have been updated, they enter the next iteration and repeat the processes. Algorithm 3.2: Knowledge Fusion Function. Input: Given the k-th network, the knowledge of all other peer networks {D1 ..DK } \ D k ; Fusion rate r(T). Output: Data for updating the k-th network D0k .   1: Dagree = Intersect {D1 ..DK } \ D k   2: Dpotential = Union {D1 ..DK } \ D k 3: if |Dagree | == |Dpotential | then 4: D0k = Dagree 5: else 6: Duncertain = Dpotential − Dagree 7: nin = r(T) · |Duncertain |   8: Din = random_sample Duncertain, nin 9: D0k = Dagree + Din 10: Return D0k ; In this section, we conduct both synthetic and real data experiments on various datasets to validate the proposed method. The synthetic experiment follows the standard setting in literature using the benchmark data from vision recognition problems [36, 82, 89]. The real data experiment 21 Data # training # testing # class image size CIFAR-10 50,000 10,000 10 32x32 CIFAR-100 50,000 10,000 100 32x32 Table 3.1: Description of datasets in synthetic experiment. 50% 55% 45% 50% 12.5% 55% 45% 50% 55% 45% 12.5% 50% 55% 45% 50% 45% 55% (a) symmetric 50% (b) pairflip 45% Figure 3.4: Noise examples for a 5-class problem. is focused on cancer drug discovery using large-scale bioinformatics data from multiple sources [105, 124]. 3.3 Experiment 3.3.1 Synthetic Experiment on Image Benchmark We demonstrate the effectiveness of RCL from following aspects: the benefit brought by agreement and disagreement, respectively; sensitivity analysis of vital hyper-parameters of RCL; a variant of RCL that significantly reduces computational burden and time complexity. Datasets. We use CIFAR-10 and CIFAR-100 datasets and the description is shown in Table 3.1. The original data is clean, and we manually create corrupted labels following the strategy in [40, 117]. Two common noise scenarios are considered, symmetric flip (SYM) and pair flip (PF), as shown in Fig. 3.4. For SYM, the label of each class is uniformly random flipped to the rest classes with equal probability; for PF, the label of each class only flips to one different but similar class. The noise rate  quantifies the overall proportion of labels that are flipped for each class. Network Architecture. We follow the same 9-layer CNN architecture as [40, 61] which is 22 32 × 32 RGB Image 3 × 3 conv, 128 LReLU 3 × 3 conv, 128 LReLU 3 × 3 conv, 128 LReLU 2 × 2 max-pool, stride 2, dropout p = 0.25 3 × 3 conv, 256 LReLU 3 × 3 conv, 256 LReLU 3 × 3 conv, 256 LReLU 2 × 2 max-pool, stride 2, dropout p = 0.25 3 × 3 conv, 512 LReLU 3 × 3 conv, 256 LReLU 3 × 3 conv, 128 LReLU avg-pool dense 128 → 10 Table 3.2: CNN architecture. The negative slope for each LeakyReLU is set as 0.01. 80 80 45.0 40 42.5 35 75 70 40.0 30 70 test_acc1 test_acc1 test_acc1 test_acc1 60 37.5 25 65 35.0 num_network net_3 net_9 50 num_network net_3 net_9 num_network net_3 net_9 20 num_network net_3 net_9 32.5 60 net_1 net_5 net_11 net_1 net_5 net_11 net_1 net_5 net_11 15 net_1 net_5 net_11 net_2 net_7 net_13 40 net_2 net_7 net_13 30.0 net_2 net_7 net_13 net_2 net_7 net_13 55 10 0 25 50 75 100 125 150 175 200 0 25 50 75 100 125 150 175 200 0 25 50 75 100 125 150 175 200 0 25 50 75 100 125 150 175 200 epoch epoch epoch epoch (a) CIFAR10 SYM 50% (b) CIFAR10 PF 45% (c) CIFAR100 SYM 50% (d) CIFAR100 PF 45% 90 90 88 70 80 86 65 85 84 60 pure_ratio1 pure_ratio1 pure_ratio1 pure_ratio1 70 80 82 55 60 80 num_network net_3 net_9 num_network net_3 net_9 num_network net_3 net_9 50 num_network net_3 net_9 75 net_1 net_5 net_11 50 net_1 net_5 net_11 78 net_1 net_5 net_11 net_1 net_5 net_11 45 net_2 net_7 net_13 net_2 net_7 net_13 76 net_2 net_7 net_13 net_2 net_7 net_13 70 40 40 0 25 50 75 100 125 150 175 200 0 25 50 75 100 125 150 175 200 0 25 50 75 100 125 150 175 200 0 25 50 75 100 125 150 175 200 epoch epoch epoch epoch (e) CIFAR10 SYM 50% (f) CIFAR10 PF 45% (g) CIFAR100 SYM 50% (h) CIFAR100 PF 45% Figure 3.5: Episode curve (upper: test accuracy, lower: pure ratio) w.r.t. training epochs. The grey lines are SPL (net_1) and co-teaching (net_2), and blue lines the RCL. The shaded area represents the variation across 5 random seeds. commonly used in weakly supervised learning (Table 3.2). We use Adam optimizer with a momentum of 0.9 and an initial learning rate of 0.001. The batch size is set to 128 and the maximum epoch is 200. We implemented the models using PyTorch and all the experiments are conducted on NIVIDIA GPUs. We ensure that for the same dataset and noise scenario, different comparison methods run on the same machine. 23 Method Standard SPL De-CP Co-T K=3 K=5 K=7 K=9 K=11 K=13 +R p-val # nets Data Noise Test Accuracy CIFAR10 SYM 50% 48.52 70.92 45.53 73.45 75.72 77.44 78.01 78.47 78.91 78.91 7.43 2e-9 11 CIFAR10 PF 45% 48.65 56.08 49.24 72.77 74.59 76.28 77.28 78.25 78.70 79.15 8.77 <1e-9 13 CIFAR100 SYM 50% 21 36.21 17.51 38.14 40.05 41.83 42.43 42.96 43.77 43.14 14.75 4e-07 11 CIFAR100 PF 45% 29.57 28.63 26.17 30.44 32.51 34.90 36.86 37.85 39.02 39.15 28.58 4e-09 13 Data Noise Pure Ratio CIFAR10 SYM 50% 50.31 84.22 40.48 83.95 86.96 88.82 89.47 89.86 90.11 90.33 7.33 <1e-9 11 CIFAR10 PF 45% 54.89 68.09 51.23 79.25 82.72 84.99 86.16 87.17 87.69 88.11 11.18 <1e-9 13 CIFAR100 SYM 50% 49.95 80.51 42.89 80.65 83.85 86.20 87.14 87.82 88.20 88.20 9.36 <1e-9 11 CIFAR100 PF 45% 54.84 58.66 53.42 59.03 61.07 63.94 66.97 68.65 69.36 69.99 18.57 <1e-9 13 Table 3.3: Performance of non-ensemble baselines and the proposed method RCL over different number of networks (K) on fixed noise rates. Each method is repeated for 5 random seeds with average performance presented. Significance t-tests (one-side) are conducted between RCL and the best baseline. A p-value less than 0.05 is considered as significant difference. 24 Experimental Setup We keep the major hyper-parameter of co-teaching [40], i.e., Tcut shown in Alg. 5.1, and fix those that have subtle effect on results in the original paper since RCL also induces extra hyper-parameters, which are our main focus. In our experiments, we first tune Tcut = {5, 10, 15} for one and two networks (i.e., SPL and Co-teaching). Then we use the best for further tuning RCL. For other hyper-parameters, we fix α = 2 for all scenarios and use fixed β for a given noise scenario, based on the sensitivity analysis using β = {0.0, 0.1, 0.3, 0.5, 1.0, 2.0, 8.0}. We test over a range of number of networks K = {3, 5, 7, 9, 11, 13}, and noise rates  = {0.25, 0.35, 0.45} for pairflip noise and  = {0.5, 0.6, 0.7, 0.8} for symmetric noise. For all the comparison methods, we run 5 random seeds to get an average performance unless stated otherwise. And we found that the variance of RCL is relatively small across different random seeds. Baselines. We consider the following baselines. (1) Standard, a single network which is trained on the entire dataset. (2) SPL [50], a single network that produces curriculum based on its own knowledge. (3) Decoupling (De-CP) [68], a double-net system in which the networks only updates parameters from data whose prediction label is disagreed between two networks. (4) Co-teaching [40], a double-net system in which the two networks exchange curriculum at each iteration. (5) Ensemble consensus [63], specifically the LNEC variant, a multi-net system that explores agreement between multiple networks. (6) Self-ensemble (SELF) [77], which explores agreement between consecutive epochs within a network. Note that for the ensemble baseline SELF, the original implementation involves other hybrid components and the authors did not release the code, which makes direct comparison difficult, so we adopt the core idea of their paper which is the temporal ensemble and implement the method. Test accuracy and pure ratio serve as the evaluation metrics. The test accuracy is evaluated on clean test set. The pure ratio measures the average proportion of clean data that is selected by the algorithm across all mini-batches in one epoch during training. All metrics are evaluated on one network. For methods that contain more than one network, we only evaluate the performance of the 1st network for a fair comparison. We also find that the variation among networks are small in the end. Benefit of agreement. The benefit of agreement comes from ensemble of multiple networks on 25 the selection of clean samples, which can be verified by adding number of networks while fixing other components. Fig. 3.5 shows the episode curve for RCL over different number of networks for a given noise rate. We can see that the test accuracy gradually improves as the number of networks increases. For certain noisy scenarios, e.g., CIFAR100 symmetric 50%, the test accuracy reaches plateau and starts to decrease at 13 networks. For other scenarios, it continues to increase as number of network grows. Higher pure ratio generally leads to higher test accuracy. We calculate the final performance as the average of last ten epochs following [40] and then average over multiple runs. The results of RCL and baselines are shown in Table 3.3. Importantly, RCL selects significantly more clean data compared to baselines which verifies the benefit of agreement during the learning process. Benefit of disagreement. The agreement strategy is relatively intuitive since it ensembles predic- tions from multiple networks after each leaner has achieved certain accuracy, however, agreement may not work when the noise rate is large since it also ensembles the error, especially during the early training stage. In such situation, disagreement plays the role that reduces the noise in gradients and helps pick out more clean samples. The disagreement takes place in terms of two levels: the first level is to exchange data in a learn-from-the-other way such that each network receives different data from the its Peer system, the second level is that the strength of disagreement varies along the training procedure and can be controlled by a hyper-parameter. Both levels can improve the selec- tion of clean samples as well as the generalization performance. To verify these, we compare RCL with several ensemble methods [63, 77] which only explore agreement among multiple networks (or multiple training epochs), and all the networks use the same set of candidates to train. Fig. 3.6 shows the test accuracy of the competing methods for different noise rates on CIFAR10. Pure ratio reveals the exact same pattern and therefore is not shown in the figure. Complete results for all the datasets and noise scenarios are presented in Table 3.4. First, we find that temporal ensemble (SELF) does not work as good as network ensembles in the provided scenarios. Second, we can see that when the noise rate is small, RCL reaches similar accuracy as the state-of-art ensemble methods; when the noise rate is large, RCL yields significantly better performance compared to the 26 80 80 70 60 70 test_acc1 test_acc1 50 60 40 method method RCL 50 RCL 30 LNEC LNEC 20 SELF 40 SELF 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.250 0.275 0.300 0.325 0.350 0.375 0.400 0.425 0.450 noise_rate noise_rate (a) CIFAR10 symmetric flip (b) CIFAR10 pair flip Figure 3.6: Test accuracy of ensemble methods and RCL (K=9) over various noise rates on CIFAR10. The shaded area represents variation across 3 random seeds. 80 90 80 70 pure_ratio1 70 test_acc1 60 method 60 method LNEC LNEC 50 LNEC + Exchange 50 LNEC + Exchange RCL RCL 40 40 0 25 50 75 100 125 150 175 200 0 25 50 75 100 125 150 175 200 epoch epoch (a) Test accuracy (b) Pure ratio Figure 3.7: Improvement of RCL over ensemble method by each disagreement level on CIFAR10 pairflip 45% scenario (K=9). Average over 3 random seeds. ensemble baselines, which demonstrates the power of introducing disagreement during the learning process. We also find that the variations of baselines are much larger compared to RCL, which indicates the robustness of the proposed method. One thing needs to point out here is that, although LNEC and RCL both use 9 networks, RCL only use the knowledge of other 8 networks, therefore it does not outperform LNEC in the easy tasks but still reaches equal goodness. Next, in order to see the improvement brought by each level of disagreement, we add the data exchange step onto LNEC baseline and compare it with pure LNEC and RCL (all use 9 networks). The result is shown in Fig. 3.7. We can see that the combination of agreement and data exchange, which makes each network receive different candidates for training, performs better than pure ensemble based on agreement (LNEC vs LNEC+Exchange). Moreover, encouraging disagreement in the early stage can further improve the selection of clean samples, leading to higher accuracy (LNEC+Exchange vs RCL). Sensitivity analysis The hyper-parameter β controls the strength of disagreement and is important in RCL. Therefore, we would like to see (1) whether different values of this parameter will affect 27 Test accuracy Pure ratio Data Noise SELF LNEC CL SELF LNEC CL 50% 61.65 79.00 78.51 81.38 90.07 89.85 CF10 60% 42.97 74.32 72.65 64.89 86.15 85.14 SYM 70% 28.99 48.05 59.60 46.20 61.31 74.40 80% 17.04 22.45 30.20 29.13 31.65 42.04 25% 72.97 83.23 83.18 88.34 93.13 92.75 CF10 35% 56.77 81.83 81.57 75.58 91.05 90.93 PF 45% 37.21 63.09 78.34 57.61 72.22 87.28 50% 25.02 43.89 43.37 66.69 86.92 87.99 CF100 60% 15.82 32.07 36.12 52.60 78.06 82.20 SYM 70% 7.69 23.41 26.71 37.23 64.44 72.34 80% 3.29 11.99 14.79 24.02 40.20 49.96 25% 38.23 52.83 51.03 82.21 92.04 87.94 CF100 35% 28.08 46.49 45.91 69.12 82.28 81.28 PF 45% 19.28 31.48 38.38 53.99 61.16 69.17 Table 3.4: Final performance of ensemble methods and RCL over various noise rates (K=9). Bold numbers indicates that the method is significantly better than the second best method. 76.5 75.5 76.0 75.0 74.5 test_acc1 test_acc1 75.5 74.0 75.0 73.5 74.5 73.0 0.0 0.1 0.3 0.5 1.0 2.0 8.0 0.0 0.1 0.3 0.5 1.0 2.0 8.0 beta beta (a) symmetric 50% (b) pairflip 45% Figure 3.8: Test accuracy of RCL over various βs, average over 3 random seeds (K=3). the performance of RCL, and (2) how does it affect. We test over a range of different values to study its behavior on different tasks. Fig. 3.8 shows the results on CIFAR10 dataset. We can see the test accuracy shows opposite patterns on the same range of β for the two different noise scenarios. When the task is relatively easy (e.g., CIFAR10 symmetric 50%), small β yields better accuracy. If the noise rate is large, i.e., the task is more difficult, it favors relatively large β which encourages disagreement during the early stage. However, extreme large value of β is not beneficial. The difference of accuracy can reach up to 2% for different values of β while fixing all other parameters. Reduce time complexity While being effective, RCL (as well as other methods) that involves multiple networks requires more computational power and running time compared to methods using 28 80 0 87.4 6.7 0.7 1.3 0.8 0.3 0.4 0.4 1.4 0.7 90 1 0.7 92.8 2.2 0.3 0.2 0.3 0.3 0.1 0.6 2.5 80 78 2 3.6 0.1 59.4 28.8 2.0 2.0 2.6 0.8 0.3 0.4 70 3 0.8 0.2 1.1 68.1 20.3 5.7 2.2 1.0 0.3 0.3 60 76 True class 4 1.1 0.0 0.8 2.2 72.9 20.1 1.2 1.3 0.2 0.1 test_acc1 50 74 5 0.6 0.1 0.9 7.4 3.3 79.5 6.0 1.9 0.2 0.2 40 6 0.6 0.1 0.9 2.3 0.9 1.3 90.6 2.8 0.1 0.2 30 72 7 0.6 0.1 1.0 2.9 2.2 2.0 0.3 88.2 2.3 0.3 20 8 2.3 0.8 0.5 0.6 0.1 0.1 0.4 0.1 88.8 6.2 10 70 9 13.9 1.5 0.1 0.6 0.2 0.2 0.3 0.1 0.4 82.7 original RCL revise_restart_RCL 0 1 2 3 4 5 6 7 8 9 68 Predicted class 0 25 50 75 100 125 150 175 200 epoch (a) Confusion matrix. (b) Test accuracy Figure 3.9: Revise and restart RCL on CIFAR10 PF 45% using K=3 networks. Left: the confusion matrix after we revise the labels based on prediction. Numbers in cells denote percentages. Right: test accuracy w.r.t epochs. one or two networks. One way to reduce time complexity without deteriorating the performance is to utilize the unselected samples during training. Therefore, we propose a revise-and-restart strategy based on the current framework. When the training reaches certain epochs (usually plateau), we first revise the labels of the unselected samples based on the current prediction of the network. We still use the first network in the system for consistency purpose, other option such as using the ensemble prediction of all networks is also applicable. Then we restart the training procedure, i.e., first introduce disagreement and then agreement. The decay factor β is reduced by half every time we restart the procedure because revising labels decreases the noise rate. Fig. 3.9 shows the corresponding result on CIFAR10 by using only 3 networks. Originally, 45% of the ground-truth labels are flipped to the adjacent class. After revising the labels of unselected samples at epoch 50, the label precision for each class is shown in Fig. 3.9a. We can see that the percentage of correct labels for each class increases significantly compared to 45%. Fig. 3.9b shows the episode curve of revise and restart compared to the original 3 networks. The test accuracies of using 3 networks are 74.1% vs 78.4% before and after. The performance of revise and restart strategy based on RCL using 3 networks reaches as high as that of using 9 networks, which is a considerable reduction on the computational burden. In summary, we demonstrated the effectiveness of RCL on various aspects on the synthetic data. 29 Cell Line VCAP MCF7 PC3 A549 A375 HT29 Unique DP 4760 5245 5238 4195 1311 961 HQ DP 463 881 347 1244 405 127 HQ DP Test 50 100 50 100 50 30 Test size 8100 16200 8100 16200 8100 4860 Train size 763K 833K 844K 663K 206K 151K Table 3.5: DP=Drug Profile, HQ=High Quality (qdp > 0.7), K=1000. Small size (≈ 10%) of high quality drug profiles are sampled as testing data, the rest are for training. For each cell line, a unique drug profile is associated with 162 predictable genes, therefore, sample size is # drug profile × 162. 3.3.2 Drug-induced Gene-Expression Change Prediction Previous studies on learning with noise labels are mostly focused on vision tasks in a synthetic way. In this subsection, we apply RCL to a real-world problem in the bioinformatics domain, where the noise naturally exists in experiments due to data generation process, and demonstrates its effectiveness. Cancer drug discovery is of high demand but also a tough problem because of low response rates and severe side effects [125]. The emerging profiling technology enables measurements of drug-induced gene-expression change (GE-change), i.e., whether the expression of a particular gene increases or decreases given a certain drug treatment, making it possible to discover new drugs and elucidate mechanism of action and toxicity of a drug candidate on the transcriptomic level. Recent years have also witnessed an increasing number of public repositories providing millions of transcriptome profiles, such as LINCS from Broad Institute [105], GEO from NCBI [10] and TCGA from NIH [123], however, large-scale profiling of drug-induced gene expression remains expensive. Therefore, there is a surging demand of computational methods for predicting drug-induced gene-expression profiles. For example, [124] proposed a deep learning framework for such tasks using drug and gene descriptors; however, due to technical and biological variations, the quality of GE-change obtained from biological experiments varies among different experiments. Due to the poor quality, half of the LINCS profiles that cost millions of dollars are discarded in regular analyses. Therefore, to make use of the full dataset, the prediction model should also take the quality of the data into consideration. 30 1 × 978 r1 Drug Profile (DP) y1 r2 Drug y2 𝒚 *+, Cell Line … Weighted rj Average GE Time yj Dose GE Replicates Each drug profile (DP) has multiple replicates of gene expression (GE) profiles 𝑦# 𝒓#+, Average GE DP Quality Figure 3.10: Illustration of drug profile and its quality. Datasets. We use LINCS from Broad Institute [105] which contains 1.3 million normalized profiles and 11,000 small molecule perturbagens covering more than 70 cell lines. For each cell line, GE-change readings for 978 genes are available under different drug profiles. Label Quality. Due to the common variation in biological experiments, in order to get reliable readings, each drug profile is repeatedly tested on 978 genes for different number of times, which results in multiple GE-change readings for the same drug profile. Some drug profiles have consistent readings while others do not, which means the label quality varies across different drug profiles. In order to conduct the experiments, we need (1) unique GE-change reading for one drug profile, (2) a rough quality measurement for generating the test set (high quality data). We do not have exact information of which drug profile is correct. In fact, no drug profile has perfectly correct or wrong GE-change readings. We can only estimate their quality through some statistical measurements. We follow [105] to estimate the quality of a drug profile and calculate the corresponding unique GE-change reading. The procedure is illustrated in Fig. 3.10. For a given drug profile (drug + cell line + time + dose), an average GE-change reading is obtained across all replicates. Then we calculate the correlation between each reading and the average reading, then take the mean, the resulted average correlation is regarded as the quality of readings for this drug profile. The final GE-change reading is obtained by weighted average of all replicates based on the correlation. 31 Cell Standard SPL Decoupling Co-teaching SELF LNEC RCL p-val Best Line ACC F1 ACC F1 ACC F1 ACC F1 ACC F1 ACC F1 ACC F1 ACC K FR VCAP 47.64 0.462 49.36 0.473 47.70 0.473 50.26 0.482 49.23 0.474 49.12 0.472 52.22 0.495 1e-3 4 0.3 MCF7 51.41 0.446 55.14 0.466 54.57 0.456 56.44 0.476 56.45 0.475 54.04 0.464 57.98 0.482 5e-4 4 0.2 PC3 47.90 0.474 47.92 0.473 45.32 0.467 48.22 0.477 48.42 0.478 48.18 0.475 49.04 0.484 8e-3 4 0.1 A549 50.73 0.403 53.52 0.417 53.38 0.422 52.15 0.414 53.67 0.421 53.63 0.417 54.00 0.421 0.25 3 0.1 A375 46.42 0.396 47.14 0.395 49.37 0.407 48.87 0.399 48.43 0.402 46.95 0.393 48.99 0.395 - 3 0.4 HT29 46.39 0.441 46.51 0.444 47.18 0.449 47.86 0.456 47.12 0.453 46.66 0.452 48.13 0.458 0.05 3 0.3 Table 3.6: Generalization performance of comparison methods for six cell lines. K = number of networks, FR = forget rate. Each method is repeated for 5 random seeds. Significance t-tests (one-side) are conducted between RCL and the best baseline method. A p-value less than 0.05 is considered as significant difference. 32 Features and Label. A tuple of drug profile and gene (drug, gene) corresponds to a continuous reading y(d,g) which measures how relatively the gene expression changes compared to reference control. Therefore, we have two sets of features, drug features and gene features. In order to obtain drug features, we download SMILES (simplified molecular-input line-entry system) strings for each drug from PubChem platform1 and use the Python package RDKit2 to convert string representations into molecular fingerprints. The molecular fingerprint is a 1 × 1024 binary-coded feature where each position represent the existence or absence of a molecular substructure. For gene features, we use the Gene Ontology (GO) features preprocessed by [124] which describes the biological domain knowledge of a gene such as molecular functions and cellular components with dimension 1107. We also discretize the target y(d,g) into 3 classes, i.e., y(d,g) < −1.5 (down regulate), −1.5 ≤ y(d,g) ≤ −1.5 (no change), y(d,g) > 1.5 (up regulate). The overall task is to predict the regulation effect of drug profiles on different genes. Network Architecture. We use fully-connected layers in the shape of (2131, 128, 32, 3) neurons at each layer as the overall structure. Leaky ReLu with 0.01 slope and 0.5 drop out rate are used. We use Adam optimizer with 0.001 learning rate. The real data converges fast and we run 25 epochs and use the average over last 5 epochs as the final performance. Experiment Set-up. We regard drug profiles that have rdp > 0.7 as high quality. For each cell line, to construct training and testing dataset, we randomly sample 10% drug profiles from high quality data to serve as the testing drug profiles. The left-over mixed with remaining drug profiles are used as training drug profiles. Each drug profile is originally associated with 978 genes, however, not all genes are predictable. Therefore, we select 162 genes that are relatively predictable compared to the rest genes based on permutation tests [38]. At last, each drug profile is associated with 162 genes, and training and testing size is shown in Table 3.5. We ensure that drug profiles appear in the training set are not included in the testing set. Moreover, since a drug having significant regulation effect is rare, the data is highly imbalanced. Therefore, we down-sample the dominant class during training while keep test set unchanged. We also add macro-f1 score as an evaluation 1 https://pubchem.ncbi.nlm.nih.gov/pc_fetch/pc_fetch.cgi 2 https://www.rdkit.org/ 33 Figure 3.11: Virtual drug screening use RCL and RGES pipeline. metric in addition to accuracy. In real experiments, we do not know the noise rate , therefore, it becomes a hyper-parameter to tune. In order to ensure that all methods eventually include the same number of data points, we first search across all possible noise (forget) rate {0.1, 0.2, 0.3, .., 0.9} for SPL (one network) and pick the best and used for other methods. We tune K = {3, 4} for all methods involving multiple networks and α = {2, 3, 4} and β = {2, 3, 4} for RCL. Each method repeatedly run on 5 random seeds and the average is taken as the final performance. Results. Table 3.6 shows the final performance on real data for different methods. We see that for most cell lines, RCL achieves the best accuracy compared to competing baselines. The results further confirm the effectiveness of proposed method in real-world settings, which largely broaden the potential application of RCL beyond the computer vision domain. An real drug discovery scenario is to use the RCL predictor to screen millions of compounds in the existing large library and rank compounds based on Reverse Gene Expression Score (RGES) proposed in [20], for a certain disease as illustrated in Figure 3.11. 34 CHAPTER 4 CONTRASTIVE LEARNING ON MOLECULAR GRAPHS WITH MULTI-LEVEL DOMAIN KNOWLEDGE 4.1 Problem Definition A (molecular) graph can be represented as G = (V, E), where V = {v1, v2, .., v|V | } and E = V × V denotes node and edge set respectively. Let X ∈ R|V |×d1 be the feature matrix for all nodes in a graph, A ∈ R|V |×|V | the adjacency matrix and E ∈ R|E |×d2 the edge features, our goal is to learn a 0 graph encoder h = f (X, A, E) ∈ Rd which maps an input graph to a vector representation without the presence of any labels. The learned encoder and representations can be used for downstream tasks directly or via finetune. 4.2 Contrastive Learning Framework In a conventional contrastive learning framework (Fig. 5.1 left), for each graph Gi , two augmentation operators t1 and t2 are sampled from the family of all operators T , and applied to Gi to obtain two correlated views Gi1 = t1 (Gi ) and Gi2 = t2 (Gi ). We use numbers in the superscript to represent different views throughout the paper. The correlated views are fed into a graph encoder f , producing graph representations hi1 and hi2 , which are then mapped into an embedding space by a projection head g, yielding zi1 and zi2 . The goal is to maximize the mutual information between the two correlated views in the embedding space via Eq (4.1). 1 Õn L local = L local, (4.1) n i=1 i and the loss for each sample Lilocal can be written as: 35 Local-level Domain Knowledge Global-level (!) Domain Knowledge 𝑓 (!) 𝐺# (!) 𝐺" 𝐺! 𝑔 𝑓 … Local Contrast 𝐺# 𝐺" 𝑔 𝐺! (") 𝑓 (") 𝐺# (") 𝐺" 𝑔 𝐺! Global Contrast 𝑓 𝑔 GNN encoder Projection head 𝐺! Original view 𝐺!(⋅) Augmented view Figure 4.1: Overall framework of MoCL. First, two augmented views are generated from local-level domain knowledge. Then, together with the original view, they are fed into the GNN encoder and projection head. The local-level contrast maximizes the MI between two augmented views, while the global-level contrast maximizes the MI between two similar graphs. The similarity information is derived from global-level domain knowledge (MI: mutual information). Lilocal = Li1 + Li2 e s(zi ,zi )/τ e s(zi ,zi )/τ 1 2 2 1 = − log Õn − log Õn , (4.2) e s(zi ,z j )/τ e s(zi ,z j )/τ 1 2 2 1 j=1, j,i j=1, j,i | {z } | {z } view 1 contrast view 2 view 2 contrast view 1 where n is the batch size, s(·, ·) is a function which measures the similarity of the two embeddings, τ is a scale parameter. The two correlated views zi1 and zi2 are regarded as positive pair while the rest pairs in the batch are regarded as negative pairs. The objective aims to increase the probability of occurrences of positive pairs as opposed to negative ones. Note that the negative pairs can be formed in two directions. If zi1 is the anchor, all z2j in view 2 are contrasted; if zi2 is the anchor, all z1j in view 1 are contrasted. Thus the loss for each sample consists of two parts as showed in Eq (4.2). 36 (a) Drop Node (b) Perturb Edge X x (c) Extract subgraph (d) Mask Attributes (e) Substitute Substructure Replace Functional Group Add (Drop) General Carbon Figure 4.2: Augmentation comparison. Upper: conventional augmentations that may alter the graph semantics. Lower: proposed augmentation in which valid substructures are replaced by bioisosteres that share similar properties. 4.2.1 Local-level Domain Knowledge Most existing approaches adopt random corruption during augmentation. For example, [141] pro- posed four types of augmentations for general graphs (Fig. 4.2 upper). However, such random corruption may alter the semantics of molecular graphs. For node dropping and edge perturbation, the resulting molecule is rarely biologically proper (Fig. 4.2ab). For example, dropping a carbon atom in the phenyl ring of aspirin breaks the aromatic system and results in an alkene chain (Fig. 4.2a); perturbing the connection of aspirin might introduce a five-membered lactone (Fig. 4.2b), which may drastically change the molecular properties. For subgraph extraction, the resulting structure is arbitrary and not representative for molecular functionality (Fig. 4.2c). For example, methyl acetate is a sub group of aspirin (Fig. 4.2c), but also frequently shown in many other compounds such as digitoxin and vitamin C etc. with diverse chemical structures and biological effects. Enforcing high mutual information between such augmentation pairs may produce sub- optimal representations for downstream tasks. This phenomenon has also been observed in [141] that edge perturbation deteriorates the performance of certain molecular tasks. Among the general augmentations, only attribute masking (Fig. 4.2d) does not violate the biological assumptions since 37 it does not change the molecule, it only masks part of the atom and edge attributes. Therefore, we aim to infuse domain knowledge to assist the augmentation process. We propose a new augmentation operator called substructure substitution, in which a valid substructure in a molecule is replaced by a bioisostere [71] which produces a new molecule with similar physical or chemical properties as the original one (Fig. 4.2e). We compile 218 such rules from domain resource 1. Each rule consists of a source substructure and a target substructure represented by SMARTS string 2. A sample rule is as follows: [#6:2][#6:1](=O)[O;-,H1] >> [*:2][c:1]1nn[nH]n1 indicating the transition from left substructure (carboxylic acid) to the right one (nitrogen hetero- cycle). The substitution rules have 36 unique source substructures which can be categorized into 8 groups. We summarize the statistics of the rules in Table 4.1. Note that target substructures are all unique and different. The original 218 substitution rules mostly happen at molecular positions where heteroatoms (heavy atoms that are not C or H) and aromatic rings are presented, therefore the variation for general carbon groups is limited. Under the common assumption that changing a few general carbon atoms will not alter the molecular property too much, we add 12 additional rules to subtract and add general carbon groups from and to a molecule. Some sample rules are: [*:1][CH2][CH2][*:2] >> [*:1][*:2] (drop) [*:1]-[*:2] >> [*:1]CC[*:2] (add) Thus, MoCL consists of 230 rules in total to generate molecule variants that share similar properties. All the rules and code are available at https://github.com/illidanlab/MoCL-DK. Moreover, since the source substructures in the rules are very common, a molecule may contain multiple source substructures or multiple copies of the same substructure in the rule, the proposed augmentation can be applied multiple times to generate variants with much more diversity. A notable difference between proposed augmentation and general augmentation is that the proposed 1 https://www.schrodinger.com/drug-discovery 2 https://www.daylight.com/dayhtml/doc/theory/theory.smarts.html 38 Group # source # target Formula CA 1 68 RCOO Ester 1 7 RCOOR’ Ketone 1 15 ROR’ Phenyl 22 36 Aromatic Rings Tbutyl 1 10 C4 dsAmide 4 18 RONR’R” msAmide 2 32 RONR’ nsAmide 4 32 RON Total 36 218 - Table 4.1: Source and target statistics for substitution rules. R/R’/R” represent arbitrary carbon- containing groups. rules are not guaranteed to be applicable to a molecule after it changes, therefore when applying proposed augmentation multiple times, we need to update the rule availability accordingly at each round. We summary the proposed augmentation procedure in Alg. 5.1. 4.2.2 Global-level Domain Knowledge Maximizing mutual information between correlated views learns transformation-invariant repre- sentations. However, it may neglect the global semantics of the data. For example, some graphs should be closer in the embedding space since they share similar graph structures or semantics from domain knowledge. For molecular graphs, such information can be derived from multiple sources. For general graph structure, extended connectivity fingerprints (ECFPs) [91] encode the presence of substructures for molecules and are widely used to measure the structural similarity between molecular graphs. Drug-target networks [87] record the drug-protein interaction information which is one of the most informative biological activity measures. In this section, we first define graph similarity from general molecular graphs, then we propose two ways to incorporate the global semantics into our learning framework. 39 4.2.2.1 Similarity calculation Given the ECFP of two molecules, e1, e2 ∈ {0, 1} m where m is the vector length and 1 indicates the presence of certain substructures, the similarity of e1 and e2 can be calculated as the Tanimoto coefficient [9]: N12 s(e1, e2 ) = , (4.3) N1 + N2 − N12 where N1, N2 denotes the number of 1s in e1, e2 respectively, and N12 denotes the number of 1s in the intersection of e1, e2 . The resulted coefficient s(e1, e2 ) ∈ [0, 1] and a larger value indicates higher structural similarity. Similarly, for drug-target network, e1, e2 ∈ {0, 1} m becomes the interaction profile of a drug to all proteins where m is the total number of proteins. The drug similarity can be calculated the same as Eq. (4.3). Algorithm 4.1: Pseudocode of domain augmentation. Input: Molecule graph G, repeat time R, rules T Output: Augmented graph G0. 1: for r = 1 to R do 2: while T do 3: sample t ∼ T # one augmentation rule 4: {G , G , .., G } = t(G) # all possible products 1 2 k 5: random choose G = Gi 6: update available T # rules may no longer be valid 7: break; 8: Return G0; 4.2.2.2 Global-level Objective We propose two strategies for using the global similarity information. One strategy is to use it as direct supervision. Given embeddings of two original graphs zi and z j , we measure the similarity zTi z j between them as θ(zi, z j ) = kzi k kz j k . We optimize the similarity using least square loss as follows: Õ Õ = = kθ(zi, z j ) − si, j k22, global global Li Li j j,i j,i 40 where si, j is the similarity from Eq. (4.3). The second strategy is to utilize a contrastive objective in which similar graph pairs have higher mutual information as compared to the background. The objective is written as: Ín s(zi,z j )/τ j=1, j∈Ni e = − log Ín , global Li s(zi,z j )/τ j=1, j