LEARNING TO DETECT LANGUAGE MARKERS By Fengyi Tang A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Computer Science – Doctor of Philosophy 2021 ABSTRACT LEARNING TO DETECT LANGUAGE MARKERS By Fengyi Tang In the world of medical informatics, biomarkers play a pivotal role in determining the physical state of human beings, distinguishing the pathologic from the clinically normal. In recent years, behavioral markers, due to their availability and low cost, have attracted a lot of attention as a potential supplement to biomarkers. “Language markers” such as spoken words and lexical preference have been shown to be both cost-effective as well as predictive of complex diseases such as mild cognitive impairment (MCI). However, language markers, although universal, do not possess many of the favorable properties that characterize traditional biomakers. For example, different people may exhibit similar use of language under certain conversational contexts (non-unique), and a person’s lexical preferences may change over time (non-stationary). As a result, it is unclear whether any set of language markers can be measured in a consistent manner. My thesis projects provide solutions to some of the limitations of language markers: (1) We formalize the problem of learning a dialog policy to measure language markers as an optimization problem which we call persona authentication. We provide a learning algorithm for finding such a dialog policy that can generalize to unseen personalities. (2) We apply our dialog policy framework on real-world data for MCI prediction and show that the proposed pipeline improves prediction against supervised learning baselines. (3) To address non-stationarity, we introduce an effective way to do temporally-dependent and non-i.i.d. feature selection through an adversarial learning framework which we call precision sensing. (4) Finally, on the prediction side, we propose a method for improving the sample efficiency of classifiers by retaining privileged information (auxiliary features available only at training time). TABLE OF CONTENTS LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix LIST OF ALGORITHMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x CHAPTER 1 BACKGROUND . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Biomakers and Their Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Behavioral Markers as Alternatives . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Limitations of Language Markers . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.5 Proposed Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.6 Question Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.6.1 MCI Proof-of-Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.6.2 Persona Authentication . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.7 Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.7.1 Precision Sensing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.7.2 Multitask LUPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 CHAPTER 2 REVIEW OF CURRENT RESEARCH . . . . . . . . . . . . . . . . . . . . 16 2.1 LANGUAGE MARKERS IN MILD COGNITIVE IMPAIRMENT . . . . . . . . . 16 2.2 DIALOG SYSTEMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3 TASK-ORIENTED SYSTEM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.4 OPEN-DOMAIN SYSTEMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.4.1 Personalized Response Generation . . . . . . . . . . . . . . . . . . . . . . 18 2.4.2 MCI Prediction via Utterance Data . . . . . . . . . . . . . . . . . . . . . . 19 2.4.3 Healthcare Applications of Dialog Systems . . . . . . . . . . . . . . . . . 19 2.5 BEYOND LINGUISTIC FEATURES . . . . . . . . . . . . . . . . . . . . . . . . 20 2.5.1 Acoustic Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.5.2 Personality Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.5.3 Automatic Evaluation of Persona Models . . . . . . . . . . . . . . . . . . 21 2.6 FEATURE SELECTION IN MEDICAL INFORMATICS . . . . . . . . . . . . . . 22 2.6.1 Active Sensing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.6.2 Temporal Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.7 SAMPLE EFFICIENCY AND PRIVILEGED INFORMATION . . . . . . . . . . 24 2.7.1 Transfer Learning Review . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.7.2 Knowledge Transfer by Inductive Learning . . . . . . . . . . . . . . . . . 26 2.7.3 Knowledge Transfer by Transductive Learning . . . . . . . . . . . . . . . . 28 2.7.4 Knowledge Transfer with Distributed Representations . . . . . . . . . . . 29 2.7.5 Knowledge Transfer by Knowledge Distillation . . . . . . . . . . . . . . . 30 iii 2.7.6 Knowledge Transfer by LUPI . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.7.7 LUPI Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 CHAPTER 3 SCALABLE DIAGNOSTIC SCREENING OF MCI USING DIALOG POLICY 40 3.1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.2 METHODOLOGY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.2.1 Dialog Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.2.2 Overview of Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.2.3 Construction of Turn-Based Dialogue . . . . . . . . . . . . . . . . . . . . 44 3.2.4 Unsupervised Learning for User Simulator . . . . . . . . . . . . . . . . . . 46 3.2.5 Reinforcement Learning Components . . . . . . . . . . . . . . . . . . . . 47 3.2.5.1 Environment E . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.2.5.2 Action a𝑡 ∈ R𝑑 . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.2.5.3 State s𝑡 ∈ R𝐶 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.2.5.4 Reward 𝑟 ∈ R . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.2.5.5 State transitions . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.2.5.6 Deep Q-Networks (DQN) . . . . . . . . . . . . . . . . . . . . . 50 3.2.5.7 Policy-masking . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.2.6 Training the RL-Agent . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.3 EXPERIMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.3.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.3.2 Baselines vs. RL Performance . . . . . . . . . . . . . . . . . . . . . . . . 53 3.3.3 Evaluation of User Simulators . . . . . . . . . . . . . . . . . . . . . . . . 56 3.3.4 Top-Performing Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.4 DISCUSSION AND CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . 61 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 CHAPTER 4 SAMPLE EFFICIENT LEARNING WITH PRIVILEGED INFORMATION 66 4.1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.2 METHODOLOGY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.2.1 Building the PI vocabulary . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.2.2 Learning the Joint Representation . . . . . . . . . . . . . . . . . . . . . . 69 4.2.3 Coupling Decision Functions with Feature Matching . . . . . . . . . . . . 72 4.3 ANALYSIS OF SAMPLING EFFICIENCY . . . . . . . . . . . . . . . . . . . . . 73 4.3.1 Results from Existing Agnostic Models . . . . . . . . . . . . . . . . . . . 73 4.3.2 Complexity of Proposed LUPI Method . . . . . . . . . . . . . . . . . . . . 74 4.4 EXPERIMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 4.4.1 Datasets and Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 4.4.2 Initial Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.4.3 Comparison Against Other Transfer Learning Methods . . . . . . . . . . . 81 4.4.3.1 Multi-task learning . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.4.3.2 Data Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.4.3.3 Distillation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.4.4 Performance with Sparse Examples . . . . . . . . . . . . . . . . . . . . . 84 iv 4.5 DISCUSSION AND CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . 85 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 CHAPTER 5 ADVERSARIAL PRECISION SENSING . . . . . . . . . . . . . . . . . . . 90 5.1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.2 PROBLEM FORMULATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 5.2.1 Precision Sensing for Time-Series Data . . . . . . . . . . . . . . . . . . . 91 5.2.2 Budget-Constrained Precision Sensing as Adversarial Optimization . . . . . 93 5.3 PROPOSED METHOD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 5.3.0.1 Feature Sensor . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 5.3.0.2 Progressive Learner . . . . . . . . . . . . . . . . . . . . . . . . 96 5.3.0.3 Budget-Manager . . . . . . . . . . . . . . . . . . . . . . . . . . 97 5.3.0.4 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 5.4 ANALYSIS OF PROPOSED METHOD . . . . . . . . . . . . . . . . . . . . . . . 98 5.4.1 Existence of Local Nash Equilibria . . . . . . . . . . . . . . . . . . . . . . 99 5.4.2 Convergence of Algorithm 2 . . . . . . . . . . . . . . . . . . . . . . . . . 101 5.5 EXPERIMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 5.5.1 Synthetic Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 5.5.1.1 Recovery of 𝐶 using BCPS . . . . . . . . . . . . . . . . . . . . 105 5.5.1.2 Comparison Against Other Sensing Methods . . . . . . . . . . . 106 5.5.2 MIMIC-III Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 5.5.2.1 Baseline Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . 108 5.5.2.2 Performance on MIMIC-III . . . . . . . . . . . . . . . . . . . . 108 5.5.2.3 Performance on Masked Data . . . . . . . . . . . . . . . . . . . 109 5.5.3 Sparsity Trade-offs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 5.5.4 Comparison Against SOTA Methods . . . . . . . . . . . . . . . . . . . . . 110 5.6 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 CHAPTER 6 PERSONA AUTHENTICATION . . . . . . . . . . . . . . . . . . . . . . . 116 6.1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 6.2 WHY MODEL PERSONA? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 6.2.0.1 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 6.2.0.2 Authenticator Example . . . . . . . . . . . . . . . . . . . . . . . 119 6.2.0.3 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 6.3 PROBLEM SETTING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 6.3.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 6.3.2 Persona Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 6.3.3 Persona Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 6.3.4 Analysis of Persona Authentication Objective . . . . . . . . . . . . . . . . 123 6.4 PROOF OF CONVERGENCE OF THM. 1 . . . . . . . . . . . . . . . . . . . . . 125 6.5 METHODOLOGY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 6.5.1 Action Space as Control Codes . . . . . . . . . . . . . . . . . . . . . . . . 130 6.5.2 Explanation of Prefix Codes . . . . . . . . . . . . . . . . . . . . . . . . . 131 6.5.3 Conditional Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 v 6.5.3.1 Active Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 131 6.5.4 Learning the Question Policy . . . . . . . . . . . . . . . . . . . . . . . . . 133 6.5.5 Model and Training Details . . . . . . . . . . . . . . . . . . . . . . . . . . 135 6.6 DATA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 6.6.0.1 Pretraining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 6.6.0.2 Fine-tuning on PersonaChat . . . . . . . . . . . . . . . . . . . . 137 6.6.0.3 Self-Play . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 6.7 EXPERIMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 6.7.1 Conditional Decoding Evaluation . . . . . . . . . . . . . . . . . . . . . . 138 6.7.2 Persona Identifier Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 140 6.7.3 Evaluation of Authentication Policies . . . . . . . . . . . . . . . . . . . . 141 6.7.4 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 6.8 GENERATED CONVERSATIONS . . . . . . . . . . . . . . . . . . . . . . . . . 143 6.8.0.1 Human-Agent Interactions . . . . . . . . . . . . . . . . . . . . . 144 6.8.0.2 PersonaGPT Self-Play . . . . . . . . . . . . . . . . . . . . . . . 147 6.8.0.3 Consent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 6.9 SOCIAL IMPACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 6.10 CONCLUSION AND DISCUSSION OF LIMITATIONS . . . . . . . . . . . . . . 151 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 vi LIST OF TABLES Table 2.1: Summary Table of Transfer Learning Approaches . . . . . . . . . . . . . . . . . 27 Table 3.1: Examples of questions from conversations . . . . . . . . . . . . . . . . . . . . . 45 Table 3.2: Performance of baseline vs. RL on MCI prediction on 10 stratified shuffle splits . 54 Table 3.3: Prediction @5, 10, 20, 30 and 35 Turns . . . . . . . . . . . . . . . . . . . . . . 58 Table 3.4: Most frequently questions in 𝜋 ∗ @5, 10, 15 and 20 . . . . . . . . . . . . . . . . . 58 Table 3.5: Table of top 10 ranked questions in final 15 turns of conversations . . . . . . . . 60 Table 4.1: Summary of datasets used in this study . . . . . . . . . . . . . . . . . . . . . . 77 Table 4.2: Comparison of performance across baseline models . . . . . . . . . . . . . . . . 81 Table 4.3: Comparison of performance across transfer learning models . . . . . . . . . . . 83 Table 4.4: Performance of various models for 30 ICD-9 codes appearing less than 100 times in the dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 Table 5.1: BCPS recovery performance vs. sparsity of sensing budget on synthetic data . . 105 Table 5.2: Comparison of predictive power of various sensing methods on synthetic data, under 80% sparsity in 𝐶 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 Table 5.3: Baseline performance of inference models conditioned on the original data 𝑋 . . 108 Table 5.4: Performance of inference models conditioned on the masked data 𝑋 𝐴 . . . . . 109 Table 5.5: Comparison of precision sensing against SOTA active sensing method with sparsity constraint of 50 − 60% . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 Table 6.1: Persona model responses can differ greatly depending on input questions . . . . . 117 Table 6.2: An example of interaction between the learned authenticator policy and Person- aGPT over an 8-turn conversation . . . . . . . . . . . . . . . . . . . . . . . . . 120 Table 6.3: The action space of the verifier featuring 11 total actions (turn-level goals) . . . . 130 Table 6.4: Automatic evaluation of PersonaGPT against existing SOTA persona models . . 138 vii Table 6.5: Human Evaluation of PersonaGPT and DialoGPT . . . . . . . . . . . . . . . . . 139 Table 6.6: Performance of various identifier models on observed dialog trajectories from PersonaChat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 Table 6.7: Comparison of verification policies on various input persona models . . . . . . . 142 Table 6.8: Authenticator performance against variations of the input persona model . . . . . 143 Table 6.9: An interactive example of human-DialoGPT conversation over 8 turns . . . . . . 145 Table 6.10: An interactive example of human-PersonaGPT conversation over 8 turns (prior knowledge of persona given, conditional decoding evaluation) . . . . . . . . . . 146 Table 6.11: An interactive example of human-PersonaGPT conversation over 8 turns (persona information not known a priori, user asked to guess) . . . . . . . . . . 149 Table 6.12: An example of self-play between two randomized PersonaGPT models . . . . . . 150 viii LIST OF FIGURES Figure 1.1: Proposed learning to interview pipeline. . . . . . . . . . . . . . . . . . . . . . 5 Figure 1.2: Preliminary dialog agent trained by deep Q-learning (DQN) [18] to interview patients for MCI screening. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Figure 1.3: Persona Authentication pipeline. . . . . . . . . . . . . . . . . . . . . . . . . . 8 Figure 1.4: Overview of co-training scheme between the feature sensor 𝐺 and classifier 𝐹. . 10 Figure 3.1: Overview of the proposed methodology. . . . . . . . . . . . . . . . . . . . . . 43 Figure 3.2: Illustration of reinforcement learning components in our proposed approach. . . 44 Figure 3.3: RL-Agent vs. Baseline w/ Variation on Turns. . . . . . . . . . . . . . . . . . . 55 Figure 3.4: Distribution of mean squared error (MSE) across all user simulators. . . . . . . 57 Figure 5.1: Overview of problem setting. . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 Figure 5.2: Recovery of feature relevance (𝐶) by BCPS on synthetic data. . . . . . . . . . . 105 Figure 5.3: Heatmap comparisons of sensed features under a fixed 𝐶. Feature number 0-9 (x-axis) vs. Timesteps 0-96 (y-axis). . . . . . . . . . . . . . . . . . . . . . . . 107 Figure 5.4: Trade-off between Budget Constraint (x-axis) vs. predictive power (y-axis). Budget denotes the strength of budgetary constraint 𝛽. . . . . . . . . . . . . . . 110 Figure 5.5: Comparison of AUC-gain against SOTA models. . . . . . . . . . . . . . . . . . 112 Figure 6.1: Overview of authentication pipeline. . . . . . . . . . . . . . . . . . . . . . . . 129 Figure 6.2: Experiment 1 (conditional decoding evaluation) instructions. . . . . . . . . . . 144 Figure 6.3: Experiment 1 post conversational survey. . . . . . . . . . . . . . . . . . . . . . 145 Figure 6.4: Experiment 2 (human policy evaluation) instructions. . . . . . . . . . . . . . . 147 Figure 6.5: Experiment 2 post conversational survey. . . . . . . . . . . . . . . . . . . . . . 148 ix LIST OF ALGORITHMS Algorithm 1 RL-Training Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 Algorithm 2 Co-training of FS and PL. . . . . . . . . . . . . . . . . . . . . . . . . 98 Algorithm 3 Active Learning with PersonaGPT . . . . . . . . . . . . . . . . . . . . 132 Algorithm 4 Verifier Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 x CHAPTER 1 BACKGROUND 1.1 Biomakers and Their Limitations In the world of medical informatics, biomarkers [1] play a pivotal role in determining the physical state of human beings, distinguishing the pathophysiologic from the clinically normal. In most chronic diseases such as congestive heart failure and Diabetes Mellitus, there exist biomarkers such as brain natriuritic peptide and HbA1c that sufficiently captures the presence and the stage of disease progression in almost all clinical populations [2, 3]. However, a key limitation of biomarkers is that they are very expensive to obtain. For example, the average cost to a basic metabolic panel in the inpatient setting ranges from $300-$10,000 dollars per patient per test [4]. Note that this is the cost of simply getting a baseline reading of metabolic enzymes; complete metabolic panels and blood tests can often multiply the cost. In general, the higher the granularity of the biomarker, the higher the cost. A computerized tomography (CT) scan can reveal numerous respiratory pathologies but can range between $1,450 - 7,800 per person depending on availability. As a result, much of the diagnostic decisions in medicine reside on the physician’s judgement regarding cost-vs.-effectiveness: which biomarkers should be obtained for which patients? From the data mining perspective, the decision of “which biomarker to select for which patient” corresponds to a feature selection problem in which each data sample (patient state information) consists of only a sparse set of observed features due to cost constraints. However, this feature selection problem is often treated in an ad-hoc manner in the practice: a mixture of insurance policies, practitioner guidelines and human judgement dictate the manner in which we engage in feature selection in the clinics. Naturally, two questions deserve our attention: (1) Are there cheaper (but effective) alternatives to biomarkers? 1 (2) Are there more effective ways to acquire clinical features? 1.2 Behavioral Markers as Alternatives In recent years, behavioral markers have attracted a lot of attention as a potential supplement to biomarkers. For example, in dementia progression, there exists a disease state clinically defined as mild cognitive impairment (MCI) that precede noticeable memory loss and severe cognitive decline [5]. In traditional Alzheimer Disease (AD), brain imaging is done to confirm diagnosis, but intervention is futile at this stage because irreversible damage would have already occurred [6]. As a result, much research efforts have been channeled toward early detection of MCI, when irreversible changes have yet to occur. However, there is a tradeoff for using biomarkers in MCI detection: while MCI can be detected by classic biomarker approaches such as brain imaging and cerebral spinal fluid markers [7, 8, 9], the presence of such biomarkers correspond to the existence of irreversible changes having already occurred in the patient. Thus, other measures of cognitive functions may be needed for early detection of MCI. Recently, there have been several works showing that behavioral markers [10, 11, 12] can provide useful diagnostic signals for early stage MCI. For example, gait monitoring has been shown to change in early MCI [13], enabling programs for at-home monitoring of behavioral changes [14]. However, in terms of cost of data acquisition, “language markers” such as spoken words [15] and linguistic features [11] have been shown to be both low-cost as well as predictive of early MCI states. Due to the ubiquity of language, the notion of developing high-performing language markers presents a promising direction for alternative diagnostic markers. 1.3 Limitations of Language Markers However, language markers, although universal, do not possess many of the favorable properties that characterize biomakers. For example, there are several key properties allow a biomarker to be generalizable across populations [16]: • Universality: the feature of interest can be measured in everyone in the disease group. 2 • Uniqueness: the feature of interest should differ between disease group and normal. • Measurability: the feature can be measured consistently across everyone. • Permanence: the feature remains invariant to time and algorithms used to measure it. In the case of language markers, only universality is satisfied: one can potentially obtain linguistic and speech features from text and audio recordings for everyone, barring disabilities. However, language markers are certainly not unique: different people may exhibit similar use of language under certain contexts. Language markers may also be transient. For example, a person’s word choices changes throughout one’s lifetime. Depending on the environmental context (e.g., time of day, the presence of other people), one’s choice of words and speech tone may differ greatly. As a result, it is unclear whether any set of linguistic features can be measured in a consistent manner. In spite of these limitations, however, language markers do provide ample signal for each of these classic biometric categories. Even though speech is not unique, higher-level features composed of language preferences – e.g., personality traits, use of idioms, pausing patterns – are regularly used to distinguish one group of people from another. Beyond the individual identification, such traits are regularly used for market segmentation and sentiment reports. Although language, and perhaps even resulting higher-level features change over time, they often follow statistical patterns of change. For example, it is well-known that human infants follow very stable patterns of language acquisition: babbling starts at 6 months, first words start at 9 month to 1 year, vocabulary expansion (250-900 words) by age 2, and can recite numbers and addresses by age 5. It is perhaps reasonable to hypothesize that linguistic decline may follow predictable patterns in the elderly. Finally, the issue of measurability provides a difficult challenge: how can we consistently measure a set of language-related features in a person when language use itself is highly context dependent? As a toy example, consider psychometric and cognitive tests, both of which rely heavily on questionnaires and surveys. These would not work well for surveying language markers, as certain features such as word choice, sentence structures and topics of interest are open observable in unstructured, open-domain conversations. However, a completely open-domain conversation 3 setting presents another layer of problems. The performance of automatic extraction of speech markers depend on restricting the spoken words used to generate speech [17], as variations in lexicon and sentence structure can introduce noise in the extracted acoustic features. Thus, there seems to numerous points of potential conflict between the vast number of potential features that comprise language markers. Additionally, it is unclear whether what questions would elicit what set of linguistic or acoustic features. In order to tackle the problem of measurability, we have to formally define the relationship between language markers of interest and the question contexts used to elicit them. 1.4 Contributions The goal of this thesis project is to provide solutions to the limitations of language markers. The main contributions are summarized as follows: • We formalize the problem of learning a question policy to measure linguistic features as persona authentication. We provide an algorithm for finding such a question policy that can generalize to unseen personalities. • We apply the question policy on real-world data for MCI prediction and show that the proposed pipeline improves prediction against supervised learning baselines. • To address the challenges of permanence (i.e., temporally varying disease markers), we introduce an effective way to do temporally-dependent and non-i.i.d. feature selection in the clinical setting. Specifically, we propose “precision sensing”, an adversarial learning framework that exploits the relationship between adversarial sample generation and classifier performance to produce feature selection strategies that consider cross-sample and temporal dependencies which make training examples non-i.i.d. • Finally, on the prediction side, we propose a method for improving the sample efficiency of classifiers. Specifically, we propose an algorithm that leverages privileged information (linguistic features available only at training time) to supplement the lack of training labels. 4 1.5 Proposed Research Figure 6.1 provides an overview of the main components of our proposed pipeline. As noted before, the goal is to learn a question policy to interact with users to obtain language markers. We will summarize the key components below and discuss them in separate sections: • State Representation: the actual representation of linguistic markers as a feature vector for downstream tasks such as classification, feature selection, and question generation. • Classifier: the prediction model trained by supervised learning. Labels of the prediction task are provided for each sample, e.g., MCI = 1 or normal = 0. • Question Policy: a second model used to acquire new training samples from the user; in other words, its role is to do data acquisition. The output of the question policy can be discrete or continuous variables, depending on the decoder used for NLG. • Decoder: the NLG unit used to decode policy actions into human understandable language. The decoder is an open-domain conversational model. The goal is to do conditional decoding based on both conversational history and policy actions. Figure 1.1: Proposed learning to interview pipeline. 5 1.6 Question Policy 1.6.1 MCI Proof-of-Concept Our first work introduces a preliminary pipeline for learning a question policy to ascertain linguistic markers for the purposes of MCI diagnosis. We use this as a proof-of-concept and illustrate that our “learning to interview” approach can indeed detect predictive signals that can be potentially more informative than unstructured conversations with human interviewers. Figure 1.2: Preliminary dialog agent trained by deep Q-learning (DQN) [18] to interview patients for MCI screening. Figure 1.2 provides a summary figure for the first work. Here, the prediction task is binary classification of patient MCI status. The action space of the question policy is a discrete set of questions. We simplify the problem setting such that NLG issues such as dialog consistency, felicity of language generation, and multi-turn co-reference reasoning are not considered; the question policy simply outputs a deterministic question (out of a pool of 107 possible questions) and obtains a response generated by the skip-thought model [19] (user simulator in Figure 1.2). We use treat the skip-thought embedding as the state input and learn the question policy using straight-forward DQN over 107 actions. The reward function is hand-crafted: the agent receives a small negative reward per turn to penalize lengthy conversations and a large positive reward at the end of conversations if the skip-thought features of the dialog history results in correct prediction by the MCI classifier. We compare this proposed framework against classifier performance on supervised learning data from the original corpus. In addition to prediction accuracy, we also compare the length of conversations needed (conversational efficiency) to obtain a threshold level of performance. 6 1.6.2 Persona Authentication In our second work, we formalize the problem of learning a question policy as person authentication. Specifically, we break down persona authentication into two parts: persona identification, which is inferring a set of persona features from a given dialog trajectory, and persona verification, the problem of finding a second conversational model – we call it a question policy – to elicit dialog trajectories for persona identification. Problem 1. Persona Verification. Given a space of persona information P, persona verification is the optimization objective:   min E𝑃∼P L (𝜏𝜃 , 𝑃) (1.1) 𝜃 where L (·, ·) is the authentication loss: L (𝜏𝜃 , 𝑃) = max{0, 𝐶 + 𝑑 (𝜏𝜃 , 𝑃+ ) − 𝑑 (𝜏𝜃 , 𝑃− )} − log 𝑝(𝜏𝜃 )). (1.2) 𝑃+ denotes persona facts that co-occur with trajectory 𝜏, 𝑃− the opposite. 𝐶 specifies the desired margin of separation, and 𝜏𝜃 is the dialog trajectory generated by the question policy (𝜃). In the persona authentication chapter, we illustrate some promising theoretical guarantees of our proposed optimization problem. For example, we show that estimators trained under the authentication loss shown in Eqn. (6.5) maximizes the mutual information between the conversational trajectory 𝜏 generated by the policy and the persona information of the input agent (i.e., the user). We then present a refined version of the original pipeline that is capable of handling open-domain conversations, all the while directing the conversation toward the end goal of classification. The overview figure is shown in Figure 1.3 which outlines the key components: verifier corresponds to the question policy of interest, identifier corresponds to the state representation, and PersonaGPT corresponds to the decoder. Details of conditional decoding, active learning of verifier action codes and policy learning details are revealed in the persona authentication chapter. Empirically, we illustrate that the learned policy outperforms human evaluators as well as unstructured conversation policies in identifying personality traits through conversation. 7 Figure 1.3: Persona Authentication pipeline. Our question policy addresses the problem of measurability of language markers. We cannot elicit the same set of linguistic features with the same set of questions, but we can obtain a question policy that maximizes the mutual information between generated dialog responses and the user’s salient characteristics. 1.7 Classifier 1.7.1 Precision Sensing On the classifier front, we present two works that address various technical issues in medical prediction tasks. First, language markers, like many other disease markers in medicine, suffer from the issue of permanence: some disease markers fluctuate through time and require repeated measurements to gain insight into temporal trends. While we cannot change the fact time-dependence of certain disease markers, we can deal with the permanence issue by learning to detect stable temporal patterns from data. Our third work introduces the precision sensing framework which addresses this issue by leveraging the interaction between adversarial examples [20] and classifier decision boundaries. Definition 1. (Precision Sensing) The problem of precision sensing seeks a sensor tensor 𝐴∗ that minimizes the empirical risk (ERM) [21] according to: 1 Õ 𝐴∗ ∈ argmin E𝑃(𝑋,𝑌 ) [𝐿(ℎ( 𝐴𝑖 𝑋𝑖 ), 𝑦𝑖 )] ≈ argmin 𝐿(𝑦𝑖 , ℎ( 𝐴𝑖 𝑋𝑖 )), 𝐴𝑖 ∈A 𝐴𝑖 ∈A 𝑚 (𝑋𝑖 ,𝑦 𝑖 )∈D 8 where ℎ is a hypothesis class that maps ℎ : 𝑋 → 𝑌 , and 𝐿(.) is a risk function that evaluates the hypothesis mappings against the actual label. Here, A denotes the space of sensing matrices, and each matrix 𝐴𝑖 can be thought of as a per-sample sensing matrix. From this perspective 𝐴𝑖 describes the subset of sensed features across time for each sample. Unlike regular feature selection which chooses the same subset of features for sample, precision sensing builds the sensing tensor 𝐴 incrementally: at each time step 𝑡, the sensing matrix 𝐴 (𝑡) ∈ {0, 1} 𝑚×𝑑 is computed across 𝑚 samples for all 𝑑 features. To learn 𝐴, we propose the following loss to simultaneous learn the hypothesis function the sensing tensor: min max {𝐹 (𝑋 − 𝑋 𝐺 (𝑋))¬𝑦 − 𝐹 (𝑋 − 𝑋 𝐺 (𝑋)) 𝑦 }+ 𝜃 𝐹 𝜃𝐺 | {z } Feature Sensing + 𝐶𝐸 𝑦 (𝐹 (𝑋 𝐺 (𝑋)) ) + 𝛽||𝑋 − 𝑋 𝐺 (𝑋)|| 1 . (1.3) | {z } | {z } Classifier Reconstruction Budget Management In Eqn. (1.3), 𝐹 represents the classifier and 𝐺 is a recurrent neural network (RNN) that outputs the elements of 𝐴𝑖,(𝑡)𝑗 ∈ 0, 1 for each sample 𝑖 and each feature 𝑗 at time 𝑡. We call this RNN the feature sensor that not only takes into account the temporal dependence of features within a sample but also the performance of 𝐹 across samples. On the other hand, the classifier 𝐹 learns to adapt its decision boundary based on the cross entropy loss 𝐶𝐸 calculated based on decision function mapping from the sensed features 𝑋 𝐺 (𝑋) rather than the full set of features 𝑋. Since the performance of 𝐺 is coupled to 𝐹, we provide a co-training scheme for both models in the Classifier chapter of the thesis. Figure 1.4a-1.4b gives an overview of the proposed co-training scheme. Fig. 1.4a describes incremental generation of the sensing tensor 𝐴 by the 𝐺. Fig. 1.4b describes the minimax game between the 𝐹 and 𝐺. 𝐹 (top) tries to minimize the classification error with sensed features 𝑋 𝐴 while 𝐺 (bottom) tries to allocate 𝐴 such that unsensed features lead to misclassification. We prove that iterative gradient descent on Eqn. (1.3) converges to local Nash Equilibrium points which are guaranteed to exist. In experiments, we show that precision sensing outperforms SOTA baselines from active sensing on a benchmark mortality prediction task. 9 Figure 1.4: Overview of co-training scheme between the feature sensor 𝐺 and classifier 𝐹. (a) Incremental generation of the sensing tensor. (b) Minimax game between 𝐹 and 𝐺. 1.7.2 Multitask LUPI In addition to feature selection, the sample complexity of learning is especially important in medical prediction tasks. For example, in electronic health records (EHR), we often find the distribution of labels is highly skewed: there are numerous diseases with very few labels but an abundance of text descriptions from literature. We propose a method for improving the sample efficiency of these learning tasks by leveraging clinical linguistic markers as privileged information. Specifically, we consider physician notes in the form of discharge summaries, linking standard medical terminologies (i.e., UMLS codes [22]) with diagnostic findings in the EHR. UMLS codes are a set of standardized medical concepts used by clinicians to describe physical findings of diseases and are used widely in both the EHR as well as medical research [22]. [23] alluded at the idea that medical datasets also contain vast amounts of privileged information in the physician notes, which serve to explain the qualities of diseases that can greatly aid decision rules. For experiments, we consider the following set of data for example features, PI, and labels: • Example Features 𝑋: continuous time-series data (i.e. lab values, blood tests, imaging) and discrete static variables (i.e. demographics information) that describe a patient. • Privileged Information 𝑋 ∗ : physician notes containing descriptions in natural language and medical terms (UMLS concepts [22]) that summarize a particular visit for a patient. • Target Task 𝑌 : prediction tasks of interest, such as mortality (binary classification), disease prediction (multi-task and transfer learning), ... etc. 10 Formally, we consider a similar problem setting where we are given D𝑡𝑟𝑎𝑖𝑛 = {(𝑥𝑖 , 𝑥𝑖∗ , 𝑦𝑖 )}𝑖=1 𝑚 . We assume that there exists a set of vocabulary with size 𝑑 that define the privileged information (PI) space. 𝑥𝑖∗𝑗 = 1 if the 𝑗 𝑡ℎ term in the PI vocabulary is contained in the 𝑥𝑖∗ sample, 𝑥𝑖∗𝑗 = 0 otherwise. We also denote 𝑥𝑖∗ = {𝑤 1 , ..., 𝑤 𝑘 } as a decomposition of 𝑥𝑖∗ into 𝑘 individual components such that each 𝑤 𝑗 ∈ {0, 1} 𝑑 is a one-hot vector, corresponding to a non-zero component in 𝑥𝑖∗ , and 𝑥𝑖∗ = 𝑤 1 + 𝑤 2 +, ..., +𝑤 𝑘 . In the following sections, we denote {𝑤 𝑗 } 𝑑𝑗=1 as the set of “words” that compose the PI vocabulary. Thus, 𝑥𝑖∗ gives the co-occurrence label of each word 𝑤 𝑗 ∈ {𝑤 𝑗 } 𝑑𝑗=1 with respect to the sample 𝑥𝑖 . We make no assumptions on the example features (𝑋) with regards to data type. In practice, the example features correspond to temporal features, each corresponding to a biomarker (e.g, lab test value) measured at various time intervals. Finally, we define the multi-task learning objective as learning 𝑦𝑖 = {0, 1}𝐶 to be a set of 𝐶 binary classification tasks. Our approach to retaining privileged information in a multitask setting is to align PI features (linguistic descriptions) with temporal features (biomarkers). 1. Build a dictionary of PI features and learn a distributed representation [24] over the PI vocabulary. 2. Find a joint representation space (Φ) between the PI and example features. 3. Jointly learn the decision functions ℎ∗ : Φ → 𝑌 by feature-matching in the joint representation space. The first process uses unsupervised learning to embed the PI vocabulary into a vector space. The second process allows for some of the privileged information to be retained at inference time, despite not having direct access to the PI vectors. The third process allows for PI information for one task to be transferred for other 𝐶-1 tasks in the label space. In the Classifier chapter, we will examine how to achieve (1) - (3) in detail. We also provide analysis of how (3) can maintain the favorable LUPI sample efficiency. We validate our multitask LUPI framework on multilabel classification tasks using the MIMIC-III database [25] and compare with several prevalent transfer learning schemes. 11 BIBLIOGRAPHY 12 BIBLIOGRAPHY [1] Kyle Strimbu and Jorge A Tavel. What are biomarkers? Current Opinion in HIV and AIDS, 5(6):463, 2010. [2] VO Puntmann. How-to guide on biomarkers: biomarker definitions, validation and applications with examples from cardiovascular disease. Postgraduate medical journal, 85(1008):538–545, 2009. [3] Jeffrey K Aronson and Robin E Ferner. Biomarkers—a general review. Current protocols in pharmacology, 76(1):9–23, 2017. [4] Renee Y Hsia, Yaa Akosa Antwi, and Julia P Nath. Variation in charges for 10 common blood tests in california hospitals: a cross-sectional analysis. BMJ open, 4(8), 2014. [5] Serge Gauthier, Barry Reisberg, Michael Zaudig, Ronald C Petersen, Karen Ritchie, Karl Broich, Sylvie Belleville, Henry Brodaty, David Bennett, Howard Chertkow, et al. Mild cognitive impairment. The lancet, 367(9518):1262–1270, 2006. [6] Jeffrey L Cummings, Rachelle Doody, and Christopher Clark. Disease-modifying therapies for alzheimer disease: challenges to early intervention. Neurology, 69(16):1622–1634, 2007. [7] Keith A Johnson, Nick C Fox, Reisa A Sperling, and William E Klunk. Brain imaging in alzheimer disease. Cold Spring Harbor perspectives in medicine, 2(4):a006213, 2012. [8] D Heister, James B Brewer, Sebastian Magda, Kaj Blennow, Linda K McEvoy, Alzheimer’s Disease Neuroimaging Initiative, et al. Predicting mci outcome with clinically available mri and csf biomarkers. Neurology, 77(17):1619–1628, 2011. [9] Liang Zhan, Yashu Liu, Yalin Wang, Jiayu Zhou, Neda Jahanshad, Jieping Ye, and Paul Matthew Thompson. Boosting brain connectome classification accuracy in alzheimer’s disease using higher-order singular value decomposition. Frontiers in neuroscience, 9:257, 2015. [10] Clifford R Jack Jr, David S Knopman, William J Jagust, Leslie M Shaw, Paul S Aisen, Michael W Weiner, Ronald C Petersen, and John Q Trojanowski. Hypothetical model of dynamic biomarkers of the alzheimer’s pathological cascade. The Lancet Neurology, 9(1):119–128, 2010. [11] Kathleen C Fraser, Jed A Meltzer, and Frank Rudzicz. Linguistic features identify alzheimer’s disease in narrative speech. Journal of Alzheimer’s Disease, 49(2):407–422, 2016. [12] Meysam Asgari, Jeffrey Kaye, and Hiroko Dodge. Predicting mild cognitive impairment from spontaneous spoken utterances. Alzheimer’s & Dementia: Translational Research & Clinical 13 Interventions, 3(2):219–228, 2017. [13] Juan Manuel Fernandez Montenegro and Vasileios Argyriou. Cognitive evaluation for the diagnosis of alzheimer’s disease based on turing test and virtual environments. Physiology & behavior, 173:42–51, 2017. [14] Hiroko H Dodge, Jian Zhu, Nora C Mattek, Molly Bowman, Oscar Ybarra, Katherine V Wild, David A Loewenstein, and Jeffrey A Kaye. Web-enabled conversational interactions as a method to improve cognitive functions: Results of a 6-week randomized controlled trial. Alzheimer’s & dementia: translational research & clinical interventions, 1(1):1–12, 2015. [15] Tuka Alhanai, Rhoda Au, and James Glass. Spoken language biomarkers for detecting cognitive impairment. In 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 409–416. IEEE, 2017. [16] Anil K Jain, Ruud Bolle, and Sharath Pankanti. Biometrics: personal identification in networked society, volume 479. Springer Science & Business Media, 2006. [17] Brian Roark, Margaret Mitchell, John-Paul Hosom, Kristy Hollingshead, and Jeffrey Kaye. Spoken language derived measures for detecting mild cognitive impairment. IEEE transactions on audio, speech, and language processing, 19(7):2081–2090, 2011. [18] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015. [19] Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. Skip-thought vectors. In Advances in neural information processing systems, pages 3294–3302, 2015. [20] Pin-Yu Chen, Yash Sharma, Huan Zhang, Jinfeng Yi, and Cho-Jui Hsieh. Ead: elastic-net attacks to deep neural networks via adversarial examples. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018. [21] Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014. [22] Olivier Bodenreider. The unified medical language system (umls): integrating biomedical terminology. Nucleic acids research, 32(suppl_1):D267–D270, 2004. [23] Vladimir Vapnik and Rauf Izmailov. Learning using privileged information: similarity control and knowledge transfer. Journal of machine learning research, 16(2023-2049):2, 2015. [24] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed repre- sentations of words and phrases and their compositionality. arXiv preprint arXiv:1310.4546, 14 2013. [25] Alistair EW Johnson, Tom J Pollard, Lu Shen, Li-wei H Lehman, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. Mimic-iii, a freely accessible critical care database. Scientific data, 3, 2016. 15 CHAPTER 2 REVIEW OF CURRENT RESEARCH 2.1 Language Markers in Mild Cognitive Impairment Fraser et al. [1] introduced “linguistic markers” such as word count, use frequencies of certain parts-of-speech, and sentiments as predictors of dementia in AD patients. Beyond summary statistics, Asgari et al. [2] introduce word-level language markers which utilize expert provided linguistic features such as socioeconomic context, presence of biological process, psychologic factors. These linguistic dimensions are standardized as Liguistic Inquiry and Word Count (LIWC) [3] features for over 5,000 English words. Beyond expert annotations, statistical properties of words have been examined by a wide range of machine learning approaches such as topic modeling [4], neural word2vec [5] and contextualized word embeddings [6]. More recently, sentence-level representations such as Skip-Thought vectors [7] have shown capacity in representing phrase and utterance level linguistic features such as semantic similarity, phrase relationships and user intentions. Currently, the state-of-the-art (SOTA) representation of linguistic features reside in the recently introduced transformer language models (LMs) [8]. For example, BERT encoders [9] have long been shown to produce state-of-the-art results in language comprehension tasks – otherwise known as natural language understanding (NLU). On the natural language generation (NLG) end, the GPT-2 decoder [10] has been shown to generate text that is difficult to distinguish from human text. 2.2 Dialog Systems Central to our characterization of question contexts and language markers is the idea of a dialog system: a conversational agent (a machine learning model) that generates text (prompts / questions) to user input based on feedback signals from the user (responses). There are two main types of dialog systems: task-oriented and open-domain. They can synonymous be thought of as structured (task-oriented) and unstructured (open-domain) conversation. However, our problem is somewhere 16 in the middle: our desired question policy should be able to adapt to a variety of conversation types (open-domain), yet still service the goal of MCI diagnosis (task-oriented). We introduce the crux of task-oriented vs. open-domain conversation before formalizing the semi-structured setting mathematically in proceeding chapters. 2.3 Task-Oriented System Task-oriented dialogue systems are typically designed for retrievaltasks in which users provide queries and the chat-bot provides appropriate responses based on an external knowledge base [11, 12, 13], or identifying correct answers by looking into vast amounts of documents [14, 15]. Such dialogue systems are typically designed to be a pipeline containing a set of components including a language understanding unit that parses the intention and semantics from the input from humans, a dialogue manager that manages dialogue state tracking and policy learning, and a language generation unit that generates response [11, 16, 17]. Similar to task-oriented dialog, our problem setting is goal-oriented: we want to find a dialog policy that can lead to the correct classification of diseases. However, our problem setting differs from task-oriented dialog in that (1) the question generation do not have fixed frames, and (2) the user responses are certainly not constrained to specific tasks (e.g., hotel booking, movie recommendations). In this regard, the actual NLG part of the task is an open-domain dialog problem. 2.4 Open-Domain Systems Open-domain system refers to dyadic conversations without any constraints on the dialog structure between the conversational agents. Open-domain dialogue generation has been formulated in two ways: (1) as a statistical machine translation (SMT) [18, 19] task in which the goal is to output a (𝑡) (𝑡) sequence of tokens y1:𝐾 in response to a sequence of tokens x1:𝐾 , and (2) as a ranking problem where a response 𝑌𝑡 is selected among a list of candidate responses that best “matches” the query 𝑋𝑡 [11]. Both approaches optimize over the maximum likelihood (MLE) objective: maximize log 𝑝(𝑌𝑡 |𝑋𝑡 ). 𝑌𝑡 17 The main difference between the two approaches is that SMT is generative, because the response is generated token-by-token, in nature while the ranking approach is discriminative due to requiring a pool of candidates. But neither takes into account the history of the conversation, and as a result both fail to capture dialogue transitions. More recently, generative models such as HRED, VHRED and CVAE [20, 21, 22] try to capture dialogue history to provide more contextualized decoding. These approaches apply MLE learning to the objective: maximize log 𝑝(𝑌𝑡 |𝑋𝑡 , 𝜏1:𝑡−1 ). (2.1) 𝑌𝑡 Although this new objective encourages more response diversity, the dialogue agent still does not have any mechanism of encouraging a consistent personality across multiple turns [23]. 2.4.1 Personalized Response Generation An emerging direction in open-domain dialogue systems is the rise of persona models, which focus on generating responses that preserve a consistent set of personalities over the course of a dialogue. For example, Li et al. [19] learns a speaker vector from which to condition the generation process. In the PersonaChat dataset [23], each dialogue speaker is equipped with text descriptions that characterize their persona. The standard objective of persona models can be expressed as: maximize log 𝑝(𝑌𝑡 |𝑋𝑡 , 𝜏1:𝑡−1 , 𝑃𝑌 ) (2.2) 𝑌𝑡 where 𝑃𝑌 is the set of persona descriptions for the dialogue agent. Zhang et al. [23] and the ConvAI2 challenges [24] provided numerous approaches to incorporate the persona information into the dialogue generation process. Our problem setting is similar to open-domain dialog in that the user (patient) is allowed to generate arbitrary responses. Thus, our dialog agent must be able to handle open-domain conversation and exhibit common-sense reasoning, multi-turn consistency and response diversity – challenges that are featured in open-domain dialog. 18 2.4.2 MCI Prediction via Utterance Data [2] used a classical supervised learning framework to formulate MCI prediction as binary classifica- tion problem. For each interview, a corpus was constructed using only the participant responses to interviewer questions. For each participant, the response corpus over several interviews was preprocessed into feature vectors using the Linguistic Inquiry & Word Count (LIWC) dictionary [3]. The LIWC dictionary transforms each word in a given corpus to a 69-dimensional feature vector with latent dimensions representing grammatical and semantic properties of each word. A final 69-dimensional feature vector is then constructed at the end of the corpus by aggregation of all previous LIWC vectors. The resulting feature representation is a 𝑚 × 69 matrix. The best performing classifier in this benchmark study uses linear support vector machines (SVM) with ℓ1 -norm regularization [2]. The resulting performance is 72.5% AUC over 5-fold validation. 2.4.3 Healthcare Applications of Dialog Systems Dialogue systems have been widely adopted in the healthcare domain for various applications. For example, chat-bots are available to assist the patient intake process [25], retrieve restaurant accommodation information for young adults with food allergies [26], and perform dialogue analysis and generation conversation to perform mental health treatment [27]. In the context of Alzheimer’s disease research, [28] designed a virtual reality based chat-bot to evaluate memory loss using predefined questions and answers. [29] discussed applications of chat-bots as caregiviers for Alzheimer’s patients, providing safety, personal assistance, entertainment, and stimulation. More recently, [30] introduced a computer avatar to ask a list of pre-defined questions from neuropsychological tests to diagnose dementia. This work is closely related to our system as it utilizes dialogue to glean disease-related information. However, one major issue in this approach is that the questions were obtained from mini-mental state examination (MMSE) [31], which is a confirmatory measure used to define clinical dementia (such as MCI) rather than a diagnostic tool to predict it. It is more clinical meaningful to identify diagnostic markers associated with the pathological pathways, such as lexical distribution associated with the cognitive changes for the 19 purpose of diagnostic screening. 2.5 Beyond Linguistic Features 2.5.1 Acoustic Features Beyond linguistic features, acoustic features such as mel-frequency cepstrum coefficients (MFCC) have been shown to effectively predict MCI status [32, 33], especially when combined with linguistic features. However, these studies are thus far limited to responses to structured conversations (e.g., responses to standardized cognitive tests and fixed questionnaires). It remains an open question whether acoustic markers can be used in semi-structured conversations and combining both linguistic and acoustic markers can improve the detection of MCI. Most predictive studies using acoustic markers rely on either fixed prompts [33] or pronunciation tasks [32] to control for the differences in linguistic contexts used in conversations. This is because variations in utterance lengths, word choices and sentence structures can introduce variance in acoustic features independent of vocal differences. In fact, Roark et al. points out that “narrow and topic-focused use of language” is important for “more accurate” acoustic marker extraction [32]. However, while these studies suggest that acoustic markers can have very high predictive value, the highly structured conversational settings may restrict the effectiveness of linguistic markers. Traditionally, semi-structured conversational settings have been used in linguistic marker studies because they reflect participant linguistic preferences in open conversations (without fixed sentences). Therefore, we are interested in combining acoustic and linguistic features in a synergistic way, despite the fact that the semi-structured setting introduces interesting trade-offs between the stability of acoustic features and the expressiveness of linguistic ones. 2.5.2 Personality Modeling Persona modeling refers to non goal-oriented dialog, otherwise known as chit-chat dialogue systems. Vinyals et al. [18] introduced a statistical machine translation (SMT) framework for sequence-to- sequence modeling of dialogue turns, with neither dialogue-state tracking nor modeling long-term 20 dependence between turns. As a result, vanilla sequence-to-sequence models often fail to capture long-term dependencies of dialogue responses and do not conform to a coherent “personality” in conversations [23, 24]. Li et al.[19] introduced persona-based modeling which incorporates persona information about a user as word embeddings. Recently, [23] extended this idea by building the PersonaChat dataset to permit the modeling of persona-dependent dialogue trajectories that capture both turn-based transitions as well as conditional response generation based on personalized text profiles. Such an approach generated great interest in the research community. The main focus of persona modeling has been how to incorporate persona information into the dialogue generation process. For example, Song et al. [34] introduced a method for incorporating persona information through variational autoencoders (VAE) with external memory, improving performance on automatic evaluations. The ConvAI2 competition [24] provided a diverse sample of state-of-the-art model architectures for persona modeling. However, as pointed out in the analysis of results from the ConvAI2 (NeurIPS ’18) competition, a bottleneck of persona models remains the disparity between automatic and human evaluation. 2.5.3 Automatic Evaluation of Persona Models The NeurIPS ’18 challenge [24] introduced several limitations for automatic evaluation of dialogue quality. For example, that F1 and perplexity (PPL) scores do not consider temporal dependence of text generation, nor the underlying semantics of generated texts. This was reflected in the fact that performance of persona models on automatic evaluation was not predictive of performance against human evaluation – the eventual competition winner scored relatively poorly compared to other top models in the automatic evaluation category. Additionally, it was found that classical metrics such as F1 score and hits@1 automatically favor ranking models over generative models. PPL, F1 and Hits@1 also do not provide any information about the style of responses, for instance tendencies to ask questions to human users or bias toward certain conversational topics. Previously, Xing et al. [35] introduced a method of evaluating dialogue quality beyond classical metrics by using a classifier to predict meta-data about the underlying persona (e.g., extroversion and personality 21 traits). However, such an approach requires strong supervision for each persona, which is not readily available for large dialogue corpus. Hu et al. [36] introduced the idea of using a discriminator as an automatic evaluator of generated dialogues. Zhang et al. [37] extended the use of discriminators to include language models for constraining the possible response outputs and policy gradients to deal with discrete generator inputs. However, discriminator-based approaches focus mostly on response-level coherence rather than coherence of the dialogue trajectory over multiple turns. In this work, we introduce an automatic evaluation scheme that systematically addresses each of these issues. 2.6 Feature Selection in Medical Informatics One prominent problem in medical informatics is the problem of feature selection over disease markers. Aside from the curse of dimensionality, biomarkers are expensive and cannot be obtained for each patient at every timestep. A prime example of this phenomenon exists in the realm of medical informatics, specifically electronic health records (EHR) dominated by time-series data. Yet this hasn’t stopped EHR from sparking numerous research interests in recent years [38, 39, 40, 41, 42]. Time-series data in the EHR consists of features that are sampled at different levels of temporal granularity (e.g., lab tests are sampled at longer time-scales than vital signs). Most of the time, only a small subset of features are observed at any time-step, as it is prohibitively expensive for the physician to obtain the full set of features for every patient for every time interval. In practice, a form of active sensing [43] is implicitly done in an ad-hoc manner by physicians based on a combination of inpatient work-flow, resource constraints, and domain expertise. As a result, the key difficulty with EHR time-series data is that they almost always require re-sampling and imputation due to large amounts of missing values. For example, consider the EHR setting, where at each time-step, the physician has to decide how to allocate a constrained amount of tests/resources among patients for some set of clinical tasks. In other words, the precision sensing problem for time-series is in fact a resource distribution problem at each time-step, with a fixed observation budget across a given set of samples. Since the querying process is applied across time rather 22 than samples, any feature selection formulation must reflect both the temporal-dependency and the budgetary-constraint of the querying process. Therefore it is no surprise that several recent works have shown that temporal models trained on EHR data have fragile decision boundaries that are susceptible to small perturbations [44, 45]. In later chapters, we present a framework called precision sensing to overcome the temporal dependence and non-i.i.d. property of medical feature selection. We first introduce some related works in active sensing below. 2.6.1 Active Sensing Precision sensing typically falls under the suite of active data acquisition problems such as active learning [46] (optimal experiment design) and active sensing [43], both of which attracted significant research interest from the machine learning community. Techniques such as incremental feature acquisition [47] and active sensing [43, 48] focus on selecting a subset of features by querying the original data in an online manner, with the goal of reconstructing (sample, view) pairs that maximize the mutual information of predictive classifiers. Note that the original active sensing problem proposed by Yu et al. [43] operates on static feature sets rather than time-series data. More recently, active sensing has been extended to time-series data by [48] and [49], but in a problem setting very different from this paper. This is because both [48] and [49] manage not only a feature selection budget across features, i.e., “data-streams”, but also across time. However, a separate budget constraint is considered for each individual. Precision sensing presents an additional layer of complexity—the sensor model has to consider budget constraints across samples. That is, we do not make the i.i.d. assumption across data samples. In reality, a sensing strategy has to take into account an accumulating budget across all samples in a data batch. Using the toy example from [49], a physician may need to conserve the observation budget for one group of patients in order to decrease uncertainty of predictions for another batch at future time points. Thus, our goal is to dynamically adjust the allocation of a fixed observation budget across a time window and a set of samples. In experiments, we compare our method against state-of-the-art (SOTA) active sensing 23 methods. 2.6.2 Temporal Feature Selection Classic statistical methods [50, 51], perturbation models [44] and attention-based techniques [52] have also been applied to distill model-preferences over temporal features across time. Schulam [53] and Saria et al. [54] proposed modeling temporal relevance of features by abstracting hierarchical representations across time (e.g., population-level and subpopulation-level latent features). However, all these strategies lack an active component because their sensing strategies neither adapt to shifts in future data-streams nor do they incrementally build the future sensing strategy. While Clertant [55], like Chang et al. [49], formulated the sensing problem in a Markov Decision Process (MDP) framework, their proposed algorithm deals with the sequential selection of static features, rather than temporal data. Moreover, perturbation models and temporal feature selection [56] are specific to the hypothesis model that they are trained on. But ideally a sensing strategy should be generalizable to different hypothesis classes and decision boundaries. We experimentally show how our method maintains performance for different model types, including non-temporal models (e.g., feed-forward networks) as well as non-deep models (e.g., logistic regression). 2.7 Sample Efficiency and Privileged Information Finally, we explore the issue of improving sample-efficiency in using disease markers for clinical prediction. In classical supervised learning, the learner is presented with the training tuple {(𝑥𝑖 , 𝑦𝑖 )}𝑖=1 𝑚 and performs an optimization task of finding the best model in a hypothesis space ℎ : 𝑋 ↦→ 𝑌 to approximate some true 𝑓 : 𝑋 ↦→ 𝑌 which explains the data. Given a new task, knowledge transfer [57] is often applied to accelerate the learning process by distilling and transferring relevant knowledge from previous tasks to the unseen one. Under classical formulations, the learner incorporates prior information in one of several ways: 1. Direct transfer of parameters from old hypothesis models to the new task and fine-tuning [57] the parameters. 24 2. Learning multiple tasks (online or batched) related to the current task [58, 42]. 3. Using the prior knowledge (i.e. a knowledge graph) to constrain the hypothesis space by regularization [59]. 4. Using representations (i.e. embeddings) of 𝑋 and / or 𝑌 from previous tasks for new tasks [5, 60]. 5. Accelerate learning rate and model compression by Distillation as typically seen in Teacher- Student models [61]. In each of these settings, knowledge transfer operates directly within the 𝑋, 𝑌 and H spaces to improve generalization of information from old models to the new task. Recently, Learning Using Privileged Information (LUPI) [62] has provided a new paradigm for knowledge transfer. Under LUPI, the learner now interacts with a Teacher that provides privileged information (PI) and is only available at training time. From the learner’s perspective, the training set is now extended to the tuple {(𝑥𝑖 , 𝑥𝑖∗ , 𝑦𝑖 )}𝑖=1 𝑛 , and the testing set stays the same. Some examples of PI include: 1) Future information that relates 𝑋 and 𝑌 . For example, using future stock prices beyond the prediction window during training. 2) Auxiliary information describing the label space that is available only to a subset of samples. For example, physician notes that accompany diagnostic predictions which is only available after the diagnosis is made. At a high-level, PI provides some similarity information between training samples from the original feature space, and the Teacher hypotehsis serves as additional “explanations” of the hypothesis space [63, 62]. As a result, [63] showed that the LUPI Teacher provides a principled way to improve the generalization error of Student learners using agnostic PAC models, providing some theoretical improvements in the number of samples required to achieve generalizability the new task (i.e. improves sampling efficiency) However, under the current state-of-art LUPI formulations such as [64, 65], PI is incorporated by means of support vectors and dropout schemes, both of which fail to explore the underlying similarity structure between samples in the PI space 𝑋 ∗ . For example the mode distribution and 25 pairwise similarity between points in the 𝑋 ∗ space is largely unused. The PI contributes as auxiliary training features and kernel information, but all LUPI information is lost at inference time and beyond. A significant question remains: can privileged information be retained for future tasks? Ideally, we want the LUPI Teacher to incorporate PI in a way that is specific enough to inform similarity between training samples yet general enough to be retained across future tasks. In the following subsections, we highlight some of these differences as well as improvements over related LUPI works. 2.7.1 Transfer Learning Review Consider the transfer learning setting: Assume we are given a source domain D𝑆 = {X𝑆 , 𝑃𝑆 (𝑋)}, a source task T𝑆 = {Y𝑆 , H𝑆 }, a target domain D𝑇 = {X𝑇 , 𝑃𝑇 (𝑋)}, and a target task T𝑇 = {Y𝑇 , H𝑇 }, where X defines a feature space, 𝑃(𝑋) defines its marginal distribution, Y defines a label space, and H defines a space of hypothesis which best approximates an underlying function 𝑓 : 𝑋 ↦→ 𝑌 which explains the data. Transfer learning leverage knowledge in D𝑆 and T𝑆 to improve the learning of ℎ𝑇∗ ∈ H𝑇 , where D𝑆 ≠ D𝑇 or T𝑆 ≠ T𝑇 . Improving the learning of ℎ𝑇∗ can come in two ways: (1) improving the sample complexity of H𝑇 , or (2) improving the rate of convergence for finding ℎ𝑇∗ ∈ H𝑇 . Transductive learning describes a situation where D𝑆 ≠ D𝑇 but T𝑆 = T𝑇 , which can occur either due to mismatch in feature space, i.e. X𝑆 ≠ X𝑇 , or due to co-variate shift, i.e. 𝑃𝑆 (𝑋) ≠ 𝑃𝑇 (𝑋)[57, 66]. On the other hand, inductive learning describes a situation where D𝑆 = D𝑇 but T𝑆 ≠ T𝑇 , i.e. the same domain but different tasks. Briefly, the various transfer learning approaches are summarized in table 2.1. 2.7.2 Knowledge Transfer by Inductive Learning A popular approach for solving the inductive learning problem is parameter sharing, whereby some or all of the parameters of ℎ∗𝑆 ∈ H𝑆 is recycled for the target model, and fine-tuning is applied to accelerate the search for ℎ𝑇∗ ∈ H𝑇 [57, 71]. For example, consider multi-task learning (MTL), where 26 Table 2.1: Summary Table of Transfer Learning Approaches Setting / Method Domain Targets Examples Inductive learning D𝑆 = D𝑇 T𝑆 ≠ T𝑇 Multi-task Learning Same Features Y𝑆 ≠ Y𝑇 [67] Parameter Transfer Same Features T𝑆 ≠ T𝑇 Relational Knowledge Relational Graph Y𝑆 ≠ Y𝑇 [59] Transductive Learning D𝑆 ≠ D𝑇 T𝑆 = T𝑇 Multi-View Learning X𝑆 ≠ X𝑇 Same Tasks [60, 68] Sample Selection Bias 𝑃𝑆 (𝑌 |𝑋) ≠ 𝑃𝑇 (𝑌 |𝑋) Same Tasks [69] Co-Variate Shift 𝑃𝑆 (𝑋) ≠ 𝑃𝑇 (𝑋) Same Task Domain-Adaptation 𝑃𝑆 (𝑋, 𝑌 ) ≠ 𝑃𝑇 (𝑋, 𝑌 ) Y𝑆 = Y𝑇 Unsupervised Learning D𝑆 ≠ D𝑇 T𝑆 ≠ T𝑇 Feature Representation Transfer of 𝜙(𝑋𝑆 ) → 𝜙(𝑋𝑇 ) New or Old Tasks [5, 70] multiple hypotheses are learned jointly for a set of related tasks [57]. Under this setting, the set of tasks T1 , ...T𝑇 are presumed to come from the same underlying distribution. Using the same Ð feature space 𝑋, a subset of the joint hypotheses space 𝑘=1 H𝑘 ⊂ {H1 , ..H𝑇−1 } should presumably accelerate the learning of the optimal hypothesis for the target task. Thus, by transferring parameters from {H1 , ..H𝑇−1 } to H𝑇 , convergence rate to the optimal ℎ𝑇∗ can be improved. A disadvantage of MTL is the constraint on relatedness of tasks. For example, it is hard to know beforehand which subset of tasks will contribute positively to the target, and poor selection of parameter transfer can actually lead to negative transfer, leading to poor performance on the target task [57]. By contrast, LUPI uses PI that is by definition specific to the current task. However, unlike MTL, LUPI does not use a shared feature space between the PI and original data, i.e. 𝑋 ≠ 𝑋 ∗ and cannot directly incorporate the Teacher hypothesis via direct parameter sharing. In recent years, relational-knowledge provides an alternative approach to inductive learning by incorporating domain-specific knowledge in the form of regularized priors for target tasks [59, 72]. In contrast to parameter-sharing, relational-knowledge is agnostic to the collection of source tasks and should apply universally to all learning tasks given the same feature space 𝑋. This formulation overcomes the limitation of task similarity in the former case, but it is also very expensive to construct reliable relational-knowledge, such as knowledge graphs. By constraining the hypothesis space of the target task with relational-knowledge, the speed of convergence can be improved. 27 However, inductive learning techniques generally focus on improving either accuracy of prediction or convergence rate rather than sample complexity. The main focus of LUPI, on the other hand, is to provide some guarantees on improving the sample efficiency of the Student learner. 2.7.3 Knowledge Transfer by Transductive Learning In contrast to inductive learning, transductive learning considers two sub-problems: X𝑆 ≠ X𝑇 and 𝑃𝑆 (𝑌 |𝑋) ≠ 𝑃𝑇 (𝑌 |𝑋). Multi-view Learning (MVL) [60, 73] and Multi-modal Learning (MML) [74, 75] are main methods which deal with the X𝑆 ≠ X𝑇 case, where the feature spaces differ between the source and target domains. MVL is often involved in processing different subsets of features describing the same set of samples, for example different channels of EEG signals for neurological diagnostics [76]. MML deals with different modalities of data, for example picture and text descriptions of a disease process [77]. Often times, both MVL and MML utilize some form of data fusion, whereby some shared representation of the multiple source domains is used together to predict the same target task [67]. One drawback of these approaches is that modalities such as PI is unavailable at test time, leading to poor generalization of the hypothesis model since it is conditioned on both 𝑋 and 𝑋 ∗ . For example, suppose we have some data fusion model 𝑔 : 𝑋 × 𝑋 ∗ ↦→ 𝑍, and a hypothesis function ℎ : 𝑍 ↦→ 𝑌 . At training time, both the PI and the original features are utilized to train ℎ(𝑔(𝑥, 𝑥 ∗ )) = 𝑦. At test time, however, since only 𝑋 is available, 𝑔(𝑥, 0) may actually map to a completely different set of features in 𝑍, leading to a biased ℎ(𝑧) = 𝑦. In other words, if 𝑋 is under-utilized during training, ℎ(𝑧) will likely lead to poor generalization at test time. Unfortunately, since PI is by definition a more task-specific descriptor of 𝑌 , this is the most likely case and presents a limitation for data fusion methods for incorporating PI. On the other hand, methods such as biased selection sampling [78] and Optimal Transport [79] are used to deal with the same set of features, but their marginal distributions disagree, i.e. 𝑃𝑆 (𝑋) ≠ 𝑃𝑇 (𝑋). This is otherwise known as co-variate shift [66], and recently generative models such as [80, 81] and [82] have dominated the state-of-art. For example, VRADA and 28 RadialGAN both attempt to learn a domain-invariant latent distribution s.t. 𝑃(𝑍 |𝑋𝑆 ) = 𝑃(𝑍 |𝑋𝑇 ), and 𝑃𝑆 (𝑌 |𝑍) = 𝑃𝑇 (𝑌 |𝑍)[82, 80]. Methods such as [83] and [81] try to directly learn a mapping between 𝑃(𝑋𝑇 |𝑋𝑆 ) so that samples from 𝑋𝑆 can be used to augment the training of ℎ(𝑥) = 𝑦. These generative models present a possible way of incorporating PI into the modeling process by learning a domain-invariant latent representation or transformation function between 𝑋 and 𝑋 ∗ . However, these models improve the hypothesis function by means of data augmentation (i.e. increasing the samples available) rather than decreasing the sample complexity required to train an accurate model. In this regard, LUPI is advantageous in that it provides a framework for decreasing the samples necessary by leveraging the Teacher’s hypothesis function 𝑓 ∗ . This subtle different becomes important when “big data” is not available for complex data problems, for example modeling rare diseases in healthcare records. 2.7.4 Knowledge Transfer with Distributed Representations Transfer of feature representations is another paradigm of transfer learning that extends beyond inductive transfer [57]. Under this setting, no label information is available for either the source or target domains. However, the underlying structure of the source domain 𝑋𝑆 can first be extracted using unsupervised learning before applying to downstream tasks. A prime example of this can be seen in the learning of distributed representations for words [5], which is widely used in NLP applications by converting the feature space of words (which have no intrinsic distance or similarity properties) into an embedding vector space where distance and similarity can be computed based on co-occurrence frequencies. Several related methods have been applied in the context of medicine by learning distributed representations for medical concepts obtained from large corpora of journal publications [70], EHR notes [41], and medical claims [72]. This framework provides a tool for understanding the structural properties of PI, which can then be transfered to improve the actual learning task of estimating the optimal Student hypothesis function. However, tactful incorporation of these embeddings may not be as simple, since the learning regime must be able to incorporate the embedded information in the hypothesis somehow. Our methodology introduces a way to achieve 29 this and incorporates both structural information about the PI as well as the Teacher hypothesis function into the Student learning procedure. 2.7.5 Knowledge Transfer by Knowledge Distillation A closely related concept to LUPI is knowledge distillation [61] in which a Teacher model outputs are used as “soft-labels” to accelerate Student learning of the target task. Similar to LUPI, the Teacher learns a more accurate model of the task, 𝑓 ∗ : 𝑋 ↦→ 𝑌 , and the Student tries to learn a “distilled” representation of the hypothesis space, where the VC-dimension of the Student hypothesis space is less than that of the teacher [84]. By decreasing its VC-dimension, the Student learner can improves its sampling efficiency at least by a constant factor. However, the Teacher model does not incorporate PI, which provides a better prediction of the label than the original feature space. Additionally, the original LUPI provides a more favorable generalization bound compared to model distillation without the help of PI [84, 63]. 2.7.6 Knowledge Transfer by LUPI Finally, LUPI provides some performance guarantees with regard to the sample efficiency of the Student learner, so long as the PI and the Teacher model satisfies some conditions [62, 63]. However, the main drawback of current formulations of LUPI is that the PI used is highly specific to the task at hand – no information is retained for related tasks. Our work applies elements of unsupervised learning and transductive learning to alleviate this limitation of LUPI. Although recent works such as [64] has generalized the LUPI framework to deep learning settings, our work extends LUPI to allow for multi-task and transfer learning, enabling the generalization of a PI source to accelerate the sample efficiency of many tasks. 30 2.7.7 LUPI Preliminaries Traditionally, LUPI is applied to training data of the form: (𝑥1 , 𝑥1∗ , 𝑦 1 ), (𝑥 2 , 𝑥2∗ , 𝑦 2 )...(𝑥 𝑚 , 𝑥 𝑚 ∗ , 𝑦 𝑚 ) ∈ Dtrain , where 𝑥𝑖 ∈ R𝑛 denotes example feature (EF) vectors from the original feature space, and 𝑥𝑖∗ ∈ R𝑛∗ denotes privileged information (PI) vectors from the privileged information space. D𝑡𝑟𝑎𝑖𝑛 indicates that the PI inputs are only available during training. 𝑦𝑖 ∈ {−1, +1} denotes the ground truth labels for inputs (𝑥𝑖 , 𝑥𝑖∗ ). LUPI then considers two pattern recognition problems: • Using {(𝑥𝑖 , 𝑦𝑖 )}𝑖=1 𝑚 , find rule ℎ(𝑥 ) = 𝑦 . 𝑖 𝑖 • Using {(𝑥𝑖∗ , 𝑦𝑖 )}𝑖=1 𝑚 , find rule 𝑓 ∗ (𝑥 ∗ ) = 𝑦 𝑖 𝑖 Suppose that the 𝑓 ∗ (𝑥𝑖∗ ) = 𝑦 can produce low generalization error, the LUPI task is to transfer knowledge of rules in the 𝑋 ∗ space to improve learning in the 𝑋 space. The original LUPI formulation in [62] considers only SVM models, whereby the privileged information is incorporated into the SVM objective as follows: 1 𝑇 Õ𝑚 minimize ∗ (𝑤 𝑤 + 𝜆𝑤 ∗𝑇 𝑤 ∗ ) + 𝐶 (𝑤 ∗𝑇 𝜓(𝑥𝑖∗ ) + 𝑏 ∗ ) 𝑤,𝑤 2 𝑖 𝑠.𝑡. 𝑦𝑖 (𝑤𝑇 𝜙(𝑥) + 𝑏) ≥ 1 − (𝑤 ∗𝑇 𝜓(𝑤 ∗ ) + 𝑏 ∗ ), 𝑤 ∗𝑇 𝜓(𝑥 ∗ ) + 𝑏 ∗ ≥ 0 Here, 𝑤 and 𝑤 ∗ are the parameter vectors, and 𝑏 and 𝑏 ∗ are biases of the decision functions. 𝜓(.) and 𝜙(.) are feature mapping kernels applied on PI and EF vectors, respectively. 𝐶 > 0 and 𝜆 > 0 are hyperparameters which control the contribution of privileged information to the overall cost function. 𝐶 controls the contribution of the Teacher loss 𝑓 ∗ : 𝑋 ∗ ↦→ 𝑌 , while 𝜆 controls correction of the parameter space by 𝑤 ∗𝑇 𝑤 ∗ . 31 BIBLIOGRAPHY 32 BIBLIOGRAPHY [1] Kathleen C Fraser, Jed A Meltzer, and Frank Rudzicz. Linguistic features identify alzheimer’s disease in narrative speech. Journal of Alzheimer’s Disease, 49(2):407–422, 2016. [2] Meysam Asgari, Jeffrey Kaye, and Hiroko Dodge. Predicting mild cognitive impairment from spontaneous spoken utterances. Alzheimer’s & Dementia: Translational Research & Clinical Interventions, 3(2):219–228, 2017. [3] James W Pennebaker, Martha E Francis, and Roger J Booth. Linguistic inquiry and word count: Liwc 2001. Mahway: Lawrence Erlbaum Associates, 71(2001):2001, 2001. [4] David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation. the Journal of machine Learning research, 3:993–1022, 2003. [5] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed repre- sentations of words and phrases and their compositionality. arXiv preprint arXiv:1310.4546, 2013. [6] Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. arXiv preprint arXiv:1802.05365, 2018. [7] Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. Skip-thought vectors. In Advances in neural information processing systems, pages 3294–3302, 2015. [8] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017. [9] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018. [10] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training. URL https://s3-us-west-2. amazonaws. com/openai- assets/researchcovers/languageunsupervised/language understanding paper. pdf, 2018. [11] Hongshen Chen, Xiaorui Liu, Dawei Yin, and Jiliang Tang. A survey on dialogue systems: Recent advances and new frontiers. Acm Sigkdd Explorations Newsletter, 19(2):25–35, 2017. [12] Bhuwan Dhingra, Lihong Li, Xiujun Li, Jianfeng Gao, Yun-Nung Chen, Faisal Ahmed, and 33 Li Deng. End-to-end reinforcement learning of dialogue agents for information access. arXiv preprint arXiv:1609.00777, 2016. [13] Tsung-Hsien Wen, David Vandyke, Nikola Mrksic, Milica Gasic, Lina M Rojas-Barahona, Pei-Hao Su, Stefan Ultes, and Steve Young. A network-based end-to-end trainable task-oriented dialogue system. arXiv preprint arXiv:1604.04562, 2016. [14] Wei He, Kai Liu, Jing Liu, Yajuan Lyu, Shiqi Zhao, Xinyan Xiao, Yuan Liu, Yizhong Wang, Hua Wu, Qiaoqiao She, et al. Dureader: a chinese machine reading comprehension dataset from real-world applications. arXiv preprint arXiv:1711.05073, 2017. [15] Minghao Hu, Yuxing Peng, Zhen Huang, Xipeng Qiu, Furu Wei, and Ming Zhou. Reinforced mnemonic reader for machine reading comprehension. arXiv preprint arXiv:1705.02798, 2017. [16] Jost Schatzmann, Karl Weilhammer, Matt Stuttle, and Steve Young. A survey of statistical user simulation techniques for reinforcement-learning of dialogue management strategies. Knowledge Engineering Review, 21(2):97–126, 2006. [17] Satinder Singh, Diane Litman, Michael Kearns, and Marilyn Walker. Optimizing dialogue management with reinforcement learning: Experiments with the njfun system. Journal of Artificial Intelligence Research, 16:105–133, 2002. [18] Oriol Vinyals and Quoc Le. A neural conversational model. arXiv preprint arXiv:1506.05869, 2015. [19] Jiwei Li, Michel Galley, Chris Brockett, Georgios P Spithourakis, Jianfeng Gao, and Bill Dolan. A persona-based neural conversation model. arXiv preprint arXiv:1603.06155, 2016. [20] Iulian Vlad Serban, Alessandro Sordoni, Ryan Lowe, Laurent Charlin, Joelle Pineau, Aaron Courville, and Yoshua Bengio. A hierarchical latent variable encoder-decoder model for generating dialogues. In Thirty-First AAAI Conference on Artificial Intelligence, 2017. [21] Yookoon Park, Jaemin Cho, and Gunhee Kim. A hierarchical latent structure for variational conversation modeling. arXiv preprint arXiv:1804.03424, 2018. [22] Tiancheng Zhao, Ran Zhao, and Maxine Eskenazi. Learning discourse-level diversity for neural dialog models using conditional variational autoencoders. arXiv preprint arXiv:1703.10960, 2017. [23] Saizheng Zhang, Emily Dinan, Jack Urbanek, Arthur Szlam, Douwe Kiela, and Jason Weston. Personalizing dialogue agents: I have a dog, do you have pets too? arXiv preprint arXiv:1801.07243, 2018. [24] Emily Dinan, Varvara Logacheva, Valentin Malykh, Alexander Miller, Kurt Shuster, Jack 34 Urbanek, Douwe Kiela, Arthur Szlam, Iulian Serban, Ryan Lowe, et al. The second conver- sational intelligence challenge (convai2). In The NeurIPS’18 Competition, pages 187–208. Springer, 2020. [25] Lin Ni, Chenhao Lu, Niu Liu, and Jiamou Liu. Mandy: Towards a smart primary care chatbot application. In International Symposium on Knowledge and Systems Sciences, pages 38–52. Springer, 2017. [26] Paris Hsu, Jingshu Zhao, Kehan Liao, Tianyi Liu, and Chen Wang. Allergybot: A chatbot technology intervention for young adults with food allergies dining out. In CHI, pages 74–79. ACM, 2017. [27] Kyo-Joong Oh, Dongkun Lee, Byungsoo Ko, and Ho-Jin Choi. A chatbot for psychiatric counseling in mental healthcare service based on emotional dialogue analysis and sentence generation. In MDM, pages 371–375. IEEE, 2017. [28] Juan Manuel Fernandez Montenegro and Vasileios Argyriou. Cognitive evaluation for the diagnosis of alzheimer’s disease based on turing test and virtual environments. Physiology & behavior, 173:42–51, 2017. [29] Miguel A Salichs, Irene P Encinar, Esther Salichs, Álvaro Castro-González, and María Malfaz. Study of scenarios and technical requirements of a social assistive robot for alzheimer’s disease patients and their caregivers. International Journal of Social Robotics, 8(1):85–102, 2016. [30] Hiroki Tanaka, Hiroyoshi Adachi, Norimichi Ukita, Manabu Ikeda, Hiroaki Kazui, Takashi Kudo, and Satoshi Nakamura. Detecting dementia through interactive computer avatars. IEEE journal of translational engineering in health and medicine, 5:1–11, 2017. [31] Tom N Tombaugh and Nancy J McIntyre. The mini-mental state examination: a comprehensive review. J. of the Ame. Geriatrics Soc., 40(9):922–935, 1992. [32] Brian Roark, Margaret Mitchell, John-Paul Hosom, Kristy Hollingshead, and Jeffrey Kaye. Spoken language derived measures for detecting mild cognitive impairment. IEEE transactions on audio, speech, and language processing, 19(7):2081–2090, 2011. [33] Tuka Alhanai, Rhoda Au, and James Glass. Spoken language biomarkers for detecting cognitive impairment. In 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 409–416. IEEE, 2017. [34] Haoyu Song, Wei-Nan Zhang, Yiming Cui, Dong Wang, and Ting Liu. Exploiting persona in- formation for diverse generation of conversational responses. arXiv preprint arXiv:1905.12188, 2019. [35] Yujie Xing and Raquel Fernández. Automatic evaluation of neural personality-based chatbots. arXiv preprint arXiv:1810.00472, 2018. 35 [36] Zhiting Hu, Zichao Yang, Xiaodan Liang, Ruslan Salakhutdinov, and Eric P Xing. Toward controlled generation of text. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1587–1596. JMLR. org, 2017. [37] Zhirui Zhang, Shujie Liu, Mu Li, Ming Zhou, and Enhong Chen. Bidirectional generative adversarial networks for neural machine translation. In Proceedings of the 22nd Conference on Computational Natural Language Learning, pages 190–199, 2018. [38] Yu Cheng, Fei Wang, Ping Zhang, and Jianying Hu. Risk prediction with electronic health records: A deep learning approach. In Proceedings of the 2016 SIAM International Conference on Data Mining, pages 432–440. SIAM, 2016. [39] Fengyi Tang, Cao Xiao, Fei Wang, and Jiayu Zhou. Predictive modeling in urgent care: a comparative study of machine learning approaches. JAMIA Open, 2018. [40] Abhyuday N Jagannatha and Hong Yu. Bidirectional rnn for medical event detection in electronic health records. In Proceedings of the conference. Association for Computational Linguistics. North American Chapter. Meeting, volume 2016, page 473. NIH Public Access, 2016. [41] Edward Choi, Mohammad Taha Bahadori, Elizabeth Searles, Catherine Coffey, Michael Thompson, James Bost, Javier Tejedor-Sojo, and Jimeng Sun. Multi-layer representation learning for medical concepts. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1495–1504. ACM, 2016. [42] Hrayr Harutyunyan, Hrant Khachatrian, David C Kale, and Aram Galstyan. Multitask learning and benchmarking with clinical time series data. arXiv preprint arXiv:1703.07771, 2017. [43] Shipeng Yu, Balaji Krishnapuram, Romer Rosales, and R Bharat Rao. Active sensing. In Artificial Intelligence and Statistics, pages 639–646, 2009. [44] Mengying Sun, Fengyi Tang, Jinfeng Yi, Fei Wang, and Jiayu Zhou. Identify susceptible locations in medical records via adversarial attacks on deep predictive models. arXiv preprint arXiv:1802.04822, 2018. [45] Pin-Yu Chen, Yash Sharma, Huan Zhang, Jinfeng Yi, and Cho-Jui Hsieh. Ead: elastic-net attacks to deep neural networks via adversarial examples. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018. [46] Burr Settles. Active learning literature survey. Technical report, University of Wisconsin- Madison Department of Computer Sciences, 2009. [47] Prem Melville, Maytal Saar-Tsechansky, Foster Provost, and Raymond Mooney. Active feature-value acquisition for classifier induction. In Fourth IEEE International Conference on Data Mining (ICDM’04), pages 483–486. IEEE, 2004. 36 [48] Jinsung Yoon, William R Zame, and Mihaela van der Schaar. Deep sensing: Active sensing using multi-directional recurrent neural networks. 2018. [49] Chun-Hao Chang, Mingjie Mai, and Anna Goldenberg. Dynamic measurement scheduling for event forecasting using deep rl. arXiv preprint arXiv:1901.09699, 2019. [50] Michail Tsagris, Vincenzo Lagani, and Ioannis Tsamardinos. Feature selection for high- dimensional temporal data. BMC bioinformatics, 19(1):17, 2018. [51] Liying Fang, Han Zhao, Pu Wang, Mingwei Yu, Jianzhuo Yan, Wenshuai Cheng, and Peiyu Chen. Feature selection method based on mutual information and class separability for dimension reduction in multidimensional time series for clinical data. Biomedical Signal Processing and Control, 21:82–89, 2015. [52] Edward Choi, Mohammad Taha Bahadori, Jimeng Sun, Joshua Kulas, Andy Schuetz, and Walter Stewart. Retain: An interpretable predictive model for healthcare using reverse time attention mechanism. In Advances in Neural Information Processing Systems, pages 3504–3512, 2016. [53] Peter Schulam and Suchi Saria. A framework for individualizing predictions of disease trajectories by exploiting multi-resolution structure. In Advances in Neural Information Processing Systems, pages 748–756, 2015. [54] Suchi Saria, Andrew Duchi, and Daphne Koller. Discovering deformable motifs in continuous time series data. In Twenty-Second International Joint Conference on Artificial Intelligence, 2011. [55] Matthieu Clertant, Nataliya Sokolovska, Yann Chevaleyre, and Blaise Hanczar. Interpretable cascade classifiers with abstention. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 2312–2320, 2019. [56] Rohan A Baxter, Graham J Williams, and Hongxing He. Feature selection for temporal health records. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pages 198–209. Springer, 2001. [57] Sinno Jialin Pan, Qiang Yang, et al. A survey on transfer learning. IEEE Transactions on knowledge and data engineering, 22(10):1345–1359, 2010. [58] Rich Caruana. Multitask learning. Machine learning, 28(1):41–75, 1997. [59] Edward Choi, Mohammad Taha Bahadori, Le Song, Walter F Stewart, and Jimeng Sun. GRAM: graph-based attention model for healthcare representation learning. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 787–795. ACM, 2017. 37 [60] Jing Zhao, Xijiong Xie, Xin Xu, and Shiliang Sun. Multi-view learning overview: Recent progress and new challenges. Information Fusion, 38:43–54, 2017. [61] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015. [62] Vladimir Vapnik and Akshay Vashist. A new learning paradigm: Learning using privileged information. Neural networks, 22(5-6):544–557, 2009. [63] Vladimir Vapnik and Rauf Izmailov. Learning using privileged information: similarity control and knowledge transfer. Journal of machine learning research, 16(2023-2049):2, 2015. [64] John Lambert, Ozan Sener, and Silvio Savarese. Deep learning under privileged information using heteroscedastic dropout. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8886–8895, 2018. [65] Xue Li, Bo Du, Chang Xu, Yipeng Zhang, Lefei Zhang, and Dacheng Tao. R-svm+: Robust learning with privileged information. In IJCAI, pages 2411–2417, 2018. [66] Hidetoshi Shimodaira. Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of statistical planning and inference, 90(2):227–244, 2000. [67] Dhanesh Ramachandram and Graham W Taylor. Deep multimodal learning: A survey on recent advances and trends. IEEE Signal Processing Magazine, 34(6):96–108, 2017. [68] Huaxiu Yao, Fei Wu, Jintao Ke, Xianfeng Tang, Yitian Jia, Siyu Lu, Pinghua Gong, and Jieping Ye. Deep multi-view spatial-temporal network for taxi demand prediction. arXiv preprint arXiv:1802.08714, 2018. [69] Yi Luo, Guojie Song, Pengyu Li, and Zhongang Qi. Multi-task medical concept normalization using multi-view convolutional neural network. 2018. [70] Youngduck Choi, Chill Yi-I Chiu, and David Sontag. Learning low-dimensional representations of medical concepts. AMIA Summits on Translational Science Proceedings, 2016:41, 2016. [71] Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. How transferable are features in deep neural networks? In Advances in neural information processing systems, pages 3320–3328, 2014. [72] Samuel G Finlayson, Paea LePendu, and Nigam H Shah. Building the graph of medicine from millions of clinical narratives. Scientific data, 1:140032, 2014. [73] Chang Xu, Dacheng Tao, and Chao Xu. A survey on multi-view learning. arXiv preprint arXiv:1304.5634, 2013. 38 [74] Tadas Baltrušaitis, Chaitanya Ahuja, and Louis-Philippe Morency. Multimodal machine learning: A survey and taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018. [75] Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y Ng. Multimodal deep learning. In Proceedings of the 28th international conference on machine learning (ICML-11), pages 689–696, 2011. [76] Ye Yuan, Guangxu Xun, Kebin Jia, and Aidong Zhang. A multi-view deep learning method for epileptic seizure detection using short-time fourier transform. In Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics, pages 213–222. ACM, 2017. [77] C. Nagpal. Deep multimodal fusion of health records and notes for multitask clinical event prediction. In NIPS ML4H Workshop, 2017. [78] Bianca Zadrozny. Learning and evaluating classifiers under sample selection bias. In Proceedings of the twenty-first international conference on Machine learning, page 114. ACM, 2004. [79] Cédric Villani. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008. [80] Jinsung Yoon, James Jordon, and Mihaela van der Schaar. Radialgan: Leveraging multiple datasets to improve target-specific predictive models using generative adversarial networks. arXiv preprint arXiv:1802.06403, 2018. [81] Yunjey Choi, Minje Choi, Munyoung Kim, Jung-Woo Ha, Sunghun Kim, and Jaegul Choo. Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. arXiv preprint, 1711, 2017. [82] Sanjay Purushotham, Wilka Carvalho, Tanachat Nilanon, and Yan Liu. Variational recurrent adversarial deep domain adaptation. 2016. [83] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. arXiv preprint, 2017. [84] David Lopez-Paz, Léon Bottou, Bernhard Schölkopf, and Vladimir Vapnik. Unifying distillation and privileged information. arXiv preprint arXiv:1511.03643, 2015. 39 CHAPTER 3 SCALABLE DIAGNOSTIC SCREENING OF MCI USING DIALOG POLICY 3.1 Introduction The progression of Alzheimer Disease (AD) has consistently been a heavy area of research in clinical medicine because while the disease itself is incurable, early intervention at the prodromal phases of the disease has proven to delay the onset of AD-related mental degeneration and systemic issues for months to years [1, 2]. Consequently, much of the recent clinical research efforts have focused on detecting early stages of mild cognitive impairment (MCI), which is a prodromal phase in AD progression occurring months to years before visible mental decline begins [3]. If successfully detected at this stage, intervention methods may confer numerous benefits in the longevity of cognitive and physiological health of AD patients [1, 2]. Brain imaging, such as the structural magnetic resonance imaging (MRI), was shown to contain prime markers of AD, capturing the physiologic changes in the AD pathological process [4, 5]. However, the identification of MCI from normal aging (NL) is particularly challenging due to the fact that structural changes in the brain in this phase are minor and hard to detect through structural MRI [6], even though decline in mental status and cognitive have already begun in most cases. Recently, the structural connections among brain regions inferred from diffusion MRI have provided promising predictive performance of MCI detection [7, 8], yet sketching brain networks via imaging still remains rather prohibitively expensive and difficult to scale. Moreover, the high dimensionality of brain imaging combined with small sample size usually imposes significant challenges in learning algorithms and leads to unstable generalization performance. On the other hand, behavior and social markers could offer a much more cost- effective option for MCI detection [9, 10, 11, 12]. A recent clinical trial has studied differentiating early stage MCI from NL cohort groups using transcripts of extensive conversations between patients and trained interviewers [11]. In a recent preliminary study [12], the authors trained supervised learning 40 models from the lexical distribution of the conversation, and showed that conversational responses of MCI and NL patients take on different distribution under various conversational topics. The success of [12] in predicting MCI using human dialogue introduced an alternative natural language processing (NLP) approach to a classically clinically expensive problem. However, the use of human interviewers still requires substantial amounts of interaction between trained staff which incur significant expense in its current form. Thus, the bottleneck questions remain: (1) can we cut down on the amount of conversations needed to achieve accurate prediction, (2) can we improve upon baseline performance given limited cohort-specific data? To address the aforementioned questions above, in this chapter we propose a novel reinforcement learning (RL) framework, which learns a MCI diagnosis agent using only very limited amount of offline human dialogue transcripts. The learned diagnosis agent can conduct subject-specific conversation with humans, asking questions based on existing conversations to efficiently sketch the lexical distribution and give high-performance MCI prediction. In order to facilitate RL using offline transcripts, we introduce a dialogue simulator pipeline which generates new conversational episodes that are less noisy and out-perform the original corpus for MCI prediction. Our dialogue pipeline provides a self-contained framework for directing dialogue generation for diagnostic screening which can potentially replace the need for human-expert interviews. Our RL-agent learns optimal dialogue strategies that are adaptive to unseen users, enabling medically- relevant NLP data to be generated on a large scale if deployed in a realistic setting. Furthermore, data generated from our dialogue simulations may be used for data augmentation or to perhaps guide the medical data collection process in the future. Ultimately, by greatly decreasing the cost of data collection and the amount needed for high-level performance, we introduce a clinical direction that is much more cost-effective and scalable to large-scale diagnostic screening and data collection. The combination of NLP features with our reinforcement learning framework may extend the process of diagnostic screenings to well beyond the confines of hospitals and primary care facilities. 41 3.2 Methodology 3.2.1 Dialog Systems Our approach is inspired by recent advancements in dialog systems. Dialog systems provide a natural human-computer interface and have been an active research field for decades. Task-oriented dialogue systems are typically designed for retrieval- tasks in which users provide queries and the chat-bot provides appropriate responses based on an external knowledge base [13, 14, 15], or identifying correct answers by looking into vast amounts of documents [16, 17]. Such dialogue systems are typically designed to be a pipeline containing a set of components including a language understanding unit that parses the intention and semantics from the input from humans, a dialogue manager that manages dialogue state tracking and policy learning, and a language generation unit that generates response [15, 18, 19]. While each of the components can be handcrafted or trained individually, recent advances of deep learning allows end-to-end training [13, 14, 20, 20] and significantly improves the performance and the capability to adapt to new domains [21]. The end-to-end systems can be trained using supervised learning [13, 22] or reinforcement learning (RL), by leveraging a user simulator [14, 20]. The main advantage of RL is that less training samples are needed to learn the high-degree-of-freedom deep models. In our work, we design a simulator to enable RL due to the limited amount of clinical data available for supervised training. We note that even though our dialogue system also tries to achieve a task (identifying MCI patients), the nature of our system is radically different from existing task-oriented dialogue systems: its goal is to efficiently sketch a disease specific lexical distribution through asking subject-specific questions and give classification results. The framework we propose in this chapter involves the use of reinforcement learning to learn the optimal set of questions 𝜋 ∗ to ask participants for the purposes of distinguishing MCI. On test set, we generate new episodes from these questions for prediction rather than the original corpus. To actualize the RL + dialogue simulation framework, we proposed a multi-step approach for implementation which capitalizes on the vast existing knowledge of NLP research. In the following 42 Figure 3.1: Overview of the proposed methodology. section, we present the details of each component of the dialogue system. Figure 3.1 shows an overview of the components of our experimental pipeline. Complete conversation from participants are used to build user simulators. The simulators are then used to train an MCI diagnosis agent (chat-bot), which conducts minimal turns of conversation with participants to sketches the lexical distribution that is then used to perform MCI classification. 3.2.2 Overview of Pipeline Our proposed framework contains three key learning modules: the user simulator, the MCI classifier and the RL-agent. The proposed pipeline is illustrated in Figure 3.2. First, the user simulator is trained by unsupervised learning, which simulates the distributed representation of user responses given feasible question inputs. Next, the MCI classifier predicts the patient label based on the averaged distributed representation of its corpus responses. The above two components and dialogue manager comprise the training environment for the RL-agent. The dialogue manager utilizes the user simulator and MCI classifier to handle the state transitions and also computes of the reward based on the ground-truth labels from the training set and MCI classifier prediction. After training in this environment, the RL-agent is able to deliver the optimal sequences of questions for training-set users at various stages of conversations. During testing, the RL-agent produces query inputs to the test-set user simulators, which represent the unseen users. Using these new queries, the user simulators generate the corresponding distributed representation of test-set user responses for MCI 43 Figure 3.2: Illustration of reinforcement learning components in our proposed approach. Question Encoding So how long did you 0 1 … 0 0 Linguistic State go out for? User State Reinforcement Signature Updated at Each Learning Agent Yeah, so I was out a long Turn time yesterday… User Simulator Questions Response Diagnosis Generate Response (Action) (Reward) Neural Network Confidence How are with the computer? Diagnosis Compute MCI Dialogue .03 -0.2 -0.7 0.4 … Label Reward … Classifier Manager Response Embedding (MCI/NL) Corpus Skip-Thought Embedding prediction. In the following subsections, we will present each component of the pipeline in detail and demonstrate the effectiveness of the RL framework in improving prediction accuracy while reducing conversational turns. 3.2.3 Construction of Turn-Based Dialogue Since utterance data was collected in the form of conversational transcripts for each participant, we must reconstructed turn-based dialogue from participant-responses. The participant responses were unstructured while interviewer questions ranged over preset question topics, as illustrated below. Interviewer: so what did you do yesterday? Participant: i had yesterday morning i yesterday was a busy day for me. i im forgetting i went to where did i go in the morning. well i went to albertsons yesterday... Interviewer: what do you see in this picture? Participant: we got a picture gosh. it looks like my uncle lou. but he never ... Interviewer: when do you think this picture was taken? Participant: this picture was probably eighteen seventy or something or nineteen twenty. so he looks too old for war he must have been ... In total there were well over 150 possible queries from the interviewers. However, for the purposes of this study, we re-compiled the question list into 107 general questions which were ubiquitous across all conversations. A snapshot of questions are in Table 3.1. 44 Table 3.1: Examples of questions from conversations Category Question Activity Did you go outside lately? So what did you do yesterday? Social Did you run into any familiar faces lately? Where did you have dinner? Picture What do you see in this picture? Where do you think this picture was taken? Tech How are you with the computer? Did you use your computer lately? Unspecified We created a total of 16 question categories, including: greetings, activity check, living situation, travel, entertainment, social, picture-related, tech, occupation, hobbies, family, pets, confirmation, clarification, goodbye and unspecified comments. For some of these comments, we delexicalised certain topic words such as “”, “” in order to (1) control for domain expansion [23] and (2) reduce model complexity of our user simulators. In the past, [23] and [22] have shown the effectiveness of delexicalisation in controlling for domain expansion in user simulators without sacrificing the contextual meaning of sentence queries. Additionally, we also created unspecified comments category, which included comments that deviated from general question prompts. These comments often result from interviewer follow-up on specific topics mentioned by the user. We consolidated these comments into a single category to distinguish the context-specific from general questions based on the corpus. However, we do demarcate the type of unspecified comment used by the interviewer. For example, a follow-up comment to an occupational story is tagged whereas a follow-up comment about a health concern is tagged . The role of these comments serve to build rapport and improve flow of conversation. In future studies we may look to generate user-specific grounding statements for these slots [24]. Implemented in this way, the corpus is tokenized into turn-based responses to questions for each user. 45 3.2.4 Unsupervised Learning for User Simulator To effectively capture contextual representation of user conversation style, we utilize vector embedding of user corpus at the sentence-level representation [25, 26]. Given that we want to capture the flow of the conversation from one response to the next, we implement skip-thought embedding, which has shown effectiveness over large corporal datasets by capturing contextual information of sentences given neighboring ones [25]. For encoding sentences, we use a model that was pretrained on the BookCorpus dataset, which contains turn-based conversations from various English novels [25]. For the decoder, we train skip-thought vectors to recover the original response of the user during NLG portion of the pipeline. Since each user has individual response styles to questions, we train a personalized user-simulator for each user. For each user, the conversation corpus is divided into question-response turns. In our dataset, for example, the number of turns per conversation ranged from 30-275 turns. We used a multilayer perceptron (MLP) with 2 hidden layers of 512 output nodes each to train the user simulator. We also introduce regularization with ℓ2 -norm penalty to constrain model complexity. Because we utilize preset questions by the interviewer, we use one-hot encoding of questions, denoted q𝑖𝑡 ∈ R𝑑 , as input for training. Given the original skip-thought vector v𝑖𝑡 , the user simulator serves as a function which maps 𝑓 : q𝑖𝑡 ↦→ v𝑖𝑡 . The output of the MLP is the skip- thought embedding representation of the utterance, denoted 𝑓 (q𝑖𝑡 ; w𝑖 ) ∈ R𝑐 . Here, 𝑑 denotes the size of our question dictionary, 𝑐 denotes the dimension of skip-thought embeddings, w𝑖 parameterizes the MLP model for the given user, 𝑖 ∈ 𝑁 denotes the user index and 𝑡 ∈ 𝑇 denotes the turn number. The loss function of the MLP is given by the mean- squared error (MSE) between the MLP output and original skip-thought vector v𝑖𝑡 ∈ R𝑐 : 1 Õ𝑇  2 𝜆 𝐿(w𝑖 ) = 𝑓 (𝑞𝑖𝑡 ; w𝑖 ) − v𝑖𝑡 + ||w𝑖 || 2 , ∀ 𝑖 = 1, ..., 𝑁 2 𝑡=1 2 In the case where questions are not preset, more state-of-art methods such as end-to-end recurrent neural network systems can be deployed to train the user simulator instead [13, 27]. To evaluate the performance of our user simulator, we computed the mean squared error on the outputs of the 46 simulator and the original thought vector representation of the user response for each turn. 3.2.5 Reinforcement Learning Components Again, let 𝑐 denote the size of skip-thought embeddings and 𝑑 denote the size of question dictionary. We formulate the dialogue and task managers portions of the dialogue system into a standard RL setting where an agent interacts with environment E over a finite number of steps. At time step 𝑡, the agent receives a state s𝑡 and samples an action (asks a question) a𝑡 based on its current policy 𝜋. The environment transitions to the next state s𝑡+1 and the agent receives a scalar reward 𝑟 𝑡+1 . In this setting, the RL-agent tries to learn an optimal policy 𝜋 ∗ over all possible states, including ones that are unseen by the agent during training. To do this, the agent has to learn an approximate action-value function, which maps state-action pairs to expected future rewards [28]. Formally, the action-value function is defined as follows: hÕ𝑇 i 𝜋 𝑡 𝑄 (s, 𝑎) = E𝜋 𝛾 𝑟 𝑡 |s, 𝑎 , 𝑡=1 where 𝛾 ∈ [0, 1] is a discount factor and 𝑇 is the max # of turns. 3.2.5.1 Environment E The environment in this case consists of the dialogue manager (DM), user simulator and MCI classifier. DM is composed of the reward and state generating functions. In previous works, a task manager, composed of a database and a query manager [15, 13], is used by the DM to generate observations in retrieval tasks. In our case, however, the the user simulator and MCI classifier is equivalent to the task manager and is used by the DM to generate observations. Here, the DM uses the MCI classifier to (1) predict probabilities for both the MCI and the NL classes based on current moving-average of skip-thought vectors at each turn, and (2) predict the label of the current user at the end of the episode for reward calculation. The result of (1) is also used by the agent as part of its internal state-representation. The result of (2) is used by the DM for credit assignment for the 47 generated conversational episode. The MCI classifier is trained separately on the training set corpus before the dialogue system phase. 3.2.5.2 Action a𝑡 ∈ R𝑑 The RL-agent chooses its actions from a set of discrete actions consisting of 107 predefined questions, where each question is represented by a one hot vector in R𝑑 . It is worth noting that we use a𝑡 and q𝑡 to differentiate the action taken by our RL-agent and the questions asked during the actual interviews, respectively. 3.2.5.3 State s𝑡 ∈ R𝐶 The state representation by the RL-agent is used to approximate the action-value function. There are five main components of the state representation vector: • Skip-thought vector of utterance at current turn: 𝑓 (𝑎 𝑡−1 ; w𝑖 ), which is the output vector from user simulator 𝑓 given action 𝑎 𝑡−1 at turn t. • Moving average of skip-thought vectors across all utterances in current episode: 𝑓¯𝑡 = 1 Í𝑡−1 𝑡 𝑘=1 𝑓 (𝑎 𝑘 ; w𝑖 ) • First hidden layer weights of user-simulator: w𝑖 [:, 1] • Predicted probability of current user for MCI and NL classes by classifier • Number of turns above threshold: 𝜏. The total dimension of the state vector is 𝐶 = 2𝑐 + |w𝑖 [:, 1]| + 3 = 10115. At each turn, the DM queries the MCI classifier to output a probability vector composed of 𝑃(𝑦𝑖 = 0| 𝑓¯𝑡 ) and 𝑃(𝑦𝑖 = 1| 𝑓¯𝑡 ), where 𝑦 = 0 denotes NL and 𝑦 = 1 denotes MCI. This 2-dimensional vector keeps track of the classifier's confidence-level for MCI prediction based on the current moving-average of skip-thought vectors generated from 1, 2, ..., 𝑡 turns. Keeping track of classifier confidence incentivizes the 48 RL-agent to terminate the conversation as soon as it reaches a threshold level of confidence for the prediction task. 3.2.5.4 Reward 𝑟 ∈ R Since we want to minimize the number of dialogue turns, we designed the environment to output a negative reward (-10) at every time step unless it reach a terminal state (e.g. when agent says “goodbye”). At the terminal state, the reward depends on the classification using the averaged skip-thought vector collected from this episodes. If the existing classifier is able to make the correct prediction, the agent receives a positive reward (1000), otherwise it receives a moderately negative reward (-500). We also set the maximum length of episodes 𝑇 = 35. Additionally, we added a linearly increasing penalty for each passing turn where the classifier predicts with ≥ 0.65 probability for either class (MCI/NL). We denote this penalty threshold as the number of turns above confidence threshold (𝜏). Formally, the reward function is defined as:  −10 − 10𝜏, for non-terminal state,        𝑟 = −500, terminal state with misclassification, (3.1)     +1000,   terminal state with correct prediction.  3.2.5.5 State transitions The state transition function has two parts: • Within User. The state transition rule between turns is characterized by: Õ 𝜋 𝑃𝑠,𝑠 0 = 𝑃(𝑠𝑡+1 = 𝑠0 |𝑠𝑡 = 𝑠, 𝑎 𝑡 = 𝑎, 𝜋) 𝑎∈𝐴 Õ 𝑎 = 𝜋(𝑎|𝑠)𝑃𝑠,𝑠 0 𝑎∈𝐴 Given a policy 𝜋, the probability of the environment transitioning to state 𝑠0 at 𝑠𝑡+1 depends only on current state 𝑠𝑡 . Internally, the DM utilizes the user simulator to generate skip-thought 𝑓 (𝑎 𝑡 ; w𝑖 ) from 𝑎 𝑡 . 49 • Between Users. In addition to state transitions within episodes, the state-generating function changes between users, leading to different transition probabilities between similar states among different users. To capture this, we apply two changes when training the RL-agent on multiple users: (1) the first hidden layer weights w𝑖 [:, 1] of each user are incorporated in the state representation vector so that the RL-agent can distinguish between dissimilar users. When used this way, the user simulator provides a means for the RL-agent to learn similar policies for similar users and dissimilar policies for dissimilar users. (2) During training, both the user simulator and classifier of the training environment is reset between users by re-initializing the user simulator weights w𝑖 to correspond to the new user. 3.2.5.6 Deep Q-Networks (DQN) In this work, the action value function needs to estimate expected reward based on the high- dimensional state representations as described in previous section. In order to approximate the action value given different users and the complicated internal state changing during the conversation, we learn a deep 𝑄-network parameterized by 𝜃 𝑣 to tackle this challenging problem. The learning procedure can be conducted by optimizing the loss function as follows: 𝐿(𝜃 𝑣 ) = Es𝑡 ,a𝑡 ,𝑟 𝑡 ,s𝑡+1 0 [(𝑦 𝑡 − 𝑄(s𝑡 , a𝑡 ; 𝜃 𝑣 ))], (3.2) with 𝑦 𝑡 = 𝑟 𝑡 + 𝛾 maxa𝑡+1 0 𝑄(s0𝑡+1 , a0𝑡+1 ; 𝜃 0𝑣 ), (3.3) where 𝜃 0𝑣 denotes the parameters of target 𝑄-network. In order to learn the estimator under complex situations, two key ingredients were proposed in [29]: experience replay and fixed target 𝑄-network. During the training, the 𝑄-network (𝜃 𝑣 ) is updated in an online fashion by conducting the gradient descent of Eqn. (3.2) while the target 𝑄-network (𝜃 0𝑣 ) is fixed to compute the target values as in Eqn. (3.3) and only updated after a certain number of iterations, which is essential to the convergence of 𝑄-network in this work. We also observe when the experience replay samples minibatch from previous experiences to update the 𝑄-network, the training performance stabilizes more consistently. 50 3.2.5.7 Policy-masking One challenge in our problem is creating an environment that can train the agent to produce responses which best align with the flow of conversations. For example, an agent may learn that the question “can you elaborate on that? ” is useful for generating a wide distribution of words from the user, but it would not make sense to include that in the first sentence of a conversation or before relevant topics are introduced. To achieve this, we created a policy-modifying function in which confirmation and clarification type questions are masked from the policy set 𝜋 at turn 𝑡 if the action history of the agent from 1, 2, ..., 𝑡 − 1 does not include any questions from social, activity, tech, picture-related, hobbies, occupation, travel, entertainment and family categories. At each turn, we keep track of an action history vector 𝜋𝑡 ∈ R𝑑 and construct a policy-masking vector 𝜑𝑡 ∈ R𝑑 to be applied element-wise over the agent's Q-value output. Specifically:  𝑗  0, if action 𝑗 masked,   𝜑𝑡 = (3.4)   1, otherwise.  𝑄 0 (s𝑡 ) = 𝜑𝑡 𝑄(s𝑡 ). 𝑗 where the 𝜑𝑡 denotes the 𝑗-th element in policy-masking vector 𝜑𝑡 . And 𝑄(s𝑡 ) ∈ 𝑅 𝑑 represents the action values of all 107 available actions given current state s𝑡 . Then the 𝑄 0 (s𝑡 ) is valid action values vector after the policy masking. To achieve effective masking, we assure the elements of 𝑄(s𝑡 ) is positive by using ReLU [30] as the activation function for the output layer of Q-network and a step of pre-training on Q-network as described in following section. 3.2.6 Training the RL-Agent We outline below the training procedure for our RL-agent. To expedite the learning process, we first train the RL-agent over the original corpus from the training set. For each user, we perform an initial pass through the entire corpus using the existing action history 𝑞𝑖1 , 𝑞𝑖2 , ...𝑞𝑇𝑖 to generate episodes 𝑠1 , 𝑎 1 , 𝑟 1 , ...𝑎 𝑡 , 𝑟 𝑡 . We use these corpus-generated episodes to train the Q-estimator network. This initialization procedure is motivated by previous studies which have cited the effectiveness 51 Algorithm 1 RL-Training Protocol Initialize replay memory D Initialize Task Manager with classifier Pre-train action-value function 𝑄 for 𝑖 = 1, ..., 𝑁 do Initialize Environment E with User Simulator 𝑓𝑖 Initialize E with true label for user 𝑖 for 𝑒 𝑝𝑖𝑠𝑜𝑑𝑒 = 1, ..., 𝑀 do Reset E Get the initial state s1 . for 𝑡 = 1, ..., 𝑇 do Obtain policy mask 𝜑𝑡 as Eqn. (3.4). With probability 𝜖 select a random action 𝑎 𝑡 otherwise select 𝑎 𝑡 = max𝑎 𝜑𝑡 𝑄(𝑠𝑡 , 𝑎; 𝜃 𝑣 ) Execute action 𝑎 𝑡 in E observe reward 𝑟 𝑡 and state s𝑡+1 Store transition (s𝑡 , a𝑡 , 𝑟 𝑡 , s 𝑗+1 ) in D Sample random minibatch of (s 𝑗 , a 𝑗 , 𝑟 𝑗 , s 𝑗+1 ) from D if terminal s 𝑗+1 then 𝑦𝑗 = 𝑟𝑗 else if non-terminal s𝑡+1 then 𝑦 𝑗 = 𝑟 𝑗 + 𝛾max𝑎0 𝑄(s 𝑗+1 , 𝑎 0; 𝜃 𝑣0 ) end if Perform a gradient descent step on (𝑦 𝑗 − 𝑄(s 𝑗 ; 𝑎 𝑗 , 𝜃 𝑣 )) 2 end for end for end for of pre-training with successful episodes so that the RL-agent can discover large terminal reward signals in games with delayed rewards [31]. During training, we stabilize the target Q-network 𝜃 0𝑣 for minibatch generation and transfer weights from learning Q-network 𝜃 𝑣 every 50 conversational episodes. During testing, we use the RL-agent to generate new actions for each test set user 𝑎𝑖1 , 𝑎𝑖2 , ...𝑎𝑖𝑡 . New episodes are then generated by each user simulator from each new action set 𝜋𝑖 for prediction. These simulated episodes often differ from the original corpus in both the questions asked by the agent as well as in the skip-thought responses by the user. 3.3 Experiments Evaluation of dialogue systems differ widely depending on the task. Previous works typically involve using metrics such as perplexity and averaged reward per response to measure the quality of the 52 natural language generation (NLG) phase of the dialogue system [15, 13, 18]. However, because the utility of our framework comes from the quality of questions that the chat-bot generates for the off-conversational task, we propose a framework of evaluation which emphasizes the agent’s off-conversation performance. We gauge utility of the dialogue system by its ability to improve (1) prediction accuracy against baseline techniques and (2) the number of turns needed to make accurate prediction. 3.3.1 Data Data used for this study was obtained from a randomized controlled behavioral clinical trial to ascertain the effect of unstructured conversation on cognitive functions. Details of the study protocol was explained in [32]. In this clinical study, conversational data was collected in Q&A format for each participant during web-cam interviews with trained interviewers. Each participant was interviewed multiple times over the course of 4-6 weeks, and dialogue responses were transcribed for each interview session [12]. On average, there are 2.81 conversational episodes per participant, and each conversation lasted between 30-45 minutes [12, 32]. MCI labels were generated using clinical assessment of each participant's cognitive status by medical professionals [12, 32]. 3.3.2 Baselines vs. RL Performance We first compare the performance of several baseline classifiers for the MCI prediction task. For our specific dataset, [12] had previously achieved benchmark performance of 72.5% AUC score on 5-fold validation while using linear SVM with ℓ1 -norm penalty and feature engineering by Linguistic Inquiry and Word Count (LIWC) dictionary [12]. LIWC embeds each word into a 69-dimensional word vector space with each dimension representing a latent feature of the English language [12]. Since 2013, various contextual representations of words and sentences have been proposed, many of which have outperformed classical rule-based contexual embedding techniques [26, 25]. Distributed representation such as Word2Vec allows for more flexible and corpus-dependent latent features to be created for individual words [26]. More recently, Skip-thought vectors [25] have risen to popularity 53 Table 3.2: Performance of baseline vs. RL on MCI prediction on 10 stratified shuffle splits Model Feature AUC Sen. Specificity F1-Score LR + ℓ1 RD 0.529 ± 0.132 0.380 ± 0.260 0.678 ± 0.105 0.361 ± 0.207 RFC RD 0.519 ± 0.057 0.080 ± 0.098 0.944 ± 0.075 0.120 ± 0.149 SVM + ℓ1 RD 0.551 ± 0.131 0.380 ± 0.227 0.722 ± 0.102 0.384 ± 0.214 SVM + ℓ2 RD 0.560 ± 0.050 0.320 ± 0.256 0.800 ± 0.185 0.322 ± 0.193 MLP RD 0.640 ± 0.193 0.110 ± 0.243 0.860 ± 0.189 0.162 ± 0.146 LR + ℓ1 W2V 0.638 ± 0.091 0.520 ± 0.204 0.756 ± 0.147 0.517 ± 0.127 RFC w2v 0.564 ± 0.110 0.340 ± 0.220 0.789 ± 0.144 0.374 ± 0.189 SVM + ℓ1 W2V 0.651 ± 0.103 0.560 ± 0.233 0.756 ± 0.130 0.541 ± 0.147 SVM + ℓ2 W2V 0.598 ± 0.116 0.440 ± 0.233 0.756 ± 0.171 0.449 ± 0.205 MLP W2V 0.680 ± 0.151 0.500 ± 0.500 0.511 ± 0.490 0.266 ± 0.266 LR + ℓ1 LIWC 0.703 ± 0.099 0.540 ± 0.237 0.867 ± 0.130 0.584 ± 0.152 RFC LIWC 0.641 ± 0.135 0.360 ± 0.250 0.922 ± 0.087 0.445 ± 0.273 SVM + ℓ1 LIWC 0.661 ± 0.125 0.600 ± 0.200 0.722 ± 0.200 0.572 ± 0.144 SVM + ℓ2 LIWC 0.712 ± 0.110 0.680 ± 0.204 0.744 ± 0.180 0.631 ± 0.135 MLP LIWC 0.689 ± 0.129 0.300 ± 0.458 0.767 ± 0.396 0.182 ± 0.285 LR + ℓ1 SKP 0.790 ± 0.112 0.680 ± 0.256 0.900 ± 0.116 0.707 ± 0.183 RFC SKP 0.608 ± 0.104 0.260 ± 0.220 0.956 ± 0.054 0.343 ± 0.259 SVM + ℓ1 SKP 0.783 ± 0.123 0.700 ± 0.241 0.867 ± 0.171 0.711 ± 0.190 SVM + ℓ2 SKP 0.797±0.122 0.660±0.269 0.933±0.102 0.716±0.189 MLP SKP 0.638 ± 0.138 0.600 ± 0.490 0.400 ± 0.490 0.316 ± 0.256 RL(T=1) SKP 0.607±0.109 0.380±0.166 0.833±0.134 0.447±0.172 RL(T=3) SKP 0.706±0.092 0.500±0.205 0.911±0.097 0.583±0.154 RL(T=5) SKP 0.707±0.072 0.480±0.133 0.933±0.102 0.594±0.129 RL(T=10) SKP 0.772±0.115 0.600±0.237 0.944±0.102 0.683±0.186 RL(T=15) SKP 0.798±0.115 0.640±0.265 0.956±0.102 0.714±0.190 RL(T=20) SKP 0.798±0.121 0.640±0.250 0.956±0.102 0.719±0.190 RL(T=25) SKP 0.808±0.111 0.660±0.254 0.956±0.102 0.732±0.184 RL(T=30) SKP 0.808±0.119 0.660±0.269 0.956±0.102 0.730±0.190 RL(T=35) SKP 0.818±0.102 0.680±0.204 0.956±0.102 0.761±0.140 Here, LR denotes sparse logistic regression classifier, RFC denotes random forest classifier, SVM denotes support vector machines, and MLP denotes multi-layer perceptron. For feature representation of corpus, RD represents raw distribution of word counts. w2v denotes averaged 300-dimension Word2Vec embeddings across all words appearing in the corpus for each user [26]. LIWC denotes the original rule-based embedding used by [12]. SKP denote averaged 4800- dimension Skip-Thought vectors across all turn-based responses for each user [25]. due to the ability to embed entire sentences into "thought vectors" that capture contextual meaning and syntactic information from neighboring sentences. For this reason, we compare various word and phrase embedding techniques to establish new baseline performances for our classification task. The first four sections of Table 5.3 show the performance of these baseline classifiers. Using the original LIWC representation, we were able to recover close to the 72.5% AUC baseline from the original paper using SVM and LR classifiers. When implementing skip-thought embedding, we used pre-trained skip-thought encoders by [25] to embed each user response across all conversational 54 turns. The encoder was pre-trained on the BookCorpus dataset, which is a large collection of novels pertaining to numerous literary genres. The advantage of pre-training on this dataset is that BookCorpus contains an abundant number of turn-based dialogues between various character types. These conversations capture a wide range of conversational response styles, idiosyncrasies and temperaments. As seen in Table 5.3, the best performing baseline model was the SVM classifier with ℓ2 norm, using Skip-Thought embedding as features. For this reason, we choose this classifier for the RL portion of our pipeline. As a baseline reference, we also included performance using raw word count distributions for all models. We then evaluate the performance of our RL-agent across 10 stratified shuffle splits. Each split uses 65% of data for training and 35% for testing. We compare the performance of RL-Agent when manually restricting the number of questions to 1, 3, 5, 7, 10, 15, 20, 25, 30 and 35. By restricting the number of turns, we can observe the number of questions needed to recover the original baseline performance using the SVM classifier. Figure 3.3: RL-Agent vs. Baseline w/ Variation on Turns. The last section of Table 5.3 illustrates the performance of the RL-agent under various turn constraints. Here, the turns notation RL(T=𝑡) denote the number of questions the agent is allowed to 55 ask before a prediction is made from the simulated user responses. It is important to note that turn 0 was set to greetings by default and was not counted toward the conversation. We see from constraint conditions that the performance of our RL-agent started to surpass baseline performances starting at 25 questions and was able to achieve comparable performance using only 15 questions. At full conversation length of 35 turns, we were able to achieve 0.818 AUC, an improvement upon current and previous baselines. In comparison, the mean number of conversational turns per user in the original corpus was 105.71. Additionally, since 2.81 conversations were conducted per user, we adjusted the number of turns allowed based on the mean number of turns per conversation, which was 37.36 per user. For this reason, we set the upper bound constraint to 35 questions, which is just slightly less than a full conversation with the user. Figure 3.3 visualizes this relationship between performance and number of questions asked by the RL-agent. We see that performance improvements with additional questions saturate after 15 questions. This was expected, as the highest-yield questions discovered by the RL-agent were asked first during test conversations. 3.3.3 Evaluation of User Simulators User simulators serve a pivotal role of simulating the user response in the RL training environment [18, 33]. In previous works, the user simulators are evaluated based on accuracy of generated user query to unseen bot responses [33, 18]. Metrics such as BLEU and perplexity are used at the NLG phase of dialogue, as the generation of user query is pivotal in retrieval-type training systems. In our case, however, the goal of the user simulator is quite different; the RL-agent is responsible for generating queries while the output from the user simulator is actually an encoded thought-vector of the user response, which is then used for state representation and downstream prediction purposes. For this reason, we evaluate the performance of the user-simulator not on the decoding portion of the dialogue system, but rather on the performance of the user-simulator in generating accurate thought-vector version of the responses. We compute mean-squared error (MSE) between the corpus Skip-Thought vector and user 56 Figure 3.4: Distribution of mean squared error (MSE) across all user simulators. simulation prediction at each turn. The resulting MSE scores are averaged across all turns for the conversation. Given that each user has on average 2.81 conversations, we evaluate the performance of the user simulator in a leave-one-out fashion: for each user, the simulator is trained on all conversations except for the last one, which is used for evaluation. Figure 3.4 visualizes the performance of user simulators. The mean MSE is 0.00495±2.93E-06, averaged across all test set performances. 3.3.4 Top-Performing Policies It is interesting to note that the simulated episodes by our RL-agent were able to provide a performance boost for the prediction task. In this section, we look qualitatively at the types of questions at 5, 10, 15, 20 and 35 turns by RL-agent in comparison with the original corpus. We also compare the performance of 𝜋 ∗ @5, @10, @20, @30 and @35 with the performance using the first 5, 10, 20, 30 and 35 responses of the original corpus. Again, we note that responses to greeting and parting queries such as “Hi” and “goodbye” are not counted toward prediction. As shown in Table 3.3, the optimal policy 𝜋 ∗ learned by our framework outperformed the original corpus for each turn constraint. For example, when our RL-agent asked only 5 questions to test set users, the classifier was able to achieve 0.707 AUC and 0.594 F1 using the simulated response. In contrast, using the first 5 questions from the original corpus for each test set user produced 0.504 57 Table 3.3: Prediction @5, 10, 20, 30 and 35 Turns Model AUC Sen Spec F1-Score Corpus@5 0.504±0.070 0.120±0.098 0.889±0.099 0.175±0.145 Corpus@10 0.513±0.076 0.160±0.174 0.867±0.130 0.193±0.200 Corpus@20 0.614±0.077 0.340±0.254 0.889±0.131 0.382±0.223 Corpus@30 0.658±0.121 0.360±0.233 0.956±0.056 0.460±0.266 Corpus@35 0.699±0.125 0.420±0.244 0.978±0.044 0.539±0.248 𝜋 ∗ @5 0.707±0.072 0.480±0.133 0.933±0.102 0.594±0.129 𝜋 ∗ @10 0.772±0.115 0.600±0.237 0.944±0.102 0.683±0.186 𝜋 ∗ @20 0.798±0.121 0.640±0.250 0.956±0.102 0.719±0.190 𝜋 ∗ @30 0.808±0.119 0.660±0.269 0.956±0.102 0.730±0.190 𝜋 ∗ @35 0.818±0.102 0.680±0.204 0.956±0.102 0.761±0.140 AUC and 0.175 F1. When using the first full-length conversation with 35 turns, the original corpus recovers an AUC score of 0.699, which is far from the performance of 𝜋 ∗ @35. In Table 3.4, we rank the most frequently appearing questions in 𝜋 ∗ @5, 𝜋 ∗ @10 and 𝜋 ∗ @20. Table 3.4: Most frequently questions in 𝜋 ∗ @5, 10, 15 and 20 Turns Question Count 1-5 when did you start working? 40 1-5 so how long did you go out for? 37 1-5 when did you meet your SO? 28 1-5 24 1-5 what did you like about ? 24 6-10 what was like for you? 30 6-10 28 6-10 when did start? 22 6-10 what do you see in this picture? 19 6-10 19 10-15 what is your opinion on ? 42 10-15 did you see any shows lately? 38 10-15 how many people do you think can fit in this? 33 10-15 what you were doing during this time period? 30 10-15 what type of do you do? 28 15-20 27 15-20 where did you meet your so? 25 15-20 did you enjoy school? 24 15-20 anyone visit you lately? 24 15-20 what was the show about? 20 58 𝜋 ∗ @5. The most effective question in 𝜋 ∗ @5 appears to be “when did you start working”. In the context of our problem, this question seems to generate the most polarizing responses from the cohort. We also see that the RL-agent included a few elaboration questions such as “what did you like about ” and “why did you do that,” for some users to expand upon previous responses. From the clinical perspective, it is also interesting to note that the RL-agent picks questions such as “what did you do yesterday” and “how long did you go out for,” which are similar to questions used clinically to assess immediate recall in MCI patients [34]. 𝜋 ∗ @10. As seen in 𝜋 ∗ @5, occupational questions were the most popular topic asked by the RL-agent. This is also the case with 𝜋 ∗ @10, where the RL-agent follows up the previous query with an elaboration question regarding past occupational experiences. It is interesting to note that the RL-agent transitions to picture-related questions, which are often used by the clinical interviewers to facilitate creative responses by participants [12]. We also observe the RL-agent asking questions such as “” and “when did start”. These were frequently asked questions during the course of the original dialogue, as technical difficulties were often encountered with connection and webcam issues during the interviews [12]. Unfortunately, the responses vary greatly and may at times generate verbose responses from participants. The RL-agent did not seem to be able to recognize this caveat during training. 𝜋 ∗ @20. As we approach questions 11 through 20, we arrive at mid- to late- dialogue for most conversations. Overall, we observe more widespread topics during this portion of conversation. The most polarizing question asked at this stage was “what is your opinion on ?” Here, we used delexicalised slots [22] to reduce model complexity, but the slots may be substituted with a wide range of social topics from political trends to recent news. Additionally, we observe that the RL-agent learns to say “goodbye” to terminate the conversation early in numerous cases. As mentioned previously, we designed the state function to include the predicted probability [0.0-1.0] of MCI by the classifier at each time-step. The environment penalizes 59 the agent for additional turns in which the prediction probability exceeds 0.65 for either class. By opting to terminate the episode, the RL-agent learns to avoid dragging on the dialogue unnecessarily in cases where it is confident in the prediction. One notable question in 𝜋 ∗ @20 is “how many people do you think can fit in this?” This is actually a picture-specific question related to one of the more provocative pictures. In fact, we confirmed from the original corpus that it generated more follow-up response from users when compared to other picture-related questions such as “when do you think this picture was taken?” and “interesting, what makes you say that?”. By ranking this question highly, the RL-agent indirectly prioritizes this picture over others in generating user responses. This exemplifies how the ranking of questions by 𝜋 ∗ may be used to direct future data collection process. 𝜋 ∗ @35. When approaching the end of conversations, we notice that the questions asked by the agent were more spread-out among the remaining choices. For this reason, we rank only the top 10 questions during the final 15 turns of simulated conversations. Table 3.5: Table of top 10 ranked questions in final 15 turns of conversations Rank Question Count 1 what is your opinion on using ? 112 2 did you do anything else? 106 3 so how long did you go out for? 98 4 what you were doing during this time period? 95 5 when do you think this picture was taken? 95 6 94 7 anything new with you lately? 91 8 what did you like about it? 85 9 76 10 how often do you ? 72 In this latter portion of 𝜋 ∗ , we note that the RL-agent utilized more elaboration questions such as “what do you like about it” and “how often do you ”. We also see that technology-related questions such as “what is your opinion on using ” are included more often when compared to topics such as occupation or social items. This indicates that tech-related questions 60 may not be as high-yield in distinguishing MCI responses, as these questions are prioritized later during conversation by the RL-agent. 3.4 Discussion and Conclusion In this chapter, we introduce a RL framework for approaching a classically supervised learning problem in clinical medicine, where the data is often noisy, scarce, and prohibitively expensive to obtain. We show that a properly trained RL framework can (1) greatly cut down on the amount of data needed to make accurate predictions, and (2) synthesize relevant new data to improve performance. To achieve this framework, we proposed a multi-step approach which capitalizes on the vast existing knowledge of the human language and NLP research. First, we used a state-of-art distributed representation to preprocess our data. We then set up a simulation environment for reinforcement learning using supervised learning to create customized user simulators. Lastly, we utilize the trained RL-agent to generate new questions from 𝜋 ∗ to obtain more targeted responses for our prediction task. A careful examination of the optimal policies discovered by our agent demonstrates that the overall framework is self-contained for directing dialogue generation for diagnostic screening, which can potentially replace the need for trained interviewers. Our trained RL-agent is able to discover relevant questions to ask users where the agent has no prior experience of interaction. We also show various clinical insights which could be deduced from observing the ranking of questions in 𝜋 ∗ at various turn constraints. In order for this framework to be effectively deployed in a realistic setting, a user-simulator that could be trained online and in real-time should be considered. In its current form, our user-simulators are trained offline, which may not be scalable to larger corpus and user volumes. Additionally, a natural language generator phase may be needed to make the questions more adaptable to the natural flow of human conversation. These will be areas of research we will explore in future studies. 61 BIBLIOGRAPHY 62 BIBLIOGRAPHY [1] J Olazaran, Rubén Muñiz, B Reisberg, J Peña-Casanova, T Del Ser, AJ Cruz-Jentoft, P Serrano, E Navarro, ML García de la Rocha, A Frank, et al. Benefits of cognitive-motor intervention in mci and mild to moderate alzheimer disease. Neurology, 63(12):2348–2353, 2004. [2] Jeffrey L Cummings, Rachelle Doody, and Christopher Clark. Disease-modifying therapies for alzheimer disease: challenges to early intervention. Neurology, 69(16):1622–1634, 2007. [3] Serge Gauthier, Barry Reisberg, Michael Zaudig, Ronald C Petersen, Karen Ritchie, Karl Broich, Sylvie Belleville, Henry Brodaty, David Bennett, Howard Chertkow, et al. Mild cognitive impairment. The lancet, 367(9518):1262–1270, 2006. [4] Keith A Johnson, Nick C Fox, Reisa A Sperling, and William E Klunk. Brain imaging in alzheimer disease. Cold Spring Harbor perspectives in medicine, 2(4):a006213, 2012. [5] D Heister, James B Brewer, Sebastian Magda, Kaj Blennow, Linda K McEvoy, Alzheimer’s Disease Neuroimaging Initiative, et al. Predicting mci outcome with clinically available mri and csf biomarkers. Neurology, 77(17):1619–1628, 2011. [6] Clifford R Jack Jr, David S Knopman, William J Jagust, Leslie M Shaw, Paul S Aisen, Michael W Weiner, Ronald C Petersen, and John Q Trojanowski. Hypothetical model of dynamic biomarkers of the alzheimer’s pathological cascade. The Lancet Neurology, 9(1):119–128, 2010. [7] Liang Zhan, Yashu Liu, Yalin Wang, Jiayu Zhou, Neda Jahanshad, Jieping Ye, and Paul Matthew Thompson. Boosting brain connectome classification accuracy in alzheimer’s disease using higher-order singular value decomposition. Frontiers in neuroscience, 9:257, 2015. [8] Qi Wang, Liang Zhan, Paul M Thompson, Hiroko H Dodge, and Jiayu Zhou. Discriminative fusion of multiple brain networks for early mild cognitive impairment detection. In ISBI, pages 568–572. IEEE, 2016. [9] Carol Dillon, Cecilia M Serrano, Diego Castro, Patricio Perez Leguizamón, Silvina L Heisecke, and Fernando E Taragano. Behavioral symptoms related to cognitive impairment. Neuropsychiatric disease and treatment, 9:1443, 2013. [10] Robert M Chapman, Mark Mapstone, John W McCrary, Margaret N Gardner, Anton Porsteins- son, Tiffany C Sandoval, Maria D Guillily, Elizabeth DeGrush, and Lindsey A Reilly. Predicting conversion from mild cognitive impairment to alzheimer’s disease using neuropsychological tests and multivariate methods. Journal of Clinical and Experimental Neuropsychology, 33(2):187–199, 2011. 63 [11] Hiroko H Dodge, Nora Mattek, Mattie Gregor, Molly Bowman, Adriana Seelye, Oscar Ybarra, Meysam Asgari, and Jeffrey A Kaye. Social markers of mild cognitive impairment: Proportion of word counts in free conversational speech. Current Alzheimer Research, 12(6):513–519, 2015. [12] Meysam Asgari, Jeffrey Kaye, and Hiroko Dodge. Predicting mild cognitive impairment from spontaneous spoken utterances. Alzheimer’s & Dementia: Translational Research & Clinical Interventions, 3(2):219–228, 2017. [13] Tsung-Hsien Wen, David Vandyke, Nikola Mrksic, Milica Gasic, Lina M Rojas-Barahona, Pei-Hao Su, Stefan Ultes, and Steve Young. A network-based end-to-end trainable task-oriented dialogue system. arXiv preprint arXiv:1604.04562, 2016. [14] Bhuwan Dhingra, Lihong Li, Xiujun Li, Jianfeng Gao, Yun-Nung Chen, Faisal Ahmed, and Li Deng. End-to-end reinforcement learning of dialogue agents for information access. arXiv preprint arXiv:1609.00777, 2016. [15] Hongshen Chen, Xiaorui Liu, Dawei Yin, and Jiliang Tang. A survey on dialogue systems: Recent advances and new frontiers. Acm Sigkdd Explorations Newsletter, 19(2):25–35, 2017. [16] Minghao Hu, Yuxing Peng, Zhen Huang, Xipeng Qiu, Furu Wei, and Ming Zhou. Reinforced mnemonic reader for machine reading comprehension. arXiv preprint arXiv:1705.02798, 2017. [17] Wei He, Kai Liu, Jing Liu, Yajuan Lyu, Shiqi Zhao, Xinyan Xiao, Yuan Liu, Yizhong Wang, Hua Wu, Qiaoqiao She, et al. Dureader: a chinese machine reading comprehension dataset from real-world applications. arXiv preprint arXiv:1711.05073, 2017. [18] Jost Schatzmann, Karl Weilhammer, Matt Stuttle, and Steve Young. A survey of statistical user simulation techniques for reinforcement-learning of dialogue management strategies. Knowledge Engineering Review, 21(2):97–126, 2006. [19] Satinder Singh, Diane Litman, Michael Kearns, and Marilyn Walker. Optimizing dialogue management with reinforcement learning: Experiments with the njfun system. Journal of Artificial Intelligence Research, 16:105–133, 2002. [20] Xuijun Li, Yun-Nung Chen, Lihong Li, and Jianfeng Gao. End-to-end task-completion neural dialogue systems. arXiv preprint arXiv:1703.01008, 2017. [21] Antoine Bordes and Jason Weston. Learning end-to-end goal-oriented dialog. arXiv preprint arXiv:1605.07683, 2016. [22] Bing Liu and Ian Lane. An end-to-end trainable neural network model with belief tracking for task-oriented dialog. arXiv preprint arXiv:1708.05956, 2017. 64 [23] Matthew Henderson, Blaise Thomson, and Steve Young. Robust dialog state tracking using delexicalised recurrent neural networks and unsupervised adaptation. In SLT, pages 360–365. IEEE, 2014. [24] Joyce Y Chai, Rui Fang, Changsong Liu, and Lanbo She. Collaborative language grounding toward situated human-robot dialogue. AI Mag, 37(4), 2016. [25] Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. Skip-thought vectors. In Advances in neural information processing systems, pages 3294–3302, 2015. [26] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed repre- sentations of words and phrases and their compositionality. arXiv preprint arXiv:1310.4546, 2013. [27] Xiujun Li, Zachary C Lipton, Bhuwan Dhingra, Lihong Li, Jianfeng Gao, and Yun-Nung Chen. A user simulator for task-completion dialogues. arXiv preprint arXiv:1612.05688, 2016. [28] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998. [29] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015. [30] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In ECCV, pages 630–645. Springer, 2016. [31] Charles W Anderson, Minwoo Lee, and Daniel L Elliott. Faster reinforcement learning after pretraining deep networks to predict state dynamics. In IJCNN, pages 1–7. IEEE, 2015. [32] Hiroko H Dodge, Jian Zhu, Nora C Mattek, Molly Bowman, Oscar Ybarra, Katherine V Wild, David A Loewenstein, and Jeffrey A Kaye. Web-enabled conversational interactions as a method to improve cognitive functions: Results of a 6-week randomized controlled trial. Alzheimer’s & dementia: translational research & clinical interventions, 1(1):1–12, 2015. [33] Jiwei Li, Michel Galley, Chris Brockett, Georgios P Spithourakis, Jianfeng Gao, and Bill Dolan. A persona-based neural conversation model. arXiv preprint arXiv:1603.06155, 2016. [34] Marshal F Folstein, Susan E Folstein, and Paul R McHugh. “mini-mental state”: a practical method for grading the cognitive state of patients for the clinician. Journal of psychiatric research, 12(3):189–198, 1975. 65 CHAPTER 4 SAMPLE EFFICIENT LEARNING WITH PRIVILEGED INFORMATION 4.1 Introduction In classical supervised learning, the learner is presented with the training tuple {(𝑥𝑖 , 𝑦𝑖 )}𝑖=1 𝑚 and performs an optimization task of finding the best model in a hypothesis space ℎ : 𝑋 → 𝑌 to approximate some true 𝑓 : 𝑋 → 𝑌 which explains the data. Given a new task, knowledge transfer [1] is often applied to accelerate the learning process by distilling and transferring relevant knowledge from previous tasks to the unseen one. Under classical formulations, the learner incorporates prior information in one of several ways: • Direct transfer of parameters from old hypothesis models to the new task and fine-tuning [1] the parameters. • Learning multiple tasks (online or batched) related to the current task [2, 3]. • Using the prior knowledge (i.e. a knowledge graph) to constrain the hypothesis space by regularization [4]. • Using representations (i.e. embeddings) of 𝑋 and / or 𝑌 from previous tasks for new tasks [5, 6]. • Accelerate learning rate and model compression by Distillation as typically seen in Teacher- Student models [7]. In each of these settings, knowledge transfer operates directly within the 𝑋, 𝑌 and H spaces to improve generalization of information from old models to the new task. Recently, Learning Using Privileged Information (LUPI) [8] has provided a new paradigm for knowledge transfer. Under LUPI, the learner now interacts with a Teacher who provides privileged information (PI), which is available exclusively at training time. From the learner’s perspective, 66 the training set is now extended to the tuple {(𝑥𝑖 , 𝑥𝑖∗ , 𝑦𝑖 )}𝑖=1 𝑚 , and the test set stays the same. Some examples of PI include: 1) Future information that relates 𝑋 and 𝑌 . For example, using future stock prices beyond the prediction window during training. 2) Auxiliary information describing the label space that is available only to a subset of samples. For example, physician notes that accompany diagnostic predictions which is only available after the diagnosis is made. At a high-level, PI provides some similarity information between training samples from the original feature space, and the Teacher hypothesis serves as additional “explanations” of the hypothesis space [9, 8]. As a result, [9] showed that the LUPI Teacher provides a principled way to improve the generalization error of Student learners using agnostic PAC models, providing some theoretical improvements in the number of samples required to generalize to test set data (i.e. improves sampling efficiency). However, under the current state-of-art LUPI formulations such as [10] and [11], PI is incorporated by means of support vectors and dropout schemes, both of which fail to explore the underlying similarity structure between examples in the PI space 𝑋 ∗ . For example the mode distribution and pairwise similarity between points in the 𝑋 ∗ space is largely unused. The PI contributes as auxiliary training features and kernel information, but much of the LUPI information is lost at inference time and beyond. A significant question remains: can privileged information be retained for future tasks? Ideally, we want the LUPI Teacher to incorporate PI in a way that is specific enough to inform similarity between training samples yet general enough to be retained across future tasks. As a motivating example, consider the medical setting, where electronic health records (EHR) are often sparse, noisy, and full of missing data. Complex tasks such as multi-task learning of many diseases are often difficult to do because of the long-tail property of diseases – that is, diseases with very few training samples (i.e. < 100) are very difficult to learn using EHR features alone. On the other hand, medical research on rare diseases are often plenty – large volumes of clinical journals focus on text descriptions of rare diseases in the medical setting. As a result, clinical texts such as discharge notes are unavailable at inference time, but when used retrospectively during training can serve as a 67 source of PI that allows for rare diseases to be learned with few examples. In this work, we propose a LUPI formulation that achieves precisely this. First, we introduce the idea of building a vocabulary of PI features by unsupervised learning using external data sources. We then propose a mechanism for learning a joint representation between the PI information and the original set of example features by exploiting their co-occurrence statistics in the training data. We finally learn a shared decision function using a contrastive-loss to distinguish between samples drawn from the joint latent space based on their labels for each task. In experiments, we demonstrate the effectiveness of our method in retaining PI obtained from external data sources to support multi-task prediction tasks in the EHR setting against other transfer learning methods. We demonstrate that such an approach both improve the prediction accuracy as well as decrease the samples required to train an accurate model, especially for rare-diseases. 4.2 Methodology At a high-level, the main intuition behind our proposed method is to decompose the LUPI process into three parts: 1. Build a dictionary of PI features and learn a distributed representation [5] over the PI vocabulary. 2. Find a joint representation space (Φ) between the PI and example features. 3. Jointly learn the decision functions ℎ∗ : Φ → 𝑌 by feature-matching in the joint representation space. The first process uses unsupervised learning to embed the PI vocabulary into a vector space. The second process allows for some of the privileged information to be retained at inference time, despite not having direct access to the PI vectors. The third process allows for PI information for one task to be transferred for other 𝐶 − 1 tasks in the label space. In the following subsections, we will examine how to achieve (1) – (3) in detail. We also provide analysis of how (3) can maintain the favorable LUPI sample efficiency. 68 4.2.1 Building the PI vocabulary First, we can define 𝑔 ∗ (𝑤 𝑗 ; 𝜃 𝑔∗ ) as an embedding function that maps 𝑔 ∗ : 𝑋 ∗ → Φ. Note that 𝑥𝑖∗ consists of individual words, {𝑤 1 , ..., 𝑤 𝑘 }. So 𝑔 ∗ (𝑤 𝑗 ; 𝜃 𝑔∗ ) embeds each individual word in the PI vocabulary rather than the PI samples (i.e., 𝑥𝑖∗ ). The rationale behind 𝑔 ∗ (.) is to encode each word in the PI vocabulary into a vector space so vector operations can be applied to the PI. We specifically consider embedding function of the form, 𝑔 ∗ (𝑤 𝑗 ; 𝜃 𝑔∗ ) = 𝑤𝑇𝑗 𝜃 𝑔∗ . (PI Embedding) Since each 𝑤 𝑗 ∈ {0, 1} 𝑑 has 𝑤 𝑗 𝑘 = 1 only when 𝑗 = 𝑘, the 𝑤 𝑗 vector simply selects the 𝑗 𝑡ℎ column in 𝜃 𝑔∗ . We restrict 𝜃 𝑔∗ ∈ R𝑑×𝑘 so that 𝜃 𝑔∗ produces a lower-dimensional representation of each word in the PI vocabulary. For this first step, we do not restrict the PI to come from the original dataset {(𝑥𝑖 , 𝑥𝑖∗ , 𝑦𝑖 )}. In fact, we can learn the embedding 𝜃 𝑔∗ for our PI using any data source by applying the following word-model:  if 𝑤 1 , 𝑤 2 ∈ 𝑥𝑖∗   1   𝑆𝑐𝑜𝑟𝑒(𝑤 1 , 𝑤 2 ) = (Co-occurence)   0  otherwise  𝑔(𝑤 1 , 𝑤 2 ) = 𝜎{(𝜃 𝑔∗ 𝑤 1 )𝑇 (𝜃 𝑔∗ 𝑤 2 )} (Word Model) L 𝑒𝑚𝑏 = 𝐵𝐶𝐸 (𝑆𝑐𝑜𝑟𝑒(𝑤 1 , 𝑤 2 ), 𝑔(𝑤 1 , 𝑤 2 )). (2) Here, 𝜎(.) denotes the Sigmoid activation function, and 𝐵𝐶𝐸 (.) denotes the binary cross-entropy Í loss: − 𝑖 [𝑎𝑖 (𝑙𝑜𝑔𝑏𝑖 ) + (1 − 𝑎𝑖 )𝑙𝑜𝑔(1 − 𝑏𝑖 )]. The cross-term (𝜃 𝑔∗ 𝑤 1 )𝑇 (𝜃 𝑔∗ 𝑤 2 ) gives the similarity between 𝑤 1 and 𝑤 2 in the embedding space, which is then scored against 𝑆𝑐𝑜𝑟𝑒(𝑤 1 , 𝑤 2 ) based on whether 𝑤 1 , 𝑤 2 both appear in 𝑥𝑖∗ . We note that the embedding loss L 𝑒𝑚𝑏 is trained separately from the rest of the LUPI model since it is not specific to the dataset. 4.2.2 Learning the Joint Representation Next, let us define 𝑔(𝑥𝑖 ; 𝜃 𝑔 ) to be the embedding function that maps 𝑔 : 𝑋 → Φ. In this work, we consider the following for 𝑔(.) and 𝑔 ∗ (.): 69 • Distributed representation of PI vocabulary, which captures its underlying manifold structure and is obtained by unsupervised learning [5]. • Encoding of 𝑋 into a fixed length vector by deep embedding methods such as [12]. For time-series 𝑋, we take the embedding functions to be a recurrent encoder neural network: 𝑔(𝑥𝑖 ; 𝜃 𝑔 ) = 𝑅𝑁 𝑁 (𝑥𝑖 ; 𝜃 𝑔 ) (EF Embedding) The motivation behind using the embedding functions 𝑔(𝑥) and 𝑔 ∗ (𝑤) is to extend the idea of Student and Teacher kernels, which allow for the privileged information to provide information about similarity between training samples in the feature space [9]. Using neural encoding for 𝑔(𝑥; 𝜃 𝑔 ) allows such feature spaces to be represented by a fixed-length vector without losing the underlying spatio-temporal information. To find commonality between example features and PI, we introduce a matching function (𝜇) that maps each (𝑔(𝑥𝑖 ), 𝑔 ∗ (𝑤 𝑗 )) pair onto the interval [0, 1], i.e., 𝜇 : Φ × Φ → [0, 1] 𝑑 : exp(max{0, [𝑔(𝑥𝑖 ); 𝑔 ∗ (𝑤 𝑗 )] 𝑇 𝐴}) 𝜇(𝑔(𝑥𝑖 ), 𝑔 ∗ (𝑤 𝑗 ); 𝐴) = Í𝑑 . (3) ∗ 𝑝=1 exp(max{0, [𝑔(𝑥𝑖 ); 𝑔 (𝑤 𝑝 )] 𝐴}) 𝑇 Here, [𝑔(𝑥𝑖 ); 𝑔 ∗ (𝑤 𝑗 )] ∈ R2𝑘 denotes the concatenation of the 𝑔(𝑥𝑖 ) and 𝑔 ∗ (𝑤 𝑗 ) embeddings in the joint latent space. The parameter matrix 𝐴 projects the pairs (𝑔(𝑥𝑖 ), 𝑔 ∗ (𝑤 𝑗 )) onto R, and the softmax activation normalizes the pairwise scores against other word-pairs in the PI vocabulary. Thus, for each sample mapped from the feature space 𝑋, the matching function 𝜇 produces a set of corresponding weights over all of the words in the PI vocabulary. We make the key observation that for each word 𝑤 𝑗 , the output weight of the matching function should correspond to the 𝑗 𝑡ℎ component of the 𝑥𝑖∗ sample in the training data. That is, 𝜇(𝑔(𝑥𝑖 ), 𝑔 ∗ (𝑤 𝑗 ); 𝐴) ≈ 𝑥𝑖∗𝑗 . Using this fact, we can learn the matching function by minimizing over the following objective: 𝑚 𝑑 1 ÕÕ ∗ L 𝜙 (𝜃 𝑔 , 𝜃 𝑔∗ , 𝐴) = − [𝑥 log 𝜇𝑖 (𝑔(𝑥𝑖 ), 𝑔 ∗ (𝑤 𝑗 ); 𝐴) 𝑚𝑑 𝑖=1 𝑗=1 𝑖 𝑗 + (1 − 𝑥𝑖∗𝑗 )log (1 − 𝜇𝑖 (𝑔(𝑥𝑖 ), 𝑔 ∗ (𝑤 𝑗 ); 𝐴))]. (4) 70 Each component of the PI vector 𝑥𝑖∗ can thus be interpreted as providing an indicator label for likelihood of the (𝑥𝑖 , 𝑤 𝑗 ) pair to occur together. The similarity control mechanism highlighted in Eqn. 4 differs from the Kernel-matching mechanism mentioned previously in [9]. The limitation of kernel-matching [9] is that two sets of Kernel weights need to be learned simultaneously: 𝛼 for 𝐾 (𝑥𝑖 , 𝑥) and 𝛽 for 𝐾 ∗ (𝑥𝑖∗ , 𝑥 ∗ ). By contrast, our joint representation for 𝑥𝑖 and 𝑥𝑖∗ encourages a single hypothesis model to be used to map ℎ∗ : Φ → 𝑌 . Since matrix 𝐴 captures the 𝑔(𝑥𝑖 ) and 𝑔 ∗ (𝑤 ∗𝑗 ) interactions, it preserves the PI in the space of Φ and allows relevant PI to be retrieved at test time. Finally, we obtain the augmented representation of 𝑥𝑖 as a weighted combination of G ∗ = {𝑔(𝑤 𝑗 )} 𝑑𝑗=1 and 𝑔(𝑥𝑖 ): Õ𝑑 𝜙(𝑥𝑖 ) = 𝑔(𝑥𝑖 ) + 𝜇𝑖 (𝑔(𝑥𝑖 ), 𝑔 ∗ (𝑤 𝑗 ); 𝐴) · 𝑔 ∗ (𝑤 𝑗 ) (5) 𝑗=1 One can think of G ∗ as the set of basis vectors supporting the PI space (similar to frames for the PI Kernel [9]). The augmented representation 𝜙(𝑥𝑖 ) contains both information from the original 𝑥𝑖 as well as relevant information retrieved from G ∗ . Note that since the PI vectors 𝑥 ∗𝑗 ∈ 𝑋 ∗ are not directly used at testing time, each sample 𝑥𝑖 ∈ D𝑡𝑒𝑠𝑡 is mapped into Φ using 𝑔(.), and the trained 𝜇(.) selects the corresponding bases in G ∗ to construct 𝜙𝑖 . This is quite different from representation fusion methods [6, 13], which only try to learn a shared representation space for input modalities 𝑋1 , ..., 𝑋 𝑘 , without a matching function to control the contribution of each modality to the hypothesis. For example, we can take 𝑋1 to be the original feature space and 𝑋2 to be the privileged information. At test time, when 𝑋2 is unavailable, 𝑋1 inputs with masked 𝑋2 components may be projected into a completely different location in the shared representation space than if the 𝑋2 information were available. Furthermore, model fusion methods [13] may also under-utilize the original feature space during training, as the PI contain more information related to the target task. 71 4.2.3 Coupling Decision Functions with Feature Matching LUPI typically considers 2 hypothesis functions: the Student hypothesis ℎ : 𝑋 → 𝑌 , and Teacher hypothesis 𝑓 ∗ : 𝑋 ∗ → 𝑌 . Since we have already addressed the problem of finding a common “frame of reference” between the original feature space and the PI space by the matching function 𝜇, the main focus for this portion of our method has to do with finding an efficient 𝑓 ∗ that relates the privileged information to the labels. Fortunately, we can directly approximate 𝑓 ∗ : 𝑋 ∗ → 𝑌 by a function ℎ∗ : Φ → 𝑌 that maps samples from the joint representation space to the label space. This is because Φ is constructed by embedding function 𝑔 ∗ (.) on 𝑋 ∗ and is an approximation of the Kernel space for the privileged information. In the case that the target task is classification, we can formulate ℎ∗ as a feature matching problem between samples from Φ and 𝑌 . Specifically, we can use contrastive loss [14] to find an invariant representation ℎ∗ : Φ → 𝑌 and vice versa, by minimizing the distance between similar samples drawn from the joint embedding space based on signals from the label space: 𝐿 (𝑊, 𝑦𝑖 , 𝑥𝑖+ , 𝑥𝑖− ) = (1 − 𝑦𝑖 )M𝑊 (𝑥𝑖+ , 𝑥𝑖− ) + (𝑦𝑖 )(max{0, 𝐶 − M𝑊 (𝑥𝑖+ , 𝑥𝑖− )}) (Contrastive) 1 Õ𝑆 L (𝑊) = 𝐿(𝑊, (𝑦𝑖 , 𝑥𝑖+ , 𝑥𝑖− ) 𝑖 ) (6) 𝑆 𝑖=1 where M𝑊 (𝑥 + , 𝑥 − ) = ||𝜙(𝑥 + ; 𝑊) − 𝜙(𝑥 − ; 𝑊)|| 2 refers to a parameterized distance metric with respect to projections 𝜙(𝑥 + ) and 𝜙(𝑥 − ), and 𝐶 is the slack variable which defines the margin of separation between them. 𝜙(.) is simply the projection function from eqn. 5, which is parameterized by 𝑊 = [𝜃 𝑔 , 𝜃 𝑔∗ , 𝐴] from eqn. 4. Intuitively, M𝑊 finds the distance between augmented projections of 𝑥𝑖+ and 𝑥𝑖− , i.e., 𝜙(𝑥𝑖+ ) and 𝜙(𝑥𝑖− ), which are compared by their labels 𝑦𝑖 . Given a training pair (𝑥𝑖 , 𝑥𝑖∗ , 𝑦𝑖 ), a set of 𝑘 similarity samples 𝑆𝑖 = {(𝑥𝑖+ , 𝑥𝑖− , 𝑦𝑖 ) 𝑗 } 𝑘𝑗=1 is constructed around the (𝑥𝑖 , 𝑦𝑖 ) pair, whereby 𝑥𝑖+ denotes samples with the same label as 𝑦𝑖 = 𝑦 + , and 𝑥𝑖− denotes samples with a different label than 𝑦𝑖 ≠ 𝑦 − . Thus, 𝑥 + and 𝑥 − denote positive samples (similar) and negative samples (dissimilar), respectively. A variety of negative sampling techniques can be used to obtain the set 𝑆 [15, 14, 16]. In practice, we found picking 5 − 10 negative samples that are close to 𝜙(𝑥𝑖 ) and 5 − 10 samples that 72 are far from 𝜙(𝑥𝑖 ) to be sufficient in creating 𝑆 for each training triplet (𝑥𝑖 , 𝑥𝑖∗ , 𝑦𝑖 ). We refer the reader to [14] and [15] for more information about the contrastive loss and the construction of 𝑆. Finally, we combine the two portions of our learning task (i.e. representation learning and joint hypothesis) into the optimization task: min L (𝑊) + 𝜆L 𝜙 (𝜃 𝑔 , 𝜃 𝑔∗ , 𝐴) + Ω(𝑊), (7) Θ where 𝑊 = [𝜃 𝑔 , 𝜃 𝑔∗ , 𝐴] is the total set of parameters for the learning task. 𝜆 is the hyperparameter which controls the trade-off between the contrastive loss to learn ℎ∗ and the representation loss in eqn. 4. Ω(.) is the regularization term used to constrain the hypothesis space of the joint model. 4.3 Analysis of Sampling Efficiency 4.3.1 Results from Existing Agnostic Models For an agnostic hypothesis model, such as non-linearly separable SVM, the generalization error bound holds with 1 − 𝛿 probability: r ∗ Δ𝐻 log(𝑚/Δ𝐻 ) − log 𝛿 𝑅(ℎ) ≤ 𝑅𝑒𝑚 𝑝 (ℎ) + 𝑂 ( ), (VC-bound) 𝑚 where |𝑅(ℎ) − 𝑅𝑒𝑚 𝑝 (ℎ)| = 𝜖 ∈ [0, 1] is the generalization error represented by the difference between expected and empirical training risks. Δ𝐻 is the VC-dimension of the given SVM model class, 𝑚 is the sample size, and 𝛿 ∈ (0, 1), whereas under the SVM+ formulation in [9], the generalization error is given by: 𝑚 ! (Δ𝐻 + Δ𝐹 ∗ )log( Δ𝐻 +Δ 𝐹∗ ) − log 𝛿 𝑅(ℎ) ≤ 𝑅𝑒𝑚 𝑝 ( 𝑓 ∗ ) + 𝑂 ∗ , (SVM+) 𝑚 where 𝑅𝑒𝑚 𝑝 ( 𝑓 ∗ ) denotes the error rate of the Teacher’s hypothesis 𝑓 ∗ : 𝑋 ∗ → 𝑌 , and Δ𝐹 ∗ denotes the VC-dimension of the Teacher model. In the original SVM, the model needs to re-estimate 𝑚 slack variables for each training sample, in addition to the 𝑛 parameters in 𝑤. At a high-level, the hypothesis function 𝑓 ∗ of the LUPI Teacher serves as a slack function which approximates these slack variables for each 𝑥𝑖 , eliminating the need for the Student to estimate them during training [9]. The number of estimations in the latter case reduces to O (𝑚 + 𝑛), rather than O (𝑚𝑛). As a 73 result, the sampling efficiency improves from 𝑚 ≤ O ( Δ𝐻 +log(1/𝛿)𝜖2 ) to 𝑚 ≤ O ( Δ𝐻 logΔ𝐻𝜖+log(1/𝛿) ) in the number of samples required to achieve the same generalization error 𝜖. 4.3.2 Complexity of Proposed LUPI Method In this section, we examine the sample complexity of our proposed LUPI method. For simplicity, let us consider the classification setting where we are given a hypothesis class H of finite VC-dimensions which define a set of functions mapping X to a label set {0, 1}, and let the 0 − 1 loss function define the empirical risk. Let ΔH = 𝑑 < ∞. By the fundamental theorem of PAC learning (Thm. 6.7 in [17]), there exist 𝐶1 , 𝐶2 ∈ R such that: 1. H is agnostic PAC learnable and has the uniform convergence property with sample complexity 𝑑 + log(1/𝛿) 𝑑 + log(1/𝛿) 𝐶1 2 ≤ 𝑚(𝜖, 𝛿) ≤ 𝐶2 (Agnostic) 𝜖 𝜖2 2. There exists a Realizable subset H𝑟 such that the sample complexity is defined by 𝑑 + log(1/𝛿) 𝑑 log(1/𝜖) + log(1/𝛿) 𝐶1 ≤ 𝑚(𝜖, 𝛿) ≤ 𝐶2 . (Realizable) 𝜖 𝜖 The main difference between agnostic and realizable PAC models lies in whether the classifier can completely classify a training set S = {(𝑥𝑖 , 𝑦𝑖 )}𝑖=1 𝑚 . That is, the training error 𝑅 𝑒𝑚 𝑝 (ℎ) = 1 Í𝑚 𝑚 𝑖=1 𝐿 (ℎ(𝑥𝑖 ), 𝑦 𝑖 ) = 0 for the particular hypothesis class under some empirical risk minimization (ERM) algorithm. On the other hand, when we have 𝑦𝑖 ≠ ℎ(𝑥𝑖 ) for some training data, there exist some examples for which the current hypothesis class can not successfully separate (i.e. cannot realize an accurate ERM hypothesis with 0 empirical risk), suggesting that the optimal solution is either not contained in the span of the given hypothesis class, or the ERM algorithm cannot converge to the optimal solution in the hypothesis space. Fortunately,[18] introduces some conditions for which a sample complexity between O ( log(1/𝛿) 𝜖 ) and O ( log(1/𝛿) 𝜖2 ) is possible for some classes of models. Specifically, Tsybakov showed that 𝑚(𝜖, 𝛿) ≤ 𝐶 𝑑 log(1/𝛿)+log(1/𝛿) 𝜖𝑛 , 1 < 𝑛 < 2 exists under two general scenarios: 74 • When there is zero training error (ERM realizable). • When the classification margin between the given hypothesis class and another realizable hypothesis class is bounded. LUPI qualifies the second Tsybakov condition by leveraging a realizable Teacher that provides a mechanism to bound the margins of the Student ERM classifier. Specifically, [8] considers the case where the Student hypothesis class H is non-realizable and the Teacher hypothesis 𝑓 ∗ ∈ F ∗ is a realizable classifier that approximates an Oracle classifier with zero-training error. [8] showed that using kernel alignment between the Teacher and the Student, the latter can satisfy the Tsybakov conditions, leading to a sample complexity that is comparable to the realizable case. [9] further showed that, under some assumptions on H ∗ , one can achieve O ( log(1/𝛿) 𝜖 ) ≤ 𝑚(𝜖, 𝛿) ≤ O ( log(1/𝛿) 𝜖2 ) so long as the Teacher classifier H ∗ has lower VC-dimension and training error than H . In our model, we provide the margin bounding mechanism by contrastive loss in Eqn. Contrastive. Under the condition that the Teacher model 𝑓 ∗ (𝑥 ∗ ) has a lower VC-dimension and 𝑦 · ℎ(𝑥) > 𝑐 0𝐶 − M (𝑥 + , 𝑥 − ), the 𝐶 − M (𝑥 + , 𝑥 − ) term from the Contrastive eqn. serves as the approximate slack margin of the oracle classifier. To see this, let 𝑝 ∈ R and 𝑞 ∈ R be random values. Suppose that 𝑝 < 0, then either 𝑞 < 0 or 𝑝 −𝑞 < 0 is true. We then also have 𝑃( 𝑝 < 0) ≤ 𝑃(𝑞 < 0) + 𝑃( 𝑝 −𝑞 < 0). If we take 𝑝 = 𝑦 · ℎ(𝑥) and 𝑞 = 𝑦 · 𝑓 ∗ (𝑥 ∗ ), then with 1 − 𝛿 probability, we can express the error bounds of the Student and Teacher models as follows, 𝑃(𝑦 · ℎ(𝑥) < 0) ≤ 𝑃(𝑦 · 𝑓 ∗ (𝑥 ∗ ) < 0) + 𝑃(𝑦 · ℎ(𝑥) < 𝑦 · 𝑓 ∗ (𝑥 ∗ )). Under the contrastive loss in eqn. (Contrastive), we can re-formulate the above as: 𝑃(𝑦 · ℎ(𝑥) < 0) = 𝑃(𝑦 · 𝑓 ∗ (𝑥 ∗ ) < 0) + 𝑃(𝑦 · ℎ(𝑥) < 𝐶 − M (𝑥 + , 𝑥 − ))   ∗ ∗ Δ𝐻 + Δ 𝑓 ∗ − ln 𝛿 ≤ 𝑃(𝑦 · 𝑓 (𝑥 ) < 0) + O 𝑚   (Δ𝐻 + Δ 𝑓 ∗ log(1/𝛿) + log(1/𝛿) =⇒ 𝑚 ≤ O , 𝜖𝑛 where 1 < 𝑛 < 2 if 𝑃(𝑦 · 𝑓 ∗ (𝑥 ∗ ) < 0) is satisfied, i.e. if the Teacher model is realizable. We note that although our proposed LUPI model is PAC learnable, i.e. has the uniform convergence 75 property, we cannot bound the computational complexity of learning. Specifically, if we allow the embedding components 𝑔(.) and 𝑔 ∗ (.) to be non-convex functions, then finding the global optimum for 𝜙(𝑥) becomes an NP-hard problem. In other words, the above analysis only examine the sample complexity bounds, but it does not provide insight into the computational runtime of learning, or the actualization of the uniform convergence property. 4.4 Experiments We empirically assess the effectiveness of our LUPI formulation for improving sample efficiency and generalization performance in a multi-task setting. First, we consider performance accuracy on numerous diagnostic prediction tasks, which are individually binary classification problems. This allows us to evaluate the ability of our LUPI formulation to actually transfer the privileged information to improve the learning efficiency in a multi-task setting. We benchmark the learning accuracy of our method against the performance of various transfer learning baselines. We also perform an ablation study on the privilege information components of our model to ascertain its necessity. Finally, we perform prediction tasks on classes with only sparse examples – as defined by ≤ 100 training samples, and compare the sample efficiency of our model against select models from other transfer learning paradigms. For PI, we consider physician notes in the form of discharge summaries, linking standard medical terminologies (i.e., UMLS codes [19]) with diagnostic findings in the EHR. UMLS codes are a set of standardized medical concepts used by clinicians to describe physical findings of diseases and are used widely in both the EHR as well as medical research [19]. [9] alluded at the idea that medical datasets also contain vast amounts of privileged information in the physician notes, which serve to explain the qualities of diseases that can greatly aid decision rules. For experiments, we consider the following set of data for example features, PI, and labels: • Example Features 𝑋: continuous time-series data (i.e. lab values, blood tests, imaging) and discrete static variables (i.e. demographics information) that describe a patient. • Privileged Information 𝑋 ∗ : physician notes containing descriptions in natural language and 76 medical terms (UMLS concepts [19]) that summarize a particular visit for a patient. • Target Task 𝑌 : prediction tasks of interest, such as mortality (binary classification), disease prediction (multi-task and transfer learning), ... etc. 4.4.1 Datasets and Setup Table 4.1 provides a brief summary of data sources for our experiments. For each data source, we extract unique data modalities available in the dataset. MIMIC-III (Medical Information Mart for Intensive Care) is a publicly available benchmark dataset for predictive modeling and clinical decision support in the intensive care unit (ICU) setting [20]. It should be noted that MIMIC-III and STRIDE datasets are EHR datasets, although STRIDE is comprised of clinical notes (PI vocabulary) obtained from multiple EHR datasets over 19 years of data collection. Documents in table 4.1 refer to literature sources, including medical claims [21] and research articles [22] that heavily utilize UMLS codes. We refer to the clinical notes from EHR as the PI source, which we decomposed into lists of UMLS codes. For example, a clinician’s note may contain a text description of pneumonia which may utilize several UMLS codes such as (Lower Lobe Consolidation, Staph Aureus, Productive Cough) as keywords. Table 4.1: Summary of datasets used in this study Database No. Patients UMLS ICD-9s Temporal MIMIC-III 22, 043 928 148 40 STRIDE 4M 14, 256 None None Documents 1.2M None 11, 245 None MIMIC-III provides a rich source of temporal data, ranging from laboratory tests, vital signs and respiratory parameters, all of which provide hourly resolution of descriptive features. For example features, we use 40 physiologic features, including vital signs (i.e. heart rate, blood pressure, oxygen saturation, temperature), blood tests (i.e. WBC count, platelets, INR) and respiratory parameters: (i.e. PaO2/FiO2 ratio, PEEP). These temporal features are the source of example features (EF) for our experiments, i.e. 𝑋 = {𝑥𝑖 }𝑖=1𝑚 , 𝑥 = {𝑥 𝑡 }𝑇 . Preprocessing of these features include binning 𝑖 𝑖 𝑡=1 77 the time-series by hourly average of each feature and standardized feature values across all adult patients. Physician notes (the source of PI) in MIMIC-III exist in the form of discharge notes, which are physician documentation of key findings relating to the patient’s hospital visit. We can represent the PI as 𝑋 ∗ = {𝑥𝑖∗ }𝑖=1 𝑚 , where each 𝑥 ∗ ∈ {0, 1} 𝑑 represents a discharge note for the 𝑖-th patient, in 𝑖 the form of a 𝑑 dimensional one-hot vector. Here, 𝑑 is the total number of UMLS codes that are found in all of our data sources (MIMIC-III, STRIDES, and Documents). One can think of the UMLS codes as a set of basis features for the PI vectors. The rationale of using physician notes as PI is that they are only available at the end of the hospital stay and contain copious amount of valuable information regarding a wide array of clinical decision support tasks, such as physical findings, periodic nurse observations, medical or surgical complications, and indicators for mortality risk. During training, we can incorporate these notes into the learning regime, but they become unavailable at inference time. For labels, MIMIC-III provides a wide range of potential tasks. We focus on the prediction of ICD-9 diagnostic codes, where are a set of diagnosis labels given to patients that identifies their disease states. Each patient has a set codes that can be described by a label vector 𝑦𝑖 ∈ {0, 1}𝐶 , where 𝐶 denotes the total number of disease classes considered. ICD-9 prediction is in fact a difficult multi-label classification problem among other clinical benchmark tasks due the fact that the distribution of diseases often contain long tails [23]. In the typical case, a few diseases dominate in high frequency while most diagnostic codes appear only a few times among all patients. As a result, training samples are sparse for most diseases, leading to poor prediction beyond the most frequent cases. Our experimental task is to leverage information from PI under a multi-task learning setting to improve the learning efficiency for a large set of ICD-9 codes, especially ones in the tail distribution (i.e. occurring with few samples in the dataset). We consider the diagnoses appearing in at least 1% of admissions, leaving 𝐶 = 148 ICD-9 group codes to formulate our multi-task prediction as 𝐶 classification tasks. We consider UMLS terms appearing at least 50 times in discharge notes, leaving 𝑑 = 928 UMLS terms to construct the PI vocabulary. 78 4.4.2 Initial Baselines We establish some baseline performance of various hypothesis models for our prediction tasks under 3 settings: • Using only example features (EF only) to predict ICD-9 labels. • Using only PI information to predict ICD-9 labels. • Using both EF and PI information to predict ICD-9 labels. For each setting, a diverse set of hypothesis classes are used, including a standard recurrent neural network (RNN) and feed forward perceptrons (MLP). The rationale behind these baselines is to determine whether the PI indeed offers more information than the original feature-set based on noisy timer-series data. Ideally, the Teacher hypothesis class 𝐹 ∗ should obtain lower empirical risk while using lower model complexity (lower VC-dimension) compared to the Student hypothesis class H without LUPI. Otherwise, the Student learner will not improve its sample complexity by LUPI, and any improvements in prediction accuracy will likely result from a variance-reduction mechanism (i.e. ensemble) rather than the LUPI mechanism. As a sanity check, we also included a comprehensive Teacher model using EF + PI features, which should provide the best performance. We note, however, that because the Teacher models use PI at test time, they are used to assess the quality of PI rather than benchmark Student performance. In practice, PI is unavailable at test time, so the Teacher models cannot be used for inference in a real-world setting. Table 4.2 summarizes the performance of these baselines on held-out test set data. Since ICD-9 predictions involve a large number of classes, we take both micro-averaged and macro-averaged AUC as evaluation metrics. Macro-averaged AUC takes per-class average of AUC scores, while micro-averaged AUC considers a single AUC score based on a roll-out of label classes for each test set sample. We also include micro-averaged F1-score and micro-averaged area under PRC to quantify the trade-off between precision and recall. 79 • RNN Student denotes the Student learner using the RNN model class conditioned exclusively on EF, using the LSTM architecture as mentioned in [24]. • MLP Student denotes a feed-forward network conditioned on the final time-step of EF. • MLP Teacher denotes the Teacher feed-forward network conditioned on PI only. Specifically, we use a weighted sum of the PI embeddings for each 𝑥𝑖∗ = {𝑤 1 , ..., 𝑤 𝑘 }: 1 Õ𝑘 𝜙(𝑥𝑖∗ ) = 𝑤𝑇 𝜃 𝑔∗ 𝑘 𝑗=1 𝑗 which maps each PI vector 𝑥𝑖∗ into a lower-dimensional representation space, and 𝜃 𝑔∗ denotes the look-up matrix of embeddings obtained in the first step of our LUPI algorithm. • Oracle Teacher denotes the Teacher model which uses both EF and PI for prediction. ℎ(𝑥𝑖 ) = 𝑅𝑁 𝑁 (𝑥𝑖 ) 𝑓𝑐∗ (𝑥𝑖 , 𝑥𝑖∗ ) = 𝜎(𝑊ℎ𝑐 ℎ(𝑥𝑖 ) + 𝑊𝑔𝑐 𝑔(𝑥𝑖∗ ) + 𝑏 𝑐 ) EF inputs are encoded into fixed-length vectors by a set of RNN layers and the PI features are embedded into lower-dimensional space by 𝑔(𝑥 ∗ ) described previously. Since there are 𝐶 tasks (i.e., 𝐶 outputs), a classifier layer is used to predict the 0 − 1 label for each ICD-9 code. Here, we note that the embedding matrix 𝜃 𝑔∗ for learning the lower dimensional representation of PI is obtained by the embedding mechanism highlighted in the PI Embedding equations. Taking the set of UMLS concept codes as the PI vocabulary, we leverage the corpus available in STRIDE and Documents datasets to learn the 𝜃 𝑔∗ , conditioned on the UMLS codes. For example, given a medical document consisting of a set of 𝑛 relevant UMLS codes 𝑣® = {𝑤 1 , 𝑤 2 , ...𝑤 𝑛 }, we can train the 𝜃 𝑔∗ for the UMLS codes by Eqn. 2, with the modified scoring function:   if 𝑤 1 , 𝑤 2 ⊂ 𝑣®  1   𝑆𝑐𝑜𝑟𝑒(𝑤 1 , 𝑤 2 ) = (Co-occurence)   0  otherwise  80 Table 4.2: Comparison of performance across baseline models Model Ma-AUC Mi-AUC Mi-F1 AUPRC RNN Student 0.735 0.783 0.299 0.260 MLP Student 0.715 0.756 0.235 0.211 MLP Teacher∗ 0.824 0.868 0.446 0.432 Oracle Teacher ∗ 0.845 0.882 0.497 0.510 ( ∗ ) denotes Teacher models using PI. Note that each UMLS concept is represented by 𝑤 𝑖 ∈ {0, 1} 𝑑 , where 𝑤 𝑖 𝑗 = 1 for the index corresponding to the UMLS code. Thus, 𝜃 𝑔∗ [:, 𝑖] gives the the distributed representation of 𝑖 𝑡ℎ UMLS concept. We train the embedding matrix 𝜃 𝑔∗ over STRIDE and Documents before applying 𝜃 𝑔∗ on the MIMIC-III dataset for LUPI. We see from table 4.2 that the PI provides strong signals for ICD-9 prediction. Micro-averaged AUC and Macro-averaged AUC are denoted as Mi-AUC and Ma-AUC, respectively. Micro-averaged F1-score and AUC of precision-recall curve are denoted as Mi-F1 and AUPRC, respectively. Large differences exist between the Student baselines and the Teacher models across all performance metrics, suggesting that the PI provides more information about the label space compared to the original time-series features. Again, we emphasize here that discharge notes (PI) are generated only after the diagnostic predictions have been made by clinicians, and thus the Teacher models are actually not available at inference time. Interestingly, we also see that the Oracle teacher with combined features provided additional performance boost compared to using PI exclusively as features. This suggests that the temporal features provide some complementary information not contained in the PI. 4.4.3 Comparison Against Other Transfer Learning Methods Next, we benchmark performance for several existing transfer learning paradigms for incorporating PI with the Student model: transductive learning, inductive learning, and model distillation. Under the transductive framework, we treat the PI as auxiliary targets, much like target replication in [24]. We train a joint hypothesis model ℎ : 𝑋 → 𝑋 ∗ × 𝑌 to map from the original EF space to the joint PI and label space. By contrast, we incorporate 𝑋 ∗ as an auxiliary input for the inductive framework. 81 We use data fusion to learn a joint representation 𝑔 : 𝑋 × 𝑋 ∗ → 𝑍 before learning a hypothesis function to predict ℎ : 𝑍 → 𝑌 . For model distillation, a Teacher network predicts a set of soft-labels over the PI information, which the Student model uses as auxiliary input for the final prediction model ℎ : 𝑋 × 𝑋 ∗ → 𝑌 . Details of the setup is explained below. 4.4.3.1 Multi-task learning MTL is the representative transductive learning technique. There is only one source domain D = {𝑋, 𝑃(𝑋)} and two target tasks: T𝑆 = {𝑋 ∗ , 𝐺} and T𝑇 = {𝑌 , H }. The MTL model learns a joint model: 𝑔 𝑘 (𝑥𝑖 ) = 𝑀 𝐿𝑃(𝑅𝑁 𝑁 (𝑥𝑖 ; 𝑊𝑟𝑒 𝑝 ); 𝑊 𝑘 ) (Shared Rep.) ℎ𝑐 (𝑥𝑖 ) = 𝑀 𝐿𝑃(𝑅𝑁 𝑁 (𝑥𝑖 ; 𝑊𝑟𝑒 𝑝 ); 𝑊𝑐 ) (Individual Hyp.) 1 Õ𝑚 Õ𝐶 L 𝑀𝑇 𝐿 = 𝐿𝑌 (ℎ𝑐 (𝑥𝑖 ), 𝑦𝑖𝑐 ) 𝑚𝐶 𝑖=1 𝑐=1 𝜆 Õ𝑚 Õ 𝑑 + 𝐿 𝑋 ∗ (𝑔 𝑗 (𝑥𝑖 ), 𝑥𝑖∗𝑗 ) (MTL Obj.) 𝑑𝑚 𝑖=1 𝑗=1 where 𝑊𝑟𝑒 𝑝 is the shared weights for the representation model 𝑅𝑁 𝑁 (𝑥𝑖 ; 𝑊𝑟𝑒 𝑝 ), 𝑊 𝑘 and 𝑊𝑐 are task-specific weights for target hypothesis models. The MTL loss is composed of two parts: (1) a loss component over the joint label space E[𝐿𝑌 (ℎ(𝑋), 𝑌 )], and (2) a loss term over the joint PI space E[𝐿 𝑋 ∗ (𝑔(𝑋), 𝑋 ∗ )]. 𝜆 is a hyperparameter which controls the trade-off between the multiple objectives during learning. 𝐿(.) denotes some evaluation criterion to approximate the 0 − 1 loss, for example the binary cross-entropy (BCE) or mean squared error (MSE). We used BCE in the proceeding experiments. 4.4.3.2 Data Fusion For inductive learning, we used a variant of the Siamese Network [16] to achieve data fusion between EF and PI. We use two parallel networks, 𝑔 : 𝑋 → 𝑍 and 𝑔 ∗ : 𝑋 ∗ → 𝑍 and minimize the distance 82 Table 4.3: Comparison of performance across transfer learning models Model Ma-AUC Mi-AUC Mi-F1 AUPRC MTL 0.783 0.836 0.384 0.336 Distillation 0.738 0.793 0.289 0.245 Data Fusion 0.779 0.811 0.374 0.328 Ours 0.838 0.845 0.397 0.344 between 𝑔(𝑥) and 𝑔 ∗ (𝑥 ∗ ) using the BCE loss. We then learn a hypothesis function ℎ : 𝑍 → 𝑌 . 𝑔(𝑥𝑖 ) = 𝑅𝑁 𝑁 (𝑥𝑖 ; 𝑊𝑥 ) (EF Embedding) 𝑔 ∗ (𝑥𝑖∗ ) = 𝑀 𝐿𝑃(𝑥𝑖∗ ; 𝑊𝑥 ∗ ) (PI Embedding) ℎ(𝑥𝑖 , 𝑥𝑖∗ ) = 𝜎(𝑊𝑔 𝑔(𝑥𝑖 ) + 𝑊𝑔∗ 𝑔 ∗ (𝑥𝑖∗ ) + 𝑏 𝑔 ) (Joint Hyp.) Õ𝑚 L 𝑍 (𝑊𝑥 , 𝑊𝑥 ∗ ) = 𝐵𝐶𝐸 (𝑔(𝑥𝑖 ), 𝑔 ∗ (𝑥𝑖∗ )) (Fusion Loss) 𝑖=1 Õ𝑚 L𝑌 (𝑊𝑔 , 𝑊𝑔∗ ) = 𝐵𝐶𝐸 (ℎ(𝑥𝑖 , 𝑥𝑖∗ ), 𝑦𝑖 ) (Task Loss) 𝑖=1 L 𝑍 and L𝑌 are trained iteratively using alternating stochastic gradient descent (SGD). At test time, a masking vector 𝑥𝑡𝑒𝑠𝑡 ∗ = {0} 𝑑 is used to represent PI, as it is unavailable for inference. Thus, ∗ ) = 𝑦 is used for evaluation. ℎ(𝑥𝑖 , 𝑥𝑡𝑒𝑠𝑡 𝑖 4.4.3.3 Distillation Our distillation also contains two parts: a Teacher network trained to generate soft-labels for PI, and Student network conditioned on the EF and PI soft-labels to predict the ICD-9 targets. 𝑔(𝑥𝑖 ) = 𝑅𝑁 𝑁 (𝑥𝑖 ; 𝑊𝑇 ) (Teacher) ℎ𝑐 (𝑥𝑖 ) = 𝑀 𝐿𝑃([𝑔(𝑥𝑖 ); 𝑅𝑁 𝑁 (𝑥𝑖 ; 𝑊 𝑅𝑁 𝑁 )]; 𝑊𝑐 ) (Student) Õ𝑚 Õ𝐶 L𝑇 𝑆 = 𝐵𝐶𝐸 (ℎ𝑐 (𝑥𝑖 ), 𝑦𝑖 ) (Distillation Loss) 𝑖=1 𝑐=1 Here, [𝑔(𝑥𝑖 ); 𝑅𝑁 𝑁 (𝑥𝑖 ; 𝑊 𝑅𝑁 𝑁 )] denotes concatenation of the PI soft-labels and the last hidden state of the 𝑅𝑁 𝑁 (𝑥𝑖 ). The joint T-S loss connects the Teacher and Student loss together, allowing the two networks to be trained end-to-end. 83 Table 4.3 summarizes the AUC and retrieval scores for our LUPI model against transfer learning baselines. Micro-averaged AUC and Macro-averaged AUC are denoted as Mi-AUC and Ma-AUC, respectively. Micro-averaged F1-score and AUC of precision-recall curve are denoted as Mi-F1 and AUPRC, respectively. We see that our LUPI formulation outperformed other transfer learning baselines in all major performance metrics. Vanilla Data Fusion and MTL networks produced comparable performances, and Distillation did not have significant improvement over the baseline Student models that did not use PI. For Data Fusion, it is likely that since the PI contains a lot more information than the original EF, the decision function of the hypothesis model ℎ(𝑥, 𝑥 ∗ ) relied heavily on access to PI. Since PI is masked during testing due to unavailability, ℎ(𝑥, {0} 𝑑 ) likely resulted in poor generalization. Comparable results can be seen in MTL. The drop in performance is most likely due to negative transfer [1] due to the wide range of tasks (ICD-9 codes) that contribute uniformly to the multi-objective learning process. Unlike our LUPI, the PI is not used to inform similarity between training samples from different tasks, which do not share the same support (i.e. different diagnosis may come from very different underlying distributions). Thus, in both Data Fusion and MTL cases, the PI is incorporated in a less efficient way than our proposed model. Interestingly, we see that the MLP Teacher and Oracle Teacher models in Table 4.2) still provided better AUC, F1 and AUPRC performances over the all transfer learning models, including our LUPI model. This result suggests that the PI is more informative for diagnostic tasks compared to the original EF, which is what enables the LUPI method to be effective. One possible explanation of the predictive power of the PI is that the embeddings of the UMLS terms, which comprise the PI vocabulary, are learned based on their co-occurrence with disease codes in public literature. 4.4.4 Performance with Sparse Examples In addition to broad coverage of tasks, we evaluate the sample efficiency of our proposed model against transfer learning baselines by considering the more rare diseases with very few training samples. This is actually quite typical in the EHR setting, where diagnosis labels often have very 84 Table 4.4: Performance of various models for 30 ICD-9 codes appearing less than 100 times in the dataset Model Ma-AUC Mi-AUC Mi-F1 AUPRC RNN Student 0.628 0.639 0.104 0.096 MLP Teacher 0.821 0.833 0.212 0.196 Oracle Teacher 0.801 0.805 0.256 0.146 MTL 0.717 0.724 0.150 0.122 Distillation 0.729 0.738 0.158 0.149 Data Fusion 0.821 0.826 0.307 0.237 Ours 0.834 0.835 0.381 0.330 long tail distributions. We restrict our predictions to diagnostic codes appearing less than 100 times in the training and test sets and examine the generalization of various modeling schemes. In table 4.4, we see that performance decreased drastically for non-transfer learning models such as the RNN Student and the Teacher models (both MLP and Oracle). Transfer learning schemes such as MTL and Distillation also decreased greatly in F1-score and AUPRC. Interestingly, Data Fusion method was able to outperform other transfer learning baselines in F1-score and AUPRC, suggesting that learning a domain-invariant representation between the original features and the PI provided a key improvement for sample efficiency. However, our LUPI model achieved the best performance among all the models for long-tail tasks. In fact, its performance across stayed relatively consistent among this subset of tasks compared to its performance on the original set of common diagnoses. 4.5 Discussion and Conclusion In this chapter, we presented a novel LUPI framework for retaining PI in the multi-task setting to improve sample complexity over a wide range of related tasks. The key idea was to learn a joint representation of the original feature space and the PI by leveraging their co-occurrence information in the data. Decomposing the PI into distributed representations of basis features was vital for the realization of this mechanism. Experiments show that our proposed LUPI method can out-perform baseline models and other transfer learning methods in multi-task learning scenarios, particularly in situations where training samples are very rare (< 100 samples per task). In addition to improved performance, we also provided sample complexity analysis that outline scenarios under which our 85 LUPI method can provide similar benefits over traditional transfer learning approaches. 86 BIBLIOGRAPHY 87 BIBLIOGRAPHY [1] Sinno Jialin Pan, Qiang Yang, et al. A survey on transfer learning. IEEE Transactions on knowledge and data engineering, 22(10):1345–1359, 2010. [2] Rich Caruana. Multitask learning. Machine learning, 28(1):41–75, 1997. [3] Hrayr Harutyunyan, Hrant Khachatrian, David C Kale, and Aram Galstyan. Multitask learning and benchmarking with clinical time series data. arXiv preprint arXiv:1703.07771, 2017. [4] Edward Choi, Mohammad Taha Bahadori, Le Song, Walter F Stewart, and Jimeng Sun. GRAM: graph-based attention model for healthcare representation learning. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 787–795. ACM, 2017. [5] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed repre- sentations of words and phrases and their compositionality. arXiv preprint arXiv:1310.4546, 2013. [6] Jing Zhao, Xijiong Xie, Xin Xu, and Shiliang Sun. Multi-view learning overview: Recent progress and new challenges. Information Fusion, 38:43–54, 2017. [7] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015. [8] Vladimir Vapnik and Akshay Vashist. A new learning paradigm: Learning using privileged information. Neural networks, 22(5-6):544–557, 2009. [9] Vladimir Vapnik and Rauf Izmailov. Learning using privileged information: similarity control and knowledge transfer. Journal of machine learning research, 16(2023-2049):2, 2015. [10] Xue Li, Bo Du, Chang Xu, Yipeng Zhang, Lefei Zhang, and Dacheng Tao. R-svm+: Robust learning with privileged information. In IJCAI, pages 2411–2417, 2018. [11] John Lambert, Ozan Sener, and Silvio Savarese. Deep learning under privileged information using heteroscedastic dropout. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8886–8895, 2018. [12] Edward Choi, Mohammad Taha Bahadori, Elizabeth Searles, Catherine Coffey, Michael Thompson, James Bost, Javier Tejedor-Sojo, and Jimeng Sun. Multi-layer representation learning for medical concepts. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1495–1504. ACM, 2016. 88 [13] Dhanesh Ramachandram and Graham W Taylor. Deep multimodal learning: A survey on recent advances and trends. IEEE Signal Processing Magazine, 34(6):96–108, 2017. [14] Raia Hadsell, Sumit Chopra, and Yann LeCun. Dimensionality reduction by learning an invariant mapping. In null, pages 1735–1742. IEEE, 2006. [15] Miguel A Carreira-Perpinan and Geoffrey E Hinton. On contrastive divergence learning. In Aistats, volume 10, pages 33–40. Citeseer, 2005. [16] Gregory Koch, Richard Zemel, and Ruslan Salakhutdinov. Siamese neural networks for one-shot image recognition. In ICML Deep Learning Workshop, volume 2, 2015. [17] Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014. [18] Alexander B Tsybakov et al. Optimal aggregation of classifiers in statistical learning. The Annals of Statistics, 32(1):135–166, 2004. [19] Olivier Bodenreider. The unified medical language system (umls): integrating biomedical terminology. Nucleic acids research, 32(suppl_1):D267–D270, 2004. [20] Alistair EW Johnson, Tom J Pollard, Lu Shen, Li-wei H Lehman, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. Mimic-iii, a freely accessible critical care database. Scientific data, 3, 2016. [21] Youngduck Choi, Chill Yi-I Chiu, and David Sontag. Learning low-dimensional representations of medical concepts. AMIA Summits on Translational Science Proceedings, 2016:41, 2016. [22] Samuel G Finlayson, Paea LePendu, and Nigam H Shah. Building the graph of medicine from millions of clinical narratives. Scientific data, 1:140032, 2014. [23] Fengyi Tang, Cao Xiao, Fei Wang, and Jiayu Zhou. Predictive modeling in urgent care: a comparative study of machine learning approaches. JAMIA Open, 1(1):87–98, 2018. [24] Zachary C Lipton, David C Kale, Charles Elkan, and Randall Wetzel. Learning to diagnose with lstm recurrent neural networks. arXiv preprint arXiv:1511.03677, 2015. 89 CHAPTER 5 ADVERSARIAL PRECISION SENSING 5.1 Introduction In many real-world applications, fully-observed datasets are difficult to obtain, especially for temporal data, where observing the full set of features for all samples across time is simply too expensive. A prime example of this phenomenon exists in the realm of medical informatics, specifically electronic health records (EHR) dominated by time-series data. Yet this hasn’t stopped EHR from sparking numerous research interests in recent years [1, 2, 3, 4, 5]. Time-series data in the EHR consists of features that are sampled at different levels of temporal granularity (e.g., lab tests are sampled at longer time-scales than vital signs). Most of the time, only a small subset of features are observed at any time-step, as it is prohibitively expensive for the physician to obtain the full set of features for every patient for every time interval. In practice, a form of active sensing [6] is implicitly done in an ad-hoc manner by physicians based on a combination of inpatient work-flow, resource constraints, and domain expertise. As a result, the key difficulty with EHR time-series data is that they almost always require re-sampling and imputation due to large amounts of missing values. Therefore it is no surprise that several recent works have shown that temporal models trained on EHR data have fragile decision boundaries that are susceptible to small perturbations [7, 8]. To address this issue, we formulate the precision sensing problem setting as one that takes in data-streams along the temporal dimension, rather than across samples. For example, consider the EHR setting, where at each time-step, the physician has to decide how to allocate a constrained amount of tests/resources among patients for some set of clinical tasks. In other words, the precision sensing problem for time-series is in fact a resource distribution problem at each time-step, with a fixed observation budget across a given set of samples. Therefore the querying process is applied across time rather than samples, and any formulation must reflect both the temporal-dependency and the budgetary-constraint of the querying process. 90 In this work, we formulate the active sensing problem as a minimax game between two players: • A Feature Sensor (FS) that generates a sensing tensor 𝐴 over a temporal data-stream in an online manner. • A Progressive Learner (PL) that adjusts the decision boundary at each time-step to handle sparser sets of features, allowing the FS model to improve its sensing strategy toward sparser selections. We then prove that our formulation reaches a local Nash equilibrium during co-training. Afterwards, we evaluate the quality of the dynamic sensing strategy by training a diverse set of classifiers on the resulting set of data with incomplete features and compare their performances against a corresponding set of hypothesis models conditioned on the original dataset. Our results show that classifiers conditioned on dynamically sensed data, i.e., incomplete data, are able to maintain prediction accuracy, despite adhering to a strict budgetary constraint for sensing. We show that unlike attention and perturbation models, the sensing strategy of the FS-Model can be used to train many types of models, including deep models, shallow models, and even non-temporal models, with minimal decrease in performance. In experiments, we demonstrate how our budget-adaptive framework can be used to produce an active data acquisition strategy that can be implemented in hospitals to deal with feature sensing under moving budgetary constraints. 5.2 Problem Formulation 5.2.1 Precision Sensing for Time-Series Data Classic time-series data comes in the form: 𝑋 = {𝑥𝑖 }𝑖=1 𝑚 , 𝑥 = {𝑥 (𝑡) }𝑇 , 𝑥 (𝑡) ∈ R𝑛 , where 𝑚 is the 𝑖 𝑖 𝑡=1 𝑖 number of samples, 𝑇 the number of time-steps, and 𝑛 the number of features. Precision sensing n o𝑚 considers the case where 𝑋 = 𝑥𝑖(𝑡) (𝑡) , 𝑥𝑖(𝑡) ∈ R𝑛 , is a feature matrix across the entire batch of 𝑖=1 samples at time step 𝑡. At each time-step we are interested in finding a corresponding sensing matrix 𝐴 (𝑡) ∈ {0, 1} 𝑚×𝑛 over the data matrix. During the observation period, a cumulative sensing tensor 91 𝐴 = {𝐴 (𝑡) }𝑇𝑡=1 is incrementally constructed according to the dynamics: 𝐴 (𝑡+1) = 𝐺 (𝑋 (1) 𝐴 (1) , . . . , 𝑋 (𝑡) 𝐴 (𝑡) ), (5.1) for some update function 𝐺 (.). Alternatively we can view 𝐴𝑖 = {𝐴𝑖(𝑡) }𝑇𝑡=1 as the subset of sensed features across time for each sample. Under this view, we can formulate the precision sensing objective as follows: Definition 2. (Precision Sensing) The problem of precision sensing seeks a sensor tensor 𝐴∗ that minimizes the empirical risk (ERM) [9] according to: 𝐴∗ ∈ argmin E𝑃(𝑋,𝑌 ) [𝐿 (ℎ( 𝐴𝑖 𝑋𝑖 ), 𝑦𝑖 )] 𝐴𝑖 ∈A 1 Õ ≈ argmin 𝐿 (𝑦𝑖 , ℎ( 𝐴𝑖 𝑋𝑖 )), 𝐴𝑖 ∈A 𝑚 (𝑋𝑖 ,𝑦 𝑖 )∈D where ℎ is a hypothesis class that maps ℎ : 𝑋 → 𝑌 , and 𝐿(.) is a risk function that evaluates the hypothesis mappings against the actual label. Here, A denotes the space of sensing matrices, and each matrix 𝐴𝑖 can be thought of as a per-sample sensing matrix. From this perspective 𝐴𝑖 describes the subset of sensed features across time for each sample. Remark 1. Ideally the optimal sensing tensor 𝐴 generates a partially-observed dataset: {( 𝑋˜ 𝑖 , 𝑦𝑖 )}, 𝑋˜ 𝑖 = 𝐴𝑖∗ ˜ 𝑌 ) ≈ 𝑃(𝑋, 𝑌 ). That is, the joint distribution of the dynamically sensed 𝑋𝑖 such that 𝑃( 𝑋, data produced by 𝐴∗ should match the joint distribution of the fully-observed dataset. The key difference between existing methods and our approach is in the manner in which 𝐴 is generated. In feature selection for time-series, 𝐴 is obtained by some form of sparse sensing technique that statically assigns the same subset of features to be observed for each 𝑋𝑖 . By contrast active sensing allows for 𝐴 to be built in an individualized manner for each sample, yet still actively sub-samples (sample, view) pairs during each co-training iteration. But since it only sub-samples a small subset of examples, active sensing fails to consider an overarching budgetary constraint across all the samples at each time-step. 92 Figure 5.1: Overview of problem setting. (a) Incremental generation of the sensing tensor. (b) Minimax game between FS and PL. 5.2.2 Budget-Constrained Precision Sensing as Adversarial Optimization At a high-level, the proposed adversarial framework works as follows: • At each time-step, FS tries to “discard”, based on observations from previous time-steps, features that lead to mis-classifcation. • At the end of each observation window, the PL adjusts the decision boundary so that the sensed features result in correct classification. Fig. 5.1a describes incremental generation of the sensing tensor 𝐴 by the FS. 𝐴 is generated recurrently by applying the sensor dynamics 𝐺 (.), which is learned by the FS, where 𝐴 (𝑡) = 𝐺 ( 𝑋˜ (1) , . . . , 𝑋˜ (𝑡) ) outputs 𝐴 (𝑡+1) at each time-step. Fig. 5.1b describes the minimax game between the FS and PL. PL (top) tries to minimize the classification error with sensed features 𝑋 𝐴 while FS (bottom) tries to allocate 𝐴 such that unsensed features lead to misclassification. At the end of the observation window, the PL adjusts the parameters of 𝐹 to handle the updated input 𝑋 𝐴, which now has a different sparsity pattern than before, as shown in Fig. 5.1b. At equilibrium the PL will no longer be able to improve its performance, provided that the FS has selected an optimal subset of features to sense such that accuracy improvement is no longer possible given the observation budget. Max-Margin Objective. More formally, the FS tries to solve the optimization problem:  max 𝐹 (𝑋 − 𝑋 𝐺 (𝑋))¬𝑦 − 𝐹 (𝑋 − 𝑋 𝐺 (𝑋)) 𝑦 , 0 𝜃𝐺 Õ s.t. 𝐴 (𝑡) ∈ {0, 1} 𝑚×𝑛 , 𝐴𝑖(𝑡) 𝑗 ≤ 𝐵 (5.2) 𝑖, 𝑗,𝑡 93 where 𝐺 (𝑋) = [𝐺 ( 𝑋˜ (1) ), 𝐺 ( 𝑋˜ (2) ), . . . , 𝐺 ( 𝑋˜ (𝑇) )] = 𝐴. When 𝐴𝑖(𝑡)𝑗 = 0, the j-th feature is masked for the i-th sample at time 𝑡. We can think of 𝑋ˆ = 𝑋 − 𝑋 𝐴 as a subset of features that are discarded by FS. Thus, 𝑋˜ = 𝑋 𝐴 is the resulting set of sensed (unmasked) features at the end of the observation window. During training, the FS model ranks 𝐹 (𝑋 − 𝑋 𝐴)¬𝑦 higher than 𝐹 (𝑋 − 𝑋 𝐴) 𝑦 .We use ¬𝑦𝑖 to denote the case where the hypothesis produces the wrong class for the 𝑖−th sample. 𝐵 denotes the budget constraint on 𝐴, and 𝐹 (.)𝑐 denotes the 𝑐-th logit of the hypothesis model output. 5.3 Proposed Method Derivation of Eqn. 5.2 First, let us consider the popular adversarial setting of perturbation models described in [10]: minimize 𝐷 (𝑋, 𝑋 + 𝛿) s.t. 𝐶 (𝑋 + 𝛿) ≠ 𝑦 𝑋 + 𝛿 ∈ [0, 1] 𝑛 (5.3) where 𝑋 is the original set of features, 𝛿 is the perturbation to be applied to the feature space, 𝐷 (.) is a distance metric, and 𝐶 (.) is a classifier. Under the perturbation setting, the 𝐶 (𝑋 + 𝛿) ≠ 𝑦 term finds an adversarial set of samples, 𝑋 + 𝛿, leading to misclassification, while 𝐷 (𝑋, 𝑋 + 𝛿) is a reconstruction term that limits the size of 𝛿 such that it is not trivially large. Recent work (e.g., [7, 8]) leverages variants of this approach for ranking feature importance in complex feature spaces. We develop a different set of reconstruction and modification objectives under a similar setting: minimize L (𝐶 (𝑋 𝐴), 𝑦) s.t. 𝐶 (𝑋 − 𝑋 𝐴) ≠ 𝑦, Õ ∀𝑖, 𝑗, 𝑡 : 𝐴𝑖(𝑡) 𝑗 ∈ {0, 1}, & 𝐴𝑖 𝑗 𝑘 ≤ 𝐵 (5.4) 𝑖𝑗𝑘 where 𝐶 is the classifier and L a loss function on (𝑋, 𝑦) pairs. 𝐴 denotes a “sensing tensor” that gives a component-wise indicator for whether feature 𝑗 is sensed at time 𝑡 for sample 𝑖. Additionally, 94 𝐵 gives a “budget constraint” for the sensing tensor 𝐴. Rather than reconstructing 𝑋 ≈ 𝑋 + 𝛿, we learn the construction of 𝐴 based on the recovery of 𝑃(𝑦|𝑋) ≈ 𝑃(𝑦|𝑋 𝐴) through 𝐶. We also update 𝐶 and 𝐴 at different rates—𝐶 is updated at the end of observation windows while 𝐴 is updated every time-step. Thus, we formulate the active sensing problem of interest under a single mini-max objective: min max {𝐹 (𝑋 − 𝑋 𝐺 (𝑋))¬𝑦 − 𝐹 (𝑋 − 𝑋 𝐺 (𝑋)) 𝑦 }+ 𝜃 𝐹 𝜃𝐺 | {z } Feature Sensing + 𝐶𝐸 𝑦 (𝐹 (𝑋 𝐺 (𝑋)) ) + 𝛽||𝑋 − 𝑋 𝐺 (𝑋)|| 1 . (5.5) | {z } | {z } Classifier Reconstruction Budget Management Here 𝐹 corresponds to 𝐶 and 𝐺 corresponds to a function to be learned from 𝐴. Meanwhile, 𝐶𝐸 𝑦 is the cross-entropy loss w.r.t. 𝑦: 𝑚 1 Õ 𝐶𝐸 𝑦 (𝑥) = − [𝑦𝑖 log 𝑥𝑖 + (1 − 𝑦𝑖 ) log (1 − 𝑥𝑖 )] . (5.6) 𝑚 𝑖=1 5.3.0.1 Feature Sensor We can decompose the update dynamics of the sensor generation process (𝐺) into the following components: ℎ𝑖(0) = max{0, 𝑊𝑥ℎ [𝑋𝑖(0) ; 0]} (Initialization) 𝑧𝑖(𝑡) = max{0, 𝑊𝑥𝑧 [ 𝑋˜ 𝑖(𝑡) ; 𝐶 𝑡 ] + 𝑊𝑧ℎ ℎ𝑖(𝑡−1) } (Update Gate) 𝑟𝑖(𝑡) = max{0, 𝑊𝑥𝑟 [ 𝑋˜ 𝑖(𝑡) ; 𝐶 𝑡 ] + 𝑊𝑟 ℎ ℎ𝑖(𝑡−1) } (Reset Gate) n  o ˜ℎ (𝑡) = tanh 𝑊𝑥ℎ [ 𝑋˜ (𝑡) ; 𝐶 𝑡 ] + 𝑊ℎℎ 𝑟 (𝑡) ℎ (𝑡−1) (Transition) 𝑖 𝑖 𝑖 𝑖   ℎ𝑖(𝑡) = 1 − 𝑧𝑖(𝑡) ℎ𝑖(𝑡−1) + 𝑧𝑖(𝑡) ℎ˜ 𝑖(𝑡) (State Update)   𝐴𝑖(𝑡+1) = 𝜎 𝑊ℎ ℎ𝑖(𝑡) + 𝑏 𝑎 (Sensor Output) Õ Õ 𝐶 𝑡 = 𝐶 𝑡−1 + 𝑐 𝑗 𝐴𝑖(𝑡) 𝑗 (Budget Update) 𝑖 𝑗 95 where 𝐶 𝑡 denotes the total observation cost at time 𝑡, corresponding to the total number of observations in [ 𝐴 (1) , . . . , 𝐴 (𝑡) ] across all samples. 𝑐 𝑗 denotes the cost of observing feature 𝑗 (e.g., certain medical tests may be more costly to observe than others). [𝑎; 𝑏] ∈ R𝑛+1 denotes the concatenation of vector 𝑎 ∈ R𝑛 and scalar 𝑏 ∈ R. 𝜎(.) denotes a thresholding function (e.g., sigmoid activation) that maps the output of each 𝐴𝑖(𝑡) 𝑗 ∈ [0, 1] feature component to {0, 1}, which is done to satisfy the integer constraint in Eqn. 5.2 that 𝐴𝑖(𝑡) 𝑗 ∈ {0, 1}, ∀𝑖, 𝑗, 𝑡. ℎ𝑖(𝑡) serves as an internal state representation that considers every sensed features vector 𝑋˜ 𝑖(1) , . . . , 𝑋˜ 𝑖(𝑡) up to time 𝑡 for sample 𝑖, as well as the observation cost across samples at each time-step, i.e., 𝐶 1 , . . . , 𝐶 𝑡 . The gates 𝑧 and 𝑟 control the update rule of the memory state to prevent gradient vanishing [11]. The weights 𝜃 𝐺 = [𝑊𝑥ℎ , 𝑊𝑥ℎ , . . . , 𝑊ℎ𝑛 ] are learned by optimizing over the feature sensing portion of the objective function in Eqn. 5.5: max {𝐹 (𝑋 − 𝑋 𝐺 (𝑋))¬𝑦 − 𝐹 (𝑋 − 𝑋 𝐺 (𝑋)) 𝑦 }+ (5.7) 𝜃𝐺 5.3.0.2 Progressive Learner The PL serves to stabilize the reference hypothesis model 𝐹 in response to the shift in sparsity patterns observed in 𝐺 (𝑋). We note the time-scale difference between 𝐹 and 𝐺 updates: whereas FS updates 𝐺 at every time-step 𝑡, the PL updates 𝐹 only at the end of an observation window 1, . . . , 𝑇. Before 𝐺 (.) is learned by the FS model, the decision function 𝐹 is first initialized on a small batch of fully-observed data {𝑋𝑖 , 𝑦𝑖 }𝑖=1 𝑘 ⊂ D train , 𝑘  𝑚 to obtain an initial approximation of the underlying 𝑃(𝑌 |𝑋). Over time, the PL is gradually conditioned on 𝐴 𝑋, with a progressively sparser 𝐴. To achieve this conditioning effect, we utilize a schedule sampling technique similar to [12], where we pick the input to 𝐹 based on a coin toss (𝐻):    𝐹 (𝑋) 𝑃(𝐻 = 1) = 𝜀,    𝐹 (.) = (Scheduled Sampling)   𝐹 (𝐺 (𝑋) 𝑃(𝐻 = 0) = 1 − 𝜀.   𝑋)  96 Here, 𝜀 denotes the probability of using the fully-observed 𝑋 for training, and 1 − 𝜀 is the probability of using the masked version. Initialize 𝜀 = 1 for the first 𝑘 iterations steps to obtain an unbiased estimation. After 𝑘 iterations, decay 𝜖 according to: 𝜀 = 𝑡 − 𝑘/(𝑡 − 𝑘 + exp(𝑘 − 𝑡)). (𝜀-Decay) By decaying 𝜀, the training distribution of the PL model shifts gradually from the fully-observed to the sensed features. We pick a baseline 𝜀 = min{𝑡 − 𝑘/(𝑡 − 𝑘 + exp(𝑘 − 𝑡)), 0.25} to ensure that PL gets exposed to at least a small fraction of the fully-observed 𝑋 during each training iteration. This allows the PL to obtain a good initialization on the feature importance. The PL adjusts the decision boundary by minimizing over the classifier reconstruction portion of Eqn. 5.5: min 𝐶𝐸 𝑦 (𝐹 (𝐺 (𝑋) 𝑋)) (5.8) 𝜃𝐹 where each 𝐹 (𝑋𝑖 ) is sampled from Eqn. Scheduled Sampling. 5.3.0.3 Budget-Manager The strength of budget constraint is updated at each iteration by dual gradient ascent [13] for 𝛽 according to Eqn. 5.9: 𝛽𝑡+1 = 𝛽𝑡 + 𝛼(𝑋 − 𝑋 𝐺 (𝑋)), (5.9) where 𝛼 denotes the learn rate for the dual variable 𝛽. We note that in real-world applications, the budget 𝛽 is usually set beforehand (e.g., by resource constraints). Thus, dual ascent on 𝛽 gives an option to continually shrink the observation budget until a minimum set of sensed features is realized. However, this may not be needed in a realistic setting where the budget constraint is constant. The intuition behind the 𝛽 term follows by considering its extreme values. When 𝛽 → 0, the max-margin objective dominates. Consequently, the FS is incentivized to sense every feature, i.e. 𝐴𝑖,(𝑡)𝑗 = 1, ∀𝑖, 𝑗, 𝑡, leading to the trivial solution where 𝑋 𝐴 → 𝑋. In contrast, as 𝛽 → ∞ the budget constraint dominates, and the FS is incentivized to choose 𝐴𝑖,(𝑡)𝑗 = 0, ∀𝑖, 𝑗, 𝑡, leading to an empty feature set. Thus, the 𝛽 term serves as a trade-off parameter between 𝐴’s sparsity and sensing performance. The training loop for FS, PL, and budget-manager is summarized in Algorithm 2. 97 Algorithm 2 Co-training of FS and PL. 1: for number of total training iterations do 2: for 𝑘 training steps do 3: Sample mini-batch of 𝑚 samples {(𝑋𝑖 , 𝑦𝑖 )}𝑖=1 𝑚 from training data. 4: Generate 𝑚 samples using 𝐺: 𝑚 {𝑋𝑖 − 𝑋𝑖 𝐺 (𝑋𝑖 ), 𝑦𝑖 }𝑖=1 . 5: Using Scheduled Sampling with 𝜀-Decay, update 𝜃 𝐹 by performing SGD on Eqn. 5.8 6: end for 7: for 𝑘 training steps do 8: Update 𝐺 parameters by projected gradient ascent on ∇𝜃 𝐺 max {𝑍 (𝑋 − 𝑋 𝐺 (𝑋))¬𝑦 − 𝑍 (𝑋 − 𝑋 𝐺 (𝑋)) 𝑦 , −𝜅}. 9: (Optional) Dual ascent on budget constraint 𝛽 according to 𝛽 ← 𝛽 + 𝛼(𝑋 − 𝑋 𝐺 (𝑋)). 10: end for 11: end for 5.3.0.4 Inference To perform inference, we train two sets of models for comparison. (1) A set of hypothesis models ℎ 𝑆 : 𝑋 ↦→ 𝑌 trained on the sensed dataset D𝑆 = {(𝐺 (𝑋𝑖 ) 𝑋𝑖 , 𝑦𝑖 )}𝑖=1 𝑚 , generated using the FS. (2) A set of hypothesis models ℎ trained on the original dataset {(𝑋𝑖 , 𝑦𝑖 )}𝑖=1 𝑚 . The difference in generalization error between ℎ 𝑆 and ℎ quantifies the feature sensing strategy learned by the FS. 5.4 Analysis of Proposed Method Optimizing FS, PL and Budget-Manager involves finding saddle points in a dynamic loss landscape. Here, we prove the existence of fixed points in the proposed minimax game and their correspondence with local Nash Equilibria (NE) [14]. At local NEs, FS and PL cannot improve allocation or accuracy by local changes in 𝐹 and 𝐺, respectively. 98 5.4.1 Existence of Local Nash Equilibria We can re-express Eqn. 5.5 into individual pay-off functions: • 𝐽𝐹 (𝜃 𝐹 ) = 𝐶𝐸 𝑦 [𝐹 (𝑋 𝐺 (𝑋))] gives the cost for the PL player. • 𝐽𝐺 (𝜃 𝐺 ) = {𝐹 (𝑋 − 𝑋 𝐺 (𝑋)) 𝑦 − 𝐹 (𝑋 − 𝑋 𝐺 (𝑋))¬𝑦 }+ gives the cost for the FS, from the minimizer’s point of view. • 𝐽𝐵 (𝜃 𝐵 ) = Í 𝑖𝑗𝑘 𝐴𝑖 𝑗 𝑘 − 𝐵 defines the cost for the budget manager. Let us define 𝜉 := (𝜃 𝐹 , 𝜃 𝐺 , 𝛽)𝑇 as a set of actions for the PL, FS and budget manager. Here, we denote 𝜉1 = 𝜃 𝐹 ∈ Γ1 , 𝜉2 = 𝜃 𝐺 ∈ Γ2 , 𝜉3 = 𝛽 ∈ Γ3 as the subset of parameters in 𝜉 that individual players can modify at each iteration, and Γ𝑖 denotes the 𝑖−th player’s action space. We also define a potential function as a composition of the payoffs among the players: 𝜙𝑖𝑗 (𝜉) := max{0, 𝐽𝑖 (𝜉) − 𝐽𝑖 ( 𝑗, 𝜉¬𝑖 )}, 𝑗 ∈ Γ𝑖 where 𝐽𝑖 ( 𝑗, 𝜉¬𝑖 ) defines the cost of another policy 𝑗 ≠ 𝜉𝑖 for the 𝑖−th player currently using 𝜉𝑖 parameters. 𝜉¬𝑖 indicating that other players retain the same actions. So the potential function defines the cost-of-switching from the current policy to another, defined over each player’s action space. Finally, let us define the best response function [14] for each player: 𝜉𝑖 + 𝜙𝑖𝑗 (𝜉) 𝑓𝑖 (𝜉𝑖 , 𝑗) := Í 1 + 𝑘∈Γ𝑖 (𝜙𝑖𝑘 (𝜉)) where 𝑗 ∈ Γ𝑖 corresponds to the alternative actions in the 𝑖-th action space. We now introduce the definition of Nash Equilibrium (NE) according to [14] for our problem: Definition 3. Nash Equilibrium. 𝜉 ∗ is a Nash Equilibrium point of 𝑓 (.) if it satisfies 𝐽𝑖 (𝜉𝑖 ) ≤ 𝐽𝑖 ( 𝑗, 𝜉¬𝑖 ), ∀𝑖, 𝑗. Lemma 1. Assuming that the domain of 𝜉 is continuous and convex, there exists a fixed point of 𝑓 , i.e., 𝑓𝑖 (𝜉𝑖∗ , 𝑗) = 𝜉𝑖∗ , ∀𝑖, 𝑗. 99 Proof. The proposed best response function maps from 𝑓 : Γ → Γ and is a continuous mapping. Furthermore, each sub-domain Γ𝑖 is convex and compact. Compactness can be achieved through regularization applied to 𝐹 and 𝐺, in the form of budget-constraint, which serves to bound the size of 𝜃 𝐺 according to a 𝛽. So each Γ𝑖 is closed and bounded, and by the Heine-Borel Theorem, each Γ𝑖 is compact in 𝑅 𝑛𝑖 . Thus, we can apply Brouwer’s fix point theorem from [15], and therefore ∃𝜉 ∗ s.t. 𝑓 (𝜉) at 𝑓 (𝜉 ∗ ) = 𝜉 ∗ . In other words, a fixed point exists. Lemma 2. A point 𝜉 ∗ is a fixed point of 𝑓 if and only if it is a Nash Equilibrium point. Proof. First, suppose 𝜉 ∗ is a NE point. Then we have 𝐽𝑖 (𝜉𝑖∗ ) ≤ 𝐽𝑖 ( 𝑗, 𝜉¬𝑖 ∗ ), ∀𝑖, 𝑗. So it follows that 𝜉𝑖∗ +0 ∀𝑖, 𝜙𝑖𝑗 (𝜉𝑖∗ ) = 0. Thus, we have 𝑓𝑖 (𝜉𝑖∗ , 𝑗) = 1+0 = 𝜉𝑖∗ , ∀𝑖, 𝑗, which defines a fixed point. Conversely, suppose 𝜉 ∗ is a fixed point, i.e., 𝑓 (𝜉 ∗ , 𝑗) = 𝜉 ∗ . Then for every 𝑖, there is at least one 𝜉𝑖∗ s.t. 𝐽𝑖 (𝜉𝑖∗ ) ≤ 𝐽𝑖 ( 𝑗, 𝜉¬𝑖 ∗ ) (by definition of fixed point). For such 𝜉 ∗ ’s, we have 𝜙𝑖𝑗 (𝜉 ∗ ) = 0 𝜉𝑖∗ + 0 =⇒ 𝑓𝑖 (𝜉𝑖∗ ) = 𝜉𝑖∗ = Í 𝑖 ∗ 1 + 𝑘 𝜙 𝑘 (𝜉 ) Õ =⇒ 𝜙𝑖𝑘 (𝜉 ∗ ) = 0 =⇒ 𝜙𝑖𝑘 (𝜉 ∗ ) ≡ 0, ∀𝑘 ≠ 𝑗 ∈ Γ𝑖 . 𝑘 We have shown that ∀𝜉 ∗ = 𝑓 (𝜉 ∗ ), such a point must also correspond to a NE point.  Theorem 1. There exists at least one Nash Equilibrium point 𝜉 ∗ such that each player’s best response does not deviate from 𝜉 ∗ with respect to the minimax objective in Eqn. 5.5. Furthermore, the fixed points of the minimax objective correspond to local Nash Equilibrium points. Proof. The result directly follows from combining Lemma 1 and 2 with Definition 3.  We note that although local NE’s exist under our formulation, we do not show admissibility [16] or interchangeability of the payoff values for such NE’s since the proposed game is non-zero sum. However, the NE property implies that small, local changes to the budget allocation and classifier decision boundary do not improve the performance under either objective. 100 5.4.2 Convergence of Algorithm 2 One interesting challenge of our problem is that 𝐺 and 𝐹 are updated at different time-scales. We now show that Algorithm 2 can converge to local NE’s, at least under some good initialization schemes around the neighborhood of NE’s. We summarize the main result below. Definition 4. The gradient operator for the FS and PL players is defined as follows: ©−∇𝜃 𝐹 𝐽𝐹 (𝜃 𝐹 ) ª ­ ® 𝜉 (𝑡+1) = 𝜉 (𝑡) + 𝛼𝜂(𝜉 (𝑡) ), 𝜂(𝜉) = ­𝜀∇𝜃 𝐺 𝐽𝐺 (𝜃 𝐺 ) ® ­ ® ­ ® ­ ® 0 « ¬ where 𝛼 is the learning rate hyperparameter, and 𝜀 > 0 denotes an off-set parameter that scales the update rate of the ∇𝜃 𝐺 component according to a faster time-scale. We can verify that when 𝜂¤ = 0 at some time 𝑇, we have 𝜂 ≡ 0 =⇒ 𝜉 (𝑡+1) = 𝜉 (𝑡), ∀𝑡 > 𝑇, which describes a fixed point 𝜉 ∗ . Theorem 2. Assuming the loss functions for each player is locally Lipschitz in 𝜉 and piecewise linear in 𝑡 near a Nash Equilibrium point, the error dynamics 𝜂¤ follows uniform asymptotic stability and converges toward the origin at a rate of: ||𝜂(𝑡)|| ≤ K (𝜆||𝜂(𝑡 0 )||, 𝑡 − 𝑡0 ), for some KL−class function (§4.4 in [17]) (K) and constant 0 < 𝜆 < ∞. Proof. Here, we make the following assumptions: • At a fixed budget constraint 𝛽, 𝜕𝛽 𝜕𝑡 = 0. 101 • 𝐽𝐹 and 𝐽𝐺 satisfy the following: 𝐽𝐹 (𝜃 𝐹 ) = 𝐿 𝐹 (𝐹 (𝑋 𝐺 (𝑋))) 𝐽𝐺 (𝜃 𝐺 ) = 𝐿 𝐺 (𝐹 (𝑋 − 𝑋 𝐺 (𝑋))) ∇𝜃 𝐹 𝐽𝐹 (𝜃 𝐹 ) = 𝐿 0𝐹 (𝐹 (𝑋 𝐺 (𝑋))) · ∇𝜃 𝐹 𝐹 (𝑋 𝐺 (𝑋)) ∇𝜃 𝐺 𝐽𝐺 (𝜃 𝐺 ) = −𝐿 0𝐺 (𝐹 (𝑋 − 𝑋 𝐺 (𝑋))) · ∇ 𝑋 𝐹 (𝑋 − 𝑋 𝐺 (𝑋)) · ∇𝜃 𝐺 𝐺 (𝑋) where 𝐿 𝐹 and 𝐿 𝐺 are loss functions according to the labeled sample pairs {(𝑋𝑖 , 𝑦𝑖 )}𝑖=1 𝑚 . • 𝐿 𝐹 and 𝐿 𝐺 are bounded, locally Lipschitz in 𝜉 and piecewise continuous in 𝑡 near a local fixed-point equilibrium. We can thus express the error dynamics 𝜂¤ (i.e., the update rule of 𝜂) as follows © −∇2𝜃𝐹 𝐽𝐹 (𝜃 𝐹 ) −𝜀∇ 𝜃𝐹 , 𝜃𝐺 𝐽𝐹 (𝜃 𝐹 ) ª 𝜂¤ = ­­ ®𝜂 ® 𝜀∇ 𝜃𝐺 , 𝜃𝐹 𝐽𝐺 (𝜃 𝐺 ) 2 2 𝜀 ∇ 𝜃𝐺 𝐽𝐺 (𝜃 𝐺 ) « ¬ −∇2𝜃𝐹 𝐽𝐹 (𝜉 𝐹 ) = −𝐿 𝐹00 (𝐹 (𝑋 𝐺 (𝑋))) · ∇ 𝜃𝐹 𝐹 (𝑋 𝐺 (𝑋)) · ∇ 𝜃𝐹 𝐹 (𝑋 𝐺 (𝑋))𝑇 −∇ 𝜃𝐹 , 𝜃𝐺 𝐽𝐹 (𝜉 𝐹 ) = −𝐿 𝐹0 (𝐹 (𝑋 𝐺 (𝑋))) · ∇ 𝜃𝐹 𝐹 (𝑋 𝐺 (𝑋)) · (∇𝑋 𝐹 (𝑋 𝐺 (𝑋))) · ∇ 𝜃𝐺 𝐺 (𝑋))𝑇 0 ∇ 𝜃𝐺 , 𝜃𝐹 𝐽𝐺 (𝜉𝐺 ) = −𝐿 𝐺 (𝐹 (𝑋 − 𝑋 𝐺 (𝑋))) · ∇𝑋 𝐹 (𝑋 𝐺 (𝑋)) · ∇ 𝜃𝐺 𝐺 (𝑋)∇ 𝜃𝐹 𝐹 (𝑋 − 𝑋 𝐺 (𝑋))𝑇 00 ∇2𝜃𝐺 𝐽𝐺 (𝜉𝐺 ) = −𝐿 𝐺 (𝐹 (𝑋 − 𝑋 𝐺 (𝑋))) (∇𝑋 𝐹 (𝑋 𝐺 (𝑋)) · ∇𝑋 𝐹 (𝑋 𝐺 (𝑋)))𝑇 ∇ 𝜃𝐺 𝐺 (𝑋)∇ 𝜃𝐺 𝐺 (𝑋)𝑇 Next, around the equilibrium point at the origin, let us assume the following properties about ∇𝑋 𝐹 (𝑋), ∇ 𝜃𝐹 𝐹 (𝑋), ∇ 𝜃𝐺 𝐺 (𝑋) , i.e., the back-propagation terms: =⇒ ∃0 < 𝑐 1 < ∞ : ∇ 𝜃𝐹 𝐹 (𝑋) ≤ 𝑐 1 ||∇ 𝜃𝐹 𝐽𝐹 (𝜃 𝐹 )|| = 𝑐 1 ||𝜂1 || =⇒ ∃0 < 𝑐 2 < ∞ : ∇ 𝜃𝐺 𝐺 (𝑋) ≤ 𝑐 2 ||∇ 𝜃𝐺 𝐽𝐺 (𝜃 𝐺 )|| = 𝑐 2 ||𝜂2 || =⇒ ∃0 < 𝛾 < ∞ : ||∇𝑋 𝐹 (𝑋)|| ≤ 𝛾. 102 Let us also assume that ∃𝑐 3 > 0 : ||𝐿 𝑖0 (𝜉)|| < 𝑐 3 , and ∃𝑐 4 > 0 : ||𝐿 𝑖00 (𝜉)|| < 𝑐 4 , that is the first and second order changes are bounded around the origin. We can thus re-express 𝜂¤ as follows: −∇ 𝜃𝐹 , 𝜃𝐺 𝐽𝐹 (𝜉 𝐹 ) = ∇ 𝜃𝐹 𝐽𝐹 (𝜉 𝐹 ) · ∇𝑋 𝐹 ( 𝑋) ˜ · ∇ 𝜃𝐺 𝐺 (𝑋)𝑇 ∇ 𝜃𝐺 , 𝜃𝐹 𝐽𝐺 (𝜉𝐺 ) = −∇ 𝜃𝐺 𝐽𝐺 (𝜉𝐺 )∇ 𝜃𝐹 𝐹 ( 𝑋) ˆ 𝑇 00 ∇2𝜃𝐺 𝐽𝐺 (𝜉𝐺 ) = −𝐿 𝐺 (𝐹 ( 𝑋)) ˆ · (∇𝑋 𝐹 ( 𝑋)∇ ˆ 𝑋 𝐹 ( 𝑋))ˆ 𝑇 ·∇ 𝜃𝐺 𝐺 (𝑋)∇ 𝜃𝐺 𝐺 (𝑋)𝑇 . © −𝐿 𝐹00 (𝐹 ( 𝑋)) ˜ · ∇ 𝜃𝐹 𝐹 ( 𝑋)∇ ˜ 𝜃𝐹 𝐹 ( 𝑋) ˜ 𝑇 𝜂1 + 𝜀∇𝑋 𝐹 ( 𝑋)˜ · ∇ 𝜃𝐺 𝐺 (𝑋)𝑇 𝜂1 𝜂2 ª ­ ® =⇒ 𝜂¤ = ­ ­ ® ® ­ ® ­ ® −𝜀∇ 𝐹 ( ˆ 𝑋) 𝑇 𝜂 𝜂 − 𝜀 2 𝐿 00 (𝐹 ( ˆ 𝑋)) · (∇ 𝐹 ( ˆ 𝑋)∇ 𝐹 ( ˆ 𝑋)) 𝑇 ∇ 𝐺 (𝑋)∇ 𝐺 (𝑋) 𝑇 𝜂 « 𝜃 𝐹 1 2 𝐺 𝑋 𝑋 𝜃𝐺 𝜃𝐺 2 ¬ Finally, we let us consider the Lyapunov function candidate: 1 1 𝑉 (𝜂) = 𝜂12 + 𝜂22 2 2 which satisfies the property 𝑉 (𝜂) > 0, ∀𝜂 ≠ 0. We can also see that 𝑉¤ (𝜂) < 0: 𝜕𝑉 (𝜂) 𝑇 𝜕𝜂 𝑉¤ (𝜂) = 𝜕𝜂 𝜕𝑡 𝑇 ©𝜂1 ª ©𝜂¤1 ª = ­­ ®® ­ ® ­ ® 𝜂 𝜂¤ « 2¬ « 2¬ = −𝐿 𝐹00 (𝐹 ( 𝑋)) ˜ · ∇ 𝜃𝐹 𝐹 ( 𝑋)∇ ˜ 𝜃𝐹 𝐹 ( 𝑋) ˜ 𝑇 𝜂2 + 𝜀∇𝑋 𝐹 ( 𝑋) 1 ˜ · ∇ 𝜃𝐺 𝐺 (𝑋)𝑇 𝜂2 𝜂2 1 − [𝜀∇ 𝜃𝐹 𝐹 ( 𝑋) ˆ 𝑇 𝜂1 𝜂2 − 𝜀 2 𝐿 00 (𝐹 ( 𝑋)) ˆ · (∇𝑋 𝐹 ( 𝑋)∇ ˆ 𝑇 ∇ 𝜃𝐺 𝐺 (𝑋)∇ 𝜃𝐺 𝐺 (𝑋)𝑇 𝜂2 ] ˆ 𝑋 𝐹 ( 𝑋)) 2 𝐺 2 ≤ −𝑐 3 𝑐21 𝜂14 − (𝑐 1 − 𝛾𝑐 2 )𝜂12 𝜂22 − 𝜀 2 𝑐 4 𝑐22 𝛾 2 𝜂24 . Letting 𝜆 = min{𝑐 3 𝑐21 , 𝜀 2 𝑐 4 𝑐22 𝛾 2 }, we have 𝑉¤ (𝜂) ≤ −𝜆||𝜂|| 4 − (𝑐 1 − 𝛾𝑐 2 )𝜀𝜂12 𝜂22 𝛾𝑐 2 𝑉¤ (𝜂) ≤ −𝜆||𝜂|| 4 < 0, ∀ < 1. 𝑐1 By Theorem 4.2 of [17], we have uniform asymptotic stability for the error dynamics around the NE, implying that the error term 𝜂(𝑡) → 0 at the rate of ||𝜂(𝑡)|| ≤ K (𝜆||𝜂(𝑡0 )||, 𝑡 − 𝑡0 ) 103 for some KL−class function K and constant 𝜆 = min{𝑐 3 𝑐21 , 𝜀 2 𝑐 4 𝑐22 𝛾 2 }. Theorem 4 assumes an initialization for 𝜉 that is close to a local equilibrium point 𝜉 ∗ , which motivates the use of scheduled sampling on the PL model inputs as a pre-training step. We also allow a fully-observed 𝐴𝑖(1)𝑗 = 1, ∀𝑖, 𝑗 at 𝑡 = 1 so that the FS observes a full set of features, as a good initialization of the generation process depends heavily on the initially observed features. 5.5 Experiments To gain intuition into the precision sensing mechanism, we first compare with several classic feature selection schemes on synthetic data. We then compare performance of precision sensing with SOTA active sensing methods on real-world medical data. 5.5.1 Synthetic Experiments We evaluate the ability of the budget-constrained precision sensing (BCPS) to retrieve the underlying temporal patterns on a set of synthetic time-series data. We generate a synthetic dataset consisting 1000 , 𝑋 ∈ R100×10 (100 time-steps, 10 features), 𝑦 ∈ {0, 1}. of time-series data with labels: {𝑋𝑖 , 𝑦𝑖 }𝑖=1 𝑖 𝑖 The temporal signals were adapted from the UCI ML dataset [18], specifically the Pseudo Periodic Synthetic Time Series Data Set. This dataset features 100, 000 samples of univariate time-series data, generated from the stochastic periodic function: 7 Õ 1 𝑓 (𝑥) = sin{2𝜋(22+𝑖 + rand(2𝑖 ))𝑐}. (0 ≤ 𝑐 ≤ 1) (5.10) 𝑖=3 2 We expand this dataset by applying the generator equation across multiple channels (10), each with a different initialization of 𝑐. We sub-sampled 1000 of the generated time-series and aligned the data such that each sample contained 100 time-steps and 10 features. We generate labels 𝑦𝑖 according to    1   ||𝐶 (𝑋𝑖 )|| > 0 𝑦𝑖 =  ||𝐶 (𝑋𝑖 )|| ≤ 0,  0   where 𝐶 (.) masks all features of 𝑋 across time except for certain features at certain times. For example, one configuration of 𝐶 may emphasize the 0 − 5𝑡ℎ features from time-steps 20 − 40 and the 9 − 10𝑡ℎ features from time-steps 60 − 80. All other features at all other time-steps are masked with 104 Figure 5.2: Recovery of feature relevance (𝐶) by BCPS on synthetic data. (a) Ground Truth (b) BCPS (c) Ground Truth (d) BCPS high negative noise sampled from a Gaussian distribution. Our goal is to recover the true underlying 𝐶 (unmasked features) using the BCPS framework. 5.5.1.1 Recovery of 𝐶 using BCPS Table 5.1 gives an overview of the BCPS performance in recovering the relevant features posed by 𝐶 on the synthetic dataset. Here, the sparsity-level refers to the percentage of masked features in 𝐶 (ground-truth). We see that the specificity of the recovered features by BCPS remains high, despite the increase in budgetary constraints. In general, the performance drops are not significant between different levels of sparsity in 𝐶. Figure 5.2 visualizes the performance of BCPS on several test-set samples with 80% sparsity in 𝐶. We see that BCPS chooses to allocate sensing to early time-steps, and then accurately locates the largest block of key observations in 𝐶 related to predicting the target. From there, BCPS decides whether to continue to observe future time-steps, as shown by Fig. 5.2d, or stop observations when it is confident in the prediction, as shown by Fig. 5.2b. Interestingly, BCPS is capable of locating the key observation blocks in 𝐶 despite the discontinuity of the key block from the original time-steps, as demonstrated by Figs. 5.2b and 5.2d. Table 5.1: BCPS recovery performance vs. sparsity of sensing budget on synthetic data Sparsity AUC F1 Sensitivity Specificity 20% 99.0 ± 0.0 98.1 ± 1 97.3 ± 1.1 99.7 ± 0.0 40% 98.2 ± 2.1 94.5 ± 3 90.2 ± 4.2 98.6 ± 1.0 60% 96.4 ± 1.1 91.7 ± 1 87.6 ± 3.1 97.9 ± 1.2 80% 94.4 ± 3.0 90.9 ± 1 84.4 ± 3.8 97.0 ± 1.1 105 Table 5.2: Comparison of predictive power of various sensing methods on synthetic data, under 80% sparsity in 𝐶 Model AUC F1 Sensitivity Specificity Temporal Feature Selection 89.5 ± 3.9 81.2 ± 4.0 77.6 ± 4.8 85.7 ± 2.8 Self-Attention 93.3 ± 3.4 88.3 ± 3.2 87.0 ± 3.5 94.4 ± 2.0 BCPS 94.0 ± 0.9 90.4 ± 0.8 84.5 ± 3.6 96.9 ± 1.1 5.5.1.2 Comparison Against Other Sensing Methods We compare BCPS against other sensing strategies on the synthetic dataset. Specifically, we investigate (1) classification difference between sensing mechanisms, and (2) how they differ in recovering feature contributions from 𝐶. We consider the following alternative models: • Temporal Feature Selection: GRU model with ℓ1 −regularization applied to input weights at each time-step. • Attention-Based Sensing: Self-attention model with a slight modification:   𝐴𝑖, 𝑗,𝑘 ≠ 0  1   𝐴𝑖, 𝑗,𝑘 =   0  Otherwise.  Because self-attention produces activations in the continuous [0, 1] range, we make this adjustment to use the non-zero activations as a proxy for feature utilization in self-attention. Table 5.2 compares the performance of BCPS against other sensing methods on the synthetic data. We fix the sparsity in 𝐶 to be 80% for all models and compare prediction performance and sensing patterns. As expected, BCPS and self-attention sensing strategies performed notably better than temporal feature selection since the relevant features of 𝐶 shifts across time. The difference in the allocation of observations between BCPS and attention-based sensing is relevant, so we further investigated it. In particular we compared the sensing patterns in BCPS and attention activations against the ground truth 𝐶 under a fixed sparsity pattern, see Table 5.2. Scores are averaged over 5 evaluation runs and reported in the format Avg. ± STD. We considered a 𝐶 with some interesting properties: 106 Figure 5.3: Heatmap comparisons of sensed features under a fixed 𝐶. Feature number 0-9 (x-axis) vs. Timesteps 0-96 (y-axis). (a) Ground Truth (b) BCPS (c) Attention Activations • 𝐶 has multiple, discontinuous blocks of key features whose observations are critical to the final prediction of 𝑦. • Different groups of features are prioritized in each key observation block. • Multiple groups of features can be prioritized at the same time-step. We see from Fig. 5.3 that only BCPS is capable of recovering the true underlying sets of key features in 𝐶. The attention activations, on the other hand, cannot locate the discontinuous blocks of contributory features, and as a result always uses a high observation budget, i.e., most features are observed frequently across time, regardless of the true underlying sparsity in 𝐶. We also see from Fig. 5.3b that although there exists a very large divide between the first block and future time-steps, the FS model sparingly allocates observations across the large gap of non-contributory features before allocating large amounts of observation resources toward the last few time-blocks. 5.5.2 MIMIC-III Experiments We also evaluate our framework on MIMIC-III [19], a publicly available electronic health records (EHR) dataset using temporal data collected from 22, 830 adult patients. For the prediction task, we choose In-Hospital Mortality as the target. Given the first 𝑇 = 48 hours of observations (𝑥𝑖(1) , ...𝑥𝑖(𝑇) ), mortality risk assessment predicts mortality risk from 𝑇 + 1 until the end of the hospital stay. Because patients have variable lengths of stay (LOS), we aligned the temporal sequences of patients by the start of their hospital admit time and excluded patients who ha LOS less than 24 hours. We 107 considered a set of 19 temporal features, split between metabolic laboratory panels (7), specialized hematologic tests (6) and vital signs (6). 5.5.2.1 Baseline Classifiers We consider a wide variety of hypothesis model classes for inference: • Logistic Regression (LR): non-deep baseline. • Multi-layer Perceptron (MLP): 3 fully-connected layers with 128 hidden units each. • Gated Recurrent Units (GRU): a bi-directional GRU with 2 hidden layers and 128 nodes per layer. • Self-Attention Model (Attn): a modified self-attention [20] model, where the inputs at each time-step are modified by an attention mechanism [21] that combines pertinent information from previous time-steps to attend to the most relevant part of the current feature space. 5.5.2.2 Performance on MIMIC-III Table 5.3 summarizes the performance of the inference models conditioned on the original dataset. Scores are averaged over 5 evaluation runs and reported in the format Avg. ± STD. We see that that the temporal information is highly relevant to the mortality task. This is indicated by the superior performance of recurrent models (GRU and self-attention GRU), which achieves higher performance across all metrics compared to the non-temporal models (LR and MLP). Self-attention model achieves the highest baseline performance compared to other models, but vanilla GRU results were comparable. Table 5.3: Baseline performance of inference models conditioned on the original data 𝑋 Model AUC F1 Sens. Spec. MLP 74.0 ± 5.5 37.4 ± 4.2 65.0 ± 7.8 72.8 ± 3.3 LR 58.4 ± 0.7 28.3 ± 1.9 18.8 ± 1.7 97.9 ± 0.3 GRU 84.7 ± 1.2 45.2 ± 2.6 78.4 ± 3.3 74.8 ± 4.1 Attn 85.1 ± 1.8 45.5 ± 1.6 76.0 ± 2.2 79.8 ± 2.6 108 Table 5.4: Performance of inference models conditioned on the masked data 𝑋 𝐴 Model AUC F1 Sens. Spec. MLP 74.2 ± 5.2 36.2 ± 5.1 69.6 ± 2.8 67.3 ± 6.2 LR 57.3 ± 0.7 25.6 ± 1.9 16.5 ± 1.5 98.2 ± 0.2 GRU 81.3 ± 2.2 42.4 ± 2.5 73.0 ± 5.4 74.4 ± 4.3 Attn 81.7 ± 1.2 41.1 ± 3.0 75.5 ± 3.2 72.6 ± 1.4 PL 82.9 ± 1.3 44.5 ± 2.4 74.3 ± 3.5 74.1 ± 1.2 5.5.2.3 Performance on Masked Data Table 5.4 illustrates the performance of the same set of inference models trained and tested on the incrementally sensed data {𝑋𝑖 𝐺 (𝑋𝑖 ), 𝑦𝑖 }𝑖=1 𝑚 . The 𝐺 (𝑋) used for Table 5.4 is trained using a budgetary constraint of 58.6% of available features across time. From Table 5.4, we see that both recurrent models (GRU and Attn) have notable decrease in performance, with their AUCs dropping by 2 − 4%, respectively. Conversely, the non-temporal models maintain their performance. Again, scores are averaged across 5 evaluation runs. This contrast between temporal vs. non-temporal models reveals that 𝐺 (𝑋) specifically maintains the temporal patterns in 𝑋 related to the prediction task at hand. We also see that the PL is able to achieve the highest performance on the sensed data, despite having the same architecture as the GRU models. This is likely because the PL is trained on a variety of sparsity patterns in 𝐺 (𝑋) during co-training. 5.5.3 Sparsity Trade-offs We further investigate the relationship between the severity of the budget constraint, i.e., sparsity in the generated sensing tensor, and the performance of the resulting dynamically sensed data. Here, we define budget cost as the % of features utilized during the observation period (𝑇 = 48 hours): Õ𝑚 Õ𝑡 Õ𝑛 BudgetCost = 𝐴𝑖, 𝑗,𝑘 /(𝑚 × 𝑡 × 𝑛), 𝑖 𝑗 𝑘 where we assume 𝑋 ∈ R𝑚×𝑡×𝑛 . Fig. 5.4 summarizes the trade-off between performance and budget constraints. Notably, the budget cost decreases from 89.3% to 58.3% while the AUC and f1-scores only decreased by < 0.02. The greatest drop in performance occurred from 12.1% → 6.5%. This is presumably because at 6.5%, the FS allocates the observation budget at the first time-step 109 only, sacrificing all of the temporal information within the data. Thus, we can see from Fig. 5.4 that 𝛽 ∈ [1e-6, 1e-4] represents an optimal trade-off range. In practice, the 𝛽-constraint is usually Figure 5.4: Trade-off between Budget Constraint (x-axis) vs. predictive power (y-axis). Budget denotes the strength of budgetary constraint 𝛽. predetermined based on real-world constraints. For example, a well-funded hospital may have access to a large set of observations, say 𝛽 = 5𝑒 − 6 (less constrained), whereas an under-funded clinic may only be able to observe at the level of 𝛽 = 5𝑒 − 5 (highly constrained). 5.5.4 Comparison Against SOTA Methods Finally, we compare BCPS against other SOTA sensing strategies for temporal data: • Deep Sensing [22]: Active sensing is done through an adaptive sampling scheme. Inference is done using interpolation and imputation layers for the missing values. • Dynamic Measurement Scheduling [23]: Active sensing is done by a dueling DQN network, and inference is done using a pre-trained recurrent network. We implement Deep Sensing (both single and multiple model settings) with the hyperparameter settings described in the Appendix Section of [22]. For Dynamic Measurement Scheduling, we run Algorithm 3 without meta-data, i.e., without age, demographics, or disease history, to obtain the 110 simulated dataset. In both cases, we use the same set of features and the same 48-hour observation window. For each active sensing model we peform the following: 1. Pre-train a GRU model. 2. Synthesize new datasets using the active sensing strategy. 3. Test inference models (M-RNN for Deep Sensing) on synthetic datasets. In Table 5.5, we report the best performance for each method with 50 − 60% observation budget, which is the reported low end for the Deep Sensing work [22]. We use the 50 − 60% budget range because each work uses its own hyperparameter (e.g., threshold 𝜏 in [22] and 𝜆 in [23]), so it is difficult to translate the threshold parameters to exact percentages in sparsity. Thus, we report the best performance within the 50 − 60% observation range of each model for fairness. From Table 5.5, we can see that the combination of FS (for data generation) and PL (for adaptive classification) maintains the best level of performance at higher budget constraints. Dynamic scheduling produces comparable results to Deep Sensing. We also plot the AUC-gain vs. sparsity trade-off for each approach in Figure 5.5. For that experiment, we fix a pre-trained GRU to deliver the predictions at each sparsity level for the generated data from each method. Our FS model produces measurements that maintained a higher level of performance at each sparsity level. It is also notable that without variance reduction (e.g., using multiple models), the imputation-prediction (M-RNN) framework of Deep Sensing has a large drop off, especially with decreasing observation budget. Dynamic Scheduling performs best at around 25%, which is consistent with the reported performance in [23]. However, we note that its performance plateaus at increasing levels of observation. Table 5.5: Comparison of precision sensing against SOTA active sensing method with sparsity constraint of 50 − 60% Sensing Model Inference AUC F1 Deep Sensing [22] M-RNN 76.0 38.5 Dynamic Scheduling [23] GRU 78.7 41.3 Ours (FS) Ours (PL) 82.9 44.5 111 Figure 5.5: Comparison of AUC-gain against SOTA models. 5.6 Conclusion In this chapter we formalize the precision sensing problem as an extension of active sensing and illustrated its application in mortality risk prediction. Our formulation consists of a minimax game between a dynamic feature sensor and an online classifier that reaches a local Nash equilibrium during co-training. We empirical show that the sensing strategy is capable of maintaining predictive performance while satisfying budget constraints. However, one limitation is that although we show local Nash equilibriums exist, we cannot make statements about their admissibility and interchangeability in payoff values [16]. Future works might consider other scheduled sampling techniques for the PL model, which can greatly affect training stability and sample complexity under the proposed setting. 112 BIBLIOGRAPHY 113 BIBLIOGRAPHY [1] Yu Cheng, Fei Wang, Ping Zhang, and Jianying Hu. Risk prediction with electronic health records: A deep learning approach. In Proceedings of the 2016 SIAM International Conference on Data Mining, pages 432–440. SIAM, 2016. [2] Fengyi Tang, Cao Xiao, Fei Wang, and Jiayu Zhou. Predictive modeling in urgent care: a comparative study of machine learning approaches. JAMIA Open, 2018. [3] Abhyuday N Jagannatha and Hong Yu. Bidirectional rnn for medical event detection in electronic health records. In Proceedings of the conference. Association for Computational Linguistics. North American Chapter. Meeting, volume 2016, page 473. NIH Public Access, 2016. [4] Edward Choi, Mohammad Taha Bahadori, Elizabeth Searles, Catherine Coffey, Michael Thompson, James Bost, Javier Tejedor-Sojo, and Jimeng Sun. Multi-layer representation learning for medical concepts. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1495–1504. ACM, 2016. [5] Hrayr Harutyunyan, Hrant Khachatrian, David C Kale, and Aram Galstyan. Multitask learning and benchmarking with clinical time series data. arXiv preprint arXiv:1703.07771, 2017. [6] Shipeng Yu, Balaji Krishnapuram, Romer Rosales, and R Bharat Rao. Active sensing. In Artificial Intelligence and Statistics, pages 639–646, 2009. [7] Mengying Sun, Fengyi Tang, Jinfeng Yi, Fei Wang, and Jiayu Zhou. Identify susceptible locations in medical records via adversarial attacks on deep predictive models. arXiv preprint arXiv:1802.04822, 2018. [8] Pin-Yu Chen, Yash Sharma, Huan Zhang, Jinfeng Yi, and Cho-Jui Hsieh. Ead: elastic-net attacks to deep neural networks via adversarial examples. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018. [9] Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014. [10] Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. In 2017 IEEE Symposium on Security and Privacy (SP), pages 39–57. IEEE, 2017. [11] Sepp Hochreiter. The vanishing gradient problem during learning recurrent neural nets and problem solutions. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 6(02):107–116, 1998. 114 [12] Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for sequence prediction with recurrent neural networks. In Advances in Neural Information Processing Systems, pages 1171–1179, 2015. [13] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, Jonathan Eckstein, et al. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine learning, 3(1):1–122, 2011. [14] John Nash. Non-cooperative games. Annals of mathematics, pages 286–295, 1951. [15] Ky Fan. Fixed-point and minimax theorems in locally convex topological linear spaces. Proceedings of the National Academy of Sciences of the United States of America, 38(2):121, 1952. [16] João P Hespanha. Noncooperative game theory: An introduction for engineers and computer scientists. Princeton University Press, 2017. [17] Hassan K Khalil. Nonlinear control. Pearson New York, 2015. [18] Arthur Asuncion and David Newman. Uci machine learning repository, 2007. [19] Alistair EW Johnson, Tom J Pollard, Lu Shen, Li-wei H Lehman, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. Mimic-iii, a freely accessible critical care database. Scientific data, 3, 2016. [20] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017. [21] Minh-Thang Luong, Hieu Pham, and Christopher D Manning. Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025, 2015. [22] Jinsung Yoon, William R Zame, and Mihaela van der Schaar. Deep sensing: Active sensing using multi-directional recurrent neural networks. 2018. [23] Chun-Hao Chang, Mingjie Mai, and Anna Goldenberg. Dynamic measurement scheduling for event forecasting using deep rl. arXiv preprint arXiv:1901.09699, 2019. 115 CHAPTER 6 PERSONA AUTHENTICATION 6.1 Introduction In recent years, one promising approach to diverse and personalized dialog generation has been persona models [1, 2, 3] which embed the so-called “persona” information (e.g., name, gender, and self-descriptions) into neural conversational agents. While the goal of persona modeling is to achieve human-level response diversity and character consistency, a critical yet often overlooked factor is the sequence of prompts used to induce diversity in generated responses. Consider the toy example in Table 6.1. Certain sequences of question prompts (from the verifier) create a trail of generated responses that reveal more persona information, compared to non-specific conversation (the random policy). Yet it is unclear a priori which sequence(s) of questions most effectively reveal the dialog agent’s underlying persona. Moreover, a set of questions may be effective for one persona but fail for others. Currently, long interactions with humans are necessary to gain insight into persona model characteristics such as authenticity [3, 4], diversity [2], and engagement [3]. In this work we present a learning approach for interacting with conversational agents. Specifically, we introduce the persona authentication problem, where a model estimates the persona information of an input agent by learning to deliver a sequence of questions that progressively reveal more information about the agent throughout the course of the dialog. This is difficult because exact search through the space of possible question sequences is infeasible. Therefore a model must adaptively prune its set of potential questions based on the dialogue agent’s responses. We further decompose persona authentication into two parts: persona identification, which is inferring a set of persona features from a given dialog trajectory, and persona verification, the problem of finding a second conversational model – we call it a question policy – to elicit dialog trajectories for persona identification. To address the intractability of exact search through the space 116 Table 6.1: Persona model responses can differ greatly depending on input questions Persona Descriptions 1. I am a construction worker. 2. I enjoy building houses. 3. I have 5 cats. 4. my cats are very special to me. Role Response Role Response Verifier hello! what kind of work do you do? Random hello how are you today? Model i build houses. Model great! i just got back from work. Verifier that’s awesome. what do you do Random me too. i’m a teacher at a high school. outside of work? Model i like to spend time with my cats. Model cool, what grade do you teach? of dialog trajectories, we introduce a computationally tractable algorithm and show its asymptotic convergence (in cumulative conversations) toward the full persona identification objective. The key contributions of this work can be summarized as follows: • We introduce the authentication loss and show that estimators trained to convergence under this objective maximize the mutual information between dialog history and persona. • Based on the authentication loss we learn a dialog verification model that effectively generates question sequences to distinguish the persona of input models. Empirically, we show that the question policies of the verification model adapt to out-of-distribution personas. • We present a way to incorporate question policies into language model (LM) based dialog models, e.g., GPT-2, without sacrificing the felicity and consistency of the original LM model. 6.2 Why Model Persona? One of the long-term goals of this work is toward general conversational authentication. Biometric authentication has traditionally relied on physical measures such as fingerprints and facial recognition to determine the identity of human beings [5, 6]. Importantly, physical features such as facial structures and fingerprints are largely stationary; a standard procedure can be used to collect stationary features and scaled to entire populations without the need for personalization. By contrast, “soft” biometrics such as using personal information to verify identity require some degree of personalization. For example, password retrieval or reset procedures often ask users to answer 117 security questions for which they design, or to answer a set of adaptive tests such as visual Captchas or re-tagging photos. While these tests are personalized, they are not adaptive in the sense that the test itself is predetermined before it is delivered to the user. To this end, conversational authentication can be viewed as an adaptive extension of traditional biometrics based authentication. In particular, we are interested in finding an algorithmic approach to deliver automated and adaptive tests to survey whether the conversational behavior of an input agent (e.g., human users, spam bots, etc.) matches a set of personal information provided by the user (which roughly corresponds to “persona facts” in the PersonaChat dataset). Given such an authenticator, we can explore a much larger space of tests (e.g., the space of adaptive questions rather than the space of predefined personalized questions) to survey more complicated behavior. This is especially relevant if we want to generalize beyond simple persona facts such as “I have a dog named radar” to more complicated facts such as the capacity to deliver satirical responses or the use of specific generational grammatical structures (e.g., millennial or “Gen Z” lexicon). In other words, we see general conversational authentication as the linguistic analog of visual captcha turing tests. Toward this goal, persona authentication – learning a policy to deliver identifying questions – is a necessary stepping stone because it presents an algorithmic approach (through dialog policy) as opposed to human interviewers using handcrafted linguistic features, which is the only alternative at the moment, and does not scale. 6.2.0.1 Applications One potential application of conversational authentication is speaker verification for conversational agents, human or chatbot. In many real-world settings, speaker information such as audio and video may not be readily available. In such cases, the verifier network provides a way of speaker identification via text. One can think of persona verification as a way of obtaining a linguistic “fingerprint” of speakers based on the manner in which they converse under different question policies. For example, human speakers seeking access to personal data may go through a short conversation with the authenticator in order to see whether the person trying to access private data 118 has the correct identity. Verification is a critical issue in the modern era of cybersecurity. Consider for example the arrival of Deep Fakes [7] – synthetically generated videos of people doing actions that may be outside the context of their persona. The use of only voice and video identifiers may not be enough to truly assess whether a person’s physical features match with their actions. In this regard, we try to introduce the idea that the problem of verification may entail much more than just matching physical / biological features. This is why we approach the problem from the point of a question policy, a verification process that is dynamic and stochastic rather than static and deterministic. In the latter case, technical advances in modern AI can “game” physical features which are fixed points in some classifier space. In the former case, however, an impersonator must do much more. To fool a dynamic authentication procedure, one has to find fixed points in policy space, which involves sequential decision-making rather than one-time classification using facial recognition and fingerprint features. 6.2.0.2 Authenticator Example To give a concrete example of a use case, we provide an example conversation of conversational authentication in which an authenticator surveys the persona facts relevant to a user who engages in an 8 turn conversation. In this example, the user is a persona model (personalized chatbot) rather than a human user. Table 6.2 shows an example conversation between the learned authenticator model and an input persona model (PersonaGPT). Turn 1 and 8 highlight the importance of incorporating question codes as context for decoding: rather than hard-coding questions, an authenticator must also answer questions from the PersonaGPT model as to not disrupt the flow of conversation. Since the authenticator is not given any persona facts, it must make up an answer to PersonaGPT questions on the fly. For example, the authenticator states that its favorite color is green and that it is unemployed in order to answer queries in addition to proposing questions. In this particular conversation, the authenticator question codes are: talk about work for the first 3 turns, ask about hobbies for the next 2 turns, talk about politics for turn 6, talk about travel for 119 turn 7, and talk about age and gender in turn 8. When question codes are repeated, the authenticator model is able to avoid the same question twice – instead, it expands upon the existing knowledge in the conversation (e.g., ex-military, works in a shipyard). Note that when as context, the question codes function as one-to-many mappings. For example, “talk about travel” may correspond to a much larger set of topics other than traveling as a hobby. The authenciator may ask where the input agent lives, where they are from, etc. Similarly, “ask about age and gender” has much more to do with questions related to identity and appearance rather than strictly age and gender alone. Table 6.2: An example of interaction between the learned authenticator policy and PersonaGPT over an 8-turn conversation Role Response Authenticator hello there what kind of work do you do? PersonaGPT hi there i am ex military. what do you do for a living? Authenticator nothing much. what do you like about your job? PersonaGPT i am a shipyard worker. Authenticator what’s your dream job? PersonaGPT to be a navy seal one day. Authenticator that sounds exciting! do you have any hobbies? PersonaGPT my hobby is volunteering with the bunny rescue. Authenticator do you have any other hobbies? PersonaGPT yes i help out at the local bunny rescue. Authenticator do you follow politics? PersonaGPT yes i follow politics. i donate a part of my salary to charity each month. Authenticator where are you from? PersonaGPT i’m from the united states. what’s your favorite color? Authenticator my favorite color is green. how old are you? PersonaGPT i’m 20 what about you? Persona Facts 1. I donate a part of my salary to charity each month. 2. I work fulltime in a shipyard. 3. I volunteer my time with a local bunny rescue. 4. I’m ex military. 5. My favorite color is navy blue. 6.2.0.3 Challenges Table 6.2 reveals two main challenges to learn an adaptive authentication policy: 120 1. At the turn-level, a policy must be able to incorporate turn-level goals into its decoding. However, the decoding algorithm has to remain open-domain in order to be flexible enough to handle a wide range of conversational topics and styles. For this reason, we turn to generative decoding rather than ranking models. 2. An objective function has to capture the relationship between the responses generated by a policy and persona facts at the dialog-level. One can think of the dialog-level objective function as informing turn-level goals to guide the decoding process. 6.3 Problem Setting 6.3.1 Notation Let D = {𝜏𝑖 }𝑖=1 𝑛 be a set of dyadic dialog samples. Each dialog follows the form 𝜏 = {𝑋, 𝑌 }, where 𝑋 = (𝑋𝑡 )𝑡=1 𝑇 denotes the sequence of source responses and 𝑌 = (𝑌 ) 𝑇 denotes the sequence of 𝑡 𝑡=1 target responses. Each response is composed of a sequence of tokens, represented as (x 𝑘(𝑡) ) 𝑘=1 𝐾 (source tokens) and (y 𝑘(𝑡) ) 𝑘=1 𝐾 (target tokens). To be consistent with state-of-the-art (SOTA) dialog model decoders [8, 9], we use Byte-Pair Encoding (BPE) [10] for tokenization. Additionally, 𝑇 signifies the maximum number of turns in a dialogue sample, 𝐾 the maximum number of tokens per response. If a response consists of 𝑘 < 𝐾 tokens, then we take 𝑘 + 1 through 𝐾 tokens to be empty (“PAD”) tokens (with similar logic applying to dialog turns). As a shorthand, we write 𝜏𝑡 to denote the dialogue trajectory (𝑋1 , 𝑌1 , . . . , 𝑋𝑡 , 𝑌𝑡 ) up to turn 𝑡, with 𝑌1:𝑡 to signify the sequence of responses 𝑌1 , . . . , 𝑌𝑡 . Similarly y1:𝑘 represents the ordered sequence of tokens up to token y 𝑘 . 6.3.2 Persona Identification The standard objective of persona models is: max log 𝑝(𝑌𝑡 |𝑋𝑡 , 𝜏1:𝑡−1 , 𝑃𝑌 ), (6.1) 𝑌𝑡 where 𝑃𝑌 is the set of persona descriptions for the dialog agent. Zhang et al. [2] and the ConvAI2 challenges [11] provided numerous ways to incorporate persona information into the dialogue 121 generation process. Recently, generative persona models [12, 8] have been shown to be effective at contextualized decoding by incorporating persona 𝑃 as language model context. Due to their effectiveness, we only consider generative persona models in this chapter. To identify a persona from a given trajectory, we formulate the persona identification problem: Problem 2. Persona Identification. Given an input dialogue trajectory 𝜏, find the persona 𝑃 that maximizes the mutual information between 𝑃 and 𝜏. More formally, the optimization objective is max 𝐼 (𝑃, 𝜏) = max 𝐻 (𝜏) − 𝐻 (𝜏|𝑃) = min 𝐻 (𝜏|𝑃), (6.2) 𝑃 𝑃 𝑃 where 𝐻 (·) is entropy and 𝑃 ∈ R𝑚 is a vector in the space of possible personas. Persona identification seeks a fixed-length representation of persona information that captures the consistency of generated responses. In other words, a personalized dialog agent has to not only generate diverse responses (high entropy 𝐻 (𝜏)), but it must also stay consistent to a persona profile throughout multiple turns of conversation, minimizing 𝐻 (𝜏|𝑃). One challenge is that it is unclear how to arrive at a set of questions 𝑋1:𝑇 to generate the input trajectory 𝜏. For example, certain sets of questions may always result in generic responses, regardless of the agent quality. Thus, problem 2 requires a way to constrain the question policy so that, given the right set of questions, the persona of the dialog agent can be elicited. 6.3.3 Persona Verification We addresses the above issue by formulating question generation as an optimization problem. A naive attempt may be to propose the following objective: max 𝐼 (𝜏, 𝑃) = max 𝐻 (𝜏) − 𝐻 (𝜏|𝑃) (6.3) 𝜏 𝜏 However, we will show in the following section that Eqn (6.3) is intractable to optimize directly due to the partition function in the second term. We will first present a more feasible objective below and show its asymptotic convergence toward Eqn (6.3) in the next section. Suppose that the inputs 122 to a verification model are dialog agents, i.e. trained persona models parameterized by different personas. We define the persona verification problem as follows: Problem 3. Persona Verification. Given a space of persona information P, persona verification is the optimization objective:   min E𝑃∼P L (𝜏𝜃 , 𝑃) (6.4) 𝜃 where L (·, ·) is the authentication loss: L (𝜏𝜃 , 𝑃) = max{0, 𝐶 + 𝑑 (𝜏𝜃 , 𝑃+ ) − 𝑑 (𝜏𝜃 , 𝑃− )} − log 𝑝(𝜏𝜃 )). (6.5) 𝑃+ denotes persona facts that co-occur with trajectory 𝜏, 𝑃− the opposite. 𝐶 specifies the desired margin of separation, and 𝜏𝜃 is the dialog trajectory generated by the question policy (𝜃). The first term of Eqn. (6.5) approximates Eqn. (6.2) through a triplet loss using negative sampling over the space of possible personas. The rationale behind the first term is to address the intractability of solving for Eqn. (6.2) directly. In Section 6.3.4, we show that this triplet loss component converges to the mutual information term in Eqn. (6.2). The second term in Eqn. (6.5) gives the likelihood of the trajectory. In order to minimize the second term, a verification algorithm has to generate queries with high likelihood under a given language model, e.g. GPT-2. If either the input agent or the question policy generate nonsensical responses, then the resulting 𝑝(𝜏𝜃 ) will be close to zero. For the rest of the chapter, we will refer to “identifier” as a model used to solve the identification problem and “verifier” to denote a model used to solve the verification problem. 6.3.4 Analysis of Persona Authentication Objective Now we analyze the relationship between Eqn. (6.5) and the mutual information between 𝑃 and 𝜏. First, we assume that for a given persona 𝑃, the density function for 𝑝(𝜏|𝑃) follows the probability density function (PDF) of a Gibbs distribution: exp[−𝛽𝐸 (𝜏, 𝑃)] 𝑝(𝜏|𝑃) = ∫ , 𝜏 0 ∈D exp[−𝛽𝐸 (𝜏0, 𝑃)] 123 where 𝐸 (𝜏, 𝑃) is an energy function which scores the un-normalized co-occurrence likelihood of a specific dialogue trajectory 𝜏 and persona 𝑃. 𝛽 is the temperature term which controls the overall entropy of the distribution. We choose the Gibbs distribution because of its expressiveness and common use in contrastive learning [13]. Then we can express the mutual information between 𝜏 and 𝑃 as:  ∫  0 𝐼 (𝜏, 𝑃) = 𝐻 (𝜏) − 𝐻 (𝜏|𝑃) = −E𝜏 [log 𝑝(𝜏)] + E𝜏,𝑃 𝛽𝐸 (𝜏, 𝑃) − log exp[−𝛽𝐸 (𝜏 , 𝑃)] . 𝜏 0 ∈D (6.6) In Eqn. (6.6), the first term on the RHS corresponds to the entropy of dialog trajectories (diversity of generated responses), which is determined by the decoding quality of the input agent. The second term depends on our question policy and our estimation of 𝑃. Trajectories under 𝑃 depend on the question policy since the input agent maximizes 𝑝(𝑌𝑡 |𝑋𝑡 , 𝜏𝑡−1 , 𝑃). Since 𝑃 is not known by the policy beforehand, it is estimated each turn by the identifier. Unfortunately, directly estimating the entire second term is difficult – the partition function of the conditional distribution requires us to integrate over the space of trajectories, an intractable task. We thus propose a local density estimation of the conditional density 𝑝(𝜏|𝑃) as follows: let 1 Õ𝑛 𝑃eN = 𝐾 (𝜑(𝜏𝑖 ), 𝜓(𝑃), 𝐶𝑛 ), (6.7) 𝑛𝑉 (𝐶𝑛 ) 𝑖=1 denote an empirical estimate of 𝑝(𝜏|𝑃) using 𝑛 sampled trajectories. 𝜓 and 𝜑 are embedding ∫ representations of 𝑃 and 𝜏, respectively. 𝑉 (𝐶𝑛 ) = N (𝑃) 𝑑𝑃0 gives the volume of a neighborhood 𝐶𝑛 ball of radius 𝐶𝑛 around 𝜓(𝑃). 𝐾 is a kernel function (which we show to be a valid kernel function in Supplemental Materials) designed as follows:    1 if 𝑑 (𝜑(𝜏), 𝜓(𝑃)) ≤ 𝐶𝑛    𝐾 (𝜏, 𝑃, 𝐶𝑛 ) = . (6.8)    0 else   We now present the main theorem of our analysis. Theorem 3. (Convergence of 𝑃 e𝑁 ) If Eqn. 6.5 (authentication loss) is minimized with 0 loss over D = {𝜏𝑖 }𝑖=1 𝑛 and P = {𝑃 } 𝑚 , then 𝑗 𝑗=1 124 𝑃e𝑁 asymptotically converges to 𝑝(𝜏|𝑃), i.e., lim 𝑃e𝑁 = 𝑝(𝜏|𝑃) (6.9) 𝑛→∞ when the following conditions hold: 𝑘 lim 𝑛𝑉 (𝐶𝑛 ) = ∞, lim 𝑉 (𝐶𝑛 ) = 0, lim = 0, (6.10) 𝑛→∞ 𝑛→∞ 𝑛→∞ 𝑛 where 𝑘 is the expected number of samples that fall within 𝑁𝐶𝑛 (𝑃). The proof is provided in the Supplemental Materials. The goal of the identifier model is to learn the embedding functions 𝜑 and 𝜓: 𝑛 𝑘 1 ÕÕ 𝜑, 𝜓 = arg min max {0, 𝐶 + 𝑑 (𝜑(𝜏), 𝜓(𝑃𝑖 )) − 𝑑 (𝜑(𝜏), 𝜓(𝑃 𝑗 ))}, (6.11) 𝜑,𝜓 𝑛𝑘 𝑖=1 𝑗=1 where 𝑃𝑖 ∈ P + , and each 𝑃 𝑗 belongs to the set of 𝑘 negative persona samples. 6.4 Proof of convergence of Thm. 1 We first state some assumptions about the density function 𝑝(𝜏|𝑃). Unless otherwise stated, we assume that there exists some joint embedding space H for which we are comparing 𝜏 ∈ D and 𝑃 ∈ P. Specifically, let us assume that here exists some optimal mapping functions 𝜑∗ and 𝜓 ∗ that maps 𝜏 and 𝑃 to H , respectively, i.e., 𝜑 : D → H, 𝜓 : P → H where H ⊆ R𝑛 . With some abuse of notation, we refer to 𝜑(𝜏) by 𝜏 and 𝜓(𝑃) by 𝑃 in the following analyses for simplicity. Assumption 1. (Locally Constant Density) We assume that within a local neighborhood N𝐶 (𝑃) of radius 𝐶 around persona vector 𝑃, trajectories 𝜏 are indistinguishable. Formally, ∀𝑃, 𝑃0, ∃0 < 𝐶 < ∞: 𝑑𝜓 (𝑃, 𝑃0) ≤ 𝐶 =⇒ 𝑑 𝜑 ( 𝑝(𝜏|𝑃0), 𝑝(𝜏|𝑃)) = 0, for some distance functions 𝑑𝜓 , 𝑑 𝜑 . For simplicity, we will consider the Euclidean distance for 𝑑𝜓 and total variational divergence for 𝑑 𝜑 . 125 We will use N𝐶 (𝑃) to denote the neighborhood set around 𝑃 for which the above condition is satisfied. Assumption 2. (Continuity and topological properties) The conditional density 𝑝(𝜏|𝑃) is Lipschitz continuous over the supporting set H for both 𝜏 and for 𝑃. Furthermore, we assume that 𝑝(𝜏|𝑃) is simply-connected. Next, we define 𝑝(𝜏|N (𝑃)) as the probability that trajectory 𝜏 will fall in the neighborhood N (𝑃) around a given persona 𝑃. Specifically, we consider the case where 𝑛 trajectories are sampled, 𝑘 of which fall into N (𝑃). Definition 5. (Neighborhood Density) We define the neighborhood density around a persona vector 𝑃 as the probability that a trajectory 𝜏 falls into the neighborhood N (𝑃) as defined by ∫ 𝑃N = 𝑝(𝜏|N (𝑃)) = 𝑝(𝜏|𝑃0)𝑑𝑃0 . (6.12) N (𝑃) Furthermore, given a set of i.i.d. 𝑛 trajectories {𝜏1 , . . . , 𝜏𝑛 }, the probability that 𝑘 such trajectories fall in N (𝑃) follows the binomial distribution:   𝑛 𝑘∼ 𝑃N 𝑘 (1 − 𝑃N )) 1−𝑘 . (6.13) 𝑘 At this point, there is one key issue: how do we calculate 𝑘, which needs to somehow “count” the trajectory-persona pairs that fall into the same neighborhood? We can conceptualize 𝑘 as the image of some counting function of the form 𝐾 : (𝜏, 𝑃, 𝐶) → R where 𝐾 is normalized over the domain H . For this purpose, we construct a kernel density function for 𝑘 as follows: given a persona vector 𝑃, let Õ𝑛 𝑘𝑛 = 𝐾 (𝜏𝑖 , 𝑃, 𝐶𝑛 ) 𝑖=1 126 be the output of the kernel function 𝐾 over 𝑛 sampled trajectories D = {𝜏1 . . . 𝜏𝑛 } from 𝑝(𝜏|𝑃). Here, 𝐶𝑛 denotes the sample neighborhood size N (𝑃) around 𝑃 satisfying the constraint 𝐶𝑛 = max 𝑑 (𝜏𝑖 , 𝜏 𝑗 ) 𝜏𝑖 ,𝜏 𝑗 ∈D for Euclidean distance 𝑑 (·, ·) from Assumption 1. Given embeddings 𝜑(𝜏) and 𝜓(𝑃), we propose the following kernel density function 𝐾 (𝜏, 𝑃, 𝐶𝑛 ):    1 if 𝑑 (𝜑(𝜏), 𝜓(𝑃)) ≤ 𝐶𝑛    𝐾 (𝜏, 𝑃, 𝐶𝑛 ) = . (6.14)   0  else  Lemma 3. (Validity of the proposed kernel density) ∫ Let 𝑉 (𝐶) = 𝑁 (𝑃) 𝑑𝑃0 denote the volume of the neighborhood with radius 𝐶 around 𝑃. The counting 𝐶 function 𝐾 described by Eqn. (6.14) is a valid kernel density function satisfying ∀𝑛 > 0, ∀𝜏, 𝑃 ∈ P : 𝐾 (𝜏, 𝑃, 𝐶𝑛 ) ≥ 0 (6.15) ∫ 1 ∀𝑛 > 0 : 𝐾 (𝜏0, 𝑃, 𝐶𝑛 )𝑑𝜏0 = 1. (6.16) 𝑉 (𝐶𝑛 ) H Proof. Condition 6.15 is observed by the definition of 𝐾 from Eqn. 6.14. 𝐾 (𝜏, 𝑃, 𝐶) > 0 over the entire supporting set for 𝜏, 𝑃 and constants 𝐶𝑛 and 0 everywhere else. For condition 6.16, see that 𝐾 integrates to 𝑉 (𝐶) over the domain of 𝜏: ∫ ∫ 0 0 𝐾 (𝜏 , 𝑃, 𝐶)𝑑𝜏 = 1 · 𝑑𝑃0 = 𝑉 (𝐶). (by definition in Eqn. (6.14)) H N𝐶 (𝑃) From our construction of 𝐾, we know that 𝐾 (𝜏, 𝑃, 𝐶𝑛 ) = 0 everywhere except in neighborhood ∫ N𝐶 (𝑃). Thus, the integral H 𝐾 (𝜏0, 𝑃, 𝐶)𝑑𝜏0 reduces to integrating over N𝐶 (𝑃).  We now present the main theorem of our analysis. First, let us denote 𝜑∗ , 𝜓 ∗ as functions satisfying the empirical objective: 𝑛 𝑘 ∗ ∗ 1 ÕÕ 𝜑 , 𝜓 = arg min max {0, 𝐶 + 𝑑 (𝜑(𝜏), 𝜓(𝑃𝑖 )) − 𝑑 (𝜑(𝜏), 𝜓(𝑃 𝑗 ))}. (6.17) 𝜑,𝜓 𝑛𝑘 𝑖=1 𝑗=1 127 Theorem 4. (Convergence of 𝑃 e𝑁 ) Let 𝑃N be the empirical estimate of 𝑃 𝑁 using the kernel density estimator: 𝑛 1 Õ eN = 𝑃 𝐾 (𝜏𝑖 , 𝑃, 𝐶𝑛 ). (6.18) 𝑛𝑉 (𝐶𝑛 ) 𝑖=1 If Eqn. (6.17) (authentication loss) is satisfied with 0 loss over D = {𝜏𝑖 }𝑖=1 𝑛 and P = {𝑃 } 𝑚 , then 𝑗 𝑗=1 e𝑁 asymptotically converges to 𝑝(𝜏|𝑃), i.e., 𝑃 lim 𝑃 e𝑁 = 𝑝(𝜏|𝑃) (6.19) 𝑛→∞ when the following conditions hold: 𝑘 lim 𝑛𝑉 (𝐶𝑛 ) = ∞, lim 𝑉 (𝐶𝑛 ) = 0, lim = 0, (6.20) 𝑛→∞ 𝑛→∞ 𝑛→∞ 𝑛 where 𝑘 is the expected number of samples that fall within 𝑁𝐶𝑛 (𝑃). Proof. From Eqn. (6.13), we see that 𝑘 ∼ Binomial(𝑛, 𝑃N ). Thus, we have E[𝑘] = 𝑛𝑃N , where E[𝑘] is the expected number of samples 𝑘 that fall within N𝐶𝑛 (𝑃) from a random sample of 𝑛 trajectories. Observe that: ∫ ∫ 0 0 𝑃N = 𝑝(𝜏|𝑃 )𝑑𝑃 = 𝑝(𝜏|𝑃)𝑑𝑃0 (by Assumption 1) N𝐶 (𝑃) N𝐶 (𝑃) ∫ = 𝑝(𝜏|𝑃) 𝑑𝑃0 = 𝑝(𝜏|𝑃) · 𝑉 (𝐶). (6.21) N𝐶 (𝑃) Additionally, if the authentication loss in 𝐸𝑞𝑛. (6.17) is satisfied with 0 loss, then we have 𝐾 (𝜏, 𝑃, 𝐶) = 1 ⇐⇒ 𝑑 (𝜑(𝜏), 𝜓(𝑃)) < 𝐶 ⇐⇒ 𝜏 ∈ 𝑁𝐶 (𝑃) ⇐⇒ 𝑘 = 1. By the strong law of large numbers, we have Õ 𝑛 ∫ lim 𝐾 (𝜏𝑖 , 𝑃, 𝐶𝑛 ) = E𝜏 [𝐾 (𝜏, 𝑃, 𝐶𝑛 )] = 𝐾 (𝜏, 𝑃, 𝐶𝑛 ) · 𝑝(𝜏|𝑃)𝑑𝜏 𝑛→∞ H 𝑖=1 ∫ = 𝑘 · 𝑝(𝜏|𝑃)𝑑𝜏 = E[𝑘]. (6.22) H For a given 𝑛, we can combine E[𝑘] = 𝑛𝑃N with Eqns. (6.21) and (6.22) to obtain: E[𝑘] = 𝑃N = 𝑝(𝜏|𝑃) · 𝑉 (𝐶𝑛 ). 𝑛 =⇒ lim 𝑃 e𝑁 = E[𝑘] = 𝑝(𝜏|𝑃). 𝑛→∞ 𝑛𝑉 (𝐶𝑛 ) 128 Figure 6.1: Overview of authentication pipeline. Although 𝑉 (𝐶𝑛 ) → 0, the first condition in Eqn. (6.20) guarantees that 𝑛 → ∞ faster. Intuitively, 𝑉 (𝐶𝑛 ) → 0 means that the size of the contrastive margin should shrink with increasing number of training samples, but the rate of shrinking must be slower than the 𝑛. Similarly, 𝑘 → ∞ since the number of observed trajectories falling into N𝐶 (𝑃) increases monotonically with 𝑛. However, the third condition in Eqn. (6.20) ensures that 𝑃N converges at all.  6.5 Methodology Figure 6.1 summarizes the key components of our authentication pipeline. The identifier model represents the current conversation history as state input to the verifier. The verifier outputs a distribution 𝜋(·|𝑠𝑡 ) over the action space. The sampled action from 𝜋(·|𝑠𝑡 ) is converted to a question code and incorporated into PersonaGPT to decode the next question. We will refer to the full authentication pipeline as “authenticator” for short. Once we can estimate 𝑃 based on the learned representation 𝜑 of the dialog history, we can learn a question policy 𝜃 under the authentication objective (Eqn. 6.5). Toward that goal, we first present an effective way to do incorporate the question policy i.e. (𝑋𝑡 )𝑡=1𝑇 , as context for conditional decoding using SOTA LM-based dialog models, e.g. the GPT-2 architecture. This requires us to define an action space of control codes [14] to contextualize the decoder during question generation. We describe an active learning approach to learn such an action space. Then, we present the verifier model details and provide an algorithm for 129 learning the question policy. 6.5.1 Action Space as Control Codes Since the goal of the verifier is to generate a sequence of questions (𝑋𝑡 )1=𝑡 𝑇 , we can describe the token-level likelihood of each question 𝑋𝑡 as: (𝑡) (𝑡) 𝑝(x1:𝑘 |𝑋𝑡 ,𝜏𝑡−1 , 𝑃) ≈ 𝑝( x1:𝑘 | 𝑋𝑡 , 𝜑(𝜏𝑡−1 ) ) (6.23) |{z} |{z} | {z } decoded tokens question at 𝑡 history embedding Unlike the persona model, the verifier does not have access to the actual persona 𝑃 of the input model. Instead, the identifier model provides an estimated version 𝜑(𝜏𝑡−1 ) based on dialog history. In Eqn. (6.23), 𝑋𝑡 is the output of the verifier at each turn, but we need an effective way to represent 𝑋𝑡 for conditional decoding. Inspired by control codes [14], we represent 𝑋𝑡 as question codes of the form: <|act|> ask about pets. <|sep|> Special tokens <|act|> and <|sep|> are used as delimiters for question codes. In the above example, we used “ask about pets” as an example of a question code that corresponds to one of many discrete actions that can be outputted by the verifier. The question code represents the raw text to be incorporated before the dialog history in the GPT-2 architecture during decoding. In experiments, we use 11 actions and their corresponding question codes as shown in Table 6.3. We have empirically found that these questions cover the majority of conversational topics in PersonaChat. Note, however, that one can apply our question code framework to arbitrarily defined questions. Table 6.3: The action space of the verifier featuring 11 total actions (turn-level goals) Action Space 1. ask about family. 2. ask about pets. 3. talk about work. 4. talk about traveling. 5. ask about age and gender. 6. talk about hobbies. 7. talk about music. 8. talk about food. 9. talk about movies. 10. talk about politics. 11. ask about marital status. - 130 6.5.2 Explanation of Prefix Codes In total there are 11 possible discrete actions that the authenticator network can output. Each action corresponds to a particular phrase to be incorporated as prefix to PersonaGPT. However, PersonaGPT can take arbitruary persona information for conditional decoding. Consider the following toy example: <|p1|>I like dogs.<|sep|> <|start|> hi! how are you doing today?<|eos|> The prefix code starts with <|p1|> and ends with <|sep|> to denote the persona input 𝑃. The text following <|start|> denotes the conditional decoding targets of the LM. 6.5.3 Conditional Decoding In order to maintain felicity and consistency of decoding, we use a common LM to do natural language generation for both the persona model and the verifier. Specifically, we use the GPT-2 medium [15] architecture as the baseline LM for conditional decoding of both the verifier question codes as well as persona inputs. We will refer to this general-purpose conditional decoder as the PersonaGPT model, which will be used as the persona model when persona facts are used as prefix code and as the question decoder when verifier questions are used as prefix code. In addition to question codes, we also introduce 3 special tokens: <|p1|> and <|p2|> to denote the persona (source and target, respectively), and <|start|> as a delimiter between the control codes and dialog history. We find that by using <|p1|> and <|p2|> to delimit source and target personas, the LM is able to attend to <|p1|> related personas for odd-numbered responses and <|p2|> related ones for even-numbered responses. We first fine-tune PersonaGPT on the PersonaChat dataset [2] with persona inputs as prefix code and the dialog history as the conditional decoding targets. 6.5.3.1 Active Learning In order to learn conditional decoding of question codes, we also fine-tune PersonaGPT on a small dataset of human-PersonaGPT conversations constructed using active learning. Algorithm 3 outlines 131 said active learning procedure. In terms of sample complexity, we are able to fine-tune 𝜃 𝐿 𝑀 to do reliable conditional decoding with 1,200 8-turn conversations. This actively learned dataset of question code examples will be made publicly available. Algorithm 3 Active Learning with PersonaGPT Require PersonaGPT 𝜃 𝐿 𝑀 fine-tuned on persona inputs, action space of question codes A. 1: Initialize active learning dataset D. 2: for total number of mini-batches do 3: for mini-batch 𝑖 do 4: while conversation not done do 5: Sample question code 𝑋𝑡 ∼ A. (𝑡) 6: Decode x1:𝑘 using 𝜃 𝐿 𝑀 . (𝑡) 7: if x1:𝑘 not satisfactory: then 8: Provide human inputs z1:𝑘 . Ð 9: Update D ← D (𝑋𝑡 , 𝜏𝑡−1 , z1:𝑘 ). 10: Gradient descent on (𝜏𝑡−1 , 𝑋𝑡 , z1:𝑘 ) to update 𝜃 𝐿 𝑀 . 11: else 12: Continue. 13: end if 14: end while 15: end for 16: end for For the gradient descent step of Algorithm 3, we split the parameters of PersonaGPT (𝜃 𝐿 𝑀 ) into 4 groups: fast group (consisting of special tokens), slow group (consisting of positional codes), freeze group (embedding weights for normal tokens), and the rest of the parameters. We set the initial learn rates of each group as follows: fast group (𝛼 = 5e-4), slow group (1e-6), freeze group (1e-9), and the rest (5e-5). This technique is inspired by natural gradients [16, 17], which provide much better performance in terms of learning rate (and hence number of samples needed to fine-tune). However, the full Fischer Information matrix is intractable to learn explicitly; instead, we design a diagonal matrix 𝑀, with entries corresponding to the learn rates of the different groups (4 different initial rates). The gradient descent update is then: 𝜃 𝐿 𝑀 ← 𝜃 𝐿 𝑀 − 𝑀∇𝜃 𝐿 𝑀 𝐽 (𝜃 𝐿 𝑀 ) 132 Empirically, we find that this scheme allows PersonaGPT to incorporate question codes without sacrificing felicity of decoded responses. 6.5.4 Learning the Question Policy Since we do not have direct supervision over the newly introduced question codes, we learn the question policy 𝜃 using deep Q-learning (DQN) [18]. Because of the inference time associated with using GPT-2 based architectures to decode, the sample generation cost of full conversations is non-trivial. That is why we choose to use value-based learning instead of policy gradient – indeed sample efficiency is maximized by off-policy methods such as Q-learning [19]. Since Q-learning tends to suffer from high-variance during early stages of training, we use the human-PersonaGPT conversations collected during active learning as an approximation of expert policies. By pretraining the Q-function on expert trajectories, we can explore the high-value states early, leading to more stable Q-functions. Markov Decision Process (MDP). We formulate the verifier learning task as an MDP: • S (state space): 𝑠𝑡 = [𝜑(𝜏𝑡−1 ); 𝑡], embedding of dialog history up to current turn concatenated with the current turn count 𝑡. • A (action space): 𝑎 𝑡 ∼ 𝜋( · |𝑠𝑡 ) is a sampled question code from the output of the verifier model at each turn. (𝑡) (𝑡) • T (transition): 𝑠𝑡+1 = [𝜑(𝜏𝑡−1 Ð y1:𝑘 ); 𝑡 + 1] where y1:𝑘 is the decoded response by the input conversational agent. • 𝑅 (reward function): The reward function is 𝑟 (𝑠𝑡 ) = −L (𝜏𝑡 , 𝑃) (6.24) where L is the authentication loss (Eqn. (6.5)) as a function of the history up to turn 𝑡 and the persona of the input agent. 133 Verifier Network. The verifier architecture is a feed-forward network with 2 hidden layers of 512 hidden units each. The logits layer of the verifier corresponds to the Q-value over each action, defined as: 𝑄(𝑠𝑡 , 𝑎 𝑡 ) = 𝑟 (𝑠𝑡 ) + 𝛾 max 𝑄(𝑠𝑡+1 , 𝑎). 𝑎 Since we are dealing with finite-horizon MDPs, we set the discount factor 𝛾 = 1. The final output layer is a softmax over the Q-value logits: 𝜋( · |𝑠𝑡 ) = softmax ( 𝑓 (𝑠𝑡 ; 𝜃)) . We first pretrain the verifier with imitation learning [20] on the human-PersonaGPT data collected during active learning. Specifically, we use the following loss function during pretraining: Õ 𝑇  𝜃 = arg min E𝜏 −𝑎 𝑡∗ log 𝜋(𝑎 𝑡 |𝑠𝑡 ) ∗ 2 + k 𝑓 (𝑠𝑡 ; 𝜃) − 𝑄 (𝑠𝑡 , 𝑎 𝑡 )k , (6.25) 𝜃 𝑡=1 where 𝑎 𝑡∗ is the expert action while visiting 𝑠𝑡 during active learning. To stabilize learning, we use a twin-delayed Q-learning scheme inspired by [21]. In addition to the verifier, we keep a target-network 𝜃 0 with parameters equal to a stochastically-weighted average (SWA) [22] of 𝜃. We thus define the pretraining Q-targets 𝑄 ∗ (·, ·) as follows: 𝑄 ∗ (𝑠𝑡 , 𝑎 𝑡 ) = 𝑟 (𝑠𝑡 ) + 𝛾𝑄 𝜃 0 (𝑠𝑡+1 , 𝑎 𝑡+1 ∗ ), (6.26) ∗ is the next action taken by the expert (i.e. the human-policy). At the end of each gradient where 𝑎 𝑡+1 update for 𝜃, the target network is updated according to: 𝜃 0 ← 𝜂𝜃 + (1 − 𝜂)𝜃 0, (6.27) where 𝜂 = 1/(𝑁 + 1) and 𝑁 is the number of training iterations. We then run Algorithm 4 with regular Q-targets and an annealed 𝜀-greedy sampling strategy to promote exploration in early conversations. We fix each synthetic conversation to 8 turns and fine-tune the logits layer of 𝑓 ( · ; 𝜃), i.e. the Q-values, using gradient descent after each conversation. 134 Algorithm 4 Verifier Training 1: Initialize question policy and target networks 𝜃, 𝜃 0. 2: for each persona model 𝑃 ∈ P do 3: while conversation not done do  4: Sample 𝑋𝑡 ∼ softmax 𝑓 (𝑠𝑡 ; 𝜃) . (𝑡) 5: Decode 𝑋𝑡 into tokens x1:𝑘 . (𝑡) 6: Obtain response 𝑦 1:𝑘 from PersonaGPT conditioned on persona 𝑃. 7: Store (𝑠𝑡 , 𝑎 𝑡 , 𝑠𝑡+1 , 𝑟 𝑡 ) in B. 8: end while 9: Sample mini-batch of (𝑠𝑡 , 𝑎 𝑡 , 𝑠𝑡+1 , 𝑟 𝑡 ) tuples from B. 10: Calculate Q-values using target network and update 𝜃 using gradient descent. 11: Update target network using Eqn. (6.28). 12: end for 6.5.5 Model and Training Details All models were written using PyTorch [23]. The PersonaGPT model was written using the HuggingFace Transformers package [9]. In terms of GPU usage, all models were trained using a single 11Gb NVIDIA GTX 1080 Ti. For experiment 4.1, the GPT-2 baseline, DialoGPT and PersonaGPT were fine-tuned on the PersonaChat dataset for 3 epochs, each taking between 13-16 hours of wall clock time. For each model, the AdamW [24] was used with an initial learn rate of 5e-5 and a linear decay schedule. For experiment 4.2, all identifier models were trained for 10 epochs. The BoW feed forward network (MLP) consisted of 2 hidden layers, 300 units each and a dropout rate of 0.2 between the layers. BoE, the MLP architecture consisted of 2 layers, 1024 units each with a dropout rate of 0.2 between layers. For the LSTM model, the input embedding size is 30, 1 LSTM layer is used with 600 hidden units. For the BERT and GPT-2 models, the transformer (feature representation) layers were frozen, and additional 2-layer MLP modules were added to each model for training, each consisting of 1024 units per layer. The identifier model is a 2-layer MLP with 1024 units each with a dropout rate of 0.2 between layers. All identifier models were trained using Adam [25] optimizer with learn rate of 1e-3. The verifier network consists of a 3-layer MLP with 512 hidden units and dropout rate of 0.1 between layers. Tanh activation is used in place of ReLU, as we found Tanh to empirically 135 outperform the latter in our use case. Note that the output layer size is 11 (corresponding to the size of the action space, i.e., number of question codes). This output layer is trained to fit the Q-targets during Q-learning, and an additional softmax layer is added to shape the Q-values into a probability distribution from which to sample the actions for decoding responses. The verifier network was pre-trained on the active learning data over 10,790 conversational turns for 3 epochs, totally between 3.5-4 wall clock hours. For Q-learning, the verifier was trained for 3 simulated conversations per training set persona, totalling 22 hours of wall clock time over 3,846 total conversations and 30,768 conversational turns. After each conversation during the DQN training loop, the Q-value layers are fine-tuned over the replay buffer for 3 epochs. For SWA, at the end of each gradient update for 𝜃, the target network is updated according to: 𝜃 0 ← 𝜂𝜃 + (1 − 𝜂)𝜃 0, (6.28) where 𝜂 = 1/(𝑁 + 1) and 𝑁 is the number of training iterations. For 𝜀-greedy, we set the initial 𝜀0 = 0.5, 𝜀min = 0.05, and decay factor to 2048. 6.6 Data The main dataset used for this experiment was the PersonChat dataset [2] which consists of 17,877 training conversations and 999 test conversations. The training set consists of 1282 personas, which are profiles (sets) of 3-5 persona facts generated from a set of 6127 distinct facts. The test set consists of 129 personas consisting of 674 facts that are unseen in the training data. 6.6.0.1 Pretraining The PersonaGPT model is based off of the DialoGPT model [8] which was trained on Reddit conversations. DialoGPT is based off of the GPT-2 model [15], which was pretrained on a diverse set of text corpus, including Wikipedia articles, fiction books and news articles. The premise of language modeling is that pretraining on a diverse range of dialog corpus yields a very large supporting set of context tokens for which conditional probabilities can be calculated from. GPT-2 training accounted for “quality documents” (i.e., eliminated certain corpus samples based on some 136 internal quality metrics) but did not account for negative transfer between different corpus types (i.e., fiction books vs. Wiki articles) [26, 15]. Interestingly, GPT-2 has been shown to generalize to more complicated tasks such as abstract summarization and general QA with few training samples, and sometimes do well in a zero-shot setting [15]. 6.6.0.2 Fine-tuning on PersonaChat Like the GPT-2, DialoGPT was trained to decode multi-turn dialog as a straight forward language modeling task; responses between speakers are separated by ‘EOS‘ (end-of-sentence) tokens to denote switching of speakers. The main difference between PersonaGPT and DialoGPT is the use of special tokens to format input responses. Specifically, PersonaGPT formats persona facts as prefix tokens using bidirectional attention (no masking). <|p1|>| and <|p2|> special tokens are introduced to denote the current speaker turn. Like task-oriented special tokens such as tl;dr, $ and extract, the different persona special tokens denote the different speaker tasks (i.e., decoding person 1’s responses vs. person 2’s responses). 6.6.0.3 Self-Play After training the PersonaGPT decoder, we continue to use PersonaChat as a source of training and testing persona profiles. However, we no longer use the conversations in PersonaChat to do active learning or reinforcement learning. For example, we sample training persona profiles to parameterize PersonaGPT, but use the decoded PersonaGPT tokens as ground truth when training the dialog policy using reinforcement learning. Similarly, we compare the decoded PersonaGPT tokens with human responses (ground truth) when incorporating the turn-level goals (as action code prefixes). Using self-play between the PersonaGPT and the dialog policy, we can generate new conversational episodes unseen from the training data (PersonaChat). This effectively allows us to explore the space of question sequences (i.e., the sequence of turn-level goals) to minimize the Authenticator Loss in Eqn. (6.5). 137 6.7 Experiments We assess the proposed authentication system through its ability to answer the following questions: (Q1) How well can PersonaGPT use control codes? (Q2) How well can the identifier predict persona? (Q3) How well can the learned question policy distinguish persona models? 6.7.1 Conditional Decoding Evaluation To answer Q1 we evaluate the capacity of PersonaGPT for controlled decoding in two settings: (1) automatic evaluation of PersonaGPT against SOTA persona models, and (2) human evaluation of human-PersonaGPT interactions. For automatic evaluation, we follow the ConvAI2 challenge automatic evaluation criterion of perplexity (PPL) and F1-score (F1) [11]. The following baselines are included for comparison: the Seq2seq baseline from the PersonaChat paper [2], the best performing generative model [12] on automatic evaluation from the ConvAI2 challenge, and the recently released DialoGPT model [8]. Since PersonaGPT is based off of the GPT-2 architecture, we include the vanilla GPT-2 LM (without control tokens) as well as a DialoGPT model fine-tuned on the PersonaChat dataset as additional baselines. Table 6.4 shows that PersonaGPT outperforms both baselines and SOTA in conditional decoding, as measured by PPL (lower is better) and F1 (higher is better). Table 6.4: Automatic evaluation of PersonaGPT against existing SOTA persona models Model PPL F1 Seq2seq Baseline [2] 29.8 16.18 Wolf et al. [12] 16.3 19.5 GPT-2 Baseline 99.45 5.76 DialoGPT [8] 56.6 12.6 DialoGPT (Fine-tuned) 11.4 22.7 PersonaGPT 10.2 43.4 138 Table 6.5: Human Evaluation of PersonaGPT and DialoGPT Model Consistency Coverage Engagineness Felicity DialoGPT (Fine-tuned) 2.83 (1.40) 1.15 (0.68) 2.90 (0.79) 3.16 (1.16) PersonaGPT 3.07 (1.34) 3.03 (1.31) 3.29 (0.95) 3.40 (1.11) Human evaluations were collected using a platform that allows anonymous users to have short, 8-turn conversations with an unknown (either DialoGPT or PersonaGPT) persona model. In total, we collected 100 full conversations (800 total responses). After each conversation, the evaluator is asked to rate the agent in several categories: • Consistency (1-5): how much did the agent’s responses agree with each other? 1 = conflicting, 5 = perfectly consistent. • Engagingness (1-5): how engaging were the agent’s responses? 1 = aloof, generic; 5 = informative, rapport-building. • Coverage (1-5): how many of the personality facts did the agent exhibit correctly? 1 = less than 20%, 5 = 100%. • Felicity (1-5): how sensible are the agent’s responses? 1 = non-sensible, 5 = grammatically and semantically correct. In Table 6.5, we compare PersonaGPT with the best performing baseline, the fine-tuned DialoGPT. We report the average ratings for each metric along with the standard deviation in parenthesis. Interestingly, the biggest difference between the two models are the coverage scores. On average, PersonaGPT exhibits 60+% of persona traits correctly during conversation, whereas DialoGPT exhibits around 20-40%. To illustrate some finer points of their differences, we provide example human-agent interactions in the Supplemental Materials. 139 6.7.2 Persona Identifier Evaluation To answer Q2, we evaluate the identifier model based on the accuracy of the estimated persona 𝜑(𝜏), given the input trajectory. We train 𝜑 and 𝜓 on conversations collected with 1,283 unique training personas from the PersonaChat dataset. Each persona consists of 3-5 persona facts, which are drawn from a pool of 6,735 unique persona facts. At test time we use a nearest neighbor model to retrieve the top-k relevant persona facts from the pool of 6,735 facts. There are 129 test set personas (i.e., collection of 3-5 persona facts) that are not present in the training set. Since there is no overlap between the training and testing personas, we are evaluating the identifier network’s capability to represent out-of-distribution persona information. We compare the identification model against several baselines: • Bag-of-Words (BoW): sum of one-hot vectors of the tokens in the dialogue trajectory. • Bag-of-Embeddings (BoE): sum of GloVe embeddings [27] of dialog tokens. • LSTM: long short-term memory (LSTM) network [28, 29] over dialog tokens. • MLP-BERT: feed-forward network trained on averaged sentence-level embeddings obtained from BERT’s [30] representation of dialog history. • MLP-GPT: feed-forward network trained on the last GPT-2 hidden state. The baseline models (BoW, BoE, LSTM, MLP-BERT, MLP-GPT) are all trained using binary cross-entropy loss over each of the 6,735 possible persona facts (0 = not present in persona, 1 = present in persona). At test time, the top-k logits of the outputs are used to obtain the relevant personas. We use the following information-retrieval metrics to evaluate each model: Ù prec@k = | 𝑃 b 𝑃|/𝑘 Ù rec@k = | 𝑃 b 𝑃|/|𝑃|. Here, | · | denotes the cardinality of a set. 𝑃 b is the set of retrieved persona facts (either based on nearest neighbors or top-k logits), and 𝑃 is the ground truth set of persona facts. 140 Table 6.6: Performance of various identifier models on observed dialog trajectories from PersonaChat Model Prec@1 Prec@5 Rec@5 Rec@10 BoW 33.8 25.3 28.3 49.4 BoE 37.7 26.9 30.1 51.0 LSTM 42.7 29.2 32.7 53.2 BERT 37.6 26.6 29.9 51.1 GPT-2 30.8 24.5 27.3 48.3 Identifier 86.2 58.3 65.3 82.8 Table 6.6 summarizes the results of the various identifer models. Our identifier model clearly outperforms the baselines. Although a wide variety of embedding methods were used to represent dialog history, their results are quite similar. The key difference appears to be the authenticator loss used to train our identifier (Eqn. 6.17). 6.7.3 Evaluation of Authentication Policies We answer Q3 by evaluating the full authentication pipeline performance based on generated dialog between the authenticator and various input persona models. We fix the PersonGPT model parameters 𝜃 𝐿 𝑀 for conditional decoding. We generate synthetic conversations between the authenticator and each of the 129 unseen test set persona profiles. For each test set conversation, prec@k and rec@k scores are reported based on the estimated persona (using the learned identifier). We compare with the following baseline policies: • LM: fine-tuned DialoGPT model without any input persona traits during decoding. • Persona Model: another persona model with randomly sampled persona profiles. • Random Policy: uniformly sample a question from the action space at each turn. • Human Policy: using the aforementioned platform, we collect a second set of 100 human- PersonaGPT conversations where the user is not given the persona traits beforehand. At the end of each conversation, the user selects a ranked list of guesses from a list of 20 candidates persona traits to match the input agent’s profile. 141 Table 6.7 compares the various authentication policies. Interestingly, using even the random policy of uniformly sampling the actions can be more revealing than non-goal oriented dialog such as LM and persona. In many of the generated conversations between LM-PersonaGPT and PersonaGPT- PersonaGPT, the two models expand upon 1 or 2 topics without ever discussing other topics relevant to their personas. In contrast, by often forcing the input agent to switch topics, the random policy ignores signals of relevant persona information. Meanwhile, we find that our authentication policy strikes a balance between both worlds: it covers more persona traits as measured by rec@5 and rec@10 while covering at least 1 relevant persona trait in the majority of conversations. Table 6.7: Comparison of verification policies on various input persona models Policy Prec@1 Prec@5 Rec@5 Rec@10 LM 57.4 40.0 45.1 67.4 Persona 69.8 39.1 44.0 63.5 Random 72.9 42.3 48.3 70.2 Human 68.6 56.0 63.0 - Ours 83.7 53.0 59.9 80.9 For human policy, we are unable to obtain an accurate rec@10 for human evaluations since a non-trivial number of participants selected less than 10 choices out of 20 candidates. Since human evaluators were instructed to guess the persona beforehand, it appears that some level of goal-orientation can improve the diversity (in terms of persona coverage) of generated conversations. However, our verifier policy is able to discover more effective ways of interaction compared to non goal-oriented and human policy baselines. In the next section, we provide snapshots of generated conversations between PersonGPT and various authentication policies. 6.7.4 Ablation Study In addition to the PersonaGPT model, we are also interested in the performance of the authenticator policy against other input models. For example, how well does our policy fair against models with less capacity to incorporate persona information? What about against models with lower decoding quality? We generate several synthetic conversations between our authenticator and several variations of persona models: 142 • Full Persona: full persona model. • Weak Persona: persona model with higher nucleus sampling size (𝑝 ∈ [0.30 − 0.8]) [31] to capture less sensible models. • Transition Model: model with either randomly initialized or no persona inputs (defaults to non-personalized decoding). We use the transition model to serve as a baseline in which persona information is not incorporated in the input dialogue agent. Additionally, we include a “weak persona” model baseline, which in corporates persona information but suffers from decreased overall felicity. We randomly sample persona inputs from the full set of 1, 412 personas and report the mean prec@k and rec@k performance across generated conversations. Table 6.8 compares authenticator performance against these persona model variants. As expected, the non-personalized transition model did not conform to given persona profiles, and the authenticator was most affected by the drop in personalization. By contrast, the authenticator was still able to maintain some performance against a much less felicitous persona model. Table 6.8: Authenticator performance against variations of the input persona model Input Model Prec@1 Prec@5 Rec@5 Rec@10 Transition 17.1 17.1 19.4 40.6 Weak Persona 79.8 49.5 55.9 74.1 Full Persona 86.0 53.2 60.0 77.8 6.8 Generated Conversations In this section, we provide snapshot of conversations between human-PersonaGPT with full knowledge of persona (conditional decoding evaluation), human-PersonaGPT conversation without persona knowledge (authentication setting), human-DialoGPT interaction, PersonaGPT self-play, and authenticator-PersonaGPT interaction. Furthermore, we include snapshots from the API interface for the different experimental settings. 143 6.8.0.1 Human-Agent Interactions Figure 6.2 illustrates the set of instructions given to the user when they first start an experiment with the persona model. Note that the actual persona model is randomly selected to be either the baseline model (DialoGPT) or PersonaGPT. In this evaluation setting, the user is given the full persona of its partner beforehand. After 8 turns of conversation, the user is asked to rate the conversation based on Figure 6.2: Experiment 1 (conditional decoding evaluation) instructions. metrics described in the Experiment Section. This interface is shown in Figure 6.3. Table 6.9 provides a point of comparison with human-DialoGPT over the course of an 8-turn conversation. Although DialoGPT is capable of also incorporating persona information, it does not 144 Figure 6.3: Experiment 1 post conversational survey. make the transition between persona facts as frequently. In this case, appearance details (short hair) and pets (living with two dogs) get ignored completely in favor of gym and running. Empirically, we find that both models seem to struggle with negatives (e.g., I hate apples vs. I like apples), particularly when the adjective used are somewhat neutral. Table 6.9: An interactive example of human-DialoGPT conversation over 8 turns Role Response Human hello, how are you? DialoGPT i’m good thank you. how about you? Human doing well. where do you go to the gym? DialoGPT the gym i usually go to is the weight room. Human got it. what weights do you lift? DialoGPT i like to lift weights at the weight room. i am in the process of training. Human training for what? DialoGPT i am training for running. i am trying to get better at it. Human got it. i run as well. DialoGPT i’m trying to get better at the weight room, but i’m not sure yet Human not sure about what? DialoGPT i mean i am pretty sure i am not sure. what about you Human i do not lift weights, but i probably should DialoGPT well if you think you are good, then why not try Human i do not own weightlifting equipment DialoGPT do you have a job you could look into? Persona Facts 1. I go to the gym regularly. 2. I’ve short hair. 3. I like doing the treadmill and rowing machine. 4. Two dogs live with me. 145 Table 6.10 shows an example of an 8-turn dialog between a human evaluator and PersonaGPT. We note that PersonaGPT, along with other persona models we have evaluated, seem to struggle with exact numbers. For example, although the persona says 100 years of age, PersonaGPT may give 95 as shown here. Turn 6 shows a situation in which PersonaGPT changes the topic in conversation to a topic relevant to its persona. Its persona enjoys “wine and dancing in the moonlight”, and it seems to steer the conversation toward this topic from the current discussion about age and lifespans. Figure 6.4 shows the instructions for the human policy evaluation. In this setting, the user is asked to Table 6.10: An interactive example of human-PersonaGPT conversation over 8 turns (prior knowledge of persona given, conditional decoding evaluation) Role Response Human hey how are you PersonaGPT i’m great, thanks. and you? Human doing good have to finish a paper, what do you do for a living? PersonaGPT i’m in between jobs right now, but i’m strong for my age. Human and what age would that be PersonaGPT i’m 95 but i’ve always been strong for my age Human 24 but 95 shouldnt you be considering retirement PersonaGPT yes. but i am really into the idea of live forever huh what is your estimated lifespan or Human do you not have biological constraints like that PersonaGPT i’m 95. and you? Human uhh i already told you that i’m 24 PersonaGPT that’s old enough. do you drink wine? Human from time to time only red tho PersonaGPT red wine? you must be quite the drinker. Human uhh idk how you made that leap PersonaGPT because i drink a lot Persona Facts 1. I am very strong for my age. 2. I feel like I might live forever. 3. I am 100 years old. 4. I like to drink wine and dancein the moonlight. guess the persona of their partner without prior knowledge. The question policy is fully determined by the user, and they are asked to rank the persona facts (out of 20 candidates) after 8 turns. The ranking interface is shown in Figure 6.5. Table 6.11 gives an example interaction between a participant and PersonaGPT during human 146 Figure 6.4: Experiment 2 (human policy evaluation) instructions. policy evaluation. In this setting, the user is asked to guess the persona of their partner without prior knowledge. The question policy is fully determined by the user, and they are asked to rank the persona facts (out of 20 candidates) after 8 turns. We find that human policies tend to be closer to non goal-oriented dialog than learned policies. Users often ask follow-up questions for clarification as seen in turns 5 and 8. When the question ventures outside of the persona facts (turn 6), we find that PersonaGPT is able to find interesting answers despite missing the corresponding information in its context. It is interesting to note that PersonaGPT is capable of adapting to misspellings e.g.,“her” instead of “here” and inconsistent punctuations. 6.8.0.2 PersonaGPT Self-Play Table 6.12 illustrates an example conversation between two PersonaGPT models contextualized by different persona facts. Although used as a baseline question policy, we notice that simply using another persona model to conduct conversation does not always lead to revealing conversations 147 Figure 6.5: Experiment 2 post conversational survey. about the underlying persona of either agents. For Bot 1, 2 out of the 5 persona facts are discussed throughout the course of conversation whereas only 1 persona fact is revealed about Bot 2 (the input persona model to be authenticated). Notice that since Bot 1 is the lead, it drives the conversation toward topics within its set of persona facts. We notice that this is common in the general case: when another persona model is used as a question policy, it tends to drive the conversation toward its set of persona facts. Thus, the question policy performs well insomuch as the persona facts are similar. 6.8.0.3 Consent We obtained IRB approval for the data collection. When users register for the experiment, the following agreement and potential risk statements are required to be read and signed off on before user registration. You will be asked to engage in an 8-10 turn, short dialog with a chatbot. The chatbot has 4-5 personality traits, which will be revealed to you at the start of the experiment. At the end of the conversation, you will be asked to rate the chatbot on a series of performance metrics. 148 Table 6.11: An interactive example of human-PersonaGPT conversation over 8 turns (persona information not known a priori, user asked to guess) Role Response Human hey whats poppin PersonaGPT hello, i’m just hanging out here, you? Human same her got some work to do today PersonaGPT what do you do? i’m a finance major Human im a phd student in computation biology PersonaGPT nice do you like it? Human its okay has its ups and downs. what do you think of finance i do not have a whole lot of money right now, PersonaGPT but i plan on going back in a few years. Human no i meant the major i’m not sure what i want to do after i graduate, PersonaGPT but i want to go to business school Human ohh how many kids do you want? PersonaGPT i’d love to have 10, but it is a lot of work. do you have kids? Human nope too young maybe someday PersonaGPT i’d like to have kids one day, maybe i’ll get some Human like buy them? PersonaGPT maybe one day, yeah. what do you do for fun? Persona Facts 2. I graduated from Yale in 2011 1. I drive a BMW. with a degree in finance. 3. I plan to go to business school next year. 4. My parents are very wealthy bankers. Purpose of This Experiment. The purpose of this research study is to evaluate the performance of several state-of-the-art open-domain conversational agents developed through various learning algorithms. Automatic evaluation of conversational agents has proven to be difficult, and thus human evaluation is necessary to assess the performance of these agents. Your rights to participate, say no, or withdraw. Participation is voluntary, you may choose not to participate at all, or you may refuse to participate in certain procedures or answer certain questions or discontinue your participation at any time without consequence. 149 Table 6.12: An example of self-play between two randomized PersonaGPT models Role Response Bot 1 hi there, do you like broadway? Bot 2 i love it! you ever played on broadway? Bot 1 no i haven’t. but it sounds like fun. where do you work? Bot 2 i work in retail. how about you? Bot 1 i am a psychology major. that’s awesome! Bot 2 oh that sounds like a tough major. Bot 1 do you have any pets? Bot 2 yes, i’ve a pet turtle. do you have any pets? Bot 1 i have one dog. he’s a collie mix. Bot 2 oh i love dogs, they are so cute. Bot 1 do you have any other hobbies? Bot 2 i am an expert on chewing gum. Bot 1 oh that’s interesting! do you enjoy doing it? Bot 2 i do, but it does not make me gassy. i could live with that. Bot 1 it helps with gas. makes sense. Bot 2 do you go for runs? Persona 1 Persona 2 1. I am a psychology major. 1. I don’t pick up my toys. 2. I enjoy broadway shows. 2. I have a pet turtle. 3. I’m a Steelers fan. 3. I like to play with my dolls. 4. My favorite band is the Avett Brother. 4. My best friend lives next door. 6.9 Social Impact Beyond the positive impacts, there are numerous potential avenues for misuse of the proposed technology. We list some notable ones below: • Mistakes in persona identification can result in mistakes in granting / denying services for persons or groups of persons. For example, persona facts (or sets) for which the persona identifier possesses higher error rates can potentially lead to poor access for those potential users. • Similarly, verifier errors (e.g., poor questions delivered) with certain actions (e.g., talk about hobbies, talk about travel) may have disproportionate less consequences compared to more sensitive topics (e.g., talk about gender, talk about politics). 150 • Although the verifier is meant to do authentication, it can potentially be abused to conduct conversations for the purposes of mining persona information. For example, an application using the verifier can abuse building rapport with human users to mine personal information. We did not explore ways to prevent this type of misuse, but future work must focus on either counter-measures or methods of prevention against such cases. • Algorithmic authentication and persona modeling can potentially greatly accelerate the development of human-like dialog generation. Deployment considerations of conversational authentication must carefully consider the impact of persona modeling on the potential increase in the capacity for general chatbots to conduct deceptive / exploitative interactions (e.g., impersonation, personalized advertising, political manipulation) and their potentially detrimental impact on human labor conditions. Additionally, note that since the persona models used in our experiments are built from a language model pretrained on large-scale datasets, they have been shown to contain various cultural biases [32, 33]. Finetuning on PersonaChat certainly do not alleviate these issues, as the personas themselves were not curated against such biases. For example, the term “gender” used in this study is defined as gender perceived by the annotators of the PersonaChat dataset. Its interpretation may not generalize to other real world settings. 6.10 Conclusion and Discussion of Limitations In this paper we proposed an authentication pipeline whose questions increase the mutual information between the dialogue trajectory and an input agent’s underlying persona features. Nonetheless, there are several limitations to our current approach. For example, our approach assumes “good faith" – it cannot handle persona models that intentionally hide their persona characteristics. Additionally, more sophisticated verification should distinguish between direct and indirect expressions of persona. For example, a bot with the persona “I like to tell jokes" may embody the persona through sarcasm rather than through self-description. 151 Finally, there are some intrinsic limitations with respect to the language modeling approach for generative dialog. In Table 6.10, we illustrated a case in which PersonaGPT fails to portray precise persona facts such as “100 years old” vs. “95 years old”. In general, GPT-based LMs suffer from this trade-off between sensibility of responses (i.e., lack of repetitive or degenerate responses) vs. precision of the decoded responses. This is simply due to the top-k and nucleus sampling approaches for decoding, which uses randomization through noise to prevent degenerate responses. The injection of noise in the response decoding process intrinsically produces a barrier to precision. Alternative approaches for repetition penalty can potentially eliminate this trade-off. Additionally, GPT-based models also suffer from high time-costs at inference time. This is because auto-regressive decoding with self-attention costs 𝑂 (𝐾 2 ) where 𝐾 is the total number of tokens in the dialog history. Empirically, we found that quantization and model compression helped during employment (<1 second per response on CPU), but the benefit cannot compare to bringing the inference cost to linear time. In order to scale to long dialogs (e.g., conversations that lasts 20+ turns), some form of compression of dialog histories (e.g., dialog level latent code) may be needed to decrease the dependence on previous tokens. 152 BIBLIOGRAPHY 153 BIBLIOGRAPHY [1] Jiwei Li, Michel Galley, Chris Brockett, Georgios P Spithourakis, Jianfeng Gao, and Bill Dolan. A persona-based neural conversation model. arXiv preprint arXiv:1603.06155, 2016. [2] Saizheng Zhang, Emily Dinan, Jack Urbanek, Arthur Szlam, Douwe Kiela, and Jason Weston. Personalizing dialogue agents: I have a dog, do you have pets too? arXiv preprint arXiv:1801.07243, 2018. [3] Haoyu Song, Wei-Nan Zhang, Yiming Cui, Dong Wang, and Ting Liu. Exploiting persona in- formation for diverse generation of conversational responses. arXiv preprint arXiv:1905.12188, 2019. [4] Zhou Yu, Ziyu Xu, Alan W Black, and Alexander Rudnicky. Strategy and policy learning for non-task-oriented conversational systems. In Proceedings of the 17th annual meeting of the special interest group on discourse and dialogue, pages 404–412, 2016. [5] Anil K Jain, Arun Ross, and Salil Prabhakar. An introduction to biometric recognition. IEEE Transactions on circuits and systems for video technology, 14(1):4–20, 2004. [6] Anil K Jain, Ruud Bolle, and Sharath Pankanti. Biometrics: personal identification in networked society, volume 479. Springer Science & Business Media, 2006. [7] David Güera and Edward J Delp. Deepfake video detection using recurrent neural networks. In 2018 15th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pages 1–6. IEEE, 2018. [8] Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jian- feng Gao, Jingjing Liu, and Bill Dolan. Dialogpt: Large-scale generative pre-training for conversational response generation. arXiv preprint arXiv:1911.00536, 2019. [9] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Transformers: State-of- the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online, October 2020. Association for Computational Linguistics. [10] Yusuxke Shibata, Takuya Kida, Shuichi Fukamachi, Masayuki Takeda, Ayumi Shinohara, Takeshi Shinohara, and Setsuo Arikawa. Byte pair encoding: A text compression scheme that accelerates pattern matching. Technical report, Citeseer, 1999. 154 [11] Emily Dinan, Varvara Logacheva, Valentin Malykh, Alexander Miller, Kurt Shuster, Jack Urbanek, Douwe Kiela, Arthur Szlam, Iulian Serban, Ryan Lowe, et al. The second conver- sational intelligence challenge (convai2). In The NeurIPS’18 Competition, pages 187–208. Springer, 2020. [12] Thomas Wolf, Victor Sanh, Julien Chaumond, and Clement Delangue. Transfertransfo: A transfer learning approach for neural network based conversational agents. arXiv preprint arXiv:1901.08149, 2019. [13] Yann LeCun, Sumit Chopra, Raia Hadsell, M Ranzato, and F Huang. A tutorial on energy-based learning. Predicting structured data, 1(0), 2006. [14] Nitish Shirish Keskar, Bryan McCann, Lav R Varshney, Caiming Xiong, and Richard Socher. Ctrl: A conditional transformer language model for controllable generation. arXiv preprint arXiv:1909.05858, 2019. [15] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. OpenAI Blog, 1(8), 2019. [16] Shun-Ichi Amari. Natural gradient works efficiently in learning. Neural computation, 10(2):251–276, 1998. [17] Magnus Rattray, David Saad, and Shun-ichi Amari. Natural gradient descent for on-line learning. Physical review letters, 81(24):5461, 1998. [18] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015. [19] Sham Machandranath Kakade. On the sample complexity of reinforcement learning. PhD thesis, UCL (University College London), 2003. [20] Ahmed Hussein, Mohamed Medhat Gaber, Eyad Elyan, and Chrisina Jayne. Imitation learning: A survey of learning methods. ACM Computing Surveys (CSUR), 50(2):1–35, 2017. [21] Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double q-learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 30, 2016. [22] Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, and Andrew Gordon Wilson. Averaging weights leads to wider optima and better generalization. arXiv preprint arXiv:1803.05407, 2018. [23] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017. 155 [24] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization, 2019. [25] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2017. [26] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training. URL https://s3-us-west-2. amazonaws. com/openai- assets/researchcovers/languageunsupervised/language understanding paper. pdf, 2018. [27] Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543, 2014. [28] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997. [29] Felix A Gers, Jürgen Schmidhuber, and Fred Cummins. Learning to forget: Continual prediction with lstm. 1999. [30] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018. [31] Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751, 2019. [32] Alex Tamkin, Miles Brundage, Jack Clark, and Deep Ganguli. Understanding the capabilities, limitations, and societal impact of large language models. arXiv preprint arXiv:2102.02503, 2021. [33] Irene Solaiman, Miles Brundage, Jack Clark, Amanda Askell, Ariel Herbert-Voss, Jeff Wu, Alec Radford, Gretchen Krueger, Jong Wook Kim, Sarah Kreps, et al. Release strategies and the social impacts of language models. arXiv preprint arXiv:1908.09203, 2019. 156