LEARNING TO DETECT LANGUAGE MARKERS
                          By
                     Fengyi Tang
                 A DISSERTATION
                     Submitted to
             Michigan State University
     in partial fulfillment of the requirements
                  for the degree of
    Computer Science – Doctor of Philosophy
                         2021


                                            ABSTRACT
                       LEARNING TO DETECT LANGUAGE MARKERS
                                                  By
                                            Fengyi Tang
In the world of medical informatics, biomarkers play a pivotal role in determining the physical
state of human beings, distinguishing the pathologic from the clinically normal. In recent years,
behavioral markers, due to their availability and low cost, have attracted a lot of attention as
a potential supplement to biomarkers. “Language markers” such as spoken words and lexical
preference have been shown to be both cost-effective as well as predictive of complex diseases such
as mild cognitive impairment (MCI).
    However, language markers, although universal, do not possess many of the favorable properties
that characterize traditional biomakers. For example, different people may exhibit similar use of
language under certain conversational contexts (non-unique), and a person’s lexical preferences may
change over time (non-stationary). As a result, it is unclear whether any set of language markers can
be measured in a consistent manner. My thesis projects provide solutions to some of the limitations
of language markers: (1) We formalize the problem of learning a dialog policy to measure language
markers as an optimization problem which we call persona authentication. We provide a learning
algorithm for finding such a dialog policy that can generalize to unseen personalities. (2) We apply
our dialog policy framework on real-world data for MCI prediction and show that the proposed
pipeline improves prediction against supervised learning baselines. (3) To address non-stationarity,
we introduce an effective way to do temporally-dependent and non-i.i.d. feature selection through
an adversarial learning framework which we call precision sensing. (4) Finally, on the prediction
side, we propose a method for improving the sample efficiency of classifiers by retaining privileged
information (auxiliary features available only at training time).


                                   TABLE OF CONTENTS
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     ix
LIST OF ALGORITHMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .         x
CHAPTER 1 BACKGROUND . . . . . .              . . . . . . . . . . . . . . . . . . . . . . . . . . 1
   1.1 Biomakers and Their Limitations .      . . . . . . . . . . . . . . . . . . . . . . . . . . 1
   1.2 Behavioral Markers as Alternatives     . . . . . . . . . . . . . . . . . . . . . . . . . . 2
   1.3 Limitations of Language Markers .      . . . . . . . . . . . . . . . . . . . . . . . . . . 2
   1.4 Contributions . . . . . . . . . . .    . . . . . . . . . . . . . . . . . . . . . . . . . . 4
   1.5 Proposed Research . . . . . . . .      . . . . . . . . . . . . . . . . . . . . . . . . . . 5
   1.6 Question Policy . . . . . . . . . .    . . . . . . . . . . . . . . . . . . . . . . . . . . 6
       1.6.1 MCI Proof-of-Concept . .         . . . . . . . . . . . . . . . . . . . . . . . . . . 6
       1.6.2 Persona Authentication . .       . . . . . . . . . . . . . . . . . . . . . . . . . . 7
   1.7 Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
       1.7.1 Precision Sensing . . . . .      . . . . . . . . . . . . . . . . . . . . . . . . . . 8
       1.7.2 Multitask LUPI . . . . . .       . . . . . . . . . . . . . . . . . . . . . . . . . . 10
   BIBLIOGRAPHY . . . . . . . . . . . .       . . . . . . . . . . . . . . . . . . . . . . . . . . 12
CHAPTER 2 REVIEW OF CURRENT RESEARCH . . . . . . . . . . .                      . . . . . . . . . 16
   2.1 LANGUAGE MARKERS IN MILD COGNITIVE IMPAIRMENT                            . . . . . . . . . 16
   2.2 DIALOG SYSTEMS . . . . . . . . . . . . . . . . . . . . . . . .           . . . . . . . . . 16
   2.3 TASK-ORIENTED SYSTEM . . . . . . . . . . . . . . . . . . . .             . . . . . . . . . 17
   2.4 OPEN-DOMAIN SYSTEMS . . . . . . . . . . . . . . . . . . . .              . . . . . . . . . 17
       2.4.1 Personalized Response Generation . . . . . . . . . . . . .         . . . . . . . . . 18
       2.4.2 MCI Prediction via Utterance Data . . . . . . . . . . . . .        . . . . . . . . . 19
       2.4.3 Healthcare Applications of Dialog Systems . . . . . . . .          . . . . . . . . . 19
   2.5 BEYOND LINGUISTIC FEATURES . . . . . . . . . . . . . . .                 . . . . . . . . . 20
       2.5.1 Acoustic Features . . . . . . . . . . . . . . . . . . . . . .      . . . . . . . . . 20
       2.5.2 Personality Modeling . . . . . . . . . . . . . . . . . . . .       . . . . . . . . . 20
       2.5.3 Automatic Evaluation of Persona Models . . . . . . . . .           . . . . . . . . . 21
   2.6 FEATURE SELECTION IN MEDICAL INFORMATICS . . . . .                       . . . . . . . . . 22
       2.6.1 Active Sensing . . . . . . . . . . . . . . . . . . . . . . .       . . . . . . . . . 23
       2.6.2 Temporal Feature Selection . . . . . . . . . . . . . . . . .       . . . . . . . . . 24
   2.7 SAMPLE EFFICIENCY AND PRIVILEGED INFORMATION .                           . . . . . . . . . 24
       2.7.1 Transfer Learning Review . . . . . . . . . . . . . . . . .         . . . . . . . . . 26
       2.7.2 Knowledge Transfer by Inductive Learning . . . . . . . .           . . . . . . . . . 26
       2.7.3 Knowledge Transfer by Transductive Learning . . . . . . .          . . . . . . . . . 28
       2.7.4 Knowledge Transfer with Distributed Representations . .            . . . . . . . . . 29
       2.7.5 Knowledge Transfer by Knowledge Distillation . . . . . .           . . . . . . . . . 30
                                                iii


      2.7.6 Knowledge Transfer by LUPI . . . . . . . . . . . . . . . . . . . . . . . . . 30
      2.7.7 LUPI Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
  BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
CHAPTER 3 SCALABLE DIAGNOSTIC SCREENING OF MCI USING DIALOG POLICY                            40
  3.1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      40
  3.2 METHODOLOGY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .         42
      3.2.1 Dialog Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    42
      3.2.2 Overview of Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    43
      3.2.3 Construction of Turn-Based Dialogue . . . . . . . . . . . . . . . . . . . .       44
      3.2.4 Unsupervised Learning for User Simulator . . . . . . . . . . . . . . . . . .      46
      3.2.5 Reinforcement Learning Components . . . . . . . . . . . . . . . . . . . .         47
            3.2.5.1 Environment E . . . . . . . . . . . . . . . . . . . . . . . . . . .       47
            3.2.5.2 Action a𝑡 ∈ R𝑑 . . . . . . . . . . . . . . . . . . . . . . . . . . .      48
            3.2.5.3 State s𝑡 ∈ R𝐶 . . . . . . . . . . . . . . . . . . . . . . . . . . . .     48
            3.2.5.4 Reward 𝑟 ∈ R . . . . . . . . . . . . . . . . . . . . . . . . . . . .      49
            3.2.5.5 State transitions . . . . . . . . . . . . . . . . . . . . . . . . . .     49
            3.2.5.6 Deep Q-Networks (DQN) . . . . . . . . . . . . . . . . . . . . .           50
            3.2.5.7 Policy-masking . . . . . . . . . . . . . . . . . . . . . . . . . . .      51
      3.2.6 Training the RL-Agent . . . . . . . . . . . . . . . . . . . . . . . . . . . .     51
  3.3 EXPERIMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       52
      3.3.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  53
      3.3.2 Baselines vs. RL Performance . . . . . . . . . . . . . . . . . . . . . . . .      53
      3.3.3 Evaluation of User Simulators . . . . . . . . . . . . . . . . . . . . . . . .     56
      3.3.4 Top-Performing Policies . . . . . . . . . . . . . . . . . . . . . . . . . . .     57
  3.4 DISCUSSION AND CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . .             61
  BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    62
CHAPTER 4 SAMPLE EFFICIENT LEARNING WITH PRIVILEGED INFORMATION                               66
  4.1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      66
  4.2 METHODOLOGY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .         68
      4.2.1 Building the PI vocabulary . . . . . . . . . . . . . . . . . . . . . . . . . .    69
      4.2.2 Learning the Joint Representation . . . . . . . . . . . . . . . . . . . . . .     69
      4.2.3 Coupling Decision Functions with Feature Matching . . . . . . . . . . . .         72
  4.3 ANALYSIS OF SAMPLING EFFICIENCY . . . . . . . . . . . . . . . . . . . . .               73
      4.3.1 Results from Existing Agnostic Models . . . . . . . . . . . . . . . . . . .       73
      4.3.2 Complexity of Proposed LUPI Method . . . . . . . . . . . . . . . . . . . .        74
  4.4 EXPERIMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       76
      4.4.1 Datasets and Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    77
      4.4.2 Initial Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
      4.4.3 Comparison Against Other Transfer Learning Methods . . . . . . . . . . .          81
            4.4.3.1 Multi-task learning . . . . . . . . . . . . . . . . . . . . . . . . .     82
            4.4.3.2 Data Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     82
            4.4.3.3 Distillation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    83
      4.4.4 Performance with Sparse Examples . . . . . . . . . . . . . . . . . . . . .        84
                                             iv


  4.5 DISCUSSION AND CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . 85
  BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
CHAPTER 5 ADVERSARIAL PRECISION SENSING . . . . . . . . . . . . . . .                . . . .  90
  5.1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     . . . .  90
  5.2 PROBLEM FORMULATION . . . . . . . . . . . . . . . . . . . . . . . .            . . . .  91
      5.2.1 Precision Sensing for Time-Series Data . . . . . . . . . . . . . . .     . . . .  91
      5.2.2 Budget-Constrained Precision Sensing as Adversarial Optimization .       . . . .  93
  5.3 PROPOSED METHOD . . . . . . . . . . . . . . . . . . . . . . . . . . . .        . . . .  94
            5.3.0.1 Feature Sensor . . . . . . . . . . . . . . . . . . . . . . .     . . . .  95
            5.3.0.2 Progressive Learner . . . . . . . . . . . . . . . . . . . .      . . . .  96
            5.3.0.3 Budget-Manager . . . . . . . . . . . . . . . . . . . . . .       . . . .  97
            5.3.0.4 Inference . . . . . . . . . . . . . . . . . . . . . . . . . .    . . . .  98
  5.4 ANALYSIS OF PROPOSED METHOD . . . . . . . . . . . . . . . . . . .              . . . .  98
      5.4.1 Existence of Local Nash Equilibria . . . . . . . . . . . . . . . . . .   . . . .  99
      5.4.2 Convergence of Algorithm 2 . . . . . . . . . . . . . . . . . . . . .     . . . . 101
  5.5 EXPERIMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      . . . . 104
      5.5.1 Synthetic Experiments . . . . . . . . . . . . . . . . . . . . . . . .    . . . . 104
            5.5.1.1 Recovery of 𝐶 using BCPS . . . . . . . . . . . . . . . .         . . . . 105
            5.5.1.2 Comparison Against Other Sensing Methods . . . . . . .           . . . . 106
      5.5.2 MIMIC-III Experiments . . . . . . . . . . . . . . . . . . . . . . .      . . . . 107
            5.5.2.1 Baseline Classifiers . . . . . . . . . . . . . . . . . . . . .   . . . . 108
            5.5.2.2 Performance on MIMIC-III . . . . . . . . . . . . . . . .         . . . . 108
            5.5.2.3 Performance on Masked Data . . . . . . . . . . . . . . .         . . . . 109
      5.5.3 Sparsity Trade-offs . . . . . . . . . . . . . . . . . . . . . . . . . .  . . . . 109
      5.5.4 Comparison Against SOTA Methods . . . . . . . . . . . . . . . . .        . . . . 110
  5.6 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     . . . . 112
  BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . . . . 113
CHAPTER 6 PERSONA AUTHENTICATION . . . . . . .               . . . . . . . . . . . . . . . . 116
  6.1 INTRODUCTION . . . . . . . . . . . . . . . . . . .     . . . . . . . . . . . . . . . . 116
  6.2 WHY MODEL PERSONA? . . . . . . . . . . . . .           . . . . . . . . . . . . . . . . 117
            6.2.0.1 Applications . . . . . . . . . . . .     . . . . . . . . . . . . . . . . 118
            6.2.0.2 Authenticator Example . . . . . . .      . . . . . . . . . . . . . . . . 119
            6.2.0.3 Challenges . . . . . . . . . . . . .     . . . . . . . . . . . . . . . . 120
  6.3 PROBLEM SETTING . . . . . . . . . . . . . . . .        . . . . . . . . . . . . . . . . 121
      6.3.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
      6.3.2 Persona Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
      6.3.3 Persona Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
      6.3.4 Analysis of Persona Authentication Objective     . . . . . . . . . . . . . . . . 123
  6.4 PROOF OF CONVERGENCE OF THM. 1 . . . . .               . . . . . . . . . . . . . . . . 125
  6.5 METHODOLOGY . . . . . . . . . . . . . . . . . .        . . . . . . . . . . . . . . . . 129
      6.5.1 Action Space as Control Codes . . . . . . . .    . . . . . . . . . . . . . . . . 130
      6.5.2 Explanation of Prefix Codes . . . . . . . . .    . . . . . . . . . . . . . . . . 131
      6.5.3 Conditional Decoding . . . . . . . . . . . . .   . . . . . . . . . . . . . . . . 131
                                            v


           6.5.3.1 Active Learning . . . . . . . . . . . .     . . . . . . . . . . . . . . 131
     6.5.4 Learning the Question Policy . . . . . . . . . . .  . . . . . . . . . . . . . . 133
     6.5.5 Model and Training Details . . . . . . . . . . . .  . . . . . . . . . . . . . . 135
6.6 DATA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
           6.6.0.1 Pretraining . . . . . . . . . . . . . . .   . . . . . . . . . . . . . . 136
           6.6.0.2 Fine-tuning on PersonaChat . . . . . .      . . . . . . . . . . . . . . 137
           6.6.0.3 Self-Play . . . . . . . . . . . . . . . .   . . . . . . . . . . . . . . 137
6.7 EXPERIMENTS . . . . . . . . . . . . . . . . . . . . .      . . . . . . . . . . . . . . 138
     6.7.1 Conditional Decoding Evaluation . . . . . . . .     . . . . . . . . . . . . . . 138
     6.7.2 Persona Identifier Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 140
     6.7.3 Evaluation of Authentication Policies . . . . . .   . . . . . . . . . . . . . . 141
     6.7.4 Ablation Study . . . . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . 142
6.8 GENERATED CONVERSATIONS . . . . . . . . . . .              . . . . . . . . . . . . . . 143
           6.8.0.1 Human-Agent Interactions . . . . . . .      . . . . . . . . . . . . . . 144
           6.8.0.2 PersonaGPT Self-Play . . . . . . . . .      . . . . . . . . . . . . . . 147
           6.8.0.3 Consent . . . . . . . . . . . . . . . . .   . . . . . . . . . . . . . . 148
6.9 SOCIAL IMPACT . . . . . . . . . . . . . . . . . . . . .    . . . . . . . . . . . . . . 150
6.10 CONCLUSION AND DISCUSSION OF LIMITATIONS                  . . . . . . . . . . . . . . 151
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . .   . . . . . . . . . . . . . . 153
                                          vi


                                       LIST OF TABLES
Table 2.1: Summary Table of Transfer Learning Approaches . . . . . . . . . . . . . . . . . 27
Table 3.1: Examples of questions from conversations . . . . . . . . . . . . . . . . . . . . . 45
Table 3.2: Performance of baseline vs. RL on MCI prediction on 10 stratified shuffle splits . 54
Table 3.3: Prediction @5, 10, 20, 30 and 35 Turns . . . . . . . . . . . . . . . . . . . . . . 58
Table 3.4: Most frequently questions in 𝜋 ∗ @5, 10, 15 and 20 . . . . . . . . . . . . . . . . . 58
Table 3.5: Table of top 10 ranked questions in final 15 turns of conversations . . . . . . . . 60
Table 4.1: Summary of datasets used in this study . . . . . . . . . . . . . . . . . . . . . . 77
Table 4.2: Comparison of performance across baseline models . . . . . . . . . . . . . . . . 81
Table 4.3: Comparison of performance across transfer learning models . . . . . . . . . . . 83
Table 4.4: Performance of various models for 30 ICD-9 codes appearing less than 100
           times in the dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
Table 5.1: BCPS recovery performance vs. sparsity of sensing budget on synthetic data . . 105
Table 5.2: Comparison of predictive power of various sensing methods on synthetic data,
           under 80% sparsity in 𝐶 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
Table 5.3: Baseline performance of inference models conditioned on the original data 𝑋 . . 108
Table 5.4: Performance of inference models conditioned on the masked data 𝑋           𝐴 . . . . . 109
Table 5.5: Comparison of precision sensing against SOTA active sensing method with
           sparsity constraint of 50 − 60% . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
Table 6.1: Persona model responses can differ greatly depending on input questions . . . . . 117
Table 6.2: An example of interaction between the learned authenticator policy and Person-
           aGPT over an 8-turn conversation . . . . . . . . . . . . . . . . . . . . . . . . . 120
Table 6.3: The action space of the verifier featuring 11 total actions (turn-level goals) . . . . 130
Table 6.4: Automatic evaluation of PersonaGPT against existing SOTA persona models . . 138
                                                vii


Table 6.5: Human Evaluation of PersonaGPT and DialoGPT . . . . . . . . . . . . . . . . . 139
Table 6.6: Performance of various identifier models on observed dialog trajectories from
            PersonaChat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
Table 6.7: Comparison of verification policies on various input persona models . . . . . . . 142
Table 6.8: Authenticator performance against variations of the input persona model . . . . . 143
Table 6.9: An interactive example of human-DialoGPT conversation over 8 turns . . . . . . 145
Table 6.10: An interactive example of human-PersonaGPT conversation over 8 turns (prior
            knowledge of persona given, conditional decoding evaluation) . . . . . . . . . . 146
Table 6.11: An interactive example of human-PersonaGPT conversation over 8 turns
            (persona information not known a priori, user asked to guess) . . . . . . . . . . 149
Table 6.12: An example of self-play between two randomized PersonaGPT models . . . . . . 150
                                                viii


                                      LIST OF FIGURES
Figure 1.1: Proposed learning to interview pipeline. . . . . . . . . . . . . . . . . . . . . .    5
Figure 1.2: Preliminary dialog agent trained by deep Q-learning (DQN) [18] to interview
            patients for MCI screening. . . . . . . . . . . . . . . . . . . . . . . . . . . . .   6
Figure 1.3: Persona Authentication pipeline. . . . . . . . . . . . . . . . . . . . . . . . . .    8
Figure 1.4: Overview of co-training scheme between the feature sensor 𝐺 and classifier 𝐹. . 10
Figure 3.1: Overview of the proposed methodology. . . . . . . . . . . . . . . . . . . . . . 43
Figure 3.2: Illustration of reinforcement learning components in our proposed approach. . . 44
Figure 3.3: RL-Agent vs. Baseline w/ Variation on Turns. . . . . . . . . . . . . . . . . . . 55
Figure 3.4: Distribution of mean squared error (MSE) across all user simulators. . . . . . . 57
Figure 5.1: Overview of problem setting. . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
Figure 5.2: Recovery of feature relevance (𝐶) by BCPS on synthetic data. . . . . . . . . . . 105
Figure 5.3: Heatmap comparisons of sensed features under a fixed 𝐶. Feature number 0-9
            (x-axis) vs. Timesteps 0-96 (y-axis). . . . . . . . . . . . . . . . . . . . . . . . 107
Figure 5.4: Trade-off between Budget Constraint (x-axis) vs. predictive power (y-axis).
            Budget denotes the strength of budgetary constraint 𝛽. . . . . . . . . . . . . . . 110
Figure 5.5: Comparison of AUC-gain against SOTA models. . . . . . . . . . . . . . . . . . 112
Figure 6.1: Overview of authentication pipeline. . . . . . . . . . . . . . . . . . . . . . . . 129
Figure 6.2: Experiment 1 (conditional decoding evaluation) instructions. . . . . . . . . . . 144
Figure 6.3: Experiment 1 post conversational survey. . . . . . . . . . . . . . . . . . . . . . 145
Figure 6.4: Experiment 2 (human policy evaluation) instructions. . . . . . . . . . . . . . . 147
Figure 6.5: Experiment 2 post conversational survey. . . . . . . . . . . . . . . . . . . . . . 148
                                                ix


                          LIST OF ALGORITHMS
Algorithm 1 RL-Training Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Algorithm 2 Co-training of FS and PL. . . . . . . . . . . . . . . . . . . . . . . . . 98
Algorithm 3 Active Learning with PersonaGPT . . . . . . . . . . . . . . . . . . . . 132
Algorithm 4 Verifier Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
                                       x


                                              CHAPTER 1
                                            BACKGROUND
1.1   Biomakers and Their Limitations
In the world of medical informatics, biomarkers [1] play a pivotal role in determining the physical
state of human beings, distinguishing the pathophysiologic from the clinically normal. In most
chronic diseases such as congestive heart failure and Diabetes Mellitus, there exist biomarkers such
as brain natriuritic peptide and HbA1c that sufficiently captures the presence and the stage of disease
progression in almost all clinical populations [2, 3].
    However, a key limitation of biomarkers is that they are very expensive to obtain. For example,
the average cost to a basic metabolic panel in the inpatient setting ranges from $300-$10,000 dollars
per patient per test [4]. Note that this is the cost of simply getting a baseline reading of metabolic
enzymes; complete metabolic panels and blood tests can often multiply the cost. In general, the
higher the granularity of the biomarker, the higher the cost. A computerized tomography (CT)
scan can reveal numerous respiratory pathologies but can range between $1,450 - 7,800 per person
depending on availability. As a result, much of the diagnostic decisions in medicine reside on the
physician’s judgement regarding cost-vs.-effectiveness: which biomarkers should be obtained for
which patients?
    From the data mining perspective, the decision of “which biomarker to select for which patient”
corresponds to a feature selection problem in which each data sample (patient state information)
consists of only a sparse set of observed features due to cost constraints. However, this feature
selection problem is often treated in an ad-hoc manner in the practice: a mixture of insurance
policies, practitioner guidelines and human judgement dictate the manner in which we engage in
feature selection in the clinics. Naturally, two questions deserve our attention:
  (1) Are there cheaper (but effective) alternatives to biomarkers?
                                                    1


  (2) Are there more effective ways to acquire clinical features?
1.2   Behavioral Markers as Alternatives
In recent years, behavioral markers have attracted a lot of attention as a potential supplement to
biomarkers. For example, in dementia progression, there exists a disease state clinically defined
as mild cognitive impairment (MCI) that precede noticeable memory loss and severe cognitive
decline [5]. In traditional Alzheimer Disease (AD), brain imaging is done to confirm diagnosis, but
intervention is futile at this stage because irreversible damage would have already occurred [6]. As a
result, much research efforts have been channeled toward early detection of MCI, when irreversible
changes have yet to occur. However, there is a tradeoff for using biomarkers in MCI detection: while
MCI can be detected by classic biomarker approaches such as brain imaging and cerebral spinal
fluid markers [7, 8, 9], the presence of such biomarkers correspond to the existence of irreversible
changes having already occurred in the patient. Thus, other measures of cognitive functions may be
needed for early detection of MCI.
    Recently, there have been several works showing that behavioral markers [10, 11, 12] can provide
useful diagnostic signals for early stage MCI. For example, gait monitoring has been shown to
change in early MCI [13], enabling programs for at-home monitoring of behavioral changes [14].
However, in terms of cost of data acquisition, “language markers” such as spoken words [15] and
linguistic features [11] have been shown to be both low-cost as well as predictive of early MCI
states. Due to the ubiquity of language, the notion of developing high-performing language markers
presents a promising direction for alternative diagnostic markers.
1.3   Limitations of Language Markers
However, language markers, although universal, do not possess many of the favorable properties
that characterize biomakers. For example, there are several key properties allow a biomarker to be
generalizable across populations [16]:
     • Universality: the feature of interest can be measured in everyone in the disease group.
                                                     2


     • Uniqueness: the feature of interest should differ between disease group and normal.
     • Measurability: the feature can be measured consistently across everyone.
     • Permanence: the feature remains invariant to time and algorithms used to measure it.
In the case of language markers, only universality is satisfied: one can potentially obtain linguistic
and speech features from text and audio recordings for everyone, barring disabilities. However,
language markers are certainly not unique: different people may exhibit similar use of language
under certain contexts. Language markers may also be transient. For example, a person’s word
choices changes throughout one’s lifetime. Depending on the environmental context (e.g., time of
day, the presence of other people), one’s choice of words and speech tone may differ greatly. As a
result, it is unclear whether any set of linguistic features can be measured in a consistent manner.
    In spite of these limitations, however, language markers do provide ample signal for each of
these classic biometric categories. Even though speech is not unique, higher-level features composed
of language preferences – e.g., personality traits, use of idioms, pausing patterns – are regularly
used to distinguish one group of people from another. Beyond the individual identification, such
traits are regularly used for market segmentation and sentiment reports. Although language, and
perhaps even resulting higher-level features change over time, they often follow statistical patterns
of change. For example, it is well-known that human infants follow very stable patterns of language
acquisition: babbling starts at 6 months, first words start at 9 month to 1 year, vocabulary expansion
(250-900 words) by age 2, and can recite numbers and addresses by age 5. It is perhaps reasonable
to hypothesize that linguistic decline may follow predictable patterns in the elderly.
    Finally, the issue of measurability provides a difficult challenge: how can we consistently
measure a set of language-related features in a person when language use itself is highly context
dependent? As a toy example, consider psychometric and cognitive tests, both of which rely heavily
on questionnaires and surveys. These would not work well for surveying language markers, as
certain features such as word choice, sentence structures and topics of interest are open observable
in unstructured, open-domain conversations. However, a completely open-domain conversation
                                                    3


setting presents another layer of problems. The performance of automatic extraction of speech
markers depend on restricting the spoken words used to generate speech [17], as variations in
lexicon and sentence structure can introduce noise in the extracted acoustic features. Thus, there
seems to numerous points of potential conflict between the vast number of potential features that
comprise language markers. Additionally, it is unclear whether what questions would elicit what
set of linguistic or acoustic features. In order to tackle the problem of measurability, we have to
formally define the relationship between language markers of interest and the question contexts used
to elicit them.
1.4   Contributions
The goal of this thesis project is to provide solutions to the limitations of language markers. The
main contributions are summarized as follows:
     • We formalize the problem of learning a question policy to measure linguistic features as
       persona authentication. We provide an algorithm for finding such a question policy that can
       generalize to unseen personalities.
     • We apply the question policy on real-world data for MCI prediction and show that the proposed
       pipeline improves prediction against supervised learning baselines.
     • To address the challenges of permanence (i.e., temporally varying disease markers), we
       introduce an effective way to do temporally-dependent and non-i.i.d. feature selection in
       the clinical setting. Specifically, we propose “precision sensing”, an adversarial learning
       framework that exploits the relationship between adversarial sample generation and classifier
       performance to produce feature selection strategies that consider cross-sample and temporal
       dependencies which make training examples non-i.i.d.
     • Finally, on the prediction side, we propose a method for improving the sample efficiency
       of classifiers. Specifically, we propose an algorithm that leverages privileged information
       (linguistic features available only at training time) to supplement the lack of training labels.
                                                     4


1.5  Proposed Research
Figure 6.1 provides an overview of the main components of our proposed pipeline. As noted before,
the goal is to learn a question policy to interact with users to obtain language markers. We will
summarize the key components below and discuss them in separate sections:
    • State Representation: the actual representation of linguistic markers as a feature vector for
      downstream tasks such as classification, feature selection, and question generation.
    • Classifier: the prediction model trained by supervised learning. Labels of the prediction task
      are provided for each sample, e.g., MCI = 1 or normal = 0.
    • Question Policy: a second model used to acquire new training samples from the user; in other
      words, its role is to do data acquisition. The output of the question policy can be discrete or
      continuous variables, depending on the decoder used for NLG.
    • Decoder: the NLG unit used to decode policy actions into human understandable language.
      The decoder is an open-domain conversational model. The goal is to do conditional decoding
      based on both conversational history and policy actions.
                         Figure 1.1: Proposed learning to interview pipeline.
                                                  5


1.6   Question Policy
1.6.1   MCI Proof-of-Concept
Our first work introduces a preliminary pipeline for learning a question policy to ascertain linguistic
markers for the purposes of MCI diagnosis. We use this as a proof-of-concept and illustrate that our
“learning to interview” approach can indeed detect predictive signals that can be potentially more
informative than unstructured conversations with human interviewers.
Figure 1.2: Preliminary dialog agent trained by deep Q-learning (DQN) [18] to interview patients
for MCI screening.
    Figure 1.2 provides a summary figure for the first work. Here, the prediction task is binary
classification of patient MCI status. The action space of the question policy is a discrete set of
questions. We simplify the problem setting such that NLG issues such as dialog consistency, felicity
of language generation, and multi-turn co-reference reasoning are not considered; the question
policy simply outputs a deterministic question (out of a pool of 107 possible questions) and obtains
a response generated by the skip-thought model [19] (user simulator in Figure 1.2). We use treat the
skip-thought embedding as the state input and learn the question policy using straight-forward DQN
over 107 actions. The reward function is hand-crafted: the agent receives a small negative reward
per turn to penalize lengthy conversations and a large positive reward at the end of conversations if
the skip-thought features of the dialog history results in correct prediction by the MCI classifier. We
compare this proposed framework against classifier performance on supervised learning data from
the original corpus. In addition to prediction accuracy, we also compare the length of conversations
needed (conversational efficiency) to obtain a threshold level of performance.
                                                   6


1.6.2    Persona Authentication
In our second work, we formalize the problem of learning a question policy as person authentication.
Specifically, we break down persona authentication into two parts: persona identification, which
is inferring a set of persona features from a given dialog trajectory, and persona verification, the
problem of finding a second conversational model – we call it a question policy – to elicit dialog
trajectories for persona identification.
Problem 1. Persona Verification. Given a space of persona information P, persona verification is
the optimization objective:
                                                                
                                          min E𝑃∼P L (𝜏𝜃 , 𝑃)                                     (1.1)
                                           𝜃
where L (·, ·) is the authentication loss:
                      L (𝜏𝜃 , 𝑃) = max{0, 𝐶 + 𝑑 (𝜏𝜃 , 𝑃+ ) − 𝑑 (𝜏𝜃 , 𝑃− )} − log 𝑝(𝜏𝜃 )).         (1.2)
𝑃+ denotes persona facts that co-occur with trajectory 𝜏, 𝑃− the opposite. 𝐶 specifies the desired
margin of separation, and 𝜏𝜃 is the dialog trajectory generated by the question policy (𝜃).
    In the persona authentication chapter, we illustrate some promising theoretical guarantees
of our proposed optimization problem. For example, we show that estimators trained under the
authentication loss shown in Eqn. (6.5) maximizes the mutual information between the conversational
trajectory 𝜏 generated by the policy and the persona information of the input agent (i.e., the user).
    We then present a refined version of the original pipeline that is capable of handling open-domain
conversations, all the while directing the conversation toward the end goal of classification. The
overview figure is shown in Figure 1.3 which outlines the key components: verifier corresponds to
the question policy of interest, identifier corresponds to the state representation, and PersonaGPT
corresponds to the decoder. Details of conditional decoding, active learning of verifier action codes
and policy learning details are revealed in the persona authentication chapter. Empirically, we
illustrate that the learned policy outperforms human evaluators as well as unstructured conversation
policies in identifying personality traits through conversation.
                                                      7


                             Figure 1.3: Persona Authentication pipeline.
    Our question policy addresses the problem of measurability of language markers. We cannot
elicit the same set of linguistic features with the same set of questions, but we can obtain a question
policy that maximizes the mutual information between generated dialog responses and the user’s
salient characteristics.
1.7    Classifier
1.7.1    Precision Sensing
On the classifier front, we present two works that address various technical issues in medical
prediction tasks. First, language markers, like many other disease markers in medicine, suffer
from the issue of permanence: some disease markers fluctuate through time and require repeated
measurements to gain insight into temporal trends. While we cannot change the fact time-dependence
of certain disease markers, we can deal with the permanence issue by learning to detect stable
temporal patterns from data. Our third work introduces the precision sensing framework which
addresses this issue by leveraging the interaction between adversarial examples [20] and classifier
decision boundaries.
Definition 1. (Precision Sensing) The problem of precision sensing seeks a sensor tensor 𝐴∗ that
minimizes the empirical risk (ERM) [21] according to:
                                                                 1 Õ
         𝐴∗ ∈ argmin E𝑃(𝑋,𝑌 ) [𝐿(ℎ( 𝐴𝑖     𝑋𝑖 ), 𝑦𝑖 )] ≈ argmin                   𝐿(𝑦𝑖 , ℎ( 𝐴𝑖 𝑋𝑖 )),
                𝐴𝑖 ∈A                                      𝐴𝑖 ∈A 𝑚   (𝑋𝑖 ,𝑦 𝑖 )∈D
                                                       8


where ℎ is a hypothesis class that maps ℎ : 𝑋 → 𝑌 , and 𝐿(.) is a risk function that evaluates the
hypothesis mappings against the actual label.
    Here, A denotes the space of sensing matrices, and each matrix 𝐴𝑖 can be thought of as a
per-sample sensing matrix. From this perspective 𝐴𝑖 describes the subset of sensed features across
time for each sample. Unlike regular feature selection which chooses the same subset of features for
sample, precision sensing builds the sensing tensor 𝐴 incrementally: at each time step 𝑡, the sensing
matrix 𝐴 (𝑡) ∈ {0, 1} 𝑚×𝑑 is computed across 𝑚 samples for all 𝑑 features. To learn 𝐴, we propose
the following loss to simultaneous learn the hypothesis function the sensing tensor:
                        min max {𝐹 (𝑋 − 𝑋          𝐺 (𝑋))¬𝑦 − 𝐹 (𝑋 − 𝑋          𝐺 (𝑋)) 𝑦 }+
                         𝜃 𝐹 𝜃𝐺    |                          {z                       }
                                                        Feature Sensing
                                  + 𝐶𝐸 𝑦 (𝐹 (𝑋 𝐺 (𝑋)) ) + 𝛽||𝑋 − 𝑋 𝐺 (𝑋)|| 1 .                    (1.3)
                                     |          {z           } |              {z         }
                                     Classifier Reconstruction         Budget Management
In Eqn. (1.3), 𝐹 represents the classifier and 𝐺 is a recurrent neural network (RNN) that outputs the
elements of 𝐴𝑖,(𝑡)𝑗 ∈ 0, 1 for each sample 𝑖 and each feature 𝑗 at time 𝑡. We call this RNN the feature
sensor that not only takes into account the temporal dependence of features within a sample but also
the performance of 𝐹 across samples. On the other hand, the classifier 𝐹 learns to adapt its decision
boundary based on the cross entropy loss 𝐶𝐸 calculated based on decision function mapping from
the sensed features 𝑋       𝐺 (𝑋) rather than the full set of features 𝑋.
    Since the performance of 𝐺 is coupled to 𝐹, we provide a co-training scheme for both models in
the Classifier chapter of the thesis. Figure 1.4a-1.4b gives an overview of the proposed co-training
scheme. Fig. 1.4a describes incremental generation of the sensing tensor 𝐴 by the 𝐺. Fig. 1.4b
describes the minimax game between the 𝐹 and 𝐺. 𝐹 (top) tries to minimize the classification error
with sensed features 𝑋       𝐴 while 𝐺 (bottom) tries to allocate 𝐴 such that unsensed features lead to
misclassification. We prove that iterative gradient descent on Eqn. (1.3) converges to local Nash
Equilibrium points which are guaranteed to exist. In experiments, we show that precision sensing
outperforms SOTA baselines from active sensing on a benchmark mortality prediction task.
                                                         9


     Figure 1.4: Overview of co-training scheme between the feature sensor 𝐺 and classifier 𝐹.
         (a) Incremental generation of the sensing tensor.
                                                                (b) Minimax game between 𝐹 and 𝐺.
1.7.2    Multitask LUPI
In addition to feature selection, the sample complexity of learning is especially important in medical
prediction tasks. For example, in electronic health records (EHR), we often find the distribution of
labels is highly skewed: there are numerous diseases with very few labels but an abundance of text
descriptions from literature. We propose a method for improving the sample efficiency of these
learning tasks by leveraging clinical linguistic markers as privileged information. Specifically, we
consider physician notes in the form of discharge summaries, linking standard medical terminologies
(i.e., UMLS codes [22]) with diagnostic findings in the EHR. UMLS codes are a set of standardized
medical concepts used by clinicians to describe physical findings of diseases and are used widely in
both the EHR as well as medical research [22]. [23] alluded at the idea that medical datasets also
contain vast amounts of privileged information in the physician notes, which serve to explain the
qualities of diseases that can greatly aid decision rules. For experiments, we consider the following
set of data for example features, PI, and labels:
     • Example Features 𝑋: continuous time-series data (i.e. lab values, blood tests, imaging) and
        discrete static variables (i.e. demographics information) that describe a patient.
     • Privileged Information 𝑋 ∗ : physician notes containing descriptions in natural language and
        medical terms (UMLS concepts [22]) that summarize a particular visit for a patient.
     • Target Task 𝑌 : prediction tasks of interest, such as mortality (binary classification), disease
        prediction (multi-task and transfer learning), ... etc.
                                                      10


     Formally, we consider a similar problem setting where we are given D𝑡𝑟𝑎𝑖𝑛 = {(𝑥𝑖 , 𝑥𝑖∗ , 𝑦𝑖 )}𝑖=1 𝑚 .
We assume that there exists a set of vocabulary with size 𝑑 that define the privileged information
(PI) space. 𝑥𝑖∗𝑗 = 1 if the 𝑗 𝑡ℎ term in the PI vocabulary is contained in the 𝑥𝑖∗ sample, 𝑥𝑖∗𝑗 = 0
otherwise. We also denote 𝑥𝑖∗ = {𝑤 1 , ..., 𝑤 𝑘 } as a decomposition of 𝑥𝑖∗ into 𝑘 individual components
such that each 𝑤 𝑗 ∈ {0, 1} 𝑑 is a one-hot vector, corresponding to a non-zero component in 𝑥𝑖∗ , and
𝑥𝑖∗ = 𝑤 1 + 𝑤 2 +, ..., +𝑤 𝑘 . In the following sections, we denote {𝑤 𝑗 } 𝑑𝑗=1 as the set of “words” that
compose the PI vocabulary. Thus, 𝑥𝑖∗ gives the co-occurrence label of each word 𝑤 𝑗 ∈ {𝑤 𝑗 } 𝑑𝑗=1 with
respect to the sample 𝑥𝑖 .
     We make no assumptions on the example features (𝑋) with regards to data type. In practice, the
example features correspond to temporal features, each corresponding to a biomarker (e.g, lab test
value) measured at various time intervals. Finally, we define the multi-task learning objective as
learning 𝑦𝑖 = {0, 1}𝐶 to be a set of 𝐶 binary classification tasks.
     Our approach to retaining privileged information in a multitask setting is to align PI features
(linguistic descriptions) with temporal features (biomarkers).
    1. Build a dictionary of PI features and learn a distributed representation [24] over the PI
       vocabulary.
    2. Find a joint representation space (Φ) between the PI and example features.
    3. Jointly learn the decision functions ℎ∗ : Φ → 𝑌 by feature-matching in the joint representation
       space.
The first process uses unsupervised learning to embed the PI vocabulary into a vector space. The
second process allows for some of the privileged information to be retained at inference time, despite
not having direct access to the PI vectors. The third process allows for PI information for one task to
be transferred for other 𝐶-1 tasks in the label space. In the Classifier chapter, we will examine how
to achieve (1) - (3) in detail. We also provide analysis of how (3) can maintain the favorable LUPI
sample efficiency. We validate our multitask LUPI framework on multilabel classification tasks
using the MIMIC-III database [25] and compare with several prevalent transfer learning schemes.
                                                     11


BIBLIOGRAPHY
      12


                                          BIBLIOGRAPHY
[1] Kyle Strimbu and Jorge A Tavel. What are biomarkers? Current Opinion in HIV and AIDS,
     5(6):463, 2010.
[2] VO Puntmann. How-to guide on biomarkers: biomarker definitions, validation and applications
     with examples from cardiovascular disease. Postgraduate medical journal, 85(1008):538–545,
     2009.
[3] Jeffrey K Aronson and Robin E Ferner. Biomarkers—a general review. Current protocols in
     pharmacology, 76(1):9–23, 2017.
[4] Renee Y Hsia, Yaa Akosa Antwi, and Julia P Nath. Variation in charges for 10 common blood
     tests in california hospitals: a cross-sectional analysis. BMJ open, 4(8), 2014.
[5] Serge Gauthier, Barry Reisberg, Michael Zaudig, Ronald C Petersen, Karen Ritchie, Karl
     Broich, Sylvie Belleville, Henry Brodaty, David Bennett, Howard Chertkow, et al. Mild
     cognitive impairment. The lancet, 367(9518):1262–1270, 2006.
[6] Jeffrey L Cummings, Rachelle Doody, and Christopher Clark. Disease-modifying therapies
     for alzheimer disease: challenges to early intervention. Neurology, 69(16):1622–1634, 2007.
[7] Keith A Johnson, Nick C Fox, Reisa A Sperling, and William E Klunk. Brain imaging in
     alzheimer disease. Cold Spring Harbor perspectives in medicine, 2(4):a006213, 2012.
[8] D Heister, James B Brewer, Sebastian Magda, Kaj Blennow, Linda K McEvoy, Alzheimer’s
     Disease Neuroimaging Initiative, et al. Predicting mci outcome with clinically available mri
     and csf biomarkers. Neurology, 77(17):1619–1628, 2011.
[9] Liang Zhan, Yashu Liu, Yalin Wang, Jiayu Zhou, Neda Jahanshad, Jieping Ye, and Paul Matthew
     Thompson. Boosting brain connectome classification accuracy in alzheimer’s disease using
     higher-order singular value decomposition. Frontiers in neuroscience, 9:257, 2015.
[10] Clifford R Jack Jr, David S Knopman, William J Jagust, Leslie M Shaw, Paul S Aisen,
     Michael W Weiner, Ronald C Petersen, and John Q Trojanowski. Hypothetical model
     of dynamic biomarkers of the alzheimer’s pathological cascade. The Lancet Neurology,
     9(1):119–128, 2010.
[11] Kathleen C Fraser, Jed A Meltzer, and Frank Rudzicz. Linguistic features identify alzheimer’s
     disease in narrative speech. Journal of Alzheimer’s Disease, 49(2):407–422, 2016.
[12] Meysam Asgari, Jeffrey Kaye, and Hiroko Dodge. Predicting mild cognitive impairment from
     spontaneous spoken utterances. Alzheimer’s & Dementia: Translational Research & Clinical
                                                  13


     Interventions, 3(2):219–228, 2017.
[13] Juan Manuel Fernandez Montenegro and Vasileios Argyriou. Cognitive evaluation for the
     diagnosis of alzheimer’s disease based on turing test and virtual environments. Physiology &
     behavior, 173:42–51, 2017.
[14] Hiroko H Dodge, Jian Zhu, Nora C Mattek, Molly Bowman, Oscar Ybarra, Katherine V
     Wild, David A Loewenstein, and Jeffrey A Kaye. Web-enabled conversational interactions
     as a method to improve cognitive functions: Results of a 6-week randomized controlled trial.
     Alzheimer’s & dementia: translational research & clinical interventions, 1(1):1–12, 2015.
[15] Tuka Alhanai, Rhoda Au, and James Glass. Spoken language biomarkers for detecting cognitive
     impairment. In 2017 IEEE Automatic Speech Recognition and Understanding Workshop
     (ASRU), pages 409–416. IEEE, 2017.
[16] Anil K Jain, Ruud Bolle, and Sharath Pankanti. Biometrics: personal identification in
     networked society, volume 479. Springer Science & Business Media, 2006.
[17] Brian Roark, Margaret Mitchell, John-Paul Hosom, Kristy Hollingshead, and Jeffrey Kaye.
     Spoken language derived measures for detecting mild cognitive impairment. IEEE transactions
     on audio, speech, and language processing, 19(7):2081–2090, 2011.
[18] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G
     Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al.
     Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.
[19] Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio
     Torralba, and Sanja Fidler. Skip-thought vectors. In Advances in neural information processing
     systems, pages 3294–3302, 2015.
[20] Pin-Yu Chen, Yash Sharma, Huan Zhang, Jinfeng Yi, and Cho-Jui Hsieh. Ead: elastic-net
     attacks to deep neural networks via adversarial examples. In Thirty-Second AAAI Conference
     on Artificial Intelligence, 2018.
[21] Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From theory to
     algorithms. Cambridge university press, 2014.
[22] Olivier Bodenreider. The unified medical language system (umls): integrating biomedical
     terminology. Nucleic acids research, 32(suppl_1):D267–D270, 2004.
[23] Vladimir Vapnik and Rauf Izmailov. Learning using privileged information: similarity control
     and knowledge transfer. Journal of machine learning research, 16(2023-2049):2, 2015.
[24] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed repre-
     sentations of words and phrases and their compositionality. arXiv preprint arXiv:1310.4546,
                                                14


     2013.
[25] Alistair EW Johnson, Tom J Pollard, Lu Shen, Li-wei H Lehman, Mengling Feng, Mohammad
     Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. Mimic-iii,
     a freely accessible critical care database. Scientific data, 3, 2016.
                                                  15


                                              CHAPTER 2
                               REVIEW OF CURRENT RESEARCH
2.1    Language Markers in Mild Cognitive Impairment
Fraser et al. [1] introduced “linguistic markers” such as word count, use frequencies of certain
parts-of-speech, and sentiments as predictors of dementia in AD patients. Beyond summary
statistics, Asgari et al. [2] introduce word-level language markers which utilize expert provided
linguistic features such as socioeconomic context, presence of biological process, psychologic factors.
These linguistic dimensions are standardized as Liguistic Inquiry and Word Count (LIWC) [3]
features for over 5,000 English words. Beyond expert annotations, statistical properties of words
have been examined by a wide range of machine learning approaches such as topic modeling [4],
neural word2vec [5] and contextualized word embeddings [6]. More recently, sentence-level
representations such as Skip-Thought vectors [7] have shown capacity in representing phrase and
utterance level linguistic features such as semantic similarity, phrase relationships and user intentions.
Currently, the state-of-the-art (SOTA) representation of linguistic features reside in the recently
introduced transformer language models (LMs) [8]. For example, BERT encoders [9] have long
been shown to produce state-of-the-art results in language comprehension tasks – otherwise known
as natural language understanding (NLU). On the natural language generation (NLG) end, the GPT-2
decoder [10] has been shown to generate text that is difficult to distinguish from human text.
2.2    Dialog Systems
Central to our characterization of question contexts and language markers is the idea of a dialog
system: a conversational agent (a machine learning model) that generates text (prompts / questions)
to user input based on feedback signals from the user (responses). There are two main types of
dialog systems: task-oriented and open-domain. They can synonymous be thought of as structured
(task-oriented) and unstructured (open-domain) conversation. However, our problem is somewhere
                                                   16


in the middle: our desired question policy should be able to adapt to a variety of conversation
types (open-domain), yet still service the goal of MCI diagnosis (task-oriented). We introduce the
crux of task-oriented vs. open-domain conversation before formalizing the semi-structured setting
mathematically in proceeding chapters.
2.3   Task-Oriented System
Task-oriented dialogue systems are typically designed for retrievaltasks in which users provide queries
and the chat-bot provides appropriate responses based on an external knowledge base [11, 12, 13],
or identifying correct answers by looking into vast amounts of documents [14, 15]. Such dialogue
systems are typically designed to be a pipeline containing a set of components including a language
understanding unit that parses the intention and semantics from the input from humans, a dialogue
manager that manages dialogue state tracking and policy learning, and a language generation unit
that generates response [11, 16, 17].
    Similar to task-oriented dialog, our problem setting is goal-oriented: we want to find a dialog
policy that can lead to the correct classification of diseases. However, our problem setting differs
from task-oriented dialog in that (1) the question generation do not have fixed frames, and (2)
the user responses are certainly not constrained to specific tasks (e.g., hotel booking, movie
recommendations). In this regard, the actual NLG part of the task is an open-domain dialog problem.
2.4   Open-Domain Systems
Open-domain system refers to dyadic conversations without any constraints on the dialog structure
between the conversational agents. Open-domain dialogue generation has been formulated in two
ways: (1) as a statistical machine translation (SMT) [18, 19] task in which the goal is to output a
                       (𝑡)                                         (𝑡)
sequence of tokens y1:𝐾    in response to a sequence of tokens x1:𝐾    , and (2) as a ranking problem
where a response 𝑌𝑡 is selected among a list of candidate responses that best “matches” the query
𝑋𝑡 [11]. Both approaches optimize over the maximum likelihood (MLE) objective:
                                       maximize log 𝑝(𝑌𝑡 |𝑋𝑡 ).
                                           𝑌𝑡
                                                  17


The main difference between the two approaches is that SMT is generative, because the response is
generated token-by-token, in nature while the ranking approach is discriminative due to requiring a
pool of candidates. But neither takes into account the history of the conversation, and as a result
both fail to capture dialogue transitions.
    More recently, generative models such as HRED, VHRED and CVAE [20, 21, 22] try to capture
dialogue history to provide more contextualized decoding. These approaches apply MLE learning
to the objective:
                                   maximize log 𝑝(𝑌𝑡 |𝑋𝑡 , 𝜏1:𝑡−1 ).                             (2.1)
                                         𝑌𝑡
Although this new objective encourages more response diversity, the dialogue agent still does not
have any mechanism of encouraging a consistent personality across multiple turns [23].
2.4.1   Personalized Response Generation
An emerging direction in open-domain dialogue systems is the rise of persona models, which
focus on generating responses that preserve a consistent set of personalities over the course of a
dialogue. For example, Li et al. [19] learns a speaker vector from which to condition the generation
process. In the PersonaChat dataset [23], each dialogue speaker is equipped with text descriptions
that characterize their persona. The standard objective of persona models can be expressed as:
                                  maximize log 𝑝(𝑌𝑡 |𝑋𝑡 , 𝜏1:𝑡−1 , 𝑃𝑌 )                          (2.2)
                                      𝑌𝑡
where 𝑃𝑌 is the set of persona descriptions for the dialogue agent. Zhang et al. [23] and the ConvAI2
challenges [24] provided numerous approaches to incorporate the persona information into the
dialogue generation process. Our problem setting is similar to open-domain dialog in that the
user (patient) is allowed to generate arbitrary responses. Thus, our dialog agent must be able to
handle open-domain conversation and exhibit common-sense reasoning, multi-turn consistency and
response diversity – challenges that are featured in open-domain dialog.
                                                  18


2.4.2    MCI Prediction via Utterance Data
[2] used a classical supervised learning framework to formulate MCI prediction as binary classifica-
tion problem. For each interview, a corpus was constructed using only the participant responses
to interviewer questions. For each participant, the response corpus over several interviews was
preprocessed into feature vectors using the Linguistic Inquiry & Word Count (LIWC) dictionary
[3]. The LIWC dictionary transforms each word in a given corpus to a 69-dimensional feature
vector with latent dimensions representing grammatical and semantic properties of each word. A
final 69-dimensional feature vector is then constructed at the end of the corpus by aggregation
of all previous LIWC vectors. The resulting feature representation is a 𝑚 × 69 matrix. The best
performing classifier in this benchmark study uses linear support vector machines (SVM) with
ℓ1 -norm regularization [2]. The resulting performance is 72.5% AUC over 5-fold validation.
2.4.3    Healthcare Applications of Dialog Systems
Dialogue systems have been widely adopted in the healthcare domain for various applications.
For example, chat-bots are available to assist the patient intake process [25], retrieve restaurant
accommodation information for young adults with food allergies [26], and perform dialogue
analysis and generation conversation to perform mental health treatment [27]. In the context of
Alzheimer’s disease research, [28] designed a virtual reality based chat-bot to evaluate memory
loss using predefined questions and answers. [29] discussed applications of chat-bots as caregiviers
for Alzheimer’s patients, providing safety, personal assistance, entertainment, and stimulation.
More recently, [30] introduced a computer avatar to ask a list of pre-defined questions from
neuropsychological tests to diagnose dementia. This work is closely related to our system as it
utilizes dialogue to glean disease-related information. However, one major issue in this approach
is that the questions were obtained from mini-mental state examination (MMSE) [31], which is
a confirmatory measure used to define clinical dementia (such as MCI) rather than a diagnostic
tool to predict it. It is more clinical meaningful to identify diagnostic markers associated with the
pathological pathways, such as lexical distribution associated with the cognitive changes for the
                                                  19


purpose of diagnostic screening.
2.5    Beyond Linguistic Features
2.5.1    Acoustic Features
Beyond linguistic features, acoustic features such as mel-frequency cepstrum coefficients (MFCC)
have been shown to effectively predict MCI status [32, 33], especially when combined with linguistic
features. However, these studies are thus far limited to responses to structured conversations (e.g.,
responses to standardized cognitive tests and fixed questionnaires). It remains an open question
whether acoustic markers can be used in semi-structured conversations and combining both linguistic
and acoustic markers can improve the detection of MCI. Most predictive studies using acoustic
markers rely on either fixed prompts [33] or pronunciation tasks [32] to control for the differences in
linguistic contexts used in conversations. This is because variations in utterance lengths, word choices
and sentence structures can introduce variance in acoustic features independent of vocal differences.
In fact, Roark et al. points out that “narrow and topic-focused use of language” is important for
“more accurate” acoustic marker extraction [32]. However, while these studies suggest that acoustic
markers can have very high predictive value, the highly structured conversational settings may
restrict the effectiveness of linguistic markers. Traditionally, semi-structured conversational settings
have been used in linguistic marker studies because they reflect participant linguistic preferences in
open conversations (without fixed sentences). Therefore, we are interested in combining acoustic
and linguistic features in a synergistic way, despite the fact that the semi-structured setting introduces
interesting trade-offs between the stability of acoustic features and the expressiveness of linguistic
ones.
2.5.2    Personality Modeling
Persona modeling refers to non goal-oriented dialog, otherwise known as chit-chat dialogue systems.
Vinyals et al. [18] introduced a statistical machine translation (SMT) framework for sequence-to-
sequence modeling of dialogue turns, with neither dialogue-state tracking nor modeling long-term
                                                    20


dependence between turns. As a result, vanilla sequence-to-sequence models often fail to capture
long-term dependencies of dialogue responses and do not conform to a coherent “personality” in
conversations [23, 24]. Li et al.[19] introduced persona-based modeling which incorporates persona
information about a user as word embeddings. Recently, [23] extended this idea by building the
PersonaChat dataset to permit the modeling of persona-dependent dialogue trajectories that capture
both turn-based transitions as well as conditional response generation based on personalized text
profiles. Such an approach generated great interest in the research community. The main focus of
persona modeling has been how to incorporate persona information into the dialogue generation
process. For example, Song et al. [34] introduced a method for incorporating persona information
through variational autoencoders (VAE) with external memory, improving performance on automatic
evaluations. The ConvAI2 competition [24] provided a diverse sample of state-of-the-art model
architectures for persona modeling. However, as pointed out in the analysis of results from the
ConvAI2 (NeurIPS ’18) competition, a bottleneck of persona models remains the disparity between
automatic and human evaluation.
2.5.3    Automatic Evaluation of Persona Models
The NeurIPS ’18 challenge [24] introduced several limitations for automatic evaluation of dialogue
quality. For example, that F1 and perplexity (PPL) scores do not consider temporal dependence of
text generation, nor the underlying semantics of generated texts. This was reflected in the fact that
performance of persona models on automatic evaluation was not predictive of performance against
human evaluation – the eventual competition winner scored relatively poorly compared to other top
models in the automatic evaluation category. Additionally, it was found that classical metrics such
as F1 score and hits@1 automatically favor ranking models over generative models. PPL, F1 and
Hits@1 also do not provide any information about the style of responses, for instance tendencies
to ask questions to human users or bias toward certain conversational topics. Previously, Xing
et al. [35] introduced a method of evaluating dialogue quality beyond classical metrics by using
a classifier to predict meta-data about the underlying persona (e.g., extroversion and personality
                                                  21


traits). However, such an approach requires strong supervision for each persona, which is not readily
available for large dialogue corpus. Hu et al. [36] introduced the idea of using a discriminator as an
automatic evaluator of generated dialogues. Zhang et al. [37] extended the use of discriminators
to include language models for constraining the possible response outputs and policy gradients
to deal with discrete generator inputs. However, discriminator-based approaches focus mostly on
response-level coherence rather than coherence of the dialogue trajectory over multiple turns. In
this work, we introduce an automatic evaluation scheme that systematically addresses each of these
issues.
2.6    Feature Selection in Medical Informatics
One prominent problem in medical informatics is the problem of feature selection over disease
markers. Aside from the curse of dimensionality, biomarkers are expensive and cannot be obtained
for each patient at every timestep. A prime example of this phenomenon exists in the realm of medical
informatics, specifically electronic health records (EHR) dominated by time-series data. Yet this
hasn’t stopped EHR from sparking numerous research interests in recent years [38, 39, 40, 41, 42].
Time-series data in the EHR consists of features that are sampled at different levels of temporal
granularity (e.g., lab tests are sampled at longer time-scales than vital signs). Most of the time,
only a small subset of features are observed at any time-step, as it is prohibitively expensive for
the physician to obtain the full set of features for every patient for every time interval. In practice,
a form of active sensing [43] is implicitly done in an ad-hoc manner by physicians based on a
combination of inpatient work-flow, resource constraints, and domain expertise. As a result, the key
difficulty with EHR time-series data is that they almost always require re-sampling and imputation
due to large amounts of missing values. For example, consider the EHR setting, where at each
time-step, the physician has to decide how to allocate a constrained amount of tests/resources
among patients for some set of clinical tasks. In other words, the precision sensing problem for
time-series is in fact a resource distribution problem at each time-step, with a fixed observation
budget across a given set of samples. Since the querying process is applied across time rather
                                                   22


than samples, any feature selection formulation must reflect both the temporal-dependency and the
budgetary-constraint of the querying process. Therefore it is no surprise that several recent works
have shown that temporal models trained on EHR data have fragile decision boundaries that are
susceptible to small perturbations [44, 45].
    In later chapters, we present a framework called precision sensing to overcome the temporal
dependence and non-i.i.d. property of medical feature selection. We first introduce some related
works in active sensing below.
2.6.1   Active Sensing
Precision sensing typically falls under the suite of active data acquisition problems such as active
learning [46] (optimal experiment design) and active sensing [43], both of which attracted significant
research interest from the machine learning community. Techniques such as incremental feature
acquisition [47] and active sensing [43, 48] focus on selecting a subset of features by querying
the original data in an online manner, with the goal of reconstructing (sample, view) pairs that
maximize the mutual information of predictive classifiers. Note that the original active sensing
problem proposed by Yu et al. [43] operates on static feature sets rather than time-series data. More
recently, active sensing has been extended to time-series data by [48] and [49], but in a problem
setting very different from this paper. This is because both [48] and [49] manage not only a feature
selection budget across features, i.e., “data-streams”, but also across time. However, a separate
budget constraint is considered for each individual. Precision sensing presents an additional layer of
complexity—the sensor model has to consider budget constraints across samples. That is, we do
not make the i.i.d. assumption across data samples. In reality, a sensing strategy has to take into
account an accumulating budget across all samples in a data batch. Using the toy example from
[49], a physician may need to conserve the observation budget for one group of patients in order
to decrease uncertainty of predictions for another batch at future time points. Thus, our goal is to
dynamically adjust the allocation of a fixed observation budget across a time window and a set of
samples. In experiments, we compare our method against state-of-the-art (SOTA) active sensing
                                                  23


methods.
2.6.2     Temporal Feature Selection
Classic statistical methods [50, 51], perturbation models [44] and attention-based techniques [52]
have also been applied to distill model-preferences over temporal features across time. Schulam [53]
and Saria et al. [54] proposed modeling temporal relevance of features by abstracting hierarchical
representations across time (e.g., population-level and subpopulation-level latent features). However,
all these strategies lack an active component because their sensing strategies neither adapt to
shifts in future data-streams nor do they incrementally build the future sensing strategy. While
Clertant [55], like Chang et al. [49], formulated the sensing problem in a Markov Decision Process
(MDP) framework, their proposed algorithm deals with the sequential selection of static features,
rather than temporal data. Moreover, perturbation models and temporal feature selection [56] are
specific to the hypothesis model that they are trained on. But ideally a sensing strategy should be
generalizable to different hypothesis classes and decision boundaries. We experimentally show how
our method maintains performance for different model types, including non-temporal models (e.g.,
feed-forward networks) as well as non-deep models (e.g., logistic regression).
2.7    Sample Efficiency and Privileged Information
Finally, we explore the issue of improving sample-efficiency in using disease markers for clinical
prediction. In classical supervised learning, the learner is presented with the training tuple
{(𝑥𝑖 , 𝑦𝑖 )}𝑖=1
            𝑚 and performs an optimization task of finding the best model in a hypothesis space
ℎ : 𝑋 ↦→ 𝑌 to approximate some true 𝑓 : 𝑋 ↦→ 𝑌 which explains the data. Given a new
task, knowledge transfer [57] is often applied to accelerate the learning process by distilling and
transferring relevant knowledge from previous tasks to the unseen one. Under classical formulations,
the learner incorporates prior information in one of several ways:
    1. Direct transfer of parameters from old hypothesis models to the new task and fine-tuning [57]
        the parameters.
                                                  24


    2. Learning multiple tasks (online or batched) related to the current task [58, 42].
    3. Using the prior knowledge (i.e. a knowledge graph) to constrain the hypothesis space by
       regularization [59].
    4. Using representations (i.e. embeddings) of 𝑋 and / or 𝑌 from previous tasks for new tasks
       [5, 60].
    5. Accelerate learning rate and model compression by Distillation as typically seen in Teacher-
       Student models [61].
In each of these settings, knowledge transfer operates directly within the 𝑋, 𝑌 and H spaces to
improve generalization of information from old models to the new task.
     Recently, Learning Using Privileged Information (LUPI) [62] has provided a new paradigm for
knowledge transfer. Under LUPI, the learner now interacts with a Teacher that provides privileged
information (PI) and is only available at training time. From the learner’s perspective, the training
set is now extended to the tuple {(𝑥𝑖 , 𝑥𝑖∗ , 𝑦𝑖 )}𝑖=1
                                                   𝑛 , and the testing set stays the same. Some examples
of PI include: 1) Future information that relates 𝑋 and 𝑌 . For example, using future stock prices
beyond the prediction window during training. 2) Auxiliary information describing the label space
that is available only to a subset of samples. For example, physician notes that accompany diagnostic
predictions which is only available after the diagnosis is made.
     At a high-level, PI provides some similarity information between training samples from the
original feature space, and the Teacher hypotehsis serves as additional “explanations” of the
hypothesis space [63, 62]. As a result, [63] showed that the LUPI Teacher provides a principled way
to improve the generalization error of Student learners using agnostic PAC models, providing some
theoretical improvements in the number of samples required to achieve generalizability the new task
(i.e. improves sampling efficiency)
     However, under the current state-of-art LUPI formulations such as [64, 65], PI is incorporated
by means of support vectors and dropout schemes, both of which fail to explore the underlying
similarity structure between samples in the PI space 𝑋 ∗ . For example the mode distribution and
                                                       25


pairwise similarity between points in the 𝑋 ∗ space is largely unused. The PI contributes as auxiliary
training features and kernel information, but all LUPI information is lost at inference time and
beyond. A significant question remains: can privileged information be retained for future tasks?
    Ideally, we want the LUPI Teacher to incorporate PI in a way that is specific enough to inform
similarity between training samples yet general enough to be retained across future tasks. In the
following subsections, we highlight some of these differences as well as improvements over related
LUPI works.
2.7.1   Transfer Learning Review
Consider the transfer learning setting: Assume we are given a source domain D𝑆 = {X𝑆 , 𝑃𝑆 (𝑋)}, a
source task T𝑆 = {Y𝑆 , H𝑆 }, a target domain D𝑇 = {X𝑇 , 𝑃𝑇 (𝑋)}, and a target task T𝑇 = {Y𝑇 , H𝑇 },
where X defines a feature space, 𝑃(𝑋) defines its marginal distribution, Y defines a label space,
and H defines a space of hypothesis which best approximates an underlying function 𝑓 : 𝑋 ↦→ 𝑌
which explains the data. Transfer learning leverage knowledge in D𝑆 and T𝑆 to improve the learning
of ℎ𝑇∗ ∈ H𝑇 , where D𝑆 ≠ D𝑇 or T𝑆 ≠ T𝑇 . Improving the learning of ℎ𝑇∗ can come in two ways:
(1) improving the sample complexity of H𝑇 , or (2) improving the rate of convergence for finding
ℎ𝑇∗ ∈ H𝑇 .
    Transductive learning describes a situation where D𝑆 ≠ D𝑇 but T𝑆 = T𝑇 , which can occur either
due to mismatch in feature space, i.e. X𝑆 ≠ X𝑇 , or due to co-variate shift, i.e. 𝑃𝑆 (𝑋) ≠ 𝑃𝑇 (𝑋)[57, 66].
On the other hand, inductive learning describes a situation where D𝑆 = D𝑇 but T𝑆 ≠ T𝑇 , i.e. the
same domain but different tasks. Briefly, the various transfer learning approaches are summarized
in table 2.1.
2.7.2   Knowledge Transfer by Inductive Learning
A popular approach for solving the inductive learning problem is parameter sharing, whereby some
or all of the parameters of ℎ∗𝑆 ∈ H𝑆 is recycled for the target model, and fine-tuning is applied to
accelerate the search for ℎ𝑇∗ ∈ H𝑇 [57, 71]. For example, consider multi-task learning (MTL), where
                                                   26


                     Table 2.1: Summary Table of Transfer Learning Approaches
             Setting / Method                 Domain                    Targets        Examples
           Inductive learning                D𝑆 = D𝑇                   T𝑆 ≠ T𝑇
           Multi-task Learning             Same Features               Y𝑆 ≠ Y𝑇            [67]
            Parameter Transfer             Same Features               T𝑆 ≠ T𝑇
          Relational Knowledge           Relational Graph              Y𝑆 ≠ Y𝑇            [59]
         Transductive Learning               D𝑆 ≠ D𝑇                   T𝑆 = T𝑇
           Multi-View Learning                X𝑆 ≠ X𝑇                Same Tasks         [60, 68]
          Sample Selection Bias        𝑃𝑆 (𝑌 |𝑋) ≠ 𝑃𝑇 (𝑌 |𝑋)         Same Tasks           [69]
             Co-Variate Shift            𝑃𝑆 (𝑋) ≠ 𝑃𝑇 (𝑋)              Same Task
           Domain-Adaptation           𝑃𝑆 (𝑋, 𝑌 ) ≠ 𝑃𝑇 (𝑋, 𝑌 )         Y𝑆 = Y𝑇
         Unsupervised Learning               D𝑆 ≠ D𝑇                   T𝑆 ≠ T𝑇
          Feature Representation   Transfer of 𝜙(𝑋𝑆 ) → 𝜙(𝑋𝑇 )    New or Old Tasks       [5, 70]
multiple hypotheses are learned jointly for a set of related tasks [57]. Under this setting, the set
of tasks T1 , ...T𝑇 are presumed to come from the same underlying distribution. Using the same
                                                          Ð
feature space 𝑋, a subset of the joint hypotheses space 𝑘=1 H𝑘 ⊂ {H1 , ..H𝑇−1 } should presumably
accelerate the learning of the optimal hypothesis for the target task. Thus, by transferring parameters
from {H1 , ..H𝑇−1 } to H𝑇 , convergence rate to the optimal ℎ𝑇∗ can be improved.
    A disadvantage of MTL is the constraint on relatedness of tasks. For example, it is hard to
know beforehand which subset of tasks will contribute positively to the target, and poor selection
of parameter transfer can actually lead to negative transfer, leading to poor performance on the
target task [57]. By contrast, LUPI uses PI that is by definition specific to the current task. However,
unlike MTL, LUPI does not use a shared feature space between the PI and original data, i.e. 𝑋 ≠ 𝑋 ∗
and cannot directly incorporate the Teacher hypothesis via direct parameter sharing.
    In recent years, relational-knowledge provides an alternative approach to inductive learning by
incorporating domain-specific knowledge in the form of regularized priors for target tasks [59, 72].
In contrast to parameter-sharing, relational-knowledge is agnostic to the collection of source tasks
and should apply universally to all learning tasks given the same feature space 𝑋. This formulation
overcomes the limitation of task similarity in the former case, but it is also very expensive to
construct reliable relational-knowledge, such as knowledge graphs. By constraining the hypothesis
space of the target task with relational-knowledge, the speed of convergence can be improved.
                                                    27


However, inductive learning techniques generally focus on improving either accuracy of prediction
or convergence rate rather than sample complexity. The main focus of LUPI, on the other hand, is to
provide some guarantees on improving the sample efficiency of the Student learner.
2.7.3   Knowledge Transfer by Transductive Learning
In contrast to inductive learning, transductive learning considers two sub-problems: X𝑆 ≠ X𝑇 and
𝑃𝑆 (𝑌 |𝑋) ≠ 𝑃𝑇 (𝑌 |𝑋). Multi-view Learning (MVL) [60, 73] and Multi-modal Learning (MML)
[74, 75] are main methods which deal with the X𝑆 ≠ X𝑇 case, where the feature spaces differ
between the source and target domains. MVL is often involved in processing different subsets of
features describing the same set of samples, for example different channels of EEG signals for
neurological diagnostics [76]. MML deals with different modalities of data, for example picture
and text descriptions of a disease process [77]. Often times, both MVL and MML utilize some
form of data fusion, whereby some shared representation of the multiple source domains is used
together to predict the same target task [67]. One drawback of these approaches is that modalities
such as PI is unavailable at test time, leading to poor generalization of the hypothesis model since it
is conditioned on both 𝑋 and 𝑋 ∗ .
    For example, suppose we have some data fusion model 𝑔 : 𝑋 × 𝑋 ∗ ↦→ 𝑍, and a hypothesis
function ℎ : 𝑍 ↦→ 𝑌 . At training time, both the PI and the original features are utilized to train
ℎ(𝑔(𝑥, 𝑥 ∗ )) = 𝑦. At test time, however, since only 𝑋 is available, 𝑔(𝑥, 0) may actually map to a
completely different set of features in 𝑍, leading to a biased ℎ(𝑧) = 𝑦. In other words, if 𝑋 is
under-utilized during training, ℎ(𝑧) will likely lead to poor generalization at test time. Unfortunately,
since PI is by definition a more task-specific descriptor of 𝑌 , this is the most likely case and presents
a limitation for data fusion methods for incorporating PI.
    On the other hand, methods such as biased selection sampling [78] and Optimal Transport
[79] are used to deal with the same set of features, but their marginal distributions disagree,
i.e. 𝑃𝑆 (𝑋) ≠ 𝑃𝑇 (𝑋). This is otherwise known as co-variate shift [66], and recently generative
models such as [80, 81] and [82] have dominated the state-of-art. For example, VRADA and
                                                   28


RadialGAN both attempt to learn a domain-invariant latent distribution s.t. 𝑃(𝑍 |𝑋𝑆 ) = 𝑃(𝑍 |𝑋𝑇 ),
and 𝑃𝑆 (𝑌 |𝑍) = 𝑃𝑇 (𝑌 |𝑍)[82, 80]. Methods such as [83] and [81] try to directly learn a mapping
between 𝑃(𝑋𝑇 |𝑋𝑆 ) so that samples from 𝑋𝑆 can be used to augment the training of ℎ(𝑥) = 𝑦. These
generative models present a possible way of incorporating PI into the modeling process by learning
a domain-invariant latent representation or transformation function between 𝑋 and 𝑋 ∗ . However,
these models improve the hypothesis function by means of data augmentation (i.e. increasing
the samples available) rather than decreasing the sample complexity required to train an accurate
model. In this regard, LUPI is advantageous in that it provides a framework for decreasing the
samples necessary by leveraging the Teacher’s hypothesis function 𝑓 ∗ . This subtle different becomes
important when “big data” is not available for complex data problems, for example modeling rare
diseases in healthcare records.
2.7.4   Knowledge Transfer with Distributed Representations
Transfer of feature representations is another paradigm of transfer learning that extends beyond
inductive transfer [57]. Under this setting, no label information is available for either the source or
target domains. However, the underlying structure of the source domain 𝑋𝑆 can first be extracted
using unsupervised learning before applying to downstream tasks. A prime example of this can
be seen in the learning of distributed representations for words [5], which is widely used in NLP
applications by converting the feature space of words (which have no intrinsic distance or similarity
properties) into an embedding vector space where distance and similarity can be computed based on
co-occurrence frequencies. Several related methods have been applied in the context of medicine by
learning distributed representations for medical concepts obtained from large corpora of journal
publications [70], EHR notes [41], and medical claims [72]. This framework provides a tool for
understanding the structural properties of PI, which can then be transfered to improve the actual
learning task of estimating the optimal Student hypothesis function. However, tactful incorporation
of these embeddings may not be as simple, since the learning regime must be able to incorporate the
embedded information in the hypothesis somehow. Our methodology introduces a way to achieve
                                                  29


this and incorporates both structural information about the PI as well as the Teacher hypothesis
function into the Student learning procedure.
2.7.5    Knowledge Transfer by Knowledge Distillation
A closely related concept to LUPI is knowledge distillation [61] in which a Teacher model outputs
are used as “soft-labels” to accelerate Student learning of the target task. Similar to LUPI, the
Teacher learns a more accurate model of the task, 𝑓 ∗ : 𝑋 ↦→ 𝑌 , and the Student tries to learn a
“distilled” representation of the hypothesis space, where the VC-dimension of the Student hypothesis
space is less than that of the teacher [84]. By decreasing its VC-dimension, the Student learner can
improves its sampling efficiency at least by a constant factor. However, the Teacher model does
not incorporate PI, which provides a better prediction of the label than the original feature space.
Additionally, the original LUPI provides a more favorable generalization bound compared to model
distillation without the help of PI [84, 63].
2.7.6    Knowledge Transfer by LUPI
Finally, LUPI provides some performance guarantees with regard to the sample efficiency of the
Student learner, so long as the PI and the Teacher model satisfies some conditions [62, 63]. However,
the main drawback of current formulations of LUPI is that the PI used is highly specific to the task
at hand – no information is retained for related tasks. Our work applies elements of unsupervised
learning and transductive learning to alleviate this limitation of LUPI. Although recent works such
as [64] has generalized the LUPI framework to deep learning settings, our work extends LUPI to
allow for multi-task and transfer learning, enabling the generalization of a PI source to accelerate
the sample efficiency of many tasks.
                                                   30


2.7.7   LUPI Preliminaries
Traditionally, LUPI is applied to training data of the form:
                              (𝑥1 , 𝑥1∗ , 𝑦 1 ), (𝑥 2 , 𝑥2∗ , 𝑦 2 )...(𝑥 𝑚 , 𝑥 𝑚
                                                                               ∗
                                                                                 , 𝑦 𝑚 ) ∈ Dtrain ,
where 𝑥𝑖 ∈ R𝑛 denotes example feature (EF) vectors from the original feature space, and 𝑥𝑖∗ ∈ R𝑛∗
denotes privileged information (PI) vectors from the privileged information space. D𝑡𝑟𝑎𝑖𝑛 indicates
that the PI inputs are only available during training. 𝑦𝑖 ∈ {−1, +1} denotes the ground truth labels
for inputs (𝑥𝑖 , 𝑥𝑖∗ ). LUPI then considers two pattern recognition problems:
     • Using {(𝑥𝑖 , 𝑦𝑖 )}𝑖=1
                          𝑚 , find rule ℎ(𝑥 ) = 𝑦 .
                                                  𝑖         𝑖
     • Using {(𝑥𝑖∗ , 𝑦𝑖 )}𝑖=1
                          𝑚 , find rule 𝑓 ∗ (𝑥 ∗ ) = 𝑦
                                                    𝑖          𝑖
Suppose that the 𝑓 ∗ (𝑥𝑖∗ ) = 𝑦 can produce low generalization error, the LUPI task is to transfer
knowledge of rules in the 𝑋 ∗ space to improve learning in the 𝑋 space. The original LUPI formulation
in [62] considers only SVM models, whereby the privileged information is incorporated into the
SVM objective as follows:
                                       1 𝑇                                   Õ𝑚
                         minimize
                                ∗
                                          (𝑤 𝑤 + 𝜆𝑤 ∗𝑇 𝑤 ∗ ) + 𝐶                    (𝑤 ∗𝑇 𝜓(𝑥𝑖∗ ) + 𝑏 ∗ )
                            𝑤,𝑤        2                                          𝑖
                                 𝑠.𝑡. 𝑦𝑖 (𝑤𝑇 𝜙(𝑥) + 𝑏) ≥ 1 − (𝑤 ∗𝑇 𝜓(𝑤 ∗ ) + 𝑏 ∗ ),
                                     𝑤 ∗𝑇 𝜓(𝑥 ∗ ) + 𝑏 ∗ ≥ 0
Here, 𝑤 and 𝑤 ∗ are the parameter vectors, and 𝑏 and 𝑏 ∗ are biases of the decision functions. 𝜓(.)
and 𝜙(.) are feature mapping kernels applied on PI and EF vectors, respectively. 𝐶 > 0 and 𝜆 > 0
are hyperparameters which control the contribution of privileged information to the overall cost
function. 𝐶 controls the contribution of the Teacher loss 𝑓 ∗ : 𝑋 ∗ ↦→ 𝑌 , while 𝜆 controls correction
of the parameter space by 𝑤 ∗𝑇 𝑤 ∗ .
                                                                  31


BIBLIOGRAPHY
      32


                                        BIBLIOGRAPHY
[1] Kathleen C Fraser, Jed A Meltzer, and Frank Rudzicz. Linguistic features identify alzheimer’s
     disease in narrative speech. Journal of Alzheimer’s Disease, 49(2):407–422, 2016.
[2] Meysam Asgari, Jeffrey Kaye, and Hiroko Dodge. Predicting mild cognitive impairment from
     spontaneous spoken utterances. Alzheimer’s & Dementia: Translational Research & Clinical
     Interventions, 3(2):219–228, 2017.
[3] James W Pennebaker, Martha E Francis, and Roger J Booth. Linguistic inquiry and word
     count: Liwc 2001. Mahway: Lawrence Erlbaum Associates, 71(2001):2001, 2001.
[4] David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation. the Journal of
     machine Learning research, 3:993–1022, 2003.
[5] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed repre-
     sentations of words and phrases and their compositionality. arXiv preprint arXiv:1310.4546,
     2013.
[6] Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton
     Lee, and Luke Zettlemoyer. Deep contextualized word representations. arXiv preprint
     arXiv:1802.05365, 2018.
[7] Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio
     Torralba, and Sanja Fidler. Skip-thought vectors. In Advances in neural information processing
     systems, pages 3294–3302, 2015.
[8] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
     Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural
     information processing systems, pages 5998–6008, 2017.
[9] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of
     deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805,
     2018.
[10] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language
     understanding by generative pre-training. URL https://s3-us-west-2. amazonaws. com/openai-
     assets/researchcovers/languageunsupervised/language understanding paper. pdf, 2018.
[11] Hongshen Chen, Xiaorui Liu, Dawei Yin, and Jiliang Tang. A survey on dialogue systems:
     Recent advances and new frontiers. Acm Sigkdd Explorations Newsletter, 19(2):25–35, 2017.
[12] Bhuwan Dhingra, Lihong Li, Xiujun Li, Jianfeng Gao, Yun-Nung Chen, Faisal Ahmed, and
                                                33


     Li Deng. End-to-end reinforcement learning of dialogue agents for information access. arXiv
     preprint arXiv:1609.00777, 2016.
[13] Tsung-Hsien Wen, David Vandyke, Nikola Mrksic, Milica Gasic, Lina M Rojas-Barahona,
     Pei-Hao Su, Stefan Ultes, and Steve Young. A network-based end-to-end trainable task-oriented
     dialogue system. arXiv preprint arXiv:1604.04562, 2016.
[14] Wei He, Kai Liu, Jing Liu, Yajuan Lyu, Shiqi Zhao, Xinyan Xiao, Yuan Liu, Yizhong Wang,
     Hua Wu, Qiaoqiao She, et al. Dureader: a chinese machine reading comprehension dataset
     from real-world applications. arXiv preprint arXiv:1711.05073, 2017.
[15] Minghao Hu, Yuxing Peng, Zhen Huang, Xipeng Qiu, Furu Wei, and Ming Zhou. Reinforced
     mnemonic reader for machine reading comprehension. arXiv preprint arXiv:1705.02798,
     2017.
[16] Jost Schatzmann, Karl Weilhammer, Matt Stuttle, and Steve Young. A survey of statistical
     user simulation techniques for reinforcement-learning of dialogue management strategies.
     Knowledge Engineering Review, 21(2):97–126, 2006.
[17] Satinder Singh, Diane Litman, Michael Kearns, and Marilyn Walker. Optimizing dialogue
     management with reinforcement learning: Experiments with the njfun system. Journal of
     Artificial Intelligence Research, 16:105–133, 2002.
[18] Oriol Vinyals and Quoc Le. A neural conversational model. arXiv preprint arXiv:1506.05869,
     2015.
[19] Jiwei Li, Michel Galley, Chris Brockett, Georgios P Spithourakis, Jianfeng Gao, and Bill
     Dolan. A persona-based neural conversation model. arXiv preprint arXiv:1603.06155, 2016.
[20] Iulian Vlad Serban, Alessandro Sordoni, Ryan Lowe, Laurent Charlin, Joelle Pineau, Aaron
     Courville, and Yoshua Bengio. A hierarchical latent variable encoder-decoder model for
     generating dialogues. In Thirty-First AAAI Conference on Artificial Intelligence, 2017.
[21] Yookoon Park, Jaemin Cho, and Gunhee Kim. A hierarchical latent structure for variational
     conversation modeling. arXiv preprint arXiv:1804.03424, 2018.
[22] Tiancheng Zhao, Ran Zhao, and Maxine Eskenazi. Learning discourse-level diversity for neural
     dialog models using conditional variational autoencoders. arXiv preprint arXiv:1703.10960,
     2017.
[23] Saizheng Zhang, Emily Dinan, Jack Urbanek, Arthur Szlam, Douwe Kiela, and Jason
     Weston. Personalizing dialogue agents: I have a dog, do you have pets too? arXiv preprint
     arXiv:1801.07243, 2018.
[24] Emily Dinan, Varvara Logacheva, Valentin Malykh, Alexander Miller, Kurt Shuster, Jack
                                                34


     Urbanek, Douwe Kiela, Arthur Szlam, Iulian Serban, Ryan Lowe, et al. The second conver-
     sational intelligence challenge (convai2). In The NeurIPS’18 Competition, pages 187–208.
     Springer, 2020.
[25] Lin Ni, Chenhao Lu, Niu Liu, and Jiamou Liu. Mandy: Towards a smart primary care chatbot
     application. In International Symposium on Knowledge and Systems Sciences, pages 38–52.
     Springer, 2017.
[26] Paris Hsu, Jingshu Zhao, Kehan Liao, Tianyi Liu, and Chen Wang. Allergybot: A chatbot
     technology intervention for young adults with food allergies dining out. In CHI, pages 74–79.
     ACM, 2017.
[27] Kyo-Joong Oh, Dongkun Lee, Byungsoo Ko, and Ho-Jin Choi. A chatbot for psychiatric
     counseling in mental healthcare service based on emotional dialogue analysis and sentence
     generation. In MDM, pages 371–375. IEEE, 2017.
[28] Juan Manuel Fernandez Montenegro and Vasileios Argyriou. Cognitive evaluation for the
     diagnosis of alzheimer’s disease based on turing test and virtual environments. Physiology &
     behavior, 173:42–51, 2017.
[29] Miguel A Salichs, Irene P Encinar, Esther Salichs, Álvaro Castro-González, and María Malfaz.
     Study of scenarios and technical requirements of a social assistive robot for alzheimer’s disease
     patients and their caregivers. International Journal of Social Robotics, 8(1):85–102, 2016.
[30] Hiroki Tanaka, Hiroyoshi Adachi, Norimichi Ukita, Manabu Ikeda, Hiroaki Kazui, Takashi
     Kudo, and Satoshi Nakamura. Detecting dementia through interactive computer avatars. IEEE
     journal of translational engineering in health and medicine, 5:1–11, 2017.
[31] Tom N Tombaugh and Nancy J McIntyre. The mini-mental state examination: a comprehensive
     review. J. of the Ame. Geriatrics Soc., 40(9):922–935, 1992.
[32] Brian Roark, Margaret Mitchell, John-Paul Hosom, Kristy Hollingshead, and Jeffrey Kaye.
     Spoken language derived measures for detecting mild cognitive impairment. IEEE transactions
     on audio, speech, and language processing, 19(7):2081–2090, 2011.
[33] Tuka Alhanai, Rhoda Au, and James Glass. Spoken language biomarkers for detecting cognitive
     impairment. In 2017 IEEE Automatic Speech Recognition and Understanding Workshop
     (ASRU), pages 409–416. IEEE, 2017.
[34] Haoyu Song, Wei-Nan Zhang, Yiming Cui, Dong Wang, and Ting Liu. Exploiting persona in-
     formation for diverse generation of conversational responses. arXiv preprint arXiv:1905.12188,
     2019.
[35] Yujie Xing and Raquel Fernández. Automatic evaluation of neural personality-based chatbots.
     arXiv preprint arXiv:1810.00472, 2018.
                                                35


[36] Zhiting Hu, Zichao Yang, Xiaodan Liang, Ruslan Salakhutdinov, and Eric P Xing. Toward
     controlled generation of text. In Proceedings of the 34th International Conference on Machine
     Learning-Volume 70, pages 1587–1596. JMLR. org, 2017.
[37] Zhirui Zhang, Shujie Liu, Mu Li, Ming Zhou, and Enhong Chen. Bidirectional generative
     adversarial networks for neural machine translation. In Proceedings of the 22nd Conference
     on Computational Natural Language Learning, pages 190–199, 2018.
[38] Yu Cheng, Fei Wang, Ping Zhang, and Jianying Hu. Risk prediction with electronic health
     records: A deep learning approach. In Proceedings of the 2016 SIAM International Conference
     on Data Mining, pages 432–440. SIAM, 2016.
[39] Fengyi Tang, Cao Xiao, Fei Wang, and Jiayu Zhou. Predictive modeling in urgent care: a
     comparative study of machine learning approaches. JAMIA Open, 2018.
[40] Abhyuday N Jagannatha and Hong Yu. Bidirectional rnn for medical event detection in
     electronic health records. In Proceedings of the conference. Association for Computational
     Linguistics. North American Chapter. Meeting, volume 2016, page 473. NIH Public Access,
     2016.
[41] Edward Choi, Mohammad Taha Bahadori, Elizabeth Searles, Catherine Coffey, Michael
     Thompson, James Bost, Javier Tejedor-Sojo, and Jimeng Sun. Multi-layer representation
     learning for medical concepts. In Proceedings of the 22nd ACM SIGKDD International
     Conference on Knowledge Discovery and Data Mining, pages 1495–1504. ACM, 2016.
[42] Hrayr Harutyunyan, Hrant Khachatrian, David C Kale, and Aram Galstyan. Multitask learning
     and benchmarking with clinical time series data. arXiv preprint arXiv:1703.07771, 2017.
[43] Shipeng Yu, Balaji Krishnapuram, Romer Rosales, and R Bharat Rao. Active sensing. In
     Artificial Intelligence and Statistics, pages 639–646, 2009.
[44] Mengying Sun, Fengyi Tang, Jinfeng Yi, Fei Wang, and Jiayu Zhou. Identify susceptible
     locations in medical records via adversarial attacks on deep predictive models. arXiv preprint
     arXiv:1802.04822, 2018.
[45] Pin-Yu Chen, Yash Sharma, Huan Zhang, Jinfeng Yi, and Cho-Jui Hsieh. Ead: elastic-net
     attacks to deep neural networks via adversarial examples. In Thirty-Second AAAI Conference
     on Artificial Intelligence, 2018.
[46] Burr Settles. Active learning literature survey. Technical report, University of Wisconsin-
     Madison Department of Computer Sciences, 2009.
[47] Prem Melville, Maytal Saar-Tsechansky, Foster Provost, and Raymond Mooney. Active
     feature-value acquisition for classifier induction. In Fourth IEEE International Conference on
     Data Mining (ICDM’04), pages 483–486. IEEE, 2004.
                                                   36


[48] Jinsung Yoon, William R Zame, and Mihaela van der Schaar. Deep sensing: Active sensing
     using multi-directional recurrent neural networks. 2018.
[49] Chun-Hao Chang, Mingjie Mai, and Anna Goldenberg. Dynamic measurement scheduling for
     event forecasting using deep rl. arXiv preprint arXiv:1901.09699, 2019.
[50] Michail Tsagris, Vincenzo Lagani, and Ioannis Tsamardinos. Feature selection for high-
     dimensional temporal data. BMC bioinformatics, 19(1):17, 2018.
[51] Liying Fang, Han Zhao, Pu Wang, Mingwei Yu, Jianzhuo Yan, Wenshuai Cheng, and Peiyu
     Chen. Feature selection method based on mutual information and class separability for
     dimension reduction in multidimensional time series for clinical data. Biomedical Signal
     Processing and Control, 21:82–89, 2015.
[52] Edward Choi, Mohammad Taha Bahadori, Jimeng Sun, Joshua Kulas, Andy Schuetz, and
     Walter Stewart. Retain: An interpretable predictive model for healthcare using reverse time
     attention mechanism. In Advances in Neural Information Processing Systems, pages 3504–3512,
     2016.
[53] Peter Schulam and Suchi Saria. A framework for individualizing predictions of disease
     trajectories by exploiting multi-resolution structure. In Advances in Neural Information
     Processing Systems, pages 748–756, 2015.
[54] Suchi Saria, Andrew Duchi, and Daphne Koller. Discovering deformable motifs in continuous
     time series data. In Twenty-Second International Joint Conference on Artificial Intelligence,
     2011.
[55] Matthieu Clertant, Nataliya Sokolovska, Yann Chevaleyre, and Blaise Hanczar. Interpretable
     cascade classifiers with abstention. In The 22nd International Conference on Artificial
     Intelligence and Statistics, pages 2312–2320, 2019.
[56] Rohan A Baxter, Graham J Williams, and Hongxing He. Feature selection for temporal health
     records. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pages 198–209.
     Springer, 2001.
[57] Sinno Jialin Pan, Qiang Yang, et al. A survey on transfer learning. IEEE Transactions on
     knowledge and data engineering, 22(10):1345–1359, 2010.
[58] Rich Caruana. Multitask learning. Machine learning, 28(1):41–75, 1997.
[59] Edward Choi, Mohammad Taha Bahadori, Le Song, Walter F Stewart, and Jimeng Sun. GRAM:
     graph-based attention model for healthcare representation learning. In Proceedings of the 23rd
     ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages
     787–795. ACM, 2017.
                                                37


[60] Jing Zhao, Xĳiong Xie, Xin Xu, and Shiliang Sun. Multi-view learning overview: Recent
     progress and new challenges. Information Fusion, 38:43–54, 2017.
[61] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.
     arXiv preprint arXiv:1503.02531, 2015.
[62] Vladimir Vapnik and Akshay Vashist. A new learning paradigm: Learning using privileged
     information. Neural networks, 22(5-6):544–557, 2009.
[63] Vladimir Vapnik and Rauf Izmailov. Learning using privileged information: similarity control
     and knowledge transfer. Journal of machine learning research, 16(2023-2049):2, 2015.
[64] John Lambert, Ozan Sener, and Silvio Savarese. Deep learning under privileged information
     using heteroscedastic dropout. In Proceedings of the IEEE Conference on Computer Vision
     and Pattern Recognition, pages 8886–8895, 2018.
[65] Xue Li, Bo Du, Chang Xu, Yipeng Zhang, Lefei Zhang, and Dacheng Tao. R-svm+: Robust
     learning with privileged information. In ĲCAI, pages 2411–2417, 2018.
[66] Hidetoshi Shimodaira. Improving predictive inference under covariate shift by weighting the
     log-likelihood function. Journal of statistical planning and inference, 90(2):227–244, 2000.
[67] Dhanesh Ramachandram and Graham W Taylor. Deep multimodal learning: A survey on
     recent advances and trends. IEEE Signal Processing Magazine, 34(6):96–108, 2017.
[68] Huaxiu Yao, Fei Wu, Jintao Ke, Xianfeng Tang, Yitian Jia, Siyu Lu, Pinghua Gong, and Jieping
     Ye. Deep multi-view spatial-temporal network for taxi demand prediction. arXiv preprint
     arXiv:1802.08714, 2018.
[69] Yi Luo, Guojie Song, Pengyu Li, and Zhongang Qi. Multi-task medical concept normalization
     using multi-view convolutional neural network. 2018.
[70] Youngduck Choi, Chill Yi-I Chiu, and David Sontag. Learning low-dimensional representations
     of medical concepts. AMIA Summits on Translational Science Proceedings, 2016:41, 2016.
[71] Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. How transferable are features
     in deep neural networks? In Advances in neural information processing systems, pages
     3320–3328, 2014.
[72] Samuel G Finlayson, Paea LePendu, and Nigam H Shah. Building the graph of medicine from
     millions of clinical narratives. Scientific data, 1:140032, 2014.
[73] Chang Xu, Dacheng Tao, and Chao Xu. A survey on multi-view learning. arXiv preprint
     arXiv:1304.5634, 2013.
                                                  38


[74] Tadas Baltrušaitis, Chaitanya Ahuja, and Louis-Philippe Morency. Multimodal machine
     learning: A survey and taxonomy. IEEE Transactions on Pattern Analysis and Machine
     Intelligence, 2018.
[75] Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y Ng.
     Multimodal deep learning. In Proceedings of the 28th international conference on machine
     learning (ICML-11), pages 689–696, 2011.
[76] Ye Yuan, Guangxu Xun, Kebin Jia, and Aidong Zhang. A multi-view deep learning method for
     epileptic seizure detection using short-time fourier transform. In Proceedings of the 8th ACM
     International Conference on Bioinformatics, Computational Biology, and Health Informatics,
     pages 213–222. ACM, 2017.
[77] C. Nagpal. Deep multimodal fusion of health records and notes for multitask clinical event
     prediction. In NIPS ML4H Workshop, 2017.
[78] Bianca Zadrozny. Learning and evaluating classifiers under sample selection bias. In
     Proceedings of the twenty-first international conference on Machine learning, page 114. ACM,
     2004.
[79] Cédric Villani. Optimal transport: old and new, volume 338. Springer Science & Business
     Media, 2008.
[80] Jinsung Yoon, James Jordon, and Mihaela van der Schaar. Radialgan: Leveraging multiple
     datasets to improve target-specific predictive models using generative adversarial networks.
     arXiv preprint arXiv:1802.06403, 2018.
[81] Yunjey Choi, Minje Choi, Munyoung Kim, Jung-Woo Ha, Sunghun Kim, and Jaegul Choo.
     Stargan: Unified generative adversarial networks for multi-domain image-to-image translation.
     arXiv preprint, 1711, 2017.
[82] Sanjay Purushotham, Wilka Carvalho, Tanachat Nilanon, and Yan Liu. Variational recurrent
     adversarial deep domain adaptation. 2016.
[83] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image
     translation using cycle-consistent adversarial networks. arXiv preprint, 2017.
[84] David Lopez-Paz, Léon Bottou, Bernhard Schölkopf, and Vladimir Vapnik. Unifying distillation
     and privileged information. arXiv preprint arXiv:1511.03643, 2015.
                                                 39


                                              CHAPTER 3
        SCALABLE DIAGNOSTIC SCREENING OF MCI USING DIALOG POLICY
3.1    Introduction
The progression of Alzheimer Disease (AD) has consistently been a heavy area of research in clinical
medicine because while the disease itself is incurable, early intervention at the prodromal phases of
the disease has proven to delay the onset of AD-related mental degeneration and systemic issues
for months to years [1, 2]. Consequently, much of the recent clinical research efforts have focused
on detecting early stages of mild cognitive impairment (MCI), which is a prodromal phase in AD
progression occurring months to years before visible mental decline begins [3]. If successfully
detected at this stage, intervention methods may confer numerous benefits in the longevity of
cognitive and physiological health of AD patients [1, 2].
     Brain imaging, such as the structural magnetic resonance imaging (MRI), was shown to contain
prime markers of AD, capturing the physiologic changes in the AD pathological process [4, 5].
However, the identification of MCI from normal aging (NL) is particularly challenging due to the
fact that structural changes in the brain in this phase are minor and hard to detect through structural
MRI [6], even though decline in mental status and cognitive have already begun in most cases.
Recently, the structural connections among brain regions inferred from diffusion MRI have provided
promising predictive performance of MCI detection [7, 8], yet sketching brain networks via imaging
still remains rather prohibitively expensive and difficult to scale. Moreover, the high dimensionality
of brain imaging combined with small sample size usually imposes significant challenges in learning
algorithms and leads to unstable generalization performance.
     On the other hand, behavior and social markers could offer a much more cost- effective option
for MCI detection [9, 10, 11, 12]. A recent clinical trial has studied differentiating early stage
MCI from NL cohort groups using transcripts of extensive conversations between patients and
trained interviewers [11]. In a recent preliminary study [12], the authors trained supervised learning
                                                    40


models from the lexical distribution of the conversation, and showed that conversational responses
of MCI and NL patients take on different distribution under various conversational topics. The
success of [12] in predicting MCI using human dialogue introduced an alternative natural language
processing (NLP) approach to a classically clinically expensive problem. However, the use of
human interviewers still requires substantial amounts of interaction between trained staff which
incur significant expense in its current form. Thus, the bottleneck questions remain: (1) can we cut
down on the amount of conversations needed to achieve accurate prediction, (2) can we improve
upon baseline performance given limited cohort-specific data?
     To address the aforementioned questions above, in this chapter we propose a novel reinforcement
learning (RL) framework, which learns a MCI diagnosis agent using only very limited amount
of offline human dialogue transcripts. The learned diagnosis agent can conduct subject-specific
conversation with humans, asking questions based on existing conversations to efficiently sketch
the lexical distribution and give high-performance MCI prediction. In order to facilitate RL using
offline transcripts, we introduce a dialogue simulator pipeline which generates new conversational
episodes that are less noisy and out-perform the original corpus for MCI prediction.
     Our dialogue pipeline provides a self-contained framework for directing dialogue generation
for diagnostic screening which can potentially replace the need for human-expert interviews. Our
RL-agent learns optimal dialogue strategies that are adaptive to unseen users, enabling medically-
relevant NLP data to be generated on a large scale if deployed in a realistic setting. Furthermore,
data generated from our dialogue simulations may be used for data augmentation or to perhaps guide
the medical data collection process in the future. Ultimately, by greatly decreasing the cost of data
collection and the amount needed for high-level performance, we introduce a clinical direction that
is much more cost-effective and scalable to large-scale diagnostic screening and data collection. The
combination of NLP features with our reinforcement learning framework may extend the process of
diagnostic screenings to well beyond the confines of hospitals and primary care facilities.
                                                 41


3.2    Methodology
3.2.1   Dialog Systems
Our approach is inspired by recent advancements in dialog systems. Dialog systems provide a
natural human-computer interface and have been an active research field for decades. Task-oriented
dialogue systems are typically designed for retrieval- tasks in which users provide queries and
the chat-bot provides appropriate responses based on an external knowledge base [13, 14, 15], or
identifying correct answers by looking into vast amounts of documents [16, 17]. Such dialogue
systems are typically designed to be a pipeline containing a set of components including a language
understanding unit that parses the intention and semantics from the input from humans, a dialogue
manager that manages dialogue state tracking and policy learning, and a language generation
unit that generates response [15, 18, 19]. While each of the components can be handcrafted or
trained individually, recent advances of deep learning allows end-to-end training [13, 14, 20, 20]
and significantly improves the performance and the capability to adapt to new domains [21]. The
end-to-end systems can be trained using supervised learning [13, 22] or reinforcement learning (RL),
by leveraging a user simulator [14, 20]. The main advantage of RL is that less training samples
are needed to learn the high-degree-of-freedom deep models. In our work, we design a simulator
to enable RL due to the limited amount of clinical data available for supervised training. We note
that even though our dialogue system also tries to achieve a task (identifying MCI patients), the
nature of our system is radically different from existing task-oriented dialogue systems: its goal is to
efficiently sketch a disease specific lexical distribution through asking subject-specific questions and
give classification results.
     The framework we propose in this chapter involves the use of reinforcement learning to learn
the optimal set of questions 𝜋 ∗ to ask participants for the purposes of distinguishing MCI. On test
set, we generate new episodes from these questions for prediction rather than the original corpus.
To actualize the RL + dialogue simulation framework, we proposed a multi-step approach for
implementation which capitalizes on the vast existing knowledge of NLP research. In the following
                                                     42


                         Figure 3.1: Overview of the proposed methodology.
section, we present the details of each component of the dialogue system. Figure 3.1 shows an
overview of the components of our experimental pipeline. Complete conversation from participants
are used to build user simulators. The simulators are then used to train an MCI diagnosis agent
(chat-bot), which conducts minimal turns of conversation with participants to sketches the lexical
distribution that is then used to perform MCI classification.
3.2.2   Overview of Pipeline
Our proposed framework contains three key learning modules: the user simulator, the MCI classifier
and the RL-agent. The proposed pipeline is illustrated in Figure 3.2. First, the user simulator is
trained by unsupervised learning, which simulates the distributed representation of user responses
given feasible question inputs. Next, the MCI classifier predicts the patient label based on the
averaged distributed representation of its corpus responses. The above two components and dialogue
manager comprise the training environment for the RL-agent. The dialogue manager utilizes the
user simulator and MCI classifier to handle the state transitions and also computes of the reward
based on the ground-truth labels from the training set and MCI classifier prediction. After training in
this environment, the RL-agent is able to deliver the optimal sequences of questions for training-set
users at various stages of conversations. During testing, the RL-agent produces query inputs to
the test-set user simulators, which represent the unseen users. Using these new queries, the user
simulators generate the corresponding distributed representation of test-set user responses for MCI
                                                  43


     Figure 3.2: Illustration of reinforcement learning components in our proposed approach.
                                       Question Encoding
              So how long did you  0       1   …     0     0   Linguistic                       State
              go out for?                                                   User State                      Reinforcement
                                                               Signature
                                                                           Updated at Each                  Learning Agent
        Yeah, so I was out a long                                               Turn
        time yesterday…            User Simulator                                                        Questions      Response
                                                                Diagnosis     Generate Response           (Action)      (Reward)
                                   Neural Network              Confidence
              How are with the
              computer?                                                           Diagnosis      Compute
                                                                       MCI                                       Dialogue
                                   .03    -0.2 -0.7 0.4    …
                                                                                     Label       Reward
                     …                                             Classifier                                     Manager
                                     Response Embedding                            (MCI/NL)
                   Corpus            Skip-Thought Embedding
prediction. In the following subsections, we will present each component of the pipeline in detail
and demonstrate the effectiveness of the RL framework in improving prediction accuracy while
reducing conversational turns.
3.2.3    Construction of Turn-Based Dialogue
Since utterance data was collected in the form of conversational transcripts for each participant,
we must reconstructed turn-based dialogue from participant-responses. The participant responses
were unstructured while interviewer questions ranged over preset question topics, as illustrated below.
Interviewer: so what did you do yesterday?
Participant: i had yesterday morning i yesterday was a busy day for me. i im forgetting i went to
where did i go in the morning. well i went to albertsons yesterday...
Interviewer: what do you see in this picture?
Participant: we got a picture gosh. it looks like my uncle lou. but he never ...
Interviewer: when do you think this picture was taken?
Participant: this picture was probably eighteen seventy or something or nineteen twenty. so he
looks too old for war he must have been ...
In total there were well over 150 possible queries from the interviewers. However, for the
purposes of this study, we re-compiled the question list into 107 general questions which were
ubiquitous across all conversations. A snapshot of questions are in Table 3.1.
                                                             44


                       Table 3.1: Examples of questions from conversations
                         Category                     Question
                          Activity            Did you go outside lately?
                                            So what did you do yesterday?
                           Social     Did you run into any familiar faces lately?
                                             Where did you have dinner?
                          Picture          What do you see in this picture?
                                      Where do you think this picture was taken?
                            Tech           How are you with the computer?
                                          Did you use your computer lately?
                        Unspecified      <unspecified scheduling comment>
                                           <unspecified picture comment>
    We created a total of 16 question categories, including: greetings, activity check, living situation,
travel, entertainment, social, picture-related, tech, occupation, hobbies, family, pets, confirmation,
clarification, goodbye and unspecified comments. For some of these comments, we delexicalised
certain topic words such as “<activity>”, “<social topic>” in order to (1) control for domain
expansion [23] and (2) reduce model complexity of our user simulators. In the past, [23] and
[22] have shown the effectiveness of delexicalisation in controlling for domain expansion in user
simulators without sacrificing the contextual meaning of sentence queries. Additionally, we also
created unspecified comments category, which included comments that deviated from general
question prompts. These comments often result from interviewer follow-up on specific topics
mentioned by the user. We consolidated these comments into a single category to distinguish the
context-specific from general questions based on the corpus. However, we do demarcate the type of
unspecified comment used by the interviewer. For example, a follow-up comment to an occupational
story is tagged <unspecified occupational comment> whereas a follow-up comment about a health
concern is tagged <unspecified health comment>. The role of these comments serve to build rapport
and improve flow of conversation. In future studies we may look to generate user-specific grounding
statements for these slots [24]. Implemented in this way, the corpus is tokenized into turn-based
responses to questions for each user.
                                                  45


3.2.4    Unsupervised Learning for User Simulator
To effectively capture contextual representation of user conversation style, we utilize vector
embedding of user corpus at the sentence-level representation [25, 26]. Given that we want to
capture the flow of the conversation from one response to the next, we implement skip-thought
embedding, which has shown effectiveness over large corporal datasets by capturing contextual
information of sentences given neighboring ones [25]. For encoding sentences, we use a model that
was pretrained on the BookCorpus dataset, which contains turn-based conversations from various
English novels [25]. For the decoder, we train skip-thought vectors to recover the original response
of the user during NLG portion of the pipeline.
     Since each user has individual response styles to questions, we train a personalized user-simulator
for each user. For each user, the conversation corpus is divided into question-response turns. In our
dataset, for example, the number of turns per conversation ranged from 30-275 turns. We used a
multilayer perceptron (MLP) with 2 hidden layers of 512 output nodes each to train the user simulator.
We also introduce regularization with ℓ2 -norm penalty to constrain model complexity. Because we
utilize preset questions by the interviewer, we use one-hot encoding of questions, denoted q𝑖𝑡 ∈ R𝑑 , as
input for training. Given the original skip-thought vector v𝑖𝑡 , the user simulator serves as a function
which maps 𝑓 : q𝑖𝑡 ↦→ v𝑖𝑡 . The output of the MLP is the skip- thought embedding representation of
the utterance, denoted 𝑓 (q𝑖𝑡 ; w𝑖 ) ∈ R𝑐 . Here, 𝑑 denotes the size of our question dictionary, 𝑐 denotes
the dimension of skip-thought embeddings, w𝑖 parameterizes the MLP model for the given user,
𝑖 ∈ 𝑁 denotes the user index and 𝑡 ∈ 𝑇 denotes the turn number. The loss function of the MLP is
given by the mean- squared error (MSE) between the MLP output and original skip-thought vector
v𝑖𝑡 ∈ R𝑐 :
                                1 Õ𝑇                          2 𝜆
                     𝐿(w𝑖 ) =               𝑓 (𝑞𝑖𝑡 ; w𝑖 ) − v𝑖𝑡 + ||w𝑖 || 2 , ∀ 𝑖 = 1, ..., 𝑁
                                2     𝑡=1                         2
In the case where questions are not preset, more state-of-art methods such as end-to-end recurrent
neural network systems can be deployed to train the user simulator instead [13, 27]. To evaluate
the performance of our user simulator, we computed the mean squared error on the outputs of the
                                                          46


simulator and the original thought vector representation of the user response for each turn.
3.2.5    Reinforcement Learning Components
Again, let 𝑐 denote the size of skip-thought embeddings and 𝑑 denote the size of question dictionary.
We formulate the dialogue and task managers portions of the dialogue system into a standard RL
setting where an agent interacts with environment E over a finite number of steps. At time step 𝑡,
the agent receives a state s𝑡 and samples an action (asks a question) a𝑡 based on its current policy 𝜋.
     The environment transitions to the next state s𝑡+1 and the agent receives a scalar reward 𝑟 𝑡+1 .
In this setting, the RL-agent tries to learn an optimal policy 𝜋 ∗ over all possible states, including
ones that are unseen by the agent during training. To do this, the agent has to learn an approximate
action-value function, which maps state-action pairs to expected future rewards [28]. Formally, the
action-value function is defined as follows:
                                                   hÕ𝑇                i
                                     𝜋                      𝑡
                                   𝑄 (s, 𝑎) = E𝜋           𝛾 𝑟 𝑡 |s, 𝑎 ,
                                                       𝑡=1
where 𝛾 ∈ [0, 1] is a discount factor and 𝑇 is the max # of turns.
3.2.5.1    Environment E
The environment in this case consists of the dialogue manager (DM), user simulator and MCI
classifier. DM is composed of the reward and state generating functions. In previous works, a task
manager, composed of a database and a query manager [15, 13], is used by the DM to generate
observations in retrieval tasks. In our case, however, the the user simulator and MCI classifier is
equivalent to the task manager and is used by the DM to generate observations. Here, the DM uses
the MCI classifier to (1) predict probabilities for both the MCI and the NL classes based on current
moving-average of skip-thought vectors at each turn, and (2) predict the label of the current user at
the end of the episode for reward calculation. The result of (1) is also used by the agent as part of
its internal state-representation. The result of (2) is used by the DM for credit assignment for the
                                                   47


generated conversational episode. The MCI classifier is trained separately on the training set corpus
before the dialogue system phase.
3.2.5.2    Action a𝑡 ∈ R𝑑
The RL-agent chooses its actions from a set of discrete actions consisting of 107 predefined questions,
where each question is represented by a one hot vector in R𝑑 . It is worth noting that we use a𝑡
and q𝑡 to differentiate the action taken by our RL-agent and the questions asked during the actual
interviews, respectively.
3.2.5.3    State s𝑡 ∈ R𝐶
The state representation by the RL-agent is used to approximate the action-value function. There are
five main components of the state representation vector:
     • Skip-thought vector of utterance at current turn: 𝑓 (𝑎 𝑡−1 ; w𝑖 ), which is the output vector from
       user simulator 𝑓 given action 𝑎 𝑡−1 at turn t.
     • Moving average of skip-thought vectors across all utterances in current episode: 𝑓¯𝑡 =
       1 Í𝑡−1
        𝑡   𝑘=1 𝑓 (𝑎 𝑘 ; w𝑖 )
     • First hidden layer weights of user-simulator: w𝑖 [:, 1]
     • Predicted probability of current user for MCI and NL classes by classifier
     • Number of turns above threshold: 𝜏.
    The total dimension of the state vector is 𝐶 = 2𝑐 + |w𝑖 [:, 1]| + 3 = 10115. At each turn, the DM
queries the MCI classifier to output a probability vector composed of 𝑃(𝑦𝑖 = 0| 𝑓¯𝑡 ) and 𝑃(𝑦𝑖 = 1| 𝑓¯𝑡 ),
where 𝑦 = 0 denotes NL and 𝑦 = 1 denotes MCI. This 2-dimensional vector keeps track of the
classifier's confidence-level for MCI prediction based on the current moving-average of skip-thought
vectors generated from 1, 2, ..., 𝑡 turns. Keeping track of classifier confidence incentivizes the
                                                  48


RL-agent to terminate the conversation as soon as it reaches a threshold level of confidence for the
prediction task.
3.2.5.4    Reward 𝑟 ∈ R
Since we want to minimize the number of dialogue turns, we designed the environment to output
a negative reward (-10) at every time step unless it reach a terminal state (e.g. when agent says
“goodbye”). At the terminal state, the reward depends on the classification using the averaged
skip-thought vector collected from this episodes. If the existing classifier is able to make the correct
prediction, the agent receives a positive reward (1000), otherwise it receives a moderately negative
reward (-500). We also set the maximum length of episodes 𝑇 = 35. Additionally, we added a
linearly increasing penalty for each passing turn where the classifier predicts with ≥ 0.65 probability
for either class (MCI/NL). We denote this penalty threshold as the number of turns above confidence
threshold (𝜏). Formally, the reward function is defined as:
                                
                                  −10 − 10𝜏, for non-terminal state,
                                
                                
                                
                                
                                
                                
                                
                          𝑟 = −500,           terminal state with misclassification,                 (3.1)
                                
                                
                                
                                 +1000,
                                
                                             terminal state with correct prediction.
                                
3.2.5.5    State transitions
The state transition function has two parts:
     • Within User. The state transition rule between turns is characterized by:
                                              Õ
                                       𝜋
                                     𝑃𝑠,𝑠 0 =        𝑃(𝑠𝑡+1 = 𝑠0 |𝑠𝑡 = 𝑠, 𝑎 𝑡 = 𝑎, 𝜋)
                                                 𝑎∈𝐴
                                              Õ
                                                             𝑎
                                            =        𝜋(𝑎|𝑠)𝑃𝑠,𝑠 0
                                                 𝑎∈𝐴
       Given a policy 𝜋, the probability of the environment transitioning to state 𝑠0 at 𝑠𝑡+1 depends
       only on current state 𝑠𝑡 . Internally, the DM utilizes the user simulator to generate skip-thought
        𝑓 (𝑎 𝑡 ; w𝑖 ) from 𝑎 𝑡 .
                                                       49


    • Between Users. In addition to state transitions within episodes, the state-generating function
       changes between users, leading to different transition probabilities between similar states
       among different users. To capture this, we apply two changes when training the RL-agent
       on multiple users: (1) the first hidden layer weights w𝑖 [:, 1] of each user are incorporated in
       the state representation vector so that the RL-agent can distinguish between dissimilar users.
       When used this way, the user simulator provides a means for the RL-agent to learn similar
       policies for similar users and dissimilar policies for dissimilar users. (2) During training,
       both the user simulator and classifier of the training environment is reset between users by
       re-initializing the user simulator weights w𝑖 to correspond to the new user.
3.2.5.6    Deep Q-Networks (DQN)
In this work, the action value function needs to estimate expected reward based on the high-
dimensional state representations as described in previous section. In order to approximate the action
value given different users and the complicated internal state changing during the conversation,
we learn a deep 𝑄-network parameterized by 𝜃 𝑣 to tackle this challenging problem. The learning
procedure can be conducted by optimizing the loss function as follows:
                                𝐿(𝜃 𝑣 ) = Es𝑡 ,a𝑡 ,𝑟 𝑡 ,s𝑡+1
                                                         0 [(𝑦 𝑡 − 𝑄(s𝑡 , a𝑡 ; 𝜃 𝑣 ))],            (3.2)
with
                                  𝑦 𝑡 = 𝑟 𝑡 + 𝛾 maxa𝑡+1     0  𝑄(s0𝑡+1 , a0𝑡+1 ; 𝜃 0𝑣 ),           (3.3)
where 𝜃 0𝑣 denotes the parameters of target 𝑄-network. In order to learn the estimator under complex
situations, two key ingredients were proposed in [29]: experience replay and fixed target 𝑄-network.
During the training, the 𝑄-network (𝜃 𝑣 ) is updated in an online fashion by conducting the gradient
descent of Eqn. (3.2) while the target 𝑄-network (𝜃 0𝑣 ) is fixed to compute the target values as in
Eqn. (3.3) and only updated after a certain number of iterations, which is essential to the convergence
of 𝑄-network in this work. We also observe when the experience replay samples minibatch from
previous experiences to update the 𝑄-network, the training performance stabilizes more consistently.
                                                             50


3.2.5.7    Policy-masking
One challenge in our problem is creating an environment that can train the agent to produce responses
which best align with the flow of conversations. For example, an agent may learn that the question
“can you elaborate on that? ” is useful for generating a wide distribution of words from the user, but
it would not make sense to include that in the first sentence of a conversation or before relevant
topics are introduced. To achieve this, we created a policy-modifying function in which confirmation
and clarification type questions are masked from the policy set 𝜋 at turn 𝑡 if the action history of the
agent from 1, 2, ..., 𝑡 − 1 does not include any questions from social, activity, tech, picture-related,
hobbies, occupation, travel, entertainment and family categories. At each turn, we keep track of
an action history vector 𝜋𝑡 ∈ R𝑑 and construct a policy-masking vector 𝜑𝑡 ∈ R𝑑 to be applied
element-wise over the agent's Q-value output. Specifically:
                                                      
                                                𝑗
                                                       0, if action 𝑗 masked,
                                                      
                                                      
                                              𝜑𝑡   =                                                        (3.4)
                                                      
                                                       1, otherwise.
                                                      
                                              𝑄 0 (s𝑡 ) = 𝜑𝑡  𝑄(s𝑡 ).
               𝑗
where the 𝜑𝑡 denotes the 𝑗-th element in policy-masking vector 𝜑𝑡 . And 𝑄(s𝑡 ) ∈ 𝑅 𝑑 represents
the action values of all 107 available actions given current state s𝑡 . Then the 𝑄 0 (s𝑡 ) is valid action
values vector after the policy masking. To achieve effective masking, we assure the elements of
𝑄(s𝑡 ) is positive by using ReLU [30] as the activation function for the output layer of Q-network
and a step of pre-training on Q-network as described in following section.
3.2.6    Training the RL-Agent
We outline below the training procedure for our RL-agent. To expedite the learning process, we
first train the RL-agent over the original corpus from the training set. For each user, we perform
an initial pass through the entire corpus using the existing action history 𝑞𝑖1 , 𝑞𝑖2 , ...𝑞𝑇𝑖 to generate
episodes 𝑠1 , 𝑎 1 , 𝑟 1 , ...𝑎 𝑡 , 𝑟 𝑡 . We use these corpus-generated episodes to train the Q-estimator network.
This initialization procedure is motivated by previous studies which have cited the effectiveness
                                                               51


Algorithm 1 RL-Training Protocol
   Initialize replay memory D
   Initialize Task Manager with classifier
   Pre-train action-value function 𝑄
   for 𝑖 = 1, ..., 𝑁 do
      Initialize Environment E with User Simulator 𝑓𝑖
      Initialize E with true label for user 𝑖
      for 𝑒 𝑝𝑖𝑠𝑜𝑑𝑒 = 1, ..., 𝑀 do
         Reset E Get the initial state s1 .
         for 𝑡 = 1, ..., 𝑇 do
            Obtain policy mask 𝜑𝑡 as Eqn. (3.4).
            With probability 𝜖 select a random action 𝑎 𝑡
            otherwise select 𝑎 𝑡 = max𝑎 𝜑𝑡 𝑄(𝑠𝑡 , 𝑎; 𝜃 𝑣 )
            Execute action 𝑎 𝑡 in E observe reward 𝑟 𝑡 and state s𝑡+1
            Store transition (s𝑡 , a𝑡 , 𝑟 𝑡 , s 𝑗+1 ) in D
            Sample random minibatch of (s 𝑗 , a 𝑗 , 𝑟 𝑗 , s 𝑗+1 ) from D
            if terminal s 𝑗+1 then
                𝑦𝑗 = 𝑟𝑗
            else if non-terminal s𝑡+1 then
                𝑦 𝑗 = 𝑟 𝑗 + 𝛾max𝑎0 𝑄(s 𝑗+1 , 𝑎 0; 𝜃 𝑣0 )
            end if
            Perform a gradient descent step on (𝑦 𝑗 − 𝑄(s 𝑗 ; 𝑎 𝑗 , 𝜃 𝑣 )) 2
         end for
      end for
   end for
of pre-training with successful episodes so that the RL-agent can discover large terminal reward
signals in games with delayed rewards [31].
     During training, we stabilize the target Q-network 𝜃 0𝑣 for minibatch generation and transfer
weights from learning Q-network 𝜃 𝑣 every 50 conversational episodes. During testing, we use the
RL-agent to generate new actions for each test set user 𝑎𝑖1 , 𝑎𝑖2 , ...𝑎𝑖𝑡 . New episodes are then generated
by each user simulator from each new action set 𝜋𝑖 for prediction. These simulated episodes often
differ from the original corpus in both the questions asked by the agent as well as in the skip-thought
responses by the user.
3.3    Experiments
Evaluation of dialogue systems differ widely depending on the task. Previous works typically involve
using metrics such as perplexity and averaged reward per response to measure the quality of the
                                                           52


natural language generation (NLG) phase of the dialogue system [15, 13, 18]. However, because
the utility of our framework comes from the quality of questions that the chat-bot generates for
the off-conversational task, we propose a framework of evaluation which emphasizes the agent’s
off-conversation performance. We gauge utility of the dialogue system by its ability to improve
(1) prediction accuracy against baseline techniques and (2) the number of turns needed to make
accurate prediction.
3.3.1   Data
Data used for this study was obtained from a randomized controlled behavioral clinical trial to
ascertain the effect of unstructured conversation on cognitive functions. Details of the study protocol
was explained in [32]. In this clinical study, conversational data was collected in Q&A format
for each participant during web-cam interviews with trained interviewers. Each participant was
interviewed multiple times over the course of 4-6 weeks, and dialogue responses were transcribed
for each interview session [12]. On average, there are 2.81 conversational episodes per participant,
and each conversation lasted between 30-45 minutes [12, 32]. MCI labels were generated using
clinical assessment of each participant's cognitive status by medical professionals [12, 32].
3.3.2   Baselines vs. RL Performance
We first compare the performance of several baseline classifiers for the MCI prediction task. For our
specific dataset, [12] had previously achieved benchmark performance of 72.5% AUC score on
5-fold validation while using linear SVM with ℓ1 -norm penalty and feature engineering by Linguistic
Inquiry and Word Count (LIWC) dictionary [12]. LIWC embeds each word into a 69-dimensional
word vector space with each dimension representing a latent feature of the English language [12].
Since 2013, various contextual representations of words and sentences have been proposed, many of
which have outperformed classical rule-based contexual embedding techniques [26, 25]. Distributed
representation such as Word2Vec allows for more flexible and corpus-dependent latent features to be
created for individual words [26]. More recently, Skip-thought vectors [25] have risen to popularity
                                                  53


               Table 3.2: Performance of baseline vs. RL on MCI prediction on 10
               stratified shuffle splits
                  Model      Feature       AUC            Sen.          Specificity    F1-Score
                  LR + ℓ1      RD     0.529 ± 0.132   0.380 ± 0.260   0.678 ± 0.105 0.361 ± 0.207
                   RFC         RD     0.519 ± 0.057   0.080 ± 0.098   0.944 ± 0.075 0.120 ± 0.149
                 SVM + ℓ1      RD     0.551 ± 0.131   0.380 ± 0.227   0.722 ± 0.102 0.384 ± 0.214
                 SVM + ℓ2      RD     0.560 ± 0.050   0.320 ± 0.256   0.800 ± 0.185 0.322 ± 0.193
                   MLP         RD     0.640 ± 0.193   0.110 ± 0.243   0.860 ± 0.189 0.162 ± 0.146
                  LR + ℓ1     W2V     0.638 ± 0.091   0.520 ± 0.204   0.756 ± 0.147 0.517 ± 0.127
                   RFC         w2v    0.564 ± 0.110   0.340 ± 0.220   0.789 ± 0.144 0.374 ± 0.189
                 SVM + ℓ1     W2V     0.651 ± 0.103   0.560 ± 0.233   0.756 ± 0.130 0.541 ± 0.147
                 SVM + ℓ2     W2V     0.598 ± 0.116   0.440 ± 0.233   0.756 ± 0.171 0.449 ± 0.205
                   MLP        W2V     0.680 ± 0.151   0.500 ± 0.500   0.511 ± 0.490 0.266 ± 0.266
                  LR + ℓ1     LIWC    0.703 ± 0.099   0.540 ± 0.237   0.867 ± 0.130 0.584 ± 0.152
                   RFC        LIWC    0.641 ± 0.135   0.360 ± 0.250   0.922 ± 0.087 0.445 ± 0.273
                 SVM + ℓ1     LIWC    0.661 ± 0.125   0.600 ± 0.200   0.722 ± 0.200 0.572 ± 0.144
                 SVM + ℓ2     LIWC    0.712 ± 0.110   0.680 ± 0.204   0.744 ± 0.180 0.631 ± 0.135
                   MLP        LIWC    0.689 ± 0.129   0.300 ± 0.458   0.767 ± 0.396 0.182 ± 0.285
                  LR + ℓ1      SKP    0.790 ± 0.112   0.680 ± 0.256   0.900 ± 0.116 0.707 ± 0.183
                   RFC         SKP    0.608 ± 0.104   0.260 ± 0.220   0.956 ± 0.054 0.343 ± 0.259
                 SVM + ℓ1      SKP    0.783 ± 0.123   0.700 ± 0.241   0.867 ± 0.171 0.711 ± 0.190
                 SVM + ℓ2      SKP     0.797±0.122    0.660±0.269      0.933±0.102   0.716±0.189
                   MLP         SKP    0.638 ± 0.138   0.600 ± 0.490   0.400 ± 0.490 0.316 ± 0.256
                 RL(T=1)       SKP     0.607±0.109    0.380±0.166      0.833±0.134   0.447±0.172
                 RL(T=3)       SKP     0.706±0.092    0.500±0.205      0.911±0.097   0.583±0.154
                 RL(T=5)       SKP     0.707±0.072    0.480±0.133      0.933±0.102   0.594±0.129
                 RL(T=10)      SKP     0.772±0.115    0.600±0.237      0.944±0.102   0.683±0.186
                 RL(T=15)      SKP     0.798±0.115    0.640±0.265      0.956±0.102   0.714±0.190
                 RL(T=20)      SKP     0.798±0.121    0.640±0.250      0.956±0.102   0.719±0.190
                 RL(T=25)      SKP     0.808±0.111    0.660±0.254      0.956±0.102   0.732±0.184
                 RL(T=30)      SKP     0.808±0.119    0.660±0.269      0.956±0.102   0.730±0.190
                 RL(T=35)      SKP     0.818±0.102    0.680±0.204      0.956±0.102   0.761±0.140
                Here, LR denotes sparse logistic regression classifier, RFC denotes random forest
               classifier, SVM denotes support vector machines, and MLP denotes multi-layer
               perceptron. For feature representation of corpus, RD represents raw distribution
               of word counts. w2v denotes averaged 300-dimension Word2Vec embeddings
               across all words appearing in the corpus for each user [26]. LIWC denotes
               the original rule-based embedding used by [12]. SKP denote averaged 4800-
               dimension Skip-Thought vectors across all turn-based responses for each user
               [25].
due to the ability to embed entire sentences into "thought vectors" that capture contextual meaning
and syntactic information from neighboring sentences. For this reason, we compare various word
and phrase embedding techniques to establish new baseline performances for our classification task.
    The first four sections of Table 5.3 show the performance of these baseline classifiers. Using the
original LIWC representation, we were able to recover close to the 72.5% AUC baseline from the
original paper using SVM and LR classifiers. When implementing skip-thought embedding, we used
pre-trained skip-thought encoders by [25] to embed each user response across all conversational
                                                       54


turns. The encoder was pre-trained on the BookCorpus dataset, which is a large collection of
novels pertaining to numerous literary genres. The advantage of pre-training on this dataset is that
BookCorpus contains an abundant number of turn-based dialogues between various character types.
These conversations capture a wide range of conversational response styles, idiosyncrasies and
temperaments. As seen in Table 5.3, the best performing baseline model was the SVM classifier
with ℓ2 norm, using Skip-Thought embedding as features. For this reason, we choose this classifier
for the RL portion of our pipeline. As a baseline reference, we also included performance using raw
word count distributions for all models.
    We then evaluate the performance of our RL-agent across 10 stratified shuffle splits. Each split
uses 65% of data for training and 35% for testing. We compare the performance of RL-Agent when
manually restricting the number of questions to 1, 3, 5, 7, 10, 15, 20, 25, 30 and 35. By restricting
the number of turns, we can observe the number of questions needed to recover the original baseline
performance using the SVM classifier.
                     Figure 3.3: RL-Agent vs. Baseline w/ Variation on Turns.
    The last section of Table 5.3 illustrates the performance of the RL-agent under various turn
constraints. Here, the turns notation RL(T=𝑡) denote the number of questions the agent is allowed to
                                                 55


ask before a prediction is made from the simulated user responses. It is important to note that turn 0
was set to greetings by default and was not counted toward the conversation.
     We see from constraint conditions that the performance of our RL-agent started to surpass
baseline performances starting at 25 questions and was able to achieve comparable performance
using only 15 questions. At full conversation length of 35 turns, we were able to achieve 0.818
AUC, an improvement upon current and previous baselines. In comparison, the mean number
of conversational turns per user in the original corpus was 105.71. Additionally, since 2.81
conversations were conducted per user, we adjusted the number of turns allowed based on the mean
number of turns per conversation, which was 37.36 per user. For this reason, we set the upper bound
constraint to 35 questions, which is just slightly less than a full conversation with the user.
     Figure 3.3 visualizes this relationship between performance and number of questions asked by
the RL-agent. We see that performance improvements with additional questions saturate after 15
questions. This was expected, as the highest-yield questions discovered by the RL-agent were asked
first during test conversations.
3.3.3    Evaluation of User Simulators
User simulators serve a pivotal role of simulating the user response in the RL training environment
[18, 33]. In previous works, the user simulators are evaluated based on accuracy of generated user
query to unseen bot responses [33, 18]. Metrics such as BLEU and perplexity are used at the NLG
phase of dialogue, as the generation of user query is pivotal in retrieval-type training systems.
     In our case, however, the goal of the user simulator is quite different; the RL-agent is responsible
for generating queries while the output from the user simulator is actually an encoded thought-vector
of the user response, which is then used for state representation and downstream prediction purposes.
For this reason, we evaluate the performance of the user-simulator not on the decoding portion of
the dialogue system, but rather on the performance of the user-simulator in generating accurate
thought-vector version of the responses.
     We compute mean-squared error (MSE) between the corpus Skip-Thought vector and user
                                                   56


           Figure 3.4: Distribution of mean squared error (MSE) across all user simulators.
simulation prediction at each turn. The resulting MSE scores are averaged across all turns for the
conversation. Given that each user has on average 2.81 conversations, we evaluate the performance
of the user simulator in a leave-one-out fashion: for each user, the simulator is trained on all
conversations except for the last one, which is used for evaluation. Figure 3.4 visualizes the
performance of user simulators. The mean MSE is 0.00495±2.93E-06, averaged across all test set
performances.
3.3.4     Top-Performing Policies
It is interesting to note that the simulated episodes by our RL-agent were able to provide a performance
boost for the prediction task. In this section, we look qualitatively at the types of questions at 5,
10, 15, 20 and 35 turns by RL-agent in comparison with the original corpus. We also compare the
performance of 𝜋 ∗ @5, @10, @20, @30 and @35 with the performance using the first 5, 10, 20,
30 and 35 responses of the original corpus. Again, we note that responses to greeting and parting
queries such as “Hi” and “goodbye” are not counted toward prediction.
     As shown in Table 3.3, the optimal policy 𝜋 ∗ learned by our framework outperformed the original
corpus for each turn constraint. For example, when our RL-agent asked only 5 questions to test set
users, the classifier was able to achieve 0.707 AUC and 0.594 F1 using the simulated response. In
contrast, using the first 5 questions from the original corpus for each test set user produced 0.504
                                                     57


                         Table 3.3: Prediction @5, 10, 20, 30 and 35 Turns
                 Model            AUC              Sen            Spec       F1-Score
               Corpus@5       0.504±0.070     0.120±0.098     0.889±0.099  0.175±0.145
              Corpus@10       0.513±0.076     0.160±0.174     0.867±0.130  0.193±0.200
              Corpus@20       0.614±0.077     0.340±0.254     0.889±0.131  0.382±0.223
              Corpus@30       0.658±0.121     0.360±0.233     0.956±0.056  0.460±0.266
              Corpus@35       0.699±0.125     0.420±0.244     0.978±0.044  0.539±0.248
                 𝜋 ∗ @5       0.707±0.072     0.480±0.133     0.933±0.102  0.594±0.129
                 𝜋 ∗ @10      0.772±0.115     0.600±0.237     0.944±0.102  0.683±0.186
                 𝜋 ∗ @20      0.798±0.121     0.640±0.250     0.956±0.102  0.719±0.190
                 𝜋 ∗ @30      0.808±0.119     0.660±0.269     0.956±0.102  0.730±0.190
                 𝜋 ∗ @35      0.818±0.102     0.680±0.204     0.956±0.102  0.761±0.140
AUC and 0.175 F1. When using the first full-length conversation with 35 turns, the original corpus
recovers an AUC score of 0.699, which is far from the performance of 𝜋 ∗ @35. In Table 3.4, we
rank the most frequently appearing questions in 𝜋 ∗ @5, 𝜋 ∗ @10 and 𝜋 ∗ @20.
                    Table 3.4: Most frequently questions in 𝜋 ∗ @5, 10, 15 and 20
                     Turns                      Question                    Count
                      1-5             when did you start working?             40
                      1-5           so how long did you go out for?           37
                      1-5            when did you meet your SO?               28
                      1-5           <unspecified hobby comment>               24
                      1-5         what did you like about <activity>?         24
                      6-10       what was <occupation> like for you?          30
                      6-10           <unspecified tech comment>               28
                      6-10          when did <tech problem> start?            22
                      6-10          what do you see in this picture?          19
                      6-10          <unspecified picture comment>             19
                     10-15      what is your opinion on <social topic>?       42
                     10-15           did you see any shows lately?            38
                     10-15   how many people do you think can fit in this?    33
                     10-15   what you were doing during this time period?     30
                     10-15         what type of <hobby> do you do?            28
                     15-20                     <goodbye>                      27
                     15-20           where did you meet your so?              25
                     15-20                did you enjoy school?               24
                     15-20               anyone visit you lately?             24
                     15-20             what was the show about?               20
                                                    58


𝜋 ∗ @5.     The most effective question in 𝜋 ∗ @5 appears to be “when did you start working”. In the
context of our problem, this question seems to generate the most polarizing responses from the
cohort. We also see that the RL-agent included a few elaboration questions such as “what did you
like about <activity>” and “why did you do that,” for some users to expand upon previous responses.
From the clinical perspective, it is also interesting to note that the RL-agent picks questions such as
“what did you do yesterday” and “how long did you go out for,” which are similar to questions used
clinically to assess immediate recall in MCI patients [34].
𝜋 ∗ @10.      As seen in 𝜋 ∗ @5, occupational questions were the most popular topic asked by the
RL-agent. This is also the case with 𝜋 ∗ @10, where the RL-agent follows up the previous query with
an elaboration question regarding past occupational experiences. It is interesting to note that the
RL-agent transitions to picture-related questions, which are often used by the clinical interviewers
to facilitate creative responses by participants [12].
     We also observe the RL-agent asking questions such as “<unspecified tech comment>” and
“when did <tech problem> start”. These were frequently asked questions during the course of the
original dialogue, as technical difficulties were often encountered with connection and webcam
issues during the interviews [12]. Unfortunately, the responses vary greatly and may at times
generate verbose responses from participants. The RL-agent did not seem to be able to recognize
this caveat during training.
𝜋 ∗ @20.     As we approach questions 11 through 20, we arrive at mid- to late- dialogue for most
conversations. Overall, we observe more widespread topics during this portion of conversation. The
most polarizing question asked at this stage was “what is your opinion on <social topic>?” Here,
we used delexicalised slots [22] <social topic> to reduce model complexity, but the slots may be
substituted with a wide range of social topics from political trends to recent news.
     Additionally, we observe that the RL-agent learns to say “goodbye” to terminate the conversation
early in numerous cases. As mentioned previously, we designed the state function to include the
predicted probability [0.0-1.0] of MCI by the classifier at each time-step. The environment penalizes
                                                    59


the agent for additional turns in which the prediction probability exceeds 0.65 for either class. By
opting to terminate the episode, the RL-agent learns to avoid dragging on the dialogue unnecessarily
in cases where it is confident in the prediction.
     One notable question in 𝜋 ∗ @20 is “how many people do you think can fit in this?” This is
actually a picture-specific question related to one of the more provocative pictures. In fact, we
confirmed from the original corpus that it generated more follow-up response from users when
compared to other picture-related questions such as “when do you think this picture was taken?” and
“interesting, what makes you say that?”. By ranking this question highly, the RL-agent indirectly
prioritizes this picture over others in generating user responses. This exemplifies how the ranking of
questions by 𝜋 ∗ may be used to direct future data collection process.
𝜋 ∗ @35.      When approaching the end of conversations, we notice that the questions asked by the
agent were more spread-out among the remaining choices. For this reason, we rank only the top 10
questions during the final 15 turns of simulated conversations.
             Table 3.5: Table of top 10 ranked questions in final 15 turns of conversations
                     Rank                          Question                      Count
                       1       what is your opinion on using <new tech>?          112
                       2                  did you do anything else?               106
                       3               so how long did you go out for?             98
                       4      what you were doing during this time period?         95
                       5        when do you think this picture was taken?          95
                       6                          <goodbye>                        94
                       7                anything new with you lately?              91
                       8                  what did you like about it?              85
                       9              <unspecified picture comment>                76
                       10             how often do you <do activity>?              72
     In this latter portion of 𝜋 ∗ , we note that the RL-agent utilized more elaboration questions such as
“what do you like about it” and “how often do you <do activity>”. We also see that technology-related
questions such as “what is your opinion on using <new tech>” are included more often when
compared to topics such as occupation or social items. This indicates that tech-related questions
                                                       60


may not be as high-yield in distinguishing MCI responses, as these questions are prioritized later
during conversation by the RL-agent.
3.4   Discussion and Conclusion
In this chapter, we introduce a RL framework for approaching a classically supervised learning
problem in clinical medicine, where the data is often noisy, scarce, and prohibitively expensive to
obtain. We show that a properly trained RL framework can (1) greatly cut down on the amount
of data needed to make accurate predictions, and (2) synthesize relevant new data to improve
performance.
    To achieve this framework, we proposed a multi-step approach which capitalizes on the vast
existing knowledge of the human language and NLP research. First, we used a state-of-art distributed
representation to preprocess our data. We then set up a simulation environment for reinforcement
learning using supervised learning to create customized user simulators. Lastly, we utilize the
trained RL-agent to generate new questions from 𝜋 ∗ to obtain more targeted responses for our
prediction task.
    A careful examination of the optimal policies discovered by our agent demonstrates that the
overall framework is self-contained for directing dialogue generation for diagnostic screening, which
can potentially replace the need for trained interviewers. Our trained RL-agent is able to discover
relevant questions to ask users where the agent has no prior experience of interaction. We also show
various clinical insights which could be deduced from observing the ranking of questions in 𝜋 ∗ at
various turn constraints.
    In order for this framework to be effectively deployed in a realistic setting, a user-simulator that
could be trained online and in real-time should be considered. In its current form, our user-simulators
are trained offline, which may not be scalable to larger corpus and user volumes. Additionally, a
natural language generator phase may be needed to make the questions more adaptable to the natural
flow of human conversation. These will be areas of research we will explore in future studies.
                                                  61


BIBLIOGRAPHY
      62


                                        BIBLIOGRAPHY
[1] J Olazaran, Rubén Muñiz, B Reisberg, J Peña-Casanova, T Del Ser, AJ Cruz-Jentoft, P Serrano,
     E Navarro, ML García de la Rocha, A Frank, et al. Benefits of cognitive-motor intervention in
     mci and mild to moderate alzheimer disease. Neurology, 63(12):2348–2353, 2004.
[2] Jeffrey L Cummings, Rachelle Doody, and Christopher Clark. Disease-modifying therapies
     for alzheimer disease: challenges to early intervention. Neurology, 69(16):1622–1634, 2007.
[3] Serge Gauthier, Barry Reisberg, Michael Zaudig, Ronald C Petersen, Karen Ritchie, Karl
     Broich, Sylvie Belleville, Henry Brodaty, David Bennett, Howard Chertkow, et al. Mild
     cognitive impairment. The lancet, 367(9518):1262–1270, 2006.
[4] Keith A Johnson, Nick C Fox, Reisa A Sperling, and William E Klunk. Brain imaging in
     alzheimer disease. Cold Spring Harbor perspectives in medicine, 2(4):a006213, 2012.
[5] D Heister, James B Brewer, Sebastian Magda, Kaj Blennow, Linda K McEvoy, Alzheimer’s
     Disease Neuroimaging Initiative, et al. Predicting mci outcome with clinically available mri
     and csf biomarkers. Neurology, 77(17):1619–1628, 2011.
[6] Clifford R Jack Jr, David S Knopman, William J Jagust, Leslie M Shaw, Paul S Aisen,
     Michael W Weiner, Ronald C Petersen, and John Q Trojanowski. Hypothetical model
     of dynamic biomarkers of the alzheimer’s pathological cascade. The Lancet Neurology,
     9(1):119–128, 2010.
[7] Liang Zhan, Yashu Liu, Yalin Wang, Jiayu Zhou, Neda Jahanshad, Jieping Ye, and Paul Matthew
     Thompson. Boosting brain connectome classification accuracy in alzheimer’s disease using
     higher-order singular value decomposition. Frontiers in neuroscience, 9:257, 2015.
[8] Qi Wang, Liang Zhan, Paul M Thompson, Hiroko H Dodge, and Jiayu Zhou. Discriminative
     fusion of multiple brain networks for early mild cognitive impairment detection. In ISBI, pages
     568–572. IEEE, 2016.
[9] Carol Dillon, Cecilia M Serrano, Diego Castro, Patricio Perez Leguizamón, Silvina L
     Heisecke, and Fernando E Taragano. Behavioral symptoms related to cognitive impairment.
     Neuropsychiatric disease and treatment, 9:1443, 2013.
[10] Robert M Chapman, Mark Mapstone, John W McCrary, Margaret N Gardner, Anton Porsteins-
     son, Tiffany C Sandoval, Maria D Guillily, Elizabeth DeGrush, and Lindsey A Reilly. Predicting
     conversion from mild cognitive impairment to alzheimer’s disease using neuropsychological
     tests and multivariate methods. Journal of Clinical and Experimental Neuropsychology,
     33(2):187–199, 2011.
                                                 63


[11] Hiroko H Dodge, Nora Mattek, Mattie Gregor, Molly Bowman, Adriana Seelye, Oscar Ybarra,
     Meysam Asgari, and Jeffrey A Kaye. Social markers of mild cognitive impairment: Proportion
     of word counts in free conversational speech. Current Alzheimer Research, 12(6):513–519,
     2015.
[12] Meysam Asgari, Jeffrey Kaye, and Hiroko Dodge. Predicting mild cognitive impairment from
     spontaneous spoken utterances. Alzheimer’s & Dementia: Translational Research & Clinical
     Interventions, 3(2):219–228, 2017.
[13] Tsung-Hsien Wen, David Vandyke, Nikola Mrksic, Milica Gasic, Lina M Rojas-Barahona,
     Pei-Hao Su, Stefan Ultes, and Steve Young. A network-based end-to-end trainable task-oriented
     dialogue system. arXiv preprint arXiv:1604.04562, 2016.
[14] Bhuwan Dhingra, Lihong Li, Xiujun Li, Jianfeng Gao, Yun-Nung Chen, Faisal Ahmed, and
     Li Deng. End-to-end reinforcement learning of dialogue agents for information access. arXiv
     preprint arXiv:1609.00777, 2016.
[15] Hongshen Chen, Xiaorui Liu, Dawei Yin, and Jiliang Tang. A survey on dialogue systems:
     Recent advances and new frontiers. Acm Sigkdd Explorations Newsletter, 19(2):25–35, 2017.
[16] Minghao Hu, Yuxing Peng, Zhen Huang, Xipeng Qiu, Furu Wei, and Ming Zhou. Reinforced
     mnemonic reader for machine reading comprehension. arXiv preprint arXiv:1705.02798,
     2017.
[17] Wei He, Kai Liu, Jing Liu, Yajuan Lyu, Shiqi Zhao, Xinyan Xiao, Yuan Liu, Yizhong Wang,
     Hua Wu, Qiaoqiao She, et al. Dureader: a chinese machine reading comprehension dataset
     from real-world applications. arXiv preprint arXiv:1711.05073, 2017.
[18] Jost Schatzmann, Karl Weilhammer, Matt Stuttle, and Steve Young. A survey of statistical
     user simulation techniques for reinforcement-learning of dialogue management strategies.
     Knowledge Engineering Review, 21(2):97–126, 2006.
[19] Satinder Singh, Diane Litman, Michael Kearns, and Marilyn Walker. Optimizing dialogue
     management with reinforcement learning: Experiments with the njfun system. Journal of
     Artificial Intelligence Research, 16:105–133, 2002.
[20] Xuĳun Li, Yun-Nung Chen, Lihong Li, and Jianfeng Gao. End-to-end task-completion neural
     dialogue systems. arXiv preprint arXiv:1703.01008, 2017.
[21] Antoine Bordes and Jason Weston. Learning end-to-end goal-oriented dialog. arXiv preprint
     arXiv:1605.07683, 2016.
[22] Bing Liu and Ian Lane. An end-to-end trainable neural network model with belief tracking for
     task-oriented dialog. arXiv preprint arXiv:1708.05956, 2017.
                                                64


[23] Matthew Henderson, Blaise Thomson, and Steve Young. Robust dialog state tracking using
     delexicalised recurrent neural networks and unsupervised adaptation. In SLT, pages 360–365.
     IEEE, 2014.
[24] Joyce Y Chai, Rui Fang, Changsong Liu, and Lanbo She. Collaborative language grounding
     toward situated human-robot dialogue. AI Mag, 37(4), 2016.
[25] Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio
     Torralba, and Sanja Fidler. Skip-thought vectors. In Advances in neural information processing
     systems, pages 3294–3302, 2015.
[26] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed repre-
     sentations of words and phrases and their compositionality. arXiv preprint arXiv:1310.4546,
     2013.
[27] Xiujun Li, Zachary C Lipton, Bhuwan Dhingra, Lihong Li, Jianfeng Gao, and Yun-Nung Chen.
     A user simulator for task-completion dialogues. arXiv preprint arXiv:1612.05688, 2016.
[28] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction, volume 1.
     MIT press Cambridge, 1998.
[29] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G
     Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al.
     Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.
[30] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual
     networks. In ECCV, pages 630–645. Springer, 2016.
[31] Charles W Anderson, Minwoo Lee, and Daniel L Elliott. Faster reinforcement learning after
     pretraining deep networks to predict state dynamics. In ĲCNN, pages 1–7. IEEE, 2015.
[32] Hiroko H Dodge, Jian Zhu, Nora C Mattek, Molly Bowman, Oscar Ybarra, Katherine V
     Wild, David A Loewenstein, and Jeffrey A Kaye. Web-enabled conversational interactions
     as a method to improve cognitive functions: Results of a 6-week randomized controlled trial.
     Alzheimer’s & dementia: translational research & clinical interventions, 1(1):1–12, 2015.
[33] Jiwei Li, Michel Galley, Chris Brockett, Georgios P Spithourakis, Jianfeng Gao, and Bill
     Dolan. A persona-based neural conversation model. arXiv preprint arXiv:1603.06155, 2016.
[34] Marshal F Folstein, Susan E Folstein, and Paul R McHugh. “mini-mental state”: a practical
     method for grading the cognitive state of patients for the clinician. Journal of psychiatric
     research, 12(3):189–198, 1975.
                                                65


                                            CHAPTER 4
          SAMPLE EFFICIENT LEARNING WITH PRIVILEGED INFORMATION
4.1    Introduction
In classical supervised learning, the learner is presented with the training tuple {(𝑥𝑖 , 𝑦𝑖 )}𝑖=1
                                                                                               𝑚 and
performs an optimization task of finding the best model in a hypothesis space ℎ : 𝑋 → 𝑌 to
approximate some true 𝑓 : 𝑋 → 𝑌 which explains the data. Given a new task, knowledge transfer [1]
is often applied to accelerate the learning process by distilling and transferring relevant knowledge
from previous tasks to the unseen one. Under classical formulations, the learner incorporates prior
information in one of several ways:
     • Direct transfer of parameters from old hypothesis models to the new task and fine-tuning [1]
        the parameters.
     • Learning multiple tasks (online or batched) related to the current task [2, 3].
     • Using the prior knowledge (i.e. a knowledge graph) to constrain the hypothesis space by
        regularization [4].
     • Using representations (i.e. embeddings) of 𝑋 and / or 𝑌 from previous tasks for new tasks
        [5, 6].
     • Accelerate learning rate and model compression by Distillation as typically seen in Teacher-
        Student models [7].
In each of these settings, knowledge transfer operates directly within the 𝑋, 𝑌 and H spaces to
improve generalization of information from old models to the new task.
    Recently, Learning Using Privileged Information (LUPI) [8] has provided a new paradigm for
knowledge transfer. Under LUPI, the learner now interacts with a Teacher who provides privileged
information (PI), which is available exclusively at training time. From the learner’s perspective,
                                                 66


the training set is now extended to the tuple {(𝑥𝑖 , 𝑥𝑖∗ , 𝑦𝑖 )}𝑖=1
                                                                𝑚 , and the test set stays the same. Some
examples of PI include: 1) Future information that relates 𝑋 and 𝑌 . For example, using future stock
prices beyond the prediction window during training. 2) Auxiliary information describing the label
space that is available only to a subset of samples. For example, physician notes that accompany
diagnostic predictions which is only available after the diagnosis is made.
    At a high-level, PI provides some similarity information between training samples from the
original feature space, and the Teacher hypothesis serves as additional “explanations” of the
hypothesis space [9, 8]. As a result, [9] showed that the LUPI Teacher provides a principled way
to improve the generalization error of Student learners using agnostic PAC models, providing
some theoretical improvements in the number of samples required to generalize to test set data (i.e.
improves sampling efficiency).
    However, under the current state-of-art LUPI formulations such as [10] and [11], PI is incorporated
by means of support vectors and dropout schemes, both of which fail to explore the underlying
similarity structure between examples in the PI space 𝑋 ∗ . For example the mode distribution and
pairwise similarity between points in the 𝑋 ∗ space is largely unused. The PI contributes as auxiliary
training features and kernel information, but much of the LUPI information is lost at inference time
and beyond. A significant question remains: can privileged information be retained for future
tasks?
    Ideally, we want the LUPI Teacher to incorporate PI in a way that is specific enough to inform
similarity between training samples yet general enough to be retained across future tasks. As a
motivating example, consider the medical setting, where electronic health records (EHR) are often
sparse, noisy, and full of missing data. Complex tasks such as multi-task learning of many diseases
are often difficult to do because of the long-tail property of diseases – that is, diseases with very few
training samples (i.e. < 100) are very difficult to learn using EHR features alone. On the other hand,
medical research on rare diseases are often plenty – large volumes of clinical journals focus on text
descriptions of rare diseases in the medical setting. As a result, clinical texts such as discharge
notes are unavailable at inference time, but when used retrospectively during training can serve as a
                                                    67


source of PI that allows for rare diseases to be learned with few examples.
    In this work, we propose a LUPI formulation that achieves precisely this. First, we introduce the
idea of building a vocabulary of PI features by unsupervised learning using external data sources.
We then propose a mechanism for learning a joint representation between the PI information and
the original set of example features by exploiting their co-occurrence statistics in the training
data. We finally learn a shared decision function using a contrastive-loss to distinguish between
samples drawn from the joint latent space based on their labels for each task. In experiments, we
demonstrate the effectiveness of our method in retaining PI obtained from external data sources to
support multi-task prediction tasks in the EHR setting against other transfer learning methods. We
demonstrate that such an approach both improve the prediction accuracy as well as decrease the
samples required to train an accurate model, especially for rare-diseases.
4.2   Methodology
At a high-level, the main intuition behind our proposed method is to decompose the LUPI process
into three parts:
   1. Build a dictionary of PI features and learn a distributed representation [5] over the PI
       vocabulary.
   2. Find a joint representation space (Φ) between the PI and example features.
   3. Jointly learn the decision functions ℎ∗ : Φ → 𝑌 by feature-matching in the joint representation
       space.
The first process uses unsupervised learning to embed the PI vocabulary into a vector space. The
second process allows for some of the privileged information to be retained at inference time, despite
not having direct access to the PI vectors. The third process allows for PI information for one task to
be transferred for other 𝐶 − 1 tasks in the label space. In the following subsections, we will examine
how to achieve (1) – (3) in detail. We also provide analysis of how (3) can maintain the favorable
LUPI sample efficiency.
                                                   68


4.2.1      Building the PI vocabulary
First, we can define 𝑔 ∗ (𝑤 𝑗 ; 𝜃 𝑔∗ ) as an embedding function that maps 𝑔 ∗ : 𝑋 ∗ → Φ. Note that 𝑥𝑖∗
consists of individual words, {𝑤 1 , ..., 𝑤 𝑘 }. So 𝑔 ∗ (𝑤 𝑗 ; 𝜃 𝑔∗ ) embeds each individual word in the PI
vocabulary rather than the PI samples (i.e., 𝑥𝑖∗ ). The rationale behind 𝑔 ∗ (.) is to encode each word in
the PI vocabulary into a vector space so vector operations can be applied to the PI. We specifically
consider embedding function of the form,
                                                𝑔 ∗ (𝑤 𝑗 ; 𝜃 𝑔∗ ) = 𝑤𝑇𝑗 𝜃 𝑔∗ .              (PI Embedding)
Since each 𝑤 𝑗 ∈ {0, 1} 𝑑 has 𝑤 𝑗 𝑘 = 1 only when 𝑗 = 𝑘, the 𝑤 𝑗 vector simply selects the 𝑗 𝑡ℎ column
in 𝜃 𝑔∗ . We restrict 𝜃 𝑔∗ ∈ R𝑑×𝑘 so that 𝜃 𝑔∗ produces a lower-dimensional representation of each word
in the PI vocabulary. For this first step, we do not restrict the PI to come from the original dataset
{(𝑥𝑖 , 𝑥𝑖∗ , 𝑦𝑖 )}. In fact, we can learn the embedding 𝜃 𝑔∗ for our PI using any data source by applying
the following word-model:
                                                    
                                                           if 𝑤 1 , 𝑤 2 ∈ 𝑥𝑖∗
                                                    
                                                    
                                                    1
                                                    
                                                    
                              𝑆𝑐𝑜𝑟𝑒(𝑤 1 , 𝑤 2 ) =                                            (Co-occurence)
                                                    
                                                    
                                                    0
                                                          otherwise
                                                    
                                  𝑔(𝑤 1 , 𝑤 2 ) = 𝜎{(𝜃 𝑔∗ 𝑤 1 )𝑇 (𝜃 𝑔∗ 𝑤 2 )}                 (Word Model)
                                        L 𝑒𝑚𝑏 = 𝐵𝐶𝐸 (𝑆𝑐𝑜𝑟𝑒(𝑤 1 , 𝑤 2 ), 𝑔(𝑤 1 , 𝑤 2 )).                   (2)
Here, 𝜎(.) denotes the Sigmoid activation function, and 𝐵𝐶𝐸 (.) denotes the binary cross-entropy
           Í
loss: − 𝑖 [𝑎𝑖 (𝑙𝑜𝑔𝑏𝑖 ) + (1 − 𝑎𝑖 )𝑙𝑜𝑔(1 − 𝑏𝑖 )]. The cross-term (𝜃 𝑔∗ 𝑤 1 )𝑇 (𝜃 𝑔∗ 𝑤 2 ) gives the similarity
between 𝑤 1 and 𝑤 2 in the embedding space, which is then scored against 𝑆𝑐𝑜𝑟𝑒(𝑤 1 , 𝑤 2 ) based on
whether 𝑤 1 , 𝑤 2 both appear in 𝑥𝑖∗ . We note that the embedding loss L 𝑒𝑚𝑏 is trained separately from
the rest of the LUPI model since it is not specific to the dataset.
4.2.2      Learning the Joint Representation
Next, let us define 𝑔(𝑥𝑖 ; 𝜃 𝑔 ) to be the embedding function that maps 𝑔 : 𝑋 → Φ. In this work, we
consider the following for 𝑔(.) and 𝑔 ∗ (.):
                                                              69


    • Distributed representation of PI vocabulary, which captures its underlying manifold structure
       and is obtained by unsupervised learning [5].
    • Encoding of 𝑋 into a fixed length vector by deep embedding methods such as [12].
For time-series 𝑋, we take the embedding functions to be a recurrent encoder neural network:
                                              𝑔(𝑥𝑖 ; 𝜃 𝑔 ) = 𝑅𝑁 𝑁 (𝑥𝑖 ; 𝜃 𝑔 )                       (EF Embedding)
The motivation behind using the embedding functions 𝑔(𝑥) and 𝑔 ∗ (𝑤) is to extend the idea of
Student and Teacher kernels, which allow for the privileged information to provide information about
similarity between training samples in the feature space [9]. Using neural encoding for 𝑔(𝑥; 𝜃 𝑔 )
allows such feature spaces to be represented by a fixed-length vector without losing the underlying
spatio-temporal information.
    To find commonality between example features and PI, we introduce a matching function (𝜇)
that maps each (𝑔(𝑥𝑖 ), 𝑔 ∗ (𝑤 𝑗 )) pair onto the interval [0, 1], i.e., 𝜇 : Φ × Φ → [0, 1] 𝑑 :
                                                       exp(max{0, [𝑔(𝑥𝑖 ); 𝑔 ∗ (𝑤 𝑗 )] 𝑇 𝐴})
                      𝜇(𝑔(𝑥𝑖 ), 𝑔 ∗ (𝑤 𝑗 ); 𝐴) = Í𝑑                                               .            (3)
                                                                                   ∗
                                                      𝑝=1 exp(max{0, [𝑔(𝑥𝑖 ); 𝑔 (𝑤 𝑝 )] 𝐴})
                                                                                            𝑇
Here, [𝑔(𝑥𝑖 ); 𝑔 ∗ (𝑤 𝑗 )] ∈ R2𝑘 denotes the concatenation of the 𝑔(𝑥𝑖 ) and 𝑔 ∗ (𝑤 𝑗 ) embeddings in
the joint latent space. The parameter matrix 𝐴 projects the pairs (𝑔(𝑥𝑖 ), 𝑔 ∗ (𝑤 𝑗 )) onto R, and the
softmax activation normalizes the pairwise scores against other word-pairs in the PI vocabulary.
Thus, for each sample mapped from the feature space 𝑋, the matching function 𝜇 produces a set of
corresponding weights over all of the words in the PI vocabulary.
    We make the key observation that for each word 𝑤 𝑗 , the output weight of the matching
function should correspond to the 𝑗 𝑡ℎ component of the 𝑥𝑖∗ sample in the training data. That is,
𝜇(𝑔(𝑥𝑖 ), 𝑔 ∗ (𝑤 𝑗 ); 𝐴) ≈ 𝑥𝑖∗𝑗 . Using this fact, we can learn the matching function by minimizing over
the following objective:
                                                     𝑚 𝑑
                                                1 ÕÕ ∗
                      L 𝜙 (𝜃 𝑔 , 𝜃 𝑔∗ , 𝐴) = −               [𝑥 log 𝜇𝑖 (𝑔(𝑥𝑖 ), 𝑔 ∗ (𝑤 𝑗 ); 𝐴)
                                               𝑚𝑑 𝑖=1 𝑗=1 𝑖 𝑗
                                               + (1 − 𝑥𝑖∗𝑗 )log (1 − 𝜇𝑖 (𝑔(𝑥𝑖 ), 𝑔 ∗ (𝑤 𝑗 ); 𝐴))].             (4)
                                                            70


Each component of the PI vector 𝑥𝑖∗ can thus be interpreted as providing an indicator label for
likelihood of the (𝑥𝑖 , 𝑤 𝑗 ) pair to occur together. The similarity control mechanism highlighted in
Eqn. 4 differs from the Kernel-matching mechanism mentioned previously in [9]. The limitation
of kernel-matching [9] is that two sets of Kernel weights need to be learned simultaneously: 𝛼 for
𝐾 (𝑥𝑖 , 𝑥) and 𝛽 for 𝐾 ∗ (𝑥𝑖∗ , 𝑥 ∗ ). By contrast, our joint representation for 𝑥𝑖 and 𝑥𝑖∗ encourages a single
hypothesis model to be used to map ℎ∗ : Φ → 𝑌 . Since matrix 𝐴 captures the 𝑔(𝑥𝑖 ) and 𝑔 ∗ (𝑤 ∗𝑗 )
interactions, it preserves the PI in the space of Φ and allows relevant PI to be retrieved at test time.
    Finally, we obtain the augmented representation of 𝑥𝑖 as a weighted combination of G ∗ =
{𝑔(𝑤 𝑗 )} 𝑑𝑗=1 and 𝑔(𝑥𝑖 ):
                                                 Õ𝑑
                              𝜙(𝑥𝑖 ) = 𝑔(𝑥𝑖 ) +      𝜇𝑖 (𝑔(𝑥𝑖 ), 𝑔 ∗ (𝑤 𝑗 ); 𝐴) · 𝑔 ∗ (𝑤 𝑗 )                (5)
                                                 𝑗=1
One can think of G ∗ as the set of basis vectors supporting the PI space (similar to frames for the PI
Kernel [9]). The augmented representation 𝜙(𝑥𝑖 ) contains both information from the original 𝑥𝑖
as well as relevant information retrieved from G ∗ . Note that since the PI vectors 𝑥 ∗𝑗 ∈ 𝑋 ∗ are not
directly used at testing time, each sample 𝑥𝑖 ∈ D𝑡𝑒𝑠𝑡 is mapped into Φ using 𝑔(.), and the trained
𝜇(.) selects the corresponding bases in G ∗ to construct 𝜙𝑖 .
    This is quite different from representation fusion methods [6, 13], which only try to learn a
shared representation space for input modalities 𝑋1 , ..., 𝑋 𝑘 , without a matching function to control
the contribution of each modality to the hypothesis. For example, we can take 𝑋1 to be the original
feature space and 𝑋2 to be the privileged information. At test time, when 𝑋2 is unavailable, 𝑋1
inputs with masked 𝑋2 components may be projected into a completely different location in the
shared representation space than if the 𝑋2 information were available. Furthermore, model fusion
methods [13] may also under-utilize the original feature space during training, as the PI contain
more information related to the target task.
                                                         71


4.2.3       Coupling Decision Functions with Feature Matching
LUPI typically considers 2 hypothesis functions: the Student hypothesis ℎ : 𝑋 → 𝑌 , and Teacher
hypothesis 𝑓 ∗ : 𝑋 ∗ → 𝑌 . Since we have already addressed the problem of finding a common “frame
of reference” between the original feature space and the PI space by the matching function 𝜇, the
main focus for this portion of our method has to do with finding an efficient 𝑓 ∗ that relates the
privileged information to the labels. Fortunately, we can directly approximate 𝑓 ∗ : 𝑋 ∗ → 𝑌 by a
function ℎ∗ : Φ → 𝑌 that maps samples from the joint representation space to the label space. This
is because Φ is constructed by embedding function 𝑔 ∗ (.) on 𝑋 ∗ and is an approximation of the
Kernel space for the privileged information.
     In the case that the target task is classification, we can formulate ℎ∗ as a feature matching
problem between samples from Φ and 𝑌 . Specifically, we can use contrastive loss [14] to find an
invariant representation ℎ∗ : Φ → 𝑌 and vice versa, by minimizing the distance between similar
samples drawn from the joint embedding space based on signals from the label space:
         𝐿 (𝑊, 𝑦𝑖 , 𝑥𝑖+ , 𝑥𝑖− ) = (1 − 𝑦𝑖 )M𝑊 (𝑥𝑖+ , 𝑥𝑖− ) + (𝑦𝑖 )(max{0, 𝐶 − M𝑊 (𝑥𝑖+ , 𝑥𝑖− )})      (Contrastive)
                                  1 Õ𝑆
                      L (𝑊) =              𝐿(𝑊, (𝑦𝑖 , 𝑥𝑖+ , 𝑥𝑖− ) 𝑖 )                                            (6)
                                  𝑆    𝑖=1
where M𝑊 (𝑥 + , 𝑥 − ) = ||𝜙(𝑥 + ; 𝑊) − 𝜙(𝑥 − ; 𝑊)|| 2 refers to a parameterized distance metric with
respect to projections 𝜙(𝑥 + ) and 𝜙(𝑥 − ), and 𝐶 is the slack variable which defines the margin of
separation between them. 𝜙(.) is simply the projection function from eqn. 5, which is parameterized
by 𝑊 = [𝜃 𝑔 , 𝜃 𝑔∗ , 𝐴] from eqn. 4. Intuitively, M𝑊 finds the distance between augmented projections
of 𝑥𝑖+ and 𝑥𝑖− , i.e., 𝜙(𝑥𝑖+ ) and 𝜙(𝑥𝑖− ), which are compared by their labels 𝑦𝑖 . Given a training pair
(𝑥𝑖 , 𝑥𝑖∗ , 𝑦𝑖 ), a set of 𝑘 similarity samples 𝑆𝑖 = {(𝑥𝑖+ , 𝑥𝑖− , 𝑦𝑖 ) 𝑗 } 𝑘𝑗=1 is constructed around the (𝑥𝑖 , 𝑦𝑖 )
pair, whereby 𝑥𝑖+ denotes samples with the same label as 𝑦𝑖 = 𝑦 + , and 𝑥𝑖− denotes samples with a
different label than 𝑦𝑖 ≠ 𝑦 − . Thus, 𝑥 + and 𝑥 − denote positive samples (similar) and negative samples
(dissimilar), respectively.
     A variety of negative sampling techniques can be used to obtain the set 𝑆 [15, 14, 16]. In
practice, we found picking 5 − 10 negative samples that are close to 𝜙(𝑥𝑖 ) and 5 − 10 samples that
                                                             72


are far from 𝜙(𝑥𝑖 ) to be sufficient in creating 𝑆 for each training triplet (𝑥𝑖 , 𝑥𝑖∗ , 𝑦𝑖 ). We refer the
reader to [14] and [15] for more information about the contrastive loss and the construction of 𝑆.
    Finally, we combine the two portions of our learning task (i.e. representation learning and joint
hypothesis) into the optimization task:
                                  min L (𝑊) + 𝜆L 𝜙 (𝜃 𝑔 , 𝜃 𝑔∗ , 𝐴) + Ω(𝑊),                             (7)
                                    Θ
where 𝑊 = [𝜃 𝑔 , 𝜃 𝑔∗ , 𝐴] is the total set of parameters for the learning task. 𝜆 is the hyperparameter
which controls the trade-off between the contrastive loss to learn ℎ∗ and the representation loss in
eqn. 4. Ω(.) is the regularization term used to constrain the hypothesis space of the joint model.
4.3   Analysis of Sampling Efficiency
4.3.1   Results from Existing Agnostic Models
For an agnostic hypothesis model, such as non-linearly separable SVM, the generalization error
bound holds with 1 − 𝛿 probability:
                                               r
                                             ∗   Δ𝐻 log(𝑚/Δ𝐻 ) − log 𝛿
                     𝑅(ℎ) ≤ 𝑅𝑒𝑚 𝑝 (ℎ) + 𝑂 (                               ),                   (VC-bound)
                                                             𝑚
where |𝑅(ℎ) − 𝑅𝑒𝑚 𝑝 (ℎ)| = 𝜖 ∈ [0, 1] is the generalization error represented by the difference
between expected and empirical training risks. Δ𝐻 is the VC-dimension of the given SVM model
class, 𝑚 is the sample size, and 𝛿 ∈ (0, 1), whereas under the SVM+ formulation in [9], the
generalization error is given by:
                                                                       𝑚              !
                                                  (Δ𝐻 + Δ𝐹 ∗ )log( Δ𝐻 +Δ 𝐹∗
                                                                            ) − log 𝛿
                      𝑅(ℎ) ≤ 𝑅𝑒𝑚 𝑝 ( 𝑓 ∗ ) + 𝑂 ∗                                        ,         (SVM+)
                                                                    𝑚
where 𝑅𝑒𝑚 𝑝 ( 𝑓 ∗ ) denotes the error rate of the Teacher’s hypothesis 𝑓 ∗ : 𝑋 ∗ → 𝑌 , and Δ𝐹 ∗ denotes
the VC-dimension of the Teacher model. In the original SVM, the model needs to re-estimate 𝑚
slack variables for each training sample, in addition to the 𝑛 parameters in 𝑤. At a high-level, the
hypothesis function 𝑓 ∗ of the LUPI Teacher serves as a slack function which approximates these
slack variables for each 𝑥𝑖 , eliminating the need for the Student to estimate them during training
[9]. The number of estimations in the latter case reduces to O (𝑚 + 𝑛), rather than O (𝑚𝑛). As a
                                                     73


result, the sampling efficiency improves from 𝑚 ≤ O ( Δ𝐻 +log(1/𝛿)𝜖2
                                                                          ) to 𝑚 ≤ O ( Δ𝐻 logΔ𝐻𝜖+log(1/𝛿) ) in
the number of samples required to achieve the same generalization error 𝜖.
4.3.2    Complexity of Proposed LUPI Method
In this section, we examine the sample complexity of our proposed LUPI method. For simplicity, let
us consider the classification setting where we are given a hypothesis class H of finite VC-dimensions
which define a set of functions mapping X to a label set {0, 1}, and let the 0 − 1 loss function define
the empirical risk. Let ΔH = 𝑑 < ∞. By the fundamental theorem of PAC learning (Thm. 6.7
in [17]), there exist 𝐶1 , 𝐶2 ∈ R such that:
    1. H is agnostic PAC learnable and has the uniform convergence property with sample complexity
                                    𝑑 + log(1/𝛿)                     𝑑 + log(1/𝛿)
                                 𝐶1         2
                                                  ≤ 𝑚(𝜖, 𝛿) ≤ 𝐶2                                 (Agnostic)
                                          𝜖                                𝜖2
    2. There exists a Realizable subset H𝑟 such that the sample complexity is defined by
                        𝑑 + log(1/𝛿)                    𝑑 log(1/𝜖) + log(1/𝛿)
                    𝐶1                 ≤ 𝑚(𝜖, 𝛿) ≤ 𝐶2                            .             (Realizable)
                              𝜖                                     𝜖
The main difference between agnostic and realizable PAC models lies in whether the classifier
can completely classify a training set S = {(𝑥𝑖 , 𝑦𝑖 )}𝑖=1  𝑚 . That is, the training error 𝑅
                                                                                                  𝑒𝑚 𝑝 (ℎ) =
 1 Í𝑚
𝑚 𝑖=1 𝐿 (ℎ(𝑥𝑖 ), 𝑦 𝑖 ) = 0 for the particular hypothesis class under some empirical risk minimization
(ERM) algorithm. On the other hand, when we have 𝑦𝑖 ≠ ℎ(𝑥𝑖 ) for some training data, there exist
some examples for which the current hypothesis class can not successfully separate (i.e. cannot
realize an accurate ERM hypothesis with 0 empirical risk), suggesting that the optimal solution is
either not contained in the span of the given hypothesis class, or the ERM algorithm cannot converge
to the optimal solution in the hypothesis space. Fortunately,[18] introduces some conditions for
which a sample complexity between O ( log(1/𝛿)  𝜖    ) and O ( log(1/𝛿)
                                                                  𝜖2
                                                                        ) is possible for some classes of
models. Specifically, Tsybakov showed that 𝑚(𝜖, 𝛿) ≤ 𝐶 𝑑 log(1/𝛿)+log(1/𝛿)
                                                                        𝜖𝑛         , 1 < 𝑛 < 2 exists under
two general scenarios:
                                                    74


    • When there is zero training error (ERM realizable).
    • When the classification margin between the given hypothesis class and another realizable
       hypothesis class is bounded.
LUPI qualifies the second Tsybakov condition by leveraging a realizable Teacher that provides a
mechanism to bound the margins of the Student ERM classifier. Specifically, [8] considers the case
where the Student hypothesis class H is non-realizable and the Teacher hypothesis 𝑓 ∗ ∈ F ∗ is a
realizable classifier that approximates an Oracle classifier with zero-training error. [8] showed that
using kernel alignment between the Teacher and the Student, the latter can satisfy the Tsybakov
conditions, leading to a sample complexity that is comparable to the realizable case. [9] further
showed that, under some assumptions on H ∗ , one can achieve O ( log(1/𝛿)      𝜖    ) ≤ 𝑚(𝜖, 𝛿) ≤ O ( log(1/𝛿)
                                                                                                         𝜖2
                                                                                                               )
so long as the Teacher classifier H ∗ has lower VC-dimension and training error than H .
    In our model, we provide the margin bounding mechanism by contrastive loss in Eqn. Contrastive.
Under the condition that the Teacher model 𝑓 ∗ (𝑥 ∗ ) has a lower VC-dimension and 𝑦 · ℎ(𝑥) >
𝑐 0𝐶 − M (𝑥 + , 𝑥 − ), the 𝐶 − M (𝑥 + , 𝑥 − ) term from the Contrastive eqn. serves as the approximate slack
margin of the oracle classifier. To see this, let 𝑝 ∈ R and 𝑞 ∈ R be random values. Suppose that
𝑝 < 0, then either 𝑞 < 0 or 𝑝 −𝑞 < 0 is true. We then also have 𝑃( 𝑝 < 0) ≤ 𝑃(𝑞 < 0) + 𝑃( 𝑝 −𝑞 < 0).
If we take 𝑝 = 𝑦 · ℎ(𝑥) and 𝑞 = 𝑦 · 𝑓 ∗ (𝑥 ∗ ), then with 1 − 𝛿 probability, we can express the error
bounds of the Student and Teacher models as follows,
                     𝑃(𝑦 · ℎ(𝑥) < 0) ≤ 𝑃(𝑦 · 𝑓 ∗ (𝑥 ∗ ) < 0) + 𝑃(𝑦 · ℎ(𝑥) < 𝑦 · 𝑓 ∗ (𝑥 ∗ )).
Under the contrastive loss in eqn. (Contrastive), we can re-formulate the above as:
                 𝑃(𝑦 · ℎ(𝑥) < 0) = 𝑃(𝑦 · 𝑓 ∗ (𝑥 ∗ ) < 0) + 𝑃(𝑦 · ℎ(𝑥) < 𝐶 − M (𝑥 + , 𝑥 − ))
                                                                                     
                                                  ∗ ∗               Δ𝐻 + Δ 𝑓 ∗ − ln 𝛿
                                    ≤ 𝑃(𝑦 · 𝑓 (𝑥 ) < 0) + O
                                                                           𝑚
                                                                               
                                               (Δ𝐻 + Δ 𝑓 ∗ log(1/𝛿) + log(1/𝛿)
                            =⇒ 𝑚 ≤ O                                              ,
                                                              𝜖𝑛
where 1 < 𝑛 < 2 if 𝑃(𝑦 · 𝑓 ∗ (𝑥 ∗ ) < 0) is satisfied, i.e. if the Teacher model is realizable. We
note that although our proposed LUPI model is PAC learnable, i.e. has the uniform convergence
                                                          75


property, we cannot bound the computational complexity of learning. Specifically, if we allow the
embedding components 𝑔(.) and 𝑔 ∗ (.) to be non-convex functions, then finding the global optimum
for 𝜙(𝑥) becomes an NP-hard problem. In other words, the above analysis only examine the sample
complexity bounds, but it does not provide insight into the computational runtime of learning, or the
actualization of the uniform convergence property.
4.4    Experiments
We empirically assess the effectiveness of our LUPI formulation for improving sample efficiency
and generalization performance in a multi-task setting. First, we consider performance accuracy
on numerous diagnostic prediction tasks, which are individually binary classification problems.
This allows us to evaluate the ability of our LUPI formulation to actually transfer the privileged
information to improve the learning efficiency in a multi-task setting. We benchmark the learning
accuracy of our method against the performance of various transfer learning baselines. We also
perform an ablation study on the privilege information components of our model to ascertain its
necessity. Finally, we perform prediction tasks on classes with only sparse examples – as defined by
≤ 100 training samples, and compare the sample efficiency of our model against select models from
other transfer learning paradigms.
    For PI, we consider physician notes in the form of discharge summaries, linking standard medical
terminologies (i.e., UMLS codes [19]) with diagnostic findings in the EHR. UMLS codes are a set
of standardized medical concepts used by clinicians to describe physical findings of diseases and are
used widely in both the EHR as well as medical research [19]. [9] alluded at the idea that medical
datasets also contain vast amounts of privileged information in the physician notes, which serve to
explain the qualities of diseases that can greatly aid decision rules. For experiments, we consider
the following set of data for example features, PI, and labels:
     • Example Features 𝑋: continuous time-series data (i.e. lab values, blood tests, imaging) and
       discrete static variables (i.e. demographics information) that describe a patient.
     • Privileged Information 𝑋 ∗ : physician notes containing descriptions in natural language and
                                                  76


       medical terms (UMLS concepts [19]) that summarize a particular visit for a patient.
     • Target Task 𝑌 : prediction tasks of interest, such as mortality (binary classification), disease
       prediction (multi-task and transfer learning), ... etc.
4.4.1    Datasets and Setup
Table 4.1 provides a brief summary of data sources for our experiments. For each data source,
we extract unique data modalities available in the dataset. MIMIC-III (Medical Information Mart
for Intensive Care) is a publicly available benchmark dataset for predictive modeling and clinical
decision support in the intensive care unit (ICU) setting [20]. It should be noted that MIMIC-III and
STRIDE datasets are EHR datasets, although STRIDE is comprised of clinical notes (PI vocabulary)
obtained from multiple EHR datasets over 19 years of data collection. Documents in table 4.1 refer to
literature sources, including medical claims [21] and research articles [22] that heavily utilize UMLS
codes. We refer to the clinical notes from EHR as the PI source, which we decomposed into lists of
UMLS codes. For example, a clinician’s note may contain a text description of pneumonia which
may utilize several UMLS codes such as (Lower Lobe Consolidation, Staph Aureus, Productive
Cough) as keywords.
                          Table 4.1: Summary of datasets used in this study
                         Database    No. Patients     UMLS     ICD-9s   Temporal
                        MIMIC-III        22, 043        928      148        40
                         STRIDE            4M         14, 256   None      None
                        Documents         1.2M         None    11, 245    None
     MIMIC-III provides a rich source of temporal data, ranging from laboratory tests, vital signs and
respiratory parameters, all of which provide hourly resolution of descriptive features. For example
features, we use 40 physiologic features, including vital signs (i.e. heart rate, blood pressure, oxygen
saturation, temperature), blood tests (i.e. WBC count, platelets, INR) and respiratory parameters:
(i.e. PaO2/FiO2 ratio, PEEP). These temporal features are the source of example features (EF) for
our experiments, i.e. 𝑋 = {𝑥𝑖 }𝑖=1𝑚 , 𝑥 = {𝑥 𝑡 }𝑇 . Preprocessing of these features include binning
                                       𝑖       𝑖 𝑡=1
                                                     77


the time-series by hourly average of each feature and standardized feature values across all adult
patients.
     Physician notes (the source of PI) in MIMIC-III exist in the form of discharge notes, which are
physician documentation of key findings relating to the patient’s hospital visit. We can represent
the PI as 𝑋 ∗ = {𝑥𝑖∗ }𝑖=1
                      𝑚 , where each 𝑥 ∗ ∈ {0, 1} 𝑑 represents a discharge note for the 𝑖-th patient, in
                                       𝑖
the form of a 𝑑 dimensional one-hot vector. Here, 𝑑 is the total number of UMLS codes that are
found in all of our data sources (MIMIC-III, STRIDES, and Documents). One can think of the
UMLS codes as a set of basis features for the PI vectors. The rationale of using physician notes
as PI is that they are only available at the end of the hospital stay and contain copious amount of
valuable information regarding a wide array of clinical decision support tasks, such as physical
findings, periodic nurse observations, medical or surgical complications, and indicators for mortality
risk. During training, we can incorporate these notes into the learning regime, but they become
unavailable at inference time.
     For labels, MIMIC-III provides a wide range of potential tasks. We focus on the prediction of
ICD-9 diagnostic codes, where are a set of diagnosis labels given to patients that identifies their
disease states. Each patient has a set codes that can be described by a label vector 𝑦𝑖 ∈ {0, 1}𝐶 ,
where 𝐶 denotes the total number of disease classes considered. ICD-9 prediction is in fact a
difficult multi-label classification problem among other clinical benchmark tasks due the fact that
the distribution of diseases often contain long tails [23]. In the typical case, a few diseases dominate
in high frequency while most diagnostic codes appear only a few times among all patients. As a
result, training samples are sparse for most diseases, leading to poor prediction beyond the most
frequent cases. Our experimental task is to leverage information from PI under a multi-task learning
setting to improve the learning efficiency for a large set of ICD-9 codes, especially ones in the tail
distribution (i.e. occurring with few samples in the dataset). We consider the diagnoses appearing in
at least 1% of admissions, leaving 𝐶 = 148 ICD-9 group codes to formulate our multi-task prediction
as 𝐶 classification tasks. We consider UMLS terms appearing at least 50 times in discharge notes,
leaving 𝑑 = 928 UMLS terms to construct the PI vocabulary.
                                                   78


4.4.2   Initial Baselines
We establish some baseline performance of various hypothesis models for our prediction tasks under
3 settings:
     • Using only example features (EF only) to predict ICD-9 labels.
     • Using only PI information to predict ICD-9 labels.
     • Using both EF and PI information to predict ICD-9 labels.
For each setting, a diverse set of hypothesis classes are used, including a standard recurrent neural
network (RNN) and feed forward perceptrons (MLP). The rationale behind these baselines is to
determine whether the PI indeed offers more information than the original feature-set based on
noisy timer-series data. Ideally, the Teacher hypothesis class 𝐹 ∗ should obtain lower empirical risk
while using lower model complexity (lower VC-dimension) compared to the Student hypothesis
class H without LUPI. Otherwise, the Student learner will not improve its sample complexity by
LUPI, and any improvements in prediction accuracy will likely result from a variance-reduction
mechanism (i.e. ensemble) rather than the LUPI mechanism. As a sanity check, we also included a
comprehensive Teacher model using EF + PI features, which should provide the best performance.
We note, however, that because the Teacher models use PI at test time, they are used to assess the
quality of PI rather than benchmark Student performance. In practice, PI is unavailable at test time,
so the Teacher models cannot be used for inference in a real-world setting. Table 4.2 summarizes
the performance of these baselines on held-out test set data.
    Since ICD-9 predictions involve a large number of classes, we take both micro-averaged and
macro-averaged AUC as evaluation metrics. Macro-averaged AUC takes per-class average of AUC
scores, while micro-averaged AUC considers a single AUC score based on a roll-out of label classes
for each test set sample. We also include micro-averaged F1-score and micro-averaged area under
PRC to quantify the trade-off between precision and recall.
                                                  79


    • RNN Student denotes the Student learner using the RNN model class conditioned exclusively
       on EF, using the LSTM architecture as mentioned in [24].
    • MLP Student denotes a feed-forward network conditioned on the final time-step of EF.
    • MLP Teacher denotes the Teacher feed-forward network conditioned on PI only. Specifically,
       we use a weighted sum of the PI embeddings for each 𝑥𝑖∗ = {𝑤 1 , ..., 𝑤 𝑘 }:
                                                             1 Õ𝑘
                                                  𝜙(𝑥𝑖∗ ) =              𝑤𝑇 𝜃 𝑔∗
                                                             𝑘        𝑗=1 𝑗
       which maps each PI vector 𝑥𝑖∗ into a lower-dimensional representation space, and 𝜃 𝑔∗ denotes
       the look-up matrix of embeddings obtained in the first step of our LUPI algorithm.
    • Oracle Teacher denotes the Teacher model which uses both EF and PI for prediction.
                                         ℎ(𝑥𝑖 ) = 𝑅𝑁 𝑁 (𝑥𝑖 )
                                   𝑓𝑐∗ (𝑥𝑖 , 𝑥𝑖∗ ) = 𝜎(𝑊ℎ𝑐 ℎ(𝑥𝑖 ) + 𝑊𝑔𝑐 𝑔(𝑥𝑖∗ ) + 𝑏 𝑐 )
       EF inputs are encoded into fixed-length vectors by a set of RNN layers and the PI features are
       embedded into lower-dimensional space by 𝑔(𝑥 ∗ ) described previously. Since there are 𝐶
       tasks (i.e., 𝐶 outputs), a classifier layer is used to predict the 0 − 1 label for each ICD-9 code.
Here, we note that the embedding matrix 𝜃 𝑔∗ for learning the lower dimensional representation of
PI is obtained by the embedding mechanism highlighted in the PI Embedding equations. Taking
the set of UMLS concept codes as the PI vocabulary, we leverage the corpus available in STRIDE
and Documents datasets to learn the 𝜃 𝑔∗ , conditioned on the UMLS codes. For example, given a
medical document consisting of a set of 𝑛 relevant UMLS codes 𝑣® = {𝑤 1 , 𝑤 2 , ...𝑤 𝑛 }, we can train
the 𝜃 𝑔∗ for the UMLS codes by Eqn. 2, with the modified scoring function:
                                                   
                                                   
                                                          if 𝑤 1 , 𝑤 2 ⊂ 𝑣®
                                                   
                                                   1
                                                   
                                                   
                         𝑆𝑐𝑜𝑟𝑒(𝑤 1 , 𝑤 2 ) =                                               (Co-occurence)
                                                   
                                                   
                                                   0
                                                         otherwise
                                                   
                                                            80


                   Table 4.2: Comparison of performance across baseline models
                           Model            Ma-AUC Mi-AUC Mi-F1          AUPRC
                       RNN Student           0.735       0.783     0.299  0.260
                       MLP Student           0.715       0.756     0.235  0.211
                       MLP Teacher∗          0.824       0.868     0.446  0.432
                     Oracle Teacher     ∗    0.845       0.882     0.497  0.510
                                  ( ∗ ) denotes Teacher models using PI.
Note that each UMLS concept is represented by 𝑤 𝑖 ∈ {0, 1} 𝑑 , where 𝑤 𝑖 𝑗 = 1 for the index
corresponding to the UMLS code. Thus, 𝜃 𝑔∗ [:, 𝑖] gives the the distributed representation of 𝑖 𝑡ℎ
UMLS concept. We train the embedding matrix 𝜃 𝑔∗ over STRIDE and Documents before applying
𝜃 𝑔∗ on the MIMIC-III dataset for LUPI.
     We see from table 4.2 that the PI provides strong signals for ICD-9 prediction. Micro-averaged
AUC and Macro-averaged AUC are denoted as Mi-AUC and Ma-AUC, respectively. Micro-averaged
F1-score and AUC of precision-recall curve are denoted as Mi-F1 and AUPRC, respectively. Large
differences exist between the Student baselines and the Teacher models across all performance
metrics, suggesting that the PI provides more information about the label space compared to the
original time-series features. Again, we emphasize here that discharge notes (PI) are generated
only after the diagnostic predictions have been made by clinicians, and thus the Teacher models
are actually not available at inference time. Interestingly, we also see that the Oracle teacher with
combined features provided additional performance boost compared to using PI exclusively as
features. This suggests that the temporal features provide some complementary information not
contained in the PI.
4.4.3    Comparison Against Other Transfer Learning Methods
Next, we benchmark performance for several existing transfer learning paradigms for incorporating
PI with the Student model: transductive learning, inductive learning, and model distillation. Under
the transductive framework, we treat the PI as auxiliary targets, much like target replication in [24].
We train a joint hypothesis model ℎ : 𝑋 → 𝑋 ∗ × 𝑌 to map from the original EF space to the joint PI
and label space. By contrast, we incorporate 𝑋 ∗ as an auxiliary input for the inductive framework.
                                                    81


We use data fusion to learn a joint representation 𝑔 : 𝑋 × 𝑋 ∗ → 𝑍 before learning a hypothesis
function to predict ℎ : 𝑍 → 𝑌 . For model distillation, a Teacher network predicts a set of soft-labels
over the PI information, which the Student model uses as auxiliary input for the final prediction
model ℎ : 𝑋 × 𝑋 ∗ → 𝑌 . Details of the setup is explained below.
4.4.3.1   Multi-task learning
MTL is the representative transductive learning technique. There is only one source domain
D = {𝑋, 𝑃(𝑋)} and two target tasks: T𝑆 = {𝑋 ∗ , 𝐺} and T𝑇 = {𝑌 , H }. The MTL model learns a
joint model:
                        𝑔 𝑘 (𝑥𝑖 ) = 𝑀 𝐿𝑃(𝑅𝑁 𝑁 (𝑥𝑖 ; 𝑊𝑟𝑒 𝑝 ); 𝑊 𝑘 )                     (Shared Rep.)
                        ℎ𝑐 (𝑥𝑖 ) = 𝑀 𝐿𝑃(𝑅𝑁 𝑁 (𝑥𝑖 ; 𝑊𝑟𝑒 𝑝 ); 𝑊𝑐 )                   (Individual Hyp.)
                                      1 Õ𝑚 Õ𝐶
                        L 𝑀𝑇 𝐿 =                     𝐿𝑌 (ℎ𝑐 (𝑥𝑖 ), 𝑦𝑖𝑐 )
                                    𝑚𝐶    𝑖=1   𝑐=1
                                     𝜆 Õ𝑚 Õ 𝑑
                                  +                 𝐿 𝑋 ∗ (𝑔 𝑗 (𝑥𝑖 ), 𝑥𝑖∗𝑗 )              (MTL Obj.)
                                    𝑑𝑚   𝑖=1   𝑗=1
where 𝑊𝑟𝑒 𝑝 is the shared weights for the representation model 𝑅𝑁 𝑁 (𝑥𝑖 ; 𝑊𝑟𝑒 𝑝 ), 𝑊 𝑘 and 𝑊𝑐 are
task-specific weights for target hypothesis models. The MTL loss is composed of two parts: (1)
a loss component over the joint label space E[𝐿𝑌 (ℎ(𝑋), 𝑌 )], and (2) a loss term over the joint PI
space E[𝐿 𝑋 ∗ (𝑔(𝑋), 𝑋 ∗ )]. 𝜆 is a hyperparameter which controls the trade-off between the multiple
objectives during learning. 𝐿(.) denotes some evaluation criterion to approximate the 0 − 1 loss,
for example the binary cross-entropy (BCE) or mean squared error (MSE). We used BCE in the
proceeding experiments.
4.4.3.2   Data Fusion
For inductive learning, we used a variant of the Siamese Network [16] to achieve data fusion between
EF and PI. We use two parallel networks, 𝑔 : 𝑋 → 𝑍 and 𝑔 ∗ : 𝑋 ∗ → 𝑍 and minimize the distance
                                                    82


                Table 4.3: Comparison of performance across transfer learning models
                            Model          Ma-AUC         Mi-AUC        Mi-F1      AUPRC
                            MTL               0.783        0.836        0.384       0.336
                         Distillation         0.738        0.793        0.289       0.245
                         Data Fusion          0.779        0.811        0.374       0.328
                            Ours              0.838        0.845        0.397       0.344
between 𝑔(𝑥) and 𝑔 ∗ (𝑥 ∗ ) using the BCE loss. We then learn a hypothesis function ℎ : 𝑍 → 𝑌 .
                                        𝑔(𝑥𝑖 ) = 𝑅𝑁 𝑁 (𝑥𝑖 ; 𝑊𝑥 )                           (EF Embedding)
                                      𝑔 ∗ (𝑥𝑖∗ ) = 𝑀 𝐿𝑃(𝑥𝑖∗ ; 𝑊𝑥 ∗ )                        (PI Embedding)
                                    ℎ(𝑥𝑖 , 𝑥𝑖∗ ) = 𝜎(𝑊𝑔 𝑔(𝑥𝑖 ) + 𝑊𝑔∗ 𝑔 ∗ (𝑥𝑖∗ ) + 𝑏 𝑔 )          (Joint Hyp.)
                                                   Õ𝑚
                               L 𝑍 (𝑊𝑥 , 𝑊𝑥 ∗ ) =         𝐵𝐶𝐸 (𝑔(𝑥𝑖 ), 𝑔 ∗ (𝑥𝑖∗ ))            (Fusion Loss)
                                                      𝑖=1
                                                   Õ𝑚
                               L𝑌 (𝑊𝑔 , 𝑊𝑔∗ ) =           𝐵𝐶𝐸 (ℎ(𝑥𝑖 , 𝑥𝑖∗ ), 𝑦𝑖 )                (Task Loss)
                                                      𝑖=1
L 𝑍 and L𝑌 are trained iteratively using alternating stochastic gradient descent (SGD). At test time,
a masking vector 𝑥𝑡𝑒𝑠𝑡 ∗    = {0} 𝑑 is used to represent PI, as it is unavailable for inference. Thus,
        ∗ ) = 𝑦 is used for evaluation.
ℎ(𝑥𝑖 , 𝑥𝑡𝑒𝑠𝑡      𝑖
4.4.3.3     Distillation
Our distillation also contains two parts: a Teacher network trained to generate soft-labels for PI, and
Student network conditioned on the EF and PI soft-labels to predict the ICD-9 targets.
                         𝑔(𝑥𝑖 ) = 𝑅𝑁 𝑁 (𝑥𝑖 ; 𝑊𝑇 )                                                   (Teacher)
                        ℎ𝑐 (𝑥𝑖 ) = 𝑀 𝐿𝑃([𝑔(𝑥𝑖 ); 𝑅𝑁 𝑁 (𝑥𝑖 ; 𝑊 𝑅𝑁 𝑁 )]; 𝑊𝑐 )                         (Student)
                                   Õ𝑚 Õ𝐶
                          L𝑇 𝑆 =                    𝐵𝐶𝐸 (ℎ𝑐 (𝑥𝑖 ), 𝑦𝑖 )                   (Distillation Loss)
                                      𝑖=1       𝑐=1
Here, [𝑔(𝑥𝑖 ); 𝑅𝑁 𝑁 (𝑥𝑖 ; 𝑊 𝑅𝑁 𝑁 )] denotes concatenation of the PI soft-labels and the last hidden state
of the 𝑅𝑁 𝑁 (𝑥𝑖 ). The joint T-S loss connects the Teacher and Student loss together, allowing the
two networks to be trained end-to-end.
                                                         83


    Table 4.3 summarizes the AUC and retrieval scores for our LUPI model against transfer learning
baselines. Micro-averaged AUC and Macro-averaged AUC are denoted as Mi-AUC and Ma-AUC,
respectively. Micro-averaged F1-score and AUC of precision-recall curve are denoted as Mi-F1
and AUPRC, respectively. We see that our LUPI formulation outperformed other transfer learning
baselines in all major performance metrics. Vanilla Data Fusion and MTL networks produced
comparable performances, and Distillation did not have significant improvement over the baseline
Student models that did not use PI. For Data Fusion, it is likely that since the PI contains a lot
more information than the original EF, the decision function of the hypothesis model ℎ(𝑥, 𝑥 ∗ ) relied
heavily on access to PI.
    Since PI is masked during testing due to unavailability, ℎ(𝑥, {0} 𝑑 ) likely resulted in poor
generalization. Comparable results can be seen in MTL. The drop in performance is most likely due
to negative transfer [1] due to the wide range of tasks (ICD-9 codes) that contribute uniformly to the
multi-objective learning process. Unlike our LUPI, the PI is not used to inform similarity between
training samples from different tasks, which do not share the same support (i.e. different diagnosis
may come from very different underlying distributions). Thus, in both Data Fusion and MTL cases,
the PI is incorporated in a less efficient way than our proposed model.
    Interestingly, we see that the MLP Teacher and Oracle Teacher models in Table 4.2) still provided
better AUC, F1 and AUPRC performances over the all transfer learning models, including our LUPI
model. This result suggests that the PI is more informative for diagnostic tasks compared to the
original EF, which is what enables the LUPI method to be effective. One possible explanation of
the predictive power of the PI is that the embeddings of the UMLS terms, which comprise the PI
vocabulary, are learned based on their co-occurrence with disease codes in public literature.
4.4.4   Performance with Sparse Examples
In addition to broad coverage of tasks, we evaluate the sample efficiency of our proposed model
against transfer learning baselines by considering the more rare diseases with very few training
samples. This is actually quite typical in the EHR setting, where diagnosis labels often have very
                                                   84


Table 4.4: Performance of various models for 30 ICD-9 codes appearing less than 100 times in the
dataset
                          Model          Ma-AUC      Mi-AUC      Mi-F1    AUPRC
                      RNN Student         0.628        0.639     0.104     0.096
                      MLP Teacher         0.821        0.833     0.212     0.196
                     Oracle Teacher       0.801        0.805     0.256     0.146
                          MTL             0.717        0.724     0.150     0.122
                       Distillation       0.729        0.738     0.158     0.149
                       Data Fusion        0.821        0.826     0.307     0.237
                          Ours            0.834        0.835     0.381     0.330
long tail distributions. We restrict our predictions to diagnostic codes appearing less than 100 times
in the training and test sets and examine the generalization of various modeling schemes.
    In table 4.4, we see that performance decreased drastically for non-transfer learning models such
as the RNN Student and the Teacher models (both MLP and Oracle). Transfer learning schemes such
as MTL and Distillation also decreased greatly in F1-score and AUPRC. Interestingly, Data Fusion
method was able to outperform other transfer learning baselines in F1-score and AUPRC, suggesting
that learning a domain-invariant representation between the original features and the PI provided a
key improvement for sample efficiency. However, our LUPI model achieved the best performance
among all the models for long-tail tasks. In fact, its performance across stayed relatively consistent
among this subset of tasks compared to its performance on the original set of common diagnoses.
4.5   Discussion and Conclusion
In this chapter, we presented a novel LUPI framework for retaining PI in the multi-task setting to
improve sample complexity over a wide range of related tasks. The key idea was to learn a joint
representation of the original feature space and the PI by leveraging their co-occurrence information
in the data. Decomposing the PI into distributed representations of basis features was vital for the
realization of this mechanism. Experiments show that our proposed LUPI method can out-perform
baseline models and other transfer learning methods in multi-task learning scenarios, particularly in
situations where training samples are very rare (< 100 samples per task). In addition to improved
performance, we also provided sample complexity analysis that outline scenarios under which our
                                                   85


LUPI method can provide similar benefits over traditional transfer learning approaches.
                                              86


BIBLIOGRAPHY
      87


                                       BIBLIOGRAPHY
[1] Sinno Jialin Pan, Qiang Yang, et al. A survey on transfer learning. IEEE Transactions on
     knowledge and data engineering, 22(10):1345–1359, 2010.
[2] Rich Caruana. Multitask learning. Machine learning, 28(1):41–75, 1997.
[3] Hrayr Harutyunyan, Hrant Khachatrian, David C Kale, and Aram Galstyan. Multitask learning
     and benchmarking with clinical time series data. arXiv preprint arXiv:1703.07771, 2017.
[4] Edward Choi, Mohammad Taha Bahadori, Le Song, Walter F Stewart, and Jimeng Sun. GRAM:
     graph-based attention model for healthcare representation learning. In Proceedings of the 23rd
     ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages
     787–795. ACM, 2017.
[5] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed repre-
     sentations of words and phrases and their compositionality. arXiv preprint arXiv:1310.4546,
     2013.
[6] Jing Zhao, Xĳiong Xie, Xin Xu, and Shiliang Sun. Multi-view learning overview: Recent
     progress and new challenges. Information Fusion, 38:43–54, 2017.
[7] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.
     arXiv preprint arXiv:1503.02531, 2015.
[8] Vladimir Vapnik and Akshay Vashist. A new learning paradigm: Learning using privileged
     information. Neural networks, 22(5-6):544–557, 2009.
[9] Vladimir Vapnik and Rauf Izmailov. Learning using privileged information: similarity control
     and knowledge transfer. Journal of machine learning research, 16(2023-2049):2, 2015.
[10] Xue Li, Bo Du, Chang Xu, Yipeng Zhang, Lefei Zhang, and Dacheng Tao. R-svm+: Robust
     learning with privileged information. In ĲCAI, pages 2411–2417, 2018.
[11] John Lambert, Ozan Sener, and Silvio Savarese. Deep learning under privileged information
     using heteroscedastic dropout. In Proceedings of the IEEE Conference on Computer Vision
     and Pattern Recognition, pages 8886–8895, 2018.
[12] Edward Choi, Mohammad Taha Bahadori, Elizabeth Searles, Catherine Coffey, Michael
     Thompson, James Bost, Javier Tejedor-Sojo, and Jimeng Sun. Multi-layer representation
     learning for medical concepts. In Proceedings of the 22nd ACM SIGKDD International
     Conference on Knowledge Discovery and Data Mining, pages 1495–1504. ACM, 2016.
                                                88


[13] Dhanesh Ramachandram and Graham W Taylor. Deep multimodal learning: A survey on
     recent advances and trends. IEEE Signal Processing Magazine, 34(6):96–108, 2017.
[14] Raia Hadsell, Sumit Chopra, and Yann LeCun. Dimensionality reduction by learning an
     invariant mapping. In null, pages 1735–1742. IEEE, 2006.
[15] Miguel A Carreira-Perpinan and Geoffrey E Hinton. On contrastive divergence learning. In
     Aistats, volume 10, pages 33–40. Citeseer, 2005.
[16] Gregory Koch, Richard Zemel, and Ruslan Salakhutdinov. Siamese neural networks for
     one-shot image recognition. In ICML Deep Learning Workshop, volume 2, 2015.
[17] Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From theory to
     algorithms. Cambridge university press, 2014.
[18] Alexander B Tsybakov et al. Optimal aggregation of classifiers in statistical learning. The
     Annals of Statistics, 32(1):135–166, 2004.
[19] Olivier Bodenreider. The unified medical language system (umls): integrating biomedical
     terminology. Nucleic acids research, 32(suppl_1):D267–D270, 2004.
[20] Alistair EW Johnson, Tom J Pollard, Lu Shen, Li-wei H Lehman, Mengling Feng, Mohammad
     Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. Mimic-iii,
     a freely accessible critical care database. Scientific data, 3, 2016.
[21] Youngduck Choi, Chill Yi-I Chiu, and David Sontag. Learning low-dimensional representations
     of medical concepts. AMIA Summits on Translational Science Proceedings, 2016:41, 2016.
[22] Samuel G Finlayson, Paea LePendu, and Nigam H Shah. Building the graph of medicine from
     millions of clinical narratives. Scientific data, 1:140032, 2014.
[23] Fengyi Tang, Cao Xiao, Fei Wang, and Jiayu Zhou. Predictive modeling in urgent care: a
     comparative study of machine learning approaches. JAMIA Open, 1(1):87–98, 2018.
[24] Zachary C Lipton, David C Kale, Charles Elkan, and Randall Wetzel. Learning to diagnose
     with lstm recurrent neural networks. arXiv preprint arXiv:1511.03677, 2015.
                                                  89


                                             CHAPTER 5
                             ADVERSARIAL PRECISION SENSING
5.1   Introduction
In many real-world applications, fully-observed datasets are difficult to obtain, especially for temporal
data, where observing the full set of features for all samples across time is simply too expensive. A
prime example of this phenomenon exists in the realm of medical informatics, specifically electronic
health records (EHR) dominated by time-series data. Yet this hasn’t stopped EHR from sparking
numerous research interests in recent years [1, 2, 3, 4, 5]. Time-series data in the EHR consists of
features that are sampled at different levels of temporal granularity (e.g., lab tests are sampled at
longer time-scales than vital signs). Most of the time, only a small subset of features are observed at
any time-step, as it is prohibitively expensive for the physician to obtain the full set of features for
every patient for every time interval. In practice, a form of active sensing [6] is implicitly done in an
ad-hoc manner by physicians based on a combination of inpatient work-flow, resource constraints,
and domain expertise. As a result, the key difficulty with EHR time-series data is that they almost
always require re-sampling and imputation due to large amounts of missing values. Therefore it is
no surprise that several recent works have shown that temporal models trained on EHR data have
fragile decision boundaries that are susceptible to small perturbations [7, 8].
    To address this issue, we formulate the precision sensing problem setting as one that takes in
data-streams along the temporal dimension, rather than across samples. For example, consider the
EHR setting, where at each time-step, the physician has to decide how to allocate a constrained
amount of tests/resources among patients for some set of clinical tasks. In other words, the precision
sensing problem for time-series is in fact a resource distribution problem at each time-step, with a
fixed observation budget across a given set of samples. Therefore the querying process is applied
across time rather than samples, and any formulation must reflect both the temporal-dependency
and the budgetary-constraint of the querying process.
                                                   90


     In this work, we formulate the active sensing problem as a minimax game between two players:
     • A Feature Sensor (FS) that generates a sensing tensor 𝐴 over a temporal data-stream in an
        online manner.
     • A Progressive Learner (PL) that adjusts the decision boundary at each time-step to handle
        sparser sets of features, allowing the FS model to improve its sensing strategy toward sparser
        selections.
We then prove that our formulation reaches a local Nash equilibrium during co-training. Afterwards,
we evaluate the quality of the dynamic sensing strategy by training a diverse set of classifiers
on the resulting set of data with incomplete features and compare their performances against a
corresponding set of hypothesis models conditioned on the original dataset.
     Our results show that classifiers conditioned on dynamically sensed data, i.e., incomplete data,
are able to maintain prediction accuracy, despite adhering to a strict budgetary constraint for sensing.
We show that unlike attention and perturbation models, the sensing strategy of the FS-Model
can be used to train many types of models, including deep models, shallow models, and even
non-temporal models, with minimal decrease in performance. In experiments, we demonstrate how
our budget-adaptive framework can be used to produce an active data acquisition strategy that can
be implemented in hospitals to deal with feature sensing under moving budgetary constraints.
5.2    Problem Formulation
5.2.1    Precision Sensing for Time-Series Data
Classic time-series data comes in the form: 𝑋 = {𝑥𝑖 }𝑖=1       𝑚 , 𝑥 = {𝑥 (𝑡) }𝑇 , 𝑥 (𝑡) ∈ R𝑛 , where 𝑚 is the
                                                                     𝑖     𝑖   𝑡=1 𝑖
number of samples, 𝑇 the number of time-steps, and 𝑛 the number of features. Precision sensing
                                      n o𝑚
considers the case where 𝑋 = 𝑥𝑖(𝑡)
                               (𝑡)             , 𝑥𝑖(𝑡) ∈ R𝑛 , is a feature matrix across the entire batch of
                                           𝑖=1
samples at time step 𝑡. At each time-step we are interested in finding a corresponding sensing matrix
𝐴 (𝑡) ∈ {0, 1} 𝑚×𝑛 over the data matrix. During the observation period, a cumulative sensing tensor
                                                       91


𝐴 = {𝐴 (𝑡) }𝑇𝑡=1 is incrementally constructed according to the dynamics:
                                𝐴 (𝑡+1) = 𝐺 (𝑋 (1)   𝐴 (1) , . . . , 𝑋 (𝑡)    𝐴 (𝑡) ),              (5.1)
for some update function 𝐺 (.). Alternatively we can view 𝐴𝑖 = {𝐴𝑖(𝑡) }𝑇𝑡=1 as the subset of sensed
features across time for each sample. Under this view, we can formulate the precision sensing
objective as follows:
Definition 2. (Precision Sensing) The problem of precision sensing seeks a sensor tensor 𝐴∗ that
minimizes the empirical risk (ERM) [9] according to:
                            𝐴∗ ∈ argmin E𝑃(𝑋,𝑌 ) [𝐿 (ℎ( 𝐴𝑖           𝑋𝑖 ), 𝑦𝑖 )]
                                     𝐴𝑖 ∈A
                                             1 Õ
                                ≈ argmin                        𝐿 (𝑦𝑖 , ℎ( 𝐴𝑖      𝑋𝑖 )),
                                      𝐴𝑖 ∈A 𝑚      (𝑋𝑖 ,𝑦 𝑖 )∈D
where ℎ is a hypothesis class that maps ℎ : 𝑋 → 𝑌 , and 𝐿(.) is a risk function that evaluates the
hypothesis mappings against the actual label.
     Here, A denotes the space of sensing matrices, and each matrix 𝐴𝑖 can be thought of as a
per-sample sensing matrix. From this perspective 𝐴𝑖 describes the subset of sensed features across
time for each sample.
Remark 1. Ideally the optimal sensing tensor 𝐴 generates a partially-observed dataset: {( 𝑋˜ 𝑖 , 𝑦𝑖 )},
𝑋˜ 𝑖 = 𝐴𝑖∗                   ˜ 𝑌 ) ≈ 𝑃(𝑋, 𝑌 ). That is, the joint distribution of the dynamically sensed
             𝑋𝑖 such that 𝑃( 𝑋,
data produced by 𝐴∗ should match the joint distribution of the fully-observed dataset.
     The key difference between existing methods and our approach is in the manner in which 𝐴
is generated. In feature selection for time-series, 𝐴 is obtained by some form of sparse sensing
technique that statically assigns the same subset of features to be observed for each 𝑋𝑖 . By contrast
active sensing allows for 𝐴 to be built in an individualized manner for each sample, yet still actively
sub-samples (sample, view) pairs during each co-training iteration. But since it only sub-samples a
small subset of examples, active sensing fails to consider an overarching budgetary constraint across
all the samples at each time-step.
                                                      92


                                Figure 5.1: Overview of problem setting.
         (a) Incremental generation of the sensing tensor.
                                                                      (b) Minimax game between FS and PL.
5.2.2    Budget-Constrained Precision Sensing as Adversarial Optimization
At a high-level, the proposed adversarial framework works as follows:
     • At each time-step, FS tries to “discard”, based on observations from previous time-steps,
       features that lead to mis-classifcation.
     • At the end of each observation window, the PL adjusts the decision boundary so that the
       sensed features result in correct classification.
Fig. 5.1a describes incremental generation of the sensing tensor 𝐴 by the FS. 𝐴 is generated recurrently
by applying the sensor dynamics 𝐺 (.), which is learned by the FS, where 𝐴 (𝑡) = 𝐺 ( 𝑋˜ (1) , . . . , 𝑋˜ (𝑡) )
outputs 𝐴 (𝑡+1) at each time-step. Fig. 5.1b describes the minimax game between the FS and PL. PL
(top) tries to minimize the classification error with sensed features 𝑋           𝐴 while FS (bottom) tries
to allocate 𝐴 such that unsensed features lead to misclassification. At the end of the observation
window, the PL adjusts the parameters of 𝐹 to handle the updated input 𝑋                𝐴, which now has a
different sparsity pattern than before, as shown in Fig. 5.1b. At equilibrium the PL will no longer be
able to improve its performance, provided that the FS has selected an optimal subset of features to
sense such that accuracy improvement is no longer possible given the observation budget.
Max-Margin Objective. More formally, the FS tries to solve the optimization problem:
                                
                        max       𝐹 (𝑋 − 𝑋      𝐺 (𝑋))¬𝑦 − 𝐹 (𝑋 − 𝑋        𝐺 (𝑋)) 𝑦 , 0
                          𝜃𝐺
                                                     Õ
                           s.t. 𝐴 (𝑡) ∈ {0, 1} 𝑚×𝑛 ,            𝐴𝑖(𝑡)
                                                                   𝑗 ≤ 𝐵                               (5.2)
                                                         𝑖, 𝑗,𝑡
                                                      93


where 𝐺 (𝑋) = [𝐺 ( 𝑋˜ (1) ), 𝐺 ( 𝑋˜ (2) ), . . . , 𝐺 ( 𝑋˜ (𝑇) )] = 𝐴. When 𝐴𝑖(𝑡)𝑗 = 0, the j-th feature is masked
for the i-th sample at time 𝑡. We can think of 𝑋ˆ = 𝑋 − 𝑋                      𝐴 as a subset of features that are
discarded by FS. Thus, 𝑋˜ = 𝑋            𝐴 is the resulting set of sensed (unmasked) features at the end
of the observation window. During training, the FS model ranks 𝐹 (𝑋 − 𝑋                          𝐴)¬𝑦 higher than
𝐹 (𝑋 − 𝑋      𝐴) 𝑦 .We use ¬𝑦𝑖 to denote the case where the hypothesis produces the wrong class for
the 𝑖−th sample. 𝐵 denotes the budget constraint on 𝐴, and 𝐹 (.)𝑐 denotes the 𝑐-th logit of the
hypothesis model output.
5.3   Proposed Method
Derivation of Eqn. 5.2 First, let us consider the popular adversarial setting of perturbation models
described in [10]:
                                            minimize 𝐷 (𝑋, 𝑋 + 𝛿)
                                                       s.t. 𝐶 (𝑋 + 𝛿) ≠ 𝑦
                                                            𝑋 + 𝛿 ∈ [0, 1] 𝑛                                 (5.3)
where 𝑋 is the original set of features, 𝛿 is the perturbation to be applied to the feature space, 𝐷 (.)
is a distance metric, and 𝐶 (.) is a classifier. Under the perturbation setting, the 𝐶 (𝑋 + 𝛿) ≠ 𝑦
term finds an adversarial set of samples, 𝑋 + 𝛿, leading to misclassification, while 𝐷 (𝑋, 𝑋 + 𝛿) is
a reconstruction term that limits the size of 𝛿 such that it is not trivially large. Recent work (e.g.,
[7, 8]) leverages variants of this approach for ranking feature importance in complex feature spaces.
    We develop a different set of reconstruction and modification objectives under a similar setting:
                           minimize L (𝐶 (𝑋                 𝐴), 𝑦)
                                    s.t. 𝐶 (𝑋 − 𝑋             𝐴) ≠ 𝑦,
                                                                           Õ
                                          ∀𝑖, 𝑗, 𝑡 : 𝐴𝑖(𝑡) 𝑗 ∈ {0, 1}, &        𝐴𝑖 𝑗 𝑘 ≤ 𝐵                   (5.4)
                                                                           𝑖𝑗𝑘
where 𝐶 is the classifier and L a loss function on (𝑋, 𝑦) pairs. 𝐴 denotes a “sensing tensor” that
gives a component-wise indicator for whether feature 𝑗 is sensed at time 𝑡 for sample 𝑖. Additionally,
                                                              94


𝐵 gives a “budget constraint” for the sensing tensor 𝐴. Rather than reconstructing 𝑋 ≈ 𝑋 + 𝛿, we
learn the construction of 𝐴 based on the recovery of 𝑃(𝑦|𝑋) ≈ 𝑃(𝑦|𝑋                         𝐴) through 𝐶. We also
update 𝐶 and 𝐴 at different rates—𝐶 is updated at the end of observation windows while 𝐴 is
updated every time-step.
    Thus, we formulate the active sensing problem of interest under a single mini-max objective:
                     min max {𝐹 (𝑋 − 𝑋             𝐺 (𝑋))¬𝑦 − 𝐹 (𝑋 − 𝑋              𝐺 (𝑋)) 𝑦 }+
                      𝜃 𝐹 𝜃𝐺       |                          {z                          }
                                                        Feature Sensing
                                  + 𝐶𝐸 𝑦 (𝐹 (𝑋 𝐺 (𝑋)) ) + 𝛽||𝑋 − 𝑋 𝐺 (𝑋)|| 1 .                                   (5.5)
                                    |            {z          } |                  {z         }
                                     Classifier Reconstruction          Budget Management
Here 𝐹 corresponds to 𝐶 and 𝐺 corresponds to a function to be learned from 𝐴. Meanwhile, 𝐶𝐸 𝑦 is
the cross-entropy loss w.r.t. 𝑦:
                                              𝑚
                                         1 Õ
                         𝐶𝐸 𝑦 (𝑥) = −             [𝑦𝑖 log 𝑥𝑖 + (1 − 𝑦𝑖 ) log (1 − 𝑥𝑖 )] .                        (5.6)
                                        𝑚 𝑖=1
5.3.0.1   Feature Sensor
We can decompose the update dynamics of the sensor generation process (𝐺) into the following
components:
                          ℎ𝑖(0) = max{0, 𝑊𝑥ℎ [𝑋𝑖(0) ; 0]}                                             (Initialization)
                           𝑧𝑖(𝑡) = max{0, 𝑊𝑥𝑧 [ 𝑋˜ 𝑖(𝑡) ; 𝐶 𝑡 ] + 𝑊𝑧ℎ ℎ𝑖(𝑡−1) }                       (Update Gate)
                          𝑟𝑖(𝑡) = max{0, 𝑊𝑥𝑟 [ 𝑋˜ 𝑖(𝑡) ; 𝐶 𝑡 ] + 𝑊𝑟 ℎ ℎ𝑖(𝑡−1) }                         (Reset Gate)
                                         n                                             o
                          ˜ℎ (𝑡) = tanh 𝑊𝑥ℎ [ 𝑋˜ (𝑡) ; 𝐶 𝑡 ] + 𝑊ℎℎ 𝑟 (𝑡) ℎ (𝑡−1)                         (Transition)
                            𝑖                       𝑖                      𝑖       𝑖
                                              
                          ℎ𝑖(𝑡) = 1 − 𝑧𝑖(𝑡)         ℎ𝑖(𝑡−1) + 𝑧𝑖(𝑡) ℎ˜ 𝑖(𝑡)                           (State Update)
                                                                     
                                     𝐴𝑖(𝑡+1) = 𝜎 𝑊ℎ ℎ𝑖(𝑡) + 𝑏 𝑎                                     (Sensor Output)
                                                           Õ Õ
                                          𝐶 𝑡 = 𝐶 𝑡−1 +                 𝑐 𝑗 𝐴𝑖(𝑡)
                                                                               𝑗                   (Budget Update)
                                                               𝑖     𝑗
                                                         95


      where 𝐶 𝑡 denotes the total observation cost at time 𝑡, corresponding to the total number of
observations in [ 𝐴 (1) , . . . , 𝐴 (𝑡) ] across all samples. 𝑐 𝑗 denotes the cost of observing feature 𝑗
(e.g., certain medical tests may be more costly to observe than others). [𝑎; 𝑏] ∈ R𝑛+1 denotes
the concatenation of vector 𝑎 ∈ R𝑛 and scalar 𝑏 ∈ R. 𝜎(.) denotes a thresholding function (e.g.,
sigmoid activation) that maps the output of each 𝐴𝑖(𝑡)        𝑗 ∈ [0, 1] feature component to {0, 1}, which is
done to satisfy the integer constraint in Eqn. 5.2 that 𝐴𝑖(𝑡)         𝑗 ∈ {0, 1}, ∀𝑖, 𝑗, 𝑡.
      ℎ𝑖(𝑡) serves as an internal state representation that considers every sensed features vector
𝑋˜ 𝑖(1) , . . . , 𝑋˜ 𝑖(𝑡) up to time 𝑡 for sample 𝑖, as well as the observation cost across samples at each
time-step, i.e., 𝐶 1 , . . . , 𝐶 𝑡 . The gates 𝑧 and 𝑟 control the update rule of the memory state to prevent
gradient vanishing [11]. The weights 𝜃 𝐺 = [𝑊𝑥ℎ , 𝑊𝑥ℎ , . . . , 𝑊ℎ𝑛 ] are learned by optimizing over the
feature sensing portion of the objective function in Eqn. 5.5:
                                   max {𝐹 (𝑋 − 𝑋     𝐺 (𝑋))¬𝑦 − 𝐹 (𝑋 − 𝑋       𝐺 (𝑋)) 𝑦 }+                    (5.7)
                                    𝜃𝐺
5.3.0.2          Progressive Learner
The PL serves to stabilize the reference hypothesis model 𝐹 in response to the shift in sparsity
patterns observed in 𝐺 (𝑋). We note the time-scale difference between 𝐹 and 𝐺 updates: whereas
FS updates 𝐺 at every time-step 𝑡, the PL updates 𝐹 only at the end of an observation window
1, . . . , 𝑇. Before 𝐺 (.) is learned by the FS model, the decision function 𝐹 is first initialized on a
small batch of fully-observed data {𝑋𝑖 , 𝑦𝑖 }𝑖=1     𝑘 ⊂ D
                                                             train , 𝑘  𝑚 to obtain an initial approximation of
the underlying 𝑃(𝑌 |𝑋). Over time, the PL is gradually conditioned on 𝐴                     𝑋, with a progressively
sparser 𝐴.
      To achieve this conditioning effect, we utilize a schedule sampling technique similar to [12],
where we pick the input to 𝐹 based on a coin toss (𝐻):
                                   
                                   
                                    𝐹 (𝑋)           𝑃(𝐻 = 1) = 𝜀,
                                   
                                   
                                   
                           𝐹 (.) =                                                          (Scheduled Sampling)
                                   
                                    𝐹 (𝐺 (𝑋)        𝑃(𝐻 = 0) = 1 − 𝜀.
                                   
                                             𝑋)
                                   
                                                           96


Here, 𝜀 denotes the probability of using the fully-observed 𝑋 for training, and 1 − 𝜀 is the probability
of using the masked version. Initialize 𝜀 = 1 for the first 𝑘 iterations steps to obtain an unbiased
estimation. After 𝑘 iterations, decay 𝜖 according to:
                                       𝜀 = 𝑡 − 𝑘/(𝑡 − 𝑘 + exp(𝑘 − 𝑡)).                       (𝜀-Decay)
By decaying 𝜀, the training distribution of the PL model shifts gradually from the fully-observed to
the sensed features. We pick a baseline 𝜀 = min{𝑡 − 𝑘/(𝑡 − 𝑘 + exp(𝑘 − 𝑡)), 0.25} to ensure that PL
gets exposed to at least a small fraction of the fully-observed 𝑋 during each training iteration. This
allows the PL to obtain a good initialization on the feature importance. The PL adjusts the decision
boundary by minimizing over the classifier reconstruction portion of Eqn. 5.5:
                                          min 𝐶𝐸 𝑦 (𝐹 (𝐺 (𝑋)    𝑋))                               (5.8)
                                           𝜃𝐹
where each 𝐹 (𝑋𝑖 ) is sampled from Eqn. Scheduled Sampling.
5.3.0.3     Budget-Manager
The strength of budget constraint is updated at each iteration by dual gradient ascent [13] for 𝛽
according to Eqn. 5.9:
                                       𝛽𝑡+1 = 𝛽𝑡 + 𝛼(𝑋 − 𝑋    𝐺 (𝑋)),                             (5.9)
where 𝛼 denotes the learn rate for the dual variable 𝛽. We note that in real-world applications, the
budget 𝛽 is usually set beforehand (e.g., by resource constraints). Thus, dual ascent on 𝛽 gives an
option to continually shrink the observation budget until a minimum set of sensed features is realized.
However, this may not be needed in a realistic setting where the budget constraint is constant.
       The intuition behind the 𝛽 term follows by considering its extreme values. When 𝛽 → 0, the
max-margin objective dominates. Consequently, the FS is incentivized to sense every feature, i.e.
𝐴𝑖,(𝑡)𝑗 = 1, ∀𝑖, 𝑗, 𝑡, leading to the trivial solution where 𝑋   𝐴 → 𝑋. In contrast, as 𝛽 → ∞ the
budget constraint dominates, and the FS is incentivized to choose 𝐴𝑖,(𝑡)𝑗 = 0, ∀𝑖, 𝑗, 𝑡, leading to an
empty feature set. Thus, the 𝛽 term serves as a trade-off parameter between 𝐴’s sparsity and sensing
performance. The training loop for FS, PL, and budget-manager is summarized in Algorithm 2.
                                                      97


Algorithm 2 Co-training of FS and PL.
 1: for number of total training iterations do
 2:    for 𝑘 training steps do
 3:       Sample mini-batch of 𝑚 samples {(𝑋𝑖 , 𝑦𝑖 )}𝑖=1
                                                       𝑚 from training data.
 4:       Generate 𝑚 samples using 𝐺:
                                                                    𝑚
                                            {𝑋𝑖 − 𝑋𝑖   𝐺 (𝑋𝑖 ), 𝑦𝑖 }𝑖=1 .
 5:       Using Scheduled Sampling with 𝜀-Decay, update 𝜃 𝐹 by performing SGD on Eqn. 5.8
 6:    end for
 7:    for 𝑘 training steps do
 8:       Update 𝐺 parameters by projected gradient ascent on
                                     ∇𝜃 𝐺 max {𝑍 (𝑋 − 𝑋 𝐺 (𝑋))¬𝑦
                                               − 𝑍 (𝑋 − 𝑋 𝐺 (𝑋)) 𝑦 , −𝜅}.
 9:       (Optional) Dual ascent on budget constraint 𝛽 according to
                                          𝛽 ← 𝛽 + 𝛼(𝑋 − 𝑋        𝐺 (𝑋)).
10:    end for
11:  end for
5.3.0.4   Inference
To perform inference, we train two sets of models for comparison. (1) A set of hypothesis models
ℎ 𝑆 : 𝑋 ↦→ 𝑌 trained on the sensed dataset D𝑆 = {(𝐺 (𝑋𝑖 )           𝑋𝑖 , 𝑦𝑖 )}𝑖=1
                                                                              𝑚 , generated using the FS.
(2) A set of hypothesis models ℎ trained on the original dataset {(𝑋𝑖 , 𝑦𝑖 )}𝑖=1   𝑚 . The difference in
generalization error between ℎ 𝑆 and ℎ quantifies the feature sensing strategy learned by the FS.
5.4   Analysis of Proposed Method
Optimizing FS, PL and Budget-Manager involves finding saddle points in a dynamic loss landscape.
Here, we prove the existence of fixed points in the proposed minimax game and their correspondence
with local Nash Equilibria (NE) [14]. At local NEs, FS and PL cannot improve allocation or accuracy
by local changes in 𝐹 and 𝐺, respectively.
                                                   98


5.4.1     Existence of Local Nash Equilibria
We can re-express Eqn. 5.5 into individual pay-off functions:
      • 𝐽𝐹 (𝜃 𝐹 ) = 𝐶𝐸 𝑦 [𝐹 (𝑋          𝐺 (𝑋))] gives the cost for the PL player.
      • 𝐽𝐺 (𝜃 𝐺 ) = {𝐹 (𝑋 − 𝑋            𝐺 (𝑋)) 𝑦 − 𝐹 (𝑋 − 𝑋        𝐺 (𝑋))¬𝑦 }+ gives the cost for the FS, from the
         minimizer’s point of view.
      • 𝐽𝐵 (𝜃 𝐵 ) =
                      Í
                         𝑖𝑗𝑘 𝐴𝑖 𝑗 𝑘 − 𝐵 defines the cost for the budget manager.
Let us define 𝜉 := (𝜃 𝐹 , 𝜃 𝐺 , 𝛽)𝑇 as a set of actions for the PL, FS and budget manager. Here, we
denote 𝜉1 = 𝜃 𝐹 ∈ Γ1 , 𝜉2 = 𝜃 𝐺 ∈ Γ2 , 𝜉3 = 𝛽 ∈ Γ3 as the subset of parameters in 𝜉 that individual
players can modify at each iteration, and Γ𝑖 denotes the 𝑖−th player’s action space.
      We also define a potential function as a composition of the payoffs among the players:
                                     𝜙𝑖𝑗 (𝜉) := max{0, 𝐽𝑖 (𝜉) − 𝐽𝑖 ( 𝑗, 𝜉¬𝑖 )}, 𝑗 ∈ Γ𝑖
where 𝐽𝑖 ( 𝑗, 𝜉¬𝑖 ) defines the cost of another policy 𝑗 ≠ 𝜉𝑖 for the 𝑖−th player currently using 𝜉𝑖
parameters. 𝜉¬𝑖 indicating that other players retain the same actions. So the potential function
defines the cost-of-switching from the current policy to another, defined over each player’s action
space.
      Finally, let us define the best response function [14] for each player:
                                                                 𝜉𝑖 + 𝜙𝑖𝑗 (𝜉)
                                              𝑓𝑖 (𝜉𝑖 , 𝑗) :=     Í
                                                             1 + 𝑘∈Γ𝑖 (𝜙𝑖𝑘 (𝜉))
where 𝑗 ∈ Γ𝑖 corresponds to the alternative actions in the 𝑖-th action space. We now introduce the
definition of Nash Equilibrium (NE) according to [14] for our problem:
Definition 3. Nash Equilibrium. 𝜉 ∗ is a Nash Equilibrium point of 𝑓 (.) if it satisfies 𝐽𝑖 (𝜉𝑖 ) ≤
𝐽𝑖 ( 𝑗, 𝜉¬𝑖 ), ∀𝑖, 𝑗.
Lemma 1. Assuming that the domain of 𝜉 is continuous and convex, there exists a fixed point of 𝑓 ,
i.e., 𝑓𝑖 (𝜉𝑖∗ , 𝑗) = 𝜉𝑖∗ , ∀𝑖, 𝑗.
                                                               99


Proof. The proposed best response function maps from 𝑓 : Γ → Γ and is a continuous mapping.
Furthermore, each sub-domain Γ𝑖 is convex and compact. Compactness can be achieved through
regularization applied to 𝐹 and 𝐺, in the form of budget-constraint, which serves to bound the size
of 𝜃 𝐺 according to a 𝛽. So each Γ𝑖 is closed and bounded, and by the Heine-Borel Theorem, each Γ𝑖
is compact in 𝑅 𝑛𝑖 . Thus, we can apply Brouwer’s fix point theorem from [15], and therefore ∃𝜉 ∗ s.t.
 𝑓 (𝜉) at 𝑓 (𝜉 ∗ ) = 𝜉 ∗ . In other words, a fixed point exists.
Lemma 2. A point 𝜉 ∗ is a fixed point of 𝑓 if and only if it is a Nash Equilibrium point.
Proof. First, suppose 𝜉 ∗ is a NE point. Then we have 𝐽𝑖 (𝜉𝑖∗ ) ≤ 𝐽𝑖 ( 𝑗, 𝜉¬𝑖             ∗ ), ∀𝑖, 𝑗. So it follows that
                                                        𝜉𝑖∗ +0
∀𝑖, 𝜙𝑖𝑗 (𝜉𝑖∗ ) = 0. Thus, we have 𝑓𝑖 (𝜉𝑖∗ , 𝑗) =         1+0   = 𝜉𝑖∗ , ∀𝑖, 𝑗, which defines a fixed point.
     Conversely, suppose 𝜉 ∗ is a fixed point, i.e., 𝑓 (𝜉 ∗ , 𝑗) = 𝜉 ∗ . Then for every 𝑖, there is at least one
𝜉𝑖∗ s.t. 𝐽𝑖 (𝜉𝑖∗ ) ≤ 𝐽𝑖 ( 𝑗, 𝜉¬𝑖
                              ∗ ) (by definition of fixed point). For such 𝜉 ∗ ’s, we have
                                      𝜙𝑖𝑗 (𝜉 ∗ ) = 0
                                                               𝜉𝑖∗ + 0
                                  =⇒ 𝑓𝑖 (𝜉𝑖∗ ) = 𝜉𝑖∗ =         Í 𝑖 ∗
                                                         1 + 𝑘 𝜙 𝑘 (𝜉 )
                                      Õ
                                  =⇒        𝜙𝑖𝑘 (𝜉 ∗ ) = 0 =⇒ 𝜙𝑖𝑘 (𝜉 ∗ ) ≡ 0, ∀𝑘 ≠ 𝑗 ∈ Γ𝑖 .
                                       𝑘
 We have shown that ∀𝜉 ∗ = 𝑓 (𝜉 ∗ ), such a point must also correspond to a NE point.                                 
Theorem 1. There exists at least one Nash Equilibrium point 𝜉 ∗ such that each player’s best
response does not deviate from 𝜉 ∗ with respect to the minimax objective in Eqn. 5.5. Furthermore,
the fixed points of the minimax objective correspond to local Nash Equilibrium points.
Proof. The result directly follows from combining Lemma 1 and 2 with Definition 3.                                    
     We note that although local NE’s exist under our formulation, we do not show admissibility [16]
or interchangeability of the payoff values for such NE’s since the proposed game is non-zero sum.
However, the NE property implies that small, local changes to the budget allocation and classifier
decision boundary do not improve the performance under either objective.
                                                               100


5.4.2   Convergence of Algorithm 2
One interesting challenge of our problem is that 𝐺 and 𝐹 are updated at different time-scales. We
now show that Algorithm 2 can converge to local NE’s, at least under some good initialization
schemes around the neighborhood of NE’s. We summarize the main result below.
Definition 4. The gradient operator for the FS and PL players is defined as follows:
                                                                  ©−∇𝜃 𝐹 𝐽𝐹 (𝜃 𝐹 ) ª
                                                                  ­                ®
                          𝜉 (𝑡+1) = 𝜉 (𝑡) + 𝛼𝜂(𝜉 (𝑡) ),   𝜂(𝜉) = ­𝜀∇𝜃 𝐺 𝐽𝐺 (𝜃 𝐺 ) ®
                                                                  ­                ®
                                                                  ­                ®
                                                                  ­                ®
                                                                           0
                                                                  «                ¬
where 𝛼 is the learning rate hyperparameter, and 𝜀 > 0 denotes an off-set parameter that scales the
update rate of the ∇𝜃 𝐺 component according to a faster time-scale.
    We can verify that when 𝜂¤ = 0 at some time 𝑇, we have 𝜂 ≡ 0 =⇒ 𝜉 (𝑡+1) = 𝜉 (𝑡), ∀𝑡 > 𝑇, which
describes a fixed point 𝜉 ∗ .
Theorem 2. Assuming the loss functions for each player is locally Lipschitz in 𝜉 and piecewise
linear in 𝑡 near a Nash Equilibrium point, the error dynamics 𝜂¤ follows uniform asymptotic stability
and converges toward the origin at a rate of:
                                     ||𝜂(𝑡)|| ≤ K (𝜆||𝜂(𝑡 0 )||, 𝑡 − 𝑡0 ),
for some KL−class function (§4.4 in [17]) (K) and constant 0 < 𝜆 < ∞.
Proof. Here, we make the following assumptions:
 • At a fixed budget constraint 𝛽,   𝜕𝛽
                                      𝜕𝑡  = 0.
                                                      101


 • 𝐽𝐹 and 𝐽𝐺 satisfy the following:
                      𝐽𝐹 (𝜃 𝐹 ) = 𝐿 𝐹 (𝐹 (𝑋        𝐺 (𝑋)))
                      𝐽𝐺 (𝜃 𝐺 ) = 𝐿 𝐺 (𝐹 (𝑋 − 𝑋          𝐺 (𝑋)))
                 ∇𝜃 𝐹 𝐽𝐹 (𝜃 𝐹 ) = 𝐿 0𝐹 (𝐹 (𝑋       𝐺 (𝑋))) · ∇𝜃 𝐹 𝐹 (𝑋       𝐺 (𝑋))
                ∇𝜃 𝐺 𝐽𝐺 (𝜃 𝐺 ) = −𝐿 0𝐺 (𝐹 (𝑋 − 𝑋           𝐺 (𝑋))) · ∇ 𝑋 𝐹 (𝑋 − 𝑋          𝐺 (𝑋)) · ∇𝜃 𝐺 𝐺 (𝑋)
   where 𝐿 𝐹 and 𝐿 𝐺 are loss functions according to the labeled sample pairs {(𝑋𝑖 , 𝑦𝑖 )}𝑖=1                  𝑚 .
 • 𝐿 𝐹 and 𝐿 𝐺 are bounded, locally Lipschitz in 𝜉 and piecewise continuous in 𝑡 near a local
   fixed-point equilibrium.
We can thus express the error dynamics 𝜂¤ (i.e., the update rule of 𝜂) as follows
                                          © −∇2𝜃𝐹 𝐽𝐹 (𝜃 𝐹 )       −𝜀∇ 𝜃𝐹 , 𝜃𝐺 𝐽𝐹 (𝜃 𝐹 ) ª
                                    𝜂¤ = ­­                                             ®𝜂
                                                                                        ®
                                            𝜀∇ 𝜃𝐺 , 𝜃𝐹 𝐽𝐺 (𝜃 𝐺 )     2  2
                                                                    𝜀 ∇ 𝜃𝐺 𝐽𝐺 (𝜃 𝐺 )
                                          «                                             ¬
   −∇2𝜃𝐹 𝐽𝐹 (𝜉 𝐹 ) = −𝐿 𝐹00 (𝐹 (𝑋      𝐺 (𝑋))) · ∇ 𝜃𝐹 𝐹 (𝑋       𝐺 (𝑋)) · ∇ 𝜃𝐹 𝐹 (𝑋       𝐺 (𝑋))𝑇
−∇ 𝜃𝐹 , 𝜃𝐺 𝐽𝐹 (𝜉 𝐹 ) = −𝐿 𝐹0 (𝐹 (𝑋     𝐺 (𝑋))) · ∇ 𝜃𝐹 𝐹 (𝑋       𝐺 (𝑋)) · (∇𝑋 𝐹 (𝑋        𝐺 (𝑋))) · ∇ 𝜃𝐺 𝐺 (𝑋))𝑇
                           0
 ∇ 𝜃𝐺 , 𝜃𝐹 𝐽𝐺 (𝜉𝐺 ) = −𝐿 𝐺    (𝐹 (𝑋 − 𝑋      𝐺 (𝑋))) · ∇𝑋 𝐹 (𝑋       𝐺 (𝑋)) · ∇ 𝜃𝐺 𝐺 (𝑋)∇ 𝜃𝐹 𝐹 (𝑋 − 𝑋        𝐺 (𝑋))𝑇
                           00
     ∇2𝜃𝐺 𝐽𝐺 (𝜉𝐺 ) = −𝐿 𝐺     (𝐹 (𝑋 − 𝑋      𝐺 (𝑋))) (∇𝑋 𝐹 (𝑋       𝐺 (𝑋)) · ∇𝑋 𝐹 (𝑋        𝐺 (𝑋)))𝑇 ∇ 𝜃𝐺 𝐺 (𝑋)∇ 𝜃𝐺 𝐺 (𝑋)𝑇
 Next, around the equilibrium point at the origin, let us assume the following properties about
∇𝑋 𝐹 (𝑋), ∇ 𝜃𝐹 𝐹 (𝑋), ∇ 𝜃𝐺 𝐺 (𝑋) , i.e., the back-propagation terms:
                            =⇒ ∃0 < 𝑐 1 < ∞ : ∇ 𝜃𝐹 𝐹 (𝑋) ≤ 𝑐 1 ||∇ 𝜃𝐹 𝐽𝐹 (𝜃 𝐹 )|| = 𝑐 1 ||𝜂1 ||
                            =⇒ ∃0 < 𝑐 2 < ∞ : ∇ 𝜃𝐺 𝐺 (𝑋) ≤ 𝑐 2 ||∇ 𝜃𝐺 𝐽𝐺 (𝜃 𝐺 )|| = 𝑐 2 ||𝜂2 ||
                              =⇒ ∃0 < 𝛾 < ∞ : ||∇𝑋 𝐹 (𝑋)|| ≤ 𝛾.
                                                              102


 Let us also assume that ∃𝑐 3 > 0 : ||𝐿 𝑖0 (𝜉)|| < 𝑐 3 , and ∃𝑐 4 > 0 : ||𝐿 𝑖00 (𝜉)|| < 𝑐 4 , that is the first and
second order changes are bounded around the origin. We can thus re-express 𝜂¤ as follows:
                               −∇ 𝜃𝐹 , 𝜃𝐺 𝐽𝐹 (𝜉 𝐹 ) = ∇ 𝜃𝐹 𝐽𝐹 (𝜉 𝐹 ) · ∇𝑋 𝐹 ( 𝑋)      ˜ · ∇ 𝜃𝐺 𝐺 (𝑋)𝑇
                                  ∇ 𝜃𝐺 , 𝜃𝐹 𝐽𝐺 (𝜉𝐺 ) = −∇ 𝜃𝐺 𝐽𝐺 (𝜉𝐺 )∇ 𝜃𝐹 𝐹 ( 𝑋)       ˆ 𝑇
                                                                  00
                                     ∇2𝜃𝐺 𝐽𝐺 (𝜉𝐺 ) = −𝐿 𝐺            (𝐹 ( 𝑋))
                                                                          ˆ · (∇𝑋 𝐹 ( 𝑋)∇
                                                                                        ˆ 𝑋 𝐹 ( 𝑋))ˆ 𝑇
                                                        ·∇ 𝜃𝐺 𝐺 (𝑋)∇ 𝜃𝐺 𝐺 (𝑋)𝑇 .
                 ©           −𝐿 𝐹00 (𝐹 ( 𝑋))
                                           ˜ · ∇ 𝜃𝐹 𝐹 ( 𝑋)∇    ˜ 𝜃𝐹 𝐹 ( 𝑋) ˜ 𝑇 𝜂1 + 𝜀∇𝑋 𝐹 ( 𝑋)˜ · ∇ 𝜃𝐺 𝐺 (𝑋)𝑇 𝜂1 𝜂2          ª
                 ­                                                                                                           ®
       =⇒ 𝜂¤ = ­
                 ­                                                                                                           ®
                                                                                                                             ®
                 ­                                                                                                           ®
                 ­                                                                                                           ®
                   −𝜀∇       𝐹  ( ˆ
                                  𝑋) 𝑇 𝜂    𝜂   −   𝜀 2 𝐿 00 (𝐹 (  ˆ
                                                                  𝑋))  · (∇   𝐹 (  ˆ
                                                                                  𝑋)∇   𝐹 ( ˆ
                                                                                            𝑋)) 𝑇 ∇    𝐺 (𝑋)∇    𝐺 (𝑋) 𝑇 𝜂
                 «       𝜃 𝐹              1   2           𝐺                 𝑋         𝑋             𝜃𝐺        𝜃𝐺           2 ¬
 Finally, we let us consider the Lyapunov function candidate:
                                                                       1      1
                                                             𝑉 (𝜂) = 𝜂12 + 𝜂22
                                                                       2      2
 which satisfies the property 𝑉 (𝜂) > 0, ∀𝜂 ≠ 0. We can also see that 𝑉¤ (𝜂) < 0:
                  𝜕𝑉 (𝜂) 𝑇 𝜕𝜂
        𝑉¤ (𝜂) =
                    𝜕𝜂         𝜕𝑡
                       𝑇
                  ©𝜂1 ª   ©𝜂¤1 ª
               = ­­ ®®    ­ ®
                          ­ ®
                   𝜂        𝜂¤
                  « 2¬    « 2¬
               = −𝐿 𝐹00 (𝐹 ( 𝑋))
                               ˜ · ∇ 𝜃𝐹 𝐹 ( 𝑋)∇    ˜ 𝜃𝐹 𝐹 ( 𝑋)   ˜ 𝑇 𝜂2 + 𝜀∇𝑋 𝐹 ( 𝑋)
                                                                       1
                                                                                     ˜ · ∇ 𝜃𝐺 𝐺 (𝑋)𝑇 𝜂2 𝜂2
                                                                                                         1
               − [𝜀∇ 𝜃𝐹 𝐹 ( 𝑋)  ˆ 𝑇 𝜂1 𝜂2 − 𝜀 2 𝐿 00 (𝐹 ( 𝑋))   ˆ · (∇𝑋 𝐹 ( 𝑋)∇           ˆ 𝑇 ∇ 𝜃𝐺 𝐺 (𝑋)∇ 𝜃𝐺 𝐺 (𝑋)𝑇 𝜂2 ]
                                                                                ˆ 𝑋 𝐹 ( 𝑋))
                                           2           𝐺                                                                2
               ≤ −𝑐 3 𝑐21 𝜂14 − (𝑐 1 − 𝛾𝑐 2 )𝜂12 𝜂22 − 𝜀 2 𝑐 4 𝑐22 𝛾 2 𝜂24 .
 Letting 𝜆 = min{𝑐 3 𝑐21 , 𝜀 2 𝑐 4 𝑐22 𝛾 2 }, we have
                                               𝑉¤ (𝜂) ≤ −𝜆||𝜂|| 4 − (𝑐 1 − 𝛾𝑐 2 )𝜀𝜂12 𝜂22
                                                                                𝛾𝑐 2
                                               𝑉¤ (𝜂) ≤ −𝜆||𝜂|| 4 < 0, ∀              < 1.
                                                                                  𝑐1
 By Theorem 4.2 of [17], we have uniform asymptotic stability for the error dynamics around the
NE, implying that the error term 𝜂(𝑡) → 0 at the rate of
                                                    ||𝜂(𝑡)|| ≤ K (𝜆||𝜂(𝑡0 )||, 𝑡 − 𝑡0 )
                                                                      103


 for some KL−class function K and constant 𝜆 = min{𝑐 3 𝑐21 , 𝜀 2 𝑐 4 𝑐22 𝛾 2 }.
    Theorem 4 assumes an initialization for 𝜉 that is close to a local equilibrium point 𝜉 ∗ , which
motivates the use of scheduled sampling on the PL model inputs as a pre-training step. We also
allow a fully-observed 𝐴𝑖(1)𝑗 = 1, ∀𝑖, 𝑗 at 𝑡 = 1 so that the FS observes a full set of features, as a good
initialization of the generation process depends heavily on the initially observed features.
5.5    Experiments
To gain intuition into the precision sensing mechanism, we first compare with several classic feature
selection schemes on synthetic data. We then compare performance of precision sensing with SOTA
active sensing methods on real-world medical data.
5.5.1    Synthetic Experiments
We evaluate the ability of the budget-constrained precision sensing (BCPS) to retrieve the underlying
temporal patterns on a set of synthetic time-series data. We generate a synthetic dataset consisting
                                             1000 , 𝑋 ∈ R100×10 (100 time-steps, 10 features), 𝑦 ∈ {0, 1}.
of time-series data with labels: {𝑋𝑖 , 𝑦𝑖 }𝑖=1       𝑖                                            𝑖
The temporal signals were adapted from the UCI ML dataset [18], specifically the Pseudo Periodic
Synthetic Time Series Data Set. This dataset features 100, 000 samples of univariate time-series
data, generated from the stochastic periodic function:
                                     7
                                    Õ   1
                            𝑓 (𝑥) =       sin{2𝜋(22+𝑖 + rand(2𝑖 ))𝑐}. (0 ≤ 𝑐 ≤ 1)                     (5.10)
                                    𝑖=3
                                        2
 We expand this dataset by applying the generator equation across multiple channels (10), each with
a different initialization of 𝑐. We sub-sampled 1000 of the generated time-series and aligned the data
such that each sample contained 100 time-steps and 10 features. We generate labels 𝑦𝑖 according to
                                                
                                                
                                                
                                                1
                                                
                                                      ||𝐶 (𝑋𝑖 )|| > 0
                                           𝑦𝑖 =
                                                
                                                       ||𝐶 (𝑋𝑖 )|| ≤ 0,
                                                
                                                0
                                                
                                                
 where 𝐶 (.) masks all features of 𝑋 across time except for certain features at certain times. For
example, one configuration of 𝐶 may emphasize the 0 − 5𝑡ℎ features from time-steps 20 − 40 and the
9 − 10𝑡ℎ features from time-steps 60 − 80. All other features at all other time-steps are masked with
                                                       104


             Figure 5.2: Recovery of feature relevance (𝐶) by BCPS on synthetic data.
    (a) Ground Truth              (b) BCPS            (c) Ground Truth             (d) BCPS
high negative noise sampled from a Gaussian distribution. Our goal is to recover the true underlying
𝐶 (unmasked features) using the BCPS framework.
5.5.1.1   Recovery of 𝐶 using BCPS
Table 5.1 gives an overview of the BCPS performance in recovering the relevant features posed by
𝐶 on the synthetic dataset. Here, the sparsity-level refers to the percentage of masked features in 𝐶
(ground-truth). We see that the specificity of the recovered features by BCPS remains high, despite
the increase in budgetary constraints. In general, the performance drops are not significant between
different levels of sparsity in 𝐶.
    Figure 5.2 visualizes the performance of BCPS on several test-set samples with 80% sparsity in
𝐶. We see that BCPS chooses to allocate sensing to early time-steps, and then accurately locates the
largest block of key observations in 𝐶 related to predicting the target. From there, BCPS decides
whether to continue to observe future time-steps, as shown by Fig. 5.2d, or stop observations when it
is confident in the prediction, as shown by Fig. 5.2b. Interestingly, BCPS is capable of locating the
key observation blocks in 𝐶 despite the discontinuity of the key block from the original time-steps,
as demonstrated by Figs. 5.2b and 5.2d.
      Table 5.1: BCPS recovery performance vs. sparsity of sensing budget on synthetic data
                     Sparsity       AUC         F1       Sensitivity   Specificity
                       20%       99.0 ± 0.0  98.1 ± 1    97.3 ± 1.1    99.7 ± 0.0
                       40%       98.2 ± 2.1  94.5 ± 3    90.2 ± 4.2    98.6 ± 1.0
                       60%       96.4 ± 1.1  91.7 ± 1    87.6 ± 3.1    97.9 ± 1.2
                       80%       94.4 ± 3.0  90.9 ± 1    84.4 ± 3.8    97.0 ± 1.1
                                                 105


Table 5.2: Comparison of predictive power of various sensing methods on synthetic data, under
80% sparsity in 𝐶
                       Model                  AUC           F1     Sensitivity   Specificity
             Temporal Feature Selection   89.5 ± 3.9   81.2 ± 4.0  77.6 ± 4.8    85.7 ± 2.8
                   Self-Attention         93.3 ± 3.4   88.3 ± 3.2  87.0 ± 3.5    94.4 ± 2.0
                       BCPS               94.0 ± 0.9   90.4 ± 0.8  84.5 ± 3.6    96.9 ± 1.1
5.5.1.2   Comparison Against Other Sensing Methods
We compare BCPS against other sensing strategies on the synthetic dataset. Specifically, we
investigate (1) classification difference between sensing mechanisms, and (2) how they differ in
recovering feature contributions from 𝐶. We consider the following alternative models:
 • Temporal Feature Selection: GRU model with ℓ1 −regularization applied to input weights at each
   time-step.
 • Attention-Based Sensing: Self-attention model with a slight modification:
                                                   
                                                   
                                                       𝐴𝑖, 𝑗,𝑘 ≠ 0
                                                   
                                                   1
                                                   
                                                   
                                         𝐴𝑖, 𝑗,𝑘 =
                                                   
                                                   
                                                   0
                                                      Otherwise.
                                                   
   Because self-attention produces activations in the continuous [0, 1] range, we make this adjustment
   to use the non-zero activations as a proxy for feature utilization in self-attention.
    Table 5.2 compares the performance of BCPS against other sensing methods on the synthetic
data. We fix the sparsity in 𝐶 to be 80% for all models and compare prediction performance and
sensing patterns. As expected, BCPS and self-attention sensing strategies performed notably better
than temporal feature selection since the relevant features of 𝐶 shifts across time.
    The difference in the allocation of observations between BCPS and attention-based sensing is
relevant, so we further investigated it. In particular we compared the sensing patterns in BCPS and
attention activations against the ground truth 𝐶 under a fixed sparsity pattern, see Table 5.2. Scores
are averaged over 5 evaluation runs and reported in the format Avg. ± STD. We considered a 𝐶 with
some interesting properties:
                                                   106


Figure 5.3: Heatmap comparisons of sensed features under a fixed 𝐶. Feature number 0-9 (x-axis)
vs. Timesteps 0-96 (y-axis).
         (a) Ground Truth                      (b) BCPS                    (c) Attention Activations
 • 𝐶 has multiple, discontinuous blocks of key features whose observations are critical to the final
   prediction of 𝑦.
 • Different groups of features are prioritized in each key observation block.
 • Multiple groups of features can be prioritized at the same time-step.
We see from Fig. 5.3 that only BCPS is capable of recovering the true underlying sets of key
features in 𝐶. The attention activations, on the other hand, cannot locate the discontinuous blocks of
contributory features, and as a result always uses a high observation budget, i.e., most features are
observed frequently across time, regardless of the true underlying sparsity in 𝐶. We also see from
Fig. 5.3b that although there exists a very large divide between the first block and future time-steps,
the FS model sparingly allocates observations across the large gap of non-contributory features
before allocating large amounts of observation resources toward the last few time-blocks.
5.5.2   MIMIC-III Experiments
We also evaluate our framework on MIMIC-III [19], a publicly available electronic health records
(EHR) dataset using temporal data collected from 22, 830 adult patients. For the prediction task, we
choose In-Hospital Mortality as the target. Given the first 𝑇 = 48 hours of observations (𝑥𝑖(1) , ...𝑥𝑖(𝑇) ),
mortality risk assessment predicts mortality risk from 𝑇 + 1 until the end of the hospital stay. Because
patients have variable lengths of stay (LOS), we aligned the temporal sequences of patients by
the start of their hospital admit time and excluded patients who ha LOS less than 24 hours. We
                                                  107


considered a set of 19 temporal features, split between metabolic laboratory panels (7), specialized
hematologic tests (6) and vital signs (6).
5.5.2.1   Baseline Classifiers
We consider a wide variety of hypothesis model classes for inference:
 • Logistic Regression (LR): non-deep baseline.
 • Multi-layer Perceptron (MLP): 3 fully-connected layers with 128 hidden units each.
 • Gated Recurrent Units (GRU): a bi-directional GRU with 2 hidden layers and 128 nodes per layer.
 • Self-Attention Model (Attn): a modified self-attention [20] model, where the inputs at each
   time-step are modified by an attention mechanism [21] that combines pertinent information from
   previous time-steps to attend to the most relevant part of the current feature space.
5.5.2.2   Performance on MIMIC-III
Table 5.3 summarizes the performance of the inference models conditioned on the original dataset.
Scores are averaged over 5 evaluation runs and reported in the format Avg. ± STD. We see that
that the temporal information is highly relevant to the mortality task. This is indicated by the
superior performance of recurrent models (GRU and self-attention GRU), which achieves higher
performance across all metrics compared to the non-temporal models (LR and MLP). Self-attention
model achieves the highest baseline performance compared to other models, but vanilla GRU results
were comparable.
     Table 5.3: Baseline performance of inference models conditioned on the original data 𝑋
                    Model        AUC            F1         Sens.        Spec.
                     MLP      74.0 ± 5.5   37.4 ± 4.2   65.0 ± 7.8    72.8 ± 3.3
                      LR      58.4 ± 0.7   28.3 ± 1.9   18.8 ± 1.7    97.9 ± 0.3
                     GRU      84.7 ± 1.2   45.2 ± 2.6   78.4 ± 3.3    74.8 ± 4.1
                     Attn     85.1 ± 1.8   45.5 ± 1.6   76.0 ± 2.2    79.8 ± 2.6
                                                 108


        Table 5.4: Performance of inference models conditioned on the masked data 𝑋           𝐴
                       Model       AUC             F1        Sens.          Spec.
                       MLP       74.2 ± 5.2   36.2 ± 5.1   69.6 ± 2.8    67.3 ± 6.2
                        LR       57.3 ± 0.7   25.6 ± 1.9   16.5 ± 1.5    98.2 ± 0.2
                       GRU       81.3 ± 2.2   42.4 ± 2.5   73.0 ± 5.4    74.4 ± 4.3
                        Attn     81.7 ± 1.2   41.1 ± 3.0   75.5 ± 3.2    72.6 ± 1.4
                        PL       82.9 ± 1.3   44.5 ± 2.4   74.3 ± 3.5    74.1 ± 1.2
5.5.2.3   Performance on Masked Data
Table 5.4 illustrates the performance of the same set of inference models trained and tested on
the incrementally sensed data {𝑋𝑖      𝐺 (𝑋𝑖 ), 𝑦𝑖 }𝑖=1
                                                    𝑚 . The 𝐺 (𝑋) used for Table 5.4 is trained using a
budgetary constraint of 58.6% of available features across time. From Table 5.4, we see that both
recurrent models (GRU and Attn) have notable decrease in performance, with their AUCs dropping
by 2 − 4%, respectively. Conversely, the non-temporal models maintain their performance. Again,
scores are averaged across 5 evaluation runs. This contrast between temporal vs. non-temporal
models reveals that 𝐺 (𝑋) specifically maintains the temporal patterns in 𝑋 related to the prediction
task at hand. We also see that the PL is able to achieve the highest performance on the sensed data,
despite having the same architecture as the GRU models. This is likely because the PL is trained on
a variety of sparsity patterns in 𝐺 (𝑋) during co-training.
5.5.3   Sparsity Trade-offs
We further investigate the relationship between the severity of the budget constraint, i.e., sparsity in
the generated sensing tensor, and the performance of the resulting dynamically sensed data. Here,
we define budget cost as the % of features utilized during the observation period (𝑇 = 48 hours):
                                            Õ𝑚 Õ𝑡 Õ𝑛
                          BudgetCost =                      𝐴𝑖, 𝑗,𝑘 /(𝑚 × 𝑡 × 𝑛),
                                                𝑖      𝑗  𝑘
    where we assume 𝑋 ∈ R𝑚×𝑡×𝑛 . Fig. 5.4 summarizes the trade-off between performance and
budget constraints. Notably, the budget cost decreases from 89.3% to 58.3% while the AUC and
f1-scores only decreased by < 0.02. The greatest drop in performance occurred from 12.1% → 6.5%.
This is presumably because at 6.5%, the FS allocates the observation budget at the first time-step
                                                    109


only, sacrificing all of the temporal information within the data. Thus, we can see from Fig. 5.4
that 𝛽 ∈ [1e-6, 1e-4] represents an optimal trade-off range. In practice, the 𝛽-constraint is usually
Figure 5.4: Trade-off between Budget Constraint (x-axis) vs. predictive power (y-axis). Budget
denotes the strength of budgetary constraint 𝛽.
predetermined based on real-world constraints. For example, a well-funded hospital may have access
to a large set of observations, say 𝛽 = 5𝑒 − 6 (less constrained), whereas an under-funded clinic may
only be able to observe at the level of 𝛽 = 5𝑒 − 5 (highly constrained).
5.5.4    Comparison Against SOTA Methods
Finally, we compare BCPS against other SOTA sensing strategies for temporal data:
 • Deep Sensing [22]: Active sensing is done through an adaptive sampling scheme. Inference is
   done using interpolation and imputation layers for the missing values.
 • Dynamic Measurement Scheduling [23]: Active sensing is done by a dueling DQN network, and
   inference is done using a pre-trained recurrent network.
We implement Deep Sensing (both single and multiple model settings) with the hyperparameter
settings described in the Appendix Section of [22]. For Dynamic Measurement Scheduling, we run
Algorithm 3 without meta-data, i.e., without age, demographics, or disease history, to obtain the
                                                  110


simulated dataset. In both cases, we use the same set of features and the same 48-hour observation
window. For each active sensing model we peform the following:
    1. Pre-train a GRU model.
    2. Synthesize new datasets using the active sensing strategy.
    3. Test inference models (M-RNN for Deep Sensing) on synthetic datasets.
In Table 5.5, we report the best performance for each method with 50 − 60% observation budget,
which is the reported low end for the Deep Sensing work [22]. We use the 50 − 60% budget range
because each work uses its own hyperparameter (e.g., threshold 𝜏 in [22] and 𝜆 in [23]), so it is
difficult to translate the threshold parameters to exact percentages in sparsity. Thus, we report the
best performance within the 50 − 60% observation range of each model for fairness. From Table 5.5,
we can see that the combination of FS (for data generation) and PL (for adaptive classification)
maintains the best level of performance at higher budget constraints. Dynamic scheduling produces
comparable results to Deep Sensing.
     We also plot the AUC-gain vs. sparsity trade-off for each approach in Figure 5.5. For that
experiment, we fix a pre-trained GRU to deliver the predictions at each sparsity level for the generated
data from each method. Our FS model produces measurements that maintained a higher level of
performance at each sparsity level. It is also notable that without variance reduction (e.g., using
multiple models), the imputation-prediction (M-RNN) framework of Deep Sensing has a large
drop off, especially with decreasing observation budget. Dynamic Scheduling performs best at
around 25%, which is consistent with the reported performance in [23]. However, we note that its
performance plateaus at increasing levels of observation.
Table 5.5: Comparison of precision sensing against SOTA active sensing method with sparsity
constraint of 50 − 60%
                              Sensing Model          Inference     AUC      F1
                            Deep Sensing [22]         M-RNN        76.0    38.5
                         Dynamic Scheduling [23]       GRU         78.7    41.3
                                Ours (FS)            Ours (PL)     82.9    44.5
                                                  111


                     Figure 5.5: Comparison of AUC-gain against SOTA models.
5.6    Conclusion
In this chapter we formalize the precision sensing problem as an extension of active sensing and
illustrated its application in mortality risk prediction. Our formulation consists of a minimax game
between a dynamic feature sensor and an online classifier that reaches a local Nash equilibrium
during co-training. We empirical show that the sensing strategy is capable of maintaining predictive
performance while satisfying budget constraints. However, one limitation is that although we
show local Nash equilibriums exist, we cannot make statements about their admissibility and
interchangeability in payoff values [16]. Future works might consider other scheduled sampling
techniques for the PL model, which can greatly affect training stability and sample complexity under
the proposed setting.
                                                   112


BIBLIOGRAPHY
     113


                                          BIBLIOGRAPHY
[1] Yu Cheng, Fei Wang, Ping Zhang, and Jianying Hu. Risk prediction with electronic health
     records: A deep learning approach. In Proceedings of the 2016 SIAM International Conference
     on Data Mining, pages 432–440. SIAM, 2016.
[2] Fengyi Tang, Cao Xiao, Fei Wang, and Jiayu Zhou. Predictive modeling in urgent care: a
     comparative study of machine learning approaches. JAMIA Open, 2018.
[3] Abhyuday N Jagannatha and Hong Yu. Bidirectional rnn for medical event detection in
     electronic health records. In Proceedings of the conference. Association for Computational
     Linguistics. North American Chapter. Meeting, volume 2016, page 473. NIH Public Access,
     2016.
[4] Edward Choi, Mohammad Taha Bahadori, Elizabeth Searles, Catherine Coffey, Michael
     Thompson, James Bost, Javier Tejedor-Sojo, and Jimeng Sun. Multi-layer representation
     learning for medical concepts. In Proceedings of the 22nd ACM SIGKDD International
     Conference on Knowledge Discovery and Data Mining, pages 1495–1504. ACM, 2016.
[5] Hrayr Harutyunyan, Hrant Khachatrian, David C Kale, and Aram Galstyan. Multitask learning
     and benchmarking with clinical time series data. arXiv preprint arXiv:1703.07771, 2017.
[6] Shipeng Yu, Balaji Krishnapuram, Romer Rosales, and R Bharat Rao. Active sensing. In
     Artificial Intelligence and Statistics, pages 639–646, 2009.
[7] Mengying Sun, Fengyi Tang, Jinfeng Yi, Fei Wang, and Jiayu Zhou. Identify susceptible
     locations in medical records via adversarial attacks on deep predictive models. arXiv preprint
     arXiv:1802.04822, 2018.
[8] Pin-Yu Chen, Yash Sharma, Huan Zhang, Jinfeng Yi, and Cho-Jui Hsieh. Ead: elastic-net
     attacks to deep neural networks via adversarial examples. In Thirty-Second AAAI Conference
     on Artificial Intelligence, 2018.
[9] Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From theory to
     algorithms. Cambridge university press, 2014.
[10] Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. In
     2017 IEEE Symposium on Security and Privacy (SP), pages 39–57. IEEE, 2017.
[11] Sepp Hochreiter. The vanishing gradient problem during learning recurrent neural nets and
     problem solutions. International Journal of Uncertainty, Fuzziness and Knowledge-Based
     Systems, 6(02):107–116, 1998.
                                                  114


[12] Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for
     sequence prediction with recurrent neural networks. In Advances in Neural Information
     Processing Systems, pages 1171–1179, 2015.
[13] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, Jonathan Eckstein, et al. Distributed
     optimization and statistical learning via the alternating direction method of multipliers.
     Foundations and Trends® in Machine learning, 3(1):1–122, 2011.
[14] John Nash. Non-cooperative games. Annals of mathematics, pages 286–295, 1951.
[15] Ky Fan. Fixed-point and minimax theorems in locally convex topological linear spaces.
     Proceedings of the National Academy of Sciences of the United States of America, 38(2):121,
     1952.
[16] João P Hespanha. Noncooperative game theory: An introduction for engineers and computer
     scientists. Princeton University Press, 2017.
[17] Hassan K Khalil. Nonlinear control. Pearson New York, 2015.
[18] Arthur Asuncion and David Newman. Uci machine learning repository, 2007.
[19] Alistair EW Johnson, Tom J Pollard, Lu Shen, Li-wei H Lehman, Mengling Feng, Mohammad
     Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. Mimic-iii,
     a freely accessible critical care database. Scientific data, 3, 2016.
[20] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
     Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural
     information processing systems, pages 5998–6008, 2017.
[21] Minh-Thang Luong, Hieu Pham, and Christopher D Manning. Effective approaches to
     attention-based neural machine translation. arXiv preprint arXiv:1508.04025, 2015.
[22] Jinsung Yoon, William R Zame, and Mihaela van der Schaar. Deep sensing: Active sensing
     using multi-directional recurrent neural networks. 2018.
[23] Chun-Hao Chang, Mingjie Mai, and Anna Goldenberg. Dynamic measurement scheduling for
     event forecasting using deep rl. arXiv preprint arXiv:1901.09699, 2019.
                                                 115


                                             CHAPTER 6
                                  PERSONA AUTHENTICATION
6.1   Introduction
In recent years, one promising approach to diverse and personalized dialog generation has been
persona models [1, 2, 3] which embed the so-called “persona” information (e.g., name, gender,
and self-descriptions) into neural conversational agents. While the goal of persona modeling is to
achieve human-level response diversity and character consistency, a critical yet often overlooked
factor is the sequence of prompts used to induce diversity in generated responses.
    Consider the toy example in Table 6.1. Certain sequences of question prompts (from the verifier)
create a trail of generated responses that reveal more persona information, compared to non-specific
conversation (the random policy). Yet it is unclear a priori which sequence(s) of questions most
effectively reveal the dialog agent’s underlying persona. Moreover, a set of questions may be
effective for one persona but fail for others. Currently, long interactions with humans are necessary
to gain insight into persona model characteristics such as authenticity [3, 4], diversity [2], and
engagement [3].
    In this work we present a learning approach for interacting with conversational agents. Specifically,
we introduce the persona authentication problem, where a model estimates the persona information
of an input agent by learning to deliver a sequence of questions that progressively reveal more
information about the agent throughout the course of the dialog. This is difficult because exact
search through the space of possible question sequences is infeasible. Therefore a model must
adaptively prune its set of potential questions based on the dialogue agent’s responses.
    We further decompose persona authentication into two parts: persona identification, which
is inferring a set of persona features from a given dialog trajectory, and persona verification, the
problem of finding a second conversational model – we call it a question policy – to elicit dialog
trajectories for persona identification. To address the intractability of exact search through the space
                                                  116


        Table 6.1: Persona model responses can differ greatly depending on input questions
                                                         Persona Descriptions
                 1. I am a construction worker.       2. I enjoy building houses.        3. I have 5 cats.
                                                   4. my cats are very special to me.
         Role                       Response                   Role                     Response
        Verifier    hello! what kind of work do you do?      Random             hello how are you today?
        Model                    i build houses.              Model        great! i just got back from work.
        Verifier      that’s awesome. what do you do         Random     me too. i’m a teacher at a high school.
                               outside of work?
        Model         i like to spend time with my cats.      Model         cool, what grade do you teach?
of dialog trajectories, we introduce a computationally tractable algorithm and show its asymptotic
convergence (in cumulative conversations) toward the full persona identification objective. The key
contributions of this work can be summarized as follows:
     • We introduce the authentication loss and show that estimators trained to convergence under
       this objective maximize the mutual information between dialog history and persona.
     • Based on the authentication loss we learn a dialog verification model that effectively generates
       question sequences to distinguish the persona of input models. Empirically, we show that the
       question policies of the verification model adapt to out-of-distribution personas.
     • We present a way to incorporate question policies into language model (LM) based dialog
       models, e.g., GPT-2, without sacrificing the felicity and consistency of the original LM model.
6.2    Why Model Persona?
One of the long-term goals of this work is toward general conversational authentication. Biometric
authentication has traditionally relied on physical measures such as fingerprints and facial recognition
to determine the identity of human beings [5, 6]. Importantly, physical features such as facial
structures and fingerprints are largely stationary; a standard procedure can be used to collect
stationary features and scaled to entire populations without the need for personalization. By contrast,
“soft” biometrics such as using personal information to verify identity require some degree of
personalization. For example, password retrieval or reset procedures often ask users to answer
                                                          117


security questions for which they design, or to answer a set of adaptive tests such as visual Captchas
or re-tagging photos. While these tests are personalized, they are not adaptive in the sense that the
test itself is predetermined before it is delivered to the user.
     To this end, conversational authentication can be viewed as an adaptive extension of traditional
biometrics based authentication. In particular, we are interested in finding an algorithmic approach
to deliver automated and adaptive tests to survey whether the conversational behavior of an input
agent (e.g., human users, spam bots, etc.) matches a set of personal information provided by the
user (which roughly corresponds to “persona facts” in the PersonaChat dataset). Given such an
authenticator, we can explore a much larger space of tests (e.g., the space of adaptive questions
rather than the space of predefined personalized questions) to survey more complicated behavior.
This is especially relevant if we want to generalize beyond simple persona facts such as “I have a
dog named radar” to more complicated facts such as the capacity to deliver satirical responses or the
use of specific generational grammatical structures (e.g., millennial or “Gen Z” lexicon). In other
words, we see general conversational authentication as the linguistic analog of visual captcha turing
tests. Toward this goal, persona authentication – learning a policy to deliver identifying questions –
is a necessary stepping stone because it presents an algorithmic approach (through dialog policy) as
opposed to human interviewers using handcrafted linguistic features, which is the only alternative at
the moment, and does not scale.
6.2.0.1     Applications
One potential application of conversational authentication is speaker verification for conversational
agents, human or chatbot. In many real-world settings, speaker information such as audio and
video may not be readily available. In such cases, the verifier network provides a way of speaker
identification via text. One can think of persona verification as a way of obtaining a linguistic
“fingerprint” of speakers based on the manner in which they converse under different question
policies. For example, human speakers seeking access to personal data may go through a short
conversation with the authenticator in order to see whether the person trying to access private data
                                                   118


has the correct identity.
    Verification is a critical issue in the modern era of cybersecurity. Consider for example the
arrival of Deep Fakes [7] – synthetically generated videos of people doing actions that may be
outside the context of their persona. The use of only voice and video identifiers may not be enough
to truly assess whether a person’s physical features match with their actions. In this regard, we
try to introduce the idea that the problem of verification may entail much more than just matching
physical / biological features. This is why we approach the problem from the point of a question
policy, a verification process that is dynamic and stochastic rather than static and deterministic.
In the latter case, technical advances in modern AI can “game” physical features which are fixed
points in some classifier space. In the former case, however, an impersonator must do much more.
To fool a dynamic authentication procedure, one has to find fixed points in policy space, which
involves sequential decision-making rather than one-time classification using facial recognition and
fingerprint features.
6.2.0.2    Authenticator Example
To give a concrete example of a use case, we provide an example conversation of conversational
authentication in which an authenticator surveys the persona facts relevant to a user who engages in
an 8 turn conversation. In this example, the user is a persona model (personalized chatbot) rather than
a human user. Table 6.2 shows an example conversation between the learned authenticator model
and an input persona model (PersonaGPT). Turn 1 and 8 highlight the importance of incorporating
question codes as context for decoding: rather than hard-coding questions, an authenticator must also
answer questions from the PersonaGPT model as to not disrupt the flow of conversation. Since the
authenticator is not given any persona facts, it must make up an answer to PersonaGPT questions on
the fly. For example, the authenticator states that its favorite color is green and that it is unemployed
in order to answer queries in addition to proposing questions.
    In this particular conversation, the authenticator question codes are: talk about work for the first
3 turns, ask about hobbies for the next 2 turns, talk about politics for turn 6, talk about travel for
                                                  119


turn 7, and talk about age and gender in turn 8. When question codes are repeated, the authenticator
model is able to avoid the same question twice – instead, it expands upon the existing knowledge in
the conversation (e.g., ex-military, works in a shipyard). Note that when as context, the question
codes function as one-to-many mappings. For example, “talk about travel” may correspond to a
much larger set of topics other than traveling as a hobby. The authenciator may ask where the input
agent lives, where they are from, etc. Similarly, “ask about age and gender” has much more to do
with questions related to identity and appearance rather than strictly age and gender alone.
Table 6.2: An example of interaction between the learned authenticator policy and PersonaGPT
over an 8-turn conversation
              Role                                          Response
          Authenticator                    hello there what kind of work do you do?
          PersonaGPT                hi there i am ex military. what do you do for a living?
          Authenticator                nothing much. what do you like about your job?
          PersonaGPT                                 i am a shipyard worker.
          Authenticator                              what’s your dream job?
          PersonaGPT                               to be a navy seal one day.
          Authenticator                 that sounds exciting! do you have any hobbies?
          PersonaGPT                   my hobby is volunteering with the bunny rescue.
          Authenticator                         do you have any other hobbies?
          PersonaGPT                        yes i help out at the local bunny rescue.
          Authenticator                              do you follow politics?
          PersonaGPT      yes i follow politics. i donate a part of my salary to charity each month.
          Authenticator                                where are you from?
          PersonaGPT                i’m from the united states. what’s your favorite color?
          Authenticator                   my favorite color is green. how old are you?
          PersonaGPT                                i’m 20 what about you?
                                                 Persona Facts
         1. I donate a part of my salary to charity each month.      2. I work fulltime in a shipyard.
           3. I volunteer my time with a local bunny rescue.                4. I’m ex military.
                   5. My favorite color is navy blue.
6.2.0.3   Challenges
Table 6.2 reveals two main challenges to learn an adaptive authentication policy:
                                                       120


   1. At the turn-level, a policy must be able to incorporate turn-level goals into its decoding.
       However, the decoding algorithm has to remain open-domain in order to be flexible enough to
       handle a wide range of conversational topics and styles. For this reason, we turn to generative
       decoding rather than ranking models.
   2. An objective function has to capture the relationship between the responses generated by
       a policy and persona facts at the dialog-level. One can think of the dialog-level objective
       function as informing turn-level goals to guide the decoding process.
6.3   Problem Setting
6.3.1   Notation
Let D = {𝜏𝑖 }𝑖=1
              𝑛 be a set of dyadic dialog samples. Each dialog follows the form 𝜏 = {𝑋, 𝑌 }, where
𝑋 = (𝑋𝑡 )𝑡=1
          𝑇 denotes the sequence of source responses and 𝑌 = (𝑌 ) 𝑇 denotes the sequence of
                                                                               𝑡 𝑡=1
target responses. Each response is composed of a sequence of tokens, represented as (x 𝑘(𝑡) ) 𝑘=1          𝐾
(source tokens) and (y 𝑘(𝑡) ) 𝑘=1
                               𝐾 (target tokens). To be consistent with state-of-the-art (SOTA) dialog
model decoders [8, 9], we use Byte-Pair Encoding (BPE) [10] for tokenization. Additionally, 𝑇
signifies the maximum number of turns in a dialogue sample, 𝐾 the maximum number of tokens
per response. If a response consists of 𝑘 < 𝐾 tokens, then we take 𝑘 + 1 through 𝐾 tokens to be
empty (“PAD”) tokens (with similar logic applying to dialog turns). As a shorthand, we write 𝜏𝑡 to
denote the dialogue trajectory (𝑋1 , 𝑌1 , . . . , 𝑋𝑡 , 𝑌𝑡 ) up to turn 𝑡, with 𝑌1:𝑡 to signify the sequence of
responses 𝑌1 , . . . , 𝑌𝑡 . Similarly y1:𝑘 represents the ordered sequence of tokens up to token y 𝑘 .
6.3.2   Persona Identification
The standard objective of persona models is:
                                          max log 𝑝(𝑌𝑡 |𝑋𝑡 , 𝜏1:𝑡−1 , 𝑃𝑌 ),                              (6.1)
                                           𝑌𝑡
where 𝑃𝑌 is the set of persona descriptions for the dialog agent. Zhang et al. [2] and the ConvAI2
challenges [11] provided numerous ways to incorporate persona information into the dialogue
                                                       121


generation process. Recently, generative persona models [12, 8] have been shown to be effective
at contextualized decoding by incorporating persona 𝑃 as language model context. Due to their
effectiveness, we only consider generative persona models in this chapter.
    To identify a persona from a given trajectory, we formulate the persona identification problem:
Problem 2. Persona Identification. Given an input dialogue trajectory 𝜏, find the persona 𝑃 that
maximizes the mutual information between 𝑃 and 𝜏. More formally, the optimization objective is
                        max 𝐼 (𝑃, 𝜏) = max 𝐻 (𝜏) − 𝐻 (𝜏|𝑃) = min 𝐻 (𝜏|𝑃),                         (6.2)
                          𝑃               𝑃                        𝑃
where 𝐻 (·) is entropy and 𝑃 ∈ R𝑚 is a vector in the space of possible personas.
    Persona identification seeks a fixed-length representation of persona information that captures
the consistency of generated responses. In other words, a personalized dialog agent has to not only
generate diverse responses (high entropy 𝐻 (𝜏)), but it must also stay consistent to a persona profile
throughout multiple turns of conversation, minimizing 𝐻 (𝜏|𝑃). One challenge is that it is unclear
how to arrive at a set of questions 𝑋1:𝑇 to generate the input trajectory 𝜏. For example, certain sets
of questions may always result in generic responses, regardless of the agent quality. Thus, problem 2
requires a way to constrain the question policy so that, given the right set of questions, the persona
of the dialog agent can be elicited.
6.3.3   Persona Verification
We addresses the above issue by formulating question generation as an optimization problem. A
naive attempt may be to propose the following objective:
                                 max 𝐼 (𝜏, 𝑃) = max 𝐻 (𝜏) − 𝐻 (𝜏|𝑃)                               (6.3)
                                   𝜏              𝜏
However, we will show in the following section that Eqn (6.3) is intractable to optimize directly due
to the partition function in the second term. We will first present a more feasible objective below
and show its asymptotic convergence toward Eqn (6.3) in the next section. Suppose that the inputs
                                                 122


to a verification model are dialog agents, i.e. trained persona models parameterized by different
personas. We define the persona verification problem as follows:
Problem 3. Persona Verification. Given a space of persona information P, persona verification is
the optimization objective:
                                                               
                                          min E𝑃∼P L (𝜏𝜃 , 𝑃)                                       (6.4)
                                            𝜃
where L (·, ·) is the authentication loss:
                     L (𝜏𝜃 , 𝑃) = max{0, 𝐶 + 𝑑 (𝜏𝜃 , 𝑃+ ) − 𝑑 (𝜏𝜃 , 𝑃− )} − log 𝑝(𝜏𝜃 )).            (6.5)
𝑃+ denotes persona facts that co-occur with trajectory 𝜏, 𝑃− the opposite. 𝐶 specifies the desired
margin of separation, and 𝜏𝜃 is the dialog trajectory generated by the question policy (𝜃).
    The first term of Eqn. (6.5) approximates Eqn. (6.2) through a triplet loss using negative sampling
over the space of possible personas. The rationale behind the first term is to address the intractability
of solving for Eqn. (6.2) directly. In Section 6.3.4, we show that this triplet loss component converges
to the mutual information term in Eqn. (6.2).
    The second term in Eqn. (6.5) gives the likelihood of the trajectory. In order to minimize the
second term, a verification algorithm has to generate queries with high likelihood under a given
language model, e.g. GPT-2. If either the input agent or the question policy generate nonsensical
responses, then the resulting 𝑝(𝜏𝜃 ) will be close to zero. For the rest of the chapter, we will refer to
“identifier” as a model used to solve the identification problem and “verifier” to denote a model used
to solve the verification problem.
6.3.4    Analysis of Persona Authentication Objective
Now we analyze the relationship between Eqn. (6.5) and the mutual information between 𝑃 and 𝜏.
First, we assume that for a given persona 𝑃, the density function for 𝑝(𝜏|𝑃) follows the probability
density function (PDF) of a Gibbs distribution:
                                                    exp[−𝛽𝐸 (𝜏, 𝑃)]
                                   𝑝(𝜏|𝑃) = ∫                            ,
                                               𝜏 0 ∈D
                                                      exp[−𝛽𝐸 (𝜏0, 𝑃)]
                                                     123


 where 𝐸 (𝜏, 𝑃) is an energy function which scores the un-normalized co-occurrence likelihood of a
specific dialogue trajectory 𝜏 and persona 𝑃. 𝛽 is the temperature term which controls the overall
entropy of the distribution. We choose the Gibbs distribution because of its expressiveness and
common use in contrastive learning [13]. Then we can express the mutual information between 𝜏
and 𝑃 as:
                                                                            ∫                        
                                                                                                 0
    𝐼 (𝜏, 𝑃) = 𝐻 (𝜏) − 𝐻 (𝜏|𝑃) = −E𝜏 [log 𝑝(𝜏)] + E𝜏,𝑃 𝛽𝐸 (𝜏, 𝑃) − log                exp[−𝛽𝐸 (𝜏 , 𝑃)] .
                                                                               𝜏 0 ∈D
                                                                                                      (6.6)
 In Eqn. (6.6), the first term on the RHS corresponds to the entropy of dialog trajectories (diversity
of generated responses), which is determined by the decoding quality of the input agent. The second
term depends on our question policy and our estimation of 𝑃. Trajectories under 𝑃 depend on the
question policy since the input agent maximizes 𝑝(𝑌𝑡 |𝑋𝑡 , 𝜏𝑡−1 , 𝑃). Since 𝑃 is not known by the
policy beforehand, it is estimated each turn by the identifier. Unfortunately, directly estimating the
entire second term is difficult – the partition function of the conditional distribution requires us to
integrate over the space of trajectories, an intractable task.
     We thus propose a local density estimation of the conditional density 𝑝(𝜏|𝑃) as follows: let
                                          1     Õ𝑛
                               𝑃eN =                  𝐾 (𝜑(𝜏𝑖 ), 𝜓(𝑃), 𝐶𝑛 ),                          (6.7)
                                       𝑛𝑉 (𝐶𝑛 )   𝑖=1
 denote an empirical estimate of 𝑝(𝜏|𝑃) using 𝑛 sampled trajectories. 𝜓 and 𝜑 are embedding
                                                      ∫
representations of 𝑃 and 𝜏, respectively. 𝑉 (𝐶𝑛 ) = N (𝑃) 𝑑𝑃0 gives the volume of a neighborhood
                                                         𝐶𝑛
ball of radius 𝐶𝑛 around 𝜓(𝑃). 𝐾 is a kernel function (which we show to be a valid kernel function
in Supplemental Materials) designed as follows:
                                              
                                              
                                               1 if 𝑑 (𝜑(𝜏), 𝜓(𝑃)) ≤ 𝐶𝑛
                                              
                                              
                                              
                              𝐾 (𝜏, 𝑃, 𝐶𝑛 ) =                                .                        (6.8)
                                              
                                              
                                               0 else
                                              
                                              
We now present the main theorem of our analysis.
Theorem 3. (Convergence of 𝑃     e𝑁 )
If Eqn. 6.5 (authentication loss) is minimized with 0 loss over D = {𝜏𝑖 }𝑖=1    𝑛 and P = {𝑃 } 𝑚 , then
                                                                                                𝑗 𝑗=1
                                                   124


𝑃e𝑁 asymptotically converges to 𝑝(𝜏|𝑃), i.e.,
                                            lim 𝑃e𝑁 = 𝑝(𝜏|𝑃)                                     (6.9)
                                           𝑛→∞
when the following conditions hold:
                                                                         𝑘
                        lim 𝑛𝑉 (𝐶𝑛 ) = ∞,       lim 𝑉 (𝐶𝑛 ) = 0,    lim    = 0,                (6.10)
                       𝑛→∞                     𝑛→∞                 𝑛→∞   𝑛
where 𝑘 is the expected number of samples that fall within 𝑁𝐶𝑛 (𝑃).
    The proof is provided in the Supplemental Materials. The goal of the identifier model is to learn
the embedding functions 𝜑 and 𝜓:
                               𝑛   𝑘
                           1 ÕÕ
         𝜑, 𝜓 = arg min               max {0, 𝐶 + 𝑑 (𝜑(𝜏), 𝜓(𝑃𝑖 )) − 𝑑 (𝜑(𝜏), 𝜓(𝑃 𝑗 ))},       (6.11)
                   𝜑,𝜓    𝑛𝑘 𝑖=1 𝑗=1
 where 𝑃𝑖 ∈ P + , and each 𝑃 𝑗 belongs to the set of 𝑘 negative persona samples.
6.4   Proof of convergence of Thm. 1
We first state some assumptions about the density function 𝑝(𝜏|𝑃). Unless otherwise stated, we
assume that there exists some joint embedding space H for which we are comparing 𝜏 ∈ D and
𝑃 ∈ P. Specifically, let us assume that here exists some optimal mapping functions 𝜑∗ and 𝜓 ∗ that
maps 𝜏 and 𝑃 to H , respectively, i.e.,
                                       𝜑 : D → H, 𝜓 : P → H
where H ⊆ R𝑛 . With some abuse of notation, we refer to 𝜑(𝜏) by 𝜏 and 𝜓(𝑃) by 𝑃 in the following
analyses for simplicity.
Assumption 1. (Locally Constant Density)
We assume that within a local neighborhood N𝐶 (𝑃) of radius 𝐶 around persona vector 𝑃, trajectories
𝜏 are indistinguishable. Formally, ∀𝑃, 𝑃0, ∃0 < 𝐶 < ∞:
                            𝑑𝜓 (𝑃, 𝑃0) ≤ 𝐶 =⇒ 𝑑 𝜑 ( 𝑝(𝜏|𝑃0), 𝑝(𝜏|𝑃)) = 0,
for some distance functions 𝑑𝜓 , 𝑑 𝜑 . For simplicity, we will consider the Euclidean distance for 𝑑𝜓
and total variational divergence for 𝑑 𝜑 .
                                                   125


     We will use N𝐶 (𝑃) to denote the neighborhood set around 𝑃 for which the above condition is
satisfied.
Assumption 2. (Continuity and topological properties)
The conditional density 𝑝(𝜏|𝑃) is Lipschitz continuous over the supporting set H for both 𝜏 and for
𝑃. Furthermore, we assume that 𝑝(𝜏|𝑃) is simply-connected.
     Next, we define 𝑝(𝜏|N (𝑃)) as the probability that trajectory 𝜏 will fall in the neighborhood
N (𝑃) around a given persona 𝑃. Specifically, we consider the case where 𝑛 trajectories are sampled,
𝑘 of which fall into N (𝑃).
Definition 5. (Neighborhood Density) We define the neighborhood density around a persona vector
𝑃 as the probability that a trajectory 𝜏 falls into the neighborhood N (𝑃) as defined by
                                                      ∫
                                𝑃N = 𝑝(𝜏|N (𝑃)) =                𝑝(𝜏|𝑃0)𝑑𝑃0 .                           (6.12)
                                                         N (𝑃)
Furthermore, given a set of i.i.d. 𝑛 trajectories {𝜏1 , . . . , 𝜏𝑛 }, the probability that 𝑘 such trajectories
fall in N (𝑃) follows the binomial distribution:
                                            
                                            𝑛
                                       𝑘∼      𝑃N 𝑘 (1 − 𝑃N )) 1−𝑘 .                                    (6.13)
                                            𝑘
     At this point, there is one key issue: how do we calculate 𝑘, which needs to somehow “count”
the trajectory-persona pairs that fall into the same neighborhood? We can conceptualize 𝑘 as the
image of some counting function of the form
                                           𝐾 : (𝜏, 𝑃, 𝐶) → R
where 𝐾 is normalized over the domain H . For this purpose, we construct a kernel density function
for 𝑘 as follows: given a persona vector 𝑃, let
                                               Õ𝑛
                                         𝑘𝑛 =      𝐾 (𝜏𝑖 , 𝑃, 𝐶𝑛 )
                                               𝑖=1
                                                   126


be the output of the kernel function 𝐾 over 𝑛 sampled trajectories D = {𝜏1 . . . 𝜏𝑛 } from 𝑝(𝜏|𝑃).
Here, 𝐶𝑛 denotes the sample neighborhood size N (𝑃) around 𝑃 satisfying the constraint
                                           𝐶𝑛 = max 𝑑 (𝜏𝑖 , 𝜏 𝑗 )
                                                  𝜏𝑖 ,𝜏 𝑗 ∈D
for Euclidean distance 𝑑 (·, ·) from Assumption 1. Given embeddings 𝜑(𝜏) and 𝜓(𝑃), we propose
the following kernel density function 𝐾 (𝜏, 𝑃, 𝐶𝑛 ):
                                               
                                               
                                                1 if 𝑑 (𝜑(𝜏), 𝜓(𝑃)) ≤ 𝐶𝑛
                                               
                                               
                                               
                              𝐾 (𝜏, 𝑃, 𝐶𝑛 ) =                                     .                     (6.14)
                                               
                                               
                                               0
                                                      else
                                               
Lemma 3. (Validity of the proposed kernel density)
             ∫
Let 𝑉 (𝐶) = 𝑁 (𝑃) 𝑑𝑃0 denote the volume of the neighborhood with radius 𝐶 around 𝑃. The counting
                𝐶
function 𝐾 described by Eqn. (6.14) is a valid kernel density function satisfying
                                    ∀𝑛 > 0, ∀𝜏, 𝑃 ∈ P : 𝐾 (𝜏, 𝑃, 𝐶𝑛 ) ≥ 0                               (6.15)
                                                   ∫
                                             1
                                ∀𝑛 > 0 :                   𝐾 (𝜏0, 𝑃, 𝐶𝑛 )𝑑𝜏0 = 1.                       (6.16)
                                         𝑉 (𝐶𝑛 ) H
Proof. Condition 6.15 is observed by the definition of 𝐾 from Eqn. 6.14. 𝐾 (𝜏, 𝑃, 𝐶) > 0 over the
entire supporting set for 𝜏, 𝑃 and constants 𝐶𝑛 and 0 everywhere else. For condition 6.16, see that
𝐾 integrates to 𝑉 (𝐶) over the domain of 𝜏:
             ∫                        ∫
                       0          0
                  𝐾 (𝜏 , 𝑃, 𝐶)𝑑𝜏 =            1 · 𝑑𝑃0 = 𝑉 (𝐶).                  (by definition in Eqn. (6.14))
                H                      N𝐶 (𝑃)
 From our construction of 𝐾, we know that 𝐾 (𝜏, 𝑃, 𝐶𝑛 ) = 0 everywhere except in neighborhood
                              ∫
N𝐶 (𝑃). Thus, the integral H 𝐾 (𝜏0, 𝑃, 𝐶)𝑑𝜏0 reduces to integrating over N𝐶 (𝑃).                            
    We now present the main theorem of our analysis. First, let us denote 𝜑∗ , 𝜓 ∗ as functions
satisfying the empirical objective:
                                 𝑛   𝑘
         ∗    ∗              1 ÕÕ
        𝜑 , 𝜓 = arg min                 max {0, 𝐶 + 𝑑 (𝜑(𝜏), 𝜓(𝑃𝑖 )) − 𝑑 (𝜑(𝜏), 𝜓(𝑃 𝑗 ))}.              (6.17)
                     𝜑,𝜓    𝑛𝑘 𝑖=1 𝑗=1
                                                       127


Theorem 4. (Convergence of 𝑃       e𝑁 )
Let 𝑃N be the empirical estimate of 𝑃 𝑁 using the kernel density estimator:
                                                           𝑛
                                                   1     Õ
                                      eN =
                                      𝑃                       𝐾 (𝜏𝑖 , 𝑃, 𝐶𝑛 ).                           (6.18)
                                              𝑛𝑉 (𝐶𝑛 ) 𝑖=1
If Eqn. (6.17) (authentication loss) is satisfied with 0 loss over D = {𝜏𝑖 }𝑖=1       𝑛 and P = {𝑃 } 𝑚 , then
                                                                                                    𝑗 𝑗=1
e𝑁 asymptotically converges to 𝑝(𝜏|𝑃), i.e.,
𝑃
                                               lim 𝑃  e𝑁 = 𝑝(𝜏|𝑃)                                        (6.19)
                                              𝑛→∞
when the following conditions hold:
                                                                                  𝑘
                          lim 𝑛𝑉 (𝐶𝑛 ) = ∞,          lim 𝑉 (𝐶𝑛 ) = 0,        lim     = 0,                (6.20)
                         𝑛→∞                        𝑛→∞                     𝑛→∞   𝑛
where 𝑘 is the expected number of samples that fall within 𝑁𝐶𝑛 (𝑃).
Proof. From Eqn. (6.13), we see that 𝑘 ∼ Binomial(𝑛, 𝑃N ). Thus, we have E[𝑘] = 𝑛𝑃N , where
E[𝑘] is the expected number of samples 𝑘 that fall within N𝐶𝑛 (𝑃) from a random sample of 𝑛
trajectories. Observe that:
                          ∫                          ∫
                                          0     0
                  𝑃N =             𝑝(𝜏|𝑃 )𝑑𝑃 =                𝑝(𝜏|𝑃)𝑑𝑃0                      (by Assumption 1)
                            N𝐶 (𝑃)                     N𝐶 (𝑃)
                                   ∫
                       = 𝑝(𝜏|𝑃)             𝑑𝑃0 = 𝑝(𝜏|𝑃) · 𝑉 (𝐶).                                        (6.21)
                                     N𝐶 (𝑃)
Additionally, if the authentication loss in 𝐸𝑞𝑛. (6.17) is satisfied with 0 loss, then we have
             𝐾 (𝜏, 𝑃, 𝐶) = 1 ⇐⇒ 𝑑 (𝜑(𝜏), 𝜓(𝑃)) < 𝐶 ⇐⇒ 𝜏 ∈ 𝑁𝐶 (𝑃) ⇐⇒ 𝑘 = 1.
By the strong law of large numbers, we have
                    Õ 𝑛                                             ∫
                lim      𝐾 (𝜏𝑖 , 𝑃, 𝐶𝑛 ) = E𝜏 [𝐾 (𝜏, 𝑃, 𝐶𝑛 )] =          𝐾 (𝜏, 𝑃, 𝐶𝑛 ) · 𝑝(𝜏|𝑃)𝑑𝜏
               𝑛→∞                                                    H
                     𝑖=1
                                            ∫
                                         =       𝑘 · 𝑝(𝜏|𝑃)𝑑𝜏 = E[𝑘].                                    (6.22)
                                             H
    For a given 𝑛, we can combine E[𝑘] = 𝑛𝑃N with Eqns. (6.21) and (6.22) to obtain:
                                            E[𝑘]
                                                    = 𝑃N = 𝑝(𝜏|𝑃) · 𝑉 (𝐶𝑛 ).
                                              𝑛
                                   =⇒ lim 𝑃    e𝑁 = E[𝑘] = 𝑝(𝜏|𝑃).
                                         𝑛→∞           𝑛𝑉 (𝐶𝑛 )
                                                        128


                            Figure 6.1: Overview of authentication pipeline.
Although 𝑉 (𝐶𝑛 ) → 0, the first condition in Eqn. (6.20) guarantees that 𝑛 → ∞ faster. Intuitively,
𝑉 (𝐶𝑛 ) → 0 means that the size of the contrastive margin should shrink with increasing number of
training samples, but the rate of shrinking must be slower than the 𝑛. Similarly, 𝑘 → ∞ since the
number of observed trajectories falling into N𝐶 (𝑃) increases monotonically with 𝑛. However, the
third condition in Eqn. (6.20) ensures that 𝑃N converges at all.                                      
6.5   Methodology
Figure 6.1 summarizes the key components of our authentication pipeline. The identifier model
represents the current conversation history as state input to the verifier. The verifier outputs a
distribution 𝜋(·|𝑠𝑡 ) over the action space. The sampled action from 𝜋(·|𝑠𝑡 ) is converted to a question
code and incorporated into PersonaGPT to decode the next question. We will refer to the full
authentication pipeline as “authenticator” for short. Once we can estimate 𝑃 based on the learned
representation 𝜑 of the dialog history, we can learn a question policy 𝜃 under the authentication
objective (Eqn. 6.5). Toward that goal, we first present an effective way to do incorporate the
question policy i.e. (𝑋𝑡 )𝑡=1𝑇 , as context for conditional decoding using SOTA LM-based dialog
models, e.g. the GPT-2 architecture. This requires us to define an action space of control codes [14]
to contextualize the decoder during question generation. We describe an active learning approach to
learn such an action space. Then, we present the verifier model details and provide an algorithm for
                                                   129


learning the question policy.
6.5.1   Action Space as Control Codes
Since the goal of the verifier is to generate a sequence of questions (𝑋𝑡 )1=𝑡               𝑇 , we can describe the
token-level likelihood of each question 𝑋𝑡 as:
                          (𝑡)                           (𝑡)
                       𝑝(x1:𝑘 |𝑋𝑡 ,𝜏𝑡−1 , 𝑃) ≈ 𝑝(     x1:𝑘       |     𝑋𝑡 ,         𝜑(𝜏𝑡−1 )     )             (6.23)
                                                     |{z}           |{z}           | {z }
                                                  decoded tokens question at 𝑡 history embedding
 Unlike the persona model, the verifier does not have access to the actual persona 𝑃 of the input
model. Instead, the identifier model provides an estimated version 𝜑(𝜏𝑡−1 ) based on dialog history.
    In Eqn. (6.23), 𝑋𝑡 is the output of the verifier at each turn, but we need an effective way to
represent 𝑋𝑡 for conditional decoding. Inspired by control codes [14], we represent 𝑋𝑡 as question
codes of the form:
    <|act|> ask about pets. <|sep|>
    Special tokens <|act|> and <|sep|> are used as delimiters for question codes. In the above
example, we used “ask about pets” as an example of a question code that corresponds to one of many
discrete actions that can be outputted by the verifier. The question code represents the raw text to be
incorporated before the dialog history in the GPT-2 architecture during decoding. In experiments, we
use 11 actions and their corresponding question codes as shown in Table 6.3. We have empirically
found that these questions cover the majority of conversational topics in PersonaChat. Note, however,
that one can apply our question code framework to arbitrarily defined questions.
       Table 6.3: The action space of the verifier featuring 11 total actions (turn-level goals)
                                                 Action Space
                                1. ask about family.               2. ask about pets.
                                3. talk about work.            4. talk about traveling.
                          5. ask about age and gender. 6. talk about hobbies.
                               7. talk about music.                8. talk about food.
                               9. talk about movies.            10. talk about politics.
                          11. ask about marital status.                       -
                                                        130


6.5.2    Explanation of Prefix Codes
In total there are 11 possible discrete actions that the authenticator network can output. Each action
corresponds to a particular phrase to be incorporated as prefix to PersonaGPT. However, PersonaGPT
can take arbitruary persona information for conditional decoding. Consider the following toy
example:
<|p1|>I like dogs.<|sep|> <|start|> hi! how are you doing today?<|eos|>
    The prefix code starts with <|p1|> and ends with <|sep|> to denote the persona input 𝑃. The
text following <|start|> denotes the conditional decoding targets of the LM.
6.5.3    Conditional Decoding
In order to maintain felicity and consistency of decoding, we use a common LM to do natural
language generation for both the persona model and the verifier. Specifically, we use the GPT-2
medium [15] architecture as the baseline LM for conditional decoding of both the verifier question
codes as well as persona inputs. We will refer to this general-purpose conditional decoder as the
PersonaGPT model, which will be used as the persona model when persona facts are used as prefix
code and as the question decoder when verifier questions are used as prefix code.
    In addition to question codes, we also introduce 3 special tokens: <|p1|> and <|p2|> to denote
the persona (source and target, respectively), and <|start|> as a delimiter between the control codes
and dialog history. We find that by using <|p1|> and <|p2|> to delimit source and target personas,
the LM is able to attend to <|p1|> related personas for odd-numbered responses and <|p2|> related
ones for even-numbered responses. We first fine-tune PersonaGPT on the PersonaChat dataset [2]
with persona inputs as prefix code and the dialog history as the conditional decoding targets.
6.5.3.1    Active Learning
In order to learn conditional decoding of question codes, we also fine-tune PersonaGPT on a small
dataset of human-PersonaGPT conversations constructed using active learning. Algorithm 3 outlines
                                                   131


said active learning procedure. In terms of sample complexity, we are able to fine-tune 𝜃 𝐿 𝑀 to
do reliable conditional decoding with 1,200 8-turn conversations. This actively learned dataset of
question code examples will be made publicly available.
Algorithm 3 Active Learning with PersonaGPT
Require PersonaGPT 𝜃 𝐿 𝑀 fine-tuned on persona inputs, action space of question codes
A.
  1: Initialize active learning dataset D.
  2: for total number of mini-batches do
  3:    for mini-batch 𝑖 do
  4:       while conversation not done do
  5:          Sample question code 𝑋𝑡 ∼ A.
                         (𝑡)
  6:          Decode x1:𝑘    using 𝜃 𝐿 𝑀 .
                  (𝑡)
  7:          if x1:𝑘 not satisfactory: then
  8:             Provide human inputs z1:𝑘 .
                                    Ð
  9:             Update D ← D (𝑋𝑡 , 𝜏𝑡−1 , z1:𝑘 ).
10:              Gradient descent on (𝜏𝑡−1 , 𝑋𝑡 , z1:𝑘 ) to update 𝜃 𝐿 𝑀 .
11:           else
12:              Continue.
13:           end if
14:        end while
15:     end for
16: end for
     For the gradient descent step of Algorithm 3, we split the parameters of PersonaGPT (𝜃 𝐿 𝑀 ) into 4
groups: fast group (consisting of special tokens), slow group (consisting of positional codes), freeze
group (embedding weights for normal tokens), and the rest of the parameters. We set the initial
learn rates of each group as follows: fast group (𝛼 = 5e-4), slow group (1e-6), freeze group (1e-9),
and the rest (5e-5). This technique is inspired by natural gradients [16, 17], which provide much
better performance in terms of learning rate (and hence number of samples needed to fine-tune).
However, the full Fischer Information matrix is intractable to learn explicitly; instead, we design a
diagonal matrix 𝑀, with entries corresponding to the learn rates of the different groups (4 different
initial rates). The gradient descent update is then:
                                       𝜃 𝐿 𝑀 ← 𝜃 𝐿 𝑀 − 𝑀∇𝜃 𝐿 𝑀 𝐽 (𝜃 𝐿 𝑀 )
                                                     132


 Empirically, we find that this scheme allows PersonaGPT to incorporate question codes without
 sacrificing felicity of decoded responses.
 6.5.4   Learning the Question Policy
 Since we do not have direct supervision over the newly introduced question codes, we learn the
 question policy 𝜃 using deep Q-learning (DQN) [18]. Because of the inference time associated
 with using GPT-2 based architectures to decode, the sample generation cost of full conversations is
 non-trivial. That is why we choose to use value-based learning instead of policy gradient – indeed
 sample efficiency is maximized by off-policy methods such as Q-learning [19]. Since Q-learning
 tends to suffer from high-variance during early stages of training, we use the human-PersonaGPT
 conversations collected during active learning as an approximation of expert policies. By pretraining
 the Q-function on expert trajectories, we can explore the high-value states early, leading to more
 stable Q-functions.
 Markov Decision Process (MDP). We formulate the verifier learning task as an MDP:
• S (state space): 𝑠𝑡 = [𝜑(𝜏𝑡−1 ); 𝑡], embedding of dialog history up to current turn concatenated
   with the current turn count 𝑡.
• A (action space): 𝑎 𝑡 ∼ 𝜋( · |𝑠𝑡 ) is a sampled question code from the output of the verifier model at
   each turn.
                                          (𝑡)                      (𝑡)
• T (transition): 𝑠𝑡+1 = [𝜑(𝜏𝑡−1
                                     Ð
                                        y1:𝑘  ); 𝑡 + 1] where y1:𝑘     is the decoded response by the input
   conversational agent.
• 𝑅 (reward function): The reward function is
                                                𝑟 (𝑠𝑡 ) = −L (𝜏𝑡 , 𝑃)                                (6.24)
   where L is the authentication loss (Eqn. (6.5)) as a function of the history up to turn 𝑡 and the
   persona of the input agent.
                                                        133


Verifier Network. The verifier architecture is a feed-forward network with 2 hidden layers of 512
hidden units each. The logits layer of the verifier corresponds to the Q-value over each action,
defined as:
                                   𝑄(𝑠𝑡 , 𝑎 𝑡 ) = 𝑟 (𝑠𝑡 ) + 𝛾 max 𝑄(𝑠𝑡+1 , 𝑎).
                                                                  𝑎
Since we are dealing with finite-horizon MDPs, we set the discount factor 𝛾 = 1. The final output
layer is a softmax over the Q-value logits:
                                        𝜋( · |𝑠𝑡 ) = softmax ( 𝑓 (𝑠𝑡 ; 𝜃)) .
We first pretrain the verifier with imitation learning [20] on the human-PersonaGPT data collected
during active learning. Specifically, we use the following loss function during pretraining:
                                     Õ 𝑇                                                          
                  𝜃 = arg min E𝜏            −𝑎 𝑡∗ log 𝜋(𝑎 𝑡 |𝑠𝑡 )                      ∗         2
                                                                  + k 𝑓 (𝑠𝑡 ; 𝜃) − 𝑄 (𝑠𝑡 , 𝑎 𝑡 )k ,  (6.25)
                           𝜃           𝑡=1
where 𝑎 𝑡∗ is the expert action while visiting 𝑠𝑡 during active learning. To stabilize learning, we
use a twin-delayed Q-learning scheme inspired by [21]. In addition to the verifier, we keep a
target-network 𝜃 0 with parameters equal to a stochastically-weighted average (SWA) [22] of 𝜃. We
thus define the pretraining Q-targets 𝑄 ∗ (·, ·) as follows:
                                   𝑄 ∗ (𝑠𝑡 , 𝑎 𝑡 ) = 𝑟 (𝑠𝑡 ) + 𝛾𝑄 𝜃 0 (𝑠𝑡+1 , 𝑎 𝑡+1
                                                                                ∗
                                                                                    ),               (6.26)
         ∗ is the next action taken by the expert (i.e. the human-policy). At the end of each gradient
where 𝑎 𝑡+1
update for 𝜃, the target network is updated according to:
                                              𝜃 0 ← 𝜂𝜃 + (1 − 𝜂)𝜃 0,                                 (6.27)
where 𝜂 = 1/(𝑁 + 1) and 𝑁 is the number of training iterations.
    We then run Algorithm 4 with regular Q-targets and an annealed 𝜀-greedy sampling strategy
to promote exploration in early conversations. We fix each synthetic conversation to 8 turns and
fine-tune the logits layer of 𝑓 ( · ; 𝜃), i.e. the Q-values, using gradient descent after each conversation.
                                                          134


Algorithm 4 Verifier Training
  1: Initialize question policy and target networks 𝜃, 𝜃 0.
  2: for each persona model 𝑃 ∈ P do
  3:    while conversation not done do 
  4:       Sample 𝑋𝑡 ∼ softmax 𝑓 (𝑠𝑡 ; 𝜃) .
                                             (𝑡)
  5:       Decode 𝑋𝑡 into tokens x1:𝑘            .
                                    (𝑡)
  6:       Obtain response 𝑦 1:𝑘 from PersonaGPT conditioned on persona 𝑃.
  7:       Store (𝑠𝑡 , 𝑎 𝑡 , 𝑠𝑡+1 , 𝑟 𝑡 ) in B.
  8:    end while
  9:    Sample mini-batch of (𝑠𝑡 , 𝑎 𝑡 , 𝑠𝑡+1 , 𝑟 𝑡 ) tuples from B.
10:     Calculate Q-values using target network and update 𝜃 using gradient descent.
11:     Update target network using Eqn. (6.28).
12: end for
6.5.5    Model and Training Details
All models were written using PyTorch [23]. The PersonaGPT model was written using the
HuggingFace Transformers package [9]. In terms of GPU usage, all models were trained using
a single 11Gb NVIDIA GTX 1080 Ti. For experiment 4.1, the GPT-2 baseline, DialoGPT and
PersonaGPT were fine-tuned on the PersonaChat dataset for 3 epochs, each taking between 13-16
hours of wall clock time. For each model, the AdamW [24] was used with an initial learn rate of
5e-5 and a linear decay schedule.
     For experiment 4.2, all identifier models were trained for 10 epochs. The BoW feed forward
network (MLP) consisted of 2 hidden layers, 300 units each and a dropout rate of 0.2 between the
layers. BoE, the MLP architecture consisted of 2 layers, 1024 units each with a dropout rate of
0.2 between layers. For the LSTM model, the input embedding size is 30, 1 LSTM layer is used
with 600 hidden units. For the BERT and GPT-2 models, the transformer (feature representation)
layers were frozen, and additional 2-layer MLP modules were added to each model for training, each
consisting of 1024 units per layer. The identifier model is a 2-layer MLP with 1024 units each with
a dropout rate of 0.2 between layers. All identifier models were trained using Adam [25] optimizer
with learn rate of 1e-3.
     The verifier network consists of a 3-layer MLP with 512 hidden units and dropout rate of
0.1 between layers. Tanh activation is used in place of ReLU, as we found Tanh to empirically
                                                         135


outperform the latter in our use case. Note that the output layer size is 11 (corresponding to the size
of the action space, i.e., number of question codes). This output layer is trained to fit the Q-targets
during Q-learning, and an additional softmax layer is added to shape the Q-values into a probability
distribution from which to sample the actions for decoding responses. The verifier network was
pre-trained on the active learning data over 10,790 conversational turns for 3 epochs, totally between
3.5-4 wall clock hours. For Q-learning, the verifier was trained for 3 simulated conversations per
training set persona, totalling 22 hours of wall clock time over 3,846 total conversations and 30,768
conversational turns. After each conversation during the DQN training loop, the Q-value layers are
fine-tuned over the replay buffer for 3 epochs. For SWA, at the end of each gradient update for 𝜃, the
target network is updated according to:
                                         𝜃 0 ← 𝜂𝜃 + (1 − 𝜂)𝜃 0,                                  (6.28)
where 𝜂 = 1/(𝑁 + 1) and 𝑁 is the number of training iterations. For 𝜀-greedy, we set the initial
𝜀0 = 0.5, 𝜀min = 0.05, and decay factor to 2048.
6.6   Data
The main dataset used for this experiment was the PersonChat dataset [2] which consists of 17,877
training conversations and 999 test conversations. The training set consists of 1282 personas, which
are profiles (sets) of 3-5 persona facts generated from a set of 6127 distinct facts. The test set
consists of 129 personas consisting of 674 facts that are unseen in the training data.
6.6.0.1   Pretraining
The PersonaGPT model is based off of the DialoGPT model [8] which was trained on Reddit
conversations. DialoGPT is based off of the GPT-2 model [15], which was pretrained on a diverse
set of text corpus, including Wikipedia articles, fiction books and news articles. The premise
of language modeling is that pretraining on a diverse range of dialog corpus yields a very large
supporting set of context tokens for which conditional probabilities can be calculated from. GPT-2
training accounted for “quality documents” (i.e., eliminated certain corpus samples based on some
                                                  136


internal quality metrics) but did not account for negative transfer between different corpus types
(i.e., fiction books vs. Wiki articles) [26, 15]. Interestingly, GPT-2 has been shown to generalize to
more complicated tasks such as abstract summarization and general QA with few training samples,
and sometimes do well in a zero-shot setting [15].
6.6.0.2     Fine-tuning on PersonaChat
Like the GPT-2, DialoGPT was trained to decode multi-turn dialog as a straight forward language
modeling task; responses between speakers are separated by ‘EOS‘ (end-of-sentence) tokens to
denote switching of speakers. The main difference between PersonaGPT and DialoGPT is the use
of special tokens to format input responses. Specifically, PersonaGPT formats persona facts as
prefix tokens using bidirectional attention (no masking). <|p1|>| and <|p2|> special tokens are
introduced to denote the current speaker turn. Like task-oriented special tokens such as tl;dr, $ and
extract, the different persona special tokens denote the different speaker tasks (i.e., decoding person
1’s responses vs. person 2’s responses).
6.6.0.3     Self-Play
After training the PersonaGPT decoder, we continue to use PersonaChat as a source of training
and testing persona profiles. However, we no longer use the conversations in PersonaChat to do
active learning or reinforcement learning. For example, we sample training persona profiles to
parameterize PersonaGPT, but use the decoded PersonaGPT tokens as ground truth when training
the dialog policy using reinforcement learning. Similarly, we compare the decoded PersonaGPT
tokens with human responses (ground truth) when incorporating the turn-level goals (as action code
prefixes). Using self-play between the PersonaGPT and the dialog policy, we can generate new
conversational episodes unseen from the training data (PersonaChat). This effectively allows us
to explore the space of question sequences (i.e., the sequence of turn-level goals) to minimize the
Authenticator Loss in Eqn. (6.5).
                                                   137


6.7   Experiments
We assess the proposed authentication system through its ability to answer the following questions:
(Q1) How well can PersonaGPT use control codes?
(Q2) How well can the identifier predict persona?
(Q3) How well can the learned question policy distinguish persona models?
6.7.1   Conditional Decoding Evaluation
To answer Q1 we evaluate the capacity of PersonaGPT for controlled decoding in two settings: (1)
automatic evaluation of PersonaGPT against SOTA persona models, and (2) human evaluation
of human-PersonaGPT interactions. For automatic evaluation, we follow the ConvAI2 challenge
automatic evaluation criterion of perplexity (PPL) and F1-score (F1) [11]. The following baselines
are included for comparison: the Seq2seq baseline from the PersonaChat paper [2], the best
performing generative model [12] on automatic evaluation from the ConvAI2 challenge, and the
recently released DialoGPT model [8]. Since PersonaGPT is based off of the GPT-2 architecture,
we include the vanilla GPT-2 LM (without control tokens) as well as a DialoGPT model fine-tuned
on the PersonaChat dataset as additional baselines. Table 6.4 shows that PersonaGPT outperforms
both baselines and SOTA in conditional decoding, as measured by PPL (lower is better) and F1
(higher is better).
      Table 6.4: Automatic evaluation of PersonaGPT against existing SOTA persona models
                                       Model           PPL      F1
                                Seq2seq Baseline [2]   29.8   16.18
                                   Wolf et al. [12]    16.3    19.5
                                   GPT-2 Baseline      99.45   5.76
                                    DialoGPT [8]       56.6    12.6
                               DialoGPT (Fine-tuned)   11.4    22.7
                                    PersonaGPT         10.2    43.4
                                                  138


                    Table 6.5: Human Evaluation of PersonaGPT and DialoGPT
                 Model             Consistency Coverage Engagineness              Felicity
        DialoGPT (Fine-tuned) 2.83 (1.40) 1.15 (0.68)            2.90 (0.79)    3.16 (1.16)
             PersonaGPT             3.07 (1.34) 3.03 (1.31)      3.29 (0.95)    3.40 (1.11)
    Human evaluations were collected using a platform that allows anonymous users to have short,
8-turn conversations with an unknown (either DialoGPT or PersonaGPT) persona model. In total,
we collected 100 full conversations (800 total responses). After each conversation, the evaluator is
asked to rate the agent in several categories:
    • Consistency (1-5): how much did the agent’s responses agree with each other? 1 = conflicting,
      5 = perfectly consistent.
    • Engagingness (1-5): how engaging were the agent’s responses? 1 = aloof, generic; 5 =
      informative, rapport-building.
    • Coverage (1-5): how many of the personality facts did the agent exhibit correctly? 1 = less
      than 20%, 5 = 100%.
    • Felicity (1-5): how sensible are the agent’s responses? 1 = non-sensible, 5 = grammatically
      and semantically correct.
In Table 6.5, we compare PersonaGPT with the best performing baseline, the fine-tuned DialoGPT.
We report the average ratings for each metric along with the standard deviation in parenthesis.
Interestingly, the biggest difference between the two models are the coverage scores. On average,
PersonaGPT exhibits 60+% of persona traits correctly during conversation, whereas DialoGPT
exhibits around 20-40%. To illustrate some finer points of their differences, we provide example
human-agent interactions in the Supplemental Materials.
                                                139


6.7.2   Persona Identifier Evaluation
To answer Q2, we evaluate the identifier model based on the accuracy of the estimated persona 𝜑(𝜏),
given the input trajectory. We train 𝜑 and 𝜓 on conversations collected with 1,283 unique training
personas from the PersonaChat dataset. Each persona consists of 3-5 persona facts, which are drawn
from a pool of 6,735 unique persona facts. At test time we use a nearest neighbor model to retrieve
the top-k relevant persona facts from the pool of 6,735 facts. There are 129 test set personas (i.e.,
collection of 3-5 persona facts) that are not present in the training set. Since there is no overlap
between the training and testing personas, we are evaluating the identifier network’s capability to
represent out-of-distribution persona information. We compare the identification model against
several baselines:
    • Bag-of-Words (BoW): sum of one-hot vectors of the tokens in the dialogue trajectory.
    • Bag-of-Embeddings (BoE): sum of GloVe embeddings [27] of dialog tokens.
    • LSTM: long short-term memory (LSTM) network [28, 29] over dialog tokens.
    • MLP-BERT: feed-forward network trained on averaged sentence-level embeddings obtained
      from BERT’s [30] representation of dialog history.
    • MLP-GPT: feed-forward network trained on the last GPT-2 hidden state.
The baseline models (BoW, BoE, LSTM, MLP-BERT, MLP-GPT) are all trained using binary
cross-entropy loss over each of the 6,735 possible persona facts (0 = not present in persona, 1 =
present in persona). At test time, the top-k logits of the outputs are used to obtain the relevant
personas. We use the following information-retrieval metrics to evaluate each model:
                                                      Ù
                                       prec@k = | 𝑃 b    𝑃|/𝑘
                                                      Ù
                                        rec@k = | 𝑃 b    𝑃|/|𝑃|.
Here, | · | denotes the cardinality of a set. 𝑃
                                              b is the set of retrieved persona facts (either based on
nearest neighbors or top-k logits), and 𝑃 is the ground truth set of persona facts.
                                                  140


Table 6.6: Performance of various identifier models on observed dialog trajectories from PersonaChat
                            Model      Prec@1     Prec@5    Rec@5    Rec@10
                             BoW         33.8      25.3       28.3     49.4
                             BoE         37.7      26.9       30.1     51.0
                            LSTM         42.7      29.2       32.7     53.2
                            BERT         37.6      26.6       29.9     51.1
                            GPT-2        30.8      24.5       27.3     48.3
                           Identifier    86.2      58.3       65.3     82.8
     Table 6.6 summarizes the results of the various identifer models. Our identifier model clearly
outperforms the baselines. Although a wide variety of embedding methods were used to represent
dialog history, their results are quite similar. The key difference appears to be the authenticator loss
used to train our identifier (Eqn. 6.17).
6.7.3    Evaluation of Authentication Policies
We answer Q3 by evaluating the full authentication pipeline performance based on generated dialog
between the authenticator and various input persona models. We fix the PersonGPT model parameters
𝜃 𝐿 𝑀 for conditional decoding. We generate synthetic conversations between the authenticator and
each of the 129 unseen test set persona profiles. For each test set conversation, prec@k and rec@k
scores are reported based on the estimated persona (using the learned identifier). We compare with
the following baseline policies:
     • LM: fine-tuned DialoGPT model without any input persona traits during decoding.
     • Persona Model: another persona model with randomly sampled persona profiles.
     • Random Policy: uniformly sample a question from the action space at each turn.
     • Human Policy: using the aforementioned platform, we collect a second set of 100 human-
       PersonaGPT conversations where the user is not given the persona traits beforehand. At the
       end of each conversation, the user selects a ranked list of guesses from a list of 20 candidates
       persona traits to match the input agent’s profile.
                                                   141


Table 6.7 compares the various authentication policies. Interestingly, using even the random policy
of uniformly sampling the actions can be more revealing than non-goal oriented dialog such as LM
and persona. In many of the generated conversations between LM-PersonaGPT and PersonaGPT-
PersonaGPT, the two models expand upon 1 or 2 topics without ever discussing other topics relevant
to their personas. In contrast, by often forcing the input agent to switch topics, the random policy
ignores signals of relevant persona information. Meanwhile, we find that our authentication policy
strikes a balance between both worlds: it covers more persona traits as measured by rec@5 and
rec@10 while covering at least 1 relevant persona trait in the majority of conversations.
          Table 6.7: Comparison of verification policies on various input persona models
                           Policy   Prec@1     Prec@5    Rec@5     Rec@10
                            LM         57.4      40.0      45.1      67.4
                          Persona      69.8      39.1      44.0      63.5
                         Random        72.9      42.3      48.3      70.2
                          Human        68.6      56.0      63.0        -
                            Ours       83.7      53.0      59.9      80.9
    For human policy, we are unable to obtain an accurate rec@10 for human evaluations since
a non-trivial number of participants selected less than 10 choices out of 20 candidates. Since
human evaluators were instructed to guess the persona beforehand, it appears that some level of
goal-orientation can improve the diversity (in terms of persona coverage) of generated conversations.
However, our verifier policy is able to discover more effective ways of interaction compared to non
goal-oriented and human policy baselines. In the next section, we provide snapshots of generated
conversations between PersonGPT and various authentication policies.
6.7.4   Ablation Study
In addition to the PersonaGPT model, we are also interested in the performance of the authenticator
policy against other input models. For example, how well does our policy fair against models
with less capacity to incorporate persona information? What about against models with lower
decoding quality? We generate several synthetic conversations between our authenticator and several
variations of persona models:
                                                 142


• Full Persona: full persona model.
• Weak Persona: persona model with higher nucleus sampling size (𝑝 ∈ [0.30 − 0.8]) [31] to capture
   less sensible models.
• Transition Model: model with either randomly initialized or no persona inputs (defaults to
   non-personalized decoding).
 We use the transition model to serve as a baseline in which persona information is not incorporated
 in the input dialogue agent. Additionally, we include a “weak persona” model baseline, which in
 corporates persona information but suffers from decreased overall felicity. We randomly sample
 persona inputs from the full set of 1, 412 personas and report the mean prec@k and rec@k
 performance across generated conversations. Table 6.8 compares authenticator performance against
 these persona model variants. As expected, the non-personalized transition model did not conform
 to given persona profiles, and the authenticator was most affected by the drop in personalization. By
 contrast, the authenticator was still able to maintain some performance against a much less felicitous
 persona model.
         Table 6.8: Authenticator performance against variations of the input persona model
                         Input Model      Prec@1     Prec@5   Rec@5    Rec@10
                           Transition       17.1      17.1      19.4     40.6
                         Weak Persona       79.8      49.5      55.9     74.1
                          Full Persona      86.0      53.2      60.0     77.8
 6.8    Generated Conversations
 In this section, we provide snapshot of conversations between human-PersonaGPT with full
 knowledge of persona (conditional decoding evaluation), human-PersonaGPT conversation without
 persona knowledge (authentication setting), human-DialoGPT interaction, PersonaGPT self-play,
 and authenticator-PersonaGPT interaction. Furthermore, we include snapshots from the API
 interface for the different experimental settings.
                                                    143


6.8.0.1   Human-Agent Interactions
Figure 6.2 illustrates the set of instructions given to the user when they first start an experiment with
the persona model. Note that the actual persona model is randomly selected to be either the baseline
model (DialoGPT) or PersonaGPT. In this evaluation setting, the user is given the full persona of its
partner beforehand. After 8 turns of conversation, the user is asked to rate the conversation based on
              Figure 6.2: Experiment 1 (conditional decoding evaluation) instructions.
metrics described in the Experiment Section. This interface is shown in Figure 6.3.
    Table 6.9 provides a point of comparison with human-DialoGPT over the course of an 8-turn
conversation. Although DialoGPT is capable of also incorporating persona information, it does not
                                                   144


                         Figure 6.3: Experiment 1 post conversational survey.
make the transition between persona facts as frequently. In this case, appearance details (short hair)
and pets (living with two dogs) get ignored completely in favor of gym and running. Empirically,
we find that both models seem to struggle with negatives (e.g., I hate apples vs. I like apples),
particularly when the adjective used are somewhat neutral.
         Table 6.9: An interactive example of human-DialoGPT conversation over 8 turns
              Role                                            Response
             Human                                      hello, how are you?
           DialoGPT                           i’m good thank you. how about you?
             Human                          doing well. where do you go to the gym?
           DialoGPT                       the gym i usually go to is the weight room.
             Human                               got it. what weights do you lift?
           DialoGPT      i like to lift weights at the weight room. i am in the process of training.
             Human                                       training for what?
           DialoGPT                i am training for running. i am trying to get better at it.
             Human                                      got it. i run as well.
           DialoGPT           i’m trying to get better at the weight room, but i’m not sure yet
             Human                                      not sure about what?
           DialoGPT                 i mean i am pretty sure i am not sure. what about you
             Human                         i do not lift weights, but i probably should
           DialoGPT                     well if you think you are good, then why not try
             Human                            i do not own weightlifting equipment
           DialoGPT                          do you have a job you could look into?
                                                   Persona Facts
                          1. I go to the gym regularly.                        2. I’ve short hair.
              3. I like doing the treadmill and rowing machine.           4. Two dogs live with me.
                                                         145


    Table 6.10 shows an example of an 8-turn dialog between a human evaluator and PersonaGPT.
We note that PersonaGPT, along with other persona models we have evaluated, seem to struggle
with exact numbers. For example, although the persona says 100 years of age, PersonaGPT may give
95 as shown here. Turn 6 shows a situation in which PersonaGPT changes the topic in conversation
to a topic relevant to its persona. Its persona enjoys “wine and dancing in the moonlight”, and it
seems to steer the conversation toward this topic from the current discussion about age and lifespans.
Figure 6.4 shows the instructions for the human policy evaluation. In this setting, the user is asked to
Table 6.10: An interactive example of human-PersonaGPT conversation over 8 turns (prior knowledge
of persona given, conditional decoding evaluation)
                   Role                                     Response
                 Human                                  hey how are you
               PersonaGPT                        i’m great, thanks. and you?
                 Human       doing good have to finish a paper, what do you do for a living?
               PersonaGPT        i’m in between jobs right now, but i’m strong for my age.
                 Human                           and what age would that be
               PersonaGPT              i’m 95 but i’ve always been strong for my age
                 Human               24 but 95 shouldnt you be considering retirement
               PersonaGPT             yes. but i am really into the idea of live forever
                                           huh what is your estimated lifespan or
                 Human
                                       do you not have biological constraints like that
               PersonaGPT                              i’m 95. and you?
                 Human                       uhh i already told you that i’m 24
               PersonaGPT                  that’s old enough. do you drink wine?
                 Human                         from time to time only red tho
               PersonaGPT                 red wine? you must be quite the drinker.
                 Human                        uhh idk how you made that leap
               PersonaGPT                            because i drink a lot
                                                Persona Facts
           1. I am very strong for my age.               2. I feel like I might live forever.
                3. I am 100 years old.        4. I like to drink wine and dancein the moonlight.
guess the persona of their partner without prior knowledge. The question policy is fully determined
by the user, and they are asked to rank the persona facts (out of 20 candidates) after 8 turns.
    The ranking interface is shown in Figure 6.5.
    Table 6.11 gives an example interaction between a participant and PersonaGPT during human
                                                     146


                  Figure 6.4: Experiment 2 (human policy evaluation) instructions.
policy evaluation. In this setting, the user is asked to guess the persona of their partner without
prior knowledge. The question policy is fully determined by the user, and they are asked to rank the
persona facts (out of 20 candidates) after 8 turns. We find that human policies tend to be closer to
non goal-oriented dialog than learned policies. Users often ask follow-up questions for clarification
as seen in turns 5 and 8.
     When the question ventures outside of the persona facts (turn 6), we find that PersonaGPT is
able to find interesting answers despite missing the corresponding information in its context. It
is interesting to note that PersonaGPT is capable of adapting to misspellings e.g.,“her” instead of
“here” and inconsistent punctuations.
6.8.0.2   PersonaGPT Self-Play
Table 6.12 illustrates an example conversation between two PersonaGPT models contextualized by
different persona facts. Although used as a baseline question policy, we notice that simply using
another persona model to conduct conversation does not always lead to revealing conversations
                                                 147


                        Figure 6.5: Experiment 2 post conversational survey.
about the underlying persona of either agents. For Bot 1, 2 out of the 5 persona facts are discussed
throughout the course of conversation whereas only 1 persona fact is revealed about Bot 2 (the input
persona model to be authenticated). Notice that since Bot 1 is the lead, it drives the conversation
toward topics within its set of persona facts. We notice that this is common in the general case: when
another persona model is used as a question policy, it tends to drive the conversation toward its set
of persona facts. Thus, the question policy performs well insomuch as the persona facts are similar.
6.8.0.3   Consent
We obtained IRB approval for the data collection. When users register for the experiment, the
following agreement and potential risk statements are required to be read and signed off on before
user registration.
    You will be asked to engage in an 8-10 turn, short dialog with a chatbot. The chatbot has 4-5
personality traits, which will be revealed to you at the start of the experiment. At the end of the
conversation, you will be asked to rate the chatbot on a series of performance metrics.
                                                  148


Table 6.11: An interactive example of human-PersonaGPT conversation over 8 turns (persona
information not known a priori, user asked to guess)
                     Role                                     Response
                   Human                                  hey whats poppin
                PersonaGPT                     hello, i’m just hanging out here, you?
                   Human                       same her got some work to do today
                PersonaGPT                     what do you do? i’m a finance major
                   Human                     im a phd student in computation biology
                PersonaGPT                               nice do you like it?
                   Human         its okay has its ups and downs. what do you think of finance
                                          i do not have a whole lot of money right now,
                PersonaGPT
                                              but i plan on going back in a few years.
                   Human                                no i meant the major
                                         i’m not sure what i want to do after i graduate,
                PersonaGPT
                                                but i want to go to business school
                   Human                         ohh how many kids do you want?
                PersonaGPT       i’d love to have 10, but it is a lot of work. do you have kids?
                   Human                          nope too young maybe someday
                PersonaGPT             i’d like to have kids one day, maybe i’ll get some
                   Human                                   like buy them?
                PersonaGPT               maybe one day, yeah. what do you do for fun?
                                                   Persona Facts
                                                                 2. I graduated from Yale in 2011
                        1. I drive a BMW.
                                                                      with a degree in finance.
          3. I plan to go to business school next year. 4. My parents are very wealthy bankers.
    Purpose of This Experiment. The purpose of this research study is to evaluate the performance
of several state-of-the-art open-domain conversational agents developed through various learning
algorithms. Automatic evaluation of conversational agents has proven to be difficult, and thus
human evaluation is necessary to assess the performance of these agents.
    Your rights to participate, say no, or withdraw. Participation is voluntary, you may choose
not to participate at all, or you may refuse to participate in certain procedures or answer certain
questions or discontinue your participation at any time without consequence.
                                                         149


        Table 6.12: An example of self-play between two randomized PersonaGPT models
                    Role                                 Response
                   Bot 1                     hi there, do you like broadway?
                   Bot 2                i love it! you ever played on broadway?
                   Bot 1      no i haven’t. but it sounds like fun. where do you work?
                   Bot 2                     i work in retail. how about you?
                   Bot 1              i am a psychology major. that’s awesome!
                   Bot 2                    oh that sounds like a tough major.
                   Bot 1                           do you have any pets?
                   Bot 2             yes, i’ve a pet turtle. do you have any pets?
                   Bot 1                    i have one dog. he’s a collie mix.
                   Bot 2                     oh i love dogs, they are so cute.
                   Bot 1                     do you have any other hobbies?
                   Bot 2                     i am an expert on chewing gum.
                   Bot 1             oh that’s interesting! do you enjoy doing it?
                   Bot 2     i do, but it does not make me gassy. i could live with that.
                   Bot 1                      it helps with gas. makes sense.
                   Bot 2                            do you go for runs?
                               Persona 1                                     Persona 2
                    1. I am a psychology major.                    1. I don’t pick up my toys.
                    2. I enjoy broadway shows.                         2. I have a pet turtle.
                        3. I’m a Steelers fan.                   3. I like to play with my dolls.
              4. My favorite band is the Avett Brother.        4. My best friend lives next door.
6.9   Social Impact
Beyond the positive impacts, there are numerous potential avenues for misuse of the proposed
technology. We list some notable ones below:
    • Mistakes in persona identification can result in mistakes in granting / denying services for
      persons or groups of persons. For example, persona facts (or sets) for which the persona
      identifier possesses higher error rates can potentially lead to poor access for those potential
      users.
    • Similarly, verifier errors (e.g., poor questions delivered) with certain actions (e.g., talk about
      hobbies, talk about travel) may have disproportionate less consequences compared to more
      sensitive topics (e.g., talk about gender, talk about politics).
                                                        150


     • Although the verifier is meant to do authentication, it can potentially be abused to conduct
       conversations for the purposes of mining persona information. For example, an application
       using the verifier can abuse building rapport with human users to mine personal information.
       We did not explore ways to prevent this type of misuse, but future work must focus on either
       counter-measures or methods of prevention against such cases.
     • Algorithmic authentication and persona modeling can potentially greatly accelerate the
       development of human-like dialog generation. Deployment considerations of conversational
       authentication must carefully consider the impact of persona modeling on the potential
       increase in the capacity for general chatbots to conduct deceptive / exploitative interactions
       (e.g., impersonation, personalized advertising, political manipulation) and their potentially
       detrimental impact on human labor conditions.
Additionally, note that since the persona models used in our experiments are built from a language
model pretrained on large-scale datasets, they have been shown to contain various cultural biases
[32, 33]. Finetuning on PersonaChat certainly do not alleviate these issues, as the personas
themselves were not curated against such biases. For example, the term “gender” used in this study
is defined as gender perceived by the annotators of the PersonaChat dataset. Its interpretation may
not generalize to other real world settings.
6.10    Conclusion and Discussion of Limitations
In this paper we proposed an authentication pipeline whose questions increase the mutual information
between the dialogue trajectory and an input agent’s underlying persona features. Nonetheless, there
are several limitations to our current approach. For example, our approach assumes “good faith" – it
cannot handle persona models that intentionally hide their persona characteristics. Additionally,
more sophisticated verification should distinguish between direct and indirect expressions of persona.
For example, a bot with the persona “I like to tell jokes" may embody the persona through sarcasm
rather than through self-description.
                                                 151


    Finally, there are some intrinsic limitations with respect to the language modeling approach
for generative dialog. In Table 6.10, we illustrated a case in which PersonaGPT fails to portray
precise persona facts such as “100 years old” vs. “95 years old”. In general, GPT-based LMs
suffer from this trade-off between sensibility of responses (i.e., lack of repetitive or degenerate
responses) vs. precision of the decoded responses. This is simply due to the top-k and nucleus
sampling approaches for decoding, which uses randomization through noise to prevent degenerate
responses. The injection of noise in the response decoding process intrinsically produces a barrier
to precision. Alternative approaches for repetition penalty can potentially eliminate this trade-off.
Additionally, GPT-based models also suffer from high time-costs at inference time. This is because
auto-regressive decoding with self-attention costs 𝑂 (𝐾 2 ) where 𝐾 is the total number of tokens in
the dialog history. Empirically, we found that quantization and model compression helped during
employment (<1 second per response on CPU), but the benefit cannot compare to bringing the
inference cost to linear time. In order to scale to long dialogs (e.g., conversations that lasts 20+
turns), some form of compression of dialog histories (e.g., dialog level latent code) may be needed
to decrease the dependence on previous tokens.
                                                152


BIBLIOGRAPHY
     153


                                        BIBLIOGRAPHY
[1] Jiwei Li, Michel Galley, Chris Brockett, Georgios P Spithourakis, Jianfeng Gao, and Bill
     Dolan. A persona-based neural conversation model. arXiv preprint arXiv:1603.06155, 2016.
[2] Saizheng Zhang, Emily Dinan, Jack Urbanek, Arthur Szlam, Douwe Kiela, and Jason
     Weston. Personalizing dialogue agents: I have a dog, do you have pets too? arXiv preprint
     arXiv:1801.07243, 2018.
[3] Haoyu Song, Wei-Nan Zhang, Yiming Cui, Dong Wang, and Ting Liu. Exploiting persona in-
     formation for diverse generation of conversational responses. arXiv preprint arXiv:1905.12188,
     2019.
[4] Zhou Yu, Ziyu Xu, Alan W Black, and Alexander Rudnicky. Strategy and policy learning for
     non-task-oriented conversational systems. In Proceedings of the 17th annual meeting of the
     special interest group on discourse and dialogue, pages 404–412, 2016.
[5] Anil K Jain, Arun Ross, and Salil Prabhakar. An introduction to biometric recognition. IEEE
     Transactions on circuits and systems for video technology, 14(1):4–20, 2004.
[6] Anil K Jain, Ruud Bolle, and Sharath Pankanti. Biometrics: personal identification in
     networked society, volume 479. Springer Science & Business Media, 2006.
[7] David Güera and Edward J Delp. Deepfake video detection using recurrent neural networks. In
     2018 15th IEEE International Conference on Advanced Video and Signal Based Surveillance
     (AVSS), pages 1–6. IEEE, 2018.
[8] Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jian-
     feng Gao, Jingjing Liu, and Bill Dolan. Dialogpt: Large-scale generative pre-training for
     conversational response generation. arXiv preprint arXiv:1911.00536, 2019.
[9] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony
     Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer,
     Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain
     Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Transformers: State-of-
     the-art natural language processing. In Proceedings of the 2020 Conference on Empirical
     Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online,
     October 2020. Association for Computational Linguistics.
[10] Yusuxke Shibata, Takuya Kida, Shuichi Fukamachi, Masayuki Takeda, Ayumi Shinohara,
     Takeshi Shinohara, and Setsuo Arikawa. Byte pair encoding: A text compression scheme that
     accelerates pattern matching. Technical report, Citeseer, 1999.
                                                154


[11] Emily Dinan, Varvara Logacheva, Valentin Malykh, Alexander Miller, Kurt Shuster, Jack
     Urbanek, Douwe Kiela, Arthur Szlam, Iulian Serban, Ryan Lowe, et al. The second conver-
     sational intelligence challenge (convai2). In The NeurIPS’18 Competition, pages 187–208.
     Springer, 2020.
[12] Thomas Wolf, Victor Sanh, Julien Chaumond, and Clement Delangue. Transfertransfo: A
     transfer learning approach for neural network based conversational agents. arXiv preprint
     arXiv:1901.08149, 2019.
[13] Yann LeCun, Sumit Chopra, Raia Hadsell, M Ranzato, and F Huang. A tutorial on energy-based
     learning. Predicting structured data, 1(0), 2006.
[14] Nitish Shirish Keskar, Bryan McCann, Lav R Varshney, Caiming Xiong, and Richard Socher.
     Ctrl: A conditional transformer language model for controllable generation. arXiv preprint
     arXiv:1909.05858, 2019.
[15] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever.
     Language models are unsupervised multitask learners. OpenAI Blog, 1(8), 2019.
[16] Shun-Ichi Amari. Natural gradient works efficiently in learning. Neural computation,
     10(2):251–276, 1998.
[17] Magnus Rattray, David Saad, and Shun-ichi Amari. Natural gradient descent for on-line
     learning. Physical review letters, 81(24):5461, 1998.
[18] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G
     Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al.
     Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.
[19] Sham Machandranath Kakade. On the sample complexity of reinforcement learning. PhD
     thesis, UCL (University College London), 2003.
[20] Ahmed Hussein, Mohamed Medhat Gaber, Eyad Elyan, and Chrisina Jayne. Imitation learning:
     A survey of learning methods. ACM Computing Surveys (CSUR), 50(2):1–35, 2017.
[21] Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double
     q-learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 30, 2016.
[22] Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, and Andrew Gordon
     Wilson. Averaging weights leads to wider optima and better generalization. arXiv preprint
     arXiv:1803.05407, 2018.
[23] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito,
     Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in
     pytorch. 2017.
                                                155


[24] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization, 2019.
[25] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2017.
[26] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language
     understanding by generative pre-training. URL https://s3-us-west-2. amazonaws. com/openai-
     assets/researchcovers/languageunsupervised/language understanding paper. pdf, 2018.
[27] Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global vectors for
     word representation. In Proceedings of the 2014 conference on empirical methods in natural
     language processing (EMNLP), pages 1532–1543, 2014.
[28] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation,
     9(8):1735–1780, 1997.
[29] Felix A Gers, Jürgen Schmidhuber, and Fred Cummins. Learning to forget: Continual
     prediction with lstm. 1999.
[30] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of
     deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805,
     2018.
[31] Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural
     text degeneration. arXiv preprint arXiv:1904.09751, 2019.
[32] Alex Tamkin, Miles Brundage, Jack Clark, and Deep Ganguli. Understanding the capabilities,
     limitations, and societal impact of large language models. arXiv preprint arXiv:2102.02503,
     2021.
[33] Irene Solaiman, Miles Brundage, Jack Clark, Amanda Askell, Ariel Herbert-Voss, Jeff Wu,
     Alec Radford, Gretchen Krueger, Jong Wook Kim, Sarah Kreps, et al. Release strategies and
     the social impacts of language models. arXiv preprint arXiv:1908.09203, 2019.
                                                156