ADAPTIVE AND AUTOMATED DEEP RECOMMENDER SYSTEMS By Xiangyu Zhao A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Computer Science — Doctor of Philosophy 2021 ABSTRACT ADAPTIVE AND AUTOMATED DEEP RECOMMENDER SYSTEMS By Xiangyu Zhao Recommender systems are intelligent information retrieval applications, and have been leveraged in numerous domains such as e-commerce, movies, music, books, and point-of- interests. They play a crucial role in the users’ information-seeking process, and overcome the information overload issue by recommending personalized items (products, services, or information) that best match users’ needs and preferences. Driven by the recent advances in machine learning theories and the prevalence of deep learning techniques, there have been tremendous interests in developing deep learning based recommender systems. They have unprecedentedly advanced effectiveness of mining the non-linear user-item relationships and learning the feature representations from massive datasets, which produce great vitality and improvements in recommendations from both academic and industry communities. Despite above prominence of existing deep recommender systems, their adaptiveness and automation still remain under-explored. Thus, in this dissertation, we study the problem of adaptive and automated deep recommender systems. Specifically, we present our efforts devoted to building adaptive deep recommender systems to continuously update recom- mendation strategies according to the dynamic nature of user preference, which maximizes the cumulative reward from users in the practical streaming recommendation scenarios. In addition, we propose a group of automated and systematic approaches that design deep recommender system frameworks effectively and efficiently from a data-driven manner. More importantly, we apply our proposed models to a variety of real-world recommendation platforms and have achieved promising enhancements of social and economic benefits. Copyright by XIANGYU ZHAO 2021 To my parents for their love and support. iv ACKNOWLEDGMENTS This dissertation would have never been completed without the help, support, and guidance from many great people through my Ph.D. study at Michigan State University in the past four and a half years. First and foremost, I would like to express my sincere gratitude to my advisor Prof. Jiliang Tang for his best mentorship, encouragement, and support during my Ph.D. study. As an advisor, he has worked tirelessly to provide help, advice, guidance, and inspiration to me, which led me to grow into an independent scholar. Besides research, he is also my role model for life, and has taught me the value of kindness, optimism, ambition and responsibility through his words and actions. I could not have imagined having a better advisor and mentor for my Ph.D. study. I would also like to thank my committee members Prof. Pang-Ning Tan, Dr. Jiayu Zhou, Dr. Mi Zhang, and Dr. Dawei Yin, for their continuous advice and insightful comments. In addition, I am thankful to all of my awesome lab-mates from the Data Science and Engineering (DSE) lab at Michigan State University: Tyler Derr, Zhiwei Wang, Yao Ma, Hamid Karimi, Wenqi Fan, Xiaorui Liu, Haochen Liu, Han Xu, Xiaoyang Wang, Jamell Dacon, Wentao Wang, Wei Jin, Yaxin Li, Yiqi Wang, Juanhui Li, Harry Shomer, Jie Ren, Jiayuan Ding, Haoyu Han, Hongzhi Wen, Yuxuan Wan, Hua Liu, and Norah Alfadhli. I enjoyed all stimulating discussions, exchange of ideas, and staying awake until the last moment of the paper deadlines. During my time in the DSE Lab I have learned so much from you all. I am also thankful for the collaboration and help from outside the DSE Lab: Prof. Jiquan Chen, Prof. Yi Chang, Dr. Charu Aggarwal, Dr. Taiquan Peng, Dr. Weinan Zhang, Dr. Ruiming Tang, Prof. Heng Huang, Dr. Hui Liu, Dr. Li Zhao, Dr. Grace Hui Yang, Dr. Alex v Beutel, Dr. Rui Chen, Dr. Jason Gauci, Dr. Minmin Chen, Dr. Shauki Jain, Dr. Yongfeng Zhang, Dr. Fei Sun, Prof. Jimmy Xiangji Huang, and Yingqiang Ge. Moreover, I appreciate the experience of working with the extraordinary collaborators I met during internships. They are Dr. Dawei Yin, Dr. Long Xia, Dr. Lixin Zou, Dr. Liang Zhang, Dr. Zhaochun Ren, and Zhuoye Ding at JD.com; Dr. Xiwang Yang, Xiaobing Liu, Dr. Chong Wang, Changsheng Gu, Ming Chen, Xudong Zheng, Haoshenglun Zhang, Dr. Taiqing Wang, and Dr. Aonan Zhang at Bytedance; Dr. Bo Long, Dr. Bee-chung Chen, Dr. Huiji Gao, Dr. Weiwei Guo, Dr. Jun Shi and Sida Wang at LinkedIn. Without their precious support it would not be possible to conduct this research. Lastly, I would again like to thank my parents, entire family and girlfriend for their love and support. This dissertation is dedicated to them. vi TABLE OF CONTENTS LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi LIST OF ALGORITHMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii Chapter 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Motivation and Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Dissertation Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Dissertation Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Chapter 2 Page-wise Recommendations . . . . . . . . . . . . . . . . . . . . . 7 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.1.1 Real-time Feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.1.2 Page-wise Recommendations . . . . . . . . . . . . . . . . . . . . . . . 9 2.1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2 The Proposed Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2.1 Framework Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2.2 Architecture of Actor Framework . . . . . . . . . . . . . . . . . . . . 14 2.2.2.1 Encoder for Initial State Generation Process . . . . . . . . . 15 2.2.2.2 Encoder for Real-time State Generation Process . . . . . . . 16 2.2.2.3 Decoder for Action Generation Process . . . . . . . . . . . . 20 2.2.3 The Architecture of Critic Framework . . . . . . . . . . . . . . . . . 20 2.3 Training and Test Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.3.1 The Training Procedure . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.3.1.1 Online Training Procedure . . . . . . . . . . . . . . . . . . . 23 2.3.1.2 Offline Training Procedure . . . . . . . . . . . . . . . . . . . 24 2.3.1.3 Training Algorithm . . . . . . . . . . . . . . . . . . . . . . . 25 2.3.2 The Test Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.3.2.1 Online Test . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.3.2.2 Offline Test . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.4.1 Experimental Settings . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.4.2 Performance Comparison for Offline Test . . . . . . . . . . . . . . . . 31 2.4.3 Performance Comparison for Online Test . . . . . . . . . . . . . . . . 33 2.4.4 Effectiveness of Components . . . . . . . . . . . . . . . . . . . . . . . 35 2.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 Chapter 3 Jointly Recommend and Advertise . . . . . . . . . . . . . . . . . 39 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 vii 3.3 Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.3.1 Deep Q-network for Recommendations . . . . . . . . . . . . . . . . . 44 3.3.1.1 The Processing of State and Action Features for RS . . . . . 45 3.3.1.2 The Cascading DQN for RS . . . . . . . . . . . . . . . . . . 46 3.3.1.3 The estimation of Cascading Q-functions . . . . . . . . . . . 47 3.3.2 Deep Q-network for Online Advertising . . . . . . . . . . . . . . . . . 48 3.3.2.1 The Processing of State and Action Features for AS . . . . . 48 3.3.2.2 The Proposed DQN Architecture . . . . . . . . . . . . . . . 49 3.3.2.3 The Action Selection in RTB setting . . . . . . . . . . . . . 51 3.3.3 The Optimization Task . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.3.4 Off-policy Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.3.5 Online Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.4 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.4.1 Experimental Settings . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.4.2 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.4.3 Architecture Details . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.4.4 Overall Performance Comparison . . . . . . . . . . . . . . . . . . . . 59 3.4.5 Component Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.4.6 Parameter Sensitivity Analysis . . . . . . . . . . . . . . . . . . . . . . 63 3.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 Chapter 4 User Simulation for Recommendations . . . . . . . . . . . . . . . 66 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.2 The Proposed Simulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.2.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.2.2 The Generator Architecture . . . . . . . . . . . . . . . . . . . . . . . 70 4.2.2.1 The Encoder Component . . . . . . . . . . . . . . . . . . . 71 4.2.2.2 The Decoder Component . . . . . . . . . . . . . . . . . . . 72 4.2.3 The Discriminator Architecture . . . . . . . . . . . . . . . . . . . . . 72 4.2.4 The Objective Function . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 4.3.1 Experimental Settings . . . . . . . . . . . . . . . . . . . . . . . . . . 78 4.3.2 Overall Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 4.3.3 RL-based Recommender Training . . . . . . . . . . . . . . . . . . . . 83 4.3.4 Effectiveness of Generator . . . . . . . . . . . . . . . . . . . . . . . . 84 4.3.5 Component Anslysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 4.3.6 Parametric Sensitivity Analysis . . . . . . . . . . . . . . . . . . . . . 88 4.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 Chapter 5 Automated Embedding Size Search . . . . . . . . . . . . . . . . . 91 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.2 Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 5.2.1 Dimensionality Search . . . . . . . . . . . . . . . . . . . . . . . . . . 95 5.2.1.1 Embedding Lookup . . . . . . . . . . . . . . . . . . . . . . . 95 5.2.1.2 Unifying Various Dimensions . . . . . . . . . . . . . . . . . 96 viii 5.2.1.3 Dimension Selection . . . . . . . . . . . . . . . . . . . . . . 97 5.2.2 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 5.2.3 Parameter Re-Training . . . . . . . . . . . . . . . . . . . . . . . . . . 102 5.2.3.1 Deriving Discrete Dimensions . . . . . . . . . . . . . . . . . 102 5.2.3.2 Model Re-training . . . . . . . . . . . . . . . . . . . . . . . 103 5.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 5.3.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 5.3.2 Implement Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 5.3.3 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 5.3.4 Overall Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 5.3.5 Efficiency Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 5.3.6 Parameter Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 5.3.7 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 5.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 Chapter 6 Automated Loss Function Search . . . . . . . . . . . . . . . . . . 114 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 6.2 The Proposed Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 6.2.1 An Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 6.2.2 Deep Recommender System Network . . . . . . . . . . . . . . . . . . 119 6.2.2.1 Embedding Layer . . . . . . . . . . . . . . . . . . . . . . . . 119 6.2.2.2 Interaction Layer . . . . . . . . . . . . . . . . . . . . . . . . 120 6.2.2.3 MLP Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 6.2.2.4 Output Layer . . . . . . . . . . . . . . . . . . . . . . . . . . 122 6.2.3 Loss Function Search . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 6.2.4 Controller Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 6.2.5 An Optimization Method . . . . . . . . . . . . . . . . . . . . . . . . 125 6.3 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 6.3.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 6.3.2 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 6.3.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 6.3.4 Overall Performance Comparison . . . . . . . . . . . . . . . . . . . . 130 6.3.5 Transferability Study . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 6.3.6 Impact of Model Components . . . . . . . . . . . . . . . . . . . . . . 133 6.3.7 Efficiency Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 6.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 Chapter 7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 7.1 Dissertation Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 7.2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 ix LIST OF TABLES Table 2.1: Performance comparison of different components. . . . . . . . . . . . . . . 36 Table 3.1: Statistics of the dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 Table 3.2: Performance comparison. . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 Table 4.1: Statistics of the datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 Table 4.2: Generator effectiveness. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 Table 5.1: Performance comparison of different embedding search methods. . . . . . 107 Table 5.2: Embedding dimensions for Movielens-1m. . . . . . . . . . . . . . . . . . . 111 Table 6.1: Statistics of the datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 Table 6.2: Performance comparison of different loss function search methods. . . . . 131 Table 6.3: Impact of model components. . . . . . . . . . . . . . . . . . . . . . . . . . 134 x LIST OF FIGURES Figure 2.1: An example of interactions between recommender systems and users. . . 8 Figure 2.2: Framework architecture selection. . . . . . . . . . . . . . . . . . . . . . . 12 Figure 2.3: Encoder to generate initial state sini . . . . . . . . . . . . . . . . . . . . . 15 Figure 2.4: Encoder to generate current state scur . . . . . . . . . . . . . . . . . . . . 17 Figure 2.5: An illustration of the proposed framework. . . . . . . . . . . . . . . . . . 22 Figure 2.6: Overall performance comparison in offline test. . . . . . . . . . . . . . . . 32 Figure 2.7: Overall performance comparison in online test. . . . . . . . . . . . . . . . 34 Figure 3.1: An example of rec-ads mixed display for one user request. . . . . . . . . 41 Figure 3.2: The agent-user interactions in MDP. . . . . . . . . . . . . . . . . . . . . 44 Figure 3.3: The architecture of cascading DQN for RS. . . . . . . . . . . . . . . . . 47 Figure 3.4: (a)(b) Two conventional DQNs. (c) Overview of the proposed DQN. . . 49 Figure 3.5: The architecture of the proposed DQN for AS. . . . . . . . . . . . . . . . 51 Figure 3.6: Performance comparison of different variants. . . . . . . . . . . . . . . . 62 Figure 3.7: Parameter sensitivity analysis. . . . . . . . . . . . . . . . . . . . . . . . . 63 Figure 4.1: An example of system-user interactions. . . . . . . . . . . . . . . . . . . 67 Figure 4.2: The generator with Encoder-Decoder architecture. . . . . . . . . . . . . . 71 Figure 4.3: The discriminator architecture. . . . . . . . . . . . . . . . . . . . . . . . 73 Figure 4.4: The results of overall performance comparison. . . . . . . . . . . . . . . . 81 Figure 4.5: The training process of RL-based recommenders. . . . . . . . . . . . . . 83 Figure 4.6: The results of component analysis. . . . . . . . . . . . . . . . . . . . . . 87 Figure 4.7: The results of parametric analysis. . . . . . . . . . . . . . . . . . . . . . 88 xi Figure 5.1: The typically DLRS architecture. . . . . . . . . . . . . . . . . . . . . . . 92 Figure 5.2: Overview of the proposed AutoDim framework. . . . . . . . . . . . . . . 95 Figure 5.3: Embedding lookup method. . . . . . . . . . . . . . . . . . . . . . . . . . 96 Figure 5.4: Linear transformation method to unify various dimensions. . . . . . . . . 97 Figure 5.5: Efficiency analysis of DeepFM on Criteo dataset. . . . . . . . . . . . . . 108 Figure 5.6: Parameter analysis on Movielens-1m dataset. . . . . . . . . . . . . . . . 110 Figure 6.1: Overview of the AutoLoss framework. . . . . . . . . . . . . . . . . . . . 118 Figure 6.2: Architectures of DeepFM and IPNN. . . . . . . . . . . . . . . . . . . . . 119 Figure 6.3: Transferability study results. . . . . . . . . . . . . . . . . . . . . . . . . . 132 Figure 6.4: Efficiency study results. . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 xii LIST OF ALGORITHMS Algorithm 2.1 Mapping Algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . 23 Algorithm 2.2 Parameters Online Training for DeepPage with DDPG. . . . . . . 27 Algorithm 2.3 Online Test for DeepPage. . . . . . . . . . . . . . . . . . . . . . . . 28 Algorithm 2.4 Offline Test of DeepPage Framework. . . . . . . . . . . . . . . . . . 29 Algorithm 3.1 Off-policy Training of the RAM Framework. . . . . . . . . . . . . . 56 Algorithm 3.2 Online Test of the RAM Framework. . . . . . . . . . . . . . . . . . 57 Algorithm 4.1 Training Algorithm for the Simulator. . . . . . . . . . . . . . . . . 77 Algorithm 5.1 DARTS based Optimization for AutoDim. . . . . . . . . . . . . . . 101 Algorithm 5.2 The Optimization of DLRS Re-training Process. . . . . . . . . . . 103 Algorithm 6.1 An Optimization Algorithm for AutoLoss via DARTS. . . . . . . . 127 xiii Chapter 1 Introduction 1.1 Motivation and Challenges Recommender systems are intelligent information retrieval applications. They assist users in their information-seeking tasks by suggesting items (products, services, or information) that best fit their needs and preferences. Recommender systems have become increasingly popular in recent years, and have been utilized in a variety of domains, including movies, music, books, search queries, and social tags [107, 108]. Typically, a recommendation procedure can be modeled as interactions between users and recommender agent (RA). It consists of two phases: 1) user model construction and 2) recommendation generation [90]. During the interaction, the recommender agent builds a user model to learn users’ preferences based on users’ personal information or historical behaviors. Then, the recommender agent generates a list of items that best match users’ preferences. Driven by the recent advances in deep learning, there have been increasing interests in developing deep learning based recommender systems (DLRSs) [154, 97, 142]. Architectures of DLRS often mainly consist of three key components: (i) embedding layers that map raw user/items features in a high dimensional space to dense vectors in a low dimensional embedding space, (ii) hidden layers that perform nonlinear transformations to transform the input features, and (iii) output layers that make predictions for specific recommendation 1 tasks (e.g. regression and classification) based on the representations from hidden layers. DLRSs have boosted the recommendation performance because of their capacity of effectively catching the non-linear user-item relationships, and learning the complex abstractions as data representations [154]. Most existing recommender systems are static and hand-crafted in (i) learning recom- mendation policy and (ii) designing DLRS framework. First, recommendation policy is the strategy that a recommender system suggests items to users, such as collaborative filtering methods recommend similar items to users of similar tastes [77]. To learn recommendation policy, there are two overarching issues: • Most recommender systems consider the recommendation procedure as a static process and make recommendations following a fixed greedy strategy, which may fail given the dynamic nature of the users’ preferences; • Existing recommendation policies are designed to maximize the immediate reward of recommendations, i.e., to make users purchase the recommended items, while completely overlooking whether these items will lead to more profitable rewards in the long run. Second, DLRS framework consists of all the key components of building a deep recom- mender system, such as data processing (e.g., data cleaning and feature engineering), model selection (e.g., W&D [24] v.s. DeepFM [51]), neural architecture (e.g., embedding, hidden and output layers), optimization (e.g., loss function, optimizer and learning rate) and model evaluation. To design DLRS framework, we face three inherent challenges: • The majority of existing DLRSs are developed based on hand-crafted components, which requires ample expert knowledge of machine learning and recommender systems; 2 • Human error and bias can lead to suboptimal components, which reduces the recommen- dation effectiveness; • Non-trivial time and engineering efforts are usually required to design the task-specific components in different recommendation scenarios. In this dissertation, we propose to design adaptive and automated deep recommender systems to address challenges in the two aforementioned perspectives, i.e., recommendation policy and DLRS framework. To learn adaptive recommendation policy, we will consider the recommendation procedure as sequential interactions between users and recommender agents; and leverage Reinforcement Learning (RL) to automatically learn an optimal recommendation strategy (policy) that maximizes cumulative rewards from users without any specific instructions. Recommender systems based on reinforcement learning have two advantages. First, they are able to continuously update their trial-and-error strategies during the interactions, until the system converges to the optimal strategy that generates recommendations best fitting their users’ dynamic preferences. Second, the models in the system are trained via estimating the present value with delayed rewards under current states and actions. The optimal strategy is made by maximizing the expected long-term cumulative reward from users. Therefore, the system could identify items with small immediate rewards but making considerable contributions to the rewards for future recommendations. To design automated DLRS framework, the recent development of automated machine learning (AutoML), easing the usage of machine learning tools and designing task-dependent learning models, has become an important and popular area with both practical needs and research values [81]. It has changed the convention of model design from manual to automatic, 3 provides unprecedented opportunities to design deep model components. Inspired by AutoML techniques, we will propose a set of unified and systematic approaches, which design the DLRS framework in an automated and data-driven manner [157, 158, 81, 100]. AutoML allows non-experts to build DLRS without requiring them to become experts in machine learning and recommender systems, excludes the possible human error and bias, and significantly reduces the computational and temporal cost. 1.2 Dissertation Contributions The major contribution of this proposal are summarized as follows: • We conduct pioneering research to develop adaptive and automated deep recommender systems from two perspectives: (i) recommendation policy and (ii) DLRS framework, which could dynamically and adaptively enhance the recommendation performance in the long run, and reduce the requirement of expert knowledge and human efforts to design the sophisticated DLRS frameworks; • As users are typically recommended a page of items in real-world recommender systems, we propose a novel page-wise recommendation framework, which leverages deep reinforcement learning to automatically learn the optimal recommendation strategies and optimize a page of items simultaneously; • Most platforms optimize recommending and advertising strategies by different teams separately via different techniques. We propose a two-level deep reinforcement learning framework with novel deep Q-network architectures to jointly optimize the user experience and advertising revenue in online recommender systems; 4 • Directly training and evaluating a new RL-based recommendation algorithm needs to collect users’ feedback in real systems, which is time/effort consuming and could downgrade users’ experiences. We propose to build a user simulator for RL-based recommendations, which can model real users’ behaviors and can be used to pre-train and evaluate new recommendation algorithms before launching them online. • Most recommender systems allocate a unified embedding dimension to all feature fields, which is memory inefficient. We propose a novel AutoML-based framework to automatically assign different embedding dimensions to different feature fields, which can achieve better recommendation performance with much fewer embedding parameters. • Existing recommender systems often leverage a predefined and fixed loss function that could lead to suboptimal recommendation quality and training efficiency. We propose an AutoML-based framework to automatically select the appropriate loss function for different data examples according to their varied convergence behaviors, improving recommendation performance and training efficiency with excellent transferability. 1.3 Dissertation Overview The remainder of this dissertation is organized as follows. In Chapter 2, we introduce an RL-based framework DeepPage for capturing users’ dynamic preferences and learning the item display strategy within a page. Chapter 3 presents a two-level RL framework RAM, where the first level acts as the recommender system to select a subset of items from the large item space, and the second level serves as the advertising system to assign the right advertisement to the right user on the right place. In Chapter 4, we develop a user simulator UserSim based on a Generative Adversarial Network, where the generator captures the underlying 5 distribution of users’ historical logs and generates realistic logs, while the discriminator not only distinguishes real and fake logs but also predicts users’ behaviors. Chapter 5 presents our work on automatically allocating different embedding dimensions to different feature fields according to their contributions in recommendation. In Chapter 6, we introduce an AutoLoss framework to select proper loss function for each data example automatically, where we develop a novel controller network to dynamically adjust the loss probabilities in a differentiable manner. Finally, Chapter 7 concludes the dissertation and presents promising future research directions. 6 Chapter 2 Page-wise Recommendations Abstract Recommender systems can mitigate the information overload problem by suggesting users’ personalized items. In real-world recommendations such as e-commerce, a typical interaction between the system and its users is – users are recommended a page of items and provide feedback; and then the system recommends a new page of items. To effectively capture such interaction for recommendations, we need to solve two key problems – (1) how to update recommending strategy according to user’s real-time feedback, and (2) how to generate a page of items with proper display, which pose tremendous challenges to traditional recommender systems. In this chapter, we study the problem of page-wise recommendations aiming to address aforementioned two challenges simultaneously. In particular, we propose a principled approach to jointly generate a set of complementary items and the corresponding strategy to display them in a 2-D page; and propose a novel page-wise recommendation framework based on deep reinforcement learning, DeepPage, which can optimize a page of items with proper display based on real-time feedback from users. The experimental results based on a real-world e-commerce dataset demonstrate the effectiveness of the proposed framework. 7 Figure 2.1: An example of interactions between recommender systems and users. 2.1 Introduction Recommender systems play an essential role in the information-seeking process and over- come information overload by facilitating the interactions between business platforms and users. Figure 2.1 illustrates a typical example of the interactions between an e-commerce recommender system and a user – each time the system recommends a page of items to the user; next, the user browses these items and provides real-time feedback and then the system recommends a new page of items. This example suggests two key challenges to effectively take advantage of these interactions for e-commerce recommender systems – 1) how to efficiently capture user’s dynamic preference and update recommending strategy according to user’s real-time feedback; and 2) how to generate a page of items with proper display based on user’s preferences. 2.1.1 Real-time Feedback Most existing recommender systems consider the recommendation procedure as a static process and make recommendations following a fixed greedy strategy. However, these 8 approaches may fail to capture the dynamic nature of the users’ preferences, and they become infeasible to efficiently and continuously update their recommending strategies according to user’s real-time feedback. Thus, in this work, we consider the recommendation procedure as sequential interactions between users and the recommender agent; and leverage Reinforcement Learning (RL) to automatically learn the optimal recommendation strategies. Recommender systems based on reinforcement learning have two major advantages. First, they are able to continuously update their strategies based on user’s real-time feedback during the interactions, until the system converges to the optimal strategy that generates recommendations best fitting users’ dynamic preferences. Second, the optimal strategy is made by maximizing the expected long-term cumulative reward from users; while the majority of traditional recommender systems are designed to maximize the immediate (short-term) reward of recommendations [116]. Therefore, the system can identify items with small immediate rewards but making significant contributions to the rewards for future recommendations. 2.1.2 Page-wise Recommendations As mentioned in the example, users are typically recommended a page of items. To achieve this goal, we introduce a page-wise recommender system, which is able to jointly (1) generate a set of diverse and complementary items and (2) form an item display strategy to arrange the items in a 2-D page that can lead to maximal reward. Conventional RL methods could recommend a set of items each time. For instance, DQN can recommend a set of items with the highest Q-values according to the current state[93]. However, these approaches recommend items based on the same state, which leads to the recommended items being similar. In practice, a bundling of complementary items may receive higher rewards than recommending all similar items. For instance, in real-time news feed recommendations, a user 9 may want to read diverse topics of interest[153]. In addition, page-wise recommendations need to properly display a set of generated items in a 2-D page. Traditional approaches treat it as a ranking problem, i.e., ranking items into a 1-D list according to the importance of items. In other words, user’s most preferred item is posited at the top of list. However, in e-commerce recommender systems, a recommendation page is a 2-D grid rather than a 1-D list. Also, eye-tracking studies [121] show that rather than linearly scanning a page, users do page chunking, i.e., they partition the 2-D page into chunks, and browse the chunk they prefer more. In addition, the set of items and the display strategy are generated separately; hence they may not be optimal to each other. Therefore, page-wise recommendations need principled approaches to simultaneously generate a set of complementary items and the corresponding display strategy in a 2-D page. 2.1.3 Contributions In this chapter, we tackle the two aforementioned challenges simultaneously by introducing a novel page-wise recommender system based on deep reinforcement learning. We summarize our major contributions as follows: • We introduce a principled approach to generate a set of complementary items and properly display them in one 2-D recommendation page simultaneously; • We propose a page-wise recommendation framework DeepPage, which can jointly optimize a page of items by incorporating real-time feedback from users; • We demonstrate the effectiveness of the proposed framework in a real-world e-commerce dataset and validate the effectiveness of the components in DeepPage for accurate recom- mendations. 10 2.2 The Proposed Framework In this section, we first give an overview of the proposed Actor-Critic based reinforcement learning recommendation framework with notations. Then we present the technical details of components in Actor and Critic, respectively. 2.2.1 Framework Overview As mentioned in Section 2.1.1, we model the recommendation task as a Markov Decision Process (MDP) and leverage Reinforcement Learning (RL) to automatically learn the optimal recommendation strategies, which can continuously update recommendation strategies during the interactions and the optimal strategy is made by maximizing the expected long-term cumulative reward from users. In practice, conventional RL methods like POMDP[116] and Q-learning[125] become infeasible with the increasing number of items for recommendations. Thus, we leverage Deep Reinforcement Learning[75] with (adapted) artificial neural networks as the non-linear approximators to estimate the action-value function in RL. This model-free reinforcement learning method does not estimate the transition probability and store the Q-value table. Hence it can support a huge amount of items in recommender systems. There are two major challenges when we apply deep reinforcement learning to the studied problem – (a) the large (or even continuous) and dynamic action space (item space), and (b) the computational cost to select an optimal action (a set of items). In practice, using only discrete indices to denote items is insufficient since we cannot know the relations between different items only from indices. One common way is to use extra information to represent items with continuous embeddings[69]. Besides, the action space of recommender systems is dynamic as items arriving and leaving. Moreover, computing the Q-value for all state-action 11 Q(s, a1)Q(s, a2) ··· Q(s, ai) Q(s, a) action a h2 h2 h2 h2 h1 h1 h1 h1 state s state s action ai state s state s action a (a) (b) Actor (c) Critic Figure 2.2: Framework architecture selection. pairs is a time-consuming task because of the enormous state and action spaces. To tackle these challenges, in this chapter, our recommending policy builds upon the Actor-Critic framework [124], shown in Figure 2.2 (c). The Actor-Critic architecture is preferred from the studied problem since it is suitable for large and dynamic action space, and can also reduce redundant computation simultaneously compared to alternative architectures as shown in Figures 2.2 (a) and (b). The conventional Deep Q-learning architectures shown in Figure 2.2 (a) inputs only the state space and outputs Q-values of all actions. This architecture is suitable for the scenario with high state space and small/fixed action space like Atari[93], but cannot handle large and dynamic action space scenarios, like recommender systems. Also, we cannot leverage the second conventional deep Q-learning architecture as shown in Figure 2.2(b) because of its temporal complexity. This architecture inputs a state-action pair and outputs the Q-value correspondingly, and makes use of the optimal action-value function Q∗ (s, a). It is the maximum expected return achievable by the optimal policy, and should follow the Bellman equation [6] as: Q∗ (s, a) = Es0 r + γ max Q∗ (s0 , a0 )|s, a   (2.1) a0 12 In practice, selecting an optimal a0 , |A| evaluations is necessary for the inner operation “maxa0 ”. In other words, this architecture computes Q-value for all a0 ∈ A separately, and then selects the maximal one. This prevents Eq. (2.1) from being adopted in practical recommender systems. In the Actor-Critic framework, the Actor architecture inputs the current state s and aims to output a deterministic action (or recommending a deterministic page of M items), i.e., s → a = {a1 , · · · , aM }. The Critic inputs only this state-action pair rather than all potential state-action pairs, which avoids the aforementioned computational cost as follows: Q(s, a) = Es0 r + γQ(s0 , a0 )|s, a   (2.2) where the Q-value function Q(s, a) is a judgment of whether the selected action matches the current state, i.e., whether the recommendations match user’s preference. Finally, according to the judgment from Critic, the Actor updates its’ parameters in a direction of boosting recommendation performance so as to output proper actions in the next iteration. With the above design intuitions, we formally define the tuple of five elements (S, A, P, R, γ) of MDP as follows: • State space S: A state s ∈ S is defined as user’s current preference, which is generated based on user’s browsing history, i.e., the items that a user browsed and her corresponding feedback. • Action space A: An action a = {a1 , · · · , aM } ∈ A is to recommend a page of M items to a user based on current state s. • Reward R: After the RA takes an action a at the state s, i.e., recommending a page 13 of items to a user, the user browses these items and provides her feedback. She can skip (not click), click, or order these items, and the agent receives immediate reward r(s, a) according to the user’s feedback. • Transition P: Transition p(s0 |s, a) defines the state transition from s to s0 when RA takes action a. • Discount factor γ: γ ∈ [0, 1] defines the discount factor when we measure the present value of future reward. In particular, when γ = 0, RA only considers the immediate reward. In other words, when γ = 1, all future rewards can be counted fully into that of the current action. Specifically, we model the recommendation task as an MDP in which a recommender agent (RA) interacts with environment E (or users) over a sequence of time steps. At each time step, the RA takes an action a ∈ A according to E’s state s ∈ S, and receives a reward r(s, a) ( or the RA recommends a page of items according to user’s current preference, and receives user’s feedback). As the consequence of action a, the environment E updates its state to s0 with transition p(s0 |s, a). The goal of reinforcement learning is to find a recommendation policy π : S → A, which can maximize the cumulative reward for the recommender system. Next, we will elaborate on the Actor and Critic architectures for the proposed framework. 2.2.2 Architecture of Actor Framework The Actor is designed to generate a page of recommendations according to user’s preference, which needs to tackle three challenges – 1) setting an initial preference at the beginning of a new recommendation session, 2) learning the real-time preference in the current session, which should capture the dynamic nature of user’s preference in current session and spatial 14 sini Output Layer GRU Layer h1 h2 ··· ··· hN E1 E2 EN Embedding Layer ··· ··· Input Layer e1 e2 eN Figure 2.3: Encoder to generate initial state sini . item display patterns in a page, and 3) jointly generating a set of recommendations and displaying them in a 2-D page. To address these challenges, we propose an Actor framework with the Encoder-Decoder architecture. 2.2.2.1 Encoder for Initial State Generation Process Figure 2.3 illustrates the model for generating initial preference. We introduce a RNN with Gated Recurrent Units (GRU) to capture users’ sequential behaviors as user’s initial preference. The inputs of GRU are user’s last clicked/ordered items {e1 , · · · , eN } (sorted in chronological order) before the current session, while the output is the representation of users’ initial preference by a vector. The input {e1 , · · · , eN } is dense and low-dimensional vector representations of items 1 . We add an item-embedding layer to transform ei into a low-dimensional dense vector via E i = tanh(WE ei + bE ) ∈ R|E| where we use “tanh” activate function since ei ∈ (−1, +1). We leverage GRU rather than Long Short-Term Memory (LSTM) because that GRU outperforms LSTM for capturing users’ sequential preference in recommendation task [55]. 1 These item representations are pre-trained using users’ browsing history by a company, i.e. each item is treated as a word and the clicked items in one recommendation session as a sentence, and item representations are trained via word embedding[69]. The effectiveness of these item representations is validated by their business such as searching, ranking, bidding and recommendations. 15 Unlike LSTM using input gate it and forget gate ft to generate a new state, GRU utilizes an update gate zt : zt = σ(W z E t + U z ht−1 ) (2.3) GRU leverages a reset gate rt to control the input of the former state ht−1 : rt = σ(W r E t + U r ht−1 ) (2.4) Then the activation of GRU is a linear interpolation between the previous activation ht−1 and the candidate activation ĥt : ht = (1 − zt )ht−1 + zt ĥt (2.5) where candidate activation function ĥt is computed as: ĥt = tanh[W E t + U (rt · ht−1 )] (2.6) We use the final hidden state ht as the representation of the user’s initial state sini at the beginning of current recommendation session as: sini = ht . (2.7) 2.2.2.2 Encoder for Real-time State Generation Process Figure 2.4 illustrates the model to generate real-time preference in current session. In the page-wise recommender system, the inputs {x1 , · · · , xM } for each recommendation page are 16 scur Output Layer Attention Layer α1 α2 αT sini GRU Layer h1 h2 ··· ··· hT p1 CNN Layer P1 X1 X2 ··· Page Layer ··· ··· ··· ··· ··· ··· XM −1 XM Xi Embedding Layer ··· ··· Input Layer e i ci f i Figure 2.4: Encoder to generate current state scur . the representations of the items in the page and user’s corresponding feedback, where M is the size of a recommendation page and xi is a tuple as: xi = (ei , ci , f i ), (2.8) where ei is the aforementioned item representation. To assist the RA in capturing user’s pref- erence among different categories of items and generating complementary recommendations, we incorporate item’s category ci . The item’s category ci is an one-hot indicator vector where ci (i) = 1 if this item belongs to the ith category and other entities are zero. The one-hot indicator vector is extremely sparse and high-dimensional; hence we add a category-embedding layer transforming ci into a low-dimensional dense vector C i = tanh(WC ci + bC ) ∈ R|C| . In addition to information from items, ei and ci , we also want to capture user’s interests or feedback in current recommendation page. Thus, we introduce user’s feedback vector f i , which is a one-hot vector to indicate user’s feedback for item i, i.e., skip/click/order. Similarly, 17 we transform f i into a dense vector F i = tanh(WF f i + bF ) ∈ R|F | via the embedding layer. Finally, we get a low-dimensional dense vector X i ∈ R|X| (|X| = |E| + |C| + |F |) by concatenating E i , C i and F i as: X i = concat(E i , C i , F i ) (2.9)  = concat tanh(WE ei + bE , WC ci + bC , WF f i + bF ) Note that all item-embedding layers share the same parameters WE and bE , which reduces the number of parameters and achieves better generalization. We apply the same constraints for category and feedback embeddings. Then, we reshape the transformed item representations {X 1 , · · · , X M } as the original arrangement in the page. In other words, we arrange the item representations in one page P t as 2D grids similar to one image. For instance, if one recommendation page has h rows and w columns (M = h × w), we will get a h × w|X| matrix P t . To learn item spatial display strategy in one page that leads to maximal reward, we introduce Convolutional Neural Network (CNN). CNN is a successful architecture in computer vision applications because of its capability to apply various learnable kernel filters on images to discover complex spatial correlations [67]. Hence, we utilize 2D-CNN followed by fully connected layers to learn the optimal item display strategy as: pt = conv2d(P t ), (2.10) where pt is a low-dimensional dense vector representing the information from the items and user’s feedback in page P t as well as the spatial patterns of the item display strategy of page P t. 18 Next, we feed {p1 , · · · , pT } into another RNN with Gated Recurrent Units (GRU) to capture user’s real-time preference in the current session. The architecture of this GRU is similar to the one in Section 2.2.2.1, but we utilize the final hidden state sini in Section 2.2.2.1 as the initial state in current GRU. Furthermore, to capture the user’s real-time preference in the current session, we employ attention mechanism [3], which allows the RA to adaptively focus on and linearly combine different parts of the input sequence: T scur = X αt ht (2.11) t=1 where the weighted factors αt determine which parts of the input sequence should be emphasized or ignored when making predictions. Here we leverage location-based attention mechanism [89] where the weighted factors are computed from the target hidden state ht as follows: exp(Wα ht + bα ) αt = P (2.12) j exp(Wα hj + bα ) This GRU with attention mechanism is able to dynamically select more important input, which is helpful to capture the user’s real-time preference in the current session. Note that - 1) the length of this GRU is flexible according to that of the current recommendation session. After each user-agent interaction, i.e., the user browses one page of generated recommendations and give feedback to RA, we can add one more GRU unit, and use this page of items, corresponding categories and feedback as the input of the new GRU unit; 2) in fact, the two RNNs in Section 2.2.2.1 and Section 2.2.2.2 can be integrated into one RNN, we describe them separately to clearly illustrate their architecture, and validate their effectiveness in Section 2.4.4. 19 2.2.2.3 Decoder for Action Generation Process In this subsection, we will propose the action acur generation process, which generates (recommends) a new page of items to users. In other words, given user’s current preference scur , we aim to recommend a page of items and display them properly to maximize the reward. It is the inverse process of what the convolutional layer does. Hence, we use Deconvolution Neural Network (DeCNN) [147] to restore one page from the low-dimensional representation scur . It provides a sophisticated and automatic way to generate a page of recommendations with the corresponding display as: acur = deconv2d(scur ). (2.13) Note that - 1) the size of acur and P are different, since acur only contains item-embedding E i , while P also contains item’s category embedding C i and feedback-embedding F i . For instance, if one recommendation page has h rows and w columns (M = h × w), P is a h × w|X| matrix, while acur is a h × w|E| matrix; and 2) the generated item embeddings in acur may be not in the real item embedding set, thus we need to map them to valid item embeddings, which will be provided in later sections. 2.2.3 The Architecture of Critic Framework The Critic is designed to leverage an approximator to learn an action-value function Q(s, a), which is a judgment of whether the action (or a recommendation page) a generated by Actor matches the current state s. Note that we use “s” as scur in the last subsection for simplicity. Then, according to Q(s, a), the Actor updates its’ parameters in the direction of improving performance to generate proper actions (or recommendations) in the following iterations. 20 Thus we need to feed user’s current state s and action a (or a recommendation page) into the critic. To generate user’s current state s, The RA follows the same strategy from Eq. (2.3) to Eq. (2.11), which uses embedding layers, 2D-CNN and GRU with attention mechanism to capture user’s current preference. For action a, because acur generated in Eq. (2.13) is a 2D matrix similar to an image, we utilize the same strategy in Eq. (2.10), a 2D-CNN, to degrade acur into a low-dimensional dense vector a as: a = conv2d(acur ). (2.14) Then the RA concatenates current state s and action a, and feeds them into a Q-value function Q(s, a). In real recommender systems, the state and action spaces are enormous, thus estimating the action-value function Q(s, a) for each state-action pair is infeasible. In addition, many state-action pairs may not appear in the real trace such that it is hard to update their values. Therefore, it is more flexible and practical to use an approximator function to estimate the action-value function. In practice, the action-value function is usually highly non-linear. Thus we choose Deep neural networks as approximators. In this work, we refer to a neural network approximator as a deep Q-value function (DQN). 2.3 Training and Test Procedure In this section, we discuss the training and test procedures. We propose two policies, i.e., online-policy and off-policy, to train and test the proposed framework based on online environment and offline historical data, respectively. Off-policy is necessary because the proposed framework should be pre-trained offline and be evaluated before launching them 21 acur pro acur val Q(s, a) e1 e2 ··· e1 e2 ··· · · · e· · · ·e· · · · · e· · · ·e· · ··· M −1 M ··· M −1 M CN h2 N User Decoder h1 s r s a Actor Critic Figure 2.5: An illustration of the proposed framework. online to ensure the quality of the recommendations and mitigate possible negative impacts on user experience. After the offline stage, we can apply the framework online, and then the framework can continuously improve its strategies during the interactions with users. 2.3.1 The Training Procedure As aforementioned in Section 2.2.2, we map user’s preference scur to a new page of rec- ommendations (acur ). In a page of M items, acur contains item-embeddings of M items, i.e., {e1 , · · · , eM }. However, acur is a proto-action, because the generated item embedding ei ∈ acur may be not in the existing item-embedding space I. Therefore, we need to map from proto-action acur cur cur pro to valid-action aval where we have {ei ∈ I|∀{ei ∈ aval }. With this modification, an illustration of the proposed Actor-Critic recommending framework is demonstrated in Figure 2.5 where we omit Encoders for state generation part. 22 2.3.1.1 Online Training Procedure When we train the proposed framework in online environment, RA can interact with users by sequentially choosing recommendation items over a sequence of time steps. Thus, in online environment, the RA is able to receive real-time feedback for the recommended items from users. In this setting, for each ei in acur pro , we select the most similar ei ∈ I as the valid item-embedding in acur val . In this work, we select cosine similarity as: e> i ·e ei = arg max e∈I kei kkek (2.15) e = arg max e> i · kek e∈I To decrease the computational cost of Eq. (2.15), we pre-compute kek e for all e ∈ I and adopt item recalling mechanism to reduce the number of relevant items 2 . Algorithm 2.1: Mapping Algorithm. Input: User’s browsing history sh , item-embedding space I, the size of recommendation page M . Output: Valid recommendation page acur val . cur 1: Generate proto-action apro according Eq. (2.3) to Eq. (2.11) 2: for m = 1, M do 3: Select the most similar item as em according to Eq. (2.15) 4: Add item em into acur val (at the same location as em in apro ) cur 5: Remove item em from I 6: end for 7: return acur val We present the mapping algorithm in Algorithm 2.1. The Actor first generates proto- action acur cur pro (line 1). For each em in apro , the RA selects the most similar item in terms of 2 In general, user’s preference in current session should be related to user’s last clicked/ordered items before the current session(say L). Thus for each item in L, we collect a number of most similar items in terms of cosine similarity from the whole item space, and combine all collected items as the initial item-embedding space I of current recommendation session. During the current session, when a user clicks or orders an item, we will also add a number of most similar items into the item-embedding space I. 23 cosine similarity (line 3), and then adds this item into acur val at the same position as em in acur pro (line 4). Finally, the RA removes this item from the item-embedding space (line 5), which prevents recommending the same item repeatedly in one recommendation page. Then the RA recommends the new recommendation page acur val to user, and receives the immediate feedback (reward) from user. The reward r is the summation of rewards of all items in this page: M X r= reward(em ) (2.16) m=1 2.3.1.2 Offline Training Procedure When we use user’s historical browsing data to train the proposed Actor-Critic framework, user’s browsing history, the new recommendation page acur val and user’s corresponding feedback (reward) r are given in the data. Thus, there is a gap between acur cur pro and aval , i.e., no matter what proto-action acur cur pro outputted by the Actor, the valid-action aval is fixed. This will disconnect the Actor and the Critic. From existing work [75, 32] and Section 2.3.1.1, we learn that acur cur pro and aval should be similar, which is the prerequisite to connect the Actor and the Critic for training. Thus, we choose to minimize the difference between acur cur pro and aval : B kacur cur 2 X  min pro − aval kF (2.17) θπ b=1 where B is the batch size of samples in each iteration of SGD. Eq. (2.17) updates Actor’s parameters in the direction of pushing acur cur pro and aval to be similar. In each iteration, given user’s browsing history, the new recommendation page acur val , the RA generates proto-action acur cur cur pro and then minimizes the difference between apro and aval , which can connect the Actor 24 and the Critic. Next, we can follow conventional methods to update the parameters of Actor and Critic. The reward r is the summation of rewards of all items in page acur val . 2.3.1.3 Training Algorithm In this work, we utilize DDPG [75] algorithm to train the parameters of the proposed Actor- Critic framework. The Critic can be trained by minimizing a sequence of loss functions L(θµ ) as: 2  L(θµ ) = Es,a,r,s0 r + γQ µ0 (s0 , f π0 (s0 )) − Qθµ (s, a)  (2.18) θ θ where θµ represents all parameters in Critic. The critic is trained from samples stored in a replay buffer [94]. Actions stored in the replay buffer are generated by valid-action acur val , i.e., a = conv2d(acurval ). This allows the learning algorithm to leverage the information of which action was actually executed to train the critic [32]. The first term y = r + γQ µ0 (s0 , f π0 (s0 )) in Eq. (2.18) is the target for the current θ θ 0 iteration. The parameters from the previous iteration θµ are fixed when optimizing the loss function L(θµ ). In practice, it is often computationally efficient to optimize the loss function by stochastic gradient descent, rather than computing the full expectations in the above gradient. The derivatives of loss function L(θµ ) with respective to parameters θµ are presented as follows: ∇θµ L(θµ ) = Es,a,r,s0 (r + γQ µ0 (s0 , f π0 (s0 )) − Qθµ (s, a))∇θµ Qθµ (s, a) (2.19)   θ θ We update the Actor using the policy gradient:   ∇θπ fθπ ≈ Es ∇â Qθµ (s, â) ∇θπ fθπ (s) (2.20) 25 where â = fθπ (s), i.e., â is generated by proto-action acur cur pro (â = conv2d(apro )). Note that proto-action acur pro is the actual action outputted by Actor. This guarantees that policy gradient is taken at the actual output of policy fθπ [32]. The online training algorithm for the proposed framework DeepPage is presented in Algorithm 2.2. In each iteration, there are two stages, i.e., 1) transition generating stage (lines 7-10), and 2) parameter updating stage (lines 11-16). For transition generating stage (line 7): given the current state st , the RA first recommends a page of items at according to Algorithm 2.1 (line 8); then the RA observes the reward rt and updates the state to st+1 (lines 9); and finally the RA stores transitions (st , at , rt , st+1 ) into the replay buffer D (line 10). For parameter updating stage (line 11): the RA samples mini-batch of transitions (s, a, r, s0 ) from D (line 12), and then updates parameters of Actor and Critic (lines 13-16) following a standard DDPG procedure [75]. The offline training procedure is similar with Algorithm 2.2. The two differences are: 1) in line 8, offline training follows off-policy b(st ), and 2) before line 13, offline training first minimizes the difference between acur cur pro and aval according to Eq. (2.17). In the algorithm, we introduce widely used techniques to train our framework. For example, we utilize a technique known as experience replay [76] (lines 12), and introduce separated evaluation and target networks [93] (lines 2,16), which can help smooth the learning and avoid the divergence of parameters. For the soft updates of target networks (lines 16), we used τ = 0.01. Moreover, we leverage prioritized sampling strategy [96] to assist the framework learning from the most important historical transitions. 26 Algorithm 2.2: Parameters Online Training for DeepPage with DDPG. 1: Initialize actor network fθπ and critic network Qθµ with random weights 0 0 2: Initialize target network f π 0 and Q µ0 with weights θ π ← θ π , θ µ ← θ µ θ θ 3: Initialize the capacity of replay buffer D 4: for session = 1, G do 5: Receive initial observation state s1 6: for t = 1, T do 7: Stage 1: Transition Generating Stage 8: Select an action at according to Alg. 2.1 (policy fθπ ) 9: Execute action at and observe the reward rt according to Eq. (2.16) and new state st+1 according to Section 2.2.2.2 10: Store transition (st , at , rt , st+1 ) in D 11: Stage 2: Parameter Updating Stage 12: Sample minibatch of N transitions (s, a, r, s0 ) from D 13: Set y = r + γQ µ0 (s0 , f π0 (s0 )) θ θ 2 Update Critic by minimizing N1 P 14: n y − Qθµ (s, a) according to Eq. (2.19) 15: Update Actor using the sampled policy gradient according to Eq. (2.20) 16: Update the target networks: 0 0 θµ ← τ θµ + (1 − τ )θµ 0 0 θπ ← τ θπ + (1 − τ )θπ 17: end for 18: end for 27 Algorithm 2.3: Online Test for DeepPage. 1: Initialize Actor with the trained parameters Θπ 2: Receive initial observation state s1 3: for t = 1, T do 4: Execute an action at according to Alg. 2.1 (policy fΘπ ) 5: Observe the reward rt from user according to Eq. (2.16) 6: Observe new state st+1 according to Section 2.2.2.2 7: end for 2.3.2 The Test Procedure After the training stage, the proposed framework learns parameters Θπ and Θµ . Now we formally present the test procedure of the proposed framework DeepPage. We design two test methods, i.e., 1) Online test: to test DeepPage in online environment where RA interacts with users and receives real-time feedback for the recommended items from users, and 2) Offline test: to test DeepPage based on user’s historical browsing data. 2.3.2.1 Online Test The online test algorithm in one recommendation session is presented in Algorithm 2.3. The online test procedure is similar to the transition generating stage in Algorithm 2.2. In each iteration of the recommendation session, given the current state st , the RA recommends a page of recommendations at to user following policy fΘπ (line 4). Then the RA observes the reward rt from user (line 5) and updates the state to st+1 (line 6). 2.3.2.2 Offline Test The intuition of our offline test method is that, for a given recommendation session (offline data), the RA reranks the items in this session. If the proposed framework works well, this session’s clicked/ordered items will be ranked at the top of the new list. The reason why RA 28 Algorithm 2.4: Offline Test of DeepPage Framework. Input: Item set I = {e1 , · · · , eN } and corresponding reward set R = {r1 , · · · , rN } of a session. Output:Recommendation list L with new order 1: Initialize Actor with well-trained parameters Θπ 2: Receive initial observation state s1 3: while |I| > 0 do 4: Select an action at according to Alg. 2.1 (policy fΘπ ) 5: for ei ∈ at do 6: Add ei into the end of L 7: Record reward ri from user’s historical browsing data 8: end for 9: Compute the overall reward rt of at according to Eq. (2.16) 10: Execute action at and observe new state st+1 according to Section 2.2.2.2 11: Remove all ei ∈ at from I 12: end while only reranks items in this session rather than items in the whole item space is that for the offline dataset, we only have the ground truth rewards of the existing items in this session. The offline test algorithm in one recommendation session is presented in Algorithm 2.4. In each iteration of an offline test recommendation session, given the state st (line 2), the RA recommends a page of recommendations at following policy fΘπ (lines 4). For each item ei in at , we add it into new recommendation list L (line 6), and record ei ’s reward ri from user’s historical browsing data (line 7). Then we can compute the overall reward rt of at (line 9) and update the state to st+1 (line 10). Finally, we remove all items ei in at from the item set I of the current session (line 11). 2.4 Experiments In this section, we conduct extensive experiments with a dataset from a real e-commerce company to evaluate the effectiveness of the proposed framework. We mainly focus on two questions: (1) how the proposed framework performs compared to representative baselines; 29 and (2) how the components in Actor and Critic contribute to the performance. We first introduce experimental settings. Then we seek answers to the above two questions. Finally, we study the impact of important parameters on the performance of the proposed framework. 2.4.1 Experimental Settings We evaluate our method on a dataset of September, 2017 from a real e-commerce company. We randomly collect 1,000,000 recommendation sessions (9,136,976 items) in temporal order, and use the first 70% sessions as the training/validation set and the later 30% sessions as the test set. For a new session, the initial state is collected from the previous sessions of the user. In this work, we leverage N = 10 previously clicked/ordered items to generate the initial state. Each time the RA recommends a page of M = 10 items (5 rows and 2 columns) to users 3 . The reward r of one skipped/clicked/ordered item is empirically set as 0, 1, and 5, respectively. The dimensions of item-embedding/ category-embedding/ feedback-embedding are |E| = 50, |C| = 35, and |F | = 15. We set the discounted factor γ = 0.95, and the rate for soft updates of target networks τ = 0.01. For the parameters of the proposed framework such as γ, we select them via cross-validation. Correspondingly, we also do parameter-tuning for baselines for a fair comparison. We will discuss more details about parameter selection for the proposed framework in the following subsections. For online test, we leverage the summation of all rewards in one recommendation session as the metric. For offline test, we select Precision@20, Recall@20, F1-score@20 [48], NDCG@20 [59] and MAP [131], as the metrics to measure the performance. Our difference from traditional Learn-to-Rank methods is that we rank both clicked and ordered items 3 This is based on offline historical data collected from mobile Apps, i.e., to fit the screen size of mobile phones, one page has only 5 rows and 2 columns. 30 together and set them by different rewards, rather than only rank clicked items as that in the Learn-to-Rank setting. 2.4.2 Performance Comparison for Offline Test To answer the first question, we compare the proposed framework with the following repre- sentative baseline methods: • CF [10]: Collaborative filtering is a method of making automatic predictions about the interests of a user by collecting preference information from many users, which is based on the hypothesis that people often get the best recommendations from someone with similar tastes to themselves. • FM [106]: Factorization Machines combine the advantages of support vector machines with factorization models. Compared with matrix factorization, higher order interactions can be modeled using the dimensionality parameter. • GRU [55] : This baseline utilizes the Gated Recurrent Units (GRU) to predict what user will click/order next based on the browsing histories. To make a fair comparison, it also keeps previous N = 10 clicked/ordered items as initial states. • DQN [93]: We use a Deep Q-network with five fully-connected layers in this baseline. The input is the concatenation of embeddings of users’ historical clicked/ordered items (state) and a page of recommendations (action), and train this baseline by Eq. (2.1). • DDPG [32]: In this baseline, we use conventional Deep Deterministic Policy Gradient with five fully connected layers in both Actor and Critic. The input for Actor is the concatenation of embeddings of users’ historical clicked/ordered items (state). The input 31 3.5 0.06 0.5 Average Q-value Precision@20 Recall@20 0.0 0.00 0.0 Deep Deep 0 200000 400000 600000 CF FM GRU DQN DDPG CF FM GRU DQN DDPG Page Page 0.1 0.2 0.15 NDCG@20 F1@20 MAP 0.0 0.0 0.00 Deep Deep Deep CF FM GRU DQN DDPG CF FM GRU DQN DDPG CF FM GRU DQN DDPG Page Page Page Figure 2.6: Overall performance comparison in offline test. for Critic is the concatenation of embeddings of state and a page of recommendations (action). We leverage offline training strategy to train DDPG and DeepPage as mentioned in Section 2.3.1.1. The results are shown in Figure 2.6. We make following observations: • Figure 2.6 (a) illustrates the training process of DeepPage. We can observe that the framework approaches convergence when the model is trained by 500,000 offline sessions. • CF and FM perform worse than other baselines. These two baselines ignore the temporal sequence of the users’ browsing history, while GRU can capture the temporal sequence, and DQN, DDPG and DeepPage are able to continuously update their strategies during the interactions. • DQN and DDPG outperform GRU. We design GRU to maximize the immediate reward for recommendations, while DQN and DDPG are designed to achieve the trade-off between 32 short-term and long-term rewards. This result suggests that introducing reinforcement learning can improve the performance of recommendations. • DeepPage performs better than conventional DDPG. Compared to DDPG, DeepPage jointly optimizes a page of items and uses GRU to learn user’s real-time preferences. More details about the impact of mode components on DeepPage will be discussed in the following subsection. 2.4.3 Performance Comparison for Online Test Following [32], we build a simulated online environment (adapted to our case) for online test. We compare DeepPage with GRU, DQN and DDPG. Note that we do not include CF and FM baselines since CF and FM are not applicable to the online environment. To answer the second question, we systematically eliminate the corresponding components of the simulator by defining following variants of RecSimu: Here we utilize online training strategy to train DDPG and DeepPage (both Actor-Critic framework) as mentioned in Section 2.3.1.2. Baselines are also applicable to be trained via the rewards generated by simulated online environment. Note that we use data different from the training set to build the simulated online environment. As the test stage is based on the simulator, we can artificially control the length of recommendation sessions to study the performance in short and long sessions. We define short sessions with 10 recommendation pages, while long sessions with 50 recommendation pages. The results are shown in Figure 2.7. It can be observed: • DDPG performs similar to DQN, but the training speed of DDPG is much faster than DQN, as shown in Figure 2.7 (a). DQN computes Q-value for all potential actions, while 33 3.5 15 30 DDPG Average Q-value DQN Reward Reward 0.0 0 0 Training Time Deep Deep GRU DQN DDPG GRU DQN DDPG Page Page (a)Training Speed (b)Short Session (c)Long Session Figure 2.7: Overall performance comparison in online test. DDPG can reduce this redundant computation. This result indicates that Actor-Critic framework is suitable for practical recommender systems with enormous action space. • In short recommendation sessions, GRU, DQN and DDPG achieve comparable performance. In other words, GRU model and reinforcement learning models like DQN and DDPG can recommend proper items matching users’ short-term interests. • In long recommendation sessions, DQN and DDPG outperform GRU significantly. GRU is designed to maximize the immediate reward for recommendations, while reinforcement learning models like DQN and DDPG are designed to achieve the trade-off between short- term and long-term rewards. This result suggests that introducing reinforcement learning can improve the performance of recommendations. • DeepPage performs better than conventional DQN and DDPG. DeepPage can learn user’s real-time preferences and optimizes a page of items. We detail the effect of model components of DeepPage in the following subsection. 34 2.4.4 Effectiveness of Components To validate the effectiveness of each component, we systematically eliminate the corresponding model components by defining following variants of DeepPage: • DeepPage-1: This variant is to evaluate the performance of the embedding layer. We remove the embedding layer for items, categories and feedback. • DeepPage-2: In this variant, we evaluate the contribution of category and feedback information, hence, this variant does not contain one-hot indicator vectors of category and feedback. • DeepPage-3: This variant is to evaluate the effectiveness of GRU to generate initial state, so we eliminate the GRU in Figure 2.3. • DeepPage-4: In this variant, we evaluate the contribution of CNNs as shown in Figure 2.4, thus we remove CNNs and directly feed the outputs of embedding layers (concatenate embeddings of all items as one vector) into GRU units. • DeepPage-5: This variant is to evaluate the effectiveness of attention mechanism in Figure 2.4. Therefore, we eliminate attention layer and use the hidden state of last GRU unit as the input of DeCNN. • DeepPage-6: In this variant, we evaluate the GRU to generate local state; thereby, we remove this GRU in Figure 2.4 and concatenate outputs of all CNNs as a vector, and feed it into DeCNN. • DeepPage-7: This variant is to evaluate the performance of DeCNN to generate a new page of items; hence, we replace it with fully-connected layers, which output a concatenated 35 Table 2.1: Performance comparison of different components. Methods Precision Recall F1score NDCG MAP @20 @20 @20 @20 DeepPage-1 0.0479 0.3351 0.0779 0.1753 0.1276 DeepPage-2 0.0475 0.3308 0.0772 0.1737 0.1265 DeepPage-3 0.0351 0.2627 0.0578 0.1393 0.1071 DeepPage-4 0.0452 0.3136 0.0729 0.1679 0.1216 DeepPage-5 0.0476 0.3342 0.0775 0.1716 0.1243 DeepPage-6 0.0318 0.2433 0.0528 0.1316 0.1039 DeepPage-7 0.0459 0.3179 0.0736 0.1698 0.1233 DeepPage 0.0491 0.3576 0.0805 0.1872 0.1378 vector of M item-embeddings. The offline results are shown in Table 2.1 (we omit similar online observations because of the space limitation). We make following observations: • DeepPage-1 and DeepPage-2 validate that incorporating category/feedback information and the embedding layer can boost recommendation performance. • DeepPage-3 and DeepPage-6 perform worse, which suggests that setting user’s initial preference at the beginning of a new recommendation session, and capturing user’s real- time preference in current session is helpful for accurate recommendations. • DeepPage-5 proves that incorporating attention mechanism can better capture user’s real-time preference than only GRU. • DeepPage outperforms DeepPage-4 and DeepPage-7, which indicates that item display strategy can influence the decision-making process of users. In a nutshell, DeepPage outperforms all its variants, demonstrating each component’s effec- tiveness for recommendations. 36 2.5 Related Work In this section, we briefly review research related to our study. In general, the related work can be mainly grouped into the following categories. The first category related to this chapter is traditional recommendation techniques. Rec- ommender systems assist users by supplying a page of items that might interest users. Efforts have been made to offer meaningful recommendations to users. Collaborative filtering[77] is the most successful and the most widely used technique, which is based on the hypothesis that people often get the best recommendations from someone with similar tastes to themselves[10]. Another common approach is content-based filtering [95], which tries to recommend items with similar properties to those that a user ordered in the past. Knowledge-based systems[1] recommend items based on specific domain knowledge about how certain item features meet users’ needs and preferences and how the item is useful for the user. Hybrid recommender sys- tems are based on the combination of the above-mentioned two or more types of techniques[12]. The other topic closely related to this category is deep learning based recommender system, which is able to effectively capture the non-linear and non-trivial user-item relationships, and enables the codification of more complex abstractions as data representations in the higher layers[154]. For instance, Nguyen et al.[97] proposed a personalized tag recommender system based on CNN. It utilizes a constitutional and max-pooling layer to get visual features from patches of images. Wu et al.[142] designed a session-based recommendation model for real-world e-commerce website. It utilizes the basic RNN to predict what users will buy next based on the click histories. The second category is about reinforcement learning for recommendations and person- alization, which is different from the traditional item recommendations. In this work, we 37 consider the recommending procedure as sequential interactions between users and the recom- mender agent; and leverage deep reinforcement learning to automatically learn the optimal recommendation strategies. Indeed, reinforcement learning has been widely examined in the recommendation field. The MDP-Based CF model in Shani et al.[116] can be viewed as approximating a partial observable MDP (POMDP) by using a finite rather than unbounded window of past history to define the current state. Furthermore, Mahmood et al.[91] adopted the reinforcement learning technique to observe the responses of users in a conversational recommender, intending to maximize a numerical cumulative reward function modeling the benefit that users get from each recommendation session. Taghipour et al.[126, 125] modeled web page recommendation as a Q-Learning problem and learned to make recommendations from web usage data as the actions rather than discovering explicit patterns from the data. The system inherits the intrinsic characteristic of reinforcement learning, which is in a constant learning process. Sunehag et al.[122] introduced agents that successfully address sequential decision problems with high-dimensional combinatorial slate-action spaces. Zheng et al.[167] proposed a reinforcement learning framework to make online personalized news recommendation, which can effectively model the dynamic news features and user preferences, and plan for future explicitly, in order to achieve higher reward (e.g., CTR) in the long run. Feng et al.[38] presented a multi-agent reinforcement learning model, MA-RDPG, which can optimize ranking strategies collaboratively for multi-scenario ranking problems. Cai et al.[14] employed a reinforcement mechanism design framework for solving the impression allocation problem of large e-commerce websites, while taking the rationality of sellers into account. 38 Chapter 3 Jointly Recommend and Advertise Abstract Online recommendation and advertising are two major income channels for online recommen- dation platforms (e.g., e-commerce and news feed site). However, most platforms optimize recommending and advertising strategies by different teams separately via different tech- niques, leading to suboptimal overall performances. To this end, in this chapter, we propose a novel two-level reinforcement learning framework to jointly optimize the recommending and advertising strategies, where the first level generates a list of recommendations to optimize user experience in the long run; then the second level inserts ads into the recommendation list that can balance the immediate advertising revenue from advertisers and the negative influence of ads on long-term user experience. To be specific, the first level tackles high combinatorial action space problem that selects a subset of items from the large item space; while the second level determines three internally related tasks, i.e., (i) whether to insert an ad, and if yes, (ii) the optimal ad and (iii) the optimal location to insert. The experimental results based on real-world data demonstrate the effectiveness of the proposed framework. 39 3.1 Introduction Practical e-commerce or news-feed platforms generally expose a hybrid list of recommended and advertised items (e.g., products, services, or information) to users, where recommending and advertising algorithms are typically optimized by different metrics [38]. The recommender systems (RS) capture users’ implicit preferences from historical behaviors (e.g., clicks, rating and review) and generate a set of items that best match users’ preferences. Thus, RS aims at optimizing the user experience or engagement. While advertising systems (AS) assign the right ad to the right user on the right ad slots to maximize the revenue, click-through rate (CTR) or return on investment (ROI) from advertisers. Thus, optimizing recommending and advertising algorithms independently may lead to suboptimal overall performance since exposing more ads to increase advertising revenue negatively influences user experience, and vice versa. Therefore, there is an increasing demand for developing a uniform framework that jointly optimizes recommending and advertising, so as to optimize the overall performance [156]. Efforts have been made to display recommended and advertised items together. They consider ads as recommendations, and rank all items in a hybrid list to optimize the overall ranking score [134]. However, this approach has two major drawbacks. First, solely maximizing the overall ranking score may result in suboptimal advertising revenue. Second, in the real- time bidding (RTB) environment, the vickrey-clarke-groves (VCG) mechanism is necessary to calculate the bid of each ad in the list, which suffers from many practical severe problems [112]. Therefore, it calls for methods where we can optimize not only the metrics for RS and AS separately, but also the overall performance. Moreover, more practical mechanisms such as generalized-second-price (GSP) should be considered to compute the bid of each ad. To achieve the goal, we propose to study a two-level framework for rec-ads mixed display. 40 rec1 rec1 rec2 rec2 rec3 rec3 Recommender Advertising User System System Feedback Figure 3.1: An example of rec-ads mixed display for one user request. Figure 3.1 illustrates the high-level idea about how the framework works. Upon a user’s request, the first level (i.e., RS) generates a list of recommendations (rec-list) according to the user’s historical behaviors, aiming to optimize the long-term user experience or engagement. The main challenge to build the first level is the high computational complexity of the combinatorial action space, i.e., selecting a subset of items from the large item space. Then the second level (i.e., AS) inserts ads into the rec-list generated from the first level, and it needs to make three decisions, i.e., whether to insert an ad into the given rec-list; and if yes, the AS also needs to decide which ad and where to insert. For example, in Figure 3.1, the AS decides to insert an belt ad ad9 between T-shirt rec2 and pants rec3 of the rec-list. The optimal ad should jointly maximize the immediate advertising revenue from advertisers in the RTB environment and minimize the negative influence of ads on user experience in the long run. Finally, the target user browses the mixed rec-ads list and provides her/his feedback. According to the feedback, the RS and AS update their policies and generate the mixed rec-ads list for the next iteration. Most existing supervised learning based recommending and advertising methods are designed to maximize the immediate (short-term) reward and suggest items following fixed 41 greedy strategies. They overlook the long-term user experience and revenue. Thus, we build a two-level reinforcement learning framework for Rec/Ads Mixed display (RAM), which can continuously update their recommending and advertising strategies during the interactions with users, and the optimal strategy is designed to maximize the expected long-term cumulative reward from users [159, 155]. Meanwhile, to effectively leverage users’ historical behaviors from other policies, we design an off-policy training approach, which can pre-train the framework before launching it online, so as to reduce the bad user experience in the initial online stage when new algorithms have not been well learned [161, 162, 175]. We summarize our major contributions as follows: • We provide a two-stage generation process for the mixed display of recommended and advertised items in a hybrid list; • We propose a two-level deep reinforcement learning based framework RAM, where the first level can generate a list of recommendations, while the second level can determine whether to insert an ad, and the corresponding optimal ad and location to insert; • We conduct experiments with real-world data to demonstrate the effectiveness of the proposed framework. 3.2 Problem Statement As aforementioned in Section 3.1, we consider the rec/ads mixed display task as a two-level reinforcement learning problem, and model it as a Markov Decision Process (MDP) where the RS and AS sequentially interact with users (environment E) by generating a sequence of rec-ads hybrid-list over time, so as to maximize the cumulative reward from E. Next, we 42 define the five elements (S, A, P, R, γ) of the MDP. • State space S: A state st ∈ S includes a user’s recommendation and advertisement browsing history before time t and the contextual information of current request at time t. The generated rec-list from RS is also considered as a part of the state for the AS. • Action space A: at = (ars as rs t , at ) ∈ A is the action of RS and AS, where at of RS is to generate a rec-list, and aas t of AS is to determine three internally related decisions, i.e., whether to insert an ad in current rec-list; and if yes, the AS needs to choose a specific ad and insert it into the optimal location of the rec-list. We denote Ars as t and At as the rec and ad candidate sets for time t, respectively. Without the loss of generality, we assume that the length of any rec-list is fixed and the AS can insert at most one ad into a rec-list. • Reward R: After an action at is executed at the state st , a user browses the mixed rec-ads list and provides her feedback. The RS and AS will receive the immediate reward rt (st , ars as t ) and rt (st , at ) based on user’s feedback. We will discuss more details about the reward in following sections. • Transition probability P: P (st+1 |st , at ) is the state transition probability from st to st+1 after executing action at . The MDP is assumed to satisfy P (st+1 |st , at , ..., s1 , a1 ) = P (st+1 |st , at ). • Discount factor γ: Discount factor γ ∈ [0, 1] balances between current and future rewards – (1) γ = 1: all future rewards are fully considered into current action; and (2) γ = 0: only the immediate reward is counted. Figure 3.2 illustrates the user-agent interactions in MDP. With the above definitions, the problem can be formally defined as follows: Given the historical MDP, i.e., (S, A, P, R, γ), 43 RS AS st , ars t aas t rtrs rtas state st reward rt action at (ars as t , at ) rt+1 st+1 Users Figure 3.2: The agent-user interactions in MDP. the goal is to find a two-level rec/ads policy π = {πrs , πas } : S → A, which can maximize the cumulative reward from users, i.e., simultaneously optimizing the user experience and the advertising revenue. 3.3 Framework In this section, we will discuss the two-level deep reinforcement learning framework for rec/ads mixed display. We will first introduce the first-level deep Q-network (i.e., RS) to generate a list of recommendations (rec-list) according to user’s historical behaviors, then we propose a novel DQN architecture as the second-level (i.e., AS) to insert ads into the rec-list generated from RS. Finally, we discuss how to train the framework via offline data. 3.3.1 Deep Q-network for Recommendations Given a user request, RS will return a list of items according to user’s historical behaviors, which have two major challenges: (i) the high computational complexity of the combinatorial |Ars t | , i.e., selecting k items from the large item space Ars , and (ii) how  action space k t to approximate the action-value function (Q-function) for a list of items in the two-level 44 reinforcement learning framework. In this subsection, we introduce and enhance a cascading version of Deep Q-network to tackle the above challenges. Next, we first introduce the processing of state and action features, and then we illustrate the cascading Deep Q-network with an optimization algorithm. 3.3.1.1 The Processing of State and Action Features for RS As mentioned in Section 3.2, a state st includes a user’s rec/ads browsing history, and the contextual information of the current request. The browsing history contains a sequence of recommended items and a sequence of advertised items the user has browsed. Two RNNs with GRU (gated recurrent unit) are utilized to capture users’ preferences of recommendations and advertisements, separately. The final hidden state of RNN is used to represent user’s preference of recommended items prec ad t (or ads pt ). The contextual information ct of current user request includes app version, operation system (e.g., ios and android) and feed type (swiping up/down the screen), etc. The state st is the concatenation prec ad t , pt and ct as: st = concat(prec ad t , pt , ct ) (3.1) For the transition from st to st+1 , the browsed recommended and advertised items at time t will be inserted into the bottom of prs as rs as t and pt and we have pt+1 and pt+1 , respectively. For the action ars rs rs t = {at (1), · · · , at (k)} is the embedding of the list of k items that will be displayed in current request. Next, we will detail the cascading Deep Q-network. 45 3.3.1.2 The Cascading DQN for RS Recommending a list of k items from the large item space Ars t is challenging because (i) the |Ars t | has high computational complexity, and (ii) the order of  combinatorial action space k items in the list also matters [160]. For example, a user may have different feedback to the same item if it is placed in different positions of the list. To resolve the above challenges, we leverage a cascading version of DQN which generates a list by sequentially selecting items in a cascading manner [23]. Given state st , the optimal action is denoted as: ars∗ rs∗ rs∗ t = {at (1), · · · , at (k)} = arg max Q∗ (st , ars t ) (3.2) atrs The key idea of the cascading DQN is inspired by the fact that:   max Q∗ (s rs t , at (1:k)) = max max Q∗ (s rs t , at (1:k)) (3.3) ars t (1:k) ars t (1) ars t (2:k) which implies a cascade of mutually consistent as:   ars∗ 1∗ rs = arg max Q (st , at (1)) := max Q (st , at (1:k)) ∗ rs t (1) ars t (1) ars t (2:k)   rs∗ 2∗ rs∗ rs at (2) = arg max Q (st ,at (1),at (2)) := max Q (st ,at (1:k)) ∗ rs ars t (2) ars t (3:k) (3.4) ···   ars∗ Qk∗ (s rs∗ rs ∗ rs t (k) = arg max t ,at (1:k−1),at (k)) :=Q (st ,at (1:k)) ars t (k) By applying above functions in a cascading fashion, we can reduce the computational |Ars | complexity of obtaining the optimal action from O kt to O(k|Ars t |). Then the RS can sequentially select items following above equations. Note that the items already recommended 46 RNN prec t i1 ··· i N RNN pad t j1 ··· j N ct Q(st , ars t ) RNN rs at (1 : j) at (1)· · · at (j) rs rs j ∈ [1, k] Figure 3.3: The architecture of cascading DQN for RS. in the recommendation session will be removed from being recommended again. Next, we will detail how to estimate {Qj∗ |j ∈ [1, k]}. 3.3.1.3 The estimation of Cascading Q-functions Figure 3.3 illustrates the architecture of the cascading DQN, where i1 , · · · ,iN and j1 , · · · ,jN are uses’ rec and ads browsing histories. The original model in [23] uses k layers to process k items separately without efficient weights sharing, which is crucial in handling large action size [118]. To address this challenge, we replace the k separate layers by RNN with GRU, where the input of j th RNN unit is the feature of the j th item in the list, and the final hidden state of RNN is considered as the representation of the list. Since all RNN units share the same parameters, the framework is flexible to any action size k. To ensure that the cascading DQN selects the optimal action, i.e., a sequence of k optimal sub-actions, {Qj∗ |j ∈ [1, k]} functions should satisfy a set of constraints as follows: Qj∗ (st , ars∗ ∗ rs∗ ∀j ∈ [1, k] t (1 : j)) = Q (st , at (1 : k)) , (3.5) 47 i.e., the optimal value of Qj∗ should be equivalent to Q∗ for all j. The cascading DQN enforces the above constraints in a soft and approximate way, where the loss functions are defined as follows:  2 ytrs − Qj (st , ars t (1 : j)) , where (3.6) ytrs rt (st , ars k)) + γQ∗ st+1 , ars∗  = t (1 : t+1 (1 : k) , ∀j ∈ [1, k] i.e., all Qj functions fit against the same target ytrs . Then we update the parameters of the cascading DQN by performing gradient steps over the above loss. We will detail the reward function rt st , ars  t (1 : k) in the following subsections. Next, we will introduce the second-level DQN for advertising. 3.3.2 Deep Q-network for Online Advertising As mentioned in Section 3.1, the advertising system (AS) is challenging. First, AS needs to make three decisions, i.e., whether, which and where to insert. Second, these three decisions are dependent and traditional DQN architectures cannot be directly applied. For example, only when the AS decides to insert an ad, AS also needs to determine the candidate ads and locations. Third, the AS needs to not only maximize the income of ads but also minimize the negative influence on user experience. To tackle these challenges, next we detail a novel Deep Q-network architecture. 3.3.2.1 The Processing of State and Action Features for AS We leverage the same architecture as that in Section 3.3.1.1 to obtain the state st . Furthermore, since the task of AS is to insert ad into a given rec-list, the output of the first-level DQN, i.e., 48 Q( Q( Q( Q(st , a)1 · · · Q(st , a)k+1 st , s st , Q(st , at ) )0 ) ··· at ad t , at ad 1 at ad ) k +1 ··· state st state st action at state st action aad t (a) (b) (c) Figure 3.4: (a)(b) Two conventional DQNs. (c) Overview of the proposed DQN. the current rec-list ars rs rs t = {at (1), · · · , at (k)}, is also considered as a part of the state for AS. For the action aas ad loc ad t = (at , at ) of AS, at is the embedding of a candidate ad. Given the rec-list of k items, there exist k + 1 possible locations. Thus, we use a one hot vector aloc t ∈R k+1 to indicate the location to insert the selected ad. 3.3.2.2 The Proposed DQN Architecture Given state st and rec-list ars as t , the action of AS at contains three sub-actions, i.e., (i) whether to insert an ad into rec-list ars t ; if yes, (ii) which is the best ad and (iii) where is the optimal location. Note that in this work we suppose that the AS can insert an ad into a given rec-list at most. Next, we discuss the limitations if we directly apply two classic Deep Q-network (DQN) architectures as shown in Figures 3.4(a) and (b) to the task. The architecture in Figure 3.4(a) inputs only the state (st and ars t ) and outputs Q-values of all k + 1 possible locations. This DQN can select the optimal location, while it cannot choose the optimal ad to insert. The architecture in Figure 3.4(b) takes a pair of state-action and outputs the Q-value for a specific ad. This DQN can determine the optimal ad but cannot 49 determine where to insert the ad. One solution is to input the location information (e.g., one-hot vector). However, it needs O(k|Aas t |) evaluations (forward propagation) to find the optimal Q-value, which is not practical in real-world advertising systems. Note that neither two classic DQNs can decide whether to insert an ad into a given rec-list. To address above challenges, we propose a novel DQN architecture for online advertising in a given rec-list ars t , as shown in Figure 3.4(c). It is built based on the two classic DQN architectures. The proposed DQN takes state st (including ars ad t ) and a candidate ad at as input, and outputs the Q-values for all k + 1 locations. This DQN architecture inherits the advantages of both two classic DQNs. It can evaluate the Q-values of the combination of two internally related types of sub-actions at the same time. In this chapter, we evaluate the Q-values of all possible locations for an ad simultaneously. To determine whether to insert an ad (the first sub-action), we extend the output layer from k + 1 to k + 2 units, where the Q(st , aad 0 rs t ) unit corresponds to the Q-value of not inserting an ad into rec-list at . Therefore, our proposed DQN can simultaneously determine three aforementioned sub-actions according to the Q-value of ad-location combinations (aad loc t , at ), and the evaluation times are reduced from O(k|Aas as 0 t |) to O(|At |); when Q(st , 0) leads to the maximal Q-value, the AS will insert no ad into rec-list ars t , where we use a zero-vector 0 to represent inserting no ad. More details of the proposed DQN architecture are illustrated in Figure 3.5. First, whether to insert an ad into a given rec-list is affected by st and ars t (especially the quality of the rec-list). For example, if a user is satisfied with a rec-list, the AS may prefer to insert an ad into the rec-list; conversely, if a user is unsatisfied with a rec-list and is likely to leave, then the AS won’t insert an ad. Second, the reward for an ad-location pair is related to all information. Thus, we divide the Q-function into a value function V (st ), related to st and ars ad rs ad t , and an advantage function A(st , at ), decided by st , at and at [138]. 50 RNN prec t i1 ··· i N V (st ) RNN pad t j1 ··· j N Q(st , aad t ) 0 Q(st , aad t ) 1 ct RNN ··· ··· Q(st , aad t ) k+1 ars t t (1)· · · t ars ars (j) ··· aad t A(st , aad t ) Figure 3.5: The architecture of the proposed DQN for AS. 3.3.2.3 The Action Selection in RTB setting In the real-time bidding environment, each ad slot is bid by advertisers in real-time when an impression is just generated from a consumer visit [13]. In other words, given an ad slot, the specific ad to display is determined by the bids from advertisers, i.e., the bidding system (BS), rather than the platform, which aims to maximize the immediate advertising revenue of each ad slot from advertisers. In this chapter, as mentioned in Section 3.1, the optimal ad selection policy should not only maximize the immediate advertising revenue (controlled by the BS), but also minimize the negative influence of ads on user experience in the long run (controlled by the AS). To achieve this goal, the AS will first calculate the Q-values for all candidate ads and all possible locations, referred as to Q(st , Aas t ), which captures the long-term influence of ads on user experience; and then the BS will select the ad that achieves the trade-off between the immediate advertising revenue and the long-term Q-values: aas as t = BS (Q(st , At )) (3.7) 51 where the operation Q(st , Aas ad t ) goes through all candidate ads {at } (input layer) and all locations {aloc t } (output layer), including the location that represents not inserting an ad. To be more specific, we design two AS+BS approaches as follows: • RAM-l: the optimal ad-location pair aas ad loc t = (at , at ) directly optimizes the linear summation of immediate advertising revenue and long-term user experience: aas Q(st , aas as  t = arg max t ) + α · revt (at ) (3.8) aas t ∈At as where α controls the second term, and revt (aas t ) is the immediate advertising revenue if inserting an ad, otherwise 0; • RAM-n: this is a nonlinear approach that the AS first selects a subset of ad-location pairs {aas t } (the size is N ) that corresponds to optimal long-term user experience Q(st , at ), as then the BS chooses one aas t that has the maximal immediate advertising revenue revt (at ) as from the subset. It is noteworthy that we maximize immediate advertising revenue rather than long-term advertising revenue because: (i) as aforementioned, advertisers determine the ad to insert rather than the platform (the agent does not generate action); and (ii) in the generalized- second-price (GSP) setting, the highest bidder pays the price (immediate advertising revenue) bid by the second-highest bidder, if we use immediate advertising revenue as rt (st , aas t ), then we cannot select an ad according to its Q(st , aast ) that represents the long-term advertising revenue. 52 3.3.3 The Optimization Task Given a user request and state st , the RS and AS sequentially generate actions ars as t and at , i.e., a rec-ads hybrid list, and then the target user will browse the list and provide her/his feedback. The two-level framework aims to (i) optimize the long-term user experience or engagement of recommended items (RS), (ii) maximize the immediate advertising revenue from advertisers in RTB environment (BS), and (iii) minimize the negative influence of ads on user long-term experience (AS), where the second goal is automatically achieved by the bidding system, i.e., the advertiser with highest bid price will win the ad slot auction. Therefore, we next design proper reward functions to assist the RL components in the framework to achieve the first and third goals. The framework is quite general for the rec-ads mixed display applications in e-commerce, news feed and video platforms. Thus, for different types of platforms, we design different reward functions. For the first level DQN (RS), to evaluate the user experience, we have   income  e−commerce rt (st , ars t )= (3.9)  dwell time news/videos  where user experience is measured by the income of the recommended items in the hybrid list in e-commerce platforms, and the dwelling time of the recommendations in news/video platforms. Based on the reward function rt (st , ars t ), we can update the parameters of the cascading DQN (RS) by performing gradient steps over the loss in Eq. (3.6). We introduce separated evaluation and target networks [93] to help smooth the learning and avoid the divergence of parameters, where θrs represents all parameters of the evaluation network, and the parameters of the target network θTrs are fixed when optimizing the loss function. The 53 derivatives of loss function L(θrs ) with respective to parameters θrs are presented as follows:   ∇θrs L(θrs ) = ytrs −Qj (st ,ars rs t (1:j);θ ) ∇θrs Qj (st ,ars rs t (1:j);θ ) (3.10) where the target ytrs = rt st ,ars ∗ rs∗ rs   t (1:k) + γQ st+1 ,at+1 (1:k);θT , ∀j ∈ [1, k]. For the second level DQN (AS), since leaving the platforms is the major risk of inserting ads improperly or too frequently, we evaluate user experience by whether user will leave the platform after browsing current rec-ads hybrid list, and we have:   1 continue  rt (st , aas t )= (3.11)  0 leave  in other words, the AS will receive a positive reward (e.g. 1) if the user continues to browse the next list, otherwise 0 reward. Then the optimal Q∗ (st , aas t ), i.e., the maximum expected return achieved by the optimal policy, follows the Bellman equation [6] as: Q∗ (st , aas as ∗ ∗ as  t ) = rt (st , at ) + γQ st+1 , BS Q (st+1 , At+1 ) (3.12) then the second level DQN can be optimized by minimizing the loss function as: 2 ytas − Q(st , aas t ) , where (3.13) ytas rt (st , aas ∗ Q∗ (s as  = t ) + γQ st+1 , BS t+1 , At+1 ) where ytas is the target of the current iteration. We also introduce separated evaluation and target networks [93] with parameters θas and θTas for the second level DQN (AS), and θTas is fixed when optimizing the loss function in Eq. (3.13) (i.e. L(θas )). The derivatives of loss 54 function L(θas ) w.r.t. parameters θas can be presented as: ∇θas L(θas ) = ytas − Q(st , aas as  as as t ; θ ) ∇θas Q(st , at ; θ ) (3.14) where ytas = rt (st , aas ∗ ∗ as as  as  as t ) + γQ st+1 , BS Q (st+1 , At+1 ; θT ) ; θT . The operation Q(st , At ) looks through the candidate ad set {aad loc t+1 } and all locations {at+1 } (including the location of inserting no ad). 3.3.4 Off-policy Training Training the two-level reinforcement learning framework requires a large amount of user- system interaction data, which may result in bad user experience in the initial online stage when new algorithms have not been well trained. To address this challenge, we propose an off-policy training approach that effectively utilizes users’ historical behavior log from other policies. The users’ offline log records the interaction history between behavior policy b(st ) (the current recommendation and advertising strategies) and users’ feedback. Our RS and AS take the actions based on the off-policy b(st ) and obtain feedback from the offline log. We present our off-policy training algorithm in detail shown in Algorithm 3.1. There are two phases in each iteration of a training session. For the transition generation phase: for the state st (line 6), the RS and AS sequentially act ars as t and at based on the be- havior policy b(st ) (line 7) according to a standard off-policy way [30]; then RS and AS receive the reward rt (st , ars as t ) and rt (st , at ) from the offline log (line 8) and update the state to st+1 (line 9); and finally the RS and AS store transition (st , ars as rs as t , at , rt (st , at ), rt (st , at ), st+1 ) into the replay buffer D (line 10). For the model training phase: the proposed framework 0 first samples minibatch of transitions from D (line 11), then generates actions ars and 55 Algorithm 3.1: Off-policy Training of the RAM Framework. Input: historical offline logs, replay buffer D Output: well-trained recommending policy πrs ∗ and advertising policy π ∗ as 1: Initialize the capacity of replay buffer D 2: Randomly initialize action-value functions Qrs and Qas 3: for session = 1, M do 4: Initialize state s0 5: for t = 1, T do 6: Observe state st = concat(prec asd t , pt , ct ) 7: Execute actions ars as t and at according to off-policy b(st ) 8: Get rewards rt (st , at ) and rt (st , aas rs t ) from offline log 9: Update the state from st to st+1 10: Store (st , ars as rs as t , at , rt (st , at ), rt (st , at ), st+1 ) transition into the D 11: Sample minibatch of (s, ars , aas , r(s, ars ), r(s, aas ), s0 ) transitions from the D 0 12: Generate RS’s next action ars according to Eq. (3.4) as0 13: ( AS’s next action a according to Eq. (3.7) Generate r(s, ars ) terminal s0 14: y rs = 0 r(s, ars ) + γQrs (s0 , ars ) non − terminal s0 j 2 15: Update(θrs of Qrs by minimizing y rs − Qrs (s, ars (1 : j)) via Eq. (3.10) r(s, aas ) terminal s0 16: y as = 0 r(s, aas ) + γQas (s0 , aas ) non − terminal s0 2 17: Update θas of Qas by minimizing y as − Qas (s, aas ) according to Eq. (3.14) 18: end for 19: end for 0 aas of next iteration according to Eqs.(3.4) and (3.7) (lines 12-13), and finally updates parameters of Qrs and Qas by minimizing Eqs.(3.6) and .(3.13) (lines 14-17). To help avoid the divergence of parameters and smooth the training, we introduce separated evaluation and target Q-networks [93] . Note that when b(st ) decides not to insert an ad (line 7), we denote aadt as an all-zero vector. 3.3.5 Online Test We present the online test procedure in Algorithm 3.2. The process is similar to the transition generation stage of Algorithm 3.1. Next, we detail each iteration of test session as shown 56 Algorithm 3.2: Online Test of the RAM Framework. 1: Initialize action-value functions Qrs and Qas with well-trained weights 2: for session = 1, M do 3: Initialize state s0 4: for t = 1, T do 5: Observe state st = concat(prec asd t , pt , ct ) 6: Generate action arst according to Eq. (3.4) 7: Generate action aast according to Eq. (3.7) 8: Execute actions arst and at as 9: Observe rewards rt (st , ars as t ) and rt (st , at ) from user 10: Update the state from st to st+1 11: end for 12: end for in Algorithm 3.2. First, the well-trained RS generates a rec-list by πrs ∗ (line 6) according to the current state st (line 5). Second, the well-trained AS, collaboratively working with BS, decides to insert an ad into the rec-list (or not) by πas ∗ (line 7). Third, the reward is observed from the target user to the hybrid list of recommended and advertised items (lines 8 and 9). Finally we transit the state to st+1 (line 10). 3.4 Experiment In this section, we will conduct extensive experiments using data from a short video site to assess the effectiveness of the proposed RAM framework. We first introduce the experimental settings, then compare the RAM framework with state-of-the-art baselines, and finally conduct component and parameter analysis on RAM. 3.4.1 Experimental Settings Since there are no public datasets consist of both recommended and advertised items, we collected a dataset from a short video site, TikTok, in March 2019. In total, we collect 57 Table 3.1: Statistics of the dataset. session user normal video ad video 1,000,000 188,409 17,820,066 10,806,778 session dwell time session length session ad revenue rec-list with ad 17.980 min 55.032 videos 0.667 55.23% 1,000,000 sessions in chronological order, where the first 70% is used as training/validation set and the later 30% is test set. For more statistics of the dataset, please see Table 3.1. There are two types of videos in the dataset: regular videos (recommended items) as well as ad videos (advertised items). The features for a normal video contain: id, like score, finish score, comment score, follow score and group score, where the scores are predicted by the platform. The features for an ad video consist of: id, image size, bid-price, hidden-cost, predicted-ctr and predicted-recall, where the platform predicts the last four. It is worth noting that (i) the effectiveness of the calculated features have been validated in the businesses of the short video site, (ii) we discretize each numeric feature into a one-hot vector, and (iii) baselines are based on the same features for a fair comparison. 3.4.2 Evaluation Metrics The reward rt (st , ars t ) to evaluate user experience of a list of regular videos is the dwell time (min), and the reward rt (st , aas t ) of ad videos is 0 if users leave the site and 1 if users continue to browse. We use the session dwell time Rrs = T1 rt (st , ars P t ), session length PT PT Ras = 1 rt (st , aas t ), and session ad revenue R rev = 1 revt (aas t ) as metrics to measure the performance of a test session. 58 3.4.3 Architecture Details Next, we detail the architecture of RAM to ease reproductivity. The number of candidate regular/ad videos (selected by external recall systems) for a request is 15 and 5 respectively, and the size of regular/ad video representation is 60. There are k = 6 regular videos in a rec-list. The initial state s0 of a session is collected from its first three requests, and the dimensions of prec ad rs ad t , pt , ct , at , at are 64, 64, 13, 360, 60, respectively. For the second level DQN (AS), two separate 2-layer neural networks are respectively used to generate V (st ) and A(st , aad t ), where the output layer has k + 2 = 8 units, i.e., 8 possible ad locations including not to insert an ad. We empirically set the size of replay buffer 10,000, and the discounted factor of MDP γ = 0.95. We select important hyper-parameters such as α and N via cross-validation, and we do parameter-tuning for baselines for a fair comparison. In the following subsections, we will present more details of parameter sensitivity for the RAM framework. 3.4.4 Overall Performance Comparison The experiment is based on a simulated online environment, which can provide the rt (st , ars t ), rt (st , aas as t ) and revt (at ) according to a mixed rec-ads list. The simulator shares similar architecture to Figure 3.5, while the output layer predicts the dwell time, whether user will leave and the ad revenue of current mixed rec-ads list. We compare the proposed framework with the following representative baseline methods: • W&D [24]: This baseline jointly trains a wide linear model with feature transformations and a deep feedforward neural network with embeddings for general recommender systems with sparse inputs. One W&D estimates the recommending scores of regular videos and 59 Table 3.2: Performance comparison. Algorithms Metrics Values W&D DFM GRU DRQN RAM-l RAM-n value 17.61 17.95 18.56 18.99 19.61 19.49 Rrs improv.(%) 11.35 9.25 5.66 3.26 - 0.61 p-value 0.000 0.000 0.000 0.000 - 0.006 value 8.79 8.90 9.29 9.37 9.76 9.68 Ras improv.(%) 11.03 9.66 5.06 4.16 - 0.83 p-value 0.000 0.000 0.000 0.000 - 0.009 value 1.07 1.13 1.23 1.34 1.49 1.56 Rrev improv.(%) 45.81 38.05 26.83 16.42 4.70 - p-value 0.000 0.000 0.000 0.000 0.001 - each time we recommend k videos with highest scores, while another W&D predicts whether to insert an ad and estimates the CTR of ads. • DFM [51]: DeepFM is a deep model that incorporates W&D model with factorization- machine (FM). It models high-order feature interactions like W&D and low-order interac- tions like FM. • GRU [55]: GRU4Rec is an RNN with GRU to predict what user will click next according to her/his behavior histories. • DRQN [53]: Deep Recurrent Q-Networks addresses the partial observation problem by considering the previous context with a recurrent structure. DRQN uses an RNN architecture to encode previous observations before the current time. Note that we also develop two separate DFMs, GRUs, DRQNs for RS and AS, respectively. The results are shown in Table 3.2. We make the following observations: • GRU performs better than W&D and DFM, since W&D and DFM neglect users’ sequential behaviors of one session, while GRU can capture the sequential patterns. 60 • DRQN outperforms GRU, since DRQN aims to maximize the long-term rewards of a session, while GRU targets maximizing the immediate reward of each request. This result demonstrates the advantage of introducing RL for online recommendation and advertising. • RAM-l and RAM-n achieve better performance than DRQN, which validates the effec- tiveness of the proposed two-level DQN framework, where the RS generates a rec-list of recommendations and the AS decides how to insert ads. • RAM-n outperforms RAM-l in session ad revenue, since the second step of RAM-n will select the ad-location pair with maximal immediate advertising revenue, which has a higher probability of inserting ads. To sum up, RAM outperforms representative baselines, which demonstrates its effectiveness in online recommendation and advertising. 3.4.5 Component Study To understand the impact of model components of RAM, we systematically eliminate the corresponding components of RAM by defining the following variants: • RAM-1: This variant has the same neural network architectures with the RAM framework, while we train it in the supervised learning way; • RAM-2: In this variant, we evaluate the contribution of recurrent neural networks, so we replace RNNs with fully-connected layers. Specifically, we concatenate recommendations or ads into one vector and then feed it into fully-connected layers; • RAM-3: In this variant, we use the original cascading DQN architecture in [23] as RS; 61 (a ) R rs (b ) R a s (c ) R re v 2 0 1 0 .0 2 .0 1 9 9 .5 1 .5 1 8 9 .0 1 .0 1 2 3 4 5 -l n 1 2 3 4 5 -l n 1 2 3 4 5 -l n M - M - M - M - M - A M A M - M - M - M - M - M - A M A M - M - M - M - M - M - A M A M - R A R A R A R A R A R R R A R A R A R A R A R R R A R A R A R A R A R R Figure 3.6: Performance comparison of different variants. • RAM-4: For this variant, we do not divide the Q-function of AS into the value function V (s) and the advantage function A(s, a); • RAM-5: This variant leverages an additional input to represents the location, and uses the DQN in Figure 3.4(b) for AS. The results are shown in Figure 3.6. By comparing RAM and its variants, we make the following observations: • RAM-1 demonstrates the advantage of reinforcement learning over supervised learning for jointly optimizing recommendation and online advertising; • RAM-2 validates that capturing user’s sequential behaviors can enhance the performance; • RAM-3 proves the effectiveness of RNN over k separate layers for larger action space; • RAM-4 suggests that dividing Q(st , at ) into V (st ) and A(st , at ) can boost the performance; • RAM-5 validates the advantage of the proposed AS architecture (over classic DQN archi- tectures) that inputs a candidate ad aad t and outputs the Q-value for all possible locations {aloc t }. 62 6 (a ) 1 0 4 R a s 9 2 R re v 8 0 6 α (b ) 1 0 4 R a s 9 2 R re v 8 0 2 4 6 8 1 0 Ν Figure 3.7: Parameter sensitivity analysis. In summary, leveraging suitable RL policy and proper neural network components can improve the overall performance. 3.4.6 Parameter Sensitivity Analysis Our method has two key hyper-parameters, i.e., (i) the parameter α of RAM-l, and (ii) the parameter N of RAM-n. To study their sensitivities, we fix other parameters, and investigate how the RAM framework performs with the changes of α or N . • Figure 3.7(a) illustrates the sensitivity of α. We observe that when α increases, the metric Rrev improves, while the metric Ras decreases. This observation is reasonable because when we decrease the importance of the second term of Eq. (3.8), the AS will insert more ads or choose the ads likely to have more revenue, while ignoring their negative impact on regular recommendations. • Figure 3.7(b) shows the sensitivity of N . With the increase of N , we can observe that 63 the metric Rrev improves and the metric Ras decreases. With smaller N , the first step of RAM-n prefers to selecting most ad-location pairs that do not insert an ad, which results in lower Rrev and larger Ras ; on the contrary, with larger N , the first step returns more pairs with non-zero ad revenue, then the second step leads to higher Rrev . In a nutshell, both above results demonstrate that recommended and advertised items are mutually influenced: inserting more ads can lead to more ad revenue while worse user experience, vice versa. Therefore, online platforms should carefully select these hyper- parameters according to their business demands. 3.5 Related Work In this section, we will briefly summarize the related works of our study, which can be mainly grouped into the following categories. The first category related to this chapter is reinforcement learning-based recommender systems. A DDPG algorithm is used to mitigate the large action space problem in real- world RL-based RS [32]. A tree-structured policy gradient is presented in [18] to avoid the inconsistency of DDPG-based RS. Biclustering is also used to model RS as grid-world games to reduce action space [26]. A Double DQN-based approximate regretted reward technique is presented to address the issue of unstable reward distribution in dynamic RS environment [21]. A pairwise RL-based RS framework is proposed to capture users’ positive and negative feedback to improve recommendation performance [164]. A page-wise RS is proposed to simultaneously recommend a set of items and display them in a 2-dimensional page [160, 165]. A DQN based framework is proposed to address the issues in the news feed scenario, like only optimizing current reward, not considering labels, and diversity issue [167]. 64 An RL-based explainable RS is presented to explain recommendations and can flexibly control the explanation quality according to the scenarios [136]. A policy gradient-based RS for YouTube is proposed to address the biases in logged data by introducing a simulated historical policy and a novel top-K off-policy correction [20]. The second category related to this chapter is RL-based online advertising techniques, which belong to two groups. The first group is guaranteed delivery (GD), where ads are charged according to a pay-per-campaign pre-specified number of deliveries [113]. A multi- agent RL method is presented to control cooperative policies for the publishers to optimize their targets in a dynamic environment [140]. The second group is real-time bidding (RTB), which allows an advertiser to bid each ad impression in a very short time slot. Ad selection task is typically modeled as multi-armed bandit problem supposing that arms are iid, feedback is immediate and environments are stationary [98, 41, 129, 148, 152, 114]. The problem of online advertising with budget constraints and variable costs is studied in MAB setting [31], where pulling the arms of bandit results in random rewards and spends random costs. However, the MAB setting considers the bid decision as a static optimization problem, and the bidding for a given ad campaign would repeatedly happen until the budget runs out. To address these challenges, the MDP setting has also been studied for RTB [13, 134, 160, 109, 141, 61]. A model-based RL framework is proposed to learn bid strategies in RTB setting [13], where state value is approximated by a neural network to better handle the large scale auction volume problem and limited budget. A model-free RL method is also designed to solve the constrained budget bidding problem, where a RewardNet is presented to generate rewards for reward design trap [141]. A multi-agent RL framework is presented to consider other advertisers’ bidding as the state, and a clustering method is leveraged to handle a large amount of advertisers issue [61]. 65 Chapter 4 User Simulation for Recommendations Abstract With the recent advances in Reinforcement Learning (RL), there have been tremendous interests in employing RL for recommender systems. However, directly training and evaluating a new RL-based recommendation algorithm needs to collect users’ real-time feedback in the real system, which is time/effort consuming and could negatively impact users’ experiences. Thus, it calls for a user simulator that can mimic real users’ behaviors to pre-train and evaluate new recommendation algorithms. Simulating users’ behaviors in a dynamic system faces immense challenges – (i) the underlying item distribution is complex, and (ii) historical logs for each user are limited. In this chapter, we develop a user simulator based on a Generative Adversarial Network (GAN). To be specific, the generator captures the underlying distribution of users’ historical logs and generates realistic logs that can be considered as augmentations of real logs; while the discriminator not only distinguishes real and fake logs but also predicts users’ behaviors. The experimental results based on benchmark datasets demonstrate the effectiveness of the proposed simulator. 66 Recommender System feedback User Figure 4.1: An example of system-user interactions. 4.1 Introduction With the recent tremendous development in Reinforcement Learning (RL), there have been increasing interests in adapting RL for recommendations [20]. RL-based recommender systems treat the recommendation procedures as sequential interactions between users and a recommender agent (RA) as shown in Figure 4.1. In each iteration, the recommender system suggests a set of items to the user; then, the user browses the recommended items and provides her/his real-time feedback; next, the system will update its recommendation strategy according to user’s feedback. RL-based recommender systems aim to automatically learn an optimal recommendation strategy (policy) that maximizes cumulative rewards from users without any specific instructions. They can achieve two key advantages: (i) the recommender agent can learn their recommendation strategies based on users’ real-time feedback during the user-agent interactions continuously; and (ii) the optimal strategy targets at maximizing the long-term reward from users (e.g., the overall revenue of a recommendation session). Given the advantages of reinforcement learning, very recently, it allures tremendous interest in developing RL-based recommender systems. [32, 160, 167]. 67 RL-based recommendation algorithms are desired to be trained and evaluated based on users’ real-time feedback (reward function). The most practical and precise way is online A/B test [151, 66], where a new recommendation algorithm is trained based on the feedback from real users and the performance is compared against that of the previous algorithm via randomized experiments. However, online A/B tests are inefficient and expensive: (i) online A/B tests usually take several weeks to collect sufficient data for the sake of statistical sufficiency, and (ii) numerous engineering efforts are typically required to deploy the new algorithm in the real system [150, 43, 73]. Furthermore, online A/B tests often lead to bad user experience in the initial stage when the new recommendation algorithms have not been well trained [72]. These reasons prevent us from quickly training and testing new RL-based recommendation algorithms. The common practice to handle these challenges in the RL community is to build a simulator to approximate the environment (e.g., OpenAI Gym for video games), and then use it to train and evaluate the RL algorithms [39]. Thus, following the best routine, we aim to build a user simulator based on users’ historical logs in this work, which can be utilized to pre-train and evaluate new recommendation algorithms before launching them online. However, simulating users’ behaviors in a dynamic recommendation environment is very challenging. First, the underlying distribution of recommended item sequences is extremely complex in historical logs since there are millions of items in practical recommender systems. Second, learning a robust simulator typically requires large-scale historical logs as training data from each user. Though massive historical logs are often available, data available to each user is rather limited. Recent efforts have demonstrated that Generative Adversarial Network (GAN) and its variants are able to generate fake but realistic images [45, 46], which implies their potential in modeling complex distributions. Furthermore, the generated images 68 can be considered as augmentations of real images to enlarge the data space. Driven by these advantages, we propose to build a GAN-based user simulator (UserSim) for RL-based recommenders, which can capture the complex distribution of users’ browsing logs and generate realistic logs to enrich the training dataset. We summarize our major contributions as follows: • We introduce a principled approach to capture the underlying distribution of recommended item sequences in historical logs, and generate realistic item sequences; • We propose a user behavior simulator UserSim, which can be utilized to simulate environ- ments with limited training data to pre-train and evaluate RL based recommender systems; and • We conduct experiments based on real-world data to demonstrate the effectiveness of the proposed simulator and validate the contributions of its components. 4.2 The Proposed Simulator This section will propose a simulator framework that imitates users’ feedback (behavior) on a recommended item according to the user’s current preference learned from her browsing history. 4.2.1 Problem Statement RL-based recommender systems treat the recommendation task as sequential interactions between a recommender system (agent) and users (environment E), and use a Markov Decision Process (MDP) to model them, which consist of a sequence of states, actions and rewards: 69 • We define the state s = {i1 , · · · , iN } as a sequence of N items that a user browsed and user’s corresponding feedback for each item. The items in s are chronologically sorted; • An action a from the recommender system perspective is defined as recommending a set of items to the user. Without loss of generality, we suppose that each time the recommender system suggests one item to the user, but it is straightforward to extend this setting to recommending more items; • When the system takes an action a based on the state s, the user will browse the recom- mended item and provide her feedback on the item, such as skip, click, or purchase the item. The recommender system will then receive a reward r(s, a) solely according to the type of feedback. With the aforementioned definitions and notations, the goal of a simulator can be formally defined as follows: Given a state-action pair (s, a), the goal is to imitate user’s feedback (behavior) on a recommended item according to user’s preference learned from the user’s browsing history. 4.2.2 The Generator Architecture The goal of the generator is to learn the data distribution and then generate indistinguishable logs (action) based on users’ browsing history (state), i.e., to imitate the recommendation policy of the recommender system that generates the historical logs. Figure 4.2 illustrates the generator with the Encoder-Decoder architecture. 70 Encoder Decoder FC Layers Output Layer PE F C1 F C2 Gθ (s) h1 ··· hN I1 IN ··· supervised e1 f1 eN fN component real action a Figure 4.2: The generator with Encoder-Decoder architecture. 4.2.2.1 The Encoder Component The Encoder component aims to learn user’s preference according to the items browsed by the user and the user’s feedback. The input is the state s = {i1 , · · · , iN } that is observed in the historical logs, i.e., the sequence of N items that a user browsed and user’s corresponding feedback for each item. The output is a low-dimensional representation of user’s current preference, referred to as pE . Each item in ∈ s involves two types of information: in = (en , f n ), where en is a low-dimensional and dense item-embedding of the recommended item, and f n is an embedding to denote user’s feedback on the recommended item1 . The intuition of selecting these two types of information is that, we not only want to learn the information of each item in the sequence, but also want to capture user’s interests (feedback) on each item. Then, we concatenate en and f n , and get a low-dimensional and dense vector: I n = concat(en , f n ). We introduce a Recurrent Neural Network (RNN) with Gated Recurrent Units (GRU) to capture the sequential patterns of items in the logs. We consider the RNN’s final hidden 1 These embeddings are jointly trained with neural networks in an end-to-end manner. 71 state hN as the output of Encoder component, i.e., the lower dimensional representation of user’s current preference: pE = hN . 4.2.2.2 The Decoder Component The goal of the Decoder component is to predict the item that will be recommended according to the user’s current preference. Therefore, the input is user’s preference representation pE , while the output is the item-embedding of the item that is predicted to appear at next position in the log, referred to as Gθ (s). We leverage a MLP component with several fully-connected layers as the Decoder to directly transform pE to Gθ (s). So far, we have delineated the architecture of the Generator, which aims to imitate the recommendation policy of the existing recommender system, and generate realistic logs to augment the historical data. In addition, we add a supervised component to encourage the generator to yield items that are close to the ground truth items, which will be discussed in Section 4.2.4. Next, we will discuss the architecture of discriminator. 4.2.3 The Discriminator Architecture The discriminator aims to not only distinguish real historical logs and generated logs, but also predict the class of user’s feedback on a recommended item according to her browsing history. Thus we consider the problem as a classification problem with 2 × K classes, i.e., K classes of real feedback for the recommended items observed from historical logs, and K classes of fake feedback for the recommended items yielded by the generator. Figure 4.3 illustrates the architecture of the discriminator. Similar with the generator, we introduce an RNN with GRU to capture user’s dynamic preference. Note that the architecture is the same as the RNN in generator, but they have separate parameters. 72 supervised lR1 · · · lRK lF 1 · · · lF K component softmax F C3 ··· ground truth feedback F C2 PD h1 ··· hN eD I1 IN F C1 ··· e1 f1 eN fN real a or fake Gθ (s) Figure 4.3: The discriminator architecture. The input of the RNN is the state s = {i1 , · · · , iN } observed in the historical logs, where in = (en , f n ), and the output is the dense representation of the user’s current preference, referred to as pD . Meanwhile, we feed the item-embedding of the recommended item (real a or fake Gθ (s)) into fully-connected layers, which encode the recommended items to low- dimensional representations, referred to as eD . Then we concatenate pD and eD , and feed the concatenation (pD , eD ) into fully-connected layers, whose goals are (1) to judge whether the recommended items are real or fake, and (2) to predict users’ feedback on these items. Therefore, the final fully-connected layer outputs a 2 × K dimensional vector of logits, which represent K classes of real feedback and K classes of fake feedback respectively: output = [lR1 , · · · , lRK , lF 1 , · · · , lF K ] (4.1) where we include K classes of fake feedback in output layer rather than only one fake class, since fine-grained distinction on fake samples can increase the power of discriminator (more 73 details in following subsections). These logits can be transformed to class probabilities through a softmax layer, and the probability corresponding to the j th class is: exp(lj ) pmodel (lj |s, a) = P2×K (4.2) k=1 exp(lk ) The objective function is based on these class probabilities. In addition, a supervised component is introduced to enhance the user’s feedback prediction and more details about this component will be discussed in Section 4.2.4. 4.2.4 The Objective Function In this subsection, we will introduce the objective functions of the proposed simulator. The discriminator has two goals: (1) distinguishing real-world historical logs and generated logs, and (2) predicting the class of user’s feedback on a recommended item according to the browsing history. The first goal corresponds to an unsupervised problem just like standard GAN that distinguishes real and fake images, while the second goal is a supervised problem that minimizes the class difference between users’ ground truth feedback and the predicted feedback. Therefore, the loss function LD of discriminator consists of two components. For the unsupervised component that distinguishes real-world historical logs and generated logs, we need to calculate the probability that a state-action pair is real or fake. From Eq. (4.2), we know the probability that a state-action pair observed from historical logs is classified as real, referred to as Dφ (s, a), is the summation of the probabilities of K real feedback: XK Dφ (s, a) = pmodel (lk |s, a) (4.3) k=1 74 while the probability of a fake state-action pair where Gθ (s) action is produced by the generator, say Dφ (s, Gθ (s)), is the summation of the probabilities of K fake feedback: 2×K X Dφ (s, Gθ (s)) = pmodel (lk |s, Gθ (s)) (4.4) k=K+1 Then, the unsupervised component of the loss function LD is defined as follows: unsup (4.5) LD = −{Es,a∼pdata log Dφ (s, a) + Es∼pdata log Dφ (s, Gθ (s))} where both s and a are sampled from historical logs distribution pdata in the first term; in the second term, only s is sampled from historical logs distribution pdata , while the action Gθ (s) is yielded by generator policy Gθ . The supervised component aims to predict the class of user’s feedback, which is formulated as a supervised problem to minimize the class difference (i.e., the cross-entropy loss) between users’ ground truth feedback and the predicted feedback. Thus it also has two terms – the first term is the cross-entropy loss between ground truth class lk and predicted class for a real state-action pair sampled from real historical data distribution pdata ; while the second term is the cross-entropy loss between ground truth class lk and the predicted class for a fake state-action pair, where the action Gθ (s) is yielded by the generator. Thus, the supervised component of the loss function LD is defined as follows: sup LD = −{Es,a,r∼pdata [log pmodel (lk |s, a, k≤K)] (4.6) + λ · Es,r∼pdata [log pmodel (lk |s, Gθ (s), K