USING THE DEEP GRAPH REPRESENTATION LEARNING METHOD TO STUDY THE INDIAN ELECTION SOCIAL NETWORK By Yimo Liu A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Business Administration - Business Information Systems - Doctor of Philosophy Computational Mathematics, Science and Engineering - Dual Major 2022 ABSTRACT USING THE DEEP GRAPH REPRESENTATION LEARNING METHOD TO STUDY THE INDIAN ELECTION SOCIAL NETWORK By Yimo Liu The aim of this dissertation is using the Deep Graph Learning to study on two problems in social network data. This dissertation includes two essays. The first essay is using Deep Graph Representation Learning to solve the counterfactual inference problem on social network. The second essay is using Deep Graph Representation Learning to solve the fairness recommendation results on social network. Graph Representation Learning is very useful method to extract graph features from the large- scale network. There are many advantages when we use this method to map the data from the graph space into the embedding space. First, there is no need to use the predefined measurement on graph structure such as bipartite graph, tripartite graph, etc, but directly keep the information of the network topology as good as possible. Second, when we study on social network platform such as Facebook, Twitter, etc, we need to consider more attributes with nodes and edges. Therefore, the information in the social network becomes more complicated. Considering these two advantages, we choose the graph representation learning to embed these features into the representation space. The first essay is combining this method with the transfer learning in order to fit the Rubin’s counterfactual framework. It can help to give a robust and lowest biased estimation results compared with the propensity score matching methods. The second essay is borrowing the advantage of the graph representation learning method to learn the representation space of different types of the users and connections in twitter. We combine this method with the attention mechanism to construct the fairness based loss function. This can help to increase the fairness of the recommendation and maintain the predicted accuracy of the recommendation in social network. ACKNOWLEDGEMENTS First, I want to thank the advisor of the dissertation committee, Professor Anjana Susarla. Under her guidance, I choose this graph neural network related methods as the dissertation topics. Both causal inference problem and recommendation fairness problem are very popular in academic research. These two questions have broad audience in many disciplines. I also sincerely thank other committee members, Professor Quan Zhang, Professor Yuying Xie, and Professor Matthew Hirn. Their suggestions includes both the model analysis and the empirical analysis. It helps me to figure out many tricky and difficult parts of this research projects. Considering that my dissertation is for satisfying on both Information Systems and Computational Mathematics disciplines, their suggestions are always helpful and considerable. I also want to regard the Professor Yingda Lu and Professor Nishtha Langer to support this rich dataset and give the nicely suggestions. Second, it was lucky and honor to have chance to join both the business school and the engineering school for the last six years at Michigan State University. I learned a lot of advanced knowledge and technique skills in the courses, lectures, and workshops. I saw how world class scholars discussed, critiqued, and created the academic researches. It is really helpful for me to continue my research job in the future career. Third, thank my family, my friends in Lansing, and my girlfriend, who always stand behind me and support me with their love and time. We played basketball together, had meal together, and shared happiness and sadness together. I cannot get any achievement without your help. In the end, thank God to let me still alive. iv TABLE OF CONTENTS LIST OF TABLES ........................................................................................................................ vii LIST OF FIGURES ........................................................................................................................ x LIST OF ALGORITHMS ............................................................................................................. xii ESSAY 1. A GRAPH REPRESENTATION LEARNING METHOD FOR MATCHING THE ONLINE POLITICAL SUPPORTERS ON THE SOCIAL NETWORK ...................................... 1 0. ABSTRACT OF ESSAY 1 ..................................................................................................... 1 1. INTRODUCTION ................................................................................................................... 2 2. LITERATURE REVIEW........................................................................................................ 7 2.1. Propensity Score Matching method under network data .................................................. 7 2.1.1. Solve hidden confounder problem ............................................................................. 7 2.1.2. Solve Interference Problem........................................................................................ 8 2.2. Representation Learning for Counterfactual Inference .................................................. 10 2.2.1. Kernel Space Matching Method for Counterfactual Inference ................................ 10 2.2.2. Deep Representation Learning for Counterfactual Inference .................................. 11 2.2.3. Compare propensity score method with representation learning method ................ 14 3. MODEL BUILDING ............................................................................................................ 18 3.1. Summary ......................................................................................................................... 18 3.2. Counterfactual Inferences and Propensity Score Matching ........................................... 21 3.3. Representation Learning for counterfactual framework ................................................. 24 3.4. Deep Graph Neural Network. ......................................................................................... 29 4. DATA AND RESULT .......................................................................................................... 34 4.1. Data Description: Indian General Election Twitter 2014 ............................................... 34 4.2. Semi-Synthetic Data ....................................................................................................... 35 4.3. Model Performance ........................................................................................................ 38 4.3.1. Covariates Balanced Improvements from GNN ...................................................... 38 4.3.2. Treatment Effect Improvements from Representation Learning ............................. 41 4.4. Empirical Data Results ................................................................................................... 46 5. DISCUSSION AND CONCLUSION ................................................................................... 48 ESSAY 2. GRAPH ATTENTION BASED METHOD FOR IMPROVING THE RECOMMENDATION FAIRNESS IN ELECTION SOCIAL NETWORK .............................. 49 0. ABSTRACT OF ESSAY 2 ................................................................................................... 49 1. INTRODUCTION ................................................................................................................. 50 2. LITERATURE REVIEW...................................................................................................... 55 2.1. Political Brand Image on semantic network ................................................................... 56 2.2. Heterogeneous Information Network Embedding .......................................................... 59 2.3. Fairness recommendation algorithms ............................................................................. 62 3. PRELIMINARIES ................................................................................................................ 65 4. MODEL SETTING AND ANALYSIS................................................................................. 72 4.1. Model setting .................................................................................................................. 72 v 4.2. Intra-Path Attention ........................................................................................................ 73 4.3. Inter-Path Attention ........................................................................................................ 75 4.4. Attention Fairness ........................................................................................................... 76 4.5. Training .......................................................................................................................... 78 4.6. Model Analysis ............................................................................................................... 78 5. DATA RESULT .................................................................................................................... 80 5.1. Dataset: Indian 2014 Election Twitter ............................................................................ 80 5.2. Experiment Setting ......................................................................................................... 80 5.2.1. Baseline .................................................................................................................... 80 5.2.2. Parameter Setting ..................................................................................................... 82 5.3. Performance Comparison ............................................................................................... 82 5.4. Fairness of the recommendation ..................................................................................... 85 5.4.1 Fairness Measures: .................................................................................................... 86 6. CONCLUSION ..................................................................................................................... 88 APPENDICES .............................................................................................................................. 89 APPENDIX A. Notation in essay one ....................................................................................... 90 APPENDIX B. Propensity Score, Inverse Probability Treatment Weighting estimation, and Doubly Robust method.............................................................................................................. 91 APPENDIX C. Ignore the hidden confounders ........................................................................ 93 APPENDIX D. Diagram of the Domain Adaptation for counterfactual framework ................ 95 APPENDIX E. Maximum mean discrepancy (MMD).............................................................. 96 APPENDIX F. Wasserstein Distance........................................................................................ 98 APPENDIX G. Node2Vec and :Latent Features .................................................................... 100 APPENDIX H. Balanced Covariates Measures and Imbalanced Covariates Measures ......... 106 APPENDIX I. Notation in essay two ...................................................................................... 129 APPENDIX J. Data category for the election in social network............................................. 130 APPENDIX K. Word Embedding Model ............................................................................... 133 APPENDIX L. Meta-path based Network Embedding ........................................................... 146 APPENDIX M. Performance Measure in Recommendation .................................................. 148 REFERENCES ........................................................................................................................... 151 vi LIST OF TABLES Table 1. Literature review of Propensity Score Matching method under network data ................. 9 Table 2. Literature review of Propensity Score Matching method under network data ............... 13 Table 3. Compare Propensity Score Method and Representation Learning Method ................... 16 Table 4. Compare our method with other methods....................................................................... 17 Table 5. An example of the factual sample set and the counterfactual sample set ....................... 25 Table 6. Data description .............................................................................................................. 35 Table 7. Propensity Score Balanced Improvement (100k users) .................................................. 41 Table 8. Compare GNN methods with Propensity score-based method on treatment effect with small dataset and Balanced Covariates Measures (without GNN embeddings) on synthetic data43 Table 9. Compare GNN methods with Propensity score-based method on treatment effect with small dataset and Imbalanced Covariates Measures (without GNN embeddings) on synthetic data ....................................................................................................................................................... 44 Table 10. Compare GNN methods with Propensity score-based method on treatment effect with big dataset and Balanced Covariates Measures (without GNN embeddings) on synthetic data .. 45 Table 11. Compare GNN methods with Propensity score-based method on treatment effect with big dataset and Imbalanced Covariates Measures (without GNN embeddings) on synthetic data ....................................................................................................................................................... 45 Table 12. Compare GNN methods with Propensity score based-method on treatment effect with small dataset and Balanced Covariates Measures (with graph features) ..................................... 46 Table 13. Compare GNN methods with Propensity score based-method on treatment effect with large dataset and Balanced Covariates Measures (with graph features) ...................................... 47 Table 14. GNN based result with different treatment (hashtag) ................................................... 47 Table 15. Literature review of Political Brand Image on semantic network ................................ 58 Table 16. Literature review of Heterogeneous Information Network Embedding ....................... 62 Table 17. Literature review of Fairness recommendation algorithms .......................................... 64 Table 18. Data Description ........................................................................................................... 80 Table 19. Recall@20 Results on three datasets. The best method is bolded ................................ 83 vii Table 20. HR@20 Results on three datasets. The best method is bolded..................................... 84 Table 21. NDCG@20 Results on three datasets. The best method is bolded ............................... 84 Table 22. AUC Results on three datasets. The best method is bolded ......................................... 85 Table 23. Demographic Parity Results on three datasets. The best method is bolded ................. 86 Table 24. Equal Opportunity Results on three datasets. The best method is bolded .................... 87 Table A.1. Equal Opportunity Results on three datasets. The best method is bolded .................. 90 Table A.2. Pros and Cons of the Similarity function Choices .................................................... 101 Table A.3. Balanced Covariates Measures on two assigned group ............................................ 106 Table A.4. Imbalanced Covariates Measures on two assigned group ........................................ 106 Table A.5. Balanced Covariates Measures on two assigned group with latent features ............ 107 Table A.6. Imbalanced Covariates Measures on two assigned group with latent features ......... 108 Table A.7. Propensity Score Balanced Improvement (10k) ....................................................... 109 Table A.8. Propensity Score Balanced Improvement (20k) ....................................................... 109 Table A.9. Propensity Score Balanced Improvement (40k) ....................................................... 109 Table A.10. Propensity Score Balanced Improvement (80k) ..................................................... 110 Table A.11. Propensity Score Balanced Improvement (100k) ................................................... 110 Table A.12. Notation in essay two .............................................................................................. 129 Table A.13. User level node type................................................................................................ 130 Table A.14. Function level node type ......................................................................................... 130 Table A.15. Content level node type .......................................................................................... 131 Table A.16. Data Coding Scheme............................................................................................... 132 Table A.17. Corpus of the Tweet (first 20 rows) ........................................................................ 133 Table A.18. The most frequent Bi-gram features of different Parties (Top 20) ......................... 134 Table A.19. The most frequent Tri-gram features of different Parties (Top 20) ........................ 135 Table A.20. Top 10 keywords for 5 Topics of different Parties ................................................. 136 viii Table A.21. Topic Perplexity and Coherence Score ................................................................... 139 Table A.22. Finding the dominant topics in each sentence ........................................................ 140 Table A.23. Most Representative Topics for Each Document ................................................... 142 Table A.24. Topic distribution across documents ...................................................................... 145 ix LIST OF FIGURES Figure 1. Concept Model .............................................................................................................. 19 Figure 2. Diagram of the Model ................................................................................................... 20 Figure 3. Diagram of the Counterfactual Inference-Step 1........................................................... 28 Figure 4. Diagram of the Counterfactual Inference-Step 2........................................................... 29 Figure 5. Diagram of the Graph Neural Network ......................................................................... 31 Figure 6. Diagram of opinion diffusion in twitter......................................................................... 67 Figure 7. Meta-path Instances ....................................................................................................... 68 Figure 8. Meta-path based Neighbors ........................................................................................... 69 Figure 9. Diagram of the Intra-Path Attention .............................................................................. 74 Figure 10. Diagram of the Inter-Path Attention ............................................................................ 76 Figure A.1. Causal Graph Model to identify the causal inference problem ................................. 93 Figure A.2. Domain Adaptation for counterfactual framework ................................................... 95 Figure A.3. Domain Adaptation for counterfactual framework with Maximum mean discrepancy ....................................................................................................................................................... 97 Figure A.4. Wasserstein Distance ................................................................................................. 99 Figure A.5. Node2Vec and Latent Features................................................................................ 101 Figure A.6. Node2Vec Visualization .......................................................................................... 103 Figure A.7. Node2Vec for Node Classification .......................................................................... 104 Figure A.8. Covariate Balanced Table for theory based features ............................................... 111 Figure A.9. Distribution Balance for each theory based feature................................................. 112 Figure A.10. Covariate Balanced Table for both theory based feature and latent features: ....... 116 Figure A.11. Distribution Balance for each theory based feature............................................... 117 x Figure A.12. Distribution Balance for each latent feature .......................................................... 120 Figure A.13. Different Topic Numbers vs Coherence Scores on each topic model ................... 139 Figure A.14. Adjacency transformation...................................................................................... 146 xi LIST OF ALGORITHMS Algorithm 1. Attention Fair Algorithm......................................................................................... 79 xii ESSAY 1. A GRAPH REPRESENTATION LEARNING METHOD FOR MATCHING THE ONLINE POLITICAL SUPPORTERS ON THE SOCIAL NETWORK 0. ABSTRACT OF ESSAY 1 Twitter is one of the most important platforms for politicians to communicate with their supporters. However, the causal mechanism behind how politicians use the twitter as an electronic campaign tool to impact their supporters is few researched. This paper studies how politicians post tweets and impact their supporters’ sentiments on the twitter. In particular, we focus on study the political hashtags causally affect on people’s reactions on twitter. We use the 2014 Indian general election data to construct the retweets network and detect the causal inference based on this retweets network. The most difficult part of detecting the causal inference problem is avoiding confoundedness factors. However, the social network data directly break the assumption of constructing the propensity scores based method, which leads to the bias estimator. In this paper, we use the deep graph representation learning to construct a better matching method under the social network circumstance. We find that, compared with propensity score based method, our method can always give the lowest error result under balanced covariates and small-scale, imbalanced covariates and small-scale, balanced covariates and large-scale, imbalanced covariates and large-scale social network circumstances. 1 1. INTRODUCTION Social network plays an important role in our lives. There are a lot of people communicate and discuss the political events on the social network platforms such as Facebook, Reddit, and Twitter, etc.1 This in turn attracts many politicians and parties use these social network platforms as tool to present themselves and campaign. For example, President Obama used twitter to interact with his voters. Even his budget is lower than Hillary’s budget, President Obama still won the election in 2008. Some researchers account his winning to the success of using twitter. He is very good at using twitter to interact with his voters during the presidential campaign. Except posting more interesting and attractive contents in order to encourage their followers to engage the politics, politicians also analysis the sentiments of the followers and modify their policies based on their reaction. They need to analyze the reason behind the successful tweets and modify their strategies on target voters. (Ahmed et al 2016, Khan et al 2020, Panda et al 2020) Exploring the reason behind the successful tweets, we want to identify the causal effects from the opinion leader’s tweet. For example, if the president Trump posts a tweet as “With what I am doing in the fight with the Drug Companies, drug prices will be coming down 50, 60, and even 70 percent.”. This post may influence the sentiments from the followers for the following reasons: 1) The follower gives a positive or negative comments because he or she likes or dislikes the drug prices got decreased. 2) The follower gives a positive or negative comment because he or she supports or not supports Trump no matter what he tweets. 1 https://www.pewresearch.org/fact-tank/2019/08/02/10-facts-about-americans-and-twitter/ 2 3) The follower retweets because he or she gets affected from the neighbors’ sentiments on twitter. The last two reasons are most common confounding factors when we detect the causal effect under social network circumstance (Fang et al 2013, Ma et al 2015, Zhang et al 2018, Rishika and Ramaprasad 2019, Gelper et al 2020). The second reason is called homophily, which means users tend to give positive sentiments to the opinion leader because they have the similar demographic information such as age, gender, etc (McPherson et al 2001, Krivitsky et al 2009). The third reason is called peer influence. It means that, users are easy to get influenced from other connected users on the social network (Lewis 2012, Christakis and Fowler 2013). If most of other users give positive or negative sentiments on the Trump’s tweet, one tends to give sentiments similar to his or her neighbors. If the opinion leaders plan to please their base voters, attract swing voters and new voters based on their reactions on the tweets, they need to distinguish the first reason from the second reason and the third reason. To detect the homophily and peer influence are very important to estimate the causal effect behind the successful tweets. (Shalizi and Thomas 2011, Ma et al 2015 , Zhang et al 2018, Gelper et al 2020) Many studies have focused on separating homophily and peer influence from the causal inference under non-social-network circumstance based on the Rubin’s (2004) framework, the propensity score matching method. (Hirano and Imbens 2004, Aral et al 2009, Imai and Ratkovic 2014, Forastiere et al 2020, Sant'Anna et al 2020) This method requires the treated group and the control group to follow the independent identical distribution condition (also called SUVTA assumption) and both the treated group and the control group outcomes should be independent with the treatment (also called unconfounded assumption). If we can construct the balanced propensity score based on the covariate samples, which means there is no significant difference 3 between treated group and the control group, we can match the similar units on both treated group and control group based on their propensity scores. (Rubin 2004, Hirano and Imbens 2004) The difference between their outcomes is the treatment effect. This method is very success under the circumstance that the covariate sample sets is small and balanced and the relationship between covariates and outcomes are linear (Johansson et al 2016, Veitch 2019). However, the propensity score matching method is not very efficient under the social network circumstance (Tchetgen and VanderWeele 2012, Huffman and Gameren 2018). The network data directly break these two assumptions because social tie and homophily exist. For example, some users have a political preference on republican and some others have a political preference on democrats. If we choose the users have a political preference on republican as the treated group and the users have a political preference on democrats as the control group and try to test how one tweet affects their opinions. Both treated group users and control group users can receive the same message from the opinion leader, we cannot separate them very clearly. This situation breaks the SUVTA assumption and leads to overestimates the causal effect on the network (Forastiere et al 2020, Jackson et al 2020). Statisticians solve these problems by modifying the propensity score with information from the confounders, or the network structure. Hirano and Imbens (2004) build a propensity score by generalized the un-confoundedness to solve the homopily problem. Jackson et al (2020) solve the peer influence problem based on generalized propensity score including the local graph structure. These methods can somewhat help to avoid overestimate the causal effect on the social network platform. However, it is still difficult to construct the balanced propensity scores on the social network (King and Nielsen 2019). 4 We address the casual inference problem under social network circumstance by using the graph- based representation learning. It can transform the covariate sample sets into a latent space, also called representation space, and match the treatment group and control group into a balanced representation space by a kernel statistical test of independence. In the meantime, this method also calculate the relationship between covariates and outcome so that it can handle more nonlinear relationship between covariates and outcome. The generalized error of the causal effect based on the representation learning method are bonded by the math proofs. (Johansson et al 2016, Li and Fu 2017, Kallus 2017, Kallus 2020) The advantage of our model is capturing more unobserved covariates, which implicit includes the information about network structure, latent homophily, etc. We use the node embedding method to represent the node, edge and attributes information (Grover and Leskovec 2016). These information can represent the homophily and peer influence. We learn the representation space based on these network related information, it helps to calculate the balanced covariate metric and capture the nonlinear relationship between covariate variable and the outcome. The deep neural network architecture can help to calculate the large and nonlinear representation space easily. We use the data with Indian election 2014 and compare our model with the propensity score matching method under four types of dataset. The small and covariate balanced case, the large and covariate balanced case, the small and covariate imbalanced case, the large and covariate imbalanced case. The results show that our graph representation learning method can always give lowest bias estimation on each case. If we want to test the casual inference under the social network data, the graph representation learning method is more robust and flexible than the propensity score based method. 5 The organization of our papers is: First, a literature review on causal inference under social network circumstance based on the propensity score matching method and based on the representation learning method. Second, a describing and building of graph neural network model. Third, how this graph representation based method causally explain the social influence problem. Fourth, the data result, and Fifth the discussion. 6 2. LITERATURE REVIEW 2.1. Propensity Score Matching method under network data The most general way to test the causal inference problem is using the Propensity Score Matching Method. In order to construct the propensity scores well, Rubin (1987) requires the covariate variables between the treated group and the control group should be balanced. The Propensity Score Matching Method can solve the selection bias problem under Rubin’s two assumptions. However, the network data break the SUTVA assumption and no hidden confounder assumption. It leads to imbalanced covariates between the treated group and the control group. The treatment effect is overestimated under this situation. Many statistical researchers tried to solve these two problems by modifying the construction ways. 2.1.1. Solve hidden confounder problem Hirano and Imbens (2004) are the pioneers to solve the hidden confounder problem. They build a propensity score by generalized the un-confoundedness: each treatment is independent with each outcome yi conditional by different covariates but not for all outcome Y. This method can reduce the bias during continuous treatment situations. Imai and Ratkovic (2014) proposed a method that the treatment is independent with treatment conditional on inverse propensity score weighting moment. This method can help to justify the over-identify problem of the propensity score method. Huffman and Gameren (2018) applied this method to the continuous interventions case. They replace the weight by generalized method of moment. Sant'Anna et al (2020) follow the covariate balanced propensity score idea but just flexible the inverse weighting moment to integrated conditional moment. This method is quite close to our integral probability metric approach. However, this method cannot compute efficiently during high dimension covariates situation. Therefore, they did not apply this method on the network data but on employment data. 7 2.1.2. Solve Interference Problem SUTVA assumption means the outcome of user A has no influence on the outcome of user B. It is not held under network data. For example, both user B and user C see a tweet given by user A on twitter. If user C investigates that the user B gives a negative comment on this tweet, user C may also give a negative comment because he or she gets affected by user B’s comment. This situation always happened on social network and directly break the SUTVA assumption. Some prior studies solve this problem by assuming these nodes are interference on the unit level but independent on the group level. They made some clusters on nodes at first step, and constructed the propensity score based on group level node. (Hudgens and Halloran 2008, Tchetgen and VanderWeele 2012, Liu L and Hudgens 2014, Arpino et al 2017) Some part of the total treatment effect is coming from the peer effect but not the treatment. Therefore, they extract the peer effect or named indirect causal effect from the total treatment effect to solve this interference problem. Some recent studies are based on the general propensity score method to solve the node level interference problem. (Aronow and Samii 2017, Ogburn and VanderWeele 2017, Forastiere et al 2020, Jackson et al 2020). Aronow and Samii (2017), Ogburn and VanderWeele (2017) discussed the theory of the estimator consider whether a node gets exposed to neighbor nodes or not. After we use the total treatment effect minus the treatment effect, which including the neighbor nodes exposure, we can get the pure causal effect. Forastiere et al (2020) gave both theory and empirical analysis of this approach, and prove that we can get a better result. Jackson et al (2020) solve this problem not based on generalized propensity score but based on constructing the propensity score based on the peer influence. They assume the graph is the fully 8 connected directed graph. Their propensity score is still biased because their graph topology is not very flexible. 1 Table 1. Literature review of Propensity Score Matching method under network data Author Model assumption Confounder Propensity score or type Estimator Construction Hirano and yi (0), yi (1) ⊥ t i |𝑒(𝑥𝑖 , 𝑡𝑖 ) Hidden Construct a propensity score Imbenes (2004) Confounder by generalized the un- confoundness Imai and yi (0), yi (1) ⊥ t i |𝑤(𝑥𝑖 ) Hidden Construct a propensity score Ratkovic (2014) where w(. ) is the inverse Confounder by inverse weighting weighting moment moment Huffman and Same with above Hidden Construct a propensity score Gameren Confounder by inverse weighting (2018). moment Sant'Anna et al yi (0), yi (1) ⊥ t i |𝐼(𝑥𝑖 ) Hidden Construct a propensity score (2020) where w(. ) is the integrated Confounder by integrated conditional conditional moment moment Hudgens. and ∑𝑛𝑗=1 𝑌 ∙ 𝐼𝑗 [𝑥𝑖 , 𝑡] Group-level Extract peer effect from total yi (𝑥, 𝑡) = Halloran (2008) ∑𝑛𝑗=1 𝐼𝑗 [𝑥𝑖 , 𝑡] interference causal effect by group where 𝐼[𝑥, 𝑡] is the group received treatment Tchetgen and Same with above Group-level Extract peer effect from total VanderWeele interference causal effect by group (2012) Liu and Same with above Group-level Extract peer effect from total Hudgens (2014) interference causal effect by group 9 Table 1. (cont’d) Aronow and ∑𝑛𝑗=1 𝑌 ∙ 𝐺𝑗 [𝑥𝑖 , 𝑡] Node-level Extract peer effect from total yi (𝑥, 𝑡) = Samii (2017) ∑𝑛𝑗=1 𝐺𝑗 [𝑥𝑖 , 𝑡] interference causal effect by neighbor where 𝐺𝑗 [𝑥𝑖 , 𝑡] is the graph neighbor graph received treatment Ogburn and Same with above Node -level Extract peer effect from total VanderWeele interference causal effect by neighbor (2017). graph Forastiere et al Same with above Node -level Extract peer effect from total (2020) interference causal effect by neighbor graph Jackson et al yi (0), yi (1) ⊥ t i |𝑒𝑖 (𝑥, 𝑔) Node-level Construct a propensity scores (2020) Where g is directed graph interference by including graph 2.2. Representation Learning for Counterfactual Inference Even there are many statisticians tried to construct a better propensity score by including the graph information, it is still not easy to represent the information of the user on the social network very well. In other way, we can use the representation learning method to represent the information of the user. It is an efficient way to represent the complex relationship between different users on the social network. Then, we put the representation learning method into the counterfactual inference framework. This can help to solve the confounding problem. 2.2.1. Kernel Space Matching Method for Counterfactual Inference The propensity score matching method is for matching the treated group and the control group on the one dimension subspace. It uses the logistic model to transform the covariate variables into the propensity scores. If we can find the balanced covariate variables, we can construct the propensity score very well. However, in order to capture the counterfactual inference on the 10 social network, we need to construct a large, nonlinear subspace to balance the treated group and control group. In order to transform the nonlinear, imbalanced covariate variables into the balanced representation space, we need to use the kernel space matching method. The advantage for using kernel space matching method is we can use the kernel trick to represent the distance between any two different units from the sample space. (Li and Fu 2017, Zhang et al 2019, Chu et al 2020) The basic logic is using the outcome Y and input X to learn the encoding transformation 𝜙. This can help to transform the covariate X into the representation 𝜙(𝑋). Then, use the decoding transformation 𝜓 to transform the representation 𝜙(𝑋) into the output Y. We can observe the treated group Xtreat and the factual outcome Y1 , the treated group with treatment. If the transformation 𝜙 and 𝜓 can successfully map the X treat into the outcome Y1 , the similar units in the control group Xcontrol should also be mapped into the counterfactual outcome Ŷ1 . Their difference is the treatment effect from the treated group. We can use the same idea to construct the treatment effect from the control group.(Caliendo and Kopeinig 2008, Li and Fu 2017, Kallus 2017) In order to find the similar units from the treated group and the control group, we require the sample of treated group and the control group are identical under the kernel space. We use the distance functions called mean maximum discrepancy and the Wasserstein 1 distance to measure the distance between the distribution of the treated group and the distribution of the control group in the kernel space. We expect their distances are as small as possible. 2.2.2. Deep Representation Learning for Counterfactual Inference If we replace the kernel transformation with the neural network, it becomes easy to compute the appropriate encoding function. The counterfactual Inference problem can be exchange to a 11 domain adaptation problem. If the treated group and the control group are assigned treatment randomly, their distributions in the representation space should be identical. We can use the discrepancy distance to measure it. Therefore, there are two main streams on this study: Some of them focus on constructing a good discrepancy distance estimation (Johansson et al 2018, Yao et al 2018, Yao et al 2019, Sharma 2020). A common way to construct the discrepancy distance are maximum mean discrepancy and Wasserstein 1 distance (Johansson et al 2018, Kallus 2020, Flato et al 2019). Both of them can measure the distance between two distributions in a Non-Euclidean distance topology. Our model follows this method to calculate the distance between the factual domain and the counterfactual domain. Yao et al (2018) and Yao et al (2019) constructed the distance between two sample domains by finding specific connections on different nodes, which gives us higher accuracy but loses some generality. Their studies is based on the network data. Sharma (2020) used the general propensity score to calculate the distance, which can give a similar performance with maximum mean discrepancy. This method is not only for network data but also for other data structures. Some others focus on constructing a good encoder function (Kallus 2020, Du et al 2019, Flato et al 2019, Alaa 2017, Schwab et al 2019). It includes basic graph neural network, convolution network, attention network, and generative network. The last three methods are just advanced methods to compute encoder function quickly. Our model follows the basic graph neural network and compares its performance with convolution network and attention network method. Alaa (2017) and Schwab et al (2019) used a different framework because they directly used propensity score as the dropout network and combine it together with the encoder function. 12 After we build the good encoder function and discrepancy distance function, we can focus on social network data. As we discussed before, the social network has two types of confound problem lead to bias estimation: homophily and peer influence. Guo et al a (2020) and Guo et al b (2020) studied the homophily problem on social network data. They embed the latent homophily variable into the representation function and get a lower bias result. Ma et al (2020) studied the peer influence problem. They used the equation of the whole graph treatment effect minus the equation of peer influence effect to avoid the peer influence bias. 2 Table 2. Literature review of Propensity Score Matching method under network data Author Loss Function Encoder Function Discrepancy Distance Components Measurement Johansson et al Treatment Prediction Deep Neural Kernel Maximum Mean (2018) Loss + Discrepancy Network Discrepancy and Loss Wasserstein 1 distance Yao et al (2018) Deep Neural Position Dependent Deep Network Metric + Middle Point Distance Minimization Yao et al (2019) Deep Neural Similarity Preserve Network Sharma (2020) Graph neural General propensity score network Kallus (2020) Generative Kernel Maximum Mean Network Discrepancy and Wasserstein 1 distance 13 Table 2. (cont’d) Du et al (2019) Treatment Prediction Generative Mutual Information Loss +Discrepancy Network Estimator Loss+ Mutual Information Loss Flato et al (2019) Generative Kernel Maximum Mean Network Discrepancy Alaa (2017) Deep Neural Propensity Dropout Network Network Schwab et al (2019) Deep Neural Propensity Dropout Network Network Guo et al a (2020) Graph Wasserstein 1 Distance Convolution Network Guo et al b (2020) Graph Attention Inverse Propensity Mechanism Scoring Weights Ma et al (2020) Graph Kernel Maximum Mean Convolution Discrepancy and Network Wasserstein 1 Distance 2.2.3. Compare propensity score method with representation learning method Comparing the propensity score method and representation learning method, we can find that if we don’t have rich covariate data, we need to choose the representation learning method. If we have rich covariate data, we should choose the propensity score-based method. The propensity score-based method requires constructing the propensity score to calculates the difference between the treatment group and the control group at first and calculate the treatment effect from 14 the first step. The representation learning method calculate the treatment effect and the difference between treatment group and control group simultaneously. Because there are two types of bias in the network data: hidden confounder and peer influence effect, the propensity score method needs to observe covariate variables as many as we can. We can construct the propensity score from the traditional network theory such as tie level, node level, and graph structure level. It give us more explanation on causal mechanism. But we cannot always expect there is no big difference between the treatment group and the control group. They may have significantly different kinds of Centrality, Density, and Closeness or have significantly different kinds of tie strength or tie directions between nodes. There are a lot of network features we cannot control, which leads to a bias propensity score or a propensity score without enough network information. Therefore, using the representation learning method has a big advantage on solving network data because graph topology includes the information of hidden confounders and peer influence. It estimates the treatment effect simultaneously. As long as we can find an appropriate representation function, we can always calculate the treatment effect with the smallest bias. How to make sure we can find an appropriate representation function? An encoder-decoder framework can help to find the right function. This encoder-decoder framework is a general way to classify the data in machine learning research field. It can perfectly extend to the neural network. 15 3 Table 3. Compare Propensity Score Method and Representation Learning Method Pros Cons Propensity score Propensity Score Method has the X should be balanced (no method advantage on learning linear, rich significant differences) between observed covariates data the treatment group and the control group. Propensity Score can give global Propensity Score is difficult to explanation above covariates data compute if covariate sample space is very large Representation X doesn’t need balanced on Representation Learning method learning method treatment group and control group cannot explain covariates but on node-level. Representation Learning is easy to compute if covariate sample space is very large Our model is combing the propensity score method and graph representation method to get both of their benefits. We construct the treatment group and control group from twitter and compare these two methods. Because the Propensity score method requires the covariate variables as rich as possible, we construct the covariate variables includes user-level features, node-level features, and network structure features. We build the propensity dropout network to replace the discrepancy distance function and still maintain the graph neural network to embed the whole graph structure. 16 4 Table 4. Compare our method with other methods Linear relation Nonlinear relation Graphical Nonlinear between X and Y between X and Y relation between X and Y Rich Covariate Propensity Score Covariate Balanced Our model balance based method Propensity Score Graph Representation Function+Discrepancy Distance Measurement+Distance between text information Poor Covariate Representation NN+ Discrepancy GNN+Discrepancy balance Function + Distance Distance Discrepancy Measurement Measurement+Distance Distance between text Measurement information 17 3. MODEL BUILDING The organization of the model building section is as follows: We first demonstrate the research context and summarize the general framework in this study. Second, we review the Propensity Score Matching Method (PSM) based on the Rubin’s Framework, and discuss the challenges of PSM in our Twitter context. Third, we discuss how the representation learning method can help address the challenges PSM faces. Fourth, we discuss how to use Graph Neural Network to address the challenge of large-scale social network. embedding method into a deep neural network framework. We also want to note that while there is a significant body of literature on causal inference using causal graph (Pearl 2009), our framework focus on counterfactual inferences first proposed by Rosenbaum & Rubin (1983). 3.1. Summary In this section, we first briefly introduce the research context. We then briefly summarize the deep learning framework employed in this research. Our research application is as follows, we can observe the treated group user 𝑢 and the control group user 𝑣 on twitter. Both of them can receive some tweets on twitter. These tweets includes some popular hashtags and we use them as the treatment 𝑡. These users post some comments on these tweets and is marked with the sentiment scores. We use these sentiment scores as outcome 𝑦. We want to test how the popular hashtag influences on the sentiment scores. 18 1 Figure 1. Concept Model Popular Hashtag Treatment Group User Post Sentiment (with similar feature data) Score on tweet On the same graph structure Control Group User Post Sentiment (with similar feature data) Score on tweet We use the graph representation learning method to calculate the treatment effect. Note that any user 𝑖 ∈ {1,2, … . , 𝑛} is a user on a social network platform. If the user 𝑖 retweets the prior tweet given by user 𝑗, it is defined as a social tie between each users. We define the user as node, the soical ties between each two users as edge. Then we can construct the retweets graph 𝐺 based on what we can observe. Therefore, for each user, we can observe his or her demographic data, (user location, followers count, friends count, etc) and we denote it as xi , the covariate variables. He or she leaves comments containing some sentiment score (number from -1 to 1) and we denote the sentiment of tweet 𝑚 by user 𝑖 as 𝑦𝑖𝑚 , the outcome. If we use 𝑘 to represent the topic2 of this tweet, we assume that the sentiment of this tweet is influenced by other tweets with the same topic 𝑞 that user 𝑖 retweeted before she posts tweet 𝑚. Thus we also use 𝑡𝑖𝑚 to represent whether user 𝑖 is exposed to other tweets belong to the same topic 𝑞 before she wrote tweet 𝑚. 2 We operationalize topics of tweets using hashtags (such as #Narendra Modi, #BJP, etc). We can also use NLP to identify tweets topics as well. 19 This 𝑡𝑖𝑚 is a binary variable and the treatment we consider in this context. The retweets graph 𝐺 can be transformed into an adjacency matrix 𝐴 to represent the edges and nodes. For each observation, we can get a tuple ({𝑥𝑖 , 𝑦𝑖𝑚 , 𝑡𝑖𝑚 }, A). Note that the adjacency matrix A is huge for the social network embedded on Twitter. Thus we use Graph Neural Network (GNN) to generate representation space (termed as hidden embedding space in GNN literature) of the original social graph, thus reduce the dimension of original social network. We use encoder function 𝜙 in figure below to represent this process. Then we use the decoder function 𝜓 to reconstruct each node 𝜓(𝜙(𝑥, 𝐴), 𝑇) into the original sample space to construct the labels for counterfactual sample set. We further find the match between factual and counter factual sample set by minimizing the distance of these two sample sets in representation space. Then we can calculate the Individual Treatment Effect (ITE) as: 𝐼𝑇𝐸𝑖𝑚 = (𝑦𝑖𝑚 (1) − 𝑦𝑖𝑚 (0)) | 𝑋, 𝐴 Where the 𝑦𝑖𝑚 (1) means the units receives the hashtag, and 𝑦𝑖𝑚 (0) means the units do not receive the hashtag. We will discuss these steps in more details in sections below. 2 Figure 2. Diagram of the Model Text Information similar Treatme nt group Encoder Decoder function 𝜙 Hidden function Inferred Representat 𝜓 Outcome ion Space GNN Loss Control ReLU Operator ψ(ϕ, t), y group Covariate Imbalance Treatment t 𝜙treatment 𝜙control 20 3.2. Counterfactual Inferences and Propensity Score Matching In this section, we first review counterfactual inferences framework first developed by Rosenbaum & Rubin (1983), and discuss the challenges we face when adapting the counterfactual inferences framework in our online social media context. We further discuss the popular propensity score matching (PSM) method under Rosenbaum & Rubin’s framework, and how PSM is a special case of the representation learning method. There are n observations notated as {𝑋, 𝑌, 𝑇} ,where X means covariates (features, i.e. demographics information), Y means outcome (i.e. tweets sentiment), and T means treatment (i.e. whether user 𝑖 is exposed to others’ tweets of the same topic before she posts 𝑚𝑡ℎ tweet). For the 𝑚𝑡ℎ tweet by user 𝑖, we can get the data tuple for each user as {𝑥𝑖 , 𝑦𝑖𝑚 , 𝑡𝑖𝑚 }. Treatment can only choose 0 or 1.3 Therefore, we denote 𝑦𝑖𝑚 (0) as control group outcome and 𝑦𝑖𝑚 (1) as treated group outcome. We want to highlight that following Rosenbaum & Rubin’s framework, we also need to construct “counterfactual outcomes” both control group outcome and treated group outcome. For example, if we actually observe before user 𝑖 posts the 𝑚𝑡ℎ tweet, she is exposed to a tweet with same topic, this is called “factual outcome”. Thus the counterfactual outcome is the one that does not happen in reality, i.e. she was not exposed to any tweet with same topic before she posts the 𝑚𝑡ℎ tweet. We define 𝑦𝑖𝑚 (1)|𝑡𝑖𝑚 = 1 and 𝑦𝑖𝑚 (0)|𝑡𝑖𝑚 = 0 as factual outcome and 𝑦𝑖𝑚 (1)|𝑡𝑖𝑚 = 0 (i.e. what would happen for a user in treated group if she were receiving no treatment) and 𝑦𝑖𝑚 (0)|𝑡𝑖𝑚 = 1 (i.e. what would happen for a user in control group if she were receiving treatment) as counterfactual outcome. We will discuss in more details on the construction of counterfactual outcomes in following sections. 3 We can easily extend our model to a multiple treatment scenario in which the treatment will represent the number of tweets a user is exposed to. 21 We want to note that one of the important assumptions under Rosenbaum & Rubin (1983) Framework is unconfoundedness. Specifically, there is no hidden confoundedness effect on both X, Y, and T: 𝑦𝑖𝑚 (0), 𝑦𝑖𝑚 (1) ⊥ 𝑡𝑖𝑚 |𝑥𝑖 And we will talk about how our method can relax this assumption in the following sections. The Individual Treatment Effect (ITE) can be defined as: 𝐼𝑇𝐸𝑖𝑚 = 𝑦𝑖𝑚 (1) − 𝑦𝑖𝑚 (0) (1) The Average Treatment Effect (ATE) can be defined as: 𝐴𝑇𝐸 = 𝐸[𝑌(1) − 𝑌(0)|𝑋] (2) The golden rule of causal inference is a randomized experiment in which treated and control group are assigned randomly. In real world, to mimic the random assignment of treated group and control group, many scholars use propensity score matching method to calculate the casual effect (Imai et al 2014, King and Nielsen 2019). However, traditional PSM suffers from two problems. First, PSM requires us to transform the covariate variables 𝑋 into the propensity score by a 1 logistic function: 𝜋𝑖 = 𝑃𝑟𝑜𝑏(𝑡𝑖 = 1|𝑋) = . PSM works well if the covariate variables 1+𝑒 𝑋𝑖 𝛽 sample space is very small and the distributions of the covariate variables has no significant differences between treated and control groups. In this case, when PSM generates propensity score and match treated and control group units based on the distance of between treated and control group units, units with similar propensity score are also likely to be similar pairs (similar location) in covariate variables sample space :𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒 = |𝜋𝑡𝑟𝑒𝑎𝑡𝑒𝑑 − 𝜋𝑐𝑜𝑛𝑡𝑟𝑜𝑙 |. In other words, PSM can be considered as a one-dimensional representation space of the covariate variables 22 sample space, and PSM works well only if there is relatively little information loss compared with the covariate variable sample space. However, social networks on social media platforms like Twitter often have very complex network structures. If we incorporate these complex network structures as the covariate variables in PSM, it leads to a very large covariate variables sample space with significantly different distributions of the covariate variables between treated and control group. As such, the one- dimensional representation space constructed by PSM is unlikely to represent all these covariate variables very well (Li and Fu 2017). Two users may have big difference on covariate variables but small difference on the propensity scores. This leads to include the wrong information problem (Imai et al 2014, King and Nielsen 2019). Second, the unconfoundedness assumption is critical for the performance of PSM. In other words, PSM works well if there is no other unobserved confounding factors that change the relationship between X and Y. Yet this assumption is easily violated in our Twitter context such as homophily—a user posts a tweet favoring Democrat following another positive tweet about Democrat from her friend not because her friend’s tweet has influence on her, but rather because both of them share similar political opinions. Therefore, such confounding factors will lead the propensity score matching method to calculate the bias result. In other words, we cannot construct the balanced propensity score when there exists the confounder (Imai and Ratkovic 2014, King and Nielsen 2019, Sant'Anna et al 2020). Please also refer to appendix 3 for more detailed discussions. Also note that, the matching process in PSM can be considered as the process of constructing the counterfactual outcomes described in Rosenbaum & Rubin (1983), because the matched 23 observations in control groups can be regarded as the counterfactual of corresponding observations in treated group. Overall, the logistic regression transforms the covariates variables into the propensity score space. Then we use these propensity score to represent the observations. The treatment effect is only calculated by these covariates variables. This calculation is efficient under the balanced covariates variables situation. Therefore, we can treat the propensity score matching method as a special case of the representation learning method. It works well under the small balanced covariates variables space and linear relationship between covariates and the outcome. In the next subsection, we will describe the representation learning framework we adapt in this research, and discuss how it can address the two problems PSM faces in our social media context. 3.3. Representation Learning for counterfactual framework If we can observe both factual outcomes and counterfactual outcomes, it becomes a supervised learning problem. However, we cannot observe the counterfactual outcomes, i.e. 𝑦𝑖𝑚 (1)|𝑡𝑖𝑚 = 0 (treated group if there were no treatment) and 𝑦𝑖𝑚 (0)|𝑡𝑖𝑚 = 1 (control group if there were treatment) in previous section. In this study, we use the representation learning method described below to construct these counterfactual outcomes. (Please check the appendix 4) Let‘s take the example as shown in table 1. In the factual sample set, let’s assume the observed covariates are tweets language and location. We also observe whether focal user receives treatment or not (whether focal user is exposed to others’ tweets with same topic, such as #BLP, etc), as well as the corresponding outcome (tweets sentiment). In the counterfactual sample set, we need to reverse the treatment variable value, so that in the first row of Table 1, the value of Treatment variable is 1 in factual sample set (marked in red), and the value of Treatment variable 24 is 0 in counterfactual sample set (marked in green). If we compare the factual sample set and the counterfactual sample set, their covariate variables, X’s, are the same, yet we only observe the outcome Y for factual sample set. To construct the outcome Y for counterfactual sample set, we employ Domain Adaptation (Cortes and Mohri 2014, Long et al 2014), where we want to label the outcome of counterfactual sample set using training results based on the factual data set in which we do have the outcome value. To some extent, this process is similar to finding the best matched units in factual sample set for units in counterfactual sample set. 5 Table 5. An example of the factual sample set and the counterfactual sample set Covariates: X = ( Location , Language) , Treatment: t={0,1} Factual sample set Counterfactual sample set X: (Location, Y: Sentiment score X: (Location, Y: Sentiment score Language, Treatment) after treatment Language, Treatment) after treatment (Delhi, English, 1) Y1 = 0.1 (Delhi, English, 0) Y1 = ? (Bangalore, English, 1) Y1 = 0.3 (Bangalore, English, 0) Y1 = ? (Bangalore, Hindi, 0) Y0 = 0.4 (Bangalore, Hindi, 1) Y0 = ? (Bangalore, Hindi, 0) Y0 = −0.2 (Bangalore, Hindi, 1) Y0 = ? (Bangalore, Bengali, 0) Y0 = −0.7 (Bangalore, Bengali, 1) Y0 = ? Step 1. Construct representation space The first step of constructing counterfactual sample set is to transformation observed covariates to the representation space. Observed covariates sample space in social media study is often highly dimensional as it involves complex social network structure. To reduce the dimension of covariates sample space, PSM generate propensity scores which can be considered as a one- dimensional representation space of the covariate variables. Remember in previous section, we argue that the first problem of PSM is that PSM lose a lot of information in the process of 25 calculating propensity scores. To overcome this challenge, in this study, we use the encoder function 𝜙 to map covariates into the representation space, 𝜙(∙). For covariates 𝑋 of an observation, we denote the corresponding vector 𝜙(𝑋) as the corresponding representation. Notice that the use of representation space allows us to lose much less information because of two reasons. First, the constructed of representation space is allowed to be multi-dimensional, in comparison with one-dimensional representation space based on propensity score. Second, we can also minimize the information loss by using a deep learning structure with multiple layers of non-linear functions. In comparison with the logistic model that PSM uses to calculate propensity score, the representation space generated using a deep learning structure can help us keep more information from the original sample space (Li and Fu 2017, Johansson et al 2020). The inclusion of more information is helpful for finding matching units from factual and counterfactual sample sets, and consequently construct labels for counterfactual sample set. Step 2. Distance in representation space In step 2, we minimize the distance between the representations of factual sample set and the representations of counterfactual sample set in the representation space 𝜙(𝑋). Because of the use of a multi-dimensional representation space, we can now use the Maximize Median Distance or Wasserstein Distance to measure the distance between two sets in the representation space, denoted as 𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒 (𝑋𝐹𝑎𝑐𝑡𝑢𝑎𝑙 , 𝑋𝐶𝑜𝑢𝑛𝑡𝑒𝑟𝐹𝑎𝑐𝑡𝑢𝑎𝑙 ). The use of these two types distances allows us to retain more information compared with the distance used in PSM. Thus matching based on these two distances can help us find better matching in the original covariates sample space. We provide more details of these two distance measures in Appendix 5 and 6. Step 3. Use factual set to learn the label of counterfactual set 26 Keep in mind, if we were to estimate the causal effect, we still need to know the label (i.e. outcome Y) for counterfactual sample set. We use decoder function 𝜓(∙) as the learned causal relationship among representation space 𝜙(𝑋), treatment T and outcome Y. We use ReLU as the activation function in 𝜓(∙) in this study. Then together with the distance function in step 2, we can define the loss function as: 𝐿 = 𝐿𝑀𝑆𝐸 (𝜓(𝜙(𝑋), 𝑇), 𝑌) + 𝐿𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒 (𝜙(𝑋𝐹𝑎𝑐𝑡𝑢𝑎𝑙 ), 𝜙(𝑋𝐶𝑜𝑢𝑛𝑡𝑒𝑟𝐹𝑎𝑐𝑡𝑢𝑎𝑙 )) + 𝜆 ∗ 𝑅(ℎ) (3) By minimizing this loss function above using our deep learning framework, we can learn both encoder function 𝜙(∙) and decoder function 𝜓(∙). Note that the first term 𝐿𝑀𝑆𝐸 (𝜓(𝜙(𝑋), 𝑇), 𝑌) is the error between 𝜓(𝜙(𝑋), 𝑇) and 𝑌. The second term 𝐿𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒 (𝜙(𝑋𝐹𝑎𝑐𝑡𝑢𝑎𝑙 ) , 𝜙(𝑋𝐶𝑜𝑢𝑛𝑡𝑒𝑟𝐹𝑎𝑐𝑡𝑢𝑎𝑙 )) is the distance of representation learning in step 2. The third term 𝜆 ∗ 𝑅(ℎ) is a regularization term that can be used to capture other similarities between factual and counterfactual units. After we learn 𝜙(∙) and 𝜓(∙), we can then calculate the labels of counterfactual sample set. Except calculating the distance between the factual group and the counterfactual group in terms of the demographic information, we also include the text information into the loss function. By adding the distance between the factual group and the counterfactual group in terms of the text information, we can get 𝐿 = 𝐿𝑀𝑆𝐸 (𝜓(𝜙(𝑋), 𝑇), 𝑌) + 𝐿𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒 (𝜙(𝑋𝐹𝑎𝑐𝑡𝑢𝑎𝑙 ), 𝜙(𝑋𝐶𝑜𝑢𝑛𝑡𝑒𝑟𝐹𝑎𝑐𝑡𝑢𝑎𝑙 )) + 𝐿𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒 (𝑡𝑜𝑝𝑖𝑐𝑠(𝑋𝐹𝑎𝑐𝑡𝑢𝑎𝑙 ), 𝑡𝑜𝑝𝑖𝑐𝑠(𝑋𝐶𝑜𝑢𝑛𝑡𝑒𝑟𝑓𝑎𝑐𝑡𝑢𝑎𝑙 )) + 𝜆 ∗ 𝑅(ℎ) (4) In addition to calculating the label of counterfactual sample set, inclusion of decoder function can also help use learn the nonlinear relationship among covariates, treatment and outcome, especially in the presence of social network data. While an important assumption in Rosenbaum 27 & Rubin (1983) framework is the unconfoundedness (i.e. there is no confounder 𝐶 to affect the relationship between X, Y ,and T), this assumption could be difficult to satisfy in our social network context due to confounding factors such as homophily. In this study, by using a deep learning method to infer the nonlinear causal relationship 𝜓(∙) among covariates, treatment and outcome, we can significantly reduce the bias of estimated treatment effect. Formally, the model building of the representation learning is as follows: we define a hidden space ℋ and two functions 𝜙 and 𝜓. The first function 𝜙 is for mapping X into ℋ and the second function 𝜓 is for mapping ℋ × T into Y. Then by minimizing the loss function above, we should learn the appropriate representation functions 𝜙 and 𝜓 to get the label of counterfactual sample set: 𝑌̂ = 𝜓(𝜙(𝑋), 𝑇). 3 Figure 3. Diagram of the Counterfactual Inference-Step 1 Encoder Decoder X features Classification function 𝜙 function 𝜓 (Location, Sample Representation Loss between Language, space space 𝜙(𝑋) 𝜓(𝜙(𝑥𝑖 ), 𝑡𝑖 ) and etc) 𝑦𝑖 In order to operationalize the three steps above efficiently, we use the method suggested by Johansson et al (2016). Note that when we minimize the loss function in equation (3), this will also produce a match of factual and counterfactual units.4 In the meanwhile, after we learn the appropriate representation function 𝜓(∙) and 𝜙(∙), we can use the 𝜓(𝜙(𝑋𝑡𝑟𝑒𝑎𝑡𝑒𝑑 ), 𝑇 = 0) to calculate the label for each unit in counterfactual sample set simply by changing the value of 4 The treated group users with treatment, yi (1)|t i = 1 (this is in factual sample set) are matched with the treated group users without the treatment (this is in counterfactual sample set), yi (1)|t i = 0. Same way, the control group users without treatment, yi (0)|t i = 0 are matched with the control group users with the treatment, yi (0)|t i = 1. 28 treatment T, denoted as 𝑦̂𝑖𝑚 (1)|𝑡𝑖𝑚 = 0. Similarly, we can use the 𝜓(𝜙(𝑋𝑐𝑜𝑛𝑡𝑟𝑜𝑙 ), 𝑇 = 1) to calculate the label for each unit in counterfactual sample set, denoted as 𝑦̂𝑖𝑚 (0)|𝑡𝑖𝑚 = 1. 4 Figure 4. Diagram of the Counterfactual Inference-Step 2 Factual Outcome of the treated group, 𝜓(𝜙(𝑥𝑡𝑟𝑒𝑎𝑡𝑒𝑑 ), 𝑡𝑖 = 1) and Encoder Decoder 𝑦𝑖 X features function ϕ Representation function ψ (Location, Sampl Language, e space space etc) 𝜙(𝑋𝑡𝑟𝑒𝑎𝑡𝑒𝑑 ) Potential Outcome of the treated group, 𝜓(𝜙(𝑥𝑡𝑟𝑒𝑎𝑡𝑒𝑑 ), 𝑡𝑖 = 0) and 𝑦̂𝑖 Thus, we can write the individual treatment effect as: (𝑦𝑖𝑚 (1)|𝑡𝑖𝑚 = 1) − (𝑦̂𝑖𝑚 (1)|𝑡𝑖𝑚 = 0), 𝑖𝑓 𝑡𝑖𝑚 = 1 ̂ 𝑖𝑚 = { 𝐼𝑇𝐸 (𝑦̂𝑖𝑚 (0)|𝑡𝑖𝑚 = 1) − (𝑦𝑖𝑚 (0)|𝑡𝑖𝑚 = 0), 𝑖𝑓 𝑡𝑖𝑚 = 0 (𝑦 (1)|𝑡𝑖𝑚 = 1) − 𝜓(𝜙(𝑥𝑖𝑚 ), 𝑡𝑖𝑚 = 0), 𝑖𝑓 𝑡𝑖𝑚 = 1 ={ 𝑖𝑚 (5) (𝜓(𝜙(𝑥𝑖𝑚 ), 𝑡𝑖𝑚 = 1) − (𝑦𝑖𝑚 (0)|𝑡𝑖𝑚 = 0), 𝑖𝑓 𝑡𝑖𝑚 = 0 3.4. Deep Graph Neural Network. We describe the causal inference method we use in section 3.3 where we minimize the loss function in equation (3). In this section, we discuss in more detail how to obtain the encoder function 𝜙(∙) in equation (3). Individual position within the social network on Twitter plays an important role in predicting her tweet sentiment. Yet one of the challenges in leveraging social network information on Twitter is 29 network’s large-scale. Previous literature using Twitter network information mainly rely on structured network characteristics such as degree centrality, tie strength, structural hole (Bonacich 1987, Borgatti 2005, Krivitsky et al 2009, Song et al 2019, Rishika and Ramaprasad 2019, Mousavi and Gu 2019). While these network characteristics can help address the large- scale problem in Twitter related study, which network characteristics to use often requires context-specific feature engineering. In our study, by using a Graph Neural Network (GNN) structure, we can generate a representation (also called embeddings) of networks on Twitter which reduces the dimension of network structure while keeping most of the network information. This allows us to solve the challenge of large-scale network. As such, context- specific feature engineering is no longer needed. In addition, information loss due to feature engineering can also be avoided. This is because it is difficult to include all available network characteristics into a traditional feature-engineering machine learning algorithm. And some network characteristics potentially for feature engieenring, such as betweeness centrality, are very computationally expensive to calculate for large-scale networks. Note that GNN is a deep node embedding technique, an improvement over the shallow node embedding techniques. Please refer to Appendix G for the details of one shallow node embedding technique. 30 5 Figure 5. Diagram of the Graph Neural Network Target Input features on each node 1 node 3 Aggregat 4 or 9 1 3 2 2 1 Aggregat Neighbor 2 5 4 Aggregation Aggregat 3 or 1 6 8 or 7 6 7 Aggregat 1 9 0 or 6 8 1 Aggregat or 5 9 Layer- Layer- Layer- 2 1 0 In our GNN framework, the network can be described using tuple {V, E, X} dataset, where V represents vortexes, E represents edges and X are the node features such as location and language of tweets. For a network with one million nodes, the corresponding adjacency matrix will be one million by one million. To reduce the dimension of this huge matrix, we use GNN so that we can generate a representation (also called an embedding) for each node with much smaller dimension (for example, a vector with size 30 for each node), yet keep most of the network information associated with focal node. The intuition of GNN is that we aggregate the neighbor information of each focal node using neural network method. Note that the neighbor here not only include the 31 immediate neighbor of focal node, but also include neighbors that are connected through multiple edges. Consider node 1 in figure above. Node 1’s immediate neighbors are nodes 2,3,7,8. Node 2 itself also has neighbors 1,3,4,9. Consider the simple case in which we only include neighbors that are at most two edges away, to calculate the embedding of node 1. To calculate the embedding of node 1, we will need to aggregate information of node 2,3,7,8. This requires us to calculate the embedding of node 2,3,7,8 respectively (Layer-1 in figure above) before we calculate the embedding of node 1. As such, we will first calculate node 2 embedding using embeddings of node 1,3,4,9 (Layer-0 in figure above), node 3 embedding using embeddings of node 1,2,6 and etc. Then we can calculate the embedding of node 1 using embeddings of node 2,3,7,8. More formally, we use h as hidden embeddings of the neighbors. The number of layers of hidden embedding ℎ is denoted as 𝐾, and 𝐾 can be any value (in this example, 𝐾 = 2). Note that embeddings in Layer-0 are one-hot enbeddings of node features. The final layer of hidden embedding (Layer 𝐾) will be the resulting embedding of focal node that we want to calculate. Formally, we use hk𝑣 to represent the hidden embedding of a node 𝑣 in Layer 𝑘 ∈ {1,2, … , 𝐾}.5 We will iteratively update this hidden embedding hk𝑣 by incorporating the aggregated information from its immediate neighbors, as well as the information of the node itself in previous iteration. We use mk𝑣 to represent this aggregated information from immediate neighbors of this node. In other words, we can write the 𝑙𝑡ℎ iteration in this updating process as: 𝑇 𝑇 (hk𝑣 )𝑙 = 𝑢𝑝𝑑𝑎𝑡𝑒 ((hk𝑣 )𝑙−1 , (mk𝑣 )𝑙−1 ) = 𝑢𝑝𝑑𝑎𝑡𝑒 (𝑊𝑠𝑒𝑙𝑓 (hk𝑣 )𝑙−1 + 𝑊𝑁𝑒𝑖𝑔ℎ𝑏𝑜𝑟 (mk𝑣 )𝑙−1 ) (5) 5 For example, to calculate embedding of node 1 in previous example, we will need to aggregate information of node 2,3,7,8 in Layer-1. In other words, we need to calculate ℎ12 , ℎ13 , ℎ17 , ℎ18 . 32 Here we use sigmoid function as the update function. After training the model by 𝐿th iteration, we get the final state z to represent the final state representation: zv = ℎ𝑣𝐿 . This final state representation zv is just the 𝜙(𝑣) of the encoder function in section 3.3. To calculate mk𝑣 in equation (5), we aggregate information from node 𝑣’s immediate neighbor following the aggregate equation (6) below. The number of immediate neighbor nodes of node 𝑣 is 𝒩. 𝜇 ∈ 𝒩(𝜈) is an immediate neighbor of node 𝜈. 𝑊𝑘 and 𝐵𝑘 are the weight and bias of neural networks respectively for this layer 𝑘 that we need to estimate. We put all them together into a ReLU function 𝜎(∙). Note that the hidden embedding of node 𝑣 in layer 𝑘 depends on the (𝑘−1) hidden embedding of 𝑣’s immediate neighbor in previous layer ℎ𝜇 . (𝑘−1) ℎ𝜇 (𝑘−1) ℎ𝑣𝑘 = 𝜎(𝑊𝑘 ∑𝜇∈𝒩(𝜈) |𝒩(𝜈)| + 𝐵𝑘 ℎ𝑣 ) (6) We also want to note that the three popular GNN methods, GCN (Kipf T. N. and Welling M. 2017), GAT (Velickovic et al 2018) and GraphSage (Hamilton et al 2017), are variations of GNN with different aggregate functions and update functions. 33 4. DATA AND RESULT 4.1. Data Description: Indian General Election Twitter 2014 Our dataset is the Indian General Election data from the Twitter between Jan 1st and May 16th 2014. This dataset contains all tweets related to Indian General Election. This dataset contains 3.2 millions tweets, from 218, 7334 users. After data cleaning, we can get the tweets sentiment, Y, the treatment T, (i.e. whether user 𝑖 is exposed to others’ tweets of the same topic before she posts 𝑚𝑡ℎ tweet), the covariates variables X such as User Location, User Followers Count, etc, and the adjacency matrix A. The tweets sentiment Y is a numerical number from -1 to 1. Negative/Positive numbers reflect users’ negative/positive attitudes within the tweets. The treatment T is the binary number to represent whether a user 𝑖 receives the treatment, i.e. whether users were exposed to other tweets with the same topic. In this research, we use hashtag, such as #Narendra Modi, #BJP, #AAP, etc., to categorize the topics of tweets. Note that we only include hashtags that represent political preference information in this study. These covariates variables X include two types of data, network theory-based variables, and demographical variables. The network theory-based variables means the features we can observe from the social network. It reflects how users interact with their neighbors, how their neighbors affect the given users, and etc. These network theory-based variables include: 1) user followers count, which means the number of followers of the focal user; 2) friends count, which means the number of users with whom the focal user follow each other; 3) listed count, which means the number of users who added focal user to a list. 34 The demographical variables include: 1) user location, which reflects where does users post the tweets; 2) status language, which language does user post on twitter; 3) status source, which channel does the user post the tweets (e.g. web, iphone and etc.). We also construct an adjacency matrix 𝐴𝑡̃ , which is a large sparse matrix to represent the retweets network between each users on twitter before period 𝑡̃. The value of element 𝐴𝑖𝑗𝑡̃ of this matrix represent whether user 𝑖 retweeted a tweets from user 𝑗 before period 𝑡̃. While previous literature uses network-based characteristics as covariates in estimation model, in this study, we use GNN to help us extract graph embedding features. We argue that these graph embedding features can significantly improve the predictive performance as it allows us to capture more complicated network features that are otherwise omitted by network theory based characteristics. 6 Table 6. Data description Variable Description User Location Delhi, Bangalore, etc. We decode it as the category value User Followers Count Number of followers for the user User Friends Count Number of friends for the user User Listed Count Number of listed members Status Language English, Hindi, Bengali, etc. We decode it as the category value Status Source Web, Iphone, and Roundteam. We decode it as the category value Status Content Text Status Time Day time Hashtag #Narendra Modi For PM, #BJP, #ANI, etc Sentiment Score Number (From -1 to 1) 4.2. Semi-Synthetic Data In order to evaluate the performance of the model, we need to compare our model with other traditional methods based on the differences between estimated results and ground truth results. 35 Following the previous literatures (Johansson 2016, Louizos 2017, Veitch et al 2019), we need to construct the semi-synthetic data based on what we can observed from the real world. We need to construct the factual set includes the treated group with the treatment and the control group without the treatment and the counterfactual set includes the control group with the treatment and the treated group without the treatment. We denote 𝑦𝑖𝑚 (0) as control group outcome and 𝑦𝑖𝑚 (1) as treated group outcome. Previous literature also suggests that the sentiment of focal user’s tweet is influenced by both neighbors’ demographic features as well as the content of their tweets (Veitch et al 2019, Ma et al 2020, Guo et al 2020) Specifically, we follow the steps below to construct the semi-synthetic dataset. In step 1, we calculate the distribution of sentiments for user 𝑖, denoted as 𝑠𝑒𝑛𝑡𝑖𝑚𝑒𝑛𝑡𝑖,𝑘 , if this focal user is exposed to tweets with same topic 𝑘 before. Then, we calculate the sentiment distribution in the treated group and the control group. The 𝑠𝑒𝑛𝑡𝑖𝑚𝑒𝑛𝑡𝑘𝑡𝑟𝑒𝑎𝑡𝑒𝑑 denotes the sentiment distribution in the treated group if he or she is exposed to tweets with the same topic 𝑘 before. The 𝑠𝑒𝑛𝑡𝑖𝑚𝑒𝑛𝑡𝑘𝑐𝑜𝑛𝑡𝑟𝑜𝑙 denotes the sentiment distribution in the control group if he or she is exposed to tweets with the same topic 𝑘 before. In step 2, we calculate the sentiment distribution of focal user 𝑖’s neighbors 𝑗 . We use 𝑡𝑟𝑒𝑎𝑡𝑒𝑑 𝑐𝑜𝑛𝑡𝑟𝑜𝑙 ∑𝑗∈𝑁(𝑖) 𝑠𝑒𝑛𝑡𝑖𝑚𝑒𝑛𝑡𝑗,𝑘 and ∑𝑗∈𝑁(𝑖) 𝑠𝑒𝑛𝑡𝑖𝑚𝑒𝑛𝑡𝑗,𝑘 to represent the sentiment distribution of focal user’s neighbors in the treated group and the control group respectively. 0 1 Then, we denote the affection from user’s neighbor’s opinions, as 𝑎𝑖𝑚 and 𝑎𝑖𝑚 . Intuitively, 𝒂𝟎𝒊𝒎 and 𝒂𝟏𝒊𝒎 are two vectors representing the probability that the 𝑚𝑡ℎ tweet by user 𝑖 belongs to 0 1 control group and treatment group respectively. These two vectors, 𝑎𝑖𝑚,𝑘 and 𝑎𝑖𝑚,𝑘 , are the probability that the corresponding tweet belongs to topic 𝑘. 36 𝑐 𝑐 1 𝑎𝑖𝑚,𝑘 = 𝑠𝑒𝑛𝑡𝑖𝑚𝑒𝑛𝑡𝑖,𝑘 ∗ (𝑠𝑒𝑛𝑡𝑖𝑚𝑒𝑛𝑡𝑘𝑡𝑟𝑒𝑎𝑡𝑒𝑑 ) + ∑ 𝑠𝑒𝑛𝑡𝑖𝑚𝑒𝑛𝑡𝑗𝑡𝑟𝑒𝑎𝑡𝑒𝑑 ∗ (𝑠𝑒𝑛𝑡𝑖𝑚𝑒𝑛𝑡𝑘𝑡𝑟𝑒𝑎𝑡𝑒𝑑 ) 𝑗∈𝑁(𝑖) 0 𝑐 𝑐 𝑎𝑖𝑚,𝑘 = 𝑒𝑛𝑡𝑖𝑚𝑒𝑛𝑡𝑖,𝑘 ∗ (𝑠𝑒𝑛𝑡𝑖𝑚𝑒𝑛𝑡𝑘𝑐𝑜𝑛𝑡𝑟𝑜𝑙 ) + ∑ 𝑠𝑒𝑛𝑡𝑖𝑚𝑒𝑛𝑡𝑗𝑐𝑜𝑛𝑡𝑟𝑜𝑙 ∗ (𝑠𝑒𝑛𝑡𝑖𝑚𝑒𝑛𝑡𝑘𝑐𝑜𝑛𝑡𝑟𝑜𝑙 ) 𝑗∈𝑁(𝑖) 𝑐 where 𝑁(𝑖) is the set of focal user 𝑖’s neighbors, (𝑠𝑒𝑛𝑡𝑖𝑚𝑒𝑛𝑡𝑘𝑡𝑟𝑒𝑎𝑡𝑒𝑑 ) means the centroid of 𝑐 sentiment distribution on treated group, and (𝑠𝑒𝑛𝑡𝑖𝑚𝑒𝑛𝑡𝑘𝑐𝑜𝑛𝑡𝑟𝑜𝑙 ) means the centroid of sentiment distribution on control group. Step 3, We calculate the outcomes of the treated group and the control group as follows: 𝑦𝑖 (𝑡𝑖 ) = (1 − 𝑡𝑖 )𝑎𝑖0 + 𝑡𝑖 𝑎𝑖1 + 𝜖, 𝑤ℎ𝑒𝑟𝑒 𝜖~𝑁(0,1) Where the treatment t i follows the binomial distribution. If t i = 1, the outcome of the treated group 𝑦𝑖 (1) = 𝑡𝑖 𝑎𝑖1 + 𝜖. If 𝑡𝑖 = 0, the outcome of the control group 𝑦𝑖 (0) = 𝑎𝑖0 + 𝜖. Combining both the treated group outcome and the control group outcome, we can create the semi-synthetic dataset for estimation. Recall that we have Average Treatment Effect as: 𝑛 1 𝜏𝐴𝑇𝐸 = 𝐸[𝑌(1) − 𝑌(0)|𝑋, 𝐴] = ∑[𝑌𝑖 (1) − 𝑌𝑖 (0)|𝑋, 𝐴] 𝑛 𝑖=1 The outcome 𝑌(1) and 𝑌(0) is what we observe from data The estimated Average Treatment Effect is: 1 𝜏̂𝐴𝑇𝐸 = 𝐸[𝑌̂(1) − 𝑌̂(0)|𝑋, 𝐴] = ∑𝑛𝑖=1[𝑌̂𝑖 (1) − 𝑌̂𝑖 (0)|𝑋, 𝐴] (11) 𝑛 The estimated outcome 𝑌̂(1) and 𝑌̂(0) is calculated by the graph neural network. Normally, we choose the mean absolute error (MAE) as the evaluation metrics: 𝜖𝑀𝐴𝐸 = |τ𝐴𝑇𝐸 − τ̂𝐴𝑇𝐸 | (12) Another common evaluation metrics is mean squared error (RMSE): 37 1 𝜖𝑅𝑀𝑆𝐸 = √ ∑𝑛𝑖=1(𝜏𝐴𝑇𝐸 − 𝜏̂𝐴𝑇𝐸 )2 (13) 𝑛 4.3. Model Performance As we described in more details in section 3, compared with traditional method, our model makes two improvements on the construction of balanced dataset. First of all, we improve the balance of the dataset because we now incorporate complicated network features generated from GNN that are otherwise omitted by traditional network theory based covariates. Second, we use the representation learning method to match observations in treated and control groups in a high dimensional space. This is in comparison with traditional methods such as PSM in which only a one-dimensional statistic (i.e. propensity score) is used to match observations in treated and control group. We employ analysis in section 4.3.1 below to demonstrate how GNN can improve the balance of the dataset, and use section 4.3.2 to demonstrate how representation learning improves the estimation of the treatment effect. 4.3.1. Covariates Balanced Improvements from GNN As we discussed in section 3, the balance of treated and control group observations critically depends on covariate variables. Following the prior studies (Ho et al 2007, Austin and Peter 2009, Heller et al 2010, Ali et al 2014, Imai et al 2014), we conduct the balanced test and match observations in treated and control group using network theory based features only (the balanced test is based on the nearest neighbor matching method). We then conduct the balanced test and match observations in treated and control group by further incorporating network embeddings generated from GNN. We demonstrate covariates balanced improvement by randomly sample 10000, 20000,40000, 80000, 100000 users. We provide the details of these results in Appendix 8. We observe that by incorporating network embeddings, we can generally improve the balance of the dataset. In particular, we present the balance test results based on propensity score from 38 100000 users sample in Table 7 below. We also present the balanced test result for the original imbalanced dataset in this table. We focus on three measures of balanced test. First, the Standardized Mean Differences (Std. Mean Diff.), measures the differences between the mean values on two groups. Second, the Variation Ratio (Var. Ratio), measures the differences between the variance values on two groups. Third, the Mean Differences Empirical Cumulative Density Function (eCDF) measures the differences between the cumulative distributions on two groups. Criteria: If two groups (the treated group and the control group) with Standard Mean Difference close to zero, with Difference Variation Ratio close to one, and with eCDF Mean Difference value close to zero is considered to be balanced. First, dataset with matching based on network theory covariates and that with matching using GNN network embeddings can significantly improve the balance of the original imbalanced dataset. More importantly, by incorporating GNN generated network embeddings, we can further improve the balance of the dataset along all three dimensions (i.e. Standard Mean Difference, Difference Variation Raio and eCDF Mean Difference). The details of the balanced test for each network theory based features and GNN embedding features, such as datasets with difference user sample size and balanced results for each covariates, are available in Appendix 8. From Table A.3 to Table A.6. As we can see from Table A.3 , the balanced covariates measures mean the treated group and the control group has no significant difference on each covariate features. We list a balanced covariates features (without graph embedding features) comparison on the appendix Table A.3. There is no big difference on mean value and standard error. Also, we list an imbalanced covariates features (without graph embedding features) comparison on appendix table A.4. There is a big difference on mean value and standard error. We will show 39 that the deep graph representation learning is a robust method under both balanced covariates situation and imbalanced covariates situation. For table A.5 and table A.6, we extract some graph embedding features by using GNN method. Therefore, when we compare the covariates features on the treated group and the control group, there are more graph embedding features that can be used. The table A.5 shows there is no big difference on the mean value and the standard error value. The table A.6, shows there is a big difference on the mean value and standard error value. The details of the propensity score balanced improvement on different scale of the network are shown from appendix Table A.7 to Table A.11. In most cases, after we conduct the balanced test on the dataset with GNN embedding features can always contributes a better performance. It demonstrate that our method can improve the balance of the dataset. From Table A.7 and A.8, we show how network theory based features distribute on the treated group and the control group. From Table A.9, A.10 and A-8.11 we show how network theory based features and the graph embedding features distribute on the treated group and the control group. We summary the results from the Table A.7 to Table A.11 into the Table 7. As we discussed in section 3.2, the propensity score is a representation of the covariate variables. Therefore, we compare four different datasets with their propensity scores and check how the balanced got improvement. The Table 7 shows that the dataset including the GNN features, after the nearest neighbor matching, give the most expected balanced performance compared with three other cases. 40 7 Table 7. Propensity Score Balanced Improvement (100k users) Std. Mean Diff Diff. Var. Ratio eCDF Mean Diff Original Imbalanced Dataset 0.3845 0.3994 0.0658 After Nearest Neighbor Matching 0.0004 1.0022 0.0001 Imbalanced Dataset with GNN 0.5915 0.6630 0.1451 Nearest Neighbor Matching with 0.0002 1.0009 0.0000 GNN 4.3.2. Treatment Effect Improvements from Representation Learning To demonstrate the performance of our proposed model, analyses in this section are based on a semi-synthetic data as it provides a ground truth in which we can compare with. Results suggest that the GNN embedding can significantly improve the data balance, and Representation Learning can help better identify the treatment effect compared with benchmark methods. To demonstrate the performance improvement using representation learning in our method, we replace the representation learning component of our model with three benchmark models. This allows us to inspect how much representation learning can improve the balance of the resulting data. We list three traditional methods which we use to replace representation learning: Propensity Score Matching: this method calculates the propensity scores of observation i in treatment group and j in control group. We can then calculate treatment effect after we find matching pairs in terms of propensity score using formula below: 𝑃𝑆 τ̂𝐴𝑇𝐸 = [ ∑ (𝑦𝑖 − 𝑦𝑗 ) − ∑ (𝑦𝑗 − 𝑦𝑖 )] /𝑛 𝑖:𝑡𝑖 =1 𝑖:𝑡𝑖 =0 Inverse Probability Treatment Weight: After we calculate the propensity scores, we can further calculate the probability of receiving treatment of each observation, and use this 41 probability as the weight to construct a randomly controlled trial setting. This weight is calculated using formula below: 𝑡𝑖 1 − 𝑡𝑖 wi = + 𝑃(𝑡𝑖 |𝑥𝑖 ) 1 − 𝑃(𝑡𝑖 |𝑥𝑖 ) where t i = 1 if individual 𝑖 receives treatment. Thus 𝑃(𝑡𝑖 |𝑥𝑖 ) is the probability of receiving treatment conditional on covariates 𝑥𝑖 . We can then use this weight to calculate the treatment effect using formula below: 𝐼𝑃𝑊 1 1 τ̂𝐴𝑇𝐸 = ∑ 𝑤𝑖 𝑦𝑖 − ∑ 𝑤𝑖 𝑦𝑖 𝑛𝑡𝑖 =1 𝑛𝑡𝑖=0 𝑖:𝑡𝑖 =1 𝑖:𝑡𝑖 =0 where 𝑛𝑡𝑖 =1 , 𝑛𝑡𝑖=0 are the number of observations in treatment group and control group respectively. Doubly Robust Estimation: Intuitively, Doubly Robust Estimation improves Inverse Probability Treatment Weight by incorporating the linear relationships between the covariates and the outcome. As we discussed in section 3.2 and 3.3, even the propensity score is constructed correctly, we may still get the bias estimator. Therefore, we need to incorporate the linear regression into the estimator. 𝑛 1 𝑦𝑖 𝑡𝑖 𝑦̃𝑖1 (𝑡𝑖 − 𝑃̂(𝑡𝑖 |𝑥𝑖 )) 𝑦𝑖 (1 − 𝑡𝑖 ) 𝑦̃𝑖0 (𝑡𝑖 − 𝑃̂(𝑡𝑖 |𝑥𝑖 )) 𝐷𝑅𝐸 τ̂𝐴𝑇𝐸 = ∑ {[ − ]−[ − ]} 𝑛 𝑃̂(𝑡𝑖 |𝑥𝑖 ) 𝑃̂(𝑡𝑖 |𝑥𝑖 ) 1 − 𝑃̂(𝑡𝑖 |𝑥𝑖 ) 1 − 𝑃̂(𝑡𝑖 |𝑥𝑖 ) 𝑖=1 𝑤ℎ𝑒𝑟𝑒 𝑦̃𝑖1 , 𝑦̃𝑖0 are the estimated potential outcome of observation 𝑖. In addition, we also want to examine how network size and covariates balance affect the performance of identifying treatment effect using representation learning. We separate the dataset into a two by two matrix: small (i.e. 10,000 users) vs large networks (i.e. 100,000 users) and balanced vs imbalanced data. We also want to note that because the models are estimated using a semi-synthetic data as described in section 4.2, this semi-synthetic data provides the 42 ground truth which can be used to compare the treatment effect estimated using different models. We list the results in Table 8 and 9. We can find that our deep representation learning (DRL) method outperform all three benchmark models in small scale network (10,000 users). In addition, even for unbalanced dataset, our DRL method can still outperform benchmark models. This is particularly useful for context in which there are many unobserved covariates that could affect the balance of the matched dataset. The Table 8 shows except the MAE in sample, the Deep Representation Learning method can help to decrease the bias in the synthetic small and balanced covariates dataset on both the measurement of MAE and RMSE. 8Table 8. Compare GNN methods with Propensity score-based method on treatment effect with small dataset and Balanced Covariates Measures (without GNN embeddings) on synthetic data Small dataset with MAE (In RMSE (In MAE (Out of RMSE (Out of Balanced sample) sample) sample) sample) Covariates Features Propensity Score 0.024 0.023 0.026 0.029 Estimator IPW Estimator 0.027 0.022 0.029 0.026 Double Robust 0.041 0.037 0.046 0.040 Estimator DRL 0.026 0.021 0.024 0.026 43 The Table 9 shows the Deep Representation Learning method can help to decrease the bias in the synthetic small and imbalanced covariates dataset on both the measurement of MAE and RMSE. 9 Table 9. Compare GNN methods with Propensity score-based method on treatment effect with small dataset and Imbalanced Covariates Measures (without GNN embeddings) on synthetic data Small dataset with MAE (In RMSE (In MAE (Out of RMSE (Out of Imbalanced sample) sample) sample) sample) Covariates Features Propensity Score 0.018 0.019 0.023 0.021 Estimator IPW Estimator 0.012 0.018 0.018 0.021 Double Robust 0.011 0.014 0.019 0.018 Estimator DRL 0.009 0.007 0.014 0.016 We list the results in Table 10 and 11. We can find that our deep representation learning (DRL) method outperform all three benchmark models in small scale network (100,000 users). In addition, even for unbalanced dataset, our DRL method can still outperform benchmark models. This is particularly useful for context in which there are many unobserved covariates that could affect the balance of the matched dataset. The Table 10 shows the Deep Representation Learning method can help to decrease the bias in the big and balanced covariates dataset on both the measurement of MAE and RMSE. 44 10 Table 10. Compare GNN methods with Propensity score-based method on treatment effect with big dataset and Balanced Covariates Measures (without GNN embeddings) on synthetic data Big dataset with MAE (In RMSE (In MAE (Out of RMSE (Out Balanced sample) sample) sample) of sample) Covariates Features Propensity Score 0.035 0.032 0.041 0.037 Estimator IPW Estimator 0.028 0.032 0.031 0.035 Double Robust 0.029 0.026 0.033 0.029 Estimator DRL 0.024 0.024 0.028 0.025 The Table 11 shows the Deep Representation Learning method can help to decrease the bias in the big and imbalanced covariates dataset on both the measurement of MAE and RMSE. 11 Table 11. Compare GNN methods with Propensity score-based method on treatment effect with big dataset and Imbalanced Covariates Measures (without GNN embeddings) on synthetic data Big dataset with MAE (In RMSE (In MAE (Out of RMSE (Out of Imbalanced sample) sample) sample) sample) Covariates Features Propensity Score 0.048 0.045 0.053 0.046 Estimator IPW Estimator 0.041 0.037 0.044 0.039 Double Robust 0.038 0.034 0.039 0.037 Estimator DRL 0.035 0.032 0.038 0.036 45 4.4. Empirical Data Results In this section, instead of using semi-synthetic data, we estimate the actual dataset using our proposed model, and demonstrate results in tables below. Please notice that we don’t have the ground truth data now. What we can do is using the dataset with balanced covariates and 100,000 users. We compare our deep representation learning method with three other methods on this dataset. Recall that we have Average Treatment Effect as: 𝑛 1 𝜏𝐴𝑇𝐸 = 𝐸[𝑌(1) − 𝑌(0)|𝑋, 𝐴] = ∑[𝑌𝑖 (1) − 𝑌𝑖 (0)|𝑋, 𝐴] 𝑛 𝑖=1 Table 12 and Table 13 shows that the deep graph representation learning method can always give the smallest treatment effect result compared with three other propensity score-based method. The deep graph representation learning is using graph neural network to construct the representation space in deep representation learning. 12 Table 12. Compare GNN methods with Propensity score based-method on treatment effect with small dataset and Balanced Covariates Measures (with graph features) Small dataset with ATE (In Sample) ATE (Out of sample) Balanced Covariates Features Propensity Score 0.322 0.328 Estimator IPW Estimator 0.382 0.397 Double Robust 0.290 0.302 Estimator DGRL(DRL with 0.254 0.289 graph feature) 46 13 Table 13. Compare GNN methods with Propensity score based-method on treatment effect with large dataset and Balanced Covariates Measures (with graph features) Large dataset with ATE (In Sample) ATE (Out of sample) Balanced Covariates Features Propensity Score 0.435 0.452 Estimator IPW Estimator 0.414 0.436 Double Robust 0.315 0.362 Estimator DGRL(DRL with 0.328 0.347 graph feature) In table 14, we account the top 5 ranked hashtags as the treatment. They are ‘ANI’, 'Kirti Saxena', 'INC India', 'Kapil' , and 'With Congress'. This table shows that users are easily causal influenced by hashtag ‘INC India’ (0.524) but hardly causal influenced by hashtag ‘Kapil’ (0.197). This result is helpful to suggest the people who wants to increase the influence of tweets in social network. 14 Table 14. GNN based result with different treatment (hashtag) Large dataset with Balanced ATE (In Sample) ATE (Out of sample) Covariates Features Treatment #1 0.328 0.347 Treatment #2 0.342 0.376 Treatment #3 0.524 0.592 Treatment #4 0.197 0.211 Treatment #5 0.264 0.279 47 5. DISCUSSION AND CONCLUSION Identifying the causal inference on the social network is an important and difficult question. This paper builds a graph-based representation learning method to identify the causal inference. We can find that the graph representation learning method partially solves the confoundedness problem that the propensity score method cannot solve very easily. The graph representation learning method extracts more graph embedding features that include the social network signals, which helps to identify the causal inference under social network circumstance. We construct different types of datasets and compare our method with propensity score-based methods. The result shows that our model can always give robust results under small balanced, large balanced, small imbalanced, and large imbalanced covariates dataset. It also shows a robust result under different sample set. The advantage of this graph representation learning method is it can solve the causal inference problem under large and imbalanced covariates social network. The disadvantage of this method is very consuming computing power. We don’t necessarily use it if we can construct the propensity score successfully. 48 ESSAY 2 GRAPH ATTENTION BASED METHOD FOR IMPROVING THE RECOMMENDATION FAIRNESS IN ELECTION SOCIAL NETWORK 0. ABSTRACT OF ESSAY 2 The graph based recommendation systems play an important role in social network platform. Most of these studies focus on building the appropriate representation space based on the interaction between users and items. However, this type of method is difficult to represent the information of the minority of users. This circumstance is called bias sampling in training process. The bias sampling on social network leads to the minority of users receive the unfairness recommendation. The graph structured based recommendation model even amplify this unfairness magnitudes. The existing graph embedding based recommendation algorithms focus on improving the recommendation quality but lose the attention on the fairness of these recommendation results. Very few studies pay attention on the fairness problem but difficult to give high recommendation accuracy with the fairness constraints. In this paper, we design an algorithm to satisfy both recommendation accuracy and fairness requirements. We present a graph attention based model to capture heterogeneous information between different users, tweets, and hashtags in twitter and use the attention mechanism to increase the disclosure of the sensitive attributes of the minority of users in twitter. The experiment shows that our model can make fair recommendation without losing too much recommendation accuracy. We compare our method with some benchmark methods and get the state-of-art performance. 49 1. INTRODUCTION In 2014, The Bharatiya Janata Party (BJP) made a great success on general election and Nargendra Modi became the prime minister of the India. This is the first time when a party can win more than half seats in the Lok Sabha since last twenty years6 (Priya and Peter 2014, Sardesai 2015, Ahmed et al 2016, Diwakar 2017, Khan et al 2019). Some political researchers pointed out that the reason behind BJP’s success on election is to win the “Hindi Belt” (Diwakar 2017, Jaffrelot 2015, Khan et al 2019). Some other political researchers studied on the twitter account of the Nargendea Modi and declared it is not because BJP won the support but the Modi won the support (Sardesai 2015, Ahmed et al 2016, Pal et al 2019). If we analyze the content of the “naerndramodi”: the official account of the Modi in twitter, there are many images such as zazen, meditation, and worship7. It helps Modi to build a religion leader image in his followers’ mind (Khare 2015). Except building a religion leader image, Modi also posts texts, videos, and hashtags in his tweets. All these things can help Modi to interact with his followers very efficiently because his followers on twitter will know what is Modi’s political declaration, political campaign, and persona (Ahmed and Skoric 2015, Khare 2015, Khan et al 2019). After these interactions between the Modi and his followers, the recommendation algorithm would recognize this kind of interactions as the preference of his followers and give the recommendation about the Modi’s tweets content to the target followers in the future. Further, politicians need to know the reactions and comments from their followers. The social network platforms such as twitter, facebook, and reddit are complement to traditional survey method (Berman et al 2019, Petrova et al 2021). It becomes more and more important for politicians to know how their followers’ impressions to their tweets. The campaign team prefer 6 https://www.indiavotes.com/lok-sabha/2014/all-states/16/0 7 https://indianexpress.com/elections/pm-modi-takes-vacation-from-politics-meditates-in-kedarnath-5734996/ 50 to recommend followers tweets based on their interests to attract more followers. (Pennacchiotti et al 2011) If we can extract the information from comments on twitter, it would be a good way to recommend users the political content about the politicians, parities, and shape candidates’ image in followers’ mind. Most of the recommendation algorithms in social network platform are based on these interactions between different users, hashtags, and tweets. The more interactions happened in the social network, the better patterns and preferences would be recognized by the recommendation algorithms. This leads to the important question in recommendation systems on social network: not everyone’s voice is heard equally in the social network (Dwork et al 2012, Feldman et al 2015, Hardt et al 2016). For example, there are many people believes in Hinduism in India. A muslim has very few chance to receive the information that Hinduisms always like to see in the twitter.8 This is because the recommendation algorithms are based on the users’ demographic data and the interactions between users and items. The majority of users have some preference of the tweets and it mislead the algorithm to build the stereotypes of the whole users without considering the minority of users (Rahman et al 2019). Therefore, the minority of users are represented by the majority of users and receive the unfair recommendation results. This is a sampling bias problem in the social network and cannot be avoided. The recommendation algorithms based these biased sample would always give the bias recommendation results, or unfairness recommendation results. In order to solve these bias recommendation results, we need to build a model to consider both the recommended accuracy and the recommended fairness in the social network. In this paper, we want to solve the following questions: 8 https://www.pewforum.org/2021/06/29/religion-in-india-tolerance-and-segregation/ 51 1. Which kind of information does politicians want to diffuse to the followers? 2. How to make sure the minority of users have chance to receive recommendation results based on their own preference in twitter? 3. How to build a recommendation algorithm satisfies both recommendation accuracy and the fairness in twitter? In order to solve the above three questions, we combine three types of machine learning methods together. First, in order to know which kind of political information are always interesting to users’ preference in twitter, we use the text mining methods, including bag of words and topic modeling (Karami et al 2018). It helps to describe the semantic information of the politicians’ tweets, which includes the political declaration, political campaign, and persona. This is the information that politician want to impress on the followers’ mind. Second, we need to detect how these semantic information diffuse between politicians and users’ mind, the meta-path based method is an very efficient way to calculate it. In each meta-path, it describes how semantic information diffuse through different types of users, tweets, and hashtags. Therefore, we can build semantic meaning across different types of edges. When we consider the preference of the users of minority, this meta-path based method includes the many information from the users of minority and can somewhat decrease the sampling bias problem in the social network. Third, the reason why we have the bias sampling is some of the demographic features of the users are sensitive. It leads to the minority of users hard to receive the recommendation based on their own preference. In order to solve this problem, we put the attention mechanism on the minority of users’ sensitive attributes and minimize the discrepancy between these attributes’ representation space and the whole users’ representation space. In this paper, we combine these three methods 52 together to capture the heterogeneous information on social network and balance off the recommendation accuracy and the recommendation fairness. The traditional recommendation method is based on the collaborative filtering models. These models treat users and items (tweets) as pairs and give the recommendation based on the similar users’ preference histories. However, it loses a lot of information about users’ interaction. For example, two users on twitter may have no similar preference histories before but belong to the similar social network group (Bobadilla et al 2020). The interaction between users also reflect their interesting contents on twitter. Therefore, the traditional collaborative filtering models are not working well on the social network based recommendation. We need a recommendation model based on the graph theory. Graph Representation Learning is a method to extract information from the large-scale network. (Kipf et al 2016, Hamilton et al 2017, He et al 2020) We can use the embedding method to encode both nodes and edges into the graph. The tweets posted by each user can be viewed as a node, and the retweet relationships between each two tweets can be viewed as an edge. We can use this method to draw a big semantic network based on the network of retweets. Please notice that this is not a social network or a text network. We call it semantic network because it includes tweets, users, and hashtags. These information are heterogeneous on each node and edge. We can use the meta-path based method to capture these heterogeneous information (Change et al 2015, Dong et al 2017, Fan et al 2019). The meta-path based model has many advantage on prediction accuracy in social network but has the limitation on considering the fairness (Bose and Hamilton 2019, Buyl and Bie 2021). To solve the fairness problem, we need to build an unbiased representation space based on the biased observation (Fu et al 2020, Buyl and Bie 2021). We add the hierarchical attention 53 mechanism on the classical meta-path network embedding model. This method has many advantages: First, meta-path only give the global weights of each node and edge, but the attention can give us the local weights of each node and edge (Zhou et al 2019, Xue et al 2020, Wang et al 2021). Therefore, we can give the sensitive attributes of the minority of users more weights when we construct the representation space. Second, these attention weights can some how help to explain how different types of nodes and edges contributes to the final prediction accuracy and the fairness results (Dwork et al 2012, Wu et al 2019). The contributions of this paper are as follows. First, we propose a model to embed the heterogeneous information network and amplify the representation of the minority of users in the twitter. Second, we use this method to the social recommendation problem and compare both accuracy performance and fairness performance with the benchmark methods and get the state- of-art results. Third, we get the evidence of the attention based on the attention coefficient, which can help to interpret the contributions of the focal nodes and edges. The rest of the paper are written as follows: we review the literatures in section 2, declare the preliminaries and notations in section 3, build the model in section 4, analysis the experiment and data result in section 5, discuss and make the conclusion in section 6. The details of variable definition, model building, and additional results are in appendix. 54 2. LITERATURE REVIEW There are three streams of studies to follow. The first stream is treating the political brand or political advertisement as a tool to help potential candidates to win the election (Gordon and Hartmann 2013, Wang et al 2018, Zhang and Chung 2020, Fossen et al 2021). Our study focuses on using the twitter as an election campaign tool to enlarge the political brand affection among the voters. The twitter is an important campaign tool in current world (Ahmed et al 2016, Diwakar 2017, Buccoliero et al 2020). The second stream is the heterogeneous information network embedding based recommendation. This stream includes the heterogeneous information network embedding based recommendation and the graph neural network based recommendation. The heterogeneous information network embedding based recommendation is for capturing the different types of users, tweets, and hashtags and give the recommendation based on these information (Chang et al 2015, Fu et al 2017, Shi et al 2018). The attention mechanism from the graph neural network based recommendation is for capturing the weights for each specific intra-path or inter-path and give the recommendation based on each connection nodes or edges (Fu et al 2020, Hong et al 2020). We combine these two methods together to take the advantages both of them. The third stream is the fairness recommendation. This stream includes the fairness sampling method and the fairness modeling method (Dwork et al 2012, Zeng et al 2021). The fairness sampling method means making the sampling not depends on the sensitive attributes. The fairness modeling method means making the fairness loss function during the training process in order to take care the preference of the users of minority (Bose and Hamilton 2019, Bobadilla et al 2020, Buyl and Bie 2020, Spinelli et al 2021). Both of these two ways can help to decrease the fairness of the recommendation. 55 We fill the gap between heterogeneous information network embedding and fairness recommendation with attention mechanism. To the best of our knowledge, this is first study of using the attention mechanism to decrease the bias under recommendation systems. Therefore, we contribute on both the theoretic part and the methodology part. In the theoretic part, we prove the twitter is an efficient tool for the politicians and their parties to make the campaign. It helps to improve the politician impression on users’ mind. In the methodology part, we contribute an advanced model to calculate the representation considering the minority of users’ preferences and their interactions in social network platform. This method can give fairness recommendation to the minority of users without losing too much recommendation accuracy. 2.1. Political Brand Image on semantic network The twitter is a growing fast platform to share different opinions and public views related to political things. It can enlarge the positive and negative images of the political candidates in followers’ mind. Twitter is an very important tool to win the campaign and predict the voters’ opinions (Bermingham and Smeaton 2011, Skoric et al. 2012, Sang and Bos 2012, Berman et al 2019, Khan et al 2019, Petrova et al 2021). Many users studied the social media platform such as Twitter and Facebook to extract the semantic meaning behind the text information and find the connection between these semantic meaning and the political campaign success (Ahmed and Skoric 2015). The India’s 2014 general election result changes the voting behavior of the India. (Diwakar 2017, Khan et al 2019). Many people becomes to follow the trend and interact with potential candidates in twitter. The BJP and its leader Modi, successfully transformed the followers to voters. They build a concept called “Modi lahar” to attract many followers (Diwakar 2017, Khan 56 et al 2019). Modi and his campaign team successfully built a political image as a fighter and a lord to lead Indian people. Ahmed et al (2017) used political brand image as the mediator and found there is a positive relationship between the political brand and the voting preference. Lin and Himelboim (2019) also used political brand as the mediator and found that the weaker social ties of the political brand community, the more voters will be attract to the political candidates. The political brand is also used as the moderator in the political brand research. It strengths the relationship between the tweet content and the voters’ engagement in twitter (Panda et al 2020, Mallipeddi et al 2021) . Except the brand image, some other researchers studied the political advertisement. (Graham et al 2013, Gordon and Hartmann 2013, Zhang and Chung 2020, Fossen et al 2021) They found that the political advertisement has significant positive relationship between the election results. The more effectiveness of the candidates’ advertisement, the more voters would support the candidates. Therefore, we can see the political advertisement is very useful to get more votes. In current time, twitter becomes more and more important to interact with the voters. More and more politicians treat twitter as a platform to advertise themselves. No matter building the political brand image or increasing the political brand advertisement effect, the campaign team of the political candidates should use twitter carefully. The twitter is an important channel to diffuse political candidates’ opinions (Berman et al 2019, Panda et al 2020, Petrova et al 2021). If the recommendation algorithm recognize these interactions between the voters and political candidates well, the information will diffuses very efficiently in the twitter. 57 15 Table 15. Literature review of Political Brand Image on semantic network Author Political Marketing Conclusion Ahmed et al (2017) Political Brand Image Political Brand Equity has positive as the Mediator and relationship with the voting preference Predictor Lin and Himelboim Political Brand The weaker social ties of the Political (2019) Community as the Brand Community, the more successful Mediator of the Political Candidates Khan et al (2019) Mallipeddi et al (2021) Political Brand as the Political Brand Image moderates the Moderator correlation between the content and the voters’ engagement Gordon and Hartmann Political Advertisement The Political Advertisement is positive (2013). as the Predictor correlation with the presidential elections Wang et al (2018) Political Advertisement The effectiveness of the political as the Mediator advertisement sponsored by Political action committees is less than sponsored by the political candidates Zhang and Chung Political Advertisement The effectiveness of the political (2020) as the Predictor advertisement from the political candidates is more than the political advertisement from the non-partisans Fossen et al (2021) Political Advertisement Political Advertisement has the spillover as the Predictor effect on the subsequent advertisement Berman et al (2019) Twitter as the online Tweets becomes more emotionally and advertisement channel elaborate the debates Petrova et al (2021) Facebook as the online The higher twitter penetration area, the advertisement channel higher donation for the political and donation channel candidates 58 2.2. Heterogeneous Information Network Embedding In recent years, graph neural network (GNN) is a popular representation method to extract node, edge, and graph features from the social network (Xie et al 2016, Hamilton et al 2017, Zhang et 2019). It can help us to analyze the users’ behaviors and the semantic information effectively. There are two types of embedding methods to learn the graph features. One is the random walk based method, another one is the graph neural network based method. The advantage of using the random walk based model, the meta-path embedding, is to capture all the heterogeneous information easily (Dong et al 2017, Fan et al 2019, Yin et al 2019). The advantage of using the graph neural network, especially the attention mechanism, is to aggregate the intra-path and inter-path information easily (Sankar et al 2018, Xue et al 2020). Therefore, in this paper, we combine the random walk based method and the graph neural network based method together to train the model. This section includes three parts. First, the heterogeneous information network embedding. There are many researchers studied the heterogeneous network, especially use the meta-path based method to embed the features on the same path (Ji et al 2018, Shi et al 2018). Second, the graph neural network. It includes graph convolution network aggregation way and graph attention aggregation way (Nguyen et al 2018, Pareja et al 2020, Sankar et al 2020). Both of them have advantage to train the model but the attention mechanism is easy to connect with the meta-path based method. Third, the graph neural network based recommendation. After we combine these two methods, we need to apply it on the recommendation problem. It has better performance compared with the traditional recommendation method (Wu et al 2018, Fan et al 2019). We review the meta-path based papers and attention based papers. Then we discuss how to combine them together and apply it into the social recommendation problem. 59 There are many users, topics, tweets and hashtags in twitter. If we treat these different types of nodes as the homogeneous graph, we will lose the information when we transform them into the embedding space. This will lead to failure if we use the embedding results to do the classification, prediction, and recommendation problems. Therefore, we need to use a representation method to capture all these heterogeneous nodes together. Meta-path method, which is based on the skip-gram framework, can successfully includes different types of nodes and their semantic meanings. We can construct the information diffusion path based on the semantic meanings and calculate the highest probability from these path instances. Dong et al (2017) firstly proposed the meta-path method to build the diffusion path and capture the features from different types of nodes. Fu et al (2017) relaxed the restriction and construct the meta-path more flexible. Shi et al (2018) followed the same idea and proposed HERec method. They defined a filter to capture the node sequence and transform the heterogeneous topology network to the homogeneous topology network. Except using the meta-path based methods, Many researchers extracted the different types of features with graph neural network method (Kipf and Welling 2016, Zhang et al 2019, He et al 2020, Fu et al 2020, Wang et al 2021). Kipf and Welling (2016) and He et al (2020) used graph convolutional network methods to construct the embedding space. However, when we use the graph convolutional network, we aggregate the information from the neighbor nodes of the target nodes. Therefore, we cannot embed the predefined diffusion path information into the model. In this paper, we plan to give more explanation of the model results. The graph convolutional network method cannot give us the apparently explanation of the results. We choose another way, the graph attention network (Zhou et al 2019, Fu et al 2020, Wang et al 2021) to aggregate the information. Zhou et al (2019) combined the attention mechanism and the meta-path method 60 together. In node level, each node features are aggregated by the attention mechanism. It still contains the meta-path structure and aggregate the information by neural network. Fu et al (2020) extend this method by aggregate the information on different edge level. Therefore, we can capture the features from both different types of nodes and different types of edges simultaneously. This type of method needs to predefine the meta-path before we train the model. Wang et al (2021) only use the attention mechanism to extract the different types of nodes, edges, and side information but not use the meta-path. Except the pure attention mechanism, some other researchers modify the attention mechanism with other embedding method (Wu et al 2019, Fan et al 2019, Zhang et al 2019, Hong et al 2020, Xue et al 2020). For example, Wu et al (2019) studied the social recommendation problem based on the dual graph structure. All of the attentions are calculated from the different nodes and aggregate on different graphs. Zhang et al (2019) replaced the attention mechanism by the LSTM embedding method. After discussing these papers, we can see that using the attention mechanism to aggregate different types of information is an effectively way. We propose a model based on the meta-path method and attention mechanism to capture the heterogeneous information on twitter. By using the attention mechanism to aggregate the graph information on each meta-path, we can represent the information of different users, hashtags, tweets, and topics on the twitter. We reviews the papers how can we use the attention mechanism to decrease the bias of recommendation in the next part. 61 16 Table 16. Literature review of Heterogeneous Information Network Embedding Method Author Model Name Node level Aggregation Edge level Aggregation Meta-path Dong et Metapath2vec Meta path NA based al(2017) Methods Fu et al HIN2Vec Meta path NA (2017) Shi et al HERec Meta path NA (2018) Hierarchical Wang et HAN Node Attention Edge Attention Attention al (2021) Mechanism Zhou et al HAHE Meta path Attention NA (2019) Fu et al MAGNN Meta Path Attention Meta Path Attention (2020) Wu et al DANSER Dual Attention NA (2019) Fan et al MEIRec Meta Path Attention NA (2019) Zhang et HetGNN LSTM Edge Attention al (2019) Hong et al HetSANN Hierarchical Attention NA (2020) Our Model Attention Fair Intra-Path Attention Inter-Path Attention 2.3. Fairness recommendation algorithms Despite the success of using graph neural network to train the data, it also leads to the bias on sensitive attributes such as age, gender, and religion. The representation of the graph neutral network amplify the bias of the especially for the minority of users. (Rahman et al 2019, Bose and Hamilton 2019, Bobadilla et al 2020, Fu et al 2020, Wang et al 2021) 62 There are two ways to decrease the bias during training. First, we can build a sampling method considering about the minority people when we generate the embedding space. Most of papers in this stream follows Rahman (2019) and Buyl and Bie (2020) ideas to calculate the specific fairness based random walk. This type of method can help to improve the fairness of the recommendation but loss the accuracy of the prediction. Zeng et al (2021) modified the traditional meta-path generation method by giving high weight for minority group and low weight for majority group. This can help to reduce the sampling bias during embedding process. Wu et al (2021) detected the sensitive attributes at first step, and built a filter space to decrease the bias of each sensitive attributes. Therefore, the filtered embedding space can give the fairness recommendation result compared with no-filtered embedding space. Second, we can build the model considering about the fairness and put it into the loss function. (Fu et al 2020, Wang et al 2021) This can help to trade-off the accuracy of the prediction and the fairness of the recommendation. The sampling bias can be eliminated based on the generated adversarial network. Dai and Wang (2021) used an adversarial neural network to get unbiased the node’s sensitive features. This method can help to decrease the fairness when we construct the graph neural network. However, it also lose some generalization of the model (Kang and Tong 2021). In order to solve this problem, many studies construct the loss function based on the ranking method (Fu et al 2020, Spinelli et al 2021, Dong et al 2021). Fu et al (2020) put the relative ranking of the users of minority into the Gini Index formula to measure the unfairness. Spinelli et al (2021) built a graph neutral network to drop the biased edge after each training epoch. The biased edges are based on some relative distance metrics. Dong et al (2021) promoted the fairness of individual by considering the distance matrix of each user and item pairs. They built a ranking based loss function. If the relative ranking distance of the minority of 63 the users is different with the majority of the users in the ranking space, they will change the ranking and put it into the next turn training. Our model choose the second way to solve the fairness problem. We use the attention mechanism to enhance the preference of the minority of users’ attributes but not the majority users’ attributes. The idea of this minimize the differences between the minority of users’ preference and the majority of users’ preference can be found in some studies using the KL divergence (Buyl and Bie 2021, Current et al 2022). Our method share the same idea and we put this idea into the graph neural network architecture. It can help to consider the fairness of the recommendation and not lose much accuracy of the prediction. 17 Table 17. Literature review of Fairness recommendation algorithms Author Model Name Sampling or User level or Neutral Network Modeling Group level Architecture Rahman et al Fairwalk Sampling User level NA (2019) Fu et al (2020) FairKG4Rec Modeling Both NA Dai and Wang Fair GNN Modeling User level GAN (2021) Spinelli et al Fair Drop Modeling User level GNN (2021) Zeng et al Fair HIN Sampling Group level GAN (2021) Wu et al (2021) Fair Go Sampling Group level GAN Dong et al EDITS Modeling Group level GNN (2021) Dong et al REDRESS Modeling User level GNN (2021) Our model Attention Fair Modeling Both GNN 64 3. PRELIMINARIES After we review the prior studies of the heterogeneous information network embedding and the fairness of the recommendation. We give the preliminaries of the concept in this section. Definition 1 Heterogeneous Information Network The heterogeneous networks is defined as 𝐺, where 𝐺 = (𝑉, 𝐸). The V denotes the original set of nodes and E denotes the original set of edges. There are two mapping functions, node type mapping function: 𝑓𝑣 : 𝑉 → 𝑅 and edge type mapping function 𝑓𝑒 : 𝐸 → 𝑆. The R are the predefined set of node types and S are the predefined set of edge types. Where |𝑅| + |𝑆| > 2 The definition of the heterogeneous information network is given by the Dong et al (2017) Definition 2 Heterogeneous Information Network Embedding Given a heterogeneous graph 𝐺 , where 𝐺 = (𝑉, 𝐸) , with node types 𝑅 and node attributes matrix 𝑋𝑇 ∈ 𝓡𝑉𝑇×𝑑𝑇 , where T is the number of types. The Heterogeneous Information Network Embedding is a task to learn d-dimension node representations ℎ 𝑇 ∈ 𝓡𝒅 for all 𝑣 ∈ 𝑉 such that capture graph information as more as it can. If T=1, it becomes a homogeneous information network. The definition of the heterogeneous information network embedding is given by the Dong et al (2017) Definition 3 Meta-path 𝑆1 𝑆2 𝑆𝑚 A meta-path 𝜙 is defined as 𝜙 = R1 → R 2 →, … ,→ R 𝑚+1 , A path instance 𝛪 is a path go through the node 𝑉1 , 𝑉2 , … , 𝑉𝑚 where, 𝑖 = 1, … , 𝑚, 𝑅𝑖 = 𝑓𝑣 (𝑉𝑖 ) and 𝑆𝑖 = 𝑓𝑒 (𝑉𝑖 , 𝑉𝑖+1 ) The definition of specific meta-path needs the prior knowledge. Although there are studies using unsupervised way to calculate the meta-path in social network (Wang et al 2018, Wei et al 2018). We need to specific the sensitive attributes in the fairness problem. Therefore, we predefine the 65 meta-path based on the semantic information from each tweet, user and hashtag. The details of predefinition can be found in Appendix K. Definition 4 Meta-path based Neighbors Given a node 𝑉𝑖 and a meta-path 𝜙 in a heterogeneous graph, the meta-path based neighbors 𝑁𝜙 is the neighbor nodes connect to node 𝑖 by meta-path After making the definition of the meta-path, we draw some diagrams to illustrate as examples of the meta-path: In figure 3.1, it shows the diagram of how opinion diffusion in twitter. The user 1 retweets the content as “crowd shows how professionally BJP manages such a huge gathering” with the hashtag BJP included. This diffusion connect different types of users: u1 and u2, the same tweet, and two different types of hashtags: #BJP and #AAP. This could be seen as a local heterogeneous information network with different types of users, tweets, and hashtags. For each different types of users, we can define different types of edges to connect these nodes. 66 6 Figure 6. Diagram of opinion diffusion in twitter User Tweets Hashtag crowd shows how professionally BJP manages such a huge gathering #BJP u1 here the proof Shazia told BJP involved on u2 Ink attack #AAP Lokhit mei jaari..via Information ministry In figure 7, we build the meta-path instance from the heterogeneous information network in figure 6. We have two types of meta-paths to illustrate. The first meta-path is the diffusion from u1 to tweet1 to hashtag1, written as User-Tweet-Hashtag and denoted as 𝛷1 . The second meta- path is the diffusion from u1 to tweet1 to u2, written as User-Tweet-User and denoted as 𝛷2 . Obviously, this is a simple example to illustrate how we define the meta-path. We can define more types of users, tweets, and hashtags to reveal more information. The details of these definition can be found in Appendix L. 67 7 Figure 7. Meta-path Instances Meta-path Instance 1: User——Tweet——Hashtag The crowd shows how u1 professionally BJP manages #BJP such a huge gathering Meta-path Instance 2: User——Tweet——User The crowd shows how u1 professionally BJP manages u2 such a huge gathering In figure 8, we build the meta-path based neighbors for the focal node u1. In the meantime, we can find the meta-path based neighbors such as u1, u2, and u3 based on the meta-path that connect all these three users together and we denote it as 𝒩𝜙 . We can also define the meta-path based neighbors for tweets and hashtags. In this paper, we focus on solving the fairness problem of the minority of users in twitter. Our meta-path based neighbors are only for users’ neighbors. 68 8 Figure 8. Meta-path based Neighbors Meta-path based Neighbors u1 u1 u2 u3 𝒩ϕ Except introducing the preliminary knowledge of the heterogeneous information network, we also need to define the measurement of the fairness. In economics and sociological studies, Gini coefficient is always used to measure the fairness of people’s income (Gini 1921). In this setting, we use Gini coefficient to measure the fairness of the recommendation quality to users in social network. This is very helpful to measure the individual fairness for each user. Definition 5 Gini Coefficient Suppose we have m users 𝑢1 , 𝑢2 , … 𝑢𝑚 and n recommended tweets 𝑡𝑤1 , 𝑡𝑤2 , … 𝑡𝑤𝑛 . We denote 𝑄𝑖𝑗 = {0,1} as whether tweet 𝑡𝑤𝑗 is recommended to user 𝑢𝑖 The top-K recommendation means user 𝑖 can receive the top K recommended tweets 𝑡𝑤𝑗 , which is denoted as ∑𝑛𝑗=1 𝑄𝑖𝑗 = 𝐾 The Gini coefficient measurement in this social network setting is written as: ∑𝑄𝜇 ,𝑄𝜈 |𝑄𝜇 − 𝑄𝜈 | 𝐺𝑖𝑛𝑖(𝑄) = 2𝑚 ∑𝑛𝑗=1 𝑄𝜇 Where 𝜇 and 𝜈 are two random users selected from the whole users. 69 There are another two ways to measure the group fairness to users with similar sensitive attributes such as age, gender, and religion in social network. Definition 6 Demographic Parity Demographic Parity is one of the important criteria of fairness given by Dwork et al (2012). It means the probability of being assigned to the predicted result ŷ (For example, the recommended engineering book in Amazon) should be independent to the sensitive attributes xsensitive . Suppose we select one user 𝑢𝜇 belong to the user of minority. (For example, the sensitive attribute is female). We select another user 𝑢𝜈 belong to the user of majority (For example, the sensitive attribute is male). We expect the user should get engineering book recommendation in Amazon not because of gender. The formula is written as: 𝑃(𝑦̂ = 1|𝑥𝑠𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑒 = 0) = 𝑃(𝑦̂ = 1|𝑥𝑠𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑒 = 1) Definition 7 Equal Opportunity Another important criteria of the fairness is defined by Hardt et al (2016), called Equal Opportunity. It means the probability of the group of people is assigned to the predicted result ŷ should be independent to the sensitive attributes xsensitive and the corresponding group truth result y. We expect the user should get engineering book recommendation in Amazon not because of gender and in the meantime, they really like to get engineering book recommendation. The formula is written as: 𝑃(𝑦̂ = 1|𝑥𝑠𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑒 = 0, 𝑦 = 1) = 𝑃(𝑦̂ = 1|𝑥𝑠𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑒 = 1, 𝑦 = 1) The advantage of using the equal opportunity is, we consider a situation that a sensitive group of users may get recommendation because of other reasons (Hardt 2016). For example, if a female has many friends belong to the engineering major. The algorithm will predict this female has strong interesting on the engineering. However it is not true. Therefore, we need to put the 70 ground true outcome as the condition to predict the recommendation result. Compared with Demographic Parity, this measurement put more weights on the true positive rate. It can help to avoid the unfairness situation that the users belong to the minority group truly but assigned as majority falsely. After we give all these preliminary statements, we discuss the model building and data analysis in next two sections. 71 4. MODEL SETTING AND ANALYSIS 4.1. Model setting We can set the model as follows, a user i belongs to a graph 𝐺(𝑉, 𝐸), where 𝑉 is the vertex, 𝐸 is the edge of this graph and 𝐴 is the corresponding adjacency matrix. Each user 𝑖 has demographic information, notated as 𝑋, called node attributes. Each user i posts some tweets or retweets on the twitter including the sentiment scores 𝑦, which is notated as 𝑌, the label data for each node. Some papers put the attention mechanism on nodes and edges to aggregate the information (Wu et al 2019, Xue et al 2020, Wang et al 2021). It is not a good way to solve the fairness problem because it give more weights on the majority of users’ demographic attributes and make the recommendation result even more unfairness. Therefore, we use the attention mechanism on the meta-path to aggregate the information. This helps to solve the problem that the minority of users are represented by the majority of users in the social network. After each aggregation in each meta-path, we make the aggregation on different meta-path, this helps to construct better representation about the heterogeneous information in social network (Fan et al 2019, Sankar et al 2019, Yin et al 2019). In the end, we still need to build a loss function to control the fairness of the minority of users. However, we don’t follow the method as previous because they put more weight on the fairness bias or the equal opportunity (Dai and Wang 2021, Wu et al 2021, Zeng et al 2021). We still want to preserve the recommendation accuracy. Therefore, we build the same attention mechanism architecture on the minority of users as what we train in the whole dataset. We build a loss function to minimize the distance between the minority of users’ representation and the whole users’ representation. If their distance is as low as possible, it mean the minority of users are not represented by the majority of users in the training process. Therefore, the training algorithm can satisfy both high recommendation results and the fairness of the minority. 72 The process of building our model is as follows: At first, we transform the node in to the representation space. Second, we aggregate information of each different types of nodes include user, tweet, and hashtag belong to the same semantic meta-path. Third, we aggregate information of each meta-path and output the whole semantic representation space. Fourth, we calculate the cross-entropy of the final embedding space with the corresponding labels. In the mean time, we use the same training process on the minority of users and calculate their final embedding space. If we can minimize the distance between the final embedding space of whole users and the final embedding space of minority of users, we can get the relative fairness training result from the model. 4.2. Intra-Path Attention The node-level attention is to calculate the important weight of each node’s neighbors and itself. The project from node into feature vector ℎ𝑖 = 𝑊 𝑟 ⋅ 𝑥𝑖𝑟 Where 𝑥𝑖𝑇 is the initial feature for node 𝑖 with type r, W r is the transformation matrix for node type 𝑟. The meta-path embedding is written as ℎ𝜙(𝑖,𝑗) = 𝑓𝜃 (ℎ𝑖 , ℎ𝑗 , ∀𝑖, 𝑗 ∈ {𝑁𝜙 }) If we calculate the attention between two nodes hi and hj , we can get the following attention 𝑒𝑖𝑗 = 𝑎𝑡𝑡𝑛𝑜𝑑𝑒 (ℎ𝑖 , ℎ𝑗 ) We modify it with including the meta-path 𝜙 as 𝜙 𝑒𝑖𝑗 = 𝐿𝑒𝑎𝑘𝑦𝑅𝑒𝐿𝑈(𝑎𝑇 ⋅ [ℎ𝑖 ||ℎ𝜙(𝑖,𝑗) ]) The weight of two nodes i and j for edge type r is calculated as 73 𝑟 𝑒𝑥𝑝(𝜎(𝑎𝑟𝑇 ⋅ [𝑊 𝑟 ⋅ 𝑥𝑖 ||𝑊 𝑟 ⋅ 𝑥𝑗 ])) 𝛼𝑖,𝑗 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥𝑗 (𝑒𝑖𝑗 ) = ∑𝑘∈𝑁𝑟𝑡 (𝑒𝑥𝑝(𝜎(𝑎𝑟𝑇 ⋅ [𝑊 𝑟 ⋅ 𝑥𝑖 ||𝑊 𝑟 ⋅ 𝑥𝑘 ])) 𝑖 N𝑖r is the neighbors of node i with edge type r. ar is the parameter weight of the attention, σ is the activation function, and || is the concatenation operation. 9 Figure 9. Diagram of the Intra-Path Attention {0} H1 {0} Meta-path based 1 3 H3 Meta-path 𝜙0 connection —— 5 2 {0} H2 User Node {0} H5 {0} 4 Embedding H6 {0} H4 Tweet Node {0} 6 7 H7 Embedding Meta-path 𝜙1 0 {0} H8 {0} H0 8 H9 {0} 9 Meta-path Embedding Z 𝜙i Intra-Path (𝑙+1) (𝑙) H𝑖 = 𝜎(𝑊𝑖 ∑ 𝐻𝑖 ) Attention 𝑖∈𝑁(𝜈) {0} H6 {0} 6 5 H5 Meta-path 𝜙p 8 {0} H9 1 {0} H8 9 {0} H1 We can get the final representation as below 74 ℎ𝑖𝑟 = 𝜎( ∑ 𝛼𝑖,𝑗 𝑟 ⋅ 𝑊 𝑟 𝑥𝑗 ) 𝑘∈𝑁𝑖𝑟 Extend to multi-head attention 𝐾 ℎ𝑖𝑟 = || 𝜎( ∑ 𝛼𝑖,𝑗 𝑟 ⋅ 𝑊 𝑟 𝑥𝑗 ) 𝑘 = 1 𝑘∈𝑁𝑖𝑟 4.3. Inter-Path Attention After we calculate the node-level information, we need to aggregate the information of each edge. We call it the semantic-level information. The semantic-level attention of node I for edge type r is calculated as The important of each semantic is calculated as 1 𝑤= ∑ 𝑞 𝑇 ⋅ 𝑡𝑎𝑛ℎ(𝑊 ⋅ ℎ𝑖𝑟 + 𝑏) |𝑉| 𝑖∈𝑉 𝑒𝑥𝑝(𝑞 𝑇 ⋅ 𝜎([𝑊 ⋅ ℎ𝑖𝑟 + 𝑏])) 𝛽𝑡𝑟 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑤) = ∑𝑟∈𝑅(𝑒𝑥𝑝(𝑞 𝑇 ⋅ 𝜎([𝑊 ⋅ ℎ𝑖𝑟 + 𝑏])) We can get the final representation as below 𝑅 𝑍𝑖 = ℎ𝑖 = ∑ 𝛽𝑖𝑟 ⋅ ℎ𝑖𝑟 𝑟=1 The object function can be written as minimize the cross-entropy over all node 𝐶 𝐿𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = − ∑ ∑ 𝑦𝑖 [𝑐] ∙ 𝑙𝑜𝑔 𝑍𝑖 [𝑐] 𝑐=1 𝑖∈𝑉 Where the V is the set of nodes that have labels , C is the number of class and yi is the ground truth. 75 10 Figure 10. Diagram of the Inter-Path Attention Z 𝜙1 Z 𝜙2 Z 𝜙p Inter-Path Attention Inter-Meta-path (𝑙) Embedding Z𝑖 Output Layer ŷi 4.4. Attention Fairness Intra-Path Attention for the minority of users The node-level attention is to calculate the important weight of each node’s neighbors and itself. The project from node into feature vector ℎ′𝑖 = 𝑊′𝑟 ⋅ 𝑥′𝑟𝑖 Where 𝑥𝑖𝑇 is the initial feature for node 𝑖 with type r, W r is the transformation matrix for node type 𝑟. The meta-path embedding is written as ℎ′𝜙(𝑖,𝑗) = 𝑓𝜃 (ℎ′𝑖 , ℎ′𝑗 , ∀𝑖, 𝑗 ∈ {𝑁𝜙 }) 76 If we calculate the attention between two nodes h′i and h′j , we can get the following attention 𝑒′𝑖𝑗 = 𝑎𝑡𝑡𝑛𝑜𝑑𝑒 (ℎ′𝑖 , ℎ′𝑗 ) We modify it with including the meta-path 𝜙 as 𝜙 𝑒′𝑖𝑗 = 𝐿𝑒𝑎𝑘𝑦𝑅𝑒𝐿𝑈(𝑎′𝑇 ⋅ [ℎ𝑖 ||ℎ𝜙(𝑖,𝑗) ]) The weight of two nodes i and j for edge type r is calculated as 𝑒𝑥𝑝(𝜎(𝑎′𝑇𝑟 ⋅ [𝑊′𝑟 ⋅ 𝑥′𝑖 ||𝑊′𝑟 ⋅ 𝑥′𝑗 ])) 𝛼′𝑟𝑖,𝑗 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥𝑗 (𝑒′𝑖𝑗 ) = ∑𝑘∈𝑁𝑟𝑡(𝑒𝑥𝑝(𝜎(𝑎′𝑇𝑟 ⋅ [𝑊 ′ 𝑟 ⋅ 𝑥 ′ 𝑖 ||𝑊 ′ 𝑟 ⋅ 𝑥′𝑘 ])) 𝑖 N′r𝑖 is the neighbors of node i with edge type r. a′r is the parameter weight of the attention, σ is the activation function, and || is the concatenation operation. We can get the final representation as below ℎ′𝑟𝑖 = 𝜎( ∑ 𝛼′𝑟𝑖,𝑗 ⋅ 𝑊′𝑟 𝑥′𝑗 ) 𝑘∈𝑁𝑖𝑟𝑡 Extend to multi-head attention 𝐾 ℎ′𝑟𝑖 = || 𝜎( ∑ 𝛼′𝑟𝑖,𝑗 ⋅ 𝑊′𝑟 𝑥′𝑗 ) 𝑘 = 1 𝑘∈𝑁𝑖𝑟𝑡 Inter-Path Attention After we calculate the node-level information, we need to aggregate the information of each edge. We call it the semantic-level information. The semantic-level attention of node i for edge type r is calculated as The important of each semantic is calculated as 1 𝑤′ = ∑ 𝑞 𝑇 ⋅ 𝑡𝑎𝑛ℎ(𝑊′ ⋅ ℎ′𝑟𝑖 + 𝑏) |𝑉| 𝑖∈𝑉 77 𝑒𝑥𝑝(𝑞 𝑇 ⋅ 𝜎([𝑊 ⋅ ℎ′𝑟𝑖 + 𝑏])) 𝛽′𝑟𝑡 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑤′) = ∑𝑟∈𝑅(𝑒𝑥𝑝(𝑞 𝑇 ⋅ 𝜎([𝑊 ⋅ ℎ′𝑟𝑖 + 𝑏])) We can get the final representation as below 𝑅 𝑍′ = ℎ𝑖′ = ∑ 𝛽′𝑟𝑖 ⋅ ℎ′𝑟𝑖 𝑟=1 The fairness loss function is defined as the distance between the representation of the all users and the representation of the minority of users. 𝐿𝑓𝑎𝑖𝑟𝑛𝑒𝑠𝑠 = 𝑑𝑖𝑠(𝑍, 𝑍′) 4.5. Training In order to train the model considering both accuracy and the fairness, we combine two loss function together and get the final loss function as 𝐿 = 𝐿𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 + 𝜆 ∗ 𝐿𝑓𝑎𝑖𝑟𝑛𝑒𝑠𝑠 The objective is to minimize the cross-entropy between the group-truth label 𝑦 and the predicted representation 𝑍 and minimize the distance between the predicted representation 𝑍 from whole users with the predicted representation 𝑍′ from the minority of users. We optimize the model with mini-bath stochastic gradient descent and back propagation. The overall description of the model is in Algorithm 1. 4.6. Model Analysis There are many advantages for our model. Fist, the proposed model can handle many types of nodes, edges, and features. Second, this model is easily compute and parallelized. The overall complexity is linear to each meta-path pairs and can be parallelized across different nodes types and meta-path pairs. The time complexity for each intra-path loop is O(V𝜙 KL2 ) and for each inter-loop is O(E𝜙 KL). 78 Third, this model is good for explain the contribution from the focal nodes and edges. When we calculate the attention weight coefficient, it will show us how does it contribute to the final task. Therefore, it is easy to investigate which part is important on the whole graph. 1Algorithm 1. Attention Fair Algorithm t Input: Gt = {V t , Et }, meta-path set {ϕ1 , ϕ2 , … , ϕp } , features set {xi , ∀i ∈ V}t Output:Number of attention Final Embedding heads K, attention Z, intra-path Numbercoefficient of the layers L, time sequence α, inter-path attention coefficient β {t1 , t 2 , … , t m } 1 For ϕi ∈ {ϕ1 , ϕ2 , … , ϕp } do 2 For k=1,2,…K do 3 Intra-path embedding 4 For i ∈ V do 5 Find the meta-path neighbors 𝒩ϕ 6 For 𝐣 ∈ 𝒩ϕ do 7 Calculate the intra-path weight coefficient 8 End 9 End 10 Calculate the semantic embeddings 11 End 12 Calculate the inter-path weight coefficient 13 End 14 Calculate the semantic embeddings 15 End 16 Calculate the final embeddings 17 Calculate the cross-entropy 18 Back propagation and update the parameters 19 Return Z, α, β 79 5. DATA RESULT 5.1. Dataset: Indian 2014 Election Twitter Our dataset is the Indian General Election data from the Twitter between Jan 1st and May 16th 2014. This dataset contains all tweets related to Indian General Election. This dataset contains 3.2 millions tweets, from 218, 7334 users. After data cleaning, we can get the tweets sentiment, Y, the treatment T, (i.e. whether user 𝑖 is exposed to others’ tweets of the same topic before she posts 𝑚𝑡ℎ tweet), the covariates variables X such as User Location, User Followers Count, etc, and the adjacency matrix A. 18Table 18. Data Description Variable Description Example Users Types Politician Modi, Gandhi, Jayalalithaa, etc Parties BJP, AAP, INC etc Media The Times of India, The Hindu, etc Tweets Types Tweet #SadacharYatra The #BJP government had promised employment to 13 lakh in 2007 but only 80,000 got jobs. #SadacharYatra Topic job, government, lakh, get, bjp, political, provide, lose, claim, pass Hashtags Types Hashtags #Narendra Modi For PM, #BJP, #ANI, etc Sentiment Scores Sentiment Scores Number (From -1 to 1) 5.2. Experiment Setting 5.2.1. Baseline We compare our model with the following methods. These model are random walk based model and graph neural network model. We compare these models’ performance with our model’s performance considering the fairness and the prediction: 80 DeepWalk: Perozzi et al (2014) built the DeepWalk to connect the word2vec method and graph embedding method. The first step is using the truncated random walks to transform the topology data into the sequential data. The second step is using the word2vec method to calculate the embedding to represent the users and items. Node2vec: Grover and Leskovec (2016) built the Node2vec model to embed the node into the vector space. The first step is using the biased random walks to search each node in a local graph network. The second step is optimize the likelihood function for each node embeddings. Metapath2vec: Dong et al (2017) built the Metapath2vec model to capture different types of information from a given network. The first step is building a meta-path to connect all different types of nodes. The second step is calculating the transition probability of each meta-path. GCN: Kipf and Welling (2016) combined convolutional neural network and graph data together. The key idea is scanning the graph topology data by spectral filters. It helps to scan the convolution neural network in a graph data. GAT: Veličković et al (2018) used attention mechanism from the computer vision field to apply on the graph structure data. This method used a shared attention weights to calculate each nodes embedding. It allows different importance for each node. FairHIN: Zeng et al (2021) built a fair meta-path sampling method to capture the features around the heterogeneous nodes and edges. They used the demographic parity and the equal opportunity as measurement to capture the fairness loss. FairGNN: Dai and Wang (2021) built the adversarial network to minimize the bias generated from the unfairness sampling. They also minimize the covariance constraint to stabilize the training process. 81 5.2.2. Parameter Setting We construct based on the proposed model given in section 4.2. The optimization is based on the Adam algorithm implemented with PyTorch. The learning rate is set to {0.001, 0.005, 0.01,0.05}, the random walk length is {10, 50}, the window size is set to 5. Our model parameters setting are as follows: batch size is 246, the embedding size of entity type is 32, the embedding size of edge is 64. 5.3. Performance Comparison In our research setting, our classes are defined [-1 to -0.5) as class 1, [-0.5 to 0) as class 2, [0 to 0.5) as class 3, and [0.5 to 1] as class 4. We mainly choose the following four measurements to evaluate our model. The details of the formula is described in Appendix M. Performance Measurement: Recall Ratio (Recall@K), means how much percentage we truly predict the sentiment scores for each user given the ground truth sentiment scored samples based on the recommendation algorithm. We set the K=20. The higher Recall Ratio, the better of the predicted results on the sentiment scores given the recommendation algorithm. The details of calculating the Recall formula is described in appendix M. Hit Ratio (HR@K), means how much percentage we truly predict the sentiment scores of the focal user based on the recommendation algorithm given these users’ corresponding tweets in the testing sets. The details of calculating the Hit Ratio formula is described in appendix M. K HR@k = Number of Hits@ |Testing Sets of Tweets| We set K =20 82 Normalized Dicounted Cummulative Gain (NDCG@K), measures for each user, after we give the predicted sentiment score based on the recommendation algorithm. How these sentiment scores rank in the predicted result affect the result in testing set. We set K=20. The details of calculating the NDCG formula is described in appendix M. AUC, measures the probability that how much percentage the truly recommended scores to the focal user compared with how much percentage this focal user is not truly recommended the correct scores. The higher AUC score, the better of the model to give the recommendation. The details of the AUC is described in appendix M. From table 19 to table 22, we show the recommendation results about the nodes with 10,000, 20,000, and 50,000. Our model can give the best recommendation results in all different kinds of measurement. In table 19, we compare our model with the benchmark models in different nodes size. The result shows that our model has the highest Recall ratio. It means our recommendation model gives the best truly predicted sentiment scores results given the ground truth sentiment scores samples. 19 Table 19. Recall@20 Results on three datasets. The best method is bolded Method 10K node 20K node 50K node DeepWalk 0.1142 0.1139 0.1104 Node2vec 0.1180 0.1164 0.1158 Metapath2vec 0.1195 0.1184 0.1143 GCN 0.1358 0.1341 0.1310 GAT 0.1389 0.1376 0.1354 DySAT 0.1794 0.1789 0.1745 DHNE 0.1784 0.1766 0.1727 Our model 0.2371 0.2295 0.2289 83 In table 20, we compare our model with the benchmark models in different nodes size. The result shows that our model has the highest Hit Ratio. It means our recommendation model gives the best truly predicted sentiment scores of the focal user given these users’ corresponding tweets. 20 Table 20. HR@20 Results on three datasets. The best method is bolded Method 10K node 20K node 50K node DeepWalk 0.1632 0.1604 0.1582 Node2vec 0.1604 0.1693 0.1683 Metapath2vec 0.1659 0.1641 0.1633 GCN 0.1885 0.1889 0.1858 GAT 0.1897 0.1884 0.1852 DySAT 0.2302 0.2293 0.2286 DHNE 0.2286 0.2271 0.2257 Our model 0.2877 0.2794 0.2742 In table 21, we compare our model with the benchmark models in different nodes size. The result shows that our model has the highest NDCG value. It means our recommendation model gives the best truly predicted sentiment scores of the focal user given these users’ corresponding tweets correlations’ ranking. 21 Table 21. NDCG@20 Results on three datasets. The best method is bolded Method 10K node 20K node 50K node DeepWalk 0.0627 0.0623 0.0696 Node2vec 0.0612 0.0606 0.0689 Metapath2vec 0.0636 0.0627 0.0692 GCN 0.0882 0.0874 0.0848 GAT 0.0893 0.0886 0.0844 DySAT 0.1247 0.1236 0.1215 84 Table 21. (cont’d) DHNE 0.1276 0.1253 0.1232 Our model 0.1842 0.1831 0.1844 In table 22, we compare our model with the benchmark models in different nodes size. The result shows that our model has the highest AUC value. It means our recommendation model gives the best truly recommended scores to the focal user compared with how much percentage this focal user is not truly recommended the correct scores. 22 Table 22. AUC Results on three datasets. The best method is bolded Method 10K node 20K node 50K node DeepWalk 0.7694 0.7628 0.7469 Node2vec 0.7921 0.7942 0.7836 Metapath2vec 0.8015 0.8153 0.8036 GCN 0.8153 0.8168 0.8045 GAT 0.8166 0.8173 0.8034 DySAT 0.8626 0.8621 0.8538 DHNE 0.8693 0.8658 0.8594 Our model 0.8842 0.8856 0.8811 5.4. Fairness of the recommendation After we compared our model with the benchmark models with the performance of accuracy, we also need to compare our model with the benchmark models with the performance of fairness. There is no standard way to measure the fairness, we choose the demographical parity difference and the equal opportunity difference as the measurement to evaluate the fairness of the model. 85 5.4.1 Fairness Measures: Demographical Parity (DP): The Statistical Parity means for each focal user, he or she got recommendation should be independent to his or her attributes (features). We create the difference of the Demographical Parity as the measurement Demographical Parity Difference = 𝑃(𝑦̂ = 1|𝑥𝑠𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑒 = 0) − 𝑃(𝑦̂ = 1|𝑥𝑠𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑒 = 1) The equal opportunity (EO): The equal opportunity means the probability of the focal user got assigned with a positive outcome should be equal for all other subgroup users. We create the difference of the Equal Opportunity as the measurement Equal Opportunity Difference = 𝑃(𝑦̂ = 1|𝑥𝑠𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑒 = 0, 𝑦 = 1) − 𝑃(𝑦̂ = 1|𝑥𝑠𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑒 = 1, 𝑦 = 1) In table 23, we compare our model with the random walk based models, the graph neural network based model with the demographic parity. Our model have the best predicted scores results given the sentiment attributes as the location. 23 Table 23. Demographic Parity Results on three datasets. The best method is bolded Method 10K node 20K node 50K node DeepWalk 0.1274 0.1249 0.1263 Node2vec 0.1294 0.1258 0.1303 Metapath2vec 0.1352 0.1395 0.1358 GCN 0.1742 0.1763 0.1798 GAT 0.1803 0.1865 0.1877 FairHIN 0.2048 0.2085 0.2094 FairGNN 0.2103 0.2162 0.2189 Our model 0.2251 0.2263 0.2285 In table 24, we compare our model with the random walk based models, the graph neural network based model with the equal opportunity. Our model have the best predicted scores 86 results given the corresponding ground truth sentiment scores and the sentiment attributes as the location. 24 Table 24. Equal Opportunity Results on three datasets. The best method is bolded Method 10K node 20K node 50K node DeepWalk 0.1301 0.1297 0.1307 Node2vec 0.1316 0.1302 0.1341 Metapath2vec 0.1348 0.1363 0.1396 GCN 0.1854 0.1821 0.1863 GAT 0.1865 0.1879 0.1896 FairHIN 0.2143 0.2174 0.2196 FairGNN 0.2269 0.2286 0.2293 Our model 0.2301 0.2317 0.2321 87 6. CONCLUSION The recommendation algorithm in social network is very important for the politician to diffuse their political opinions. Compared with traditional survey method, electronic election is more efficient and popular for their followers. Therefore, using the twitter as campaign tool to declare the political polices and attract more voters is very important to the political candidates. In order to hear the voice of the minority of users, platform need to design the fairness recommendation algorithms to recognize the preference of the users. However, all of the recommendation models are would give the bias results because there is always sampling bias happened during the training process. To solve this bias in the post-processing part is what our proposed model work. In this paper, we proposed a novel meta-path based recommendation method on the election social network. This model learns the information from the different types of nodes, edges, and time periods. Our model consider both the prediction performance and the fairness of the model. In order to calculate the prediction of the recommendation, we use the advanced meta-path based model to capture the heterogeneous information of the network. In the meantime, we combine both meta-path based embedding method with the attention based method. This can somewhat help to decrease the difference between the users belong to the majority group and the users belong to the minority group. The experimental results shows that our model performs better than baselines models on the real world dataset as the performance of Recall ratio, AUC, Hit Ratio, and NDCG. Our model also performs better than baselines models on different kind of dataset as the performance of demographic parity and the equal opportunity. 88 APPENDICES 89 APPENDIX A. Notation in essay one 25 Table A.1. Equal Opportunity Results on three datasets. The best method is bolded Notation Description xi Demographic features for user i yi Potential outcome for user i after get treatment t yi (0) Counterfactual outcome for user i after get treatment t = 0 yi (1) Factual outcome for user i after get treatment t = 1 ti User i get treatment t A Adjacency matrix of the observed network G G{E, V} We have edge E and vertex V to represent a graph G 𝜇 and 𝜈 Any two node 𝜇 and 𝜈 select from graph G z The state layer in Graph Neural Network 𝑊𝑘 Weight matrix in Graph Neural Network on k th iteration 𝐵𝑘 Bias matrix in Graph Neural Network on k th iteration N The number of neighbor nodes X Demographic features Space Y Outcome Space H Hidden Space 𝜙 The embedding, also called encoder function to map the tuple ({xi , yi , t i }, A) into the hidden space h 𝜓 The decoder function to map tuple ({xi , yi , t i }, A) into the outcome space h The hidden space, also called embedding space h 90 APPENDIX B. Propensity Score, Inverse Probability Treatment Weighting estimation, and Doubly Robust method Un-confound means: yi (0), yi (1) ⊥ t i |𝑥𝑖 , 𝑐𝑖 We can calculate the Propensity Score as : e(xi ) = P(Ti = 1|Xi = xi ) ∀x ∈ X If the propensity score is a good balancing score, we will get: yi (0), yi (1) ⊥ t i |𝑒(𝑥𝑖 ) This way is very efficient to remove the biases coming from the non random assignment. Under confound environment (also called balanced property) e(xi ) = E[Ti = 1|𝑒(𝑥𝑖 )] Inverse Probability Treatment Weighting estimation is the way to balance the difference between two groups: 𝑛 1 𝑊𝑖 𝑌𝑖 (1 − 𝑊𝑖 )𝑌𝑖 τIPTW = ∑( ) −( ) 𝑛 𝑒̂ (𝑥𝑖 ) 1 − 𝑒̂ (𝑥𝑖 ) 𝑖=1 The quality of this estimator depends on the estimation of 𝑒̂ (𝑥𝑖 ) Because there exists confound, we cannot get good propensity scores. Covariate Balanced Propensity Score is the most advanced tool to handle the network data. It is also the double robust method. Doubly Robust method is a good estimator when one of the two assumptions are right: 1 . Outcome function is linear but propensity is not logistic 2 . Outcome function is not linear but propensity is logistic Because the outcome function Y = ϕ(x) is nonlinear, and propensity score cannot always achieve by logistic regression, DR method is still not good enough. 91 Our aim is to minimize the balanced score and variance, which is the equal to minimize the IPM + regulator form. The math proof can be found in Kallus (2020) 1 1 𝐸[(𝜏𝑤 − 𝜏)2 |𝑋, 𝑇] = 2 ∑𝑤𝑖 𝑓(𝑥𝑖 )2 + 2 ∑𝑤𝑖2 𝜎 2 𝑥𝑖 𝑛 𝑛 92 APPENDIX C. Ignore the hidden confounders Go back to review the content of Assumption 1 : There is no hidden confounder C affect on both X, Y and T yi (0), yi (1) ⊥ t i |𝑥𝑖 , 𝑐𝑖 Pearl (2000) proposed the causal graph model to identify the causal inference problem. The basic idea is finding an observed proxy variable to approximate the unobserved confounder C. In this setting, we need to approximate the C by the tuple (X,Y.T) 11 Figure A.1. Causal Graph Model to identify the causal inference problem C T X Y Because we cannot measure the C part, directly calculate the conditional probability P(Y|X, T) may generate the estimation bias. Pearl gave a method called do-calculus to solve this problem. Do-Calculus can be nominated by P(Y|X , do(T)) and transformed to the conditional probability as follows 93 P(Y = y|X, do(T = t)) = ∑ 𝑃𝑌|𝑇,𝐶 (𝑦|𝑡, 𝑐)𝑃𝐶 (𝑐) 𝐶∈[0,1] In our average treatment effect setting, we can calculate the ATE as ATE = P(Y = 1|X, do(T = 1)) − P(Y = 1|X, do(T = 0)) = ∑ (𝑃𝑌|𝑇,𝐶 (1|1, 𝑐) − 𝑃𝑌|𝑇,𝐶 (1|0, 𝑐)) 𝑃𝐶 (𝑐) 𝐶∈[0,1] 94 APPENDIX D. Diagram of the Domain Adaptation for counterfactual framework Let us have a look at the following diagram, circles mean the treated group and dots mean the control group. We need to use Maximum Mean Distance or Wasserstein 1 Distance to calculate the distance between two data domains. If these two data domains can get closer, the covariates variables become more balanced on the representation space. The dashed line is the mean square error to measure the error between predict y and real y 12 Figure A.2. Domain Adaptation for counterfactual framework Classification Classification between control between control group and group and treatment group treatment group Source Source Domain Domain Minimize the difference between two domains Target Target Domain Domain Sample Space Feature Space 95 APPENDIX E. Maximum mean discrepancy (MMD) The most common way to measure the distance between the target data and source data is the Maximum mean discrepancy (MMD). Let define the source data x is coming from a distribution p and the target data y is coming from a distribution q First step, define a function f mapping the sample into a hidden space. 𝑓(𝑥) 𝑎𝑛𝑑 𝑓(𝑦). The mean value between these two function is 𝐸(𝑓(𝑥)) and 𝐸(𝑓(𝑥)) Second step, calculate their difference and find the maximum function f MMD[ℱ, p, q] = 𝑠𝑢𝑝 (𝐸 (𝑓𝑥~𝑝 (𝑥)) − 𝐸 (𝑓𝑦~𝑞 (𝑥))) 𝑓∈ℱ Third step, replace the distribution with the sample 𝑚 𝑛 1 1 MMD[ℱ, x, y] = 𝑠𝑢𝑝 ( ∑ 𝑓(𝑥𝑖 ) − ∑ 𝑓(𝑦𝑖 )) 𝑓∈ℱ 𝑚 𝑛 𝑖=1 𝑖=1 If the distribution of x is identical to the distribution of y, their expectation after the mapping should be the same. Therefore, if we want to transfer the source domain to the target domain successfully, we just need to minimize the maximum mean discrepancy value. Sometimes, the function 𝑓 is not easy to calculate, we need to construct a reproduction kernel Hilbert space (RKHS) to make this adaptation domain process easy to compute. MMD[ℱ, p, q] = 𝑠𝑢𝑝 (𝐸 (𝑓𝑥~𝑝 (𝑥)) − 𝐸 (𝑓𝑦~𝑞 (𝑥))) 𝑓∈ℱ𝐻 ≤1 = 𝑠𝑢𝑝 (𝐸𝑝 [〈𝜙(𝑥), 𝑓〉𝐻 ] − 𝐸𝑞 [〈𝜙(𝑥), 𝑓〉𝐻 ]) 𝑓∈ℱ𝐻 ≤1 = 𝑠𝑢𝑝 (〈𝜇𝑝 − 𝜇𝑞 , 𝑓〉) 𝑓∈ℱ𝐻 ≤1 = ‖𝜇𝑝 − 𝜇𝑞 ‖𝐻 2 MMD2 [ℱ, p, q] = ‖𝜇𝑝 − 𝜇𝑞 ‖𝐻 = 𝐸𝑝 〈𝜙(𝑥), 𝜙′(𝑥)〉𝐻 + 𝐸𝑝 〈𝜙(𝑦), 𝜙′(𝑦)〉𝐻 − 2𝐸𝑝,𝑞 〈𝜙(𝑥), 𝜙(𝑦)〉𝐻 96 If we choose the Gaussian kernel 𝑚 𝑛 𝑚,𝑛 1 1 2 MMD2 [ℱ, p, q] = ∑ 𝑘(𝑥𝑖 , 𝑥𝑗 ) + ∑ 𝑘(𝑦, 𝑦𝑗 ) − ∑ 𝑘(𝑥𝑖 , 𝑦𝑗 ) 𝑚(𝑚 − 1) 𝑛(𝑛 − 1) 𝑚𝑛 𝑖≠𝑗 𝑖≠𝑗 𝑖≠𝑗 2 ‖𝑥−𝑥 ′ ‖ Where 𝑘(𝑥, 𝑥 ′ ) = 𝑒𝑥𝑝 (− ) 2𝜎 2 13 Figure A.3. Domain Adaptation for counterfactual framework with Maximum mean discrepancy Classification between Classification between control group and control group and . treatment group treatment group Source Domain 𝑓𝑥~𝑝 (𝑥) Maximize the confusion part of domains Target 𝑓𝑦~𝑞 (𝑥) Domain 𝐸 (𝑓𝑥~𝑝 (𝑥)) 𝐸 (𝑓𝑦~𝑞 (𝑥)) Sample Space Max Mean Feature Space Distance 97 APPENDIX F. Wasserstein Distance Another way to calculate the distance between two distributions is Wasserstein Distance. Compare with the MMD and some other distance measure such as KL divergence, JS divergence, Wasserstein Distance has some advantages : 1 . It can measure the distance between discrete distribution and continuous distribution 2 . Even two distributions has very long distance, the Wasserstein Distance can still measure the length but not give constant or zero result 3 . Wasserstein barycenter can describe the geographic character compare to the Euclidean average However, Wasserstein Distance is very hard to compute, only Wasserstein 1 Distance can be calculated easily. The basic idea is treating the distribution as a set of stones. If we take each stone from one distribution p to another distribution q, we want to find a minimum path to achieve the task. After we find the minimum path for each stone and sum all these path up, we can equally get the distance between two distributions. 1 𝑝 𝑊𝑝 (𝜇, 𝜈) = ( 𝑖𝑛𝑓 ∫‖𝑥 − 𝑦‖𝑝 𝑑𝛾(𝑥, 𝑦)) 𝛾∈𝛤(𝜇,𝜈) 𝑤ℎ𝑒𝑟𝑒 𝑝 𝑖𝑠 𝑡ℎ𝑒 𝑙𝑖𝑝𝑠𝑐ℎ𝑖𝑡𝑧 𝑐𝑜𝑛𝑠𝑡𝑟𝑎𝑖𝑛𝑡, 𝜇, 𝜈 𝑎𝑟𝑒 𝑡𝑤𝑜 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑚𝑒𝑎𝑠𝑢𝑟𝑒𝑠, 𝑎𝑛𝑑 𝛾(𝑥, 𝑦) 𝑖𝑠 𝑡ℎ𝑒 𝑗𝑜𝑖𝑛𝑡 𝑑𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛 If we choose dimension equals to 1, it becomes the Wasserstein 1 Distance: 1 1 1 𝑛 𝑝 𝑝 𝑊𝑝 (𝜇, 𝜈) = (∫ ‖𝑥 − 𝑦‖𝑝 𝑑𝛾(𝑥, 𝑦)) = (∑‖𝑥𝑖 − 𝑦𝑖 ‖𝑝 ) 0 𝑖=1 98 14 Figure A.4. Wasserstein Distance 𝑓𝑥~𝑝 (𝑥) 𝑓y~𝑞 (𝑦) 99 APPENDIX G. Node2Vec and :Latent Features Graph embedding extends the representation from structure data to graph data. Given a graph G{E, V}, we have edge E and vertex (node) V to represent a graph. We need to find a mapping function 𝜙 to encode each vertex V into a hidden embedding space ℋ. This embedding space is the same with the representation space. We expect that embedding space can still contain the information in sample space. In another word, the similarity between node 𝜇 and 𝜈 should be as equal to the similarity between embedding node 𝜙(𝜇) and 𝜙(𝜈) as possible 𝑆𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 (𝜇, 𝜈) ≈ 𝜙(𝜇)𝑇 𝜙(𝜈) (5) For each node, we use its feature, con-current probability, and etc, to represent the vector. For example, node μ can be vectorized to 𝜙(𝜇) = [0.1,0.2,0.4, … . ,0.0] and node ν can be vectorized to 𝜙(𝜈) = [0.3,0.1,0.2, … . ,0.7]. Each node is encoded as 1*D dimension vector and all nodes are encoded as V*D dimensions matrix. After we input the graph G, label Y and calculate the object function 𝐿 = ∑𝜇,𝜈∈𝑉∗𝑉‖𝑆𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 (𝜇, 𝜈) − 𝜙(𝜇)𝑇 𝜙(𝜈)‖2 (6) , we can get the embedding space. The encoder is just a simple lookup style embedding and the similarity function can be chosen from adjacency based, multi-hop based, and Random-walk based methods. 100 15 Figure A.5. Node2Vec and Latent Features Treatment Treatment group group 𝜇 Representation mapping 𝜙 𝜙(𝜇) Control 𝜈 group 𝜙(𝜈) Control group 𝜙(𝜇) = [0.1, 0.2 ,0.4 ,…., 0.0] 𝜙(𝜈) = [0.3, 0.1, 0.2, …, 0.7] Embedding Network Space Sample Space (Representation Space) Edge relationship is always represented by an adjacency matrix A, we use it to replace the similarity function 𝑆𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 (𝜇, 𝜈). The similarity function can be adjacency based, multi-hop based, and Random-walk based. The simple encoder function 𝜙 is just a look-up function, we will replace it by the graph neural network later. 26 Table A.2. Pros and Cons of the Similarity function Choices Embedding Method Loss Function Pros and Cons Adjacency based 2 Only consider the 𝐿= ∑ ‖𝐴𝜇,𝜈 − 𝜙(𝜇)𝑇 𝜙(𝜈)‖ 𝜇,𝜈∈𝑉∗𝑉 complete graph Multi-hop based 𝑘 2 Need to define the 𝐿= ∑ ‖𝐴𝜇,𝜈 − 𝜙(𝜇)𝑇 𝜙(𝜈)‖ 𝜇,𝜈∈𝑉∗𝑉 pairwise node similarity by researcher Random-walk based 𝐿 = ∑ ∑ − 𝑙𝑜𝑔(𝑃(𝜈|𝜙(𝜇))) It can be any random 𝜇∈𝑉 ∈𝑁𝑅 (𝜈) walk chain to generate the pairs nodes. 101 We choose the Random-walk based method because: 1) It can define the relationship very flexible, only consider the pairs node that con-current on the random-walk chain 2) Compare with two other methods, it is efficient to compute. The Random-walk embedding method is using the skip-gram idea from the word to vector method. The basic idea is as follows: Step1: Choosing any node μ and generating the random walk start from the node μ with some strategy S. Step2: Calculate the probability of visiting a node ν from the node μ, 𝑃(𝜈|𝜙(𝜇)) , by the random-walk chain Step3: Optimize the loss function 𝐿 = ∑𝜇∈𝑉 ∑𝜈∈𝑁𝑆 (𝜈) − 𝑙𝑜𝑔(𝑃(𝜈|𝜙(𝜇))). Plug in the softmax function, we can get the following function exp(𝜙𝜇𝑇𝜙 ) 𝜈 𝐿 = ∑𝜇∈𝑉 ∑𝜈∈𝑁𝑅(𝜈) − 𝑙𝑜𝑔 (∑ 𝑇 ) (7) 𝑛∈𝑉 exp(𝜙𝜇 𝜙𝜈 ) Step4: Negative Sampling for compute efficient exp(𝜙𝜇 𝑇𝜙 ) 𝜈 𝑙𝑜𝑔 (∑ 𝑇 ) = log (𝜎(𝜙𝜇𝑇 𝜙𝜈 )) − ∑𝑛∈𝑉 log σ(𝜙𝜇𝑇 𝜙𝜈 ) (8) 𝑛∈𝑉 exp(𝜙𝜇 𝜙𝜈 ) There is an example to illustrate this process, let’s assume we have a small network and try to embed it into the vector space by skip-gram9 9 log((𝑤,𝑐)𝐷) The skip-gram method can be used to solve the word2vec problem The basic form is , where w is the 𝑏 (𝑤)(𝑐) word and c is the context, and D is the total number of word-context pairs 102 What kind of latent features we can extract from nodes? Let assume we have a network as follows: 16 Figure A.6. Node2Vec Visualization We have 34 nodes, and 78 edges. Red and Blue are just label for each nodes, let’s say democratic and republican, we want to extract some features to classify each node By using random walk method, we can generate walk path on each node, choosing walk length as 10, it generates 34*10 walks: [0, 11, 0, 11, 0, 12, 3, 12, 3, 1] [0, 21, 0, 2, 7, 2, 0, 8, 2, 8] [0, 2, 3, 7, 0, 21, 0, 6, 0, 31] [0, 4, 0, 5, 10, 0, 13, 2, 7, 2] [0, 31, 0, 31, 0, 17, 0, 8, 33, 9] [0, 13, 0, 2, 27, 2, 3, 7, 3, 1] 103 [0, 1, 17, 1, 0, 3, 1, 3, 2, 1] [0, 13, 2, 1, 2, 0, 2, 1, 2, 1] [0, 12, 0, 31, 0, 10, 5, 16, 5, 6] [0, 31, 28, 31, 28, 33, 28, 33, 9, 33] [1, 13, 1, 2, 1, 17, 1, 0, 13, 0] Put these walk paths into the skip-gram model, we can get the features on each node, we choose hidden space dimension as 12 : Node 1 has features with 1*12: array([ 0.03700204, -0.0196846 , 0.038861 , 0.00561627, 0.02323475, -0.04270516, 0.01880202, 0.00162641, -0.02773627, -0.02885933, -0.02692292, -0.02374379], dtype=float32) The element means the probability of each node happened on each walk path. By using these features, we can reconstruct the node location and graph structure in the R dimension space. The node embeddings is 34*12 matrix After this encoding step, we need to reconstruct these features on a 2D space figure: 17 Figure A.7. Node2Vec for Node Classification 104 We can find that our 34*12 features space can nicely help to classify each node. 105 APPENDIX H. Balanced Covariates Measures and Imbalanced Covariates Measures From Table A.3 to Table A.4 show the balanced testing results of covariates such as retweet_count, scource, followers_count, location, statuses_count, friends_count, and favourites_count. The standard mean value and the empirical cumulative density function are expected to close to zero and the difference Variance Ratio is expected to close to one. 27 Table A.3. Balanced Covariates Measures on two assigned group Summary of Means Means Std. Diff. Var. eCDF eCDF Balance for Treated Control Mean Ratio Mean Max Matched Data: distance 0.5528 0.5525 0.0022 1.0210 0.0024 0.0222 retweet_count 62.0180 41.3845 0.1830 0.4142 0.0664 0.1531 source 2.7947 2.6499 0.0406 0.6929 0.0256 0.1489 followers_count 989.6921 1208.5635 -0.0480 0.7915 0.0480 0.1187 location 80.6824 89.9013 -0.0912 0.9353 0.0279 0.0901 statuses_count 10693.3301 13188.3687 -0.1118 0.7131 0.0420 0.1003 friends_count 423.0513 522.9355 -0.1642 0.6954 0.0472 0.0884 favourites_count 1154.4508 1525.9780 -0.1007 0.5520 0.0245 0.0578 28 Table A.4. Imbalanced Covariates Measures on two assigned group Summary of Means Means Std. Diff. Var. eCDF eCDF Balance for Treated Control Mean Ratio Mean Max Matched Data: distance 0.5528 0.4749 0.6384 0.7763 0.1537 0.2460 retweet_count 62.0180 149.0133 -0.7714 0.0691 0.0476 0.1003 source 2.7947 3.7570 -0.2696 0.5307 0.0370 0.1301 followers_count 989.6921 648.8513 0.0747 2.7630 0.0311 0.0743 location 80.6824 61.4963 0.1898 1.4801 0.0569 0.1295 statuses_count 10693.3301 7948.9352 0.1229 2.3092 0.0364 0.0668 friends_count 423.0513 425.7408 -0.0044 0.8257 0.0168 0.0546 favourites_count 1154.4508 1196.1753 -0.0113 0.7098 0.0261 0.0865 106 The Table A.5 and A.6 show the balanced testing results of observed covariates incorporating with the latent graph features from fea0 to fea19 generated from the GNN. 29 Table A.5. Balanced Covariates Measures on two assigned group with latent features Summary of Means Means Std. Mean Diff. Var. eCDF eCDF Balance for All Treated Control Ratio Mean Max Data: distance 0.5816 0.4443 0.8264 0.9102 0.2093 0.3116 retweet_count 22.9307 75.0648 -0.9012 0.0629 0.0546 0.1308 source 62.0180 149.0133 -0.7714 0.0691 0.0476 0.1003 followers_count 2.7947 3.7570 -0.2696 0.5307 0.0370 0.1301 location 989.6921 648.8513 0.0747 2.7630 0.0311 0.0743 statuses_count 80.6824 61.4963 0.1898 1.4801 0.0569 0.1295 friends_count 10693.3301 7948.9352 0.1229 2.3092 0.0364 0.0668 favourites_count 423.0513 425.7408 -0.0044 0.8257 0.0168 0.0546 fea0 0.1446 0.1437 0.0007 1.2589 0.0110 0.0386 fea1 -0.0729 -0.0264 -0.0411 0.9343 0.0113 0.0345 fea2 -0.1827 -0.2136 0.0246 0.9488 0.0177 0.0551 fea3 -0.1932 -0.1733 -0.0165 1.0388 0.0130 0.0361 fea4 -0.1254 -0.0065 -0.1092 0.8355 0.0367 0.0742 fea5 0.2121 0.1933 0.0151 1.0720 0.0081 0.0268 fea6 0.1856 0.1046 0.0728 0.9165 0.0110 0.0368 fea7 -0.2394 -0.2510 0.0092 1.0560 0.0124 0.0380 fea8 -0.2488 -0.1823 -0.0602 0.8491 0.0121 0.0433 fea9 0.0744 0.0662 0.0066 1.0174 0.0091 0.0302 fea10 0.3296 0.2738 0.0417 0.8545 0.0131 0.0357 fea11 -0.0661 0.0476 -0.0948 1.0436 0.0134 0.0374 fea12 0.0686 0.0456 0.0177 1.1773 0.0102 0.0278 fea13 0.0410 0.0742 -0.0299 0.9019 0.0253 0.0597 fea14 0.0189 0.0353 -0.0129 1.3130 0.0080 0.0282 fea15 0.0575 0.1055 -0.0422 0.8429 0.0181 0.0614 107 Table A.5. (cont’d) fea16 0.1466 0.0688 0.0710 0.9113 0.0210 0.0544 fea17 0.0190 0.0857 -0.0612 0.6547 0.0213 0.0395 fea18 0.0129 0.0615 -0.0402 1.2819 0.0208 0.0483 fea19 0.0423 0.1656 -0.0867 0.8996 0.0093 0.0250 30 Table A.6. Imbalanced Covariates Measures on two assigned group with latent features Summary of Means Means Std. Diff. Var. eCDF eCDF Balance for Treated Control Mean Ratio Mean Max Matched Data: distance 0.5816 0.5808 0.0045 1.0166 0.0018 0.0194 favorite_count 22.9307 13.2903 0.1667 0.4832 0.0760 0.1993 retweet_count 62.0180 34.4026 0.2449 0.5106 0.0759 0.1548 source 2.7947 2.5350 0.0728 0.9014 0.0241 0.1370 followers_count 989.6921 1723.3125 -0.1608 0.6938 0.0416 0.1151 location 80.6824 89.2738 -0.0850 0.8979 0.0268 0.0717 statuses_count 10693.3301 14157.3344 -0.1552 0.6561 0.0368 0.0934 friends_count 423.0513 440.6401 -0.0289 0.8804 0.0229 0.0926 favourites_count 1154.4508 1396.1416 -0.0655 0.5590 0.0170 0.0456 fea0 0.1446 0.2248 -0.0594 1.9734 0.0521 0.1081 fea1 -0.0729 -0.0382 -0.0307 1.7857 0.0169 0.0463 fea2 -0.1827 -0.1964 0.0109 1.2429 0.0199 0.0535 fea3 -0.1932 -0.1918 -0.0012 1.5741 0.0178 0.0608 fea4 -0.1254 -0.1292 0.0035 1.4373 0.0215 0.0686 fea5 0.2121 0.2515 -0.0317 1.1118 0.0194 0.0618 fea6 0.1856 0.2147 -0.0261 1.0228 0.0369 0.0940 fea7 -0.2394 -0.2753 0.0286 1.4045 0.0242 0.0614 fea8 -0.2488 -0.2463 -0.0023 1.4477 0.0243 0.0605 fea9 0.0744 -0.0087 0.0670 1.1829 0.0278 0.0956 fea10 0.3296 0.2961 0.0250 1.1525 0.0137 0.0493 fea11 -0.0661 -0.0713 0.0043 1.5688 0.0232 0.0654 108 Table A.6. (cont’d) fea12 0.0686 0.1819 -0.0874 1.3711 0.0312 0.0809 fea13 0.0410 0.1026 -0.0554 0.9797 0.0356 0.0910 fea14 0.0189 -0.0081 0.0214 1.8832 0.0241 0.0616 fea15 0.0575 0.1408 -0.0732 1.0851 0.0411 0.1004 fea16 0.1466 0.2004 -0.0490 1.3431 0.0503 0.1145 fea17 0.0190 -0.0175 0.0335 0.7556 0.0283 0.0855 fea18 0.0129 0.0718 -0.0487 1.5696 0.0243 0.0698 fea19 0.0423 0.0040 0.0269 1.0470 0.0291 0.0723 Table from A.7 to A.11 show the balanced testing result specifically on the propensity score with different network scale. (10,000 users, 20,000 users, 40,000 users, 80,000 users, and 100,000 users.) 31 Table A.7. Propensity Score Balanced Improvement (10k) Std. Mean Diff. Var. Ratio eCDF Mean Original Imbalanced Dataset 0.6384 0.7763 0.1537 After Nearest Neighbor Matching 0.0022 1.0210 0.0024 Imbalanced Dataset including 0.8264 0.9102 0.2093 GNN embedding After Nearest Neighbor Matching 0.0045 1.0166 0.0018 (including GNN embedding) 32 Table A.8. Propensity Score Balanced Improvement (20k) Std. Mean Diff. Var. Ratio eCDF Mean Original Imbalanced Dataset 0.2220 0.3798 0.0272 After Nearest Neighbor Matching 0.0058 0.9895 0.0062 Imbalanced Dataset including 0.2945 0.7250 0.0696 GNN embedding After Nearest Neighbor Matching 0.0000 0.9982 0.0009 (including GNN embedding) 33 Table A.9. Propensity Score Balanced Improvement (40k) Std. Mean Diff. Var. Ratio eCDF Mean Original Imbalanced Dataset 0.2822 0.3654 0.0392 After Nearest Neighbor Matching 0.0131 1.0597 0.0008 Imbalanced Dataset including 0.3006 0.5173 0.0575 GNN embedding 109 Table A.9. (cont’d) After Nearest Neighbor Matching 0.0094 0.9608 0.0002 (including GNN embedding) 34 Table A.10. Propensity Score Balanced Improvement (80k) Std. Mean Diff. Var. Ratio eCDF Mean Original Imbalanced Dataset 0.2727 0.3801 0.0369 After Nearest Neighbor Matching 0.0010 0.9794 0.0003 Imbalanced Dataset including 0.2877 0.5101 0.0560 GNN embedding After Nearest Neighbor Matching 0.0046 0.9621 0.0001 (including GNN embedding) 35 Table A.11. Propensity Score Balanced Improvement (100k) Std. Mean Diff. Var. Ratio eCDF Mean Original Imbalanced Dataset 0.3845 0.3994 0.0658 After Nearest Neighbor Matching 0.0004 1.0022 0.0001 Imbalanced Dataset including 0.5915 0.6630 0.1451 GNN embedding After Nearest Neighbor Matching 0.0002 1.0009 0.0000 (including GNN embedding) The table A.11 gives the visualization of the covariates differences between treated group and the control group. Two figures show that after the balanced, (based on the nearest neighbor matching) the covariates differences between treated group and the control group becomes smaller than before the balanced. Table A.12 shows the visualization of the covariates differences including the latent graph features between the treated group and the control group. The differences become smaller after the balanced. 110 18 Figure A.8. Covariate Balanced Table for theory based features 111 The Table A-8.7 shows the visualization of the covariates distribution differences between treated group and the control group. Normally, after the balanced, their differences become smaller than before. The Table A-8.9 and the Table A-8.10 show the visualization of the covariates distribution differences including the latent graph features between the treated group and the control group. The differences become smaller after the balanced. 19 Figure A.9. Distribution Balance for each theory based feature 112 Figure A.9. (cont’d) 113 Figure A.9. (cont’d) 114 Figure A.9. (cont’d) 115 20 Figure A.10. Covariate Balanced Table for both theory based feature and latent features: 116 21 Figure A.11. Distribution Balance for each theory based feature 117 Figure A.11. (cont’d) 118 Figure A.11. (cont’d) 119 Figure A.11. (cont’d) 22 Figure A.12. Distribution Balance for each latent feature 120 Figure A.12. (cont’d) 121 Figure A.12. (cont’d) 122 Figure A.12. (cont’d) 123 Figure A.12. (cont’d) 124 Figure A.12. (cont’d) 125 Figure A.12. (cont’d) 126 Figure A.12. (cont’d) 127 Figure A.12. (cont’d) 128 APPENDIX I. Notation in essay two 36 Table A.12. Notation in essay two Notation Explanation 𝑢𝜇 , 𝑢𝜈 Randomly choose user 𝜇 and 𝜈 y The ground truth of the sentiment scores 𝑦̂ The predicted truth of the sentiment scores 𝑥𝑖𝑇 Original feature for node 𝑖 with type 𝑡 𝑥𝑠𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑒 The sensitive attributes affects the fairness 𝛷 Meta-path h Node feature 𝑊𝜙 Type-specific transformation 𝜙 Importance of meta-path based on pair (𝑖, 𝑗) 𝑒𝑖𝑗 𝑎𝜙 Node level attention 𝜙 Weight of meta path 𝜙 constructed on node pair (𝑖, 𝑗) 𝛼𝑖𝑗 𝒩𝜙 Neighbors based on meta path 𝜙 𝑆𝜙 Semantic level node embedding 𝑏 Semantic level attention 𝑤𝜙 Importance of meta-path 𝜙 𝛽𝜙 Weight of meta path 𝜙 Z The final embedding 129 APPENDIX J. Data category for the election in social network We follow the decoding strategy suggested from Bulsara and Singh (2017) User level: politician, media, parties, religions (Yogi etc), Judge 37 Table A.13. User level node type User Type Example Description Politician Modi, Gandhi, Jayalalithaa, etc Politician account Parties BJP, AAP, INC etc Parties account Media The Times of India, The Hindu, etc Third Party Media account Religions Yogi, Swami, etc Religion account Function Level: Party, Activity 38 Table A.14. Function level node type Function Type Strategy focus Description Party Past Success Past achievement of the party Ideology Ideology of the party Activity Slogan The slogan party declare Issues The issues party meets in current time Policies The policy of party declare to make Promise Promise to satisfy the voters in exchange of support 130 Table A.14. (cont’d) Fund Funds to support the party Candidate Highlight Candidate Highlight the advantage of Candidate Attack the competitor Public Relations Internet Marketing Rallies and debates Content level: tweet, topic, hashtag, policy. Please check the details in word embedding model part 39 Table A.15. Content level node type Content Type Example Description Tweet #SadacharYatra The #BJP The content of each tweet government had promised employment to 13 lakh in 2007 but only 80,000 got jobs. #SadacharYatra Topic job, government, lakh, get, bjp, The topic extract from each political, provide, lose, claim, pass tweet Hashtag #SadacharYatra, #BJP The hashtag included in each tweet Policy Employee Policy The policy which the tweet declare 131 40 Table A.16. Data Coding Scheme Tweet Type Example Description Mention other user @BJP A tweet which is post or commented for mention some other users Hashtag #BJP A hashtag which is used to connect to the popular trend Retweet A tweet which is used to repost from the prior tweet Post tweet The original tweet which is not used to mention, reply or retweet 132 APPENDIX K. Word Embedding Model 41 Table A.17. Corpus of the Tweet (first 20 rows) ['delhi', 'chief', 'minister', 'arvind', 'kejriwal', 'leave', 'residence', 'meet', 'breakingnow'] ['nigerian', 'wild', 'animal', 'cancer', 'goa', 'bjp', 'minister', 'co', 'oxwlwf'] ['congress', 'bjp', 'face', 'first', 'time', 'democracy', 'politic', 'aadmi_reminiscence', 'european_nationalism'] ['angry', 'rahuls_statement', 'bjp', 'bald', 'bjp', 'man_lead', 'protest', 'bald', 'people', 'http', 'co', 'vqfgwqjwcc'] ['narendra', 'modi', 'lead', 'bjp', 'government', 'moodys', 'co', 'sdxjiem', 'congress', 'call', 'moody', 'agent', 'sangh'] ['interesting_point', 'sevanti_raise', 'aap', 'dharna', 'anarchy', 'demolition', 'babri_masjid', 'bjp', 'constitutional'] ['bad', 'news', 'bjps', 'lok_sabha', 'ticket_hideously', 'corrupt', 'amp', 'allege', 'murder', 'fighting', 'ganga'] ['aap', 'dharna', 'anarchy', 'demolition', 'babri_masjid', 'bjp', 'constitutional'] ['feel_compelle', 'explain', 'position', 'aap', 'never', 'congress', 'bjp', 'party'] ['think', 'tweet', 'antiaap', 'even', 'pro', 'bjp', 'call', 'tension', 'lol', 'traitor'] ['bad', 'news', 'bjps', 'lok_sabha', 'ticket_hideously', 'corrupt', 'amp', 'allege', 'murder', 'fighting', 'ganga'] ['donate', 'today', 'co', 'ssfqetb', 'modi', 'pm', 'co', 'eofpnub'] ['angry', 'rahuls_statement', 'bjp', 'bald', 'bjp', 'man_lead', 'protest', 'bald', 'people', 'http', 'co', 'vqfgwqjwcc'] ['membership_drive', 'bjp', 'shud', 'limited', 'election', 'purpose', 'shud', 'employ', 'right', 'process', 'change', 'system'] ['delhi', 'lucky', 'arvind', 'kejriwal', 'cm_hitte', 'road', 'accountable_police'] ['interesting_point', 'sevanti_raise', 'aap', 'dharna', 'anarchy', 'demolition', 'babri_masjid', 'bjp', 'constitutional'] ['sir', 'importantly', 'ppl', 'want', 'know', 'bjp', 'quiet', 'allegation', 'shinde', 'support', 'dawood', 'friend', 'hafta', 'collection'] ['bjp', 'pm', 'candidate', 'narendra', 'modi', 'set', 'address', 'mega', 'rally', 'gorakhpur', 'today', 'live', 'coverage', 'air', 'country'] ['share', 'pic', 'independent', 'mp', 'bihar', 'join', 'bjp', 'support', 'bjp', 'bihar', 'increase', 'consistently', 'co', 'hjvveilz'] ['dear_chhidu', 'do', 'know', 'muslim', 'rehman', 'amp', 'bjp', 'muslim', 'bjp'] 133 42 Table A.18. The most frequent Bi-gram features of different Parties (Top 20) AAP BJP INC ('us', 'several') ('support', 'BJP') ('several', 'scams') ('Babri', 'Masjid') ('BJP', 'led') ('the', 'state') ('the', 'AAP') ('about', 'stabilitdevelopment') ('vote', 'share') ('Amethi', 'today') ('GDP', 'growth') ('Rajiv', 'Gandhi') ('AAP', 'Dharna') ('Narendra', 'Modis') ('the', 'demolition') ('Interesting', 'point') ('for', 'BJP') ('techie', 'behind') ('NDA', 'brought') ('BJP', 'government') ('ppl', 'manage') ('Nation', 'opinion') ('Uttar', 'Pradesh') ('MOTNPoll', 'Vote') ('women', 'members') ('Chandra', 'Bose') ('an', 'Annarchist') ('speaking', 'to') ('India', 'Today') ('Congress', 'VP') ('Perhaps', 'biggest') ('NDA', 'GDP') ('Gorakhpur', 'rally') ('2012', 'reelection') ('Sabha', 'projection') ('governments', 'decision') ('but', 'improving') ('BJP', 'surge') ('own', 'corruption') ('JoinAAP', 'SaveNation') ('Opinion', 'Poll') ('Doctor', 'said') ('small-scale', 'industries') ('share', 'in') ('waste', 'vote') ('Congress', 'does') ('to', 'Gujrat') ('formed', 'alliance') ('elitist', 'swipe') ('.Congress', 'debacle') ('handedly', 'revived') ('BJPs', 'anarchy') ('Elections', 'carries') ('News-Nielsen', 'Opinion') ('Today', 'poll') ('economic', 'growth') ('against', 'poverty') ('against', 'Vadra') ('BJP-MDMK', 'alliance') ('challenged', 'youths') 134 43 Table A.19. The most frequent Tri-gram features of different Parties (Top 20) AAP BJP INC ('Dharna', 'is', 'anarchy') ('I', 'support', 'BJP') ('When', 'Manmohan', 'Singh') ('Interesting', 'point', 'Sevanti') ('of', 'my', 'country') ('Govt', 'of', 'India') ('Self', 'Help', 'Groups') ('GDP', 'growth', 'rate') ('Mahila', 'Vikas', 'Pariyojana') ('LIVE', 'will', 'be') ('BJP', 'in', 'UP') ('Funds', 'for', 'Indira') ('Gorakhpur', '#', ('The', 'techie', 'behind') ('Vijay', 'Sankalp', 'Rally') 'NaMoInGKP') ('Now', 'we', 'know') ('NDA', 'GDP', 'growth') ('Congress', 'Vice', 'President') ('News-Nielsen', 'Opinion', ('brought', 'more', ('District', 'Level', 'Meet') 'Poll') 'development') ('Nation', 'opinion', 'poll') ('for', 'Narendra', 'Modis') ('Vice', 'President', ' Rahul') ('MOTNPoll', 'Lok', 'Sabha') ('Narendra', 'Modi', 'pays') ('Vote', 'share', 'of') ('the', 'total', 'promised') ('party', 'can', 'increase') ('India', 'Today', 'Mood') ('for', 'Ahmedabad', 'Civil') ('sweep', 'the', 'state') ('praises', 'the', 'lea') ('UP', 'poll', 'tracker') ('Modi', 'pays', 'tribute') ('answer', 'on', 'misuse') ('Ravi', 'Shankar', 'Prasad') ('total', 'promised', 'jobs') ('crore', 'central', 'fund') ('Ahmedabad', 'Civil', ('twitter', 'accusing', 'us') ('DusDusKiDaud', 'Srinagar', 'Hospita') 'NDA') ('empowerment', 'of', ('electiontracker', 'shows', ('When', 'Arvind', 'Kejriwal') 'women') 'BJP') ('Chandra', 'Bose', 'on') ('on', 'the', 'upswing') ('today', 'at', 'Dandi') ('origins', 'have', ('SadacharYatra', 'against', ('Gandhi', 'addressing', 'embarrassed') 'Guj') 'women') ('Vijay', 'Shankhnad', 'Rally') ('to', 'sell', 'land') ('jobs', 'created', 'in') ('be', 'an', 'Annarchist') ('people', 'during', 'his') ('Priyanka', 'Gandhi', 'speaking') ('in', 'Tamil', 'Nadu') ('groups', 'and', ('bad', 'in', 'law') 'representations') 135 44 Table A.20. Top 10 keywords for 5 Topics of different Parties Topic AAP BJP INC No. 1 0.626 "do" 0.584 "bjp" 0.554 "medium" 0.079 "share" 0.119 "india" 0.077 "cong" 0.038 "know" 0.113 "job" 0.073 "show" 0.031 "road" 0.065 "talk" 0.071 "pay" 0.000 "create" 0.054 "time" 0.065 "work" 0.000 "nda" 0.032 "leader" 0.026 "ppl" 0.000 "rule" 0.005 "new" 0.005 "interest" 0.000 "yashwant_sinha" 0.004 "alliance" 0.000 "trust" 0.000 "price_rise" 0.000 "nda" 0.000 "house" 0.000 "come" 0.000 "growth" 0.000 "doubt" 2 0.110 "candidate" 0.451 "kejriwal" 0.233 "sonia" 0.105 "live" 0.263 "arvind" 0.209 "call" 0.096 "set" 0.000 "high" 0.186 "explain" 0.091 "rally" 0.000 "reason" ' 0.077 "think" 0.041 "hope" 0.000 "ka" 0.072" even" 0.038 "dharna" 0.000 "suroor" 0.046 "get" 0.022 "gorakhpur" 0.000 "start" 0.026 "voter" 0.021 "anarchy" 0.000 "career" 0.016 "tweet" 0.019 "address" 0.000"aam_aadmi" 0.012" namoingkp" 0.014 "babri_masjid" 0.000"communal 0.011 "pro 136 Table A.20. (cont’d) 3 0.236 "choice" 0.485 "singh" 0.382 "party" 0.006 "sack" 0.349 "manmohan" 0.262 "lakh" 0.006 "later" 0.022 "mp" 0.122 "day" 0.005 "evidence_mount" 0.000 "prime" 0.063 "join" 0.005"miscalculation_pressure" 0.000 "dr" ' 0.000 "create" 0.000 "nda" 0.000 "replug" 0.000 "nda" 0.000 "rule" 0.000 "prove" 0.000 "yashwant_sinha" 0.000 "make" 0.000 "indeed" 0.000 "rule" 0.000 "next" 0.000 "tev_evrx" ' 0.000 "price_rise" 0.000 "vote" 0.000 "hire" 0.000 "year 4 0.386"aap" 0.472"upa" 0.721"amp" 0.198"want" 0.082"put" 0.066"need" 0.179"money" 0.077"pm" 0.064"much" 0.031"sir" 0.074"govt" 0.028"thing" 0.000"card" ' 0.046"ask" 0.001"proud" 0.000"bill" 0.044"india" 0.000"economy" 0.000"communal_violence" 0.035"lead" 0.000"ground" 0.000"expose" 0.034"support" 0.000"approach" 0.000"appeasement" 0.033"country" 0.000"run" 0.000"tell 0.025"today" 0.000"wrong" 137 Table A.20. (cont’d) 5 0.321"congress" 0.211 "bjps" 0.394 "meet" 0.235"say" 0.151 "first" 0.295 "delhi" 0.144"gujarat" 0.141 "politic" 0.118 "minister" 0.099"riot" 0.086 "face" 0.035 "chief" 0.092"muslim" 0.076 "news" 0.000 "university" 0.037"s" 0.074 "corrupt" 0.000 "tirupati" 0.001"maharashtra" 0.064 "bad" 0.000"international_islamic" 0.000"vote" 0.037 "lok_sabha" 0.000 "announce" 0.000"ydwtppfyo" 0.007 "allege" 0.000 "seriously_oppose" 0.000"unite 0.001 "fighting" 0.000 "setup" The table shows the top 10 keywords that contribute to each topic. The weights reflect how important a keyword is to that topic. 138 45 Table A.21. Topic Perplexity and Coherence Score Measurement Descriptions Result Perplexity A measure of how good the -18.065618014400595 model is. The lower the better Coherence Score Measures score a single topic 0.3486032341447605 by measuring the degree of semantic similarity between high scoring words in the topic 23 Figure A.13. Different Topic Numbers vs Coherence Scores on each topic model The figure shows that when topic number is 26, we get the highest coherence scores 139 46 Table A.22. Finding the dominant topics in each sentence Document_No Dominant_Topic Topic_Perc_ Keywords Text Contrib 0 13.0 0.458999991 kejriwal, arvind, Delhi Chief Minister Arvind 41693115 first, mr, take, co, Kejriwal leaves his residence to minister, law, meet Lieutenant Governor Najeeb delhi, word Jung #BreakingNow #AAPInstantJustice 1 14.0 0.387199997 bjp, economy, Nigerians are like wild animals and 9019165 alliance, come, co, cancer - Goa BJP Minister cute, break, policy, http://t.co/oxWLWF11V6 keep, performance 2 20.0 0.401399999 modi, co, narendra, Congress,BJP are facing first time 85694885 congress, deliver, the Democracy,Politics of the http, face, politic, Extraordinary Aam friend, never Aadmi.Reminiscence of European Nationalism!!! 3 2.0 0.583500027 https, corruption, RT Angry over Rahuls statement 6565552 co, other, bjp, amp, that BJP can sell combs to bald, ever, proof, BJP man leads protest of bald address, freedom people. http://t.co/vQfgwqjWcc 4 14.0 0.597400009 bjp, economy, Narendra #Modi-led BJP 6321106 alliance, come, co, government can lift mood: Moodys cute, break, policy, http://t.co/9SdXjIeM2X Now keep, performance #Congress will call Moody an agent of the Sangh ! 5 11.0 0.903100013 bjp, dharna, Interesting point Sevanti raises. If 7329102 revolution, aap, the AAP Dharna is anarchy, then anarchy, what was the demolition of the babri_masjid, shri, Babri Masjid by the BJP? demolition, Constitutional? constitutional, sevanti_raise 6 27.0 0.763800024 create, co, modi, Bad news if BJPs Nishank getting 9862671 air, narendra, Lok Sabha ticket.Hideously gorakhpur, bad, set, corrupt & alleged to be pm, state behind the murder of Swami Nityanand fighting for Ganga 9 11.0 0.878899991 bjp, dharna, If the AAP Dharna is anarchy, then 5122986 revolution, aap, what was the demolition of the anarchy, Babri Masjid by the BJP? babri_masjid, shri, Constitutional? demolition, #JoinAAPSaveNation constitutional, sevanti_raise 10 12.0 0.599900007 explain, instead, Why does feel compelled to 2479248 mp, speech, explain its position on #AAP? expose, stop, thing, Since it never did for #Congress, congress, issue, bjp #BJP or any other party? 11 19.0 0.334899991 aap, delhi, bjp, co, I think 90% of her tweets are 75071716 congress, thank, antiAAP . Not even 1% pro BJP . true, act, non, rti This is called Tension Lol . Traitor 140 Table A.22. (cont’d) 12 27.0 0.892400026 create, co, modi, Donate Today for Better 3214111 air, narendra, Tomorrow! Donate online at gorakhpur, bad, set, http://t.co/4SsfQetb5b #Modi4PM pm, state http://t.co/y44EoFpnub 13 4.0 0.844699978 seat, bjp, put, trust, These membership drives by BJP 8284302 share, win, tv, vote, shud not be limited 2 just for election, pay election purposes. This shud be employed right thru the process. Change system 14 13.0 0.878799974 kejriwal, arvind, Delhi is Lucky to Have Arvind 9183655 first, mr, take, co, Kejriwal ; The CM hitting the minister, law, roads for accountable delhi, word Police .#JoinAAPSaveNation 15 2.0 0.599799990 https, corruption, AAP पार्टी लोकसभा चुनाव जितने के 6539917 co, other, bjp, amp, जलए नह ीं BJP को हराने और UPA को ever, proof, जिताने के जलए लड़ रह है address, freedom #QuitAAPsaveNation 16 20.0 0.189999997 modi, co, narendra, Sir more importantly ppl want to 6158142 congress, deliver, know Y BJP is quiet on allegations http, face, politic, that Shinde supports Dawood friend, never friends? Hafta collection 17 27.0 0.844099998 create, co, modi, BJP’s PM candidate Narendra 4741211 air, narendra, Modi is all set to address a mega gorakhpur, bad, set, rally in UP’s Gorakhpur today, live pm, state coverage to be aired in 37 countries 18 15.0 0.603799998 india, co, bjp, Sharing pic of two independent 7602234 namo, fund, wrong, MPs from Bihar joining BJP. listen, avg, today, Support for BJP in Bihar is source increasing consistently. http://t.co/hJVVeIlZ8h 19 31.0 0.687300026 bear, gujarat, bjp, s, Dear #Chhidu dont you know 4167786 muslim, riot, that ??? The only muslims who narendramodi, won in Rajasthan- Habibur congress, report, Rehman & Yunus Khan, both modi from BJP. #Muslims are with BJP 20 31.0 0.448700010 bear, gujarat, bjp, s, BJP and Cong are not realizing 7765198 muslim, riot, that more they throw mud at more narendramodi, will fall on them.Ppl will realize congress, report, they hv no substance. modi 141 47 Table A.23. Most Representative Topics for Each Document Topic_Num Topic_Perc_Contrib Keywords Text 0.0 0.8788999915122986 gandhi, rahul, co, #Rahul #Gandhi meeting Ex-servicemen meet, sonia, amethi, personnel in Amethi today. congress, http, look, http://t.co/CopS6rb4uL visit http://t.co/fZatfpwJJE 1.0 0.8924000263214111 job, government, #SadacharYatra The #BJP government had lakh, get, bjp, promised employment to 13 lakh in 2007 but political, provide, only 80,000 got jobs. #SadacharYatra lose, claim, pass 2.0 0.7767000198364258 https, corruption, co, The Salesman & The Showman. And other, bjp, amp, ever, why BJP must address proof of its own proof, address, corruption, before accusing others! freedom https://t.co/WjbaK97I92 3.0 0.8062000274658203 power, congress, bjp, เจอกันหาดใหญ่😚 Don Mueang work, dmk, divide, International Airport (DMK) anti, fight, donation, ท่าอากาศยานดอนเมือง rise http://t.co/HXvAGUM1bC 4.0 0.8446999788284302 seat, bjp, put, trust, These membership drives by BJP shud not be share, win, tv, vote, limited 2 just for election purposes. This election, pay shud be employed right thru the process. Change system 5.0 0.8942999839782715 co, bjp, election, http, Jan 23 : HJS salutes Shiv Sena chief inc, wave, lok_sabha, Balasaheb Thackeray on his Birth modi, major, india Anniversary. Read on http://t.co/FklxFvnpq7 http://t.co/C0Xpubdhb7 6.0 0.7050999999046326 time, next, hope, bjp, "***** Hinduism Logic "" GOD BRAHMA dalit, mayawati, MARRIED DAUGHTER SARASWATI & resignation, together, ELEPHANT GOD WAS BORN"" #Atheism follower, medium #HDL #AAP #BJP http://t.co/MH2uj6g8xk" 7.0 0.9354000091552734 upa, nda, growth, #feku thinks #NDA govt was better than rate, rule, agree, govt, UPA. In 9yrs of #UPA economic growth rate think, woman, avgd 7.9% which was only 6% in #NDA rule. economic #SadacharYatra 8.0 0.9193000197410583 crore, indian, opinion, #SadacharYatra The BJP government of bjp, tracker, mean, Gujarat has managed to give only nine per rajasthan, hold, fall, cent of the total promised jobs. big #SadacharYatra 9.0 0.9031000137329102 bjp, pay, show, #NaMoInGKP - BJP senior leaders will announce, address Vijay Sankalp Rally. LIVE will be namoingkp, last, available at http://t.co/AWhprQnLB2 oppose, beat, co, life #NETAJI 10.0 0.7408999800682068 year, poll, bjp, "Subhash Chandra Bose said ""tum mujhe congress, parliament, khoon do main aajadi doonga"". Narendra try, need, spend, Modi to Indians ""tum mera saath do main problem, candidate tumhe vikaas doonga""" 142 Table A.23. (cont’d) 11.0 0.9031000137329102 bjp, dharna, Interesting point Sevanti raises. If the AAP revolution, aap, Dharna is anarchy, then what was the anarchy, demolition of the Babri Masjid by the BJP? babri_masjid, shri, Constitutional? demolition, constitutional, sevanti_raise 12.0 0.7577999830245972 explain, instead, mp, AAP confuses majorityism with democracy. speech, expose, stop, BJP presents majorityism as democracy. thing, congress, issue, bjp 13.0 0.8924000263214111 kejriwal, arvind, first, Union Home Minister Sushil Kumar Shinde mr, take, co, minister, calls Arvind Kejriwal a mad chief minister. law, delhi, word My #cartoon #YEDA http://t.co/I9y2u15MUJ 14.0 0.8062000274658203 bjp, economy, "MDMK - BJP alliance confirmed. Middle alliance, come, co, finger for these ""sorinaai""s cute, break, policy, http://t.co/I9FCqvhhg0" keep, performance 15.0 0.9031000137329102 india, co, bjp, namo, #Modiका जवज़न मेरे सपोर्टट का र ज़न fund, wrong, listen, साबरमजतको साफ़ जकया अब गींगा यमुनाक avg, today, source बार #NaMoInUP #NaMo4India #BJP #NamoVision http://t.co/Z6Dgb3wVnt 16.0 0.911899983882904 money, want, upa, 14) British Govt told Commission that they help, back, refuse, would not declassify some papers on #Netaji touch, fail, bjp, until 2021.UPA did nt help to access these reason papers 17.0 0.8127999901771545 bjp, talk, become, Congress Party as crafted by Mahatma survey, india, Gandhi was the grandest political formation electiontracker, anywhere in world. Nehru progeny made it sweep, party, poor, private fiefdom. congress 18.0 0.9031000137329102 amp, bjp, news, gt, 11) UPA was hostile towards th party, today, speak, Commission.Mukherjee was humiliated for ground, co, expect his insistence 2 probe Taiwanese &Russian angles to #Netaji mystery. 19.0 0.8062000274658203 aap, delhi, bjp, co, 240 for BJP alone. It has already maximized congress, thank, true, where it could. I am confident with it, it can act, non, rti make govt. in Delhi. 20.0 0.8616999983787537 modi, co, narendra, Mani Shankar Aiyyar ko gussa kyun aata hai? congress, deliver, Because Narendra Modi chaiwallah is not his http, face, politic, cup of tea. friend, never 21.0 0.6876999735832214 even, do, call, bjp, Sad that AAP supporters are now the most supporter, great, blind votebank than any other party. BJP is many, party, far, dear now only party which has supporters who criticise it 22.0 0.8615999817848206 chidambaram, house, Ram Jethmalanis letter to P. Chidambaram in pm, leave, shame, NDTV money laundering history, day, post, co, matter....http://t.co/60COtbob3d pc 143 Table A.23. (cont’d) 23.0 0.9254999756813049 medium, modi, unite, 50 ppl manage modi social Media: know all- social, know, B. G. Mahesh: The techie behind Narendra narendra, like, co, Modi’s campaign - http://t.co/alyo5abKQS surge, campaign #SadacharYatra 24.0 0.8062000274658203 cong, bjp, paidmedia, 2/2 Cong., BJP युवाओीं को जवदे श सेक्स और good, aap, attack, ड्रग्स रै केर्ट के हवाले करना चाहत हैं , और amp, modi, show, म जड्या/ पुजलस इस जमल भगत में शाजमल है - दे श stand का दु भाट ग्य 25.0 0.7911999821662903 vote, singh, go, tell, Reveal truth behind Netajis death, Rajnath national, rajnath, Singh tells Centre http://t.co/43DUitVjy4 sonia, choice, pm, president 26.0 0.9354000091552734 give, support, I support BJP led by because Iam worried country, upa, lead, about stability+development of my bjp, govt, india, right, country!UPA gave us several scams as Govt development of India 27.0 0.8924000263214111 create, co, modi, air, Donate Today for Better Tomorrow! Donate narendra, gorakhpur, online at http://t.co/4SsfQetb5b #Modi4PM bad, set, pm, state http://t.co/y44EoFpnub 28.0 0.9031000137329102 manmohan, singh, RT Massive #SadacharYatra against Guj BJP bsp, sp, bjp, co, govt by ends today at Dandi, Bardoli. congress, happy, end, Completed 4936 kms of tour. massive 29.0 0.8062000274658203 say, ask, modi, bjp, RT Chidambaram is being economical with narendra, seem, truth: BJP http://t.co/8Be0GVoh0L chidambaram, co, sc, truth 30.0 0.7386000156402588 people, bjps, leader, The point is very simple. The answer to AAPs see, make, week, bjp, anarchy is not BJPs anarchy, topped with care, eye, top rabid regressive communalism. 144 48 Table A.24. Topic distribution across documents Dominant_Topic Topic_Keywords Num_Documents Perc_Documents 13.0 kejriwal, arvind, first, mr, take, 2312.0 0.0824 co, minister, law, delhi, word 14.0 bjp, economy, alliance, come, co, 389.0 0.0139 cute, break, policy, keep, performance 20.0 modi, co, narendra, congress, 233.0 0.0083 deliver, http, face, politic, friend, never 2.0 https, corruption, co, other, bjp, 274.0 0.0098 amp, ever, proof, address, freedom 14.0 bjp, economy, alliance, come, co, 1666.0 0.0594 cute, break, policy, keep, performance 11.0 bjp, dharna, revolution, aap, 1369.0 0.0488 anarchy, babri_masjid, shri, demolition, constitutional, sevanti_raise 27.0 create, co, modi, air, narendra, 109.0 0.0039 gorakhpur, bad, set, pm, state 11.0 bjp, dharna, revolution, aap, 1931.0 0.0688 anarchy, babri_masjid, shri, demolition, constitutional, sevanti_raise 20.0 modi, co, narendra, congress, 714.0 0.0254 deliver, http, face, politic, friend, never 11.0 bjp, dharna, revolution, aap, 1289.0 0.0459 anarchy, babri_masjid, shri, demolition, constitutional, sevanti_raise 12.0 explain, instead, mp, speech, 604.0 0.0215 expose, stop, thing, congress, issue, bjp 19.0 aap, delhi, bjp, co, congress, 520.0 0.0185 thank, true, act, non, rti 27.0 create, co, modi, air, narendra, 179.0 0.0064 gorakhpur, bad, set, pm, state 27.0 create, co, modi, air, narendra, 1430.0 0.051 gorakhpur, bad, set, pm, state 11.0 bjp, dharna, revolution, aap, 781.0 0.0278 anarchy, babri_masjid, shri, demolition, constitutional, sevanti_raise 145 APPENDIX L. Meta-path based Network Embedding We can set the model as follows, a user i belongs to a graph G(V,E), where V is the vertex (node) , E is the edge of this graph and A is the corresponding adjacency matrix. Each user has i demographic information, notated as X, called node attributes. The network centrality and community clusters can also be included in X. Each user i posts some tweets or retweets on the twitter, which is notated as Y, the label data for each node. After basic setting, we can transform a graph into a neutral network architectures as follows: 24 Figure A.14. Adjacency transformation Vertex set: {A,B,C,D,E} C Adjacency set: B 0 1 1 1 1 A 1 0 0 0 0 1 0 0 1 0 D 1 0 1 0 1 1 0 0 1 0 E Vertex attributes: X Label set: {0,1} ep. Tweet, Retweet, etc The heterogeneous networks is defined as G , where G = (V, E, T) ,where T is the node type. Two mapping function s, (ψ: V → TV , 𝜙: 𝐸 → 𝑇𝐸 ). The metapath2vec is a skip-gram style method to represent the network. We want to maximize the probability of context with the given vector 𝑣 argmax ∑ ∑ ∑ log 𝑝(𝑐𝑡 |𝑣; 𝜃) 𝑣∈𝑉 𝑡∈𝑇𝑉 𝑐𝑡 ∈𝑁𝑡 (𝑣) 146 where Nt (𝑣) is the neighbor nodes around given vector v, and type t. The probability is given by a soft-max function: 𝑒 𝑍𝑐 𝑡∙Z𝑣 𝑝(𝑐𝑡 |𝑣; 𝜃) = ∑𝑢𝑡∈𝑉𝑡 𝑒 𝑍𝑢 𝑡∙𝑍𝑣 Where Z𝑣 means the number v th column of the lower dimension matrix Z By using negative sampling method, we can reduce the computing burden of this equation and get the following object function 𝑀 log(𝑍𝑐 𝑡 ∙ 𝑍𝑣 ) ∑ 𝐸𝑢𝑚 𝑃(𝑢)[log 𝜎(−𝑍𝑢 𝑡 ∙ 𝑍𝑣 )] 𝑚=1 147 APPENDIX M. Performance Measure in Recommendation When we have a classification problem, we want to know how good performance of our model. The basic four measurements are True Positive, True Negative, False Positive, and False Negative. In our research setting, our classes are defined [-1 to -0.5) as class 1, [-0.5 to 0) as class 2, [0 to 0.5) as class 3, and [0.5 to 1] as class 4. True Positive (TP): The people that truly have the given sentiment scores on the given tweet True Negative (TN): The people that truly don’t have the given sentiment scores on the given tweet False Positive (FP): The people that truly have the given sentiment scores on the given tweet , but falsely get predicted sentiment scores based on the recommendation False Negative (FN): The people that truly don’t have the given sentiment scores on the given tweet, but falsely get predicted sentiment scores based on the recommendation Based on these four measurements, we can calculate the Accuracy, Precision, Recall , and F1 Accuracy means how much percentage we truly predict the sentiment scored based on the recommendation algorithm TP + TN Accuracy = TP + TN + FP + FN Precision means how much percentage we truly predict the sentiment scored given the predicted sentiment scored samples based on the recommendation algorithm TP Precision = TP + FP Recall means how much percentage we truly predict the sentiment scored given the ground truth sentiment scored samples based on the recommendation algorithm TP Recall = TP + FN 148 F1 means the balanced performance between the precision and the recall. 2 ∗ recall ∗ precision F1 = recall + precision Based on these four measurement, we can continue construct the True Positive Rate, The False Positive Rate, and Area Under the Curve. True Positive Rate (TPR) means how much percentage we truly predict the sentiment scores given the predicted sentiment scored samples based on the recommendation algorithm. This is as the same as the Recall Ratio TP True Positive Rate = TP + FN False Positive Rate (FPR) means how much percentage we don’t truly predict the sentiment scores given the false predicted sentiment scored samples based on the recommendation algorithm. FP True Positive Rate = FP + FN Using these two Rates, we can draw a Receiver Operating Characteristic (ROC) curve and the AUC means the Area Under the Receiver Operating Curve. This AUC score can treated as: for each user, how much percentage the truly recommended scores to this focal user compared with the not truly recommended scores to this focal user. Except the above classification measurement, we also care about the measurement of the recommendation. We choose the Hit Ratio (HR) and Normalized Discounted Cumulative Gain (NDCG) in our setting. Hit Ratio means how much percentage we truly predict the sentiment scores of the focal user based on the recommendation algorithm given the full tweets in the testing sets. 149 Number of Hits@K Users HR@K = |Testing Sets of their corresponding Tweets| Normalized Discounted Cumulative Gain measures how much percentage the Discounted Cumulative Gains take place of the Ideal Discounted Cumulative Gains. For example, if we have the list of users n and the list of tweets m , there is a correlation score between each pairs given by the TF-IDF Method in Appendix 11. The Cumulative Gains at rank position K is calculated as 𝑘 CGk = ∑ 𝑐𝑜𝑟𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛𝑖 𝑖=1 Where 𝑐𝑜𝑟𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛𝑖 means the correlation between each user and tweet pair i. The Discounted Cumulative Gains at rank position K is calculated as 𝑘 2𝑐𝑜𝑟𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛𝑖 − 1 DCGk = ∑ log 2 𝑖 + 1 𝑖=1 This formula helps to measure the cumulative gains considering their corresponding ranks. The higher of the rank, the more effect of the recommendation. The Ideal Discounted Cumulative Gains shares the same formula as above but choose the ideal correlation value |𝐼𝑑𝑒𝑎𝑙 𝐶𝑜𝑟𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛| 2𝑐𝑜𝑟𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛𝑖 − 1 IDCGk = ∑ log 2 𝑖 + 1 𝑖=1 Therefore, we can get Normalized Discounted Cumulative Gain (NDCG) as : 𝐷𝐶𝐺𝑘 NDCGk = 𝐼𝐷𝐶𝐺𝑘 150 REFERENCES 151 REFERENCES Ahmed, S., & Skoric, M. (2015). Twitter and 2013 Pakistan general election: The case of David 2.0 against goliaths. In Case Studies in e-Government 2.0 (pp. 139-161). Springer, Cham. Ahmed, S., Jaidka, K., & Cho, J. (2016). The 2014 Indian elections on Twitter: A comparison of campaign strategies of political parties. Telematics and Informatics, 33(4), 1071-1087. Ahmed, M. A., Lodhi, S. A., & Ahmad, Z. (2017). Political brand equity model: The integration of political brands in voter choice. Journal of Political Marketing, 16(2), 147-179. Ali, M. Sanni, Rolf H. H. Groenwold, Wiebe R. Pestman, Svetlana V. Belitser, Kit C. B. Roes, Arno W. Hoes, Anthonius de Boer, and Olaf H. Klungel. 2014. “Propensity Score Balance Measures in Pharmacoepidemiology: A Simulation Study.” Pharmacoepidemiology and Drug Safety 23 (8): 802–11. https://doi.org/10.1002/pds.3574. Alaa, A. M., Weisz, M., & Van Der Schaar, M. (2017). Deep counterfactual networks with propensity-dropout. arXiv preprint arXiv:1706.05966. Aral, S., Muchnik, L., & Sundararajan, A. (2009). Distinguishing influence-based contagion from homophily-driven diffusion in dynamic networks. Proceedings of the National Academy of Sciences, 106(51), 21544-21549. Aronow, P. M., & Samii, C. (2017). Estimating average causal effects under general interference, with application to a social network experiment. The Annals of Applied Statistics, 11(4), 1912-1947. Arpino, B., De Benedictis, L., & Mattei, A. (2015). Implementing propensity score matching with network data: The effect of GATT on bilateral trade. Bang H and Robins JM , Doubly robust estimation in missing data and causal inference models, Biometrics, 61 (2005), pp. 962- 973. Austin, Peter C. 2009. “Balance Diagnostics for Comparing the Distribution of Baseline Covariates Between Treatment Groups in Propensity-Score Matched Samples.” Statistics in Medicine 28 (25): 3083–3107. https://doi.org/10.1002/sim.3697. Berman, R., Melumad, S., Humphrey, C., & Meyer, R. (2019). A tale of two Twitterspheres: Political microblogging during and after the 2016 primary and presidential debates. Journal of Marketing Research, 56(6), 895-917. Bermingham, A., & Smeaton, A. (2011). On using Twitter to monitor political sentiment and predict election results. In Proceedings of the Workshop on Sentiment Analysis where AI meets Psychology Bennett, A., & Kallus, N. (2019). Policy evaluation with latent confounders via optimal balance. arXiv preprint arXiv:1908.01920. 152 Bhattacharya, P., Phan, T. Q., Bai, X., & Airoldi, E. M. (2019). A coevolution model of network structure and user behavior: The case of content generation in online social networks. Information Systems Research, 30(1), 117-132. Bobadilla, J., Lara-Cabrera, R., González-Prieto, Á., & Ortega, F. (2020). Deepfair: deep learning for improving fairness in recommender systems. arXiv preprint arXiv:2006.05255. Bonacich, P. (1987). Power and centrality: A family of measures. American journal of sociology, 92(5), 1170-1182. Borgatti, S. P. (2005). Centrality and network flow. Social networks, 27(1), 55-71. Bose, A., & Hamilton, W. (2019). Compositional fairness constraints for graph embeddings. In International Conference on Machine Learning (pp. 715-724). PMLR. Bronstein, M. M., Bruna, J., LeCun, Y., Szlam, A., and Vandergheynst, P. (2017). Geometric deep learning: going beyond euclidean data. IEEE Signal Processing Magazine 34, 18–42 Buccoliero, L., Bellio, E., Crestini, G., & Arkoudas, A. (2020). Twitter and politics: Evidence from the US presidential elections 2016. Journal of Marketing Communications, 26(1), 88-114. Bulsara, H. P., & Singh, R. A. (2017) Pre-Election Branding Strategies of Bhartiya Janta Party in Indian General Election 2014: A Content Analysis. Buyl, M., & Bie, T. D. (2020). Debayes: a bayesian method for debiasing network embeddings. In International Conference on Machine Learning (pp. 1220-1229). PMLR. Buyl, M., & Bie, T. D. (2021). The kl-divergence between a graph model and its fair i-projection as a fairness regularizer. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases (pp. 351-366). Springer, Cham. Caliendo, M., & Kopeinig, S. (2008). Some practical guidance for the implementation of propensity score matching. Journal of economic surveys, 22(1), 31-72. Cedric V (2003). Topics in optimal transportation. No.58. American Mathematical Soc. Chang, S., Han, W., Tang, J., Qi, G.-J., Aggarwal, C. C., and Huang, T. S. (2015). Heterogeneous network embedding via deep architectures. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (ACM), 119–128 Christakis, N. A., & Fowler, J. H. (2013). Social contagion theory: examining dynamic social networks and human behavior. Statistics in medicine, 32(4), 556-577. Chu, Z., Rathbun, S. L., & Li, S. (2020). Matching in Selective and Balanced Representation Space for Treatment Effects Estimation. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management (pp. 205-214). 153 Cortes, C., & Mohri, M. (2014). Domain adaptation and sample bias correction theory and algorithm for regression. Theoretical Computer Science, 519, 103-126 Current, S., He, Y., Gurukar, S., & Parthasarathy, S. (2022). FairMod: Fair Link Prediction and Recommendation via Graph Modification. arXiv preprint arXiv:2201.11596. Dai, E., & Wang, S. (2021). Say no to the discrimination: Learning fair graph neural networks with limited sensitive attribute information. In Proceedings of the 14th ACM International Conference on Web Search and Data Mining (pp. 680-688). Dehejia, R. H., & Wahba, S. (2002). Propensity score-matching methods for nonexperimental causal studies. Review of Economics and statistics, 84(1), 151-161. Diwakar, R. (2017). Change and continuity in Indian politics and the Indian party system: Revisiting the results of the 2014 Indian general election. Asian Journal of Comparative Politics, 2(4), 327-346. Dong, Y., Kang, J., Tong, H., & Li, J. (2021). Individual fairness for graph neural networks: A ranking based approach. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining (pp. 300-310). Dong, Y., Chawla, N. V., and Swami, A. (2017). metapath2vec: Scalable representation learning for heterogeneous networks. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (ACM), 135–144 Dong, Y., Liu, N., Jalaian, B., & Li, J. (2021). EDITS: Modeling and Mitigating Data Bias for Graph Neural Networks. arXiv preprint arXiv:2108.05233. Du, X., Sun, L., Duivesteijn, W., Nikolaev, A., & Pechenizkiy, M. (2019). Adversarial balancing-based representation learning for causal effect inference with observational data. arXiv preprint arXiv:1904.13335. Dwork, C., Hardt, M., Pitassi, T., Reingold, O., & Zemel, R. (2012). Fairness through awareness. In Proceedings of the 3rd innovations in theoretical computer science conference (pp. 214-226). Fan, S., Zhu, J., Han, X., Shi, C., Hu, L., Ma, B., & Li, Y. (2019). Metapath-guided heterogeneous graph neural network for intent recommendation. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (pp. 2478-2486). Fang, X., Hu, P. J. H., Li, Z., & Tsai, W. (2013). Predicting adoption probabilities in social networks. Information Systems Research, 24(1), 128-145. Feldman, M., Friedler, S. A., Moeller, J., Scheidegger, C., & Venkatasubramanian, S. (2015). Certifying and removing disparate impact. In proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 259-268). 154 Forastiere, L., Airoldi, E. M., & Mealli, F. (2020). Identification and estimation of treatment and interference effects in observational studies on networks. Journal of the American Statistical Association, 1-18. Fossen, B. L., Mallapragada, G., & De, A. (2021). Impact of political television advertisements on viewers’ response to subsequent advertisements. Marketing Science, 40(2), 305-324. Fu, T. Y., Lee, W. C., & Lei, Z. (2017). Hin2vec: Explore meta-paths in heterogeneous information networks for representation learning. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management (pp. 1797-1806). Fu, X., Zhang, J., Meng, Z., & King, I. (2020). Magnn: Metapath aggregated graph neural network for heterogeneous graph embedding. In Proceedings of The Web Conference 2020 (pp. 2331-2341). Fu, Z., Xian, Y., Gao, R., Zhao, J., Huang, Q., Ge, Y & De Melo, G. (2020). Fairness-aware explainable recommendation over knowledge graphs. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 69-78). Gelper, S., van der Lans, R., & van Bruggen, G. (2021). Competition for attention in online social networks: Implications for seeding strategies. Management Science, 67(2), 1026-1047. Gini, C. (1921). Measurement of inequality of incomes. The economic journal, 31(121), 124- 126. Glynn, A. N., & Quinn, K. M. (2010). An introduction to the augmented inverse propensity weighted estimator. Political analysis, 18(1), 36-56. Gordon, B. R., & Hartmann, W. R. (2013). Advertising effects in presidential elections. Marketing Science, 32(1), 19-35. Graham, T., Broersma, M., Hazelhoff, K., & Van'T Haar, G. (2013). Between broadcasting political messages and interacting with voters: The use of Twitter during the 2010 UK general election campaign. Information, communication & society, 16(5), 692-716. Gretton, A., Borgwardt, K. M., Rasch, M. J., Schölkopf, B., & Smola, A. (2012). A kernel two- sample test. The Journal of Machine Learning Research, 13(1), 723-773. Grover, A., & Leskovec, J. (2016). node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 855-864). Guo, R., Li, J., & Liu, H a (2020). Learning individual causal effects from networked observational data. In Proceedings of the 13th International Conference on Web Search and Data Mining, pages 232–240. 155 Guo, R., Li, J., & Liu, Hb (2020). Counterfactual evaluation of treatment assignment functions with networked observational data. In Proceedings of the 2020 SIAM International Conference on Data Mining (pp. 271-279). Hamilton, W. L., Ying, R., & Leskovec, J. (2017). Inductive representation learning on large graphs. In Proceedings of the 31st International Conference on Neural Information Processing Systems (pp. 1025-1035). Hardt, M., Price, E., & Srebro, N. (2016). Equality of opportunity in supervised learning. Advances in neural information processing systems, 29. He, X., Deng, K., Wang, X., Li, Y., Zhang, Y., & Wang, M. (2020). Lightgcn: Simplifying and powering graph convolution network for recommendation. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval (pp. 639-648). Heller, R., Rosenbaum, P. R., & Small, D. S. (2010). Using the cross-match test to appraise covariate balance in matched pairs. The American Statistician, 64(4), 299-309. Hirano, K., & Imbens, G. W. (2004). The propensity score with continuous treatments. Applied Bayesian modeling and causal inference from incomplete-data perspectives, 226164, 73-84. Hill, J. L. (2011). Bayesian nonparametric modeling for causal inference. Journal of Computational and Graphical Statistics, 20(1), 217-240. Ho, Daniel E., Kosuke Imai, Gary King, and Elizabeth A. Stuart. 2007. “Matching as Nonparametric Preprocessing for Reducing Model Dependence in Parametric Causal Inference.” Political Analysis 15 (3): 199–236. https://doi.org/10.1093/pan/mpl013. Hong, H., Guo, H., Lin, Y., Yang, X., Li, Z., & Ye, J. (2020). An attention-based graph neural network for heterogeneous structural learning. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 34, No. 04, pp. 4132-4139). Hudgens, M. G., & Halloran, M. E. (2008). Toward causal inference with interference. Journal of the American Statistical Association, 103(482), 832-842. Huffman, C., & van Gameren, E. (2018). Covariate balancing inverse probability weights for time-varying continuous interventions. Journal of Causal Inference, 6(2). Imai, K., King, G., & Stuart, E. A. (2014). 8. Misunderstandings Between Experimentalists and Observationalists About Causal Inference. In Field Experiments and Their Critics (pp. 196-227). Yale University Press. Imai, K., & Ratkovic, M. (2014). Covariate balancing propensity score. Journal of the Royal Statistical Society: Series B: Statistical Methodology, 243-263. Jackson, M. O., Lin, Z., & Yu, N. N. (2020). Adjusting for peer-influence in propensity scoring when estimating treatment effects. Available at SSRN 3522256. 156 Jaffrelot, C. (2015). The class element in the 2014 Indian election and the BJP’s success with special reference to the Hindi belt. Studies in Indian Politics, 3(1), 19-38. Johansson, F., Shalit, U., & Sontag, D. (2016). Learning representations for counterfactual inference. In International conference on machine learning (pp. 3020-3029). PMLR. Johansson, F. D., Shalit, U., Kallus, N., & Sontag, D. (2020). Generalization bounds and representation learning for estimation of potential outcomes and causal effects. arXiv preprint arXiv:2001.07426. Ji, H., Shi, C., & Wang, B. (2018). Attention based meta path fusion for heterogeneous information network embedding. In Pacific Rim International Conference on Artificial Intelligence (pp. 348-360). Springer, Cham. Kallus, N. (2017). A framework for optimal matching for causal inference. In Artificial Intelligence and Statistics (pp. 372-381). PMLR. Kallus N. (2020) DeepMatch: Balancing Deep Covariate Representations for Causal Inference Using Adversarial Training; Proceedings of the 37th International Conference on Machine Learning, PMLR 119:5067-5077. Kang, J., & Tong, H. (2021). Fair Graph Mining. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management (pp. 4849-4852). Kang, K., Park, J., Kim, W., Choe, H., & Choo, J. (2019). Recommender system using sequential and global preference via attention mechanism and topic modeling. In Proceedings of the 28th ACM international conference on information and knowledge management (pp. 1543-1552). Karami, A., Bennett, L. S., & He, X. (2018). Mining public opinion about economic issues: Twitter and the us presidential election. International Journal of Strategic Decision Sciences, 9(1), 18-28. Khare, H. (2015). How Modi Won It: notes from the 2014 election. Hachette India. Khan, A., Zhang, H., Shang, J., Boudjellal, N., Ahmad, A., Ali, A., & Dai, L. (2020). Predicting Politician’s Supporters’ Network on Twitter Using Social Network Analysis and Semantic Analysis. Scientific Programming, 2020. Khan, F. B. (2019). The game of votes: Visual media politics and elections in the digital era. SAGE Publishing India. Kipf, T. N., & Welling, M. (2016). Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. King, G., & Nielsen, R. (2019). Why propensity scores should not be used for matching. Political Analysis, 27(4), 435-454. 157 Krivitsky, P. N., Handcock, M. S., Raftery, A. E., & Hoff, P. D. (2009). Representing degree distributions, clustering, and homophily in social networks with latent cluster random effects models. Social networks, 31(3), 204-213. Kumar, S., Zhang, X., & Leskovec, J. (2019). Predicting dynamic embedding trajectory in temporal interaction networks. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (pp. 1269-1278). Lewis, K., Gonzalez, M., & Kaufman, J. (2012). Social selection and peer influence in an online social network. Proceedings of the National Academy of Sciences, 109(1), 68-72. Li, J., Liu, Y., & Zou, L. (2020). DynGCN: A Dynamic Graph Convolutional Network Based on Spatial-Temporal Modeling. In International Conference on Web Information Systems Engineering (pp. 83-95). Springer, Cham. Li, Y., Tarlow, D., Brockschmidt, M., & Zemel, R. (2017). Gated graph sequence neural networks. arXiv preprint arXiv:1511.05493. Li, S., & Fu, Y. (2017). Matching on balanced nonlinear representations for treatment effects estimation. In NIPS. Lin, J. S., & Himelboim, I. (2019). Political brand communities as social network clusters: winning and trailing candidates in the GOP 2016 primary elections. Journal of Political Marketing, 18(1-2), 119-147. Liu, L., & Hudgens, M. G. (2014). Large sample randomization inference of causal effects in the presence of interference. Journal of the american statistical association, 109(505), 288-301. Long, M., Wang, J., Ding, G., Sun, J., & Yu, P. S. (2014). Transfer joint matching for unsupervised domain adaptation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1410-1417). Louizos C, Swersky K , Li,YJ, Welling M, and Zemel R. (2016). The variational fair autoencoder. International Conference on Learning Representations Louizos C, Shalit U, Mooij JM, Sontag D, Zemel R, and Welling M, (2017) Causal effect inference with deep latent-variable models, in Advances in Neural Information Processing Systems, pp. 6446-6456. Lu M, Sadiq S, Feaster DJ, Ishwaran H (2018) Estimating individual treatment effect in observational data using random forest methods. Journal of Computational and Graphical Statistics 27(1):209–219 Ma L, Krishnan R, Montgomery AL.(2015) Latent Homophily or Social Influence? An Empirical Analysis of Purchase Within a Social Network. Management Science 61(2):454-473 Ma, Y., Wang, Y., & Tresp, V. (2020). Causal Inference under Networked Interference. arXiv preprint arXiv:2002.08506. 158 Mallipeddi, R. R., Janakiraman, R., Kumar, S., & Gupta, S. (2021). The Effects of Social Media Content Created by Human Brands on Engagement: Evidence from Indian General Election 2014. Information Systems Research. Manessi, F., Rozza, A., & Manzo, M. (2020). Dynamic graph convolutional networks. Pattern Recognition, 97, 107000. McPherson, M., Smith-Lovin, L., & Cook, J. M. (2001). Birds of a feather: Homophily in social networks. Annual review of sociology, 27(1), 415-444. Mikolov, T., Chen, K., Corrado, G.S., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. CoRR, abs/1301.3781. Mousavi R, Gu B (2019) The Impact of Twitter Adoption on Lawmakers’ Voting Orientations. Information Systems Research 30(1):133-153 Muja, M., & Lowe, D. G. (2014). Scalable nearest neighbor algorithms for high dimensional data. IEEE transactions on pattern analysis and machine intelligence, 36(11), 2227-2240. Nguyen, G. H., Lee, J. B., Rossi, R. A., Ahmed, N. K., Koh, E., & Kim, S. (2018). Continuous- time dynamic network embeddings. In Companion Proceedings of the The Web Conference (pp. 969-976). Ogburn EL, VanderWeele TJ (2017). Vaccines, contagion, and social networks. The Annals of Applied Statistics, 11(2):919–948 Ou, M., Cui, P., Pei, J., Zhang, Z., and Zhu, W. (2016). Asymmetric transitivity preserving graph embedding. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (ACM), 1105–1114 Ozery-Flato, M., Thodoroff, P., Ninio, M., Rosen-Zvi, M., & El-Hay, T. (2018). Adversarial balancing for causal inference. arXiv preprint arXiv:1810.07406. Panda, A., Kommiya Mothilal, R., Choudhury, M., Bali, K., & Pal, J. (2020). Topical Focus of Political Campaigns and its Impact: Findings from Politicians' Hashtag Use during the 2019 Indian Elections. Proceedings of the ACM on Human-Computer Interaction, 4(CSCW1), 1-14. Pal, J., Chandra, P., & Vydiswaran, V. V. (2016). Twitter and the rebranding of Narendra Modi. Economic & Political Weekly, 51(8), 52-60. Panaretos, V. M., & Zemel, Y. (2019). Statistical aspects of Wasserstein distances. Annual review of statistics and its application, 6, 405-431. Panda, A., Kommiya Mothilal, R., Choudhury, M., Bali, K., & Pal, J. (2020). Topical Focus of Political Campaigns and its Impact: Findings from Politicians' Hashtag Use during the 2019 Indian Elections. Proceedings of the ACM on Human-Computer Interaction, 4(CSCW1), 1-14. 159 Pareja, A., Domeniconi, G., Chen, J., Ma, T., Suzumura, T., Kanezashi, H., ... & Leiserson, C. (2020). Evolvegcn: Evolving graph convolutional networks for dynamic graphs. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 34, No. 04, pp. 5363-5370). Pătruţ, B., & Pătruţ, M. (Eds.). (2014). Social media in politics: case studies on the political power of social media (Vol. 13). Springer. Pearl, J. (2009). Causality. Cambridge university press. Pennacchiotti, M., & Popescu, A. M. (2011). Democrats, republicans and starbucks afficionados: user classification in twitter. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 430-438). Perozzi, B., Al-Rfou, R., & Skiena, S. (2014). Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 701-710). Petrova, M., Sen, A., & Yildirim, P. (2021). Social media and political contributions: the impact of new technology on political competition. Management Science, 67(5), 2997-3021. Pich, C., & Armannsdottir, G. (2018). Political brand image: an investigation into the operationalisation of the external orientation of David Cameron’s Conservative brand. Journal of Marketing Communications, 24(1), 35-52. Priya Chacko & Peter Mayer (2014) The Modi lahar (wave) in the 2014 Indian national election: A critical realignment?, Australian Journal of Political Science, 49:3, 518-528 Rahman, T., Surma, B., Backes, M., & Zhang, Y. (2019). Fairwalk: Towards fair graph embedding. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19. International Joint Conferences on Artificial Intelligence Organization, 3289–3295 Rishika R, Ramaprasad J (2019) The Effects of Asymmetric Social Ties, Structural Embeddedness, and Tie Strength on Online Content Contribution Behavior. Management Science 65(7):3398-3422. Rubin, D. B. (2005). Causal inference using potential outcomes: Design, modeling, decisions. Journal of the American Statistical Association 100(469):322–331 Rutter, R. N., Hanretty, C., & Lettice, F. (2018). Political brands: can parties be distinguished by their online brand personality? Journal of Political Marketing, 17(3), 193-212. Sang, E. T. K., & Bos, J. (2012). Predicting the 2011 dutch senate election results with twitter. In Proceedings of the workshop on semantic analysis in social media (pp. 53-60). Sankar, A., Wu, Y., Gou, L., Zhang, W., & Yang, H. (2020). Dysat: Deep neural representation learning on dynamic graphs via self-attention networks. In Proceedings of the 13th International Conference on Web Search and Data Mining (pp. 519-527). 160 Sardesai, R. (2015). 2014: The election that changed India. Penguin UK. Sant'Anna P H.C., Song XJ, Xu X, (2020). "Covariate Distribution Balance via Propensity Scores," arXiv preprint, arXiv: 1810.01370 Schneiker, A. (2019). Telling the story of the superhero and the anti-politician as president: Donald Trump’s branding on Twitter. Political studies review, 17(3), 210-223. Schwab, P., Linhardt, L., & Karlen, W. (2019). Perfect match: A simple method for learning representations for counterfactual inference with neural networks. arXiv preprint arXiv:1810.00656. Seo, Y., Defferrard, M., Vandergheynst, P., & Bresson, X. (2018). Structured sequence modeling with graph convolutional recurrent networks. In International Conference on Neural Information Processing (pp. 362-373). Springer, Cham. Shalit, U., Johansson, F. D., & Sontag, D. (2017). Estimating individual treatment effect: generalization bounds and algorithms. In International Conference on Machine Learning (pp. 3076-3085). PMLR. Shalizi, C. R., & Thomas, A. C. (2011). Homophily and Contagion Are Generically Confounded in Observational Social Network Studies. Sociological Methods & Research.;40(2):211-239 Sharma, A., Gupta, G., Prasad, R., Chatterjee, A., Vig, L., & Shroff, G. (2020). MultiMBNN: Matched and Balanced Causal Inference with Neural Networks. arXiv preprint arXiv:2004.13446. Skoric, M., Poor, N., Achananuparp, P., Lim, E. P., & Jiang, J. (2012). Tweets and votes: A study of the 2011 singapore general election. In 2012 45th hawaii international conference on system sciences (pp. 2583-2591). IEEE Shi, C., Blei, D. M., & Veitch, V. (2019). Adapting neural networks for the estimation of treatment effects. arXiv preprint arXiv:1906.02120. Shi, C., Hu, B., Zhao, W. X., & Philip, S. Y. (2018). Heterogeneous information network embedding for recommendation. IEEE Transactions on Knowledge and Data Engineering, 31(2), 357-370. Song, L., Langfelder, P., & Horvath, S. (2012). Comparison of co-expression measures: mutual information, correlation, and model based indices. BMC bioinformatics, 13(1), 1-21. Song, T., Tang, Q., & Huang, J. (2019). Triadic Closure, Homophily, and Reciprocation: An Empirical Investigation of Social Ties Between Content Providers. Information Systems Research 30(3):912-926 Spinelli, I., Scardapane, S., Hussain, A., & Uncini, A. (2021). FairDrop: Biased Edge Dropout for Enhancing Fairness in Graph Representation Learning. IEEE Transactions on Artificial Intelligence. 161 Susarla, A., Oh, J. H., & Tan, Y. (2016). Influentials, imitables, or susceptibles? Virality and word-of-mouth conversations in online social networks. Journal of Management Information Systems, 33(1), 139-170. Sweet, T., Adhikari, S. (2020) A Latent Space Network Model for Social Influence. Psychometrika 85, 251–274 Tang, J., Qu, M., Wang, M., Zhang, M., Yan, J., and Mei, Q. (2015). Line: Large-scale information network embedding. In International Conference on World Wide Web. 1067–1077 Tchetgen, E. J. T., & VanderWeele, T. J. (2012). On causal inference in the presence of interference. Statistical methods in medical research, 21(1), 55-75. Velickovic, P., Cucurull, G., Casanov a, A., Romero, A., Lio, P., and Bengio, Y. (2018). Graph attention networks International Conference on Learning Representations (ICLR) Veitch, V., Wang, Y., & Blei, D. M. (2019). Using embeddings to correct for unobserved confounding in networks. arXiv preprint arXiv:1902.04114. Wang, Y., Lewis, M., & Schweidel, D. A. (2018). A border strategy analysis of ad source and message tone in senatorial campaigns. Marketing Science, 37(3), 333-355. Wang, X., Ji, H., Shi, C., Wang, B., Ye, Y., Cui, P., & Yu, P. S. (2019). Heterogeneous graph attention network. In The World Wide Web Conference (pp. 2022-2032). Wang, N., Lin, L., Li, J., & Wang, H. (2021). Unbiased Graph Embedding with Biased Graph Observations. arXiv preprint arXiv:2110.13957. Wang, C., Song, Y., Li, H., Zhang, M., & Han, J. (2018). Unsupervised meta-path selection for text similarity measure based on heterogeneous information networks. Data Mining and Knowledge Discovery, 32(6), 1735-1767. Wei, X., Liu, Z., Sun, L., & Yu, P. S. (2018). Unsupervised meta-path reduction on heterogeneous information networks. arXiv preprint arXiv:1810.12503. Wu, L., Sun, P., Hong, R., Fu, Y., Wang, X., & Wang, M. (2018). Socialgcn: An efficient graph convolutional network based model for social recommendation. arXiv preprint arXiv:1811.02815. Wu, Q., Zhang, H., Gao, X., He, P., Weng, P., Gao, H., & Chen, G. (2019). Dual graph attention networks for deep latent representation of multifaceted social effects in recommender systems. In The World Wide Web Conference (pp. 2091-2102). Wu, L., Chen, L., Shao, P., Hong, R., Wang, X., & Wang, M. (2021). Learning fair representations for recommendation: A graph-based perspective. In Proceedings of the Web Conference 2021 (pp. 2198-2208). 162 Xie, R., Liu, Z., Jia, J., Luan, H., & Sun, M. (2016). Representation learning of knowledge graphs with entity descriptions. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 30, No. 1). Xue, H., Yang, L., Jiang, W., Wei, Y., Hu, Y., & Lin, Y. (2020). Modeling dynamic heterogeneous network for link prediction using hierarchical attention with temporal rnn. arXiv preprint arXiv:2004.01024. Yao, L., Li, S., Li, Y., Huai, M., Gao, J., & Zhang, A. (2018). Representation learning for treatment effect estimation from observational data. In: Advances in Neural Information Processing Systems, pp 2634–2644 Yao, L., Li, S., Li, Y., Xue, H., Gao, J., & Zhang, A. (2019, August). On the estimation of treatment effect with text covariates. In International Joint Conference on Artificial Intelligence. Yun, S., Jeong, M., Kim, R., Kang, J., & Kim, H. J. (2019). Graph transformer networks. Advances in Neural Information Processing Systems, 32, 11983-11993. Zeng, Z., Islam, R., Keya, K. N., Foulds, J., Song, Y., & Pan, S. (2021). Fair representation learning for heterogeneous information networks. arXiv preprint arXiv:2104.08769. Zhang, B., Pavlou, P. A., & Krishnan, R. (2018). On Direct vs. Indirect Peer Influence in Large Social Networks. Information Systems Research 29(2):292-314 Zhang, L., Fu, J., Wang, S., Zhang, D., Dong, Z., & Chen, C. P. (2019). Guide subspace learning for unsupervised domain adaptation. IEEE transactions on neural networks and learning systems, 31(9), 3374-3388. Zhang, C., Song, D., Huang, C., Swami, A., & Chawla, N. V. (2019). Heterogeneous graph neural network. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (pp. 793-803). Zhang, L., & Chung, D. J. (2020). The Air War vs. the Ground Game: An Analysis of Multichannel Marketing in US Presidential Elections. Marketing Science, 39(5), 872-892. Zheng, J., Qi, Z., Dou, Y., & Tan, Y. (2019). How Mega Is the Mega? Exploring the Spillover Effects of WeChat Using Graphical Model. Information Systems Research 30(4):1343-1362. 163