DEPRESSION DETECTION IN SOCIAL MEDIA VIA DIFFERENTIAL TEXT EMBEDDING By Norah Alfadhli A THESIS Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Computer Science - Master of Science 2022 ABSTRACT Deep learning models have shown promising results for depression detection using social media data (i.e., Twitter), but the difficulties of maintaining explainability and few-shot adaptation of models for new problems remain an open challenge. The standard solution for depression detection modeling is to represent the natural language text of the tweet as a numerical vector via embedding first then training a classification model that uses the vectors to predict the depression status. In this study, we propose a few-shot learning technique to improve the performance of depression detection classification models. More specifically, we represent tweets as differential embeddings: a set of embedding vectors that measure the tweet’s (Sentence BERT) embedding location with respect to a set of depression tweet templates (anchor points) derived from clinically backed depression symptoms described in the literature. Intuitively, the differential embeddings describe the similarities between different tweets and the set of depression templates. We have assessed the capability of our approach on random samples we drew from a source of tweets to create multiple datasets as follows: (1)20 random balanced datasets and (2)20 random unbalanced dataset. We trained a supervised model using different approaches derived from Sentence-BERT and the anchor points. The results show that the proposed solution improved SBERT in a supervised task by 0.035 and .023 relative improvements in terms of Partial AUROc @FPR: 0.10 in balanced and unbalanced datasets, respectively. ACKNOWLEDGMENTS First and foremost, I must acknowledge my limitless thanks to Allah, the Ever-magnificent, the Ever-Thankful, for His help and bless by giving me the opportunity, courage and enough energy to carry out and complete the entire thesis I would like to express my deepest appreciation to my advisor Dr. Mohammad Ghassemi for his support, encouragement, and guidance during this journey. He did not only teach me how to be a better researcher, but also how to talk about science. Without his assistance and the many thought-provoking discussions, I would never be able to complete this thesis. Dr. Ghassemi was always available and willing to give even if it will cost his valuable time. I am thankful to Dr. Ross and Dr. Johnson for accepting to be in my master thesis committee. I am also grateful to all my colleagues for their continuous support and help, including Sari Sadiya, Niloufar Eghbali and Najla’a Alsaedi. I must express my very profound gratitude to my parents for their unfailing support, continuous encouragement and prayers throughout my years of study. I would like to express my special thanks to my beloved mother who, through her thorough discussions and patience throughout my life, made me the person who I am today. I am deeply thankful to my husband for his confidence, patience and positive energy when I have been stressed out. I appreciate my brothers, sisters, and friends for their great motivation. iii TABLE OF CONTENTS Chapter 1: Introduction and Literature Review . . . . . . . . . . . . . . . . 1 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Theoretical Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.4 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 Chapter 2: Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.1 Data Collection and Processing . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.2 Features Exploration and Visualization . . . . . . . . . . . . . . . . . . . . . 26 2.3 Supervised Classification Using SBERT Embedding . . . . . . . . . . . . . . 28 Chapter 3: Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . 30 3.1 Experimental Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.2 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 Chapter 4: Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.1 Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 APPENDIX A: DYSFUNCTIONAL THOUGHT REPRESENTATIVE SEN- TENCES REFERENCES . . . . . . . . . . . . . . . . . . . . . . . 49 APPENDIX B: DEPRESSION SYMPTOMS REPRESENTATIVE SEN- TENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 APPENDIX C: DEPRESSION SYMPTOMS VALUES DISTRIBUTION . 54 iv Chapter 1: Introduction and Literature Review 1.1 Introduction 1.1.1 Background to the study Globally, more than 264 million people of all ages suffer from depression [1]. The scale of the problem is so immense that it is both logistically and financially impossible to monitor the early signs of depression using human agents alone. Additionally, studies have shown that those suffering from depression (especially young people) find it more challenging to discuss their feelings in face-to-face settings [2] but are less hesitant to share their thoughts on social media platforms. This provides an incentive for developing automated systems that can detect the presence of depression symptoms in social media for a possible downstream intervention. 1.1.2 Research Objective The present research aims to apply techniques in natural language processing (NLP) and ML to build a system that can identify social media posts (i.e., tweets) that contain the elements of the symptoms of depression. More specifically, the current research aims to de- sign a few-shot approach that synthesizes NLP algorithms (i.e., Sentence-Bert) with clinical intuitions about depression (i.e., clinical depression symptoms). We demonstrate that this process can help detect depression more effectively than algorithms or insights alone. Previous works have shown that data mining of social media activity can aid in detecting cases of depression automatically [3], [4], [5], [6]. Especially in the case of using text data from social media, ML approaches have unique advantages in the detection of depression; With people’s participation in online platforms and their public sharing via the internet, text data from social media records are a treasure trove of psychological data that may assist in 1 screening for depressive symptoms among the users of social media. In fact, the number of daily social media users is increasing and a significant percentage of the users have reported being depressed. ML techniques also offer opportunities for identifying hidden patterns in online communication and interaction on social media that may reveal users’ mental states. The automatic detection of depressive symptoms through ML algorithms applied to social media data has the potential to identify those at risk of depression through large-scale moni- toring of online social networks, potentially complementing traditional screening procedures. 1.1.3 Research Contributions As described in the literature review section below (see section 1.4), the existing depres- sion detection systems have two primary limitations: 1. Limited clinical relevance: Many state-of-the-art (SOTA) techniques (i.e., deep learn- ing) achieve excellent performance but do not explicity account for factors that are known to be clinically relevant for depression diagnosis [5, 7]. 2. High false positive rates: Many existing approaches are ineffective at discriminating between depressed tweets from those tweets that carry a negative sentiment but are not necessarily depressed. Indeed, many depression detection systems deliberately exclude negative sentiments from the nondepressed class during training and evaluation [8]. The present research addresses these challenges through a simple but effective few-shot solution that utilizes a small amount of labeled data to enhance model generalization to unseen data. More specifically, our work has two advantages over existing alternatives: 1. Augments the power of contemporary representation learning techniques with a clini- cally grounded indication of depression symptoms in texts. 2. Is trained on a dataset that includes a portion of negative tweets in the non-depressed class, making the learning task more challenging but also more relevant for real-world 2 deployment. 1.2 Outline The remainder of the current thesis is organized as follows: Chapter 1 focuses on a review of the key concepts of NLP and ML that are relevant to the thesis; we also present a literature review, in which the research related to depression detection in social media is discussed. Chapter 2 describes the proposed methodological approach and baseline of the research. It also includes a discussion of the data, pre-processing approach, and feature extraction pipeline. Chapter 3 provides a detailed description of the experiments performed to assess the research method and their results; It also provides a discussion of the research results, the context of the literature, and the propensity for downstream clinical use of the approach in this research. Finally, Chapter 4 is a discussion of the key takeaways, limitations, and future directions of work. 3 1.3 Theoretical Background The present thesis utilizes NLP and ML to build an automated depression detection system. In this chapter, we review the key concepts and theories that our work builds on. NLP is the analysis of human (i.e., “natural”) language using computational and statistical techniques. The objective of NLP is to allow computers to represent, understand, analyze, and derive meaning from human language [9]. In recent years, there have been significant advances in NLP techniques. Recent advances have utilized deep neural architectures with multi-headed attention (i.e., BERT approaches) for language modeling. These models are trained using the wealth of natural language human text available online (and offline). The SOTA language models provide contextual representations of natural language that are useful for a variety of downstream tasks, including text summarization, question answering, and topic classification with little-to-no costly human annotation required. Generally speaking, NLP systems are composed of two types: rule-based approaches and statistical NLP. The rule-based approaches are based on a set of rules that guide the system. One example of this is the sentence parsing systems that use a nominally complete set of rules that define allowable words, parts of speech, and allowable sequences of the parts of speech. Despite its strengths, its shortcomings includes the difficulty in modeling natural language using a set of rules based on a predefined vocabulary. On the other hand, statistical NLP consists of all the quantitative approaches (often probabilistic) to automated language processing and modeling language implicitly instead of using explicit rules. One criticism of the statistical approach is that the statistical assumptions may not match the intuition of current research on how languages work. To this effect, text corpora–based approaches could be subject to criticism because they have insufficient data. In the current study, 4 instead of hand-coding rules, we used statistical NLP to learn these rules by analyzing a set of examples and making statistical inferences. This approach relies on methods based on ML algorithms [10]. Arthur Samuel described ML as the computer’s ability to learn without being explicitly programmed. Hence, ML can make decisions about new data without being instructed on how to do so. Machines can learn different tasks and are able to do so in one of two forms: supervised learning and unsupervised learning. In supervised learning, an algorithm is given training data and the desired solution (labels). The two main types of supervised learning problems include classification, which involves predicting a class label, and regression, which involves a numerical value. The supervised learning uses historical data (data from the past) to learn patterns and uncover relationships between features and the target [11]. In unsupervised learning, an algorithm models the underlying data structure or distribution to learn the pattern within the structure. It is used to solve ML problems when there are no ground truths (known as the target for training or validating the model with a labeled dataset). Clustering is an example of unsupervised learning [11]. ML applications typically involve the following steps: data collection and preprocessing, features engineering, training a predictive model, and testing the performance. 1.3.1 Feature Engineering This is referred to as the process of using knowledge of the domain to select, manipulate, and transform raw data into features that can be used in supervised and unsupervised learn- ing. A feature is a form of information that is useful for prediction. In computer vision, an image is an observation, but a feature could be a line in the image. In natural language pro- cessing, a document or tweet could be an observation, and a phrase or word count could be a feature. In speech recognition, an utterance could be an observation, but a feature might 5 be a single word or phoneme [12]. Feature engineering in deep learning is direct and can be performed by an algorithm. More specifically, deep learning attempts to mimic the activity in the layers of neurons in the human neocortex to enable it to transform the input raw data into features, a process that occurs in an early stage of the training process.On the contrary, in conventional ML (shallow learning), features engineering is carried out outside the algo- rithmic stage. Experts and nonmachines are in charge of analyzing raw data to transform it into valuable features [13]. A dataset that is given to an ML algorithm contains dependent and independent variables. The outcome of a prediction is the dependent variable. The expert can provide features as part of the data set or may be derived from the data in the case of textual data. As a machine learns, it finds patterns in data (associated or not with given classes). In all ML applications, the data are first converted into a representation (a set of features) that can be interpreted. To apply ML algorithms to NLP applications, the text has to be transformed into a numeric (or discrete) representation. Features Representation in NLP: Finding useful features is an integral part of conven- tional Machine Learning research. In the case of NLP tasks, these features are extracted from text. Some kinds of features rely on word frequencies such as Bags of Words and n-grams. Other features are more problem-specific, such as the sentiment value of a document, its readability level, tone, etc. Generating these features involves extracting information from the text and converting it into a form that machine learning algorithms can understand. As an example, NLP uses deep learning to represent information from the text in the form of embedding representations. Deep Learning methods out-compete other statistical and linguistic models for NLP tasks [14]. Deep learning can learn the features from the nat- ural language required by the model, rather than requiring that the features be specified and extracted by an expert. This learned representation is called embedding. The way 6 words and documents are represented is a key breakthrough in deep learning when it is ap- plied to challenging NLP problems. A word or document embedding is similar for words or documents that have the same meaning. A SOTA approach to learning document represen- tation embedding is by utilizing Bidirectional Encoder Representations from Transformers (BERT) networks. BERT is SOTA language model for NLP. The key technical innovation of BERT is its application of Transformer, a popular attention model, in bidirectional training to language modeling. Prior research looked at text sequences left to right or combined left-to-right and right-to-left training, but BERT examines a text sequence bidirectionally. BERT shows that bi-directionally trained language models provide a deeper understanding of language context and flow than their single-direction counterparts. To learn the context of words (or sub-words) within a text, BERT uses a Transformer. A transformer consists of two mechanisms such as an encoder that reads text input and a decoder that produces a prediction. Given that BERT aims to produce a language model, only the encoder mech- anism is necessary. The Transformer encoder reads the entire sequence of words at once, in contrast to directional models, which read the text input sequentially (from right to left or left to right). As a result, it is regarded as bidirectional. This characteristic enables the model to learn a word’s context depending on all of its surroundings (left and right of the word). A series of tokens are used as the input and are first embedded into vectors before being processed by a neural network. The result is a series of vectors of size H, each of which corresponds to a token from the input with the same index [15]. A modification of BERT was introduced to better handle sentence embedding. Sentence-BERT (SBERT) uses siamese and triplet network structures to derive semantically meaningful embedded sentences that can be compared using cosine similarity. This modification of BERT enables BERT to be used for certain tasks such as large-scale semantic similarity comparison, clustering, and 7 information retrieval through semantic search. However, BERT uses a cross-encoder which requires two sentences to be fed into the network and the target value is predicted. Due to a large number of possible combinations, this setup is not suitable for various pair regression tasks. An inference computation via cross-encoding would need to be completed 100K times if we wanted to search for similarity in a small 100K sentence dataset. To cluster sentences, we would have to compare all 100K sentences, resulting in just under 500M comparisons. Therefore, to address the issue of the expensive computation, we need to pre-compute sen- tence vectors that can be stored and then used whenever required. SBERT was developed to handle the limitation of BERT by processing one sentence at a time. By using siamese network architecture in SBERT, it ensures that fixed-sized vectors for input sentences can be derived. Using a similarity measure like cosine similarity or Manhatten / Euclidean dis- tance, semantically similar sentences could be found. Clustering and semantic search are commonly accomplished by mapping each sentence into a vector space where semantically similar sentences are grouped together. BERT has been used to identify fixed-size embedding of individual sentences and the common approaches include averaging the BERT output layer (known as BERT embedding) or by using the first to- ken ([CLS]). This common practice shows bad sentence embedding [16]. This research uses textual data from Twitter. Though a fine-tuned version of BERT was proposed to handle twitter data (BERTweet), the model uses the same architecture as BERT which results in the same shortcoming of BERT that were mentioned above 1.3.2 Feature Selection Features selection is one of the core principles in ML that extremely affects the overall performance of the predictive model. The process of automatically or manually selecting the features that make the most contribution to the desired prediction variable or output is 8 known as feature selection. Having irrelevant or partially relevant features in the data can decrease the accuracy of the models and make the model learn based on irrelevant features. The model can benefit from features selection in three ways: 1) improve accuracy, 2)reduce overfitting by reducing number of features, and 3) reduce the training time. One of the features selection methods is sequential features selection (SFS). Farwrd-SFS is a greedy process that iteratively finds the best new feature to add to the set of selected features. Concretely, at first, we start with a zero feature and find the one feature that increases a cross-validated score when an estimator is trained on this single feature. We repeat the process by adding a new feature to the set of selected features after selecting the first one. When the desired number of selected features has been reached or there is no improvement, the procedure ends [17]. 1.3.3 Classifier Selection Some of the most widely used algorithms in NLP include logistic regression (LR), sup- port vector machine (SVM), naive Bayes (NB), K-nearest neighbors (KNN), and ensemble methods. Below we provide a brief overview of the method we used in this research. Logistic Regression: LR is a statistical model that is often used for classification and pre- dictive analysis. LR estimates the probability of an event occurring given a set of indicators (i.e., features). Because the outcome is a probability, it lies between 0 and 1. This model uses a logistic function (Sigmoid function) to map the predicted values to probabilities [12]. Logistic regression is commonly used to predict binary target variables, but it can be ex- panded and further divided into three different types: binomial, where a target variable can only have two types, for example, predicting whether an email is spam; polynomial, which is when the target variable has more than two types that may not have quantitative meaning, for example, predicting illness; and ordinal, where the categories of the target variable are 9 ordered, for example, a web series rating from 1 to 5. Cost function: it is a mathematical formula used to quantify the error between the pre- dicted values and expected values. More specifically, a cost function is a measure of how wrong the model is in terms of its ability to estimate the relationship between x and y. The value returned by the cost function is referred to as the cost or loss or, simply, the error [18]. 1.3.4 Performance Measures To evaluate the ML models, performance evaluation metrics are used. Based on their outputs, the performance measurements are different for supervised and unsupervised algo- rithms. In supervised algorithms, performance measurements rely on the correctly classified labels. Examples of evaluation metrics that can be used with supervised algorithms are accuracy, F1 score, precision, recall, and the receiver operator characteristics (ROC) curve. • Precision: The precision is calculated as the ratio between the number of positive samples cor- rectly classified to the total number of samples classified as positive (either correctly or incorrectly). The precision measures the model’s accuracy in classifying a sample as positive. TP P = (1.1) TP + FP where, T P : True positive, F P : False positive A high precision indicates that 1) the model makes many correct positive classifications 10 (maximize true positive). 2) the model makes fewer incorrect positive classifications (minimize false positive) [19]. • Recall: The recall is determined as the proportion of positive samples that were correctly identified as positive to all positive samples. The recall measures how well the model can identify positive samples. The more positive samples that are identified, the higher the recall will be. TP P = (1.2) TP + FN where, T P : True positive, F N : False negative The recall cares only about how the positive samples are classified. This occurs inde- pendently of how the negative samples are classified, for example, for the precision. When the model classifies all the positive samples as positive, then the recall will be 100%, even if all the negative samples were incorrectly classified as positive [19]. • Precision-Recall Curve: The precision-recall curve shows the trade-off between precision and recall for different threshold. High precision indicates a low false positive rate, while high recall indicates a low false negative rate. A high area under the curve reflects both high recall and high precision. High scores for both indicate that the classifier is yielding results that are accurate (high precision) and that are mostly positive (high recall). The precision-recall curve is recommended when the classes are imbalanced. [19]. 11 • ROC Curve: A ROC (receiver operating characteristic) curve is a graph showing the performance of a classification model at all classification thresholds. This curve plots two param- eters: the true positive rate (TPR) and false positive rate (FPR). The true positive rate (TPR) is a synonym for recall. False positive rate (FPR) is defined as follows: TP P = (1.3) TP + TN where, T P : True positive, F N : True negative • AUC: Area under the ROC Curve: AUC stands for "Area under the ROC curve." That is, the AUC measures the entire two-dimensional area underneath the entire ROC curve. The AUC provides an aggre- gate measure of performance across all possible classification thresholds. The values range from 0 to 1. A model that predicts 100% incorrectly has an AUC of 0.0, while a model whose prediction is 100% correct has an AUC of 1.0. The AUC is recommended when we are looking for a metric that 1) measures the rank- ing of predictions, not absolute values and that 2) measures the quality of the model’s predictions, regardless of which classification threshold is chosen [20]. Another factor that affects the choice of the evaluation metric is the nature of the dataset. It would be misleading to evaluate an imbalanced dataset using an accuracy score; in a test set that contains majority and minority examples, a model that predicts the majority class for all examples will have a classification accuracy as high as 99%, reflecting the distribution 12 of the major and minor examples expected on average in the test set [21]. For this reason, we use the ROC and precision-recall curves in the current research. 1.4 Literature Review 1.4.1 Depression Detection in Social Media Worldwide, depression affects more than 264 million individuals of all ages [1]. Because of the problem’s enormous scope, it is both logistically and monetarily unfeasible to detect early indications of depression solely with human agents. Additionally, research has indicated that people with depression, especially young people, find it more difficult to talk about their feelings in person than they do online [2]; however, they are less afraid to share their challenges on social media. This has also sparked the creation of automated algorithms that look for signs of depression in social media to potentially offer an intervention. To build systems that can detect depression, a features-based approach could be used. This approach requires some knowledge of the problems’ domain to extract meaningful features from texts. Another approach that is often used involves the use of SOTA deep-learning techniques. This research has shown that deep learning has promising results in a wide range of problems, but it has its shortcomings when it come to clinical context as it compromises the clinical relevance. Few-shot learning approach is the most recent approach for detecting depression. The purpose of few-shot learning models is to improve model generalizabilily for cases where training data is scarce. 1.4.2 Approaches for Depression Detection in Social Media 1.4.2.1 Features-Based Approaches Classical depression-detection systems rely heavily on expertly crafted features, includ- ing linguistic [22], psycholinguistic [23], textual [24], [3], [25], semantic [25], and sentiment 13 features [26], [27]. These features are often used in conjunction with shallow modeling frame- works (e.g., SVM) for the task of depression detection. However, not all features are equally valuable. For example, Cacheda et al. reported that textual features tend to outperform their semantic counterparts when used for depression detection [25], and Alsagri and Yakhlef found that combining synonyms with LIWC (linguistic inquiry and word count), sentiment analysis, and social activity increases the accuracy of detection models [3]. De Choudhry et al. performed crowd-sourcing to identify Twitter users who were reported as being depressed based on psychometric measures; to identify the symptoms of major depressive disorder, the authors used behavioral characteristics identified under engagement, egocentric social graph, depressive language, emotion, and linguistic style; they reported an accuracy of 70% and a higher precision of 74% for the depression class [28]. Taking a similar route, Tsugawa et al. considered a user’s activity history to collect ground truth data for predicting depression among Twitter users [29]. In this work, the authors used bag of words and word frequencies, in addition to the features used in [28], to identify the ratio of tweet topics. Although subtle differences in behavioral features were found between Tsugawa et al.’s research and that of De Choudry et al., the analysis performed by Tsugawa et al. found similar patterns for the use of negative words, frequency of posting, retweet rate, and URLs contained in tweets. They found that support vector machine (SVM) classifiers using features generated from Twitter user activity resulted in an accuracy of 69%, with a precision of 0.64 and recall of 0.43 [29]. 1.4.2.2 Deep Learning Contemporary systems increasingly leverage representation-learning approaches for de- pression detection. The recent work by Hussein et al. explored the performance of several word-level embedding techniques (random trainable, skip-gram, and CBOW) with convolu- 14 tional neural network (CNN) and recurrent neural network (RNN) models. The researchers demonstrated the superiority of CNNs and RNNs over simple feature-based approaches (SVM classifier using TF-IDF) [7]. More recent work by Zogan et al. used a concatenated CNN and bidirectional gated recurrent unit (BiGRU) to identify social media users at risk of de- pression using temporal, semantic, and behavioral data [30]. Indeed, the general direction of research in this domain focuses on the use of deep-learning techniques. Although this has (in many cases) improved model performance compared with feature-based approaches, it has also (in some cases) compromised the clinical relevance and interpretability of the models. However, clinical relevance is important in the automated depression-detection task to inform an appropriate potential intervention [31]. SOTA deep-learning approaches do not, for instance, provide a way to understand the underlying distorted thinking patterns that may be responsible for depression classification. Being able to associate these patterns with the depression classification task can help in managing depression at early stages, with treatment being as simple as cognitive reconstruction therapy [32] [31]. 1.4.2.3 Few-Shots Learning An attempt to detect and confirm symptoms of depression has been created [5]. The researchers proposed a zero-shot learning model that predicts a possible relationship of a sample to an unseen label, that is, to a label that the model did not see during the training. They have proposed a set of depression symptoms and descriptors representing the symptoms (words). For depression detection in a post, they assigned a membership score for each of the symptoms that appears in the post. As a result, they used a set of membership scores as the representation of a post sent to an SVM classifier. The authors reported a significant capability of few-shot learning models compared with the baselines. Additionally, one study constrained the behavior of depression detection methods by the presence of symptoms 15 known to be related to depression (i.e., clinically backed symptoms) while producing a model that is easy to inspect [8]. These researchers have proposed a questionnaire model and depression model; the questionnaire model used a pattern-based method that matched every post against symptom patterns; this model comprises nine symptom classifiers, such as anhedonia, concentration, eating, fatigue, mood, psychomotor, self-esteem, self-harm, and sleep. The model takes BERT embeddings and weakly labeled symptoms data as the input and generates the final question scores (i.e., symptom scores) or the hidden layers (i.e., symptom vectors) of the nine submodels. The depression model then uses the outputs of the questionnaire model to predict depression in posts. The researchers in [8] have shown that their approach performs well compared with strong baselines (unconstrained BERT) while generalizing better results. Constraining the behavior of depression detection models by the presence of depression symptoms has different advantages. This type of model has the advantage of being inherently more reliable than a black-box model because it determines classifications based on the presence of specific symptoms in specific posts so that it can be inspected to assess the quality of the evidence for a diagnosis. Apart from this argument, the model can generalize more effectively by limiting the use of spurious shortcuts. [8]. 1.5 Summary This chapter has introduced the key concepts of NLP and ML that are relevant to the depression detection system designed in the current research. This chapter focused on the commonly used techniques, features, classifiers, and performance measures used for an NLP task. It also presented the existing social media systems that have been developed to identify depression symptoms while providing a literature review that has uncovered the need for a constrained, automated mental illness detection system. 16 Chapter 2: Methodology In this chapter, we first describe the approach we used to detect depression in social media and provided an overview of our final proposed system. Figure 2.1 illustrates the methodological pipeline. Figure 2.1: The methodological pipeline. More details about (1) can be found in 3.1.1, a detailed description of (2) is available in 2.1.1. The differential embedding generation explained in 2.1.4.1 The problem of detecting depression symptoms from posts on social media has been formulated as a binary classification problem. The target classes are depressed and non- depressed, where the class indicates the presence of depression symptoms in a post. That is, our depression detection system is required to simply predict whether a text belongs to the depressed class or not. The research questions we are trying to answer in this study are: (1) can we leverage the strength of SOTA deep learning techniques without any training or fine-tuning, while maintaining clinical relevance in a semi-supervised depression detection system? (2) can 17 dysfunctional thought patterns be effectively used in a depression detection system? 2.1 Data Collection and Processing For this study, we used two datasets which we described in greater detail below. 2.1.1 Depression Symptoms Dataset In this research, we considered two sets of depression symptoms. More specifically, we combined a set of (1) cognitive depression symptoms that have been used in several studies and clinically confirmed to be depression symptoms and (2) a set of cognitive distortions (i.e dysfunctional thought patters) that we propose and investigate as depression symptoms. 2.1.1.1 Dysfunctional Thought Pattern Definitions: According to cognitive therapy, even a seemingly insignificant event such as forgetting an appointment can cause individuals to feel anxious or depressed if unwarranted negative inter- pretations are made, such as, “That’s just like me; I forget everything,” or “I blew it; they’ll never want to talk to me again.” A negative interpretation usually makes an event appear to be worse than it truly is [33]. Feeling hopeless, guilty, angry, or discouraged can be triggered by thoughts; a core principle of cognitive therapy is that negatively based thoughts contribute to mood (and other) disorders. These thoughts are referred to as “dysfunctional thoughts,” or “cognitive distortions.” Dysfunctional thoughts are negative perceptions of oneself, others, and the world [34]. Although everyone generates an occasional inaccurate interpretation; depressed individuals have an overall, systematic bias towards dysfunctional thoughts [35]. A significant relationship has been found between the frequency of dysfunctional thoughts and the severity of clinical symptoms [36]. In an attempt to look closely at how users talk about their depression on social media, Lachmar et al. performed a study on the popular hashtag #Mydepressionlookslike [2]. They found that one of the most common themes is 18 using language that shows dysfunctional thought patterns such as fortune-telling, emotional reasoning, labeling, mind-reading, overgeneralizing, personalizing, and “should” thinking. A feasibility test was done in an attempt to classify dysfunctional thought automatically by Cromer et al. [37]; their system reliably detects the seven types of dysfunctional thought categories that were mentioned earlier. This automatic identification is based on language markers that make it feasible for the system to distinguish between different types of dysfunc- tional thoughts. However, no prior work has been done to show the impact of dysfunctional thoughts analysis on the depression-detection task in social media. 2.1.1.2 Dysfunctional Thought Patterns Dataset We collected a set of sentences that represent clinically grounded dysfunctional thought categories. The dataset consists of 63 exemplary statements that reflect the presence of the dysfunctional thought categories. These statements were collected from a review of several mental health journal articles (see Table 1 in Appendix for complete details). These sentences serve as anchor points for features matrix generation. Table 2.1 shows examples of the representative sentences of the seven dysfunctional thought categories (A full set is available in Appendix Table2). Table 2.1: Dysfunctional thought categories and representative sentences Category Example Mind reading He thinks I am a loser. Labeling I must be a worthless person. Fortune telling I will get rejected. Overgeneralizing I am going to fail at everything. My boyfriend is upset; therefore, Emotional reasoning I must have done something wrong. Personalizing The world has got it in for me. Should & Must I should always give everything I do 100% 19 This dataset was used to construct the features matrix in subsequent steps. 2.1.1.3 Introducing Other Depression Symptoms Depression symptoms could be self-reported or observable symptoms [38]. Dysfunctional thought patterns are an example of signs that are detected by either an observer or by self- observation. On the other hand, [38] provided a set of self-reported and clinically-observed signs. In this work, we used a subset of the self-reported symptoms, such as loss of interest, pleasure loss, inability to feel, etc. We included eight symptoms that we suspected could be found in writing rather than diagnosed clinically. Using the same approach we followed to represent dysfunctional thought categories, we collected a set of exemplary statements as anchor points that reflect eight self-reported symptoms. Unlike dysfunctional thoughts that appear in one’s writing, this set of symptoms were self-reported. Finding a set of representative statements was not feasible due to the nature of these symptoms. To cope with this, we used keywords that could be used to describe each of the symptoms and used it in a short sentence. For these symptoms, we used four to five sentences for each symptom as, unlike dysfunctional thought categories, we were not looking for varying language markers. Table 2.2 shows the self-reported symptoms included in this work, and representative statements. 2.1.2 Depression Detection Dataset In all experiments, a depression detection dataset included instances from two sources of tweets, representing the two classes: 1. Depressive tweets: contained a #MyDepressionLooksLike hashtag; we made the rea- sonable assumption that tweets with this hashtag were (1) generated by individuals that self-identify as depressed and (2) would contain self-reported symptoms of the 20 Table 2.2: Self-reported symptoms and representative sentences Category Example Loss of insight Lack of understanding Pleasure loss I feel miserable Interest loss I am finished with it Feeling bothered I am not happy with this Energy loss I feel mentally drained Inability to feel I feel unmoved Feeling needed Be valued at something Feeling happy I am over the moon individual’s depression. To address why we particularly used the tweets from this hashtag, from a review of multiple publicly available depression datasets, we found that (in most cases) these datasets contain (1) information about the depressed users (with no tweets available) [39], [?], (2) sentiment tweets; where the researchers who bases their work on this kind of datasets consider the negative class as the depressed class [41]. However, this research argues that negative tweets should be a part of the non-depressed class as not all "sad" or "negative" users are depressed. 2. Non-Depressive tweets: Tweets were selected from [42]: a dataset of 1,600,000 tweets and contains tweets annotated with neutral, positive, and negative scores.. The non-depressive tweets has over a million tweets. If we use all of them against the de- pressive tweets (≈ 2000), severe class imbalance may cause the predictive performance of the machine learning algorithm to be biased toward the majority class during model training. For this reason, we randomly drew a number of samples from the non-depressive tweets to test against the depressive tweets. We evaluated two scenarios: (1) datasets with balanced classes. (2) datasets with imbalanced classes. To get the random samples, we made sure they contain equal portions of positive, neutral and negative tweets. To assess the statis- 21 tical integrity of our results, we evaluated our model using multiple random samples from the source data and create multiple datasets. Figure 2.3 illustrates the dataset generation procedure. Figure 2.2: Random Balanced Datasets Construction Figure 2.3: Random Imbalanced Datasets Construction 22 2.1.3 Pre-Processing We expecting linguistic markers to identify dysfunctional thought categories [37]. How- ever, tweets include links, mentions, emojis, and hashtags. Hence, we removed all links, mentions, emojis, and hashtags using regex. Tweets may also contain bad Unicode text such as mojibake (encoding mix-ups). We used ftfy to correct unicode errors, keeping only the main thought [43]. 2.1.4 Feature Matrix Construction The feature matrix consists of n rows and m columns, where n is the number of tweets and m is the number of features. Each row is a feature vector that is presented as an input to the classifiers. The details of each dataset, n and m is described in 3.2 and 2.1.2. For this research, we have constructed two different features sets as follows: 1. Tweets were embdded using SBERT. 2. Differential embeddings that denote depression symptoms. This differential embedding represents the distance between two vectors in the embedding space of the depression symptoms exemplary sentences and tweets. 2.1.4.1 Depression Symptoms Features Extraction In [2], ten dysfunctional categories are included to study if these categories can be iden- tified by linguistic markers. These categories are All-or-nothing thinking, Negative pre- dictions, Disqualifying the positive, Emotional reasoning, Labeling, Magnification, Mind reading, Overgeneralization, “Should” thinking, and Personalization. Based on the feasibil- ity test that was performed by Lachmaret al., seven of the dysfunctional thought categories have the highest matches with the linguistic markers that were developed as a part of their system. In this study, we included the seven dysfunctional thought types that were shown 23 to be identified correctly. The dysfunctional thought concept was incorporated in the de- pression detection task by the means of differential embedding. Examples of the selected dysfunctional thought categories are listed in Table 2.1. The other depression symptoms that we considered were provided in [38]. In this article, a set of depression symptoms were discussed. As we are looking for symptoms that could pos- sibly appear in one’s writing, only a subset of these symptoms were included in this research. More specifically, we included eight depression symptoms which are Loss of insight, Plea- sure loss, Interest loss, Feeling bothered, Energy loss, Inability to feel, Feeling needed, and Feeling happy. Part of these symptoms show a positive correlation with depression, while the presence of others would indicate a non-depressed individual. For instance, pleasure loss is a symptom that would possibly indicate depression. On the contrary, feeling needed and feeling happy are not signs of a depressed individual. This difference may impact further expectations and analysis. Examples of the selected dysfunctional thought categories are listed in Table 2.2. The differential embedding that is the difference between a tweet and dysfunctional thought anchors was generated as follows: 1. A set of representative sentences of dysfunctional thought categories was collected from psychological journals (A list of citations is available in Appendix Table 1). The limitation of taking the sentences only from such journals was to ensure that they were identified by experts. For each category, nine sentences were collected. In addition to dysfunctional thought patterns, a set of representative sentences was generated using keywords to represent other depression symptoms. For each depression symptom, four to five sentences were generated. 24 2. The set of depression symptoms representative sentences was represented in an em- bedding a pace using SentenceBERT. SentenceBERT derives semantically meaningful sentence embedding that can be averaged or compared using cosine-similarity for fur- ther uses. The pre-trained model we used in this research is bert-base-nli-mean-tokens which encodes sentences/texts in 768-dimension vectors. Figure 2.4 and 2.5 show plots of the first and second dimensions of Multi-Class Linear Figure 2.4: First and second dimensions of MDA on SBERT embedding space that represent dysfunctional thought anchors Figure 2.5: First and second dimensions of MDA on SBERT embedding space that represent other depression symptoms anchors Discriminant on SBERT embedding space. They show the separation between dysfunc- tional thought categories and between other depression symptoms, respectively. Depression symptoms were incorporated into the depression detection task by the means of 25 differential embedding. To get the differential embedding, an encoded tweet in a dataset is considered to be one vector and the other vector is the embedding of an anchor point in the depression symptoms dataset. Therefore, for each of the depression symptom in the dataset, we encoded and selected the depression symptom’s exemplary sentence that is the closest sentence to a tweet to serve as a feature in the features vector. We tested four approaches to incorporate the dysfunctional thought categories as described below: • We used the cosine distance between two embedding’s vectors (tweets and selected anchor points) to generate the features vectors. • Instead of cosine distance scores, we concatenated the differential embeddings of the depression symptoms that we selected earlier, based on the distance, and used them as one feature vector. • We calculated the mean of the differential embeddings of the depression symptoms. • We made use of one of features selection algorithms to get the most contributed de- pression symptoms. Further details and explanation can be found in 3.1 and 3.2. 2.2 Features Exploration and Visualization To better understand the similarity between depression symptoms anchors and tweets in our dataset, we measured the cosine distance between a tweet and the closest anchor point to all depression symptoms in a balanced dataset. The result is a feature vector that contains the distances to 15 anchor points. Each represents one depression symptom. In Figure 2.6, we show the distribution of the values that capture the cosine distance scores between the tweets and their closet anchor points to the “Emotional Reasoning” symptom. The skewed histogram suggests that emotional reasoning language tends to appear more in 26 tweets that show depression symptoms. This can be inferred by looking at the distances that are reserved in each bin and the number of depression-indicative tweets that each bin represents. This provides some evidence that this category could be discriminative and be used for the prediction task. Figure 2.6: Cosine distance distribution of the Emotional Reasoning category - 0: "No depression symptoms", 1: "Shows Depression symptoms" On the other hand, Figure 2.7 shows the distribution of the cosine distance scores between the tweets and their closet anchor points to the “Feeling happy” symptom. Looking at the histogram, unlike the histogram in 2.6, we see that we have higher number of non-depressed instances when the cosine distance score gets the closest to 0. This observation means we rarely have tweets that show happy feelings within the depressed class. Though the observation is not surprising, it suggests that this depression symptom is also can be used for the prediction detection task for its negative correlation. Figure 4.1 in appendix shows the distribution of the 15 depression symptoms. 27 Figure 2.7: Cosine distance distribution of "Feeling happy" depression symptom - 0: "No depression symptoms", 1: "Shows Depression symptoms" 2.3 Supervised Classification Using SBERT Embedding This machine learning task is a standard binary classification problem i.e., to predict if a tweet contains any depression symptoms, or not. 2.3.1 Logistic Regression We tested a simple model (logistic regression). We evaluated how different independent variables effect the outcomes by training different features sets as described in 2.1.4.1. One of the main disadvantages of logistic regression is the overfitting that may occur as the number of observations approaches the number of features; to avoid overfitting, we utilized L1/L2 regularization in addition to dimensionality reduction via PCA. 2.3.2 Performance Measurements Any dataset that exhibits an unequal distribution between its target classes can be con- sidered imbalanced. Commonly, imbalanced data refers to datasets that exhibit significantly or even extremely unequal class distribution. In such cases, we require a classifier that will provide high accuracy for the minority class, without severely jeopardizing the accu- 28 racy in the majority class. This also suggests that the conventional evaluation practice, such as the overall accuracy or error rate, does not provide adequate information in the case of imbalanced learning. Therefore, more informative assessment metrics such as the receiver operating characteristics (ROC) curves, precision-recall curves, and cost curves, are necessary for conclusive evaluations of performance in the presence of imbalanced data. In this research we used AUROC and AUC-PR to evaluate the performance of the supervised learning models. 29 Chapter 3: Experiments and Results For this research, all development was done in Python. Classifiers were trained and evaluated using 10*10-fold cross validation to help assess over-fitting. The results presented in this chapter are for the test set only. 3.1 Experimental Settings To answer (1) if we can leverage the strength of SOTA deep learning techniques without any training or fine-tuning while maintaining clinical relevance in a semi-supervised depres- sion detection system and (2) if dysfunctional thought patterns can be effectively used in a depression detection system, we used SBERT embedding vectors and dysfunctional thought patterns (as depression symptoms). More specifically, we used SBERT embedding vectors to measure the similarities between tweets and clinically backed anchor points as distance in the vector space and fed them directly into the machine learning model. 3.1.1 Datasets To assess the statistical integrity of our results, we re-performed our experiments on 40 random samples of the data. We then compared the distributions of the results obtain using our approach with those using the baseline method. In this way, we were able to test for the statistical significance of any differences in performance that were observed. The random samples were drawn from the two datasets described in Section 2.1.2 as follows: (1) datasets with balanced classes: each random sample contained a total of ≈4,216 tweets; 2,108 tweets were taken from the #MyDepressionLooksLike hashtag, and the remaining were drawn ran- domly from random tweets dataset (30% negative tweets). (2) datasets with imbalanced classes: each random sample contained a total of ≈10,108 tweets; 2,108 tweets were taken from the #MyDepressionLooksLike hashtag, and the remaining were drawn randomly from 30 random tweets dataset (30% negative tweets). 3.1.1.1 Generalization One of the useful properties of machine learning is the ability to create a model that can generate accurate predictions for a certain task. An effective machine learning model has the ability to make predictions on not only the data that it has seen but also on data that it has not seen. In a binary classification problem, we can assume there is a perfect model or function that can discriminate between two classes. In the context of a given problem, the perfect discriminant function is likely to have profound relevance to the domain experts. When we build a predictive model, we want to understand that relevance and try to best approximate this perfect discriminant function. To approximate the perfect discriminant function, we use a sample, or a subset, of all possible data collected from the domain. This data contains the structure that is appropriate for the ideal discriminant function. When we prepare the data, we do so in a way that best exposes this structure to the predictive model. However, the data also includes information that is not related to the discriminant function, such as biases caused by the selection of the data and random fluctuations that disguise the underlying structure. For this reason, we aim to create a predictive model that does not model all the noise in the sample but generalizes beyond the seen data. To evaluate a model’s ability to generalize from the sample of data, we use data that the model has not seen before or during training. The problem with evaluating using a sample of data that the model was trained on is that doing so prohibits awareness of how well the model will perform on new, unseen data. If a selected model is chosen for its perfect accuracy on the training dataset rather than on unseen test data, it is very likely that the model will perform poorly on unseen data. This phenomenon, called overfitting, occurs because the 31 model was trained to recognize a specific structure in the training dataset [44]. 3.1.1.2 The Overfitting Problem To deal with overfitting, the dataset is divided into training and test datasets. The pre- dictive model is created using a portion of the training dataset while the model’s perfomance is tested using the unseen test dataset. Another way to deal with overfitting is through cross- validation. A common example of the use of cross-validation is 10-fold cross-validation. In 10-fold cross-validation, the dataset is split into 10 portions and the algorithm is run 10 times. In each run, the model is trained on 90% of the data and tested on the remaining 10%. The 10% testing portions are different in each run. In this study, all experiments were evaluated using 10*10-fold cross-validation, and the re- sults reported on the test set. 3.1.2 The Baseline The baseline for our experiments was the SBERT embedding vectors of tweets fed di- rectly into a machine learning model. This is because the main contribution of this research is the leveraging of SOTA representation learning techniques without model training where 1) the interpretability of the model is important, i.e, in clinical contexts, and 2) there is not enough labelled data. We used SBERT embedding vectors of the representative sentences in the depression symp- toms dataset and computed new vectors using subtract, average, and concatenate vectors operations. 3.1.3 Distance Scores Cosine similarity is often used in text analysis to measure how similar two documents are, regardless of their size. This score determines whether two vectors are pointing roughly 32 Table 3.1: Sorted cosine distances to depression symptoms’ representative sentences gener- ated by applying the method on the sentence "I always fail" Depression symptom Cosine Distances Overgeneralizing [0.302, 0.307, 0.363, 0.376] Labeling [0.308, 0.406, 0.536, 0.538] Fortune telling [0.4, 0.454, 0.531, 0.623] Personalizing [0.467, 0.533, 0.565, 0.57] Inability to feel [0.472, 0.491, 0.524, 0.605] Emotional reasoning [0.481, 0.493, 0.54, 0.689] Pleasure loss [0.488, 0.627, 0.673] Loss of insight [0.506, 0.508, 0.543] Interest loss [0.528, 0.544, 0.799, 0.82] Mind reading [0.554, 0.605, 0.624, 0.641] Feeling bothered [0.582, 0.641, 0.657, 0.68] Energy loss [0.584, 0.734] Loss of insight [0.607] Shoulds and musts [0.648, 0.686, 0.745, 0.782] Feeling needed [0.761, 0.873, 0.951] Feeling happy [0.879, 0.903, 0.907] in the same direction. In this experiment, we first measured the similarity (i.e., cosine distance) between a tweet’s SBERT embedding and the SBERT embedding of all represen- tative sentences expressing depression symptoms. Table 3.1 presents an example measuring the cosine distance between the encoded "I always fail" sentence and the embedding of 15 depression symptoms (dysfunctional thought patterns included). In the context of dysfunc- tional thought patterns, the main category of this type of sentence is identified by an expert as ’overgeneralizing thinking pattern’. The table shows the depression symptom and the co- sine distance scores from the given sentence to the representative sentences of all depression symptoms (only the top four distances are represented in the table, if available). If a depres- sion symptom shows fewer than four scores, it means there were not enough representative sentences of that symptom similar to the given sentence (cosine distance < 1). 33 Once we had the cosine distance scores sorted for each depression symptom, we generated features vectors by concatenating the cosine distance scores to the closest sentences in each depression symptom. To illustrate using 3.1, if we wanted to generate a features vector for "I always fail," then the values [0.302, 0.308, 0.4, 0.467, 0.472, 0.481, 0.488, 0.506, 0.528 , 0.554, 0.582, 0.584, 0.607, 0.648, 0.761, 0.879] would be the features vector input to the predictive model. 3.1.4 Mean and Concatenation of Differential Embedding We also carried out experiments using models fed embedding vectors. This approach was inspired by the mathematical operations that can be applied to word embedding vectors. Various operations can be used to obtain new embedding vectors, such as sum, average, and concatenation. Figure 3.1 illustrates the latter two approaches. Following the techniques that are used on word embedding to get document-level embeddings, we used the averaging and concatenation approaches in this research to get the final features vectors for each tweet. The concatenation approach is expected to be more informative than the mean as it maintains the original representation of all the depression symptoms. Table 3.2 provides a summary of the approaches and respective features Figure 3.1 illustrates the latter two approaches. 3.1.5 Concatenated Differential Embedding of Selected Features We implemented a features selection method (i.e., Sequential Features Selection (SFS) algorithm) to 1) determine what features we should exclude/include to improve performance and 2) get a general idea of which depression symptoms can be used effectively in a depression detection system. A traditional SFS method works by considering every column in the dataset as a feature. 34 Figure 3.1: Overview of the averaging and concatenating approaches to obtaining new em- bedding vectors (features vector).(1) Measure the distance scores between a tweet and all representative sentences of depression symptoms.(2) Choose the sentence that has the small- est score.(3) Subtract operation to get the differential embedding between the tweet’s em- bedding vector and the embedding vector of the chosen sentence.(4) Generate the features vector using average and concatenation However, in our dataset, a feature is a depression symptom and every feature/depression symptom is a set of 768 dimensions/columns in the dataset. Therefore, we modified the original algorithm to account for this special case. We applied SFS to 50 different random datasets that we created in 3.1.1. In all experiments, SFS consistently selected two depression symptoms: "mind reading" and "feeling happy". Figure 3.2 shows the set of features and the respective performance in each iteration. 35 Figure 3.2: Features selection using SFS. Starting from the bottom: depression symptoms are in white, gray squares represent the AUROC of the included depression symptoms starting from an empty features set, and the AUROC of the best feature in each iteration is identified by the color purple Table 3.2: Summary of the approaches. Encoder refers to the transformer used to generate the original embedding, and #Features refers to the original number of features before ap- plying dimension reduction methods. Diff. emb: Differential embedding, Dep. symptoms: Depression symptoms Approach Type of Features #Features SBERT - baseline BERT-generated emb 768 Distance scores Cosine distance scores #Dep. symptoms = 15 Average Mean of diff. emb 768 Concatenation Concatenation of diff. emb 768 X #Dep. symptoms = 11,520 Concatenation of diff. emb Features selection 768 X #Selected features = 1,536 of selected features 3.2 Supervised Learning The classification task began by applying dimension reduction (i.e, principal components analysis) before using both concatenation approaches. To concatenate differential embed- dings, we increased the features space while maintaining the number of training samples. This may lead to overfitting, which is why we implemented dimension reduction. In ap- proach 3 and 4, we used 350 and 280 principal components, respectively. We implemented a logistic regression model on each of the 50 random datasets that we cre- ated (described in 3.1.1). We used L2 regularization and 10*10-fold cross-validation in all LR models. 36 3.2.0.1 Comparing two models Comparing machine learning approaches and choosing a final model are frequent oper- ations in applied machine learning. Common resampling techniques for model evaluation include k-fold cross-validation, from which the mean of a model’s performance can be de- rived and directly compared to the means of other models’. Although straightforward, this method may be deceptive because it is difficult to determine if the difference in a mean model’s performance is real or a statistical fluke. A difference in a model’s performance is considered statistically significant if the null hypothesis, or the assumption, is rejected. A statistical hypothesis test measures the probability of observing two data samples assum- ing that they have the same distribution. The null hypothesis is a presumption that underlies a statistical test, and we can compute and analyze statistical measures to determine whether or not to accept or reject the null hypothesis. In the case of selecting models based on their estimated performance, we are interested in knowing whether there is a real or statistically significant difference between the two models. There are two possible outcomes in the com- parison of models: 1) if the test’s outcome indicates that there is not enough data to rule out the null hypothesis, then any observed difference in model performance is probably the consequence of statistical chance, and 2) if the test’s outcome indicates that there is enough evidence to reject the null hypothesis, then any observed difference in model performance is probably caused by a difference in the models [11]. The products of a statistical test are a test statistic and p-value, which can both be inter- preted and used to quantify the degree of confidence or significance in the difference between models. In addition to reporting performance metrics, we also report the p value that indicates if the 37 difference between two models is statistically significant. In this research, the threshold of the p value was set to .05. Figure 3.3 summarizes the LR results. We measured model performance using the AUROC scores for each approach. Figure 3.3 illustrates that, across datasets, using the baseline and concatenating the differential embedding of depression symptoms show relatively sim- ilar performances. However, using the concatenated differential embedding of the selected features. Figure 3.3: LR results using different approaches on 10 random samples 38 Using the baseline and the best performing model from the previous test, we reported a precision-recall (PR) curve and Partial ROC curve at low positive rate (FPR: 0.10) for the LR model trained on 20 different random datasets (Figure 3.4). The differences in the means of the PR and ROC scores were statistically significant (p < 0.05) for all datasts. a) b) Figure 3.4: a) Partial AUROC scores @FPR: 0.10 and b) AUC-PR scores across different random balanced datasets a) b) Figure 3.5: a) Partial AUROC scores @FPR: 0.10 and b) AUC-PR scores across different random imbalanced datasets 3.3 Discussion In this study, we proposed a constrained few-shot learning model that makes use of SOTA representation learning techniques and clinically relevant depression symptoms. We used this model in supervised settings on 50 random samples. 39 3.3.1 Supervised Classification Using SBERT Embedding and Differential Embedding For standard binary classification, 3.3 shows that the proposed model performs best at low false positive rate when using the concatenation of the selected-feature differential em- bedding. Although it is a relatively small improvement, this performance is based on a constrained depression detection model. More specifically, the depression detection task is constrained to the presence of depression symptoms rather than a black box detection task (i.e, deep learning models). This distinction makes the proposed model more reliable, with a performance comparable to that of SOTA models (i.e., SBERT). To assess the integrity of the results, we reperformed the experiments using 20 random bal- anced datasets and 20 random imbalanced datasets. 3.4 and 3.5 show that the improvement is consistent and statically significant across multiple random balanced/imbalanced datasets (p < 0.05) for all reported results. 3.3.2 Interpretation of the results The best-performing model is one that uses two depression symptoms that are the result of Forward-SFS (i.e., "mind reading" and "feeling happy"). One can intuitively recognize that "mind reading" is positively correlated with depression, while the absence of "feeling happy" is a depression indicator. 3.2 presents the AUROC scores for individual depression symptoms. We observed that "feeling happy" does not perform well by itself which is likely due to having negative tweets in the non-depressed class. Therefore, the "feeling happy" symptom is not informative if used alone. This observation confirms the necessity of including negative tweets in the non-depressed class, as this changes the way we extract features. Also, it poses a challenge to get more informative features. Although "feeling happy" is not an informative depression symptom that we considered 40 alone, it can significantly improve the model’s performance when combined with "mind reading." Combining two correlated features that measure different characteristics provides complementary information to the predictive model. Self-reported depression symptoms that can be identified by asking questions, which implies that the user is aware of their mental health distress, have been the focus of recent works on constrained depression detection systems [5], [8]. However, we based this study on depression symptoms that can be inferred from one’s writing (i.e., dysfunctional thought patterns) rather than self-reported symptoms, as we hope this approach will help detect depression symptoms at early stages, even if a user is not aware of their symptoms. Although one of the best features comes from dysfunctional thought patterns, and the other comes from a set of self-reported depression symptoms, the distribution of the distance scores of these symptoms shows less discrimination between the depressed and non-depressed classes 4.1. In addition to the difficulty of identifying self-reported symptoms in one’s writing, this research was limited by using representative sentences to express these symptoms, unlike dysfunctional thought patterns, where there are confirmed representative sentences. 3.3.3 Comparing to literature review Although we propose a few-shot learning method, similar to other recent studies, we included a different set of depression symptoms that do not rely on the presence or absence of specific words and do not require fine-tuning and training [5]. A similar study was undertaken to constrain the behavior of depression detection methods by the presence of symptoms known to be related to depression (i.e., clinically backed symptoms), while producing a model that is easy to inspect. However, we observed certain shortcomings in this work because the authors manually selected only neutral and positive examples for the non-depressed class [8]. To properly address this issue, we proposed a constrained depression detection system and 41 carried out various experiments that included negative tweets. We showed that including negative tweets imposed a challenge and changed how well some features performed. 42 Chapter 4: Conclusion 4.1 Summary and Conclusion In this research, we attempted to strike a balance between expert-defined features and machine-learned features; we represent 15 depression symptoms as Sentence-BERT embed- dings which are the results of encoding a set of representative sentences of each symptom. To train a depression-detection system, we then used the cosine distance to measure the similarity between the tweets and the depression symptoms. Additionally, we used the differential embedding that is the difference between tweets embedding vectors and the de- pression symptoms representative sentences’ embedding vectors. The results support the theory that depressed individuals on social media use dysfunctional thought patterns more than individuals with no depression symptoms. This research shows that we can perform a classification task based on clinically relevant depression symptoms. Additionally, this research presents a methodology with which we express and incorporate depression symptoms in a depression detection system by the means of differential embed- ding. We showed that the proposed methodology outperformed SOTA embedding generation techniques (i.e, SBERT). 4.2 Future Work The current research bases the experiments on a set of depression symptoms and their representative sentences that were collected and generated manually. Although the current set yields good results, a method to automatically generate representative sentences would serve in providing a variety of sentences and not be limited to what we find online. Contrastive learning is a technique that has been used recently to train a model to learn representations of sentences such that similar samples are closer in the vector space. Investi- 43 gating this approach for the purpose of getting the representations of the anchor points would be an important next step and expected to achieve better performance. This technique is also expected to perform well in the unsupervised learning aspect of this research. 44 BIBLIOGRAPHY [1] “Depression,” Jan 2020. [2] E. M. Lachmar, A. K. Wittenborn, K. W. Bogen, and H. L. McCauley, “#mydepres- sionlookslike: Examining public discourse about depression on twitter,” JMIR Ment Health, vol. 4, Oct 2017. [3] H. S. AlSagri and M. Ykhlef, “Machine learning-based approach for depression detection in twitter using content and activity features,” IEICE Transactions on Information and Systems, vol. 103, no. 8, pp. 1825–1832, 2020. [4] R. Chiong, G. S. Budhi, S. Dhakal, and F. Chiong, “A textual-based featuring approach for depression detection using machine learning classifiers and social media texts,” Com- puters in Biology and Medicine, vol. 135, p. 104499, 2021. [5] N. Farruque, R. Goebel, O. R. Zaiane, and S. Sivapalan, “Explainable zero-shot mod- elling of clinical depression symptoms from text,” 2021 20th IEEE International Con- ference on Machine Learning and Applications (ICMLA), 2021. [6] A. Husseini Orabi, P. Buddhitha, M. Husseini Orabi, and D. Inkpen, “Deep learning for depression detection of Twitter users,” in Proceedings of the Fifth Workshop on Computational Linguistics and Clinical Psychology: From Keyboard to Clinic, (New Orleans, LA), pp. 88–97, Association for Computational Linguistics, June 2018. [7] A. H. Orabi, P. Buddhitha, M. H. Orabi, and D. Inkpen, “Deep learning for depression detection of twitter users,” in Proceedings of the Fifth Workshop on Computational Linguistics and Clinical Psychology: From Keyboard to Clinic, pp. 88–97, 2018. [8] T. Nguyen, A. Yates, A. Zirikly, B. Desmet, and A. Cohan, “Improving the generaliz- ability of depression detection by leveraging clinical questionnaires,” Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022. [9] B. I. C. Education, “What is natural language processing?.” [10] “Supervised learning vs deep learning: Learn top 5 amazing differences,” Mar 2021. [11] J. Brownlee, “14 different types of learning in machine learning,” Nov 2019. [12] J. Brownlee, “Discover feature engineering, how to engineer features and how to get good at it,” Aug 2020. [13] “What is the difference between deep learning and machine learning? quantdare,” Dec 2019. 45 [14] I. C. B. Owner, L. B. Developer, N. J. I. Architect, J.-L. M. Professor, and J. H. S. D. Scientist, “Deep learning for natural language processing,” Oct 2021. [15] R. Horev, “Bert explained: State of the art language model for nlp,” Nov 2018. [16] N. Reimers and I. Gurevych, “Sentence-bert: Sentence embeddings using siamese bert- networks,” 2019. [17] “1.13. feature selection.” [18] “Perfect recipe for classification using logistic regression.” [19] “Precision-recall.” [20] “Classification: Roc curve and auc nbsp;|nbsp; machine learning nbsp;|nbsp; google developers.” [21] J. Brownlee, “Failure of classification accuracy for imbalanced class distributions,” Jan 2021. [22] G. Coppersmith, M. Dredze, and C. Harman, “Quantifying mental health signals in twit- ter,” in Proceedings of the workshop on computational linguistics and clinical psychology: From linguistic signal to clinical reality, pp. 51–60, 2014. [23] M. R. Islam, M. A. Kabir, A. Ahmed, A. R. M. Kamal, H. Wang, and A. Ulhaq, “Depression detection from social network data using machine learning techniques,” Health information science and systems, vol. 6, no. 1, pp. 1–12, 2018. [24] T. Shen, J. Jia, G. Shen, F. Feng, X. He, H. Luan, J. Tang, T. Tiropanis, T.-S. Chua, and W. Hall, “Cross-domain depression detection via harvesting social media,” in Pro- ceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI-18, pp. 1611–1617, International Joint Conferences on Artificial Intelligence Or- ganization, 7 2018. [25] F. Cacheda, D. Fernandez, F. J. Novoa, and V. Carneiro, “Early detection of depression: Social network analysis and random forest techniques,” J Med Internet Res, vol. 21, p. e12554, Jun 2019. [26] M. De Choudhury, M. Gamon, S. Counts, and E. Horvitz, “Predicting depression via social media,” in Proceedings of the International AAAI Conference on Web and Social Media, vol. 7, 2013. [27] X. Tao, X. Zhou, J. Zhang, and J. Yong, “Sentiment analysis for depression detection on social networks,” in International Conference on Advanced Data Mining and Appli- cations, pp. 807–810, Springer, 2016. 46 [28] M. De Choudhury, M. Gamon, S. Counts, and E. Horvitz, “Predicting depression via social media,” Proceedings of the International AAAI Conference on Web and Social Media, vol. 7, pp. 128–137, Aug. 2021. [29] S. Tsugawa, Y. Kikuchi, F. Kishino, K. Nakajima, Y. Itoh, and H. Ohsaki, “Recognizing depression from twitter activity,” pp. 3187–3196, 04 2015. [30] H. Zogan, X. Wang, S. Jameel, and G. Xu, “Depression detection with multi-modalities using a hybrid deep learning model on social media,” arXiv preprint arXiv:2007.02847, 2020. [31] Melinda, “Depression treatment,” Oct 2021. [32] N. Schimelpfening, “How to positively conquer common cognitive distortions,” Mar 2020. [33] W. Irwin and G. Bassham, “Depression, informal fallacies, and cognitive therapy,” In- quiry: Critical Thinking Across the Disciplines, vol. 21, no. 3, p. 15–21, 2003. [34] I. M. Blackburn and K. M. Eunson, “A content analysis of thoughts and emotions elicited from depressed patients during cognitive therapy*,” British Journal of Medical Psychology, vol. 62, no. 1, p. 23–33, 1989. [35] A. T. Beck, Cognitive therapy and the emotional disorders. Penguin Books, 1991. [36] M. J. Fennell and E. A. Campbell, “The cognitions questionnaire: Specific thinking errors in depression,” British Journal of Clinical Psychology, vol. 23, no. 2, p. 81–92, 1984. [37] K. Wiemer-Hastings, A. S. Janit, P. M. Wiemer-Hastings, S. Cromer, and J. Kinser, “Automatic classification of dysfunctional thoughts: a feasibility test,” Behavior Re- search Methods, Instruments, & Computers, vol. 36, no. 2, pp. 203–212, 2004. [38] E. I. Fried, J. K. Flake, and D. J. Robinaugh, “Revisiting the theoretical and method- ological foundations of depression measurement,” Nature Reviews Psychology, vol. 1, no. 6, p. 358–368, 2022. [39] “There are 6 depression datasets available on data.world..” [40] Möbius, “The depression dataset,” Feb 2021. [41] Isanbel, “Depression on twitter,” Jan 2020. [42] V. Romero, “Detecting-Depression-in-Tweets,” 2 2019. [43] R. Speer, “ftfy.” Zenodo, 2019. Version 5.5. 47 [44] J. Brownlee, “A simple intuition for overfitting, or why testing on training data is a bad idea,” Aug 2016. [45] K. Cherry, “Cognitive psychology is the science of how we think,” Feb 2022. [46] N. Dinovitz, “Can you read minds? if not, stop trying.... - dinovitz counseling llc: Philadelphia and bala cynwyd therapist,” Sep 2019. [47] D. Bryan, Panic Attacks Think Yourself Free: The Self-Help Book to Overcome Panic Attacks. Xlibris, 2011. [48] “Cognitive distortions and thinking errors: Mindreading.” [49] “Thinking traps: 12 cognitive distortions that are hijacking your brain.” [50] B. Elizabeth Hartney, “10 cognitive distortions you’ll learn about in therapy.” [51] A. Bonfil, “Cognitive distortions: Labeling,” May 2017. [52] K. Cherry, “How cognitive behavior therapy works.” [53] “Thinking traps,” Sep 2019. [54] S. S. Casabianca, “15 cognitive distortions to blame for your negative thinking,” Jan 2022. [55] E. McAdam, “Skill 18 cognitive distortions part 1 cognitive behavioral therapy tech- niques,” Aug 2021. [56] “Midtown fellowship.” [57] M. Cortese, S. J. Hemmeter, and N. Schrandt. Practising Law Institute, 2008. [58] D. D. Burns, The feeling good handbook. Penguin, 1999. [59] R. J. Stanborough, “Cognitive distortions: 10 examples of distorted thinking,” Dec 2019. [60] “Cognitive distortions and thinking errors - how can cbt help?,” Apr 2022. [61] “Succeed socially.com: Free social skills guide for adults.” [62] K. Kost, “What are cognitive distortions and why should i care?,” Aug 2021. 48 APPENDIX A: DYSFUNCTIONAL THOUGHT REPRESENTATIVE SENTENCES REFERENCES Table 4.1: Dysfunctional thoughts representative sentences references Representative Sentence Reference 1) I just know that my therapist thinks I am a waste of his time 2) I am a total loser 3) I must have failed that test because I feel so bad about my performance 4) I feel anxious, so I know something dangerous is going to happen [45] 5) he thinks I am a loser [46] 6) John’s in a terrible mood It must have been something I did 7) I could tell he thought I was stupid in the interview 8) I can tell they hate my shirt 9) It is obvious she does not like me, otherwise she would have said hello 10) This relationship is sure to fail 11) I feel hopeless, therefore my situation must be hopeless [47] 12) He is ignoring me so he must not like me anymore [48] 13) I knew they hated me 14) They are all making fun of me behind my back 15) She is bored of hanging out with me 16) I am an awkward person 17) I am a failure 18) I should not eat any junk food [49] 19) I must be a worthless person [50] 20) He is a jerk 21) She is irresponsible 22) He is an idiot 23) I am useless [51] 24) I will never find love or have a committed and happy relationship 25) I will get rejected 26) I will make a fool of myself 27) I have got nothing done 28) I am going to fail everything [52] 29) If I do not get out of here, I am going to faint 30) I am going to make a fool of myself and people will laugh at me 31) I always screw up 32) I must not fail 33) I must get over this fear 34) I should not have made so many mistakes [53] 35) What if I have not turned the iron off and the house burns down 36) If I do not perform well, I will get the sack 37) My neighbor did not speak to me this morning, therefore I must have done something to upset them To be continued 49 Table 4.1 (cont’d) Sentence Reference 38) The world has got it in for me [54] 37) I have always been like this; I will never be able to change 38) He did not want to go out with me, so I will always be lonely 39) I will never be asked on a second date 40) I have the worst luck in the entire world 41) My daughter failed her exam because I have not helped her [55] 42) I never can speak publicly without messing up We were late to the dinner party 43) [56] and caused everyone to have a terrible time 44) I am a terrible speaker and always screw up 45) I feel guilty, therefore I must have done something bad 46) I feel so depressed, this must be the worst place to work in 47) This shows what a bad mother I am 48) If only I were better in bed, he would not beat me 49) He should not be so stubborn and argumentative [57] 50) I must be a complete loser and failure 51) I am not in the mood to do anything, therefore I might as well just lie in bed 52) I am furious with you, this proves that you have been acting badly and trying to take advantage of me [58] 53) I am a horrible student and should quit school [59] 55) My boss is irritable today so I must have annoyed her 56) It is my fault that my son is not studying 57) My husband hit me because I am a bad wife 58) It is all my fault that the meeting ran on so long [60] 60) I should always give everything I do 100% 61) I must not be rude so others should not be either [61] 62) I really should exercise 63) I should not be so lazy 64) I should pick up after myself more [62] 50 APPENDIX B: DEPRESSION SYMPTOMS REPRESENTATIVE SENTENCES Table 4.2: Depression Symptoms representative sentences Depression Symptom Representative Sentence Mind reading I just know that my therapist thinks I am a waste of his time Mind reading he thinks I am a loser. Mind reading John’s in a terrible mood It must have been something I did Mind reading when you like that I know you were not telling the truth Mind reading he is ignoring me so he must not like me anymore Mind reading I knew they hated me Mind reading they are all making fun of me behind my back Mind reading she is bored of hanging out with me It i obvious she does not like me, Mind reading otherwise she would have said hello Labelling I am an awkward person Labelling I am a worthless person Labelling he is a jerk Labelling she is irresponsible Labelling I am a born loser Labelling I am a phony Labelling I am a failure Labelling He is an idiot Labelling I am useless i will never find love Fortune telling or have a committed and happy relationship Fortune telling I will get rejected Fortune telling I will make a fool of myself Fortune telling If I don not get out of here, I am going to faint I am going to make a fool of myself Fortune telling and people will laugh at me what if I haven not turned the iron off Fortune telling and the house burns down Fortune telling If I do not perform well, I will get the sack Fortune telling I have always been like this; I will never be able to change Fortune telling This relationship is sure to fail Overgeneralising I never can speak publicly without messing up Overgeneralising I have got nothing done Overgeneralising People are all mean and superficial Overgeneralising shopping will always be a stressful experience Overgeneralising I am going to fail everything Overgeneralising all sales clerks are rude Overgeneralising I always screw up To be continued 51 Table 4.2 (cont’d) Sentence Reference Overgeneralising I must be a complete loser and failure Overgeneralising I am a horrible student and should quit school I must have failed that test Emotional Reasoning because I feel so bad about my performance. Emotional Reasoning I feel hopeless, therefore my situation must be hopeless Emotional Reasoning I feel guilty, therefore I must have done something bad I am not in the mood to do anything, Emotional Reasoning therefore I might as well just lie in bed I am furious with you, this proves that you have Emotional Reasoning been acting badly and trying to take advantage of me I feel anxious, Emotional Reasoning so I know something dangerous is going to happen Emotional Reasoning I feel so depressed, this must be the worst place to work in my neighbour did not speak to me this morning, Emotional Reasoning therefore I must have done something to upset them Emotional Reasoning my boss is irritable today so I must have annoyed her Personalising It is my fault that my son is not studying we were late to the dinner party Personalising and caused everyone to have a terrible time Personalising this shows what a bad mother I am Personalising I have the worst luck in the entire world Personalising My daughter failed her exam because I have not helped her Personalising My husband hit me because I am a bad wife Personalising if only I were better in bed he would not beat me Personalising It is all my fault that the meeting ran on so long Personalising the world has got it in for me Shoulds and Musts I should always give everything I do 100% Shoulds and Musts I must not fail Shoulds and Musts I must not be rude so other should not be either Shoulds and Musts I should not be so lazy Shoulds and Musts I should pick up after myself more Shoulds and Musts I should not eat any junk food Shoulds and Musts He should not be so stubborn and argumentative Shoulds and Musts I must get over this fear Shoulds and Musts I should not have made so many mistakes Loss of insight lack of understanding Loss of insight insufficient understanding Loss of insight lack of awareness Loss of insight false interpretation Pleasure loss I feel miserable Pleasure loss I feel unhappy To be continued 52 Table 4.2 (cont’d) Sentence Reference Pleasure loss I feel sorrow Pleasure loss life is joyless Pleasure loss I feel distressed Interest loss I am finished with it Interest loss I am sick of it Interest loss everything is boring Interest loss I am done with it Feeling bothered it is disturbing Feeling bothered I feel irritated Feeling bothered I am pissed off Feeling bothered feeling upset Energy loss mentally drained Energy loss I can not leave my bed Energy loss I stay in bed all day Energy loss power draining Energy loss I feel energyless Inability to feel I am unemotional Inability to feel not being able to feel Inability to feel I feel heartless Inability to feel I feel unmoved Inability to feel I feel apathetic towards everything Feeling needed be valued at something Feeling needed feeling needed Feeling needed be wanted Feeling needed My family needs me Feeling needed I help my friends Feeling happy life is joyful Feeling happy I am happy Feeling happy I am in high spirits on the last day of school Feeling happy I am over the moon about being accepted to the university 53 APPENDIX C: DEPRESSION SYMPTOMS VALUES DISTRIBUTION Figure 4.1: Cosine distances distribution of dysfunctional thought categories on a balanced dataset 54