THE EFFECTS OF PARTICIPATION AND FEEDBACK RECEIVED ON THE LENGTH OF TIME MEMBERS IN ONLINE COMMUNITIES REMAIN ACTIVE By Chandan Sarkar A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Media and Information Studies—Doctor of Philosophy 2013 ABSTRACT THE EFFECTS OF PARTICIPATION AND FEEDBACK RECEIVED ON THE LENGTH OF TIME MEMBERS IN ONLINE COMMUNITIES REMAIN ACTIVE By Chandan Sarkar Online communities support extensive interactions among their members. Membership in most of these communities is voluntary, content supplied by other members is typically a primary attractant to new members, and barriers to admission and exit are minimal (Lampe, 2009; Lampe, 2010). For a community to thrive, it is necessary that members remain active in the community and continue to interact with others. Given that sustaining a solid base of active long-term members is critical to the sustainability of an online community, it is important that factors that contribute to the length of active membership are identified. Addressing certain limitations of prior studies, this dissertation examines key factors such as rate of participation, rate of feedback received, early participation and early feedback received that may influence the length of time members stay active in a community. A mixed method approach that included server log analyses for two online communities, Everything2 and Sploder, and qualitative interviews with members of Everything2, was used to study how these factors are related to how long members remain active in a community. A Cox proportional hazard rate model and a Granger causality test were employed to analyze the server log data. The results suggest that certain types of early participation (first post submitted in Sploder and first post and first message submitted in Everything2) and certain type of early feedback received (deletion of post in Sploder and first positive and negative vote and deletion of first post in Everything2) are significant predictors of how long a member remains active in Sploder and Everything2. A member’s average rate of participation (writeups, votes given, and messages sent) in Everything2 is positively correlated with length of active membership, but not in Sploder. The rate of feedback received is not significantly correlated in either community. It is well-known that correlational evidence is not dispositive proof of a causal link. Therefore, the relationships between the dependent variable and the independent variables identified by the Cox Proportional Hazard Rate model are further examined using a Granger causality test, with which time series data can be employed for a more rigorous test of causality. The results showed no causality between rate of participation and the length of time a member remains active in a community. Findings from the quantitative studies are expanded on, based on interviews with longterm members in the community. These results show that the factors contributing to length of active membership may vary among online communities. While some results may generalize to other communities if the communities are similar enough, not all results do generalize. The findings also suggest that early negative feedback has a strong negative impact on how long a member will remain active in an online community, as both Everything2 and Sploder had a significant negative correlation with deletion of first post. The implications of these results for the design online communities are discussed. For my wife, Jessica Donald Sarkar, and my parents, Sankar & Ajita Sarkar iv ACKNOWLEDGEMENTS This dissertation is the culmination of my four and half years work, which would not have been possible without the help and support of my family, advisors, my committee members, fellow colleagues, friends, classmates. I am indebted to my advisor, Dr. Steve Wildman. Without his brilliance, experience, support and guidance I would not be able to reach this final destination. Thank you once again for all the support, guidance, the time you invested in me and your willingness to always guide me. I would like to also thank my committee members, Dr. Mark Levy, Dr. Steve Lacy and Dr. Susan Wyche, for their support, guidance and encouragement in this process. I would also like to thank Dr. Cliff Lampe from University of Michigan and Dr. Kurt DeMaagd for advising me during my PhD life. Their expressions helped me to think through this. I would like to thank Dr. Rick Wash and my colleague, Yvette Wohn, for their help in developing this idea. I would like to thank Everything2 and Sploder administrators and users for allowing me to use their data for this dissertation. Finally, I would like to thank the College of Communication Arts and Sciences of Michigan State Univeristy for providing resources and support for this work. v TABLE OF CONTENTS LIST OF TABLES……………………………………………………………………………….ix LIST OF FIGURES………………………………………………………………………………xi CHAPTER 1 An Introduction to Online Communities………………………….……………………………...1 1.1 Research Goals……………………………………………………………...……..…5 1.2 Everything2 and Sploder as Sites for this Study………………………………….….5 1.2.1 Everything2…………………………………………………………...….6 1.2.2 Sploder……………………………………………………………….…..8 1.3 Why Did I Study Two Communities Instead of One?...............................................13 1.4 Approaches…………………………………………………………………………..14 1.5 Dissertation Outline……………………………………………………………........15 1.6 Contributions………………………………………………………………………...17 1.7 Chapter Summary……………………………………………………………………18 CHAPTER 2 Literature Review on Factors that may Affect Active Membership……….……………………19 2.1Membership in Online Communities………………………….……………………..20 2.2 Participation in Online Communities…………………………….………………….21 2.3 Early Participation in Online Communities…………………………………………24 2.4 Feedback in Online Communities..………………………………………………….25 2.5 Early Feedback in Online Communities…………………………………………….26 2.6 Use of Social Science Theory in this Dissertation…...……………………………..27 2.7 Chapter Summary……………………………………………………………………30 CHAPTER 3 Examining Length of Active Membership for Everything2: A Quantitative Study……......…..31 3.1 Overview of Everything2……………………………………………………………34 3.2 Data and Operationalization of Variables in Everything2…………………………..36 3.3 The Everything2 Data-Set…………………………………………………………...37 3.4 Data and Measures...…………………………………………………………………41 3.5 Examining Length of Active Membership Using a Hazard Rate Model……………43 3.5.1 How is Length of Active Membership Related to the Rate of Participation? (RQ1)…………………………………………………………………….47 3.5.2 How is Length of Active Membership Related to Early Participation? (RQ2)………………………………………………………………..…...48 3.5.3 How is Length of Active Membership Related to Rate of Feedback Received? (RQ3)……………………………………….……………..….50 3.5.4 How is Length of Active Membership Related to Early Feedback Received? (RQ4)…………………………………………………………………….51 3.6 Examining Causal Links for the Everything2 Community (Granger Causality Tests) …………………………………………………………………………………....53 vi 3.7 How is Length of Active Membership Affected by the Rate of Participation? (RQ1)…………………………………………………………………………….54 3.7.1 Members’ Participation Affecting Length of Membership……………….54 3.7.2 Concluding Remarks………………………………………………….……58 3.8 How is Length of Active Membership Affected by Rate of Feedback? (RQ3)…...…58 3.8.1 Feedback Received from Others Affecting Length of Membership...….…58 3.8.2 Concluding Remarks…………………………………………………….…60 3.9 Granger Causality Tests in Regards to Early Participation and Early Feedback Received from Others on the Content…...……………………………………….60 3.10 Chapter Summary…………………………………………………………………..60 CHAPTER 4 Examining Length of Active Membership for Sploder: A Quantitative Study………….……...62 4.1 Overview of Sploder…………………………………………………………………63 4.2 Data and Operationalization of Variables in Sploder………………………………..64 4.3 The Sploder Data-Set………………………………………………………………...66 4.4 Data and Measures……………………………………………………………...……69 4.5 Examining Length of Active Membership Using a Hazard Rate Model…………….72 4.5.1 How is Length of Active Membership Related to the Rate of Participation? (RQ1)…………………………………………………………………….76 4.5.2 How is Length of Active Membership Related to Early Participation? (RQ2)…………………………………………………………………….78 4.5.3 How is Length of Active Membership Related to Rate of Feedback Received? (RQ3)…………………………………………………………79 4.5.4 How is Length of Active Membership Related to Early Feedback Received? (RQ4)…………………………………………………….………………80 4.6 Examining Causal Links in the Sploder Community (Granger Causality Tests)……81 4. 7 How is Length of Active Membership Affected by the Rate of Participation? (RQ1).……………………………………………………………………………83 4.7.1 Members' Participation Affecting Length of Membership……...….….….83 4.7.2 Concluding Remarks……………………………………………….………86 4.8 How is Length of Active Membership Affected by Rate of Feedback Received? (RQ3)?....................................................................................................................86 4.8.1 Feedback Members Received from Others Affecting Length of Membership…………………………………..………………….………86 4.8.2 Concluding Remarks……………………………………………………….88 4.9 Granger Causality Tests in Regards to Early Participation and Early Feedback Received from Others on the Content……………………………………………88 4.10 Chapter Summary…………………………………………………………………..89 CHAPTER 5 A Qualitative Study of Everything2……………………………………………………………..90 5.1 Recruitment of Everything2 Participants……………………………………………91 5.2 Interview Protocols…………………………………………………………………..93 5.3 Coding………………………………………………………………………………..94 5.4 Members’ Participation……………………………………………………………...96 vii 5.5 Factors that Reduced Members’ Participation……………………………………...97 1. Deletion of Writeups …………………………………………………………97 2. Downvotes……………………………………………………………………98 3. The Evolution of the “Wiki-era”……………………………………………..98 4. Life-Changing Events………………………………………………………..99 5.6 Leaving the Community…………………………………………………………….99 5.7 Chapter Summary…………………………………………………………………..100 5.8 Concluding Remarks………………….…………………………………………….101 CHAPTER 6 Discussion of Findings .................................................………………………………….….....103 6.1 Overview of Results from All Three Studies………………………………………104 6.2 Implications for Practice……………………………………………………………108 6.3 Limitations………………………………………………………………………….109 6.4 Future Research…………………………………………………………………….111 6.5 Conclusions…………………………………………………………………………111 APPENDICES ………...……………………………………………………………………….113 Appendix A: Results from a Cox Proportional Hazard Rate Model with a Cutoff Period of Two Months (Sixty Days) for Everything2…………………………………114 Appendix B: Variance Inflation Factor Analysis for Everything2…………………….117 Appendix C: Survival Function at Mean of Covariates in Everything2……………….118 Appendix D: Model Fit In Terms of a Chi-square Difference for the Hazard Rate in Everything2……………………………………………………………………..119 Appendix E: Statistical Tests to Determine a Lag Order……………………………….120 Appendix F: Results from a Lag Order Test for Granger Causality in Everything2 Pertaining to Length of Active Membership and Participation………………..122 Appendix G: Results from a Lag Order Test for Granger Causality in Everything2 Pertaining to Length of Active Membership and Feedback Received…………123 Appendix H: Results from a Cox Proportional Hazard Rate Model with a Cutoff Period of Two Months (Sixty Days) for Sploder………………………………………124 Appendix I: Variance Inflation Factor Analysis for Sploder…………………………..126 Appendix J: Survival Function at Mean of Covariates………………………………...127 Appendix K: Model Fit In terms of a Chi-square Difference for the Hazard Rate in Sploder………………………………………………………………………….128 Appendix L: Results from a Lag Order Test for Granger Causality in Sploder Pertaining to Length of Active Membership and Participation…………………………….129 Appendix M: Results from a Lag Order Test for Granger Causality in Sploder Pertaining to Length of Active Membership and Feedback Received……………………..130 Appendix N: Interview Questions……………………………………………………...131 REFERENCES…………………………………………………………………………………137 viii LIST OF TABLES Table 1-1: Similarities and Differences Between Everything2 and Sploder at a Glance …..…..14 Table 3-1: definition of participation and feedback variables in Everything2..……………..…38 Table 3-2: descriptive statistics for participation and feedback factors in Everything2……..…42 Table 3-3: hazard rate model results on participation and feedback factors and length of active membership (*p <.001)………………………………………………………………….46 Table 3-4: Granger causality results whether members’ rate of participation causes their length of active membership (*p < .001)…………………………..………………………….…...57 Table 3-5: Granger causality results whether rate of feedback received causes length of active membership (*p < .001)………………………………………………………………….59 Table 4-1: description of participation and feedback variables in Sploder.………….………….67 Table 4-2: descriptive statistics for participation and feedback received in Sploder……………71 Table 4-3: hazard rate model results for participation and feedback factors and length of active membership (*p <.001)…………………………..………………………………………75 Table 4-4: Granger causality Wald results whether members’ rate of participation causes their length of active membership (*p < .001)……………………………...…………………86 Table 4-5: Granger causality results whether rate of feedback causes length of active membership (*p < .001)………………………………………………………………….88 Table 5-1: Everything2 long term interview participants ...……………………………………..92 Table 5-2: key themes identified based on coding ...…………………………………………….95 Table A-1: descriptive statistics for the variables with a two month cut off time in Everything2……………………………………………………………………………..114 Table A-2: Omnibus Tests of Model Coefficients for Everything2……………………………115 Table A-3: Hazard rate model results for participation and feedback factors and length of active membership in Everything2 (*p <.001)………………………......................................115 ix Table A-4: results from a Variance Inflation Factor (VIF) analysis for Everything2………….117 Table A-5: Omnibus Tests of Model Coefficients……………………………………………...119 Table A-6: lag order results for Granger causality test on length of active membership and participation in Everything2……………………………………………………………122 Table A-7: results from a lag order test for Granger causality on length of active membership and feedback in Everything2…...…………………………………………………………...123 Table A-8: descriptive statistics for the variables with a two month cut off time in Sploder …124 Table A-9: Omnibus Tests of Model Coefficients for Sploder…………………………………124 Table A-10: Hazard rate model results on participation and feedback factors and length of active membership in Sploder (*p <.001)…………………………….....................................125 Table A-11: results from a Variance Inflation Factor (VIF) analysis for Sploder ……………..126 Table A-12: Omnibus Tests of Model Coefficients…………………………………………….128 Table A-13: lag order diagnosis for a Granger Causality Test…...…………………………….129 Table A-14: lag order diagnosis for a Granger Causality Test …..…………………………….130 x LIST OF FIGURES Figure 1.1: a sample Everything2 page as it appears to members …………………………..8 Figure 1.2a: the Sploder website as it appears to members ………………………………….9 Figure 1.2b: the Sploder game creation interface…………………………………………….10 Figure 1.2c: the Sploder public games page………………………………………………….11 Figure 1.2d: the Sploder community discussion forum………………………………………12 Figure A-1: Survival graph for members of Everything2………………………………….118 Figure A-2: Survival graph for members of Sploder ………………………………………127 xi CHAPTER 1 An Introduction to Online Communities Scholars have been studying communities for a long time. It is still an open question on what attributes make a community sustainable over time. A community can be defined as a group of people who share some common attributes such as values, beliefs, expectations, and locality (Zhang, 2012). Every community has its own set of values, beliefs and expectations; some of these attributes may overlap with different communities. People may join and participate in one or more of these communities when their expectations, values and beliefs match with those of the community. A community is often viewed as a self contained unit, consisting of a loose collection of individuals, which is continuously shifting due to changes in individual and group behaviors (Wenger, 2001; Delanty, 2010; Durkheim, 1960). Communities can be classified into two broad categories: offline communities and online communities. In offline communities, members communicate primarily face to face with other members, where they share their experiences, interests, convictions and interact with each other 1 (Bender, 1978; Etzioni and Etzioni, 1999). An Amazon Indian tribe is an example of an offline community. An online community can be defined as a network in which members communicate with each other using interactive online tools such as email, discussion boards, online chat, and video. An online community is a collection of voluntary members whose primary purpose is to 1 In general, a member can be any person who joins and participates in a community. This joining can take many forms, depending on the individual community. Some offline communities require registration (gym), some only require people show up (some churches), while others are based solely on locality (members are part of the neighborhood that they live in simply by living there, regardless of all other factors). To be a part of an offline community, one must live in the vicinity of that community, where as an online community is comprised of members from many different localities. 1 reinforce the collective welfare of its members; members share their interests, experiences, convictions and interact with other members primarily using an online medium. Even though membership is voluntary in nature, most online communities require users to register to become members in order to be a part of the community. In online communities, members participate in various activities such as political deliberation, maintenance of relationship, information seeking, and social collaboration (Horrigan, 2007). Though the topics, activities, and audiences of these communities vary, it is important to understand that many of these communities support extensive, voluntary online interactions among members where barriers to admission and exit are minimal (Lampe, 2009; Lampe, 2010; Kraut, 2012). Membership in online communities is based on social connections, common interests, and members’ beliefs and expectations about the community. An online community allows its members to share content through its interface, where content could be an article, a comment, a picture or a video. It is members' contributions of 2 content that is referred to as participation in the literature. Online Communities that are solely dependent on the participation of members to generate content are referred to as content-based communities in the literature (Bowes, 2002; Ren, 2007). Most of today's online communities are content-based communities. Content generated by individual members provides value to other members and to the community in general; this is necessary for a community to thrive over time 2 It is crucial to note that if a member submits posts or if a member submits votes on a post or if a member sends messages to others, all these contributions are considered participation in the research literature. Some studies can include feedback as part of participation; however, this study considers members' participation and feedback as two different entities. It is also important to note, throughout this dissertation, whenever feedback is mentioned it means feedback received from other members. Whenever early participation or rate of participation is mentioned, this refers to all participation variables as a group (discussion articles submitted, votes given, comments sent). Likewise, whenever early feedback received or rate of feedback received is mentioned, this refers to all feedback variables as a group (messages received, votes received, cools received, and deletion of discussion articles). 2 3 (Koh, 2007). According to a study conducted by Toral (2009) on Linux port development communities, the success of online communities is dependent on participation by long-term members. Though the motivation for users to become members in a community varies, the basic fact remains that participation is important for the success of online communities (Lampe 2010; Kraut 2012, Burke, 2009; Ren, 2011). While many online communities are successful, the success of individual communities varies widely (Kraut & Resnick 2012). Online communities find it challenging to consistently attract a steady stream of contributing members (Richardson 2010). Communities that fail to retain members turn into “ghost towns” (communities with few or no members) (Phang 2009). Even popular online communities find it challenging to retain members for a long period of time. An example is the community associated with the massively multiplayer online game, World of Warcraft, where a study by Williams (2006) showed that forty-six percent of members leave the community within a month after they join. For MovieLens, an online community about movie recommendations, half of new members cease being active within 18 days of the day they join (Ren, 2010). Although a long tail distribution of continued activity is common in these online communities, it is essential to understand that for online communities to survive over the long haul, members must be retained over time. An indicator of retention is membership length, which is the length of time a user has been a member of a particular online community (Preece 2001). There are costs associated with participation in an online community. Some of these costs come when a member first joins, and 3 This value is both social and economical. It provides social value because the more members providing content, and the more content provided in general, the more the members are able to interact with each other and the larger the community grows. This also could lead to a long-lasting, thriving community. This in turn provides economic value for some communities because larger communities can gain more attention from advertisers (e.g Facebook, Sploder). 3 others during activities such as learning the community’s software, getting acquainted with community norms, making an effort to be noticed by others or integrating socially. Additionally, there is a switching cost if members choose to switch from one community to another community. When members switch from one community to another, it is unlikely that they will be able to share content or communicate with members from the old community. These costs are what make it important to the individual member to remain in a community, as the time and effort put into the previous community no longer generates value if they leave (Kraut, 2012). Research has shown that potential members often explore an online community before they join, a practice called lurking, to learn about the benefits of the community before making a commitment (Nonnecke and Preece, 2000; Nonnecke and Preece, 2001; Nonnecke and Preece, 2003). Members put their time and effort into becoming familiar with the community, both before and immediately after joining. These costs, when incurred after joining, affect whether members decide to continue to use the community (Arkes, 1985; Arkes, 2000). If we can determine what makes a member actively participate and continue to use a site, we may be able to develop metrics to predict the sustainability of an online community. Online communities are multifaceted in nature; studies of online communities are often conducted on certain attributes, such as members’ participation and the feedback they received from other members on the content. This dissertation focuses primarily on how the length of active membership is related to members’ participation and to the feedback they receive on 4 content they contribute. Online communities are networks composed of diverse sets of 4 Active membership is the length of time that community members remain active (length of active membership). Active members are members who have registered in a community, and still login to the community to use services offered by the community. A member is considered active as long as they login, whether or not they contribute by submitting posts, votes, or messages. 4 members, including members who have met each other face to face or expect to meet each other at some point in the future, and members who do not expect to meet each other face to face at all. Because there are diverse sets of members with different expectations and levels of participation, sustaining long-term membership is a challenge. In this dissertation, I examine how different types of participation and feedback received are related to how long members remain active in two different online communities. 1.1 Research Goals This dissertation examines how the participation of members in two different online communities is associated with how long they remain active in their communities and how feedback received from other members on content is related to their length of active membership. It examines the causal nature of links between members’ participation and their length of active membership, and between feedback received from others and length of active membership. The goal is to understand the various factors that are related to length of active membership in online communities. A better understanding of these factors may inform future research. Findings of this study may also have meaningful implications for those designing online communities. 1.2 Everything2 and Sploder as Sites for this Study The following section describes the two communities, Everything2 and Sploder, that are being examined to answer questions that address the research goals for this dissertation. It also describes some of the tools that members in these two communities use for collaboration and to generate content. 5 1.2.1 Everyting2 Everything2 (http://www.everything2.com) is an online peer-production community (a community which supports collaboration among members to create better quality products) similar to Wikipedia, but with an emphasis on creative writing, which started in 1998 as a spinoff of the popular news and discussion site Slashdot. Everything2 is a compelling community for this research because it has existed for more than 10 years. It contains a heterogeneous set of members and has records for a very large collection of posts, known as writeups. Everything2 is structured around writeups, which can be about any topic and in any form of writing—e.g., fiction, non-fiction, personal experience and poetry. Each writeup is written by a single author, who is given credit for the content. Users must register and login to post writeups; Everything2 refers to these registered users as members. Membership and registration in Everything2 is free. Once registered, members are never deleted from Everything2 for lack of activity, including failure to log back in. Members of the Everything2 community can rate other members’ writeups by submitting votes, which can be positive (upvote) or negative (downvote) with values of +1 or -1. Members can only vote once on any piece of content (writeup). Positive votes add to the reputation score (experience points referred to as XP) of the author, and negative votes subtract from the reputation score of the content author. Members with sufficient reputation in the community can also rate writeups with a cool, which is a type of super-vote (+20). Cools make the content more visible in the community. In Everything2, only members with 2300 or more XP are allowed to submit cools on a writeup. Writeups with negative ratings are sometimes deleted from the site at the discretion of content editors, a group of volunteers that serve as administrators for the community, if they 6 think the content does not meet the community’s standards. Other reasons for deletions of writeups include copyright violation, overly short contribution, and failure to submit posts in English. Although deletions are made solely by the editors, prior research on deleted content in Everything2 found that deletions are strongly correlated with negative votes (Sarkar, 2012). While anyone can view the content of Everything2, only users that became members by creating an account can login to the community and contribute new content. They can provide feedback, or communicate with other members. Because of these restrictions, spam is rare on Everything2. Communication within the community can take place through the site’s messaging feature, which is very similar to email in that it allows members to send a private message to others. Everything2 allows the messages to be tracked for each individual member. Everything2 is a suitable case to study for this research for the following reasons: 1) Everything2 is a decade old site, with a diverse set of members of varying membership length and a large volume of articles, 2) the long history of the site and its features has created a rich set of content for analysis, 3) Everything2 is a valid representation of a content based online community because content is solely generated by members, 4) Everything2 site administrators have granted me access to nine years, six months of server log data, which provides an opportunity to examine factors that might influence the length of time members stay active over a substantial period of time. 7 Figure 1.1: a sample Everything2 page as it appears to members - this figure is to illustrate a sample page of Everything2. The top of the figure has the Everything2 logo with a search feature to search for a writeup in a specific topic. The left of the figure is the title (removed) of the specific writeup that is contained on the page, the screen name of the member (removed) that created the writeup, date the writeup was created, and the text of the writeup. The right of the figure is the individual administration section where members can log out, changing settings, or receive help. Below this box is a box for messages. For interpretation of the references to color in this and all other figures, the reader is referred to the electronic version of this dissertation. 1.2.2 Sploder Sploder (http://www.sploder.info/ ) is an informal, game-making, peer production community for adolescents and young adults, where membership and registration is free. Sploder allows its members to build their own games by using a set of visual tools built in Adobe Flash. They can then submit these games through a web browser interface for other members to play. Sploder contains a heterogeneous set of members, and a very large collection of games. Figures 1.2a, 1.2b, 1.2c and 1.2d show how the Sploder interface appears to members. Members can share their games with other members in Sploder (Figure 1.2a); they can customize 8 their design using the Sploder interface (Figure 1.2b); and once members are satisfied with their creations (games), they can make their games public in Sploder (Figure 1.2c). Sploder also allows members to discuss game-related topics on a discussion forum page (Figure 1.2d). Figure 1.2a: the Sploder website as it appears to members – this figure demonstrates how the home page of Sploder appears. The top of the figure contains the Sploder logo as well as a log in/sign up link and links to the different pages in the Sploder website. The most popular games on the site are listed below the heading with reasons to join the community (sharing your games, playing games, and voting) given on the right. 9 Figure 1.2b: the Sploder game creation interface – this figure shows the page that allows members to create their own games. The top gives options to start creating a new game, load a game, save a game, test a game, and publish a game. The center shows the screen that members use to create the game, with buttons to the left that allow them to add different features in the game. 10 Figure 1.2c: the Sploder public games page – this figure shows the page that lists the different games in the community. The top of the page shows the member that created the games on this page (name removed). There are options at the top to go to a comprehensive list of games or to a saved list of favorite games. The bottom of the figure is a list of games that have been created by this member. Each game has the date created, name of the game (removed), name of the game creator (removed), rating of the game (out of five stars), how many votes the game received, how many views the game received, and how many comments the game received. 11 Figure 1.2d: the Sploder community discussion forum – this figure shows how the discussion forum appears to members. This is the section of the community studied in this dissertation. It lists each discussion article that has been submitted (name removed), the member that submitted it (name removed), how many comments it has received, which member posted the last comment (name removed) and when the last comment was posted. In the forum, members are allowed to submit posts on a topic. A post is an article-like entry called a discussion article in Sploder. Discussion articles can be of any length. Only registered members in Sploder are allowed to submit games and post in the forum. 5 5 It is important to note that the Sploder discussion forum is a community on its own. Membership in the discussion forum requires separate registration and login in. Membership can overlap between the original game site and the discussion forum. Being a member of one does not imply membership in the other. In this study, I analyzed the data from the Sploder discussion forum as I gained access to the forum data only. 12 Once a member submits a discussion article, other members can provide feedback on the discussion article through whispers and comments. The feedback on a discussion article could be from either a Sploder editor or a regular member. Whispers are one-on-one private communications between an author and a member or an author and an editor. Members can also submit votes (which can only be positive) on a discussion article. The community is managed and administered by a group of volunteer editors. These volunteer editors are appointed by the community manager (Sploder’s owner). Editors can delete unwanted comments or discussion articles from the forum at their discretion if they think that a comment or article does not meet community standards. Sploder is a reasonable online community to study for the following reasons: 1) Sploder is about five years old and has a heterogeneous set of members and a large collection of discussions related to games, 2) the discussion forum is fairly large; nearly 4000 members participate on a regular basis, 3) Sploder administrators have granted access to two years, seven months of server log data. 1.3 Why Did I Study Two Communities Instead of One? It is often argued that online community studies lack generalizability; findings for one online community often do not resemble findings for another community (Gupta, 2004). There are various reasons why scholars in the past have studied a single community. It is challenging to study multiple online communities at the same time because the interfaces and designs of these communities are complex in nature and they may vary substantially. Different communities are also often constructed to serve different purposes. Despite differences among online communities, most online communities support a peer production process and support an environment for collaboration, where members can provide 13 support and feedback to other members. Most of these communities (Wikipedia, Everything2, Sploder) are managed by members who volunteer their time; these members can be active members of the community and they are expected to know the community’s norms and regulations. Although these communities are designed to serve different purposes, for many of these communities there is still a large overlap of features, such as voting on other members’ posts and sending messages. The various features within these communities’ interfaces also often support members’ interactions in a similar fashion, and make possible similar types of interactions. Similarities Peer production communities. Members can provide feedback/comments and vote on others’ posts. Administered by volunteers (editors). Provide a messaging facility for members. Differences Everything2: anyone can participate. Sploder: community for young adolescents. Everything2: centered on creative writing. Sploder: Flashbased game design community. Everything2: in existence for more than 10 years. Sploder: in existence for 5 years. Everything2: facilitates all types of creative writing. Sploder: facilitates all types of game related discussions. Table-1.1: Similarities and Differences Between Everything2 and Sploder at a Glance Two different communities, Everything2 and Sploder, are selected for this study to facilitate comparisons that will give a better sense of the extent to which findings for one community might generalize to others. 1.4 Approaches This dissertation employs a mix of qualitative and quantitative methods to study how members’ participation and feedback received from other members influence the length of time they stay active in these communities. 14 (1) I identify a set of attributes that may be related to length of active membership based on literature reviews from various fields such as human-computer-interaction (HCI) and social psychology. (2) Two different quantitative approaches (empirical methods) are used to examine how factors, participation and feedback received, are related to length of active membership for two different communities. (3) A qualitative study is also conducted by analyzing interviews with long term members of one of the communities to further understand factors that may be associated with length of active membership. (4) The findings from the analyses are then used to identify practical implications for designing and maintaining online communities. 1.5 Dissertation Outline In Chapter 1, I identify important attributes of communities that may contribute to the sustainability of online communities. Based on these attributes, I elaborate on the importance of long-term active membership. I also state the research goals of this dissertation topic. I describe the two communities I am studying, Everything2.com and Sploder.com, to provide context for later chapters. Chapter 2 reviews the existing literature on membership in online communities. Based on the literature reviewed, factors that may influence length of active membership are identified. I used the results from the current literature and gaps from the literature to generate high level constructs (factors such as members' participation and feedback received from others) that could affect length of active membership. Chapter 3 describes the quantitative analyses conducted to gain insights on how length of active membership is related to different aspects of participation and feedback received from 15 others in the Everything2 online community. Two different types of analyses are used to test the research hypotheses. They are a Cox proportional hazard rate model and a Granger causality test. The hazard rate model is used to test whether there is statistically significant correlation between members’ participation and their length of active membership or feedback they received from others and their length of active membership. Previous studies have pointed to correlations between different factors such as participation and feedback received on length of active membership as evidence of causal links between these factors and length of active membership. Because these relationships have plausible non-causal explanations, findings of correlation are not satisfactory evidence of causal connections. In this dissertation, in addition to correlational evidence I employ a more rigorous statistical test, a Granger Causality test, to more rigorously determine whether these correlations are consequences of causation. Chapter 4 describes the quantitative analyses conducted to gain insights on how length of active membership is related to different aspects of participation and feedback received from others in the Sploder online community. Again, a Cox proportional hazard rate model in terms of Survival analyses is used to test whether there is statistically significant correlation between members’ participation and their length of active membership or feedback they received from others and their length of active membership. A Granger causality test is further used as a more rigorous test for causality. Chapter 5 presents the findings from interviews with long-term members from Everything2. Data from the interviews is analyzed to gain further insights about factors that may affect length of active membership. Chapter 6 summarizes the research and its contributions and provides recommendations for the design of future online communities. Limitations of the study are also discussed. 16 1.6 Contributions This dissertation makes the following contributions: 1. The dissertation’s findings should be useful to individuals and organizations who design and maintain online communities. Understanding how participation and feedback factors may affect length of membership in online communities could inform the design of tools that may reduce the burden of peer production editors in managing the community. 2. All previous studies have only examined correlational evidence. Correlation is not dispositive proof of causality. Plus, there are also plausible non causal explanations. Therefore, this dissertation applies a more rigorous statistical test for determining whether causal links exist between members’ participation and length of active membership and between feedback members receive and length of active membership. 3. It addresses questions about the generality of findings in the literature on membership in online communities by studying two communities to examine whether the results from the two communities are similar to each other. 4. It also studies members’ activities for a period of time longer than two years (and nearly 10 years of one of the communities studied) to examine whether the variables constructed have an effect on length of active membership. Because earlier studies examined online community members for shorter periods of time, the findings of this study shed light on the extent to which relationships found by previous studies persist over longer periods of time. 5. It identifies possible key factors that may influence length of active membership in online communities proposed by the prior research literature. It examines how these key factors are related to length of active membership with more rigorous statistical tests. It also considers rate of participation and rate of feedback received rather than prior studies use of measures of 17 total participation and feedback received, as members who are active in the community for a longer period of time will most likely have higher measures of total participation and feedback received simply by virtue of having been members of the community longer. 1.7 Chapter Summary In this chapter, I have used the term ‘online communities’ to define technology mediated social interactions. The term online community is often used in a broad sense to describe social network sites, social media, and social computing sites. It was not my intention to provide a canonical definition for online communities. Rather, I have considered some of the salient characteristics of communities such as 1) use of socio-technical features built-in the design, 2) interactions of members in the community, and 3) value and dependence on the user-generated content. I used these characteristics to define an online community for this study. In this chapter, I discussed why length of active membership is important for an online community; based on this, I also provided the overall research goals for this dissertation. 18 CHAPTER 2 Literature Review on Factors that may Affect Active Membership 6 Online communities have been observed to suffer very different fates. Some communities, such as Facebook, grow, thrive and seem to endure (relative to the short amount of time it has been possible to have an online community). Other online communities never gain any traction and just disappear, and still others seem to thrive briefly and then decline (e.g., MySpace). A study conducted by Deloitte Development LLC reported that even though 60% of businesses invest time and money into building their own online communities to understand their customers’ needs, 35% have less than one hundred members (Moran, 2008). Researchers have sought to identify factors that contribute to the sustainability of online communities to determine why some online communities thrive while others do not. The research reviewed in this chapter suggests that three key factors contribute to the sustainability of an online community: the length of time members stay active in the 7 community, the ways in which members engage with the community and the frequency with 6 In this dissertation, an online community is defined as a network in which members communicate with each other using interactive online tools such as email, discussion boards, online chat, and video. The characteristics of an online community include: 1) communication between members of a site, 2) reliance on user-generated content where content is solely generated by the registered members and 3) the use of online tools for the purpose of communication and information sharing. 7 A member is any person that has registered with a community. Members have to login after registering to participate. Active members are members who have registered in a community, and still login to the community to use services offered by the community. A member is considered active as long as they login whether or not they contribute by submitting posts, votes, messages, or other content. 19 8 which they do so, and how other members respond to them when they participate. For an online community to thrive, it is necessary that members remain active in the community and continue to interact with each other for a non-negligible amount of time. A community cannot be built solely with people who use it once. Further, if all new members cease being active shortly after joining, the community will eventually exhaust the supply of potential new members and the community will cease to exist. To the extent an online community attracts members by offering the associational benefits of an online community, it must provide an experience sufficiently compelling for them to continue using the service and participating with other members for some time beyond the first time they use the service. Research to date identifies members’ own participation in communities and the feedback they receive from other members as factors that may affect length of active membership. This chapter reviews the relevant literature and identifies gaps in the literature and limitations of prior research. The subsequent chapters present my research which addresses those gaps and limitations. 2.1 Membership in Online Communities Studies have found that keeping members actively engaged in an online community over the long-term is a challenge, even for the most popular online communities. For example, 60% of 8 It is crucial to note that if a member submits posts or if a member submits votes on a post or if a member sends messages to others, all these contributions are termed participation in the literature. If a member receives votes on their post or if a member receives messages from others, all these interactions are referred to as feedback. Some studies can include feedback as part of participation, however this study considers member' participation and feedback (received) as two different things. It is also important to note, whenever early participation or rate of participation is mentioned, this refers to all participation variables as a group (discussion articles submitted, votes given, comments sent). Likewise, whenever early feedback received or rate of feedback received is mentioned, this refers to all feedback variables as a group (messages received, votes received, cools received, and deletion of discussion articles submitted). 20 members who create a Wikipedia account never log back in after their initial (first) login to submit or edit an article (Panciera, 2009). More than half of the developers who registered to participate in the Perl open-source development community never returned after posting their first message (Ducheneaut, 2005). Given that sustaining a solid base of active long-term members is critical to the sustainability of an online community, and assuming that we want to help these communities thrive, it is important that we identify factors that contribute to the length 9 of active membership in online communities. The factors that identified below are members' participation and the feedback they receive from others. 2.2 Participation in Online Communities Previous studies reported that unequal participation is a common phenomenon in online communities. For example, in a study of Usenet communities, Whittaker et al. (1998) reported that 2.9% of members within the community contributed nearly 25% of the messages, and around 27% of the messages were contributed by members who posted only once. Studies of the P2P file-sharing community, Gnutella, reported similar findings. Eytan’s (2000) study found that 25% of Gnutella community members contributed 98% of files, and that 66% of members contributed almost nothing. A later study reported that nearly 85% of its members contributed nothing to the Gnutella community (Hughes, 2005). Based on these two studies, it appears that the longer the Gnutella community exists, the smaller the percentage of members that contribute to the community. Another study that examined unequal participation was conducted by researchers from IBM. They reported that within an internal community maintained by IBM, one percent of the 9 The length of active membership refers to the amount of time a registered member is active in a community. Active refers to any member that has continued to login to a community. 21 members were super contributors; most of the posts came from them. 66% of members were moderate contributors, and 33% were peripheral contributors, they contributed almost nil (Stewart, 2010). Similarly, two other studies reported that 4% of developers contributed 88% of the new code in the open source Apache community (Mockus, 2002), and only 58% of newmembers to a Usenet group posted a second time (Argullo, 2006). Panciera (2009) further reported that the modal number of edits on Wikipedia is one per person. All of these studies have reported that participation inequality is prevalent in online communities (Mockus, 2002; Argullo, 2006; Brothers, 1992; Nielsen, 2006; Stewart, 2010; Eytan, 2000). This also suggests that the rate (frequency of participation) at which each individual member participates in a community varies among members. Scholars have studied whether members in a community will continue to participate based on their total participation in the first few weeks (Panciera, 2009; Burke, 2009). However, they did not account for members’ rate of participation, which may change over time. 10 Are members who have a higher rate of participation more or less likely to have longer active memberships than those who have a lower rate of participation? And related to that, how is the rate of members' participation related to their length of active membership? It is not only new members, but long-term members who may also stop participating, and this could affect the sustainability of an online community. It is likely that users who are members of a community for a longer period will have a higher total participation (cumulative total for of each type of participation such as posts and votes) than members who are in the community for a short period of time. For example, someone who is a member of a community for five years will have a 10 Rate of participation was computed by dividing members’ total amount of participation (i.e.total posts submitted, messages sent, or votes submitted) by their total length of active membership. 22 higher total participation than a member who is in a community for five months if their rates of participation are the same. Thus, rate of participation is an appropriate construct to use in order to predict length of active membership. It is important to note, none of the above studies explicitly examined the length of active membership in a community (Panciera, 2009; Burke, 2009; Lampe, 2005). However, there is one study that reported some correlational evidence between length of active membership and members' total participation (Wang, 2012). The study found messages submitted by members that contain emotional content (e.g expressing understanding, encouragement, affirmation, sympathy, or caring to others) was positively correlated with how long members remained in a community and messages submitted by members which contain informational content (e.g. giving advice, referrals or knowledge) was negatively correlated with how long members remained in the community (Wang, 2012). The study did not consider how members' participation changed over time. Further, neither Wang’s study nor any other previous studies explicitly test for causal links between members' length of membership and rate of participation. To address this gap in the literature, I examine the relationship between members' length of membership and rate of participation beyond simple correlational tests. Finally, most prior research on members' activities in online communities examined participation over a short time period, ranging from a minimum of three months (Burke 2009) to a maximum of sixteen months (Churchill 2004). An exception to this is Wang's (2012) study; the study contained data for nine years, three months. Considering the fact that, except for Wang's study, all the studies I am aware of only looked at data for a maximum of sixteen months, I choose to examine the relationship between members' length of active membership and rate of participation for two online communities (Everything2 and Sploder) over two years. I examine 23 length of active membership in Everything2 for a period of nine years, six months and length of active membership in Sploder for a period of two years, seven months. Looking at members' activities over a longer period of time may reveal interesting patterns that were not identified in the literature to this point. The advantage of examining a longer period of time is that it could eliminate the limitations inherent in the shorter observation periods of other studies. This could assist in identifying potential unmeasured causes among variables and may establish the causality between them. Examining the effect of participation, more precisely the rate of participation, on members’ length of active membership may provide useful insights regarding the sustainability of online communities. Based on this information, my first research question is: RQ1: How is length of active membership related to the rate of participation of individual members in a community? For this RQ, I am examining both a correlational and a causal relationship between length of active membership and members' rate of participation in a community. 2.3 Early Participation in Online Communities Members' early participation might be a significant predictor for their length of active membership. Studies have reported that early participation is a strong predictor on whether a new member will continue using a site (community) (Burke, 2009, Joyce, 2006, Lampe, 2005, Panciera, 2010). These studies examined how members' early participation was related (positively correlated) with their use of the site for the first few months (maximum 3 months). In this dissertation, I examine how members’ early participation is related to their length of active membership for two different communities over two years. 24 I define early participation as whether a member posts a first article in a community, whether a member posts a first message in a community, and whether a member submitted a first vote on the content. Each of these is considered as early participation. Early participation can also include members' second post, members' second vote on a post, members’ second message submitted and so on. However, due to the complexity of the server log data, I have only used first post, first message, and first vote on a post as measures of early participation. My second research question is: RQ2: How is length of active membership related to early participation of individual members in a community? 2.4 Feedback in Online Communities Another factor that may affect members’ retention is feedback they receive from other members on content they contribute (Burke, 2009; Lampe, 2005). In an online community, feedback gives an idea to authors on how their posts are received by the community, and whether the submitted posts need further editing. Feedback can be a comment or vote on a post, or it can be a direct message from other members to an author. Studies have reported that feedback may influence members continued use of a community. For example, Zhang and Zhu (2006) reported that feedback in the form of edits from other contributors (editors) reduced the author’s incentive to contribute in Wikipedia; Halfaker, Kittur, and Riedl (2011) found that new members’ continued use of Wikipedia was negatively associated with the feedback in the form of editorial reverts (corrective edits) on their posts. A study done by Choi et al., (2010) reported that a positive correlation exists between new members’ continued use of Wikipedia and negative feedback in the form of editorial reverts (corrective edits) from the editors. The results from Choi's study differ from Zhang's study and 25 Halfaker's study. Feedback that members receive can be either positive or negative, and it can be positively or negatively associated with members’ future use of a community. For example, Choi et al. (2010) found a positive correlation between new members’ number of edits for Wikipedia articles and negative feedback from editors. 11 Lampe (2005) reported that feedback in terms of receiving a rating on the first post (binary) is negatively associated with posting a second time. However, whether the rating was positive or negative feedback on the first post was not explicitly reported in this study. Sarkar (2012) found that positive feedback on the content is positively correlated with members' length of active membership. The results from these studies may indicate that the sustainability of an online community is associated with the feedback members received in some form from other members. To the best of my knowledge no study has yet tested for causal links between length of active membership and character of feedback. Therefore, my third research question is: RQ3-: How is length of active membership related to rate of feedback individual members receive on the content in a community? For this RQ, I am examining both a correlational and a causal relationship between length of active membership and rate of feedback received from others on the content. 2.5 Early Feedback in Online communities Previous research has found that initial feedback received on an initial post is a predictor of whether or not new members will continue to use an online community (Lampe, 2005; Ren, 2007). For example, Joyce (2006) reported that when new members received responses from others on their first post they were more likely to post a second time, compared with new members who did not receive a response. 11 Choi studied whether the number of edits was correlated with negative feedback in terms of editorial reverts. 26 The initial feedback received from others on a members' first post could determine whether they are likely to post in the future (Joyce, 2006; Lampe, 2005; Burke, 2009). Results from these studies suggest that initial feedback received on the content is a strong predictor on whether a new member will continue using a site (community). These studies examined how members' initial feedback received was related to their use of the site for the first few months (maximum 3 months). In this dissertation, I examine how members’ early feedback received is related to their length of active membership for two different communities for over two years. Early feedback is referred to as members' first vote received on a post and first comment received on a post. In this dissertation, I am considering multiple instances of feedback received (e.g. first vote received on a post and first comment received on a post), whereas previous studies have used only one instance of feedback received (e.g. first comment received on a post). Because of this, I have used early feedback received rather than initial feedback received. It is important to note, early feedback can also include members' second vote on a post, second comment on a post and so on. Due to the complexity of the server log data, I have only used first vote on a post and first comment on a post as early feedback received. Thus, my last research question is: RQ4-: How is the length of active membership related to the early feedback individual members receive on their content in a community? Length of active membership in online communities may also be examined through the lens of social science theories. The next section discusses theories that could explain members’ length of active membership in a community. 2.6 Use of Social Science Theory in this Dissertation 27 Many different types of social science theories which were originally developed for offline communities have been used to explain members' behavior in online communities. However, most of the literature reviewed in this chapter does not describe itself as based on any specific theory (Wang, 2012; Zhang, 2006; Choi, 2010; Priedhorsky, 2007; Panciera, 2009; Hughes, 2005; Arguello, 2006; Mockus, 2002). A reason for this could be that most of these theories were developed for offline communities, an exception being the hyperpersonal model of Walther (2007). Compared to offline communities, online communities are constrained in various ways. A fundamental constraint is how the rules and norms governing participation and membership are developed in online communities compared to offline communities. In online communities, rules are designed by entrepreneurs and designers of different communities, whereas in offline communities, rules evolve endogenously over time as members interact with each other. It is important to note, for an online community, rules may evolve; however, any change in the rules still needs the approval of the entrepreneurs and designers before they are implemented. Even though most studies have not mentioned using a specific theory (or theoretical framework) when they examined members' behavior in online communities, a few studies did use a theory. For example, Joyce (2006) used theory of commitment and socialization to groups. The theory explains how individuals and groups change over time; based on the theory, Joyce examined the communication exchange between members in an online community. Farzan (2011) used bond-based identity theory to explain the factors that may contribute to interpersonal relationship among members in a game community. Neither theory of commitment and socialization to groups or bond-based identity explains length of active membership in online communities. These theories rely on the fact that bonds are an important part of members' 28 interaction, yet these bonds are weak in an online community due to the fact that members participate from different locations, membership is purely voluntary in nature, and the true identity of the members can be anonymous (Moreland, 2001; Poblocki, 2001). Another theory used to explain new members’ behavior in an online community is social learning theory by Lampe (2005) and Burke (2009). The theory states that in a social situation people observe others and learn how to act based on the observation. The theory was initially applied to children observing and learning from adults. In both online and offline communities, people do not always observe and learn from that observation how to act. This is especially true with adults or people that are already established in a community, because they already have an established behavioral pattern and they are less likely to change. The theory can explain new members' behavior in an online community with more success than explaining long-term active members’ behavior (Burke, 2009). Though none of these theories fully explain members' behavior in online communities, Burke (2009) and Lampe (2005) used social learning theory and Joyce (2006) used theory of commitment and socialization to groups to examine how initial feedback received on a post (first feedback received on a post) could predict members' continued use of a site (that is the chance of posting a second time). This dissertation builds on the theoretical constructs borrowed from these studies. Hence, I used social learning theory and theory of commitment and socialization to groups (to a certain extent) to generate my participation variables (such as rate of posts, rate of votes given, rate of messages sent, first post submitted, first votes given) and feedback received variables (such as first vote received on a post, first comment received on a post, rate of votes received and rate of comments received). 29 It is important note the variables extracted for the quantitative study in Chapter 3 & Chapter 4 are derived from these theoretical constructs (theories) in a broad sense. I have not used these theoretical constructs to identify specific variables incorporated in the study (first vote received, first post submitted, etc), rather I used them to generate high level constructs such as members' participation and feedback received from others. 2.7 Chapter Summary The long-term viability, which is associated with members’ length of active membership, of online communities depends on several factors. The important question is what these factors are and how these factors can be generalized across communities. In this chapter, I have reported on the existing literature in terms of two key factors that could affect members’ length of active membership: members' participation and feedback they received from others on the content. I have illustrated some of the existing literature gaps and proposed four research questions to address the literature gaps. 30 CHAPTER 3 Examining Length of Active Membership for Everything2: A Quantitative Study Online communities support extensive interactions among members with minimal barriers to admission and exit (Lampe, 2009; Lampe, 2010; Kraut, 2010). Membership in these communities is voluntary. Most communities are content-based communities, where content is generated by individual members. Because their content is generated by members, the success of online communities is predominantly dependent on long-term active membership (Toral, 2009; Lampe 2010; Kraut 2010, Burke, 2009; Ren, 2011). For a community to thrive, it is necessary that members remain active in the community and continue to interact with each other. Given that sustaining a solid base of active long-term members is critical to the sustainability of an online community, it is important that factors that contribute to the length of active membership are identified. Research to date has identified two key factors that may contribute to the length of time members remain active in a community (length of active membership), the ways and frequency with which members engage with the community (members’ participation) (Burke, 2009; Argullo, 2006; Neilson, 2006 ) and how other members respond to them when they participate (feedback received from others) (Zhang, 2006; Halfaker, 2011; Choi et al., 2010; Lampe, 2005; Kraut, 2007; Joyce, 2006). Studies of participation found that members' own participation may be predictive of whether they remain active in a community (Burke, 2009; Argullo, 2006; Neilson, 2006). Other studies have reported that feedback received from other members may influence members' continued use of a community (Joyce, 2006; Lampe, 2005). While these studies have added to the knowledge of what may contribute to the sustainability of online communities, there are a few key points that need to be addressed to 31 solidify our understanding of this process. In general, studies have not looked into these points, and more than one study should look into each to confirm findings of other studies. Also, it may add to the understanding of what factors increase the amount of time a member remains active in a community if all of these key points are addressed in a single study. 12 1. Studies to date have not explicitly sought to identify factors that influence length of active membership in a community. 2. Studies to date also have not addressed the implications of unequal participation and feedback received over time. They have only looked into total participation and feedback received over time, but such participation and feedback received could vary. Rate of participation may be a better indicator of members’ activity level, as members who have been active for a longer period of time will most likely have a larger total participation. 3. Prior research has reported findings for analyses of data from online communities covering members' activities over short periods of time, ranging from three months (Burke, 2009) to sixteen months (Churchill, 2004). Because members may remain active participants in online communities for many years, studies using data collected over much shorter participation periods may not accurately generalize to time spans that extend much beyond the lengths of the periods studied. 12 In general, studies do not take these five points into account, this does not mean that no study has considered them, though no single study I am aware of has taken all five into consideration. For example, Wang’s (2012) study reported correlational evidence on the relationship between length of active membership and members’ participation using nine years, three months worth of data, though the study did not examine causal links or take into account changes in the rate of participation. Cheshire’s (2008) study used rate of participation, but the study did not examine length of active membership or causal links and only studied data covering a period of seven months. 32 4. All prior research has used correlational evidence when they reported how participation and feedback received are related to members continued activity in a community. While the correlational findings have been interpreted as evidence of causation, there are also plausible non-causal explanations for these correlational relations and it is generally understood that correlational evidence is not dispositive proof of a causal link. No previous study I am aware of explicitly tests for causal links between length of active membership and members’ participation or between length of active membership and feedback received from others. 5. Prior research has only studied one community at a time. They have not tested to what extent the findings can be generalized to different communities. This is the reason why Chapter 3 and Chapter 4 of this dissertation examine two separate communities. This chapter addresses these gaps in the research literature by using data preserved on server logs for Everything2, an online community that allows its registered members to interact with other members, to statistically examine the effects of characteristics of their own participation and feedback received from other members on the length of time that community members remain active (length of active membership). 13 13 The chapter first describes the server In this dissertation, a member is any person that has registered with a community. Members have to login after registering to participate. Active members are members who have registered in a community and still login to use services offered by the community. For this study, a member is considered active as long as they login, whether or not they contribute by submitting posts, votes, or messages. An online community is defined as a network in which members communicate with each other using interactive online tools such as email, discussion boards, online chat, and video. The characteristics of an online community include: 1) communication between members of a site, 2) reliance on user-generated content where content is solely generated by the registered members, and 3) the use of online tools for the purpose of communication and information sharing. 33 log data and the operationalization of variables for this study. It examines the length of active membership through the lens of a Cox proportional hazard rate model, a statistical technique (a type of survival analysis) that is used to examine the influence of certain explanatory variables (independent variables) on the length of time from registering that a member remains active in a community. Hazard rate models are used to examine factors that may influence the amount of time that elapses before a discrete event, such as an individual catching a disease or adopting a new product, occurs. In this dissertation, the event is cessation of active membership. Previous studies have used hazard rate models to examine correlations between members' participation and their continued use of a community (Wang, 2012; Yang, 2010; Farzan, 2011). Because a hazard model identifies correlations, and it is well-known that correlational evidence is not dispositive proof of causation, and in the case of the relationships examined here there are plausible non-causal explanations for the identified correlations, this dissertation also uses a Granger causality test to more rigorously test for causal links between members’ participation and length of active membership and between feedback received and length of active membership. 3.1 Overview of Everything2 Everything2 (http://www.everything2.com) is an online peer-production community (a community where members work in collaboration to create better quality products) similar to Wikipedia, but with an emphasis on creative writing. Content in the community is solely generated by members. Everything2 contains a heterogeneous set of members and a very large collection of posts, known as writeups. 14 Writeups can be about any topic and in any form of 14 Everything2 is a global community. A recent visitors’ profile, found on thatweb.com, showed that visitors to Everything2 are represented by many different countries. Though the 34 writing such as fiction, non-fiction, personal experience, or poetry. Each writeup is written by one author, who retains complete ownership. However, the authors can, and often do, take advice from other community members to improve the quality of their writeups. Members of the Everything2 community can rate other members’ writeups by submitting votes, which can be positive (upvote, +1) or negative (downvote, -1). Members can only vote once on a piece of content. Positive votes add to the experience points (commonly referred to as XP) of the author, and negative votes subtract from the experience points of the content author. Members can gain XP when they submit a writeup or submit a cool. Members with sufficient points in the community can also rate writeups with a cool, which is a type of super-vote (+20). Cools make the content more visible in the community than regular votes. In Everything2, only members with 2300 or more XP are allowed to submit cools on a writeup. Writeups with negative ratings are sometimes deleted at the discretion of the content editors, a group of volunteers that serve as administrators for the community, if they think the content does not meet the community’s standards. Everything2 supports a messaging feature for its members, which is very similar to email in that it allows Everything2 members to send private messages to other members; Everything2 allows the messages to be tracked for each individual member in order to provide better services to the community member. Messages are also tracked so that administrators can view and monitor them if needed. Members have to register with the community, which is free, and login to be able to submit writeups or interact with other members. Once registered, members are never evicted from Everything2’s membership, even though they may have been inactive for a long time. majority of visits (40.8%) come from the United States, visitors also come from India, United Kingdom, Philippines, Canada, Australia, etc. 35 3.2 Data and Operationalization of Variables in Everything2 I examine the length of active membership for Everything2 members, where length of active membership is operationalized as the number of days from the date a member’s account is created to the member’s last login date, both of which are recorded on the server log. Prior literature suggests that members’ participation and feedback received from other members should contribute to length of active membership (Burke, 2009; Joyce, 2007; Lampe, 2005). To study factors that may affect members’ length of active membership, I have constructed the following measures of participation; rate of participation, which is the total count of each type of participation (writeups submitted, votes given, and messages sent) divided by the total length of active membership, and early participation, which includes a member’s first writeup, first upvote submitted, first downvote submitted and first message sent. Similarly, to examine the effects of different types of feedback received on length of active membership, I have constructed the following measures of feedback received; rate of feedback received, which is the total count of each type of feedback received (votes received, messages received, cools received, and fraction of deleted writeups) divided by the total length of active membership, and early feedback received, which includes a cool received on a member’s first writeup, upvote received on a first writeup, downvote received on a first writeup, and deletion of a first writeup. These measures have been constructed from data preserved on Everything2 server logs. 15 15 The goal is to examine Throughout this dissertation, when feedback is mentioned, it always refers to feedback a member received from other members. Participation always refers to the participation by a member whose length of membership is being examined. Some form of feedback (especially votes) can be participation for others. If a member submits writeups, votes, or messages it is termed as participation, and if the same member (based on user_id) receives votes, or messages from others, it is termed as feedback. Participation and feedback variables are measure id(s) (or proxy ids) with records of activities and contributions preserved on server logs. It is also important to note, whenever early participation or rate of participation is mentioned, this refers to all participation variables as a group (discussion articles submitted, votes given, comments sent). 36 whether members' participation (rate of participation and early participation) and feedback received from others (rate of feedback received and early feedback received) on their content are related to their length of active membership. I first use a hazard rate model to estimate the effects of participation and feedback received on length of active membership. I then employ a Granger causality test to examine the extent to which the correlational evidence generated by the hazard rate model does in fact reflect causality between the dependent variable and the independent variables in the model. 3.3 The Everything2 Data-Set I collected information for all Everything2 members (100,682) who created an account from November 11, 1999 to May 25, 2009. Table 3.1 gives the definitions for the participation and feedback variables employed for this study of the Everything2 community. The data set contains timestamps for members' last login and members' account creation dates and times. 16 It also contains total counts since the first login for each type of participation variable and total counts for each type of feedback variable, which I have used to calculate the rate of participation and rate of feedback received. The data set contains timestamps for participation activities (such as posts submitted, votes given and messages sent) for each member, and for each member, timestamps for feedback received (such as votes received, messages received, or deletion of a writeup). Likewise, whenever early feedback received or rate of feedback received is mentioned, this refers to all feedback variables as a group (messages received, votes received, cools received, and deletion of discussion articles). 16 The data I gained access to did not contain other login information, only the last login. While other logins may have been captured, I did not receive that information. 37 Factors Variables Length of Active Membership Rate of Participation Rate of Writeups Rate of Messages sent Rate of Votes given Rate of Feedback Received Rate of Messages received Rate of Votes received Rate of Cools received Early Participation Early Feedback Received Fraction of deleted Writeups First Writeup submitted or not First Message submitted or not First Downvote submitted or not First Upvote submitted or not Description Amount of time a member has been active (last login date-account creation date) in the community The total count of writeups a member submitted divided by the amount of time the member was active in the community The total count of messages a member sent divided by the amount of time the member was active in the community The total count of votes a member submitted divided by the amount of time the member was active in the community The total count of messages a member received divided by the amount of time the member was active in the community The total count of votes a member received divided by the amount of time the member was active in the community The total count of cools a member received divided by the amount of time the member was active in the community The total deleted writeups divided by the total amount submitted Whether or not a member submitted a first post (writeup) Whether or not a member sent a first message to another member Whether or not a member submitted a first downvote Whether or not a member submitted a first upvote First Upvote received on First Writeup Whether or not a member received a first upvote on first writeup First Downvote received on First Writeup First Cool received on First Writeup Deletion of First Writeup Familiarization Time Whether or not a member received a first downvote on first writeup Whether or not a member received a first cool on first writeup Whether or not a member’s first writeup was deleted Control The amount of time between a member creating an account and posting a first writeup Table 3.1: definition of participation and feedback variables in Everything2 38 Participation variables include rate of writeups submitted, rate of messages sent, and rate of votes given. With everything else equal, users who are members of a community for a longer period will have a higher total participation (sum of each type of participation such as sum of post, sum of votes) than members who are in the community for a shorter period of time (Wang, 2012). I used rate as a measure to account for the effect of unequal lengths of participation. For each of these variables, rate was calculated by dividing a member’s total count for the variable 17 by the member’s length of active membership. For example, a member’s rate of writeups is calculated by dividing the member's total writeups submitted by the member’s length of active membership. Participation activities for members vary from community to community. In Everything2, these participation activities (features) are broken down into a relatively small number of discrete types of activities that are explicitly named and supported by the community, such as submitting a writeup, submitting a vote, and submitting a message. If members use these features, they may become more engaged in the community and stay active longer. Four early participation variables were used to examine the effects of early participation 17 It is important to note that the server log data I have access to does not contain total upvotes given, total downvotes given, total upvotes received, or total downvotes received for individual members. Rather it contains total votes given and total votes received. It also contains early feedback (such as first upvote submitted on first writeup, first downvote submitted on first writeup) and members’ early participation activities (such as first writeup submitted or not, first upvote submitted or not, first downvote submitted or not). It does not contain timestamps for every individual participation activity or every individual feedback received. The rate of votes given was measured for individual members by the sum of upvotes and downvotes given divided by total length of active membership. Also, Everything2 did not automatically log a member out if a member closed the browser without signing out. In this instance, the server logs did not capture the next login. This is not accounted for in the analysis as this information is not recorded in the server log. 39 on length of active membership. The variables include: 1) whether or not a member submitted a first writeup, 2) whether or not a member sent a first message, 3) whether or not a member submitted a first upvote, and 4) whether or not a member submitted a first downvote. Early participation variables are binary. Both general feedback variables and early feedback variables were constructed from the server logs. General feedback variables include rate of messages received, rate of votes received, and rate of cools received. In the analysis, I also included the fraction of writeups deleted as a general feedback variable. The fraction of writeups deleted was computed by dividing total 18 deleted writeups by total writeups. I used the first cool received on first writeup, the first upvote received on first writeup, the deletion of first writeup, and the first downvote received on first writeup as early feedback variables. Early feedback received from others can be either positive or negative. To assess the effect of early positive feedback received from others on the length of active membership, I used a) whether or not a first cool was received on first writeup and b) whether or not a first upvote was received on first writeup. Similarly, to assess the effect of early negative feedback received from others on the length of active membership, I used a) whether or not a first downvote was received on first writeup and b) whether or not a member’s first writeup was deleted (deletion of first writeup). Early feedback variables are binary. In addition to the variables, I have also included familiarization time – time between creation of the account and the first writeup – as a control variable for this analysis. When 18 Fraction of deleted writeups is also included as a control variable. While not a rate, it can be multiplied by rate of writeups submitted to derive a rate for deleted writeups. For exposition convenience, I will list fraction of deleted writeups as one of the rate of feedback received variables for the remainder of this chapter. 40 members join a community, it takes a certain amount of time to become familiar with the community before they submit their first post (writeup). Familiarization time provides a measure (a proxy) on how long a member takes to become familiar enough with both the use of features in a community and norms of participation in a community to submit their first post. 3.4 Data and Measures First, as a diagnostic test, I ran a missing value analysis test (MVA test) on the data. The MVA test detected both missing values and outliers in the data. The test used a grid search approach to detect missing values. It also used the range (Mean + 2*SD, Mean - 2*SD, where SD is the standard deviation) to detect outliers. The analysis found 200 rows with missing values or outliers in the data. These 200 rows were removed from the sample. 19 Next, members (registered users) whose length of active membership was less than a day were removed from the dataset. These registered members created an account in the community but did not login after a day. It is important to note that these registered members who did not login after a day also did not post anything. These excluded members had zero active usage. This reduced the sample size to 39,904 unique members. 20 19 Additionally, I also conducted a P-P (Normal probability plot) as a check on the MVA test. A probability plot reports values against a straight line and shows deviation of points from the straight lines. The P-P plot also reported 200 missing values. 20 In this chapter, I have explained the results for members whose length of active membership is over a day. The cut-off point in this case is a day. Please note, I also conducted a separate analysis excluding members who did not login after 60 days from their account creation; these members have a length of active membership of less than 60 days. I reported these results in Appendix A. The results for members whose length of membership is over a period of 60 days (N=21,909) are similar to members whose length of membership is over a day. The cut-off period of 60 days is used as a proof of concept. This technique in data-mining is known as sensitivity analysis (Yang, 2010). 41 I checked for multi-collinearity associated with length of active membership, members’ participation, and feedback received from others using a Variance Inflation Factor (VIF) analysis. A VIF value of 5 or above usually indicates multicollinearity among the variables. I found VIF values between 1.007 and 2.609 for all variables. The VIF test for multi-collinearity confirms that all of the individual participation variables and all individual feedback variables that have been used in the analysis are NOT collinear with each other. This means that no participation variable is collinear with any other participation variable, nor is any participation variable collinear with any feedback variable. No feedback variable is collinear with any other feedback variable. Also the VIF values showed that length of active membership is not collinear with any participation or any feedback received variables. 21 (Please refer to Appendix B for these results) The average length of active membership for members in the Everything2 community is 442 days. Table 3.2 presents descriptive statistics for the variables. N Length of Membership Min 39904 1 Max 3451.28 Mean 442.44 S.D 726.19 Rate of Writeups 39904 0 7.0 .040 .235 Rate of Messages sent 39904 0 11.0 .022 .211 Rate of Votes given 39904 0 20.5 .083 .632 Rate of Messages received 39904 0 10.4 .002 .055 Rate of Votes received 39904 0 10.0 .004 .113 Rate of Cools received 39904 0 3.3 .270 .307 Fraction of deleted Writeups 2662 .30 1.0 .598 .274 First Writeup submitted or not 39904 0 1.0 .370 .483 First Message submitted or not 39904 0 1.0 .340 .472 First Downvote submitted or not 39904 0 1.0 .330 .469 Table 3.2: descriptive statistics for participation and feedback factors in Everything2 21 This also suggests members’ participation and feedback received from others on the content they contribute represent two different constructs. 42 First Upvote submitted or not First Upvote received on First Writeup First Downvote received on First Writeup Deletion of First Writeup First Cool received on First Writeup Familiarization Time Table 3.2 (cont’d) 39904 0 39904 0 1.0 1.0 .890 .105 .317 .306 39904 0 1.0 .835 .370 39904 39904 39904 0 0 0 1.0 1.0 5.0 .200 .220 .340 .399 .414 .786 3.5 Examining Length of Active Membership Using a Hazard Rate Model I used a Cox proportional hazard rate model to examine how participation and feedback factors are associated with length of active membership in the Everything2 community. A Cox proportional hazard rate model is a statistical technique (commonly referred to as survival analysis) for analyzing time to an event (Cox, 1972; Cox, 1984). The hazard represents a specific event and is often interpreted in terms of survival (Wang, 2012). In this study, the event occurs when an active member ceases being active in the community. The hazard rate is the probability of the event occurring in a specific period of time. The coefficients of the model are estimated in terms of a member's hazard ratio, which is the probability of an event occurring in a specific period of time compared to the probability of the control (Smith, 2003; Therneau, 2000). The control is a group of constructed members (hypothetical) whose values are assigned (Smith, 2003; Therneau, 2000). The assignment of the coefficient values for binary and continuous values for the control group are different. For example, for all rate of participation and all rate of feedback received variables, the assigned values for the control group are the average rate of participation and the average rate of feedback received per day. For early participation and feedback, the control is the group of members who did not participate (submit a first writeup, submit a first vote) and who did not receive early feedback (first upvote received, first downvote received) from others. Since this is a ratio, the comparison between the two groups is performed 43 by dividing the first group by the control. The first group is actually multiple groups, each group being compared to the control. For example, one group is a group of members with one unit increase in the variable value compared with the control; another group is a group of members with two unit increase in the variable value compared with the control. The software accounts for all of these and comes up with a single hazard ratio and all the variances are captured and taken care of through the software. A hazard ratio is interpreted based on whether the ratio is greater than or less than 1.000. If it is greater than 1.000, then the probability of the event occurring increases compared to the control. If it is less than 1.000, then the probability of the event occurring reduces compared to the control. If it is 1.000, then the probability of the event occurring (no difference in survival between the control and the first group) is the same for both groups. I used a Cox proportional hazard rate model compared to other statistical models because identifying the actual point in time that a member becomes inactive is a challenge. Long-term inactivity does not preclude members from coming back; such interrupted inactivity could add bias to the results. It is possible that members who did not login for the last six months may log back in. Standard statistical regression models (such as Ordinary Least Square regression and Logistic regression) do not accurately estimate time to an event (Wang, 2012; Smith, 2003). A Cox proportional hazard rate model is used in this case to predict future events or the failure of an event, such as when an active member becomes inactive and vice versa. A Cox proportional hazard rate model can estimate the hazard ratio of an event as a function of multiple explanatory variables (independent variables, commonly referred to as covariates in the model). This type of model is often used in disease contagion studies, with the 44 state measured being health status (individual has or has not caught the disease by time t) (Smith, 2003). A Cox proportional hazard rate model can be represented as: h(t) = h0(t)*exp(bi*zi ) where the value h(t) denotes the length of active membership given the explanatory variables (zi) (such as rate of writeups, rate of messages given, rate of votes received, first writeup submitted, first upvote received, etc., for each individual member). bi is the coefficient for explanatory variable zi. The term h0(t) is called the baseline hazard for the model. A baseline hazard is the hazard when all independent variable values are equal to zero. In this case, I used a Gompertz distribution, which is a commonly used statistical distribution for proportional hazard rate model. A Gompertz distribution is a density function 22 that can take many different shapes, as it is a flexible distribution. The dependent variable is length of active membership, which is measured in days. Rates of participation (rate of writeups submitted, rate of votes submitted, and rate of messages sent) and rates of feedback received from others (rate of votes received, rate of messages received, rate of cools received and fraction of deleted writeups), as well as early participation (first vote submitted or not, first upvote submitted or not, first downvote submitted or not, first message submitted or not and first deletion of a writeup) and early feedback received from others (first 22 A probability density function is a function that represents the likelihood a random continuous variable can take a specific value. The variable will fall within a given range of values represented by the integral of this density of this variable. 45 upvote received or not, first downvote received or not and first cool received or not), are used as explanatory variables. 23 The EXP(B) column in Table 3.3 gives the estimated coefficient values for the explanatory variable in terms of hazard ratio, which tells us whether an explanatory variable in the model is related to the probability of members remaining active in the community. Factors Rate of Participation Rate of Feedback Received Early Participation Variables Rate of Writeups Rate of Messages sent Rate of Votes given Rate of Messages received Rate of Votes received Rate of Cools received Fraction of deleted Writeups First Writeup submitted or not First Message submitted or not Exp(B) 1.733* 2.715* 1.850* .001 .993 1.030 .912 1.331* .377* SE .060 .083 .051 2.789 .646 .112 .092 .072 .084 Sig. .000 .000 .000 .010 .992 .790 .313 .000 .000 1.745 .579 First Downvote submitted or not 1.174 .081 First Upvote submitted or not Early Feedback First Upvote received on First 1.678* .133 Received Writeup First Downvote received on First 1.266* .073 Writeup First Cool received on First 4.814 .917 Writeup .420* .082 Deletion of First Writeup Control .916 .034 Familiarization Time Table 3.3: hazard rate model results on participation and feedback factors and length of active membership (*p <.001) 23 This is just brief summary of the Cox proportional hazard rate model. For more information please refer to http://cran.r-project.org/doc/contrib/Fox-Companion/appendix-cox-regression.pdf https://mywebspace.wisc.edu/jmullahy/web/basu%20manning%20mullahy.pdf 46 .336 .048 .000 .001 .087 .000 .009 Due to the large sample size, statistical significance is considered at the p=.001 significance level only (Jensen 2007). Statistical significance should be examined along with the effect size as indicated by the hazard ratio. This is the same for all of the variables. 3.5.1 How is Length of Active Membership Related to the Rate of Participation? (RQ1) The coefficient estimates for the explanatory variables and their associated significant levels from the Cox proportional hazard rate test, where the dependent variable is length of active membership, are reported in Table 3.3. The study examines how three different explanatory variables that address RQ1, rate of writeups, rate of messages sent, and rate of votes given, are related to length of active membership. In the model, a status variable (binary), censored was also included. The censored variable is a binary variable that was constructed by using last login date and last posting date. Censored is the probability of an inactive member becoming active in the future. If the difference between last login and last posting date is less than sixty days, the member is considered as non-censored. 24 In this study, the coefficients from the Cox proportional hazard model are interpreted in terms of probability of members remaining active (survival) rather than become inactive. The hazard ratio for the rate of writeups is 1.733, which suggests that a unit increase in writeups per day (rate of writeup) will increase the probability of members remaining active in the 24 In Everything2, if the time between members' last login and last post exceeds consecutive 60 days, they are not likely to post again and thus they are considered censored. 70% of members did not submit a post if the censoring time is over sixty days. This implies 70% members were considered active and 30% as censored at the beginning of the analysis. The model considered 30% as censored which means they are still included and may come back in the future. The model estimates the probability of them coming back (internally), reclassifies them (if necessary) and reports a hazard ratio. All of this is done internally in the software using algorithms. Previous studies have used 70% as a cutoff point for censoring using survival analysis (Yang, 2010) . In online community research, 70% cutoff is an accepted cut-off. I use a Survival analysis, a Cox proportional hazard rate model to account for this censoring of the variables. 47 community by 73.3% {(1.733-1)*100%}; this value is significant at the p =.001 level. 25 Similarly, the hazard ratio for the rate of messages sent is 2.715, which suggests that a unit increase in messages sent per day (rate of messages sent) will increase the probability of members remaining active in the community by 171.5% {(2.715-1)*100%}; this value is significant at the p =.001 level. The hazard ratio for the rate of votes given is 1.850, which suggests that a unit increase in votes given per day (rate of votes given) will increase the probability of members remaining active in the community by 85.0% {(1.850-1)*100%}; the value is significant at the p =.001 level. These results show that the rate of participation variables and length of active membership in the community are positively correlated. One possible explanation could be that as members' participation increases, their involvement in the community increases and since they invest more time, they have more interest in continuing their active membership. Another reason could be that some members, who by their nature derive more pleasure from online communities, are naturally motivated to participate more and to stay active longer, making members’ rate of participation and length of active membership determined by members’ personality type. 3.5.2 How is Length of Active Membership Related to Early Participation? (RQ2) Four different explanatory variables that respond to RQ2, first writeup submitted or not, first message submitted or not, first downvote submitted or not, first upvote submitted or not, were included in the model. Table 3.3 reports the regression coefficients for early participation 25 This is a comparison between a group of members whose rate of participation, i.e. writeups, votes, comments, increases by one unit compared to the control group of members whose rate of participation, i.e. writeups, votes, comments, is the average. The unit increase implies that an explanatory variable (continuous in nature such as rate of writeups) increases by one standard deviation per day. 48 factors. The hazard ratio for first writeup submitted or not is 1.331, which suggests that submitting a first writeup will increase the probability of members remaining active in the community by 33.1% {(1.331-1)*100%}; the value is significant at the p =.001 level. 26 Perhaps submitting their first writeup indicates that they are interested in becoming involved in the community since it requires more effort on the part of the member to submit a writeup than other forms of participation. Another reason could be that members who derive more pleasure from online communities are more likely to submit a first writeup and also more likely to remain active because of the pleasure derived from submitting writeups. The hazard ratio for first message submitted or not is .377, which suggests that submitting a first message will reduce the probability of members remaining active in the community by 62.3% {-(.377-1)*100%}; the value is significant at the p =.001 level. One possible explanation could be that members' first message was a reaction to negative comments or downvotes they received on their initial writeups, which might discourage further participation in the community. 27 Future research should look into this more deeply. The hazard ratio for first downvote submitted or not is 1.745, which suggests that submitting a first downvote will increase the probability of members remaining active in the community by 74.5% {(1.745-1)*100%%}; the value is NOT significant at the p =.001 level. Similarly, the hazard ratio for first upvote submitted or not is 1.174, which suggests that submitting a first upvote will increase the probability of members remaining active in the community by 17.4% {(1.174-1)*100%}; the value is NOT statistically significant at the p =.001 26 This is a comparison between a group of members who submitted early participation, i.e. first writeup, first message, first vote, compared with the control group of members who did not submit early participation, i.e. first writeup, first message, first vote. 27 I have speculated a possible reason based on the results from the hazard rate model. 49 level. These findings suggest that members' early participation has some impact on their length of active membership, as their first upvote and first downvote did not show any statistical significance but their first writeup and first message did. Maybe because submitting their first writeup and first message require more effort on the part of the member than submitting upvotes and downvotes, members are more concerned about these forms of early participation. 3.5.3 How is Length of Active Membership Related to Rate of Feedback Received? (RQ3) Four different explanatory variables that address RQ3, rate of messages received, rate of votes received, rate of cools received, and fraction of deleted writeups, were included in the model. Table 3.3 reports the regression coefficients for rate of feedback received. The hazard ratio for the rate of messages received is .001, which suggests that a unit increase in messages received per day (rate of message received) will reduce the probability of members remaining active in the community by 99.9% {-(.001-1)*100%}; the value is NOT significant at the p =.001 level. 28 The hazard ratio for the rate of votes received is .993, which suggests that a unit increase in votes received per day (rate of votes received) will reduce the probability of members remaining active in the community by .7% {-(.993-1)*100%}; the value is NOT significant at the p =.001 level. The negative correlation between rate of votes received and length of active membership is perhaps due to the fact that a downvote has more of an impact than an upvote for members desire to participate, and for reasons explained in footnote 14, rate of votes combines downvotes and upvotes. The hazard ratio for the rate of cools received is 1.030, which suggests that a unit increase in cools received per day will increase the 28 This is a comparison between a group of members whose rate of feedback received from others, i.e. votes, messages, cools, increases by one unit compared to the control group of members whose rate of feedback received from others, i.e. votes, messages, cools, is the average. 50 probability of members remaining active in the community by 3% {(1.030-1)*100%}; the value is NOT significant at the p =.001 level. The hazard ratio for the fraction of deleted writeups is .912, which suggests that a unit increase in the fraction of deleted writeups (fraction of deleted writeups) will reduce the probability of members remaining active in the community by 8.8% {(.912-1)*100%}; the value is NOT significant at the p =.001 level. The results show that none of the feedback variables are significantly correlated with length of active membership. Perhaps it is members' rate of participation and not the rate of feedback received from others on the content that was more important to individual members in choosing to continue using a community. Rate of participation appears to be more important when members decide whether to remain active in this online community than rate of feedback received on the content submitted. 3.5.4 How is Length of Active Membership Related to Early Feedback Received? (RQ4) Four different explanatory variables that address RQ4, first cool received on first writeup, first upvote received on first writeup, deletion of first writeup, and first downvote received on first writeup, were included in the model. Table 3.3 reports how the regression coefficients for early feedback received are related to length of active membership. The hazard ratio for first cool received on first writeup is 4.814, which suggests that receiving a first cool on first writeup will increase the probability of members remaining active in the community by 381.4% {(4.814-1)*100%}; the value is NOT significant at the p=.001 level. 29 The hazard ratio for first upvote received on first writeup is 1.678, which suggests that 29 This is a comparison between a group of members who received early feedback, i.e. first upvote on a writeup, first downvote on a writeup, first message, and first cool on a writeup, compared with the control group of members who did not receive early feedback. Even though receiving a first cool was not significant. Higher coefficient of receiving a first cool could suggest that members' contributions are of high quality and are valued by 51 receiving a first upvote on first writeup will increase the probability of members remaining active in the community by 67.8% {(1.678-1)*100%}; the value is significant at the p =.001 level. It could be that members placed a high value on receiving an upvote. Maybe receiving a cool is a rare event, so not enough members receive a cool to make it significant. Upvotes could occur more frequently because members do not have to reach the same status to submit an upvote as to submit a cool. Alternatively, if members are introduced to the community by other members, they are more likely to receive an upvote on the first writeup from their friends and likely to remain longer because they value the connections to their friends. The Hazard ratio for first downvotes received on first writeup is 1.266, which suggests that receiving a first downvote on first writeup will increase the probability of members remaining active in the community by 26.6% {(1.266-1)*100%}; the value is significant at the p =.001 level. Perhaps getting the recognition on a first writeup is enough to encourage members to remain active, even if it is negative. Previous studies have reported some forms of feedback encourage members to post a second time (Lampe, 2005; Joyce, 2006). Future research should look into this in more depth. The hazard ratio for deletion of first writeup is .420, which suggests that receiving a first deletion on a writeup will reduce the probability of members remaining active in the community by 58% {-(.420-1)*100%}; the value is significant at the p =.001 level. One possible explanation could be that only editors can delete a writeup, so this is stronger negative feedback than a downvote because members may place more value on feedback from editors than feedback from other members. These results suggest that early feedback received from others has significant impact on length of active membership, as three of the variables show members with higher XP's. It can be expected that members who are producing higher quality posts are likely to stay active in the community. Hence, first cool received on a first writeup can be an indicator to administrators of whether a member is likely to remain active. 52 significance. Based on these results, members appear to place more emphasis on the feedback they receive from their first writeup than on feedback received from subsequent writeups. Familiarization time was used as a control in the model to account for how long a member takes to start participating in a community. The hazard ratio for familiarization time is .916, which suggests that a unit increase in familiarization time, will reduce the probability of members remaining active in the community by 8.4% {-(.916-1)*100%}; the value is NOT significant at the p =.001 level. This suggests that the longer a member’s familiarization time, the shorter their length of active membership in a community. One possible explanation could be that members that take longer to post have a harder time understanding the community and find it more difficult to use when they start using the different features. This could make them less likely to remain active. On the other hand, it could be that members who take longer to start participating just have less interest in the community to begin with and for this reason they are likely to cease being active earlier. Please refer to Appendix C for a length of active membership survival graph and to Appendix D for more information about the model’s fit. 3.6 Examining Causal Links for the Everything2 Community (Granger Causality Tests) Because there are plausible non-clausal explanations for the statistically significant relationships revealed by the Cox hazard regression, I also employ a Granger causality test to more rigorously test for causality. It is important to note that if the results from a Granger causality test are not statistically significant, we can rule out any possible causal links among variables. However, if the results from a Granger causality test are statistically significant, the evidence for a causal relationship is only stronger. 53 The intuition behind a Granger causality test is that if changes in one variable cause changes in a second variable, then the value of the first variable in any given period should be correlated with the value of the second variable in a subsequent period or periods (Granger, 30 1969). A Granger causality test provides stronger evidence for or against causality than the statistical significance of simple regression coefficients. The temporal evidence for causality is derived from time series data. 31 In this study, the model relies on the prediction that length of active membership in period N will be correlated with participation in the community during periods N-1, N-2, etc. A Granger causality test simultaneously tests for each direction in which a causal relationship might run between two variables. In this study, I examine whether members' participation is causally linked to their length of active membership and whether the length of active membership is causally linked to members’ participation. 3.7 How is Length of Active Membership Affected by the Rate of Participation? (RQ1) 3.7.1 Members’ Participation Affecting Length of Membership A Granger causality test examines the relationship between two variables using lag orders, where a lag order is the number of measurement periods for explanatory variables that is included in the model. A period could be of any length, such as a day, a month, or a year. For this study, a single lag order is a period of six months. I used the statistical software package STATA to determine the period of lag order and conduct the Granger causality test. STATA 30 A Granger causality test establishes whether a causal link among variables exist based on simultaneous correlations among variables (current instance and previous instances among variables) from time series data. 31 I am unable to ascertain true causality, as I do not have the information why the members actually left. Using a time series model, I am attempting to ascertain possible causes (by moving backward in time). 54 selected six months as the appropriate unit of time for measuring lags based on its assessment of the data. 32 Using the server timestamps from the logs in Everything2, I derived lag orders for explanatory variables (rate of writeups, rate of votes given, rate of messages) and the dependent variable (length of active membership). I examined the possible causality between explanatory variables and the dependent variable. For example, the server log contains timestamps for writeups, timestamps for votes, and timestamps for messages. From these timestamps, I constructed lagged values for variables and examined the possible causal links among variables. For a Granger test, time series data, a collection of observations made sequentially in time, is decomposed into a stationary trends and residuals (often known as random shocks). A time series is called stationary if certain statistical properties (mean, standard deviation, and autocorrelation) of the time series are constant. In a dynamic world, no trend will be stationary forever. However, if a time series contains stable trends during the observation period of time, the time series is termed trend stationary. The Granger test, which utilizes time series model, can be represented (in terms of participation and length of active membership) with the following equation: p p j=1 j=1 MembershipLength(t)=∑C1j*Participation(t-j)+∑C2j*MembershipLength(t-j)+U1t, where t is the current time, p is the maximum number of lagged observations included in the model (the model order), j is the lag order and can take any value from 1 through maximum lag order p, MembershipLength is the length of active membership, Participation is the rate of 32 A six month period for a lag order was selected based on the assessment of the data when all dependent and explanatory variables are taken into account. 55 participation, C2j is the coefficient of Membership Length for lag order j, C1j is the coefficient of Participation for lag order j, and U1t is the model’s residuals (shocks) at the time t. To select the number of lags (lag order) to consider for a time series model, I employed five lag order selection statistics (test) reported by STATA. I used the dependent variable and explanatory variables to find the correct lag order for the model. Five tests such as Likelihood Ratio test (LR), Akaike Information Criteria (AIC), Hannan-Quinn Information Criteria (HQIC), Schwarz' Bayesian Information Criterion (SBIC), and Akaike's Final Prediction Error (FPE) were used to test for the appropriate lag order. (Please refer to Appendix E for more information about the five tests.) A maximum lag order of 2 was suggested by the LR, FPE, and AIC, where the lag order of 0 was suggested by the HQIC and SBIC criteria (refer to Appendix F). A lag order of 2, based on what the majority of the tests suggested (three of the five information criteria), was selected for the Granger causality test. 33 AIC, SBIC, HQIC, LR, and FPE were run for all variables, including the dependent variable which is the length of active membership. The algorithms applied complex statistical tests/models (in a FPE test, the expected variance of the error is measured when an Auto regressive time series is fitted against another time series of similar co-variance structure) and suggested a lag order. It is important to note, for this study, 2 lag order is a period of 12 months. In this study, if members who had been members for less than a year they were still included (in the Granger causality test), and it only used data on a member for the period the member was active. 33 The approach I used to select a lag order is often referred as majority vote approach in the field of information science, and more precisely, in data mining and in machine learning. 56 The Granger causality test was conducted using a Vector Auto Regression (VAR). 34 Table 3.4 reports the results for a Granger causality test between rate of participation factors and length of active membership. The results showed that higher rates of participation do NOT cause members to stay active longer because the chi-squares values were not statistically significant at the p=.001 level. Dependent Variable Explanatory Variables Chi-square Probability Length of Membership Rate of Writeups .219 0.896 Length of Membership Rate of Messages sent .230 0.891 Length of Membership Rate of Votes given .191 0.909 Length of Membership ALL .690 0.995 Table 3.4: Granger causality results whether members’ rate of participation causes their length of active membership (*p < .001) 34 A Granger causality test can be derived using a VAR model. A VAR model can be represented as a time series consisting of two variables (x and y), where yp (value of variable y for time p) can be represented in terms of its past values and past values for the variable x. If x Granger causes y, some or all lagged x values will have non-zero coefficients. A Vector Auto Regression (VAR) is a statistical regression model. In a VAR model, multiple time series are used to estimate linear dependencies among variables. Each variable can be considered evolving from its own lags, and lags from other variables. In a VAR equation, a set of variables is used. Each variable is represented as a linear function of v lags of itself and of all of the remaining variables in the equation. An error term is also included. A first order VAR(1) for n variables collected in nx1 vector yt can be represented as yt= b(0) +b1y(t-1)+ q(t) where the element qt is the error term, which can be represented as the iid normal (a diagonal matrix); b(0) is nx1 vector which represents a constant term in the equation. A VAR(1) model should satisfy the following matrix equations E(v,v')= W and E(Vt,Vt-j)=0, where W is a positive semi-definite matrix containing error terms in nxn dimensions and E(Vt,Vt-j)=0 indicates that every error term in the equation has a mean of zero. The dependencies among variables are represented by the matrix b1 and the contemporaneous dependence is determined by the term qt. The results from a Granger causality test (null hypothesis is supported or not) can be determined (based on the chi-square values and the associated probability values) from a Wald test. This is just brief summary of a Granger causality test and a VAR model. For more information please refer to http://academic.reed.edu/economics/parker/s13/312/tschapters/S13_Ch_5.pdf or not) can be determined from a chi-square test. This is just brief summary of a Granger causality test and a VAR model. For more information please refer to http://academic.reed.edu/economics/parker/s13/312/tschapters/S13_Ch_5.pdf 57 Due to the large sample size, statistical significance should be examined along with the effects size as indicated at the p=.001 significance level only (Jensen 2007). 3.7.2 Concluding Remarks The Granger causality tests showed that a higher rate of participation does NOT cause community members to remain active longer. A hazard rate model showed that correlation does exist between all participation variables (rate of writeups, rate of messages given, and rate of votes given) and length of active membership. Even though length of active membership and members' rates of participation are correlated, the Granger causality tests showed no causality. 3.8 How is Length of Active Membership Affected by Rate of Feedback? (RQ3) 3.8.1 Feedback Received from Others Affecting Length of Membership Using the server timestamps from the logs, I constructed lagged values for explanatory variables (rate of votes received, rate of messages received, rate of cools received, fraction of deleted writeups) and the dependent variable (length of active membership), and tested for causality between the explanatory variables and the dependent variable. A Granger causality test between feedback received from others on member supplied content and length of active membership can be conducted with the following equation: p p MembershipLength(t)=∑C5j*Feedback(t-j)+∑C6j*MembershipLength(t-j)+U3t, j=1 j=1 where t is the current time, p is the maximum number of lagged observations included in the model (the model order), j is the lag order and can take any value from 1 through maximum lag order p, MembershipLength is the length of active membership, Feedback is the rate of feedback 58 received from others, C6j is the coefficient of Membership Length for lag order j, C5j is the coefficient of Feedback for lag order j, and U3t is the model’s residuals (shocks) at the time t. To select the number of lags (lag order) for the time series model, I employed five lag order selection statistics (test) reported by STATA. Five tests, the Likelihood Ratio test (LR), Akaike Information Criteria (AIC), Hannan-Quinn Information Criteria (HQIC), Schwarz' Bayesian Information Criterion (SBIC), and Akaike's Final Prediction Error (FPE) were used. LR, FPE, and AIC suggested a maximum lag order of 1, whereas HQIC and SBIC information criteria suggested a lag order of 0. A lag order of 1, as suggested by the majority of the tests, was selected for the Granger causality test. (Refer to Appendix G). The tests were run for all variables. A Granger causality test was conducted with a Vector Auto Regression (VAR). Please refer to Table 3.5 to review the results from the Granger causality test. The Granger causality test results showed that feedback received from others does NOT cause length of active membership as the chi-square values were not significant at the p=.001 level for any of the explanatory variables. Dependent Variable Explanatory Variable Chi-square Probability Length of Membership Rate of Messages received 1.342 0.511 Length of Membership Rate of Votes received 1.509 0.219 Length of Membership Rate of Cools received 1.882 0.628 Length of Membership Fraction of deleted Writeups 1.314 0.390 Length of Membership ALL 4.704 0.582 Table 3.5: Granger causality results whether rate of feedback received causes length of active membership (*p < .001) It is important to note that the Cox proportional hazard rate results did not show significant correlation between rate of feedback received and length of active membership. Based on the Cox proportional hazard rate results, causality between rate of feedback received and length of active membership can be ruled out. However, a Granger causality test was still 59 conducted. The results from a Granger causality test can be viewed as additional validation of the results of the Cox proportional hazard rate model. 3.8.2 Concluding Remarks The Granger causality tests showed that more feedback received from others does NOT cause community members to remain active longer. A hazard rate model showed that no significant correlation (at the p=.001 level; probability values for the corresponding chi-square is greater than 1%) exists between feedback members received from others and their length of active membership. The Granger Causality test supports the no causality interpretation of the hazard model results. 3.9 Granger Causality Tests in Regards to Early Participation and Early Feedback Received from Others on the Content It is important to note that I could not conduct a Granger Causality Test of whether members' early participation influences their lengths of active membership (RQ2) or whether early feedback received from others influences length of active membership (RQ4). A Granger causality test examines whether changes in one variable may cause changes in a second variable using past values of both variables. In this dissertation, I used members' first participation and first feedback received from others as measures of early participation and early feedback received from others. Because there is at most only one event recorded for each participation variable and each early feedback received variable, there are no prior observations. 3.10 Chapter Summary Several possible factors that may contribute to length of active membership, which may in turn contribute to the viability of online communities, were identified and tested in this chapter. These factors were tested for the online community, Everything2, to determine whether or not they are related to length of active membership using two rigorous statistical tests, a Cox 60 proportional hazard rate model and a Granger causality test. First, a Cox proportional hazard rate model was introduced, and then the results from the test were presented. A hazard rate model tests for any correlation evidence between two variables. The results showed that all three rate of participation variables (rate of writeups submitted, rate of votes given and rate of messages sent), one early participation variable (first message submitted or not) and one early feedback variable (first cool received on a first writeup) were correlated with length of active membership. A Granger causality test was then conducted and the results from the test were presented. Even though the hazard rate model found correlational evidence between rate of participation variables and length of active membership, the stronger Granger causality test found no evidence for casual relationships, suggesting that in this case correlation was not evidence of causation. It also found no evidence for causal relationships between rate of feedback received variables and length of active membership. 61 CHAPTER 4 Examining Length of Active Membership for Sploder: A Quantitative Study Chapter 3 reviewed the research literature on online communities that is most relevant to a study of factors that influence the amounts of time members in these communities remain active. From the perspective of determining what factors influence length of active membership, it identified five limitations of the earlier work: (1) that earlier research was not focused explicitly on factors influencing length of active membership; (2) that earlier work looked at the effects of total levels of participation and amounts of feedback received overtime but did not consider the implications of variation among users in rates of participation and feedback received or variation in the length of time over which different individuals total participation and feedback received measures were calculated; (3) that the lengths of time covered by almost all prior studies was short relative to the length of time members might remain active in their communities; (4) that prior research treated measures of correlation between dependent and independent variables as evidence of causation even though there were plausible non causal explanations for the observed relationships among variables; and (5) the focus of previous studies on single online communities limited the claims that might be made for the generality of their findings. The empirical study of the online community Everything2 presented in Chapter 3 addressed the first four of these limitations. This chapter replicates the study of Everything2 for a second online community, Sploder. In addition to providing an additional study that addresses the first four limitations of the earlier literature, replicating the Everything2 study design for a second online community makes possible a more direct and meaningful cross-study comparison 62 of findings than was possible before and allows us to draw stronger conclusions about the extent to which empirical findings for one online community generalize to other online communities. The chapter first describes the Sploder server log data made available for this research and the operationalization of variables for this study. It then examines length of active membership through the lens of a COX proportional hazard rate model, a statistical technique (a type of survival analysis) that I use to examine the influence of explanatory variables (independent variables) on the length of time from registering that a member remains active in a community. Hazard rate models are used to examine factors that may influence the amount of time that elapses before a discrete event, such as an individual catching a disease or adopting a new product, occurs. For this chapter’s study, as for that of Chapter 3, the event is cessation of active membership. Previous studies have used hazard rate models to examine correlations between members' participation and their continued use of a community (Wang, 2012; Yang, 2010; Farzan, 2011). Because a hazard model identifies correlational evidence, and it is wellknown that correlational evidence is not dispositive proof of causation, and in the case of the relationships examined here, there are plausible non-causal explanations for the identified correlations, this dissertation also uses a Granger causality test to more rigorously test for causal links between members’ participation and length of active membership and between feedback received and length of active membership. 4.1 Overview of Sploder Sploder (http://www.sploder.com/ ) is an informal, peer production community (a community where members work together to create a better quality product) for adolescents and young adults that allows its members to build their own games, and then share them with other members. For users to build games and interact with others, they first have to register as 63 members. Membership is free and once registered, members are never evicted, even though they may have been inactive for a long time. Sploder allows its members to join the Sploder discussion forum to discuss game-related topics, where they submit posts on a topic, which is an article-like entry called a discussion article. 35 This study focuses only on members of the Sploder discussion forum. Other members can provide feedback on a discussion article through whispers and comments, or simply vote on the discussion article. Whispers are private communications between an author and a member or an author and an editor, comments are public communications made on a discussion article. Votes in Sploder are only positive; they do not allow a negative vote. The community is managed and administered by a group of volunteer editors who are appointed by the community manager (owner). Editors can delete comments or discussion articles from the forum if they feel they do not meet community standards. 4.2 Data and Operationalization of Variables in Sploder I examine length of active membership in Sploder where length of active membership is operationalized as the number of days from the date a member’s account is created to the member’s last login date, both of which are recorded on the server log. Prior literature suggests that members’ participation and feedback received from other members should contribute to length of active membership. To study factors that may affect members’ length of active membership, I have constructed the following measures of participation. Rate of participation measures are: discussion articles submitted, votes given, messages sent, and comments given per 35 It is important to note that the Sploder discussion forum is a community on its own. Membership in the discussion forum requires separate registration and login in. Membership can overlap between the original game site and the discussion forum. Being a member of one does not imply membership in the other. In this study, I analyzed the data from the Sploder discussion forum only. 64 unit of time. Early participation measures are: first discussion article submitted, first vote submitted, first message submitted, and first comment submitted. Similarly, to examine the effects of different types of feedback received on length of active membership, I have constructed rate and early feedback measures of feedback received. Rate of feedback received measures are: votes received, messages received, comments received, and fraction of write-ups deleted. Measures of early feedback received are: first vote received on first discussion article, first comment received on a discussion article, and deletion of first discussion article. These measures have been constructed from data preserved on Sploder server logs. 36 The goal is to examine whether members’ participation (rate of participation and early participation) and feedback received from others (rate of feedback received and early feedback received) on their content are related to their length of membership. I first use a Cox proportional hazard rate model to estimate the effects of participation and feedback received on length of active membership. Then I use a Granger causality test to examine the extent to which the correlational evidence generated by the hazard rate model does in fact reflect causality between the dependent variable and the independent variables in the model. 36 Throughout this study, when feedback is mentioned it always refers to feedback a member receives from other members. Participation always refers to the participation of the member whose length of active membership is being examined. Some forms of feedback (especially votes) can be participation for others. If a member submits discussion articles, votes, comments or whispers, it is termed as participation, and if the same member (based on userid) receives votes or comments from others, it is termed as feedback. It is also important to note, whenever early participation or rate of participation is mentioned, this refers to all participation variables as a group (discussion articles submitted, votes given, comments sent). Likewise, whenever early feedback received or rate of feedback received is mentioned, this refers to all feedback variables as a group (messages received, votes received, cools received, and deletion of discussion articles). 65 4.3 The Sploder Data-Set The server log contains information on all (5,373) members who created an account from February 15, 2008 to September 23, 2010 in the Sploder community. Table 4.1 gives the definitions for the participation and feedback variables employed for this study. The data set contains time stamps for members' last login and members' account creation dates and times. 37 It also contains total counts from the first login for each type of participation variable and total counts for each type of feedback variable, which I have used to calculate rates of participation and rates of feedback received. The data set contains timestamps for participation activities recorded for each member (such as discussion article submitted, vote sent, whisper given, or comment given) and, for each member, timestamps for feedback received (such as a vote received, a comment received, a whisper received, or deletion of a post). 38 37 The data I had access to did not contain other login information, only the last login. While other logins may have been captured, I did not receive that information. 38 The server logs I gained access to only contain early feedback (such as first vote submitted on first discussion article, first comment submitted on first discussion article) and early participation of members (such as first discussion article submitted or not, first vote submitted or not, first comment submitted or not) and total counts of feedback and participation. It does not contain timestamps for every individual participation activity or every individual feedback received. Also, Sploder did not automatically log a member out if a member closes closed the browser without signing out. In this instance, the server logs did not capture the next login. This is not accounted for in the analysis as this information is not recorded in the server log. 66 Factors Variables Length of Active Membership Rate of Participation Rate of Discussion Articles Submitted Rate of Comments sent Rate of Votes given Rate of Whispers given Rate of Comments received Rate of Feedback Received Rate of Votes received Rate of Whispers received Early Participation Early Feedback Received Control Fraction of deleted Discussion articles First Discussion Article submitted or not First Comment sent or not First Whisper submitted or not First vote submitted or not First Vote received on First Discussion article First Comment received on First Discussion article First Deletion of a Discussion article Familiarization Time Description Amount of time a member has been active (last login date-account creation date) in the community The total count of discussion articles a member submitted divided by the amount of time the member was active in the community The total count of comments a member sent divided by the amount of time the member was active in the community The total count of votes a member submitted divided by the amount of time the member was active in the community The total count of whispers a member submitted divided by the amount of time the member was active in the community The total count of comments a member received divided by the amount of time the member was active in the community The total count of votes a member received divided by the amount of time the member was active in the community The total count of whispers a member received divided by the amount of time the member was active in the community The total deleted discussion articles divided by the total amount submitted Whether or not a member has submitted a first post (discussion article) Whether or not a member sent a first comment Whether or not a member submitted a first whisper Whether or not a member submitted a first vote or not Whether or not a member received a first vote on first discussion article Whether or not a member received a first whisper on first discussion article Whether or not a member received a first deletion on a discussion article The amount of time between a member creating an account and posting a first discussion article Table 4.1: description of participation and feedback variables in Sploder 67 I used rate of discussion articles submitted, rate of comments sent, rate of votes given and rate of whispers given as participation variables. With everything else equal, users who are members of a community for a longer period will have a higher total participation counts (sum of each type of participation measure, such as sum of post, sum of votes and sum of comments) than members who are in the community for a shorter period of time (Wang, 2012). I used rate as a measure because of the analytical questions raised by unequal lengths of participation as discussed above. Rates were computed by dividing a member’s total count for a variable by the member’s total length of active membership. For example, rate of discussion articles is calculated by dividing members' total discussion articles submitted by their length of active membership. Participation activities for members vary from community to community. In Sploder, these participation activities (features) are broken down into a relatively small number of discrete types of activities that are explicitly named and supported by the community, such as submitting a discussion article, submitting a vote, submitting a comment and submitting a whisper. Prior research suggests that if members use these features, they may become more engaged in the community and stay active longer. I used four early participation variables to examine the effects of early participation on length of active membership. They were whether or not a member submitted a first discussion article, whether or not a member submitted a first comment, whether or not a member submitted a first vote, and whether or not a member submitted a first whisper. Early participation variables are binary. Two types of feedback variables were constructed from the server logs: General feedback variables and early feedback variables. General feedback variables include rate of comments 68 received, rate of votes received, rate of whispers received and fraction of deleted discussion articles. The fraction of deleted discussion articles was computed by dividing total deleted discussion articles by total discussion articles. 39 Three different variables were used to examine how length of active membership is related to early feedback received from others on the content. The early positive feedback variable was whether or not a member received a first vote on a first discussion article. Sploder does not support negative votes on members' post, but I included deletion of first discussion article as an early negative feedback variable. I have also included whether or not a member received a first comment on a first discussion article. 40 These variables are all binary. Familiarization time—time between creation of an account and posting the first discussion article —was included as a control variable in the analyses. When members join a community, it takes a certain amount of time to become familiar with the community before they submit their first post (discussion article). Familiarization time provides a measure of how long a member takes to become familiar enough with both the use of the site’s features and norms of participation in a community to become an active participant in the community. 4.4 Data and Measures 39 Fraction of deleted discussion articles is also included as a control variable. While not a rate, it can be multiplied by rate of discussion articles submitted to derive a rate for discussion articles. For exposition convenience, I will list fraction of deleted discussion articles as one of the rate of feedback received variables for the remainder of this chapter. 40 First comment received on first discussion article can be positive or negative. Since the content of a comment is not analyzed in this study, it is unknown whether comments are positive or negative. Nevertheless, receiving a first comment on the discussion article from others may encourage members to continue using the community as they received some form of early feedback. Moreover, from an administrator’s point of view, knowing whether comments will increase or decrease active membership can be an indicator where they should invest time to grow or nurture the community. 69 First, I ran a diagnostic test on the data. I ran a missing value analysis test (MVA test). The MVA test detected both missing values and outliers in the data. The test used a grid search approach to detect missing values. It also used the range (Mean + 2*SD, Mean - 2*SD, where SD is the standard deviation) to detect outliers. The analysis found 23 rows with missing values or outliers in the data. These 23 rows were removed from the sample. 41 Members whose length of active membership was less than a day (based on the Unix timestamp) in the community were removed from the dataset. This reduced the sample size to 1,982 unique members. 42 These registered members created an account in the community but did not login after a day. It is important to note that registered members who did not login after a day also did not post anything. I checked for multi-collinearity associated with length of active membership, members’ participation, and feedback received from others using a Variance Inflation Factor (VIF) analysis. A VIF value of 5 or above usually indicates multicollinearity among the variables. VIF test result showed first vote received on first discussion article and first commented submitted or not have values higher than 5, indicating that they are collinear. These two variables were 41 Additionally, I also conducted a P-P (Normal probability plot) to validate the MVA test. A probability plot reports values against a straight line and shows deviation of points from the straight lines. The P-P plot also reported 23 missing values. 42 I also conducted a separate analysis excluding members who did not login after 60 days from their account creation; these members have a length of active membership of less than 60 days. I report these results in Appendix-H. The results for members whose length of membership is over a period of 60 days (N=971) are similar to members whose length of membership is over a day. The cut-off period of 60 days is used as a proof of concept. This technique in data-mining is known as sensitivity analysis (Yang, 2010). 70 removed from the analysis. 43 After removing these two variables, I reran the VIF test. I found VIF values between 1.004 and 1.504 for the remaining variables. The VIF test for multicollinearity confirms that all of the individual participation variables and all individual feedback variables that have been used in the analysis are NOT collinear with each other. This means that no participation variable is collinear with any other participation variable, nor is any participation variable collinear with any feedback variable. No feedback variable is collinear with any other feedback variable. Also, the VIF values showed that length of active membership is not collinear with any participation or any feedback received variables. 44 (Please see Appendix I for these results.) Table 4.2 presents descriptive statistics for the variables. Variables N Min 1982 1 1982 0 1982 0 1982 0 1982 0 1982 0 1982 0 1982 0 1982 0 1982 0 Max 951.52 9.51 23.44 20.85 7.00 5.00 11.35 1.50 35.0 1.00 Mean 152.571 .322 1.715 .908 .257 2.461 6.187 .749 .143 .228 S.D 207.174 .685 3.682 1.607 .685 1.584 3.088 .475 1.207 .419 Length of active membership Rate of Discussion Article Rate of Comments sent Rate of Votes given Rate of Whisper given Rate of Comments Received Rate of Votes Received Rate of Whisper Received Fraction of Deleted Discussion Article First Comment received on First Discussion Article 1982 0 1.00 .228 .419 Deletion of First Discussion Article 1982 0 1.00 .712 .452 First Discussion Article submitted or not 1982 0 1.00 .997 .050 First Vote submitted or not 1982 0 1.00 .960 .194 First Whisper submitted or not 1982 0 950.15 238.074 245.275 Familiarization Time Table 4.2: descriptive statistics for participation and feedback received in Sploder 43 Both variables were dropped because it was difficult to determine whether first vote received on first discussion article or first commented submitted was the more important factor to consider. 44 This also suggests members’ participation and feedback received from others on the content they contribute represent two different constructs. 71 The average length of active membership in the Sploder community is 152 days. 4.5 Examining Length of Active Membership Using a Hazard Rate Model I used a Cox proportional hazard rate model to examine how participation and feedback factors are associated with length of active membership in the Sploder community. A Cox proportional hazard rate model is a statistical technique (commonly referred to as survival analysis) for analyzing time to an event (Cox, 1972; Cox, 1984). The hazard represents a specific event and is often interpreted in terms of survival (Wang, 2012). In this study, the event occurs when an active member becomes inactive in the community. The hazard rate is the probability of the event occurring within a specific amount of time from the moment observations on whether the event occurs or not begin (in this case from the time a member joins an online community). The coefficients of the model are estimated in terms of a member's hazard ratio, which is the probability of an event occurring in a specific period of time compared to the probability of the control (Smith, 2003; Therneau, 2000). The control is a constructed (hypothetical) group of members whose values are assigned (Smith, 2003; Therneau, 2000). The assignment of the coefficient values for binary and continuous values for the control group are different. For example, for all rate of participation variables and all rate of feedback received variables, the assigned values for the control group are the average rate of participation and the average rate of feedback received per day. For early participation and feedback, the control is the group of members who did not participate (submit a first discussion article, submit a first vote) and who did not receive early feedback (first vote received, first comment received) from others. The comparison between the two groups is performed by dividing the first group by the control. The first group is actually multiple groups, each group being compared to the control. For example, one group is a group of members with one unit increase in the variable value compared 72 with the control; another group is a group of members with two unit increase in the variable value compared with the control. The software accounts for all of these and comes up with a single hazard ratio, all the variances are captured and taken care of through the software. A hazard ratio is interpreted based on whether the ratio is greater than or less than 1.000. If it is greater than 1.000, then the probability of the event occurring increases compared to the control. If it is less than 1.000, then the probability of the event occurring reduces compared to the control. If it is 1.000, then the probability of the event occurring (no difference in survival between the control and the first group) is the same for both groups. I used a Cox proportional hazard rate model instead of other statistical models because identifying the actual point in time that a member becomes inactive is challenging. Long-term inactivity does not preclude members from coming back and such interrupted inactivity could add bias to the results. For example, it is possible that a member who did not login for the last six months may still log back in. Standard statistical regression models (such as Ordinary Least Square regression and Logistic regression) do not accurately estimate this time to an event (Wang, 2012; Smith, 2003). A Cox proportional hazard rate model is used to predict future events or the failure of an event to occur, which in this study is when an active member becomes inactive and stops logging. A Cox proportional hazard rate model can estimate the hazard ratio of an event as a function of multiple explanatory variables (independent variables, commonly referred to as covariates in the model). This type of model is often used in disease contagion studies, with the state measured being health status (individual has or has not caught the disease by time t) (Smith, 2003). A Cox proportional hazard rate model can be represented as: 73 h(t) = h0(t)*exp(bi*zi), where the value h(t) denotes the length of active membership given the explanatory variables (zi) (such as rate of writeups, rate of messages given, rate of votes received, first writeup submitted, first upvote received, etc., for each individual member). bi is the coefficient for explanatory variable zi. The term h0(t) is called the baseline hazard for the model. A baseline hazard is the hazard when all independent variable values are equal to zero. In this case, I used a Gompertz distribution, which is a commonly used statistical distribution for proportional hazard rate models. A Gompertz distribution is a density function 45 that can take many different shapes, as it is a flexible distribution. The dependent variable is length of active membership, which is measured in days. Members' rate of participation (rate of discussion articles submitted, rate of votes given, rate of whispers given and rate of comments sent) and rate of feedback received from others (rate of votes received, rate of comments received, rate of whispers received and fraction of deleted discussion articles), as well as early participation (first vote submitted or not, first discussion article submitted or not, first comment submitted or not, and first whisper submitted or not) and early feedback received from others (first vote received or not, first comment received or not and 45 A probability density function is a function that represents the likelihood a random continuous variable can take a specific value. The variable will fall within a given range of values represented by the integral of this density of this variable. 74 first deletion of a discussion article), are used as explanatory variables. 46 The EXP(B) column in Table 4.3 gives the estimated coefficient values for the explanatory variable in terms of a hazard ratio, which tells us whether an explanatory variable in the model is related to the probability of members remaining active in the community. Due to the large sample size, statistical significance is considered at the p=.001 significance level only for all variables (Jensen 2007). Statistical significance should be examined along with the effect size as indicated by the hazard ratio. Factors Rate of Participation Rate of Feedback Received Early Feedback Received Early Participation Variables Rate of Discussion Article Rate of Comments sent Rate of Votes given Rate of Whisper given Rate of Comments received Rate of Votes received Rate of Whisper received Fraction of deleted Discussion Articles First Comment received on First Discussion Article Deletion of First Discussion Article First Discussion Article submitted or not First Whisper submitted or not First Vote submitted or not SE .039 .008 .019 .045 .016 .008 .054 .043 .071 Sig. .670 .127 .134 .537 .971 .548 .231 .963 .164 .071* .713* .196 .071 .000 .000 3.084 1.231 .590 .503 .057 .679 .999* Familiarization Time Exp(B) 1.017 1.012 .972 .972 .999 1.005 1.067 1.000 .906 .000 .000 Table 4.3: hazard rate model results for participation and feedback factors and length of active membership (*p <.001) 46 This is just brief summary of the Cox proportional hazard rate model. For more information please refer to http://core.ecu.edu/ofe/StatisticsResearch/Survival%20Analysis%20Using%20SPSS.pdf and http://cran.r-project.org/doc/contrib/Fox-Companion/appendix-cox-regression.pdf https://mywebspace.wisc.edu/jmullahy/web/basu%20manning%20mullahy.pdf 75 4.5.1 How is Length of Active Membership Related to the Rate of Participation? (RQ1) The coefficient estimates for the explanatory variables and their associated significant levels from the Cox proportional hazard rate test, where the dependent variable is length of active membership, are reported in Table 4.3. The study examines how four different explanatory variables that address RQ1, rate of discussion articles, rate of whispers sent, rate of comments sent and rate of votes given, are related to length of active membership. In the model, a binary status variable, “censored,” was also included. Censored was constructed using last login date and last posting date. Censored is the probability of an inactive member becoming active in the future. If the difference between last login and last posting date is less than sixty days, the member is considered non-censored. 47 In this study, the coefficients from the Cox proportional hazard model are interpreted in terms of probability of members remaining active (survival) rather than become inactive. The hazard ratio for the rate of discussion articles is 1.017, which suggests that a unit increase in discussion article posts per day (rate of discussion articles) will increase the probability of members remaining active in the community by 1.7% {(1.017-1)*100%}; the value is NOT 47 In Sploder, if the time between members' last login and last post exceeds 90 consecutive days, they are not likely to post again and thus they are considered censored. 85% of members did not submit another post if their censoring time was over ninety days. This implies 85% members were considered active and 15% as censored at the beginning of the analysis. The model considered 15% as censored which means they are still included and may come back in the future. The model estimates the probability of them coming back (internally), reclassifies them (if necessary) and reports a hazard ratio. All of this is done internally in the software using algorithms. I use a Survival analysis, a Cox proportional hazard rate model to account for this censoring of the variables. It is important to note, I used a higher cutoff in Sploder (compared to Everything2) due to the type of community it is. Members are usually part of two different communities in Sploder, so they are probably still involved in the games-making component of Sploder while they are not on the discussion forum. It makes sense to give them more time since they are still active in the Sploder gaming community. I also performed a sensitivity analysis with an alternative cutoff of 70%. The results obtained showed no major differences from those with the 85% cutoff reported in this chapter. 76 significant at the p = .001 level. Similarly, the hazard ratio for the rate of comments sent is 1.012, which suggests that a unit increase in comments sent per day (rate of comments sent) will increase the probability of members remaining active in the community by 1.2% {(1.0121)*100%}; the value is NOT significant at the p = .001 level. 48 The hazard ratio for the rate of votes given is .972, which suggests that a unit increase in votes given per day (rate of votes given) will reduce the probability of members remaining active in the community by 2.8% {(.972-1)*100%}; the value is NOT significant at the p = .001 level. The hazard ratio for the rate of whispers given is .972, which suggests that a unit increase in whispers given per day (rate of whispers given) will reduce the probability of members remaining active in the community by 2.8% {-(.972-1)*100%}; this value is NOT significant at the p =.001 level. This suggests that the rate of participation is NOT statistically significantly related to length of active membership. A possible explanation for this could be that Sploder is a gaming community for young adolescents. It is possible that young adolescents are more invested in designing and playing games than in forum discussions. It could also be possible that members visit the discussion forum to get answers or ideas for creating games or features. Thus they may login to read, but may not actively post. 48 This is a comparison between a group of members whose rate of participation, i.e. discussion articles, votes, whispers, and comments increases by one unit compared to members of the control group whose rate of participation remains at the average value for the community. For an explanatory variable, the interpretation of a unit increase is that the explanatory variable (continuous in nature such as rate of discussion articles) increases by one standard deviation per day. 77 4.5.2 How is Length of Active Membership Related to Early Participation? (RQ2) Four different explanatory variables that respond to RQ2, first discussion article submitted or not, first comment submitted or not, first whisper submitted or not, and first vote submitted or not were included in the model. Table 4.3 reports the regression coefficients for early participation factors. The hazard ratio for first discussion article submitted or not is .713, which suggests that submitting a first discussion article (first discussion article submitted or not) will reduce the probability of members remaining active in the community by 28.7% {-(.713-1)*100%}; the value is significant at the p =.001 level. 49 The negative relationship between a first discussion article submitted and the length of active membership could be due to the type of members that are associated with the Sploder community. Sploder is a gaming community; members are invested in submitting games, and it is possible that if members' games are not well received, they may submit a discussion article indicating their displeasure before the stop using the site. It could also be that they post a question looking for help in creating their game. Once this question is answered sufficiently they may become inactive in the forum. The hazard ratio for first whisper submitted or not is 3.084, which suggests that submitting a first whisper (first whisper submitted or not) will increase the probability of members remaining active in the community by 208.4% {(3.084-1)*100%}; the value is NOT significant at the p =.001 level. One possible explanation could be that, since whispers are oneon-one communications, members use whispers to make social connections in the community. 49 This is a comparison between a group of members who submitted early participation, i.e. first discussion article, first comment, first vote, and first whisper compared with the control group of members who, because they did not submit anything counted as participation, have early participation measures of zero. 78 Submitting their first whisper could be an indication that members are becoming more involved in the community, and more involved members may stay active longer. The hazard ratio for first vote submitted or not is 1.231, which suggests that submitting a first vote (votes submitted or not) will increase the probability of a member remaining active in the community by 23.1% {(1.231-1)*100%}; the value is NOT significant at the p =.001 level. 4.5.3 How is Length of Active Membership Related to Rate of Feedback Received? (RQ3) Four different explanatory variables that respond to RQ3, rate of comments received, rate of votes received, rate of whispers received, and fraction of deleted discussion articles, were included in the model. Table 4.3 reports the regression coefficients for rate of feedback received. The hazard ratio for the rate of comments received is .999, which suggests that a unit increase in comments received per day (rate of comments received) will reduce the probability of members remaining active in the community by .1% {-(.999-1)*100%}; the value is NOT significant at the p =.001 level. 50 The hazard ratio for the rate of votes received is 1.005, which suggests that a unit increase in votes received per day (rate of votes received) will increase the probability of members remaining active in the community by .5% {(1.005-1)*100%}; the value is NOT significant at the p =.001 level. The hazard ratio for the rate of whispers received is 1.067, which suggests that a unit increase in whispers received per day (rate of whispers received) will increase the probability of members remaining active in the community by 6.7% {(1.067-1)*100%}; the value is NOT significant at the p =.001 level. The hazard ratio for the fraction of deleted discussion articles is 1.000, which suggests that a unit increase in fraction of 50 This is a comparison between a group of members whose rate of feedback received from others, i.e. votes, comments, whispers, and deletion of articles increases by one unit compared to the control group of members whose rate of feedback received from others remains the same. 79 deleted discussion articles (fraction of deleted discussion articles) has no affect on how long members remain active in the community; the value is NOT significant at the p =.001 level. This suggests that the rate of feedback received from others is NOT significantly correlated with length of active membership. One possible explanation could be that members joined the community primarily to read posts and gain ideas from others’ submissions, so they may not be concerned about feedback received from other members on their own discussion articles. 4.5.4 How is Length of Active Membership Related to Early Feedback Received? (RQ4) Three different explanatory variables that respond to RQ4, first votes received on first discussion article, first whispers received on first discussion article, and deletion of first discussion article, were included in the model. Table 4.3 reports the regression coefficients for early feedback received. The hazard ratio for deletion of first discussion article is .071, which suggests that receiving a deletion of first discussion article will reduce the probability of members remaining active in the community by 92.9% {-(.071-1)*100%}; the value is significant at the p =.001 51 level. There at least two potential explanations for this. One could be that because only editors can delete discussion articles, members may be discouraged from remaining in the community as they may put more emphasis on feedback from authority figures. Another explanation could be that new members are more sensitive to harsh criticism than members who are more established. The hazard ratio for first comment received on a first discussion article is .906, which suggests that receiving a first comment on a first discussion article will reduce the probability of a member remaining active in the community by 9.4% {-(.906-1)*100%}; the value is NOT 51 This is a comparison between a group of members who received early feedback, i.e. first vote on a discussion article, first comment, first whisper, and first deletion of a discussion article, compared with the control group of members who did not receive early feedback. 80 significant at the p =.001 level. This suggests that the majority of the early feedback received from others on the content has no significant impact on length of active membership. The impact of early feedback on length of active membership may be due to the fact that members joined the community to understand specific game-related topics. They were more interested in gaining this information than in what other members thought of their posts. Familiarization time was measured as the time between creation of the account and the first discussion article. This variable was included as a control in the model. Members need to settle down in a community and become familiar with the socio-technical system and norms associated with a community before they begin actively contributing to the site; this time is referred to as familiarization time. The Hazard ratio for familiarization time is .999, which suggests that a unit increase in familiarization time will reduce the probability of members remaining active in the community by .1% {-(.999-1)*100%} at the p =.001 significance level. This suggests that the longer the familiarization time a member has, the shorter their length of active membership in a community. One possible explanation could be that members who take longer to post have a harder time understanding the community and find it more difficult to use when they start using the different features. This may make them less likely to remain active. On the other hand, it could be that members who take longer to start participating just have less interest in the community to begin with and for this reason they are likely to cease being active earlier. Please also refer to Appendix J for a length of active membership survival graph and refer to Appendix K for more information about the model fit. 4.6 Examining Causal Links in the Sploder Community (Granger Causality Tests) 81 Because there are plausible non-causal explanations for the statistically significant relationships revealed by the Cox hazard regression, I also employ a Granger causality as a more rigorously test for causality. It is important to note that if the results from a Granger causality test are not statistically significant, we can rule out any possible causal links among variables. However, if the results from a Granger causality test are statistically significant, the evidence for a causal relationship is merely stronger. The intuition behind a Granger causality test is that if changes in one variable cause changes in a second variable, then the value of the first variable in any given period should be correlated with the value of the second variable in a subsequent period or periods (Granger, 52 1969). A Granger causality test provides stronger evidence for or against causality than the statistical significance of simple regression coefficients. The temporal evidence for causality is derived from time series data. 53 In this study, the model relies on the prediction that length of active membership in period N will be correlated with participation in the community during periods N-1, N-2, etc. A Granger causality test simultaneously tests for each direction in which a causal relationship might run between two variables. In this study, I examine whether members' participation is causally linked to their length of active membership and whether the length of active membership is causally linked to members’ participation. 52 A Granger causality test establishes whether a causal link among variables exist based on simultaneous correlations among variables (current instance and previous instances among variables) from time series data. 53 I am unable to ascertain true causality, as I do not have the information why the members actually left. Using a time series model, I am attempting to ascertain possible causes (by moving backward in time) by establishing causality in terms of statistics. 82 It is important to note that the Cox proportional hazard rate results did not show significant correlation between rate of participation and length of active membership or between rate of feedback received and length of active membership. Based on the Cox proportional hazard rate results, causality between rate of participation and length of active membership and rate of feedback received and length of active membership can be ruled out. However, a Granger causality test was still conducted. The results for a Granger causality test can be viewed as an additional validation for the results of the Cox proportional hazard rate model. 4.7 How is Length of Active Membership Affected by the Rate of Participation? (RQ1) 4.7.1 Members' Participation Affecting Length of Membership A Granger causality test examines the relationship between two variables using lag orders, where a lag order is the number of measurement periods for explanatory variables that is included in the model. A period could be of any length, such as a day, a month, or a year. For this study, a single lag order is a period of six months. I used the statistical software package STATA to determine the period of lag order and conduct the Granger causality test. STATA selected six months as the appropriate unit of time for measuring lags based on its assessment of the data. 54 Using the server timestamps from the logs in Everything2, I derived lag orders for explanatory variables (rate of writeups, rate of votes given, rate of messages) and the dependent variable (length of active membership). I examined the possible causality between explanatory variables and the dependent variable. For example, the server log contains timestamps for writeups, timestamps for votes, and timestamps for messages. From these timestamps, I constructed lagged values for variables and examined the possible causal links among variables. 54 A six month period for a lag order was selected based on the assessment of the data when all dependent and explanatory variables are taken into account. 83 For a Granger test, time series data, a collection of observations made sequentially in time, is decomposed into stationary trends and residuals (often known as random shocks). A time series is called stationary if statistical properties (mean, standard deviation, and autocorrelation) of the time series are constant. In a dynamic world, no trend will be stationary forever. However, if a time series contains stable trends during the observation period of time, the time series is termed trend stationary. A Granger test, which utilizes time series model, can be represented (in terms of participation and length of active membership) with the following equation: p p j=1 j=1 MembershipLength(t)=∑C1j*Participation(t-j)+∑C2j*MembershipLength(t-j)+U1t, where t is the current time, p is the maximum number of lagged observations included in the model (the model order), j is the lag order and can take any value from 1 through maximum lag order p, MembershipLength is the length of active membership, Participation is the rate of participation, C2j is the coefficient of Membership Length for lag order j, C1j is the coefficient of Participation for lag order j, and U1t is the model’s residuals (shocks) at the time t. To select the number of lags (lag order) for a time series model, I employed five lag order selection statistics (test) reported by STATA. I used the dependent variable and explanatory variables to find the correct lag order for the model. Five tests, Likelihood Ratio test (LR), Akaike Information Criteria (AIC), Hannan-Quinn Information Criteria (HQIC), Schwarz' Bayesian Information Criterion (SBIC), Akaike's Final Prediction Error (FPE), were used to test for the appropriate lag order. (Please refer to Appendix E for more information about the five tests.) A maximum lag order of 2 was suggested by the FPE, AIC, HQIC and LR information 84 criteria, whereas SBIC information criteria suggested a maximum lagged order of 0 (refer to Appendix-L). A lag order of 2, based on what the majority of the tests suggested (three of the five information criteria), was selected for the Granger causality test. 55 LR, AIC, SBIC, FPE, and HQIC were run for all rate of participation variables, including the dependent variable, which is the length of active membership. The algorithms applied complex statistical tests (in a FPE test, the expected variance of the error is measured when an Auto regressive time series is fitted against another time series of similar co-variance structure) and suggested a lag order. It is important to note, for this study, 2 lag orders is a period of 12 months. In this study, if members had been members for less than a year they were still included (in the Granger causality test), and it only used data on a member for the period the member was active. The Granger causality test was conducted using a Vector Auto Regression (VAR). 56 Table 4.4 reports the results for a Granger causality test between rate of participation factors and 55 The approach I used to select a lag order is often referred as majority vote approach in the field of information science, and more precisely, in data mining and in machine learning. 56 A Granger causality test can be derived using a VAR framework. A VAR model can be represented as a time series consisting of two variables (x and y), where yp (value of variable y for time p) can be represented in terms of its past values and past values for the variable x. If x Granger causes y, some or all lagged x values will have non zero coefficients. A Vector Auto Regression (VAR) is a statistical regression model. In a VAR model, multiple time series are used to estimate linear dependencies among variables. Each variable can be considered evolving from its own lags, and lags from other variables. In a VAR equation, a set of variables is used. Each variable is represented as a linear function of v lags of itself and of all of the remaining variables in the equation. An error term is also included. A first order VAR(1) for n variables collected in nx1 vector yt can be represented as yt= b(0) +b1y(t-1)+ q(t) where the element qt is the error term, which can be represented as the iid normal (a diagonal matrix); b(0) is nx1 vector which represents a constant term in the equation. A VAR(1) model should satisfy the following matrix equations E(v,v')= W and E(Vt,Vt-j)=0, where W is a positive semi-definite matrix containing error terms in nxn dimensions and E(Vt,Vt-j)=0 indicates that every error term in the equation has a mean of zero. The dependencies among variables are represented by the matrix b1 and the contemporaneous dependence is determined by the term qt. The results from a Granger causality test (null hypothesis is supported or not) can be determined (based on the chi-square 85 length of active membership. The results showed that a higher rate of participation does NOT cause members to stay active because the chi-squares values were not statistically significant at the p=.001 level. Due to the large sample size, statistical significance should be examined along with the effects size as indicated at the p=.001 significance level only (Jensen 2007). Dependent Variable Explanatory Variables Chi-square Probability Length of Membership Rate of Discussion Articles 1.680 0.432 Length of Membership Rate of Comments sent 1.638 0.441 Length of Membership Rate of Votes given 3.183 0.204 Length of Membership Rate of Whisper given 0.831 0.660 Length of Membership ALL 6.713 0.568 Table 4.4: Granger causality Wald results whether members’ rate of participation causes their length of active membership (*p < .001) 4.7.2 Concluding Remarks The Granger causality tests showed that more participation does NOT cause community members to remain active longer. A hazard rate model showed that NO correlation (at the p=.001 level; probability values for the corresponding chi-square are greater than 1%) exists between members’ participation and their length of active membership. The Granger causality test supports the no causality interpretation of the hazard model results. 4.8 How is Length of Active Membership Affected by Rate of Feedback Received? (RQ3)? 4.8.1 Feedback Members Received from Others Affecting Length of Membership Using the server timestamps from the logs, I constructed lagged values for explanatory variables (rate of votes received, rate of comments received, rate of whispers received, fraction of deleted discussion articles) and the dependent variable (length of active membership), and tested for causality between explanatory variables and the dependent variable. values and the associated probability values) from a Wald test. This is just brief summary of a Granger causality test and a VAR model. For more information please refer to http://academic.reed.edu/economics/parker/s13/312/tschapters/S13_Ch_5.pdf 86 A Granger causality test between feedback received from others on member supplied content and length of active membership can be conducted with the following equation: p p MembershipLength(t)=∑C5j*Feedback(t-j)+∑C6j*MembershipLength(t-j)+U3t, j=1 j=1 where t is the current time, p is the maximum number of lagged observations included in the model (the model order), j is the lag order and can take any value from 1 through maximum lag order p, MembershipLength is the length of active membership, Feedback is the rate of feedback received from others, C6j is the coefficient of Membership Length for lag order j, C5j is the coefficient of Feedback for lag order j, and U3t is the model’s residuals (shocks) at the time t. To select the number of lags (lag order) for the time series model, I employed five lag order selection statistics reported by STATA. Five tests, the Likelihood Ratio test (LR), Akaike Information Criteria (AIC), Hannan-Quinn Information Criteria (HQIC), Schwarz' Bayesian Information Criterion (SBIC), and Akaike's Final Prediction Error (FPE) were used. LR, FPE, and AIC suggested a maximum lag order of 2, whereas HQIC and SBIC information criteria suggested a lag order of 1. A lag order of 2, as suggested by the majority of the tests, was selected for the Granger causality test. (Please refer to Appendix M). The tests were run for all variables, including the dependent variable, which is the length of active membership. The algorithms applied complex statistical tests/models (in a FPE test, the expected variance of the error is measured when an Auto regressive time series is fitted against another time series of similar co-variance structure) and based on the majority of the test results, a lag order of 2 was selected. A Granger causality test was conducted through a Vector Auto Regression (VAR). The 87 Granger causality test results showed that feedback received from others does NOT cause length of active membership as the chi-square values were not significant at the p=.001 level (probability values for the corresponding chi-square is greater than 1%). Please refer to Table 4.5 to review the results from the Granger causality test. Dependent Variable Explanatory Variables Chi-square Probability Length of Membership Rate of Comments received .816 0.665 Length of Membership Rate of Votes received .512 0.774 Length of Membership Rate of Whisper received .261 0.877 Length of Membership Fraction of deleted Discussion Article 1.951 0.377 Length of Membership ALL 3.436 0.904 Table 4.5: Granger causality results whether rate of feedback causes length of active membership (*p < .001) 4.8.2 Concluding Remarks The Granger causality tests showed that more feedback received from others does NOT cause community members to remain active longer. A Hazard rate model showed that no significant correlation (at the p=.001 level; probability values for the corresponding chi-square is greater than 1%) exists between feedback members received from others and their length of active membership. The Granger causality test supports the no causality interpretation of the hazard model results. 4.9 Granger Causality Tests in Regards to Early Participation and Early Feedback Received from Others on the Content It is important to note that I could not conduct a Granger Causality Test of whether members' early participation influences their lengths of active membership (RQ2), and whether early feedback received from others influences length of active membership (RQ4). A Granger causality test examines whether changes in one variable may cause changes in a second variable using past values of both variables. In this dissertation, I used members' first participation and first feedback received from others as measures of early participation and early feedback 88 received from others. Since there is at most only one event recorded for each participation variable and each early feedback received variable, there are no prior observations. 4.10 Chapter summary Several possible factors that may contribute to length of active membership, which may in turn contribute to the viability of online communities, were identified and tested in this chapter. These factors were tested for the online community, Sploder, to determine whether or not they are related to length of active membership using two rigorous statistical tests, a Cox proportional hazard rate model and a Granger causality test. First, a Cox proportional hazard rate model was introduced, and then the results from the test were presented. A hazard rate model tests for any correlation evidence between two variables. The results found that only two early participation variables (first discussion article submitted or not and first whisper submitted or not) were correlated with length of active membership. A Granger causality test was then introduced, and the results from the test were presented. A Granger causality test is used to determine if there is any possible causal link between two different variables. It found no evidence for a causal relationship between rate of participation and length of active membership or between rate of feedback received and length of active membership. 89 CHAPTER 5 A Qualitative Study of Everything2 This chapter reports on interviews with participants from the Everything2 online community. This chapter does not include any results from Sploder. Perhaps because Sploder is a community of young adolescents and it is challenging to gain approval to interview young adolescents, only one long-term Sploder member agreed to be interviewed. Qualitative analyses will allow us to capture some of the insights behind members’ participation over time such as what contributes to their interest in remaining active in the community. The interviews were conducted as a part of a previous study. 58 57 The interviewees were long-term members (members who have been active for over a year) from the community. The findings for this study were obtained by analyzing data collected for an earlier study of Everything2. The chapter further explores the reasons why the length of active membership in online communities varies among members. It elaborates on why some members leave while other members remain active and some members decrease their activity in a community. Gaining insights from interviews may help to explain some of the findings from the earlier chapter (Chapter 3) or add new insights to the results from the earlier chapter that are associated with members' length of active membership. It is important to note, a qualitative study is not meant to capture statistical power or statistical significance. In a qualitative study, the variation among individual cases is what 57 In this chapter, whenever the term 'member' is used, this refers to members in general of Everything2 community who are registered users of the community. When the terms 'participant' is used, this refers to the specific Everything2 members that took part in the interviews. Members (registered users) login to the site to access different services of the community. 58 The data was collected in 2010 for a separate project. I was a part of the research team. 90 matters. For example, this study has pointed out some quotes of participants regarding the reasons that they use Everything2. The quotes highlight different underlying reasons for why individual members remain active in the community or why some members leave the community. 5.1 Recruitment of Everything2 Participants To acquire a deeper understanding of how length of active membership in online communities is associated with their participation and feedback received on their content, 59 interview data from thirty-one long-term Everything2 members was analyzed. Everything2 members selected for the interviews were active participants for over a year. Using this criterion, potential participants were identified for the interview. A snowball sample recruitment technique was used to recruit participants. Initial participants selected for the interview were asked to recommend other long-term members of the community who might be willing to participate. Please refer to Table 5.1 for a summary of participants' total years in the community and the gender mix for the participants. A snowball sample is a technique often used in qualitative research to gain access to a population otherwise difficult to identify and contact (Heckathorn, 1997; Heckathorn, 2002; Goodman, 1961). Initial participants were selected by using email addresses gained from the server logs and through personal contacts of research team members. In the recruiting email, potential interview participants were asked to provide a phone number and date/time in order to be contacted for the interview. If no response was received, one reminder email was sent to 59 It is crucial to note that if a user submits posts or if a user submits votes on a post or if a user sends messages to others, all these content contributions in a community are termed as participation. If a user receives votes on their post or if a user receives messages from others, all these interactions are referred to as feedback. 91 potential participants. Once the recruitment reached a point where there was not enough variation in responses from the interview participants, several additional criteria were used to recruit interview participants likely to provide more diversity in their responses. Participants Sex P1 P2 P3 P4 P5 P6 P7 P8 P9 P 10 P 11 P 12 P 13 P 14 P 15 P 16 Total Participants Sex Participation Years Female 10 P 17 Male Male 6 P 18 Male Female 7 P 19 Male Male 10 P 20 Male Male 9 P 21 Female Male 10 P 22 Male Female 10 P 23 Male Female 2 P 24 Female Everything2 Male 10 P 25 Female Female 10 P 26 Male Female 10 P 27 Female Female 8 P 28 Male Male 10 P 29 Male Male 6 P 30 Female Female 10 P 31 Male Female 10 Table 5.1: Everything2 long term interview participants Total Participation Years 10 9 10 10 10 10 10 7 4 3 6 10 10 10 8 Additional participants were therefore recruited based on different types of participation characteristics, which included: members who sent private messages to other members but to a large extent did not post in the forum, members who messaged or voted for other members rather than posted articles in the forum, and members who did not login for months after posting their articles. Recruiting a variety of different types of members helped address a possible homophily bias in the Everything2 sample (Heckathorn, 1997; Kuzel, 1999). A homophily bias occurs when a sample’s participants are too similar to each other; this is common with snowball samples because people tend to become friends with others who are like themselves (Heckathorn, 1997). It is important to note that the sampling method was designed to recruit a 92 diverse set of participants in order to capture a varied range of responses from the interview sample. Thirty-one participants from Everything2 were recruited for the purpose of the interview. 5.2 Interview Protocols In-depth interviews with the participants were conducted using an interview protocol (see Appendix N). The protocol was created using King and Horrocks’ (2010) interview guide. The interview guidelines helped with the design of the questionnaire by providing the research team with the tools to frame questions and discern how broad a question should be. It further emphasized the importance of avoiding presuppositions (assuming the answer before asking the question) in the questions and to consider how the question could change during the course of the study. After the -protocol was created, a pilot test was given to two participants to explore whether the questionnaire needed to be modified. If modifications were required, the questionnaire was refined based on the response received from the pilot interviews and semistructured interviews were conducted to collect the data. In a semi-structured interview, a set of possible themes are explored while allowing a free-flowing natural progression of conversations with the interviewees. 60 The interview questionnaire focused on the participation lifecycle of members in an online community. The study concentrates on 1) participants in the community who were still participating (active) at the time of the interview, and 2) participants who were inactive (members who no longer post or login to the community) at the time of the interview. 60 Because the interviews were semi-structured, some of the interview responses deviated from the original questions. 93 The data was collected through telephone interviews, with each interview lasting sixty to ninety minutes. Using semi-structured interviews, active participants were asked about whether their amount of contributions (participation) in the community changed over time, the reasons for any change in participation, and how their contributions (posts) were received in the community. In addition to these questions, participants who had ceased to participate (inactive) were also asked why they decided to leave the community (stop their participation). Interviews focused on exploring members' lifespans in the Everything2 community, members’ rates of participation (how often they participate) and any changes to their rates of participation. 61 The interviews were audio recorded and transcribed using the software Atlas.ti. After transcribing the interviews, the interview data was coded. 5.3 Coding An exploratory analysis was conducted to answer generic questions about members' changes in participation, particularly whether they remained active or became inactive. Participants’ rationales behind their participation and their perception that influenced a change in the usage within these communities, such as life constraints and close ties in their network, were coded. Participants' types of usage such as reading, personal relations and posting behaviors were also coded. An iterative approach was taken for the coding purpose. 62 An interpretive data analysis approach was used, in which data was systematically analyzed to identify major categories. Participants' responses pertaining to these categories from 61 It is important to note, the underlying assumptions (interpretations) for rate of participation and changes to their rate of participation are consistent with interpretation of quantitative data. 62 An iterative coding process is an approach where qualitative data is categorized by reviewing the responses multiple times. During the coding process, the qualitative data is categorized. 94 the data were summarized (Strauss, 1998). The categories were reviewed in terms of the original transcripts and relevant quotes were pulled from the categories. Themes were identified and summarized based on the overarching patterns. The themes were then listed in a spreadsheet. The iterative coding process supported the identification of various patterns in the data. The process made it possible to examine how themes were similar and different across different participants and then identify any patterns in the data. In order to construct a story line (that explores length of active membership), codes were removed during the coding process, which allowed me to concentrate on specific themes. Using this coding technique, a data matrix was 63 created and participants’ quotes were entered into it. Please refer to Table 5.2 for the major themes identified from the coding. Themes Posting Rationale Reduction in Use Members' use of the community Leaving the community Descriptions Reasons why posts were made. This includes social reasons and wanting to post higher quality material. Different precipitant factors that included a change in the usage of members. The factors include: a) life constraints b) deletion of posts c) downvotes d) the evolution of ‘wiki-era’ How the community is used such as personal relations and postings. Different reasons members gave for leaving a community. Table 5.2: key themes identified based on coding It is important to note that the codes covered commonalities and unique cases among participants. Participants’ responses were grouped into categories and emerging themes were identified from the data and reported in this chapter. 63 A comparative method, reading and rereading themes, guided the identification of final themes. After the interviews were read through once to identify themes, the related literature was reread and themes were formulated. These themes are presented in this study. 95 5.4 Members’ Participation Members have used the Everything2 online community for a variety of reasons. Members viewed the online community as a place to develop skills for themselves and others. For example, one participant mentioned using others’ feedback to improve their writing skills: I used Everything2 in terms of my own writing basically to hone my technical writing skills... (P 12) The level of participation among members varied over time. Most participants mentioned that they posted fewer posts (at the time of interview) compared to what they had previously posted. Various reasons were given for reductions in participation over time, such as becoming busy in their lives or improving the quality of posts. A couple of participants stated: I decreased the amount of posting because I wanted to post higher quality content. Started reading more, wasn’t posting, when I became busy with life. I decreased my posting around 2002 when I started grad school. (P 3) I got wrapped up in work and the thing is like as that happened I started going on the site less and less ... (P 12) Participants mentioned that even though they had reduced the frequency of their posts over the years, they remained active in the community to keep in touch with others through their posts. They posted so that their friends could read and enjoy their writeups. This desire to keep in touch with friends in the community encouraged them to continue participating for a longer period of time, as shown by these participants’ comments: Some people I considered as friends I can only contact through the site... That's why I posted the writeup about my father's death and it’s probably the only way I would have to connect with them. (P 1) Participants expressed appreciation for the messaging features in Everything2. The messaging feature in Everything2 allows members to send and receive private messages. 96 Participants indicated that they would check-in (login to the community) even after they had reduced their activity, sometimes even if they were no longer posting, just to keep in touch with friends through the messaging center. So if I wasn't messaging people about their writing as an editor, I was messaging them to catch up. (P 20) I still log in today and the only purpose is literally to check my inbox. Because I definitely do keep in touch with some people that way to this day. (P 7) One participant mentioned that giving and receiving feedback in the form of messages is an important part in being active in the community. I think that message feedback is crucial because it is the glue that holds all this together. It helps us to become better. (P 8) 5.5 Factors that Reduced Members' Participation There are four main factors that participants consistently identified as accounting for their reduced participation in the community: deletion of writeups, downvotes, life changing events, and the 'wiki-era'. 1. Deletion of Writeups Participants identified deletions of their writeups by editors as one of the reasons for their reduced participation. In Everything2, no clear editorial guidelines for deletion of writeups exist; the community did not promote a clear editorial policy in terms of deletion of writeups as content editors deleted writeups based on their discretion (Sarkar et al., 2012). In some cases, content editors expressed preferences for higher quality posts as reasons for deleting writeups. Content editors also deleted members' posts without clear standards, guidance or reasons, which discouraged members from participating. Editors also did not inform members when their posts were deleted. 97 What precipitated the trickling off of my nodding [posting] was not only the massive purging of various things I had put into the database… And at some point someone decided these all had to go…and I was never informed until I went looking for something and realized it missing but weeks and weeks of other work was also missing and there was no one I could appeal to I thought…this is a good sign that this site is no longer a great place to spend my spare time because someone here doesn’t put much value on how I spend it. (P11) I became editor for a short period of time in the site. I felt it was a mistake, I was not ready. I deleted posts without explanation; I was not mature enough to do that job. (P 16) I was still active but I was, you know what, you guys you know what, delete as many of my writeups as you want, I’m not going to participate in this f*** absurd town hall show fest… (P 4) 2. Downvotes Everything2 supports a voting system that allows members to submit a negative vote, downvote, on other members’ writeups. Participants mentioned that receiving downvotes discouraged them from participating further. For example, P17 reported that downvotes by others on his posts discouraged him from participating in the community. I thought that was the absolute worst thing that could happen was to get downvoted. (P17) 3. The Evolution of the “Wiki-era” The “Wiki-era” is the time when sites like Wikipedia, an online encyclopedia, were designed (starting in 2001). Some members attempted to make Everything2 more like Wikipedia. As Everything2 became more like Wikipedia in the way it was run, members found it less rewarding to participate. They felt that Everything 2 is about writings of any kind, not about generating fact-oriented entries. A couple of participants expressed these views in their interviews. Participants felt discouraged by the way Everything2 started imitating other competing “Wiki-era” sites. 98 Back then we had a pretty gross divergence between growing us into something that would eventually have competed with Wikipedia ...what Wikipedia grew into which is fine, whatever the community wanted eventually it's from the ground up, the entire site is more run by the writers and the coders than anybody really upstairs. But the social framework of the site is gonna be provided by the noders [authors]. (P 21) I felt kind of pushed off the site at that point (referring to the wiki-era) and I only came back a few times with something that I apparently thought was important. One Writeup worthy...there were editors who thought that what I had to say was just not worthy of being on the site I thought well, I guess I’M NOT going to contribute to this site. (P 1) 4. Life-Changing Events Participants mentioned that life-changing events changed their rate of participation in the Everything2 community. Life-changing events include starting a job, getting married, having kids, graduating from high school, and getting into a college program. Life-changing events led members' to spend less time in the community than what they were spending at their beginning of their membership. A couple of participants stated: I didn’t quite have the time to take classes and after classes to keep up with the site as much as when I was younger...I was also hired by the university run student newspaper to design pages for it. (P 19) I still check e2 periodically and I don’t visit the site very often mainly because I have almost no personal time anymore. I’m teaching full time, I’m taking grad school classes, I’m married, my husband has two children that we have every other weekend and every summer and I’ve found that we have almost no time to sit back and reflect back on anything which is the main motivation for me to. I’ve found ironically that I wrote more when I had less of a “life” and now that I have a life I have no time to sit and reflect and write about it. (P 16) 5.6 Leaving the Community Despite reductions in their activity levels, many members still remained active in the community. However, some members left the community, becoming inactive. Participants mentioned various reasons for leaving the community. Participants stated that they left because their friends also left. 99 We lost a lot of good people and a lot of the people that I was virtual friends with decided to go. And as much as I enjoyed the writing, the longer I was there the more it was about the community to me, the more it was about the people. And when the people that I knew left, I didn’t really see any need to continue...when the people left, that was kind of the last thing for me, well there’s really no point in sticking around now...When those people started to leave, that’s when I chose to make an exit. (P 14) Some users might have left because of all the political arguments on the site. I believe some users were disenfranchised because they were not interested in the goofier aspects of the site, and tried to remain writing only factual or ended up leaving. I’m not sure if it's received that way because people like what I'm doing, or if it's positive because everybody knows my name. (P 30) Other reasons for leaving the community are similar to reasons participants mentioned for reducing their participation level. This shows that members do leave the Everything2 community. Everything2 is a content-based community that relies solely on its members to generate the content. This means it is crucial that members remain active in the community for the community to survive. This supports the claim at the beginning of this dissertation that understanding why some members remain active in the community while others leave is important to the viability of online communities. 5.7 Chapter Summary This chapter has examined interview data from members of Everything2. It reported reasons offered by different members for why they chose to continue participating in the community, why they chose to decrease their participation, and why they became inactive and effectively left the community. Factors that discouraged Everything2 members from participating were deletions of writeups, downvotes, imitation of other sites, and life-changing events. The findings from this study provide insights about factors that lead to change in members' participation and their reasons for leaving the community. 100 5.8 Concluding Remarks The insights gained from this interview-based study could not all be accounted for in the quantitative study (Chapters 3) due to limitations in the data sets. Server logs cannot capture certain factors, such as life changing events, that affect participation. Some findings did help to clarify some of the results in the quantitative study. For example, in the quantitative study for Everything2, it was found that the rate of writeups posted and the rate of messages sent is significantly correlated with members’ length of active membership. The interviews for the qualitative study found that members used writeups and the messaging center to stay in touch with friends in the community. This could explain why the rate of writeups and rate of messages is significantly correlated with length of active membership. Some of the results from the qualitative study were not shown to be significantly correlated with the length of active membership in the quantitative study. For example, some participants mentioned deleted writeups as a reason for reducing their participation. Although deletion of first writeup was statistically significant, fraction of writeups deleted was found not to be significantly correlated with length of active membership. Perhaps some members are less sensitive to negative feedback and the interview sample may have included such members from the community. I conclude this chapter by discussing differences in the findings reported here from those of an interview study conducted by Valesquez et. al (2013). This discussion is necessary because Valesquez et. al analyzed responses from the same set of interviews that generated the responses reported and discussed in this chapter. Readers familiar with both studies might reasonably ask why they report different findings. Valesquez et. al (2013) examined patterns of participation for latent members of the Everything2 community (members of a community familiar with its norms who are not actively participating at the time of the interview) and their motivations to 101 participate. They found that most members move in stages through a range of membership roles, such as reader, contributor, collaborator, and leader. Only someone in one of the latter 3 stages of participation (contributors, collaborators, and leaders) can be classified as latent if they become inactive. They also reported that members' motivation to participate remain constant over time. The focus of the study of interview responses reported in this chapter was, by contrast, on factors that contributed to cessation of active participation by community members and how these affected lengths of active membership. For this reason, the responses of interest and interview quotes reported in this study are different from those that were the primary foci of the Velasquez et al. (2013) study. It is important to note, however, that while this qualitative study was conducted to examine factors affecting length of active membership, many of the themes (e.g., reasons for reduced participation) that emerged as important in this study are similar to those reported in Velasquez et al (2013). 102 CHAPTER 6: Discussion of Findings Without a sustaining base of long-term active members, communities are unlikely to survive for the long haul. If online communities can identify factors that contribute to increased length of active membership, they could modify their designs and strategies to take these factors into account. This dissertation first reviewed the research literature to find candidates for these factors. Two key factors that may affect length of active membership were identified: attributes of members' own participation in their online communities and feedback they receive from other members. The dissertation pointed out some of the gaps in the existing literature and accounts addresses those gaps in the current study. It also examined two different online communities, each with their own types of members and goals (Everything2 is about writing, Sploder (discussion community) is about games to see if findings for one community also apply to the other community as both communities support environments for collaboration and interaction among members with similar affordances. This comparison addresses problems determining the extent to which results of studies of other online communities, all of which focus on single communities, might hold for online communities in general. A mixed methods approach was used to examine how participation and feedback received are related to length of active membership in each community. The approaches included both quantitative and qualitative studies. The quantitative analysis of each community employed two rigorous statistical tests. A survival analysis using the Cox proportional hazard rate model was employed first to examine how participation and feedback factors are associated with length of active membership. 103 Prior literature has reported correlational evidence that measures of members' participation in online communities are linked to the length of time they remain active in these communities. Because correlational evidence is not dispositive proof of causality and there are plausible non-causative explanations for the relationships found, this dissertation also used a Granger causality test as a more rigorous and stringent test of the possibility that length of active membership is causally linked to several types of participation and feedback received measures. The qualitative component is an analysis of interviews with 31 long-term members of Everything2 who responded to questions related to changes in their levels of participation over time. Analysis of the interview data provided additional insights and helped further contextualize and explain some of the relationships identified in the previous two chapters. 6.1 Overview of Results from All Three Studies When new members join an online community, they face the challenge of deciding whether to remain in the community. For an online community, it is important that they remain active because their engagement with the community provides social and economic value to the community as a whole. The empirical findings from the Everything2 and Sploder studies reported in Chapters 3 and 4 show that the factors contributing to longer-term membership may vary among communities. The Cox proportional hazard rate model showed that all rate of participation variables were strongly correlated with length of active membership in Everything2. In the Everything2 community, rate of write-ups submitted, rate of messages sent, and rate of votes given are positively correlated at statistically significant levels with length of active membership. However, for the Sploder community, no rate of participation variable is strongly correlated with length of active membership. Perhaps members of Everything2 considered their own participation more valuable than members of Sploder. 104 The findings therefore show that rate of participation is not a good predictor of length of active membership for all online communities. For some communities like Everything2, members’ rate of participation could be a good predictor of length of active membership, while for other communities like Sploder, this is not the case. The hazard model also showed that the rates for individual feedback variables, such as rate of messages received, rate of votes received, rate of cools received, rate of comments received, and rate of whispers received are NOT strongly correlated with length of active membership for either community. The results from a Cox proportional hazard model showed that members' early participation in terms of first message submitted or not, first writeup submitted or not, (Everything2) and first discussion submitted or not (Sploder) are all correlated with length of active membership at statistically significant levels. All other types of early participation variables (first comment submitted or not, first upvote submitted or not, etc.,) were NOT strongly correlated with length of active membership. This suggests that members in online communities prefer different types of early participation, possibly based on their motivations for participation. Perhaps members’ first message (Everything2) was a response to negative messages from other members early on, which might explain a negative correlation between sending a message and length of active membership for this community. Perhaps submitting their first writeup indicates that they are interested in becoming more involved in the community since it requires more effort on the part of the member to submit a writeup than other forms of participation. Future research should examine this possibility in more depth. Submitting a first discussion article (Sploder) reduced the likelihood of members remaining active in the community. Sploder is a game making community, meaning members are probably more invested in making and playing games than in posting discussion articles in the forum. Perhaps when they sign up on the 105 discussion forum they are looking for answers to a specific question in designing their game. Once this question is answered, they may stop logging into the community. This dissertation generated mixed results regarding the relationship between early feedback received and length of active membership. For Everything2, the first upvote received on a first writeup, first downvote received on a first writeup, and deletion of first writeup were significantly correlated with length of active membership. Receiving an upvote or a downvote on the first writeup increased the likelihood of a longer length of membership. This suggests that an upvote encouraged members to remain active in the community longer. This could also mean that positive feedback encouraged members to continue using the site early in their membership. A downvote may increase the length of membership because it shows that other members have noticed the new member, and perhaps just getting noticed by another member is enough to encourage the new member to remain active, even if the feedback is negative. Deletion of a first writeup was negatively correlated with length of active membership. This could be because deleting a first writeup is viewed as negative feedback from the editors, whose feedback new members may take more seriously than feedback received from ordinary members. Several participants in the qualitative study expressed displeasure over their writeups being deleted. This follows the statistical significance of deletion of first writeup. The analysis for Sploder also showed that deletion of the first discussion article is significantly negatively correlated with length of active membership. The deletion of a first discussion article is negatively correlated with length of active membership. New members may put more emphasis on a deletion of a discussion article than the rest of the early feedback variables because only editors can delete a discussion article. For both Everything2 and Sploder, 106 fraction of deleted posts—a negative feedback cue—was NOT significantly correlated with the length of membership. One early participation variable (first discussion article submitted) and one early feedback variable (deletion of first discussion article) were significantly correlated with length of active membership in Sploder. No rate of participation or rate of feedback received variables showed statistical significance for Sploder. This dissertation used a Granger causality test, (which uses a multivariate time series model) to examine whether the dissertation’s participation measures and feedback received measures are simply correlated with length of membership or if there is causality attributable to these factors. The results from this test showed that there is no solid evidence for causality between members’ participation and length of active membership in either direction in Everything2. Members’ participation in Everything2 is NOT causally linked to their length of active membership. The results from the Granger causality test showed that receiving feedback from other members does NOT influence their length of active membership, which supports the results from the hazard rate model. For the Sploder community, results from the Granger causality test showed that participation and length of active membership and feedback and length of active membership are NOT causally linked with each other, further validating the findings of from the hazard rate analysis. Previous research on online communities found that feedback received from others was positively correlated with continued participation in a community (Joyce, 2006; Lampe, 2005; Burke, 2009; Panciera, 2009). However, these studies examined examined the relationship between feedback over relatively short periods of time (the first three to sixteen months after a member joined). However, this dissertation found that rate of feedback received from others was 107 not statistically significantly correlated with a longer length of active membership when behavior was tracked for over two years for the two communities studied. This suggests that the shortterm effects of feedback on continued use identified by previous studies may decay rapidly. 6.2 Implications for Practice Online communities such as Everything2 and Sploder enable large numbers of members to participate. However, many of these members become inactive. The high dropout rates for online members suggest new research on factors that contribute to longer active membership would be beneficial. Prior research showed members’ participation and feedback received on the content might be important for inducing members to remain active in their online communities. Understanding the effects of participation and feedback factors on length of membership in online communities could provide valuable insights for designing online services for which interactions among members are important. Prior research has suggested certain design decisions that community administrators can make to increase the time members stay active in online communities, such as publishing clear guidelines regarding what constitutes acceptable and valuable content in a community (Sarkar, 2012). If these guidelines were posted on the website where new members could read them, perhaps members would have a better idea of what is acceptable so their first post would be less likely to be deleted. This could encourage them to remain active longer. The interviewees from the qualitative study reported in Chapter 5 said that they were not even informed that their posts were deleted, or why, and this discouraged them from participating further. If there are clear guidelines and posts are deleted, the administrators could inform the author why the post was deleted. The same goal can be further advanced by can be achieved by hiring professionals to generate content at the early stages of the community so that it can set an example for new and 108 mature members in the community (Kraut, 2012). Sustained existence for peer-production communities is dependent on cordial relationships between members and online editors. If members are NOT encouraged by editors, they may feel frustrated (Kraut, 2012). This could lead to a reduction in their participation and, eventually, members may become inactive in a community and stop using services associated with it. Understanding how participation and feedback factors affect length of membership in online communities could inform the design of tools that may reduce the burden of peer production editors in managing the community. If a member does not remain active in a community long-term, it could have a negative impact on the longevity of the site. Thus, online administrators should encourage members to remain active in the community. As the interview participants mentioned, a primary reason for them to remain active was to maintain friendships they built. Administrators should encourage these friendships to promote cooperation and a longer period of active membership. The insight gained from this research about the impact of initial participation and initial feedback on length of active membership may help administrators determine where they should spend their time and effort to incentivize members to participate further in a community, which could improve the sustainability of the community. The findings further suggest that early negative feedback has a strong negative impact on how long a member will remain active in an online community. Administrators should use their discretion carefully when providing negative feedback on content, especially with new members. 6.3 Limitations While this study can provide insights into factors that influence length of active membership in online communities, it has limitations that future studies can address to expand this line of research. This study does NOT generalize to every online community. Findings from this study may apply to some user generated content-based communities, such as Wikipedia or 109 World of Warcraft, that are similar to Everything2 and Sploder. However, even for the two communities examined the findings were somewhat different. Further, even though the sample sizes were large, the data analyzed in this dissertation is a convenience sample based on two online communities, NOT a true random sample drawn from the universe of online communities. Quality of posts and quality of feedback, which this study did not take into account, may also contribute to length of active membership. Evaluating the quality of posts for a data set of this size would be a challenge. This may require automated natural language processing, which should be an elaboration of the approach used here for future research. The dissertation could not account for psychological variables, such as members’ motivations to participate or members’ personalities because server logs capture only actions and the results of actions. Psychological factors may help further explain what contributes to length of active membership in a community. Another limitation is that the two communities did not automatically log a member out if a member closed the browser without signing out. In this instance, the server logs did not capture the next login. This may add bias to the results. Future research should address this. In addition, data sets were derived from a snapshot of members' behaviors. Due to computational challenges and restrictions in the server logs, analyses of early participation and early feedback were restricted to members’ first posts only. Future research should expand the measures of early participation. Also, the qualitative and the quantitative studies of Everything2 were jointly designed. The interview responses used for the qualitative study were from a previous study, which therefore was not designed to address many of the questions addressed by the quantitative studies. Nevertheless, the answers from the interviewees did still help to provide additional 110 context for some of the findings and they helped identify factors, such as life changing events, that might influence how long members remained active in an online community that could not be analyzed in the two quantitative studies reported in this dissertation. Due to computational challenges, I could not separate feedback received from higher status members versus other members. This may have skewed the results in terms of how feedback factors are related to length of membership in the community. However, only 2.29% of the members in the Everything2 sample have achieved a higher status rank that allowed them to submit cools. This is a hard problem that future studies should find a way to address when they examine length of membership in the community. 6.4 Future Research This study was limited to the first instance of each type of early participation and early feedback received variables (captured through server logs). Since there is at most only one event recorded for each early participation variable and each early feedback received variable, there are no prior observations. Because of this, Granger causality tests could not be performed to examine whether early participation causally contributes greater length of active membership and whether early feedback received is causally linked to length of active membership. Future studies should examine these relationships further by using the first few months of activity for early participation and early feedback received measures. 6.5 Conclusions This dissertation focuses on length of active membership in online communities and ways it might be related to members’ participation and feedback they receive on content they contribute. Keeping members active is important for the success of online communities. For many of these online communities, members are the main sources for sharing information and 111 generating content. It is also possible that over time members who currently only receive information (e.g., only read) may start providing information (e.g., posts) to other members. However, it is not guaranteed that they will remain active. This research was driven by the goal of better understanding to what extent different activities may impact length of active membership and also to see how these findings can be generalized for two communities that are similar in important ways. In the past, scholars have examined different communities (such as the online encyclopedia Wikipedia, the health community Breastcancer.org, the game community World of Warcraft, the news and discussion community Slashdot, the question and answer forum Yahoo! Answers, etc.) as standalone objects of study. These studies thus reported findings that were specific to their individual communities. Studying two communities provides evidence on whether findings from one community generalize to the other community. The findings reported in this dissertation show that while some results may generalize across communities, not all results generalize. The findings that early negative feedback is negatively correlated with how long a member remains active in an online community applied to the measures of early negative feedback for both Everything2 and Sploder. 112 APPENDICES 113 Appendix A Results from a Cox Proportional Hazard Rate Model with a Cutoff Period of Two Months (Sixty Days) for Everything2 This appendix provides the results from the Cox proportional hazard rate model for Everything2 with a cutoff point of two months, so only members who have remained active in the community for at least two months after they registered in the community are included in this analysis. N Min Max Mean S.D 829.200 Length of Membership 21909 60.003 3451.289 793.151 Rate of Writeups 21909 .0 5.7 .014 .090 Rate of Messages sent 21909 .0 11.0 .038 .284 Rate of Votes given 21909 .0 20.5 .109 .757 Rate of Messages received 21909 .0 10.4 .003 .075 Rate of Votes received 21909 .0 .9 .000 .014 Rate of Cools received 21909 .0 1.5 .384 .297 Fraction of deleted Writeups 1714 .2 .9 .603 .274 First Writeup submitted or not 21909 0 1 .31 .461 First Message submitted or not 21909 0 1 .48 .499 First Downvote submitted or 21909 0 1 .31 .461 not First Upvote submitted or not 21906 0 1 .98 .135 First Upvote received on a 21909 .0 1 .018 .131 Writeup First Downvote received on a 21909 .0 1 .311 .462 Writeup First Deletion of a Writeup 21909 0 1 .36 .480 First Cool received on a 21909 .0 1 .181 .384 Writeup Familiarization Time 21909 0 5 .50 .899 Table A-1: descriptive statistics for the variables with a two month cut off time in Everything2 114 Table A-2 shows the chi-square difference between the null model and the full model for the Cox proportional hazard rate model at two months. -2 Log Likelihood Overall (score) Change From Previous Step Chidf Sig. Chidf Sig. square square 11528.284 517.599 16 .000 381.617 16 .000 Table A-2: Omnibus Tests of Model Coefficients for Everything2 Table A-3 shows the results for the Cox proportional hazard rate model for Everything2 for members who remained active in the community at least two months. The column labeled Exp(B) gives the estimated coefficient values for the explanatory variable in terms of hazard ratio and Sig. gives the significance level for each variable Factors Rate of Participation Rate of Feedback Received Early Participation Variables Rate of Writeups Rate of Messages sent Rate of Votes given Rate of Messages received Rate of Votes received Rate of Cools received Fraction of deleted Writeups First Writeup submitted or not First Message submitted or not Exp(B) 3.921 1.572* 1.117 .012 2.995 4.609 1.044 2.628 .201* SE .474 .080 .049 2.388 1.584 .134 .122 1.289 .112 Sig. .004 .000 .023 .065 .489 .000 .722 .453 .000 .403 1.284 .480 First Downvote submitted or not .924 .267 .768 First Upvote submitted or not Early Feedback .705 .222 .116 First Upvote received on a Writeup Received First Downvote received on a 1.595* .079 .000 Writeup 5.639 .837 .039 First Cool received on a Writeup First Deletion received on a .963 .068 .583 Writeup Control 1.146* .039 .001 Familiarization Time Table A-3: Hazard rate model results for participation and feedback factors and length of active membership in Everything2 (*p <.001) 115 The results showed a significance difference for the rate of cools received on length of active membership for members who stayed in the community for at least 60 days compared to those who stayed in the community over a day but less than 60 days. Rate of cools received is positively correlated with length of active membership (at the p=.001). First cool received on a writeup is not statistically significant at p=.001 level. 116 Appendix B Variance Inflation Factor Analysis for Everything2 Table A-4 gives result for a Variance Inflation Factor (VIF) analysis. A VIF value of 5 or above usually indicates multi-collinearity among the variables. I found VIF values between 1.011 and 2.609 for all variables. The VIF test for multi-collinearity confirms that all of the individual participation variables and all individual feedback variables that have been used in the analysis are NOT collinear with each other. This means that no participation variable is collinear with any other participation variable, nor is any participation variable collinear with any feedback variable. No feedback variable is collinear with any other feedback variable. Also the VIF values showed that length of active membership is not collinear with any participation or any feedback received variables. Variables Collinearity Statistics Tolerance VIF Rate of Writeups .786 1.272 Rate of Messages sent .989 1.011 Rate of Votes given .964 1.037 Rate of Messages received .986 1.015 Rate of Votes received .888 1.126 Rate of Cools received .540 1.853 Fraction of deleted Writeups .993 1.007 First Writeup submitted or not .517 1.936 First Message submitted or not .383 2.609 First Downvote submitted or not .711 1.406 First Upvote submitted or not .945 1.058 First Upvote received on First Writeup .886 1.129 First Downvote received on First Writeup .845 1.183 First Cool received on First Writeup .711 1.407 Deletion of First Writeup .962 1.039 Familiarization Time .967 1.034 Table A-4: results from a Variance Inflation Factor (VIF) analysis for Everything2 117 Appendix C Survival Function at Mean of Covariates in Everything2 The x axis is time to cessation of active membership in days. The y axis indicates probability of survival. Any point on the survival curve shows the probability that an active member will remain active for at least that amount of time. Survival Function at Mean of Covariates 1.0 Cum Survival 0.8 0.6 0.4 0.2 0.0 0 1000 2000 3000 Length of Membership in Days Figure A-1: Survival graph for members of Everything2 118 4000 Appendix D Model Fit In Terms of a Chi-square Difference for the Hazard Rate in Everything2 The chi-square difference between the null model (only the intercept has non-zero value; all explanatory (independent) variables in a model have zero regression coefficients in the equation) and the full model (when all explanatory variables and the intercept are included; the intercept and at least one explanatory variable has a value in the equation) is significant at the p=.001 level (refer to Table A-5), suggesting that effects size of the explanatory variables in terms of hazard ratio is significant for explaining the length of active membership. A null model provides a baseline measure of a model’s fit. The fitness of a model refers to how well the datapoints fit the equation of a model and how close the predicted values are from the observed values. A model fit is often assessed in terms of a Chi-square difference. A smaller chi-square difference between a full model and a null model represents a better model fit. A better model fit suggests that the margin of unexplained variance (in the model) is low and provides evidence for stronger correlation between explanatory variables and the dependent variable. -2 Log Likelihood Overall (score) 21706.418 Chi-square 3393.323 Change From Previous Step Sig. .000 Chi-square 838.576 Table A-5: Omnibus Tests of Model Coefficients 119 Sig. .000 Appendix E Statistical Tests to Determine a Lag Order Five statistical tests were performed to determine the lag order for a Granger causality test. These tests are Likelihood Ratio test (LR), Akaike Information Criteria (AIC), HannanQuinn Information Criteria (HQIC), Schwarz' Bayesian Information Criterion (SBIC), Akaike's Final Prediction Error (FPE). These tests are designed to maintain trade-off for selecting a lag order between too few lags and too many lags. These tests are often used to select a lag order for forecasting. Granger causality uses forecasting techniques to establish a relationship between variables using past values of two separate variables. In this study, I have used the software package STATA for these tests, which provided a corresponding lag order for the Granger causality test. Likelihood Ratio test (LR) A likelihood ratio test is a statistical test, which compares fit between two models, null model and full model. The null model (a model which contains the intercept and all variables with zero coefficients), whereas an alternative model (which contain all explanatory variables and the intercept; the intercept and at least one explanatory variable has a value in the equation). The test uses likelihood ratio to determine the likelihood of a given outcome of an event from the data under one model compared to another model. The likelihood ratio is used to generate a pvalue to decide if null model should be rejected in favor of the alternate model. Akaike Information Criteria (AIC) Akaike Information Criteria (AIC) is a statistical model, which uses information entropy to provide a measure of relative quality. If K is the maximum number of parameters in a model 120 and L is the maximum value of a likelihood function, AIC values for a model is measured as 2k2Ln(L). Hannan-Quinn Information Criteria (HQIC) Hannan-Quinn Information Criteria (HQIC) is a well known criterion for lag order detection from time series model. In this study, this information criterion is used to determine the lag order for the Granger causality test. The HQIC is measured using the following expression HQIC= ulog(RSS/u) + 2mloglogu, where m is the number of parameters, u is the number of observations, and RSS is the residual sum of squares which is obtained from a linear regression. Schwarz' Bayesian Information Criterion (SBIC) Schwarz Bayesian Information criterion (SBIC) is a statistical test often used for model selection. Schwarz Information criterion utilizes the expression -2Lp + plnq, where q is the sample size, Lp is the maximized log-likelihood of the model and p is the number of parameters in the model. It uses a Bayesian prior in terms of p parameter. Akaike's Final Prediction Error (FPE) Akaike's Final Prediction Error (FPE) is often used to estimate a lag order. The model is defined as expected variance of the prediction error in econometric. The model estimates the expected variance of the error when an autoregressive time series is fitted against another time series of similar covariance structure. 121 Appendix F Results from a Lag Order Test for Granger Causality in Everything2 Pertaining to Length of Active Membership and Participation Results from a lag order selection test (members' length of active membership and participation variables such as rate of writeups, rate of messages sent, and rate of votes) are reported based on LR, FPE, AIC, HQIC, SBIC information criteria. Lag Order P-Value LR AIC HQIC SBIC .069489 0 FPE 8.68492 8.72917* 8.79456* 1 .010 31.946 .069559 8.6855 8.90677 9.23372 2 .005 34.221* .068042* 8.66136* 9.05964 9.64815 Table A-6: lag order results for Granger causality test on length of active membership and participation in Everything2 122 Appendix G Results from a Lag Order Test for Granger Causality in Everything2 Pertaining to Length of Active Membership and Feedback Received Results from lag order selection test based on LR, FPE, AIC, HQIC, SBIC information criteria. Lag Order P-Value LR 1 0.010 32.101* AIC HQIC SBIC 4.5e+09 0 FPE 33.5715 33.6313* 33.7584* 4.5e+09 * 33.5682* 33.867 34.5023 Table A-7: results from a lag order test for Granger causality on length of active membership and feedback in Everything2 123 Appendix H Results from a Cox Proportional Hazard Rate Model with a Cutoff Period of Two Months (Sixty Days) for Sploder This appendix provides the results from the Cox proportional hazard rate model for Sploder with a lower bound membership cutoff of two months such that only members who remained active in the community for at least two months after they registered in the community were included in this analysis Variables N Min Max 971 60.02 951.52 971 0 9.51 971 0 23.44 971 0 19.34 971 0 7.00 971 0 5.00 971 0 11.35 971 0 1.50 971 0 20.1 971 0 1.00 Mean 293.087 .281 1.670 .876 .240 2.463 6.128 .738 .113 .213 S.D 220.533 .663 3.543 1.558 .667 1.597 3.043 .476 .715 .409 Length of active membership Rate of Discussion Article Rate of Comments sent Rate of Votes given Rate of Whisper given Rate of Comments Received Rate of Votes Received Rate of Whisper Received Fraction of Deleted Discussion Article First Comment received on First Discussion Article Deletion of First Discussion Article 971 0 1.00 .233 .423 First Discussion Article submitted or not 971 0 1.00 .808 .393 First Vote submitted or not 971 0 1.00 .996 .055 First Whisper submitted or not 971 0 1.00 .955 .205 Familiarization Time 971 0 950.15 279.683 254.756 Table A-8: descriptive statistics for the variables with a two month cut off time in Sploder Table A-2 shows the chi-square difference between the null model and the full model for the Cox proportional hazard rate model at two months. -2 Log Likelihood 19827.233 Overall (score) Change From Previous Step Chidf Sig. Chidf Sig. square square 521.038 14 .000 721.626 14 .000 Table A-9: Omnibus Tests of Model Coefficients for Sploder 124 Table A-3 shows the results for the Cox proportional hazard rate model for Sploder for members who remained active in the community at least two months, with column Exp(B) giving the estimated coefficient values for the explanatory variables in terms of hazard ratio and Sig. gives the significance level for each variable. Factors Variables Rate of Participation Rate of Discussion Article Rate of Comments sent Rate of Votes given Rate of Whisper given Rate of Comments received Rate of Votes received Rate of Whisper received Fraction of deleted Discussion Articles First Comment received on First Discussion Article Deletion of First Discussion Article First Discussion Article submitted or not First Whisper submitted or not First Vote submitted or not Rate of Feedback Received Early Feedback Received Early Participation Exp(B) 1.000 1.001 1.000 .958 .999 .999 1.002 .998 1.022 SE .056 .012 .028 .070 .023 .012 .078 .050 .096 Sig. .994 .957 .993 .541 .952 .952 .974 .962 .820 .048* 1.000 .327 .109 .000 .997 1.655 1.001 .593 .582 .396 .998 Familiarization Time 1.000 .000 .971 Table A-10: Hazard rate model results on participation and feedback factors and length of active membership in Sploder (*p <.001) 125 Appendix I Variance Inflation Factor Analysis for Sploder Table A-10 gives result from a Variance Inflation Factor (VIF) analysis. A VIF value of 5 or above usually indicates a multi-collinearity association among the variables. After removing two variables that showed a VIF value above 5, I found VIF values between 1.004 and 1.504 for all variables used in the analysis. The VIF test for multi-collinearity confirms that all of the individual participation variables and all individual feedback variables that have been used in the analysis are NOT collinear with each other. This means that no participation variable is collinear with any other participation variable, nor is any participation variable collinear with any feedback variable. No feedback variable is collinear with any other feedback variable. Also the VIF values showed that length of active membership is not collinear with any participation or any feedback received variables. Variables Collinearity Statistics Tolerance VIF Rate of Discussion Article .949 Rate of Comments sent .679 Rate of Votes given .683 Rate of Whisper given .924 Rate of Comments received .996 Rate of Votes received .994 Rate of Whisper received .993 Fraction of deleted Discussion Articles .982 First Comment received on First Discussion Article .784 Deletion of First Discussion Article .773 First Discussion Article submitted or not .665 First Whisper submitted or not .841 First Vote submitted or not .992 Familiarization Time .757 Table A-11: results from a Variance Inflation Factor (VIF) analysis for Sploder 126 1.054 1.473 1.463 1.082 1.004 1.006 1.008 1.018 1.276 1.293 1.504 1.189 1.008 1.320 Appendix J Survival Function at Mean of Covariates The x axis is time to cessation of active membership in days. The y axis indicates probability of survival. Any point on the survival curve shows the probability that an active member will remain active for at least that amount of time. Survival Function at Mean of Covariates 1.0 Cum Survival 0.8 0.6 0.4 0.2 0.0 0 200 400 600 Length of Membership in Days 800 Figure A-2: Survival graph for members of Sploder 127 1000 Appendix K Model Fit In terms of a Chi-square Difference for the Hazard Rate in Sploder The chi-square difference between the null model (only the intercept has non-zero value; all explanatory (independent) variables in a model have zero regression coefficients in the equation) and the full model (when all explanatory variables and the intercept are included; the intercept and at least one explanatory variable has a value in the equation) is significant at the p =.001 level, suggesting that the effects size of the explanatory variables in terms of hazard ratio is significant for explaining the length of active membership (refer to Table A-11). A null model provides a baseline measure of a model fit. The fitness of a model refers to how well the datapoints fit the equation of a model and how close the predicted values are from the observed values. A model fit is often measured in terms of a Chi-square difference. A smaller chi-square difference between a full model and a null model represents a better model fit. A better model fit suggests that the margin of unexplained variance (in the model) is low and provides evidence for stronger correlation. -2 Log Likelihood Overall (score) 19770.023 Chi-square 570.831 Change From Previous Step Sig. .000 Chi-square 778.837 Table A-12: Omnibus Tests of Model Coefficients 128 Sig. .000 Appendix L Results from a Lag Order Test for Granger Causality in Sploder Pertaining to Length of Active Membership and Participation Results from a lag order selection tests based on LR, FPE, AIC, HQIC, SBIC information criteria. Lag P-Value LR AIC HQIC SBIC 5.4e+06 0 FPE 29.6966 29.7693 29.9345* 1 0.522 23.96 1.4e+07 30.6266 31.0629 32.0539 2 0.001 53.914* 1.5e+07* 30.4868* 31.2868 * 33.1036 Table A-13: lag order diagnosis for a Granger Causality Test 129 Appendix M Results from a Lag Order Test for Granger Causality in Sploder Pertaining to Length of Active Membership and Feedback Received Results from lag order selection tests based on LR, FPE, AIC, HQIC, SBIC information criteria. Lag P-Value LR AIC HQIC SBIC 395.27 0 FPE 20.1688 20.2415 20.4067 1 0.225 29.968 835.967 20.8842 21.3206* 22.3116* 2 0.081 35.402* 1701.69* 21.4056* 22.2056 24.0224 Table A-14: lag order diagnosis for a Granger Causality Test 130 Appendix N Interview Questions 1. Think back to when you first heard about Everything2/Sploder. Where did you hear about the site? What prompted you to join Everything2/Sploder [Initial motivation]? 2. Think back, what was the first thing you contributed to E2/Sploder?  Do you remember it? (If needed: Our logs indicated that it was node )  Do you think it was a good contribution? Why or why not? What did you like about it?  Was it well received by the community?  How / why did you decide to contribute content to the site?  Did you contribute privately first with members and gradually started posted publicly? [Private vs. Public]  What was your sense of the other users of the site? 3. How do you think your use of E2/Sploder changed over time? a) Were you ever contributing “regularly”? b) Did you contribute to the site in other ways than just nodes? c) How often would you say you contributed? Was it always about the same amount, or did that amount of contribution change at different times? (If so, why did it change?) d) On average, how much time was spent in E2/Sploder every week? Did it change over time? 131 e) Did your role change over the years while contributing to E2/Sploder as a member? (Trying to probe general contributor vs. moderator) 4. Did you participate in giving feedback to other users on the site? a) How did you decide what deserved a C!? b) Did you use these feedback features much? Why or why not? c) Did your use of these community feedback features change the way you contributed nodes? 5. Think back, what was the last thing you contributed to the site? a) Why did you decide to contribute that? b) How well was it received by the community? c) What happened that caused you to stop after that? d) (If accurate) That was contributed around . What else was happening in your life around that time? e) Did you receive private messages from other members of the node/community during your contribution? [ validating the spikes in private messages] f) After you stopped contributing, did you follow the other members’ contribution in E2/Sploder? [Indirectly validating the lurking behavior as precursors of Exiting] g) While contributing to E2/Sploder did you also contribute to any other similar social networking sites? If yes, did you continued to contribute the other site even after withdrawing your contributions from E2/Sploder? [Competing Sites] 132 h) After withdrawing from E2/Sploder did you join any other similar websites ?[Validity Checking] 6. According to our logs, the last node you contributed was . Do you remember contributing that? i) Why did you decide to contribute that? j) How well was it received by the community? k) What happened that caused you to stop after that? l) That was contributed around . What else was happening in your life around that time? 7. What are your current thoughts about Everything2/Sploder? 8. If you could start over, would you contribute again? What would you do differently? Contributors who are no longer active 1. Before contributing to E2/Sploder did you observe the other users in the site? 2. What was the content you contributed to E2/Sploder? a) Do you remember it? (If needed: Our logs indicated that it was node ) b) Do you think it was a good contribution? Why or why not? What did you like about it? c) Was it well received by the community? d) How / why did you decide to contribute content to the site? e) Did you contribute privately first and gradually started contributing publicly? [Private vs. Public] f) What was your sense of the other users of the site? 133 g) What caused you to stop after that? h) (If accurate) That was contributed around . What else was happening in your life around that time? i) Did you receive private messages from other members of the node/community during your contribution? [ validating the spikes in private messages] j) After you stopped contributing, did you follow the other members’ contribution in E2/Sploder? [Indirectly validating the lurking behavior as precursors of Exiting] k) While contributing to E2/Sploder, did you also contribute to any other similar social networking sites? If yes, did you continued to contribute to the other site even after withdrawing your contribution from E2/Sploder? [Competing Sites] l) After withdrawing from E2/Sploder did you join any other similar websites ?[Validity Checking] Long term contributors who are still active 1. Think back to when you first heard about Everything2/Sploder. Where did you hear about the site? What prompted you to join Everything2[Initial motivation]? 2. Think back, what was the first thing you contributed to E2/Sploder?  Do you remember it? (If needed: Our logs indicated that it was node )  Do you think it was a good contribution? Why or why not? What did you like about it? 134  Was it well received by the community?  How / why did you decide to contribute content to the site?  Did you contribute privately first with members and gradually started posting publicly? [Private vs. Public]  What was your sense of the other users of the site? 3. How do you think your use of E2/Sploder changed over time? f) Were you ever contributing “regularly”? g) Did you contribute to the site in other ways than just nodes? h) How often would you say you contributed? Was it always about the same amount, or did that amount of contribution change at different times? (If so, why did it change?) i) On average, how much time do you spend on E2/Sploder every week? Did it change over the time? j) Did your role change over the years while contributing to E2/Sploder as a member? (Trying to probe general contributor vs. moderator) 4. Did you participate in giving feedback to other users on the site? d) How did you decide what deserved a C!/Whisper? e) Did you use these feedback features much? Why or why not? f) Did your use of these community feedback features change the way you contributed nodes? 5. a) When and what is the most recent contribution you have made? b) What comments did you receive about your contribution? c) What do you feel about the comments? 135 d) Are you also a member of other Social network sites like E2/Sploder? e) What changes and improvements do you want to see in E2/Sploder? [N.B: Based on the answers users’ provided in the earlier questions, if needed, or if they were wrong about their answer to the previous question] 136 REFERENCES 137 REFERENCES Arguello, J, Butler, B. S, Joyce, L, Kraut, R. E, Ling, K. S, Rosé, C. P. and Wang, X. (2006). Talk to me: Foundations for Successful Individual-Group Interactions in Online Communities. In Proceedings of the 2006 ACM Conference on Human Factors in Computing Systems, New York: ACM Press, 959-968. Arkes, H. and Blumer, C. (1985). The Psychology of Sunk Cost. Organizational Behavior and Human Decision Processes, 35(1), 124-140. Arkes, H. and Hutzel, L. (2000). The Role of Probability of Success Estimates in the Sunk Cost Effect. Journal of Behavioral Decision Making, 13(3), 295-306. Bender , T. (1978). Community and Social Change in America. Princeton, NJ: Rutgers University Press. Bendor, J. and Swistak, P. (2001). The Evolution of Norms. American Journal of Sociology 106 (6), 1,493–1,545. Brothers, L, Hollan, J, Nielsen, J, Stornetta, S, Abney, S, Furnas, G. and Littman, M. (1992, November 1-4). Supporting Informal Communication via Ephemeral Interest Groups. Proceedings of CSCW 1992. Toronto, Ontario: ACM. Bowes, J. (2002, March). Building Online Communities for Professional Networks. Proceedings of the Global Summit of Online Knowledge Networks. Adelaide, Australia. checked 25 July 2005. Burke, M, Marlow, C, and Lento, T. (2009). Feed Me: Motivating Newcomer Contribution in Social Network Sites. In proceedings of the CHI 2009, ACM Press, 945-954. Cheshire, C. (2008). The Social Psychological Effects of Feedback on the Production of Internet Information Pools. Journal of Computer-Mediated Communication, 13, 705–727. Choi, B, Alexander, K, Braut, R.E. and Levine, J.M. (2010, February). Socialization Tactics in Wikipedia and Their Effects. Presented at CSCW 2010. Churchill, E, Girgensohn, A, Nelson, L, and Lee, A. (2004). Information Cities: Blending Digital and Physical Spaces for Ubiquitous Community Participation. Communication of ACM, 47 (2), 38-44. Cox, D. R. and Oakes, D. (1984). Analysis of Survival Data. New York: Chapman & Hall. Cox, D. R. (1972). Regression Models and Life-Tables. Journal of the Royal Statistical Society, Series B 34 (2): 187–220. 138 Cox Reference 1. http://cran.r-project.org/doc/contrib/Fox-Companion/appendix-coxregression.pdf Cox Reference 2. https://mywebspace.wisc.edu/jmullahy/web/basu%20manning%20mullahy.pdf Delanty, G. (2010). Community. (Second edition). New York: Routledge. Ducheneaut, N. (2005). Socialization in an Open Source Software Community: A SocioTechnical Analysis. Computer Supported Cooperative Work, 14 (4), 323-368. Durkheim, E. (1960). The Division of Labour in Society. New York: Free Press. Etzioni, A. and Etzioni, O. (1999). Fact-to-Face and Computer-Mediated Communities, a Comparative Analysis. The Information Society 15: 241 – 8. Eytan, A. and Huberman, B.A. (2000, October). Free Riding on Gnutella. First Monday, 5(10). Everything2: http://www.everything2.com/ Farzan, R. and Brusilovskya, P. (2011, January). Encouraging User Participation in a Course Recommender System: An Impact on User Behavior. Computers in Human Behavior: Volume 27, Issue 1, 276-284. Goodman, L.A. (1961). Snowball Sampling. Annals of Mathematical Statistics, 32 (1): 148–170. doi:10.1214/aoms/1177705148. Granger, C. (1969). Investigating Causal Relations by Econometric Models and Cross-Spectral Methods. Econometrica 37 (3). Granger Reference 1. http://academic.reed.edu/economics/parker/s13/312/tschapters/S13_Ch_5.pdf Gupta, S. and Kim, H.W. (2004). Virtual Community: Concepts, Implications, and Future Research Directions. In Proceedings of the Tenth Americas Conference on Information Systems (New York, NY, August), C. Bullen, and E. Stohr, Eds. AIS, Atlanta, GA. Halfaker, A, Kittur, A. and Riedl, J. (2011) Don't Bite the Newbies: How Reverts Affect the Quantity and Quality of Wikipedia Work. In Proceedings of the 7th International Symposium on Wikis and Open Collaboration (WikiSym '11). New York, NY: ACM, 163-172. Heckathorn, D. (1997). Respondent-Driven Sampling: a New Approach to the Study of Hidden Populations. Social problems, 44 (2). 174-199. Heckathorn, D.D. (2002). Respondent-Driven Sampling II: Deriving Valid Estimates from Chain-Referral Samples of Hidden Populations. Social Problems. 49: 11-34. 139 Horrigan, J. (2007). A Typology of Information and Communication Users. Pew Internet & American Life Project. Available from http://www.pewInternet.org/pdfs/PIP_ICT_Typology.pdf [Accessed on 15 February 2012]. Hughes. D, Coulson,G. and Walkerdine, J. ( 2005). Free Riding on Gnutella Revisited: The Bell Tolls. IEEE Distributed Systems Online, 6(6): 1–18. Jensen, C, Sarkar, C, Jensen, C. and Potts. C. (2007, July). Tracking Website Data-Collection and Privacy Practices with the iWatch Web Crawler. In Proceedings of the Symposium of Usable Privacy and Security. Pittsburgh, PA. Jones B.D. (2001). Politics and the Architecture of Choice: Bounded Rationality and Governance. Chicago: University of Chicago Press. Joyce, E. and Kraut, R. (2006). Predicting Continued Participation in Newsgroups. Journal of Computer-Mediated Communication, 11(3):723-747. Keisler, S. (2007). http://www.ee.oulu.fi/~vassilis/courses/socialweb10F/reading_material/5/Kiesler07.pdf King, N, and Horrocks, C. (2010). Interviews in Qualitative Research. London: Sage. Kollock, P. (1999). The Economies of Online Cooperation: Gifts and Public Goods in Cyberspace. M. Smith and P. Kollock (editors). Communities in Cyberspace. London: Routledge, 219–239. Koh, J, Kim, Y, Butler, B. and Bock, G. (2007). Encouraging Participation in Virtual Communities. Communications of the ACM 50 (2), 68-73. Kraut, R. E. and Resnick, P. (2012). Evidence-Based Social Design: Mining the Social Sciences to Build Online Communities. Cambridge, MA: MIT Press. Kuzel, A.J. (1999): Sampling in Qualitative Inquiry. In B.F. Crabtree and W.L. Miller (eds): Doing Qualitative Research. Thousand Oaks, CA.: SAGE, pp. 33–46. Lakhani, K. and Wolf, R. (2005). Why Hackers do What They do. Perspectives in Free and Open Source Software. J. Feller, B. Fitzgerald, S. Hissam, and K. Lakhani (Eds.). MIT Press, Cambridge, MA. Lampe, C. and E. Johnston. (2005). Follow the (Slash) Dot: Effects of Feedback on New Members in an Online Community. Proceedings of the 2005 international ACM SIGGroup conference on supporting group work. New York, NY: ACM Press. Lampe, C. (2009). Participation Lifecycle in Online Communities. Submitted to National Science Foundation. 140 Lampe, C, Wash, R, Velasquez, A. and Ozkaya, E. (2010). Motivations to Participate in Online Communities. In the Proceedings of the ACM Conference on Human Factors in Computing Systems (CHI) Atlanta, GA. Ling, K., Beenen, G., Ludford, P., Wang, X, Chang, K Li, X, Cosley, D., Frankowski, D, Terveen, L, Rashid, A, Resnick, P. and Kraut, R. (2005). Using Social Psychology to Motivate Contributions to Online Communities. Journal of Computer-Mediated Communication. 10 (4). Mockus, A. Fielding, R. T. and Andersen, H. (2002). Two Case Studies of Open Source Software Development: Apache and Mozilla. ACM Transactions on Software Engineering and Methodology, 11(3): p. 309-346. Moran, E. (2008). http://www.sitepoint.com/study-why-most-online-communities-fail/ Moreland, R. L. and Levine, J. M. (2001). Socialization in Organizations and Work Groups. M.E. Turner (Ed.). Groups at Work: Theory and Research, 69–112. Mahwah, NJ: Lawrence Erlbaum Nielsen, J. (2006). Participation Inequality: Lurkers vs. Contributors in Internet Communities. http://www.useit.com/alertbox/participation inequality.html Nonnecke, B. and Preece, J. (2000). Lurker Demographics: Counting the Silent. Paper presented at the ACM CHI, The Hague. Nonnecke, B. and Preece, J. (2001). Why Lurkers Lurk. Paper presented at the Americas Conference on Information Systems, Boston. Nonnecke, B. and Preece, J. (2003). Silent Participants: Getting to Know Lurkers Better. C. Lueg & D.Fisher (Eds.). From Usenet to CoWebs: Interacting with Social Information Spaces, Springer. Nov, O, Naaman, M. and Ye, C. (2009). Motivational, Structural and Tenure Factors that Impact Online Community Photo Sharing. In Proceedings of ICWSM 2009: ACM Press, 555-566. O'Mahony, S. and Ferraro, F. (2007). The Emergence of Governance in an Open Source Community. Academy of Management Journal, 50 (5), 1079-1106. Panciera, K, Priedhorsky, R, Erickson, T. and Terveen, L. (2010). Lurking? Cyclopaths? A Quantitative Lifecycle Analysis of User Behavior in a Geowiki. In Proceedings of CHI. Panciera, K, Halfaker, A. and Terveen, L. (2009). Wikipedians are Born, not Made. In ACM Special Interest Group on Computer-Human Interaction. Phang, C.W, Kankanhalli, A. and Sabherwal, R. (2009). Usability and Sociability in Online Communities: A Comparative Study of Knowledge Seeking and Contribution. Journal of the Association for Information Systems, 10 (10), 721–747. 141 Poblocki, K. (2001, November). The Napster Network Community. First Monday 6(11): at http://firstmonday.org/issues/issue6_11/poblocki/index.html. Preece, J. (2001). Sociability and Usability: Twenty Years of Chatting Online. Behavior and Information Technology Journal, 20 (5), 347–56. Priedhorsky, R, Chen, J, Lam, S.K, Panciera, K, Terveen, L. and Riedl, J. (2007). Creating, Destroying, and Restoring Value in Wikipedia. Sanibel Island, Florida: Proceedings of the 2007 international ACM conference on Supporting group work, ACM. Ren,Y, Kraut, R. and Kiesler, S. (2007) Applying Common Identity and Bond Theory to Design of Online Communities. Organization Studies, 28(3), 377–408. Ren, Y, Harper, F. M, Drenner, S, Terveen, L, Kiesler, S, Riedl, J. and Kraut, R. (2010) Increasing Attachment to Online Communities: Evidence-Based Design. MIS Quarterly (Under review). Ren, Y., J. Chen, J. Riedl. (2011). The Impact and Evolution of Group Diversity on Online Communities. Richardson. C.R, Buis, L.R, Janney, A.W, Goodrich. D.E, Sen, A, Hess, M.L, Mehari1, K.S, Fortlage, L. A. and Resnick, P.J. (2010). An Online Community Improves Adherence in an Internet-Mediated Walking Program. Part 1: Results of a Randomized Controlled Trial. Journal of Medical Internet Research. Sarkar, C., Wohn, D. Y., Lampe, C., and DeMaagd, K. (2012). A Quantitative Explanation of Governance in an Online Peer-Production Community. Proceedings on CHI2012, ACM Press, 2939-2942. Skinner, B. F. (1953). Science and Human Behavior. New York: MacMillan. Skinner, B. F. (1957). Verbal Behavior. New York: Appleton-Century-Crofts. Smith, T, Smith, B. and Ryan, M.AK. (2003). Survival Analysis Using Cox Proportional Hazards Modeling For Single And Multiple Event Time Data. San Diego, CA. Strauss, A. and Corbin, J. (1998). Basics of Qualitative Research: Techniques and Procedures for Developing Grounded Theory. Thousand Oaks, CA: Sage Publications Inc. Stewart., O, Heights, Y. and Lubensky, D. (2010). Crowdsourcing Participation Inequality: a SCOUT Model for the Enterprise Domain. Proceedings of the ACM SIGKDD Workshop on Human Computation.HCOMP '10. New York: ACM. Takahashi,N. (2000). The Emergence of Generalized Exchange. American Journal of Sociology, 105 (4), 1,105–1,134. 142 Therneau, T. M. and P. M. Grambsch. (2000). Modeling Survival Data: Extending the Cox Model. New York:Springer. Toral, S, Martinez-Torres, M. R, Barrero, F. and Cortes, F. (2009). An Empirical Study of the Driving Forces Behind Online Communities. Internet Research, 19(4), 378-392. Velasquez, A, Lampe, C, Wash, R. and Bjornrud, T. (2013, March 16). Latent Users in an Online User-Generated Content Community. Computer Supported Cooperative Work. DOI 10.1007/s10606-013-9188-4. Walther, J. B. (1996). Computer-Mediated Communication: Impersonal, Interpersonal, and Hyperpersonal Interaction. Communication Research, 23, 3-43. Wang, Y, Kraut, R. and Levine, J.M. (2012). To Stay or Leave? The Relationship of Emotional and Informational Support to Commitment in Online Health Support Groups. Presented at CSCW 2012. Whittaker, S, Terveen, L, Hill, W. and Cherny, L. (1998). The Dynamics of Mass Interaction. Proceedings of CSCW 1998, ACM Press, 257-264. Williams, D. (2006). Groups and Goblins: The Social and Civic Impact of an Online Game. Journal of Broadcasting & Electronic Media, 50(4), 651-670. Wenger, E. (2001). Supporting Communities of Practice: a survey of community-oriented technologies. Retrieved on January 2, 2012 from http://www.ewenger.com/tech Yang, J, Wei, X, Ackerman, M.S. and Adamic, L.A. (2010). Activity Lifespan: An Analysis of User Survival Patterns in Online Knowledge Sharing Communities. Presented at the Fourth International AAAI Conference on Weblogs and Social Media. Zhang, G. (2012, April). Community: Issues, Definitions, and Operationalization on the Web. Presented at WWW 2012. Zhang, X. and Zhu, F. (2006). Intrinsic Motivation of Open Content Contributors: the Case of Wikipedia. 143