SENTIMENT MAPPING: POINT PATTERN ANALYSIS OF  

SENTIMENT CLASSIFIED TWITTER DATA 

 

By 

 

Kenneth Camacho 

 

 

 

 

 

 

 

 

 

A THESIS 

Submitted to 

Michigan State University 

in partial fulfillment of the requirements 

for the degree of 

 

Geography – Master of Science 

 

2020 

 

 

 

 

 

 

 

 

 

 

ABSTRACT 

SENTIMENT MAPPING: POINT PATTERN ANALYSIS OF  

SENTIMENT CLASSIFIED TWITTER DATA 

 

By 

 

Kenneth Camacho 

 

Varieties of sentiment analysis and point pattern analysis are being applied to social 

media data to address a broad range of questions, but they are rarely used in tandem. This study 

outlines a methodology that combines these two approaches to analyze the spatial distribution of 

sentiment classified opinions from social media data. Twitter postings on natural gas were 

downloaded and classified using a variety of sentiment analysis methods into positive, negative, 

and neutral categories. The classifications were then converted into spatial points using the 

location data associated with the tweets, whereby point pattern analysis techniques were applied 

to the points to examine the patterns of positive and negative tweet locations with respect to a 

background rate of neutral tweets across the contiguous Unites States. Basic temporal 

visualizations were also constructed to explore the variations in sentiment over time. 

Considerations are discussed on the accuracy limitations of sentiment analysis and the potential 

for a variety of applications using these techniques. With careful implementation, this 

methodology can open the door to a range of spatiotemporal analyses of social media sentiment. 

 

 

 
 

 

ACKNOWLEDGEMENTS 

I am indebted to all who supported me in the writing of this Thesis. This includes my 

advisor, Dr. Raechel Portelli, and my other committee members, Dr. Ashton Shortridge and Dr. 

Bruno Takahashi, whose guidance and curiosity were essential in shaping this document. I also 

extend my gratitude to Dr. Nathan Moore and Dr. Shiyuan Zhong for their mentoring and 

friendly optimism throughout my stay in the department. 

I thank Pietro Sciusco, who gave his time to serve as the secondary coder of the manually 

classified tweets. I would also like to express thanks to him and the other Lunch Buddies for 

their camaraderie as we navigated our way through graduate school together. 

 

 
 

 

 

iii 

 

 

TABLE OF CONTENTS 

LIST OF TABLES .......................................................................................................................... v 

LIST OF FIGURES ....................................................................................................................... vi 
 
1. INTRODUCTION ...................................................................................................................... 1 
From Volunteered Geographic Information to Geospatial Twitter Mining................................ 1 
Twitter Mining Human Sentiment .............................................................................................. 2 
Geospatial Twitter Mining .......................................................................................................... 5 
Public Health ........................................................................................................................... 5 
Environmental Issues .............................................................................................................. 6 
Sentiment-Related Topics ....................................................................................................... 7 
Geospatial Sentiment Analysis ................................................................................................... 9 
Developing a Methodology for Sentiment Mapping ................................................................ 11 
 

2. METHODS ............................................................................................................................... 12 
Data Collection ......................................................................................................................... 14 
Location Analysis ..................................................................................................................... 16 
Sentiment Analysis ................................................................................................................... 20 
Spatial Analysis ........................................................................................................................ 30 
 

3. RESULTS ................................................................................................................................. 35 
Tweets in Space ........................................................................................................................ 35 
Classification Comparisons ...................................................................................................... 37 
Sentiment Cluster Maps ............................................................................................................ 40 
Raw Sentiment Maps ................................................................................................................ 45 
Exaggerated Sentiment Maps ................................................................................................... 47 
Tweets in Time ......................................................................................................................... 50 
 

4. CONCLUSION ......................................................................................................................... 56 
 
REFERENCES ............................................................................................................................. 62 
 

 

 

 

 

 

iv 

 

LIST OF TABLES 

Table 2.1. Accuracy level of the tweets following the completion of location analysis. An 
accuracy score of 1 indicates that the geocoder had the highest possible confidence in the 
coordinates returned. ......................................................................................................................19 
 
Table 2.2. Example tweets with their manually coded sentiment scores. ......................................22 
 
Table 2.3. The seven machine learning algorithms applied to the tweet text for classification into 
positive, negative, and neutral categories. .....................................................................................24 
 
Table 2.4. Mean performance results for all classifiers from 50 random samples of training and 
testing data using a 90/10 train/test split. The ± symbol denotes the size of two standard 
deviations around each mean value.  .............................................................................................26 
 
Table 3.1. Pearson’s correlation coefficients between classifier pairs. All correlations were 
significant at the p < .05 level with a Bonferroni correction. ........................................................40 

 

v 

 

LIST OF FIGURES 

Figure 2.1. Flowchart of sentiment mapping methodology.  ........................................................13 
 
Figure 2.2. A kernel density map of California tweets (a) before and (b) after localizing state-
level tweets. An apparent clustering of tweets appears at the centroid of the state in (a) as an 
artifact of state-level geocode returns being assigned to state centroids. ......................................20 
 
Figure 2.3. Classification results for the manually coded subset of 5,000 tweets. ........................21 
 
Figure 2.4. The voting process for creating the ensemble classifier. Each arrow represents a vote, 
where the majority decides the next “vote” classifier. ...................................................................29 
 
Figure 2.5. An example of reassigning coordinates from state-level accuracy to more local 
coordinates based on the background rate of neutral tweets. Here, 10 state-level points returned 
from Geocodio located at (a) the centroid of Texas are moved to (b) locations placed 
probabilistically on a kernel density surface derived from neutral tweets. ....................................31 
 
Figure 2.6. Spatial scan statistics comparing the clustering of errors from different classifiers to 
the background rate of neutral tweets. The errors pictured are from three machine learning 
classifiers, (a) MaxEnt, (b) MNB, and (c) N-SVM. Errors were obtained from 50 cross-
validation runs of the classifiers using different sets of training and testing data. Plots were 
generated from 99 simulations of the spatial scan statistic with a radius of 150 km, p < .01.  .....34 
 
Figure 3.1. Kernel density plots for latitude and longitude alongside a kernel density map of 
tweet locations across the contiguous United States (CONUS) with an accuracy score equal to 1 
(n = 121,026). A 50km bandwidth was used for the kernel density map. .....................................36 
 
Figure 3.2. Total classification results for all machine learning algorithms by category. Each 
square represents 2000 tweets. .......................................................................................................38 
 
Figure 3.3. “Sentiment cluster maps.” Spatial scan statistic results for positive (left) and negative 
(right) tweets with respect to a background rate of neutral tweets for all classifiers.  The spatial 
scan was completed with a 150 km radius and 99 simulations, p < .01.  ......................................41 
 
Figure 3.4. Khat difference between positive and negative tweets classified by the ensemble 
classifier. Envelopes represent p = .05 confidence intervals from 19 simulations. Khat difference 
values above 0 indicate greater clustering of negative tweets at the distances shown. .................44 
 
Figure 3.5. “Raw sentiment maps.” Kernel density difference maps from four different 
classifiers computed by subtracting the kernel density map of negative sentiment (green) from 

 

vi 

that of positive sentiment (orange). Values above zero indicate a higher density of positive 
natural gas tweets. ..........................................................................................................................45 
 
Figure 3.6. “Exaggerated sentiment maps.” The difference between the square roots of the 
density of positively and negatively classified tweets. ..................................................................49 
 
Figure 3.7. Density plot showing the frequency of positive, negative, and neutral tweets over 
time using four different classifiers. The three most prominent peaks occur on May 30th, 
September 8th, and December 8th. The bandwidth is set to 7.5 days.  .........................................52 
 
Figure 3.8. Monthly sentiment cluster maps. Created using a spatial scan statistic with data 
grouped by monthly time slices, using the ensemble classifier. The scan shows elevated levels of 
positive sentiment (left) and negative sentiment (right) relative to the background rate of neutral 
sentiment. Each scan was completed with a 150 km radius and 99 simulations, p < .01. .............54 
 
Figure 3.9. Monthly raw sentiment maps. Kernel density difference plots for each month in the 
data set, using the ensemble classifier. Orange areas have a higher density of positive tweets 
while green areas have a higher density of negative tweets.  ........................................................55 
 

 

 

vii 

1. INTRODUCTION 

From Volunteered Geographic Information to Geospatial Twitter Mining 

Ever-expanding in their response to new technologies, the contributions made by 

volunteered geographic information (VGI) have transformed geography. Data collection has 

become informalized to the point where citizens are now the primary data collectors for many 

studies, in fields as disparate as ornithology, epidemiology, and forest pathology (Connors, Lei, 

& Kelly, 2012). There is a new appreciation for the collective capacity of ordinary people to 

provide data for researchers to take in, analyze, and share as results. 

In a review of volunteered geographic information, Goodchild (2007) describes the 

people contributing this information as sensors. This mindset is useful when considering 

volunteered information as it implies that researchers should observe the normal considerations 

that are used for other geographic sensors such as seismographs and weather stations; these 

considerations include how the sensors are distributed in space, their precision and accuracy, 

how often they report readings, how to calibrate them, and so on. At the time, Goodchild was 

referring primarily to the uses of Wikimapia, Flikr, Google Earth, and OpenStreetMaps, but 

today the literature has expanded to include the immense depths of data available from social 

media platforms such as Twitter, a microblogging platform with over 300 million active users 

(The Statistics Portal, 2018). 

Few papers take the “citizens as sensors” view more literally than #Earthquake, an article 

which demonstrated that posts about an earthquake on Twitter could make a beneficial 

supplement to other spatial VGI sources such as the USGS ‘Did You Feel It?’ website. The 

paper analyzed the reaction time of Twitter users to an earthquake in order to understand where 

 

1 

and when the shaking was felt. The paper also stands testament to the scale of Twitter. Despite 

having access to only 1% of tweets, the authors were able to identify over 100 geolocated 

earthquake tweets posted within two minutes of the event and nearly 1,000 within five minutes 

(Crooks, Croitoru, Stefanidis, & Radzikowski, 2013). 

If Twitter users are data collecting sensors, what are they capable of sensing? Evidently, 

their capacities go well beyond those of a seismograph. Users are sensing the experiences of 

other people and sharing their own lived experiences which, rather than being limited to the 

recording of events, emphasize sentiments that reflect their engagement with the world. The real 

value of Twitter users as sensors is not likely to be found in their ability to detect readily 

observable environmental phenomena, which other technologies and platforms are better 

designed for, but instead in their ability to sense and report on the human social environment. 

Twitter Mining Human Sentiment 

There are a growing number of disparate researchers exploring new ways to analyze 

social media posts and interpret them in ways that reveal something meaningful about the 

nondigital, human world. Broadly speaking, their papers examine different expressions of human 

sentiment; that is, the opinions or feelings expressed by people in general or with regard to a 

specific topic. The study of sentiment can be limited by means of geography, time period, 

demographic group, and topic to produce answers to different research questions. Opinion 

polling is a popular example of measuring broad human sentiment that has been hugely 

impactful on American politics and other areas of life (Erikson & Tedin, 2001; Herbst, 1995). In 

a less direct way, Twitter sentiment can address similar topics to opinion polling as well as a host 

of other issues. A small sample of this work will be explored in this section.  

 

2 

Twitter provides an imperfect window into human psychology that, with enough data 

mining, has the potential to offer insights on human social dynamics that other sources of data 

cannot easily provide. Early in its lifespan, Twitter was established as a platform that could be 

used to monitor the emotional states of its user base, with researchers tracking the changing 

moods of Twitter users over time (Roberts, Roach, Johnson, Guthrie, & Harabagiu, 2012). 

Emotions on the website have been found to track real-world social, cultural, and political events 

(Bollen, Mao, & Pepe, 2011). Others have used this mood data to pursue a direction focused on 

trends in mental health (Larsen et al., 2015). That the sentiment on Twitter has been 

demonstrated to respond to time-sensitive real-world phenomena suggests an abundance of other 

research opportunities as well. 

In addition to tracking mood, Twitter mining was also quickly adapted for use as a means 

of studying opinions such as political sentiment. Following the spread of Occupy Wall Street 

protests, Twitter data were employed to examine the spread of anticapitalistic ideas and learn 

more about the messaging backing them, as well as how members of the movement organized 

themselves (Conover et al., 2013). A number of studies also came out in response to the Brexit 

referendum. Some attempted to use Twitter to predict the outcome of the voting while others 

sought to explore links between political statements on twitter and general opinion on political 

issues (Freitas et al., 2016). While the latter exploration of opinions seems to be an appropriate 

use of Twitter data, election predictions may fall outside of what is currently feasible. Gayo-

Avello (2012) provided a cautionary tale concerning the use of Twitter posts to make predictions 

about political outcomes. At the start of the decade, it was common to see publications and 

conference speakers claiming an ability to predict election results using Twitter data. These 

results were short-lived, however, with few researchers able to make reliable predictions about 

 

3 

future elections. Gayo-Avello explains how the promise shown by these early studies was a 

consequence of faulty assumptions, biased data, and the file-drawer effect, which is the tendency 

of researchers to abstain from reporting negative results (Fanelli, 2010). 

An important lesson can be learned from this and other cautionary tales found in the 

literature. When generalizing data on public sentiment, compare apples to apples only – Twitter 

does not necessarily represent the true average of the public’s feelings on a given topic. There is 

a selection bias inherent in using Twitter data in that Twitter posts are made by people who find 

it prudent to share their opinion on potentially contentious topics publicly. Thus, the 

demographics of Twitter users do not align neatly with US demographics. By margins of 10% or 

greater, users are higher income, younger, and more highly educated than the general population. 

Politically, 60% of users lean Democratic (compared to 52% in the US population) and 35% lean 

Republican (compared to 43% in the US population). Further, 10% of US tweeters are 

responsible for 80% of US tweets, skewing the results further. These top 10% of users are 65% 

female and nearly four times more likely to post political content than other users (Wojcik & 

Hughes, 2019). It can make sense to compare sentiments of Twitter users in one geographic area 

to those in another area, but to make assertions about the general public from these demographics 

without due diligence is careless. That said, there are still researchers who try. For example, by 

analyzing public mood through tweets, Bollen & Mao (2011) attempted to predict changes in the 

stock market. Like many others that tried to use Twitter as a true mirror of the nondigital world, 

this research did not provide long-term success. 

While many of these studies show promise in analyzing the opinions and moods 

expressed on Twitter, they do not take advantage of all the dimensions of the data. Twitter users 

 

4 

are sentiment sensors of the population that are located somewhere in space. The distribution of 

these users is also open to analysis and may be just as significant as the sentiment they are 

expressing. 

Geospatial Twitter Mining 

A multitude of studies demonstrate that new insights can be obtained by applying a 

geographic lens to Twitter data. These studies range in topic but are generally congregated in the 

social sciences. Lessons learned from the areas of public health, environmental analysis, and 

sentiment-related approaches will be covered in this section.  

Public Health 

From its beginnings, one of the prominent uses of geospatial Twitter data has been the 

analysis of the spread of infectious diseases across time and space. Lessons can be drawn from 

this research that are worth repeating here. Early work using Twitter to identify influenza 

outbreaks began around 2010, initially focusing on the methods of classifying tweets and 

constructing accurate models that could match with CDC reports of influenza outbreaks 

(Achrekar, Gandhe, Lazarus, Yu, & Liu, 2011; Aramaki, Maskawa, & Morita, 2011; Culotta, 

2010). The flu was an exciting spatiotemporal issue to explore at the time using social media 

trends, in part fueled by the early success of Google’s Flu Trends service, a web-based tool that 

used Google search queries as a method of early disease detection (Carneiro & Mylonakis, 

2009).  

It is prudent to highlight here that this success did not last. Google Flu consistently 

overpredicted flu outbreaks, resulting in a service that did little to improve on existing tools for 

tracking and predicting the spread of the flu. The story has become something of a parable for 

 

5 

the limitations of big data, primarily when used in fields where there are already well-established 

methods for fulfilling similar tasks. Big data should prove to be most useful in areas where it can 

touch on novel topics rather than areas that have already been thoroughly developed (Lazer, 

Kennedy, King, & Vespignani, 2014). 

Since the shutdown of Google Flu trends in 2015, the social media big data space 

surrounding disease appears noticeably different. Some have changed course to focus on related 

but perhaps more appropriate topics for social media investigation, such as the public’s 

awareness of influenza outbreaks (Smith, Broniatowski, Paul, & Dredze, 2015). Other 

researchers have chosen to focus on alternative diseases, such as heart disease (Eichstaedt et al., 

2015) or Zika (McGough, Brownstein, Hawkins, & Santillana, 2017), while others have given 

their attention to the social side of disease outbreak and public health more broadly (Daughton, 

Paul, & Chunara, 2018; Karami, Dahl, Turner-McGrievy, Kharrazi, & Shaw, 2018; Lee et al., 

2016). 

Environmental Issues 

Twitter data is often used during and after environmental calamities to foster situational 

awareness and understand the human response to these events. Already mentioned is the use of 

#Earthquake (Crooks et al., 2013), which used humans as sensors of a natural disaster. Other 

researchers have used Twitter to examine human responses to earthquake events (Vo & Collier, 

2013) as well as wildfires (Z. Wang & Ye, 2016), focusing on the emotional reaction and content 

of the discourse surrounding these events, similar in spirit to some of the sentiment-related topics 

discussed above. Other natural disasters have been analyzed as well. Using a combination of 

tweet content and analysis of the movement patterns before, during, and after hurricane events 

 

6 

(as gleaned from the same tweets), researchers can learn information about behavioral responses 

to natural disasters (Stowe, Anderson, Palmer, Palen, & Anderson, 2018). This information can 

be used for a variety of purposes, from the behavioral sciences to the disaster response planning 

of emergency services. 

Some work has been done in analyzing Twitter data to learn about environment-related 

perspectives across space, but this research area is largely unexplored. One example is a national 

study on fracking perspectives in the US. The authors analyzed tweets from a variety of 

stakeholders to determine how perspectives were distributed spatially and which users were more 

likely to have their ideas diffused across the platform. The authors of this study also made an 

essential point regarding environmental perspectives: a user will probably be less likely to share 

their viewpoint if they feel that their perspective is the dominant one in their area, or if the issue 

no longer pertains to their geographic location. For example, the authors found far less 

discussion on fracking in New York, where the practice has been banned, than in California, 

where active policy debate on the subject occurred during the study period (Sharag-eldin, Ye, & 

Spitzberg, 2018). 

Sentiment-Related Topics 

Less studied geospatially are the subjects that are unique to humans and are expressed 

primarily through language: emotions, personal preferences, and opinions. Being a platform that 

reflects the collective sentiment of individuals, Twitter has the potential to provide a plethora of 

insights on these topics so long as appropriate analyses are run. In terms of the effort and 

financial inputs needed to run the research, it is an inexpensive means of comparing geographic 

trends over broad areas. The added value can be made clear by briefly returning to research on 

 

7 

Brexit sentiment. A recent study used tweets to examine the global distribution of “stay” and 

“leave” sentiments during the time of the voting period, allowing for comparisons across 

countries inside and outside the European Union (Agarwal, Singh, & Toshniwal, 2018), and 

demonstrating the value that can be added when geospatial components are incorporated into 

data analysis. An opinion poll covering the same countries would have been an expensive 

exercise. 

A range of studies exist that can shed light on the breadth of possibility in geospatial 

analysis on more human topics, but they also highlight the disconnectedness of the research. 

Many of these studies stand alone, with no clear predecessor or follow-up. One study examined 

the connectivity of users online with comparison to their physical location in the world. Perhaps 

unsurprisingly, distinct geographic and cultural groups have developed on Twitter, a platform 

that is aspatial  (Kulshrestha, Kooti, Nikravesh, & Gummadi, 2012). Some authors have 

examined the spatial distribution of music tastes, developing clusters of similar artists and 

analyzing local trends based on network interactions and the spatial distribution of listening 

trends (Hauger & Schedl, 2014). Some focus on extracting twitter data over time from a single 

location. The instrument EmoTwitter detects and visualizes Twitter discussions taking place near 

a given location over time (Kobayashi, Mozgovoy, & Munezero, 2016). A more prominent 

example tracked the spatial spread of ideas across nearly six million tweets, finding that 

comparatively small clusters of tweets played a vital role in the diffusion of discussion (Ardon et 

al., 2011). However, even this more prominent study did not have an obvious follow-up. It is not 

entirely clear why using geospatial twitter data for these purposes tends to have minimal impact 

on the literature. Perhaps the studies are too novel and disconnected, or there is a lack of interest 

or trust in sourcing human data from social media platforms. 

 

8 

Geospatial Sentiment Analysis 

The aforementioned studies rely primarily on simple, straightforward analyses such as the 

presence of certain hashtags to determine the sentiment of tweets. More complex methods for 

determining the sentiment of a piece of text are often accomplished using techniques from 

sentiment analysis, a field of natural language processing specializing in extracting sentiment 

from a body of text. The most common methods in sentiment analysis either compare the words 

in the text to a lexicon of words with known sentiment or use machine learning with a body of 

manually classified training data to determine the sentiment (Liu, 2012). Using sentiment 

analysis to analyze tweets rather than more simplistic methods introduces additional uncertainty 

into the analysis – the highest accuracy of these methods tends to lie in the mid 80% range (Saif, 

He, Fernandez, & Alani, 2016). However, utilizing sentiment analysis to classify tweets provides 

many new opportunities for deeper analysis, particularly when it is used to inform geographic 

analyses. 

 

The literature combining sentiment analysis of tweets with geographical analysis is 

extremely sparse, but some examples do exist. A 2013 study using geotagged tweets analyzed 

the spatial distribution of Twitter sentiment in New York City to determine areas with positive 

and negative sentiment (Bertrand, Bialik, Virdee, Gros, & Bar-Yam, 2013). The tweets were 

classified with a machine learning approach and the resulting maps offer an exploratory means of 

visualizing mood at the census block level. The study excelled in producing high resolution maps 

of the mood across New York City over the data collection period and tying these sentiments to 

real-world features, though the accuracy of the sentiment analysis is questionable, and the 

geographic analysis was very limited. Another example of visualizing emotions through 

sentiment analysis is the We Feel project by Larsen et al. (2015). The sentiment analysis in this 

 

9 

case was performed with a lexicon-based method on tweets globally, aggregating the data at the 

national scale to compare the results to mental health indices of different nations. Temporal 

analyses were also conducted to examine the rates of different emotions over time. 

 

A tool entitled GeoSentiment was developed to track the sentiment response to events on 

Twitter interactively across different locations (Pino, Kavasidis, & Spampinato, 2016). Although 

the concept is promising, the methodology is unclear on the sentiment analysis techniques and 

only geotagged tweets were used, which excludes the vast majority of Twitter posts. One 

interesting component is the use of kernel density maps to display the density of tweets across 

the area of study. A final paper worth mentioning used twitter data to analyze the political 

sentiment in different areas of East Java, Indonesia using a machine learning approach to the 

sentiment analysis (Fahrur Rozi et al., 2018). Again, the concept is exciting, but the data set was 

quite small, there are significant issues with the sentiment analysis, and the geographic 

component of the data was only used for visualization purposes. 

Common threads tie these studies together. They often succeed in grounding their 

findings with reference to real-world phenomena, which is key for building credibility in Twitter 

data. Although they use primarily geotagged tweets, which are accurate but limit the sample size, 

the geographic component of the data is used only for aggregation and/or visualization rather 

than any substantive analysis. In addition, the sentiment analysis is consistently underdeveloped. 

Where the methods are clear enough to interpret, only one method of classification was applied 

to the data, and normally toward the classification of moods. Although geographic mood data is 

useful, there is potential for far more expansive applications of sentiment analysis.  

These studies point in the direction of a powerful new application of Twitter data without 

fully realizing its extent. With a more thorough exploration of sentiment analysis, truer and more 

 

10 

meaningful sentiment can be extracted from the data. With a more developed geographic 

analysis, patterns and trends can be explored in a more substantive and quantifiable manner. For 

the latter, there is potential to draw from point pattern analysis, a geographic toolkit for 

analyzing the distribution of points that is often applied in the fields of epidemiology and 

ecology (Gatrell, Bailey, Diggle, & Rowlingson, 1996; Velázquez, Martínez, Getzin, Moloney, 

& Wiegand, 2016). Like diseases or living things, the distribution of tweets of varying sentiment 

can be quantified and explored spatially to learn meaningful things about their spread and 

location in space. What remains is to implement these techniques. 

Developing a Methodology for Sentiment Mapping 

At this time, the research surrounding Twitter mining is unconnected and without any 

standardized methodology. Despite its already broad applications, there is considerable room for 

the use of Twitter data to be expanded into new fields and analyzed spatially with point pattern 

analysis to answer new questions or revisit old inquiries, especially toward the study of human 

systems such as healthcare, economics, politics, psychology, and human-environment issues. 

Like any area of big data, the results of Twitter mining should be interpreted with caution; 

Twitter users are not ambassadors for humanity as a whole. Many exciting projects have started 

well, only to fail later (Gayo-Avello, 2012; Lazer, Kennedy, King, & Vespignani, 2014), 

highlighting the need for prudence. However, Twitter mining shows promise as a useful tool for 

understanding mass human perceptions, behaviors, and experiences – a side of research that has 

been underexplored. With methods to analyze this sentiment used in tandem with point pattern 

analysis techniques, a new space can be created for the spatiotemporal interpretation of social 

media sentiment. In this paper, potential methodology is outlined for these exploratory 

techniques, the result of which is dubbed sentiment mapping. 

 

 

11 

2. METHODS 

A summary of the methods outlined in this Methods section is visualized in Figure 2.1. 

Like the bolded headings in the figure, the section is divided into four main parts: data collection, 

location analysis, sentiment analysis, and spatial analysis. The methodology is presented as a 

linear sequence of steps but experimentation across all stages of the methods should happen 

concurrently. For example, experimentation with sentiment analysis should begin as soon as the 

first sets of tweets are downloaded. To do otherwise is to risk months of data collection on a 

subject that may be difficult or impossible to classify with the sentiment analysis techniques 

discussed below. In the case of this research, the topic that was chosen as a demonstration of the 

spatial analysis of Twitter sentiment is natural gas. This topic has been explored previously, but 

only at the state level, and without the use of point pattern analysis or algorithm-assisted 

sentiment analysis (Sharag-eldin et al., 2018). 

A theme of these methods is the cycle of automated work with human validation and 

modification. The majority of the process is handled through scripting, but every step requires 

careful investigation of the results to avoid errors, find patterns, rework assumptions, choose the 

best algorithms, and so on. Sentiment mapping is therefore a subjective process, making it 

paramount that methods be as clear as possible about intermediate results and the decisions that 

led to the final maps. Further, given that the process involves interdisciplinary methods and that 

the results of the research will be more applicable to social scientists than computer sciences, it 

seems sensible to work with approaches, as much as they are available, that do not require a deep 

understanding of machine learning algorithms or spatial statistics. 

 

 

12 

Figure 2.1. Flowchart of sentiment mapping methodology. 

 

 

13 

Data Collection 

Tweets were downloaded and stored over an eight-month period from June 2019 to 

January 2020. The data were retrieved via requests to Twitter’s Standard API, available to all 

Twitter developer accounts, which provides access to a sample of 1% of tweets from the 

previous 6-9 days (Twitter, 2020). Over the data collection period, tweets were downloaded in 

batches once per week with the programming language R (R Core Team, 2020) and with the 

assistance of the “rtweet” package (Kearney, 2019). The convenient package formats the 

returned data as a data table, which is a format that is easy to manipulate and export from R. 

Data were stored locally and backed up with OneDrive’s cloud storage service. 

Requests to Twitter’s Standard API require the name of the account, a consumer key, and 

an access token. With this, a set of key terms can be passed through as a request along with a 

number of other parameters, such as a limit on the number of tweets retrieved, the language of 

the text, and whether or not to include retweets or media items. For this analysis, tweets were 

limited to the English language and retweets were not retrieved. Retweets were ignored to avoid 

exceeding the maximum number of allowable requests and because retweets are not necessarily 

signals of agreement – with some additional text accompanying a retweet, the retweet can either 

support or oppose the sentiment of the original tweet, creating problems for the sentiment 

analysis phase of the methodology. In short, a retweet is not equivalent to “I agree.” 

The key terms passed to the API were prominent nonrenewable electricity production 

methods including “petroleum”, “natural gas”, “coal”, and “nuclear energy”. These terms were 

intentionally broad in order to include the broadest range of the conversation on the national 

scale. Studies on a more local scale or of a more specific phenomenon would benefit from more 

specific search terms. By examining the returned tweets across these different topics, it was 

 

14 

decided that the focus of this study is tweets containing the key words “natural gas”. Out of the 

range of topics examined, these tweets were the most focused, with discussions revolving almost 

exclusively around natural gas energy production. They were also the most balanced, with a 

similar amount of positive and negative sentiment on display toward natural gas. In total, 

366,353 tweets containing the phrase “natural gas” were collected over the eight-month period. 

For each request, the API returns 88 fields for every tweet available that matches the key 

terms given. Although all fields were saved, the analyses performed were completed using only 

seven dimensions: 1) created_at, a datetime field that contains the timestamp of each tweet, 

necessary for temporal analysis 2) screen_name, the field that contains the username of the 

poster, useful for filtering out bots and finding other data quality issues, 3) text, containing the 

text content of each tweet, 4) location, a user-entered field from the profile wherein users 

describe their location, 5) geo_coords, the latitude and longitude coordinates from where the 

tweet was posted, available only if the tweet was geo-tagged, 6) place_full_name, which has the 

full name of the Twitter Place of the tweet, which similarly is only available if provided with the 

tweet, and 7) lang, a language field used to filter for tweets in the English language. Fields 4-6 

were used to ascribe a location to each tweet, as will be described in more detail in the next 

section.  

Following a filter for tweets in the English language only, a total of 283,297 tweets 

remained for further analysis, totaling approximately 111 MB when stored in a single a comma-

delimited file. Of these, 217,746 had information in one or more of the location, geo_coords, or 

place_full_name fields, allowing for the opportunity of assigning geographic coordinates to these 

tweets. From there, a random subset of 5,000 of these tweets was set aside for preliminary 

location analysis and the training and parameterization of the different sentiment analysis 

 

15 

techniques. This 5,000-tweet sample will hereafter be referred to simply as the sample. In 

addition to its foundational necessity for location and sentiment analysis, the sample also serves 

a key role in evaluating the final spatial analysis and map of sentiment. Although the sample 

should not be expected to align neatly with the final analysis, it can be used as a sanity check. 

The relative ratio of positive to negative to neutral sentiment, clustering of the data, and 

geographic trends can be expected to reflect some of the qualities of the sample. 

Location Analysis 

The majority of tweets are not geotagged, and therefore do not provide direct coordinate 

information. One study found that only 1.5% of tweets have a geotagged location associated with 

them (Zandbergen & Barbeau, 2011). More analytic methods must then be applied to the data to 

infer the location from which a user is posting, using both the location and place_full_name 

fields. However, it cannot be assumed that these fields are always accurate. A 2011 study 

estimated that only 64% of users input accurate location information in their profile, though 

accuracy level (city versus state level, for example) can vary (Hecht, Hong, Suh, & Chi, 2011). 

See Ikawa, Vukovic, Rogstadius, & Murakami (2013) for an example of an evaluation of these 

different location accuracy levels and their effects on the spatial precision of tweets. The 

takeaway message is that assigning coordinates to tweets can be an inaccurate process if care is 

not taken to evaluate geocoding returns carefully and remove erroneous results. 

The process of assigning coordinates to tweets was carried out in R software with 

assistance from the “rgeocodio” package, designed to assist with API connections to the forward 

geocoding service Geocodio (Rudis & Thompson, 2018). The initial geocoding runs were 

performed on the 5,000-tweet sample in order to evaluate the returns from Geocodio. In response 

to text cleaned and submitted from the location field and place_full_name, the service returned 

 

16 

not only the estimated latitude and longitude for each location but also an approximated accuracy 

score ranging from zero to one, with one being the most accurate; the accuracy level of a 

location, be it rooftop level, street level, a place (such as a city or zip code), or state; and the 

source of the data used to find the coordinates. The most common data source is the TIGER/Line 

dataset from the US Census Bureau, indicating the majority of returned coordinates are centroids 

of US Census tracts, an accuracy level that is more than sufficient when analyzing spatial trends 

at the national scale. 

Several lessons were learned from the initial geocoding runs performed on the sample. 

First, only 3508 out of the 5000 tweets were successfully geocoded – and these were tweets that 

had already been filtered to ensure there was text present in one of the location fields. This 

indicates that much of the information users input into the location fields in their user profiles is 

inscrutable as location data.  

Second, out of the records with coordinates returned, 2668 had an accuracy score of one, 

indicating the highest possible accuracy return from Geocodio. After going through the returns 

with an accuracy below one, it was established that only tweets with the highest accuracy score 

would be utilized in the final geographic analyses. This is was decided upon after seeing the high 

frequency of errors present in lower accuracy levels. With the filtering out of unsuccessfully 

geocoded locations and accuracies below one, approximately half of the tweets with information 

in a location field were retained for geographic analysis. 

Third, after evaluating each record by comparing the user’s location to the coordinates 

returned by Geocodio, it was discovered that even some of the returns with accuracy scores of 

one were spatially inaccurate. For example, all instances of the location “Earth” were assigned to 

the small town of Earth, Texas with high confidence despite it being unlikely that any given user 

 

17 

was actually from that location. These falsely accurate errors were commonly associated with a 

relatively small list of locations, many of which are located outside of the United States. For the 

final geocoding of the full dataset, places containing the following names removed: China, 

Ontario, England, Mexico, Ireland, Laguna, Russia, Turkey, Ottawa, Alberta, Columbia, Earth, 

Norway, Delhi, Arab, Edmonton, and Nottingham. Locations referencing Silicon Valley also 

created some inaccurate coordinates, so for the full dataset, instances of “Silicon Valley” were 

replaced with “San Jose” before geocoding. 

With this information gleaned from the subset, the process for assigning coordinates to 

the full dataset was developed. First, the location field was replaced by the place_full_name field 

where it was available, because unlike location it is selected from an established list Twitter 

locations and is thus easier to geocode. Second, the listed names above were altered or removed 

from the location because of the accuracy issues they presented. Third, the location text of the 

tweets was uploaded to the Geocodio servers as a CSV file, with the coordinates and 

accompanying information returned after approximately one hour. Fourth, the results were 

examined to find and correct any common errors in accuracy scores. Fifth, the coordinates were 

overwritten for tweets with coordinates in the geo_coords field, their accuracy score was updated 

to be one, and their accuracy level was reassigned to “coordinates”.  Sixth, tweets without any 

coordinate information were removed from the dataset. 

 

This process resulted in 160,768 tweets with assigned coordinates. According to the 

Geocodio return, dozens of sources were used to geocode these coordinates, with the most 

common being the TIGER/Line dataset from the US Census Bureau, which was used to 

geolocate 135,766 of the tweets. The rest of the sources were mainly geocoded using state or 

city-level datasets. As can be seen in Table 2.1, the overwhelming majority of the points are at 

 

18 

the accuracy level of place, which would not be precise enough for any local analysis but will 

suffice for analyzing national trends. Aside from tweets at the state accuracy level, the remaining 

categories are more than precise enough for analysis at a broad scale. The state-level data are too 

coarse even for the national scale and must either be removed or processed further before they 

can be utilized in the geographic analysis.  

Table 2.1. Accuracy level of the tweets following the completion of location analysis. An 
accuracy score of 1 indicates that the geocoder had the highest possible confidence in the 

coordinates returned. 

Accuracy level 

Number of tweets  Tweets with accuracy score of 1 

Place (cities, zip codes, etc) 

State 

Coordinates                                     

Street center         

Rooftop 

Intersection             

Range interpolation        

 

133,233 

21,329 

3710 

1224 

1067 

193 

12 

94,029 

21,329 

3710 

887 

1060 

0 

11 

In order to retain the state-level data, it must be localized somehow, which is possible 

only though an imprecise process. If left as state centroids, these tweets can create artifacts in the 

spatial analysis phase that do not correspond with any real phenomena (see Figure 2.2). The 

process for localizing the state-level tweets involves knowing the sentiment of each tweet, so it 

will be explored further in the spatial analysis section.  

 

19 

Figure 2.2. A kernel density map of California tweets (a) before and (b) after localizing state-
level tweets. An apparent clustering of tweets appears at the centroid of the state in (a) as an 

artifact of state-level geocode returns being assigned to state centroids. 

 

 

Sentiment Analysis 

 

There are a variety of methods to choose from when classifying the sentiment of text. For 

this analysis, one lexicon-based technique and three machine learning techniques were applied to 

the data. Analyses were carried out using the python programming language and two essential 

libraries: 1) the “Natural Language Toolkit” (NLTK) library  (Loper & Bird, 2002), which 

provides an array of corpuses, preprocessing tools, sentiment classifiers, and other modules for 

simplifying the sentiment analysis process, and 2) the “scikit-learn” library (Pedregosa, Weiss, & 

Brucher, 2011), which is a broad machine learning library used for this research to build and 

evaluate the different machine learning classifiers used for the sentiment analysis.  

 

Before performing any classifications using computer algorithms, a set of manually 

classified tweets is necessary as a benchmark for assessing the accuracy of the classifiers and 

training the machine learning algorithms. The 5000-tweet subset was manually classified into 

three sentiment categories: positive, negative, and neutral, based on each tweet’s position toward 

 

20 

natural gas. In order to determine the sentiment of each tweet, the question asked was, roughly, 

“does this text show a preference for the continued use or expansion of natural gas as an energy 

source?” and the classification was recorded. See Table 2.2 for exemplar classifications. This 

was a subjective process, prone to misinterpretation and personal bias, so a second reviewer 

evaluated 200 tweets from the subset. Between the original reviewer and the second reviewer, 

there was an 88% agreement overall in the classification of tweets. 

A classification challenge for “natural gas” tweets is their strong bias toward neutral 

sentiment (see Figure 2.3). In the manually classified subset, only 6.4% of tweets were scored 

with a positive sentiment and 7.9% of tweets were given a negative sentiment, with the 

remaining 85.7% of tweets having a neutral sentiment. Non-neutral tweets will have less 

representation in the training data and will thus be more challenging to classify. This creates an 

issue because it is these same non-neutral tweets that are key to visualizing patterns in “natural 

gas” sentiment spatially. With these biased proportions in the sentiment categories, the 

assessment of text classification scores will need to pay special attention to the performance of 

Figure 2.3. Classification results for the manually coded subset of 5,000 tweets. 

 

 

 

21 

different classifiers with respect to the positive and negative categories, not merely the overall 

accuracy scores. 

 

With a set of tweets manually classified, the text in the tweets could then be run through 

an automated sentiment classification process. Prior to any classification, the text in tweets 

requires several preprocessing steps that improve sentiment analysis performance. For all 

analyses, Twitter handles and URLs were removed due to their irrelevance in text classification.  

Table 2.2. Example tweets with their manually coded sentiment scores. 

Text 
In the US, renewable natural gas used as a vehicle fuel has displaced over seven 
million tons of carbon dioxide equivalent (CO2e) over the past five years.  Learn 
more about the benefits of #RNG: 
Go green with natural gas and keep some green in your wallet! Triple the 
efficiency, savings and benefit to the environment. 
Fracked natural gas has resulted in the US decreasing its CO2 emissions levels 
drastically since 2007. Any sane energy policy starts with not outlawing the 
practice. 
A cursory review of the chilling documentary 'Gas Land' tends to prove that 
Natural Gas extraction in the United States on a large scale is simply 
UNTENABLE. 
Tell USA Governors: Stop Touting Gas as a "Bridge Fuel" and Reject All New  
Natural Gas Infrastructure 
Flaring of natural gas increased by 11% to peak globally this year. 
A wasteful and harmful practice. 
 
Did we consider this increase in our #ClimateChange Models 
 
#Environment #ClimateChangeIsNow #ClimateAction #Economy 
This Northstar Village residence features a king bed, kitchenette, washer/dryer 
and a natural stone gas fireplace. 
Asian LNG spot prices remain steady 
Energy officials within the Trump administration referred to natural gas 
exported by U.S. energy companies as "freedom gas" and "molecules of U.S. 
freedom" in official statements 

Sentiment 

positive 

positive 

positive 

negative 

negative 

negative 

neutral 
neutral 

neutral 

 

Words shorter than three letters were also removed, as they generally do not provide useful 

sentiment-related information. For the machine learning analyses, all non-letter characters 

 

22 

including numbers and punctuation were removed, and the remaining words were converted to 

lower case and split into individual strings. By the end of the preprocessing phase, each tweet 

was reduced to a list of its individual words and hashtags, the presence or absence of which were 

used to train the machine learning classifiers along with the sentiment scores assigned from the 

manual sentiment analysis. 

 

The first sentiment analysis method applied to the data was VADER, an unsupervised 

lexicon-based technique designed to perform well on social media data such as tweets divided 

into positive, negative, and neutral categories (Hutto & Gilbert, 2014). Lexicon-based methods, 

such as VADER, are the most common approach to sentiment analysis (Liu, 2012). The VADER 

algorithm utilizes a dictionary of words and phrases (with their positivity/negativity attached) 

along with a set of rules to evaluate the words in each tweet along with emoticons and contextual 

clues such as negating words and punctuation to return scores for each tweet. Individual scores 

are given for the positive, negative, and neutral categories for each tweet as well as a compound 

score ranging from -1 to 1, with lower values relating to negative sentiment and higher values 

relating to positive sentiment. With different cutoff values in the compound score, the tweets can 

be classified into one of the three categories. The VADER sentiment analysis was performed 

using the “vader” module included in the NLTK python library (Loper & Bird, 2002). 

Unfortunately, despite VADER seeming well-positioned to perform well with the 

“natural gas” tweets, given that they are social media data, its accuracy scores were very low 

across the positive, negative, and neutral categories. Depending on the threshold set for the 

cutoff between the categories, accuracy scores would range from approximately 10% to 40% for 

the positive and negative categories and 30% to 80% for the neutral category, with improvement 

in the former coming at the expense of the latter. With a specific interest in classifying positive 

 

23 

and negative tweets, these accuracy levels were deemed unacceptable, and the VADER results 

were left unused. Lexicon-based methods designed for general use, even when targeted explicitly 

at the type of short form text under analysis, are likely better at detecting the mood of a piece of 

text than they are at more nuanced tasks such as assessing the sentiment of political views. For a 

lexicon-based method to successfully classify any single political topic, a new dictionary of 

lexical features would most likely need to be created that incorporates the language, slogans, 

irony, and other parts of speech relevant to the topic. 

With an unsuccessful attempt at using a lexicon-based sentiment classifier, the focus 

shifted to supervised machine learning approaches. Three different machine learning techniques 

were applied to the tweets: naïve Bayes, Support Vector Machine (SVM), and logistic regression 

(also called maximum entropy or MaxEnt), as shown in Table 2.3. Naïve Bayes and SVM are 

two of the most commonly used classification algorithms in sentiment analysis, with different 

researchers finding one or the other to be more accurate when applied to Twitter data (Kolchyna, 

Souza, Treleaven, & Aste, 2015; Pak & Paroubek, 2010). Logistic regression is used less 

Table 2.3. The seven machine learning algorithms applied to the tweet text for classification into 

positive, negative, and neutral categories. 

Approach 

  
Naive Bayes 

 
  
Support Vector Machine 
  
Logistic Regression 

Algorithm 

Gaussian (GNB) 
Multinomial (MNB) 
Complement (CNB) 
C-Support (C-SVM) 
Linear (L-SVM) 
Nu-Support (N-SVM) 
Maximum Entropy (MaxEnt) 

 

frequently for the classification of Twitter data, though it is not uncommon (Gautam & Yadav, 

2014; González-Ibáñez, Muresan, & Wacholder, 2011). The variants of naïve Bayes classifiers 

 

24 

used took the forms of Gaussian (GNB), Multinomial (MNB), and Complement (CNB) 

algorithms, of which the latter two are best suited for text classification. Three Support Vector 

Machine algorithms were also used, with these being C-Support (C-SVM), Linear (L-SVM), and 

Nu-Support (N-SVM) classifiers. All machine learning classifiers were implemented using the 

scikit-learn library (Pedregosa et al., 2011).  

Hyperparameters for each classifier were optimized on the data subset using a grid search 

method. Every combination of a range of values for each parameter was entered into the 

classifiers to determine which combination returned the highest accuracy scores across a set of 

train/test data splits. The highest performing parameters were retained. See Table 2.4 for an 

overview of the resulting performance of each classifier, which includes the overall accuracy as 

well as the recall and precision for each sentiment category. It is crucial when analyzing the 

performance of classifiers to look beyond overall accuracy because accuracy alone obfuscates 

critical information about classifier performance. Recall indicates the percentage of tweets in a 

particular category that were correctly classified, ignoring false positives. For example, if a 

dataset has 500 true negative tweets, and a classifier correctly assigns a negative score to 250 of 

these, its recall score would be 50%. Precision indicates the percentage of classified tweets of a 

specific category that are correct. To continue the previous example, if 300 total tweets were 

assigned a negative score by the classifier, and the same 250 were correct, the precision score 

would be 83%. With the primary goal of the sentiment analysis being the confident 

categorization of positive and negative tweets, close attention was paid to the recall and precision 

of non-neutral tweets when optimizing the hyperparameters. 

 

 

25 

Table 2.4. Mean performance results for all classifiers from 50 random samples of training and 

testing data using a 90/10 train/test split. The ± symbol denotes the size of two standard 

deviations around each mean value. 

 

 

This paper will now run through the performance and some of the hyperparameters of the 

classifiers evaluated to find the highest performing option. The numbers reported are the cross-

validation results of 50 random sets of train/test splits with 90% of tweets used as training. The 

GNB classifier performed the poorest of any classifier used. It classified far too many tweets as 

positive or negative, resulting in low accuracy scores overall and particularly low precision for 

the positive and negative categories. Adjusting the parameters did little to redeem the overall 

performance of this classifier, which had an accuracy score of 23% on the subset of tweets. The 

results of this classifier were retained because they may provide an interesting counterfactual, 

showing the spatial effects of over-assigning polarized sentiments to tweets.  

The CNB classifier was expected to perform highest among the naïve Bayes classifiers 

because it was written specifically to deal with skewed data classes such as those present in the 

“natural gas” tweets. Additionally, this classifier was designed to excel in classifying text data 

(Rennie, Shih, Teevan, & Karger, 2003). With an alpha of 0.5 and using two normalizations, the 

CNB classifier obtained an overall accuracy of 83%, lower than most other classifiers. 

 

26 

GNB23.09±0.4187.79±2.728.27±0.1489.78±2.9618.34±1.0115.62±0.4498.19±0.23CNB83.13±0.3950.24±1.2626.53±1.7762.88±2.3841.75±1.8792.61±0.5688.91±0.22MNB87.08±0.1241.56±1.5969.23±3.8550.68±1.6969.77±3.2698.85±0.1587.73±0.08C-SVM87.45±0.1041.61±1.2878.23±3.9752.30±2.1679.32±3.0699.25±0.1187.87±0.11L-SVM87.41±0.1240.99±1.3970.99±4.1153.12±2.1273.70±2.6698.97±0.1387.97±0.09N-SVM87.47±0.1141.23±1.3582.63±3.7251.55±2.2280.37±2.4899.36±0.1087.80±0.11Maxent87.48±0.0731.41±1.2590.04±2.8440.93±1.6086.54±1.5999.74±0.0487.43±0.07Ensemble87.47±0.1140.10±1.3778.65±3.7151.72±1.8383.12±2.8199.20±0.1087.84±0.09PositiveNegativeNeutralRecallAccuracyPrecisionRecallPrecisionRecallPrecisionInterestingly, the recall for positive and negative tweets was relatively good, but at the cost of 

lower precision. 

Best performing among the naïve Bayes classifiers was the MNB classifier. Like CNB, it 

is commonly used for text classification tasks (Frank & Bouckaert, 2006). With an alpha of 0.6, 

its overall accuracy was quite good (87%) and its precision for positive and negative tweets was 

about 70%, far better than the other naïve Bayes algorithms used. At this point, it appears the 

MNB classifier may be best suited to represent the naïve Bayes family of machine learning 

algorithms in the spatial analysis portion of the research. 

The range of scores among the SVM classifiers used is far smaller than what was 

observed between the different naïve Bayes classifiers. The overall accuracy, for example, varied 

by a fraction of a percent, with all returning a mean accuracy score of about 87.5%. The higher 

performance was unexpected as previous research indicates that naïve Bayes algorithms may be 

superior at classifying short texts (S. Wang & Manning, 2012). The implementation of these 

classifiers is based on the LIBSVM implementation of SVM classification tasks (Chang & Lin, 

2011). The C-SVM classifier was implemented with C set to 10,000, an RBF kernel, and 

balanced class weights. L-SVM was implemented with C set to 1, an L2 penalization norm, and 

a squared hinge loss function. The N-SVM classifier was applied with Nu set to 0.1, a linear 

kernel, and equal class weights. The N-SVM classifier performed best, showing very similar 

accuracy and recall when compared to the C-SVM, but with slightly improved precision. With 

precision around 80% for positive and negative tweets, relatively high confidence can be placed 

in the N-SVM classification of non-neutral tweets, and it will likely be used to represent the 

SVM genre of classifiers going forward. 

 

27 

The logistic regression classifier (MaxEnt) returned some of the most favorable scores in 

terms of its overall accuracy (87.5%) and precision (90% for positive tweets and 86.5% for 

negative tweets). MaxEnt is a discriminative classifier using a logistic function that can be used 

to classify text based on its word features. It is commonly used for text classification, including 

sentiment analysis (Jurafsky & Martin, 2019). The classifier was run with C set to 1, an L-BFGS 

solver, and an L2 penalization norm. Overall, the MaxEnt and SVM classifiers provided the best 

classification performance for the data subset.  

To avoid pitfalls that may be present in any of the classifiers individually, one additional 

classifier was developed as an ensemble of the three different types of machine learning 

algorithms applied. Many methods exist for combining the output of different classifiers into an 

ensemble classification, with voting being the easiest to understand and most straightforward to 

implement (Kotsiantis, Zaharakis, & Pintelas, 2006). For this research, the three learning types, 

naïve Bayes, SVM, and logistic regression, were given one vote each, with the ensemble 

classifier being assigned whatever the majority vote was, with three-way ties being assigned a 

neutral score. The score for each learning type was simply a majority vote of the classifiers 

within each type. Figure 2.4 visualizes this voting process. 

For the data subset, the ensemble model performed well, roughly similar to the N-SVM 

model. Its real use, however, will come into play when the ensemble classifier is applied to the 

full dataset of tweets rather than the subset, where there is a higher likelihood of unforeseen 

flaws emerging as the classifiers extrapolate their learning into new territory. The associations 

between each classifier were quantified by calculating a Pearson’s correlation coefficient on the 

classification results, ranked -1 for negative, 0 for neutral, and 1 for positive, between all pairs of 

 

28 

Figure 2.4. The voting process for creating the ensemble classifier. Each arrow represents a vote, 

where the majority decides the next “vote” classifier. 

 

 

classifiers with a Bonferroni correction. By examining the correlations between each pair of 

classifiers, it is possible to approximate how representative each classifier is of the other 

classifiers. Unsurprisingly, the Ensemble classifier showed the highest average correlations, 

indicating that it is successfully representing the full range of classifiers. These findings will be 

discussed further in the results section. 

Although these sentiment analysis methods focused on machine learning approaches, it is 

important to remember that other approaches exist. A lexicon-based approach could potentially 

perform well if a new dictionary of lexical features was created to match the sentiment of words 

used in the discourse surrounding natural gas. While this would be a laborious process, manually 

classifying the training subset of 5000 tweets was also a time-consuming process. Regardless of 

 

29 

the method applied, sentiment analysis is a hands-on task that is likely to consume a plurality of 

the working hours involved in preparing data for sentiment mapping. 

Spatial Analysis 

With sentiment analysis and location analysis complete, the tweets are now prepared for 

the final steps of processing. The spatial analysis presented here relies on point pattern analysis 

techniques and was carried out in R with the use of several spatial packages: “rgdal” (Roger 

Bivand et al., 2015), “spatstat” (Baddeley & Turner, 2005), “splancs” (R. Bivand et al., 2017) 

“sp” (Pebesma & Bivand, 2013), and “raster” (Hijmans et al., 2015). All spatial processing was 

performed in the USA Contiguous Albers Equal Area Conic projection, using only the tweets 

with an accuracy score returned from the geocoding equal to 1. In most of the analyses, the 

neutral tweets were used to represent a background rate of “natural gas” tweets. As was alluded 

to at the end of the location analysis section, this section will begin with a discussion of updating 

the coordinates of state-level tweets. 

Localizing state-level tweets is a multi-step process, with the idea being to use tweets 

with more local accuracy levels to create a probabilistic raster to reassign the coordinates of 

state-level points. First, a kernel density surface was created for all tweets assigned with a neutral 

sentiment in order to create a surface of the background rate of “natural gas” tweets. Then, using 

a US shapefile, this surface was clipped into 49 rasters corresponding with the 48 contiguous 

states plus Washington, DC. Finally, using the name of the state returned with each state-level 

tweet, the coordinates were randomly reassigned to the center of a grid cell of the corresponding 

state raster probabilistically based on the value in each cell, plus or minus half the cell width and 

height in the x and y directions, respectively (see Figure 2.5). An assumption behind this process 

is that negative and positive tweets cluster in the same locations as neutral tweets. If the data 

 

30 

show strong in-state variation in sentiment, or if this assumption is otherwise invalid, the state-

level tweets should be removed from the point-level spatial analysis. 

Figure 2.5. An example of reassigning coordinates from state-level accuracy to more local 

coordinates based on the background rate of neutral tweets. Here, 10 state-level points returned 

from Geocodio located at (a) the centroid of Texas are moved to (b) locations placed 

probabilistically on a kernel density surface derived from neutral tweets. 

 

 

This process has the potential to introduce considerable error into the process at a sub-

state regional level that affects the overall interpretation in the final results. With these “natural 

gas” data, the state-level tweets comprise approximately 1/6th of the total tweets and removing 

them entirely from the dataset did not cause significant changes to the results. It was determined 

that, ultimately, removing the state-level tweets may be the safest choice, especially considering 

that some of the spatial statistics described in this section are sensitive to the precise locations of 

points in a point pattern. That said, when working with a dataset with less observations, or if 

working in a different spatial context (such as a global study or a region of the US with smaller 

states), it is worthwhile to consider this or similar processes for localizing state-level tweets.  

With the decision to remove the state-level data resolved, the data were prepared for 

basic point pattern analysis. First, a variety of global cluster detection methods were applied to 

 

31 

the subset and full dataset to characterize the overall spatial structure of the data. Ripley’s K and 

L statistics (Ripley, 1979) were applied to the positive, negative, and neutral point patterns to 

evaluate their second-order structure individually. Although the point patterns were clustered, 

which was not surprising, confidence intervals would be necessary to ascribe significance to any 

differences between the patterns. Simulated confidence envelopes were utilized to determine if 

any single sentiment category was more spatially clustered than another (19 simulations). The 

tests were also completed on the full dataset with and without state tweets to determine if there 

was any effect by the relocated state tweets on the second-order spatial properties of the point 

pattern. 

The next test, which is the first of the “sentiment mapping” visualizations, is a binomial 

spatial scan statistic. This was used to identify clustering of positive and negative tweets relative 

to the background rate of neutral tweets. The spatial scan computes the likelihood ratio of 

observing the clustering of one point pattern in a given radius to the clustering of the background 

point pattern, returning statistically significant locations where more clustering is present than 

what is expected (Kulldorff, 1997). A radius of 150 km was selected for these data as a 

representation of the regional scale. With smaller radii, too many local clusters are generated to 

allow for meaningful interpretation of the resulting map, whereas larger radii simply mirrored the 

results at the 150 km scale. A total of 99 simulations were run to obtain a confidence level of p < 

.01. The spatial scan statistic was repeated with the sentiment results of different classifiers, and 

over varying time slices, in order to glean a well-rounded picture of the true clustering 

underlying the point patterns. 

Another check performed to develop confidence in the final results was to plot the spatial 

distribution of misclassifications resulting from the different sentiment analysis techniques. 

 

32 

Errors were saved from the cross-validation runs performed in the sentiment analysis of the data 

subset as points. These points were then run as a point pattern into the spatial scan statistic 

described previously. The concept behind this step is that the word choice and issues discussed 

around the terms “natural gas” may vary regionally, causing some areas to be more prone to 

sentiment misclassification. Areas that the scan statistic highlights as containing more clustering 

of error than expected should be interpreted with lowered confidence. As seen in Figure 2.6, the 

primary area of concern is in Central and Eastern Pennsylvania, the region where three different 

classes of machine learning classifiers had the highest proportion of error. Also registering on 

each plot were regions of California and New England, though less so than Pennsylvania. Results 

in these areas – particularly non-definitive results – will need to be interpreted with more 

caution. 

Finally, in order to visualize the positive and negative sentiment spatially, kernel density 

difference maps were created. This was done by first creating kernel density maps for the 

positive, negative, and neutral point patterns using a radius of 50 km. The negative kernel density 

raster was then subtracted from the positive kernel density raster to create the finished kernel 

density difference maps. Like the spatial scan statistic, these steps were repeated over monthly 

time slices to create a visualization of the changes in the polarized sentiment over time. In a 

similar vein, one final temporal analysis performed was the aspatial density of positive, negative, 

and neutral tweets over time. 

 

33 

 

Figure 2.6. Spatial scan statistics comparing the clustering of errors from different classifiers to 

the background rate of neutral tweets. The errors pictured are from three machine learning 

classifiers, (a) MaxEnt, (b) MNB, and (c) N-SVM. Errors were obtained from 50 cross-

validation runs of the classifiers using different sets of training and testing data. Plots were 
generated from 99 simulations of the spatial scan statistic with a radius of 150 km, p < .01. 

 

 

34 

3. RESULTS 

Tweets in Space 

Following the data processing, a total of 121,026 “natural gas” tweets across the 

contiguous United States were prepared for sentiment mapping. The text in each tweet had been 

used to assign each to a positive, negative, or neutral sentiment category, and only the tweets 

with the most accurate locations geocoded from their location information were retained. The 

results presented here attempt to visualize these data using a plurality of different techniques, a 

necessity for providing a complete picture of the results and also for establishing their reliability. 

Decisions made throughout the methodology introduced many opportunities for subjectivity, 

making a clear and straightforward presentation of the full range of results an essential priority. 

 

Which areas of the US are responsible for the “natural gas” conversation on Twitter? As 

might be expected, the tweets cluster in urban population centers and follow the general east-to-

west population gradient present across the States. Figure 3.1 below depicts the locations of 

tweets as a kernel density map, with accompanying kernel density plots of the latitude and 

longitude of each tweet. In the Western US, Seattle, Portland, the San Francisco Bay Area, Los 

Angeles, and Phoenix are visible. Across the Central US, Denver, Milwaukee, Detroit, and 

Indianapolis stand out, as do several cities in Texas, including Dallas and Fort Worth, Houston, 

and Austin. In the Eastern US, the conversation is concentrated primarily in the northeast. 

Boston, Pittsburgh, a corridor from New Jersey through New York City, Washington, D.C., and 

Atlanta stand out.  

Two of these locations stand out as disproportionately dense. Although its presence is not 

as unexpected, the density of “natural gas” tweets in Pittsburgh, Pennsylvania seems 

 

35 

overabundant given the size of the city. Further, the highest density of “natural gas” tweets is 

located in Washington, D.C., and although the area is highly populated, this density still 

overrepresents the population. There is a good explanation for both of these high densities of 

tweets. In the first case, Pennsylvania is the second-largest producer of natural gas in the US, just 

behind Texas, the largest producer. Pittsburgh is located in the western side of the state on the 

Marcellus Shale, the largest natural gas field in the country (U.S. Energy Information 

Adminstration, 2019b). It seems reasonable that such a city would be responsible for an 

Figure 3.1. Kernel density plots for latitude and longitude alongside a kernel density map of 

tweet locations across the contiguous United States (CONUS) with an accuracy score equal to 1 

(n = 121,026). A 50km bandwidth was used for the kernel density map. 

 

 

 

36 

abundance of natural gas posts on Twitter. Washington, D.C, on the other hand, makes sense as a 

source of postings because of its function as the political center of the US. Natural gas and its 

extraction are political topics of interest to lobbying groups, policy makers, and the politically 

minded citizenry of the capitol. Beyond natural gas, it seems likely that any topic of national 

political interest would see a high density of tweets in D.C. 

Classification Comparisons 

 

Before analyzing the sentiment of these tweets toward natural gas spatially, some time 

will be devoted to a comparison of the results of the different sentiment classifiers. As there is no 

way to know the actual accuracy of the classifiers once they have been extrapolated to the full 

dataset, comparison is the only means of assessing their performance. Based on evaluating the 

results of Table 2.4 in the Methods, the SVM classifiers performed best overall, and the 

ensemble classifier appeared to be a good middle ground in the performance of all the classifiers. 

The best naïve Bayes classifier working on the subset appeared to be MNB, while the best SVM 

classifier was the N-SVM. However, these classifiers were not guaranteed to be the highest 

performers on the full data set. 

In examining Figure 3.2, it appears that MNB and N-SVM had some issues with over-

fitting when compared to other classifiers of their algorithm type. While this may have granted 

some advantage with performance scores in the data subset, positive and negative tweets were 

under-classified in the full dataset (compared to the 6.4% positive and 7.9% negative tweets 

classified in the subset). When working with 121,026 tweets, there is enough data that over-

fitting is unlikely to result in missing any large spatial clustering of positive or negative 

sentiment. However, even if some smaller clusters are lost, overfitting is preferable to the  

 

37 

 

Figure 3.2. Total classification results for all machine learning algorithms by category. Each 

square represents 2000 tweets. 

 

alternative where classifiers attribute too much sentiment to neutral tweets and generate false 

sentiment clusters. Overall, the C-SVM classifier attributed non-neutral scores to the highest 

number of tweets of its class (4.9% positive and 5.6% negative) and MNB the lowest (0.6% 

positive and 2.2% negative). Again, the ensemble classifier’s results appear to land in the middle 

of the other classifiers, which is a good omen and indicates that the different families of 

classifiers agree on the sentiment of a large proportion of the tweets.  

 

38 

Aside from overfitting, another issue present with most classifiers is the exaggeration of 

the ratio of positive to negative tweets. Whereas in the subset there were about 1.2x as many 

negative tweets as there are positive tweets, the naïve Bayes results have 3.7x, 2.5x, and 0.5x as 

many negative tweets as positive. There are also 1.7x as many negatives for MaxEnt and 1.5x as 

many for the ensemble classifier. If the data subset is reflective of the full dataset, the spatial 

results from these classifiers exaggerate the magnitude of negative sentiment. Again, these 

results support the idea that the SVM classifiers are the best performers on these data as their 

ratios were very near 1.2 negative tweets for every positive one. The GNB results continue to be 

wildly inaccurate (also the only results with more positive tweets than negative tweets), and only 

continue to be retained as a study of the downstream consequences of overly polarized sentiment 

classification.  

The takeaway from Figure 3.2 is increased confidence in the overall results. Although 

individual classifiers such as MNB appear to have missed a significant portion of non-neutral 

tweets, the general profiles of the classifiers look similar to one another, aside from the expected 

exception of GNB. Overall, the SVM and MaxEnt classifiers appear to have classified a 

reasonable number of tweets into each category given the expectations set by the recall and 

precision of Table 2.4. In contrast, the naïve Bayes classifiers may have been overfit to the 

subset and thus underperformed. The ensemble classifier continues to represent a middle ground. 

The associations between the results of the different classifiers is summarized in Table 

3.1 below using Pearson’s correlation coefficient between each pair of classifiers. Here it is 

demonstrated that the Ensemble classifier is indeed the most representative of all classifiers with 

an average correlation of .648. It matched very closely with the MaxEnt classifier in particular, 

mirroring results seen in previous figures. Among the SVM family of classifiers, C-SVM 

 

39 

correlates most strongly with the others, indicating that it would be a good choice for 

representing its family. As expected, the naïve Bayes classifiers have weakest associations with 

one another and with all other classifiers, suggesting that they are categorizing the tweets 

substantially differently. Considering that this family also had the lowest accuracy scores, the 

categorizations of the naïve Bayes classifiers are undoubtedly the least reliable, though they will 

be retained throughout the results as a demonstration of the range of outcomes possible with the 

use of multiple classifiers. 

Table 3.1. Pearson’s correlation coefficients between classifier pairs. All correlations were 

significant at the p < .05 level with a Bonferroni correction. 

  

GNB  CNB  MNB  C-SVM  N-SVM  L-SVM  MaxEnt  Ensemble 

GNB 

CNB 

MNB 

C-SVM 

N-SVM 

L-SVM 

MaxEnt 

Ensemble 

 

0.253 
0.199 
0.269 
0.246 
0.262 
0.223 
0.242 

0.253 

  

0.615 
0.411 
0.413 
0.432 
0.431 
0.539 

0.199 
0.615 

 

0.395 
0.428 
0.397 
0.428 
0.504 

0.269 
0.411 
0.395 

  

0.802 
0.835 
0.699 
0.771 

0.246 
0.413 
0.428 
0.802 

 

0.781 
0.772 
0.807 

0.262 
0.432 
0.397 
0.835 
0.781 

  

0.733 
0.783 

0.223 
0.431 
0.428 
0.699 
0.772 
0.733 

 

0.890 

0.242 
0.539 
0.504 
0.771 
0.807 
0.783 
0.890 

  

 

Sentiment Cluster Maps 

 Further evaluation of the classifiers can be performed spatially by comparing the spatial 

scan statistics for the positive and negative sentiment categories. Figure 3.3, displaying 

“sentiment cluster maps,” groups the machine learning classifiers into the same three groupings 

as the previous figure. The clusters of positive (left) and negative (right) sentiment are plotted 

with respect to the background rate of neutral tweets. The MaxEnt and ensemble classifier results 

are consistent with the previous figure in that they represent a middle ground between the other 

classifiers. Here, they identify many of the same clusters as the other classifiers but without any 

 

40 

Figure 3.3. “Sentiment cluster maps.” Spatial scan statistic results for positive (left) and negative 

(right) tweets with respect to a background rate of neutral tweets for all classifiers.  The spatial 

scan was completed with a 150 km radius and 99 simulations, p < .01. 

 

 

41 

strong emphasis on individual cluster in the case of positive sentiment, and more distinct 

clustering in the case of negative sentiment. 

Because it was created through a voting process, focusing on the spatial scan results for 

the ensemble classifier is a way of summarizing the trends across all the classifiers. There are 

clusters of positive sentiment in Reno, Phoenix, Pittsburgh, and Atlanta, as well as other 

nonspecific regions of eastern Montana, North Dakota, Louisiana, Alabama, and Oklahoma. 

Positive sentiment appears to cluster in a broad range of locations across the interior of the US. 

Examining the maps above the ensemble classifier reveals that these positive cluster locations 

can vary considerably depending on the classifier used, indicating that the individual locations 

are less meaningful than this overall trend of broad clustering. Conversely, negative sentiment is 

consistent and highly clustered in three distinct coastal areas: Seattle, the San Francisco Bay 

Area, and the New York City area. Note that although the negative sentiment is clustered in 

highly populated areas that Figure 3.1 revealed are responsible for a large portion of the “natural 

gas” tweets, these results account for the underlying population because the background rate of 

neutral tweets is utilized with the spatial scan statistic. This means that these clusters of negative 

sentiment are clustering to an extent that they are significant even in light of the very high rates 

of neutral tweeting in these areas. 

When comparing the classifiers to one another and previous figures, many of the same 

trends emerge that were seen in prior comparisons. The SVM family maps similarly for both 

positive and negative sentiment. For the positive sentiment, the cluster most contested is centered 

on Phoenix, where C-SVM finds it to be one of the most robust clusters but N-SVM finds it to be 

one of the weakest. The other differences are mostly minute. The negative sentiment maps are 

nearly identical, both internally within the SVM family and when compared to the MaxEnt and 

 

42 

ensemble classifiers. Overall, the results of the SVM classifications continue to follow the same 

pattern of consistency with each additional visualization. Thus far, there have been no 

indications that the SVM family is producing unreliable results. 

The naïve Bayes classifiers disagree with one another most strongly, with GNB once 

again setting itself apart as the outlier. This is most apparent in the spatial scan of negative 

sentiment, where the Seattle cluster is nearly absent from the map though it appears in every 

other negative scan. It also underemphasizes the southern portion of the Bay Area cluster while 

managing to find a Louisiana cluster than no other classifier identified. The MNB classifier, 

which Figure 3.2 revealed to be stringent in assigning non-neutral scores, also performed very 

poorly. Consistent with its underrepresentation of polarized sentiment, the Bay Area cluster is 

nearly missing entirely from the negative sentiment plot. The positive sentiment scan fares even 

worse for MNB, bearing little resemblance to the other maps of positive sentiment and also being 

the only to find a strong positive sentiment cluster around Portland. Of the naïve Bayes 

classifiers, CNB is the only to compare with the non-NB classifiers relatively well. However, it 

found weak sentiment clusters on both plots in regions where the other classifiers found nothing. 

This is somewhat unsurprising, as Table 2.4 reveals that CNB correctly identified many of the 

non-neutral tweets (recall), but it was also quick to mislabel the non-neutral tweets (precision). 

Overall, the flaws in the naïve Bayes family of classifiers appear to have had seriously 

detrimental effects when the classifications are plotted spatially. The results disagree internally 

and also externally compared to the other classifier families. The spatial scan statistics resulted in 

significantly decreased confidence in the naïve Bayes results. 

The Khat statistic is another means of characterizing the spatial distribution of tweets. 

Taken individually, the Khat values for a given point pattern reveal the degree to which the 

 

43 

points are clustered at different spatial scales, which only reveals that both the positive and 

negative sentiment tweets are clustered. It is in taking the difference between the Khat values for 

positive and negative tweets that more revealing information is extracted. Figure 3.4 depicts the 

results of subtracting the Khat values of positive tweets from the Khat values of negative tweets. 

Figure 3.4. Khat difference between positive and negative tweets classified by the ensemble 

classifier. Envelopes represent p = .05 confidence intervals from 19 simulations. Khat difference 

values above 0 indicate greater clustering of negative tweets at the distances shown. 

 

 

Up to a radius of approximately 375 km, the negative tweets are significantly more clustered 

than the positive tweets, rising well above the p = .05 confidence envelopes. This matches the 

expectation set by the nature of the clustering in Figure 3.3., where negative tweets were highly 

clustered in a few major population centers while positive sentiment clusters were more spatially 

diffuse. The Khat difference plot confirmed the visual interpretation that the negative tweets 

 

44 

were the more clustered pattern using a statistical test rather than merely a visual cue. Further, 

the test indicates that this more intense clustering holds true out to a radius of about 375 km, 

beyond which neither category of tweets is significantly more clustered than the other.  

Raw Sentiment Maps 

With an understanding of the performance of different classifiers and the overall spatial 

patterns of the positive and negative sentiment categories established, the results will now move 

on to the final sentiment maps. Depicted in Figure 3.5, the “raw sentiment map” is the difference 

between the density of positive and negative sentiment across the contiguous Unites States. 

Areas with a higher raw density of positive tweets appear in orange while locations of higher 

negative density appear in green. The eight classifiers were reduced to four in this plot to 

highlight only the classifiers that showed the highest performance of their family throughout the 

Figure 3.5. “Raw sentiment maps.” Kernel density difference maps from four different 

classifiers computed by subtracting the kernel density map of negative sentiment (green) from 

that of positive sentiment (orange). Values above zero indicate a higher density of positive 

 

 

natural gas tweets. 

45 

 

different visualizations. The four classifiers included in the plot tell similar but somewhat 

varying stories, indicating that the kernel density difference map is less sensitive to different 

classifiers than the spatial scan was. The map that is most dissimilar compared to the others is the 

CNB plot. This is unsurprising given that the naïve Bayes classifiers consistently performed the 

most poorly, and with results most different from the other classifiers. 

Once again, all classifiers agree on the hubs of negative sentiment that were identified 

with the spatial scan statistic. Seattle, the Bay Area, and the northeast from New Jersey to Boston 

have a high density of tweets classified as negative toward natural gas. Added to this list are 

Washington, D.C., the Twin Cities, Portland, Detroit, and Los Angeles, which appear on the 

density difference maps for most classifiers. While it may seem unusual that new locations are 

highlighted with this visualization, recall that these visualizations are independent of the 

background rate of neutral tweets. In an area such as Washington, D.C. with a high density of 

neutral tweets, there is a very high threshold to cross before the density of negative tweets 

becomes statistically significant. However, despite the clustering of negative tweets being 

insignificant when compared to the background rate, most classifiers agreed that there was more 

negative sentiment toward natural gas in the D.C. area than positive sentiment. 

Areas that register with more considerable positive sentiment toward natural gas are 

again more dispersed and more variable than the negative sentiment hotspots, similar to what 

was observed in Figure 3.3. Houston, Austin, Oklahoma, Louisiana, Pittsburgh, and Tampa stand 

out most prominently and consistently as areas with a higher density of positive natural gas 

tweets than negative tweets. Looking at the ranges of values, the high-density centers for positive 

tweets are not as pronounced as the negative density clusters, again harking back to the more 

 

46 

intense clustering of negative tweets. Also, given that more negative tweets exist overall in the 

dataset, and that the clustering of negative tweets was found to be more significant, it is 

unsurprising that the most prominent positive sentiment areas do not size up to the magnitude of 

negative sentiment areas. 

Many of the hotspots of positive natural gas sentiment align with natural gas producing 

regions of the United States. As mentioned, Texas and Pennsylvania are the largest producers of 

natural gas in the country, and both contain high-density areas of positive sentiment. Louisiana 

and Oklahoma, which also show favorable natural gas sentiment on the raw sentiment map, are 

the third and fourth-largest producers (U.S. Energy Information Adminstration, 2019a). The 

major hotspots of negative sentiment seem concordant with expectations when the sentiment is 

viewed through a political lens. Seattle, the Bay Area, the Northeast, and Washington, D.C. skew 

strongly toward the Democratic Party. Given that the Democratic Party favors “green” policies, 

which generally call for the reduction or complete halt of natural gas production, it is 

unsurprising to see clusters of negative sentiment in these highly Democratic regions. A similar 

story applies to negative sentiment observed in Twin Cities, Portland, Detroit, and Los Angeles, 

which are areas that strongly favor the Democratic Party. Taken together, the current results 

build an internally consistent representation of “natural gas” sentiment across the contiguous 

United States that aligns with real-world phenomena, such as political sentiment and regions of 

natural gas production.  

Exaggerated Sentiment Maps 

An issue with both the raw sentiment maps and the sentiment cluster maps is that 

population centers are dominating the results. In one way, this reflects reality; less populated 

areas contribute far less to the Twitter discourse in terms of the volume of tweets. However, this 

 

47 

leaves a blank in the evaluation of the “natural gas” sentiment, both literally and figuratively. 

The large swaths of white space on the maps provide no understanding of the sentiment in less 

populated regions of the country. Although these areas may not have the same level of 

contribution as more populated areas, there are still many reasons to be interested in the 

sentiment of these areas. 

One way to highlight the sentiment of places with a lower density of tweets is to dampen 

the magnitude of higher density areas. This can be accomplished by taking the square root of the 

positive and negative kernel density maps before calculating the difference between them. In 

doing so, the prevailing sentiment of the classified tweets becomes clear in almost all regions of 

the contiguous United States (see Figure 3.6). With these “exaggerated sentiment maps,” within-

state variation in sentiment is now clearly visible. State-by-state, the Ensemble and MaxEnt 

classifiers show the most agreement in the distribution of sentiment. As expected, the CNB 

classifier is once again the most different, showing a far greater dispersion of negative sentiment 

than the other classifiers. C-SVM, which is the most representative of the SVM classifiers used, 

and may represent the most accurate tweet classifications overall, showed the largest distribution 

of positive sentiment. 

Although these results will not delve into the mapped outcomes state-by-state, it is 

worthwhile to note that a majority of states show consistent patterns across a majority of the 

classifiers (excluding CNB). Additionally, many of these patterns make sense with comparison 

to the locations of natural gas production and the distribution of political viewpoints across the 

contiguous US. The effect of the relatively poor classification performance by the naïve Bayes 

classifiers is apparent with this figure – although the negative sentiment results are similar in 

many locations, the areas of positive sentiment are significantly altered to the extent where 

 

48 

Figure 3.6. “Exaggerated sentiment maps.” The difference between the square roots of the 

density of positively and negatively classified tweets. 

 

 

state-by-state interpretation of the CNB map would often lead to very different interpretations of 

the distribution of sentiment. It is a notable example of what could go wrong with the results of 

sentiment mapping, even with careful hyperparameter optimization and a large set of training 

data. 

One cost of the additional information provided by the exaggerated sentiment maps is 

that the overall noisiness of the maps increases as well as the number of differences between the 

results of the four classifiers. Further, any given hotspot of positive or negative sentiment is less 

reliable, and a sense of the proportion of differences in sentiment intensity is lost. Whatever 

concerns have developed thus far around the accuracy of the overall sentiment mapping results 

should be heightened significantly for these plots. Recall also that misclassifications from the 

 

49 

classifiers are not distributed evenly across space. Now would be a good time to reexamine the 

error maps of Figure 2.6 to recall where the classifiers most consistently made classification 

mistakes. At this sub-state level of analysis, it is likely best to avoid including the state-level 

tweets; at best they are adding some useful information along with a lot of noise, and at worst 

they are biasing the within-state sentiment. 

Tweets in Time 

 

The sentiment of the “natural gas” tweets can also be evaluated temporally, both to learn 

more about the nature of the different sentiments and to evaluate the differences between the 

classifiers further. Figure 3.7 depicts the density of positive, negative, and neutral tweets over 

time. Of note is that all classifiers follow a uniform distribution overall, indicating that there was 

no significant increase or decrease in “natural gas” tweeting over the time period. The two 

valleys occurring in October are due to two weeks of missed data collection near the beginning 

and end of the month. Except for CNB, all classifiers agree on three peaks in non-neutral 

sentiment occurring on June 30th, September 8th, and December 8th. The first and last peaks occur 

due to high densities of negative sentiment during the periods while the middle peak indicates a 

high density of both positive and negative sentiment, with positive sentiment rising above 

negative.  

According to the results of these classifiers, it seems that the negative sentiment is not 

only more highly clustered spatially but also more clustered temporally than the positive 

sentiment. The C-SVM classifier, which found the lowest ratio of positive to negative tweets 

among the classifiers plotted, finds these sentiment peaks to be noticeably lower in magnitude. In 

examining these plots, it is essential to remember that their data are entirely dependent upon the 

 

50 

classifier used, its parameters, and the training data. The spikes observed could be nothing more 

than artifacts from the classification process. 

Many previous studies have linked temporal peaks and valleys in different sentiment 

categories to news events. For many users, one of Twitter’s primary uses is reading and sharing 

articles and current events. If a topic is trending that brings out significant positive or negative 

sentiment, this may be reflected in plots such as Figure 3.7. Another contributor to the spikes 

could be changes in the language used during different periods. If a specific word or phrase is 

used frequently during a single time window, and this word or phrase is distinctly connected to a 

particular sentiment category, it is possible that the classifiers were trained to assign non-neutral 

sentiment to tweets with that text more easily. 

For example, the day before the May 30th spike, the US Department of Energy described 

natural gas as “freedom gas” and “molecules of US freedom” in official statements (Bowden, 

2019). This spurred a number of tweets on the subject, the majority of which were of negative or 

neutral sentiment. What was key in this event’s ability to trigger a spike in sentiment is the rarity 

of words like “molecules” and “freedom” in the natural gas dialogue. If the classifiers learned to 

associate these words with a significantly higher likelihood of negative sentiment, it would be 

simple to understand why so many negative tweets were classified because of this event. 

 

 

51 

Figure 3.7. Density plot showing the frequency of positive, negative, and neutral tweets over 

time using four different classifiers. The three most prominent peaks occur on May 30th, 

September 8th, and December 8th. The bandwidth is set to 7.5 days. 

 

 

52 

Other temporal analyses are more complex and involve mapping the change in sentiment 

spatiotemporally. As seen in Figures 3.8 and 3.9 below, which display monthly sentiment cluster 

maps and monthly raw sentiment maps, this quickly becomes cumbersome as each time slice 

requires a different map. These data are visualized and interpreted much more easily with 

interactive maps or gifs with multiple frames. That said, the main takeaway from these plots is 

still evident from the still plots. There is very high spatial variability in positive and negative 

sentiment. All visualizations before this section of the results smoothed over this surprisingly 

high variation. Again, the negative sentiment shows less variability, particularly in Figure 3.9, 

where the same few coastal hotspots of negative sentiment continue to stand out. Even so, the 

intensity and locations of negative sentiment hotspots change significantly from month to month. 

These figures highlight the need for long-term studies on twitter data. This study used an 8-

month period and the variability present in these maps and Figure 3.7 suggests that some of the 

overall spatial patterns in sentiment may be more temporary than previously assumed. 

 

53 

 

54 

 

Figure 3.8 (cont’d). Monthly sentiment cluster maps. Created using a spatial scan statistic with 

data grouped by monthly time slices, using the ensemble classifier. The scan shows elevated 

levels of positive sentiment (left) and negative sentiment (right) relative to the background rate 
of neutral sentiment. Each scan was completed with a 150 km radius and 99 simulations, p < .01. 

 

Figure 3.9. Monthly raw sentiment maps. Kernel density difference plots for each month in the 

data set, using the ensemble classifier. Orange areas have a higher density of positive tweets 

while green areas have a higher density of negative tweets. 

 

 

 

55 

4. CONCLUSION 

This paper has provided a demonstration of sentiment mapping, an approach to 

visualizing the spatial distribution of opinions or sentiment shared on social media. Here, data 

are pulled from Twitter to analyze sentiment on the platform toward natural gas. The process of 

sentiment mapping draws from work in two disparate areas: 1) sentiment analysis, a branch of 

natural language processing focused on identifying the mood or sentiment of the text, and 2) 

density and distance-based methods of point pattern analysis, which have been applied to a broad 

range of spatial questions. Although there are many examples to be found of studying sentiment 

on Twitter or studying the spatiotemporal distribution of tweets, the unique combination of 

sentiment analysis and point pattern analysis allows for spatiotemporal assessment of Twitter 

sentiment in ways that have not previously been explored. 

In addition to describing the methodology of sentiment mapping, a case study of the 

natural gas sentiment expressed on Twitter over eight months was examined to evaluate the 

legitimacy of the results. Although over 300,000 tweets containing the keywords “natural gas” 

were downloaded, only 121,026 could be geocoded with enough accuracy for use in further 

spatial analysis. Seven machine learning classifiers were trained on a subset of manually 

classified tweets to learn the patterns of classifying natural gas tweets into positive, negative, and 

neutral categories. They were evaluated on their accuracy for this subset using standard training 

and testing methods, both overall and by individual category. From there, the full dataset was 

classified, and the sentiment mapping process began.  

Three methods of sentiment mapping were applied to the data to examine the spatial 

patterns of positive and negative sentiment in varying ways. The first, “sentiment cluster maps,” 

 

56 

used a spatial scan statistic to identify spatial clusters of positive and negative sentiment with 

respect to the background rate of neutral tweets. The second, “raw sentiment maps,” depicted the 

difference between the kernel density maps of positive and negative sentiment. The third, 

“exaggerated sentiment maps,” took the square root of the raw sentiment maps to highlight the 

sentiment in areas with less tweeting. The maps were created using the classifications output 

from different machine learning classifiers and were compared to evaluate the range of results 

resulting from different classifiers. 

Overall, the results of the spatial analysis were reasonably consistent between the 

different classifiers and sentiment mapping techniques applied. The exaggerated sentiment maps, 

which highlighted local variation, showed the most variability but shared the same big picture 

trend. Negative sentiment was most densely clustered in three coastal areas: Seattle, Washington, 

and the northeast near New York City, which are areas that are politically outspoken against 

natural gas production. Positive sentiment was less densely clustered overall but showed high 

concentrations in major natural gas producing regions of Texas, Pennsylvania, Louisiana, 

Oklahoma, and other states. 

When considering the sentiment classification results, which were promising and 

consistent overall, of note is the poor performance of the naïve Bayes classifiers. These results 

not only run contrary to other research (S. Wang & Manning, 2012) but also highlight the need 

for more care with the application of sentiment analysis to social media data. Many studies 

covered in the Introduction relied on only one classifier to perform the sentiment analysis. 

However, the results here demonstrate that a single classifier, or even a family of classifiers, has 

the potential to go awry when applied to a large dataset, even with a large sample of training 

data. It is strongly recommended that multiple classification algorithms are run and compared to 

 

57 

avoid the potential for poor classification and to understand the possible ranges in the 

spatiotemporal distribution of the data. 

Although the sentiment maps told a consistent and cogent story, the spatiotemporal 

analysis revealed more complexity behind the broad results. There was high temporal variation 

of both positive and negative sentiment, with two temporal peaks of negative sentiment being 

responsible for a large portion of the overall negative sentiment. Spatially, the clusters of positive 

and negative sentiment varied considerably on a month-to-month basis, suggesting that the time 

over which tweets were collected played an important role in shaping the final results. 

Interpretation of the results must also consider that the data passes through many filters before 

the sentiment mapping begins – rather than representing the sentiment of all people in different 

regions toward a chosen topic, the data can only ever approximate the sentiment of a subset of 

Twitter users.  

That said, the potential applications of sentiment mapping are quite broad. The extremely 

large sample sizes available through social media can provide information on large populations 

of people. As Twitter and other social media platforms, some new, expand globally, the potential 

reach of this research will only continue to grow. Even if the data can only represent people 

using specific platforms, the capacity to approximate the regional moods or opinions of people, 

both spatially and temporally, has a broad range of research potential. Many areas of human 

research, from psychology and sociology to geography and public policy, have lines of research 

that could be supplemented with the use of sentiment mapping. While it is true that the results 

can only offer an approximation, this is also true of most broad-scale human research on mood or 

opinions. 

 

58 

 

There are many promising avenues for further research in the area of sentiment mapping. 

Although three types of sentiment maps were presented in this paper, other possibilities exist for 

using point patterns to analyze and visualize sentiment spatially, such as geographic analysis 

machine (GAM) (Fotheringham & Zhan, 1996; Openshaw, Charlton, Wymer, & Craft, 1987). 

Other methods for the temporal analyses may better visualize and/or quantify changes in 

sentiment temporally and spatiotemporally. A more statistical approach, such as space-time 

autoregressive modeling, might better quantify the stationarity or non-stationarity of the 

sentiment in different regions. Also of temporal interest is how the performance of the classifiers 

varies with time as the lexicon of the conversation around a given topic changes. Analyzing this 

“concept drift” (Forman, 2006) of the Twitter conversation may be a key part of understanding 

the temporal variation in sentiment observed in these data. It may also play a role in improving 

the performance of the classifiers. 

 

Several other areas call for deeper understanding as well. Further evaluation and 

quantification of the spatial variability across the different machine learning classifiers is 

important for interpreting these results. Experiments with simulated data and/or more advanced 

statistical analysis of the spatial deviations of the sentiment maps could provide benchmarks for 

evaluating the results beyond mere visual interpretation. The results of this analysis tell a fairly 

consistent story about the broad sentiment distribution toward the topic of natural gas. Still, it is 

not clear what the likelihood of these results are. It could be that the techniques used simply lend 

themselves to consistent output so long as the machine learning classifiers are using the same 

training data and are reasonably well parameterized. Developing a method for hypothesis testing 

the sentiment map results to compare them to anticipated values or to other topics would be a 

significant step forward. 

 

59 

 

There is also further research needed to understand if new results are produced by taking 

the influence of tweets (such as likes and retweets) into account when measuring their 

distribution geographically. While this paper utilized only the raw presence of positive and 

negative tweets in its spatial analyses, creating the sentiment maps with the tweets weighted by 

their influence may tell a different story, or at the very least, might be used to address different 

questions. 

Both within the area of sentiment mapping and in sentiment analysis of social media data 

more broadly, there is still more work needed to explore the biases of Twitter users compared to 

the general population, both in the US and abroad. For example, do the more general biases such 

as political leaning do well to predict the distribution of sentiment on specific topics, or do more 

general biases break down when selecting for keywords that isolate a particular subject? With 

some of these questions addressed, similar sentiment mapping research could contribute to the 

growing use of social media for disaster management (Houston et al., 2015), public health 

understanding (Daughton et al., 2018), or even to improve the targeted advertising efforts of 

businesses (Nair, Shetty, & Shetty, 2017), among many other possibilities. 

What this paper offers is a first pass at combining sentiment analysis and point pattern 

analysis to create sentiment maps, providing some terminology and methods for visualizing and 

interpreting social media sentiment spatially. The methodology presented in this paper can be 

refined much further but perhaps it is enough to demonstrate the power infused in these 

techniques. There is room for improvement in every step of the process, including cleaner 

location analysis, more accurate machine learning classification, more substantial point pattern 

analysis, and a more in-depth temporal analysis. The work presented here offers only the 

 

60 

skeleton of sentiment mapping, a novel amalgamation of techniques with great potential for 

mapping the human social environment.

 

 

 

61 

 

 

 

 

 

 

 

 

 

 

REFERENCES 

 

 

62 

 

REFERENCES 

Achrekar, H., Gandhe, A., Lazarus, R., Yu, S. H., & Liu, B. (2011). Predicting flu trends using 

twitter data. In 2011 IEEE Conference on Computer Communications Workshops, 
INFOCOM WKSHPS 2011 (pp. 702–707). 
https://doi.org/10.1109/INFCOMW.2011.5928903 

Agarwal, A., Singh, R., & Toshniwal, D. (2018). Geospatial sentiment analysis using twitter data 
for UK-EU referendum. Journal of Information and Optimization Sciences, 39(1), 303–317. 
https://doi.org/10.1080/02522667.2017.1374735 

Aramaki, E., Maskawa, S., & Morita, M. (2011). Twitter Catches The Flu: Detecting Influenza 

Epidemics using Twitter.pdf. In Proceedings ofthe 2011 Conference on Empirical Methods 
in Natural Language Processing (pp. 1568–1576). 

Ardon, S., Bagchi, A., Mahanti, A., Ruhela, A., Seth, A., Tripathy, R. M., & Triukose, S. (2011). 
Spatio-Temporal Analysis of Topic Popularity in Twitter. In Proceedings of the 22nd ACM 
international conference on Information & Knowledge Management (pp. 219–228). 
Retrieved from http://arxiv.org/abs/1111.2904 

Baddeley, A., & Turner, R. (2005). spatstat: An R Package for Analyzing Spatial Point Patterns. 

Journal of Statistical Software, 12(6). 

Bertrand, K. Z., Bialik, M., Virdee, K., Gros, A., & Bar-Yam, Y. (2013). Sentiment in New York 

City: A High Resolution Spatial and Temporal View, 1–12. Retrieved from 
http://arxiv.org/abs/1308.5010 

Bivand, R., Rowlingson, B., Diggle, P., Petris, G., Eglen, S., & Bivand, M. R. (2017). Package ‘ 

splancs .’ R Package. 

Bivand, Roger, Keitt, T., Rowlingson, B., Pebesma, E., Sumner, M., Hijmans, R., … Bivand, M. 

R. (2015). Package ‘ rgdal .’ R Package. 

Bollen, J., & Mao, H. (2011). Twitter Mood as a Stock Market. Computer, (October), 91–95. 

Bollen, J., Mao, H., & Pepe, A. (2011). Modeling Public Mood and Emotion: Twitter Sentiment 

and Socio-Economic Phenomena. In Proceedings of the Fifth International AAAI 
Conference on Weblogs and Social Media Modeling (pp. 450–453). Retrieved from 
http://ovidsp.ovid.com/ovidweb.cgi?T=JS&CSC=Y&NEWS=N&PAGE=fulltext&D=emed
5&AN=2001043531 

Bowden, J. (2019). Trump energy officials label natural gas “freedom gas.” The Hill. Retrieved 

from https://thehill.com/policy/energy-environment/446004-trump-energy-officials-label-
natural-gas-freedom-gas 

Carneiro, H. A., & Mylonakis, E. (2009). Google Trends: A Web‐Based Tool for Real‐Time 

 

63 

Surveillance of Disease Outbreaks. Clinical Infectious Diseases, 49(10), 1557–1564. 
https://doi.org/10.1086/630200 

Chang, C., & Lin, C. (2011). LIBSVM : A Library for Support Vector Machines. ACM 

Transactions on Intelligent Systems and Technology (TIST), 2(3), 1–27. 

Connors, J. P., Lei, S., & Kelly, M. (2012). Citizen Science in the Age of Neogeography: 

Utilizing Volunteered Geographic Information for Environmental Monitoring. Annals of the 
Association of American Geographers, 102(6), 1267–1289. 
https://doi.org/10.1080/00045608.2011.627058 

Conover, M. D., Davis, C., Ferrara, E., Mckelvey, K., Menczer, F., & Flammini, A. (2013). The 

Geospatial Characteristics of a Social Movement Communication Network. PLoS ONE, 
8(3). https://doi.org/10.1371/journal.pone.0055957 

Crooks, A., Croitoru, A., Stefanidis, A., & Radzikowski, J. (2013). # Earthquake : Twitter as a 

Distributed Sensor System. Transactions in GIS, 17(1), 124–147. 
https://doi.org/10.1111/j.1467-9671.2012.01359.x 

Culotta, A. (2010). Towards detecting inﬂuenza epidemics. In Proceedings of the first workshop 

on social media analytics (pp. 115–122). 

Daughton, A. R., Paul, M. J., & Chunara, R. (2018). What Do People Tweet When They’re 

Sick? A Preliminary Comparison of Symptom Reports and Twitter Timelines. In ICWSM 
Social Media and Health Workshop. Retrieved from www.aaai.org 

Eichstaedt, J. C., Schwartz, H. A., Kern, M. L., Park, G., Labarthe, D. R., Merchant, R. M., … 

Seligman, M. E. P. (2015). Psychological Language on Twitter Predicts County-Level Heart 
Disease Mortality. Psychological Science, 26(2), 159–169. 
https://doi.org/10.1177/0956797614557867 

Erikson, R. S., & Tedin, K. L. (2001). American public opinion: its origins, content, and impact 

(Sixth Edit). New York: Longman. Retrieved from 
http://catalog.lib.msu.edu/search~S39?/twar+garden+victorious/twar+garden+victorious/1%
2C1%2C2%2CB/frameset&FF=twar+garden+victorious+%2F&1%2C%2C2/indexsort=- 

Fahrur Rozi, I., Rizky Yunianto, D., Mentari, M., Setiawan, A., Ariyanto, R., & Siradjuddin, I. 
(2018). Geo-Sentiment Analysis as a Location-Based Opinion Analysis System on Public 
Opinion Data about Governor Candidates. International Journal of Engineering & 
Technology, 7(4.44), 110. https://doi.org/10.14419/ijet.v7i4.44.26873 

Fanelli, D. (2010). Do pressures to publish increase scientists’ bias? An empirical support from 

US states data. PLoS ONE, 5(4). https://doi.org/10.1371/journal.pone.0010271 

Forman, G. (2006). Tackling concept drift by temporal inductive transfer. Proceedings of the 

Twenty-Ninth Annual International ACM SIGIR Conference on Research and Development 
in Information Retrieval, 2006, 252–259. https://doi.org/10.1145/1148170.1148216 

 

64 

Fotheringham, A. S., & Zhan, F. B. (1996). A comparison of three exploratory methods for 

cluster detection in spatial point patterns. Geographical Analysis, 28(3), 200–218. 

Frank, E., & Bouckaert, R. R. (2006). Naive Bayes for Text Classification with Unbalanced 

Classes. In European Conference on Principles of Data Mining and Knowledge Discovery 
(pp. 503–510). 

Freitas, A., Fernández, S., Hürlimann, M., Handschuh, S., Davis, B., & Cortis, K. (2016). A 
Twitter Sentiment Gold Standard for the Brexit Referendum. In SEMANTICS (pp. 193–
196). https://doi.org/10.1145/2993318.2993350 

Gatrell, A. C., Bailey, T. C., Diggle, P. J., & Rowlingson, B. S. (1996). Spatial point pattern 

analysis and its application in geographical epidemiology. Transactions of the Institute of 
British Geographers, 21(1), 256–274. 

Gautam, G., & Yadav, D. (2014). Sentiment Analysis of Twitter Data Using Machine Learning 
Techniques and Scikit-learn. In 2014 Seventh International Conference on Contemporary 
Computing (IC3) (pp. 437–442). https://doi.org/10.1145/3302425.3302492 

Gayo-Avello, D. (2012). No , You Cannot Predict Elections with Twitter. IEEE Internet 

Computing, 16, 91–94. https://doi.org/10.1109/MIC.2012.137 

González-Ibáñez, R., Muresan, S., & Wacholder, N. (2011). Identifying sarcasm in Twitter: A 
closer look. ACL-HLT 2011 - Proceedings of the 49th Annual Meeting of the Association 
for Computational Linguistics: Human Language Technologies, 2(2010), 581–586. 

Goodchild, M. F. (2007). Citizens as sensors : the world of volunteered geography. GeoJournal, 

69, 211–221. https://doi.org/10.1007/s10708-007-9111-y 

Hauger, D., & Schedl, M. (2014). Exploring Geospatial Music Listening Patterns in Microblog 

Data. In International Workshop on Adaptive Multimedia Retrieval (pp. 133–146). 
https://doi.org/10.1007/978-3-319-12093-5 

Hecht, B., Hong, L., Suh, B., & Chi, E. H. (2011). Tweets from Justin Bieber’s heart: the 

dynamics of the location field in user profiles. In Proceedings of the SIGCHI conference on 
human factors in computing systems (pp. 237–246). 
https://doi.org/10.1109/icimw.2007.4516799 

Herbst, S. (1995). Numbered voices: How opinion polling has shaped American politics. 

University of Chicago Press. 

Hijmans, R. J., Etten, J. Van, Sumner, M., Cheng, J., Bevan, A., Bivand, R., … Wueest, R. 

(2015). Package ‘ raster .’ R Package. 

Houston, J. B., Hawthorne, J., Perreault, M. F., Park, E. H., Goldstein Hode, M., Halliwell, M. 

R., … Griffith, S. A. (2015). Social media and disasters : A functional framework for social 
media use in disaster planning , response , and research, 39(1), 1–22. 
https://doi.org/10.1111/disa.12092 

 

65 

Hutto, C. J., & Gilbert, E. (2014). VADER : A Parsimonious Rule-based Model for Sentiment 
Analysis of Social Media Text. In Eighth international AAAI conference on weblogs and 
social media. 

Ikawa, Y., Vukovic, M., Rogstadius, J., & Murakami, A. (2013). Location-based insights from 
the social web. In Proceedings of the 22nd International Conference on World Wide Web 
(pp. 1013–1016). https://doi.org/10.1145/2487788.2488107 

Jurafsky, D., & Martin, J. H. (2019). Logistic Regression. In Speech and Language Processing. 

Karami, A., Dahl, A. A., Turner-McGrievy, G., Kharrazi, H., & Shaw, G. (2018). Characterizing 

diabetes, diet, exercise, and obesity comments on Twitter. International Journal of 
Information Management, 38(1), 1–6. https://doi.org/10.1016/j.ijinfomgt.2017.08.002 

Kearney, M. (2019). rtweet: Collecting and analyzing Twitter data. Journal of Open Source 

Software, 4(42), 1829. https://doi.org/10.21105/joss.01829 

Kobayashi, Y., Mozgovoy, M., & Munezero, M. (2016). Analysis of Emotions in Real-time 

Twitter Streams. Informatica, 40, 387–391. 

Kolchyna, O., Souza, T. T. P., Treleaven, P. C., & Aste, T. (2015). Methodology for Twitter 

Sentiment Analysis. 

Kotsiantis, S. B., Zaharakis, I. D., & Pintelas, P. E. (2006). Machine learning: a review of 

classification and combining techniques. Artificial Intelligence Review, 26(3), 159–190. 
https://doi.org/10.1007/s10462-007-9052-3 

Kulldorff, M. (1997). A spatial scan statistic, 26(6), 1481–1496. 

https://doi.org/10.1080/03610929708831995 

Kulshrestha, J., Kooti, F., Nikravesh, A., & Gummadi, K. P. (2012). Geographic Dissection of 

the Twitter Network. In Proceedings of the Sixth International AAAI Conference on 
Weblogs and Social Media (pp. 202–209). 

Larsen, M. E., Batterham, P. J., O’Dea, B., Boonstra, T. W., Christensen, H., & Paris, C. (2015). 

We Feel: Mapping Emotion on Twitter. IEEE Journal of Biomedical and Health 
Informatics, 19(4), 1246–1252. https://doi.org/10.1109/jbhi.2015.2403839 

Lazer, D., Kennedy, R., King, G., & Vespignani, A. (2014). The Parable of Google Flu : Traps in 

Big Data Analysis. Science, 343(March), 1203–1206. 
https://doi.org/10.1126/science.1248506 

Lee, E. C., Asher, J. M., Goldlust, S., Kraemer, J. D., Lawson, A. B., & Bansal, S. (2016). Mind 

the scales: Harnessing spatial big data for infectious disease surveillance and inference. 
Journal of Infectious Diseases, 214(Suppl 4), S409–S413. 
https://doi.org/10.1093/infdis/jiw344 

Liu, B. (2012). Sentiment Analysis: A Fascinating Problem. In G. Hirst (Ed.), Sentiment Analysis 

 

66 

and Opinion Mining (pp. 1–8). Morgan & Claypool. 

Loper, E., & Bird, S. (2002). NLTK: The Natural Language Toolkit. ArXiv Preprint Cs/0205028. 

McGough, S. F., Brownstein, J. S., Hawkins, J. B., & Santillana, M. (2017). Forecasting Zika 

Incidence in the 2016 Latin America Outbreak Combining Traditional Disease Surveillance 
with Search, Social Media, and News Report Data. PLoS Neglected Tropical Diseases, 
11(1), 1–15. https://doi.org/10.1371/journal.pntd.0005295 

Nair, L. R., Shetty, S. D., & Shetty, S. D. (2017). Streaming big data analysis for real-time 

sentiment based targeted advertising. International Journal of Electrical and Computer 
Engineering, 7(1), 402–407. https://doi.org/10.11591/ijece.v7i1.pp402-407 

Openshaw, S., Charlton, M., Wymer, C., & Craft, A. (1987). A mark 1 geographical analysis 

machine for the automated analysis of point data sets. International Journal of 
Geographical Information Systems, 1(4), 335–358. 
https://doi.org/10.1080/02693798708927821 

Pak, A., & Paroubek, P. (2010). Twitter as a Corpus for Sentiment Analysis and Opinion Mining. 

LREc, 10(2010), 1320–1326. 

Pebesma, E., & Bivand, R. (2013). Package ‘ sp .’ R Package. 

Pedregosa, F., Weiss, R., & Brucher, M. (2011). Scikit-learn : Machine Learning in Python. 

Journal of Machine Learning Research, 12, 2825–2830. 

Pino, C., Kavasidis, I., & Spampinato, C. (2016). GeoSentiment: A tool for analyzing 

geographically distributed event-related sentiments. 2016 13th IEEE Annual Consumer 
Communications and Networking Conference, CCNC 2016, 270–271. 
https://doi.org/10.1109/CCNC.2016.7444775 

R Core Team. (2020). R: A language and environment for statistical computing. Vienna, Austria: 

R Foundation for Statistical Computing. Retrieved from https://www.r-project.org/ 

Rennie, J. D. M., Shih, L., Teevan, J., & Karger, D. R. (2003). Tackling the Poor Assumptions of 

Naive Bayes Text Classifiers. In Proceedings of the 20th international conference on 
machine learning (ICML-03) (pp. 616–623). 

Ripley, B. D. (1979). Tests of “Randomness” for Spatial Point Patterns. Journal of the Royal 

Statistical Society. Series B (Methodological), 41(3), 368–374. 

Roberts, K., Roach, M. A., Johnson, J., Guthrie, J., & Harabagiu, S. M. (2012). EmpaTweet: 

Annotating and Detecting Emotions on Twitter. In Proceedings of the Language Resources 
and Evaluation Conference (pp. 3806–3813). Retrieved from 
http://lrec.elra.info/proceedings/lrec2012/pdf/201_Paper.pdf 

Rudis, B., & Thompson, C. (2018). rgeocodio: Tools to Work with the Geocodio “API”. 

Retrieved from https://github.com/hrbrmstr/rgeocodio 

 

67 

Saif, H., He, Y., Fernandez, M., & Alani, H. (2016). Contextual semantics for sentiment analysis 

of Twitter. Information Processing and Management, 52(1), 5–19. 
https://doi.org/10.1016/j.ipm.2015.01.005 

Sharag-eldin, A., Ye, X., & Spitzberg, B. (2018). Multilevel model of meme diffusion of 

fracking through Twitter. Chinese Sociological Dialogue, 3(1), 17–43. 
https://doi.org/10.1177/2397200917752646 

Smith, M. C., Broniatowski, D. A., Paul, M. J., & Dredze, M. (2015). Towards Real-Time 

Measurement of Public Epidemic Awareness: Monitoring Influenza Awareness through 
Twitter. AAAI Workshop on the World Wide Web and Public Health Intelligence., 20052. 
Retrieved from http://www.cs.jhu.edu/~mdredze/publications/2016_ossm.pdf 

The Statistics Portal. (2018). Number of monthly active Twitter users worldwide from 1st 

quarter 2010 to 4th quarter 2018 (in millions). Retrieved from 
https://www.statista.com/statistics/282087/number-of-monthly-active-twitter-users/ 

Twitter. (2020). Getting started — Twitter Developers. Retrieved February 15, 2020, from 

https://developer.twitter.com/en/docs/basics/getting-started 

U.S. Energy Information Adminstration. (2019a). Natural Gas Dry Production. Retrieved from 

http://www.eia.gov/dnav/ng/ng_prod_sum_a_epg0_fpd_mmcf_a.htm%0A 

U.S. Energy Information Adminstration. (2019b). Pennsylvania - State Energy Profile Analysis. 

Retrieved from https://www.eia.gov/state/analysis.php?sid=PA 

Velázquez, E., Martínez, I., Getzin, S., Moloney, K. A., & Wiegand, T. (2016). An evaluation of 

the state of spatial point pattern analysis in ecology. Ecography, 39(11), 1042–1055. 
https://doi.org/10.1111/ecog.01579 

Vo, B.-K. H., & Collier, N. (2013). Twitter Emotion Analysis in Earthquake Situations. 

International Journal of Computational Linguistics and Applications, 4(1), 159–173. 

Wang, S., & Manning, C. D. (2012). Baselines and bigrams: Simple, good sentiment and topic 

classification. 50th Annual Meeting of the Association for Computational Linguistics, ACL 
2012 - Proceedings of the Conference, 2(July), 90–94. 

Wang, Z., & Ye, X. (2016). Spatial, temporal, and content analysis of Twitter for wildfire 

hazards. Natural Hazards, 83(1), 523–540. https://doi.org/10.1007/s11069-016-2329-6 

Wojcik, S., & Hughes, A. (2019, April). Sizing Up Twitter Users. Pew Research Center. 

Zandbergen, P. A., & Barbeau, S. J. (2011). Positional accuracy of assisted GPS data from high-

sensitivity GPS-enabled mobile phones. Journal of Navigation, 64(3), 381–399. 
https://doi.org/10.1017/S0373463311000051 

 

 

68