ESSAYS IN ASSET PRICING AND INVESTOR BEHAVIOR
                              By
                           Qian Yang
                      A DISSERTATION
                          Submitted to
                  Michigan State University
          in partial fulfillment of the requirements
                       for the degree of
 Business Administration—Finance—Doctor of Philosophy
                             2023


                                             ABSTRACT
In Chapter One, we examine the following question. Have retail investors become the ants that
move the log? Social media has proved instrumental for effective coordination that might lead to
extreme returns. To study this effect, I construct a novel crash risk measure by estimating ex-ante
crash probabilities via logit and machine learning techniques. Stocks with high ex-ante crash risk
tend to have lower returns, especially when lagged sentiment is high. Robinhood traders tend to
over-buy high crash-risk stocks, consistent with the optimal expectations theory. By exploiting
the staggered first appearances of ticker names on “Wallstreetbets”, I document a causal effect of
social transmission on crash risk. This effect is significantly more substantial for smaller stocks.
To further bolster the finding, I exploit the entire history of Reddit to construct a novel instrument
and show that social transmission is likely to cause elevated crash risk on a daily basis.
    In Chapter Two, we examine the following issue. Cyber risk is an important but latent source
of risk in the economy. To estimate its impact on the asset market, we use machine learning
techniques to develop a firm-level measure of cyber risk. The measure aggregates information
from a rich set of firm characteristics and shows superior ability to forecast future cyberattacks on
individual firms. We find that firms with higher cyber risk earn higher average stock returns. When
these firms underperform, cybersecurity experts tend to have higher concerns about cyber risk, and
cybersecurity exchange-traded funds outperform. Further tests strengthen the identification of the
cyber risk premium.


Copyright by
QIAN YANG
2023


                                  ACKNOWLEDGEMENTS
I want to acknowledge the generous support I received while studying here at the Broad School
of Business finance department. I thank my committee members, Naveen Khanna, Hao Jiang,
Andrei Simonov, and Ryan Israelsen, for their help and support during all my years here. I thank
Naveen Khanna and Hao Jiang for their limitless patience and guidance. I thank William Grieser
and Morad Zekhnini for their helpful comments and suggestions. Finally, I thank my wife, Ting
Ting, for her firm support and understanding. Without her, none of this would be possible.
                                               iv


                              TABLE OF CONTENTS
CHAPTER 1    ANTS THAT MOVE THE LOG: CRASHES, DISTORTED BELIEFS,
             AND SOCIAL TRANSMISSION . . . . . . . . . . . . . . . . . . . . .                1
CHAPTER 2    THE CYBER RISK PREMIUM . . . . . . . . . . . . . . . . . . . . . . 44
CHAPTER 3    CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
                                            v


                                              CHAPTER 1
   ANTS THAT MOVE THE LOG: CRASHES, DISTORTED BELIEFS, AND SOCIAL
                                           TRANSMISSION
1.1    Introduction
    A long-standing point of inquiry in asset pricing and market micro-structure research concerns
the role of retail traders. On the one hand, retail traders may generate noise that provides liquidity and
incentivizes informed trading, both necessary elements for financial markets to function efficiently
(Grossman and Stiglitz, 1980; Kyle, 1985; Black, 1986; Barber and Odean, 2000). On the other
hand, correlated sentiment among retail traders can induce modest transitory price impacts that
generate limits to arbitrage (Shleifer and Vishny, 1997; Barber et al., 2008, 2009). A few key
features of financial markets have likely driven the historical modesty of retail traders’ price
impact. Specifically, transaction costs have restricted retail trading to a small portion of market
volume. Moreover, correlated sentiment among retail traders was mainly confined to herd behavior
or everyday exposure to salient events along with inefficient Bayesian updating (e.g., Banerjee,
1992; Bikhchandani et al., 1998; Barber et al., 2021) rather than deliberate coordination.
    While these features previously characterized financial markets, recent innovations have dramat-
ically changed the environment for retail traders. For example, Robinhood’s advent of commission-
free trading in 2015, followed by major online trading platforms such as Charles Schwab, TD
Ameritrade, and E-trade in 2019, relaxed retail trading costs considerably. These events partially
explain the exponential growth in retail trading, now responsible for as much as 25% of stock mar-
ket volume (McCrank, 2021). In addition, social media platforms such as Reddit facilitate direct
coordination among retail traders. These evolving characteristics are all well represented in the
"GameStop" event in 2021, whereby retail traders joined forces to drive up GameStop’s stock price
by 3,000% to engineer a short squeeze. The GameStop event raises two critical questions. First, has
improved coordination introduced the possibility that entertainment motivates many retail traders
rather than profit? Second, is the GameStop event an anomaly, or have these features allowed
retail traders to become "the ants that move the log," thus potentially altering their role in financial
                                                      1


markets?
     Gaining an adequate understanding of these questions will likely require considerable theoretical
and empirical analysis and, therefore, is well beyond the scope of a single study. Thus, this paper
aims to provide an initial systematic exploration of this topic by employing various novel empirical
techniques in various settings with granular data on Robinhood trading activity and interactions
among retail traders on Reddit. First, I use the standard logit regression to estimate ex-ante crash
probabilities, where a "crash" is defined as the log monthly return lower than -20%.1 Estimating
crash risk by a return threshold is informative. According to Beason and Schreindorfer (2022),
80% of the average equity premium is attributable to monthly returns below -10%. However,
crashes defined as over -20% monthly return drop constitutes only 5% of all stock returns in
the CRSP universe from 1996-2021. Thus, predicting ex-ante crash risk is challenging because
of the relatively low frequency of crashes, making it hard to construct valid counterfactuals. I
employ a novel machine-learning technique that substantially improves the predictive power of
low-probability binary outcomes.
     Consistently with prior literature (e.g., Jang and Kang, 2019; Atilgan et al., 2020), ex-ante
crash risk is negatively correlated with future stock returns. Specifically, a one-standard-deviation
increase in crash risk is associated with an approximately 50 bps drop in monthly risk-adjusted
returns. The return predictability remains strong conditioning on other tail risk measures (e.g.
𝑉 𝑎𝑅 in Atilgan et al. (2020)). Moreover, when lagged sentiment is high, the overpricing of high
crash-risk stocks is more severe. These results are consistent with the predictions in Brunnermeier
et al. (2007), where investors underestimate the left-tail probabilities when sentiment is high and
thus buy more than the rational amount. Furthermore, consistent with the theory, I document
that Robinhood traders disproportionately buy stocks with high ex-ante crash risk. In contrast,
institutional investors tend to sell high crash-risk stocks.
     It is hard to determine the direction of causality, which perhaps even cuts both ways. That is, are
retail traders merely attracted to high-tail-risk stocks? Or are they part of what creates the tail risk?
    1 The  -20% cutoff is motivated by prior literature (e.g., Jang and Kang, 2019), and I explore alternative return
thresholds in Appendix.
                                                           2


To partially unpack the potential for the latter channel, building on the recent advancement in social
transmission theory (Han et al., 2022), I exploit the history of the social media platform Reddit
and the first-time appearances of stock tickers on “Wallstreetbets” as a quasi-natural experiment.
Specifically, I use a stacked “difference-in-differences” approach (Gormley and Matsa, 2011; Cengiz
et al., 2019) to document a causal effect of investors’ online conversations on the ex-ante crash risk
of stocks. I partially alleviate the possible endogeneity concerns by carefully constructing a match
sample and conditioning on a set of characteristics that draw retail attention. The results show that
on average the crash risk of stocks increases by approximately 10% within the first three months of
appearance on “Wallstreetbets”.
    Recent work on social transmission (Hu et al., 2021) shows that the online conversations of
retail investors on “Wallstreetbets” contain information that possibly drives future stock prices
on a daily basis. To bolster the previous results, I build on this work and construct a novel and
plausible instrument for investment-related conversations by utilizing the entire history of Reddit
posts. Through an instrumental variable estimation approach, I show that a one-standard-deviation
increase in online discussions in “Wallstreetbets” is associated with an approximately 2.3% increase
in ex-ante crash risk at a daily frequency, where I follow prior literature (e.g., Bollen and Whaley,
2004; Van Buskirk, 2011; Kim and Zhang, 2014; Kim et al., 2016) and use the option implied
volatility 𝑆𝐾 𝐸𝑊 as the proxy for crash risk. These results corroborate the previous “difference-
in-differences” framework and suggest that retail investors could cause extreme stock returns via
efficient herding.
    Have retail traders become the ants that move the log? This paper presents a preliminary
analysis to address whether we’ve reached a paradigm shift in the role of retail traders. There are
several unique contributions. First, to the best of my knowledge, this is the first study that conducts
causal inference on retail influence on crash risk or left-tail risk. Moreover, this paper proposes a
new ex-ante crash risk measure via novel methodologies.
    The rest of the paper is organized as follows. Section 1.2 briefly reviews the existing literature.
Section 2.2 explains the construction of ex-ante crash risk and corresponding results for estimating
                                                    3


monthly crash probabilities. Section 2.4 conducts asset pricing tests for crash risk in the cross-
section of stock returns. Section 1.5 discusses the distorted belief mechanism for the negative price
of crash risk. Section 1.6 documents the causal effect of retail conversations on firm crash risk.
Section 1.7 constructs a novel instrument to provide further evidence on the causal effect of social
transmission on crash risk. Section 1.8 concludes.
1.2    Literature Review
    This study is related to an extensive list of areas in literature. First and foremost, it concerns the
firm-level crash risk. The corporate finance literature studies the determinants of firm crash risk.
These determinants are often motivated by managers hoarding bad news (Jin and Myers, 2006).
The idea is that the hoarding delays the information transmission such that when it is ultimately
released, there is a sudden drop in the price corresponding to the size of the cumulative bad news.
Motivated by this theory, the literature has proposed a list of determinants that could endogenously
influence crash risk, such as earnings management (Hutton et al., 2009), tax avoidance (Kim et al.,
2011), annual report readability (Li, 2008), CSR (Kim et al., 2014), liquidity (Chang et al., 2016),
short interest (Callen and Fang, 2015), and governance (Andreou et al., 2016; An and Zhang,
2013). This paper differs from this literature in that it estimates crash risk at a monthly frequency,
by utilizing a rich set of conditional information (Chen and Zimmermann, 2021).
    In asset pricing, a rich body of literature extracts information from option prices to determine
the size of tail risk. For example, Pan (2002) provides theoretical support for the jump-risk premia
implied by near-the-money short-dated options that help explain volatility smirk. Xing et al. (2010)
studies the relationship between implied volatility smirks and the cross-section of stock returns.
They show that the difference between the implied volatility of out-of-money put options and at-the-
money call options shows strong predicting power for future stock returns. Yan (2011) show that
jump size proxied by the slope of volatility smile predicts the cross-section of stock returns. The
present study uses option information as one set of variables in predicting crashes, thus exploiting
a far richer information set.
    The third strand of literature on crash risk directly predicts the probability of crashes. Chen
                                                     4


et al. (2001) employs cross-sectional regressions to forecast the skewness of daily stock returns.
Campbell et al. (2008) use a dynamic logit model to predict distress probabilities for the cross-
section of firms. Conrad et al. (2014) show that high distress risk stocks are also likely to become
jackpots. They use a logit model to predict the probability of deaths and jackpots. Jang and Kang
(2019) exploits a multinomial logit model to jointly predict probabilities of crashes and jackpots at
an annual horizon.
    This study is also related to the literature on the relationship between investor trading and market
efficiency and bubble formation. De Long et al. (1990a), De Long et al. (1990b), and Abreu and
Brunnermeier (2003) provide the theoretical support to and empirical evidence of positive feedback
traders and their potential impact on market. Retail investors are believed to be “noise traders”
that trade too much (Barber and Odean, 2000). Speculative retail traders tend to chase lottery-like
stocks, experiencing subsequent negative trading alpha, and affect stock prices accordingly (Han
and Kumar, 2013). Recent evidence from “Robinhood Traders” shows that they tend to herd more
on extreme past-return stocks, which are more attention-grabbing (Barber et al., 2021), while there
is also evidence that mimicking portfolios based on the characteristics of “Robinhood Traders” do
not seem to underperform the market, but instead could be a market stabilizing force (Welch, 2020).
On the pricing impact of retail trading, Foucault et al. (2011) was one of the first papers that use a
quasi-natural experiment to identify the causal effect of retail trading on stock volatility.
    Finally, this study is related to the emerging literature that studies the implications and ap-
plications of machine learning methodologies in asset pricing. They are mostly concerned with
resolving the “factor zoo” problem (Kozak et al., 2020; Feng et al., 2020; Bianchi et al., 2021; Gu
et al., 2020).
1.3    Data and Estimation of Crash Risk
    I use two sets of measures for ex-ante crash risk, one monthly measure, and one daily measure.
The monthly measure is the ex-ante probability of stock crashing in a certain month, while the
daily measure 𝑆𝐾 𝐸𝑊 is motivated by Xing et al. (2010), and defined as the difference between the
                                                      5


implied volatility of out-of-money put option and that of the at-the-money call option.2 I will start
by describing the monthly measure and defer the discussion of the daily measure to Section 1.7.
1.3.1    Estimation of Monthly Ex-Ante Crash Risk
     I define firm-level crashes as stock monthly log returns lower than -20%. The choice is
reasonable in the following sense. Prior literature uses log annual returns of -70% as the cutoff
points (Conrad et al., 2014; Jang and Kang, 2019). The unconditional probabilities of crashes
defined this way at the annual frequency are roughly 5%. At a monthly frequency, a cutoff point
at -20% agrees with this distribution. Thus the universe of stock returns falls into two categories –
crashes and otherwise. Then the monthly ex-ante crash risk is defined as follows:
                               𝐶𝑟𝑎𝑠ℎ𝑅𝑖𝑠𝑘 𝑖,𝑡 = 𝐸 [𝑃(𝑟𝑖,𝑡 < −20%)|𝑋𝑖,𝑡− 𝑗 ]                                    (1.1)
     Where 𝑟 is the monthly log return. 𝑗 ∈ [1, 2, 3, 4, 5, 6] is the months in each training window,
or in other words, the period we draw conditional information. 𝑋 is a set of firm-level predictors.
     Estimating the ex-ante probabilities of future crashes naturally calls for a logistic regression,
where the dependent variable is a binary response 𝐷 𝑐𝑟𝑎𝑠ℎ , where it equals one if the log monthly
return is lower than -20%, and zero otherwise. A critical issue arises, however, in forecasting
rare events such as crashes. The usual logistic estimator could produce suboptimal results due
to the poor finite sample properties (King and Zeng, 2001). I provide a simple intuition for this
argument in Appendix ??. Though the difficulties and the associated statistical issues in forecasting
rare events are rarely studied in economics, the remedy is readily available in machine learning
literature. I follow Jiang et al. (2020) and introduce an Ensemble method, “Easy Ensemble” (EEC),
that combines random undersampling and bootstrapping (Liu et al., 2008) to supplement the logistic
regression approach. A detailed discussion of this technique can be found in Appendix ??.
     To estimate the ex-ante probabilities of a crash, it is essential to conduct out-of-sample proce-
dures. Thus I use a rolling window of 6 months to estimate parameters and fit the following month
to produce an OOS estimate of crash risk. With respect to the independent variables, in a slight
    2 The 𝑆𝐾 𝐸𝑊 measure by Xing et al. (2010) is widely used in the corporate finance literature as a proxy for firm
crash risk. See for example...
                                                       6


departure from prior literature, I choose a large set of characteristics that have been shown as return
predictors as the independent variables in the estimation process. Specifically, I use variables ob-
tained from Chen and Zimmermann (2021). These are monthly firm-level characteristics that have
been shown in the literature as important drivers of future returns, and these variables encompass
all variables that were considered as predictors of crashes (Campbell et al., 2008; Conrad et al.,
2014; Jang and Kang, 2019).
    I limit the data scope to between 1996 and 2020, both to reduce the computation load and to
ensure maximum data usage, as some variables are only available from 1996 (for example, option
variables). Therefore, with 6-month rolling windows for training, our out-of-sample prediction
starts from July 1996 to December 2020, comprised of 294 months. I use CRSP for monthly stock
returns. I require common stocks with a share code of 10 or 11 and with prior month-end stock
prices greater than $5 to avoid extreme outliers.
    Next, I compare the usual logistic estimator with the EasyEnsemble method in forecasting
performances. To illustrate the performance difference, I conduct the following experiment. For
the whole sample, I plot the percentages of real crashes predicted by either model against a decision
threshold from zero to one, meaning that at each threshold, all stocks with a predicted probability
higher than that would be labeled “crash”. The results are shown in Figure 1.1.
    Note that EasyEnsemble outperforms logistic regression in the low threshold region. This result
is desirable because we know that crashes are low-probability events (the unconditional probability
of a crash is around 5-6%), and we want the classifier to do well in this region. For example, at
the 7% threshold, meaning that we predict all stocks with a probability estimate greater than 7%
to crash in the next month, logistic regression is able to capture 72% of all real crashes, while
EasyEnsemble is able to capture 85%.
1.3.2    Summary Statistics
    Given the refined estimate of monthly ex-ante crash risk, we can examine its relationship
with firm characteristics. In particular, we are interested in the relationship between the risk
and the underlying regressors. We summarize the relationship between the machine learning-
                                                   7


                    Figure 1.1 Out-of-Sample Predicted Crashes by Thresholds
The figure depicts the total percentage of out-of-sample predicted crashes for logistic regression
and EasyEnsemble against decision thresholds. The X-axis is the decision threshold from zero to
one. The Y-axis gives the percentage of real crashes successfully predicted by either model based
on the decision threshold. For example, at the 7% threshold, meaning that we predict all stocks
with a probability greater than 7% to crash in the next month, logistic regression is able to catch
72% of all real crashes, while EasyEnsemble is able to catch 85%.
                                                8


generated crash risk and the top regressors in Appendix. The summary statistics of both logit-
generated ex-ante crash risk and machine learning-generated crash risk, along with all relevant
stock characteristics and other data used in later analyses, are presented in Table 2.1.
                                                   9


                                                             Table 1.1 Summary Statistics
         Crash1         Crash2         VaR1%         VaR5%           Size          Beta            Log(B/M)      ATG            GP             MOM
  count  1,383,264      1,383,264      1,393,933     1,393,933       1,439,823     1,284,824       1,273,512     1,235,657      1,068,246      1,346,720
  mean   0.09           0.10           -0.08         -0.05           5.73          1.08            -0.77         0.20           0.32           0.14
  std    0.13           0.09           0.05          0.03            2.17          0.85            1.04          3.45           0.39           0.84
  1%     0.00           0.00           -0.26         -0.16           1.33          -0.21           -3.75         -0.54          -0.78          -0.86
  25%    0.02           0.03           -0.11         -0.07           4.14          0.50            -1.32         -0.03          0.15           -0.23
  50%    0.04           0.07           -0.07         -0.04           5.62          0.93            -0.69         0.06           0.30           0.04
  75%    0.11           0.14           -0.05         -0.03           7.18          1.48            -0.14         0.19           0.48           0.32
  99%    0.64           0.42           0.00          0.00            11.08         3.72            1.72          2.64           1.25           2.89
         ST-Rev         Vol            Skew          TailBeta        Coskew        IdioRisk        Illiq         MaxRet         IO             UserNum
  count  1,469,593      1,466,228      1,437,111     951,654         1,356,663     1,435,097       1,345,881     1,440,263      496,204        87,456
  mean   0.01           0.03           0.24          0.72            0.22          0.03            4.72          0.08           0.41           3418.46
  std    0.20           0.03           1.00          0.58            0.29          0.03            46.24         0.11           0.34           21496.13
  1%     -0.43          0.00           -2.64         -0.49           -0.41         0.00            0.00          0.01           0.00           6.00
  25%    -0.07          0.02           -0.28         0.37            0.04          0.01            0.00          0.03           0.09           95.00
  50%    0.00           0.03           0.20          0.63            0.20          0.02            0.03          0.06           0.36           319.00
  75%    0.07           0.04           0.72          0.99            0.37          0.04            0.55          0.10           0.71           1161.00
  99%    0.62           0.15           3.25          2.52            1.09          0.15            86.39         0.45           1.09           64691.10
This table reports the summary statistics of our main variable ex-ante monthly crash risk and other firm characteristics used later in our analyses. There
are two sets of crash risk estimates. 𝐶𝑟𝑎𝑠ℎ1 is estimated by logit regression, and 𝐶𝑟𝑎𝑠ℎ2 by machine learning (EEC-Adaboost). To differentiate our
measure from the left-tail measure 𝑉 𝑎𝑅 in Atilgan et al. (2020), we also include their measure. 𝑉 𝑎𝑅1% is defined as the 1 percentile daily return of
the stock in the past year, while 𝑉 𝑎𝑅5% is the 5 percentile daily return of the stock in the past year. Other variables include the natural log of market
capitalizations (𝑆𝑖𝑧𝑒), the natural log of book-to-market ratio, asset growth (𝐴𝑇𝐺), gross profitability (𝐺𝑃), momentum (prior 11-to-1 month returns,
𝑀𝑂 𝑀), and short-term reversal (prior 1-month returns, 𝑆𝑇 − 𝑅𝑒𝑣), idiosyncratic volatility, illiquidity (Amihud, 2002), market beta, tail Beta (Kelly
and Jiang, 2014), coskewness(Harvey and Siddique, 2000), MAX (Bali et al., 2011). 𝐼𝑂 is the institutional ownership for each stock, measured at
the quarterly frequency. 𝑈𝑠𝑒𝑟 𝑁𝑢𝑚 is the total number of users for each stock on Robinhood by the end of each month. The sample starts from July
1996 to December 2021, except for Robinhood user numbers where it is limited to between May 2018 and August 2020 due to the data availability of
Robintrack (https://robintrack.net/).
                                                                            10


    On top of firm-level crashes, the aggregate probability of a market crash is of great interest to
researchers and practitioners alike. Although one can argue that the aggregate stock market crash is
systematic, while firm-level crashes are more idiosyncratic in nature, aggregating firm-level crash
probabilities might still contain information about the aggregate crash risk. One possible reason for
this logic is that we use a fixed threshold (-20% log return) to define crashes, and thus aggregating
these firm-level probabilities contains a systematic component. Therefore, I aggregate monthly
firm-level crash risk to the market level by their lagged market capitalizations and plot the series in
Figure 1.2.
    On top of the aggregate crash risk series, I also plot NBER recession periods (NBER, 2021)
in the gray shaded areas. Though not immediately clear, the series does contain some information
about future possibilities of market crashes, as there are signs of spikes ahead of or during recession
periods. Next, we move on to examine the pricing implications of firm-level monthly crash risk.
1.4    Monthly Crash Risk and Stock Returns
    In this section, I examine whether the ex-ante monthly crash risk is priced in the market. I
conduct both time-series portfolio analysis and cross-sectional analysis. Prior literature (Conrad
et al., 2014; Jang and Kang, 2019; Atilgan et al., 2020) has indicated that crash risk, or left-tail
risk, is negatively priced in the market. Though my measure is different in its time frequency and
construction, we should expect similar behavior.
1.4.1    Portfolio Analysis
    At the end of each month, I sort stocks into ten decile portfolios based on their estimated ex-
ante crash probabilities. Then I compute both value-weighted and equal-weighted excess returns
of each portfolio and the hedge portfolio that long high crash risk decile portfolio and short low
crash risk decile portfolio. I regress the time series of returns on various asset pricing factors and
compute the alpha estimates and their associated 𝑇-statistics. The asset pricing models include:
CAPM, Fama-French three-factor model (FF3) (Fama and French, 1993), then augmented with a
momentum factor (FF4) (Carhart, 1997), Fama-French five-factor model (FF5) (Fama and French,
2015), and then augmented with momentum factor (FF6). To show the consistency of the results and
                                                   11


                                 Figure 1.2 Aggregate Crash Risk
The figure plots market-wide aggregate ex-ante crash probabilities from 1996
to 2020.           The aggregation is done by weighting the monthly ex-ante
crash risk of each firm by their Í lagged market capitalizations as follows:
                                                  𝑀𝑎𝑟Í𝑘𝑒𝑡𝐶𝑎 𝑝 𝑖,𝑡 −1 ×𝐶𝑟𝑎𝑠ℎ𝑅𝑖𝑠𝑘 𝑖,𝑡
                          𝐴𝑔𝑔𝐶𝑟𝑎𝑠ℎ𝑅𝑖𝑠𝑘 𝑡 = 𝑖
                                                       𝑖 𝑀𝑎𝑟 𝑘𝑒𝑡𝐶𝑎 𝑝 𝑖,𝑡 −1
The red solid line indicates the aggregate crash probabilities by using the machine learning-
generated crash probabilities, while the blue dashed line uses the logit-generated crash probabilities.
The gray shaded areas indicate NBER recession periods (NBER, 2021). The time series run from
July 1996 to December 2020.
                                                 12


the superiority of the EasyEnsemble method, I show alpha estimates using both logistic regression
and EasyEnsemble in Table 1.2.
                         Table 1.2 Decile High-Minus-Low Portfolio Alphas
                                                        Logit                       EEC-Adaboost
                         Pricing model          Alpha            T-stat         Alpha            T-stat
  Value-weighted         CAPM                   -1.852           -3.730         -1.967           -4.393
                         FF3                    -1.842           -4.440         -1.963           -5.456
                         FF4                    -1.533           -3.531         -1.775           -4.636
                         FF5                    -0.874           -2.834         -1.120           -3.947
                         FF6                    -0.696           -2.263         -1.023           -3.442
  Equal-weighted         CAPM                   -2.470           -5.571         -2.458           -5.325
                         FF3                    -2.461           -7.941         -2.452           -7.573
                         FF4                    -2.106           -7.161         -2.173           -7.005
                         FF5                    -1.656           -5.637         -1.783           -6.093
                         FF6                    -1.438           -5.788         -1.614           -5.947
  Note:                                                                 ∗ p<0.1; ∗∗ p<0.05; ∗∗∗ p<0.01
This table presents the analysis of portfolios sorted on the ex-ante crash risk measures estimated
by both logit and machine learning (EEC-AdaBoost). At the end of each month, stocks are ranked
by their ex-ante crash probabilities produced by either logit or machine learning into ten decile
portfolios. Then we compute both equal-weighted portfolio returns and value-weighted returns by
their lagged market capitalization. The hedge portfolio is long in the top decile ex-ante crash risk
portfolio and short in the bottom decile crash risk portfolio. Then the hedge portfolio return series
are regressed on risk factor returns from various empirical asset pricing models. The asset pricing
models include: CAPM, Fama-French three-factor model (FF3) (Fama and French, 1993), then
augmented with a momentum factor (FF4) (Carhart, 1997), Fama-French five-factor model (FF5)
(Fama and French, 2015), and then augmented with momentum factor (FF6). Then we report the
resulting intercepts (alphas) and their associated 𝑇-statistics. The upper panel presents results from
using value-weighted portfolio returns, while the lower panel presents equal-weighted results. The
left half shows results from using ex-ante crash risk estimated from logistic regressions, and the
right from machine learning (EEC-AdaBoost). Standard errors are adjusted using the Newey-West
procedure (Newey and West, 1986) with 6 lags.
    As shown in Table 1.2, when we long top crash risk decile portfolio and short bottom decile
portfolio, we produce consistent and significant negative alphas across different asset pricing
models, equal-weighted or value-weighted, with 𝑇-statistics of magnitude well over 3. Note also
that when we compare the results from using logit-generated crash risk and machine learning-
                                                  13


generated crash risk, the latter shows superiority in both the magnitude of alpha and the 𝑇-statistics.
This is a strong piece of evidence that machine learning not only produces consistent results with
conventional methods but also demonstrates better forecasting efficacy, as it classifies correctly
more actual crashes that contribute to lower returns in the subsequent month.
1.4.2    Cross-Sectional Regressions
    Next, I run Fama-MacBeth cross-sectional regressions (Fama and MacBeth, 1973a) following
the procedure in Fama and French (2020). Each month, I regress raw stock returns on cross-
sectionally standardized lagged firm characteristics. Then I average the coefficients to arrive at the
final estimates. The coefficients on characteristics can be directly interpreted as average priced
return spread for one standard deviation increase of the corresponding firm risk. I include common
risk characteristics such as the natural log of market capitalizations, natural log of book-to-market
ratio, asset growth, gross profitability, momentum (prior 11-to-1 month returns), short-term reversal
(prior 1-month returns), and my estimated crash probabilities from the Ensemble method. On top
of these variables, I control for a set of anomaly characteristics that are shown to be significantly
correlated with future stock returns: idiosyncratic volatility, illiquidity (Amihud, 2002), market
beta, tail Beta (Kelly and Jiang, 2014), coskewness(Harvey and Siddique, 2000), and net operating
assets 𝑁𝑂 𝐴 (Hirshleifer et al., 2004). Bali et al. (2011) proposes a measure 𝑀 𝐴𝑋 that represents
investors’ preference for lottery-like payoffs. 𝑀 𝐴𝑋 stands for the maximum daily return achieved
by each stock in the prior month. To see if the estimated crash risk carries additional information
that distinguishes it from 𝑀 𝐴𝑋, I add the 𝑀 𝐴𝑋 measure as a control variable in the Fama-MacBeth
regressions.
    Atilgan et al. (2020) also studies the left-tail risk, although their measure is constructed differ-
ently. Their “value-at-risk” (𝑉 𝑎𝑅) is entirely based on historical returns and is defined as the return
conditioning on probability distribution, which differs from our measure that takes return cutoff
as given and estimates ex-ante probabilities. To see whether our crash risk contains incremental
information about future stock returns than the 𝑉 𝑎𝑅 measure, I include 𝑉 𝑎𝑅 as a control variable.
The 𝑉 𝑎𝑅 measure is the negative of 1 percentile daily return of the stock in the past year. I report
                                                    14


the regression results in Table 1.3.
                       Table 1.3 Fama-MacBeth Cross-Sectional Regressions
                         (1)             (2)              (3)             (4)              (5)
                                              Dependent Variable: Returns in %
  Crash Risk (Logit)     -0.491***       -0.453***
                         (0.080)         (0.077)
  Crash Risk (EEC)                                        -0.507***       -0.459***
                                                          (0.097)         (0.086)
  VaR1%                                  -0.123                           -0.097           -0.246***
                                         (0.082)                          (0.074)          (0.083)
  Controls               YES             YES              YES             YES              YES
  Observations           545,367         545,290          545,367         545,290          564,466
  R-squared              0.083           0.086            0.083           0.085            0.084
  Note:                                                                ∗ p<0.1; ∗∗ p<0.05; ∗∗∗ p<0.01
This table reports Fama-MacBeth cross-sectional regressions of raw returns on ex-ante crash risk
and lagged firm characteristics in the spirit of Fama and French (2020). First, we regress monthly
stock returns of each month on lagged firm characteristics. Then we average the coefficients and
report the associated standard errors. Our main variables of interest are the two ex-ante crash risk
measures. One is estimated by logit regression, and the other by machine learning (EEC-Adaboost).
Columns (1) and (2) use the logit-generated crash risk as the main variable, while Columns (3) and
(4) use the machine learning-generated crash risk. To differentiate our measure from the left-tail
measure 𝑉 𝑎𝑅 in Atilgan et al. (2020), I include their measure in Columns (2) and (4) as a control
variable, where 𝑉 𝑎𝑅1% is defined as the negative of 1 percentile daily return of the stock in the
past year. In Column (5), I only include 𝑉 𝑎𝑅1% as the sole variable to proxy for left-tail risk to
ensure that our results are consistent with Atilgan et al. (2020). Other control variables include the
natural log of market capitalizations, the natural log of book-to-market ratio, asset growth, gross
profitability, momentum (prior 11-to-1 month returns), and short-term reversal (prior 1-month
returns). In Column (3), I add idiosyncratic volatility and illiquidity (Amihud, 2002). In Column
(4), I add market beta, tail Beta (Kelly and Jiang, 2014), coskewness(Harvey and Siddique, 2000),
net operating assets 𝑁𝑂 𝐴 (Hirshleifer et al., 2004), and MAX (Bali et al., 2011). All independent
variables are standardized cross-sectionally each month to be mean zero and standard deviation
of unity, such that the coefficients on all the independent variables can be directly read as the
percentage increase in average stock returns if the underlying independent variable increase by one
standard deviation. Standard errors are adjusted according to Newey-West procedures (Newey and
West, 1986) with 6 lags.
    Table 1.3 suggests several points. First, both logit-generated ex-ante crash risk and machine
learning-generated crash risk are significantly and negatively correlated with future stock returns,
                                                  15


and their magnitudes are very similar to each other. Second, the loadings on crash risk are robust
even after controlling for common risk characteristics and go beyond a plethora of tail risk-related
variables, including the lottery-payoff proxy 𝑀 𝐴𝑋 (Bali et al., 2011). Third, when our crash
risk is not included in the regression, the 𝑉 𝑎𝑅 measure is significantly and negatively correlated
with future stock returns, consistent with the results in Atilgan et al. (2020). However, when
our crash risk is included in the regression, the loading on 𝑉 𝑎𝑅 becomes insignificant, while our
crash risk measure loads negative and significant consistently. This suggests that both our logit-
generated and machine learning-generated crash risk measures contain more information than 𝑉 𝑎𝑅
and consequently subsume its effect. Depending on the control variables and the measure we use,
a one-standard-deviation increase in ex-ante monthly crash risk is associated with approximately
a 45-51 bps drop in subsequent risk-adjust returns, which translates into -5.47% to -6.12% in
annual risk-adjusted returns. These results corroborate the prior literature that ex-ante crash risk is
negatively priced, and also provide strong evidence that our crash risk measure contains richer and
incremental information than existing crash risk measures.
1.5   A Possible Economic Mechanism: Distorted Belief
    The negative price of crash risk does not agree with rational expectations, as a rational investor
would naturally demand a positive risk premium for holding such risk. Prior literature attempts to
explain the phenomenon via several arguments. One argument is the limits to arbitrage (Shleifer
and Vishny, 1997; Conrad et al., 2014; Jang and Kang, 2019). They show evidence that institutional
investors tend to “ride the bubble” as rational speculators, instead of trading against crash risk as
rational arbitragers, since high crash risk stocks tend to be small, illiquid, and hence costly to short.
The second argument is that investors underestimate the momentum in the left tail (Atilgan et al.,
2020), meaning that stocks that crashed the last month may well be highly possible to continue
crashing in the subsequent month. Investors somehow fail to understand this dynamic and “bought
the dip”, which renders the stocks with high crash probabilities overpriced. However, it is unclear
why this momentum exists. Moreover, since the 𝑉 𝑎𝑅 measured used Atilgan et al. (2020) is an
ex-post measure, it does not answer the question from an investor behavior perspective. A third
                                                   16


argument pertains to the observation that stocks with extreme past returns are attention-grabbing,
and retail investors have a preference for such stocks (Barber and Odean, 2008; Barber et al., 2021).
However, it is reasonable to assume that investors are drawn to extreme past winners, as they might
be over-extrapolating past returns. It is nonetheless puzzling why investors should prefer extreme
past losers. Moreover, it is difficult to understand why investors should prefer high left-tail stocks.
Even if they underestimate the momentum in the left tail, these are undesirable stocks from a
risk-return tradeoff standpoint. In addition, over time, investors should be able to learn from past
observations that high crash risk stocks are overpriced, as many of them indeed crashed in the
subsequent month.
    The literature in behavioral theories provides valuable guidance in terms of investor beliefs and
preferences towards crash risk or left tail risk. Two theories, in particular, have clear predictions
about investors’ attitudes towards the left tail. One is cumulative prospect theory (CPT) by Barberis
and Huang (2008). They show that investors with a CPT preference would overweight small
probability events. One example is that people would gamble on slim chances of big payoffs, but
buy insurance for plane crashes. The implication is that investors with CPT preference should shun
high crash risk stocks since they effectively deem those crashes more likely to happen than the true
distribution. If all investors have such a preference, high crash risk stocks should be underpriced,
and thus produce a positive risk-adjusted return. This prediction does not seem to conform to the
empirical observation.
    The second theory is the optimal expectations theory (OET) by Brunnermeier et al. (2007).
They show that investors may derive anticipatory utility when holding an optimistic subjective
belief about stock returns, even though such beliefs prove to be wrong afterward. If investors hold
such a belief, they would effectively shift their subjective return distribution to the right when their
sentiment is high. The implication is that when sentiment is high, investors with such beliefs tend
to think that crashes are less likely than reality, and thus overbuy high crash risk stocks. The pricing
implication is that crash risk or left tail risk is overpriced and thus predicts a negative risk-adjusted
return.
                                                      17


    The evidence presented in this paper and the prior literature for the negative price of crash risk
agrees with the optimal expectations theory. To further establish evidence as to whether investors
overbuy high crash risk stocks when their sentiment is high, I conducted several tests to provide
additional evidence.
1.5.1    Crash Risk Portfolio Returns and Sentiment
    First, I examine the relationship between the crash risk hedge portfolio returns and sentiment. If
investors hold optimal expectations, then the loss on the crash risk high-minus-low hedge portfolio
would be higher when lagged sentiment is high, since investors’ belief distortion would be more
severe during such periods.
    I follow Baker and Wurgler (2006) and use their sentiment index as a proxy for the market-
wide sentiment. In particular, I use the sentiment measure that is orthogonal to macroeconomic
indicators to alleviate the impact of market risks. Since their index is available up to the year 2018,
my sample is hence limited between July 1996 and December 2018. Then I divide the sample
period into two subperiods, where one is the high sentiment period when sentiment is higher than
the median value of the whole sample, and another is the low sentiment period. Then I compute
the excess returns of the top decile portfolio, the bottom decile portfolio, and the long-short hedge
portfolio that long high crash risk stocks and short crash risk stocks, in each of the subperiods. I
then compute the differences in these returns between high and low sentiment periods. The results
are summarized in Panel A of Table 1.4.
    It is immediately clear from the table that the high-crash-risk stocks experience the lowest
returns after a high sentiment period when mispricing is most severe, while they do not show
negative returns on average after low sentiment months. On the other hand, there is no statistically
significant difference between high and low sentiment periods for low-crash-risk stocks. On the
whole, a long-short strategy that is long high-crash-risk stocks and short low-crash-risk stocks
produces more negative and significant excess returns after high sentiment months. These results
are consistent with our hypothesis that when investors are bullish, they are more likely to overbuy
high-crash-risk stocks, and thereby the expected returns of these stocks would be lower.
                                                   18


                            Table 1.4 Sentiment and Crash Risk Returns
                          Panel A: Portfolio Excess Returns and Sentiment
                                High Sent           Low Sent          High-Low
  Low crash risk                0.597*              0.812**           -0.215
                                (0.314)             (0.309)           0.436
  High crash risk               -1.849*             0.879             -2.728**
                                (1.008)             (0.880)           (1.280)
  Long-short                    -2.446**            0.067             -2.513**
                                (0.943)             (0.709)           (1.144)
                            Panel B: Price of Crash Risk and Sentiment
                                (1)                 (2)               (3)                 (4)
                                              FMB                                  Panel
  VARIABLES                     Low Sent            High Sent         Return              Return
  Crash Risk                    -0.405***           -0.619***         -0.335***           -0.135**
                                (0.108)             (0.141)           (0.050)             (0.062)
  SentmentD×Crash Risk                                                                    -0.374***
                                                                                          (0.063)
  Controls                      YES                 YES               YES                 YES
  Observations                  240,805             269,577           545,227             510,260
  R-squared                     0.078               0.085             0.168               0.159
  Note:                                                               ∗ p<0.1; ∗∗ p<0.05; ∗∗∗ p<0.01
This table presents the relationship between the price of crash risk and sentiment. Panel A reports
the value-weighted portfolio excess returns for high-crash-risk, low-crash-risk, and long-short
hedge portfolios in both high sentiment and low sentiment periods and their differences. Market-
wide sentiment is defined by Baker and Wurgler (2006). Our sample is limited between July 1996
and December 2018 because of the availability of the index. High and low sentiment periods
are defined as either above or below median sentiment over the sample period. Panel B reports
regression results. In Columns (1) and (2), we run Fama-MacBeth cross-sectional regressions of
stock returns on crash risk and lagged firm characteristics for high- and low-lag-sentiment periods
separately. Control variables include all the firm characteristics used in Table 1.3. Standard errors
are estimated according to Newey-West procedure (Newey and West, 1986) with 6 lags. In Columns
(3) and (4), we run panel regressions of stock returns on the same independent variables as the
previous specification, with firm and time fixed effects. In Column (4), we add a dummy variable
𝑆𝑒𝑛𝑡𝐷, where it equals one if the lagged sentiment is higher than the sample median, and zero
otherwise. We interact 𝑆𝑒𝑛𝑡𝐷 with crash risk, and hence the coefficient on the interaction term can
be interpreted as the incremental price of crash risk when lagged sentiment is high. All independent
variables are standardized cross-sectionally each month to be mean zero and standard deviation of
unity. Standard errors are clustered at the firm level.
                                                  19


     To further examine the relationship between crash risk and sentiment, I run Fama-MacBeth
regressions and panel regressions of stock returns on firm characteristics for high- and low-lag-
sentiment months separately. The hypothesis is that the price of crash risk should be more negative
immediately after high sentiment months. As before, high sentiment months are defined as those
months with lag sentiment higher than the sample median, and low sentiment months are defined
as those months with lag sentiment lower than the sample median. The results are reported in the
first two columns in Panel B of Table 1.4.
     We can see from the table that when lagged sentiment is high, the coefficient on crash risk is
-0.619%, compared to -0.405% when lagged sentiment is low. In other words, the price of crash
risk associated with a one-standard-deviation increase in the risk is 21 bps lower when lagged
sentiment is high. Though the difference between the two coefficients is not statistically significant
(𝑇-statistic of -1.2), the annualized return difference is large at -2.52%. This is another piece of
evidence that high crash risk stocks are more overpriced when lagged sentiment is high.
     To further assess this phenomenon, I also conduct the following analysis. I define a dummy
variable 𝑆𝑒𝑛𝑡𝐷, where it equals one if the lagged sentiment is higher than the sample median,
and zero otherwise. I first run a panel regression of stock returns on crash risk and other firm
characteristics, with firm and time fixed effects. Then I include the 𝑆𝑒𝑛𝑡𝐷 variable and interact it
with crash risk. The hypothesis is that the interaction term should be significantly negative since
when lagged sentiment is high, the overpricing of high crash risk stocks should be more severe. I
report the results in Columns (3) and (4) in Panel B of Table 1.4.
     As shown in the table, even after including firm and time-fixed effects, the ex-ante crash risk is
consistently priced negatively, albeit with a smaller magnitude. In Column (4), when we interact
the sentiment dummy with crash risk, the loading on crash risk is much smaller in magnitude and
statistically significant at 5% level, while the coefficient on the interaction term is negative and
statistically significant at 1%, with a much higher magnitude. These results are consistent with our
hypothesis that when lagged sentiment is high, investors buy more high crash risk stocks, which
causes the overpricing of these stocks even higher, and therefore the subsequent returns turn out to
                                                  20


be much lower than in low lagged sentiment periods.
1.5.2     Trades on Crash Risk
     Next, we examine whether some investors are likely to buy high-crash-risk stocks. This
hypothesis is the underlying assumption of the previous literature that high-crash-risk stocks are
overpriced and is an implication from (Brunnermeier et al., 2007). To explore this hypothesis, I
first use Robintrack data to construct a retail trading measure and examine whether they tend to buy
high-crash-risk stocks.3
     As has been extensively discussed in Barber et al. (2021) and Welch (2020), Robintrack data
contains hourly stock popularity numbers that are measured by how many users on Robinhood hold
a particular stock at a certain hour. Since we cannot observe the number of shares they hold for
each stock, and there is no data for the total number of users for each time period, the next best
solution is to measure the change in the number of users for each stock. As crash risk is estimated
at a monthly frequency, I use month-end numbers of Robinhood users to merge the data. I first
construct a log measure for Robinhood trading:
                        𝐶ℎ𝑎𝑛𝑔𝑒 𝑖𝑛 𝐿𝑜𝑔(#𝑈𝑠𝑒𝑟𝑖,𝑡 ) = log(#𝑈𝑠𝑒𝑟𝑖,𝑡 ) − log(#𝑈𝑠𝑒𝑟𝑖,𝑡−1 )              (1.2)
     Then I follow Barber et al. (2021) and construct a percentage change measure for Robinhood
trading:
                                %𝐶ℎ𝑎𝑛𝑔𝑒#𝑈𝑠𝑒𝑟𝑖,𝑡 = #𝑈𝑠𝑒𝑟𝑖,𝑡 /#𝑈𝑠𝑒𝑟𝑖,𝑡−1 − 1                        (1.3)
     Where 𝑡 is at the monthly frequency to match the frequency of our ex-ante crash risk measures.
The specification is as follows:
                                                         ∑︁
       𝑅𝑜𝑏𝑖𝑛ℎ𝑜𝑜𝑑 𝑇𝑟𝑎𝑑𝑒𝑖,𝑡 = 𝛼0 + 𝛽 × 𝐶𝑟𝑎𝑠ℎ 𝑅𝑖𝑠𝑘 𝑖,𝑡 +        𝛽 𝑝 𝐶𝑜𝑛𝑡𝑟𝑜𝑙 𝑝,𝑖,𝑡−1 + 𝛼𝑖 + 𝜆𝑡 + 𝜖𝑖,𝑡 (1.4)
                                                          𝑝
     Where we add firm and time fixed effects to account for unobserved heterogeneity that might
be correlated with the error term. The Robinhood sample runs from May 2018 to August 2020. I
regress the Robinhood trading measures on both measures of ex-ante crash risk, controlling for the
     3 Robintrack: https://www.robintrack.net/.
                                                    21


lagged log of the user number and a set of firm characteristics. The results are reported in Columns
(1) to (4) of Table 1.5.
                              Table 1.5 Investor Trading and Crash Risk
                      (1)           (2)           (3)          (4)          (5)            (6)
  VARIABLES              Change in Log(User)          User%Change                   IO Change
  Crash Risk (Logit) 0.093***                     0.154***                  -0.026***
                      (0.010)                     (0.020)                   (0.002)
  Crash Risk (EEC)                  0.104***                   0.156***                    -0.013***
                                    (0.016)                    (0.028)                     (0.003)
  Controls            YES           YES           YES          YES          YES            YES
  Observations        63,692        63,692        63,692       63,692       375,339        375,339
  R-squared           0.241         0.240         0.191        0.190        0.500          0.500
  Firm & Time FE      YES           YES           YES          YES          YES            YES
  Note:                                                                ∗ p<0.1; ∗∗ p<0.05; ∗∗∗ p<0.01
This table presents results from regressing Robinhood user trading measures and institutional
trading measures on crash risk, controlling for other firm characteristics. The first Robinhood user
trading measure is the monthly change of the natural log of user numbers holding a particular
stock, where the user numbers are from the online brokerage Robinhood (Robintrack). The second
Robinhood user trading measure is the percentage change in the number of users over the previous
month. The institutional trading measure is the quarterly change in the ratio of institutional
holding for each stock. We regress all of these trading measures on the contemporaneous crash
risk measures constructed from both logit regressions and the machine learning method (EEC-
AdaBoost). Columns (1) to (4) add lagged log of the number of users as a control variable. For all
specifications, the control variables include the natural log of market capitalization, the natural log
of book-to-market ratio, asset growth, gross profitability, momentum, short-term reversal, MAX
and MIN (Bali et al., 2011), defined as the highest and lowest daily returns of the previous month,
total skewness of daily returns in the previous month, illiquidity (Amihud, 2002), and Fama-French
three-factor betas estimated from the previous month. Firm and Time fixed effects are included,
and robust standard errors are included in parentheses.
    The table shows that over the sample period when Robinhood data is available, retail investors
on average tend to buy high-crash-risk stocks, consistent with our hypothesis. Importantly, in all
specifications, we control for such commonly used lottery characteristics as MAX and MIN (Bali
et al., 2011), which are defined as maximum and minimum daily returns of the previous month, and
total skewness of the previous month. The coefficient on crash risk is consistently and significantly
                                                  22


positive in Robinhood trading tests, meaning that retail preference for high-crash-risk stocks goes
beyond the conventional proxies for lottery characteristics defined in the literature (Barberis and
Huang, 2008; Bali et al., 2011).
     A related question arises as to whether institutional investors would be liquidity providers and
act as counterparties since literature has shown that they are reluctant to short the left tail, and would
rather ride the bubble. I examine this issue by regressing the change of institutional holdings on
the same set of characteristics. The institutional holdings data comes from Thomson Reuters 13F
filings data and is defined as the percentage of shares held by institutional investors. The change
in the holdings is the difference between the current quarter’s holdings and the previous quarter’s.
The results are shown in Columns (5) and (6) in Table 1.5.
     The results show that there is strong evidence that institutions might be the counterparty of retail
investors for crash risk. In sharp contrast to Robinhood trading results, the coefficients on both
crash risk measures are negative and statistically significant for institutional trading tests. Taken
together, these results support the hypothesis that retail traders derive anticipatory utilities from
distorted subjective beliefs. Consistent with the predictions in Brunnermeier et al. (2007), when
lagged sentiment is high, investors underestimate left-tail risks and tend to overbuy stocks with
high crash risk, which in turn drives up their prices, leading to lower expected returns subsequently.
Both the pricing results and retail trading results conform to this theory.
1.6    Retail Influence on Monthly Crash Risk
     Evidence from the previous section shows that retail investors tend to buy high ex-ante crash risk
stocks, and this effect is over and beyond the effect of the usual proxies for lottery characteristics.
These buying activities could be inconsequential if retail investors are pure “noise traders” (De Long
et al., 1990a), as their trades are idiosyncratic and would be canceled out on average. However,
when their trades are correlated because of attention or herding, they could forecast subsequent
returns (Barber and Odean, 2008; Barber et al., 2021). Social media is instrumental in facilitating
herding behavior, as it transmits trading strategies more efficiently. As implied in Han et al. (2022),
there is an inherent feedback loop in correlated trading and asset prices. When investors (receivers)
                                                   23


take note of other investors’ (senders) recent trading success, as demonstrated by their bragging on
social media about the high recent returns of their stock picks, they continue to trade in the same
direction, thus pushing the stock price even higher. The implication is that regardless of whether
investors display a preference for skewness, their trading actions would produce such results and
influence stock prices.
     There is causal evidence that suggests higher participation by retail investors does induce
higher stock volatility (Foucault et al., 2011). They may be marginal price setters for small stocks
(Graham and Kumar, 2006). Retail short sellers predict negative future returns, and they seem
to have superior knowledge of small firm fundamentals (Kelley and Tetlock, 2017). Much of the
literature focuses on predictive tests, as it is extremely difficult to find ideal settings for the proper
identification of causality.
     I explore a particular shock to the retail attention and herding channel that might have influenced
retail investors’ trading behavior, which in turn could drive the change in the crash risk of the
underlying stocks.
1.6.1    The Advent of Wallstreetbets
     “Wallstreetbets” is a “Subreddit” on the social media platform “Reddit”, and has garnered
considerable attention from the investment community largely because of the “GameStop” saga.
The Subreddit started in April 2012, and today it has over 12 million subscribers. These subscribers
call themselves “degenerates”, and frequently exchange trading ideas and post their gains and losses.
In a recent study, Hu et al. (2021) shows that conversations on “Wallstreetbets” have information
content that predicts next-day returns. A study from a different discipline, Li and Wu (2018) shows
that retailers displaying past sales numbers can induce consumers to herd and buy more of the
products. These studies suggest that social media as a platform for idea sharing can facilitate more
efficient herding. Therefore, it is conceivable that the advent of a highly efficient platform for
sharing ideas might affect asset prices, including the crash risk of the underlying stocks, following
the results that retail investors exacerbate the overpricing of high-crash-risk stocks.
     I examine this issue by tracing back to the origin of “Wallstreetbets” when it was founded in
                                                      24


April 2012. I obtain and process all posts from April 2012 when the Subreddit started till December
2020, and find out all stock tickers that were mentioned in these posts.4 I drop all ticker names that
are also common English words, slang, and abbreviations. To illustrate the growing community
on “Wallstreetbets”, I plot the number of posts each month that mention ticker names, and also the
number of unique ticker names mentioned each month in Figure 1.3.
    Panel A of Figure 1.3 plots the number of posts that mention ticker names on the Subreddit
“Wallstreetbets”, and Panel B plots the number of unique tickers/firms each month. The time series
spans from April 2012 when “Wallstreetbets” was started to December 2020. It shows that the
activities on “Wallstreetbets” exploded after the pandemic began in 2020. It also shows the growing
breadth of retail investor interests in the number of stocks.
1.6.2    The Staggered First Appearances of Stock Tickers
    Members started to mention stocks in their posts on “Wallstreetbets” on the first day of the
Subreddit. According to Han et al. (2022), people are more likely to mention certain stocks if these
stocks happen to have high past returns. If other people see these posts, they are more likely to
follow suit and trade in the same direction. This could in turn affect the stock returns.
    To test this hypothesis, I focus on the seven-month window around each “event”, where “event”
means a stock ticker appeared for the first time on “Wallstreetbets”. Thus there are three months
pre-event, and three months post-event. Since conversations about stocks are not exogenous per
se, we need a matching strategy and control variables that can offset the endogenous portion of
the test. Therefore, I use propensity score matching by running logistic regression. The response
is a dummy variable 𝐷 𝑖,𝑡 = 1 if a stock 𝑖 appears on “Wallstreetbets” for the first time at time 𝑡,
or zero otherwise. The independent variables include lagged market capitalization, prior-month
return, asset growth, book-to-market ratio, gross profitability, idiosyncratic risk, illiquidity, MAX,
and prior 12-month return, to proxy for the common stock characteristics.
    The estimated parameters are then fit to the whole sample to generate fitted values as the
propensity score for each stock at each point in time. To match each event, I use the score generated
    4 The complete history of Reddit comments data comes from https://files.pushshift.io/reddit/comments/.
                                                      25


           Figure 1.3 Monthly Number of Posts and Unique Tickers on “Wallstreetbets”
The figure plots the total number of posts each month on the Subreddit “Wallstreetbets” that mention
stock ticker names, and also the number of unique ticker names mentioned each month in Panel A
and Panel B respectively. For a ticker to be counted, it must not be common English words, slang,
or abbreviations. The time series spans from April 2012 when “Wallstreetbets” was established to
December 2020.
                                                 26


for each “never treated” stock three months prior to the event and find five stocks that have the
closest propensity scores to each treated stock.5
     After the matching process, I follow Gormley and Matsa (2011); Cengiz et al. (2019) and stack
each event cohort, where each cohort contains the treated stock and the matched sample. Then I
run the following specification:
                                                                       ∑︁
                  𝐶𝑟𝑎𝑠ℎ 𝑅𝑖𝑠𝑘 𝑖,𝑐,𝑡 = 𝛾0 + 𝛽𝐷 𝑖,𝑐,𝑡 + 𝛿𝑐,𝑡 + 𝛼𝑖,𝑐 +          𝛽 𝑝 𝐶𝑜𝑛𝑡𝑟𝑜𝑙 𝑝,𝑖,𝑡−1 + 𝜖𝑖,𝑡             (1.5)
                                                                         𝑝
     Where 𝐶𝑟𝑎𝑠ℎ 𝑅𝑖𝑠𝑘 𝑖,𝑐,𝑡 is the estimated crash risk of stock 𝑖 in cohort 𝑐 at time 𝑡. 𝐷 𝑖,𝑐,𝑡 is a dummy
variable that indicates whether a stock 𝑖 in cohort 𝑐 is treated at time 𝑡. 𝛿𝑐,𝑡 is 𝐶𝑜ℎ𝑜𝑟𝑡 × 𝑇𝑖𝑚𝑒 fixed
effects. 𝛼𝑖,𝑐 is 𝑈𝑛𝑖𝑡 × 𝐶𝑜ℎ𝑜𝑟𝑡 fixed effects. Then 𝛽 is the coefficient of interest that estimates the
average treatment effect on the treated stocks. The results are reported in Column (1) and Column
(3) of Table 1.6, where Column (3) adds control variables. The control variables include the
natural log of market capitalization, prior-month return, asset growth, gross profitability, illiquidity
(Amihud, 2002), MAX (Bali et al., 2011), prior 12-month return, and idiosyncratic risk. Standard
errors are clustered at the unit level.
     When control variables are not included, there is a 1.03 percentage point estimated increase in
logit-generated crash risk when a stock is first mentioned on “Wallstreetbets”, and the coefficient
is highly statistically significant. When control variables are included, the magnitude reduces to
approximately 56 bps, and the coefficient remains statistically significant at the 1% level. This
corroborates our hypothesis that when a stock was mentioned on social media and subsequently
draws more attention that possibly induces more correlated retail trading, which could increase
stock crash risk.
     A critical assumption for the difference-in-differences analysis is the “parallel trend” assump-
tion, where the treated group and the control group should not have significant differences before the
event happens. To examine this “parallel-trend” assumption, I conduct a dynamic approach, where
     5 “Never treated” means the stock never appears on “Wallstreetbets”. This is to ensure the cleanest matching. There
are in total 2,276 unique stocks that are never mentioned on “Wallstreetbets”.
                                                          27


          Table 1.6 First Appearances of Stock Tickers on “Wallstreebets” and Crash Risk
                        (1)          (2)            (3)           (4)            (5)          (6)
  VARIABLES                      Crash Risk (Logit)                         Crash Risk (EEC)
  Treated               1.032***     0.560***                     0.674***       0.303***
                        (0.103)      (0.129)                      (0.054)        (0.064)
  Month -3                                          0.009                                     0.001
                                                    (0.160)                                   (0.082)
  Month -2                                          -0.041                                    0.041
                                                    (0.140)                                   (0.074)
  Month 0                                           0.464***                                  0.152**
                                                    (0.136)                                   (0.076)
  Month +1                                          0.326*                                    0.152
                                                    (0.185)                                   (0.097)
  Month +2                                          0.689***                                  0.478***
                                                    (0.199)                                   (0.095)
  Month +3                                          0.735***                                  0.508***
                                                    (0.218)                                   (0.105)
  Observations          208,502      125,734        125,734       208,502        125,734      125,734
  R-squared             0.874        0.909          0.909         0.921          0.946        0.946
  Cohort×Units FE       YES          YES            YES           YES            YES          YES
  Cohort×Month FE       YES          YES            YES           YES            YES          YES
  Note:                                                                  ∗ p<0.1; ∗∗ p<0.05; ∗∗∗ p<0.01
This table reports results from a “stacked difference-in-differences” approach (Gormley and Matsa,
2011) that examines the effect of first appearances of stocks tickers on “Wallstreetbets” on their
ex-ante crash risk. Columns (1) to (3) use logit-generated crash risk as the dependent variable,
while Columns (4) to (6) use machine learning-generated crash risk. “Wallstreetbets” was started
in April 2012. From the beginning of “Wallstreetbets” to the end of 2020, we find all the stock
tickers that are ever mentioned in the Subreddit and the first month they were mentioned. We then
define each of these instances as one event and each of the stocks as a treated stock. We match each
treated stock with five control stocks from the pool of “never treated” stocks via propensity score
matching based on lagged characteristics three months prior to each event. Then the “cohorts” con-
taining treated and control observations are stacked together and the following specification is run:
                                                                Í
                𝐶𝑟𝑎𝑠ℎ 𝑅𝑖𝑠𝑘 𝑖,𝑐,𝑡 = 𝛾0 + 𝛽𝐷 𝑖,𝑐,𝑡 + 𝛿𝑐,𝑡 + 𝛼𝑖,𝑐 + 𝑝 𝛽 𝑝 𝐶𝑜𝑛𝑡𝑟𝑜𝑙 𝑝,𝑖,𝑡−1 + 𝜖𝑖,𝑡
Where 𝐶𝑟𝑎𝑠ℎ 𝑅𝑖𝑠𝑘 𝑖,𝑐,𝑡 is the estimated crash risk of stock 𝑖 in cohort 𝑐 at time 𝑡. 𝐷 𝑖,𝑐,𝑡 is a dummy
variable that indicates whether a stock 𝑖 in cohort 𝑐 is treated at time 𝑡. 𝛿𝑐,𝑡 is 𝐶𝑜ℎ𝑜𝑟𝑡 × 𝑇𝑖𝑚𝑒 fixed
effects. 𝛼𝑖,𝑐 is 𝑈𝑛𝑖𝑡 × 𝐶𝑜ℎ𝑜𝑟𝑡 fixed effects. Then 𝛽 is the coefficient of interest that estimates the
average treatment effect on the treated stocks. The results are reported in Columns (1), (2), (4),
and (5), where Columns (2) and (5) add control variables. The control variables include the natural
log of market capitalization, prior-month return, asset growth, gross profitability, illiquidity, MAX
(Bali et al., 2011), prior 12-month return, and idiosyncratic risk. Columns (3) and (6) examine the
dynamic treatment effects around the events. Standard errors are clustered at the unit level.
                                                    28


instead of examining the coefficient on the treatment dummy, I run the following specification:
                                     +3
                                     ∑︁                                  ∑︁
            𝐶𝑟𝑎𝑠ℎ 𝑅𝑖𝑠𝑘 𝑖,𝑐,𝑡 = 𝛾0 +       𝛽 𝑗 𝐷 𝑖, 𝑗,𝑐,𝑡 + 𝛿𝑐,𝑡 + 𝛼𝑖,𝑐 +    𝛽 𝑝 𝐶𝑜𝑛𝑡𝑟𝑜𝑙 𝑝,𝑖,𝑡−1 + 𝜖𝑖,𝑡 (1.6)
                                    𝑗=−3                                  𝑝
    Where the dummy variables 𝐷 𝑖, 𝑗,𝑐,𝑡 indicate whether a stock 𝑖 is treated in cohort 𝑐 at time 𝑡,
and the distance 𝑗 ∈ [−3, 3] from the current month to the treatment month. Month −1 is chosen
as the base month that will be omitted from the regression. The results are included in Column (3)
and (6) of Table 1.6.
    As shown in the table, the coefficients for the two months before the event are economically
and statistically insignificant. On the other hand, the coefficients on the treatment month and the
months after the treatment are economically and statistically significant. These results provide
strong support to the assumption that there are no significant differences between treatment and
control groups before the treatment.
    To provide further evidence of the “parallel trend” assumption, I also plotted the coefficients on
the dummy variables 𝐷 𝑖, 𝑗,𝑐,𝑡 with their 95% confidence intervals in Figure 1.4.
    The figure provides visual support for the “parallel trend” assumption for our “difference-in-
differences” analysis. The dynamic results, together with the static results, provide strong evidence
that there is a possible causal effect of increased retail attention on stock crash risk.
1.6.3   Size and Institutional Ownership
    Foucault et al. (2011) show that retail investors have an outsized impact on stock volatility,
especially for smaller stocks, where the standard limits to arbitrage argument apply (Shleifer and
Vishny, 1997). Smaller stocks are traded thinly and thus are less liquid. Because of their price
tag, they are usually the preferred habitat of retail investors, and thus their institutional holding is
usually lower. As a result, their prices can stay distant from their fundamentals for an extended
period of time, since rational investors are reluctant to arbitrage for the arbitrage would be costly.
    The same argument should apply to crash risk. Prior literature has shown that high crash risk
stocks tend to be smaller and more costly to arbitrage (Jang and Kang, 2019). We have also shown
in Section 1.5 that retail investors seem to display a preference for high crash risk stocks possibly
                                                         29


  Figure 1.4 Dynamic Treatment Effects of the First Appearances of Tickers on “Wallstreetbets”
This figure plots the dynamic treatment effects between three months prior to the treat-
ment and three months after the treatment to examine whether the “parallel trend”
assumption holds for the “difference-in-differences” analysis on whether the first ap-
pearances of stock tickers on “Wallstreetbets” can have a positive and significant ef-
fect on stock crash risk.            The “difference-in-differences” specification is as follows:
             𝐶𝑟𝑎𝑠ℎ 𝑅𝑖𝑠𝑘 𝑖,𝑐,𝑡 = 𝛾0 + +3
                                    Í                                     Í
                                      𝑗=−3 𝛽 𝑗 𝐷 𝑖, 𝑗,𝑐,𝑡 + 𝛿 𝑐,𝑡 + 𝛼𝑖,𝑐 + 𝑝 𝛽 𝑝 𝐶𝑜𝑛𝑡𝑟𝑜𝑙 𝑝,𝑖,𝑡−1 + 𝜖𝑖,𝑡
Where the dummy variables 𝐷 𝑖, 𝑗,𝑐,𝑡 indicate whether a stock 𝑖 is treated in cohort 𝑐 at time 𝑡, and
the distance 𝑗 ∈ [−3, 3] from the current month to the treatment month. Month −1 is chosen to
be the base month that will be omitted from the regression. Month 0 is the treatment month, and
a green dotted line is plotted for better illustration. The coefficients on the rest of the dummies
𝐷 𝑖, 𝑗,𝑐,𝑡 together with their 95% confidence interval bands are then plotted against their respective
time periods. The blue markers display results using logit-generated crash risk as the dependent
variable, while the golden markers use machine learning-generated crash risk. 𝐶𝑟𝑎𝑠ℎ 𝑅𝑖𝑠𝑘 𝑖,𝑐,𝑡 is
the estimated crash risk of stock 𝑖 in cohort 𝑐 at time 𝑡. 𝛿𝑐,𝑡 is 𝐶𝑜ℎ𝑜𝑟𝑡 × 𝑇𝑖𝑚𝑒 fixed effects. 𝛼𝑖,𝑐 is
𝑈𝑛𝑖𝑡 × 𝐶𝑜ℎ𝑜𝑟𝑡 fixed effects. Standard errors are clustered at the unit level. The regression results
are reported in Column (3) and (6) in Table 1.6.
                                                        30


because of their distorted beliefs (Brunnermeier et al., 2007). The combination of these factors
should lead to a natural hypothesis that retail attention should have an outsized impact on the crash
risk of smaller stocks and stocks with lower institutional ownership.
     To examine this hypothesis, I divide the universe of stocks into two subgroups based on either
lagged size or institutional ownership. Then I define a dummy variable 𝐷 𝑠𝑖𝑧𝑒/𝑖𝑜 = 1 if the stock
is larger than the median or zero otherwise, based on the lagged value of each stock three months
prior to each event. In the case of institutional ownership, 𝐷 𝑠𝑖𝑧𝑒/𝑖𝑜 = 1 if the ratio of institutional
ownership for the stock is greater than the median or zero otherwise, based on the lagged value of
institutional ownership three months prior to each event. Then I interact 𝐷 𝑠𝑖𝑧𝑒/𝑖𝑜 with the 𝑇𝑟𝑒𝑎𝑡𝑒𝑑
dummy variable in the same “stacked difference-in-differences” specification:
                                                                              ∑︁
  𝐶𝑟𝑎𝑠ℎ 𝑅𝑖𝑠𝑘 𝑖,𝑐,𝑡 = 𝛾0 + 𝛽1 𝐷 𝑖,𝑐,𝑡 + 𝛽2 𝐷 𝑖,𝑐,𝑡 × 𝐷 𝑠𝑖𝑧𝑒/𝑖𝑜 + 𝛿𝑐,𝑡 + 𝛼𝑖,𝑐 +    𝛽 𝑝 𝐶𝑜𝑛𝑡𝑟𝑜𝑙 𝑝,𝑖,𝑡−1 + 𝜖𝑖,𝑡 (1.7)
                                                                               𝑝
     I report the results of this specification in Table 1.7. Columns (1) to (4) report the results of
using logit-generated crash risk as the dependent variable, while Columns (5) to (8) use machine
learning-generated crash risk as the dependent variable. Columns (1), (3), (5), and (7) only include
the treated dummy and the interaction between the treated and the size dummy or IO dummy.
Columns of even numbers add control variables. Standard errors are clustered at the unit level.
                                                        31


                          Table 1.7 First Appearances of Stock Tickers on “Wallstreebets” and Crash Risk: Size & IO
                          (1)               (2)             (3)                 (4)               (5)            (6)                   (7)            (8)
  VARIABLES                                     Crash Risk (Logit)                                                   Crash Risk (EEC)
  Treated                 1.501***          1.038***        1.539***            0.988***          1.060***       0.434***              1.019***       0.422***
                          (0.182)           (0.326)         (0.177)             (0.310)           (0.090)        (0.142)               (0.090)        (0.145)
  Treated×𝐷 𝑠𝑖𝑧𝑒          -0.930***         -0.743**                                              -0.766***      -0.202
                          (0.205)           (0.343)                                               (0.104)        (0.153)
  Treated×𝐷 𝑖𝑜                                              -1.082***           -0.689**                                               -0.735***      -0.191
                                                            (0.202)             (0.330)                                                (0.102)        (0.155)
  Controls                NO                YES             NO                  YES               NO             YES                   NO             YES
  Observations            208,502           125,734         208,502             125,734           208,502        125,734               208,502        125,734
  R-squared               0.874             0.909           0.874               0.909             0.921          0.946                 0.921          0.946
  Cohort×Units FE         YES               YES             YES                 YES               YES            YES                   YES            YES
  Cohort×Month FE         YES               YES             YES                 YES               YES            YES                   YES            YES
  Note:                                                                                                                          ∗ p<0.1; ∗∗ p<0.05; ∗∗∗ p<0.01
This table reports results from a “stacked difference-in-differences” approach that examines whether the effect of first appearances of stocks tickers
on “Wallstreetbets” on their ex-ante crash risk differs because of size or level of institutional ownership. “Wallstreetbets” was started in April
2012. From the beginning of “Wallstreetbets” to the end of 2020, we find all the stock tickers that are ever mentioned in the Subreddit and the
first month they were mentioned. We then define each of these instances as one event and each of the stocks as a treated stock. We match each
treated stock with five control stocks from the pool of “never treated” stocks via propensity score matching based on lagged characteristics three
months prior to each event. Then the “cohorts” containing treated and control observations are stacked together and the following specification is run:
                                                                                                       Í
                              𝐶𝑟𝑎𝑠ℎ 𝑅𝑖𝑠𝑘 𝑖,𝑐,𝑡 = 𝛾0 + 𝛽1 𝐷 𝑖,𝑐,𝑡 + 𝛽2 𝐷 𝑖,𝑐,𝑡 × 𝐷 𝑠𝑖𝑧𝑒 + 𝛿 𝑐,𝑡 + 𝛼𝑖,𝑐 + 𝑝 𝛽 𝑝 𝐶𝑜𝑛𝑡𝑟𝑜𝑙 𝑝,𝑖,𝑡 −1 + 𝜖 𝑖,𝑡
Where 𝐶𝑟𝑎𝑠ℎ 𝑅𝑖𝑠𝑘 𝑖,𝑐,𝑡 is the estimated crash risk of stock 𝑖 in cohort 𝑐 at time 𝑡. 𝐷 𝑖,𝑐,𝑡 is a dummy variable that indicates whether a stock 𝑖 in cohort
𝑐 is treated at time 𝑡. 𝐷 𝑠𝑖𝑧𝑒/𝑖𝑜 is a dummy variable that equals one if the stock is larger than the median or zero otherwise, based on the lagged value
of each stock three months prior to each event. In the case of institutional ownership, 𝐷 𝑠𝑖𝑧𝑒/𝑖𝑜 = 1 if the ratio of institutional ownership for the stock
is greater than the median or zero otherwise. 𝛿 𝑐,𝑡 is 𝐶𝑜ℎ𝑜𝑟𝑡 × 𝑇𝑖𝑚𝑒 fixed effects. 𝛼𝑖,𝑐 is 𝑈𝑛𝑖𝑡 × 𝐶𝑜ℎ𝑜𝑟𝑡 fixed effects. Then 𝛽2 is the coefficient of
interest that estimates the difference in average treatment effect on the treated stocks if the stocks belong to the large stock subgroup. Column (2) adds
control variables that include the natural log of market capitalization, prior-month return, asset growth, gross profitability, illiquidity (Amihud, 2002),
MAX (Bali et al., 2011), prior 12-month return, and idiosyncratic risk. Standard errors are clustered at the unit level to account for possible duplicate
observations.
                                                                                32


     Consistent with our hypothesis, the coefficient on the interaction term between the treated and
size dummy or the IO dummy is negative and economically, and statistically significant. For
example, as shown in Column (1) when controls are not included, if the stock is below median size,
the first appearance on “Wallstreetbets” increases stock crash risk by 1.5 percentage points, much
higher than our baseline estimate of 1.03%. If the stock is above the median size, the effect is much
smaller at approximately 57 bps. In column (2) when control variables are included, being a small
stock that first appears on ‘Wallstreetbets” leads to a 1.04 percentage points increase in crash risk.
The interaction term between 𝑇𝑟𝑒𝑎𝑡𝑒𝑑 and the size dummy remains significantly negative. The
results are consistent when using institutional ownership as the main variable of interest. These
results are consistent with prior literature that retail investors have a higher impact on smaller stocks
or stocks with a lower level of institutional ownership.
1.6.4    Supporting Evidence from Trading Volume and Volatility
     One necessary assumption for our analysis is that retail investors pile in the stocks that are
mentioned on social media. While we do not have individual trading data, there should be a surge
in trading volume and volatility (Foucault et al., 2011) around the events. To examine whether this
is the case, we re-run the “difference-in-differences” analysis but substitute the dependent variable
with trading volume and return volatility, where trading volume is defined as the monthly total
volume of shares traded scaled by total shares outstanding, and volatility is defined as daily return
volatility of the current month. The results are reported in Table 1.8.
     The table shows clearly that there is a significant surge in both trading volume and return
volatility in the treated stocks that first appeared on “Wallstreetbets”. Moreover, the dynamic tests
confirm that there is no evidence that the “parallel trend” assumption is violated. In fact, before
the event happens, there is a downward trend for the treated stocks in terms of trading volume and
return volatility. This can be more readily shown in Figure 1.5.
     Taken together, these results support our main analysis that heightened retail attention as a result
of social transmission leads to higher ex-ante crash risk. Moreover, there is evidence that retail
activities are behind the surge of trading interests in these stocks.
                                                     33


    Table 1.8 First Appearances of Stock Tickers on “Wallstreebets”: Trading Vol & Volatility
                        (1)          (2)            (3)           (4)            (5)          (6)
  VARIABLES                       Trading Volume                                 Volatility
  Treated               0.227***     0.146***                     0.260***       0.162***
                        (0.026)      (0.032)                      (0.021)        (0.026)
  Month -3                                          -0.067*                                   -0.037
                                                    (0.034)                                   (0.041)
  Month -2                                          -0.059*                                   -0.068*
                                                    (0.031)                                   (0.038)
  Month 0                                           0.336***                                  0.497***
                                                    (0.042)                                   (0.042)
  Month +1                                          0.066*                                    0.024
                                                    (0.038)                                   (0.039)
  Month +2                                          0.015                                     0.013
                                                    (0.041)                                   (0.042)
  Month +3                                          -0.017                                    -0.055
                                                    (0.044)                                   (0.042)
  Observations          209,478      125,748        125,748       212,961        125,748      125,748
  R-squared             0.931        0.954          0.954         0.790          0.842        0.843
  Cohort×Units FE       YES          YES            YES           YES            YES          YES
  Cohort×Month FE       YES          YES            YES           YES            YES          YES
  Note:                                                                  ∗ p<0.1; ∗∗ p<0.05; ∗∗∗ p<0.01
This table reports results from a “stacked difference-in-differences” approach (Gormley and Matsa,
2011) that examines the effect of first appearances of stocks tickers on “Wallstreetbets” on their
trading volume and volatility. Columns (1) to (3) use trading volume as the dependent variable,
while Columns (4) to (6) use return volatility. “Wallstreetbets” was started in April 2012. From
the beginning of “Wallstreetbets” to the end of 2020, we find all the stock tickers that are ever
mentioned in the Subreddit and the first month they were mentioned. We then define each of
these instances as one event and each of the stocks as a treated stock. We match each treated
stock with five control stocks from the pool of “never treated” stocks via propensity score matching
based on lagged characteristics three months prior to each event. Then the “cohorts” contain-
ing treated and control observations are stacked together and the following specification is run:
                                                                Í
                𝑇𝑟𝑎𝑑𝑖𝑛𝑔𝑉 𝑜𝑙𝑖,𝑐,𝑡 = 𝛾0 + 𝛽𝐷 𝑖,𝑐,𝑡 + 𝛿𝑐,𝑡 + 𝛼𝑖,𝑐 + 𝑝 𝛽 𝑝 𝐶𝑜𝑛𝑡𝑟𝑜𝑙 𝑝,𝑖,𝑡−1 + 𝜖𝑖,𝑡
Where 𝑇𝑟𝑎𝑑𝑖𝑛𝑔𝑉 𝑜𝑙𝑖,𝑐,𝑡 is the trading volume of stock 𝑖 in cohort 𝑐 at time 𝑡. 𝐷 𝑖,𝑐,𝑡 is a dummy
variable that indicates whether a stock 𝑖 in cohort 𝑐 is treated at time 𝑡. 𝛿𝑐,𝑡 is 𝐶𝑜ℎ𝑜𝑟𝑡 × 𝑇𝑖𝑚𝑒 fixed
effects. 𝛼𝑖,𝑐 is 𝑈𝑛𝑖𝑡 × 𝐶𝑜ℎ𝑜𝑟𝑡 fixed effects. Then 𝛽 is the coefficient of interest that estimates the
average treatment effect on the treated stocks. The results are reported in Columns (1), (2), (4),
and (5), where Columns (2) and (5) add control variables. The control variables include the natural
log of market capitalization, prior-month return, asset growth, gross profitability, illiquidity, MAX
(Bali et al., 2011), prior 12-month return, and idiosyncratic risk. Columns (3) and (6) examine the
dynamic treatment effects around the events. Standard errors are clustered at the unit level.
                                                    34


                  Figure 1.5 Dynamic Treatment Effects: Trading Vol & Volatility
This figure plots the dynamic treatment effects between three months prior to the treatment and
three months after the treatment to examine whether the “parallel trend” assumption holds for the
“difference-in-differences” analysis on whether there is a surge in trading volume and volatility
after the first appearances of stock tickers on “Wallstreetbets”. The “difference-in-differences”
specification is the same specification as in our main test except that we replace the dependent
variable with either trading volume or return volatility. Panel A display results on trading volume,
while Panel B shows results for return volatility. The regression results are reported in Column (3)
and (6) in Table 1.8.
                                                  35


1.7   Retail Traders and Crash Risk: Daily Evidence
    In this section, we approach the main questions using the daily data by exploring the 𝑆𝐾 𝐸𝑊
measure by Xing et al. (2010), which is widely used as a proxy for firm-level crash risk (Bollen and
Whaley, 2004; Van Buskirk, 2011; Kim and Zhang, 2014; Kim et al., 2016). It is motivated by the
notion that a volatility smirk indicates investors’ expectation of a steep decline in the underlying
asset value (Bates, 2000).
    Using 𝑆𝐾 𝐸𝑊 as a proxy has the following advantages. First, it is available at a daily frequency
for stocks that have options traded. Second, it is easy to compute as it only relies on implied
volatility. Third, it is ex-ante in nature and thus conforms to our purpose. Formally, 𝑆𝐾 𝐸𝑊 is
defined as follows:
                                               𝑂𝑇 𝑀−𝑃𝑢𝑡                  𝐴𝑇 𝑀−𝐶𝑎𝑙𝑙
                      𝑆𝐾 𝐸𝑊𝑖,𝑡 = 𝐼𝑚 𝑝𝑙𝑖𝑒𝑑𝑉 𝑜𝑙𝑖,𝑡         − 𝐼𝑚 𝑝𝑙𝑖𝑒𝑑𝑉 𝑜𝑙𝑖,𝑡                      (1.8)
    Following Xing et al. (2010), I screen the options based on the following criteria. Days to
expiration are between 10 and 60 days. Implied volatilities are between 0.03 and 2. Open interest
must be greater than zero. Option price must be greater than $0.125. Volume is non-missing. For
out-of-money put options, the moneyness is between 0.8 and 0.95. For at-the-money call options,
the moneyness is between 0.95 and 1.05. We choose the implied volatility of the put option with
moneyness closest to 0.95, and the implied volatility of the call option with moneyness closest to
1 to compute the 𝑆𝐾 𝐸𝑊 measure for the day.
1.7.1    SKEW and Daily Returns
    Xing et al. (2010) show that 𝑆𝐾 𝐸𝑊 is significantly negatively correlated with future weekly
returns. To test whether this is the case in the daily frequency and to check whether daily 𝑆𝐾 𝐸𝑊
can be used as a suitable proxy for ex-ante crash risk, we need to examine whether 𝑆𝐾 𝐸𝑊 is
significantly negatively correlated with future daily returns.
    Therefore I follow Hu et al. (2021) and use the following specification:
                                                 ∑︁
                       𝑅𝑖,𝑡 = 𝛼 + 𝛽𝑆𝐾 𝐸𝑊𝑖,𝑡−1 +      𝛽 𝑝 𝐶𝑜𝑛𝑡𝑟𝑜𝑙𝑖,𝑝,𝑡−1 + 𝜆𝑡 + 𝜖𝑖,𝑡             (1.9)
                                                  𝑝
                                                  36


    Where 𝑡 is at a daily frequency. The control variables include prior day return, prior month-
end log of market capitalization, book-to-market ratio, cumulative 19-day returns lagged for 2 days
(reversal), cumulative 100-day returns lagged for 21 days (momentum), prior month average trading
volume scaled by total shares outstanding (liquidity), and prior month volatility of daily returns.
For robustness, I run both Fama-MacBeth regressions and panel regressions and report the results
in Panel A of Table 1.9.
    Panel A of Table 1.9 shows that throughout all specifications, the 𝑆𝐾 𝐸𝑊 measure is negatively
correlated with future daily stock returns, which is statistically significant at the 1% level. These
results corroborate the findings in the prior literature and provide support for using 𝑆𝐾 𝐸𝑊 as a
valid proxy for ex-ante crash risk at the daily frequency.
1.7.2   Retail Trading of SKEW
    In section 1.5, we show that retail investors have a tendency to buy high ex-ante crash risk
stocks. To see whether this is also the case in the daily frequency, we again use the trading measure
derived from Robintrack to regress on the contemporaneous 𝑆𝐾 𝐸𝑊 measure and the same set
of control variables that we used in the previous test. We regress retail trading measures on the
contemporaneous 𝑆𝐾 𝐸𝑊 measure instead of the lagged measure because we want to examine retail
trading behavior on the “ex-ante” measure of crash risk. The results are reported in Panel B of
Table 1.9.
    To control for common market-wide shocks, we follow the prior specifications and include day
fixed effects and cluster standard errors at the stock level. From Panel B of Table 1.9, we see that
both regressions using different measures for retail trading load positively and significantly on the
contemporaneous 𝑆𝐾 𝐸𝑊, the proxy for ex-ante crash risk measure. These results are consistent
with our prior monthly results that retail investors tend to overbuy high crash-risk stocks.
    Apparently, these results only report the positive correlation between crash risk and retail
trading, while the causality can go both directions, just like in the monthly case. To see whether
retail behaviors have a real influence on ex-ante crash risk, we turn to online conversations in
“Wallstreetbets” again but follow a different path. We want to examine whether the intensity of
                                                   37


                  Table 1.9 Daily Returns, Retail Trading, and Crash Risk (𝑆𝐾 𝐸𝑊)
                             (1)                  (2)                 (3)                  (4)
                       Panel A: Daily Stock Returns and Crash Risk (𝑆𝐾 𝐸𝑊)
  VARIABLES                                 FMB                                     Panel
  Lag Option SKEW            -0.001***            -0.002***           -0.001***            -0.001***
                             (0.000)              (0.000)             (0.000)              (0.000)
  Controls                   NO                   YES                 NO                   YES
  Observations               2,071,209            2,010,815           2,071,209            2,010,815
  R-squared                  0.003                0.072               0.199                0.201
                    Panel B: Robinhood User Trading and Crash Risk (𝑆𝐾 𝐸𝑊)
  VARIABLES                   Change in                               % Change in
                              Log(Robinhood Users)                    Robinhood Users
  Option SKEW                0.001**                                  0.001**
                             (0.000)                                  (0.001)
  Controls                   YES                                      YES
  Observations               703,614                                  862,423
  R-squared                  0.011                                    0.003
  Note:                                                                ∗ p<0.1; ∗∗ p<0.05; ∗∗∗ p<0.01
This table examines the relationship between daily returns and lagged 𝑆𝐾 𝐸𝑊 measure, and
the relationship between Robinhood user trading and the contemporaneous 𝑆𝐾 𝐸𝑊 measure.
Panel A reports regressions of daily stock returns on lagged 𝑆𝐾 𝐸𝑊 measure as a proxy
for crash risk in the daily frequency. The 𝑆𝐾 𝐸𝑊 measure follows Xing et al. (2010):
                      𝑆𝐾 𝐸𝑊𝑖,𝑡 = 𝐼𝑚 𝑝𝑙𝑖𝑒𝑑𝑉 𝑜𝑙𝑖,𝑡𝑂𝑇 𝑀−𝑃𝑢𝑡 − 𝐼𝑚 𝑝𝑙𝑖𝑒𝑑𝑉 𝑜𝑙 𝐴𝑇 𝑀−𝐶𝑎𝑙𝑙
                                                                        𝑖,𝑡
The option data is from Option Metrics. We screen the option data based on the following condi-
tions. Days to expiration are between 10 and 60 days. Implied volatilities are between 0.03 and 2.
Open interest must be greater than zero. Option price must be greater than $0.125. Volume is non-
missing. For out-of-money put options, the moneyness is between 0.8 and 0.95. For at-the-money
call options, the moneyness is between 0.95 and 1.05. We choose the implied volatility of the put
option with moneyness closest to 0.95, and the implied volatility of the call option with moneyness
closest to 1 to compute the 𝑆𝐾 𝐸𝑊 measure for the day. Columns (1) and (2) report Fama-MacBeth
cross-sectional regressions, while Columns (3) and (4) report panel regressions. The control vari-
ables include prior day return, prior month-end log of market capitalization, book-to-market ratio,
cumulative 19-day returns lagged for 2 days (reversal), cumulative 100-day returns lagged for 21
days (momentum), prior month average trading volume scaled by total shares outstanding (liquid-
ity), and prior month volatility of daily returns. For panel regressions, we include day fixed effects,
and standard errors are clustered at the stock level. Panel B reports panel regressions of Robinhood
user trading measures on contemporaneous 𝑆𝐾 𝐸𝑊 measure as a proxy for ex-ante crash risk and
control variables. The trading measures include the change in the log of user numbers and the
percentage change of user numbers from the previous day.
                                                   38


daily conversations about certain stocks can have a significantly positive impact on the ex-ante
crash risk of these stocks.
1.7.3   Online Conversations and SKEW: Endogeneity
    Apparently, online conversations about stocks are endogenous. As shown in Han et al. (2022),
agents receive prominent presentations of other agents’ trading strategies, typically represented by
high past returns, and thus follow the same strategy, which leads to feedback on the stock returns.
Because of this feedback loop, it’s impossible to separate the two legs of the circle via the usual
regression specifications.
    Specifically, consider the following specification, where we regress the 𝑆𝐾 𝐸𝑊 measure on
the number of times each stock is mentioned on social media, controlling for a set of stock
characteristics.
                                                       ∑︁
      𝑆𝐾 𝐸𝑊𝑖,𝑡 = 𝛼0 + 𝛽𝑆𝑜𝑐𝑖𝑎𝑙𝑇𝑟𝑎𝑛𝑠𝑚𝑖𝑠𝑠𝑖𝑜𝑛𝑖,𝑡−1 +           𝛽 𝑝 𝐶𝑜𝑛𝑡𝑟𝑜𝑙𝑖,𝑝,𝑡−1 + 𝜆𝑡 + 𝜎𝑖 + 𝜖𝑖,𝑡 (1.10)
                                                         𝑝
    In a slight abuse of notation, the 𝑡 − 1 in the subscript of “Social Transmission” means the
pre-trading hours from 16:30 PM on the previous day to 09:00 AM on the current day, while the
𝑡 − 1 in the controls ranges from the previous day to previous month, depending on the variable
referred to. In this specification, even when we use the two-way fixed effects estimator, “Social
Transmission” is still correlated with the idiosyncratic error term, and thus the estimate of the
coefficient 𝛽 is inconsistent.
1.7.4   A Plausible Instrument
    Let’s consider the following scenario. Person A zones away during his long and boring working
hours by wandering aimlessly on social media. His/her favorite venue for wandering is Reddit, a
popular platform for talking about anything. Each sub-venue specializing in a different topic is
called a “Subreddit”, a symbol of rich social life in a society. Apart from working, person A spends
a tremendous amount of time on hobbies such as football, fishing, and political debates, where
he/she posts and comments on the corresponding Subreddits. Apart from all this, person A has
developed a keen interest in stock trading, and thus becomes a subscriber of “Wallstreetbets”, as
                                                  39


he/she can always find interesting ideas for trading there. For person A, Reddit almost satisfies all
his/her needs for socializing, and the migration cost is high, plus there is no comparable platform
(Chang et al., 2014).
    Therefore, person A’s activities on “Wallstreetbets” are correlated with his/her activities on
other Subreddits. In other words, person A is more likely to post on “Wallstreetbets” if he/she is
also posting on other Subreddits. However, it is logical that person A’s activities on other Subreddits
have no direct bearing on stock market returns. Such an influence can only be exerted via his/her
activities on “Wallstreetbets”.
    Formally, consider the following specification.
                    𝑊 𝑆𝐵_𝑃𝑜𝑠𝑡𝑠𝑖,𝑡−1 = 𝛼0 + 𝛽 𝑍 𝑁𝑜𝑛_𝐹𝑖𝑛𝑎𝑛𝑐𝑒_𝑃𝑜𝑠𝑡𝑠𝑖,𝑡−1 + 𝜖𝑖,𝑡−1                   (1.11)
                                                         ∑︁
                𝑆𝐾 𝐸𝑊𝑖,𝑡 = 𝛼1 + 𝛽 𝑋 𝑊 𝑆𝐵_𝑃𝑜𝑠𝑡𝑠𝑖,𝑡−1 +        𝛽 𝑝 𝐶𝑜𝑛𝑡𝑟𝑜𝑙𝑖,𝑝,𝑡−1 + 𝜆𝑡 + 𝑢𝑖,𝑡      (1.12)
                                                          𝑝
    Where the first equation represents the first-stage regression, and the second equation represents
the instrumental variable estimation. The subscript 𝑖 represents stock 𝑖 and simultaneously all the
agents that mention stock 𝑖. To operationalize this procedure, we must ensure that the non-finance
conversations are truly non-finance related. Therefore, the name of the Subreddit matters.
    To ensure that we extract non-finance posts from non-finance “Subreddits”, I follow the strategy
used in Li et al. (2021). I choose a set of “seed words” and find out 50 words/phrases that are
closest in meaning to each seed word. Finally, I choose those “Subreddits” whose title does not
contain these keywords. The seed words I choose include: “finance”, “stock-market”, “stocks”,
“wall-street”, “trading”, “forex”, “options”, “investment”, “bond-market”, and “bonds”.
    How to find words/phrases that are similar in meaning to the seed words? Recent advances in
computational linguistics offer powerful tools to help solve the problem. First, we want to vectorize
the words/phrases into fixed-length vectors. Then we compute the cosine similarity between each
pair of vectors to check their distance to each other as a proxy for meaning closeness. To do this,
I use the pre-trained word embedding system called “Global Vectors for Word Representation”
(GloVe) developed by Pennington et al. (2014). These vectors are trained on the whole corpus of
                                                   40


Wikipedia up to 2014 and Gigaword 5 (Parker et al., 2011) on the co-occurrences of words and
phrases. I use the 300D version of GloVe, which means that each word/phrase is represented by a
300 dimension vector: 𝑉 = [𝑥 1 , 𝑥2 , ..., 𝑥 300 ]. Thus the cosine similarity between two words 𝑉1 and
𝑉2 is:
                                                                 𝑉1 · 𝑉2
                                          𝐶𝑜𝑠𝑖𝑛𝑒𝑆𝑖𝑚 1,2 =                                                     (1.13)
                                                              ||𝑉1 || · ||𝑉2 ||
     The cosine similarity measure for word vectors ranges between zero and one, with one being
the closest meaning.6 I find out the top 50 most similar words/phrases for each seed word and
group them together. Because of duplicates, we end up with a set of 351 keywords that are related
to the topic of finance. I then use these keywords to screen all the Subreddits.
1.7.5     Instrumental Variable Results
     With all the data processed, we are ready to construct the instruments. First, we denote a user 𝑗’s
number of posts on “Wallstreetbets” about stock 𝑖 on day 𝑡 as 𝑛𝑊                 𝑆𝐵
                                                                             𝑖, 𝑗,𝑡 , and his/her number of posts on
non-finance “Subreddits” as 𝑛𝑖,𝑛𝑜𝑛𝐹𝑖𝑛 𝑗,𝑡  . Then stock 𝑖’s total number of posts on “Wallstreetbets” on day
        𝑊 𝑆𝐵 = Í 𝑛𝑊 𝑆𝐵 . The instrument we construct for this variable would be 𝑁 𝑛𝑜𝑛𝐹𝑖𝑛 = Í 𝑛𝑛𝑜𝑛𝐹𝑖𝑛 ,
𝑡 is 𝑁𝑖,𝑡          𝑗 𝑖, 𝑗,𝑡                                                                      𝑖,𝑡       𝑗 𝑖, 𝑗,𝑡
where the term is summing over all 𝑗 that have posted on “Wallstreetbets” about stock 𝑖 on day 𝑡.
     We proceed to run the regressions of the daily 𝑆𝐾 𝐸𝑊 measure on our main variable of interest
– the number of “Wallstreetbets” posts 𝑁𝑖,𝑡        𝑊 𝑆𝐵 , instrumented by the total number of non-finance
posts by the same users 𝑁𝑖,𝑡     𝑛𝑜𝑛𝐹𝑖𝑛 , controlling for the same set of independent variables we use in
prior settings. First, we run panel regressions without using the instrument. Then, to test whether
there is evidence that the instrument violates the exclusion restriction, I add the instrument into the
regression to see whether the instrument is inappropriately excluded. Finally, I run the regression
with instrumental variable estimation. The first stage regression is untabulated, but the coefficient
on the instrument is 0.049 and statistically significant at the 1% level, and the 𝑅 2 is 3.4%. I report
the main results in Table 1.10.
     The insignificant coefficient on the number of non-finance posts in Column (2) supports the
     6 Because all word vectors contain nonnegative numbers, the cosine similarity between any pair of word vectors is
nonnegative.
                                                         41


       Table 1.10 Instrumental Variable Estimation: “WSB” Posts and Crash Risk (𝑆𝐾 𝐸𝑊)
                                                  (1)                      (2)                       (3)
  VARIABLES                                       Panel                    Panel                     IV
  Number of “Wallstreetbets” Posts                0.070***                 0.067***                  0.193***
                                                  (0.019)                  (0.018)                   (0.035)
  Number of Non-Finance Posts                                              0.005
                                                                           (0.004)
  Controls                                        YES                      YES                       YES
  Observations                                    2,655,209                2,655,209                 2,655,209
  R-squared                                       0.089                    0.089                     0.042
  Day FE                                          YES                      YES                       YES
  Firm Cluster                                    YES                      YES                       YES
  Note:                                                                        ∗ p<0.1; ∗∗ p<0.05; ∗∗∗ p<0.01
This table reports the results of regressing the daily 𝑆𝐾 𝐸𝑊 measure on the number of “Wall-
streetbets” posts, controlling for other stock characteristics. Column (1) reports a panel regression
of 𝑆𝐾 𝐸𝑊 on the number of “Wallstreetbets” posts. Column (2) adds the proposed instrument
“number of non-finance posts” to test the exclusion restriction. Column (3) reports the result of
instrumental variable estimation. Denote a user 𝑗’s number of posts on “Wallstreetbets” about
stock 𝑖 on day 𝑡 as 𝑛𝑊    𝑆𝐵
                      𝑖, 𝑗,𝑡 , and his/her number of posts on non-finance “Subreddits” as 𝑛𝑖, 𝑗,𝑡
                                                                                                    𝑛𝑜𝑛𝐹𝑖𝑛 . Then
                                                                           𝑊 𝑆𝐵 =  Í     𝑊 𝑆𝐵
stock 𝑖’s total number of posts on “Wallstreetbets” on day 𝑡 is 𝑁𝑖,𝑡                  𝑗 𝑛𝑖, 𝑗,𝑡 . The instrument
we construct for this variable would be 𝑁𝑖,𝑡    𝑛𝑜𝑛𝐹𝑖𝑛 = Í 𝑛𝑛𝑜𝑛𝐹𝑖𝑛 , where the term is summing over all
                                                           𝑗 𝑖, 𝑗,𝑡
𝑗 that have posted on “Wallstreetbets” about stock 𝑖 on day 𝑡. The IV specification is as follows:
                                         𝑊 𝑆𝐵 = 𝛼 + 𝛽 𝑁 𝑛𝑜𝑛𝐹𝑖𝑛 + 𝜖
                                        𝑁𝑖,𝑡−1    0    𝑍             𝑖,𝑡−1
                                                       Í 𝑖,𝑡−1
                      𝑆𝐾 𝐸𝑊𝑖,𝑡 = 𝛼1 + 𝛽 𝑋 𝑁𝑖,𝑡−1 + 𝑝 𝛽 𝑝 𝐶𝑜𝑛𝑡𝑟𝑜𝑙𝑖,𝑝,𝑡−1 + 𝜆𝑡 + 𝑢𝑖,𝑡
                                               𝑊 𝑆𝐵
The 𝑡 − 1 subscripts on the number of posts refer to the time period of 16:30 PM the previous
day to 9:00 AM on day 𝑡. The first stage regression of 𝑁𝑖,𝑡      𝑊 𝑆𝐵 on 𝑁 𝑛𝑜𝑛𝐹𝑖𝑛 produces a coefficient of
                                                                             𝑖,𝑡
0.049, statistically significant at the 1% level, and a 𝑅 2 of 3.4%, which dispels the weak instrument
concern. In all specifications, the control variables include prior day return, prior month-end log of
market capitalization, book-to-market ratio, cumulative 19-day returns lagged for 2 days (reversal),
cumulative 100-day returns lagged for 21 days (momentum), prior month average trading volume
scaled by total shares outstanding (liquidity), and prior month volatility of daily returns. We include
day fixed effects, and standard errors are clustered at the stock level.
                                                      42


exclusion restriction assumption. The significantly positive coefficient on the number of “Wall-
streetbets” posts in the instrumental variable estimation in Column (3) is consistent with our prior
results of the “Difference-in-Differences” specification that online conversations among retail in-
vestors positively influence the ex-ante crash risk of stocks. A one-standard-deviation increase in
the number of “Wallstreetbets” posts is associated with a 15 bps increase in the 𝑆𝐾 𝐸𝑊 measure on
average. Since the mean 𝑆𝐾 𝐸𝑊 is 0.065, the 15 bps increase translates into approximately 2.3%
increase in ex-ante crash risk on a daily basis.
    These results, combined with our prior results on monthly crash risk, support our hypothesis
that social media conversations are instrumental in facilitating more efficient herding of individual
investors, which in turn drives the increase in the ex-ante crash risk of the underlying stocks.
1.8    Conclusion
    Recent development in financial technology (FinTech) like “Robinhood” has dramatically re-
duced the hurdle for retail trading. In addition, popular online forums like “Reddit” facilitate more
efficient sharing of trading ideas. These innovations can likely amplify the effect of correlated
retail trading behaviors. Because of distorted beliefs, retail investors tend to over-buy high crash-
risk stocks, contributing to the negative price of crash risk. The buying activities and subsequent
price reactions formulate a possible feedback loop. The resulting more elevated level of crash risk
contributes to exacerbated market volatility, potentially damaging investor welfare.
    Future research avenues could further explore social media’s role in forming investor beliefs
and their subsequent trading behavior. As reflected in the meme stock frenzy, the mass psychology
of the online investing community could be influenced without apparent fundamental information,
often to the harm of such investors. Studying this interaction between social media conversations
and asset prices could help us understand the intricacies of price formation and aid policymakers
in their pursuit of protecting potentially novice and vulnerable investor groups.
                                                  43


                                                 CHAPTER 2
                                    THE CYBER RISK PREMIUM
2.1    Introduction
    The digital transformation of the economy and increased interconnectivity have created un-
precedented opportunities and under-explored risks. On the one hand, the ever-growing pool of
data along with advances in artificial intelligence helps to promote efficiency and productivity in
the economy; on the other hand, it exposes households, businesses, and governments to a new and
potentially systemic source of risks: the cyber risk.1 In the past decade, various forms of cyberat-
tacks have attracted widespread attention. For instance, the Equifax data breach in 2017 exposed
approximately 147 million names and dates of birth, 145.5 million Social Security numbers, and
209,000 payment card numbers along with expiration dates, which led to a settlement of nearly
$700 million with federal and state investigators. The recent SolarWinds cyberattack, which came
to light in December 2020, illustrates both the vulnerability of the most secure networks and the
potential for immense collateral damage to corporate and governmental entities connected to these
networks.
    Considering the importance of cyber risk, it is natural to study how it might influence stock
market prices and returns. The primary challenge for such a study is that cyber risk is latent, not
directly observable. To address this challenge, we use machine learning algorithms to develop
a real-time estimate of the likelihood for each individual firm to experience a cyberattack in the
subsequent period. The underlying logic is that hackers do not choose their targets at random, but
focus on firms with certain attributes (see, e.g., Kamiya et al. (2021)). Moreover, firms tend to
communicate their self-perceived exposures to cyber risk through their disclosures, e.g., through
10-K filings. This information is also likely to be important in predicting the level of the firm’s
cyber risk. Machine learning techniques are particularly suited to this task due to their ability
to extract useful information from a large set of features and to deliver superior out-of-sample
forecasts.
    1 See Kashyap and Wetherilt (2019) for discussions on the distinguishing features of cyber risk.
                                                        44


     In our empirical analyses, we introduce a novel data set of cyberattacks and apply a variety of
machine learning techniques including the logistic ridge regression, the K-Nearest Neighbor, and
Naive Bayes combined with an “EasyEnsemble" sampling technique (EEC-KNN and EEC-NB).2
We find that our cyber risk measure based on these algorithms has a superior ability to forecast
the occurrence of future cyberattacks. For instance, compared with the simple logistic forecasting
model, the predictive performance of the logistic ridge regression improves from 0.4% to 6.3%
based on the harmonic mean of precision and recall rates (F-score) and from around 1.4% to
59.9% based on the geometric mean of sensitivity and specificity (G-mean). Because the different
machine learning techniques yield similar results, we present our main results using the logistic
ridge regression, which is commonly used in asset pricing literature.
     Armed with the cyber risk measure, we study how cyber risk is related to stock returns.
Our primary tests use the Fama and MacBeth (1973b) cross-sectional regressions to examine the
incremental predictive power of the cyber risk measure for stock returns against well-known firm
characteristics in the period from July 2008 to June 2019. In the first set of regressions, we include
the natural log of market capitalization, book-to-market ratio, gross profitability, investment ratio,
and past 11-month return skipping the most recent month as control variables. In the second
set, we add past one-month return, idiosyncratic volatility (Ang et al., 2006), and the Amihud
illiquidity ratio (Amihud, 2002). Finally, we also control for the organizational capital (Eisfeldt
and Papanikolaou, 2013), CAPM market beta, tail risk beta (Kelly and Jiang, 2014), co-skewness
(Harvey and Siddique, 2000), and net operating assets (Hirshleifer et al., 2004). In a contemporary
study, Florackis et al. (2022) propose a cyber cosine measure based on the linguistic similarity
between hacked and non-hacked firms in the risk factor section of their annual reports, which is
also included in some of our regression specifications. In all these regressions, the relation between
cyber risk and stock returns is economically meaningful and statistically significant. In terms
of magnitudes, a one standard deviation increase in cyber risk is associated with higher average
     2 The EasyEnsemble technique combines undersampling and bootstrapping, which is effective in dealing with
class-imbalance problems. This technique is particularly suited to address issues like cyberattacks, which represent
rare events in the data. See Section ?? for relevant technical details on the EasyEnsemble technique.
                                                           45


stock returns of between 0.16% and 0.21% per month. These results show that cyber risk is an
independent and important driver of cross-sectional variation in stock returns during our sample
period.
     We also perform a standard portfolio analysis. At the end of each June from 2008 to 2018, we
sort stocks into five quintiles based on the estimated level of cyber risk using information available
in real-time and form value-weighted portfolios that are rebalanced at the end of the following
June. Consistent with the Fama-MacBeth regressions, we find that stocks in the top quintile with
the highest level of cyber risk have higher returns than those in the bottom quintile. The spread in
returns is 0.84% per month after we adjust for their market exposure and 0.58% per month after
adjusting for Fama-French five factors and momentum. The alpha remains large and statistically
significant when we form tercile and quartile portfolios as well as under alternative asset pricing
models.
     In addition to the cross-sectional variation in average returns, we examine the time variation
in the relative returns between stocks with high and low cyber risk. We conduct two tests. First,
we study the relation between the variation in the return difference between high and low cyber
risk stocks and the New York University’s Index of Cyber Security (ICS), which is based on
monthly surveys of industry executives about their perceived level of cyber risk. Because ICS is a
monthly aggregate measure of cyber security, we form a factor-mimicking portfolio and compute
its performance. We estimate each stock’s return sensitivity to changes in ICS in the past year. A
lower ICS beta is associated with higher firm-level cyber risk. Then we form a spread portfolio
that buys stocks with high ICS beta and sells those with low ICS beta. The return on the spread
portfolio based on our cyber risk measure and that based on the ICS beta have a strong negative
comovement, with a time-series correlation coefficient of −32%.
     Second, we study the relation between the performance of the spread portfolio based on our
cyber risk measure and that of two cybersecurity ETFs that invest in firms providing cybersecurity
services. We again find strong negative comovements between the two, with time-series correlation
coefficients below −40%. That is, when stocks with higher cyber risk underperform, cybersecurity
                                                  46


firms tend to have higher returns. These results suggest that the time variation in the return spread
between high and low cyber-risk firms is likely driven by the market’s perception of cyber risk in
the economy.
    The strong relation between our cyber risk measure and future stock returns is consistent with
the view that cyber risk is priced in the stock market. We conduct a number of tests to strengthen
the identification of the cyber risk premium. First, we exploit the variation along the dimension
of industry competition. Kamiya et al. (2021) shows that the negative stock market reaction to
cyberattacks on victim firms spills over to their peers around the announcement of cyberattacks.
We find that this spill-over effect is larger when the victim firm is a stronger competitor of its
peer firms in the product market. This is because weaker peer firms are less likely to be able to
increase market share at the expense of a stronger competitor—the repricing effect of their shares
due to market recognition of their cyber risk exposure is less confounded by the effects of product
market competition. Second, firms that provide products and services similar to those of hacked
firms are likely to hold similarly valuable data, which makes them vulnerable to cyberattacks. We
use the product similarity measure proposed by Hoberg and Philips (2016) as a proxy for data
similarity. We find that peer firms with higher data similarity to victim firms tend to experience
a more negative stock market reaction around the announcement of cyberattacks, and that data
similarity is a positive contributor to our cyber risk measure. These tests provide further evidence
for cyber risk being an important determinant of stock returns.
    We evaluate the robustness of our results by performing additional tests. First, we split our
sample into two subperiods. We find that in both periods, the cyber risk measure is positively related
to future stock returns, with the effect slightly stronger in the more recent subperiod. Second, we
construct a cyber risk measure based on industry-adjusted risk disclosure variables and compute
industry-adjusted returns for each stock in our sample. In both cases, we find that the positive
relation between cyber risk and future stock returns remains strong. Third, to address the concern
that our cyber risk measure may be driven by other dimensions of risks such as corporate fraud, we
use our measure to predict the incidence of corporate misconduct and financial misconduct. Our
                                                   47


tests empirically reject this hypothesis. Fourth, we consider different machine learning algorithms
and use different dictionaries in linguistic analyses to estimate the cyber risk measure. The results
are robust. Finally, we consider a placebo test to pinpoint the importance of cyber risk. In particular,
we randomly select firms to construct a fake sample of cyber-attacked firms. The number of random
draws to create the fake sample equals the actual number of cyberattacks for each industry in each
year. We then use the same machine learning algorithms to estimate each firm’s pseudo-cyber risk
and estimate its relation to stock returns. Our results show that this relation is essentially flat. This
result highlights the importance of cyber risk in driving stock returns.
     Our study contributes to the fast-growing literature that studies the implications of cyber risk
for firms, the financial markets, and the economy. Many studies in this literature focus on the
impact of realized cyberattacks. For instance, Kamiya et al. (2021) find that the announcement
of successful cyberattacks is, on average, associated with a 1.09% wealth loss for shareholders
within a three-day window around the incidents. They argue that successful cyberattacks lead
to value loss for victims for both actual and reputational reasons.3 Notable exceptions include
Jamilov et al. (2020), and Florackis et al. (2022), who use textual information to identify firms
with high cybersecurity risk. Jamilov et al. (2020) use firms’ earnings conference call transcripts
to develop a measure for firm-level cyber risk exposure and sentiment and report a number of
interesting findings, such as an increase in corporate discussions of cyber risk in earnings calls,
increasingly negative sentiment regarding cyber risk, and the spread of cyber risk discussions
across regions.4 However, they do not distinguish between cyber risk exposure and cyber risk
awareness. For instance, corporate managers’ extensive cyber risk discussions in conference calls
can be driven by their keen awareness of cyber risk; a lack of such discussions can be driven by
    3 In another study, Michel et al. (2020) study the timing of cyberattack announcements. They find negative abnormal
returns on cyberattack victims prior to the announcement of the attacks, which is consistent with some information
leakage. Echoing this message, Lin et al. (2020) reports evidence of insider trading ahead of public announcements
of cyberattacks. Binfarè (2019) studies the effect of data breaches on the cost of debt financing. He finds that lenders
tend to charge borrowing firms larger spreads after they experience data breaches.
    4 Kopp et al. (2017) and Warren et al. (2018) examine the impact of cyberattacks on financial institutions and
financial market infrastructure, and argue such attacks potentially can be quite damaging. Duffie and Younger (2019)
study the linkage between the cyber security of large banks and financial stability. They argue that large banks tend
to have sufficient liquidity to weather relatively extreme cyber runs; however, severe cyberattacks may deter nonbanks
from sending funds through these institutions, which could create systemic risk.
                                                             48


their inadequate attention to or even ignorance of cybersecurity risk. Our results are robust to
excluding textual variables from the cyber risk measure. Closer to us, Florackis et al. (2022) study
the relation between cyber risk and stock returns. They use firms’ descriptions of cyber-related risk
in 10-K reports to identify firms with high cyber risk. Their conclusion that cybersecurity risk is
priced in the cross-section of stock returns provides independent support for our results. The main
difference between their study and ours is methodological. Florackis et al. (2022) compare the
word distribution of “Item 1A. Risk Factors" in the 10-K reports of training firms with that of the
hacked sample, identifying the most similar firms as high cybersecurity risk firms. Their method
is simple and powerful, relying only on the textual data in firms’ annual reports. Our machine
learning algorithms provide a mapping from any feature set to the occurrence of cyberattacks. The
special strength of our approach is its flexibility: the feature set can be expanded to encompass any
form of data, including traditional and alternative data.
    Our paper also joins the burgeoning literature that applies machine learning techniques to asset
pricing. This literature has focused on using machine learning algorithms to efficiently select and
combine firm characteristics to predict stock returns (see, e.g., Freyberger et al. (2020); Gu et al.
(2020)) and to construct a robust stochastic discount factor (see, e.g., Kelly et al. (2019), Kozak
et al. (2020), and Chen et al. (2020)). Instead of directly targeting asset returns, our paper uses
machine learning techniques to estimate an important yet latent risk as economies become more
digital. Our approach exploits the correlations between firm characteristics and the likelihood of
cyberattacks and shows that the resulting estimate of cyber risk has predictive power for stock
returns beyond the usual firm characteristics.
    Finally, our paper is related to the emerging literature that applies computational linguistics
to finance. This literature has made substantial progress in building subject matter lexicons. For
instance, Tetlock (2007) used lexicons that are outside the finance field to detect tones in newspapers
and their implications for returns and trading volumes. Loughran and McDonald (2011) developed
a lexicon that classifies finance-related words into different sentiment classes, and shows that
applying this lexicon to 10K filings can predict subsequent returns. More recently, researchers have
                                                    49


explored deeper structures of texts that link to capital market activities. For example, Cohen et al.
(2020) use changes in 10K texts to predict return, earnings, and bankruptcies. Ke et al. (2019)
use supervised learning to extract sentiments from newspapers to predict returns. Bybee et al.
(2020) apply topic modeling (an unsupervised word cluster approach) to gauge the relationship
between news article topics and macroeconomic activities. Our paper contributes to the literature
by introducing dictionaries developed in the cyber risk community to estimate individual firms’
self-assessment of cyber risk in their 10-K filings and using it as an input to our machine learning
algorithms to better predict cyberattacks.
     The rest of the paper is organized as follows. Section 2.2 describes the data. Section 2.3 shows
the methodology to construct the cyber risk measure. Section 2.4 presents the key results on the
relationship between cyber risk and stock returns. Section 2.5 provides further identification of the
cyber risk premium. Section 2.6 concludes.
2.2     Data
2.2.1     Cyberattacks
     In this paper, we use a novel data set obtained from the Identity Theft Resource Center (ITRC).5
It has a number of advantages over the data set from the Privacy Rights Clearinghouse which
has been used previously in the literature.6 It provides more up-to-date data, reports many more
cyberattacks on public firms, and importantly, contains the source of the reports, which allows for
cross-validation. ITRC provides annual data breach reports from the year 2005, detailing all the
reported and confirmed cyberattack incidents for US-based organizations. These reports are stored
on their website in the form of PDF files. We obtained the available annual reports from 2005 to
2019 and extracted all items connected to cyber incidents. We first matched them to Compustat
firms using a fuzzy matching algorithm, and then manually checked each one of the matched pairs
     5 ITRC, https://www.idtheftcenter.org/. On their website: “The ITRC is a non-profit organization established to
support victims of identity theft in resolving their cases, and to broaden public education and awareness in the under-
standing of identity theft, data breaches, cyber security, scams/fraud, and privacy issues.” See this SEC report that cites
the data from ITRC: https://www.sec.gov/files/speech-jackson-cybersecurity-2018-03-15-data-appendix-updated.pdf.
     6 Privacy Rights Clearinghouse, 2019, https://www.privacyrights.org/data-breaches.
                                                             50


for confirmation. After this process, we are left with 1,010 cyberattack incidents.7 Since we use
the cyberattack as a binary variable, we count multiple attacks of a firm in a given year as one
instance. This leaves us with 552 unique firm-year pairs, with 368 unique firms that experienced
at least one cyberattack incident. Because we use accounting variables as predictors, we follow
the convention in Fama and French (1996) and define each year as the 12 months from July to the
following June. The sample then spans from 2006 to 2018.8 Figure 2.1 plots the number of unique
incidents and the unique firm-year pairs for firms that experienced cyberattacks during the sample
period. It shows an increasing time trend in cyberattacks.
    An important question is whether the distribution across industries is uniform, as some industries
are more digitized than others, and for some, significantly more data exist. We plot the number of
incidents for each of the Fama-French 17 industries per Fama and French (1988) in our sample in
Figure 2.2.
    From Figure 2.2, we find that the financial sector is the hardest-hit industry, with a total of
247 incidents, dwarfing all other sectors. This is not surprising because financial firms possess
large amounts of client identification and financial data. With the exception of the “Mining and
Minerals” industry, no sector appears immune to cyberattacks as firms increasingly rely on online
platforms to conduct their business and on cloud services to store their data.
    Before leaving this subsection, we shall note that in an important paper, Cong et al. (2023)
point out that a large fraction of cybercrimes and cyberattacks are unreported and kept hidden by
victim firms. The under-reporting and under-recording would lead researchers to miss instances
of cyberattacks, the inclusion of which may increase the statistical power of empirical tests to
identify the risk premium. One useful observation from our analyses is that when we restrict our
sample of cyberattacks to significant incidents with strong stock market reactions, the algorithm
shows stronger power in identifying the risk premium associated with cyber risk. If the incentive to
underreport cyberattacks is stronger for more significant incidents, it would lead us to underestimate
    7 This increases our data set from the 311 observations from PRC, which is nearly a 3-fold increase.
    8 We  drop the first 6 months of the calendar year 2006 and the last 6 months of the year 2019, as they do not
constitute a whole year per our definition.
                                                          51


                    Figure 2.1 Disclosed Cyberattack Incidents for US Public Firms
The figure depicts the number of cyberattack incidents that are disclosed for US public firms from
2006 to 2018. The year convention follows Fama and French (1996), which starts from July to next
June. The source of data is the Identity Theft Resource Center (ITRC). The black line indicates the
number of unique cyberattack incidents matched to Compustat firms; the blue dashed line shows
the number of unique public firms that experienced cyberattack incidents each year.
the importance of cyber risk in the stock market. Future research will benefit from a more
comprehensive data coverage of cyberattacks.
2.2.2    Other Data
    The rest of the data sources are as follows. Stock price and return data are from the CRSP.
Accounting data are from Compustat. Asset pricing factors are from Kenneth French’s website.9
The 10-K filings are from the University of Notre Dame Repository website.10
    9 French, Kenneth, 2019, http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html.
   10 McDonald,  Bill, University of Notre Dame, https://sraf.nd.edu/data/stage-one-10-x-parse-data/.
                                                          52


                           Figure 2.2 Industry Distribution of Cyberattacks
The figure depicts the industry distribution of attacked firms based on Fama-French 17 industry
classification per Fama and French (1988). The bars show the number of incidents in the sample
period in each industry. The hardest hit is the financial industry, with a total of 247 incidents. The
least hit is the “Mining and Minerals” industry, which has zero incidents in our sample. The sample
runs from 2006 to 2018.
                                                  53


2.2.3    Summary Statistics
    Table 2.1 summarizes the characteristics of firms that experience cyberattacks (victims) and
those that do not (non-victims) in our sample. All the variables are measured prior to the year when
the incidents happen. The accounting variables are lagged by six months to ensure data availability.
We follow Kamiya et al. (2021) in the definitions of the firm characteristics. In addition, we use
two linguistic variables: the “Risk Factor Length,” which is based on the length of “Item 1A: Risk
Factors” in the firm’s 10-K disclosure, and the “NIST Counts,” which is the number of times a firm
mentions a cybersecurity-related term in Item 1A, based on the NIST dictionary. We discuss this
dictionary in more detail in Section 2.3.
    Table 2.1 shows that victim firms tend to be larger, which suggests that hackers target firms with
larger customer bases and bigger data sets, as such targets are likely to offer them potentially higher
gains. They also tend to have more intangible assets, have higher cash flow, spend less on research
and development, have a higher return on assets, and experience higher past year excess returns.
These characteristic differences are in line with the results in Kamiya et al. (2021). In addition,
victims tend to spend more time discussing cybersecurity-related risks in their 10-K disclosure. In
the next section, we describe how we use these characteristics to construct the firm-level measure
of cyber risk.
2.3    Measuring Firm-Level Cyber Risk
    A well-known insurer, Northbridge Insurance, provides a broad definition of cyber risk: “Cyber
risk commonly refers to any risk of financial loss, disruption or damage to the reputation of an
organization resulting from the failure of its information technology systems."11 To operationalize
this notion, we follow the literature (e.g., Kamiya et al. (2021) and Michel et al. (2020)) and treat
reported cyberattacks as realized cyber risk. Based on these cyberattacks, we use machine learning
algorithms to develop a real-time estimate of the likelihood for each individual firm to experience
a cyberattack in the subsequent period.
   11 Northbridge Insurance, 2019, https://www.nbins.com/blog/cyber-risk/what-is-cyber-risk-2/.
                                                       54


                                    Table 2.1 Summary Statistics
             Variable                Victim     Mean      STD         P25       P75
             Intangibility                 0     0.79     0.23       0.70       0.97
                                           1     0.82     0.21       0.74       0.97
             CAPX/AT                       0     0.04     0.06       0.01       0.05
                                           1     0.03     0.04       0.01       0.05
             CF/AT                         0     0.02     0.22       0.01       0.11
                                           1     0.07     0.09       0.02       0.12
             Firm Size                     0     6.68     2.03       5.26       8.02
                                           1     9.22     2.30       7.56     10.99
             Net Worth                     0     0.45     0.25       0.25       0.66
                                           1     0.37     0.21       0.18       0.51
             R&D                           0     0.05     0.12       0.00       0.04
                                           1     0.02     0.06       0.00       0.01
             ROA                           0    -0.02     0.23      -0.01       0.07
                                           1     0.04     0.09       0.01       0.07
             Tobin’s Q                     0     1.86     1.55       1.05       2.06
                                           1     1.86     1.19       1.09       2.10
             Sales Growth                  0     1.56    22.94       0.97       1.18
                                           1     1.10     0.23       1.00       1.13
             Financial Constraint          0    -0.27     0.81      -0.33      -0.18
                                           1    -0.38     0.12      -0.47      -0.29
             Annual Exret                  0     0.01     0.52      -0.25       0.18
                                           1     0.03     0.31      -0.16       0.19
             Log(age)                      0     2.11     0.62       1.79       2.56
                                           1     2.26     0.60       1.95       2.71
             Risk Factor Length            0    7,191    5,288      3,668     9,299
                                           1    8,710    6,422      4,304    12,019
             NIST Counts                   0       95       71         48        122
                                           1      135       96         66        178
The table presents the summary statistics for key characteristics of both cyberattack victims and
non-victims. There are in total 37,481 firm-year pairs, with 552 cyberattack victim cases, during
the period 2006–2018. We show the mean, standard deviation, 25𝑡ℎ percentile, and 75𝑡ℎ percentile
for each group, and present them side by side for comparison. Victims are denoted as 𝑉𝑖𝑐𝑡𝑖𝑚 = 1,
while zero indicates non-victims. The variables are defined in the Appendix. Risk Factor Length
is the number of tokens in a firm’s 10-K Item 1A. NIST Term Counts is the number of times a firm
mentions cybersecurity terms per NIST dictionary in its Item 1A in 10-K.
                                                55


2.3.1      Constructing Predictors
     Kamiya et al. (2021) document that a set of firm-specific variables are correlated with the
probability that a firm experiences a cyberattack. We use these variables as the starting point in
building our model. The variables include asset intangibility, the ratio of capital expenditures to
total assets (CAPX/AT), the ratio of cash flows to total assets (CF/AT), firm size, net worth, R&D,
ROA, Tobin’s q, sales growth, financial constraint, prior year excess return over the CRSP value-
weighted market return, and the natural log of firm age. In addition, we explore soft information
about a firm’s cyber vulnerability. One source is a firm’s self-assessment of its cyber risk disclosed
in the “Item 1A: Risk Factors” section in firms’ annual 10-K filings. The SEC requires that firms
disclose the most significant risk factors.12 Per this requirement, firms started to include this section
prominently and regularly from 2006. Firms are in a unique position to assess their own risk level
given their private information and familiarity with the firm’s operations. In addition, this risk
disclosure is forward-looking, providing useful information about the perceived exposures. The
nature and intensity of the discussion on cyber-related risk in their 10-K filings should be a useful
indicator of their firms’ vulnerability to cyberattacks.
     To measure the intensity of perceived cyber risk, we count the number of times a firm mentions
cyber risk-related terms in the risk factors section in its 10K filing as well as the length of the
entire section. In this way, we can capture the information about a firm’s self-assessment of both
its overall risk exposure and exposure specific to cyber risk.
     To identify cyber risk terms, we exploit a specialized dictionary compiled by the National
Institute of Standards and Technology (NIST), named the “Glossary of Key Information Security
Terms.”13 An important feature worth noting is that the NIST dictionary was compiled in 2006,
suggesting that cyber risk was already a concern with governmental bodies. Secondly, it avoids a
potential look-ahead bias since our sample starts in the year 2006.14
     Next, we turn to firms’ annual 10-K filings. Since the SEC mandate for disclosing risk factors
    12 See SEC, 2005, https://www.sec.gov/rules/final/33-8591.pdf.
    13 NIST, April 25, 2006, https://www.nist.gov/publications/glossary-key-information-security-terms.
    14 A contemporaneous study by Jamilov et al. (2020) uses a dictionary with a different set of words and phrases. We
show in Section ?? that our results are robust to using alternative dictionaries.
                                                           56


came into effect in 2006, we start our sample in 2006. We obtain the filings from 2006 to 2018
through the University of Notre Dame Repository and extract their risk factor sections.15 We use
the central index key (CIK) to match the filing firms to Compustat firms. Then we count the number
of tokens per risk factor text block, only keeping the blocks with at least 100 tokens.16
    The final step in constructing the cyber risk disclosure variable is to measure the intensity with
which cybersecurity-related terms are mentioned. We apply the following procedure: first, we
measure the maximum number of tokens in each term in each dictionary. We find that the NIST
dictionary contains terms that can be as long as seven tokens. Second, to control for the fact that
there might be common boilerplate language that most firms use, we eliminate the common terms
used by firms in each year as in Hoberg and Philips (2016). Third, we transform each risk factor
text block into a collection of 𝑁𝑔𝑟𝑎𝑚s, where 𝑁 ∈ [1, 2, 3, 4, 5, 6, 7]. Each 𝑁𝑔𝑟𝑎𝑚 is a collection
of every possible combination of adjacent 𝑁 tokens in each document.17 Finally, we count how
many times each cybersecurity term occurs in each 𝑁𝑔𝑟𝑎𝑚-transformed text block.
    We use the length of the risk factor section and the NIST term frequency as two separate
variables for two reasons. First, the length of the risk factor section measures the number of tokens,
or uni-grams, while the NIST term frequency is the number of times each term is mentioned. Here
the term ranges from uni-grams to seven grams, and thus it is inappropriate to scale the frequency
by the length of the section. Second, we keep both variables in the feature set to maximize the
power of the machine-learning techniques.
2.3.2     Constructing the Predictive Model
2.3.2.1     In-sample Fit
    Before building the predictive models, we examine the in-sample explanatory power of the
regressors we select in-sample. In particular, we perform in-sample logit regressions, which
   15 We  use an algorithm to match the section between the beginning of Section 1A and either Section 1B or Section
2.
   16 A token is an instance of a sequence of characters in some particular document that is grouped together as a useful
semantic unit for processing (Schütze et al. (2008)). Here we refer to each term as a token after preprocessing the text,
such as dropping stop words.
   17 See, e.g., Damashek (1995) for description of Ngrams.
                                                            57


include the accounting and text-based variables, which are cross-sectionally standardized to have
means of zero and standard deviations of one. Following Petersen (2009), we cluster standard
errors by both firm and year.
    The results in Table ?? of the Appendix show that many firm characteristics have strong
associations with the probability of future cyberattacks. In terms of magnitudes, firm size is
particularly important, with larger firms more likely to be attacked. We also find that the text-based
cyber risk count variables have positive predictive power for future cyberattacks. Interestingly, as
shown in Column (1), the total number of tokens in the Risk Factors section of a firm’s 10-K filing
has a negative association with the probability of future cyberattacks. This result suggests that
firms with more thorough discussions of their risk factors may also have better risk management
practices.
2.3.2.2    Building Machine Learning Predicting Models
    To build an effective cyberattack forecasting model, there are two main considerations. First,
an in-sample fit could induce look-ahead bias and overfitting. Second, cyberattacks are rare
events, which result in a highly imbalanced sample, thereby introducing biases into the Maximum
Likelihood Estimator (King and Zeng (2001)).
    To address the possible look-ahead bias, we follow a recursive and expanding prediction proce-
dure. Namely, we start from the year 2006, run a predictive model, save the coefficients, and then
use them to fit the data in 2007. We do this recursively, so the final step is to use the data from
2006 to 2017 to train the model to fit the year 2018. In this way, we avoid the look-ahead bias and
generate true out-of-sample (OOS) predictions.
    Next, to address overfitting and maximize OOS performance, we introduce machine learning
classification models to improve the predictive power. In particular, we select logistic ridge
regression as our main model.18 The logistic ridge regression combines the logistic regression
with ridge regularization, which uses an 𝐿-2 penalty.19 The objective function for logistic ridge
   18 We show in Section ?? additional results based on Ensemble methods, and find our results to be robust to model
choice.
   19 Another popular algorithm in asset pricing is LASSO, which uses an 𝐿-1 penalty to achieve model sparsity. Since
                                                         58


regression is:
                                   h 1 ∑︁𝑁                                                i 1
                                                                                      𝑇𝜷
                         min     −           𝑦𝑖 · (𝛽0 + 𝒙 𝒊 𝑇 𝜷) − log(1 + 𝑒 𝛽0 +𝒙𝒊      ) + 𝜆∥ 𝜷∥ 22                    (2.1)
                     𝛽0 ,𝜷∈R 𝑝+1     𝑁  𝑖=1
                                                                                               2
     where 𝑝 is the number of parameters, and 𝜆 controls the regularization strength.
     To fully exploit the conditioning information and the logistic ridge regression’s regularization
capability, we perform the polynomial transformation of our regressors.20 A polynomial transfor-
mation of degree 𝑑 converts the variables to all possible interactions and powers up to degree 𝑑. In
our implementation, we choose a polynomial transformation of degree 4.
     In each training window, we tune hyperparameters for optimal performance via three-fold cross-
validation, using the training data in each recursive sample period. Cross-validation is done through
the following procedure: for the training data in each window, we randomly split the training data
into three folds, i.e., three subsets of equal size; we use two of the three folds to fit the model,
and one remaining fold as the “validation set” to find the best estimator; we iterate this procedure
three times and find the best estimator, in terms of out-of-sample performance metric (based on the
validation set of each iteration); then we use this estimator for prediction on the test data. Since
regularization is sensitive to the unit of variables, we scale the variables before the training process
(Hastie et al., 2017).
     Since our sample is highly imbalanced, we use the stratified K-fold strategy of Zeng and
Martinez (2000) to ensure that the class distribution is maintained across all folds. In addition to
tuning the penalty factor 𝜆, we also tune the class weight parameter in the loss function, using a
heuristic proposed by King and Zeng (2001).21
we are primarily interested in the forecasting performance of the model rather than variable selection, we focus on
logistic ridge regression. With LASSO, we are able to report weaker but consistent results.
    20 See Hastie et al. (2017) for an introduction of polynomial transformation. This function is implicitly used in
kernel methods such as support vector machines.
    21 King and Zeng (2001) shows that in highly imbalanced learning problems, the coefficients can be biased towards
the majority class, effectively rendering the model useless. They propose several measures to mitigate the bias,
including the prior correction procedure and weighted logistic regression. We use the second option as it is less
costly and shows Í similar performance
                                     Í compared with the prior correction procedure. The weighted logit equation is:
ln L 𝑤 (𝛽|𝑦) = 𝑤 1 𝑌𝑖 =1 ln 𝜋𝑖 + 𝑤 0 𝑌𝑖 =0 ln (1 − 𝜋𝑖 ). The intuition for the weighted logistic is that the objective function
is weighted by a coefficient 𝑤 that is inversely related to the sample class distribution. Consequently, the minority class
gets weighted more heavily, thus mitigating the bias. We tune 𝑤 via cross-validation to add extra flexibility.
                                                              59


2.3.2.3     Filtering the Sample
    In our study, we wish to capture significant cyberattacks with meaningful impacts on the victim
firms. In a contemporaneous study, Florackis et al. (2022) select events that are prominently
featured in media reports as a filter that identifies important attacks. We take a similar but more
direct approach: we use a filter based on the 7-day cumulative abnormal returns (CAR) around the
events. Specifically, we first calculate the [-3,3] window CAR for each event using a four-factor
model, the Fama-French three factors and the momentum factor, as the benchmark. Then we choose
the events with an absolute value of CAR greater than 1%; we consider these events as important
to investors based on the stronger market reaction.22 After using this filter, our sample consists of
394 cyberattacks from 2007 to 2018.
2.3.2.4     Evaluating the Forecasting Performance
    To evaluate the performance of the logistic ridge regression against the simple logistic regression,
we follow the machine learning literature and choose three commonly used metrics of forecasting
performance: the F1-score, G-mean, and Balanced Accuracy.23
    The building blocks for these metrics include recall (also called sensitivity), precision, and
specificity.
                                                                  𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
                       𝑅𝑒𝑐𝑎𝑙𝑙 = 𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 =
                                                      𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 + 𝐹𝑎𝑙𝑠𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠
                                                            𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
                               𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
                                                𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 + 𝐹𝑎𝑙𝑠𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
                                                            𝑇𝑟𝑢𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠
                             𝑆 𝑝𝑒𝑐𝑖 𝑓 𝑖𝑐𝑖𝑡𝑦 =
                                                𝑇𝑟𝑢𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠 + 𝐹𝑎𝑙𝑠𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
    The F1 score is the harmonic average of recall and precision. G-mean is the geometric average
of recall and specificity. Balanced Accuracy is the arithmetic average of sensitivity (recall) and
specificity.
   22 We  choose 1% because Kamiya et al. (2021) document that attacked firms on average experienced -1% CAR
around the events. Our results are robust to different specifications. We show in Section ?? that using 0.5% absolute
CAR as the filter or using no filter produces similar results.
   23 See Brodersen et al. (2010); Yue et al. (2007); He and Garcia (2009) for discussions of these metrics.
                                                            60


    Table 2.2 compares the average forecasting performance of the logistic ridge regression with
that of the commonly used simple logistic regression over our sample period. Figure 2.3 presents
the time series of the performance metrics.
                                   Table 2.2 Performance Metrics
                                           F1-score:       G-mean:
                                           Harmonic        Geometric         Balanced
                 Metrics                   Mean            Mean              Accuracy
                 Logit                     0.004           0.014             0.501
                 Logistic Ridge            0.063           0.599             0.650
                 Logistic Ridge - Logit    0.059***        0.584***          0.149***
                                           (0.010)         (0.043)           (0.022)
                 Note:                                 ∗ p<0.1; ∗∗ p<0.05; ∗∗∗ p<0.01
This table reports the time-series mean out-of-sample performance metrics across the recursive
windows for logit and logistic ridge. F1-score is the harmonic mean between precision and recall.
The geometric mean is the geometric mean between sensitivity and specificity. Balanced accuracy
is the arithmetic mean between sensitivity and specificity. Each recursive window contains 𝑛 + 1
years of data, with 𝑛 ∈ [1, 10] as the training data and +1 the subsequent year as the test data.
There are in total 11 windows, starting at the year 2007 as the training data and 2008 as the test
data. The final window consists of the years 2007 - 2017 as the training data and 2018 as the test
data. The first two rows are the means of each corresponding metric for 11 windows. The third row
represents the mean difference between the logistic ridge and logit, and the fourth row computes
the standard error of the difference in means.
    The results show that the logistic ridge regression strongly outperforms the baseline logistic
regression across all three performance metrics. For instance, the average F1 score of the logistic
ridge regression is approximately 15 times as large as that of the logistic regression, and the average
G-mean of the logistic ridge regression is approximately 40 times as large as that of the logistic
regression. The Balanced Accuracy of the logistic ridge regression exceeds that of the logistic
regression by approximately 30%. The strong outperformance of the logistic ridge regression is
persistent through time.
    To understand the drivers of the superior performance of the logistic ridge regression, we plot the
aggregate (or equivalently, mean) confusion matrices for the logistic ridge and logistic regressions
                                                  61


                           Figure 2.3 Out-of-Sample Performance Metrics
The figure depicts the time series of three out-of-sample performance metrics for both logit and
logistic ridge regression across the test years. The three metrics are F1 score, G-mean, and Balanced
Accuracy. F1-score is the harmonic mean between precision and recall. The geometric mean is
the geometric mean between sensitivity and specificity. Balanced accuracy is the arithmetic mean
between sensitivity and specificity. The black solid line indicates logit, while the blue dashed line
represents logistic ridge. The metrics are all measured against test samples, and thus the figure
runs from 2008 to 2018.
in Figure 2.4. The rows of a confusion matrix are the true classes, while the columns are predicted
classes. Each of the four quadrants indicates the total number of observations across the sample
period classified per the prediction model.
    In a confusion matrix, a superior forecasting technique tends to register higher numbers in the
diagonal elements, which implies that the classifier more correctly classifies observations. The
results show the source of the weakness of the simple logistic regression: it overwhelmingly predicts
firms as negative cases and under-predicts the positive cases. Among the firms that experience a
cyberattack the following year, the algorithm falsely predicts more than 99% of them as negative
                                                   62


                                      Figure 2.4 Confusion Matrices
The figure shows the aggregate/mean confusion matrices for logit and logistic ridge regression.
That is, we aggregate the numbers in the four quadrants across the 11 test windows for each model.
The rows are true classes, while the columns are predicted classes. Each quadrant presents the
number of observations that the classification model allocates. The diagonal numbers represent the
total number of observations that are correctly allocated to their respective classes. For example,
for logit in the left panel, it correctly predicts 1 cyberattack case, while misallocating 376 cases as
non-victims. On the other hand, logistic ridge regression in the right panel correctly predicts 184
cyberattack cases, while misallocating 193 cases in the non-victim class.
cases. In comparison, the logistic ridge regression falsely predicts approximately 50% of them
as negative cases. In other words, the logistic ridge regression benefits from a jump in recall in
exchange for a small decrease in precision, and thus achieves a better balance in performance.
    In the rest of the paper, we use the cyber risk measure based on the logistic ridge regression as
the primary measure. We present the results based on alternative machine learning techniques in
Section ?? as robustness tests.
2.3.2.5    Characterizing the Firms with High Cyber Risk
    Figure 2.5 shows the correlation coefficients between our cyber risk measure and firm char-
acteristics. The left column presents the time-series averages of the cross-sectional Spearman
correlations, and the heat map is based on the absolute value of the average correlation coefficient.
    The results of average correlation coefficients indicate that firms with higher cyber risk tend to
be larger, less financially constrained, and more profitable. The heat map also shows that these three
                                                     63


             Figure 2.5 Absolute Rank Correlations between Predictors and Cyber Risk
The figure depicts the absolute value of rank correlations between each of the 14 predictor variables
and the estimated cyber risk using logistic ridge regression for each test year, as well as the time-
series mean of rank correlations for each variable. Specifically, for each test year, we compute the
rank correlation between each variable and cyber risk measure, and then take the time-series mean
for each variable and put them in the left column of the figure. Then we compute the absolute
values of all the rank correlations of all variables across the test years. In so doing, we have 11 sets
of coefficients, since the test years run from 2008 to 2018. We plot them in a heat map, as shown
in the figure. The color depth represents how large the coefficient is relative to other variables. For
example, firm size is consistently important across years, while CAPX/AT has been consistently
less important.
                                                   64


characteristics tend to have high and stable correlations with cyber risk across time. In addition,
the text-based variables have high correlations with cyber risk.
    Before leaving this section, we examine the relation between our cyber risk measure and
the measure of 𝐶 𝑦𝑏𝑒𝑟 𝐶𝑜𝑠𝑖𝑛𝑒 developed by Florackis et al. (2022). Although both measures
intend to capture firm-level cyber risk, the methodologies are distinct. It is therefore of interest
to quantitatively examine the relation. We find that the correlation between the two measures is
approximately 13.8%, which is indeed positive but moderate. Because the cyber cosine measure
is purely text-based, it also makes sense to compare the correlation between their measure and the
textual variables we use in our cyber risk forecasting model. We find that the correlation between
cyber cosine and the “NIST Counts” is 0.40 and that between cyber cosine and “Risk Factor Length”
0.27, suggesting that our textual variables have a stronger comovement with 𝐶 𝑦𝑏𝑒𝑟 𝐶𝑜𝑠𝑖𝑛𝑒. Overall,
these results show that our cyber risk measure and the cyber cosine measure share a moderate amount
of common information. In other words, both measures should be of value for studies on cyber
risk.
2.4   Cyber Risk and Stock Returns
    In this section, we study the relationship between cyber risk and stock returns. We start
with cross-sectional analyses using both the Fama and MacBeth (1973b) regressions and portfolio
analyses. Then we examine the time variation in the return spread between firms with high and low
cyber risk, focusing on its comovement with other measures of aggregate cyber risk concern.
2.4.1   Cross-sectional Regressions
    To examine whether firms with higher cyber risk compensate investors with higher average
returns, we use Fama-MacBeth cross-sectional regressions to examine the incremental predictive
power of the cyber risk measure for stock returns, controlling for well-known stock characteristics
proposed in the previous literature. Specifically, at the end of each month from July of year 𝑡 to
June of year 𝑡 + 1, we regress individual stock returns on our cyber risk measure estimated in year 𝑡
and a set of firm characteristics. The cyber risk estimate is based on firms’ accounting information
for the fiscal year ending anytime in the calendar year 𝑡 − 1, the cyberattack forecasting model
                                                   65


parameters estimated using the accounting information available up to June in year 𝑡 − 2, and the
cyberattack incidents available up to the end of June of year 𝑡 − 1. The year 𝑡 ranges from 2008
to 2018. All regressors are cross-sectionally standardized to have a mean of zero and a standard
deviation of one. Then we test whether the time-series averages of the regression coefficients are
statistically significant. The standard errors are based on the Newey and West (1987) adjustment
with 6 lags. For a stock to be included in the analysis, it is required to have a CRSP share code 10
or 11 and to have a closing price above $5 at the end of June in year 𝑡.
    Table 2.3 shows the results. In Column (1), we include our cyber risk measure, together with
control variables including natural log of market cap, natural log of book-to-market ratio, gross
profitability, asset growth, and momentum (measured as prior twelve to one-month return). Column
(2) replaces our cyber risk measure with 𝐶 𝑦𝑏𝑒𝑟 𝐶𝑜𝑠𝑖𝑛𝑒 of Florackis et al. (2022). Column (3)
jointly includes our cyber risk measure and the cyber cosine measure. Column (4) further includes
the past one-month return to control for short-term return reversal, idiosyncratic volatility (Ang
et al., 2006), and illiquidity (Amihud, 2002). Column (5) adds organizational capital (Eisfeldt and
Papanikolaou, 2013), CAPM beta, tail risk beta (Kelly and Jiang, 2014), coskewness (Harvey and
Siddique, 2000), and net operating assets (Hirshleifer et al., 2004).
    The results in Table 2.3 show a strong relation between our cyber risk measure and future stock
returns. For instance, Column (1) indicates that a one-standard-deviation increase in cyber risk is
associated with an approximately 20 basis-point increase in returns per month, or 2.4% per year.
The effect is statistically significant at the 1% level. Consistent with Florackis et al. (2022), we
find in Column (2) that the 𝐶 𝑦𝑏𝑒𝑟 𝐶𝑜𝑠𝑖𝑛𝑒 measure has strong predictive power for future stock
returns. When we include both variables in the same regression in Column (3), both our cyber risk
measure and the cyber cosine measure have strong relations with future stock returns. In terms of
magnitudes, the coefficient for our cyber risk measure is 0.186 and that for the cyber cosine measure
0.093. This result indicates that our cyber risk measure captures independent information about
firms’ cyber risk exposure as compared to the 𝐶 𝑦𝑏𝑒𝑟 𝐶𝑜𝑠𝑖𝑛𝑒 measure of Florackis et al. (2022).
Columns (4) and (5) further establish the robustness of the results when we include more control
                                                  66


                       Table 2.3 Fama-MacBeth Cross-sectional Regressions
                                               Dependent Variable: Returns
                     (1)               (2)              (3)               (4)              (5)
  Cyber Risk         0.197***                           0.186***          0.148**          0.212***
                     (0.067)                            (0.064)           (0.063)          (0.065)
  Cyber Cosine                         0.132***         0.093**           0.100**          0.049
                                       (0.049)          (0.043)           (0.043)          (0.044)
  Log(MktCap)        -0.127            -0.006           -0.133            -0.159           -0.219*
                     (0.158)           (0.139)          (0.152)           (0.128)          (0.120)
  B/M                0.163             0.204            0.160             0.087            0.085
                     (0.148)           (0.143)          (0.149)           (0.150)          (0.144)
  GP                 0.098             0.155            0.103             0.081            0.393**
                     (0.108)           (0.112)          (0.107)           (0.100)          (0.152)
  ATG                -0.247***         -0.225***        -0.249***         -0.244***        -0.112
                     (0.055)           (0.058)          (0.055)           (0.056)          (0.076)
  MOM                -0.089            -0.081           -0.091            -0.119           -0.202
                     (0.181)           (0.183)          (0.182)           (0.188)          (0.179)
  ST_Rev                                                                  -0.345***        -0.381***
                                                                          (0.107)          (0.126)
  IdioRisk                                                                -0.226           -0.101
                                                                          (0.146)          (0.191)
  Illiquidity                                                             0.134            0.258*
                                                                          (0.111)          (0.139)
  Other Controls     NO                NO               NO                NO               YES
  Observations       289,696           310,243          275,301           275,301          132,924
  R-squared          0.031             0.030            0.032             0.046            0.078
  Note:                                                                ∗ p<0.1; ∗∗ p<0.05; ∗∗∗ p<0.01
This table reports Fama-MacBeth cross-sectional regressions of returns on firm characteristics and
anomaly variables. For each month, we run cross-sectional regression of stock returns on firm
characteristics, and then we average the coefficients across time series. The sample runs from
July 2008 to June 2019. Column (1) includes the natural log of market cap, natural log of book-
to-market ratio, gross profitability, asset growth, and momentum. Column (2) replaces our cyber
risk measure with 𝐶 𝑦𝑏𝑒𝑟 𝐶𝑜𝑠𝑖𝑛𝑒 (Florackis et al., 2022). Column (3) adds back our cyber risk
measure. Column (4) further includes the past one-month return to control for the short-term return
reversal, idiosyncratic volatility (Ang et al., 2006), and illiquidity (Amihud, 2002). Column (5)
adds organizational capital (Eisfeldt and Papanikolaou, 2013), past market beta, tail beta (Kelly and
Jiang, 2014), coskewness (Harvey and Siddique, 2000), and net operating assets (Hirshleifer et al.,
2004). Standard errors are Newey-West standard errors with 6 lags and are reported in parentheses.
Variable definitions can be found in Appendix A. Returns are measured as percentages, and all
independent variables are cross-sectionally winsorized at [1%,99%] level and standardized to have
zero mean and standard deviation of unity.
                                                  67


variables.
2.4.2    Portfolio Sorts
    We also use the portfolio approach to examine the relationship between cyber risk and stock
returns. At the end of each June from 2008 to 2018, we sort stocks into three tercile portfolios,
four quartile portfolios, and five quintile portfolios according to the cyber risk measure. These
portfolios are held until the end of June in year 𝑡 + 1 and then rebalanced based on the updated
cyber risk estimate when new information becomes available.
    We create zero-cost hedge portfolios that buy high cyber-risk stocks and short low cyber-risk
stocks. We compute the monthly equal- and value-weighted alphas on the hedge portfolios using a
number of asset pricing models: the CAPM, Fama-French three-factor model (FF3), FF3 augmented
by a momentum factor (FF4), Fama-French five-factor model (FF5), and FF5 augmented with a
momentum factor (FF6).
    Table 2.4 shows that the hedge portfolios generate large alphas across the different portfolio
constructions and against the various benchmark models. The monthly alpha ranges from 40 to 80
basis points per month and is always statistically significant. These results show that investors in
high cyber-risk stocks earn high average returns, which cannot be explained by these asset pricing
models.
2.4.3    Comovement with the Index of Cybersecurity
    In addition to the cross-sectional variation in average returns, we examine the time variation
in the relative returns between stocks with high and low cyber risk. In this subsection, we relate
the return spread to the index based on an independent survey conducted by New York University
on the perception of cyber risk among industry experts: the Index of Cybersecurity (ICS).24 Each
month, the survey queries experts such as chief risk officers, chief information security officers,
selected academicians engaged in fieldwork, and selected security product vendors’ chief scientists
about their perceived level of cyber risk, from which an aggregate measure of cyber risk is built.
   24 ICS, NYU Engineering, https://wp.nyu.edu/awm1/. According to the website, “The Index of Cyber Security is a
measure of perceived risk. A higher index value indicates a perception of increasing risk, while a lower index value
indicates the opposite.”
                                                       68


                                      Table 2.4 Portfolio Analyses
                           Value-Weighted                                Equal-Weighted
  model      Tercile         Quartile          Quintile      Tercile         Quartile          Quintile
  CAPM       0.574***        0.765***          0.843***      0.508***        0.645***          0.713***
             (0.176)         (0.207)           (0.227)       (0.154)         (0.171)           (0.180)
  FF3        0.458***        0.658***          0.739***      0.437***        0.581***          0.657***
             (0.122)         (0.190)           (0.219)       (0.131)         (0.154)           (0.157)
  FF4        0.461***        0.660***          0.742***      0.433***        0.577***          0.653***
             (0.127)         (0.193)           (0.226)       (0.121)         (0.141)           (0.150)
  FF5        0.431***        0.536***          0.577***      0.425***        0.541***          0.616***
             (0.116)         (0.157)           (0.179)       (0.138)         (0.160)           (0.165)
  FF6        0.438***        0.540***          0.585***      0.419***        0.533***          0.611***
             (0.124)         (0.161)           (0.190)       (0.131)         (0.151)           (0.163)
  Note:                                                                   ∗ p<0.1; ∗∗ p<0.05; ∗∗∗ p<0.01
This table provides various portfolio analyses for the universe of stocks sorted on predicted ex-ante
cyberattack probabilities. The prediction model used here is the logistic ridge regression. At the end
of each June from 2008 to 2018, we sort stocks into three tercile portfolios, four quartile portfolios,
and five quintile portfolios according to the estimated cyber risk measure. These portfolios are held
until the end of June in year 𝑡 + 1 and then rebalanced based on the updated cyber risk estimate when
new information is available. We create zero-cost hedge portfolios that buy high cyber-risk stocks
and short low cyber-risk stocks. We compute the monthly equal- and value-weighted alpha on the
hedge portfolios using a number of asset pricing models: the CAPM, Fama-French three-factor
model (FF3), FF3 augmented by a momentum factor (FF4), Fama-French five-factor model (FF5),
and FF5 augmented with a momentum factor (FF6). The left half panel reports value-weighted
results, while the right half panel reports equal-weighted results. Standard errors are Newey-West
standard errors with 6 lags and are included in parentheses.
    To examine the relationship between our annual firm-level cyber risk measure and the monthly
aggregate ICS, we project both variables onto the return space. For each stock in our sample, we
estimate its return sensitivity to the monthly percentage change in the ICS. That is, we perform
bivariate regressions of monthly excess returns of individual stocks on the monthly changes in the
ICS and the excess returns on the market portfolio. Stocks with a high (low) ICS beta tend to have
lower (higher) cyber risk exposures based on the ICS. Therefore, a long-short portfolio that buys
stocks with a high ICS beta and sells those with a low ICS beta should deliver higher performance
when aggregate cyber risk concern is high. If our firm-level cyber risk measure and the ICS capture
                                                    69


some common component of cyber risk in the economy, we would expect the portfolio that buys
high cyber-risk stocks and sells low cyber-risk stocks to have a negative correlation with the ICS
factor-mimicking portfolio.
    We apply a 12-month rolling window to estimate the ICS beta for individual stocks, requiring
at least six observations for the statistical estimation. We form five quintile portfolios at the
end of each month from December 2015. Then we repeat this process for each subsequent month.
Therefore, we have monthly ICS beta portfolios from December 2015 to June 2019, with concurrent
observations for our cyber risk-based quintile portfolios.
    Panel A of Figure 2.6 shows the performance of the portfolio that buys stocks with high cyber
risk and sells those with low cyber risk (the solid black line) against the return on the portfolio
that buys high ICS beta stocks and sells low ICS beta stocks (the dotted red line). Consistent with
our conjecture, the two series show a strong negative comovement. The time-series correlation
coefficient is -41.2% and statistically significant.
2.4.4    Comovement with the Cybersecurity ETFs
    In this subsection, we use the return on cybersecurity index ETFs as another proxy for aggregate
concern about cyber risk. The conjecture is that when investors have stronger concerns about cyber
risk, cybersecurity ETFs tend to outperform. Our analyses include two such ETFs: the First Trust
Nasdaq Cybersecurity ETF (CIBR)25 and ETFMG Prime Cyber Security ETF (HACK).26 Both
ETFs track a set of firms that provide cybersecurity services. Table 2.5 shows their top ten holdings.
    Panel B of Figure 2.6 plots the monthly returns on the spread portfolio that buys high cyber risk
stocks and sells low cyber risk stocks (the solid dark line) against the performance of the cyberse-
curity ETFs (the dotted blue and dashed green lines). It shows a strong negative comovement. The
time-series correlations are below -40% for both ETFs. If we perform a time series regression of
our cyber risk portfolio return against the two ETF returns individually, the coefficients are -0.405
and -0.312, respectively, for CIBR and HACK, and are statistically significant at the 1% level. This
result provides further validation for our cyber risk measure.
   25 CIBR, https://www.ftportfolios.com/retail/etf/etfsummary.aspx?Ticker=CIBR.
   26 HACK,  https://etfmg.com/funds/hack/.
                                                          70


        Figure 2.6 Comovement with ICS and Cybersecurity ETFs & Major Cyberattacks
The figure shows the comovement of the return series of our high-minus-low cyber risk port-
folio and that of the ICS beta portfolio, and that of two cybersecurity ETFs. In Panel A, we
plot the return series of our high-minus-low cyber risk portfolio and ICS beta portfolio. The
ICS beta portfolio is constructed at a monthly frequency by sorting stocks into quintile portfolios
based on stocks’ return betas with respect to the monthly change of the ICS index, control-
ling for market excess returns during the past 12 months. The two series have a correlation
of -41.2%. In Panel B, we plot the monthly returns of the First Trust Nasdaq Cybersecu-
rity ETF (CIBR) and ETFMG Prime Cyber Security ETF (HACK) from August 2015 to June
2019. The time-series correlation between the returns on the long-short cyber risk portfolio
and the ETFs are -47.6% and -41.5% for CIBR and HACK, respectively. In both panels, we
also mark the months that major cyberattacks happened, where the major attacks are accord-
ing to a Bloomberg report (URL: https://www.bloomberg.com/graphics/corporate-hacks-cyber-
attacks/?sref=GlJWBQ7Q). We screen the major attacks to only include public firm attacks during
our sample period, which include Yahoo!, Equifax, Under Armor, Facebook, and Quest Diagnostics.
Returns and events are plotted at month ends.
                                                71


                             Table 2.5 Cybersecurity ETFs: Top 10 Holdings
  CIBR                                                          HACK
  Holding                                       Weight          Holding                            Weight
  CrowdStrike Holdings, Inc. (Class A)          7.49%           Blackberry Ltd.                    3.57%
  Zscaler, Inc.                                 7.08%           Cisco Sys Inc.                     3.13%
  Cisco Systems, Inc.                           5.60%           Palo Alto Networks Inc.            2.81%
  Accenture Plc                                 5.25%           Cyberark Software Ltd.             2.80%
  Splunk Inc.                                   4.41%           Ping Identity Hldg Corp            2.77%
  FireEye, Inc.                                 3.41%           FireEye Inc.                       2.76%
  Palo Alto Networks, Inc.                      3.38%           Sumo Logic Inc.                    2.71%
  Proofpoint, Inc.                              3.28%           Commvault Systems Inc.             2.69%
  SailPoint Technologies Holdings, Inc.         3.27%           Cloudflare Inc.                    2.64%
  Fortinet, Inc.                                3.22%           Qualys Inc.                        2.64%
This table presents the top ten holdings of two cybersecurity ETFs, respectively: First Trust Nasdaq
Cybersecurity ETF (CIBR) and ETFMG Prime Cyber Security ETF (HACK). The holdings are as
of February 2021.
    As a still further validation, we hypothesize that when major cyberattacks happen, high cyber-
risk firms would underperform low cyber-risk firms. To examine this conjecture, we first identify
major cyberattacks via a Bloomberg report.27 Because the list consists of private and public firms,
we further screen the list to include only public firms to match our sample. We then manually
check the dates to ensure they reflect when the attacks became public. The final list includes the
following firms: Yahoo! (September 2016), Equifax (September 2017), Under Armor (March
2018), Facebook (September 2018), and Quest Diagnostics (June 2019). We mark these events in
both panels of Figure 2.6. The figure shows clearly that our cyber risk high-minus-low portfolio
underperformed when major cyberattacks happened, while the ICS beta portfolio and cybersecurity
ETFs overperformed, consistent with our conjecture. These results provide further evidence for the
validity of our cyber risk measure.
   27 Bloomberg report: https://www.bloomberg.com/graphics/corporate-hacks-cyber-attacks/?sref=GlJWBQ7Q.
                                                     72


2.5   Is Cyber Risk Priced? Further Identification
    The preceding results show a strong relation between our cyber risk measure and future stock
returns, which is consistent with the view that cyber risk is priced in the stock market. In this
section, we provide further tests that strengthen the identification of the cyber risk premium.
2.5.1   Industry Competition
    We start by exploiting the variation along the dimension of industry competition. Kamiya et al.
(2021) show that when a firm is hacked, its peer firms in the same industry tend to experience
stock price drops around the announcement. This result is consistent with the view that increased
awareness of cyber risk exposure for peer firms leads the stock market to reprice their stocks.
Building on this observation, we hypothesize that the average market reaction to peer firms is
confounded by the nature of product market competition in the industry. For instance, if the hacked
firm is a weaker player in the product market, the incidence of a cyber attack is likely to provide its
stronger competitor an opportunity to increase market share. Thus, the stock market reaction to the
cyber risk exposure of the peer firm can be muted by the stock market’s perception of its improved
product market opportunities. However, this confounding effect is likely to be weaker when the
hacked firm is a stronger player in the industry: It is harder for a weaker competitor to exploit the
potential opportunity to the same extent.
    To empirically test this hypothesis, for each hacked firm, we identify its peers in the same
industry based on the Fama-French 48-industry classification (Fama and French, 1997). We
construct a variable “Strong Victim” to capture the relative competitive position of the hacked firm
(victim) to a peer firm: It equals one if the victim has a higher market share (firm sales over industry
sales) than that of the peer firm, and zero otherwise. The idea is that if the victim is a stronger
(weaker) competitor of the peer firm, the peer firm is less (more) likely to increase its market share
in response to a cyberattack on the victim firm. As a result, the negative market reaction to the
peer firm around the time period when the data breach of the victim becomes public would be
stronger. This is because it is a cleaner test of the market price reaction of a peer’s stock as investors
re-calibrate their estimate of its cyber risk in response to the victim getting hacked. That is, in a
                                                     73


regression of the [-3,+3] window cumulative abnormal return (CAR) of the peer firm on the [-3,+3]
window CAR of the victim firm around the date when the data breach is announced, the coefficient
for the interaction of the “Strong Victim” dummy variable and the victim’s CAR is expected to be
positive.
    Table 2.6 presents the results based on the cyberattack events when the victim firm experiences
a CAR of lower than -1% during the [-3,+3] window. In Column (1), we confirm the analysis in
Kamiya et al. (2021) that when focal firms experience a cyberattack, peer firms in the same industry
also suffer, demonstrated by the statistically significant and positive coefficient for “Victim CAR”.
Column (2) shows that the “Strong Victim” variable itself has no effect on the stock market reaction
of peer firms. However, when we interact “Strong Victim” with “Victim CAR”, the coefficient for
the interaction is statistically significant and positive. The spillover of the negative stock market
reaction from the victim to its peer firms is stronger when the confounding effect from the product
market is weaker. This reinforces our claim that cyber risk is an important driver of stock prices.
2.5.2    Product Similarity as a Proxy for Data Similarity
    We note that firms with valuable data holdings tend to be particularly vulnerable to cyberattacks.
Based on this observation, we seek to identify firms with data holdings similar to the victim firm.
Hoberg and Philips (2016) propose an interesting measure that captures the similarity of products
and services between two firms. If two firms provide similar products and services to their
customers, it is likely that the data generated and stored by the two firms would be similar. Based
on the Hoberg and Philips (2016) product similarity score, we identify firms providing similar
products and services to those of the victim firm. Since these firms should have a higher probability
of experiencing a cyberattack due to their data similarity to the hacked firm, a public release of
information about a hack would lead the stock market to update their beliefs about the cyber risk
exposure of these firms, resulting in a larger stock price drop for them.
    To test this hypothesis, we use the event study framework similar to the preceding tests, focusing
on the interaction between “Data Similarity" and “Victim CAR". In Table 2.7, we find that firms
providing products and services similar to those of the victim firm experience a more negative
                                                   74


      Table 2.6 Stock Price Reaction of Peer Firms to Cyberattacks and Industry Competition
                                                   Dependent Variable: Peer CAR [-3,3]
                                             (1)             (2)              (3)
         Victim CAR                          0.046***        0.046***         0.051***
                                             (0.015)         (0.015)          (0.013)
         Strong Victim                                       -0.000           0.001
                                                             (0.002)          (0.002)
         Victim CAR × Strong Victim                                           0.025***
                                                                              (0.008)
         Size                                -0.001          -0.001           -0.001
                                             (0.001)         (0.001)          (0.001)
         B/M                                 -0.001          -0.001           -0.001
                                             (0.001)         (0.001)          (0.001)
         Constant                            0.002           0.002            0.002
                                             (0.005)         (0.006)          (0.006)
         Observations                        57,194          57,194           57,194
         R-squared                           0.008           0.008            0.008
         Industry & Year FE                  YES             YES              YES
         Industry & Event Cluster            YES             YES              YES
         Note:                                                  ∗ p<0.1; ∗∗ p<0.05; ∗∗∗ p<0.01
This table examines how industry competition influences the stock market reaction of peer firms
to cyberattacks on victims. Peer firms are defined as those in the same Fama-French 48 industry
(Fama and French, 1997) as the victim firms that experienced a cyberattack. We select only those
cyberattack incidents when victims experienced lower than -1% CAR around the [-3,+3] window,
consistent with the screening criteria used in our main results. The CAR is defined as the cumulative
daily abnormal returns based on the Fama-French three-factor model augmented by the momentum
factor, with an estimation window of 100 days. We require a minimum number of valid returns of
70 days in the estimation window, and a gap of 50 days between the end of the estimation window
and the beginning of the event window. Victim CAR is the [-3,+3] window cumulative abnormal
return of the focal firm around cyberattack events. Peer CAR is the [-3,+3] window cumulative
abnormal return of the peer firm around cyberattack events. “Strong Victim” is a dummy variable if
the peer firm’s sales are lower than the victim firm in the previous year or zero otherwise. Industry
and year-fixed effects are included. Standard errors are clustered at the industry and event levels.
                                                   75


market reaction when the data breach of the victim becomes public, which supports our hypothesis.
In terms of magnitudes, because the data similarity measure has a mean of 0.022 and a standard
deviation of 0.045, a one-standard-deviation increase in data similarity is associated with an increase
in the response of peer firm’s stock price to the CAR of the victim firm by 0.037 (0.045 × 0.82).
           Table 2.7 Stock Price Reaction of Firms with Data Similarity to Cyberattacks
                                                  Dependent Variable: Peer CAR [-3,3]
                                             (1)             (2)               (3)
         Victim CAR                          0.046***        0.049***          0.038**
                                             (0.015)         (0.017)           (0.016)
         Data Similarity                                     -0.055***         -0.011
                                                             (0.010)           (0.009)
         Victim CAR × Data Similarity                                          0.820***
                                                                               (0.237)
         Size                                -0.001          -0.001            -0.001
                                             (0.001)         (0.001)           (0.001)
         B/M                                 -0.001          -0.000            -0.001
                                             (0.001)         (0.001)           (0.001)
         Constant                            0.002           0.003             0.003
                                             (0.005)         (0.005)           (0.005)
         Observations                        57,191          57,191            57,191
         R-squared                           0.008           0.008             0.009
         Industry & Year FE                  YES             YES               YES
         Industry & Event Cluster            YES             YES               YES
         Note:                                                 ∗ p<0.1; ∗∗ p<0.05; ∗∗∗ p<0.01
This table uses product similarity as a proxy for data similarity to study the impact of cyberattacks
on firms with similar data holdings. First, we identify firms in the same Fama-French 48 industry
(Fama and French, 1997) as the victim firms that experienced a cyberattack. Then we use the
product similarity score from Hoberg and Philips (2016) as our proxy for data similarity between
two firms. Other variables are as defined in Table 2.6. Industry and year-fixed effects are included.
Standard errors are clustered at the industry and event levels.
    Building on the success of the Hoberg and Philips (2016) product similarity measure to capture
data similarity, we construct a firm-level Composite Data Similarity measure that averages a firm’s
data similarity with all the victim firms experiencing a cyberattack in the previous year:
                                                  76


                                                          𝑚
                                                       1 ∑︁
                           𝐶𝑜𝑚 𝑝𝑜𝑠𝑖𝑡𝑒𝐷𝑎𝑡𝑎𝑆𝑖𝑚𝑖,𝑡 =            𝐷𝑎𝑡𝑎𝑆𝑖𝑚𝑖, 𝑗,𝑡 ,                    (2.2)
                                                       𝑚 𝑗=1
where 𝑖 represents individual firms in the Compustat universe, 𝑗 indexes firms that have been
attacked in year 𝑡, and 𝐷𝑎𝑡𝑎𝑆𝑖𝑚 is the product similarity score measure in Hoberg and Philips
(2016). Thus 𝐶𝑜𝑚 𝑝𝑜𝑠𝑖𝑡𝑒𝐷𝑎𝑡𝑎𝑆𝑖𝑚 is the average data similarity of each firm to the group of
attacked firms in the previous year.
    Since a firm holding valuable data similar to previously attacked firms is a more likely target,
we expect its composite data similarity score to be a positive contributor to a firm’s cyber risk.
To examine this conjecture, we regress both our cyber risk measure and the cyber cosine measure
proposed by Florackis et al. (2022) on the composite data similarity. To make the coefficients
comparable, we standardize each of the continuous variables to be have a mean of zero and a
standard deviation of unity. The results in Table 2.8 are consistent with our hypothesis. The
coefficients for the “Composite Data Sim” are positive and statistically significant across the
different specifications. A one-standard-deviation increase in “Composite Data Sim” is associated
with a 0.012 standard deviation increase in our “Cyber Risk” measure, and a 0.038 standard
deviation increase in the “Cyber Cosine” measure, when we control for the effects of firm size and
book-to-market ratio. These results support the idea that valuable data holdings tend to attract the
attention of hackers, which increases the cyber risk of a firm.
2.6   Conclusion
    In this paper, we use machine learning algorithms to develop an ex-ante cyber risk measure
for individual firms, which has a superior ability to forecast the occurrence of future cyberattacks.
We find that firms with higher cyber risk, according to this measure, earn higher average stock
returns, which cannot be explained by standard asset pricing models. In times when these firms
underperform, cybersecurity experts tend to have higher concerns about cyber risk, and cybersecu-
rity exchange-traded funds outperform. Further evidence based on product market competition and
data similarity between firms provides further support to the notion that cyber risk is an important
determinant of expected returns in increasingly digitized economies.
                                                 77


                              Table 2.8 Cyber Risk and Data Similarity
                               (1)                (2)                  (3)                 (4)
  VARIABLES                              Cyber Risk                            Cyber Cosine
  Composite Data Sim           0.028***           0.012***             0.061***            0.038***
                               (0.007)            (0.004)              (0.016)             (0.010)
  Log(MktCap)                                     0.137***                                 0.200***
                                                  (0.003)                                  (0.013)
  B/M                                             0.021***                                 -0.013
                                                  (0.002)                                  (0.011)
  Constant                     0.000***           -0.013***            0.042***            0.024***
                               (0.000)            (0.000)              (0.000)             (0.001)
  Observations                 31,106             31,084               29,797              29,776
  R-squared                    0.872              0.887                0.470               0.506
  Industry FE                  YES                YES                  YES                 YES
  Year FE                      YES                YES                  YES                 YES
  Note:                                                               ∗ p<0.1; ∗∗ p<0.05; ∗∗∗ p<0.01
This table presents the results of regressing our cyber risk measure and the cyber cosine measure
from Florackis et al. (2022) on the composite data similarity measure in the previous year. We
construct the “Composite Data Sim” variable as follows:
                           𝐶𝑜𝑚 𝑝𝑜𝑠𝑖𝑡𝑒𝐷𝑎𝑡𝑎𝑆𝑖𝑚𝑖,𝑡 = 𝑚1 𝑚𝑗=1 𝐷𝑎𝑡𝑎𝑆𝑖𝑚𝑖, 𝑗,𝑡 ,
                                                        Í
where 𝑖 represents firms in the Compustat universe, 𝑗 indexes firms that have been attacked in year
𝑡, and 𝐷𝑎𝑡𝑎𝑆𝑖𝑚 is the product similarity score measure in Hoberg and Philips (2016). In Columns
(2) and (4), we control for size and book-to-market ratio. Every continuous variable is standardized
to have a mean of zero and a standard deviation of one. We include industry (Fama and French
48-industry classification) fixed effects and year fixed effects. Standard errors are clustered at the
industry level.
    Our study suggests interesting avenues for future research. First, we have followed a bottom-
up approach to estimate the cyber risk premium, starting with firm-level estimates of cyber risk.
Another approach, which is top-down, can examine the systemic impact of a large attack on major
players in the economy. For instance, Eisenbach et al. (2020) models how cyber attacks on large
banks can influence the financial system and even the economy through network effects. It would
be of interest to explore the connections between these two broad approaches.
    Second, a limitation of our study is that we focus on the likelihood of cyberattacks as a proxy
for cyber risk but do not study the scale of the loss resulting from cyberattacks. Although the
                                                 78


probability of cyberattacks is likely to be correlated with the severity of cyberattacks, these two
dimensions of cyber risk can contain different information. Our research makes a useful first step
toward understanding the implications of cyber risk for asset prices. Future research can benefit
from investigating these two dimensions in a unified framework.
                                                 79


                                             CHAPTER 3
                                           CONCLUSIONS
We examined the broad theme in asset pricing: how investors perceive different sorts of risks and
how that would impact asset prices.
    In Chapter One, we studied how social transmission, people’s conversations along their social
dimension, would influence their portfolio and trading choices, which would, in turn, influence
asset prices. Individual investors, proxied by Reddit users, tend to follow biased presentations of
investment results and, thus, over-buy high-crash-risk stocks.
    In Chapter Two, we studied how cyberattacks on firms could update people’s expectations about
future probabilities for all the firms to get attacked in a certain year. That update of expectations
has asset pricing consequences, as the most cyber-risky firms would get the most discount in their
prices when the update happens.
    These results also speak to the ongoing debate of whether strictly following the rational expec-
tations paradigm is appropriate. One caveat of the studies here is that we do not have account-level
trading data, so we do not perfectly observe how investors make portfolio choices in real-time.
In future research, we hope to experiment with more settings and back out individual investors’
realistic preference parameters and utility functions.
                                                   80


                                        BIBLIOGRAPHY
Abreu, D. and Brunnermeier, M. K. (2003). Bubbles and crashes. Econometrica, 71(1):173–204.
Amihud, Y. (2002). Illiquidity and stock returns: cross-section and time-series effects. Journal of
  financial markets, 5(1):31–56.
An, H. and Zhang, T. (2013). Stock price synchronicity, crash risk, and institutional investors.
  Journal of Corporate Finance, 21:1–15.
Andreou, P. C., Antoniou, C., Horton, J., and Louca, C. (2016). Corporate governance and
  firm-specific stock price crashes. European Financial Management, 22(5):916–956.
Ang, A., Hodrick, R. J., Xing, Y., and Zhang, X. (2006). The Cross-Section of Volatility and
  Expected Returns. Journal of Finance, 61(1):259–299.
Atilgan, Y., Bali, T. G., Demirtas, K. O., and Gunaydin, A. D. (2020). Left-tail momentum:
  Underreaction to bad news, costly arbitrage and equity returns. Journal of Financial Economics,
  135(3):725–753.
Baker, M. and Wurgler, J. (2006). Investor sentiment and the cross-section of stock returns. The
  journal of Finance, 61(4):1645–1680.
Bali, T. G., Cakici, N., and Whitelaw, R. F. (2011). Maxing out: Stocks as lotteries and the
  cross-section of expected returns. Journal of financial economics, 99(2):427–446.
Banerjee, A. V. (1992). A simple model of herd behavior. Quarterly Journal of Economics,
  107(3):797–817.
Barber, B. M., Huang, X., Odean, T., and Schwarz, C. (2021). Attention induced trading and
  returns: Evidence from robinhood users. Journal of Finance, forthcoming.
Barber, B. M. and Odean, T. (2000). Trading is hazardous to your wealth: The common stock
  investment performance of individual investors. The journal of Finance, 55(2):773–806.
Barber, B. M. and Odean, T. (2008). All that glitters: The effect of attention and news on the buying
  behavior of individual and institutional investors. The review of financial studies, 21(2):785–818.
Barber, B. M., Odean, T., and Zhu, N. (2008). Do retail trades move markets? The Review of
  Financial Studies, 22(1):151–186.
Barber, B. M., Odean, T., and Zhu, N. (2009). Systematic noise. Journal of Financial Markets,
  12(4):547–569.
Barberis, N. and Huang, M. (2008). Stocks as lotteries: The implications of probability weighting
                                                 81


  for security prices. American Economic Review, 98(5):2066–2100.
Bates, D. S. (2000). Post-’87 crash fears in the s&p 500 futures option market. Journal of
  econometrics, 94(1-2):181–238.
Beason, T. and Schreindorfer, D. (2022). Dissecting the equity premium. Journal of Political
  Economy, 130(8):2203–2222.
Bianchi, D., Büchner, M., and Tamoni, A. (2021). Bond risk premiums with machine learning.
  The Review of Financial Studies, 34(2):1046–1089.
Bikhchandani, S., Hirshleifer, D., and Welch, I. (1998). Learning from the behavior of others:
  Conformity, fads, and informational cascades. Journal of Economic Perspectives, 12(3):151–
  170.
Binfarè, M. (2019). The Real Effects of Risk Management Vulnerabilities: Evidence from Data
  Breaches. Available at SSRN 3411553.
Black, F. (1986). Noise. The journal of finance, 41(3):528–543.
Bollen, N. P. and Whaley, R. E. (2004). Does net buying pressure affect the shape of implied
  volatility functions? The Journal of Finance, 59(2):711–753.
Brodersen, K. H., Ong, C. S., Stephan, K. E., and Buhmann, J. M. (2010). The balanced accuracy
  and its posterior distribution. In 2010 20th international conference on pattern recognition,
  pages 3121–3124. IEEE.
Brunnermeier, M. K., Gollier, C., and Parker, J. A. (2007). Optimal beliefs, asset prices, and the
  preference for skewed returns. American Economic Review, 97(2):159–165.
Bybee, L., Kelly, B. T., Manela, A., and Xiu, D. (2020). The structure of economic news. Technical
  report, National Bureau of Economic Research.
Callen, J. L. and Fang, X. (2015). Short interest and stock price crash risk. Journal of Banking &
  Finance, 60:181–194.
Campbell, J. Y., Hilscher, J., and Szilagyi, J. (2008). In search of distress risk. The Journal of
  Finance, 63(6):2899–2939.
Carhart, M. M. (1997). On persistence in mutual fund performance. The Journal of finance,
  52(1):57–82.
Cengiz, D., Dube, A., Lindner, A., and Zipperer, B. (2019). The effect of minimum wages on
  low-wage jobs. The Quarterly Journal of Economics, 134(3):1405–1454.
                                                 82


Chang, I.-C., Liu, C.-C., and Chen, K. (2014). The push, pull and mooring effects in virtual
  migration for social networking sites. Information Systems Journal, 24(4):323–346.
Chang, X. S., Chen, Y., and Zolotoy, L. (2016). Stock liquidity and stock price crash risk. Journal
  of Financial and Quantitative Analysis (JFQA), Forthcoming.
Chen, A. Y. and Zimmermann, T. (2021). Open source cross-sectional asset pricing. Critical
  Finance Review, Forthcoming.
Chen, J., Hong, H., and Stein, J. C. (2001). Forecasting crashes: Trading volume, past returns, and
  conditional skewness in stock prices. Journal of financial Economics, 61(3):345–381.
Chen, L., Pelger, M., and Zhu, J. (2020). Deep learning in asset pricing. Available at SSRN
  3350138.
Cohen, L., Malloy, C., and Nguyen, Q. (2020). Lazy prices. Journal of Finance, 75(3):1371–1415.
Cong, L. W., Harvey, C. R., Rabetti, D., and Wu, Z.-Y. (2023). An anatomy of crypto-enabled
  cybercrimes. Technical report, National Bureau of Economic Research.
Conrad, J., Kapadia, N., and Xing, Y. (2014). Death and jackpot: Why do individual investors hold
  overpriced stocks? Journal of Financial Economics, 113(3):455–475.
Damashek, M. (1995). Gauging Similarity with n-Grams: Language-Independent Categorization
  of Text. Science, 267(5199):843–848.
De Long, J. B., Shleifer, A., Summers, L. H., and Waldmann, R. J. (1990a). Noise trader risk in
  financial markets. Journal of political Economy, 98(4):703–738.
De Long, J. B., Shleifer, A., Summers, L. H., and Waldmann, R. J. (1990b). Positive feedback
  investment strategies and destabilizing rational speculation. the Journal of Finance, 45(2):379–
  395.
Duffie, D. and Younger, J. (2019). Cyber runs. Hutchins Center Working Paper.
Eisenbach, T. M., Kovner, A., and Lee, M. J. (2020). Cyber risk and the us financial system: A
  pre-mortem analysis. FRB of New York Staff Report, (909).
Eisfeldt, A. L. and Papanikolaou, D. (2013). Organization capital and the cross-section of expected
  returns. Journal of Finance, 68(4):1365–1406.
Fama, E. F. and French, K. R. (1988). Permanent and temporary components of stock prices.
  Journal of Political Economy, 96(2):246–273.
Fama, E. F. and French, K. R. (1993). Common risk factors in the returns on stocks and bonds.
                                                 83


  Journal of.
Fama, E. F. and French, K. R. (1996). Multifactor explanations of asset pricing anomalies. Journal
  of Finance, 51(1):55–84.
Fama, E. F. and French, K. R. (1997). Industry costs of equity. Journal of Financial Economics,
  43(2):153–193.
Fama, E. F. and French, K. R. (2015). A five-factor asset pricing model. Journal of Financial
  Economics, 116(1):1–22.
Fama, E. F. and French, K. R. (2020). Comparing cross-section and time-series factor models. The
  Review of Financial Studies, 33(5):1891–1926.
Fama, E. F. and MacBeth, J. D. (1973a). Risk, return, and equilibrium: Empirical tests. Journal of
  political economy, 81(3):607–636.
Fama, E. F. and MacBeth, J. D. (1973b). Risk, return, and equilibrium: empirical tests. Journal of
  Political Economy, 81(3):607–636.
Feng, G., Giglio, S., and Xiu, D. (2020). Taming the factor zoo: A test of new factors. The Journal
  of Finance, 75(3):1327–1370.
Florackis, C., Louca, C., Michaely, R., and Weber, M. (2022). Cybersecurity risk. Review of
  Financial Studies, forthcoming.
Foucault, T., Sraer, D., and Thesmar, D. J. (2011). Individual investors and volatility. The Journal
  of Finance, 66(4):1369–1406.
Freyberger, J., Neuhierl, A., and Weber, M. (2020). Dissecting characteristics nonparametrically.
  Review of Financial Studies, 33(5):2326–2377.
Gormley, T. A. and Matsa, D. A. (2011). Growing out of trouble? corporate responses to liability
  risk. The Review of Financial Studies, 24(8):2781–2821.
Graham, J. R. and Kumar, A. (2006). Do dividend clienteles exist? evidence on dividend preferences
  of retail investors. The Journal of Finance, 61(3):1305–1336.
Grossman, S. J. and Stiglitz, J. E. (1980). On the impossibility of informationally efficient markets.
  The American economic review, 70(3):393–408.
Gu, S., Kelly, B., and Xiu, D. (2020). Empirical asset pricing via machine learning. The Review of
  Financial Studies, 33(5):2223–2273.
Han, B., Hirshleifer, D., and Walden, J. (2022). Social transmission bias and investor behavior.
                                                  84


   Journal of Financial and Quantitative Analysis, 57(1):390–412.
Han, B. and Kumar, A. (2013). Speculative retail trading and asset prices. Journal of Financial
   and Quantitative Analysis, 48(2):377–404.
Harvey, C. R. and Siddique, A. (2000). Conditional skewness in asset pricing tests. Journal of
   Finance, 55(3):1263–1295.
Hastie, T., Tibshirani, R., and Friedman, J. (2017). The Elements of statistical learning. Springer
   Science+Business Media New York.
He, H. and Garcia, E. A. (2009). Learning from imbalanced data. IEEE Transactions on knowledge
   and data engineering, 21(9):1263–1284.
Hirshleifer, D., Hou, K., Teoh, S. H., and Zhang, Y. (2004). Do investors overvalue firms with
   bloated balance sheets? Journal of Accounting and Economics, 38:297–331.
Hoberg, G. and Philips, G. M. (2016). Text-based network industries and endogenous product
   differentiation. Journal of Political Economy, 124(5):1423–1465.
Hu, D., Jones, C. M., Zhang, V., and Zhang, X. (2021). The rise of reddit: How social media
   affects retail investors and short-sellers’ roles in price discovery. Available at SSRN 3807655.
Hutton, A. P., Marcus, A. J., and Tehranian, H. (2009). Opaque financial reports, r2, and crash risk.
   Journal of financial Economics, 94(1):67–86.
Jamilov, R., Rey, H., and Tahoun, A. (2020). The Anatomy of Cyber Risk. London Business School
   Working Paper.
Jang, J. and Kang, J. (2019). Probability of price crashes, rational speculative bubbles, and the
   cross-section of stock returns. Journal of Financial Economics, 132(1):222–247.
Jiang, H., Khanna, N., Yang, Q., and Zhou, J. (2020). The cyber risk premium. Available at SSRN:
   https://ssrn.com/abstract=3637142 or http://dx.doi.org/10.2139/ssrn.3637142.
Jin, L. and Myers, S. C. (2006). R2 around the world: New theory and new tests. Journal of
   financial Economics, 79(2):257–292.
Kamiya, S., Kang, J.-K., Kim, J., Milidonis, A., and Stulz, R. M. (2021). Risk management,
   firm reputation, and the impact of successful cyberattacks on target firms. Journal of Financial
   Economics, 139(3):719–749.
Kashyap, A. K. and Wetherilt, A. (2019). Some Principles for Regulating Cyber Risk. AEA Papers
   and Proceedings, 109:482–487.
                                                    85


Ke, Z. T., Kelly, B. T., and Xiu, D. (2019). Predicting returns with text data. Technical report,
  National Bureau of Economic Research.
Kelley, E. K. and Tetlock, P. C. (2017). Retail short selling and stock prices. The Review of
  Financial Studies, 30(3):801–834.
Kelly, B. and Jiang, H. (2014). Tail risk and asset prices. The Review of Financial Studies,
  27(10):2841–2871.
Kelly, B. T., Pruitt, S., and Su, Y. (2019). Characteristics are covariances: A unified model of risk
  and return. Journal of Financial Economics, 134(3):501–524.
Kim, J.-B., Li, L., Lu, L. Y., and Yu, Y. (2016). Financial statement comparability and expected
  crash risk. Journal of Accounting and Economics, 61(2-3):294–312.
Kim, J.-B., Li, Y., and Zhang, L. (2011). Corporate tax avoidance and stock price crash risk:
  Firm-level analysis. Journal of Financial Economics, 100(3):639–662.
Kim, J.-B. and Zhang, L. (2014). Financial reporting opacity and expected crash risk: Evidence
  from implied volatility smirks. Contemporary Accounting Research, 31(3):851–875.
Kim, Y., Li, H., and Li, S. (2014). Corporate social responsibility and stock price crash risk.
  Journal of Banking & Finance, 43:1–13.
King, G. and Zeng, L. (2001). Logistic regression in rare events data. Political analysis, 9(2):137–
  163.
Kopp, E., Kaffenberger, L., and Jenkinson, N. (2017). Cyber risk, market failures, and financial
  stability. International Monetary Fund.
Kozak, S., Nagel, S., and Santosh, S. (2020). Shrinking the cross-section. Journal of Financial
  Economics, 135(2):271–292.
Kyle, A. S. (1985). Continuous auctions and insider trading. Econometrica: Journal of the
  Econometric Society, pages 1315–1335.
Li, F. (2008). Annual report readability, current earnings, and earnings persistence. Journal of
  Accounting and economics, 45(2-3):221–247.
Li, K., Mai, F., Shen, R., and Yan, X. (2021). Measuring corporate culture using machine learning.
  The Review of Financial Studies, 34(7):3265–3315.
Li, X. and Wu, L. (2018). Herding and social media word-of-mouth: Evidence from groupon.
  Forthcoming at MISQ.
                                                  86


Lin, Z., Sapp, T. R., Ulmer, J. R., and Parsa, R. (2020). Insider trading ahead of cyber breach
  announcements. Journal of Financial Markets, 50:100527.
Liu, X.-Y., Wu, J., and Zhou, Z.-H. (2008). Exploratory undersampling for class-imbalance learning.
  IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 39(2):539–550.
Loughran, T. and McDonald, B. (2011). When is a liability not a liability? Textual analysis,
  dictionaries, and 10-Ks. Journal of Finance, 66(1):35–65.
McCrank, J. (2021). Factbox: The u.s. retail trading frenzy in numbers. Thomson Reuters,. URL:
  https://www.reuters.com/article/us-retail-trading-numbers-idUSKBN29Y2PW.
Michel, A., Oded, J., and Shaked, I. (2020). Do security breaches matter? The shareholder puzzle.
  European Financial Management, 26(2):288–315.
NBER (2021). Us business cycle expansions and contractions.
Newey, W. K. and West, K. D. (1986). A simple, positive semi-definite heteroskedasticity and
  autocorrelation consistent covariance matrix. Econometrica, 55(3):703–708.
Newey, W. K. and West, K. D. (1987). A simple, positive semi-definite, heteroskedasticity and
  autocorrelation consistent covariance matrix. Econometrica, 55(3):703–708.
Pan, J. (2002). The jump-risk premia implicit in options: Evidence from an integrated time-series
  study. Journal of financial economics, 63(1):3–50.
Parker, R., Graff, D., Kong, J., Chen, K., and Maeda, K. (2011). English gigaword fifth edition,
  2011. Linguistic Data Consortium, Philadelphia, PA, USA.
Pennington, J., Socher, R., and Manning, C. D. (2014). Glove: Global vectors for word repre-
  sentation. In Proceedings of the 2014 conference on empirical methods in natural language
  processing (EMNLP), pages 1532–1543.
Petersen, M. A. (2009). Estimating standard errors in finance panel data sets: Comparing ap-
  proaches. The Review of Financial Studies, 22(1):435–480.
Schütze, H., Manning, C. D., and Raghavan, P. (2008). Introduction to information retrieval,
  volume 39. Cambridge University Press Cambridge.
Shleifer, A. and Vishny, R. W. (1997). The limits of arbitrage. The Journal of finance, 52(1):35–55.
Tetlock, P. C. (2007). Giving content to investor sentiment: The role of media in the stock market.
  Journal of Finance, 62(3):1139–1168.
Van Buskirk, A. (2011). Volatility skew, earnings announcements, and the predictability of crashes.
                                                 87


  Earnings Announcements, and the Predictability of Crashes (April 28, 2011).
Warren, P., Kaivanto, K., Prince, D., et al. (2018). Could a cyber attack cause a systemic impact in
  the financial sector? Bank of England Quarterly Bulletin, 58(4):21–30.
Welch, I. (2020). Retail raw: Wisdom of the robinhood crowd and the covid crisis. Technical
  report, National Bureau of Economic Research.
Xing, Y., Zhang, X., and Zhao, R. (2010). What does the individual option volatility smirk tell us
  about future equity returns? Journal of Financial and Quantitative Analysis, pages 641–662.
Yan, S. (2011). Jump risk, stock returns, and slope of implied volatility smile. Journal of Financial
  Economics, 99(1):216–233.
Yue, Y., Finley, T., Radlinski, F., and Joachims, T. (2007). A support vector method for optimizing
  average precision. In Proceedings of the 30th annual international ACM SIGIR conference on
  Research and development in information retrieval, pages 271–278.
Zeng, X. and Martinez, T. R. (2000). Distribution-balanced stratified cross-validation for accuracy
  estimation. Journal of Experimental & Theoretical Artificial Intelligence, 12(1):1–12.
                                                  88