RECREATIONAL CANNABIS LEGALIZATION: PREDICTING LOCAL POLICY ADOPTION AND
     ESTIMATING THE ASSOCIATED EFFECTS ON POPULATION CANNABIS USE
                                          By
                          Barrett Wallace Montgomery
                                  A DISSERTATION
                                      Submitted to
                             Michigan State University
                     in partial fulfillment of the requirements
                                   for the degree of
                      Epidemiology – Doctor of Philosophy
                                         2022


                                             ABSTRACT
 RECREATIONAL CANNABIS LEGALIZATION: PREDICTING LOCAL POLICY ADOPTION AND
         ESTIMATING THE ASSOCIATED EFFECTS ON POPULATION CANNABIS USE
                                                  By
                                    Barrett Wallace Montgomery
        Cannabis is undergoing a remarkable transformation from a regulated drug to a
recreationally legal one in the United States (U.S.). Yet, in states that have legalized recreational
cannabis, there is substantial geographic variability in actual cannabis policies and the effects of
cannabis legalization are still being debated. This dissertation addresses these modern scientific
issues of the recreational cannabis landscape.
        The population under study primarily includes non-institutionalized U.S. civilian residents,
sampled and assessed in successive waves of the National Survey on Drug Use and Health
(NSDUH) starting in 2008 through 2019. Estimates on drug use and mental illness prevalences
are aggregated to the county level for the first aim, and to the state level for the second and third
aims. In the first aim, the county-level data are linked to several other publicly available sources
of information on all 3,142 U.S. counties including the 2010 Census, 2012 presidential election,
and recreational cannabis sales policies. I then used these data to train a machine learning
algorithm to predict which counties allowed for the recreational sale of cannabis in 2014. In the
second aim, I used state-level estimates of cannabis incidence in an event study model to
estimate the effects of legalizing recreational cannabis on cannabis use onsets for persons under
and over the legal minimum age of 21. The final aim focuses specifically on 21 year-olds to better
understand the implications for setting a legal minimum age drug policy on age-specific patterns
of incidence and proposes a theoretical framework that may help understand these findings.


         For the first aim, the model-averaging predictions classified almost 94% of the U.S.
counties correctly. The main factors associated with county-level recreational cannabis laws
were the prevalences of past-month cannabis use and past-year cocaine use. In the second aim,
I found that for those who were legally able to purchase cannabis (21 and older), cannabis
legalization did not appear to affect incidence in the first year following legalization. Even so,
between two and four years after legalization, the difference in differences modeling disclosed
statistically robust increases of 0.6% for this sub-population of adults. After four years, the
estimated increase is 1.3%. The corresponding estimates for underage persons who were
ineligible to legally purchase cannabis show no appreciable differences in the occurrence in past-
year cannabis use incidence. Finally, the age-specific incidence estimates for 21-year-olds show
a rise after the passage of recreational cannabis laws (RCL) and are suggestive of the arrival of
a new pattern of age-specific incidence.
         Taken together, the work and results of this dissertation point toward four potential
conclusions. First, cannabis legalization might depend on a predictable process driven in part
by prior drug use in each jurisdiction. Second, once implemented, recreational cannabis
legalization might not have effects on adolescent onset newly incident cannabis use. Third, for
adults permitted to buy cannabis without penalty, the occurrence of newly incident cannabis use
might increase. Fourth, a tentative conclusion is that legalization of retail sales to adults removes
a barrier for adults who had been interested in trying cannabis, but did not do so, perhaps due
to concerns about legal or social consequences faced before legalization.


Copyright by
BARRETT WALLACE MONTGOMERY
2022


This dissertation is dedicated to Kyle, Jared, Matt, CT, and Gabe.
                May the next generation suffer less.
                                 v


                                     ACKNOWLEDGEMENTS
This is work supported by a National Institute on Drug Abuse R25 Science Education research
training program grant award (R25DA051249) and by Michigan State University. The content is
the sole responsibility of the author and does not necessarily represent the official views of
Michigan State University, the National Institute on Drug Abuse, the National Institutes of
Health, or the United States Substance Abuse and Mental Health Services Administration. In
addition, we would like to thank the United States Substance Abuse and Mental Health
Services Administration Center for Behavioral Health Statistics and Quality for sponsoring the
National Surveys on Drug Use and Health and for making the datasets available for public use
to allow research of this nature.
This work is the culmination of the efforts and sacrifices of many. Special acknowledgement
should first be given to Olga Vsevolozhskaya, Xiaoran Tong, Meaghan Roberts, and Claire
Margerison for their contributions to this work. Many thanks to Debra Furr-Holden and Jim
Anthony for believing in me and guiding me, and to the professors of the department who were
generous with their time and talents during my education. Many thanks are also of course due
to my family and friends across the country for supporting this major decision in my life. Finally,
to my wife, Evelyn, for being by my side through every struggle and success.
                                                 vi


                                                  TABLE OF CONTENTS
LIST OF TABLES ....................................................................................................................... ix
LIST OF FIGURES....................................................................................................................... x
KEY TO ABBREVIATIONS ........................................................................................................ xii
1. Introduction and Specific Aims ...............................................................................................1
2. History, Background, and Significance ...................................................................................5
   2.1 Overview of this chapter ...................................................................................................5
   2.2 History ..............................................................................................................................5
      2.2.1 The Opium Wars ....................................................................................................... 7
      2.2.2 The Origins of Domestic Drug Policy ......................................................................... 9
      2.2.3 Roots in Racism and Class Warfare .........................................................................12
      2.2.4 Arriving at the Controlled Substances Act ................................................................15
      2.2.5 The Modern Era .......................................................................................................17
   2.3 An Initial Look at Drug Policy and its Intentions ..............................................................19
   2.4 Related Developments in Social Statistics and Study Design .........................................21
      2.4.1 Earliest work.............................................................................................................21
      2.4.2 Early Sociology and Psychiatric Epidemiology .........................................................23
      2.4.3 Classical to Modern Statistics and Causal Inference................................................24
      2.4.4 Controlling for Confounding and Policy Analysis......................................................25
   2.5 The Current Understanding of the Effects of Cannabis Legalization ...............................28
      2.5.1 Cannabis Use in Individuals Under 21 After Legalization .........................................28
      2.5.2 Cannabis Use in Individuals Over 21 After Legalization............................................30
   2.6 Significance ....................................................................................................................30
   2.7 Potential Impact on the Field ..........................................................................................32
3. Materials and Methods .........................................................................................................33
   3.1 Overview of this Chapter .................................................................................................33
      3.1.1 Details on IRB Approval, Recruitment, and Participation Levels ..............................34
   3.2 Aim 1...............................................................................................................................35
      3.2.1 Study population and sample...................................................................................35
        3.2.1.1 National Surveys on Drug Use and Health Small Area Estimates .......................35
        3.2.1.2 Census ...............................................................................................................36
        3.2.1.3 County Presidential Data ....................................................................................36
        3.2.1.4 Cannabis Legalization Status .............................................................................36
      3.2.2 Data Management ....................................................................................................37
      3.2.3 Study Design ............................................................................................................38
        3.2.3.1 Pre-iteration Modelling and Validation ................................................................39
        3.2.3.2 Building Ensemble Prediction .............................................................................40
        3.2.3.3 Sensitivity Analyses ............................................................................................40
   3.3 Aim 2...............................................................................................................................42
                                                                  vii


      3.3.1 Study population and sample...................................................................................42
      3.3.2 Outcome ..................................................................................................................43
      3.3.3 Study Design and Statistical Analysis ......................................................................44
        3.3.3.1 Dates of Legalization vs. Dates of Implementation .............................................46
        3.3.3.2 Alternative Specifications and Robustness Checks............................................46
   3.4 Aim 3...............................................................................................................................47
      3.4.1 Study population and sample...................................................................................47
      3.4.2 Outcome ..................................................................................................................47
      3.4.3 Study design ............................................................................................................49
4. Results ..................................................................................................................................51
   4.1 Aim 1...............................................................................................................................51
      4.1.1 Descriptive statistics ................................................................................................51
      4.1.2 Predictive model ......................................................................................................53
      4.1.3 County-level predictions ..........................................................................................55
      4.1.4 Sensitivity analyses ..................................................................................................57
   4.2 Aim 2...............................................................................................................................61
      4.2.1 Descriptive statistics ................................................................................................61
      4.2.2 Event Study Findings ...............................................................................................63
      4.2.3 DiD Findings .............................................................................................................65
      4.2.4 Alternative Specifications and Robustness Checks .................................................66
   4.3 Aim 3...............................................................................................................................66
      4.3.1 Panel study approach ..............................................................................................66
      4.3.2 Stratification at age 21 .............................................................................................67
5. Discussion ............................................................................................................................70
   5.1 Aim 1...............................................................................................................................70
   5.2 Aim 2...............................................................................................................................73
   5.3 Aim 3...............................................................................................................................78
6. Summary ..............................................................................................................................80
APPENDICES ...........................................................................................................................82
   Appendix A: Supplemental Figures and Tables.....................................................................83
   Appendix B: Program Code Used to Derive the Constructed Study .....................................95
BIBLIOGRAPHY...................................................................................................................... 538
                                                                    viii


                                                       LIST OF TABLES
Table 1. Sample sizes and participation levels of successive years of the National Surveys on
Drug Use and Health.……...……………......……...……………......……...……………......……... 34
Table 2. Sociodemographic and political compositions and prevalences of mental illness and
drug use in counties that allowed for the sale of recreational cannabis and those that did not.
……...……………......……...……………......……...……………......……...……………......…….... 52
Table 3. Predictors for legal cannabis sales in 2014 as represented by median z score over
1000 model iterations.……...………...……………......……...……………......……...…….………. 55
Table 4. Sensitivity and Specificity of Models Using Various Weighting Techniques and Hard
Cut-off Values.……...……………......……...……………......……...………………...……….…….. 60
Table 5. Characteristics of the U.S. Population Under Study. Data from the U.S. National
Surveys on Drug Use and Health.……...……………......……...……………...……...……………. 62
Table A.1. Predictors for legal cannabis sales in 2014 as represented by median z score over
1000 model iterations when proportions of voters are replaced by a binary indicating the party
of the majority............................................................................................................................. 94
                                                                   ix


                                                          LIST OF FIGURES
Figure 1. Chemist J.J. Pemberton writes on the use of the coca plant in his invention, the
recipe for Macalister's Cough Mixture contained cannabis, alcohol, and chloroform, and an
original Bayer product containing both aspirin and heroin........................................................ 10
Figure 2. Controlled Substances Act schedules and criteria..................................................... 16
Figure 3. A map of states that give local authorities the right to depart from the state provisions
regarding the recreational use of cannabis as of January 1, 2020............................................ 19
Figure 4. How the causal effect is estimated in the differences-in-differences model.............. 27
Figure 5. ROC curves of 1000 predictions of county-level legal cannabis sales in 2014 and the
ensemble average...................................................................................................................... 54
Figure 6. Ensemble produced county-level probability of allowing recreational cannabis sales in
2014............................................................................................................................................56
Figure 7. Actual cannabis policies by county, 2014 compared to predicted policy outcomes. 57
Figure 8. ROC curves of 1000 models and average profiles for each possibility of coding the
outcome in sensitivity analyses.................................................................................................. 58
Figure 9. Distinguishing power of ensemble predictions (weighted or naïve average).............. 59
Figure 10. Estimated effect of time since cannabis legalization on cannabis incidence in the 21
and older age group with 95% confidence intervals..................................................................64
Figure 11. Estimated effect of time since legalization on incidence in those aged 12 to 20 with
95% confidence intervals........................................................................................................... 65
Figure 12. Trends in past-year cannabis incidence by age in Colorado and Washington vs. all
other states in the US, 2010-2017............................................................................................. 67
Figure 13. Trends in past-year cannabis incidence for 21 year-olds in Colorado and
Washington vs. all other states in the US, 2010-2017............................................................... 68
Figure 14. Estimated effect of time since cannabis legalization on cannabis incidence at age 21
with 95% confidence intervals................................................................................................... 69
Figure A.1. Percent of variance captured from over 1000 census variables in each principal
component................................................................................................................................. 84
                                                                        x


Figure A.2. Cannabis incidence in the 21 and older age group, first wave legalizing states vs
untreated states......................................................................................................................... 85
Figure A.3. Cannabis incidence in the 21 and older age group, second wave legalizing states
vs. untreated states.................................................................................................................... 86
Figure A.4. Cannabis incidence in the 21 and older age group, third wave legalizing states vs.
untreated states......................................................................................................................... 87
Figure A.5. Cannabis incidence in the 21 and older age group, first wave legalizing states vs.
third wave legalizing states........................................................................................................ 88
Figure A.6. Cannabis incidence in the 21 and older age group, second wave legalizing states
vs. third wave legalizing states................................................................................................... 89
Figure A.7. Estimated effect of time since cannabis legalization on past-month cannabis
prevalence in the 21 and older age group..................................................................................90
Figure A.8. Estimated effect of time since legalization on past-month cannabis prevalence in
the 12 to 20 age group............................................................................................................... 91
Figure A.9. Estimated placebo effect of time since cannabis legalization on past-year cannabis
incidence in the 21 and older age group....................................................................................92
Figure A.10. Estimated placebo effect of time since cannabis legalization on past-year
cannabis incidence in the 12 to 20 age group........................................................................... 93
                                                                 xi


                            KEY TO ABBREVIATIONS
ATT    Average Treatment Effect on the Treated
AUC    Area Under the Curve
B.C.E. Before Common Era
C.E.   Common Era
CI     Confidence Intervals
CSA    Controlled Substances Act
CUD    Cannabis Use Disorder
DiD    Differences-in-Differences
FDA    Food and Drug Administration
FMTA   The Federal Marihuana Tax Act
IRD    Internationally Regulated Drugs
IRS    Internal Revenue Service
LMA    Legal Minimum Age
NSDUH  National Survey on Drug Use and Health
PFDA   Pure Food and Drug Act
RCL    Recreational Cannabis Legalization
R-DAS  Restricted Data Access System
                                       xii


ROC    Receiver Operator Characteristic
SAMHSA Substance Abuse and Mental Health Services Administration
TNR    True Negative Rate
TPR    True Positive Rate
U.S.   United States
                                      xiii


                                1. Introduction and Specific Aims
         Imagine for a moment that you are the mayor of a small city. The state legislature has just
announced that a legal referendum on recreational cannabis has just passed by popular vote
with 56% of the state population voting for the measure. You know that the legislation allows for
some local control over what will happen in your city. What will you do?
    Over the course of the past decade, policy and decision-makers across the country have
been experiencing this hesitation in droves. It is not a simple decision. The votes passed with
popular support, but almost half of the population is likely to disagree with the decision. And
what about the constituents of their city? Opinions on the legalization of recreational cannabis
can vary dramatically within a state.
    Over the past ten years, eighteen states and three United States (U.S.) territories have
legalized cannabis for people aged 21 and older. Policy and decision-makers in every
municipality within these states, districts, and territories have had to wrestle with how to move
forward when there are so many unanswered questions: Do my constituents want this? How will
this affect them? How could this affect their children?
    To date, researchers and social scientists have tried to answer these questions with varying
degrees of success and consensus. Questions about the motivations and feelings of voters are
based almost exclusively on polling, a practice that has come under fire in recent years for
incorrectly predicting the results of presidential campaigns. Meanwhile, questions on how the
changing laws may affect adults and children have relied on a variety of state and nationally
representative surveys with results pointing in all directions depending on the outcome measured
and the type of analysis. However, not one of the many studies has produced estimates on how
                                                  1


the changing law affects the decision making of potential first-time users. The rationale for a
concentration of policy research on ‘prevalence’ of active cannabis use might be based on issues
of statistical precision and power because recently active cannabis users are more numerous
than newly incident cannabis users. Nonetheless, the recently active cannabis users often are
dominated by long-time cannabis users who started their cannabis use many years before the
policy change. The focus on ‘prevalence’ of use ignores the distinction between ‘being a
cannabis user’ versus ‘becoming a cannabis user.’ A policy analysis failure can occur when
investigations do not discriminate the epidemiological processes of prevalence (determined by
both duration of use and incidence of use) from the epidemiological processes of becoming a
newly incident user (determined solely by incidence of use). Ignoring age-specific estimates of
incidence will inevitably leave important relationships between society and cannabis use hiding
in the dark. This dissertation research project seeks to shine light to uncover the changing
dynamics of these relationships.
    In this dissertation, I aim to achieve the following goals:
        1. Develop a predictive model of sub-state cannabis legalization using publicly
             available datasets that are readily available to other investigators, and that can be
             used in future investigations.
        2. Provide evidence on the degree to which the incidence of cannabis use might have
             increased or decreased after cannabis legalization for two important subgroups of
             the population: (1) the adults who are permitted to make a retail purchase of a
             cannabis product in each jurisdiction, and (2) the underage adolescents (<21 years
                                                    2


            old) for whom retail purchase of cannabis products remains prohibited in each
            jurisdiction.
        3. Estimate the degree to which the legalization of cannabis might have affected the
            age of first cannabis use with special attention to the legal minimum age (LMA).
        An important point of departure for this dissertation research project is the distinction
between prevalence proportions and incidence rates. My dissertation research project is focused
on the occurrence of newly incident cannabis use, year by year. Prior studies have focused upon
prevalence proportions. Any epidemiologist knows that prevalence varies as a function of both
incidence and duration, and the estimated size of the prevalence proportion can be dominated
by long-sustained cannabis users who would likely have continued to consume cannabis with,
or without, a change in policy. In contrast, the incidence parameter reflects occurrence of newly
incident use, given no prior use before the interval under study (e.g., before and after the interval
of cannabis policy change).
        While the prevalence of cannabis use is an important measure for public health, it is only
one piece of the larger puzzle. Criticism of relying so heavily on prevalence and the need to better
measure incidence in chronic and mental health disorders has a long tradition (Kramer, 1957;
Lapouse, 1967; Wu et al., 2003). The criticism was best espoused by Reema Lapouse:
          “Prevalence rates measure the size of the disease problem and as such are useful in
  planning services. They are, however, a fallible indicator of the risk of acquiring any chronic
     disease including psychiatric disorder. Since prevalence is a function of incidence and
  duration, any factors affecting duration of disease will similarly influence its prevalence rate.
   Thus, long-term, nonfatal, noncurable diseases which limit migration produce a pile-up of
                                                   3


  cases and a rise in the prevalence rates. Survivorship, mobility, and duration may, in turn, be
 associated with demographic factors. Consequently an association between these factors and
 prevalence may occur even though demographic factors bear no relationship to the genesis of
  disease. The only suitable measure applicable to the search for possible causes of disease is
                                    the incidence rate.” (1967).
        In this dissertation research project, my work focuses upon the occurrence of newly
incident cannabis use, consistent with the principles set forth by Kramer and Lapouse more
than a half-century ago.
                                                 4


                              2. History, Background, and Significance
2.1 Overview of this chapter
         This chapter will provide an overview of the prior theory, concepts, principles, and
research approaches used in prior studies of cannabis policy, as well as a substantial body of
literature that is relevant to understanding the development and interpretation of the results of
the current line of research. The first part of this review of the literature will describe the history
of drug policy in the United States (U.S.) and its international origins to inform the current views
on cannabis policy in the U.S. The second section will briefly outline the history of the research
conducted and methods used, intentions, and/or probable consequences on the various drug
policy changes in our national history. The third section will cover the related and coinciding
developments in the literature of social statistics which has allowed for the modern methods of
analysis and the increasing number of publications on policy effects. The fourth and final section
will cover the new wave of state-level cannabis liberalization and the research on possible
changes in epidemiological parameters of cannabis use, key discoveries, and the current state
of the science of drug policy research. This is not meant to be an exhaustive review of the
literature. My review is intended to familiarize the reader with the basic assumptions about
epidemiological research and issues that must be considered in studies intended to study non-
random policy events that stimulate population-level changes in the occurrence of cannabis or
other drug use. This chapter will conclude with a discussion of the significance and potential
impact of the results on drug dependence epidemiology in response to policy events.
2.2 History
         The legal regulation of cannabis is the most recent example in a long history of how
societies have chosen to treat psychoactive drugs. Although the earliest regulation of any
                                                   5


psychoactive drug is believed to date back to Hammurabi's punishment for overcharging tavern
patrons for alcohol, this review will focus on drugs that are currently internationally regulated
(King, 1915). The current research addresses recent state and sub-state level changes in
cannabis legalization in the U.S. The primary focus of this review will be cannabis policy and how
the U.S. arrived at the federal law currently in effect, the Controlled Substances Act (CSA). Yet,
the history of cannabis policy cannot be fully understood without some understanding of the
earlier regulations to other drugs, most importantly, opium. This review will begin with the
international socio-political climate which shaped so much of our modern drug policies. The
chapter then covers how the U.S. arrived at the current and most conservative iteration of drug
policy, the CSA.
          In this brief review of the history of drug policy, I will demonstrate that the modern origins
of drug policy are intrinsically tied to the economic interests of powerful countries throughout
the 19th century as well as intra-national racism and class warfare. This history was expertly
summarized in the book Drug policy and the public good:
 “In its initial stages, the effort to control drugs at an international level was aimed at limiting the
  reach and effects of colonial empires (Carstairs, 2005). Psychoactive substances were a glue
   of empires in the period of European colonial expansion from about 1500 until the late 19th
  century (Courtwright, 2001; Jankowiak and Bradburd, 2003). From the point of view of those
     seeking to create markets and dependence on trade, psychoactive substances were an
obvious choice; once the demand for them has been created, it becomes self-sustaining. Thus,
   psychoactive substances became a favourite commodity from which to extract revenues for
 the state, either with excise taxes or through a state-run or farmed-out monopoly. In particular,
     opium monopolies were an important source of revenue for colonial powers in Asia (e.g.,
                                                      6


 Munn, 2000). In the interests of financing their empires, European states had no compunctions
         about forcing open markets for their psychoactive wares.” (Babor et al., 2010).
        Modern readers may be surprised to see no discussion of civil liberties, civil rights, the
costs and benefits of incarceration, or potential effects on health that characterize the
contemporary drug policy conversation. Drug policy originated from the greed of nations warring
with each other for the psychoactive drug market. Nowhere in history is this more evident than
in The Opium Wars fought between Britain and China in the 1840s and 1850s.
2.2.1 The Opium Wars
        Opium was introduced to the Chinese sometime between the 5th and 7th centuries by
Arab traders. The drug was praised and widely used to cure diarrhea, induce sleep, and reduce
pain (Feige & Miron, 2008). The English, however, did not arrive in China until the 17th century
and opened the first legal trading station in China in the 18th. Later, between the late 18th century
and the early 19th, the famous East India Trading Company of Britain obtained a monopoly on
opium from the two major trading ports of India and began exchanging opium for tea from China
(Beeching, 1975).
        By 1729, opium use had risen to levels deemed unacceptable by the emperor and he
delivered an imperial edict, forbidding the sale of opium for smoking. Throughout the century,
the Chinese government became wary of the increasing British influence on their country through
the opium trade, and another imperial edict was issued, banning the import of opium. Despite
the edicts, opium use was increasing in the country and stricter laws against the sale and import
of opium were decreed in both 1814 and 1831. After again seeing no reductions in the use of
opium, a series of internal debates were held by the emperor between those favoring legalization
                                                  7


and those who wanted to suppress the trade even further. The tone of these debates would be
familiar to modern readers, with one side claiming that legalization and taxation would be
beneficial and that black markets and wasted resources were the real crime, while the other side
feared that legalizing the drug would set a poor moral standard and would result in even more
widespread opium use (Feige & Miron, 2008).
        The emperor ended up siding with the moral purists and opium addiction itself became
a capital offense while eliminating the trade became a goal. After a few more years of still not
seeing the desired reductions in opium use, the emperor appointed a special commissioner, Lin
Tseh-Sen, to do anything necessary to stop the opium trade in China. Lin ordered the seizure
and destruction of British opium, halted food shipments to British sailors, and poisoned their
water supply. In retaliation, British sailors attacked and murdered a Chinese villager, inciting a
full-on war between Lin and the British Navy (Feige & Miron, 2008).
        The British defeated the Chinese and forced them to sign the treaty of Nanjing in 1842
(Chang, 1970). This is the treaty that gave Hong Kong to the British and established more ports
of trade to be open to the British. The emperor, however, succeeded in keeping opium illegal
and largely out of the treaty. The Second Opium War began in 1856 when Chinese officials ripped
down a British flag. Within two years, the British again won the war and forced the signing of the
treaty of Tientsin which gave special trading privileges to the British. It was not until after the
treaty when the emperor finally succumbed to the British argument that legalizing opium was the
only way to control the epidemic. Opium was legalized in 1858 and taxed at a rate of about 8%
(Feige & Miron, 2008).
                                                   8


        In this example, we see that opium was the tool of revenue extraction and international
political control for the British Empire and a subject of much frustration for the Ching dynasty.
The conflicts and the seeming insolubility of the problem made the emperor feel that he appeared
as a weak subject of the British. Many believe that the Chinese fought, not for the health of their
people, but for the economic prosperity that would come with selling opium from India (Babor et
al., 2010).
2.2.2 The Origins of Domestic Drug Policy
        Meanwhile, in the United States of the 19th century, products derived from cannabis,
opiates, and cocaine were popular and largely unrecognized everyday items in American life
(Musto, 1999). The lack of regulation allowed for the proliferation of 'patent medicines' – over-
the-counter concoctions often brewed with psychoactive drugs in a proprietary formula.
        As medicinal chemistry and pharmaceutical industry advances were made, the most
active components of the cannabis, opium, and coca leaf plants were extracted and later were
synthesized ‘de novo’ (e.g., heroin, cocaine, morphine, oxycodone, hydrocodone, methedrine).
Vocabulary shifted from opiates (derived strictly from opium) toward ‘opioids’ as the new
laboratory synthesized products were introduced to the market and regulatory responses were
required (Offermanns, 2008).
        In the 19th century and the early 20th century, there was no U.S. federal policy designed
to regulate whether any ‘patent medicine’ or other product included cannabis, opiates, or
cocaine. No special labeling was required. The inclusion of cannabis, opiates, or cocaine in the
product required no special label. In the most famous example, the original formulation of Coca-
                                                  9


Cola contained cocaine. At the time, numerous products used psychoactive drugs marketed
with a familiar blurriness between medicinal and recreational use (see Figure 1).
Figure 1. Chemist J.J. Pemberton writes on the use of the coca plant in his invention, the
recipe for Macalister's Cough Mixture contained cannabis, alcohol, and chloroform, and an
original Bayer product containing both aspirin and heroin.
        The first federal law governing psychoactive drugs in the U.S. did not occur until 1906
with the Pure Food and Drug Act (PFDA). Despite major opposition from the American
Pharmacist Association, this legislation created the Food and Drug Administration (FDA) with its
staff of trained physicians, chemists, and pharmacists (Musto, 1999). The FDA did not outlaw
any psychoactive drugs or their use in patent medicines. Rather, the PFDA only required
truthfulness about ingredients and prohibited false and misleading labels. Later amendments to
the PFDA would require that the quantity of each drug be stated on the label and that the drugs
meet official standards of purity. Thus, for a time the act served to safeguard consumers of the
patent medicines.
                                                 10


         Around the same period, the recognition of morphine addiction in the U.S. led to a quick
spread of anti-morphine laws in the 1890s (Musto, 1999). However, it was not until the U.S. saw
an opportunity in global politics that a national policy on opiates would be adopted. U.S. officials
saw this opportunity in the long struggle between the British and Chinese governments over the
British marketing of opium in China. The U.S. wanted to take an active role to meet with Chinese
officials to create a system of international drug control. To their embarrassment, the Americans
realized they had no national law against opium themselves. The result was the sudden efforts
that lead to outlawing opium in 1909 (Musto, 1999).
         Then U.S. President Theodore Roosevelt began this effort by appointing Dr. Hamilton
Wright as the United States Opium Commissioner in 1908. Dr. Wright was an American physician
who built his reputation on the discovery of a pathogen which supposedly caused beri-beri
(Jonnes, 1999). Of course, this claim did not survive the test of time as beri-beri was determined
to be a severe and chronic form of Thiamine deficiency, a discovery which Christian Eijkman and
Frederick Hopkins were awarded the 1929 Nobel Prize for Physiology and Medicine (Nobel Prize
Outreach AB, 2022). In studying pathogens in the tropics, Wright became interested in the
countries’ social and economic problems and authored many articles on the topic (Jonnes,
1999). These writings clearly caught the attention of the U.S. policymakers as he was chosen to
represent the country at a series of international conventions to facilitate the regulation of opium.
         The Americans met with the Chinese at the Shanghai Convention of 1909, and later, when
more countries got involved, the Hague Convention of 1912. The legislation that resulted from
these conventions created the framework for international drug policy still in use today. The
conventions were also the first in setting the precedent of the U.S. being a driving force for these
                                                  11


efforts, and that one American in particular (in this case, Dr. Wright) to be incredibly influential
(Babor et al., 2010; Helmer & Vietorisz, 1974).
         International drug policy is still largely influenced by the internal conditions in the U.S.
(Musto, 1999). The U.S. policymakers of the time were motivated by a mixture of moral
leadership, protection of U.S. prosperity, and a desire to assuage the resistance of the Chinese
regarding American financial investments. In the creation of international and domestic laws,
they were greatly aided by American citizens’ prejudice against the Chinese and their association
with smoking opium (Helmer & Vietorisz, 1974).
2.2.3 Roots in Racism and Class Warfare
         In the late 19th century, opium smoking was largely believed to be a cultural norm of
Chinese-Americans which came along with the Chinese labor force which largely built the
railroads and other early industries in the American West. As in other examples of race-specific
drug use stereotypes, this is likely a false over-simplification. The truth is likely more complex,
involving the socio-economic drivers of race and class hierarchies and the economic prospects
of White Americans (Helmer & Vietorisz, 1974).
         The competition for jobs between working-class whites and the immigrant Chinese led
to a campaign of excluding Chinese immigrants from the labor force between 1875 and 1880
(Helmer & Vietorisz, 1974). There is no record or official notice of Chinese-operated opium dens
until this large-scale labor exclusion period (Helmer & Vietorisz, 1974). Opium use became a part
of this hostile stereotype against the Chinese immigrants and was used to fuel the campaign,
leading to the earliest opium legislation in the country enacted in San Francisco in 1875. “It was
                                                    12


its character as a Chinese habit, not as a narcotic, which warranted the earliest legislation against
opium in the country” (Helmer & Vietorisz, 1974).
        A similar story can be told about cocaine and Black Americans a few decades later. In
the early 20th century, racially motivated horror stories regarding the actions of Black men using
cocaine were widespread. Newspapers, including the New York Times, reported that “negro
cocaine fiends” were raping white women (New York Times, 1914). Such reports emphasized
that the local police did not have the means to prevent these violent acts. These reports were
supported by important policymakers like Dr. Wright, who stated in 1910: “The use of cocaine
by the negroes of the South is one of the most elusive and troublesome questions which confront
the enforcement of the law in most of the Southern states…. [Cocaine] is often the direct
incentive to the crime of rape by the negroes…” (Wright, 1910). Today, most believe that Wright
was reporting unsubstantiated gossip as all data from the time point to extremely low prevalence
rates for cocaine or other drug use among the Blacks of the South (Brecher et al., 1972; Musto,
1999; Helmer & Vietorisz, 1974). The fear spread by these news outlets coincided with the “peak
of lynchings, legal segregation, and voting laws all designed to remove political and social power
from him” (Musto, 1999).
        Inpatient psychiatric treatment data, policing data, and import records from the time paint
a very different picture. Cocaine use is believed to have peaked in 1907 followed by a sharp
decrease and stabilization at low rates during World War I, a period during which cocaine use
did not seem to vary appreciably across the U.S. subgroups characterized as ‘White’ versus
‘Negro’ (Brecher et al., 1972). Many believe this was due to the PFDA when liquor prohibition
produced liquor scarcity. As a result, poor southerners, especially minority group members, were
purported to turn to cola drinks or cocaine itself. In response, more state-level laws against
                                                 13


cocaine use followed, using the blueprint of stoking white fear through news outlets. Later, when
some minority group members were becoming over-represented among drug-using groups, the
public health community tended to ignore the evidence (Musto, 1999).
        These racial and economic tensions helped contribute to the U.S. Congress enactment
of the Harrison Narcotic Act of 1914, largely authored by Dr. Wright. The legislation defined
narcotics as any opiate or cocaine (which Congress had erroneously labeled as a narcotic)
derivative product (Schaffer Library, n.d.; Brecher et al., 1972). The act required that standard
order receipts be issued and kept by any purchaser of narcotics and kept for two years for review
by federal revenue agents (IRS). Copies were kept in a permanent file by the IRS. Registered
physicians were required only to keep records of drugs dispensed or prescribed, thus protecting
physicians prescribing the drugs “in the course of his professional practice only”. Maximum
amounts were set for patent medicines containing heroin, opium, cocaine, or morphine, but
products could still be sold in general stores or by mail order. Essentially, everyone dealing in
narcotics except the consumer would have to be registered. Cannabis was notably omitted from
the final version of the law (Schaefer Library of Drug Policy, n.d.).
        An important development in the story of the racist and classist roots of drug policy
involves cannabis and Mexican-Americans in the 1930s. In this case, Mexican immigrants
certainly were using cannabis, a common custom that resembled the use of alcohol among
Americans. However, there was almost no awareness of or concern for cannabis use by law
enforcement or the community before the 1930s. As in the case of Chinese immigrants and
opium, the conflicts began when white working-class jobs and the sustainability of industry were
threatened. This situation, however, had the added ingredients of repealing alcohol prohibition
laws and one incredibly amoral and powerful figure- Harry Anslinger.
                                                 14


         Similar to Dr. Wright, Ansligner was a government appointee who fundamentally shaped
drug policy in the U.S. Named as the founding commissioner of the Federal Narcotics Bureau
(the precursor to the Drug Enforcement Agency) by Andrew Mellon, Anslinger led a harsh legal
campaign that often conflated race with drug use and inferiority (McWilliams, 1990; Smith, 2018).
Criticizing Anslinger and his methods has become very popular among scholars and activists
and there is no shortage of evidence to put Anslinger’s hypocrisy and racism on full display
(Brecher et al., 1972; Bonnie & Whitebread, 1970; Smith, 2018). With the help of William Hearst,
the prolific media mogul of the early 20th century, Anslinger successfully lobbied congress
against the scientific consensus that cannabis was not harmful (McWilliams, 1990). The resulting
framework of laws would come to be the foundation of the later war on drugs.
         The Federal Marihuana Tax Act (FMTA) was passed in 1937 in response to political
pressure from states bordering Mexico. Many members of the House of Representatives
famously did not even know what cannabis was, nor what the act was introducing (Bonnie &
Whitebread, 1970). A brief explanation for the changed policy, and for the origins of the pressure
to change it, is that cannabis was commonly used by Mexican immigrants to the U.S. Hence,
the legislation was intended to increase the cost of cannabis with the hypothesis that some
Mexican immigrants would return to Mexico if they could no longer afford the drug. The FMTA
was directly modeled after the earlier Harrison act authored by Wright, with a few differences.
Most importantly, possession of cannabis without a written order would be punishable with a
fine of up to $2,000 and no more than five years in prison.
2.2.4 Arriving at the Controlled Substances Act
         The FMTA in its various forms regulated cannabis for over 30 years until Timothy Leary,
a Harvard professor of psychology and outspoken psychedelics advocate, was arrested for
                                                 15


possession of cannabis in 1969. Leary contended that the FMTA violated his 5th amendment
rights against self-incrimination and the case was elevated up to the Supreme Court of the United
States (SCOTUS). In its decision, SCOTUS unanimously agreed with Leary’s case and declared
the FMTA to be unconstitutional. Leaving cannabis unregulated, however, was an unacceptable
idea to most of the U.S. government. The removal of the outdated legislation cleared the way for
a new, more conservative approach. The Controlled Substances Act (CSA) of 1970 replaced
most of the federal regulations regarding psychoactive drugs that came before it, including the
FMTA and the Harrison Narcotic Act, reinforcing the illegality of all previously regulated drugs.
The CSA clearly stated what the authority of the federal government would be and provided a
framework within which all existing and new drugs could be regulated based on the three criteria
of abuse potential, safety, and medical utility (see figure 2).
Figure 2. Controlled Substances Act schedules and criteria.
        An important change in the legal framework between this legislation and prior legislation
was the ability to regulate drugs as they were developed using a common structure, without the
                                                  16


need for congressional legislation. This need was made apparent in the period after World War
II in which many new synthetic narcotics were developed (Spillane, 2004). Most importantly for
our purposes, however, is that cannabis was now classified alongside heroin and hallucinogens
as drugs with a high potential for abuse, unsafe to use (even under medical supervision), and
with no currently accepted medical use (U.S. Department of Justice, 1970).
        The use of psychoactive drugs is a timeless societal practice. Whether for recreational,
medicinal, or spiritual use, societies from before the common era until the past century largely
accepted and did not think to regulate the use of cannabis. When cannabis did become
regulated in the US, the intentions were undoubtedly immoral and facetious. Yet, the intention
and consequences of the actions are two very different constructs. Overall, it is not clear whether
this misguided process has been beneficial for citizens of the U.S. or detrimental. Judgments of
this type cannot be prescribed as the benefits and harms of cannabis regulation are largely value-
based and vary widely from person to person. In the history of drug policy in the U.S., there has
been some movement from its origins in racism and personal crusades to a more evidence-
based approach, yet the legacy of the CSA has not achieved anything close to equality. The
evidence regarding cannabis shows a widening gap in how the drug is treated by the law and
how it is perceived by society (Pew, 2021). The FMTA, with all its flaws, lasted 40 years. It has
now been 50 years since the CSA was passed, and society has demonstrated its impatience
with this outdated law through state-level legalization, most often through popular vote on a
ballot measure.
2.2.5 The Modern Era
        The era of modern cannabis legalization can be traced back to the medicalization
movement in 1990’s California but did not truly begin until the first two legalization ballots passed
                                                17


in 2012. Colorado and Washington were the first two states to legalize the recreational use and
sale of cannabis in 2012. Alaska and Oregon followed suit in 2014, legalizing recreational
cannabis through ballot measures as well. In 2016, California, Nevada, Maine, and
Massachusetts all approved ballot measures to legalize recreational cannabis as well. In a slight
deviation, Vermont became the first state to legalize recreational cannabis through the state
legislature in 2018, although not allowing for the commercial sale of cannabis. In that same year,
voters in Michigan approved a ballot measure to legalize recreational cannabis as well. In 2019,
Illinois followed the example set by Vermont and legalized recreational cannabis through the
state legislature, and in 2020, Vermont legalized cannabis sales, again through the state
legislature, and voters in Arizona, Montana, New Jersey, and South Dakota all approved ballot
measures to legalize recreational cannabis. In the most recent movement, voters in South
Dakota voted to legalize both recreational and medical cannabis.
         Critical to this dissertation, many states have granted sub-state jurisdictions (cities and
counties) the authority to make their own decisions regarding the legalization of cannabis. This
results in a legal patchwork of many county or sub-state areas with differing policies regarding
cannabis. Figure 3 shows a map of the states that allow local authorities to depart from the state
provisions for recreational cannabis legalization as of January 2020.
                                                  18


Figure 3. A map of states that give local authorities the right to depart from the state provisions
regarding the recreational use of cannabis as of January 1, 2020.
2.3 An Initial Look at Drug Policy and its Intentions
         Research into the epidemiological parameters of drug use did not occur until the turn of
the 20th century around the time that Hamilton Wright participated in the Shanghai convention.
In what is perhaps the first analysis of drug dependence epidemiology and the laws which
influence it, Lawrence Kolb and A.G. Du Mez published a review of drug use under different
policies. The evidence showed disparate estimates of “the number of addicts” using a number
of different methods and data sources in areas with differing policies (1924).
         In this section I provide a rough overview of historical drug policy instruments that have
been used (sometimes) in attempts to shift epidemiological parameters in directions of public
health improvements. The table lists a selection of the drug policy instruments I have found in
my review of the literature, along with my description of what I have discerned as the intended
                                                  19


shifts in epidemiological parameters and any epidemiological evidence that does exist. To some
extent, there is a vagueness in the epidemiological parameters because the historical records
do not always clearly state the intended purpose of implementing each drug policy instrument.
In some instances, I have had to infer the intended purpose. In other instances, the intended
purpose might have been outside the boundaries of what we think of as epidemiological and
public health parameters, as illustrated by apparently racist social control effects of early policies
on opium smoking and cannabis use described elsewhere.
    Estimates of the number of narcotic (again, defined at the time as both opiates and cocaine)
dependents in the U.S. during the time of the first national laws range from 182,215 (1884) to
782,118 (1913), or an estimated 1-2% of the population. Survey research approaches of the time
were not nearly as rigorous then as it is today and scholars agree that no one survey can be
trusted (Bonnie & Whitebread, 1970). In a 1924 review of the most rigorous studies completed
between 1915 and 1922, the breadth of the estimate was found to be even wider. However, after
a careful evaluation of the biases in each survey method and sample, the authors arrive at a likely
figure of around 215,000 in 1915 and 110,000 in 1922 (Kolb & Du Mez, 1924).
    In a masterful work of the time, the authors of The Opium Problem come to the similar
conclusions on the state of the estimates of opium users. Terry and Pellens cite much of the
same work with estimates between “a few thousand individuals to several millions” and conclude
that “under present conditions it is impossible to obtain [an accurate estimate of the total number
of opiate users]” (1928). Nonetheless, the book presented some of the earliest evidence on the
basic epidemiology of opium users including the demographic differences, etiology, pathology,
treatments, and symptomology.
                                                 20


     The decrease in the estimates reported by Kolb and Du Mez reflects the cultural shift of the
time as opposition grew to the use and promotion of psychoactive drugs, partly influenced by
religiosity and the temperance movement in Britain and the U.S., but also by indigenous
movements among colonized peoples (Babor et al., 2010). In the first international law regarding
psychoactive drugs, the Brussels General Act of 1890 regulated distilled spirits for large parts of
Africa. Liquor would be prohibited “for the native population” or its sale would be taxed (Babor
et al., 2010). The proposed intention of the law was to protect the native people of Africa. Seen
through the lens of international power dynamics, the prohibition aspects of the law appear more
as a mechanism of control. In this regulation, we also see the earliest framework of the bifurcated
system of law governing psychoactive drugs- one mechanism controlling supply while the other
controls demand. The taxation aspect of the law is a penalty to the supplier, while prohibition
seeks to control demand, an important dynamic in drug policy.
     In this line of research, the development of new statistical techniques has been critical.
Controlling for changing demographics and other differences in populations before and after a
policy change, or between populations with different policy experiences, continues to be the
most important issue.
2.4 Related Developments in Social Statistics and Study Design
2.4.1 Earliest work
         The earliest instance of probabilistic thinking is commonly believed to have been
recorded by Cicero, who, around the year 85 B.C.E., referred to events likely to happen as
probabile (Gigerenzer, et al., 1990). It was not until the 14th century C.E. that the first known
attempt at a systematic calculus for enumerating all possibilities of dice rolls was written, the
ancestor of today’s concept of permutation (Kendall, 1956). The origin of statistical probability is
                                                21


commonly believed to have started with the Italian maritime insurance industry of the 15th and
16th centuries. Despite this persistent belief, these insurers kept no data on shipwrecks or other
mishaps and the premiums they established were somewhat arbitrary (Gigerenzer et al., 1990).
The reality is in the opposite direction; it was the early mathematics of Blaise Pascal and Pierre
de Fermat which influenced the insurance industry (Maistrov, 2014). Pascal and Fermat are
primarily credited with creating the first formulations of probability in 1654. Their contributions
included the fundamental concept of expectation defined as the product of the probability of an
event 𝑒 and its outcome value 𝑉 :
                                              𝑃(𝑒) 𝑉 = 𝐸
         Once the work of Pascal and Fermat was completed, applications of probability theory
spread. One early use of data to draw statistical inferences includes an important milestone for
epidemiology, as manifest in John Graunt’s Natural and Political Observations Made upon the
Bills of Mortality, first published in 1662 (only 8 years after Pascal and Fermat’s seminal work).
Graunt sought to predict mortality and survivorship of the citizens of London in ten-year intervals
(Graunt, 1939). Important developments in probability theory continued alongside statistics,
most notably by Huygens, Bernoulli, DeMoivre, Bayes, and LaPlace, until the two were merged
to create the important science of inference we use today (Gigerenzer, et al., 1990; Maistrov,
2014). Nevertheless, here I will branch out to the early epidemiological work focused on disease
outcomes in populations
         It was during the age of enlightenment that these classical probability theorists made it
clear that their mathematics should be applied to “civil life”, a “social mathematics” as the early
social scientist Nicolas de Condorcet put it (Baker, 1975). Condorcet even went as far as arguing
to restructure the French judicial system to be based on statistical probability (Boland, 1989).
                                                  22


This movement, of course, was never embraced by society at large as later statisticians found
the logic to be too simplistic (Gigerenzer, et al., 1990). The application of statistics to the social
sciences for other problems, however, proved to be enormously useful and influential. Especially
influential contributions were made by William Farr and Emile Durkheim, including attempts to
understand the epidemiological patterns of suicide mortality rates (Durkheim, 1897; Farr, 2000).
2.4.2 Early Sociology and Psychiatric Epidemiology
        Farr worked primarily with census data and mortality records, being the first to combine
the two to test whether suicide risk varied with other life experiences. In his own words, Farr
explained and concluded that “The Importance of this determination will become apparent by
enumerating some of the relations the mortality bears to other orders of facts… the difference of
external circumstances and sanitary condition exercise a very real influence on life, disease, and
death…“ (Farr, 2000).
     Emile Durkheim's Rules of the Sociological Method built on the types of observations Farr
had made. Although they were contemporaries, I could find no evidence that Farr influenced
Durkheim’s work or vice-versa. Durkheim applied his newly developed sociological theory in his
seminal work On Suicide. He argued that higher rates of suicide were partially due to the absence
of shared social values and norms (anomie) in the general population and lower rates of suicide
with the opposite - more social integration and shared values (Durkheim, 1897).
     While great strides were being made on the pioneering topic of psychiatric epidemiology,
similar work on drug dependence syndromes was not nearly as advanced as the analytical
epidemiology already being conducted on suicide. The earliest descriptive epidemiological
estimates on drug dependence often took the form of attempts at estimating the “number of
addicts” in the U.S. (Kolb & Du Mez, 1924) (see section 2.2 for estimates). Much work had yet to
                                                  23


be done in the domain of survey methodology to produce reliable population-level estimates,
but more important to the current work was the development of modern statistics and the
concept of controlling for confounding to create causal estimates.
2.4.3 Classical to Modern Statistics and Causal Inference
     The concept of correlation (and the need to control for it in certain cases with a tool he named
“regression”) was first introduced by Francis Galton in his seminal paper Typical Laws of
Heredity, published in Nature in 1877 (Galton, 1877). Some of Galton’s later work sparked the
interest of Karl Pearson, who is largely credited with the development of modern mathematical
statistics (Varberg, 1963). Among Pearson’s many contributions are the precursors to
conventional probability distributions and the P-value. However, most important to the current
work are the contributions of Ronald Fischer and his later debates on how to constitute causality
with Austin Bradford Hill.
     Fisher is credited with developing the foundations of modern statistical science, initiating the
original principles of study design, and developing the first randomized trials in his agricultural
work as a way of correctly adjusting for random error terms (Hald, 1998). British statistician and
epidemiologist Austin Bradford Hill is credited as the first scientist to apply Fisher’s concept of
randomization in studies of humans (Hill, 1951; Hill, 1952; Hill, 1953). The randomized controlled
trial is perhaps the greatest advance towards the framework of causality as it is still considered
the gold standard for estimating the counterfactual (what would have happened to these people
if an event did not occur?). Yet, it was Hill’s work with British epidemiologist Richard Doll using
non-random case-control studies, that would convince the world that smoking tobacco was the
primary cause of lung cancer and other illnesses (Hill, 1965; Hill and Doll, 1956).
                                                 24


     Despite the large, estimated associations linking tobacco with morbidity and mortality, Fisher
was skeptical. He asked whether causal inference was justified when the data did not have the
rigor of a randomized trial (Andersen, 2007). The arguments and debates of this age culminated
in the now-famous first Surgeon General’s report on tobacco and health of the 1960s. This report
laid out a new framework for determining causality and sided with Hill and Doll that smoking
indeed was a cause of lung cancer, heart problems, and many other detrimental health
outcomes, without the need for experimental evidence (United States Surgeon General, 1964).
This new framework opened the door for the “web of causation”. Allowing for the consideration
of different forms of evidence (strength of association, consistency, specificity, temporality, etc.)
in analyzing many possible contributing causes of disease and health behaviors is now widely
used, yet still controversial (MacMahon, Pugh, and Ipsen, 1960). The framework was later
revisited by Hill who authored what many consider to be the best list of criteria for determining
causality. Hill’s criteria are still in use today in the study the causes of chronic diseases which
have no definitive microorganism or other specific causal agent to which we can point our
collective finger (Hill, 1965; Shimonovich et al., 2020). The conceptualization of the “web of
causation” was necessary to facilitate our more modern concept of studying proportions of
causal attribution in a probabilistic world.
2.4.4 Controlling for Confounding and Policy Analysis
         To date, governments have rarely adopted Fisher’s random experiment design when its
decision-makers seek to change the state of affairs by making new policy decisions. There are
some exceptions to this general rule, as in the U.S. federal government’s Moving To Opportunity
housing voucher experiment (Katz, Kling, & Liebman, 2001; Leventhal & Brooks-Gunn, 2003;
Chetty, Hendren, & Katz, 2016). Instead, the implementation of policy is almost always inherently
                                                    25


nonrandom, particularly in a democratic society like the United States. The passage of laws and
policies in the U.S. can occur by several mechanisms but are primarily voted on and approved
by a majority ballot or precedent can be set by court rulings. In both cases, the beliefs and shared
values of the citizens affected by the law drive the change of the law itself - whether directly, by
voting on a ballot measure, or indirectly, by voting on political appointments and local judges.
        This policy-making process yields inherently confounded relationships when the goal is
to estimate the effects of various policies. This topic of study is of the utmost importance to the
current work which seeks to estimate the causal effect of recreational cannabis legalization
(RCL). The roots of this dissertation project’s approach can be seen in early evolution of time
series experiments and quasi-experiments in education and psychology (e.g., see Campbell &
Stanley , 1963).
        The field began with the development of interrupted time series analysis by Campbell
and Cook (1979). In their textbook for quasi-experimentation in field settings, Campbell and Cook
outline the method of interrupted time series as an evaluation of the change in the level of an
outcome over time if the level of the outcome would not have changed if the intervention under
study did not occur. Many critics noted that this is a strong assumption to be made, is especially
prone to selection bias, and the outcome may vary with many other factors, both known and
unknown (Baicker & Svoronos, 2019).
        To deal with these criticisms, the differences-in-differences (DiD) approach was devised.
DiD is one of the best tools econometrics offers to deal with unobserved confounders with a
loose assumption that the trends in both the treatment and control groups are parallel (Angrist
and Pischke, 2008). DiD models require panel data or repeated cross-sections and are ideal
when the event to be modeled occurs at an aggregate level, such as a state. DiD takes advantage
                                                  26


of the trends, before and after the event, in a group that experienced the event (treatment group)
and a group that did not experience the event (control group) (figure 4).
Figure 4. How the causal effect is estimated in the differences-in-differences model.
Recent explorations and analyses by economists have revealed that this estimate of the average
treatment effect is a bit of an over-simplification, especially when more than two time periods
can be defined for each group (Goodman-Bacon, 2018; Callaway and Sant’Anna, 2020;
Cunningham, 2020). The average treatment effect on the treated (ATT) is a weighted average of
all the possible two-period estimators, which is problematic as it averages out the treatment
effect heterogeneity that can take place over time. When treatment effects change over time, the
ATT estimate is biased (Goodman-Bacon, 2021). In the drug policy literature, there is good
evidence that effects will change over time due to the so-called policy lag effect (Cheng et al.,
2019; Hall & Weier, 2015).
         In this dissertation research project, my goal has been to estimate the causal effect of
RCL passage on the occurrence of newly incident cannabis use in the United States using an
extension of the DiD model that uses treatment leads and lags to dynamically model the changes
in cannabis use incidence before and after the law was changed. This model allows for the effect
                                                  27


of time since or before the RCL passage to be estimated while controlling for the fixed effects of
the states and time. Using this event study model with leads and lags in treatment timing, I will
show that all states were comparable on cannabis incidence dynamics. I then estimate the
degree to which the treatment effect changes over time and estimate the effects of RCL passage
on cannabis incidence separately for underage persons and for those aged 21 and older.
2.5 The Current Understanding of the Effects of Cannabis Legalization
         Research into U.S. drug policy has gained pace in recent years alongside the growing
number of states which have legalized cannabis. The legalization movement has been fueled by
a growing belief among Americans that cannabis should be legal (Pew, 2019), partially fueled by
recognition of the adverse consequences of mass incarceration (Gallup, 2016). At the heart of
this research lies the goal of quantifying the changes in cannabis use epidemiology among the
constituents of states that have legalized cannabis. Since the beginning of the legalization era in
the U.S., much has been written about the potential impacts of the law changes and a few
studies have gone as far as to ascribe causal changes to the differences the authors saw. This
literature has focused almost exclusively on measures of prevalence and by and large focuses
on two distinct populations- individuals under the age of 21, and individuals 21 and over.
2.5.1 Cannabis Use in Individuals Under 21 After Legalization
         The evidence on the effects of cannabis legalization on cannabis use among youth is
mixed. In one of the earliest studies of its kind, no change in the past 30-day prevalence of
cannabis use among Colorado high schoolers was found between 2013 and 2014 in the Healthy
Kids Survey (Gruber et al., 2016). According to a more recent analysis of the Healthy Kids
Service, this flat trend has remained unchanged through 2019 (Reed, 2021). In Washington,
                                                28


researchers found an increase in past 30-day prevalence among tenth graders according to
Monitoring the Future data (Cerdá et al., 2017). However, another group analyzed a different
dataset collected from the same population of high schoolers in Washington State and found
decreases among eighth and tenth graders in the Healthy Youth Survey data (Dilley et al., 2019).
        More recent analyses of the National Survey on Drug Use and Health (NSDUH) found that
12–17-year-old participants in the states with legalized recreational cannabis (RCL) had an
increased prevalence of cannabis use disorder (CUD) (Cerdá et al., 2020). Yet, another analysis
found no changes in cannabis use prevalence for any racial or ethnic group among individuals
aged 12 to 20 (Martins et al., 2021). In an analysis of the national sample survey data collected
for the Youth Risk Behavior Survey, Coley and colleagues found no evidence that RCLs were
associated with increased likelihood or level of cannabis use among adolescents and even found
a 16% lower prevalence of use among prior cannabis users (2021). Meanwhile, however,
Paschall, García-Ramírez, & Grube reported an increase in lifetime cumulative incidence
proportion of cannabis use in the California Healthy Kids Survey (2021).
        Midgette and Rueter have argued that some of the heterogeneity in results might be
attributable to differences in sampling between nationally representative surveys and state
representative surveys when investigating state-specific results (2020). However, differences
between nationally representative and state representative surveys are not always in the same
direction. A 2019 meta-analysis on the topic showed null findings when analyzing the results of
studies with the lowest risk of bias, while a small increase was found in prevalence when
including all studies (Melchior et al., 2019). In a systematic review published in the same year,
the authors concluded that findings among youth were mixed primarily by state, with increased
                                                  29


use prevalence among youth in Washington and Oregon, but not in Colorado (Smart and Pacula,
2019).
2.5.2 Cannabis Use in Individuals Over 21 After Legalization
        The evidence on the prevalence of cannabis use after legalization among adults has been
more consistent. Excepting some early null findings from the Colorado Behavioral Risk Factor
Surveillance System and the National Alcohol Surveys (Reed, 2016; Kerr, Lui, & Ye, 2018),
evidence from studies published after 2016 consistently show increases in cannabis use
prevalence. Despite the earlier null finding in Colorado, recent cannabis use prevalence
increased to 17.5% in 2017 and continued to 19% in 2019 (Reed, 2021). Increases in some adult
age groups in the prevalence of frequent users, CUD, and past-year cannabis use prevalence
have been consistently reported among adults in the NSDUH (Cerdá et al., 2020). These findings
are especially robust among the White and Hispanic sub-groups (Martins et al., 2021). Although
the early studies and systematic review of the literature reported no effects for adults (Reed,
2016; Kerr, Lui, & Ye, 2018; Smart and Pacula, 2019), the more recent analyses show consistent
evidence of increased use among adults.
2.6 Significance
        As proposed, my dissertation research project was designed to address three critical
barriers in the current state of policy-guiding evidence that should be produced when cannabis
use epidemiology seeks to shed light on recent changes in state-level recreational cannabis
laws. First, many major works to date have failed to adequately address the within-state variation
that exists in each of these states with regards to municipality and county cannabis laws.
Second, none of these analyses have been based on estimates of the occurrence of newly
                                                 30


incident cannabis use. All prior research has focused on prevalence proportions. Third, there has
not been a uniform framework to make it possible to control for differences between areas that
legalized cannabis and those that have not, or within the same state before and after cannabis
retail sales have been permitted.
         Ignoring the variation of the law at a sub-state level is an error that could result in a biased
view of recreational cannabis legalization by averaging out important subgroup heterogeneity.
Many local municipalities and counties have chosen to keep cannabis as a regulated schedule I
drug, or to ban the commercial sale of cannabis, after the state legalized the drug. Residents in
these areas are included in state-level and nationally representative samples. When analyzing
differences between states, these individuals are included in a state where recreational cannabis
is “legal” (i.e., differential misclassification).
         As stated in Chapter 1, prevalence estimates do not capture the rate at which
adolescents and adults are trying the drug for the first time. Age of first use is of particular interest
in cannabis epidemiology given that one of the major pillars of cannabis policy is to prevent new
users, especially in their adolescent years. Incidence is a critical component in understanding
the public health consequences of legalization given the plethora of evidence that associates
younger age of first use with a vast array of negative outcomes (Volkow et al., 2014; Fontes et
al., 2011; Horwood et al., 2010; Wagner, 2002). Incidence is also an essential component in
testing the hypothesis that a sub-group of the population does not use cannabis for the sole
reason that it is illegal and for understanding the shifting distributions in the age of first use.
         One path toward attributing a causal relationship between cannabis policy change and
an epidemiological parameter of interest is to create a framework that can be used to control for
                                                   31


differences in states with and without cannabis legalization. Developing the prediction algorithm
is a novel method to determine which facets of different areas vary with the policy changes and
which need to be controlled for in non-experimental designs. This framework allows for
repeatable experiments and can be used by other researchers in their work on other changes
that may have occurred as a result of recreational cannabis legalization.
         Addressing these limitations is significant to public health. Targeted prevention
campaigns for alcohol and tobacco use have been a major public health success story, partly
due to early age-specific targeting (Dobbins et al., 2008) and appropriate messaging (Pierce,
White, & Messer, 2009). The results of this study might reveal if the age of first use pattern
changes after cannabis policy liberalization and will give us a detailed understanding of the
demographics who are experiencing these changes.
2.7 Potential Impact on the Field
         If the aims of this project are achieved, a more specific understanding of cannabis
initiation and use after recreational cannabis legalization will help guide future policy decisions
and initiatives. Technical capacity to project changes in epidemiology will be improved by
incorporating the effects of these often-unmeasured local factors and the confounding variables
which facilitated legalization in the first place. Successful completion of these aims will increase
the accuracy of current models of cannabis use and improve our understanding of the effects of
drug policy changes.
                                                   32


                                   3. Materials and Methods
3.1 Overview of this Chapter
    To understand how cannabis legalization affects the epidemiology of cannabis use we, of
course, must have valid estimates of the main epidemiological parameters at the population
level. In addition, we also must understand the differences between states and counties which
prefer to legalize cannabis from those that prefer to keep cannabis as a schedule I drug to make
valid inferences regarding causation. This chapter will outline the methods, by aim, that I used
to:
         1. Develop a predictive model of sub-state cannabis legalization using publicly
            available datasets that are readily available to other investigators, and that can be
            used in future investigations.
         2. Provide evidence on the degree to which the incidence of cannabis use might have
            increased or decreased after cannabis legalization for two important subgroups of
            the population: (1) the adults who are permitted to make a retail purchase of a
            cannabis product in each jurisdiction, and (2) the underage adolescents (<21 years
            old) for whom retail purchase of cannabis products remains prohibited in each
            jurisdiction.
         3. Estimate the degree to which the legalization of cannabis might have affected the
            age of first cannabis use with special attention to the legal minimum age (LMA).
                                                 33


3.1.1 Details on IRB Approval, Recruitment, and Participation Levels
    The current study was determined by the MSU IRB as not human research on 8/27/2021.
Proof: STUDY00006620.
    Overall interview participation levels in the NSDUH are between 67%-75%, which is slightly
lower than corresponding levels for the 12-22 year-olds in this study’s sub-samples. See table
1 for the sample size, response rates, and overall participation rates in the NSDUH for each year
under study.
Table 1. Sample sizes and participation levels of successive years of the National Surveys on
Drug Use and Health.
                     Total                                                          Overall
                                 Weighted Screening       Weighted Interview
  Survey Year       Sample                                                        Participation
                                    Response Rate          Response Rate
                      Size                                                           Level*
       2008          68736                89%                    74%                  66%
       2009          68700                89%                    76%                  67%
       2010          68487                89%                    75%                  66%
       2011          70109                87%                    74%                  65%
       2012          68309                86%                    73%                  63%
       2013          67838                84%                    72%                  60%
       2014          67901                82%                    71%                  58%
       2015          68073                80%                    70%                  56%
       2016          67942                78%                    68%                  53%
       2017          68032                75%                    67%                  50%
       2018          67791                73%                    67%                  49%
       2019          67625                71%                    65%                  46%
                                                  34


3.2 Aim 1
3.2.1 Study population and sample
    For this predictive study of recreational cannabis policy change, I used data from a variety of
publicly available sources to study 3094 counties (including county equivalents) of the United
States (U.S.). The data on the counties consist of data collected from individuals aggregated to
the county level as well as data that is inherent to the counties themselves. The publicly available
data sources I used include the 2010 - 2012 Small Area Estimates from the National Surveys on
Drug Use and Health (NSDUH), the 2010 Census, and the 2012 County Presidential Data from
the MIT Elections Lab. Each data set used in the analysis is described separately.
3.2.1.1 National Surveys on Drug Use and Health Small Area Estimates
    The population in these surveys was specified to include non-institutionalized U.S. civilian
residents, sampled and assessed for successive NSDUH surveys in the years 2010, 2011, and
2012. The data used in this project was made available at the substate region level (n=369) and
downloaded from SAMHSA’s Public Data Access System. These NSDUH cross‐sectional
surveys were conducted with multistage area probability sampling to draw state-level
representative samples and to over-sample 12-to-17‐year‐olds. In the NSDUH surveys
administered in 2010, 2011, and 2012, data were collected from 206,222 individuals with an
average interview participation level of 74% (Montgomery, Thompson, & Anthony, 2022). Data
from 170,978 of these participants were made available in the public use file and used in this
analysis (Montgomery, Thompson, & Anthony, 2022). Home addresses of all participants were
collected and used in a statistical model which links the survey outcome variables to local area
predictors so that the survey outcome of interest in an area that may have not been chosen in
the probability sampling stage can be predicted. The variables I used from this data source
                                                  35


include the prevalences of alcohol use disorder in the past year, alcohol use in the past month,
cigarette use in the past month, cocaine use in the past year, serious thoughts of suicide in the
past year, illicit drug dependence in the past year, marijuana use in the past month, and serious
mental illness in the past year.
3.2.1.2 Census
    Data from the 2010 Census was downloaded from the census.gov website (Census Bureau,
2010). The decennial census seeks to count every member of the U.S. population and records
basic demographic information such as age, sex, marriage status, race and ethnicity, information
about the households and living arrangements, and county-level information including total
population, land area, water area, population density, and the number of occupied and vacant
housing units. The data are reported at the county level.
3.2.1.3 County Presidential Data
        The Massachusetts Institution of Technology Election Data and Science Lab collects and
makes available data on U.S. presidential elections, as well as data on U.S. house and senate
elections, and state and local elections. The County Presidential Elections Returns 2000-2020
were used in this analysis (MIT Election Data and Science Lab, 2018). The file contains the total
number of votes in every U.S. county for each major party presidential candidate in the general
election (democrat, republican) as well as the total votes cast for third parties. I used the percent
of votes for the republican and democratic candidates as the variables in this analysis.
3.2.1.4 Cannabis Legalization Status
        In 2014, four states had legalized recreational cannabis at the state level. Colorado and
Washington legalized recreational cannabis in 2012 and Oregon and Alaska legalized in 2014. A
responsible government department from each state collected and published lists of cities and
                                                36


counties that opted out of different aspects of the recreational cannabis laws. Colorado counties
were coded according to data published by the State Governments or Municipal League (CML,
2019). Alaskan boroughs were coded according to data from Alaska’s Department of
Commerce, Community, and Economic Development (ADCCED, 2017). Counties in Oregon were
coded according to data collected and published by the Oregon Liquor and Cannabis
Commission (OLCC, 2021).
        Unlike the other three states, the legislation approved in Washington did not allow
substate municipalities local authority over the issue (Colorado Constitution, 2012; Washington
State Liquor Control Board, 2012; Oregon legislature, 2014; Alaska State Legislature, 2014).
While local authority was not granted to cities and counties explicitly through the legislation, land
use laws were used to effectively ban the sale of cannabis in some areas with varying degrees
of success (Darnell, 2015; Dilley et al., 2017). I discuss the methods I used to deal with this
complication and others in the Sensitivity Analyses section.
        Counties that included a city or town where selling recreational cannabis was legal in
2014 were coded as having legal recreational cannabis sales. I used Statsamerica.org’s City and
county Finder to trace each municipality to its home county (Statsamerica, 2021). In some cases,
a city could exist in more than one county, in these cases, all counties were coded as having
legal recreational cannabis.
3.2.2 Data Management
        To appropriately merge all data at the county level, I first created a proprietary crosswalk
dataset to assign every U.S. county to the NSDUH small area estimate regions as defined by the
Substance Abuse and Mental Health Service Administration’s (SAMHSA) documentation (United
States, 2014). This crosswalk allowed me to use the small area estimate of each NSDUH
                                                 37


outcome for each county that existed within its boundaries. Estimates for areas smaller than a
county were not included (District of Columbia wards, LA Statistical Areas, Detroit, and
Wilmington City) as these areas were all nested within broader county-defined regions. Because
of mismatches in documentation, some small area estimates from North Carolina could not be
used. Mental health and drug use estimates in Massachusetts and Connecticut are not county-
specific. I attempted to mitigate the effect of this missing data by imputing state-level averages
of these variables for all counties in their respective states. Similarly, discrepancies in the
presidential election data from Alaska and a small village in Hawaii made this data unusable at
the county level, the state level averages of presidential voting in Alaska were imputed for all
counties.
        Including every variable from the census data caused separation and numerical instability
in the regression models. Therefore, I used principal components analysis to reduce the number
of census variables from over 1000 to 10 principal components (figure A.1). I then used the first
two principal components which account for ~80% of variance as predictors.
        The NSDUH small area estimates are often suppressed when a county is not large
enough or the estimate is below a certain cut-off. Because of this correlation between county
size, estimate sizes, and missing data, I concluded that the data missing from the NSDUH small
area estimates are not missing at random and would bias the models if used. However, this never
occurs in the estimate which uses the entire age range of the surveys (12 and over, or in some
cases, 18 and over). Therefore, all NSDUH variables used in the prediction are for the whole
survey population for whom the data is available. Estimates by age subgroups were not used.
3.2.3 Study Design
        Because I must classify a relatively small number of counties from a larger set (92 RCL
                                                 38


counties, 3011 non-RCL counties), the overabundance of counties where recreational cannabis
remains illegal would bias my attempts at classification (Nekooeimehr & Lai-Yuen 2016). This
would, in turn, result in the inapplicability of the algorithm to the real world (Zolbanin et al., 2020).
As such, I accounted for the imbalance between the policy conditions being examined using a
relatively simple minority oversampling technique (Kubat, Holte, & Matwin, 1998). To create
probability estimates and 95% confidence intervals (95%CIs) for each county, I used 1000
resamples in the ensemble of logistic regressions (Kittler, 2001). In each iteration, 80% of the
RCL counties (n~74 out of 92) and twice as many non-RCL counties (n ~ 148=74x2) were
randomly allowed into the predictive model.
         I introduced variance to the NSDUH estimates so that the same estimates of drug use
and mental health prevalences were not used for the same county in every iteration. Instead, I
transformed the estimate logarithmically with a normal distribution and chose a number
probabilistically from the distribution about the log transformed estimate. This number was then
back-transformed to be interpretable as a prevalence measure. In other words, I am not using
the observed data as is, but as a probabilistic realization surrounding it. In this way, I
acknowledge the fact that the NSDUH estimates are not absolute truth, but close estimates. The
added variation in each iteration also prevents an overfitted regression model completely
separating RCL and non-RCL counties just by a few descriptive statistics, which harms the
overall specificity of the stacked ensemble predictor by making it overly sensitive to the said
statistics.
3.2.3.1 Pre-iteration Modelling and Validation
         Modelling and validation were performed using standard techniques in supervised
machine learning. In every sample iteration, after ~74 RCL and ~147 non-RCL counties were
                                                    39


selected, the data were split with a random subset of 70% used to train the logistic regression
model, after which I saved the regression coefficients and standard errors. The remaining 30%
served as testing data. To evaluate the predictions using the test cases, I used a four-step
process. First, I used the trained model to evaluate the expected probability of legalization from
the assembled county data. Second, I hard-coded the prediction using a cut-point based on the
overall proportion of legalized counties in the model, which was 33.3% in this case due to the
1:2 sampling scheme. A county was predicted to have RCL if its expected probability exceeded
33.3%, otherwise it was labeled as non-RCL (for logic on choosing a prevalence based cut-off
value, see Gelman & Hill, 2006). Third, I compared the predicted labels to the truth and evaluated
the logistic model’s performance in terms of true positive rate (TPR, or sensitivity), true negative
rate (TNR, or specificity), overall classification accuracy, and the area under the receiver operator
characteristic (ROC) curve (Hanley & McNeil, 1982). Finally, I stored the expected probabilities,
the predicted labels, and the performance values of the model for future ensemble building.
3.2.3.2 Building Ensemble Prediction
        I calculated the ensemble’s probability of legalization for each county by averaging and
weighting the expected probabilities from the subset of models that made predictions on that
county. Both the weighting mechanism and the cut-off value for the prediction label can be
changed to suit the needs of the application. In this study, I weighted the probabilities by the
overall classification accuracy of the model from which the probability was derived, emphasizing
the importance of models that performed better in the 1000 interactions. This was followed by a
final call for the binary prediction label with the same 33.3% cut-point.
3.2.3.3 Sensitivity Analyses
        Because of differences between state administrative structures and cannabis policies, I
                                                    40


planned to conduct several sensitivity analyses to understand which policy coding schemes
would produce the most accurate model. The two states that warranted some re-coding and
analysis were Washington state and Alaska. As mentioned previously, the cannabis reform
legislation of Washington state is unique in this sample as it is the only state which did not
explicitly allow for local bans. However, this is a matter of some controversy, as zoning laws
were used to effectively ban the sale and cultivation of cannabis in 11 counties (Darnell, 2015).
Though the mechanisms are not the same, the sociodemographic factors which lead to them
may be similar. I sought to understand whether including the 39 Washington counties in this
analysis would improve the prediction of local cannabis laws. Washington counties in this
analysis were coded according to the Washington State Institute for Public Policy’s preliminary
implementation report (Darnell, 2015).
         As explained previously, Alaska is divided using a system of boroughs and not counties.
Although they function much the same, unlike county-equivalents in the other 49 states, the
boroughs do not cover the entire land area of the state. Along with the inability to code voter
information to Alaskan boroughs, no local action was taken to ban cannabis at the county level.
Because there is no variance among Alaskan boroughs in this outcome variable, there is reason
to believe that Alaska is not representative of the other 49 states. Thus, I also modeled the data
with and without Alaska to understand whether including information from this state improves
the predictive model.
         For illustrative purposes, I also present sensitivities and specificities of the models using
several different weighting mechanisms and varying the binary cut-off at every 0.1 interval
between 0.1 and 0.9, besides the default 0.333. As I have noted, the weighting mechanisms and
cut-offs can be varied to alter the utility of the model to prioritize sensitivity or specificity. I present
                                                    41


this information in the hope that future researchers may use it to inform the application of such
models.
3.3 Aim 2
3.3.1 Study population and sample
         For this epidemiological study, the population was specified to include non-
institutionalized U.S. civilian residents, sampled and assessed for successive NSDUH survey
waves, 2008 through 2019. These NSDUH cross‐sectional surveys were conducted with
multistage area probability sampling to draw state-level representative samples and to over-
sample 12–17-year‐olds. The total sample size for surveys conducted in this period includes
819,543 respondents with an average overall interview participation level of 58% (Substance
Abuse and Mental Health Services Administration, 2021; Montgomery, Thompson, & Anthony,
2022).
         In Aim 1, it was possible to focus on the county-level cannabis laws because the NSDUH
variables I used from the small area estimate public use files were more common than cannabis
incidence and were also aggregated across three years (2010-2012). First time cannabis use
becomes a relatively rare event after the teen years. Hence, estimates of incidence for the
respondents aged 21 or older are not made available at the county level to protect respondents
from possible re-identification. Aims 2 and 3 focus on the state-level incidence for this reason.
         Standardized audio computer-assisted self-interview modules assessed the month and
year of first cannabis use, from which age-specific incidence rates can be estimated from the
NSDUH Restricted Data Access portal (R-DAS). The R-DAS portal provides analysis weights and
variance estimate capabilities for state-specific and national estimates and 95% confidence
                                                  42


intervals (CI). The R-DAS portal also allows for state-specific analysis of data but can only be
downloaded in year-pairs and not individual years (e.g., 2018 – 2019 vs. 2018, 2019); therefore,
I use data from six year-pairs in the analysis, not 12 individual years. I categorized states into
different analysis groups according to each state’s year of recreational cannabis legalization
(RCL) through 2018. Because the 2018-2019 year-pair is the most recent available data in R-
DAS at the time of analysis, states that legalized cannabis in 2019 or later were categorized into
the illegal group. Washington and Colorado were included in the 2012 group; Oregon, Alaska,
and Washington D.C. were in the 2014 group; California, Maine, Massachusetts, and Nevada
were included in the 2016 group; and Vermont and Michigan were included in the 2018 group.
All other states were categorized into the same illegal cannabis group for this analysis.
3.3.2 Outcome
         To test the hypotheses, the primary estimate is past-year cannabis use incidence,
calculated as 𝜓 = 𝑋𝑟 /𝑁𝑟 , where 𝑋𝑟 is the number of individuals starting to use cannabis within
the one to twelve month interval before assessment, and 𝑁𝑟 is all persons who had not started
using cannabis before that interval. Estimates described in this report are not readily available in
R-DAS. The estimated prevalence rates (𝑝1 = 𝑋𝑟 /𝑁, where 𝑁 is the total projected population
size) and the estimated proportion of the population at risk (𝑝2 = 𝑁𝑟 /𝑁), with the corresponding
standard errors can be obtained. Incidence can then be calculated in terms of 𝑝1 and 𝑝2 as:
                                              𝑝1 𝑋𝑟 Τ𝑁
                                        𝜓=        =        .
                                              𝑝2 𝑁𝑟 Τ𝑁
                                                 43


3.3.3 Study Design and Statistical Analysis
        My study design observed changes in annual cannabis incidence in the RCL states
relative to non-RCL states before and after the legalization of cannabis at the state level. I
estimate this using an event-study model that allowed me to estimate incidence (or other
outcomes) in each period relative to legalization while controlling for fixed differences across
states and national trends over time. All analyses were performed in SAS version 9.04 and use
NSDUH survey weights.
        The models can be expressed as:
                                        4
                        𝑌𝑠𝑡 = 𝑅𝐶𝐿𝑠 × ෍ 𝛽𝑦 𝐼(𝑡 − 𝑡∗𝑠 = 𝑦) + 𝛽𝑡 + 𝛽𝑠 + 𝜖𝑠𝑡
                                      𝑦=−5
                                      𝑦≠−1
        As described earlier, the data is constructed at the state category (s) by year (t) level. In
the primary analyses, Yst measures past-year cannabis incidence for each state grouping and
pair of years. In this equation, 𝛽𝑠 denotes state fixed effects and 𝛽𝑡 denotes fixed effects of time
in calendar years. These account for general trends in cannabis incidence for each group of
states over time. The variable    𝑅𝐶𝐿𝑠is set equal to one if the observation is from a state that
legalized cannabis and was measured after the date of legalization and is set equal to zero
otherwise. The time-event dummy variables 𝐼(𝑡 − 𝑡𝑠∗ = 𝑦) indicate the legality of cannabis in
                                                                                                    ∗
each state group by the first year of the R-DAS year pair relative to the year of legalization (𝑡𝑠 )
and are set equal to zero for all observations from states that did not legalize recreational
cannabis during the study period. These variables are referred to in this analysis as leads
(indicators of time-event before legalization) and lags (indicators of time-event after legalization).
The omitted category is     𝑦 = −1, the year pair before legalization. Therefore, each estimate of
                                                 44


𝛽𝑦 is an estimate of the difference between past-year cannabis incidence in the RCL states
relative to the illegal states during year 𝑦, as measured from the year pair that immediately
preceded legalization. After multiplying the coefficient by 100, these coefficients can be
interpreted as the percentage point change in the past-year cannabis incidence in RCL states
relative to non-RCL states. Where only one or two categories of states would be included at a
specific time point because of the variation in legalization timing across states (≤6 years before
legalization and ≥ 4 years after legalization), the indicators are combined to balance the leads
and lags and prevent modelling the outcome for only a small subset of the data. This is commonly
referred to as balancing the leads and lags of the model.
        If past-year cannabis incidence was trending similarly in all the state groups before
legalization, I expect that the estimated coefficients for the lead indicators will be too small to
represent a true difference. This is a test of the parallel trends assumption built into the regression
models. Similarly, if the estimated coefficients for the lag indicators are positive, this indicates
an increase in the incidence of past-year cannabis use in the RCL states whereas negative
coefficients would indicate a decreasing incidence.
        In addition to the event study estimates of change at each time point, I also present a
simple 2x2 DiD estimate of the ATT as a summary of the estimated effect across all post-
legalization years through 2019. This is estimated using the same equation except that the event
study dummy variables are replaced with a single indicator denoting an RCL state post-
legalization.
                                                    45


3.3.3.1 Dates of Legalization vs. Dates of Implementation
        The best practice in the field has been to analyze the data using the date that cannabis
sales began as the divider between pre and post-periods. However, because of the nature of the
data as reported in the R-DAS system, using the date of legalization made for a cleaner analysis.
I note that the average number of days between the date of legalization and of sales in the states
in the sample (except for Washington D.C. where sales have never been legal) is 497 days.
Therefore, the T0 period in this analysis is a close approximation of the time between legalization
and implementation of the RCLs. The expectation of increased incidence would begin to show
in the surveys after this roughly 500 day period when recreational cannabis sales began.
3.3.3.2 Alternative Specifications and Robustness Checks
        To ensure the robustness of the analyses, I used two different alternate specifications.
The first alternate specification uses the same method to estimate the effect of RCL on cannabis
prevalence. The estimate for prevalence has been studied extensively in the literature and I
compare the results to prior estimates as a check of face validity for the model. The second
robustness check uses a time placebo as a check of robustness. In this model, a random year
within the data was selected as the year that states legalized cannabis. The model is then run
with the same specifications. If any of the model’s coefficients are appreciably different, then
this indicates that there may be a problem in the model or that it is over-sensitive to spurious
associations.
                                                  46


3.4 Aim 3
3.4.1 Study population and sample
        For this study, the population was specified to include non-institutionalized U.S. civilian
residents, sampled and assessed for successive National Surveys on Drug Use and Health
(NSDUH), 2010 through 2019. These NSDUH cross‐sectional surveys were conducted with
multistage area probability sampling to draw state-level representative samples and to over-
sample 12–17-year‐olds, with overall interview participation levels of 67%-75%, slightly lower
than corresponding levels for the 12–22-year-olds in this study’s sub-samples. Standardized
audio computer-assisted self-interview modules assessed the month and year of first cannabis
use, from which age-specific incidence rates can be estimated from the NSDUH Restricted Data
Access portal (R-DAS). The R-DAS portal provides analysis weights and variance estimate
capabilities for state-specific and national estimates and 95% confidence intervals (CI).
3.4.2 Outcome
        For this research, the primary estimate is again the first-time cannabis use (incidence),
calculated as 𝜓 = 𝑋𝑟 /𝑁𝑟 , where 𝑋𝑟 is the number of individuals starting to use cannabis within
the one to twelve month interval before assessment at age 21. 𝑁𝑟 is all persons who had not
started using cannabis before that interval, stratified by cannabis policy. Estimates described in
this report are not readily available in R-DAS. The estimated prevalence rates (𝑝1 = 𝑋𝑟 /𝑁, where
𝑁 is the total projected population size) and the estimated proportion of the population at risk
(𝑝2 = 𝑁𝑟 /𝑁), with the corresponding standard errors can be obtained. I note that the incidence
can be calculated in term of 𝑝1 and 𝑝2 as:
                                                 47


                                              𝑝1 𝑋𝑟 Τ𝑁
                                        𝜓=       =         .
                                              𝑝2 𝑁𝑟 Τ𝑁
         The corresponding variance can be calculated using the standard statistical procedures
as:
                                          1              𝑝12
                              𝑉𝑎𝑟(𝜓) = 2 𝑉𝑎𝑟(𝑝2 ) + 4 𝑉𝑎𝑟(𝑝1 ).
                                         𝑝2              𝑝2
         Furthermore, I discovered that R-DAS estimates can often be produced for the entire
population of interest (e.g., age-specific cannabis incidence over all 50 states), and for a sub-
population that includes a relatively large, unweighted numerator and denominator (e.g., first-
time cannabis use in every state except Colorado and Washington). Nevertheless, estimates for
the other subpopulation (e.g., age-specific cannabis incidence in Colorado or Washington) may
often be suppressed due to privacy concerns. In the instance when two sub-populations can be
considered mutually exclusive, a method for estimating the suppressed output “by hand” was
used (Vsevolozhskaya & Anthony, 2014). Specifically, if I let 𝜓 be the incidence of cannabis use
in all 50 states, and 𝜓𝑁𝐶𝑊 be the incidence of cannabis use in every state except Colorado and
Washington, I can estimate the suppressed output as:
                                𝜓𝐶𝑊 = 𝑁−𝑁𝑁
                                                      𝑁𝑁𝐶𝑊
                                                𝜓 − 𝑁−𝑁      𝜓𝑁𝐶𝑊 ,
                                            𝑁𝐶𝑊          𝑁𝐶𝑊
         Where   𝑁  is the projected population size in all 50 states and 𝑁𝑁𝐶𝑊 is the projected
population size in every state except Colorado and Washington, then the corresponding
variances can be calculated as:
                                           2
                                    𝑁                        𝑁𝑁𝐶𝑊 2
               𝑉𝑎𝑟(𝜓𝐶𝑊 ) = ቆ              ቇ 𝑉𝑎𝑟(𝜓) − ቆ               ቇ 𝑉𝑎𝑟(𝜓𝑁𝐶𝑊 )
                               𝑁 − 𝑁𝑁𝐶𝑊                   𝑁 − 𝑁𝑁𝐶𝑊
                                                48


3.4.3 Study design
         The RCL policies that were implemented by Colorado and Washington State in 2014 were
largely modelled after the state’s own policies regarding alcohol sales. The age of onset
distribution for alcohol has always been different from that of illegal drugs, characterized by the
same peak of incidence in adolescence as the other drugs, but with an additional peak in
incidence at age 21. A hypothesis for this difference in patterns was offered by Cheng and
colleagues that involves distinct sub-group variation in the population with one group willing to
try a drug even though it is illegal for them, and another that waits until it is legal (at age 21) to
try (2018). I hypothesized that setting the legal minimum age for cannabis purchase at 21 in
Colorado and Washington State would cause a distinct shift in the age of onset distribution from
one that had only one peak in the adolescent period to one with a second peak at age 21. This
shift would be consistent with the theory that a distinct sub-group of the population was
interested in using cannabis but would wait until their 21st birthday to do so. The implication
being that a subset of the population only tries to use cannabis if it is legal to do so.
         To detect this change in the shape of the age of onset distribution, I first looked at the
raw incidence rates using two approaches. The first approach is a panel study with sample
restriction to participants in the birth cohort born in either 1995 or 1996, successively re-sampled
to secure a new sample each year. Using a cohort born in the years 1995 or 1996 allows for
tracking the population experience of persons born in these years whose adolescent period
occurred prior to the RCL implementation but turned 21 after cannabis had been legalized.
Although the panel approach has constrained statistical power, given its focus on that one birth
cohort, the strength of this design is that it provides an intuitive look at how incidence patterns
changed for this cohort depending on if they lived in Washington or Colorado or any other state.
                                                   49


The second approach is more tightly focused on what happens at age 21. The expectation is
that cannabis incidence at age 21 years in Colorado and Washington will show an increase
versus the relatively stable cannabis incidence at age 21 years in the other 48 states.
         In addition to observing the raw incidence rates, I applied the same event study model
used in aim 2 to detect changes specifically at age 21 between the populations of states that
legalized cannabis and those that did not. This third method goes beyond simply looking at the
Colorado and Washington test cases and uses data in the same way that aim 2 was approached.
Again, using data from six year-pairs in the analysis, not 12 individual years, I categorized states
into different analysis groups according to each state’s year of RCL through 2018. Because the
2018-2019 year-pair is the most recent available data in R-DAS at the time of analysis, states
that legalized cannabis in 2019 or later were categorized into the illegal group. Washington and
Colorado were included in the 2012 group; Oregon, Alaska, and Washington D.C. were in the
2014 group; California, Maine, Massachusetts, and Nevada were included in the 2016 group;
and Vermont and Michigan were included in the 2018 group. All other states were categorized
into the same illegal cannabis group for this analysis.
         The study design for aim 3 observes changes in annual cannabis incidence specifically
at age 21 in the RCL states relative to non-RCL states before and after the legalization of
cannabis at the state level. Again, I used an event-study model that allowed me to estimate
incidence (or other outcomes) in each period relative to legalization while controlling for fixed
differences across states and national trends over time. All analyses were performed in SAS
version 9.04 and use NSDUH survey weights.
                                                50


                                             4. Results
4.1 Aim 1
4.1.1 Descriptive statistics
         Although there are 3142 counties in total in the U.S. at this time, this analysis included
3094 counties in 366 sub-state regions defined by the NSDUH 2010-2012 small area estimates.
The three sub-state regions and 39 counties in Washington state were not used in this analysis,
Washington residents had voted to legalize cannabis in 2012 but were not allowed local authority
over the issue. Because of this, these counties could not be considered an adequate exposure
or control group and were not included in this analysis. Additionally, nine counties in North
Carolina could not be used due to discrepancies in documentation. RCL counties included 42
of the 64 counties in Colorado, 21 of the 36 counties in Oregon, and all 29 counties in Alaska. In
Colorado and Oregon, a similar proportion of the counties (34% and 42%, respectively) opted
out of legal cannabis sales using local government mechanisms. Nine municipalities banned
retail cannabis sales in Alaska, but there were no county-level bans as in Colorado and Oregon.
         Sociodemographics, political affiliations, and mental health and drug use prevalences of
the counties by policy exposure are presented in table 2. In counties where the sale of cannabis
was legal in 2014 there is a slightly higher proportion of males. These countries tend to have
more people falling into the 18–64-year-old age range and less under 18 and over 65. These
counties are less racially diverse with a higher proportion of white citizens and less black citizens,
although counties where recreational cannabis was sold legally have a higher proportion of
American Indians and Native Alaskans. Surprisingly, the RCL counties have a higher proportion
of republican than democrat voters, at least according to the 2012 presidential race. The use of
alcohol, cannabis, and cocaine was more prevalent in RCL counties, as were alcohol use and
                                                  51


other substance use disorders, and serious mental illnesses.
Table 2. Sociodemographic and political compositions and prevalences of mental illness and
drug use in counties that allowed for the sale of recreational cannabis and those that did not.
                                                   Sale of cannabis not   Sale of cannabis is
                                                      legal (n=3011)         legal (n=92)
    Variable
    Gender
                  Male                                    49.1%                 49.9%
                  Female                                  50.9%                 50.1%
    Age
                  Under 18 years                          24.0%                 23.8%
                  18 to 34 years                          23.2%                 24.1%
                  35 to 64 years                          39.6%                 40.6%
                  65 and over                             13.1%                 11.5%
    Race
                  White                                   63.3%                 72.6%
                  Black or African American               12.7%                  3.1%
                  Hispanic or Latino                      16.5%                 15.8%
                  American Indian and Alaska
                  Native                                   0.7%                  1.9%
                  Asian                                    4.7%                  3.5%
                  Native Hawaiian and Other
                  Pacific Islander                         0.1%                  0.3%
                  Some Other Race                          0.2%                  0.2%
                  Two or More Races                        1.9%                  2.7%
    Political party
                  Voted for the republican
                  nominee in 2012 presidential            38.1%                 50.0%
                  race
                  Voted for the democratic
                  nominee in 2012 presidential            59.9%                 46.5%
                  race
    Mental Health
                  Past-year prevalence of
                                                           4.3%                  4.5%
                  serious mental illness a
                  Past-year prevalence of
                                                           3.9%                  4.0%
                  suicidal thoughts a
    Substance use
                  Past-month alcohol use
                                                          48.8%                 54.5%
                  prevalence b
                                                52


                                         Table 2 (cont’d)
                 Past-month cigarette use
                                                          25.2%               24.8%
                 prevalence b
                 Past-month cannabis use
                                                          6.0%                10.9%
                 prevalence b
                 Past-year alcohol use
                                                          6.6%                 8.0%
                 disorder prevalence b
                 Past-year cocaine use
                                                          1.4%                 2.0%
                 prevalence b
                 Past-year substance use
                                                          1.7%                 1.9%
                 disorder prevalence b
    Footnotes
    a
      Prevalences of mental illnesses for individuals aged 18+ as sampled by the NSDUH
    b
      Prevalences of substance use for individuals aged 12+ as sampled by the NSDUH
4.1.2 Predictive model
        Figure 5 shows the ROC curves of 1000 iterations of logistic modeling in grey, with their
average profile in red. In general, our model demonstrates a high degree of discrimination with
an average area under the ROC curve (AUC) of 0.94.
                                                 53


Figure 5. ROC curves of 1000 predictions of county-level legal cannabis sales in 2014 and the
ensemble average.
         The drivers of the model’s predictive power are represented in table 3 by their median Z
scores over all iterations. By this metric, the most powerful predictor is past-month cannabis
use, followed by past-year cocaine use, and serious mental illness. Although not shown
explicitly, the negative predictive power of voting for the Democratic or Republican candidate is
relative to the percentage of votes for a third party, therefore a higher prevalence of voting for a
different political party was positively predictive of legalizing cannabis sales.
                                                  54


Table 3. Predictors for legal cannabis sales in 2014 as represented by median z score over
1000 model iterations.
  Variables                                                              Median z score
  Past month cannabis use prevalence b                                         2.871
  Past year cocaine use prevalence b                                           2.239
  Past year prevalence of serious mental illness a                             1.351
  Land area                                                                    1.342
  Proportion of votes for a 3rd party candidate in
  2012 presidential race                                                       0.912
  Proportion of votes for the republican candidate in
  2012 presidential race                                                       -1.203
  Past month cigarette use prevalence b                                        -0.829
  Census principal component 2                                                 -0.802
  Census principal component 1                                                 -0.715
  Past month alcohol use prevalence b                                          0.618
  Past year alcohol use disorder prevalence b                                  0.536
  Past year substance use disorder prevalence b                                -0.481
  Area water                                                                   0.467
  Past year prevalence of suicidal thoughts a                                  -0.326
  Footnotes
  a
    Prevalences of mental illnesses for individuals aged 18+ as sampled by the NSDUH
  b
    Prevalences of substance use for individuals aged 12+ as sampled by the NSDUH
4.1.3 County-level predictions
         By averaging the weighted expected probabilities derived from 1000 logistic regression
models, our ensemble model produced probabilities of legalizing cannabis sales in 2014 for
every county. Each probability was weighted by the classification accuracy of the model from
which they were derived and averaged across each iteration in which the county appeared.
Figure 6 shows every county in the U.S. (except those in Washington state and North Carolina
that could not be included) categorized by its percentile in the ensemble predicted probabilities
of RCL.
                                                      55


Figure 6. Ensemble produced county-level probability of allowing recreational cannabis sales in
2014.
        When using accuracy weighted probabilities and a binary cut-off of 0.333, the ensemble
correctly categorized all 92 of the counties with legal cannabis and 2721 of the 3002 counties
that did not legalize cannabis. For incorrect classifications, 281 counties were predicted to
legalize cannabis that did not by 2014 while no counties with legal cannabis sales were
incorrectly classified.
        Figure 7 compares the actual cannabis policy landscape by county in the U.S. in 2014 to
our predicted policy landscape. The figure demonstrates how and where the model performs
best and where it lacks specificity as demonstrated by the false positive clusters. Some false
                                               56


positives appear in the states that legalized cannabis (Oregon and Colorado), but also in Arizona,
California, Washington D.C., Florida, Hawaii, Idaho, Maine, Massachusetts, Michigan,
Minnesota, Montana, Nevada, New Hampshire, New Mexico, New York, Rhode Island, Texas,
Vermont, and Wyoming.
Figure 7. Actual cannabis policies by county, 2014 compared to predicted policy outcomes.
4.1.4 Sensitivity analyses
        As previously explained, the sensitivity analyses included different combinations of
coding schemes for Washington and Alaska and testing different cut-offs for county-level
predictions. Under these different coding schemes, the area under the curve for the ensemble
predicted probabilities varies little (0.89 - 0.95 compared to 0.94) (Figure 8). Including the
Washington counties performed best while excluding Alaska and Washington counties
performed worst (Figure 8). Additionally, including the Washington counties also produced the
greatest specificity and sensitivity at the same cut-offs (Table 4).
                                                 57


Figure 8. ROC curves of 1000 models and average profiles for each possibility of coding the
outcome in sensitivity analyses.
        To this point, we have demonstrated the ensemble results of 1000 expected probabilities
weighted by each logistic model’s classification accuracy. The sensitivity analysis also considers
naively averaging the expected probabilities unweighted or averaging the 1000 hard predictions
made at a threshold of 0.333 (i.e., the prevalence of RCL in each iteration) instead. We also
explored alternative weights derived from quality metrics other than classification accuracy, such
as true positive fraction (i.e., sensitivity), true negative fraction (i.e., specificity), and the false
negative fraction. Figure 4 shows the frequency of RCL and non-RCL counties (y-axis) against
                                                   58


the probabilities predicted by ensembles of various sorts (x-axis).
Figure 9. Distinguishing power of ensemble predictions (weighted or naïve average).
        An effective ensemble prediction should assign distinct probabilities to RCL versus non-
RCL counties and separate the predicted probabilities that fall below and above a cut-off that
was determined pre-hoc. In this case, we chose 0.333 as a liberal cut-off to capture all RCL
counties (in favor of sensitivity) and 1 – 0.333 as a conservative cut-off to avoid excessive false
positives (in favor of specificity). The default prediction accuracy weighted ensemble (Figure 9,
top-left) worked as intended since the vast majority of non-RCL counties (red) were given
probabilities lower than the threshold at 0.333 (blue dash). In contrast, most RCL counties (teal)
were given probabilities higher than the same threshold, thus providing a reasonable balance
between sensitivity and specificity.
        If we took the conservative threshold of 0.667=1-0.333 instead (red dash), nearly all non-
RCL counites (red) were assigned a probability lower than 0.667 along with a large portion of
RCL counties (teal), thus achieving a near 100% specificity at the expense of sensitivity. By the
                                                  59


same standards, the true negative fraction weighted ensemble (TNF, Figure 9 bottom-left), and
the ensemble formed by naïvely averaging the 1000 expected probabilities (PHT, Figure 9 top-
right) also worked at the threshold of 0.333 (blue dash), with the latter seemingly achieved
optimal balance between sensitivity and specificity.
         In contrast, the false negative fraction (FNF, Figure 9 top-middle) weighted ensemble was
unacceptable due to largely overlapping probabilities assigned to both types of counties and the
inability to clearly separate the two groups by the pre-defined threshold at 0.333 (blue dash); as
for the conservative threshold of 0.667 (red dash), the 100% specificity could only be achieved
by labeling all counties as non-RCL, essentially sacrificing sensitivity entirely. The true positive
fraction (TPF, Figure 9 bottom-middle) weighted ensemble and the unweighted average of hard-
predictions (YHT, Figure 9 bottom-right) were mediocre because they did not separate the two
types of counties at 0.333 (blue dash) as clearly as accuracy weighted ensemble (ACC, Figure 9
top-left).
         Finally, table 4 shows the sensitivity and specificity at all cut-offs between .1 and .9 for
each type of ensemble where Washington counties were included, and where they were
excluded.
Table 4. Sensitivity and Specificity of Models Using Various Weighting Techniques and Hard
Cut-off Values.
                                   Excluding Washington Counties
                                           Weighting Schema
                        PHT              YHT            ACC             TPF           TNF
       Cut-offs     Sens Spec       Sens Spec Sens Spec Sens Spec Sens Spec
       0.1          1.00     0.78   1.00 0.86 1.00 0.80 1.00 0.82 1.00 0.79
       0.2          1.00     0.86   1.00 0.89 1.00 0.87 1.00 0.88 1.00 0.87
       0.3          1.00     0.89   1.00 0.91 1.00 0.90 1.00 0.91 1.00 0.90
       0.4          1.00     0.91   0.95 0.93 0.98 0.92 0.93 0.93 0.99 0.92
       0.5          0.96     0.93   0.87 0.95 0.88 0.94 0.84 0.95 0.90 0.94
                                                    60


                                        Table 4 (cont’d)
      0.6         0.86     0.95 0.79 0.96 0.78 0.96 0.66 0.97 0.83 0.95
      0.7         0.76     0.96 0.72 0.97 0.61 0.98 0.34 0.99 0.71 0.97
      0.8         0.60     0.98 0.54 0.98 0.29 0.99 0.00 1.00 0.46 0.99
      0.9         0.30     0.99 0.33 0.99 0.00 0.97 0.00 0.97 0.20 1.00
                      PHT           YHT             ACC           TPF            TNF
      Cut-offs    Sens Spec Sens Spec Sens Spec Sens Spec Sens Spec
      0.1         1.00     0.76 0.99 0.87 1.00 0.78 1.00 0.80 1.00 0.77
      0.2         1.00     0.85 0.98 0.90 1.00 0.87 0.99 0.88 1.00 0.86
      0.3         0.99     0.89 0.98 0.91 0.98 0.89 0.98 0.90 0.98 0.89
      0.4         0.98     0.91 0.93 0.93 0.97 0.92 0.94 0.93 0.98 0.91
      0.5         0.96     0.92 0.88 0.95 0.93 0.94 0.82 0.96 0.93 0.93
      0.6         0.88     0.95 0.75 0.96 0.75 0.96 0.58 0.98 0.83 0.96
      0.7         0.74     0.96 0.64 0.97 0.52 0.98 0.26 0.99 0.63 0.97
      0.8         0.51     0.98 0.48 0.98 0.29 0.99 0.00 0.96 0.40 0.99
      0.9         0.33     0.99 0.34 0.99 0.00 0.96 0.00 0.96 0.13 1.00
      PHT – Unweighted average of expected probability
      YHT – Unweighted average of hard predictions with a cut-off value of 0.333
      ACC – Weighted by accuracy (used in main results)
      TPF – Weighted by true positive fraction (sensitivity)
      TNF – Weighted by true negative fraction (specificity)
4.2 Aim 2
4.2.1 Descriptive statistics
        This study included 819,543 respondents from the NSDUH surveys between the years
2008 and 2019. The unweighted sample is 48% female, 60% White, 13% Black, 18% Hispanic,
2% Native American, 4% Asian, and 4% of more than one race or another race or ethnicity (Table
4). 11% used cannabis in the past month, and 3% qualified for past-year cannabis abuse or
dependence. Table 5 provides the total unweighted sample characteristics as derived from the
NSDUH Public Data Analysis System.
                                               61


Table 5. Characteristics of the U.S. Population Under Study. Data from the U.S. National
Surveys on Drug Use and Health.
                 Gender                                        %         n
                 Female                                      47.8%    322636
                 Male                                        52.2%    351885
                 Race
                 White                                       59.9%    404314
                 Black                                       12.8%    86272
                 Native American                              1.5%    10095
                                                              0.5%     3380
                 Native Hawaiian / Other Pacific Islander
                 Asian                                        4.1%    27907
                 More than one race                           3.6%    24301
                 Hispanic                                    17.5%    118252
                 Age
                 12-17 Years Old                             28.1%    189789
                 18-25 Years Old                             29.0%    195650
                 26-34 Years Old                             12.7%    86000
                 35 or Older                                 30.1%    203082
                 Past-month cannabis use prevalence
                 Did not use in the past month               88.7%    597984
                 Used within the past month                  11.3%    76537
                 Past-year cannabis abuse or
                 dependence
                         No/Unknown                          97.1%    654930
                         Yes                                  2.9%    19591
                 Unweighted Sample Total                    100.0% 674521
       Figures A.2 – A.6 show various combinations of the past-year cannabis use incidence for
those aged 21 and older by state legal category. Upon visual inspection, the parallel lines
assumption and assumption of no anticipation look to be met in every comparison of groups.
The past-year cannabis use incidence ranges from as low as .25% for the illegal states in years
                                               62


2008 and 2009 to over 2.5% in the states that legalized cannabis in 2014 (Oregon, Alaska, and
the District of Columbia) in years 2018 and 2019.
4.2.2 Event Study Findings
        Figures 10 and 11 show the primary findings for individuals aged 21 and older (Figure 10)
and those between the ages of 12 and 20 (Figure 11). For those who were legally able to
purchase cannabis (21 and older), the legalization of cannabis is estimated to have had no effect
on newly incident cannabis use in the year that the legislation passed. However, between two
and four years after legalization, RCLs were estimated to have increased incidence by 0.6%
[95% Confidence Interval (CI) = 0.1, 1.0]. The corresponding estimate for the interval four to
seven years after passage of the RCL is 1.3% [0.8, 1.8] (Figure 10). For the 12-to-20-year-olds,
the estimated cannabis incidence does not vary appreciably in any period (Figure 11).
                                                63


Figure 10. Estimated effect of time since cannabis legalization on cannabis incidence in the 21
and older age group with 95% confidence intervals.
                                              64


Figure 11. Estimated effect of time since legalization on incidence in those aged 12 to 20 with
95% confidence intervals.
4.2.3 DiD Findings
        When including the total time post-legalization, the simple ATT estimate derived from the
2x2 DiD indicates no substantial differences in cannabis incidence before and after the laws were
passed (p=0.12). However, since we expected no effect before cannabis sales became effective,
we estimated a separate ATT for two years of legalization and later in the 21+ age group as 0.7%
(p=0.003, [0.3, 1.1]). The estimated average treatment effects for those aged 12 to 20 years
indicated no differences after the legalization date (p=0.27) or the effective date (p=.53).
                                                 65


4.2.4 Alternative Specifications and Robustness Checks
        In our first alternate specification, we estimate that the effect of cannabis legalization
increased the prevalence of cannabis use in the past month among those aged 21 and older by
3.2% between two and four years after legalization (p=0.0005, [1.6, 4.7]) (Figure S 7). The
corresponding estimate for the interval four to seven years after legalization is 4.3% (p=0.0002,
[2.3, 6.2]) (Figure S 7). In the 12-to-20-year-old age group, no appreciable variation in estimated
cannabis use prevalence is seen across these study intervals (P=0.39 and 0.33, respectively)
(Figure S 8).
        In the time placebo analysis based upon a randomized legalization date, the date of
placebo legalization was set to the year 2011 for all the states that legalized cannabis through
2018. Figure S 9 shows an estimated coefficient that does increase slightly over time, yet the
estimated effect of this 'placebo' policy change is null. Note especially that for the adolescents
(<21 years), the coefficients are distributed more or less at random in relation to the zero value,
with no appreciable differences or patterns (Figure S 10).
4.3 Aim 3
4.3.1 Panel study approach
        Figure 12 shows cannabis incidence estimates based on the panel study approach
restricted to the 1995-96 birth cohorts, with state contrasts based on RCL policies. The red lines
show a relatively stable incidence by age in the 48 states where cannabis remains illegal for all
ages with the highest incidence period occurring as expected between ages 16 and 18. As
expected, incidence decreases in each successive year after this peak. In Colorado and
Washington, incidence is higher at each age and also shows the expected peak of incidence in
                                                  66


adolescence. At age 21, we have the hypothesized increase in incidence at that age, however
this sample does not have sufficient power and the confidence intervals overlap (Figure 12).
Figure 12. Trends in past-year cannabis incidence by age in Colorado and Washington vs. all
other states in the US, 2010-2017.
4.3.2 Stratification at age 21
        Figure 13 shows RCL-stratified year-pair-specific estimated cannabis incidence,
focusing on the NSDUH participants assessed at age 21. The mean cannabis incidence for non-
RCL states (red) is relatively stable at about 5% becoming new users. For Colorado and
Washington State, the corresponding estimate is close to the estimate for the other states until
                                              67


after 2014-2015; the estimate for 2016-17 is just above 20%. Here, again, the statistical precision
of estimates based on the R-DAS datasets is constrained, however the point estimate has
increased fourfold (Figure 13).
Figure 13. Trends in past-year cannabis incidence for 21 year-olds in Colorado and
Washington vs. all other states in the US, 2010-2017.
        The striking increase in the raw incidence estimate for 21 year-olds in Colorado and
Washington justified further and more sophisticated analysis. Figure 14 shows the estimated
effects when adding more states to the analysis, observing the outcome relative to each state’s
year of RCL, and doing so in the event study framework. The coefficients do not deviate from
                                                68


zero prior to legalization which supports the assumption of parallel trends that we also see in
figure 13. Two years after the passage of the RCLs, incidence increased 4.3% relative to the
change in the non-RCL differences at baseline (figure 14). After four years, incidence increased
by almost 9% in the RCL states relative to non-RCL state differences at baseline (figure 14). At
age 21, the average incidence in non-RCL states was 5.4% and 6.7% in RCL states. Six years
after the passage of the RCLs, incidence at age 21 in RCL states increased to 18.1% while it
increased to only 6.9% in non-RCL states.
        Figure 14. Estimated effect of time since cannabis legalization on cannabis incidence at
age 21 with 95% confidence intervals.
                                                69


                                            5. Discussion
5.1 Aim 1
        To my knowledge, this is the first predictive model of any policy change in the United
States that uses publicly available data. With the use of similar methods and different machine
learning algorithms, the results should be reproducible. While there are no accepted thresholds
for policy prediction models, an AUC, sensitivity, specificity, and classification accuracy above
90% would be considered a very successful model in any prediction domain. Further, I find that
the model is not sensitive to different outcome classifications as all the coding schemes
produced similar results.
        This research confirms the importance of prior cannabis use as the single most important
predictor of cannabis policy liberalization. It extends this finding from the individual-level, where
past cannabis use predicts support for cannabis liberalization, to the county-level, where a higher
prevalence of cannabis use is predictive of future policy change. The findings that serious mental
illness and cocaine use prevalences were largely predictive have not been reported before.
Because I do not have a measure of cannabis use frequency in the model, I speculate that these
associations may be indicative of heavier cannabis use in those areas. The significance of the
county land area is likely an artifact of early cannabis reform occurring in the west where counties
tend to be larger than the east, so is unlikely to have predictive utility in analyses of later time
periods. If the definition of the outcome variable were expanded to include the more recent areas
that have legalized cannabis in the Northeast, I expect this variable’s importance would
decrease. However, if it did not, it is not impossible to imagine that it could fit into the causal
puzzle. Perhaps having more land facilitates growing cannabis, and this distal factor manifests
in our legalization outcome.
                                                   70


        Although sociodemographics such as age and race are included in the census principal
components used as predictors, I cannot conclude what the individual effects of different age
and race distributions may be on RCL in this sample. However, I can assume with some
confidence that age and race differences by county are controlled for by including the census
principal components. Lastly, I found that after controlling for the other covariates in the model,
political affiliation was not as important as past drug use in predicting RCL. Furthermore,
democratic affiliation was not even positively associated with RCL; rather, it was voting for a
third-party candidate that was positively correlated with RCL.
        While the prediction algorithm produced a fair number of false positives, savvy readers
will note that many of the states in which the false positives appear have legalized cannabis since
2014. The model may be improved with an expanded definition of legalization which would
include more exposed counties and produce a more balanced dataset for analysis.
        This study has several limitations: 1) Reliance on self-reported mental health and drug
use are subject to social desirability bias; however, there is no other nationally representative
and publicly available dataset on drug use and mental health from which I could derive sub-state
estimates. 2) I was not able to include all variables from the NSDUH that would be theoretically
predictive of cannabis liberalization (e.g., perceived harmfulness of cannabis, past
criminalization, etc.) as this data were unavailable in the NSDUH small area estimates. 3) Election
data in Alaska and mental health and drug use estimates in Massachusetts and Connecticut are
not county-specific. I attempted to mitigate the effect of this missing data by imputing state-level
averages of these variables for all counties in their respective states. 4) I was unable to match
some counties in the North Carolina NSDUH small area estimates with the other sources of data.
Because this was a small number of counties, I decided it was best to leave these counties out
                                                  71


rather than impute state-level averages. 5) One small county in Hawaii (population <100) has no
voter data. Because this was such a small county, I left it out of this analysis as well. 6) Lastly,
county-level policies regarding recreational cannabis can change over time. Little documentation
exists on changes in local cannabis policies, let alone on how they may change over time and
why. A database of county-level cannabis policies that document changes over time would be
of substantial value to this field.
        Notwithstanding these limitations, this study shows that the legalization of recreational
cannabis sales is a relatively predictable phenomenon using publicly available data that precede
the policy change by two to four years. This conclusion was achieved as a result of the several
strengths of this work which include: 1) The novelty of the design which could be used to predict
other state or county-level policy changes; 2) using only publicly available data which allows the
results to be replicated, and 3) having used nationally representative data, I were not limited to
comparing counties within states, rather, I were able to predict county-level changes throughout
the U.S.
        Future research can build on these findings in several ways. Primarily, given the
increasing interest in modelling the effects of policy, extensions of this method can be used to
establish covariates for use in quasi-experimental methods. The causal interpretation of these
quasi-experimental designs is only allowed when the covariate structures between areas can be
considered balanced. Specific methods of interest include creating synthetic controls,
propensity score matching, and difference-in-differences (Ben-Michael, Feller, & Rothstein,
2021; Imbens & Rubin, 2015; Roth, 2020; Wooldridge, 2021). Second, updating the model with
the 2020 census data and more recent voter and NSDUH data may be used to predict the next
wave of states and counties which legalize recreational cannabis. In this sample, the prediction
                                                72


was 90% accurate for events that occurred between two and four years out. A larger sample
may be more accurate with longer time horizons. Finally, decomposing the demographic
information in the census may prove to be beneficial for the prediction model. It would allow for
the replicability of earlier findings that race and age can be highly predictive of cannabis
legalization.
5.2 Aim 2
         These results show consistent evidence of an increase in past-year cannabis use
incidence among those for whom cannabis became legal, but not for those for whom cannabis
remained illegal because of their age. In the simple 2x2 DiD model, I estimate an average increase
in past-year cannabis use incidence of 0.7 percentage points after recreational cannabis began
being legally sold through the year 2019.
         To understand the magnitude of these changes, I find it best to compare these changes
in annual incidence to the raw incidence rates estimated by the NSDUH (figures S2-S6). Between
2008 and 2019, the overall estimate of past-year cannabis incidence in the 21 and older age
group, independent of the state, was estimated to be just 0.5%. Thus, an increase in the
incidence of 0.6% is more than double the proportion of new cannabis users in this age group.
         This study advances our understanding of the effects of RCLs in several important ways.
First, this is the first study of which I am aware that examines the effects of RCLs on cannabis
use incidence. Prior studies on the associations between liberalizing cannabis policy and
cannabis use epidemiology focused on past-month cannabis use prevalence (Gruber et al.,
2016; Reed, 2016; Cerdá et al., 2017; Dilley et al., 2019; Kerr, Lui, & Ye, 2018; Everson et al.,
2019; Martins et al., 2021; Reed, 2021; Paschall, García-Ramírez, & Grube, 2021), the prevalence
                                                 73


of daily or frequent users (Everson et al., 2019; Coley et al., 2021; Martins et al., 2021), and
prevalence of CUD (Cerdá et al., 2020; Martins et al., 2021). The importance of understanding
changes in cannabis use incidence in response to RCLs cannot be overstated. Prevalence of
use and dependence syndromes and frequency of use are of great public health importance.
Yet, they tell us nothing about whether new users are entering into the population of cannabis
users. This study produces robust evidence that the legalization of cannabis increases the
number of cannabis users entering the cannabis-using population where it becomes legal to sell
cannabis.
        Second, this is the first study of which I am aware that has examined the heterogeneity
of treatment effects in the years following RCL. This event study DiD design allows for the
estimation of effects by years relative to the passage of the RCL and the effective dates of
implementation. This method has yielded three important pieces of evidence: 1.) That the effect
of cannabis legalization estimated here is dynamic; 2.) that the trends of the estimates are
different by age group, and 3.) the estimated effects of legalization have been an increase in the
incidence and prevalence among those for whom cannabis has become legal.
        Third, the use of a quasi-experimental DiD design allows for the causal interpretation of
the estimated effects of RCL. Except for a few studies (Cerdá et al., 2017; Coley et al., 2021),
the evidence produced in this literature has relied mostly on controlling for observed variables
between the populations. The differences between states and populations that have legalized
cannabis and those that have not are so vast that I question whether controlling for any number
of observed variables is likely to produce valid estimates. The design I used allows for the control
of unobserved variables if a few assumptions are met. Additionally, I have produced evidence
and judged that the assumptions of no anticipation and parallel trends hold in this case.
                                                 74


         Although the estimates of incidence cannot be compared to the findings of other studies
as these are the first of their kind, I used the same method to estimate the effect of RCL on
prevalence in the two age groups as has been done many times. The two studies that used
quasi-experimental methods to estimate changes in cannabis use prevalence post-RCLs also
reported null findings among adolescents (Cerdá et al., 2017; Coley et al., 2021). Although no
quasi-experimental methods have been used to estimate changes in cannabis use prevalence
in the adult population, the findings on prevalence are in line with those reported by Cerdá et al.,
Martins et al., and Reed’s more recent findings (2020; 2021; 2021). Perhaps more importantly,
the findings also support the seemingly conflicting earlier null findings in this age group (Reed,
2016; Kerr, Lui, & Ye, 2018; Smart and Pacula, 2019), and support a narrative that the increases
in the use of cannabis (among both new users and existing) in the adult age group only began
increasing after a few years when recreational cannabis shops began sales.
         The strengths of this work are the robustness of the estimates, the novelty of the design
in this space, and the interpretations that the design allow. The estimates of the effects of RCL
on cannabis use incidence by age group were robust to both the check of face validity using the
same method to estimate past-month prevalence, as well as the alternate specification using a
time-placebo analysis. The use of a DiD event study design moves this field forward by allowing
for a dynamic estimate of the causal effect of RCL on the outcome of choice. As I have
demonstrated, it is not reasonable to assume that the effect of cannabis legalization is
homogenous over time, especially not if the period includes the time before cannabis sales
began. Therefore, future research on the effects of RCL should allow for effect heterogeneity.
Although this is only one study, from which conclusions should not be drawn, this design allows
for a visualization of the policy lag effect, about which much has been written (Cheng et al., 2019;
                                                   75


Hall & Weier, 2015). We see that the effect estimates are not linear and are beginning to take a
sigmoid shape with the increase in cannabis use incidence and prevalence beginning to plateau,
although more data is needed to confirm the trend.
        Some limitations of this work include the self-report nature of the data, differing legal
statuses of the drug under study, the sensitivity of the findings to different definitions of the study
period, and an inability to control for sub-state level RCL. First-time cannabis use between one
and twelve months ago was self-reported, leading to potential recall bias. However, the data
collection method has been validated in previous methodological studies. The legality of
cannabis use is, of course, different between the states in this sample which makes the outcome
subject to some differential response bias; however, the assessment was conducted using
confidential standardized audio computer-assisted self-interview modules which have been
shown to reduce this bias.
        The limitation regarding the definition of the study period is important, specifically to the
estimate of the ATT. When including the two years immediately after legalization (before sales
began) in the treatment period, the estimated effect is small. However, using a study design that
allows for dynamic treatment effects and having estimates that are robust to alternate
specifications may make this more of a feature of the study than a limitation. Given that the time-
specific estimate of the causal effect of RCL in this two-year window is not appreciably different,
these two data points combine to form a strong argument that the effect of cannabis legalization
is driven by the opening of outlets where recreational cannabis is sold.
        Another limitation of this work at the state level is that many counties and municipalities
within states that have legalized recreational cannabis have chosen to ban the sale or cultivation
                                                 76


of cannabis within that sub-state area. For example, in Washington State, 15% of counties and
55% of municipalities have prohibited the sale of cannabis (MRSC, 2019) while in California, 69%
of counties and 70% of cities prohibit the sale (Staggs et al., 2019). Similar to the null finding
between the date of legalization and effective dates of cannabis sales, it is likely that the estimate
of the effect of RCL at the state level is diminished by incorporating incidence for many
individuals who reside in areas where recreational cannabis is effectively in this pre-
implementation state. This sub-group heterogeneity is averaged out in the state-level estimates.
While a county-specific analysis is beyond the scope of this study, future research should seek
to replicate this analysis at the county level.
        This study adds to the significant literature on the public health puzzle that is the effects
of legalizing cannabis in the U.S. by introducing the first estimates of first-time cannabis use and
using a quasi-experimental approach that allows for estimation of time-varying treatment effects.
The strengths of the study include the large, nationally representative samples spanning eleven
years, all U.S. states and Washington D.C., and both age groups of interest to the policy under
study, a survey design that allows for state-specific estimates, and state of the art statistical
methods from the recent econometrics literature on causal inference with staggered policy
implementation. The legalization of cannabis has proven to be a dividing and contentious issue
in the national political landscape with different risks and benefits to all paths forward. Thus,
voters need the best available information on what the potential effects of legalizing cannabis
might be to weigh the empirical evidence with their personal values to make the best decision
for themselves and their neighbors. Likewise, policymakers and public health officials can use
this evidence to plan for the changes that are likely to occur if cannabis is to be legalized in their
districts.
                                                  77


5.3 Aim 3
        In this aim, I built from Cheng and colleagues’ recent alcohol LMA hypothesis and present
evidence that a corresponding pattern can be seen for cannabis when it became legal at the
same age (2018). Given the constraints on the statistical precision of the cannabis incidence
estimates presented in Figures 12 and 13, the event study estimates that used all available data
is a much more convincing portrayal of the hypothesis (Figure 14). Taken together, these
analyses provide some early indication that legalizing a drug with an accompanying legal
minimum age creates a predictable spike in incidence at the specified age. My hope is that, in
turn, these epidemiological estimates of age-specific patterns can be used to guide the
organization and deployment of public health tactics of early outreach and intervention, as well
as prevention initiatives intended to reduce hazards of drug use onsets during adolescence and
the transition to early adulthood.
        Limitations of the research include reliance upon self-reports about age and timing of
cannabis onsets as well as uncontrolled confounding between states in the unadjusted analyses.
Though it does not explicitly control for any potentially confounding factors, the event study
model does provide a more robust point of evidence. Given the large increase at age 21 between
the two types of states, the question becomes, what else could possibly explain this increase?
Indeed, given the specificity of the outcome and our hypothesis grounded in prior theory, a viable
alternate explanation is hard to imagine.
        One of my committee members suggested that the evidence from both aims 2 and 3
might be strengthened by additional statistical controls, such as a propensity scoring approach
or adding known confounders to the event study models. This work would add even more
                                                  78


robustness to the evidence by taking into account known confounders between the policy
conditions. Although the exploration of these potential alternative models was not completed as
part of my dissertation, these elaborations constitute lines of potential research that can be
undertaken in the future.
         Notwithstanding these limitations, these analyses demonstrate the potential for a large
shift in long-standing patterns in the age of first use for cannabis in the U.S. Additionally, the
study findings are of interest because the hypothesis that LMA may be shaping age-specific
drug use incidence has never been tested. The lag time for seeing such policy effects has been
estimated to take five to ten years if cannabis follows the experience with alcohol legal minimum
age in the U.S. (Cheng et al., 2019). Our event study results suggest that four may be sufficient.
         If this pattern continues to develop, there are new public health considerations for this
age group as well as the design and implementation of cannabis prevention campaigns.
Targeted prevention campaigns for alcohol and tobacco use have been one of the larger
successes of public health and prevention, partly due to age-specific and appropriately timed
targeting (Haggerty & Mrazek, 1994; Nation et al., 2003). In a deviation from the traditional
perspective that early adolescence is the optimal window for prevention, if incidence at age 21
remains high in the RCL states, public health campaigns that seek to reduce cannabis use may
be optimized in separate approaches for the law-ignoring teens who first use cannabis illegally
vs. the 21-year-olds who wait until cannabis use is legal for them. Although more research is
needed to investigate the theorized policy-induced curve, if a sufficient number of states legalize
recreational cannabis, or if a federal legalization occurs, I believe we will likely see age 21 become
a common age for first trying cannabis.
                                                   79


                                               6. Summary
        This dissertation adds to the literature on the laws that govern recreational cannabis use
and its epidemiology in several ways. First, by leveraging publicly available data and machine
learning techniques, I successfully predicted which counties would legalize recreational cannabis
sales in 2014 with high levels of discrimination and revealed the major driving forces of that local
policy change. Second, I produced the first estimates of how recreational cannabis legalization
affects first-time cannabis use among both the legal and under-aged sub-groups. Third, I have
introduced a quasi-experimental approach that allows for estimation of time-varying treatment
effects. Fourth, I established a need for it by showing that the effect coefficients varied over time.
Finally, I proposed and tested the hypothesis that, after legalization, drug use incidence
increases at the legal minimum age.
        Taken together, the work and results of this dissertation suggest that:
        1. Cannabis legalization is a predictable process driven mainly by prior cannabis use.
        2. When implemented, recreational cannabis legalization does not affect adolescents’
        choice to use cannabis for the first time.
        3. However, among those of legal age, the occurrence of newly incident cannabis use
        may increase by two to four times.
                 3.a. I interpret this finding in my hypothesized framework; by legalizing cannabis,
                 a barrier for individuals interested in trying it, but did not for fear of legal
                 consequences, is removed. In epidemiologic terms, this is comparable to disease
                 spreading among a novel population (i.e., the virgin soil hypothesis).
                                                    80


4. The removal of this barrier may lead to an epidemic of first-time cannabis users among
those aged 21 and older in areas where recreational cannabis is legally sold.
5. As more states legalize cannabis, the age-specific cannabis use incidence curve might
begin to resemble that of alcohol, with a peak in the adolescent years, but a larger peak
at age 21.
                                         81


APPENDICES
    82


Appendix A: Supplemental Figures and Tables
                   83


Figure A.1. Percent of variance captured from over 1000 census variables in each principal
component.
                                              84


Figure A.2. Cannabis incidence in the 21 and older age group, first wave legalizing states vs
untreated states.
                                              85


Figure A.3. Cannabis incidence in the 21 and older age group, second wave legalizing states vs
untreated states.
                                              86


Figure A.4. Cannabis incidence in the 21 and older age group, third wave legalizing states vs
untreated states.
                                              87


Figure A.5. Cannabis incidence in the 21 and older age group, first wave legalizing states vs
third wave legalizing states.
                                              88


Figure A.6. Cannabis incidence in the 21 and older age group, second wave legalizing states vs
third wave legalizing states.
                                              89


Figure A.7. Estimated effect of time since cannabis legalization on past-month cannabis
prevalence in the 21 and older age group.
                                               90


Figure A.8. Estimated effect of time since legalization on past-month cannabis prevalence in
the 12 to 20 age group.
                                                91


Figure A.9. Estimated placebo effect of time since cannabis legalization on past-year cannabis
incidence in the 21 and older age group.
                                              92


Figure A.10. Estimated placebo effect of time since cannabis legalization on past-year
cannabis incidence in the 12 to 20 age group.
                                               93


Table A.1. Predictors for legal cannabis sales in 2014 as represented by median z score over
1000 model iterations when proportions of voters are replaced by a binary indicating the party
of the majority.
        Variables                                                            Median z score
        Past month cannabis use prevalence b                                        3.83
        Past year cocaine use prevalence b                                          1.34
        Past year prevalence of serious mental illness a                           -0.08
        Land area                                                                   1.36
        County voted majority democratic in 2012 presidential race                  1.53
        Past month cigarette use prevalence b                                       0.06
        Census principal component 2                                               -0.25
        Census principal component 1                                               -1.04
        Past month alcohol use prevalence b                                         0.25
        Past year alcohol use disorder prevalence b                                 0.80
        Past year substance use disorder prevalence b                              -0.50
        Area water                                                                  0.11
        Past year prevalence of suicidal thoughts a                                -0.25
        Footnotes
        a
          Prevalences of mental illnesses for individuals aged 18+ as sampled by the NSDUH
        b
          Prevalences of substance use for individuals aged 12+ as sampled by the NSDUH
                                                     94


Appendix B: Program Code Used to Derive the Constructed Study
                            95


                                                SAS
/*******************************************************************************************************
* In Enterprise Guide, "Specify the page size for log and text output" under 'Results
General' must be *
* de-selected in order to be able to specify pagesize and linesize using an options
statement.           *
********************************************************************************************************/
OPTIONS PS=56 LS=160 NOCENTER NOFMTERR MPRINT ORIENTATION =
LANDSCAPE validVarName=any;
title1'Dissertation';
title2'Aim 1: Legalization Prediction';
/****************************************************************************************
* The following macro variables are available to all users: *
*                                                                     *
* &Project       – the name of the project folder in which the .EGP file is stored                 *
* &ProgName             – the name of the .egp file, without the extension                           *
* &ProgNode            - the name of the code node                                          *
                                                  96


* &ProgDir          – the path to the folder in which the .egp file is stored.             *
*****************************************************************************************/
/*****************************************************************************
* The following macro variables are used in conjunction with the DOC_BLOCK *
* macro to document programs and output.                                         *
*                                                            *
* Do not use quotation marks when defining macro variables. If SAS syntax                   *
* requires quotes, use double quotes when you reference the macro variable. *
******************************************************************************/
** PROGRAMMER'S NAME ;
%LET PROGRAMMER                   = Barrett Montgomery;
** LIST ALL SUBDIRECTORIES CALLED IN THE LIBNAME STATEMENT ;
** THESE CAN BE LEFT BLANK IF NOT NEEDED OR USED ;
%LET USEDIR1             = C:\Users\montg\Dropbox\Ph.D Work\Dissertation\Data\Aim
1\raw;
                                                  97


** DEFINE DIRECTORY AND FILE NAME OF ANY PERMANENT SAS DATASETS
SAVED IN THIS PROGRAM AS MACRO VARIABLES ;
** USE &PROGNAME FOR SAVEFILE NAME ;
** LEAVE BLANK IF NO DATASET SAVED ;
%LET SAVEDIR1      = C:\Users\montg\Dropbox\Ph.D Work\Dissertation\Data\Aim
1\processed;
** THE NUMBERING SCHEME IS SAVEFILx_y WHERE X IS THE NUMBER OF THE
SAVEDIR AND ;
** Y IS THE NUMBER OF THE FILE WITHIN IT ;
** THIS SHOULD GENERALLY BE SET TO EITHER &PROGNAME OR &PROGNODE ;
%LET SAVEFIL1_1 = ;
** NAME FORMAT LIBRARY DIRECTORY ;
%LET FMTDIR       =;
** GIVE A BRIEF DESCRIPTION OF OVERALL PURPOSE OF THIS PROGRAM ;
** YOU CAN USE SINGLE QUOTES OR NO QUOTES-- DOUBLE QUOTES WILL NOT
WORK ;
                                    98


%LET PURPOSE1 =;
/*************************************************************
** CREATE LIBRARY REFERENCES TO DIRECTORIES SPECIFIED ABOVE **
**************************************************************/
** INPUT FILES ;
LIBNAME USE "&USEDIR1";
** OUTPUT FILE DESTINATION ;
LIBNAME SAV "&SAVEDIR1" ;
/*******************************
  ** Required Steps Outline **
1. Import data
2. Check data - note: nsduh county and nsduh tract files must be used in combination
3. Clean and merge data on sub-state area
4. Code sub-state region LRC implementation as of 2014
                                                  99


5. Model legalization at county level
*********************************/
title3"Step 1: Import Data";
proc import datafile= "C:\Users\montg\Dropbox\Ph.D Work\Dissertation\Data\Aim
1\Raw/countypres_2000_2016.csv" dbms=csv out=USE.election replace ;
         GUESSINGROWS=50000;
run;
DATA USE.NSDUH_County;
   set 'C:\Users\montg\Dropbox\Ph.D Work\Dissertation\Data\Aim
1\Raw/substate_county141516.sas7bdat';
run;
DATA USE.NSDUH_Tract;
   set 'C:\Users\montg\Dropbox\Ph.D Work\Dissertation\Data\Aim
1\Raw/substate_tract141516.sas7bdat';
run;
                                      100


proc import datafile= "C:\Users\montg\Dropbox\Ph.D Work\Dissertation\Data\Aim
1\Raw\County Merge/counties_geo_merge_2010-2012.xlsx" dbms=xlsx
out=USE.NSDUH_Crosswalk replace ;
run;
/*this macro imports all data in a folder with 2 options, the folder directory, and the type
of file
folder is saved in a macro variable called root and file type is csv.*/
/*%LET Root        = C:\Users\montg\Dropbox\Ph.D Work\Dissertation\Data\Aim
1\Raw\NSDUH;*/
/**/
/*%macro drive(dir,ext); */
/* %local cnt filrf rc did memcnt name; */
/* %let cnt=0;         */
/**/
/* %let filrf=mydir;    */
/* %let rc=%sysfunc(filename(filrf,&dir)); */
/* %let did=%sysfunc(dopen(&filrf));*/
                                         101


/*   %if &did ne 0 %then %do; */
/* %let memcnt=%sysfunc(dnum(&did));        */
/**/
/*   %do i=1 %to &memcnt;            */
/*                */
/*    %let name=%qscan(%qsysfunc(dread(&did,&i)),-1,.);               */
/*             */
/*    %if %qupcase(%qsysfunc(dread(&did,&i))) ne %qupcase(&name) %then %do;*/
/*    %if %superq(ext) = %superq(name) %then %do;                     */
/*       %let cnt=%eval(&cnt+1);     */
/*       %put %qsysfunc(dread(&did,&i)); */
/*       proc import datafile="&dir\%qsysfunc(dread(&did,&i))" out=dsn&cnt */
/*        dbms=csv replace;*/
/*                      GUESSINGROWS=5000; */
/*       run;        */
/*    %end; */
/*    %end; */
/**/
                                       102


/*    %end;*/
/*     %end;*/
/* %else %put &dir cannot be open.;*/
/* %let rc=%sysfunc(dclose(&did));    */
/*           */
/* %mend drive;*/
/* */
/*%drive(&Root,csv) */
/**/
/*%macro combine;*/
/*data USE.NSDUH_SAE;*/
/*       length outcome $110 geography $250;*/
/*       set*/
/*              %do i = 1 %to 15;*/
/*              DSN&i*/
/*              %end;*/
/*       ;*/
/*run;*/
                                     103


/*%mend combine;*/
/*%combine;
This data has been processed with standard errors addded, import that version*/
proc import datafile= "C:\Users\montg\Dropbox\Ph.D Work\Dissertation\Data\Aim
1\Raw\NSDUH/Appended_20102012_SAEs_SE.csv" dbms=csv out=USE.NSDUH_SAE
replace ;
        GUESSINGROWS=30000;
run;
title3"Step 2: Check Data";
proc contents data=USE.Census2010;
        title4'census data from 2010';
run;
proc print data=USE.Census2010 (obs=5);
run;
                                       104


proc contents data=USE.election;
       title4'election data from 2000-2016';
run;
proc print data=USE.election (obs=5) ;
       where fips='5075';
run;
proc freq data=USE.election;
       table fips/list missing;
run;
proc contents data=USE.NSDUH_County;
       title4'County definitions from NSDUH 2014-2016';
run;
proc print data=USE.NSDUH_County (obs=5);
run;
proc contents data=USE.NSDUH_SAE;
                                       105


       title4'NSDUH SAE 2010-2012';
run;
proc print data=USE.NSDUH_SAE (obs=5);
run;
proc freq data=USE.NSDUH_County;
       title4"Check to see if nsduh county names are same in both ds";
       table state*sbst16n / list missing;
       format state state.;
run;
proc freq data=USE.NSDUH_SAE;
       title4"Check to see if nsduh county names are same in both ds";
       table geography / list missing;
run;
/*merge steps
       make fips/county uniform for all datasets
                                        106


                Census data has Char var - leading zeroes for 4 digit fips
                election data has num var with no leading zeroes for 4 digit fips
                nsduh county file has two number vars for state and county
                        Need to concatenate in the form of SSCCC'
                then need to concatenate state and sbst16n to merge with geography
from NSDUH SAE
                nsduh SAEs have only the region name as character variables
       1. election fips needs to become char var with leading 0
       2. Create working Crosswalk of NSDUH areas and counties FIPS Codes
(FIPS_PCH)
       5. Merge census, election, and nsduh crosswalk on FIPS_PCH
       5. Merge in NSDUH SAEs on geography*/
proc print data=USE.Census2010 (obs=1);
       title4'census county variable';
       var fips;
run;
                                        107


proc print data=USE.election (obs=1);
        title4'election county variable';
        var fips;
run;
proc freq data=USE.Census2010;
        title4'check values for state and make sure all align';
        table state/list missing;
run;
proc freq data=USE.election;
        title4'check values for state and make sure all align';
        table state/list missing;
run;
/* transform election fips from num var with no leading zeroes
        to a 5 character var with leading zeroes*/
title3"Step 3: Clean and Merge Data";
                                          108


data election1;
        length fips_pch $ 5 temp_fips 5;
  set USE.election;
        where year in (2008, 2012);
        if fips = "NA" then fips = ".";
  /* read fips as number, then format with leading zeroes */
  /* (pch = suffix for padded character) */
        temp_fips= input(fips, 5.);
  fips_pch = put(temp_fips, z5.);
        candidatevotesx=input(candidatevotes, 8.);
        label fips_pch = "county fips code, padded (leading zeroes) character variable";
        drop candidatevotes;
        rename candidatevotesx=candidatevotes;
run;
                                        109


proc freq data=election1;
       title4'check new fips var in election data';
       table state*fips*temp_fips*fips_pch/list missing;
run;
proc contents data=election1;
       title4'Check new fips var in election data has correct format and length';
run;
data election2;
       set election1;
       drop fips temp_fips;
       /*recode kansas city, MO to be in Jackson County*/
       if fips_pch = "36000" then fips_pch = "29095";
run;
proc print data=election2;
                                        110


         title4'what to do with Alaska';
         where year = 2008 and state="Alaska" and party='democrat';
run;
/*alaska population in 2010 is 710,231
something is wrong with the way alaska fips are coded
Official fips from Census.gov
AK,02,013,Aleutians East Borough,H1
AK,02,016,Aleutians West Census Area,H5
AK,02,020,Anchorage Municipality,H6
AK,02,050,Bethel Census Area,H5
AK,02,060,Bristol Bay Borough,H1
AK,02,068,Denali Borough,H1
AK,02,070,Dillingham Census Area,H5
AK,02,090,Fairbanks North Star Borough,H1
AK,02,100,Haines Borough,H1
                                         111


AK,02,105,Hoonah-Angoon Census Area,H5
AK,02,110,Juneau City and Borough,H6
AK,02,122,Kenai Peninsula Borough,H1
AK,02,130,Ketchikan Gateway Borough,H1
AK,02,150,Kodiak Island Borough,H1
AK,02,164,Lake and Peninsula Borough,H1
AK,02,170,Matanuska-Susitna Borough,H1
AK,02,180,Nome Census Area,H5
AK,02,185,North Slope Borough,H1
AK,02,188,Northwest Arctic Borough,H1
AK,02,195,Petersburg Census Area,H5
AK,02,198,Prince of Wales-Hyder Census Area,H5
AK,02,220,Sitka City and Borough,H6
AK,02,230,Skagway Municipality,H1
AK,02,240,Southeast Fairbanks Census Area,H5
AK,02,261,Valdez-Cordova Census Area,H5
AK,02,270,Wade Hampton Census Area,H5
AK,02,275,Wrangell City and Borough,H1
                                    112


AK,02,282,Yakutat City and Borough,H1
AK,02,290,Yukon-Koyukuk Census Area,H5
impute means for alaskan voter data*/
proc print data=election2 ;
       where state='Alaska';
run;
proc sql;
       title4'calculate average alaskan voter data by year and party';
       create table voter_impute as
       select *, mean(candidatevotes) as vote_imp
       from election2
       where state = "Alaska"
       group by year, party;
run;
                                       113


proc means data=election2 mean;
        var candidatevotes;
        class state year party;
        where state = "Alaska";
run;
data election3;
        set election2 voter_impute;
        if state = "Alaska" and vote_imp=. then delete;
        if state = "Alaska" then candidatevotes = vote_imp;
        drop vote_imp;
run;
proc print data=election3 noobs;
        where state = "Alaska";
run;
/*transform party votes data from long to wide*/
                                        114


proc sql;
      title4'Create year-party-specific vote vars';
      create table partyvote1 as
      select fips_pch, sum(candidatevotes) as dem_2008_votes
      from election3
      where party="democrat" and year=2008
      group by fips_pch;
quit;
proc sql;
      create table partyvote2 as
      select fips_pch, sum(candidatevotes) as rep_2008_votes
      from election3
      where party="republican" and year=2008
      group by fips_pch;
quit;
                                       115


proc sql;
      create table partyvote3 as
      select fips_pch, sum(candidatevotes) as other_2008_votes
      from election3
      where party="NA" and year=2008
      group by fips_pch;
quit;
proc sql;
      title4'Create year-party-specific vote vars';
      create table partyvote4 as
      select fips_pch, sum(candidatevotes) as dem_2012_votes
      from election3
      where party="democrat" and year=2012
      group by fips_pch;
quit;
proc sql;
                                       116


       create table partyvote5 as
       select fips_pch, sum(candidatevotes) as rep_2012_votes
       from election3
       where party="republican" and year=2012
       group by fips_pch;
quit;
proc sql;
       create table partyvote6 as
       select fips_pch, sum(candidatevotes) as other_2012_votes
       from election3
       where party="NA" and year=2012
       group by fips_pch;
quit;
data partyyearvotes;
       merge partyvote1 partyvote2 partyvote3 partyvote4 partyvote5 partyvote6;
run;
                                     117


proc surveyselect data=partyyearvotes n=5 out=SampleRep seed=2;;
       title4'Check summarization worked as intended';
run;
proc print data=partyyearvotes noobs;
       var fips_pch other: dem: rep:;
  where fips_pch in ("06087", "13173","26147", "37013", "45079");
run;
proc means data=election3 sum;
  class fips_pch party year;
  var candidatevotes;
  where fips_pch in ("06087", "13173","26147", "37013", "45079");
run;
data election4;
       set partyyearvotes;
                                      118


       total_2008_votes = sum(other_2008_votes, dem_2008_votes, rep_2008_votes);
       total_2012_votes = sum(other_2012_votes, dem_2012_votes, rep_2012_votes) ;
       percentdem_2008_votes = dem_2008_votes/total_2008_votes;
       percentrep_2008_votes = rep_2008_votes/total_2008_votes;
       percentother_2008_votes = other_2008_votes/total_2008_votes;
       percentdem_2012_votes = dem_2012_votes/total_2012_votes;
       percentrep_2012_votes = rep_2012_votes/total_2012_votes;
       percentother_2012_votes = other_2012_votes/total_2012_votes;
       label total_2008_votes = 'total votes in district in 2008'
                total_2012_votes = 'total votes in district in 2012'
                percent_2008_votes ='percent of total votes in district for each party in
2008'
                percent_2012_votes ='percent of total votes in district for each party in
2012';
run;
proc print data=election4;
       title4'check percent voter and total votes variables';
                                        119


   where fips_pch in ("06087", "13173","26147", "37013", "45079");
run;
/*successfully made all election data wide format, use election4 and merge by
fips_pch*/
data census1;
       length fips_pch $ 5;
       set      USE.Census2010;
       fips_pch = fips;
       rename state=state_fips;
run;
proc freq data=census1 (obs=50);
       title4'these need to be made numeric';
       table arealand * areawatr /list missing;
run;
                                        120


proc freq data=census1;
       title4'drop these vars?';
       table sumlev division geocomp /list missing;
run;
/*keep division to use as a hierarchical variable*/
data census2;
       length state $20;
       set census1;
       if state_fips = '01' then state = "Alabama";
       if state_fips = '02' then state = "Alaska";
       if state_fips = '04' then state = "Arizona";
       if state_fips = '05' then state = "Arkansas";
       if state_fips = '06' then state = "California";
       if state_fips = '08' then state = "Colorado";
       if state_fips = '09' then state = "Connecticut";
       if state_fips = '10' then state = "Delaware";
                                         121


if state_fips = '11' then state = "DC";
if state_fips = '12' then state = "Florida";
if state_fips = '13' then state = "Georgia";
if state_fips = '15' then state = "Hawaii";
if state_fips = '16' then state = "Idaho";
if state_fips = '17' then state = "Illinois";
if state_fips = '18' then state = "Indiana";
if state_fips = '19' then state = "Iowa";
if state_fips = '20' then state = "Kansas";
if state_fips = '21' then state = "Kentucky";
if state_fips = '22' then state = "Louisiana";
if state_fips = '23' then state = "Maine";
if state_fips = '24' then state = "Maryland";
if state_fips = '25' then state = "Massachusetts";
if state_fips = '26' then state = "Michigan";
if state_fips = '27' then state = "Minnesota";
if state_fips = '28' then state = "Mississippi";
if state_fips = '29' then state = "Missouri";
                                  122


if state_fips = '30' then state = "Montana";
if state_fips = '31' then state = "Nebraska";
if state_fips = '32' then state = "Nevada";
if state_fips = '33' then state = "New Hampshire";
if state_fips = '34' then state = "New Jersey";
if state_fips = '35' then state = "New Mexico";
if state_fips = '36' then state = "New York";
if state_fips = '37' then state = "North Carolina";
if state_fips = '38' then state = "North Dakota";
if state_fips = '39' then state = "Ohio";
if state_fips = '40' then state = "Oklahoma";
if state_fips = '41' then state = "Oregon";
if state_fips = '42' then state = "Pennsylvania";
if state_fips = '44' then state = "Rhode Island";
if state_fips = '45' then state = "South Carolina";
if state_fips = '46' then state = "South Dakota";
if state_fips = '47' then state = "Tennessee";
if state_fips = '48' then state = "Texas";
                                  123


if state_fips = '49' then state = "Utah";
if state_fips = '50' then state = "Vermont";
if state_fips = '51' then state = "Virginia";
if state_fips = '53' then state = "Washington";
if state_fips = '54' then state = "West Virginia";
if state_fips = '55' then state = "Wisconsin";
if state_fips = '56' then state = "Wyoming";
if state_fips = '60' then state = "American Samoa";
if state_fips = '66' then state = "Guam";
if state_fips = '69' then state = "Northern Mariana Islands";
if state_fips = '72' then state = "Puerto Rico";
if state_fips = '78' then state = "Virgin Islands";
area_water=input(areawatr, 15.);
area_land=input(arealand, 15.);
division_n =input(division, 8.);
                                  124


         label area_land="the size, in square meters, of the land portions of geographic
entities for which the Census Bureau tabulates and disseminates data"
                  area_water = "size, in square meters of inland, coastal, Great Lakes, and
territorial sea water";
         drop fips sumlev geocomp;
run;
proc freq data=census2;
         title4'check state fips and state variable creation';
         table state_fips * state/list missing;
run;
proc freq data=census2;
         title4'check state fips and state variable creation';
         table state_fips * county * fips_pch/list missing;
run;
proc print data=census2 (obs=50);
                                           125


       title4'check area variable creation';
       var areawatr area_water arealand area_land division division_n;
run;
proc means data=census2;
       var area_water area_land;
run;
/*check duplication of county and state var for which one to keep*/
proc freq data=census2;
       title4'check duplication of county and state var for which one to keep';
       table state county/list missing;
run;
proc sort data=census2;
       by fips_pch;
run;
                                        126


proc sort data=election4;
       by fips_pch;
run;
data test1;
       merge election4 (in=inelectionx) census2 (in=incensusx drop=arealand areawatr)
;
       by fips_pch;
       inelection=inelectionx;
       incensus=incensusx;
       state_fips=put(fips_pch,2.);
run;
proc contents data=test1;
       title4'Check merge of census and election data';
run;
                                      127


proc freq data= test1;
       table incensus*inelection/list missing;
run;
proc freq data=test1;
       title4'Check fips and state fips';
       table fips_pch * state_fips/list missing;
run;
proc print data= test1 noobs n;
       title4'check discrepancies after merge: not in census data';
       where incensus=0;
run;
/*these are the connecticut/maine/rhode island non-counties that have voter data, can
be deleted
need to investigate why Alaska and Kansas City are specifcially not merging
                                         128


kansas city exists in several counties, recoded to jackson county, where most of
popultion lies, in election 2 datastep*/
proc print data= test1 noobs n;
         title4'check discrepancies after merge: not in election data';
         var fips_pch county state:;
         where inelection=0;
run;
/*       alaska here probably the data not merging correctly
         look into county in hawaii
         Puerto rico can be deleted*/
proc print data= test1 noobs n;
         title4'Hawaii county not in election data';
         where inelection=0 and state="Hawaii";
run;
/*tiny village with less than 100 people, leave as is*/
                                          129


data test2;
       set test1;
/*              delete the weird connecticut/maine/rhode island non-counties, delete
Puerto Rico Census data*/
       if compress(fips_pch)= "." then delete;
       if state ="Puerto Rico" then delete;
run;
proc print data= test2;
       title4'Check test 2 fixes worked (should not print)';
       where state="Puerto Rico" or compress(fips_pch)= "." ;
run;
proc freq data= test2;
       title4'Check mismatches after test 2 fixes';
       table incensus*inelection/list missing;
run;
                                       130


proc print data= test2;
       title4'Check mismatches after test 2 fixes';
       where incensus=0 or inelection=0;
run;
proc print data= test2;
       title4'Check Alsakan data uses averages';
       where state="Alaska";
run;
data test3;
       /* remove bad alaskan data but impute average voter data
               set in election for these to be 1*/
       set test2;
       where incensus=1;
       if state="Alaska" then do;
               dem_2008_votes                       = 3089.85;
               rep_2008_votes                       = 4846.03;
                                         131


               other_2008_votes            = 219.05;
               dem_2012_votes                     = 3066;
               rep_2012_votes                     = 4116.9;
               other_2012_votes            = 329.475;
               total_2008_votes            = 8154.93;
               total_2012_votes            = 7512.38;
               percentdem_2008_votes       = 0.37889;
               percentrep_2008_votes       = 0.59425;
               percentother_2008_votes     = 0.026861;
               percentdem_2012_votes       = 0.40813;
               percentrep_2012_votes       = 0.54802;
               percentother_2012_votes     = 0.043858;
               inelection=1;
               end;
run;
proc print data= test3;
       title4'Check Alsakan data uses averages and correct fips';
                                      132


         where state="Alaska";
run;
proc print data= test3;
         title4'last remianing issue';
         where inelection=0 or incensus=0;
run;
/*limitations of voter data: alsaka uses state averages, missing 1 hawaii county*/
/* compared crosswalk and my small area estimates with
https://www.samhsa.gov/data/report/2010-2012-nsduh-substate-region-definitions
changes to crosswalk made to match sae file based on above comparisons and
documentation*/
data nsduh_crosswalk2;
         set use.nsduh_crosswalk;
         where fips_pch ~='';
                                         133


              if geography = "Arizona Maricopa" then geography= "Arizona Central";
              if geography = "Arizona Pima" then geography= "Arizona South A";
              if geography = "Florida Circuit 17 (Broward)" then geography= "Florida
Broward (Circuit 17)";
              if geography = "Florida Region F - Southern (Circuits 11 and 16)" then
geography= "Florida South (Circuits 11 and 16)";
              if geography = "Illinois Region I (Cook)" then geography= "Illinois Region
1 (Cook)";
              if geography = "Illinois Region II" then geography= "Illinois Region 2";
              if geography = "Illinois Region III" then geography= "Illinois Region 3";
              if geography = "Illinois Region IV" then geography= "Illinois Region 4";
              if geography = "Illinois Region V" then geography= "Illinois Region 5";
              if geography = "Kentucky Seven Counties" then geography= "Kentucky
Centerstone";
              if geography = "Michigan Macomb" then geography= "Michigan Region
1";
              if geography = "Michigan Oakland" then geography= "Michigan Region
8";
              if geography = "Michigan Pathways and Western" then geography=
"Michigan Region 9";
                                        134


run;
/* merge NSDUH crosswalk*/
proc sort data= nsduh_crosswalk2;
       by fips_pch;
run;
proc sort data= test3;
       by fips_pch;
run;
data test4;
       merge test3 nsduh_crosswalk2 (in=incrossx);
       by fips_pch;
       where fips_pch ~='';
       incross = incrossx;
                                   135


       label    fips_pch = 'concatenated state_fips and county_fips in the form of
SSCCC, mergable with election and census data';
run;
proc freq data=test4;
       title4'nsduh merge check';
       table incross*incensus*inelection/list missing;
run;
proc freq data= test4;
       title4'Check mismatches after test 3 merge: not in election data';
       table qname/list missing;
       where inelection=0;
run;
/*same limitations of census data relating to alaska and hawaii*/
/*begin with nsduh SAEs, these also need to be transformed to one level...*/
                                        136


proc sql;
       title4'prepare NSDUH SAEs for transpose and merge';
       select distinct(outcome)
       from use.nsduh_sae;
quit;
data NSDUH_SAEs;
       set use.nsduh_sae ;
       length short_name $32 age_group2 $32;
/*remove aggregate areas and sub-county areas*/
       where geography not in ("District of Columbia Ward 1","District of Columbia
Ward 2",
                "District of Columbia Ward 3","District of Columbia Ward 4","District of
Columbia Ward 5",
                "District of Columbia Ward 6","District of Columbia Ward 7","District of
Columbia Ward 8",
                "California LA SPA 1 and 5", "California LA SPA 2", "California LA SPA
3",
                                       137


                "California LA SPA 4", "California LA SPA 6", "California LA SPA 7",
"California LA SPA 8",
                "California Regions 13 and 19R",
                "Florida Region A - Northwest", "Florida Region B - Northeast",
                "Florida Region C - Central", "Florida Region D - Southeast"
                "Florida Region E - Sun Coast", "Hawaii Kauai and Maui",
                "Louisiana Regions 1 and 10", "Maine Aroostook/Downeast", "Missouri
Eastern"
                "Missouri Northwest", "Nebraska Regions 1 and 2", "Nevada Region 3",
                "New Hampshire Central", "New Hampshire Southern", "Texas Region
11",
                "Texas Region 3", "Texas Region 6", "Texas Region 7", "Washington
Region 1",
                "Washington Region 2", "Washington Region 3");
        if outcome = "Alcohol Dependence in the Past Year" then short_name =
"PYAlcDepPrev";
        if outcome = "Alcohol Use Disorder in the Past Year" then short_name =
"PYAUDPrev";
        if outcome = "Alcohol Use in the Past Month" then short_name = "PMAlcPrev";
                                        138


      if outcome = "Any Mental Illness in the Past Year" then short_name =
"PYAMIPrev";
      if outcome = "Average Annual Rate of First Use of Marijuana" then short_name =
"PYMJInc";
      if outcome = "Cigarette Use in the Past Month" then short_name =
"PMCigPrev";
      if outcome = "Cocaine Use in the Past Year" then short_name = "PYCocPrev";
      if outcome = "Past Year Suicidal Thoughts" then short_name = "PYSTPrev";
      if outcome = "Illicit Drug Dependence in the Past Year" then short_name =
"PYSUDPrev";
      if outcome = "Major Depressive Episode in the Past Year" then short_name =
"PYMDEPrev";
      if outcome = "Marijuana Use in the Past Month" then short_name =
"PMMJPrev";
      if outcome = "Marijuana Use in the Past Year" then short_name = "PYMJPrev";
      if outcome = "Serious Mental Illness in the Past Year" then short_name =
"PYSMIPrev";
      if outcome = "Tobacco Product Use in the Past Month" then short_name =
"PMTobPrev";
                                      139


       if outcome = "Underage Alcohol Use in the Past Month" then short_name =
"PMUAlcPrev";
       if age_group = "12 or Older" then age_group2      = "Twelve_plus";
       if age_group = "12 to 17" then age_group2         = "Twelve_to_17" ;
       if age_group = "12 to 20" then age_group2         = "Twelve_to_20" ;
       if age_group = "18 or Older" then age_group2      = "Eighteen_plus" ;
       if age_group = "18 to 25" then age_group2         = "Eighteen_to_25" ;
       if age_group = "26 or Older" then age_group2      = "Twenty_six_plus" ;
run;
proc print data=NSDUH_SAEs;
       title3'interval check (should not print)';
       var geography estimate ci_lower ci_upper;
       where estimate > ci_upper or estimate < ci_lower;
run;
proc freq data=NSDUH_SAEs;
                                        140


        title4'Chek to see if new vars worked';
        table outcome*short_name age_group*age_group2/list missing;
run;
proc sort data=NSDUH_SAEs;
        by geography;
run;
/*macro to transpose all variable values to make wide dataset*/
%macro multi_transp (var=, out=);
proc transpose data= NSDUH_SAEs delimiter=_ out=&out(drop=_NAME_) suffix=&var ;
        id age_group2 short_name ;
        by geography ;
        var &var ;
run;
proc sort data= &out;
        by geography;
                                        141


run;
%mend;
%multi_transp(var=estimate, out=NSDUH_SAEs1);
%multi_transp(var=ci_lower, out=NSDUH_SAEs2);
%multi_transp(var=ci_upper, out=NSDUH_SAEs3);
%multi_transp(var=L,              out=NSDUH_SAEs4);
%multi_transp(var=L_lower,        out=NSDUH_SAEs5);
%multi_transp(var=sel,            out=NSDUH_SAEs6);
%multi_transp(var=L_upper,        out=NSDUH_SAEs7);
%multi_transp(var=se,             out=NSDUH_SAEs8);
data sae_merge;
       merge NSDUH_SAEs1 NSDUH_SAEs2 NSDUH_SAEs3 NSDUH_SAEs4
NSDUH_SAEs5 NSDUH_SAEs6 NSDUH_SAEs7 NSDUH_SAEs8;
       by geography;
       label fips_pch = 'concatenated state_fips and county_fips in the form of
SSCCC, mergable with election and census data';
                                    142


run;
/*ensure this merge worked*/
data merge_check;
       set sae_merge;
       a1=Twelve_plus_PYAlcDepPrevestimate<Twelve_plus_PYAlcDepPrevci_lower;
       a2=Twelve_plus_PYAlcDepPrevestimate>Twelve_plus_PYAlcDepPrevci_upper;
       b1=Twelve_plus_PMAlcPrevestimate<Twelve_plus_PMAlcPrevci_lower;
       b2=Twelve_plus_PMAlcPrevestimate>Twelve_plus_PMAlcPrevci_upper;
       c1=Twelve_plus_PMCigPrevestimate<Twelve_plus_PMCigPrevci_lower;
       c2=Twelve_plus_PMCigPrevestimate>Twelve_plus_PMCigPrevci_upper;
       d1=Twelve_plus_PMMJPrevestimate<Twelve_plus_PMMJPrevci_lower;
       d2=Twelve_plus_PMMJPrevestimate>Twelve_plus_PMMJPrevci_upper;
       e1=Twelve_plus_PMTobPrevestimate<Twelve_plus_PMTobPrevci_lower;
       e2=Twelve_plus_PMTobPrevestimate>Twelve_plus_PMTobPrevci_upper;
       f1=Twelve_plus_PYAUDPrevestimate<Twelve_plus_PYAUDPrevci_lower;
       f2=Twelve_plus_PYAUDPrevestimate>Twelve_plus_PYAUDPrevci_upper;
                                 143


       g1=Twelve_plus_PYAlcDepPrevestimate<Twelve_plus_PYAlcDepPrevci_lower;
       g2=Twelve_plus_PYAlcDepPrevestimate>Twelve_plus_PYAlcDepPrevci_upper;
       h1=Twelve_plus_PYCocPrevestimate<Twelve_plus_PYCocPrevci_lower;
       h2=Twelve_plus_PYCocPrevestimate>Twelve_plus_PYCocPrevci_upper;
       i1= Twelve_plus_PYMJIncestimate<Twelve_plus_PYMJIncci_lower;
       i2= Twelve_plus_PYMJIncestimate>Twelve_plus_PYMJIncci_upper;
       j1=Twelve_plus_PYMJPrevestimate<Twelve_plus_PYMJPrevci_lower;
       j2=Twelve_plus_PYMJPrevestimate>Twelve_plus_PYMJPrevci_upper;
       k1=Twelve_plus_PYSUDPrevestimate<Twelve_plus_PYSUDPrevci_lower;
       k2=Twelve_plus_PYSUDPrevestimate>Twelve_plus_PYSUDPrevci_upper;
run;
proc freq data=merge_check;
       title4'1 indicates where estimate is out of bounds';
       table a1 a2 b1 b2 c1 c2 d1 d2 e1 e2 f1 f2 g1 g2 h1 h2 i1 i2 j1 j2 k1 k2/list
missing;
run;
                                       144


/*all estimates are within lower and upper confidence intervals
chack a random subset of variables for comparisons*/
proc surveyselect noprint data=nsduh_saes method=srs n=10 seed=2
out=merge_check1;
run;
proc print data=merge_check1 noobs;
         title4'random subset of long form observations';
         var geography outcome short_name age_group2 estimate ci_lower ci_upper;
run;
proc print data= sae_merge noobs;
         title4'compare to wide form after merge';
         var Eighteen_plus_PYMJPrevestimate Eighteen_plus_PYMJPrevci_lower
Eighteen_plus_PYMJPrevci_upper;
         where geography = "California Region 12R";
run;
                                         145


proc print data= sae_merge noobs;
       title4'compare to wide form after merge';
       var Twelve_plus_PYAUDPrevestimate Twelve_plus_PYAUDPrevci_lower
Twelve_plus_PYAUDPrevci_upper;
       where geography = "Delaware Sussex";
run;
proc print data= sae_merge noobs;
       title4'compare to wide form after merge';
       var Eighteen_plus_PYAlcDepPrevestima Eighteen_plus_PYAlcDepPrevci_low
Eighteen_plus_PYAlcDepPrevci_upp;
       where geography = "Florida Circuit 12";
run;
proc print data= sae_merge noobs;
       title4'compare to wide form after merge';
       var Eighteen_to_25_PYCocPrevestimate Eighteen_to_25_PYCocPrevci_lower
Eighteen_to_25_PYCocPrevci_upper;
       where geography = "Florida Circuit 12";
                                      146


run;
proc print data= sae_merge noobs;
       title4'compare to wide form after merge';
       var Twelve_plus_PYSUDPrevestimate Twelve_plus_PYSUDPrevci_lower
Twelve_plus_PYSUDPrevci_upper;
       where geography = "Massachusetts Central";
run;
proc print data= sae_merge noobs;
       title4'compare to wide form after merge';
       var Eighteen_plus_PYCocPrevestimate Eighteen_plus_PYCocPrevci_lower
Eighteen_plus_PYCocPrevci_upper;
       where geography = "Michigan Kent";
run;
proc print data= sae_merge noobs;
       title4'compare to wide form after merge';
                                      147


       var Twenty_six_plus_PMMJPrevestimate Twenty_six_plus_PMMJPrevci_lower
Twenty_six_plus_PMMJPrevci_upper;
       where geography = "New Hampshire Central 1";
run;
proc print data= sae_merge noobs;
       title4'compare to wide form after merge';
       var Eighteen_to_25_PMTobPrevestimate Eighteen_to_25_PMTobPrevci_lower
Eighteen_to_25_PMTobPrevci_upper;
       where geography = "New York Region 2C: New York";
run;
proc print data= sae_merge noobs;
       title4'compare to wide form after merge';
       var Twelve_to_17_PMAlcPrevestimate Twelve_to_17_PMAlcPrevci_lower
Twelve_to_17_PMAlcPrevci_upper;
       where geography = "Oregon Region 4";
run;
                                      148


proc print data= sae_merge noobs;
         title4'compare to wide form after merge';
         var Twelve_plus_PYAlcDepPrevestimate Twelve_plus_PYAlcDepPrevci_lower
Twelve_plus_PYAlcDepPrevci_upper;
         where geography = "West";
run;
/*all are exact matches*/
data sae_merge2;
         set sae_merge ;
/*create a state binary*/
         if geography
in("Alabama","Alaska","Arizona","Arkansas","California","Colorado","Connecticut",
         "Delaware","Florida","Georgia","Hawaii","Idaho","Illinois","Indiana","Iowa","Kan
sas","Kentucky",
         "Louisiana","Maine","Maryland","Massachusetts","Michigan","Minnesota","Miss
issippi","Missouri",
                                        149


        "Montana","Nebraska","Nevada","New Hampshire","New Jersey","New
Mexico","New York","North Carolina",
        "North Dakota","Ohio",       "Oklahoma","Oregon","Pennsylvania","Rhode
Island","South Carolina",
        "South Dakota",        "Tennessee","Texas","Utah",
        "Vermont","Virginia","Washington","West Virginia",
        "Wisconsin","Wyoming") then do;
                 agg_state=1;
                 agg=1;
                 agg_region=0;
        end;
        else if geography in ("Midwest", "Northeast", "South", "West", "United States")
then do
                 agg_region=1;
                 agg_state=0;
                 agg=1;
         end;
         else do;
                 agg_region=0;
                                       150


            agg_state=0;
            agg=0;
      end;
            if geography = "New York Region 10" then geography = "New York
Region C";
            if geography = "New York Region 11" then geography = "New York
Region C";
            if geography = "New York Region 12" then geography = "New York
Region C";
            if geography = "New York Region 13" then geography = "New York
Region D";
            if geography = "New York Region 14" then geography = "New York
Region D";
            if geography = "New York Region 15" then geography = "New York
Region D";
            if geography = "New York Region 1: Long Island" then geography =
"New York Region B";
            if geography = "New York Region 2" then geography = "New York
Region A";
                                  151


              if geography = "New York Region 2: New York City" then geography =
"New York Region A";
              if geography = "New York Region 2A: Bronx" then geography = "New
York Region A";
              if geography = "New York Region 2C: New York" then geography =
"New York Region A";
              if geography = "New York Region 2D: Queens" then geography = "New
York Region A";
              if geography = "New York Region 6" then geography = "New York
Region B";
              if geography = "New York Region 7" then geography = "New York
Region B";
              if geography = "New York Region 8" then geography = "New York
Region C";
              if geography = "New York Region 9" then geography = "New York
Region C";
              if geography = "North Carolina Cardinal Innovations Healthcare Solutions
3" then geography = "North Carolina Cardinal Innovations";
              if geography = "North Carolina CenterPoint Human Services" then
geography = "North Carolina CenterPoint";
                                     152


                if geography = "North Carolina Partners Behavioral Health Management"
then geography = "North Carolina Partners";
                if geography = "North Carolina Sandhills Center 1" then geography =
"North Carolina Sandhills";
                if geography = "North Carolina Sandhills Center 2" then geography =
"North Carolina Sandhills";
                if geography = "North Carolina Smoky Mountain Center 1" then
geography = "North Carolina Smoky Mountain";
                if geography = "North Carolina Smoky Mountain Center 2" then
geography = "North Carolina Smoky Mountain";
                if geography = "North Carolina Trillium Health Resources 1" then
geography = "North Carolina ECBH";
                if geography = "North Carolina Trillium Health Resources 2" then
geography = "North Carolina CoastalCare";
                if geography = "Michigan Detroit City" then fips_pch = "26163";
run;
proc freq data=sae_merge2;
       title4'check agg bins';
                                       153


       table agg*agg_region*agg_state/list missing;
run;
proc sort data= sae_merge2;
       by geography ;
run;
proc sort data= nsduh_crosswalk2;
       by geography ;
run;
data nsduh_merge;
       merge sae_merge2 (in=insaex) nsduh_crosswalk2 (in=incrossx);
       by geography ;
       incross = incrossx;
       insae = insaex;
run;
                                    154


proc freq data=nsduh_merge;
       title4'nsduh merge check';
       table incross*insae/list missing;
run;
proc freq data=nsduh_merge;
       title4'nsduh merge check where not in crosswalk data';
       table geography / list missing;
       where incross=0 and agg=0;
run;
proc print data=nsduh_merge noobs n;
       title4'why are so many county fips codes missing?';
       var name fips_pch geography;
       where fips_pch = "" and agg=0;
run;
                                        155


proc freq data=nsduh_merge noprint;
       title4'find remaining fips code duplicates';
       table fips_pch / list missing out=fips_counts ;
run;
proc print data=fips_counts;
       where count>1;
run;
proc print data=nsduh_merge;
       var name fips_pch geography;
       where fips_pch in ("10003", "26163");
run;
proc print data=nsduh_merge;
       title4'Check empty geography';
       where geography="";
run;
                                        156


data nsduh_merge2;
       length name $90 ;
       set nsduh_merge;
       /*remove wilmington city, detroit, and kalawao*/
       if geography in ("Delaware Wilmington City") then delete;
       if geography in ("Michigan Detroit City") then delete;
       if name in ("Kalawao County") then delete;
run;
proc freq data=nsduh_merge2 noprint;
       title4'Check to see if fips duplicate removal worked (should not print)';
       table fips_pch / list missing out=fips_counts ;
run;
proc print data=fips_counts n;
       where count>1 and fips_pch ~="";
run;
                                         157


proc freq data=test4 noprint;
       title4'check census/election data for duplicate fips';
       table fips_pch/list missing out=countfips;
run;
proc print data=countfips;
       where count>1;
run;
proc print data=test4;
       var name fips_pch geography;
       where fips_pch in ("10003", "26163");
run;
proc sort data=nsduh_merge2;
       by fips_pch;
run;
                                       158


proc sort data=test4;
       by fips_pch;
run;
data full_set;
       merge nsduh_merge2 test4;
       by fips_pch;
/*make sure these are not in final set */
       if geography ="Delaware Wilmington City" then delete;
       if name ="Kalawao County" then delete;
       label agg = "Observation level is above county - aggregate"
                agg_region = "Observation level is US regional"
                agg_state = "Observation level is US state";
run;
proc freq data=full_set;
                                          159


       title4'check final merge';
       table incensus*inelection*insae*incross/list missing;
run;
proc print data=full_set ;
       title4'check missing census data if not aggregate info';
       where incensus= . and agg ~=1;
run;
proc print data=full_set ;
       title4'check missing name values';
       var geography;
       where name= "" and agg ~=1;
run;
/*allmass/conn counties or aggregate areas*/
proc freq data=full_set;
       title4'check aggregate flag data';
                                       160


       table agg*incensus*inelection*insae*incross/list missing;
run;
proc print data=full_set;
       var agg name geography incensus inelection insae incross;
       where agg=.;
run;
proc print data=full_set;
       title4'Check obs missing census/election data';
       var geography name county fips_pch incensus incross inelection insae agg;
       where (inelection=. or incensus=.) and agg ~=1;
run;
proc print data=full_set noobs;
       title4'why are there missing nsduh estimates?';
       var state geography fips_pch name incensus incross inelection insae;
       where insae= 0 and agg ~=1 ;
                                       161


run;
/*begin imputing SAE state averages for massachusetts and conn counties/
/* Mean imputation: Use PROC STDIZE to replace missing values with mean */
proc stdize data=full_set out=mass_imputes
       reponly          /* only replace; do not standardize */
       method=MEAN;           /* or MEDIAN, MINIMUM, MIDRANGE, etc. */
       var Eighteen: Twelve: Twenty:;
       where geography= 'Massachusetts' or state='Massachusetts';
run;
proc stdize data=full_set out=conn_imputes
       reponly          /* only replace; do not standardize */
       method=MEAN;           /* or MEDIAN, MINIMUM, MIDRANGE, etc. */
       var Eighteen: Twelve: Twenty:;
       where geography= 'Connecticut' or state='Connecticut';
run;
                                       162


proc print data=full_set;
       title4'Imputed mass and conn SAEs';
       var state name geography fips_pch Eighteen: Twelve: Twenty:;
       where geography in ('Massachusetts', 'Connecticut');
run;
proc print data=mass_imputes;
       var state name geography fips_pch Twelve:;
run;
proc print data=conn_imputes;
       var state name geography fips_pch Twelve:;
run;
proc print data=full_set;
       title4'drop these duplicates';
       var state name geography fips_pch Twelve:;
                                      163


       where state="Massachusetts" or state='Connecticut';
run;
data full_set2;
       set full_set (where=(state~= "Massachusetts" and state~= 'Connecticut'))
                mass_imputes conn_imputes;
       if agg=. then agg = 0;
       if inelection=. then inelection= 0;
       if incensus=. then incensus = 0;
/*placeholders for variables built in next data step*/
       RCL_2012 = 0;
       LAG_2014 = 0;
       sens_2014 = 0;
       sens2_2014 = 0;
       sens3_2014 = 0;
                                         164


       if state in ("Massachusetts", 'Connecticut') then geography = catx(' ', state,
name);
run;
proc print data= full_set2;
       title4'check mass imputations';
       var state geography fips_pch Twelve:;
       where (state = 'Massachusetts' or geography= 'Massachusetts') and agg ~=1;
run;
proc print data= full_set2;
       title4'check Conn imputations';
       var state geography fips_pch Twelve:;
       where (state = 'Connecticut' or geography= 'Connecticut') and agg ~=1;
run;
proc freq data=full_set2;
       title4'check for any unexpected missingness or duplicates';
                                       165


       table incensus*incross*insae*inelection / list missing;
run;
proc freq data=full_set2;
       title4'check for any unexpected missingness or duplicates';
       table rcl_2012 LAG_2014 / list missing;
run;
proc freq data=full_set2;
       title4'missing names, fips, census, election, and cross walk should all be
aggregate obs';
       table agg / list missing;
       where fips_pch='' or name='' or incensus=0 or inelection=0 or incross=0;
run;
proc print data=full_set2 noobs n;
       title4'what are these non-agg obs?';
       var name geography;
                                       166


       where (fips_pch='' or name='' or incensus=0 or inelection=0 or incross=0) and
agg=0;
run;
proc print data=full_set2 noobs n;
       title4'check to see that we have all the data for massachusetts and connecticut
we need before dropping these observations';
       where state in ('Massachusetts','Connecticut');
run;
proc print data=full_set2 noobs n;
       title4'check Colorado counties for RCL coding';
       var state name county geography;
       where state in ("Colorado");
run;
proc print data=full_set2 noobs n;
       title4'check Washington counties for RCL coding';
       var state name county geography;
                                       167


        where state in ("Washington");
run;
title3"Step 4: Code county level cannabis sales";
data full_set3;
        set full_set2;
        if geography in ("Connecticut Eastern", "Connecticut North Central",
                                     "Connecticut Northwestern", "Connecticut South
Central",
                                     "Connecticut Southwest", "Massachusetts
Boston",
                                     "Massachusetts Central", "Massachusetts
Metrowest",
                                     "Massachusetts Northeast", "Massachusetts
Southeast",
                                     "Massachusetts Western") then delete;
                                       168


/*Here we create five variables: RCL_2012, LAG_2014, sens_2014, sens2_2014,
sens3_2014
                RCL_2012: Counties where rec cannabis sales became legal in 2012 (not
implemented until 2014, but voted on in 2012).
                                This incluides all Washington counties and 2/3 of
Colorado counties
                LAG_2014: counties that have local authority over cannabis sales and
voted to allow in 2014.
                                This includes counties in Colorado, all of Alaska, and
Oregon.
                                Washington must be coded as missing because cannabis
is legal but no local authority was granted to counties
                                should not be counted in either exposure condition
                sens_2014: Same as LAG_2014 but codes washington counties
according to zoning laws outlined in the
                                        I-502 Evaluation Plan and Preliminary Report on
Implementation, exhibit 6
                sens2_2014: same as sens_2014, but Alaska defined as missing
because it has no variance
                                          169


                 sens3_2014: same as LAG_2014, but Alaska defined as missing because
it has no variance
        /*64 counties in Colorado*/
        if state = "Colorado" and name in ("Park County", "Conejos County", "Pitkin
County", "Arapahoe County", "Adams County", "Douglas County",
                 "Eagle County", "Larimer County", "Weld County", "Gilpin County",
"Boulder County", "Summit County", "Chaffee County", "Garfield County",
                 "Delta County", "Montezuma County", "Moffat County", "Gunnison
County", "Saguache County", "Mesa County", "Denver County",
                 "Montrose County", "Dolores County", "La Plata County", "El Paso
County", "Jefferson County", "Clear Creek County", "San Miguel County",
                 "Crowley County", "Grand County", "Routt County", "Huerfano County",
"Las Animas County", "Lake County", "Morgan County", "Archuleta County",
                 "Pueblo County", "Ouray County", "Otero County", "Costilla County",
"Sedgwick County", "San Juan County")
                 then do;
                        RCL_2012 = 1;
                        LAG_2014 = 1;
                        sens_2014 = 1;
                                        170


                         sens2_2014 = 1;
                         sens3_2014 = 1;
                 end;
        else if state = "Colorado" and name not in ("Park County", "Conejos County",
"Pitkin County", "Arapahoe County", "Adams County", "Douglas County",
                 "Eagle County", "Larimer County", "Weld County", "Gilpin County",
"Boulder County", "Summit County", "Chaffee County", "Garfield County",
                 "Delta County", "Montezuma County", "Moffat County", "Gunnison
County", "Saguache County", "Mesa County", "Denver County",
                 "Montrose County", "Dolores County", "La Plata County", "El Paso
County", "Jefferson County", "Clear Creek County", "San Miguel County",
                 "Crowley County", "Grand County", "Routt County", "Huerfano County",
"Las Animas County", "Lake County", "Morgan County", "Archuleta County",
                 "Pueblo County", "Ouray County", "Otero County", "Costilla County",
"Sedgwick County", "San Juan County")
        then do;
                         RCL_2012 = 0;
                         LAG_2014 = 0;
                         sens_2014 = 0;
                         sens2_2014 = 0;
                                        171


                         sens3_2014 = 0;
        end;
        /*29 counties in Alaska, some towns where cannabis is illegal but not many*/
        else if state = "Alaska" and name in ("Aleutians West Census Area", "Kodiak
Island Borough", "Bethel Census Area", "Aleutians East Borough",
                 "Wade Hampton Census Area", "Dillingham Census Area", "Yukon-
Koyukuk Census Area", "Northwest Arctic Borough", "North Slope Borough",
                 "Anchorage Municipality", "Denali Borough", "Hoonah-Angoon Census
Area", "Nome Census Area", "Lake and Peninsula Borough",
                 "Valdez-Cordova Census Area", "Prince of Wales-Hyder Census Area",
"Southeast Fairbanks Census Area", "Fairbanks North Star Borough",
                 "Kenai Peninsula Borough", "Matanuska-Susitna Borough", "Juneau City
and Borough", "Petersburg Census Area", "Ketchikan Gateway Borough",
                 "Sitka City and Borough", "Skagway Municipality", "Wrangell City and
Borough", "Yakutat City and Borough", "Bristol Bay Borough", "Haines Borough" )
                 then do;
                         RCL_2012 = 0;
                         LAG_2014 = 1;
                         sens_2014 = 1;
                                        172


                        sens2_2014 = .;
                        sens3_2014 = .;
                end;
                /*36 counties in Oregon, 15 counties with banned cannabis, 21 w/o*/
       else if state = "Oregon" and name in ("Benton County", "Clackamas County",
"Clatsop County", "Columbia County", "Coos County",
                "Curry County", "Deschutes County", "Gilliam County", "Grant County",
"Hood River County", "Jackson County",
                "Josephine County", "Lane County", "Lincoln County", "Linn County",
"Multnomah County", "Polk County",
                "Tillamook County", "Wasco County", "Washington County", "Yamhill
County")
                then do;
                        RCL_2012 = 0;
                        LAG_2014 = 1;
                        sens_2014 = 1;
                        sens2_2014 = 1;
                        sens3_2014 = 1;
                                        173


                end;
       else if state = "Oregon" and name not in ("Benton County", "Clackamas
County", "Clatsop County", "Columbia County", "Coos County",
                "Curry County", "Deschutes County", "Gilliam County", "Grant County",
"Hood River County", "Jackson County",
                "Josephine County", "Lane County", "Lincoln County", "Linn County",
"Multnomah County", "Polk County",
                "Tillamook County", "Wasco County", "Washington County", "Yamhill
County")
                then do;
                        RCL_2012 = 0;
                        LAG_2014 = 0;
                        sens_2014 = 0;
                        sens2_2014 = 0;
                        sens3_2014 = 0;
                end;
       else if state = "Washington" and name in ("Clark County","Columbia
County","Franklin County","Garfield County",
                                       174


                 "Kittitas County","Lewis County","Pierce County","Wahkiakum
County","Walla Walla County","Yakima County")
                 then do;
                          RCL_2012 = 1;
                          LAG_2014 = .;
                          sens_2014 = 0;
                          sens2_2014 = 0;
                          sens3_2014 = .;
                 end;
        else if state = "Washington" and name not in
("Clark","Columbia","Franklin","Garfield",
                 "Kittias","Lewis","Pierce","Wahkiakum","Walla Walla","Yakima")
                 then do;
                          RCL_2012 = 1;
                          LAG_2014 = .;
                          sens_2014 = 1;
                          sens2_2014 = 1;
                          sens3_2014 = .;
                                          175


         end;
         else if state = "Alaska"
         then do;
                  RCL_2012 = 1;
                  LAG_2014 = .;
                  sens_2014 = 1;
                  sens2_2014 = .;
                  sens3_2014 = .;
         end;
else if agg = 1 then do;
         RCL_2012 = .;
         LAG_2014 = .;
         sens_2014 = .;
         sens2_2014 = .;
         sens3_2014 = .;
end;
                                  176


        label RCL_2012 = "County has at least one municipality that will allow
recreational cannabis sales (year of vote)"
                  LAG_2014 = "County has at least one municipality where voters
decided cannabis can be legally sold AND Local authority was granted to those voters
(year of vote), washington coded missing"
                  sens_2014 = "County has at least one municipality where voters
decided cannabis can be legally sold AND Local authority was granted to those voters
(year of vote), washington coded using zoning bans"
                  sens2_2014 = "County has at least one municipality where voters
decided cannabis can be legally sold AND Local authority was granted to those voters
(year of vote), washington coded using zoning bans, alaska missing"
                  sens3_2014 ="County has at least one municipality where voters
decided cannabis can be legally sold AND Local authority was granted to those voters
(year of vote), washington and alaska missing";
run;
proc freq data=full_set3;
        title4'check for any unexpected missingness or duplicates';
        table incensus*incross*insae*inelection / list missing;
                                       177


run;
proc freq data=full_set3;
       title4'check RCL 2014 coding';
       table state*rcl_2012*LAG_2014*sens_2014*sens2_2014*sens3_2014 / list
missing;
run;
proc freq data=full_set3;
       title4'missing names, fips, census, election, and cross walk should all be
aggregate obs';
       table agg*qname/ list missing;
       where incensus=0 or inelection=0 or incross=0 or insae=0;
run;
/*OK: aggregate areas, mass and conn counties*/
/*range check*/
data merge_check2;
                                       178


set full_set3;
a1=Twelve_plus_PYAlcDepPrevestimate<Twelve_plus_PYAlcDepPrevci_lower;
a2=Twelve_plus_PYAlcDepPrevestimate>Twelve_plus_PYAlcDepPrevci_upper;
b1=Twelve_plus_PMAlcPrevestimate<Twelve_plus_PMAlcPrevci_lower;
b2=Twelve_plus_PMAlcPrevestimate>Twelve_plus_PMAlcPrevci_upper;
c1=Twelve_plus_PMCigPrevestimate<Twelve_plus_PMCigPrevci_lower;
c2=Twelve_plus_PMCigPrevestimate>Twelve_plus_PMCigPrevci_upper;
d1=Twelve_plus_PMMJPrevestimate<Twelve_plus_PMMJPrevci_lower;
d2=Twelve_plus_PMMJPrevestimate>Twelve_plus_PMMJPrevci_upper;
e1=Twelve_plus_PMTobPrevestimate<Twelve_plus_PMTobPrevci_lower;
e2=Twelve_plus_PMTobPrevestimate>Twelve_plus_PMTobPrevci_upper;
f1=Twelve_plus_PYAUDPrevestimate<Twelve_plus_PYAUDPrevci_lower;
f2=Twelve_plus_PYAUDPrevestimate>Twelve_plus_PYAUDPrevci_upper;
g1=Twelve_plus_PYAlcDepPrevestimate<Twelve_plus_PYAlcDepPrevci_lower;
g2=Twelve_plus_PYAlcDepPrevestimate>Twelve_plus_PYAlcDepPrevci_upper;
h1=Twelve_plus_PYCocPrevestimate<Twelve_plus_PYCocPrevci_lower;
h2=Twelve_plus_PYCocPrevestimate>Twelve_plus_PYCocPrevci_upper;
i1= Twelve_plus_PYMJIncestimate<Twelve_plus_PYMJIncci_lower;
                            179


       i2= Twelve_plus_PYMJIncestimate>Twelve_plus_PYMJIncci_upper;
       j1=Twelve_plus_PYMJPrevestimate<Twelve_plus_PYMJPrevci_lower;
       j2=Twelve_plus_PYMJPrevestimate>Twelve_plus_PYMJPrevci_upper;
       k1=Twelve_plus_PYSUDPrevestimate<Twelve_plus_PYSUDPrevci_lower;
       k2=Twelve_plus_PYSUDPrevestimate>Twelve_plus_PYSUDPrevci_upper;
run;
proc freq data=merge_check2;
       title4'1 indicates where estimate is out of bounds';
       table a1 a2 b1 b2 c1 c2 d1 d2 e1 e2 f1 f2 g1 g2 h1 h2 i1 i2 j1 j2 k1 k2/list
missing;
run;
proc means data=full_set3 nmiss;
       title4'Check Missingness in NSDUH data';
       var twelve: eighteen: twenty: ;
run;
/*smaller age groups are often suppressed, and overlapping age categories make no
sense to include in prediction
                                       180


use only the twelve plus variables + 18+ mental health variables (MDE, SMI, AMI,
suicidal thoughts) because these variable aren't available for indiviudals under 18*/
/*save this version of dataset with local authority coding for easy access */
data sav.full_set_LAG;
        set full_set3;
        where agg ~=1;
        drop agg: incensus inelection incross insae;
run;
proc contents data=sav.full_set_LAG ;
run;
proc sort data=sav.full_set_LAG;
        by state;
run;
proc print data= sav.full_set_lag;
        var name rcl_2012 lag_2014 sens_2014 sens2_2014 sens3_2014;
                                        181


       where state in ("Alaska", "Colorado", "Oregon", "Washington");
       by state;
run;
/*Need a table 1*/
proc means data=sav.full_set_LAG;
       title3'table 1';
       class lag_2014;
       var T001_001 t003_002 t003_003 t011_002 t011_003 t011_004 t011_005
                T055_003 T055_004 T055_005 T055_006 T055_007 T055_008 T055_009
T055_010
                eighteen_plus_pysmiprevestimate eighteen_plus_pystprevestimate
percentdem_2012_votes percentrep_2012_votes
                twelve_plus_pmalcprevestimate twelve_plus_pmcigprevestimate
twelve_plus_pmmjprevestimate twelve_plus_pyaudprevestimate
                twelve_plus_pycocprevestimate twelve_plus_pysudprevestimate;
run;
                                       182


/**/
/*proc sql noprint ;*/
/*       select name into :droplist separated by ' '*/
/*       from contents*/
/*       where name like '%upp%' or name like '%upper' */
/*               or name like '%se' or name like '%lower' */
/*               or name like '%estimate' escape '^';*/
/*quit;*/
/**/
/*%put=&droplist;*/
/**/
/*data trim_set ;*/
/* set sav.full_set (drop=&droplist);*/
/* where agg ~=1;*/
/* drop agg: incensus inelection incross insae twelve_to_: twenty: eighteen_to:*/
/*               Eighteen_plus_PMAlcPrevL Eighteen_plus_PMAlcPrevsel*/
/*               Eighteen_plus_PMCigPrevL Eighteen_plus_PMCigPrevsel*/
/*               Eighteen_plus_PMMJPrevL Eighteen_plus_PMMJPrevsel*/
                                         183


/*                Eighteen_plus_PMTobPrevL Eighteen_plus_PMTobPrevsel*/
/*                Eighteen_plus_PYAUDPrevL Eighteen_plus_PYAUDPrevsel*/
/*                Eighteen_plus_PYAlcDepPrevL Eighteen_plus_PYAlcDepPrevL_lowe*/
/*                Eighteen_plus_PYAlcDepPrevci_low Eighteen_plus_PYAlcDepPrevestima
Eighteen_plus_PYAlcDepPrevsel*/
/*                Eighteen_plus_PYCocPrevL Eighteen_plus_PYCocPrevsel*/
/*                Eighteen_plus_PYMJIncL Eighteen_plus_PYMJIncsel*/
/*                Eighteen_plus_PYMJPrevL Eighteen_plus_PYMJPrevsel */
/*                Eighteen_plus_PYSUDPrevL Eighteen_plus_PYSUDPrevsel ;*/
/*run;*/
/*save a trimmer set without excess nsduh variables (keep only L and SEL nsduh vars
for bootstrapping in R), aggregate areas, qc variables*/
/*data sav.trim_set;*/
/*       set trim_set;*/
/*run;*/
/**/
/*proc contents data=sav.full_set;*/
/*       title4'check final contents';*/
                                         184


/*run;*/
/**/
/*proc contents data=sav.trim_set;*/
/*       title4'check final contents';*/
/*run;*/
/*make a version for appendix where voter variables are binaries*/
proc print data=sav.full_set_LAG;
         title3 'Do 3rd party voters outnumber dems or reps in any county?';
         var county;
         where (percentother_2012_votes > percentrep_2012_votes)
         or (percentother_2012_votes > percentdem_2012_votes);
run;
data voter_bins;
         set sav.full_set_LAG;
         dembin = 0;
                                         185


       repbin = 0;
       if percentdem_2012_votes > percentrep_2012_votes then dembin=1;
       else if percentrep_2012_votes > percentdem_2012_votes then repbin = 1;
run;
proc freq data=voter_bins;
       table dembin*repbin/list missing;
run;
/*export for r*/
/*proc export data=sav.full_set_LAG*/
/* outfile="C:\Users\montg\Dropbox\Ph.D Work\Dissertation\Data\Aim
1\Processed\County_RCL_LAG.csv"*/
/* dbms=csv*/
/* replace;*/
                                      186


/*run;*/
/**/
/*proc export data=voter_bins*/
/* outfile="C:\Users\montg\Dropbox\Ph.D Work\Dissertation\Data\Aim
1\Processed\voter_bins.csv"*/
/* dbms=csv*/
/* replace;*/
/*run;*/
proc sql;
         title4'Counts in full set';
         select count(distinct fips_pch) as uniq_counties, count (distinct geography) as
uniq_nsduh_sa
         from sav.full_set;
quit;
proc sql;
         title4'Counts in LAG';
                                          187


       select count(distinct fips_pch) as uniq_counties, count (distinct geography) as
uniq_nsduh_sa
       from sav.full_set_LAG;
quit;
proc sql;
       title4'rcl counties per state: rcl_2012';
       select state, count(distinct(name))
       from sav.full_set_LAG
       where rcl_2012 in (1,.)
       group by state;
quit;
proc sql;
       title4'rcl counties per state: lag_2014';
       select state, count(distinct(name)), count(lag_2014)
       from sav.full_set_LAG
       where LAG_2014 in (1,.)
                                          188


      group by state;
quit;
proc sql;
      title4'rcl counties per state: sens_2014';
      select state, count(distinct(name)), count(sens_2014)
      from sav.full_set_LAG
      where sens_2014 in (1,.)
      group by state;
quit;
proc sql;
      title4'rcl counties per state: sens2_2014';
      select state, count(distinct(name)), count(sens2_2014)
      from sav.full_set_LAG
      where sens2_2014 in (1,.)
      group by state;
quit;
                                        189


proc sql;
        title4'rcl counties per state: sens3_2014';
        select state, count(distinct(name)), count(sens3_2014)
        from sav.full_set_LAG
        where sens3_2014 in (1,.)
        group by state;
quit;
proc sql;
        title4'check colorado rcl counties for mismatches in data step or missing
counties';
        select distinct name
        from sav.full_set
        where RCL_2014=1 and state="Colorado";
quit;
/*no text mismatches*/
                                          190


proc print data=sav.full_set noobs n;
        title4'What does NC end up looking like?';
        var name geography incensus inelection incross insae Twelve:;
        where geography contains "North Carolina";
run;
title3"Step 5: Model legalization at county level";
proc contents data = sav.full_set out=varnames (keep=name) ;
run;
proc freq data=sav.full_set;
        title4'hierarchical variables, along with states';
        table division division_n region / list missing;
run;
data predictorvars;
        set varnames;
                                          191


        if name in ('COUNTY', 'QName', 'agg', 'agg_region', 'agg_state', 'fips_pch',
'geography',
                               'incensus', 'incross', 'inelection', 'insae', 'name',
'state_fips') then delete;
run;
proc sql noprint;
        select trim(name) into : predictor_list separated by " "
        from predictorvars;
quit;
%put &predictor_list;
/*Too much missing data in these predictors for model convergence
        Missing data analysis
break down variables by source data to determine missing data patterns*/
proc sql noprint;
        select trim(name) into : censusvars separated by " "
                                         192


       from predictorvars
       where index(name,"T0") > 0 or name in ('area_land', 'area_water');
       select trim(name) into : nsduhvars separated by " "
       from predictorvars
       where index(name,"Eighteen") > 0 or index(name,"Twelve")>0 or
index(name,"Twenty")>0 ;
       select trim(name) into : votervars separated by " "
       from predictorvars
       where index(name,"dem") > 0 or index(name,"rep")>0 or index(name,"other")>0
or index(name,"total")>0 ;
quit;
%put &censusvars;
%put &nsduhvars;
%put &votervars;
proc means data = sav.full_set n nmiss;
                                       193


       title4'Check all numeric variables from census for missingness';
       var &censusvars;
       where agg=0;
run;
/*no census data missing, beautiful*/
proc means data = sav.full_set n nmiss;
       title4'Check all numeric variables from election data for missingness';
       var &votervars;
       where agg=0;
run;
/*no voter data missing with the Alaskan imputations, beautiful*/
proc means data = sav.full_set n nmiss;
       title4'Check all numeric variables from nsduh data for missingness';
       var &nsduhvars;
       where agg=0;
run;
                                       194


/*NSDUH SAEs unsurprisingly the limiting factor due to data suppression issues.*/
data full_set2;
       set sav.full_set;
       if missing(Twenty_Six_or_O_PYMJInc) then miss=1;
       else miss=0;
run;
proc freq data=full_set2;
       table miss / list missing;
run;
proc means data=full_set2 n nmiss;
       title4'Check all numeric variables from nsduh data for missingness';
       var &nsduhvars;
       where agg=0 and miss=0;
run;
                                       195


proc corr data=full_set2 plots=scatter;
       title4'Check to see if data is missing at random';
       var miss T001_001;
run;
/*Cannot argue this is missing at random
as expected, more missingness where there is less population*/
proc sql noprint;
       select trim(name) into : predictor_list separated by " "
       from predictorvars;
quit;
ods graphics on;
proc means data=sav.full_set;
       var Twelve_or_Older:;
run;
                                        196


/*ALL MODELLING COMPLETED IN R*/
/*******************************************************************************************************
* In Enterprise Guide, "Specify the page size for log and text output" under 'Results
General' must be *
* de-selected in order to be able to specify pagesize and linesize using an options
statement.           *
********************************************************************************************************/
OPTIONS PS=56 LS=160 NOCENTER NOFMTERR MPRINT ORIENTATION =
LANDSCAPE validVarName=any;
title1'Dissertation';
title2'Aim 1: Legalization Prediction';
/****************************************************************************************
* The following macro variables are available to all users: *
*                                                                     *
* &Project       – the name of the project folder in which the .EGP file is stored                 *
* &ProgName             – the name of the .egp file, without the extension                           *
* &ProgNode            - the name of the code node                                          *
                                                 197


* &ProgDir          – the path to the folder in which the .egp file is stored.             *
*****************************************************************************************/
/*****************************************************************************
* The following macro variables are used in conjunction with the DOC_BLOCK *
* macro to document programs and output.                                         *
*                                                            *
* Do not use quotation marks when defining macro variables. If SAS syntax                   *
* requires quotes, use double quotes when you reference the macro variable. *
******************************************************************************/
** PROGRAMMER'S NAME ;
%LET PROGRAMMER                   = Barrett Montgomery;
** LIST ALL SUBDIRECTORIES CALLED IN THE LIBNAME STATEMENT ;
** THESE CAN BE LEFT BLANK IF NOT NEEDED OR USED ;
%LET USEDIR1             = C:\Users\montg\Dropbox\Ph.D Work\Dissertation\Data\Aim
1\raw;
                                                 198


** DEFINE DIRECTORY AND FILE NAME OF ANY PERMANENT SAS DATASETS
SAVED IN THIS PROGRAM AS MACRO VARIABLES ;
** USE &PROGNAME FOR SAVEFILE NAME ;
** LEAVE BLANK IF NO DATASET SAVED ;
%LET SAVEDIR1      = C:\Users\montg\Dropbox\Ph.D Work\Dissertation\Data\Aim
1\processed;
** THE NUMBERING SCHEME IS SAVEFILx_y WHERE X IS THE NUMBER OF THE
SAVEDIR AND ;
** Y IS THE NUMBER OF THE FILE WITHIN IT ;
** THIS SHOULD GENERALLY BE SET TO EITHER &PROGNAME OR &PROGNODE ;
%LET SAVEFIL1_1 = ;
** NAME FORMAT LIBRARY DIRECTORY ;
%LET FMTDIR       =;
** GIVE A BRIEF DESCRIPTION OF OVERALL PURPOSE OF THIS PROGRAM ;
** YOU CAN USE SINGLE QUOTES OR NO QUOTES-- DOUBLE QUOTES WILL NOT
WORK ;
                                   199


%LET PURPOSE1 =;
/*************************************************************
** CREATE LIBRARY REFERENCES TO DIRECTORIES SPECIFIED ABOVE **
**************************************************************/
** INPUT FILES ;
LIBNAME USE "&USEDIR1";
** OUTPUT FILE DESTINATION ;
LIBNAME SAV "&SAVEDIR1" ;
title3"Step 6: Data visualizations";
proc import datafile= "C:/Users/montg/Dropbox/Ph.D Work/Dissertation/Data/Aim
1/Processed/ens_lag.csv" dbms=csv out=fig replace ;
         GUESSINGROWS=3000;
run;
                                                 200


proc import datafile= "C:/Users/montg/Dropbox/Ph.D Work/Dissertation/Data/Aim
1/Processed/ens_sens1.csv" dbms=csv out=sens_fig replace ;
         GUESSINGROWS=3000;
run;
/*Prior analysis shows these two methods of coding outcomes makes prediction
harder*/
/*proc import datafile= "C:/Users/montg/Dropbox/Ph.D Work/Dissertation/Data/Aim
1/Processed/ens_sens2.csv" dbms=csv out=sens2_fig replace ;*/
/*       GUESSINGROWS=3000; */
/*run;*/
/**/
/*proc import datafile= "C:/Users/montg/Dropbox/Ph.D Work/Dissertation/Data/Aim
1/Processed/ens_sens3.csv" dbms=csv out=sens3_fig replace ;*/
/*       GUESSINGROWS=3000; */
/*run;*/
proc freq data=fig;
                                      201


        title4'check missing';
        table county / list;
        where PHT = .;
run;
proc contents data=fig;
        title4'Check var names need changed';
run;
data fig1;
        set fig;
        length name $500;
        array thresh[9] thresh1-thresh9;
        thresh1=.1;
        thresh2=.2;
        thresh3=.3;
        thresh4=.4;
        thresh5=.5;
                                       202


thresh6=.6;
thresh7=.7;
thresh8=.8;
thresh9=.9;
array PHT_Y [9] PHT_Y1-PHT_y9;
array YHT_Y [9] YHT_Y1-YHT_y9;
array TNF_Y [9] TNF_Y1-TNF_y9;
array TPF_Y [9] TPF_Y1-TPF_y9;
array FNF_Y [9] FNF_Y1-FNF_y9;
array acc_Y [9] acc_Y1-acc_y9;
do i=1 to 9;
        PHT_Y[i]=PHT>thresh[i];
        YHT_Y[i]=YHT>thresh[i];
        TNF_Y[i]=TNF>thresh[i];
        TPF_Y[i]=TPF>thresh[i];
        FNF_Y[i]=FNF>thresh[i];
        acc_Y[i]=acc>thresh[i];
                                203


       end;
       name = county;
       prediction = (acc>.3333);
       rename var1=county_num;
       label Y                 ="Actual county policy, legalize cannabis with local
control"
                 N             = 'number of times county appeared in sims'
                 county        = "County name"
                 prediction= 'Binary prediction for the default model, accuracy weighted
probabilities with the prevalence based cutoff'
                 YHT           = "Average predicted probability of county having legal
cannabis w/ local control weighted by hard-coded Y based on proportion of RCL
counties"
                 PHT           = "Average predicted probability of county having legal
cannabis w/ local control"
                 TNF           = "Average predicted probability of county having legal
cannabis w/ local control weighted by specificity of each simulation"
                                        204


                TPF           = "Average predicted probability of county having legal
cannabis w/ local control weighted by sensitivity of each simulation"
                FNF           = "Average predicted probability of county having legal
cannabis w/ local control weighted by 1-specificity of each simulation"
                acc           = "Average predicted probability of county having legal
cannabis w/ local control weighted by overall accuracy of each simulation";
               drop i thresh: county;
run;
data sens_fig1;
       length name $500;
       set sens_fig;
       array thresh[9] thresh1-thresh9;
       thresh1=.1;
       thresh2=.2;
       thresh3=.3;
       thresh4=.4;
       thresh5=.5;
                                      205


thresh6=.6;
thresh7=.7;
thresh8=.8;
thresh9=.9;
array sens_PHT_Y [9] sens_PHT_Y1-sens_PHT_y9;
array sens_YHT_Y [9] sens_YHT_Y1-sens_YHT_y9;
array sens_TNF_Y [9] sens_TNF_Y1-sens_TNF_y9;
array sens_TPF_Y [9] sens_TPF_Y1-sens_TPF_y9;
array sens_FNF_Y [9] sens_FNF_Y1-sens_FNF_y9;
array sens_acc_Y [9] sens_acc_Y1-sens_acc_y9;
do i=1 to 9;
        sens_PHT_Y[i]=PHT>thresh[i];
        sens_YHT_Y[i]=YHT>thresh[i];
        sens_TNF_Y[i]=TNF>thresh[i];
        sens_TPF_Y[i]=TPF>thresh[i];
        sens_FNF_Y[i]=FNF>thresh[i];
        sens_acc_Y[i]=acc>thresh[i];
                              206


      end;
      name = county;
      sens_prediction = (acc>.3333);
      rename var1=sens_county_num
                    Y = sens_Y
                    acc = sens_acc
                    N = sens_N
                    YHT= sens_YHT
                    PHT=sens_PHT
                    TNF=sens_TNF
                    TPF=sens_TPF
                    FNF=sens_FNF;
label  Y            ="Actual county policy, legalize cannabis with local control"
              N            = 'number of times county appeared in sims'
              county       = "County name"
                                     207


                sens_prediction= 'Binary prediction for the default model, accuracy
weighted probabilities with the prevalence based cutoff'
                YHT           = "Average predicted probability of county having legal
cannabis w/ local control weighted by hard-coded Y based on proportion of RCL
counties"
                PHT           = "Average predicted probability of county having legal
cannabis w/ local control"
                TNF           = "Average predicted probability of county having legal
cannabis w/ local control weighted by specificity of each simulation"
                TPF           = "Average predicted probability of county having legal
cannabis w/ local control weighted by sensitivity of each simulation"
                FNF           = "Average predicted probability of county having legal
cannabis w/ local control weighted by 1-specificity of each simulation"
                acc           = "Average predicted probability of county having legal
cannabis w/ local control weighted by overall accuracy of each simulation";
               drop county i thresh:;
run;
proc freq data=fig1;
       title4'Check binary variable creation works';
                                       208


       table pht * PHT_Y1 * PHT_Y2 * PHT_Y3 * PHT_Y4 * PHT_Y5 * PHT_Y6 * PHT_Y7
* PHT_Y8 * PHT_Y9 / list missing;
       table acc*prediction/list missing;
       table prediction*Y/list missing senspec;
run;
proc freq data=sens_fig1;
       title4'Check binary variable creation works for sensitivty data';
       table sens_pht * sens_PHT_Y1 * sens_PHT_Y2 * sens_PHT_Y3 * sens_PHT_Y4 *
sens_PHT_Y5 * sens_PHT_Y6 * sens_PHT_Y7 * sens_PHT_Y8 * sens_PHT_Y9 / list
missing;
run;
proc contents data=fig1;
       title4"Contents of predictions data";
run;
proc freq data=fig1;
       title4'Prevalence of Actual County level RCLs';
                                        209


       table Y /list missing;
run;
proc freq data=sens_fig1;
       title4'Prevalence of Actual County level RCLs in sens analysis';
       table sens_Y /list missing;
run;
/*check dona ana county new mexico in all sets*/
proc print data=fig1;
       var name county_num;
       where name contains "New Mexico";
run;
/*county_num = 785*/
proc print data=sens_fig1;
       var name sens_county_num;
                                      210


        where name contains "New Mexico";
run;
/*sens_county_num = 793*/
data fig2;
        set fig1;
        if county_num = 785 then name="Dona Ana County, New Mexico";
run;
data sens_fig2;
        set sens_fig1;
        if sens_county_num = 793 then name="Dona Ana County, New Mexico";
run;
/*MAPPING*/
/*need to add fips codes to the county names
                add from my own data file
                                      211


                 check mismatches between names
rename dona ana to match in both files*/
data fips;
        length name $500;
        set sav.full_set_LAG (keep=qname fips_pch state rcl_2012 LAG_2014
sens_2014 sens2_2014 sens3_2014);
        fips_pch = strip(fips_pch);
        if fips_pch = "35013" then qname="Dona Ana County, New Mexico";
        name = qname;
        drop qname ;
run;
proc print data=fips;
        title3"Check Dona Ana in both sets for match";
                                      212


       where state = "New Mexico";
       var _char_;
run;
proc print data=fig2;
       var name;
       where name contains ("New Mexico");
run;
proc print data=sens_fig2;
       var name;
       where name contains ("New Mexico");
run;
proc sort data=fig2;
       by name;
run;
                                   213


proc sort data=sens_fig2;
        by name;
run;
proc sort data=fips;
        by name;
run;
data fig_wide3;
        merge fig2 (in=in1x) sens_fig2 (in=in2x) fips (in=in3x) ;
        by name;
        in1=in1x;
        in2=in2x;
        in3=in3x;
run;
                                        214


proc freq data=fig_wide3;
       title4'Look at merge observations';
       table in1*in2*in3 / list missing;
run;
proc freq data=fig_wide3 ;
       title4'Look at merge observations that do not appear in all';
       table in1*in2*in3*state /list missing ;
       where in1=0 or in2=0 ;
run;
proc sql;
       title3'States where False Positives Appear';
       select distinct state
       from fig_wide3
       where y=0 and prediction=1;
quit;
                                         215


proc freq data=fig_wide3;
        title3'confusion matrix';
        table y *prediction /list missing;
        where y~=.;
run;
/*                      y
                        0        1
pred    0 2721 0
                 1 281 92
total 3094
*/
proc freq data=fig_wide3;
        title3'confusion matrix including WA';
        table sens_y *sens_prediction /list missing;
        where sens_y~=.;
run;
                                          216


/*                      y
                        0        1
pred    0 2718 2
                1 294 119
total 3133
*/
/*/*Create table for sensitivity analyses*/*/
/*ods excel file="C:\Users\montg\Dropbox\Ph.D Work\Dissertation\Data\Aim
1\Results\weights and cutpoints raw tables.xlsx";*/
/*proc freq data=fig_wide3;*/
/*      title3'Table of sensitivity and speciifcity for each weighted probability estimator
at each hard cut-point';*/
/*      table Y*pht_Y1 Y*pht_Y2 Y*pht_Y3 Y*pht_Y4 Y*pht_Y5 Y*pht_Y6 Y*pht_Y7
Y*pht_Y8 Y*pht_Y9/list missing ;*/
/*      table Y*yht_Y1 Y*yht_Y2 Y*yht_Y3 Y*yht_Y4 Y*yht_Y5 Y*yht_Y6 Y*yht_Y7
Y*yht_Y8 Y*yht_Y9/list missing ;*/
                                          217


/*       table Y*acc_Y1 Y*acc_Y2 Y*acc_Y3 Y*acc_Y4 Y*acc_Y5 Y*acc_Y6 Y*acc_Y7
Y*acc_Y8 Y*acc_Y9/list missing ;*/
/*       table Y*tpf_Y1 Y*tpf_Y2 Y*tpf_Y3 Y*tpf_Y4 Y*tpf_Y5 Y*tpf_Y6 Y*tpf_Y7 Y*tpf_Y8
Y*tpf_Y9/list missing ;*/
/*       table Y*tnf_Y1 Y*tnf_Y2 Y*tnf_Y3 Y*tnf_Y4 Y*tnf_Y5 Y*tnf_Y6 Y*tnf_Y7 Y*tnf_Y8
Y*tnf_Y9/list missing ;*/
/*       table Y*fnf_Y1 Y*fnf_Y2 Y*fnf_Y3 Y*fnf_Y4 Y*fnf_Y5 Y*fnf_Y6 Y*fnf_Y7 Y*fnf_Y8
Y*fnf_Y9/list missing ;*/
/*       where Y ~=.;*/
/*run;*/
/*ods excel close;*/
/**/
/*ods excel file="C:\Users\montg\Dropbox\Ph.D Work\Dissertation\Data\Aim
1\Results\weights and cutpoints raw tables sens.xlsx";*/
/*proc freq data=fig_wide3;*/
/*       title3'Table of sensitivity and speciifcity for each weighted probability estimator
at each hard cut-point';*/
/*       table sens_Y*sens_pht_Y1 sens_Y*sens_pht_Y2 sens_Y*sens_pht_Y3
sens_Y*sens_pht_Y4 sens_Y*sens_pht_Y5 sens_Y*sens_pht_Y6 sens_Y*sens_pht_Y7
sens_Y*sens_pht_Y8 sens_Y*sens_pht_Y9/list missing ;*/
                                           218


/*       table sens_Y*sens_yht_Y1 sens_Y*sens_yht_Y2 sens_Y*sens_yht_Y3
sens_Y*sens_yht_Y4 sens_Y*sens_yht_Y5 sens_Y*sens_yht_Y6 sens_Y*sens_yht_Y7
sens_Y*sens_yht_Y8 sens_Y*sens_yht_Y9/list missing ;*/
/*       table sens_Y*sens_acc_Y1 sens_Y* sens_acc_Y2 sens_Y* sens_acc_Y3 sens_Y*
sens_acc_Y4 sens_Y* sens_acc_Y5 sens_Y* sens_acc_Y6 sens_Y*sens_acc_Y7
sens_Y* sens_acc_Y8 sens_Y*sens_acc_Y9/list missing ;*/
/*       table sens_Y*sens_tpf_Y1 sens_Y* sens_tpf_Y2 sens_Y* sens_tpf_Y3 sens_Y*
sens_tpf_Y4 sens_Y* sens_tpf_Y5 sens_Y* sens_tpf_Y6 sens_Y*sens_tpf_Y7
sens_Y*sens_tpf_Y8 sens_Y*sens_tpf_Y9/list missing ;*/
/*       table sens_Y*sens_tnf_Y1 sens_Y* sens_tnf_Y2 sens_Y* sens_tnf_Y3 sens_Y*
sens_tnf_Y4 sens_Y* sens_tnf_Y5 sens_Y* sens_tnf_Y6 sens_Y*sens_tnf_Y7
sens_Y*sens_tnf_Y8 sens_Y*sens_tnf_Y9/list missing ;*/
/*       table sens_Y*sens_fnf_Y1 sens_Y* sens_fnf_Y2 sens_Y* sens_fnf_Y3 sens_Y*
sens_fnf_Y4 sens_Y* sens_fnf_Y5 sens_Y* sens_fnf_Y6 sens_Y*sens_fnf_Y7
sens_Y*sens_fnf_Y8 sens_Y*sens_fnf_Y9/list missing ;*/
/*       where sens_Y ~=.;*/
/*run;*/
/*ods excel close;*/
/*maps can be made from this internal counties dataset*/;
                                      219


data counties;
       set maps.uscounty;
       length fips_pch $ 5 stfips_pch $ 2 cnfips_pch $3;
       where state <72; /*remove puerto rico*/
  stfips_pch = put(state, z2.);
       cnfips_pch = put(county, z3.);
       fips_pch = strip(stfips_pch)||strip(cnfips_pch); /* (pch = suffix for padded
character) */
       label stfips_pch = "State Fips, padded character"
                 cnfips_pch = "County Fips, padded character"
                 fips_pch = "concatednated state and county fips, padded character";
run;
proc freq data=counties;
                                         220


        title3'Check fips pch creation';
        table state*stfips_pch county*cnfips_pch stfips_pch*cnfips_pch*fips_pch/list
missing;
run;
proc print data=counties;
        title3'check dona ana county in new mexico';
        where fips_pch="35013";
run;
proc contents data=counties;
        title3'check to see if coordinates are projected and how';
run;
/*already projected*/
/*actions to make map work before merge
in counties
                                         221


        remove 15005 - Kalawao Hawaii previously removed
        remove 30113 - Yellowstone National Park
  Halifax county, Virginia (FIPS code 51083) includes the population of the former
independent city South Boston city, Virginia (FIPS code 51780)
               recode 51780 as 51083
in fig_wide
        02105 and 02230 need to use shapefile data from 02232
               replace those fips codes with 02232
        02280 was split to create part of 02275 and all of 02195
        02275 was created from part of 02280 and part of 02201
        02198 was created from the remainder of the former 02201
        02280 > 02275          02280 > 02195
        02201 02198
                                        222


                 duplicate 02275 and give 02280 02201
        51560, formerly an independent city, merged with Alleghany county (FIPS code
51005)
                 recode 51560 as 51005
   Broomfield County, Colorado (FIPS code 08014), formed on 11-15-2001. Population
data is not available for Broomfield county, Colorado.
                 drop 08014
THESE CHANGES ARE ONLY TO MAKE MAPS FUNCTIONAL, OTHER ANALYSIS
SHOULD BE COMPELTED USING PRIOR DATASETS
*/
data counties2;
        set counties ;
        where fips_pch not in('15005', '30113');
        if fips_pch = '51780' then fips_pch = '51083';
        if fips_pch = '51560' then fips_pch = '51005';
                                        223


run;
data dupe;
        set fig_wide3 (where=(fips_pch='02275'));
        fips_pch='02280';
run;
data fig_wide4;
        set fig_wide3 dupe;
        where fips_pch not in('08014', '02195', '02198');
        if fips_pch in ('02105', '02230') then fips_pch = '02232';
        if fips_pch = '02275' then fips_pch = '02201';
  /*default outcomes for lag 2014*/
        map_lag = lag_2014;
                                          224


if lag_2014=. then do;
         map_lag=2;
         prediction=2;
end;
if lag_2014 = 1 and prediction=1 then outcome="TP";
else if lag_2014 = 0 and prediction=0 then outcome="TN";
else if lag_2014 = 1 and prediction=0 then outcome="FN";
else if lag_2014 = 0 and prediction=1 then outcome="FP";
if prediction = . then prediction=2;
/* outcomes for sens 2014*/
map_sens = sens_2014;
if sens_2014=. then do;
                                 225


                map_sens=2;
                sens_prediction=2;
       end;
       if sens_2014 = 1 and sens_prediction=1 then sens_outcome="TP";
       else if sens_2014 = 0 and sens_prediction=0 then sens_outcome="TN";
       else if sens_2014 = 1 and sens_prediction=0 then sens_outcome="FN";
       else if sens_2014 = 0 and sens_prediction=1 then sens_outcome="FP";
       if sens_prediction = . then sens_prediction=2;
       label lag_2014 = "County has at least one municipality where voters decided to
keep cannabis sales legal"
                sens_2014 = "County has at least one municipality where voters decided
to keep cannabis sales legal, including Washington"
                outcome = "Confusion matrix classification, unweighted probabilities
with cut-off=.4"
                sens_outcome = "Confusion matrix classification, unweighted
probabilities cut-off=.4, including Washington";
                                       226


run;
proc freq data=fig_wide4;
       title3"check missing";
       table state;
       where sens_prediction = 2 or prediction =2;
run;
proc freq data=fig_wide4;
       title3"check missing";
       table y lag_2014/list missing;
run;
proc sort data=fig_wide3;
       by state fips_pch;
run;
proc print data= fig_wide3 noobs n;
                                      227


       by state;
       var fips_pch;
       where state in ('Alaska', 'Colorado');
run;
proc sort data=fig_wide4;
       by state fips_pch;
run;
proc print data= fig_wide4 noobs n;
       by state;
       var fips_pch;
       where state in ('Alaska', 'Colorado');
run;
proc print data= fig_wide4;
       title3'confirm duping worked';
       where fips_pch in ('02275', '02201', '02280');
                                        228


run;
proc freq data=fig_wide4;
       title3'Check new variable creation';
       table lag_2014 * map_lag outcome*lag_2014*prediction /list missing;
       table sens_2014 * map_sens sens_outcome*sens_2014*sens_prediction/list
missing;
run;
proc sort data=counties2;
       by fips_pch;
run;
proc sort data=fig_wide4;
       by fips_pch;
run;
data county_pred_map;
                                       229


       merge counties2 (in=in1x ) fig_wide4(in=in2x drop=state);
       by fips_pch;
       in_shp = in1x;
       in_dat = in2x;
run;
/*check data for any problems*/
proc contents data=county_pred_map;
       title3'check merged data contents for any problems';
run;
proc freq data=county_pred_map;
       title3'check that all obs merged';
       table in_dat*in_shp/list missing;
run;
proc print data=county_pred_map;
       title3'check that all obs merged (should not print)';
                                       230


       where in_shp = 0;
run;
proc print data=county_pred_map;
       title3'print data for the recoded fips records';
       var name x y pht ;
       where fips_pch in ('51780','51083','51005','02201', '02280', '02232', "35013");
run;
proc print data=county_pred_map;
       title3'fips codes missing from maps.uscounty (should not print)';
       var name fips_pch;
       where state=.;
run;
proc freq data=county_pred_map;
       title3'where are missaing values from confusion matrix';
       table state*lag_2014*pht/ list missing;
                                        231


       where lag_2014 = . or pht=.;
       format state state.;
run;
proc contents data=county_pred_map;
run;
data sav.full_map_set;
       set county_pred_map;
       drop in_dat in_shp in1 in2 in3;
run;
/*make map layer for thicker state borders
Remove obs that do not apply to polygon borders*/
proc gremove data=maps.uscounty out=anno_outline;
       by state notsorted;
       id county;
                                       232


run;
proc gmap map=anno_outline data=anno_outline;
        id state;
        choro segment / levels=1 stat=first nolegend coutline=grayaa;
run;
/* Create annotate dataset for diff state borders*/
data state_outline;
        set anno_outline;
        by state segment notsorted ;
        length function $8 color $8;
        color='gray33'; style='mempty'; when='a'; xsys='2';
        ysys='2';
        if first.segment then function='poly';
        else function='polycont';
run;
                                         233


* graphics options to be used for all maps;
goptions reset=all ftext='calibri' htext=2;
/* Specify policy formats for mapping */
proc format ;
        value lag 0 = "Recreational cannabis illegal"
                               1 = "Locals allow recreational cannabis sales"
                               2 = "Cannabis legal, no local control";
run;
goptions colors=(cream darkgreen olive);
/*create map of actual policies*/
proc gmap map=maps.uscounty data=sav.full_map_set anno=state_outline;
        id state county;
        choro map_lag/ coutline=gray discrete stat=first;
        format map_lag lag.;
                                         234


        label map_lag = "Cannabis policies by county, 2014";
run;
quit;
/*create map of predicted policies*/
proc format ;
        value con
                              0 = "Predicted illegal"
                              1 = "Predicted legal"
                              2 = "Data not used";
run;
goptions colors=(cream darkgreen CX142233);
proc gmap map=maps.uscounty data=sav.full_map_set anno=state_outline;
        id state county;
        choro prediction/ coutline=gray discrete stat=first;
        format prediction con.;
                                       235


        label prediction = "Predicted recreational cannabis sale policy in 2014";
        legend1 label=(f=swissb j=c 'Cases') across=1 down=4 frame;
        title h=4 color=black 'Predicted recreational cannabis sales in 2014';
run;
quit;
/*create map of outcome classifiers
        just in Colorado (08)
        Oregon (41)
        Alaska (02)*/
proc format ;
        value $class
                               "TP" = "True Positive"
                               "TN" = "True Negative"
                               "FP" = "False Positive"
                               "FN" = "False Negative"
                               " " = "Data not used";
                                        236


run;
goptions colors=(red orange cyan darkgreen white);
proc gmap map=maps.uscounty (where=(state in (2,8,41))) data=sav.full_map_set
anno=state_outline;
       id state county;
       choro outcome/ coutline=gray discrete stat=first;
       format outcome $class.;
       title ;
       label outcome = "Diagnostic classification for each county, .4 cut-point";
run;
quit;
proc univariate data=fig_wide4;
       var acc;
       histogram acc;
run;
                                     237


proc format;
       value p_hat
       0.435877 - 1 = 'Top 10%'
       0.0811583 - 0.43587699 = '75 - 90th percentile'
       0.0156888 - .081158299 = '50 - 75th percentile'
       0.00420628 - 0.015688799 = '25 - 50th percentile'
       0.0016763 - 0.0042062799 = '10 - 25th percentile'
       0 - 0.001676299 = 'Bottom 10%';
run;
goptions reset=all border colors= (VLIYG LIYG MOYG DAYG VDEYG VDAYG);
proc gmap map=maps.uscounty data=sav.full_map_set anno=state_outline all;
       id state county;
       choro acc/ coutline=gray cdefault=degb statistic=first discrete;
       label acc = "Accuracy weighted model probabilities";
       format acc p_hat.;
run;
quit;
                                       238


proc freq data=sav.map_set_lag_2014;
       title4'states with FPs';
       table state;
       where fg = 0 and weighted_pred3 = 1;
       format state state.;
run;
proc sql;
       title3'Why do i have missing county averages?';
       select distinct(fips_pch) as miss_avg
       from sav.map_set
       where county_average = . and state ~=53;
run;
proc sql;
       title3'Why do i have missing county averages?';
       select distinct(fips_pch) as miss_consensus
                                       239


       from sav.map_set
       where consensus = . and state ~=53;
run;
/*North carolina recodes from waaaay back, same counties i have no SAEs for...*/
proc print data=sav.full_set noobs;
       title3'NC data that I could not get SAEs for';
       var state fips_pch geography incensus incross inelection insae;
       where state = "North Carolina" and insae=0;
run;
/*Maps for Sens analysis*/
goptions colors=(cream darkgreen olive);
proc gmap map=maps.uscounty data=sav.full_map_set anno=state_outline;
       id state county;
       choro map_sens/ coutline=gray discrete stat=first;
       format map_sens lag.;
                                        240


       label map_sens = "Cannabis policies by county, 2014";
run;
quit;
goptions colors=(cream darkgreen CX142233);
proc gmap map=maps.uscounty data=sav.full_map_set anno=state_outline;
       id state county;
       choro sens_prediction/ coutline=gray discrete stat=first;
       format sens_prediction con.;
       label sens_prediction = "Predicted recreational cannabis sale policy in 2014";
       legend1 label=(f=swissb j=c 'Cases') across=1 down=4 frame;
       title h=4 color=black 'Predicted recreational cannabis sales in 2014';
run;
quit;
goptions colors=(red orange cyan darkgreen white);
                                       241


proc gmap map=maps.uscounty (where=(state in (2,8,41, 53))) data=sav.full_map_set
anno=state_outline;
        id state county;
        choro sens_outcome_class3/ coutline=gray discrete stat=first;
        format sens_outcome_class3 $class.;
        title ;
        label sens_outcome_class3 = "Diagnostic classification for each county, .3 cut-
point";
run;
quit;
proc univariate data=fig_wide4;
        var sens_pht;
        histogram sens_pht;
run;
proc format;
        value sens_pht
        0.539441055 - 1 = 'Top 10%'
                                     242


         0.113663725 - .53944105499 = '75 - 90th percentile'
         0.027264315 - 0.11366372499 = '50 - 75th percentile'
         0.009572451 - 0.02726431499 = '25 - 50th percentile'
         0.004093802 - 0.00957245099 = '10 - 25th percentile'
         0 - 0.00409380199 = 'Bottom 10%';
run;
goptions reset=all border colors= (VLIYG LIYG MOYG DAYG VDEYG VDAYG);
proc gmap map=maps.uscounty data=sav.full_map_set anno=state_outline all;
         id state county;
         choro sens_pht/ coutline=gray cdefault=degb statistic=first discrete;
         label sens_pht = "Accuracy weighted probabilities";
         format sens_pht p_hat.;
run;
quit;
/*******************************************************************************************************
* In Enterprise Guide, "Specify the page size for log and text output" under 'Results
General' must be *
                                                 243


* de-selected in order to be able to specify pagesize and linesize using an options
statement.           *
********************************************************************************************************/
OPTIONS PS=56 LS=160 NOCENTER NOFMTERR MPRINT ORIENTATION =
LANDSCAPE ;
title1'Dissertation';
title2'Aim 2: Incidence after leglaization DiD';
/****************************************************************************************
* The following macro variables are available to all users: *
*                                                                     *
* &Project       – the name of the project folder in which the .EGP file is stored                 *
* &ProgName             – the name of the .egp file, without the extension                           *
* &ProgNode            - the name of the code node                                          *
* &ProgDir          – the path to the folder in which the .egp file is stored.                   *
*****************************************************************************************/
/*****************************************************************************
                                                 244


* The following macro variables are used in conjunction with the DOC_BLOCK *
* macro to document programs and output.                                        *
*                                                            *
* Do not use quotation marks when defining macro variables. If SAS syntax         *
* requires quotes, use double quotes when you reference the macro variable. *
******************************************************************************/
** PROGRAMMER'S NAME ;
%LET PROGRAMMER                   = Barrett Montgomery;
** DEFINE ALL NON-SAS FILES CALLED IN YOUR PROGRAM AS MACRO VARIABLES
by Drug;
** THESE CAN BE LEFT BLANK IF NOT NEEDED OR USED ;
%LET Elg          = C:\Users\montg\Dropbox\Ph.D Work\Dissertation\Data\Aim
2\Raw\Legal Timeline Categories\ELG;
%LET Rec           = C:\Users\montg\Dropbox\Ph.D Work\Dissertation\Data\Aim
2\Raw\Legal Timeline Categories\REC;
%LET ElgE          = C:\Users\montg\Dropbox\Ph.D Work\Dissertation\Data\Aim
2\Raw\Effective Date Categories\ELG;
                                                 245


%LET RecE      = C:\Users\montg\Dropbox\Ph.D Work\Dissertation\Data\Aim
2\Raw\Effective Date Categories\REC;
%LET PMMJ        = C:\Users\montg\Dropbox\Ph.D Work\Dissertation\Data\Aim
2\Raw\Alt Spec\PMMJ;
** DEFINE DIRECTORY AND FILE NAME OF ANY PERMANENT SAS DATASETS
SAVED IN THIS PROGRAM AS MACRO VARIABLES ;
** USE &PROGNAME FOR SAVEFILE NAME ;
** LEAVE BLANK IF NO DATASET SAVED ;
%LET SAVEDIR1        = C:\Users\montg\Dropbox\Ph.D Work\Dissertation\Data\Aim
2\Processed;
** NAME FORMAT LIBRARY DIRECTORY ;
%LET FMTDIR         = ;
** GIVE A BRIEF DESCRIPTION OF OVERALL PURPOSE OF THIS PROGRAM ;
** YOU CAN USE SINGLE QUOTES OR NO QUOTES-- DOUBLE QUOTES WILL NOT
WORK ;
%LET PURPOSE1 = Difference in difference event study design to estimate the effect
of cannabis legalization on cannabis incidence;
                                       246


/*************************************************************
** CREATE LIBRARY REFERENCES TO DIRECTORIES SPECIFIED ABOVE **
**************************************************************/
** INPUT FILES ;
** OUTPUT FILE DESTINATION ;
LIBNAME SAV "&SAVEDIR1" ;
/*this macro imports all data in a file with 2 options, the folder directory, and the type of
file
here, folder is saved in a macro variable named above and file type is csv.*/
title1"Download all data, append, and organize";
%macro drive(dir,ext);
   %local cnt filrf rc did memcnt name;
   %let cnt=0;
                                                 247


%let filrf=mydir;
%let rc=%sysfunc(filename(filrf,&dir));
%let did=%sysfunc(dopen(&filrf));
%if &did ne 0 %then %do;
%let memcnt=%sysfunc(dnum(&did));
%do i=1 %to &memcnt;
 %let name=%qscan(%qsysfunc(dread(&did,&i)),-1,.);
 %if %qupcase(%qsysfunc(dread(&did,&i))) ne %qupcase(&name) %then %do;
  %if %superq(ext) = %superq(name) %then %do;
    %let cnt=%eval(&cnt+1);
    %put %qsysfunc(dread(&did,&i));
    proc import datafile="&dir\%qsysfunc(dread(&did,&i))" out=dsn&cnt
     dbms=csv replace;
    run;
                                     248


     %end;
    %end;
   %end;
    %end;
  %else %put &dir cannot be open.;
  %let rc=%sysfunc(dclose(&did));
 %mend drive;
/*read in eligibility data for leglaization dates*/
 %drive(&Elg,csv)
 data dsn1;
        set dsn1;
        years="2008 to 2009";
run;
                                            249


data dsn2;
       set dsn2;
       years="2010 to 2011";
run;
data dsn3;
       set dsn3;
       years="2012 to 2013";
run;
data dsn4;
       set dsn4;
       years="2014 to 2015";
run;
data dsn5;
       set dsn5;
       years="2016 to 2017";
                             250


run;
data dsn6;
        set dsn6;
        years="2018 to 2019";
run;
data Elig;
 set dsn1 dsn2 dsn3 dsn4 dsn5 dsn6;
 eligible = catx('', 'eligible for past year initiatio'n, 'rc-eligible for past year initia'n);
run;
proc freq data=Elig;
        title3'check to see that new var works';
        table eligible*'eligible for past year initiatio'n* 'rc-eligible for past year initia'n/list
missing;
run;
                                           251


/*read in for recmj data*/
%drive(&Rec,csv)
 data dsn1;
        set dsn1;
        years="2008 to 2009";
run;
data dsn2;
        set dsn2;
        years="2010 to 2011";
run;
data dsn3;
        set dsn3;
        years="2012 to 2013";
run;
                              252


data dsn4;
       set dsn4;
       years="2014 to 2015";
run;
data dsn5;
       set dsn5;
       years="2016 to 2017";
run;
data dsn6;
       set dsn6;
       years="2018 to 2019";
run;
data Rec;
 set dsn1 dsn2 dsn3 dsn4 dsn5 dsn6;
                                    253


          recent_initiate = catx('','rc-past year initiate of marijua'n, 'rc-recent initiate of
marijuana'n, 'recent initiate of marijuana use'n);
run;
/*cleaning*/
/*prepare susbets of data where eligibility and rec = yes
to merge and create incidence estimates
 first for legal date then for effective dates*/
data elig1;
          set elig;
          where eligible = "1 - Yes";
          drop row: total: 'eligible for past year initiatio'n 'rc-eligible for past year initia'n
eligible 'Column % CI (lower)'n 'Column % CI (upper)'n;
          rename 'STATE NAME'n = legal_cat;
          rename 'final edited age'n = age;
          rename 'Column %'n = elig_phat;
          rename 'Column % SE'n = elig_phat_SE;
          rename 'Weighted Count'n = elig_count;
                                            254


        rename 'Count SE'n = elig_count_se;
run;
data rec1;
        set rec;
        where recent_initiate = "1 - Yes";
        drop row: total: 'rc-past year initiate of marijua'n 'rc-recent initiate of marijuana'n
'recent initiate of marijuana use'n recent_initiate 'Column % CI (lower)'n 'Column % CI
(upper)'n;
        rename 'STATE NAME'n = legal_cat;
        rename 'final edited age'n = age;
        rename 'Column %'n = rec_phat;
        rename 'Column % SE'n = rec_phat_SE;
        rename 'Weighted Count'n = rec_count;
        rename 'Count SE'n = rec_count_se;
run;
proc sort data=elig1;
                                         255


        by years legal_cat age ;
run;
proc sort data=rec1;
        by years legal_cat age ;
run;
/*estimate incidence by age group, state group, and years*/
data all_data;
        merge elig1 rec1;
        by years legal_cat age ;
        length time year_n 8.;
        if legal_cat="Not_Legalized_" then legal_cat="Illegal";
        incidence = rec_count/elig_count;
/*fixed effects for years as categorical may function differently
        if years was considered a continuous variable, year_n is to check this*/
                                         256


         if years = '2008 to 2009' then year_n = 2008;
         else if years = '2010 to 2011' then year_n = 2010;
         else if years = '2012 to 2013' then year_n = 2012;
         else if years = '2014 to 2015' then year_n = 2014;
         else if years = '2016 to 2017' then year_n = 2016;
         else if years = '2018 to 2019' then year_n = 2018;
         /*need to calculate incidence SE or 95%CIs*/
         label incidence = "Percentage of past year initiates among persons at risk for
initiation"
                         legal_cat = "Legal status of cannabis through 2018"
                         year_n = "Numeric date of data, first year in year-pair";
run;
proc print data=all_data (obs=6);
         title3'check incidence calcs';
                                          257


run;
proc freq data=all_data;
       title3'Check Time recode';
       table year_n*years/list missing;
run;
proc freq data=all_data;
       title3'combinations of legal cat and years to make relative time variable';
       table legal_cat* years/list;
       where legal_cat ~="Overall";
run;
proc means data=all_data;
       title3'average incidence in legal states';
       var incidence;
       where legal_cat not in("Overall", "Illegal");
run;
                                        258


proc means data=all_data;
        title3'average incidence overall';
        var incidence;
        where legal_cat ="Overall" and age="Overall";
run;
proc means data=all_data;
        title3'average incidence 21+';
        var incidence;
        where legal_cat ="Overall" and age="21_Plus";
run;
data all_data2;
        set all_data;
        where legal_cat ~="Overall";
        /*create a time variable relative to year of legalization and year of data
                                          259


                time = how many years away froma states year of legalization is this data
point?*/
       if legal_cat = "Illegal" then time=0;
       else if legal_cat = "Legalized_2012" and years = "2008 to 2009" then time = -4;
       else if legal_cat = "Legalized_2014" and years = "2008 to 2009" then time = -6;
       else if legal_cat = "Legalized_2016" and years = "2008 to 2009" then time = -8;
       else if legal_cat = "Legalized_2018" and years = "2008 to 2009" then time = -10;
       else if legal_cat = "Legalized_2012" and years = "2010 to 2011" then time = -2;
       else if legal_cat = "Legalized_2014" and years = "2010 to 2011" then time = -4;
       else if legal_cat = "Legalized_2016" and years = "2010 to 2011" then time = -6;
       else if legal_cat = "Legalized_2018" and years = "2010 to 2011" then time = -8;
       else if legal_cat = "Legalized_2012" and years = "2012 to 2013" then time = 0;
       else if legal_cat = "Legalized_2014" and years = "2012 to 2013" then time = -2;
       else if legal_cat = "Legalized_2016" and years = "2012 to 2013" then time = -4;
       else if legal_cat = "Legalized_2018" and years = "2012 to 2013" then time = -6;
                                         260


        else if legal_cat = "Legalized_2012" and years = "2014 to 2015" then time = 2;
        else if legal_cat = "Legalized_2014" and years = "2014 to 2015" then time = 0;
        else if legal_cat = "Legalized_2016" and years = "2014 to 2015" then time = -2;
        else if legal_cat = "Legalized_2018" and years = "2014 to 2015" then time = -4;
        else if legal_cat = "Legalized_2012" and years = "2016 to 2017" then time = 4;
        else if legal_cat = "Legalized_2014" and years = "2016 to 2017" then time = 2;
        else if legal_cat = "Legalized_2016" and years = "2016 to 2017" then time = 0;
        else if legal_cat = "Legalized_2018" and years = "2016 to 2017" then time = -2;
        else if legal_cat = "Legalized_2012" and years = "2018 to 2019" then time = 6;
        else if legal_cat = "Legalized_2014" and years = "2018 to 2019" then time = 4;
        else if legal_cat = "Legalized_2016" and years = "2018 to 2019" then time = 2;
        else if legal_cat = "Legalized_2018" and years = "2018 to 2019" then time = 0;
/*create dummy variable for each relative time point*/
        if time=-10 then tminus10 =1; else tminus10=0;
                                        261


       if time=-8 then tminus8 =1; else tminus8=0;
       if time=-6 then tminus6 =1; else tminus6=0;
       if time=-4 then tminus4 =1; else tminus4=0;
       if time=-2 then tminus2 =1; else tminus2=0;
       if time=0 then t0 =1; else t0=0;
       if time=2 then t2 =1; else t2=0;
       if time=4 then t4 =1; else t4=0;
       if time=6 then t6 =1; else t6=0;
       label    time = "Time relative to legalization based on beginning of year pair";
run;
proc contents data=all_data2;
run;
proc freq data=all_data2;
       title3'check new variable creation';
       table time*legal_cat*years
                                         262


                time*tminus10*tminus8*tminus6*tminus4*tminus2*t0*t2*t4*t6/list missing;
        where legal_cat ~="Overall";
run;
proc freq data=all_data2;
        title3'determine where tails of legal date data distribution should be grouped
together';
        table time/list;
run;
/*<=-4, >=4*/
proc freq data=all_data2;
        title3'Check new time event dummies';
        table legal_cat/list missing;
run;
/* data step 3 creates new dummies that
can categorize all data before a certain time point
                                         263


so that analysis on a single category is not done
 econs call it balancing leads and lags for short*/
data all_data3;
        set all_data2;
        if time<=-4 then tlt4=1; else tlt4=0;
        if time>=4 then tgt4=1; else tgt4=0;
        if time<=-6 then tlt6=1; else tlt6=0;
        if time>=6 then tgt6=1; else tgt6=0;
        if legal_cat = "Illegal" then legal = 0;
        else if time>=0 then legal = 1;
        else if time <0 then legal =0;
        if legal_cat = "Illegal" then effective = 0;
        else if time>=2 then effective = 1;
        else if time <2 then effective =0;
                                           264


        if legal_cat = "Illegal" then legal_wave = 0;
        else if legal_cat = "Legalized_2012" then legal_wave = 1;
        else if legal_cat = "Legalized_2014" then legal_wave = 2;
        else if legal_cat = "Legalized_2016" then legal_wave = 3;
        else if legal_cat = "Legalized_2018" then legal_wave = 4;
        label legal = "Simple binary for RCL, 1 if year>=legalize date, 0 otherwise"
                         effective = "Simple binary for RCL effective, 1 if year>=effective
date, 0 otherwise";
run;
proc freq data=all_data3;
        title3'Check new time event dummies';
        table time*tlt4*tminus10*tminus8*tminus6*tminus4*tminus2*t0*t2*t4*t6*tgt4/list
missing;
run;
proc freq data=all_data3;
                                           265


         title3'Check new legality dummy';
         table legal_cat*years*legal/list missing;
run;
proc freq data=all_data3;
         title3'Check new effective dummy';
         table legal_cat*years*effective/list missing;
run;
/*Save dataset*/
/*data sav.Legal_date;*/
/*       set all_data3;*/
/*run;*/
/**/
/*/*export to csv if needed*/*/
/*proc export data= sav.Legal_date outfile='C:\Users\montg\Dropbox\Ph.D
Work\Dissertation\Data\Aim 2\Processed\Legal_date.csv'*/
/*       dbms=csv replace;*/
                                          266


/*run;*/
/*/*/*/*/*/*/*/*/**/*/*/*/*/*/*/*/*/
/* REPEAT FOR EFFECTIVE DATE DATA */
/*/*/*/*/*/*/*/*/**/*/*/*/*/*/*/*/*/;
/*read in eligibility data for effective dates*/
%drive(&ElgE,csv)
data dsn1;
          set dsn1;
          years="2008 to 2009";
run;
data dsn2;
          set dsn2;
          years="2010 to 2011";
run;
                                           267


data dsn3;
       set dsn3;
       years="2012 to 2013";
run;
data dsn4;
       set dsn4;
       years="2014 to 2015";
run;
data dsn5;
       set dsn5;
       years="2016 to 2017";
run;
data dsn6;
       set dsn6;
                             268


        years="2018 to 2019";
run;
data EligE;
 set dsn1 dsn2 dsn3 dsn4 dsn5 dsn6;
 eligible = catx('', 'eligible for past year initiatio'n, 'rc-eligible for past year initia'n);
run;
proc freq data=EligE;
        title3'check to see that new var works';
        table eligible*'eligible for past year initiatio'n* 'rc-eligible for past year initia'n/list
missing;
run;
%drive(&RecE,csv)
data dsn1;
        set dsn1;
                                           269


       years="2008 to 2009";
run;
data dsn2;
       set dsn2;
       years="2010 to 2011";
run;
data dsn3;
       set dsn3;
       years="2012 to 2013";
run;
data dsn4;
       set dsn4;
       years="2014 to 2015";
run;
                             270


data dsn5;
        set dsn5;
        years="2016 to 2017";
run;
data dsn6;
        set dsn6;
        years="2018 to 2019";
run;
data RecE;
  set dsn1 dsn2 dsn3 dsn4 dsn5 dsn6;
        recent_initiate = catx('','rc-past year initiate of marijua'n, 'rc-recent initiate of
marijuana'n, 'recent initiate of marijuana use'n);
run;
/*cleaning*/
                                          271


proc freq data=rece;
       title3'check to see that new var works';
       table recent_initiate*'rc-past year initiate of marijua'n *'rc-recent initiate of
marijuana'n *'recent initiate of marijuana use'n /list missing;
run;
proc print data=elige;
       title3'May not even need to use subtraction method, check for suppression
(should not print)';
       where 'Weighted Count'n = . or 'Count SE'n =.;
run;
proc print data=rece;
       title3'May not even need to use subtraction method, check for suppression
(should not print)';
       where 'Weighted Count'n = . or 'Count SE'n =.;
run;
/*no data from 2008-2009 period in ALaska, subtraction method doesnt even work*/
                                         272


data elige1;
         set elige;
         where eligible = "1 - Yes";
         drop row: total: 'eligible for past year initiatio'n 'rc-eligible for past year initia'n
eligible 'Column % CI (lower)'n 'Column % CI (upper)'n;
         rename 'STATE NAME'n = legal_cat;
         rename 'final edited age'n = age;
         rename 'Column %'n = elig_phat;
         rename 'Column % SE'n = elig_phat_SE;
         rename 'Weighted Count'n = elig_count;
         rename 'Count SE'n = elig_count_se;
run;
data rece1;
         set rece;
         where recent_initiate = "1 - Yes";
                                           273


        drop row: total: 'rc-past year initiate of marijua'n 'rc-recent initiate of marijuana'n
'recent initiate of marijuana use'n recent_initiate 'Column % CI (lower)'n 'Column % CI
(upper)'n;
        rename 'STATE NAME'n = legal_cat;
        rename 'final edited age'n = age;
        rename 'Column %'n = rec_phat;
        rename 'Column % SE'n = rec_phat_SE;
        rename 'Weighted Count'n = rec_count;
        rename 'Count SE'n = rec_count_se;
run;
proc sort data=elige1;
        by years legal_cat age ;
run;
proc sort data=rece1;
        by years legal_cat age ;
run;
                                         274


/*estimate incidence by age group, state group, and years*/
data all_datae;
         merge elige1 rece1;
         by years legal_cat age ;
         length time 8.;
         incidence = rec_count/elig_count;
         /*need to calculate incidence SE or 95%CIs*/
         label incidence = "Percentage of past year initiates among persons at risk for
initiation"
                         legal_cat = "Legal status of cannabis through 2018";
run;
proc freq data=all_datae;
         title3'combinations of legal cat and years to make relative time variable';
         table years*legal_cat /list;
                                          275


        where legal_cat ~="Overall";
run;
data all_datae2;
        set all_datae;
        where legal_cat ~="Overall";
        /*create a time variable relative to year of legalization and year of data
                 time = how many years away froma states year of legalization is this data
point?*/
        if legal_cat = "Illegal" then time=0;
        else if legal_cat = "Effective_2014" and years = "2008 to 2009" then time = -6;
        else if legal_cat = "Effective_2015" and years = "2008 to 2009" then time = -7;
        else if legal_cat = "Effective_2016" and years = "2008 to 2009" then time = -8;
        else if legal_cat = "Effective_2017" and years = "2008 to 2009" then time = -9;
        else if legal_cat = "Effective_2018" and years = "2008 to 2009" then time = -10;
        else if legal_cat = "Effective_2014" and years = "2010 to 2011" then time = -4;
                                          276


else if legal_cat = "Effective_2015" and years = "2010 to 2011" then time = -5;
else if legal_cat = "Effective_2016" and years = "2010 to 2011" then time = -6;
else if legal_cat = "Effective_2017" and years = "2010 to 2011" then time = -7;
else if legal_cat = "Effective_2018" and years = "2010 to 2011" then time = -8;
else if legal_cat = "Effective_2014" and years = "2012 to 2013" then time = -2;
else if legal_cat = "Effective_2015" and years = "2012 to 2013" then time = -3;
else if legal_cat = "Effective_2016" and years = "2012 to 2013" then time = -4;
else if legal_cat = "Effective_2017" and years = "2012 to 2013" then time = -5;
else if legal_cat = "Effective_2018" and years = "2012 to 2013" then time = -6;
else if legal_cat = "Effective_2014" and years = "2014 to 2015" then time = 0;
else if legal_cat = "Effective_2015" and years = "2014 to 2015" then time = -1;
else if legal_cat = "Effective_2016" and years = "2014 to 2015" then time = -2;
else if legal_cat = "Effective_2017" and years = "2014 to 2015" then time = -3;
else if legal_cat = "Effective_2018" and years = "2014 to 2015" then time = -4;
else if legal_cat = "Effective_2014" and years = "2016 to 2017" then time = 2;
                                 277


        else if legal_cat = "Effective_2015" and years = "2016 to 2017" then time = 1;
        else if legal_cat = "Effective_2016" and years = "2016 to 2017" then time = 0;
        else if legal_cat = "Effective_2017" and years = "2016 to 2017" then time = -1;
        else if legal_cat = "Effective_2018" and years = "2016 to 2017" then time = -2;
        else if legal_cat = "Effective_2014" and years = "2018 to 2019" then time = 4;
        else if legal_cat = "Effective_2015" and years = "2018 to 2019" then time = 3;
        else if legal_cat = "Effective_2016" and years = "2018 to 2019" then time = 2;
        else if legal_cat = "Effective_2017" and years = "2018 to 2019" then time = 1;
        else if legal_cat = "Effective_2018" and years = "2018 to 2019" then time = 0;
/*create dummy variable for each relative time point*/
        if time=-10 then tminus10 =1; else tminus10=0;
        if time=-9 then tminus9 =1; else tminus9=0;
        if time=-8 then tminus8 =1; else tminus8=0;
        if time=-7 then tminus7 =1; else tminus7=0;
        if time=-6 then tminus6 =1; else tminus6=0;
        if time=-5 then tminus5 =1; else tminus5=0;
                                         278


       if time=-4 then tminus4 =1; else tminus4=0;
       if time=-3 then tminus3 =1; else tminus3=0;
       if time=-2 then tminus2 =1; else tminus2=0;
       if time=-1 then tminus1 =1; else tminus1=0;
       if time=0 then t0 =1; else t0=0;
       if time=1 then t1 =1; else t1=0;
       if time=2 then t2 =1; else t2=0;
       if time=3 then t3 =1; else t3=0;
       if time=4 then t4 =1; else t4=0;
       label    time = "Time relative to legalization based on beginning of year pair";
run;
proc freq data=all_datae2;
       title3'check new variable creation';
       table time time*legal_cat*years
       time*tminus10*tminus9*tminus8*tminus7*tminus6*tminus5*tminus4*tminus3*tmi
nus2*tminus1*t0*t1*t2*t3*t4/list missing;
                                         279


run;
proc sort data=all_datae2;
        by time;
run;
proc print data=all_datae2;
        title3'Look at data sorted by time to see how much info we are missing in this
coding scheme';
        var time legal_cat incidence;
        where legal_cat ~="Overall";
run;
/*Just the Alaskan data from 2008-2009 I already knew about*/
proc freq data=all_datae2;
        title3'determine where tails of effective date data distribution should be grouped
together';
        table time/list;
run;
                                         280


/*<=-6, >=2*/
/* data step 3 creates new dummies that
can categorize all data before a certain time point
so that analysis on a single category is not done
 econs call it balancing leads and lags for short*/
/*the cut points are different for legalization dates
than for effective dates because of the way the data
is structured and the categorization by 1 year intervals
(more effective date 1 year interval categories)*/
data all_datae3;
        set all_datae2;
        if time<=-6 then tlt6=1; else tlt6=0;
        if time>=2 then tgt2=1; else tgt2=0;
run;
                                          281


proc freq data=all_datae3;
         title3'Check new time event dummies';
         table
time*tlt6*tminus10*tminus9*tminus8*tminus7*tminus6*tminus5*tminus4*tminus3*tminus
2*tminus1*t0*t1*t2*t3*t4*tgt2/list missing;
run;
/*Save dataset*/
/*data sav.Effective_date;*/
/*       set all_datae3;*/
/*run;*/
/*export to csv if needed*/
/*proc export data= sav.Effective_date outfile='C:\Users\montg\Dropbox\Ph.D
Work\Dissertation\Data\Aim 2\Processed\Effective_date.csv'*/
/*       dbms=csv replace;*/
/*run;*/
                                         282


/*/*/*/*/*/*/*/*/*/*/**/*/*/*/*/*/*/*/*/*/*/
/* REPEAT FOR PREVALENCE LEGAL DATE DATA */
/*/*/*/*/*/*/*/*/*/*/**/*/*/*/*/*/*/*/*/*/*/;
/*read in prevalence data for leglaization dates*/
 %drive(&PMMJ,csv)
 data dsn1;
          set dsn1;
          years="2008 to 2009";
run;
data dsn2;
          set dsn2;
          years="2010 to 2011";
run;
                                              283


data dsn3;
       set dsn3;
       years="2012 to 2013";
run;
data dsn4;
       set dsn4;
       years="2014 to 2015";
run;
data dsn5;
       set dsn5;
       years="2016 to 2017";
run;
data dsn6;
       set dsn6;
       years="2018 to 2019";
                             284


run;
data prev;
  set dsn1 dsn2 dsn3 dsn4 dsn5 dsn6;
run;
/*read in for recmj data*/
%drive(&Rec,csv)
 data dsn1;
        set dsn1;
        years="2008 to 2009";
run;
data dsn2;
        set dsn2;
        years="2010 to 2011";
run;
                                     285


data dsn3;
       set dsn3;
       years="2012 to 2013";
run;
data dsn4;
       set dsn4;
       years="2014 to 2015";
run;
data dsn5;
       set dsn5;
       years="2016 to 2017";
run;
data dsn6;
       set dsn6;
                             286


        years="2018 to 2019";
run;
data prev;
  set dsn1 dsn2 dsn3 dsn4 dsn5 dsn6;
        past_month_use = catx('','RC-MARIJUANA - PAST MONTH USE'n,
'MARIJUANA - PAST MONTH USE'n);
run;
proc freq data=prev;
        title3'check that prevalence indicator concatenated correctly';
        table past_month_use*'RC-MARIJUANA - PAST MONTH USE'n*'MARIJUANA -
PAST MONTH USE'n / list missing;
run;
/*cleaning*/
/*prepare susbets of data for past month users*/
data prev1;
        set prev;
                                        287


     where past_month_use = "1 - Used within the past month";
     length time 8.;
     drop row: total: 'RC-MARIJUANA - PAST MONTH USE'n 'MARIJUANA - PAST
MONTH USE'n 'Column % CI (lower)'n 'Column % CI (upper)'n;
     /*need to calculate prevalence SE or 95%CIs?*/
     rename 'STATE NAME'n = legal_cat;
     rename 'final edited age'n = age;
     rename 'Column %'n = pmmj_phat;
     rename 'Column % SE'n = pmmj_SE;
     rename 'Weighted Count'n = pmmj_count;
     rename 'Count SE'n = pmmj_count_se;
     label pmmj_phat = "Percentage of population that used marijuana past month"
                     legal_cat = "Legal status of cannabis through 2018";
run;
                                     288


proc means data=prev1 n mean stddev;
       title3'face validity check for prevalence estimates';
       var pmmj_phat;
       class legal_cat age;
run;
data prev2;
       set prev1;
       /*remove overall state category*/
       where legal_cat ~="Overall";
       /*create a time variable relative to year of legalization and year of data
                time = how many years away froma states year of legalization is this data
point?*/
       if legal_cat = "Illegal" then time=0;
       else if legal_cat = "Legal_2012" and years = "2008 to 2009" then time = -4;
       else if legal_cat = "Legal_2014" and years = "2008 to 2009" then time = -6;
                                         289


else if legal_cat = "Legal_2016" and years = "2008 to 2009" then time = -8;
else if legal_cat = "Legal_2018" and years = "2008 to 2009" then time = -10;
else if legal_cat = "Legal_2012" and years = "2010 to 2011" then time = -2;
else if legal_cat = "Legal_2014" and years = "2010 to 2011" then time = -4;
else if legal_cat = "Legal_2016" and years = "2010 to 2011" then time = -6;
else if legal_cat = "Legal_2018" and years = "2010 to 2011" then time = -8;
else if legal_cat = "Legal_2012" and years = "2012 to 2013" then time = 0;
else if legal_cat = "Legal_2014" and years = "2012 to 2013" then time = -2;
else if legal_cat = "Legal_2016" and years = "2012 to 2013" then time = -4;
else if legal_cat = "Legal_2018" and years = "2012 to 2013" then time = -6;
else if legal_cat = "Legal_2012" and years = "2014 to 2015" then time = 2;
else if legal_cat = "Legal_2014" and years = "2014 to 2015" then time = 0;
else if legal_cat = "Legal_2016" and years = "2014 to 2015" then time = -2;
else if legal_cat = "Legal_2018" and years = "2014 to 2015" then time = -4;
                                290


        else if legal_cat = "Legal_2012" and years = "2016 to 2017" then time = 4;
        else if legal_cat = "Legal_2014" and years = "2016 to 2017" then time = 2;
        else if legal_cat = "Legal_2016" and years = "2016 to 2017" then time = 0;
        else if legal_cat = "Legal_2018" and years = "2016 to 2017" then time = -2;
        else if legal_cat = "Legal_2012" and years = "2018 to 2019" then time = 6;
        else if legal_cat = "Legal_2014" and years = "2018 to 2019" then time = 4;
        else if legal_cat = "Legal_2016" and years = "2018 to 2019" then time = 2;
        else if legal_cat = "Legal_2018" and years = "2018 to 2019" then time = 0;
/*create dummy variable for each relative time point*/
        if time=-10 then tminus10 =1; else tminus10=0;
        if time=-8 then tminus8 =1; else tminus8=0;
        if time=-6 then tminus6 =1; else tminus6=0;
        if time=-4 then tminus4 =1; else tminus4=0;
        if time=-2 then tminus2 =1; else tminus2=0;
        if time=0 then t0 =1; else t0=0;
        if time=2 then t2 =1; else t2=0;
                                         291


       if time=4 then t4 =1; else t4=0;
       if time=6 then t6 =1; else t6=0;
       label    time = "Time relative to legalization based on beginning of year pair";
run;
proc freq data=prev2;
       title3'combinations of legal cat and years to make relative time variable';
       table legal_cat* years/list;
run;
proc freq data=prev2;
       title3'check new variable creation';
       table time*legal_cat*years
                time*tminus10*tminus8*tminus6*tminus4*tminus2*t0*t2*t4*t6/list missing;
run;
proc freq data=prev2;
                                         292


        title3'determine where tails of legal date data distribution should be grouped
together';
        table time/list;
run;
/*<=-4, >=4*/
/* data step 3 creates new dummies that
can categorize all data before a certain time point
so that analysis on a single category is not done
 econs call it balancing leads and lags for short*/
data prev3;
        set prev2;
        if time<=-4 then tlt4=1; else tlt4=0;
        if time>=4 then tgt4=1; else tgt4=0;
run;
proc freq data=prev3;
                                          293


          title3'Check new time event dummies';
          table time*tlt4*tminus10*tminus8*tminus6*tminus4*tminus2*t0*t2*t4*t6*tgt4/list
missing;
run;
/*Save dataset*/
/*data sav.Prevalence;*/
/*        set prev3;*/
/*run;*/
/*export to csv if needed*/
/*proc export data= sav.Prevalence outfile='C:\Users\montg\Dropbox\Ph.D
Work\Dissertation\Data\Aim 2\Processed\Prevalence.csv'*/
/*        dbms=csv replace;*/
/*run;*/
/*/*/*/*/*/*/*/**/*/*/*/*/*/*/*/
/*Placebo Analysis Dataset*/*/*/
/*/*/*/*/*/*/*/**/*/*/*/*/*/*/*/;
                                         294


data time_placebo;
       set sav.Legal_date;
       if legal_cat = "Illegal" then legal=0;
       else if legal_cat = "Overall" then legal=.;
       else if time=0 and legal_cat ~= "Illegal" then legal=1;
       else if time<0 then legal=0;
       else if time>0 then legal=1;
       label    legal = "cannabis leagality binary";
run;
proc freq data= time_placebo;
       title3'check legal binary creation';
       table year_n*legal_cat*legal/list missing;
run;
                                          295


/*randomly select a year to be the placebo time*/
/* pick a random number between 2010 and 2016 1000 times*/
data year_range;
        do i = 1 to 7;
           placebo_year = 2009 + i;
                id=1;
        output;
        end;
        keep id placebo_year;
run;
proc surveyselect data=year_range sampsize=1
        method=srs reps=1 out=placebos seed = 1124;
run;
data time_placebo2;
                                       296


       set time_placebo;
       id=1;
run;
data time_placebo3;
       merge time_placebo2 placebos;
       by id;
       drop replicate id;
run;
proc print;
       var legal_cat year_n time ;
run;
data time_placebo4;
       merge time_placebo3;
       if legal_cat = "Illegal" then placebo_time=0;
                                         297


        else if legal_cat ~= "Illegal" and years = "2008 to 2009" then placebo_time = -3;
        else if legal_cat ~= "Illegal" and years = "2010 to 2011" then placebo_time = -1;
        else if legal_cat ~= "Illegal" and years = "2012 to 2013" then placebo_time = 1;
        else if legal_cat ~= "Illegal" and years = "2014 to 2015" then placebo_time = 3;
        else if legal_cat ~= "Illegal" and years = "2016 to 2017" then placebo_time = 5;
        else if legal_cat ~= "Illegal" and years = "2018 to 2019" then placebo_time = 7;
/*create dummy variable for each relative time point*/
        if placebo_time=-3 then ptminus3 =1; else ptminus3=0;
        if placebo_time=-1 then ptminus1 =1; else ptminus1=0;
        if placebo_time=1 then pt1 =1; else pt1=0;
        if placebo_time=3 then pt3 =1; else pt3=0;
        if placebo_time=5 then pt5 =1; else pt5=0;
        if placebo_time=7 then pt7 =1; else pt7=0;
        if placebo_time=0 then pt0 =1; else pt0=0;
        if legal_cat= "Illegal" then placebo_legal = "Illegal";
                                          298


         else placebo_legal = "Legal";
         label placebo_time = "Time between observation and placebo year of cannabis
legalization"
                            placebo_legal = "2 groups for the plcebo trial, legalized in 2011
and illegal";
run;
proc freq data=time_placebo4;
         title3'check placebo time variable and placebo time binaries';
         table legal_cat*years*placebo_time
         placebo_time legal_cat*placebo_legal
         placebo_time*ptminus3*ptminus1*pt0*pt1*pt3*pt5*pt7/list missing;
run;
data sav.placebo;
         set time_placebo4;
run;
/*******************************************************************************************************
                                                 299


* In Enterprise Guide, "Specify the page size for log and text output" under 'Results
General' must be *
* de-selected in order to be able to specify pagesize and linesize using an options
statement.           *
********************************************************************************************************/
OPTIONS PS=56 LS=160 NOCENTER NOFMTERR MPRINT ORIENTATION =
LANDSCAPE ;
title1'Dissertation';
title2'Aim 2: Table 1';
/****************************************************************************************
* The following macro variables are available to all users: *
*                                                                     *
* &Project       – the name of the project folder in which the .EGP file is stored                 *
* &ProgName             – the name of the .egp file, without the extension                           *
* &ProgNode            - the name of the code node                                          *
* &ProgDir          – the path to the folder in which the .egp file is stored.                   *
*****************************************************************************************/
                                                 300


/*****************************************************************************
* The following macro variables are used in conjunction with the DOC_BLOCK *
* macro to document programs and output.                                        *
*                                                            *
* Do not use quotation marks when defining macro variables. If SAS syntax         *
* requires quotes, use double quotes when you reference the macro variable. *
******************************************************************************/
** PROGRAMMER'S NAME ;
%LET PROGRAMMER                   = Barrett Montgomery;
** DEFINE ALL NON-SAS FILES CALLED IN YOUR PROGRAM AS MACRO VARIABLES
by Drug;
** THESE CAN BE LEFT BLANK IF NOT NEEDED OR USED ;
%LET dat         = C:\Users\montg\Dropbox\Ph.D Work\Dissertation\Data\Aim
2\Raw\Table 1;
%LET ABOD          = C:\Users\montg\Dropbox\Ph.D Work\Dissertation\Data\Aim
2\Raw\Table 1\PDAS\ABODMRJ;
                                                 301


%LET AGE      = C:\Users\montg\Dropbox\Ph.D Work\Dissertation\Data\Aim
2\Raw\Table 1\PDAS\CATAGE;
%LET SEX      = C:\Users\montg\Dropbox\Ph.D Work\Dissertation\Data\Aim
2\Raw\Table 1\PDAS\IRSEX;
%LET MRJ      = C:\Users\montg\Dropbox\Ph.D Work\Dissertation\Data\Aim
2\Raw\Table 1\PDAS\MRJMON;
%LET RACE     = C:\Users\montg\Dropbox\Ph.D Work\Dissertation\Data\Aim
2\Raw\Table 1\PDAS\NEWRACE2;
** DEFINE DIRECTORY AND FILE NAME OF ANY PERMANENT SAS DATASETS
SAVED IN THIS PROGRAM AS MACRO VARIABLES ;
** USE &PROGNAME FOR SAVEFILE NAME ;
** LEAVE BLANK IF NO DATASET SAVED ;
%LET SAVEDIR1      = C:\Users\montg\Dropbox\Ph.D Work\Dissertation\Data\Aim
2\Processed;
** NAME FORMAT LIBRARY DIRECTORY ;
%LET FMTDIR       = ;
** GIVE A BRIEF DESCRIPTION OF OVERALL PURPOSE OF THIS PROGRAM ;
                                   302


** YOU CAN USE SINGLE QUOTES OR NO QUOTES-- DOUBLE QUOTES WILL NOT
WORK ;
%LET PURPOSE1 = Aim 2 sample demographics and whatnot;
/*************************************************************
** CREATE LIBRARY REFERENCES TO DIRECTORIES SPECIFIED ABOVE **
**************************************************************/
** INPUT FILES ;
** OUTPUT FILE DESTINATION ;
LIBNAME SAV "&SAVEDIR1" ;
/*this macro imports all data in a file with 2 options, the folder directory, and the type of
file
here, folder is saved in a macro variable named above and file type is csv.*/
title1"Download all data, append, and organize";
%macro drive(dir,ext);
                                                 303


%local cnt filrf rc did memcnt name;
%let cnt=0;
%let filrf=mydir;
%let rc=%sysfunc(filename(filrf,&dir));
%let did=%sysfunc(dopen(&filrf));
%if &did ne 0 %then %do;
%let memcnt=%sysfunc(dnum(&did));
%do i=1 %to &memcnt;
 %let name=%qscan(%qsysfunc(dread(&did,&i)),-1,.);
 %if %qupcase(%qsysfunc(dread(&did,&i))) ne %qupcase(&name) %then %do;
  %if %superq(ext) = %superq(name) %then %do;
    %let cnt=%eval(&cnt+1);
    %put %qsysfunc(dread(&did,&i));
    proc import datafile="&dir\%qsysfunc(dread(&did,&i))" out=dsn&cnt
                                     304


        dbms=csv replace;
       run;
     %end;
    %end;
   %end;
    %end;
  %else %put &dir cannot be open.;
  %let rc=%sysfunc(dclose(&did));
 %mend drive;
/*read in data*/
 %drive(&dat,csv)
data dsn1;
        set dsn1 ;
        years="2008 to 2009";
                                   305


run;
data dsn2;
       set dsn2;
       years="2008 to 2009";
run;
data dsn3;
       set dsn3;
       years="2008 to 2009";
run;
data dsn4;
       set dsn4;
       years="2008 to 2009";
run;
data dsn5;
                             306


       set dsn5;
       years="2008 to 2009";
run;
data dsn6;
       set dsn6;
       years="2010 to 2011";
run;
data dsn7;
       set dsn7;
       years="2010 to 2011";
run;
data dsn8;
       set dsn8;
       years="2010 to 2011";
run;
                             307


data dsn9;
       set dsn9;
       years="2010 to 2011";
run;
data dsn10;
       set dsn10;
       years="2010 to 2011";
run;
data dsn11;
       set dsn11;
       years="2012 to 2013";
run;
data dsn12;
       set dsn12;
                             308


       years="2012 to 2013";
run;
data dsn13;
       set dsn13;
       years="2012 to 2013";
run;
data dsn14;
       set dsn14;
       years="2012 to 2013";
run;
data dsn15;
       set dsn15;
       years="2012 to 2013";
run;
                             309


data dsn16;
       set dsn16;
       years="2014 to 2015";
run;
data dsn17;
       set dsn17;
       years="2014 to 2015";
run;
data dsn18;
       set dsn18;
       years="2014 to 2015";
run;
data dsn19;
       set dsn19;
       years="2014 to 2015";
                             310


run;
data dsn20;
       set dsn20;
       years="2014 to 2015";
run;
data dsn21;
       set dsn21;
       years="2016 to 2017";
run;
data dsn22;
       set dsn22;
       years="2016 to 2017";
run;
data dsn23;
                             311


       set dsn23;
       years="2016 to 2017";
run;
data dsn24;
       set dsn24;
       years="2016 to 2017";
run;
data dsn25;
       set dsn25;
       years="2016 to 2017";
run;
data dsn26;
       set dsn26;
       years="2018 to 2019";
run;
                             312


data dsn27;
       set dsn27;
       years="2018 to 2019";
run;
data dsn28;
       set dsn28;
       years="2018 to 2019";
run;
data dsn29;
       set dsn29;
       years="2018 to 2019";
run;
data dsn30;
       set dsn30;
                             313


        years="2018 to 2019";
run;
data dat2008;
  set dsn1 dsn2 dsn3 dsn4 dsn5;
  keep years 'state name'n 'final edited age'n 'marijuana abuse or dependence -'n
                               'imputation revised gender'n 'marijuana - past month
use'n
                               'race recode (4 levels)'n 'Weighted Count'n 'Count SE'n
                               'Column %'n 'Column % SE'n;
;
  rename 'final edited age'n = 'FINAL EDITED AGE'n
                       'marijuana abuse or dependence -'n = 'RC-MARIJUANA
DEPENDENCE OR ABUSE'n
                       'imputation revised gender'n = 'GENDER - IMPUTATION
REVISED'n
                        'marijuana - past month use'n = 'RC-MARIJUANA - PAST
MONTH USE'n
                                        314


                       'race recode (4 levels)'n = 'RC-RACE RECODE (4 LEVELS)'n ;
run;
data dat2010;
 set dsn6 dsn7 dsn8 dsn9 dsn10;
 keep years 'state name'n 'final edited age'n 'marijuana abuse or dependence -'n
       'imputation revised gender'n 'marijuana - past month use'n
       'race recode (4 levels)'n 'Weighted Count'n 'Count SE'n
       'Column %'n 'Column % SE'n;
 rename 'final edited age'n = 'FINAL EDITED AGE'n
                      'marijuana abuse or dependence -'n = 'RC-MARIJUANA
DEPENDENCE OR ABUSE'n
                      'imputation revised gender'n = 'GENDER - IMPUTATION
REVISED'n
                       'marijuana - past month use'n = 'RC-MARIJUANA - PAST
MONTH USE'n
                       'race recode (4 levels)'n = 'RC-RACE RECODE (4 LEVELS)'n ;
run;
                                       315


data dat2012;
  set dsn11 dsn12 dsn13 dsn14 dsn15;
  keep years 'state name'n 'final edited age'n 'MARIJUANA ABUSE OR DEPENDENCE -
'n'imputation revised gender'n
        'marijuana - past month use'n 'race recode (4 levels)'n 'Weighted Count'n
'Count SE'n
        'Column %'n 'Column % SE'n;
 rename 'final edited age'n = 'FINAL EDITED AGE'n
                       'marijuana abuse or dependence -'n = 'RC-MARIJUANA
DEPENDENCE OR ABUSE'n
                       'imputation revised gender'n = 'GENDER - IMPUTATION
REVISED'n
                        'marijuana - past month use'n = 'RC-MARIJUANA - PAST
MONTH USE'n
                        'race recode (4 levels)'n = 'RC-RACE RECODE (4 LEVELS)'n ;
run;
data dat2014;
  set dsn16 dsn17 dsn18 dsn19 dsn20;
                                        316


 keep years 'state name'n 'FINAL EDITED AGE'n 'RC-MARIJUANA DEPENDENCE OR
ABUSE'n 'GENDER - IMPUTATION REVISED'n
       'RC-MARIJUANA - PAST MONTH USE'n 'RC-RACE RECODE (4 LEVELS)'n
'Weighted Count'n 'Count SE'n 'Column %'n       'Column % SE'n;
run;
data dat2016;
 set dsn21 dsn22 dsn23 dsn24 dsn25;
 keep years 'state name'n 'FINAL EDITED AGE'n 'RC-MARIJUANA DEPENDENCE OR
ABUSE'n 'GENDER - IMPUTATION REVISED'n
       'RC-MARIJUANA - PAST MONTH USE'n 'RC-RACE RECODE (4 LEVELS)'n
'Weighted Count'n 'Count SE'n 'Column %'n       'Column % SE'n;
run;
data dat2018;
 set dsn26 dsn27 dsn28 dsn29 dsn30;
 keep years 'state name'n 'FINAL EDITED AGE'n 'RC-MARIJUANA DEPENDENCE OR
ABUSE'n 'GENDER - IMPUTATION REVISED'n
       'RC-MARIJUANA - PAST MONTH USE'n 'RC-RACE RECODE (4 LEVELS)'n
'Weighted Count'n 'Count SE'n 'Column %'n       'Column % SE'n;
                                    317


run;
data all;
        set dat2008 dat2010 dat2012 dat2014 dat2016 dat2018;
run;
proc print data=all;
        where 'FINAL EDITED AGE'n ~='';
        var 'state name'n 'FINAL EDITED AGE'n years 'Weighted Count'n 'Count SE'n
'Column %'n 'Column % SE'n ;
run;
proc print data=all;
        where 'GENDER - IMPUTATION REVISED'n ~='';
        var 'state name'n years 'Weighted Count'n 'GENDER - IMPUTATION REVISED'n
'Count SE'n 'Column %'n       'Column % SE'n;
run;
proc print data=all;
                                      318


         where 'RC-RACE RECODE (4 LEVELS)'n ~='';
         var 'state name'n years 'Weighted Count'n 'RC-RACE RECODE (4 LEVELS)'n
'Count SE'n 'Column %'n        'Column % SE'n;
run;
proc print data=all;
         where 'RC-MARIJUANA - PAST MONTH USE'n ~='';
         var 'state name'n years 'Weighted Count'n 'RC-MARIJUANA - PAST MONTH
USE'n 'Count SE'n 'Column %'n         'Column % SE'n;
run;
proc print data=all;
         where 'RC-MARIJUANA DEPENDENCE OR ABUSE'n ~='';
         var 'state name'n years 'Weighted Count'n 'RC-MARIJUANA DEPENDENCE OR
ABUSE'n 'Count SE'n 'Column %'n 'Column % SE'n;
run;
/*/*/*/* PDAS */*/*/*/
                                        319


/*read in data*/;
%drive(&ABOD,csv)
data dsn1;
        set dsn1 ;
        year="2008";
run;
data dsn2;
        set dsn2 ;
        year="2009";
run;
data dsn3;
        set dsn3;
        year="2010";
run;
                     320


data dsn4;
       set dsn4;
       year="2011";
run;
data dsn5;
       set dsn5;
       year="2012";
run;
data dsn6;
       set dsn6;
       year="2013";
run;
data dsn7;
       set dsn7;
       year="2014";
                    321


run;
data dsn8;
       set dsn8;
       year="2015";
run;
data dsn9;
       set dsn9;
       year="2016";
run;
data dsn10;
       set dsn10;
       year="2017";
run;
data dsn11;
                    322


       set dsn11;
       year="2018";
run;
data dsn12;
       set dsn12;
       year="2019";
run;
data ABOD;
 set dsn1 dsn2 dsn3 dsn4 dsn5 dsn6 dsn7 dsn8 dsn9 dsn10 dsn11 dsn12;
 MJ_DEP = catx('','marijuana abuse or dependence -'n,'RC-MARIJUANA
DEPENDENCE OR ABUSE'n);
 keep year mj_dep 'Unweighted Count'n;
run;
%drive(&AGE,csv)
                                    323


data dsn1;
       set dsn1 ;
       year="2008";
run;
data dsn2;
       set dsn2 ;
       year="2009";
run;
data dsn3;
       set dsn3;
       year="2010";
run;
data dsn4;
       set dsn4;
       year="2011";
                    324


run;
data dsn5;
       set dsn5;
       year="2012";
run;
data dsn6;
       set dsn6;
       year="2013";
run;
data dsn7;
       set dsn7;
       year="2014";
run;
data dsn8;
                    325


       set dsn8;
       year="2015";
run;
data dsn9;
       set dsn9;
       year="2016";
run;
data dsn10;
       set dsn10;
       year="2017";
run;
data dsn11;
       set dsn11;
       year="2018";
run;
                    326


data dsn12;
       set dsn12;
       year="2019";
run;
data age;
 set dsn1 dsn2 dsn3 dsn4 dsn5 dsn6 dsn7 dsn8 dsn9 dsn10 dsn11 dsn12;
 AGE = catx('','AGE CATEGORY'n,'RC-AGE CATEGORY'n);
 keep year AGE 'Unweighted Count'n;
run;
%drive(&SEX,csv)
data dsn1;
       set dsn1 ;
       year="2008";
run;
                                   327


data dsn2;
       set dsn2 ;
       year="2009";
run;
data dsn3;
       set dsn3;
       year="2010";
run;
data dsn4;
       set dsn4;
       year="2011";
run;
data dsn5;
       set dsn5;
                    328


       year="2012";
run;
data dsn6;
       set dsn6;
       year="2013";
run;
data dsn7;
       set dsn7;
       year="2014";
run;
data dsn8;
       set dsn8;
       year="2015";
run;
                    329


data dsn9;
       set dsn9;
       year="2016";
run;
data dsn10;
       set dsn10;
       year="2017";
run;
data dsn11;
       set dsn11;
       year="2018";
run;
data dsn12;
       set dsn12;
       year="2019";
                    330


run;
data gender;
 set dsn1 dsn2 dsn3 dsn4 dsn5 dsn6 dsn7 dsn8 dsn9 dsn10 dsn11 dsn12;
 keep year 'IMPUTATION REVISED GENDER'n 'Unweighted Count'n;
run;
%drive(&MRJ,csv)
data dsn1;
       set dsn1 ;
       year="2008";
run;
data dsn2;
       set dsn2 ;
       year="2009";
run;
                                   331


data dsn3;
       set dsn3;
       year="2010";
run;
data dsn4;
       set dsn4;
       year="2011";
run;
data dsn5;
       set dsn5;
       year="2012";
run;
data dsn6;
       set dsn6;
                    332


       year="2013";
run;
data dsn7;
       set dsn7;
       year="2014";
run;
data dsn8;
       set dsn8;
       year="2015";
run;
data dsn9;
       set dsn9;
       year="2016";
run;
                    333


data dsn10;
       set dsn10;
       year="2017";
run;
data dsn11;
       set dsn11;
       year="2018";
run;
data dsn12;
       set dsn12;
       year="2019";
run;
data mrj;
 set dsn1 dsn2 dsn3 dsn4 dsn5 dsn6 dsn7 dsn8 dsn9 dsn10 dsn11 dsn12;
 mrj = catx('','marijuana - past month use'n,'RC-MARIJUANA - PAST MONTH USE'n);
                                       334


 keep year mrj'Unweighted Count'n;
run;
%drive(&RACE,csv)
data dsn1;
       set dsn1 ;
       year="2008";
run;
data dsn2;
       set dsn2 ;
       year="2009";
run;
data dsn3;
       set dsn3;
       year="2010";
                                   335


run;
data dsn4;
       set dsn4;
       year="2011";
run;
data dsn5;
       set dsn5;
       year="2012";
run;
data dsn6;
       set dsn6;
       year="2013";
run;
data dsn7;
                    336


       set dsn7;
       year="2014";
run;
data dsn8;
       set dsn8;
       year="2015";
run;
data dsn9;
       set dsn9;
       year="2016";
run;
data dsn10;
       set dsn10;
       year="2017";
run;
                    337


data dsn11;
       set dsn11;
       year="2018";
run;
data dsn12;
       set dsn12;
       year="2019";
run;
data RACE;
 set dsn1 dsn2 dsn3 dsn4 dsn5 dsn6 dsn7 dsn8 dsn9 dsn10 dsn11 dsn12;
 RACE = catx('','RC-RACE/HISPANICITY RECODE (7 LE'n,'RACE/HISPANICITY
RECODE (7 LEVEL'n);
 keep year RACE 'Unweighted Count'n;
run;
                                   338


proc means data=abod sum;
      var 'Unweighted Count'n;
      class MJ_DEP;
run;
proc means data=age sum;
      var 'Unweighted Count'n;
      class AGE;
run;
proc means data=gender sum;
      var 'Unweighted Count'n;
      class 'IMPUTATION REVISED GENDER'n;
run;
proc means data=RACE sum;
      var 'Unweighted Count'n;
      class RACE;
                                339


run;
proc means data=mrj sum;
         var 'Unweighted Count'n;
         class mrj;
run;
/*******************************************************************************************************
* In Enterprise Guide, "Specify the page size for log and text output" under 'Results
General' must be *
* de-selected in order to be able to specify pagesize and linesize using an options
statement.           *
********************************************************************************************************/
OPTIONS PS=56 LS=160 NOCENTER NOFMTERR MPRINT ORIENTATION =
LANDSCAPE ;
title1'Dissertation';
title2'Aim 2: Incidence after leglaization DiD';
/****************************************************************************************
                                                 340


* The following macro variables are available to all users: *
*                                                                     *
* &Project       – the name of the project folder in which the .EGP file is stored             *
* &ProgName             – the name of the .egp file, without the extension                       *
* &ProgNode            - the name of the code node                                         *
* &ProgDir          – the path to the folder in which the .egp file is stored.               *
*****************************************************************************************/
/*****************************************************************************
* The following macro variables are used in conjunction with the DOC_BLOCK *
* macro to document programs and output.                                            *
*                                                            *
* Do not use quotation marks when defining macro variables. If SAS syntax                     *
* requires quotes, use double quotes when you reference the macro variable. *
******************************************************************************/
** PROGRAMMER'S NAME ;
%LET PROGRAMMER                   = Barrett Montgomery;
                                                 341


** DEFINE ALL NON-SAS FILES CALLED IN YOUR PROGRAM AS MACRO VARIABLES
by Drug;
** THESE CAN BE LEFT BLANK IF NOT NEEDED OR USED ;
%LET Elg      = C:\Users\montg\Dropbox\Ph.D Work\Dissertation\Data\Aim
2\Raw\Legal Timeline Categories\ELG;
%LET Rec       = C:\Users\montg\Dropbox\Ph.D Work\Dissertation\Data\Aim
2\Raw\Legal Timeline Categories\REC;
%LET ElgE      = C:\Users\montg\Dropbox\Ph.D Work\Dissertation\Data\Aim
2\Raw\Effective Date Categories\ELG;
%LET RecE       = C:\Users\montg\Dropbox\Ph.D Work\Dissertation\Data\Aim
2\Raw\Effective Date Categories\REC;
** DEFINE DIRECTORY AND FILE NAME OF ANY PERMANENT SAS DATASETS
SAVED IN THIS PROGRAM AS MACRO VARIABLES ;
** USE &PROGNAME FOR SAVEFILE NAME ;
** LEAVE BLANK IF NO DATASET SAVED ;
%LET SAVEDIR1        = C:\Users\montg\Dropbox\Ph.D Work\Dissertation\Data\Aim
2\Processed;
** NAME FORMAT LIBRARY DIRECTORY ;
                                     342


%LET FMTDIR              = ;
** GIVE A BRIEF DESCRIPTION OF OVERALL PURPOSE OF THIS PROGRAM ;
** YOU CAN USE SINGLE QUOTES OR NO QUOTES-- DOUBLE QUOTES WILL NOT
WORK ;
%LET PURPOSE1 = Differnce in differnce and event study design to estimate the effect
of cannabis legalization on cannabis incidence
/*************************************************************
** CREATE LIBRARY REFERENCES TO DIRECTORIES SPECIFIED ABOVE **
**************************************************************/
** INPUT FILES ;
** OUTPUT FILE DESTINATION ;
LIBNAME SAV "&SAVEDIR1" ;
/*This section recreates the various 2x2 plots that can be created with these data
                                                 343


meant to replicate figure 2 in Goodman-Bacon's 2018 seminal working paper
       also easy visual check for parrallel trends assumption*/
proc freq data = sav.Legal_date;
       table legal_cat age / list missing;
run;
proc format;
       value $legal
        'Legalized_2012'='Legalized in 2012'
        'Legalized_2014'='Legalized in 2014'
        'Legalized_2016'='Legalized in 2016'
        'Legalized_2018'='Legalized in 2018';
        value $bin
         'Legalized_2012'='Legal'
        'Legalized_2014'='Legal'
        'Legalized_2016'='Legal'
        'Legalized_2018'='Legal';
                                        344


run;
proc sgplot data=sav.Legal_date ;
         title3'Cannabis incidence in 21+ age group, first wave legalizing states vs
untreated states';
         series x=years y=incidence /lineattrs=(pattern=2) group=legal_cat;
         where legal_cat in ('Legalized_2012','Illegal') and age = "21_Plus";
   refline "2012 to 2013" / axis=x label="First wave of cannabis legalization"
lineattrs=(color=green);
         xaxis grid label='NSDUH year-pairs';
   yaxis grid label='Newly incident cannabis use' discreteorder=data;
         format legal_cat $legal.;
run;
proc sgplot data=sav.Legal_date ;
         title3'Cannabis incidence in 21+ age group, second wave legalizing states vs
untreated states';
         series x=years y=incidence / lineattrs=(pattern=2) group=legal_cat;
         where legal_cat in ('Legalized_2014','Illegal') and age = "21_Plus";
                                         345


   refline "2014 to 2015" / axis=x label="Second wave of cannabis legalization"
lineattrs=(color=green);
         xaxis grid label='NSDUH year-pairs';
   yaxis grid label='Newly incident cannabis use' discreteorder=data;
         format legal_cat $legal.;
run;
proc sgplot data=sav.Legal_date ;
         title3'Cannabis incidence in 21+ age group, third wave legalizing states vs
untreated states';
         series x=years y=incidence / lineattrs=(pattern=2) group=legal_cat;
         where legal_cat in ('Legalized_2016','Illegal') and age = "21_Plus";
   refline "2016 to 2017" / axis=x label="Third wave of cannabis legalization"
lineattrs=(color=green);
         xaxis grid label='NSDUH year-pairs';
   yaxis grid label='Newly incident cannabis use' discreteorder=data;
         format legal_cat $legal.;
run;
                                         346


proc sgplot data=sav.Legal_date ;
        title3'Cannabis incidence in 21+ age group, first wave legalizing states vs
second wave legalizing states';
        series x=years y=incidence /lineattrs=(pattern=2) group=legal_cat;
        where legal_cat in ('Legalized_2012','Legalized_2014') and age = "21_Plus";
  refline "2012 to 2013" "2014 to 2015"/ axis=x lineattrs=(color=green);
        xaxis grid label='NSDUH year-pairs';
  yaxis grid label='Newly incident cannabis use' discreteorder=data;
        format legal_cat $legal.;
run;
proc sgplot data=sav.Legal_date ;
        title3'Cannabis incidence in 21+ age group, first wave legalizing states vs third
wave legalizing states';
        series x=years y=incidence /lineattrs=(pattern=2) group=legal_cat;
        where legal_cat in ('Legalized_2012','Legalized_2016') and age = "21_Plus";
  refline "2012 to 2013" "2016 to 2017"/ axis=x lineattrs=(color=green);
        xaxis grid label='NSDUH year-pairs';
  yaxis grid label='Newly incident cannabis use' discreteorder=data;
                                        347


         format legal_cat $legal.;
run;
proc sgplot data=sav.Legal_date ;
         title3'Cannabis incidence in 21+ age group, second wave legalizing states vs
third wave legalizing states';
         series x=years y=incidence /lineattrs=(pattern=2) group=legal_cat;
         where legal_cat in ('Legalized_2014','Legalized_2016') and age = "21_Plus";
   refline "2014 to 2015" "2016 to 2017"/ axis=x;
          xaxis grid label='Time';
   yaxis grid label='Past year cannabis use incidence' discreteorder=data;
run;
/***********************************************/
/*                          legal date analysis Plots         */
/***********************************************/
ods graphics on;
                                                 348


title1 "Simple Incidence Plots";
title3 "All ages";
proc sgplot data=sav.Legal_date;
         where age="Overall";
    series x=years y=incidence / group=legal_cat;
run;
title3 "12-20";
proc sgplot data=sav.Legal_date;
         where age="12_20";
    series x=years y=incidence / group=legal_cat;
run;
title3 "21+";
proc sgplot data=sav.Legal_date;
         where age="21_Plus";
    series x=years y=incidence / group=legal_cat;
run;
                                       349


title3 "21+";
proc sgplot data=sav.Legal_date;
         where age="21_Plus";
    series x=time y=incidence / group=legal_cat;
run;
proc freq data=sav.Legal_date;
         table legal_cat*years*incidence / list missing;
         where age = "21_Plus";
run;
/*Establish baseline*/
proc freq data=sav.Legal_date;
         table legal_cat* / list missing;
         where age = "21_Plus";
run;
                                          350


proc means data=sav.Legal_date mean;
         title3'Average incidence in 2 year period prior to legalization';
         var incidence;
         class age legal_cat ;
         where time=-2;
         format legal_cat $bin.;
run;
proc means data=sav.Legal_date mean;
         title3'Average incidence where illegal';
         var incidence;
         class age legal_cat ;
         where legal_cat="Illegal";
         format legal_cat $bin.;
run;
title1 "Regression modelling";
                                         351


title4 ;
proc sort data=sav.Legal_date;
         by years;
run;
ods graphics on;
proc glm data=sav.Legal_date ;
         title3 "Regression for panel event study, all ages";
         absorb years;
         class legal_cat (ref="Illegal") tminus10(ref="0") tminus8(ref="0") tminus6(ref="0")
tminus4(ref="0") t0(ref="0") t2(ref="0") t4(ref="0") t6(ref="0");
         where age="Overall";
         model incidence = legal_cat tminus10 tminus8 tminus6 tminus4 t0 t2 t4 t6
/solution CLPARM ;
         ods output ParameterEstimates = ParamEsttotal;
run;
proc glm data=sav.Legal_date ;
                                           352


        title3 "Regression for panel event study, 12 to 20 year olds";
        absorb years;
        class legal_cat (ref="Illegal") tminus10(ref="0") tminus8(ref="0") tminus6(ref="0")
tminus4(ref="0") t0(ref="0") t2(ref="0") t4(ref="0") t6(ref="0");
        where age="12_20";
        model incidence = legal_cat tminus10 tminus8 tminus6 tminus4 t0 t2 t4 t6
/solution CLPARM ;
        ods output ParameterEstimates = ParamEstunderage;
run;
proc glm data=sav.Legal_date ;
        title3 "Regression for event study, ages 21 and up";
        absorb years;
        class legal_cat (ref="Illegal") tminus10(ref="0") tminus8(ref="0") tminus6(ref="0")
tminus4(ref="0") t0(ref="0") t2(ref="0") t4(ref="0") t6(ref="0");
        where age="21_Plus";
        model incidence = legal_cat tminus10 tminus8 tminus6 tminus4 t0 t2 t4 t6
/solution CLPARM ;
        ods output ParameterEstimates = ParamEst21;
                                          353


run;
title1"Manage Regression Output";
data ParamEsttotal1;
        set ParamEsttotal;
        where StdErr >. and Parameter in ("tminus10 1",
"tminus8 1",
"tminus6 1",
"tminus4 1",
"t0      1",
"t2      1",
"t4      1",
"t6      1");
run;
data ParamEstunderage1;
        set ParamEstunderage;
                                     354


       where StdErr >. and Parameter in ("tminus10 1",
"tminus8 1",
"tminus6 1",
"tminus4 1",
"t0     1",
"t2     1",
"t4     1",
"t6     1");
run;
data ParamEst211;
       set ParamEst21;
       where StdErr >. and Parameter in ("tminus10 1",
"tminus8 1",
"tminus6 1",
"tminus4 1",
"t0     1",
"t2     1",
                                    355


"t4        1",
"t6        1");
run;
title1 "Plot coefficients";
proc sgplot data=ParamEsttotal1 noautolegend;
          title3'Effect of time since legalization on incidence';
          title4'all ages, fixed effects for time and state categories';
          scatter x=parameter y=Estimate / yerrorlower=LowerCL yerrorupper=upperCL
          markerattrs=(symbol=diamondfilled);
    refline 0 / axis=y;
    xaxis grid;
    yaxis grid display=(nolabel) discreteorder=data;
run;
proc sgplot data=ParamEstunderage1 noautolegend;
          title3'Effect of time since legalization on incidence';
                                              356


        title4'aged 12 to 20, fixed effects for time and state categories';
        scatter x=parameter y=Estimate / yerrorlower=LowerCL yerrorupper=upperCL
        markerattrs=(symbol=diamondfilled);
  refline 0 / axis=y;
  xaxis grid;
  yaxis grid display=(nolabel) discreteorder=data;
run;
proc sgplot data=ParamEst211 noautolegend;
        title3'Effect of time since cannabis legalization on cannabis incidence';
        title4'ages 21 and up, fixed effects for time and state categories';
        scatter x=parameter y=Estimate / yerrorlower=LowerCL yerrorupper=upperCL
        markerattrs=(symbol=diamondfilled);
  refline 0 / axis=y;
  xaxis grid;
  yaxis grid display=(nolabel) discreteorder=data;
run;
                                         357


/***********************************************/
/*                          Simple DD estimate ATT for 21+ */
/***********************************************/
proc sort data=sav.Legal_date ;
         by years;
run;
proc glm data=sav.Legal_date ;
         title3 "Regression for event study, ages 21 and up";
         absorb years;
         class legal_cat (ref="Illegal") legal(ref="0");
         where age="21_Plus";
         model incidence = legal_cat legal /solution CLPARM ;
run;
/*small because includes data from before effective date...*/
proc glm data=sav.Legal_date ;
         title3 "Regression for event study, ages 21 and up";
                                                 358


       absorb years;
       class legal_cat (ref="Illegal") effective(ref="0");
       where age="21_Plus";
       model incidence = legal_cat effective /solution CLPARM ;
run;
proc glm data=sav.Legal_date ;
       title3 "Regression for event study, 20 and younger";
       absorb years;
       class legal_cat (ref="Illegal") legal(ref="0");
       where age="12_20";
       model incidence = legal_cat legal /solution CLPARM ;
run;
proc glm data=sav.Legal_date ;
       title3 "Regression for event study, 20 and younger";
       absorb years;
       class legal_cat (ref="Illegal") effective(ref="0");
                                          359


         where age="12_20";
         model incidence = legal_cat effective /solution CLPARM ;
run;
/*************************/
/*effective date analysis*/
/*************************/
title1 "Simple Incidence Plots to check for parallel trends";
title3 "All ages";
proc sgplot data=sav.Effective_date;
         where age="Overall";
    series x=years y=incidence / group=legal_cat;
run;
title3 "12-20";
proc sgplot data=sav.Effective_date;
         where age="12_20";
                                        360


    series x=years y=incidence / group=legal_cat;
run;
title3 "21+";
proc sgplot data=sav.Effective_date;
         where age="21_Plus";
    series x=years y=incidence / group=legal_cat;
run;
title1 "Regression modelling by effective date";
title4 ;
proc sort data=sav.Effective_date;
         by years;
run;
ods graphics on;
proc glm data=sav.Effective_date ;
                                        361


       title3 "Regression for panel event study, all ages";
       absorb years;
       class legal_cat (ref="Illegal") tminus10(ref="0") tminus9(ref="0") tminus8(ref="0")
tminus7(ref="0") tminus6(ref="0") tminus5(ref="0") tminus4(ref="0") tminus3(ref="0")
tminus2(ref="0") t0(ref="0") t1(ref="0") t2(ref="0") t3(ref="0") t4(ref="0") ;
       where age="Overall";
       model incidence = legal_cat tminus10 tminus9 tminus8 tminus7 tminus6
tminus5 tminus4 tminus3 tminus2 t0 t1 t2 t3 t4 /solution CLPARM ;
       ods output ParameterEstimates = ParamEsttotal;
run;
proc glm data=sav.Effective_date ;
       title3 "Regression for panel event study, 12 to 20 year olds";
       absorb years;
       class legal_cat (ref="Illegal") tminus10(ref="0") tminus9(ref="0") tminus8(ref="0")
tminus7(ref="0") tminus6(ref="0") tminus5(ref="0") tminus4(ref="0") tminus3(ref="0")
tminus2(ref="0") t0(ref="0") t1(ref="0") t2(ref="0") t3(ref="0") t4(ref="0") ;
       where age="12_20";
       model incidence = legal_cat tminus10 tminus9 tminus8 tminus7 tminus6
tminus5 tminus4 tminus3 tminus2 t0 t1 t2 t3 t4 /solution CLPARM ;
                                         362


        ods output ParameterEstimates = ParamEstunderage;
run;
proc glm data=sav.Effective_date ;
        title3 "Regression for event study, ages 21 and up";
        absorb years;
        class legal_cat (ref="Illegal") tminus10(ref="0") tminus9(ref="0") tminus8(ref="0")
tminus7(ref="0") tminus6(ref="0") tminus5(ref="0") tminus4(ref="0") tminus3(ref="0")
tminus2(ref="0") t0(ref="0") t1(ref="0") t2(ref="0") t3(ref="0") t4(ref="0") ;
        where age="21_Plus";
        model incidence = legal_cat tminus10 tminus9 tminus8 tminus7 tminus6
tminus5 tminus4 tminus3 tminus2 t0 t1 t2 t3 t4 /solution CLPARM ;
        ods output ParameterEstimates = ParamEst21;
run;
title1"Manage Regression Output";
data ParamEsttotal1;
        set ParamEsttotal;
                                          363


      where StdErr >. and Parameter in ("tminus10 1",
      "tminus9 1",
      "tminus8 1",
      "tminus7 1",
      "tminus6 1",
      "tminus5 1",
      "tminus4 1",
      "tminus3 1",
      "tminus2 1",
      "tminus1 1",
      "t0    1",
      "t1    1",
      "t2    1",
      "t3    1",
      "t4    1");
run;
data ParamEstunderage1;
                                   364


     set ParamEstunderage;
     where StdErr >. and Parameter in ("tminus10 1",
     "tminus9 1",
     "tminus8 1",
     "tminus7 1",
     "tminus6 1",
     "tminus5 1",
     "tminus4 1",
     "tminus3 1",
     "tminus2 1",
     "tminus1 1",
     "t0     1",
     "t1     1",
     "t2     1",
     "t3     1",
     "t4     1");
run;
                                  365


data ParamEst211;
      set ParamEst21;
      where StdErr >. and Parameter in ("tminus10 1",
      "tminus9 1",
      "tminus8 1",
      "tminus7 1",
      "tminus6 1",
      "tminus5 1",
      "tminus4 1",
      "tminus3 1",
      "tminus2 1",
      "tminus1 1",
      "t0     1",
      "t1     1",
      "t2     1",
      "t3     1",
      "t4     1");
run;
                                   366


title1 "Plot coefficients";
proc sgplot data=ParamEsttotal1 noautolegend;
          title3'Effect of time since recreational cannabis dispensaries become operational
on cannabis incidence';
          title4'all ages, fixed effects for time and state categories';
          scatter x=parameter y=Estimate / yerrorlower=LowerCL yerrorupper=upperCL
          markerattrs=(symbol=diamondfilled);
    refline 0 / axis=y;
    xaxis grid;
    yaxis grid display=(nolabel) discreteorder=data;
run;
proc sgplot data=ParamEstunderage1 noautolegend;
          title3'Effect of time since recreational cannabis dispensaries become operational
on cannabis incidence';
          title4'ages 12 to 20, fixed effects for time and state categories';
          scatter x=parameter y=Estimate / yerrorlower=LowerCL yerrorupper=upperCL
                                              367


         markerattrs=(symbol=diamondfilled);
   refline 0 / axis=y;
   xaxis grid;
   yaxis grid display=(nolabel) discreteorder=data;
run;
proc sgplot data=ParamEst211 noautolegend;
         title3'Effect of time since recreational cannabis dispensaries become operational
on cannabis incidence';
         title4'ages 21 and up, fixed effects for time and state categories';
         scatter x=parameter y=Estimate / yerrorlower=LowerCL yerrorupper=upperCL
         markerattrs=(symbol=diamondfilled);
   refline 0 / axis=y;
   xaxis grid;
   yaxis grid display=(nolabel) discreteorder=data;
run;
/*****************************************/
                                            368


/*analyze legal date and effective
  date with more balanced lags and leads */
/*****************************************/
title1 "Regression modelling with balanced leads and lags";
title4 'Legal dates';
proc sort data=sav.Legal_date;
         by years;
run;
proc print data=sav.Legal_date; run;
ods graphics on;
proc glm data=sav.Legal_date ;
         title3 "Regression for panel event study, all ages";
         absorb years;
                                            369


        class legal_cat (ref="Illegal") tlt6(ref="0") tminus4(ref="0") t0(ref="0") t2(ref="0")
tgt4(ref="0");
        where age="Overall";
        model incidence = legal_cat tlt6 tminus4 t0 t2 tgt4 /solution CLPARM ;
        ods output ParameterEstimates = ParamEsttotal;
run;
proc glm data=sav.Legal_date ;
        title3 "Regression for panel event study, 12 to 20 year olds";
        absorb years;
        class legal_cat (ref="Illegal") tlt6(ref="0") tminus4(ref="0") t0(ref="0") t2(ref="0")
tgt4(ref="0");
        where age="12_20";
        model incidence = legal_cat tlt6 tminus4 t0 t2 tgt4 /solution CLPARM ;
        ods output ParameterEstimates = ParamEstunderage;
run;
proc glm data=sav.Legal_date ;
        title3 "Regression for event study, ages 21 and up";
                                           370


        absorb years;
        class legal_cat (ref="Illegal") tlt6(ref="0") tminus4(ref="0") t0(ref="0") t2(ref="0")
tgt4(ref="0");
        where age="21_Plus";
        model incidence = legal_cat tlt6 tminus4 t0 t2 tgt4 /solution CLPARM ;
        ods output ParameterEstimates = ParamEst21;
run;
title1"Manage Regression Output";
data ParamEsttotal1;
        set ParamEsttotal;
        where StdErr >. and Parameter in ("tlt6         1",
        "tminus4 1",
"t0      1",
"t2      1",
"tgt4     1");
run;
                                           371


data ParamEstunderage1;
      set ParamEstunderage end=eof;
      where StdErr >. and Parameter in ("tlt6 1",
      "tminus4 1",
      "t0     1",
      "t2     1",
      "tgt4    1");
      if parameter = "tlt6   1" then do;
              parameter = '6+ years prior' ;
              order=1;
      end;
      if parameter = "tminus4 1" then do;
              parameter = '4 years prior';
              order=2;
      end;
      if parameter = "t0     1" then do;
              parameter = 'Legalized';
                                      372


         order=4;
end;
if parameter = "t2      1" then do;
         parameter = '2 years after';
         order=5;
end;
if parameter = "tgt4     1" then do;
         parameter = '4+ years after';
         order=6;
end;
if eof then do;
output;
         parameter = '2 years prior';
         estimate=0;
         stderr=0;
         tvalue=0;
         probt=0;
         lowercl=0;
                                 373


              uppercl=0;
              order=3;
      end;
      output;
run;
proc sort data=ParamEstunderage1;
      by order;
run;
data ParamEst211;
      set ParamEst21 end=eof;
      where StdErr >. and Parameter in ("tlt6 1",
      "tminus4 1",
      "t0     1",
      "t2     1",
      "tgt4    1");
                                   374


if parameter = "tlt6   1" then do;
        parameter = '6+ years prior' ;
        order=1;
end;
if parameter = "tminus4 1" then do;
        parameter = '4 years prior';
        order=2;
end;
if parameter = "t0     1" then do;
        parameter = 'Legalized';
        order=4;
end;
if parameter = "t2     1" then do;
        parameter = '2 years after';
        order=5;
end;
if parameter = "tgt4    1" then do;
        parameter = '4+ years after';
                                375


               order=6;
      end;
      if eof then do;
      output;
               parameter = '2 years prior';
               estimate=0;
               stderr=0;
               tvalue=0;
               probt=0;
               lowercl=0;
               uppercl=0;
               order=3;
      end;
      output;
run;
proc sort data=ParamEst211;
      by order;
                                       376


run;
title1 "Plot coefficients";
proc sgplot data=ParamEsttotal1 noautolegend;
          title3'Effect of time since legalization on incidence';
          title4'all ages, balanced leads and lags, fixed effects for time and state
categories';
          scatter x=parameter y=Estimate / yerrorlower=LowerCL yerrorupper=upperCL
          markerattrs=(symbol=diamondfilled);
    refline 0 / axis=y;
    xaxis grid;
    yaxis grid display=(nolabel) discreteorder=data;
run;
proc sgplot data=ParamEstunderage1 noautolegend;
          title3'Effect of time since legalization on incidence';
          title4'aged 12 to 20, balanced leads and lags, fixed effects for time and state
categories';
                                             377


        scatter x=parameter y=Estimate / yerrorlower=LowerCL yerrorupper=upperCL
        markerattrs=(symbol=diamondfilled);
  refline 0 / axis=y;
  xaxis grid label='Time relative to legalization';
  yaxis grid label='Newly incident cannabis use' discreteorder=data;
run;
proc sgplot data=ParamEst211 noautolegend;
        title3'Effect of time since cannabis legalization on cannabis incidence';
        title4'ages 21 and up, balanced leads and lags, fixed effects for time and state
categories';
        scatter x=parameter y=Estimate / yerrorlower=LowerCL yerrorupper=upperCL
        markerattrs=(symbol=diamondfilled);
  refline 0 / axis=y;
  xaxis grid label='Time relative to legalization';
  yaxis grid label='Newly incident cannabis use' discreteorder=data;
run;
                                         378


title1 "Regression modelling with balanced leads and lags";
title4 'Effective dates';
proc sort data=sav.Effective_date;
         by years;
run;
ods graphics on;
proc glm data=sav.Effective_date ;
         title3 "Regression for panel event study, all ages";
         absorb years;
         class legal_cat (ref="Illegal") tlt6(ref="0") tminus5(ref="0") tminus4(ref="0")
tminus3(ref="0") tminus2(ref="0") t0(ref="0") t1(ref="0") tgt2(ref="0") ;
         where age="Overall";
         model incidence = legal_cat tlt6 tminus5 tminus4 tminus3 tminus2 t0 t1 tgt2
/solution CLPARM ;
         ods output ParameterEstimates = ParamEsttotal;
run;
                                            379


proc glm data=sav.Effective_date ;
        title3 "Regression for panel event study, 12 to 20 year olds";
        absorb years;
        class legal_cat (ref="Illegal") tlt6(ref="0") tminus5(ref="0") tminus4(ref="0")
tminus3(ref="0") tminus2(ref="0") t0(ref="0") t1(ref="0") tgt2(ref="0") ;
        where age="12_20";
        model incidence = legal_cat tlt6 tminus5 tminus4 tminus3 tminus2 t0 t1 tgt2
/solution CLPARM ;
        ods output ParameterEstimates = ParamEstunderage;
run;
proc glm data=sav.Effective_date ;
        title3 "Regression for event study, ages 21 and up";
        absorb years;
        class legal_cat (ref="Illegal") tlt6(ref="0") tminus5(ref="0") tminus4(ref="0")
tminus3(ref="0") tminus2(ref="0") t0(ref="0") t1(ref="0") tgt2(ref="0") ;
        where age="21_Plus";
        model incidence = legal_cat tlt6 tminus5 tminus4 tminus3 tminus2 t0 t1 tgt2
/solution CLPARM ;
                                           380


        ods output ParameterEstimates = ParamEst21;
run;
title1"Manage Regression Output";
data ParamEsttotal1;
        set ParamEsttotal;
        where StdErr >. and Parameter in ("tlt6 1",
        "tminus5 1",
        "tminus4 1",
        "tminus3 1",
        "tminus2 1",
        "tminus1 1",
        "t0     1",
        "t1     1",
        "tgt2    1");
run;
                                     381


data ParamEstunderage1;
      set ParamEstunderage;
      where StdErr >. and Parameter in ("tlt6 1",
      "tminus5 1",
      "tminus4 1",
      "tminus3 1",
      "tminus2 1",
      "tminus1 1",
      "t0     1",
      "t1     1",
      "tgt2    1");
run;
data ParamEst211;
      set ParamEst21;
      where StdErr >. and Parameter in ("tlt6 1",
      "tminus5 1",
      "tminus4 1",
                                   382


          "tminus3 1",
          "tminus2 1",
          "tminus1 1",
          "t0       1",
          "t1       1",
          "tgt2      1");
run;
title1 "Plot coefficients";
proc sgplot data=ParamEsttotal1 noautolegend;
          title3'Effect of time since recreational cannabis dispensaries become operational
on cannabis incidence';
          title4'all ages, balanced leads and lags, fixed effects for time and state
categories';
          scatter x=parameter y=Estimate / yerrorlower=LowerCL yerrorupper=upperCL
          markerattrs=(symbol=diamondfilled);
    refline 0 / axis=y;
    xaxis grid;
                                            383


  yaxis grid display=(nolabel) discreteorder=data;
run;
proc sgplot data=ParamEstunderage1 noautolegend;
        title3'Effect of time since recreational cannabis dispensaries become operational
on cannabis incidence';
        title4'ages 12 to 20, balanced leads and lags, fixed effects for time and state
categories';
        scatter x=parameter y=Estimate / yerrorlower=LowerCL yerrorupper=upperCL
        markerattrs=(symbol=diamondfilled);
  refline 0 / axis=y;
  xaxis grid;
  yaxis grid display=(nolabel) discreteorder=data;
run;
proc sgplot data=ParamEst211 noautolegend;
        title3'Effect of time since recreational cannabis dispensaries become operational
on cannabis incidence';
                                          384


          title4'ages 21 and up, balanced leads and lags, fixed effects for time and state
categories';
          scatter x=parameter y=Estimate / yerrorlower=LowerCL yerrorupper=upperCL
          markerattrs=(symbol=diamondfilled);
    refline 0 / axis=y;
    xaxis grid;
    yaxis grid display=(nolabel) discreteorder=data;
run;
/*****************************************/
/*/*/*/*Repeat procedure for prevalence to */*/*/
/*/*/*compare to prior results from Cerda */*/*/
/*/*/*/*/*/*/*/    and others */*/*/*/*/*/*/*/*/
/*****************************************/;
title1 "Regression modelling with balanced leads and lags";
title4 'Prevalence by Legal dates';
                                             385


proc sort data=sav.Prevalence;
       by years;
run;
ods graphics on;
proc glm data=sav.Prevalence ;
       title3 "Regression for panel event study, all ages";
       absorb years;
       class legal_cat (ref="Illegal") tlt4(ref="0") t0(ref="0") t2(ref="0") tgt4(ref="0");
       where age="Overall";
       model pmmj_phat = legal_cat tlt4 t0 t2 tgt4 /solution CLPARM ;
       ods output ParameterEstimates = ParamEsttotal;
run;
proc glm data=sav.Prevalence ;
       title3 "Regression for panel event study, 12 to 20 year olds";
       absorb years;
       class legal_cat (ref="Illegal") tlt4(ref="0") t0(ref="0") t2(ref="0") tgt4(ref="0");
                                          386


        where age="12_20";
        model pmmj_phat = legal_cat tlt4 t0 t2 tgt4 /solution CLPARM ;
        ods output ParameterEstimates = ParamEstunderage;
run;
proc glm data=sav.Prevalence ;
        title3 "Regression for event study, ages 21 and up";
        absorb years;
        class legal_cat (ref="Illegal") tlt4(ref="0") t0(ref="0") t2(ref="0") tgt4(ref="0");
        where age="21_Plus";
        model pmmj_phat = legal_cat tlt4 t0 t2 tgt4 /solution CLPARM ;
        ods output ParameterEstimates = ParamEst21;
run;
title1"Manage Regression Output";
data ParamEsttotal1;
        set ParamEsttotal;
                                           387


      where StdErr >. and Parameter in ("tlt4 1",
"t0    1",
"t2    1",
"tgt4   1");
run;
data ParamEstunderage1;
      set ParamEstunderage end=eof;
      where StdErr >. and Parameter in ("tlt4 1",
"t0    1",
"t2    1",
"tgt4   1");
      if parameter = "tlt4  1" then do;
              parameter = '4+ years prior';
              order=1;
      end;
      if parameter = "t0    1" then do;
                                     388


         parameter = 'Legalized';
         order=3;
end;
if parameter = "t2      1" then do;
         parameter = '2 years after';
         order=4;
end;
if parameter = "tgt4     1" then do;
         parameter = '4+ years after';
         order=5;
end;
if eof then do;
output;
         parameter = '2 years prior';
         estimate=0;
         stderr=0;
         tvalue=0;
         probt=0;
                                 389


              lowercl=0;
              uppercl=0;
              order=2;
      end;
      output;
run;
proc sort data= ParamEstunderage1;
      by order;
run;
data ParamEst211;
      set ParamEst21 end=eof;
      where StdErr >. and Parameter in ("tlt4 1",
"t0    1",
"t2    1",
"tgt4   1");
                                   390


         if parameter = "tlt4   1" then do;
         parameter = '4+ years prior';
         order=1;
end;
if parameter = "t0       1" then do;
         parameter = 'Legalized';
         order=3;
end;
if parameter = "t2       1" then do;
         parameter = '2 years after';
         order=4;
end;
if parameter = "tgt4      1" then do;
         parameter = '4+ years after';
         order=5;
end;
if eof then do;
output;
                                  391


                 parameter = '2 years prior';
                 estimate=0;
                 stderr=0;
                 tvalue=0;
                 probt=0;
                 lowercl=0;
                 uppercl=0;
                 order=2;
         end;
         output;
run;
proc sort data= ParamEst211;
         by order;
run;
title1 "Plot coefficients";
                                         392


proc sgplot data=ParamEsttotal1 noautolegend;
        title3'Effect of time since legalization on past month cannabis prevalence';
        title4'all ages, balanced leads and lags, fixed effects for time and state
categories';
        scatter x=parameter y=Estimate / yerrorlower=LowerCL yerrorupper=upperCL
        markerattrs=(symbol=diamondfilled);
  refline 0 / axis=y;
  xaxis grid label='Time relative to legalization';
  yaxis grid label='Past-month cannabis use prevalence' discreteorder=data;
run;
proc sgplot data=ParamEstunderage1 noautolegend;
        title3'Effect of time since legalization on past month cannabis prevalence';
        title4'aged 12 to 20, balanced leads and lags, fixed effects for time and state
categories';
        scatter x=parameter y=Estimate / yerrorlower=LowerCL yerrorupper=upperCL
        markerattrs=(symbol=diamondfilled);
  refline 0 / axis=y;
  xaxis grid label='Time relative to legalization';
                                           393


   yaxis grid label='Past-month cannabis use prevalence' discreteorder=data;
run;
proc sgplot data=ParamEst211 noautolegend;
         title3'Effect of time since cannabis legalization on past month cannabis
prevalence';
         title4'ages 21 and up, balanced leads and lags, fixed effects for time and state
categories';
         scatter x=parameter y=Estimate / yerrorlower=LowerCL yerrorupper=upperCL
         markerattrs=(symbol=diamondfilled);
   refline 0 / axis=y;
   xaxis grid label='Time relative to legalization';
   yaxis grid label='Past-month cannabis use prevalence' discreteorder=data;
run;
/*Additional checks*/
proc means data=sav.Legal_date;
         title3'What is going on at t minus 6 in the legalization data?';
                                          394


        var incidence ;
        class age tminus6;
run;
data r;
        set sav.Legal_date;
        /*alternate specification for replication in r*/
        if legal_cat = "Illegal" then legal=0;
        else if legal_cat = "Overall" then legal=.;
        else if time=0 and legal_cat ~= "Illegal" then legal=1;
        else if time<0 then legal=0;
        else if time>0 then legal=1;
        label    legal = "cannabis leagality binary";
run;
                                           395


proc freq data=r;
       title3'check legal binary creation';
       table time*years*legal_cat*legal/list missing;
run;
proc glm data=r ;
       title3 "Regression for event study, altrernate specification, ages 21 and up";
       class legal (ref="0") time (ref="-2");
       where age="21_Plus";
       model incidence = legal*time /solution CLPARM ;
       ods output ParameterEstimates = ParamEst21;
run;
proc sort data=ParamEst21;
       by Parameter ;
run;
proc sgplot data=ParamEst21 noautolegend;
                                         396


         title3'Effect of time since recreational cannabis dispensaries become operational
on cannabis incidence, 21+ alternate specification';
         scatter x=parameter y=Estimate / yerrorlower=LowerCL yerrorupper=upperCL
         markerattrs=(symbol=diamondfilled);
   refline 0 / axis=y;
   xaxis grid;
   yaxis grid display=(nolabel) discreteorder=data;
run;
/*****************************************/
/* analyze legal date with placebo RCL */
/*****************************************/
proc sort data=sav.placebo;
         by years;
run;
ods graphics on;
                                            397


proc glm data=sav.placebo ;
        title3 "PLACEBO Regression for panel event study, all ages";
        absorb years;
        class placebo_legal (ref="Illegal") ptminus3(ref="0") pt0(ref="0") pt1(ref="0")
pt3(ref="0") pt5(ref="0") pt7(ref="0");
        where age="Overall";
        model incidence = placebo_legal ptminus3 pt0 pt1 pt3 pt5 pt7 /solution
CLPARM ;
        ods output ParameterEstimates = ParamEsttotal;
run;
proc glm data=sav.placebo ;
        title3 "PLACEBO Regression for panel event study, 12 to 20 year olds";
        absorb years;
        class placebo_legal (ref="Illegal") ptminus3(ref="0") pt0(ref="0") pt1(ref="0")
pt3(ref="0") pt5(ref="0") pt7(ref="0");
        where age="12_20";
        model incidence = placebo_legal ptminus3 pt0 pt1 pt3 pt5 pt7 /solution
CLPARM ;
                                        398


        ods output ParameterEstimates = ParamEstunderage;
run;
proc glm data=sav.placebo ;
        title3 "PLACEBO Regression for event study, ages 21 and up";
        absorb years;
        class placebo_legal (ref="Illegal") ptminus3(ref="0") pt0(ref="0") pt1(ref="0")
pt3(ref="0") pt5(ref="0") pt7(ref="0");
        where age="21_Plus";
        model incidence = placebo_legal ptminus3 pt0 pt1 pt3 pt5 pt7 /solution
CLPARM ;
        ods output ParameterEstimates = ParamEst21;
run;
title1"Manage Regression Output";
data ParamEsttotal1;
        set ParamEsttotal;
        where StdErr >. and Parameter in ('ptminus3       1',
                                        399


'ptminus1      1',
'pt0       1',
'pt1       1',
'pt3       1',
'pt5       1');
run;
data ParamEstunderage1;
       set ParamEstunderage end=eof;
       where StdErr >. and Parameter in ('ptminus3 1',
       'pt0        1',
       'pt1        1',
       'pt3        1',
       'pt5        1',
       'pt7        1');
       if parameter = 'ptminus3     1' then do;
                parameter = '3 years prior' ;
                                        400


        order=1;
end;
if parameter = 'pt1      1' then do;
        parameter = '1 year after';
        order=3;
end;
if parameter = 'pt3      1' then do;
        parameter = '3 years after';
        order=4;
end;
if parameter = 'pt5      1' then do;
        parameter = '5 years after';
        order=5;
end;
if parameter = 'pt7      1' then do;
        parameter = '7 years after';
        order=6;
end;
                                401


      if eof then do;
      output;
               parameter = '1 year prior';
               estimate=0;
               stderr=0;
               tvalue=0;
               probt=0;
               lowercl=0;
               uppercl=0;
               order=2;
      end;
      output;
run;
proc sort data=ParamEstunderage1;
      by order;
run;
                                       402


data ParamEst211;
      set ParamEst21 end=eof;
      where StdErr >. and Parameter in ('ptminus3 1',
      'pt0       1',
      'pt1       1',
      'pt3       1',
      'pt5       1',
      'pt7       1');
      if parameter = 'ptminus3     1' then do;
              parameter = '3 years prior' ;
              order=1;
      end;
      if parameter = 'pt1      1' then do;
              parameter = '1 year after';
              order=3;
      end;
      if parameter = 'pt3      1' then do;
                                      403


         parameter = '3 years after';
         order=4;
end;
if parameter = 'pt5       1' then do;
         parameter = '5 years after';
         order=5;
end;
if parameter = 'pt7       1' then do;
         parameter = '7 years after';
         order=6;
end;
if eof then do;
output;
         parameter = '1 year prior';
         estimate=0;
         stderr=0;
         tvalue=0;
         probt=0;
                                 404


                  lowercl=0;
                  uppercl=0;
                  order=2;
         end;
         output;
run;
proc sort data=ParamEst211;
         by order;
run;
title1 "Plot coefficients";
proc sgplot data=ParamEsttotal1 noautolegend;
         title3'PLACEBO Effect of time since legalization on incidence';
         title4'all ages, fixed effects for time and state categories';
         scatter x=parameter y=Estimate / yerrorlower=LowerCL yerrorupper=upperCL
                                             405


        markerattrs=(symbol=diamondfilled);
  refline 0 / axis=y;
  xaxis grid label='Time relative to placebo legalization';
  yaxis grid label='Past year cannabis use incidence' discreteorder=data;
run;
proc sgplot data=ParamEstunderage1 noautolegend;
        title3'PLACEBO Effect of time since legalization on incidence';
        title4'aged 12 to 20, fixed effects for time and state categories';
        scatter x=parameter y=Estimate / yerrorlower=LowerCL yerrorupper=upperCL
        markerattrs=(symbol=diamondfilled);
  refline 0 / axis=y;
  xaxis grid label='Time relative to placebo legalization';
  yaxis grid label='Newly incident cannabis use' discreteorder=data;
run;
proc sgplot data=ParamEst211 noautolegend;
                                         406


          title3'PLACEBO Effect of time since cannabis legalization on cannabis
incidence';
          title4'ages 21 and up, fixed effects for time and state categories';
          scatter x=parameter y=Estimate / yerrorlower=LowerCL yerrorupper=upperCL
          markerattrs=(symbol=diamondfilled);
    refline 0 / axis=y;
    xaxis grid label='Time relative to placebo legalization';
    yaxis grid label='Newly incident cannabis use' discreteorder=data;
run;
/*******************************************************************************************************
* In Enterprise Guide, "Specify the page size for log and text output" under 'Results
General' must be *
* de-selected in order to be able to specify pagesize and linesize using an options
statement.           *
********************************************************************************************************/
OPTIONS PS=56 LS=160 NOCENTER NOFMTERR MPRINT ORIENTATION =
LANDSCAPE ;
title1'Dissertation';
                                                 407


title2'Aim 3: The LMA Effect';
/****************************************************************************************
* The following macro variables are available to all users: *
*                                                                     *
* &Project       – the name of the project folder in which the .EGP file is stored             *
* &ProgName             – the name of the .egp file, without the extension                       *
* &ProgNode            - the name of the code node                                         *
* &ProgDir          – the path to the folder in which the .egp file is stored.               *
*****************************************************************************************/
/*****************************************************************************
* The following macro variables are used in conjunction with the DOC_BLOCK *
* macro to document programs and output.                                             *
*                                                            *
* Do not use quotation marks when defining macro variables. If SAS syntax                     *
* requires quotes, use double quotes when you reference the macro variable. *
******************************************************************************/
                                                 408


** PROGRAMMER'S NAME ;
%LET PROGRAMMER          = Barrett Montgomery;
** DEFINE ALL NON-SAS FILES CALLED IN YOUR PROGRAM AS MACRO VARIABLES
by Drug;
** THESE CAN BE LEFT BLANK IF NOT NEEDED OR USED ;
%LET Elg     = C:\Users\montg\Dropbox\Ph.D Work\Dissertation\Data\Aim
3\Raw\ELG;
%LET Rec      = C:\Users\montg\Dropbox\Ph.D Work\Dissertation\Data\Aim
3\Raw\REC;
** DEFINE DIRECTORY AND FILE NAME OF ANY PERMANENT SAS DATASETS
SAVED IN THIS PROGRAM AS MACRO VARIABLES ;
** USE &PROGNAME FOR SAVEFILE NAME ;
** LEAVE BLANK IF NO DATASET SAVED ;
%LET SAVEDIR1      = C:\Users\montg\Dropbox\Ph.D Work\Dissertation\Data\Aim
3\Processed;
** NAME FORMAT LIBRARY DIRECTORY ;
                                    409


%LET FMTDIR              = ;
** GIVE A BRIEF DESCRIPTION OF OVERALL PURPOSE OF THIS PROGRAM ;
** YOU CAN USE SINGLE QUOTES OR NO QUOTES-- DOUBLE QUOTES WILL NOT
WORK ;
%LET PURPOSE1 = Difference in difference event study design to estimate the effect
of RCL on incidence for 21 year olds (LMA);
/*************************************************************
** CREATE LIBRARY REFERENCES TO DIRECTORIES SPECIFIED ABOVE **
**************************************************************/
** INPUT FILES ;
** OUTPUT FILE DESTINATION ;
LIBNAME SAV "&SAVEDIR1" ;
/*theroetical power calc*/
                                                 410


ods graphics on;
proc power;
        twosamplefreq alpha=.05 sides=2 test=pchi
        relativerisk= 1.5 to 3.0 by .05
        refproportion=.08
        groupweights=(1 2)
        ntotal= .
        power=.8;
        plot x=effect xopts=(crossref=yes ref=.70 .75 .80 .85 .90) ;
run;
ods graphics off;
/*this macro imports all data in a file with 2 options, the folder directory, and the type of
file
here, folder is saved in a macro variable named above and file type is csv.*/
title1"Download all data, append, and organize";
%macro drive(dir,ext);
                                          411


%local cnt filrf rc did memcnt name;
%let cnt=0;
%let filrf=mydir;
%let rc=%sysfunc(filename(filrf,&dir));
%let did=%sysfunc(dopen(&filrf));
%if &did ne 0 %then %do;
%let memcnt=%sysfunc(dnum(&did));
%do i=1 %to &memcnt;
 %let name=%qscan(%qsysfunc(dread(&did,&i)),-1,.);
 %if %qupcase(%qsysfunc(dread(&did,&i))) ne %qupcase(&name) %then %do;
  %if %superq(ext) = %superq(name) %then %do;
    %let cnt=%eval(&cnt+1);
    %put %qsysfunc(dread(&did,&i));
    proc import datafile="&dir\%qsysfunc(dread(&did,&i))" out=dsn&cnt
                                     412


        dbms=csv replace;
       run;
     %end;
    %end;
   %end;
    %end;
  %else %put &dir cannot be open.;
  %let rc=%sysfunc(dclose(&did));
 %mend drive;
/*read in eligibility data for leglaization dates*/
 %drive(&Elg,csv)
data dsn1;
        set dsn1;
        years="2008 to 2009";
                                            413


run;
data dsn2;
       set dsn2;
       years="2010 to 2011";
run;
data dsn3;
       set dsn3;
       years="2012 to 2013";
run;
data dsn4;
       set dsn4;
       years="2014 to 2015";
run;
data dsn5;
                             414


       set dsn5;
       years="2016 to 2017";
run;
data dsn6;
       set dsn6;
       years="2018 to 2019";
run;
data Elig;
       set dsn1 dsn2 dsn3 dsn4 dsn5 dsn6;
       eligible = catx('', 'eligible for past year initiatio'n, 'rc-eligible for past year initia'n);
run;
proc freq data=Elig;
       title3'check to see that new var works';
       table eligible*'eligible for past year initiatio'n* 'rc-eligible for past year initia'n/list
missing;
                                           415


run;
/*read in for recmj data*/
%drive(&Rec,csv)
data dsn1;
        set dsn1;
        years="2008 to 2009";
run;
data dsn2;
        set dsn2;
        years="2010 to 2011";
run;
data dsn3;
        set dsn3;
        years="2012 to 2013";
                              416


run;
data dsn4;
       set dsn4;
       years="2014 to 2015";
run;
data dsn5;
       set dsn5;
       years="2016 to 2017";
run;
data dsn6;
       set dsn6;
       years="2018 to 2019";
run;
data Rec;
                             417


          set dsn1 dsn2 dsn3 dsn4 dsn5 dsn6;
          recent_initiate = catx('','rc-past year initiate of marijua'n, 'rc-recent initiate of
marijuana'n, 'recent initiate of marijuana use'n);
run;
proc freq data=rec;
          title3'check to see that new var works';
          table recent_initiate*'rc-past year initiate of marijua'n*'rc-recent initiate of
marijuana'n*'recent initiate of marijuana use'n/list missing;
run;
/*cleaning*/
/*prepare susbets of data where eligibility and rec = yes
to merge and create incidence estimates
 first for legal date then for effective dates*/
data elig1;
          set elig;
          where eligible = "1 - Yes" and 'final edited age'n = '21';
                                            418


         drop row: total: 'eligible for past year initiatio'n 'rc-eligible for past year initia'n
eligible 'Column % CI (lower)'n 'Column % CI (upper)'n;
         rename 'STATE NAME'n = legal_cat;
         rename 'final edited age'n = age;
         rename 'Column %'n = elig_phat;
         rename 'Column % SE'n = elig_phat_SE;
         rename 'Weighted Count'n = elig_count;
         rename 'Count SE'n = elig_count_se;
run;
data rec1;
         set rec;
         where recent_initiate = "1 - Yes" and 'final edited age'n = '21';
         drop row: total: 'rc-past year initiate of marijua'n 'rc-recent initiate of marijuana'n
'recent initiate of marijuana use'n recent_initiate 'Column % CI (lower)'n 'Column % CI
(upper)'n;
                                           419


       rename 'STATE NAME'n = legal_cat;
       rename 'final edited age'n = age;
       rename 'Column %'n = rec_phat;
       rename 'Column % SE'n = rec_phat_SE;
       rename 'Weighted Count'n = rec_count;
       rename 'Count SE'n = rec_count_se;
run;
proc sort data=elig1;
       by years legal_cat age ;
run;
proc sort data=rec1;
       by years legal_cat age ;
run;
/*estimate incidence by age group, state group, and years*/
                                      420


data all_data;
        merge elig1 rec1;
        by years legal_cat age ;
        length time year_n 8.;
        if legal_cat="Not_Legalized_" then legal_cat="Illegal";
        incidence = rec_count/elig_count;
/*fixed effects for years as categorical may function differently
        if years was considered a continuous variable, year_n is to check this*/
        if years = '2008 to 2009' then year_n = 2008;
        else if years = '2010 to 2011' then year_n = 2010;
        else if years = '2012 to 2013' then year_n = 2012;
        else if years = '2014 to 2015' then year_n = 2014;
        else if years = '2016 to 2017' then year_n = 2016;
        else if years = '2018 to 2019' then year_n = 2018;
                                         421


         /*need to calculate incidence SE or 95%CIs*/
         label incidence = "Percentage of past year initiates among persons at risk for
initiation"
                         legal_cat = "Legal status of cannabis through 2018"
                         year_n = "Numeric date of data, first year in year-pair";
run;
proc print data=all_data (obs=6);
         title3'check incidence calcs';
run;
proc freq data=all_data;
         title3'Check Time recode';
         table year_n*years/list missing;
run;
proc freq data=all_data;
                                          422


        title3'combinations of legal cat and years to make relative time variable';
        table legal_cat* years/list;
        where legal_cat ~="Overall";
run;
proc means data=all_data;
        title3'average incidence at age 21 by category';
        var incidence;
        class legal_cat;
run;
proc freq data=all_data;
        title3'Check legal category and year combinations';
        table legal_cat*years/list missing;
run;
data all_data2;
        set all_data;
                                         423


       where legal_cat ~="Overall";
       /*create a time variable relative to year of legalization and year of data
                time = how many years away froma states year of legalization is this data
point?*/
       if legal_cat = "Illegal" then time=0;
       else if legal_cat = "Legal_2012" and years = "2008 to 2009" then time = -4;
       else if legal_cat = "Legal_2014" and years = "2008 to 2009" then time = -6;
       else if legal_cat = "Legal_2016" and years = "2008 to 2009" then time = -8;
       else if legal_cat = "Legal_2018" and years = "2008 to 2009" then time = -10;
       else if legal_cat = "Legal_2012" and years = "2010 to 2011" then time = -2;
       else if legal_cat = "Legal_2014" and years = "2010 to 2011" then time = -4;
       else if legal_cat = "Legal_2016" and years = "2010 to 2011" then time = -6;
       else if legal_cat = "Legal_2018" and years = "2010 to 2011" then time = -8;
       else if legal_cat = "Legal_2012" and years = "2012 to 2013" then time = 0;
                                         424


else if legal_cat = "Legal_2014" and years = "2012 to 2013" then time = -2;
else if legal_cat = "Legal_2016" and years = "2012 to 2013" then time = -4;
else if legal_cat = "Legal_2018" and years = "2012 to 2013" then time = -6;
else if legal_cat = "Legal_2012" and years = "2014 to 2015" then time = 2;
else if legal_cat = "Legal_2014" and years = "2014 to 2015" then time = 0;
else if legal_cat = "Legal_2016" and years = "2014 to 2015" then time = -2;
else if legal_cat = "Legal_2018" and years = "2014 to 2015" then time = -4;
else if legal_cat = "Legal_2012" and years = "2016 to 2017" then time = 4;
else if legal_cat = "Legal_2014" and years = "2016 to 2017" then time = 2;
else if legal_cat = "Legal_2016" and years = "2016 to 2017" then time = 0;
else if legal_cat = "Legal_2018" and years = "2016 to 2017" then time = -2;
else if legal_cat = "Legal_2012" and years = "2018 to 2019" then time = 6;
else if legal_cat = "Legal_2014" and years = "2018 to 2019" then time = 4;
else if legal_cat = "Legal_2016" and years = "2018 to 2019" then time = 2;
else if legal_cat = "Legal_2018" and years = "2018 to 2019" then time = 0;
                                425


/*create dummy variable for each relative time point*/
        if time=-10 then tminus10 =1; else tminus10=0;
        if time=-8 then tminus8 =1; else tminus8=0;
        if time=-6 then tminus6 =1; else tminus6=0;
        if time=-4 then tminus4 =1; else tminus4=0;
        if time=-2 then tminus2 =1; else tminus2=0;
        if time=0 then t0 =1; else t0=0;
        if time=2 then t2 =1; else t2=0;
        if time=4 then t4 =1; else t4=0;
        if time=6 then t6 =1; else t6=0;
        label    time = "Time relative to legalization based on beginning of year pair";
run;
proc freq data=all_data2;
        title3'check new variable creation';
        table time*legal_cat*years
                                          426


                time*tminus10*tminus8*tminus6*tminus4*tminus2*t0*t2*t4*t6/list missing;
run;
proc freq data=all_data2;
        title3'determine where tails of legal date data distribution should be grouped
together';
        table time/list;
run;
/*<=-4, >=4*/
/* data step 3 creates new dummies that
can categorize all data before a certain time point
so that analysis on a single category is not done
econs call it balancing leads and lags for short*/
data all_data3;
        set all_data2;
                                         427


if time<=-4 then tlt4=1; else tlt4=0;
if time>=4 then tgt4=1; else tgt4=0;
if time<=-6 then tlt6=1; else tlt6=0;
if time>=6 then tgt6=1; else tgt6=0;
if legal_cat = "Illegal" then legal = 0;
else if time>=0 then legal = 1;
else if time <0 then legal =0;
if legal_cat = "Illegal" then effective = 0;
else if time>=2 then effective = 1;
else if time <2 then effective =0;
if legal_cat = "Illegal" then legal_wave = 0;
else if legal_cat = "Legal_2012" then legal_wave = 1;
else if legal_cat = "Legal_2014" then legal_wave = 2;
                                   428


        else if legal_cat = "Legal_2016" then legal_wave = 3;
        else if legal_cat = "Legal_2018" then legal_wave = 4;
        label legal = "Simple binary for RCL, 1 if year>=legalize date, 0 otherwise"
                        effective = "Simple binary for RCL effective, 1 if year>=effective
date, 0 otherwise";
run;
proc freq data=all_data3;
        title3'Check new time event dummies';
        table time*tlt4*tminus10*tminus8*tminus6*tminus4*tminus2*t0*t2*t4*t6*tgt4/list
missing;
run;
proc freq data=all_data3;
        title3'Check new legality dummy';
        table legal_cat*years*legal/list missing;
run;
                                         429


proc freq data=all_data3;
         title3'Check new effective dummy';
         table legal_cat*years*effective/list missing;
run;
/*Save dataset*/
data sav.Legal_date_21;
         set all_data3;
run;
/*export to csv if needed*/
proc export data= sav.Legal_date_21 outfile='C:\Users\montg\Dropbox\Ph.D
Work\Dissertation\Data\Aim 2\Processed\Legal_date_21.csv'
         dbms=csv replace;
run;
/*******************************************************************************************************
* In Enterprise Guide, "Specify the page size for log and text output" under 'Results
General' must be *
                                                 430


* de-selected in order to be able to specify pagesize and linesize using an options
statement.           *
********************************************************************************************************/
OPTIONS PS=56 LS=160 NOCENTER NOFMTERR MPRINT ORIENTATION =
LANDSCAPE ;
title1'Dissertation';
title2'Aim 3: Incidence at age 21 after leglaization event study';
/****************************************************************************************
* The following macro variables are available to all users: *
*                                                                     *
* &Project       – the name of the project folder in which the .EGP file is stored                 *
* &ProgName             – the name of the .egp file, without the extension                           *
* &ProgNode            - the name of the code node                                          *
* &ProgDir          – the path to the folder in which the .egp file is stored.                   *
*****************************************************************************************/
/*****************************************************************************
                                                 431


* The following macro variables are used in conjunction with the DOC_BLOCK *
* macro to document programs and output.                                        *
*                                                            *
* Do not use quotation marks when defining macro variables. If SAS syntax         *
* requires quotes, use double quotes when you reference the macro variable. *
******************************************************************************/
** PROGRAMMER'S NAME ;
%LET PROGRAMMER                   = Barrett Montgomery;
** DEFINE DIRECTORY AND FILE NAME OF ANY PERMANENT SAS DATASETS
SAVED IN THIS PROGRAM AS MACRO VARIABLES ;
** USE &PROGNAME FOR SAVEFILE NAME ;
** LEAVE BLANK IF NO DATASET SAVED ;
%LET SAVEDIR1             = C:\Users\montg\Dropbox\Ph.D Work\Dissertation\Data\Aim
3\Processed;
** NAME FORMAT LIBRARY DIRECTORY ;
%LET FMTDIR              = ;
                                                 432


** GIVE A BRIEF DESCRIPTION OF OVERALL PURPOSE OF THIS PROGRAM ;
** YOU CAN USE SINGLE QUOTES OR NO QUOTES-- DOUBLE QUOTES WILL NOT
WORK ;
%LET PURPOSE1 = Difference in difference and event study design to estimate the
effect of cannabis legalization on cannabis incidence for 21 year olds
/*************************************************************
** CREATE LIBRARY REFERENCES TO DIRECTORIES SPECIFIED ABOVE **
**************************************************************/
** INPUT FILES ;
** OUTPUT FILE DESTINATION ;
LIBNAME SAV "&SAVEDIR1" ;
/*This section recreates the various 2x2 plots that can be created with these data
meant to replicate figure 2 in Goodman-Bacon's 2018 seminal working paper
         also easy visual check for parrallel trends assumption*/
                                                 433


proc freq data = sav.Legal_date_21;
       table legal_cat age / list missing;
run;
proc format;
       value $legal
        'Legalized_2012'='Legalized in 2012'
        'Legalized_2014'='Legalized in 2014'
        'Legalized_2016'='Legalized in 2016'
        'Legalized_2018'='Legalized in 2018';
        value $bin
         'Legalized_2012'='Legal'
        'Legalized_2014'='Legal'
        'Legalized_2016'='Legal'
        'Legalized_2018'='Legal';
run;
                                        434


proc means data = sav.Legal_date_21;
         var incidence;
         class legal_cat ;
         where time=0;
         format legal_cat $bin.;
run;
proc sgplot data=sav.Legal_date_21 ;
         title3'Cannabis incidence in 21 year olds, first wave legalizing states vs
untreated states';
         series x=years y=incidence /lineattrs=(pattern=2) group=legal_cat;
         where legal_cat in ('Legal_2012','Illegal');
   refline "2012 to 2013" / axis=x label="First wave of cannabis legalization"
lineattrs=(color=green);
         xaxis grid label='NSDUH year-pairs';
   yaxis grid label='Newly incident cannabis use' discreteorder=data;
         format legal_cat $legal.;
                                         435


run;
proc sgplot data=sav.Legal_date_21 ;
         title3'Cannabis incidence in 21 year olds, second wave legalizing states vs
untreated states';
         series x=years y=incidence / lineattrs=(pattern=2) group=legal_cat;
         where legal_cat in ('Legal_2014','Illegal');
   refline "2014 to 2015" / axis=x label="Second wave of cannabis legalization"
lineattrs=(color=green);
         xaxis grid label='NSDUH year-pairs';
   yaxis grid label='Newly incident cannabis use' discreteorder=data;
         format legal_cat $legal.;
run;
proc sgplot data=sav.Legal_date_21 ;
         title3'Cannabis incidence in 21 year olds, third wave legalizing states vs
untreated states';
         series x=years y=incidence / lineattrs=(pattern=2) group=legal_cat;
         where legal_cat in ('Legal_2016','Illegal');
                                         436


   refline "2016 to 2017" / axis=x label="Third wave of cannabis legalization"
lineattrs=(color=green);
         xaxis grid label='NSDUH year-pairs';
   yaxis grid label='Newly incident cannabis use' discreteorder=data;
         format legal_cat $legal.;
run;
proc sgplot data=sav.Legal_date_21 ;
         title3'Cannabis incidence in 21 year olds, first wave legalizing states vs second
wave legalizing states';
         series x=years y=incidence /lineattrs=(pattern=2) group=legal_cat;
         where legal_cat in ('Legal_2012','Legal_2014');
   refline "2012 to 2013" "2014 to 2015"/ axis=x lineattrs=(color=green);
         xaxis grid label='NSDUH year-pairs';
   yaxis grid label='Newly incident cannabis use' discreteorder=data;
         format legal_cat $legal.;
run;
proc sgplot data=sav.Legal_date_21 ;
                                         437


         title3'Cannabis incidence in 21+ age group, first wave legalizing states vs third
wave legalizing states';
         series x=years y=incidence /lineattrs=(pattern=2) group=legal_cat;
         where legal_cat in ('Legal_2012','Legal_2016');
   refline "2012 to 2013" "2016 to 2017"/ axis=x lineattrs=(color=green);
         xaxis grid label='NSDUH year-pairs';
   yaxis grid label='Newly incident cannabis use' discreteorder=data;
         format legal_cat $legal.;
run;
proc sgplot data=sav.Legal_date_21 ;
         title3'Cannabis incidence in 21+ age group, second wave legalizing states vs
third wave legalizing states';
         series x=years y=incidence /lineattrs=(pattern=2) group=legal_cat;
         where legal_cat in ('Legal_2014','Legal_2016');
   refline "2014 to 2015" "2016 to 2017"/ axis=x;
          xaxis grid label='Time';
   yaxis grid label='Past year cannabis use incidence' discreteorder=data;
run;
                                        438


/***********************************************/
/*                          legal date analysis Plots            */
/***********************************************/
ods graphics on;
title1 "Simple Incidence Plots";
proc sgplot data=sav.Legal_date_21;
    series x=years y=incidence / group=legal_cat;
run;
proc means data=sav.Legal_date_21 mean;
         title3'Average incidence in 2 year period prior to legalization';
         var incidence;
         class age legal_cat ;
         where time=-2;
         format legal_cat $bin.;
                                                 439


run;
proc means data=sav.Legal_date_21 mean;
         title3'Average incidence where illegal';
         var incidence;
         class age legal_cat ;
         where legal_cat="Illegal";
         format legal_cat $bin.;
run;
title1 "Regression modelling";
title4 ;
proc sort data=sav.Legal_date_21;
         by years;
run;
ods graphics on;
                                       440


proc glm data=sav.Legal_date_21 ;
        title3 "Regression for panel event study";
        absorb years;
        class legal_cat (ref="Illegal") tminus10(ref="0") tminus8(ref="0") tminus6(ref="0")
tminus4(ref="0") t0(ref="0") t2(ref="0") t4(ref="0") t6(ref="0");
        model incidence = legal_cat tminus10 tminus8 tminus6 tminus4 t0 t2 t4 t6
/solution CLPARM ;
        ods output ParameterEstimates = ParamEsttotal;
run;
title1"Manage Regression Output";
data ParamEsttotal1;
        set ParamEsttotal;
        where StdErr >. and Parameter in ("tminus10 1",
"tminus8 1",
"tminus6 1",
"tminus4 1",
"t0      1",
                                          441


"t2        1",
"t4        1",
"t6        1");
run;
title1 "Plot coefficients";
proc sgplot data=ParamEsttotal1 noautolegend;
          title3'Effect of time since legalization on incidence';
          title4'all ages, fixed effects for time and state categories';
          scatter x=parameter y=Estimate / yerrorlower=LowerCL yerrorupper=upperCL
          markerattrs=(symbol=diamondfilled);
    refline 0 / axis=y;
    xaxis grid;
    yaxis grid display=(nolabel) discreteorder=data;
run;
proc sort data=sav.Legal_date_21 ;
                                              442


        by years;
run;
proc glm data=sav.Legal_date_21 ;
        title3 "Regression for event study";
        absorb years;
        class legal_cat (ref="Illegal") legal(ref="0");
        model incidence = legal_cat legal /solution CLPARM ;
run;
/*small because includes data from before effective date...*/
proc glm data=sav.Legal_date_21 ;
        title3 "Regression for event study, 20 and younger";
        absorb years;
        class legal_cat (ref="Illegal") effective(ref="0");
        where age="12_20";
        model incidence = legal_cat effective /solution CLPARM ;
run;
                                           443


/***************************************************/
/* analyze legal date with balanced lags and leads */
/***************************************************/
title1 "Regression modelling with balanced leads and lags";
proc sort data=sav.Legal_date_21;
         by years;
run;
ods graphics on;
proc glm data=sav.Legal_date_21 ;
         title3 "Regression for panel event study, all ages";
         absorb years;
         class legal_cat (ref="Illegal") tlt6(ref="0") tminus4(ref="0") t0(ref="0") t2(ref="0")
tgt4(ref="0");
         model incidence = legal_cat tlt6 tminus4 t0 t2 tgt4 /solution CLPARM ;
         ods output ParameterEstimates = ParamEst21;
                                                 444


run;
title1"Manage Regression Output";
data ParamEst211;
        set ParamEst21 end=eof;
        where StdErr >. and Parameter in ("tlt6 1",
        "tminus4 1",
        "t0     1",
        "t2     1",
        "tgt4    1");
        if parameter = "tlt6   1" then do;
                parameter = '6+ years prior' ;
                order=1;
        end;
        if parameter = "tminus4 1" then do;
                parameter = '4 years prior';
                                        445


         order=2;
end;
if parameter = "t0      1" then do;
         parameter = 'Legalized';
         order=4;
end;
if parameter = "t2      1" then do;
         parameter = '2 years after';
         order=5;
end;
if parameter = "tgt4     1" then do;
         parameter = '4+ years after';
         order=6;
end;
if eof then do;
output;
         parameter = '2 years prior';
         estimate=0;
                                 446


                  stderr=0;
                  tvalue=0;
                  probt=0;
                  lowercl=0;
                  uppercl=0;
                  order=3;
         end;
         output;
run;
proc sort data=ParamEst211;
         by order;
run;
title1 "Plot coefficients";
proc sgplot data=ParamEst211 noautolegend;
         title3'Effect of time since cannabis legalization on cannabis incidence';
                                          447


         title4'ages 21 and up, balanced leads and lags, fixed effects for time and state
categories';
         scatter x=parameter y=Estimate / yerrorlower=LowerCL yerrorupper=upperCL
         markerattrs=(symbol=diamondfilled);
   refline 0 / axis=y;
   xaxis grid label='Time relative to legalization';
   yaxis grid label='Newly incident cannabis use' discreteorder=data;
run;
/*Look at average incidence by group to determine % change*/
proc contents; run;
data dat;
         set sav.Legal_date_21;
         if legal_cat = "Illegal" then legal=0;
                                            448


      else if legal_cat = "Overall" then legal=.;
      else if time=0 and legal_cat ~= "Illegal" then legal=1;
      else if time<0 then legal=0;
      else if time>0 then legal=1;
      label    legal = "cannabis leagality binary";
run;
proc means data=dat;
      var incidence;
      class legal time;
run;
proc means data=dat;
      var incidence;
      class legal years;
run;
                                       449


R
library(ROCR)
library(reshape2)
library(dplyr)
library(plyr)
library(tidyr)
library(FactoMineR)
library(factoextra)
library(caret)
library(InformationValue)
library(ISLR)
library(verification)
library(Epi)
library(pROC)
## ## ---------------------------------------------------------
                                    450


## ## read-in dataset
## ## ---------------------------------------------------------
remove(list=ls()); gc()
dat <-read.csv("C:/Users/montg/Dropbox/Ph.D
Work/Dissertation/Data/Aim 1/Processed/County_RCL_LAG.csv", header
= T,
          na.strings=c("",".","NA", "NULL", "N/A"))
 dim(dat)
# [1] 3142 1598
## make all names lower cases
names(dat) <- tolower(names(dat))
## --------------------------------------------
## get rid of census data and replace with pca
## --------------------------------------------
                                     451


kp <- c(grep("t0", names(dat), value = TRUE))
 ## ,("area", names(dat), value = TRUE)) ## adds land area to PCA grep
pop.dat <- dat[, kp]
dat.pca <- prcomp(pop.dat, scale = TRUE)
## percent of variance explained
## first 2 PCS explain 79.3% of variabiility
fviz_eig(dat.pca, addlabels = TRUE)
## individual-level information
ind <- get_pca_ind(dat.pca)
## add PC information
dat$PCA1 <- ind$coord[,1]
dat$PCA2 <- ind$coord[,2]
                                 452


dat$PCA3 <- ind$coord[,3]
dat$PCA4 <- ind$coord[,4]
dat$PCA5 <- ind$coord[,5]
dat$PCA6 <- ind$coord[,6]
dat$PCA7 <- ind$coord[,7]
dat$PCA8 <- ind$coord[,8]
dat$PCA9 <- ind$coord[,9]
dat$PCA10 <- ind$coord[,10]
dat <- dat %>% dplyr::select(-kp)
rm(pop.dat); rm(dat.pca); rm(ind); rm(kp)
#check PCs added correctly
glimpse(dat)
## Save
                                 453


saveRDS(dat, file = "C:/Users/montg/Dropbox/Ph.D
Work/Dissertation/Data/Aim 1/Processed/clean_data.rds")
library(ROCR)
library(reshape2)
library(dplyr)
library(plyr)
library(tidyr)
library(FactoMineR)
library(factoextra)
library(caret)
library(InformationValue)
library(ISLR)
library(verification)
library(Epi)
library(pROC)
                               454


## ## ---------------------------------------------------------
## ## read-in dataset
## ## ---------------------------------------------------------
remove(list=ls()); gc()
dat <- readRDS(file = "C:\\Users\\montg\\Dropbox\\Ph.D
Work\\Dissertation\\Data\\Aim 1\\Processed\\clean_data.rds")
 dim(dat)
## make all names lower cases
names(dat) <- tolower(names(dat))
## -------------------------------------------------------
## remove duplicated variables
## -------------------------------------------------------
rmv <- c("name",
                                     455


"geography",
"fips_pch",
"dem_2008_votes",
"rep_2008_votes",
"other_2008_votes",
"dem_2012_votes",
"rep_2012_votes",
"other_2012_votes",
"total_2008_votes",
"total_2012_votes",
"percentdem_2008_votes",
"percentrep_2008_votes",
"percentother_2008_votes",
"state",
"county",
"region",
                         456


      "division",
      "state_fips",
      "division_n"
      )
dat <- dat %>% dplyr::select(-rmv)
rm(rmv)
dim(dat)
## -------------------------------------------------------
## remove duplicate NSDUH variables
## -------------------------------------------------------
## keep only 12+ and 18+ estimates
rmv <- c(grep("twelve_to", names(dat), value = TRUE),
      grep("eighteen_to", names(dat), value = TRUE),
      grep("twenty_six", names(dat), value = TRUE))
                                     457


dat <- dat %>% dplyr::select(-rmv)
rm(rmv)
# now, remove eighteen plus variables for which we have twelve plus
estimates
rmv <- c(grep("eighteen_plus_pymjinc", names(dat), value = TRUE),
     grep("eighteen_plus_pmalc", names(dat), value = TRUE),
     grep("eighteen_plus_pmcig", names(dat), value = TRUE),
     grep("eighteen_plus_pmmj", names(dat), value = TRUE),
     grep("eighteen_plus_pmtob", names(dat), value = TRUE),
     grep("eighteen_plus_pyalc", names(dat), value = TRUE),
     grep("eighteen_plus_pyaud", names(dat), value = TRUE),
     grep("eighteen_plus_pycoc", names(dat), value = TRUE),
     grep("eighteen_plus_pysud", names(dat), value = TRUE),
     grep("eighteen_plus_pycoc", names(dat), value = TRUE),
     grep("eighteen_plus_pymj", names(dat), value = TRUE)
                                458


       )
dat <- dat %>% dplyr::select(-rmv)
rm(rmv)
# all duplicate age groups removed
## ---------------------
## Variable Selection
## ---------------------
# # Excluded Eighteen_plus_PYAMIPrevestimate and
Eighteen_plus_PYMDEPrevestimate
# because mental health variables are about 70-80% correlated with each
other,
# SMI and ST are least correlated with each other (69%) and somewhat
correlated with policy.
# # Excluded Twelve_plus_PMTobPrevestimate since tobacco use is 96%
correlated with
                                459


# cigarettes and cigarettes are the better predictor for policy.
# # Excluded Twelve_plus_PYAlcDepPrevestimate because Alcohol
Dependence is
# the older (DSM-IV) way of coding AUD, it does not appear in all years
and is
# 86% correlated with AUD.
# # Excluded Twelve_plus_PYMJPrevestimate and
Twelve_plus_PYMJIncestimate.
# Among the marijuana variables, all are 80-90% correlated with each
other
# and past month prevalence is the best predictor of policy of the three.
rmv <- c(grep("eighteen_plus_pyamiprev", names(dat), value = TRUE),
     grep("eighteen_plus_pymdeprev", names(dat), value = TRUE),
      grep("twelve_plus_pmtobprev", names(dat), value = TRUE),
      grep("twelve_plus_pyalcdepprev", names(dat), value = TRUE),
      grep("twelve_plus_pymjprev", names(dat), value = TRUE),
                                  460


       grep("twelve_plus_pymjinc", names(dat), value = TRUE)
       )
dat <- dat %>% dplyr::select(-rmv)
rm(rmv)
 dim(dat)
## Save
saveRDS(dat, file = "C:/Users/montg/Dropbox/Ph.D
Work/Dissertation/Data/Aim 1/Processed//final_data.rds")
library(ggplot2)
library(pROC)
set.seed(1000)
## ---------------------------------------------------------
## read-in dataset
                                     461


## ---------------------------------------------------------
remove(list=ls()); gc()
dat <- readRDS("C:\\Users\\montg\\Dropbox\\Ph.D
Work\\Dissertation\\Data\\Aim 1\\Processed/final_data.rds")
## remove "Other" votes and alternate specifications for the outcomes
dat <- subset(dat, se=-c(percentdem_2012_votes, rcl_2012, sens_2014,
sens2_2014, sens3_2014))
## make all names lower cases
names(dat) <- tolower(names(dat))
## drop PCs if desired
dat <- subset(dat, se=-c(pca3, pca4, pca5, pca6, pca7, pca8, pca9,
pca10))
                                     462


## drop missing outcomes from analysis
dat <- subset(dat, !is.na(lag_2014))
## dat <- within(dat,               # xt: standardize large numbers
## {
##    area_water <- as.vector(scale(area_water))
##    area_land <- as.vector(scale(area_land))
## })
dat <- na.omit(dat)               # xt: 3103 to 3094
## -------------------------------------------------------
## Create macro variables for L and seL
## -------------------------------------------------------
seL <- grep("sel$", names(dat), value = TRUE)                # xt: sd of log
muL <- setdiff(grep("l$", names(dat), value = TRUE), seL) # xt: mu of log
vnL <- sub('l$', '', muL)                         # xt: var names
P <- length(vnL)
                                     463


## -------------------------------------------------------
## define train/test sample and case control ratio
## -------------------------------------------------------
## ratio of train-to-test data split
trn <- 0.7; tst <- 0.3
## ratio of legalized-to-nonlegalized to sample
zrs <- 1.6; ons <- .8
## -------------------------------------------------------
## define objects to store results
## -------------------------------------------------------
var.lst <- list() # data to store variable importance
prd.lst <- list() # data to store county-level predictions
roc.lst <- list() # data to store AUC stuff
acc.lst <- list() # data to store accurary assessment
                                     464


## ---------------------------------------------------------
## start simulations
## ---------------------------------------------------------
B <- 1e3
for(i in 1:B)
{
   ## bootstrap, training and testing DO NOT share counties.
   tmp <- with(dat,
   {
       i0 <- which(lag_2014 == 0); n0 <- length(i0) # 0s
       i1 <- which(lag_2014 == 1); n1 <- length(i1) # 1s
       ## divide (a) training and (b) testing, then permute
       a0 <- sample(i0, n0 * trn); b0 <- sample(setdiff(i0, a0))
       a1 <- sample(i1, n1 * trn); b1 <- sample(setdiff(i1, a1))
       ## bootstrap, number of 1s as the basic number
                                     465


      a <- c(rep(a0, len=n1 * zrs * trn), rep(a1, len=n1 * ons * trn))
      b <- c(rep(b0, len=n1 * zrs * tst), rep(b1, len=n1 * ons * tst))
      list(a=sort(a), b=sort(b))
   })
   tmp <- rbind(cbind(dat='a', dat[tmp$a, ]), cbind(dat='b', dat[tmp$b, ]))
   N <- nrow(tmp)
   ## check
   with(tmp, intersect(qname[dat=='a'], qname[dat=='b'])) # must be
empty
   with(tmp, table(dat, lag_2014)) # row should be trn:tst, col should be
zro:ons
   ## ------------------------------------------------
   ## introduce variability to NSDUH RDAS estimates
   ## ------------------------------------------------
   ## xt: expand standard deviation to 3 times
                                     466


   .x <- rnorm(N * P, unlist(tmp[, muL]), unlist(tmp[, seL]))
   .x <- matrix(1 / (1 + exp(-.x)), N, P, dimnames=list(NULL, vnL))
   tmp <- cbind(tmp[, !grepl("^(twelve|eighteen)", names(tmp))], .x)
   ## ---------------------------
   ## Perform training testing split
   ## ----------------------------
   A <- subset(tmp, dat == 'a', -dat)
   B <- subset(tmp, dat == 'b', -dat)
   ## ----------------------------
   ## fit a GLM model, predict testing data
   ## ----------------------------
   mld <- glm(lag_2014 ~ ., data       = subset(A, se=-qname), family =
'binomial')
                                     467


   pht <- predict(mld,      newdata = subset(B, se=-qname), type =
'response')
   ## ---------------------------------------
   ## store county-level predictions
   ## ---------------------------------------
   ## xt: fraction of 0s as threshold
   res <- cbind(sim=i, B, pht=pht, yht=0 + (pht > mean(B$lag_2014 < 1)))
   prd.lst[[i]] <- res
   ## -------------------------------------
   ## store predictor statistics
   ## -------------------------------------
   var.lst[[i]] <- data.frame(sim=i, variables=names(coef(mld))[-1],
Z=summary(mld)$coef[-1, 3], row.names=NULL)
                                     468


  ## -------------------------------------
  ## store prediction accuracy results
  ## -------------------------------------
  tbl <- with(res, table(lag_2014, yht))
  acc.lst[[i]] <- data.frame(
     sim=i,
     TP = tbl[2, 2], TN = tbl[1, 1], FP = tbl[1, 2], FN = tbl[2, 1],
     TPF = tbl[2, 2] / sum(tbl[2, ]), TNF = tbl[1, 1] / sum(tbl[1, ]),
     FPF = tbl[1, 2] / sum(tbl[1, ]), FNF = tbl[2, 1] / sum(tbl[2, ]),
     acc = sum(diag(tbl)) / sum(tbl))
  ## --------------------------
  ## create a ROC curve
  ## --------------------------
  ROC <- with(res, roc(lag_2014, pht, quiet = TRUE))
  roc.lst[[i]] <- with(ROC, data.frame(sim=i, TPR=rev(sensitivities),
FPR=rev(1 - specificities)))
                                    469


}
var2 <- do.call(rbind, var.lst)   # predictor z-scores
acc2 <- do.call(rbind, acc.lst)     # accurary performance
roc2 <- do.call(rbind, roc.lst)    # roc
prd2 <- do.call(rbind, prd.lst)    # truth and prediction, per county, per sim
## ROC for all simulations combined (overall)
rocA <- with(prd2, roc(lag_2014, pht, quiet = TRUE))
aucA <- round(auc(rocA), 3)
rocA <- with(rocA, data.frame(TPR=rev(sensitivities), FPR=rev(1-
specificities)))
## ---------------------------------------
## plot all ROC curves with the average
## ---------------------------------------
                                     470


setwd("C:/Users/montg/Dropbox/Ph.D Work/Dissertation/Data/Aim
1/results/")
g <- ggplot() + theme_bw()
g <- g + geom_abline(intercept = 0, slope = 1, size = 0.5, linetype =
"dashed") # ref
g <- g + geom_line(aes(x=FPR, y=TPR, group=sim), roc2, alpha=0.3) # per
sim
g <- g + geom_line(aes(x=FPR, y=TPR), rocA, color="red", size=2)      #
overall
g <- g + geom_text(aes(x = 0.75, y=0.4, label=paste0("AUC: ", aucA)),
color="red", size=10)
g <- g + scale_x_continuous(limits = c(0, 1)) + scale_y_continuous(limits =
c(0, 1))
ggsave("roc_glm.png", g)
## -------------------------------------------
## summarize predictive power of variables
                                     471


## -------------------------------------------
round(do.call(rbind, with(var2, by(Z, variables, summary)))[, 2:5], 3)
#                  1st Qu. Median        Mean 3rd Qu.
# area_land              0.733 1.346 320390.08 1.932
# area_water              0.111 0.467 228688.61 0.829
# eighteen_plus_pysmiprev 0.527 1.351 441991.16 1.986
# eighteen_plus_pystprev -1.107 -0.326 -194993.85 0.332
# pca1                -1.649 -0.715 -195687.63 0.555
# pca2                -1.588 -0.802 -216793.23 -0.007
# percentdem_2012_votes          -1.619 -0.912 -598545.37 -0.202
# percentrep_2012_votes         -1.782 -1.100 -612526.57 -0.392
# twelve_plus_pmalcprev        -0.004 0.618 268324.83 1.314
# twelve_plus_pmcigprev        -1.567 -0.829 -362955.85 -0.010
# twelve_plus_pmmjprev           2.203 2.871 1202101.03 3.326
# twelve_plus_pyaudprev        -0.085 0.536 392370.13 1.279
                                     472


# twelve_plus_pycocprev         1.636 2.239 333135.16 2.685
# twelve_plus_pysudprev        -1.320 -0.481 35547.05 0.260
## ------------------------------------------
## Summarize Classification Accuracies
## ------------------------------------------
summary(acc2)
#     sim            TP           TN          FP      FN    TPF
# Min. : 1.0 Min. : 7.00 Min. :33.00 Min. : 0.000 Min. : 0.000
Min. :0.3182
# 1st Qu.: 250.8 1st Qu.:16.00 1st Qu.:40.00 1st Qu.: 2.000 1st Qu.:
3.000 1st Qu.:0.7273
# Median : 500.5 Median :17.00 Median :41.00 Median : 3.000
Median : 5.000 Median :0.7727
# Mean : 500.5 Mean :17.05 Mean :40.97 Mean : 3.026 Mean :
4.945 Mean :0.7752
                                     473


# 3rd Qu.: 750.2 3rd Qu.:19.00 3rd Qu.:42.00 3rd Qu.: 4.000 3rd Qu.:
6.000 3rd Qu.:0.8636
# Max. :1000.0 Max. :22.00 Max. :44.00 Max. :11.000 Max.
:15.000 Max. :1.0000
#     TNF            FPF            FNF            acc
# Min. :0.7500 Min. :0.00000 Min. :0.0000 Min. :0.7576
# 1st Qu.:0.9091 1st Qu.:0.04545 1st Qu.:0.1364 1st Qu.:0.8485
# Median :0.9318 Median :0.06818 Median :0.2273 Median :0.8788
# Mean :0.9312 Mean :0.06877 Mean :0.2248 Mean :0.8792
# 3rd Qu.:0.9545 3rd Qu.:0.09091 3rd Qu.:0.2727 3rd Qu.:0.9091
# Max. :1.0000 Max. :0.25000 Max. :0.6818 Max. :0.9848
#slightly lower accuracy but much better balance of sens and spec
## ------------------------------------------------
## ensemble predictions per county using different weights
                                     474


## ------------------------------------------------
WTS <- c('TPF', 'TNF', 'FNF', 'acc')
mrg <- merge(prd2, acc2, by="sim")[, c(names(prd2), WTS)]
ens <- by(mrg, mrg$qname, function(x)
{
   wts <- t(colMeans(x[, 'pht'] * x[, WTS])) # esemble voted probability
   data.frame(
     county = x[1, 'qname'], N = nrow(x), Y = x[1, 'lag_2014'],
     YHT = mean(x[, 'yht']), # mean of per-simulation hard-call
     PHT = mean(x[, 'pht']), # mean of per-simulation preditive probability
     wts)
})
ens <- data.frame(do.call(rbind, ens), row.names=NULL)
## check performance
CRI <- c("YHT", "PHT", WTS) # criteria
round(cor(ens[, 'Y'], ens[, CRI]), 3)
                                     475


## make hard-calls
THD <- zrs / (zrs + ons) # xt: fraction of 0s as threshold
hdc <- cbind(ens[, !names(ens) %in% CRI], (ens[, CRI] > THD) + 0)
with(hdc, table(Y, TNF))
##   TNF
## Y    0   1
## 0 2751 251
## 1    3 89
with(hdc, table(Y, FNF))
##   FNF
## Y    0
## 0 3002
## 1 92
                                   476


with(hdc, table(Y, acc))
##   acc
## Y    0   1
## 0 2746 256
## 1    3 89
with(hdc, table(Y, PHT))
##   PHT
## Y    0   1
## 0 2732 270
## 1    1 91
with(hdc, table(Y, YHT))
##   YHT
## Y    0   1
                         477


## 0 2734 268
## 1     1 91
## Counties with false positive predictions
subset(ens, Y == 0 & TNF>THD)
## Output for SAS mapping
write.csv(ens, file='C:\\Users\\montg\\Dropbox\\Ph.D
Work\\Dissertation\\Data\\Aim 1\\Processed/ens_lag.csv')
library(ggplot2)
library(pROC)
set.seed(1000)
## ---------------------------------------------------------
## read-in dataset
## ---------------------------------------------------------
                                     478


remove(list=ls()); gc()
dat <- readRDS("C:/Users/montg/Dropbox/Ph.D
Work/Dissertation/Data/Aim 1/Processed/final_data.rds")
## remove "Other" votes and alternate specifications for the outcomes
dat <- subset(dat, se=-c(percentother_2012_votes, rcl_2012, lag_2014,
sens2_2014, sens3_2014))
## make all names lower cases
names(dat) <- tolower(names(dat))
## drop PCs if desired
dat <- subset(dat, se=-c(pca3, pca4, pca5, pca6, pca7, pca8, pca9,
pca10))
## drop missing outcomes from analysis
                                479


dat <- subset(dat, !is.na(sens_2014))
## dat <- within(dat,               # xt: standardize large numbers
## {
##    area_water <- as.vector(scale(area_water))
##    area_land <- as.vector(scale(area_land))
## })
dat <- na.omit(dat)               # 3142 to 3133
## -------------------------------------------------------
## Create macro variables for L and seL
## -------------------------------------------------------
seL <- grep("sel$", names(dat), value = TRUE)                # xt: sd of log
muL <- setdiff(grep("l$", names(dat), value = TRUE), seL) # xt: mu of log
vnL <- sub('l$', '', muL)                         # xt: var names
P <- length(vnL)
                                     480


## -------------------------------------------------------
## define data slit sample samples
## -------------------------------------------------------
## ratio of train-to-test data split
trn <- 0.7; tst <- 0.3
## ratio of legalized-to-nonlegalized to sample
zrs <- 1.6; ons <- .8
## -------------------------------------------------------
## define objects to store results
## -------------------------------------------------------
var.lst <- list() # data to store variable importance
prd.lst <- list() # data to store county-level predictions
roc.lst <- list() # data to store AUC stuff
acc.lst <- list() # data to store accurary assessment
                                     481


## ---------------------------------------------------------
## start simulations
## ---------------------------------------------------------
B <- 1e3
for(i in 1:B)
{
   ## bootstrap, training and testing DO NOT share counties.
   tmp <- with(dat,
   {
       i0 <- which(sens_2014 == 0); n0 <- length(i0) # 0s
       i1 <- which(sens_2014 == 1); n1 <- length(i1) # 1s
       ## divide (a) training and (b) testing, then permute
       a0 <- sample(i0, n0 * trn); b0 <- sample(setdiff(i0, a0))
       a1 <- sample(i1, n1 * trn); b1 <- sample(setdiff(i1, a1))
       ## bootstrap, number of 1s as the basic number
       a <- c(rep(a0, len=n1 * zrs * trn), rep(a1, len=n1 * ons * trn))
                                     482


      b <- c(rep(b0, len=n1 * zrs * tst), rep(b1, len=n1 * ons * tst))
      list(a=sort(a), b=sort(b))
   })
   tmp <- rbind(cbind(dat='a', dat[tmp$a, ]), cbind(dat='b', dat[tmp$b, ]))
   N <- nrow(tmp)
   ## check
   with(tmp, intersect(qname[dat=='a'], qname[dat=='b'])) # must be
empty
   with(tmp, table(dat, sens_2014)) # row should be trn:tst, col should be
zro:ons
   ## ------------------------------------------------
   ## introduce variability to NSDUH RDAS estimates
   ## ------------------------------------------------
   ## xt: expand standard deviation to 3 times
   .x <- rnorm(N * P, unlist(tmp[, muL]), unlist(tmp[, seL]))
                                     483


   .x <- matrix(1 / (1 + exp(-.x)), N, P, dimnames=list(NULL, vnL))
   tmp <- cbind(tmp[, !grepl("^(twelve|eighteen)", names(tmp))], .x)
   ## ---------------------------
   ## Perform training testing split
   ## ----------------------------
   A <- subset(tmp, dat == 'a', -dat)
   B <- subset(tmp, dat == 'b', -dat)
   ## ----------------------------
   ## fit a GLM model, predict testing data
   ## ----------------------------
   mld <- glm(sens_2014 ~ ., data        = subset(A, se=-qname), family =
'binomial')
   pht <- predict(mld,      newdata = subset(B, se=-qname), type =
'response')
                                     484


  ## ---------------------------------------
  ## store county-level predictions
  ## ---------------------------------------
  ## xt: fraction of 0s as threshold
  res <- cbind(sim=i, B, pht=pht, yht=0 + (pht > mean(B$sens_2014 < 1)))
  prd.lst[[i]] <- res
  ## -------------------------------------
  ## store predictor statistics
  ## -------------------------------------
  var.lst[[i]] <- data.frame(sim=i, variables=names(coef(mld))[-1],
Z=summary(mld)$coef[-1, 3], row.names=NULL)
  ## -------------------------------------
  ## store prediction accuracy results
                                    485


  ## -------------------------------------
  tbl <- with(res, table(sens_2014, yht))
  acc.lst[[i]] <- data.frame(
     sim=i,
     TP = tbl[2, 2], TN = tbl[1, 1], FP = tbl[1, 2], FN = tbl[2, 1],
     TPF = tbl[2, 2] / sum(tbl[2, ]), TNF = tbl[1, 1] / sum(tbl[1, ]),
     FPF = tbl[1, 2] / sum(tbl[1, ]), FNF = tbl[2, 1] / sum(tbl[2, ]),
     acc = sum(diag(tbl)) / sum(tbl))
  ## --------------------------
  ## create a ROC curve
  ## --------------------------
  ROC <- with(res, roc(sens_2014, pht, quiet = TRUE))
  roc.lst[[i]] <- with(ROC, data.frame(sim=i, TPR=rev(sensitivities),
FPR=rev(1 - specificities)))
}
var2 <- do.call(rbind, var.lst)   # predictor z-scores
                                    486


acc2 <- do.call(rbind, acc.lst)     # accurary performance
roc2 <- do.call(rbind, roc.lst)    # roc
prd2 <- do.call(rbind, prd.lst)    # truth and prediction, per county, per sim
## ROC for all simulations combined (overall)
rocA <- with(prd2, roc(sens_2014, pht, quiet = TRUE))
aucA <- round(auc(rocA), 3)
rocA <- with(rocA, data.frame(TPR=rev(sensitivities), FPR=rev(1-
specificities)))
## ---------------------------------------
## plot all ROC curves with the average
## ---------------------------------------
setwd("C:/Users/montg/Dropbox/Ph.D Work/Dissertation/Data/Aim
1/results/")
g <- ggplot() + theme_bw()
                                     487


g <- g + geom_abline(intercept = 0, slope = 1, size = 0.5, linetype =
"dashed") # ref
g <- g + geom_line(aes(x=FPR, y=TPR, group=sim), roc2, alpha=0.3) # per
sim
g <- g + geom_line(aes(x=FPR, y=TPR), rocA, color="red", size=2)       #
overall
g <- g + geom_text(aes(x = 0.75, y=0.4, label=paste0("AUC: ", aucA)),
color="red", size=10)
g <- g + scale_x_continuous(limits = c(0, 1)) + scale_y_continuous(limits =
c(0, 1))
ggsave("roc_glm_sens.png", g)
## -------------------------------------------
## summarize predictive power of variables
## -------------------------------------------
round(do.call(rbind, with(var2, by(Z, variables, summary)))[, 2:5], 3)
                                     488


# area_land              0.963 1.559 1.583 2.176
# area_water              0.327 0.656 0.645 0.967
# eighteen_plus_pysmiprev 2.386 2.884 2.815 3.343
# eighteen_plus_pystprev -0.377 0.367 0.338 1.084
# pca1                -1.692 -0.907 -0.614 0.298
# pca2                -1.669 -0.898 -0.831 -0.135
# percentdem_2012_votes           -1.805 -1.048 -1.108 -0.418
# percentrep_2012_votes         -1.945 -1.215 -1.282 -0.665
# twelve_plus_pmalcprev        -0.153 0.521 0.504 1.251
# twelve_plus_pmcigprev        -2.107 -1.465 -1.413 -0.798
# twelve_plus_pmmjprev           2.976 3.552 3.406 4.028
# twelve_plus_pyaudprev          0.195 0.879 0.855 1.578
# twelve_plus_pycocprev         1.681 2.288 2.175 2.814
# twelve_plus_pysudprev        -1.240 -0.570 -0.521 0.167
## ------------------------------------------
                                      489


## Summarize Classification Accuracies
## ------------------------------------------
summary(acc2)
#     sim            TP           TN          FP     FN     TPF
# Min. : 1.0 Min. :10.00 Min. :47.00 Min. : 0.000 Min. : 0.000
Min. :0.3448
# 1st Qu.: 250.8 1st Qu.:20.00 1st Qu.:53.00 1st Qu.: 2.000 1st Qu.:
5.000 1st Qu.:0.6897
# Median : 500.5 Median :22.00 Median :54.00 Median : 4.000
Median : 7.000 Median :0.7586
# Mean : 500.5 Mean :21.89 Mean :54.12 Mean : 3.882 Mean :
7.114 Mean :0.7547
# 3rd Qu.: 750.2 3rd Qu.:24.00 3rd Qu.:56.00 3rd Qu.: 5.000 3rd Qu.:
9.000 3rd Qu.:0.8276
# Max. :1000.0 Max. :29.00 Max. :58.00 Max. :11.000 Max.
:19.000 Max. :1.0000
#     TNF            FPF            FNF          acc
                                     490


# Min. :0.8103 Min. :0.00000 Min. :0.0000 Min. :0.7241
# 1st Qu.:0.9138 1st Qu.:0.03448 1st Qu.:0.1724 1st Qu.:0.8506
# Median :0.9310 Median :0.06897 Median :0.2414 Median :0.8736
# Mean :0.9331 Mean :0.06693 Mean :0.2453 Mean :0.8736
# 3rd Qu.:0.9655 3rd Qu.:0.08621 3rd Qu.:0.3103 3rd Qu.:0.8966
# Max. :1.0000 Max. :0.18966 Max. :0.6552 Max. :0.9655
#slightly lower accuracy but much better balance of sens and spec
## ------------------------------------------------
## ensemble predictions per county using different weights
## ------------------------------------------------
WTS <- c('TPF', 'TNF', 'FNF', 'acc')
mrg <- merge(prd2, acc2, by="sim")[, c(names(prd2), WTS)]
ens <- by(mrg, mrg$qname, function(x)
{
                                     491


   wts <- t(colMeans(x[, 'pht'] * x[, WTS])) # esemble voted probability
   data.frame(
     county = x[1, 'qname'], N = nrow(x), Y = x[1, 'sens_2014'],
     YHT = mean(x[, 'yht']), # mean of per-simulation hard-call
     PHT = mean(x[, 'pht']), # mean of per-simulation preditive probability
     wts)
})
ens <- data.frame(do.call(rbind, ens), row.names=NULL)
## check performance
CRI <- c("YHT", "PHT", WTS) # criteria
round(cor(ens[, 'Y'], ens[, CRI]), 3)
## make hard-calls
THD <- zrs / (zrs + ons) # xt: fraction of 0s as threshold
hdc <- cbind(ens[, !names(ens) %in% CRI], (ens[, CRI] > THD) + 0)
                                    492


with(hdc, table(Y, TNF))
with(hdc, table(Y, FNF))
with(hdc, table(Y, acc))
with(hdc, table(Y, PHT))
with(hdc, table(Y, YHT))
## Output for SAS mapping
write.csv(ens, file='C:\\Users\\montg\\Dropbox\\Ph.D
Work\\Dissertation\\Data\\Aim 1\\Processed/ens_sens1.csv')
library(ggplot2)
library(pROC)
                                 493


set.seed(1000)
## ---------------------------------------------------------
## read-in dataset
## ---------------------------------------------------------
remove(list=ls()); gc()
dat <- readRDS("C:/Users/montg/Dropbox/Ph.D
Work/Dissertation/Data/Aim 1/Processed/final_data.rds")
## remove "Other" votes and alternate specifications for the outcomes
dat <- subset(dat, se=-c(percentother_2012_votes, rcl_2012, lag_2014,
sens_2014, sens3_2014))
## make all names lower cases
names(dat) <- tolower(names(dat))
                                     494


## drop PCs if desired
dat <- subset(dat, se=-c(pca3, pca4, pca5, pca6, pca7, pca8, pca9,
pca10))
## drop missing outcomes from analysis
dat <- subset(dat, !is.na(sens2_2014))
## dat <- within(dat,               # xt: standardize large numbers
## {
##    area_water <- as.vector(scale(area_water))
##    area_land <- as.vector(scale(area_land))
## })
dat <- na.omit(dat)               # 3142 to 3133
## -------------------------------------------------------
## Create macro variables for L and seL
## -------------------------------------------------------
                                     495


seL <- grep("sel$", names(dat), value = TRUE)                # xt: sd of log
muL <- setdiff(grep("l$", names(dat), value = TRUE), seL) # xt: mu of log
vnL <- sub('l$', '', muL)                         # xt: var names
P <- length(vnL)
## -------------------------------------------------------
## define data slit sample samples
## -------------------------------------------------------
## ratio of train-to-test data split
trn <- 0.7; tst <- 0.3
## ratio of legalized-to-nonlegalized to sample
zrs <- 1.6; ons <- .8
## -------------------------------------------------------
## define objects to store results
## -------------------------------------------------------
                                     496


var.lst <- list() # data to store variable importance
prd.lst <- list() # data to store county-level predictions
roc.lst <- list() # data to store AUC stuff
acc.lst <- list() # data to store accurary assessment
## ---------------------------------------------------------
## start simulations
## ---------------------------------------------------------
B <- 1e3
for(i in 1:B)
{
   ## bootstrap, training and testing DO NOT share counties.
   tmp <- with(dat,
   {
       i0 <- which(sens2_2014 == 0); n0 <- length(i0) # 0s
       i1 <- which(sens2_2014 == 1); n1 <- length(i1) # 1s
                                     497


      ## divide (a) training and (b) testing, then permute
      a0 <- sample(i0, n0 * trn); b0 <- sample(setdiff(i0, a0))
      a1 <- sample(i1, n1 * trn); b1 <- sample(setdiff(i1, a1))
      ## bootstrap, number of 1s as the basic number
      a <- c(rep(a0, len=n1 * zrs * trn), rep(a1, len=n1 * ons * trn))
      b <- c(rep(b0, len=n1 * zrs * tst), rep(b1, len=n1 * ons * tst))
      list(a=sort(a), b=sort(b))
   })
   tmp <- rbind(cbind(dat='a', dat[tmp$a, ]), cbind(dat='b', dat[tmp$b, ]))
   N <- nrow(tmp)
   ## check
   with(tmp, intersect(qname[dat=='a'], qname[dat=='b'])) # must be
empty
   with(tmp, table(dat, sens2_2014)) # row should be trn:tst, col should be
zro:ons
                                    498


## ------------------------------------------------
## introduce variability to NSDUH RDAS estimates
## ------------------------------------------------
## xt: expand standard deviation to 3 times
.x <- rnorm(N * P, unlist(tmp[, muL]), unlist(tmp[, seL]))
.x <- matrix(1 / (1 + exp(-.x)), N, P, dimnames=list(NULL, vnL))
tmp <- cbind(tmp[, !grepl("^(twelve|eighteen)", names(tmp))], .x)
## ---------------------------
## Perform training testing split
## ----------------------------
A <- subset(tmp, dat == 'a', -dat)
B <- subset(tmp, dat == 'b', -dat)
## ----------------------------
## fit a GLM model, predict testing data
                                  499


   ## ----------------------------
   mld <- glm(sens2_2014 ~ ., data        = subset(A, se=-qname), family =
'binomial')
   pht <- predict(mld,      newdata = subset(B, se=-qname), type =
'response')
   ## ---------------------------------------
   ## store county-level predictions
   ## ---------------------------------------
   ## xt: fraction of 0s as threshold
   res <- cbind(sim=i, B, pht=pht, yht=0 + (pht > mean(B$sens2_2014 <
1)))
   prd.lst[[i]] <- res
   ## -------------------------------------
   ## store predictor statistics
   ## -------------------------------------
                                     500


  var.lst[[i]] <- data.frame(sim=i, variables=names(coef(mld))[-1],
Z=summary(mld)$coef[-1, 3], row.names=NULL)
  ## -------------------------------------
  ## store prediction accuracy results
  ## -------------------------------------
  tbl <- with(res, table(sens2_2014, yht))
  acc.lst[[i]] <- data.frame(
     sim=i,
     TP = tbl[2, 2], TN = tbl[1, 1], FP = tbl[1, 2], FN = tbl[2, 1],
     TPF = tbl[2, 2] / sum(tbl[2, ]), TNF = tbl[1, 1] / sum(tbl[1, ]),
     FPF = tbl[1, 2] / sum(tbl[1, ]), FNF = tbl[2, 1] / sum(tbl[2, ]),
     acc = sum(diag(tbl)) / sum(tbl))
  ## --------------------------
  ## create a ROC curve
  ## --------------------------
                                    501


  ROC <- with(res, roc(sens2_2014, pht, quiet = TRUE))
  roc.lst[[i]] <- with(ROC, data.frame(sim=i, TPR=rev(sensitivities),
FPR=rev(1 - specificities)))
}
var2 <- do.call(rbind, var.lst)   # predictor z-scores
acc2 <- do.call(rbind, acc.lst)     # accurary performance
roc2 <- do.call(rbind, roc.lst)    # roc
prd2 <- do.call(rbind, prd.lst)    # truth and prediction, per county, per sim
## ROC for all simulations combined (overall)
rocA <- with(prd2, roc(sens2_2014, pht, quiet = TRUE))
aucA <- round(auc(rocA), 3)
rocA <- with(rocA, data.frame(TPR=rev(sensitivities), FPR=rev(1-
specificities)))
## ---------------------------------------
                                     502


## plot all ROC curves with the average
## ---------------------------------------
setwd("C:/Users/montg/Dropbox/Ph.D Work/Dissertation/Data/Aim
1/results/")
g <- ggplot() + theme_bw()
g <- g + geom_abline(intercept = 0, slope = 1, size = 0.5, linetype =
"dashed") # ref
g <- g + geom_line(aes(x=FPR, y=TPR, group=sim), roc2, alpha=0.3) # per
sim
g <- g + geom_line(aes(x=FPR, y=TPR), rocA, color="red", size=2)      #
overall
g <- g + geom_text(aes(x = 0.75, y=0.4, label=paste0("AUC: ", aucA)),
color="red", size=10)
g <- g + scale_x_continuous(limits = c(0, 1)) + scale_y_continuous(limits =
c(0, 1))
ggsave("roc_glm_sens2.png", g)
                                     503


## -------------------------------------------
## summarize predictive power of variables
## -------------------------------------------
round(do.call(rbind, with(var2, by(Z, variables, summary)))[, 2:5], 3)
## ------------------------------------------
## Summarize Classification Accuracies
## ------------------------------------------
summary(acc2)
## ------------------------------------------------
## ensemble predictions per county using different weights
## ------------------------------------------------
                                     504


WTS <- c('TPF', 'TNF', 'FNF', 'acc')
mrg <- merge(prd2, acc2, by="sim")[, c(names(prd2), WTS)]
ens <- by(mrg, mrg$qname, function(x)
{
   wts <- t(colMeans(x[, 'pht'] * x[, WTS])) # esemble voted probability
   data.frame(
     county = x[1, 'qname'], N = nrow(x), Y = x[1, 'sens2_2014'],
     YHT = mean(x[, 'yht']), # mean of per-simulation hard-call
     PHT = mean(x[, 'pht']), # mean of per-simulation preditive probability
     wts)
})
ens <- data.frame(do.call(rbind, ens), row.names=NULL)
## check performance
CRI <- c("YHT", "PHT", WTS) # criteria
round(cor(ens[, 'Y'], ens[, CRI]), 3)
                                    505


## make hard-calls
THD <- zrs / (zrs + ons) # xt: fraction of 0s as threshold
hdc <- cbind(ens[, !names(ens) %in% CRI], (ens[, CRI] > THD) + 0)
with(hdc, table(Y, TNF))
with(hdc, table(Y, FNF))
with(hdc, table(Y, acc))
with(hdc, table(Y, PHT))
with(hdc, table(Y, YHT))
## Output for SAS mapping
                                   506


write.csv(ens, file='C:\\Users\\montg\\Dropbox\\Ph.D
Work\\Dissertation\\Data\\Aim 1\\Processed/ens_sens2.csv')
library(ggplot2)
library(pROC)
set.seed(1000)
## ---------------------------------------------------------
## read-in dataset
## ---------------------------------------------------------
remove(list=ls()); gc()
dat <- readRDS("C:/Users/montg/Dropbox/Ph.D
Work/Dissertation/Data/Aim 1/Processed/final_data.rds")
## remove "Other" votes and alternate specifications for the outcomes
dat <- subset(dat, se=-c(percentother_2012_votes, rcl_2012, lag_2014,
sens_2014, sens2_2014))
                                     507


## make all names lower cases
names(dat) <- tolower(names(dat))
## drop PCs if desired
dat <- subset(dat, se=-c(pca3, pca4, pca5, pca6, pca7, pca8, pca9,
pca10))
## drop missing outcomes from analysis
dat <- subset(dat, !is.na(sens3_2014))
## dat <- within(dat,            # xt: standardize large numbers
## {
##    area_water <- as.vector(scale(area_water))
##    area_land <- as.vector(scale(area_land))
## })
dat <- na.omit(dat)            # 3142 to 3133
                                  508


## -------------------------------------------------------
## Create macro variables for L and seL
## -------------------------------------------------------
seL <- grep("sel$", names(dat), value = TRUE)                # xt: sd of log
muL <- setdiff(grep("l$", names(dat), value = TRUE), seL) # xt: mu of log
vnL <- sub('l$', '', muL)                         # xt: var names
P <- length(vnL)
## -------------------------------------------------------
## define data slit sample samples
## -------------------------------------------------------
## ratio of train-to-test data split
trn <- 0.7; tst <- 0.3
## ratio of legalized-to-nonlegalized to sample
zrs <- 1.6; ons <- .8
                                     509


## -------------------------------------------------------
## define objects to store results
## -------------------------------------------------------
var.lst <- list() # data to store variable importance
prd.lst <- list() # data to store county-level predictions
roc.lst <- list() # data to store AUC stuff
acc.lst <- list() # data to store accurary assessment
## ---------------------------------------------------------
## start simulations
## ---------------------------------------------------------
B <- 1e3
for(i in 1:B)
{
   ## bootstrap, training and testing DO NOT share counties.
                                     510


tmp <- with(dat,
{
   i0 <- which(sens3_2014 == 0); n0 <- length(i0) # 0s
   i1 <- which(sens3_2014 == 1); n1 <- length(i1) # 1s
   ## divide (a) training and (b) testing, then permute
   a0 <- sample(i0, n0 * trn); b0 <- sample(setdiff(i0, a0))
   a1 <- sample(i1, n1 * trn); b1 <- sample(setdiff(i1, a1))
   ## bootstrap, number of 1s as the basic number
   a <- c(rep(a0, len=n1 * zrs * trn), rep(a1, len=n1 * ons * trn))
   b <- c(rep(b0, len=n1 * zrs * tst), rep(b1, len=n1 * ons * tst))
   list(a=sort(a), b=sort(b))
})
tmp <- rbind(cbind(dat='a', dat[tmp$a, ]), cbind(dat='b', dat[tmp$b, ]))
N <- nrow(tmp)
## check
                                 511


   with(tmp, intersect(qname[dat=='a'], qname[dat=='b'])) # must be
empty
   with(tmp, table(dat, sens3_2014)) # row should be trn:tst, col should be
zro:ons
   ## ------------------------------------------------
   ## introduce variability to NSDUH RDAS estimates
   ## ------------------------------------------------
   ## xt: expand standard deviation to 3 times
   .x <- rnorm(N * P, unlist(tmp[, muL]), unlist(tmp[, seL]))
   .x <- matrix(1 / (1 + exp(-.x)), N, P, dimnames=list(NULL, vnL))
   tmp <- cbind(tmp[, !grepl("^(twelve|eighteen)", names(tmp))], .x)
   ## ---------------------------
   ## Perform training testing split
   ## ----------------------------
                                     512


   A <- subset(tmp, dat == 'a', -dat)
   B <- subset(tmp, dat == 'b', -dat)
   ## ----------------------------
   ## fit a GLM model, predict testing data
   ## ----------------------------
   mld <- glm(sens3_2014 ~ ., data        = subset(A, se=-qname), family =
'binomial')
   pht <- predict(mld,      newdata = subset(B, se=-qname), type =
'response')
   ## ---------------------------------------
   ## store county-level predictions
   ## ---------------------------------------
   ## xt: fraction of 0s as threshold
   res <- cbind(sim=i, B, pht=pht, yht=0 + (pht > mean(B$sens3_2014 <
1)))
                                     513


  prd.lst[[i]] <- res
  ## -------------------------------------
  ## store predictor statistics
  ## -------------------------------------
  var.lst[[i]] <- data.frame(sim=i, variables=names(coef(mld))[-1],
Z=summary(mld)$coef[-1, 3], row.names=NULL)
  ## -------------------------------------
  ## store prediction accuracy results
  ## -------------------------------------
  tbl <- with(res, table(sens3_2014, yht))
  acc.lst[[i]] <- data.frame(
     sim=i,
     TP = tbl[2, 2], TN = tbl[1, 1], FP = tbl[1, 2], FN = tbl[2, 1],
     TPF = tbl[2, 2] / sum(tbl[2, ]), TNF = tbl[1, 1] / sum(tbl[1, ]),
                                    514


     FPF = tbl[1, 2] / sum(tbl[1, ]), FNF = tbl[2, 1] / sum(tbl[2, ]),
     acc = sum(diag(tbl)) / sum(tbl))
  ## --------------------------
  ## create a ROC curve
  ## --------------------------
  ROC <- with(res, roc(sens3_2014, pht, quiet = TRUE))
  roc.lst[[i]] <- with(ROC, data.frame(sim=i, TPR=rev(sensitivities),
FPR=rev(1 - specificities)))
}
var2 <- do.call(rbind, var.lst) # predictor z-scores
acc2 <- do.call(rbind, acc.lst)   # accurary performance
roc2 <- do.call(rbind, roc.lst)  # roc
prd2 <- do.call(rbind, prd.lst)  # truth and prediction, per county, per sim
## ROC for all simulations combined (overall)
rocA <- with(prd2, roc(sens3_2014, pht, quiet = TRUE))
                                    515


aucA <- round(auc(rocA), 3)
rocA <- with(rocA, data.frame(TPR=rev(sensitivities), FPR=rev(1-
specificities)))
## ---------------------------------------
## plot all ROC curves with the average
## ---------------------------------------
setwd("C:/Users/montg/Dropbox/Ph.D Work/Dissertation/Data/Aim
1/results/")
g <- ggplot() + theme_bw()
g <- g + geom_abline(intercept = 0, slope = 1, size = 0.5, linetype =
"dashed") # ref
g <- g + geom_line(aes(x=FPR, y=TPR, group=sim), roc2, alpha=0.3) # per
sim
g <- g + geom_line(aes(x=FPR, y=TPR), rocA, color="red", size=2)      #
overall
                                     516


g <- g + geom_text(aes(x = 0.75, y=0.4, label=paste0("AUC: ", aucA)),
color="red", size=10)
g <- g + scale_x_continuous(limits = c(0, 1)) + scale_y_continuous(limits =
c(0, 1))
ggsave("roc_glm_sens3.png", g)
## -------------------------------------------
## summarize predictive power of variables
## -------------------------------------------
round(do.call(rbind, with(var2, by(Z, variables, summary)))[, 2:5], 3)
## ------------------------------------------
## Summarize Classification Accuracies
## ------------------------------------------
                                     517


summary(acc2)
## ------------------------------------------------
## ensemble predictions per county using different weights
## ------------------------------------------------
WTS <- c('TPF', 'TNF', 'FNF', 'acc')
mrg <- merge(prd2, acc2, by="sim")[, c(names(prd2), WTS)]
ens <- by(mrg, mrg$qname, function(x)
{
  wts <- t(colMeans(x[, 'pht'] * x[, WTS])) # esemble voted probability
  data.frame(
     county = x[1, 'qname'], N = nrow(x), Y = x[1, 'sens3_2014'],
     YHT = mean(x[, 'yht']), # mean of per-simulation hard-call
     PHT = mean(x[, 'pht']), # mean of per-simulation preditive probability
     wts)
                                     518


})
ens <- data.frame(do.call(rbind, ens), row.names=NULL)
## check performance
CRI <- c("YHT", "PHT", WTS) # criteria
round(cor(ens[, 'Y'], ens[, CRI]), 3)
## make hard-calls
THD <- zrs / (zrs + ons) # xt: fraction of 0s as threshold
hdc <- cbind(ens[, !names(ens) %in% CRI], (ens[, CRI] > THD) + 0)
with(hdc, table(Y, TNF))
with(hdc, table(Y, FNF))
with(hdc, table(Y, acc))
                                    519


with(hdc, table(Y, PHT))
with(hdc, table(Y, YHT))
## Output for SAS mapping
write.csv(ens, file='C:\\Users\\montg\\Dropbox\\Ph.D
Work\\Dissertation\\Data\\Aim 1\\Processed/ens_sens3.csv')
library(ggplot2)
library(pROC)
set.seed(1000)
## ---------------------------------------------------------
## read-in dataset
## ---------------------------------------------------------
remove(list=ls()); gc()
                                     520


dat <- readRDS("C:\\Users\\montg\\Dropbox\\Ph.D
Work\\Dissertation\\Data\\Aim 1\\Processed/final_data_voterbins.rds")
## remove "Other" votes and alternate specifications for the outcomes
dat <- subset(dat, se=-c(percentdem_2012_votes, rcl_2012, sens_2014,
sens2_2014, sens3_2014))
## make all names lower cases
names(dat) <- tolower(names(dat))
## drop PCs if desired
dat <- subset(dat, se=-c(pca3, pca4, pca5, pca6, pca7, pca8, pca9,
pca10))
## drop missing outcomes from analysis
dat <- subset(dat, !is.na(lag_2014))
## dat <- within(dat,            # xt: standardize large numbers
                                 521


## {
##    area_water <- as.vector(scale(area_water))
##    area_land <- as.vector(scale(area_land))
## })
dat <- na.omit(dat)               # xt: 3103 to 3094
## -------------------------------------------------------
## Create macro variables for L and seL
## -------------------------------------------------------
seL <- grep("sel$", names(dat), value = TRUE)                # xt: sd of log
muL <- setdiff(grep("l$", names(dat), value = TRUE), seL) # xt: mu of log
vnL <- sub('l$', '', muL)                         # xt: var names
P <- length(vnL)
## -------------------------------------------------------
## define train/test sample and case control ratio
                                     522


## -------------------------------------------------------
## ratio of train-to-test data split
trn <- 0.7; tst <- 0.3
## ratio of legalized-to-nonlegalized to sample
zrs <- 1.6; ons <- .8
## -------------------------------------------------------
## define objects to store results
## -------------------------------------------------------
var.lst <- list() # data to store variable importance
prd.lst <- list() # data to store county-level predictions
roc.lst <- list() # data to store AUC stuff
acc.lst <- list() # data to store accurary assessment
## ---------------------------------------------------------
## start simulations
                                     523


## ---------------------------------------------------------
B <- 1e3
for(i in 1:B)
{
   ## bootstrap, training and testing DO NOT share counties.
   tmp <- with(dat,
   {
       i0 <- which(lag_2014 == 0); n0 <- length(i0) # 0s
       i1 <- which(lag_2014 == 1); n1 <- length(i1) # 1s
       ## divide (a) training and (b) testing, then permute
       a0 <- sample(i0, n0 * trn); b0 <- sample(setdiff(i0, a0))
       a1 <- sample(i1, n1 * trn); b1 <- sample(setdiff(i1, a1))
       ## bootstrap, number of 1s as the basic number
       a <- c(rep(a0, len=n1 * zrs * trn), rep(a1, len=n1 * ons * trn))
       b <- c(rep(b0, len=n1 * zrs * tst), rep(b1, len=n1 * ons * tst))
       list(a=sort(a), b=sort(b))
                                     524


   })
   tmp <- rbind(cbind(dat='a', dat[tmp$a, ]), cbind(dat='b', dat[tmp$b, ]))
   N <- nrow(tmp)
   ## check
   with(tmp, intersect(qname[dat=='a'], qname[dat=='b'])) # must be
empty
   with(tmp, table(dat, lag_2014)) # row should be trn:tst, col should be
zro:ons
   ## ------------------------------------------------
   ## introduce variability to NSDUH RDAS estimates
   ## ------------------------------------------------
   ## xt: expand standard deviation to 3 times
   .x <- rnorm(N * P, unlist(tmp[, muL]), unlist(tmp[, seL]))
   .x <- matrix(1 / (1 + exp(-.x)), N, P, dimnames=list(NULL, vnL))
   tmp <- cbind(tmp[, !grepl("^(twelve|eighteen)", names(tmp))], .x)
                                     525


   ## ---------------------------
   ## Perform training testing split
   ## ----------------------------
   A <- subset(tmp, dat == 'a', -dat)
   B <- subset(tmp, dat == 'b', -dat)
   ## ----------------------------
   ## fit a GLM model, predict testing data
   ## ----------------------------
   mld <- glm(lag_2014 ~ ., data       = subset(A, se=-qname), family =
'binomial')
   pht <- predict(mld,      newdata = subset(B, se=-qname), type =
'response')
   ## ---------------------------------------
                                     526


  ## store county-level predictions
  ## ---------------------------------------
  ## xt: fraction of 0s as threshold
  res <- cbind(sim=i, B, pht=pht, yht=0 + (pht > mean(B$lag_2014 < 1)))
  prd.lst[[i]] <- res
  ## -------------------------------------
  ## store predictor statistics
  ## -------------------------------------
  var.lst[[i]] <- data.frame(sim=i, variables=names(coef(mld))[-1],
Z=summary(mld)$coef[-1, 3], row.names=NULL)
  ## -------------------------------------
  ## store prediction accuracy results
  ## -------------------------------------
  tbl <- with(res, table(lag_2014, yht))
                                    527


  acc.lst[[i]] <- data.frame(
     sim=i,
     TP = tbl[2, 2], TN = tbl[1, 1], FP = tbl[1, 2], FN = tbl[2, 1],
     TPF = tbl[2, 2] / sum(tbl[2, ]), TNF = tbl[1, 1] / sum(tbl[1, ]),
     FPF = tbl[1, 2] / sum(tbl[1, ]), FNF = tbl[2, 1] / sum(tbl[2, ]),
     acc = sum(diag(tbl)) / sum(tbl))
  ## --------------------------
  ## create a ROC curve
  ## --------------------------
  ROC <- with(res, roc(lag_2014, pht, quiet = TRUE))
  roc.lst[[i]] <- with(ROC, data.frame(sim=i, TPR=rev(sensitivities),
FPR=rev(1 - specificities)))
}
var2 <- do.call(rbind, var.lst)  # predictor z-scores
acc2 <- do.call(rbind, acc.lst)   # accurary performance
roc2 <- do.call(rbind, roc.lst)  # roc
                                    528


prd2 <- do.call(rbind, prd.lst)    # truth and prediction, per county, per sim
## ROC for all simulations combined (overall)
rocA <- with(prd2, roc(lag_2014, pht, quiet = TRUE))
aucA <- round(auc(rocA), 3)
rocA <- with(rocA, data.frame(TPR=rev(sensitivities), FPR=rev(1-
specificities)))
## ---------------------------------------
## plot all ROC curves with the average
## ---------------------------------------
setwd("C:/Users/montg/Dropbox/Ph.D Work/Dissertation/Data/Aim
1/results/")
g <- ggplot() + theme_bw()
g <- g + geom_abline(intercept = 0, slope = 1, size = 0.5, linetype =
"dashed") # ref
                                     529


g <- g + geom_line(aes(x=FPR, y=TPR, group=sim), roc2, alpha=0.3) # per
sim
g <- g + geom_line(aes(x=FPR, y=TPR), rocA, color="red", size=2)       #
overall
g <- g + geom_text(aes(x = 0.75, y=0.4, label=paste0("AUC: ", aucA)),
color="red", size=10)
g <- g + scale_x_continuous(limits = c(0, 1)) + scale_y_continuous(limits =
c(0, 1))
ggsave("roc_glm.png", g)
## -------------------------------------------
## summarize predictive power of variables
## -------------------------------------------
round(do.call(rbind, with(var2, by(Z, variables, summary)))[, 2:5], 3)
#                  1st Qu. Median        Mean 3rd Qu.
# area_land              0.733 1.346 320390.08 1.932
                                     530


# area_water              0.111 0.467 228688.61 0.829
# eighteen_plus_pysmiprev 0.527 1.351 441991.16 1.986
# eighteen_plus_pystprev -1.107 -0.326 -194993.85 0.332
# pca1                -1.649 -0.715 -195687.63 0.555
# pca2                -1.588 -0.802 -216793.23 -0.007
# percentdem_2012_votes          -1.619 -0.912 -598545.37 -0.202
# percentrep_2012_votes         -1.782 -1.100 -612526.57 -0.392
# twelve_plus_pmalcprev        -0.004 0.618 268324.83 1.314
# twelve_plus_pmcigprev        -1.567 -0.829 -362955.85 -0.010
# twelve_plus_pmmjprev           2.203 2.871 1202101.03 3.326
# twelve_plus_pyaudprev        -0.085 0.536 392370.13 1.279
# twelve_plus_pycocprev         1.636 2.239 333135.16 2.685
# twelve_plus_pysudprev        -1.320 -0.481 35547.05 0.260
## ------------------------------------------
## Summarize Classification Accuracies
                                     531


## ------------------------------------------
summary(acc2)
#     sim            TP           TN          FP     FN     TPF
# Min. : 1.0 Min. : 7.00 Min. :33.00 Min. : 0.000 Min. : 0.000
Min. :0.3182
# 1st Qu.: 250.8 1st Qu.:16.00 1st Qu.:40.00 1st Qu.: 2.000 1st Qu.:
3.000 1st Qu.:0.7273
# Median : 500.5 Median :17.00 Median :41.00 Median : 3.000
Median : 5.000 Median :0.7727
# Mean : 500.5 Mean :17.05 Mean :40.97 Mean : 3.026 Mean :
4.945 Mean :0.7752
# 3rd Qu.: 750.2 3rd Qu.:19.00 3rd Qu.:42.00 3rd Qu.: 4.000 3rd Qu.:
6.000 3rd Qu.:0.8636
# Max. :1000.0 Max. :22.00 Max. :44.00 Max. :11.000 Max.
:15.000 Max. :1.0000
#     TNF            FPF            FNF          acc
# Min. :0.7500 Min. :0.00000 Min. :0.0000 Min. :0.7576
                                     532


# 1st Qu.:0.9091 1st Qu.:0.04545 1st Qu.:0.1364 1st Qu.:0.8485
# Median :0.9318 Median :0.06818 Median :0.2273 Median :0.8788
# Mean :0.9312 Mean :0.06877 Mean :0.2248 Mean :0.8792
# 3rd Qu.:0.9545 3rd Qu.:0.09091 3rd Qu.:0.2727 3rd Qu.:0.9091
# Max. :1.0000 Max. :0.25000 Max. :0.6818 Max. :0.9848
#slightly lower accuracy but much better balance of sens and spec
## ------------------------------------------------
## ensemble predictions per county using different weights
## ------------------------------------------------
WTS <- c('TPF', 'TNF', 'FNF', 'acc')
mrg <- merge(prd2, acc2, by="sim")[, c(names(prd2), WTS)]
ens <- by(mrg, mrg$qname, function(x)
{
  wts <- t(colMeans(x[, 'pht'] * x[, WTS])) # esemble voted probability
                                     533


   data.frame(
     county = x[1, 'qname'], N = nrow(x), Y = x[1, 'lag_2014'],
     YHT = mean(x[, 'yht']), # mean of per-simulation hard-call
     PHT = mean(x[, 'pht']), # mean of per-simulation preditive probability
     wts)
})
ens <- data.frame(do.call(rbind, ens), row.names=NULL)
## check performance
CRI <- c("YHT", "PHT", WTS) # criteria
round(cor(ens[, 'Y'], ens[, CRI]), 3)
## make hard-calls
THD <- zrs / (zrs + ons) # xt: fraction of 0s as threshold
hdc <- cbind(ens[, !names(ens) %in% CRI], (ens[, CRI] > THD) + 0)
with(hdc, table(Y, TNF))
                                    534


##   TNF
## Y    0   1
## 0 2751 251
## 1    3 89
with(hdc, table(Y, FNF))
##   FNF
## Y    0
## 0 3002
## 1 92
with(hdc, table(Y, acc))
##   acc
## Y    0   1
## 0 2746 256
## 1    3 89
                         535


with(hdc, table(Y, PHT))
##   PHT
## Y    0   1
## 0 2732 270
## 1    1 91
with(hdc, table(Y, YHT))
##   YHT
## Y    0   1
## 0 2734 268
## 1    1 91
## Counties with false positive predictions
subset(ens, Y == 0 & TNF>THD)
                                 536


## Output for SAS mapping
write.csv(ens, file='C:\\Users\\montg\\Dropbox\\Ph.D
Work\\Dissertation\\Data\\Aim 1\\Processed/ens_lag.csv')
                                 537


BIBLIOGRAPHY
      538


                                       BIBLIOGRAPHY
1.  Alaska State Legislature. (2014). House joint resolution no. 14. Retrieved from
    https://www.akleg.gov/basis/Bill/Text/32?Hsid=HJR014A
2.  Andersen, H. (2007). History and philosophy of modern epidemiology.
3.  Angrist, J. D. and Pischke, J.-S. (2008). Mostly harmless econometrics: An empiricist's
    companion. Princeton university press.
4.  Angrist, J. D., & Pischke, J. S. (2008). Mostly harmless econometrics. Princeton
    university press.
5.  Athey, S., & Imbens, G. W. (2021). Design-based analysis in difference-in-differences
    settings with staggered adoption. Journal of Econometrics.
6.  Baicker, K., & Svoronos, T. (2019). Testing the validity of the single interrupted time
    series design (No. w26080). National Bureau of Economic Research.
7.  Baker, K. M. (1975). Condorcet, from natural philosophy to social mathematics.
8.  Barber, Elizabeth Wayland. (1992). Prehistoric Textiles: The Development of Cloth in the
    Neolithic and Bronze Ages with Special Reference to the Aegean. Princeton University
    Press.
9.  Beeching, J. (1977). The Chinese Opium Wars. United Kingdom: Harcourt Brace
    Jovanovich.
10. Beltz, L., Mosher, C., & Schwartz, J. (2020). County-Level Differences in Support for
    Recreational Cannabis on the Ballot. Contemporary Drug Problems, 47(2), 149-164.
11. Ben-Michael, E., Feller, A., & Rothstein, J. (2021). Synthetic Controls with Staggered
    Adoption (No. w28886). National Bureau of Economic Research.
12. Ben-Michael, E., Feller, A., & Stuart, E. A. (2021). A trial emulation approach for policy
    evaluations with group-level longitudinal data. Epidemiology (Cambridge, Mass.), 32(4),
    533.
13. Bernal, J. L., Cummins, S. and Gasparrini, A. (2017). Interrupted time series regression
    for the evaluation of public health interventions: a tutorial. International journal of
    epidemiology 46 348{355.
14. Blocker, J. (2006). Did Prohibition Really Work? Am J Pub Health, 96(2).
                                             539


15. Boland, P. J. (1989). Majority systems and the Condorcet jury theorem. Journal of the
    Royal Statistical Society: Series D (The Statistician), 38(3), 181-189.
16. Bonnie, R. J., & Whitebread, C. H. (1970). The forbidden fruit and the tree of knowledge:
    an inquiry into the legal history of American marijuana prohibition. Virginia Law Review,
    971-1203.
17. Brodersen, K. H., Gallusser, F., Koehler, J., Remy, N., Scott, S. L. et al. (2015). Inferring
    causal impact using bayesian structural time-series models. The Annals of Applied
    Statistics 9 247{274.
18. C. C. Bakels. (2003). The contents of ceramic vessels in the Bactria-Margiana
    Archaeological Complex, Turkmenistan. Electron. J. Vedic Stud. 9.
19. C.J. van Boxtel, B. Santoso and I.R. Edwards (2008). Drug Benefits and Risks:
    International Textbook of Clinical Pharmacology, revised 2nd edition. IOS Press and
    Uppsala Monitoring Centre.
20. Callaway, B., & Sant’Anna, P. H. C. (2021). Difference-in-Differences with multiple time
    periods. Journal of Econometrics, 225(2), 200–230.
    https://doi.org/10.1016/j.jeconom.2020.12.001
21. Callaway, B., & Sant’Anna, P. H. C. (2021). Difference-in-Differences with multiple time
    periods. Journal of Econometrics, 225(2), 200–230.
    https://doi.org/10.1016/j.jeconom.2020.12.001
22. Campbell, D. T. and Cook, T. D. (1979). Quasi-experimentation: Design & analysis
    issues for field settings. Rand McNally College Publishing Company Chicago.
23. Caulkins, J. P., Goyeneche, L. A., Guo, L., Lenart, K., & Rath, M. (2021). Outcomes
    associated with scheduling or up-scheduling controlled substances. International
    Journal of Drug Policy, 91, 103110.
24. Cerdá, M., Mauro, C., Hamilton, A., Levy, N. S., Santaella-Tenorio, J., Hasin, D., ... &
    Martins, S. S. (2020). Association between recreational marijuana legalization in the
    United States and changes in marijuana use and cannabis use disorder from 2008 to
    2016. JAMA Psychiatry, 77(2), 165-171.
25. Cerdá, M., Mauro, C., Hamilton, A., Levy, N. S., Santaella-Tenorio, J., Hasin, D., ... &
    Martins, S. S. (2020). Association between recreational marijuana legalization in the
    United States and changes in marijuana use and cannabis use disorder from 2008 to
    2016. JAMA Psychiatry, 77(2), 165-171.
26. Cerdá, M., Wall, M., Feng, T., Keyes, K. M., Sarvet, A., Schulenberg, J., ... & Hasin, D.
    S. (2017). Association of state recreational marijuana laws with adolescent marijuana
    use. JAMA pediatrics, 171(2), 142-149.
                                            540


27. Cerdá, M., Wall, M., Feng, T., Keyes, K. M., Sarvet, A., Schulenberg, J., ... & Hasin, D.
    S. (2017). Association of state recreational marijuana laws with adolescent marijuana
    use. JAMA pediatrics, 171(2), 142-149.
28. Chan, J. T., & Zhong, W. (2019). Reading China: Predicting policy change with machine
    learning.
29. Chang, H. (1970). Commissioner Lin and the Opium War. United States: Norton.
30. Chen, C. Y., Dormitzer, C. M., Gutiérrez, U., Vittetoe, K., González, G. B., & Anthony, J.
    C. (2004). The adolescent behavioral repertoire as a context for drug exposure:
    behavioral autarcesis at play. Addiction, 99(7), 897-906.
31. Cheng, H. G., Augustin, D., Glass, E. H., & Anthony, J. C. (2019). Nation-scale primary
    prevention to reduce newly incident adolescent drug use: the issue of lag time. PeerJ,
    7, e6356. https://doi.org/10.7717/peerj.6356
32. Cheng, H. G., Augustin, D., Glass, E. H., & Anthony, J. C. (2019). Nation-scale primary
    prevention to reduce newly incident adolescent drug use: the issue of lag time. PeerJ,
    7, e6356. https://doi.org/10.7717/peerj.6356
33. Cheng, H. G., Cantave, M. D., & Anthony, J. C. (2016). Alcohol experiences viewed
    mutoscopically: newly incident drinking of twelve-to twenty-five-year-olds in the United
    States, 2002–2013. Journal of studies on alcohol and drugs, 77(3), 405-412.
34. Cheng, H. G., Lopez-Quintero, C., & Anthony, J. C. (2018). Age of onset or age at
    assessment-that is the question: Estimating newly incident alcohol drinking and rapid
    transition to heavy drinking in the United States, 2002-2014. International Journal of
    Methods in Psychiatric Research, 27(1), e1587.
35. Cheng, H. G., Lopez‐Quintero, C., & Anthony, J. C. (2018). Age of onset or age at
    assessment—that is the question: Estimating newly incident alcohol drinking and rapid
    transition to heavy drinking in the United States, 2002–2014. International journal of
    methods in psychiatric research, 27(1), e1587.
36. Coley, R. L., Kruzik, C., Ghiani, M., Carey, N., Hawkins, S. S., & Baum, C. F. (2021).
    Recreational Marijuana Legalization and Adolescent Use of Marijuana, Tobacco, and
    Alcohol. Journal of Adolescent Health, 69(1), 41–49.
    https://doi.org/10.1016/j.jadohealth.2020.10.019
37. Coley, R. L., Kruzik, C., Ghiani, M., Carey, N., Hawkins, S. S., & Baum, C. F. (2021).
    Recreational Marijuana Legalization and Adolescent Use of Marijuana, Tobacco, and
    Alcohol. Journal of Adolescent Health, 69(1), 41–49.
    https://doi.org/10.1016/j.jadohealth.2020.10.019
                                             541


38. Colorado Constitution, Amendment 64. (2012).
39. Colorado Municipal League (2019) Municipal Retail Marijuana Status. Retrieved from
    https://www.cml.org/home/topics-key-issues/municipal-retail-marijuana-laws
40. Cunningham, S. (2020). Causal Inference. The Mixtape, 1.
41. Daniller, A. (2019). Two-thirds of Americans support marijuana legalization. Pew
    Research Center, 14.
42. Darnell, A. (2015). I-502 Evaluation plan and preliminary report on implementation.
    Washington State Institute for Public Policy.
43. Dawson, D. A., Goldstein, R. B., Patricia Chou, S., June Ruan, W., & Grant, B. F. (2008).
    Age at First Drink and the First Incidence of Adult-Onset DSM-IV Alcohol Use
    Disorders. Alcoholism: Clinical and Experimental Research, 32(12), 2149–2160.
    https://doi.org/10.1111/j.1530-0277.2008.00806.x
44. Degenhardt, Louisa et al. (2008). Toward a global view of alcohol, tobacco, cannabis,
    and cocaine use: findings from the WHO World Mental Health Surveys. PLoS
    medicine vol. 5,7: e141. doi:10.1371/journal.pmed.0050141
45. Department of Justice. Title 21 United States Code (USC) Controlled Substances Act
    (1970).
46. Dilley, J. A., Hitchcock, L., McGroder, N., Greto, L. A., & Richardson, S. M. (2017).
    Community-level policy responses to state marijuana legalization in Washington
    State. International Journal of Drug Policy, 42, 102-108.
47. Dilley, J. A., Richardson, S. M., Kilmer, B., Pacula, R. L., Segawa, M. B., & Cerdá, M.
    (2019). Prevalence of cannabis use in youths after legalization in Washington
    state. JAMA pediatrics, 173(2), 192-193.
48. Durkheim, E. (2005). Suicide: A study in sociology. Routledge.
49. Ebrey, P. B. (1999). The Cambridge Illustrated History of China. Kiribati: Cambridge
    University Press.
50. Edward M. Brecher, et al. (1972). The Consumers Union Report on Licit and Illicit Drugs.
51. Everson EM, Dilley JA, Maher JE et al. Post-legalization opening of retail cannabis
    stores and adult cannabis use in Washington State, 2009-2016. Am J Public Health
    2019;109:1294-301.
                                             542


52. Everson EM, Dilley JA, Maher JE et al. Post-legalization opening of retail cannabis
    stores and adult cannabis use in Washington State, 2009-2016. (2019) Am J Public
    Health; 109:1294-301.
53. Farr, W. (2000). Vital statistics: memorial volume of selections from the reports and
    writings. 1885. Bulletin of the World Health Organization, 78(1), 88.
54. Feige, C., & Miron, J. A. (2008). The opium wars, opium legalization and opium
    consumption in China. Applied Economics Letters, 15(12), 911-913.
55. Fontes, M. A., Bolla, K. I., Cunha, P. J., Almeida, P. P., Jungerman, F., Laranjeira, R. R.,
    ... & Lacerda, A. L. (2011). Cannabis use before age 15 and subsequent executive
    functioning. The British Journal of Psychiatry, 198(6), 442-447.
56. Gallup Social and Policy Issues (2016). Gallup Polls [Gallup Poll Social Series].
    Retrieved from http:// https://news.gallup.com/poll/196568/americans-views-shift-
    toughness-justice-system.aspx
57. Gallup Social and Policy Issues (2016). Gallup Polls [Gallup Poll Social
    Series]. Retrieved July 10, 2020, available at
    https://news.gallup.com/poll/196568/americans-views-shift-toughness-justice-
    system.aspx
58. Gallup. (2020). Support for Legal Marijuana Inches Up to New High of 68%.
    Gallup.Com. Retrieved December 29, 2021, from
    https://news.gallup.com/poll/323582/support-legal-marijuana-inches-new-high.aspx
59. Galston, W. A., & Dionne Jr, E. J. (2013). The new politics of marijuana legalization: Why
    opinion is changing. Governance Studies at Brookings, 1-17.
60. Galton, Francis (1877). Typical Laws of Heredity . Nature 15, 492–495
    https://doi.org/10.1038/015492a0
61. Gelman, A., & Hill, J. (2006). Data analysis using regression and multilevel/hierarchical
    models. Cambridge university press.
62. Gigerenzer, G., Swijtink, Z., Porter, T., Daston, L., & Kruger, L. (1990). The empire of
    chance: How probability changed science and everyday life (No. 12). Cambridge
    University Press.
63. Goodman-Bacon, A. (2021). Difference-in-differences with variation in treatment timing.
    Journal of Econometrics.
64. Goodman-Bacon, A. (2021). Difference-in-differences with variation in treatment
    timing. Journal of Econometrics.
                                              543


65. Graunt, J. (1939). Natural and political observations made upon the bills of mortality
    (No. 2). Johns Hopkins Press.
66. Gruber, K., Anderson, A., Calanan, R., VanDyke, M., Barker, L., Burris, D., & Tolliver, R.
    (2016). Marijuana Use Among Adolescents in Colorado: Results from the 2013 Healthy
    Kids Colorado Survey. 10.
67. Gruber, K., Anderson, A., Calanan, R., VanDyke, M., Barker, L., Burris, D., & Tolliver, R.
    (2016). Marijuana Use Among Adolescents in Colorado: Results from the 2013 Healthy
    Kids Colorado Survey. 10.
68. Haggerty, R. J., & Mrazek, P. J. (1994). Reducing risks for mental disorders: Frontiers
    for preventive intervention research. National Academies Press.
69. Hald, A. (1998). A History of Mathematical Statistics from 1750 to 1930 (Vol. 314).
    Wiley-Interscience.
70. Hall, W., & Weier, M. (2015). Assessing the public health impacts of legalizing
    recreational cannabis use in the USA. Clinical pharmacology & therapeutics, 97(6), 607-
    615.
71. Helmer, J., & Vietorisz, T. (1974). Drug use, the labor market and class conflict (Vol. 43,
    No. 8). Drug Abuse Council.
72. Herodotus, A. D. Godley. (1920). Trans. The Histories. Harvard Univ. Press.
73. Hill, A. B. (1951) The Clinical Trial. BMJ , 278-282.
74. Hill, A. B. (1952) The Clinical Trial. New England Journal of Medicine 247, 113-119.
75. Hill, A. B. (1953) Observations and Experiment. New England Journal of Medicine 248,
    995-1001.
76. Hill, A. B. (1965) The Environment and Disease: Association or Causation? Proceedings
    of the Royal Society of Medicine 58, 295-300.
77. Hill, A. B. and Doll, R. (1956). Lung Cancer and Tobacco. The BMJ's Questions
    Answered by Professor A. Bradford Hill and R. Doll. BMJ 1(4976), 1160-1163.
78. Horwood, L. J., Fergusson, D. M., Hayatbakhsh, M. R., Najman, J. M., Coffey, C.,
    Patton, G. C., ... & Hutchinson, D. M. (2010). Cannabis use and educational
    achievement: findings from three Australasian cohort studies. Drug and alcohol
    dependence, 110(3), 247-253.
    https://news.gallup.com/poll/323582/support-legal-marijuana-inches-new-high.aspx
                                              544


79. Humphreys, K., Edwards, G., Caulkins, J. P., Babor, T., Foxcroft, D. R., Rehm, J.,
    Fischer, B., Obot, I. S., Babor, T. F., Reuter, P. (2010). Drug Policy and the Public Good.
    United Kingdom: OUP Oxford.
80. Imbens, G. W., & Rubin, D. B. (2015). Causal inference in statistics, social, and
    biomedical sciences. Cambridge University Press.
81. Jonnes, J. (1999). Hep-cats, narcs, and pipe dreams: A history of America's romance
    with illegal drugs. JHU Press.
82. Kandel, D. B., & Logan, J. A. (1984). Patterns of Drug Use from Adolescence to Young
    Adulthood: 1. Periods of Risk for Initiation, Continued Use, and Discontinuation.
83. Kendall, M. G. (1956). Studies in the history of probability and statistics: II. The
    beginnings of a probability calculus. Biometrika, 43(1/2), 1-14.
84. Kerr WC, Lui C, Ye Y. Trends and age, period and cohort effects for marijuana use
    prevalence in the 1984-2015 US national alcohol surveys. Addiction. 2018;113:473–81.
    doi:10.1111/add.14031.
85. Kerr WC, Lui C, Ye Y. Trends and age, period and cohort effects for marijuana use
    prevalence in the 1984-2015 US national alcohol surveys. Addiction. 2018;113:473–81.
    doi:10.1111/add.14031.
86. Kolb, L., & Du Mez, A. G. (1924). The prevalence and trend of drug addiction in the
    United States and factors influencing it. Public Health Reports (1896-1970), 1179-1204.
87. Kramer, M. (1957). A discussion of the concepts of incidence and prevalence as related
    to epidemiologic studies of mental disorders. American Journal of Public Health and the
    Nations Health, 47(7), 826-840.
88. Kramer, M. (1957). A discussion of the concepts of incidence and prevalence as related
    to epidemiologic studies of mental disorders. American Journal of Public Health and the
    Nations Health, 47(7), 826-840.
89. L.W. King. (1915). The Code of Hammurabi. Trans. Paulo J. S. Pereira.
90. Labouvie, E., Bates, M. E., & Pandina, R. J. (1997). Age of first use: its reliability and
    predictive utility. Journal of Studies on Alcohol, 58(6), 638–643.
    https://doi.org/10.15288/jsa.1997.58.638
91. Lapouse, R. (1967). Problems in studying the prevalence of psychiatric disorder.
    American Journal of Public Health and the Nations Health, 57(6), 947-954.
92. Lapouse, R. (1967). Problems in studying the prevalence of psychiatric
    disorder. American Journal of Public Health and the Nations Health, 57(6), 947-954.
                                             545


93.  Legal Recreational Marijuana States and DC - Recreational Marijuana—ProCon.org
     (2021). Recreational Marijuana. Retrieved November 29, 2021, available at
     https://marijuana.procon.org/legal-recreational-marijuana-states-and-dc/
94.  MacMahon, B., T. F. Pugh, and J. Ipsen. (1960). Epidemiologic Methods. Boston: Little,
     Brown & Co.
95.  Maistrov, L. E. (2014). Probability theory: A historical sketch. Academic Press.
96.  Martins, S. S., Segura, L. E., Levy, N. S., Mauro, P. M., Mauro, C. M., Philbin, M. M., &
     Hasin, D. S. (2021). Racial and Ethnic Differences in Cannabis Use Following
     Legalization in US States With Medical Cannabis Laws. JAMA Network Open, 4(9),
     e2127002. https://doi.org/10.1001/jamanetworkopen.2021.27002
97.  Martins, S. S., Segura, L. E., Levy, N. S., Mauro, P. M., Mauro, C. M., Philbin, M. M., &
     Hasin, D. S. (2021). Racial and Ethnic Differences in Cannabis Use Following
     Legalization in US States With Medical Cannabis Laws. JAMA Network Open, 4(9),
     e2127002. https://doi.org/10.1001/jamanetworkopen.2021.27002
98.  McWilliams, J. C. (1990). The Protectors: Harry J. Anslinger and the Federal Bureau of
     Narcotics, 1930-1962. United Kingdom: University of Delaware Press.
99.  Melchior, M., Nakamura, A., Bolze, C., Hausfater, F., El Khoury, F., Mary-Krause, M., &
     Da Silva, M. A. (2019). Does liberalisation of cannabis policy influence levels of use in
     adolescents and young adults? A systematic review and meta-analysis. BMJ Open,
     9(7), e025880.
100. Melchior, M., Nakamura, A., Bolze, C., Hausfater, F., El Khoury, F., Mary-Krause, M., &
     Da Silva, M. A. (2019). Does liberalisation of cannabis policy influence levels of use in
     adolescents and young adults? A systematic review and meta-analysis. BMJ
     Open, 9(7), e025880.
101. Merlin, M. D. (2003). Archaeological evidence for the tradition of psychoactive plant use
     in the old world. Economic Botany, 57(3), 295-323.
102. Midgette, G., & Reuter, P. (2020). Has Cannabis Use Among Youth Increased After
     Changes in Its Legal Status? A Commentary on Use of Monitoring the Future for
     Analyses of Changes in State Cannabis Laws. Prevention Science, 21(1), 137–145.
     https://doi.org/10.1007/s11121-019-01068-4
103. MIT Election Data and Science Lab, 2018. County Presidential Election Returns 2000-
     2020, Retrieved from https://doi.org/10.7910/DVN/VOQCHQ, Harvard Dataverse, V9,
     UNF:6:qSwUYo7FKxI6vd/3Xev2Ng== [fileUNF]’
                                              546


104. Montgomery, B. W., Anthony, J. C., & Vsevolozhskaya, O. (2021). An Epidemiological
     Hypothesis of Policy-Shaped Drug Use Onset Curves. Biomedical Journal of Scientific
     & Technical Research, 38(1), 29994-29998.
105. Mosher, C. J., & Akins, S. (2019). In the weeds: Demonization, legalization, and the
     evolution of us marijuana policy. Temple University Press.
106. MRSC. (2019). Marijuana Regulation in Washington State. Retrieved from
     http://mrsc.org/Home/Explore-Topics/Legal/Regulation/Marijuana-Regulation-in-
     Washington-State.aspx
107. MRSC. (2019). Marijuana Regulation in Washington State. Retrieved December 4, 2019,
     from http://mrsc.org/Home/Explore-Topics/Legal/Regulation/Marijuana-Regulation-in-
     Washington-State.aspx
108. Musto, D. F. (1999). The American disease: Origins of narcotic control. Oxford
     University Press.
109. Nation, M., Crusto, C., Wandersman, A., Kumpfer, K. L., Seybolt, D., Morrissey-Kane,
     E., & Davino, K. (2003). What works in prevention: Principles of effective prevention
     programs. American psychologist, 58(6-7), 449.
110. National Conference of State Legislatures [NCSL]. (2019). State medical marijuana
     Laws. Retrieved June 25, 2019. From http://www.ncsl.org/research/health/state-
     medical-marijuana-laws.aspx. Accessed 26 Aug 2019.
111. National Surveys on Drug Use and Health, 2014. 2010-2012 NSDUH Substate Region
     Definitions. Retrieved from https://www.samhsa.gov/data/report/2010-2012-nsduh-
     substate-region-definitions
112. New York Times. (1914) via Schaeffer’s Drug Library.
     http://www.druglibrary.org/schaffer/History/Negro_cocaine_fiends.htm accessed
     November 11th, 2020.
113. Nobel Prize Outreach AB. (2022). The Nobel Prize in Physiology or Medicine 1929.
     NobelPrize.org. Retrieved April 5, 2022 from
     https://www.nobelprize.org/prizes/medicine/1929/summary/
114. Oregon Legislature. (2014). Chapter 475B — Cannabis Regulation. Retrieved December
     30, 2021, from https://www.oregonlegislature.gov/bills_laws/ors/ors475B.html
115. Pacula RL, Kilmer B, Wagenaar AC et al. Developing public health regulations for
     marijuana: lessons from alcohol and tobacco. Am J Public Health 2014;104:1021-8.
116. Pacula RL, Kilmer B, Wagenaar AC et al. Developing public health regulations for
     marijuana: lessons from alcohol and tobacco. Am J Public Health 2014;104:1021-8.
                                             547


117. Parker MA, Anthony JC. (2015). Epidemiological evidence on extra-medical use of
     prescription pain relievers: transitions from newly incident use to dependence among
     12–21 year olds in the United States using meta-analysis, 2002-
     13. PeerJ 3:e1340 https://doi.org/10.7717/peerj.1340
118. Paschall, M. J., García-Ramírez, G., & Grube, J. W. (2021). Recreational marijuana
     legalization and use among California adolescents: findings from a statewide survey.
     Journal of studies on alcohol and drugs, 82(1), 103-111.
119. Paschall, M. J., García-Ramírez, G., & Grube, J. W. (2021). Recreational marijuana
     legalization and use among California adolescents: findings from a statewide
     survey. Journal of studies on alcohol and drugs, 82(1), 103-111.
120. Payán, D. D., Brown, P., & Song, A. V. (2021). County‐Level Recreational Marijuana
     Policies and Local Policy Changes in Colorado and Washington State (2012‐2019). The
     Milbank Quarterly.
121. Pew. (2013). Majority now supports legalizing Marijuana. Pew Research Center.
     Retrieved December 30, 2021, from http://www.people-press.org/2013/04/04/majority-
     now-supports-legalizing-marijuana
122. Pew. (2015). 63% of Republican Millennials favor marijuana legalization. Pew Research
     Center. Retrieved December 30, 2021, from https://www.pewresearch.org/fact-
     tank/2015/02/27/63-of-republican-millennials-favor-marijuana-legalization/
123. Pew. (2016). Support for marijuana legalization continues to rise. Pew Research Center.
     Retrieved December 30, 2021, from https://www.pewresearch.org/fact-
     tank/2016/10/12/support-for-marijuana-legalization-continues-to-rise/
124. Pew. (2019). Two-thirds of Americans support marijuana legalization. Pew Research
     Center. Retrieved December 30, 2021, from https://www.pewresearch.org/fact-
     tank/2019/11/14/americans-support-marijuana-legalization/
125. Pew. (2021). Americans overwhelmingly say marijuana should be legal for recreational
     or medical use. Pew Research Center. Retrieved December 29, 2021, from
     https://www.pewresearch.org/fact-tank/2021/04/16/americans-overwhelmingly-say-
     marijuana-should-be-legal-for-recreational-or-medical-use/
126. Pillard, R. C. (1970). Marihuana. New England Journal of Medicine, 283(6), 294-303.
127. Prescott, C. A., & Kendler, K. S. (1999). Age at First Drink and Risk for Alcoholism: A
     Noncausal Association. Alcoholism: Clinical and Experimental Research, 23(1), 101–
     107. https://doi.org/10.1111/j.1530-0277.1999.tb04029.x
128. Public Law No. 223, 63rd Cong., approved December 17, 1914.
                                               548


129. R. C. Clarke, M. D. Merlin. (2013) Cannabis: Evolution and Ethnobotany. University of
     California Press.
130. Reed, J. (2016). Marijuana Legalization in Colorado: Early Findings: A Report Pursuant
     to Senate Bill 13-283 (March 2016). 147.
131. Reed, J. (2016). Marijuana Legalization in Colorado: Early Findings: A Report Pursuant
     to Senate Bill 13-283 (March 2016). 147.
132. Reed, J. (2021). Impacts of marijuana legalization in Colorado: A Report Pursuant to
     C.R.S. 24-33.4-516. Retrieved December 10, 2021, available at
     https://cdpsdocs.state.co.us/ors/docs/reports/2021-SB13-283_Rpt.pdf
133. Ren, M., Tang, Z., Wu, X., Spengler, R., Jiang, H., Yang, Y., & Boivin, N. (2019). The
     origins of cannabis smoking: Chemical residue evidence from the first millennium BCE
     in the Pamirs. Science advances, 5(6), eaaw1391.
134. Roth, J. (2020), “Pre-test with Caution: Event-study Estimates After Testing for Parallel
     Trends,” working paper, Brown University Department of Economics.
135. Shimonovich, M., Pearce, A., Thomson, H., Keyes, K., & Katikireddi, S. V. (2020).
     Assessing causality in epidemiology: revisiting Bradford Hill to incorporate
     developments in causal thinking. European Journal of Epidemiology, 1-15.
136. Smart, R., & Pacula, R. L. (2019). Early evidence of the impact of cannabis legalization
     on cannabis use, CUD, and the use of other substances: findings from state policy
     evaluations. The American journal of drug and alcohol abuse, 45(6), 644-663.
137. Smith L. (2018) How a racist hate-monger masterminded America’s War on Drugs.
     Medium. Accessed April 5, 2022, available at https://timeline.com/harry-anslinger-
     racist-war-on-drugs-prison-industrial-complex-fb5cbc281189
138. Socia, K. M., & Brown, E. K. (2017). Up in smoke: The passage of medical Marijuana
     legislation and enactment of dispensary moratoriums in Massachusetts. Crime &
     Delinquency, 63(5), 569-591.
139. Spillane, J. F. (2004). Debating the controlled substances act. Drug and Alcohol
     Dependence, 76(1), 17-29.
140. Staggs, B., Wheeler, I., Aitken, D., & Lawrence, P. (2019). What are the marijuana laws
     in your California city? Explore our database of local cannabis policies. Retrieved from
     https://www.ocregister.com/2018/01/03/what-are-the-marijuana-laws-in-your-
     california-city-explore-our-database-of-local-cannabis-policies-2/.
                                              549


141. Staggs, B., Wheeler, I., Aitken, D., & Lawrence, P. (2019). What are the marijuana laws
     in your California city? Explore our database of local cannabis policies. Retrieved
     December 3, 2019, from https://www.ocregister.com/2018/01/03/what-are-the-
     marijuana-laws-in-your-california-city-explore-our-database-of-local-cannabis-policies-
     2/.
142. STATSAMERCIA (2021). Indiana Business Research Center. Indiana University Kelley
     School of Business. Retrieved from
     http://www.statsamerica.org/CityCountyFinder/Default.aspx
143. Substance Abuse and Mental Health Services Administration (2014). National survey on
     drug use and health.
144. Substance Abuse and Mental Health Services Administration (2021). CBHSQ Data.
     Retrieved September 11, 2021, from https://www.samhsa.gov/data/
145. Substance Abuse and Mental Health Services Administration. (2019). Key substance
     use and mental health indicators in the United States: Results from the 2018 National
     Survey on Drug Use and Health (HHS Publication No. PEP19-5068, NSDUH Series H-
     54). Rockville, MD: Center for Behavioral Health Statistics and Quality, Substance
     Abuse and Mental Health Services Administration. Retrieved from
     https://www.samhsa.gov/data/
146. Thornton, M. Alcohol Prohibition Was a Failure (1991). Pol. Analysis No. 157.
     Washington: Cato.
147. United Nations Office of Drugs and Crime. (2020).
     https://dataunodc.un.org/data/drugs/Prevalence-general accessed December 15th,
     2020
148. United States Census Bureau. (2012). 2010 Census. Retrieved from
     https://data.census.gov/cedsci/
149. United States. Surgeon General's Advisory Committee on Smoking. (1964). Smoking
     and Health: Report of the Advisory Committee to the Surgeon General of the Public
     Health Service (No. 1103). US Department of Health, Education, and Welfare, Public
     Health Service.
150. Varberg, D. (1963). The development of modern statistics. The Mathematics Teacher.
     56 (4), 252-257. Retrieved September 7, 2021, from
     http://www.jstor.org/stable/27956805
151. Volkow, N. D., Baler, R. D., Compton, W. M., & Weiss, S. R. (2014). Adverse health
     effects of marijuana use. New England Journal of Medicine, 370(23), 2219-2227.
                                              550


152. Volkow, N. D., Baler, R. D., Compton, W. M., & Weiss, S. R. (2014). Adverse health
     effects of marijuana use. New England Journal of Medicine, 370(23), 2219-2227.
153. Wagner, F. (2002). From First Drug Use to Drug Dependence Developmental Periods of
     Risk for Dependence upon Marijuana, Cocaine, and Alcohol.
     Neuropsychopharmacology, 26(4), 479–488. https://doi.org/10.1016/S0893-
     133X(01)00367-0
154. Wagner, F. (2002). From First Drug Use to Drug Dependence Developmental Periods of
     Risk for Dependence upon Marijuana, Cocaine, and Alcohol.
     Neuropsychopharmacology, 26(4), 479–488. https://doi.org/10.1016/S0893-
     133X(01)00367-0
155. Warner, Jessica; Her, Minghao; Gmel, Gerhard; Rehm, Jürgen (2001). Can Legislation
     Prevent Debauchery? Mother Gin and Public Health in 18th-Century England. American
     Journal of Public Health. 91: 375–84. doi:10.2105/ajph.91.3.375. PMC 1446560. PMID
     11236401.
156. Washington State Liquor control Board, Initiative 502. (2012).
157. Wilkins, C., Tremewan, J., Rychert, M., Atkinson, Q., Fischer, K., & Forsyth, G. L. (2022).
     Predictors of voter support for the legalization of recreational cannabis use and supply
     via a national referendum. International Journal of Drug Policy, 99, 103442.
158. Wooldridge, Jeffrey M., Two-Way Fixed Effects, the Two-Way Mundlak Regression, and
     Difference-in-Differences Estimators (August 17, 2021). Available at
     SSRN: https://ssrn.com/abstract=3906345 or http://dx.doi.org/10.2139/ssrn.3906345
159. Wright, Hamilton. (1910). Report on the International Opium Commission and on the
     Opium Problem as Seen Within the United States and Its Possessions. U.S. Senate,
     61st Congress, 2nd Session, Document #377.
160. Wu, L.-T., Korper, S. P., Marsden, M. E., Lewis, C., & Bray, R. M. (2003). Use of
     Incidence and Prevalence in the Substance Use Literature: A Review. Rockville, MD:
     Substance Abuse and Mental Health Services Administration, Office of Applied Studies.
161. Wu, L.-T., Korper, S. P., Marsden, M. E., Lewis, C., & Bray, R. M. (2003). Use of
     Incidence and Prevalence in the Substance Use Literature: A Review. Rockville, MD:
     Substance Abuse and Mental Health Services Administration, Office of Applied Studies.
162. Zhang, A. X., & Counts, S. (2015, April). Modeling ideology and predicting policy change
     with social media: Case of same-sex marriage. In Proceedings of the 33rd Annual ACM
     Conference on Human Factors in Computing Systems (pp. 2603-2612).
                                             551