THREE ESSAYS ON DEMAND ESTIMATION

By

Hee Kwon Kyung

A DISSERTATION

Michigan State University

in partial fulﬁllment of the requirements

Submitted to

for the degree of

Economics – Doctor of Philosophy

2019

ABSTRACT

THREE ESSAYS ON DEMAND ESTIMATION

By

Hee Kwon Kyung

Chapter 1: The Role of Reputation/Feedback Contents in NYC Airbnb Market: Evidence
from Hedonic Price Regressions

Economists have found that reducing information asymmetry is crucial for online marketplaces to
overcome market failure due to adverse selection. Reputation/feedback systems and multi-media
web contents from sellers are known to be popular disclosure devices for this purpose. This
paper employs hedonic price regressions to provide empirical evidence that the recent success
of a sharing economy platform, Airbnb, also relies on such publicly available information on
product quality. Machine learning selectors were employed to reduce high-dimensionality in the
attribute space. To process consumer review texts and sellers’ advertisement texts, word/phrase
extraction and sentiment analysis were introduced. I propose a GMM estimation to produce more
accurate implicit price estimates, that was designed to control for time-varying unobservables.
’Superhost’ designation by the platform and consumer reviews showed greater impacts than seller
side advertisement texts.

Chapter 2: Demand Estimation for NYC Airbnb Market: Value of Reputation/Feedback
Contents and Voluntary Disclosures

The success of online marketplaces has often been attributed to reputation/feedback systems,
in that they reduce adverse selection due to information asymmetry by disclosing enforced or
veriﬁable ex-post information on product quality. This paper tries to quantify the value of such
information content in NYC Airbnb market with a newly constructed dataset containing the actual
708,308 vacation rental reservations from Airbnb tourists. A three level nested logit model was
employed to capture consumers’ choice set formation behaviors during web search on the platform

using Google Maps API. High-dimensional attribute space due to extreme product heterogeneity
necessitates variable selection using machine learning methods based on sparsity assumption.
Though model selection procedures by LASSO and exact inference for post selection parameter
estimates were proposed, structural modeling and endogeneity control turn out to be essential
for successful identiﬁcation. Text processing techniques were introduced to extract variables
from sellers’ advertisement texts and consumer reviews. The results conﬁrm a key insight from
information economics: enforced quality certiﬁcations and ex-post veriﬁed consumer reviews
generate greater welfare impacts than non-veriﬁed seller side voluntary disclosures.

Chapter 3: Estimation for the Distribution of Random Coeﬃcients with Heterogeneous Agent
Types: Monte-Carlo Simulation

This paper is a simple Monte-Carlo extension for Fox, Kim, Ryan, and Bajari (2011), which gives a
direct estimator for the distribution of random coeﬃcients in diverse settings including logit demand
models. The estimator is a simple inequality constrained least squares, and this study examines its
behaviors given there are hundreds of consumer types, which could be an interesting case for various
marketplaces. High-dimensional metrics are then introduced to reduce the dimensionality of design
matrices the rank of which is the number of consumer types. The approximation performances to
the cumulative distribution of random coeﬃcients of such post lasso estimators are compared to
those of baseline estimator.

ACKNOWLEDGEMENTS

I am deeply grateful for the guidance from my chair Professor Kyoo Il Kim and committee members,
Professor Joseph A. Herriges, Distinguished Professor Peter Schmidt, and Professor Seunghyun
Kim. Despite of my countless shortcomings, they have endowed me with their precious time and
seasoned wisdom. Up to this date, I am not sure if I deserve such an honor. I also thank all the
faculty members and staﬀs of the Economics Department at Michigan State University.

Born and raised in a perimeter of civilization at least in terms of knowledge, I enjoyed not
only the PhD program but also my exposures to American culture, “the shower of enlightenment”
I would say. The years spent here will remain as one of the best times I have ever had in my life.
Lastly, I thank my parents, sister, brother-in-law, and the little nephew for their love and trust.

iv

TABLE OF CONTENTS

.
.

.
.

.
.

.
.

.
.

.
.

.
.

.
.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
LIST OF TABLES .
LIST OF FIGURES .
ix
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
CHAPTER 1 THE ROLE OF REPUTATION/FEEDBACK CONTENTS IN NYC AIRBNB
MARKET: EVIDENCE FROM HEDONIC PRICE REGRESSIONS . . . .
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
. . . . . . . . . . . . . . . .
1.2.1.1 NYC Airbnb Market
. . . . . . . . . . . . . . . . . . . . . . . .
1.2.1.2 Reputation/Feedback Repository . . . . . . . . . . . . . . . . . .
1.2.1.3 Basic Contract Enforceability and Voluntary Disclosure from

1.2.1 Trade among Anonymous Sellers and Buyers

1.1
. .
1.2 Airbnb and Data .

Introduction .

1
1
5
5
5
6

.
.

.
.

.
.

.
.

.
.

1.3 Model and Identifying Assumption . . . . . . . . . . . . . . . . . . . . . . . . .

.
.

. .

1.2.2 Data .
1.2.2.1
1.2.2.2

7
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
Sellers
8
. .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. .
Summary Statistics . . . . . . . . . . . . . . . . . . . . . . . . .
8
Text Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
. 14
1.3.1 High-Dimensional Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Sparsity and Variable Selection . . . . . . . . . . . . . . . . . . 14
1.3.1.1
1.3.1.2
Preliminary Analysis: OLS post Lasso and Post Selection Inference 15
1.3.1.3 Cautions with Endogeneity . . . . . . . . . . . . . . . . . . . . . 21
1.3.2 Time-Varying Correlated Unobservables . . . . . . . . . . . . . . . . . . . 22
1.3.2.1
Possible Sources . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.3.2.2 Markov Process and Consumer Rationality . . . . . . . . . . . . 22
Previous Empirical Research on Airbnb . . . . . . . . . . . . . . 24
1.3.2.3
.
.
.
. 25
1.4.1
Fixed Eﬀects vs. GMM Based on Consumer Rationality . . . . . . . . . . 25
1.4.2 Over Price Levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
. 32

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. .

. .

.

.

.

.

.

.

.

1.4 Results .

1.5 Conclusion .

.

.

.
.

.
.

.
.

.
.

2.1

CHAPTER 2 DEMAND ESTIMATION FOR NYC AIRBNB MARKET: VALUE OF
REPUTATION/FEEDBACK CONTENTS AND VOLUNTARY DISCLO-
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
SURES .
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
.
Introduction .
. .
2.1.1
P2P Online Marketplaces and Asymmetric Information . . . . . . . . . . . 33
2.1.2 Estimation Challenges with Airbnb Platform Data . . . . . . . . . . . . . . 36
2.1.3 Literature Review and My Contribution . . . . . . . . . . . . . . . . . . . 39
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
. . . . . . . . . . . . . . . . . . . . . . . . 42
2.2.1.1 Market Deﬁnition and Size . . . . . . . . . . . . . . . . . . . . . 42
2.2.1.2
. . . . . . . . . . 43
Summary Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

Purchase Units and Market Share of a Product

2.2.1 NYC Accommodation Market

2.2 Data .

2.2.2

. .

. .

. .

.

.

.

.

v

.

.

.

. .

. 44
2.2.2.1 Rating Score Inﬂation . . . . . . . . . . . . . . . . . . . . . .
. 47
Text Processing . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.2.2
2.3 Model
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
.
.
.
2.3.1
Potential Guests’ Rental Searching Behavior . . . . . . . . . . . . . . . . . 49
2.3.2 Nested Multinomial Logit (NMNL) Model . . . . . . . . . . . . . . . . .
. 51
2.3.3 High-Dimensional Attributes and Machine Learning . . . . . . . . . . . . 53
Lasso Selector and Oracle Property . . . . . . . . . . . . . . . . 53
Post Selection Inference . . . . . . . . . . . . . . . . . . . . . . 55
. 56
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
. 58
. . . . . . . . . . . . . . . . . . . . . . 62
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

2.3.3.1
2.3.3.2
2.3.3.3 Cautions on Endogeneity and Post Selection Estimator . . . . .

2.4.1
2.4.2 Elasticities and Welfare Measures

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.5 Conclusion . .

Parameter Estimates

2.4 Results .

. .

. .

.

.

.

.

.

.

.

.

.

.

.
.
.

. .

. .

. .

.
. .

Introduction .

3.1
3.2 Model

CHAPTER 3 ESTIMATION FOR THE DISTRIBUTION OF RANDOM COEFFI-
CIENTS WITH HETEROGENEOUS AGENT TYPES: MONTE-CARLO
SIMULATION .
.
.
.
.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.2.1 Multinomial Random Coeﬃcients Logit Demand Model
. . . . . . . . . . 70
3.2.2 High-Dimensional Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . 71
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
Parameter and Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.4.1
3.4.2 Marginal Distributions of Coeﬃcients . . . . . . . . . . . . . . . . . . . . 75
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

3.4 Results and Discussion .

3.3 Monte-Carlo .

3.5 Conclusion .

3.3.1

. .

. .

.

.

.

.

.

.

.

.

.

.

.

.

.

APPENDICES .
APPENDIX A
APPENDIX B
APPENDIX C

BIBLIOGRAPHY .

.

.

.

.

.

.

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
.
MACHINE LEARNING AND POST SELECTION INFERENCE . 88
OMITTED DETAILS FOR CHAPTER 1 . . . . . . . . . . . . . . . 94
OMITTED DETAILS FOR CHAPTER 2 . . . . . . . . . . . . . . . 101
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
.

.

.

.

.

.

vi

LIST OF TABLES

Table 1.1: Deﬁnitions for Review Score Categories . . . . . . . . . . . . . . . . . . . . . .

9

Table 1.2: Summary Statistics .

.

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

Table 1.3: Selected Words/Phrases from Airbnb Hosts’ Advertisement Texts

. . . . . . . . 11

Table 1.4: Selected Words/Phrases from Airbnb Guests’ Review Texts . . . . . . . . . . . . 12

Table 1.5: Example Guest Reviews and Sentiment Labels

. . . . . . . . . . . . . . . . . . 13

Table 1.6: Bag of Words Matrix for Example Guest Reviews . . . . . . . . . . . . . . . . . 13

Table 1.7: OLS post Lasso on the Pooled Sample (1) . . . . . . . . . . . . . . . . . . . . . 19

Table 1.8: OLS post Lasso on the Pooled Sample (2): Amenity Feature Selection . . . . . . 20

Table 1.9: Fixed Eﬀects vs. GMM (1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

Table 1.10: Fixed Eﬀects vs. GMM (2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

Table 1.11: GMM: Results for Reputation/Feedback Contents over Price Levels

. . . . . . . 28

Table 1.12: GMM: Results for Amenity and Service Features . . . . . . . . . . . . . . . . . 29

Table 2.1: NYC Visitors and Potential Reservations for Travel Accommodations

. . . . . . 43

Table 2.2: Actual Booking Data Summary for NYC Market for Hotels and Airbnb . . . . . 43

Table 2.3: Deﬁnitions for Review Score Categories . . . . . . . . . . . . . . . . . . . . . . 45

Table 2.4: Summary Statistics .

.

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

Table 2.5: Selected Words/Phrases from Airbnb Hosts’ Advertisement Texts

. . . . . . . . 47

Table 2.6: Example Guest Reviews and Sentiment Labels

. . . . . . . . . . . . . . . . . . 48

Table 2.7: Bag of Words Matrix for Example Guest Reviews . . . . . . . . . . . . . . . . . 48

Table 2.8: Demand Parameter Estimates (1): Price, Nesting, and Information . . . . . . . . 60

Table 2.9: Demand Parameter Estimates (2): Amenity and Service Features . . . . . . . . . 61

vii

Table 2.10: WTP and Factor Elasticities for Information Variables

. . . . . . . . . . . . . . 62

Table 2.11: Price Elasticities

.

.

.

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

Table 2.12: Compensating Variations over Counterfactual Scenarios

. . . . . . . . . . . . . 64

Table 3.1: Monte-Carlo Results (1) (Number of Mixtures: 2) . . . . . . . . . . . . . . . . . 77

Table 3.2: Monte-Carlo Results (2) (Number of Mixtures: 4) . . . . . . . . . . . . . . . . . 78

Table 3.3: Monte-Carlo Results (3) (Number of Mixtures: 6) . . . . . . . . . . . . . . . . . 79

Table B.1: Conﬁdence Intervals for OLS post Lasso . . . . . . . . . . . . . . . . . . . .

. 95

Table B.2: GMM: Manhattan and Other Neighborhoods

. . . . . . . . . . . . . . . . . .

. 96

Table B.3: GMM: Manhattan and Other Neighborhoods (Continued from Table B.2)

. . . . 97

Table B.4: Summary Statistics for Variables and Annual Variations

. . . . . . . . . . . . . 98

Table B.5: Relevance Tests for Lagged Instruments . . . . . . . . . . . . . . . . . . . . . . 99

Table C.1: C.Is for OLS Logit, Classical vs. Post Selection Inference

. . . . . . . . . . .

. 102

Table C.2: C.Is for IV Logit, Classical vs. Post Selection Inference . . . . . . . . . . . . . . 103

Table C.3: C.Is for Two Level Nested Logit, Classical vs. Post Selection Inference

. . . . . 104

Table C.4: First Stage Regression Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

Table C.5: NYC Airbnb Service Neighborhoods . . . . . . . . . . . . . . . . . . . . . . .

. 107

Table C.6: NYC Airbnb Service Neighborhoods (Continued from Table C.5)

. . . . . . . . 108

viii

LIST OF FIGURES

Figure 2.1: Example Airbnb Listing, a ’Superhost’ Rental

. . . . . . . . . . . . . . . . . . 34

Figure 2.2: Example Review Ratings, Texts, and Google Maps API

. . . . . . . . . . . . . 35

Figure 2.3: Search Filters for NYC Airbnb Rentals . . . . . . . . . . . . . . . . . . . . . . 38

Figure 2.4: Neighborhood Designation Example: ’Midtown’ in Manhattan . . . . . . . .
. 50
Figure 3.1: ˆF(β1): Base vs. Post-cv Lasso (N=5,000, Mix 6, R = 16, ..., 49) . . . . . . . . . 80
Figure 3.2: ˆF(β1): Base vs. Post-hdm Lasso (N=5,000, Mix 6, R = 16, ..., 49)
. 81
Figure 3.3: ˆF(β1): Base vs. Post-cv Lasso (N=5,000, Mix 6, R = 81, ..., 144) . . . . . . . . 82
Figure 3.4: ˆF(β1): Base vs. Post-hdm Lasso (N=5,000, Mix 6, R = 81, ..., 144) . . . . . .
. 83
Figure 3.5: ˆF(β1): Post-cv vs. Post-hdm Lasso (N=5,000, Mix 6, R = 169)
Figure 3.6: ˆF(β1): Post-cv vs. Post-hdm Lasso (N=5,000, Mix 6, R = 529)
Figure 3.7: True Joint Distributions of β1 and β2 (1) N=5,000, Mixture of Two Normals . . 86
Figure 3.8: True Joint Distributions of β1 and β2 (2) N=5,000, Mixture of Four Normals . . 86
Figure 3.9: True Joint Distributions of β1 and β2 (3) N=5,000, Mixture of Six Normals . . . 86
Figure C.1: Correlation Across Review Score Categories . . . . . . . . . . . . . . . . . . . 109

. . . . . . . . . 84

. . . . . .

. . . . . . . . . 85

ix

CHAPTER 1

THE ROLE OF REPUTATION/FEEDBACK CONTENTS IN NYC AIRBNB MARKET:

EVIDENCE FROM HEDONIC PRICE REGRESSIONS

1.1 Introduction

The explosive growth of Airbnb and other sharing economy platforms in the last decade begs a
question; how could they build trust among total strangers over one-time transactions despite the
theoretically expected market failure due to information asymmetry? (Akerlof (1970)) The P2P
(Peer-to-Peer) platform accommodated more than 100 million parties of tourists and the market
value topped at $31 billion as of 2017. One common insight on the success of Airbnb over various
disciplines including tourism, marketing, and economics is that the reputation/feedback systems and
information contents such as texts and photographs of rental units and hosts reduced information
asymmetry, thus facilitating trust among market participants.
(Guttentag (2015), Horton and
Zeckhauser (2016), Ert, Fleischer, Magen (2016), Fradkin, Grewal, and Holtz (2018), and Liang,
Schuckert, Law, and Chen (2017))

Indeed, it is one of the foundational ideas of classical information economics that a market
provider or seller could partially contract on product quality by disclosing ex-post veriﬁable or
enforced information such as warranties and insurances. (Grossman and Hart (1980), Grossman
(1981), and Milgrom (1981)) Besides, repeated transactions and disclosure of reputation/feedback
repository (history) to all potential buyers discipline sellers to act honestly in a future transaction
with a total stranger, which is a particularly enlightening lesson for sharing economies and online
retail outlets. (Kreps (1982, 1990), Tadelis (2016) and Milgrom, North, and Weingast (1990))

This paper argues that for NYC Airbnb market, disclosure of platform enforced and ex-post
veriﬁed information contents on product quality also neutralizes the initial information asymmetry
and prevents market failure. To test this hypothesis, I employ hedonic price regression and
present empirical evidence that the quality certiﬁcation ’Superhost’ badge, host identity veriﬁcation
measures, and consumer review ratings and texts are more inﬂuential to the transaction prices than

1

non-veriﬁed seller side advertisement texts. (’Cheap Talk’)

The drive behind pursuing a question veriﬁed in various online marketplaces is twofold. First,
unlike already successful online P2P markets for material goods, Airbnb is an intermediary for
service products which involves a diﬀerent type of risks over monetary losses.
In fact, there
have been many unfortunate incidents for Airbnb customers: infringements of privacy by hidden
cameras, physical attacks by hosts, and loss of time and pleasure due to deceptive web listings. It is
worth investigating how Airbnb could be successful with such a risk of adverse selection. Second,
this paper oﬀers several methodological tools to deal with a set of identiﬁcation challenges modern
online platform data presents.

One distinctive feature of P2P online platform data is extreme product heterogeneity. It does not
refer to the fact that consumers now have access to many products once bought and sold in traditional
oﬄine shops, thanks to online retail giants like Amazon. The extreme product heterogeneity of
particular interest in this paper is uniqueness in each of the numerous products that have never been
on markets. For example, in Etsy.com buyers can shop custom made apparel, crafts, toys, and 3d
printer blueprints without any big brand names printed on. Taskers in Taskrabbit.com have no idea
what kind of problems they are going to solve until the customers specify them. Likewise, potential
Airbnb guests are staying in a house of someone they never met before.

The ﬁrst identiﬁcation challenge due to such extreme product heterogeneity is the high-
dimensional characteristic (attribute) space. The platforms need to deﬁne and diﬀerentiate each
product so that they can match and sell it to a consumer with speciﬁc preferences. Airbnb rental
units consist of various types of personal properties that have never been publicly oﬀered as travel
accommodations, including cabins, castles, farm barns, boats, and tree houses. Customers have a
choice over a new set of amenity and service features such as baby beds, children’s dinnerware,
EV chargers, and video game consoles. As a result, Airbnb data attaches more than 150 binary
indicators to each rental unit for consumers to satisfy their heterogeneous preferences.

A high-dimensional dataset with numerous binary indicators poses two serious problems.
Some attributes are common and some are scarce, causing multicollinearity. There are irrelevant

2

attributes that do not aﬀect consumers’ purchase decisions signiﬁcantly either in an economic or
statistical sense. The hedonic price regression could thus suﬀer biases in implicit price estimates
or misspeciﬁcations. A variable selection procedure for eﬃciently reducing dimensionality is
necessary.

This paper proposes to adopt sparsity assumption and use variable (model) selection based on
machine learning methods for choosing a subset of attributes that explain the variation in trans-
action prices the best. In fact, economists have been resorting to machine learning techniques to
cope with high-dimensional data plagued by numerous collinear regressors, nuisance variables,
and instruments over various research areas: demand estimation, program/policy evaluation, treat-
ment eﬀects, and general linear models.
(Bajari, Nekipelov, Ryan, and Yang (2015), Belloni,
Chernozhukov, Fernandez-Val, and Hansen (2017), Chernozhukov, Chetverikov, Demirer, Duﬂo,
Hansen, Newey, and Robins (2018), Chernozhukov, Hansen, and Spindler (2015))

The second identiﬁcation challenge from product heterogeneity is how to process product
information contained in unstructured formats. Airbnb hosts voluntarily disclose information on
product quality in texts and images. Through texts, a host extols the merits of a rental unit and
neighborhood: nearby tourist attractions, transportation logistics, restaurant recommendations,
and house rules guests should abide by. They characterize the identity of each unique product that
cannot be transmitted via numerical variables or search ﬁlters (binary indicators), letting sellers and
platforms better diﬀerentiate each good/service from another. Buyers can also voluntarily disclose
information by review texts. Reviews often reveal product information from the perspective of past
customers, giving more individuality to products. Airbnb lets only the actual guests write reviews
on the rental units they visited, so review texts are considered to be ex-post veriﬁed information.
Photographs of rental units and textual advertisements from sellers can be deemed as non-veriﬁed.
Such information contents in unstructured formats should be incorporated into hedonic price
regression model because they aﬀect consumers’ valuation and choices. This paper proposes to
generate numerical variables from text data using widely accepted processing techniques: extracting
keywords/phrases and sentiment analysis on consumer reviews. To be more speciﬁc, the frequency

3

of appearance of certain words/phrases and the number of reviews that were classiﬁed as negative
by a supervised machine learning will be used as additional product attributes. Image processing is
beyond the scope of this paper, but represents another source of product information for consumers.
Tourism and hospitality researchers have been adopting textual analysis on guest reviews to
capture consumer sentiments. The two most dominant approaches are tokenization of a review
text into words/phrases and machine learning classiﬁcation or prediction for emotional polarity
of a review.1 Economists also rely on textual analysis in hedonic price studies on various online
marketplaces such as eBay Motors and real estate. (Lewis (2011) and Nowak and Smith (2017))

The third and ﬁnal identiﬁcation issue is endogeneity due to unobserved/omitted variables,
which is a pervasive problem for hedonic price regressions. For Airbnb data, it is often hard
to explain why a consumer chose a speciﬁc rental property among thousands of others even
with high-dimensional attributes and unstructured format descriptions. There could be time-ﬁxed
unobservables such as seasonality in travel accommodation demand/supply and neighborhood traits
like crime rates and education levels. Endogeneity can easily be dealt with ﬁxed eﬀects estimation
if there are only time-ﬁxed unobservables.

However, heterogeneity in consumer tastes and diverse individual schedules/itineraries are also
candidates for unobservables that are time varying and correlated with observed attributes. This
paper ﬁnds a strong evidence for omitted variable bias in preliminary OLS hedonic regressions and
employs a ﬁxed eﬀects estimation to control for time-ﬁxed unobservables. Then with an additional
identifying assumption for time-varying correlated unobservables, I employ a GMM estimation.
For all error structures and estimation methods, estimation results conﬁrm our hypothesis that
enforced and ex-post veriﬁed reputation/feedback signals are more inﬂuential than non-veriﬁed
seller side information contents.

This paper adds value to the existing simple OLS hedonic price research for Airbnb by dealing
with the identiﬁcation challenges listed above. Wang and Nicolau (2017), Chen and Xie (2017),
Teubner, Hawlitschek, and Dann (2017), Gibbs, Guttentag, Gretzel, Morton, and Goodwill (2018)

1See Alaei, Becken, and Stantic (2017) for a comprehensive review of methods and literature up to date.

4

investigated 33 cities across U.S, Austin in Texas, 86 cities in Germany, and ﬁve metropolitan areas
of Canada, respectively. For hotel booking websites, Ye, Law, Gu (2009), Ye, Law, Gu, and Chen
(2011), Ogut and Tas (2012), and Xie, Zhang, and Zhang (2014) regress the number of room nights
sold on review ratings.

My research extends previous literature in economics and marketing/management on the role of
reputation/feedback mechanisms and voluntary disclosures in reducing information asymmetry to
NYC Airbnb case. Economists found that eBay’s ’Superseller’, seller rating scores, and text/photo
descriptions have causal relationships with transaction prices.
(Resnick, Zeckhauser, Swanson,
and Lockwood (2006), Houser and Wooders (2006), Jin and Kato (2006), and Lewis (2011))
Marketing/management researchers investigated the role of online consumer reviews in business
outcomes in various markets for books, movies, music, retail, video games, and electronics.
(Chavalier and Mayzlin (2006), Liu (2006), Dellarocas, Zhang, and Awad (2007), Duan, Gu,
Whinston (2008), Chintagunta, Gopinath, and Venkataraman (2010), Dhar and Chang (2009),
Floyd et al. (2014), Cui, Lui, and Guo (2012), Ghose and Ipeirotis (2011))

The rest of this paper is organized as follows. Section 1.2 revisits Airbnb in terms of information
asymmetry, presents data and explains processing details. Section 1.3 introduces identifying as-
sumptions, preliminary OLS analyses, and estimation methods to control for unobserved variables.
Section 1.4 reports and discusses estimation results. Section 1.5 concludes.

1.2 Airbnb and Data

1.2.1 Trade among Anonymous Sellers and Buyers

1.2.1.1 NYC Airbnb Market

Since 2016, more than 60 million tourists visit NYC annually including 20% of international
arrivals. Average length of stays is about four days and the number of potential accommodation
reservations totals 15 million. Without any possession of commercial real estates, Airbnb received

5

more than a million reservations in 2017.2 A survey on 4,000 people asking previous experiences
with and intentions to use Airbnb rental services found that as of November, 2015 the market share
of Airbnb is occupying 12% of leisure and business travelers. This share was expected to rise to
16-18% in 2016.3 Another report states that the room nights share amounts to 8% compared to
hotels as of August, 2015.4

Such successful entry and robust trading volumes seem quite surprising given that potential
Airbnb guests choose to stay at a total stranger’s housing unit relying solely on the information
showing on computer screens. The web page content of each rental property then can be considered
as the contract between buyers and sellers. Potential guests take most of the product information
on web listings at face value. This paper focuses on how the platform generates such consumer
trust and which information categories would look trustworthy and how much can be trusted in the
eyes of consumers.

1.2.1.2 Reputation/Feedback Repository

According to the disclosure model of Grossman and Hart (1980), Grossman (1981), and Milgrom
(1981), buyers should update their expectations for product quality based only on enforced or
veriﬁable ex-post information contents. For Airbnb case, the quality certiﬁcation ’Superhost’
badge and actual guests’ review ratings and texts apply to the categories. Both could be called
as credible reputation/feedback repositories in that only actual guests can leave review ratings and
texts and Airbnb enforces strict service quality criteria based on past performances in designating
’Superhost’. A host must have accommodated more than ten parties of guests, maintained a 90%
response rate to booking requests or higher, received a ﬁve star review - review scores higher than
80 out of 100 - at least 80% of the time, and completed each of conﬁrmed reservations without
canceling.

2The visitor poll is from NYC & Company. Average length of stays was calculated from the actual booking records
for hotels and Airbnb obtained from Expedia.com and Airdna, a data consulting branch ﬁrm of Airbnb, respectively.

3Who Will Airbnb Hurt More - Hotels or OTAs (Online Travel Agency)?, JP-Morgan’s Global Insight
4Airbnb and Impacts on the New York City Lodging Market and Economy, Hospitality Valuation Services

6

The existence of these well-functioning public reputation/feedback repositories forces Airbnb
hosts to provide present and future guests with accommodation services as speciﬁed by the web
listings, even if the hosts will never meet them again. It is because positive feedback from customers
of the past would reward the hosts in the future businesses with anonymous tourists to come and
negative feedbacks would do the opposite.
(Kreps (1982, 1990) and Tadelis (2016)) One clear
empirical implication of this powerful insight would be that ’Superhost’ badge, higher rating
scores, and textual variables with positive sentiments from Airbnb tourists will attract more future
guests and enable hosts to obtain price premiums.

1.2.1.3 Basic Contract Enforceability and Voluntary Disclosure from Sellers

A set of ground rules and coordination schemes for safe transactions are fundamental prerequisites
for oﬄine/online marketplaces since at least medieval trade fairs in Europe.
(Greif (2006) and
Milgrom, North, and Weingast (1990)) Reputation/feedback systems of Airbnb also rely on basic
contract enforceability and consumer protection measures, including government issued identiﬁca-
tions for both sellers and buyers, payment holdings by escrow during the ﬁrst 24 hours after check-in,
full/partial refunds, and dispute resolutions including providing alternative accommodations.

Together with such trust-enhancing apparatus, the idea that sellers have incentives to diﬀerentiate
themselves from others now makes it possible for consumers to believe voluntary disclosure such
as rental unit descriptions in text and image formats, not whole-heartedly but certainly at some
level. For example, positive adjectives every hosts could say such as “nice”, “comfy”, and “best
in New York City” would not appeal that much to consumers, whereas texts explaining locational
merits like “5 min walk from Central Park” would appeal to consumers given that Google Maps
API on each listing webpage allows potential guests to check the validity of such statements almost
immediately.

Accommodation capacities such as the number of default guests, accompanying guests, bed-
rooms, bathrooms, and beds and binary ﬁlters indicating various amenitiy and service features are
presented in standardized visualized items on each listing webpages or, search ﬁlter menus. They

7

are comparable to hotel booking portals making it natural to include as product attributes in the
hedonic price regression models.

1.2.2 Data

1.2.2.1 Summary Statistics

The data source is InsideAirbnb.com, a public repository of rental unit prices, attributes, and review
texts run by Airbnb. The panel dataset of this paper consists of two time periods. For each time
period, three cross sections of NYC Airbnb rental unit data were stacked together: June, August,
and December recorded each in 2016 and 2017, respectively. Summers and Decembers are the
peak time periods for both demand/supply for NYC vacation rental businesses. A year gap is
used to control for seasonality in ﬁxed eﬀects and GMM estimation with time-varying correlated
unobservables. OLS estimation is conducted on pooled samples stacking cross sections together.
Rental units with extreme prices at both 0.1% outer margins and units without any reviews were
truncated.

Table 1.2 reports summary statistics for the cross section recorded in December 2017. Prices are
pre-tax transaction rental prices which mean listing prices per night plus cleaning fees. All variables
in Table 1.2 except for some review score categories (due to the rating inﬂation and collinear
relationships, as will be explained shortly) were in fact selected by a lasso variant (Belloni and
Chernozhukov (2013)) designed to achieve a successful asymptotic approximation to the objective
(prices) with only a subset of all 180 variables.
It is to eﬃciently reduce high-dimensionality
in attribute space due to extreme product heterogeneity.5 There exist clear diﬀerences between
’Superhost’ units and others. The mean price is higher for ’Superhost’ units by about $10 to $12.
’Veriﬁcation Accounts’ mean the number of contact methods a host maintains, for example phone,
email, facebook, google, and other social media accounts. The fact that a host has multiple accounts
5Section 1.3 explains the methodology for the lasso variable (model) selection, performance of post selection OLS
estimates, and exact inference for them. The lasso selection was conducted on the OLS dataset, containing all the cross
sections recorded at June, August, and December for both 2016 and 2017.

8

implies that Airbnb has veriﬁed GPS coordinates for the location, government issued identiﬁcation,
and photographs of the host on the listing webpage. ’Veriﬁcation Accounts’ can thus be considered
as a proxy for the degree of contract enforceability a host represents.

For all review score categories, ’Superhost’ units have higher averages. However, review
scores are extremely skewed toward left. Review score inﬂation in Airbnb market in fact has
been extensively investigated by Zervas, Proserpio, and Byers (2015), Proserpio, Xu, and Zervas
(2016), and Fradkin, Grewal, and Holtz (2018). Compared to other vacation rental portals, rating
scores of Airbnb tend to be higher. For example, the average 5 star rating (’Overall Rating’) for
TripAdvisor.com vacation rentals is 3.8/5 and for Airbnb, 4.75/5. Reciprocity due to bilateral review
policy and sellers’ strategic manipulation were proposed to be the main causes for rating inﬂation.
The empirical implication is that price premiums due to a unit increase in each rating scores
would be small. Also, given that every rating scores are near perfection, it is important to catch
which categories would appeal to consumers the most. The lasso procedure chose ’Cleanliness’,
’Location’, and ’Value’.

Table 1.1: Deﬁnitions for Review Score Categories
Questions Asked in Reviewing Process

Overall experience

How accurately did the photos and description represent the actual space?

Did the cleanliness match your expectations of the space?

How smooth was the check-in process, within control of the host?

Category
Rating
Accuracy
Cleanliness
Check-in

Communication How responsive and accessible was the host before and during your stay?
How appealing is the neighborhood (safety, convenience, desirability)?

How would you rate the value of the listing?

Location
Value

The number of negative reviews is about twofold for ’Superhost’ units, compared to normal
hosts’ units but the diﬀerence is due to the fact that, on average, ’Superhost’ units received two
times as many consumer reviews. Reviews on ’Superhost’ units also contain more positive phrases
expressing a strong satisfaction enough to write recommendations to future guests or intentions to
come back.6

6More details on text processing is provided in the next sub-subsection 1.2.2.2

9

Table 1.2: Summary Statistics

S.D.

Superhost Normalhost Min Max

2,532

119.439

195.2812

10,832
183.1565

34

1000

1

1.3273

4.8175

0

4.5410

0
1

1
9

7.2012
0.7368
1.0106
0.6018
0.5659
0.7637
0.7874

96.4680
9.8776
9.7149
9.9645
9.9664
9.5170
9.6584

92.4105
9.4928
9.0937
9.6944
9.7268
9.3963
9.2525

4.9400
3.4892

6.4968
4.9246

9.1078
6.5687

3.9658
2.7693

6.5965
10.0342

4.6933
7.0098

7.4214
11.3468

6.4036
9.7274

20

100

2

10

0
0

0
0

1
0
0
1
1

76
48

39
62

16
4.5
9
16
14

0

1

Mean
13,364
185.4537

0.1895
4.5934

93.1792
9.5657
9.2114
9.7456
9.7722
9.4192
9.3294

(Cross Section: 201712)

Number of Obs

Price ($)

Quality Certification

Superhost Indicator
Veriﬁcation Accounts

Review Scores
Overall Rating

Accuracy
Check-in/out
Cleanliness

Communication

Location
Value

Review Text

Negative Reviews
Positive Phrases

Seller Text

Positive Adjectives
Location Phrases

Accommodation Capacites

Default Guests

Bedrooms
Bathrooms

Beds

Guests Included

Amenity and Services

Air Conditioning

Buzzer Wireless Intercom

Cable TV
Free Parking

Indoor Fire Place

Lock on Bedroom Door

Cats Allowed

Internet
Shampoo

(Room Type)

Entire Home/Apt

Shared Room

2.9461
1.1675
1.1143
1.6051
1.6192

0.8836
0.5186
0.3631
0.1076
0.0387
0.1408
0.0648
0.7710
0.6207

0.5622
0.0137

3.0865
1.2014
1.1145
1.6904
1.7749

0.9356
0.4897
0.4313
0.1445
0.0474
0.1829
0.0746
0.7670
0.7753

0.5391
0.0138

1.8063
0.6963
0.3626
1.0757
1.1364

10

2.9133
1.1595
1.1142
1.5852
1.5828

0.8714
0.5254
0.3471
0.0990
0.0367
0.1310
0.0625
0.7720
0.5846

0.5676
0.0137

Seller texts include rental unit titles, sub-titles, descriptions on various aspects such as neigh-
borhoods, transportation, pros and cons, and etc. ’Superhost’ rentals in fact contain fewer ’Positive
Adjectives’ and ’Location Phrases’ than normal host units. The selected accommodation capacities
conform to the empirical studies on price determinants of hotels and Airbnb.7 Amenity and service
features were cross-selected by additional data-driven machine learning methods other than Belloni
and Chernozhukov (2013).

1.2.2.2 Text Processing

This paper employs n-gram word/phrase extraction (bag of words) and sentiment analysis (classiﬁ-
cation) using a supervised machine learning method to process seller and buyer texts. N-gram bag
of words means extracting words/phrases purely according to the frequency of occurrences and use
them as regressors. As shown in tables, selected features are often reduced and categorized at a
researcher’s discretion.

Table 1.3: Selected Words/Phrases from Airbnb Hosts’ Advertisement Texts

Positive Adjectives
Amazing, Beautiful,

Cozy, Friendly, Spacious, ...

Category
Unigram

Bigram

Trigram

Quadrigram

Location Words

Broadway, Manhattan, Soho

Brooklyn, Chelsea, ...

Central Park, Columbia University,
Hell’s Kitchen, Brooklyn Bridge,

Times Square, Union Square,

Walking Distance, Rockefeller Center, ...

Empire State Building, The G train,

Major subway lines,
Grand Central Station,
The Hudson River, ...

Metropolitan Museum of Art
Museum of Natural History, ...

This paper extracts 36 words/phrases out of the 3,000 most frequently appearing ones among
more than 120,000 seller advertisement texts. The counts for each rental unit were summed over
the regarding two categories: ’Positive Adjectives’ and ’Location Words’ (Table 1.3). Similarly,

7See Wang and Nicolau (2017) for a comprehensive review up to date.

11

38 ’Positive Phrases’ expressing recommendations for future guests and intentions to revisit were
selected out of 7,000 most frequently appearing words/phrases among more than 850,000 Airbnb
guest review texts (Table 1.4).

Table 1.4: Selected Words/Phrases from Airbnb Guests’ Review Texts

Recommendations

can highly recommend

can recommend

i deﬁnitely recommend this
i really recommend this place

i recommend
id recommend
we recommend
will recommend

would absolutely recommend
would deﬁnitely recommend
would highly recommend

Intentions to Revisit
cant wait to come back
cant wait to go back
hope to be back soon
hope to come back
hope to see you again
hope to stay here again
hope to stay there again

id deﬁnitely stay here again

...

would deﬁnitely come back

would deﬁnitely consider staying here again

would not hesitate to recommend

would stay there again

would recommend

wouldnt hesitate to stay here again

Supervised machine learning means ﬁtting a function that maps an input to an output based on
an example input-output pair dataset. For sentiment classiﬁcation of review texts it includes the
following procedures; a researcher conducts a pre-processing such as removing non-alphabetical
components, (arabic numbers, commas, punctuations, and etc) trimming white spaces, and con-
verting to lower case letters. A classiﬁcation machine is then trained on sample reviews, with
emotional polarity as outputs and words/phrases as inputs. The choice on word/phrase regressors
could either rely on pre-established dictionaries (lexicons) or n-gram words/phrases from sample
reviews whichever yields the best in-sample prediction performances. The trained machine is then
scaled up on the whole review text corpus.

This paper trains a classiﬁcation machine on a set of 1,000 sample reviews collected from
four major U.S cities other than NYC: Ashevill (NC), Austin (TX), Denver (CO), and Washington
D.C. Using 3,500 n-gram bag of words/phrases, multiple supervised machine learning models
were constructed and the highest in-sample prediction rate (87%) was achieved with classiﬁcation
tree in ’Caret’ R package over Naive Bayes and Support Vector Machine. The classiﬁcation was

12

then applied to the whole 850,000 NYC Airbnb review texts. Table 1.5 and Table 1.6 provide a
conceptual example.

Table 1.5: Example Guest Reviews and Sentiment Labels

Reviews

Ex.1

(Negative)

Ex. 2

(Nonnegative:

Neutral)

Ex. 3

(Nonnegative:

Positive)

Raw Texts

This is a dirty frat house.

No locks other than main building door.

Dirty toilets. No host present. Rotting food in the fridge.

My room at the BPS Hostel was clean and cool.

The staﬀ and fellow guests were friendly and helpful.

The location is very convenient for local eateries, coﬀee shops, pubs and deli’s.

However, I do not feel it was good value for money at $72 per day.

There was no room service,

I shared a bathroom with upto 8 others and the breakfast was weak.

Great location just outside of downtown Asheville.
I stayed here with three other people. Plenty of space.

Mike was very easy to work with, and made sure we had everything we needed.

Table 1.6: Bag of Words Matrix for Example Guest Reviews
plenty

breakfast

however

helpful

dirty

great

Label
Ex.1
Ex.2
Ex.3

...

clean

0
1
0

0
1
0

cool
0
1
0

2
0
0

0
0
1

0
1
0

0
1
0

0
0
1

rot
1
0
0

...

N-gram bag of words and sentiment analysis by supervised machine learning have been the
dominant processing techniques in hospitality research on the impacts of consumer reviews on
prices and business performances, over many platforms such as Ctrip.com, TripAdvisor.com,
Booking.com, Expedia.com, Travel.yahoo.com, and Yelp.com. Also, supervised machine learning
showed better prediction results than lexicon based methods. Existing lexicons do not share many
of the words/phrases on a speciﬁc web portal of interest.8

Following the conventional approach of using counts of words/phrases and negative reviews
as covariates instead of proportionate variables with scaling purposes was done for two reasons.
First, review texts are time-cumulative and superimposed so it is unlikely for consumers to read all
8See Alei, Becken, and Stantic (2017) for a comprehensive review on sentiment analysis methods listed above and

performances over online hospitality research in the last decade.

13

the reviews. If one uses the proportion of negative reviews (from total number of reviews) instead
of the counts, it suﬀers the risk of downward (upward) bias of the coeﬃcient for rental units that
received a great (small) number of reviews already. Second, one can imagine the proportion of
’Positive Adjectives’ or ’Location Words’ among total number of words sellers’ advertisement texts
include. But high ratio does not necessarily mean more value. Individual speciﬁc perceptions and
expectations on product quality induced from such texts could involve further considerations on
various unobserved factors.

1.3 Model and Identifying Assumption

1.3.1 High-Dimensional Metrics

1.3.1.1 Sparsity and Variable Selection

Following Rosen (1974), equation (1.1) presents the baseline hedonic regression model, which
expresses prices as a function of observed attributes and errors (unobserved variables).

log(pit) = α + βXit + it

(1.1)

pit is per night rental prices of unit i = 1 , ..., n at time period t, Xit ∈ Rp represents rental unit
attributes, and it is the error term. The ﬁrst identiﬁcation challenge with NYC Airbnb data is
high-dimensional attribute space plagued by multicollinearity and irrelevant variables. This paper
hence adopts sparsity that is frequently assumed in high-dimensional metrics i.e., that there exist
s = o(n) (cid:28) p attributes that asymptotically capture most of the impacts of all p regressors.

A practical implication of sparsity for general linear regression models with Gaussian or
heteroskedastic errors is that an econometrician ﬁrst chooses a set of s variables that aﬀects
prices the most by lasso and then conducts an OLS only with the s variables. Such OLS post
lasso, with theoretically suggested conditioning parameters for the ﬁrst step lasso selector achieves
a successful asymptotic approximation to the ’true’ log(pit) objective function.
(Belloni and

14

Chernozhukov (2013), Chernozhukov, Hansen, and Spindler (2015), Belloni, Chernozhukov, and
Wang (2014))

More speciﬁcally, the typical risk minimization problem of balancing bias and variance for

hedonic price estimation with sparsity assumption can be stated as the following.

min c2

s + σ2 s
n
c2
s = min

dim(β)≤s

E[(log(pit) − βXit)2]

(1.2)

is the upper bound of the risk for the best log price estimator using only s (cid:28) p covariates.
s + σ2 s
c2
n
(the ’oracle risk’), which is achieved if the ﬁrst stage lasso selector chose the correct s variables
which by sparsity assumption that captures the most of the impacts of all p regressors. Then the

resulting ’oracle rate’ of error convergence rate is given by(cid:112)s/n.
covariates, post selection OLS estimator still achieves the ’near oracle rate’ of(cid:112)s ∗ log(p)/n. In

One important appeal of OLS post lasso is that even if lasso selector gives only a subset of s

E[(log(pit) − β ˆM X ˆM)2] = Op(cs + σ(cid:112)s ∗ log(p)/n), where X ˆM and β

other words,
ˆM represent
the vector of attributes chosen by lasso (the observed selected model ˆM) and the corresponding
post selection OLS coeﬃcients.

(cid:113)

1.3.1.2 Preliminary Analysis: OLS post Lasso and Post Selection Inference

To achieve the ’near oracle property’ of OLS post lasso, lasso procedures need to use theoretically
imposed conditioning parameters. Borrowing notations from Belloni and Chernozhukov (2013),
the lasso selector based on sparsity assumption chooses variables with non-zero coeﬃcients in
solving the following penalized regression problem;

n

ˆβ = argminβ∈Rp ˆQ(β) +

λ
n
(log(pit) − βXit)2

ˆQ(β) =

1
n

i=1

|| ˆΨβ||1

(1.3)

| βj| and ˆΨ = diag( ˆψ1 , ..., ˆψp). The theoretically suggested penalty loadings

where || β||1 =p

j=1

ˆΨ and penalty level λ for heteroskedastic errors are;

15


(cid:115)

ˆψj =

λ = 2c

1
n
√

(x2
i )
i j ˆ 2

n
nΦ−1(1 − γ/(2p))

(1.4)

where Φ denotes the cumulative standard normal distribution and ˆ is an empirical estimate of errors
(residuals). The suggested preset values for c and γ are 1.1 and 0.1. ˆΨ and λ for homoskedastic
errors result in similar variable selection results.

If one proceeds to OLS with the selected (observed) model ˆM with variables of non-zero
coeﬃcients from equation (1.3) however, then classical inferences (conﬁdence intervals and p-
ˆM are no longer valid. It is because of the non-selected (omitted) variables, making the
values) on ˆβ
post selection OLS only with the attributes in X ˆM biased. Though the asymptotic distribution of
lasso coeﬃcients for our case of n (cid:29) p is available, (Knight and Fu (2000)) an exact post selection
inference for OLS post lasso is the primary target of interest.

Such ’post selection inference’ after variable selection with machine learning is a relatively new
and still developing area. This paper follows Lee, L. Sun, Sun, and Taylor (2016) which provides an
exact distribution of post selection OLS estimates and hence, exact conﬁdence intervals, p-values,
and tail areas. The idea is that given a response y ∼ N(µ, σ2In), the model selection event { ˆM = M}
by lasso can be expressed as a form of polyhedron {Ay ≤ b}. Then {Ay ≤ b} once again can be
transformed into an interval with low and upper endpoints being functions ν−(zj) and ν+(zj) of
residuals zj of y in the direction of xj, {ν−(z) ≤ y ≤ ν+(z)}. Due to the independence between y
and zj, the distribution of an individual coeﬃcient ˆβ
(a simple linear transformation of y) from
ˆM
j
OLS conditional on the model selection results ˆM is a truncated normal.

One advantageous fact about Lee et al. (2016) is that a practitioner can produce exact conﬁdence
intervals and p-values with a ﬁxed penalty parameter λ(cid:48). To be more speciﬁc, the lasso formulation
for the exact post selection inference is the original lasso by Tibshirani (1996).

ˆβ = argminβ∈Rp ˆQ(β) + λ

(cid:48)|| β||1

(1.5)

Therefore, with a range of values of λ(cid:48) that produces the same model ˆM including variables of

16

non-zero coeﬃcients from the penalized regression problem in equation (1.3), a practitioner can
produce exact inference for OLS post lasso, preserving the ’oracle’ property.

Table 1.7 and Table 1.8 (the ﬁrst column) report the OLS post lasso estimation results on the
pooled sample stacking cross sections recorded at 2016 and 2017. The lasso selection procedure
for the ’oracle’ rate (equation (1.3)) considers all variables in the dataset, to preliminary check
this paper’s idea: platform enforced quality certiﬁcations and consumer review contents are more
inﬂuential to prices than sellers’ disclosures. The variables pertaining to the information contents
from the platform, buyers and sellers were indeed all selected.

The selected model is stable over a range of the penalty control parameter c, from 0.9 to 1.3 with
0.05 increments. Also, the data-driven lasso (equation (1.5)) selects the model with a range of λ(cid:48)
values and the exact inferences for post OLS estimates were produced. The conﬁdence intervals for
parameter estimates essentially reproduces those of OLS, but are slightly wider for most variables.
This is due to the re-normalization of density resulting from the truncation. The margins are small
given the large number of samples in the dataset.9

The ﬁrst column of Table 1.7 reports OLS post lasso coeﬃcients on information variables.
’Superhost’ badge has 5.25% price impacts and host veriﬁcation accounts have 1.73%, both of
which are statistically signiﬁcant at a 1% level. Among seven rating categories, ’Cleanliness’,
’Location’, and ’Value’ were selected. ’Location’ score has the greatest price impact of 13.68%.
The negative coeﬃcients for ’Value’ scores come natural since they represent per dollar satisfaction.
The following estimation (Subsection 1.3.2) controlling for unobservables proceeds with these three
review scores.

Textual variables indeed turn out to show expected impacts on prices but the magnitudes are
much smaller than those of quality certiﬁcations and review scores. One thing to note is that the
magnitudes and statistical signiﬁcance of coeﬃcients are greater for review text variables than seller
text variables; ’Positive Adjectives’ from seller texts are insigniﬁcant, and ’Location Phrases’ show
9See Appendix A.3 and B.1 for a detailed explanation for the exact post selection inference and the resulting

conﬁdence intervals and tail areas for p-values.

17

signiﬁcant but much smaller implicit price (0.12%) estimate compared to ’Negative Reviews’ and
’Positive Phrases’ from reviews (-0.73% and 0.70%). This diﬀerential impacts are maintained in
price elasticity measures (βk xk) on average values of attributes xk’s as expected.

The ﬁrst column of Table 1.8 reports OLS post lasso coeﬃcients for the chosen 11 amenity
and service features out of the total 150. For robustness to the selection procedures, the successive
four columns of Table 1.7 and Table 1.8 report OLS coeﬃcients with model selection using other
ML methods: data-driven lasso, ridge regression, elastic net, and gradient boosting with n-folds
cross-validation and RMSE criterion for price approximation or prediction performances. The
diﬀerence between the ﬁrst column (OLS post lasso) and the second (data-driven lasso) is about
the purpose of penalized regression problem. The former is for achieving ’oracle’ rate of post OLS
regression, and the latter is for minimizing the RMSE.

Data-driven lasso is as given in equation (1.5). Ridge regression (Hoerl and Kennard (1970))
uses the penalty with L2 norm, || β||2. Elastic net (Zou and Hastie (2005)) deﬁnes the penalty
with a linear combination of L1 and L2 norms. Gradient boosting (Friedman (2001)) is a variation
of regression tree methods which means recursive partitioning of data space for classiﬁcation or
prediction purposes.10

Key information variables and accommodation capacities were selected by all ﬁve ML methods.
Eight additional binary amenity and service features were cross selected by the four data-driven
ML methods. This paper proceeds to estimation methods for controlling unobservables with
variables in Table 1.7, and 11 unanimously selected amenity and service features in Table 1.8:
’Air Conditioning’, ’Buzzer Wireless Intercom’, ’Cable TV’, ’Free Parking on Street’, ’Indoor Fire
Place’, ’Lock on Bedroom Door’, ’Cats Allowed’, ’Internet’, ’Shampoo’, and room types of ’Entire
Home/Apt’ and ’Shared Room’. One reassuring fact is that whether the model includes binary
features selected by ML methods does not systematically alter the main hypothesis of this paper on
the superiority of enforced and ex-post veriﬁed information contents over non-veriﬁed seller side
disclosures.

10See Appendix A.1 and A.2 for detailed explanation on the methodologies.

18

Table 1.7: OLS post Lasso on the Pooled Sample (1)

obj : log(pit)
obs: 75,236

Quality Certification

Superhost Indicator

Veriﬁcation Accounts

Review Scores

Cleanliness

Location

Value

Review Text

Negative Reviews

Positive Phrases

Seller Text

Positive Adjectives

Location Phrases

Accommomdation Capacities

Default Guests

Bedrooms

Bathrooms

Beds

Guests Included

OLS post LASSO

0.0525a
(0.0041)
0.0173a
(0.0013)

0.0490a
(0.0016)
0.1368a
(0.0018)
-0.0919a
(0.0022)

-0.0073a
(0.0004)
0.0070a
(0.0005)

0.0002
(0.0003)
0.0012a
(0.0002)

0.0563a
(0.0016)
0.1037a
(0.0042)
0.1105a
(0.0028)
-0.0192a
(0.0024)
0.0242a
(0.0016)

LASSO

0.0532a
(0.0041)
0.0176a
(0.0013)

Data Driven ML
RIDGE
GBM

0.0509a
(0.0041)
0.0171a
(0.0013)

0.0514a
(0.0041)
0.0171a
(0.0013)

ENET

0.0528a
(0.0041)
0.0176a
(0.0013)

0.0492a
(0.0016)
0.1372a
(0.0018)
-0.0919a
(0.0022)

0.0487a
(0.0016)
0.1372a
(0.0018)
-0.0919a
(0.0022)

0.0489a
(0.0016)
0.1370a
(0.0018)
-0.0921a
(0.0022)

0.0491a
(0.0016)
0.1373a
(0.0018)
-0.0920a
(0.0022)

-0.0072a
(0.0004)
0.0071a
(0.0005)

-0.0074a
(0.0004)
0.0070a
(0.0005)

-0.0073a
(0.0004)
0.0070a
(0.0005)

-0.0072a
(0.0004)
0.0071a
(0.0005)

0.0002
(0.0003)
0.0012a
(0.0002)

0.0002
(0.0003)
0.0012a
(0.0002)

0.0565a
(0.0016)
0.1035a
(0.0042)
0.1106a
(0.0028)
-0.0196a
(0.0024)
0.0244a
(0.0016)

0.0555a
(0.0016)
0.1035a
(0.0042)
0.1104a
(0.0028)
-0.0199a
(0.0024)
0.0238a
(0.0016)

0.0555a
(0.0016)
0.1037a
(0.0042)
0.1104a
(0.0028)
-0.0198a
(0.0024)
0.0239a
(0.0016)

0.0565a
(0.0016)
0.1031a
(0.0042)
0.1104a
(0.0028)
-0.0197a
(0.0024)
0.0243a
(0.0016)

Constant

3.0824a
(0.0206)
a : 1% signiﬁcant,b: 5%, c : 10%, standard errors in parentheses

3.0792a
(0.0206)

3.0874a
(0.0207)

3.0878a
(0.0207)

3.0836a
(0.0206)

19

Table 1.8: OLS post Lasso on the Pooled Sample (2): Amenity Feature Selection

obj : log(pit)
obs: 75,236

Unanimous Choice

Air Conditioner

Buzzer Wireless Intercom

Cable TV

Free Parking on Street

Indoor Fire Place

Lock on Bedroom Door

Cats Allowed

Internet

Shampoo

Room Type

Entire Home/Apt

Shared Room

Cross Selected
Fire Extinguisher

Other Pets Allowed

Family/Kid Friendly

Laptop Friendly Workspace

Safety Card

Smoke Detector

Carbon Monoxide Detector

Hot Tub

OLS post LASSO

LASSO

0.1271a
(0.0042)
0.0888a
(0.0027)
0.0995a
(0.0028)
-0.1206a
(0.0043)
0.1219a
(0.0068)
-0.0677a
(0.0046)
-0.0844a
(0.0053)
0.0267a
(0.0035)
0.0400a
(0.0028)

0.5901a
(0.0034)
-0.1655a
(0.0112)

0.1263a
(0.0042)
0.0878a
(0.0027)
0.1000a
(0.0028)
-0.1209a
(0.0043)
0.1223a
(0.0068)
-0.0677a
(0.0046)
-0.0846a
(0.0053)
0.0259a
(0.0035)
0.0392a
(0.0028)

0.5901a
(0.0034)
-0.1640a
(0.0112)

20

ENET

0.1270a
(0.0042)
0.0890a
(0.0027)
0.0992a
(0.0028)
-0.1210a
(0.0043)
0.1212a
(0.0068)
-0.0684a
(0.0046)
-0.0839a
(0.0053)
0.0266a
(0.0035)
0.0395a
(0.0028)

0.5901a
(0.0034)
-0.1653a
(0.0112)

0.0052
(0.0029)
-0.0479b
(0.0213)

Data Driven ML
RIDGE
GBM

0.1263a
(0.0042)
0.0879a
(0.0027)
0.0992a
(0.0028)
-0.1224a
(0.0043)
0.1208a
(0.0068)
-0.0704a
(0.0047)
-0.0841a
(0.0053)
0.0251a
(0.0035)
0.0371a
(0.0029)

0.5880a
(0.0034)
-0.1638a
(0.0112)

0.0035
(0.0031)
-0.0498b
(0.0213)
0.0110a
(0.0030)
0.0062b
(0.0029)
0.0133a
(0.0042)
-0.0099a
(0.0036)

0.1262a
(0.0042)
0.0880a
(0.0027)
0.0988a
(0.0028)
-0.1227a
(0.0043)
0.1204a
(0.0068)
-0.0704a
(0.0047)
-0.0838a
(0.0053)
0.0248a
(0.0035)
0.0366a
(0.0029)

0.5881a
(0.0034)
-0.1637a
(0.0112)

0.0025
(0.0031)
-0.0500b
(0.0213)
0.0107a
(0.0030)
0.0056b
(0.0029)
0.0119a
(0.0042)
-0.0178a
(0.0042)
0.0123a
(0.0035)
0.0044
(0.0060)

1.3.1.3 Cautions with Endogeneity

Hedonic price regressions often suﬀer from endogeneity due to omitted or unobserved variables.
Even though OLS post lasso produced seemingly appropriately signed parameter estimates, it is
susceptible to endogeneity. It is because both the lasso selector and post OLS use Gaussian or at
best, heteroskedastic errors implicitly assuming that there is no endogeneity due to unobservables.
Indeed, Ramsey test F-values for omitted variables in post selection OLS estimates are extremely
high for all speciﬁcations in Table 1.7 and 1.8.

High-dimensional econometricians in fact, have provided post selection IV regressions with
variable selection on both many controls and instruments with a small number of key endogenous
variables such as treatment/policy indicators or prices.
(Chernozhukov, Hansen, and Spindler
(2015), Chernozhukov, Chetverikov, Demirer, Duﬂo, Hansen, Newey, and Robins (2018)) But they
need a distributional separability assumption between the ’key’ variables and lasso selection on
controls and instruments, which still uses Gaussian errors. It is understandable in that unobservables
are not in the dataset, and ML methods resort to the magnitudes of in-sample prediction errors such
as RMSE to choose the ’right’ subset of all covariates.

This paper hence proposes to use OLS post lasso only as a guide for eﬃcient dimension
reduction in attribute space and employ econometric methodologies to control for endogeneity due
to unobservables: ﬁxed eﬀects and a GMM approach with a panel data. They do not pre-specify a
small set of endogenous variables but accepts the possibility that unobservables could be correlated
with any observed characteristics included in the model.

In fact, the same concerns on endogeneity arise with conventional alternatives including prin-
cipal component analysis (PCA), Akaike information criterion (AIC), or Bayesian information
criterion (BIC), plus being practically infeasible with hundreds of variables to consider. PCA
coeﬃcients are linear combinations of covariates which makes it impossible to isolate and identify
coeﬃcients for individual variables, and the stepwise nature of AIC and BIC dictates too high
calculation costs for estimation and comparison to incur given that there are at most 150 binary
indicators.

21

1.3.2 Time-Varying Correlated Unobservables

1.3.2.1 Possible Sources

Unobservables correlated with observable attributes and prices could be classiﬁed into two cate-
gories; one is time-invariant and another is time-varying. Time-invariant unobservables include
seasonality and neighborhood speciﬁc attributes such as education levels, crime rates, and prox-
imity to famous tourist attractions. If there is no time-varying unobservables, biases in parameter
estimates could be controlled for with a ﬁxed eﬀects estimation.

log(pit) − log(pit−1) = β(Xit − Xit−1) + it

(1.6)

However, there are strong candidates for time-varying unobservables such as curb appeal,
direct/indirect advertising, and geographical dynamics. For example, if a consumer ﬁnds the curb
appeal of a rental unit perceived via web images quite attractive, then he/she could ignore some
shortcomings in certain observed attributes. Rental unit images containing curb appeal change
overtime in quantity, quality, and contents.

Also, if a potential guest were looking for information about travel accommodation choices
using social network services (SNS), chances are the search log and cookies would pop up Airbnb
advertisement on the screen. The schedule and coverage of such advertising campaigns vary over
time. Concerts, plays, and other events being held in speciﬁc areas of NYC could also aﬀect
consumers’ rental unit choices. Anticipating a traﬃc jam, a tourist can sacriﬁce some personal
standards on other attributes for the sake of locational merits.

1.3.2.2 Markov Process and Consumer Rationality

Following the ideas of Bajari, Fruewirth, Kim and Timmins (2012), this paper imposes a Markov
(AR(1)) process on the errors to describe time-varying unobservables.

log(pit) = α + βXit + τit
τit = ρτit−1 + ηit

22

(1.7)

With a few algebraic manipulations, equation (1.7) implies

log(pit) = α + βXit + (ρτit−1 + ηit)

= (1 − ρ)α + β(Xit − ρXit−1) + ρlog(pit−1) + ηit

(1.8)

The rationale behind such error structure is twofold; time-invariant and a certain portion of time-
varying unobservables could be controlled for with a relatively high value of the persistency
parameter ρ. Given that Airbnb rentals are originally individual real estate properties, a rather
stable time-series modelling is proposed to describe consumers’ expectation of implicit prices
for unobservables. Also, accommodation capacities like the number of bedrooms show enough
variations to identify our dynamic models, but are stable over time. Idiosyncratic shock ηit captures
unexpected changes in time-varying unobservables.

The third and ﬁnal identifying assumption is consumer rationality (equation (1.9)) that time-
varying unobservables ηit do not aﬀect Airbnb guests’ price expectations based on observable
characteristics. This orthogonality moment condition implies in other words, that consumers do
not make systematic errors in implicit pricing of attributes with the current information set It due
to unexpected changes in ηit.

E[log(pit) − ρlog(pit−1) − (1 − ρ)α − β(Xit − ρXit−1)|It] = 0

(1.9)

Since time-varying unobservables are correlated with observable attributes Xit, consumers’ current
information set It includes a set of instruments Zit−1 along with (log(pit−1), Xit, Xit−1). This paper
uses further lags of observables, Xit−2 as instruments. Appendix B.4 provides evidence for strong
relevance of each xit−2 to xit after controlling for log(pit−1) and xit−1. The estimation is easily
implemented with a standard GMM command in STATA.

Together with ﬁxed-eﬀects estimation, the GMM (Generalized Methods of Moments) approach
based on consumer rationality assumption regresses dynamic price adjustments on changes in
observed attributes. Given small annual variations in observed attributes and prices (Appendix
B.3), there could be a risk of relatively less precise estimates for the implicit prices from the dynamic

23

models. This paper presents estimation results for both ﬁxed-eﬀects and GMM to investigate the
research question under various error structures.

1.3.2.3 Previous Empirical Research on Airbnb

Existing hedonic studies on price determinants of Airbnb rental units in tourism and hospitality
research employed simple OLS. (Wang and Nicolau (2017), Chen and Xie (2017), Teubner, Hawl-
itschek, and Dann (2017), Gibbs, Guttentag, Gretzel, Morton, and Goodwill (2018)) Controlling
for an obvious existence of unobservables seems necessary to identify implicit prices of attributes
more accurately. Also, they do not handle the issue of model selection in the presence of many
binary indicators for amenity and service features.

Others employed quasi-experimental identiﬁcation strategies; Ert, Fleischer, and Magen (2016)
hired 900 Amazon Mechanical Turks to study the impacts of the ’Beauty Scores’ from host pho-
tographs. Edelman, Luca, and Svirsky (2017) created Airbnb guest accounts with names strongly
suggestive of African-American ethnicity, and found strong evidence for racial discriminations
in booking processes. However, quasi-experimental methods cannot identify multiple attributes
like regression models because of the narrow windows focused only on one target variable. Also,
artiﬁcial environments to generate data are susceptible to biases, not reﬂecting natural choices of
real customers. Finally, it is hard to ﬁnd a common geographical or regulational break that would
aﬀect entire NYC Airbnb market.

Table 1.9 and Table 1.10 report estimation results using ﬁxed eﬀects and GMM for time-
varying unobservables on the panel dataset. Columns (1) and (3) report estimation results with
amenity and service features selected by ML methods and (2) and (4) without. It is to check if
this paper’s hypothesis is robust to the variable selection procedures. Platform enforced quality
certiﬁcations including ’Superhost’ badge and ’Veriﬁcation Accounts’ show strongly positive signs
for all methods and speciﬁcations. The coeﬃcients and signiﬁcance are higher for GMM than ﬁxed
eﬀects, comparing (1) and (3). ’Superhost’ units have 0.93 to 1.01% price premiums over normal
host units.

24

1.4 Results

1.4.1 Fixed Eﬀects vs. GMM Based on Consumer Rationality

Table 1.9: Fixed Eﬀects vs. GMM (1)

obj : log(pit)
obs: 37,618

Quality Certification

Superhost

Veriﬁcation Accounts

Fixed Eﬀects

(1)

(2)

GMM

(3)

(4)

0.0099***
(0.0020)
0.0022***
(0.0007)

0.0094***
(0.0020)
0.0022***
(0.0007)

0.0101***
(0.0020)
0.0033***
(0.0008)

0.0093***
(0.0021)
0.0019**
(0.0008)

-0.0007
(0.0018)
0.0050**
(0.0021)
-0.0038*
(0.0020)

-0.0012
(0.0018)
0.0052**
(0.0021)
-0.0035*
(0.0020)

0.0007
(0.0022)
0.0087***
(0.0026)
-0.0070***
(0.0023)

0.0014
(0.0022)
0.0064**
(0.0026)
-0.0046**
(0.0023)

-0.0001
(0.0004)
0.0036***
(0.0006)

0

(0.0004)
0.0035***
(0.0006)

-0.0023***
(0.0004)
0.0034***
(0.0006)

-0.0027***
(0.0004)
0.0036***
(0.0006)

-0.0006
(0.0005)
0.0021***
(0.0003)

-0.0006
(0.0005)
0.0021***
(0.0003)

-0.0004
(0.0007)
0.0013***
(0.0004)

Review Scores

Cleanliness

Location

Value

Review Texts
Negative Reviews

Positive Phrases

Seller Texts

Positive Adjectives

Location Phrases

Constant

ρ

***: 1% signiﬁcant, **: 5%, *: 10%, standard errors in parentheses

-0.0007
(0.0007)
0.0012***
(0.0004)

0.1793***
(0.0074)
0.9643***
(0.0015)

0.1720***
(0.0079)
0.9654***
(0.0017)

25

Table 1.10: Fixed Eﬀects vs. GMM (2)

Fixed Eﬀects

(1)

(2)

GMM

(3)

(4)

0.0269***
(0.0018)
0.0151**
(0.0076)
0.0420***
(0.0042)
0.0125***
(0.0026)
0.0167***
(0.0018)

0.0365***
(0.0018)
0.0139*
(0.0077)
0.0583***
(0.0041)
0.0133***
(0.0026)
0.0173***
(0.0018)

obj : log(pit)
obs: 37,618

Accommodation Capacities†

Default Guests

Bathrooms

Bedrooms

Beds

Included Guests

Amenity and Service

Air Conditioner

Cable TV

Free Parking on Street

0.0005
(0.0048)
Buzzer Wireless Intercomm 0.0120***
(0.0046)
-0.0005
(0.0042)
-0.0059
(0.0057)
0.0141
(0.0128)
-0.0033
(0.0047)
-0.0330***
(0.0076)
-0.0024
(0.0043)
-0.0029
(0.0035)

Lock on Bedroom Door

Internet

Shampoo

Indoor Fire Place

Cats Allowed

0.0409***
(0.0038)
0.0236*
(0.0130)
0.0607***
(0.0088)
0.0188***
(0.0045)
0.0172***
(0.0030)

0.0304***
(0.0034)
0.0260**
(0.0122)
0.0356***
(0.0079)
0.0136***
(0.0040)
0.0153***
(0.0029)

-0.0054
(0.0071)
0.0111*
(0.0062)
0.0064
(0.0051)
-0.0033
(0.0077)
0.0295*
(0.0165)
-0.0170***
(0.0060)
-0.0174
(0.0127)
0.0007
(0.0051)
0.0004
(0.0044)

Room Type

Entire Home/Apt

Shared Room

0.1499***
(0.0060)
-0.0478***
(0.0157)

0.1700***
(0.0114)
0.0079
(0.0389)

†: One concern with ﬁxed eﬀects and GMM could be that within a year (from 2016 to 2017), there may not
be enough variations in accommodation capacities for identiﬁcation. It turns out that unlike hotels, Airbnb
hosts make non-negligible changes to accommodation capacities. (see Appendix B.3)

26

For review scores, the implicit price estimates and their statistical signiﬁcance for ’Location’
and ’Value’ improve in GMM compared to ﬁxed eﬀects speciﬁcations: from 0.5% and -0.38%
impacts with ﬁxed eﬀects to 0.87% and -0.7% with GMM, respectively. ’Negative Reviews’ have
expected negative sign and are highly signiﬁcant with GMM, but not with ﬁxed eﬀects estimation.
’Positive Phrases’ extracted from consumer review texts have positive impact on rental prices for
both methods. Leaving phrases like ’would deﬁnitely come back’ implies a strong satisfaction of
customers from the past, and hence it is expected that consumers would ﬁnd them trustworthy.
However, the coeﬃcients and statistical signiﬁcance of sellers’ advertisement texts both are less
than consumer review text variables; non-veriﬁed ’Positive Adjectives’ are insigniﬁcant whereas
’Location Phrases’ are signiﬁcant, which is likely because it is hard to lie about locational merits
given that potential guests can check the location with Google Maps API on web listings instantly.
The coeﬃcients on variables for accommodation capacities show expected signs, with ’Default
Guests’, ’Bathrooms’, and ’Beds’ showing noticeable increments in coeﬃcients with GMM method.
Also, the price premiums for room type ’Entire Home/Apt’ become greater and more signiﬁcant
with GMM. Among amenity and service features, ’Indoor Fire Place’ and ’Lock on Bedroom Door’
become signiﬁcant in GMM estimation, though ’Cats Allowed’ and the room type ’Shared Room’
become insigniﬁcant. ’Buzzer Wireless Intercomm’, ’Indoor Fire Place’, and ’Entire Home/Apt’
usually come with private house or apartment which make them as indicators of price premiums;
’Lock on Bedroom Door’ is often associated with rental units of shared spaces such as youth hostels,
thus a good indicator of cheap prices.

Appendix B.2 reports ﬁxed eﬀects and GMM estimation results over Manhattan and other
neighborhoods. More than 47% of NYC Airbnb rental units are concentrated in Manhattan area
with a higher average price by more than $50.
In a less than 3 mile distance circle, central
Manhattan area contains most of the tourist attractions and famous places; if a rental unit belongs
to Manhattan could be an indicator for many unobservables that could be correlated with prices.
The results conﬁrm the dominance of enforced and ex-post veriﬁed reputation/feedback contents
over non-veriﬁed seller side disclosures in both regions.

27

1.4.2 Over Price Levels

Table 1.11: GMM: Results for Reputation/Feedback Contents over Price Levels
obj : log(pit)
obs: 37,618

GMM
2/3Q

Fixed Eﬀects

1Q

1Q

2/3Q

4Q

4Q

Quality Certification

Superhost

Host Veriﬁcation

0.0187***
(0.0047)
0.0061***
(0.0016)

0.0061**
(0.0028)
0.0047***
(0.0010)

0.0071**
(0.0036)
-0.0061***
(0.0013)

0.0190***
(0.0045)
0.0030**
(0.0014)

0.0080***
(0.0030)
0.0043***
(0.0011)

0.0081**
(0.0033)
-0.0004
(0.0015)

-0.0089**
(0.0039)
0.0127***
(0.0042)
0.0029
(0.0045)

0.0067**
(0.0025)
0.0038
(0.0029)
-0.0089***
(0.0028)

-0.0080**
(0.0034)
-0.0029
(0.0044)
-0.0030
(0.0036)

-0.0010
(0.0047)
0.0105**
(0.0050)
0.0011
(0.0050)

0.0089**
(0.0031)
0.0083**
(0.0034)
-0.0107***
(0.0033)

-0.0044
(0.0034)
0.0075*
(0.0045)
-0.0031
(0.0037)

0.0007
(0.0009)
0.0073***
(0.0014)

-0.0006
(0.0006)
0.0038***
(0.0008)

-0.0030***
(0.0009)
0.0032***
(0.0010)

-0.0022***
(0.0008)
0.0051***
(0.0012)

-0.0030***
(0.0006)
0.0028***
(0.0008)

-0.0014*
(0.0008)
0.0025***
(0.0009)

0.0002
(0.0010)
0.0045***
(0.0007)

-0.0013*
(0.0007)
0.0021***
(0.0004)

-0.0023**
(0.0010)
-0.0002
(0.0006)

0.0009
(0.0011)
0.0025***
(0.0008)

-0.0014
(0.0010)
0.0014***
(0.0005)

-0.0011
(0.0014)
-0.0001
(0.0006)

Review Scores

Cleanliness

Location

Value

Review Texts
Negative Reviews

Positive Phrases

Seller Texts

Positive Adjectives

Location Phrases

Constant

ρ : rho

1Q: rental units of lower 25% price range, 2/3Q: middle 50%, and 4Q: upper 25%
***: 1% signiﬁcant, **: 5% ,*: 10%, standard errors in parentheses

0.3237***
(0.0280)
0.9234***
(0.0070)

0.3844***
(0.0229)
0.9193***
(0.0049)

0.3472***
(0.0356)
0.9338***
(0.0066)

28

obj : log(pit)
obs: 37,618

Accommodation Capacities

Default Guests

Bathrooms

Bedrooms

Beds

Included Guests

Amenity and Service

Air Conditioner

0.0236***
(0.0052)
-0.0020
(0.0144)
0.0239**
(0.0119)
0.0019
(0.0079)
0.0299***
(0.0059)

Cable TV

Free Parking on Street

-0.0074
(0.0089)
Buzzer Wireless Intercomm 0.0262***
(0.0098)
-0.0152
(0.0097)
0.0288**
(0.0115)
0.1049***
(0.0324)
0.0139*
(0.0077)
-0.0273**
(0.0129)
0.0134
(0.0093)
-0.0024
(0.0081)

Lock on Bedroom Door

Internet

Shampoo

Indoor Fire Place

Cats Allowed

Table 1.12: GMM: Results for Amenity and Service Features

Fixed Eﬀects

1Q

2/3Q

4Q

1Q

0.0271***
(0.0024)
0.0337***
(0.0106)
0.0516***
(0.0053)
0.0186***
(0.0035)
0.0281***
(0.0029)

0.0033
(0.0064)
0.0212***
(0.0063)
-0.0020
(0.0058)
-0.0202***
(0.0078)
-0.0217
(0.0201)
-0.0201***
(0.0067)
-0.0410***
(0.0112)
-0.0078
(0.0059)
-0.0034
(0.0048)

0.0243***
(0.0029)
0.0188
(0.0181)
0.0364***
(0.0076)
0.0115***
(0.0039)
0.0062***
(0.0024)

-0.0188
(0.0134)
-0.0299***
(0.0086)
0.0146**
(0.0074)
-0.0159
(0.0114)
0.0197
(0.0181)
-0.0852***
(0.0168)
-0.0267
(0.0186)
-0.0116
(0.0084)
0.0025
(0.0061)

0.0319***
(0.0088)
-0.0070
(0.0182)
0.0136
(0.0193)
-0.0006
(0.0113)
0.0105
(0.0103)

0.0136
(0.0124)
0.0248**
(0.0127)
-0.0111
(0.0113)
0.0269*
(0.0153)
0.1123***
(0.0434)
0.0018
(0.0090)
-0.0306**
(0.0149)
0.0045
(0.0103)
0.0047
(0.0100)

GMM
2/3Q

0.0360***
(0.0046)
0.0130
(0.0162)
0.0345***
(0.0091)
0.0184***
(0.0062)
0.0314***
(0.0047)

-0.0009
(0.0093)
0.0169**
(0.0079)
0.0062
(0.0073)
-0.0184*
(0.0105)
0.0296
(0.0289)
-0.0301***
(0.0078)
-0.0581***
(0.0191)
-0.0029
(0.0070)
-0.0001
(0.0061)

4Q

0.0236***
(0.0048)
0.1240***
(0.0279)
0.0466***
(0.0123)
0.0029
(0.0050)
0.0055*
(0.0032)

-0.0183
(0.0156)
-0.0126
(0.0096)
0.0209***
(0.0075)
-0.0135
(0.0147)
0.0196
(0.0178)
-0.0710***
(0.0260)
0.0097
(0.0298)
-0.0141
(0.0090)
0.0015
(0.0066)

Room Type

Entire Home/Apt

Shared Room

0.2514***
(0.0169)
-0.1023***
(0.0251)

0.1225***
(0.0077)
0.0601**
(0.0258)

0.1300***
(0.0110)
-0.1076
(0.0349)

0.3284***
(0.0327)
-0.1137**
(0.0552)

0.2228***
(0.0133)
0.1239**
(0.0490)

0.1182***
(0.0202)
0.0367
(0.0596)

29

Price level itself acts as a quality indicator for consumers and there is a risk of reverse causality in
hedonic price regressions with information variables on product quality as covariates. Also it is
worth investigating the diﬀerential eﬀects of information disclosures over price ranges. Table 1.11
and Table 1.12 report estimation results on the three panel datasets over three price ranges: lower
25%, middle 50%, and upper 25% with average nightly prices of $75, $159, and $337 respectively.
Consumers turn out to value ’Superhost’ badge much more (2-3 folds) in rental units of lower
prices. It has price eﬀects of 1.9% compared to 0.6-0.8% of rental units with higher price ranges.
The coeﬃcients become greater and more signiﬁcant with GMM estimation, namely controlling for
endogeneity due to time-varying correlated unobservables. Host ’Veriﬁcation Accounts’ also show
strongly positive and signiﬁcant coeﬃcients for rental units at both lower 25% and middle 50%
price ranges. It might reﬂect the fact that consumers might demand more rigorous credibility and
professionalism standard for relatively cheap rental units. Since cheap rentals could lure low quality
sellers in the market segment inducing high risk of information asymmetry, credible measures of
product quality would be appreciated by potential guests much more.

For review scores, Airbnb guests for rental units of the middle 50% price range turn out to
care about all three categories of review scores: ’Cleanliness’, ’Location’, and ’Value’. Customers
of relatively cheap and expensive rental units only consider ’Location’ scores, with implicit price
estimates of 1.05% and 0.075%, respectively. It could be stated that locational merits are indeed
an important source of price premiums over all price levels. Comparing estimation methods,
’Cleanliness’ score had a strongly negative coeﬃcient with ﬁxed eﬀects but become insigniﬁcant
with GMM for rental units of lower and upper 25% price ranges . Another sign change and
gain in signiﬁcance occur for ’Location’ score in the top 25% rentals with GMM. There are
overall improvements in statistical signiﬁcance and increments in coeﬃcients over all review score
categories for the units of middle 50% price range.

Coeﬃcients for ’Negative Reviews’ become highly signiﬁcant and show expected negative signs
over all price levels with GMM. Coeﬃcients for ’Positive Phrases’ from review texts are also highly
signiﬁcant and the magnitude is particularly higher for cheap rentals, conﬁrming the ﬁnding that

30

consumers require ex-post veriﬁed trustworthiness more for cheap Airbnb units, similar to the
diﬀerential impacts of ’Superhost’. It also conforms to the ﬁnding in online commerce research
that prices of cheaper products are more sensitive to positive e-WOM. (electronic word of mouth,
Shin, Hanssesns, and Kim (2016))

’Positive Adjectives’ extracted from sellers’ advertisement texts are insigniﬁcant with GMM.
’Location Phrases’ are highly signiﬁcant for rental units of lower 25% and middle 50% price ranges,
but the magnitudes are much smaller than those of quality certiﬁcations, review scores, and review
text variables.

Regarding accommodation capacities, rental guests who book low priced rentals seem to only
care about the number of ’Default Guests’; higher priced rental customers seem to consider the
number of bathrooms, bedrooms, beds, and included guests altogether. A likely explanation for this
could be that a tourist or tourists who visit a rental unit of an average price of $75 in NYC mostly
are ﬁnding a place to spend the night.

A few interesting sign changes occur for amenity and service features. ’Free Parking on Street’
adds a price premium for lower priced rentals, but is a minus factor for rentals of higher prices;
customers who are visiting rentals with prices coming close to three star hotels (more than $158)
usually anticipate a designated parking space or a private garage. ’Lock on Bedroom Door’ could
be welcomed by tourists who visit a cheap rental unit expecting the space being shared with other
guests but it is deﬁnitely a minus factor indicating low level of privacy for higher priced rentals.

Also, the room type ’Entire Home/Apt’ show diﬀerential impacts over price levels. The level
of privacy it presents is much more appreciated in cheaper rentals. The fact that ’Shared Room’
has negative price impacts for cheaper rentals and positive for middle 50% price range could reﬂect
the diﬀerence between the dominant property types each price level implies; rentals of $75 nightly
price in NYC usually mean a youth hostel or a smoking guy’s couch. However, rentals with prices
of more than $158 typically means private housing units where a customer shares the house with
the host or family. The so advertised ’home out of home’ experiences and interactions with locals
seem to come only above a certain price level.

31

1.5 Conclusion

Sharing economy such as Airbnb could suﬀer more severe information asymmetry in that
products and services oﬀered have never been tested as marketable, and involve further risks than
monetary losses. This paper shows that quality certiﬁcations and consumer reviews resolve adverse
selection as veriﬁed in other online marketplaces over various error structures and speciﬁcations.
Non-veriﬁed voluntary disclosures from sellers turn out to be less inﬂuential than enforced and
ex-post veriﬁable information on product quality. Machine learning methods, text processing tech-
niques, and ﬂexible identifying assumptions were proposed to deal with identiﬁcation challenges
modern that online platform data present.

32

CHAPTER 2

DEMAND ESTIMATION FOR NYC AIRBNB MARKET: VALUE OF

REPUTATION/FEEDBACK CONTENTS AND VOLUNTARY DISCLOSURES

2.1 Introduction

2.1.1 P2P Online Marketplaces and Asymmetric Information

P2P (Peer-to-Peer) online marketplaces can be thought of as a matching platform between sellers of
underutilized idle assets and buyers who are willing to pay for temporary occupation of the assets.
For example, Uber or Lyft makes an ordinary car owner a taxi driver and Turo lets people lend and
borrow cars from each other. The same business model applies to leftover parking spaces (Parking
Panda), bikes, surfboards, ski equipments (Spinlister), and many others. Airbnb also falls into this
category, where home owners can be travel accommodation hosts and travelers can enjoy ’home
away from home’ experiences. Airbnb’s ’sharing economy’ platform recorded a market value of
$31 billion in 2017, and is being considered as a serious threat to the existing accommodation
businesses.

This paper empirically tests if the insight of classical information economics (i.e., disclosure
models) contributes to the remarkable success of Airbnb in NYC vacation rental market (Akerlof
(1970), Grossman and Hart (1980), Grossman (1981), and Milgrom (1981)). The insight from dis-
closure models, put simply, means that if the platform provides trustworthy (veriﬁable) information
on product quality, it can prevent market failure caused by adverse selection due to information
asymmetry. Conversely, providing non-trustworthy (non-veriﬁable) information on product quality
would not prevent the market failure due to the asymmetric information.

This paper tests how various information contents on Airbnb websites aﬀect consumer choices
and quantiﬁes how much value (in $) each content created for consumers. Speciﬁcally, it quantiﬁes
the value of non-veriﬁed information provided by the sellers and veriﬁable information provided
by prior consumers. Figure 2.1 shows an example Airbnb listing, with product information stored

33

Figure 2.1: Example Airbnb Listing, a ’Superhost’ Rental

in photographs and texts. Some of them are standard and formatting is provided by the platform,
such as the room type (’Entire Apt’), accommodation capacities (the number of guests, beds, and
bathrooms), ’Home Highlights’ and ’Superhost’ badge right next to the host’s photo. While Airbnb
provides standardized formatting, sellers have discretion in terms of the pictures, texts, and speciﬁc
information to be provided.

Figure 2.2 shows the review ratings and texts from the past travelers. Airbnb lets only the
actual guests who visited the rental unit leave reviews, and hence they are more trustworthy or
’ex-post veriﬁed’. Potential future guests can check the validity of sellers’ contents indirectly by
the reputation they accumulated over time, and it functions as a feedback mechanism for sellers to
enhance their product quality. In addition, the platform provides quality certiﬁcations (’Superhost’
badge and ’Home Highlights’) based on past buyer review ratings. Textual descriptions such
as ’Prime location’ or ’Quietest apartment’ provided by sellers can be deemed as non-veriﬁed
information on product quality or, ’Cheap Talk’.

34

Figure 2.2: Example Review Ratings, Texts, and Google Maps API

This paper’s primary focus is to estimate the values of veriﬁable user information and the non-
veriﬁable seller information on product quality. It is from the key insight from Kreps (1982, 1990)
and Tadelis (2016) that a well functioning public repositories of reputation/feedback mechanisms
discipline sellers to act honestly even with a total stranger in every future transaction, which sustains
a market of trades among anonymous individuals like Airbnb. Using utility parameter estimates
from logit demand models, I ﬁnd that the compensating variations for review ratings and seller
texts are about $38.54 and $3.65 million respectively, with a total of 708,308 reservations during
2016 and 2017 in NYC for a counterfactual scenario of complete absence. It shows that ex-post
veriﬁed information on product quality aﬀects consumer choices more than non-veriﬁed seller side
voluntary disclosures.

There are three estimation challenges. They originate from extreme product heterogeneity that
is inevitable due to the business model of P2P platforms, gathering as many unique individual
assets as possible. First is ’too many products’ for consumers to choose from. If an econometrician
fails to consider consumers’ realistic choice set formation, demand parameters will suﬀer biases.
Second is high dimensional attribute space. A new set of 150 diverse amenity and service features

35

such as ’video game consoles’ and ’EV chargers’ is now on search ﬁlters and product descriptions.
In the dataset, they are all binary indicators causing the risk of multicollinearity and irrelevant
variables. Third is product information stored in unstructured or non-numerical text format i.e.,
consumer reviews and sellers’ advertisement texts as can be seen in Figure 2.1 and 2.2. Given
that they aﬀect consumer choices, appropriate data science techniques to transform such texts into
numerical variables are called for.

This paper contributes to the literature by addressing the estimation challenges listed above
using a set of tools recently developed in econometrics, i.e., machine learning (high dimensional
metrics) and text processing. They can be ﬂexibly adjusted to most of general linear models used
in applied microeconomics research and possess a great potential for extracting policy insights or
business intelligence from massive size datasets (’Big Data’).

2.1.2 Estimation Challenges with Airbnb Platform Data

The ﬁrst and primary concern of demand estimation with many products is to control for consumer
choice sets. There are on average 40,000 unique individual properties operating as Airbnb rentals
in NYC alone. If a researcher falsely assumes and models a consumer to choose a rental unit from
all of the tens of thousands products, utility parameter estimates will deﬁnitely suﬀer biases. The
dataset used in this paper is aggregate (market) level, which means it includes individual rental
units’ market shares but does not contain individual demographics that could directly identify a
consumer’s choice set formation processes.

Too many choice alternatives indeed have been one of the most challenging identiﬁcation
problems in empirical industrial organization. (Berry, Linton, and Pakes (2004) and Berry and
Pakes (2007)) Marketing researchers also showed that consumers pay attention to only a small
subset of all products. (Draganska and Klapper (2011), and Kim, Albuquerque, and Bronnenberg
(2010)) If additional information on consumers’ choice set formation is available such as scanner
data in retail demand, discrete choice models and GMM estimation can be ﬂexibly adjusted to
utilize such information (Kim and Kim (2017)).

36

To describe a realistic choice set formation with utility models, this paper borrows the ﬁndings
from careful observations on Airbnb guests’ web search behaviors. As Fradkin (2017) notes,
Airbnb rental searchers heavily rely on Google Maps API on the search screen. Among searchers
who sent a reservation request, more than 64% of them changed the default map locations and
more than 50% used the zoom-in function to further reduce choice sets. Another key search ﬁlter
was ’Room Type’: Entire Home/Apt, Private, and Shared Room. Nearly 70% of potential guests
applied this ﬁlter to ﬁnd a rental unit.

Hence this paper employs a three level nested multinomial logit (NMNL) model to reﬂect
consumers’ preferences for geographic locations in choice set formation during web search process.
The nesting structure is based on the mutually exclusive service neighborhood designations by
Airbnb. It ﬁrst divides NYC into ﬁve precincts: Bronx, Brooklyn, Manhattan, Queens, and Staten
Island. Each precinct contains 32 to 53 neighborhoods, such as ’SoHo’ in Manhattan. The number
of rentals in a neighborhood varies from 2 to more than 5,500.

The ﬁrst level nesting parameter for the big ﬁve regions (precincts) is intended to capture
the ’changing default map location’ behaviors, which occurs at a wider scale on the map. The
second level nesting parameter for neighborhoods is to capture the zoom-in/out behaviors; tourists
often have intentions to visit or preferences for famous neighborhoods such as ’Hell’s Kitchen’,
’Midtown’, or ’Financial Districts’.
I then include the binary indicators for ’Entire Home/Apt’
and ’Shared Room’. Each counts for 51% and 2.79% of entire rental units in the sample. The
estimation results show highly signiﬁcant and economically meaningful estimates for the nesting
parameters and ’Room Type’ ﬁlters, suggesting that the modeling choice was able to capture the
targeted aspects of consumer preferences.

The second identiﬁcation challenge due to extreme product heterogeneity is high dimensional
attribute space. To diﬀerentiate each rental from another and match it to heterogeneous consumer
preferences, a new set of amenity and service features that traditional hotel chains cannot provide is
added to search ﬁlters, e.g., baby beds, children’s dinnerware, EV charger, and video game consoles.
Property types also show surprising variety, including cabins, castles, farm barns, camping cars,

37

and tree houses. The dataset at our hands as a result attaches 150 binary indicators, giving a high
dimensional attribute space (Figure 2.3).

Such high dimensionality with many binary indicators poses two serious threats; some charac-
teristics are common and some are scarce, causing multicollinearity. There are irrelevant variables
that do not aﬀect a consumer’s purchase decision signiﬁcantly. The demand parameter estimates
could suﬀer biases or misspeciﬁcations.

An eﬃcient way of dimension reduction is called for. This paper hence proposes to adopt
sparsity assumption and use variable (model) selection by a lasso variant (Belloni and Chernozhukov
(2013)), choosing a subset of attributes that explains variations in sales (market shares) the best. The
demand model hence includes only the selected attributes, and exact inferences for the parameter
estimates adjusted to reﬂect additional uncertainty due to the model selection procedures were
proposed directly following Lee, L. Sun, Sun, and Taylor (2016).

Figure 2.3: Search Filters for NYC Airbnb Rentals

The third identiﬁcation challenge is to incorporate textual data as shown in Figure 2.1 and
2.2 into the econometric models. The platform allows each Airbnb hosts to post text descriptions

38

and photographs. Through advertisement texts, a host extols the merits of his/her rental unit
and neighborhood: nearby tourist attractions, transportation logistics, restaurant/shopping mall
recommendations, and house rules guests should abide by. They characterize the identity of
each unique product that cannot be transmitted via numerical variables or search ﬁlters (binary
indicators), letting sellers better diﬀerentiate from another.

This paper proposes to generate numerical variables from text data using processing techniques
that are widely accepted in online tourism/hospitality research: extracting keywords/phrases and
sentiment analysis on consumer reviews.1 In other words, the frequency of appearance of certain
words/phrases and the number of reviews that were classiﬁed as negative by a supervised machine
learning will be used as additional product attributes. Image processing is beyond the scope of this
paper, but represents another source of product information for consumers. Economists also used
text analysis in studying online marketplaces such as eBay Motors and real estate (Lewis (2011)
and Nowak and Smith (2017)).

2.1.3 Literature Review and My Contribution

This paper will be the ﬁrst to employ standard logit demand models to quantify the value of
information contents in NYC Airbnb market, with a newly constructed dataset of the actual NYC
Airbnb tourists’ rental unit choices between 2016 and 2017. It proposes a new set of empirical
toolsets for the pervasive estimation challenges for P2P online platform data:
too many choice
alternatives, high dimensional attribute space, and unstructured texts. Nested logit models for
consumer choice sets, dimension reduction using machine learning, and text processing for 130,000
seller texts and 850,000 review texts diﬀerentiate this paper from the previous empirical research for
Airbnb market using hedonic price regressions (Wang and Nicolau (2017), Chen and Xie (2017),
Teubner, Hawlitscheck, and Dann (2017), and Gibbs, Guttentag, Gretzel, Morton, and Goodwill
(2018)) and quasi-experimental approaches.
(Ert, Fleischer, and Margen (2016) and Edelman,
Luca, and Svirsky (2017))

1See Alaei, Becken, and Stantic (2017) for a comprehensive review of methods and literature up to date.

39

My methodological contributions also extend to previous literature in economics and market-
ing/management on the role of reputation/feedback systems and voluntary disclosures in reducing
information asymmetry. Economists found that eBay’s ’Superseller’, seller rating scores, and
text/photo descriptions have causal relationships with transaction prices (Resnick, Zeckhauser,
Swanson, and Lockwood (2006), Houser and Wooders (2006), Jin and Kato (2006), and Lewis
(2011)). Marketing/management researchers investigated the role of online reviews in business
outcomes in various markets for books, movies, music, retail, and etc (Chavalier and Mayzlin
(2006), Liu (2006), Dellarocas, Zhang, and Awad (2007), Duan, Gu, Whinston (2008), Chinta-
gunta, Gopinath, and Venkataraman (2010), Dhar and Chang (2009), Floyd et al. (2014), Cui, Lui,
and Guo (2012), Ghose and Ipeirotis (2011)).

This paper follows closely Lewis and Zervas (2016), wherein the authors study the welfare
impacts of consumer review ratings for U.S. hotel industry. The authors estimated a series of
logit demand models using a proprietary 10 year monthly panel data from Smith Travel Research
containing 5,944 hotels in Arizona, California, Nevada, Oregon, and Washington (45% of all hotels).
They augmented the dataset with a panel of consumer reviews from three online travel review
platforms: TripAdvisor, Expedia, and Hotels.com each of which containing 807,140, 1,410,488,
and 1,544,883 review ratings. The welfare implications (compensating variations) of the ratings
in a counterfactual scenario of complete absence of review ratings vary over how counterfactual
prices were calculated. The aggregate consumer surplus falls about $123 million without price
adjustments, $107 million with the conventional nash equilibrium prices, and $546 million in the
case of reduced form price changes.

The methodological contributions of this paper compared to Lewis and Zervas (2016) are clear.
First, I focus on a regional market (NYC) and reﬂect the actual consumers’ choice set formation
principles into the modeling approach. Though they included market-year-monthly ﬁxed eﬀects
in the utility speciﬁcations, the fact that their dataset covers a wide range of locational segments
make the analysis focused more on the hotel market as a whole, and less on correctly describing
consumers’ decision making processes leaving concerns of bias.

40

Secondly, I employ machine learning and high dimensional econometrics to handle high di-
mensional attribute space. Though machine learning could provide an eﬃcient way of dimension
reduction, econometric modeling and identiﬁcation techniques for endogeneity control turn out to
be essential for successful identiﬁcation. Plus, I show that product information stored in text format
is worth being incorporated into empirical analyses.

A couple of clear shortcomings of this paper compared to Lewis and Zervas (2016) and key
demand literature in industrial organization (BLP (1995), Nevo (2001), and Petrin (2002)) include
ﬁrst, that I do not estimate supply side moments. One excuse for this would be that each of
Airbnb vacation rentals is a unique individual housing unit, which makes it hard to justify a simple
Nash equilibrium marginal costs modeling. Another shortcoming is not to incorporate random
coeﬃcients into the nested logit models, mainly due to the extremely small market shares of a
single rental property and the resulting numerical instability sensitive to pre-set initial values for
the iterative BLP estimation routine.

Instead, I used hedonic price adjustments in producing counterfactual prices following Haus-
mann and Leonard (2002). The resulting compensating variations for information contents from
the demand estimates support this paper’s hypothesis that the platform’s quality certiﬁcations and
consumer reviews (ex-post veriﬁed) show greater welfare impacts than non-veriﬁed seller side
advertisement texts.

The rest of this paper is organized as follows. Section 2.2 introduces NYC accommodation
market, the dataset, and processing details. Section 2.3 discusses actual NYC Airbnb guests’
web searching behaviors in more detail, presents the demand model, and estimation methods.
Section 2.4 reports the demand parameter estimates, price elasticities, and welfare measures for
key reputation/feedback and disclosure devices. Section 2.5 concludes.

41

2.2 Data

2.2.1 NYC Accommodation Market

2.2.1.1 Market Deﬁnition and Size

This paper deﬁnes the market size of NYC accommodation market as the total potential number of
accommodation reservations. Table 2.1 lists the annual number of visitors to NYC and since 2016,
more than 60 million tourists came to NYC with about 20% of international arrivals.2 Assuming
each of them reserves an accommodation facility, the total potential annual number of reservations
can be obtained by dividing the number of annual NYC visitors by the average length of stays.
Based on the actual booking records obtained from both Expedia.com and Airbnb rentals (Table
2.2), I roughly assume that the average length of stays is four days.3

The rationale behind such a comprehensive market deﬁnition comes from the fact that Airbnb is
indeed taking market shares in various accommodation segments, not just competing with hotels.
A survey on 4,000 potential Airbnb tourists found that4: (1) As of November, 2015 the market share
of Airbnb.com is occupying 12% of leisure and business travelers and projected to reach 16-18% in
2016; (2) 42% of Airbnb demand is coming at the expense of hotels. Moreover, it is also replacing
other non-traditional accommodations such as bed and breakfast inns, vacation rentals and stays
with friends and family. The last segment makes up around 60% of overnight accommodations,
and is thus larger than hotels. Another report by an accommodation market research company says
that the room nights share of Airbnb amounts to 8% compared only to hotels as of August, 2015.5

Deﬁning the market size with the maximum purchasing capacity is a fairly conventional ap-
proach taken by key demand literature in empirical industrial organization; Berry, Levinsohn, and

2The visitor poll is from NYC & Company.
3Expedia.com launched a prediction contest in 2016. The task was to predict top ﬁve hotel recommendations
based on the distributed data. The locations of hotels were anonymized but a participant decoded the regional codes
by the distances between users and destination hotels. The contest operator conﬁrmed the leak. The sales records for
NYC Airbnb rentals were purchased from Airdna.

4Who Will Airbnb Hurt More - Hotels or OTAs (Online Travel Agency)? - JP Morgan Global Insight
5Airbnb and Impacts on the New York City Lodging Market and Economy - Hospitality Valuation Services

42

Pakes (1995) deﬁned the total annual market size for automobile as the number of households in
each year. Nevo (2001) used the total potential number of servings in a city per quarter as the
quarterly market size for ready to eat cereals.

Table 2.1: NYC Visitors and Potential Reservations for Travel Accommodations

Year
2015

(Jan - Jun)
(Jul - Dec)

2016

(Jan - Jun)
(Jul - Dec)

2017

Total

58,500,000
23,400,000
35,100,000
60,300,000
24,120,000
36,180,000
61,800,000

Domestic
46,200,000
18,480,000
27,720,000
47,600,000
19,040,000
28,560,000
48,700,000

International
12,300,000
4,920,000
7,380,000
12,650,000
5,060,000
7,590,000
13,100,000

Potential Reservations

Record

14,625,000
5,850,000
8,775,000
15,075,000
6,030,000
9,045,000
15,450,000

Actual

Projected

Table 2.2: Actual Booking Data Summary for NYC Market for Hotels and Airbnb

Source
Sample
Coverage

Average Length
S.D. of Length
Reservations
Sale Periods
Length of Stay

1
2
3
4
5
6
. . .
28

Expedia.com

Random Sample
All Segments

3

2.106
96,262

201301 - 201412

Airbnb.com
All Sales

All Segments

4.984
5.017

1,987,362

201408 - 201704

Frequency

26,199
21,255
18,386
12,751
7,212
3,988

...
11

%

27.22
22.08
19.10
13.25
7.49
4.14
...
0.01

Cum. % Frequency
270,076
27.22
49.30
363,177
331,128
68.40
260,940
81.64
179,106
89.13
93.28
126,477

...
100

...

6,025

%

13.59
18.27
16.66
13.13
9.01
6.36
...
0.30

Cum. %
13.59
31.86
48.53
61.66
70.67
77.03

...

98.56

2.2.1.2 Purchase Units and Market Share of a Product

Following recreational demand literature, this paper takes the number of short-term vacation rental
reservations (in other words, the number of trips a recreation site received from visitors) as the
number of sales an Airbnb rental unit recorded. By ’short-term’, I mean reservations with lengths
of stays up to four days, keeping in line with the market deﬁnition. Using the number of reservations

43

instead of nights sold will reduce the risk of inﬂating the market shares of cheap rentals operating
many rooms or hostel type Airbnb listings and long-term extended stays for speciﬁc purposes. The
unit price is deﬁned to be the rental price per night plus cleaning fees, which is the actual transaction
price consumers pay.

Frequency of leisure and business travels a person takes in a year is also an important factor
for both decision modeling and how many cross sections (time periods) the demand model should
include. Without individual level choice data, I rely on a previous market research. According
to AARP (American Association of Retired Persons) 2016 and 2017 travel research, Americans
across all generations take on average 3.5 domestic leisure trips. I assume a person on average
consider visiting NYC Airbnb rentals three times a year.

2.2.2 Summary Statistics

2.2.2.1 Rating Score Inﬂation

Table 2.4 presents the summary statistics from the dataset used for demand models. Prices and
rental unit attributes are from InsideAirbnb.com, a public repository of Airbnb data. Sales records
were purchased from Airdna, a data consulting branch ﬁrm of Airbnb. The dataset consists of four
cross sections: April, 2016, August, 2016, December, 2016, and April, 2017. Rental units without
any reviews or sales records were dropped from the analysis. Also, rental units with nightly prices
greater than $5,000 were excluded.

One important issue with the data is review rating score inﬂation. There are seven categories of
review ratings for Airbnb rentals, and Table 2.3 presents the questions Airbnb asks for consumers
during the rating process. As can be seen in Table 2.4, rating scores over all categories are near
perfection. Rating inﬂation seems exacerbated in Airbnb particularly compared to other vacation
rental portals. Zervas, Proserpio, and Byers (2015) compares review scores of two rental platforms:
Airbnb and TripAdvisor. Not only are the average ratings higher for Airbnb listings than TripAdvisor
(4.7/5 stars versus 3.8/5 stars), but it is also the case for cross-listed rentals. (0.1/5 stars diﬀerences

44

in mean) The authors conjecture that this phenomenon is from reciprocity or fears of retaliation
due to the bilateral review policy and strategic manipulation by sellers.

Also, there seems to exist strong collinear relationships among review score categories (see
Appendix C.4). To avoid multicollinearity, I use an average of the six rating score categories (2 -
10 scale, ’Ratings Average’) rather than overall rating (20 - 100 scale), which was more severely
inﬂated. Due to the time-cumulative nature of review scores, there is a risk of sellers’ strategic
manipulations; rating scores of a rental unit with a small number of extremely positive reviews
cannot be trusted. Hence I drop rental unit observations that both have a ’Ratings Average’ greater
than 9.99999 and a total number of reservations received during the time periods of our empirical
analysis (201604 - 201704) less than 5.

’Superhost’ designation implies a rigorous quality certiﬁcation that cumulates past reputa-
tion/feedback performances. A host must have accommodated more than ten parties of guests,
maintained a 90% response rate to booking requests or higher, received a ﬁve star review, i.e.,
review scores higher than 80 out of 100 - at least 80% of the time, and completed each conﬁrmed
reservations without canceling. ’Veriﬁcation Accounts’ means the number of contact methods a
host maintains including emails, phones, and social media network accounts. It implies that hosts
with multiple ’Veriﬁcation Accounts’ have gone through the government issued ID check and up-
loaded self-introduction with photographs. They represent that the basic contract enforceability by
Airbnb is active, which is the key prerequisite for reputation/feedback systems to work. (Milgrom,
North, and Weingast (1990))

Table 2.3: Deﬁnitions for Review Score Categories
Questions Asked in Reviewing Process

Overall experience

Category

Overall Rating

Did the cleanliness match your expectations of the space?

How smooth was the check-in process, within control of the host?

How accurately did the photos and description represent the actual space?

Accuracy
Cleanliness
Check-in/out
Communication How responsive and accessible was the host before and during your stay?
How appealing is the neighborhood (safety, convenience, desirability)?

Location
Value

How would you rate the value of the listing?

45

Table 2.4: Summary Statistics

All Sample

Superhost Normal Host

Mean

171.8356
27.4482
11.2973

S.D.

118.7063
24.0788
10.2259

Mean

187.3653
36.6391
14.7591

Mean

169.8143
26.2519
10.8468

Min Max
30
4770
168
1
1
88

0.1152
4.3243

1

0.9975

4.4341

0

4.3100

92.0813
9.4650
9.1121
9.6555
9.7074
9.3123
9.2172

7.4158
0.7756
1.0109
0.6690
0.6290
0.8182
0.7966

96.8795
9.8828
9.7431
9.9498
9.9701
9.5590
9.6784

91.4568
9.4107
9.0300
9.6172
9.6732
9.2802
9.1571

26.8576
3.5813

33.2056
4.9652

41.4124
5.9722

24.9632
3.2701

2.8890
3.5993

1.6858
2.3112

3.2350
4.0855

2.7090
1.0826
1.4769
0.1793

0.2921
0.5286
0.9570
0.6499

1.2622
0.3159
0.8881
0.3836

0.4547
0.4992
0.2028
0.4770

2.7544
1.0786
1.5799
0.1607

0.3935
0.6730
0.9805
0.7929

2.8440
3.5360

2.7031
1.0831
1.4635
0.1817

0.2789
0.5098
0.9540
0.6313

0
1

20
2
2
2
2
2
2

1
0

0
0

1
0
0
0

0
0
0
0

1
12

100
10
10
10
10
10
10

380
67

10
15

6
5
14
1

1
1
1
1

0.5099
0.0279

0.4999
0.1648

0.5169
0.0197

0.5089
0.0290

0
0

1
1

obs: 62,673

Price per Night ($)

Nights Sold
Reservations

Quality Certifications

Superhost Indicator
Veriﬁcation Accounts

Review Rating
Overall Rating

Accuracy
Cleanliness
Check-in/out
Communication

Location
Value

Review Texts

Number of Reviews
Negative Reviews

Seller Texts

Positive Adjectives
Location Words

Accommodation Capacities

Default Guests

Bathrooms

Additional Guests
Instant Bookable

Amenity and Service

24 Hour Check-in

Hangers
Heating
Shampoo

(Room Type)

Entire Home/Apt

Shared Room

* Variables were selected by a lasso variant (Belloni and Chernozhukov (2013)) designed for a successful
asymptotic approximation to the objective ln(sjt/sot) with only a subset of all 180 variables. Subsection
2.3.3 for high dimensional metrics introduces the selection principles and inference for the post selection
parameter estimates.

46

Seller texts include rental unit titles, sub-titles, descriptions of various aspects of the rentals,
such as neighborhoods, transportation, pros and cons, and etc. The selected accommodation
capacities conform to the empirical studies on price determinants of hotels and Airbnb.6 Amenity
and service features in Table 2.4 were in fact, cross-selected by other popular machine learning
methods other than Belloni and Chernozhukov (2013)’s lasso.7

2.2.2.2 Text Processing

This paper employs n-gram word/phrase extraction (bag of words) and sentiment analysis (clas-
siﬁcation) using a supervised machine learning to process seller and buyer texts. N-gram bag of
words means extracting words/phrases purely according to the frequency of occurrences and use
them as regressors. Selected features are often reduced and categorized at a researcher’s discretion.
For seller texts, 36 words/phrases including up to four words (’Quadrigram’) out of the 3,000 most
frequently appearing ones were selected among 120,000 advertisement texts. The counts for each
rental were summed over the regarding two categories: ’Positive Adjectives’ and ’Location Words’.

Table 2.5: Selected Words/Phrases from Airbnb Hosts’ Advertisement Texts

Positive Adjectives
Amazing, Beautiful,

Cozy, Friendly, Spacious, ...

Category
Unigram

Bigram

Trigram

Quadrigram

Location Words

Broadway, Manhattan, SoHo

Brooklyn, Chelsea, ...

Central Park, Columbia University,
Hell’s Kitchen, Brooklyn Bridge, ...
Empire State Building, The G train,

Major subway lines, ...

Metropolitan Museum of Art
Museum of Natural History, ...

Supervised machine learning means ﬁtting a function that maps an input to an output based on an
example input-output pair dataset. Sentiment classiﬁcation for review texts includes the following
steps. A researcher conducts a pre-processing such as removing non-alphabetical components,
e.g., arabic numbers, commas, and etc, trimming white spaces, and converting to lower case letters.

6See Wang and Nicolau (2017) for a comprehensive review up to date.
7For Belloni and Chernozhukov (2013), see Subsection 2.3.3 and for machine learning, Appendix A.1 and A.2.

47

A classiﬁcation machine is then trained on sample reviews, with emotional polarity as outputs
and words/phrases as inputs. The choice on words/phrases could either rely on pre-established
dictionaries (lexicons) or n-gram words/phrases from sample reviews whichever yields the best
in-sample prediction rates. The trained machine is then scaled up on the whole review corpus.

This paper trains a classiﬁcation machine on a set of 1,000 sample reviews collected from
four major U.S. cities other than NYC: Ashevill (NC), Austin (TX), Denver (CO), and Washington
D.C. Using 3,500 n-gram bag of words/phrases, multiple supervised machine learning models
were constructed and the highest in-sample prediction rate (87%) was achieved with classiﬁcation
tree in ’Caret’ R package over Naive Bayes and Support Vector Machine. The classiﬁcation was
then applied to the whole 850,000 NYC Airbnb review texts. Table 2.6 and Table 2.7 provide
conceptual examples. One caution with n-gram dictionaries is that they could contain indicators
for expressions of no particular meaning (e.g., were not, was indeed) or opposite meaning (’dirty’
for positive reviews) for pure prediction performances.

Table 2.6: Example Guest Reviews and Sentiment Labels

Reviews

Ex.1

(Negative)

Ex. 2

(Nonnegative:

Neutral)

Ex. 3

(Nonnegative:

Positive)

Raw Texts

This is a dirty frat house.

No locks other than main building door.

Dirty toilets. No host present. Rotting food in the fridge.

My room at the BPS Hostel was clean and cool.

The staﬀ and fellow guests were friendly and helpful.

The location is very convenient for local eateries, coﬀee shops, pubs and deli’s.

However, I do not feel it was good value for money at $72 per day.

There was no room service,

I shared a bathroom with upto 8 others and the breakfast was weak.

Great location just outside of downtown Asheville.
I stayed here with three other people. Plenty of space.

Mike was very easy to work with, and made sure we had everything we needed.

Table 2.7: Bag of Words Matrix for Example Guest Reviews
plenty

breakfast

however

clean

dirty

great

helpful

Label
Ex.1
Ex.2
Ex.3

...

0
1
0

0
1
0

cool
0
1
0

rot
1
0
0

...

2
0
0

0
0
1

0
1
0

0
1
0

0
0
1

48

2.3 Model

2.3.1 Potential Guests’ Rental Searching Behavior

The dataset used in this paper does not include any individual demographics. The market share
of an individual product is extremely small, because for each time periods, there are about 12,000
- 15,000 rental units in operation even after multiple truncation processes. Given such product
heterogeneity, it is elusive to ﬁnd a homogeneous product groups among which the demand analysis
could ﬁnd a targeted insight on substitution patterns, like automobile markets (BLP (1995) and
Petrin (2002)) and retail applications (Nevo (2001)).

However, setting up a utility function with a proper description of consumers’ choice set
formation could suﬃce to answer the research question with aggregate level data: evaluating welfare
implications of information contents from realized purchase decisions. I propose to employ a three
level nesting structure based on Airbnb’s hierarchial service neighborhood designations and, use
’Room Type’ ﬁlters as another set of observed attributes, which were found to be the most popular
tools for reducing choice sets during web search processes of actual NYC Airbnb guests.

This idea is from Fradkin (2017), who investigates the impacts of search and matching perfor-
mance of Airbnb platform designs with a detailed consumer web search log data for a major U.S.
city between September 2013 and September 2014.8 A consumer’s search is fairly limited in that
he/she only sees about 4 to 5% out of over a thousand rental units popping up after the initial search
command. During initial search steps, a consumer typically sets the number of guests, which is
included in the utility model.

Though limited, a consumer puts a signiﬁcant amount of eﬀort and time, checking out 88 rental
units during 58 minutes on average. Among web searchers who sent a reservation request, more
than 64% of them changed the default map location and 50% used the zoom-in/out function to
further reduce the choice sets. Figure 2.4 contains an actual Google Maps API example from NYC
8The name of the city was anonymized but he says it is the ﬁrst city Airbnb made a success, which is highly likely
to be NYC. The platform is known to be taking oﬀ in NYC after the ﬁrst signiﬁcant capital investment from Sequoia
capital and changing the company’s name from Air Bed & Breakfast to Airbnb.

49

rental unit search on Airbnb.com. It seems that consumers ﬁrst choose a relatively greater region
on the default scale map, and then zoom-in to further narrow down on a neighborhood, wherein
he/she picks a rental unit (a three step choice).

Figure 2.4: Neighborhood Designation Example: ’Midtown’ in Manhattan

Airbnb divides NYC area into ﬁve big regions: Bronx, Brooklyn, Manhattan, Queens, and
Staten Island. Each region is again divided into neighborhoods. For example, ’Midtown’ or
’Harlem’ in Manhattan. Each region contains from 32 to 53 neighborhoods, and the number of
listings contained in a neighborhood varies from 2 to more than 5,500. Neighborhood designations
are mutually exclusive for all samples and observations, which suggests a hierarchial three level
nesting structure (for a comprehensive list, see Appendix C.3).

The ﬁrst level nesting is based on the ﬁve big region indicators, and it is for capturing ’changing
default map location’ behaviors. The second level nesting is based on the neighborhood dummies,
and it is for capturing zoom-in/out behaviors of consumer search. For the third or rental unit level,
I included indicators for ’Entire Home/Apt’ and ’Shared Room’ as attributes based on the ﬁnding
that 70% of web searchers applied the ’Room Type’ ﬁlter.

50

2.3.2 Nested Multinomial Logit (NMNL) Model

Berry (1994) provides a transformation to estimate a (two level) nested logit model with aggregate
level data. Three level nesting structure is an extension by Verboven (1996) and has been adopted
in various applications on markets for drugs, automobiles, and agricultural products (Bjornerstedt
and Verboven (2016), Brenkers and Verboven (2010), and Ciliberto, Moschini, and Perry (2017)).
I estimate four models. A simple OLS, IV, and two and three level nested logit models. This
is done to show stepwise improvements in utility parameter estimates over the model changes to
resolve identiﬁcation issues of price endogeneity and choice set formation. The nesting structures
are expected to provide a more accurate modeling for consumers’ choice set formation, reducing
possible sources of biases due to unobserved variables or decision making principles. The utility
function for a Berry (1994) style IV logit model consists of a mean utility term δjt and an
idiosyncratic Type I extreme value error i jt;

ui jt = δjt + i jt

= xjt β − αpjt + ξjt + i jt

(2.1)

where xjt is the attributes vector, pjt is the per night rental price, and ξjt captures the unobservables
of rental unit j at time period t. Nested logit models impose additional structures on i jt for each
consumer i;

ui jt = xjt β − αpjt + ξjt + (ζigt + (1 − σ)i jt)
ui jt = xjt β − αpjt + ξjt + (ζigt + (1 − σ2)ihgt + (1 − σ1)i jt)

(2.2)

(2.3)

where equation (2.2) and (2.3) represent the utility functions for two and three level nested logit
models, respectively. ζigt captures the impact of nesting ’groups’ or in our case the big ﬁve regions:
Bronx, Brookyln, Manhattan, Queens, and Staten Island (g = 1, ..., G).

The nesting parameter 0 ≤ σ < 1 (for three level nested logit, σ2) captures how strong the
substitution within each group is. For example, if an estimate of σ(σ2) is positive and signiﬁcant,
then a tourist is likely to choose rental units in the same region like Bronx, but not in a diﬀerent

51

region, like Brooklyn. On top of the big regions, the three level nested logit model captures a
stronger correlated preferences for units in a neighborhood (h = 1, ..., Hg) of a group g with

parameter σ1. The total number of products J is thenG

Hg

h=1 1(j∈h).

g=1

ζigt is common to all products in group g for consumer i and follows a distribution that depends
on σ for two level, or σ1 and σ2 for three level nested logit model. Cardell (1997) shows that
then ζigt follows a distribution with (ζigt + (1 − σ)i jt) or (ζigt + (1 − σ2)ihgt + (1 − σ1)i jt) also
following extreme value distribution. As the values of nesting parameters approach to zero, i.e.,
σ(σ1, σ2) → 0, the within group correlation goes to zero and hence the model becomes a simple
logit model with a Type I extreme value error. As σ(σ1, σ2) → 1, the within group correlation
goes to one.

sOLS(IV)

jt

=

sN L2
jt

=

sN L3
jt

=

J
exp(δjt)
k=0 exp(δkt)
k∈g exp[δkt/(1 − σ)] ∗
exp[δjt/(1 − σ)]
exp[Ihg/(1 − σ1)] ∗ exp[Ihg/(1 − σ2)]
exp[δjt/(1 − σ1)]

(cid:16)k∈g exp[δk/(1 − σ)](cid:17)1−σ
(cid:16)k∈g exp[δk/(1 − σ)](cid:17)1−σ
1 +G
exp[Ig/(1 − σ2)] ∗ exp(Ig)
exp(I)

g=1

jt

, sN L2

, and sN L3

sOLS(IV)
are the resulting stepwise choice probabilities or market shares of a
jt
rental unit j for a simple logit, two level nested logit, and three level nested logit models. The
inclusive values Ihg, Ig, and I for three level nested logit models are deﬁned by:

jt

(2.4)

(2.5)

(2.6)

(2.7)

Ihg = (1 − σ1) ∗ ln

exp[δkt/(1 − σ1)]

Ig = (1 − σ2) ∗ ln

exp[Ihg/(1 − σ2)]

I = ln(cid:169)(cid:173)(cid:171)1 +

G

g=1

k=1

Jhg
Hg
exp(Ig)(cid:170)(cid:174)(cid:172)

h=1

52

McFadden (1978) gives the condition for nesting parameters to be consistent with the utility
theory: 0 ≤ σ2 ≤ σ1 < 1 which comes natural in that the correlation of preferences is stronger

for rental property choices on a neighborhood level (σ1) than neighborhood choices out of a big
locational segment (σ2). The inverted aggregate level estimating equations based on (2.4), (2.5),
and (2.6) were provided by Berry (1994) and Verboven (1996):

ln(sjt/sot) = xjt β − αpjt + ξjt
ln(sjt/s0t) = xjt β − αpjt + σln(sj|gt) + ξjt
ln(sjt/s0t) = xjt β − αpjt + σ1ln(sj|hgt) + σ2ln(sh|gt) + ξjt

(2.8)

(2.9)

(2.10)

where s0t is the outside market share at time period t, sj|gt is the market share of rental unit j in
region g = 1, ..., 5, sj|hgt is j’s share in neighborhood h in region g, and ﬁnally, sh|gt is the share
of all units in neighborhood h in region g.

The idea of aggregate level estimating equations (2.8), (2.9), and (2.10) for identifying utility
parameters is similar to regressing ASC (Alternative Speciﬁc Constants) on observable attributes
in recreational demand literature (Murdock (2006)). Also, though nested logit models partially
alleviate the pervasive IIA problem, with individual level data a practitioner can estimate more
comprehensive substitution patterns across recreation sites with mixed logit models and consider
nested logit as a special case (Herriges and Phaneuf (2002)).

2.3.3 High-Dimensional Attributes and Machine Learning

2.3.3.1 Lasso Selector and Oracle Property

Candidates for attributes in xjt include information contents, accommodation capacities, and 150
binary indicators for amenity and service features. Such a high dimensional characteristic space
with many binary indicators originating from extreme product heterogeneity poses a threat of
multicollinearity and irrelevant variables. This paper hence assumes sparsity, which is frequently
introduced in high dimensional metrics. Sparsity assumption is that given a p-dimensional vector
[xjt, pjt] ∈ Rp, there exist s = o(n) (cid:28) p variables that asymptotically capture most of the impacts
of all p regressors onto the objective ln(sjt/sot).

53

A practical implication of sparsity for general linear regression models with Gaussian or
heteroskedastic errors ( ∼ N(0, σ)) is that an econometrician ﬁrst chooses a set of s variables
(the observed model ˆM) that aﬀects ln(sjt/sot) the most by lasso and then do OLS only with the
selected variables. Such OLS post lasso, with theoretically suggested conditioning parameters for
the ﬁrst step lasso selector achieves a ’successful’ asymptotic approximation to the ’true’ ln(sjt/s0t)
objective function (Belloni and Chernozhukov (2013), Chernozhukov, Hansen, and Spindler (2015),
Belloni, Chernozhukov, and Wang (2014)).

Speciﬁcally, the risk minimization problem of balancing bias and variance for demand estima-

tion with sparsity can be stated as the following (Belloni and Chernozhukov (2013)).

min c2

s + σ2 s
n
c2
s = min

dim(β,α)≤s

E[(ln(sjt/s0t) − xjt β + αpjt)2]

(2.11)

is the upper bound of the risk for the best market share estimator using only s (cid:28) p
s + σ2 s
c2
n
covariates. This ’oracle risk’ is achieved if the ﬁrst stage lasso chose the correct s variables which
by sparsity assumption captures the most of the impacts of all p regressors. Then the resulting

’oracle rate’ of error convergence rate is given by(cid:112)s/n.
s covariates, post selection OLS estimator still achieves the ’near oracle rate’ of(cid:112)s ∗ log(p)/n.
jt β ˆM + αpjt)2] = Op(cs + σ(cid:112)s ∗ log(p)/n), where x ˆM and

In other words,
ˆM represent the vector of attributes chosen by lasso (the observed selected model ˆM) and the
β
corresponding post selection OLS coeﬃcients.

One important appeal of OLS post lasso is that even if lasso selector gives only a subset of

E[(log(sjt/s0t) − x ˆM

(cid:114)

A lasso selection to get ˆM (including price pjt) means choosing variables of non-zero coeﬃ-

cients in solving the following penalized regression problem. Letting β(cid:48) = [β, α],

ˆβ(cid:48) = argminβ(cid:48)∈Rp ˆQ(β
(cid:48)) =

(cid:48)||1
(ln(sjt/sot) − xjt β + αpjt)2

|| ˆΨ β

(cid:48)) +

λ
n

ˆQ(β


n

1
n

(2.12)

54

where || β(cid:48)||1 =p−1

l=1

| βl| + |α| and ˆΨ = diag( ˆψ1 , ..., ˆψp). The penalty loadings ˆΨ and penalty

level λ for post OLS oracle rate in the heteroskedastic case are;

(cid:115)


1
n
√

n
nΦ−1(1 − γ/(2p))

λ = 2c

ˆψk =

(x2
i )
ik ˆ 2

(2.13)

where Φ denotes the cumulative standard normal distribution and ˆ is an empirical estimate of errors
(residuals). The suggested preset values for c and γ is 1.1 and 0.1. ˆΨ and λ for homoskedastic
errors result in similar variable selection results.

The attributes in the summary statistics (Table 2.4) were in fact chosen by this process using
R package hdm. The observed model ˆM is stable over a range of c from 0.9 and 1.3 with
0.05 increments. To check the validity of selection results, four other data-driven machine learning
models were estimated: lasso, ridge, elastic net, and gradient boosting. They focus on the prediction
accuracy (reducing RMSE) rather than ’oracle rate’. Again, all attributes in Table 2.4 were
unanimously chosen by all and hence included in xjt.9

2.3.3.2 Post Selection Inference

However, if one proceeds to OLS with the selected (observed) model ˆM by lasso, there are two
possible pitfalls. First is that classical inferences (conﬁdence intervals and p-values) on ˆβ
ˆM are no
longer valid. It is because of the non-selected (omitted) variables, making the post selection OLS
only with the attributes in X ˆM biased. Though the asymptotic distribution of lasso coeﬃcients for
our case of n (cid:29) p is available (Fu and Knight (2000)), an exact post selection inference for OLS
post lasso is the primary target of interest.

Such ’post selection inference’ after variable selection with machine learning is a relatively new
and still developing area. This paper follows Lee, L. Sun, Sun, and Taylor (2016) which provides
an exact distribution of post selection OLS estimates and hence, exact conﬁdence intervals (C.Is),
p-values, and tail areas. The idea is that given a response y ∼ N(µ, σ2In), the model selection

9See Appendix A.1 and A.2 for details on the methodologies for data-driven machine learning.

55

event { ˆM = M} by lasso can be expressed as a form of polyhedron {Ay ≤ b} which once again
can be transformed into an interval with low and upper endpoints being functions of residuals zj
of y in the direction of xj, {ν−(z) ≤ y ≤ ν+(z)}. Due to the independence between y and zj, the
(a linear transformation of y) from OLS conditional on
distribution of an individual coeﬃcient ˆβ
ˆM
j
the lasso selection is a truncated normal.

One advantageous fact about Lee et al. (2016) is that a practitioner can produce exact C.Is and
p-values with a ﬁxed penalty parameter λ(cid:48). Speciﬁcally, the lasso formulation for the exact post
selection inference is the original data-driven lasso by Tibshirani (1996).

ˆβ = argminβ∈Rp ˆQ(β) + λ

(cid:48)|| β||1

(2.14)

Therefore, with a range of values of λ(cid:48) that produces the same model ˆM including variables of
non-zero coeﬃcients from the penalized regression problem in equation (2.12), a practitioner can
produce exact inference for OLS post lasso, still achieving cs asymptotically.

2.3.3.3 Cautions on Endogeneity and Post Selection Estimator

The second concern with OLS post lasso is endogeneity. In fact, endogeneity lurks under both lasso
selection and the subsequent demand estimation with the chosen model. Lasso selector (equation
(2.12) and (2.14)) uses Gaussian or at best, heteroskedastic errors implicitly assuming there is no
endogeneity due to omitted/unobserved variables. The post selection estimating equations (2.8),
(2.9), and (2.10) under OLS structure, could suﬀer endogeneity in prices and group shares for
nesting structures (sj|gt, sj|hgt, and sh|gt).

High dimensional econometricians have provided post selection IV regressions after a variable
selection on both many controls and instruments with a small number of key endogenous variables
such as treatment/policy indicators or prices (Chernozhukov, Hansen, and Spindler (2015) and
Chernozhukov et al. (2018)). But still, to the best of my knowledge, a variable selection approach
under the presence of unobserved variables coupled with post selection estimation and inference
has not been established well. It is understandable in that unobservables are not in the dataset,

56

and we resort to the magnitudes of in-sample prediction errors such as RMSE to choose the ’right’
subset of all covariates.

The same concern arises with the conventional alternatives such as principal component analy-
sis (PCA), Akaike information criterion (AIC), or Bayesian information criterion (BIC), in addition
being practically infeasible with hundreds of variables to consider. PCA coeﬃcients are linear
combinations of covariates which makes it impossible to isolate and identify coeﬃcients for indi-
vidual variables, and the stepwise nature of AIC and BIC dictates too high calculation costs for
estimation and comparison to incur given 150 binary indicators.

This paper does not attempt to provide an analytic methodology for the ﬁrst step model selection
by lasso under the presence of unobservables, but shows that the variable selection results vary
when the dataset include variables of possible sources of endogeneity. Speciﬁcally, lasso selection
was conducted on four diﬀerent datasets for each estimation methods: simple OLS logit, IV logit
(with instrumented price ˆpjt, equation (2.15)), two level, and three level nested logit models (also
with ˆpjt, equation (2.16) and (2.17)).

ln(sjt/sot) = xjt β − α ˆpjt + ξjt
ln(sjt/s0t) = xjt β − α ˆpjt + σ
ln(sjt/s0t) = xjt β − α ˆpjt + σ1

ˆ

ln(sj|gt) + ξjt
ln(sj|hgt) + σ2

ˆ

(2.15)

(2.16)

(2.17)

ln(sh|gt) + ξjt

ˆ

ˆ

ˆ

ˆ

ln(sj|hgt) and

For OLS logit, the dataset for lasso selection contains all attributes except variables for nesting
ln(sj|gt) was added
structures. For IV logit, pjt was replaced with the instrumented price ˆpjt.
ln(sh|gt) for three level nested logit. Group shares
for two level nested logit, and
were instrumented due to endogeneity concerns proposed by Berry (1994). The selected model
ˆM diﬀers over datasets, gauging a suspicion on the instability of variable selection results due to
unobserved variables. ’Location Words’ was not selected in OLS logit case, and for three level
nested logit case, a few key parameters including σ2 for the precinct level correlated preferences
were not chosen.10
10Hence for three level nested logit, I proceed with xjt’s selected in IV and two level nested logit case.

57

Actual estimation with equation (2.15), (2.16), and (2.17) relies on the moment condition
E[ξjt|zjt], using the separable unobservables term ξjt and zjt including the selected observables
xjt and instruments, following Berry (1994) and BLP (1995). It is a simple two step least squares
with the ﬁrst stage regressions to produce ˆpjt and group shares. For instruments, I used variables
related to supply decisions: lagged base per night rental price (without ’Cleaning Fee’, recorded
one year before), starting date of an Airbnb host’s rental business, long term availabilities (30, 60,
90, and 365 days), and cancellation policy.11

Table 2.8 and 2.9 report the post selection estimation results. Exact C.Is and tail areas reﬂecting

additional uncertainty due to lasso were produced using Lee et al. (2016).12

2.4 Results

2.4.1 Parameter Estimates

The main interest of this paper lies on evaluating the value of information contents produced
by the platform, consumers, and sellers. Also, the own and cross price elasticities of the new
accommodation products could provide an insight on consumers’ substitution patterns. For such
purposes, it is important to check if the econometric (structural) modeling approach controls
endogeneity properly. Together with instrumenting prices and group shares for the key parameters
in calculating compensating variations and elasticities, the nesting structures were introduced to
target consumers’ realistic choice set formation which was expected to reduce biases in utility
parameter estimates.

Strong evidence of endogeneity with the simple OLS logit model can be found in the coeﬃcients
of ’Room Type’ indicators for ’Entire Home/Apt’ and ’Shared Room’ (Table 2.9). More than 50%
of the total NYC Airbnb rentals are ’Entire Home/Apt’ and they occupy more than 48% share
of total reservations and enjoy a signiﬁcant amount of price premium. Hence a highly negative

11See Appendix C.2 for a detailed discussion on instruments and ﬁrst stage regressions.
12See Appendix A.3 and C.1 for the methodology and comparison between OLS inference and Lee et al. (2016)

58

coeﬃcient for ’Entire Home/Apt’ indicator gauges a suspicion of endogeneity due to unobservables
or misspeciﬁcation in consumers’ decision making principles. Also, a positive and signiﬁcant
coeﬃcient for ’Shared Room’ looks strange given that it represents the lowest grade ’Room Type’
occupying only 2.8% of total listings in the dataset.

For IV logit model, even after instrumenting prices parameter estimates seem to be inﬂated
in overall scale compared to both OLS and nested logit models. This is not speciﬁc to the set of
instruments reported in the Appendix C.2 (ﬁrst stage regressions), but fairly stable over various
sets of instruments tried. It seems to indicate that there are unobserved variables that signiﬁcantly
aﬀect consumer choices.

The nesting parameters show relatively high coeﬃcients and statistical signiﬁcance; the Z-
scores for σ, σ1, and σ2 are 99.67, 83.15, and 4.33, respectively. It would be safe to say that the
imposed nesting structures were able to capture consumers’ preference for location, as observed
in web search log data. The condition 0 ≤ σ2 ≤ σ1 < 1 is satisﬁed, showing that the results are
consistent with the random utility theory (McFadden (1978)).

The nesting parameters capturing either precinct or neighborhood preferences alleviate the IIA
problem of the simple logit as will be shown in Subsection 2.4.2. The cross price elasticities using
σ, σ1, and σ2 for nested logit models show that a consumer’s substitution across rental units is
conﬁned within his/her geographical choice set in mind. To get a more comprehensive picture on
substitution patterns, mixed logit models with individual level choice data or BLP type random
coeﬃcients could be useful for future research.

Interpretation of individual coeﬃcients are quite straightforward with the standard formulas for
either (maximum) willingness to pay (WTP) or attribute elasticities. WTP for a unit increase in
attribute k is βk/α or, the coeﬃcient of a factor divided by the price coeﬃcient. For example, the
willingness to pay for one point increase in ’Ratings Average’ in the three level nested logit case is
= $20.7692. Attribute elasticities can be obtained by βk xjk(1 − sjt). Given that the market
share of a single rental unit j is extremely small, one could approximately use βk xjk. If a rental
unit j has a ’Ratings Average’ of 9, then the demand elasticity with respect to ’Ratings Average’

0.0810
0.0039

59

is about 0.0810 ∗ 9 = 0.7290. Table 2.10 presents WTP and demand elasticities (average) with
respect to each attributes listed in Table 2.8, namely the information variables of main interest.

Table 2.8: Demand Parameter Estimates (1): Price, Nesting, and Information

IV Logit

2 Level NL 3 Level NL††

Obs: 62,673
Objective

Price

OLS Logit

-0.0008***
(0.0000)

ln(sjt/s0t)

-0.0156***
(0.0002)

-0.0059***
(0.0002)

-0.0039***
(0.0003)

0.8125***
(0.0098)
0.2293***
(0.0529)

0.7505***
(0.0075)

Nesting Parameters

σ(σ1)

σ2

Quality Certifications

Superhost Indicator

Veriﬁcation Accounts

Consumer Review
Ratings Average

Number of Reviews

Negative Reviews

Seller Texts

0.2133***
(0.0119)
0.0479***
(0.0036)

0.3378***
(0.0115)
0.0842***
(0.0035)

0.0649***
(0.0110)
0.0663***
(0.0033)

0.0873***
(0.0112)
0.0535***
(0.0035)

0.1938***
(0.0067)
0.0171***
(0.0003)
-0.0381***
(0.0018)

0.3992***
(0.0069)
0.0161***
(0.0003)
-0.0386***
(0.0017)

0.1220***
(0.0070)
0.0045***
(0.0003)
-0.0118***
(0.0016)

0.0810***
(0.0081)
0.0036***
(0.0003)
-0.0091***
(0.0016)

-0.0125***
(0.0022)

Positive Adjectives
Location Words†

-0.0591***
(0.0023)
0.0100***
(0.0016)
***: 1% signiﬁcant, **: 5%, *: 10%, standard errors in parentheses
†: ’Location Words’ was not selected by lasso procedure on the dataset for OLS, of all attributes and pjt except for
group shares for nesting structures
†† : ’Location Words’ and ln(sh|gt) for σ2 were not selected by lasso on the dataset for three level nested logit, of all
ln(sh|gt). Hence I estimated three level nested logit model with variables
observable attributes, ˆpjt,
selected by lasso in IV and two level nested logit case.

-0.0353***
(0.0022)
0.0168***
(0.0015)

-0.0206***
(0.0026)
0.0064***
(0.0018)

ln(sj|hgt), and

ˆ

ˆ

60

Table 2.9: Demand Parameter Estimates (2): Amenity and Service Features

Obs: 62,673
Objective

OLS Logit

Accommodation Capacities
Default Guests

Bathrooms

Additional Guests

Instant Bookable

Amenity and Service
24 Hour Check-In

Hangers

Heating

Shampoo

(Room Type)

Entire Home/Apt

Shared Room

IV Logit

2 Level NL

ln(sjt/s0t)

0.3768***
(0.0052)
0.7227***
(0.0141)
0.1725***
(0.0051)
0.3327***
(0.0094)

0.1768***
(0.0086)
0.1415***
(0.0080)
0.2053***
(0.0174)
0.2292***
(0.0078)

0.1026***
(0.0055)
0.3114***
(0.0137)
0.0706***
(0.0048)
0.0123
(0.0093)

0.0566***
(0.0081)
0.0346***
(0.0075)
0.1046***
(0.0162)
0.0752***
(0.0074)

0.0971***
(0.0039)
0.0604***
(0.0118)
0.0041
(0.0048)
0.4807***
(0.0096)

0.1318***
(0.0090)
0.1437***
(0.0084)
0.0689***
(0.0181)
0.1424***
(0.0081)

-0.0837***
(0.0235)
0.2015***
(0.0223)

1.4138***
(0.0212)
-0.1068***
(0.0217)

0.7256***
(0.0209)
-0.0761***
(0.0201)

3 Level NL

0.0781***
(0.0060)
0.1886***
(0.0184)
0.0454***
(0.0055)
0.0398***
(0.0097)

0.0291***
(0.0086)
0.0278***
(0.0075)
0.0725***
(0.0165)
0.0413***
(0.0082)

0.4182***
(0.0373)
-0.0413**
(0.0204)

Constant

-13.4340***

-15.5229***

(0.0694)

(0.0688)

-5.4277***
(0.1198)

-5.9737***
(0.1317)

But WTP and attribute elasticities do not take into consideration the supply side responses
due to the unit changes in attributes. For instance, if ’Ratings Average’ decreases by one unit,
not only does a consumer’s WTP decreases, but also a seller’s price premium does due to the
reduction in reputation scores. Also, a practitioner should use nesting parameters σ, σ1, and σ2 in
calculating consumer surpluses (utility before and after a unit change in attributes) and the resulting
compensating variations to get more realistic welfare measures for variables of interest.

61

Table 2.10: WTP and Factor Elasticities for Information Variables

Quality Certifications

Superhost Indicator
Veriﬁcation Accounts

Consumer Review
Ratings Average

Number of Reviews
Negative Reviews

2 Level NL

3 Level NL

WTP ($) Elasticity WTP

Elasticity

10.9470
11.1819

0.0075
0.2866

22.5587
13.8191

0.0101
0.2312

20.5807
0.7633
-1.9845

1.1479
0.1215
-0.0421

20.9310
0.9209
-2.3552

0.7621
0.0957
-0.0327

Seller Texts
Positive Adjectives
Location Words†

-5.9596
2.8321

-0.1020
0.0604

-5.3379
1.6474

-0.0600
0.0229

Willingness to pay (WTP) was calculated using the formula βk/α. Factor elasticities (βk xjk(1 − sjt)) are the mean
values of all observations. The small magnitude of demand elasticity with respect to ’Superhost’ indicator is due to
the fact that only about 11% of total rental units were designated as ’Superhost’.

Overall, the nesting structures reduce the magnitudes of coeﬃcients and imply a realistic
impacts of each for consumers’ purchase decision making processes. Amenity and service features
chosen by multiple ML methods seem to show signiﬁcant impacts on purchase decisions both in
statistical and economic senses. However, demand parameters cannot, by themselves tell much
about substitution patterns and welfare implications ($ metric). To investigate this paper’s research
question of evaluating and comparing information contents on product quality, appropriate formulas
should be applied.

2.4.2 Elasticities and Welfare Measures
The own price elasticities for the simple IV logit is α(1 − sjt)pjt, and the formulas for nested logit
models are presented in equations (2.18).

N L2

N L3

∂qjt
∂pjt
∂qjt
∂pjt

∗ pjt
qjt
∗ pjt
qjt

= α(sjt − 1
1 − σ
= α(sjt −
1

1 − σ1

+

σ
1 − σ
+ (
1

sj|gt)pjt

(2.18)

1 − σ1

−

1

1 − σ2

)sj|hgt +

σ2
1 − σ2

sj|gt)pjt

62

It turns out that the demand for Airbnb rentals in NYC is quite elastic (3.5365 to 4.0817), which
is not a surprise given the product heterogeneity and severe competition in a densely populated
urban area. Cross price elasticities involve multiple cases due to the nesting structures. First
possibility is that product j and k are in the same big region (for two level nested logit) or in the
same neighborhood (for three level nested logit). The formulas for this case (Case 1 in Table 2.11)
are as follows.

N L2

N L3

∂qjt
∂pkt
∂qjt
∂pkt

∗ pkt
qjt
∗ pkt
qjt

= α(sjt +
= α(sjt + (

σ
1 − σ
1

sj|gt)pjt
−

1

1 − σ1

1 − σ2

(2.19)

)sj|hgt +

σ2
1 − σ2

sj|gt)pjt

The cross price elasticities between substitutes j and k are negligible for products in the same big
region for two level nested logit model. It is because of the fact that the two big regions ’Brooklyn’
and ’Manhattan’ occupies nearly 40% and 50% of total rental units in the data respectively, making
the precinct eﬀect σ
sj|gt minuscule. It is hard to expect that cross price elasticities
would be as large as that of Coke and Pepsi given there are about 25,000 to 30,000 alternatives.
Other neighborhoods also contain many alternatives. ’Queens’, ’Bronx’, and ’Staten Island’ contain
5,480, 980, and 378 rental units inside, respectively.

1−σ sj|gt or σ2
1−σ2

Table 2.11: Price Elasticities

2 Level NL

3 Level NL

Own Price Elasticities

Mean
S.D.

Cross Price Elasticities

Mean
S.D.
Min
Max

4.0817
2.8200

Case 1
0.0007
0.0020
0.0000
0.0763

3.5365
2.4439

Case 1 Case 2
0.0286
0.0001
0.0001
0.1618
0.0000
0.0000
6.3240
0.0050

On the other hand, cross price elasticities for three level nested logit models show more
reasonable values and greater variations though the mean is still small (0.0268). The cross price
elasticities range from almost zero to 6.3240, reﬂecting the fact that there are neighborhoods

63

such as ’Midtown’ in Manhattan with more than 4,000 substitutes and pretty small ones with
only a few competitors. The ’Room Type’ ﬁlters, travel dates and host availabilities, number of
rooms and guests, and maximum price ﬁlters still leave at least couple of hundred alternatives to
consider in a popular neighborhood. To identify more reﬁned choice set formation of consumers,
an econometrician may need an individual level data.

The second possibility is that product j and k are in diﬀerent neighborhoods but in the same
big region (for three level nested logit only). The formula for this case (Case 2 in Table 2.11) is the
same as equation (2.19) (Case 1 for two level nesting) with σ replaced with σ2.

The estimates seem to suggest that a substitution between products in diﬀerent neighborhoods
is not a realistic option for NYC Airbnb tourists. Such small cross price elasticities is because of
still a large number of alternatives in a precinct that contains the neighborhoods rental unit j and k
reside, similar to the two level nested logit case.

The last possibility is when product j and k are in diﬀerent big regions. Then the cross
elasticities reduce to the simple logit case αsjt pjt, which are close to zero meaning a negligible
substitution among rental units far away from the locational preference of a consumer.

Table 2.12: Compensating Variations over Counterfactual Scenarios

Categories / Scenarios
Quality Certifications

Superhost Indicator
Veriﬁcation Accounts

Consumer Reviews

Ratings Average
Negative Reviews

Seller Texts

Positive Adjectives
Location Words

2 Level NL

−1

without

3 Level NL

Average

without

−1

Total (million)
−1

without

-0.5868

-1.7270

-1.2232

-7.0991

-25.9670

-9.5921

-40.6014

-6.7941

-28.7583

-5.7275
1.2637

-51.9953
3.5131

-6.1465
1.6230

-54.4118
3.6848

-4.3536
1.1496

-38.5403
2.6100

2.2300
-1.7784

6.3000
-6.3085

1.8513
-0.8378

5.1584
-3.0022

1.3113
-0.5934

3.6537
-2.1265

64

The welfare measure is compensating variation which takes the following general form.

CVi =

(CSa f ter

i

1
α

− CSbe f ore

i

)

(2.20)

where α is the price coeﬃcient or the marginal utility of income. CSa f ter
represent
consumer surplus after and before the counterfactual experiments, respectively. The expressions
for consumer surplus for nested logit models are

and CSbe f ore

i

i

exp[δjt/(1 − σ)])1−σ]

(2.21)

CSN L2

i

= log[1 +

CSN L3

i

= log[1 +

(

G
G

g=1

g=1

k∈g
exp(Ig)]

Jhg
Hg

k=1

where the inclusive values for three level nested logit models are presented below for the purpose
of easier stepwise understanding and actual computations.

Ihg = (1 − σ1) ∗ log

Ig = (1 − σ2) ∗ log

exp[δkt/(1 − σ1)]

exp[Ihg/(1 − σ2)]

(2.22)

h=1

There are clear limitations in the counterfactual experiments of this paper. Due to the diﬃculty in
supply side modeling, I cannot produce a complete description of market equilibrium before and
after the counterfactual scenarios including changes in prices, quantities, and product oﬀerings. I
leave this task to future research with more data on the heterogeneous Airbnb rental unit owners.
Instead, I generate counterfactual prices by an OLS hedonic regression results, following Hausmann
and Leonard (2002). The price responses for a unit change in ’Superhost Indicator’, ’Veriﬁcation
Accounts’, ’Ratings Average’, ’Negative Reviews’, ’Positive Adjectives’, and ’Location Words’ are
$8.3759, $2.4799, $13.8010, -$0.4440, -$3.1582, and $0.6599, respectively.

Table 2.12 reports compensating variations from two counterfactual scenarios. First is a unit
reduction (−1) in each information variables. Second is comparing situations with and without one
of the information contents. The latter approach is for controlling the diﬀerent measurement scales
of each information variables, and following Lewis and Zervas (2016)’s study on the impacts of

65

reviews in hotel markets. The induced changes in price using the estimates from the hedonic price
regressions are reﬂected in calculating CSa f ter

together.

i

The ’dollar metric’ from counterfactual experiments conﬁrms the hypothesis of this paper that
trustworthy information on product quality is important in sustaining a market with a high degree
of information asymmetry. Enforced quality certiﬁcations and veriﬁable ex-post review contents
turn out to be more inﬂuential than non-veriﬁed seller side ’Cheap Talk’. Speciﬁcally, the host
identity veriﬁcation measures (’Veriﬁcation Accounts’) show a greater dollar impact on purchase
decisions in either case of unit reduction or complete absence ($9.5921/$40.6014). The lower
value for ’Superhost’ ($1.7270) is originating from the fact that only about 11% of rental units get
aﬀected by the counterfactual scenario.

CV for consumer review ratings also show greater impacts on consumer choices than those
of seller side textual voluntary disclosures. Compensating variations for ’Ratings Average’
are $6.1465/$54.4118, both higher than those of ’Positive Adjectives’($1.8513/$5.1584) and
’Location Words’ ($0.8378/$3.0022) from advertisement texts. CVs for ’Negative Reviews’
($1.6230/$3.6848) from review texts do not show particularly more dominant impacts.

Given that there were 708,308 reservations in the sample during the time periods of our empirical
analysis, the aggregate dollar values of consumer welfare from each information contents would be
quite huge. For the case of a unit reduction, the welfare impacts are $1.2232, $6.7941, $4.3536,
and $1.1496 million for ’Superhost’, ’Veriﬁcation Accounts, ’Ratings Average’, and ’Negative
Reviews’, respectively. On the seller side, $1.3113 and $0.5934 million are for ’Positive Adjectives’
and ’Location Words’. In the case of total absence, they are $1.2232, $28.7583, $38.5403, $2.6100,
$3.6537, and $2.1265 million in the same order.13

One caution for the interpretation of positive CV signs for ’Negative Reviews’ and ’Positive
Adjectives’ is that they represent the increase in consumers’ demand for rentals with a unit lower
’Negative Reviews’ and ’Positive Adjectives’. In fact, ’Positive Adjectives’ seems to be a strong
13Lewis and Zervas (2016) estimated the welfare impacts of online review ratings for hotel markets over ﬁve
U.S. states over 10 years of time periods as about $546 million with hedonic price regression adjustments for the
counterfactual case of total absence.

66

indicator for cheap and low quality Airbnb rental units, showing negative signs on coeﬃcients both
for demand and hedonic models. On the other hand, ’Location Words’ such as ’5 min walk to
Central Park’ and ’A Walking Distance from Grand Central’ are usually veriﬁable instantly on the
Google Maps API on each Airbnb listing webpages, which is more credible and hence attracts more
consumers.

2.5 Conclusion

This paper investigates how the sharing economy platform Airbnb could overcome adverse selection
due to information asymmetry. The risk of adverse selection for P2P markets is expected to be
higher than online retail outlets for material goods because the accommodation service transactions
among anonymous non-professional individuals imply a higher degree of information asymmetry
and more than just monetary losses. To test the insight from information economics that enforced
and ex-post veriﬁable information on product quality is more inﬂuential for a consumer’s decision
making process, demand models were estimated.

Predominant identiﬁcation challenges due to high dimensionality in attribute space were partly
resolved using the variable selection by a lasso variant and exact post selection inferences. How-
ever, the model selection was unstable once endogeneity is involved and the results show that an
appropriate econometric (structural) modeling approach designed to capture actual consumers’ de-
cision making principles is essential to produce more accurate utility parameters. Unstructured text
information on product quality was incorporated in the model and showed nonnegligible impacts.
The results conﬁrm our hypothesis, with quality certiﬁcations and consumer review ratings
showing greater impacts on rental choices than non-veriﬁed seller side voluntary disclosures via
textual advertisements.

67

CHAPTER 3

ESTIMATION FOR THE DISTRIBUTION OF RANDOM COEFFICIENTS WITH

HETEROGENEOUS AGENT TYPES: MONTE-CARLO SIMULATION

3.1 Introduction

Since Berry, Levinsohn, and Pakes (1995), Nevo (2001), and Petrin (2002), random coeﬃcients
logit models to capture heterogeneous consumer preferences have been one of the most popular
frameworks for demand research. But the estimation routine is highly nonlinear, computationally
burdensome, and in some cases the convergence is not guaranteed. Even if individual choice data
is available, (simulated) maximum likelihood estimation for random coeﬃcients usually incurs too
much calculation costs, which is not an attractive option to applied researchers working with more
than millions of transaction records.

Fox, Kim, Ryan, and Bajari (2011, henceforth FKRB) proposes an alternative, that is nonpara-
metric, computationally simple, easy to program, and easy to combine auxiliary methods due to its
least squares format. To give a concrete idea, consider a simple logit choice probability given the
binary outcome yi j, attributes vector xi j, and the random coeﬃcients βi, where i and j are indicies
for individual consumers and products, respectively.

Pr(yi j = j|x) = (cid:119)

1 +J

(cid:48)
i j βi)
exp(x

exp(x
j(cid:48)=1

(cid:48)

i j(cid:48) βi) dF(βi)

(3.1)

(3.2)

Assuming there are r = 1, ... , R types of consumers, i.e., R ﬁxed preference parameters β1, ... , βR,
the choice probability of choosing product j can be expressed as an weighted average with the
probability tuple θ = (θ1, ... , θR).

Pr(yi j = 1|x) =

θr

R

r=1

1 +J

exp(x
j(cid:48)=1

(cid:48)
i j βr)
(cid:48)
exp(x
i j(cid:48) βr)

Then the parameters enter the estimating moments linearly, and the main interest is to estimate
the tuple θ. From the estimated tuples ˆθ, a practitioner can also estimate the empirical joint and

68

marginal distributions of random coeﬃcients β. The inequality constraints for θ is also simple,

required as a natural condition for a probability vector:R

r=1 θr = 1 and θr ≥ 0 for all r.

FKRB (2011) demonstrates that estimation for θ and F(β) using the reparametrization speciﬁed
as in equation (3.2) is consistent. The estimator can be applicable to a wide range of nonlinear
models, but this paper focuses on the multinomial logit demand case. For more rigorous theoretical
discussion, see Fox, Kim, Ryan, and Bajari (2012) and Fox, Kim and Yang (2016). FKRB(2011) is
closely related to latent class models in discrete choice literature (Green (1976) and Train (2003)).
One possible weaknesses of FKRB (2011) is that, as demonstrated in their Monte-Carlo simu-
lation results, the approximating performances of ˆF(β) can deteriorate as the number of consumer
types R grows. Also, there is a possibility that there are some ’nuisance’ consumer types that
can cause poor estimation results for ˆF(β). It is a similar environment where there are too many
irrelevant regressors in linear regressions. One can expect that appropriate dimensionality re-
duction techniques can improve the approximation performances, along with signiﬁcant gains in
computation speed.

To examine such a possibility, this paper tries to reduce the dimensionality in consumer hetero-
geneity by introducing high-dimensional metrics (Belloni and Chernozhukov (2013)). The baseline
estimator based on equation (3.2) can be expressed as a linear regression with a design matrix of
size N J ∗ R, where N is the number of observations (consumer choices) and J is the number of
I apply the lasso variant ﬁrst to reduce R, and with R∗(≤ R), construct a
choice alternatives.
new design matrix of size N J ∗ R∗ and compare the performance metrics to measure the distances
between the true CDF F0(β) and ˆF(β). I also try the original lasso formulation by Tibshirani (1996)
with 10 folds cross validation.

The lasso variant developed by Belloni and Chernozhukov (2013) guarantees the asymptotic
It is one of the ﬁrst high-
approximation performances of post-lasso least squares estimators.
dimensional metrics or machine learning application that started to be accepted in economics,
with application areas including demand estimation, treatment/policy impacts, and general linear
models (Belloni, Chernozhukov, and Wang (2014), Chernozhukov, Hansen, and Spindler (2015),

69

(2018)).

In statistics, post-selection estimators and inference using
and Chernozhukov et al.
popular machine learning methods other than lasso such as ridge regression, elastic net, and tree
based models (boosting) have been actively investigated.

In Monte-Carlo experiments (Section 3.4), post-lasso estimators show better approximation
It is stable once the number of consumer
performances compared to the baseline estimator.
types R exceeds 36. The estimated CDFs of β produced by post-lasso estimators also track the
(simulated) ’true’ distributions better, and this can be attributed to the variable selection process
’killing’ nuisance variables so that ˆF(β) does not take extreme values. Hence the combination of the
baseline inequality constrained least squares and high-dimensional metrics can be a good alternative
to estimate random coeﬃcients logit demand models with ’Big Data’ in various marketplaces.

The rest of this paper is organized as follows: Section 3.2 brieﬂy introduces the multinomial
random coeﬃcients logit model, the baseline estimator from FKRB (2011), and the lasso variant
by Belloni and Chernozhukov (2013). Section 3.3 outlines the Monte-Carlo designs to compare
baseline estimator and post-lasso estimator. Section 3.4 reports estimation results and ﬁgures for
the marginal empirical distributions of β1 to compare the approximation performances obtained
using baseline and post-lasso estimators.

3.2 Model

3.2.1 Multinomial Random Coeﬃcients Logit Demand Model

This section lays out the multinomial random coeﬃcients logit demand model, which is one of the
key motivations for FKRB (2011).

ui j = x(cid:48)

gj(xi, βr) =

Pr(yi j = 1|xi) =

i j βr + i j
exp(x(cid:48)
j(cid:48)=1

1 +J
R

θr gj(xi, βr)

i j βr)
exp(x(cid:48)

i j(cid:48) βr)

r=1

70

(3.3)

(3.4)

(3.5)

where i j is Type I extreme value error, and xi j is the K observed characteristics for each pair of
agents i = 1, ... , N and products j = 1, ... , J. θ = (θ1, ... , θR) represents the probability or share
of consumer types r = 1, ... , R in the population. The primary interest is to estimate the tuple θ,
and hence the distribution of random coeﬃcients (CDFs) of β.

The actual estimation is simple OLS, with i = 1, ... , N observations on (xi, yi) and the following

moment condition.

E[yi j − Pr(yi j = 1|xi)|xi] = 0

(3.6)
Letting the R × 1 vector zi j = (zi j1, ... , zi jR)(cid:48) with individual elements zi jr = gj(xi, βr), if one
ﬁxes or simulates the observation pairs (xi, yi), zi jr is a ﬁxed regressor. Equation (3.6) gives a
consistent OLS estimator for θ.

ˆθ = arg min

θ

1
N J

(yi j − z(cid:48)

i j θ)2

(3.7)

N

J

i=1

j=1

R

Deﬁning Y as the N J × 1 vector stacking yi j’s and Z as the N J × R matrix stacking zi j, the estimator
is ˆθ = (Z(cid:48)Z)−1Z(cid:48)Y. Solving equation (3.7) can be easily done as a constrained minimization using
r=1 θr = 1

linlsq in Matlab. The two constraints for θ naturally required as a probability vector areR

and θr ≥ 0 for all r = 1, ... , R.

Once θ is estimated, one can construct the estimated CDFs for the random coeﬃcients.

ˆFN(β) =

ˆθr 1[βr ≤ β]

(3.8)

where 1[βr ≤ β] = 1 when βr ≤ β.

r=1

3.2.2 High-Dimensional Metrics

As an extension of FKRB (2011), the main interest is to examine the performance of the baseline
estimator when there are too many consumer types r = 1, ..., R to consider. In other words, this
paper shows the approximating performance of the estimator ˆF(β) to F(β) when the dimensionality
of grid of points R is reduced by two lasso variants, namely the original plain lasso and Belloni and

71

Chernozhukov (2013)’s lasso with sparsity assumption (henceforth cv and hdm lasso, respectively),
to the OLS minimization problem as speciﬁed in equation (3.7).

1
N J

N
J
J
N

j=1

i=1

i=1

j=1

arg min

θ

arg min

θ

1
N J

(yi j − z(cid:48)

i j θ)2 + λ||θ||1

(yi j − z(cid:48)

i j θ)2 +

λ∗
N J

|| ˆΨθ

(cid:48)||1

(3.9)

(3.10)

(3.11)

(3.12)

Minimization problem (3.9) is the formulation for cv lasso with the shrinkage parameter λ and
the absolute value norm || · ||1. Minimization (3.10) is for hdm lasso with sparsity assumption,
and the data driven penalty loadings ˆΨ and λ∗ deﬁned to guarantee the asymptotic approximation
performance of post-selection OLS estimators.

(cid:118)(cid:117)(cid:117)(cid:116) 1

N J
√

N

J

(z2
i j)
i jr ˆ 2

ˆψr =

∗ = 2c

λ

N JΦ

i=1

j=1
−1(1 − γ/(2R))

where ˆi j is the residuals, Φ is the CDFs for standard normal distribution, and c and γ are
conditioning parameters preset at 1.1 and 0.1 for heteroskedastic error structure.

Hence, given the agent type probabilities θ = (θ1, ... , θR), a researcher ﬁrst reduces the dimen-
sionality of θ, for example, θ∗ = (θ1∗, ... , θR∗) with R∗ ≤ R. In the Monte-Carlo experiment, this
paper picks a ﬁxed grid of points for θ of dimension R using Halton draws, and then use lasso
variants to select a grid of points for θ∗ with a smaller dimensionality R∗. Corresponding xi j’s and
the coeﬃcients βr’s are generated from ﬁxed distributions. With this reduced dimensionality of
the new tuple θ∗, the baseline least squares (equation (3.7)) is estimated to construct the estimated
CDFs for β.

This dimension reduction is in fact conducted on the regressor vector Z with rank R, with
individual elements zi jr, or the individual choice probabilities gj(xi, βr). The selected choice
probabilities gj(xi, βr) form a new rank-reduced regressor vector Z∗ ∈ RN J×R∗.

72

3.3 Monte-Carlo

3.3.1 Parameter and Settings

For the Monte-Carlo experiment, this paper tries six combinations. For each N = 2, 000 and 5, 000
observation pair set, three gaussian mixtures distributions for generating β = (β1, β2) were used.
There are J = 10 choice alternatives, K = 2 observed attributes, and R = t2, t = 3, 4, ... , 22 (9, 16,
... , 484) consumer types. The two dimensional grid of points for θ with dimensionality R are drawn
from [−10, 10] × [−10, 10] using Halton draws. For estimated CDFs, the S = 10, 201(101 × 101)
grid of points on which both the actual (simulated) and estimated CDFs will be evaluated are
uniformly drawn from also [−10, 10]×[−10, 10]. The number of Monte-Carlo repetition M is 100.
Two kinds of approximating performance metric were used, RMISE (Root Mean Integrated

Squared Error) and IAE (Integrated Absolute Error).

(cid:118)(cid:117)(cid:116) 1
S

M

1
S

s=1

S

M

[ 1
S

s=1

m=1
| ˆFm(βs) − F0(βs)|

RMISE =

I AE =

( ˆFm(βs) − F0(βs))2]

(3.13)

(3.14)

βs represents the two dimensional (K = 2) coeﬃcients for observed attributes xi j’s at one of the
grid points s = 1, ... , S. ˆFm is the estimated CDFs at the m-th repetition and F0 is the ’true’ CDFs
for random coeﬃcients β generated using N = 10, 000.

Each xi j ∈ R2 is drawn from N(0, 1.52), and the true F0(β) are drawn from three diﬀerent

 0.2 −0.1

−0.1

0.4

 and Σ2 =

0.3 0.1
. In other words, there

0.1 0.3

are three designs for each N, namely gaussian mixtures distributions of two, four, and six normal
distributions (Equation (3.15), (3.16), and (3.17)).

mixtures normal distributions with Σ1 =

73

0.4 ∗ N([3,−1], Σ1) + 0.6 ∗ N([−1, 1], Σ2)
0.2 ∗ N([3, 0], Σ1) + 0.4 ∗ N([0, 3], Σ1)
+ 0.3 ∗ N([1, −1], Σ2) + 0.1 ∗ N([−1, 1], Σ2)
0.1 ∗ N([3, 0], Σ1) + 0.2 ∗ N([0, 3], Σ1) + 0.2 ∗ N([1, −1], Σ1)
+ 0.1 ∗ N([−1, 1], Σ2) + 0.3 ∗ N([2, 1], Σ2) + 0.1 ∗ N([1, 2], Σ2)

(3.15)

(3.16)

(3.17)

For lasso methods, the selection results (R∗) could diﬀer over the random draws of x’s and β’s.
Also, to compare the approximating performances between the baseline OLS and post-selection
OLS, I ﬁxed a grid of points R, and conducted lasso selection over 10 diﬀerent random draws for
x’s and β’s producing 10 diﬀerent reduced grid of points of dimensionality R∗. For example, if a
Halton draws of two dimensional grid of points R = 256 is at hand, hdm lasso selection method
was applied 10 times to each set of x’s and β’s to produce 10 reduced grid of points R∗ with the
dimensionality varying from 20 to 22 for the design of N = 5, 000 with the number of mixtures at
six (Table 3.3).

The post-cv lasso uses 10 folds cross validation, and gj(xi, βr)’s were selected using the λ
values achieving the minimum RMSE (Root Mean Squared Error) from the penalized regressions.
The post-hdm lasso was applied with the default setting as speciﬁed by the R package ’hdm’ for
the heteroskedastic error case.

3.4 Results and Discussion

3.4.1 Performance Metrics

Table 3.1, 3.2, and 3.3 report the Monte-Carlo simulation results. Each table contains the perfor-
mance metrics (RMISE and IAE) for the baseline and post-lasso estimators using 10 folds cross
validation and Belloni and Chernozhukov (2013), for each combination of N (2,000 and 5,000),
mixtures distributions (two, four, and six) and R from 16 to 484. For each R, the reduced dimen-
sionalities R∗ produced by both post-cv and post-hdm lasso are reported, along with the number of

74

positive weights estimated. RMISE and IAE results for post-cv and post-hdm lasso estimators are
average values computed over the 10 diﬀerent reduced grid points.

The results shows the following: (1) For both the baseline and post-lasso inequality constrained
OLS, RMISE and IAE decrease in N and R but only until R reaches a certain level (144 or 169). (2)
Even R is relatively high, the number of non-zero basis functions (non-zero θr’s) stays low, about
up to 11 for the most complex case (N = 5, 000 and R = 484 with mixtures of six normals). (3)
RMISE and IAE are lower than the baseline for R values above certain level (≥ 49) with the reduced
grid R∗ using either post-cv or post-hdm lasso. (5) It is hard to compare post-cv and post-hdm
lasso in terms of RMISE and IAE across all the combinations of N, R, and distribution mixtures.
(6) Post-hdm lasso selects fewer variables than post-cv lasso with 10 folds cross validation. The
mean and maximum number of dimensionality in the post-selection grid R∗ are higher for post-cv
lasso. So are the number of positive weights (θr’s).

3.4.2 Marginal Distributions of Coeﬃcients
Figure 3.1 through Figure 3.6 depict F0(β1) and ˆF(β1), namely the (simulated) true marginal
distribution of β1 and the estimated marginal distributions using the baseline, post-cv lasso, and
post-hdm lasso estimators from the Monte-Carlo designs of N = 5,000 over various R’s. The
marginal distributions were calculated from the estimated joint CDFs ˆF(β1, β2).

Figure 3.1 and Figure 3.2 compare ˆF(β1)’s produced by the baseline and post-lasso estimators
with relatively low levels of R ranging from 16 to 49. For R values of 16 and 25, there seems
to be no clear visual conﬁrmation that the approximation performances of post-lasso estimators
are better than the baseline. As R exceeds 36, post-lasso estimators start to show better ﬁts, with
post-cv lasso performing better at tail areas than post-hdm lasso.

Figure 3.3 and Figure 3.4 depict the analogous comparisons with an increase in R values of
81 to 144. The ﬁt for ˆF(β) of post-lasso estimators improves more clearly and stays consistent,
as demonstrated by the RMISE and IAE values in Table 3.3. Post-cv lasso hits the best ﬁt at
R = 121, and over R values of 81 and 144, post-cv lasso tracks the (simulated) true F0(β1) better

75

than post-hdm lasso though the diﬀerences are small.

Figure 3.5 and Figure 3.6 show F0(β1) and ˆF(β1) for relatively high R values of 169 and 529.
Both post-cv and post-hdm lasso track the true CDFs of β1 very well, while post-cv lasso still
performs slightly better than post-hdm lasso. But the computation speed is much faster when using
post-hdm lasso, because the reduced dimensionality R∗ for cv lasso is much greater than hdm lasso.
Figure 3.7 shows the (simulated) true joint distribution of β1 and β2. One thing to note is that
by coincidence, the mixtures distributions become smoother as the number of mixtures increase.
Though the mixtures of two normals contain more inﬂection points, the ﬁt of post-lasso estimators
are still excellent. The results in Figure 3.1 through 3.7 are similar for β2.

3.5 Conclusion

This paper explores the potential gains of high-dimensional metrics to the nonparametric least
squares estimator for the distribution of random coeﬃcients in multinomial logit demand case,
developed by FKRB (2011). It is easy to program and highly ﬂexible enough to be combined with
auxiliary techniques, such as lasso and other machine learning methods for dimension reduction.
Post-lasso regression results shows better approximating performances to the joint mixtures dis-
tributions and faster computation speed. Without resorting to the existing nonlinear estimation
methods, our post-lasso estimator successfully captures heterogeneity in consumer preferences.

76

Table 3.1: Monte-Carlo Results (1) (Number of Mixtures: 2)

RMISE

cv
mean

hdm

base

0.0688
0.0714
0.0681
0.0666
0.0669
0.0673
0.0674
0.0653
0.0692
0.0658
0.0697
0.0672
0.0706
0.0698
0.0683
0.0706
0.0677
0.0688
0.0696

0.0347
0.0609
0.0602
0.0508
0.0493
0.0392
0.0366
0.0325
0.0330
0.0324
0.0340
0.0353
0.0326
0.0314
0.0326
0.0321
0.0292
0.0309
0.0332

0.0705
0.0786
0.0767
0.0684
0.0655
0.0633
0.0656
0.0641
0.0649
0.0655
0.0670
0.0672
0.0668
0.0664
0.0663
0.0677
0.0666
0.0670
0.0660

0.0299
0.0607
0.0603
0.0520
0.0507
0.0371
0.0401
0.0352
0.0358
0.0321
0.0345
0.0373
0.0330
0.0283
0.0331
0.0326
0.0355
0.0348
0.0368

0.0283
0.0290
0.0302
0.0297
0.0306
0.0306
0.0311
0.0312
0.0322
0.0329
0.0331
0.0337
0.0341
0.0339
0.0343
0.0341
0.0348
0.0349
0.0350

0.0288
0.0284
0.0271
0.0262
0.0320
0.0284
0.0284
0.0284
0.0290
0.0294
0.0295
0.0291
0.0296
0.0297
0.0297
0.0297
0.0298
0.0307
0.0298

base

0.0676
0.0695
0.0702
0.0695
0.0713
0.0711
0.0720
0.0722
0.0745
0.0755
0.0755
0.0771
0.0780
0.0778
0.0791
0.0788
0.0797
0.0799
0.0801

0.0693
0.0695
0.0652
0.0641
0.0759
0.0668
0.0671
0.0675
0.0688
0.0695
0.0699
0.0697
0.0713
0.0715
0.0714
0.0714
0.0717
0.0733
0.0716

IAE
cv
mean

0.0289
0.0300
0.0283
0.0277
0.0275
0.0283
0.0288
0.0272
0.0283
0.0273
0.0293
0.0278
0.0297
0.0292
0.0287
0.0293
0.0282
0.0290
0.0292

0.0379
0.0254
0.0259
0.0198
0.0191
0.0155
0.0151
0.0132
0.0129
0.0133
0.0139
0.0142
0.0140
0.0121
0.0131
0.0142
0.0117
0.0124
0.0137

R

16
25
36
49
64
81
100
121
144
169
196
225
256
289
324
361
400
441
484

16
25
36
49
64
81
100
121
144
169
196
225
256
289
324
361
400
441
484

dim(R*)

cv
hdm
mean (min, max)

# of Pos. Weights
base
hdm

cv
mean

7.9 (7, 9)
11.5 (9, 15)
19.2 (9, 25)
19.0 (11, 29)
23.5 (10, 49)
24.1 (13, 36)
18.5 (9, 33)
23.9 (12, 63)
19.6 (14, 36)
22.3 (13, 42)
18.3 (14, 22)
29.7 (18, 66)
30.7 (16, 74)
34.8 (18, 78)
27.0 (24, 30)
30.1 (18, 43)
26.9 (19, 34)
32.5 (22, 52)
27.9 (15, 37)

11.2 (8, 13)
23.4 (22, 24)
25.8 (22, 33)
27.4 (14, 35)
32.2 (26, 36)
26.0 (18, 32)
30.6 (15, 71)
25.4 (20, 43)
20.8 (20, 21)
27.6 (23, 32)
25.0 (22, 36)
36.2 (27, 40)
37.6 (20, 47)
26.4 (19, 36)
31.6 (25, 42)
35.8 (28, 42)
24.2 (20, 31)
49.8 (31, 104)
46.6 (37, 65)

5.9 (5, 7)
5.1 (4, 6)
7.2 (5, 9)
8.0 (7, 9)
8.0 (7, 10)
9.7 (6, 14)
8.9 (6, 12)
9.3 (8, 11)
9.6 (7, 12)
10.1 (8, 14)
10.1 (7, 12)
11.9 (10, 15)
10.4 (7, 13)
11.9 (7, 15)
11.5 (8, 14)
11.9 (10, 15)
13.3 (10, 17)
12.9 (10, 16)
13.1 (9, 21)

5.2 (5, 6)
7.6 (6, 9)
8.8 (8, 11)
10.0 (9, 11)
11.6 (10, 13)
11.6 (9, 16)
10.2 (9, 13)
12.0 (9, 14)
11.0 (11, 11)
11.4 (11, 13)
12.2 (12, 13)
13.0 (13, 13)
15.2 (14, 17)
14.2 (12, 17)
16.2 (12, 18)
16.2 (14, 18)
17.0 (14, 21)
16.6 (14, 21)
18.8 (16, 21)

6.36
6.67
7.51
7.43
8.29
8.36
8.00
8.21
8.84
8.29
8.35
8.35
8.74
9.09
8.42
8.48
8.84
8.70
8.64

6.24
6.72
7.40
7.62
8.17
8.61
8.74
8.72
8.94
9.05
9.25
9.06
9.13
9.06
9.10
9.24
9.17
9.54
9.10

6.12
6.42
6.48
6.95
6.56
7.38
7.54
6.90
6.89
7.19
7.37
7.10
7.94
7.14
7.54
7.52
7.16
7.32
7.55

7.36
9.44
10.59
10.80
11.59
12.61
12.33
11.96
12.23
12.11
12.93
14.89
14.90
12.60
13.60
13.82
11.96
14.54
15.34

5.35
4.94
5.13
5.14
5.57
5.57
6.01
5.65
5.80
5.74
5.57
5.58
6.29
5.38
5.30
5.64
5.49
5.61
5.40

5.00
7.03
7.63
8.32
8.93
9.06
8.22
8.17
7.51
9.10
9.10
7.61
10.26
10.21
9.18
9.56
9.63
9.40
10.56

hdm

N = 2,000
0.0289
0.0328
0.0329
0.0278
0.0269
0.0260
0.0273
0.0264
0.0265
0.0269
0.0273
0.0275
0.0274
0.0271
0.0268
0.0277
0.0273
0.0275
0.0269
N = 5,000
0.0398
0.0248
0.0251
0.0200
0.0193
0.0145
0.0156
0.0140
0.0140
0.0121
0.0128
0.0159
0.0128
0.0106
0.0127
0.0125
0.0137
0.0130
0.0142

77

Table 3.2: Monte-Carlo Results (2) (Number of Mixtures: 4)

RMISE

cv
mean

hdm

base

0.0627
0.0682
0.0655
0.0640
0.0642
0.0631
0.0660
0.0619
0.0646
0.0618
0.0662
0.0620
0.0678
0.0648
0.0647
0.0660
0.0644
0.0641
0.0648

0.0549
0.0514
0.0516
0.0406
0.0392
0.0323
0.0355
0.0341
0.0356
0.0404
0.0368
0.0345
0.0365
0.0340
0.0354
0.0347
0.0361
0.0352
0.0373

0.0593
0.0731
0.0753
0.0662
0.0651
0.0608
0.0657
0.0617
0.0617
0.0624
0.0642
0.0636
0.0630
0.0623
0.0638
0.0628
0.0615
0.0631
0.0617

0.0569
0.0550
0.0617
0.0460
0.0447
0.0401
0.0439
0.0416
0.0414
0.0424
0.0379
0.0405
0.0407
0.0409
0.0423
0.0423
0.0432
0.0418
0.0433

0.0268
0.0279
0.0309
0.0303
0.0308
0.0303
0.0308
0.0311
0.0321
0.0325
0.0322
0.0331
0.0334
0.0336
0.0338
0.0332
0.0345
0.0345
0.0347

0.0242
0.0254
0.0260
0.0248
0.0266
0.0262
0.0271
0.0271
0.0280
0.0283
0.0295
0.0292
0.0295
0.0295
0.0294
0.0295
0.0298
0.0295
0.0303

base

0.0634
0.0664
0.0713
0.0704
0.0708
0.0695
0.0706
0.0712
0.0733
0.0740
0.0730
0.0756
0.0761
0.0765
0.0770
0.0764
0.0781
0.0783
0.0786

0.0575
0.0610
0.0605
0.0580
0.0618
0.0610
0.0621
0.0622
0.0643
0.0650
0.0679
0.0672
0.0682
0.0685
0.0683
0.0685
0.0689
0.0683
0.0700

IAE
cv
mean

0.0264
0.0293
0.0276
0.0269
0.0267
0.0270
0.0294
0.0261
0.0269
0.0261
0.0283
0.0262
0.0293
0.0273
0.0277
0.0277
0.0271
0.0273
0.0277

0.0248
0.0228
0.0228
0.0168
0.0165
0.0144
0.0162
0.0149
0.0158
0.0173
0.0159
0.0147
0.0158
0.0150
0.0155
0.0152
0.0159
0.0148
0.0163

R

16
25
36
49
64
81
100
121
144
169
196
225
256
289
324
361
400
441
484

16
25
36
49
64
81
100
121
144
169
196
225
256
289
324
361
400
441
484

dim(R*)

cv
hdm
mean (min, max)

# of Pos. Weights
base
hdm

cv
mean

11.8 (10, 16)
14.0 (10, 19)
16.6 (11, 30)
18.8 (13, 29)
17.4 (12, 30)
23.8 (15, 38)
30.2 (14, 74)
21.0 (15, 32)
23.6 (21, 29)
23.4 (20, 27)
23.0 (18, 30)
28.6 (26, 34)
31.8 (27, 40)
31.6 (23, 43)
34.2 (21, 65)
29.6 (23, 36)
38.0 (32, 45)
41.4 (22, 76)
39.2 (27, 48)

14.4 (12, 16)
16.6 (15, 22)
20.0 (13, 27)
21.8 (15, 34)
22.2 (17, 31)
22.2 (18, 26)
27.0 (22, 34)
25.6 (23, 31)
36.4 (27, 62)
29.8 (24, 42)
33.4 (29, 43)
34.4 (29, 43)
41.0 (25, 74)
39.4 (34, 47)
50.8 (29, 77)
45.0 (14, 80)
45.6 (35, 79)
43.8 (37, 61)
74.8 (39, 111)

7.6 (7, 8)
8.8 (7, 10)
10.0 (8, 12)
9.8 (7, 14)
11.2 (9, 14)
12.2 (11, 13)
12.6 (10, 15)
13.8 (11, 17)
13.2 (10, 17)
12.8 (10, 18)
12.8 (9, 16)
14.4 (12, 16)
16.0 (13, 19)
13.4 (7, 17)
13.4 (10, 16)
16.2 (11, 19)
16.4 (10, 20)
15.0 (11, 19)
15.6 (12, 19)

8.0 (7, 9)

12.0 (11, 13)
11.4 (10, 13)
13.2 (12, 14)
15.0 (13, 17)
15.2 (14, 16)
14.0 (12, 16)
17.2 (15, 19)
16.4 (15, 18)
16.6 (14, 19)
16.2 (14, 18)
17.2 (13, 25)
20.4 (14, 29)
20.2 (16, 25)
17.6 (17, 18)
23.0 (17, 30)
19.2 (14, 22)
19.6 (14, 26)
20.8 (17, 24)

6.61
7.17
8.01
8.18
8.54
8.61
8.75
8.68
8.96
8.66
8.97
8.97
9.11
9.21
9.29
8.82
8.90
8.94
9.00

7.00
7.52
8.52
8.83
9.40
9.48
9.69
9.56
9.87
9.74
9.98
10.10
9.83
9.97
9.90
10.14
10.05
10.07
10.22

6.28
6.59
6.87
7.01
6.84
7.66
7.59
7.29
7.35
7.32
7.58
7.45
7.96
7.46
7.78
7.62
7.58
7.75
7.64

8.98
10.70
11.01
12.50
14.29
15.16
16.19
16.07
17.50
16.04
16.38
16.75
17.32
17.56
17.69
17.86
17.85
16.41
17.61

5.52
5.13
5.53
5.19
5.85
5.85
6.07
6.03
6.26
5.87
5.78
5.94
6.30
5.70
5.54
5.73
5.92
5.84
5.49

7.55
9.46
8.30
9.93
9.98
10.92
10.44
11.16
11.42
11.39
11.48
10.86
11.65
10.52
10.24
10.91
10.74
10.09
10.56

hdm

N = 2,000
0.0245
0.0311
0.0328
0.0268
0.0270
0.0252
0.0277
0.0255
0.0252
0.0257
0.0263
0.0265
0.0260
0.0255
0.0257
0.0254
0.0254
0.0261
0.0252
N = 5,000
0.0259
0.0255
0.0282
0.0210
0.0191
0.0172
0.0194
0.0181
0.0182
0.0179
0.0159
0.0175
0.0178
0.0177
0.0187
0.0188
0.0191
0.0183
0.0197

78

Table 3.3: Monte-Carlo Results (3) (Number of Mixtures: 6)

RMISE

cv
mean

hdm

base

0.0684
0.0675
0.0640
0.0627
0.0632
0.0651
0.0622
0.0624
0.0670
0.0635
0.0665
0.0659
0.0666
0.0680
0.0652
0.0683
0.0644
0.0668
0.0676

0.0568
0.0569
0.0387
0.0330
0.0371
0.0297
0.0306
0.0286
0.0347
0.0287
0.0309
0.0338
0.0319
0.0346
0.0348
0.0330
0.0338
0.0333
0.0346

0.0746
0.0763
0.0705
0.0637
0.0593
0.0594
0.0590
0.0601
0.0616
0.0621
0.0632
0.0641
0.0639
0.0639
0.0622
0.0659
0.0651
0.0642
0.0638

0.0687
0.0684
0.0424
0.0354
0.0341
0.0329
0.0357
0.0344
0.0335
0.0333
0.0345
0.0383
0.0386
0.0377
0.0373
0.0381
0.0378
0.0344
0.0364

0.0267
0.0269
0.0262
0.0258
0.0270
0.0275
0.0279
0.0278
0.0287
0.0297
0.0303
0.0305
0.0310
0.0304
0.0311
0.0313
0.0313
0.0314
0.0314

0.0261
0.0258
0.0232
0.0234
0.0242
0.0253
0.0252
0.0250
0.0259
0.0262
0.0278
0.0271
0.0284
0.0279
0.0287
0.0287
0.0287
0.0289
0.0290

base

0.0651
0.0657
0.0620
0.0616
0.0646
0.0655
0.0662
0.0660
0.0683
0.0694
0.0705
0.0709
0.0721
0.0713
0.0732
0.0734
0.0734
0.0736
0.0735

0.0647
0.0644
0.0561
0.0568
0.0581
0.0600
0.0607
0.0602
0.0621
0.0626
0.0655
0.0646
0.0669
0.0665
0.0681
0.0681
0.0686
0.0684
0.0686

IAE
cv
mean

0.0288
0.0281
0.0264
0.0261
0.0259
0.0270
0.0258
0.0259
0.0273
0.0261
0.0278
0.0270
0.0275
0.0285
0.0271
0.0283
0.0268
0.0278
0.0281

0.0238
0.0241
0.0176
0.0141
0.0159
0.0122
0.0129
0.0117
0.0150
0.0117
0.0132
0.0148
0.0137
0.0148
0.0146
0.0141
0.0142
0.0140
0.0148

R

16
25
36
49
64
81
100
121
144
169
196
225
256
289
324
361
400
441
484

16
25
36
49
64
81
100
121
144
169
196
225
256
289
324
361
400
441
484

dim(R*)

cv
hdm
mean (min, max)

# of Pos. Weights
base
hdm

cv
mean

11.0 (9, 13)
12.8 (9, 16)
15.0 (11, 17)
19.6 (13, 37)
18.0 (16, 20)
27.8 (20, 41)
27.2 (23, 31)
27.6 (19, 47)
28.8 (19, 45)
32.2 (21, 48)
35.0 (28, 49)
28.8 (23, 41)
35.2 (28, 46)
38.2 (22, 58)
33.0 (25, 45)
39.2 (30, 55)
34.6 (28, 41)
41.8 (25, 82)
36.6 (31, 39)

13.0 (11, 16)
18.2 (12, 22)
26.2 (26, 27)
24.8 (18, 33)
30.0 (17, 39)
25.2 (18, 29)
29.8 (23, 34)
28.8 (23, 34)
37.0 (27, 52)
32.4 (26, 37)
32.8 (31, 34)
63.8 (35, 95)
38.8 (35, 45)
39.4 (37, 47)
40.8 (33, 62)
46.2 (42, 52)
33.8 (31, 40)
57.8 (41, 97)
86.4 (42, 160)

7.4 (7, 8)
8.0 (6, 10)
9.8 (8, 11)
12.2 (10, 14)
12.0 (8, 14)
13.0 (11, 15)
12.8 (11, 14)
12.8 (10, 15)
13.4 (12, 16)
15.2 (11, 19)
14.0 (12, 15)
16.6 (11, 23)
18.4 (16, 22)
15.0 (11, 19)
17.2 (16, 19)
17.6 (13, 21)
16.0 (10, 20)
18.0 (13, 22)
18.6 (12, 22)

7.4 (7, 8)
9.2 (9, 10)
12.8 (12, 13)
13.2 (12, 16)
17.6 (14, 19)
14.8 (10, 19)
14.6 (14, 15)
15.4 (15, 17)
16.6 (15, 20)
17.0 (15, 18)
17.4 (16, 21)
17.8 (16, 19)
20.8 (20, 22)
20.2 (18, 25)
20.8 (17, 24)
25.8 (21, 28)
20.2 (17, 21)
20.8 (19, 24)
23.8 (20, 26)

6.21
6.58
7.57
7.92
7.88
7.99
8.01
8.10
8.2
8.59
8.66
8.65
8.74
8.77
8.44
8.60
8.54
8.62
8.86

6.14
6.51
8.66
8.66
9.14
9.13
9.45
9.49
9.61
9.60
9.47
9.69
9.74
9.82
9.58
9.58
9.94
9.64
9.72

5.6
6.12
6.71
7.01
7.06
7.68
7.37
7.16
7.18
7.43
7.62
7.25
7.46
7.66
7.55
7.47
7.31
7.71
7.46

9.27
10.51
12.47
13.44
16.01
15.86
17.07
16.18
18.10
17.01
16.70
20.65
17.60
16.64
17.32
16.64
15.52
17.91
19.10

4.16
4.33
5.45
5.88
5.96
6.10
6.12
5.63
5.86
5.97
5.90
6.08
6.32
5.89
6.03
6.07
5.78
6.04
5.93

6.72
7.66
10.09
10.48
12.67
10.39
10.25
9.95
11.56
11.47
11.50
10.26
11.45
11.50
11.9
11.57
11.37
12.17
11.37

hdm

N = 2,000
0.0299
0.0305
0.0290
0.0254
0.0235
0.0237
0.0236
0.0240
0.0246
0.0248
0.0250
0.0252
0.0252
0.0252
0.0247
0.0266
0.0259
0.0255
0.0254
N = 5,000
0.0295
0.0298
0.0205
0.0152
0.0148
0.0141
0.0160
0.0141
0.0137
0.0138
0.0151
0.0160
0.0163
0.0163
0.0156
0.0154
0.0162
0.0148
0.0147

79

Figure 3.1: ˆF(β1): Base vs. Post-cv Lasso (N=5,000, Mix 6, R = 16, ..., 49)

80

-10-5051000.10.20.30.40.50.60.70.80.91R = 16true CDFestCDF cvestCDF base-10-5051000.10.20.30.40.50.60.70.80.91R = 25-10-5051000.10.20.30.40.50.60.70.80.91R = 36-10-5051000.10.20.30.40.50.60.70.80.91R = 49Figure 3.2: ˆF(β1): Base vs. Post-hdm Lasso (N=5,000, Mix 6, R = 16, ..., 49)

81

-10-5051000.10.20.30.40.50.60.70.80.91R = 16true CDFestCDF hdmestCDF base-10-5051000.10.20.30.40.50.60.70.80.91R = 25-10-5051000.10.20.30.40.50.60.70.80.91R = 36-10-5051000.10.20.30.40.50.60.70.80.91R = 49Figure 3.3: ˆF(β1): Base vs. Post-cv Lasso (N=5,000, Mix 6, R = 81, ..., 144)

82

-10-5051000.10.20.30.40.50.60.70.80.91R = 81true CDFestCDF cvestCDF base-10-5051000.10.20.30.40.50.60.70.80.91R = 100-10-5051000.10.20.30.40.50.60.70.80.91R = 121-10-5051000.10.20.30.40.50.60.70.80.91R = 144Figure 3.4: ˆF(β1): Base vs. Post-hdm Lasso (N=5,000, Mix 6, R = 81, ..., 144)

83

-10-5051000.10.20.30.40.50.60.70.80.91R = 81true CDFestCDF hdmestCDF base-10-5051000.10.20.30.40.50.60.70.80.91R = 100-10-5051000.10.20.30.40.50.60.70.80.91R = 121-10-5051000.10.20.30.40.50.60.70.80.91R = 144Figure 3.5: ˆF(β1): Post-cv vs. Post-hdm Lasso (N=5,000, Mix 6, R = 169)

84

-10-8-6-4-2024681000.10.20.30.40.50.60.70.80.91base vs. cvtrue CDFestCDF cvestCDF base-10-8-6-4-2024681000.10.20.30.40.50.60.70.80.91base vs. hdmtrue CDFestCDF hdmestCDF baseFigure 3.6: ˆF(β1): Post-cv vs. Post-hdm Lasso (N=5,000, Mix 6, R = 529)

85

-10-8-6-4-2024681000.10.20.30.40.50.60.70.80.91base vs. cvtrue CDFestCDF cvestCDF base-10-8-6-4-2024681000.10.20.30.40.50.60.70.80.91base vs. hdmtrue CDFestCDF hdmestCDF baseFigure 3.7: True Joint Distributions of β1 and β2 (1) N=5,000, Mixture of Two Normals

Figure 3.8: True Joint Distributions of β1 and β2 (2) N=5,000, Mixture of Four Normals

Figure 3.9: True Joint Distributions of β1 and β2 (3) N=5,000, Mixture of Six Normals

86

APPENDICES

87

APPENDIX A

MACHINE LEARNING AND POST SELECTION INFERENCE

A.1 Penalized Regression: Lasso, Ridge, and Elastic Net

Shrinkage methods or regularized regressions set an additional constraint on magnitudes pa-
rameter estimates can take. If there is a situation where regression coeﬃcients can ’explode’ due to
multicollinearity one could employ one of shrinkage methods. Also, if there are irrelevant variables
it could ﬁlter out such variables by increasing the shrinkage parameter.

A basic LASSO (Least Absolute Shrinkage and Selection Operator, Tibshirani(1996)) formu-
lation could be stated as the following, where PRSS(βl1) represents penalized residual sum of
squares, where the shrinkage penalty on coeﬃcient values (β) is given by L1 metric;

minβPRSS(βl1) =

(yi − x(cid:48)

i β)2 + λ

| βj|

(A.1)

Ridge regression (Hoerl and Kennard (1970)) is a similar minimization but with L2 metric;

n

i=1

n

i=1

p

j=1

p

j=1

= (Y − X β)(cid:48)(Y − X β) + λ|| β||1

minβPRSS(βl2) =

(yi − x(cid:48)

i β)2 + λ

β2
j

= (Y − X β)(cid:48)(Y − X β) + λ|| β||2

∂PRSS(βl2)

∂ β

= −2X(cid:48)(Y − X β) + 2λβ

ˆβRidge = (X(cid:48)X + λIp)−1X(cid:48)Y

(A.2)

(A.3)

λ is the tuning parameter that determines the degree of shrinkage for both LASSO and ridge
regression problems. As λ approaches zero, the estimation gets closer to OLS (Ordinary Least
Squares), and as λ approaches to inﬁnity the model becomes an intercept only speciﬁcation.
Compared to ridge regression, LASSO tends to eliminate too many coeﬃcients and ridge tends to
leave too many variables.

88

Elastic net (Zou and Hastie (2005)) is a convex combination of LASSO and ridge that tries to

harmonize the two methods;

minβPRSS(βElasticNet) = (Y − X β)(cid:48)(Y − X β) + λ1|| β||1 + λ2|| β||2

(A.4)

Letting α =
regression is the following.

λ1+λ2

λ2

and t as an arbitrary positive real number, the solution to the elastic net

ˆβElasticNet = argminβ(Y − X β)(cid:48)(Y − X β)
s.t. (1 − α)|| β||1 + α|| β||2 ≤ t

(A.5)

(A.6)

glmnet package in R implements LASSO, ridge regression, and elastic net with n-folds cross
validation and RMSE (Root Mean Squared Error) criterion.

A.2 Gradient Boosting: Regression Tree Based Prediction

Given a response variable Y and predictors X = (x1, x2,..., xp) the decision tree picks a variable,
pinpoints a splitting value on the selected variable and splits the predictor space X recursively. Each
node contains a subset of observations for predictors and the response variable. Average value of
the response in each ﬁnal nodes is a tree model’s prediction on Y. Splitting process stops when
a loss function reaches a preset threshold. To improve prediction accuracy, a pruning process is
commonly applied after ﬁtting the tree model, F(X).

Loss function choice for categorical response with J classes is Gini impurity measure IG(p),
where pj is the probability of predicting class j correctly and 1− pj is the probability of predicting
class j with a wrong class at each node.
In the categorical case, the decision tree is called
classiﬁcation tree I used for the review sentiment classiﬁcation.

J

pj(1 − pj) = 1 − J

j=1

j=1

89

IG(p) =

p2
j

(A.7)

If the response is a continuous numeric variable, now it is a regression tree. One common choice
of loss function for a regression tree is RMSE (Root Mean Square Error).

RMSE(Y, F(X)) =

(yi − F(xi))2

(A.8)

(cid:118)(cid:117)(cid:116) 1

n

n

i

At each node, the split variable and value are determined to minimize the resulting RMSE.

However, regression tree has its own weaknesses. Though a high level of prediction accuracy
could be achieved, the resulting tree structure could be too complicated (’Bushy’). Then inter-
pretation of the ﬁtted model becomes nearly impossible and the model yields poor out of sample
prediction performances. Also, if there is one variable that has a particularly strong correlation
with the response, the splitting process is concentrated on the variable leading to biased estimates.
To deal with the weaknesses of plain regression tree, practitioners use model averaging tech-
niques. First averaging method is ’Bagging’ (Breiman (1996)). Bagging ﬁts many trees on
bootstrapped subsets of training data, and predicts outcome by majority vote from the estimated
tree models. Second is ’Random Forests’, a reﬁned bagging approach (Breiman (2001)). Random
forests method uses the same bootstrapped samples, but for each tree, a random sample of m(< p)
predictors is drawn and only those m features are used in the ﬁtting processes. It tries to improve
on bagging by de-correlating each trees.

This paper uses the third averaging technique, namely ’Gradient Boosting’ (Friedman (2001)).

L

l=1

ρhl(X)

(A.9)

Basic formulation can be stated as; (cid:98)Y = F(X) +
tree trained on the residuals from F(X) +l−1

F(X) can be an initial ﬁtted tree with predictors X. hl(X) is called a ’Weak Learner’, another
l=1 hl(X). Speciﬁcally, for the initial model F(X), the
residuals is Y − F(X). Then h1(X) sets the residuals as a new response variable and trains another
tree. The shrinkage parameter ρ is set low enough so that there would not be an overﬁtting problem.
The reason why this approach is called as gradient boosting is from the fact that it uses residuals.
Speciﬁcally, if we set an RMSE loss function, our optimization problem will be

90

minF(xi)J =
Treating F(xi) as parameters, the derivative is,

∂i(yi − F(xi))2

∂F(xi)

∂J
∂F(xi) =

n

i=1

(yi − F(xi))2

∂(yi − F(xi))2

∂F(xi)

=

= 2(F(xi) − yi)

(A.10)

(A.11)

Gradient descent optimization minimizes a function by moving the function in the opposite direction
of the gradient, in this case − ∂J

∂F(xi) = yi − F(xi).

A practitioner can implement gradient boosting ﬁtting procedure using the gbm package in R,

also using n-folds cross validation and RMSE criterion function.

A.3 Exact Inference for OLS Estimates after Lasso Selection

The post selection inference approach used in this paper is directly from Lee, L. Sun, Sun, and
Taylor (2016), implemented by R package ’selectiveInference’. Exact inference for regression models
after statistical/machine learning is an actively developing area and interested readers could beneﬁt
a lot from the recent literature written by the pioneers of machine learning in statistics (Lockhart,
Taylor, J. Tibshirani, and Tibshirani (2014), J. Tibshirani, Taylor, Lockhart, and Tibshirani (2016),
and Taylor and Tibshirani (2017)).

This appendix introduces a brief outline and compare two conﬁdence intervals (C.Is) obtained
from the classical OLS method and post selection inference of Lee et al. (2016). For a typical OLS
regression, the objective y follows a multivariate normal distribution.

y ∼ N(µ, σ2In)

(A.12)

where µ is the mean vector modeled as a linear combination of p predictors x1, ..., xp ∈ Rn and σ
is the standard error. The primary goal is to get an exact distribution of coeﬃcients obtained from
OLS conducted only with the selected variables by LASSO or model M.

βM = argminbM E||y − XM bM||2 = X +

M µ = (XT

M XM)−1XT
M µ

(A.13)

91

The LASSO selection event in fact implies one get the variables with non-zero coeﬃcients and
corresponding signs. The event of selecting the observed model ˆM and signs ˆs i.e., { ˆM = M, ˆs = s}
can be described by a polyhedron in the form of {Ay ≤ b}.

(A.14)

(cid:170)(cid:174)(cid:174)(cid:172)

−diag(s)(XT

M XM)−1XT

M

b0(M, s)
−λdiag(s)(XT

M XM)−1s

(cid:170)(cid:174)(cid:174)(cid:172)

{ ˆM = M, ˆs = s} = {A(M, s)y ≤ b(M, s)}
A0(M, s)

A1(M, s)

(cid:170)(cid:174)(cid:174)(cid:172) =(cid:169)(cid:173)(cid:173)(cid:171)
A(M, s) =(cid:169)(cid:173)(cid:173)(cid:171) A0(M, s)
(cid:170)(cid:174)(cid:174)(cid:172) =(cid:169)(cid:173)(cid:173)(cid:171)
b(M, s) =(cid:169)(cid:173)(cid:173)(cid:171) b0(M, s)
(cid:169)(cid:173)(cid:173)(cid:171) XT−M(I − PM)
b0(M, s) =(cid:169)(cid:173)(cid:173)(cid:171) 1 − XT−M(XT

−XT−M(I − PM)
M)+s
M)+s

1 + XT−M(XT

A0(M, s) =

b1(M, s)

1
λ

(cid:170)(cid:174)(cid:174)(cid:172)

(cid:170)(cid:174)(cid:174)(cid:172)

where the subscript −M represents variables of zero-coeﬃcients in the LASSO selector, λ is the
penalty parameter, PM is the projection matrix toward the vector space of the selected variables,
and diag(s) is a diagonal matrix with the elements of s.

The next step is to get an exact distribution of individual coeﬃcients βM
j

, conditional on the
model selection event {Ay ≤ b}. First the authors establish the conditional distribution of a generic
linear transformation of the objective y: ηT y|{Ay ≤ b}. With the choice of η = (X +
M)T ej, one gets
M µ = ηT µ. The selection event {Ay ≤ b} can once again
the conditional distribution of βM
j
be transformed into an interval of residuals from projecting y onto the direction of η.

j X +

= eT

+(z), ν0(z) ≥ 0}

(A.15)

{Ay ≤ b} = {ν
−(z) = max
j:(Ac)j <0

+(z) = min

j:(Ac)j >0

ν

ν

ν0(z) = min
j:(Acj)=0

−(z) ≤ ηT y ≤ ν
bj − (Az)j
(Ac)j
bj − (Az)j
(Ac)j
bj − (Az)j

92

where A and b are as deﬁned in the previous page, z = (In − Pη)y is the residual with the projection
matrix Pn onto the direction of η, and c = η(ηT η)−1. Notice that z being residual, is independent
of ηT y and hence the LASSO selection event does not incur any complication to produce the
conditional distribution but just imposes upper and lower limits on ηT y.

Hence, the distribution of ηT y conditional on the model selection is a truncated normal.

[ηT y|Ay ≤ b, z = z0] ∼ T N(ηT µ, σ2||η||2, ν

−(z0), ν

+(z0))

(A.16)

where z0 is a realization of residual, and equation (A.16) is true for any z0 because of the in-
dependence. The cumulative density F
is monotone decreasing in ηT µ or in our
which gives the conﬁdence interval [L, U] with L and U are deﬁned as
speciﬁc interst, in βM
j
[ν−
s (z), ν
to achieve a signiﬁcance level α. Thus,
F
L, σ2||η||

(βM
j ) = 1 − α

[ν−
s (z), ν
+
ηT µ, ηT η

and F

s (z)]

2

+

s (z)]

+

s (z)]

(βM
j ) = α

[ν−
s (z), ν
U, σ2||η||
j ∈ [L, U]| ˆM = M, ˆs = s] = 1 − α

2

P[βM

(A.17)

93

APPENDIX B

OMITTED DETAILS FOR CHAPTER 1

B.1 Exact Inference for Post Lasso Estimates

Table B.1 reports the C.Is obtained by both OLS and OLS post LASSO. For almost all variables,
the C.Is from truncated normal essentially reproduce those of OLS but they are slightly wider
reﬂecting the changes in density due to truncation. There are two exceptions to this.

First is when the signal of a variable is weak. Then parameter estimates could be near to one
of the truncation endpoints, giving much wider intervals than OLS. It is the case of seller text
variables, and the ratio [L,U]postL ASSO
17.87 and 4.57 for ’Positive Adjectives’ and ’Location
Words’, while the average for others (except for ’Entire Home/Apt’) is 1.09.

[L,U]OLS

Second is when the signal is ’too strong’. For ’Entire Home/Apt’, the Z-score is 173. Then the
lower end of truncation is very high and almost every value above it satisﬁes the signiﬁcance level.
The R package in this case produces the output ’inf’.

94

Table B.1: Conﬁdence Intervals for OLS post Lasso

5% C.Is

obs: 75,236

OLS
C.Is

Exact Truncated Normal

C.Is

L

U

L

U

Tail Areas

L

U

Quality Certification

Superhost

Veriﬁcation Accounts

Review Scores

Cleanliness
Location
Value

Review Texts
Negative Reviews
Positive Phrases

Seller Texts

Positive Adjectives
Location Phrases

Accommodation Capacities

0.044510
0.014836

0.060512
0.019844

0.044440
0.014821

0.060600
0.020556

0.024014
0.024218

0.023888
0.024826

0.045786
0.133259
-0.096141

0.052211
0.140423
-0.087599

0.045766
0.133232
-0.096158

0.053591
0.140430
-0.087576

0.024051
0.024261
0.024626

0.024403
0.024782
0.024383

-0.008115
0.005995

-0.006459
0.008099

-0.008118
0.005986

-0.006454
0.008527

0.024613
0.023892

0.024319
0.024603

-0.000550
0.000841

0.000587
0.001604

-0.020134
0.000927

0.000175
0.004413

0.024976
0.024842

0.024988
0.024967

Default Guests

Bathrooms
Bedrooms

Beds

Included Guests

0.053228
0.095556
0.104912
-0.023948
0.021074

0.059414
0.111882
0.116032
-0.014518
0.027366

0.053214
0.095064
0.104866
-0.024029
0.021046

0.059429
0.111892
0.116618
-0.014506
0.027368

0.024489
0.024449
0.024066
0.024033
0.024016

0.024443
0.024862
0.024889
0.024716
0.024923

Amenity and Service

Air Conditioner

Buzzer Wireless Intercomm

Cable TV

Free Parking on Street

Indoor Fire Place

Lock on Bedroom Door

Cats Allowed

Internet
Shampoo

(Room Type)

Entire Home/Apt

Shared Room

0.118190
0.082411
0.094420
-0.129221
0.109043
-0.076620
-0.094924
0.018999
0.033656

0.134485
0.093107
0.105542
-0.112493
0.135648
-0.058717
-0.074257
0.032768
0.044654

0.118118
0.082432
0.090869
-0.129269
0.108097
-0.076746
-0.094945
0.018958
0.033637

0.134490
0.096908
0.105517
-0.111331
0.135658
-0.058666
-0.072997
0.036304
0.045603

0.024008
0.024692
0.024627
0.024326
0.024230
0.024509
0.024770
0.023992
0.024591

0.024931
0.024774
0.024924
0.024521
0.024907
0.024355
0.024382
0.024511
0.024570

0.583494
-0.185973

0.596795
-0.142035

0.339306
-0.186067

inf

0

0

-0.141924

0.024510

0.024422

95

B.2 Manhattan vs. Other Neighborhoods

Table B.2: GMM: Manhattan and Other Neighborhoods

obj : log(pit)

Quality Certification

Superhost

Veriﬁcation Accounts

Manhattan

Fixed Eﬀects

GMM

Other Neighborhoods
GMM

Fixed Eﬀects

0.0116***
(0.0030)
0.0005
(0.0010)

0.0124***
(0.0030)
-0.0002
(0.0011)

0.0086***
(0.0027)
0.0034***
(0.0010)

0.0084***
(0.0027)
0.0061***
(0.0011)

Review Scores

Cleanliness

Location

Value

Review Texts
Negative Reviews

Positive Phrases

Seller Texts

Positive Adjectives

Location Phrases

Constant

ρ : rho

-0.0025
(0.0025)
0.0059*
(0.0034)
-0.0013
(0.0028)

-0.0003
(0.0030)
0.0061
(0.0042)
-0.0059*
(0.0033)

0.0007
(0.0026)
0.0044*
(0.0027)
-0.0061**
(0.0028)

0.0022
(0.0030)
0.0087***
(0.0031)
-0.0065**
(0.0031)

-0.0007
(0.0006)
0.0045***
(0.0009)

-0.0025***
(0.0006)
0.0042***
(0.0009)

0.0006
(0.0006)
0.0029***
(0.0008)

-0.0022***
(0.0006)
0.0031***
(0.0007)

-0.0019***
(0.0007)
0.0023***
(0.0004)

0.0005
(0.0007)
0.0020***
(0.0004)

-0.0014
(0.0011)
0.0017***
(0.0006)

0.1875***
(0.0118)
0.9631***
(0.0024)

0.0002
(0.0009)
0.0010*
(0.0005)

0.1616***
(0.0115)
0.9667***
(0.0025)

***: 1% signiﬁcant, **: 5% ,*: 10%, standard errors in parentheses

Number of Obs

17,687

19,931

96

Table B.3: GMM: Manhattan and Other Neighborhoods (Continued from Table B.2)

obj : log(pit)

Manhattan

Fixed Eﬀects

GMM

Other Neighborhoods
GMM

Fixed Eﬀects

Accommodation Capacities

Default Guests

Bathrooms

Bedrooms

Beds

Included Guests

Amenity and Service

Air Conditioner

Buzzer Wireless Intercomm

Cable TV

Free Parking

Indoor Fire Place

Lock on Bedroom Door

Cats Allowed

Internet

Shampoo

Room Type

Entire Home/Apt

Shared Room

0.0275***
(0.0026)
0.0401***
(0.0136)
0.0383***
(0.0063)
0.0245***
(0.0039)
0.0163***
(0.0029)

0.0060
(0.0081)
0.0104*
(0.0062)
0.0011
(0.0061)
0.0014
(0.0108)
-0.0464**
(0.0183)
0.0119
(0.0075)
-0.0496***
(0.0138)
-0.0011
(0.0066)
0.0019
(0.0051)

0.0254***
(0.0048)
0.0602***
(0.0189)
0.0279***
(0.0106)
0.0260***
(0.0062)
0.0152***
(0.0046)

-0.0043
(0.0116)
0.0121
(0.0077)
0.0090
(0.0071)
-0.0020
(0.0138)
-0.0234
(0.0216)
-0.0047
(0.0101)
-0.0397*
(0.0218)
-0.0043
(0.0082)
0.0077
(0.0066)

0.0268***
(0.0024)
0.0051
(0.0094)
0.0460***
(0.0056)
0.0051
(0.0034)
0.0150***
(0.0024)

-0.0025
(0.0061)
0.0145**
(0.0067)
-0.0025
(0.0058)
-0.0088
(0.0067)
0.0745***
(0.0180)
-0.0124**
(0.0060)
-0.0233**
(0.0092)
-0.0038
(0.0057)
-0.0072
(0.0047)

0.0329***
(0.0043)
0.0135
(0.0148)
0.0471***
(0.0105)
0.0099**
(0.0049)
0.0149***
(0.0036)

-0.0095
(0.0086)
0.0118
(0.0093)
0.0068
(0.0071)
-0.0067
(0.0089)
0.0749***
(0.0217)
-0.0229***
(0.0073)
-0.0115
(0.0148)
0.0027
(0.0063)
-0.0031
(0.0057)

0.1151***
(0.0085)
-0.0929***
(0.0215)

0.1305***
(0.0152)
-0.0571
(0.0476)

0.1828***
(0.0085)
-0.0060
(0.0228)

0.1991***
(0.0164)
0.0549
(0.0602)

97

B.3 Annual Variations in Attributes

Table B.4: Summary Statistics for Variables and Annual Variations

201712
Mean

201612
Mean

184.6176

181.8856

Diﬀerence (201712-201612)
Mean
2.7319

Min Max
-550
700

S.D.

31.0575

-1
-3

-4
-4
-4

-24
-38

-24
-29

-14
-2.5
-6
-10
-11

-1
-1
-1
-1
-1
-1
-1
-1
-1

-1
-1

1
8

6
7
7

19
14

20
47

14
3
4
11
13

1
1
1
1
1
1
1
1
1

1
1

obs: 37,618

Price ($)

Quality Certification

Superhost Indicator
Veriﬁcation Accounts

Review Scores

Cleanliness
Location
Value

Review Text

Negative Reviews
Positive Phrases

Seller Text

Positive Adjectives
Location Phrases

0.1761
4.4649

0.1003
4.2075

0.0758
0.2574

0.3842
1.0407

9.2026
9.3968
9.3077

9.1913
9.3492
9.2816

0.0113
0.0476
0.0262

0.4523
0.3782
0.4186

4.8731
3.5182

3.4239
2.5320

1.4492
0.9862

2.1540
1.6366

6.6013
10.6401

6.3382
9.6204

0.2631
1.0196

1.5587
2.6019

Accommodation Capacites

Default Guests

Bedrooms
Bathrooms

Beds

Guests Included

Amenity and Service

Air Conditioning

Buzzer Wireless Intercom

Cable TV
Free Parking

Indoor Fire Place

Lock on Bedroom Door

Cats Allowed

Internet
Shampoo

(Room Type)

Entire Home/Apt

Shared Room

2.9284
1.1092
1.1567
1.5891
1.6164

0.8842
0.5495
0.3688
0.1103
0.0396
0.1161
0.0675
0.8133
0.6231

0.5582
0.0141

-0.0002
0.0001
0.0091
0.0182
0.0462

0.0150
0.0015
0.0016
-0.0026
-0.0004
0.0215
-0.0009
-0.0060
0.0238

0.5111
0.1007
0.2060
0.3436
0.4377

0.1589
0.1753
0.1824
0.1350
0.0597
0.1633
0.1004
0.1860
0.2213

0.0003
-0.0004

0.1369
0.0497

2.9287
1.1092
1.1476
1.5709
1.5702

0.8692
0.5480
0.3672
0.1128
0.0400
0.0945
0.0684
0.8193
0.5992

0.5579
0.0145

98

B.4 Evidence for Relevance of Lagged Instruments

Table B.5: Relevance Tests for Lagged Instruments
R Squared Wald F (24, 37,618)

obs: 37,618

Quality Certification

Superhost Indicator
Veriﬁcation Accounts

Review Scores

Cleanliness
Location
Value

Review Text

Negative Reviews
Positive Phrases

Seller Text

Positive Adjectives
Location Phrases

0.2623
0.2737

0.8487
0.8180
0.8023

0.9404
0.9363

0.9083
0.8877

13.81
9.83

19.55
22.28
23.29

44.99
13.49

10.71
28.15

GMM method in Chapter 1 assumes that all of the attributes in Xit could be endogenous with
the error ηit. Further lagged observations for attributes Xit−2 are proposed as instruments. Xit−2
contains rental unit attributes recorded three to four months earlier than those in Xit−1 recorded in
2016. For example, for a rental i appearing in June, 2017 and 2016, xikt−2 is the k-th attribute
recorded in Febrary, 2016. Though GMM implementation does not require an explicit ﬁrst stage
regression, it is important that Xit−2 satisﬁes the relevance condition.

Table B.5 provides evidence of relevance for each attribute xikt ∈ Xit, from a joint signiﬁcance
test for hypothesis (B.2) with Zit = (xi1t−2, ..., xikt−2, ..., xiKt−2) and the corresponding coeﬃcients
vector Γ = (γ1, ... , γK). The regressor vector X−k
includes attributes xikt’s except for xikt, and
Xikt−1 includes xikt−1’s for all k.

it

xikt = c + ΓZit + θplog(pit−1) + Θ1X−k
H0 : γ1 = γ2 = · · · = γ25 = 0

it + Θ2Xit−1

(B.1)

(B.2)

99

The degrees of freedom for F-statistic is 25(=K) and 37,618, and the corresponding p-values
for each variable are less than 0.0001. Table B.5 shows the relevance conditions for information
variables, and the results are similar for amenity and service features.

100

APPENDIX C

OMITTED DETAILS FOR CHAPTER 2

C.1 Exact Inference for Post Lasso Estimates

Table C.1, C.2, and C.3 report the C.Is produced by classical OLS and exact post selection
inference for OLS, IV and two level nested logit models, respectively. As a reminder, the ﬁrst step
LASSO selection was conducted separately on datasets for each estimation methods to see if the
model selection varies due to endogeneity. For OLS logit, the LASSO selector chose variables
from the dataset of all attributes and price, pjt. For IV logit pjt was replaced with the instrumented
ln(sj|gt) was included also with ˆpjt. The LASSO selector,
price ˆpjt. For two level nested logit,
ln(sh|gt) omitted a few key variables, including σ2
for three level nested logit with
ˆ
for the regional level correlated preferences. It shows the selection results are susceptible and vary
over endogeneity controls or econometric modeling choices.

ln(sj|hgt) and

ˆ

ˆ

C.Is from the exact inference reﬂecting the re-normalization of density due to truncation or,
additional uncertainty due to LASSO selection are slightly wider than C.Is from OLS inference,
but they essentially reproduce them for most of the variables, except for two cases.

The ﬁrst case is when the signal of a variable is ’weak’, or the correlation between the objective
and covariate is small. Then either the variable is not chosen, or the parameter estimate could
be near to one of the truncation endpoints, giving much wider intervals than OLS. For example,
’Location Words’ was not selected by LASSO for OLS logit case. But it was selected in all the
other cases, conﬁrming the suspicion of instability of selection results due to endogeneity. ’Shared
Room’ showed slightly weak signals in both OLS and IV logit datasets, with [L,U]E xact
[L,U]OLS
and 1.1557 where the averages for others were about 1.0050.

= 1.3535

’Instant Bookable’ turn out to be insigniﬁcant or a weak signal in the two level nested logit case,
ln(sj|gt) and ˆpjt. [L,U]E xact
for ’Instant Bookable’ is 1.8376 where the average for others is
[L,U]OLS

ˆ

with
also about 1.0050.

101

The second case is when the signal is ’too strong’, which is the case for the two level nesting
parameter σ with a Z-score of 99.67. Then the lower endpoint of truncation is set too high
(0.73), and any value above it satisﬁes the signiﬁcance level α = 0.05. In this case, R package
selectiveInference produces ’inf’ for the upper bound of the C.Is.

Table C.1: C.Is for OLS Logit, Classical vs. Post Selection Inference

Obs: 62,673

5% C.Is
Price

OLS

L

U

Exact Post LASSO

L

U

Tail Areas

L

U

-0.000852

-0.000697

-0.000853

-0.000697

0.024498

0.024434

Quality Certifications

Superhost Indicator
Veriﬁcation Accounts

Consumer Review
Ratings Average

Number of Reviews
Negative Reviews

Seller Texts

Positive Adjectives
Location Phrases

Accommodation Capacities

0.190012
0.040741

0.236667
0.055015

0.189803
0.040685

0.236677
0.055027

0.023989
0.024116

0.024951
0.024819

0.180669
0.016606
-0.041592

0.206865
0.017638
-0.034697

0.180619
0.016606
-0.041624

0.206938
0.017643
-0.034695

0.024568
0.024950
0.023977

0.024364
0.023989
0.024963

-0.016798

-0.008212

-0.016828

-0.008202

0.024202

0.024739

(Not chosen by the ﬁrst step LASSO selector)

Default Guests

Bathrooms

Additional Guests
Instant Bookable

0.089439
0.037290
-0.005437
0.461827

0.104727
0.083423
0.013559
0.499565

0.089414
0.037103
-0.023288
0.461756

0.107401
0.083452
0.013201
0.499671

0.024505
0.024192
0.024951
0.024571

0.024389
0.024852
0.024513
0.024361

Amenity and Service

24 Hour Check-In

Hangers
Heating
Shampoo

(Room Type)

Entire Home/Apt

Shared Room

0.114049
0.127225
0.033329
0.126503

0.149457
0.160103
0.104381
0.158248

0.114005
0.127082
0.030885
0.126501

0.149579
0.160115
0.104490
0.158395

0.024715
0.024023
0.024740
0.024984

0.024218
0.024915
0.024649
0.023957

-0.102712
0.157776

-0.064644
0.245300

-0.102753
0.157417

-0.064505
0.245352

0.024756
0.024074

0.024179
0.024863

102

Table C.2: C.Is for IV Logit, Classical vs. Post Selection Inference

Obs: 62,673

5% C.Is

2SLS

L

U

Exact Post Selection

L

U

Tail Areas

L

U

Price (Instrumented)

Quality Certifications

Superhost Indicator
Veriﬁcation Accounts

Consumer Review
Ratings Average

Number of Reviews
Negative Reviews

Seller Texts

Positive Adjectives
Location Phrases

Accommodation Capacities

-0.015932

-0.015179

-0.015936

-0.015175

0.023905

0.023868

0.315313
0.077299

0.360276
0.091051

0.315160
0.077279

0.360333
0.091096

0.024226
0.024674

0.024708
0.024259

0.385672
0.015625
-0.041871

0.412724
0.016612
-0.035289

0.385632
0.015622
-0.041886

0.412811
0.016613
-0.035274

0.024660
0.024174
0.024456

0.024273
0.024760
0.024476

-0.063586
0.006822

-0.054530
0.013240

-0.063620
0.006799

-0.054521
0.013247

0.024171
0.024193

0.024764
0.024744

Default Guests

Bathrooms

Additional Guests
Instant Bookable

0.366664
0.695131
0.162532
0.314323

0.386868
0.750209
0.182525
0.351105

0.366592
0.694847
0.162514
0.314310

0.386891
0.750469
0.182601
0.351265

0.024193
0.023839
0.024797
0.024917

0.024741
0.023933
0.024139
0.024021

Amenity and Service

24 Hour Check-In

Hangers
Heating
Shampoo

(Room Type)

Entire Home/Apt

Shared Room

0.159904
0.125852
0.171242
0.213911

0.193794
0.157232
0.239436
0.244531

0.159750
0.125703
0.170930
0.213832

0.193799
0.157394
0.239444
0.244595

0.023976
0.023932
0.023969
0.024407

0.024965
0.023840
0.024972
0.024525

1.372216
-0.149253

1.455404
-0.064283

1.372104
-0.149284

1.455681
-0.034280

0.024689
0.024190

0.024244
0.024747

103

Table C.3: C.Is for Two Level Nested Logit, Classical vs. Post Selection Inference
Tail Areas

Exact Post Selection

Obs: 62,673

2SLS

5% C.Is

L

U

L

U

L

U

Price (Instrumented)
Nesting Parameters

σ(σ1)

Quality Certifications

Superhost Indicator
Veriﬁcation Accounts

Consumer Review
Ratings Average

Number of Reviews
Negative Reviews

Seller Texts

Positive Adjectives
Location Phrases

Accommodation Capacities

-0.006324

-0.005529

-0.006327

-0.005527

0.024296

0.024636

0.735739

0.765255

0.735592

inf

0.023852

0

0.043312
0.059870

0.086441
0.072666

0.043220
0.059849

0.086551
0.072705

0.024708
0.024623

0.024418
0.024309

0.108272
0.004012
-0.014863

0.135667
0.005035
-0.008658

0.108169
0.004010
-0.014875

0.135693
0.005038
-0.008641

0.024167
0.024532
0.024561

0.024784
0.024399
0.024370

-0.039552
0.013800

-0.031086
0.019769

-0.039595
0.013779

-0.031045
0.019776

0.023859
0.024211

0.023913
0.024723

Default Guests

Bathrooms

Additional Guests
Instant Bookable

0.091769
0.284585
0.061140
-0.005903

0.113417
0.338250
0.080142
0.030520

0.091673
0.284530
0.061081
-0.036878

0.113422
0.338447
0.080172
0.030052

0.023998
0.024765
0.024298
0.024883

0.024941
0.024170
0.024635
0.023929

Amenity and Service

24 Hour Check-In

Hangers
Heating
Shampoo

(Room Type)

Entire Home/Apt

Shared Room

0.040703
0.019869
0.072892
0.060668

0.072543
0.049325
0.136373
0.089754

0.040690
0.019704
0.072865
0.060639

0.072679
0.049452
0.136643
0.089861

0.024902
0.023948
0.024932
0.024771

0.024036
0.024029
0.024039
0.024164

0.684688
-0.115604

0.766577
-0.036653

0.684392
-0.115694

0.766664
-0.024454

0.024180
0.024643

0.024754
0.024571

104

C.2 First Stage Regression for Prices and Group Shares

Table C.4: First Stage Regression Results

Obs: 62,673 / Obj:
Lagged Base Price

(pjt−1 − CleaningFeejt−1)
Business Starting Date

Dates Last Scraped

Availabilities

30 Days

60 Days

90 Days

365 Days

Reviews per Month

Cancellation Policy

Moderate

R Squared

pjt

0.016891***
(0.000805)
-0.004411***
(0.000445)
-0.011947***
(0.002884)

0.657721***
(0.085911)
0.015415
(0.087163)
0.083191*
(0.046007)
0.024166***
(0.003571)
-8.706140***
(0.332901)

ln(sj|gt)
0.000001
(0.000009)
0.000053***
(0.000005)
-0.000615***
(0.000032)

0.002714***
(0.000947)
-0.006876***
(0.000961)
0.006025***
(0.000507)
0.000299***
(0.000039)
0.401024***
(0.003670)

ln(sj|hgt)
-0.000007
(0.000014)
0.000041***
(0.000007)
-0.000632***
(0.000048)

0.004588***
(0.001439)
-0.007233***
(0.001460)
0.004392***
(0.000771)
-0.000059
(0.000060)
0.400606***
(0.005575)

ln(sh|gt)
0.000008
(0.000013)
0.000013*
(0.000007)
0.000018
(0.000046)

-0.001874
(0.001380)
0.000357
(0.001400)
0.001632**
(0.000739)
0.000358***
(0.000057)
0.000418
(0.005347)

-8.146719***
(0.801306)

0.043266***
(0.008833)

0.035425***
(0.013420)

0.007842
(0.012870)

0.18
680.71

0.01
21.99

Wald F Statistic: F(9, 62673)
**: 5% signiﬁcant, ***: 1%, and standard errors in parentheses

1672.05

0.43
279.57

0.36

To instrument price pjt and group shares, variables reﬂecting supply side decisions were chosen
as the ﬁrst stage regressors. Table C.4 reports the coeﬃcients from the ﬁrst stage regressions (only
for instruments). To test the relevance condition of IVs, I report the F statistics for the following
hypothesis for each objective variable. For price, it would be

pjt = constant + zjt γp + xjt θp + jt
H0 : γp1 = γp2 = · · · = γp9 = 0

(C.1)

’Lagged Base Price’ means per night rental price minus ’Cleaning Fee’ recorded one year before
for all rental units in the dataset. One important fact is that it indirectly reﬂects hosts’ decisions

105

to impose or remove the ’Cleaning Fee’. Also, there is a big diﬀerence in the means (pjt −
(pjt−1 − CleaningFeejt−1) = −$159.1783) with a standard deviation of $455.2833 implying
heavy adjustments in prices and ’Cleaning Fee’ imposing decisions. Such big changes reﬂect the
tendency of new Airbnb hosts setting high prices at the start of business and constantly decreasing
prices as they face low demand due to heavy competition.

’Business Starting Date’ is the date a host started Airbnb hosting, and ’Dates Last Scraped’ is
the date InsideAirbnb.com last recorded the data on the host. Both are in the format of cumulative
days from a starting point. For example, if a host started hosting on Jan 4, 2016, then ’Business
Starting Date’ is the cumulative days since Jan 1, 1900.

’Availability’ variables represent how many consecutive days a host or a rental unit can provide.
For example, the variable ’60 Days’ ranges from 0 to 60. If it is 40, then a potential guest can
make a reservation request with a maximum length of 40 days. Deciding such lengths of maximum
nights is mostly in the hands of rental hosts, unlike hotels.

’Moderate’ cancellation policy contains a host’s policy on refunds, reservation modiﬁcations
or cancellation. There are six more categories including ’Flexible’, ’Strict’, and ’Long Term’.
’Reviews per Month’ is not the cumulative number of reviews divided by months of operation up
to date, but it is the average number of reviews received in a month when data scraping occurred.
It was included to indirectly capture an exogenous variation in supply decisions, given that Airbnb
hosts can set their business days as ﬂexibly as imaginable.

One reason for not using BLP instruments (the isolation measurej(cid:44)r xjk) or, product charac-

teristic IVs is that when the market is large or ’thick’ meaning there are too many close substitutes,
the identifying power of them could be in doubt. In this case, cost shifters or supply side instruments
could provide a better identifying power (Armstrong (2016)).

106

C.3 NYC Service Neighborhoods

Table C.5: NYC Airbnb Service Neighborhoods

Brooklyn
Bath Beach
Bay Ridge

Bedford-Stuyvesant

Bensonhurst
Bergen Beach
Boerum Hill
Borough Park
Brighton Beach
Brooklyn Heights

Brownsville
Bushwick
Canarsie

Carroll Gardens

Clinton Hill
Cobble Hill
Columbia St
Coney Island
Crown Heights
Cypress Hills

Manhattan

Battery Park City

Chelsea
Chinatown
Civic Center
East Harlem
East Village

Financial District
Flatiron District

Gramercy

Queens
Arverne
Astoria

Bay Terrace

Bayside
Bayswater
Belle Harbor

Bellerose

Breezy Point
Briarwood

Greenwich Village

Cambria Heights

Harlem

Hell’s Kitchen

Inwood
Kips Bay
Little Italy

Lower East Side

Marble Hill
Midtown

Morningside Heights

Staten Island
Arden Heights

Arrochar
Bay Terrace
Bull’s Head

Castleton Corners

Charleston
Chelsea
Clifton
Concord

Dongan Hills
Eltingville
Emerson Hill
Fort Wadsworth

Graniteville
Great Kills
Grymes Hill
Howland Hook

Huguenot

Lighthouse Hill
Mariners Harbor
Midland Beach
New Brighton

New Dorp

New Dorp Beach
New Springville

Oakwood

Port Richmond
Prince’s Bay
Randall Manor
Richmondtown

Rosebank
Rossville
Shore Acres
Silver Lake
South Beach
St. George
Stapleton
Todt Hill

Tompkinsville

College Point

Corona

Ditmars Steinway

Douglaston
East Elmhurst

Edgemere
Elmhurst

Far Rockaway

Flushing
Forest Hills

Fresh Meadows

Glen Oaks
Glendale
Hollis

Hollis Hills
Holliswood
Howard Beach
Jackson Heights

Jamaica

Jamaica Estates
Jamaica Hills
Kew Gardens

Kew Gardens Hills

Laurelton
Little Neck

Long Island City

Maspeth

Middle Village

Neponsit

Bronx
Allerton
Baychester
Belmont
Bronxdale
Castle Hill
City Island

Claremont Village

Clason Point
Concourse

Concourse Village

Co-op City
Country Club
East Morrisania

Eastchester
Edenwald
Fieldston
Fordham
Highbridge
Hunts Point
Kingsbridge
Longwood
Melrose

Morris Heights
Morris Park
Morrisania
Mott Haven
Mount Eden
Mount Hope

North Riverdale

Norwood
Olinville
Parkchester
Pelham Bay

Downtown Brooklyn

Murray Hill

DUMBO

Dyker Heights
East Flatbush
East New York

Flatbush
Flatlands
Fort Greene
Fort Hamilton
Gerritsen Beach

Gowanus
Gravesend
Greenpoint
Kensington

NoHo
Nolita

Roosevelt Island

SoHo

Stuyvesant Town
Theater District

Tribeca

Two Bridges

Upper East Side
Upper West Side

Washington Heights

West Village

Pelham Gardens

Manhattan Beach

Port Morris
Riverdale

Schuylerville
Soundview

Midwood
Mill Basin
Navy Yard
Park Slope

Spuyten Duyvil

Prospect Heights

107

Table C.6: NYC Airbnb Service Neighborhoods (Continued from Table C.5)

Staten Island
Tottenville

West Brighton
Westerleigh
Willowbrook

Woodrow

Bronx

Brooklyn

Manhattan

Queens

Throgs Neck

Prospect-Leﬀerts Gardens

Tremont
Unionport

University Heights

Van Nest
Wakeﬁeld
West Farms

Westchester Square

Williamsbridge

Woodlawn

Red Hook
Sea Gate

Sheepshead Bay

South Slope
Sunset Park
Vinegar Hill
Williamsburg
Windsor Terrace

Ozone Park

Queens Village

Rego Park

Richmond Hill

Ridgewood

Rockaway Beach

Rosedale

South Ozone Park
Springﬁeld Gardens

St. Albans
Sunnyside
Whitestone
Woodhaven
Woodside

Subtotals

49

48

32

53

44

108

C.4 Evidence for Review Rating Inﬂation

Figure C.1: Correlation Across Review Score Categories

109

BIBLIOGRAPHY

110

BIBLIOGRAPHY

[1] Akerlof, G. A., Aug. 1970. The market for "lemons": Quality uncertainty and the market

mechanism. The Quarterly Journal of Economics 83 (3), 488–500.

[2] Alaei, A. R., Becken, S., Stantic, B., Dec. 2017. Sentiment analysis in tourism: Capitalizing

on big data. Journal of Travel Research, 1–17.

[3] Armstrong, T., Aug. 2016. Large market asymptotics for diﬀerentiated product demand esti-

mators with economic models of supply. Econometrica 84 (5), 1961–1980.

[4] Bajari, P., Fruewirth, J. C., Kim, K. I., Timmins, C., Aug. 2012. A rational expectations
approach to hedonic price regressions with time-varying unobserved product attributes: The
price of pollution. American Economic Review 102 (5), 1898–1926.

[5] Bajari, P., Nekipelov, D., Ryan, S. P., Yang, M., May 2015. Machine learning methods for

demand estimation. American Economic Review 105 (5), 481–85.

[6] Belloni, A., Chernozhukov, V., 2013. Least squares after model selection in high-dimensional

sparse models. Bernoulli 19 (2), 521–547.

[7] Belloni, A., Chernozhukov, V., Fernandez-Val, I., Hansen, C., Jan. 2017. Program evaluation

and causal inference with high-dimensional data. Econometrica 85 (1), 233–298.

[8] Belloni, A., Chernozhukov, V., Wang, L., 2014. Pivotal estimation via square root lasso in

nonparametric regression. The Annals of Statistics 42 (2), 757–788.

[9] Berry, S., Levinsohn, J., Pakes, A., Jul. 1995. Automobile prices in market equilibrium.

Econometrica 63 (4), 841–890.

[10] Berry, S., Linton, O. B., Pakes, A., 2004. Limit theorems for estimating the parameters of

diﬀerentiated product demand systems. Review of Economic Studies 71, 613–654.

[11] Berry, S., Pakes, A., Nov. 2007. The pure characteristics demand model. International Eco-

nomic Review 48 (4).

[12] Berry, S. T., 1994. Estimating discrete-choice models of product diﬀerentiation. The RAND

Journal of Economics 25 (2), 242–262.

[13] Bjornerstedt, J., Verboven, F., Jul. 2016. Does merger simulation work? evidence from the
swedish analgesics market. American Economic Journal: Applied Economics 8 (3), 125–164.

[14] Breiman, L., Aug. 1996. Bagging predictors. Machine Learning 24 (2), 123–140.

[15] Breiman, L., Oct. 2001. Random forests. Machine Learning 45 (1), 5–32.

[16] Brenkers, R., Verboven, F., Dec. 2010. Liberalizing a distribution system: The european car

market. Journal of the European Economic Association 4 (1).

111

[17] Cardell, N. S., Apr. 1997. Variance components structures for the extreme-value and logistic
distributions with application to models of heterogeneity. Econometric Theory 13 (2), 185–
213.

[18] Chen, Y., Xie, K., 2017. Consumer valuation of airbnb listings: A hedonic pricing approach.

International Journal of Contemporary Hospitality Management 29 (9), 2405–2424.

[19] Chernozhukov, V., Chetverikov, D., Demirer, M., Duﬂo, E., Hansen, C., Newey, W., Robins,
J., Feb. 2018. Double, debiased machine learning for treatment and structural parameters. The
Econometrics Journal 21 (1), C1–C68.

[20] Chernozhukov, V., Hansen, C., Spindler, M., May 2015. Post-selection and post-regularization
inference in linear models with many controls and instruments. American Economic Review
105 (5), 486–490.

[21] Chevalier, J. A., Mayzlin, D., Aug. 2006. The eﬀect of word of mouth on sales: Online book

reviews. Journal of Marketing Research 43 (3), 345–354.

[22] Chintagunta, P. K., Gopinath, S., Venkataraman, S., Sep. 2010. The eﬀects of online user
reviews on movie box oﬃce performance: Accounting for sequential rollout and aggregation
across local markets. Marketing Science 29 (5), 944–957.

[23] Ciliberto, F., Moschini, G., Perry, E. D., Dec. 2017. Valuing product innovation: Genetically
engineered varieties in u.s. corn and soybeans. Center for Agricultural and Rural Develop-
mentWorking Paper 17-WP 576, Available at SSRN: https://ssrn.com/abstract=3088632 or
https://dx.doi.org/10.2139/ssrn.3088632.

[24] Cui, G., Lui, H.-K., Guo, X., 2012. The eﬀects of online consumer reviews on new product

sales. International Journal of Electronic Commerce 17 (1), 39–58.

[25] Dellarocas, C., Zhang, X. M., Awad, N. F., 2007. Exploring the value of online product
reviews in forecasting sales: The case of motion pictures. Journal of Interactive Marketing
21 (4).

[26] Dhar, V., Chang, E. A., Nov. 2009. Does chatter matter? the impact of user-generated contents

on music sales. Journal of Interactive Marketing 23 (3), 300–307.

[27] Draganska, M., Klapper, D., Aug. 2011. Choice set heterogeneity and the role of advertising:

An analysis with micro and macro data. Journal of Marketing Research 48 (4), 653–669.

[28] Duan, W., Gu, B., Whinston, A. B., Nov. 2008. Do online reviews matter? an empirical

investigation of panel data. Decision Support Systems 45 (4), 1007–1016.

[29] Edelman, B., Luca, M., Svirsky, D., Apr. 2017. Racial discrimination in the sharing economy:
Evidence from a ﬁeld experiment. American Economic Journal: Applied Economics 9 (2),
1–22.

[30] Ert, E., Fleischer, A., Magen, N., Aug. 2016. Trust and reputation in the sharing economy:

The role of personal photos in airbnb. Tourism Management 55, 62–73.

112

[31] Floyd, K., Freling, R., Alhoqail, S., Cho, H. Y., Freling, T., Jun. 2014. How online product

reviews aﬀect retail sales: A meta-analysis. Journal of Retailing 90 (2), 217–232.

[32] Fox, J., Kim, K. I., Ryan, S. P., Bajari, P., Feb. 2012. The random coeﬃcients logit model is

identiﬁed. Journal of Econometrics 166 (2), 204–212.

[33] Fox, J., Kim, K. I., Yang, C., Dec. 2016. A simple nonparametric approach to estimating
the distribution of random coeﬃcients in structural models. Journal of Econometrics 195 (2),
236–254.

[34] Fox, J. T., Kim, K. I., Ryan, S. P., Bajari, P., 2011. A simple estimator for the distribution of

random coeﬃcients. Quantitative Economics 2, 381–418.

[35] Fradkin, A., 2017. Search, matching, and the role of digital marketplace design in enabling

trade: Evidence from airbnb.

[36] Fradkin, A., Grewal, E., Holtz, D., 2018. The determinants of online review informativeness:

Evidence from ﬁeld experiments on airbnb.

[37] Friedman, J. H., Oct. 2001. Greedy function approximation: A gradient boosting machine.

The Annals of Statistics 29 (5), 1189–1232.

[38] Fu, W., Knight, K., 2000. Asymptotics for lasso-type estimators. The Annals of Statistics

28 (5), 1356–1378.

[39] Ghose, A., Ipeirotis, P. G., Oct. 2011. Estimating the helpfulness and economic impact of
product reviews: Mining text and reviewer characteristics. IEEE Transactions on Knowledge
and Data Engineering 23 (10), 1498–1512.

[40] Gibbs, C., Guttentag, D., Gretzel, U., Morton, J., Goodwill, A., Apr. 2018. Pricing in the
sharing economy: a hedonic pricing model applied to airbnb listings. Journal of Travel and
Tourism Marketing 35 (1), 46–56.

[41] Green, P. E., Carmone, F. J., Wachspress, D. P., Dec. 1976. Consumer segmentation via latent

class analysis. Journal of Consumer Research 3 (3), 170–174.

[42] Greif, A., 2006. Institutions and the Path to the Modern Economy. Cambridge University

Press, Cambridge, UK.

[43] Grossman, S., 1981. The informational role of warranties and private disclosure about product

quality. Journal of Law and Economics 24 (3), 461–483.

[44] Grossman, S. J., Hart, O. D., 1980. Takeover bids, the free-rider problem, and the theory of

the corporation. The Bell Journal of Economics 11 (1), 42–64.

[45] Guttentag, D., 2015. Airbnb: Disruptive innovation and the rise of an informal tourism

accommodation sector. Current Issues in Tourism 18 (12), 1192–1217.

[46] Hausmann, J. A., Leonard, G. K., Sep. 2002. The competitive eﬀects of a new product

introduction: A case study. The Journal of Industrial Economics 50 (3), 237–263.

113

[47] Herriges, J. A., Phaneuf, D. J., Nov. 2002. Inducing patterns of correlation and substitution
in repeated logit models of recreation demand. American Journal of Agricultural Economics
84 (4), 1076–7090.

[48] Hoerl, A. E., Kennard, R. W., Feb. 1970. Ridge regression: Biased estimation for nonorthog-

onal problems. Technometrics 12 (1), 55–67.

[49] Horton, J. J., Zeckhauser, R. J., 2016. Owning, using and renting: Some simple economics

of the "sharing economy"NBER Working Paper No 22029.

[50] Houser, D., Wooders, J., 2006. Reputation in auctions: Theory, and evidence from ebay.

Journal of Economics & Management Strategy 15 (2), 353–369.

[51] Jin, G. Z., Kato, A., 2006. Price, quality, and reputation: Evidence from an online ﬁeld

experiment. The RAND Journal of Economics 37 (4), 983–1004.

[52] Kim, H., Kim, K. I., Sep. 2017. Estimating store choices with endogenous shopping bundles

and price uncertainty. International Journal of Industrial Organization 54, 1–36.

[53] Kim, J. B., Albuquerque, P., Bronnenberg, B. J., Nov. 2010. Online demand under limited

consumer search. Marketing Science 29 (6), 1001–1023.

[54] Kreps, D. M., 1990. Corporate Culture and Economic Theory. Cambridge University Press.
[55] Kreps, D. M., Milgrom, P., Roberts, J., Wilson, R., 1982. Rational cooperation in the ﬁnitely

repeated prisoners’ dilemma. Journal of Economic Theory 27 (2), 245–252.

[56] Lee, J. D., Sun, D. L., Sun, Y., Taylor, J. E., 2016. Exact post-selection inference, with

application to the lasso. The Annals of Statistics 44 (3), 907–927.

[57] Lewis, G., Jun. 2011. Asymmetric information, adverse selection and online disclosure: The

case of ebay motors. American Economic Review 101 (4), 1535–1546.

[58] Lewis, G., Zervas, G., 2016. The welfare impact of consumer reviews: A case study of the

hotel industry. Industrial Organization Workshop (Pennsylvania State University, 2016).

[59] Liang, S., Schuckert, M., Law, R., Chen, C., Jun. 2017. Be a "superhost": The importance of
badge systems for peer-to-peer rental accommodations. Tourism Management 60, 454–465.
[60] Liu, Y., Jul. 2006. Word of mouth for movies: Its dynamics and impact on box oﬃce revenue.

Journal of Marketing 70 (3), 74–89.

[61] Lockhart, R., Taylor, J., Tibshirani, R. J., Tibshirani, R., 2014. A signiﬁcance test for the

lasso. The Annals of Statistics 42 (2), 413–468.

[62] McFadden, D., 1978. Modeling the choice of residential location. In Spatial Interaction Theory
and Planning Models, ed. by A. Karlgvist, et al. Amsterdam: North-HollanCowles Foundation
Discussion Papers 477, Yale University.

[63] Milgrom, P. R., 1981. Good news and bad news: Representation theorems and applications.

The Bell Journal of Economics 12 (2), 380–391.

114

[64] Milgrom, P. R., North, D. C., Weingast, B. R., Mar. 1990. The role of institutions in the revival
of trade: The law merchant, private judges, and the champagne fairs. Economics and Politics
2 (1), 1–23.

[65] Murdock, J., Jan. 2006. Handling unobserved site characteristics in random utility models of

recreation demand. Journal of Environmental Economics and Management 51 (1), 1–25.

[66] Nevo, A., Mar. 2001. Measuring market power in the ready-to-eat cereal industry. Economet-

rica 69 (2), 307–342.

[67] Nowak, A., Smith, P., 2017. Textual analysis in real estate. Journal of Applied Econometrics

32, 896–918.

[68] Ogut, H., Tas, B. K. O., Aug. 2012. The inﬂuence of internet customer reviews on the online

sales and prices in hotel indsutry. The Service Industries Journal 32 (2), 197–213.

[69] Petrin, A., Aug. 2002. Quantifying the beneﬁts of new products: The case of the minivan.

Journal of Political Economy 110 (4), 705–729, nBER Working Paper No. 8227.

[70] Proserpio, D., Xu, W., Zervas, G., 2016. You get what you give: Theory and evidence of

reciprocity in the sharing economy.

[71] Resnick, P., Zeckhauser, R., Swanson, J., Lockwood, K., Jun. 2006. The value of reputation

on ebay: A controlled experiment. Experimental Economics 9 (2), 79–101.

[72] Rosen, S., Jan. 1974. Hedonic prices and implicit markets: Product diﬀerentiation in pure

competition. Journal of Political Economy 82 (1), 34–55.

[73] Shin, H. S., Hanssens, D. M., Kim, K. I., Dec. 2016. The role of online buzz for leader versus
challenger brands: The case of the mp3 player market. Electronic Commerce Research 16 (4),
503–528.

[74] Tadelis, S., Oct. 2016. Reputation and feedback systems in online platform markets. Annual

Review of Economics 8, 321–340.

[75] Taylor, J., Tibshirani, R., Mar. 2017. Post-selection inference for l1 penalized likelihood
models. The Canadian Journal of Statistics 46 (1), 41–61, special Issue on Big Data and the
Statistical Sciences.

[76] Teubner, T., Hawlitschek, F., Dann, D., 2017. Price determinants on airbnb: How reputation
pays oﬀ in the sharing economy. Journal of Self-Governance and Management Science 5 (4),
53–80.

[77] Tibshirani, R., 1996. Regression shrinkage and selection via the lasso. Journal of the Royal

Statistical Society. Series B Methodological 58 (1), 267–288.

[78] Tibshirani, R. J., Taylor, J., Lockhart, R., Tibshirani, R., Aug. 2016. Exact post-selection
inference for sequential regression procedures. Journal of the American Statistical Association
111 (514), 600–620.

115

[79] Train, K. E., 2003. Discrete Choice Methods with Simulation. Cambridge University Press.

[80] Verboven, F., 1996. International price discrimination in the european car market. The RAND

Journal of Economics 27 (2), 240–268.

[81] Wang, D., Nicolau, J. L., Apr. 2017. Price determinants of sharing economy based accom-
modation rental: A study of listings from 33 cities on airbnb.com. International Journal of
Hospitality Management 62, 120–131.

[82] Xie, K. L., Zhang, Z., Zhang, Z., Oct. 2014. The business value of online consumer reviews and
management response to hotel performance. International Journal of Hospitality Management
43, 1–12.

[83] Ye, Q., Law, R., Gu, B., Mar. 2009. The impact of online user reviews and hotel room sales.

International Journal of Hospitality Management 28 (1), 180–182.

[84] Ye, Q., Law, R., Gu, B., Chen, W., Mar. 2011. The inﬂuence of user-generated content on
traveler behavior: An empirical investigation on the eﬀects of e-word-of-mouth to hotel online
bookings. Computers in Human Behavior 27 (2), 634–639.

[85] Zervas, G., Proserpio, D., Byers, J. W., Jan. 2015. A ﬁrst look at online reputation on airbnb,

where every stay is above average. Available at SSRN: https://ssrn.com/abstract=2554500.

[86] Zou, H., Hastie, T., 2005. Regularization and variable selection via the elastic net. Journal of

the Royal Statistical Society. Series B Methodological 67 (2), 301–320.

116