, .111 n A 3:... : at? . I, . . .v may.” ‘V. U “M.” U10. 1: I t. Z i :- ....7l . .. . mg? ?wfi¢ WWI unavflrmnwmuwm .‘vwflnd i. . v. .44 , , .LAH.......4....P§ z r. i=3. sea...” (Til; 1.11riu: _: on. 1.3.3.. :. . . s 311 “WWW“; : . .. n , W4 fi. 2 fi. . .{Vfirfl t! - .. .. 5.5.1.... sin. 3...». .....s:..: vma a... .fiufiéauwfia . : l .. . I: :5... 5.: .fizh.§w e $qu . (313.5... :2.) .4 II I. .17. :3. ($2. ‘!.d“.l|4. 354‘: .aflhuLIIqZOX 15.1. a! If.-. .1...- l . . but»: 1a.?!» xvii-m t. ‘ .v [it"i: i‘ Ail-tinnitv. ! KHJ"J.wuV-ol. $32: 5 :1 :11 i It‘li: .{i . .505- .ll I‘ll-lili- a..l.b..ttv..tuu 2.)} IE}. )3. :99 «.MN 5 if ‘il.§.lfl.s.l..ixt=fl l2 LII-A8!!! .3... . z. :1? 1. It 5 v1. flurh. ht... any): .. . q a . . 2 ts»: 6:33.31- silk? 3:11. ‘ 23.? .ititc‘l : .3... 3 312.2 $.!_|1 . 19.1152). ¥ 1 s I Q i i i 1.....3... .033 é‘fi DK’I, 3“. 3d?::r€ia¢l .13? « ~11 . .52.}. 1 :. .4 a: , Futiykiwdfi 1561/ This is to certify that the dissertation entitled ESTIMATION FROM CENSORED MEDICAL COST DATA presented by ONUR BASER has been accepted towards fulfillment of the requirements for Ph . D . degree in ECONOMICS % ‘\ Gigi; Major professor Date M“; q’. 20% MS U is an Affirmariw Action/Equal Opportunity Institution 0- 12771 LIBRARY Michigan State University PLACE IN RETURN Box to remove this checkout from your record. TO AVOID FINES return on or before date due. MAY BE RECALLED with earlier due date if requested. DATE DUE Q DATE DUE DATE DUE moi % E3003 AUG 0 1 2004 fits-r6174 6/01 cJCIRC/DaieDuopGS-p. 15 ESTIMATION FROM CENSORED MEDICAL COST DATA By ONUR BASER A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Economics 2002 ABSTRACT ESTIMATION FROM CENSORED MEDICAL COST DATA By ON UR BASER Health care inflation is a concern in many industrialized countries. One response is search for cost effective therapies which requires proper analysis of treatment cost data. Common problem with medical cost data is censoring and statistical properties of estimating medical cost from a censored data is not well developed. In my thesis, I propose two method, one with an extension to panel data setting, to estimate medical cost from censored data. First chapter applies the inverse probability weighted least-squares method to pre- dict censored total medical cost. Since survival time and medical costs may be subject to right censoring and therefore are not always observable, the ordinary least-squares approach cannot be used to assess the effects of certain explanatory variables. Inverse probability weighted least-squares estimation provides consistent asymptotic normal coefficients with easily computable standard errors. A test is derived to compare the differences between the coefficients estimated by the ordinary least-squares approach and the inverse probability weighted least-squares estimation. A study on the medical cost of lung cancer is used as an application of the method. Second chapter applies the inverse probability weighted (IPW) least-squares method to predict total medical cost from panel data subject to censoring. Specifically, IPW pooled ordinary-least squares(POLS) and IPW random effects(RE) models are used. Because total medical cost is not independent of the survival time under administra— tive censoring, unweighted POLS and RE cannot be used with uncensored data, to assess the effects of certain explanatory variables. IPW estimation gives consistent asymptotic normal coefficients with easily computable standard errors. A traditional and robust form of Hausman test can be used to compare the coefficients estimated by weighted and unweighted estimation methods. The method developed in this paper are applied to lung cancer cost data. In the third chapter, a method for testing and correcting for sample selection bias for cross-sectional data is proposed. Specifically, this paper provides a systematic treatment of the correction for nonrandom sample selection of medical cost data where the selection rule is described by a censored regression model. We show that the population parameters are identified, and provide straightforward x/N-consistent and asymptotically normal estimation methods under the assumption that the selection rule is governed by a censored Tobit Model. A study on the medical cost of lung cancer is used as an application of the method. Copyright by ONUR BASER 2002 iv For my wife, Deniz and my brother, Erdem Table of Contents LIST OF TABLES LIST OF FIGURES 1 Estimation of Censored Medical Cost Data 1.1 IPW least squares ............................. 1.2 Comparison Of The IPW And Unweighted Estimators ........ 1.3 Application to the Lung Cancer Study ................. 1.3.1 Data ................................ 1.3.2 Variables .............................. 1.3.3 Descriptive Analysis ....................... 1.3.4 Survival Curves .......................... 1.3.5 Regression Analysis ........................ 1.4 Conclusions ................................ 2 The Longitudinal Analysis of Censored Medical Cost Data 2.1 General Framework ............................ 2.1.1 Pooled Ordinary Least Squares (POLS) Estimation ...... 2.1.2 Random Effect Model ...................... 2.2 Weighted or Unweighted Estimator? ................... 2.3 The Lung Cancer Study ......................... vi viii ix 10 12 12 13 15 16 18 22 25 27 28 36 39 2.3.1 The Data ............................. 39 2.3.2 Regression Analysis ........................ 40 2.4 Conclusion ................................. 47 3 Full Parametric Estimation of Censored Medical Cost 49 3.1 General Framework ............................ 51 3.2 Statistical Methods ............................ 52 3.3 Lung Cancer Study ............................ 59 3.4 Conclusion ................................. 63 APPENDICES 64 .1 Appendix for Chapter 1 ......................... 65 .2 Appendix for Chapter 3 ......................... 66 LIST OF REFERENCES 69 vii 1.1 1.2 2.1 3.1 3.2 List of Tables Descriptive Statistics from the Lung Cancer Study ........... Estimates of the Log(tcost) Equation by OLS and IPW ........ Estimation of Log of Total Medical Cost from Longitudinal Data . . . Summary Statistics from the Lung Cancer Study ........... Estimates of the Log(tcost) Equation by OLS, IPW and Procedure 3 viii 15 20 42 1.1 1.2 1.3 2.1 2.2 2.3 3.1 3.2 List of Figures Survival functions according to disease stage level ........... Survival functions according to hospitalization for reasons other than lung cancer surgery ............................ Total medical cost distribution among the censored cases ....... Distribution of average monthly cost .................. Distribution of absolute value of monthly dummy coefficients under POLS and IPW POLS estimation. ................... Distribution of absolute value of monthly dummy coefficients under RE and IPW RE estimation ........................ Administrative censoring when each individual has different starting time .................................... Starting time backed up to O for the individuals faced administrative censoring .................................. ix 17 19 44 46 54 Chapter 1: Estimation of Censored Medical Cost Data Introduction The rising cost of health care is a concern in many industrialized countries. One response is the search for cost-effective therapies, which requires proper analysis of treatment cost data. Cost—effectiveness analysis (CEA) involves estimating the net, or incremental, costs and effects of an intervention. Treatment costs and health outcomes are compared with some alternative, which might be the care that would be given if the interventions were not used at all. (Gold et al., 1996). The first step in this important process is the estimation of costs. Costs can be estimated from a variety of sources, including Medicare claims files. However, censoring is a common problem with these administrative data. Statistical methods applicable to censored cost are not well developed. Nevertheless, the average total cost for a group of patients has been estimated in one of three ways: (1) by estimating the sample mean of observed costs from all cases, (2) by estimating the sample mean the uncensored subjects only, and (3) by using modifications of standard survival analysis techniques (Lin, 1997). All of these methods yield biased estimators. The sample mean from all cases creates a downward bias because it does not account for the costs incurred after censoring. The sample mean from uncensored subjects is biased toward the costs for patients with a shorter survival time since a longer survival time is likely to be censored. Standard survival analysis on costs is not valid if patients accumulate costs with different rate functions over time. This technique assumes independence between the cost at the survival time and the cost at the censoring time, whereas, the two are generally positively correlated. To adjust for this dependency, Lin et al. (1997) proposed a “partitioned estimator” method to assess average costs. This method partitions the entire time period of interest into a number of smaller intervals and calculates average cost and product-limit estimates for each interval. The sum of the product of these two components becomes “product-limit sampling average estimator” of total cost for the sample. An application by Sloan et al. (1999) to health care costs of patients in a oncology clinical trial found Lin’s variance estimation to be an arduous and untenable numerical programming exercise. Instead, they used a straightforward application of the bootstrap method to obtain variance estimates. If we are interested in conditional average costs, all three estimation methods incorrectly assume some homogeneity in the medical cost data in the sense that they are independent of patient characteristics or the type of treatment. Since cost can depend on patients’ age, disease stage, comorbid conditions, symptoms, type of treatments received,etc., estimation should account for these control variables. Multivariable regression analysis is required but no such method is available using standard software programs. The two approaches developed by Lin (2000a, 2000b) require a high level of computer programming and have not been emprically tested. In this chapter, the inverse probability weighted (IPW) least squares method is used to assess the effects of covariates (e.g., patient and clinical characteristics) on medical cost with censored data. IPW has a long history in statistics (Horvitz and Thomson (1952), Robins and Rotnisky (1992,1995), Robin, Rotzinky and Zhao (1995), Horowitz and Manski (1998), Rosenbaum (1987) and Hirano, Imbens, and Ridder (2000)). In more general framework, Wooldridge (1999, 2001) examined the asymptotic properties of the IPW M—estimator for variable probability samples and standard stratified samples and Wooldridge(2002b) provides an overview of IPW M-estimation for cross-section applications. IPW estimation produces consistent estimators with a covariance matrix that can be calculated by most commercial statistics software programs. We also developed a test to compare the coefficients estimated by IPW least squares and ordinary least squares methods (OLS). This chapter is organized as follows. The first section outlines IPW least squares as applied to censored medical cost data, including the statistical properties of the estimation and step-by-step procedures for implementation. The next section introduces a Hausman type of test to compare the estimators calculated by using IPW least squares and OLS over uncensored data. The third section describes an application of our methods to a study on the medical cost in lung cancer patient. The last section presents our conclusions. 1.1 IPW least squares Suppose that we are interested in the total medical cost over period [0, L]. Since there is no further medical expense after death, the total cost over [0, L] is the same as the cumulative cost at T“ = min(T, L), where T is the survival time. The distribution of T is assumed to be continuous from 0 to L. Assume that in the population of interest 3; = :66 + u, (1.1) where y, a: and 6 are respectively the cumulative cost (or transformed cost) at T“, a l x K vector of explanatory variables, a K x 1 vector of unknown regression parameters, and u is the unobservable random disturbance or error, whose distribution is unspecified. The first component of a: is set to 1 so that the first component of ,8 represents the intercept. Assume that E(:1:’u) = o. (1.2) Under random sampling from the population, equation (1.2) is the crucial assumption in obtaining consistency of the OLS estimator of ,3 in (1.1). With (1.2) and the rank assumption E(a:’:c) =K, the OLS estimator using a random sample will be consistent for 6 . The assumption that u is a zero-mean error term does not guarantee consistency. Survival time and medical costs may be subject to right censoring and therefore are not always fully observable. Cost censoring occurs when a patient’s follow-up time is less than L, and the patient is alive at the time of censoring. Since no further expense is assumed after death, whether death occurs before L, or after L is immaterial for cost estimation. Let C be the time of censoring. Let Z = min(T, C), s = I(C 2 T), and 3* = I(C Z T“), where I(.) is the indicator function. There are two types of censoring: time censoring if s = 0 that is, T > C, and cost censoring if 3* = 0, that is, min(T, L) > C. A generic element from the population can be denoted (y, at, 3*). Suppose that T and C are independent given a: . Assumption 1 (i) :1: and C are always observed, and T", y are observed when 3" = 1. (ii) y can be ignored in the selection equation, conditional on :1: : P(s’ =1|a:,y)= P(s‘ =1|m)= P(C 2 T‘Im ) = P(C 2 r). Assumption 1 indicates that a: is always observed and that, conditional on a: , the response variable does not affect the selection probability. Part (ii) of Assumption 1 is crucial because of the requirement that the selection probability is observable when 3* = 1. Since we can ignore y from the selection equation, having a censored y value does not create a problem for estimating selection probabilities. Suppose we have a random sample of size N from the population to estimate ,6. Thus {(mi ,y,): 2' = 1, 2, ..., N } are treated as independent, identically distributed random variables, where m,- is 1 X K and y, is a scalar, and 3‘,“ is a corresponding sample selector indicator. The underlying model is then, 311 = (Eta + Um (1.3) The IPW least square estimators, Ha, , solves min 2 nil-(y, — (02,5)2, (1.4) where w,- = (s?/P(C,- _>_ Ti")). Under assumption 1 and equation (1.2), H", is consistent with asymptotically normal distribution and the estimated asymptotic variance V(,8:,, )=Aw-1 Eu, AID—1 /N, where N A,” = N-1 Zwimgmi, (1.5) i=1 N Bw = N—1 2 10312393223“ (1.6) i-l and 12,- : y,- — $1,810 are the residuals after IPW least squares estimation (Wooldridge, 1999). The objective function in (1.4) simply weights each observation (y,, 33,-) by the inverse probability of appearing in the sample, that is, observations for which 3‘,“ = 0 do not appear in the optimization problem. Assumption 1 part (z'z') requires P(C,- 2 Ti“) to be known whenever sf = 1, so flu, is computable from observed data assuming we know P(C,- Z 71*). Note that neither ,5“, nor its covariance matrix estimator involves the incomplete observations. In addition the estimated covariance matrix is the White (1980) heteroskedasticity-consistent covariance matrix estimator applied to all variables for observation 2' weighted by JIE. Note that even if there is no heteroskedasticity in the potential model (1.3), we treat the model as heteroskedastic due to censoring. Heteroskedasticity-robust standard errors after the weighted regression provide the estimated asymptotic standard errors. Censoring, then can be handled easily because most standard statistics software programs compute a heteroskedasticity-consistent covariance matrix. Another advantage of weighting the observations, other than solving the censoring problem, is that we derive consistency with the weaker assumption (1.2) rather than E(u|a:) = 0. Since w is independent of (3,21), E(wa:’u) = E(E(rw|u, a3)a:’u)) = E(E(w|y,a:)a:’u)) = E(E(w|a:)a:’u)) = o. (1.7) The last equality follows because E (wlm) = 1. So far, it has been assumed that the sampling probability function is known. Usually, that function is unknown and needs to be estimated. We propose to estimate the sampling probability function by the product-limit estimator (Kaplan and Meier 1958), with the roles of C, and T,- reversed (i.e., T,- censors Ci). Assuming censoring is not covariate dependent, define p(t) = P(C 2 t), and let 13(t) be the product limit estimator of p(t) based on the data (2,,3?) (2' = 1, ..., N), where 3‘,- = 1 — 3,. Then, i: 1, N. (1.8) Under standard regularity conditions 1 two step IPW least square estimator that uses 113,- instead of w,- in equation (1.4) consistently estimates Hm (Newey and MacFadden 1994). Exogeneous censoring implies that 13(wa ,T*,C) =E(y|$ )- (1-9) It can be shown that under exogenous sampling the use of an estimate of the probabilities for the second step yields a variance estimator that is asymptotically equivalent to that estimated with known probability values. Therefore, define Hm , the two step IPW least squares estimator, as the solution to N mi 11 —a:;,8,- 2. 1.10 e 9 gm ) < > Then flu, is asymptotically normally distributed with estimated variance 1 17,, we}, ) = AZ; 8',” A'w‘l /N, (1.11) where, N A}, = N-1 Savage” (1.12) i=1 1The conditions in which the uniform weak law of large numbers can be applied. For details; see Theorem 12.1 in Wooldridge(2000a). Lemma 4.3 in Newey and McFadden (1994) shows that if 11),- is replaced with a consistent estimator, the convergence still valid. N 8-", = N‘1 2 11331233233, (1.13) i=1 where 11, = y,- — 31,810 are the residuals. Under administrative censoring, for example, all censoring is caused by study termination, and C is independent of y. However, unless we have short interval cost values, such as monthly or weekly, we may expect that T“ and y are correlated. In this case,17w in (1.11) has to be adjusted for estimation of w, (for adjusted covariance matrix, see Wooldridge (2002b)). The estimated covariance matrix of the two-step IPW least squares estimator is the White (1980) heteroskedasticity-consistent covariance matrix estimator applied to all variables for observation 2' weighted by flu—1. Robust covariance matrix is built into most statistical programs, adjustment for the Vw in (1.11) requires programming. In practice, it has been found that adjusting for the first-step estimators usually has little effect on the asymptotic standard errors. Moreover, Wooldridge (2002b) shows that using the estimated selection probability will produce smaller standard errors than true estimated by using a known selection probability. In other words, if we compute the asymptotic variance as if we have not estimated the probabilities, inference is conservative. Adjustment for estimation of 11),- requires programming in the application. The main point of this paper is to suggest an easily applicable method. By ignoring adjustment for simplicity, we produce higher standard errors, however obtaining significant estimates using unadjusted standard errors leads to larger t statistics after correction. This is somewhat unusual for two—step estimation problems, where the prevailing wisdom is that larger standard errors occur by adjusting standard errors for a first stage estimation. The steps for deriving consistent two-step IPW least squares estimators and their unadjusted asymptotic variance estimators can be summarized as follows.2 (i) Calculate the product-limit estimator, mi, based on data (Z,, 1 — 3,) (i = 1, ...,N). (ii) Generate p,- = m,- for the cases Z,- s L; p,- = 1,: were 1,: is the value of m,- at Z,- = L. (iii) Generate weight, w,- = si/p, (2' = 1, ..., N). (iv) Generate weighted response and explanatory variables: y; 2 fly“ in: = \/u—); 112,-(z'=1,...,N). (v) Run the OLS regression of y; on m: with the heteroskedasticity robust option. Total medical cost data are typically characterized by a skewed empirical distribution of the nonzero realizations (Manning and Mullahy, 2001). The most common method for analyzing such data is logarithmic transformation of the response variable. In our estimation procedure, y,- can be chosen as the transformed dependent variable. Retransformation then can be done using the smearing estimator (Duan, 1983). The smearing estimator is the exponential of the expected response on the log-scale multiplied by the average of the exponential cost. Anderson et al. (2000) developed the heteroskedastic smearing estimator for use when the variances of the residuals are not constant.3 2STATA commands are in the Appendix 3In the method described above, a heteroskedastic-robust variance matrix is used. The ro- bust variance matrix is needed because of stratification whether or not there is heteroskedasticity. Therefore, homoskedastic smearing transformation can still be chosen after robust estimation if the variance of the residuals are constant. 1.2 Comparison Of The IPW And Unweighted Estimators The OLS estimator for cases with complete data, called the unweighted estimator, flu solves N .— ',)2. 1.14 finéilbgsm wfi) ( ) It is well-known that selection under exogeneous sampling does not cause problems if we impose the stronger assumption, E (u|a:) = 0. Then 6:, is consistent and asymptoticlally normally distributed and the usual variance matrix estimator 1/0311 )— -— 1B,, Au— 1/N is consistent, where ~ N Au 2 N“1 2 3:32:31, (1.15) ‘21 N = 1W: s*u'..,-2a:’.a:,-, (1.16) 11,- = y, — 2&3; are the residuals after OLS estimation of uncensored sample. If equation (1.9) is satisfied, then unweighted and weighted estimators are both consistent. In such a case, theory suggests that an unweighted estimator is more efficient under conditional homoskedasticity and weighted estimator is more efficient under unknown heteroscedasticity (Wooldridge, 1999). Because the unweighted estimator is inconsistent under the violation of equation ( 1.9) and the weighted estimator is consistent with or without exogeneous 10 sampling, we can apply a Hausman (1978) test to determine exogeneity of sampling. The traditional form of Hausman statistics can be used under homoskedasticity assumption. We can state this assumption as follows: For the selected sample, 7;: 1, 2, ..., N0: E(uf:c1a:,~) = agE(m1a:,-). (1.17) When equation (1.17) holds, the unweighted least squares variance estimator is N —1 17,, E V(fi~u ) = [72 (N—123131m,) (1.18) i=1 provided we have a consistent estimator of 62 of 08. In general form, the Hausman test can be stated as: ~ ~ 1 ~ :- H=(fiw —B.. W ’ (a. 41.), (1.19) where I7 E 17“, — 17,, . V“, is defined in equation (11) and I7“ is defined in equation (18). In many cases we may want to use a Hausman test when the homoskedasticity asssumption is violated. This requires a robust form that replaces ‘7 by ~ ~ N ~ ~ (A: l — A?) (A1“ 2 a, ) (A: | — Arr/N. (120) 1:1 and 6,- = (10,11,931 ,s;1l,a:1 )I. Here 12,- and 11,- are the residuals after IPW least squares and OLS estimations of the selected sample, respectively. 113,-, AZ”, A.“ are defined in equation (1.8),(1.12),(1.15) respectively. Under the null hypothesis the sampling scheme is exogenous, H ,3, Xi). If we 11 reject the hypothesis, IPW least squares method should be used. Since we have endogenous sampling, OLS estimation using complete cases is not consistent. If we fail to reject the hypothesis, the typical response is to conclude that the exogeneity assumption holds and we should use OLS estimates. Unfortunately, we may be committing a Type 11 error by failing to reject exogeneity assumption when it is false. Therefore, we should report results from both estimation procedures. 1.3 Application to the Lung Cancer Study 1.3.1 Data From 1994 through 1997, 202 patients with incident cases of lung-cancer were recruited from 24 Michigan community hospitals and their affiliated oncology units. Each patient provided written consent for researchers to acquire his or her Medicare claim files; 189 patient had lung-cancer treatment. We obtained Medicare claim files for the two years following lung cancer diagnosis. The files included any reimbursement claims for inpatient or outpatient care, physician provider services (including laboratory tests and diagnostics, mammography, radiation, and intraveneously chemotherapy), home health care, and / or skilled nursing facilities. Several cases had missing data. One case was missing age, and nine cases were missing stage, nineteen cases did not reported comorbid conditions, three cases were missing symptoms and six cases had no data on physical function. We first assumed a completely random distribution of missing data,that is, cases with complete data are not distinguisable from cases with incomplete data. We used mean substitution, median substitution, pairwise deletion and regression methods to complete the data set. We then assumed that complete data are different from cases with incomplete 12 data but a missing pattern is tractable. We used a multiple imputation method to complete the data set. None of imputation methods yielded a significant estimate in the regression model for the variables with missing values, therefore we used median substitution without loss of generality. 1 .3.2 Variables Total Cost. Total cost is the sum of inpatient, outpatient, and provider costs. Medicare payments were used as a proxy for direct medical care costs rather than billed charges. Medicare reimbursements formulas are designed to reflect an underlying pattern of resource use, whereas charges inflate actual cost. Charges were adjusted for inflation to 1997 prices by using the National Medical Care Price Index, 1994-1997. The costs of prescription drugs, unpaid caregiver services paid by other insurers or out of pocket were not included. Age. Age is defined as continuous variable, patient’s age within two weeks of initiating either radiation or chemotherapy. Treatments. Surgical procedures were identified by the two medicare codes, International Classification of Diseases version 9 (ICD—9) and Current Procedural Technology (CPT) Codes. We used all ICD-9 and CPT codes avaliable in the inpatient, outpatient, and physician supplier files to identify chemotherapy and radiation. These data were coded as dichotomous variables with yes / no categories for comparison purposes. Hospitalization. The number of inpatient Medicare claims was used to derive the number of hospitalizations for reasons other than lung cancer surgery. Physical Function. Physical function three months prior to diagnosis was assessed using the subscale from the SF-36 (Ware et al. 2000). The 10-item subscale asks questions about such activities as lifting heavy objects, participating in 13 strenuous sports, climbing stairs, walking various distances, and ability to bathe and dress. Response categories are: limited a lot, limited a little, and not limited at all. Scores are standardized and range from 100 (no limitation) to 0 (severe limitation). Symptoms. Patients were asked if during the past two weeks they had experienced any of 33 symptoms. A count of all symptoms was summed for each patient. Death. The Office of Vital Statistics,Michigan Department of Community Health, Death Certificate Registry was used to identify the date of death. Comorbid Conditions. Comorbid conditions were assessed with an instrument from the Aging and Health in America Study, a national survey that asks patients to indicate whether a health professional has ever told them they have one of 15 problems. The total number of positive responses was summed for each patient and sorted into one of two categories: zero to two,and three or more. A comparison of patient reports of comobid conditions with medical record audits indicates that patients are able to recall other diagnosed illnesses (Katz et al., 1996). Restricting the categories for comorbid conditions does not result in lost predictive power (Newschaffer, 1998). Stage. Disease stage was determined using the American Joint Committee on Cancer (AJCC) Tumor Nodes & Metastasis (TN M) staging system which was applied to pathological data obtained from an audit of patients’ medical records. Stage of cancer at diagnosis was collapsed into early (in situ and local) and late (regional and distant). Gender. The value is 1 for males; 0 for females. Race. The value is 1 for whites, 0 for blacks. 14 Table 1.1: Descriptive Statistics from the Lung Cancer Study [ Uncensored(n=135) Censored(n=48) Variables Mean Std Mean Std total cost 63939 41680 62877 40114 lstage .67 .54 lcomorbi .64 .63 hospitalize .62 .54 chemo only .08 .06 radiation only .26 .25 chemo and radiation .36 .54 symptoms 11.13 5.14 10.12 4.72 physical functions 73.55 26.76 71.45 28.71 age 71.96 4.85 72.68 5.201 gender .57 .62 white .93 .91 death .53 0 1.3.3 Descriptive Analysis All analysis were done using STATA version 7. Table 1.1 shows the summary statistics. Nineteen cases had no treatment related to lung cancer, so we dropped these cases from the sample. Out of the remaining 183 patients, we had complete data for 135 cases and incomplete data for 48 cases. So, approximately 26 percent of the sample had cencored data. As shown in Table 1.1, the patient sample can be described as predominantly white and in their early seventies for both censored and uncensored cases. Two thirds of the patients were diagnosed with late stage disease for complete cases whereas half of the patients with incomplete data were diagnosed with late stage disease. Eighty-three patients who have compete data were hospitalized for reasons other than lung cancer surgery while 26 censored patients were hospitalized. Most patients had three or more comorbidities and experienced some level of symptoms related to cancer treatment. The patient sample is relatively high 15 functioning in terms of physical health. F ifty-seven percent of the complete cases were male relative to 62% percent of the cases in the censored data. The last four rows of Table 1.1 show the categorical variables related to treatment types. For censored and uncensored cases, we have similar percentages for the patients who had radiation only or chemotherapy only, approximately 25% and 7% respectively. Twenty—four percent of the patients had surgery or surgery plus adjuvant therapy in the complete cases whereas 35% had them in incomplete cases. For chemotherapy and radiation, the ratios are 36% and 54% respectively for censored and uncensored cases. The dependent variable, total Medicare payments two years following diagnosis is shown in the first row of Table 1.1. Considering the mean alone, we find that the total cost of all care is $63,939 for the two years following a lung-cancer diagnosis for complete cases and $62,877 for incomplete cases. 1 .3.4 Survival Curves Figure 1.1 and Figure 1.2 show the separately estimated baseline survival curves for the variable of interest after we conditioned on the explanatory variables. For each graph, we estimated a separate Cox (1972) proportional hazards model on the explanatory variables of interest so that we can compare the effects of certain variables on survival time and total medical cost conditioning on the others. As shown in Figure 1a, the patients with less aggressive disease have better survival probabilities after we control for physical health three months prior to diagnosis, age, gender,race, comorbidity conditions, hospitalization and the treatments. The chances are approximately 90% for early stage and 30% for the late stage. Interestingly, the patients who have hospitalizations for reasons other than lung cancer surgery have a 30% chance of survival , compared to 95% for those who do 16 Figure 1.1: Survival functions according to disease stage level Survivor functions, by Istage adjusted for mospf3m73 age72 sex lcomorbi hospitalize chemon... 1 1 1 1 “)0 ‘ —F‘¥:FF—fi _ lstageO 0.75 * \wH he fl 1 __ 0.50 - |stage1 0.25 1 0.00 - 1 1 1 1 1 0 200 400 600 800 analysis time 17 not, conditioning on the other explanatory variables(Figure 1.2). For the other categorical variables, comoribid conditions, race, and treatment types; we did not observe differences in the base line curves after we controlled for the explanatory variables. 1.3.5 Regression Analysis Our aim is to determine how the variables age, gender, comorbid conditions, stage of cancer, symptoms, death status, physical functions, hospitalization, and treatment account for the total medical cost of lung cancer in the two years following diagnosis. We found that costs are skewed to the right, so we transformed the cost equation to a log-linear scale. We started with the log-scale residuals from a generalized linear model with a logarithmic link function and found that the log-scaler residuals are dense at the tails. Following Manning and Mullahy (2001) we considered an OLS—based model with a log-transformed dependent variable. Table 1.2 shows the result of the regression analysis predicting total cost of care for the two years following a lung cancer diagnosis. The first column of Table 1.2 shows the unweighted regression coefficients, while the second column shows weighted regression results. The reference groups for treatment modalities are surgery only and surgery plus adjuvant therapies. Variables that reach statistical significance (p < .05) include hospitalization for reasons other than lung cancer surgery, chemotherapy only, radiation only, and chemotherapy and radiation. Hospitalization for reasons other than lung cancer surgery increases total medical cost during the period of interest by 107% according to the unweighted estimation and by 114% according to the IPW least square estimation. 18 Figure 1.2: Survival functions according to hospitalization for reasons other than lung cancer surgery Survivor functions, by hospitalize adjusted for mospf3m73 age72 sex Istage lcomorbi chemonly ra... 1 I I 1 1.00 — mmmmo L l H 1 i l 0.75 — \L : —1_l_ 1 i I 0.50 7 l. hospitalizet 1 I 0.25 ~ 1 i I i 0.00 J 1.. I I g 1 0 200 400 600 800 analysis time 19 Table 1.2: Estimates of the Log(tcost) Equation by OLS and IPW Explantory OLS IPW constant 10.74 10.70 (1.06)" (1.04)" hospitalize .72 .75 (.18)" (.18)” chemotherapy only -.92 -.83 (.29)" (.31)" radiation only -.79 -.73 (.23)" (.22)" chemothrepay and radiation -.49 -.43 (.22)‘ (.21)‘ Death -.02 -.03 (.12) (.12) symptoms .004 .009 (.014) (.014) physical functions .001 .002 (.003) (.002) age -.003 -.005 (.014) (.013) gender .081 .12 (.12) (.12) white .12 .23 (.20) (.19) Observations 135 135 R-squared 0.13 0.15 Robust standard errors are in parentheses. *significant at 5% level;** significant at 1% level. 20 Whether or not a person receives radiation or chemotherapy separately or in combination significantly decreased the total medical cost relative to the mean costs for persons receiving surgery only or surgery plus adjuvant therapies. The estimates with respec to the unweighted and weighted least squares are: for radiation only, 120% and 105%; for chemotherapy only, 148 % and 127%; for chemotherapy and radiation, 65% and 55%. As we demonstrated in Table 1.2, age, gender, physical functions, stage, comorbid conditions, race and death status do not have a statistically significant effect. Our models explain 13% of the variability in total costs the two years following diagnosis according to unweighted estimation and 15% according to IPW least squares. A comparison of the weighted and unweighted estimations does not reveal significant differences, although the former statistically corrects for potential bias. The test developed in section 1.2 can be used to support this argument. We failed to reject the hypothesis that sampling scheme is exogenous. In this case, there is a chance that our unweighted estimators are consistent. Both estimators are reported in Table 1.2 and are statistically and practically the same. Adjusted means can be calculated using the smearing estimation. These are shown below. Method Mean Standard Deviation Uncensored-Unadjusted $63,939 $41,680 Unweighted Estimation $64,043 $14,177 Weighted Estimation $64,563 $15,850 Whether the sample selection depends on the conditioning variables, or is independent, then the weighted and unweighted estimators are consistent. Since we have evidence of exogenous sampling with the robust form of Hausman Test, we reached this conclusion. In this case, the theory suggests that the unweighted 21 estimator is more efficient under conditional homoskedasticity. In our model, we do not have heteroscedasticity,therefore the standard errors from the predicted means are in the expected direction. 1 .4 Conclusions Prior to Lin (2000a, 2000b) the methods of estimating censored costs incorrectly assume some homegeneity in the medical cost data in the sense that they are independent of covariates such as patient and clinical characteristics. In 2000, Lin developed a technique for estimating censored costs. However, his approach, while correct, is extraordinarily complex and not applicable using commercially available statistical software programs. Therefore no empirical tests of this model have been completed. This paper examines the IPW least squares method to solve for inconsistencies due to censoring and is easily applicable using most statistical software programs. Under the key assumption that selection is ignorable, the inverse probability weighting scheme identifies the population parameters.The regression method introduced can handle large numbers of continuous and discrete explanatory variables. The application of the method is a two-step estimation process where at first step, we estimate selection probabilities by using the product limit estimation where the role of censoring and survival time is reversed. At the second step, we estimate heteroskedastic robust OLS on the uncensored data set where each variable is weighted with the inverse of the square root of the estimated selection probabilities from first stage. We also developed a test to compare the coefficients estimated by the IPW least squares and by OLS. This test can be used to asses efficiency improvement 22 between two models. Specifically, if we reject the null hypothesis that the sampling scheme is exogenous, IPW least squares method should be used because the other method yields inconsistent estimates. Failing to reject the null hypothesis could be used to support unweighted estimation under conditional homoskedasticity. We also applied the proposed method to an inception cohort of patients newly diagnosed with lung cancer. The findings from the lung cancer study can be summarized as follows. Although lung cancer stage does not affect the total medical cost, it decreases survival time. Comorbid conditions are not significant for the estimation of total medical cost and do not effect survival time. Hospitalization for reasons other than lung cancer surgery decreases survival time and it also doubles the total medical cost during the period of interest. Several limitations should be discussed. The lung cancer study does not demonstrate the full power of the IPW least squares method. First, the sample size is small and all of the results demonstrated in the first two sections are asymptotically valid. Second, the censored observations in the data set are relatively homogeneous. Applying OLS to the cases with complete data yields an unbiased estimator toward the cost of the patients with shorter survival time because a longer survival time is more likely to be censored. Since a longer survival time tends to be associated with higher medical cost, the cost values of the censored case should be well above the mean value for cases with complete data. Figure 1.3 shows that is not the case for data in this study. All the censored cases cluster around the mean of uncensored cases. With the available data set, where the number of observation is large and deviation between censored and uncensored observations is significant, we would see the full power of the IPW least squares method over OLS. 23 Figure 1.3: Total medical cost distribution among the censored cases $250 c: . a $200 0 c: v ‘0 5150‘ ' 0 .E o . ‘2’? $100 ’ (J . 0 ..o g 350 4‘ 0 ' *- ‘ db 5" ° . . . e.: . . so 1 T I Y Y 200 300 400 500 600 700 Survival Time 800 Third, for the exact asymptotic variances adjustment for the first stage estimation should be made. So marginally insignificant variables should be interpreted with the caution since with the adjustment they may turn out to be significant. In conclusion, our study improves upon previous studies by propose a multivariate regression analysis that solves for inconsistencies due to censoring and a statistical test to asses the efficiency improvement between the old methods and the more easily replicatable proposed method. Furthermore, an application of lung cancer study shows how the method can be applied by using most of the statistical software programs, including step~by-step procedures. 24 Chapter 2: The Longitudinal Analysis of Censored Medical Cost Data Introduction f Proper analysis of treatment cost data is more challenging than is generally recognized. A common problem is that censoring and statistical methods applicable to estimation of medical cost from censored data are not well developed. f Until recently the methods (Lin et al. 1997, Bang and Tsiatis 2000) for analyzing censored medical cost assumed homogeneity in the data, which in practice is rare. Proper analysis requires multivariate regression analysis. The two approaches developed by Lin (2000a, 2000b) require a high level of computer programming and have not been fully empirically tested. Analysis of censored data under exogenous sampling can be done easily by using the ordinary least-squares (OLS) method for uncensored data. It produces consistent estimators which we refer as unweighted estimators throughout the paper. Exogenous sampling in the context of estimation from censored medical cost means that once explanatory variables are selected, such as patient characteristics or the type of treatments, total medical cost does not depend on the censoring time and survival time. Under administrative censoring, that is when all censoring is caused by study termination, it is reasonable to assume that total cost is independent of censoring time. Exogenous sampling assumptions are violated; if total cost and survival time 25 are correlated after we condition on explanatory variables. Since longer survival time may be associated with higher medical cost, the unweighted method yields an estimator biased toward the cost of patients with shorter survival times. In the first chapter, I applied the inverse probability weighted (IPW) least-squares method to predict total medical cost in patients with lung cancer two years after diagnosis. In more general framework, Wooldridge (1999, 2001) examined the asymptotic properties of the IPW M-estimator for variable probability samples and standard stratified samples and Wooldridge(2002b) provides an overview of IPW M-estimation for cross-section applications. IPW produces consistent asymptotically normal coefficients with easily computable standard errors under the violation of the exogenous sampling assumption. With small data sets the resulting estimator may be unstable if the censoring is heavy (Bang 2000). It is necessary to ensure that sufficient follow-up is available during the period for which we wish to compute medical costs. In this chapter, we extend first the method described in the first chapter to handle data with extensive censoring. The method covers the partitioned estimation suggested by Lin (2000a), that estimator can be used only for time independent regressors. Our method can be applied for both time dependent and independent explanatory variables. We use weighted estimation specifically, IPW pooled ordinary-least square (POLS) and IPW random effects (RE) models. The choice between the two depends on whether unobserved heterogeneity is present. If present, IPW RE should be used, otherwise the simpler IPW POLS will produce consistent asymptotically normal coefficients. Second, since an unweighted estimator is inconsistent when exogenous sampling is violated and the weighted estimator is consistent with or without exogenous sampling, traditional and robust form of the Hausman (1978) test will be applied to determine systematic differences in the models in a panel data. setting. 26 The third section of the chapter describes a study on the medical cost for lung cancer that is used to demonstrate the methods. The last section presents conclusions. 2. 1 General Framework Suppose that we are interested in the total medical cost over period [0, L]. If data on cost and explanatory variables are available in multiple intervals such as every month or every year, we can set up the data into a panel format by dividing the entire time period of interest into C intervals: 0 = to < t1 < < ta = L. Since there is no further medical expense after death, the total cost over (tg_1, t9] is the same as the cumulative cost at T 1; = min(T, tg), where T is the survival time. Distribution of T is assumed to be continuous on [0, L]. Survival time and medical costs may be subject to right censoring and therefore are not always fully observable. Censoring of cost occurs when a patient’s follow-up time is less than t0 and the patient is alive at the time of censoring. Because no further expense is incurred after death, for all observed deaths the total costs are known. One advantage of dividing the total period into intervals is that we can consider the ith individual as uncensored in the gth interval whenever the censoring time C exceeds the maximum T and tg. Therefore, some individuals counted as censored in our previous work can be considered uncensored in some interval during the period of interest. The increase in the sample size allows more precise estimators and test statistics with more power. For ith individual, let Z,- = min(T,-,C), s’,’ = I(C Z T,), and 3,9 = I(C Z 71;), where I () is the indicator function. There are two types of censoring: time censoring if s; = 0; that is, T,- > C,, and cost censoring if 3,9 = 0, that is, 27 min(T,,t,-g) > C,. Let y,g be the total medical (or log transformed) cost for ith individual for the interval (tg_1, t9]. If there is initial cost at t = 0, we include that cost in the first time interval. 2.1.1 Pooled Ordinary Least Squares (POLS) Estimation The properties of POLS for the linear data can be summarized as follows. Assume that the model is the usual linear model for i.i.d cross-sections: for any i, y: = X,- B +11,- i = 1,2, ...,N (2.1) where X,- = (51:11 ,az12 , ..., 51:16. )’ is G x K matrix of explanatory variables, )6 is the K x 1 vector of unknown regression parameters, 11,- is a G x 1 vector of unobservables which has unspecified distribution. Let S,- be a G x G matrix in which gth diagonal 3,9 = 1 if (wig ,y,g) is observed, zero otherwise. Generally we have an unbalanced panel. Then we can define our explanatory variables and a response variable for selected sample as X,- = S,- X, ,3},- = S, y,- . Assumption 1 : (i) E(ui I51 ,Xz' ) = E(ui lXi ) = 0- (ii) E02132,- ) has rank K. It is well known that under the assumption 1, the unweighted POLS estimator .BUP iS N 3UP = A5}: (IV—1:32,, 371 ); (2'2) i=1 28 where N ~ ~ Aw, = (N’12X1X, ) (2.3) i=1 on the unbalanced panel is consistent; and its asymptotic robust variance matrix is V(aup )zAl—Jf’ BU]: AB}; /N, where N BU}: = N_1 :2: (Sit-ti) (85,174), X; (2.4) i=1 and 11,- = y,- — XiBUP (\Nooldridge, 2002). The key exogenous sampling assumption underlying the validity of the unweighted POLS estimator on the selected sample is given in assumption 1(i). Exogenous sampling in this setup implies that E(y,-g|:1:,-g ,T-" C,) = E(y,-g|:v,-g) g =1,2,...C i = 1,2, ...,N. (2.5) 19’ Under administrative censoring C,- is independent of y,g but we would expect that 7”,; and My may be correlated. Correlation increases with the length of the interval. Violation of equation (2.5) would yield an inconsistent POLS estimator. IPW estimation produces consistent and x/N asymptotically normal estimators even under the violation of equation (2.5) with the following assumption. Suppose that T and C are independent given 2:. Assumption 1’ : (i) E(X1u,- ) = 0. (ii) Same as assumption 1 part (ii). (iii) 33,9 ,y,g, T5, are observed when 3,9 = 1, C,- is always observed. 29 (iv) 3,9 and y,_,, can be ignorable in the selection equation P(S;g Z 1.]:th ,y,9,C,~ T-t lg) = P(Sig =1|C,~,T,-"g) = P(C, Z 71;)- Another advantage of weighting the observations, other than solving the censoring problem, is that we derive consistency with the weaker assumption 1’(i) rather than assumption 1(i). Assumption 1’(iii) simply defines when the data are observable. Part (iv) requires that the selection probability is observable when 3,9 = 1. Under Assumption 1’, the IPW POLS estimator is, Bwp : N 417.31» (1)/"Zr 17.- ), (2.6) i=1 where A N A A AWP = (N-1 Ex; X,- ), (2.7) i=1 X,- = W,- X, , g, = W,- y,- , and W,- is a G x G diagonal matrix in which the gth diagonal element is , /w,-g where 111,, = Sig/P(C? 2 T12)- (2-8) Bwp is consistent, asymptotically normal and its asymptotic robust variance matrix is WBWP ) = AWP BWP 417w» /N, (29) 30 where N A BWP = N-1 Ex; (W,a,-) (W,1“1,-)’ X,- (2.10) 1:1 and a, = y,- - Xfiwp (Wooldridge, 1999). Each observation of (91,31) is weighted by the inverse probability of appearing in the sample. Assumption 1’ part (iv) requires P(C,- > T,:,) to be known whenever 3:9 = 1, so Bwp is computable from observed data assuming we know P(C,- > 71;). Usually the sampling probability function, 111,9, is unknown and needs to be estimated. We propose to estimate the unknown survivor function by the Kaplan-Meier (1958) estimator, with the roles of C and T reversed. Define p(t) = P(C Z t), and let p(t) be the product limit estimator of p(t) based on the data (Z,,s‘-') (i = 1, ...,N), where 3? =1— 3;". Then, 1 A Sig . w, = . i =1,...,N; =1,...,K. 2.11 9 p123) g ( ’ Lemma 4.3 in Newey and McFadden (1994) shows that if my in (8) is replaced with consistent estimator 111,9, under the conditions in which the uniform weak law of large numbers can be applied, then fiwp consistently estimates 3 in equation (2.1). The estimated covariance matrix in (2.9) is the White (1980) heteroskedasticity-consistent covariance matrix applied to all variables for observation i at the gth interval and weighted by m. Censoring therefore can be handled fairly easily because most standard statistics software programs compute a heteroskedasticity-consistent covariance matrix. This simplicity does not work when 112,9 is replaced with 121,9 for variance of IPW POLS because it should be adjusted for estimation of selection probabilities. 31 Fortunately, Wooldridge (2002b) shows that estimating the selection probabilities leads to a more efficient estimator than using known probabilities. In other words, if we compute the asymptotic covariance matrix as if we have no estimated probabilities and if we get significant estimators by using the easily computable matrix in (2.9), we know that they would have smaller standard errors under corrected covariance matrix calculation. This is somewhat unusual for two—step estimation problems. Estimating (2.9) by using 111,9 instead of 111,9 results in a conservative inference. The steps for deriving consistent two—step IPW least-squared estimators and their unadjusted asymptotic variance estimators can be summarized as follows. (i) Calculate the product-limit estimator, m,, based on data (Z,, 1 — 3:) (i = 1, ...,N). (ii) Generate p,_,, = m,g; where mm is the value of m,- at T,g* and s,g = 1 if (y,g,:1:,-g) is observed at (tg-1,tg]. (iii) Generate weight, 111,9 = Sig/pig (i = 1, ..., N). (iv) Generate weighted response and explanatory variables: yi‘g = M159, (1:19 = M (13,9 (i = 1, ...,N). (v) Compute the OLS regression of yfg on 319 with robust option. 2.1.2 Random Effect Model Panel data usually provides the researcher with a large number of data points that increases the degrees of freedom and reduces collinearity among explanatory variables. Panel data also provides a way to resolve or reduce the magnitude of an econometric problem that often arises in empirical studies, namely, omitted variables that are correlated with explanatory variables. By using information on both the intertemporal dynamics and the individuality of the entities being investigated, one 32 is better able to control for the effects of unobserved variables (Hsiao, 1999). Let us first investigate assumptions under which the random effects estimator is consistent under exogenous selection. The model is the unobserved effects model for any 1', gm = mtg fl + (11+ uig g =1,2,...,G, (2.12) where a,- is unobserved effect, wig is 1 X K; and fl is the K x 1 vector of interest. We can write the model as 311' = X; g + '01 (213) by defining y,- = (1),], ya, ..., y,G)’ and v,- = 01,-jg + u,- , where jg is the C x 1 vector of ones, 11.,- = (u,1,u,2, ...,'u,-G)’ and X,- = ($11 ,:1312 ,...,a:1G )’. Define the variance matrix of 1),- over uncensored cases as {23,- = S,- E (u, 111 )Si , a C x G matrix that we assume to be positive definite. Assumption 2: (1) E(’U,; IX; ,3; ) = E(’Ui IX; )20 (ii) rank E(X,-’ (2;? X,- ) = K Under assumption 2, generalized least squares (GLS) over uncensored data is consistent and it can be shown that a consistent estimator of the RE model is ~ ~ _1 ~ _ ~ 3R3 =ARE (N-lzi,xgn,,1 3,1,); (2.14) 33 where N ARE = N‘IZX1QQIX“ (2.15) i=1 with variance matrix V(1§RE)=A;,E1 BREAEE /N, (2.16) where N ~ 2(z’n;1 = 62.4.7}: . (223) provided we have a consistent estimator 62 of 03. The homoskedasticity assumption under RE is E(Xi’fl—-1v;vifl;-l X,- ) = E(x..’n;,1 x,- ). (2.24) 81 Then, the unweighted RE variance estimator becomes V(fi£fi£ ) = ARI: - (225) In general form, the Hausman test can be stated as: H = (6,, — 9..) 17-1 (6., — é") . (2.26) For weighted and unweighted POLS, choose 9,”, an as ,éwp, 3UP, respectively. I7 E 9;, — V“, where Vw is defined in equation (2.9) and V“ is defined in equation (2.23) under the homoskedasticity assumption. For the RE model, 9,”, Ba is BWRE , fiRE . IQ, is as in equation (2.20); and Va is as in equation (2.25). In many cases we may want to use a Hausman test when the homoskedasticity 37 assumption is violated. This requires a robust form that replaces V for the POLS estimation: (Aw. I — A51.) (Iv-12: Z é. ézg ) (AJJP I — ASH/N. (2.27) i=1 9:1 I where éigz (113,-911,-g:r:’ig , 5,911.93); ) . 21,9 and {1,9 are the residuals after weighted and unweighted POLS estimation. For RE estimation, 4. N A ~ (Av—VIM: | — ARE) (IV‘1 Zéi éi ) (AG/RE | — ABBY/N, (228) i=1 where é): (Xi, gal/“Vim, 32,-, (235.17.), and iii, fig are the residuals after weighted and unweighted RE estimation (Wooldridge, 1995). The methods described are easily applicable using standard commercial statistical software programs. The traditional Hausman test is built in to most statistical programs, but the robust form Hausman requires programming. We can use an alternative approach, the regression-based Hausman test, for easy computation of the robust form. Ruud (1984) and Wooldridge (1990) examine this issue. Since the Hausman test compares systematic differences in the coefficients, if we regress the independent variables on weighted and unweighted explanatory variables and the coefficients are not different, then the F test for the coefficients on weighted explanatory variables should result in an insignificant value. It can be shown that the statistics obtained from this procedure are asymptotically equivalent to Hausman statistics that compare the difference of weighted and unweighted estimators. To obtain the traditional form of the Hausman test: (i) Compute the POLS(RE) regression of gig on wig and 1:39 with the heteroskedasticity robust option. 38 (ii) Compute the F test for the coefficients of (3:9. We can also obtain the traditional Hausman test statistics if we repeat (i) without the heteroskedasticity robust option. If the Hausman test indicates rejection then the exogeneous sampling assumption is violated; and the unweighted estimator are inconsistent. A failure to reject means the coefficients from unweighted and weighted estimators are not systematically different and can be used as evidence of exogeneous sampling. 2.3 The Lung Cancer Study 2.3.1 The Data The data are from a project entitled “Family Home Care for Cancer: A Community-Based Model from: the National Institute of Nursing Research and National Cancer Institute (grant No. N R1915-06)” ,which studied 202 Medicare beneficiaries over age 65 who were diagnosed with lung cancer from 1994 through 1997. Among them, 183 subjects who had some kind of treatment, whether surgery, radiation, or chemotherapy. Medicare claim files for each patient for two years following diagnosis were obtained. These files revealed monthly cost values, treatment types, hospitalization, and death status during the 24 months. Payments by Medicare were used as a proxy for direct Medicare costs as opposed to bill charges. Costs are adjusted for inflation to 1997 prices by using the National Medicare Price Index, 1994-1997. Patient information (such as age, sex, race) was obtained through interviews. In addition, we collected data on patients’ physical function three months prior to diagnosis as measured by the short form 36 (SF-36). Comorbid conditions were assessed by questions from the Aging and Health in America Survey (1996), which 39 documents 15 diseases and health problems other than lung cancer. Disease stage was determined by the American Joint Commitee on Cancer (AJ CC) Tumor Nodes & Metastasis (TNM) staging system, which was applied to pathological data obtained from an audit of patients’ medical records. The medical costs are censored for patients alive at the end of 1997 and when patient follow-up is less than two years. Because censoring is solely caused by the limited study duration, it is reasonable to assume that censoring is independent of all other random variables. The distribution of average monthly cost values for uncensored cases is given in Figure 2.1. It shows that medical care expenditures for lung cancer patients spike in the first month after diagnosis, during the surgical period. The interventions such as surgery and radiation incur large costs within the first couple of months; whereas chemotherapy may be administered over a much longer time. 2.3.2 Regression Analysis Two analyses were performed to examine how patient- and treatment- related variables explain total medical cost for older persons newly diagnosed with lung cancer. Total medical cost is the expenditure incurred from initiation of treatment until death or during two years, whichever comes first. Following Manning and Mullahy (2001), our cost values satisfied the conditions in which an OLS-based model with a long-transformed dependent variable is suitable. Table 2.1 shows the results of the regression analysis predicting the total cost of care. The first two columns present the regression results estimated by POLS and IPW POLS. As emphasized throughout this paper, POLS is likely to suffer from omitted variable problems.Therefore, RE and IPW RE models were estimated as an 40 Figure 2.1: Distribution of average monthly cost $20,000 $18,000 ' $16,000 $14,000 $12,000 $10,000 $8,000 ° $6,000 Average Montly Cost $4,000 . $2,000 ° ' - $0 IlTllll1‘lflrlrlrl11ILILILITI4ILrTfirTI‘r'rfT‘fii 2 4 6 8 1012141618202224 Months 41 Table 2.1: Estimation of Log of Total Medical Cost from Longitudinal Data Variables POLS IPWPOLS RE IPWRE constant 6.67 6.47 6.80 6.23 (1.52)" (0.68)" (1.57)" (1.54)" late stage —1.10 —1.17 —1.02 —1.12 (0.20)” (0.09)" (0.20)" (0.20)" late comorbidity 0.48 0.51 0.49 0.53 (0.20)‘ (0.09)" (0.21)’ (0.19)" hospitalize 3.61 3.71 3.50 3.45 (0.23)" (0.17)" (0.21)" (0.20)" radiation 4.05 4.08 3.88 3.83 (0.16)“ (0.17)" (0.16)" (0.16)” radiationl 1.02 1.06 0.90 0.87 (0.18)" (0.22)" (0.18)" (0.18)" radiation; 0.90 0.90 0.71 0.64 (0.22)” (0.21)" (0.21)" (0.21)" chemothrapy 2.95 2.99 2.78 2.73 (0.22)" (0.18)" (0.21)“ (0.21)“ chemothrapyl 1.11 1.16 1.04 0.99 (0.18)" (0.21)" (0.18)" (0.17)" chemothrapyg 0.96 0.94 0.88 0.77 (0.19)” (0.19)" (0.19)" (0.18)" other 2.97 2.07 2.93 2.90 (0.21)" (0.21)" (0.21)" (0.20)” otherl 1.35 1.40 1.30 1.26 (0.22)" (0.24)" (0.23)” (0.22)" 0th67'2 0.65 0.64 0.57 0.49 (0.24)" (0.24)“ (0.24)" (0.23)“ death 0.02 0.06 0.20 0.18 (0.19) (0.09) (0.19) (0.18) physical functions 0.008 0.008 0.007 0.007 (0.003)‘ (0.002)" (0.003)‘ (0.03)‘ age —0.02 —-0.02 -0.02 -0.01 (0.02) (0.01)‘ (0.02) (0.02) sea: —0.33 —0.32 —0.30 —0.30 (0.18) (0.18) (0.19) (0.19) white —0.29 —0.34 -—0.21 ——0.27 (0.36) (0.17)‘ (0.19) (0.34) observations 4000 4000 4000 4000 r - squared 0.61 0.59 0.60 0.61 Robust standard errors are in parentheses. ‘significant at 5% level;“ significant at 1% level. 42 alternative way to use panel data to view the unobserved factors affecting the dependent variable. Almost all models explained 60% of the variation in total cost. The population may have a different distribution in different periods, therefore we allow the intercept to differ across months. We chose the first month after diagnosis as the base month and included dummy variables for all but the first month after diagnosis. The coefficients were all negative and statistically significant (p < .05). Figure 2.2 shows the pattern of the absolute value of the coefficients under POLS and IPW POLS estimation. For example, after we control for patient and treatment related variables, a patient’s total medical cost 4.3 less in month 24 after diagnosis than in first month after diagnosis. As shown in Table 2.1, the control variables are age, gender, race, comorbid conditions, stage of cancer,death status, physical function, and treatment-related variables. We divided treatment into four categories: no treatment, radiation only, chemothrapy only and, others which includes chemo and radiation, surgery only, and surgery plus other therapies. N 0 treatment was chosen as the reference group. Our time independent variables are gender, race, comorbid conditions, stage of cancer and physical function. The only variables that did not reach statistical significance under POLS and IPW POLS estimation is death. The coefficients for physical functioning and age, while statistically significant, are small in magnitude. On average, expenses for male patients were almost 31% less than for female patients. Race is also significant. The costs for whites is 34% less than black peOple. Disease severity measures, such as comorbid condition and stage, have different and statistically significant effects. As shown in columns 1 and 2 of Table 2.1, having three or more comorbid conditions increase cost by almost 48% and 51%, respectively. Disease stage has a large negative effect on costs. Regional stage decreased total cost of care almost 1.1 times compared to in situ or local stage cancer according to POLS and IPWPOLS. 43 Figure 2.2: Distribution of absolute value of monthly dummy coefficients under POLS and IPW POLS estimation. Cooffleclenu ." o: .l j . POLS ro— IPWPOLS f t l A L 4L 1 1 l l A 1 l 1 A l I; A l l I I Y I I T 1 T ' Y Y Y Y T r I 1 V Y 2 34 5 1112131415161718192021222324 Months 678910 44 Hospitalization for reasons other than lung cancer surgery increases total medical cost 3.6 times (column 1) and 3.7 times (column 2) during the period of interest. A two—period lag effect is found treatment-related variables. If a person receives radiation in a particular month; on average cost increases almost 4 times relative to the ones who have no treatment in that month according to both weighted and unweighted POLS estimation. If the same person has radiation one month prior, the effect becomes almost 5 times. To see the effect of radiation alone relative to no treatment, we need to add three coefficients. So overall, if a person receives a radiation, total cost is 6 times more than for someone who had no treatment. The effects for chemothrapy only and the others category are almost 5 times. Note that the coefficients estimated by POLS and IPW POLS are not practically different. Since the exogenous sampling assumption is violated, however, POLS estimators are inconsistent. The traditional and robust Hausman test described in section 2.2 reject the null that sampling scheme is exogenous (pvalue is 0 for five decimal points). Figure 2.3 shows the distribution of absolute value of monthly dummy variable coefficients when the first month after diagnosis is the base month. The variation of the coefficients between weighted and unweighted RE estimation is relatively smaller than that of POLS. As revealed in columns 3 and 4 of Table 2.1, death status, race and age are not statistically significant and physical functioning is practically insignificant. Gender has a significant effect. The cost for male patients is 29 % and 27% less than for female patients according to RE and IPW RE, respectively. We can see differences for disease severity measured under weighted and unweighted RE models. Regional stage decreases total cost of care by 1.3 times according to IPW RE and by 1 time according to RE estimation. A patient with late comorbidity conditions paid 50% to 60% more on average; depending on the 45 Figure 2.3: Distribution of absolute value of monthly dummy coefficients under RE and IPW RE estimation. Coefficclonts 4.5 v 4.0 ~~ 3.5 2 3.0 ~ 2.0 -* 1.5 « 1.0 ~- 0.5 4 2.5 i 0.0 23456789101112 0 RE —0— IPWRE l A 1 1 1 1 L L A L 1 J 1 n L 1 4 1 A 1 A 1 I I I I I I I 7 T I I I I 7 I I I I I I I fl 13141516171819 20 2122 23 24 Months 46 estimation method. Hospitalization other than for lung cancer surgery increases total medical costs almost 3.5 times, which is very similar to the POLS estimation. The two-period lag effect persists under RE estimation methods. The permanent effects of radiation only, chemotherapy only, and others are 5.4, 4.5, and 4.7 times greater relative to nontreatment under unweighted RE estimation and are 5.7, 4.7, 4.8 times more under IPW RE estimation. Since survival times are correlated with total cost, values in column 4 of Table 2.1 are the consistent estimators. Unweighted RE estimators, given in column 3, are consistent under exogenous sampling. However, both traditional and robust form of the Hausman Test as in the case of POLS reject the null hypothesis (pvalue is 0 for five decimal place). So consistent estimators are in the one in column 4. 2.4 Conclusion The IPW least—squares method was applied to longitudinal data to illustrate how the censoring problem can be solved. The main motivation for developing regression methods is to handle a large number of continuous and discrete covariates. POLS and RE models were analyzed and their statistical properties examined under censoring. Usual POLS and RE estimation will create an inconsistent estimator without exogenous sampling. Since survival times are correlated with total medical cost, the exogenous sampling assumption is violated. IPW estimators produce consisted and W asymptotically normal estimators. The method is easy to apply and can be done with most statistical software programs on the market. Step by step procedures were provided. Since unweighted POLS and RE estimators are consistent under exogenous sampling and more efficient under the homoskedasticity assumption, the Hausman 47 test can be used to compare the systematic differences in coefficients between weighted and unweighted estimators. Traditional and robust forms of Hausman test described to determine between the models. The lung cancer study, although it does not demonstrate the full power of the IPW least squares method, served as an example of how to use proposed regression methods and test statistics. We create artificial panel data by dividing two years after diagnosis into the months. The better estimates can be obtained if the data set originally was set in panel data format. That would be an interesting research topic to explore. 48 Chapter 3: Full Parametric Estimation of Censored Medical Cost Introduction Due to escalating cost of medical care it is important that costs of health care interventions and treatments are carefully assessed. A common problem with medical cost data is censoring since not all patients are followed until the endpoint of interest.Therefore, their medical costs are not fully observed. The estimation of medical costs might be addressed through multivariate regression analysis. Multivariate analysis can control for patient and clinical characteristics to estimate medical cost. Using regression methodology estimating medical cost is a relatively new technique. The regression methodology would be particularly valuable in identifying cost-effective intervention programs. These intervention programs require that treatment costs are compared with alternatives that requires proper analysis of conditional means. Although there have been several non-parametric approaches (Lin, 2000a) and a semi-parametric approach (Lin, 2000b) suggested for handling censored data, currently there is not a valid full parametric regression method for assessing the effects of covariates (e.g. patient and clinical characteristics) on censored medical costs. Non parametric methods are often not as efficient as parametric statistics. Parametric methods are the “best practice” for estimation. They provide speed, 49 accuracy and flexibility to estimating processes. This chapter suggests a full parametric method for estimating the parameters in linear structural equations when the selection rule is governed by the Tobit model. The resulting estimators are shown to be consistent and asymptotically normal. The procedure introduced in this paper involves a two stage estimation. In the first stage, the selection equation is estimated by Tobit and in the second stage, an additional variable estimated by Tobit parameters is included in the structural equation to correct for possible sample selection bias. The resulting model in this paper could be estimated by full maximum-likelihood estimation (MLE). However, the two-stage approach has the advantage of being easier to compute and is usually more robust than full MLE. The drawback to our approach is that the asymptotic variance matrix is cumbersome to estimate; although it can be done by using the general methods in Newey and McFadden (1994). This chapter is organized as follows. Section 1 outlines the general framework to show the conditions under which the ordinary least squares (OLS) estimator using selected sample is consistent. The next section demonstrates how to apply this framework to the cases for which the selection rule is determined by the Tobit model with the specific example of estimating censored medical cost data, including statistical properties of the estimation and step-by-step procedures. Section 3 describes an application of our methods to a study on medical costs with a comparison of non-parametric method results in the first chapter. Concluding remarks are given in Section 4. The appendix contains proofs of the propositions in the main text. 50 3. 1 General Framework Assume that there is a population represented by the random vector (2:, y) where a: is a 1 x K vector of explanatory variables, y is the scalar response variable. Suppose that the population model of interest is 31:51+fl2$2+...+fikxk+u=mfi+u, (3.1) where we define 2:1 2 1 and u as the error term. Let s be a binary indicator such that s = 1 if (m, y) is observed and s = 0 otherwise. Assume that s = h(m) for some non-random function h(.). Let ((0,, y,), i = 1, 2, ..., N, be a random sample from the population and let 3,- = h(m,). Then OLS estimator using the selected subsample can be written as N ‘1 N [BA : (IV—1 2 8103223) (IV-12 3,32%) . (3.2) i=1 i=1 Theorem 1: In model (1), assume that E(u2) < oo, E(2:,-2) < 00; i = 1,2, ...,K. Lets = h(m) be a binary indicator for non-random function h(..) Assume that E (ulm) = 0 and E (33’ m) = K, then the OLS estimator using the selected sample, given by equation (2) is consistent for B. All proofs are given in the Appendix. In the next section, we show how to apply this framework for cases where selection is determined by a censored selection variable. 51 3.2 Statistical Methods Suppose that we are interested in the total medical cost over period [0, L]. Since there is no further medical expense after death, the total cost over [0, L] is the same as the cumulative cost at T‘ = min(T, L), where T is the survival time. Assume that the population model of interest is defined as (1), where y, (13,,5 are respectively a scalar representing cumulative cost (or transformed cost) at L or T, a 1 x K explanatory variables (patient characteristics, treatment types, others), a K x 1 unknown regression parameters. Medical costs may be subject to right censoring and therefore are not always fully observable. Let C be the time of censoring. Suppose individuals enter the study at different times and terminal point of the study is predetermined by the researcher, so that censoring times are known when an individual is entered into the study. This form of censoring is called administrative censoring. From figure 3.1, we can see that cases subject-2 and subject-4 is subject to administrative censoring. A convenient representation of such data is to rescale each individual’s starting time to 0 as described in the figure 3.2. Since we are interested in the total medical cost over period [0, L]; we impose a second type of censoring which we will call artificial censoring for some cases. From figure 3.2, cases such as subject-1 or sub ject-3, who are not censored due to administrative censoring, will be artificially censored because their duration on study is greater than L. Subject-2 is both artificially and administratively censored. Note that in such cases, artificial censoring precedes administrative censoring. Sub ject-5 is neither artificially nor administratively censored. Let s = 1(C > T‘), where 1(.) is indicator function. So y is observed when 3 = 1 (all cases except sub ject-4 in figure 3.2). We calculate the expected value of y 52 Figure 3.1: Administrative censoring when each individual has different starting time Subject1 >1 Subjectz "g in! Subject3 3 O s b .14 B u je m Subject 5 Months 53 Figure 3.2: Starting time backed up to 0 for the individuals faced administrative censoring Subject 1 >1 Subject2 1; - :5 Subject3 s u- ; o s 1: Subject4 ; c 3 I.” Subject 5 Months I- 54 conditional on m,T“,C; that is; E(y|a:, T’, C) : azfi + E(u|a:,T*, C) = mfi + E(u|:z:,T‘). (3.3) Since C is caused by study termination and L is determined by the research question (e.g., a week, a month, a year). Both are independent of y and m. However, survival time depends on patient characteristics, treatment types ,and other factors. Let log(T) = ma + 1). Therefore, OLS estimation of y on :0 yields inconsistent estimators because E(u|:1:) is not equal to 0. The problem described so far can be transformed into the following statistical model: y = wfl+u (3.4) log(T*) = min (logL, ma+v) (3.5) where :1: is always observed in the population but y is observed only when log(T*) < logC. Equation (3.4) and (3.5) are known as the censored Tobit Model (after Tobin, 1956). We refer equation (3.4) as the “structural regression equation” and equation (3.5) as the “selection equation.” With the following assumption we show how to estimate fl and its asymptotical covariance matrix consistently. Assumption 1 (i) a: is always observed in the population but y is observed only when T" < C. (ii) (u, v) is independent of x with zero mean. (iii) v ~ Normal(0, 72) (iv) E(u|v) = pi) 55 Assumption 1 part(i) defines the particular sample selection problem. y is not observable unless T“ < C. Part (ii) is a standard form of exogeneity of (1:. Part (iii) is the most restrictive assumption, but it is needed to derive a conditional expectation given selected sample. Part (iv) requires linearity in the population model u on 1). Under bivariate normality it always holds. However, the normality of u is not necessary. Under assumption 1, E(u|:1:,T“) = pE(v|m,T*). (3.6) Equation (3.3) can be written as, E(y|a:,T*, C) = (3,3 + pE(v|:c, T“). (3.7) If we calculate E (vlax, T *), then from Theorem 1, we could consistently estimate ,3 and p from the regression y on a: and E(v|a:, T *). From (3.5), for the uncensored cases, 1[logT < logL], E(v|m, T”) = logT — ma = v, (3.8) and for the censored cases,1[logT 2 logL], E(v|a:,T*) = E(v|a:,a:a + v 2 logL), 2 TE (14:13,; 2 _logL — ma) 7. z “b (logL—ma) N) (logL—ma) T T = m (@512) (3.9) 7. 56 where /\ is the inverse Mills Ratio. Equation (3.8) and (3.9) can be written succinctly, .. logL — ma E(v|a:,T ) = 1[logT < logL]v + 1[logT 2 logL]r)\ —_—r_— (3.10) Estimation of equation (3.5) with Tobit, replaces the unknown variables in equation (3.10) with their consistent estimators. In other words, replacing v with i), residuals, r with 1?, estimated standard error, a with (i, estimated coefficients, of the equation (3.5) after the Tobit estimation does not effect consistency of the parameters of equation(3.7), fl and p. This result follows from Newey and McFadden (1994). Therefore, for each i in the selected sample, if we define; (3.11) A 1 L — ,- * a, = 1[logT,- < logL]i1,+1[logT,~ 2 logL]%,)\,- (37—3-3) and g,- = (3,, 7),), then OLS estimator of ('9 = (fi’, p’) using selected sample can be written N ’1 N o = (N-1 Z s,g,’g,) (N-1 2 3,913,.) . (3.12) i=1 i=1 Theorem 2: Under the Assumption 1, OLS estimator given equation (3.12) is consistent for 0; and a consistent estimator of Avar((:)) is N ‘1 N '1 A ,. I A .. (2:29:12) (§j(s.gzé.+fice) (s.gzé.+207=.)) (2:22:92) (3.13) i=1 i=1 i=1 57 where C = fi 2:” 3,92, e,- = y, — (in-BA — [37),- (for s,- = 1), and ~ __ —1 7"i '— 7'1“” )2 —Vey7liH Zaia where ’7: (a ,i'), v.97],- = (%(fi), %;($'),... —‘”—"—"~,(’7)>, H is the P x P Tobit hessian and d,- is the score of the censored tobit log likelihood for observation i valued at estimated parameters. Since the term with the generated regressor 7‘7,- does not appear in the variance matrix (3.13) when p = 0, the usual variance matrix for 0 is valid under homoskedasticity and the robust version is valid under heteroskedasticity. Therefore testing p = 0 is just usual t-statistics or its heteroskedastic robust version. The steps for deriving consistent estimators in the structural equation can be summarized as follows. (i) Estimate equation (3.5) by Tobit using all N observations. For logT < logL, define v1, = logT - (1.3-(i For logT 2 logL, obtain U21 = 1A (M) r (ii) Using observations for which logT“ < logC, estimate )8, p by OLS regression 31. on an. 7‘2.- (3.14) where f), = 1[logT < logL](v‘1,~ + 1[logT 2 l0gL]v§,)’. 58 Equation (3.14) produces consistent, \/TV_ asymptotically normal estimators of 6 and p under the assumption 1. The statistic to test censoring bias is just the usual t statistics on f},- in regression (3.14). If it is statistically insignificant usual variance matrix in regression (3.14) can be used under homoscedasticity, and the robust version is valid under heteroscedasticity. Otherwise, standard errors should be adjusted as described in Theorem 2. In practice, it has been found adjusting for first-step estimators has usually has little effect on the asymptotical standard errors. It is not surprising to find little effect of the adjustment for modest amounts of sample selection since no correction is needed when p = 0. 3.3 Lung Cancer Study We apply our procedure on lung cancer treatment cost data. The data set was in our study of inverse probability weighted estimation of censored medical cost data. This data set consists of an inception cohort of 183 lung cancer patients, 48 of whom are subject to administrative censoring, 65 of whom are subject to artificial censoring. Seventy of the cases are neither administratively nor artificially censored. The dependent variable was the log of total medicare payments two years following diagnosis. The exogenous variables include late disease stage (lstage), late comorbid conditions (lcomorbi), hospitalization for the reasons other than lung cancer surgery (hospitalize), treatment types (chemotherapy only, radiation only, chemotherapy and radiation with the reference group as combination of surgery only and surgery plus adjuvant therapy), death, symptoms, physical functions, age, sea:(=1 if male), race (=1 if white). Table 3.1 shows the summary statistics for the administratively censored, artificially censored and uncensensored groups. Table 3.2 shows the results of the regression analysis predicting total cost of 59 Table 3.1: Summary Statistics from the Lung Cancer Study Administratively Artificially Uncensored Censored Censored n=48 n=65 n=70 Variable Mean Standard Mean Standard Mean Standard Deviation Deviation Deviation total cost 62878 40115 63344 44646 64490 39075 [stage .54 .52 .78 n=26 n=34 n=55 lcomorbi .62 .65 .64 n=30 n=42 11:45 hospitalize .54 .46 .76 n=26 n=30 n=53 chemo only .06 .06 .1 n=3 n=4 n=7 radiation only .25 .20 .31 n=12 n=13 n=22 chemo and radiation .33 .31 .31 n=16 n=20 n=28 symptoms 10.12 28.71 10.68 5.19 11.56 5.11 phsical functions 71.46 28.71 75.70 27.08 71.57 26.51 age 72.68 5.21 71.91 4.72 72.01 4.99 sea: .62 .54 .61 n=30 n=35 n=43 race .92 .92 .94 n=44 n=60 n=66 60 care for the two years following a lung cancer diagnosis. For comparison, we also obtain the estimates using OLS and using IPW procedure in the chapter 1. The results are obtained using the statistical package Stata. The robust standard errors are given in parentheses; no adjustment has been made to account for the generated regressor, 7?, since there is little evidence of sample selection bias. H0 : p = 0 cannot be rejected at even 40 percent significance level for any specification. As we showed in section 2, robust standard errors on all explanatory variables are valid when p = 0. We found no sample selection bias with the data set in chapter 1. The results by using Procedure 3 supports that outcome. The same variables, hospitalization for reasons other than lung cancer surgery, chemotherapy only, radiation only, and chemotherapy and radiation reach statistical significance (p < .05). The coefficients are closer to OLS estimates relative to IPW estimators. Hospitalization for reasons other than surgery increases the total medical cost during the period of interest by 109% according to our procedure, 114% according to IPW least squares estimation and 107% according to OLS. The total medical cost relative to the mean costs for persons receiving surgery only or surgery plus adjuvant therapies decreased for the patients who receive radiation or chemotherapy separately or in combination. The estimates with respect to OLS, IPW least squares and our procedure are: for radiation only, 120%, 105%, 118%, for chemotherapy only 151%, 129%, and 146%, for chemotherapy and radiation, 63%, 54%, 63%. Age, gender, physical function, stage, comorbid conditions, and race do not have a statistically significant effect in all three estimation methods. Our procedure explained 14% of the variability in total costs for the two years following diagnosis. This value is somewhat in the middle of the values calculated by OLS (13%) and by IPW least squares (15%). 61 Table 3.2: Estimates of the Log(tcost) Equation by OLS, IPW and Procedure 3 Explantory OLS IPW Procedure Variable 3 constant 10.74 10.70 10.73 (1.06) (1.04) (1.07) late stage .02 -.06 .05 (.16) (.16) (.17) late comorbidity .004 -.046 -.001 (.132) (.136) (.131) hospitalize .72 .75 .74 (.18) (.18) (.17) chemotherapy only -.92 —.83 -.90 (.29) (.31) (.29) radiation only -.79 -.73 -.78 (.23) (.22) (.23) chemothrepay and radiation -.49 -.43 -.49 (.22) (.21) (.22) symptoms .004 .009 .005 (.014) (.014) (.014) physical functions .001 .003 .003 (.002) (.003) (.003) age -.003 -.005 -.003 (.013) (.012) (.013) sex .08 .12 .09 (.12) (.12) (.12) race .12 .23 .13 (.20) (.19) (.21) i7 .04 (.05) Observations 135 135 135 R-squared 0.13 0.15 0.14 Robust standard errors are in parentheses. "significant at 5% level;" significant at 1% level. 62 3.4 Conclusion This paper shows a new method for testing and correcting for sample selection bias for cross-section data under the assumption that the selection rule is governed by a censored regression model. The method is easily applicable by using standard software programs. Application the method to censored lung-cancer medical cost data illustrates its simplicity. Several limitations should be discussed. The first and obvious one is the strong distributional assumption on the selection equation to derive a conditional expectation given a selected sample. It is possible to derive a semi-parametric extension for selection equation. If we write equation (3.3) as E(y|a:, T‘, o) = xfi + h(.), (3.15) where h is an unknown function of sample selection variables, then h(.) can be estimated without specifying distributional assumption. Powell (1987), Robinson (1988), Newey (1988), Coslett (1991) offers different ways of dealing with the presence of the function h(..) The second limitation is that we assume E (ulv) is linear in 12. We can also relax this assumption by adding quadratic terms in assumption 1 part (iv), and the formulas can be adjusted accordingly. It will be useful to extend the methods for a longitudinal data setting. The results in section 3.2 are easily modified to panel a data setting. 63 APPENDICES 64 .1 Appendix for Chapter 1 Stata Commands (i) stset z,failure(1 — s) stset gen m = k (ii) gen p=m if Z << L stbase, at(L) stset gen l = I: replace p = l if Z >2 L (iii) gen w=s / p (iv) gen ys = sqrt(w) >1: :c gen 2:3 2 sqrt(w) >1< a: (v) reg ys :rs,robust 65 .2 Appendix for Chapter 3 Proof of Theorem 1 : Substituting y, = aufi + u, into equation (3.2) gives N ’1 N 3 = 13 + (N‘1 2: 33%,) (N’1 2: 3,391,.) . i=1 i=1 Under E(u|a:) = 0, since 3,3,- : Man-)2, is just some function of mi: E(s,a:£u,-) = E(E(s,~a:;u,-|m,-)) = E(s,-a:£E(u,-,|a:,)) = 0 With the rank condition and the second moment conditions which is necessary to apply the law of large numbers, consistency follows. Proof of Theorem 2 : Substituting y,- = 13,0 + p77,- + e E 9,6 + e = 9,6 + p(n, - 7),) + e,- in equation (3.12) gives x/N(O 9) —(N 2819391) {N Zpag, '— 77,) )‘§+N 2293,} (1) 1— 1 Since each estimator is VN-consistency of each estimators for its plim, it can be shown that N N'% 2:: s,g;e,-= N 2 Z 3,-9’. e,- + op(1(2) where 9,- : (mi, 77,). Also, by an application of the UWLLN (see Newey and McFadden (1994, Lemma 4.3)) N N“ Z 3.53:9“. 1* E(sz‘g:g.-) s A (3) i=1 66 which is nonsingular by identification assumption that E (929,-Is, = 1) has rank K + 1. Since E(e,-|a:,-,17,~) = 0, 9‘, depends on 23,-, T,- and with equations (2) and (3), consistency of O can be read off from equation (1). When p 51$ 0, second term at the right hand side of equation (16), contributes to the asymptotic variance of 0. Let 7 = (a, r) be a P x 1 vector of unknown parameters where P = K + 1. Then 1?, a x/N- asymptotically normal estimator of '7 has representation; W(’7—7)=N‘%H“Zai+op(l) (4) i=1 where H is the P x P tobit hessian, a,- is the score of the censored tobit log-likelihood for observation i. The formulas are given in Wooldridge (2002, section 16.4). From mean value expansion; 01 = n,- + an,(7)(‘y — ‘7) (5) where V7n,('y) is the 1 X P gradient of n,('y). By combining equations (2) to (4), it can be shown that \/N(O — O) —£—> Normal(0,A_lBA’1) (6) where B = Var(p,°) = E(p§p,-) is defined as p, = sigi’e, + ps,g,-’r, where Ti = —V‘7ni(7)H_l 271:1 ai- 67 To estimate Avar(O) E A“lBA‘1/N, first define N N A A I A A A E N-1 23,939,- and B a N—1 (2 (3,9313,- + 733,952,) (3,935,- + 733.9320) (7) i=1 i=1 where r, = r,(‘y) = —V777,(5')H -1 2:” 61,-, in which gradient of 77,:(7), tobit hessian and the score of the censored tobit likelihood for observation i evaluated at 5'; éi = 21/1 — (13,3 — 901' (for 31: 1) and 91 = (331,771)- The asymptotic variance of O is estimated as Avar(O) = 24—1324—1/N, and the asymptotic standard errors are obtained as the square roots of the diagonal elements of this matrix. When testing exclusion of the generated regressors, then one can take 73 E 0. 68 Bibliography [1] Anderson, C., K. Anderson and P. Kragu—Sorensen (2000), Cost function [2] l3] estimation: The choice of a model to apply to dementia, Health Economics 9, 397-409. Bang, H., and AA. Tsiatis (2000), Estimating Medical Costs Censored Data, Biometrica 87, 329-343. Cox, DR. (1972), Regression models and life—tables(with discussions, Journal of the Royal Statistical Society Series 34, 187—220. [4] Duan, N. (1983), Smearing Estimate: A non-parametric retransformation [5] [6] [7] [8] [9] method, Journal of the American Statistical Association 78, 697-718. Gold, M.R., J .E. Siegel, L.B. Russel, and MC. Weinstein (1996), Cost-effectiveness in Health and Medicine. New York: Oxford University Press. Hausman, J .A., (1978), Specification Tests in Econometrics, Econometrica 46, 1251-1271. Hirano, K., G.W. Imbens, and G. Ridder (2000), Efficient Estimation of Average Treatment Effects Using the Estimated Propensity Score, mimeo, UCLA Department of Economics. Horowitz, J .L. and CF. Manski (1998), Censoring of Outcomes and Regressors Due to Survey Nonresponse: Identification and Estimation Using Weights and Imputations, Journal of Econometrics 84, 37-58. Horvitz, D. and D. Thompson (1952), A Generalization of Sampling without replacement from a Finite Population, Journal of the American Statistical Association 47, 663-685. 69 [10] Hsiao, C. (1999), Analysis of Panel Data. Cambridge: The University Press. [11] Kaplan, E.L., P. Meier (1958), Non parametric estimation incomplete observations, Journal of the American Statistical Association 53, 457—481. [12] Katz, J.N., L.C. Chung, O. Sango, A.H. Fossel and D.W. Bates (1996), Can comorbidity be measured by questionnaire rather than medical record review ?, Medical Care 34, 73-84. [13] Lin, D.Y., E.J. Feuer, R. Etzioni and Y. Wax (1997), Estimating Medical Costs from Incomplete follow-up data, Biometrics 53, 419—434. [14] Lin, D.Y. (2000a), Linear Regression Analysis of Censored Medical Cost, Biostatistics 1, 35—47. [15] Lin, D.Y. (2000b), Proportional Means Regression for Censored Medical Cost, Biometrics 56, 775-778. [16] Manning, W.G. and H. Mullahy (2001), Estimating log models: to transform or not to transform?, Journal of Health Economics 20, 461—494. [17] Newey, W.K. (1988), Two Step Estimation of Sample Selection Models, manuscript, Princeton University Department of Economics. [18] Newey, W.K., and D. Macfadden (1994), Large Sample Estimation and Hypothesis Testing, in: RF. Engle and D. Mc Fadden, eds, Handbook of Econometrics, Volume 4 (North-Holland, Amsterdam) 2111-2245. [19] Newschaffer, OJ, L. Penberthy, C.E. Desch, et al (1996), The effect of age and comorbidity in the treatment of elderly women with nonmetastatic breast cancer, Archives of Internal Medicine 156, 85-90. 70 [20] Powell, J .L (1987), Semiparametric Estimation of Bivariate Latent Variable Models, Discussion Paper No. 8704, University of Wisconsin Department of Economics. [21] Robins, J .M., and A. Rotnitzky (1992), Recovery of Information and Adjustment for Dependent Censoring Using Surrogate Markers, in: H. Jewell, K.Dietz, and V. Farewell eds., AIDS Epidemiology-Methodological Issues, (Boston) 297-331. [22] Robins, J .M. and A. Rotnitzky (1995), Semiparametric Efficiency in Multivariate Regression Models with Missing Data, Journal of the American Statistical Association 90, 122-129. [23] Robins, J .M., A. Rotnitzky and LP. Zhao (1995), Analysis of Semiparametric Regression Models for Repeated Outcomes in the Presence of Missing Data, Journal of the American Statistical Association 90, 106-121. [24] Robinson, RM. (1988), Root-N Semiparametric Regression, Econometrica 56, 931-954. [25] Rosenbaum, RR. (1987), Model-Based Direct Adjustment, Journal of the American Statistical Association 82, 387-394. [26] Sloan J.A., S.S. Cha, J.L. Wagner, S.R. Alberts, and J. Lindman (1999), Analyzing oncology patient health care costs using the SAS system, SUGI—24, Paper 284. [27] Ware J ., K.K. Snow, M. Kosinski (2000), Health Survey; Manual and Interpretation Guide, Lincoln, RI: Quality Metric, Incorporated. [28] White, H., A Heteroskedasticity-Consistent Covariance Matrix Estimator and a Direct Test for Heteroskedasticity, Econometrica 48, 817-838. 71 [29] Wooldridge, J .M. (1990), An encompassing approach to conditional mean tests with applications to testing nonnested hypotheses, Journal of Econometrics 6, 1385— 1406. [30] Wooldridge, J .M. (1995), Selection corrections for panel data models under conditional mean independence assumptions, Journal of Econometrics 68, 115-132. [31] Wooldridge, J .M (1999), Asymptotic properties of weighted M-estimators for variable probability sampling, Econometrica 6, 1385-1406. [32] Wooldridge, J .M (2000), Introductory Econometrics: A Modern Approach. Cincinnati, OH: South-Western. [33] Wooldridge, J .M (2001), Asymptotic Properties of weighted M-estimator for standard stratified samples, Econometric Therory 17, 451-470. [34] Wooldridge, J .M (2002a), Econometric Analysis of cross section and Panel Data, Cambridge, MAzMIT Press. [35] Wooldridge, J .M (2002b), Inverse Probability Weighted M-Estimators for sample selection, attrition, and stratification, mimeo, Michigan State University. 72 ' 1111111171717121111 IIIIIIIIIIIIIIIIIIIIIIIIIIII '1 3 12 934 0197