“Hi I. .l» {Iii Us... H. J ‘ . I ‘ .. . p? k154i 4 am? 7:1! .I\.V,. > R. .ui 19‘. 3 .nu :3 .l .1: .- .v . 1.9 h vll'tn‘ L“ I \n. S... If! 41! ‘ 51.3 J « Ct 1.. ,. Ill: '51. hthhnw :E , nflvta‘...‘ , , .a . 11). _ 4 ‘1‘; .3031 )3» 133! ‘ . . I at. t. » A . gumoulflumn 21.5%. I pl 11%. I“ -okij anyl why I"). b; .. 13...". .8: l . : I . , VII-hr! .c t l .1le . 1’ (.9. .Vflll‘a llBRARY Michigan State University This is to certify that the dissertation entitled THE DISTRIBUTION OF THE PRODUCT OF TWO DEPENDENT CORRELATION COEFFICIENTS WITH APPLICATIONS IN CAUSAL INFERENCE presented by Wei Pan has been accepted towards fulfillment of the requirements for PhD. Counseling, degree in Educational Psychology, and Special Education K? Major professor Date_’l 2- ‘] 3 .0! MS U is an Affirmative Action/Equal Opportunity Institution 0-12771 PLACE IN RETURN BOX to remove this checkout from your record. To AVOID FINES return on or before date due. MAY BE RECALLED with earlier due date if requested. DATE DUE DATE DUE DATE DUE @3123sz 6/01 cJClRC/DateDuepGS-p. 15 THE DISTRIBUTION OF THE PRODUCT OF TWO DEPENDENT CORRELATION COEFFICIENTS WITH APPLICATIONS IN CAUSAL INFERENCE By Wei Pan A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Counseling, Educational Psychology, and Special Education 2001 ABSTRACT THE DISTRIBUTION OF THE PRODUCT OF TWO DEPENDENT CORRELATION COEFFICIENTS WITH APPLICATIONS IN CAUSAL INFERENCE By Wei Pan Causal inference is an important, controversial topic in the social sciences, in which it is difficult to statistically control for all possible confounding variables. To address this concern, Frank (2000) derives an index, a product of two dependent correlation coefficients (between the confounding variable and the predictor of interest and between the confounding variable and the outcome), to express the sensitivity of regression inferences based on linear modeling to the impact of a confounding variable. Frank’s index leads to a promising methodology by which we can inform causal knowledge to address the controversy in causal inference. However, the behavior of the distribution of the product of two dependent correlation coefficients is little known. Frank used a reference distribution generated through an approximation based on the Fisher 2 transformation, and then an approximation to the product of two normal variables; therefore, this doubly asymptotic result is tenuous. The present study advances Frank’s approach and provides a direct and more accurate approximation to the reference distribution with a closed form—Pearson Type I (Beta) distribution. A simulation study is conducted to assess the accuracy of the approximation. With the more accurate approximation to the reference distribution, we will have more confidence to conclude whether a causal interpretation of a given predictor is robust to confounding variables, that is, whether uncontrolled confounding variables are unlikely to have impacts great enough to alter an inference about a predictor of interest. This study also conveys the robustness into a probability scale, and guidance for interpreting the magnitude of the probability is given. Applications are illustrated with an example pertaining to educational attainment. The methodology discussed in this study would allow for multiple partial causes in the complex social phenomena that we study, informing causal inferences in the social sciences from statistical linear models. The findings in the present study may also be applicable to other methodological issues, such as indirect effects in path analysis. ACKNOWLEDGEMENTS The dissertation would not be possible without the patient and thoughtful guidance of my advisor, Ken Frank. I am deeply grateful for his guidance from the beginning, listening, and supporting with rational care as I stood at the crossroads of countless decisions. I wish also to recognize-the contributions of my committee members. Betsy Becker, Robert Floden, and James Stapleton offered important comments and suggestions that made my dissertation strong both theoretically and practically. Thanks also to my family. My parents, Yi-Nan Zhou and Zhan-Ping Pan, encouraged me from my childhood to study hard and to be a well-educated person. I am glad that I fulfilled their wish today. My wife, Li Zhang, who sacrificed her profession in our home country coming to the US with me, and my son, Bill, have given me constant strong support. iv TABLE OF CONTENTS LIST OF TABLES ................................................................................. vii LIST OF FIGURES .............................................................................. viii CHAPTER 1 INTRODUCTION ................................................................................... 1 1.1 Causal Inference ............................................................................ l 1.2 Confounding and Frank’s Index k ....................................................... 3 1.3 Purpose of the Study ....................................................................... 8 CHAPTER 2 LITERATURE REVIEW ......................................................................... 10 CHAPTER 3 APPROXIMATION PROCEDURES ........................................................... 13 3.1 Moments of rxcryc ......................................................................... 13 3.2 Pearson Distributions ..................................................................... 23' 3.3- Approximate Distribution of rxcryc ..................................................... 25 CHAPTER 4 SIMULATION STUDY ........................................................................... 28 4.1 Simulation Design ........................................................................ 28 4.2‘ Simulation Results ........................................................................ 38 CHAPTER 5 APPROXIMATI‘ON CORRECTION ........................................................... 48 CHAPTER 6 DISTRIBUTION COMPARISON .............................................................. 56 CHAPTER 7 APPLICATIONS .................................................................................. 62 7.1 An Index of the Robustness of a Causal Inference (IOROCI) ..................... 62 7.2 An Example ................................................................................ 71 CHAPTER 8 DISCUSSION ...................................................................................... 75 8.1 Conclusions and Limitations ............................................................ 75 8.2 Extensions ................................................................................. 78 APPENDIX A ...................................................................................... 80 REFERENCES .................................................................................... 83 vi LIST OF TABLES Table 1 — Coefficients of the Regressions ........................................................ 4 Table 2 — Parameter Specifications for the Simulation Study ............................... 29 Table 3 — Simulation Results for theMean y '1 (N = 28) ..................................... 32 Table 4 - Simulation Results for the Mean ,1: '1 (N = 84) ..................................... 32 Table 5 — Simulation Results for the Mean ,1: '1 (N = 783) ................................... 33. Table 6 — Simulation Results for the Variance ,uz (N = 28) .................................. 33 Table 7 — Simulation Results for the Variance ,uz (N = 84) .................................. 34 Table 8 — Simulation Results for the Variance ,uz (N = 783) ................................. 34 Table~9r— Simulation Results for the Third- Moment #3 (N = 28) ........................... 35 Table 10 - Simulation Results for the Third Moment ,u3 (N = 84) .......................... 3-5 Table 11 — Simulation Results for the Third Moment ,ug (N = 783) ........................ 36 Table 12 — Simulation Results for theFourth Moment ,u4 (N = 28) ........................ 36 Table 13- — Simulation Results for the Fourth Moment #4 (N = 84) ........................ 37 Table 14 — Simulation Results for the Fourth Moment ,u, (N = 783) ....................... 37 Table 15 - Estimated Values of the Third Moment Compared with the Simulated Values and the Approximated Values (N = 28) ............................................ 52 Table 16 — Estimated Values of the Third Moment Compared with the Simulated Values and the Approximated Values (N = 84) ............................................ 53 Table 17 - Estimated Values of the Fourth Moment Compared with the Simulated Values and the Approximated Values (N = 28) ............................................ 54 Table 18 — Estimated Values of the Fourth Moment Compared with the Simulated Values and the Approximated Values (N = 84) ............................................ 55 Table 19 — Detailed Counts and Percentages for re < -2, -2 < (C < 2, and 2 < tc Based on A Simulated Dataset with N = 84 and p,,. = .30 ..................................... 67 vii LIST OF FIGURES Figure 1 — Specifications of K' for Distinguishing the Types of Pearson Distributions ........................................................................................... 24 Figure 2 — The Distributions of the Product of Simulated Correlations Tic}; with Different Numbers of Replications for the Selected Cells ..................... 39 Figure 3 — The Trends of the Accuracy of the Approximated and Frank’s Moments along One Parameter Holding the Other Parameters Constant, Including the Sample Size ............................................................................ 42 Figure 4 — P-P Plots for the Selected Cells ..................................................... 58 Figure 5 — P-P Plots for the Selected Cells Using the Estimated Values for the Third and the Fourth Moments ............................................................ 59 Figure 6 — A P-P Plot with Frank’s Distribution for pm = pyc = pxy = .5 and N= 783 61 Figure 7 — The relationship of tc and to ......................................................... 65 Figure 8 — The relationship of to and to for a simulated dataset with N = 84 and pch = .30 ........................................................................................... 66 Figure 9 — Father ’s Education as a Potential Confounding Variable for the Causal Relationship Between Father ’3 Occupation and Educational Attainment 71 Figure 10 —— Distribution function of k = rxcryc ................................................... 73 viii Chapter 1 INTRODUCTION 1.1 Causal" Inference Causal inference is an important, controversial issue in most fields of the social sciences, such as educational research, behavioral research, psychometrics, econometrics, and sociology as well as epidemiology and biostatistics. In those fields, researchers routinely draw conclusions about causal relationships between dependent variables and independent variables fiom statistical linear models using data fi'om- observational studies (See Jacobs, Finken, Griffin, & Wright, 1998; Lee, 1999; Okagaki & Frensch, 1998; Portes, 1999, for examples). However, the usual statistical approaches may not lead to valid causal inferences, even if themodels are supported by related theories and fiilly specified (Abbott, 1998; Cook & Campbell, 1979; Holland, 1986, 1988; McKim‘& Turner, 1997; Pearl, 2000; Rubin, 1974; Sobel, 1996, 1998). The problem mainly comes from the failure to control for all possible confounding variables for which the list cannot be exhausted. For instance, Okagaki and Frensch (1998) examined the relationship between parenting and children’s school performance for different ethnic groups, but did not control for the children’s age, gender, or socioeconomic status. Jacobs, Finken, Griffin, and Wright (1998) examined the relationships between parent attitudes, intrinsic values of science, peer support, available activities, and preference for future science career for science-talented, rural, adolescent females. However, they also failed to control for other demographics, such as age and socioeconomic status. A third example is that Lee (1999) examined the differences in children’s views of the world after they personally experienced a natural disaster for various ethnic, socioeconomic status, and gender groups, but failed to control for pre-world-views. Still another, Portes (1999) examined the influence of various factors in immigrant students’ school achievement and controlled- for many demographic and sociopsychological covariates. Nevertheless, we still can ask: “Did he control for all possible sociopsychological factors?” Therefore, the conclusions in each case might not support causal relationships, although there are statistically significant results. To accommodatethis problem, the literature suggests the following options: (1) Use alternative models, e. g., randomized and well-controlled non-randomized studies (Rubin, 1974); (2) Try causal discovery algorithms which operate on statistical data sets to produce directed causal graphs (Spirtes, Glymour, & Scheines, 1993); (3) Abandon the use of causal language and emphasize the effects of causes rather than the causes of effects (Holland, 1986, 1988; Sobel, 1996); (4) Spend more effort on descriptive work (Abbott, 1998; Sobel, 1998). If we are still interested in exploring causal relationships in the real world, options (3) and (4) will not work. As for option (1), random assignment is often impractical in the social sciences given logical, ethical, and political concerns. In addition, it is not always possible to measure all confounding variables to be controlled for in statistical analyses. Thus, option (1) would be also inapplicable. In option (2), the causal graphs are generated by calculations of conditional statistical dependence or independence among pairs of variables, but in most cases, the assumptions under which the algorithms operate are not powerful enough to uniquely identify the real causal structure underlining correlational data rather than some set of statistically equivalent but genuinely alternative representations (Woodward, 1997). Thus, the soundness of the methodology of causal graphs is uncertain. Is there a comparatively simple, feasible way to explore causal relationships using commonly used statistical linear models? The answer is optimistic. Although we are never able to find all possible causes of an outcome, we can statistically characterize the extent to which a causal inference regarding a given predictor is robust to the impacts of other minor causes of the outcome using Frank’s (2000) index of the impact of a confounding variable. Before going further to discuss how to implement Frank’s index, I will first explain below what Frank’s index is as well as the concept of confounding. 1.2- Confounding and Frank’s Index k A confounding variable is one related to both the predictor and the outcome and it is also assumed to occur causally prior to the predictor (Anderson, Auquier, Hauck, Oakes, Vandaele, & Weisberg, 1980; Cook & Campbell, 1979). If a confounding variable were- introduced into a linear model, the effect of a predictor of interest on an outcome might be changed from statistically significant to not statistically significant. For example, I took 990 schools that have complete data regarding the following three relevant variables fi'om the NELS:88 data (National Center for Education Statistics, 1996) and found, using a simple regression, that Teachers’ Morale (BYSC47G) of those schools has a significant effect (p < .001) on their Students ’ Academic Achievement (mean score of F 12XCOMP—standardized test composite of reading and math). Also, extensive research has shown that School Socioeconomic Status (mean score of BYSES) is related to both Teachers’ Morale and Students ’ Academic Achievement (Chall, Jacobs, & Baldwin, 1991; Miller & et al., 1986; Solomon, Battistich, & Horn, 1996; Trusty, Peck, ’ & Mathews, 1994, to name a few). Thus, School Socioeconomic Status is a potential confounding variable for Teachers’ Morale on Students ’ Academic Achievement. After introducing School Socioeconomic Status into the regression model, I found that the effect of Teachers ’ Morale on their Students ’ Academic Achievement was no longer statistically significant (p > .448) (see Table 1), supporting that School Socioeconomic Status is a confounding variable for Teachers’ Morale on their Students ’ Academic Achievement. Table 1 Coefficients of the Regressions8| Unstd. Std. Model Variable Coefficient (s. e.) Coefficient t p l Intercept 47.567 (.842) 56.493 .000 Teachers’ Morale .849 (.206) .130 4.121 .000 2 Intercept 51 .226 (.541) 94.607 .000 Teachers’ Morale .100 (.132) .015 .759 .448 School SES 8.799 (.229) .776 38.448 .000 Note. N = 990. ‘Dependent variable: F12XCOMP——standardized test composite (reading & math— school mean). Usually, we do not always have measures of confounds, and cannot always control for them. However, we can ask: “How large must be the impact of a confounding variable to alter the inference?” Technically, the impact can be obtained by expressing a t-statistic for a regression coefficient in terms of zero-order correlations: A I: Bx : rxy_rxcryc (1) 59(5):) 1— r)?y — r:C "3: + 2rxyrxcryc n—q—l where X is the predictor of interest; Y is the outcome; C is the confounding variable; n is the sample size; q is the number of independent variables, e. g., 2 if are here X and C; A ,6, is an estimate of the regression coefficient of X; 5e63,) is the standard error of 3x; rxy, rxc, and rye are the observed correlation coefficients between X and Y, between X and C, and between Y and C, respectively. This formula holds if the regression model only has X and C as independent variables. For the case of more than two independent variables, the formula will be more complex (of, Frank, 2000, p. 164). If the coefficient ,6, changes from statistically significant to non-significant after including the confounding variable, the t-statistic (1) crosses the threshold of the critical value. Unfortunately, in many cases, we do not have a measure of a confounding variable. In these cases, there is no absolute rebuttal to challenges of causal inference associated with confounding variables. Nonetheless, Frank (2000)quantifies the impact of a confounding variable on a regression coefficient as a product of the two correlation coefficients rJlrc and rye: k = rxcxryclll. Given this constraint, the value of the t-statistic (l) achieves its minimum value when r2“ = rzyc = k. Thus, the t-Value can be re-expressed as rxy — k tmin = ’ (2) \[(1+ rxy — 2k)(1— r9.) ,n—q—l which affords the confounding variable the greatest impact on the inference regarding the regression coefficient of X. The expression (2) reveals the relationship between the index k and the t-value. That is, we can express k as a function of an estimated correlation rxy and the minimum t- value, tmin (2). Therefore, if we set the minimum t-value equal to a critical value, we can obtain a magnitude of k that is necessary for a potential confounding variable to alter the inference regarding the predictor on the outcome, even if we are not able to measure the confounding variable. In other words, if we had a confounding variable, we know from the index k how strong the correlational relationships of the confounding variable with the predictor and with the outcome must be to alter the inference for the predictor. Frank refers to the threshold at which the impact of a confounding variable would alter a statistical inference as the impact threshold for a confounding variable (ITCV). [I] (a) In the case of more than two independent variables, It will be the product of two partial correlations (cf. Frank, 2000). (b) For the moment, assume k > 0. For the case of k < 0, see Frank (2000). That is, if the index k of any potential confounding variable does not exceed the corresponding ITCV of a given predictor that is statistically significant, we can say that the causal inference about the predictor on the outcome is robust to the other causes or confounding variables. Therefore, we may not need to worry about the validity of the predictor as a major cause of the outcome, since we can argue for the validity in terms of the ITCV. Thus, Frank’s methodology is a promising attempt to lessen the crisis in causal inference that was mentioned at the beginning of the chapter, by rephrasing the problem in terms of the sensitivity of a statistical inference to the impact of confounding variables. Of course, confounding variables are usually unmeasured or immeasurable. While Frank’s index quantifies the impact necessary to alter an inference, how do we know the likelihood that such an impact could or would alter the inference in the presence of the confounding variable? One response is to generate a reference distribution for the impact of the unmeasured confounding variable from the impacts of existing, measured covariates. The reference distribution takes the same form as that of index km: the product of two dependent correlation coefficients between the covariate and the predictor and between the covariate and the outcome. With this reference distribution, we can assess the likelihood that the causal interpretation of the predictor could be altered if a confounding variable with comparable impact were measured and controlled in the linear model. [2] In the case of more than two independent variables, k is the product of two partial correlations (cf, Frank, 2000, p. 166). Some researchers may be uncomfortable with the strategy of using measured covariates to generate a reference distribution for the impact of an unknown confounding variable. For example, a poorly chosen set of covariates will underestimate the impact of an important confounding variable. Thus, we acknowledge that this use of the reference distribution is only as valid as is the set of covariates on which it is based. In this sense, the problem is no difi‘erent fi'om any other associated with making an inference fiom a sample that must be representative of the population. In this light, the impact of existing covariates represents important information by which to assess the ITCV. For example, would it not be informative if the ITCV were much larger than the impact of any measured covariate? If one agrees, then the question is not whether to use the impacts of measured covariates, but how to use this information. This becomes the core of the present study. 1.3 Purpose of the Study In order to utilize the reference distribution, we must understand the behavior of the distribution of the product of two dependent correlation coefficients. From Cohen and Cohen (1983, p. 280), we know that the product of two dependent correlation coefficients is constrained by the upper and lower limits, rather than just —1 and 1: n—,/<1-r.i—>(1....‘ + 6p’yc‘(Aryc)’ +16)?“ ”AI-“Ar... + 16p’..py.Ar..(Aryc)’ + 16acp’yc(Arxc)’Aryc + 16p..p..(m,.)3(Ar,.)3 + 24in imam”? + upland/Arms. + 24a..a.(m..)2’1 + p’chKArxcfl + E[(Ar,.)2(m,.)2]i + Manama.) + 2p... flaw...) + 2p..E[(Ar..)’Ary.] + 2p,cE[Ar..(Ar,..)’] + 4p,cp,.E(Ar,.Ar,.); pa = Emmy] = pars. + pioneer] + p’chKArxcf] + 3t?“ ram.) + miner/Aryan + media/Ar...) + 3p“ ”Erwin + 3p’..E[Arxc(Aryc)31 + 3p’ch[(Ar..)3Aryc] + 9d... ”Burn/Ar”) + 9d..pch[Arxc(Ary.)’] + 9p... chl(Arxc)’Aryc1 (4) + 9p..p,.E[(Ar..)’(Ary.-)’]; to = E[(r..ry.)‘] = dept. + [22.5mm + p‘chKArxctl + 4p‘xcp’chWyc) + 4 p3..p4,.E(Ar..) + 4p’xcpch[(Aryc)3] + 4p..p4,.5[(m..)3] + 6 p4,. ,.E[(Ar..)21 + 6fl.-.p‘y.E[(Arxc)’] + 16p’.m’y.E(Ar..Aryc) + 16;,cp,.5[m..(my.)3] + 16p... ..E[(Ar..)3Ary.] + 24p31cp2ch [Arxc(Aryc)2] + 24p215p3ch[(ArXC)2AryC] J + 36p2.cp2,.E[(Ar.c)2(Aryc)2]. In order to obtain closed form expressions for the first four moments, we need to express E[(Ar,c)i], i = 1, 2, 3, 4, E[(Aryc)’], j = l, 2, 3, 4, and E[(Ar,c)k(Aryc)’], k, 1= 1, 2, or 3, in terms of p“, pyc, or pry. Before going further, we need to define some notation and get some preliminary results. Suppose the three initial variables X, Y, and C follow a trivariate normal distribution. Then, following Ghosh (1966) and Hotelling (1936, 1940), the moments and the covariance of rxc and ryc can be approximately expressed as follows: 15 .. P.(1‘P.2){ 9 2 3 2 4 =' r, —.-. ,-——————- 1+—-— 3+ , + 121+7O ,+25 , m. c( ) p 2M 4M( p) 8M,( p p)\ 3 . , 2 4 6 +.—-— 6479+4923 , +2925 , +1225 . 64M” p p p ) 3 +— 4 128M (86341 + 77260 p3 + 58270 p3 + 38220 pf +19845pf)} ; of) = Var(r.) = Err. —#,. >21 1 2M2 _(1--p3)’ " M {1+—2iM—(14+11p3)+ (98+130p3+75p1’) 21143 (2744 + 4645 p3 +4422 pf + 2565p?) + 8 + 81:44 (19208 + 37165 p} + 44499 pf + 40299 pf + 26685 pf )} ; (5)31; , _ 3 =_£’_-_(l_p_-_L{5+.L 69+88 2 a”. [(r, ya) ] M2 M( P.) -..-. (797+1691p3 +1560pf) 44/2 l. 4.7813702325 +33147p,2 +488099p,‘ +44109pf)}; 3(1 — p3 )‘ (4)_ 4 —- 0",. “EKro "#5) ]" M2 1 2 l+——— 12+35 . { M( p) l 2 + 4 (436 + 2028p,2 +3025pf) 1 4. M3 (3552 +20009p,2 +46462pf +59751pf)}; r“, r” = Cov(r,.c,ryc ) = E[(I‘xc - t1,“ )(ryc - (1%)] j rile (1-702 ~p2 )--!-p p (lepz ~p2 -p2) M _ .ry xc yc 2 re yc xc yc xy ’ 16 66 99 where M = N + 6, N is the sanrple size; the subscript . represents xc or yo; and the superscripts “(2)”, “(3)”, and “(4)” represent a variance, a third moment, and a fourth moment, respectively (as distinguished from quadratic, cubic, and quartic powers). On the other hand, we have the expressions for E[(Ar.)'], i = 1, 2, 3, 4, and E(Ar..Ar,.) as follows: E(Ar,) = E(r, —p,) = '11,. -—p,, which will be referred to as b,’ ; \ E[(Ar.)2] = E1(r. — p.)’1 = E{121+o4 —p.)2 = of?) +4.3; E10103] = E1(r. —p.)31 = E{[(r. —24.,>—(p. —p. >13} = E1(r. -#., >31 — 3oz. — p. )E1’1+<#. - 22-)“ —44%. +4244 +2.4 E(Ar,,hryc) =. E[(r,. — p... )(ry. - py. )] =E{[(rxc -#rxc)_(flrxc —pxc)][(ryc -1uryc)-(#fyc —pyc)]} : E[(rxc -#"xc )(Itvc —luryc )] +(rurxc —pxC )(Iuryc —p)’c) / =0 +b b rev’yc ’xc ”yc' 17 We also need the expressions for the higher order product-moments, E [(Arxc)’(Aryc)’], s, t = 1, 2, or 3. Afier a few simplifications, we first have: Hike-404440) = E[(r.- -p.-)(r,- -p,~)(rr -p.)(rr -pr )1 \ =E{[(r,- —,u.,.)+(u.,. —p.-)][(r,- -#.,)+(#r,. —p,- )1 IO). -#..)+(#,,, ‘Pk)ll(’1-flq)+(.uq #4)” = EK’} —lur,- )(rj ‘flrj )(rk —'u"k )(r, —’u'1 )] 4. b,1E[(I;- — p, )(r,- - 24,. )(r. - 24-. )1 + b”, E[(r,- — 22,, )(r; - 21., )(r/ - 21., )1 + b,jE[(r;- —— 21., )(r. - 21.. )(n - 21,, )] + 1),, E[(rj - 21., )(r. - 21., )("1 - #2, )1 + brk b,’ E[( ,1, _ ”r.- )0“) — Iqu )] + b0, b,7 E[(’;~ — fir, )(rk — lurk )] 4. 4,14,, E[(I;- - p. )(r, — 24. )1+ 4. b. Elm - 22+.- er “ #4 )1 + 1),, 12,, E[(r,- - 24,. )(n - x4, )] + b. 14,. E[(rr - #4. )(r, ‘ #4 )1 +bbbb rJ-jr fk — 0,1,10,0sz +0riJk0JJJ7 +0J0 ,1 J ,1 +0, , bn‘b,7 +0, Mb, b,7 +0", ,1 b, brk +0, ka, br, +00 ,1 b, brk +0rk abribJJ J +bbbb r-jkrrqa 18 (7) where i, j, k, and I can be xc or yc, and Anderson’s (1958, p.39, Equations 25 & 26) formulaem were applied to the last equality in (7). By the same fashion, we further have E(A’1A’1A’k)=El(’i -p.)(rj '30}er ‘Pk )l tar-fr} r,jrkr rjar; =0 b+0 b+0 b+bmbb (8) Then, in(7) letting i=j=k=xc andl=yc, lettingi=xc andj= k=l=yc, and letting i = j = xc and k = I = yc, respectively, give us the desired expressions for some of the higher order product-moments as follows: b2 +b3 b r“, r”. r“. rxc rye 9 N E [(A7 xC)3(Ar you = 36:320-5050 + 3 0.12:.)er b ryc + 30 3 = (2) (2) 2 3 . E[(Arxc)(Aryc) ] 30ch0 +30,” b,“ bryc +30,“ b +b b rxc-ryc ,ry.c rye ’xc 'yc’ > (9) E1(4r..)(Ary.)1=o— 4:3) other? ,,.. +31); +49%; +40, .4 b Mb +1421;2 J And, in (8) letting i = j = xc and k = yc and letting i = xc and j = k = yc, respectively, give us the desired expressions for the rest of the higher order product-moments as follows: rxc I'yc ’ . E[(Arxc)2(Aryc)]=0.5:)bryc'l'20rxcr , ”but: +b2b } (10) E[(Ar,.)(Ar,.)2] = 0:31),“ +205, b +b b3 l4] Anderson’s equations were tactically used only for obtaining the higher order product-moments, although these equations are based on the normal distribution. 19 Applying (6), (9), and (10) to (4), I obtain the approximate first four non-central moments of rxcryc as follows, in terms of p... and pyc and the moments and the covariance of rxc arid rye that are the functions of pm, pyc, and p”. (of, Equation 5): 1 = E(r,cryc) = (bare + ch )(bryc + pyc) +0,m,yc ; \ 2 = E(rxcryc)2 = [(er + P... )2 + 05:) ][(b,J + py.)2 +032] 2 . rxc ,ch 9 +4(b,xc + pxc )(beC + ch )0JxJW +20 3=E(r r )3=-b3 p3 +b3 (3b —p )p2 +p3 p3 +3p p3 0(32)+3p p 3“ ,VC ryc xc r“ ryc yc ye are ye xc yea 'xc xc yea gory: (2)0. (2) 3 (3 (2 2 +92)...t2..~<7,Ja J +2940; ’+p..0‘ J’+9[py.65 J’+p3.(p§.+0‘J 010 2 2 2(2) +18pxcpyc0'rchyc+3brycpxc(pxcpyc+3pyca( ’xc +3!)ch 'xc’yc) 2 2 2 (2) + 3b,“ pyc [pxc (3b,” + 3b,” pyc + pyc. + 30 "yc )+ 3 pyco'rrr-ne ] +3b p [3222042’+;02 (to2 --0‘2’)+6p p 0' l (11) CW xc yc r)“. are ye 'yc xc yc rxc,ryc 3 2 (2) 2 2 (2) +3b J,{b J29... +3b Jpn/0y. + 3b leicaJ +p..(p,. +0.J )+ 4p..py.0,J,.J] +pyc[— pyCO' ar2)+pxc(pyc+30(2))+6pxcpyca 0-“2- my I}; 4 = E(r-“'r}’c')4 : bgcpz‘ + 4b?” (4brxc — pxc )picpyc + b4 cpyc— —4b:: cpxcpyc + pxcpyc (2) (2) (2)0120 3 +6px2cpyca( ’xc +6pxcpyca ’vc +36pxcpf’firco-CO- ch)+4pxcpch-( 0” c)+4px4cpyc0' 0:”) 0(4) (4) 00 2 +pyc0',“130.60,”+16pxcpycl3pyc0'6‘c)+Prc(pyc +30; c))]0 ’36ch 4 4 2 (2) (3) (4) 3 3 + pxc(bryc + 6bryc0ryc —— 4ber0ryc + 0ch )+16pxcpyc(b&cbryc +0JICJW) ) 20 ”222.42402 .J .J +6123, 20416123, cpic+4bJ p..p§.+6p..0 0.2’+P..(Pic+0‘2’> \ +84..p..a...,.,.1+6bf 245.14%: oif’+pi.+8p..p.0+c v.41 +4b Jp..[- 3p..p..00‘J’+p..(p..+60‘J’)- p.47 mf’+12p..p,0..0,..JJl >012) +4b5’cpxc {4bixcpic +6b:cpxcp}3,c +4b5rcpyc[3pyc0' 0'2) +pxc(pyc +30'(2)) ll-j (2) o,(2)_ 0(3) 2 +9pxcpycarxcfy€]+pxc[6pyco.( r“. +pxc(pyc _3p,o.rVC arw)+12pxcpyco'rxc,ryc The derivations for the formulae (11) and (11’) were done by Mathematica (Wolfram, 1999). We could further substitute (5) into (1 l) and (11’) to obtain direct expressions for the first four moments of rxcryc in terms of pm pyc, and p” as well as the sample size N, but this is not advisable because the formulae would be much messier. For the central moments about the mean, substitution in Kendall and Stuart’s equations (Kendall & Stuart, 1977, Equation 3.9, p. 58): #2 ___ ”.2 _ #12 \ = 2 (2) (2) 2 (2) (er + p“) 0JyC +0,“ [055. + pyc) +0,” ] +202... +4..)(bJ +4.44%... 2403......2 )(12) #3 = #3 — 3/1'1fl'2 + 2#'13 =-2b3 Jp..-3b3J 20.57052)4125,2040”)-.3bJp..0‘:’0‘J0’+6p..py. £073. ‘2’ 21 -b3 c[b3 +3b,2 ycpyc +2pyc +3(br yc+pyc)0'(:)]+pyca £3c2+pxca 0123c) 22 _3t—2p..a;c2>+a:2’(b3 +212 p..- 2ij 222222.223122...» 4,0160% _ 10...”; .. ’40:...c’3 :C{b pxc(b2 +3a(2)) (2) _ (2) 3 2 3 + [3225. (225.. + 2 pyc ) + or” ]0',mryc } + 3b,“ { a,“ [b’xc + 3b,“ bryc + 2 pyc +(b +pyc)0'(2)]— 2p,c(3b,2 +J£2Hc))0’, r.. — 4(b +pyc)0'2 }; ’xc’ yc p4 = [1'4 - 4fl'i/I'3 + 6fl'lzfl'2 — 3M4 _ 4(2) (2)0, (26) 4 4 3 2 2 212 (410.0 + 317“”, n)+6p.cp..0.,ca + b.0132). +1212.” 12.. + 6b.. p... (4) (4) +412 pfic+8pyc+6(b +0..) 205221+pyc056 +p..0, +413p..p..0‘ , 2200.. 0022—0523)] -20‘2’)- pyca U‘f’+p..(3p..0 +6lp.c(2p.2c- -50f2’) (12') 'xc’yc w—36pxcpyca 0: +904 —4b2 Mpxcapxcpyc 06(2) 2 2 +a‘ 2(— —5p§c +0: €2,510 0,” W” + 2 pinata,” — 302,230.05. ) + 4b,; {pxc [3pjc + 3b,; pyc .. 31,3” (Pic _ 2053.2) — 3p§ca§y2c2 +b,yc (2ij + 3pyco 0522)] + [6b,3 +18b,2 w-pyc 2ij + 3pyco'ry 0(2) +3225. (2p; +agc))]aw }+6b2 [pic(4pfc +0(2))0.(2) +(2pxc + 0(2))02 ] 02W yc 2 4 2 3 2 4 (2) 3 (2) 0(2) 0(2) +6brx[brycpxc—2brwpxcpyc+brxc arx +4b ”pycarx +4pyc% 6—b ”pfcpyca (220(26)+pyc 0’320' (2H)+6b prc(2pyc+br yc—pyc zpyc r yc 'xc 2 (2) 0(2) +b M0" +2brycpycc7 (2) 2 2 (2) 2 4 3 3 3 + aryc )O'rwyc + (8125,. + 1 6br,,. pyc + 2 p ye + cry. )UGchc ] + 4b,“ {by pm + 2b,” ch pyc J, ( 2) 3 + 6b,). 10.. 10.0.. 0(2) +3bf ”p..0‘2’+3b3 p..p..0 .6 -9b2y.p..p§.0.22+6b,10.10.67,“ 22 2 2 2 2 2 0(2 2 013“) 4 (3) +3br J”pal? dire-0') Syc) —,.3b Mpxcpyc Ow) (ya) _6pxcpycar M)0£)— bf ”pica-f _zpyca’xc W -b pic 0‘3’)- 2x520, 0‘33 -b pic 0‘33 -pxp,ra‘3’+3[2b3 pic—6193 pip” 3 0(2) 2 0(2) 0(2) 3 (2) 2 _ (2) (2) +b,yc +3b pyca, -3b pyca, Wyn,” -(b,yc +p,c)(3px 0,“ )0,” 10%,} +3pxc(8b’2yc —"2b yc—pyc —4pvc+0'(:c))0'r: ,ycr +9(bry c+pyc)0'rxc ,ycr }- -r4b ”[prcar (:2 (2) 2 C(30) +3pica 0030' rm, +3pycarxc a W, +6pfcpyc(0'(230'ryc (33+ +20: rye )+ch(Pyc0' +9pyca (70(2) 33330323032305“.- -90'3 )]. j NOte that [1] = 03303,. rxc ryc —lul )dF: ail” xc “ye E(r rxc ryc r)]dF= £(rxcryc )dF— E(rxcryc) _w -m 3.2 Pearson Distributions The four moments only give us a general idea about the characteristics of the distribution of the product of two dependent correlation coefficients. To better understand the distribution of the product of two dependent correlation coefficients, we need to explore the shape of the distribution. Since the Pearson distribution family provides approximations to a wide variety of observed distributions using the only first four moments, in this thesis the Pearson distribution family is employed to obtain an approximate distribution for the product of the two dependent correlation coefficients. 23 (12") There are three main types and some other uncommon types of frequency curves in the Pearson distribution family, which are characterized by fl-coefficients: fl,=”—§,fl2=”—:. (13) 2 In particular, after obtaining fl] and ,6; via the first four moments, we can plot the couplet (,61, ,62) on the (A, ,6;) plane illustrated in Pearson and Hartley (1972, p. 78). From the (,61, 5;) plane we lmow the type to which the observed distribution belongs. Instead of referencing the (,61, #2) plane, we can also distinguish the types of Pearson distributions by evaluating the criterion K(kappa) (Kendall & Stuart, 1977, Equation 6.10): K = :31 (52 +3)2 , (14) 4(432 ’ 3131 )(Zflz — 3,51 — 6) with the. specifications illustrated in Figure l (Elderton & Johnson, 1969, p.49). K='°° K=0 K=1 =00 I4 K'< O O < K<1 K>1 TypeI TypeIV TypeVI Type III Normal curvem Type V Type III (fl2=-3) TypeIIorVII (626300-3561 Figure 1. Specifications of K for distinguishing the types of Pearson distributions. [5" [6] Originally, they are “/32 = 3” and “,6; < 3 or > 3” in Elderton and Johnson (1969, p.49), but according to the expression (14) above (from Kendall & Stuart, 1977, Equation 6.10) and the expression (4) in Elderton and Johnson (1969, p.41), they should be “,6; = -3” and “ < -3 or > -3”, respectively. 24 Type I distributions are beta distributions and type HI distributions are gamma distributions. Pearson and Hartley (1972, p. 261—285) have tables for percentage points of Pearson curves for given ,6] and [92. Elderton and Johnson (1969) provide mathematical expressions for Pearson distributions with formulae for the parameters in terms of the first four moments. 3.3 Approximate Distribution of rxcryc In order to apply the first four moments of rxcryc to Pearson distributions, I begin by noting that there are three conditions for the dependent population correlations p“, pyc, and per in this particular study: (i) p“ and pyc are dependent and, from (3), p,y is correspondingly constrained by: pxcpyc i\/(1‘Pic)(1-P;c); (ii) pxc,pyc, andpxyat -l, 0, or 1; (hi) chpycpxy > 0- It is obvious that condition (i) is supported by Cohen and Cohen’s constraint in equation (3) above. Since -1, O, l are extreme cases and trivial, condition (ii) is sensible. As for condition (iii), I will deal only with cases in which this condition holds for issues of confounding and indirect effects. The general interpretation of confounding applies when the impact of the confounding variable, K = pmoyc, takes the same sign as the 25 relationship between X and Y, pxy. Therefore, the product of the two components, pxcpyc and p,,,, must be positive, showing that condition (iii) applies. Under these three conditions, we might be able to theoretically prove that K < 0, but for simplicity, I only numerically evaluate the criterion K. Under the conditions above, I evaluated Kvalue for all possible conditional triplets of p“, pyc, and pry with an increment of. 10 for each correlation coeflicient and I found that the larger N, the more values of K are negative. When N > 300"], all values of K are negative under the three conditions. Note that the values of K are approximate, since the fl-coeflicients are evaluated by the approximate moments. ‘ As N approaches infinity, the approximate values of K will approach the true value that will be negative, as observed by examining the numerical trend of K as N increases. Thus, by looking at Figure 1, we can conclude that the distribution of the product of two dependent correlation coefficients, rxcryc, can be approximated by a Pearson Type I distribution. Correspondingly, the density function is (Elderton and Johnson, 1969; Kendall & Stuart, 1977): f(k)=fo[1+—’€-] [Ll—J ,—alsksa2;1"—=-'—"—2-, at 02 a 1 a 2 where k = rxcryc, and _ 61M"? 1 0 _ . (a, +a2)"'1+'"2+' 13(m1 +1,m2 +1) ’ [7] For N < 300, p“ or p), must be greater than .10 so that Kbecomes negative. For smaller correlations, e.g., p“ and pyc < .10, a rare case in confounding, O < K< l; and the distribution of the product of two dependent correlation coefficients, rxcryr, can be approximated by 3 Pearson Type IV distribution. 26 al +a2 =—;-\/,u2[fll(s+2)2 +l6(s+1)], = 6(fl2 -fl1_1) 6+3fl1‘2fl2 , and m and ”12 are given by s— 2 is(s+2) 2 fl,(s+2)2 flI+16(s+1) withm10andml>m2 iffl3<0. Note that here we got an approximate distribution for the product of two dependent zero-order correlation coefficients. The results hold for the product of two dependent partial correlation coefficients. It is evidenced by Fisher’s (1924) founding that the distribution of the sample partial correlation is that of the zero-order correlation coeficient. 27 Chapter 4 SIMULATION STUDY Equation (4) in Chapter 3 gives us the expressions for the first four moments of rxcryc in terms of the moments and product-moments of the original correlation coefficients, r“ and rye. Then, by applying approximations to the moments and product-moments of the original correlation coefficients, this current study obtains approximate expressions for the non-central moments or central moments of the product, rxcryc. However, we do not know how accurate the approximation is. Therefore, a simulation study is conducted to check the accuracy of the approximate moments against the moments calculated fiom the simulated data. 4.1 Simulation Design The parameters in this simulation study are the sample size, N, and the population correlations, p“, pyc, and pxy. For p“, pyc, and pm I choose .10, .30, and .50 as small, medium, and large correlations (Cohen, 1988); and for the sample size, I select 28, 84, and 783, which correspond to a statistical power of .80 for the small, medium, and large correlations (Cohen & Cohen, 1983). Table 2 shows the parameter specifications for this simulation study. 28 Table 2 Parameter the Simulation ch 28 .10 .10 .30 .50 .10 .30 .50 .10 .30 .50 .10 .30 .50 .10 .30 .50 .10 .30 .50 .30 .50 As can been seen in Table 2, we do not need to include every possible combination of. 10, .30, and .50, because pxc and pyc are symmetric in the mathematical expressions in Chapter 3. Therefore, I have removed the duplicate cases. In addition, under the condition where pxcpycpner > 0, we can always change the sign of the relevant variable to have all three correlations positive. Thus, there are only 18 positive correlation triplets needed for this simulation study. Multiplied by the three magnitudes of N, we have 18 x 3 = 54 cells to simulate. For each of 54 cells (i.e., for each set of N, p“, pyc, pv), by Cholesky factorization, I generate X, Y, and C with the specified population correlations (cf., Browne, 1968) and with the sample size as N. For simplicity, I generate (X, Y, C) as a trivariate standard 29 normal distribution given values of pm pyc, and pry. Then, I compute the pair of estimated correlation coefficients, which are defined as Em and 7,6. Within each cell, the same procedure is replicated 1000 times, resulting in 1000 pairs of 7m and 7” values. Directly multiplyingr ”and ryc gives us a simulated distribution of the product mm. which serves as the true distribution of rxcryc, with which the approximate moments”) and distribution can be compared. Tables 3 to 14 show the comparisons of the first four moments. To quantify the accuracy of the approximation method in this paper, by following Kendall and Stuart (1977, p. 247), the standard errors of the first four moments, [1”, , [13 , H3 , and m , of the simulated distribution of the product rxcryc are also computed (see the numbers in the parentheses in Tables 3 to 14) by the following formulae: 5.8.(fl'1)= 000112 ; setflz) \/=——(.u4-u 22): b (15) ~ 1 s.e.(u3)=\/'1—OFO'(U(5 — #32’6114112 7’9“?)3 J 3-3'074) = —1_8(/1 114'" 8H5l13 +16fl2fl3) 1000 [8] The approximate moments were calculated by SPSS (See Appendix A for the SPSS code). To make the code clearer, I computed Equations (6), (9), and (10) substituted them into (4), instead of using the messy Equations ( 12), (12’), and (12”). 3O here I simply use the formulae (15) above to calculate the standard errors instead. Actually, the two approaches yield similar results, except for E for which the standard errors were a little overestimated by the formula in (15). For instance, for the cell of N = 84, pm = .30, pyc = .30, and p,ty = .50, by the formulae (15) the standard errors of [7'] , 272 , jig , and II, are .001632, .000137, .000018, and .000004, respectively, while they are .001542, .000127, .000014, and .000003, respectively, if we generate 1000 of each of those moments. Thus, Tables 3 to 14 display in the parentheses the standard errors calculated only by the formulae (15). In addition, to help to have a good look at how far the approximated values are from the simulated values relative to the standard errors (s. e.) of the simulated values, the Approximated — Simulated standardized diflerences (Std. Diff. = ) are also listed in s.e. Tables 3 to 14. 31 Table 3 Simulation Results for the Mean ,u '1 (N = 28) stpwji .10 .30 .50 .10 .30 .50 .30 .50 .50 .10 .30 .50 .10 .30 .50 .10 .30 .50 .10 .30 .50 .10 .30 .50 .10 .30 .50 Simulated (s.e.) .016237540(.001416792) .022705850(.001482690) .029056640(.001561022) .035789020(.002169760) .041859490(.002253031) .047860880(.002344481) .054815860(.003166590) .059943050(.003241347) .065037510(.003320096) .093832050(.002612237) .099503540(.002755509) .105077490(.00289]617) .151518190(.003232683) .156342730( .003393298 ) .161155260(.003549875 ) .248737050(.OO3346673) .252917640(.003566791) .257125970(.00376706l) Approximated (Std. DifigL .015379623( «605534834) .021156094(-L045232565) .026944329(-L353158933) .034214339( 4725739594) .039543751(-L027832596) .044908457(-L259307843) .053042589( 4559993936) .057454354( -.767796859) .061924942( 4937493265) .091340180( 4953922018) .096269592(-L173629978) .101304886(—1.304669237 ) .1489] 7726( -.804428991 ) .152976550( -.992008421 ) .157211844(-l.110860634 ) .245704306( -.906196698 ) .248939600(-1.1 15299353 ) .252469012(-1.236231048 ) Table 4 Simulation Results for the Mean ,1: '1 (N = 84) pic .10 .30 .50 P1: .10 .30 .50 .30 .50 .50 Pm .10 .30 .50 .10 .30 .50 .10 .30 .50 .10 .30 .50 .10 .30 .50 .10 .30 .50 Simulated (s. e.) .010398070(.000618555) .012648370(.000655401) .014908830(.000694075) .030208880(.001164637) .032315040(.001213070) .O34499610(.001262882) .050552070(.001821664) .052358840(.001865880) .054340980(.0019l4181) .089548730(.00143646l) .091304370(.001536766) .093066460(.001632057) .149496750(.001868874) .150922470(.001980651) .152451530(.002090734) .248913370( .001942712 ) .249983780( .002086890 ) .251040100( .002226773 ) Approximated (Std. Difi‘.) .010916407( .837980770) .013098629( .686998108 ) .015285296( .542399656) .030509873( .258443555) .032523206( .171602560) .034549873( .039800247) .0500967OO(-4249974738) .051763367(-n319l37803) .053452256(-a464284l33) .089522706( -.0181 16748 ) .091384928 ( .052420470 ) .093287151 ( .135222607) .148693461 ( -.429825126 ) .150226795( -.351235456 ) .151826795( -.29881 1281) .247612886( -.669416767) .248835108 ( -.550422870) .250168441 ( -.39l444850) 32 Table 5 Simulation Results for the Mean p '1 (N = 783) fire .10 .30 .50 A .10 .30 .50 .30 .50 .50 pry .10 .30 .50 .10 .30 .50 .10 .30 .50 .10 .30 .50 .10 .30 .50 .10 .30 .50 Simulated (s. e.) 010037140(000167601 ) 010288380(000181604 ) .010545220 ( .000194859 ) .030017610 ( .000358608 ) 030253530(000376577 ) 030497610(000393853 ) .049967830(.000568102 ) 050166610(000584003 ) 050371640(000599416 ) .089681690(.000442595 ) .089872890 ( .000482732 ) .090092450 ( .000520759 ) .l49624530(.000572730 ) .149788470(000616166 ) .149987730(.00065833l ) .249354340 ( .000582864 ) .249408930(000638882 ) .249522490(.000695198 ) Approximated (Std. Diff.) .010105409( .407331195 ) .010354331 ( .363158371 ) .010603761 ( .300427607) .030060721 ( .120217478 ) 030290379( .097852579 ) 030521557( 060801909 ) .050015047(.083113581) .050205161(.066011599) .050397810(.043659134) 089953037( .613082460 ) 090165458( .606067374 ) 090382441 ( .556862130 ) .149862675 ( .415806530 ) .150037580( .404290690 ) ‘ .150220089( .352951511 ) .249745934( .671844984 ) .249885351 ( .745710667 ) .250037442( .740727301 ) Table 6 Simulation Results for the Variance ,uz (N = 28) Page p1"; In .10 .30 .50 .10 .30 .50 .30 .50 .50 .10 .30 .50 .10 .30 .50 .10 .30 .50 .10 .30 .50 .10 .30 .50 .10 .30 .50 Simulated (s.e.) 002007300(000145714) 002198370(000168953) 002436790(000193817) 004707860 ( 000262925 ) 005076150(000288296) 005496590(000319352) .010027290 ( 000479899 ) .010506330(.000504464 ) .01 1023040 ( 000542946 ) .006823780(.000324890) .007592830(.000360409) .008361450(.000398414) .010450240(.000464233) .011514470(.000494732) .012601610(.000532917) .01 1200220 ( .000472091 ) .012722000( .000519374 ) .014190750( .000561380 ) Approximated (Std. Diff.) .002082915( .518928191 ) 002259304( .360657337) 002502991 ( .341563738) .004732986( 095563454) 005094097 ( .062251932 ) 005514970( 057553959) .010002292 ( -.052090086 ) .010463308( -.085282518 ) .010970180( -.097357668) 006634122( -.583760512) 007529351 ( -.176130254) 008493985( .332656208) 010310752( -.300469693) 011511735( -.005528245) 012800062( .372388203 ) .010941769( -.547459824) .012522676( -.383777661 ) .014271177( .143266703) 33 Table 7 Simulation Results for the Variance [12 (N = 84) thy Shnuhued(801) Approximated (Std. Diff.) .10 .30 .50 .10 .30 .50 .10 .30 .50 .10 .30 .50 .10 .30 .50 .10 .30 .50 000382610 ( 000030085 ) .000429550(.000036380) .000481740( 000042849) .001356380(.000077782) .001471540(.000088283) .001594870(.000099699) .003318460(.000161163) .003481510(.000173l31) .003664090(.000186859) .002063420(.000104842) .002361650(.000121537) .002663610(.000137272) 003492690 ( .000 1 633 70 ) .003922980(;000183839) .004371170(.000204267) .003774130(.000171079) .004355110(.000195961) .004958520(.000220055) .000394662( .400597850) .000447068( .481529421) .000509135( .639344415) .001320810( 4457303157) .001447716( a269860865) .001583651( a112528167) .003138425(-Lll7101076) .003308192(-L001078030) .003485889( n953663857) .001977716( a817455669) .002314607( a387068319) .002665894( .016638551) .003225462(-L6357l7828) .003683807(-L300989915) .004167092( a999076307) .003398698(-2.l94491155) .004005468(-L784244005) .004670806(-L307466027) __215 P}: .10 .10 .30 .50 .30 .30 .50 .50 .50 Table 8 Simulation Results for the Variance ,uz (N = 783) Px_c 1015 Pry .10 .30 .50 .10 .30 .50 .30 .50 .50 .10 .30 .50 .10 .30 .50 .10 .30 .50 .10 .30 .50 .10 .30 .50 .10 .30 .50 Shnukned(sxt) .000028090(.000001459) .000032980(.000001747) .000037970(.000002035) .000128600 ( 000006030 ) .000141810(.000006616) .000155120(.000007225) .000322740(.000014903) .000341060(.000015682) .000359300(.000016454) .000195890(.000008773) .000233030(.000010558) .000271190(.000012356) .000328020(.000014938) .000379660(.000017344) .000433400(.000019839) .000339730(.000015007) .000408170(.000018179) .000483300(.000021619) Approximated (Std. Diff.) .000028980(.609886393) .000034073(.625794230) .000039300(.653585335) .000130418(.301481748) .000144278 ( .373008067 ) .000158337(.445249186) .000328215(.367377356) .000347269(.395920301) .000366650(.446687221) .000202016(.698282409) .000240276(.686304995) .000279452(.668649487) .000336743(.583938073) .000389197(.549883427) .000443997(.534151916) .000353338(.906795524) .000422995(.815510421) .000499029(.727550288) 34 Table 9 Simulation Results for the Third Moment [13 (N = 28) chpmpnr .10 .30 .50 .10 .30 .50 .30 .50 .50 .10 .30 .50 .10 .30 .50 .10 .30 .50 .10 .30 .50 .10 .30 .50 .10 .30 .50 Simulated (s. e.) .000102612 ( .000022926 ) .000158858 ( .000030373 ) .000213329 ( .000037334) .000229896 ( .00004073 1 ) .000342621 ( .000049056) .000460741 (.000058132) .000312498( .000091352 ) .000528974 ( .000102736) .000761741 (.000122471 ) .000400153 ( .000055523 ) .000514982 ( .000066161 ) .000637833 ( .000078976 ) .000332125 ( 000086245 ) 000484532(000095090) 000656781 (000107726) 000186486(000080016) 000223 523 ( 000089843 ) .000281911 (000100228) Approximated (Std. Difi‘.) 000081765( -.909316933 ) .000086309(-2.3886017l9) 000088324 (-3.348288423 ) .0002403 34 ( .256266726 ) .000347121 ( .091731898) .000460429 ( -.005367096 ) 000289363( -.253251 160) 000538633( 094017676) 000801008( .320622841) .000532525( 2384093079) .000707575( 2.910974743 ) .000911491 ( 3.465077998) 000526148( 2.249672445) .000778373( 3090135661 ) 001073164( 3.865204315) .000490178( 3.795390922) 000713757( 5.456563116) 00099461 1 ( 7.110787405) Table 10 Simulation Results for the Third Moment 4113 (N = 84) 1&9}:ny .10 .30 .50 .10 .30 .50 .30 .50 .50 .10 .30 .50 .10 .30 .50 .10 .30 .50 .10 .30 .50 .10 .30 .50 .10 .30 .50 Simulated (s. e.) .000010500( .000002313 ) .000015611 (.000003300) .000020994 ( .000004420 ) 000032053 ( 000008501 ) 000047655 (00001 1095 ) 000064406 ( 000014054 ) 000051326( 000020603 ) 000082326 ( 000024813 ) .0001 16846 ( 000029601 ) 000052947 ( 00001 1332) 000072472 ( 000014632 ) 0000931 17 ( 000017853) .000054061 (.000019433 ) .000075321 (.000024430) .000098774 ( .000029508 ) 000027162(000017725) 00003 1018 ( 000022389 ) .000041201 (000027059) Approximated (Std. Difl‘.) .000009451 ( -.453523562) .000012091 (-1.066666667) .000014954(-1.366515837) .000026099 ( -.700388190 ) .000040805( -.6l7395223 ) .000056834( -.538778995) .000030707(-1.000776586) 000062103( -.815016322) 000095498( -.721191852) 000055051 ( .185668902) 000077684( .356205577) 000104548( .640284546) 000051763( -.1 18252457) 000083043( .316086779) .000120181( .725464281) 000045863( 1055063470) 000072871( 1.869355487) 000107492( 2.449868805) 35 Table 11 Simulation Results for the Third Moment #3 (N = 783) $1015va .10 .30 .50 .10 .30 .50 .30 .50 .50 .10 .30 .50 .10 .30 .50 .10 .30 .50 .10 .30 .50 .10 .30 .50 .10 .30 .50 Simulated (s. e.) 000000108( 000000020) 000000151 ( 000000026) 000000197( 000000032) .000000127 ( .000000123 ) .000000288 ( .000000142 ) .000000461 ( .000000164 ) -000000203 ( 000000467) .000000128( 000000503 ) .000000506( 000000542) .000000605 ( .000000214 ) .000000801 ( .000000293) .000000992( .0000003 73 ) .000000137( .000000456) .000000275 ( .000000573 ) .0000003 89 ( .000000700 ) 000000545 ( 000000449) 000000710( 000000606) 000000774( 000000798) Approximated (Std. Diff.) .000000111( .150000000) .000000155( .153846154 ) .000000205( .250000000 ) 000000299( 1.398373984) 000000485( 1.387323944) 000000692( 1.408536585) 000000348( 1.179871520) 000000732( 1.200795229) 000001144(1.177121771) .000000615 ( .046728972 ) .000000895( .3208191 13) .000001231 ( .640750670) .000000565( .93 8596491 ) .000000945 ( 1 . 169284468 ) .000001401 ( 1.445714286) 000000490( -.122494432) .000000814( .171617162) 000001235( .577694236) Table 12 Simulation Results for the F ourth Moment ,m (N = 28) pg. gt! pm, Simulated (s. e.) Approximated (Std. D13.) .10 .10 .10 .000025395 ( .000005791 ) .000000932 (4224313590 ) .30 .000033559(.000008163) -.000001358(-4.277471518) .50 .000043744 ( .0000] 0421 ) -.000004857 ( 4.663755878) .30 .10 .000091691 ( .000012082 ) .000038147(-4.431716603 ) .30 .000109363 ( .000015133 ) .000036404(-4.821185489) .50 .000132794 ( .000018452 ) .000028269 ( -5.664697594) .50 .10 .000332066 ( .000036967 ) .0002525 85 ( -2. 150052750) .30 .000366214 ( .000042815 ) .000267453 (-2.306691580) .50 .000417883 ( .000053603 ) .000271646 ( 2728149544 ) .30 .30 .10 .000152673 (.000018407 ) .000105481 (-2.563807247) .30 .000188228( .000022664 ) .000125898 ( 2750176491 ) .50 .000229483 ( .000028618 ) .000141015(-3.091341114) .50 .10 .000325797( .000034324) .000300419 ( -.73936604l ) .30 .000378532 ( .000038363 ) .000369983 ( -.222844929) .50 .000444161 (.000044656) .000440353 ( -.085274095) .50 .50 .10 .000349379(.000031879) .000374644( .792527996) .30 .000432846 ( .000036435 ) .000486989( 1.486016193 ) .50 .000517932(.000041225 ) .000619684 ( 2.468211037) 36 Table 13 Simulation Results for the Fourth Moment ,u4 (N = 84) Pm Simulated (s. e.) Approximated (Std. Difl'.) .10 .30 .50 .10 .30 .50 .10 .30 .50 .10 .30 .50 .10 .30 .50 .10 .30 .50 .000001057 ( 000000269 ) 000001517(000000437) 000002080 ( 000000652 ) 000007925 ( 000001627) 000010005(000002306) 000012543 ( 000003136) 000037124(000005280) 000042258 ( 000006742 ) .000048534(.000008487) 000015310(000002288) .000020430 ( 000003221 ) 000026042 ( .000004214 ) .00003 9028 ( .000004820 ) .000049363 ( .000006550) .000061049 ( .000008456 ) .000043661 (.000004284 ) .000057562 ( .000005858) .000073253 ( .000007544) .000000160(-3.334572491 ) .000000094 ( -3 .256292906 ) -.000000048 ( -3.263 803681 ) .000004301 (~2.227412415 ) .000004813 ( -2.251517780) .000005068 ( -2.383609694 ) 000027981 (-1.73 1628788) .000030662(-1.719964402) 000032974(-1.833392247) .000010897(-1.928758741 ) .000014492(-1.843526855) .000018342(-1.827242525) .000030707(-1.726348548) .000039828(-l.455725191 ) .000050142(-l.289853359) .000035330(-1.944677871 ) .000048855(-1.486343462) .000065800( -.987937434) Simulation Results for the Fourth Moment ,u4 (N = 783) Pry Simulated (s. e.) Approximated (Std. DifQ ch P: .10 .10 .30 .50 .30 .30 .50 .50 .50 Table 14 _&€ 494.9 .10 .10 .30 .50 .30 .30 .50 .50 .50 .10 .30 .50 .10 .30 .50 .10 .30 .50 .10 .30 .50 .10 .30 .50 .10 .30 .50 000000003 ( 000000001 ) 000000004( 000000001 ) 000000006 ( 000000001 ) 000000053 ( 000000006 ) 000000064 ( 000000007 ) 000000077 ( 000000008 ) .000000327 ( .000000033 ) .000000364 ( .000000037 ) .000000401 ( .000000041 ) 000000116(000000012) .000000166 ( .000000019 ) 000000227 ( 000000025 ) 000000332 ( 000000033 ) 000000447 ( 000000044 ) 000000583 ( 000000058 ) 000000342 ( 000000032 ) 000000499 ( 000000047 ) 000000703 ( 000000069) .000000002 (-1 .000000000 ) .000000003 ( -1 .000000000 ) .000000004 ( -2.000000000 ) .000000050 ( -.500000000 ) .000000061 ( -.42857l429) .000000072 ( -.625000000) .000000321( -.181818182) .000000359( -.l35135135) .000000399( -.048780488) .000000122 ( .500000000) .000000171 ( .263157895 ) .000000230 ( . 120000000) .000000340( .242424242 ) .000000453 ( . 136363636) .000000589( . 103448276 ) .000000375 ( 1.03 1250000) .000000538( .829787234) .000000748( .652173913 ) 37 4.2 Simulation results The simulation results are listed in Tables 3 through 14, and the first set of the results is the estimated first four moments of the product of two simulated correlations, Em fie (Under the “simulated” column in Tables 3 to 14). To establish whether 1000 replications (numbers of simulated correlations per cell) are enough, Figure 2 displays the distributions of the simulated product 7:, 73,6. for selected cells with different replications. From the figure 2, we can see that there is not much difference in the shapes of the distributions across the difi‘erent numbers of replications, verifying that 1000 replications are adequate for this simulation design. On the other hand, as the population correlations, ps, become bigger, the distributions are more spread, while as the sample size N becomes larger, the distributions are less spread. Now, we turn to the comparison of approximated values and those obtained via Frank’s method for the first four moments against the simulated values. Comparing to the simulated values, Figure 3shows the most noticeable fact that the approximated values are much more accurate than Frank’s values which consistently, heavily overestimate all four moments. For the approximated values themselves, most of them are within or very close to one standard error of the simulated values, except for the third and the fourth moments when N is 28 or 84. The inaccuracy might come from the lower- order approximation of 0',m,yc (of, Equation 5) which is heavily involved in the third and the fourth moments. In Chapter 5, I will introduce a regression approach to correct the inaccuracy for the third and the fourth moment when N = 28 or 84. 38 500 500 400 '4 4004 3004 300 4 Replication = 500 2... 1001 52 In a? 0001 800‘ 000 1 600 4 Replication = 1 000 ‘00 ‘ 400 4 200 200 o _ _ o 10 02 15 .32 40 05 4000 1 4000 4 3000 d 3000 4‘ Replication = 5000 zooo- 20004 10001 1000 4 o. . . _, -30 .02 .15 .32 4'8 .85 ~10 ~02 .15 32 u 65 Figure 2. The distributions of the product of simulated correlations 'r: 7" with different numbers of replications for the selected cells. 39 .a v 5 v . a. = W 0, 5, 4 n, I 8 , amN .1. 5., __ . a a 0. H mm m m m m o __ 3 3... 1. .4... __ __ amN .3. = lmfil/I/l/l/ll/l/l/ , x luauurVV/ll. m p m m m m m o n .m m m on 5 w. __ R .0 v 5 ”n m m m .... m. o. 5“. a. a. " 11ml. infill/l/l/l/l/llllllz 2 .lauutruwlllli 0 m cl 0 w. 0 0 n 1 w. __ R (Mn avvv 4 3 2 1 a a en 2.... .. I 1 illn'l" ln’l/l/l/fl/‘I/l/ll/l/IA 2 lag/Ila. 0 w t . . n 0 0 0 4 0a . m m m m m n .m 0 w. 0 0 Ru 5 w. ._ R 10 Figure 2 (cont ’d). The distributions of the product of simulated correlations different numbers of replications for the selected cells. 40 =.5,p,y=.5 N=783 7935.543» . M v M v M. v M. v“ v a. 1. 5n ,n... ‘- I’ll/I’lllll’léll , , lazuli, , ,. a... .w v m 1 M. u m . n v“ . W q 4 1 i1 .. n. m 1 q 1 . n. m 01 m 0: . c 0 0 0 O 0 v 0 W M M 0 n. 0 0 O m" w .6. m .0 m a 6 . 2 m" m w m m v M v M 1 M f a v a v a v “u. v u v a.“ .w .u 1.... V fffffffffffffffffffffffff , “wfififljflrifyfyfirfi/iflp !!!!! I ////#///////¢¢/////A////V///////// m M VKK/nr /////////////JVVV// n m A V////// m v u . u r“. u d 1 d .. v o 0 m d o . nv 4 m 1 N . n. O 0 0 . O M . W 0 W 0 0 m” w m m m m. w w t 2 m" w m m m n n n o m a m 0 o m 0 t 0 4|. 0 t O a 0 w a .w 5 .1 0 .c 0 m __ 0.1 ms with US 7;ch Figure 2 (cont ’d). The distributions of the product of simulated correlatio different numbers of replications for the selected cells. 41 .14 .016 .014 1 o .131 9 .012 '4 .010 '4 .12 I I ,U 1 N N .0025 0014 O .0012 4 o 0020 '1 .06... l o... 1 .0008 4 .0010 .1 #3 .0005 «1 . 0.0000 I -0005 i _ j -.0002 v _ v 28 84 783 28 84 783 N N Figure 3. The trends of the accuracy of the approximated and Frank’s moments along one parameter holding the other parameters constant. - = approximated value; c = Frank’s value; _____ simulated value; ...... one-standard-error upper bound; one- standard-error lower bound. 42 .35 .012 .30 4 o .010 4 .254 .0084 .204 .10 " .054 000 - - 10 .30 50 ch ch .0020 .0012 0 .oou24 0 .oou54 .0008 4 .oou34 .0006 4 #3 , #4 0 .0004 4 .0002 4 0 -1 croooo t, - - 10 -30 50 fix pxc Figure 3 (cont ’d). The trends of the accuracy of the approximated and Frank’s moments along one parameter holding the other parameters constant, including the sample size. 1- = approximated value; 0 = Frank’s value; ----- simulated value; ------ one-standard- error upper bound; one-standard-error lower bound. 43 .20 .010 .11 1 pyc ,ch .0014 .0008 .0007 o .0012 1 O .0006 4 .0010 4 .0005 4 .0008 4 .0004 4 #3 ' , #4 00034 o .0002 4 .0004 4 .0001 4 I .0002 4 0.0000 4 0.0000 - - _ -.0001 - _ f 10 30 50 .10 30 50 ,ch Pyc Figure 3 (cont ’d). The trends of the accuracy of the approximated and Frank’s moments along one parameter holding the other parameters constant, including the sample size. n = approximated value; 0 = Frank’s value; -'-'- simulated value; ----- one-standard- error upper bound; one-standard-error lower bound. 44 p.) pry 0012 .0006 0 .0010 4 ° .0005 4 0008 4 .0004 4 ° 0 .0006 4 0003 4 .113 #4 o o 0004 4 0002 4 I .0002 4 .0001 4 0.0000 - - _ 0.0000 - - f 10 30 .50 10 30 50 pxy p3? Figure 3 (cont ’d). The trends of the accuracy of the approximated and Frank’s moments along one parameter holding the other parameters constant, including the sample size. a = approximated value; 0 = Frank’s value; "'- simulated value; """" one-standard- error upper bound; _ one-standard—error lower bound. 45 Now we examine the trends of the accuracy of the approximated and Frank’s values. Figure 3 shows the trends of the accuracy of the approximated and Frank’s values of the four moments along one parameter, N, p“, pyc, or pxy, holding the other parameters constant. As the sample size N becomes larger, all of the approximated values and most of Frank’s values become more accurate, except for Frank’s mean. This is not an unexpected result for Frank’s mean in that it is obvious that the mean of the Fisher 2, E[z(r.)] = élnfi :g‘ ] , does not equal the mean of the original correlation, E(r.) = p. — 09%;) (of, Equation 5); and therefore, the abnormal trend regarding N for Frank’s mean is predictable. It also shows the inadequacy of Frank’s approach as an approximation to the distribution of the product of the two dependent correlation coeflicients. Frank might argue that he was trying to get a p-value through Aroian’s procedure, not approximate the distribution. However, the P-P plots in Chapter 6 also will show that Frank’s quantiles are not comparable to the extremely accurate approximated quantiles obtained by the approximation procedure described in this current study (of, Figure 6). As for the trends of the approximated and Frank’s values regarding ps, there is no clear trend in the approximated values, except for the third moment in which the approximated values increasingly overestimate the third moment as ps become larger. As stated above, I will fix this problem by a regression approach based on simulated values. For Frank’s values, they consistently overestimate all of the moments when any of the ps become large, since the magnitude of the corresponding Fisher z’s increase much more than do those of the original correlations. It is yet another evidence of the 46 detachment of the distribution of the product of two Fisher z’s from that of the two original correlation coefficients. 47 Chapter 5 APPROXIMATION CORRECTION As we can see in Tables 3 through 14 and Figure 3, the approximations in this study to the first four moments are very accurate, acknowledging a little discrepancy for the third and the fourth moments when N is small. To account for this discrepancy, we can borrow the principle of regression analysis in which we use predictors to explain as much variation in an outcome as possible through a linear model. In this light, we can model the deviations of the simulated values from the approximated values of the moments as a fitnction of p“, pyc, pxy, fill—2 , and up to 4-way interactions. By the nature of the formulae in Equations (12), (12'), and (12"), a regression model without an intercept is employed. To keep the work for the approximation correction minimum, we may need to model the discrepancy for only the third and the fourth moments when N = 28 and 84. For the other cases, we do not need to model the deviations, since the approximations are accurate enough and the deviations do not have much variation to be explained in the regression model. For instance, for the very accurate approximation to the variance 112, only 41% of the outcome variance could be explained by all predictors in the model, whereas as shown below for the third moment #3, more than 99% of the outcome variance can be explained by the predictors. 48 Since we only have 54 cells of records, a stepwise regression is used to obtain more statistical power; and the stepwise regression results in a simpler model for the third moment when N = 28 and 84. In particular, the model is D: S- 4" .31(-N— iii—HM yzcg—Hflfi ?)+fl4(-x——- ”)4" fls(—— 9) wtffi-igi”) + a, (16) where D is the deviation between the simulated value and the approximated value; S is the simulated value; A is the approximated value; ,6.- is the regression coefficient; a is the error. We now can use the predicted value of the deviation, 0 , to correct the inaccuracy of the approximation to the third moment. The corrected or estimated value of the approximation, denoted as Z = A + D , is theoretically assumed to be more accurate than the approximated value, A. The virtue of this methodology is to utilize all the available information borne in the simulation data across pm, '15,, pm and N to correct the inaccuracy of the approximation. This method is consistent with the general principle of statistical analysis; and the method works because the range of parameters in the experimental design is limited by theory, constraining ps according to what are defined as small, medium, and large effects. 49 Afier model (16) was fittedm to the data in Tables 9 and 10, 1 obtained the predictive equation as follows (R2 = .991): A X D”: = _.176408(7‘;‘—;) + .116374(%1;—) + .447522(.’1;_5;) _ ,550128(p.. 2705c) )0..qu ppm.) ——2—)- ——5—)- — 1.659398( .805672( (17) And, the estimated values, X , for the third moments, along with the simulated values, S, as well as the approximated values, A, are listed in Tables 15 and 16 that show more accurate results for the third moment than those in Tables 9 and 10. For the fourth moments when N = 28 and 84, a stepwise regression ended up with the following model: pry D= S— 4= fl(p—‘§)+ +fl2(— ’5)+ 447’;— ‘5+) fl4(-p—"——— ”EL—HM ) +fls(p—”—— —p—Nx.p“”)+a< ”2”: pt)”. (18) After model (18) was fitted (cf., Footnote 9) to the data in Tables 12 and 13, I obtained the predictive equation as follows (R2: .:981) ___.p_): 15m=097559(-;*;)+.113876(;y; 084159(%— 5;)- .425381( ) X X X X +1090953(5£N—f—5y—)+.564063(515A7€£)—3.620234(p ‘5 ’13”; p5").(19) '91 Here a regular (unweighted) regression was conducted. A WLS regression weighted by the sample sizes was also tried, but the result was not evidently better than that of the unweighted regression. 50 The estimated values, A , for the fourth moments, along with the simulated values, S, as well as the approximated values, A, are listed in Tables 17 and 18 that show more accurate results for the fourth moment than those in Tables 12 and 13. To determine whether the correction is sample dependent, a cross validation is analyzed by fitting the models (16) and (18) to the full sample in Tables 9 through 14. Extremely good approximations were obtained. Specifically, compared with ones in Equations (17) and (19), only two coefficients have differences in 1045. 51 .828, 00880859“ 05 n V 30:?» 022987. 05 u m. ”V Ea m. 8233 302.33 008680 05 u Q 8029» 008838 05 u Q + V .1. V .082 ( ( z A 8888.. V 28888. A8882: V 5888. A 88228. V 8888. 8. 88888.- V 82388. AS88848 V 88:82 A 88888. 88888. 8. 285.82.- V 88:89 88888.8 V 8888. A 8888. 84858. 2. 8. 8. 88888. V ”8888. 88888.8 V 8888. A 8888. 8888. 8. 28888. V 8888. 28838.8 V 8888. A 88888. 88888. 8. A 8888. V 88888. 838888 V ”2888. A 88888. 8888. 2. 88. A8888. V 8888. 88888.8 V 84:88. A 8888c. 8888. 8. 88888.- V 82888. 888838 V 88888. A $888. 8888c. 8. A8285. V8888. 888.488 V8888. A 88888. «.2888. 2. 8. 8. 88888.- V 8888. 28888. V 88888. A $858. 3:888. 8. A8888. V 8888. A8888. V 88888. A 8888. 8888. 8. 88888.- V 8888. A8588; V «.8888. A 8288. 8888. 2. 8. A8888. V 8888. $8888.- V 8888. A 8888. 3888. 8. A8885. V 8588. 888:8. V 8588. A 8888. 8888. 8. A8888. V 8888. A8888. V8888. A 5888. 8888a. A:. 8. 85888.- V 82888. 88888.8- V 48888. A 88888. 88:88. 8. 88888.- V 8838. 85888.0 V 88888. A 8888. V 88838. 8. 888284.- V 88888. A8888? V 88888. A 88288. V 28888. 2. 2. 2. Earn .28 N Cam 0me V 60.th 00. 80. 640A 3N u g 8308 hSSSRENWV 03 3:0 .330; 380383. 05 A34: 30.804809 30803 P35 030 .8308 3808.23 2 030,—. 52 .829, 3088588“ 05 u V803? 382588 05.1-meva 802509 828303 38on08 05 u Q 8029» 308838 08 u Q + Vu N .082 8828:.- V 98888. 888838 V 88288. A 8888. 8:88. 8. 88888.- V 88888. AS8883 V 5888. A 88888. 28888. 8. A 2588.- V 8588. 888888 V 88888. A 8588. 8888. 2. 8. 88. A8885..- V 88888. 28888. V 5888. A 8888. 8888a. 8. @8888.- V 8888. 8:888. V 88888. A 83888. 88888. 8. 88388.- V 8888. A588:- V 8:88. A 8888. 8888. S. 8. A8858- V 88888. 88888. V 88888. A 88:88. 2888. 8. 88888.- V 8888. 88888. V 88888. A 88388. 8888. 8. 88888.- V 88888. 88888. V 8888. A «88:88. 88888. S. 8. 88. A8888.- V 88888. 888:8.- V 88288. A 8888. 88:88. 8. 8:888.- V 8888. A8888.- V 8888. A 3888. 8888. 8. 88888.- V 8888. 88888.7 V 8888. A 88888. 88888. 2. 8. 88288. V 88888. 88888.- . V 8888. A 88:88. 88888. 8. 88888.- V 88888. 88888.- V 88888. A 88:88. 88888. 8. 88888.- V 5888. 82888.- V 88888. A 8888. 88880. 2. 8. A8888? V 88888. A8888?- V 88888. A 83888. 88888. 8. A 88888; V 88888. AS8883- V 8888. A 8888. V :888. 8. 88:88. V 8388. 88888.- V 8888. A 8888. V 88288. S. S. 2. A85 .EmV N A85 .BmV w Ada-Va nu 9Q 23 9w .1. 2V 80338 308538-AQ3M 03 3:: .8058: 30338.8. 05 5.3» 338889 3028: 30.2? 058 80:30; 30S§8Q 3 2an 53 .8:—g 38:88.83 05 n V ”839» coax—=81.” 05 u m. ”V v5 m. coo-Eon Scrum-6w 36:85 2: u Q ”82? 83838 05 M Q ( ‘ if w .302 z $8888... V 88:88. 88:88.” V 8888. A 83.88. V 88888. 8. 88888.- V 88888. A8888: V 88888. A 3888. V 88888. 8. A 88888. V2888. 88888. V3888. A3888. V 88888. E. 8. 8. 88888. V 88888. 88:88.- V 88888. A 88388. $3388. 8. 28888. V 2:888. 883.88.- V 88888. A «8888. V 88888. 8. 88888. V 88888. A 3888? V 28888. A 8888. V 88888. E. 8. 28888.- V 88888. 3:38:88- V 22:88. A 88888. V 88888. 8. 83882. V 8888. ARV-8888- V 88288. A «8888. V 88888. 8. 82828.. V 88388. 888888- V ;8888. A 8888. V 88888. A:. 8. 8. 8888:.- V 28888. $88888- V 82808. A 88888. 88:88. 8. 2888:. V 88888. 888888- V 88888. A 2888. 38888. 8. 8888:.- V 8888. A8882.”- V 88888. A 88888. 88888. S. 88. A8838? V 8828. 38888.8- V 88888. A 8888. 88288. 8. 88:28.- V 88889 88828.? V 8888. A 8588. 88888. 8. $8888.- V 88888. 88228.? V 3888. A 8888. 82888. 2. 8. 88282. 88:88. 88888.? 88888.- A 5888. 3.8888. 88. A8888: V 8888. 885:?- V 82888.- A 8888. 88888. 8. A 8888. V 88888. A8888?- V 88888. A 88888. 88888. 2. A:. 2. A85 .EmV N €5.38 V $.an & am am 9N u 2v .3338 EBENHENRV 2% V83 8538 32:35.5 2: 3.3» huxamsob 23:63 3.59% 2th .3338 32:52.3 2 2an 54 .8:—g 3858058 2:" V8039» woes—=88 ofium. ”Vugwnoogon 828166 36608. 05" Q 803g 38:58 05 n Q + V" N 682 288887 V 82888. $8888.. V 8888. A 8888. V 88888. 8. 88888.7 V 8888. 88888.7 V 88888. A 8888. V 8888. 8. 882888- V8888. 28:88.7 V8888. A 8888. V 28888. 2. 8. 8 88:88.7 V 8888. 88888.7 V 8888. A 8888. V 88888. 8. 2888.7 V 8888. 22888.7 V 88888. A 8888. V 88888. 8. 8828827 V 8888. 88888.7 V 8888. A 8888. V 8888. S. 8. A8888. V 8888. 88888.7 V 88288. A 3888. V 88888. 8. 88888. V 8288. 88888.7 V 8388. A 8888. V 88888. 8. A8888. V 8888. 28888.7 V 8888. A 8888. V 2888. E. 8. 8 28:88.- V 8888. 88888.7 V 8888. A 8888. V 88888. 88. A8888. V 8888. ANS-88:7 V 88888. A 3888. V 88888. 8. 2888:.- V 8888. 88828.7 V 28888. A 8888. V 82888. 2. 88. A8888. V 88288. $8888.”- V 8888. A 8888. V 88288. 88. 88888.2 V 8888. 88:288- V 28888. A 8888. V 8888. 8. 88:28.2 V 8888. 8:888.”- V 8888. A 8888. V 88888. S. 88. 88828.2 V 8888. 28888.8- V 88888.- A 8888. V 88888. 8. 8888.2 V 28888. $8888.8- V 88888. A 5.888. V 8888. 8. 888888 V 8888. 28888.8- V 82888. A 8888. V 8888. S. 2. 2. A85 .28 N 85.8me 7 A3:- § 8 am can u 2v .3338 3835838? 2% BE 8538 3833254 2% 3.3» 398%80 28:82 $88K 22-? .8338 hmEEtam M: 033. 55 Chapter 6 DISTRIBUTION COMPARISON Up to now, we have compared the approximate distribution of the product of two dependent correlation coefficients with the simulated distribution in terms of the first four moments. However, we usually use p-values obtained from the shape of a distribution, rather than the moments, to do analysis. Thus, it is more desirable to compare the shapes of the approximate distributionwith the simulated distribution. After obtaining the distribution function for the approximate distribution of the product of two dependent correlation coefficients as a Pearson Type I distribution, the distribution comparison becomes achievable. A P-P plot is employed for this comparison. A P-P plot, a probability plot, is a graphical tool for assessing the fit of data to a theoretical distribution (Rice, 1995, p. 321). Specifically, for a given sample data X1, ..., X", we plot X0) VS- F-l(—l-) n+1 where X(,), i = l, ..., n, are the order statistics of X 1, ..., X", and F is the cumulative distribution function of the theoretical distribution. Although we do not know the theoretical distribution of the product of two dependent correlation coefficients, we have the approximated Pearson Type I distribution. Then, we 56 can compare the simulated data with the approximated Pearson Type I distrlbution to validate the approximation of the distribution of the product of two dependent correlation coefficients to the Pearson Type I distribution. Thus, here in the P-P plot, I will plot the quantiles of the Pearson Type I distribution against those of the simulated distribution of the product of two dependent correlation coefficients. Due to the complexity of the calculations of its mathematical fimction for obtaining every duantile of the Pearson Type I distribution, I used the common 15 percentiles, 25%, .5%, 1%, 2.5%, 5%, 10%, 25%, 50%, 75%, 90%, 95%, 97.5%, 99%, 99.5%, and 99.75%, which can be looked up in Pearson and Hartley’s table (Pearson & Hartley, 1972, Table 32). Setting —l—1 = 25%, n + .5%, 1%, 2.5%, 5%, 10%, 25%, 50%, 75%, 90%, 95%, 97.5%, 99%, 99.5%, or 99.75%, and noting that n = 1000 in this (simulation) study, I obtained the corresponding ith order statistic Xm in the simulated data Therefore, in our P-P plot there are only 15 points, instead of 1000 points, each corresponding to one of the 15 percentiles above. Figure 4 displays some P-P plots for selected cells. From the P-P plots, we can see that when the correlations are larger and N becomes bigger, the simulated data fit the Pearson Type I distribution better, which is consistent with the conclusion in Chapter 3 (Approximation Procedures). I argued there that if the correlations are larger than a small size, say .10, and N is bigger than 300, the distribution of the product of two correlations can be approximated as the Pearson Type I distribution. Also, Figure 4 shows the unmatched low ends for the cases of smaller correlations, which results from the inaccuracy of the approximations to the fourth moment (cf., Tables 12 and 13) that regulates the tails of the distribution. However, for smaller correlations, these cases are less interesting to us in that the impact of the confounding variable would be so small that we would not need to assess its impact on the causal inference about the predictor of interest. To address the concern above, I also created P—P plots (see Figure 5) using the estimated values, via the regression approach, for the third and the fourth moments. Figure 5 shows a better fit of the simulated data to the Pearson Type I distribution using the estimated third and fourth moments. The nicer results come from the better 57 N=28 N=84 N=783 Simulated Data Simulated Data Simulated Data 25 99.75% I 99.5% I 15‘ 99% " 97.5% I 95% I 90% I .05 1 75% I 25% . ' 50% I 10% O I "05* 5" - 2.5% 1% I I .5% 25% I -.6 j I I -.)5 -.05 .05 .15 .25 (Approximated) Pearson Type I Quantilw .11 99.75% . '09, 99.5% I 99% I .07 1 97.5% I .05 1 95% I _03, 90% - 75% I '°‘ 1 n 50% 25% g _m , 5% : 10% I 2.5% 0 -05 25% I -.05 -.03 -.or .01 .03 .05 .07 .09 .u (Approximated) Pearson Type I Quantiles .031 99.75% . 99.5% . 99% I .023 1 97.5% ' 95% ' W56 I .015 1 75% I 50% I '°°7 ‘ 25% . 5% II 10% 1% :2.5% .00, 259.15% -.001.007 .015 .023 .031 (Approximated) Pearson Type 1 Quantiles Simulated Data Simulated Data Simulated Data chzpyc:ny=-50 -65 99.75% 99.5% - ' 55‘ 99% . 97.5% . 451 95% ' 90% I .35 ' 75% I _25 J 50%» I 15 ‘ 25% g 10% I 5% I '05 ‘ 1% I 2.5% 25%; 5% -.05 ‘ ' r ' v j '-05 ~05 ~15 % 35 45 55 ~65 (Approximated) Pearson Type I Quantiles 50 99.75%- .451 99.5% I 99% I 401 97.5% _ 95% . “35‘ 90% I .30 1 75% I ”1 ”/0 I .20 q 25% I .5, 10% . 5% I I 40‘ 105.3% 2.5% _05 .25-7. .05.10.5.20.25.30(§404560 (Approximated) Pearson Type I Quantiles .32 99.75% I 99.5% I .30. 99% ' 97.5% I 95% I 23‘ 90% . 75% g .26 1 50% . '2“ 25%. .22 ‘ 10°/0 - 5% I I 2.5% 20‘ 1% I I. 5% .18 25% .18 .20 .22 .24 .2'6 .28 .30 (Approximated) Pearson Type 1 Quantiles Figure 4. P-P plots for the selected cells. 58 ch=pyc=PIy=-10 25 99.75% . 30‘ 99.5% . ”/0- .u; 1 97.5% I G g .10 1 95% ' g 90% I N— 28 5 °5 75% - ' g 50% '63 "00‘ 25% I 5% .I 10% ‘-°5‘ I 250/. 1% I -.10 4 I .5% .25%I .-5 I I I T j i r -.15 -.10 -.05 -.oo .05 .10 .15 .20 .25 (Estimated) Pearson Type I Quantiles .11 99.75% . .09 1 9.5% I 99% I .07 1 97.5% I g .05 i 95% I Q g .03 ‘ «1% I .. E N — 84 a 75% . Eb- .011 5017/. . 25% I - ' 10‘/ .01 1 5% . o 1% ' 2.5% --°3‘ -'.5% 25% -.05 I ~.05 -.03 «01 .01 .03 .05 .07 .09 .11 (Estimated) Pearson Type I Quantiles Simulated Data Simulated Data ch=Pyc=ny=-50 .65 99.75%- 99.5% I -55' 99% - 97.5% I 45‘ 95% ' 90% I -35 ‘ 75% I ,5 1 50%| ' .15 1 25% - I 10% 5% I '05 11% I 2.5% - .5% .05 25% -.05 .05 .15 25 -35 45 55 -65 (Estimated) Pearson Type I Quantiles .50 99.75% . .451 99.5% I 99% I 40' 97.5% I 95% I '35 90% - -30‘ 75% I as 50% . .201 25% I I 10'/ .L51 5% . I O .101 1%}.I 2.5/0 05 25%' 0549-1539-25393'54134'550 (Estimated) Pearson Type I Quantiles Figure 5. P-P plots for the selected cells using the estimated values for the third and the fourth moments. approximation of the estimated values of the moments to the simulated ones (relative to that of the approximated values). Thus, the better results validate the regression approach in Chapter 5 as a helpful methodology to correct the inaccuracy of the approximated values. 59 Figure 5 also shows that the estimated values for the third and the fourth moments did a good correction job, and the P-P plots indicate a better fit of the simulated values to Pearson Type I distribution for small correlations. To compare the approximate distribution—Pearson Type I distribution—with the Frank’s distribution, Figure 6 is a P-P plot including the two distributions. According to Frank’s approach, Fisher z’s, z(rxc) and z(ryc), were obtained from the simulated correlations and standardized by the variances. Then, the quantiles of the standardized product z(r,c)xz(ryc) were looked up in Meeker, Cornwell, and Aroian’s tables (1981) (see Frank, 2000, p.174 for the detailed steps). Figure 6 shows that the approximated distribution is fitted much better by the simulated data than is Frank’s distribution. In addition, Frank’s distribution is too conservative. For example, the quintile of .30 is corresponding to the percentile of 99% for the approximate distribution, but only 50% for Frank’s distribution. Therefore, it provides evidence that the approximation procedures using the Pearson Type I distribution are more advanced and accurate than Frank’s approach for approximating the distribution of the product of two dependent correlation coefficients. 6O .42 , .38 I ’III a .34 * [I’ll ‘51 9975% 9975% Q 995% x 995% '3 99% 99% 0° 1;; ~30 ' 97.5% " 97.5%<> '3 95% 5’ 95% E 90% ,11 90% o 275 26 ' 75% .I’ 75% o 2" 25% 0 25% .22 1 ‘1 1°" 0 5,10" 0 Frank's " 25% 0°2.5% . ’4 57196000596196 I (Approximated) .18 ' 25% ' 25% - _ . _ Pearson 1ype I .18 .22 .26 .30 .34 .38 .42 Quantiles Figure 6. A P-P plot with Frank’s distribution for p“. = pyc = p.y = .5 and N = 783. 61 Chapter 7 APPLICATIONS 'In this Chapter, through a real example I will illustrate an application in causal inference by using the methodology with the approximate distribution of the product of two dependent correlation coefficients discussed in the current study. 7.1 An Index of the Robustness of a Causal Inference (IOROCI) Suppose a causal inference about a predictor of interest, X, on an outcome, Y, was made through the following regression model Y=flo+fl,X+£, (20) and the corresponding hypothesis regarding the coefficient of X, ,B,, was Ho: ,6, = 0 versus H]: ,6, at 0. We know that a t-value under the null hypothesis H0 is Suppose we also have the second model (21) that includes a potential confounding variable, C: Y=flo+flxX+flcC+a (21) Then, the t—value under the null hypothesis Hg with respect to the model (21) is t- %—Q% c _ O 2 2 2 \/1— rxy — rxc — rye + 2rxyrxcryc n — 3 We want to know the likelihood that we will retain the primary statistical inference that rejects Ho, when C is in the model (21). That is, for a particular study with an observed to which is larger than ta, we are interested in the following probability function W: W = War! I t0) = PM: > ’0' T: ’0): for ’0 > (“[10], (22) where ta is a t-critical value at level a W is a likelihood that we will reject the null hypothesis Ho, when the potential confounding variable is in the model (21), for an observed to from model (20) which is larger than ta. In practice, we make use of the all information we have for assessing the robustness of causal inference. In view of the unobserved confounding variable C, the information about “01 (a) Without loss of generality, 10> 0 is assumed, because the case of to < O is symmetrical. (b) Different ta’s might be used for the two models (20) and (21), because the degrees of freedom are different. One is n - 2, and n — 3 for the other. But, the two 10’s are very close, and when n fairly large, they are almost identical. For simplicity, we used the same symbol for the both ta’s. 63 r“ and rye can be acquired from the distribution of observed covariates. We can use information about the distribution of the covariates to estimate W in the following manner. Suppose we have m covariates Z1, ..., Z... The models (20) and (2]) become as follows, respectively: Y=,60+,B,X+,BZIZ1+ ...+,6sz,,,+£, (23) Y=fio+fi,X+fi.C+/3.,Z, + +,6’sz,,,+£. (24) The impact of the confounding variable is the product of two partial correlations, r x r (of, the footnote [1], p. 6). If we use the distribution of the covariates to xcozl...z,,, ycozl...zm acquire the information about confounding variable C, then the impact of the confounding variable r x r can be estimated from the impacts of the covariates, ’ XCIZ1.. .mz ycoz1...zm 3 i = 1, ..., m. Consequently, we can estimate the r X r . x2102]...z,-_1z,-+1...zm yzi .21...Zl'_12'.1.1,1...‘.m ’ probability fimction W(22) as follows: W: W(ta|to,F 5M): P(tc>ta|T= to,p...= r;,,,,p,.= ryw) forto>ta,(25) m m erz WIZ1...Z,'_1z,-+1...m ry.z Il eryZfl-Iz1...z,-lz,+1...zm [:1 i=1 _1_ m _ _ 1 where rm,z — —- m 64 Intuitively, tc could be any value when to > ta, but 1,, and to are related. For example, in the case of confounding (pmoflpxy > O), ltcl < ta; in the case of suppression, tc < 40,; and in the case of robust, tc > ta. Note that ITCV is not defined when to < ta. Following Frank’s Figure 1 (2000, p. 156), Figure 7 shows a general relationship between tc and to. tc 'I'I‘I'I':'TW’i-i'fC-FIK'I'Z'I' Robust ITCVNot i'i'i'i': . ..... Defined. . . : :1'23: I'I'I'I'I'IC'Z'I‘I'I' ‘11:" :ivfiW/t ............................ ..WTT‘A““A. ......................... haw/«Confoundingk to .:.:.:.:.:.:.:.:.:.:.: 0 _ . . .:.w A ::::::::::::I:1:222:31. I'ta IIIIIIIIIIIII IIIIIIIIIIIII IIIIIIIIIIIIIIII -:-:-:-:~:-:-:-:-:-:-:-ta: --------- y .z. I :. :/ /ppressio/n//é IIIIIIIIIIIII ::;:;:;:;I;I:3:3:323232323: : : :- ;///2 Figure 7. The relationship between tc and to. 65 p..=.30&p,.=.30 a 8 6 6 ‘1: *4. 4 4 to 2 to 2 . * + 4. 0 0 e + + -2 -2 _4 4 _4 s 4 2 o 2 4 6 a to 8 I 6 6 + 4 4 3" t 2 2 «o 4. +1 + to tC + ° ° «r .141- 1'1- + + -2 -2 4 4 4 8 4 2 0 2 4 6 8 to p“ = .50 & g... = 50 o 8 5 6 4 4 .1» a"; .1. t0 2 (C 2 + g .L + + 4‘ 0 o i + 1' * ’ .fl’ + .2 2 4+ 1» 4 4 4 2 o 2 4 6 a 4 2 o 2 4 6 a to (0 Figure 8. The relationship between re and to for a simulated dataset with N = 84 and p,y = 30. 66 Table 19 Detailed Counts and Percentages for re < -2, -2 < tc < 2, and 2 < tc Based on A Simulated Dataset with N = 84 and pa, = .3 0 k tc<-2 -2 2 (a ta for N = 84 and a = .05) they are never 67 below -2, which is the suppression case (of, Figure 7). In addition, Figure 8 shows that more points fall into the robust region as pxc or pyc become smaller and that more points fall into the confounding region as p“ and pyc become larger. This information is also illustrated in Table 19 with the detailed counts and percentages for to < -2, -2 < tc < 2, and 2 < tc. This emphasizes the importance of using 7'“. and Pyc which are employed in (25) through their estimates as the means of the partial correlations with respect to the observed covariates. When the impacts of covariates are generally large, the primary inferences are less robust because we anticipate that there would be a hypothetical confounding variable comparable to the covariates in terms of the (large) impact on the primary inferences for the predictor of interest. This is based on the assumption that the impact of the unmeasured covariates would behave as do the impacts of the measured covariates. One way to obtain the value of W in (25) is through the impact threshold of the confounding variable, ITCV. When I. > ta, the coeflicient of X, ,63, is significant in the model (20) or (23), which means that the impact of the confounding variable does not exceed the impact threshold of the confounding variable, ITCV. Therefore, the value of W can be obtained as follows in terms of ITCV: ITCV W = P(K < ITCV) = 1f(k)dk, for to > ta, (26) -m where K is the impact of the confounding variable with the reference distribution f(k). Note that the condition T = to in (25) is implied by ITCV because ITCV is a fimction of to 68 (of, Equation 18 in Frank, 2000, p. 167), and the reference distribution captures pxc = rxz.oz and pyc = ryzpz ' Of course, we cannot always measure confounding variables, but under the assumption that the impacts of existing covariates represent the impact of the confounding variable, we can use the impacts of existing covariates to represent the impact of the confounding variable through estimating the means for pxc = szn and pyc = 7%“. It is for this reference distribution that I have obtained the approximated Pearson distribution in the preceding chapters. Thus, we still can obtain the value of W as described in (26) by using the reference distribution generated from the measured covariates for the impact of the unmeasured confounding variable. For the impacts of the measured covariates to have a tractable distribution that is representative of the impact of the unmeasured confounding variable, we assume homogenous impacts of the covariates. When the empirical distribution of the impacts is heterogeneous, researchers need to evaluate the sources of impacts according to substantive theory. Concerns about the heterogeneity of the impacts of the covariates would be greater if the impacts of the covariates were obtained from multiple estimated models accounting for different sets of covariates. The heterogeneity could be assessed by generating a P-P plot of the observed impacts against the theoretical distribution. In addition, one may be also concerned about the influence of small values of population correlations on obtaining the value of W through (26). The fact that my approximation to the distribution of the product of two dependent correlations is 69 comparatively poor for small values of p” and pyc becomes more of a concern if the partial correlations of observed covariates with the outcome and with the predictor of interest are smaller than their zero-order correlations, which usually happens in the social sciences because the covariates are’ofien correlated with one another. On the other hand, in the case of confounding, given the generally negative relationship between the t-value for the predictor of interest and the product of the correlations with respect to the covariates, when the impacts of the covariates are small, the inferences about the predictor of interest through the t—test are more likely to be robust to the small impacts of the covariates. Thus, we are more likely to retain the primary inference due to the small impacts of covariates, although we may have some difficulty in characterizing the distribution for the small, partialled impacts of covariates. In other words, the poor approximation for small correlations would only result in more conservative decision. The following is a guideline for interpreting the value of W. If W > .95, this means that the probability of sustaining the original inference is large and we can say that the statistical inference is very robust with respect to concerns about confounding variables. If .8 < W s .95, the statistical inference is fairly robust, but we may still need to check some possrble confounding variables, and we should interpret the causal inference regarding X with caution. If W S .8, we may want to say that the statistical inference is not robust and we need to consider the possibility of a confounding variable, that is, we cannot make a causal inference regarding X from the linear model. Thus, W serves as an index of the robustness of a causal inference (IOROCI) to a confounding variable. Please note that here .95 and .8 for IOROCI are arbitrary, as is .05 for the significance level or .2, .5, and .8 for small, medium, and large effect sizes. Researchers can make their own 70 judgments based on what is studied. The example below will show how exactly to implement this technique. 7.2 An Example In the following text, I will use an example pertaining to educational attainment to illustrate the applications of the methodology delineated in this study. ,3: Father’s Occupation ~ (X—predictor 0f 111161331) Educational Attainment g (Y—outcome) rXC Father ’3 Education (C—confounding variable) Figure 9. Father ’3 Education as a potential confounding variable for the causal relationship between Father ’s Occupation and Educational Attainment. From a general linear model, F eatherman and Hauser (1976) concluded that family background, e. g., Father ’s Occupation, has an effect on Educational Attainment, but Sobel (1998) argued that both family background and Educational Attainment are affected by Father ’5 Education which was not controlled for in the analysis. That is, 71 Father ’3 Education is a potential confounding variable for the causal relationship between Father’s Occupation and Educational Attainment (cf, Figure 9). Now, I use the methodology described in this study to assess how robust Featherman and Hauser’s causal inference about Father ’s Occupation is to the impact of Father ’s Education. In order to compare my results with Frank’s results (Frank, 2000), I will use the same data set extracted from F eatherman and Hauser (1976) and Duncan, Featherman, and Duncan (1972). Let X be Father ’5 Occupation, Y be Educational Attainment, and C be any covariate (we have 14 covariates and N = 10,567 in the data). First, we need to obtain the reference distribution from the 14 covariates. We could estimate ,6- coeflicients for the reference distribution directly fi'om the sample moments of rxcryc. But, we only have 14 covariates and the sample error for the sample moments would be too large. Thus, I will estimate the fl-coefficients for the reference distribution from the formulae (5), (12), (12'), (12"), and (13) in Chapter 3. First, I got the estirmted population correlations p“, pyc, and p.y as follows: 114 ——2r _, , . =.235; 14 XC, C1 ...C,_1CH,1...C14 i=1 A — pxc '- 1 14 pyc _I;Zrycrcr~-Ci—IC:'+i---€14 ~ 260’ i=1 [in = r = .325. xy'C1...C14 Substituting those estimated population values into Formulae (5), (12), (12'), (12"), and 72 (13) gives us the estimated values of ,3] and ,6; as bl = .0073 and b2 = 2995“”. Also, fi'om (14), I obtained 13 = —.1726 < O, verifying that the reference distribution can be approximated by a Pearson Type I distribution (cf. Figure 1). From the formulae at the end of Chapter 3, I obtained the distribution function f(k) for rxcryc as follows: It 87.28 [C 180.99 f(k)=33.18[l+—] (1——-) ,—.l4Sk_<_.28. .14 .28 Wt) 30 A_ L A 4 Ag 4 A4 A A % —0.1 0 0.1 0.2 k Figure 10. Distribution function of k = rxcryc. Given the ITCV for Father ’8 Occupation as .228 (Frank, 2000), I obtainedm] ITCV .228 IOROCI = 1f(k)dk = 1f(k)dk = .99999 via a numerical integration using 1”] It is not necessary to use the corrected pg and ,u. for this example, because the sample size N is very big, but the corrected ,u; and 114 would be applied for small N, say N < 90: First, one would obtain the estimated mean values of p“ and pyc; second, substitute these into Equations (17) and (19) to obtain 13113 and bra ; and last, from A},3 = A”, + D”3 and A114 = A1,,4 + D“ one would obtain the corrected #3 and 114. “2‘ There are also two other methods to obtain IOROCI: (a) Programming using Bowman and Shenton’s (1979) approach; and (b) looking up the probability value in Pearson and Hartley’s table (1972) with interpolation. 73 Mathematica. According to the guidance of the interpretation for IOROCI in the previous section, we would like to say that the inference regarding Father ’3 Occupation on Educational Attainment is very robust and that it is very unlikely for the impact of a covariate to alter the inference. To the extent that the impact of Father ’3 Education is represented by the impacts of the other covariates, we could then conclude that it is very unlikely that the impact of Father ’8 Education will alter our inference about Father ’8 Occupation. 74 Chapter 8 DISCUSSION 8.1 Conclusions and Limitations Causal inference is a controversial topic in the social sciences, where we are often not able to conduct a randomized experiment or statistically control for all possible confounding variables. In the literature, some have attempted to deal with the crisis in causal inference, but most approaches have practical or theoretical limitations. Frank’s (2000) index is the soundest approach that leads to a very promising methodology. Frank’s index is composed of the product of two dependent correlation coefficients: the correlation between the predictor of interest and the outcome and the correlation between the confounding variable and the outcome. The index is most informative when evaluated against a reference distribution defined by the impacts of existing covariates, due to the irnmeasurability of confounding variables. Thus, Frank used the reference distribution to statistically assess how robust the causal inference for a given predictor on an outcome is to the impacts of uncontrolled confounding variables. Unfortunately, when he generated the reference distribution, Frank used an approximation based on approximately normally distributed Fisher 2 for each r, and then another approximation to 75 the product of two normal variables; and therefore, this doubly asymptotic result is very rough. The current study provided a much more accurate approximation to the reference distribution with a closed form——Pearson Type I (Beta) distribution than Frank’s. With the more accurate approximation to the reference distribution, we can make a more valid conclusion about whether the causal inference for a given predictor on an outcome is robust to the impacts of other possible confounding variables, that is, all uncontrolled confounds are unlikely to have an impact great enough to alter the inference. This methodology would allow for multiple partial causes in the complex social phenomenon that we study. Therefore, we are able to inform the controversy about causal inference that arises from the use of statistical linear models in the social sciences. As stated in Chapter 1, some researchers may be uncomfortable with the use of measured covariates to generate a reference distribution for the impact of an unknown confounding variable. But, we acknowledge that this use of the reference distribution is only as valid as is the set of covariates on which it is based, which is no different from any other inference from a sample that must be representative of the population. In this light, the impact of existing covariates represents important information by which to assess the ITCV. Ultimately, this approach also allows us to rescale ITCV to a probability scale, IOROCI, that can be represented by the existing covariates. We hope to have homogenous impacts of covariates, because it will produce a desirable reference distribution that is representative of the unknown confounding variable. We can use Q-statistic, like one in meta-analysis, or P-P plot, to test the 76 homogeneity. In case of having heterogeneous impacts of covariates, it is reasonable, but arguable, to use the maximum impact to get a reference distribution. Concerns about heterogeneity would be greater if the impacts of covariates were obtained from multiple estimated models accounting for different sets of covariates. Researchers should evaluate the likelihood of the observed extreme impacts according to theoretical and statistical criteria. In any case, the soluti0n that is provided in the current study is technical, only provides a quantitative discussion, and not solves the problem of causal inference. A simulation study was conducted to check the accuracy of the approximation method in this study. The simulation results (Tables 3 to 14), along with the line charts (Figure 3) and the P-P plot (Figure 6), indicate that the approximation technique is generally much better than Frank’s (2000). We also understand that the approximation is not very favorable with respect to the third and the fourth moments when N is small. The problem may come from the lower-order approximation to the covariance of rxc and ryc (cf., the expression of 0',“ in Equation 5). Although the approximation correction by the regression approach nicely solved the problem, it would be worth finding a better approximation to the covariance of rm and rye through further study. For researchers in the field of the social sciences, this study provides IOROCI to assess the robustness of causal inferences drawn from correlational data via statistical linear models. Thus, IOROCI provides a statistical index for researchers to evaluate the robustness of causal inferences. Note that the current study is based on linear modeling which assumes that all dependent and independent variables are measured without measurement error. This 77 concern does not apply directly to the confounding variable which is assumed to be perfectly measured to maximize impact. But, it does apply to the distribution of impacts of covariates used to generate the reference distribution. To the extent that the covariates are unreliably measured, their impacts will underestimate their true impacts. When reliabilities are known, a correlation disattenuation is recommended. That is, one can conduct all analyses on a correlation matrix that has been adjusted for attenuation. It is especially important to use a correlation disattenuation when the impact of covariates comprises two partial correlations, because partial correlations produce an underestimated small impact. 8.2 Extensions In addition to the applications of the distribution of the product of two dependent correlation coefficients to causal inference in educational research, the distribution of the product of two dependent correlation coefficients may be also applicable to other issues in the social sciences. For example, the distribution of the product could be applied to assessing indirect effects with respect to mediating processes in path analysis. This is because the indirect effect can be considered as a product of two correlation coefficients. Thus, if we know the distribution of the product, we can better understand the likelihood of observing an indirect effect of a given size. Another task would be to extend the simulation range for the correlation coefficient beyond .50. Although .50 is a large size for correlation coefficients in the social sciences (Cohen & Cohen, 1983), in social research on real data we often see correlation 78 coefficients larger than .50. Thus, it would be a valuable exploration to see if the conclusions of the current study still hold for correlation coefficients larger than .50. That is, it is important to know whether the Pearson Type I distribution still can be a reasonable approximate distribution for the product of two dependent correlation coefficients when some population correlations are larger than .50. It would also be valuable to conduct a larger number of replications for small ,0“ or pyc, where approximation was poor, or to obtain better estimates of small p—values. Note that the approximation method in this paper is based on the assumption that the three initial variables X, Y, and C follow a trivariate normal distribution. However, it is not necessary for predictors and covariates to be normally distributed in linear models. Hence, for an extension of this study, it would be interesting to find an approximate distribution of the product of two dependent correlation coefficients for non-normal, categorical or mixed initial variables X, Y, and C. 79 APPENDIX A Below is the SPSS code for calculating the approximate moments for N = 28 or M = N + 6 = 34 as an example, where bxc = brx; stcO = 02:); 83xc0 = 0'82; s4ch = 01;); b c: b ; = (2). = (3). = (4). Y rye sZch 0”»: , s3ch 0',” , s4ch 0,” , pxc = pxc; s2xc = E [(Arxcf]; s3xc = E[(Arxc)3]; s4xc = E [(Arxc)4]; pyc=p,.; s2yc=E1