Using machine learning to uncover population heterogeneity in longitudinal study
Machine learning has been an emerging data analytic tool in the fields of quantitative social and behavioral sciences. Among others, model-based recursive partitioning (MOB) is one of the popular comprehensive approaches incorporating parametric model into tree-based algorithm. It has gained growing interests as a complementary data analytic tool to address population heterogeneity by detecting parameter instability over candidate covariates. Structural equation models using tree algorithm (SEM Trees) has particularly shown its benefits for discovering informative covariates and their complex interactions that predict differences in structural parameters with interpretable results, which in turn produces distinct homogeneous subgroups. While all previous studies make important contributions to use this approach, it has been less examined to investigate the performance of SEM Trees where there exist interaction effects of various types of covariates (i.e., categorical, ordinal, and continuous), which is the key motivation of this study. This study has three main purposes. First, it aims to introduce a framework of MOB for educational researchers and guide them when it can be beneficial with an illustrative example using nationally representative longitudinal data (High School Longitudinal Study of 2009). A parametric latent growth curve model (LGCM) is used as a template model along with MOB. Second, a simulation study for a given LGCM is conduced to investigate the performance of MOB, which provides researchers with statistical evidence of how well MOB recovers true subgroups. Simulation conditions include a) effect size (0.2, 0.4, 0.6, 0.8, and 1.0), b) sample size (1,000, 2,000, 5,000, 10,000, and 20,000), c) three different test statistic for ordinal covariate (chi-square, adapted maximum Lagrange multiplier, and a weighted double maximum), d) pre pruning option of limiting the minimum sample size per subgroup (250 vs. none), and e) post pruning option (BIC vs. none). The main evaluation criteria are a) statistical power to recover true subgroups, b) overall classification accuracy and precision, c) accuracy of cut points of ordinal/continuous covariates and labels of categorical covariates, and d) bias and root mean squared error (RMSE) of the parameter estimates per subgroup. Third, the simulation is parallelly conducted with GMM, and the results of it are compared with the ones of MOB. The key findings suggest that medium effect size (0.4 - 0.6) with relatively large sample sizes (5,000, 10,000, and 20,000) and large effect size (0.8 - 1.0) with adequate sample size (1,000 or 2,000) are enough to distinguish the difference in focal parameters, recovering the true number of subgroups. In addition, treating ordinal variables as either ordinal or categorical is not different in terms of recovering the true subgroups. However, the empirical study suggests that using test statistic for the ordinal covariates is desired when there exist association between the outcome and ordinal covariate. Post pruning using BIC and limiting the minimum size per subgroup simultaneously are also desired options. Without the post pruning with BIC, MOB tends to over-extract the subgroups across conditions. With the same simulated datasets, GMM produced neither accurate subgroups nor reliable parameter estimates. This study sheds light on how to uncover subpopulations using MOB algorithm with a popular parametric model for longitudinal study. This approach is beneficial for large-scale data such as more than 10,000 sizes with large number of potential covariates. Limitations and future directions are also discussed. The findings play a critical role to lay the groundwork of extending the application of MOB into various statistical models by investigating its performance regarding complex covariate effects.
Read
- In Collections
-
Electronic Theses & Dissertations
- Copyright Status
- Attribution 4.0 International
- Material Type
-
Theses
- Authors
-
Lee, Youngjun
- Thesis Advisors
-
Schmidt, William H.
- Committee Members
-
Kelly, Kimberly
Raykov, Tenko
Myers, Nicholas
- Date Published
-
2022
- Subjects
-
Educational tests and measurements--Methodology
Statistics
Machine learning--Statistical methods
Structural equation modeling
Pattern recognition systems
Population--Longitudinal studies
- Program of Study
-
Measurement and Quantitative Methods - Doctor of Philosophy
- Degree Level
-
Doctoral
- Language
-
English
- Pages
- x, 109 pages
- ISBN
-
9798845421142
- Permalink
- https://doi.org/doi:10.25335/y2ym-sx15