Contributions to machine learning in biomedical informatics

"With innovations in digital data acquisition devices and increased memory capacity, virtually all commercial and scientific domains have been witnessing an exponential growth in the amount of data they can collect. For instance, healthcare is experiencing a tremendous growth in digital patient information due to the high adaptation rate of electronic health record systems in hospitals. The abundance of data offers many opportunities to develop robust and versatile systems, as long as the underlying salient information in data can be captured. On the other hand, today's data, often named big data, is challenging to analyze due to its large scale and high complexity. For this reason, efficient data-driven techniques are necessary to extract and utilize the valuable information in the data. The field of machine learning essentially develops such techniques to learn effective models directly from the data. Machine learning models have been successfully employed to solve complicated real world problems. However, the big data concept has numerous properties that pose additional challenges in algorithm development. Namely, high dimensionality, class membership imbalance, non-linearity, distributed data, heterogeneity, and temporal nature are some of the big data characteristics that machine learning must address. Biomedical informatics is an interdisciplinary domain where machine learning techniques are used to analyze electronic health records (EHRs). EHR comprises digital patient data with various modalities and depicts an instance of big data. For this reason, analysis of digital patient data is quite challenging although it provides a rich source for clinical research. While the scale of EHR data used in clinical research might not be huge compared to the other domains, such as social media, it is still not feasible for physicians to analyze and interpret longitudinal and heterogeneous data of thousands of patients. Therefore, computational approaches and graphical tools to assist physicians in summarizing the underlying clinical patterns of the EHRs are necessary. The field of biomedical informatics employs machine learning and data mining approaches to provide the essential computational techniques to analyze and interpret complex healthcare data to assist physicians in patient diagnosis and treatment. In this thesis, we propose and develop machine learning algorithms, motivated by prevalent biomedical informatics tasks, to analyze the EHRs. Specifically, we make the following contributions: (i) A convex sparse principal component analysis approach along with variance reduced stochastic proximal gradient descent is proposed for the patient phenotyping task, which is defined as finding clinical representations for patient groups sharing the same set of diseases. (ii) An asynchronous distributed multi-task learning method is introduced to learn predictive models for distributed EHRs. (iii) A modified long-short term memory (LSTM) architecture is designed for the patient subtyping task, where the goal is to cluster patients based on similar progression pathways. The proposed LSTM architecture, T-LSTM, performs a subspace decomposition on the cell memory such that the short term effect in the previous memory is discounted based on the length of the time gap. (iv) An alternative approach to T-LSTM model is proposed with a decoupled memory to capture the short and long term changes. The proposed model, decoupled memory gated recurrent network (DM-GRN), is designed to learn two types of memories focusing on different components of the time series data. In this study, in addition to the healthcare applications, behavior of the proposed model is investigated for traffic speed prediction problem to illustrate its generalization ability. In summary, the aforementioned machine learning approaches have been developed to address complex characteristics of electronic health records in routine biomedical informatics tasks such as computational patient phenotyping and patient subtyping. Proposed models are also applicable to different domains with similar data characteristics as EHRs."--Pages ii-iii.

Read