Multi-view probabilistic matrix factorization for inferring latent disease topics and patients’ risk

Yue Li - Postdoctoral Associate, Computer Science and Artificial Intelligence Laboratory (CSAIL), MIT

Feb. 13, 2018, 10 a.m. - Feb. 13, 2018, 11 a.m.


Title: Multi-view probabilistic matrix factorization for inferring latent disease topics and patients’ risks by mining large-scale heterogeneous electronic health record data



Electronic health records (EHR) are rich heterogenous collection of patient health information. The broad adoption of EHR systems has provided clinicians and researchers unprecedented opportunities for conducting health informatics research, which promise to provide an unbiased ways to characterize patents’ disease risks, thereby making actionable clinical recommendations for subsequent follow-ups of precision medicine. However, there are several challenges in modeling EHR data, including noisy irregular text in clinical notes, arbitrary bias in the billing codes, not missing at random (NMAR) lab tests, and heterogeneous data types (e.g., clinical notes, billing codes, lab tests, medications). To address the above challenges, I developed a Bayesian integrative generative model in the ravine of collaborative filtering and latent topic models. Specifically, I propose a multi-view probabilistic matrix factorization framework. In a nutshell, the proposed method factorizes multiple high-dimensional clinical-feature matrices into lower rank (basis) matrices and a common (loading) matrix that spans patients' dimension, which I interpret as the probabilistic disease mixture memberships for each patient. To jointly model the distribution of the test records (e.g., lymphocytes cell counts) and the test results (e.g, abnormal high, low, or normal), I make a conditional independence assumption given the patient/test-specific latent topic variable, thereby bypassing the difficulties of directly modeling the distribution of the NMAR mechanisms. To learn the model parameters, I will present an efficient variational inference algorithm and its online stochastic counterpart. I demonstrate the method’s general utilities using real-world EHR data. Qualitative assessment shows that heterogeneous clinical features that tend to co-occur under the same latent topics exhibit meaningful semantics of known diseases under similar epidemiology along with relevant medications and treatment procedures. I then leverage the lower dimensional patient mixture projections to predict prospective mortality of patients in the ICU using their early admission records 1-6 months in advance. The proposed approach gives state-of-the-art performance compared to existing methods and reveals several distinct and meaningful disease topics related to the prognostic outcomes.


I obtained my Master and PhD degrees in in Computer Science and Computational Biology at University of Toronto under the advisory of Prof. Zhaolei Zhang. In my PhD research, I was interested in functional implications of long and short non-coding RNAs and post-transcriptional regulations of microRNA. I developed several computational methods to infer microRNA-mediated transcriptional regulatory networks and prognostic RNA expression markers for cancers. Currently, I'm a postdoctoral associate in Prof. Manolis Kellis research group at the Computer Science and Artificial Intelligence Laboratory (CSAIL), Massachusetts Institute of Technology. I worked on inferring risk mutations implicated in tissue-specific context of complex genetic diseases. My current main research is to focus on developing probabilistic generative models to mine massive electronic health record data.