A Comparison of Representation Learning Methods for Medical Concepts in EHR Databases
Liu Z., Wu X., Yang Y., Clifton DA.
This study evaluates four NLP models-LDA, Word2Vec, GloVe, and BERT-for representing medical concepts in Electronic Health Records (EHR) databases using MIMIC-IV and eICU-CRD datasets. EHR contains detailed and coded information on patient diagnoses, procedures, and medications, with codes holding essential knowledge for tasks like diagnosis prediction and medication recommendations. NLP techniques, which model these codes as words within a sentence-like structure of patient visits, have shown promise in creating vector representations that capture the implicit relationships among codes. However, prior research lacks a comprehensive comparison of these methods for EHR data. Traditional NLP approaches, such as Word2Vec and GloVe, emphasize distributional semantics, while newer models like BERT offer contextual embeddings that capture more nuanced language patterns. In the settings of clinical code embedding pre-train, the results show that GloVe outperforms other models in retaining medical concept semantics and improving prediction tasks, suggesting the need for models that capture both global co-occurrence and nuanced relationships in medical data.