ObjectiveTo construct a demand model for electronic medical record (EMR) data quality in regards to the lifecycle in machine learning (ML)-based disease risk prediction, to guide the implementation of EMR data quality assessment. MethodsReferring to the lifecycle in ML-based predictive model, we explored the demand for EMR data quality. First, we summarized the key data activities involved in each task on predicting disease risk with ML through a literature review. Second, we mapped the data activities in each task to the associated requirements. Finally, we clustered those requirements into four dimensions. ResultsWe constructed a three-layer structured ring to represent the demand model for EMR data quality in ML-based disease risk prediction research. The inner layer shows the seven main tasks in ML-based predictive models: data collection, data preprocessing, feature representation, feature selection and extraction, model training, model evaluation and optimization, and model deployment. The middle layer is the key data activities in each task; and the outer layer represents four dimensions of data quality requirements: operability, completeness, accuracy, and timeliness. ConclusionThe proposed model can guide real-world EMR data governance, improve its quality management, and promote the generation of real-world evidence.
ObjectiveTo summarize and explore the application of machine learning models to survival data with non-proportional hazards (NPH), and to provide a methodological reference for large-scale, high-dimensional survival data. MethodsFirst, the concept of NPH and related testing methods were outlined. Then the advantages and disadvantages of machine learning algorithm-based NPH survival analysis methods were summarized based on the relevant literature. Finally, using real-world clinical data, a case study was conducted with two ensemble machine learning models and two deep learning models in survival data with NPH: a study of the risk of death within 30 days in stroke patients in the ICU. ResultsEight commonly used machine learning model-based NPH survival analyses were identified, including five traditional machine learning models such as random survival forest and three deep learning models based on artificial neural networks (e.g., DeepHit). The case study found that the random survival forest model performed the best (C-index=0.773, IBS=0.151), and the permutation importance-based algorithm found that age was the most important characteristic affecting the risk of death in stroke patients. ConclusionSurvival big data in the era of precision medicine presenting NPH are common, and machine learning model-based survival analysis can be used when faced with more complex survival data and higher survival analysis needs.