ObjectiveThe current medical questionnaire resources are mainly processed and organized at the document level, which hampers user access and reuse at the questionnaire item level. This study aims to propose a multi-class classification of items in medical questionnaires in low-resource scenarios, and to support fine-grained organization and provision of medical questionnaires resources. MethodsWe introduced a novel, BERT-based, prompt learning approach for multi-class classification of items in medical questionnaires. First, we curated a small corpus of lung cancer medical assessment items by collecting relevant clinical assessment questionnaires, extracting function and domain classifications, and manually annotating the items with "function-domain" combination labels. We then employed prompt learning by feeding the customized template into BERT. The masked positions were predicted and filled, followed by mapping the populated text to labels. This process enables the multi-class classification of item texts in medical questionnaires. ResultsThe constructed corpus comprised 347 clinical assessment items for lung cancer, across nine "function-domain" labels. The experimental results indicated that the proposed method achieved an average accuracy of 93% on our self-constructed dataset, outperforming the runner-up GAN-BERT by approximately 6%. ConclusionThe proposed method can maintain robust performance while minimizing the cost of building medical questionnaire item corpora, illustrating its promotion value of research and practice in medical questionnaire classification.
ObjectiveTo construct a demand model for electronic medical record (EMR) data quality in regards to the lifecycle in machine learning (ML)-based disease risk prediction, to guide the implementation of EMR data quality assessment. MethodsReferring to the lifecycle in ML-based predictive model, we explored the demand for EMR data quality. First, we summarized the key data activities involved in each task on predicting disease risk with ML through a literature review. Second, we mapped the data activities in each task to the associated requirements. Finally, we clustered those requirements into four dimensions. ResultsWe constructed a three-layer structured ring to represent the demand model for EMR data quality in ML-based disease risk prediction research. The inner layer shows the seven main tasks in ML-based predictive models: data collection, data preprocessing, feature representation, feature selection and extraction, model training, model evaluation and optimization, and model deployment. The middle layer is the key data activities in each task; and the outer layer represents four dimensions of data quality requirements: operability, completeness, accuracy, and timeliness. ConclusionThe proposed model can guide real-world EMR data governance, improve its quality management, and promote the generation of real-world evidence.
ObjectiveTo summarize and explore the application of machine learning models to survival data with non-proportional hazards (NPH), and to provide a methodological reference for large-scale, high-dimensional survival data. MethodsFirst, the concept of NPH and related testing methods were outlined. Then the advantages and disadvantages of machine learning algorithm-based NPH survival analysis methods were summarized based on the relevant literature. Finally, using real-world clinical data, a case study was conducted with two ensemble machine learning models and two deep learning models in survival data with NPH: a study of the risk of death within 30 days in stroke patients in the ICU. ResultsEight commonly used machine learning model-based NPH survival analyses were identified, including five traditional machine learning models such as random survival forest and three deep learning models based on artificial neural networks (e.g., DeepHit). The case study found that the random survival forest model performed the best (C-index=0.773, IBS=0.151), and the permutation importance-based algorithm found that age was the most important characteristic affecting the risk of death in stroke patients. ConclusionSurvival big data in the era of precision medicine presenting NPH are common, and machine learning model-based survival analysis can be used when faced with more complex survival data and higher survival analysis needs.