A research on depression recognition based on voice pre-training model_Journal of Biomedical Engineering

Authors：

HUANG Xiangsheng , LIAO Yilong , ZHANG Wenjing ,  ZHANG Li

1. School of Biomedical Engineering, South-Central Minzu University, Wuhan 430074, P. R. China;

Corresponding author：

ZHANG Li, Email: zhangli1996@163.com

Keywords：

Depression recognition; Voice pre-training model; Voice features

DOI：

10.7507/1001-5515.202304008

Video：

Export PDF Favorites Scan Get Citation

Abstract Full text Figures/Tables Video References Cited by

For the increasing number of patients with depression, this paper proposes an artificial intelligence method to effectively identify depression through voice signals, with the aim of improving the efficiency of diagnosis and treatment. Firstly, a pre-training model called wav2vec 2.0 is fine-tuned to encode and contextualize the speech, thereby obtaining high-quality voice features. This model is applied to the publicly available dataset - the distress analysis interview corpus-wizard of OZ (DAIC-WOZ). The results demonstrate a precision rate of 93.96%, a recall rate of 94.87%, and an F1 score of 94.41% for the binary classification task of depression recognition, resulting in an overall classification accuracy of 96.48%. For the four-class classification task evaluating the severity of depression, the precision rates are all above 92.59%, the recall rates are all above 92.89%, the F1 scores are all above 93.12%, and the overall classification accuracy is 94.80%. The research findings indicate that the proposed method effectively enhances classification accuracy in scenarios with limited data, exhibiting strong performance in depression identification and severity evaluation. In the future, this method has the potential to serve as a valuable supportive tool for depression diagnosis.

Citation： HUANG Xiangsheng, LIAO Yilong, ZHANG Wenjing, ZHANG Li. A research on depression recognition based on voice pre-training model. Journal of Biomedical Engineering, 2024, 41(1): 9-16. doi: 10.7507/1001-5515.202304008 Copy

1.	Janardhan N, Kumaresh N. Improving depression prediction accuracy using Fisher score-based feature selection and dynamic ensemble selection approach based on acoustic features of speech. Trait Signal, 2022, 39(1): 87-107.
2.	Kaur B, Rathi S, Agrawal R K. Enhanced depression detection from speech using quantum whale optimization algorithm for feature selection. Comput Biol Med, 2022, 150: 106122.
3.	Lu H, Xu S, Hu X, et al. Postgraduate student depression assessment by multimedia gait analysis. IEEE MultiMedia, 2022, 29(2): 56-65.
4.	Cohn J F, Kruez T S, Matthews I, et al. Detecting depression from facial actions and vocal prosody//2009 3rd International Conference on Affective Computing and Intelligent Interaction and Workshops. Amsterdam, Netherlands: IEEE, 2009: 1-7.
5.	Dowlati Y, Herrmann N, Swardfager W, et al. A meta-analysis of cytokines in major depression. Biol Psychiatry, 2010, 67(5): 446-457.
6.	Michael A, Jenaway A, Paykel E S, et al. Altered salivary dehydroepiandrosterone levels in major depression in adults. Biol Psychiatry, 2000, 48(10): 989-995.
7.	Pampouchidou A, Simos P G, Marias K, et al. Automatic assessment of depression based on visual cues: a systematic review. IEEE T Affect Comput, 2017, 10(4): 445-470.
8.	Solieman H, Pustozerov E A. The detection of depression using multimodal models based on text and voice quality features//2021 IEEE Conference of Russian Young Researchers in Electrical and Electronic Engineering (ElConRus). St. Petersburg, Moscow, Russia: IEEE, 2021: 1843-1848.
9.	Ooi K E B, Lech M, Allen N B. Prediction of major depression in adolescents using an optimized multi-channel weighted speech classification system. Biomedical Signal Processing and Control, 2014, 14: 228-239.
10.	Gao Y, Xin Y, Zhang L. Intelligent diagnosis approach for depression using vocal source features. Tehnički Vjesnik, 2022, 29(3): 971-975.
11.	Jiang H, Hu B, Liu Z, et al. Detecting depression using an ensemble logistic regression model based on multiple speech features. Comput Math Methods Med, 2018, 2018: 6508319.
12.	van Eeden W A, Luo C, van Hemert A M, et al. Predicting the 9-year course of mood and anxiety disorders with automated machine learning: a comparison between auto-sklearn, naïve Bayes classifier, and traditional logistic regression. Psychiatry Res, 2021, 299: 113823.
13.	Sun G, Zhao S, Zou B, et al. Speech-based depression detection using unsupervised autoencoder//2022 7th International Conference on Signal and Image Processing (ICSIP). Suzhou, China: IEEE, 2022: 35-38.
14.	Yin F, Du J, Xu X, et al. Depression detection in speech using transformer and parallel convolutional neural networks. Electronics, 2023, 12(2): 328.
15.	Chen W, Xing X, Xu X, et al. SpeechFormer++: a hierarchical efficient framework for paralinguistic speech processing. IEEE-ACM Transactions on Audio, Speech, and Language Processing, 2023, 31: 775-788.
16.	Wang H, Liu Y, Zhen X, et al. Depression speech recognition with a three-dimensional convolutional network. Front Hum Neurosci, 2021, 15: 73823.
17.	Rejaibi E, Komaty A, Meriaudeau F, et al. MFCC-based recurrent neural network for automatic clinical depression recognition and assessment from speech. Biomed Signal Proces, 2022, 71: 103107.
18.	Miao X, Li Y, Wen M, et al. Fusing features of speech for depression classification based on higher-order spectral analysis. Speech Commun, 2022, 143: 46-56.
19.	Zhao Y, Xie Y, Liang R, et al. Detecting depression from speech through an attentive LSTM network. IEICE Trans Inf Syst, 2021, 104(11): 2019-2023.
20.	Toto E, Tlachac M L, Rundensteiner E A. Audibert: a deep transfer learning multimodal classification framework for depression screening//Proceedings of the 30th ACM International Conference on Information and Knowledge Management, 2021: 4145-4154.
21.	Muzammel M, Salam H, Othmani A. End-to-end multimodal clinical depression recognition using deep neural networks: a comparative analysis. Comput Methods Programs Biomed, 2021, 211: 106433.
22.	Sun H, Liu J, Chai S, et al. Multi-modal adaptive fusion transformer network for the estimation of depression level. Sensors, 2021, 21(14): 4764.
23.	Zhao J, Zhang W Q. Improving automatic speech recognition performance for low-resource languages with self-supervised model. IEEE Journal of Selected Topics in Signal Processing, 2022, 16(6): 1227-1241.
24.	Baevski A, Zhou Y, Mohamed A, et al. wav2vec 2. 0: a framework for self-supervised learning of speech representations// 34th Conference on Neural Information Processing Systems (NeurIPS 2020), 2020, 33: 12449-12460.
25.	Gratch J, Artstein R, Lucas G, et al. The distress analysis interview corpus of human and computer interviews// International Conference on Language Resources and Evaluation, 2014: 14488823.
26.	DeVault D, Artstein R, Benn G, et al. SimSensei kiosk: a virtual human interviewer for healthcare decision support//Proceedings of the 2014 International Conference on Autonomous Agents and Multi-Agent Systems, 2014: 1061-1068.
27.	Nowakowski K, Ptaszynski M, Murasaki K, et al. Adapting multilingual speech representation model for a new, underresourced language through multilingual fine-tuning and continued pretraining. Inform Process Manag, 2023, 60(2): 103148.
28.	Shen Y, Liu Q, Fan Z, et al. Self-supervised pre-trained speech representation based end-to-end mispronunciation detection and diagnosis of mandarin. IEEE Access, 2022, 10: 106451-106462.
29.	Liu H, Perera L P G, Khong A W H, et al. Efficient self-supervised learning representations for spoken language identification. IEEE Journal of Selected Topics in Signal Processing, 2022, 16(6): 1296-1307.
30.	Cai X, Yuan J, Zheng R, et al. Speech emotion recognition with multi-task learning//INTERSPEECH 2021, Brno, Czechia: ISCA, 2021: 4508-4512.
31.	van den Oord A, Li Y, Vinyals O. Representation learning with contrastive predictive coding. arXiv preprint, 2018. arXiv: 1807.03748.
32.	Schneider S, Baevski A, Collobert R, et al. wav2vec: unsupervised pre-training for speech recognition. arXiv preprint, 2019. arXiv: 1904.05862.
33.	Baevski A, Schneider S, Auli M. vq-wav2vec: Self-supervised learning of discrete speech representations. arXiv preprint, 2019. arXiv: 1910.05453.
34.	Jain R, Barcovschi A, Yiwere M, et al. A WAV2VEC2-based experimental study on self-supervised learning methods to improve child speech recognition. IEEE Access, 2023, 11: 46938-46948.

1. Janardhan N, Kumaresh N. Improving depression prediction accuracy using Fisher score-based feature selection and dynamic ensemble selection approach based on acoustic features of speech. Trait Signal, 2022, 39(1): 87-107.
2. Kaur B, Rathi S, Agrawal R K. Enhanced depression detection from speech using quantum whale optimization algorithm for feature selection. Comput Biol Med, 2022, 150: 106122.
3. Lu H, Xu S, Hu X, et al. Postgraduate student depression assessment by multimedia gait analysis. IEEE MultiMedia, 2022, 29(2): 56-65.
4. Cohn J F, Kruez T S, Matthews I, et al. Detecting depression from facial actions and vocal prosody//2009 3rd International Conference on Affective Computing and Intelligent Interaction and Workshops. Amsterdam, Netherlands: IEEE, 2009: 1-7.
5. Dowlati Y, Herrmann N, Swardfager W, et al. A meta-analysis of cytokines in major depression. Biol Psychiatry, 2010, 67(5): 446-457.
6. Michael A, Jenaway A, Paykel E S, et al. Altered salivary dehydroepiandrosterone levels in major depression in adults. Biol Psychiatry, 2000, 48(10): 989-995.
7. Pampouchidou A, Simos P G, Marias K, et al. Automatic assessment of depression based on visual cues: a systematic review. IEEE T Affect Comput, 2017, 10(4): 445-470.
8. Solieman H, Pustozerov E A. The detection of depression using multimodal models based on text and voice quality features//2021 IEEE Conference of Russian Young Researchers in Electrical and Electronic Engineering (ElConRus). St. Petersburg, Moscow, Russia: IEEE, 2021: 1843-1848.
9. Ooi K E B, Lech M, Allen N B. Prediction of major depression in adolescents using an optimized multi-channel weighted speech classification system. Biomedical Signal Processing and Control, 2014, 14: 228-239.
10. Gao Y, Xin Y, Zhang L. Intelligent diagnosis approach for depression using vocal source features. Tehnički Vjesnik, 2022, 29(3): 971-975.
11. Jiang H, Hu B, Liu Z, et al. Detecting depression using an ensemble logistic regression model based on multiple speech features. Comput Math Methods Med, 2018, 2018: 6508319.
12. van Eeden W A, Luo C, van Hemert A M, et al. Predicting the 9-year course of mood and anxiety disorders with automated machine learning: a comparison between auto-sklearn, naïve Bayes classifier, and traditional logistic regression. Psychiatry Res, 2021, 299: 113823.
13. Sun G, Zhao S, Zou B, et al. Speech-based depression detection using unsupervised autoencoder//2022 7th International Conference on Signal and Image Processing (ICSIP). Suzhou, China: IEEE, 2022: 35-38.
14. Yin F, Du J, Xu X, et al. Depression detection in speech using transformer and parallel convolutional neural networks. Electronics, 2023, 12(2): 328.
15. Chen W, Xing X, Xu X, et al. SpeechFormer++: a hierarchical efficient framework for paralinguistic speech processing. IEEE-ACM Transactions on Audio, Speech, and Language Processing, 2023, 31: 775-788.
16. Wang H, Liu Y, Zhen X, et al. Depression speech recognition with a three-dimensional convolutional network. Front Hum Neurosci, 2021, 15: 73823.
17. Rejaibi E, Komaty A, Meriaudeau F, et al. MFCC-based recurrent neural network for automatic clinical depression recognition and assessment from speech. Biomed Signal Proces, 2022, 71: 103107.
18. Miao X, Li Y, Wen M, et al. Fusing features of speech for depression classification based on higher-order spectral analysis. Speech Commun, 2022, 143: 46-56.
19. Zhao Y, Xie Y, Liang R, et al. Detecting depression from speech through an attentive LSTM network. IEICE Trans Inf Syst, 2021, 104(11): 2019-2023.
20. Toto E, Tlachac M L, Rundensteiner E A. Audibert: a deep transfer learning multimodal classification framework for depression screening//Proceedings of the 30th ACM International Conference on Information and Knowledge Management, 2021: 4145-4154.
21. Muzammel M, Salam H, Othmani A. End-to-end multimodal clinical depression recognition using deep neural networks: a comparative analysis. Comput Methods Programs Biomed, 2021, 211: 106433.
22. Sun H, Liu J, Chai S, et al. Multi-modal adaptive fusion transformer network for the estimation of depression level. Sensors, 2021, 21(14): 4764.
23. Zhao J, Zhang W Q. Improving automatic speech recognition performance for low-resource languages with self-supervised model. IEEE Journal of Selected Topics in Signal Processing, 2022, 16(6): 1227-1241.
24. Baevski A, Zhou Y, Mohamed A, et al. wav2vec 2. 0: a framework for self-supervised learning of speech representations// 34th Conference on Neural Information Processing Systems (NeurIPS 2020), 2020, 33: 12449-12460.
25. Gratch J, Artstein R, Lucas G, et al. The distress analysis interview corpus of human and computer interviews// International Conference on Language Resources and Evaluation, 2014: 14488823.
26. DeVault D, Artstein R, Benn G, et al. SimSensei kiosk: a virtual human interviewer for healthcare decision support//Proceedings of the 2014 International Conference on Autonomous Agents and Multi-Agent Systems, 2014: 1061-1068.
27. Nowakowski K, Ptaszynski M, Murasaki K, et al. Adapting multilingual speech representation model for a new, underresourced language through multilingual fine-tuning and continued pretraining. Inform Process Manag, 2023, 60(2): 103148.
28. Shen Y, Liu Q, Fan Z, et al. Self-supervised pre-trained speech representation based end-to-end mispronunciation detection and diagnosis of mandarin. IEEE Access, 2022, 10: 106451-106462.
29. Liu H, Perera L P G, Khong A W H, et al. Efficient self-supervised learning representations for spoken language identification. IEEE Journal of Selected Topics in Signal Processing, 2022, 16(6): 1296-1307.
30. Cai X, Yuan J, Zheng R, et al. Speech emotion recognition with multi-task learning//INTERSPEECH 2021, Brno, Czechia: ISCA, 2021: 4508-4512.
31. van den Oord A, Li Y, Vinyals O. Representation learning with contrastive predictive coding. arXiv preprint, 2018. arXiv: 1807.03748.
32. Schneider S, Baevski A, Collobert R, et al. wav2vec: unsupervised pre-training for speech recognition. arXiv preprint, 2019. arXiv: 1904.05862.
33. Baevski A, Schneider S, Auli M. vq-wav2vec: Self-supervised learning of discrete speech representations. arXiv preprint, 2019. arXiv: 1910.05453.
34. Jain R, Barcovschi A, Yiwere M, et al. A WAV2VEC2-based experimental study on self-supervised learning methods to improve child speech recognition. IEEE Access, 2023, 11: 46938-46948.

Journal of Biomedical Engineering

A research on depression recognition based on voice pre-training model

Abstract Full text Figures/Tables Video References Cited by

Previous Article

Next Article

Format

Content