Sains Malaysiana 50(3)(2021):
753-768
http://doi.org/10.17576/jsm-2021-5003-17
Predicting 30-Day
Mortality after an Acute Coronary Syndrome (ACS) using Machine Learning Methods
for Feature Selection, Classification and Visualisation
(Meramalkan Kematian 30
Hari selepas Sindrom Koronari Akut (ACS) menggunakan Kaedah Pembelajaran Mesin
untuk Pemilihan Ciri, Pengelasan dan Pemvisualan)
NANYONGA
AZIIDA1, SORAYYA MALEK1*, FIRDAUS AZIZ1,
KHAIRUL SHAFIQ IBRAHIM2 & SAZZLI KASIM2
1Bioinformatics Division, Institute of
Biological Sciences, University of Malaya, 50603
Kuala Lumpur, Federal Territory, Malaysia
2Department of Cardiology, Faculty of
Medicine, Universiti Teknologi MARA (UiTM), Sungai Buloh Campus, Jalan
Hospital, 47000 Sungai Buloh, Selangor Darul Ehsan, Malaysia
Diserahkan:
23 Disember 2019/Diterima: 26 Ogos 2020
ABSTRACT
Hybrid combinations of
feature selection, classification and visualisation using machine learning (ML)
methods have the potential for enhanced understanding and 30-day mortality
prediction of patients with cardiovascular disease using population-specific
data. Identifying a feature selection method with a classifier algorithm that
produces high performance in mortality studies is essential and has not been
reported before. Feature selection methods such as Boruta, Random Forest (RF),
Elastic Net (EN), Recursive Feature Elimination (RFE), learning vector
quantization (LVQ), Genetic Algorithm (GA), Cluster Dendrogram (CD), Support
Vector Machine (SVM) and Logistic Regression (LR) were combined with RF, SVM,
LR, and EN classifiers for 30-day mortality prediction. ML models were
constructed using 302 patients and 54 input variables from the Malaysian
National Cardiovascular Disease Database. Validation of the best ML model was
performed against Thrombolysis in Myocardial Infarction (TIMI) using an
additional dataset of 102 patients. The Self-Organising Feature Map (SOM) was
used to visualise mortality-related factors post-ACS. The performance of ML
models using the area under the curve (AUC) ranged from 0.48 to 0.80. The
best-performing model (AUC = 0.80) was a hybrid combination of the RF variable
importance method, the sequential backward selection and the RF classifier
using five predictors (age, triglyceride, creatinine, troponin, and total
cholesterol). Comparison with TIMI using an additional dataset resulted in the
best ML model outperforming the TIMI score (AUC = 0.75 vs. AUC = 0.60). The
findings of this study will provide a basis for developing an online ML-based
population-specific risk scoring calculator.
Keywords: Acute coronary
syndrome; feature selection; hybrid model; machine learning; self-organising
maps
ABSTRAK
Gabungan hibrid
pemilihan ciri, pengelasan dan pemvisualan menggunakan kaedah pembelajaran
mesin (ML) mempunyai potensi untuk pemahaman yang lebih baik untuk ramalan
kematian pesakit bagi tempoh 30 hari dengan penyakit kardiovaskular menggunakan
data penduduk yang khusus. Mengenal pasti ciri-ciri kaedah pemilihan
dengan algoritma pengelas yang menghasilkan prestasi tinggi dalam kajian
kematian adalah penting dan tidak pernah dilaporkan sebelum ini. Ciri-ciri
kaedah pemilihan seperti ‘Boruta’, ‘Random Forest’ (RF), ‘Elastic Net’ (EN),
‘Recursive Feature Elimination’ (RFE), ‘Learning Vector Quantization’ (LVQ),
‘Genetic Algorithm’ (GA), ‘Cluster Dendrogram’ (CD), ‘Support Vector Machine’ (SVM)
dan ‘Logistic Regression’ (LR) telah digabungkan dengan algoritma bagi
pengelasan RF, SVM, LR dan EN bagi ramalan kematian bagi tempoh 30
hari. Model ML telah dibina menggunakan 302 pesakit dan 54 pemboleh ubah
input dari Pangkalan Data Penyakit Kardiovaskular Kebangsaan
Malaysia. Pengesahan terbaik model ML telah dijalankan dengan Trombolisis dalam Infarksi Miokardium (TIMI) menggunakan set data tambahan daripada 102
pesakit. Peta swaurus (SOM) telah digunakan untuk menggambarkan faktor
yang berkaitan dengan kematian selepas ACS. Prestasi model diukur
menggunakan kawasan di bawah lengkung (AUC) antara 0.48-0.80. Model
terbaik mencatatkan (AUC = 0.80) adalah gabungan hibrid RF cara kepentingan
berubah-ubah, pemilihan ke belakang berurutan dan pengelas RF menggunakan lima
peramal (umur, trigliserida, kreatinin, troponin dan jumlah
kolesterol). Model terbaik telah dibandingkan dengan TIMI menggunakan set
data tambahan yang menyebabkan model ML mengatasi TIMI (AUC = 0.75 vs AUC =
0.60). Penemuan daripada kajian ini akan digunakan sebagai asas untuk
membangunkan talian ML berdasarkan pengiraan pemarkahan risiko yang penduduk
tertentu.
Kata kunci: Model hibrid; pembelajaran mesin;
pemilihan ciri; peta swaurus sindrom koronari akut
RUJUKAN
Alalyan,
F., Zamzami, N. & Bouguila, N. 2019. Model-based hierarchical clustering
for categorical data. In IEEE 28th
International Symposium on Industrial Electronics (ISIE). Vancouver,
Canada: IEEE. pp. 1424-1429. doi: 10.1109/ISIE.2019.8781307.
Breiman,
L. 2001. Using iterated bagging to debias regressions. Machine Learning 45(3): 261-277.
https://doi.org/10.1023/a:1017934522171.
Castro-Dominguez,
Y., Dharmarajan, K. & Mcnamara, R.L. 2018. Predicting death after acute
myocardial infarction. Trends in
Cardiovascular Medicine 28(2): 102-109.
https://doi.org/10.1016/j.tcm.2017.07.011.
Chandrashekar,
G. & Sahin, F. 2014. A survey on feature selection methods. Computers & Electrical Engineering 40(1):
16-28. https://doi.org/10.1016/j.compeleceng.2013.11.024.
Chen,
X. & Ishwaran, H. 2012. Random forests for genomic data analysis. Genomics 99(6): 323-329.
https://doi.org/10.1016/j.ygeno.2012.04.003.
Cheng,
J.M., Helming, A.M., Vark, L.C.V., Corstiaan, I.K., Uil, A.D., Jewbali, L.S.,
van Geuns, R., Zijlstra, F., van Domburg, R.T., Boersma, E. & Akkherhuis,
K.M. 2015. A simple risk chart for initial risk assessment of 30-day mortality
in patients with cardiogenic shock from ST-elevation myocardial infarction. European Heart Journal: Acute Cardiovascular
Care 5(2): 101-107. https://doi.org/10.1177/2048872615568966.
Chopra,
A., Dimri, A. & Pradhan, T. 2017. Prediction of factors affecting
amlodipine induced pedal edema and its classification. In International Conference on Advances in
Computing, Communications and Informatics (ICACCI). Udupi, India: DBLP. pp.
1684-1689. https://doi.org/10.1109/icacci.2017.8126085.
Collazo,
R.A., Pessôa, L.A.M., Bahiense, L., Pereira, B.D.B., Reis, A.F.D. & Silva,
N.S.E. 2016. A comparative study between artificial neural network and support
vector machine for acute coronary syndrome prognosis. Pesquisa Operacional 36(2): 321-343.
https://doi.org/10.1590/0101-7438.2016.036.02.0321.
Couronné,
R., Probst, P. & Boulesteix, A. 2018. Random forest versus logistic
regression: A large-scale benchmark experiment. BMC Bioinformatics 19(1): 270.
https://doi.org/10.1186/s12859-018-2264-5.
Cox,
D.R. 1958. Two further applications of a model for binary regression. Biometrika 45(3-4): 562-565.
https://doi.org/10.1093/biomet/45.3-4.562.
Dunkler,
D., Plischke, M., Leffondré, K. & Heinze, G. 2014. Augmented backward
elimination: A pragmatic and purposeful way to develop statistical models. PLoS ONE 9(11): e113677.
https://doi.org/10.1371/journal.pone.0113677.
Engberding,
N. & Wenger, N.K. 2017. Acute coronary syndromes in the elderly. F1000Research 6: 1791.
https://doi.org/10.12688/f1000research.11064.1.
Fawcett,
T. 2006. An introduction to ROC analysis. Pattern Recognition Letters 27(8): 861-874.
https://doi.org/10.1016/j.patrec.2005.10.010.
Fernández-Delgado, M., Eva, C., Senén, B.
& Dinani, A. 2014. Do we need hundreds of classifiers to solve real world
classification problems? Journal of Machine Learning Research 15:
3133-3181.
Galili,
Tal. 2015. Dendextend: An R package for visualizing, adjusting and comparing
trees of hierarchical clustering. Bioinformatics 31(22):
3718-3720. https://doi.org/10.1093/bioinformatics/btv428.
Geisser,
S. 1993. Predictive Inference: An
Introduction. London: Chapman and Hall.
http://dx.doi.org/10.1007/978-1-4899-4467-2.
Genuer,
R., Poggi, J. & Tuleau-Malot, C. 2010. Variable selection using random
forests. Pattern Recognition Letters 31(14):
2225-2236. https://doi.org/10.1016/j.patrec.2010.03.014.
Hammer,
B. & Villmann, T. 2002. Generalized relevance learning vector
quantization. Neural Networks 15(8-9):
1059-1068. https://doi.org/10.1016/s0893-6080(02)00079-5.
Hinde,
C.J. 2003. Extracting causal nets from databases. In Developments in Applied Artificial Intelligence Lecture Notes in
Computer Science, IEA/AIE 2003, Lecture Notes in Computer Science. pp.
166-175. https://doi.org/10.1007/3-540-45034-3_17.
Holland,
J.H. 1992. Genetic algorithms. Scientific
American 267(1): 66-72. https://doi.org/10.1038/scientificamerican0792-66.
Hoo,
F.K., Boo, Y.L., Foo, Y.L., Mohd, S., Lim, S. & Ching, S.M. 1969. Acute
coronary syndrome in young adults from a Malaysian tertiary care centre. Pakistan Journal of Medical Sciences 32(4):
841-845. https://doi.org/10.12669/pjms.324.9689.
Huang,
B.F.F. & Boutros, P.C. 2016. The parameter sensitivity of random
forests. BMC Bioinformatics 17:
331. https://doi.org/10.1186/s12859-016-1228-x.
Jafarian,
A., Ngom, A. & Rueda, L. 2011. A novel recursive feature subset selection
algorithm. In IEEE 11th International Conference on Bioinformatics and Bioengineering. Taichung,
Taiwan: IEEE. pp. 78-83. https://doi.org/10.1109/bibe.2011.19.
Johansson,
S., Rosengren, A., Young, K. & Jennings, E. 2017. Mortality and morbidity
trends after the first year in survivors of acute myocardial infarction: A
systematic review. BMC Cardiovascular
Disorders 17(1): 53. https://doi.org/10.1186/s12872-017-0482-9.
Kesavaraj,
G. & Sukumaran, S. 2013. A study on classification techniques in data
mining. In Fourth International Conference on Computing, Communications and
Networking Technologies. Tiruchengode, India: IEEE. pp. 1-7.
https://doi.org/10.1109/icccnt.2013.6726842.
Kohonen,
T. 2001. Self-organizing maps. In Springer
Series in Information Sciences. Berlin, Germany: Springer.
https://doi.org/10.1007/978-3-642-56927-2.
Kohonen,
T. 2001. Learning vector quantization. In Self-Organizing
Maps Springer Series in Information Sciences. Berlin, Germany: Springer.
pp. 245-261. https://doi.org/10.1007/978-3-642-56927-2_6.
Kuhn,
M. 2008. Building predictive models in R using the caret package. Journal of Statistical Software 28(5):
1-26. https://doi.org/10.18637/jss.v028.i05.
Kursa,
M.B. & Rudnicki, W.R. 2010. Feature selection with the boruta package. Journal of Statistical Software 36(11):
1-13. https://doi.org/10.18637/jss.v036.i11.
Liang,
H., Guo, Y.C., Chen, L.M., Li, M., Han, W.Z., Zhang, X. & Jiang, S.L. 2016.
Relationship between fasting glucose levels and in-hospital mortality in
Chinese patients with acute myocardial infarction and diabetes mellitus: A
retrospective cohort study. BMC
Cardiovascular Disorders 16: 156.
https://doi.org/10.1186/s12872-016-0331-2.
Lin,
X., Li, C., Zhang, Y., Su, B., Fan, M. & Wei, H. 2017. Selecting feature
subsets based on svm-rfe and the overlapping ratio with applications in
bioinformatics. Molecules 23(1):
52. https://doi.org/10.3390/molecules23010052.
Liu,
C.H., Bryan, B.P.C., Little, D.A. & Cardoso, A. 2017. Generalising random
forest parameter optimisation to include stability and cost. Machine Learning and Knowledge Discovery in
Databases Lecture Notes in Computer Science 10536: 102-113.
https://doi.org/10.1007/978-3-319-71273-4_9.
Malek,
S., Gunalan, R., Kedija, S.Y., Lau, C.F., Mogeeb, A.A., Milow, M.P., Lee, S.A.
& Saw, A. 2018. Random forest and self-organizing maps application for
analysis of pediatric fracture healing time of the lower limb. Neurocomputing 272: 55-62.
https://doi.org/10.1016/j.neucom.2017.05.094.
Mandrekar,
J.N. 2010. Receiver operating characteristic curve in diagnostic test
assessment. Journal of Thoracic
Oncology 5(9): 1315-1316.
https://doi.org/10.1097/jto.0b013e3181ec173d.
Marenzi,
G., Cabiati, A., Cosentino, N., Assanelli, E., Milazzo, V., Rubino, M., Lauri,
G., Morpurgo, M., Moltrasio, M., Marana, I., Metrio, M.D., Bonomi, A., Veglia,
F. & Bartorelli, A. 2015. Prognostic significance of serum creatinine and
its change patterns in patients with acute coronary syndromes. American Heart Journal 169(3):
363-370. https://doi.org/10.1016/j.ahj.2014.11.019.
Menard,
S. 2002. Applied Logistic Regression
Analysis. 2nd ed. USA: SAGE Publishing.
https://doi.org/10.4135/9781412983433.
Mokeddem,
S., Atmani, B. & Mokaddem, M. 2013. Supervised feature selection for
diagnosis of coronary artery disease based on genetic algorithm. In Computer Science & Information
Technology (CS & IT). Dubai, UAE: DDBM. pp. 41-52.
https://doi.org/10.5121/csit.2013.3305.
Motwani,
M., Dey, D., Berman, D.S., Germano, G., Achenbach, S., Al-Mallah, M.H.,
Andreini, D., Budoff, M.J., Cademartini, F., Callister, T.Q., Chang, H.J.,
Chinnaiyan, K., Chow, B.J.W., Cury, B.C., Delago, A., Gomez, M., Gransar, H.,
Hadamitzky, M., Hausleiter, J., Hindoyan, N., Feuchtner, G., Kaufmann, P.A.,
Kim, Y.J., Leipsic, J., Lin, F.Y., Maffei, E., Marques, H., Pantone, G., Raff,
G., Rubinshtein, R., Shaw, L.J., Stehli, J., Villines, T.C., Duniing, A., Min,
J.K. & Slomka, P.J. 2016. Machine learning for prediction of all-cause
mortality in patients with suspected coronary artery disease: A 5-year
multicentre prospective registry analysis. European
Heart Journal 38(7): 500-507. https://doi.org/10.1093/eurheartj/ehw188.
Perez-Riverol,
Y., Kuhn, M., Vizcaíno, J.A., Hitz, M. & Audain, E. 2017. Accurate and fast
feature selection workflow for high-dimensional omics data. PLoS ONE 12(12): e0189875.
https://doi.org/10.1371/journal.pone.0189875.
Prokashgoswami,
J. & Mahanta, A.J. 2013. Categorical data clustering based on an
alternative data representation technique. International Journal of Computer Applications 72(5): 7-12.
https://doi.org/10.5120/12488-8301.
Saeys,
Y., Inza, I. & Larranaga, P. 2007. A review of feature selection techniques
in bioinformatics. Bioinformatics 23(19):
2507-2517. https://doi.org/10.1093/bioinformatics/btm344.
Shaikhina,
T., Lowe, D., Daga, S., Briggs, D., Higgins, R. & Khovanova, N. 2019.
Decision tree and random forest models for outcome prediction in antibody
incompatible kidney transplantation. Biomedical
Signal Processing and Control 52: 456-462.
https://doi.org/10.1016/j.bspc.2017.01.012.
Shouval,
R., Hadanny, A., Shlomo, N., Iakobishvili, Z., Unger, R., Zahger, D., Alcalai,
R., Atar, S., Gottlieb, S., Matetzky, S., Goldenberg, I. & Beigel, R. 2017.
Machine learning for prediction of 30-day mortality after ST elevation
myocardial infraction: An acute coronary syndrome Israeli survey data mining
study. International Journal of
Cardiology 246: 7-13. https://doi.org/10.1016/j.ijcard.2017.05.067.
Sonawane,
J.S. & Patil, D.R. 2014. Prediction of heart disease using learning vector
quantization algorithm. In Conference
on IT in Business, Industry and Government (CSIBIG). Indore, India: IEEE
Xplore. https://doi.org/10.1109/csibig.2014.7056973.
Steele,
A.J., Denaxas, S.C., Shah, A.D., Hemingway, H. & Luscombe, N.M. 2018.
Machine learning models in electronic health records can outperform
conventional survival models for predicting patient mortality in coronary
artery disease. PLoS ONE 13(8):
e0202344. https://doi.org/10.1371/journal.pone.0202344.
Torres,
M. & Moayedi, S. 2007. Evaluation of the acutely dyspneic elderly patient. Clinics in Geriatric Medicine 23(2):
307-325. https://doi.org/10.1016/j.cger.2007.01.007.
Tuckova,
J. 2013. The possibility of kohonen self-organizing map applications in
medicine. In IEEE 11th International Workshop of Electronics, Control,
Measurement, Signals and Their Application to Mechatronics. France: IEEE.
pp. 1-6. https://doi.org/10.1109/ecmsm.2013.6648946.
Vapnik,
V. 1998. The support vector method of function estimation. In Nonlinear Modeling. Boston, MA:
Springer. pp. 55-85. https://doi.org/10.1007/978-1-4615-5703-6_3.
Wallert,
J., Tomasoni, M., Madison, G. & Held, C. 2017. Predicting two-year survival
versus non-survival after first myocardial infarction using machine learning
and Swedish national register data. BMC Medical
Informatics and Decision Making 17(1): 99.
https://doi.org/10.1186/s12911-017-0500-y.
Wu,
C., Singh, A., Collins, B., Fatima, A., Qamar, A., Gupta, A., Hainer, J.,
Klein, J., Jarolim, P., Carli, M.D., Nasir, K., Bhatt, D.L. & Blankstein,
R. 2018. Causes of troponin elevation and associated mortality in young
patients. The American Journal of
Medicine 131(3): 284-292. https://doi.org/10.1016/j.amjmed.2017.10.026.
Yang,
J., Li, X., Chen, T., Li, Y., Xie, G. & Yang, Y. 2018. Machine learning
models to predict in-hospital mortality for ST-elevation myocardial infraction:
From China acute myocardial infarction (cami) registry. Journal of the American College of
Cardiology 71(11): A236.
https://doi.org/10.1016/s0735-1097(18)30777-0.
Yang,
X. 2017. Identification of risk genes associated with myocardial infarction
based on the recursive feature elimination algorithm and support vector machine
classifier. Molecular Medicine Reports 17(1): 1555-1560. https://doi.org/10.3892/mmr.2017.8044.
Zhang,
L. & Lin, X. 2011. Some considerations of classification for high dimension
low-sample size data. Statistical Methods
in Medical Research 22(5): 537-550.
https://doi.org/10.1177/0962280211428387.
Zhang,
Z., Murtagh, F., Poucke, S.V., Lin, S. & Lan, P. 2017. Hierarchical cluster
analysis in clinical research with heterogeneous study population: Highlighting
its visualization with R. Annals of
Translational Medicine 5(4): 75.
https://doi.org/10.21037/atm.2017.02.05.
Zhou,
X. 2010. Enhancement of topology preservation of self-organizing map. Journal of Computer Applications 29(12):
3256-3258. https://doi.org/10.3724/sp.j.1087.2009.03256.
Zou,
H. & Hastie, T. 2005. Regularization and variable selection via the elastic
net. Journal of the Royal Statistical
Society: Series B (Statistical Methodology) 67(2): 301-320. https://doi.org/10.1111/j.1467-9868.2005.00503.x.
*Pengarang
untuk surat-menyurat; email: sorayya@um.edu.my
|