Sains Malaysiana 52(5)(2023): 1595-1606

http://doi.org/10.17576/jsm-2023-5205-20

Identifying Multiple Outliers in Linear Functional Relationship Model Using a Robust Clustering Method

(Menentukan Data Terpencil Berganda bagi Model Linear Hubungan Fungsian Menggunakan Kaedah Berkelompok yang Lebih Kukuh)

ADILAH ABDUL GHAPOR^1,*, YONG ZULINA ZUBAIRI², SAYED MD. AL MAMUN³, SITI FATIMAH HASSAN⁴, ELAYARAJA ARUCHUNAN⁵ & NURKHAIRANY AMYRA MOKHTAR⁶

¹Department of Decision Science, Faculty of Business and Economics, Universiti Malaya, 50603 Kuala Lumpur, Federal Territory, Malaysia

²Institute of Advanced Studies, Universiti Malaya, 50603 Kuala Lumpur, Federal Territory, Malaysia

³Department of Statistics, University of Rajshahi, Bangladesh

⁴Centre for Foundation Studies in Science, Universiti Malaya, Kuala Lumpur, Malaysia

⁵Institute of Mathematical Sciences, Faculty of Science, Universiti Malaya, 50603 Kuala Lumpur, Federal Territory, Malaysia

⁶Mathematical Sciences Studies, College of Computing, Informatics and Media, Universiti Teknologi MARA, 85000 Segamat, Johor Darul Takzim, Malaysia

Diserahkan: 12 Oktober 2022/Diterima: 10 Mei 2023

Abstract

Outliers are some observation points outside the usual pattern of the other observations. It is essential to detect outliers as anomalous observations can affect the inference made in the analysis. In this study, we propose an efficient clustering procedure to identify multiple outliers in the linear functional relationship model using the single linkage algorithm with the Euclidean distance as the similarity measure. A new robust cut-off point using the median and median absolute deviation for the tree heights to classify the potential outliers are proposed in this study. Experimental results from the simulation study suggest our proposed method is able to identify the presence of multiple outliers with very small probability of swamping and masking. Application in real data also shows that the proposed clustering method for this linear functional relationship model successfully detects the outliers, thus suggesting the method's practicality in real-world problems.

Keywords: Clustering; linear; measurement error; multiple outliers

Abstrak

Data terpencil merupakan pemerhatian data yang berada di luar corak pemerhatian data yang lain. Menentukan data terpencil adalah penting kerana pemerhatian yang luar biasa boleh mempengaruhi inferens yang dibuat ke atas analisis tersebut. Dalam kajian ini, kami mencadangkan kaedah berkelompok yang lebih kukuh untuk menentukan data terpencil berganda bagi model linear hubungan fungsian (LFRM) menggunakan satu hubungan algoritma dengan jarak Euclidean sebagai ukuran bersama. Satu nilai potongan yang kukuh dicadangkan untuk mengumpulkan data terpencil berganda dengan menggunakan median dan median sisihan mutlak bagi menentukan ketinggian pokok tersebut. Keputusan uji kaji berdasarkan simulasi menunjukkan kaedah yang dicadangkan berjaya mengesan data terpencil berganda di dalam sesebuah set data dan menunjukkan prestasi yang bagus dengan nilai ‘masking’ dan ‘swamping’ yang rendah. Aplikasi pada data sebenar juga menunjukkan kaedah berkelompok yang dicadangkan bagi model linear hubungan fungsian (LFRM) ini berjaya menentukan data terpencil, justeru, dicadangkan penggunaan kaedah ini dalam aplikasi pada data dunia yang sebenar.

Kata kunci: Berkelompok; kesilapan pengukuran; linear; terpencil berganda

RUJUKAN

Atkinson, A. 1985. Plots, Transformations, and Regression: An Introduction to Graphical Methods of Diagnostic Regression Analysis. Oxford: Clarendon Press.

Adnan, R., Mohamad, M.N. & Setan, H. 2003. Multiple outliers detection procedures in linear regression. Matematika 19: 29-45.

Aldenderfer, M.S. & Blashfield, R.K. 1984. Cluster Analysis: Quantitative Applications in the Social Sciences. A SAGE Publications.

Arif, A.M., Zubairi, Y.Z. & Hussin, A.G. 2022. Outlier detection in balanced replicated linear functional relationship model. Sains Malaysiana 51(2): 599-607. https://doi.org/10.17576/jsm-2022-5102-23.

Arif, A.M., Zubairi, Y.Z. & Hussin, A.G. 2020. Parameter estimation in replicated linear functional relationship model in the presence of outliers. Malaysian Journal of Fundamental and Applied Sciences 16(2): 158-160. https://doi.org/10.11113/mjfas.v16n2.1633

Barnett, V. & Lewis, T. 1984. Outliers in Statistical Data. 2nd ed. New York: Wiley.

Brzezińska, A.N. & Horyń, C. 2021. Outliers in COVID 19 data based on rule representation - the analysis of LOF algorithm. Procedia Comput. Sci. 192: 3010-3019. doi: 10.1016/j.procs.2021.09.073

Hampel, F.R., Ronchetti, E.M., Rousseeuw, P.J. & Stahel, W.A. 2011. Robust Statistics: The Approach Based on Influence Functions. New York: John Wiley & Sons.

He, Z., Xu, X. & Deng, S. 2003. Discovering cluster-based local outliers. Pattern Recognition Letters 24(9): 1641-1650.

Ilbeigipour, S., Albadvi, A. & Akhondzadeh Noughabi, E. 2022. Cluster-based analysis of COVID-19 cases using self-organizing map neural network and K-means methods to improve medical decision-making. Informatics in Medicine Unlocked 32: 101005. https://doi.org/10.1016/j.imu.2022.101005

Kaufman, L. & Rousseeuw, P.J. 1990. Finding Groups in Data: An Introduction to Cluster Analysis. New York: John Wiley & Sons, Inc.

Kendall, M.G. 1951. Regression, structure and functional relationship, Part I. Biometrika 38(1/2): 11-25.

Kendall, M.G. 1952. Regression, structure and functional relationship, Part II, Biometrika 39(1/2): 96-108.

Kumar, S. 2020. Use of cluster analysis to monitor novel coronavirus-19 infections in Maharashtra, India. Indian Journal of Medical Sciences 72(2): 44-48. https://doi.org/10.25259/IJMS_68_2020

Li, Y., Jin, D.C., Bao, Z.B., Jin, H., Guo, J.W., Zhao, Y.L., Shao, J. & Yang, D. 2016. Advances in Energy, Environment and Materials Science. Boca Raton: CRC Press.

Mokhtar, N.A., Zubairi, Y.Z., Hussin, A.G., Badyalina, B., Ghazali, A.F., Ya’Acob, F.F., Shamala, P. & Kerk, L.C. 2021. Modelling wind direction data of Langkawi Island during Southwest monsoon in 2019 to 2020 using bivariate linear functional relationship model with von Mises distribution. Journal of Physics: Conference Series 1988(1): 012097. https://doi.org/10.1088/1742-6596/1988/1/012097

Milligan, G.W. & Cooper, M.C. 1985. An examination of procedures for determining the number of clusters in a data set. Psychometrika 50(2): 159-179.

Mojena, R. 1977. Hierarchical grouping methods and stopping rules: An evaluation. The Computer Journal 20(4): 359-363.

Oh, J.H. & Gao, J. 2009. A kernel-based approach for detecting outliers of high-dimensional biological data. BMC Bioinformatics 10(4): S7.

O'Leary, B., Reiners, J.J., Xu, X. & Lemke, L.D. 2016. Identification and influence of spatio-temporal outliers in urban air quality measurements. Science of the Total Environment 573: 55-65.

Sebert, D.M., Montgomery, D.C. & Rollier, D.A. 1998. A clustering algorithm for identifying multiple outliers in linear regression. Computational Statistics & Data Analysis 27(4): 461-484.

Syaiba, B.A. & Midi, H. 2010. Robust logistic diagnostic for the identification of high leverage points in logistic regression model. Journal of Applied Sciences 10(23): 3042-3050.

Rousseeuw, P.J. & Leroy, A. 1987. Robust Regression and Outlier Detection. New York: Wiley.

Toutenburg, H., Chatterjee, S. & Hadi, A.S. 1990. Sensitivity analysis in linear regression. Statistical Papers 31: 232.

Ultsch, A. & Lötsch, J. 2022. Euclidean distance-optimized data transformation for cluster analysis in biomedical data (EDOtrans). BMC Bioinformatics 23: 233.

Wang, L., Zhang, Y. & Feng, J. 2005. On the Euclidean distance of images. IEEE Transactions on Pattern Analysis and Machine Intelligence 27(8): 1334-1339.

^*Pengarang untuk surat-menyurat; email: adilahghapor@gmail.com

sebelumnya

kandungan