SAINS MALAYSIANA

Sains Malaysiana 44(3)(2015): 449–456

A Comparison of Various Imputation Methods for Missing Values in Air Quality Data

(Perbandingan Pelbagai Kaedah Imputasi bagi Data Lenyap untuk Data Kualiti Udara)

NURYAZMIN AHMAT ZAINURI¹*, ABDUL AZIZ JEMAIN² & NORA MUDA²

¹Fundamental Studies of Engineering Unit, Faculty of Engineering and Built Environment

Universiti Kebangsaan Malaysia, 43600 Bangi, Selangor Darul Ehsan, Malaysia

²School of Mathematical Sciences, Faculty of Science and Technology

Universiti Kebangsaan Malaysia, 43600 Bangi, Selangor Darul Ehsan, Malaysia

Diserahkan: 30 Mei 2013/Diterima: 21 Ogos 2014

ABSTRACT

This paper presents various imputation methods for air quality data specifically in Malaysia. The main objective was to select the best method of imputation and to compare whether there was any difference in the methods used between stations in Peninsular Malaysia. Missing data for various cases are randomly simulated with 5, 10, 15, 20, 25 and 30% missing. Six methods used in this paper were mean and median substitution, expectation-maximization (EM) method, singular value decomposition (SVD), K-nearest neighbour (KNN) method and sequential K-nearest neighbour (SKNN) method. The performance of the imputations is compared using the performance indicator: The correlation coefficient (R), the index of agreement (d) and the mean absolute error (MAE). Based on the result obtained, it can be concluded that EM, KNN and SKNN are the three best methods. The same result are obtained for all the eight monitoring station used in this study.

Keywords: Imputation techniques; missing data; performance indicators

ABSTRAK

Kertas ini membincangkan pelbagai kaedah imputasi bagi rawatan data lenyap untuk data kualiti udara khususnya di Malaysia. Objektif utama kajian ini ialah memilih rawatan data lenyap yang terbaik dan juga perbandingan sama ada wujud perbezaan antara kaedah yang digunakan antara stesen di Semenanjung Malaysia. Pelbagai kes data lenyap telah dijana secara rawak iaitu dengan 5, 10, 15, 20, 25 dan 30% data lenyap. Enam kaedah rawatan data lenyap telah digunakan dalam kajian ini iaitu teknik berasaskan min, median, jangkaan pemaksimuman (EM), dekomposisi nilai tunggal (SVD), K-jiran terdekat (KNN) dan K-jujukan jiran terdekat (SKNN). Pemilihan teknik imputasi terbaik adalah berdasarkan kepada penunjuk prestasi yang menggunakan nilai pekali korelasi (R), indeks persetujuan (d) dan min ralat mutlak (MAE). Berdasarkan kepada keputusan yang diperoleh, dapat disimpulkan bahawa kaedah EM, KNN dan SKNN adalah tiga kaedah yang terbaik. Keputusan yang sama diperoleh bagi semua stesen yang digunakan dalam kajian ini.

Kata kunci: Data lenyap; penunjuk prestasi; teknik imputasi

RUJUKAN

Allison, P.D. 2001. Missing Data. Sage Publications, Inc.

Dempster, A.P., Laird, N.M. & Rubin, D.B. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (Methodological) 39(1): 1-38.

Gelman, A., King, G. & Liu, C. 1998. Not asked and not answered: Multiple imputation for multiple surveys. Journal of the American Statistical Association 93(443): 846-857.

Junninen, H., Niska, H., Tupprainen, K., Ruuskanen, J. & Kolehmainen, M. 2004. Methods for imputation of missing values in air quality data sets. Atmospheric Environment 38: 2895-2907.

Kim, K.Y., Kim, B.J. & Yi, G.S. 2004. Reuse of imputed data in microarray increases imputation efficiency. BMC Bioinformatics 5: 160.

Laaksonen, S. 2000. Regression-based nearest neighbor hot decking. Computational Statistics15(1): 65-71.

Little, R.J.A. & Rubin, D.B. 2002. Statistical Analysis with Missing Data. 2nd ed. New York: Wiley.

Plaia, A. & Bondi, A.L. 2006. Single imputation method of missing values in environmental pollution data sets. Atmospheric Environment 40: 7316-7330.

Pollice, A. & Lasinio, G.J. 2009. Two approaches to imputation and adjustment of air quality data from a composite monitoring network. Journal of Data Science 7: 43-59.

Porter, J., Cossman, R. & James, W. 2009. Research note: Imputing large group averages for missing data, using rural-urban continuum codes for density driven industry sectors. Journal of Population Research 26(3): 273-278.

Rubin, D.B. 1987. Multiple Imputation for Nonresponse in Surveys. New York: Wiley.

Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R., Botstein, D., & Altman, R.B. 2001. Missing value estimation methods for DNA microarrays. Bioinformatics 17(6): 520-525.

*Pengarang untuk surat-menyurat; email: yazmin@eng.ukm.my

sebelumnya

kandungan

seterusnya