Malaysian Journal of Analytical Sciences Vol 21 No 3 (2017): 552 - 559

DOI: https://doi.org/10.17576/mjas-2017-2103-05

 

 

 

MARKOV CHAIN MONTE CARLO METHOD FOR HANDLING MISSING DATA IN AIR QUALITY DATASETS

 

(Kaedah Rantai Markov Monte Carlo Untuk Mengurus Data Hilang Di Dalam Data Kualiti Udara)

 

Norhazlina Suhaimi1, Nurul Adyani Ghazali1*, Muhammad Yazid Nasir 1, Muhammad Izwan Zariq Mokhtar 1,

Nor Azam Ramli2

 

1School of Ocean Engineering,

Universiti Malaysia Terengganu, 21030 Kuala Nerus, Terengganu, Malaysia

2School of Civil Engineering,

Universiti Sains Malaysia, 14300 Nibong Tebal, Pulau Pinang, Malaysia

 

*Corresponding author: nurul.adyani@umt.edu.my

 

 

Received: 27 October 2016; Accepted: 18 April 2017

 

 

Abstract

Missing data are a common problem in raw data especially in air quality datasets. Incomplete data due to machine or instruments failures, changes in the sitting air station monitors, calibration, routine maintenance and human error handling dataset. Multiple imputation of missing value technique was used to deal with selective air quality data by using Markov Chain Monte Carlo (MCMC). Expectation-maximization (EM) algorithm was used to compute the maximum likelihood estimate (MLE), assuming a multivariate normal distribution for the data. In this paper, the air quality monitoring stations selected namely Kemaman, Terengganu and Petaling Jaya, Selangor. The parameters selected were carbon oxide, ground level ozone, sulphur dioxide, nitrogen oxide, nitric oxide and nitrogen dioxide. A total of annual hourly data is 52,704 (8784 x 6 parameters) observations. Result shows that the coefficient of determination for all annual hourly monitoring data is consistently high (R2 = 0.49 – 0.91) and small error (0.0001 – 0.3).  Therefore, the multiple imputation technique by using MCMC method provides a good fit imputation and unbiased result of missing value to this data.

 

Keywords:  missing data, air quality, multiple imputation, Markov Chain Monte Carlo

 

Abstrak

Data hilang ialah masalah biasa dalam data mentah terutamanya data kualiti udara. Data yang tidak lengkap disebabkan oleh kegagalan alat atau mesin, perubahan tempat stesen udara, kalibrasi, rutin penyelenggaraan dan kesilapan pekerja mengurus data. Teknik pelbagai imputasi digunakan untuk mengisi nilai hilang bagi data kualiti udara yang terpilih dengan menggunakan Markov Chain Monte Carlo. Jangkaan maksimum (EM) algoritma digunakan untuk menjangka kemungkinan maksimum dengan andaian taburan normal dari pelbagai pembolehubah. Dalam kajian ini, stesen kualiti udara yang dipilih ialah Kemaman, Terengganu dan Petaling Jaya, Selangor. Pembolehubah yang dipilih ialah karbon dioksida, ozon paras tanah, sulfur dioksida, nitrogen oksida, nitrik oksida dan nitrogen dioksida. Jumlah data jam tahunan ialah 52,704 (8784 x 6 pembolehubah) pemerhatian.  Keputusan menunjukkan penentuan pekali untuk semua data jam tahunan ialah tinggi dan kesilapan kecil. Oleh sebab itu, teknik pelbagai imputasi dengan menggunakan kaedah MCMC menyediakan imputasi yang sangat bagus dan keputusan yang tiada keraguan bagi nilai hilang. 

 

Kata kunci:  data hilang, kualiti udara, pelbagai imputasi, rantai Markov Monte Carlo

 

References

1.       Department of Environment Malaysia (2012). Malaysia Environmental Quality Report 2012. Kuala Lumpur: Department of Environment, Ministry of Sciences, Technology and the Environment, Malaysia.

2.       Awang, M., Jaafar, A. B., Abdullah, A. M., Ismail, M., Hassan, M. N., Abdullah, R., Johan, S. and Noor, H. (2000). Air quality in Malaysia: Impacts, management issues and future challenges. Respirology, 5: 183 – 196.

3.       Nichol, J. (1998). Smoke haze in South East Asia: A predictable recurrence. Atmospheric Environment, 32: 2715 – 2716.

4.       Yusoff, N. F. F., Ramli, N. A., Yahaya, A. S., Sansuddin, N., Ghazali, N. A. and AlMadhoun, W. A.  (2010). Monsoonal differences and probability distribution of PM10 concentration. Environmental Monitoring and Assessment, 163: 1 – 4.

5.       Little, R. J. and Rubin, D. B. (1987). Statistical analysis with missing data, New York: John Wiley and Sons.

6.       Noor, M. N., Yahaya, A. S., Ramli, N. A. and Mustafa Al Bakri, A. M. (2014). Mean imputation techniques for filling the missing observations in air pollution dataset. Trans Tech Publications, Switzerland, Key Engineering Materials: pp. 902 – 908.

7.       Batista, G. E. A. P. A. and Monard, M. C. (2003). An analysis of four missing data treatment methods for supervised learning. Applied Artificial Intelligence, 17: 519 – 533.

8.       Li, D., Deogun, J., Spaulding, W. and Shuart, B. (2004). Towards missing data imputation: A study of fuzzy k-means clustering method. In S Tsumoto, R.Slowinski, J.Komorrowski, & J. W. Grzmala-Busse (Eds), Lecture notes in computer science: Rough sets and current trends in computing. Sweden:Springer-Verlag: pp. 573 – 579.

9.       Noor, M. N., Yahaya, A. S, Ramli, N. A. and Abdullah, M. A. A. B. (2008). Estimation of missing values in air pollution data using single imputation techniques. ScienceAsia, 34: 341 – 345.

10.    Jerez, J. M., Molina, I., GarcÍa-Laencina, P. J., Alba, E., Ribelles, N., Martín, M. and Franco, L. (2010). Missing data imputation using statistical and machine learning methods in real breast cancer problem. Artificial Intelligence in Medicine, 50: 105 – 115.

11.    Silva-Ramírez E-L., Pino-Mejías R., López-Coello M. and Cubiles-de-la-Vega, M-D. (2011). Missing value imputation on missing completely at random data using multilayer perceptron. Neural Networks, 24: 121 – 129.

12.    Junninen, H., Niska, H., Tuppurainen, K., Ruuskanen, J. and Kolehmainen, M. (2004). Methods for imputation of missing values in air quality datasets. Atmospheric Environment, 38: 2895 – 2907.

13.    Rubin, D. B. (1987). Multiple imputation for nonresponse in surveys, New York: John Wiley & Sons, Inc.

14.    Greenland, S. and Finkle, W. D. (1995). A critical look at methods for handling missing covariates in epidemiologic regression analyses. American Journal of Epidemiology, 142: 1255 – 1264.

15.    Lu, W-Z. and Wang, D. (2008). Ground-level ozone prediction by support vector machine approach with a cost-sensitive classification scheme. Science of the Total environment, 395: 109 – 116.

16.    Schafer, J. L. (1997). Analysis of incomplete multivariate data. Chapman and Hall, New York.

17.    Little, R. A. and Rubin, B. B. (2002). Statistical analysis with missing data 2nd edition. Wiley, New York: pp. 4 - 22.

18.    Chen, J. L., Islam, S. and Biswas, P. (1998). Nonlinear dynamics of hourly ozone concentrations: Nonparametric short term prediction. Journal of Atmospheric Environment, 32: 1839 – 1848.

19.    Plaia, A. and Bondi, A. L. (2006.) Single imputation method of missing values in environmental pollution data sets. Atmospheric Environment, 40: 7316 – 7330.

20.    Gómez-Carracedo, M. P., Andrade, J. M., López-Mahlía, P., Muniategui, S. & Prada, D. (2014). A practical comparison of single and multiple imputation methods to handle complex missing data in air quality datasets. Chemometrics and Intelligent Laboratory Systems, 134: 23 – 33.

21.    Ingsrisawang, L. and Potawee, D. (2012). Multiple imputation for missing data in repeated measurements using MCMC and copulas. Proceedings of the International Multiconference of Engineers and Computers Scientists. 14-16 March 2012, Hong Kong.

 




Previous                    Content                    Next