Malaysian
Journal of Analytical Sciences Vol 21 No 3 (2017): 552 - 559
DOI:
https://doi.org/10.17576/mjas-2017-2103-05
MARKOV CHAIN MONTE CARLO METHOD
FOR HANDLING MISSING DATA IN AIR QUALITY DATASETS
(Kaedah
Rantai Markov Monte Carlo Untuk Mengurus Data Hilang Di Dalam Data Kualiti
Udara)
Norhazlina
Suhaimi1, Nurul Adyani Ghazali1*, Muhammad Yazid Nasir
1, Muhammad Izwan Zariq Mokhtar 1,
Nor Azam Ramli2
1School of Ocean Engineering,
Universiti Malaysia Terengganu, 21030 Kuala Nerus,
Terengganu, Malaysia
2School of Civil Engineering,
Universiti Sains Malaysia, 14300 Nibong Tebal, Pulau
Pinang, Malaysia
*Corresponding author: nurul.adyani@umt.edu.my
Received: 27
October 2016; Accepted: 18 April 2017
Abstract
Missing
data are a common problem in raw data especially in air quality datasets.
Incomplete data due to machine or instruments failures, changes in the sitting
air station monitors, calibration, routine maintenance and human error handling
dataset. Multiple imputation of missing value technique was used to deal with
selective air quality data by using Markov Chain Monte Carlo (MCMC).
Expectation-maximization (EM) algorithm was used to compute the maximum
likelihood estimate (MLE), assuming a multivariate normal distribution for the
data. In this paper, the air quality monitoring stations selected namely
Kemaman, Terengganu and Petaling Jaya, Selangor. The parameters selected were
carbon oxide, ground level ozone, sulphur dioxide, nitrogen oxide, nitric oxide
and nitrogen dioxide. A total of annual hourly data is 52,704 (8784 x 6
parameters) observations. Result shows that the coefficient of determination
for all annual hourly monitoring data is consistently high (R2 =
0.49 – 0.91) and small error (0.0001 – 0.3).
Therefore, the multiple imputation technique by using MCMC method
provides a good fit imputation and unbiased result of missing value to this
data.
Keywords: missing data, air quality, multiple
imputation, Markov Chain Monte Carlo
Abstrak
Data hilang ialah masalah biasa dalam data mentah terutamanya data kualiti
udara. Data yang tidak lengkap disebabkan oleh kegagalan alat atau mesin,
perubahan tempat stesen udara, kalibrasi, rutin penyelenggaraan dan kesilapan
pekerja mengurus data. Teknik pelbagai imputasi digunakan untuk mengisi nilai hilang
bagi data kualiti udara yang terpilih dengan menggunakan Markov Chain Monte
Carlo. Jangkaan maksimum (EM) algoritma digunakan untuk menjangka kemungkinan
maksimum dengan andaian taburan normal dari pelbagai pembolehubah. Dalam kajian
ini, stesen kualiti udara yang dipilih ialah Kemaman, Terengganu dan Petaling
Jaya, Selangor. Pembolehubah yang dipilih ialah karbon dioksida, ozon paras
tanah, sulfur dioksida, nitrogen oksida, nitrik oksida dan nitrogen dioksida.
Jumlah data jam tahunan ialah 52,704 (8784 x 6 pembolehubah) pemerhatian. Keputusan menunjukkan penentuan pekali untuk
semua data jam tahunan ialah tinggi dan kesilapan kecil. Oleh sebab itu, teknik
pelbagai imputasi dengan menggunakan kaedah MCMC menyediakan imputasi yang
sangat bagus dan keputusan yang tiada keraguan bagi nilai hilang.
Kata kunci: data hilang, kualiti udara, pelbagai imputasi, rantai Markov Monte Carlo
References
1.
Department
of Environment Malaysia (2012). Malaysia
Environmental Quality Report 2012. Kuala Lumpur: Department of Environment,
Ministry of Sciences, Technology and the Environment, Malaysia.
2.
Awang,
M., Jaafar, A. B., Abdullah, A. M., Ismail, M., Hassan, M. N., Abdullah, R.,
Johan, S. and Noor, H. (2000). Air quality
in Malaysia: Impacts, management issues and future challenges. Respirology, 5:
183 – 196.
3.
Nichol,
J. (1998). Smoke haze in South East
Asia: A predictable recurrence. Atmospheric Environment, 32: 2715 – 2716.
4.
Yusoff,
N. F. F., Ramli, N. A., Yahaya, A. S., Sansuddin, N., Ghazali, N. A. and
AlMadhoun, W. A. (2010). Monsoonal differences
and probability distribution of PM10 concentration. Environmental Monitoring and Assessment, 163:
1 – 4.
5.
Little,
R. J. and Rubin, D. B. (1987). Statistical
analysis with missing data, New
York: John Wiley and Sons.
6.
Noor,
M. N., Yahaya, A. S., Ramli, N. A. and Mustafa Al Bakri, A. M. (2014). Mean imputation
techniques for filling the missing observations in air pollution dataset. Trans Tech Publications, Switzerland, Key
Engineering Materials: pp. 902 –
908.
7.
Batista,
G. E. A. P. A. and Monard, M. C. (2003). An analysis of four missing data
treatment methods for supervised learning. Applied
Artificial Intelligence, 17: 519
– 533.
8.
Li,
D., Deogun, J., Spaulding, W. and Shuart, B. (2004). Towards missing data
imputation: A study of fuzzy k-means clustering method. In S Tsumoto,
R.Slowinski, J.Komorrowski, & J. W. Grzmala-Busse (Eds), Lecture notes in
computer science: Rough sets and current trends in computing.
Sweden:Springer-Verlag: pp. 573 – 579.
9.
Noor,
M. N., Yahaya, A. S, Ramli, N. A. and Abdullah, M. A. A. B. (2008). Estimation
of missing values in air pollution data using single imputation techniques. ScienceAsia, 34: 341 – 345.
10.
Jerez,
J. M., Molina, I., GarcÍa-Laencina, P. J., Alba, E., Ribelles, N., Martín, M.
and Franco, L. (2010). Missing data imputation using statistical and machine
learning methods in real breast cancer problem. Artificial Intelligence in Medicine, 50: 105 – 115.
11.
Silva-Ramírez E-L.,
Pino-Mejías R., López-Coello M. and Cubiles-de-la-Vega, M-D. (2011). Missing value
imputation on missing completely at random data using multilayer perceptron. Neural Networks, 24: 121 – 129.
12.
Junninen,
H., Niska, H., Tuppurainen, K., Ruuskanen, J. and Kolehmainen, M. (2004).
Methods for imputation of missing values in air quality datasets. Atmospheric Environment, 38: 2895 – 2907.
13.
Rubin,
D. B. (1987). Multiple imputation for nonresponse in surveys, New York: John Wiley & Sons, Inc.
14.
Greenland,
S. and Finkle, W. D. (1995). A critical look at methods for handling missing
covariates in epidemiologic regression analyses. American Journal of Epidemiology, 142: 1255 – 1264.
15.
Lu,
W-Z. and Wang, D. (2008). Ground-level ozone prediction by support vector
machine approach with a cost-sensitive classification scheme. Science of the Total environment, 395: 109 – 116.
16.
Schafer,
J. L. (1997). Analysis of incomplete multivariate data. Chapman and Hall, New
York.
17.
Little,
R. A. and Rubin, B. B. (2002). Statistical analysis with missing data 2nd
edition. Wiley, New York: pp. 4 - 22.
18.
Chen,
J. L., Islam, S. and Biswas, P. (1998). Nonlinear dynamics of hourly ozone
concentrations: Nonparametric short term prediction. Journal of Atmospheric Environment, 32: 1839 – 1848.
19.
Plaia,
A. and Bondi, A. L. (2006.) Single imputation method of missing values in
environmental pollution data sets. Atmospheric
Environment, 40: 7316 – 7330.
20.
Gómez-Carracedo,
M. P., Andrade, J. M., López-Mahlía, P., Muniategui, S. & Prada, D. (2014).
A practical comparison of single and multiple imputation methods to handle
complex missing data in air quality datasets. Chemometrics and Intelligent Laboratory Systems, 134: 23 – 33.
21.
Ingsrisawang,
L. and Potawee, D. (2012). Multiple imputation for missing data in repeated
measurements using MCMC and copulas. Proceedings
of the International Multiconference of Engineers and Computers Scientists.
14-16 March 2012, Hong Kong.