Sains Malaysiana 43(10)(2014): 1599–1607
Imputing
Missing Values in Modelling the
PM10 Concentrations
(Mengganti
Nilai Hilang dalam Pemodelan Kepekatan PM10)
NURADHIATHY ABD RAZAK1, YONG ZULINA ZUBAIRI2*
& ROSSITA M. YUNUS3
1Institute of Graduate Studies, University of Malaya, 50603
Kuala Lumpur, Malaysia
2Centre for Foundation Studies in Science, University of Malaya
50603 Kuala Lumpur, Malaysia
3Institute of Mathematical Sciences, University of Malaya, 50603
Kuala Lumpur, Malaysia
Diserahkan: 30 Julai 2013/Diterima: 13 Februari 2014
ABSTRACT
Missing values have always been a problem in analysis. Most
exclude the missing values from the analyses which may lead to biased parameter
estimates. Some imputations methods are considered in this paper in which
simulation study is conducted to compare three methods of imputation namely
mean substitution, hot deck and expectation maximization (EM)
imputation. The EM imputation is found to be superior
especially when the percentage of missing values is high as it constantly gives
low RMSE as compared with other two methods. The EM imputation
method is then applied to the PM10 concentrations
data set for the southwest and northeast monsoons in Petaling Jaya and Seberang
Perai, Malaysia which has missing values. Four types of distributions, namely
the Weibull, lognormal, gamma and Gumbel distribution are considered to
describe the PM10 concentrations.
The Weibull distribution gives the best fit for the southwest monsoon data for
Petaling Jaya. The lognormal distribution outperformed the others in describing
the southwest monsoon in Seberang Perai. Meanwhile, for the northeast monsoon
in both locations, gamma distribution is the best distribution to describe the
data.
Keywords: Expectation maximization; mean imputation; missing
value; PM10; Weibull
ABSTRAK
Nilai hilang selalu menjadi masalah dalam
analisis. Kebanyakan
mengabaikan nilai hilang ini daripada analisis yang mungkin menyebabkan
kepincangan dalam anggaran parameter. Beberapa kaedah
gantian dipertimbangkan dalam kertas kerja ini dengan kaedah simulasi telah
dijalankan untuk membandingkan kaedah-kaedah gantian tersebut iaitu penggantian
menggunakan min, geladak panas dan jangkaan pemaksimuman (EM). Gantian EM didapati yang terbaik terutama apabila peratus nilai
hilang adalah tinggi kerana ia berterusan memberi RMSE yang rendah berbanding dua kaedah yang lain. Kaedah gantian EM ini
kemudiannya diaplikasikan pada set data kepekatan PM10 bagi
monsun barat daya dan timur laut di Petaling Jaya dan Seberang Perai, Malaysia
yang mempunyai nilai hilang. Empat jenis taburan, iaitu taburan Weibull,
lognormal, gama dan Gumbel dipertimbangkan untuk
menggambarkan kepekatan-kepekatan PM10. Taburan
Weibull memberi kesesuaian terbaik untuk data monsun barat daya bagi Petaling
Jaya. Taburan lognormal pula mengatasi yang lain dalam menggambarkan
monsun barat daya di Seberang Perai. Manakala bagi monsun timur laut di
kedua-dua kawasan, taburan gama adalah taburan yang
terbaik yang menggambarkan data tersebut.
Kata kunci: Jangkaan pemaksimuman; min gantian;
nilai hilang; PM10; Weibull
RUJUKAN
Allison, P.D. 2001. Missing Data. California:
Thousand Oaks, Sage.
Barzi, F. & Woodward, M. 2004. Imputations of missing values in practice: Results from
imputations of serum cholesterol in 28 cohort studies. American Journal of
Epidemiology 160: 34-45.
Clark, T.G., Bradburn, M.J., Love, S.B.
& Altman, D.G. 2003. Survival Analysis
Part IV: Further concepts and methods in survival analysis. British Journal
of Cancer 89: 781-786.
Dempster, A.P., Laird, N.M. & Rubin, D.B. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society Series B(Methodological) 39: 1-38.
Department of Statistics. 2011. Population distribution and basic demographic
characteristics 2010. http://www.statistics. gov.my/portal/. Assessed on 29
November 2011.
Dominici, F., McDermott, A., Zeger,
S.L. & Samet, J.M. 2003. National
maps of the effects of particulate matter on mortality: Exploring geographical
variation. Environmental Health Perspectives 111: 39-43.
Fitri, M.D.N.F., Ramli, N.A. & Yahaya, A.S. 2011. Extreme value distribution for prediction of future PM10 exceedences. International Journal
of Environmental Protection 1: 28-36.
Fitri, M.D.N.F., Ramli, N.A., Yahaya, A.S., Sansuddin, N.,
Ghazali, N.A. & Al Madhoun, W. 2010. Monsoonal
differences and probability distribution of PM10 concentration. Environmental Monitoring Assessment 163:
655-667.
Jamal, H.H., Pillay, M.S., Zailina, H., Shamsul, B.S.,
Sinha, K., Zaman Huri, Z., Khew, S.L., Mazrura, S., Ambu, S., Rahimah, A. &
Ruzita, M.S. 2004. A Study of Health Impact & Risk
Assessment of Urban Air Pollution in Klang Valley, Malaysia. Kuala
Lumpur: UKM Pakarunding Sdn Bhd.
Junninen, H., Niska, H., Tuppurrainen,
K., Ruuskanen, J. & Kolehmainen, M. 2004. Methods for imputation of missing values in air quality
data sets. Atmospheric Environment 38: 2895-2907.
Lu, H.C. 2004. Estimating the emission source reduction of
PM10 in
central Taiwan. Chemosphere 54: 805-814.
Majlis Perbandaran Petaling Jaya. 2005. Maklumat Asas
Petaling Jaya. Petaling Jaya: Majlis Perbandaran Petaling Jaya.
Norazian, M.N., Shukri, Y.A., Azam, R.N. & Mustafa Al
Bakri, A.M. 2008. Estimation of missing values in air pollution data using
single imputation techniques. ScienceAsia 34: 341-345.
Noor, N.M., Tan, C.Y., Abdullah,
M.M.A., Ramli, N.A. & Yahaya, A.S. 2011. Modelling of PM10 concentration
in industrialized area in Malaysia: A case study in Nilai. 2011
International Conference on Environment and Industrial Innovation IPCBEE, Vol.12.
Singapore: IACSIT Press.
Noor, N.M. & Zainudin, M.L. 2008. A review: Missing
values in environmental data sets. In Proceeding of
International Conference on Environment.
Noor, N.M., Yahaya, A.S., Ramli, N.A.
& Abdullah, M.M.A. 2006. The replacement of missing values of continuous air pollution
monitoring data using mean top bottom imputation technique. Journal
of Engineering Research & Education 3: 96-105.
Sansuddin, N., Ramli, N.A., Yahaya,
A.S., Fitri, M.D.N.F., Ghazali, N.A. & Al Madhoun, W.A. 2011. Statistical analysis of PM10 concentrations at different locations in
Malaysia. Environmental Monitoring Assessment 180: 573-588.
Schafer, J.L. & Graham, J.W. 2002. Missing data: Our
view of the state of the art. Psychological Methods 7: 147-177.
Schafer, J.L. 1997. Analysis of
Incomplete Multivariate Data. New York: Chapman & Hall.
Shaadan, N., Deni, S.M. & Jemain,
A.A. 2012. Assessing and comparing PM10 pollutant behaviour using functional data
approach. Sains Malaysiana 41(11): 1335-1344.
*Pengarang
untuk surat-menyurat; email: yzulina@um.edu.my
|