Sains Malaysiana 43(10)(2014): 1599–1607

 

Imputing Missing Values in Modelling the PM10 Concentrations

(Mengganti Nilai Hilang dalam Pemodelan Kepekatan PM10)

 

 

NURADHIATHY ABD RAZAK1, YONG ZULINA ZUBAIRI2* & ROSSITA M. YUNUS3

 

1Institute of Graduate Studies, University of Malaya, 50603 Kuala Lumpur, Malaysia

 

2Centre for Foundation Studies in Science, University of Malaya

50603 Kuala Lumpur, Malaysia

 

3Institute of Mathematical Sciences, University of Malaya, 50603 Kuala Lumpur, Malaysia

 

Diserahkan: 30 Julai 2013/Diterima: 13 Februari 2014

 

ABSTRACT

 

Missing values have always been a problem in analysis. Most exclude the missing values from the analyses which may lead to biased parameter estimates. Some imputations methods are considered in this paper in which simulation study is conducted to compare three methods of imputation namely mean substitution, hot deck and expectation maximization (EM) imputation. The EM imputation is found to be superior especially when the percentage of missing values is high as it constantly gives low RMSE as compared with other two methods. The EM imputation method is then applied to the PM10 concentrations data set for the southwest and northeast monsoons in Petaling Jaya and Seberang Perai, Malaysia which has missing values. Four types of distributions, namely the Weibull, lognormal, gamma and Gumbel distribution are considered to describe the PM10 concentrations. The Weibull distribution gives the best fit for the southwest monsoon data for Petaling Jaya. The lognormal distribution outperformed the others in describing the southwest monsoon in Seberang Perai. Meanwhile, for the northeast monsoon in both locations, gamma distribution is the best distribution to describe the data.

 

Keywords: Expectation maximization; mean imputation; missing value; PM10; Weibull

 

ABSTRAK

Nilai hilang selalu menjadi masalah dalam analisis. Kebanyakan mengabaikan nilai hilang ini daripada analisis yang mungkin menyebabkan kepincangan dalam anggaran parameter. Beberapa kaedah gantian dipertimbangkan dalam kertas kerja ini dengan kaedah simulasi telah dijalankan untuk membandingkan kaedah-kaedah gantian tersebut iaitu penggantian menggunakan min, geladak panas dan jangkaan pemaksimuman (EM). Gantian EM didapati yang terbaik terutama apabila peratus nilai hilang adalah tinggi kerana ia berterusan memberi RMSE yang rendah berbanding dua kaedah yang lain. Kaedah gantian EM ini kemudiannya diaplikasikan pada set data kepekatan PM10 bagi monsun barat daya dan timur laut di Petaling Jaya dan Seberang Perai, Malaysia yang mempunyai nilai hilang. Empat jenis taburan, iaitu taburan Weibull, lognormal, gama dan Gumbel dipertimbangkan untuk menggambarkan kepekatan-kepekatan PM10. Taburan Weibull memberi kesesuaian terbaik untuk data monsun barat daya bagi Petaling Jaya. Taburan lognormal pula mengatasi yang lain dalam menggambarkan monsun barat daya di Seberang Perai. Manakala bagi monsun timur laut di kedua-dua kawasan, taburan gama adalah taburan yang terbaik yang menggambarkan data tersebut.

 

Kata kunci: Jangkaan pemaksimuman; min gantian; nilai hilang; PM10; Weibull

 

RUJUKAN

 

Allison, P.D. 2001. Missing Data. California: Thousand Oaks, Sage.

Barzi, F. & Woodward, M. 2004. Imputations of missing values in practice: Results from imputations of serum cholesterol in 28 cohort studies. American Journal of Epidemiology 160: 34-45.

Clark, T.G., Bradburn, M.J., Love, S.B. & Altman, D.G. 2003. Survival Analysis Part IV: Further concepts and methods in survival analysis. British Journal of Cancer 89: 781-786.

Dempster, A.P., Laird, N.M. & Rubin, D.B. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society Series B(Methodological) 39: 1-38.

Department of Statistics. 2011. Population distribution and basic demographic characteristics 2010. http://www.statistics. gov.my/portal/. Assessed on 29 November 2011.

Dominici, F., McDermott, A., Zeger, S.L. & Samet, J.M. 2003. National maps of the effects of particulate matter on mortality: Exploring geographical variation. Environmental Health Perspectives 111: 39-43.

Fitri, M.D.N.F., Ramli, N.A. & Yahaya, A.S. 2011. Extreme value distribution for prediction of future PM10 exceedences. International Journal of Environmental Protection 1: 28-36.

Fitri, M.D.N.F., Ramli, N.A., Yahaya, A.S., Sansuddin, N., Ghazali, N.A. & Al Madhoun, W. 2010. Monsoonal differences and probability distribution of PM10 concentration. Environmental Monitoring Assessment 163: 655-667.

Jamal, H.H., Pillay, M.S., Zailina, H., Shamsul, B.S., Sinha, K., Zaman Huri, Z., Khew, S.L., Mazrura, S., Ambu, S., Rahimah, A. & Ruzita, M.S. 2004. A Study of Health Impact & Risk Assessment of Urban Air Pollution in Klang Valley, Malaysia. Kuala Lumpur: UKM Pakarunding Sdn Bhd.

Junninen, H., Niska, H., Tuppurrainen, K., Ruuskanen, J. & Kolehmainen, M. 2004. Methods for imputation of missing values in air quality data sets. Atmospheric Environment 38: 2895-2907.

Lu, H.C. 2004. Estimating the emission source reduction of PM10 in central Taiwan. Chemosphere 54: 805-814.

Majlis Perbandaran Petaling Jaya. 2005. Maklumat Asas Petaling Jaya. Petaling Jaya: Majlis Perbandaran Petaling Jaya.

Norazian, M.N., Shukri, Y.A., Azam, R.N. & Mustafa Al Bakri, A.M. 2008. Estimation of missing values in air pollution data using single imputation techniques. ScienceAsia 34: 341-345.

Noor, N.M., Tan, C.Y., Abdullah, M.M.A., Ramli, N.A. & Yahaya, A.S. 2011. Modelling of PM10 concentration in industrialized area in Malaysia: A case study in Nilai. 2011 International Conference on Environment and Industrial Innovation IPCBEE, Vol.12. Singapore: IACSIT Press.

Noor, N.M. & Zainudin, M.L. 2008. A review: Missing values in environmental data sets. In Proceeding of International Conference on Environment.

Noor, N.M., Yahaya, A.S., Ramli, N.A. & Abdullah, M.M.A. 2006. The replacement of missing values of continuous air pollution monitoring data using mean top bottom imputation technique. Journal of Engineering Research & Education 3: 96-105.

Sansuddin, N., Ramli, N.A., Yahaya, A.S., Fitri, M.D.N.F., Ghazali, N.A. & Al Madhoun, W.A. 2011. Statistical analysis of PM10 concentrations at different locations in Malaysia. Environmental Monitoring Assessment 180: 573-588.

Schafer, J.L. & Graham, J.W. 2002. Missing data: Our view of the state of the art. Psychological Methods 7: 147-177.

Schafer, J.L. 1997. Analysis of Incomplete Multivariate Data. New York: Chapman & Hall.

Shaadan, N., Deni, S.M. & Jemain, A.A. 2012. Assessing and comparing PM10 pollutant behaviour using functional data approach. Sains Malaysiana 41(11): 1335-1344.

 

 

*Pengarang untuk surat-menyurat; email: yzulina@um.edu.my

 

 

sebelumnya