Pengelasan E-mel Menggunakan Kaedah Perambat Balik
NOR AZMAN MAT ARIFF & NAZLIA OMAR
ABSTRAK
E-mel merupakan antara perkhidmatan komunikasi yang paling popular dewasa ini. Penggunaan e-mel tidak melibatkan kos yang tinggi serta pantas di dalam menyampaikan maklumat. Namun begitu, lambakan e-mel spam banyak menimbulkan masalah kepada pengguna, organisasi dan penyedia
servis Internet. E-mel spam menyebabkan produktiviti kerja menurun dan kerugian dari segi penggunaan jalur lebar dan storan. Justeru itu, satu kajian telah dilakukan bagi menapis e-mel spam menggunakan rangkaian neural perambat balik. Data bagi kajian diperolehi dari e-mel peribadi
penulis yang dikumpul selama 6 bulan. Perkataan yang wujud pada kandungan e-mel digunakan bagi melatih rangkaian neural. Perkataan terlebih dahulu diekstrak dari e-mel dan melalui pra proses data. Pra proses data melibatkan pembuangan kata henti, cantasan, penjanaan matriks perkataan e-mel dan
umpukan pemberat terhadap perkataan. Perlaksanaan cantasan menggunakan algoritma Porter bagi perkataan bahasa Inggeris dan algoritma Fatimah bagi perkataan bahasa Malaysia. Umpukan pemberat bagi perkataan menggunakan TF-IDF dan teknik khi kuasa dua digunakan bagi memilih perkataan yang akan melatih rangkaian neural. Pemberat TF-IDF perkataan akan ditukar ke nilai 0 hingga 1 menggunakan pernormalan minimummaksimum sebelum menjadi input kepada rangkaian neural. Kriteria pemilihan model terbaik adalah berdasarkan kepada ketepatan ramalan set latihan tertinggi bagi rangkaian neural. Hasil eksperimen dibandingkan dengan kajian lepas mendapati gabungan pemberat TF-IDF dan khi kuasa dua memberikan keputusan ramalan yang memuaskan.
Katakunci: Pengelasan e-mel spam, pra pemprosesan data, rangkaian neural
ABSTRACT
E-mail is one of the most popular communication services nowadays. E-mail usage is very cost-effective and time saving for information dissemination. However, as the spamming grows, e-mail usage has created problems to the users, organizations and the Internet service providers. Spam e-mails have
resulted in lower productivity and caused a big loss to bandwidth usage and storage capacity. Hence, a study was conducted to filter the spam e-mails using the back-propagation neural network technique. The data for the study was collected from the author’s personal e-mail messages for almost about 6 months. Text from the e-mail contents are used to train the neural network components. The text are extracted using some data pre-processing mechanisms; stopword discarding, stemming, e-mail word matrices building and word weighting. The stemming process uses the Porter algorithm for English
words and Fatimah algorithm for Bahasa Malaysia words. Term weighting assignment is made using TF-IDF and chi-square method to select words for neural network training set. The term weighting for the TF-IDF is transformed to values between 0 and 1 using min-max normalization as the input to the
neural network. The best selection model is performed based upon the most precise prediction of the training set of the neural network. From the result of the experiments of spam e-mail filtering, it shows that the combination of TF-IDF weighting and chi-square method yields a satisfactory prediction
behavior.
Keywords: Spam e-mail filtering, data preprocessing, neural network