Study on the Effect of Preprocessing Methods for Spam Email Detection
DOI:
https://doi.org/10.21108/INDOJC.2019.4.1.284Abstract
The use of email as a communication technology is now increasingly being exploited. Along with its progress, email spam problem becomes quite disturbing to email user. The resulting negative impacts make effective spam email detection techniques indispensable. A spam email detection algorithm or spam classifier will work effectively if supported by proper preprocessing steps (noise removal, stop words removal, stemming, lemmatization, term frequency). This research studies the effect of preprocessing steps on the performance of supervised spam classifier algorithms. Experiments were conducted on two widely used supervised spam classifier algorithms: Naïve Bayes and Support Vector Machine. The evaluation is performed on the Ling-spam corpus dataset and uses evaluation metrics: accuracy. The experimental results show that different preprocessing steps give different effects to different classifier.Downloads
References
G. V. Cormack, “Email Spam Filtering: A Systematic Review,†Foundations and Trends® in Information Retrieval, vol. 1, no. 4, pp. 335–455, 2008.
E. Blanzieri and A. Bryl, “A survey of learning-based techniques of email spam filtering,†Artificial Intelligence Review, vol. 29, no. 1, pp. 63–92, 2008.
W. Yerazunis, “Correspondence with Paul Graham.†2002.
B. Leiba, J. Ossher, V. Rajan, R. Segal, and M. Wegman, “SMTP Path Analysis,†in Conference on Email and Anti-spam, 2005, vol. 2, no. 1, pp. 54–66.
S. Balakrishnan and K. L. Shunmuganathan, “An Agent Based Collaborative Spam Filtering Assistance Using JADE,†International Journal of Applied Engineering Research, vol. 10, no. 21, pp. 42476–42479, 2015.
T. A. Almeida, J. Almeida, and A. Yamakami, “Spam filtering: How the dimensionality reduction affects the accuracy of Naive Bayes classifiers,†Journal of Internet Services and Applications, vol. 1, no. 3, pp. 183–200, 2011.
W. Feng, J. Sun, L. Zhang, C. Cao, and Q. Yang, “A Support Vector Machine based Naive Bayes Algorithm for Spam Filtering,†in 2016 IEEE 35th International Performance Computing and Communications Conference (IPCCC), 2016, no. IEEE, p. 8.
A. Sharma and A. Suryawanshi, “A Novel Method for Detecting Spam Email using KNN Classification with Spearman Correlation as Distance Measure,†International Journal of Computer Applications, vol. 136, no. 6, pp. 975–8887, 2016.
O. Kufandirimbwa and R. Gotora, “Spam Detection Using Artificial Neural Networks (Perceptron Learning Rule),†Online Journal of Physical and Environmental Science Research, vol. 1, no. 2, pp. 22–29, 2012.
A. S. Rao, P. S. Avadhani, and N. B. Chaudhuri, “A Content-Based Spam E-Mail Filtering Approach Using Multilayer Perceptron Neural Networks,†International Journal of Engineering Trends and Technology (IJETT), vol. 41, no. 1, pp. 44–55, 2016.
J. Bluszcz, D. Fitisova, A. Hamann, A. Trifonov, and P. Jahnichen, “Application of Support Vector Machine Algorithm in E-Mail Spam Filtering,†pp. 1–5, 2016.
Z. Khan and U. Qamar, “Text Mining Approach to Detect Spam in Emails,†Proceedings of The International Conference on Innovations in Intelligent Systems and Computing Technologies, no. February, 2016.
H. Wei-chih and T. Yu, “E-mail Spam Filtering Using Support Vector Machines with Selection of Kernel,†Information and Control, pp. 764–767, 2009.
D. C. Trudgian and Z. R. Yang, “Spam Classification Using Nearest Neighbour Techniques,†in Intelligent Data Engineering and Automated Learning – IDEAL 2004, 2004, pp. 578–585.
S. B. Rathod and T. M. Pattewar, “Content Based Spam Detection in Email using Bayesian Classfifier,†in 2015 International Conference on Communications and Signal Processing (ICCSP), 2015, pp. 1257–1261.
G. Sakkis, I. O. N. Androutsopoulos, G. Paliouras, V. Karkaletsis, C. D. Spyropoulos, and P. Stamatopoulos, “A Memory-Based Approach to Anti-Spam Filtering,†pp. 49–73, 2003.
S. K. Trivedi, “A study of machine learning classifiers for spam detection,†in 2016 4th International Symposium on Computational and Business Intelligence (ISCBI), 2016, pp. 176–180.
A. R. On and D. Glaucoma, “A Review on Different Spam Detection Approaches,†vol. 11, no. 6, pp. 2–7, 2015.
J. Daniel and J. Martin, “Naive Bayes and Sentiment Classification,†in Speech and Language Processing Stanford University, 2017.
Downloads
Published
How to Cite
Issue
Section
License
- Manuscript submitted to IndoJC has to be an original work of the author(s), contains no element of plagiarism, and has never been published or is not being considered for publication in other journals.Â
- Copyright on any article is retained by the author(s). Regarding copyright transfers please see below.
- Authors grant IndoJC a license to publish the article and identify itself as the original publisher.
- Authors grant IndoJC commercial rights to produce hardcopy volumes of the journal for sale to libraries and individuals.
- Authors grant any third party the right to use the article freely as long as its original authors and citation details are identified.
- The article and any associated published material is distributed under the Creative Commons Attribution 4.0License