Implementation of Naïve Bayes and Gini Index for Spam Email Classification
DOI:
https://doi.org/10.34818/INDOJC.2021.6.1.452Keywords:
Complete Gini-Index Text, Multinomial Naïve Bayes, Email ClassificationAbstract
Email is a medium of information that is still frequently used by people today. At the moment email still has an endless problem that is spam email. Spam email is an email that can pollute, damage or disturb the recipient. In this study, we show the performance and accuracy of Multinomial Naïve Bayes (MNNB) and Complete Gini-Index Text (GIT) for use in spam email filtering. In this study, we used 6 cross-validations as testers for the built classification machines. We found that the average yield can exceed Multinomial Naïve Bayes without using feature selection which only uses 80000 features with a difference of 0.39%. Feature selection also increases speed during classification and can reduce features that are less relevant to the category to be classified.
Downloads
References
Email statistics report, 2018-2022. 2018. THE RADICATI GROUP, INC.
2018 Internet Security Threat Report, volume 23. 2018. symantec.
A. Sharma, D. Manisha, Manisha, and D. R. Jain. Data pre-processing in spam detection. IJSTE International Journal of Science Technology & Engineering, 1(11), 2015.
H. Park, S. Kwon, and H. Kwon. Complete Gini-Index Text (GIT) feature-selection algorithm for text classification. In The 2nd International Conference on Software Engineering and Data Mining, pages 366–371, June 2010.
W. Gad and S. Rady. Email filtering based on supervised learning and mutual information feature selection. In 2015 Tenth International Conference on Computer Engineering Systems (ICCES), pages 147–152, 2015.
S. R. Gomes, S. G. Saroar, M. Mosfaiul, A. Telot, B. N. Khan, A. Chakrabarty, and M. Mostakim. A comparative approach to email classification using naive bayes classifier and hidden markov model. In 2017 4th International Conference on Advances in Electrical Engineering (ICAEE), pages 482–487, 2017.
J. Yang, Y. Liu, Z. Liu, X. Zhu, and X. Zhang. A new feature selection algorithm based on binomial hypothesis testing for spam filtering. Know.-Based Syst., 24(6):904–914, 2011.
The enron-spam datasets, 2006. Accessed on October 2018.
B. Issac, W. Jap, and J. Sutanto. Improved bayesian anti-spam filter implementation and analysis on independent spam corpuses. volume 2, pages 326 – 330, 02 2009.
M. Singh. Classification of spam email using intelligent water drops algorithm with naïve bayes classifier. In C. R. Panigrahi, A. K. Pujari, S. Misra, B. Pati, and K.-C. Li, editors, Progress in Advanced Computing and Intelligent Engineering, pages 133–138, Singapore, 2019. Springer Singapore.
G. P. Vangelis Metsis, Ion Androutsopoulos. Spam filtering with naive bayes – which naive bayes? In THIRD CONFERENCE ON EMAIL AND ANTI-SPAM (CEAS), 2006.
J. J. Eberhardt. Bayesian spam detection. Scholarly Horizons: University of Minnesota, Morris Undergraduate Journal, 2(1):2, 2015.
G. Mujtaba, L. Shuib, R. G. Raj, N. Majeed, and M. A. Al-Garadi. Email classification research trends: Review and open issues, 2017.
M. F. Porter. Readings in information retrieval. chapter An Algorithm for Suffix Stripping, pages 313–316. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1997.
E. Jones, T. Oliphant, P. Peterson, et al. SciPy: Open source scientific tools for Python, 2001–. [ accessed July 2019].
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
Downloads
Published
How to Cite
Issue
Section
License
- Manuscript submitted to IndoJC has to be an original work of the author(s), contains no element of plagiarism, and has never been published or is not being considered for publication in other journals.Â
- Copyright on any article is retained by the author(s). Regarding copyright transfers please see below.
- Authors grant IndoJC a license to publish the article and identify itself as the original publisher.
- Authors grant IndoJC commercial rights to produce hardcopy volumes of the journal for sale to libraries and individuals.
- Authors grant any third party the right to use the article freely as long as its original authors and citation details are identified.
- The article and any associated published material is distributed under the Creative Commons Attribution 4.0License