Toxic Comment Classification on Social Media Using Support Vector Machine and Chi Square Feature Selection
DOI:
https://doi.org/10.21108/ijoict.v7i1.552Keywords:
text classification, toxic comment, social media, support vector machineAbstract
The use of social media in society continues to increase over time and the ease of access and familiarity of social media then make it easier for an irresponsible user to do unethical things such as spreading hatred, defamation, radicalism, pornography so on. Although there are regulations that govern all the activities on social media. However, the regulations are still not working effectively. In this study, we conducted a classification of toxic comments containing unethical matters using the SVM method with TF-IDF as the feature extraction and Chi Square as the feature selection. The best performance result based on the experiment that has been carried out is by using the SVM model with a linear kernel, without implementing Chi Square, and using stemming and stopwords removal with the F1 − Score equal to 76.57%.
Downloads
References
Bahassine, S., Madani, A., Al-Sarem, M., & Kissi, M. (2020). Feature selection using an improved chi-square for arabic text classification.Journal of King Saud University-Computer and InformationSciences,32(2), 225–231.
Chatterjee, S., Jose, P. G., & Datta, D. (2019). Text classification using svm enhanced by multithreadingand cuda.International Journal of Modern Education & Computer Science,11(1).
Chekina, L., Rokach, L., & Shapira, B. (2011). Meta-learning for selecting a multi-label classificationalgorithm. In2011 ieee 11th international conference on data mining workshops(pp. 220–227).
Cristianini, N., Shawe-Taylor, J., et al. (2000).An introduction to support vector machines and otherkernel-based learning methods. Cambridge university press.
Fung, G. M., Mangasarian, O. L., & Shavlik, J. W. (2003). Knowledge-based nonlinear kernel classifiers.InLearning theory and kernel machines(pp. 102–113). Springer.
Hana, K. M., Al Faraby, S., Bramantoro, A., et al. (2020). Multi-label classification of indonesian hatespeech on twitter using support vector machines. In2020 international conference on data scienceand its applications (icodsa)(pp. 1–7).
Hsu, C.-W., Chang, C.-C., Lin, C.-J., et al. (2003).A practical guide to support vector classification.Taipei.
Ibrohim, M. O., & Budi, I. (2019). Multi-label hate speech and abusive language detection in indonesiantwitter. InProceedings of the third workshop on abusive language online(pp. 46–57).
Izzan, A., Wibisono, C., & Putra, I. F. (2018).Indonesian social media post toxicity classification.https://github.com/ahmadizzan/netifier. GitHub.
Jing, L.-P., Huang, H.-K., & Shi, H.-B. (2002). Improved feature selection approach tfidf in text mining. InProceedings. international conference on machine learning and cybernetics(Vol. 2, pp. 944–946).
Kemp, S. (2020).Digital 2020 indonesia.Retrieved 2020-02-18, fromhttps://datareportal.com/reports/digital-2020-indonesia
Lapedes, D. N. (1974).Mcgraw-hill dictionary of scientific and technical terms.
Malmasi, S., & Zampieri, M.(2017).Detecting hate speech in social media.arXiv preprintarXiv:1712.06427.
Sagar, A. A., & Kiran, J. S. (2008). Toxic comment classification using natural language processing.
Sun, Y.-Y., Zhang, Y., & Zhou, Z.-H. (2010). Multi-label learning with weak label. InProceedings ofthe aaai conference on artificial intelligence(Vol. 24).
Syahputra, H., Basyar, L., & Tamba, A. (2020). Setiment analysis of public opinion on the go-jekindonesia through twitter using algorithm support vector machine. InJournal of physics: Conferenceseries(Vol. 1462, p. 012063).
Taha, A. Y., & Tiun, S. (2016). Binary relevance (br) method classifier of multi-label classification forarabic text.Journal of Theoretical & Applied Information Technology,84(3).
Trstenjak, B., Mikac, S., & Donko, D. (2014). Knn with tf-idf based framework for text categorization.Procedia Engineering,69, 1356–1364.
Tsoumakas, G., & Katakis, I. (2007). Multi-label classification: An overview.International Journal ofData Warehousing and Mining (IJDWM),3(3), 1–13.
Wieczorkowska, A., Synak, P., & Ra Ìs, Z. W. (2006). Multi-label classification of emotions in music. InIntelligent information processing and web mining(pp. 307–315). Springer.
Wongso, R., Luwinda, F. A., Trisnajaya, B. C., Rusli, O., et al. (2017). News article text classificationin indonesian language.Procedia Computer Science,116, 137–143.
Ye, J., Jing, X., & Li, J. (2017). Sentiment analysis using modified lda. InInternational conference onsignal and information processing, networking and computers(pp. 205–212).
Yulietha, I., Faraby, S., & Adiwijaya, A. (2017). Klasifikasi sentimen review film menggunakan algoritmasupport vector machine.eProceedings of Engineering,4(3).
Downloads
Published
How to Cite
Issue
Section
License
Manuscript submitted to IJoICT has to be an original work of the author(s), contains no element of plagiarism, and has never been published or is not being considered for publication in other journals. Author(s) shall agree to assign all copyright of published article to IJoICT. Requests related to future re-use and re-publication of major or substantial parts of the article must be consulted with the editors of IJoICT.