AUGMENTING ABUSIVE WORD IN SOCIAL MEDIA WITH WORD EMBEDDING

Tan Tamarine Myrna Aphrodite

Abstract


The increase in the use of abusive language on social media lately is very bad. Many parties throw abusive words at each other against an object, either personal or group. Abusive words themselves can be in the form of sexism, attacking flaws or disabilities, and others. Activities on social media are now so negative that they do more harm than good. We use Word2vec and some algorithms to detect abusive words in hate speech on social media to see who’s the best algorithms so far that compatible work together with word2vec. First, we need to know the dataset we use from Kaggle.com. Then, for implementation, the dataset needs to be processed in data preprocessing, with steps such as word embedding, so that maximum results can be obtained. The final result of this project will be presented in a table of confusion matrix, and with this research, the calculated average F1 value is 86% and the accuracy rate is also 86%. So, with that result, we know that the final result is that the most suitable algorithm for this dataset is XGBoost, but the algorithm the most suitable with word2vec is KNearestNeighbor.


Keywords


abusive word; XGBoost; KNearestNeighbor; word embeddings

References


Jae Yeon Kim , Carlos Ortiz, Sarah Nam, Sarah Santiago, Vivek Datta, “Intersectional Bias in Hate Speech and Abusive Language Datasets,” in 2020 at the ICWSM 2020 Data Challenge Workshop. Available: https://arxiv.org/ftp/arxiv/papers/2005/2005.05921.pdf

M. O. Ibrohim and I. Budi, “Multi-label Hate Speech and Abusive Language Detection in Indonesian Twitter” in 2019. Available: https://aclanthology.org/W19-3506.pdf’

Peng Zhou , Zhenyu Qi1 , Suncong Zheng , Jiaming Xu , Hongyun Bao , Bo Xu1, “Text Classification Improved by Integrating Bidirectional LSTM with Two-dimensional Max Pooling” in 2016, [Online]. Available: https://arxiv.org/pdf/1611.06639.pdf

Chikashi Nobata, Joel Tetreault, Achint Thomas, Yashar Mehdad, Yi Chang, “Abusive Language Detection in Online User Content” in 2016, [Online]. Available: http://www.yichang-cs.com/yahoo/WWW16_Abusivedetection.pdf

Amit Pandey and Achin Jain, “Comparative Analysis of KNN Algorithms using Various Normalization Techniques” published November 2017 in MECS. Available: https://www.mecs-press.org/ijcnis/ijcnis-v9-n11/IJCNIS-V9-N11-4.pdf

Sebastian Köffer, Dennis M. Riehle, Steffen Höhenberger, and Jörg Becker, “Discussing the Value of Automatic Hate Speech Detection in Online Debates” in 2018, [Online]. Available: https://www.wi.uni-muenster.de/research/publications/131445

Gudbjartur Ingi Sigurbergsson and Leon Derczynski, “Offensive Language and Hate speech Detection for Danish,” proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pages 3498–3508 Marseille, 11–16 May 2020, [Online]. Available: https://aclanthology.org/2020.lrec-1.430.pdf

Yiwen Tang & Nicole Dalzell (2019) “Classifying Hate Speech Using a TwoLayer Model,” Statistics and Public Policy, 6:1, 80-86, DOI: 10.1080/2330443X.2019.166028, [Online]. Available: https://www.tandfonline.com/doi/epdf/10.1080/2330443X.2019.1660285?needAccess=true&role=button

Emmanuel Ayo, Olusegun Folorunso, Friday Thomas Ibharalu, Idowu Ademola Osinuga, Adebayo Abayomi Alli , “A probabilistic clustering model for hate speech classification in twitter” in 2021, [Online]. Available: https://www.sciencedirect.com/science/article/abs/pii/S0957417421002037

Asogwa D.C, Chukwuneke C.I, Ngene C.C, Anigbogu G.N, “Hate Speech Classification using SVM and Naive Bayes,” IOSR Journal of Mobile Computing & Application (IOSR-JMCA) e- ISSN: 2394-0050, P-ISSN: 2394-0042.Volume 9, Issue 1 (Jan. – Feb. 2022), PP 27-34, [Online]. Available: https://arxiv.org/abs/2204.07057




DOI: https://doi.org/10.24167/proxies.v8i1.12476

Copyright (c) 2024 Proxies : Jurnal Informatika



View My Stats