COMPARISON OF GLOVE AND FASTTEXT ALGORITHMS ON CNN FOR CLASSIFICATION OF INDONESIAN NEWS CATEGORIES

Tjong, Genesius Hartoko

Abstract


Computers cannot understand natural language as humans do, so natural language needs to be converted into something that computers can understand. Word embedding is a term that refers to a method for representing words in natural language into vectors so that computers can understand and perform mathematical operations. In a previous study, the classification of Indonesian news using CNN was carried out but only using the GloVe word embedding algorithm, while in another study it was found that fastText outperformed GloVe in terms of accuracy when classifying English news using CNN. However, because each language has different characteristics, grammar, and structure, this research was conducted to find out whether fastText would also outperform GloVe when using Indonesian news data. The dataset used in this study is a Wikipedia article to train the fastText and GloVe models which will produce a text representation in vector form and be used in the CNN model as a weight on the Embedding layer. The next dataset is Indonesian news with 8 categories for CNN model training, validation, and testing. This study will use 3 different numbers of Wikipedia articles to see the performance of each algorithm when given 10000, 50000, and 100000 Wikipedia articles. The results obtained from this study indicate that fastText outperforms GloVe in accuracy with an average difference of 2.51%, macro precision with an average difference of 4.32%, weighted precision with an average difference of 2.86%, and weighted recall with an average difference of 2.51 %, but for fastText macro recall it only excels when there are 10000 articles with a difference of 11.95% while when there are 50000 and 100000 articles GloVe excels with an average difference of 1.96%.


Keywords


glove; fasttext; indonesian news; cnn

Full Text:

PDF

References


Joulin, Armand, et al. "Bag of tricks for efficient text classification." arXiv preprint arXiv:1607.01759, https://arxiv.org/abs/1607.01759 (2016).

Ramdhani, M. Ali, Dian Sa’adillah Maylawati, and Teddy Mantoro. "Indonesian news classification using convolutional neural network." Indonesian Journal of Electrical Engineering and Computer Science 19.2, https://pdfs.semanticscholar.org/e825/69350f83a20f88968d4035826bc529b8600a.pdf (2020): 1000-1009.

Dharma, EDDY MUNTINA, et al. "The accuracy comparison among word2vec, glove, and fasttext towards convolution neural network (cnn) text classification." J Theor Appl Inf Technol 31.2, http://www.jatit.org/volumes/Vol100No2/5Vol100No2.pdf (2022).

Li, Hongmin, et al. "Comparison of word embeddings and sentence encodings as generalized representations for crisis tweet classification tasks." Proceedings of ISCRAM Asia Pacific, https://par.nsf.gov/servlets/purl/10204524 (2018).

Nguyen, Hai Ngoc, et al. "The Comparison of Word Embedding Techniques in RNNs for Vulnerability Detection." ICISSP, https://pdfs.semanticscholar.org/b0d0/772f51a98da5b2893bbbc1cc3f286c8f31c2.pdf (2021).

Wang, Yanshan, et al. "A comparison of word embeddings for the biomedical natural language processing." Journal of biomedical informatics 87, https://www.sciencedirect.com/science/article/pii/S1532046418301825 (2018): 12-20.

Pennington, Jeffrey, Richard Socher, and Christopher D. Manning. "Glove: Global vectors for word representation." Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), https://aclanthology.org/D14-1162.pdf (2014).

Adipradana, Ryan, et al. "Hoax analyzer for Indonesian news using RNNs with fasttext and glove embeddings." Bulletin of Electrical Engineering and Informatics 10.4, https://www.beei.org/index.php/EEI/article/view/2956 (2021): 2130-2136.

Keeling, Robert, et al. "Empirical comparisons of CNN with other learning algorithms for text classification in legal document review." 2019 IEEE International Conference on Big Data (Big Data). IEEE, https://arxiv.org/pdf/1912.09499 (2019).

David, Merlin Susan, and Shini Renjith. "Comparison of word embeddings in text classification based on RNN and CNN." IOP Conference Series: Materials Science and Engineering. Vol. 1187. No. 1. IOP Publishing, https://iopscience.iop.org/article/10.1088/1757-899X/1187/1/012029/meta (2021).

idwiki dump progress on 20230420. https://dumps.wikimedia.org/idwiki/20230420/ (accessed May 2, 2023).

Indonesian News Corpus. https://data.mendeley.com/datasets/2zpbjs22k3/1 (accessed March 13, 2023).

Bojanowski, Piotr, et al. "Enriching word vectors with subword information." Transactions of the association for computational linguistics 5, https://arxiv.org/abs/1607.04606 (2017): 135-146.

fastText source code. https://github.com/facebookresearch/fastText (accessed May 12, 2023).

GloVe source code. https://github.com/stanfordnlp/GloVe (accessed May 15, 2023).




DOI: https://doi.org/10.24167/proxies.v7i1.12465

Copyright (c) 2024 Proxies : Jurnal Informatika



View My Stats