Improving Multi-label Classification Performance on Imbalanced Datasets Through SMOTE Technique and Data Augmentation Using IndoBERT Model

Leno Dwi Cahya(1*), Ardytha Luthfiarta(2), Julius Immanuel Theo Krisna(3), Sri Winarno(4), Adhitya Nugraha(5)
(1) Universitas Dian Nuswantoro
(2) Universitas Dian Nuswantoro
(3) Universitas Dian Nuswantoro
(4) Universitas Dian Nuswantoro
(5) Universitas Dian Nuswantoro
(*) Corresponding Author



Abstrak


Sentiment and emotion analysis is a common classification task aimed at enhancing the benefit and comfort of consumers of a product. However, the data obtained often lacks balance between each class or aspect to be analyzed, commonly known as an imbalanced dataset. Imbalanced datasets are frequently challenging in machine learning tasks, particularly text datasets. Our research tackles imbalanced datasets using two techniques, namely SMOTE and Augmentation. In the SMOTE technique, text datasets need to undergo numerical representation using TF-IDF. The classification model employed is the IndoBERT model. Both oversampling techniques can address data imbalance by generating synthetic and new data. The newly created dataset enhances the classification model's performance. With the Augmentation technique, the classification model's performance improves by up to 20%, with accuracy reaching 78%, precision at 85%, recall at 82%, and an F1-score of 83%. On the other hand, using the SMOTE technique, the evaluation results achieve the best values between the two techniques, enhancing the model's accuracy to a high 82% with precision at 87%, recall at 85%, and an F1-score of 86%.

Kata Kunci


Imbalanced; SMOTE; Augmentation; Sentiment; NLP


Teks Lengkap:

PDF (English)


Referensi


[1] T. Shaik, X. Tao, C. Dann, H. Xie, Y. Li, and L. Galligan, “Sentiment analysis and opinion mining on educational data: A survey,” Natural Language Processing Journal, vol. 2, p. 100003, Mar. 2023, doi: 10.1016/j.nlp.2022.100003.

[2] W. Zhang, X. Li, Y. Deng, L. Bing, and W. Lam, “A Survey on Aspect-Based Sentiment Analysis: Tasks, Methods, and Challenges,” IEEE Trans Knowl Data Eng, vol. 35, no. 11, pp. 11019–11038, Nov. 2023, doi: 10.1109/TKDE.2022.3230975.

[3] E. Alemayehu and Y. Fang, “A Submodular Optimization Framework for Imbalanced Text Classification With Data Augmentation,” IEEE Access, vol. 11, pp. 41680–41696, 2023, doi: 10.1109/ACCESS.2023.3267669.

[4] A. Nugroho, M. A. Soeleman, R. Anggi Pramunendar, A. Affandy, and A. Nurhindarto, “Peningkatan Performa Ensemble Learning pada Segmentasi Semantik Gambar dengan Teknik Oversampling untuk Class Imbalance,” Jurnal Teknologi Informasi dan Ilmu Komputer, vol. 10, no. 4, pp. 899–908, 2023, doi: 10.25126/jtiik.2023106831.

[5] Z. Hengyu, “Improved SMOTE algorithm for imbalanced dataset,” in 2020 Chinese Automation Congress (CAC), IEEE, Nov. 2020, pp. 693–697. doi: 10.1109/CAC51589.2020.9326603.

[6] B. Jonathan, P. H. Putra, and Y. Ruldeviyani, “Observation Imbalanced Data Text to Predict Users Selling Products on Female Daily with SMOTE, Tomek, and SMOTE-Tomek,” in 2020 IEEE International Conference on Industry 4.0, Artificial Intelligence, and Communications Technology (IAICT), IEEE, Jul. 2020, pp. 81–85. doi: 10.1109/IAICT50021.2020.9172033.

[7] M. S. N. M. Danuri, R. A. Rahman, I. Mohamed, and A. Amin, “The Improvement of Stress Level Detection in Twitter: Imbalance Classification Using SMOTE,” in 2022 IEEE International Conference on Computing (ICOCO), IEEE, Nov. 2022, pp. 294–298. doi: 10.1109/ICOCO56118.2022.10031684.

[8] V. Rupapara, F. Rustam, H. F. Shahzad, A. Mehmood, I. Ashraf, and G. S. Choi, “Impact of SMOTE on Imbalanced Text Features for Toxic Comments Classification Using RVVC Model,” IEEE Access, vol. 9, pp. 78621–78634, 2021, doi: 10.1109/ACCESS.2021.3083638.

[9] J. Wei and K. Zou, “EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks,” Jan. 2019, [Online]. Available: http://arxiv.org/abs/1901.11196

[10] M. Wankhade, A. C. S. Rao, and C. Kulkarni, “A survey on sentiment analysis methods, applications, and challenges,” Artif Intell Rev, vol. 55, no. 7, pp. 5731–5780, Oct. 2022, doi: 10.1007/s10462-022-10144-1.

[11] Y. Yanfi, Y. Heryadi, L. Lukas, W. Suparta, and Y. Arifin, “Sentiment Analysis of User Review on Indonesian Food and Beverage Group using Machine Learning Techniques,” in 2022 IEEE Creative Communication and Innovative Technology (ICCIT), IEEE, Nov. 2022, pp. 1–5. doi: 10.1109/ICCIT55355.2022.10118707.

[12] S. Saadah, Kaenova Mahendra Auditama, Ananda Affan Fattahila, Fendi Irfan Amorokhman, Annisa Aditsania, and Aniq Atiqi Rohmawati, “Implementation of BERT, IndoBERT, and CNN-LSTM in Classifying Public Opinion about COVID-19 Vaccine in Indonesia,” Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi), vol. 6, no. 4, pp. 648–655, Aug. 2022, doi: 10.29207/resti.v6i4.4215.

[13] B. Juarto and Yulianto, “Indonesian News Classification Using IndoBert,” International Journal of Intelligent Systems and Applications in Engineering, vol. 11, no. 2, pp. 454–460, 2023.

[14] F. S. S. Ningsih et al., “Synonym-based Text Generation in Restructuring Imbalanced Dataset for Deep Learning Models,” in 2022 5th International Conference on Networking, Information Systems and Security: Envisage Intelligent Systems in 5g//6G-based Interconnected Digital Worlds (NISS), IEEE, Mar. 2022, pp. 1–6. doi: 10.1109/NISS55057.2022.10085156.

[15] L. Hu, C. Li, W. Wang, B. Pang, and Y. Shang, “Performance Evaluation of Text Augmentation Methods with BERT on Small-sized, Imbalanced Datasets,” in 2022 IEEE 4th International Conference on Cognitive Machine Intelligence (CogMI), IEEE, Dec. 2022, pp. 125–133. doi: 10.1109/CogMI56440.2022.00027.

[16] F. Muftie and M. Haris, “IndoBERT Based Data Augmentation for Indonesian Text Classification,” in 2023 International Conference on Information Technology Research and Innovation (ICITRI), IEEE, Aug. 2023, pp. 128–132. doi: 10.1109/ICITRI59340.2023.10250061.

[17] Riccosan and K. E. Saputra, “Multilabel multiclass sentiment and emotion dataset from indonesian mobile application review,” Data Brief, vol. 50, p. 109576, Oct. 2023, doi: 10.1016/j.dib.2023.109576.

[18] H. Q. Abonizio, E. C. Paraiso, and S. Barbon, “Toward Text Data Augmentation for Sentiment Analysis,” IEEE Transactions on Artificial Intelligence, vol. 3, no. 5, pp. 657–668, Oct. 2022, doi: 10.1109/TAI.2021.3114390.

[19] D. R. Beddiar, M. S. Jahan, and M. Oussalah, “Data expansion using back translation and paraphrasing for hate speech detection,” Online Soc Netw Media, vol. 24, p. 100153, Jul. 2021, doi: 10.1016/j.osnem.2021.100153.

[20] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” Oct. 2018, [Online]. Available: http://arxiv.org/abs/1810.04805

[21] J. Tiedemann and S. Thottingal, “OPUS-MT-Building open translation services for the World,” 2020. [Online]. Available: http://opus.nlpl.eu


Artikel Statistik

Abstrak telah dilihat : 238 kali
PDF (English) telah dilihat : 98 kali

Refbacks

  • Saat ini tidak ada refbacks.


Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

 

Alamat Redaksi :
Departemen Sistem Informasi, Fakultas Teknologi Informasi
Universitas Andalas
Kampus Limau Manis, Padang 25163, Sumatera Barat

email: teknosi@fti.unand.ac.id

  Jumlah Pengunjung :

 

Creative Commons License
This work by JSI-Unand and licensed under a CC BY-SA 4.0 International License.