Hybrid LexRank-LDA-MMR for Indonesian Text Summarization

Nasrul Amin Muis; Yoga Pristyanto; Ika Nur Fajri

doi:10.25077/TEKNOSI.v12i1.2026.97-104

Penulis

Nasrul Amin Muis Information Systems Study Program, Faculty of Computer Science, Amikom University Yogyakarta
Yoga Pristyanto Information Systems Study Program, Faculty of Computer Science, Amikom University Yogyakarta
Ika Nur Fajri Information Systems Study Program, Faculty of Computer Science, Amikom University Yogyakarta

DOI:

https://doi.org/10.25077/TEKNOSI.v12i1.2026.97-104

Kata Kunci:

Extractive Summarization, LexRank, LDA,, Hybrid Approach, ROUGE

Abstrak

The rapid growth of digital text information makes it crystal clear that there is a need for automated tools that summarize text for rapid retrieval. Extractive methods employed include LexRank, Latent Dirichlet Allocation (LDA), and Maximal Marginal Relevance (MMR), and the study aimed to enhance the quality of Indonesian text summaries beyond regular LexRank. In this study, the role of LexRank was to assist in selecting meaningful sentences that were centric to the center of the graphs, while the role of LDA was to ensure that the sentences were topically relevant. The strength of MMR lies in maintaining the document's relevance and diversity, thereby reducing redundancy in the summaries. Summaries from two publicly available datasets, IndoSum and Liputan6, containing texts in Bahasa Indonesia, were analyzed at 30% and 50% compression levels and graded using ROUGE (ROUGE-1, ROUGE-2, ROUGE-L F1) scores. Analysis of 5000 articles per dataset showed that implementing LexRank and LDA together with MMR resulted in a higher average ROUGE score than standard LexRank, irrespective of the set compression levels and across both datasets, demonstrating the approach's effectiveness in enhancing summary quality. The improvements recorded are most significant in ROUGE-1 and ROUGE-2, indicating that these combination approaches can produce more informative and relevant summaries while preserving sentence-level diversity, thereby deepening understanding of the information presented in the summary.

Referensi

H. K. Pae, “The Impact of Digital Text,” in Script Effects as the Hidden Drive of the Mind, Cognition, and Culture, vol. 21, in Literacy Studies, vol. 21. , Cham: Springer International Publishing, 2020, pp. 209–217. doi: 10.1007/978-3-030-55152-0_11.

N. Chatterjee and R. Agarwal, “Studying the Effect of Syntactic Simplification on Text Summarization,” IETE Tech. Rev., vol. 40, no. 2,

pp. 155–166, Mar. 2023, doi: 10.1080/02564602.2022.2055670.

G. Daga, S. Saha, Y. Shah, and S. J. Nirmala, “Abstractive Text Summarization Using Hybrid Methods,” in 2022 Third International Conference on Intelligent Computing Instrumentation and Control Technologies (ICICICT), Kannur, India: IEEE, Aug. 2022, pp. 1294–1300. doi: 10.1109/ICICICT54557.2022.9917994.

Halimah, Surya Agustian, and Siti Ramadhani, “Peringkasan teks otomatis (automated text summarization) pada artikel berbahasa indonesia menggunakan algoritma lexrank,” J. CoSciTech Comput. Sci. Inf. Technol., vol. 3, no. 3, pp. 371–381, Dec. 2022, doi: 10.37859/coscitech.v3i3.4300.

G. Martínez Guzmán, M. B. Bernábe Loranca, C. Cerón Garnica, J. Serrano Pérez, and E. Archundia Sierra, “Application of the LDA Model for Obtaining Topics from the WIKICORPUS,” Comput. Sist., vol. 26, no. 1, Mar. 2022, doi: 10.13053/cys-26-1-4171.

S. Bellaouar, M. M. Bellaouar, and I. E. Ghada, “Topic Modeling: Comparison of LSA and LDA on Scientific Publications,” in 2021 4th International Conference on Data Storage and Data Engineering, Barcelona Spain: ACM, Feb. 2021, pp. 59–64. doi: 10.1145/3456146.3456156.

N. S. Muninggar and A. A. Krisnadhi, “LexID: The Metadata and Semantic Knowledge Graph Construction of Indonesian Legal Document,” J. Ilmu Komput. Dan Inf., vol. 16, no. 1, pp. 15–46, Mar. 2023, doi: 10.21609/jiki.v16i1.1096.

A. Byerly and T. Kalganova, “Towards an Analytical Definition of Sufficient Data,” SN Comput. Sci., vol. 4, no. 2, p. 144, Jan. 2023, doi: 10.1007/s42979-022-01549-4.

P. Pandey, J. Keswani, and S. K. Dash, “Comparative Analysis of Various Techniques Used to Obtain a Suitable Summary of the Document,” in Rising Threats in Expert Applications and Solutions, vol. 1187, V. S. Rathore, N. Dey, V. Piuri, R. Babo, Z. Polkowski, and J. M. R. S. Tavares, Eds., in Advances in Intelligent Systems and Computing, vol. 1187. , Singapore: Springer Singapore, 2021, pp. 627–633. doi: 10.1007/978-981-15-6014-9_75.

R. Van Der Goot, “We Need to Talk About train-dev-test Splits,” in Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Online and Punta Cana, Dominican Republic: Association for Computational Linguistics, 2021, pp. 4485–4494. doi: 10.18653/v1/2021.emnlp-main.368.

R. Wijayanti, M. L. Khodra, and D. H. Widyantoro, “Indonesian Abstractive Summarization using Pre-trained Model,” in 2021 3rd East Indonesia Conference on Computer and Information Technology (EIConCIT), Surabaya, Indonesia: IEEE, Apr. 2021, pp. 79–84. doi: 10.1109/EIConCIT50028.2021.9431880.

T. Thieu, H. Do, T. Duong, S. Pu, S. Aakur, and S. Khan, “LexDivPara: A Measure of Paraphrase Quality with Integrated Sentential Lexical Complexity,” in Intelligent Systems and Applications, vol. 296, K. Arai, Ed., in Lecture Notes in Networks and Systems, vol. 296. , Cham: Springer International Publishing, 2022, pp. 1–10. doi: 10.1007/978-3-030-82199-9_1.

D. V, V. Sharmila, B. Natarajan, S. Shalini, and K. C, “An Approach of Statement Compression Using Classifier Algorithm with Improved Efficiency,” in 2021 Second International Conference on Electronics and Sustainable Communication Systems (ICESC), Coimbatore, India: IEEE, Aug. 2021, pp. 1212–1219. doi: 10.1109/ICESC51422.2021.9532679.

M. Kretinin and G. Nguyen, “Topic Modeling on News Articles using Latent Dirichlet Allocation,” in 2022 IEEE 26th International Conference on Intelligent Engineering Systems (INES), Georgioupolis Chania, Greece: IEEE, Aug. 2022, pp. 000249–000254. doi: 10.1109/INES56734.2022.9922609.

D. Delvin, D. Arisandi, and T. Sutrisno, “APLIKASI PERINGKASAN DOKUMEN MENGGUNAKAN METODE MAXIMUM MARGINAL RELEVANCE (MMR),” J. Ilmu Komput. Dan Sist. Inf., vol. 10, no. 1, Mar. 2022, doi: 10.24912/jiksi.v10i1.17820.

H. A. Younis, N. I. R. Ruhaiyem, W. Ghaban, N. A. Gazem, and M. Nasser, “A Systematic Literature Review on the Applications of Robots and Natural Language Processing in Education,” Electronics, vol. 12, no. 13, p. 2864, Jun. 2023, doi: 10.3390/electronics12132864.

J. Bishop, Q. Xie, and S. Ananiadou, “GenCompareSum: a hybrid unsupervised summarization method using salience,” in Proceedings of the 21st Workshop on Biomedical Language Processing, Dublin, Ireland: Association for Computational Linguistics, 2022, pp. 220–240. doi: 10.18653/v1/2022.bionlp-1.22.

P. Harremoës, “Rate Distortion Theory for Descriptive Statistics,” Entropy, vol. 25, no. 3, p. 456, Mar. 2023, doi: 10.3390/e25030456.

D. C. Bell, R. C. Boehm, J. Feldhausen, and J. S. Heyne, “A Data Set Comparison Method Using Noise Statistics Applied to VUV Spectrum Match Determinations,” Anal. Chem., vol. 94, no. 43, pp. 14861–14868, Nov. 2022, doi: 10.1021/acs.analchem.2c01931.

M. D. B. Laksana, A. E. Karyawati, L. A. A. R. Putri, I. W. Santiyasa, N. A. Sanjaya Er, and I. G. A. G. A. Kadnyanan, “Text Summarization terhadap Berita Bahasa Indonesia menggunakan Dual Encoding,” JELIKU J. Elektron. Ilmu Komput. Udayana, vol. 11, no. 2, p. 339, Jul. 2022, doi: 10.24843/JLK.2022.v11.i02.p13.

Hybrid LexRank-LDA-MMR for Indonesian Text Summarization

Penulis

DOI:

Kata Kunci:

Abstrak

Referensi

Unduhan

Telah diserahkan

Diterima

Diterbitkan

Cara Mengutip

Terbitan

Bagian

Lisensi

Artikel Serupa

SidebarMenu

Template Artikel

indexed

ISSN

SiteLink

Bahasa

IndexedBy

Lisensi

Address:

Contact Info:

Information :