Hybrid LexRank-LDA-MMR for Indonesian Text Summarization

Nasrul Amin Muis; Yoga Pristyanto; Ika Nur Fajri

doi:10.25077/TEKNOSI.v12i1.2026.97-104

Authors

Nasrul Amin Muis Information Systems Study Program, Faculty of Computer Science, Amikom University Yogyakarta
Yoga Pristyanto Information Systems Study Program, Faculty of Computer Science, Amikom University Yogyakarta
Ika Nur Fajri Information Systems Study Program, Faculty of Computer Science, Amikom University Yogyakarta

DOI:

https://doi.org/10.25077/TEKNOSI.v12i1.2026.97-104

Keywords:

Extractive Summarization, LexRank, LDA, MMR, Hybrid Approach, ROUGE

Abstract

The rapid growth of digital text information makes it crystal clear that there is a need for automated tools that summarize text for rapid retrieval. Extractive methods employed include LexRank, Latent Dirichlet Allocation (LDA), and Maximal Marginal Relevance (MMR), and the study aimed at enhancing the quality of Indonesian text summaries with more than just regular LexRank. In this study, the role of LexRank was to assist in selecting meaningful sentences with centricity to the center of the graphs, while the role of LDA was to ensure that the sentences were topically relevant. The strength of MMR is maintaining the document's relevance and diversity, which reduces redundancy in the summaries. Summaries from two publicly available datasets, IndoSum and Liputan6, containing texts in Bahasa Indonesia, were analyzed at 30% and 50% compression levels and graded using ROUGE (ROUGE-1, ROUGE-2, ROUGE-L F1 score) measurements. Analysis of 5000 articles per dataset showed that the implementation of LexRank and LDA together with MMR resulted in a greater average ROUGE score than when using standard LexRank, irrespective of the set compression levels and across both datasets, demonstrating the effectiveness of the approach to enhance summary quality. The improvements recorded are most significant in ROUGE-1 and ROUGE-2, which indicates that these combination approaches can produce more informative and relevant summaries while preserving sentence-level diversity, which deepens the understanding of the information presented in the summary.

References

H. K. Pae, “The Impact of Digital Text,” in Script Effects as the Hidden Drive of the Mind, Cognition, and Culture, vol. 21, in Literacy Studies, vol. 21. , Cham: Springer International Publishing, 2020, pp. 209–217. doi: 10.1007/978-3-030-55152-0_11.

N. Chatterjee and R. Agarwal, “Studying the Effect of Syntactic Simplification on Text Summarization,” IETE Tech. Rev., vol. 40, no. 2,

pp. 155–166, Mar. 2023, doi: 10.1080/02564602.2022.2055670.

G. Daga, S. Saha, Y. Shah, and S. J. Nirmala, “Abstractive Text Summarization Using Hybrid Methods,” in 2022 Third International Conference on Intelligent Computing Instrumentation and Control Technologies (ICICICT), Kannur, India: IEEE, Aug. 2022, pp. 1294–1300. doi: 10.1109/ICICICT54557.2022.9917994.

Halimah, Surya Agustian, and Siti Ramadhani, “Peringkasan teks otomatis (automated text summarization) pada artikel berbahasa indonesia menggunakan algoritma lexrank,” J. CoSciTech Comput. Sci. Inf. Technol., vol. 3, no. 3, pp. 371–381, Dec. 2022, doi: 10.37859/coscitech.v3i3.4300.

G. Martínez Guzmán, M. B. Bernábe Loranca, C. Cerón Garnica, J. Serrano Pérez, and E. Archundia Sierra, “Application of the LDA Model for Obtaining Topics from the WIKICORPUS,” Comput. Sist., vol. 26, no. 1, Mar. 2022, doi: 10.13053/cys-26-1-4171.

S. Bellaouar, M. M. Bellaouar, and I. E. Ghada, “Topic Modeling: Comparison of LSA and LDA on Scientific Publications,” in 2021 4th International Conference on Data Storage and Data Engineering, Barcelona Spain: ACM, Feb. 2021, pp. 59–64. doi: 10.1145/3456146.3456156.

N. S. Muninggar and A. A. Krisnadhi, “LexID: The Metadata and Semantic Knowledge Graph Construction of Indonesian Legal Document,” J. Ilmu Komput. Dan Inf., vol. 16, no. 1, pp. 15–46, Mar. 2023, doi: 10.21609/jiki.v16i1.1096.

A. Byerly and T. Kalganova, “Towards an Analytical Definition of Sufficient Data,” SN Comput. Sci., vol. 4, no. 2, p. 144, Jan. 2023, doi: 10.1007/s42979-022-01549-4.

P. Pandey, J. Keswani, and S. K. Dash, “Comparative Analysis of Various Techniques Used to Obtain a Suitable Summary of the Document,” in Rising Threats in Expert Applications and Solutions, vol. 1187, V. S. Rathore, N. Dey, V. Piuri, R. Babo, Z. Polkowski, and J. M. R. S. Tavares, Eds., in Advances in Intelligent Systems and Computing, vol. 1187. , Singapore: Springer Singapore, 2021, pp. 627–633. doi: 10.1007/978-981-15-6014-9_75.

R. Van Der Goot, “We Need to Talk About train-dev-test Splits,” in Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Online and Punta Cana, Dominican Republic: Association for Computational Linguistics, 2021, pp. 4485–4494. doi: 10.18653/v1/2021.emnlp-main.368.

R. Wijayanti, M. L. Khodra, and D. H. Widyantoro, “Indonesian Abstractive Summarization using Pre-trained Model,” in 2021 3rd East Indonesia Conference on Computer and Information Technology (EIConCIT), Surabaya, Indonesia: IEEE, Apr. 2021, pp. 79–84. doi: 10.1109/EIConCIT50028.2021.9431880.

T. Thieu, H. Do, T. Duong, S. Pu, S. Aakur, and S. Khan, “LexDivPara: A Measure of Paraphrase Quality with Integrated Sentential Lexical Complexity,” in Intelligent Systems and Applications, vol. 296, K. Arai, Ed., in Lecture Notes in Networks and Systems, vol. 296. , Cham: Springer International Publishing, 2022, pp. 1–10. doi: 10.1007/978-3-030-82199-9_1.

D. V, V. Sharmila, B. Natarajan, S. Shalini, and K. C, “An Approach of Statement Compression Using Classifier Algorithm with Improved Efficiency,” in 2021 Second International Conference on Electronics and Sustainable Communication Systems (ICESC), Coimbatore, India: IEEE, Aug. 2021, pp. 1212–1219. doi: 10.1109/ICESC51422.2021.9532679.

M. Kretinin and G. Nguyen, “Topic Modeling on News Articles using Latent Dirichlet Allocation,” in 2022 IEEE 26th International Conference on Intelligent Engineering Systems (INES), Georgioupolis Chania, Greece: IEEE, Aug. 2022, pp. 000249–000254. doi: 10.1109/INES56734.2022.9922609.

D. Delvin, D. Arisandi, and T. Sutrisno, “APLIKASI PERINGKASAN DOKUMEN MENGGUNAKAN METODE MAXIMUM MARGINAL RELEVANCE (MMR),” J. Ilmu Komput. Dan Sist. Inf., vol. 10, no. 1, Mar. 2022, doi: 10.24912/jiksi.v10i1.17820.

H. A. Younis, N. I. R. Ruhaiyem, W. Ghaban, N. A. Gazem, and M. Nasser, “A Systematic Literature Review on the Applications of Robots and Natural Language Processing in Education,” Electronics, vol. 12, no. 13, p. 2864, Jun. 2023, doi: 10.3390/electronics12132864.

J. Bishop, Q. Xie, and S. Ananiadou, “GenCompareSum: a hybrid unsupervised summarization method using salience,” in Proceedings of the 21st Workshop on Biomedical Language Processing, Dublin, Ireland: Association for Computational Linguistics, 2022, pp. 220–240. doi: 10.18653/v1/2022.bionlp-1.22.

P. Harremoës, “Rate Distortion Theory for Descriptive Statistics,” Entropy, vol. 25, no. 3, p. 456, Mar. 2023, doi: 10.3390/e25030456.

D. C. Bell, R. C. Boehm, J. Feldhausen, and J. S. Heyne, “A Data Set Comparison Method Using Noise Statistics Applied to VUV Spectrum Match Determinations,” Anal. Chem., vol. 94, no. 43, pp. 14861–14868, Nov. 2022, doi: 10.1021/acs.analchem.2c01931.

M. D. B. Laksana, A. E. Karyawati, L. A. A. R. Putri, I. W. Santiyasa, N. A. Sanjaya Er, and I. G. A. G. A. Kadnyanan, “Text Summarization terhadap Berita Bahasa Indonesia menggunakan Dual Encoding,” JELIKU J. Elektron. Ilmu Komput. Udayana, vol. 11, no. 2, p. 339, Jul. 2022, doi: 10.24843/JLK.2022.v11.i02.p13.

Hybrid LexRank-LDA-MMR for Indonesian Text Summarization

Authors

DOI:

Keywords:

Abstract

References

Downloads

Submitted

Accepted

Published

How to Cite

Issue

Section

License

Similar Articles

SidebarMenu

Template Artikel

indexed

ISSN

SiteLink

Language

IndexedBy

License

Address:

Contact Info:

Information :