Comparison of Chi-Square and Information Gain Feature Selection Methods for Improving Interpretability and Optimizing TabNet Model Performance

Authors

  • Annisa Ratna Salsabilla Universitas Dian Nuswantoro
  • Ramadhan Rakhmat Sani Dian Nuswantoro University
  • Ika Novita Dewi Dian Nuswantoro University

DOI:

https://doi.org/10.25077/TEKNOSI.v11i3.2025.253-262

Keywords:

Breast Cancer, TabNet, Chi-Square, Information Gain, Optuna

Abstract

Breast cancer is one of the most significant global health issues. Machine learning approaches offer the potential to accurately analyze clinical data and aid in early diagnosis. However, conventional machine learning models are often limited in their ability to model complex nonlinear relationships in medical data, which can reduce predictive accuracy. This study employs a deep learning architecture because of its ability to model such relationships. Specifically, the TabNet model was chosen because it is designed for tabular data and offers better interpretability. The public Wisconsin Diagnostic Breast Cancer (WDBC) dataset, which has 30 features and an imbalanced class distribution, was used in this study. Feature selection was necessary to handle the high-dimensional data, and SMOTE-ENN was used for class balancing. Two feature selection methods, Chi-Square and Information Gain, were compared to determine the most effective approach. Hyperparameter optimization was performed using Optuna and validated with stratified k-fold cross-validation to ensure optimal performance. The results of the experiment demonstrate that feature selection and optimization significantly improve performance. The base model with Chi-Square feature selection achieved an accuracy rate of 64.91%. Meanwhile, the Chi-Square model with Optuna optimization increased accuracy to 98.25%. This is 3.51% higher than the accuracy of 94.74% achieved by the optimized model without feature selection. In the final comparison, both methods demonstrated distinct advantages: Chi-Square (75% features) excelled in achieving 100% precision and more efficient computation time. Information Gain (75% features), on the other hand, was the only method to achieve 100% recall, which is crucial for minimizing false negatives. These results demonstrate that the optimal method depends on the context. Information Gain is best for maximum diagnostic sensitivity, and Chi-Square is best for performance balance and efficiency.

References

F. Brayl., “Global cancer statistics 2022 GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries,” CA. Cancer J. Clin., vol. 74, pp. 229–263, May 2024, doi: 10.3322/caac.21834.

M. Fu, Z. Peng, M. Wu, D. Lv, Y. Li, and S. Lyu, “Current and future burden of breast cancer in Asia A GLOBOCAN data analysis for 2022 and 2050,” Breast, vol. 79, no. May 2024, p. 103835, 2025, doi: 10.1016/j.breast.2024.103835.

L. Pratiwi, A. Ambarsari, B. Fajri, and W. Mariyana

W. Gautama, “Breast Cancer in Indonesia in 2022: 30 Years of Marching in Place,” Indones. J. Cancer, vol. 16, no. 1, p. 1, 2022, doi: 10.33371/ijoc.v16i1.920.

T. S. and R. P. P. Tumuluru, C. P. Lakshmi, “A Review of Machine Learning Techniques for Breast Cancer Diagnosis in Medical Applications,” 2019 Third Int. Conf. I-SMAC (IoT Soc. Mobile, Anal. Cloud), vol. 11, no. 113, pp. 13–21, 2019.

W. S. W. Wolberg, O. Mangasarian, N. Street, “Breast Cancer Wisconsin (Diagnostic).” UCI Machine Learning Repository, 1993. [Online]. Available: https://doi.org/10.24432/C5DW2B.

H. Chen, N. Wang, X. Du, K. Mei, Y. Zhou, and G. Cai, “Classification Prediction of Breast Cancer Based on Machine Learning,” Comput. Intell. Neurosci., vol. 2023, no. 1, 2023, doi: 10.1155/2023/6530719.

S. Ara, A. Das, and A. Dey, “Malignant and Benign Breast Cancer Classification using Machine Learning Algorithms,” 2021 Int. Conf. Artif. Intell. ICAI 2021, pp. 97–101, 2021, doi: 10.1109/ICAI52203.2021.9445249.

M. A. A. Albadr, M. Ayob, S. Tiun, F. T. AL-Dhief, A. Arram, and S. Khalaf, “Breast cancer diagnosis using the fast learning network algorithm,” Front. Oncol., vol. 13, no. April, pp. 1–16, 2023, doi: 10.3389/fonc.2023.1150840.

S. Zhou, C. Hu, S. Wei, and X. Yan, “Breast Cancer Prediction Based on Multiple Machine Learning Algorithms,” Technol. Cancer Res. Treat., vol. 23, 2024, doi: 10.1177/15330338241234791.

C. Shah, Q. Du, and Y. Xu, “Enhanced TabNet: Attentive Interpretable Tabular Learning for Hyperspectral Image Classification,” Remote Sens., vol. 14, no. 3, pp. 1–21, 2022, doi: 10.3390/rs14030716.

J. Si, W. Y. Cheng, M. Cooper, and R. G. Krishnan, “InterpreTabNet: Distilling Predictive Signals from Tabular Data by Salient Feature Interpretation,” in Proceedings of Machine Learning Research, ML Research Press, 2024, pp. 45353–45405.

K. Qu, J. Xu, Q. Hou, K. Qu, and Y. Sun, “Feature selection using Information Gain and decision information in neighborhood decision system,” Appl. Soft Comput., vol. 136, p. 110100, 2023, doi: 10.1016/j.asoc.2023.110100.

K. Kanti Ghosh et al., “Theoretical and empirical analysis of filter ranking methods: Experimental study on benchmark DNA microarray data,” Expert Syst. Appl., vol. 169, no. May 2020, p. 114485, 2021, doi: 10.1016/j.eswa.2020.114485.

I. Chhillar and A. Singh, “An improved soft voting-based machine learning technique to detect breast cancer utilizing effective feature selection and SMOTE-ENN class balancing,” Discov. Artif. Intell., vol. 5, no. 1, 2025, doi: 10.1007/s44163-025-00224-w.

M. Bahrami, M. Vali, and H. Kia, “Breast Cancer Detection from Imbalanced Clinical Data: A Comparative Study of Sampling Methods,” 2023 30th Natl. 8th Int. Iran. Conf. Biomed. Eng. ICBME 2023, no. December, pp. 145–149, 2023, doi: 10.1109/ICBME61513.2023.10488624.

R. Bhuvanya, T. Kujani, S. Manoj Kumaran, and N. Lokesh Kumar, “OptNet: Innovative Model for Early Lung Cancer Diagnosis integrating TabNet and Optuna,” IEEE Int. Conf. Electron. Syst. Intell. Comput. ICESIC 2024 - Proc., pp. 174–179, 2024, doi: 10.1109/ICESIC61777.2024.10846378.

A. Rahmadeyan and M. Mustakim, “Seleksi Fitur pada Supervised Learning: Klasifikasi Prestasi Belajar Mahasiswa Saat dan Pasca Pandemi COVID-19,” J. Nas. Teknol. dan Sist. Inf., vol. 9, no. 1, pp. 21–32, 2023, doi: 10.25077/teknosi.v9i1.2023.21-32.

H. I. Mun and W. Son, “Properties of chi-square statistic and information gain for feature selection of imbalanced text data,” Korean J. Appl. Stat., vol. 35, no. 4, pp. 469–484, 2022, doi: 10.5351/kjas.2022.35.4.469.

V. Borisov, T. Leemann, K. Sebler, J. Haug, M. Pawelczyk, and G. Kasneci, “Deep Neural Networks and Tabular Data: A Survey,” IEEE Trans. Neural Networks Learn. Syst., vol. 35, no. 6, pp. 7499–7519, 2024, doi: 10.1109/TNNLS.2022.3229161.

S. Ank and T. Pfister, “TabNet Attentive Interpretable Tabular Learning,” 35th AAAI Conf. Artif. Intell. AAAI 2021, vol. 8A, pp. 6679–6687, 2021, doi: 10.1609/aaai.v35i8.16826.

M. F. Amin, “Confusion matrix in three-class classification problems: A step-by-step tutorial,” J. Eng. Res., vol. 6, no. 5, 2023, [Online]. Available: https://erjeng.journals.ekb.eg/article_296718_30a98aac15193d04dc73ba9bc00cf046.pdf

Submitted

2025-10-18

Accepted

2025-12-18

Published

2025-12-28

How to Cite

[1]
A. R. Salsabilla, R. R. Sani, and I. N. Dewi, “Comparison of Chi-Square and Information Gain Feature Selection Methods for Improving Interpretability and Optimizing TabNet Model Performance”, TEKNOSI, vol. 11, no. 3, pp. 253–262, Dec. 2025.

Similar Articles

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 > >> 

You may also start an advanced similarity search for this article.