A Comparative Analysis of P-Value and Mutual Information Feature Selection Methods for Random Forest-Based Phishing Detection

Authors

  • Fahmi Bahtiar Adi Nugroho Universitas Dian Nuswantoro
  • Wildanil Ghozi Universitas Dian Nuswantoro
  • Fauzi Adi Rafrastara Universitas Dian Nuswantoro

DOI:

https://doi.org/10.25077/TEKNOSI.v11i3.2025.377-386

Keywords:

Phishing, Feature Selection, P-Value, Mutual Information, Random Forest

Abstract

The application of ANOVA's P-Value-based feature selection method, namely the F-test, in phishing detection with the Random Forest algorithm indicates that a configuration of 25 features yields the quickest inference time, rendering it appropriate for scenarios demanding great computational efficiency and responsiveness. However, if the user's primary priority is to achieve the highest level of detection accuracy, the 29-feature configuration is more feasible because it exhibits higher accuracy performance and better prediction stability. Consequently, there is no definitive trade-off between 25 or 29 features, there exists a selection of solutions that can be tailored to the application's requirements. This methodology enables users to achieve an optimal equilibrium between superior performance and minimal inference time in a phishing detection system, contingent upon the implementation context and operational priorities. This study successfully shows that a simple statistical approach such as P-Value is not only competitive but also provides superior results compared to more complex methods, offering a practical and efficient solution for real-world implementation.

References

“APWG Trends Report Q1 2025”, Accessed: Jul. 10, 2025. [Online]. Available: https://apwg.org/trendsreports

W. Bambang Triadi Handaya, “Lukito, Deteksi Website Phishing Menggunakan Teknik Machine Learning 69 Deteksi Website Phishing Menggunakan Teknik Machine Learning.”

A. F. Mahmud and S. Wirawan, “Sistemasi: Jurnal Sistem Informasi Deteksi Phishing Website menggunakan Machine Learning Metode Klasifikasi Phishing Website Detection using Machine Learning Classification Method.” [Online]. Available: http://sistemasi.ftik.unisi.ac.id

R. Aggrawal and S. Pal, “P-Value Feature Selection Technique for Prediction of Student Performance,” 2021. [Online]. Available: www.ijrpr.com

A. L. Young et al., “Mutual information: Measuring nonlinear dependence in longitudinal epidemiological data,” PLoS One, vol. 18, no. 4 April, Apr. 2023, doi: 10.1371/journal.pone.0284904.

M. A. Daniel, S.-C. Chong, L.-Y. Chong, and K.-K. Wee, “Optimising Phishing Detection: A Comparative Analysis of Machine Learning Methods with Feature Selection,” Journal of Informatics and Web Engineering, vol. 4, no. 1, pp. 200–212, Feb. 2025, doi: 10.33093/jiwe.2025.4.1.15.

S. N. A. Kamarudin, I. R. A. Hamid, C. F. M. Foozy, and Z. Abdullah, “Feature Selection Approach to Detect Phishing Website Using Machine Learning Algorithm,” in AIP Conference Proceedings, American Institute of Physics Inc., Nov. 2022. doi: 10.1063/5.0104347.

M. A. Taha, H. D. A. Jabar, and W. K. Mohammed, “A Machine Learning Algorithms for Detecting Phishing Websites: A Comparative Study,” Iraqi Journal for Computer Science and Mathematics, vol. 5, no. 3, pp. 275–286, 2024, doi: 10.52866/ijcsm.2024.05.03.015.

Selvan K, “Prediction Of Phishing Websites And Analysis Of Various Classification Techniques,” INTERNATIONAL JOURNAL OF SCIENTIFIC & TECHNOLOGY RESEARCH, vol. 9, p. 2, 2020, [Online]. Available: www.ijstr.org

A. R. Omar, S. Taie, and M. E. Shaheen, “From Phishing Behavior Analysis and Feature Selection to Enhance Prediction Rate in Phishing Detection.” [Online]. Available: https://apwg.org/

L. Mat Rani, C. F. Mohd Foozy, and S. N. B. Mustafa, “Feature Selection to Enhance Phishing Website Detection Based On URL Using Machine Learning Techniques,” Journal of Soft Computing and Data Mining, vol. 4, no. 1, pp. 30–41, May 2023, doi: 10.30880/jscdm.2023.04.01.003.

L. Tang and Q. H. Mahmoud, “A Deep Learning-Based Framework for Phishing Website Detection,” IEEE Access, vol. 10, pp. 1509–1521, 2022, doi: 10.1109/ACCESS.2021.3137636.

Akash Kumar, “Phishing website dataset.” Accessed: Jul. 19, 2025. [Online]. Available: https://www.kaggle.com/datasets/akashkr/phishing-website-dataset

S. K. Kwak and J. H. Kim, “Statistical data preparation: Management of missing values and outliers,” Aug. 01, 2017, Korean Society of Anesthesiologists. doi: 10.4097/kjae.2017.70.4.407.

K. Maharana, S. Mondal, and B. Nemade, “A review: Data pre-processing and data augmentation techniques,” Global Transitions Proceedings, vol. 3, no. 1, pp. 91–99, Jun. 2022, doi: 10.1016/j.gltp.2022.04.020.

O. Rainio, J. Teuho, and R. Klén, “Evaluation metrics and statistical tests for machine learning,” Sci Rep, vol. 14, no. 1, Dec. 2024,

doi: 10.1038/s41598-024-56706-x.

N. O. F. Elssied, O. Ibrahim, and A. H. Osman, “A novel feature selection based on one-way ANOVA F-test for e-mail spam classification,” Research Journal of Applied Sciences, Engineering and Technology, vol. 7, no. 3, pp. 625–638, 2014, doi: 10.19026/rjaset.7.299.

V. Vajrobol, B. B. Gupta, and A. Gaurav, “Mutual information based logistic regression for phishing URL detection,” Cyber Security and Applications, vol. 2, Jan. 2024, doi: 10.1016/j.csa.2024.100044.

H. A. Salman, A. Kalakech, and A. Steiti, “Random Forest Algorithm Overview,” Babylonian Journal of Machine Learning, vol. 2024, pp. 69–79, Jun. 2024, doi: 10.58496/bjml/2024/007.

M. M. Alani and H. Tawfik, “PhishNot: A Cloud-Based Machine-Learning Approach to Phishing URL Detection,” Computer Networks, vol. 218, Dec. 2022, doi: 10.1016/j.comnet.2022.109407.

O. Peretz, M. Koren, and O. Koren, “Naive Bayes classifier – An ensemble procedure for recall and precision enrichment,” Eng Appl Artif Intell, vol. 136, Oct. 2024, doi: 10.1016/j.engappai.2024.108972.

R. Kohavi, “A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection.” [Online]. Available: http//roboticsStanfordedu/"ronnyk

T. Fawcett, “An introduction to ROC analysis,” Pattern Recognit Lett, vol. 27, no. 8, pp. 861–874, Jun. 2006, doi: 10.1016/j.patrec.2005.10.010.

M. Sokolova and G. Lapalme, “A systematic analysis of performance measures for classification tasks,” Inf Process Manag, vol. 45, no. 4, pp. 427–437, Jul. 2009, doi: 10.1016/j.ipm.2009.03.002.

Downloads

Submitted

2025-11-15

Accepted

2025-12-18

Published

2026-01-14

How to Cite

[1]
F. B. Adi Nugroho, W. Ghozi, and F. Adi Rafrastara, “A Comparative Analysis of P-Value and Mutual Information Feature Selection Methods for Random Forest-Based Phishing Detection”, TEKNOSI, vol. 11, no. 3, pp. 377–386, Jan. 2026.

Similar Articles

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 > >> 

You may also start an advanced similarity search for this article.