Predicting Student Academic Performance: A Machine Learning Approach and Feature Analysis

Document Type : Research Paper

Authors

1 Department of Information Technology Management, University of Tehran, Tehran, Iran

2 Department of Industrial Management, Islamic Azad University, Tehran, Iran

10.22059/ijms.2025.362506.676053

Abstract

Predicting student academic performance is a challenging task and, at the same time, has significant implications for educators and policymakers in the field of education. By utilizing machine learning techniques, this article seeks to explore the relationship between various features across six categories: demographic factors, personality traits, skills, favorite activities, relationships with others, out-of-school activities on one hand, and academic performance in terms of Grade Point Average, on the other. The data utilized in this study has been collected through several surveys conducted in one of the schools in Iran over multiple years and educational levels, which form the basis of the analysis. Using CRISP-DM methodology, a predictive model is developed based on CatBoost Regressor. A predictive model with an R-squared value of 0.87 is developed. Moreover, the analysis of feature importance reveals that positive personality traits such as "Interest in studying," "The quality of homework," "Contentment," "Self-regulation," and "Logical thinking and reasoning" skills are among the most predictive features affecting students' academic performance which is rooted in and supported by some of the well-known psychological theories such as Self-Determination Theory. The contribution of the current research includes the development of a highly accurate prediction model based on the machine learning approach to predict student academic performance in terms of their GPA and to extract the most important features that influence it. This study is unique in this field due to the incorporation of various features and data collection across different years and educational stages.

Keywords

Main Subjects


Ahmad, M. S., Asad, A. H., & Mohammed, A. (2021). A machine learning based approach for student performance evaluation in educational data mining. In 2021 International Mobile, Intelligent, and Ubiquitous Computing Conference (MIUCC) (pp. 187–192). http://dx.doi.org/10.1109/MIUCC52538.2021.9447602
Asselman, A., Khaldi, M., & Aammou, S. (2023). Enhancing the prediction of student performance based on the machine learning XGBoost algorithm. Interactive Learning Environments, 31(6), 3360–3379. https://doi.org/10.1080/10494820.2021.1928235
Baashar, Y., Hamed, Y., Alkawsi, G., Capretz, L. F., Alhussian, H., Alwadain, A., & Al-amri, R. (2022). Evaluation of postgraduate academic performance using artificial intelligence models. Alexandria Engineering Journal, 61(12), 9867–9878. https://doi.org/10.1016/j.aej.2022.03.021
Bilal, M., Omar, M., Anwar, W., Bokhari, R. H., & Choi, G. S. (2022). The role of demographic and academic features in a student performance prediction. Scientific Reports, 12(1), 12508. https://doi.org/10.1038/s41598-022-15880-6
Brenner, C. A. (2022). Self-regulated learning, self-determination theory and teacher candidates’ development of competency-based teaching practices. Smart Learning Environments, 9(1), 1–14. https://doi.org/10.1186/s40561-021-00184-5
Cameron, A. C., & Windmeijer, F. A. G. (1997). An R-squared measure of goodness of fit for some common nonlinear regression models. Journal of Econometrics, 77(2), 329–342. https://doi.org/10.1016/S0304-4076(96)01818-0
Chapman, P., Clinton, J., Kerber, R., Khabaza, T., Reinartz, T., Shearer, C., & Wirth, R. (1999, March). The CRISP-DM user guide. In 4th CRISP-DM SIG Workshop in Brussels in March (Vol. 1999). https://the-modeling-agency.com/crisp-dm.pdf
Chicco, D., Warrens, M. J., & Jurman, G. (2021). The coefficient of determination R-squared is more informative than SMAPE, MAE, MAPE, MSE and RMSE in regression analysis evaluation. PeerJ Computer Science, 7, e623. https://doi.org/10.7717/peerj-cs.623
Cubo, E., Trejo, J., Ausín, V., Sáez, S., Delgado, V., Macarrón, J., Cordero, J., Louis, E. D., Kompoliti, K., & Benito-León, J. (2013). Association of tic disorders with poor academic performance in central Spain: A population-based study. The Journal of Pediatrics, 163(1), 217–223. https://doi.org/10.1016/j.jpeds.2012.12.030
Czibula, G., Ciubotariu, G., Maier, M.-I., & Lisei, H. (2022). IntelliDaM: A machine learning-based framework for enhancing the performance of decision-making processes. A case study for educational data mining. IEEE Access, 10(2), 80651–80666. http://dx.doi.org/10.1109/ACCESS.2022.3195531
Dabhade, P., Agarwal, R., Alameen, K. P., Fathima, A. T., Sridharan, R., & Gopakumar, G. (2021). Educational data mining for predicting students’ academic performance using machine learning algorithms. Materials Today: Proceedings, 47(8), 5260–5267. http://dx.doi.org/10.1016/j.matpr.2021.05.646
De Feyter, T., Caers, R., Vigna, C., & Berings, D. (2012). Unraveling the impact of the Big Five personality traits on academic performance: The moderating and mediating effects of self-efficacy and academic motivation. Learning and Individual Differences, 22(4), 439–448. https://doi.org/10.1016/j.lindif.2012.03.013
Eapen, V., Črnčec, R., McPherson, S., & Snedden, C. (2013). Tic disorders and learning disability: Clinical characteristics, cognitive performance and comorbidity. Australasian Journal of Special Education, 37(2), 162–172. https://doi.org/10.1017/jse.2013.2
Farahani, H. A., Rahiminezhad, A., & Same, L. (2010). A comparison of partial least squares (PLS) and ordinary least squares (OLS) regressions in predicting of couples mental health based on their communicational patterns. Procedia-Social and Behavioral Sciences, 5, 1459–1463. http://dx.doi.org/10.1016/j.sbspro.2010.07.308
Farooq, M. S., Chaudhry, A. H., Shafiq, M., & Berhanu, G. (2011). Factors affecting students’ quality of academic performance: A case of secondary school level. Journal of Quality and Technology Management, 7(2), 1–14.
Freund, P. A., & Holling, H. (2008). Creativity in the classroom: A multilevel analysis investigating the impact of creativity and reasoning ability on GPA. Creativity Research Journal, 20(3), 309–318. https://doi.org/10.1080/10400410802278776
Furnham, A., & Mitchell, J. (1991). Personality, needs, social skills and academic achievement: A longitudinal study. Personality and Individual Differences, 12(10), 1067–1073. https://doi.org/10.1016/0191-8869(91)90036-B
Goga, M., Kuyoro, S., & Goga, N. (2015). A recommender for improving the student academic performance. Procedia-Social and Behavioral Sciences, 180, 1481–1488. https://doi.org/10.1016/j.sbspro.2015.02.296
Guang-yu, L., & Geng, H. (2019). The behavior analysis and achievement prediction research of college students based on XGBoost gradient lifting decision tree algorithm. In Proceedings of the 2019 7th International Conference on Information and Education Technology (pp. 289–294). https://doi.org/10.1145/3323771.3323803
Hashim, A. S., Awadh, W. A., & Hamoud, A. K. (2020). Student performance prediction model based on supervised machine learning algorithms. In IOP Conference Series: Materials Science and Engineering (Vol. 928, No. 3, p. 032019). IOP Publishing. https://doi.org/10.1088/1757-899X/928/3/032019
Huitt, W., & Hummel, J. (2003). Piaget’s theory of cognitive development. Educational Psychology Interactive, 3(2), 1–5.
Hwang, S.-W., Chung, H., Lee, T., Kim, J., Kim, Y., Kim, J.-C., Kwak, H. W., Choi, I.-G., & Yeo, H. (2023). Feature importance measures from random forest regressor using near-infrared spectra for predicting carbonization characteristics of kraft lignin-derived hydrochar. Journal of Wood Science, 69(1), 1–12. https://doi.org/10.1186/s10086-022-02073-y
Ismail, L., Materwala, H., & Hennebelle, A. (2021). Comparative analysis of machine learning models for students’ performance prediction. In Advances in Digital Science: ICADS 2021 (pp. 149-160). Springer International Publishing. https://doi.org/10.1007/978-3-030-71782-7_14
Jacob, D., & Henriques, R. (2023). Educational data mining to predict bachelors students’ success. Emerging Science Journal, 7, 159–171. http://dx.doi.org/10.28991/ESJ-2023-SIED2-013
Kananda, T. N., & Mwangi, H. (2023). Forecasting student academic performance in kenyan secondary schools using data mining. International Journal of Innovative Science and Research Technology (IJISRT), 8(3), 1626-1629. https://doi.org/10.5281/zenodo.7793063
Lievens, F., & Sackett, P. R. (2012). The validity of interpersonal skills assessment via situational judgment tests for predicting academic success and job performance. Journal of Applied Psychology, 97(2), 460-8. http://dx.doi.org/10.1037/a0025741
Meghji, A. F., Shaikh, F. B., Wadho, S. A., Bhatti, S., & Ayyasamy, R. K. (2023). Using educational data mining to predict student academic performance. VFAST Transactions on Software Engineering, 11(2), 43-49. https://doi.org/10.1007/s10639-022-11152-y
Moreira, P. A. S., Dias, P., Vaz, F. M., & Vaz, J. M. (2013). Predictors of academic performance and school engagement—Integrating persistence, motivation and study skills perspectives using person-centered and variable-centered approaches. Learning and Individual Differences, 24, 117–125. http://dx.doi.org/10.1016/j.lindif.2012.10.016
Muthukrishnan, R., & Rohini, R. (2016). LASSO: A feature selection technique in predictive modeling for machine learning. In 2016 IEEE International Conference on Advances in Computer Applications (ICACA) (pp. 18–20). https://doi.org/10.1109/ICACA.2016.7887916
O’Connor, M. C., & Paunonen, S. V. (2007). Big Five personality predictors of post-secondary academic performance. Personality and Individual Differences, 43(5), 971–990. https://doi.org/10.1016/j.paid.2007.03.017
Oreshin, S., Filchenkov, A., Petrusha, P., Krasheninnikov, E., Panfilov, A., Glukhov, I., Kaliberda, Y., Masalskiy, D., Serdyukov, A., & Kazakovtsev, V. (2020, October). Implementing a machine learning approach to predicting students’ academic outcomes. In Proceedings of the 2020 1st International Conference on Control, Robotics and Intelligent System (pp. 78–83). https://doi.org/10.1145/3437802.3437816
Ouatik, F., Erritali, M., Ouatik, F., & Jourhmane, M. (2022). Predicting student success using big data and machine learning algorithms. International Journal of Emerging Technologies in Learning (Online), 17(12), 236-251. https://doi.org/10.3991/ijet.v17i12.30259
Panigrahi, R., Patne, N. R., Pemmada, S., & Manchalwar, A. D. (2022). Regression model-based hourly aggregated electricity demand prediction. Energy Reports, 8(4), 16–24. http://dx.doi.org/10.1016/j.egyr.2022.10.004
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., & Dubourg, V. (2011). Scikit-learn: Machine learning in Python. The Journal of Machine Learning Research, 12, 2825–2830.
Prestes, P. A. N., Silva, T. E. V, & Barroso, G. C. (2021). Correlation analysis using teaching and learning analytics. Heliyon, 7(11), e08435. https://doi.org/10.1016/j.heliyon.2021.e08435
Rahman, S. R., Islam, M. A., Akash, P. P., Parvin, M., Moon, N. N., & Nur, F. N. (2021). Effects of co-curricular activities on student’s academic performance by machine learning. Current Research in Behavioral Sciences, 2, 100057. http://dx.doi.org/10.1016/j.crbeha.2021.100057
Rahman, M. M., Watanobe, Y., Kiran, R. U., Thang, T. C., & Paik, I. (2021). Impact of practical skills on academic performance: A data-driven analysis. IEEE Access, 9, 139975–139993. http://dx.doi.org/10.1109/ACCESS.2021.3119145
Ramaswami, G., Susnjak, T., & Mathrani, A. (2022). On developing generic models for predicting student outcomes in educational data mining. Big Data and Cognitive Computing, 6(1), 6. http://dx.doi.org/10.3390/bdcc6010006
Resmi, T. J., Mathews, M. K., & Padmanabhan, S. (2024). Statistical analysis of student data and machine learning models for performance prediction. In 2024 4th International Conference on Data Engineering and Communication Systems (ICDECS) (pp. 1–5). https://doi.org/10.1109/ICDECS59733.2023.10502482
Rodriguez-Galiano, V., Sanchez-Castillo, M., Chica-Olmo, M., & Chica-Rivas, M. (2015). Machine learning predictive models for mineral prospectivity: An evaluation of neural networks, random forest, regression trees and support vector machines. Ore Geology Reviews, 71, 804–818. https://doi.org/10.1016/j.oregeorev.2015.01.001
Rogers, J., & Gunn, S. (2005). Identifying feature relevance using a random forest. International Statistical and Optimization Perspectives Workshop" Subspace, Latent Structure and Feature Selection", 173–184. https://doi.org/10.1007/11752790_12
Ryan, R. M., & Deci, E. L. (2000). Self-determination theory and the facilitation of intrinsic motivation, social development, and well-being. American Psychologist, 55(1), 68-78. http://dx.doi.org/10.1037/0003-066X.55.1.68
Seligman, M. E. P. (2011). Flourish: A visionary new understanding of happiness and well-being. Simon and Schuster.
Siddiq, F., Gochyyev, P., & Valls, O. (2020). The role of engagement and academic behavioral skills on young students’ academic performance—A validation across four countries. Studies in Educational Evaluation, 66, 100880. https://doi.org/10.1016/j.stueduc.2020.100880
Valanides, N. (1997). Formal reasoning abilities and school achievement. Studies in Educational Evaluation, 23(2), 169–185. https://doi.org/10.1016/S0191-491X(97)00011-4
Valentine, J. C., Cooper, H., Bettencourt, B. A., & DuBois, D. L. (2002). Out-of-school activities and academic achievement: The mediating role of self-beliefs. Educational Psychologist, 37(4), 245–256. https://doi.org/10.1207/S15326985EP3704_4
Veluri, R. K., Patra, I., Naved, M., Prasad, V. V., Arcinas, M. M., Beram, S. M., & Raghuvanshi, A. (2022). Learning analytics using deep learning techniques for efficiently managing educational institutes. Materials Today: Proceedings, 51, 2317–2320. https://doi.org/10.1016/j.matpr.2021.11.416
Wirth, R., & Hipp, J. (2000, April). CRISP-DM: Towards a standard process model for data mining. In Proceedings of the 4th International Conference on the Practical Applications of Knowledge Discovery and Data Mining (Vol. 1, pp. 29-39).
Wu, J.-Y. (2021). Learning analytics on structured and unstructured heterogeneous data sources: Perspectives from procrastination, help-seeking, and machine-learning defined cognitive engagement. Computers & Education, 163, 104066. https://doi.org/10.1016/j.compedu.2020.104066
Yağcı, M. (2022). Educational data mining: Prediction of students’ academic performance using machine learning algorithms. Smart Learning Environments, 9(1), 11. http://dx.doi.org/10.1186/s40561-022-00192-z
Yakubu, M. N., & Abubakar, A. M. (2022). Applying machine learning approach to predict students’ performance in higher educational institutions. Kybernetes, 51(2), 916–934. https://doi.org/10.1108/K-12-2020-0865
Yeo, I., & Johnson, R. A. (2000). A new family of power transformations to improve normality or symmetry. Biometrika, 87(4), 954–959. https://doi.org/10.1093/biomet/87.4.954
Zimmerman, B. J. (2002). Becoming a self-regulated learner: An overview. Theory into Practice, 41(2), 64–70. https://doi.org/10.1207/s15430421tip4102_2