Enhancing Environmental and Health Statistics through Artificial Intelligence: A Comparative Study of Imputation Techniques
DOI:
https://doi.org/10.71014/sieds.v79i2.377Abstract
In an increasingly globalized world, addressing health, environmental sustainability and social inequalities is crucial and requires an integrated approach involving national statistical offices. The latter is increasingly called upon to develop statistical frameworks to facilitate informed policy-making. However, incomplete or missing data in questionnaires or registers may compromise the accuracy and reliability of results.
The main objective of this study is to assess the effectiveness of different imputation methods using machine learning (ML) and artificial intelligence (AI) techniques in dealing with missing data in social surveys. To this end, a comparative analysis of different imputation techniques has been carried out, based on real datasets from the Istat Multi-purpose Household Survey, where missing data are common. Preliminary results suggest that ML/AI-based imputation methods outperform traditional statistical techniques in terms of performance and robustness.
The aim is to improve imputation techniques in official statistics to improve data quality on critical issues.
References
ADAM S P., ALEXANDROPOULOS S. A. N., PARDALOS P. M., VRAHATIS M. N. 2019 No free lunch theorem: A review. Approximation and optimization: Algorithms, complexity and applications, pp. 57-82. DOI: https://doi.org/10.1007/978-3-030-12767-1_5
BREIMAN L. 2001. Random forests. Machine learning, Vol. 45, pp. 5-32. DOI: https://doi.org/10.1023/A:1010933404324
DE FAUSTI, F., DI ZIO M., FILIPPINI R., TOTI S., ZARDETTO, D. 2023. A study of MLP for the imputation of the “Attained Level of Education” in Base Register of Individuals. In: WORKSHOP ON METHODOLOGIES FOR OFFICIAL STATISTICS. p. 69.
DEY R., SALEM F. M. 2017. Gate-variants of gated recurrent unit (GRU) neural networks. In 2017 IEEE 60th international midwest symposium on circuits and systems (MWSCAS) (pp. 1597-1600), IEEE. DOI: https://doi.org/10.1109/MWSCAS.2017.8053243
GUO G., WANG H., BELL D., B, Y., GREER K. 2003. KNN model-based approach in classification. In On The Move to Meaningful Internet Systems 2003: CoopIS, DOA, and ODBASE: OTM Confederated International Conferences, Catania, Sicily, Italy, November 3-7, 2003. Proceedings (pp. 986-996). Springer Berlin Heidelberg. DOI: https://doi.org/10.1007/978-3-540-39964-3_62
HOCHREITER S., SCHMIDHUBER J. 1997. Long short-term memory, Neural Computation, Vol. 9, No. 8, pp. 1735-1780. DOI: https://doi.org/10.1162/neco.1997.9.8.1735
HONGHAI, F, GUOSHUN, C., CHENG, Y., BINGRU, Y., & YUMEI, C. 2005. A SVM regression based approach to filling in missing values. In: International Conference on Knowledge-Based and Intelligent Information and Engineering Systems. Springer Berlin Heidelberg, 2005. p. 581-587. DOI: https://doi.org/10.1007/11553939_83
ISTAT. 2022. Indagine Aspetti della vita quotidiana 2021.
ISTAT 2024. Rapporto BES 2023.
KALTON G., KISH L. 1984. Some efficient random imputation methods, Communications in Statistics-Theory and Methods, Vol. 13, No.16, pp. 1919-1939. DOI: https://doi.org/10.1080/03610928408828805
MITCHELL R., FRANK E. 2017. Accelerating the XGBoost algorithm using GPU computing. PeerJ Computer Science, Vol 3: e127. DOI: https://doi.org/10.7717/peerj-cs.127
MONTGOMERY D. C., PECK E. A., VINING G. G. 2021. Introduction to linear regression analysis. John Wiley & Sons.
NIKFALAZAR S., YEH C. H., BEDINGFIELD S., KHORSHIDI H. A. 2020. Missing data imputation using decision trees and fuzzy clustering with iterative learning, Knowledge and Information Systems, Vol. 62, pp. 2419-2437. DOI: https://doi.org/10.1007/s10115-019-01427-1
RIGO A. 2022. Programmazione e innovazione: il percorso verso l’efficienza interna delle Pubbliche Amministrazioni.
SUN Y, LI J., XU Y., ZHANG T., WANG X. 2023. Deep learning versus conventional methods for missing data imputation: A review and comparative study. Expert Systems with Applications, Vol. 227: 120201. DOI: https://doi.org/10.1016/j.eswa.2023.120201
TANG F., ISHWARAN H. 2017. Random forest missing data algorithms. Statistical Analysis and Data Mining: ASA Data Science Journal, Vol. 10, No. 6, pp. 363-377. DOI: https://doi.org/10.1002/sam.11348
Downloads
Published
Issue
Section
License
Copyright (c) 2025 Simona Cafieri, Francesco Pugliese, Mauro Sodani

This work is licensed under a Creative Commons Attribution 4.0 International License.