Enhancing Environmental and Health Statistics through Artificial Intelligence: A Comparative Study of Imputation Techniques

Authors

  • Simona Cafieri Istat
  • Francesco Pugliese Istat
  • Mauro Sodani Istat

DOI:

https://doi.org/10.71014/sieds.v79i2.377

Abstract

In an increasingly globalized world, addressing health, environmental sustainability and social inequalities is crucial and requires an integrated approach involving national statistical offices. The latter is increasingly called upon to develop statistical frameworks to facilitate informed policy-making. However, incomplete or missing data in questionnaires or registers may compromise the accuracy and reliability of results.

The main objective of this study is to assess the effectiveness of different imputation methods using machine learning (ML) and artificial intelligence (AI) techniques in dealing with missing data in social surveys. To this end, a comparative analysis of different imputation techniques has been carried out, based on real datasets from the Istat Multi-purpose Household Survey, where missing data are common. Preliminary results suggest that ML/AI-based imputation methods outperform traditional statistical techniques in terms of performance and robustness.

The aim is to improve imputation techniques in official statistics to improve data quality on critical issues.

References

ADAM S P., ALEXANDROPOULOS S. A. N., PARDALOS P. M., VRAHATIS M. N. 2019 No free lunch theorem: A review. Approximation and optimization: Algorithms, complexity and applications, pp. 57-82. DOI: https://doi.org/10.1007/978-3-030-12767-1_5

BREIMAN L. 2001. Random forests. Machine learning, Vol. 45, pp. 5-32. DOI: https://doi.org/10.1023/A:1010933404324

DE FAUSTI, F., DI ZIO M., FILIPPINI R., TOTI S., ZARDETTO, D. 2023. A study of MLP for the imputation of the “Attained Level of Education” in Base Register of Individuals. In: WORKSHOP ON METHODOLOGIES FOR OFFICIAL STATISTICS. p. 69.

DEY R., SALEM F. M. 2017. Gate-variants of gated recurrent unit (GRU) neural networks. In 2017 IEEE 60th international midwest symposium on circuits and systems (MWSCAS) (pp. 1597-1600), IEEE. DOI: https://doi.org/10.1109/MWSCAS.2017.8053243

GUO G., WANG H., BELL D., B, Y., GREER K. 2003. KNN model-based approach in classification. In On The Move to Meaningful Internet Systems 2003: CoopIS, DOA, and ODBASE: OTM Confederated International Conferences, Catania, Sicily, Italy, November 3-7, 2003. Proceedings (pp. 986-996). Springer Berlin Heidelberg. DOI: https://doi.org/10.1007/978-3-540-39964-3_62

HOCHREITER S., SCHMIDHUBER J. 1997. Long short-term memory, Neural Computation, Vol. 9, No. 8, pp. 1735-1780. DOI: https://doi.org/10.1162/neco.1997.9.8.1735

HONGHAI, F, GUOSHUN, C., CHENG, Y., BINGRU, Y., & YUMEI, C. 2005. A SVM regression based approach to filling in missing values. In: International Conference on Knowledge-Based and Intelligent Information and Engineering Systems. Springer Berlin Heidelberg, 2005. p. 581-587. DOI: https://doi.org/10.1007/11553939_83

ISTAT. 2022. Indagine Aspetti della vita quotidiana 2021.

ISTAT 2024. Rapporto BES 2023.

KALTON G., KISH L. 1984. Some efficient random imputation methods, Communications in Statistics-Theory and Methods, Vol. 13, No.16, pp. 1919-1939. DOI: https://doi.org/10.1080/03610928408828805

MITCHELL R., FRANK E. 2017. Accelerating the XGBoost algorithm using GPU computing. PeerJ Computer Science, Vol 3: e127. DOI: https://doi.org/10.7717/peerj-cs.127

MONTGOMERY D. C., PECK E. A., VINING G. G. 2021. Introduction to linear regression analysis. John Wiley & Sons.

NIKFALAZAR S., YEH C. H., BEDINGFIELD S., KHORSHIDI H. A. 2020. Missing data imputation using decision trees and fuzzy clustering with iterative learning, Knowledge and Information Systems, Vol. 62, pp. 2419-2437. DOI: https://doi.org/10.1007/s10115-019-01427-1

RIGO A. 2022. Programmazione e innovazione: il percorso verso l’efficienza interna delle Pubbliche Amministrazioni.

SUN Y, LI J., XU Y., ZHANG T., WANG X. 2023. Deep learning versus conventional methods for missing data imputation: A review and comparative study. Expert Systems with Applications, Vol. 227: 120201. DOI: https://doi.org/10.1016/j.eswa.2023.120201

TANG F., ISHWARAN H. 2017. Random forest missing data algorithms. Statistical Analysis and Data Mining: ASA Data Science Journal, Vol. 10, No. 6, pp. 363-377. DOI: https://doi.org/10.1002/sam.11348

Downloads

Published

2025-02-28

Issue

Section

Articles