A clustering approach for determining stratification variables in SBS surveys

Authors

  • Ilaria Bombelli Istat
  • Giorgia Sacco Istat
  • Alessio Guandalini Istat

DOI:

https://doi.org/10.71014/sieds.v79i1.329

Abstract

Many Structural Business Statistics (SBSs) surveys, according to European Regulations, must move from considering the Legal Unit (LU) as unit of interest towards considering the Enterprise (ENT) as such. This transition is not trivial, as many NSIs still need to provide estimates at a LU level, for comparability through the time.

Consequently, to modify and enhance the standard sample design based on LU to address this shift, it could be required to investigate an alternative stratification of the sample. To address this task, we propose to use a clustering algorithm, i.e., the K-prototype, to obtain groups of ENT and assess the variables' importance in the clustering result.

The algorithm is applied to several input datasets, obtained by sub-setting the ASIA ENT 2021 register, which includes all enterprises carrying on economic activities. The input datasets include ENT working on different sections of the statistical classification of economic activities in the European Community (NACE) and ENT included in the target population of the Community Innovation Survey (CIS) carried out by ISTAT. The clustering is applied separately to each aforementioned dataset. From the clustering result, we assess the variables’ importance and identify the variables that mostly influence the obtained partition.

The most influential variables are used to build the new stratification of the ENT, hence they contribute to a new definition of the strata. The proposed stratification is used to allocate a sample of the same dimension as the one extracted with the current stratification. From the sample, we estimate some of the survey’s target variables and their coefficient of variation (CV). The CVs are compared with the ones resulting from the current stratification. The comparison reveals that the efficiency of the estimates is preserved. In addition, the new stratification allows for reducing the number of strata and therefore also the processing time is limited.

References

BARCAROLI G., FASULO A., GUANDALINI A., TERRIBILI M.D. 2023. Two Stage Sampling Design and Sample Selection with the R Package R2BEAT. The R Journal, Vol. 15, No. 3, pp. 191-213. DOI: https://doi.org/10.32614/RJ-2023-069

BETHEL J. 1989. Sample allocation in multivariate surveys. Survey Methodology, Vol. 15, pp. 47 -57

FASULO A., BARCAROLI G., FALORSI S., GUANDALINI A., PAGLIUCA D., TERRIBILI M.D. 2021. R2BEAT: Multistage Sampling Allocation and Sample Selection R package version 1.0.4. url:https://CRAN.R-project.org/package=R2BEAT

HUANG J. 1998. Extensions to the k-means algorithm for clustering large data sets with categorical values. Data mining and knowledge discovery, Vol. 2, No. 3, pp. 283-304. DOI: https://doi.org/10.1023/A:1009769707641

HUANG Z. 1997. A fast clustering algorithm to cluster very large categorical data sets in data mining. DMKD, Vol. 3, No. 8, pp. 34-39.

MCQUEEN J. B. 1967. Some methods of classification and analysis of multivariate observations. In: Proc. of 5th Berkeley Symposium on Math. Stat. and Prob, Vol 2, No. 3, pp.281-297.

PFAFFEL O. 2021. FeatureImpCluster: Feature Importance for Partitional Clustering. R package version 0.1.5. url: https://CRAN.Rproject.org/package=FeatureImpCluster. DOI: https://doi.org/10.32614/CRAN.package.FeatureImpCluster

ROUSSEEUW P. J. 1987. Silhouettes: a graphical aid to the interpretation and vali- dation of cluster analysis. Journal of computational and applied mathematics, Vol. 20, pp. 53-65. DOI: https://doi.org/10.1016/0377-0427(87)90125-7

SZEPANNEK G. 2018. ClustMixType: User-Friendly Clustering of Mixed-Type Data in R. R Journal, Vol. 10, No. 2, p. 200. DOI: https://doi.org/10.32614/RJ-2018-048

VAN BUUREN S., ROOTHUIS-OUDSHOORM K 2011. mice: Multivariate Imputation by Chained Equations in R. Journal of Statistical Software, Vol. 45, No. 3, pp. 1-67. DOI: https://doi.org/10.18637/jss.v045.i03

Downloads

Published

2025-02-13

Issue

Section

Articles