Detection of multivariate outliers in business survey data with incomplete information

  • Authors:
  • Valentin Todorov;Matthias Templ;Peter Filzmoser

  • Affiliations:
  • United Nations Industrial Development Organization (UNIDO), Vienna International Centre, Vienna, Austria 1400;Department of Methodology, Statistics Austria, Vienna University of Technology, Vienna, Austria and Department of Statistics and Probability Theory, Vienna University of Technology, Vienna, Austri ...;Department of Statistics and Probability Theory, Vienna University of Technology, Vienna, Austria 1040

  • Venue:
  • Advances in Data Analysis and Classification
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Many different methods for statistical data editing can be found in the literature but only few of them are based on robust estimates (for example such as BACON-EEM, epidemic algorithms (EA) and transformed rank correlation (TRC) methods of Béguin and Hulliger). However, we can show that outlier detection is only reasonable if robust methods are applied, because the classical estimates are themselves influenced by the outliers. Nevertheless, data editing is essential to check the multivariate data for possible data problems and it is not deterministic like the traditional micro editing where all records are extensively edited manually using certain rules/constraints. The presence of missing values is more a rule than an exception in business surveys and poses additional severe challenges to the outlier detection. First we review the available multivariate outlier detection methods which can cope with incomplete data. In a simulation study, where a subset of the Austrian Structural Business Statistics is simulated, we compare several approaches. Robust methods based on the Minimum Covariance Determinant (MCD) estimator, S-estimators and OGK-estimator as well as BACON-BEM provide the best results in finding the outliers and in providing a low false discovery rate. Many of the discussed methods are implemented in the R package $${\tt{rrcovNA}}$$ which is available from the Comprehensive R Archive Network (CRAN) at http://www.CRAN.R-project.org under the GNU General Public License.