Attribute Value Extraction and Standardization in Data Integration

  • Authors:
  • Hongjun Lu;Zengping Tian

  • Affiliations:
  • -;-

  • Venue:
  • WAIM '01 Proceedings of the Second International Conference on Advances in Web-Age Information Management
  • Year:
  • 2001

Quantified Score

Hi-index 0.00

Visualization

Abstract

Most database systems and data analysis tools work with relational or well-structured data. When data is collected from various sources for warehousing or analysis, extracting and formatting the input data into required form is never a trivial task, as it seems. We present in this paper a patternmatching based approach for extracting and standardizing attribute values from input data entries in the form of character strings. The core component of the approach is a powerful pattern language, which provides a simple way for specifying the semantic features, length limitations, external references, element extraction and restructure of attributes. Attribute values can then be extracted from input strings by pattern matching. Constraints on attributes can be enforced so that the attribute values are standardized even the input data is from different sources and in different formats. The pattern language and matching algorithms are presented. A prototype system based on the proposed approach is also described.