Algorithms for finding patterns in strings
Handbook of theoretical computer science (vol. A)
PODS '97 Proceedings of the sixteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Efficient string matching: an aid to bibliographic search
Communications of the ACM
Introduction To Automata Theory, Languages, And Computation
Introduction To Automata Theory, Languages, And Computation
The Design and Analysis of Computer Algorithms
The Design and Analysis of Computer Algorithms
Hi-index | 0.00 |
Most database systems and data analysis tools work with relational or well-structured data. When data is collected from various sources for warehousing or analysis, extracting and formatting the input data into required form is never a trivial task, as it seems. We present in this paper a patternmatching based approach for extracting and standardizing attribute values from input data entries in the form of character strings. The core component of the approach is a powerful pattern language, which provides a simple way for specifying the semantic features, length limitations, external references, element extraction and restructure of attributes. Attribute values can then be extracted from input strings by pattern matching. Constraints on attributes can be enforced so that the attribute values are standardized even the input data is from different sources and in different formats. The pattern language and matching algorithms are presented. A prototype system based on the proposed approach is also described.