Modern Information Retrieval
Automating the Construction of Internet Portals with Machine Learning
Information Retrieval
Bibliographic attribute extraction from erroneous references based on a statistical model
Proceedings of the 3rd ACM/IEEE-CS joint conference on Digital libraries
Approximate Matching for OCR-Processed Bibliographic Data
ICPR '96 Proceedings of the International Conference on Pattern Recognition (ICPR '96) Volume III-Volume 7276 - Volume 7276
Adaptive duplicate detection using learnable string similarity measures
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Semantic integration in text: from ambiguous names to identifiable entities
AI Magazine - Special issue on semantic integration
A Fast Linkage Detection Scheme for Multi-Source Information Integration
WIRI '05 Proceedings of the International Workshop on Challenges in Web Information Retrieval and Integration
Information extraction from scanned documents by stochastic page layout analysis
Proceedings of the 2008 ACM symposium on Applied computing
Hi-index | 0.01 |
Record linkage is an important application area of text pattern analysis. In this paper we propose a new sequence labeling method that can be used to extract entities from a string for record linkage. The proposed method combines a classifier and a Hidden Markov Model (HMM) to utilize both syntactical and textual information from the string. We first describe the model used in the proposed method and then discuss the parameter estimation for this model. The proposed method incorporates a classifier for handling textual information and integrates the classifier with the HMM statistically by estimating the error probability of the classifier. We applied the proposed method to the bibliographic sequence labeling problem, in which bibliographic components are extracted from reference strings. We compared the proposed method with other methods that use textual or syntactical information alone and showed that the proposed method outperforms them.