Effective document presentation with a locality-based similarity heuristic
Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Effective ranking with arbitrary passages
Journal of the American Society for Information Science and Technology
Automatic segmentation of text into structured records
SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data
ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
RoadRunner: Towards Automatic Data Extraction from Large Web Sites
Proceedings of the 27th International Conference on Very Large Data Bases
Information Extraction with HMM Structures Learned by Stochastic Optimization
Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence
Extracting structured data from Web pages
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Mining reference tables for automatic text segmentation
Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Integrating Unstructured Data into Relational Databases
ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Information extraction from research papers using conditional random fields
Information Processing and Management: an International Journal
Proximity-based document representation for named entity retrieval
Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Foundations and Trends in Databases
Named entity recognition in query
Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
ONDUX: on-demand unsupervised learning for information extraction
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Hi-index | 0.00 |
A large number of web pages contain information of entities in a form of lists of field values. Those implicit semi-structured records are often available in textual sources on the web such as advertisings of products, postal addresses, bibliographic information, etc. Harvesting information of those entities from such lists of field values is challenge task because the lists are manually generated, not written in a well-defined templates or may miss some information. In this paper, we introduce a proximity-based positional model (PPM) to improve the quality of extracting information by text segmentation. Our proposed model offers improvements over the fixed-positional model proposed in ONDUX, a current state-of-art method for information extraction by text segmentation (IETS) to revise the labels of text segments in an input list of field values. Different from fixed-positional model in previous work, the key idea of PPM is to define proximity heuristic for labels in an input list in a unified language model. Our proposed model is estimated based on propagated counts of labels through a proximity-based density function. We propose and study several density functions and experimental results on different domains show that PPM is effective to revise labels and helps to improve performance of current state-of-art method.