Heuristic learning of rules for information extraction from web documents

Authors:
Dawei Hu;Huan Li;Tianyong Hao;Enhong Chen;Liu Wenyin
Affiliations:
University of Science & Technology of China, Hefei, China and CityU-USTC Advanced Research Institute, Suzhou, China and City University of Hong Kong, Hong Kong, China;University of Science & Technology of China, Hefei, China and CityU-USTC Advanced Research Institute, Suzhou, China and City University of Hong Kong, Hong Kong, China;City University of Hong Kong, Hong Kong, China;University of Science & Technology of China, Hefei, China and CityU-USTC Advanced Research Institute, Suzhou, China;CityU-USTC Advanced Research Institute, Suzhou, China and City University of Hong Kong, Hong Kong, China
Venue:
Proceedings of the 2nd international conference on Scalable information systems
Year:
2007

Citing 8
Cited 0

Learning Information Extraction Rules for Semi-Structured and Free Text

Machine Learning - Special issue on natural language learning
Scaling question answering to the Web

Proceedings of the 10th international conference on World Wide Web
Acquisition of Linguistic Patterns for Knowledge-Based Information Extraction

IEEE Transactions on Knowledge and Data Engineering
Bottom-up relational learning of pattern matching rules for information extraction

The Journal of Machine Learning Research
Learning surface text patterns for a Question Answering system

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Generic soft pattern models for definitional question answering

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
SIIPU*S: A Semantic Pattern Learning Algorithm

SKG '06 Proceedings of the Second International Conference on Semantics, Knowledge, and Grid
CRYSTAL inducing a conceptual dictionary

IJCAI'95 Proceedings of the 14th international joint conference on Artificial intelligence - Volume 2

Quantified Score

Hi-index	0.00

Visualization

Abstract

The efficacy of an information extraction system is mostly determined by the quality of the extraction rules. Building these extraction rules is time-consuming and difficult to implement by hand. Hence, we propose a Heuristic Rule Learning (HRL) algorithm which can automatically and efficiently acquire high-quality extraction rules from a user labeled training corpus. Moreover, these extraction rules are maintained at the most suitable generalization level to enhance information extraction efficacy. In HRL, we use a Dynamic tErm eXtraction Technique (DEXT) to construct terms and extraction rules at different generalization levels. The conditional entropy model is used to evaluate the suitability of these different generalization levels of the extraction rules so as to maintain them at a high-quality level. Experimental results show the algorithm's efficacy of acquiring extraction rules at different generalization levels and the efficacy of these extraction rules in the information extraction tasks.