Learning to extract information from semi-structured text using a discriminative context free grammar

Authors:
Paul Viola;Mukund Narasimhan
Affiliations:
Microsoft Research, Redmond, WA;University of Washington, Seattle, WA
Venue:
Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Year:
2005

Citing 13
Cited 18

Support-Vector Networks

Machine Learning
Foundations of statistical natural language processing

Foundations of statistical natural language processing
Large Margin Classification Using the Perceptron Algorithm

Machine Learning - The Eleventh Annual Conference on computational Learning Theory
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Active Hidden Markov Models for Information Extraction

IDA '01 Proceedings of the 4th International Conference on Advances in Intelligent Data Analysis
Table extraction using conditional random fields

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Proposal for an Interactive Environment for Information Extraction

Proposal for an Interactive Environment for Information Extraction
Citrine: providing intelligent copy-and-paste

Proceedings of the 17th annual ACM symposium on User interface software and technology
Shallow parsing with conditional random fields

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Discriminative training methods for hidden Markov models: theory and experiments with perceptron algorithms

EMNLP '02 Proceedings of the ACL-02 conference on Empirical methods in natural language processing - Volume 10
Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons

CONLL '03 Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4
Interactive information extraction with constrained conditional random fields

AAAI'04 Proceedings of the 19th national conference on Artifical intelligence
Efficiently inducing features of conditional random fields

UAI'03 Proceedings of the Nineteenth conference on Uncertainty in Artificial Intelligence

Dynamic Conditional Random Fields: Factorized Probabilistic Models for Labeling and Segmenting Sequence Data

The Journal of Machine Learning Research
Relations, cards, and search templates: user-guided web data integration and layout

Proceedings of the 20th annual ACM symposium on User interface software and technology
Robust location search from text queries

Proceedings of the 15th annual ACM international symposium on Advances in geographic information systems
Crosslingual location search

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Relaxation Labeling for Selecting and Exploiting Efficiently Non-local Dependencies in Sequence Labeling

PKDD 2007 Proceedings of the 11th European conference on Principles and Practice of Knowledge Discovery in Databases
Force deployment analysis with generalized grammar

Information Fusion
Extracting the author of web pages

Proceedings of the 2nd ACM workshop on Information credibility on the web
An unsupervised method for joint information extraction and feature mining across different Web sites

Data & Knowledge Engineering
Information Extraction

Foundations and Trends in Databases
A grammar-based entity representation framework for data cleaning

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Extracting structured information from user queries with semi-supervised conditional random fields

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Learning field compatibilities to extract database records from unstructured text

EMNLP '06 Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing
An unsupervised approach for product record normalization across different web sites

AAAI'08 Proceedings of the 23rd national conference on Artificial intelligence - Volume 2
Identifying Information Sender Configuration of Web Pages

WI-IAT '09 Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
Learning with annotation noise

ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - Volume 1
Semantic tagging of web search queries

ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2 - Volume 2
Efficient duplicate record detection based on similarity estimation

WAIM'10 Proceedings of the 11th international conference on Web-age information management
From layout to semantic: a reranking model for mapping web documents to mediated XML representations

Large Scale Semantic Access to Content (Text, Image, Video, and Sound)

Quantified Score

Hi-index	0.00

Visualization

Abstract

In recent work, conditional Markov chain models (CMM) have been used to extract information from semi-structured text (one example is the Conditional Random Field [10]). Applications range from finding the author and title in research papers to finding the phone number and street address in a web page. The CMM framework combines a priori knowledge encoded as features with a set of labeled training data to learn an efficient extraction process. We will show that similar problems can be solved more effectively by learning a discriminative context free grammar from training data. The grammar has several distinct advantages: long range, even global, constraints can be used to disambiguate entity labels; training data is used more efficiently; and a set of new more powerful features can be introduced. The grammar based approach also results in semantic information (encoded in the form of a parse tree) which could be used for IR applications like question answering. The specific problem we consider is of extracting personal contact, or address, information from unstructured sources such as documents and emails. While linear-chain CMMs perform reasonably well on this task, we show that a statistical parsing approach results in a 50% reduction in error rate. This system also has the advantage of being interactive, similar to the system described in [9]. In cases where there are multiple errors, a single user correction can be propagated to correct multiple errors automatically. Using a discriminatively trained grammar, 93.71% of all tokens are labeled correctly (compared to 88.43% for a CMM) and 72.87% of records have all tokens labeled correctly (compared to 45.29% for the CMM).