Unsupervised learning of field segmentation models for information extraction

Authors:
Trond Grenager;Dan Klein;Christopher D. Manning
Affiliations:
Stanford University, Stanford, CA;U.C. Berkeley, Berkeley, CA;Stanford University, Stanford, CA
Venue:
ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Year:
2005

Citing 7
Cited 28

Fundamentals of speech recognition

Fundamentals of speech recognition
Topic segmentation with an aspect hidden Markov model

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Text classification in a hierarchical mixture model for small training sets

Proceedings of the tenth international conference on Information and knowledge management
Maximum Entropy Markov Models for Information Extraction and Segmentation

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
A Machine Learning Approach to Building Domain-Specific Search Engines

IJCAI '99 Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence
Model-Based Hierarchical Clustering

UAI '00 Proceedings of the 16th Conference on Uncertainty in Artificial Intelligence
TextTiling: segmenting text into multi-paragraph subtopic passages

Computational Linguistics

Prototype-driven learning for sequence models

HLT-NAACL '06 Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics
HAL-based cascaded model for variable-length semantic pattern induction from psychiatry web resources

COLING-ACL '06 Proceedings of the COLING/ACL on Main conference poster sessions
An unsupervised method for joint information extraction and feature mining across different Web sites

Data & Knowledge Engineering
Just Add Weights: Markov Logic for the Semantic Web

Uncertainty Reasoning for the Semantic Web I
Extracting structured information from user queries with semi-supervised conditional random fields

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Weakly supervised supertagging with grammar-informed initialization

COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
Homotopy-based semi-supervised Hidden Markov Models for sequence labeling

COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
Joint inference in information extraction

AAAI'07 Proceedings of the 22nd national conference on Artificial intelligence - Volume 1
Learning and inference with constraints

AAAI'08 Proceedings of the 23rd national conference on Artificial intelligence - Volume 3
Applying NLP technologies to the collection and enrichment of language data on the Web to aid linguistic research

LaTeCH-SHELT&R '09 Proceedings of the EACL 2009 Workshop on Language Technology and Resources for Cultural Heritage, Social Sciences, Humanities, and Education
Generalized isotonic conditional random fields

Machine Learning
Learning semantic correspondences with less supervision

ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - Volume 1
Semantic tagging of web search queries

ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2 - Volume 2
Active learning by labeling features

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1 - Volume 1
Generalized expectation criteria for bootstrapping extractors using record-text alignment

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1 - Volume 1
On the use of virtual evidence in conditional random fields

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 3 - Volume 3
Generalized Expectation Criteria for Semi-Supervised Learning with Weakly Labeled Data

The Journal of Machine Learning Research
Markov logic

Probabilistic inductive logic programming
Extracting medication information from discharge summaries

Louhi '10 Proceedings of the NAACL HLT 2010 Second Louhi Workshop on Text and Data Mining of Health Documents
Constructing reference sets from unstructured, ungrammatical text

Journal of Artificial Intelligence Research
Normalizing web product attributes and discovering domain ontology with minimal effort

Proceedings of the fourth ACM international conference on Web search and data mining
Structural topic model for latent topical structure analysis

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Online structure learning for Markov logic networks

ECML PKDD'11 Proceedings of the 2011 European conference on Machine learning and knowledge discovery in databases - Volume Part II
Performing information extraction to improve OCR error detection in semi-structured historical documents

Proceedings of the 2011 Workshop on Historical Document Imaging and Processing
A novel framework of training hidden markov support vector machines from lightly-annotated data

Proceedings of the 20th ACM international conference on Information and knowledge management
Bootstrapped named entity recognition for product attribute extraction

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Learning to adapt cross language information extraction wrapper

Applied Intelligence
Building a lightweight semantic model for unsupervised information extraction on short listings

EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning

Quantified Score

Hi-index	0.00

Visualization

Abstract

The applicability of many current information extraction techniques is severely limited by the need for supervised training data. We demonstrate that for certain field structured extraction tasks, such as classified advertisements and bibliographic citations, small amounts of prior knowledge can be used to learn effective models in a primarily unsupervised fashion. Although hidden Markov models (HMMs) provide a suitable generative model for field structured text, general unsupervised HMM learning fails to learn useful structure in either of our domains. However, one can dramatically improve the quality of the learned structure by exploiting simple prior knowledge of the desired solutions. In both domains, we found that unsupervised methods can attain accuracies with 400 unlabeled examples comparable to those attained by supervised methods on 50 labeled examples, and that semi-supervised methods can make good use of small amounts of labeled data.