Semi-supervised multi-task learning of structured prediction models for web information extraction

Authors:
Paramveer S. Dhillon;Sundararajan Sellamanickam;Sathiya Keerthi Selvaraj
Affiliations:
University of Pennsylvania, Philadelphia, PA, USA;Yahoo! Labs, Bangalore, India;Yahoo! Labs, Santa Clara, CA, USA
Venue:
Proceedings of the 20th ACM international conference on Information and knowledge management
Year:
2011

Citing 30
Cited 0

Multitask Learning

Machine Learning - Special issue on inductive transfer
Automatic segmentation of text into structured records

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Hierarchical Wrapper Induction for Semistructured Information Sources

Autonomous Agents and Multi-Agent Systems
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
Mining reference tables for automatic text segmentation

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Regularized multi--task learning

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Why collective inference improves relational classification

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Kernel conditional random fields: representation and clique selection

ICML '04 Proceedings of the twenty-first international conference on Machine learning
Web data extraction based on partial tree alignment

WWW '05 Proceedings of the 14th international conference on World Wide Web
Learning Multiple Tasks with Kernel Methods

The Journal of Machine Learning Research
Integrating Unstructured Data into Relational Databases

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Simultaneous record detection and attribute labeling in web data extraction

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
A Survey of Web Information Extraction Systems

IEEE Transactions on Knowledge and Data Engineering
A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data

The Journal of Machine Learning Research
Semi-supervised conditional random fields for improved sequence segmentation and labeling

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Context-aware wrapping: synchronized data extraction

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Unsupervised information extraction from unstructured, ungrammatical data sources on the World Wide Web

International Journal on Document Analysis and Recognition
Domain Adaptation of Conditional Probability Models Via Feature Subsetting

PKDD 2007 Proceedings of the 11th European conference on Principles and Practice of Knowledge Discovery in Databases
Automatic wrapper induction from hidden-web sources with domain knowledge

Proceedings of the 10th ACM workshop on Web information and data management
Information Extraction

Foundations and Trends in Databases
Extracting data records from the web using tag path clustering

Proceedings of the 18th international conference on World wide web
Domain adaptation from multiple sources via auxiliary classifiers

ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
Extracting discriminative concepts for domain adaptation in text mining

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Joint inference in information extraction

AAAI'07 Proceedings of the 22nd national conference on Artificial intelligence - Volume 1
Answering table augmentation queries from unstructured lists on the web

Proceedings of the VLDB Endowment
CETR: content extraction via tag ratios

Proceedings of the 19th international conference on World wide web
Using clustering and edit distance techniques for automatic web data extraction

WISE'07 Proceedings of the 8th international conference on Web information systems engineering
ONDUX: on-demand unsupervised learning for information extraction

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Efficient graph-based semi-supervised learning of structured tagging models

EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Extracting information from web pages is an important problem; it has several applications such as providing improved search results and construction of databases to serve user queries. In this paper we propose a novel structured prediction method to address two important aspects of the extraction problem: (1) labeled data is available only for a small number of sites and (2) a machine learned global model does not generalize adequately well across many websites. For this purpose, we propose a weight space based graph regularization method. This method has several advantages. First, it can use unlabeled data to address the limited labeled data problem and falls in the class of graph regularization based semi-supervised learning approaches. Second, to address the generalization inadequacy of a global model, this method builds a local model for each website. Viewing the problem of building a local model for each website as a task, we learn the models for a collection of sites jointly; thus our method can also be seen as a graph regularization based multi-task learning approach. Learning the models jointly with the proposed method is very useful in two ways: (1) learning a local model for a website can be effectively influenced by labeled and unlabeled data from other websites; and (2) even for a website with only unlabeled examples it is possible to learn a decent local model. We demonstrate the efficacy of our method on several real-life data; experimental results show that significant performance improvement can be obtained by combining semi-supervised and multi-task learning in a single framework.