Using latent-structure to detect objects on the web

Authors:
Luciano Barbosa;Juliana Freire
Affiliations:
AT&T Labs - Research;University of Utah
Venue:
Procceedings of the 13th International Workshop on the Web and Databases
Year:
2010

Citing 10
Cited 1

Probability and statistics

Probability and statistics
A brief survey of web data extraction tools

ACM SIGMOD Record
XTRACT: Learning Document Type Descriptors from XML Document Collections

Data Mining and Knowledge Discovery
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
One-class svms for document classification

The Journal of Machine Learning Research
Automatic information extraction from large websites

Journal of the ACM (JACM)
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)

Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data (Data-Centric Systems and Applications)

Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data (Data-Centric Systems and Applications)
Learning deterministic regular expressions for the inference of schemas from XML data

Proceedings of the 17th international conference on World Wide Web
Wrapper maintenance: a machine learning approach

Journal of Artificial Intelligence Research

Joint unsupervised structure discovery and information extraction

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data

Quantified Score

Hi-index	0.00

Visualization

Abstract

An important requirement for emerging applications which aim to locate and integrate content distributed over the Web is to identify pages that are relevant for a given domain or task. In this paper, we address the problem of identifying pages that contain objects with a latent structure, i.e., the structure is implicitly represented in the page. We propose an algorithm which, given a set of instances of an object type, derives rules by automatically extracting statistically significant patterns present inside the objects. These rules can then be used to detect the presence of these objects in new, unseen pages. Our approach has several advantages when compared against learning-based text classifiers. Because it relies only on positive examples, constructing accurate object detectors is simpler than constructing learning classifiers, which require both positive and negative examples. Also, besides providing a classification decision for the presence of an object, the derived detectors are able to pinpoint the location of the object inside a Web page. This enables our algorithm to extract additional object fragments and apply online learning to automatically update the rules as new documents become available. An experimental evaluation, using a representative set of domains, indicates that our approach is effective. It is able to learn structural patterns and derive detectors that outperform state-of-art text classifiers and the online learning component leads to substantial improvements over the initial detectors.