Exploiting structural similarity for effective Web information extraction

Authors:
Sergio Flesca;Giuseppe Manco;Elio Masciari;Luigi Pontieri;Andrea Pugliese
Affiliations:
DEIS, Univ. della Calabria, Via P. Bucci 41/C, 87036 Rende, Italy;ICAR-CNR, Via P. Bucci 41/C, 87036 Rende, Italy;ICAR-CNR, Via P. Bucci 41/C, 87036 Rende, Italy;ICAR-CNR, Via P. Bucci 41/C, 87036 Rende, Italy;DEIS, Univ. della Calabria, Via P. Bucci 41/C, 87036 Rende, Italy
Venue:
Data & Knowledge Engineering
Year:
2007

Citing 14
Cited 4

Discrete-time signal processing

Discrete-time signal processing
Efficient crawling through URL ordering

WWW7 Proceedings of the seventh international conference on World Wide Web 7
WebL - a programming language for the Web

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Wrapper induction: efficiency and expressiveness

Artificial Intelligence - Special issue on Intelligent internet systems
A Course in Digital Signal Processing

A Course in Digital Signal Processing
Modern Information Retrieval

Modern Information Retrieval
Hierarchical Wrapper Induction for Semistructured Information Sources

Autonomous Agents and Multi-Agent Systems
Efficient Similarity Search In Sequence Databases

FODO '93 Proceedings of the 4th International Conference on Foundations of Data Organization and Algorithms
Crawling the Hidden Web

Proceedings of the 27th International Conference on Very Large Data Bases
Visual Web Information Extraction with Lixto

Proceedings of the 27th International Conference on Very Large Data Bases
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
Matching an XML Document against a Set of DTDs

ISMIS '02 Proceedings of the 13th International Symposium on Foundations of Intelligent Systems
Detecting Changes in XML Documents

ICDE '02 Proceedings of the 18th International Conference on Data Engineering
Fast Detection of XML Structural Similarity

IEEE Transactions on Knowledge and Data Engineering

Structure-based graph distance measures of high degree of precision

Pattern Recognition
An unsupervised method for joint information extraction and feature mining across different Web sites

Data & Knowledge Engineering
Tag tree template for Web information and schema extraction

Expert Systems with Applications: An International Journal
A bounded distance metric for comparing tree structure

Information Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we propose a classification technique for Web pages, based on the detection of structural similarities among semistructured documents, and devise an architecture exploiting such technique for the purpose of information extraction. The proposal significantly differs from standard methods based on graph-matching algorithms, and is based on the idea of representing the structure of a document as a time series in which each occurrence of a tag corresponds to an impulse. The degree of similarity between documents is then stated by analyzing the frequencies of the corresponding Fourier transform. Experiments on real data show the effectiveness of the proposed technique.