Reformatting web documents via header trees

Authors:
Minoru Yoshida;Hiroshi Nakagawa
Affiliations:
University of Tokyo, Tokyo, Japan;University of Tokyo, Tokyo, Japan
Venue:
ACLdemo '05 Proceedings of the ACL 2005 on Interactive poster and demonstration sessions
Year:
2005

Citing 6
Cited 0

An Evaluation of Statistical Approaches to Text Categorization

Information Retrieval
IEPAD: information extraction based on pattern discovery

Proceedings of the 10th international conference on World Wide Web
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
HTML Page Analysis Based on Visual Cues

ICDAR '01 Proceedings of the Sixth International Conference on Document Analysis and Recognition
Automatic Discovery of Semantic Structures in HTML Documents

ICDAR '03 Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 1
Parsing algorithms and metrics

ACL '96 Proceedings of the 34th annual meeting on Association for Computational Linguistics

Quantified Score

Hi-index	0.00

Visualization

Abstract

We propose a new method for reformatting web documents by extracting semantic structures from web pages. Our approach is to extract trees that describe hierarchical relations in documents. We developed an algorithm for this task by employing the EM algorithm and clustering techniques. Preliminary experiments showed that our approach was more effective than baseline methods.