An Evaluation of Statistical Approaches to Text Categorization
Information Retrieval
IEPAD: information extraction based on pattern discovery
Proceedings of the 10th international conference on World Wide Web
RoadRunner: Towards Automatic Data Extraction from Large Web Sites
Proceedings of the 27th International Conference on Very Large Data Bases
HTML Page Analysis Based on Visual Cues
ICDAR '01 Proceedings of the Sixth International Conference on Document Analysis and Recognition
Automatic Discovery of Semantic Structures in HTML Documents
ICDAR '03 Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 1
Parsing algorithms and metrics
ACL '96 Proceedings of the 34th annual meeting on Association for Computational Linguistics
Hi-index | 0.00 |
We propose a new method for reformatting web documents by extracting semantic structures from web pages. Our approach is to extract trees that describe hierarchical relations in documents. We developed an algorithm for this task by employing the EM algorithm and clustering techniques. Preliminary experiments showed that our approach was more effective than baseline methods.