Webpage understanding: an integrated approach

Authors:
Jun Zhu;Bo Zhang;Zaiqing Nie;Ji-Rong Wen;Hsiao-Wuen Hon
Affiliations:
Tsinghua University;Tsinghua University;Microsoft Research Asia;Microsoft Research Asia;Microsoft Research Asia
Venue:
Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2007

Citing 25
Cited 10

Information extraction from HTML: application of a general machine learning approach

AAAI '98/IAAI '98 Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence
Record-boundary discovery in Web documents

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Learning Information Extraction Rules for Semi-Structured and Free Text

Machine Learning - Special issue on natural language learning
Foundations of statistical natural language processing

Foundations of statistical natural language processing
Wrapper induction: efficiency and expressiveness

Artificial Intelligence - Special issue on Intelligent internet systems
IEPAD: information extraction based on pattern discovery

Proceedings of the 10th international conference on World Wide Web
Automatic segmentation of text into structured records

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
Extracting structured data from Web pages

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Bottom-up relational learning of pattern matching rules for information extraction

The Journal of Machine Learning Research
Using the structure of Web sites for automatic segmentation of tables

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Exploiting dictionaries in named entity extraction: combining semi-Markov extraction processes and data integration methods

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Fully automatic wrapper generation for search engines

WWW '05 Proceedings of the 14th international conference on World Wide Web
Web data extraction based on partial tree alignment

WWW '05 Proceedings of the 14th international conference on World Wide Web
Activity Recognition and Abnormality Detection with the Switching Hidden Semi-Markov Model

CVPR '05 Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05) - Volume 1 - Volume 01
An integrated architecture for shallow and deep processing

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Topic transition detection using hierarchical hidden Markov and semi-Markov models

Proceedings of the 13th annual ACM international conference on Multimedia
2D Conditional Random Fields for Web information extraction

ICML '05 Proceedings of the 22nd international conference on Machine learning
Combining lexical and formatting cues for named entity acquisition from the web

EMNLP '00 Proceedings of the 2000 Joint SIGDAT conference on Empirical methods in natural language processing and very large corpora: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics - Volume 13
Efficient inference on sequence segmentation models

ICML '06 Proceedings of the 23rd international conference on Machine learning
Simultaneous record detection and attribute labeling in web data extraction

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Combining linguistic and statistical analysis to extract relations from web documents

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Wrapper maintenance: a machine learning approach

Journal of Artificial Intelligence Research
Chunk parsing revisited

Parsing '05 Proceedings of the Ninth International Workshop on Parsing Technology

An unsupervised framework for extracting and normalizing product attributes from multiple web sites

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Dynamic Hierarchical Markov Random Fields for Integrated Web Data Extraction

The Journal of Machine Learning Research
Closing the loop in webpage understanding

Proceedings of the 17th ACM conference on Information and knowledge management
Webpage understanding: beyond page-level search

ACM SIGMOD Record
Extracting structured information from user queries with semi-supervised conditional random fields

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
An integrated discriminative probabilistic approach to information extraction

Proceedings of the 18th ACM conference on Information and knowledge management
Normalizing web product attributes and discovering domain ontology with minimal effort

Proceedings of the fourth ACM international conference on Web search and data mining
Automatically adapting web pages to heterogeneous devices

CHI '11 Extended Abstracts on Human Factors in Computing Systems
Towards a top-down and bottom-up bidirectional approach to joint information extraction

Proceedings of the 20th ACM international conference on Information and knowledge management
Learning to adapt cross language information extraction wrapper

Applied Intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

Recent work has shown the effectiveness of leveraging layout and tag-tree structure for segmenting webpages and labeling HTML elements. However, how to effectively segment and label the text contents inside HTML elements is still an open problem. Since many text contents on a webpage are often text fragments and not strictly grammatical, traditional natural language processing techniques, that typically expect grammatical sentences, are no longer directly applicable. In this paper, we examine how to use layout and tag-tree structure in a principled way to help understand text contents on webpages. We propose to segment and label the page structure and the text content of a webpage in a joint discriminative probabilistic model. In this model, semantic labels of page structure can be leveraged to help text content understanding, and semantic labels ofthe text phrases can be used in page structure understanding tasks such as data record detection. Thus, integration of both page structure and text content understanding leads to an integrated solution of webpage understanding. Experimental results on research homepage extraction show the feasibility and promise of our approach.