Data-rich Section Extraction from HTML pages

Authors:
Jiying Wang;Frederick H. Lochovsky
Affiliations:
-;-
Venue:
WISE '02 Proceedings of the 3rd International Conference on Web Information Systems Engineering
Year:
2002

Citing 0
Cited 12

Data extraction and label assignment for web databases

WWW '03 Proceedings of the 12th international conference on World Wide Web
Fine-grain web site structure discovery

WIDM '03 Proceedings of the 5th ACM international workshop on Web information and data management
Clustering web pages based on their structure

Data & Knowledge Engineering - Special issue: WIDM 2003
An automatic data grabber for large web sites

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Flint: Google-basing the Web

EDBT '08 Proceedings of the 11th international conference on Extending database technology: Advances in database technology
WRAPPER INFERENCE FOR AMBIGUOUS WEB PAGES

Applied Artificial Intelligence
Finding and using the content texts of HTML pages

AIRS'08 Proceedings of the 4th Asia information retrieval conference on Information retrieval technology
Exploiting tree structure of a web page for clustering

International Journal of Knowledge and Web Intelligence
Encapsulating multi-stepped web forms as web services

ICSOC/ServiceWave'09 Proceedings of the 2009 international conference on Service-oriented computing
Ranking search results by web quality dimensions

Journal of Web Engineering
Automatic web information extraction based on rules

WISE'11 Proceedings of the 12th international conference on Web information system engineering
TEX: An efficient and effective unsupervised Web information extractor

Knowledge-Based Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we propose a novel algorithm, DSE (Data-richSubtree Extraction) to recognize and extract the data-richsection of an HTML page. We apply the DSEalgorithm as a pre-processing "clean-up" step for twotypical web information retrieval problems: topicdistillation and web information extraction. Ourexperiments show that, for the test data sets we used, theDSE algorithm can correctly identify the data-rich sectionsof HTML pages with 100% accuracy. Therefore, it caneffectively reduce the root set size for the topic distillationproblem thereby improving the precision and accuracy ofthe HITS algorithm. Furthermore, when applied to the webinformation extraction problem using the IEPADalgorithm, it can decrease the number of patternsdiscovered by this algorithm, thus shortening its time costto generalize a wrapper for HTML pages.