Data extraction and label assignment for web databases
WWW '03 Proceedings of the 12th international conference on World Wide Web
Fine-grain web site structure discovery
WIDM '03 Proceedings of the 5th ACM international workshop on Web information and data management
Clustering web pages based on their structure
Data & Knowledge Engineering - Special issue: WIDM 2003
An automatic data grabber for large web sites
VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
EDBT '08 Proceedings of the 11th international conference on Extending database technology: Advances in database technology
WRAPPER INFERENCE FOR AMBIGUOUS WEB PAGES
Applied Artificial Intelligence
Finding and using the content texts of HTML pages
AIRS'08 Proceedings of the 4th Asia information retrieval conference on Information retrieval technology
Exploiting tree structure of a web page for clustering
International Journal of Knowledge and Web Intelligence
Encapsulating multi-stepped web forms as web services
ICSOC/ServiceWave'09 Proceedings of the 2009 international conference on Service-oriented computing
Ranking search results by web quality dimensions
Journal of Web Engineering
Automatic web information extraction based on rules
WISE'11 Proceedings of the 12th international conference on Web information system engineering
TEX: An efficient and effective unsupervised Web information extractor
Knowledge-Based Systems
Hi-index | 0.00 |
In this paper, we propose a novel algorithm, DSE (Data-richSubtree Extraction) to recognize and extract the data-richsection of an HTML page. We apply the DSEalgorithm as a pre-processing "clean-up" step for twotypical web information retrieval problems: topicdistillation and web information extraction. Ourexperiments show that, for the test data sets we used, theDSE algorithm can correctly identify the data-rich sectionsof HTML pages with 100% accuracy. Therefore, it caneffectively reduce the root set size for the topic distillationproblem thereby improving the precision and accuracy ofthe HITS algorithm. Furthermore, when applied to the webinformation extraction problem using the IEPADalgorithm, it can decrease the number of patternsdiscovered by this algorithm, thus shortening its time costto generalize a wrapper for HTML pages.