Finding and using the content texts of HTML pages

Authors:
Jun Ma;Zhumin Chen;Li Lian;Lianxia Li
Affiliations:
The Colledge of Computer Science and Technology, Shandong University, Jinan, China;The Colledge of Computer Science and Technology, Shandong University, Jinan, China;The Colledge of Computer Science and Technology, Shandong University, Jinan, China;The Colledge of Computer Science and Technology, Shandong University, Jinan, China
Venue:
AIRS'08 Proceedings of the 4th Asia information retrieval conference on Information retrieval technology
Year:
2008

Citing 9
Cited 0

QProber: A system for automatic classification of hidden-Web databases

ACM Transactions on Information Systems (TOIS)
Data-rich Section Extraction from HTML pages

WISE '02 Proceedings of the 3rd International Conference on Web Information Systems Engineering
Eliminating noisy information in Web pages for data mining

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Organizing structured web sources by query schemas: a clustering approach

Proceedings of the thirteenth ACM international conference on Information and knowledge management
Learning important models for web page blocks based on layout and content analysis

ACM SIGKDD Explorations Newsletter
ViPER: augmenting automatic information extraction with visual perceptions

Proceedings of the 14th ACM international conference on Information and knowledge management
Web page cleaning for web mining through feature weighting

IJCAI'03 Proceedings of the 18th international joint conference on Artificial intelligence
Extracting content structure for web pages based on visual representation

APWeb'03 Proceedings of the 5th Asia-Pacific web conference on Web technologies and applications
Identifying content blocks from web documents

ISMIS'05 Proceedings of the 15th international conference on Foundations of Intelligent Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

A novel algorithm to find the content text in an HTML page is proposed based on a number of features of textual blocks in the page. Experiments show the new algorithm is better than known ones in terms of the ratios of the correctly removed noise blocks and the correctly found content blocks respectively. The application of the algorithm in hidden web classification is demonstrated as well.