An indent shape based approach for web lists mining

Authors:
Yanxu Zhu;Gang Yin;Huaimin Wang;Dianxi Shi;Xiang Li;Lin Yuan
Affiliations:
College of Computer Science and Technology, National University of Defense Technology, Changsha, Hunan, China;College of Computer Science and Technology, National University of Defense Technology, Changsha, Hunan, China;College of Computer Science and Technology, National University of Defense Technology, Changsha, Hunan, China;College of Computer Science and Technology, National University of Defense Technology, Changsha, Hunan, China;College of Computer Science and Technology, National University of Defense Technology, Changsha, Hunan, China;College of Electronic Technology, Information Engineering University, Zhengzhou, Henan, China
Venue:
WISM'11 Proceedings of the 2011 international conference on Web information systems and mining - Volume Part II
Year:
2011

Citing 13
Cited 1

Record-boundary discovery in Web documents

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
IEPAD: information extraction based on pattern discovery

Proceedings of the 10th international conference on World Wide Web
Toward Learning Based Web Query Processing

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Discovering informative content blocks from Web documents

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Data extraction and label assignment for web databases

WWW '03 Proceedings of the 12th international conference on World Wide Web
Mining data records in Web pages

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Fully automatic wrapper generation for search engines

WWW '05 Proceedings of the 14th international conference on World Wide Web
Web data extraction based on partial tree alignment

WWW '05 Proceedings of the 14th international conference on World Wide Web
ViPER: augmenting automatic information extraction with visual perceptions

Proceedings of the 14th ACM international conference on Information and knowledge management
Towards domain-independent information extraction from web tables

Proceedings of the 16th international conference on World Wide Web
ViDE: A Vision-Based Approach for Deep Web Data Extraction

IEEE Transactions on Knowledge and Data Engineering
Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data

Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data
NET – a system for extracting web data from flat and nested data records

WISE'05 Proceedings of the 6th international conference on Web Information Systems Engineering

Exploiting attribute redundancy for web entity data extraction

ICADL'11 Proceedings of the 13th international conference on Asia-pacific digital libraries: for cultural heritage, knowledge dissemination, and future creation

Quantified Score

Hi-index	0.00

Visualization

Abstract

Mining repeated patterns from HTML documents is a key step for typical applications of Web information extraction, which require efficient techniques of patterns mining to generate wrappers automatically. Existing approaches such as tree matching and string matching can detect repeated patterns with a high precision, but their efficiency is still a challenge. In this paper, we present a novel approach for Web lists mining based on the indent shape of HTML documents. Indent shape is a simplified abstraction of HTML documents in which tandem repeated waves indicate the potential repeated patterns to be detected. By identifying the tandem repeated waves efficiently with a horizontal line scanning along an indent shape, the repeated patterns in the documents can be recognized, from which the lists of the target Web page can be extracted. Extensive experiments show that our approach achieves better performance and efficiency compared with existing approaches.