Information Systems - Special issue on semistructured data
Building intelligent web applications using lightweight wrappers
Data & Knowledge Engineering - Special issue on heterogeneous information resources need semantic access
DEByE - Date extraction by example
Data & Knowledge Engineering
WebOQL: Restructuring Documents, Databases, and Webs
ICDE '98 Proceedings of the Fourteenth International Conference on Data Engineering
XWRAP: An XML-Enabled Wrapper Construction System for Web Information Sources
ICDE '00 Proceedings of the 16th International Conference on Data Engineering
Automatic web news extraction using tree edit distance
Proceedings of the 13th international conference on World Wide Web
OLERA: Semisupervised Web-Data Extraction with Visual Support
IEEE Intelligent Systems
Web data extraction based on partial tree alignment
WWW '05 Proceedings of the 14th international conference on World Wide Web
Thresher: automating the unwrapping of semantic content from the World Wide Web
WWW '05 Proceedings of the 14th international conference on World Wide Web
Title extraction from bodies of HTML documents and its application to web page retrieval
Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
A Survey of Web Information Extraction Systems
IEEE Transactions on Knowledge and Data Engineering
NET – a system for extracting web data from flat and nested data records
WISE'05 Proceedings of the 6th international conference on Web Information Systems Engineering
Semistructured data: the TSIMMIS experience
ADBIS'97 Proceedings of the First East-European conference on Advances in Databases and Information systems
Hi-index | 0.00 |
Many Web news sites have similar structures and layout styles. Our extensive case studies have indicated that there exists potential relevance between Web content layouts and path patterns. Compared with the delimiting features of Web content, path patterns have many advantages, such as a high positioning accuracy, ease of use and a strong pervasive performance. Consequently, a Web information extraction model with path patterns constructed from a path pattern mining algorithm is proposed in this paper. Our experimental data set is obtained by randomly selecting news Web pages from the CNN website. With a reasonable tolerance threshold, the experimental results show that the average precision is above 99% and the average recall is 100% when we integrate Web information extraction with our path pattern mining algorithm. The performance of path patterns from the pattern mining algorithm is much better than that of priori extraction rules configured by domain knowledge.