Subtopic structuring for full-length document access
SIGIR '93 Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval
Making large-scale support vector machine learning practical
Advances in kernel methods
Learning to remove Internet advertisements
Proceedings of the third annual conference on Autonomous Agents
Statistical Models for Text Segmentation
Machine Learning - Special issue on natural language learning
Authoritative sources in a hyperlinked environment
Proceedings of the ninth annual ACM-SIAM symposium on Discrete algorithms
IntelliClean: a knowledge-based intelligent data cleaner
Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Enhanced topic distillation using text, markup tags, and hyperlinks
Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Template detection via data mining and its applications
Proceedings of the 11th international conference on World Wide Web
Modern Information Retrieval
Introduction to Modern Information Retrieval
Introduction to Modern Information Retrieval
Entropy-based link analysis for mining web informative structures
Proceedings of the eleventh international conference on Information and knowledge management
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features
ECML '98 Proceedings of the 10th European Conference on Machine Learning
A Comparative Study on Feature Selection in Text Categorization
ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Discovering informative content blocks from Web documents
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
A model of lexical attraction and repulsion
ACL '98 Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics
Cohesion and collocation: using context vectors in text segmentation
ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
Statistical models for topic segmentation
ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
Learning block importance models for web pages
Proceedings of the 13th international conference on World Wide Web
Using link analysis to improve layout on mobile devices
Proceedings of the 13th international conference on World Wide Web
Editorial: special issue on web content mining
ACM SIGKDD Explorations Newsletter
Learning important models for web page blocks based on layout and content analysis
ACM SIGKDD Explorations Newsletter
Browsing fatigue in handhelds: semantic bookmarking spells relief
WWW '05 Proceedings of the 14th international conference on World Wide Web
The volume and evolution of web page templates
WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Mining key information of web pages: A method and its application
Expert Systems with Applications: An International Journal
Two-phase Web site classification based on Hidden Markov Tree models
Web Intelligence and Agent Systems
Page-level template detection via isotonic smoothing
Proceedings of the 16th international conference on World Wide Web
Enhancing web page classification through image-block importance analysis
Information Processing and Management: an International Journal
Site-Independent Template-Block Detection
PKDD 2007 Proceedings of the 11th European conference on Principles and Practice of Knowledge Discovery in Databases
Web Contents Extracting for Web-Based Learning
ICWL '08 Proceedings of the 7th international conference on Advances in Web Based Learning
Automated Semantic Analysis of Schematic Data
World Wide Web
Extracting article text from the web with maximum subsequence segmentation
Proceedings of the 18th international conference on World wide web
A fast and simple method for extracting relevant content from news webpages
Proceedings of the 18th ACM conference on Information and knowledge management
Finding and using the content texts of HTML pages
AIRS'08 Proceedings of the 4th Asia information retrieval conference on Information retrieval technology
Automatic selection of print-worthy content for enhanced web page printing experience
Proceedings of the 10th ACM symposium on Document engineering
On text preprocessing for opinion mining outside of laboratory environments
AMT'12 Proceedings of the 8th international conference on Active Media Technology
A hybrid approach for extracting informative content from web pages
Information Processing and Management: an International Journal
Hi-index | 0.00 |
Unlike conventional data or text, Web pages typically contain a large amount of information that is not part of the main contents of the pages, e.g., banner ads, navigation bars, and copyright notices. Such irrelevant information (which we call Web page noise) in Web pages can seriously harm Web mining, e.g., clustering and classification. In this paper, we propose a novel feature weighting technique to deal with Web page noise to enhance Web mining. This method first builds a compressed structure tree to capture the common structure and comparable blocks in a set of Web pages. It then uses an information based measure to evaluate the importance of each node in the compressed structure tree. Based on the tree and its node importance values, our method assigns a weight to each word feature in its content block. The resulting weights are used in Web mining. We evaluated the proposed technique with two Web mining tasks, Web page clustering and Web page classification. Experimental results show that our weighting method is able to dramatically improve the mining results.