A sequential algorithm for training text classifiers
SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Syntactic clustering of the Web
Selected papers from the sixth international conference on World Wide Web
Learning to remove Internet advertisements
Proceedings of the third annual conference on Autonomous Agents
Statistical Models for Text Segmentation
Machine Learning - Special issue on natural language learning
Authoritative sources in a hyperlinked environment
Proceedings of the ninth annual ACM-SIAM symposium on Discrete algorithms
IntelliClean: a knowledge-based intelligent data cleaner
Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Template detection via data mining and its applications
Proceedings of the 11th international conference on World Wide Web
Entropy-based link analysis for mining web informative structures
Proceedings of the eleventh international conference on Information and knowledge management
Data Mining for Web Intelligence
Computer
Mining the Web: Discovering Knowledge from HyperText Data
Mining the Web: Discovering Knowledge from HyperText Data
A Comparative Study on Feature Selection in Text Categorization
ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Discovering informative content blocks from Web documents
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
A model of lexical attraction and repulsion
ACL '98 Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics
Learning block importance models for web pages
Proceedings of the 13th international conference on World Wide Web
Using link analysis to improve layout on mobile devices
Proceedings of the 13th international conference on World Wide Web
Web-page classification through summarization
Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Detecting and Partitioning Data Objects in Complex Web Pages
WI '04 Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence
Editorial: special issue on web content mining
ACM SIGKDD Explorations Newsletter
Learning important models for web page blocks based on layout and content analysis
ACM SIGKDD Explorations Newsletter
Bootstrapping Semantic Annotation for Content-Rich HTML Documents
ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Browsing fatigue in handhelds: semantic bookmarking spells relief
WWW '05 Proceedings of the 14th international conference on World Wide Web
The volume and evolution of web page templates
WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Automatic extraction of informative blocks from webpages
Proceedings of the 2005 ACM symposium on Applied computing
Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Enhancing Data Analysis with Noise Removal
IEEE Transactions on Knowledge and Data Engineering
Learning Object Models from Semistructured Web Documents
IEEE Transactions on Knowledge and Data Engineering
Model-directed web transactions under constrained modalities
Proceedings of the 15th international conference on World Wide Web
Template detection for large scale search engines
Proceedings of the 2006 ACM symposium on Applied computing
Simultaneous record detection and attribute labeling in web data extraction
Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
A Survey of Web Information Extraction Systems
IEEE Transactions on Knowledge and Data Engineering
A fast and robust method for web page template detection and removal
CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Logical structure analysis: From HTML to XML
Computer Standards & Interfaces
Page-level template detection via isotonic smoothing
Proceedings of the 16th international conference on World Wide Web
Context browsing with mobiles - when less is more
Proceedings of the 5th international conference on Mobile systems, applications and services
Model-directed Web transactions under constrained modalities
ACM Transactions on the Web (TWEB)
Noise reduction through summarization for Web-page classification
Information Processing and Management: an International Journal
Computing block importance for searching on web sites
Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Near-replicas of web pages detection efficient algorithm based on single MD5 fingerprint
ICAI'07 Proceedings of the 8th Conference on 8th WSEAS International Conference on Automation and Information - Volume 8
Incremental web page template detection
Proceedings of the 17th international conference on World Wide Web
Efficient algorithms for incremental Web log mining with dynamic thresholds
The VLDB Journal — The International Journal on Very Large Data Bases
Learning from multi-topic web documents for contextual advertisement
Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Learning to Classify Documents with Only a Small Positive Training Set
ECML '07 Proceedings of the 18th European conference on Machine Learning
Site-Independent Template-Block Detection
PKDD 2007 Proceedings of the 11th European conference on Principles and Practice of Knowledge Discovery in Databases
Web Contents Extracting for Web-Based Learning
ICWL '08 Proceedings of the 7th international conference on Advances in Web Based Learning
Automated Semantic Analysis of Schematic Data
World Wide Web
A densitometric approach to web page segmentation
Proceedings of the 17th ACM conference on Information and knowledge management
Combining content extraction heuristics: the CombinE system
Proceedings of the 10th International Conference on Information Integration and Web-based Applications & Services
Extracting article text from the web with maximum subsequence segmentation
Proceedings of the 18th international conference on World wide web
On Finding Templates on Web Collections
World Wide Web
Deriving image-text document surrogates to optimize cognition
Proceedings of the 9th ACM symposium on Document engineering
Web document text and images extraction using DOM analysis and natural language processing
Proceedings of the 9th ACM symposium on Document engineering
Entropy-Based Visual Tree Evaluation on Block Extraction
WI-IAT '09 Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
Bridging the gap: from multi document Template Detection to single document Content Extraction
EuroIMSA '08 Proceedings of the IASTED International Conference on Internet and Multimedia Systems and Applications
Boilerplate detection using shallow text features
Proceedings of the third ACM international conference on Web search and data mining
Web Semantics: Science, Services and Agents on the World Wide Web
CETR: content extraction via tag ratios
Proceedings of the 19th international conference on World wide web
Web mediators for accessible browsing
ERCIM'06 Proceedings of the 9th conference on User interfaces for all
Finding and using the content texts of HTML pages
AIRS'08 Proceedings of the 4th Asia information retrieval conference on Information retrieval technology
Clustering template based web documents
ECIR'08 Proceedings of the IR research, 30th European conference on Advances in information retrieval
Web page DOM node characterization and its application to page segmentation
IMSAA'09 Proceedings of the 3rd IEEE international conference on Internet multimedia services architecture and applications
Improving mention detection robustness to noisy input
EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Expert Systems with Applications: An International Journal
Prediction of web page accessibility based on structural and textual features
Proceedings of the International Cross-Disciplinary Conference on Web Accessibility
A site oriented method for segmenting web pages
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
DOM based content extraction via text density
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
A preprocessing framework and approach for web applications
Journal of Web Engineering
Segmenting eBay item descriptions into coherent sections
Proceedings of the 2011 Joint Workshop on Multilingual OCR and Analytics for Noisy Unstructured Text Data
Privacy protected knowledge management in services with emphasis on quality data
Proceedings of the 20th ACM international conference on Information and knowledge management
A tool for link-based web page classification
CAEPIA'11 Proceedings of the 14th international conference on Advances in artificial intelligence: spanish association for artificial intelligence
Block-based language modeling approach towards web search
APWeb'05 Proceedings of the 7th Asia-Pacific web conference on Web Technologies Research and Development
Classification of news web documents based on structural features
FinTAL'06 Proceedings of the 5th international conference on Advances in Natural Language Processing
Identifying content blocks from web documents
ISMIS'05 Proceedings of the 15th international conference on Foundations of Intelligent Systems
An intelligent extracting web content agent on the internet
KES'05 Proceedings of the 9th international conference on Knowledge-Based Intelligent Information and Engineering Systems - Volume Part II
Towards understanding the functions of web element
AIRS'04 Proceedings of the 2004 international conference on Asian Information Retrieval Technology
Cleaning web pages for effective web content mining
DEXA'06 Proceedings of the 17th international conference on Database and Expert Systems Applications
Extracting informative textual parts from web pages containing user-generated content
Proceedings of the 12th International Conference on Knowledge Management and Knowledge Technologies
Effectiveness of template detection on noise reduction and websites summarization
Information Sciences: an International Journal
Webzeitgeist: design mining the web
Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Cluster-based page segmentation-a fast and precise method for web page pre-processing
Proceedings of the 3rd International Conference on Web Intelligence, Mining and Semantics
A hybrid approach for extracting informative content from web pages
Information Processing and Management: an International Journal
Heuristic role detection of visual elements of web pages
ICWE'13 Proceedings of the 13th international conference on Web Engineering
Hi-index | 0.00 |
A commercial Web page typically contains many information blocks. Apart from the main content blocks, it usually has such blocks as navigation panels, copyright and privacy notices, and advertisements (for business purposes and for easy user access). We call these blocks that are not the main content blocks of the page the noisy blocks. We show that the information contained in these noisy blocks can seriously harm Web data mining. Eliminating these noises is thus of great importance. In this paper, we propose a noise elimination technique based on the following observation: In a given Web site, noisy blocks usually share some common contents and presentation styles, while the main content blocks of the pages are often diverse in their actual contents and/or presentation styles. Based on this observation, we propose a tree structure, called Style Tree, to capture the common presentation styles and the actual contents of the pages in a given Web site. By sampling the pages of the site, a Style Tree can be built for the site, which we call the Site Style Tree (SST). We then introduce an information based measure to determine which parts of the SST represent noises and which parts represent the main contents of the site. The SST is employed to detect and eliminate noises in any Web page of the site by mapping this page to the SST. The proposed technique is evaluated with two data mining tasks, Web page clustering and classification. Experimental results show that our noise elimination technique is able to improve the mining results significantly.