Template detection via data mining and its applications

Authors:
Ziv Bar-Yossef;Sridhar Rajagopalan
Affiliations:
University of California at Berkeley, Berkeley, CA;IBM Almaden Research Center, San Jose, CA
Venue:
Proceedings of the 11th international conference on World Wide Web
Year:
2002

Citing 17
Cited 75

An Information Retrieval Approach for Automatically Constructing Software Libraries

IEEE Transactions on Software Engineering
Silk from a sow's ear: extracting usable structures from the Web

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Enhanced hypertext categorization using hyperlinks

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
Improved algorithms for topic distillation in a hyperlinked environment

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Automatic resource compilation by analyzing hyperlink structure and associated text

WWW7 Proceedings of the seventh international conference on World Wide Web 7
The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Finding related pages in the World Wide Web

WWW '99 Proceedings of the eighth international conference on World Wide Web
Trawling the Web for emerging cyber-communities

WWW '99 Proceedings of the eighth international conference on World Wide Web
Focused crawling: a new approach to topic-specific Web resource discovery

WWW '99 Proceedings of the eighth international conference on World Wide Web
Authoritative sources in a hyperlinked environment

Journal of the ACM (JACM)
Topic Distillation and Spectral Filtering

Artificial Intelligence Review - Special issue on data mining on the Internet
The stochastic approach for link-structure analysis (SALSA) and the TKC effect

Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
Integrating the document object model with hyperlinks for enhanced topic distillation and information extraction

Proceedings of the 10th international conference on World Wide Web
Enhanced topic distillation using text, markup tags, and hyperlinks

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Distributed Hypertext Resource Discovery Through Examples

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Fast Algorithms for Mining Association Rules in Large Databases

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases

On the bursty evolution of blogspace

WWW '03 Proceedings of the 12th international conference on World Wide Web
The connectivity sonar: detecting site functionality by structural patterns

Proceedings of the fourteenth ACM conference on Hypertext and hypermedia
Eliminating noisy information in Web pages for data mining

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
A bag of paths model for measuring structural similarity in Web documents

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Extracting unstructured data from template generated web documents

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Techniques for efficient fragment detection in web pages

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Probe, Cluster, and Discover: Focused Extraction of QA-Pagelets from the Deep Web

ICDE '04 Proceedings of the 20th International Conference on Data Engineering
Liveclassifier: creating hierarchical text classifiers through web corpora

Proceedings of the 13th international conference on World Wide Web
Learning block importance models for web pages

Proceedings of the 13th international conference on World Wide Web
Using link analysis to improve layout on mobile devices

Proceedings of the 13th international conference on World Wide Web
Automatic detection of fragments in dynamically generated web pages

Proceedings of the 13th international conference on World Wide Web
Discovery of ads web hosts through traffic data analysis

Proceedings of the 9th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery
The site browser: catalyzing improvements in hypertext organization

Proceedings of the fifteenth ACM conference on Hypertext and hypermedia
Editorial: special issue on web content mining

ACM SIGKDD Explorations Newsletter
Learning important models for web page blocks based on layout and content analysis

ACM SIGKDD Explorations Newsletter
On the Bursty Evolution of Blogspace

World Wide Web
WISDOM: Web Intrapage Informative Structure Mining Based on Document Object Model

IEEE Transactions on Knowledge and Data Engineering
Sentiment Mining in WebFountain

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Web data extraction based on partial tree alignment

WWW '05 Proceedings of the 14th international conference on World Wide Web
Browsing fatigue in handhelds: semantic bookmarking spells relief

WWW '05 Proceedings of the 14th international conference on World Wide Web
AcceSS: accessibility through simplification & summarization

W4A '05 Proceedings of the 2005 International Cross-Disciplinary Workshop on Web Accessibility (W4A)
The volume and evolution of web page templates

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Automatic extraction of informative blocks from webpages

Proceedings of the 2005 ACM symposium on Applied computing
Automatic Fragment Detection in Dynamic Web Pages and Its Impact on Caching

IEEE Transactions on Knowledge and Data Engineering
QA-Pagelet: Data Preparation Techniques for Large-Scale Data Analysis of the Deep Web

IEEE Transactions on Knowledge and Data Engineering
Bulk loading large collections of hyperlinked resources

Proceedings of the sixteenth ACM conference on Hypertext and hypermedia
Clustering web pages based on their structure

Data & Knowledge Engineering - Special issue: WIDM 2003
HW-STALKER: a machine learning-based system for transforming QURE-Pagelets to XML

Data & Knowledge Engineering
Learning Object Models from Semistructured Web Documents

IEEE Transactions on Knowledge and Data Engineering
Efficient PageRank approximation via graph aggregation

Information Retrieval
Template detection for large scale search engines

Proceedings of the 2006 ACM symposium on Applied computing
Automatic extraction of bilingual word pairs using inductive chain learning in various languages

Information Processing and Management: an International Journal
Template extraction from candidate template set generation: a structure and content approach

Proceedings of the 43rd annual Southeast regional conference - Volume 2
Measuring website usability for visually impaired people-a modified GOMS analysis

Proceedings of the 8th international ACM SIGACCESS conference on Computers and accessibility
A fast and robust method for web page template detection and removal

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Preparing heterogeneous XML for full-text search

ACM Transactions on Information Systems (TOIS)
Sampling, information extraction and summarisation of hidden web databases

Data & Knowledge Engineering - Special issue: WIDM 2004
Two-phase Web site classification based on Hidden Markov Tree models

Web Intelligence and Agent Systems
Page-level template detection via isotonic smoothing

Proceedings of the 16th international conference on World Wide Web
High performance index build algorithms for intranet search engines

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Computing block importance for searching on web sites

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Tracking Web spam with HTML style similarities

ACM Transactions on the Web (TWEB)
A graph-theoretic approach to webpage segmentation

Proceedings of the 17th international conference on World Wide Web
Incremental web page template detection

Proceedings of the 17th international conference on World Wide Web
Understanding web documents: finding pagelets for transformation using structural patterns

International Journal of Web Engineering and Technology
Site-Independent Template-Block Detection

PKDD 2007 Proceedings of the 11th European conference on Principles and Practice of Knowledge Discovery in Databases
Automated Semantic Analysis of Schematic Data

World Wide Web
A densitometric approach to web page segmentation

Proceedings of the 17th ACM conference on Information and knowledge management
Combining content extraction heuristics: the CombinE system

Proceedings of the 10th International Conference on Information Integration and Web-based Applications & Services
On Finding Templates on Web Collections

World Wide Web
Web page cleaning for web mining through feature weighting

IJCAI'03 Proceedings of the 18th international joint conference on Artificial intelligence
Bridging the gap: from multi document Template Detection to single document Content Extraction

EuroIMSA '08 Proceedings of the IASTED International Conference on Internet and Multimedia Systems and Applications
Boilerplate detection using shallow text features

Proceedings of the third ACM international conference on Web search and data mining
The paths more taken: matching DOM trees to search logs for accurate webpage clustering

Proceedings of the 19th international conference on World wide web
CETR: content extraction via tag ratios

Proceedings of the 19th international conference on World wide web
Clustering template based web documents

ECIR'08 Proceedings of the IR research, 30th European conference on Advances in information retrieval
Web page DOM node characterization and its application to page segmentation

IMSAA'09 Proceedings of the 3rd IEEE international conference on Internet multimedia services architecture and applications
A site oriented method for segmenting web pages

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
DOM based content extraction via text density

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Information summarization and transcoding of biomedical information resources for mobile handheld devices

Journal of Mobile Multimedia
Accelerating dynamic web content delivery using keyword-based fragment detection

Journal of Web Engineering
Accessibility summarization & simplification in a template-based web transcoder

Journal of Web Engineering
A tool for link-based web page classification

CAEPIA'11 Proceedings of the 14th international conference on Advances in artificial intelligence: spanish association for artificial intelligence
A new method for focused crawler cross tunnel

RSKT'06 Proceedings of the First international conference on Rough Sets and Knowledge Technology
Identifying content blocks from web documents

ISMIS'05 Proceedings of the 15th international conference on Foundations of Intelligent Systems
Cleaning web pages for effective web content mining

DEXA'06 Proceedings of the 17th international conference on Database and Expert Systems Applications
Hybrid model of content extraction

Journal of Computer and System Sciences
Intelligent web navigation

FDIA'09 Proceedings of the Third BCS-IRSG conference on Future Directions in Information Access
Assessing the effort of repairing the accessibility of web sites

ICCHP'12 Proceedings of the 13th international conference on Computers Helping People with Special Needs - Volume Part I
Effectiveness of template detection on noise reduction and websites summarization

Information Sciences: an International Journal
Echo: the editor's wisdom with the elegance of a magazine

Proceedings of the 5th ACM SIGCHI symposium on Engineering interactive computing systems
A hybrid approach for extracting informative content from web pages

Information Processing and Management: an International Journal
URL tree: efficient unsupervised content extraction from streams of web documents

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Locality sensitive hashing for scalable structural classification and clustering of web documents

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
CALA: An unsupervised URL-based web page classification system

Knowledge-Based Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

We formulate and propose the template detection problem, and suggest a practical solution for it based on counting frequent item sets. We show that the use of templates is pervasive on the web. We describe three principles, which characterize the assumptions made by hypertext information retrieval (IR) and data mining (DM) systems, and show that templates are a major source of violation of these principles. As a consequence, basic "pure" implementations of simple search algorithms coupled with template detection and elimination show surprising increases in precision at all levels of recall.