Syntactic clustering of the Web
Selected papers from the sixth international conference on World Wide Web
Learning to remove Internet advertisements
Proceedings of the third annual conference on Autonomous Agents
Template detection via data mining and its applications
Proceedings of the 11th international conference on World Wide Web
Entropy-based link analysis for mining web informative structures
Proceedings of the eleventh international conference on Information and knowledge management
Eliminating noisy information in Web pages for data mining
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
What's new on the web?: the evolution of the web from a search engine perspective
Proceedings of the 13th international conference on World Wide Web
Learning block importance models for web pages
Proceedings of the 13th international conference on World Wide Web
Web page cleaning for web mining through feature weighting
IJCAI'03 Proceedings of the 18th international joint conference on Artificial intelligence
The web beyond popularity: a really simple system for web scale RSS
Proceedings of the 15th international conference on World Wide Web
Hierarchical topic segmentation of websites
Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
The portrait of a common HTML web page
Proceedings of the 2006 ACM symposium on Document engineering
Summarizing personal web browsing sessions
UIST '06 Proceedings of the 19th annual ACM symposium on User interface software and technology
Coarse-grained classification of web sites by their structural properties
WIDM '06 Proceedings of the 8th annual ACM international workshop on Web information and data management
A fast and robust method for web page template detection and removal
CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Page-level template detection via isotonic smoothing
Proceedings of the 16th international conference on World Wide Web
Toward editable web browser: edit-and-propagate operation for web browsing
Proceedings of the 9th annual ACM international workshop on Web information and data management
CASCON '07 Proceedings of the 2007 conference of the center for advanced studies on Collaborative research
A graph-theoretic approach to webpage segmentation
Proceedings of the 17th international conference on World Wide Web
Understanding web documents: finding pagelets for transformation using structural patterns
International Journal of Web Engineering and Technology
Site-Independent Template-Block Detection
PKDD 2007 Proceedings of the 11th European conference on Principles and Practice of Knowledge Discovery in Databases
A densitometric approach to web page segmentation
Proceedings of the 17th ACM conference on Information and knowledge management
Combining content extraction heuristics: the CombinE system
Proceedings of the 10th International Conference on Information Integration and Web-based Applications & Services
A densitometric analysis of web template content
Proceedings of the 18th international conference on World wide web
On Finding Templates on Web Collections
World Wide Web
A fast and simple method for extracting relevant content from news webpages
Proceedings of the 18th ACM conference on Information and knowledge management
Bridging the gap: from multi document Template Detection to single document Content Extraction
EuroIMSA '08 Proceedings of the IASTED International Conference on Internet and Multimedia Systems and Applications
Boilerplate detection using shallow text features
Proceedings of the third ACM international conference on Web search and data mining
Clustering template based web documents
ECIR'08 Proceedings of the IR research, 30th European conference on Advances in information retrieval
No Code Required: Giving Users Tools to Transform the Web
No Code Required: Giving Users Tools to Transform the Web
Web page DOM node characterization and its application to page segmentation
IMSAA'09 Proceedings of the 3rd IEEE international conference on Internet multimedia services architecture and applications
Exploiting content redundancy for web information extraction
Proceedings of the VLDB Endowment
Generalized link suggestions via web site clustering
Proceedings of the 20th international conference on World wide web
Bricolage: example-based retargeting for web design
Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
A site oriented method for segmenting web pages
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
DOM based content extraction via text density
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Visualisation de digests d'emails en entreprise
23rd French Speaking Conference on Human-Computer Interaction
Using main content extraction to improve performance of Vietnamese web page classification
Proceedings of the Second Symposium on Information and Communication Technology
Exploiting attribute redundancy for web entity data extraction
ICADL'11 Proceedings of the 13th international conference on Asia-pacific digital libraries: for cultural heritage, knowledge dissemination, and future creation
Intelligent crawling of web applications for web archiving
Proceedings of the 21st international conference companion on World Wide Web
WebCrystal: understanding and reusing examples in web authoring
Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
A platform for large-scale machine learning on web design
CHI '12 Extended Abstracts on Human Factors in Computing Systems
Extracting informative textual parts from web pages containing user-generated content
Proceedings of the 12th International Conference on Knowledge Management and Knowledge Technologies
Assessing the effort of repairing the accessibility of web sites
ICCHP'12 Proceedings of the 13th international conference on Computers Helping People with Special Needs - Volume Part I
Supporting view transition design of smartphone applications using web templates
ICWE'12 Proceedings of the 12th international conference on Web Engineering
Effectiveness of template detection on noise reduction and websites summarization
Information Sciences: an International Journal
Cluster-based page segmentation-a fast and precise method for web page pre-processing
Proceedings of the 3rd International Conference on Web Intelligence, Mining and Semantics
Web news extraction via path ratios
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Demonstrating intelligent crawling and archiving of web applications
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Intelligent and adaptive crawling of web applications for web archiving
ICWE'13 Proceedings of the 13th international conference on Web Engineering
Hi-index | 0.00 |
Web pages contain a combination of unique content and template material, which is present across multiple pages and used primarily for formatting, navigation, and branding. We study the nature, evolution, and prevalence of these templates on the web. As part of this work, we develop new randomized algorithms for template extraction that perform approximately twenty times faster than existing approaches with similar quality. Our results show that 40--50% of the content on the web is template content. Over the last eight years, the fraction of template content has doubled, and the growth shows no sign of abating. Text, links, and total HTML bytes within templates are all growing as a fraction of total content at a rate of between 6 and 8% per year. We discuss the deleterious implications of this growth for information retrieval and ranking, classification, and link analysis.