An Information Retrieval Approach for Automatically Constructing Software Libraries
IEEE Transactions on Software Engineering
Silk from a sow's ear: extracting usable structures from the Web
Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Enhanced hypertext categorization using hyperlinks
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Syntactic clustering of the Web
Selected papers from the sixth international conference on World Wide Web
Improved algorithms for topic distillation in a hyperlinked environment
Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Automatic resource compilation by analyzing hyperlink structure and associated text
WWW7 Proceedings of the seventh international conference on World Wide Web 7
The anatomy of a large-scale hypertextual Web search engine
WWW7 Proceedings of the seventh international conference on World Wide Web 7
Finding related pages in the World Wide Web
WWW '99 Proceedings of the eighth international conference on World Wide Web
Trawling the Web for emerging cyber-communities
WWW '99 Proceedings of the eighth international conference on World Wide Web
Focused crawling: a new approach to topic-specific Web resource discovery
WWW '99 Proceedings of the eighth international conference on World Wide Web
Authoritative sources in a hyperlinked environment
Journal of the ACM (JACM)
Topic Distillation and Spectral Filtering
Artificial Intelligence Review - Special issue on data mining on the Internet
The stochastic approach for link-structure analysis (SALSA) and the TKC effect
Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
Proceedings of the 10th international conference on World Wide Web
Enhanced topic distillation using text, markup tags, and hyperlinks
Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Distributed Hypertext Resource Discovery Through Examples
VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Fast Algorithms for Mining Association Rules in Large Databases
VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
On the bursty evolution of blogspace
WWW '03 Proceedings of the 12th international conference on World Wide Web
The connectivity sonar: detecting site functionality by structural patterns
Proceedings of the fourteenth ACM conference on Hypertext and hypermedia
Eliminating noisy information in Web pages for data mining
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
A bag of paths model for measuring structural similarity in Web documents
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Extracting unstructured data from template generated web documents
CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Techniques for efficient fragment detection in web pages
CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Probe, Cluster, and Discover: Focused Extraction of QA-Pagelets from the Deep Web
ICDE '04 Proceedings of the 20th International Conference on Data Engineering
Liveclassifier: creating hierarchical text classifiers through web corpora
Proceedings of the 13th international conference on World Wide Web
Learning block importance models for web pages
Proceedings of the 13th international conference on World Wide Web
Using link analysis to improve layout on mobile devices
Proceedings of the 13th international conference on World Wide Web
Automatic detection of fragments in dynamically generated web pages
Proceedings of the 13th international conference on World Wide Web
Discovery of ads web hosts through traffic data analysis
Proceedings of the 9th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery
The site browser: catalyzing improvements in hypertext organization
Proceedings of the fifteenth ACM conference on Hypertext and hypermedia
Editorial: special issue on web content mining
ACM SIGKDD Explorations Newsletter
Learning important models for web page blocks based on layout and content analysis
ACM SIGKDD Explorations Newsletter
On the Bursty Evolution of Blogspace
World Wide Web
WISDOM: Web Intrapage Informative Structure Mining Based on Document Object Model
IEEE Transactions on Knowledge and Data Engineering
Sentiment Mining in WebFountain
ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Web data extraction based on partial tree alignment
WWW '05 Proceedings of the 14th international conference on World Wide Web
Browsing fatigue in handhelds: semantic bookmarking spells relief
WWW '05 Proceedings of the 14th international conference on World Wide Web
AcceSS: accessibility through simplification & summarization
W4A '05 Proceedings of the 2005 International Cross-Disciplinary Workshop on Web Accessibility (W4A)
The volume and evolution of web page templates
WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Automatic extraction of informative blocks from webpages
Proceedings of the 2005 ACM symposium on Applied computing
Automatic Fragment Detection in Dynamic Web Pages and Its Impact on Caching
IEEE Transactions on Knowledge and Data Engineering
QA-Pagelet: Data Preparation Techniques for Large-Scale Data Analysis of the Deep Web
IEEE Transactions on Knowledge and Data Engineering
Bulk loading large collections of hyperlinked resources
Proceedings of the sixteenth ACM conference on Hypertext and hypermedia
Clustering web pages based on their structure
Data & Knowledge Engineering - Special issue: WIDM 2003
HW-STALKER: a machine learning-based system for transforming QURE-Pagelets to XML
Data & Knowledge Engineering
Learning Object Models from Semistructured Web Documents
IEEE Transactions on Knowledge and Data Engineering
Efficient PageRank approximation via graph aggregation
Information Retrieval
Template detection for large scale search engines
Proceedings of the 2006 ACM symposium on Applied computing
Automatic extraction of bilingual word pairs using inductive chain learning in various languages
Information Processing and Management: an International Journal
Template extraction from candidate template set generation: a structure and content approach
Proceedings of the 43rd annual Southeast regional conference - Volume 2
Measuring website usability for visually impaired people-a modified GOMS analysis
Proceedings of the 8th international ACM SIGACCESS conference on Computers and accessibility
A fast and robust method for web page template detection and removal
CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Preparing heterogeneous XML for full-text search
ACM Transactions on Information Systems (TOIS)
Sampling, information extraction and summarisation of hidden web databases
Data & Knowledge Engineering - Special issue: WIDM 2004
Two-phase Web site classification based on Hidden Markov Tree models
Web Intelligence and Agent Systems
Page-level template detection via isotonic smoothing
Proceedings of the 16th international conference on World Wide Web
High performance index build algorithms for intranet search engines
VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Computing block importance for searching on web sites
Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Tracking Web spam with HTML style similarities
ACM Transactions on the Web (TWEB)
A graph-theoretic approach to webpage segmentation
Proceedings of the 17th international conference on World Wide Web
Incremental web page template detection
Proceedings of the 17th international conference on World Wide Web
Understanding web documents: finding pagelets for transformation using structural patterns
International Journal of Web Engineering and Technology
Site-Independent Template-Block Detection
PKDD 2007 Proceedings of the 11th European conference on Principles and Practice of Knowledge Discovery in Databases
Automated Semantic Analysis of Schematic Data
World Wide Web
A densitometric approach to web page segmentation
Proceedings of the 17th ACM conference on Information and knowledge management
Combining content extraction heuristics: the CombinE system
Proceedings of the 10th International Conference on Information Integration and Web-based Applications & Services
On Finding Templates on Web Collections
World Wide Web
Web page cleaning for web mining through feature weighting
IJCAI'03 Proceedings of the 18th international joint conference on Artificial intelligence
Bridging the gap: from multi document Template Detection to single document Content Extraction
EuroIMSA '08 Proceedings of the IASTED International Conference on Internet and Multimedia Systems and Applications
Boilerplate detection using shallow text features
Proceedings of the third ACM international conference on Web search and data mining
The paths more taken: matching DOM trees to search logs for accurate webpage clustering
Proceedings of the 19th international conference on World wide web
CETR: content extraction via tag ratios
Proceedings of the 19th international conference on World wide web
Clustering template based web documents
ECIR'08 Proceedings of the IR research, 30th European conference on Advances in information retrieval
Web page DOM node characterization and its application to page segmentation
IMSAA'09 Proceedings of the 3rd IEEE international conference on Internet multimedia services architecture and applications
A site oriented method for segmenting web pages
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
DOM based content extraction via text density
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Journal of Mobile Multimedia
Accelerating dynamic web content delivery using keyword-based fragment detection
Journal of Web Engineering
Accessibility summarization & simplification in a template-based web transcoder
Journal of Web Engineering
A tool for link-based web page classification
CAEPIA'11 Proceedings of the 14th international conference on Advances in artificial intelligence: spanish association for artificial intelligence
A new method for focused crawler cross tunnel
RSKT'06 Proceedings of the First international conference on Rough Sets and Knowledge Technology
Identifying content blocks from web documents
ISMIS'05 Proceedings of the 15th international conference on Foundations of Intelligent Systems
Cleaning web pages for effective web content mining
DEXA'06 Proceedings of the 17th international conference on Database and Expert Systems Applications
Hybrid model of content extraction
Journal of Computer and System Sciences
FDIA'09 Proceedings of the Third BCS-IRSG conference on Future Directions in Information Access
Assessing the effort of repairing the accessibility of web sites
ICCHP'12 Proceedings of the 13th international conference on Computers Helping People with Special Needs - Volume Part I
Effectiveness of template detection on noise reduction and websites summarization
Information Sciences: an International Journal
Echo: the editor's wisdom with the elegance of a magazine
Proceedings of the 5th ACM SIGCHI symposium on Engineering interactive computing systems
A hybrid approach for extracting informative content from web pages
Information Processing and Management: an International Journal
URL tree: efficient unsupervised content extraction from streams of web documents
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Locality sensitive hashing for scalable structural classification and clustering of web documents
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
CALA: An unsupervised URL-based web page classification system
Knowledge-Based Systems
Hi-index | 0.00 |
We formulate and propose the template detection problem, and suggest a practical solution for it based on counting frequent item sets. We show that the use of templates is pervasive on the web. We describe three principles, which characterize the assumptions made by hypertext information retrieval (IR) and data mining (DM) systems, and show that templates are a major source of violation of these principles. As a consequence, basic "pure" implementations of simple search algorithms coupled with template detection and elimination show surprising increases in precision at all levels of recall.