Automatic detection of fragments in dynamically generated web pages

Authors:
Lakshmish Ramaswamy;Arun Iyengar;Ling Liu;Fred Douglis
Affiliations:
Georgia Tech, Atlanta, GA;IBM T.J. Watson Research Center, Yorktown Heights, NY;Georgia Tech, Atlanta, GA;IBM T.J. Watson Research Center, Yorktown Heights, NY
Venue:
Proceedings of the 13th international conference on World Wide Web
Year:
2004

Citing 13
Cited 27

Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
Aliasing on the world wide web: prevalence and performance implications

Proceedings of the 11th international conference on World Wide Web
Template detection via data mining and its applications

Proceedings of the 11th international conference on World Wide Web
Proxy-based acceleration of dynamically generated content on the world wide web: an approach and implementation

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Visual Based Content Understanding towards Web Adaptation

AH '02 Proceedings of the Second International Conference on Adaptive Hypermedia and Adaptive Web-Based Systems
Value-based web caching

WWW '03 Proceedings of the 12th international conference on World Wide Web
On the Resemblance and Containment of Documents

SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
A Fully Automated Object Extraction System for the World Wide Web

ICDCS '01 Proceedings of the The 21st International Conference on Distributed Computing Systems
Improved File Synchronization Techniques for Maintaining Large Replicated Collections over Slow Networks

ICDE '04 Proceedings of the 20th International Conference on Data Engineering
Redundancy elimination within large collections of files

ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
Design, implementation, and evaluation of duplicate transfer detection in HTTP

NSDI'04 Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation - Volume 1
View invalidation for dynamic content caching in multitiered architectures

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Replica-aware caching for Web proxies

Computer Communications

Editorial: special issue on web content mining

ACM SIGKDD Explorations Newsletter
Bootstrapping Semantic Annotation for Content-Rich HTML Documents

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Web data extraction based on partial tree alignment

WWW '05 Proceedings of the 14th international conference on World Wide Web
Browsing fatigue in handhelds: semantic bookmarking spells relief

WWW '05 Proceedings of the 14th international conference on World Wide Web
A fragment-based approach for efficiently creating dynamic web content

ACM Transactions on Internet Technology (TOIT)
Context-aware interactive content adaptation

Proceedings of the 4th international conference on Mobile systems, applications and services
Model-directed web transactions under constrained modalities

Proceedings of the 15th international conference on World Wide Web
Template detection for large scale search engines

Proceedings of the 2006 ACM symposium on Applied computing
Automatic extraction of dynamic record sections from search engine result pages

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Design and Performance Studies of an Adaptive Scheme for Serving Dynamic Web Content in a Mobile Computing Environment

IEEE Transactions on Mobile Computing
Sampling, information extraction and summarisation of hidden web databases

Data & Knowledge Engineering - Special issue: WIDM 2004
Homepage live: automatic block tracing for web personalization

Proceedings of the 16th international conference on World Wide Web
Csurf: a context-driven non-visual web-browser

Proceedings of the 16th international conference on World Wide Web
Context browsing with mobiles - when less is more

Proceedings of the 5th international conference on Mobile systems, applications and services
Model-directed Web transactions under constrained modalities

ACM Transactions on the Web (TWEB)
Web Contents Extracting for Web-Based Learning

ICWL '08 Proceedings of the 7th international conference on Advances in Web Based Learning
A Semiautomatic Content Adaptation Authoring Tool for Mobile Learning

ICWL '08 Proceedings of the 7th international conference on Advances in Web Based Learning
Automated Semantic Analysis of Schematic Data

World Wide Web
The web changes everything: understanding the dynamics of web content

Proceedings of the Second ACM International Conference on Web Search and Data Mining
Information Extraction

Foundations and Trends in Databases
Bridging the Web Accessibility Divide

Electronic Notes in Theoretical Computer Science (ENTCS)
Web page DOM node characterization and its application to page segmentation

IMSAA'09 Proceedings of the 3rd IEEE international conference on Internet multimedia services architecture and applications
Accelerating dynamic web content delivery using keyword-based fragment detection

Journal of Web Engineering
A TNATS approach to hidden web documents

ICDCIT'04 Proceedings of the First international conference on Distributed Computing and Internet Technology
EXTIRP 2004: towards heterogeneity

INEX'04 Proceedings of the Third international conference on Initiative for the Evaluation of XML Retrieval
Automated detection of refactorings in evolving components

ECOOP'06 Proceedings of the 20th European conference on Object-Oriented Programming
A shared fragments analysis system for large collections of web pages

DAS'06 Proceedings of the 7th international conference on Document Analysis Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Dividing web pages into fragments has been shown to provide significant benefits for both content generation and caching. In order for a web site to use fragment-based content generation, however, good methods are needed for dividing web pages into fragments. Manual fragmentation of web pages is expensive, error prone, and unscalable. This paper proposes a novel scheme to automatically detect and flag fragments that are cost-effective cache units in web sites serving dynamic content. We consider the fragments to be interesting if they are shared among multiple documents or they have different lifetime or personalization characteristics. Our approach has three unique features. First, we propose a hierarchical and fragment-aware model of the dynamic web pages and a data structure that is compact and effective for fragment detection. Second, we present an efficient algorithm to detect maximal fragments that are shared among multiple documents. Third, we develop a practical algorithm that effectively detects fragments based on their lifetime and personalization characteristics. We evaluate the proposed scheme through a series of experiments, showing the benefits and costs of the algorithms. We also study the impact of adopting the fragments detected by our system on disk space utilization and network bandwidth consumption.