Wrapping-oriented classification of web pages

Authors:
Valter Crescenzi;Giansalvatore Mecca;Paolo Merialdo
Affiliations:
Universitá Roma Tre, Via della Vasca Navale, 79, 00146 --- Roma, Italy;Universitá della Basilicata, C.da Macchia Romana, 85100 --- Potenza, Italy;Universitá Roma Tre, Via della Vasca Navale, 79, 00146 --- Roma, Italy
Venue:
Proceedings of the 2002 ACM symposium on Applied computing
Year:
2002

Citing 9
Cited 11

Introduction to statistical pattern recognition (2nd ed.)

Introduction to statistical pattern recognition (2nd ed.)
Wrapper generation for semi-structured Internet sources

ACM SIGMOD Record
NoDoSE—a tool for semi-automatically extracting structured and semistructured data from text documents

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Discrete-time signal processing (2nd ed.)

Discrete-time signal processing (2nd ed.)
Grammars have exceptions

Information Systems - Special issue on semistructured data
Wrapper induction: efficiency and expressiveness

Artificial Intelligence - Special issue on Intelligent internet systems
Efficient Similarity Search In Sequence Databases

FODO '93 Proceedings of the 4th International Conference on Foundations of Data Organization and Algorithms
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
Wrapper induction for information extraction

Wrapper induction for information extraction

Fine-grain web site structure discovery

WIDM '03 Proceedings of the 5th ACM international workshop on Web information and data management
Automatic web news extraction using tree edit distance

Proceedings of the 13th international conference on World Wide Web
Title extraction from bodies of HTML documents and its application to web page retrieval

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Clustering web pages based on their structure

Data & Knowledge Engineering - Special issue: WIDM 2003
Web page title extraction and its application

Information Processing and Management: an International Journal
Joint optimization of wrapper generation and template detection

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Bootstrapping Information Extraction from Semi-structured Web Pages

ECML PKDD '08 Proceedings of the 2008 European Conference on Machine Learning and Knowledge Discovery in Databases - Part I
Verifying the consistency of web-based technical documentations

Journal of Symbolic Computation
Highly efficient algorithms for structural clustering of large websites

Proceedings of the 20th international conference on World wide web
Hybrid method for automated news content extraction from the web

WISE'06 Proceedings of the 7th international conference on Web Information Systems
RecipeCrawler: collecting recipe data from WWW incrementally

WAIM '06 Proceedings of the 7th international conference on Advances in Web-Age Information Management

Quantified Score

Hi-index	0.00

Visualization

Abstract

Data extraction from HTML Web pages is performed by software programs called wrapper. Writing wrappers is a costly and labor intensive task; recently several proposal have attacked the problem of automatically generating wrappers. In this paper, we study a problem related to the automation of the wrapping generation process: given a portion of a Web site to wrap, we develop techniques to cluster its HTML pages into page classes with homogeneous organization and layout; these classes can become the input to the wrapper generation process. Also, once a wrapper library has been generated for a bunch of Web sites, our techniques can be used in order to select, for any new page downloaded from these site, the right wrapper in the library. Based on the proposed techniques we have developed a software prototype, and conducted several experiments on HTML pages from real-life Web sites.