Introduction to statistical pattern recognition (2nd ed.)
Introduction to statistical pattern recognition (2nd ed.)
Wrapper generation for semi-structured Internet sources
ACM SIGMOD Record
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Discrete-time signal processing (2nd ed.)
Discrete-time signal processing (2nd ed.)
Information Systems - Special issue on semistructured data
Wrapper induction: efficiency and expressiveness
Artificial Intelligence - Special issue on Intelligent internet systems
Efficient Similarity Search In Sequence Databases
FODO '93 Proceedings of the 4th International Conference on Foundations of Data Organization and Algorithms
RoadRunner: Towards Automatic Data Extraction from Large Web Sites
Proceedings of the 27th International Conference on Very Large Data Bases
Wrapper induction for information extraction
Wrapper induction for information extraction
Fine-grain web site structure discovery
WIDM '03 Proceedings of the 5th ACM international workshop on Web information and data management
Automatic web news extraction using tree edit distance
Proceedings of the 13th international conference on World Wide Web
Title extraction from bodies of HTML documents and its application to web page retrieval
Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Clustering web pages based on their structure
Data & Knowledge Engineering - Special issue: WIDM 2003
Web page title extraction and its application
Information Processing and Management: an International Journal
Joint optimization of wrapper generation and template detection
Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Bootstrapping Information Extraction from Semi-structured Web Pages
ECML PKDD '08 Proceedings of the 2008 European Conference on Machine Learning and Knowledge Discovery in Databases - Part I
Verifying the consistency of web-based technical documentations
Journal of Symbolic Computation
Highly efficient algorithms for structural clustering of large websites
Proceedings of the 20th international conference on World wide web
Hybrid method for automated news content extraction from the web
WISE'06 Proceedings of the 7th international conference on Web Information Systems
RecipeCrawler: collecting recipe data from WWW incrementally
WAIM '06 Proceedings of the 7th international conference on Advances in Web-Age Information Management
Hi-index | 0.00 |
Data extraction from HTML Web pages is performed by software programs called wrapper. Writing wrappers is a costly and labor intensive task; recently several proposal have attacked the problem of automatically generating wrappers. In this paper, we study a problem related to the automation of the wrapping generation process: given a portion of a Web site to wrap, we develop techniques to cluster its HTML pages into page classes with homogeneous organization and layout; these classes can become the input to the wrapper generation process. Also, once a wrapper library has been generated for a bunch of Web sites, our techniques can be used in order to select, for any new page downloaded from these site, the right wrapper in the library. Based on the proposed techniques we have developed a software prototype, and conducted several experiments on HTML pages from real-life Web sites.