A brief survey of web data extraction tools
ACM SIGMOD Record
Understanding and Restructuring Web Sites with ReWeb
IEEE MultiMedia
Reverse Engineering and Design Recovery: A Taxonomy
IEEE Software
Visual Web Information Extraction with Lixto
Proceedings of the 27th International Conference on Very Large Data Bases
Flexible Reverse Engineering of Web Pages with VAQUISTA
WCRE '01 Proceedings of the Eighth Working Conference on Reverse Engineering (WCRE'01)
Reverse Software Engineering with UML for Web Site Maintenance
WISE '00 Proceedings of the First International Conference on Web Information Systems Engineering (WISE'00)-Volume 2 - Volume 2
LZW Based Compressed Pattern Matching
DCC '04 Proceedings of the Conference on Data Compression
Reverse engineering web applications: the WARE approach
Journal of Software Maintenance and Evolution: Research and Practice - Special issue: Web site evolution
Clustering web pages based on their structure
Data & Knowledge Engineering - Special issue: WIDM 2003
Acquiring owl ontologies from data-intensive web sites
ICWE '06 Proceedings of the 6th international conference on Web engineering
Adapting Web information extraction knowledge via mining site-invariant and site-dependent features
ACM Transactions on Internet Technology (TOIT)
A Meta-model Approach to the Management of Hypertexts in Web Information Systems
ER '08 Proceedings of the ER 2008 Workshops (CMLSA, ECDM, FP-UML, M2AS, RIGiM, SeCoGIS, WISM) on Advances in Conceptual Modeling: Challenges and Opportunities
Extracting content structure for web pages based on visual representation
APWeb'03 Proceedings of the 5th Asia-Pacific web conference on Web technologies and applications
Automatic web page annotation with google rich snippets
OTM'10 Proceedings of the 2010 international conference on On the move to meaningful internet systems: Part II
RDFa based annotation of web pages through keyphrases extraction
OTM'11 Proceedings of the 2011th Confederated international conference on On the move to meaningful internet systems - Volume Part II
A reverse engineering approach for automatic annotation of Web pages
Multimedia Tools and Applications
Hi-index | 0.00 |
The majority of documents on the Web are written in HTML, constituting a huge amount of legacy data: all documents are formatted for visual purposes only and with different styles due to diverse authorships and goals and this makes the process of retrieval and integration of Web contents difficult to automate. We provide a contribution to the solution of this problem by proposing a structured approach to data reverse engineering of data-intensive Web sites. We focus on data content and on the way in which such content is structured on the Web. We profitably use a Web data model to describe abstract structural features of HTML pages and propose a method for the segmentation of HTML documents in special blocks grouping semantically related Web objects. We have developed a tool based on this method that supports the identification of structure, function, and meaning of data organized in Web object blocks. We demonstrate with this tool the feasibility and effectiveness of our approach over a set of real Web sites.