Towards more personalized web: extraction and integration of dynamic content from the web

Authors:
Marek Kowalkiewicz;Maria E. Orlowska;Tomasz Kaczmarek;Witold Abramowicz
Affiliations:
Department of Management Information Systems, Poznan University of Economics, Poznan, Poland;School of Information Technology and Electrical Engineering, The University of Queensland, QLD, Australia;Department of Management Information Systems, Poznan University of Economics, Poznan, Poland;Department of Management Information Systems, Poznan University of Economics, Poznan, Poland
Venue:
APWeb'06 Proceedings of the 8th Asia-Pacific Web conference on Frontiers of WWW Research and Development
Year:
2006

Citing 14
Cited 5

WIDL: application integration with XML

World Wide Web Journal - Special issue on XML: principles, tools, and techniques
Informia: a mediator for integrated access to heterogeneous information sources

Proceedings of the seventh international conference on Information and knowledge management
WebL - a programming language for the Web

WWW7 Proceedings of the seventh international conference on World Wide Web 7
WebViews: accessing personalized web content and services

Proceedings of the 10th international conference on World Wide Web
Annotea: an open RDF infrastructure for shared Web annotations

Proceedings of the 10th international conference on World Wide Web
IEPAD: information extraction based on pattern discovery

Proceedings of the 10th international conference on World Wide Web
Effective Web data extraction with standard XML technologies

Proceedings of the 10th international conference on World Wide Web
Content integration for e-business

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Data warehousing and business intelligence for e-commerce

Data warehousing and business intelligence for e-commerce
A brief survey of web data extraction tools

ACM SIGMOD Record
Don't Scrap It, Wrap It! A Wrapper Architecture for Legacy Data Sources

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Detecting web page structure for adaptive viewing on small form factor devices

WWW '03 Proceedings of the 12th international conference on World Wide Web
XWRAP: An XML-Enabled Wrapper Construction System for Web Information Sources

ICDE '00 Proceedings of the 16th International Conference on Data Engineering
The eShopmonitor: a comprehensive data extraction tool for monitoring web sites

IBM Journal of Research and Development

Robust web content extraction

Proceedings of the 15th international conference on World Wide Web
MyPortal: robust extraction and aggregation of web content

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Supporting end-users in the creation of dependable web clips

Proceedings of the 16th international conference on World Wide Web
Exploring websites through contextual facets

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Usability of GeoWeb sites: case study of Czech regional authorities web sites

BIS'07 Proceedings of the 10th international conference on Business information systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Information and content integration are believed to be a possible solution to the problem of information overload in the Internet. The article is an overview of a simple solution for integration of information and content on the Web. Previous approaches to content extraction and integration are discussed, followed by introduction of a novel technology to deal with the problems, based on XML processing. The article includes lessons learned from solving issues of changing webpage layout, incompatibility with HTML standards and multiplicity of the results returned. The method adopting relative XPath queries over DOM tree proves to be more robust than previous approaches to Web information integration. Furthermore, the prototype implementation demonstrates the simplicity that enables non-professional users to easily adopt this approach in their day-to-day information management routines.