STAVIES: A System for Information Extraction from Unknown Web Data Sources through Automatic Web Wrapper Generation Using Clustering Techniques

Authors:
Nikolaos K. Papadakis;Dimitrios Skoutas;Konstantinos Raftopoulos;Theodora A. Varvarigou
Affiliations:
-;-;IEEE Computer Society;IEEE
Venue:
IEEE Transactions on Knowledge and Data Engineering
Year:
2005

Citing 18
Cited 18

A critical investigation of recall and precision as measures of retrieval system performance

ACM Transactions on Information Systems (TOIS)
Adaptive filter theory (3rd ed.)

Adaptive filter theory (3rd ed.)
The World-Wide Web: quagmire or gold mine?

Communications of the ACM
A scalable comparison-shopping agent for the World-Wide Web

AGENTS '97 Proceedings of the first international conference on Autonomous agents
NoDoSE—a tool for semi-automatically extracting structured and semistructured data from text documents

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Record-boundary discovery in Web documents

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Wrapper induction: efficiency and expressiveness

Artificial Intelligence - Special issue on Intelligent internet systems
Wrapping web data into XML

ACM SIGMOD Record
Data Mining for Web Intelligence

Computer
Building Light-Weight Wrappers for Legacy Web Data-Sources Using W4F

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Visual Web Information Extraction with Lixto

Proceedings of the 27th International Conference on Very Large Data Bases
Semi-Automatic Wrapper Generation for Internet Information Sources

COOPIS '97 Proceedings of the Second IFCIS International Conference on Cooperative Information Systems
Wrapper Generation for Web Accessible Data Sources

COOPIS '98 Proceedings of the 3rd IFCIS International Conference on Cooperative Information Systems
Jedi: Extracting and Synthesizing Information from the Web

COOPIS '98 Proceedings of the 3rd IFCIS International Conference on Cooperative Information Systems
A Fully Automated Object Extraction System for the World Wide Web

ICDCS '01 Proceedings of the The 21st International Conference on Distributed Computing Systems
Semi-automatic wrapper generation and adaption: living with heterogeneity in a market environment

Enterprise information systems IV
Mining data records in Web pages

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
A new clustering evaluation function using Renyi's information potential

ICASSP '00 Proceedings of the Acoustics, Speech, and Signal Processing, 2000. on IEEE International Conference - Volume 06

MEMPHIS: a mobile agent-based system for enabling acquisition of multilingual content and providing flexible format internet premium services

Journal of Systems Architecture: the EUROMICRO Journal
Designing ETL processes using semantic web technologies

DOLAP '06 Proceedings of the 9th ACM international workshop on Data warehousing and OLAP
MI-MERCURY: A mobile agent architecture for ubiquitous retrieval and delivery of multimedia information

Multimedia Tools and Applications
Integrating recommendation models for improved web page prediction accuracy

ACSC '08 Proceedings of the thirty-first Australasian conference on Computer science - Volume 74
Detecting data records in semi-structured web sites based on text token clustering

Integrated Computer-Aided Engineering
Recognition of Data Records in Semi-structured Web-Pages Using Ontology and Χ2 Statistical Distribution

ADMA '08 Proceedings of the 4th international conference on Advanced Data Mining and Applications
Extracting the author of web pages

Proceedings of the 2nd ACM workshop on Information credibility on the web
Towards a System for Ontology-Based Information Extraction from PDF Documents

OTM '08 Proceedings of the OTM 2008 Confederated International Conferences, CoopIS, DOA, GADA, IS, and ODBASE 2008. Part II on On the Move to Meaningful Internet Systems
A method for web information extraction

APWeb'08 Proceedings of the 10th Asia-Pacific web conference on Progress in WWW research and development
Information extraction in a set of knowledge using a fuzzy logic based intelligent agent

ICCSA'07 Proceedings of the 2007 international conference on Computational science and its applications - Volume Part III
An integrated model for next page access prediction

International Journal of Knowledge and Web Intelligence
SXPath: extending XPath towards spatial querying on web documents

Proceedings of the VLDB Endowment
Towards a spatial instance learning method for deep web pages

ICDM'11 Proceedings of the 11th international conference on Advances in data mining: applications and theoretical aspects
SILA: a spatial instance learning approach for deep webpages

Proceedings of the 20th ACM international conference on Information and knowledge management
A Fuzzy Logic intelligent agent for Information Extraction: Introducing a new Fuzzy Logic-based term weighting scheme

Expert Systems with Applications: An International Journal
Structure detection system from web documents through backpropagation network learning

AI'06 Proceedings of the 19th Australian joint conference on Artificial Intelligence: advances in Artificial Intelligence
An automatic web-oriented multimedia extraction and multiresolution visualization scheme

ACA'12 Proceedings of the 11th international conference on Applications of Electrical and Computer Engineering
TEX: An efficient and effective unsupervised Web information extractor

Knowledge-Based Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

A fully automated wrapper for information extraction from Web pages is presented. The motivation behind such systems lies in the emerging need for going beyond the concept of "human browsing.驴 The World Wide Web is today the main "all kind of information驴 repository and has been so far very successful in disseminating information to humans. By automating the process of information retrieval, further utilization by targeted applications is enabled. The key idea in our novel system is to exploit the format of the Web pages to discover the underlying structure in order to finally infer and extract pieces of information from the Web page. Our system first identifies the section of the Web page that contains the information to be extracted and then extracts it by using clustering techniques and other tools of statistical origin. STAVIES can operate without human intervention and does not require any training. The main innovation and contribution of the proposed system consists of introducing a signal-wise treatment of the tag structural hierarchy and using hierarchical clustering techniques to segment the Web pages. The importance of such a treatment is significant since it permits abstracting away from the raw tag-manipulating approach. Experimental results and comparisons with other state of the art systems are presented and discussed in the paper, indicating the high performance of the proposed algorithm.