Web data extraction based on structural similarity

Authors:
Zhao Li;Wee Keong Ng;Aixin Sun
Affiliations:
Nanyang Technological University, Centre for Advanced Information Systems, School of Computer Engineering, Nanyang Avenue, Singapore;Nanyang Technological University, Centre for Advanced Information Systems, School of Computer Engineering, Nanyang Avenue, Singapore;Nanyang Technological University, Centre for Advanced Information Systems, School of Computer Engineering, Nanyang Avenue, Singapore
Venue:
Knowledge and Information Systems
Year:
2005

Citing 15
Cited 6

Wrapper induction: efficiency and expressiveness

Artificial Intelligence - Special issue on Intelligent internet systems
IEPAD: information extraction based on pattern discovery

Proceedings of the 10th international conference on World Wide Web
Querying websites using compact skeletons

PODS '01 Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Monadic datalog and the expressive power of languages for web information extraction

Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Virtual Database Technology

ICDE '98 Proceedings of the Fourteenth International Conference on Data Engineering
Visual Web Information Extraction with Lixto

Proceedings of the 27th International Conference on Very Large Data Bases
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
Extracting Partial Structures from HTML Documents

Proceedings of the Fourteenth International Florida Artificial Intelligence Research Society Conference
Wiccap Data Model: Mapping Physical Websites to Logical Views

ER '02 Proceedings of the 21st International Conference on Conceptual Modeling
Discovering informative content blocks from Web documents

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
TreeFinder: a First Step towards XML Data Mining

ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
Extracting structured data from Web pages

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
An Efficient and Scalable Algorithm for Clustering XML Documents by Structure

IEEE Transactions on Knowledge and Data Engineering
XRules: an effective structural classifier for XML data

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Information extraction from web documents based on local unranked tree automaton inference

IJCAI'03 Proceedings of the 18th international joint conference on Artificial intelligence

S2S: structural-to-syntactic matching similar documents

Knowledge and Information Systems
Self-supervised relation extraction from the Web

Knowledge and Information Systems
Semantic Annotation of Web Pages Using Web Patterns

Advanced Internet Based Systems and Applications
Profile-based focused crawling for social media-sharing websites

Journal on Image and Video Processing
Exploiting maximal redundancy to optimize SQL queries

Knowledge and Information Systems
Meta-search based web resource discovery for object-level vertical search

WISE'06 Proceedings of the 7th international conference on Web Information Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Web data-extraction systems in use today mainly focus on the generation of extraction rules, i.e., wrapper induction. Thus, they appear ad hoc and are difficult to integrate when a holistic view is taken. Each phase in the data-extraction process is disconnected and does not share a common foundation to make the building of a complete system straightforward. In this paper, we demonstrate a holistic approach to Web data extraction. The principal component of our proposal is the notion of a document schema. Document schemata are patterns of structures embedded in documents. Once the document schemata are obtained, the various phases (e.g. training set preparation, wrapper induction and document classification) can be easily integrated. The implication of this is improved efficiency and better control over the extraction procedure. Our experimental results confirmed this. More importantly, because a document can be represented as avector of schema, it can be easily incorporated into existing systems as the fabric for integration.