Wrapper induction: efficiency and expressiveness
Artificial Intelligence - Special issue on Intelligent internet systems
IEPAD: information extraction based on pattern discovery
Proceedings of the 10th international conference on World Wide Web
Querying websites using compact skeletons
PODS '01 Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Monadic datalog and the expressive power of languages for web information extraction
Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
ICDE '98 Proceedings of the Fourteenth International Conference on Data Engineering
Visual Web Information Extraction with Lixto
Proceedings of the 27th International Conference on Very Large Data Bases
RoadRunner: Towards Automatic Data Extraction from Large Web Sites
Proceedings of the 27th International Conference on Very Large Data Bases
Extracting Partial Structures from HTML Documents
Proceedings of the Fourteenth International Florida Artificial Intelligence Research Society Conference
Wiccap Data Model: Mapping Physical Websites to Logical Views
ER '02 Proceedings of the 21st International Conference on Conceptual Modeling
Discovering informative content blocks from Web documents
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
TreeFinder: a First Step towards XML Data Mining
ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
Extracting structured data from Web pages
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
An Efficient and Scalable Algorithm for Clustering XML Documents by Structure
IEEE Transactions on Knowledge and Data Engineering
XRules: an effective structural classifier for XML data
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Information extraction from web documents based on local unranked tree automaton inference
IJCAI'03 Proceedings of the 18th international joint conference on Artificial intelligence
S2S: structural-to-syntactic matching similar documents
Knowledge and Information Systems
Self-supervised relation extraction from the Web
Knowledge and Information Systems
Semantic Annotation of Web Pages Using Web Patterns
Advanced Internet Based Systems and Applications
Profile-based focused crawling for social media-sharing websites
Journal on Image and Video Processing
Exploiting maximal redundancy to optimize SQL queries
Knowledge and Information Systems
Meta-search based web resource discovery for object-level vertical search
WISE'06 Proceedings of the 7th international conference on Web Information Systems
Hi-index | 0.00 |
Web data-extraction systems in use today mainly focus on the generation of extraction rules, i.e., wrapper induction. Thus, they appear ad hoc and are difficult to integrate when a holistic view is taken. Each phase in the data-extraction process is disconnected and does not share a common foundation to make the building of a complete system straightforward. In this paper, we demonstrate a holistic approach to Web data extraction. The principal component of our proposal is the notion of a document schema. Document schemata are patterns of structures embedded in documents. Once the document schemata are obtained, the various phases (e.g. training set preparation, wrapper induction and document classification) can be easily integrated. The implication of this is improved efficiency and better control over the extraction procedure. Our experimental results confirmed this. More importantly, because a document can be represented as avector of schema, it can be easily incorporated into existing systems as the fabric for integration.