Record-boundary discovery in Web documents
SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Wrapper induction: efficiency and expressiveness
Artificial Intelligence - Special issue on Intelligent internet systems
IEPAD: information extraction based on pattern discovery
Proceedings of the 10th international conference on World Wide Web
SCOOP: A Record Extractor without Knowledge on Input
DS '01 Proceedings of the 4th International Conference on Discovery Science
Eliminating Useless Parts in Semi-structured Documents Using Alternation Counts
DS '01 Proceedings of the 4th International Conference on Discovery Science
A Fully Automated Object Extraction System for the World Wide Web
ICDCS '01 Proceedings of the The 21st International Conference on Distributed Computing Systems
L-tree match: a new data extraction model and algorithm for huge text stream with noises
Journal of Computer Science and Technology
Automatic extraction of bilingual word pairs using inductive chain learning in various languages
Information Processing and Management: an International Journal
Cross Language Information Extraction Knowledge Adaptation
RSKT '09 Proceedings of the 4th International Conference on Rough Sets and Knowledge Technology
Learning to adapt cross language information extraction wrapper
Applied Intelligence
Hi-index | 0.00 |
We present a wrapper generation system to extract contents of semi-structured documents which contain instances of a record. The generation is done automatically using general assumptions on the structure of instances. It outputs a set of pairs of left and right delimiters surrounding instances of a field. In addition to input documents, our system also receives a set of symbols with which a delimiter must begin or end. Our system treats semi-structured documents just as strings so that it does not depend on markup and natural languages. It does not require any training examples which show where instances are. We show experimental results on both static and dynamic pages which are gathered from 13 Web sites, markuped in HTML or XML, and written in four natural languages. In addition to usual contents, generated wrappers extract useful information hidden in comments or tags which are ignored by other wrapper generation algorithms. Some generated delimiters contain whitespaces or multibyte characters.