Fast algorithms for finding nearest common ancestors
SIAM Journal on Computing
Generating finite-state transducers for semi-structured data extraction from the Web
Information Systems - Special issue on semistructured data
IEPAD: information extraction based on pattern discovery
Proceedings of the 10th international conference on World Wide Web
A brief survey of web data extraction tools
ACM SIGMOD Record
Visual Web Information Extraction with Lixto
Proceedings of the 27th International Conference on Very Large Data Bases
RoadRunner: Towards Automatic Data Extraction from Large Web Sites
Proceedings of the 27th International Conference on Very Large Data Bases
Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence
Extracting structured data from Web pages
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Wrapper induction for information extraction
Wrapper induction for information extraction
Web data extraction based on partial tree alignment
WWW '05 Proceedings of the 14th international conference on World Wide Web
Active learning with strong and weak views: a case study on wrapper induction
IJCAI'03 Proceedings of the 18th international joint conference on Artificial intelligence
Extracting web data using instance-based learning
WISE'05 Proceedings of the 6th international conference on Web Information Systems Engineering
Learning (k,l)-contextual tree languages for information extraction
ECML'05 Proceedings of the 16th European conference on Machine Learning
Schema-Guided Induction of Monadic Queries
ICGI '08 Proceedings of the 9th international colloquium on Grammatical Inference: Algorithms and Applications
Learning queries for relational, semi-structured, and graph databases
Proceedings of the 2013 Sigmod/PODS Ph.D. symposium on PhD symposium
Query induction with schema-guided pruning strategies
The Journal of Machine Learning Research
Hi-index | 0.00 |
This paper studies from a machine learning viewpoint the problem of extracting tuples of a target n-ary relation from tree structured data like XML or XHTML documents. Our system can extract, without any post-processing, tuples for all data structures including nested, rotated and cross tables. The wrapper induction algorithm we propose is based on two main ideas. It is incremental: partial tuples are extracted by increasing length. It is based on a representation-enrichment procedure: partial tuples of length i are encoded with the knowledge of extracted tuples of length i - 1. The algorithm is then set in a friendly interactive wrapper induction system for Web documents. We evaluate our system on several information extraction tasks over corporate Web sites. It achieves state-of-the-art results on simple data structures and succeeds on complex data structures where previous approaches fail. Experiments also show that our interactive framework significantly reduces the number of user interactions needed to build a wrapper.