Snowball: extracting relations from large plain-text collections
DL '00 Proceedings of the fifth ACM conference on Digital libraries
Relational Databases for Querying XML Documents: Limitations and Opportunities
VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Extracting Patterns and Relations from the World Wide Web
WebDB '98 Selected papers from the International Workshop on The World Wide Web and Databases
Extracting relational data from HTML repositories
ACM SIGKDD Explorations Newsletter
A Survey of Web Information Extraction Systems
IEEE Transactions on Knowledge and Data Engineering
Simple algorithms for complex relation extraction with applications to biomedical IE
ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
KnowItNow: fast, scalable information extraction from the web
HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
LegoDB: customizing relational storage for XML documents
VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
StatSnowball: a statistical approach to extracting entity relationships
Proceedings of the 18th international conference on World wide web
Open information extraction from the web
IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Unsupervised named-entity extraction from the Web: An experimental study
Artificial Intelligence
Record extraction based on user feedback and document selection
APWeb/WAIM'07 Proceedings of the joint 9th Asia-Pacific web and 8th international conference on web-age information management conference on Advances in data and web management
Hi-index | 0.00 |
Information Extraction (IE) is a technique to extract structured information (record) from unstructured documents such as Web pages. However, existing techniques are basically aiming at extracting simple records, such as binary relationships like "(company, location)" or named entities like "(organization)". In this paper, we propose an algorithm for extracting complex records like XML by utilizing an existing IE technique. Given a set of seed records in the form of XML data (XML records), we firstly infer the schema information from the XML records. Then, we transform the XML records to a set of relational records consisting of several tables. The obtained relational tables are decomposed into a set of binary relations, and they are forwarded to a record extraction system. We reconstruct XML data from the results obtained from the record of the extraction system. We point out a naive implementation docs not work well, and propose an improved scheme for more efficient XML record extraction. We evaluate the effectiveness of our proposed algorithm in some experiments.