C4.5: programs for machine learning
C4.5: programs for machine learning
Communications of the ACM
STRUDEL: a Web site management system
SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Extracting schema from semistructured data
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
The Araneus Web-based management system
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
A hierarchical approach to wrapper induction
Proceedings of the third annual conference on Autonomous Agents
Storing semistructured data with STORED
SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Record-boundary discovery in Web documents
SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
ICDT '97 Proceedings of the 6th International Conference on Database Theory
A Conceptual-Modeling Approach to Extracting Data from the Web
ER '98 Proceedings of the 17th International Conference on Conceptual Modeling
CRYSTAL inducing a conceptual dictionary
IJCAI'95 Proceedings of the 14th international joint conference on Artificial intelligence - Volume 2
Document classification via structure synopses
ADC '03 Proceedings of the 14th Australasian database conference - Volume 17
Information extraction using two-phase pattern discovery
Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Hi-index | 0.00 |
This paper describes work towards automatically building on-line structured information resources from information sources that are comprised largely of natural language but with some structuring conventions. Such conversion requires two phases: region identification of the incoming documents, and mapping the information they contain into a more structured form. We describe a system that uses decision-tree-based machine learning techniques to build a classifier that can accurately identify document regions and discuss pattern-discovery methods for extracting information from the identified regions. Experiments demonstrate that this approach works well.