Extracting Information from Semistructured Data

Authors:
Liping Ma;John Shepherd;Yanchun Zhang
Affiliations:
-;-;-
Venue:
WAIM '02 Proceedings of the Third International Conference on Advances in Web-Age Information Management
Year:
2002

Citing 12
Cited 2

C4.5: programs for machine learning

C4.5: programs for machine learning
Information extraction

Communications of the ACM
STRUDEL: a Web site management system

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
NoDoSE—a tool for semi-automatically extracting structured and semistructured data from text documents

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Extracting schema from semistructured data

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
The Araneus Web-based management system

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
A hierarchical approach to wrapper induction

Proceedings of the third annual conference on Autonomous Agents
Storing semistructured data with STORED

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Record-boundary discovery in Web documents

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Querying Semi-Structured Data

ICDT '97 Proceedings of the 6th International Conference on Database Theory
A Conceptual-Modeling Approach to Extracting Data from the Web

ER '98 Proceedings of the 17th International Conference on Conceptual Modeling
CRYSTAL inducing a conceptual dictionary

IJCAI'95 Proceedings of the 14th international joint conference on Artificial intelligence - Volume 2

Document classification via structure synopses

ADC '03 Proceedings of the 14th Australasian database conference - Volume 17
Information extraction using two-phase pattern discovery

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper describes work towards automatically building on-line structured information resources from information sources that are comprised largely of natural language but with some structuring conventions. Such conversion requires two phases: region identification of the incoming documents, and mapping the information they contain into a more structured form. We describe a system that uses decision-tree-based machine learning techniques to build a classifier that can accurately identify document regions and discuss pattern-discovery methods for extracting information from the identified regions. Experiments demonstrate that this approach works well.