Information Extraction from the Web: System and Techniques

Authors:
Luo Xiao;Dieter Wissmann;Michael Brown;Stephan Jablonski
Affiliations:
CT SE 5, Siemens AG, Erlangen, Germany. Luo.Xiao@siemens.de;CT SE 5, Siemens AG, Erlangen, Germany. Dieter.Wissmann@siemens.de;Global Transactions, Ltd., Berlin, Germany. Mike@GTCT.com;Department of Computer Sciences VI, University of Erlangen-Nuremberg, Germany. Stefan.Jablonski@informatik.uni-erlangen.de
Venue:
Applied Intelligence
Year:
2004

Citing 17
Cited 9

C4.5: programs for machine learning

C4.5: programs for machine learning
Information extraction as a basis for high-precision text classification

ACM Transactions on Information Systems (TOIS)
Some advances in transformation-based part of speech tagging

AAAI '94 Proceedings of the twelfth national conference on Artificial intelligence (vol. 1)
A scalable comparison-shopping agent for the World-Wide Web

AGENTS '97 Proceedings of the first international conference on Autonomous agents
Information extraction from HTML: application of a general machine learning approach

AAAI '98/IAAI '98 Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence
A hierarchical approach to wrapper induction

Proceedings of the third annual conference on Autonomous Agents
Learning Information Extraction Rules for Semi-Structured and Free Text

Machine Learning - Special issue on natural language learning
Exploration of Document Collections with Self-Organizing Maps: A Novel Approach to Similarity Representation

PKDD '97 Proceedings of the First European Symposium on Principles of Data Mining and Knowledge Discovery
Where to Position the Precision in Knowledge Extraction from Text

Proceedings of the 14th International conference on Industrial and engineering applications of artificial intelligence and expert systems: engineering of intelligent systems
Information Extraction from HTML: Combining XML and Standard Techniques for IE from the Web

Proceedings of the 14th International conference on Industrial and engineering applications of artificial intelligence and expert systems: engineering of intelligent systems
XWRAP: An XML-Enabled Wrapper Construction System for Web Information Sources

ICDE '00 Proceedings of the 16th International Conference on Data Engineering
Building a large annotated corpus of English: the penn treebank

Computational Linguistics - Special issue on using large corpora: II
Introduction to the special issue on word sense disambiguation: the state of the art

Computational Linguistics - Special issue on word sense disambiguation
A definition and short history of Language Engineering

Natural Language Engineering
Software infrastructure for natural language processing

ANLC '97 Proceedings of the fifth conference on Applied natural language processing
Adaptive information extraction from text by rule induction and generalisation

IJCAI'01 Proceedings of the 17th international joint conference on Artificial intelligence - Volume 2
CRYSTAL inducing a conceptual dictionary

IJCAI'95 Proceedings of the 14th international joint conference on Artificial intelligence - Volume 2

Incremental development of CBR strategies for computing project cost probabilities

Advanced Engineering Informatics
Named entities for hot topics ranking and ontology navigation aid

Proceedings of the 20th ACM conference on Hypertext and hypermedia
Seeking Acronym Definitions: a Web-based Approach

Proceedings of the 2009 conference on Artificial Intelligence Research and Development: Proceedings of the 12th International Conference of the Catalan Association for Artificial Intelligence
Seeking Acronym Definitions: a Web-based Approach

Proceedings of the 2009 conference on Artificial Intelligence Research and Development: Proceedings of the 12th International Conference of the Catalan Association for Artificial Intelligence
A method for web information extraction

APWeb'08 Proceedings of the 10th Asia-Pacific web conference on Progress in WWW research and development
Tree-based overlay networks for scalable applications

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Automatic extraction of acronym definitions from the Web

Applied Intelligence
Learning to adapt cross language information extraction wrapper

Applied Intelligence
Pattern matching with wildcards and gap-length constraints based on a centrality-degree graph

Applied Intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

Information Extraction (IE) systems that can exploit the vast source of textual information that is the internet would provide a revolutionary step forward in terms of delivering large volumes of content cheaply and precisely, thus enabling a wide range of new knowledge driven applications and services. However, despite this enormous potential, few IE systems have successfully made the transition from laboratory to commercial application. The reason may be a purely practical one—to build useable, scaleable IE systems requires bringing together a range of different technologies as well as providing clear and reproducible guidelines as to how to collectively configure and deploy those technologies.This paper is an attempt to address these issues. The paper focuses on two primary goals. Firstly, we show that an information extraction system which is used for real world applications and different domains can be built using some autonomous, corporate components (agents). Such a system has some advanced properties: clear separation to different extraction tasks and steps, portability to multiple application domain, trainability, extensibility, etc. Secondly, we show that machine learning and, in particular, learning in different ways and at different levels, can be used to build practical IE systems. We show that carefully selecting the right machine learning technique for the right task and selective sampling can be used to reduce the human effort required to annotate examples for building such systems.