Infoxtract: A customizable intermediate level information extraction engine

Authors:
Rohini k. Srihari;Wei Li;Thomas Cornell;Cheng Niu
Affiliations:
Janya inc., 1408 sweet home road, amherst, ny 14228, usa, state university of new york at buffalo e-mail: rohini@janyainc.com;Janya inc., 1408 sweet home road, amherst, ny 14228, usa e-mail: wei@janyainc.comcornell@janyainc.com;Janya inc., 1408 sweet home road, amherst, ny 14228, usa e-mail: wei@janyainc.comcornell@janyainc.com;Microsoft research china, 5/f, beijing sigma center, no. 49, zhichun road, haidian district, beijing100080, p.r.c. e-mail: cniu@microsoft.com
Venue:
Natural Language Engineering
Year:
2008

Citing 18
Cited 12

Tree languages

Handbook of formal languages, vol. 3
An Algorithm that Learns What‘s in a Name

Machine Learning - Special issue on natural language learning
Trips on trees

Acta Cybernetica
Learning dictionaries for information extraction by multi-level bootstrapping

AAAI '99/IAAI '99 Proceedings of the sixteenth national conference on Artificial intelligence and the eleventh Innovative applications of artificial intelligence conference innovative applications of artificial intelligence
Finite-State Language Processing

Finite-State Language Processing
MiTAP: A Case Study of Integrated Knowledge Discovery Tools

HICSS '03 Proceedings of the 36th Annual Hawaii International Conference on System Sciences (HICSS'03) - Track 3 - Volume 3
REES: a large-scale relation and event extraction system

ANLC '00 Proceedings of the sixth conference on Applied natural language processing
A question answering system supported by information extraction

ANLC '00 Proceedings of the sixth conference on Applied natural language processing
A hybrid approach for named entity and sub-type tagging

ANLC '00 Proceedings of the sixth conference on Applied natural language processing
Entity-based cross-document coreferencing using the Vector Space Model

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1
Location normalization for information extraction

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
A bootstrapping approach to named entity classification using successive learners

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
An expert lexicon approach to identifying English phrasal verbs

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
FASTUS: a system for extracting information from text

HLT '93 Proceedings of the workshop on Human Language Technology
InfoXtract: a customizable intermediate level information extraction engine

SEALTS '03 Proceedings of the HLT-NAACL 2003 workshop on Software engineering and architecture of language technology systems - Volume 8
Proceedings of the HLT-NAACL 2003 workshop on Analysis of geographic references - Volume 1

HLT-NAACL-GEOREF '03 Proceedings of the HLT-NAACL 2003 workshop on Analysis of geographic references - Volume 1
From manual knowledge engineering to bootstrapping: Progress in information extraction and NLP

ICCBR'03 Proceedings of the 5th international conference on Case-based reasoning: Research and Development
Automatically generating extraction patterns from untagged text

AAAI'96 Proceedings of the thirteenth national conference on Artificial intelligence - Volume 2

Semantic search via XML fragments: a high-precision approach to IR

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
A Method for Estimating the Precision of Placename Matching

IEEE Transactions on Knowledge and Data Engineering
Use of ranked cross document evidence trails for hypothesis generation

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Automatic scoring of short handwritten essays in reading comprehension tests

Artificial Intelligence
Ontology-supported polarity mining

Journal of the American Society for Information Science and Technology
Named Entity Recognition for Improving Retrieval and Translation of Chinese Documents

ICADL 08 Proceedings of the 11th International Conference on Asian Digital Libraries: Universal and Ubiquitous Access to Information
NE tagging for Urdu based on bootstrap POS learning

CLIAWS3 '09 Proceedings of the Third International Workshop on Cross Lingual Information Access: Addressing the Information Need of Multilingual Societies
Making semantic topicality robust through term abstraction

DEW '09 Proceedings of the Workshop on Semantic Evaluations: Recent Achievements and Future Directions
An Information-Extraction System for Urdu---A Resource-Poor Language

ACM Transactions on Asian Language Information Processing (TALIP)
Using sequence kernels to identify opinion entities in Urdu

CoNLL '11 Proceedings of the Fifteenth Conference on Computational Natural Language Learning
Print pickets combined language models and knowledge resources in web

ROCLING '11 ROCLING 2011 Poster Papers
Improving cross-document knowledge discovery using explicit semantic analysis

DaWaK'12 Proceedings of the 14th international conference on Data Warehousing and Knowledge Discovery

Quantified Score

Hi-index	0.00

Visualization

Abstract

Information Extraction (IE) systems assist analysts to assimilate information from electronic documents. This paper focuses on IE tasks designed to support information discovery applications. Since information discovery implies examining large volumes of heterogeneous documents for situations that cannot be anticipated a priori, they require IE systems to have breadth as well as depth. This implies the need for a domain-independent IE system that can easily be customized for specific domains: end users must be given tools to customize the system on their own. It also implies the need for defining new intermediate level IE tasks that are richer than the subject-verb-object (SVO) triples produced by shallow systems, yet not as complex as the domain-specific scenarios defined by the Message Understanding Conference (MUC). This paper describes InfoXtract, a robust, scalable, intermediate-level IE engine that can be ported to various domains. It describes new IE tasks such as synthesis of entity profiles, and extraction of concept-based general events which represent realistic near-term goals focused on deriving useful, actionable information. Entity profiles consolidate information about a person/organization/location etc. within a document and across documents into a single template; this takes into account aliases and anaphoric references as well as key relationships and events pertaining to that entity. Concept-based events attempt to normalize information such as time expressions (e.g., yesterday) as well as ambiguous location references (e.g., Buffalo). These new tasks facilitate the correlation of output from an IE engine with structured data to enable text mining. InfoXtract's hybrid architecture comprised of grammatical processing and machine learning is described in detail. Benchmarking results for the core engine and applications utilizing the engine are presented.