The Talent system: TEXTRACT architecture and data model

Authors:
Mary S. Neff;Roy J. Byrd;Branimir K. Boguraev
Affiliations:
IBM T.J. Watson Research Center, 19 Skyline Drive, Hawthorne, NY 10532, USA e-mail: maryneff@us.ibm.com roybyrd@us.ibm.com bran@us.ibm.com;IBM T.J. Watson Research Center, 19 Skyline Drive, Hawthorne, NY 10532, USA e-mail: maryneff@us.ibm.com roybyrd@us.ibm.com bran@us.ibm.com;IBM T.J. Watson Research Center, 19 Skyline Drive, Hawthorne, NY 10532, USA e-mail: maryneff@us.ibm.com roybyrd@us.ibm.com bran@us.ibm.com
Venue:
Natural Language Engineering
Year:
2004

Citing 23
Cited 9

The logic of typed feature structures

The logic of typed feature structures
Lexical navigation: visually prompted query expansion and refinement

DL '97 Proceedings of the second ACM international conference on Digital libraries
Extended finite state models of language

Extended finite state models of language
Question-answering by predictive annotation

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Transcriber: Development and use of a tool for assisting speech corpora production

Speech Communication - Special issue on speech annotation and corpus tools
A formal framework for linguistic annotation

Speech Communication - Special issue on speech annotation and corpus tools
Discourse Segmentation in Aid of Document Summarization

HICSS '00 Proceedings of the 33rd Hawaii International Conference on System Sciences-Volume 3 - Volume 3
Samsa: A Speech Analysis, Mining and Summary Application for Outbound Telephone Calls

HICSS '01 Proceedings of the 34th Annual Hawaii International Conference on System Sciences ( HICSS-34)-Volume 4 - Volume 4
Robust methods in analysis of natural language data

Natural Language Engineering
Architectural elements of language engineering robustness

Natural Language Engineering
Software infrastructure for natural language processing

ANLC '97 Proceedings of the fifth conference on Applied natural language processing
Mixed-initiative development of language processing systems

ANLC '97 Proceedings of the fifth conference on Applied natural language processing
Regular expressions for language engineering

Natural Language Engineering
Automatic acquisition of hyponyms from large text corpora

COLING '92 Proceedings of the 14th conference on Computational linguistics - Volume 2
Knowledge portals and the emerging digital knowledge workplace

IBM Systems Journal
Text analysis and knowledge mining system

IBM Systems Journal
International standard for a linguistic annotation framework

Natural Language Engineering
UIMA: an architectural approach to unstructured information processing in the corporate research environment

Natural Language Engineering
Automatic glossary extraction: beyond terminology identification

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
GATE: an architecture for development of robust HLT applications

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Identification of probable real words: an entropy-based approach

ULA '02 Proceedings of the ACL-02 workshop on Unsupervised lexical acquisition - Volume 9
Experiments in multidocument summarization

HLT '02 Proceedings of the second international conference on Human Language Technology Research
Multi-document summarization by visualizing topical content

NAACL-ANLP-AutoSum '00 Proceedings of the 2000 NAACL-ANLP Workshop on Automatic Summarization

Software Architecture for Language Engineering

Natural Language Engineering
UIMA: an architectural approach to unstructured information processing in the corporate research environment

Natural Language Engineering
Evolving GATE to meet new challenges in language engineering

Natural Language Engineering
Taxonomies by the numbers: building high-performance taxonomies

Proceedings of the 14th ACM international conference on Information and knowledge management
Multimedia surrogates for video gisting: Toward combining spoken words and imagery

Information Processing and Management: an International Journal
Tracking topic evolution in on-line postings: 2006 IBM innovation Jam data

PAKDD'08 Proceedings of the 12th Pacific-Asia conference on Advances in knowledge discovery and data mining
A robust linguistic platform for efficient and domain specific web content analysis

Large Scale Semantic Access to Content (Text, Image, Video, and Sound)
A scalable and distributed NLP architecture for web document annotation

FinTAL'06 Proceedings of the 5th international conference on Advances in Natural Language Processing
Automated extraction of security policies from natural-language software documents

Proceedings of the ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present the architecture and data model for TEXTRACT, a robust, scalable and configurable document analysis framework. TEXTRACT has been engineered as a pipeline architecture, allowing for rapid prototyping and application development by freely mixing reusable, existing, language analysis plugins and custom, new, plugins with customizable functionality. We discuss design issues which arise from requirements of industrial strength efficiency and scalability, and which are further constrained by plugin interactions, both among themselves, and with a common data model comprising an annotation store, document vocabulary and a lexical cache. We exemplify some of these by focusing on a meta-plugin: an interpreter for annotation-based finite state transduction, through which many linguistic filters can be implemented as stand-alone plugins. The framework and component plugins have been extensively deployed in both research and industrial environments, for a broad range of text analysis and mining tasks.