The logic of typed feature structures
The logic of typed feature structures
Lexical navigation: visually prompted query expansion and refinement
DL '97 Proceedings of the second ACM international conference on Digital libraries
Extended finite state models of language
Extended finite state models of language
Question-answering by predictive annotation
SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Transcriber: Development and use of a tool for assisting speech corpora production
Speech Communication - Special issue on speech annotation and corpus tools
A formal framework for linguistic annotation
Speech Communication - Special issue on speech annotation and corpus tools
Discourse Segmentation in Aid of Document Summarization
HICSS '00 Proceedings of the 33rd Hawaii International Conference on System Sciences-Volume 3 - Volume 3
Samsa: A Speech Analysis, Mining and Summary Application for Outbound Telephone Calls
HICSS '01 Proceedings of the 34th Annual Hawaii International Conference on System Sciences ( HICSS-34)-Volume 4 - Volume 4
Robust methods in analysis of natural language data
Natural Language Engineering
Architectural elements of language engineering robustness
Natural Language Engineering
Software infrastructure for natural language processing
ANLC '97 Proceedings of the fifth conference on Applied natural language processing
Mixed-initiative development of language processing systems
ANLC '97 Proceedings of the fifth conference on Applied natural language processing
Regular expressions for language engineering
Natural Language Engineering
Automatic acquisition of hyponyms from large text corpora
COLING '92 Proceedings of the 14th conference on Computational linguistics - Volume 2
Knowledge portals and the emerging digital knowledge workplace
IBM Systems Journal
Text analysis and knowledge mining system
IBM Systems Journal
International standard for a linguistic annotation framework
Natural Language Engineering
Natural Language Engineering
Automatic glossary extraction: beyond terminology identification
COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
GATE: an architecture for development of robust HLT applications
ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Identification of probable real words: an entropy-based approach
ULA '02 Proceedings of the ACL-02 workshop on Unsupervised lexical acquisition - Volume 9
Experiments in multidocument summarization
HLT '02 Proceedings of the second international conference on Human Language Technology Research
Multi-document summarization by visualizing topical content
NAACL-ANLP-AutoSum '00 Proceedings of the 2000 NAACL-ANLP Workshop on Automatic Summarization
Software Architecture for Language Engineering
Natural Language Engineering
Natural Language Engineering
Evolving GATE to meet new challenges in language engineering
Natural Language Engineering
Taxonomies by the numbers: building high-performance taxonomies
Proceedings of the 14th ACM international conference on Information and knowledge management
Multimedia surrogates for video gisting: Toward combining spoken words and imagery
Information Processing and Management: an International Journal
Tracking topic evolution in on-line postings: 2006 IBM innovation Jam data
PAKDD'08 Proceedings of the 12th Pacific-Asia conference on Advances in knowledge discovery and data mining
A robust linguistic platform for efficient and domain specific web content analysis
Large Scale Semantic Access to Content (Text, Image, Video, and Sound)
A scalable and distributed NLP architecture for web document annotation
FinTAL'06 Proceedings of the 5th international conference on Advances in Natural Language Processing
Automated extraction of security policies from natural-language software documents
Proceedings of the ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering
Hi-index | 0.00 |
We present the architecture and data model for TEXTRACT, a robust, scalable and configurable document analysis framework. TEXTRACT has been engineered as a pipeline architecture, allowing for rapid prototyping and application development by freely mixing reusable, existing, language analysis plugins and custom, new, plugins with customizable functionality. We discuss design issues which arise from requirements of industrial strength efficiency and scalability, and which are further constrained by plugin interactions, both among themselves, and with a common data model comprising an annotation store, document vocabulary and a lexical cache. We exemplify some of these by focusing on a meta-plugin: an interpreter for annotation-based finite state transduction, through which many linguistic filters can be implemented as stand-alone plugins. The framework and component plugins have been extensively deployed in both research and industrial environments, for a broad range of text analysis and mining tasks.