Mixed-initiative development of language processing systems

Authors:
David Day;John Aberdeen;Lynette Hirschman;Robyn Kozierok;Patricia Robinson;Marc Vilain
Affiliations:
The MITRE Corporation, Bedford, Massachusetts;The MITRE Corporation, Bedford, Massachusetts;The MITRE Corporation, Bedford, Massachusetts;The MITRE Corporation, Bedford, Massachusetts;The MITRE Corporation, Bedford, Massachusetts;The MITRE Corporation, Bedford, Massachusetts
Venue:
ANLC '97 Proceedings of the fifth conference on Applied natural language processing
Year:
1997

Citing 5
Cited 41

A corpus-based approach to language learning

A corpus-based approach to language learning
A simple rule-based part of speech tagger

ANLC '92 Proceedings of the third conference on Applied natural language processing
Finite-state phrase parsing by rule sequences

COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 1
Message Understanding Conference-6: a brief history

COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 1
Validation of terminological inference in an information extraction task

HLT '93 Proceedings of the workshop on Human Language Technology

Snowball: extracting relations from large plain-text collections

DL '00 Proceedings of the fifth ACM conference on Digital libraries
Amilcare: adaptive information extraction for document annotation

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Adaptive information extraction for document annotation in amilcare

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Named Faces: Putting Names to Faces

IEEE Intelligent Systems
Can We Make Information Extraction More Adaptive?

Information Extraction: Towards Scalable, Adaptable Systems
Computing Geographical Scopes of Web Resources

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Disambiguating Geographic Names in a Historical Digital Library

ECDL '01 Proceedings of the 5th European Conference on Research and Advanced Technology for Digital Libraries
Evaluation-driven design of a robust coreference resolution system

Natural Language Engineering
Architectural elements of language engineering robustness

Natural Language Engineering
TopCat: Data Mining for Topic Identification in a Text Corpus

IEEE Transactions on Knowledge and Data Engineering
The Talent system: TEXTRACT architecture and data model

Natural Language Engineering
Evolving GATE to meet new challenges in language engineering

Natural Language Engineering
Man vs. machine: a case study in base noun phrase learning

ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
Finding errors automatically in semantically tagged dialogues

HLT '01 Proceedings of the first international conference on Human language technology research
Transformation-based learning in the fast lane

NAACL '01 Proceedings of the second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies
Rule writing or annotation: cost-efficient resource usage for base noun phrase chunking

ACL '00 Proceedings of the 38th Annual Meeting on Association for Computational Linguistics
Coaxing confidences from an old friend: probabilistic classifications from transformation rule lists

EMNLP '00 Proceedings of the 2000 Joint SIGDAT conference on Empirical methods in natural language processing and very large corpora: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics - Volume 13
MUP: the UIC standoff markup tool

SIGDIAL '02 Proceedings of the 3rd SIGdial workshop on Discourse and dialogue - Volume 2
Selecting sentences for multidocument summaries using randomized local search

AS '02 Proceedings of the ACL-02 Workshop on Automatic Summarization - Volume 4
Using a text engineering framework to build an extendable and portable IE-based summarisation system

AS '02 Proceedings of the ACL-02 Workshop on Automatic Summarization - Volume 4
Blueprint for a high performance NLP infrastructure

SEALTS '03 Proceedings of the HLT-NAACL 2003 workshop on Software engineering and architecture of language technology systems - Volume 8
A confidence-based framework for disambiguating geographic terms

HLT-NAACL-GEOREF '03 Proceedings of the HLT-NAACL 2003 workshop on Analysis of geographic references - Volume 1
Automated judgment of document qualities: Research Articles

Journal of the American Society for Information Science and Technology
Espresso: leveraging generic patterns for automatically harvesting semantic relations

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Why nitpicking works: evidence for Occam's Razor in error correctors

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Evita: a robust event recognizer for QA systems

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
Detecting discrepancies in numeric estimates using multidocument hypertext summaries

HLT '02 Proceedings of the second international conference on Human Language Technology Research
Corpus-based comprehensive and diagnostic MT evaluation: initial Arabic, Chinese, French, and Spanish results

HLT '02 Proceedings of the second international conference on Human Language Technology Research
Adapting svm for data sparseness and imbalance: A case study in information extraction

Natural Language Engineering
Automatically Harvesting and Ontologizing Semantic Relations

Proceedings of the 2008 conference on Ontology Learning and Population: Bridging the Gap between Text and Knowledge
The difficulties of taxonomic name extraction and a solution

BioNLP '06 Proceedings of the Workshop on Linking Natural Language Processing and Biology: Towards Deeper Biological Literature Analysis
Tools for monitoring, visualizing, and refining collections of noisy documents

Proceedings of The Third Workshop on Analytics for Noisy Unstructured Text Data
Investigation of unsupervised pattern learning techniques for bootstrap construction of a medical treatment lexicon

BioNLP '09 Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing
Generating an entailment corpus from news headlines

EMSEE '05 Proceedings of the ACL Workshop on Empirical Modeling of Semantic Equivalence and Entailment
Active learning for part-of-speech tagging: accelerating corpus annotation

LAW '07 Proceedings of the Linguistic Annotation Workshop
On privacy preservation in text and document-based active learning for named entity recognition

Proceedings of the ACM first international workshop on Privacy and anonymity for very large databases
The difficulties of taxonomic name extraction and a solution

LNLBioNLP '06 Proceedings of the HLT-NAACL BioNLP Workshop on Linking Natural Language and Biology
Drawing TimeML relations with TBox

Proceedings of the 2005 international conference on Annotating, extracting and reasoning about time and events
Common sense reasoning – from cyc to intelligent assistant

Ambient Intelligence in Everyday Life
Experience of using GATE for NLP R&D

Proceedings of the COLING-2000 Workshop on Using Toolsets and Architectures To Build NLP Systems
GATE Teamware: a web-based, collaborative text annotation framework

Language Resources and Evaluation

Quantified Score

Hi-index	0.00

Visualization

Abstract

Historically, tailoring language processing systems to specific domains and languages for which they were not originally built has required a great deal of effort. Recent advances in corpus-based manual and automatic training methods have shown promise in reducing the time and cost of this porting process. These developments have focused even greater attention on the bottleneck of acquiring reliable, manually tagged training data. This paper describes a new set of integrated tools, collectively called the Alembic Workbench, that uses a mixed-initiative approach to "bootstrapping" the manual tagging process, with the goal of reducing the overhead associated with corpus development. Initial empirical studies using the Alembic Workbench to annotate "named entities" demonstrates that this approach can approximately double the production rate. As an added benefit, the combined efforts of machine and user produce domain specific annotation rules that can be used to annotate similar texts automatically through the Alembic-NLP system. The ultimate goal of this project is to enable end users to generate a practical domain-specific information extraction system within a single session.