Finding new terminology in very large corpora

Authors:
Joachim Wermter;Udo Hahn
Affiliations:
Jena University Language and Information Engineering (JULIE) Lab, Jena, Germany;Jena University Language and Information Engineering (JULIE) Lab, Jena, Germany
Venue:
Proceedings of the 3rd international conference on Knowledge capture
Year:
2005

Citing 8
Cited 4

Generating and evaluating domain-oriented multi-word terms from texts

Information Processing and Management: an International Journal
Foundations of statistical natural language processing

Foundations of statistical natural language processing
Term identification in the biomedical literature

Journal of Biomedical Informatics - Special issue: Named entity recognition in biomedicine
Methods for the qualitative evaluation of lexical association measures

ACL '01 Proceedings of the 39th Annual Meeting on Association for Computational Linguistics
Chunking with support vector machines

NAACL '01 Proceedings of the second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies
Unsupervised, corpus-based method for extending a biomedical terminology

BioMed '02 Proceedings of the ACL-02 workshop on Natural language processing in the biomedical domain - Volume 3
Enhancing automatic term recognition through recognition of variation

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Collocation extraction based on modifiability statistics

COLING '04 Proceedings of the 20th international conference on Computational Linguistics

The XTREEM Methods for Ontology Learning from Web Documents

Proceedings of the 2008 conference on Ontology Learning and Population: Bridging the Gap between Text and Knowledge
Comparison of feature-level learning methods for mining online consumer reviews

Expert Systems with Applications: An International Journal
Ontology learning from text: A look back and into the future

ACM Computing Surveys (CSUR)
Terminological paraphrase extraction from scientific literature based on predicate argument tuples

Journal of Information Science

Quantified Score

Hi-index	0.00

Visualization

Abstract

Most technical and scientific terms are comprised of complex, multi-word noun phrases but certainly not all noun phrases are technical or scientific terms. The distinction of specific terminology from common non-specific noun phrases can be based on the observation that terms reveal a much lesser degree of distributional variation than non-specific noun phrases. We formalize the limited paradigmatic modifiability of terms and, subsequently, test the corresponding algorithm on bigram, trigram and quadgram noun phrases extracted from a 104-million-word biomedical text corpus. Using an already existing and community-wide curated biomedical terminology as an evaluation gold standard, we show that our algorithm significantly outperforms standard term identification measures and, therefore, qualifies as a high-performant building block for any terminology identification system. We also provide empirical evidence that the superiority of our approach, beyond a 10-million-word threshold, is essentially domain- and corpus-size-independent.