Generating and evaluating domain-oriented multi-word terms from texts
Information Processing and Management: an International Journal
Foundations of statistical natural language processing
Foundations of statistical natural language processing
Term identification in the biomedical literature
Journal of Biomedical Informatics - Special issue: Named entity recognition in biomedicine
Methods for the qualitative evaluation of lexical association measures
ACL '01 Proceedings of the 39th Annual Meeting on Association for Computational Linguistics
Chunking with support vector machines
NAACL '01 Proceedings of the second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies
Unsupervised, corpus-based method for extending a biomedical terminology
BioMed '02 Proceedings of the ACL-02 workshop on Natural language processing in the biomedical domain - Volume 3
Enhancing automatic term recognition through recognition of variation
COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Collocation extraction based on modifiability statistics
COLING '04 Proceedings of the 20th international conference on Computational Linguistics
The XTREEM Methods for Ontology Learning from Web Documents
Proceedings of the 2008 conference on Ontology Learning and Population: Bridging the Gap between Text and Knowledge
Comparison of feature-level learning methods for mining online consumer reviews
Expert Systems with Applications: An International Journal
Ontology learning from text: A look back and into the future
ACM Computing Surveys (CSUR)
Terminological paraphrase extraction from scientific literature based on predicate argument tuples
Journal of Information Science
Hi-index | 0.00 |
Most technical and scientific terms are comprised of complex, multi-word noun phrases but certainly not all noun phrases are technical or scientific terms. The distinction of specific terminology from common non-specific noun phrases can be based on the observation that terms reveal a much lesser degree of distributional variation than non-specific noun phrases. We formalize the limited paradigmatic modifiability of terms and, subsequently, test the corresponding algorithm on bigram, trigram and quadgram noun phrases extracted from a 104-million-word biomedical text corpus. Using an already existing and community-wide curated biomedical terminology as an evaluation gold standard, we show that our algorithm significantly outperforms standard term identification measures and, therefore, qualifies as a high-performant building block for any terminology identification system. We also provide empirical evidence that the superiority of our approach, beyond a 10-million-word threshold, is essentially domain- and corpus-size-independent.