The influence of collocation segmentation and top 10 items to keyword assignment performance

Authors:
Vidas Daudaravicius
Affiliations:
Vytautas Magnus University, Kaunas, Lithuania
Venue:
CICLing'10 Proceedings of the 11th international conference on Computational Linguistics and Intelligent Text Processing
Year:
2010

Citing 10
Cited 1

Automatic text processing: the transformation, analysis, and retrieval of information by computer

Automatic text processing: the transformation, analysis, and retrieval of information by computer
Translating collocations for bilingual lexicons: a statistical approach

Computational Linguistics
A stop list for general text

ACM SIGIR Forum
Retrieving collocations from text: Xtract

Computational Linguistics - Special issue on using large corpora: I
Text Mining: Predictive Methods for Analyzing Unstructured Information

Text Mining: Predictive Methods for Analyzing Unstructured Information
Introduction to the CoNLL-2000 shared task: chunking

ConLL '00 Proceedings of the 2nd workshop on Learning language in logic and the 4th conference on Computational natural language learning - Volume 7
Extending the single words-based document model: a comparison of bigrams and 2-itemsets

Proceedings of the 2006 ACM symposium on Document engineering
Introduction to Information Retrieval

Introduction to Information Retrieval
An Experiment in Automatic Classification of Pathological Reports

AIME '07 Proceedings of the 11th conference on Artificial Intelligence in Medicine
Automatic Identification of Stop Words in Chinese Text Classification

CSSE '08 Proceedings of the 2008 International Conference on Computer Science and Software Engineering - Volume 01

Applying collocation segmentation to the ACL anthology reference corpus

ACL '12 Proceedings of the ACL-2012 Special Workshop on Rediscovering 50 Years of Discoveries

Quantified Score

Hi-index	0.00

Visualization

Abstract

Automatic document annotation from a controlled conceptual thesaurus is useful for establishing precise links between similar documents. This study presents a language independent document annotation system based on features derived from a novel collocation segmentation method. Using the multilingual conceptual thesaurus EuroVoc, we evaluate filtered and unfiltered version of the method, comparing it against other language independent methods based on single words and bigrams. Testing our new method against the manually tagged multilingual corpus Acquis Communautaire 3.0 (AC) using all descriptors found there, we attain improvements in keyword assignment precision from 18 to 29 percent and in F-measure from 17.2 to 27.6 for 5 keywords assigned to a document. The further filtering out of the top 10 frequent items improves precision by 4 percent and collocation segmentation improves precision by 9 percent on the average, over 21 languages tested.