Word association norms, mutual information, and lexicography
Computational Linguistics
Mining Sequential Patterns: Generalizations and Performance Improvements
EDBT '96 Proceedings of the 5th International Conference on Extending Database Technology: Advances in Database Technology
Multiword Expressions: A Pain in the Neck for NLP
CICLing '02 Proceedings of the Third International Conference on Computational Linguistics and Intelligent Text Processing
Discovery of Frequent Word Sequences in Text
Proceedings of the ESF Exploratory Workshop on Pattern Detection and Discovery
Retrieving collocations from text: Xtract
Computational Linguistics - Special issue on using large corpora: I
Retrieving collocations by co-occurrences and word order constraints
ACL '98 Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics
Surface grammatical analysis for the extraction of terminological noun phrases
COLING '92 Proceedings of the 14th conference on Computational linguistics - Volume 3
Creating a multilingual collocation dictionary from large text corpora
EACL '03 Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - Volume 2
Multiword unit hybrid extraction
MWE '03 Proceedings of the ACL 2003 workshop on Multiword expressions: analysis, acquisition and treatment - Volume 18
Algorithms for the verification of the semantic relation between a compound and a given lexeme
Proceedings of the 12th International Conference on Knowledge Management and Knowledge Technologies
Lexical ontology layer: a bridge between text and concepts
ISMIS'12 Proceedings of the 20th international conference on Foundations of Intelligent Systems
Cross-language patent matching via an international patent classification-based concept bridge
Journal of Information Science
Hi-index | 0.00 |
The identification of appropriate text tokens (words or sequences of words representing concepts) is one of the most important tasks of text preprocessing and may have great influence on the final results of text analysis. In our paper, we introduce a new approach to discovering compound nouns, including proper compound nouns. Our approach combines the data mining methods with shallow lexical analysis. We propose a simple pattern language for specifying grammatical patterns to be satisfied by extracted compound nouns. Our method requires annotating the words with part of speech tags, thus to this extent, it is language-dependent. Based on the data mining GSPalgorithm, we propose T-GSPas its modification for extracting frequent text patterns, and in particular, frequent word sequences that satisfy given grammatical rules. The obtained sequences are regarded as candidates for compound nouns. The experiments have proven very high quality of the method.