Word association norms, mutual information, and lexicography
Computational Linguistics
Chinese text segmentation for text retrieval: achievements and problems
Journal of the American Society for Information Science
ACTS: an automatic Chinese text segmentation system for full text retrieval
Journal of the American Society for Information Science
Using corpus statistics to remove redundant words in text categorization
Journal of the American Society for Information Science
A stochastic finite-state word-segmentation algorithm for Chinese
Computational Linguistics
SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
On the use of words and n-grams for Chinese information retrieval
IRAL '00 Proceedings of the fifth international workshop on on Information retrieval with Asian languages
A vector space model for automatic indexing
Communications of the ACM
Document Warehousing and Text Mining: Techniques for Improving Business Operations, Marketing, and Sales
Information Retrieval
Introduction to Modern Information Retrieval
Introduction to Modern Information Retrieval
A Comparative Study on Feature Selection in Text Categorization
ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
A Mutually Beneficial Integration of Data Mining and Information Extraction
Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence
A compression-based algorithm for Chinese word segmentation
Computational Linguistics
Retrieving collocations from text: Xtract
Computational Linguistics - Special issue on using large corpora: I
Improving Chinese tokenization with linguistic filters on statistical lexical acquisition
ANLC '94 Proceedings of the fourth conference on Applied natural language processing
Chinese word segmentation and its effect on information retrieval
Information Processing and Management: an International Journal
Automatic semantic classification for Chinese unknown compound nouns
COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 1
Unknown word extraction for Chinese documents
COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
Chinese unknown word identification using character-based tagging and chunking
ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 2
Automated ontology construction for unstructured text documents
Data & Knowledge Engineering
Introduction to Information Retrieval
Introduction to Information Retrieval
Efficient Phrase-Based Document Similarity for Clustering
IEEE Transactions on Knowledge and Data Engineering
Chinese new word identification: a latent discriminative model with global features
Journal of Computer Science and Technology - Special issue on natural language processing
Hi-index | 0.00 |
There is now a huge amount of electronic documents stored on the internet. In order to retrieve information from this data, each document is commonly represented as a set of keywords, and then all documents are analysed based on the set of discriminative words. In information retrieval the recognition of words in articles is an essential step; however, unlike English, Chinese words are not distinguished by spaces. Therefore, many approaches have been devised to parse Chinese words. The dictionary-based approach is commonly used in most current systems for text segmentation. However, general purpose dictionaries are not always able to provide proper references to accurately parse the domain-specific words, especially with unknown words. This paper aims to propose a new method for classifying longer keywords from Chinese documents by incorporating previously unknown keywords into a keyword list without the effort of building domain-specific dictionaries. Our method first utilizes the parsed words from existing parsers and filters the keywords utilizing term frequency-inverse document frequency (TF-IDF) values; further, based on the parsed words and keywords, a T tree is used to store the candidates for composing unknown words. The candidates are evaluated by an unknown word (UW) coefficient threshold, i.e. newly composed words are deemed as newly discovered unknown words if their UW coefficient is higher than a pre-defined threshold. Finally, the parsed words and newly composed words are re-filtered to form long keywords. The results of several experiments comparing the results with Google and Yahoo show that, regardless of recall rates, precision rates and F-measures, our proposed method significantly outperforms other methods.