A new method to compose long unknown Chinese keywords

Authors:
Yu-Chin Liu;Chun-Wei Lin
Affiliations:
;
Venue:
Journal of Information Science
Year:
2012

Citing 24
Cited 0

Word association norms, mutual information, and lexicography

Computational Linguistics
Chinese text segmentation for text retrieval: achievements and problems

Journal of the American Society for Information Science
ACTS: an automatic Chinese text segmentation system for full text retrieval

Journal of the American Society for Information Science
Using corpus statistics to remove redundant words in text categorization

Journal of the American Society for Information Science
A stochastic finite-state word-segmentation algorithm for Chinese

Computational Linguistics
On Chinese text retrieval

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
On the use of words and n-grams for Chinese information retrieval

IRAL '00 Proceedings of the fifth international workshop on on Information retrieval with Asian languages
A vector space model for automatic indexing

Communications of the ACM
Document Warehousing and Text Mining: Techniques for Improving Business Operations, Marketing, and Sales

Document Warehousing and Text Mining: Techniques for Improving Business Operations, Marketing, and Sales
Information Retrieval

Information Retrieval
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
A Mutually Beneficial Integration of Data Mining and Information Extraction

Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence
A compression-based algorithm for Chinese word segmentation

Computational Linguistics
Retrieving collocations from text: Xtract

Computational Linguistics - Special issue on using large corpora: I
Improving Chinese tokenization with linguistic filters on statistical lexical acquisition

ANLC '94 Proceedings of the fourth conference on Applied natural language processing
Chinese word segmentation and its effect on information retrieval

Information Processing and Management: an International Journal
Automatic semantic classification for Chinese unknown compound nouns

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 1
Unknown word extraction for Chinese documents

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
Chinese unknown word identification using character-based tagging and chunking

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 2
Automated ontology construction for unstructured text documents

Data & Knowledge Engineering
Introduction to Information Retrieval

Introduction to Information Retrieval
Efficient Phrase-Based Document Similarity for Clustering

IEEE Transactions on Knowledge and Data Engineering
Chinese new word identification: a latent discriminative model with global features

Journal of Computer Science and Technology - Special issue on natural language processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

There is now a huge amount of electronic documents stored on the internet. In order to retrieve information from this data, each document is commonly represented as a set of keywords, and then all documents are analysed based on the set of discriminative words. In information retrieval the recognition of words in articles is an essential step; however, unlike English, Chinese words are not distinguished by spaces. Therefore, many approaches have been devised to parse Chinese words. The dictionary-based approach is commonly used in most current systems for text segmentation. However, general purpose dictionaries are not always able to provide proper references to accurately parse the domain-specific words, especially with unknown words. This paper aims to propose a new method for classifying longer keywords from Chinese documents by incorporating previously unknown keywords into a keyword list without the effort of building domain-specific dictionaries. Our method first utilizes the parsed words from existing parsers and filters the keywords utilizing term frequency-inverse document frequency (TF-IDF) values; further, based on the parsed words and keywords, a T tree is used to store the candidates for composing unknown words. The candidates are evaluated by an unknown word (UW) coefficient threshold, i.e. newly composed words are deemed as newly discovered unknown words if their UW coefficient is higher than a pre-defined threshold. Finally, the parsed words and newly composed words are re-filtered to form long keywords. The results of several experiments comparing the results with Google and Yahoo show that, regardless of recall rates, precision rates and F-measures, our proposed method significantly outperforms other methods.