Generating and evaluating domain-oriented multi-word terms from texts
Information Processing and Management: an International Journal
Chinese text segmentation for text retrieval: achievements and problems
Journal of the American Society for Information Science
Natural language understanding (2nd ed.)
Natural language understanding (2nd ed.)
ACTS: an automatic Chinese text segmentation system for full text retrieval
Journal of the American Society for Information Science
A stochastic finite-state word-segmentation algorithm for Chinese
Computational Linguistics
Querying across languages: a dictionary-based approach to multilingual information retrieval
SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Overlapping statistical word indexing: a new indexing method for Japanese text
Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
Text segmentation for chinese spell checking
Journal of the American Society for Information Science
Discovering Chinese words from unsegmented text (poster abstract)
Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Cross-language information access to multilingual collections on the internet
Journal of the American Society for Information Science - digital libraries: Part 1
Combination and boundary detection approaches on Chinese indexing
Journal of the American Society for Information Science - Special topic issue on digital libraries: part 2
An Efficient Chinese Word Segmentation Algorithm for Chinese Information Processing on the Internet
ICSC '99 Proceedings of the 5th International Computer Science Conference on Internet Applications
A compression-based algorithm for Chinese word segmentation
Computational Linguistics
Word association norms, mutual information, and lexicography
ACL '89 Proceedings of the 27th annual meeting on Association for Computational Linguistics
A Chinese dictionary construction algorithm for information retrieval
ACM Transactions on Asian Language Information Processing (TALIP)
Mining longitudinal web queries: trends and patterns
Journal of the American Society for Information Science and Technology
Error anaylsis of Chinese text segmentation using statistical approach
Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries
A collaborative framework for collecting Thai unknown words from the web
COLING-ACL '06 Proceedings of the COLING/ACL on Main conference poster sessions
Expert Systems with Applications: An International Journal
Character usage in Chinese short message service SMS: a real-world study in Mainland China
International Journal of Mobile Communications
Hi-index | 0.00 |
Khoo, Dai, and Loh examine new statistical methods for the identification of two and three character words in Chinese text. Some meaningful Chinese words are simple (independent units of one or more characters in a sentence that have independent meaning) but others are compounds of two or more simple words. In their segmentation they utilize theModern Chinese Word Segmentation for Application of Information Processing, with some modifications to focus on meaningful words to do manual segmentation. About 37% of meaningful words are longer than 2 characters indicating a need to handle three and four character words. Four hundred sentences from news articles were manually broken into overlapping bi-grams and tri-grams. Using logistic regression, the log of the odds that such bi/tri-grams were meaningful words was calculated. Variables like relative frequency, document frequency, local frequency, and contextual and positional information, were incorporated in the model only if the concordance measure improved by at least 2% with their addition. For two- and three-character words relative frequency of adjacent characters and document frequency of overlapping bi-grams were found to be significant. Using measures of recall and precision where correct automatic segmentation is normalized either by manual segmentation or by automatic segmentation, thecontextual information formula for 2 character words provides significantly better results than previous formulations and using both the 2 and 3 character formulations in combination significantly improves the 2 character results.