Using statistical and contextual information to identify two-and three-character words in Chinese text

Authors:
Christopher S.G. Khoo;Teck Ee Loh
Affiliations:
Nanyang Technological Univ., Singapore, Republic of Singapore;Data Storage Institute, Singapore
Venue:
Journal of the American Society for Information Science and Technology
Year:
2002

Citing 14
Cited 6

Generating and evaluating domain-oriented multi-word terms from texts

Information Processing and Management: an International Journal
Chinese text segmentation for text retrieval: achievements and problems

Journal of the American Society for Information Science
Natural language understanding (2nd ed.)

Natural language understanding (2nd ed.)
ACTS: an automatic Chinese text segmentation system for full text retrieval

Journal of the American Society for Information Science
A stochastic finite-state word-segmentation algorithm for Chinese

Computational Linguistics
Querying across languages: a dictionary-based approach to multilingual information retrieval

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Overlapping statistical word indexing: a new indexing method for Japanese text

Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
Text segmentation for chinese spell checking

Journal of the American Society for Information Science
Discovering Chinese words from unsegmented text (poster abstract)

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Cross-language information access to multilingual collections on the internet

Journal of the American Society for Information Science - digital libraries: Part 1
Combination and boundary detection approaches on Chinese indexing

Journal of the American Society for Information Science - Special topic issue on digital libraries: part 2
An Efficient Chinese Word Segmentation Algorithm for Chinese Information Processing on the Internet

ICSC '99 Proceedings of the 5th International Computer Science Conference on Internet Applications
A compression-based algorithm for Chinese word segmentation

Computational Linguistics
Word association norms, mutual information, and lexicography

ACL '89 Proceedings of the 27th annual meeting on Association for Computational Linguistics

A Chinese dictionary construction algorithm for information retrieval

ACM Transactions on Asian Language Information Processing (TALIP)
Mining longitudinal web queries: trends and patterns

Journal of the American Society for Information Science and Technology
Error anaylsis of Chinese text segmentation using statistical approach

Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries
A collaborative framework for collecting Thai unknown words from the web

COLING-ACL '06 Proceedings of the COLING/ACL on Main conference poster sessions
Chinese text segmentation: A hybrid approach using transductive learning and statistical association measures

Expert Systems with Applications: An International Journal
Character usage in Chinese short message service SMS: a real-world study in Mainland China

International Journal of Mobile Communications

Quantified Score

Hi-index	0.00

Visualization

Abstract

Khoo, Dai, and Loh examine new statistical methods for the identification of two and three character words in Chinese text. Some meaningful Chinese words are simple (independent units of one or more characters in a sentence that have independent meaning) but others are compounds of two or more simple words. In their segmentation they utilize theModern Chinese Word Segmentation for Application of Information Processing, with some modifications to focus on meaningful words to do manual segmentation. About 37% of meaningful words are longer than 2 characters indicating a need to handle three and four character words. Four hundred sentences from news articles were manually broken into overlapping bi-grams and tri-grams. Using logistic regression, the log of the odds that such bi/tri-grams were meaningful words was calculated. Variables like relative frequency, document frequency, local frequency, and contextual and positional information, were incorporated in the model only if the concordance measure improved by at least 2% with their addition. For two- and three-character words relative frequency of adjacent characters and document frequency of overlapping bi-grams were found to be significant. Using measures of recall and precision where correct automatic segmentation is normalized either by manual segmentation or by automatic segmentation, thecontextual information formula for 2 character words provides significantly better results than previous formulations and using both the 2 and 3 character formulations in combination significantly improves the 2 character results.