On the use of words and n-grams for Chinese information retrieval

Authors:
Jian-Yun Nie;Jiangfeng Gao;Jian Zhang;Ming Zhou
Affiliations:
Département d'informatique et de recherche opérationnelle, Université de Montréal;Microsoft Research;Microsoft Research;Microsoft Research
Venue:
IRAL '00 Proceedings of the fifth international workshop on on Information retrieval with Asian languages
Year:
2000

Citing 7
Cited 28

A new character-based indexing method using frequency data for Japanese documents

SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
Combining multiple evidence from different properties of weighting schemes

SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
Using n-grams for Korean text retrieval

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
On Chinese text retrieval

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Comparing representations in Chinese information retrieval

Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
Implementation of the SMART Information Retrieval System

Implementation of the SMART Information Retrieval System
Word identification for Mandarin Chinese sentences

COLING '92 Proceedings of the 14th conference on Computational linguistics - Volume 1

A comparison of Chinese document indexing strategies and retrieval models

ACM Transactions on Asian Language Information Processing (TALIP)
Automatic construction of English/Chinese parallel corpora

Journal of the American Society for Information Science and Technology
Applying Machine Learning to Text Segmentation for Information Retrieval

Information Retrieval
Chinese word segmentation and its effect on information retrieval

Information Processing and Management: an International Journal
Dictionary-based techniques for cross-language information retrieval

Information Processing and Management: an International Journal - Special issue: Cross-language information retrieval
Covering ambiguity resolution in Chinese word segmentation based on contextual information

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
Chinese information retrieval based on terms and relevant terms

ACM Transactions on Asian Language Information Processing (TALIP)
Adapting pivoted document-length normalization for query size: Experiments in Chinese and English

ACM Transactions on Asian Language Information Processing (TALIP)
A comparison and semi-quantitative analysis of words and character-bigrams as features in Chinese text categorization

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
The effect of translation quality in MT-based cross-language information retrieval

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Document re-ranking based on automatically acquired key terms in Chinese information retrieval

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Document reranking by term distribution and maximal marginal relevance for Chinese information retrieval

Information Processing and Management: an International Journal - Special issue: AIRS2005: Information retrieval research in Asia
Comparing different units for query translation in Chinese cross-language information retrieval

Proceedings of the 2nd international conference on Scalable information systems
Relating dependent indexes using dempster-shafer theory

Proceedings of the 17th ACM conference on Information and knowledge management
Kinds of features for Chinese opinionated information retrieval

ACL '07 Proceedings of the 45th Annual Meeting of the ACL: Student Research Workshop
Information retrieval oriented word segmentation based on character associative strength ranking

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
A large scale study of English-Chinese online dictionary search behavior

UAHCI'07 Proceedings of the 4th international conference on Universal access in human-computer interaction: applications and services
Synonyms extraction using web content focused crawling

AIRS'08 Proceedings of the 4th Asia information retrieval conference on Information retrieval technology
Diacritics restoration in vietnamese: letter based vs. syllable based model

PRICAI'10 Proceedings of the 11th Pacific Rim international conference on Trends in artificial intelligence
Managing misspelled queries in IR applications

Information Processing and Management: an International Journal
Chinese document re-ranking based on term distribution and maximal marginal relevance

AIRS'05 Proceedings of the Second Asia conference on Asia Information Retrieval Technology
A cross-lingual framework for web news taxonomy integration

AIRS'06 Proceedings of the Third Asia conference on Information Retrieval Technology
Document re-ordering based on key terms in top retrieved documents

CICLing'05 Proceedings of the 6th international conference on Computational Linguistics and Intelligent Text Processing
Improving retrieval effectiveness by using key terms in top retrieved documents

ECIR'05 Proceedings of the 27th European conference on Advances in Information Retrieval Research
A framework and its empirical study of automatic diagnosis of traditional Chinese medicine utilizing raw free-text clinical records

Journal of Biomedical Informatics
A new method to compose long unknown Chinese keywords

Journal of Information Science
A preliminary work on symptom name recognition from free-text clinical records of traditional chinese medicine using conditional random fields and reasonable features

BioNLP '12 Proceedings of the 2012 Workshop on Biomedical Natural Language Processing
Supervised methods for symptom name recognition in free-text clinical records of traditional Chinese medicine: An empirical study

Journal of Biomedical Informatics

Quantified Score

Hi-index	0.00

Visualization

Abstract

In the processing of Chinese documents and queries in information retrieval (IR), one has to identify the units that are used as indexes. Words and n-grams have been used as indexes in several previous studies, which showed that both kinds of indexes lead to comparable IR performances. In this study, we carry out more experiments on different ways to segment documents and queries, and to combine words with n-grams. Our experiments show that a combination of the longest-matching algorithm with single characters is the best choice.