Information retrieval oriented word segmentation based on character associative strength ranking
EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Hi-index | 0.00 |
In Chinese information retrieval, documents are usually segmented into words and then indexed by these words. However, segmentation granularity problem (SDP) should be considered because small granularity may lead to low precision and efficiency while big granularity may cause low recall. To solve the problem, this paper proposes an intuitive and heuristic approach. Two-level index for the segmentation dictionary is built by which the original query word could be expanded with its weighted overlaid words. This method not only reserves the advantage of big granularity in precision, but also overcome its disadvantage in recall. The experimental results show that our approach slightly but consistently outperforms the baseline.