Chinese keyword extraction based on max-duplicated strings of the documents

Authors:
Wenfeng Yang
Affiliations:
Tsinghua University, P.R.China
Venue:
SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Year:
2002

Citing 2
Cited 4

New indices for text: PAT Trees and PAT arrays

Information retrieval
PAT-tree-based keyword extraction for Chinese information retrieval

Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval

News-oriented automatic Chinese keyword indexing

SIGHAN '03 Proceedings of the second SIGHAN workshop on Chinese language processing - Volume 17
An Automatic Online News Topic Keyphrase Extraction System

WI-IAT '08 Proceedings of the 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
A Keyword Extraction Method Based on Lexical Chains

ISICA '08 Proceedings of the 3rd International Symposium on Advances in Computation and Intelligence
Advertising keywords extraction from web pages

WISM'10 Proceedings of the 2010 international conference on Web information systems and mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

The corpus analysis methods in Chinese keyword extraction look on the corpus as a single sample of language stochastic process. But the distributions of keywords in the whole corpus and in each document are very different from each other. The extraction based on global statistical information only can get significant keywords in the whole corpus. Max-duplicated strings contain the local significant keywords in each document. In this paper, we designed an efficient algorithm to extract the max-duplicated strings by building PAT-tree for the document, so that the keywords can be picked out from the max-duplicated strings by their SIG values in the corpus.