A corpus-based statistical approach to automatic book indexing

Authors:
Jyun-Sheng Chang;Tsung-Yih Tseng;Ying Cheng;Huey-Chyun Chen;Shun-Der Cheng;Sur-Jin Ker;John S. Liu
Affiliations:
National Tsing Hua University, Hsinchu, Taiwan, ROC;National Tsing Hua University, Hsinchu, Taiwan, ROC;National Tsing Hua University, Hsinchu, Taiwan, ROC;National Tsing Hua University, Hsinchu, Taiwan, ROC;National Tsing Hua University, Hsinchu, Taiwan, ROC;SooChow University;Sampo Research Institute
Venue:
ANLC '92 Proceedings of the third conference on Applied natural language processing
Year:
1992

Citing 6
Cited 1

Network-based heuristics for constraint-satisfaction problems

Artificial Intelligence
Grammatical category disambiguation by statistical optimization

Computational Linguistics
A stochastic parts program and noun phrase parser for unrestricted text

ANLC '88 Proceedings of the second conference on Applied natural language processing
Finding clauses in unrestricted text by finitary and stochastic methods

ANLC '88 Proceedings of the second conference on Applied natural language processing
Syntactic approaches to automatic book indexing

ACL '88 Proceedings of the 26th annual meeting on Association for Computational Linguistics
Lexicon and grammar in probabilistic tagging of written English

ACL '88 Proceedings of the 26th annual meeting on Association for Computational Linguistics

PAT-tree-based keyword extraction for Chinese information retrieval

Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

The paper reports on a new approach to automatic generation of back-of-book indexes for Chinese books. Parsing on the level of complete sentential analysis is avoided because of the inefficiency and unavailability of a Chinese Grammar with enough coverage. Instead, fundamental analysis particular to Chinese text called word segmentation is performed to break up characters into a sequence of lexical units equivalent to words in English. The sequence of words then goes through part-of-speech tagging and noun phrase analysis. All these analyses are done using a corpus-based statistical algorithm. Experimental results have shown satisfactory results.