Automatic thesaurus generation for Chinese documents

Authors:
Yuen-Hsien Tseng
Affiliations:
Department of Library & Information Science, Fu Jen Catholic University, 510, Chung Cheng Road, HsinChuang, Taipei, Taiwan, Republic of China
Venue:
Journal of the American Society for Information Science and Technology
Year:
2002

Citing 22
Cited 14

Automatic text processing: the transformation, analysis, and retrieval of information by computer

Automatic text processing: the transformation, analysis, and retrieval of information by computer
Thesaurus construction

Information retrieval
Experiments in automatic statistical thesaurus construction

SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
Concept based query expansion

SIGIR '93 Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval
Query expansion using lexical-semantic relations

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Highlights: language- and domain-independent automatic indexing terms for abstracting

Journal of the American Society for Information Science
Automatic thesaurus generation for an electronic community system

Journal of the American Society for Information Science
Automatic thesaurus construction using Bayesian networks

CIKM '95 Proceedings of the fourth international conference on Information and knowledge management
A Parallel Computing Approach to Creating Engineering Concept Spaces for Semantic Retrieval: The Illinois Digital Library Initiative Project

IEEE Transactions on Pattern Analysis and Machine Intelligence
A concept space approach to addressing the vocabulary problem in scientific information retrieval: an experiment on the worm community system

Journal of the American Society for Information Science
Chinese text retrieval without using a dictionary

Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
Multilingual keyword extraction for term suggestion

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Mining Text Using Keyword Distributions

Journal of Intelligent Information Systems
Content-based retrieval for music collections

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Combining multiple evidence from different types of thesaurus for query expansion

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Deriving concept hierarchies from text

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Snowball: extracting relations from large plain-text collections

DL '00 Proceedings of the fifth ACM conference on Digital libraries
Construction of a Chinese-English WordNet and its application to CLIR

IRAL '00 Proceedings of the fifth international workshop on on Information retrieval with Asian languages
Automatic cataloguing and searching for retrospective data by use of OCR text

Journal of the American Society for Information Science and Technology
Automatic acquisition of hyponyms from large text corpora

COLING '92 Proceedings of the 14th conference on Computational linguistics - Volume 2
Identification and classification of proper nouns in Chinese texts

COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 1
Lexical relations: enhancing effectiveness of information retrieval systems

ACM SIGIR Forum

Error correction in a Chinese OCR test collection

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Document-self expansion for text categorization

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Automatic thesaurus development: Term extraction from title metadata

Journal of the American Society for Information Science and Technology - Research Articles
Text mining techniques for patent analysis

Information Processing and Management: an International Journal
Patent surrogate extraction and evaluation in the context of patent mapping

Journal of Information Science
Association thesaurus construction methods based on link co-occurrence analysis for wikipedia

Proceedings of the 17th ACM conference on Information and knowledge management
Generic title labeling for clustered documents

Expert Systems with Applications: An International Journal
A delimiter-based general approach for Chinese term extraction

Journal of the American Society for Information Science and Technology
Mining concept maps from news stories for measuring civic scientific literacy in media

Computers & Education
Annotation and verification of sense pools in OntoNotes

Information Processing and Management: an International Journal
Toward generic title generation for clustered documents

AIRS'06 Proceedings of the Third Asia conference on Information Retrieval Technology
A cross-lingual framework for web news taxonomy integration

AIRS'06 Proceedings of the Third Asia conference on Information Retrieval Technology
Mining term networks from text collections for crime investigation

Expert Systems with Applications: An International Journal
Journal clustering of library and information science for subfield delineation using the bibliometric analysis toolkit: CATAR

Scientometrics

Quantified Score

Hi-index	0.00

Visualization

Abstract

This article reports an approach to automatic thesaurus construction for Chinese documents. An effective Chinese keyword extraction algorithm is first presented. Experiments showed that for each document an average of 33% keywords unknown to a lexicon of 123,226 terms could be identified by this algorithm. Of these unregistered words, only 8.3% of them are illegal. Keywords extracted from each document are further filtered for term association analysis. Association weights larger than a threshold are then accumulated over all the documents to yield the final term pair similarities. Compared to previous studies, this method speeds up the thesaurus generation process drastically, It also achieves a similar percentage level of term relatedness.