Using a web-based categorization approach to generate thematic metadata from texts

Authors:
Chien-Chung Huang;Shui-Lung Chuang;Lee-Feng Chien
Affiliations:
Academia Sinica, Taiwan;Academia Sinica, Taiwan;Academia Sinica, Taiwan
Venue:
ACM Transactions on Asian Language Information Processing (TALIP)
Year:
2004

Citing 15
Cited 4

Term-weighting approaches in automatic text retrieval

Information Processing and Management: an International Journal
Context-sensitive learning methods for text categorization

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Foundations of statistical natural language processing

Foundations of statistical natural language processing
Deriving concept hierarchies from text

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Automatic RDF metadata generation for resource discovery

WWW '99 Proceedings of the eighth international conference on World Wide Web
Yahoo! as an ontology: using Yahoo! categories to describe documents

Proceedings of the eighth international conference on Information and knowledge management
Web mining research: a survey

ACM SIGKDD Explorations Newsletter
A study of thresholding strategies for text categorization

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Maximum Entropy Markov Models for Information Extraction and Segmentation

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Generating natural language summaries from multiple on-line sources

Computational Linguistics - Special issue on natural language generation
A memory-based approach to learning shallow natural language patterns

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1
Improved source-channel models for Chinese word segmentation

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Named entity extraction based on a maximum entropy model and transformation rules

ACL '00 Proceedings of the 38th Annual Meeting on Association for Computational Linguistics
Database summarization using fuzzy ISA hierarchies

IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics
From manual to semi-automatic semantic annotation: about ontology-based text annotation tools

Proceedings of the COLING-2000 Workshop on Semantic Annotation and Intelligent Content

Finding a catalog: generating analytical catalog records from well-structured digital texts

Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
Web-page summarization using clickthrough data

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Web page classification: Features and algorithms

ACM Computing Surveys (CSUR)
Web log analysis: a review of a decade of studies about information acquisition, inspection and interpretation of user interaction

Data Mining and Knowledge Discovery

Quantified Score

Hi-index	0.00

Visualization

Abstract

Conventional tools for automatic metadata creation mostly extract named entities or text segments from texts and annotate them with information about persons, locations, dates, and so on. However, this kind of entity type information is often insufficient for machines to understand the facts contained in the texts, thus precluding the possibility of implementing more advanced, intelligent applications, such as concept-based search. In this work, we try to create more refined thematic metadata inherent in texts. Based on Web resource mining, our approach acquires training corpora necessary to describe both the thematic categories and the metadata extracted from the texts. The approach then finds the corresponding relationships among them by means of categorization and thus generates thematic metadata for the textual data. Experimental results confirm the potential and wide adaptability of our approach.