Exploiting poly-lingual documents for improving text categorization effectiveness

Authors:
Chih-Ping Wei;Chin-Sheng Yang;Ching-Hsien Lee;Huihua Shi;Christopher C. Yang
Affiliations:
Department of Information Management, National Taiwan University, Taipei, Taiwan, ROC;Department of Information Management, Yuan Ze University, Chung-Li, Taiwan, ROC;Department of Information Management, National Taiwan University, Taipei, Taiwan, ROC;Infrastructure & System Department I, Information Technology Division (AUT), AU Optronics Corporation, Hsinchu Science Park, Hsinchu, Taiwan, ROC;College of Computing and Informatics, Drexel University, Philadelphia, PA, USA
Venue:
Decision Support Systems
Year:
2014

Citing 34
Cited 0

Classifying news stories using memory based reasoning

SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
Automated learning of decision rules for text categorization

ACM Transactions on Information Systems (TOIS)
An example-based mapping method for text categorization and retrieval

ACM Transactions on Information Systems (TOIS)
Expert network: effective and efficient learning from human decisions in text categorization and retrieval

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Some advances in transformation-based part of speech tagging

AAAI '94 Proceedings of the twelfth national conference on Artificial intelligence (vol. 1)
The nature of statistical learning theory

The nature of statistical learning theory
A comparison of classifiers and document representations for the routing problem

SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
Cluster-based text categorization: a comparison of category search strategies

SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
Combining classifiers in text categorization

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Feature selection, perceptron learning, and a usability case study for text categorization

Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
Inductive learning algorithms and representations for text categorization

Proceedings of the seventh international conference on Information and knowledge management
Using a generalized instance set for automatic text categorization

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Distributional clustering of words for text classification

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Context-sensitive learning methods for text categorization

ACM Transactions on Information Systems (TOIS)
A re-examination of text categorization methods

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Combination and boundary detection approaches on Chinese indexing

Journal of the American Society for Information Science - Special topic issue on digital libraries: part 2
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Maximizing Text-Mining Performance

IEEE Intelligent Systems
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Athena: Mining-Based Interactive Management of Text Database

EDBT '00 Proceedings of the 7th International Conference on Extending Database Technology: Advances in Database Technology
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Automatic generation of English/Chinese thesaurus based on a parallel corpus in laws

Journal of the American Society for Information Science and Technology
Automatic construction of English/Chinese parallel corpora

Journal of the American Society for Information Science and Technology
An Association Thesaurus for Information Retrieval

An Association Thesaurus for Information Retrieval
A simple rule-based part of speech tagger

ANLC '92 Proceedings of the third conference on Applied natural language processing
Event detection from online news documents for supporting environmental scanning

Decision Support Systems - Special issue: Knowledge management technique
Cross-language text classification

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
An EM Based Training Algorithm for Cross-Language Text Categorization

WI '05 Proceedings of the 2005 IEEE/WIC/ACM International Conference on Web Intelligence
Exploiting comparable corpora and bilingual dictionaries for cross-language text categorization

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
An Intelligent Web Agent to Mine Bilingual Parallel Pages via Automatic Discovery of URL Pairing Patterns

WI-IATW '07 Proceedings of the 2007 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology - Workshops
Effective spam filtering: A single-class learning and ensemble approach

Decision Support Systems
On strategies for imbalanced text classification using SVM: A comparative study

Decision Support Systems
Cross-lingual text categorization: Conquering language boundaries in globalized environments

Information Processing and Management: an International Journal
Automatic acquisition of chinese–english parallel corpus from the web

ECIR'06 Proceedings of the 28th European conference on Advances in Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

With the globalization of business environments and rapid emergence and proliferation of the Internet, organizations or individuals often generate, acquire, and then archive documents written in different languages (i.e., poly-lingual documents). Prevalent document management practice is to use categories to organize this ever-increasing volume of poly-lingual documents for subsequent searches and accesses. Poly-lingual text categorization (PLTC) refers to the automatic learning of text categorization models from a set of preclassified training documents written in different languages and the subsequent assignment of unclassified poly-lingual documents to predefined categories on the basis of the induced text categorization models. Although PLTC can be approached as multiple, independent monolingual text categorization problems, this naive PLTC approach employs only the training documents of the same language to construct a monolingual classifier and thus fails to exploit the opportunity offered by poly-lingual training documents. In this study, we propose a feature-reinforcement-based PLTC (FR-PLTC) technique that takes into account the training documents of all languages when constructing a monolingual classifier for a specific language. Using the independent monolingual text categorization (MnTC) approach as a performance benchmark, the empirical evaluation results show that our proposed FR-PLTC technique achieves higher classification accuracy than the benchmark technique. In addition, our empirical results suggest the superiority of the proposed FR-PLTC technique over its counterpart across a range of training sizes.