Cross-lingual text categorization: Conquering language boundaries in globalized environments

Authors:
Chih-Ping Wei;Yen-Ting Lin;Christopher C. Yang
Affiliations:
Department of Information Management, College of Management, National Taiwan University, Taipei, Taiwan, ROC;Science & Technology Policy Research and Information Center, National Applied Research Laboratories, Taipei, Taiwan, ROC;College of Information Science and Technology, Drexel University, Philadelphia, PA, USA
Venue:
Information Processing and Management: an International Journal
Year:
2011

Citing 41
Cited 2

Implementing agglomerative hierarchic clustering algorithms for use in document retrieval

Information Processing and Management: an International Journal
Classifying news stories using memory based reasoning

SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
Scatter/Gather: a cluster-based approach to browsing large document collections

SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
Automated learning of decision rules for text categorization

ACM Transactions on Information Systems (TOIS)
An example-based mapping method for text categorization and retrieval

ACM Transactions on Information Systems (TOIS)
Expert network: effective and efficient learning from human decisions in text categorization and retrieval

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Some advances in transformation-based part of speech tagging

AAAI '94 Proceedings of the twelfth national conference on Artificial intelligence (vol. 1)
A comparison of classifiers and document representations for the routing problem

SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
A survey of multilingual text retrieval

A survey of multilingual text retrieval
Combining classifiers in text categorization

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Hierarchic document classification using Ward's clustering method

Proceedings of the 9th annual international ACM SIGIR conference on Research and development in information retrieval
Feature selection, perceptron learning, and a usability case study for text categorization

Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
Inductive learning algorithms and representations for text categorization

Proceedings of the seventh international conference on Information and knowledge management
Using a generalized instance set for automatic text categorization

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Context-sensitive learning methods for text categorization

ACM Transactions on Information Systems (TOIS)
A re-examination of text categorization methods

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Combination and boundary detection approaches on Chinese indexing

Journal of the American Society for Information Science - Special topic issue on digital libraries: part 2
Document clustering for electronic meetings: an experimental comparison of two techniques

Decision Support Systems - From information retrieval to knowledge management: enabling technologies and best practices
Partitioning-based clustering for Web document categorization

Decision Support Systems - Special issue on WITS '97
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Document organization using Kohonen's algorithm

Information Processing and Management: an International Journal
Maximizing Text-Mining Performance

IEEE Intelligent Systems
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Athena: Mining-Based Interactive Management of Text Database

EDBT '00 Proceedings of the 7th International Conference on Extending Database Technology: Advances in Database Technology
Automatic generation of English/Chinese thesaurus based on a parallel corpus in laws

Journal of the American Society for Information Science and Technology
Automatic construction of English/Chinese parallel corpora

Journal of the American Society for Information Science and Technology
An Association Thesaurus for Information Retrieval

An Association Thesaurus for Information Retrieval
A simple rule-based part of speech tagger

ANLC '92 Proceedings of the third conference on Applied natural language processing
Event detection from online news documents for supporting environmental scanning

Decision Support Systems - Special issue: Knowledge management technique
Error anaylsis of Chinese text segmentation using statistical approach

Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries
Building parallel corpora by automatic title alignment using length-based and text-based approaches

Information Processing and Management: an International Journal
Cross-language text classification

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
An EM Based Training Algorithm for Cross-Language Text Categorization

WI '05 Proceedings of the 2005 IEEE/WIC/ACM International Conference on Web Intelligence
A heuristic method based on a statistical approach for Chinese text segmentation

Journal of the American Society for Information Science and Technology
Detecting spam web pages through content analysis

Proceedings of the 15th international conference on World Wide Web
Exploiting comparable corpora and bilingual dictionaries for cross-language text categorization

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
An Intelligent Web Agent to Mine Bilingual Parallel Pages via Automatic Discovery of URL Pairing Patterns

WI-IATW '07 Proceedings of the 2007 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology - Workshops
Effective spam filtering: A single-class learning and ensemble approach

Decision Support Systems
Opinion Mining and Sentiment Analysis

Foundations and Trends in Information Retrieval
Combining preference- and content-based approaches for improving document clustering effectiveness

Information Processing and Management: an International Journal
Automatic acquisition of chinese–english parallel corpus from the web

ECIR'06 Proceedings of the 28th European conference on Advances in Information Retrieval

A multi-classifier system for text categorization

Proceedings of the 2011 ACM Symposium on Research in Applied Computation
Exploiting poly-lingual documents for improving text categorization effectiveness

Decision Support Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Text categorization pertains to the automatic learning of a text categorization model from a training set of preclassified documents on the basis of their contents and the subsequent assignment of unclassified documents to appropriate categories. Most existing text categorization techniques deal with monolingual documents (i.e., written in the same language) during the learning of the text categorization model and category assignment (or prediction) for unclassified documents. However, with the globalization of business environments and advances in Internet technology, an organization or individual may generate and organize into categories documents in one language and subsequently archive documents in different languages into existing categories, which necessitate cross-lingual text categorization (CLTC). Specifically, cross-lingual text categorization deals with learning a text categorization model from a set of training documents written in one language (e.g., L"1) and then classifying new documents in a different language (e.g., L"2). Motivated by the significance of this demand, this study aims to design a CLTC technique with two different category assignment methods, namely, individual- and cluster-based. Using monolingual text categorization as a performance reference, our empirical evaluation results demonstrate the cross-lingual capability of the proposed CLTC technique. Moreover, the classification accuracy achieved by the cluster-based category assignment method is statistically significantly higher than that attained by the individual-based method.