A document is known by the company it keeps: neighborhood consensus for short text categorization

Authors:
Gabriela Ramírez-De-La-Rosa;Manuel Montes-Y-Gómez;Thamar Solorio;Luis Villaseñor-Pineda
Affiliations:
Department of Computer and Information Sciences, University of Alabama at Birmingham, Birmingham, USA;Department of Computational Sciences, National Institute for Astrophysics, Optics and Electronics, Puebla, Mexico;Department of Computer and Information Sciences, University of Alabama at Birmingham, Birmingham, USA;Department of Computational Sciences, National Institute for Astrophysics, Optics and Electronics, Puebla, Mexico
Venue:
Language Resources and Evaluation
Year:
2013

Citing 42
Cited 1

Evaluating text categorization

HLT '91 Proceedings of the workshop on Speech and Natural Language
Support-Vector Networks

Machine Learning
Hybrid neural plausibility networks for news agents

AAAI '99/IAAI '99 Proceedings of the sixteenth national conference on Artificial intelligence and the eleventh Innovative applications of artificial intelligence conference innovative applications of artificial intelligence
A practical hypertext catergorization method using links and incrementally available class information

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Improving Short-Text Classification using Unlabeled Data for Classification Problems

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Centroid-Based Document Classification: Analysis and Experimental Results

PKDD '00 Proceedings of the 4th European Conference on Principles of Data Mining and Knowledge Discovery
Cluster-based retrieval using language models

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Corpus structure, language models, and ad hoc information retrieval

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data

Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data
An EM Based Training Algorithm for Cross-Language Text Categorization

WI '05 Proceedings of the 2005 IEEE/WIC/ACM International Conference on Web Intelligence
Graph-based text classification: learn from your neighbors

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)

Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
Language model information retrieval with document expansion

HLT-NAACL '06 Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics
Cluster-based patent retrieval

Information Processing and Management: an International Journal
Semi-supervised single-label text categorization using centroid-based classifiers

Proceedings of the 2007 ACM symposium on Applied computing
Clustering short texts using wikipedia

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
An improved centroid classifier for text categorization

Expert Systems with Applications: An International Journal
A general optimization framework for smoothing language models on graph structures

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Semi-supervised text categorization by active search

Proceedings of the 17th ACM conference on Information and knowledge management
Text classification from unlabeled documents with bootstrapping and feature projection techniques

Information Processing and Management: an International Journal
The Set Classification Problem and Solution Methods

ICDMW '08 Proceedings of the 2008 IEEE International Conference on Data Mining Workshops
Using the Web as corpus for self-training text categorization

Information Retrieval
Semisupervised Learning for Computational Linguistics

Semisupervised Learning for Computational Linguistics
Exploiting Wikipedia as external knowledge for document clustering

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
"A term is known by the company it keeps": On Selecting a Good Expansion Set in Pseudo-Relevance Feedback

ICTIR '09 Proceedings of the 2nd International Conference on Theory of Information Retrieval: Advances in Information Retrieval Theory
Improved use of continuous attributes in C4.5

Journal of Artificial Intelligence Research
Smoothing document language model with local word graph

Proceedings of the 18th ACM conference on Information and knowledge management
Using Nearest Neighbor Information to Improve Cross-Language Text Classification

MICAI '09 Proceedings of the 8th Mexican International Conference on Artificial Intelligence
Neighbor-weighted K-nearest neighbor for unbalanced text corpus

Expert Systems with Applications: An International Journal
Short text classification in twitter to improve information filtering

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Research on Short Text Classification Algorithm Based on Statistics and Rules

ISECS '10 Proceedings of the 2010 Third International Symposium on Electronic Commerce and Security
Summarizing microblogs automatically

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Inducing word senses to improve web search result clustering

EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
On the difficulty of clustering company tweets

SMUC '10 Proceedings of the 2nd international workshop on Search and mining user-generated contents
Summarization as feature selection for document categorization on small datasets

IceTAL'10 Proceedings of the 7th international conference on Advances in natural language processing
Sentiment Mining within Social Media for Topic Identification

ICSC '10 Proceedings of the 2010 IEEE Fourth International Conference on Semantic Computing
A New Model for Chinese Short-text Classification Considering Feature Extension

AICI '10 Proceedings of the 2010 International Conference on Artificial Intelligence and Computational Intelligence - Volume 02
A Self-enriching Methodology for Clustering Narrow Domain Short Texts

The Computer Journal
Transductive learning for text classification using explicit knowledge models

PKDD'06 Proceedings of the 10th European conference on Principle and Practice of Knowledge Discovery in Databases
Using weighted nearest neighbor to benefit from unlabeled data

PAKDD'06 Proceedings of the 10th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining

Distributional term representations for short-text categorization

CICLing'13 Proceedings of the 14th international conference on Computational Linguistics and Intelligent Text Processing - Volume 2

Quantified Score

Hi-index	0.00

Visualization

Abstract

During the last decades the Web has become the greatest repository of digital information. In order to organize all this information, several text categorization methods have been developed, achieving accurate results in most cases and in very different domains. Due to the recent usage of Internet as communication media, short texts such as news, tweets, blogs, and product reviews are more common every day. In this context, there are two main challenges; on the one hand, the length of these documents is short, and therefore, the word frequencies are not informative enough, making text categorization even more difficult than usual. On the other hand, topics are changing constantly at a fast rate, causing the lack of adequate amounts of training data. In order to deal with these two problems we consider a text classification method that is supported on the idea that similar documents may belong to the same category. Mainly, we propose a neighborhood consensus classification method that classifies documents by considering their own information as well as information about the category assigned to other similar documents from the same target collection. In particular, the short texts we used in our evaluation are news titles with an average of 8 words. Experimental results are encouraging; they indicate that leveraging information from similar documents helped to improve classification accuracy and that the proposed method is especially useful when labeled training resources are limited.