Learning to classify short and sparse text & web with hidden topics from large-scale data collections

Authors:
Xuan-Hieu Phan;Le-Minh Nguyen;Susumu Horiguchi
Affiliations:
Tohoku University, Sendai, Japan;Japan Advanced Institute of Science and Technology, Nomi, Japan;Tohoku University, Sendai, Japan
Venue:
Proceedings of the 17th international conference on World Wide Web
Year:
2008

Citing 26
Cited 60

On the limited memory BFGS method for large scale optimization

Mathematical Programming: Series A and B
A maximum entropy approach to natural language processing

Computational Linguistics
Combining labeled and unlabeled data with co-training

COLT' 98 Proceedings of the eleventh annual conference on Computational learning theory
Distributional clustering of words for text classification

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Grouper: a dynamic clustering interface to Web search results

WWW '99 Proceedings of the eighth international conference on World Wide Web
Text Classification from Labeled and Unlabeled Documents using EM

Machine Learning - Special issue on information retrieval
Concept decompositions for large sparse text data using clustering

Machine Learning
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Text categorization by boosting automatically extracted concepts

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Latent dirichlet allocation

The Journal of Machine Learning Research
Distributional word clusters vs. words for text categorization

The Journal of Machine Learning Research
Latent semantic models for collaborative filtering

ACM Transactions on Information Systems (TOIS)
A hierarchical monothetic document clustering algorithm for summarization and browsing search results

Proceedings of the 13th international conference on World Wide Web
Learning to cluster web search results

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Using the web to overcome data sparseness

EMNLP '02 Proceedings of the ACL-02 conference on Empirical methods in natural language processing - Volume 10
A web-based kernel function for measuring the similarity of short text snippets

Proceedings of the 15th international conference on World Wide Web
The Wikipedia XML corpus

ACM SIGIR Forum
LDA-based document models for ad-hoc retrieval

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Measuring semantic similarity between words using web search engines

Proceedings of the 16th international conference on World Wide Web
Identifying Document Topics Using the Wikipedia Category Network

WI '06 Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence
Clustering short texts using wikipedia

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Improving similarity measures for short segments of text

AAAI'07 Proceedings of the 22nd national conference on Artificial intelligence - Volume 2
Computing semantic relatedness using Wikipedia-based explicit semantic analysis

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Similarity measures for short segments of text

ECIR'07 Proceedings of the 29th European conference on IR research
Expectation-propagation for the generative aspect model

UAI'02 Proceedings of the Eighteenth conference on Uncertainty in artificial intelligence

Matching and Ranking with Hidden Topics towards Online Contextual Advertising

WI-IAT '08 Proceedings of the 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
Exploiting Wikipedia as external knowledge for document clustering

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Web Search Clustering and Labeling with Hidden Topics

ACM Transactions on Asian Language Information Processing (TALIP)
Simultaneously modeling semantics and structure of threaded discussions: a sparse coding approach and its applications

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Exploiting internal and external semantics for the clustering of short texts using world knowledge

Proceedings of the 18th ACM conference on Information and knowledge management
Framework for timely and accurate ads on mobile devices

Proceedings of the 18th ACM conference on Information and knowledge management
Cross-cultural analysis of blogs and forums with mixed-collection topic models

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 3 - Volume 3
Latent document re-ranking

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 3 - Volume 3
Web opinions analysis with scalable distance-basedclustering

ISI'09 Proceedings of the 2009 IEEE international conference on Intelligence and security informatics
Short text classification in twitter to improve information filtering

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
LDA for on-the-fly auto tagging

Proceedings of the fourth ACM conference on Recommender systems
Collaboration analytics: mining work patterns from collaboration activities

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Cross lingual text classification by mining multilingual topics from wikipedia

Proceedings of the fourth ACM international conference on Web search and data mining
Empirical study of topic modeling in Twitter

Proceedings of the First Workshop on Social Media Analytics
Query by document via a decomposition-based two-level retrieval approach

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Entity disambiguation with hierarchical topic models

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Discovering context: classifying tweets through a semantic transform based on wikipedia

FAC'11 Proceedings of the 6th international conference on Foundations of augmented cognition: directing the future of adaptive systems
Text classification for data loss prevention

PETS'11 Proceedings of the 11th international conference on Privacy enhancing technologies
Transferring topical knowledge from auxiliary long texts for short text clustering

Proceedings of the 20th ACM international conference on Information and knowledge management
Large-scale question classification in cQA by leveraging Wikipedia semantic knowledge

Proceedings of the 20th ACM international conference on Information and knowledge management
Unsupervised concept annotation using latent Dirichlet allocation and segmental methods

EMNLP '11 Proceedings of the First Workshop on Unsupervised Learning in NLP
Topic discovery and topic-driven clustering for audit method datasets

ADMA'11 Proceedings of the 7th international conference on Advanced Data Mining and Applications - Volume Part II
Document hierarchies from text and links

Proceedings of the 21st international conference on World Wide Web
Enhancing naive bayes with various smoothing methods for short text classification

Proceedings of the 21st international conference companion on World Wide Web
An approach of semi-automatic public sentiment analysis for opinion and district

WAIM'11 Proceedings of the 2011 international conference on Web-Age Information Management
Representation models for text classification: a comparative analysis over three web document types

Proceedings of the 2nd International Conference on Web Intelligence, Mining and Semantics
Classification of short texts by deploying topical annotations

ECIR'12 Proceedings of the 34th European conference on Advances in Information Retrieval
On minimum distribution discrepancy support vector machine for domain adaptation

Pattern Recognition
Short text classification improved by learning multi-granularity topics

IJCAI'11 Proceedings of the Twenty-Second international joint conference on Artificial Intelligence - Volume Volume Three
Short text conceptualization using a probabilistic knowledgebase

IJCAI'11 Proceedings of the Twenty-Second international joint conference on Artificial Intelligence - Volume Volume Three
A wikipedia based semantic graph model for topic tracking in blogosphere

IJCAI'11 Proceedings of the Twenty-Second international joint conference on Artificial Intelligence - Volume Volume Three
Source-selection-free transfer learning

IJCAI'11 Proceedings of the Twenty-Second international joint conference on Artificial Intelligence - Volume Volume Three
A novel approach for clustering sentiments in Chinese blogs based on graph similarity

Computers & Mathematics with Applications
Short text classification using very few words

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Identifying comparable corpora using LDA

NAACL HLT '12 Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Topic classification of blog posts using distant supervision

Proceedings of the Workshop on Semantic Analysis in Social Media
Making your interests follow you on twitter

Proceedings of the 21st ACM international conference on Information and knowledge management
Topic-driven reader comments summarization

Proceedings of the 21st ACM international conference on Information and knowledge management
TCSST: transfer classification of short & sparse text using external data

Proceedings of the 21st ACM international conference on Information and knowledge management
Using semi-structured data for assessing research paper similarity

Information Sciences: an International Journal
Graph-based collective classification for tweets

Proceedings of the 21st ACM international conference on Information and knowledge management
Multi-aspect sentiment analysis for Chinese online social reviews based on topic modeling and HowNet lexicon

Knowledge-Based Systems
Extended information inference model for unsupervised categorization of web short texts

Journal of Information Science
Wiki3C: exploiting wikipedia for context-aware concept categorization

Proceedings of the sixth ACM international conference on Web search and data mining
visualRSS: a platform to mine and visualise social data from RSS feeds

ICWE'12 Proceedings of the 12th international conference on Current Trends in Web Engineering
Distributional term representations for short-text categorization

CICLing'13 Proceedings of the 14th international conference on Computational Linguistics and Intelligent Text Processing - Volume 2
Harnessing linked knowledge sources for topic classification in social media

Proceedings of the 24th ACM Conference on Hypertext and Social Media
Enhancing short text clustering with small external repositories

AusDM '11 Proceedings of the Ninth Australasian Data Mining Conference - Volume 121
NIFTY: a system for large scale information flow tracking and clustering

Proceedings of the 22nd international conference on World Wide Web
A biterm topic model for short texts

Proceedings of the 22nd international conference on World Wide Web
Steeler nation, 12th man, and boo birds: classifying Twitter user interests using time series

Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining
Probabilistic semantic similarity measurements for noisy short texts using Wikipedia entities

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Short text classification by detecting information path

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Community question topic categorization via hierarchical kernelized classification

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
An unsupervised transfer learning approach to discover topics for online reputation management

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
A feature-word-topic model for image annotation and retrieval

ACM Transactions on the Web (TWEB)
Improving short text classification using public search engines

IUKM'13 Proceedings of the 2013 international conference on Integrated Uncertainty in Knowledge Modelling and Decision Making
What users care about: a framework for social content alignment

IJCAI'13 Proceedings of the Twenty-Third international joint conference on Artificial Intelligence
Cross lingual entity linking with bilingual topic model

IJCAI'13 Proceedings of the Twenty-Third international joint conference on Artificial Intelligence
Towards social data platform: automatic topic-focused monitor for twitter stream

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents a general framework for building classifiers that deal with short and sparse text & Web segments by making the most of hidden topics discovered from large-scale data collections. The main motivation of this work is that many classification tasks working with short segments of text & Web, such as search snippets, forum & chat messages, blog & news feeds, product reviews, and book & movie summaries, fail to achieve high accuracy due to the data sparseness. We, therefore, come up with an idea of gaining external knowledge to make the data more related as well as expand the coverage of classifiers to handle future data better. The underlying idea of the framework is that for each classification task, we collect a large-scale external data collection called "universal dataset", and then build a classifier on both a (small) set of labeled training data and a rich set of hidden topics discovered from that data collection. The framework is general enough to be applied to different data domains and genres ranging from Web search results to medical text. We did a careful evaluation on several hundred megabytes of Wikipedia (30M words) and MEDLINE (18M words) with two tasks: "Web search domain disambiguation" and "disease categorization for medical text", and achieved significant quality enhancement.