A Hidden Topic-Based Framework toward Building Applications with Short Web Documents

Authors:
Xuan-Hieu Phan;Cam-Tu Nguyen;Dieu-Thu Le;Le-Minh Nguyen;Susumu Horiguchi;Quang-Thuy Ha
Affiliations:
Tohoku University, Sendai;Tohoku University, Sendai;University of Trento, Italy;Japan Advanced Institute of Science and Technology, Nomi;Tohoku University, Sendai;Vietnam National University, Hanoi
Venue:
IEEE Transactions on Knowledge and Data Engineering
Year:
2011

Citing 0
Cited 10

Metadata enrichment via topic models for author name disambiguation

NLP4DL'09/AT4DL'09 Proceedings of the 2009 international conference on Advanced language technologies for digital libraries
Transferring topical knowledge from auxiliary long texts for short text clustering

Proceedings of the 20th ACM international conference on Information and knowledge management
A keyword-topic model for contextual advertising

Proceedings of the Third Symposium on Information and Communication Technology
Query classification using topic models and support vector machine

ACL '12 Proceedings of ACL 2012 Student Research Workshop
Low frequency keyword and keyphrase extraction from meeting transcripts with sentiment classification using unsupervised framework

Proceedings of the Second International Conference on Computational Science, Engineering and Information Technology
Exploiting enriched contextual information for mobile app classification

Proceedings of the 21st ACM international conference on Information and knowledge management
Multi-aspect sentiment analysis for Chinese online social reviews based on topic modeling and HowNet lexicon

Knowledge-Based Systems
Distributional term representations for short-text categorization

CICLing'13 Proceedings of the 14th international conference on Computational Linguistics and Intelligent Text Processing - Volume 2
Learning to detect task boundaries of query session

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
A feature-word-topic model for image annotation and retrieval

ACM Transactions on the Web (TWEB)

Quantified Score

Hi-index	0.01

Visualization

Abstract

This paper introduces a hidden topic-based framework for processing short and sparse documents (e.g., search result snippets, product descriptions, book/movie summaries, and advertising messages) on the Web. The framework focuses on solving two main challenges posed by these kinds of documents: 1) data sparseness and 2) synonyms/homonyms. The former leads to the lack of shared words and contexts among documents while the latter are big linguistic obstacles in natural language processing (NLP) and information retrieval (IR). The underlying idea of the framework is that common hidden topics discovered from large external data sets (universal data sets), when included, can make short documents less sparse and more topic-oriented. Furthermore, hidden topics from universal data sets help handle unseen data better. The proposed framework can also be applied for different natural languages and data domains. We carefully evaluated the framework by carrying out two experiments for two important online applications (Web search result classification and matching/ranking for contextual advertising) with large-scale universal data sets and we achieved significant results.