Representation models for text classification: a comparative analysis over three web document types

Authors:
George Giannakopoulos;Petra Mavridi;Georgios Paliouras;George Papadakis;Konstantinos Tserpes
Affiliations:
SKEL - NCSR Demokritos, Athens, Greece;National Technical University of Athens, Greece;SKEL - NCSR Demokritos, Athens, Greece;L3S Research Center, Hanover, Germany, and National Technical University of Athens, Greece;National Technical University of Athens, Greece
Venue:
Proceedings of the 2nd International Conference on Web Intelligence, Mining and Semantics
Year:
2012

Citing 26
Cited 0

Hierarchical classification of Web content

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Latent dirichlet allocation

The Journal of Machine Learning Research
An extensive empirical study of feature selection metrics for text classification

The Journal of Machine Learning Research
Automatic text categorization in terms of genre and author

Computational Linguistics
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)

Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
Feature selection methods for text classification

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Finding high-quality content in social media

WSDM '08 Proceedings of the 2008 International Conference on Web Search and Data Mining
Learning to classify short and sparse text & web with hidden topics from large-scale data collections

Proceedings of the 17th international conference on World Wide Web
Introduction to Information Retrieval

Introduction to Information Retrieval
Summarization system evaluation revisited: N-gram graphs

ACM Transactions on Speech and Language Processing (TSLP)
LIBLINEAR: A Library for Large Linear Classification

The Journal of Machine Learning Research
Evidence of quality of textual features on the web 2.0

Proceedings of the 18th ACM conference on Information and knowledge management
The WEKA data mining software: an update

ACM SIGKDD Explorations Newsletter
Keyword extraction for social snippets

Proceedings of the 19th international conference on World wide web
Short text classification in twitter to improve information filtering

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
A latent variable model for geographic lexical variation

EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Patterns of temporal variation in online media

Proceedings of the fourth ACM international conference on Web search and data mining
Empirical study of topic modeling in Twitter

Proceedings of the First Workshop on Social Media Analytics
Topic classification in social media using metadata from hyperlinked objects

ECIR'11 Proceedings of the 33rd European conference on Advances in information retrieval
Local histograms of character N-grams for authorship attribution

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Towards effective short text deep classification

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Improving categorisation in social media using hyperlinks to structured data sources

ESWC'11 Proceedings of the 8th extended semantic web conference on The semanic web: research and applications - Volume Part II
Discovering context: classifying tweets through a semantic transform based on wikipedia

FAC'11 Proceedings of the 6th international conference on Foundations of augmented cognition: directing the future of adaptive systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Text classification constitutes a popular task in Web research with various applications that range from spam filtering to sentiment analysis. To address it, patterns of co-occurring words or characters are typically extracted from the textual content of Web documents. However, not all documents are of the same quality; for example, the curated content of news articles usually entails lower levels of noise than the user-generated content of the blog posts and the other Social Media. In this paper, we provide some insight and a preliminary study on a tripartite categorization of Web documents, based on inherent document characteristics. We claim and support that each category calls for different classification settings with respect to the representation model. We verify this claim experimentally, by showing that topic classification on these different document types offers very different results per type. In addition, we consider a novel approach that improves the performance of topic classification across all types of Web documents: namely the n-gram graphs. This model goes beyond the established bag-of-words one, representing each document as a graph. Individual graphs can be combined into a class graph and graph similarities are then employed to position and classify documents into the vector space. Accuracy is increased due to the contextual information that is encapsulated in the edges of the n-gram graphs; efficiency, on the other hand, is boosted by reducing the feature space to a limited set of dimensions that depend on the number of classes, rather than the size of the vocabulary. Our experimental study over three large-scale, real-world data sets validates the higher performance of n-gram graphs in all three domains of Web documents.