Understanding temporal aspects in document classification

Authors:
Fernando Mourão;Leonardo Rocha;Renata Araújo;Thierson Couto;Marcos Gonçalves;Wagner Meira, Jr.
Affiliations:
Federal University of Minas Gerais, Belo Horizonte, Brazil;Federal University of Minas Gerais, Belo Horizonte, Brazil;Federal University of Minas Gerais, Belo Horizonte, Brazil;Federal University of Minas Gerais, Belo Horizonte, Brazil;Federal University of Minas Gerais, Belo Horizonte, Brazil;Federal University of Minas Gerais, Belo Horizonte, Brazil
Venue:
WSDM '08 Proceedings of the 2008 International Conference on Web Search and Data Mining
Year:
2008

Citing 19
Cited 6

Information filtering and information retrieval: two sides of the same coin?

Communications of the ACM - Special issue on information filtering
Evaluating and optimizing autonomous text classification systems

SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
Incremental relevance feedback for information filtering

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Inductive learning algorithms and representations for text categorization

Proceedings of the seventh international conference on Information and knowledge management
Making large-scale support vector machine learning practical

Advances in kernel methods
Context-sensitive learning methods for text categorization

ACM Transactions on Information Systems (TOIS)
Adaptive information filtering: detecting changes in text streams

Proceedings of the eighth international conference on Information and knowledge management
Automatic Document Classification

Journal of the ACM (JACM)
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
Context and Page Analysis for Improved Web Search

IEEE Internet Computing
Web-Based Knowledge Management for Distributed Design

IEEE Intelligent Systems
Incremental context mining for adaptive document classification

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
An extensive empirical study of feature selection metrics for text classification

The Journal of Machine Learning Research
Margin-based local regression for adaptive filtering

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Tackling concept drift by temporal inductive transfer

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Training linear SVMs in linear time

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Temporal profiles of queries

ACM Transactions on Information Systems (TOIS)
Boosting classifiers for drifting concepts

Intelligent Data Analysis - Knowlegde Discovery from Data Streams

Exploiting temporal contexts in text classification

Proceedings of the 17th ACM conference on Information and knowledge management
Temporally-aware algorithms for document classification

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Exploring classification concept drift on a large news text corpus

CICLing'12 Proceedings of the 13th international conference on Computational Linguistics and Intelligent Text Processing - Volume Part I
Improving tweet stream classification by detecting changes in word probability

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Temporal contexts: Effective text classification in evolving document collections

Information Systems
Timeline adaptation for text classification

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

Due to the increasing amount of information present on the Web, Automatic Document Classification (ADC) has become an important research topic. ADC usually follows a standard supervised learning strategy, where we first build a model using preclassified documents and then use it to classify new unseen documents. One major challenge for ADC in many scenarios is that the characteristics of the documents and the classes to which they belong may change over time. However, most of the current techniques for ADC are applied without taking into account the temporal evolution of the collection of documents In this work, we perform a detailed study of the temporal evolution in the ADC, introducing an analysis methodology. We discuss that temporal evolution may be explained by three factors: 1) class distribution; 2) term distribution; and 3) class similarity. We employ metrics and experimental strategies capable of isolating each of these factors in order to analyze them separately, using two very different document collections: the ACM Digital Library and the Medline medical collections. Moreover, we present some preliminary results of potential gains that could be obtained by varying the training set to find the ideal size that minimizes the time effects. We show that by using just 69% of the ACM database, we are able to have an accuracy of 89.76%, and with only 25% of the Medline, an accuracy of 87.57%, which means gains of up to 20% in accuracy with much smaller training sets