Building a dynamic classifier for large text data collections

Authors:
Pavel Kalinov;Bela Stantic;Abdul Sattar
Affiliations:
Griffith University, Brisbane, Australia;Griffith University, Brisbane, Australia;Griffith University, Brisbane, Australia
Venue:
ADC '10 Proceedings of the Twenty-First Australasian Conference on Database Technologies - Volume 104
Year:
2010

Citing 20
Cited 2

On the Optimality of the Simple Bayesian Classifier under Zero-One Loss

Machine Learning - Special issue on learning with probabilistic representations
Enhanced hypertext categorization using hyperlinks

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Self-Organizing Maps

Self-Organizing Maps
Modern Information Retrieval

Modern Information Retrieval
Scalable feature selection, classification and signature generation for organizing large text databases into hierarchical topic taxonomies

The VLDB Journal — The International Journal on Very Large Data Bases
A large-scale study of the evolution of web pages

WWW '03 Proceedings of the 12th international conference on World Wide Web
A taxonomy of web search

ACM SIGIR Forum
Sic transit gloria telae: towards an understanding of the web's decay

Proceedings of the 13th international conference on World Wide Web
Mining massive document collections by the WEBSOM method

Information Sciences: an International Journal - Special issue: Soft computing data mining
Ending Spam: Bayesian Content Filtering and the Art of Statistical Language Classification

Ending Spam: Bayesian Content Filtering and the Art of Statistical Language Classification
Support vector machines classification with a very large-scale taxonomy

ACM SIGKDD Explorations Newsletter - Natural language processing and text mining
Link spam detection based on mass estimation

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Google news personalization: scalable online collaborative filtering

Proceedings of the 16th international conference on World Wide Web
Can social bookmarking enhance search in the web?

Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Measuring data-driven ontology changes using text mining

AusDM '07 Proceedings of the sixth Australasian conference on Data mining and analytics - Volume 70
Deep classification in large-scale text hierarchies

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Compressed web indexes

Proceedings of the 18th international conference on World wide web
WeBrowSearch: toward web browser with autonomous search

WISE'07 Proceedings of the 8th international conference on Web information systems engineering
Naive bayes for text classification with unbalanced classes

PKDD'06 Proceedings of the 10th European conference on Principle and Practice of Knowledge Discovery in Databases
On the utility of incremental feature selection for the classification of textual data streams

PCI'05 Proceedings of the 10th Panhellenic conference on Advances in Informatics

Search or explore: do you know what you're looking for?

Proceedings of the 23rd Australian Computer-Human Interaction Conference
Towards real intelligent web exploration

APWeb'12 Proceedings of the 14th Asia-Pacific international conference on Web Technologies and Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

Due to the lack of in-built tools to navigate the web, people have to use external solutions to find information. The most popular of these are search engines and web directories. Search engines allow users to locate specific information about a particular topic, whereas web directories facilitate exploration over a wider topic. In the recent past, statistical machine learning methods have been successfully exploited in search engines. Web directories remained in their primitive state, which resulted in their decline. Exploration however is a task which answers a different information need of the user and should not be neglected. Web directories should provide a user experience of the same quality as search engines. Their development by machine learning methods however is hindered by the noisy nature of the web, which makes text classifiers unreliable when applied to web data. In this paper we propose Stochastic Prior Distribution Adjustment (SPDA) - a variation of the Multinomial Naïve Bayes (MNB) classifier which makes it more suitable to classify real-world data. By stochastically adjusting class prior distributions we achieve a better overall success rate, but more importantly we also significantly improve error distribution across classes, making the classifier equally reliable for all classes and therefore more usable.